The Value of Test-first Unit Testing

The value of test driven development (TDD) is being shown in many projects and by many people. It is being shown in environments on-the-job, not in a rarefied or artificial environment. It isn’t theoretical but borne out in day-to-day work of developing real applications.

UTs as “Requirements”

But what is the underlying mechanism of unit tests (UTs)? In the human effort required to program a piece of software, what is it that UTs bring to the table? Simply put, UTs form a set of “mini-requirements” around a function (Code Under Test, CUT). The CUT behaves according to the UTs that run it or the asserts within the UTs fail and so warn the software developer. So UTs provide a developer a powerful one-two punch: they can use UTs to specify how a CUT should behave and they can run the UTs to double-check the CUT actually behaves as the requirement specifies.

A small scale example is:

void test_square()
  {
  assert(f(0), 0);
 
  assert(f(1), 1);
  assert(f(-1), 1);
 
  assert(f(2), 4);
  assert(f(-2), 4);
 
  assert(f(3), 9);
  assert(f(-3), 9);
  }

The function f() returns the square of it’s parameter. The asserts double check that behavior for representative positive and negative values and for zero.

Using UTs to convert a code base

A large scale example occurred on a project where we had approximately 20K+ lines of code written in C# (Linux Mono) and we wanted to convert it to Java. Luckily we had a full set of UTs for it. The conversion process was:

  • first comment out all of the existing C# UTs
  • un-comment out one UT
  • convert that UT to Java. Naturally it fails since there’s no code for it to run
  • add enough Java code to cause the UT to pass.
  • find other UTs related to this piece of code, convert them to Java one-by-one and fix the UT or CUT until they pass
  • un-comment another UT and repeat until the full suite of Java UTs are running successfully

At the end of this process, our code base was converted to Java and we had strong confidence it behaved exactly the same as the old code base in C#.

UTs can be difficult to read

But in some ways UTs are worse than a written set of requirements. To find out what the CUT actually does, you have to read the UT (which is yet more code) and infer the expected behavior. Here’s an example:

void test_func()
{
assert(f(1), 1);
assert(f(2), 2);
assert(f(3), 3);
assert(f(4), 5);
}

What is f() supposed to be doing? Hard to tell, but perhaps it can be guessed.

If the UT is written well, this effort can be much simpler.

void test_prime()
{
// zeroth prime is not implemented
try {
  f(0);
  assert(false);
} catch (ParameterRangeExcp ex) {
  assert(true);
}
 
// negative indices should not be used 
try {
  f(-1);
  assert(false);
} catch (BadParameterExcp ex) {
  assert(true);
}
 
// try some representative samples
assert(f(1), 1);
assert(f(2), 2);
assert(f(3), 3);
assert(f(4), 5);
 
// can only handle the first 10 primes
try {
  f(11);
  assert(false);
} catch (ParameterRangeExcp ex) {
  assert(true);
}
}

In this case it looks like f() returns one of the first 10 primes given an index. Well maybe it does and maybe it doesn’t, but it looks plausible enough based on the UT.

UTs can be ignored

Once a UT is written and passing, it is tempting to ignore them after that. This can lead to false positives. For example a change is made to the CUT, and, by chance, the existing UTs continue to pass. But because they are passing, there is little impetus to the developer to review them to see if they should be changed.

UTs don’t guarantee full coverage

A UT only tests what it tests. Does a set of UTs, in fact, test all of the functionality of the CUT? If a UT is missing, if there is some functionality of the CUT that isn’t tested, all of the existing UTs will blithely pass. There is no indication that a test is missing, that a part of the functionality is untested. And there’s no automatic way to determine it. It is up to the developer to ensure it (see Code Reviews for one way to double-check). The use of coverage tools can also help.

You can run UTs

In other ways, UTs are better than a written set of requirements. You can run the UTs and explicitly find out if the CUT is behaving correctly or not. A text-based set of requirements can’t be run that way. Although there are some symbolic requirement systems (e.g. Z) that have verifier utilities that can check for bugs, they only find problems in the requirements themselves. There are none that I am aware of that check the requirements against the actual code.

UTs confirm run-time behavior — almost

The UTs create a close simulation of the final run-time environment. UTs execute the CUT in roughly the same run-time environment as when the CUT runs in the “real” application. This means we can trust the CUT is behaving as it will be in its final resting place in the application. The strong similarity between the two environments gives solid evidence for the validity of the assert results.

A UT set represent relevant system behavior

If a set of UTs are built up as development progresses, the portion of a CUT’s functionality that is relevant to the rest of the system will tend to be fully tested. There may be parts of the CUT’s functionality that is untested but they are very likely irrelevant because they aren’t used or are trivial. This implies that, on the whole, all of the application code is fully tested and more specifically, fully tested for all of the relevant behavior of the system as a whole.

Write UTs from the start

But note that this also implies that writing UTs post-fact doesn’t work very well. There is no conclusive way to tell which part of a CUTs functionality are relevant. A profiler can give us a clue and known unused paths through the CUT could raise an exception, but neither technique is conclusive. But having too much unit testing is clearly better than not having enough!

UTs as Exploration

There are two kinds of unit tests in TDD, both of which are heavily related to each other.

The first kind of test will confirm that the code you’ve just added works as you expect in and around the existing functionality. It examines slightly different but related parts of the solution space that are important to you. These do not have to be comprehensive, but they do have to be relevant.  You write these until your anxiety is satisfied. This kind was explored in the first section.

The second kind is a speculative test, it will cause you to write code for a very small part of new functionality. The new code satisfies some small part of your spec, requirements, task, Story, etc.

The second kind moves the design forward in an area, the first kind fleshes out the design in that area. Bill Wake calls the second set “generative” and the first set “elaborative”.

UTs are software oscilloscopes

Hardware folks use oscilloscopes to peek into a circuit while it is running. It shows the behavior of the circuit in real-time, in actual operation.  UTs perform the same function for software.

Use UTs to explore edge functionality

A simple example is to use a UT to explore how a system call works. Don’t run the whole application just to find out how the Date function works in leap years. Isolate the function and test it’s behavior. Isolate the small, rare conditions and test them directly.

The odds of your code working correctly in ALL conditions suddenly went up. And that knowledge is pure advantage to you. Once you have a solid understanding of the low-level function behavior, write the next level design.

Use UTs to test next level design

A more sophisticated use is to use  of UTs to write tentative high(-er) level code and test it piecemeal, as you write it.

These UTs are exploration. If your code is straightforward you don’t need it. But sometimes complex designs or algorithms are necessary. Use UTs to display or check various aspects of the code as you develop the algorithm. Design X doesn’t work because Y happens when the Date function returns a particular function. But Design Z does work correctly in all these cases.

Once this code is working, write the final “elaborative” UTs. These can be simple tweaks or even rewrites of the “generative” versions.  But sometimes they are completely scrapped and new UTs need to be written from scratch. This can occur if the generative tests do not clearly show the relevant expected behavior of the system. Perhaps they show some actual behavior but not as important or relevant. Or they don’t transmit the expected behavior as clearly or cleanly as they could.