Quality Matters, #2: Correctness, Robustness, and Reliability

by Matthew Wilson

This column instalment first published in ACCU's Overload, #93, October 2009. All content copyright Matthew Wilson 2009-2010.


In the previous instalment I defined correctness as 'the degree to which a software entity's behaviour matches its specification' [QM-1], but didn't offer definitions of robustness or reliability. This time I'm going to take the plunge and attempt definitions of them. I embark on a (possibly deranged) attempt to equate computing with the worlds of Newtonian and Quantum Physics, along with the somewhat more obvious parallel drawn between the behaviour of software systems and chaos theory.

I'll do my best to keep my feet on planet Earth by using examples from real-world experience, illustrating how some software entities can be established to be correct, but the best we can hope for with most is to ensure adequate levels of robustness. I'll also comment on why correctness may be of no interest to non-programmers, and reliability is not of much interest to programmers.

Weaving together a cogent narrative for this instalment has been exceedingly difficult, and you may well feel that it's escaped me. If so, all I can say is watch out for the instalment on contract programming!

Extant definitions

Before I can begin to pontificate about robustness and reliability, I need to consider the definitions that currently exist in the canon.

The Shorter Oxford English Dictionary (SOED) [SOED] gives the following definitions:

Steve McConnell [CC] gives these definitions:

Bertrand Meyer [OOSC] gives these definitions:

As is probably quite obvious, my definition of correctness is informed by these definitions, which I've examined many times previously. The important aspect taken from Meyer's definition [OOSC] is that correctness is relative to a specification. Indeed, Meyer states this most clearly in his Software Correctness Property [OOSC]:

Software Correctness Property: Correctness is a relative notion.

Without a specification against which to compare behaviour, the notion of correctness is meaningless.

The important aspect taken from McConnell [CC] is that correctness is a variable notion, and that a software entity's behaviour may correspond to a specification to a certain degree. At first blush this may seem a bizarre idea. Certainly, a software entity that is known to fail to meet its specification is defective (aka incorrect), plain and simple. From the perspective of a potential user of a software entity, that its creator (or any other agent) may volunteer that it is 50% correct or 90% correct is of no use, because such figures, even if obtained by repeatable quantitative measurements, e.g. unit-tests, cannot be meaningfully used in the calculation of quantitative failure probabilities of a software system built from the offending entity. We'll discuss why in the next section.

Beyond this, there are other, serious, objections to attempting to make use of a software entity that is known to be defective - something known as the Principle of Irrecoverability - but those discussions will have to wait until another time.

So what is the purpose of considering correctness as a quantitative concept (in addition to its being a relative one)? Well, there are the practical benefits to the producers of a software entity in being able to quantify its degree of divergence from a specification. Of course, we all know that any given defect can, upon cursory examination, appear to be of the same magnitude (of corrective development effort) as another, and yet take two, five, ten, sometimes hundreds of times longer to correct. But averaged out over the course of a project, team, career, there is a usefulness to being able to quantify. Certainly, when I'm developing software libraries, I can take the temperature for the project velocity (if you'll pardon the atrocious mixing of my metaphors) via the defect fix rate in a new (or regressive) group of tests.

But all of the foregoing paragraph, while having some utility to our consideration of the subject, is pretty pedestrian stuff, infused with more than a whiff of equivocation. It could even be taken as an invitation to debates I'm not much interested in having. In point of fact, we needn't care about this stuff, because there's a point of far greater significance in eschewing the speciously attractive binary notion of correctness. Simply, a given software entity can exist in three apparent states of correctness:

The third state is somewhat like poor old Schrödinger's cat [GRIBBIN], who is neither dead nor alive until examined. So too, software can be correct, or defective, or neither (known to be) correct nor defective. The latter state collapses into exactly one of the former when it is evaluated against a specification.

In this instalment I'm going to consider the notion that most, perhaps all, software systems are built up from layers of abstraction most of which are in the disconcerting third state of uncertain correctness. Furthermore, I'm going to argue that software has to be like this, and that's what makes it challenging, fun, and not a little frightening.

(Note: I'm still not going to discuss the definitions of what a specification is in this instalment. What a tease ...)

Exactitude, non-linearity, Newtonian software, quantum execution environments, and why Software Development is not an engineering discipline.

A perennial debate within (and without) the software community is whether software development is an engineering discipline, and, if not, why not. Well, despite plentiful (mis)use of the term 'software engineer' in my past, I'm increasingly moving over to the camp of those whose opinion is that it is not an engineering discipline. To illustrate why I'm going to draw from three of my favourite branches aspects of science: Newtonian physics, Chaos theory, and Quantum physics, with a modicum of logic thrown in for good measure.

The Unpredictable Exactitude Paradox

As my career has progressed - both as practitioner (programmer, consultant) and as (imperfect) philosopher (author, columnist) - the issues around software quality have grown in importance to me. The one that confounds and drives me more than all others is (what I believe to be) the central dichotomy of software system behaviour:

The Unpredictable Exactitude Paradox: Software entities are exact and act precisely as they are programmed to do, yet the behaviour of (non-trivial) computer systems cannot be precisely understood, predicted, nor relied upon to refrain from exhibiting deleterious behaviour.

Note that I say programmed to do, not designed to do, because a design and its reification into program form are often, perhaps mostly, perhaps always, divergent. Hence the purpose of this column, and, to a large extent, the purposes of our careers. (The issue of the commonly defective transcription of requirements to design to code will have to wait for another time.)

Consider the behaviour of the computer on which I'm writing this epistle. Assuming perfect hardware, it's still the case that the sequence of actions - involving processor, memory, disk, network - carried out on this machine during the time I've written this paragraph have never been performed before, and that it is impossible to rely on the consequences of those actions. And that is despite the fact that the software is doing exactly what it's been programmed to do.

I mentioned earlier that the relationship between the size/number of defects and the effort involved to remedy them is not linear. This non-linearity is also to be seen in the relationship between the size/number of defects and their effects. Essentially, this is because software operates on the certain, strict interpretation of binary states, and there are no natural attenuating mechanisms in software at the scale of these states. If one Iron atom in a bridge is replaced by, say, a Molybdenum atom, the bridge will not collapse, nor exhibit any measurable difference in its ability to be a bridge. Conversely, an errant bit in a given process may have no effect whatsoever, or may manifest benignly (e.g. a slightly different hue in one pixel in a picture), or may have major consequences (e.g. sending confidential information to the wrong customer).

We, as software developers, need language to support our reasoning and communication about software, and it must address this paradox, otherwise we'll be stuck in fruitless exchanges, often between programmers and non-programmers (clients, users, project managers), each of whom, I believe, tend to think and see the software world at different scales. I will continue the established use of the term correctness to represent exactitude. And I will, influenced somewhat by Meyer and McConnell, use the terms robustness and reliability in addressing the inexact, unpredictable, real behaviour of software entities.

Bet-Your-Life?: review

Let's look at some code. Remember the first of the Bet-Your-Life? Test cases from the last the previous instalment [QM-1]:

    bool invert(bool value);

We can implement this easily, as follows:

    bool invert(bool value)
      return !value;

In fact, it'd be pretty hard to write any implementation other than this. Certainly there are plenty of (possibly apocryphal) screeds of long-winded alternative implementations available on the web (such as on www.thedailywtf.com), but pretty much any functionally correct implementation that does not involve fatuous complexity/dependencies - such as converting value to string and them using strcmp() against "0" or "1" - will evaluate to the following pseudo-machine code in Listing 1.

Listing 1

    bool invert(register bool value)
      register bool result;
      if(0 != value)
        result = 0;
        result = 1;
      return result;

With languages that have a bona-fide Boolean type, such as Java and C#, the value may not need to be compared against 0, and may well be implemented as equal to true (or to false). Other languages such as C and C++ represent (for historical and performance reasons [IC++]) a notional Boolean false value as being 0, and a notional Boolean true value as being all non-0 values. In those, comparison against zero is necessary, even for their built-in bool types! In either case, it's almost impossible to implement this function incorrectly.

If we permit ourselves the luxury of assuming a correctly functioning execution environment, then without recourse to any automated techniques, or even to a detailed written specification, we may reasonably assert the correctness of this function by visual inspection.

Now consider the definition of strcmp(), the second Bet-Your-Life? Test case:

    int strcmp(char const* lhs, char const* rhs);

Here's an implementation I knocked up during the preparation of this instalment, without recourse to any I've written in the past (or to various open-source and commercial implementations).

    int strcmp(char const* s1, char const* s2)
      for(; '\0' != *s1 && *s1 == *s2; ++s1, ++s2)
      return (int)(unsigned char)*s1 - (int)(unsigned char)*s2; /* C99: 7.21.4(1) */

Notwithstanding an issue I had with the signedness of the comparison (see sidebar), I intended to use the example of strcmp() as a (modest) stepping-stone in complexity - up from invert(), and down from b64_encode() - which relies on more assumptions about the execution environment:

If this smells suspiciously like a contract pre-condition [OOSC, IC++], well, that's something we'll examine in a later instalment.

This additional reliance on external factors is a significant part of the increased complexity over invert(). In languages such as C#(/.NET) and Java, it is reasonable to assume that an object reference is valid (or is the sentinel value, null), but in C (and C++) where pointers have free range, it is possible for strcmp() to receive pointers that:

The possibility of the latter two options makes reasoning about the correctness of strcmp() and software entities built in terms of it more complicated than is the case for invert(). Specifically, it is possible for strcmp() to be passed invalid arguments (as a result of a defect elsewhere within the program), whereas all 'physically' possible arguments to invert() are valid.

The next Bet-Your-Life? Test case is b64_encode() (see Listing 2).

Listing 2

    size_t b64_encode(
      void const* src
    , size_t      srcSize
    , b64_char_t* dest
    , size_t      destLen
      . . .
      b64_char_t* p   =   dest;
      b64_char_t* end =   dest + destLen;
      size_t      len =   0;
      for(; NUM_PLAIN_DATA_BYTES <= srcSize; srcSize -= NUM_PLAIN_DATA_BYTES)
        characters[0] = (b64_char_t)((src[0] & 0xfc) >> 2);
        characters[1] = (b64_char_t)(((src[0] & 0x03) << 4) + ((src[1] & 0xf0) >> 4));
        characters[2] = (b64_char_t)(((src[1] & 0x0f) << 2) + ((src[2] & 0xc0) >> 6));
        characters[3] = (b64_char_t)(src[2] & 0x3f);
        src += NUM_PLAIN_DATA_BYTES;
        *p++ = b64_chars[(unsigned char)characters[0]];
        *p++ = b64_chars[(unsigned char)characters[1]];
        *p++ = b64_chars[(unsigned char)characters[2]];
        *p++ = b64_chars[(unsigned char)characters[3]];
        if( ++len == lineLen &&
            p != end)
        {    {
          *p++ = '\r';
          *p++ = '\n';
          len = 0;

      if(0 != srcSize)
        unsigned char dummy[NUM_PLAIN_DATA_BYTES];
        size_t        i;
        for(i = 0; i < srcSize; ++i)
          dummy[i] = *src++;
        for(; i < NUM_PLAIN_DATA_BYTES; ++i)
          dummy[i] = '\0';
        b64_encode_(&dummy[0], NUM_PLAIN_DATA_BYTES,
           p, NUM_ENCODED_DATA_BYTES * (1 + 2),
           0, rc);
        for(p += 1 + srcSize; srcSize++ < NUM_PLAIN_DATA_BYTES; )
          *p++ = '=';
      . . .

I'm not going to show the full implementation of this for brevity's sake. (If you're interested you can download the library [B64] and see for yourself.) Like strcmp() (and invert()), the b64 library has no dependencies on any other software libraries, not even on the C runtime library (except when contracts are being enforced, e.g. in debug builds). This permits a substantial level of confidence in behaviour, because only the b64 software entities themselves are involved in such considerations. Broadly speaking, it means that behaviour, once 'established', can be relied on regardless of other activities in the execution environment. However, it's fair to say that the internal complexity of b64_encode() is substantially increased over that of strcmp(). Consequently, I think it is impossible in a library such as this to stipulate its correctness based on visual inspection of the code; anyone who would do so would be rightly seen as reckless (at best).

Thus we can see that increasing complexity acts strongly against human-assessed correctness. But there's more to this than correctness. Let's now consider the final member of the Bet-Your-Life? Test cases, Recls_Search() from the recls library [RECLS]:

    RECLS_API Recls_Search(
      char const* searchRoot
    , char const* patterns
    , int         flags
    , hrecls_t*   phSrch

An incomplete description of the semantics of this function are as follows:

Clearly the recls library (or at least this part of it) has substantial behavioural complexity. That alone makes it, in my opinion, impossible for any reasonable developer to stipulate its correctness. But that's only part of it. Of greater significance is that recls is implemented in terms of a great many other software entities, including library components (from STLSoft) and operating system facilities (e.g. the opendir() API on UNIX and the FindFirstFile() API on Windows). And even that is not the major issue. The predominant concern is that recls interacts with the file-system, whose structure and contents can (and do) change independently of the current process. By definition, it is impossible to establish correctness for recls or any other software entities who interact with aspects of the execution environment that are subject to change from other, independent software entities.

By now you're probably starting to worry now that I'm asserting that correctness cannot be stipulated. Am I saying that software cannot be correct?

Newtonian software, quantum execution environment

At the risk of embarrassing myself, because it's been 20 years since I did any formal study of the subject, I will now draw parallels between software+hardware and Newtonian+quantum physics.

Consider a point object travelling through an empty universe. In Newtonian physics, the object will continue to travel in the same line, at the same speed, forever more. If there are two point objects, they will influence each other's travel in predictable ways, based on their masses, positions and velocities. But add in a third, fourth, ... trillionth object, and the behaviour of the universe becomes complex, and therefore unpredictable (beyond small timescales within which simplifying assumptions may be used to form reasonable approximate results). As is the case in reality, if the objects are non-point, then we have to consider rotation of the bodies, and heat, and a whole lot more besides, including chemistry, biology, even sociology and technology! Thus, even in a Newtonian universe, behaviour is non-linear (and unpredictable) due to the interactions of entities (in part because some of the quantities involved are irrational, and calculations thereby require infinite precision).

In a quantum universe, there are two challenges to our understanding even in the case of a single point object. For one thing, it is, in principle, impossible to state with certainty the position and momentum of the object. Second, it's possible that a virtual particle will spring into existence in any part of the otherwise empty universe at any time. (Here my inadequate training lets me down in understanding whether a virtual particle can have a net effect on our single travelling particle, but I think you get enough of the picture for us to have a working analogy.)

I contend that software is conceived in a Newtonian frame, where we imagine we can rely on perfect (non-defective) execution environments, and that hardware, necessarily, introduces a quantum aspect, due to the imperfect reliability of hardware systems (and the occasional cosmic ray that might flip a bit inside your processor) and the actions of other operating entities (programs, hardware, etc.). Let's look back to the Bet-Your-Life? Test examples from the previous instalment, and consider the behaviours in light of the two perspectives, where imperfect execution environments are subject to 'Quantum' surprises:


I'm not going to engage in discussion about specifications in this instalment, but must at least provide a definition in order that we can properly engage in further reasoning about correctness. Without further ado, a specification is one (or both, if used in concert) of two things:

Specification: A software entity's specification is the sum of all its passing unit-tests.


Specification: A software entity's specification is the sum of all its unfired active contract enforcements.

Everything else is fluff and air.

(Note: for today, I'm considering only functional aspects of specifications. Other aspects, such as performance - time and/or resource consumption - are outside the scope of this instalment, and will be discussed at another time. I'm also only going to be talking about measuring specifications in terms of unit-tests.)

Final definitions

Given the forgoing discussion, I'm now in a position to offer my definitions of these three important aspects of software quality.

Correctness: The degree to which a software entity's behaviour matches its specification.

Robustness: The adjudged ability of a software entity to behave according to the expectations of its stakeholders.

Reliability: The degree to which a software system behaves robustly over time.


Correctness is exact and measurable. It is the concern of software developers.

When measured (against its specification), the correctness of a software entity 'collapses' from the unknown state to exactly one of two states: correct and defective.

The binary nature of measured correctness is a great thing. For example, consider that we measure the correctness of invert() as shown in Listing 3 (assuming a C# implementation, with NUnit [NUNIT]).

Listing 3

    public void Test_False()
    public void Test_True()

That's a complete functional test for BetYourLifeTests.Invert(). Informed by this, we could now implement another function, Nor(), as shown in Listing 4 (sticking with C#).

Listing 4

    public static class LogicalOperations
      public static bool Nor(bool v1, bool v2)
        return BetYourLifeTests.Invert(v1) && BetYourLifeTests.Invert(v2);

Knowing that Invert() is correct, we may choose to assert that Nor() will faithfully give expected behaviour based on visual inspection. And we could go on to completely measure that correctness with ease, involving just four unit-tests.


However, add in just a little complexity and things get sticky very quickly. Consider that we've measured strcmp()'s correctness against a unit-test suite as shown in Listing 5, this time in C, with xTests [XTESTS].

Listing 5

    static void test_equal()
      XTESTS_TEST_INTEGER_EQUAL(0, strcmp("", ""));
      XTESTS_TEST_INTEGER_EQUAL(0, strcmp("a", "a"));
      XTESTS_TEST_INTEGER_EQUAL(0, strcmp("ab", "ab"));
      XTESTS_TEST_INTEGER_EQUAL(0, strcmp("abc", "abc"));

    static void test_less()
      XTESTS_TEST_INTEGER_LESS(0, strcmp("a", "b"));
      XTESTS_TEST_INTEGER_LESS(0, strcmp("ab", "bc"));
      XTESTS_TEST_INTEGER_LESS(0, strcmp("abc", "bcd"));

    static void test_greater()
      XTESTS_TEST_INTEGER_GREATER(0, strcmp("b", "a"));
      XTESTS_TEST_INTEGER_GREATER(0, strcmp("bc", "ab"));
      XTESTS_TEST_INTEGER_GREATER(0, strcmp("bcd", "abc"));

Clearly, this is not a comprehensive test suite. But the permutations of arguments passed to strcmp() in the myriad programs built from it will dwarf that found in any unit-test suite. Consequently, we are all using strcmp() beyond its specification. Specifically, we are using strcmp() in a state of unknown correctness. How do we get away with it? We apply judgement.

A correct software entity has been proven so by mechanical means. A robust software entity has been judged as likely to behave according to expectations. This judgement is based on our knowledge of the software entity's interface, its likely complexity, its author(s), its published test suite, the skills and experience of the judge, and many other factors.

We can define a principle for robustness as:

The Robustness Principle: A robust software entity is comprised of:

  • 0 or more correct software entities
  • 0 or more robust software entities
  • 0 defective software entities

We must now concern ourselves with how correctness propagates between software entity dependencies. Consider the function f(), which is implemented in terms of strcmp():

    int f(char const* s)
      return strcmp(s, "fgh");

What can we say about the correctness of f()? Well, until we test it, by definition it has unknown correctness. But howsoever we make use of it - whether in test or in a software application - we are using strcmp() outside the bounds of its specification, because "fgh" is not included in strcmp()'s test suite. By definition, therefore, we will be using strcmp() in a manner in which its correctness is unknown.

Consider that we now write a suite of tests for f(), as in Listing 6.

Listing 6

    static void test_equal()
      XTESTS_TEST_INTEGER_EQUAL(0, f("fgh"));
    static void test_less()
      XTESTS_TEST_INTEGER_LESS(0, f("abc"));
      XTESTS_TEST_INTEGER_LESS(0, f("bcd"));
      XTESTS_TEST_INTEGER_LESS(0, f("cde"));
      XTESTS_TEST_INTEGER_LESS(0, f("def"));
    static void test_greater()

Since we have a specification for f(), and f() meets that specification, we can state that f() is correct. However, there is something a little strange about having a component that is correct when it is implemented using another component that has unknown correctness.

Taking this notion to extreme, we might wonder whether we can implement a correct software entity in terms of a defective one? Let's imagine that our implementation of strcmp() always returns a value of 0 when passed a string of less than three characters. With this behaviour it would fail four of the tests of its specification and thus be proven defective. But since the test suite for f() always uses strings of length three, it would still pass all cases. f() is proven correct, yet is implemented in terms of a defective component. That is more than a little strange, and violates the robustness principle given above.

The answer to this apparent conundrum lies in the notion of robustness. Confidence is placed in a software entity based on a number of factors, knowing that it will be used outside the exact, but necessarily limited, aspects of its specification. An implementation of f() that uses a correct strcmp() outside the bounds of its specification is, while common, something that should give pause for thought. An implementation of f() that uses a defective implementation of strcmp() violates the robustness principle and, in my opinion, should never be countenanced.

In both cases, the implementation of f() is brittle. And as each layer of abstraction and dependency is added, this brittleness spreads and compounds, and the combinatorial cracks through which extra-correctness behaviour can permeate increase. Thus, an important part of the skill/art of the software developer lies in making judgements about robustness when implementing software entities in terms of others that have unknown correctness (and that must therefore be judged on their robustness).

Robustness is inexact and subjective. It cannot be measured or proven, and it cannot be automated (beyond a few static analysis tricks). It is equally the concern of software developers, who must provide it, and stakeholders, whose experiences of the software system define it.


I am moved to almost completely agree with McConnell's definition of reliability, but I do feel that reliability is a measurable, quantifiable, emergent property of a software system's behaviour. In some senses, it could be thought of as robustness over time, but robustness can't measured, so maybe it's better thought of as apparent robust action over time.

Reliability is more a concern to stakeholders than it is to developers, reflecting the differing perspectives between these groups. To stakeholders, it is almost entirely irrelevant how many constituent software entities were correct versus those adjudged robust. To stakeholders, the proof of the pudding is the eating, and that's its reliability.

Conversely, to software developers, the more correctness that can be adduced the better, because it simplifies the construction of dependent software entities. Reliability, on the other hand, is a distant prospect to a developer, and probably viewed in different ways. For example, I can say that I am motiviated, by pride, to have 0 failures ever; 1 or 10 failures would be equally galling. Conversely, frequency of failures is of proper relevance to a user, who may well tolerate one failure per month if the software can cost him/her significantly less than the version that fails once per year (or never). Many do, and many others just expect software failure, otherwise how do we explain the popularity of certain operating systems, editors, websites, ...

Naturally, I'm not suggesting that tuning software failure frequencies is a good thing; I believe that we can all write much more robust software without suffering in the process. That's the raison d'être of this column, and as we proceed I intend to pursue the notion that we should all be aiming for maximum quality all the time.


At this point I'd intended to go on to examine some of the interesting conflicts between correctness and robustness, and between them and other software characteristics, as well as discussing practical techniques for ensuring robustness when correctness is not achievable. I've even got an argument in favour of Java's hateful checked exceptions. But I've run out of space (and time), and these will have to wait until another instalment.

For the moment, I'll posit a parting rubric that correctness is the worthy aim wherever possible (which is rare), and robustness is the practical must-have in all other circumstances.

I'm not sure what's coming in the next instalment, but I'm determined that it's going to have a lot more code, and a lot less philosophy than this one. It's too exhausting!

See you next time.

References and asides

[B64] http://synesis.com.au/software/b64/

[CC] Code Complete, 2nd Edition, Steve McConnell, Microsoft Press, 2004

[GRIBBIN] In Search of Schrödinger's Cat, John Gribbin, Corgi, 1984

[IC++] Imperfect C++, Matthew Wilson, Addison-Wesley, 2004

[NUNIT] http://nunit.org/

[OOSC] Object Oriented Software Construction, 2nd Edition, Bertrand Meyer, Prentice-Hall, 1997

[QM-1] Quality Matters, Part 1: Introductions, and Nomenclature, Matthew Wilson, Overload 92, August 2009

[RECLS] http://www.recls.org/

[SOED] The New Shorter Oxford English Dictionary, Thumb Index Edition, ed. Lesley Brown, Clarenden Press, Oxford, 1993.

[XTESTS] http://xtests.org/