Is software with zero errors possible?

Could it ever be possible to create sofware with zero errors? Software which is so perfect that it is not possible for it to ever do something unexpected, undefined or contain any errata of any kind? To answer this question, let’s take one of the most simple programs physically possible: hello world. For the moment, I will be writing in very generic pseudocode, for reasons which will become clear in due course. Here is an example of “hello, world” written in my flavour of pseudocode:

OUTPUT "hello, world"

In this example, the computer isn’t doing any real computation, however, so let’s improve this slightly to suit our needs. We will be using a program which accepts two inputs from arbitrary sources and outputs (through an arbitrary channel) their sum. Pseudocode reads as follows:

INPUT a
INPUT b

c <- a + b
OUTPUT c

Now, let’s imagine the scenario: you are running on a computer which is solely dedicated to executing your code. For the sake of simplicity, we will assume that the computer will have a constant and uninterrupted power supply. Now, can we guarantee that this code will be error free? Where in this code could an error occur.

Well, first, let us examine the language itself which the code is written in. On creating the variable “c”, we have implicitly allocated some kind of memory. If we assume that this variable is allocated on the stack, it is very possible for a stack overflow or out-of-memory scenario to ruin our plans. Typically, memory is allocated on the stack by subtracting a value from the base pointer (the stack is usually in high memory) and placing the result in the stack pointer, so future allocations will occur after the un-use block of memory. This is fine, but does lead to it being very difficult to catch when our stack over-grows its bounds. For instance, what would happen if we subtract enough space from the stack pointer to run into our program code. On x86, you will crash with a segmentation protection violation (attempted to write to RX only memory). So, using the stack is out of the question, right? Well, in theory, we could insert a check before the allocation of each stack frame to detect and overflow (compare our memory map to the current base pointer before setting the stack pointer), but what we will do when this fails is quite a problem. It’s not like we can just erase stack frames, after all. It seems more likely that the heap will do. Now, the question of the language is particularly relevant. In garbage-collected, escape-analysed languages (such as Go and Python), it is impossible to use the heap without an implicit call to code we do not control. This tends to lend itself to serious problems with reliability if our goal is to literally never crash. Go’s memory allocator, for instance, will instantly abort program execution on failed memory allocation. Python will do something similar. So, garbage collected languages are out of the question - we will have to manage the heap ourselves. That means going back to C.

The question that may have popped into your mind is “Why not C++ Ethan?”. Leaving aside C++’s obvious flaws in other areas, it is completely unfit for both low-level and high-reliability software. Unhandled exceptions can come from anywhere, and the C++ method of doing things tends to revolve around abstractions at the language level, which is never good if we need to cover all our bases. Think about it: the C++ runtime, according to the language standard, has the right to abort your program at any time it sees fit. It is much safer to just write in C.

Ok, so our code currently looks something like this:

int callback(int in0, int in1)
{
    int *res = malloc(sizeof(int));
    *res = in0 + in1;
}

Whoops! We just invoked undefined behaviour! In this instance, we have not checked the return value of malloc. Malloc reserves the right to return a NULL pointer. So, in dereferencing it without checking, we have just potentially caused a segmentation violation and a crash. Let’s fix it:

int callback(int in0, int in1)
{
    int *res = malloc(sizeof(int));
    if (!res) {
        // But what now?
    }

    *res = in0 + in1;
}

What now indeed? Well, we have reached a similar problem to with stacks, but with a twist: as the heap is a nonlinear data structure, it can be freed and allocated in a non-defined order. So, we could run some kind of cleanup function here which cleans out any memory we don’t absolutely need. Let’s program that in then.

int callback(int in0, int in1)
{
    do {
        int *res = malloc(sizeof(int));
    } while (!res);

    *res = in0 + in1;
}

If I were to actually need to do this, I would write a wrapper function to do this (maybe called smalloc, for “safe malloc”). I haven’t called a cleanup function here for simplicity, but it would be easy to add. The problem still remains, however, that it is not for certain that we will actually ever get any freed heap memory and may still be stuck in this loop forever blocking. Not good. Another minor problem being that the pointer we just recieved may well be stored on the stack again. We have just made the problem a whole lot more complicated and slow. And it could block forever. And it could just abort our program due to heap corruption. On the whole, it may be safer to simply use the CPU’s registers. There is no direct way to do this in C, but there are compiler flags which force it to pay attention to “register” declarations, throwing an error if you ask it to do something impossible. Let’s program it in:

int callback(int in0, int in1)
{
    register int res;
    res = in0 + in1;

    return res;
}

Okay, so we now have a completely safe callback, right? No, of course not. Integers in computing are not as they are in mathematics. They are limited to two to the power of the sizeof the integer. In this case, two to the power of 32 (4294967296). However, we have not specified that the integers are unsigned, so the actual space to store the integer is 31 bits, meaning the maximum is, in fact, two to the power of 31 (2147483648) - which is signiricantly smaller. How can we guard against this without overflowing our integer? We could rely on the calling code, or we could implement a simple check here. Let’s think about our design requirements:

We must never return more than 2^31
We must never handle any value larger than that size, lest we cause an overflow
Inputs may be up to this size or down to this negative size

It is clear that we can’t juts add the numbers and see if it is over a maximum. What can we do? Well, we know the maximum, and we know both inputs. So, logically, if we subtract one of the operands from the maximum, the result must be larger than or equal to our other operand - otherwise, we have detected an overflow. However, there is a big problem here: what if the first operand we subtract is larger in size than the maximum? Well, this is not possible, as it would simply not fit in the integer which was passed. Let’s implement it. While we’re at it, let’s fix the minor problem that we are not returning any value.

int callback(int in0, int in1)
{
    register int res;

    /* assumes presence of limits.h */
    if (INT_MAX - in0 < in1) {
        /* overflow detected */
        return INT_MAX;
    }

    res = in0 + in1;
    return res;
}

Two small problems still remain. Primarily, what happens if in0 is negative? Big problem, as subtracting a negative simply adds. This could overflow our calculation to the extent of being completely incorrect. Let’s check for this case specifically and amend our algorithm:

int callback(int in0, int in1)
{
    register int res;

    /* assumes presence of limits.h */
    if ((in0 >= 0) && INT_MAX - in0 < in1) {
        /* overflow detected */
        return INT_MAX;
    } else ((in0 < 0) && INT_MIN - in0 < in1) {
        /* underflow detected */
        return INT_MIN;
    }

    res = in0 + in1;
    return res;
}

We have thus prevented an overflow in both directions, and we have introduced exactly zero ourselves (unless the compiler sucks and feeds us false information, but that is unlikely).

So, have we finally managed to add two numbers together without the possibility of error? Kind of. You see, in all of our efforts to guard against over or underflows, we have simply replaced one kind of errata with another, more defined version. According to this function, 2^31 + 1 = 2^31. That is simply not true, and breaks the entirety of mathematics. But, I find it better than the technically undefined behaviour of overflows. It is better to clamp the value than obtain a completely nonsensical version. However, a greater problem is the concept of hardware errors. What happens if we have a memory parity error, a bit flips in a register due to cosmic radiation, or some other kind of completely random behaviour? There is nothing we can do about this, right? Well, let’s take the case of cosmic radiation for one moment: we can guard against this by shielding the computer in lead lining. Memory parity errors are rare and can be avoided by memory testing at program startup. If we really cared about this, we could test the memory on bootup for errors and bad wiring, aborting if we find issues (or create a memory map of where we can’t use, but that isn’t an issue in this case).

However, we have established so far that, even if we have a computer shielded in lead, using only registers and low level programming which has a dedicated power plant to supply it power, we still can’t do everything right - and there is a lesson here. Computers are flawed machines. They cannot exist in the same space as human beings can in our minds, where numbers are unlimited in size, space can just be conjured out of nowhere and things can only happen if we explicitly tell it to. There are an infinite number of ways in which things can go wrong. In the original UNIX, Ken Thompson famously ridiculed the developer of SunOS for including so much error handling code, instead remarking:

Yeah, we threw all that out. Instead we have this one function called “panic” that just crashes the machine and gives out a stack trace. Saves a lot of time.

Ken Thompson

Today, most major kernels still use this phillosophy of error handling for certain errors. The Linux and BSD kernels followed the UNIX tradition of “panicking” when something bad happens. The Windows developers, ever corporate, decided that it should be instead renamed to a “DebugCheck callback”. These routines are all identical. They all crash your machine instantly and irreversibly until reboot. You have likely seen one of them in the form of the Blue Screen of Death (BSOD).

The moral here is that it is futile to attempt to eliminate 100% of the errors that could occur in your code. If this is your goal, programming is not going to be kind to you. Instead, aim to eliminate errors which occur overly frequently and which have well-defined and practical solutions. Only using registers may be possible on x86 (where you have an enormous amount of them available), but it isn’t real programming. And, in theory, you could detect cosmic radiation ruining your calculations by doing the same calculation on three different CPU cores and repeating until all three agree, but it isn’t really practical. Has anybody ever really written code this way? Well, NASA has some interesting insights about how you ought to use C99 to make it as reliable as possible when running on a spacecraft in the middle of deep space. The original NASA guidance computer was written in pure assembly and was designed to never do anything unexpected. For context, this was a computer that had to map the floor of a celestial body with radar (and detect failure to do so), navigate between a planet and its moon using complex physics, and monitor and issue commands to the spacecraft and pilot - and do all of this with 1960s integer math technology. Oh yeah, and it had to operate a control system that the astronauts could interface with through a keypad. No standard library. No heap. No calling conventions, code testing or tooling- nothing except mathematical models of how to test code and a bunch of monospaced punch cards and paper. If they could do it up to what was needed, we can too. In the modern day, if your code has any glaring errors in it which are not completely out of your control, it must be fixed. Probably the one place where I disagree with the UNIX phillosophy is that incorrect results must never be allowed - within reason. I think it is fine to return incorrect results if what the user is asking for is outlandish or would just result in tedious and slow error handling code. Apart from extreme circumstances, however, incorrect results are simply unacceptable and must be rooted out and fixed.

So, to answer the question in the title, no, it is probably never possible - but we can get pretty close, and that should be our aim.

This article was sparked by the computer lab Windows workstation I was on crashing three times in a row while I was working the other day. I have used Linux computers for many years now. They have crashed probably around three times in that entire time - and I never lost data.

Ethan Marshall