Perception

Technology vs. Humanity

When standard fails us

I guess I have finally lost my patience with C semantics. Although the C programming language is kind of infamous for its lack of memory safety and disastrous type system, I still believed that the programming language still represented intuition well, and given that an appropriate amount of attention is paid, all mistakes can be avoided. Well, turns out it is not the case when it comes to undefined behaviors.

First I think it was some problem with the education people normally receive: usually we pick C as the first programming language to learn, before we learn anything about security and type safety. At least in my case, I first treated C language the same as an interpreted language, with some additional features and ran blazingly fast. After I began to touch some low-level stuff like memory addresses, data placement, stack layout, etc., I started to view C as an assembly language without registers to worry about. Perhaps this is where the gap comes from. The C Standard actually intended the C programming language to be a high level language to be used on any type of microarchitectures, not only those whose registers are 3264-bit long, not only those who represent negative numbers in two’s complement form, not only those whose stacks always grow downward and are continuous in the address space… Perhaps you can see where all this is going. Lots of things we regarded as natural, or even taught so, were simply not true according to the C Standard.

One of such examples, which is the most shocking one for me, is that pointer arithmetic is only defined when referring to objects in the same array. That is to say the C Standard imposes no restrictions what so ever how separate stack variables should be placed in memory. Well, this isn’t too outrageous right? Why would a reasonable programmer worry about how data are being place on the stack anyway… A side effect of this specification is that pointers can never overflow, because if it gets too large, before it overflows it must have gone beyond the boundary of the array it was suppose to point to, and after that point everything is undefined. I recall that during one of my undergraduate courses I actually got the advice to always perform some checking before copying something into a buffer to make sure that we don’t overflow the buffer. One such checking takes the form of

char *buf_start;
char *buf_end;
size_t len;

if (buf_start + len > buf_end) {
    // do copying
} else {
    // overflowing, don't copy
}

It’s quite clear what’s wrong here. The C Standard mandates that pointers never overflow, so the “if” statement is actually useless, because the compiler can safely assume that the condition (buf_start + len > buf_end) is always true without breaking compliance with the standard. It is non-trivial and things certainly have gone opposite ways the programmer wanted.

I have actually dug through people’s discussions on such undefined behaviors in the C programming language, and more shocking stuff showed up. To list a few:

1 << 32
// Undefined

INT_MAX + 1
// Undefined

-INT_MIN
// Undefined

UINT_MAX + 1U
// Defined, result is 0U

In summary, signed integer overflow or underflow is undefined, but unsigned integers do modulo arithmetic. Okay fine, so this means that we can do the following check on unsigned integers without risking undefined behavior!

unsigned short x;
unsigned short y;

if (x + y < x) {
    // overflows!
} else {
    // we are good!
}

Unfortunately this doesn’t work either… Well if we are working with unsigned int or larger size integers it actually works, but not for integers whose sizes are smaller than int. The C Standard also suggests that smaller sizes are promoted to int before arithmetic operations… OK, to this point, I think I’m done with C. Let’s move on with Rust, seriously.