Wednesday, March 16, 2011

Overflow Overflow Overflow

For the second time in the past few weeks it turns out I'm my own worst enemy. I work for days to track down a bug in my latest embedded application only to find out that (surprise) the jerk who put the bug there is none other than me.

Some background: the ARM Cortex M3 has this nice interrupt vector called FAULT. When something bad happens (divide by zero, memory protection error, etc) the program packs up the current processor state and vectors directly to the fault interrupt handler. This is a very good thing - you can put a fault handler in that interrupt, resolve the problem and continue on your merry way. Perhaps. The thing about faults is that they tend to be somewhat hard to recover from. For instance, let's look at my two most recent faults I've had to deal with.

The first fault was a stack overflow. I know that some of you have probably heard of a stack, and some of you have heard of a stack overflow but if you're anything like me you had no idea what this actually meant and simply assumed it would never be a problem for you. Well, it most likely will be for you at some point. It's like the fraternity hazing of the embedded world: you can't consider yourself an embedded programmer until you've had your first stack overflow. Let's have some background: A stack is a Last-in-First-Out memory structure that is essentially scratch space for your program. Whenever you call a function the return address is pushed on the stack as well as function arguments. Inside the function the arguments are popped off and used, and when the function is done the return address is popped off and jumped to. You can use the stack directly but it's not typical to do so if you're programming in a high-level language like C - just let the compiler do the dirty work. Now the stack is stored in memory just like everything else (you'll see in a second why this is at the root of the problem) and it has a specific length. So it may start at memory address 0x20001E0 and if it's 16 bytes long then something else starts at location 0x20001F0. So you've only got 16 bytes to play with for all of your function arguments and return addresses and such. In a VERY simple system maybe this would be enough but it's very doubtful. For instance, if you call a function within a function you have to push a lot more stuff onto the stack. If you call a function within that it gets worse.

Now here's the tricky part - even though the stack is special and unique and essential to calling functions and making your program work correctly your processor doesn't care. It can't really tell what memory is stack and what isn't. So if your stack is 16 bytes long and you push 17 bytes on to it your processor will happily obliterate whatever data was residing just after your stack and replace it with what you just pushed onto the stack. This is stack overflow.

What happened to me the first time is that I didn't allocate enough stack space and it wiped out some important information residing just beyond the stack. Now if what is just beyond the stack is data, then overwriting it is bad, but not fatal for the program. After all, data is data. If I'm adding two numbers and one of them is overwritten by the stack then I can still add it, but the result is unimportant and nonsensical. What's worse is when pointers get overwritten. If you don't know pointers... then you probably have lots of company nowadays. But it's simple: pointers are memory locations. You interpret the value in a pointer as pointing to a memory location. So to load data from a pointer is a two step process: read the memory address stored in the pointer, then look at that memory address and grab the data there. However, if your stack has overwritten this pointer then it might point to the wrong memory address or (more likely) not point to memory at all. You see, RAM on the Cortex M3 (the one I'm working with) starts at 0x20000000. So if a pointer tells you to look at 0x1FFFFFFF then it's not just telling you to look at the wrong memory address - it's telling you to look at something that's not even memory. If it tries to do that it triggers a hard fault and you get to figure out why. If it's your first stack overflow then this process lasts days. Enjoy!

So that's one type of overflow that can hurt you. The second is buffer overflow. There's no magic here either - a buffer is basically an array. Arrays are bounded - they have a definite size. But even if you declare your array to have a size of 512 your compiler won't stop you from requesting the 513th element - it picks out the memory just beyond the end of your array and reads it. And you can write to it just as easily. And of course, there's usually something important there that you really shouldn't overwrite.

Arrays can be difficult to work with, so people create circular buffers. Normal arrays would be linear buffers: start at zero and go to the end. Circular buffers wrap around so that if you try to go outside the array you loop back around to the beginning instead of trailing off the end. That is, they loop back around to the beginning if you code them correctly. Tell me what you think the result of this code is:

head_pointer = (head_pointer + increment_value % buffer_size)

If you said the modulus (this thing: % ) operation would be evaluated first - you're right! And you're smarter than me! That code doesn't do what I wanted it to do: increment the value then wrap it around if it was greater than the buffer size. Instead it wrapped increment_value around if it was bigger than buffer_size and then added it to head_pointer. The net effect of this is that the pointer never loops around to the beginning of the buffer - it just keeps growing and growing and growing and gobbling up memory as you write to it. If you let it go on long enough it will overwrite a pointer with gibberish and trigger a hard fault.

Yes, I did this. At least this sort of stupidity on my part gains me valuable experience. Lesson learned: I am very foolish sometimes.