Why to check for runtime errors early

Errors due to out-of-bounds array references, stray pointer references, and other forms of references to undefined regions of memory can be very difficult to track down.

This web page presents a program that exhibits undefined behavior, but the example behaviors listed here are far from the only kinds of undefined behaviors that are possible.

Basically, when you write a program in a given language, you are entering into a sort of "contract" with the language's runtime system. If you violate the terms of that contract by referencing an undefined region of memory, that frees the language implementation to get out of the contract as well. In other words, the language implementation (in most languages with potentially unsafe memory references, like C and Fortran) becomes free to do any oddball, seemingly random thing it wants to, without doing anything "illegal" from the standpoint of the contract you've violated. And to make matters worse, the oddball behavior may not actually occur until far later in the program, at a seemingly completely unrelated spot in the code.

The example program presented here, that demonstrates some of the things that might happen, is written in Fortran, but basically the same thing applies in C as well.

The example program follows. We basically declare 3 arrays (array1, array2, and array3), and then write too much data to the middle one of the 3 (array2 - but note that the compiler may not give any guarantees that array2 will actually be in memory between array1 and array3, though that is what happens in the examples below):

   program main

   character array1(5)
   character array2(5)
   character array3(5)
   integer i
   integer j

   do i=1,25
      do j=1,5
         array1(j) = 'a'
         array2(j) = 'a'
         array3(j) = 'a'
      end do
      call sub(array2,i)
      print *,i,' ',array1,' ',array2,' ',array3
   end do

   end

   subroutine sub(chars,maxind)
   character chars(*)
   integer maxind

   integer i

   do i=1,maxind
      chars(i) = 'b'
   end do

   return

   end

Here are some sample outputs from the program above. Please note how much the output can vary, depending on the CPU and/or compiler used. There are basically four failure modes here, but many kinds of failures are possible. There's a summary below the raw data.

With g77 3.4.3 on 32 bit Fedora Core 3 on a 64 bit AMD CPU, generating a 32 bit executable:

 1 aaaaa baaaa aaaaa
 2 aaaaa bbaaa aaaaa
 3 aaaaa bbbaa aaaaa
 4 aaaaa bbbba aaaaa
 5 aaaaa bbbbb aaaaa
 6 aaaaa bbbbb aaaaa
 7 aaaaa bbbbb aaaaa
 8 aaaaa bbbbb aaaaa
 9 aaaaa bbbbb aaaaa
 10 aaaaa bbbbb aaaaa
 11 aaaaa bbbbb aaaaa
 12 aaaaa bbbbb aaaaa
 13 aaaaa bbbbb aaaaa
 14 aaaaa bbbbb aaaaa
 15 aaaaa bbbbb aaaaa
 16 aaaaa bbbbb aaaaa
 17 baaaa bbbbb aaaaa
 18 bbaaa bbbbb aaaaa
 19 bbbaa bbbbb aaaaa
 20 bbbba bbbbb aaaaa
 21 bbbbb bbbbb aaaaa
 22 bbbbb bbbbb aaaaa
 23 bbbbb bbbbb aaaaa
 24 bbbbb bbbbb aaaaa
 25 bbbbb bbbbb aaaaa

With gcc version 4.0.0 20050129 (experimental) (g95!) Feb 17 2005 on Fedora Core 3 on a 32 bit Intel Xeon, generating a 32 bit executable:

 1  aaaaa baaaa aaaaa
 2  aaaaa bbaaa aaaaa
 3  aaaaa bbbaa aaaaa
 4  aaaaa bbbba aaaaa
 5  aaaaa bbbbb aaaaa
 6  ~B bbbbb aaaaa
 7   bbbbb aaaaa
make: *** [go] Segmentation fault

With g77 -maix64 3.3.6 on AIX 5.1 (a Fortran 77 compiler on a PowerPC CPU, generating a 64 bit executable):

 1 aaaaa baaaa aaaaa
 2 aaaaa bbaaa aaaaa
 3 aaaaa bbbaa aaaaa
 4 aaaaa bbbba aaaaa
 5 aaaaa bbbbb aaaaa
 6 aaaaa bbbbb aaaaa
 7 aaaaa bbbbb aaaaa
 8 aaaaa bbbbb aaaaa
 9 aaaaa bbbbb aaaaa
 10 aaaaa bbbbb aaaaa
 11 aaaaa bbbbb aaaaa
 12 aaaaa bbbbb aaaaa
 13 aaaaa bbbbb aaaaa
 14 aaaaa bbbbb aaaaa
 15 aaaaa bbbbb aaaaa
 16 aaaaa bbbbb aaaaa
 17 aaaaa bbbbb baaaa
 18 aaaaa bbbbb bbaaa
 19 aaaaa bbbbb bbbaa
 20 aaaaa bbbbb bbbba
 21 aaaaa bbbbb bbbbb
 22 aaaaa bbbbb bbbbb
 23 aaaaa bbbbb bbbbb
 24 aaaaa bbbbb bbbbb
 25 aaaaa bbbbb bbbbb

With xlf95 8.1.1.5 on AIX 5.1 (a fortran 95 compiler generating a 64 bit executable):

 1 aaaaa baaaa aaaaa
 2 aaaaa bbaaa aaaaa
 3 aaaaa bbbaa aaaaa
 4 aaaaa bbbba aaaaa
 5 aaaaa bbbbb aaaaa
 6 aaaaa bbbbb aaaaa
 7 aaaaa bbbbb aaaaa
 8 aaaaa bbbbb aaaaa
 9 aaaaa bbbbb aaaaa
 10 aaaaa bbbbb aaaaa
 11 aaaaa bbbbb aaaaa
 12 aaaaa bbbbb aaaaa
 13 aaaaa bbbbb aaaaa
 14 aaaaa bbbbb aaaaa
 15 aaaaa bbbbb aaaaa
 16 aaaaa bbbbb aaaaa
 17 aaaaa bbbbb baaaa
 18 aaaaa bbbbb bbaaa
 19 aaaaa bbbbb bbbaa
 20 aaaaa bbbbb bbbba
 21 aaaaa bbbbb bbbbb
 22 aaaaa bbbbb bbbbb
 23 aaaaa bbbbb bbbbb
 24 aaaaa bbbbb bbbbb
 25 aaaaa bbbbb bbbbb

With xlf 8.1.1.5 on AIX 5.1 (a fortran 77 compiler generating a 64 bit executable):

 1  aaaaa baaaa aaaaa
 2  aaaaa bbaaa aaaaa
 3  aaaaa bbbaa aaaaa
 4  aaaaa bbbba aaaaa
 5  aaaaa bbbbb aaaaa
 6  aaaaa bbbbb aaaaa
 7  aaaaa bbbbb aaaaa
 8  aaaaa bbbbb aaaaa
 9  aaaaa bbbbb baaaa
 10  aaaaa bbbbb bbaaa
 11  aaaaa bbbbb bbbaa
 12  aaaaa bbbbb bbbba
 13  aaaaa bbbbb bbbbb
 14  aaaaa bbbbb bbbbb
 15  aaaaa bbbbb bbbbb
 16  aaaaa bbbbb bbbbb
 17  aaaaa bbbbb bbbbb
make: *** [go] Segmentation fault (core dumped)

To summarize the runs above, there are basically four failure modes shown above, and a fifth important failure mode that isn't shown, but should also be presented. Recall that we are writing to array2.

Sometimes array1 gets overwritten with unintended data
Sometimes array3 gets overwritten with unintended data
Sometimes the program segfaults
Sometimes an array gets filled with completely bizarre values, even when using homogeneous types as in this example (this is also very likely to happen when the types are not homogeneous)
The fifth, which isn't shown above unfortunately, is that sometimes a subroutine's return address may get overwritten. In that case, everything may seem fine until a subroutine or function attempts to return to its caller, but the address pulled off of the stack has been overwritten with garbage, in which case the subroutine attempts to return to the wrong address, usually causing a segfault, but even stranger things than that could result in a pathological scenario, like suddenly executing code that shouldn't even be getting called.

One last comment on the program. If we change the declaration of the arrays from being of size "*" (meaning any length is legal) to size 5, and compile again with -C, -C can give an immediate error on the illegal array reference to an undefined region of memory. This makes it relatively easy to fire up a debugger like dbx or gdb, feeding them your program and the resulting "core" file, to see where the error is. In this case, it should be far simpler to determine the ultimate source of the error by looking at the surrounding code, because this time we're catching errors immediately, and the debugger will show you where the error occured, not just when things finally got so haywire that the runtime could not continue, in a potentially, completely unrelated region of code.

esmf04m-strombrg> xlf -C -g why-early.f -o why-early
** main   === End of Compilation 1 ===
** sub   === End of Compilation 2 ===
1501-510  Compilation successful for file why-early.f.
Fri Jun 10 17:41:55

esmf04m-strombrg> ./why-early   
 1  aaaaa baaaa aaaaa
 2  aaaaa bbaaa aaaaa
 3  aaaaa bbbaa aaaaa
 4  aaaaa bbbba aaaaa
 5  aaaaa bbbbb aaaaa
Trace/BPT trap (core dumped)
Fri Jun 10 17:42:00

esmf04m-strombrg> dbx why-early core
Type 'help' for help.
reading symbolic information ...
[using memory image in core]

Trace/BPT trap in sub at line 29 in file "why-early.f"
   29                   chars(i) = 'b'
(dbx) where
sub(chars = (...), maxind = warning: Unable to access address 0x110000a28 from core
-1, 0x1), line 29 in "why-early.f"
main(), line 16 in "why-early.f"
(dbx) list 1,50
    1           program main
    2   
    3           character array1(5)
    4           character array2(5)
    5           character array3(5)
    6           integer i
    7           integer j
    8   
    9   
   10           do i=1,25
   11                   do j=1,5
   12                           array1(j) = 'a'
   13                           array2(j) = 'a'
   14                           array3(j) = 'a'
   15                   end do
   16                   call sub(array2,i)
   17                   print *,i,' ',array1,' ',array2,' ',array3
   18           end do
   19   
   20           end
   21   
   22           subroutine sub(chars,maxind)
   23           character chars(5)
   24           integer maxind
   25   
   26           integer i
   27   
   28           do i=1,maxind
   29                   chars(i) = 'b'
   30           end do
   31   
   32           return
   33   
   34           end
   35   
(dbx)

To further expand on this topic, here's an e-mail (slightly modified) I sent to a professor who had an oddball fortran problem. It may serve as both an example of how using a system call tracer might relate to this, as well as presenting a simple taxonomy of the chain of causality behind such an error.

For general information about the method I used to dig this up, please see debugging with system call tracers. I guess what I did is really all in there, but there are some ways of combining things that aren't necessarily immediately obvious.

Anyway, I fired up 20 truss's, one against each of your pop (not the e-mail protocol, but rather part of a climatology simulation) processes on esmf08m, saving the output from each truss to a distinct compressed text file. I had to start them up in the middle of the run (or so), because starting them up at the beginning generated too much output, and I had remove everything and restart the truss's.

I then looked near the end of all 20 files, using bunzip2 (which compresses text harder than gzip) and less -c. 19 of them seemed very similar, all having died on signal 15, but one was different - it died on signal 1.

So I examined that outstanding file further, and after wading back past all the localization-related I/O that's generated on a modern *ix system when reporting errors, found this:

username

This just about has to be the first-level cause (to coin a phrase. Let's call a first-level cause the cause that results in the error, and the 2nd level what caused the first level, and so on. Then the nth level cause is the problem in the source code - someone's source code, not necessarily yours!) of your error.

More specifically, that value, "0400000402", just about has to be illegal, and causing your EINVAL. I looked at the include file that defines what bits are legal in the second argument to open on AIX 5.1 (/usr/include/sys/mode.h - the values are logically or'd together), and found that while this value used by pop to open that file is 9 octal digits long, the flags in the file are normally only 6 digits long, and even the obscure ones are 8 digits long, which would seem to account for the EINVAL - IE, there are no bit flags that are 9 digits long, and just and'ing and or'ing those values shouldn't produce a 9 digit long octal value (though shifts could, but they aren't common for the second argument to open).

Now as far as finding the 2nd..nth level causes (the nth level one hopefully being the one that appears in the pop source code), that can be quite a chore sometimes. The first thing to do is probably to look at the open statement that's trying to open the file /ptmp/username/rungx3v5.140/ocn/rest/rungx3v5.140.pop.r.1791-01-01-00000, and seeing if there's anything unusual about it. If there is, you're in luck - one just corrects that problem, and you know n is 2, which is always nice to discover.

If there isn't, then you may be stuck with trying to guess what's changing this flag, which can be very difficult, for reasons described at my checking early page.

Things that may facilitate getting this working without having to resort to cranking warning levels and trying to eliminate all nth level causes the warnings catch in the hope of finding the "right" nth level cause corresponding to your 1st level cause (the EINVAL), are tools like purify (recently purchased by IBM from Rational, I believe), libefence, valgrind, and so forth.

Good luck, and feel free to let me know if you require further assistance with this potentially-messy issue.

Thanks!

More on software to make this easier (mentioned briefly in the letter above) :

Some specific software
- totalview - noteworthy that it does AIX and Fortran
- purify
- efence
- valgrind
- sentinel
- memcheck
And here's a paper comparing such software (It's in RTF format, so download it and feed it to a wordprocessor)

Undefined behavior in C

Hits: 7660
Timestamp: 2025-07-26 17:51:59 PDT

Back to Dan's tech tidbits

You can e-mail the author with questions or comments: