Converting binary data

This page describes some methods for converting binary data intended for one OS, compiler and/or CPU, to another.

This software is owned by The university of California, Irvine, and is not distributed under any version of the GPL. GPL is a fine series of licenses, but the owners of the software need it to be distributed under these terms.

First off, if you want to future-proof your code, you should pick one of the following and use it:

htons() and friends (mostly for network protocols)
NetCDF (usually for scientific data you're writing to disk)
If you're working with (Sun) RPC data, then XDR is your friend
Plain old ASCII. If you think it's too wasteful of space and bandwidth, contemplate on XML for a while, then revisit ASCII. :)

Any of these options are better, long term, than what I'm about to describe...

But if you really do want the expedient route, then read on....

What architecture are you moving the data from? Get the operating system name, the compiler name, and the CPU name.
Is your source architecture big endian, little endian, or something else?. This matters mostly for integers, can be relevant for float's/real's too.
What architecture are you moving the data to? Get the operating system name, the compiler name, and the CPU name.
Is your target architecture big endian, little endian, or something else?. This matters mostly for integers, can be relevant for float's/real's too
What data types are in the file? What structure do the records have?
- Are all of the items in the file of the same type?
  - Is it a file of all integers? Then you might try my byte-swap program. Byte-swap is easy enough to use that you might even try it if you aren't sure about the format of your file, and just see if it works. Integers can often (but not always) just be swapped around a bit, to get a good conversion.
  - Is it a file of all real's/float's? Then you might be able to use byte-swap then too, but a better bet is my real-ascii-real suite. It'll convert your source data to ascii, and then convert that ascii to the other host's real format. Some float formats are simply byte-swappable, but that's even less likely for float's/real's than it is for integers.
- Does the file have a fixed record length (always writing the same amount of data on each write/fwrite/etc.)?
  - You can sometimes find out if your file has fixed-length records using a combination of approaches:
    - Try a system call tracer to see if you can find anything out about the file's block sizes. If the program is using buffered I/O, then you'll probably just see a bunch of 4K (4096 bytes, or other largish power of two) write()'s, in which case the system call tracer isn't going to help without you first helping the system call tracer. :) You can help your system call tracer in a variety of ways (pick one that works, you don't need others):
      - Turn off buffered I/O in your C code with setvbuf() or similar.
      - Check your compiler options for something that'll force unbuffered writes.
      - Try using g95, the GNU fortran 95 compiler, and set the environment variable G95_UNBUFFERED_ALL to "TRUE". There may be other variables that work for other compilers.
        
        In sh/ksh/bash: export G95_UNBUFFERED_ALL=TRUE
        In csh/tcsh: setenv G95_UNBUFFERED_ALL TRUE
      - Add fflush()'s to your C code after every write()/fwrite()/fprintf()/etc.
      Once you've got unbuffered writes happening, the system call tracer should be able to give you the length of the writes being performed. This is your record length. If it's consistent after some init and before some shutdown syscalls, then you probably have fixed-length records. Otherwise, you most likely have variable-length records.
  - Another approach for determining your record length, again assuming fixed record lengths, is to study your source code, and possibly use this program as a guide. Use it like: "./factor filelen". It should show all of the integers that divide evenly into your file length. Be aware that some fortran compilers (runtimes) will want to prefix an integer on either end of your "records" that describe how long the record is, so on a system with 4 byte integers, you could end up adding 2*4 bytes to your record length estimate.
  - Either way, once you have a guess what the record size is, try dividing that number into the file length. If it divides evenly, then your hypothesis is supported. If it doesn't divide evenly, then back to the drawing board.
  - Once you have the record length, and the data elements that it's made up of, then you can probably concoct a program that'll read your single data structure in a loop, and then convert it to a neutral format, like ascii. Then transmit that neutral format to the remote host, and have another program that converts your neutral format to the native representation of the target machine.
- No homogeneous type, and no fixed-length records? Then you have variable length records - the hardest kind to convert. In this case you may still get some mileage out of the unbuffered writes+syscall tracer technique described above, however you're almost certainly going to have to study the source code, possibly in great detail, and figure out what's letting the consumer program identify what sort of data to expect, and mirror that logic in your pair of conversion programs. You'll still be converting native data to a neutral format, and then the neutral format to the other kind of native data, but you'll likely have more than one data structure to look at, and you'll likely have some control flow beyond just a single while/for/do loop.
  Here is an example of converting a file with variable length records from linux/g95/x86 to AIX/xlf90_r/pSeries. As it happens, the file was made of real*4's and real*8's, and fortuitously, x86 and pSeries actually use the same kind of real's, just layed out in memory in a different order - hence there was no need to convert to ASCII or other neutral format, nor to get into "mantissa - exponent - radix - fieldwidths" nonsense.

"Framed" fortran records

What they are: Some fortran runtimes like to "frame" each record they write out. This generally means that if your record is "r", then when writing the record, instead of just writing bytestream(r), they will instead write out bytestream(int(len(r)))+bytestream(r)+bytestream(int(len(r))).
Stripping them out of a file that has them
- strip-fortran-framing. Run it with no arguments for usage.
- Example usage, and output. In this example, we're using a termination heuristic where if any record is longer than 300000 bytes (-m 300000), the program exits. We're assuming big endian byte order (due to -b, and we're only on the framing integers, not the data between the framing integers), and a wordsize of 4 bytes (due to -w 4, again, only for the framing integers).
  - ./strip-fortran-framing -m 300000 -b -w 4 < runoff-2d-921River-1deg-mon.bin > runoff-2d-921River-1deg-mon.bin.no-framing
  - Record 0 is length 259200
  - Record 1 is length 259200
  - Record 2 is length 259200
  - Record 3 is length 259200
  - Record 4 is length 259200
  - Record 5 is length 259200
  - Record 6 is length 259200
  - Record 7 is length 259200
  - Record 8 is length 259200
  - Record 9 is length 259200
  - Record 10 is length 259200
  - Record 11 is length 259200
  - Normal termination

Adding them into a file that doesn't have them: add-fortran-framing

Usage:

/usr/local/bin/add-fortran-framing
You must use exactly one of -r, -f or -o
Usage: /usr/local/bin/add-fortran-framing [-r recordfile|-f fixedreclen|-o] -w word_size [-h]
-r recordfile means to read a series of record lengths from file recordfile
-f fixedreclen means to assume a fixed record length
-o means to assume the file has a single record
-w wordsize specifies the size of the integers used in the framing.  Common values are 4 and 8
-b says to assume big endian byte ordering of the integers used for framing
-l says to assume little endian byte ordering of the integers used for framing
-h shows help

Example usage: 4 byte big endian words, on a file with only one record

add-fortran-framing -o -w 4 -b < MLLATMSB > foo
mv MLLATMSB MLLATMSB.original
mv foo MLLATMSB
./read_lat_lon 
 75.01390839

Options to change the byte order used by a language's runtime system

Some compilers have the option of using a different byte order from that which the system's CPU(s) use(s) natively. Examples:
- Intel compiler?: F_UFMTENDIAN
- Gnu g95 compiler?: G95_ENDIAN
However, note that these sorts of options will usually change both a language runtime's input format as well as its output format, so:
- These options are often not that helpful for converting data from one byte order to another, unless you already have some form of neutral format involved, in which case you probably don't really need these options
- But:
  1. If you're willing to run your code at a performance penalty long-term
  2. And you are moving to a system with such an option
  3. And you do not need your code on the new system to output binary files that are in native byte order
  4. Then you may be able to just always specify one of these kinds of options every time you utilize the non-native, architecture-dependent data
  5. For example:
    - If you're using a little endian system
    - But you need to work with data that was created on a big endian system
    - And your little endian system has some way of turning on big endian runtime behavior
    - then you may be able to just leave your data in big endian format.

Converting specific formats

Converting matlab files

References

gla.ac.uk on floating point representation
epfl.ch with gory details on the internals of floating point representations - converting to a neutral format and then from there on to the target native format should spare you from these details though!
rdrop.com's endian FAQ

Hits: 6125
Timestamp: 2025-07-22 15:45:47 PDT

Back to Dan's tech tidbits

You can e-mail the author with questions or comments: