This page describes some methods for converting binary data intended for
one OS, compiler and/or CPU, to another.
This software is owned by The university of California, Irvine,
and is not distributed under
any version of the GPL. GPL is a fine series of licenses, but the owners of the software need it to be distributed under these terms.
First off, if you want to future-proof your code, you should pick one of
the following and use it:
htons() and friends (mostly for network protocols)
NetCDF (usually for scientific data you're writing to disk)
If you're working with (Sun) RPC data, then XDR is your friend
Plain old ASCII. If you think it's too wasteful of space and
bandwidth, contemplate on XML for a while, then revisit ASCII.
Any of these options are better, long term, than what I'm about to
But if you really do want the expedient route, then read on....
What architecture are you moving the data from?
operating system name, the compiler name, and the CPU name.
What data types are in the file? What structure do the records
Are all of the items in the file of the same type?
Is it a file of all integers? Then you might try my byte-swap program. Byte-swap is
easy enough to use that you might even try it if you aren't
sure about the format of your file, and just see if it
works. Integers can often (but not always) just be
swapped around a bit, to get a good conversion.
Is it a file of all real's/float's? Then you might be able to
use byte-swap then too, but a better
bet is my real-ascii-real
suite. It'll convert your source data to ascii, and then
convert that ascii to the other host's real format. Some
float formats are simply byte-swappable, but that's even
less likely for float's/real's than it is for integers.
Does the file have a fixed record length (always writing the same
amount of data on each write/fwrite/etc.)?
You can sometimes find out if your file has fixed-length
records using a combination of approaches:
Try a system call
tracer to see if you can find anything out about the
file's block sizes. If the program is using buffered I/O,
then you'll probably just see a bunch of 4K (4096 bytes, or
other largish power of two) write()'s, in which case the
system call tracer isn't going to help without you first
helping the system call tracer. :) You can help your
system call tracer in a variety of ways (pick one that
works, you don't need others):
Turn off buffered I/O in your C code with setvbuf()
Check your compiler options for something that'll
force unbuffered writes.
Try using g95, the GNU fortran 95 compiler, and set
the environment variable G95_UNBUFFERED_ALL to "TRUE".
There may be other variables that work for other
In sh/ksh/bash: export G95_UNBUFFERED_ALL=TRUE
In csh/tcsh: setenv G95_UNBUFFERED_ALL TRUE
Add fflush()'s to your C code after every
Once you've got unbuffered writes happening, the system
call tracer should be able to give you the length of the
writes being performed. This is your record length. If
it's consistent after some init and before some
shutdown syscalls, then you probably have fixed-length
records. Otherwise, you most likely have variable-length
Another approach for determining your record length, again
assuming fixed record lengths, is to study your source code,
and possibly use this program as a
guide. Use it like: "./factor filelen". It should show all
of the integers that divide evenly into your file length.
Be aware that some fortran compilers (runtimes) will want to
prefix an integer on either end of your "records" that
describe how long the record is, so on a system with 4 byte
integers, you could end up adding 2*4 bytes to your record
Either way, once you have a guess what the record size is,
try dividing that number into the file length. If it divides
evenly, then your hypothesis is supported. If it doesn't
divide evenly, then back to the drawing board.
Once you have the record length, and the data elements that
it's made up of, then you can probably concoct a program that'll
read your single data structure in a loop, and then convert it
to a neutral format, like ascii. Then transmit that neutral
format to the remote host, and have another program that
converts your neutral format to the native representation of
the target machine.
No homogeneous type, and no fixed-length records? Then you
have variable length records - the hardest kind to convert. In
this case you may still get some mileage out of the unbuffered
writes+syscall tracer technique described above, however
you're almost certainly going to have to study the source code,
possibly in great detail, and figure out what's letting the
consumer program identify what sort of data to expect, and
mirror that logic in your pair of conversion programs. You'll
still be converting native data to a neutral format, and then
the neutral format to the other kind of native data, but you'll
likely have more than one data structure to look at, and you'll
likely have some control flow beyond just a single while/for/do
Here is an example of converting a
file with variable length records from linux/g95/x86 to
AIX/xlf90_r/pSeries. As it happens, the file was made of
real*4's and real*8's, and fortuitously, x86 and pSeries
actually use the same kind of real's, just layed out in memory
in a different order - hence there was no need to convert to
ASCII or other neutral format, nor to get into
"mantissa - exponent - radix - fieldwidths" nonsense.
"Framed" fortran records
What they are: Some fortran runtimes like to "frame" each record they write out.
This generally means that if your record is "r", then when writing the
record, instead of just writing bytestream(r), they will instead
write out bytestream(int(len(r)))+bytestream(r)+bytestream(int(len(r))).
Example usage, and output. In this example, we're using a
termination heuristic where if any record is longer than 300000
bytes (-m 300000), the program exits. We're assuming big endian
byte order (due to -b, and we're only on the framing integers, not
the data between the framing integers), and a wordsize of 4 bytes
(due to -w 4, again, only for the framing integers).
You must use exactly one of -r, -f or -o
Usage: /usr/local/bin/add-fortran-framing [-r recordfile|-f fixedreclen|-o] -w word_size [-h]
-r recordfile means to read a series of record lengths from file recordfile
-f fixedreclen means to assume a fixed record length
-o means to assume the file has a single record
-w wordsize specifies the size of the integers used in the framing. Common values are 4 and 8
-b says to assume big endian byte ordering of the integers used for framing
-l says to assume little endian byte ordering of the integers used for framing
-h shows help
Example usage: 4 byte big endian words, on a file with only one record
Options to change the byte order used by a language's runtime system
Some compilers have the option of using a different byte order
from that which the system's CPU(s) use(s) natively. Examples:
Intel compiler?: F_UFMTENDIAN
Gnu g95 compiler?: G95_ENDIAN
However, note that these sorts of options will usually
change both a language runtime's input format as well as its output
These options are often not that helpful for converting data from
one byte order to another, unless you already have some form of
neutral format involved, in which case you probably don't
really need these options
If you're willing to run your code at a performance
And you are moving to a system with
such an option
And you do not need your code on the new
system to output binary files that are in native byte order
Then you may be able to just always specify one of
these kinds of options every time you utilize the
non-native, architecture-dependent data
If you're using a little endian system
But you need to work with data that was created on a big endian system
And your little endian system has some way of turning on
big endian runtime behavior
then you may be able to just leave your
data in big endian format.