Raw Binary Data in Julia

Keith Rutkowski

Keith Rutkowski ⋅
April 9, 2020

Learn how reading or writing raw binary data from Julia is both simple and efficient with the CBinding.jl package.

Binary data in Julia

Building on what was introduced in a previous article, we will now detail how to easily create simple and efficient binary IO code in Julia. Binary data formats are often encountered with file and networking IO. Usually low-level C libraries exist to deal with common data formats, but because of licensing differences, incompatible versions, or portability issues they may not be reliably used. Working with binary data using C is rather easy to do since the data layout matches the memory layout of user-defined types.

Unfortunately, Julia offers no built-in support for directly working with packed binary data in an efficient way. In contrast, the Python Standard Library provides the struct module with facilities to pack and unpack binary data. We will demonstrate below how CBinding.jl, which was created to add proper support for C constructs to Julia, is an essential tool when working with binary data in Julia.

Features covered in this article include:

creating an object from an IO stream,
performing both in-memory and direct on-disk data manipulation,
efficient zero-copy IO, and
byte alignment and bit packing capabilities.

A simple example

The WAV audio file format will be used here since it is a simple and ubiquitous file format. The format is binary of course, but the header data doesn’t require any bit packing, byte alignment, or changes to byte order. By using CBinding.jl and the @cstruct macro it provides, we define a Julia type that is analogous to the C type shown in the comment to the right.

julia> using CBinding

julia> @cstruct WAV_header {           # struct WAV_header {
         riff::UInt8[4]                #   uint8_t  riff[4];
         fileSize::UInt32              #   uint32_t fileSize;
         fileHeader::UInt8[4]          #   uint8_t  fileHeader[4];
         fmtMarker::UInt8[4]           #   uint8_t  fmtMarker[4];
         fmtLength::UInt32             #   uint32_t fmtLength;
         fmtType::UInt16               #   uint16_t fmtType;
         dataChannels::UInt16          #   uint16_t dataChannels;
         dataSampleRate::UInt32        #   uint32_t dataSampleRate;
         dataBytesPerSecond::UInt32    #   uint32_t dataBytesPerSecond;
         dataBytesPerSample::UInt16    #   uint16_t dataBytesPerSample;
         dataBitsPerSample::UInt16     #   uint16_t dataBitsPerSample;
         dataHeader::UInt8[4]          #   uint8_t  dataHeader[4];
         dataSize::UInt32              #   uint32_t dataSize;
       }                               # };

Next, we open a sample WAV file and read the header exactly as it was defined.

julia> header = open("sample.wav") do io
         read(io, WAV_header)
       end;

julia> header.fileHeader[] |> String
"WAVE"

julia> header.dataBitsPerSample |> signed
16

julia> header.dataChannels |> signed
1

julia> header.dataSampleRate |> signed
22050

shell> file sample.wav
sample.wav: RIFF (little-endian) data, WAVE audio, Microsoft PCM, 16 bit, mono 22050 Hz

Comparing the values of the header’s fields with what is reported by the file command indicates a successful parsing of the binary header data.

Changing byte order

In our example the file format byte order (little-endian) happens to be the same as the host system’s byte order, but that is not always the case. Ensuring the correct byte order results in safer, more portable code, and it is easy to do using the following Julia functions:

ltoh(x): Convert x from little-endian byte order to host’s byte order.
ntoh(x): Convert x from big-endian byte order to host’s byte order.
htol(x): Convert x from host’s byte order to little-endian byte order.
hton(x): Convert x from host’s byte order to big-endian byte order.

So, to more correctly read the header’s fields, the code would look like this:

julia> header.dataBitsPerSample |> ltoh |> signed
16

julia> header.dataChannels |> ltoh |> signed
1

julia> header.dataSampleRate |> ltoh |> signed
22050

Wrapping byte arrays

Occasionally the layout of binary data is not static and depends on values found within the data itself. In such cases, the user would read some block of data and then (re-)interpret that byte array as it is inspected. CBinding.jl also provides an unsafe_wrap method to create a user-defined view of a byte array. It does not take ownership of the data, so the original data reference must be kept.

julia> data = Vector{UInt8}(undef, sizeof(WAV_header));

julia> open("sample.wav") do io
         readbytes!(io, data)
       end;

julia> header = unsafe_wrap(WAV_header, pointer(data));

Efficient zero-copy IO

By combining memory mapping with the facilities presented so far, we can achieve optimal IO performance. Memory mapping a file essentially makes the contents of the file accessible as a byte array in memory. The operating system handles the mapping and transparently performs the reads and writes, so you don’t actually need to read the whole file into memory to get high performance in random access use cases.

Julia has the standard library package Mmap that provides the mmap function. We use it below to create a byte array mapped to an on-disk file and then use unsafe_wrap to interpret the byte array as a WAV_header object.

julia> using Mmap

julia> data = open("sample.wav", "r+") do io
         Mmap.mmap(io, Vector{UInt8}, 256)
       end;

julia> header = unsafe_wrap(WAV_header, pointer(data));

julia> header.fileSize |> ltoh |> signed
440634

We can even update the file on-disk simply by changing the header’s fields.

shell> hexdump --canonical --length=48 sample.wav
00000000  52 49 46 46 5e b9 06 00  57 41 56 45 66 6d 74 20  |RIFF^...WAVEfmt |
00000010  10 00 00 00 01 00 01 00  22 56 00 00 44 ac 00 00  |........"V..D...|
00000020  02 00 10 00 64 61 74 61  3a b9 06 00 00 00 00 00  |....data:.......|

julia> header.riff[1] = 'r' |> htol;

julia> header.fileSize = 1000 |> htol;

shell> hexdump --canonical --length=48 sample.wav
00000000  72 49 46 46 e8 03 00 00  57 41 56 45 66 6d 74 20  |rIFF....WAVEfmt |
00000010  10 00 00 00 01 00 01 00  22 56 00 00 44 ac 00 00  |........"V..D...|
00000020  02 00 10 00 64 61 74 61  3a b9 06 00 00 00 00 00  |....data:.......|

Advanced usage

The basic facilities demonstrated above should already simplify your IO code. Other more advanced resources provided by CBinding.jl include bit fields, field byte alignment, and packing strategies all of which tend to be used more frequently in networking protocols. The definition of an IP header below, though it is rather contrived, illustrates some of these features.

julia> @cstruct IP_header {                   # struct IP_header {
         (vers:4, hdrLen:4, svc:8)::UInt32    #   uint32_t vers:4, hdrLen:4, svc:8;
         (len:16)::UInt32                     #   uint32_t len:16;
         (ident:16)::UInt32                   #   uint32_t ident:16;
         (ctrlFlags:3, fragOff:13)::UInt32    #   uint32_t ctrlFlags:3, flagOff:13;
         (ttl:8, proto:8)::UInt32             #   uint32_t ttl:8, proto:8;
         (hdrChksum:16)::UInt32               #   uint32_t hdrChksum:16;
         srcAddr::UInt32                      #   uint32_t srcAddr;
         dstAddr::UInt32                      #   uint32_t dstAddr;
       } __packed__                           # } __attribute__((packed));

julia> data = zeros(UInt8, sizeof(IP_header));

julia> header = unsafe_wrap(IP_header, pointer(data));

julia> header.ctrlFlags = 0x7;

julia> header.len = 0x1234 |> hton;

julia> header.srcAddr = 0x7f000001 |> hton;

julia> header.dstAddr = 0x7f000001 |> hton;

julia> data'
1×20 LinearAlgebra.Adjoint{UInt8,Array{UInt8,1}}:
 0x00  0x00  0x12  0x34  0x00  0x00  0x07  0x00  0x00  0x00  0x00  0x00  0x7f  0x00  0x00  0x01  0x7f  0x00  0x00  0x01

If you are considering the transition to Julia, but have several C libraries or binary file formats you depend on, we can help! Analytech Solutions offers your team many years of experience working with both Julia and C, and we can streamline your transition process. Please contact us for more information!

Keith Rutkowski Keith Rutkowski is a seasoned visionary, inventor, and computer scientist with a passion to provide companies with innovative research and development, physics-based modeling and simulation, data analysis, and scientific or technical software/computing services. He has over a decade of industry experience in scientific and technical computing, high-performance parallelized computing, and hard real-time computing.