In Progress
Unit 1, Lesson 21
In Progress

Binary Mode

Video transcript & code

The other day I took some Ruby code that I wrote on a Linux system, and started running it on a Windows box. Pretty quickly, I ran into an error:

marshal data too short

This error occurred as the program was reading some saved cache data from an earlier run.

This error is representative of one of the most common problems programmers run into when porting a program from UNIX-like systems to Windows.

While it's not immediately obvious, the source of this error can be traced back to the way different operating systems represent line endings.

As you may recall if you watched episode #468, most computer operating systems adopted the ASCII standard for representing plain text. But ASCII had no standardized, abstract character for representing the separation between two lines. Instead, it had a selection of control codes that were used to signal teleprinters to do various physical actions.

In particular, the principle control codes associated with line endings were:

  • ASCII code 10, known as "Line Feed", or "LF" for short; and
  • ASCII code 13, known as "Carriage Return", or "CR" for short.
ASCII Code Name Meaning
10 Line Feed (LF) Advance paper roll
13 Carriage Return (CR) Return print head to start

Even as most operating systems standardized on ASCII character codes, they settled on different combinations of the CR and LF codes for their standard text file line endings.

Some standardized on a CR followed by an LF; some went with just an LF or a CR alone; and a few special cases even used an LF followed by CR.

Characters OSes
CR+LF CP/M, MS-DOS, Windows, OS/2, Palm, Symbian, Atari TOS
LF Multics, UNIX, BSD, Amiga, BeOS, OS X, macOS
CR Commodore, ZX Spectrum, TRS-80, Apple II, Mac OS
LF+CR Print spooler for Acorn BBC and RISC OS

Imagine what it would be like if we had to deal with this proliferation of text file conventions in our code.

For instance, let's say we had the simple task of writing a series of verses to a file. Here's a program to do that in Ruby.

For Windows-based builds of Ruby, it picks a CR+LF separator, which are represented in Ruby strings by the escape sequences \r and \n.

For UNIX-type builds of Ruby, it goes with just a linefeed.

lines = ["'Twas brillig, and the slithy toves",
         "Did gyre and gymbal in the wabe:",
         "All mimsy were the borogroves,",
         "And the mome raths outgrabe."]

separator = case RUBY_PLATFORM
            when /cygwin|mswin|mingw/
              "\r\n"
            when /darwin|linux|freebsd/
              "\n"
            else fail "I don't know that platform: #{RUBY_PLATFORM}"
            end

output = lines.join(separator)

And actually, this program doesn't even get around to writing the file; it just fills in the correct separators for certain Windows and UNIX-like builds.

Before you go and copy it, know that this case statement is woefully incomplete, and would not be remotely suitable for a widely-distributed program.

Fortunately, we don't have to worry about getting line-endings right on different operating systems. That's because Ruby was originally implemented in the C programming language, and it inherits C's approach to dealing with text file diversity.

Here's what that means. Internally, we use the linefeed character everywhere we want to represent a line separation.

Then we just write the file without any further ado.

The standard linefeed character is translated to the appropriate line separator for the current operating system, with no further intervention on our part.

We can see this if we compare the size of the data we wrote, to the size of the file on disk.

Let's run this program on a Windows machine.

lines = ["'Twas brillig, and the slithy toves",
         "Did gyre and gymbal in the wabe:",
         "All mimsy were the borogroves,",
         "And the mome raths outgrabe."]

output = lines.join("\n")

IO.write("jabberwocky.txt", output)
puts "Output length: #{output.size}"
puts "Size on disk: #{File.size('jabberwocky.txt')}"

Here in the output we can see that it wrote 128 characters to disk, but the file on disk is 131 bytes long!

But as long as these translations are being handled for us transparently, this seems harmless enough. The problems start to crop up when we start working with non-text data.

For instance, let's write ourselves a little demo program. We'll start with a list of addresses. Notice that these addresses incorporate linebreaks.

Now, when storing data on disk, many databases store an index alongside the actual data entries. This enables fast lookup without loading the entire database into memory at once.

Let's cobble together the most rudimentary index. We'll initialize a starting offset and an empty index array.

Then we'll go through the list of addresses. For each one, we'll stuff the current offset into the index. Then we'll increment the offset by the byte length of the current address.

Next we'll write our little database out to disk. We open a file.

We start out the file with our index, stored as a space-separated list of integer offsets and terminated with a newline.

Then we write every address to the file. We don't bother with any separators between them, since we already know what their offsets will be inside the address section of the file.

Now that we've written our data to disk, let's read it back in.

First we use gets to grab everything up to the first newline, and parse it back out into our index.

Now let's look up the second address in our list. To do that, we look up the second entry in the index.

We calculate the length of the address we'll be reading by subtracting its offset from the next offset.

We use the offset to seek forwards to the right point in the file.

And then we read in the right number of bytes.

Finally, we dump the address we just looked up.

addresses = [
  "123 Main St.\nAnytown, TN",
  "456 Broad St.\nPlace City, PA",
  "789 First St.\nNowheresville, IL"]

offset = 0
index  = []
addresses.each_with_index do |addr,i|
  puts "Address #{i} is found at offset #{offset}"
  index << offset
  offset += addr.size
end

puts "Writing addrs.db"
open("addrs.db", "w") do |f|
  f.write(index.map(&:to_s).join(" ") + "\n")
  f.write(addresses.join(""))
end

puts "Reading in addrs.db"
open("addrs.db") do |f|
  index = f.gets.chomp.split(" ").map(&:to_i)
  puts "Read indexes: #{index.inspect}"
  puts "Looking up address #2"
  offset = index[1]
  length = index[2] - offset
  f.seek(offset, IO::SEEK_CUR)
  address = f.read(length)
  puts "Retrieved address:"
  p address
end

Let's run this on my windows box, and see what happens.

Everything seems to run OK, except… the found address doesn't look right.

It has a little bit of a previous address in it. Then it has a raw CR+LF sequence. And then it ends too early!

>ruby addrs.rb
Address 0 is found at offset 0
Address 1 is found at offset 24
Address 2 is found at offset 52
Writing addrs.db
Reading in addrs.db
Read indexes: [0, 24, 52]
Looking up address #2
Retrieved address:
"N456 Broad St.\r\nPlace City, "

The problem here is that when we use standard C-based file I/O methods like seek and read that operate in terms of byte offsets and byte counts, they ignore newline translation! So while all of our offsets were calculated in terms of single-byte newlines, our random access file reads are being done on the raw on-disk representation. And on this Windows system, the on-disk representation has newlines as two-byte CR+LF sequences.

I want to point out, here, that we didn't have to move a file from a UNIX system to a Windows system, or vice-versa, to run into trouble. Simply writing out and reading in a file on the same Windows machine was enough to get us into this mess.

Fixing the problem is quite easy. All we have to do is update both our open calls to explicitly put the file into binary mode using the b specifier.

Binary mode tells the I/O system to refrain from any translation whatsoever. The bytes in memory are the bytes that get written to disk. And the bytes on disk are the bytes that get read back into memory.

addresses = [
  "123 Main St.\nAnytown, TN",
  "456 Broad St.\nPlace City, PA",
  "789 First St.\nNowheresville, IL"]

offset = 0
index  = []
addresses.each_with_index do |addr,i|
  puts "Address #{i} is found at offset #{offset}"
  index << offset
  offset += addr.size
end

puts "Writing addrs.db"
open("addrs.db", "wb") do |f|
  f.write(index.map(&:to_s).join(" ") + "\n")
  f.write(addresses.join(""))
end

puts "Reading in addrs.db"
open("addrs.db", "rb") do |f|
  index = f.gets.chomp.split(" ").map(&:to_i)
  puts "Read indexes: #{index.inspect}"
  puts "Looking up address #2"
  offset = index[1]
  length = index[2] - offset
  f.seek(offset, IO::SEEK_CUR)
  address = f.read(length)
  puts "Retrieved address:"
  p address
end

When we run our modified version, we can see that it successfully looks up the second address, beginning and ending on the correct characters.

>ruby addrs_bin.rb
Address 0 is found at offset 0
Address 1 is found at offset 24
Address 2 is found at offset 52
Writing addrs.db
Reading in addrs.db
Read indexes: [0, 24, 52]
Looking up address #2
Retrieved address:
"456 Broad St.\nPlace City, PA"

A lot of programmers, when they go from Linux or OS X-based programming to a Windows environment, blame these issues on the design of Windows. It's all Windows fault, for having different, non-standard line-endings!

But there's nothing non-standard about Windows line-ending conventions, because as we saw earlier, there's no such thing as a standard. Different operating systems picked different conventions more or less arbitrarily.

If there's a technology to blame here, it's the C standard library that the Ruby I/O system is based on. C I/O functions make text-mode, translated I/O the default. And yet they treat text-mode files as binary when using functions like seek and read to do random-access I/O.

As a result, it's far too easy to write our file I/O incorrectly. And let's be very clear: the first version of this program was incorrect. It was incorrect on any operating system.

It would have worked, by accident, on a Linux or OS X system. Because as you may recall from episode #468, as a historical coincidence the C-style internal representation of text files is identical to the on-disk representation for UNIX-style systems. So the line-ending "translation" on those systems is literally a no-op.

But just because text mode happens to be identical to binary mode on UNIX-lilke systems doesn't mean that this I/O should have been done in the default text mode. The fact that this program deals in byte offsets means that it is treating data on disk as opaque binary data, not as text. And whenever we do binary I/O, we should be using the appropriate flags to open the files involved.

What does all this have to do with the error we saw at the very beginning of the episode?

marshal data too short

When Ruby serializes objects with the Marshal class, it writes them out in a binary format. Now that you've seen the kinds of issues that we can run into when binary data is read or written as text, you can probably see how a "marshal data too short" error might be related to text-mode I/O. And indeed, the solution to the problem turned out to be switching some file-I/O to binary mode.

Personally, I'm terrible about remembering this. But I'm trying to get better. Especially when writing Rubygems for public consumption, it's important to be mindful that not everyone using our code is using the same operating system we are.

Some OS differences are difficult to work around. But this one is just a matter of specifying the correct file mode when opening a file for reading or writing.

Here are some guidelines on how to decide which file open mode to use:

  • If the file is human-readable text, and it might be read or written by programs other this one, open the file in the default text mode. This includes configuration or log files which only this program uses, but which the user might open in a text editor or viewer.
  • Otherwise, open the file in binary mode for both read and write.

If there's a case where you're not sure, it's probably better to err on the side of specifying binary mode. Binary mode ensures that the data inside the program is exactly the same as the data outside the program. With this choice, you do run the risk of having the file be interpreted incorrectly by other programs. But it's less likely that your own code will break because it can't read back the data that it wrote.

In conclusion: line endings and file modes are a frustrating reality of cross-platform development. But using the right file mode is not hard to do, and it will save someone a headache down the road.

Happy hacking!

Responses