Have you ever written some Ruby code, only to have someone complain that it mangles text or raises data corruption errors on their machine?
Often, these kinds of problems result from text vs. binary file encoding incompatibilities on different operating systems. To understand why this happens in a modern, high-level language such as Ruby, you need to dig into the language’s C roots. And to understand file encodings in C, you have to go back just a little bit further…
Video transcript & code
On a classic manual typewriter, moving to a new line of text required two discrete actions. First, you had to return the carriage to the starting position. And second, you had to advance the paper a line's width up.
You don't have to do them in this order. You could just as well advance the paper and then return the carriage. But both have to happen before you can start typing a new line.
In the first half of the 20th century, the first "teleprinters" were created. These were basically electric typewriters connected to telegraph lines. An operator could type text on the keyboard at one end of the line, and hundreds of miles away, the text would be automatically printed out by the receiving teleprinter. A major manufacturer of teleprinters was the "Teletype Corporation", and we now know these machines as "teletypes".
For this remote typewriter transmission to be possible, the letters typed on the keyboard had to be translated into electronic codes on the wires. But more than just the letters had to be represented. In order for the printer on the end of the line to print an accurate representation of what was typed, it also had to be told when to return the carriage, and when to advance the paper. This latter action was called a "line feed".
In 1963, something wonderful happened. The ASCII standard was published, which specified standardized numeric codes for passing English letters and numbers over the wire. Included in this standard were control codes. The number 13 signaled a Carriage Return, and the number 10 signaled a Line Feed.
Newer teleprinters like the Teletype Model 33 adopted this standardized ASCII table of codes. An interesting fact about these machines is that technically it would have been possible to specify a single control code that told the printer to both advance the feed and return the print carriage. Except for one problem: the codes came over the wire at a constant speed, and the Model 33 wasn't physically fast enough to move the print head back to the start of the line in the time it took for the next character code to arrive. So it was necessary to define the line separator as a Carriage Return (CR) followed by a Line Feed (LF).
As early mainframe computers like the DEC PDP series were being developed, teletype machines were a natural way to communicate with them. Operators could type commands into the teletype keyboard, and the computer would respond by sending ASCII codes back to the printer. The "TTY" virtual devices still found on UNIX-like operating systems originally referred to physical teletype machines.
Now, ASCII specified a common set of character and control codes. But these mapped directly to physical actions on a teleprinter. There was no standard for abstract concepts like line separations.
So as operating system developers created their software, one question that faced them was how to represent whitespace, and in particular line separations, in internal memory.
The developers of the CP/M operating system went with a bare-metal approach. The teletypes it was communicating with expected a Carriage-Return/Line-Feed, or CR+LF, sequence. Writers of programs for CP/M simply encoded these CR+LF sequences directly in in-memory text sequences, and it was output without any translation.
This convention was later copied by MS-DOS operating system, and carried forward into the Windows series of operating system. Along the way, the CR+LF convention was also picked up by Atari TOS, OS/2, Symbian, Palm, and various other OSes.
Meanwhile the developers of the Multics operating system were a little more ambitious. They decided that text would have an abstract representation internal, which would then be automatically translated by a "device driver" into the appropriate control sequences for the physical terminal device. They went with the Line Feed character for this internal representation of line endings.
The choice of the ASCII 10 Linefeed code to represent line endings was a somewhat arbitrary one. They could have just easily picked the ASCII 13 Carriage Return code. And that's exactly the choice that was made by the creators of the Commodore, the Acorn BBC, the ZX Spectrum, the TRS-80, Lisp machines, and the Apple II. This selection of the Carriage Return for line separator was carried forward into later Macintosh operating systems until the transition to the BSD-based OS X.
For that matter, the common ordering of Carriage Return followed by Linefeed was also kind of arbitrary. The Acorn BBC computers and the RISC OS standardized on the reverse LF+CR sequence as their convention for text that was spooled out to a printer.
But while all this proliferation of internal text representations was going on, something was happening that would forever shape how we think about ASCII text in computer programs.
At Bell Labs, Dennis Ritchie was developing the C programming language. Since he was creating it on and for the UNIX operating system, he naturally chose the UNIX-standard Linefeed convention as the standard internal representation of text line separations for C programs.
Starting from UNIX, the C programming language went on to dominate the entire industry. It became the primary language for DOS and Windows development. Even when you weren't writing code in C itself, chances were you were writing it in a C-derived or C-inspired language like C++, Objective C, or Java. And most modern scripting languages, like Perl, PHP, Python, and Ruby had their reference implementations created in C. They all embed many C-originated assumptions in their design.
As the C language was standardized, it became clear that for ease of programming, it would need a way to translate between its internal representation of ASCII text, and the various line-ending conventions existing on different target operating systems. To address this, the C standard library defined two modes that a file could be opened in: binary mode, and text mode.
When a file was opened in binary mode, the C I/O libraries would perform no translation. They would read and write bytes exactly as specified from and to the disk or other I/O device.
But when opened in "text" mode, the I/O libraries would automatically translate the "native" line-endings to and from the C-standard internal Linefeed convention.
Of course, on UNIX systems, since no translation is needed, these two modes behave identically. Regardless of the mode, internal linefeeds become external linefeeds with no change, and vice-versa. As a result, it was possible to write programs on a UNIX host without ever giving a thought to whether given file ought to be treated as text or binary content—because on UNIX, both modes behaved identically. This fact, combined with the fact that binary mode must be explicitly requested, set the stage for all manner of confusion and porting difficulties for decades to come.