In Progress
Unit 1, Lesson 21
In Progress

Input Record Separator

Ruby has a lot of tools for processing input line-by-line. But when you look a little closer, it turns out that these methods are for more than just lines of text: they generalize to process all sorts of record-oriented input.

Video transcript & code

Input Record Separator

If you ever look through the documentation for Ruby's IO objects, you're bound to notice that there are a lot of methods for reading data in. I mean, like, a lot.

We've got getbyte and getc and gets. We've got sysread and read and readbyte and readchar and readline and readlines and readpartial and each_line and... well you get the picture.

open("jabberwocky.txt").getbyte  # => 84
open("jabberwocky.txt").getc  # => "T"
open("jabberwocky.txt").gets  # => "Twas brillig, and the slithy toves\r\n"
open("jabberwocky.txt").sysread(64)  # => "Twas brillig, and the slithy toves\r\n      Did gyre and gimble in"
open("jabberwocky.txt").read  # => "Twas brillig, and the slithy toves\r\n      Did gyre and gimble in the wabe:\r\nAll mimsy were the borogoves,\r\n      ...
open("jabberwocky.txt").readbyte  # => 84
open("jabberwocky.txt").readchar  # => "T"
open("jabberwocky.txt").readline  # => "Twas brillig, and the slithy toves\r\n"
open("jabberwocky.txt").readlines  # => ["Twas brillig, and the slithy toves\r\n", "      Did gyre and gimble in the wabe:\r\n", "All mimsy were the borogoves,\r\n", "      ...
open("jabberwocky.txt").readpartial(64)  # => "Twas brillig, and the slithy toves\r\n      Did gyre and gimble in"
open("jabberwocky.txt").each_line do |line|
  puts line
end

# >> Twas brillig, and the slithy toves
# >>       Did gyre and gimble in the wabe:
# >> All mimsy were the borogoves,
# >>       And the mome raths outgrabe.
# >>

Each of these methods has its own semantics and reason for existence.

But several of them have something in common: they work with lines of text.

open("jabberwocky.txt").gets  # => "Twas brillig, and the slithy toves\r\n"
open("jabberwocky.txt").readline  # => "Twas brillig, and the slithy toves\r\n"
open("jabberwocky.txt").readlines  # => ["Twas brillig, and the slithy toves\r\n", "      Did gyre and gimble in the wabe:\r\n", "All mimsy were the borogoves,\r\n", "      ...
open("jabberwocky.txt").each_line do |line|
  puts line
end

# >> Twas brillig, and the slithy toves
# >>       Did gyre and gimble in the wabe:
# >> All mimsy were the borogoves,
# >>       And the mome raths outgrabe.
# >>

Now of course files and other sources of input are just streams of data, so what exactly do we mean by a "line", anyway?

I have another video exploring the history of computer text line delimiters, and I'm not going to rehash it here. The short version is that when we're working with textual data, by "lines of text" we typically mean some number of text characters followed by a terminator character or sequence.

In most modern computer systems, that means an ASCII linefeed, a carriage-return/linefeed pair, or less commonly a single carriage-return character.

"A line on UNIX, Linux, FreeBSD, Multics, etc...\n"
"A line on DOS, Windows, OS/2, Symbian, etc...\r\n"
"A line on Classic MacOS, Apple II, Commodore, etc...\r"

To explore how Ruby deals with these various line terminators, let's use the stringio standard library to create an IO object from a hard-coded string.

By default, Ruby line-oriented input methods recognize line-feed-terminated strings.

require "stringio"
input = StringIO.new("Line the First\nLine the Second")
input.readlines
# => ["Line the First\n", "Line the Second"]

As well as lines terminated with a carriage-return linefeed pair.

require "stringio"
input = StringIO.new("Line the First\r\nLine the Second")
input.readlines
# => ["Line the First\r\n", "Line the Second"]

But not lines delimited by just a carriage-return.

require "stringio"
input = StringIO.new("Line the First\rLine the Second")
input.readlines
# => ["Line the First\rLine the Second"]

Considering how rare this style of text file is these days, that's probably a reasonable default behavior.

But sometimes we don't want default behavior! What if the task at hand involves ingesting some legacy data in an old format?

For eventualities like this, we can supply a separator argument. In this case, we can specify that the separator is a carriage return character. We then see the output separated into two separate strings, one for each line.

require "stringio"
input = StringIO.new("Line the First\rLine the Second")
input.readlines("\r")
# => ["Line the First\r", "Line the Second"]

Every line-oriented method takes this optional argument.

So, for instance, we could do the same thing with gets or readline.

require "stringio"
input = StringIO.new("Line the First\rLine the Second\rLine the third")
input.gets("\r")  # => "Line the First\r"
input.readline("\r")  # => "Line the Second"

So far I've been calling all these methods "line-oriented". But this is a bit misleading. It's more accurate to say that they are record-oriented.

To understand this better, let's say we have a stream of data records in YAML format. The YAML standard specifies that a single file or stream can contain multiple YAML documents, separated by three dash characters on a line by themselves. These particular records are a sampling of sets from a weight-training workout.

To process these as individual YAML records, we can read them in with a separator of three dashes and a newline.

Let's parse the second superset as a YAML document, and check out the result.

require "stringio"
require "yaml"
input = StringIO.new(<<EOF)
- exercise: squat jump
  reps: 12
- exercise: deep barbell back squat
  weight: 155
  reps: 5
- exercise: swiss-ball roll-out
  reps: 8
---
- exercise: box jump
  reps: 8
- exercise: front squat
  weight: 135
  reps: 5
- exercise: jacknife crunch
  reps: 8
---
- exercise: bulgarian split squat
  weight: 35
  reps: 8
- exercise: glute-ham raise
  reps: 8
- exercise: good morning
  weight: 70
  reps: 5
EOF

supersets = input.readlines("---\n")
YAML.load(supersets[1])
# => [{"exercise"=>"box jump", "reps"=>8},
#     {"exercise"=>"front squat", "weight"=>135, "reps"=>5},
#     {"exercise"=>"jacknife crunch", "reps"=>8}]

There it is: the data from the second superset. We used the readlines method, but by specifying a custom separator, we read in delineated records rather than lines.

There are a few "special" values for this separator argument that Ruby interprets more than literally.

We've actually seen one of these special separators in action already.

If we specify a separator of a single linefeed character, Ruby will split newline-delimited lines as expected.

require "stringio"
input = StringIO.new("Line the First\nLine the Second")
input.gets("\n")
# => "Line the First\n"

...but it will also split on Windows-style carriage-return linefeed terminators.

require "stringio"
input = StringIO.new("Line the First\r\nLine the Second")
input.gets("\n")
# => "Line the First\r\n"

This is an example of Ruby doing the thing which is probably the most useful, but not the most obvious. Essentially it treats a linefeed separator as a kind of "do what I mean" flag for splitting the most common types of text file lines.

We saw earlier that this is also the behavior these methods exhibit without any explicit separator argument.

require "stringio"
input = StringIO.new("Line the First\r\nLine the Second")
input.gets
# => "Line the First\r\n"

That's because a newline is the global default for line or record-oriented input methods. But this default isn't hard-coded into Ruby!

In fact, it comes from a global variable called $/.

$/  # => "\n"

This is one of those global variables that has a weird little symbol name because it's most often used in short command-line one-liners. We'll get to some examples of those soon.

If we require the English module we can refer to this variable by its more expressive alias: $INPUT_RECORD_SEPARATOR

Or $RS for short.

require "stringio"
require "English"
input = StringIO.new("Line the First\r\nLine the Second")
$/  # => "\n"
$INPUT_RECORD_SEPARATOR  # => "\n"
$RS  # => "\n"
input.gets
# => "Line the First\r\n"

We can update this variable to globally alter the default input record separator.

For instance, let's go back to our weight-training example.

If we set the input record separator to the YAML stream separator marker...

And then load a single "record" using gets with no arguments

We can see that we get the first YAML record, and nothing more.

require "stringio"
require "yaml"
input = StringIO.new(<<EOF)
- exercise: squat jump
  reps: 12
- exercise: deep barbell back squat
  weight: 155
  reps: 5
- exercise: swiss-ball roll-out
  reps: 8
---
- exercise: box jump
  reps: 8
- exercise: front squat
  weight: 135
  reps: 5
- exercise: jacknife crunch
  reps: 8
---
- exercise: bulgarian split squat
  weight: 35
  reps: 8
- exercise: glute-ham raise
  reps: 8
- exercise: good morning
  weight: 70
  reps: 5
EOF

$/ = "---\n"
puts input.gets

# >> - exercise: squat jump
# >>   reps: 12
# >> - exercise: deep barbell back squat
# >>   weight: 155
# >>   reps: 5
# >> - exercise: swiss-ball roll-out
# >>   reps: 8
# >> ---

We talked earlier about one "special" value for the input record separator, the single linefeed character.

Let's talk about another one: the empty string.

$/ = ""

What do you think a record separator of "empty string" means to Ruby? Whatever you guess, it's probably wrong. Because if the behavior of a single newline for the separator was surprising, this one is downright arcane.

Here's a file containing the poem "Jabberwocky" by Lewis Carrol.

Let's say we want to read this poem in, not line-by-line, but stanza-by-stanza.

We can do this by setting the input record separator to an empty string. This puts record-oriented methods into "paragraph mode".

Now we can select various stanzas of the poem by index into the array returned by readlines.

input = open("jabberwocky.txt")
stanzas = input.readlines("")

puts stanzas[4]

# >> "And, has thou slain the Jabberwock?
# >>   Come to my arms, my beamish boy!
# >> O frabjous day! Callooh! Callay!'
# >>   He chortled in his joy.

This special mode treats paragraphs separated by a blank line as the records to be read in.

We might use this mode in a one-liner to count paragraphs in text files.

We use the -n flag to loop over every line of input (or more accurately, every record);

-e to eval some code;

a BEGIN clause to set the input record separator to the empty string to flag paragraph mode;

and an END clause to output the special line-count variable, which in this case becomes the paragraph count variable.

We supply the text file as an argument, and poof, we have a paragraph count.

$ ruby -ne 'BEGIN{$/=""}; END{puts $.}' jabberwocky.txt 
5

There's one more special separator flag value I want to show you.

Let's say we have three text files, and we want to concatenate them together into one file... but with separators in between them.

$ ls ?.txt
a.txt  b.txt  c.txt

We'll do this with another Ruby one-liner.

This time we'll use -p to loop over every input line... uh, I mean, record... and then print each one in turn.

In a BEGIN clause we'll set the input record separator to nil this time. We'll talk about what that means in a moment.

Next we set the output record separator to a newline, some stars to make a visual delineator, and another newline. I have another video about the output record separator.

And finally we supply our filenames as arguments.

Before we run this, let's talk about that input record separator. The special value nil tells record-oriented input methods to read in the entire file as one record. This might seem counter-intuitive at first: isn't the whole point of record-oriented methods not to read in a whole file at a time?

But in this case, the -p flag tells Ruby to loop over the special AR<a href="https://www.rubytapas.com/2013/02/11/episode-058-argf/">Episode #058: ARGF</a>GF object, which is an input stream consisting of the concatenation of all the files specified at the command-line. So setting the input separator to nil effectively instructs Ruby to treat the whole contents of each file as a single record.

Then, our special output record separator string arranges for a starred line to be inserted between records as they are printed out.

The upshot is that we get the contents of all the input files concatenated together, with visual delineation.

$ ruby -pe 'BEGIN{$/=nil; $\="\n******n"}' a.txt b.txt c.txt
How doth the little crocodile
  Improve his shining tail
And pour the waters of the Nile
  On every golden scale!

How cheerfully he seems to grin
  How neatly spreads his claws,
And welcomes little fishes in
  With gently smiling jaws!
******
Twinkle, twinkle, little bat!
How I wonder what you're at!
Up above the world you fly,
Like a teatray in the sky.
******
The sun was shining on the sea,
Shining with all his might:
He did his very best to make
The billows smooth and bright —
And this was odd, because it was
The middle of the night.
******

And with that, we'll call it a day. Today you've seen that Ruby's various line-oriented input reading methods are really record-oriented. And that we can influence the definition of a record both at an individual method call level, as well as at a global level. We've seen that global redefinition of input separators can be particularly useful in command-line one-liners. And we've seen that Ruby's interpretation of what is a record can be put into some "special" modes for reading text lines, paragraphs, and whole files at a time. Happy hacking!

Responses