In Progress
Unit 1, Lesson 21
In Progress

Null-Terminated Record

Before YAML, before JSON, there were null-terminated records. See how this simple, pragmatic technique greases the pipes between UNIX command-line tools—and how you can harness it in your Ruby one-liners!

Video transcript & code

I've made a couple of videos recently talking about the concept of "input record separators" in Ruby. We've talked about how changing the input record separator is particularly useful in console one-liners.

Like this one, that uses the special empty string separator to count paragraphs.

# count paragraphs
$ ruby -ne 'BEGIN{$/=""}; END{puts $.}' jabberwocky.txt 

Something you might reasonably wonder is whether there's a command-line * * for setting the input record separator. And there is! Well... sort of.

The flag for setting this global variable has the unlikely name of -0. Not dash-O! Dash Zero.

ruby -0

And the unusual name is probably the least weird thing about this option.

Because this flag doesn't accept a string argument. Instead, it expects a number... Specifically, an octal number.

So, like, if we wanted to split some input on colons, like, say, this PATH variable.

$ echo $PATH
/root/.vscode-server/bin/e5a624b788d92b8d34d1392e4c4d9789406efe8f/bin:/usr/local/bundle/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin

In theory, we could figure out the octal value of an ASCII colon... OK, looks like it's 72...

$ irb
":".codepoints.first.to_s(8)
=> "72"

...and then we could use that to specify our input separator.

$ echo $PATH | ruby -072 -nle 'puts $_'
/root/.vscode-server/bin/e5a624b788d92b8d34d1392e4c4d9789406efe8f/bin
/usr/local/bundle/bin
/usr/local/sbin
/usr/local/bin
/usr/sbin
/usr/bin
/sbin
/bin

But if you actually use -0 this way, tentacles will emerge from your laptop screen and drag you into the nethermost confusion of the dread god Azathoth, who blasphemes and bubbles at the center of all infinity. Sorry, I don't make the rules.

As far as I'm concerned, there is only one octal value that is acceptable to use with the -0 flag. But before I tell you which one it is, let me set up a scenario.

Let's say we are working in a Git repository, and we want to gather some stats about who has made the most commits.

commit a7ecd32411a4e1d7dade44ed03462fce9c05015d (HEAD -> master, origin/master, origin/HEAD)
Merge: 5947d20 efae4f8
Author: Hiroshi SHIBATA <hsbt@ruby-lang.org>
Date:   Tue Nov 10 19:54:36 2020 +0900

    Merge pull request #366 from bahasalien/patch-1

    Update rdoc; HTTP -> HTTPS

commit efae4f88963229a7c8ee54c3d13af5730993308b
Author: Alam <73675883+bahasalien@users.noreply.github.com>
Date:   Mon Nov 9 23:40:59 2020 +0800

    Fix doubled "http://" in line 102

    sorry about that....

commit 49820401e29089fddb95f0499769a40c433b94ca
Author: Alam <73675883+bahasalien@users.noreply.github.com>
Date:   Mon Nov 9 23:34:09 2020 +0800

    Update rdoc; HTTP -> HTTPS

    except www.a-a-p.org still cannot....

We could get this information by piping the git log into a Ruby one-liner. But in order to do this, we need to split up the input into separate commits. And the commits don't have an obvious delimiter character. We've have to do some parsing here to break the input into records.

And we have to be extra careful about the parsing, too. Remember, pretty much any string can appear in a git log message!

Wouldn't it be nice if there were some way to tell Git to output a completely un-ambiguous and unique character between each log record? As it turns out, we can!

If we pass the -z flag to git log, it will insert a null character---meaning ASCII code zero---as a terminator after each log record.

git log -z

Which means we pipe the git log output into ruby, with an input record separator value of octal zero.

$ git log -z | ruby -00

Let's go ahead and fill in the rest of our one-liner. I'm not going to go over this one in detail, but essentially it instantiates a Hash for stats; and as it loops over records if it can spot an author it increments a counter for that author. At the end it dumps the final stats, sorted by number of commits.

$ git log -z | ruby -rpp -00 -nle 'BEGIN{stats=Hash.new(0)}; stats[$1]+=1 if /^Author:\s+(.*)$/; END{pp stats.sort_by(&:last)}'

So this is the one permissible value of the -0 flag, as far as I'm concerned: The number zero. And as a matter of fact, Ruby's implementers understood that null-delimited records was the most likely use of the -0 flag, and so they made it the default!

We can get rid of the value, leaving just -0, and it has the same effect.

$ git log -z | ruby -rpp -00 -nle 'BEGIN{stats=Hash.new(0)}; stats[$1]+=1 if /^Author:\s+(.*)$/; END{pp stats.sort_by(&:last)}'

Gosh, it sure was convenient that git log had that special null-delimited output mode. Wouldn't it be cool if other command-line utilities had that feature too??

Well, guess what: they do! Null-terminated records is a feature that many, many different command-line utilities support, although the specific command-line flags may differ.

For instance, the grep utility takes a capital Z argument to put it in this mode.

$ grep -Z

And Ruby doesn't have to just be on the consumer side of null-delimited records. Ruby one-liners can also produce null-terminated records for other tools to parse.

For instance, let's set put the input record separator in paragraph mode with a blank string,

and set the output record separator to the NUL character. We do this using the string escape sequence \0.

And then we'll read in a poem.

We pipe this into the head utility and tell it to output just the first two lines, but we also put it into null-delimited record mode with a -z flag,

instead of two lines, we get the first two stanzas instead!

$ ruby -ple 'BEGIN{$/=""; $\="\0"}' jabberwocky.txt | head -n 2 -z
Twas brillig, and the slithy toves
      Did gyre and gimble in the wabe:
All mimsy were the borogoves,
      And the mome raths outgrabe.He took his vorpal sword in hand:
  Long time the manxome foe he sought --
So rested he by the Tumtum tree,
  And stood awhile in thought.

We've overridden the record separator in both Ruby's output and head's input, and as a result we're no longer working in terms of linefeed-terminated lines of text.

The UNIX command-line philosophy is all about connecting small, sharp tools together with pipes. Sometimes the records we want to stream between those tools are richer than simple newline-terminated strings. In those cases, using the ASCII NUL as a record terminator is an important technique to understand for pragmatic data interchange. And now you know how to consume and produce null-terminated records in Ruby. Happy hacking!

Responses