In Progress
Unit 1, Lesson 1
In Progress

Chunk

Video transcript & code

Recently I went through my filing cabinet and archived a lot of old files. Each file that I wanted to archive went into a cardboard file box. In the end, I would up with 5 boxes of archived files.

My files are alphabetical, and sometimes I would fill a box while I was still in the middle of one letter. I could have gone for maximum use of space, and just split some letters across two boxes. But I realized that this would have made looking up old files a real chore. I realized that it wasn't enough that the files were alphabetically sorted. I needed to also chunk the files by letter.

Let's take a look at what this would look like in Ruby.

I have a list of files that are already sorted alphabetically.

FILES = [
  "AcmeCo",
  "Beeblebrox",
  "Breakfast",
  "Cats",
  "Chunky Bacon",
  "Cthulhu",
  "Delicatessens",
  "Ducks",
  # ...
]

Each filename is different. In order to chunk them, I need to extract the part of the filename which is the same for every member of a chunk. In this case that would be the first letter. Just to visualize this point, lets map over the files and pull out just their first letters.

require "./files"

FILES.map{|f| f[0]}
# => ["A", "B", "B", "C", "C", "C", "D", "D"]

To use this block as a chunking criterion, we can simply swap our map out for chunk.

require "./files"

FILES.chunk{|f| f[0]}
# => #<Enumerator: ...>

Well, OK, not quite that simply. Like so many Enumerable methods we've looked at on this show, #chunk produces an Enumerator for maximum flexibility. In order to force the Enumerator to fully evaluate into a result, we can send #to_a to it.

require "./files"

FILES.chunk{|f| f[0]}.to_a
# => [["A", ["AcmeCo"]],
#     ["B", ["Beeblebrox", "Breakfast"]],
#     ["C", ["Cats", "Chunky Bacon", "Cthulhu"]],
#     ["D", ["Delicatessens", "Ducks"]]]

The result, as you can see here, is an array of arrays. Each entry has two parts: first, the key value that was used to split out a chunk. And second, an array of files comprising that chunk.

If you're familiar with the methods that can be found in Enumerable, you might be wondering how this differs from #group_by.

The most obvious difference is that #group_by produces a finished Hash, not an Enumerator that can generate arrays of arrays. This is because #group_by is intended to be run on finished data sets.

require "./files"

FILES.group_by{|f| f[0]}
# => {"A"=>["AcmeCo"],
#     "B"=>["Beeblebrox", "Breakfast"],
#     "C"=>["Cats", "Chunky Bacon", "Cthulhu"],
#     "D"=>["Delicatessens", "Ducks"]}

The differences between these methods become more clear when we introduce a new element to the list of files. This file starts with "X", but it has been mis-filed in between the "C"-s.

#group_by produces five sections, with the "X" falling after the Cs but before the Ds. On the other hand, when we run #chunk, we get six sections. The Cs are split around the interloper "X" file. This is because #chunk always preserves order. It is only concerned with chunking together runs of related elements. It is not trying to comprehensively group all same-keyed elements together no matter where they are found.

FILES = [
  "AcmeCo",
  "Beeblebrox",
  "Breakfast",
  "Cats",
  "Chunky Bacon",
  "X-Windows",
  "Cthulhu",
  "Delicatessens",
  "Ducks",
  # ...
]

FILES.group_by{|f| f[0]}
# => {"A"=>["AcmeCo"],
#     "B"=>["Beeblebrox", "Breakfast"],
#     "C"=>["Cats", "Chunky Bacon", "Cthulhu"],
#     "X"=>["X-Windows"],
#     "D"=>["Delicatessens", "Ducks"]}

FILES.chunk{|f| f[0]}.to_a
# => [["A", ["AcmeCo"]],
#     ["B", ["Beeblebrox", "Breakfast"]],
#     ["C", ["Cats", "Chunky Bacon"]],
#     ["X", ["X-Windows"]],
#     ["C", ["Cthulhu"]],
#     ["D", ["Delicatessens", "Ducks"]]]

This makes #chunk useful for streams of data we may not want to read all the way to the end. It also makes it useful as a tool for lexing: breaking text up into tokens.

As an example, here's a string that contains some source code for a simple, Ruby-like language. This code adds two numbers, assigns the result to a variable, and then prints the contents of the variable on the next line.

We can take the characters of the stream, and chunk them. In the chunk block, we have a case statement that switches on each character. If the character is a newline, it's categorized as an "endline" token. Whitespace is characterized as exactly that. Runs of word characters are categorized as identifiers. Operator characters are grouped as operators. And runs of digits are categorized as numbers.

When we run this code, we get a tokenized breakdown of the string. The total variable is chunked as an identifier. Then there is some whitespace. Then an operator. Then more whitespace. Then the number "123". And so on.

code = <<EOF
total = 123 + 456
print total
EOF

code.chars.chunk{|c|
  case c
  when /\n/ then :endline
  when /\s/ then :whitespace
  when /[[:alpha:]]/ then :identifier
  when /[\+\-\=\/\*]/ then :operator
  when /\d/ then :number
  end
}.to_a
# => [[:identifier, ["t", "o", "t", "a", "l"]],
#     [:whitespace, [" "]],
#     [:operator, ["="]],
#     [:whitespace, [" "]],
#     [:number, ["1", "2", "3"]],
#     [:whitespace, [" "]],
#     [:operator, ["+"]],
#     [:whitespace, [" "]],
#     [:number, ["4", "5", "6"]],
#     [:endline, ["\n"]],
#     [:identifier, ["p", "r", "i", "n", "t"]],
#     [:whitespace, [" "]],
#     [:identifier, ["t", "o", "t", "a", "l"]],
#     [:endline, ["\n"]]]

This is exactly the kind of tokenized stream that Ruby breaks code into before performing higher-level parsing and compilation on it.

#chunk has a few more tricks up its sleeve.

Let's say we decide that whitespace is just not interesting to us from the point of view of parsing the code. We can treat whitespace as a separator rather than a target for chunking by returning the special :_separator symbol from our block.

The resulting output is a lot shorter, with all the whitespace dropped.

code = <<EOF
total = 123 + 456
print total
EOF

code.chars.chunk{|c|
  case c
  when /\n/ then :endline
  when /\s/ then :_separator
  when /[[:alpha:]]/ then :identifier
  when /[\+\-\=\/\*]/ then :operator
  when /\d/ then :number
  end
}.to_a
# => [[:identifier, ["t", "o", "t", "a", "l"]],
#     [:operator, ["="]],
#     [:number, ["1", "2", "3"]],
#     [:operator, ["+"]],
#     [:number, ["4", "5", "6"]],
#     [:endline, ["\n"]],
#     [:identifier, ["p", "r", "i", "n", "t"]],
#     [:identifier, ["t", "o", "t", "a", "l"]],
#     [:endline, ["\n"]]]

Returning nil is treated the same was as returning the :_separator flag.

code = <<EOF
total = 123 + 456
print total
EOF

code.chars.chunk{|c|
  case c
  when /\n/ then :endline
  when /\s/ then nil
  when /[[:alpha:]]/ then :identifier
  when /[\+\-\=\/\*]/ then :operator
  when /\d/ then :number
  end
}.to_a
# => [[:identifier, ["t", "o", "t", "a", "l"]],
#     [:operator, ["="]],
#     [:number, ["1", "2", "3"]],
#     [:operator, ["+"]],
#     [:number, ["4", "5", "6"]],
#     [:endline, ["\n"]],
#     [:identifier, ["p", "r", "i", "n", "t"]],
#     [:identifier, ["t", "o", "t", "a", "l"]],
#     [:endline, ["\n"]]]

This means that, since a case statement will return nil if nothing matches, we can tighten up this code by omitting the whitespace case entirely.

code = <<EOF
total = 123 + 456
print total
EOF

code.chars.chunk{|c|
  case c
  when /\n/ then :endline
  when /[[:alpha:]]/ then :identifier
  when /[\+\-\=\/\*]/ then :operator
  when /\d/ then :number
  end
}.to_a
# => [[:identifier, ["t", "o", "t", "a", "l"]],
#     [:operator, ["="]],
#     [:number, ["1", "2", "3"]],
#     [:operator, ["+"]],
#     [:number, ["4", "5", "6"]],
#     [:endline, ["\n"]],
#     [:identifier, ["p", "r", "i", "n", "t"]],
#     [:identifier, ["t", "o", "t", "a", "l"]],
#     [:endline, ["\n"]]]

OK, I think that's enough to chew on for today. By the way, thank you to viewer Julian Vanier for reminding me that I hadn't done an episode on #chunk yet. Happy hacking!

Responses