In Progress
Unit 1, Lesson 1
In Progress

# Chunk

Video transcript & code

Recently I went through my filing cabinet and archived a lot of old files. Each file that I wanted to archive went into a cardboard file box. In the end, I would up with 5 boxes of archived files.

My files are alphabetical, and sometimes I would fill a box while I was still in the middle of one letter. I could have gone for maximum use of space, and just split some letters across two boxes. But I realized that this would have made looking up old files a real chore. I realized that it wasn't enough that the files were alphabetically sorted. I needed to also chunk the files by letter.

Let's take a look at what this would look like in Ruby.

I have a list of files that are already sorted alphabetically.

``````FILES = [
"AcmeCo",
"Beeblebrox",
"Breakfast",
"Cats",
"Chunky Bacon",
"Cthulhu",
"Delicatessens",
"Ducks",
# ...
]
``````

Each filename is different. In order to chunk them, I need to extract the part of the filename which is the same for every member of a chunk. In this case that would be the first letter. Just to visualize this point, lets map over the files and pull out just their first letters.

``````require "./files"

FILES.map{|f| f[0]}
# => ["A", "B", "B", "C", "C", "C", "D", "D"]
``````

To use this block as a chunking criterion, we can simply swap our `map` out for `chunk`.

``````require "./files"

FILES.chunk{|f| f[0]}
# => #<Enumerator: ...>
``````

Well, OK, not quite that simply. Like so many `Enumerable` methods we've looked at on this show, `#chunk` produces an `Enumerator` for maximum flexibility. In order to force the `Enumerator` to fully evaluate into a result, we can send `#to_a` to it.

``````require "./files"

FILES.chunk{|f| f[0]}.to_a
# => [["A", ["AcmeCo"]],
#     ["B", ["Beeblebrox", "Breakfast"]],
#     ["C", ["Cats", "Chunky Bacon", "Cthulhu"]],
#     ["D", ["Delicatessens", "Ducks"]]]
``````

The result, as you can see here, is an array of arrays. Each entry has two parts: first, the key value that was used to split out a chunk. And second, an array of files comprising that chunk.

If you're familiar with the methods that can be found in `Enumerable`, you might be wondering how this differs from `#group_by`.

The most obvious difference is that `#group_by` produces a finished `Hash`, not an `Enumerator` that can generate arrays of arrays. This is because `#group_by` is intended to be run on finished data sets.

``````require "./files"

FILES.group_by{|f| f[0]}
# => {"A"=>["AcmeCo"],
#     "B"=>["Beeblebrox", "Breakfast"],
#     "C"=>["Cats", "Chunky Bacon", "Cthulhu"],
#     "D"=>["Delicatessens", "Ducks"]}
``````

The differences between these methods become more clear when we introduce a new element to the list of files. This file starts with "X", but it has been mis-filed in between the "C"-s.

`#group_by` produces five sections, with the "X" falling after the Cs but before the Ds. On the other hand, when we run `#chunk`, we get six sections. The Cs are split around the interloper "X" file. This is because `#chunk` always preserves order. It is only concerned with chunking together runs of related elements. It is not trying to comprehensively group all same-keyed elements together no matter where they are found.

``````FILES = [
"AcmeCo",
"Beeblebrox",
"Breakfast",
"Cats",
"Chunky Bacon",
"X-Windows",
"Cthulhu",
"Delicatessens",
"Ducks",
# ...
]

FILES.group_by{|f| f[0]}
# => {"A"=>["AcmeCo"],
#     "B"=>["Beeblebrox", "Breakfast"],
#     "C"=>["Cats", "Chunky Bacon", "Cthulhu"],
#     "X"=>["X-Windows"],
#     "D"=>["Delicatessens", "Ducks"]}

FILES.chunk{|f| f[0]}.to_a
# => [["A", ["AcmeCo"]],
#     ["B", ["Beeblebrox", "Breakfast"]],
#     ["C", ["Cats", "Chunky Bacon"]],
#     ["X", ["X-Windows"]],
#     ["C", ["Cthulhu"]],
#     ["D", ["Delicatessens", "Ducks"]]]
``````

This makes `#chunk` useful for streams of data we may not want to read all the way to the end. It also makes it useful as a tool for lexing: breaking text up into tokens.

As an example, here's a string that contains some source code for a simple, Ruby-like language. This code adds two numbers, assigns the result to a variable, and then prints the contents of the variable on the next line.

We can take the characters of the stream, and chunk them. In the chunk block, we have a case statement that switches on each character. If the character is a newline, it's categorized as an "endline" token. Whitespace is characterized as exactly that. Runs of word characters are categorized as identifiers. Operator characters are grouped as operators. And runs of digits are categorized as numbers.

When we run this code, we get a tokenized breakdown of the string. The `total` variable is chunked as an identifier. Then there is some whitespace. Then an operator. Then more whitespace. Then the number "123". And so on.

``````code = <<EOF
total = 123 + 456
print total
EOF

code.chars.chunk{|c|
case c
when /\n/ then :endline
when /\s/ then :whitespace
when /[[:alpha:]]/ then :identifier
when /[\+\-\=\/\*]/ then :operator
when /\d/ then :number
end
}.to_a
# => [[:identifier, ["t", "o", "t", "a", "l"]],
#     [:whitespace, [" "]],
#     [:operator, ["="]],
#     [:whitespace, [" "]],
#     [:number, ["1", "2", "3"]],
#     [:whitespace, [" "]],
#     [:operator, ["+"]],
#     [:whitespace, [" "]],
#     [:number, ["4", "5", "6"]],
#     [:endline, ["\n"]],
#     [:identifier, ["p", "r", "i", "n", "t"]],
#     [:whitespace, [" "]],
#     [:identifier, ["t", "o", "t", "a", "l"]],
#     [:endline, ["\n"]]]
``````

This is exactly the kind of tokenized stream that Ruby breaks code into before performing higher-level parsing and compilation on it.

`#chunk` has a few more tricks up its sleeve.

Let's say we decide that whitespace is just not interesting to us from the point of view of parsing the code. We can treat whitespace as a separator rather than a target for chunking by returning the special `:_separator` symbol from our block.

The resulting output is a lot shorter, with all the whitespace dropped.

``````code = <<EOF
total = 123 + 456
print total
EOF

code.chars.chunk{|c|
case c
when /\n/ then :endline
when /\s/ then :_separator
when /[[:alpha:]]/ then :identifier
when /[\+\-\=\/\*]/ then :operator
when /\d/ then :number
end
}.to_a
# => [[:identifier, ["t", "o", "t", "a", "l"]],
#     [:operator, ["="]],
#     [:number, ["1", "2", "3"]],
#     [:operator, ["+"]],
#     [:number, ["4", "5", "6"]],
#     [:endline, ["\n"]],
#     [:identifier, ["p", "r", "i", "n", "t"]],
#     [:identifier, ["t", "o", "t", "a", "l"]],
#     [:endline, ["\n"]]]
``````

Returning `nil` is treated the same was as returning the `:_separator` flag.

``````code = <<EOF
total = 123 + 456
print total
EOF

code.chars.chunk{|c|
case c
when /\n/ then :endline
when /\s/ then nil
when /[[:alpha:]]/ then :identifier
when /[\+\-\=\/\*]/ then :operator
when /\d/ then :number
end
}.to_a
# => [[:identifier, ["t", "o", "t", "a", "l"]],
#     [:operator, ["="]],
#     [:number, ["1", "2", "3"]],
#     [:operator, ["+"]],
#     [:number, ["4", "5", "6"]],
#     [:endline, ["\n"]],
#     [:identifier, ["p", "r", "i", "n", "t"]],
#     [:identifier, ["t", "o", "t", "a", "l"]],
#     [:endline, ["\n"]]]
``````

This means that, since a `case` statement will return `nil` if nothing matches, we can tighten up this code by omitting the whitespace case entirely.

``````code = <<EOF
total = 123 + 456
print total
EOF

code.chars.chunk{|c|
case c
when /\n/ then :endline
when /[[:alpha:]]/ then :identifier
when /[\+\-\=\/\*]/ then :operator
when /\d/ then :number
end
}.to_a
# => [[:identifier, ["t", "o", "t", "a", "l"]],
#     [:operator, ["="]],
#     [:number, ["1", "2", "3"]],
#     [:operator, ["+"]],
#     [:number, ["4", "5", "6"]],
#     [:endline, ["\n"]],
#     [:identifier, ["p", "r", "i", "n", "t"]],
#     [:identifier, ["t", "o", "t", "a", "l"]],
#     [:endline, ["\n"]]]
``````

OK, I think that's enough to chew on for today. By the way, thank you to viewer Julian Vanier for reminding me that I hadn't done an episode on `#chunk` yet. Happy hacking!