In Progress
Unit 1, Lesson 1
In Progress

Advanced Chunk

Video transcript & code

In the last episode we looked at how we can do simple lexing operations using Ruby's #chunk method. Today I thought we might continue from where we left off, and look at some more advanced features of the #chunk method.

Let's start by making a change to the code want to parse. Our new version version uses a doubled-up bang operator to force a value to be converted to a boolean. We saw how this works in Ruby code in episode #94.

In order to make this work, first we have to add the bang to the list of operators.

code = <<EOF
bool = !!result
EOF

code.chars.chunk{|c|
  case c
  when /\n/ then :endline
  when /[[:alpha:]]/ then :identifier
  when /[\!\+\-\=\/\*]/ then :operator
  when /\d/ then :number
  end
}.to_a
# => [[:identifier, ["b", "o", "o", "l"]],
#     [:operator, ["="]],
#     [:operator, ["!", "!"]],
#     [:identifier, ["r", "e", "s", "u", "l", "t"]],
#     [:endline, ["\n"]]]

The resulting token stream isn't quite what we want, though. If the language we are parsing is anything like Ruby, a doubled-up bang shouldn't be treated as a single operator.

One way to address this is to change from categorizing these characters as "operators" to using a special flag symbol, :_alone. This symbol tells #chunk that it needs to chunk this element individually, even if the previous or following characters match the same pattern.

This has the desired effect. But it also has the down side that instead of being tagged as operators, all operator characters are now chunked with the less meaningful symbol :_alone. Here we bump up against some of the limitations of #chunk for lexing purposes.

code = <<EOF
bool = !!result
EOF

code.chars.chunk{|c|
  case c
  when /\n/ then :endline
  when /[[:alpha:]]/ then :identifier
  when /[\!\+\-\=\/\*]/ then :_alone
  when /\d/ then :number
  end
}.to_a
# => [[:identifier, ["b", "o", "o", "l"]],
#     [:_alone, ["="]],
#     [:_alone, ["!"]],
#     [:_alone, ["!"]],
#     [:identifier, ["r", "e", "s", "u", "l", "t"]],
#     [:endline, ["\n"]]]

Now let's look at a different problem. The HTTP chunked-encoding spec describes a format for HTTP data that is suitable for streaming delivery. Instead sending a whole document all at once, chunked HTTP connections send a chunk at a time, indefinitely. Each chunk is preceded by a line containing the size of the chunk, specified as a hexadecimal number.

One of the complications of chunked encoding is that the server doesn't have to send a whole chunk at once. It can break chunks down into multiple chunk segments, each one separated by a carriage-return linefeed sequence. Only when the client has collected enough characters to fulfill the chunk size header, should it return the completed chunk to application code.

This seems like a perfect application for the #chunk method. But there's an extra consideration we haven't had top think about before. Up til now, we've been chunking on characteristics of the elements themselves. But now, we need to chunk based on the chunk sizes we receive in chunk headers.

Fortunately, the #chunk method has a way to handle this, as we'll see in a moment.

First, we simulate an incoming IO stream using a StringIO object. Then we get a streaming enumerator over the lines of the input, using #each_line. We specify the carriage-return linefeed sequence as the line-breaking pattern.

Then we send the #chunk message. This time, we add an argument. The argument is a hash, containing some state that we want to maintain while iterating over incoming lines. First, we want to keep track of the current chunk number, starting at zero. And second, we want to track the number of characters remaining in the current chunk. We start this off with a special flag value of :unknown, since until we get a chunk header, we won't know how many characters to expect.

Our chunk block will now receive two parameters instead of just one. The first parameter is the current line being considered. The second is our state hash. This same object will be passed by #chunk to every iteration of the block. Which means we can save state inside this hash, and have it be carried forward to future iterations.

We start the block off by chomping off the line separator, since we don't need it anymore.

Then we check to see if the number of characters left is unknown. If so, we update the value, assuming that the current line is a chunk header and interpreting its contents as a hexadecimal integer. We use the strict Integer() conversion function in order to assert that the line must be a valid integer.

Then we return the special :_separator flag that we learned about in the last episode, to let #chunk know that it shouldn't include this header line in its output.

Next up, if we do have an expected number of characters left in the current chunk, and that number is greater than zero, we decrement the expected character count by the size of the current line. Then we return the current chunk number as the chunking key. Remember, so long as this key stays the same, #chunk will keep adding to the current chunk.

Our next option handles the case where the number of characters remaining in the current chunk is zero. This means we are done with the last chunk, and we are most likely looking at a new chunk header. In this case, we bump up the chunk number. We set the characters-left count to the :unknown flag. Then, instead of letting #chunk proceed to the next line, we use redo to jump back up to the top of the block, with the updated state. This will trigger the first branch, parsing a new chunk count and kicking off another round of chunk collection.

We now have an enumerator over the numbered chunks in this HTTP chunk-encoded stream. We can request chunks one by one, and get back a chunk number and an array of lines.

Receiving input chunks as an array of lines may seem a bit awkward as opposed to a single string. But it's actually a common way to work efficiently with lots of streaming incoming data, because it minimizes the number of string allocations between receiving the data and using it. We can always join the arrays into strings later on if we need to.

require 'stringio'
data = "6\r\nHello!\r\nD\r\n How are\r\n you?\r\n0\r\n"
chunked_data = StringIO.new(data)

chunks = chunked_data
         .each_line("\r\n")
         .chunk({chunk_num: 0, chars_left: :unknown}){
  |line, state|
  line.chomp!("\r\n")
  if state[:chars_left] == :unknown
    state[:chars_left] = Integer(line, 16)
    :_separator
  elsif state[:chars_left] > 0
    state[:chars_left] -= line.size
    state[:chunk_num]
  else
    state[:chunk_num] += 1
    state[:chars_left] = :unknown
    redo
  end
}

chunks.next                     # => [0, ["Hello!"]]
chunks.next                     # => [1, [" How are", " you?"]]

I like this final code a lot, because it's at a relatively high level of abstraction. This is a chunking problem, as is obvious from the name of the protocol. Most code that we might encounter that deals with a chunking problem like this one would consist of multiple nested loops. But here, we say: "Hey Ruby! We've got a chunking problem. I know you know about those. How about we just tell you what the criterion for the chunks is, and you do the rest?". In this case the criterion is a little bit involved. But the code we end up with is still simpler than nested loops.

And that's enough for today. Happy hacking!

Responses