In Progress
Unit 1, Lesson 1
In Progress

Chunk While

Video transcript & code

Awhile back, in episodes #290 and #291, we explored the Enumerable#chunk method. In the recently-released 2.3 version, Ruby has added a new cousin to this method called #chunk_while. I thought we might take a look at it today.

Let's say we're analyzing some log data.

In order to make sense of the data, we'd like to group the log entries into runs of related events.

I, [2015-12-28T16:13:36.225741 #12345]  INFO -- : Cat belled
I, [2015-12-28T16:13:36.581504 #12345]  INFO -- : Pipes flushed
I, [2015-12-28T16:13:37.181864 #12345]  INFO -- : Steam buckets empty
I, [2015-12-28T16:13:37.319839 #12345]  INFO -- : Cream whipped
I, [2015-12-28T16:13:51.923681 #12345]  INFO -- : Pipes flushed
I, [2015-12-28T16:13:52.016141 #12345]  INFO -- : Cream whipped
I, [2015-12-28T16:13:52.851744 #12345]  INFO -- : Entropy reversed
I, [2015-12-28T16:13:53.014513 #12345]  INFO -- : Wood chucked
I, [2015-12-28T16:13:53.544517 #12345]  INFO -- : Steam buckets empty
I, [2015-12-28T16:15:52.070590 #12345]  INFO -- : Entropy reversed
I, [2015-12-28T16:15:52.519846 #12345]  INFO -- : Stars lighted
I, [2015-12-28T16:15:53.327443 #12345]  INFO -- : Bacon chunked
I, [2015-12-28T16:15:53.746008 #12345]  INFO -- : Pillows plumped
I, [2015-12-28T16:16:44.241876 #12345]  INFO -- : Lemons juiced
I, [2015-12-28T16:16:44.867397 #12345]  INFO -- : Stars lighted
I, [2015-12-28T16:16:45.629143 #12345]  INFO -- : Entropy reversed
I, [2015-12-28T16:16:46.556990 #12345]  INFO -- : Steam buckets empty
I, [2015-12-28T16:16:47.453171 #12345]  INFO -- : Pillows plumped
I, [2015-12-28T16:17:06.726179 #12345]  INFO -- : Pillows plumped
I, [2015-12-28T16:17:06.816256 #12345]  INFO -- : Wood chucked
I, [2015-12-28T16:17:07.774405 #12345]  INFO -- : Cream whipped
I, [2015-12-28T16:17:47.759721 #12345]  INFO -- : Bacon chunked
I, [2015-12-28T16:17:48.463404 #12345]  INFO -- : Pipes flushed
I, [2015-12-28T16:17:48.702728 #12345]  INFO -- : Wood chucked
I, [2015-12-28T16:17:49.519692 #12345]  INFO -- : Steam buckets empty
I, [2015-12-28T16:17:49.946548 #12345]  INFO -- : Cat belled
I, [2015-12-28T16:18:17.362492 #12345]  INFO -- : Lemons juiced
I, [2015-12-28T16:18:17.891629 #12345]  INFO -- : Entropy reversed
I, [2015-12-28T16:18:18.658450 #12345]  INFO -- : Steam buckets empty
I, [2015-12-28T16:18:50.696795 #12345]  INFO -- : Wood chucked
I, [2015-12-28T16:18:51.322421 #12345]  INFO -- : Steam buckets empty
I, [2015-12-28T16:18:51.704837 #12345]  INFO -- : Bacon chunked
I, [2015-12-28T16:18:52.067990 #12345]  INFO -- : Stars lighted

What constitutes a related event? Well, we're not quite sure. Unfortunately, this log data isn't tagged with any kind of request ID metadata that we could use to tie entries together.

One thing we've noticed, however, is that log entries tend to come in bursts. And a given burst of messages often corresponds to a single system event, like a user request being processed. In between the bursts are long stretches of time without any log messages at all.

It seems like it might be helpful to group log messages together into bursts, and then review them one by one. The only problem is, how do we group them?

This seems a lot like the problems we used the #chunk method on. Except for one big difference: in those cases, we had a property that was the same across runs of related entries. But in this case, there is no such identical property. Instead, runs of entries are related by small differences in time, and delimited by large gaps in time.

Instead of #chunk, let's apply the #chunk_while method.

We start by requiring the time library.

As we learned back in episode #104, this gives us access to timestamp parsing abilities, which we'll need for reading the log entry timestamps.

Then we read in the lines of the log file as an array.

Next we divide the lines into chunks. We send the #chunk_while message to the entries array.

The block we pass to #chunk_while will receive two arguments at a time: a "before" item, and an "after" item.

This block will be called for every consecutive pair of elements in the data set. The idea is that for every pair of elements, our block gets a chance to decide whether there should be a chunk boundary between the two. If the two elements should be kept together, the block should return true. If the elements represent a split point, the block should return false.

In order to make our determination, we need to compare log entry times. To do that, we parse the timestamps out of each line using Time.parse and a string subscript to pull out just the timestamp column. We do this for both the "before" and "after" entry.

Next, we find out how much time elapsed from one entry to the next by subtracting the first timestamp from the second.

When we subtract Ruby Time objects, the result is a possibly fractional number of seconds.

We check to see if the elapsed time between events was less than one second. If it was less, that means these events are probably related. If it was more, we'll consider them unrelated and begin a new chunk.

The result of this operation is an enumerator.

Each chunk in the enumerator is an array of lines, as we can see if we peek at the first one.

Let's format the chunks visually. We'll go through each one, printing the lines and then printing a blank line in between.

And here's the result: chunks of the log, broken up by their relationship in time.

The chunks are not all of equal length. Some of the runs are more the second in total elapsed time. But the length of time between each entry is less than a second. Between these chunks, there are larger gaps: tens of seconds or more.

require "time"
entries = File.readlines("events.log")
chunks = entries.chunk_while{|entry_before, entry_after|
  timestamp_before = Time.parse(entry_before[4..29])
  timestamp_after  = Time.parse(entry_after[4..29])
  timestamp_after - timestamp_before < 1.0
}
# => #<Enumerator: #<Enumerator::Generator:0x00555aaeb05dc0>:each>
chunks.first
# => ["I, [2015-12-28T16:13:36.225741 #12345]  INFO -- : Cat belled\n", "I, [...
chunks.each do |chunk|
  puts *chunk
  puts
end

# >> I, [2015-12-28T16:13:36.225741 #12345]  INFO -- : Cat belled
# >> I, [2015-12-28T16:13:36.581504 #12345]  INFO -- : Pipes flushed
# >> I, [2015-12-28T16:13:37.181864 #12345]  INFO -- : Steam buckets empty
# >> I, [2015-12-28T16:13:37.319839 #12345]  INFO -- : Cream whipped
# >>
# >> I, [2015-12-28T16:13:51.923681 #12345]  INFO -- : Pipes flushed
# >> I, [2015-12-28T16:13:52.016141 #12345]  INFO -- : Cream whipped
# >> I, [2015-12-28T16:13:52.851744 #12345]  INFO -- : Entropy reversed
# >> I, [2015-12-28T16:13:53.014513 #12345]  INFO -- : Wood chucked
# >> I, [2015-12-28T16:13:53.544517 #12345]  INFO -- : Steam buckets empty
# >>
# >> I, [2015-12-28T16:15:52.070590 #12345]  INFO -- : Entropy reversed
# >> I, [2015-12-28T16:15:52.519846 #12345]  INFO -- : Stars lighted
# >> I, [2015-12-28T16:15:53.327443 #12345]  INFO -- : Bacon chunked
# >> I, [2015-12-28T16:15:53.746008 #12345]  INFO -- : Pillows plumped
# >>
# >> I, [2015-12-28T16:16:44.241876 #12345]  INFO -- : Lemons juiced
# >> I, [2015-12-28T16:16:44.867397 #12345]  INFO -- : Stars lighted
# >> I, [2015-12-28T16:16:45.629143 #12345]  INFO -- : Entropy reversed
# >> I, [2015-12-28T16:16:46.556990 #12345]  INFO -- : Steam buckets empty
# >> I, [2015-12-28T16:16:47.453171 #12345]  INFO -- : Pillows plumped
# >>
# >> I, [2015-12-28T16:17:06.726179 #12345]  INFO -- : Pillows plumped
# >> I, [2015-12-28T16:17:06.816256 #12345]  INFO -- : Wood chucked
# >> I, [2015-12-28T16:17:07.774405 #12345]  INFO -- : Cream whipped
# >>
# >> I, [2015-12-28T16:17:47.759721 #12345]  INFO -- : Bacon chunked
# >> I, [2015-12-28T16:17:48.463404 #12345]  INFO -- : Pipes flushed
# >> I, [2015-12-28T16:17:48.702728 #12345]  INFO -- : Wood chucked
# >> I, [2015-12-28T16:17:49.519692 #12345]  INFO -- : Steam buckets empty
# >> I, [2015-12-28T16:17:49.946548 #12345]  INFO -- : Cat belled
# >>
# >> I, [2015-12-28T16:18:17.362492 #12345]  INFO -- : Lemons juiced
# >> I, [2015-12-28T16:18:17.891629 #12345]  INFO -- : Entropy reversed
# >> I, [2015-12-28T16:18:18.658450 #12345]  INFO -- : Steam buckets empty
# >>
# >> I, [2015-12-28T16:18:50.696795 #12345]  INFO -- : Wood chucked
# >> I, [2015-12-28T16:18:51.322421 #12345]  INFO -- : Steam buckets empty
# >> I, [2015-12-28T16:18:51.704837 #12345]  INFO -- : Bacon chunked
# >> I, [2015-12-28T16:18:52.067990 #12345]  INFO -- : Stars lighted
# >>

Of course, we could have done all this with other Enumerable methods, or even with some clever use of the original #chunk method. But this version took significantly less effort, and allowed us to express our intent pretty clearly: we said we wanted runs of log entries separated by less than a second of dead quiet time, and that's exactly what we got. Happy hacking!

Responses