In Progress
Unit 1, Lesson 1
In Progress

Tail Part 5: Idiom

Video transcript & code

In the previous four episodes in this miniseries we came up with a working implementation of a very small subset of the UNIX tail(1) command. The code we wound up with is reasonably compact. It's also very reminiscent of C code. If you went and looked at the source for the GNU implementation of tail(1) you'd find code that looked pretty similar. Aspects like the nesting of loops, and the long do...while loop with a complex condition at the very end are characteristic of UNIX C source code.

It's nice that we can write C in Ruby if we want to, but that's not why we use the language. When we write idiomatic Ruby we use things like blocks and small methods with meaningful names to produce code that reads almost like a narrative description of the solution.

Lets see if we can apply a little Ruby sugar to this code to make it read better than it does, at least at the top level.

First, that big do...while loop, implemented as a begin...end block with while a condition on the end, is a major obstacle to comprehending this code. It's nice that we know how to make a do...while loop in Ruby now, but there's a reason we don't see a lot of this kind of loop in idiomatic Ruby code. It's not a very readable construct. We have to get all the way to the end of the loop body before we discover just what condition the loop has been looping on.

newline_count     = 0
chunk_size        = 512
next_chunk_offset = -chunk_size
file = open('/var/log/syslog.1')
begin
  file.seek(next_chunk_offset, IO::SEEK_END)
  chunk_start_offset = file.tell
  chunk              = file.read(chunk_size)
  while(nl_index = chunk.rindex("\n", (nl_index || chunk.size) - 1))
    newline_count += 1
    break if newline_count > 10
  end
  next_chunk_offset -= chunk_size
end while chunk && chunk_start_offset > 0 && newline_count <= 10

print(chunk[(nl_index+1)..-1])
IO.copy_stream(file, $stdout)

So we'll start by extracting this loop out into a method of its own. We copy the entire loop, and paste it into a new method named each_chunk. The file method will take one argument, file, which is the file object to read from.

We also pull the chunk_size and next_chunk_offset variables into this method.

Inside the loop body we replace the inner loop with a yield statement, yielding the current chunk to the block given to this method.

This loop is strictly focused on iterating over chunks. It has no knowledge of newlines. So we remove the newline_count part of the loop condition.

Finally, we return an empty string from the method. The reason for this will become clear later.

Back in the main code, we replace the outer loop with a call to each_chunk, passing in the opened file. We capture the result of the call in a variable called start_text. We get rid of all the lines having to do with file seeking and reading. The inner loop remains the same.

The reason we capture start_text is because now that we're processing chunks in a block, both the chunk variable and the nl_index variable are local to that block. So we can no longer use those variables once the loops are finished. Instead, we add a line that breaks out of the loop with an explicit value when sufficient newlines are found. The value is the part of the current text chunk that contains the tenth-to-last line and onward.

This is also why each_chunk returns an empty string by default - so that in the case where enough newlines are never found, this code will still capture a string value.

Now that we are producing the starting text within the loop code, we can simplify the line that prints this text out.

def each_chunk(file)
  chunk_size        = 512
  next_chunk_offset = -chunk_size
  begin
    file.seek(next_chunk_offset, IO::SEEK_END)
    chunk_start_offset = file.tell
    chunk              = file.read(chunk_size)
    yield(chunk)
    next_chunk_offset -= chunk_size
  end while chunk && chunk_start_offset > 0
  ""
end

newline_count     = 0
file = open('/var/log/syslog.1')
start_text = each_chunk(file) do |chunk|
  while(nl_index = chunk.rindex("\n", (nl_index || chunk.size) - 1))
    newline_count += 1
    break if newline_count > 10
  end
  break chunk[(nl_index+1)..-1] if newline_count > 10
end
print(start_text)
IO.copy_stream(file, $stdout)

Now we turn our attention to the inner loop. Once again, we copy the code and paste it into a new method. This method is named each_reverse_newline_index. It receives a chunk of text as its sole argument.

We keep the loop itself as-is, but we replace the body, as before, with a yield. This time, we yield the found newline index. We also return the last found newline index from this method once the loop finishes.

In the main code we replace the loop with a method invocation, passing in the current chunk. We keep the old loop body inside the block, with one modification: if and when sufficient newlines are found, we break with an explicit value: the last newline index. Once again this lets us use the return value of the method to get at a block-internal value once the method is finished.

def each_chunk(file)
  chunk_size        = 512
  next_chunk_offset = -chunk_size
  begin
    file.seek(next_chunk_offset, IO::SEEK_END)
    chunk_start_offset = file.tell
    chunk              = file.read(chunk_size)
    yield(chunk)
    next_chunk_offset -= chunk_size
  end while chunk && chunk_start_offset > 0
  ""
end

def each_reverse_newline_index(chunk)
  while(nl_index = chunk.rindex("\n", (nl_index || chunk.size) - 1))
    yield(nl_index)
  end
  nl_index
end

newline_count = 0
file = open('/var/log/syslog.1')
start_text = each_chunk(file) do |chunk|
  nl_index = each_reverse_newline_index(chunk) do |index|
    newline_count += 1
    break index if newline_count > 10
  end
  break chunk[(nl_index+1)..-1] if newline_count > 10
end
print(start_text)
IO.copy_stream(file, $stdout)

There are more refactorings we could do to this code. For one thing, we've only papered over that nasty do...while loop by hiding it away in a method. Anyone who modifies the code is still going to have to contend with it. And there's something a little fishy about having two separate but similar break statements, each checking the newline count.

But on the other hand, we've made the main code more intention-revealing by using semantically named methods. It is now clear even without comments that this code has an outer loop that loops over chunks of text from a file, and an innter loop that looks for newline indices in a given chunk. And I think that's enough for today. Happy hacking!

Responses