In Progress
Unit 1, Lesson 1
In Progress

Tail Part 6: Process As Object

Video transcript & code

We've been slowly cloning the UNIX tail(1) utility, and now that our version has the ability to read the last ten lines of a file, we've started refactoring the code to be a bit more Rubyish. Our first stab at this resulted in a pair of methods encapsulating the outer and inner loops of the tail(1) implementation. The each_chunk method reads 512 byte chunks of a file, starting from the end, and yields each chunk in turn. The each_reverse_newline_index method inspects individual chunks of text for newlines in order to determine where one line ends and the next begins.

When we pulled out the outer loop into the method called each_chunk, one of our justifications was to hide the complexity of a do...while loop with multiple loop conditions. This loop starts with a begin, seeks backwards in the file to the beginning of the next chunk, reads the chunk, and yields it. It then checks to see whether the last chunk read yielded any data, and if there is any more file left to read. So long as both of these conditions remain true, the loop continues.

def each_chunk(file)
  chunk_size        = 512
  next_chunk_offset = -chunk_size
  begin
    file.seek(next_chunk_offset, IO::SEEK_END)
    chunk_start_offset = file.tell
    chunk              = file.read(chunk_size)
    yield(chunk)
    next_chunk_offset -= chunk_size
  end while chunk && chunk_start_offset > 0
  ""
end

def each_reverse_newline_index(chunk)
  while(nl_index = chunk.rindex("\n", (nl_index || chunk.size) - 1))
    yield(nl_index)
  end
  nl_index
end

newline_count = 0
file = open('/var/log/syslog.1')
start_text = each_chunk(file) do |chunk|
  nl_index = each_reverse_newline_index(chunk) do |index|
    newline_count += 1
    break index if newline_count > 10
  end
  break chunk[(nl_index+1)..-1] if newline_count > 10
end
print(start_text)
IO.copy_stream(file, $stdout)

We may have given this loop its own method, but it's still kind of ugly. Ideally, we'd rewrite this in a way that didn't require a do...while loop at all, and which simplified the loop conditions as well.

The biggest reason we have a do...while loop is that we need to do a multi-line file seek / read operation before we check to see if the loop should continue. If we could only combine the loop condition test with the seek-and-read in a single method, we could put a call to that method at the top of an ordinarty while loop. In other words, it would be nice if we could rewrite this method to look more like this:

def each_chunk
  while chunk = read_chunk
    yield(chunk)
  end
  ""
end

But for that to work, read_chunk would have to keep track of the state of the read—specifically, where it left off and where the next read should begin—as a side effect of the call.

When we start to talk about tracking state as a side effect, that suggests it's time to introduce a new object to the system. So let's do that: we'll write a class for tracking a backwards, chunked, file read. What do call it? Well, how about BackwardChunkedFileRead?

We've actually chosen this name quite deliberately, particularly the last word in it. When we come across the need for an object to represent a process, we're often tempted to give it an name ending in -er—like OverdueNotifier, AccountReconciler, or BackwardChunkedFileReader. A name like this then leads us to try to write an object which can be told to perform its process as many times as necessary, with new inputs every time. E.g. we might give this class a read method, which takes a file object:

class BackwardChunkedFileReader
  def read(file)
    # ...
  end
end

There is nothing inherently wrong with these "verb-y" class names. But remember our initial impetus to extract out a class here: we need an object which can track the state of a given backwards pass through the input file. That includes hanging onto intermediate state like the next chunk offset that is only applicable to one traversal of one file.

So instead we call this class BackwardChunkedFileRead, because it represents a single read-through of a file. Next we give it some accessor methods, which correspond to variables in the code we are extracting. We then define an initializer which takes a file object and an optional chunk size. It initializes instance variables for the file, the chunk size, the intended next chunk offset, and the actual starting offset of the current chunk.

We pull in our make-believe #each_chunk method. Then we set about implementing the #read_chunk method it calls.

This method starts out with a guard clause that returns early if the last-read chunk's actual starting offset was zero, meaning it was read from the very beginning of the file. Remember that this variable is set to nil, not zero, by the initializer, so the method will always complete at least once.

Next we pull in the seek-and-read logic from our old method. We also pull in the next_chunk_offset decrement. We update both the chunk_start_offset and the next_chunk_offset to reference instance variables. These two variables hold the intermediate state information we need in order to make our ideal while loop work. Finally, we have the method return the chunk it just read.

That's the end of our new class. To use it, we instantiate it where we formerly called each_chunk. We pass the file object to the constructor instead, and call #each_chunk on the new backward chunked read object. Everything else stays the same.

class BackwardChunkedFileRead
  attr_reader :file, :chunk_size, :next_chunk_offset

  def initialize(file, chunk_size=512)
    @file               = file
    @chunk_size         = chunk_size
    @next_chunk_offset  = -@chunk_size
    @chunk_start_offset = nil
  end

  def each_chunk
    while chunk = read_chunk
      yield(chunk)
    end
    ""
  end

  def read_chunk
    return nil if @chunk_start_offset == 0
    file.seek(next_chunk_offset, IO::SEEK_END)
    @chunk_start_offset = file.tell
    chunk = file.read(chunk_size)
    @next_chunk_offset -= chunk_size
    chunk
  end
end

def each_reverse_newline_index(chunk)
  while(nl_index = chunk.rindex("\n", (nl_index || chunk.size) - 1))
    yield(nl_index)
  end
  nl_index
end

newline_count     = 0
file = open('/var/log/syslog.1')
start_text = BackwardChunkedFileRead.new(file).each_chunk do |chunk|
  nl_index = each_reverse_newline_index(chunk) do |index|
    newline_count += 1
    break index if newline_count > 10
  end
  break chunk[(nl_index+1)..-1] if newline_count > 10
end
print(start_text)
IO.copy_stream(file, $stdout)

Our new code is substantially larger, but it is no longer plagued by hard-to-read loops. And we now have an abstraction to represent an important domain concept, one that was formerly just implicit: that of a BackwardChunkedFileRead.

There are definitely other improvements we could make to this code, as well as new features we could add. But this is a good stopping point for today. Happy hacking!

Responses