Tail Part 6: Process As Object
Video transcript & code
We've been slowly cloning the UNIX
tail(1) utility, and now that our version has the ability to read the last ten lines of a file, we've started refactoring the code to be a bit more Rubyish. Our first stab at this resulted in a pair of methods encapsulating the outer and inner loops of the
tail(1) implementation. The
each_chunk method reads 512 byte chunks of a file, starting from the end, and yields each chunk in turn. The
each_reverse_newline_index method inspects individual chunks of text for newlines in order to determine where one line ends and the next begins.
When we pulled out the outer loop into the method called
each_chunk, one of our justifications was to hide the complexity of a
do...while loop with multiple loop conditions. This loop starts with a
begin, seeks backwards in the file to the beginning of the next chunk, reads the chunk, and yields it. It then checks to see whether the last chunk read yielded any data, and if there is any more file left to read. So long as both of these conditions remain true, the loop continues.
def each_chunk(file) chunk_size = 512 next_chunk_offset = -chunk_size begin file.seek(next_chunk_offset, IO::SEEK_END) chunk_start_offset = file.tell chunk = file.read(chunk_size) yield(chunk) next_chunk_offset -= chunk_size end while chunk && chunk_start_offset > 0 "" end def each_reverse_newline_index(chunk) while(nl_index = chunk.rindex("\n", (nl_index || chunk.size) - 1)) yield(nl_index) end nl_index end newline_count = 0 file = open('/var/log/syslog.1') start_text = each_chunk(file) do |chunk| nl_index = each_reverse_newline_index(chunk) do |index| newline_count += 1 break index if newline_count > 10 end break chunk[(nl_index+1)..-1] if newline_count > 10 end print(start_text) IO.copy_stream(file, $stdout)
We may have given this loop its own method, but it's still kind of ugly. Ideally, we'd rewrite this in a way that didn't require a
do...while loop at all, and which simplified the loop conditions as well.
The biggest reason we have a
do...while loop is that we need to do a multi-line file seek / read operation before we check to see if the loop should continue. If we could only combine the loop condition test with the seek-and-read in a single method, we could put a call to that method at the top of an ordinarty while loop. In other words, it would be nice if we could rewrite this method to look more like this:
def each_chunk while chunk = read_chunk yield(chunk) end "" end
But for that to work,
read_chunk would have to keep track of the state of the read—specifically, where it left off and where the next read should begin—as a side effect of the call.
When we start to talk about tracking state as a side effect, that suggests it's time to introduce a new object to the system. So let's do that: we'll write a class for tracking a backwards, chunked, file read. What do call it? Well, how about
We've actually chosen this name quite deliberately, particularly the last word in it. When we come across the need for an object to represent a process, we're often tempted to give it an name ending in
BackwardChunkedFileReader. A name like this then leads us to try to write an object which can be told to perform its process as many times as necessary, with new inputs every time. E.g. we might give this class a
read method, which takes a file object:
class BackwardChunkedFileReader def read(file) # ... end end
There is nothing inherently wrong with these "verb-y" class names. But remember our initial impetus to extract out a class here: we need an object which can track the state of a given backwards pass through the input file. That includes hanging onto intermediate state like the next chunk offset that is only applicable to one traversal of one file.
So instead we call this class
BackwardChunkedFileRead, because it represents a single read-through of a file. Next we give it some accessor methods, which correspond to variables in the code we are extracting. We then define an initializer which takes a file object and an optional chunk size. It initializes instance variables for the file, the chunk size, the intended next chunk offset, and the actual starting offset of the current chunk.
We pull in our make-believe
#each_chunk method. Then we set about implementing the
#read_chunk method it calls.
This method starts out with a guard clause that returns early if the last-read chunk's actual starting offset was zero, meaning it was read from the very beginning of the file. Remember that this variable is set to
nil, not zero, by the initializer, so the method will always complete at least once.
Next we pull in the seek-and-read logic from our old method. We also pull in the
next_chunk_offset decrement. We update both the
chunk_start_offset and the
next_chunk_offset to reference instance variables. These two variables hold the intermediate state information we need in order to make our ideal
while loop work. Finally, we have the method return the chunk it just read.
That's the end of our new class. To use it, we instantiate it where we formerly called
each_chunk. We pass the file object to the constructor instead, and call
#each_chunk on the new backward chunked read object. Everything else stays the same.
class BackwardChunkedFileRead attr_reader :file, :chunk_size, :next_chunk_offset def initialize(file, chunk_size=512) @file = file @chunk_size = chunk_size @next_chunk_offset = -@chunk_size @chunk_start_offset = nil end def each_chunk while chunk = read_chunk yield(chunk) end "" end def read_chunk return nil if @chunk_start_offset == 0 file.seek(next_chunk_offset, IO::SEEK_END) @chunk_start_offset = file.tell chunk = file.read(chunk_size) @next_chunk_offset -= chunk_size chunk end end def each_reverse_newline_index(chunk) while(nl_index = chunk.rindex("\n", (nl_index || chunk.size) - 1)) yield(nl_index) end nl_index end newline_count = 0 file = open('/var/log/syslog.1') start_text = BackwardChunkedFileRead.new(file).each_chunk do |chunk| nl_index = each_reverse_newline_index(chunk) do |index| newline_count += 1 break index if newline_count > 10 end break chunk[(nl_index+1)..-1] if newline_count > 10 end print(start_text) IO.copy_stream(file, $stdout)
Our new code is substantially larger, but it is no longer plagued by hard-to-read loops. And we now have an abstraction to represent an important domain concept, one that was formerly just implicit: that of a
There are definitely other improvements we could make to this code, as well as new features we could add. But this is a good stopping point for today. Happy hacking!