In Progress
Unit 1, Lesson 1
In Progress

Tail Part 4: copy

Video transcript & code

When we left our re-implementation of the UNIX tail(1) command, we had found a way to backtrack through a file, a chunk at a time, and then locate the beginning of the tenth-to-last line. We located the line by finding the newline characters in each chunk.

newline_count     = 0
chunk_size        = 512
next_chunk_offset = -chunk_size
file = open('/var/log/syslog.1')
begin
  file.seek(next_chunk_offset, IO::SEEK_END)
  chunk_start_offset = file.tell
  chunk              = file.read(chunk_size)
  while(nl_index = chunk.rindex("\n", (nl_index || chunk.size) - 1))
    newline_count += 1
    break if newline_count > 10
  end
  next_chunk_offset -= chunk_size
end while chunk && chunk_start_offset > 0 && newline_count <= 10

Now we need to print the last ten lines to the console. At the end of the loop execution we have a chunk of text in memory and an offset for where the tenth-to-last line begins. So an obvious place to start would be to dump that chunk to $stdout, starting at the offset.

newline_count                   # => 11
nl_index                        # => 32
puts chunk[(nl_index+1)..-1]
# >> Feb 17 00:22:16 hazel NetworkManager[1454]: <info> Activation (eth0) Stage 4 of 5 (IPv6 Configure Timeout) complete.
# >> Feb 17 00:22:17 hazel ntpdate[10333]: adjust time server 91.189.94.4 offset -0.045899 sec
# >> Feb 17 00:22:22 hazel NetworkManager[1454]: <info> (wlan0): IP6 addrconf timed out or failed.
# >> Feb 17 00:22:22 hazel NetworkManager[1454]: <info> Activation (wlan0) Stage 4 of 5 (IPv6 Configure Timeout) scheduled...
# >> Feb 17 00:22:22 hazel NetworkManager[1454]: <info> Activa

This is great, but we can see where the chunk abruptly cuts off. What about the chunks we already went through and discarded?

One way we might go about it is to redo our loop so that it saves references to the chunks of text somewhere as it reads them. Then we could just go down the list of previously read chunks.

But before we launch into it, consider the downside of this approach. Supposing we accidentally fed a 1 gigabyte binary file into our version of tail(1), and that file happened not to contain any newlines. Because it would keep searching back to the very beginning of the file, and because it would save every chunk it read in memory, the process would quickly grow to more than a gig in RAM. As a general rule, we prefer to avoid solutions with the potential to grow in memory as a direct function of the input size.

Let's take a minute to go over the state of the program at the end of this loop. We've searched backwards, chunk by chunk, through the file. We have one current chunk in memory. Reading that chunk in moved the file object's offset to the end of that chunk in the file. Another way to say that is that the file offset is currently positioned where the current chunk leaves off.

Which means that in order to write out the rest of the lines to $stdout, all we need to do is read forward from the current file offset and print the result!

The most obvious way to do that would be to simply call #read on the file object, reading in the rest of the file in into one string, and then print out that string. Unfortunately, this has the same problem as our chain-of-chunks idea earlier: it grows memory use linearly with the size of the remaining section of the file.

print(file.read)

If we want to do the dump efficiently, we need to do it by chunks, the same way we read the data in in the first place. We can do this with a loop that reads and writes chunks of data until the source file is exhausted.

while data = file.read(chunk_size)
  print(data)
end

But there's an easier and more efficient way to do this. Ruby 1.9 introduced an IO.copy_stream method. We can invoke this method with a source and a destination, which should both be IO objects. It will copy data from one to the other in an efficient fashion. It respects the current offset of the source, so the read will begin at the position we want it to.

IO.copy_stream(file, $stdout)

This isn't just a reduction in the amount of code. In my own informal benchmarking I saw close to an order of magnitude improvement in performance when using copy_stream instead of my own handwritten copy loop.

The next time we revisit this code, we'll look into cleaning it up and making it a bit more like idiomatic Ruby, and less like C code. But for now, we can be happy knowing that we've successfully cloned a small piece of the functionality of tail(1). Happy hacking!

Responses