In Progress
Unit 1, Lesson 1
In Progress

Tail Part 3: #rindex

Video transcript & code

In preceding episodes we wrote some code to search backward from the end of a file, counting lines of text until 10 lines are found. It does this by seeking backwards 512 bytes at a time, reading a chunk of text, and then counting the newlines found in that chunk.

newline_count     = 0
chunk_size        = 512
next_chunk_offset = -chunk_size
file = open('/var/log/syslog')
begin  
  file.seek(next_chunk_offset, IO::SEEK_END)
  chunk_start_offset = file.tell
  chunk              = file.read(chunk_size)
  newline_count     += chunk.to_s.chars.count("\n")
  next_chunk_offset -= chunk_size
end while chunk && chunk_start_offset > 0 && newline_count <= 10

To be a usable clone of the UNIX tail(1) command, this code needs to do more than just count newlines. It needs to locate the beginning of the first line of text, and dump everything from that point forward to $stdout.

We'll tackle finding the beginning of the starting line first. To do this, we need to be able to search through the string to find the location of a given character. As you would probably guess, Ruby has a set of methods for just this task.

Let's play around with a sample string for a minute. Our example contains three lines, separated by newline characters. If we send the message #index to the string object, providing a single newline as an argument, we get back the index at which that string was first found. To search in reverse, from the end of the string, we can send #rindex instead. Since the string ends with a newline, we get the index of the last character of the string.

s = "Line 1\nLine 2\nLine 3\n"
s.index("\n")                   # => 6
index = s.rindex("\n")          # => 20

To see the part of the string from that newline onward, we can use the subscript operator with a range argument. The start of the range is the newline index, and the end of the range is is -1, which symbolizes the end of the string. This returns a string containing only a single newline, since the newline was found at the very end of the source string.

In case you're unfamiliar with this syntax, we're taking advantage of the fact that strings can be sliced up using a Range, indicated by double dots, as the argument to the subscript operator. By the way, when I say "subscript", I'm referring to the square brackets ([]).

s[index..-1]                    # => "\n"

Now let's search backwards further. We call #rindex again, this time with a second argument which is the previously found index minus one. The second argument tells #rindex where to start searching backwards from. Giving it the index preceding the last found newline causes it to search for an earlier newline in the string.

index = s.rindex("\n", index-1) # => 13

This time we get a new index value. Examining the slice of the string from that point forward shows that we've located the last full line in the string.

s[index..-1]                    # => "\nLine 3\n"

If we were going to dump this to the terminal, we probably wouldn't want to include the preceding newline in the dump. To get just the text following the newline, we can slice it starting at index+1.

s[(index+1)..-1]                # => "Line 3\n"

Now, what happens when we can't find anymore newlines? In that case, #rindex returns nil.

s = "Line 1\nLine 2\nLine 3\n"
s.rindex("\n", 5)                   # => nil

While we're talking about #index and #rindex, it's worth pointing out that we can search for more than just strings—they can take regular expression arguments as well. We won't use that capability today, however.

s.index(/Line [2-3]/)           # => 7

Equipped with this information, we can proceed to write an inner loop to search backwards for newlines in a given chunk. We construct a while loop that continues as long as the return of #rindex is non-nil. At each iteration, it assigns the result of the #rindex invocation to a local variable called nl_index. The starting offset for #rindex is either the last newline index minus one—that is, the index of the character preceding the newline—or, if no newlines have been found yet, the end of the string.

Inside the loop, we increment the newline count. Then, if the new count indicates that we have found all the lines we were looking for, we break out early.

newline_count     = 0
chunk_size        = 512
next_chunk_offset = -chunk_size
file = open('/var/log/syslog.1')
begin  
  file.seek(next_chunk_offset, IO::SEEK_END)
  chunk_start_offset = file.tell
  chunk              = file.read(chunk_size)
  while(nl_index = chunk.rindex("\n", (nl_index || chunk.size) - 1))
    newline_count += 1
    break if newline_count > 10
  end
  next_chunk_offset -= chunk_size
end while chunk && chunk_start_offset > 0 && newline_count <= 10

newline_count                   # => 11
nl_index                        # => 32
puts chunk[(nl_index+1)..-1]
# >> Feb 17 00:22:16 hazel NetworkManager[1454]: <info> Activation (eth0) Stage 4 of 5 (IPv6 Configure Timeout) complete.
# >> Feb 17 00:22:17 hazel ntpdate[10333]: adjust time server 91.189.94.4 offset -0.045899 sec
# >> Feb 17 00:22:22 hazel NetworkManager[1454]: <info> (wlan0): IP6 addrconf timed out or failed.
# >> Feb 17 00:22:22 hazel NetworkManager[1454]: <info> Activation (wlan0) Stage 4 of 5 (IPv6 Configure Timeout) scheduled...
# >> Feb 17 00:22:22 hazel NetworkManager[1454]: <info> Activa

A quick check shows that after execution, newline_count and nl_index have been populated, and the nl_index+1 is aligned with the start of a line of text in the input file.

We've managed to pinpoint the beginning of the tenth-to-last string. Now all we need to do is dump from there to the end of the file to $stdout, as efficiently as possible. We'll tackle that next time around. Until then, happy hacking!

Responses