In Progress
Unit 1, Lesson 21
In Progress

Last Match

Video transcript & code

Let's say we have a log file.

shodan: INFO Mining laser maintenance cycle finished
shodan: INFO Reactor nominal
shodan: WARN Unverified access
shodan: INFO Grove climate control day cycle
shodan: WARN Ethics node offline
shodan: ERROR Notification of station IT failed
shodan: INFO Science level security protocols updated
shodan: ERROR Uncontrolled neural net reconfiguration
shodan: ERROR Safety protocols inconsistent with updated mandates
shodan: FATAL Logging subsystem compromi

We want to filter it and simplify it, so that the output looks like this:

* Notification of station IT failed
* Uncontrolled neural net reconfiguration
* Safety protocols inconsistent with updated mandates
! Logging subsystem compromi

Only error and fatal-level messages are included. And the header fields have been simplified to a one-character sigil.

Here's a first crack at this problem.

First, we set up a table of shorter sigils for logging levels.

Then, we open up the log, and iterate over each line.

We grep through the lines for error or fatal-level log messages, as we learned about in episode #215.

For each mapped line, we then use regular expressions to pull out both the log level, and the message component. Here, we're using regular expressions in string subscripts, which we learned about in episode #99.

Then we output the sigil for the given log level, a space, and the message.

This gives us the output we want.

sigils = { "ERROR" => "*", "FATAL" => "!", "WARN" => "?" }
output = open("shodan.log").each_line.grep(/ERROR|FATAL/).map{|line|
  level   = line[/ERROR|FATAL/]
  message = line[/(ERROR|FATAL) (.*)/, 2]
  "#{sigils[level]} #{message}"
}
puts output


# >> * Notification of station IT failed
# >> * Uncontrolled neural net reconfiguration
# >> * Safety protocols inconsistent with updated mandates
# >> ! Logging subsystem compromi

But it's not really ideal. For one thing, we use roughly the same regular expression three different times. What if we decide we also want to include "warning" level messages in the results? We'd have to change regular expressions on three different lines. That's not very DRY.

Let's do some work on this code, and see if we can make it both more succinct, and less repetitive.

First off, we don't actually need that #map. As we learned in episode #215, #grep itself can optionally call a block for each matching line.

sigils = { "ERROR" => "*", "FATAL" => "!", "WARN" => "?" }
output = open("shodan.log").each_line.grep(/ERROR|FATAL/) {|line|
  level   = line[/ERROR|FATAL/]
  message = line[/(ERROR|FATAL) (.*)/, 2]
  "#{sigils[level]} #{message}"
}
puts output


# >> * Notification of station IT failed
# >> * Uncontrolled neural net reconfiguration
# >> * Safety protocols inconsistent with updated mandates
# >> ! Logging subsystem compromi

Instead of two separate regular expression matches inside the block, we can just do a single match and save the matchdata in a local variable.

Then, since we've grouped the log level and message parts of the regular expression into capture groups, we can just pull out each part using capture group subscripts.

sigils = { "ERROR" => "*", "FATAL" => "!", "WARN" => "?" }
output = open("shodan.log").each_line.grep(/ERROR|FATAL/) {|line|
  match   = line.match(/(ERROR|FATAL) (.*)/)
  level   = match[1]
  message = match[2]
  "#{sigils[level]} #{message}"
}
puts output


# >> * Notification of station IT failed
# >> * Uncontrolled neural net reconfiguration
# >> * Safety protocols inconsistent with updated mandates
# >> ! Logging subsystem compromi

This brings us down from three regular expressions, to two. Now, can we get it down to just one?

Before we do, there are some things we need to understand about regular expression matching in Ruby.

Here's a single line from the logfile.

Let's match that line against a regular expression.

The result when using the match operator is just to return the integer offset of the start of the match.

You might think this means that all other information about the match is lost. But in fact, Ruby keeps regular expression match data around after each regular expression match.

We can see it by sending the last_match message to the Regexp class.

The result is a MatchData object. We can interrogate this object for capture groups one and two using subscripts.

line = "shodan: ERROR Notification of station IT failed"
line =~ /(ERROR|FATAL) (.*)/      # => 8
match = Regexp.last_match         # => #<MatchData "ERROR Notification of station IT failed" 1:"ERROR" 2:"Notification of station IT failed">
match[1]                          # => "ERROR"
match[2]                          # => "Notification of station IT failed"

So this means that in our log filtering code, we can put our fancy grouping regular expression in the arguments to #grep.

Then, inside the block, instead of matching again we can just reference the last_match.

This gives us the output we want.

sigils = { "ERROR" => "*", "FATAL" => "!", "WARN" => "?" }
output = open("shodan.log").each_line.grep(/(ERROR|FATAL) (.*)/) {|line|
  match   = Regexp.last_match
  level   = match[1]
  message = match[2]
  "#{sigils[level]} #{message}"
}
puts output


# >> * Notification of station IT failed
# >> * Uncontrolled neural net reconfiguration
# >> * Safety protocols inconsistent with updated mandates
# >> ! Logging subsystem compromi

But there's still more effort going into this code than is strictly necessary. Specifically, we had to remember to surround parts of the regular expression that we would need later in parenthesized groups.

How can we get away with skipping this step? To answer that question, we need another digression.

Let's update our exploratory code to simply look for log levels, without any capture groups.

It turns out, a MatchData object provides access to more than just capture groups.

If we want, we can get at the text that came before the match, by sending the pre_match.

Naturally, we can get at the matched text itself, by converting the match data to a string.

And we can get the text which came after the match, by sending post_match.

The only difference between these post-match results and the capture group version we used before, is that now we have a leading space included. We can get rid of that by stripping the string.

line = "shodan: ERROR Notification of station IT failed"
line =~ /ERROR|FATAL/           # => 8
match = Regexp.last_match         # => #<MatchData "ERROR">
match.pre_match                   # => "shodan: "
match.to_s                        # => "ERROR"
match.post_match                  # => " Notification of station IT failed"
match.post_match.strip            # => "Notification of station IT failed"

So, how does this simplify our filtering code? Well, we can get rid of the regular expression capture groups.

We can grab the log level from the text of the last match.

And we can grab the log message from the post-match string.

sigils = { "ERROR" => "*", "FATAL" => "!", "WARN" => "?" }
output = open("shodan.log").each_line.grep(/ERROR|FATAL/) {|line|
  level   = Regexp.last_match.to_s
  message = Regexp.last_match.post_match.strip
  "#{sigils[level]} #{message}"
}
puts output


# >> * Notification of station IT failed
# >> * Uncontrolled neural net reconfiguration
# >> * Safety protocols inconsistent with updated mandates
# >> ! Logging subsystem compromi

There's one more way we can trim this code down, although it's a bit controversial. Each of the last-match elements we've used so far has a special shorthand pseudo-global variable.

The whole last-match object is accessible with $~

This echoes the "equals-tilde" operator used to match against regular expressions.

The pre-match string is available via $`.

The matched string can be accessed as $&

And the post-match is exposed as =$'=.

By the way, there is a certain logic to the symbols chosen for the pre- and post-match variables. If you think of the backquote and single-quote as open-quote and close-quote, you can think of these variables "surrounding" the last matched string the way quotes surround text.

line = "shodan: ERROR Notification of station IT failed"
line =~ /ERROR|FATAL/             # => 8
match = Regexp.last_match         # => #<MatchData "ERROR">
$~                                # => #<MatchData "ERROR">
match.pre_match                   # => "shodan: "
$`                                # => "shodan: "
match.to_s                        # => "ERROR"
$&                                # => "ERROR"
match.post_match                  # => " Notification of station IT failed"
$'                                # => " Notification of station IT failed"

These special variables are inherited from the Perl programming languages. Perl is one of Ruby's main parent languages. Or as some would have it, Ruby's weird uncle who lives in the attic. A lot of Ruby programmers prefer to pretend these variables don't exist, and there have been calls to remove them from the language entirely.

If you choose never to use or think about these variables again, I wouldn't blame you. Personally, because I'm an old Perl hacker and it's second nature to me, I still sometimes use these pseudo-globals in quick one-off scripts.

Here's how we might use them in our filtering code.

We get rid of the level and message variables entirely.

Then we just use the special variables for post-match and match directly in the result string.

sigils = { "ERROR" => "*", "FATAL" => "!", "WARN" => "?" }
output = open("shodan.log").each_line.grep(/ERROR|FATAL/) {|line|
  "#{sigils[$&]} #{$'.strip}"
}
puts output


# >> * Notification of station IT failed
# >> * Uncontrolled neural net reconfiguration
# >> * Safety protocols inconsistent with updated mandates
# >> ! Logging subsystem compromi

There's one more potential objection to this code, which I'd like to deal with before we finish. We've been using both class-level methods, and apparently global variables.

You might be wondering if this makes the code not thread-safe. After all, doesn't this mean that matching against a regular expression could re-set the last-match information in other threads?

Let's check to see if this is the case.

After making an initial match, we'll start a new thread.

Inside the thread, we'll take a look at the Regexp.last_match, and its alias, $~.

Then we'll join the thread, in order to wait for it to finish.

When we run this, we can see that the main-thread last-match data does not cross the thread boundary. Both last-match expressions result in nil.

Now let's do a regex match inside the thread, and verify that this sets the last-match information.

Finally, we'll check the last-match data again in the main thread.

Looking at the results, we can see that the regex match inside the thread has not affected the main-thread local state.

line = "shodan: ERROR Notification of station IT failed"
line =~ /ERROR|FATAL/             # => 8
match = Regexp.last_match         # => #<MatchData "ERROR">
$~                                # => #<MatchData "ERROR">

Thread.new do
  Regexp.last_match             # => nil
  $~                            # => nil
  line =~ /shodan/
  Regexp.last_match             # => #<MatchData "shodan">
end.join

Regexp.last_match               # => #<MatchData "ERROR">

What this shows us is that regular expression last-match information, while appearing to be global, is actually managed at a thread-local level by Ruby. That's why I referred to the shorthand variables as "pseudo-global" earlier.

As a result, it is in fact safe to use these values in multi-threaded code. The last-match information will always be from the local thread; it will not "bleed over" from other threads.

So what have we learned today? The one thing I want you to take away from this episode is that it's not always necessary to explicitly encode every bit of info you want to use from a regular expression using parenthesized capture groups. For simple matches, it may be enough to just state the basic criteria for a match, and then grab the rest from the match, pre-match, and post-match info that Ruby naturally captures anyway.

Happy hacking!

Responses