In Progress
Unit 1, Lesson 21
In Progress

Grep

Video transcript & code

If you have any command-line experience at all, you are probably familiar with the "grep" command-line utility. Given a pattern, and a list of files to look in, it will output the lines it finds matching the pattern.

$ grep "Jabberwock" jabberwocky.txt
"Beware the Jabberwock, my son!
The Jabberwock, with eyes of flame,
"And hast thou slain the Jabberwock?

Ruby has a grep as well. We can find it on any Enumerable collection. For instance, the File#each_line method returns an enumerator over a file's lines. Since enumerators are enumerable, we can send grep the result, and thus reproduce the function of our grep command.

open("jabberwocky.txt").each_line.grep(/Jabberwock/)
# => ["\"Beware the Jabberwock, my son!\n",
#     "The Jabberwock, with eyes of flame,\n",
#     "\"And hast thou slain the Jabberwock?\n"]

Note that we pass a regular expression literal to #grep. Which makes sense, given that we also give a regular expression to the UNIX grep command.

If you've ever piped the output of a grep command into another UNIX command, you might be wondering if we can similarly execute Ruby code as #grep discovers matches, rather than just acting on the final array of results.

In fact it can, and this is one way in which Ruby's #grep differentiates itself from the #select, aka #find_all method. Since #grep accepts its criteria for a match as an argument, that leaves its block available for other purposes. If we pass a block to our invocation of #grep, it will be called back once for every matching element, as those elements are discovered.

open("jabberwocky.txt").each_line.grep(/Jabberwock/) do |line|
  puts "MATCH: #{line}"
end
# >> MATCH: "Beware the Jabberwock, my son!
# >> MATCH: The Jabberwock, with eyes of flame,
# >> MATCH: "And hast thou slain the Jabberwock?

Since we are also using an Enumerator generated from #each_line, these callbacks will occur as the file is read in, rather than all at once after the whole file has been processed. This might not matter for small texts like Jabberwocky, but it can be a big deal when accepting piped-in input, or processing very large logfiles.

It also means that we can break out early, without the expense of processing all of the elements. For instance, we can break if the current matched line contains a particular flag word.

open("jabberwocky.txt").each_line.grep(/Jabberwock/) do |line|
  break if line =~ /slain/
  puts "MATCH: #{line}"
end
# >> MATCH: "Beware the Jabberwock, my son!
# >> MATCH: The Jabberwock, with eyes of flame,

For more on this "streaming" style of processing, see episode #42.

If we were to go by the name, we might reasonably assume that text searching is all that Ruby's #grep is good for. But that would be to give it far too little credit. In fact, #grep can be used to match any kind of object in a collection.

This is possible because under the covers #grep makes use of Ruby's "case-equality", or "threequals" operator. As the first name suggests, "case equality" is the operator that powers Ruby's very powerful case statements, which let us match objects based on their value, a matching regular expression, their class, a range of values, and so on.

case object
when 23 then "the number 23"
when /foo/ then "contains foo"
when Float then "a floating-point number"
when /2..10/ then "between 2 and 10"
end

Ruby simply takes the argument to #grep and tests it against the collection's elements using the threequals operator. Here's an example which demonstrates this fact: instead of strings, we can use #grep on a series of numbers, passing a range instead of a regular expression as the pattern to match:

[87, 23, 15, 74, 62, 42, 91].grep(0..50)
# => [23, 15, 42]

This works because the range 0..50 implements the threequals operator as a test for inclusion in the range.

(0..50) === 23                    # => true
(0..50) === 87                    # => false

Now that we understand how #grep leverages the power of the threequals, we can begin to understand how truly flexible it is. We also realize an interesting possibility: since the patterns given to #grep are simply objects which implement the threequals, we can invent our own pattern objects for arbitrary matching.

To use an example that is probably still be fresh on our minds, let's use the registry of ConversionRatio objects we came up with in episode #214.

class Feet; end
class Meters; end

ConversionRatio = Struct.new(:from, :to, :number) do
  def self.registry
    @registry ||= []
  end

  def self.find(from, to)
    registry.detect{|ratio| ratio.from == from && ratio.to == to}
  end
end

ConversionRatio.registry << 
  ConversionRatio.new(Feet, Meters, 0.3048) <<
  ConversionRatio.new(Meters, Feet, 3.28084)

Let's say we want to find ratios based on their from and to properties. We've already seen how to do this one way in the .find method, where we pass a block to #detect. Now let's explore alternatives using #grep.

We might construct a special RatioPattern object which matches based on passed-in from and to attributes. We can then grep with an instance of this object, parameterized with the "from" and "to" units we're looking for.

require "./ratio"

RatioPattern = Struct.new(:from, :to) do
  def ===(other)
    from == other.from && to == other.to
  end
end

ConversionRatio.registry.grep(RatioPattern.new(Feet, Meters))
# => [#<struct ConversionRatio from=Feet, to=Meters, number=0.3048>]

That's a pretty verbose way to set up a single type of query, though. Here's another approach: we can set up a lambda that looks for exactly what we want. Passing this lambda to #grep gets us the ratio we were looking for.

require "./ratio"

pattern = ->(ratio) { ratio.from == Meters && ratio.to == Feet }
ConversionRatio.registry.grep(pattern)
# => [#<struct ConversionRatio from=Meters, to=Feet, number=3.28084>]

This works because Ruby Proc objects alias threequals to the #call method, so sending threeequals to a proc is the same as calling it.

require "./ratio"

pattern = ->(ratio) { ratio.from == Meters && ratio.to == Feet }
pattern === ConversionRatio.new(Meters, Feet, 3.28084)
# => true

Of course, this approach is also suspiciously similar to simply sending #select with a block. But remember, with #grep we pass the pattern as an argument, which frees us to use the block for callbacks whenever matches are found.

require "./ratio"

pattern = ->(ratio) { ratio.from == Meters && ratio.to == Feet }
ConversionRatio.registry.grep(pattern) do |ratio|
  # ...
end

So far we've looked at grepping with custom pattern objects, and with lambdas. But there's a third possibility, which is perhaps the most interesting of all.

When we compare one Struct object to another, the objects are compared by value. That is, a ConversionRatio's from, to, and number fields must all match for two ratios to be considered equal. So if we compare one ratio with another whose number value is nil, the result is false.

require "./ratio"
ratio1 = ConversionRatio.new(Meters, Feet, 3.28084)
ratio2 = ConversionRatio.new(Meters, Feet, nil)
ratio2 == ratio1                # => false

But what if we had a "wildcard" object which could match any value? Let's build one. It will define an equality operator which always returns true. In other words, a Wildcard is equal to any other object.

class Wildcard
  def ==(other)
    true
  end
end

As a convenience, let's assign an instance of this class to a Constant. We'll call it "ANY".

ANY = Wildcard.new

Now let's construct a special ConversionRatio where the number field contains a wildcard placeholder. When we compare this "ratio pattern" to a fully filled-in ConversionRatio with matching from and to fields, the result is true!. But if we compare it to a ratio with differing from and to fields, the result is false.

require "./ratio"
require "./wildcard"

ratio1 = ConversionRatio.new(Meters, Feet, 3.28084)
pattern = ConversionRatio.new(Meters, Feet, ANY)
pattern == ratio1                # => true

pattern == ConversionRatio.new(Feet, Meters, 0.3048)
# => false

Because Struct defines case equality identically to regular equality, if we replace the equality operators with threequals, the outcome is the same.

require "./ratio"
require "./wildcard"

ratio1 = ConversionRatio.new(Meters, Feet, 3.28084)
pattern = ConversionRatio.new(Meters, Feet, ANY)
pattern === ratio1                # => true

pattern === ConversionRatio.new(Feet, Meters, 0.3048)
# => false

What does this mean? Now we can grep through our registry using a ConversionRatio that where the fields we don't care about have been filled in with wildcards.

require "./ratio"
require "./wildcard"

ConversionRatio.registry.grep(ConversionRatio.new(Meters, Feet, ANY))
# => [#<struct ConversionRatio from=Meters, to=Feet, number=3.28084>]

Instead of expressing our criteria procedurally, we're now specifying a pattern, and finding out what matches that pattern. In effect, we are searching for objects by example.

I can't claim that this approach has a clear and unequivocal advantage over procedural searching. But at least in some cases, especially when the value objects being searched for are have few attributes, I feel like there is a special kind of clarity to searching by example.

Whether you find this style of searching agreeable is a matter of taste. But if nothing else, now you know all about how to use #grep to search for patterns in collections and streams. Happy hacking!

Responses