Match
Video transcript & code
Here's a regular expression that matches local phone numbers in a common US format. We can use it to match against a string that contains a phone number. Then we can output the found phone number.
r = /\d{3}-\d{4}/
match = r.match("my number is 555-1234")
puts "Found #{match}"
# >> Found 555-1234
Of course, it's always possible we might try to match against a string that doesn't contain a recognizable phone number. When a match is made, the #match
method returns a MatchData object. If the regular expression isn't found, it returns nil
.
r = /\d{3}-\d{4}/
r.match("my number is 555-1234") # => #<MatchData "555-1234">
r.match("no number") # => nil
Since nil
is "falsy" we can exploit this fact and insert a conditional. This if
statement ensures we'll only try to use the match data if a match was found.
r = /\d{3}-\d{4}/
match = r.match("my number is 555-1234")
if match
puts "Found #{match}"
end
# >> Found 555-1234
This code illustrates an extremely common idiom in Ruby code: first determine if a match for a pattern was found. Then, it it was, do something with the resulting data. In effect the return value of #match
plays double duty: first as a flag indicating success or failure, and then as a holder of data.
This pattern is common enough that I find the code as we've typed it here to be unnecessarily awkward. I much prefer to combine the test for a match and the subsequent use of the match into a single statement. We can accomplish this by inlining the invocation of #match
into the if
statement's test.
r = /\d{3}-\d{4}/
if match = r.match("my number is 555-1234")
puts "Found #{match}"
end
# >> Found 555-1234
The only downside to this is that many programmers, and some automated code linting tools, will automatically flag it as a possible bug. The reason being that it's a common typo to use a single equals in an if
statement, when we intended to use a double-equals to test for equality.
To mitigate this objection to the code I sometimes enclose the whole match assignment expression in parentheses. This doesn't alter the behavior of the code at all. It's just a way of visually differentiating code where an assignment is intentionally performed inside an if-test clause. It's a way of saying "yes, I meant to do this".
r = /\d{3}-\d{4}/
if (match = r.match("my number is 555-1234"))
puts "Found #{match}"
end
# >> Found 555-1234
However, just the other day I discovered an alternative to this idiom that I'm pretty excited about. It turns out that #match
accepts an optional block. The block will be executed only if the match is successful. When it is, it will receive the match data as a block argument.
We can use this block to remove the need for an if
statement entirely.
r = /\d{3}-\d{4}/
r.match("my number is 555-1234") do |match|
puts "Found #{match}"
end
# >> Found 555-1234
This isn't a complete replacement for an if
statement though. Consider the case where we also want to take a specific action when the match fails. If we get clever, we might think to make use of the or
control flow operator. We tack on an or
and a clause raising an exception. We expect this to work, based on the fact that the call to #match
will return nil
on failure. And indeed, at first this appears to work.
r = /\d{3}-\d{4}/
r.match("my number is 1234") do |match|
puts "Found #{match}"
end or fail "No number found"
# ~> -:7:in `<main>': No number found (RuntimeError)
But when we test the code with a string that contains a valid phone number we see something weird: Both the success and failure actions are triggered.
r = /\d{3}-\d{4}/
r.match("my number is 555-1234") do |match|
puts "Found #{match}"
end or fail "No number found"
# ~> -:4:in `<main>': No number found (RuntimeError)
# >> Found 555-1234
Why is this? The answer is that when we pass a block to #match
and the regular expression match is successful, #match
doesn't return the MatchData
object. Instead, it returns whatever the return value of the block was.
r = /\d{3}-\d{4}/
r.match("my number is 555-1234") { 42 } # => 42
In the case of our phone-number match block, we called puts
. puts
always returns nil
. The nil
became the return value of the block, and was passed through to become the return value of #match
. This then triggered the right-hand side of the or
operator.
So in the case where we need to take action either for the success or failure branches, we should stick to a traditional if
statement.
r = /\d{3}-\d{4}/
if match = r.match("my number is 1234")
puts "Found #{match}"
else
fail "No number found"
end
# ~> -:5:in `<main>': No number found (RuntimeError)
This begs the question: why does #match
pass through the block return value, anyway?
Here's why. Very often, when we match against the regular expression the very next thing we do is to extract some specific piece of information about the matching text. For instance, consider a situation in which all we really are interested in is the exchange portion of matched telephone numbers. The exchange is represented by the first three digits of the phone number.
To isolate the exchange, we add a capture group to our regular expression. Then we use [1]
on the match data to pull out that group.
r = /(\d{3})-\d{4}/
match = r.match("my number is 555-1234")
exchange = match[1]
exchange # => "555"
Of course, if there is no match, we'll get a no method error as we try to subscript index a nil
value.
r = /(\d{3})-\d{4}/
match = r.match("my number is 1234")
exchange = match[1]
exchange # =>
# ~> -:3:in `<main>': undefined method `[]' for nil:NilClass (NoMethodError)
To avoid this, we add an if
statement to only do the subscripting if a match is found. Then we assign the result of the whole if
statement to the exchange
variable. If there is no match, there will be no exception and the exchange
variable will be set to nil
.
r = /(\d{3})-\d{4}/
exchange = if match = r.match("my number is 1234")
match[1]
end
exchange # => nil
We can accomplish the same thing more concisely using a block passed to #match
. If there is a match, the block will be executed and the resulting value assigned to the exchange
variable. If there is no match, the block will be ignored and nil
will be assigned.
r = /(\d{3})-\d{4}/
exchange = r.match("my number is 555-1234") { |match| match[1] }
exchange # => "555"
And there you have it: the regular expression #match
method is just one more example of how blocks are used by Ruby core classes to make common operations easy. Happy hacking!
Responses