In Progress
Unit 1, Lesson 21
In Progress

Named Capture

Video transcript & code

Every now and then I run across a topic in my backlog and think "surely I've done that one already", and then I go and look through the archives and discover that I haven't. Today is one of those times. Today we're going to talk about named regex captures.

You might remember that in episode #260, guest chef Nell Shamrell gave us an intro to regex capture groups. Just to briefly review, let's say we have a regex that matches certain filename patterns.

/\A\d{3}-[\w-]+\.\w{3,4}\z/

This can tell us if a string matches the pattern.

patt = /\A\d{3}-[\w-]+\.\w{3,4}\z/
patt =~ "375-named-capture.mp4" # => 0
patt =~ "vo.wav"                # => nil

But what it can't tell us is which parts of the string matched which parts of the pattern. For that, we need groups.

If we use parentheses to introduce some groups to this regular expression, it enables some finer-grained matching.

Let's make a successful match and then take a look at the value of the match groups, via the magic variables $1, $2, and $3.

patt = /\A(\d{3})-([\w-]+)\.(\w{3,4})\z/
patt =~ "375-named-capture.mp4" # => 0
$1                              # => "375"
$2                              # => "named-capture"
$3                              # => "mp4"

There are about a half dozen alternatives for looking up match groups.

We can use the $~ pseudo-global variable to get the last match data, and look them from there.

We can use the more readable Regexp.last_match method.

Or if we use the .match method, we can look up match groups on the resulting MatchData object.

We can even get a list of all the captures in order.

patt = /\A(\d{3})-([\w-]+)\.(\w{3,4})\z/
patt =~ "375-named-capture.mp4" # => 0

$~[1]                           # => "375"
Regexp.last_match[2]            # => "named-capture"
md = patt.match("375-named-capture.mp4") # => #<MatchData "375-named-capture....
md[3]                           # => "mp4"
md.captures
# => ["375", "named-capture", "mp4"]

The one drawback that all of these approaches share in common is the fact that we have to use magic numbers to address the capture groups. That's because the capture groups are positional. In order to figure out which group number is associated with which set of parentheses in the original pattern, we have to count opening parentheses from left to right.

Wouldn't it be nice if, instead of numbers, we could give the capture groups names instead? Well, as a matter of fact, since Ruby 1.9, we can.

Let's give our capture groups names. We'll call the leftmost one "num", the middle one "name", and the rightmost one "ext".

Notice the syntax we use: right after the opening paren of the group, we put a question mark. We follow that with the name of the group, inside angle brackets.

patt = /\A(?<num>\d{3})-(?<name>[\w-]+)\.(?<ext>\w{3,4})\z/

Now let's match this pattern against a string, and check out the captures that result.

How do we get at the captures? Well, remember that for positional captures we could look them up by number using the subscript operator. We can do the same thing with named captures, only using symbolic names.

This works anywhere we use a MatchData object.

patt = /\A(?<num>\d{3})-(?<name>[\w-]+)\.(?<ext>\w{3,4})\z/
filename = "375-named-capture.mp4"
patt =~ filename                # => 0
$~[:num]                        # => "375"
Regexp.last_match[:name]        # => "named-capture"
md = patt.match(filename)       # => #<MatchData "375-named-capture.mp4" num:...
md[:ext]                        # => "mp4"

But what if we don't want to deal with a MatchData object? What if we just want to use the match groups directly? For instance, let's say we have a very traditional conditional statement which performs an action only if a regex is matched.

In that case, something especially magical happens: we get to use the group names as local variable names, without any extra effort needed.

filename = "375-named-capture.mp4"

if /\A(?<num>\d{3})-(?<name>[\w-]+)\.(?<ext>\w{3,4})\z/ =~ filename
  num                           # => "375"
  name                          # => "named-capture"
  ext                           # => "mp4"
end

As nifty as this is, there are some pretty strict limitations on it. This will only work when we match against a literal regular expression. If, for instance, we extracted the expression out into a variable, the magic locals would stop being available.

filename = "375-named-capture.mp4"
patt = /\A(?<num>\d{3})-(?<name>[\w-]+)\.(?<ext>\w{3,4})\z/
if patt =~ filename
  num                           # =>
  name                          # =>
  ext                           # =>
end

# ~> NameError
# ~> undefined local variable or method `num' for main:Object
# ~>
# ~> xmptmp-in12048QpQ.rb:4:in `<main>'

This is consistent with the idea that these auto-assigned capture group variables are really only intended to be conveniences in quick, one-off scripts. In any larger, longer-lived program, we probably don't want local variables popping into existence unannounced. In those cases, we are better off pulling our capture groups out of MatchData objects.

And that's all for today. Happy hacking!

Responses