In Progress
Unit 1, Lesson 21
In Progress

Regex Capture Groups with Nell Shamrell

Video transcript & code

 

When I think about regular expressions in Ruby, I think of Nell Shamrell. Nell has put a lot of study into regular expressions: how to write them, how to optimize them, and how they are implemented under the covers. She's given some great talks on this subject. I've put some links in the show notes.

Today, she has agreed to step into the RubyTapas kitchen and give us an introduction on using regex capture groups. If you've ever looked at some of the more advanced regex tricks on this show and felt a little lost, this episode should fill in some of the blanks.

 


Today I'd like to talk to you about using regular expressions capture groups in Ruby. Capture groups are a way to capture certain parts of my regular expression's match so I can use them later in my regular expression or later code outside of the regex. Let's say I want to make a regex to scan a string looking for basic urls. Here's an example string.

                    
                        "site www.rubytapas.com"
                    

Next, I'm going to create a basic regular expression to find the url in that string. I want this regular expresion to match www followed by a literal dot. Notice that I had to escape the dot using a backslash.This tells the regular expression engine to treat this a literal dot not as a dot metacharacter which has different meaning. Followed by any word character appearing one or more times, followed by another dot, followed by any word character appearing one or more times.

                    
                        "site www.rubytapas.com"
                    
                    
                        /www.\.\w+\.\w+/
                    

Now, this is a somewhat contrived example and there very likely are more efficient regular expressions to match urls, but this one illustrates the points I want to make.

So let's say I want to capture the domain name for use later outside the regular expression. In this string the domain name would be "rubytapas".

                    
                        # Domain name: rubytapas
                    
                    
                        "site www.rubytapas.com"
                    
                    
                        /www.\.\w+\.\w+/
                    

Then I also want to capture the top level domain for use later in the program. In the case of this string, it would be "com".

                    
                        # Top Level domain: com
                        # Domain name: rubytapas
                    
                    
                        "site www.rubytapas.com"
                    
                    
                        /www.\.\w+\.\w+/
                    

To capture that domain name, I'm going to enclose the section of the regex meant to match the domain name in parentheses.

                    
                        # Top Level domain: com
                        # Domain name: rubytapas
                    
                    
                        "site www.rubytapas.com"
                    
                    
                        /www.\(.\w+)\.\w+/
                    

Next I'll do the same thing for the section that's meant to match the top level domain.

                    
                        # Top Level domain: com
                        # Domain name: rubytapas
                    
                    
                        "site www.rubytapas.com"
                    
                    
                        /www.\(.\w+)\.(\w+)/
                    

So let's try running this in Ruby. I'm first going to assign my string to a variable, we'll just call it string.

                    
                        # Top Level domain: com
                        # Domain name: rubytapas
                    
        string = "site www.rubytapas.com"

                    /www.\(.\w+)\.(\w+)/

Then I'm going to assign my regex to a variable called regex.

                    
                        # Top Level domain: com
                        # Domain name: rubytapas
                    
        string = "site www.rubytapas.com"
        regex = /www.\(.\w+)\.(\w+)/

I'm first going to match my regex against this string using the equals sign tilde operator. And I'm going to put my regex on one side of this and my string on the other side. This tells Ruby "look in the string for an of it that matches this regex pattern."

                    
                        # Top Level domain: com
                        # Domain name: rubytapas
                    
        string = "site www.rubytapas.com"
        regex = /www.\(.\w+)\.(\w+)/

        regex =~ string

And Ruby's going to return back "5." That "5" means the part of the string that matches the regex begins on the fifth character of the string, the character at index 5.

                    
                        # Top Level domain: com
                        # Domain name: rubytapas
                    
        string = "site www.rubytapas.com"
        regex = /www.\(.\w+)\.(\w+)/

        regex =~ string # => 5

Now, knowing where my match began is useful, but Ruby offers a few different ways I can get more information about my match. First, let's say I want to see exactly what my match is. One way to do this in Ruby is to type $~. That returns an instance of Ruby's matchdata class for my match. We'll go a little more into matchdata in just a little bit. Notice that it contains the entire part of the string that matched my regex and the results of the two capture groups.

$~ # => #

Now I personally find the $~ to by cryptic and not very readable. Fortunately, Ruby has another way to see what my last match was. And that is through using Regexp - impossible to pronounce but important to know, it's the regular expressions class in Ruby - and I'm going to call last_match on that class. And I get back that same matchdata object for our last match. You can see it also shows the results from my capture groups - those subexpressions within my larger regular expression.

Regexp.last_match # => #

Now what about when I want to look at those capture groups individually and maybe use them later in my code? I can view the first capture group by typing in $1, in that case it returns "rubytapas".

$1 # => "rubytapas"

Likewise, I can view the second capture group by typing in $2, which returns "com".

$2 # => "com"

Notice that my first capture group is referenced by one, not by zero. If I were to type in zero, I would get back the name of the program that ran the match.

Along with looking at these capture groups, I can also use them later in the program. Let's try interpolating these two capture groups into a string. In my string I'm going to type in "Domain name: ", then interpolate my first capture group, followed by "Top Level Domain: " then I'll interpolate my second capture group. And this interpolates those two capture groups into my string.

"Domain name: #{$1} Top Level Domain: #{$2}"

And this interpolates those two capture groups into my string.

"Domain name: #{$1} Top Level Domain: #{$2}" # => "Domain name: rubytapas Top Level Domain: com"

Now using numbers like this does work, but again it's somewhat cryptic and a little hard to read. A perhaps clearer way to handle capture groups is through Ruby's matchdata class. Working with capture groups is one of the places the matchdata class is most useful.

So let's create a matchdata object using the match method. I'm going to assign it to a variable called "my_match." And I'm going to call match on my regex and pass it in my string.

my_match = regex.match(string)

And I get back that instance of the matchdata class with the full string and the capture groups.

my_match = regex.match(string) # => #

I can then also access the results of my capture groups similar to how I would access the elements of an array. If I type in my_match[1], I'll get back the result of my first capture group.

my_match[1] # => "rubytapas"

Likewise, if I type in my_match[2], I'll get back the result of my second capture group.

my_match[2] # => "com"

Again, note that the first capture group begins at 1, not at 0 like an array. If I were to type in my_match[0], I would get back the entire string that matched the larger regular expression.

my_match[0] # => "www.rubytapas.com"

And that is an intro to using capture groups in your Ruby regular expressions. Happy hacking!

Responses