In Progress
Unit 1, Lesson 21
In Progress

Matching Digits

Video transcript & code

Today I want to talk about some advanced regular expression features.

Let's say we have a list of episode titles, and we want to extract the episode number from each one.

titles = [
  "RubyTapas007-Constructors",
  "059-enumerator",
  "stats-150",
  "123",
  "555-8675309-Jenny",
]

I've written a helper method which will allow us to apply regular expressions to these strings and view the resulting matches. We'll gloss over the details of this helper for now.

require "table_print"
require "ostruct"

def show_matches(strings, regexp, captures: false)
  matches = strings.map.with_index{|title, index|
    matches = []
    string  = title # !> method redefined; discarding old name=
    while match = regexp.match(string)
      matches << match
      string = match.post_match
    end
    OpenStruct.new({
        num: index + 1,
        string: title,
        matches: matches
      })
  }
  if captures
    tp matches, "num", "string", "matches.to_s", "matches.captures"
  else
    tp matches, "num", "string", "matches.to_s"
  end
end

We know that episode numbers always consist of a triplet of digits. So we start by looking for three digits in a row. We use the \d shortcut to match digits from 0-9.

show_matches(titles, /\d\d\d/)

Before we go any further, we decide to make the regex a little more concise by using a single \d and specifying that it should be found exactly three times, using curly braces.

show_matches(titles, /\d{3}/)

We can see from the results that this match is a little too permissive. In the last example string it's matching three times when we really only want it to match the triplet at the start of the title.

We could try specifying that the triplet must be followed by a word boundary, using \b. This special code will match the space in between a word character, such as a letter or number, and a nonword character, such as a space or punctuation.

show_matches(titles, /\d{3}\b/)

Hmm, looks like we're still matching one too many times in that last example. Let's require a word boundary before the triplet as well.

show_matches(titles, /\b\d{3}\b/)

Oops, we just broke the first example. In that example, the triplet follows immediately after a string of letters with no intervening word boundary.

Let's try specifying that the episode number simply has some kind of non-digit before and after it. We use the \D code for this.

show_matches(titles, /\D\d{3}\D/)

That's a complete failure. We're now matching more than just the triplet, and we're losing all the matches where the number is up against the beginning or end of the string.

Let's specify either a non-digit or the beginning or end of string. In order to sort out the triplet itself from any parts matched before and after it, we'll add parenthesized capture groups.

show_matches(titles, /(\A|\D)(\d{3})(\D|\z)/, captures: true)

This seems to work pretty well, but now in order to get at the actual episode number, we need to ask the regular expression match for a specific capture group index. Generally speaking, we prefer not to have to deal with capture groups unless we want to pull out multiple independent parts of the string at once. Let's see if we can simplify this.

/(\A|\D)(\d{3})(\D|\z)/.match("RubyTapas007-Constructors")[2] # => "007"

In order to do so, we turn to a regular expression feature called zero-width negative lookahead and lookbehind. First, we strip the regular expression back down matching to a simple triplet of digiets. Then we add a group to the beginning of the pattern. A ? at the start of the group signals to Ruby that we will be using a special regex extension. A < tells it that the extension we want to use is the "lookbehind" feature. Next, a ! says to make this a negative lookbehind. We follow it up with our lookbehind pattern, for which we use \d to mean a single digit.

All this tells Ruby that the digit triplet it is trying to match must be preceded by something other than a digit. Whether that is a letter, some punctuation, or nothing at all doesn't matter—it simply mustn't be a digit.

Next we bookend our regular expression with a similar zero-width negative lookahead. A group containing ?!\d tells Ruby that the triplet must not be followed by a digit.

show_matches(titles, /(?<!\d)\d{3}(?!\d)/, captures: true)

Trying this out, we see that this pattern is perfect: it matches episode number triplets, and nothing else. The "zero-width" part of the assertions we used means that they do not add to the final matched string at all, which is why we don't see extra characters before or after the matches. We've also done away with our unwanted capture groups.

Let's do one more thing before we finish. Regular expressions are a very dense language, and they can be very hard to decipher. In order to assist future readers of our code, let's make some changes to make this regex more self-documenting.

First, we'll assign it to a meaningful constant.

#<<show_matches>>
#<<titles>>
EPISODE_NUMBER_PATTERN = /(?<!\d)\d{3}(?!\d)/
show_matches(titles, EPISODE_NUMBER_PATTERN, captures: true)

Next, we'll put it into extended mode by appending an x after the closing slash. This frees us up to include whitespace and comments in the pattern. We'll take advantage of this freedom to break the expression down into parts, and include line-by-line commentary for each segment.

#<<show_matches>>
#<<titles>>

EPISODE_NUMBER_PATTERN = /
      (?<!\d)                   # preceded by non-digits
      \d{3}                     # exactly three digits in a row
      (?!\d)                    # followed by non-digits
/x
show_matches(titles, EPISODE_NUMBER_PATTERN, captures: true)

Ruby regular expressions are extraordinarily powerful. They can accomplish nearly any text-matching task we can dream of, usually more succinctly and efficiently than hand-coded string-processing code. But it's not always obvious how to write a regular expression to do what we want. It can take a lot of trial and error before we find an optimal solution.

I think it behooves any intermediate tor advanced Ruby programmer to build a solid understanding of Ruby's regex capabilities. There are a few resources I'd recommend that can help you get a handle on this topic.

First, spend some time with the Regular Expressions section in the "Pickaxe" book, Programming Ruby 1.9 and 2.0.

Second, check out the site rubular.com. This site lets you play with Ruby regular expressions in a live, visual way. I've found it to be an indispensable tool when building regexes—including the ones used in this episode.

Finally, check out some of Nell Shamrell's excellent talks on Ruby regular expressions. I'll put some links to them in the show notes.

That's it for now. Happy hacking!

Responses