In Progress
Unit 1, Lesson 21
In Progress

Backreference

Video transcript & code

Today I thought we'd talk about backreferences. Backreferences are a feature of regular expression processing which you may well already be familiar with. However, I want to talk about a particular usage of backreferences which is a bit less common.

But first, a quick review. Chances are, if you've used a backreference, you've used it in the context of a search and replace operation. For instance, let's write the beginnings of a rudimentary Pig Latin generator. We'll use gsub to find words and isolate the first letter separately from the rest of the word. Then, in the replacement string, we mangle the text. We do this using backreferences, specified as a backslash followed by a number. The gsub method recognizes these as referring to match groups from original regular expression. \1 refers to the first group, which is the first letter of the word. \2 refers to the second group, which contains the rest of each word.

"pig latin".gsub(/(\w)(\w+)/, "\\2\\1ay") # => "igpay atinlay"

Note that because this is a double-quoted string, we have to double up our backslashes. Ordinarily a double-quoted string interprets numbers preceded by backslashes as special escapes for Unicode codepoints. So in order for the backslashes to make it through all the way to the gsub method they have to be escaped with an extra slash. This is only true for double-quoted strings. If we had single-quoted this string we wouldn't have had to add any extra escaping.

OK, so this is how we use regular expression back-references in replacement text. But we can also use backreferences in the regular expression to be matched! And that's what I really want to talk about.

Let's say we are tasked with implementing an extremely simple compression algorithm. It deals only with alphabetic characters. The algorithm compresses repeated characters. So wherever the input string has a run of repeated characters, the output should contain a single character followed by a repetition count. For simplicity, even single characters should have a repetition count of 1 following them.

So given the input string "aabcccccaaa", the output should be "a2b1c5a3".

It seems like we could just use a regular expression to recognize runs of alphabetic characters. So we send gsub. We give it a regex which will recognize one or more alphabetic characters in a row, making use of the special alpha character class. Then we supply a block, which will catch the matched string, use its first character, and then follow that up with the length of the entire match.

"aabcccccaaa".gsub(/[[:alpha:]]+/) { |m|
  "#{m[0]}#{m.length}"
}
# => "a11"

But of course this doesn't do what we need. It simply recognizes the entire string as a single run of alphabetic characters. What we end up with is "a" followed by the length of the whole input string.

What we really need to do is to tell Ruby to look for an alphabetic character followed by the same character, zero or more times.

And this is where backreferences come to the rescue. We modify our regex to put a group around the first character matched. And then we immediately reference the group we just defined, using \1. We follow that up with a *, meaning zero-or-more.

What we've just written is a recipe for finding some alphabetic character, followed by the same character, whatever it happened to be, zero or more times. When we run the code, we can see that we now get the result we were looking for.

"aabcccccaaa".gsub(/([[:alpha:]])\1+/) { |m|
  "#{m[0]}#{m.length}"
}
# => "a2bc5a3"

Another way we can put backreferences to use is for recognizing quoted sections in a string. Let's say we want to find all the sequences of text that have either single quotes, double quotes, or backticks around them.

In order to do this, we'll use the scan method on String. Our regex to scan for will start with a master group to contain the entire match. We'll dig into why this is necessary in a minute.

Then we'll start an inner group. This group will match any one of a double quote, a single quote, or a backtick. After the opening quote character, we are looking for the contents of the quotation. So we use a negated character class to say that we want any number of characters that are not the quote character - whichever quote character was matched by the preceding group.

We use \2 here to reference the second capture group. If you're ever confused about which number corresponds to which capture group, there is a very easy rule: just count opening parenthesies from left to right. The group we want to reference corresponds to the second open paren in the regex, so we specify \2.

Once we have matched the contents of the quotation, we need to match the quote character that terminates the quotation. For this, we again backreference the capture group that bound to the open quote.

When we execute this code, we see that the result is an array of arrays. When scan finds capture groups in the given regex, it delivers its results in the form of an array of arrays. Each nested array corresponds to a single match, and each element in it represents one of the capture groups. Now we can see why we needed a "master" capture group around the whole expression - otherwise, scan would not have returned the entire quoted string as one of its results for each match.

As it is, for each found quotation we now have an element for the whole match, and an element containing just the quote character which was used in that case.

And that, in a nutshell, is how we can put backreferences to use inside regular expressions. Happy hacking!

%{This string has "various" 'kinds of' `quotation`}.
  scan(/((["'`])[^\2]*\2)/)
# => [["\"various\"", "\""], ["'kinds of'", "'"], ["`quotation`", "`"]]

Responses