In Progress
Unit 1, Lesson 1
In Progress

Composed Regex

Regular expressions can quickly grow to be large, dense, and impenetrable. In today’s episode you’ll learn how Ruby helps you compose them out of smaller, manageable pieces!

Video transcript & code

Composed Regex

Recently we explored how regular expressions in Ruby can be clarified with the extended regular expression syntax.

shot_number_pattern = /
(?<=\<!--\s) # Must be preceded by '<!--' shot\(\d*\) # e.g. 'shot(23)' or 'shot()' (?=\s.*-->) # must be followed by '-->'
/x

ARGF.each_line.each_with_object({shot_number: 0}) do
|line, state|
puts line.sub(shot_number_pattern) {
"shot(#{state[:shot_number] += 1})"
}
end

This syntax enables us to split patterns across multiple lines and annotate them with inline comments. Today we’re going to explore an alternative way to clarify regular expressions… a technique to break them into discrete parts, and make those parts self-documenting.

You’re probably familiar with string interpolation in Ruby. As it turns out, interpolation isn’t limited to just strings. We can also use it in regular expression literals!

Here’s a pattern to match one of two different food ingredients

We can use it in a larger regular expression by interpolating it in!

ingredient_pattern = /bacon|tofu/
food_pattern = /chunky #{ingredient_pattern}/
# => /chunky (?-mix:bacon|tofu)/

You might be wondering what’s up with the odd extra syntax around the interpolated-in part of the regex.

Here’s the thing… Ruby regular expressions have certain flags attached to them, which change their semantics.

For instance, you might know that adding the i flag makes a regular expression case-insensitive. If we add this flag to the ingredient pattern, it can match strings with capital letters as well.

ingredient_pattern = /bacon|tofu/i
ingredient_pattern.match("BACON!!!")
# => #<MatchData "BACON">

But how can we mix a regular expression with certain flags enabled into a larger expression that might not have the same flags turned on?

Check out how ingredient_pattern interpolated into the final food_pattern.

food_pattern = /chunky #{ingredient_pattern}/
# => /chunky (?i-mx:bacon|tofu)/

It’s in a parenthesized group. The group is prefixed with a question mark followed by an i, then a dash and the letters mx, and then a colon.

This indicates that for the interpolated-in part of the regular expression, the case-insensitive flag is turned on, while the “multiline” and “extended” flags are turned off. That’s how Ruby mixes regular expressions together without losing their individual flags: it converts them into a special sub-group in the final expression that has the flags explicitly set or not set.

When we apply the expression to a string, we can see that it successfully matches.

food_pattern.match("I like chunky tofu")
# => #<MatchData "chunky tofu">

Now let’s go back to our original regular expression.

We can use the interpolation technique to extract out pieces of a regular expression pattern into smaller regexes… like the HTML comment open delimiter.

And the comment-close delimiter.

And the pattern that identifies the syntax for a particular video shot.

comment_open_pattern = /\<!--\s/ shot_descriptor_pattern = /shot\(\d*\)/ comment_close_pattern = /\s.*-->/
shot_number_pattern = /
(?<=#{comment_open_pattern})
#{shot_descriptor_pattern}
(?=#{comment_close_pattern})
/x

ARGF.each_line.each_with_object({shot_number: 0}) do
|line, state|
puts line.sub(shot_number_pattern) {
"shot(#{state[:shot_number] += 1})"
}
end

The result is a “composed regular expression”: a regex that is made up of smaller regexes. In the process of creating it, we’ve preserved the information that used to be expressed with inline comments by giving the sub-parts of the expression meaningful variable names.

You might notice that I haven’t extracted the lookahead and lookbehind assertion operators in this composed regular expression.

That’s because these assertions need to occur in a particular position in the expression. They don’t represent easily relocatable elements of the overall pattern. Because of this, it doesn’t make as much sense to extract them.

Regular expressions have a dense syntax that can be incomprehensible above a certain size. But Ruby provides tools to make them more approachable: both extended regexes and interpolation. With care, Ruby regexes can be both powerful and self-explanatory. Happy hacking!

Responses