In Progress
Unit 1, Lesson 21
In Progress

Extended Regex Syntax

Video transcript & code

Extended Regex Syntax

When I’m writing scripts for these videos, I annotate them with shot numbers. They look something like this.

A Markdown file snippet, with video production comments

Each shot is denoted with a Markdown comment and a special shot numbering convention, possibly followed by some guidance about how to record the shot.

But sometimes I mess up the numbering....

...and sometimes I leave it out entirely.

# Some RubyTapas Script

<!-- shot(1) open on a blank editor -->

blah blah blah blah...

<!-- shot(3) -->

yadda yadda yadda 

<!-- shot() -->

````
this is code
````

In conclusion, pie is good.

Fortunately, it’s very very easy to renumber these shots using a short Ruby script.

This is the sort of thing Ruby is great at.

As you might expect, this script uses a regular expression to match the shot markers.

shot_number_pattern = /(?<=\<!--\s)shot\(\d*\)(?=\s.*-->)/

ARGF.each_line.each_with_object({shot_number: 0}) do 
  |line, state|
  puts line.sub(shot_number_pattern) {
    "shot(#{state[:shot_number] += 1})"
  }
end

This pattern uses lookahead and lookbehind specifiers to find shot numbering comments. These ensure that the regex will only match shot numbers within single-line HTML-style comments. But the actual matched portion of the string returned from matching will be just the shot number part, without the comment delimiters.

This works great. But it makes for a rather difficult to understand regular expression.

There is a way in Ruby to make this regular expression a little easier to digest.

If we put the ‘x’ modifier at the end of the expression, Ruby will parse it as an extended regular expression.

What does this give us?

Well, we can now split the regex across multiple lines!

With the addition of the extended flag, Ruby now ignores any whitespace we add, like newlines and spaces. Which means this version of the regex is exactly equivalent to the old one!

shot_number_pattern = /
(?<=\<!--\s)
shot\(\d*\)
(?=\s.*-->)
/x

By itself this doesn’t do much to clarify the pattern. But now let’s add a second benefit of extended regular expressions: comments!

We can add ordinary Ruby comments at the end of each line, clarifying each part of the expression.

shot_number_pattern = /
(?<=\<!--\s) # Must be preceded by '<!--'
shot\(\d*\)  # e.g. 'shot(23)' or 'shot()'
(?=\s.*-->)  # must be followed by '-->'
/x

Now our complex pattern comes with inline documentation! Of course, it’s up to us to keep this documentation in sync with the pattern as we modify it over time.

Something you might wonder about extended regular expressions is: how do we explicitly match against whitespace characters, since Ruby is now ignoring them in the pattern?

One way to do that is with the usual regular expression character class escapes. For instance, this pattern uses the \s escape to mean “any whitespace character”.

It’s also possible to escape actual whitespace characters to make them “literal”.

For instance, here’s how we might change the pattern to require a single space character between the opening comment delimiter and the shot specifier.

shot_number_pattern = /
(?<=\<!--\ ) # Must be preceded by '<!--'
shot\(\d*\)  # e.g. 'shot(23)' or 'shot()'
(?=\s.*-->)  # must be followed by '-->'
/x

As you can see, extended regular expressions don’t do anything to clarify the regular expression mini-language. But they do enable us to break out complex patterns onto multiple lines with inline documentation. And that’s a win for readability. Happy hacking!

Responses