In Progress
Unit 1, Lesson 1
In Progress

Regexp Union

Video transcript & code

In Episode #190, we talked a little about how to create a regular expression that would match any of a list of words. Alert viewer Myron Marston pointed out that I had missed a chance in that episode to demonstrate a useful feature of the Ruby Regexp class. Today I want to remedy that oversight.

Let's say we're searching for certain patterns in Ruby files. We have a list of these patterns in the form of strings.

patterns = [
  ".each do |",
  ".each {|",
]

I chose the example of Ruby code patterns, because these present an interesting problem when composing regular expressions. Many of the characters used in these patterns have special meaning in a regular expression. If we were to format these as regex literals, we would have to do a lot of escaping. Periods, pipe characters, and curly braces all need to be escaped. We end up with what Perl programmers used to call "leaning toothpick syndrome".

patterns = [
  /\.each do \|/,
  /\.each \{\|/,
]

In Episode #190 we briefly talked about Regexp.escape, which can automatically add the appropriate escape sequences to a string to make it safe for use as a regex.

Regexp.escape(".each do |")
# => "\\.each\\ do\\ \\|"

But when we want to combine many strings into a single regular expression, we can use Regexp.union instead. When we pass our list of patterns to it, it constructs a single regex object which will match either of the patterns in the list.

patterns = [
  ".each do |",
  ".each {|",
]
Regexp.union(patterns)
# => /\.each\ do\ \||\.each\ \{\|/

Regexp.union turns out to be pretty flexible. We can also pass it a list of pattern strings as separate arguments.

Regexp.union(".each do |", ".each {|")
# => /\.each\ do\ \||\.each\ \{\|/

Even cooler, we can mix and match strings and regular expression objects, and it will join them all into a single regex.

Regexp.union(".each do |", ".each {|", /for .* in .*/)
# => /\.each\ do\ \||\.each\ \{\||(?-mix:for .* in .*)/

You might notice that the regular expression literal we passed into this invocation of Regexp.union got some special treatment. This is because regular expression objects always have flags associated with them, and Regexp.union preserves those flags. So, for instance, here's a union of a pattern that searches for my name in a case-insensitive way, and another one which searches for the string "RubyTapas" in the default case-sensitive way. Regexp.union constructs a regular expression which combines those two patterns together while using special regular expression flags to preserve the differing options for the two different regexen.

Regexp.union(/Avdi/i, /RubyTapas/)
# => /(?i-mx:Avdi)|(?-mix:RubyTapas)/

So next time you have a list of different keywords to search for, let Ruby make your job easier. Use Regexp.union to combine them all into one master regex which will efficiently match any of them. Happy hacking!

Responses