In Progress
Unit 1, Lesson 1
In Progress

Reconsidering Regexen

Video transcript & code

Pop quiz: how do you verify that an IP address is valid?

If you're anything like me, the first answer that pops into your head is "use a regex!".

So, OK, let's write a regex. Let's see, assuming we are only dealing with IPv4, we need four groups of digits, separated by periods.

Apart from a mild case of leaning toothpicks, this isn't so bad. And it matches an IP address just fine.

Unfortunately, it also matches a bad IP address where one of the quads is way too big.

r = /\d+\.\d+\.\d+\.\d+/
r =~ "128.0.0.1"                # => 0
r =~ "128.1000.0.1"             # => 0

OK, so let's limit the number of digits to between 1 and three. This rejects our second example. But it matches a bad address with too many digits in the first position.

r = /\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}/
r =~ "128.0.0.1"              # => 0
r =~ "128.1000.0.1"           # => nil
r =~ "1280.0.0.1"             # => 1

That's easy enough to fix. We just need to anchor the regex at the beginning and end so that nothing is allowed outside of the pattern.

This handles our first three examples with aplomb. But it also accepts a bad address where the last number is 257, which is out of range.

r = /\A\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}\z/
r =~ "128.0.0.1"              # => 0
r =~ "128.1000.0.1"           # => nil
r =~ "1280.0.0.1"             # => nil
r =~ "128.0.0.256"            # => 0

How do we tighten our regex up even further? This is a deceptively difficult task. It's fine if a there's a 6, 7, 8. or 9 in the third position so long as the first digits are low enough. But it's not OK if the first digit is a 2 and the second digit is a 5.

"128.0.0.129"            # OK
"123.0.0.259"            # not OK

OK, this isn't fun anymore. Maybe regexen aren't the right tool for this job.

It turns out, there's a much easier way to do this. We can require the ipaddr standard library, and use it to validate IP addresses. We'll write a little helper method to make it easier. It tries to instantiate an IPAddr object, and returns true if it succeeds. If the instantiation raises an exception, it returns false.

require "ipaddr"

def valid_ipaddr?(addr)
  IPAddr.new(addr)
  true
rescue IPAddr::InvalidAddressError
  false
end

Let's try this validator method on our example addresses. As we step through each example, we can see that the first address is valid, but the rest fail the test.

require "./validators"

valid_ipaddr?("128.0.0.1")      # => true
valid_ipaddr?("128.1000.0.1")   # => false
valid_ipaddr?("1280.0.0.1")     # => false
valid_ipaddr?("128.0.0.256")    # => false

I got this idea to use Ruby's buily-in IPAddr class from a reply Ryan Davis posted to the ruby-talk mailing list. And it got me thinking about other text formats I often find myself wanting to validate.

An obvious one is a URL. We can use Ruby's built-in URI class for this. We have to be a bit careful, though: URI is pretty forgiving, because URIs come in all shapes and sizes. The word "bacon" is a perfectly valid relative URI.

On the other hand, a URI with backslashes instead of forward slashes is not OK.

require "uri"
URI("bacon")                    # => #<URI::Generic:0x000000017b5b00 URL:bacon>
URI("http:\\\\rubytapas.com")   # => 
# ~> /home/avdi/.rvm/rubies/ruby-2.1.0/lib/ruby/2.1.0/uri/common.rb:176:in `split': bad URI(is not URI?): http:\\rubytapas.com (URI::InvalidURIError)
# ~>    from /home/avdi/.rvm/rubies/ruby-2.1.0/lib/ruby/2.1.0/uri/common.rb:211:in `parse'
# ~>    from /home/avdi/.rvm/rubies/ruby-2.1.0/lib/ruby/2.1.0/uri/common.rb:747:in `parse'
# ~>    from /home/avdi/.rvm/rubies/ruby-2.1.0/lib/ruby/2.1.0/uri/common.rb:1232:in `URI'
# ~>    from -:3:in `<main>'

If we want to validate fully-qualified HTTP URLs, we can check that no exception is raised, and the scheme is either http or https.

require "uri"

URI("https://rubytapas.com").scheme # => "http"

Let's look at one last example. Another common target for validation is email addresses. This is another surprisingly difficult problem to solve with just a regex. Here's the regex that my copy of the Regular Expressions Cookbook recommends for the most robust validation of email addresses:

/\A[\w!#$%&'*+\/=?`{|}~^-]+(?:\.[\w!#$%&'*+\/=?`{|}~^-]+)*@(?:[A-Z0-9-]+\.)+[A-Z]{2,6}\Z/i

That book also has this to say about the problem:

If you thought something as conceptually simple as validating an email address would have a simple one-size-fits-all regex solution, you’re quite wrong.

Instead of using a regex, let's again turn to a library. This time, we'll use the Mail gem, by Mikel Lindsaar.

We can use the Mail::Address class to verify that an email address appears valid. When I feed it a simple, valid address it returns an object. But when I leave off the local part, it raises an exception.

require "mail"

Mail::Address.new("avdi@avdi.org")
# => #<Mail::Address:23091740 Address: |avdi@avdi.org| >

Mail::Address.new("@avdi.org")
# ~> /home/avdi/.rvm/gems/ruby-2.1.0/gems/mail-2.6.1/lib/mail/parsers/address_lists_parser.rb:14:in `parse': Mail::AddressList can not parse |@avdi.org| (Mail::Field::ParseError)
# ~> Reason was: Only able to parse up to @avdi.org
# ~>    from /home/avdi/.rvm/gems/ruby-2.1.0/gems/mail-2.6.1/lib/mail/elements/address.rb:186:in `parse'
# ~>    from /home/avdi/.rvm/gems/ruby-2.1.0/gems/mail-2.6.1/lib/mail/elements/address.rb:30:in `initialize'
# ~>    from -:4:in `new'
# ~>    from -:4:in `<main>'

The nice thing about using the Mail gem is that it handles advanced email address syntax such as display names and comments.

require "mail"

Mail::Address.new("Avdi Grimm <avdi@avdi.org>")
# => #<Mail::Address:26802620 Address: |Avdi Grimm <avdi@avdi.org>| >

Mail::Address.new("Avdi Grimm (Personal) <avdi@avdi.org>")
# => #<Mail::Address:26800020 Address: |Avdi Grimm <avdi@avdi.org> (Personal)| >

The moral of today's story is that while regexen are wonderfully powerful, they aren't the only tool in the box when it comes to validating common forms of structured strings. Sometimes it's easier to use a purpose-built class to check a string's validity than it is to come up with a robust regex for the job.

Happy hacking!

Responses