String Cleaning with Avdi
Not to be confused with Spring Cleaning, string cleaning is all about taking strings of text and removing unwanted characters from them. This is something I find myself needing to do on a regular basis, and I thought I’d show you some of the tools Ruby offers to make it easier.
Video transcript & code
Let's start with a messy string that needs to be cleaned up.
s = " Paleta De Mango - Kölsch (Rahr & Sons) "
This string identifies a craft-brewed beer. Now, let's say we wanted to make a version of this string that is suitable for use in a URL. This type of simplified string is often referred to as a "slug".
There are a number of issues with this string from the point of view of using it for an identifying slug.
1. It has extra whitespace at the beginning and end. 2. It contains punctuation characters, such as parenthesies and an ampersand. 3. It contains a non-ascii character, namely an "O" with an umlaut in the word "Kölsch"
Let's create some robust and reliable code for transforming strings into URL-compatible slugs.
One step I usually like to take at the very beginning is to downcase the string.
s = s.downcase # => " paleta de mango - kölsch \n(rahr & sons) "
This ensures that in any later transformations, we don't need to worry about whether the operation is case-sensitive or not.
Another quick and easy win is to strip off all leading and trailing whitespace.
s = s.strip # => "paleta de mango - kölsch \n(rahr & sons)"
If we want a slug that contains only valid ASCII characters, we have a tricky task ahead of us. Fortunately, if we're willing to delegate to a RubyGem, it's a task that has already been tackled for us.
...with a little bit of preparatory setup...
...we can easily remove non-ascii characters, like the accented "O" in the word "kölsch".
require "i18n" I18n.config.available_locales = :en s = I18n.transliterate(s) # => "paleta de mango - kolsch \n(rahr & sons)"
Now it's time to get rid of extra whitespace inside of the string.
We might try using the
String#delete method to get rid of extra newlines and spaces.
s.delete("\n ") # => "paletademango-kolsch(rahr&sons)"
But this results in words being smashed together.
s.tr("\n ", "-") # => "paleta-de-mango---kolsch--(rahr-&-sons)"
This seems better, but it results in long runs of dashes when one would be sufficient.
A handy shortcut for the string deletion and translation methods in Ruby is that we can specify ranges of characters we'd like to target using in terms of just the starting and ending character in the set, with a dash in the middle.
s.delete("a-z0-9") # => " - \n( & )"
In this case we've told the string to delete all characters between "a" and "z", and all digits.
By itself this doesn't really help us. But we can also tell these methods to negate a range by starting with the caret character.
s.delete("^a-z0-9") # => "paletademangokolschrahrsons"
This deletes everything other than the characters specified in the range.
Let's supply this negated range to the
tr method, and tell it to replace all the matches with a dash.
s.tr("^a-z0-9", "-") # => "paleta-de-mango---kolsch---rahr---sons-"
Now we're getting somewhere! But the runs of more than one dash in a row seem excessive. It would be nice if we could shrink them down to just a single dash.
s.tr_s("^a-z0-9", "-") # => "paleta-de-mango-kolsch-rahr-sons-"
In this case, the "s" stands for "squash". As in, squash all runs of multiple dashes down to just one.
A single remaining irritant is that this leaves a trailing dash at the end.
One way to tackle this issue is to move the
tr_s invocation earlier in the process, replacing any unwanted characters with a single space.
Then, with all whitespace and other non-ASCII text collapsed to a single space,
strip takes care of leading or trailing whitespace.
Finally, we replace the spaces with dashes.
s = " Paleta De Mango - Kölsch (Rahr & Sons) " s = s.downcase # => " paleta de mango - kölsch \n(rahr & sons) " require "i18n" I18n.config.available_locales = :en s = I18n.transliterate(s) # => " paleta de mango - kolsch \n(rahr & sons) " s = s.tr_s("^a-z0-9", " ") # => " paleta de mango kolsch rahr sons " s = s.strip # => "paleta de mango kolsch rahr sons" s = s.tr(" ", "-") # => "paleta-de-mango-kolsch-rahr-sons"
The result is a nice clean slug, ready for use as an identifier or in URLs.
These are some of the tools I use when ever I need to clean up and simplify strings. However, I want to stress that the code here isn't robust enough by itself to stand up to arbitrary user input from untrusted sources. In an upcoming episode, we'll look at some resources for testing and hardening our string cleaning algorithms. Until then,