Video transcript & code
I had another topic planned for this episode, but then one of my websites broke unexpectedly. In the process of fixing the problem, I learned something new, and I thought I'd share it. If the beginning of this episode is old hat to you, stick around, because there's a gotcha that might just catch you by surprise.
Say we have a string. It's a string we scraped off of a website, so it has some extra whitespace around it.
str = " Hello world\n "
We don't want the whitespace; we just want the words.
There's no big secret about the easy way to do this. If you've been using Ruby for a while you probably already know what it is. The
strip method removes leading and trailing whitespace.
str = " Hello world\n " str.strip # => "Hello world"
There are also
rstrip variants for the cases when we only want to strip whitespace from one side of the string.
str = " Hello world\n " str.lstrip # => "Hello world\n " str.rstrip # => " Hello world"
These all return a new, modifies copy of the string. If we want to modify the original string—which we sometimes want to do when trying to eke out a little more efficiency—there are bang versions of all of these methods. They modify the string in-place.
str = " Hello world\n " str.strip! str # => "Hello world"
So far this is all pretty basic. Let's mix things up a little bit. Here's a new string.
str = " Chunky bacon\n Â "
Let's strip the whitespace from it.
str = " Chunky bacon\n Â " str.strip # => "Chunky bacon\n Â "
Wait a sec… why is there still whitespace on the end? Why is
#strip suddenly broken?
Let's take a closer look at this string, by examining the individual Unicode codepoints.
str = " Chunky bacon\n Â " str.codepoints # => [32, 32, 32, 67, 104, 117, 110, 107, 121, 32, 98, 97, 99, 111, 110, 10, ...
In particular, let's focus on the last few codepoints.
str = " Chunky bacon\n Â " str.codepoints.last(5) # => [10, 32, 32, 32, 160]
We see a 10, which is the newline. Then we see three 32s in a row. These are space characters. But then, at the very end, we see 160.
What is 160? Well, as it turns out, it corresponds to Unicode point "NO-BREAK SPACE". In other words, it's a space character which is not supposed to be used as a place to break a line of text into two. We don't have a way to type this character, but we can insert it into a string ourselves by using the Unicode escape sequence
"\u00A0".codepoints # => 
Dealing with whitespace in Unicode strings is not as simple as it was in ASCII text. In fact, Wikipedia lists 25 different Unicode characters which are classified as whitespace. Unfortunately for us, Ruby's
strip family of methods does not appear to be hip to this definition of whitespace; at least, not yet. And since Ruby doesn't see it as whitespace, that last nonbreaking space is preventing strip from removing the other whitespace characters.
How can we deal with this? Well, one way to do it is to simply remove all the nonbreaking-space characters with gsub, and then strip the string.
str = " Chunky bacon\n Â " str.gsub("\u00A0", "") # => " Chunky bacon\n " .strip # => "Chunky bacon"
Or, we could use a fancy regular expression to supersede
strip altogether. We start with \A, which anchors the match at the start of the string. Then we specify the
[:space:] character class. This special regex character class has a much wider definition of whitespace than
strip does. We put a
+ on the end to match one or more spaces.
Then we use a pipe to specify an alternative match. We reference the
[:space:] character class once again, followed by another
+ for one-or-more. Finally, we anchor this second alternative to only match at the end of strings using
str = " Chunky bacon\n Â " str.gsub(/\A[[:space:]]+|[[:space:]]+\z/, "") # => "Chunky bacon"
This works well, at the cost of being long and hard to read.
By the way, you might be tempted to use the shorter
\s regex shorthand, which also means "whitespace". But surprisingly, it turns out that this shorthand does not refer to the same set of characters in the
[:space:] character class. As we can see in the results.
str = " Chunky bacon\n Â " str.gsub(/\A\s+|\s+\z/, "") # => "Chunky bacon\n Â "
So what's the moral of this story? Well, one takeaway is that arguably Ruby still has some catching-up to do when it comes to being fully Unicode-aware.
But the practical lesson here is simple: Unicode strings are not always what they seem. If you are taking in strings from outside of your program, and your string-munging code isn't working the way you expect, don't hesitate to break the strings into their codepoints and find out exactly what they are made of.