In Progress
Unit 1, Lesson 1
In Progress

Drop While

Video transcript & code

Say we have an email, and we'd like to extract just the body. Emails follow RFC 822 conventions, where each line ends with a carriage-return followed by a linefeed. The carriage returns show up as a ^M notation in this Emacs buffer. The body of the message is separated from the headers by two linefeeds in a row.

Ruby provides a shortcut for reading a file into memory and dividing it into lines all at once. It's called File.readlines. We give it the filename and an optional delimiter string that indicates linebreaks. In our case this is a carriage return linefeed sequence.

The result is an array of the lines in the file. We can see that the break in between the headers and the body shows up as a string containing just a carriage return and a linefeed, and nothing else. We'll use this string to identify where the body begins.

lines = File.readlines("mail.txt", "\r\n")
lines
# => ["MIME-Version: 1.0\r\n",
#     "Received: by 10.112.223.164 with HTTP; Tue, 2 Sep 2014 15:17:21 -0700 (PDT)\r\n",
#     "Date: Tue, 2 Sep 2014 18:17:21 -0400\r\n",
#     "Delivered-To: gibbons@example.com\r\n",
#     "Message-ID: <CA+XG7-iiZQ0SAzgg+2ci==fcbCwvEQP413W8mEBdfu_t0rDAWg@mail.gmail.com>\r\n",
#     "Subject: TPS Reports\r\n",
#     "From: Bill Lumberg <lumberg@example.com>\r\n",
#     "To: Peter Gibbons <gibbons@example.com>\r\n",
#     "Content-Type: text/plain; charset=UTF-8\r\n",
#     "\r\n",
#     "Yeah... if you could just fill out your TPS reports... that'd be great...\r\n"]

To get rid of the lines we don't care about, we can shift them off of the array until one of them contains just carriage return linefeed. Then we can join the remaining lines together.

lines = File.readlines("mail.txt", "\r\n")
line = lines.shift until line == "\r\n"
lines.join
# => "Yeah... if you could just fill out your TPS reports... that'd be great...\r\n"

And there we go… clean and concise, right?

No! It's hideous. Three whole lines of code? I can't even look at it anymore. Let's try a different way.

We'll start with File.readlines, as before. Then we'll chain a send to #drop_while. #drop_while takes a block. The block receives one line at a time, and functions as a predicate. The method drops items—hence the name—until it finds one that fails the predicate. In our case, we'll use a predicate that looks for the divider string. The return value of this message send is an array of the remaining lines, including the divider string.

We don't want the divider string in our final product, so we use #drop with an argument of 1 to get rid of it. This method is like #drop_while except that it simply drops however many elements are requested. The result is an array containing just the body lines.

We can join this array and once again we have the email body text. This time, though, we achieved it in a single line of chained message sends.

File.readlines("mail.txt", "\r\n").drop_while{|l| l != "\r\n"}
# => ["\r\n",
#     "Yeah... if you could just fill out your TPS reports... that'd be great...\r\n"]

File.readlines("mail.txt", "\r\n").drop_while{|l| l != "\r\n"}.drop(1)
# => ["Yeah... if you could just fill out your TPS reports... that'd be great...\r\n"]


File.readlines("mail.txt", "\r\n").drop_while{|l| l != "\r\n"}.drop(1).join
# => "Yeah... if you could just fill out your TPS reports... that'd be great...\r\n"

The difference between these two approaches goes beyond cosmetic, however. The first example works by modifying the original list of lines. Any later code that wants to dig into the full email contents without re-reading the file is out of luck.

By contrast, the version using #drop_while leaves the list of lines untouched, as we can see if we inject an intermediate variable and then inspect its value after the fact. Despite their names, these methods don't drop elements of the original array on the floor; they only drop elements from the new array to be returned.

lines = File.readlines("mail.txt", "\r\n")
lines.drop_while{|l| l != "\r\n"}.drop(1).join
# => "Yeah... if you could just fill out your TPS reports... that'd be great...\r\n"

lines
# => ["MIME-Version: 1.0\r\n",
#     "Received: by 10.112.223.164 with HTTP; Tue, 2 Sep 2014 15:17:21 -0700 (PDT)\r\n",
#     "Date: Tue, 2 Sep 2014 18:17:21 -0400\r\n",
#     "Delivered-To: gibbons@example.com\r\n",
#     "Message-ID: <CA+XG7-iiZQ0SAzgg+2ci==fcbCwvEQP413W8mEBdfu_t0rDAWg@mail.gmail.com>\r\n",
#     "Subject: TPS Reports\r\n",
#     "From: Bill Lumberg <lumberg@example.com>\r\n",
#     "To: Peter Gibbons <gibbons@example.com>\r\n",
#     "Content-Type: text/plain; charset=UTF-8\r\n",
#     "\r\n",
#     "Yeah... if you could just fill out your TPS reports... that'd be great...\r\n"]

In technical terms, the #drop... methods are referentially transparent, which means they are pure functions of their inputs, and have no side effects. Not surprisingly, these methods originate from the functional programming side of Ruby's diverse heritage.

Today we've gone from an already clean, expressive three-line program to an even more elegant and concise one-liner. And not only that, the second version has some useful new semantics. Ruby sure is great. Happy hacking!

Responses