In Progress
Unit 1, Lesson 1
In Progress

Grep, Sort, Uniq

As we continue exploring Ruby one-liner alternatives to UNIX command-line operations, today we tackle a common case: a combination of grepping through multiple files, and collating the results. We’ll iterate through a few different possibilities before arriving at a solution that is both straightforward and terse. Along the way, we’ll talk about the problem of hammers…

Video transcript & code

<!-- shot(01) -->

In another video, we talked through using a one-liner to discover all of the inline image references in directory full of Markdown files. It looked like this.

$ ruby -ne 'puts $1 if /!\[.*\]\((.*)\)/' *.md
chapter18.assets/odin.jpg
chapter04.assets/image-20200513095255753.png
chapter18.assets/odin.jpg

<!-- shot(02) -->

In that video we talked about how the Ruby -n flag puts any code inside a ghost loop that loops over the special ARGF object, which behaves like the concatenation of all the input files given on the command line. (By the way, if no files on are specified, this object pulls from the standard input stream instead).

while $_ = ARGF.gets
  # --- code from -e flags goes here ---
  puts $1 if /!\[.*\]\((.*)\)/
  # --- end code from -e flags ---
end

In that video and others, we explored how using -n and its sibling flag -p is the basis for a lot of command-line one-liners in Ruby. We can use them to tackle tasks that wed otherwise use commands like sed or awk for.

<!-- shot(03) -->

Back to the command line, and listing image references

$ ruby -ne 'puts $1 if /!\[.*\]\((.*)\)/' *.md
chapter18.assets/odin.jpg
chapter04.assets/image-20200513095255753.png
chapter18.assets/odin.jpg

Today we have a challenge that builds on this task. See how one of these output lines is repeated? Well now we not only need to discover these references, but we also need to sort and unique the results as well, so that any given filename only appears once.

<!-- shot(04) -->

One obvious way to approach this is to pipe the results into the UNIX sort and uniq tools:

$ ruby -ne 'puts $1 if /!\[.*\]\((.*)\)/' *.md | sort | uniq
chapter04.assets/image-20200513095255753.png
chapter18.assets/odin.jpg

<!-- shot(05) -->

Or we could omit the uniq command and use the -u flag to sort.

$ ruby -ne 'puts $1 if /!\[.*\]\((.*)\)/' *.md | sort -u 
chapter04.assets/image-20200513095255753.png
chapter18.assets/odin.jpg

<!-- shot(06) -->

But lets say, just for the sake of argument, that we want to keep this completely within a single Ruby command.

$ ruby -ne 'puts $1 if /!\[.*\]\((.*)\)/' *.md

How do we tackle this?

Hm well, we know that our code is going to be executed once for every line of input. Thats great for grepping for a pattern, but its not so great for sorting and unique-ing. For those tasks we really need to have the entire unsorted output at hand in a single array.

But we do know that we can use special BEGIN and END blocks in our one-liners to escape code out of the ghost loop.

<!-- shot(07) -->

So maybe we could do something like this we could instantiate a collector array in a BEGIN block.

$ ruby -ne 'BEGIN{refs=[]}; puts $1 if /!\[.*\]\((.*)\)/' *.md

<!-- shot(08) -->

Then instead of outputting matches immediately, we could add them to this collector.

$ ruby -ne 'BEGIN{refs=[]}; refs << $1 if /!\[.*\]\((.*)\)/' *.md

<!-- shot(09) -->

We could call sort and uniq on the final list

<!-- shot(10) -->

and print out the result.

$ ruby -ne 'BEGIN{refs=[]}; refs << $1 if /!\[.*\]\((.*)\)/; END{puts refs.sort.uniq}' *.md
chapter04.assets/image-20200513095255753.png
chapter18.assets/odin.jpg

This works. But it is nowhere near as terse or as satisfying as just piping the results into sort -u. Is there some other approach that were missing?

Well, heres a hint sometimes when we have a hammer, everything looks like a nail. In this case, weve been using -n and -p a lot lately to come up with tight, expressive little one-liners. But is the -n ghost loop really helping us here?

In this task, we really want to work on the entire input at once. But -n is an optimization for line-by-line, streaming text processing.

<!-- shot(11) -->

Lets start over, with nothing but the parts we know well need: a -e to tell Ruby to evaluate some code, the regex for matching Markdown inline image references, and the glob pattern for input files.

$ ruby -e '/!\[.*\]\((.*)\)/' *.md

Now, rather than work line-by-line, we want to operate on the entire input corpus at once. Is there some object that could help us do this?

<!-- shot(12) -->

Well yeah, we saw it a minute ago: its the ARGF object, which acts like an open file combining all the input files into one.

while $_ = ARGF.gets
  # --- code from -e flags goes here ---
  puts $1 if /!\[.*\]\((.*)\)/
  # --- end code from -e flags ---
end

Theres a global variable that normally points to this object. Its name is even shorter than ARGF, and arguably more mnemonic:

<!-- shot(13) -->

Its the $< special global variable.

while $_ = $<.gets
  # --- code from -e flags goes here ---
  puts $1 if /!\[.*\]\((.*)\)/
  # --- end code from -e flags ---
end

I find this variable name easy to remember, because it is named after the shell input redirection operator.

$ wc < chapter04.md
  44  297 1927

<!-- shot(14) -->

By the way, theres also a longer name for this variable, if we require the English module.

<!-- shot(15) -->

Its the $DEFAULT_INPUT global.

require "English"
while $_ = $DEFAULT_INPUT.gets
  # --- code from -e flags goes here ---
  puts $1 if /!\[.*\]\((.*)\)/
  # --- end code from -e flags ---
end

Were not going to use the long name for one-liners, but well call this object the default input object from here on out.

<!-- shot(16) -->

So, back on the command line

$ ruby -e '/!\[.*\]\((.*)\)/' *.md

<!-- shot(17) -->

Well start with the default input object.

<!-- shot(18) -->

And well use the grep method to search for the inline image pattern.

<!-- shot(19) -->

To check the output, well add a puts at the beginning.

<!-- shot(20) -->

when we run this, we see that were successfully finding Markdown image links.

$ ruby -e 'puts $<.grep(/!\[.*\]\((.*)\)/)' *.md
![Odin the Cat](chapter18.assets/odin.jpg)
![image-20200513095255753](chapter04.assets/image-20200513095255753.png)
![Odin the Cat](chapter18.assets/odin.jpg)

But we dont want the whole link; we just want the file reference part. Thats why weve included a capture group in our regular expression.

<!-- shot(21) switch to capture_group.rb demo -->

When Ruby matches against a regular expression,

<!-- shot(22) -->

it puts captured groups into some numbered pseudo-global variables.

<!-- shot(23) -->

There are more formal ways of referring to these captures, but today were interested in the concise version.

pattern = /!\[.*\]\((.*)\)/
text = <<EOF
First, a gratuitous cat picture!

![Odin the Cat](chapter18.assets/odin.jpg)
EOF

pattern =~ text  # => 34
$1   # => "chapter18.assets/odin.jpg"
Regexp.last_match[1]  # => "chapter18.assets/odin.jpg"

<!-- shot(24) -->

If we pass a block to grep, it will return the results of that block rather than the entire regular expression match.

<!-- shot(25) -->

We return the value of the first capture group from the block.

<!-- shot(26) -->

And we run what we have.

$ ruby -e 'puts $<.grep(/!\[.*\]\((.*)\)/){$1}' *.md
chapter18.assets/odin.jpg
chapter04.assets/image-20200513095255753.png
chapter18.assets/odin.jpg

Now were back to our original functionality, but producing our output list all at once instead of line-by-line.

<!-- shot(27) -->

At this point, all we need to do is tack on a .sort and a .uniq.

<!-- shot(28) -->

Running this again, we get our sorted, unique-ed final list.

$ ruby -e 'puts $<.grep(/!\[.*\]\((.*)\)/){$1}.sort.uniq' *.md
chapter04.assets/image-20200513095255753.png
chapter18.assets/odin.jpg

This was a long way around to a short command. But what it reminds us of is to be careful of our hammers. Rubys ghost-loop-producing command-line flags are useful for quite a few filtering and munging tasks. But theres also quite a lot we can accomplish at the command line using nothing but -e and the special variables Ruby provides.

Happy hacking!

Responses