In Progress
Unit 1, Lesson 21
In Progress

Ruby as a Filter

Building on recent episodes about Ruby’s special features for command-line scripting, today we tackle a text filtering problem. Learn how Ruby’s -p flag along with a bunch of handy shortcuts can take a script’s worth of munging logic and shrink it into a one-liner!

Video transcript & code

Dash P

When I’m writing a script for a video, I start with a markdown file.

Let's take a first crack at this. 

We'll use Ruby `-n` to automatically loop over every line of input.

Then I annotate it with shot markers inside of comments. I have a conventional syntax I’ve settled on, which looks like this.

Let's take a first crack at this. 

<!-- shot() --> 

We'll use Ruby `-n` to automatically loop over every line of input.

<!-- shot() -->

Later, once the document is in a final form, I either manually add numbers, or use an automated script to do it.

Let's take a first crack at this. 

<!-- shot(1) --> 

We'll use Ruby `-n` to automatically loop over every line of input.

<!-- shot(2) -->

In the past I’ve used full-fledged Ruby scripts to automate this shot numbering.

But let’s see if we can do this with a Ruby one-liner!

We’ll build up to our one-liner step-by-step. Let’s begin by having Ruby spit out every line of the input file unchanged.

We do this with the -n flag, followed by an e flag to evaluate some code. Our code prints the current value of the last-read-line variable, $_.

ruby -ne `puts $_`

As you might recall if you’ve watched some of my other videos on Ruby one-liners,

the -n flag puts an implicit “ghostly loop” around any code being evaluated. The loop iterates over lines of input, stuffing each line into the $_ global variable.

while $_ = ARGF.gets
  # BEGIN -e code
  puts $_
  # END -e code

When we run this with a file as input, we can see it behaves like the UNIX cat command, echoing each line from the input file to standard output.

Something else we’ve seen about one-liners is that we can replace the puts $_ with a bare call to print.

ruby -ne `print`

This functions exactly the same as before.

That’s because unlike puts, print with no arguments assumes you want to print the last-read-line.

But there’s an even shorter version of this script.

We can replace the n flag with a p flag. Then erase the call to print!

ruby -pe ''

When we run this, we still see the full contents of the input file!

So, how did this work?

Well, the -p flag is like the -n flag, except it adds one thing to the ghost loop:

# -p implied loop
while $_ = ARGF.gets
  # BEGIN -e code
  # ...
  # END -e code
  print $_

It adds an implied print of every line of input.

That means that in order to update shot numbers, we need to modify the last-read-line. Ruby will take care of outputting it.

OK, so how do we want to go about munging our lines of input? Well, before we get into it, let’s clarify our requirements.

As input, we might start with a video script that looks like this, with blank shot markers.

Let's take a first crack at this. 

<!-- shot() --> 

We'll use Ruby `-n` to automatically loop over every line of input.

<!-- shot() -->

But sometimes we have an already-numbered script where the numbering has gotten out of date.

<!-- shot(2) --> 

Let's take a first crack at this. 

<!-- shot() --> 

We'll use Ruby `-n` to automatically loop over every line of input.

<!-- shot(3) -->

In this case, we want the script to re-number the shots. In other words, we want our one-liner to be idempotent and give the same output with either empty or numbered shot markers.

With that in mind, let’s take a first crack at this.

Anywhere we find a shot marker, we want to rewrite it. Or to put it another way, we want to make a text substitution.

We can do this by sending the sub! message to the last-read-line object. The bang version tells the object to modify itself, instead of returning a modified copy.

As first argument, we supply a regular expression that will match a a line beginning with a shot marker inside Markdown comment delimiters. Including a .* wildcard that will match either a blank or already-numbered marker.

As a second argument, we’ll supply a placeholder string for now, rather than worrying about incrementing numbers.

When we run this, we can see that every shot marker, including the comment delimiters, is replaced with the placeholder.

ruby -pe '$_.sub! /^<!-- shot(.*) -->/, "NNN"'

Taking a closer look at the regex we used, we realize that part of it is only working by accident.

Bare parentheses in a regex are used for grouping. They don’t match literal parentheses.

'<!-- shots after this are repeats -->' | ruby -pe '$_.sub! /^<!-- shot(.*) -->/, "NNN"'

We should backslash-escape them to make the semantics of the pattern match our intent.

ruby -pe '$_.sub! /^<!-- shot\(.*\) -->/, "NNN"'

While we’re at it, let’s tighten up our pattern by only matching zero or more numeric digits between the parens.

ruby -pe '$_.sub! /^<!-- shot\(\d*\) -->/, "NNN"'

Now that we’ve got our basic regex dialed in, let’s make our substitution a little more surgical.

We can use unescaped parentheses around the parts of the pattern before and after where the shot number should go. These are regex “capture groups”.

ruby -pe '$_.sub! /^(<!-- shot\()\d*(\) -->)/, "NNN"'

By itself this doesn’t change the output.

But if we then use backslashed numbers to reference these capture groups… we now see our placeholder sandwiched inside the rest of the shot marker!

ruby -pe '$_.sub! /^(<!-- shot\()\d*(\) -->)/, "\\1NNN\\2"'

Notice that because we’re using a double-quoted replacement string, we had to double our backslashes to escape them for the replacement group references.

OK, now that we know how to narrow our rewrite to just the part we want to update, let’s try to put some actual numbers there.

For a first attempt, we can use a BEGIN block to initialize a shot number variable to zero.

Then inside the substitution replacement string, we can increment and interpolate in the current number.

ruby -pe 'BEGIN{sn=0}; $_.sub! /^(<!-- shot\()\d*(\) -->)/, "\\1#{sn += 1}\\2"'

When we run this, we can see that the numbers are going up… but they seem a bit off!

What’s going on here? Well, the substitution only happens when the regular expression matches. But the replacement string gets build every time, no matter what!

OK, we need that replacement string to only be built when the substitution actually takes place.

We can do that by switching to the block form of sub!.

ruby -pe 'BEGIN{sn=0}; $_.sub!(/^(<!-- shot\()\d*(\) -->)/) {"\\1#{sn +=1}\\2"}'

But for the block form, we can’t use these backreferences anymore. We have to switch to the special numbered pseudoglobals that Ruby sets with each regular expression match.

ruby -pe 'BEGIN{sn=0}; $_.sub!(/^(<!-- shot\()\d*(\) -->)/) { "#{$1}#{sn+=1}#{$2}" }'

Now that is starting to look right!

We can tighten up the replacement string a bit by using a little-known shorthand.

When all we want to interpolate into a Ruby string is the value of a sigil-prefixed variable like $1 or $2, we can skip the curly braces and just use # followed by the variable name.

ruby -pe 'BEGIN{sn=0}; $_.sub!(/^(<!-- shot\()\d*(\) -->)/) { "#$1#{sn+=1}#$2" }'

Another option is to use string concatenation operators instead of string interpolation. In this version we have to explicitly convert the shot number to a string, because Ruby never performs these kinds of implicit coercions for us.

ruby -pe 'BEGIN{sn=0}; $_.sub!(/^(<!-- shot\()\d*(\) -->)/) { $1 + (sn+=1).to_s + $2 }'

Personally, I prefer the string interpolation version, because it makes it more visually obvious that a string is being constructed.

ruby -pe 'BEGIN{sn=0}; $_.sub!(/^(<!-- shot\()\d*(\) -->)/) { "#$1#{sn+=1}#$2" }'

There’s another shortcut we can apply. -p one-liners are so often used to do some selective substitution on the input text,

Ruby supplies a global Kernel method called sub which implicitly updates the contents of the last-read-line variable.

ruby -pe 'BEGIN{sn=0}; sub(/^(<!-- shot\()\d*(\) -->)/) { "#$1#{sn+=1}#$2" }'

There’s also another way we can tackle this regex substitution.

Instead of capturing the before and after parts of the line into numbered groups, we can segment those parts of the line into lookbehind and lookahead expressions.

These expressions must still be found for the overall regex to match. But they are no longer considered part of the matched string to be replaced.

Which means that we can shrink our replacement string to just the number!

ruby -pe 'BEGIN{sn=0}; sub(/(?<=^<!-- shot\()\d*(?=\) -->)/) { sn+=1 }'

The lookahead and lookbehind expressions narrow the scope of the replacement so we no longer have to rebuild the rest of the line.

One last thing. When I renumber the shots in the script, I don’t actually want to output the result to standard out or create a new file. I usually want to update the file in place.

Ruby gives us an easy way to do that as well!

If we supply the -i flag, Ruby updates the file in-place instead of spitting to standard out.

ruby -i -pe 'BEGIN{sn=0}; sub(/(?<=^<!-- shot\()\d*(?=\) -->)/) { sn+=1 }'

Checking the contents of the file, we can see it has been updated.

If we’re worried about accidentally messing up our file, we can also supply a backup file extension to -i.

ruby -i.bak -pe 'BEGIN{sn=0}; sub(/(?<=^<!-- shot\()\d*(?=\) -->)/) { sn+=1 }'

After running our command, the directory now contains a new backup file with the original unaltered contents.

ls example-script.*

And that is how we can use Ruby from the command-line to accomplish trivial, or even slightly less-than-trivial text rewrites. Happy hacking!