In Progress
Unit 1, Lesson 21
In Progress

File Find

Video transcript & code

The other day I was looking through my files, and discovered a complete RubyTapas episode that I had apparently produced and then forgotten about. Ironically, this episode is about finding files. Here at last, is the "lost episode" on Ruby's find library. Enjoy!


Lately I've been wanting to dig into some stats about the RubyTapas videos I've written so far. For instance, I'm curious how much total video I've produced.

I have a command-line tool called avprobe which can pull out metadata about video files. Here's an example of the output:

avprobe ../090-class-self/090-class-self.mp4 2>&1
avprobe version 0.8.5-6:0.8.5-0ubuntu0.12.10.1, Copyright (c) 2007-2012 the Libav developers
  built on Jan 24 2013 14:49:20 with gcc 4.7.2
Input #0, mov,mp4,m4a,3gp,3g2,mj2, from '../090-class-self/090-class-self.mp4':
  Metadata:
    major_brand     : mp42
    minor_version   : 19529854
    compatible_brands: mp42isom
    creation_time   : 2013-03-21 08:43:08
  Duration: 00:02:52.43, start: 0.000000, bitrate: 812 kb/s
    Stream #0.0(eng): Audio: aac, 48000 Hz, stereo, s16, 192 kb/s
    Metadata:
      creation_time   : 2013-03-21 08:43:08
    Stream #0.1(eng): Video: h264 (Main), yuv420p, 960x540 [PAR 1:1 DAR 16:9], 617 kb/s, 14.99 fps, 14.99 tbr, 30k tbn, 29.97 tbc
    Metadata:
      creation_time   : 2013-03-21 08:43:08

Now I could easily run this tool on all of my finished video files using the UNIX find(1) utility, but this is a Ruby show. And anyway, if I do my digging in Ruby code, it'll be easier to extend it with more elaborate analyses later on.

Ruby actually comes with a Find utility of its own. I'm going to use the Pathname variant of Ruby's find. I start by constructing a Pathname for the root of the directory tree I want to search for video files. Then I send the #find message to this path. #find will recursively search the provided directory tree, yielding every single directory and file it finds to the given block. I can see this if I just put a simple p statement inside the #find block.

require 'pathname'

Pathname("~/Dropbox/rubytapas/090-class-self").expand_path.find do |path|
  p path
end
# >> /home/avdi/Dropbox/rubytapas/090-class-self/090-class-self.veg.bak
# >> #<Pathname:/home/avdi/Dropbox/rubytapas/090-class-self>
# >> #<Pathname:/home/avdi/Dropbox/rubytapas/090-class-self/090-class-self.html>
# >> #<Pathname:/home/avdi/Dropbox/rubytapas/090-class-self/090-class-self.media>
# >> #<Pathname:/home/avdi/Dropbox/rubytapas/090-class-self/090-class-self.mp4>
# >> #<Pathname:/home/avdi/Dropbox/rubytapas/090-class-self/090-class-self.org>
# >> #<Pathname:/home/avdi/Dropbox/rubytapas/090-class-self/090-class-self.org~>
# >> #<Pathname:/home/avdi/Dropbox/rubytapas/090-class-self/090-class-self.rb>
# >> #<Pathname:/home/avdi/Dropbox/rubytapas/090-class-self/090-class-self.veg>
# >> #<Pathname:/home/avdi/Dropbox/rubytapas/090-class-self/090-class-self.veg.bak>

Something else this demonstration reveals is the fact that Pathname's #find method yields Pathname objects instead of simple strings. This will come in handy in a moment.

I'm only interested in completed video files, so I need to do some kind of filtering. I start an if statement, and look for paths which are files, not directories. Then I check them against some regular expressions which should match RubyTapas episodes. In each regular expression I put a parenthesis group around the episode number, for later reference.

Executing this code shows that I've selected a subset video files.

require 'pathname'

Pathname("~/Dropbox/rubytapas").expand_path.find do |path|
  if path.file? && (
      path.basename.to_s =~ /^RubyTapas(\d{3})\b.*\.mp4$/ ||
      path.basename.to_s =~ /^(\d{3})\b.*\.mp4$/)

    p path
  end
end
# >> #<Pathname:/home/avdi/Dropbox/rubytapas/001-binary-literals/RubyTapas001-sample.mp4>
# >> #<Pathname:/home/avdi/Dropbox/rubytapas/001-binary-literals/RubyTapas001.mp4>
# >> #<Pathname:/home/avdi/Dropbox/rubytapas/002-large-integer-literals/RubyTapas002.mp4>
# >> #<Pathname:/home/avdi/Dropbox/rubytapas/003-char-literals/RubyTapas003.mp4>

This looks promising. Now I proceed to collect the data I'm actually interested in for each file. First, I grab the episode number by referencing the global back-reference variable for the first match group from the last-matched regex.

I add a little bit of extra filtering after this line, skipping to the next path if the pathname contains the word "sample". I do this after the assignment of the number because otherwise this new regex match would overwrite the back-reference variables.

Next I use the backquotes to grab the output of the avprobe command and stash it in a variable. I have to redirect stderr to stdout in the command because avprobe writes to stderr.

I use yet another regular expression to extract the video duration for the stats output.

Now I'm ready to dump the output. I write the episode number, duration, and the filename, separated by spaces, to $stdout.

I decide to add one more optimization before I finish. I often have unfinished episodes in directories that start with three 'x'-s instead of an episode number. While searching these directories won't hurt, it's a waste of time because I'm only interested in finished videos. So I'd like to somehow tell the #find method to ignore these directories and anything inside them.

To do this, I use Find.prune if the current path is for an in-progress episode. You might be wondering where the Find constant came from here. The truth is, the Pathname#find method I've been using is actually just a thin wrapper over Ruby's Find library. And Find.prune does the same thing that the -prune option to UNIX find(1) does: it signals the in-progress finding process to NOT to recurse into the current directory. In effect, it "prunes" the current branch of the directory tree off of the search.

require 'pathname'

Pathname("~/Dropbox/rubytapas").expand_path.find do |path|
  Find.prune if path.directory? && path.basename.to_s =~ /^xxx/
  if path.file? && (
      path.basename.to_s =~ /^RubyTapas(\d{3})\b.*\.mp4$/ ||
      path.basename.to_s =~ /^(\d{3})\b.*\.mp4$/)
    number   = $1
    next if path.basename.to_s =~ /sample/
    stats    = `avprobe #{path} 2>&1`
    duration = stats[/Duration: (\d{2}:\d{2}:\d{2})/, 1]
    puts "#{number} #{duration} #{path.basename}"
  end
end
# >> 001 00:01:47 RubyTapas001.mp4
# >> 002 00:00:45 RubyTapas002.mp4
# >> 003 00:01:11 RubyTapas003.mp4
# >> 005 00:03:36 RubyTapas005.mp4
# >> 006 00:04:56 RubyTapas006-Forwardable.mp4

The output doesn't change after this modification, but least in theory it runs a little faster.

I can see in my output that there are still a few apparent duplicates and files that don't belong, but at this point it's probably easier for me to manually clean up my RubyTapas directory than to encode more special processing into this script. Anyway, now you know how you to use Ruby to perform simple file searches. Happy hacking!

Responses