In Progress
Unit 1, Lesson 21
In Progress

Schwartzian Transform

Video transcript & code

Whole books have been devoted to the topic of sorting and searching data in computer programs. Fortunately for us, Ruby gives us the tools to benefit from all that research without necessarily being algorithms experts ourselves. Today's show demonstrates one particularly useful sorting optimization that Ruby can help us out with, if we know how to ask for it.

Let's say we have a list of video files, and we want to sort them by length. Not by file size, but by video runtime.

videos = Dir["../*/[0-9][0-9][0-9]*.mp4"]

We have a method called #duration which uses the avprobe command-line utility to detect and return a file's duration.

def duration(filename)
  stats = `avprobe #{filename} 2>&1`
  /(?<=Duration: )\d\d:\d\d:\d\d\.\d\d/.match(stats).to_s
end

When we test this method on one of the video files, we can see it returns a string consisting of hours, minutes, seconds, and milliseconds. While this isn't a numeric type, it's formatted in such a way that a string comparison will accurately reflect which video is shorter and which is longer.

duration(videos.last)          # => "00:04:44.48"

In order to sort our list of files, we could send the #sort message along with a block. The block receives two video filenames. In it we compare the duration of the first file with the duration of the second file using the spaceship operator.

sorted_videos = videos.sort{|x, y| duration(x) <=> duration(y)}

This works, but it turns out to be painfully slow. Given 144 videos, when we time it we can see that it takes close to a minute to run.

videos.count                    # => 144
start_time = Time.now
sorted_videos = videos.sort{|x, y| duration(x) <=> duration(y)}
Time.now - start_time           # => 53.290444705

Now, we know that getting a video duration is a relatively expensive operation, since we have to shell out to an external program and then parse the results. But this still seems a bit excessive. Let's dig into why it takes so long.

To do so, we create a global $call_count variable. inside duration, we increment the call count. We execute the sort again, and this time we take a look at the $call_count at the end.

def duration(filename)
  $call_count += 1
  stats = `avprobe #{filename} 2>&1`
  /(?<=Duration: )\d\d:\d\d:\d\d\.\d\d/.match(stats).to_s
end

sorted_videos = videos.sort{|x, y| duration(x) <=> duration(y)}
$call_count                     # => 1824

Remember, we're sorting 144 video files. But the #duration method is called a whopping 1824 times! This is because in order to sort a list, Ruby may have to compare a given entry to many other entries before finding its final place in the sort order. Computer scientists have devoted a lot of thought and time to reducing the total number of comparisons required to sort a given number of elements, but barring a miraculous new algorithm, the number of comparisons needed will generally be considerably greater than the number of elements.

OK, so we can't reduce the number of comparisons. But can we reduce the amount of work done in each comparison? Well, the duration of a given file doesn't actually change between each comparison. So once we know the duration of a given video file, we could hang on to that information rather than calculating it anew for every comparison.

Let's rewrite our sort. Before the sort, we map our list of videos into a list of two-element arrays. Each array has the video file name as its first element, and the video duration as the second element. Then we take this list of pairs and sort it. This time the sort block will receive two different two-element arrays. We use some splat groups to destructure each pair into separate video and duration parameters. Then we compare only the duration parts.

Now we have a list of sorted two-element arrays. We no longer need the durations now that we are done sorting. So we map the list back into a list of just video filenames by extracting the first element of each array.

videos = Dir["../*/[0-9][0-9][0-9]*.mp4"]

$call_count = 0

def duration(filename)
  $call_count += 1
  stats = `avprobe #{filename} 2>&1`
  /(?<=Duration: )\d\d:\d\d:\d\d\.\d\d/.match(stats).to_s
end

$call_count = 0
start_time = Time.now
sorted_videos = videos.map{|v| [v, duration(v)]} 
  .sort{|(xv, xd), (yv, yd)| xd <=> yd} 
  .map(&:first)
Time.now - start_time           # => 4.059575059
$call_count                     # => 146

Let's run this code. The execution time is massively reduced—down to around 4 seconds. And the call count for #duration is now 144—or just once for each video file.

This idiom, of pre-calculating comparison keys for a set of elements before sorting them, is sometimes known as a "Schwartzian Transform", after Randall Schwartz, who demonstrated the technique in the Perl language. It's so useful that Ruby gives us an easier way to employ it.

Instead of #sort, we can send the #sort_by message. Like #sort, #sort_by also takes a block. But where the #sort block is intended to perform element comparison, the block given to #sort_by should just extract a key from the given element. Ruby will then perform the Schwartzian Transform we just saw behind the scenes, using the provided block to pre-calculate all the element keys before using the keys to sort.

videos = Dir["../*/[0-9][0-9][0-9]*.mp4"]

$call_count = 0

def duration(filename)
  $call_count += 1
  stats = `avprobe #{filename} 2>&1`
  /(?<=Duration: )\d\d:\d\d:\d\d\.\d\d/.match(stats).to_s
end

$call_count = 0
start_time = Time.now
sorted_videos = videos.sort_by{|v| duration(v)}
Time.now - start_time           # => 4.165870343
$call_count                     # => 144

When we execute this code, we can see that it exhibits a similarly quick runtime as before, and the #duration method is only called once for each element of the collection. We've reproduced our earlier optimization with a single method call.

When sorting collections on some expensive-to-calculate property of the elements, a Schwartzian Transform can drastically reduce the amount of work the code has to do. Ruby makes invoking this technique trivially easy with the #sort_by method. If you've ever skimmed over the Enumerable documentation and wondered why there is both a #sort and a #sort_by method, now you know! Happy hacking!

Responses