In Progress
Unit 1, Lesson 1
In Progress

Benchmark

Video transcript & code

Sometimes code runs too slowly, and we have to find a way to make it go faster. But optimization is a tricky business. As the computer scientist Donald Knuth famously said: "premature optimization is the root of all evil". I can't count the number of times I've seen a programmer complicate code in the name of optimization, only to have it fail to make a difference—or even run more slowly than the original.

Because our intuition is rarely correct when it comes to optimization strategies, it is vital that we know how to quickly and effectively benchmark different approaches. That way we can back our optimizations with hard data instead of gut feelings. And so today, I thought we'd talk about benchmarking in Ruby.

What exactly does it mean to benchmark some code? Fundamentally, it means to run a bit of code a bunch of times and see how long it took to finish. We can roll our own quick and dirty benchmarks pretty easily. We save the current time. Then we run some code. It's usually necessary to run it many times over, so that the results aren't lost in statistical noise. Then we subtract the start time from the end time, and look at the result.

start_time = Time.now
1_000_000.times do
  2 + 2
end
result = Time.now - start_time
result                          # => 0.041977537

This technique only shows us so-called "wall-clock time". We can see how long the operation took, in terms of the real-world passage of time. But we don't know anything about what took up that time.

For more detail, we turn to the Ruby Benchmark standard library. In its simplest form, we can call Benchmark.measure and see how long an operation took.

require "benchmark"

result = Benchmark.measure {
  1_000_000.times do
    2 + 1
  end
}
result
# => #<Benchmark::Tms:0x007fe36b90b400
#     @cstime=0.0,
#     @cutime=0.0,
#     @label="",
#     @real=0.044379808,
#     @stime=0.0,
#     @total=0.04000000000000001,
#     @utime=0.04000000000000001>

The return value is a Benchmark::Tms object, which has a number of fields. Let's dig into what they represent.

require "benchmark"

result = Benchmark.measure {
  1_000_000.times do
    2 + 1
  end
}
result.real                     # => 0.043051687
result.stime                    # => 0.0
result.utime                    # => 0.05
result.total                    # => 0.05

The real attribute is our wall-clock time: the actual amount of time that passed from the start of the test to the end.

The stime attribute tells us how long the method spent in system calls. For instance, doing file or network I/O. In this case, since all we're doing is a calculation, the answer is zero.

The utime attribute reveals the time the code spent in user-land code; in this case, all of it. In other words, all its time was spent in the Ruby virtual machine, not in the operating system kernel.

Then there's total. This value reflects all the time spent in either user-land or operating system code.

But wait a minute: how does total differ from real? Well, this is where using Benchmark is a big improvement over our hand-rolled version. Because lots of things can affect an operation's wall-clock time. On a modern operating system, there is always more than one program running. Maybe a file sync program on our PC chose just the moment we ran the benchmark to sync some files, and distracted the CPU in the process.

By contrast, the stime and utime numbers use operating system facilities to gauge only the amount of CPU time used by our code being benchmarked. For this reason, the combined system time and user time attributes can often be a more stable and accurate measure of the efficiency of a given piece of code, versus wall-clock time.

So far we haven't seen any system time used. Let's look at the kind of operation that uses both system time and user time. Here we set up a string with the word "supercalifragilisticexpialidocious". Then we reverse the string, and write it to a file.

require "benchmark"

result = Benchmark.measure do
  100_000.times do
    str = "Supercalifragilisticexpialidocious"
    str.reverse!
    File.write("word.txt", str)
  end
end
result.utime                    # => 1.14
result.stime                    # => 4.37

When we look at the user time and system time stats, we can see that this code spends the majority of its time in operating system calls. In other words, it's spending most of its time writing to the file. If we were to come in here without benchmarks, thinking that the string reversal were the slowest part, we'd be mistaken. And we'd be disappointed in any modest gains made by optimizing that part of the code. In this case, any optimizations should probably first target the file writes, not the string munging.

So far we've just been benchmarking one thing at a time. But benchmarks are most useful when we can compare different implementations side by side. And the Benchmark library provides some handy tools for that.

First, let's come up with something interesting to compare. Let's say we've been working on some code that maps events to handler methods. We've been using #public_send to dynamically dispatch to a handler method based on the name of an event.

class Handler
  def handle_stepintime
  end
end

handler = Handler.new

event = :stepintime
handler.public_send("handle_#{event}")

In our last code review, someone flagged this code and said that it might cause performance problems. Our task is to compare this dynamic method dispatch to a case statement, to see if one is significantly slower than the other.

In order to do this, we first define a repetition count. Then we use the Benchmark.bm method, with a block. The block yields a reporter object. We send #report to this object, passing it the name of the scenario that we're presently reporting on. We also pass it a block. Inside the block we put the code we want to benchmark. We wrap the code to be tested in an n.times block, so we can easily vary the repetition count. Then we set up a simple one-branch case statement that switches on the event name and sends the appropriate handler message.

We go through the whole process again for our dynamic dispatch test. We'll repeat this test the same number of times. The test code itself is just one line in this case.

require "benchmark"

class Handler
  def handle_stepintime
  end
end

handler = Handler.new
n       = 1_000_000

Benchmark.bm do |r|
  event = :stepintime
  r.report("case dispatch") do
    n.times do
      case event
      when :stepintime then handler.handle_stepintime
      end
    end
  end
  r.report("dynamic dispatch") do
    n.times do
      handler.public_send("handle_#{event}")
    end
  end
end

When we execute this code we see that it automatically prints a report to standard output.

$ ruby bm1.rb
       user     system      total        real
case dispatch  0.120000   0.000000   0.120000 (  0.123501)
dynamic dispatch  0.360000   0.000

The report is a little hard to read though, because the columns don't line up. This is because the .bm method isn't clever enough to auto-calculate the appropriate column width for the labels on the left-hand side. Fortunately, we can provide an argument which forces a column width.

require "benchmark"

class Handler
  def handle_stepintime
  end
end

handler = Handler.new
n       = 1_000_000

Benchmark.bm(16) do |r|
  event = :stepintime
  r.report("case dispatch") do
    n.times do
      case event
      when :stepintime then handler.handle_stepintime
      end
    end
  end
  r.report("dynamic dispatch") do
    n.times do
      handler.public_send("handle_#{event}")
    end
  end
end
$ ruby bm2.rb
                       user     system      total        real
case dispatch      0.120000   0.000000   0.120000 (  0.120638)
dynamic dispatch   0.360000   0.000000   0.360000 (  0.359631)

Now that the report is neatly aligned, we can easily see that indeed, dynamic dispatch using #public_send and string interpolation is a few times times slower than case-based dispatch. Now, whether this is enough of a difference to have a significant effect on our application speed depends on where and how this dispatch is used.

There's one subtlety of benchmarking we haven't yet addressed. When we run a benchmarking script like this, we're testing the code from a "cold start". And that's a problem, because the performance of a piece of code can vary wildly from when it is first run, versus when the virtual machine has been "warmed up" with a few iterations. This "warm-up" effect happens for a lot of reasons: the hardware and operating system have caches that need to be warmed up. The virtual machine may perform optimizations over time, such as call-site caching of methods, or code inlining. For these and many other reasons, the first few iterations might be wildly different in performance from the later repetitions.

In addition, one test running before another may bias the results against the first test, since the first one may have served to warm up the system for the second.

In order to combat this "warm-up" effect, Benchmark has one more tool to offer us. If we replace bm with bmbm, Benchmark runs the tests twice over. The first run is a "rehearsal". It serves to warm up the system for the next run. Then the second run gives us steady-state statistics.

require "benchmark"

class Handler
  def handle_stepintime
  end
end

handler = Handler.new
n       = 1_000_000

Benchmark.bmbm(16) do |r|
  event = :stepintime
  r.report("case dispatch") do
    n.times do
      case event
      when :stepintime then handler.handle_stepintime
      end
    end
  end
  r.report("dynamic dispatch") do
    n.times do
      handler.public_send("handle_#{event}")
    end
  end
end

We can see in the results, that the case dispatch runtime is significantly faster in the warmed-up second round than when coming up from a cold start.

 $ ruby bmbm.rb
Rehearsal ----------------------------------------------------
case dispatch      0.120000   0.000000   0.120000 (  0.120997)
dynamic dispatch   0.360000   0.000000   0.360000 (  0.361532)
------------------------------------------- total: 0.480000sec

                       user     system      total        real
case dispatch      0.080000   0.000000   0.080000 (  0.080419)
dynamic dispatch   0.350000   0.000000   0.350000 (  0.356050)

That's not the only thing bmbm does. It also forces a garbage collector run before running the scenarios. The idea is that this will hopefully prevent the garbage collector from running during the scenarios, thus skewing the results.

There are many more performance measurement tools we could talk about, and we will, in future episodes. But if you are only familiar with one benchmarking tool, this should be the one. It's always available in the standard library. And it's often sufficient to give us the insight we need into how one code solution stacks up to another. If you're not already benchmarking all of your performance questions, I hope that this episode has given you the kick-start you need.

Happy hacking!

Responses