In Progress
Unit 1, Lesson 21
In Progress

Log Thread

Video transcript & code

When we are running software in development, we are often perfectly happy to trade a little performance for increased information. We disable optimizations, turn off caches, and turn logging verbosity up. But in production, things are different. There, speed is often at a premium, and we may sacrifice visibility into our code for the sake of performance.

Take this code, for instance. What we have here is a little web app written in Sinatra. When it receives a request, it simply serves up the full text of Alice's Adventures in Wonderland.

require "sinatra"

get "/" do
  send_file "alice.html", disposition: "inline"
end

When I start this app on my machine, using the Thin web server, and load test it with the Apache Bench tool, I see that it processes about 1000 requests per second.

$ ab -n 10000 http://localhost:4567/ | grep "Requests per second"
Requests per second:    1009.00 [#/sec] (mean)

But let's say that we start encountering a problem with this application, and in order to track it in production we introduce some error logging. So we set up a special error logger, and we add some error logging to each request.

require "sinatra"
require "logger"

ERRLOG = Logger.new("error.log")

get "/" do
  ERRLOG.error "That thing happened again"
  send_file "alice.html", disposition: "inline"
end

But when we benchmark our server again, we discover that we have a problem. Adding one log message per request has reduced our throughput by more than 30%!

 ab -n 10000 http://localhost:4567/ | grep "Requests per second"
Requests per second:    686.69 [#/sec] (mean)

We need this information, but we can't drastically cut performance in order to track it. What do we do?

There is more than one answer to this question, and which one makes the most sense depends on what our reporting needs are. In this case, we don't need error reports instantaneously. We just need them to be logged so we can examine the logs later.

With this in mind, we decide to separate the noting of errors from the logging of errors. We establish a queue for error messages. You may recall from previous episode #126 that a Ruby Queue is a safe way to pass messages from one thread to another.

Then we create a brand new thread, whose sole job will be to take error reports off of the queue and log them to a file.

Finally, we change the HTTP request handler to push a message onto the error queue, instead of logging it directly.

require "sinatra"
require "logger"

ERRLOG = Logger.new("error.log")
ERRQ   = Queue.new

Thread.new do
  while message = ERRQ.pop
    ERRLOG.error message
  end
end

get "/" do
  ERRQ.push "That thing happened again"
  send_file "alice.html", disposition: "inline"
end

Optimistic that this separation of responsibilities will enable the main thread to process requests more quickly, we run our benchmark again. But to our dismay, we get similar numbers to those we saw before.

$ ab -n 10000 http://localhost:4567/ | grep "Requests per second"
Requests per second:    690.39 [#/sec] (mean)

Well, maybe the problem is that Ruby is giving all threads the same priority, and we really want our logging thread to have low priority. We try lowering the thread priority of the logging thread below the default priority of 0.

require "sinatra"
require "logger"

ERRLOG = Logger.new("error.log")
ERRQ   = Queue.new

t = Thread.new do
  while message = ERRQ.pop
    ERRLOG.error message
  end
end
t.priority = -1

get "/" do
  ERRQ.push "That thing happened again"
  send_file "alice.html", disposition: "inline"
end

We run our benchmark again and… again, we see no improvement.

 $ ab -n 10000 http://localhost:4567/ | grep "Requests per second"
Requests per second:    676.87 [#/sec] (mean)

The fundamental problem that we haven't escaped is that Ruby queues are natural points for a transfer of control. Pushing a new value onto a queue signals to any waiting threads that a new value is ready, and tells the Ruby scheduler that now may be a good time to let one of those threads run. So we are effectively swapping control back and forth between the main thread and the logging thread every time an error is reported.

Let's try a different approach to the logging thread. We'll add an outer loop which checks the number of waiting messages. Then the inner loop will only try to pull that many messages off of the queue.

After the messages are logged, we add one more, all-important line of code. We tell the thread to sleep for a tenth of a second. This means that instead of being woken up every time there is a new message on the queue, this thread will only wake up ten times a second. Each time it will process as many messages as were already in the queue when it woke up, and then it will go back to sleep.

require "sinatra"
require "logger"

ERRLOG = Logger.new("error.log")
ERRQ   = Queue.new

t = Thread.new do
  loop do
    message_count = ERRQ.size
    message_count.times do
      message = ERRQ.pop
      ERRLOG.error message
    end
    sleep 0.1
  end
end
t.priority = -1

get "/" do
  ERRQ.push "That thing happened again"
  send_file "alice.html", disposition: "inline"
end

Once more we run our benchmark. This time, we see numbers that are much closer to our original, un-instrumented code.

$ ab -n 10000 http://localhost:4567/ | grep "Requests per second"
Requests per second:    961.95 [#/sec] (mean)

By batching up log writes to 10 times a second instead of constantly switching between servicing requests, we've drastically reduced the impact of our logging on overall throughput.

Now, as with any optimization, this one comes with a huge caveat that it works in a particular scenario, with a particular kind of workload. Your mileage may vary, and in fact it almost certainly will. Make sure to run your own benchmarks before complicating your systems with threads and queues. And be aware that batching up work like this can lead to larger variances in request latency, because a fraction of the requests will get hung up waiting for the thread to finish flushing the queue.

But if you ever run into a situation where you have low-priority reporting which you don't want to interfere with the usual business of the system, servicing those reports using a sleepy thread like this one may give you the boost you need. Happy hacking!

Responses