In Progress
Unit 1, Lesson 1
In Progress

Dead Thread

Video transcript & code

I've had a lot of fun digging into Rake over the past couple of weeks, but it's time now to return to the topic of threads and queues. Today I want to talk about what happens when threads unexpectedly die.

Let's start with the most basic possible producer-consumer example. We'll require the "thread" library, then create a Queue to enable two threads to communicate. We'll start a producer thread. This thread loops forever, manufacturing widgets and pushing them onto the queue. Then it sleeps in order to simulate the work of making another widget.

The consumer thread simply pops widgets off of the queue and reports the widget name. Then it waits for a second, simulating whatever work it needs to do on the widget.

We end this example by waiting for both the producer and the consumer to stop before exiting. Otherwise this program would exit immediately after starting.

require "thread"

q = Queue.new

producer = Thread.new do
  i = 0
  loop do
    widget = "widget#{i+=1}"
    puts "pushing #{widget}"
    q.enq widget
    sleep 1
  end
end

consumer = Thread.new do  
  widget = q.dequeue
  puts "popping #{widget}"
  sleep 1
end

producer.join
consumer.join

Let's run this program. We see the producer manufacturing widgets… and nothing else. What's the problem here?

If you remember a few weeks ago when we talked about the Queue class, you might be saying "I see the problem: you misspelled the q.dequeue method!"

But I'm afraid that's not the answer I'm looking for. You've spotted the defect all right. But that's not the real problem here. The problem is that there was a defect in one of our threads and we received no notification about it whatsoever. This is the kind of silent failure that gives threaded programming its bad reputation.

Normally, when an unhandled exception is raised in a thread, it causes the thread to terminate. The thread then sits there, in a dead state. But this has no effect on the main thread of the program. The only time we'll see the exception is if and when the main thread calls #join on that thread—and as you can see, we're calling #join on the producer thread before we call it on the consumer. Since the producer loops forever, we never get to the second join, and we never see the exception.

Let's modify the code to set a global flag called Thread.abort_on_exception to true. This tells Ruby that when an exception is raised in a thread and causes the thread to die, the exception should immediately be re-raised in the main thread.

Then we'll run it again.

require "thread"

Thread.abort_on_exception = true

q = Queue.new

producer = Thread.new do
  i = 0
  loop do
    widget = "widget#{i+=1}"
    puts "pushing #{widget}"
    q.enq widget
    sleep 1
  end
end

consumer = Thread.new do  
  widget = q.dequeue
  puts "popping #{widget}"
  sleep 1
end

producer.join
consumer.join

This time we see a NoMethodError pointing out our mistake. So clearly setting the Thread.abort_on_exception flag is a helpful debugging tool when using threads. But we don't always want all threads to handle exceptions this way, so this is a setting we only want to include when there is debugging to be done.

There is another way to set this flag. If we set the $DEBUG global variable, Ruby switches on Thread.abort_on_exception along with some other debugging features. When we run the code again we can see the exception being reported everywhere it is raised or re-raised.

require "thread"

$DEBUG = true

q = Queue.new

producer = Thread.new do
  i = 0
  loop do
    widget = "widget#{i+=1}"
    puts "pushing #{widget}"
    q.enq widget
    sleep 1
  end
end

consumer = Thread.new do  
  widget = q.dequeue
  puts "popping #{widget}"
  sleep 1
end

producer.join
consumer.join

We're still having to modify the code to enable this setting though. Instead of explicitly setting $DEBUG in the code, we can also set it by passing the -d flag to ruby. When we run our program with -d, we once again see the error bubble up.

OK, let's go ahead and fix this defect by using the correct (#deq) method name for #dequeue.

require "thread"

q = Queue.new

producer = Thread.new do
  i = 0
  loop do
    widget = "widget#{i+=1}"
    puts "pushing #{widget}"
    q.enq widget
    sleep 1
  end
end

consumer = Thread.new do  
  widget = q.deq
  puts "popping #{widget}"
  sleep 1
end

producer.join
consumer.join

Now let's run it again. It starts out alright, but only one widget is popped off by the consumer. Something is still wrong with our code, but what is it?

The defect this time is that the consumer thread has no loop, so it just pops one widget off of the queue and then dies. But again, that's just the defect; it's not the real problem. The problem, this time, is that we can keep filling up our queue with items until the end of time, with no errors. Once again, this hides the fact that the consumer thread has halted. It would also be a gradual memory leak if we left this program running, as it continually pumped more elements into the queue.

Let's make the queue a little less accommodating. To do that we'll switch it to a SizedQueue, of size 3. This queue will only allow three items to be in it at once.

require "thread"

q = SizedQueue.new(3)

producer = Thread.new do
  i = 0
  loop do
    widget = "widget#{i+=1}"
    puts "pushing #{widget}"
    q.enq widget
    sleep 1
  end
end

consumer = Thread.new do  
  widget = q.deq
  puts "popping #{widget}"
  sleep 1
end

producer.join
consumer.join

We get an interesting error message this time. It says that we've triggered a deadlock scenario. The reason is that the consumer thread is dead, the producer thread is waiting for space in the queue that will never appear, and the main thread is waiting for the producer to finish. Since all the threads are either dead or waiting, Ruby is able to infer that this program will wait forever, and it raises an error.

This is a nice big clue that something is wrong. Unfortunately, we can't rely on always getting a "deadlock" error whenever we have a mistake in our threaded code. Ruby will only report the error if all the live threads are waiting. So if there are other, unrelated threads with no problems that are still executing, we'll never see a deadlock error for our problem threads.

To simulate this, let's create a new thread called idler which simply loops and waits over and over.

require "thread"

q = SizedQueue.new(3)

producer = Thread.new do
  i = 0
  loop do
    widget = "widget#{i+=1}"
    puts "pushing #{widget}"
    q.enq widget
    sleep 1
  end
end

consumer = Thread.new do  
  widget = q.deq
  puts "popping #{widget}"
  sleep 1
end

idler = Thread.new do
  loop do
    sleep 1
  end
end

producer.join
consumer.join

This time when we run the code, there is no error. We do see that the producer stops after producing four widgets, though. While it's not an error message, this is a clear indication that something is wrong. But still, an error message would be nice.

What we really need is a sanity check in the code that would raise an exception if pushing a widget onto the queue takes an inordinately long amount of time.

To make this happen, we include the timeout library. We then surround the enqueue operation with a call to Timeout.timeout. We tell it to wait three seconds before timing out. The consumer should only take a second or so to process a widget, so this should be plenty of time.

require "thread"
require "timeout"

q = SizedQueue.new(3)

producer = Thread.new do
  i = 0
  loop do
    widget = "widget#{i+=1}"
    puts "pushing #{widget}"
    Timeout.timeout(3) do
      q.enq widget
    end
    sleep 1
  end
end

consumer = Thread.new do  
  widget = q.deq
  puts "popping #{widget}"
  sleep 1
end

idler = Thread.new do
  loop do
    sleep 1
  end
end

producer.join
consumer.join

We run our program again, and this time we see it raise an exception after the allotted time has passed.

I want to stop here to give you a very important warning: the timeout library can cause threading bugs in your program and you generally shouldn't use it. We can get into the how and the why in another show. If you're using Ruby 2.0 the code I'm showing you today is safe; but in 1.9 and any earlier versions of Ruby it was not thread-safe to use timeout with queues.

With that caveat covered, let's run the code with the timeout in place. We watch a few widgets being produced, and then bang, it fails with a timeout error.

Now that we have set things up so that our threads don't fail silently, we fix the consumer thread to use a loop. Now when we run the program we can see the producer producing widgets, and the consumer consuming them.

require "thread"
require "timeout"

q = SizedQueue.new(3)

producer = Thread.new do
  i = 0
  loop do
    widget = "widget#{i+=1}"
    puts "pushing #{widget}"
    Timeout.timeout(3) do
      q.enq widget
    end
    sleep 1
  end
end

consumer = Thread.new do  
  loop do
    widget = q.deq
    puts "popping #{widget}"
    sleep 1
  end
end

idler = Thread.new do
  loop do
    sleep 1
  end
end

producer.join
consumer.join

The point of today's episode is that one of the biggest challenges in multithreaded programming is the likelihood of silent failures. If you want to write concurrent programs, it's a good idea to always be looking for ways to make the code fail quickly and loudly if something unanticipated happens.

OK, that's enough for today. Happy hacking!

Responses