In Progress
Unit 1, Lesson 21
In Progress

Circuit Breaker

Video transcript & code

When we think about handling errors and failures in software, it's easy to want to consider them individually. If error X happens, what will we do?

But in the real world, failures rarely come alone. And handling them well involves thinking, not just in terms of individual failure modes, but in terms of failure rates.

Let's say we've signed up to the hippest social network on the planet: a social network just for cats. And we've built our own command-line client for the network. So far none of the other members has outed us as a non-feline, and we've build up a respectable following. We can list our followers using the followers command. It turns out cats aren't very creative about coming up with nicknames.

> followers
kitty59
kitty809
kitty367
kitty244
kitty561
kitty564
kitty781
kitty927
kitty586
...

When we want to see what our furry friends are up to, we can type the "timeline" command.

> timeline
kitty59: thpppt fsst
kitty809: purrp HURRRRK mrowr
kitty367: mrowr spsss
kitty244: mew spsss rrrrrrr purrrr
kitty561: fsst ack meep
...

Let's take a look at the code for that. The Nip API is still in a pretty primitive state. In order to see a timeline full of statuses, we have to first fetch a list of our followers. Then, we have to loop through the list, and do another request for that cat's latest status.

client = Nip.new
loop do
  print "\n> "
    input = gets.chomp.downcase
  case input
  when /followers/
    ids = client.follower_ids
    ids.each do |id|
      puts id
    end
  when /timeline/
    ids = client.follower_ids
    ids.each do |id|
      status = client.status_for_id(id)
      puts "#{id}: #{status}"
    end
  when /quit/
    exit(0)
  else
    puts "I don't know that command"
  end
end

Unfortunately, the Nip network has proven so popular that they are experiencing some scaling issues. So a lot of the time, when we enter the timeline command, we get an exception and our client crashes.

> timeline
nip.rb:7:in `status_for_id': We are over-capacity, chill out! (RuntimeError)
    from nip.rb:26:in `block (2 levels) in <main>'
    from nip.rb:25:in `each'
    from nip.rb:25:in `block in <main>'
    from nip.rb:14:in `loop'
    from nip.rb:14:in `<main>'

This has been happening so much that we went ahead and updated our client to use a "safe proxy" object to handle these exceptions. We learned about safe proxies in episodes 300-302. Each of the service request methods is wrapped in a method which catches errors, logs them, and returns a benign value instead of propagating the error .The idea here is that if we get the occasional connection hiccup or service outage won't completely kill our timeline command. Instead, the exception will be suppressed and logged, and we'll go on to the next status.

require "delegate"
class SafeNip < DelegateClass(Nip)
  def follower_ids
    super
  rescue => error
    warn "ERROR getting follower IDs: #{error}"
    []
  end

  def status_for_id(id)
    super
  rescue => error
    warn "ERROR getting status for id #{id}: #{error}"
    "<Status Unavailable>"
  end
end

client = SafeNip.new(Nip.new)
# ...

But sometimes service outages go for a while. And when we check out timeline, it looks like this. This isn't much of an improvement.

> timeline
ERROR getting status for id kitty574: We are over-capacity, chill out!
kitty574: <Status Unavailable>
ERROR getting status for id kitty709: We are over-capacity, chill out!
kitty709: <Status Unavailable>
ERROR getting status for id kitty385: We are over-capacity, chill out!
kitty385: <Status Unavailable>
ERROR getting status for id kitty161: We are over-capacity, chill out!
kitty161: <Status Unavailable>
ERROR getting status for id kitty254: We are over-capacity, chill out!
kitty254: <Status Unavailable>
...

For one thing, it's cluttering up our logs with all the error messages. And for another, the last thing a web service that is over capacity wants is to receive hundreds of hits in a row. We are in danger of having our IP blocked by the site admins if we keep this up.

So what can we do? We want to be able to deal gracefully with a few exceptions, but we need to switch to a different strategy if we are getting lots of exceptions in a row.

To address this problem, we've built a circuit breaker. A circuit breaker is a software component that is modeled on real-world electrical circuit breakers that you might find in your home or workplace. The idea is simple: normally, the circuit breaker is closed and electricity flows through it. But if the circuit experiences an unsafe load, the breaker trips. Once it has tripped, it requires manual intervention to reset.

We started by creating a CircuitBreaker class. In it we made a specialized error class for the case when the breaker has been tripped.

The CircuitBreaker starts out with a few different instance variables. First, there is a @state variable, which starts out :closed. Next there is a threshold, which is the number of errors in a row that will trip the breaker. Finally, there is an error count, which starts at 0.

We have a monitor method in this class, whose job is to monitor some operation for errors. We start out by raising an exception if monitoring is attempted when the breaker is already in the tripped state. Assuming that is not the case, we yield to the given block, and capture the resulting value. If we are able to get to the next line, that means the monitored operation succeeded without raising an exception, so we call a success handler method, then return the result.

If there was an error, but it was just the OpenError we raised at the top of the method, we allow it to continue up the call stack.

Otherwise, if some error cropped up during the yield, we call an error handler method. Then we re-raise the exception.

Pay special attention to this line. It is not a circuit breaker's job to suppress errors, the way the safe proxy does. A circuit breaker's role is different: its job is to monitor errors, and prevent the actions which cause them from being repeated indefinitely.

We also provide a predicate method to ask the circuit breaker if it is in the open state.

Let's now move on to the success and failure handlers. The success handler method is very small. This is a simple circuit breaker design which only trips when we get a series of failures all in a row. So if we get a success, that resets the error count back down to zero.

The error handler is more complex. First it increments the error count. Then it logs that fact. Then it checks to see if the error count has gone over the threshold yet. If so, the circuit trips. It sets the @state to open, and logs the event.

class CircuitBreaker
  class OpenError < StandardError
  end

  def initialize(threshold: 5)
    @state       = :closed
    @threshold   = threshold
    @error_count = 0
  end

  def monitor
    raise OpenError, "Circuit breaker is open" if @state == :open
    result = yield
    handle_success
    result
  rescue OpenError
    raise
  rescue => error
    handle_error(error)
    raise
  end

  def open?
    @state == :open
  end

  private

  def handle_success
    @error_count = 0
  end

  def handle_error(error)
    @error_count += 1
    warn "Failure count is now #{@error_count}"
    if @error_count >= @threshold
      @state = :open
      warn "Circuit breaker has tripped!"
    end
  end
end

In order to use the circuit breaker with the Nip client class, we introduce a new <aproxy object href="https://www.rubytapas.com/t/proxy">proxy object. This proxy holds a reference to a circuit breaker. It implements both the methods on the Nip class, but in each one it surrounds the call with a circuit breaker #monitor block.

class NipCircuitBreaker < DelegateClass(Nip)
  def initialize(target, breaker: CircuitBreaker.new)
    @circuit_breaker = breaker
    super(target)
  end

  def follower_ids
    @circuit_breaker.monitor do
      super
    end
  end

  def status_for_id(id)
    @circuit_breaker.monitor do
      super
    end
  end

end

To tie everything together, we add the NipCircuitBreaker class between the Nip client object and the SafeNip wrapper.

client = SafeNip.new(NipCircuitBreaker.new(Nip.new))

Now when we try to read our timeline, we see something a little different. At the beginning, we see failures because of request errors. But then the circuit breaker trips, and we after that we just see circuit breaker failures.

> timeline
Failure count is now 1
ERROR getting status for id kitty225: We are over capacity, chill out!
kitty225: <Status Unavailable>
Failure count is now 2
ERROR getting status for id kitty401: We are over capacity, chill out!
kitty401: <Status Unavailable>
Failure count is now 3
ERROR getting status for id kitty406: We are over capacity, chill out!
kitty406: <Status Unavailable>
Failure count is now 4
ERROR getting status for id kitty152: We are over capacity, chill out!
kitty152: <Status Unavailable>
Failure count is now 5
Circuit breaker has tripped!
ERROR getting status for id kitty921: We are over capacity, chill out!
kitty921: <Status Unavailable>
ERROR getting status for id kitty283: Circuit breaker is open
kitty283: <Status Unavailable>
ERROR getting status for id kitty739: Circuit breaker is open
kitty739: <Status Unavailable>
...

We still have a bunch of log junk, but at least now we're only spamming our own logs. We're not spamming a remote service with tons of requests after it has repeatedly reported failure.

If we wanted, we could also add some special handling for a tripped circuit breaker. If we set up our circuit breaker separately, we can then check its status inside out application code. This gives us a way to short-circuit in the case where the breaker trips.

ids = client.follower_ids
ids.each do |id|
  break if breaker.open?
  status = client.status_for_id(id)
  puts "#{id}: #{status}"
end

Now when we ask for our timeline, the loop quits once the breaker trips. And if we ask again, it doesn't bother trying at all.

> timeline
Failure count is now 1
ERROR getting status for id kitty720: We are over capacity, chill out!
kitty720: <Status Unavailable>
Failure count is now 2
ERROR getting status for id kitty243: We are over capacity, chill out!
kitty243: <Status Unavailable>
Failure count is now 3
ERROR getting status for id kitty757: We are over capacity, chill out!
kitty757: <Status Unavailable>
Failure count is now 4
ERROR getting status for id kitty932: We are over capacity, chill out!
kitty932: <Status Unavailable>
Failure count is now 5
Circuit breaker has tripped!
ERROR getting status for id kitty131: We are over capacity, chill out!
kitty131: <Status Unavailable>

> timeline
ERROR getting follower IDs: Circuit breaker is open

What we've looked at today is the most basic, primitive form of circuit breaker. One thing we haven't talked about at all is how we reset the breaker. Right now, the only way to do it is to restart the whole program. In the future, we might take a look at some more advanced variations on the circuit breaker theme.

But what we've seen so far should be enough to get you started exploring the possibility of circuit breakers in your own applications.

Oh, and by the way, if you want to learn more about circuit breakers, check out Michael Nygard's excellent book, Release It!

Happy hacking!

Responses