Video transcript & code
The other day the public RubyTapas website went down. Which meant I had to go and fix it, because unlike the third party service that powers subscriptions, the public site is all mine.
Naturally, the first place I turned was the site logs. I have all the site logs sent to a service that specializes in collecting and analyzing logs.
The stack traces from crashes were obvious enough. I could see from the traces that something was going wrong with the process that grabs video thumbnails from an external service.
As I looked at these logs, I kept seeing something else: a single older episode that kept having its video information reloaded over and over again. So I spent a long time fiddling around in a server console, trying to figure out what was wrong with that episode.
I added a bunch of extra debug logging, and updated the site. I was able to see a lot more about what was happening, but I still couldn't work out why that episode was causing trouble.
Finally, I decided to look at the logs from beginning to end. And then I finally discovered the problem: the very latest episode had an abnormal thumbnail URL which was causing an exception. When I went to look at the video hosting site, sure enough: something had gone wrong in the uploading or processing of that video, and there was no video and no thumbnail.
And the episode that kept loading over and over? That was an unrelated, relatively benign bug, that I'd either failed to notice earlier, or forgotten about.
After I get to the root of a failure, I always like to take a step back and think about what I could have done differently to find the problem faster. In this case, I realized that I'd failed to apply one of the fundamental rules of debugging: always start by asking "what was the last thing to change?"
Often, when we think about the last thing to change, we think only of code changes. But changes can come in many forms: not only can they come in the form of source code updates, they can come from the database, from outside data feeds, or from updated dependencies. In this case, it was an update to the data coming in from an external site: a new episode that had malformed attributes.
The lesson I took from this experience is that while logs are indispensable, I'm too easily sidetracked by anomalies in that data. Just because something looks wrong doesn't mean it's related to the bug I'm currently tracking down. I need to remember that nine times out of ten, the last thing to change is the source of the failure.
And that's it for today. Happy hacking!