I’m afraid that this post is a little longwinded, and I take a long time to arrive at my point. I hope you will stick with me, as I think it is a valuable insight. It just takes some time to get there.
There are two ways to deal with bugs. The first way is to write less bugs. The second way is to get very good at debugging. Both strategies are important.
Let me tell you a story about Tim. Tim is the best and brightest developer at BigCo. He’s the evangelist for unit testing, integration testing, and distributed source control for his team. Every change that is checked in is automatically tested, and every bug that slips through creates a new test to keep the software bug-free. Tim is methodical about his work, and usually attaches a proof of correctness in the commit message. Tim is writing a book on the O-notations (“Theta-notations”, Tim helpfully interjects) of combinatorial algorithms.
Meanwhile, Fred is another developer in a different department at BigCo. Fred passess the FizzBuzz test, but he’s nowhere near as good as Tim. Fred just started using CVS instead of the old zip-file version control system. Overall his team produces decent code, but they do experience a 10% regression rate for bugs.
As part of some sinister BigCo plot, half the BigCo developers are let go, and Tim and Fred end up assigned to the same team of the reorganization, this time doing twice the work that they did formerly. Because budgets are tight, management doesn’t have money for such luxuries as “bug trackers” , as it would put an undue burden on the golf budget. Every bug is a free-for-all, and Tim and Fred often end up working on the same bug by mistake. Oops.
An e-mail report comes in, and both Tim and Fred are on the case. Tim checks out the CVS repositry (yuck) and realizes that the code is a mess. There are no standards, no documentation, no unit or integration tests, and nothing that would indicate where the problem lies. He throws up his hands. “The problem could be anywhere!” Tim clears out his calendar for the next two weeks to write unit tests and set up a build server.
Meanwhile, Fred has reproduced the problem in five minutes, has traced it down to a module in 10 minutes, and has traced it down to a class in another 6 minutes. After another ten minutes, he’s got it down to the line that is causing the crash, and after another ten minutes he’s built an effective mental model of the problem and is developing a fix. He quickly scrolls through blame to see if his fix will cause any known regression. Total time: 1 hour.
The point of this story is not to say “Fred is the better developer.” He clearly isn’t. But he’s definitely the better debugger. Tim is a terrible debugger simply because his entire strategy is focused around not writing bugs, instead of what to do when bugs happen. Fred’s a fantastic debugger because he writes so many bugs that he’s got the process down to a science. You need both types: the guy who does prevention for the long-term health of the codebase, and the guy who puts out fires when the system’s down at 3am. Whatever issue Frank just introduced will be caught by Tim’s unit test, but when it’s caught, it will be Fred, not Tim, who will actually figure out how to make the test pass. It takes both kinds.
The other moral of the story is that the best and brightest developers are typically bad debuggers. They have less experience fixing bugs because they have less experience writing bugs.
This is why I think it is important, not only to emphasize good long-term strategies like unit testing and integration testing, but also to emphasize good short-term strategies like triangulating and fixing bugs.
When I was in Digital Electronics back in school, I was bored to tears. I was actually really bad at digital. I blew up expensive equipment in labs. I had a poor mental model of electricity in general, and still do. The EE majors were drawing circles around me with timing diagrams and problem sets. The way I got good at programming was by building lots of software. So by extension, to get good at digital, I decided to build a hardware implementation of Pong. Here it is:
This project was completely absurd. It’s a plastic bin with roughly 1 cubic meter of capacity, with breadboards strapped to all six sides. Here’s a closer pic of one of those boards:
By any measure, this is a completely ridiculous design. I had no idea what I was doing, I was just making it up as I went along. There was not so much as a lick of documentation.
But working on the world’s worst Pong implementation taught me to get really, really good at hardware debugging. Every time I got out the box, one of the thousands of wires had come loose, and I would have to work out what was wrong within minutes to keep from drowning in bugs. As I added more and more boards, I had to get exponentially faster at fixing problems. Soon I had hundreds of spec sheets memorized, because referring to pin layouts was too slow. I was inventing error indicators on boards, a serial debugging interface, so I could study the state of the system in software, an interactive clock, so I could manually step through “instructions” as they were passed around to different phases of the pipeline, hardware error codes, I invented my own stress-tolerant connectors, and much more.
The final test came when I moved the device across campus to show the professor. Of course, wires came loose in the move. In under 10 minutes and completely unassisted by even as much as a multimeter, I was able to debug the issues and get Pong running again under pressure. The professor was shocked, not only that I could get something like this working in the first place, but that I was able to fix it.
One last bug remained. Although the system was rated for 5 volts, in reality it only functioned between about 4.85-4.95v. I worked and worked, but I was unable to diagnose the problem beyond a particular board. I enlisted the EE majors, but they were befuddled. As were the lab assistants. As were the electrical engineering professors. Finally, the bug escalated all the way to one of the chip manufacturers, where it had gone undiscovered for several years. Where it was fixed. At the transistor level.
The moral of this story is that I went from the bottom of that class to the top. Not because I was good at digital. Because I was terrible at it. I was the only one stupid enough to build a totally ridiculous system full of bugs. You should never, ever, have me design a circuit for you. But if you have a problem with an existing circuit? I’m your guy. By being really bad at digital, I got insanely good at debugging.
Let me tell you a story about spoiled food.
Last week I had some cheese grow mold. “Must have been some bad cheese,” I said to myself, and added it to my grocery list.
Last night I had some hotdogs grow mold. I almost didn’t connect the two events, but at some point my debugging instinct kicked in. “Does food really spoil this quickly?” I wondered. So I started googling. Turns out, no. No it doesn’t. Bug filed: food spoils. Reproduction steps: put in fridge.
Hmm, the fridge. Now that you mention it, I noticed a few weeks ago that some places on the outside of the fridge get warm. Very warm. Melt chocolate into a puddle warm. I figured it was just radiating the excess heat. That heat has to go somewhere, right? But now that I think about it, a fridge shouldn’t get super warm on the outside. Mental model: heat buildup somewhere in the fridge.
However, an interesting piece of information: the freezer seemed to be working OK, as everything was frozen. A useful piece of data that we will return to.
At this point, most people would escalate to maintenance. However, it’s a Sunday afternoon, and I have food in the fridge. Therefore, fixing the fridge on my own is the difference between saving $150 in groceries.
Now I’m sure that fridges can have excess heat for any number of reasons. I don’t even know the first thing about fridges, or about how they work. But I do remember once my girlfriend’s cat got stuck behind her fridge, where there was a warm, dusty set of coils, running all along the back of the fridge.
This tells me, first, that these coils are part of the heat dissipation system for the fridge, and second, that they can become dusty, presumably limiting their radiation power. Theory: my fridge coils are dusty.
So I pull out the fridge. Unfortunately, the back of my fridge is completely smooth and doesn’t have any coils. This is a setback. Maybe some fridges don’t have coils? Or perhaps they’ve placed the coils inside the refrigerator somehow?
Thinking it through logically, the heat from the fridge has to dissipate, whether it’s via coils or via some other mechanism. Since I can visually inspect the front, back, top, and sides of the fridge, the heat has to dissipate through the bottom. Hmm…
Well whatever’s under my fridge, there is a LOT of dust. I say this as someone who has pulled decades of pet hair out of computer cases in middle school as a computer repair guy. As I bring the vacuum hose around to the back of the fridge bottom, I notice that I am pulling dust out of rows of metal plates, spaced approximately an inch apart. Although not coils, it’s possible that this is a radiation mechanism, a theory which is confirmed when the coils are very warm to the touch. But the exterior-facing surface area of these metal plates is less than ten square inches, which is surely not comparable to the large area of the coils on the back of my girlfriend’s fridge. So what’s going on here?
As I am pulling the dust off, I uncover a small fan. With a little vacuuming, it whirs to life, blowing air through the spiraled plates. “Aha!”, I say, “So the dust has been blocking this fan, which is preventing air from cooling the plates, causing a heat buildup.”
This seemed like a reasonable solution to the problem. But I was troubled by the fact that my freezer seemed unaffected by the inoperability of this fan. I searched for a second, freezer-dedicated heat radiation mechanism, but I didn’t find one. It puzzled me that a heat buildup would affect the refrigerated portion only.
So I started to think about how the fridge and freezer were actually cooled. I had developed a naive mental model: if you turn the “refrigerator” dial down to cold, the refrigerator gets cold, and if you turn the “freezer” dial down to cold, the freezer gets cold.
It’s an entirely reasonable assumption, when you look at two identical thermostats, to assume that they are two separate, identical systems. However now that I had studied the bottom of the fridge, I knew that the fridge and freezer shared, at minimum, the same heat radiation mechanism. Is it possible that my “two thermostats” theory could be a leaky abstraction?
I started playing with the dials. I noticed that if the refrigerator thermostat was set to “colder”, the fan would kick on, and if it was set to “warmer”, the fan would kick off. But no matter what I did to the freezer’s thermostat setting, it did nothing to the fan. So what did it do?
After playing with it for a few minutes, it seemed to affect a tiny stream of air coming out of a hole inside the refrigerator. Set the freezer to “warmer”, we get cold air blowing into the refrigerator, set the freezer to “colder”, we get no air blowing into the refrigerator. So the “refrigerator” thermostat appears to control the cooling power of the fridge overall, and the “freezer” knob controls how that shared cooling power is divided between the two compartments. Under the old mental model, if I wanted things cold, I set both thermostats to cold. Under the new mental model, I was in effect telling my fridge to dedicate all of its capacity to the freezer compartment and very little to the refrigerator compartment. If the cooling element was already very overtaxed due to the lack of a fan, and I had allocated 90% of the cooling capacity to the freezer, it’s very possible to keep a nice freeze while the fresh foods get uncomfortably warm. Now I had a complete picture for how the refrigerator failed, and so I set the freezer control on the warm side to dedicate more of the cooling capacity to the refrigerated compartment for a rapid cooldown.
It has taken me a lot longer to write down my thought process than it did to actually debug the fridge, which was under 25 minutes. In the time it has taken me to write this blog post, the fridge has cooled off considerably, so the groceries are saved, and we have confirmation that I solved the problem.
The moral of the story is that if you’re a good debugger, you’re a superhero! You too can go from zero to refrigerator repair man, in 25 minutes. Your skills are directly transferrable to appliances, cars, mechanical systems, and more. If I were better at circuit design, or if I consistently wrote better code, I would have less bugs, which is (according to all the usual authoritative sources), a great outcome. But I would be much worse at debugging, and I would be waiting angrily for the repair guy to show up to fix the fridge, eating out for a few days, and running all over town for groceries, a decidedly not-so-great outcome.
I am not at all opposed to new and awesome ways to write better code, or focusing on new code quality strategies. In fact, I am knee-deep in design for an automated unit testing framework as we speak. It’s a problem of emphasis–as a community we talk all day long about integration testing, and never about debugging. Maybe somewhere between the Agile, TDD, CI, and source control conferences, we could have a tiny little workshop about debugging? Maybe we even carve out a class in undergrad talking about debugging strategies, problem bisection, and so on? That’s all I’m saying. Your groceries will thank you.