This is a write-up based on a talk that I recently gave in DevPulseCon on Debugging.
In this post I will go over the basics of debugging. In Part II of this post, we will see how we applied this process to debugging a docker-machine issue we recently encountered.
Debugging is twice as hard as writing the code in the first place. Therefore, if you write the code as cleverly as possible, you are, by definition, not smart enough to debug it.
— Brian W. Kernighan and P. J. Plauger in The Elements of Programming Style.
Debugging may seem like a black art. However, most people and teams that are great at solving problems employ similar strategies. Today we will discuss some of these strategies and look at a framework for debugging hard problems.
Define the Problem
The first thing and the most important thing to do before embarking on a problem solving session is to define the problem, as precisely as possible. Try to answer who, what, where, when, how and why for the situation to find the exact problem. Repeatedly ask the question “What exactly is going wrong and why do you think it is a problem?”, until finally you come to a question you can’t answer. Sometimes this may still leave a very large problem area to tackle, and the next set of steps will help find more information about the system after which you can come back to this step and repeat the process.
Understand the System
To define abnormal behavior you need to understand how a system behaves in normal conditions. To understand the data flow in a complex system, you need to understand the deployment topology. You need to understand how to collect additional data and use tools to probe into the system – like debuggers, log collectors, hardware analyzers etc. Understanding the system becomes a wider net as the problem gets harder to locate and often “you” don’t have to be and cannot be a single person. It can be a team of people who know their subsystem, tools and interfaces extremely well and who can work together to paint a complete picture.
Dmitry Vostokov takes this to an extreme and provides you dumps of machine code from a working Windows system that you can compare with a problematic dump to see what is wrong. Whether it is machine code, source code, documentation, configuration files or tools, understanding the system and how the system behaves normally is critical to finding what went wrong and why.
“One thing is sure. We have to do something. We have to do the best we know how at the moment . . . ; If it doesn’t turn out right, we can modify it as we go along.”
— Franklin D.Roosevelt
You have defined the problem to the best of the available information, but it is not clear what is going wrong. This is the point to start looking deeper and broader into the system. Collect additional data around the problem. If you see anything deviating from what is expected, start collecting more information.
Catch the Failure
Reproducing the problem or catching a problem when it happens is critical not only to the problem solving process, but also to ensure that you have actually resolved the problem. If you can’t catch a problem, there is no way to know when you may have solved it.
You can start with the problem definition and the data collected to identify what software/hardware combinations and what conditions are best to recreate a problem. The faster and more exactly you can create a problem, the faster you can debug it and find and test a solution. One gotcha with trying to recreate a failure is that you may end up finding and creating another totally different failure that can side track the debugging process, so it is important to change as little as possible in any given experiment and to meticulously track everything you change.
Identify Potential Causes
The Kepner-Tregoe problem solving methodology makes this a surgical science. The most important question to ask here is what is different (from a working case or a comparable working system) or what changed(from when a system was working). Often in production systems, hundreds of variables may have changed before you start noticing a problem, so it is important to go beyond just the last change. Once you know the exact differences it is time to brainstorm. This step is best done collaboratively, putting together expert knowledge from multiple perspectives with the problem and data in hand. This is the time that Sherlock Holmes talks to Watson or Hercule Poirot talks to Hastings and gets that spark that provides the potential cause.
Divide and Conquer
If the previous step gives you a single potential cause, lucky you! If not, now is the time to group the potential causes and subsystems to narrow down the probable causes. Replacing a sub-component with a known working alternative, testing with a different version of software or hardware, and checking a previous configuration all help in narrowing the scope and the potential causes. Always track every experiment and its effects. It is very easy to lose track of what was tried and what the results were and you will find yourself running around in circles unless you leave a trail of breadcrumbs.
Fix, Test and Test
You now understand the system and know how it works, you know how to catch a failure, you have found the one true cause and now it is time to create a fix (or workaround) and test that it really solves the problem. Remember to make sure you didn’t inadvertently change something else that caused the failure to simply hide. Keep the system in the test as close to the one in which you created the failure as possible.
Following the above steps can make your problem solving sessions seem slow. When you have a production down scenario, you want to do something fast!
— The Hitchhiker’s Guide to the Galaxy
The tendency to panic and do “something” in a hurry can:
- Cause a new problem
- Erase evidence required to solve the original problem
- Hide the problem until much later when it has more severe consequences
- Cause mistrust about the software system, competence of the people and processes.
Don’t panic! Follow a systematic debugging process. Start with defining the problem precisely and you will be well on your way to a finding a good solution.