OpenVOS Blog

For an Always-On World

Fix the algorithm before you fix the code

7.29.2013AvailabilityBy: When you are fixing bugs in existing code, it is easy to fall into the habit of finding the “mistake” in the code and adjusting the code to avoid the mistake. But if the root cause of the problem was a design error, or a problem with the algorithm, then just adding another flag or goto statement in the code is the wrong approach. Not only have you left the the root cause intact, but now you have made the code harder to understand for the next software engineer who has to figure it out.

Train yourself to ask “how would I have solved this problem if I was the original coder?” Then look at the difference between the “clean sheet of paper” approach and the one that is actually there in front of you. But be pragmatic. I’m certainly not suggesting that you should reimplement a function because you and the author have a different approach. Take this step as a thought experiment. You have a test case that fails. The author missed this case. What led her to miss this case? If the author had known about this case, how would her approach have been different? These types of questions are invaluable when digging into code that basically works, but still has some bugs.

In our case, instead of adding another parameter to an internal function, and adding another automatic variable that had to be correctly initialized at 3 different entrypoints to the main procedure, and adding code to test this new variable at a critical point, I realized we could repair the algorithm by changing the order in which the sub-steps were performed. Once we did that, the need for additional variables and special cases disappeared. The final code is actually much easier to understand than the original code, to say nothing of the first attempt at a repair. This gives me confidence that we are on the right track. The test case that reproduces the problem proved that both the original (overly complex) fix, and the revised (simplified) fix solve the problem. Now we just need to be certain that we didn’t break anything else.

Since this fix is headed to a field release (i.e., one that is already installed at customer locations), we will also be running three types of intensive regression tests. Two of them will check for continued adherence to the POSIX standard (the bug fix is in code that implements a POSIX function), and the third type of test is a general system-level regression tests.

Fix the algorithm first. Once the algorithm is correct, deal with the code.

Leave a Comment

Leave a Reply

Share