You Can (Somewhat) Reliably Measure Change Failure Rate
In the State of the dora DevOps Metrics in 20221 Which is otherwise excellent, and I recommend reading it. In particular, the discovery around the uselessness of mean time to recovery was important to me, in that sense of “oh man, this is obvious and I should really have realised it sooner, on my own, instead of blindly trusting an authority on it. ” article, Logan claims that change failure rate cannot be measured consistently across an organisation. I disagree.
Logan mentions three reasons:
- Changes that resulted in failures need to be manually entered into the change failure rate reporting, which is an opportunity for mistakes.
- If a service is only slightly degraded, the judgment of whether a failure has occurred is subjective.
- Not all failures are strictly associated with a change, and any change–failure association will miss these failures.
Perhaps surprisingly, I agree with all three of these objections. There is a trick, however, that I’ve used successfully in the past to avoid these problems: instead of directly measuring the fraction of changes that cause failures, measure the fraction of changes that fix failures. Here’s why that works:
- Most people agree on which changes are made to fix something that didn’t work the way it was supposed to. In some ticket tracking systems, there’s even a label specifically for bug fixes and similar, meaning this information can be extracted automatically.
- Fixes are generally made only for failures that are severe enough to warrant a fix. In other words, “when is a problem bad enough to count as a failure” is operationally defined exactly at the point where it should be: when you have to go back and fix it.
- This measurement obviously also captures fixes made for failures that were not directly associated with another change.
Note that a lot of the benefit of measuring the change failure rate through the change fix rate hinges on the second point in the list above: it assumes that only failures that are eventually fixed are failures worth counting. In order for the change fix rate to be meaningful, there needs to be a strong correlation between failures that need fixing, and failures that are fixed. If this is the case, then the change failure rate and the change fix rate will be fairly close to each other.2 In many (more or less dysfunctional) teams, that correlation might be low. It’s easy to accidentally fix things that don’t need fixing, and let things that need fixing go unfixed. In the teams where I have used the change fix rate, I have had reason to believe there is a good process for determining what needs to be fixed and fixing it.
Additionally, I would argue that unless the team has a strong process for figuring out what failures needs fixing, and then actually fixing them, the team has bigger problems than tracking the change failure rate. In other words, I’m making the somewhat circular argument that measuring the change failure rate through the change fix rate is not universally possible, but it works precisely for the teams that are sophisticated enough to benefit from tracking the change failure rate in the first place.
How neatly things sometimes work out!