Hazardous States and Accidents
I have long wanted to write about how root cause analysis is a crap technique for learning from failure. In order to do that, we need to know some fundamentals first. These are good to know for anyone designing anything they want to be reliable.
A hazard is an accident waiting to happen
In safety-critical systems, we distinguish between accidents (actual loss, e.g. lives, equipment, etc.) and hazardous states (sometimes called only “hazards”). If we say that \(H\) stands for hazardous state, \(E\) for environmental conditions, and \(A\) for accident, then the equation is
\[H \land E \Leftrightarrow A\]
This says that an accident requires both unfavourable environmental conditions, and that the system is in a hazardous state. As a consequence,
- If a system sits in a hazardous state, it can be driven into an accident by bad environmental conditions.
- But conversely, the system can sit in a hazardous state for a long time without accident if the environmental conditions are good enough.
Since we can only control the system and not its environment, we achieve safety by avoiding hazardous states.1 If we try to prevent accidents while not paying attention to hazardous states, we are effectively placing our trust in the environment being on our side. Many people do this, and it can be successful for quite some time, but it always fails at some point.
Example from aviation
There was recently a commercial flight that made the news because they landed with less than 30 minutes of fuel in its tanks. Many people wondered why this was a big deal, because it sounds like the system was working as intended: there was a reserve, it was needed, and it was used. End of story?
The thing to realise is that landing with less than 30 minutes of fuel is a hazardous state for commercial jets. If a jet lands with less than 30 minutes of fuel, then it would only have taken bad environmental conditions to make it crash, rather than land. Thus we design commercial aviation so that jets always have 30 minutes of fuel remaining when landing. If they don’t, that’s a big deal. They’ve entered a hazardous state, and we never want to see that.
Example from child’s play
One of my children loves playing around cliffs and rocks. Initially he was very keen on promising me that he wouldn’t fall down. I explained the difference between accidents and hazardous states to him2 In children’s terms., and he realised slowly that he cannot control whether or not he has an accident, so it’s a bad idea to promise me that he won’t have an accident.
What he can control is whether or not bad environmental conditions lead to an accident, and he does that by keeping out of hazardous states. In this case, the hazardous state would be standing less than a child-height within a ledge when there is nobody below ready to catch. He can promise me to avoid that, and that satisfies me a lot more than a promise to not fall.
Maintaining constraints is a dynamic control problem
Hazardous conditions, as we have seen, are defined by constraints. To stay out of hazardous conditions, we have the system maintain such safety constraints. In general, though, the environment often tries to tip the system into breaking these constraints, and it often does this in unpredictable ways. This means we cannot declare in advance a sequence of steps the system should follow that will always maintain constraints.
Instead, maintaining constraints is a dynamic control problem. There are multiple controllers interacting with the system to try to keep it out of hazardous conditions. They observe feedback, i.e. information on where the system is now; they execute mental models, i.e. run simulations of where the system is going in the future; and then they issue control actions, i.e. try to adjust the system to maintain constraints based on their predictions.
Whenever a system enters a hazardous condition, it is because there were problems with the control structure, specifically one of the three components listed above:
- Feedback to controllers can be insufficient, which means the controllers do not understand what is going on with the system at some specific moment.
- Mental models can be insufficient, which means the controllers understand what’s going on with the system, but they are unable to predict something that will happen in the future.
- Control actions can be insufficient, which means the controllers know what they need to do to the system to maintain constraints, but it does not have an effect of the desired strength.3 This could be because the effect is too weak – or too strong!
We can also see combinations of these problems. When all three of them are problematic, we might actually be looking at an entire controller missing that should be present.
Controllers exist on all levels. For aircraft maintaining fuel constraints, controllers include the fadec inside the jet engines, the flight management computer, pilots, ground crew, dispatchers at the airline, training programmes for pilots, air traffic controllers, as well as national and international regulatory boards.4 For my child among rocks, controllers include their balance, their strength, their extremely limited sense of self-preservation, my instruction, my supervision, the places I decide to take us, etc.
Low-level controllers are often automated, in hardware or software. High-level controllers are often social, cultural, and legal in nature.
Predicting hazardous states is easier than accidents
Accidents in safety-critical systems can look like a one-off freak occurrences that would be impossible to predict.5 What are the chances that a flight encounters delay enroute, then has to make multiple landing attempts at the intended destination including delays there, diverts, is unable to land at the alternate, and has quite far to go to a tertiary airport? This is because in order for an accident occur, not only do we need bad environmental conditions, but also multiple controllers must have been unable to maintain safety constraints. The combination seems unlikely. However, by thinking in terms of hazardous states instead of accidents, we get the benefit that hazardous states are easier to predict.
Think of any common technology, like the car. We can probably rattle off several constraints we’d like it to maintain, some fairly mundane. Our car must not start an uncommanded turn, for example. One of the controllers maintaining this constraint is positive stability in the turning axis: if we let go of the steering wheel on flat ground it will return back to the centre position over time. This ensures small bumps only put us slightly off course, at which point another controller kicks in: the driver makes a small adjustment to change the course back to what it was.6 In some cars, another automated layer takes over before the driver: software lane keeping assistance can perform that correction.
We don’t have to actually witness a car crash caused by an uncommanded turn to realise it would be a bad thing if a car started an uncommanded turn. Now we can continue to work on our controllers – why does the turning axis have positive stability? Can that fail? Sure it can, if tyre pressures are unequal. That’s another constraint we can design control structures around, and so on.
Analysing hazards as accidents
Further benefits of thinking about hazardous states rather than accidents is we don’t have to wait for an accident to occur before we improve the safety of our system. Being unable to maintain constraints is already a safety problem and should be analysed whether or not environmental conditions were on our side that day, i.e. whether it turned into an accident or not.
This might seem obvious. If we had designed a car that started a sudden uncommanded turn, we wouldn’t wait for it to injure someone before we addressed the problem. But I often see people – especially in the software industry – paper over near misses as long as nobody got hurt. The aviation industry is not like that. You bet safety boards will issue reports on the flight landing with less than 30 minutes of fuel.
More on safety and systems theory
The ideas covered in this article mainly come from a systems theory perspective of safety. One of the central figures in promoting that perspective is Nancy Leveson. I’m a huge fan of her work, among others, the books Engineering a Safer World, the cast Handbook, and the stpa Handbook. The issue with these is that they’re (a) not well known, and (b) quite dense and filled with decades of Leveson’s experience.
I would like to present a simple, easily digestible view of system theoretic safety, but I find it hard to structure well. This article is one in hopefully a series of several which go through the important points. This being a broad topic, we have just skimmed the surface for now. Some things I want to bring up eventually are
- In addition to avoiding accidents, reducing their consequence is an important part of safety. When comparing reliability between similar systems, I have almost universally found that the more reliable system actually fails more often, but with less severe consequences.
- We can design systems to make controllers more efficacious. Improving the quality of feedback is often a low-hanging fruit, both because it improves understanding of the system, but especially because it allows operators to train themselves better mental models.
- System designers often have inaccurate mental models. This means following procedures (designed by system designers) can not only prevent accidents, but also cause them.
- Root cause analysis, and many similar techniques, are based on an oversimplified theory of causation.
- Human error is not the end of accident analysis, but a good starting point. Any decision in the past was more complicated than it seems now.
- We can learn more from accidents than we do right now. That takes more effort per accident, but the tradeoff is worth it because we learn more generally and this improves reliability more than if we perform shallower analysis over fewer accidents.
- How to actually perform analysis of a system from this perspective, in a series of prescribed steps.
I don’t know when we’ll see the rest of these, but stay tuned. If you’re a fan of rss, you know what to do. Otherwise, you can subscribe by email.