Entropic Thoughts

Questions for Cloudflare

Questions for Cloudflare

Cloudflare just had a large outage which brought down significant portions of the internet. They have written up a useful summary of the design errors that led to the outage. When something similar happened recently to aws, I wrote a detailed analysis of what went wrong with some pointers to what else might be going wrong in that process.

Today, I’m not going to model in such detail, but there are some questions raised by a system-theoretic model of the system which I did not find the answers to in accident summary Cloudflare published, and which I would like to know the answers to if I were to put Cloudflare between me and my users.

In summary, the blog post and the fixes suggested by Cloudflare mention a lot of control paths, but very few feedback paths. This is confusing to me, because it seems like the main problems in this accident were not due to lacking control.

The initial protocol mismatch in the features file is a feedback problem (getting an overview of internal protocol conformance), and during the accident they had the necessary control actions to fix the issue: copy an older features file. The reason they couldn’t do so right away was they had no idea what was going on.

Thus, the critical two questions are

The Cloudflare blog post suggests no.


There are more questions for those interested in details. First off, this is a simplified control model as best as I can piece it together in a few minutes. We’ll focus on the highlighted control actions because they were most proximate to the accident in question.

cloudflare-outage-01.png

Storming through the stpa process very sloppily, we’ll come up with several questions which are not brought up by the report (and typically missed by common accident analysis approaches based on chain of events or Swiss cheese models, such as Five Whys).

Maybe some of these questions are obviously answered in a Cloudflare control panel or help document. I’m not in the market right now so I won’t do that research. But if any of my readers are thinking about adopting Cloudflare, these are things they might want to consider!

I don’t know. I wish technical organisations would be more thorough in investigating accidents. Things like stpa are right there and they work – see the previous article on the aws outage!