Entropic Thoughts

Questions for Cloudflare

Questions for Cloudflare

Note: This article has earned a lot of criticism for being shallow, reacting too much to a fast preliminary analysis, being dismissive of Cloudflare’s excellent robustness track record, and being biased by hindsight. I’ll answer to all four points in turn:

That said, let’s get on to the apparently controversial stuff.


Cloudflare just had a large outage which brought down significant portions of the internet. They have written up a useful summary of the design errors that led to the outage. When something similar happened recently to aws, I wrote a detailed analysis of what went wrong with some pointers to what else might be going wrong in that process.

Today, I’m not going to model in such detail, but there are some questions raised by a system-theoretic model of the system which I did not find the answers to in accident summary Cloudflare published, and which I would like to know the answers to if I were to put Cloudflare between me and my users.

In summary, the blog post and the fixes suggested by Cloudflare mention a lot of control paths, but very few feedback paths. This is confusing to me, because it seems like the main problems in this accident were not due to lacking control.

The initial protocol mismatch in the features file is a feedback problem (getting an overview of internal protocol conformance), and during the accident they had the necessary control actions to fix the issue: copy an older features file. The reason they couldn’t do so right away was they had no idea what was going on.

Thus, the critical two questions are

The Cloudflare blog post suggests no.


There are more questions for those interested in details. First off, this is a simplified control model as best as I can piece it together in a few minutes. We’ll focus on the highlighted control actions because they were most proximate to the accident in question.

cloudflare-outage-01.png

Storming through the stpa process very sloppily, we’ll come up with several questions which are not brought up by the report (and typically missed by common accident analysis approaches based on chain of events or Swiss cheese models, such as root cause analysis and Five Whys).

I don’t know. I wish technical organisations would be more thorough in analysing their systems for these kinds of problems before they happen. Especially when they have apparently-blocking calls to exception-throwing processes in the middle of 10 % of web traffic.

I don’t expect a Cloudflare engineer to have a shower thought about this specific problem before it happens; I expect Cloudflare as an organisation to adopt processes that let them systematically find these weaknesses before they are problems. Things like stpa are right there and they work – why not use them?