System Observability: Metrics, Sampling, and Tracing
Few things excite me as much as system observability. The best software engineers I know are very creative when it comes to figuring out what happens in a system.
I came across an excellent overview of ways to do process tracing in software today by Tristan Hume. This is remarkable because I didn’t think it was possible!1 To be more nuanced, I had prematurely discarded it due to its overhead. But clearly, it can be done, and very well at that. It’s certainly something I need to try out the next opportunity.
Observability at two levels
In this article, we will use the word “process” to mean a single operation as viewed from outside the system. Some examples of processes:
- A backup run on a database cluster
- A payment attempt at an online store
- A comment being posted on a social media platform
- A cabinet being made by a carpenter
- A surprise party being arranged by a friend
When these go wrong, we would like to find out what happens inside the system to figure out why. The key realisation is that a single operation from outside the system can consist of many operations inside the system. Our desire is to find something out about what those operations are.
There are two levels at which we can do this: whole-system and per-process. For single-threaded, non-async systems, these are the same. Those systems can only run one process at a time, so if we take whole-system measurements when the process is running, we get measurements for that specific process.
More commonly though, our systems are multi-tasking. If we measure how much our friend sleeps while in the process of planning a surprise party, we would be surprised to find out that almost 30 % of the effort of planning a surprise party consists of sleeping! This is the problem with whole-system observations: they give us a clue as to how our systems spend our time, but not what happens during specific processes.
In other words, whole-system observations are good for capacity planning2 If friends sleep 30 % of their time, how many are needed to plan a surprise party? and fault detection3 Is our friend sleeping 10 % of the time over a few days? That’s a problem regardless of what specific processes they are running at the moment., but it is not very useful for troubleshooting specific processes.
Fix one problem at a time: the most profitable
A system is most efficiently optimised by optimising the slowest processes first.4 As a rough guide, anyway. This is an application of Amdahl’s law. The highest possible speedup we can get from eliminating an operation is the time spent on that operation. It sounds obvious, but people often forget about it. For each process that can be optimised, there’s a return-on-investment calculation we can make on whether the effort will pay off. To optimise efficiently, one starts with the highest roi optimisation and then goes down the list until other uses of the budget have higher roi.5 Again, at the risk of stating the obvious: we can only fix one problem at a time, so why not start with the most profitable one?
This is why per-process measurements are really critical, when we can get them. That lets us evaluate the roi of optimising a specific process.
One way to fake per-process observations is to try to make sure the system runs only one process6 E.g. by running a load generator that crowds out all other activity. which sorta-kinda turns whole system observations into per-process observations, but when we have access to per-process observations, that’s even better.
Process tracing is king
There are three kinds of observability I know about, listed here in order of increasing detail:
- Metrics (counters and gauges7 We can probably shoehorn in logs as a type of highly specific counter that increments once every time that specific log message is emitted.)
- System sampling
- Process tracing
The catch is that this list is also in order of increasing cost.
Metrics
Metrics are simple, extremely cheap8 With built-in cpu instructions for compare-and-swap, it can even be done performantly at high parallelism., but only give an whole-system overview. Due to their incredible simplicity, I include metrics in almost everything I do. Every time an event of interest happens, increment a counter. Every few minutes, write a line to a log file with the current values of the counters.
From that counter data, we can do all sorts of things, like compute how fast things are happening, or project how many things will have happened later, or discover patterns in when things are happening, or trends in how fast they are happening, or how evenly distributed through time things happening are, and so on.
This is the sort of observability that, from what I understand, is very common in other industries – in particular, I read at some point about observability in the oil and gas industry being heavy on counters and gauges, but I fail to remember which article that was. I think we can still learn a lot from how other industries use metrics.
System sampling
Many people think of system sampling when they think of profiling. System sampling is when you poke the system at random (or regular) intervals and see what it is doing right at that moment. Depending on how the system is constructed, this can be supported from within the system, or by the environment the system runs in9 Linux, for example, supports sampling userspace stacks with e.g. perf and eBPF. Many runtimes (Java, .NET) support sampling of their vm..
Going into the details of system sampling would be a separate article, but the main thrust for the purpose of this article is that this still only results in whole-system observations. We will learn exactly how much time the system spends on various things, but not in which order, or to which process they belong.
Process tracing
Process tracing, on the other hand, means recording every event that occurs and in which context it occurs, which means after the fact we can construct an exact timeline of a specific process within a system.
This is what we are really after – when we have this, we can reconstruct everything else from the trace data: metrics and system samples are implied by the trace. But critically, we can know the impact of optimising specific parts of any process.
Recording every event to construct a complete timeline sounds very expensive, and compared to the above techniques, it is. But I’ve always assumed it is prohibitively expensive. Is it, though? Ultimately, the code I write is part of an interactive systems with 10–500 ms response times, or batch systems that need to run for a few hours every day. Compare that order of magnitude to printing a trace event to a log file for, I don’t know, the 50 most important top-level operations? Could easily be worth it, depending on specifics.
And that’s just using the dumbest possible form of tracing – Tristan Hume’s article lists many more options, most of which are highly performant.
I don’t have much to write about this but I hope I will in the future! In the mean time, read that article – it’s good.
Job smarts in observability
In the literature10 Working Minds: A Practitioner’s Guide to CTA; Crandall, Klein, and Hoffman; Bradford Books; 2006., job smarts is a common recurring marker of expertise. This refers to how experts choose to sequence their work to optimise for their goals, and ways they adapt their process to get more feedback faster.
I like singling out instances of job smarts when I see them, because I think it’s useful to learn by examples. Here’s what Tristan Hume does:
I wanted to correlate packets with userspace events from a Python program, so I used a fun trick: Find a syscall which has an early-exit error path and bindings in most languages, and then trace calls to that which have specific arguments which produce an error.
This is brilliant. I would never in a thousand years have thought of doing it this way, but now that I know about it, of course that’s what one does. Piggyback on a no-op syscall to get access to the wealth of system tracing tools from within your application.