Entropic Thoughts

Practices of Reliable Software Design

Practices of Reliable Software Design

I was nerd-sniped. Out of the blue, a friend asked me,

If you would build an in-memory cache, how would you do it?

It should have good performance and be able to hold many entries. Reads are more common than writes. I know how I would do it already, but I’m curious about your approach.

I couldn’t not take the bait.

In the process of answering the question and writing the code, I discovered a lot of things happened in my thought processes that have come with experience. These are things that make software engineering easier, but that I know I wouldn’t have considered when I was less experienced.

I started writing a long and sprawling article on it, but it didn’t quite hit the right notes, so this is a much abbreviated version. Time allowing, I might expand on some of these practices in separate articles that can be more focused and concise.

The Practices

Here are all eight practices I have adopted with experience that I had use for during the exercise of writing a fast, small, in-memory cache.

1. Use Off-The-Shelf

Our first response to the question should be something like, can we use Redis?

As long as we’re not dealing with a very expensive or complicated component, or a part of the software that generates most of its value, we should default to off-the-shelf solutions. I have written before about why we want to build expensive and complicated stuff.

2. Cost And Reliability Over Features

If we find out we can’t use an off-the-shelf solution, we need to build something cheap and reliable. That usually means not having all the bells and whistles, but that’s a trade-off worth making. One of my favourite ways of phrasing this is

It is much easier to add features to reliable software, than it is to add reliability to featureful software.

Besides, it’s very easy to accidentally think you need features that you don’t actually need.

I’m not a big proponent of a design phase. Sometimes you need it, but I think if you need to design your software up front, it will cost you more to develop and it will take longer to finish. On the other hand, sometimes a little up front analysis allow you to eliminate large portions of the design space before even writing a single line of code.

In the case of this cache, we may ask question about item durability requirements, request rates, sizes, eviction requirements, and so on. When we do, we would in this case find out that we can get by with a single thread accessing a single data structure, and no active eviction process. That’s a huge win! It simplifies the design a lot.

3. Idea To Production Quickly

Part of the reasoning of the previous practice was that it’s easy to think you need features you don’t. How do you, then, know which features you need? For the most part, the cheapest and most reliable way to find this out is in production.

If we deploy the bare minimum feature set, we will very quickly find out what additional features are most commonly requested, and it’s not going to be the ones we think. And even if they are, there are usually other requirements on them that we wouldn’t have caught ahead of time.

This practice ties in strongly with the previous one: use analysis to pare down the requirements to the bare minimum, and then write the bare minimum code to support them, and get it into production to find out more about the problem we’re trying to solve.

4. Simple Data Structures

Complicated data structures are often tempting. Especially when there are libraries that handle them for us. The problem with complicated data structures is that it’s easy to misuse them due to a lack of understanding, which leads to performance problems and bugs.

In the case of this specific cache, the number of entries that needed to be stored (we can use queueing theory to calculate this from the ttl and the rate of writes) would comfortably be addressed by 19.9 bits of information – in other words, we can just use a plain array to store each item, and hash the key to address it.1 We found out earlier we didn’t need to actually evict expired items. If we would have needed to, we could keep a separate index for the item nearest to expiry using a min-heap. This is cheaper and simpler than something fully sorted, which it might be tempting to reach for otherwise. Instead of a separate process, we can rely on the high request rate to trigger eviction of items, thus trading a little latency to avoid designing for concurrency up front. Another alternative is to provide a separate API that triggers eviction of the oldest expired entry, letting the caller decide how much time to spend on eviction. This means the caller can vary that time to make dynamic latency–resource requirement tradeoffs.

5. Reserve Resources Early

Another potentially controversial decision I made for the cache I designed was that it allocated the full index array up front. This sounds wasteful, but we can tell from back-of-the-napkin analysis that it would need most of this space anyway during continuous operation, so nothing would be saved by postponing the allocation.

What early allocation does is it allows the software to crash early if the required resources are not available. This is a desirable trait – it saves the operator the frustration of finding that out only once they’ve gone to bed and receive a call from PagerDuty. It also makes it easier to do capacity planning and reason about performance.

6. Set Maximums

I chose a sloppy open-addressing-with-linear-probing-based method for handling hash collisions in addressing items that is simple and has very good performance in the typical case (again, only very rudimentary arithmetic needed to find this out for any specific use case), but can have very bad performance in the worst case: it can end up iterating through the entire array of all items on every cache miss.

Since perfect recall turned out not to be necessary2 This is again why asking questions about requirements is so important!, it was trivial to set a maximum iteration limit of e.g. 500 steps. This makes sure the worst case is not too far from the average case, at the cost of a few more cache misses. Does 500 steps optimise this tradeoff? No. This is one thing I don’t know how to compute ahead of time, so production experience will have to inform future changes to the maximum number of steps.

The important thing is usually not what the maximum is, just that there is some maximum, so we’re not waiting surprisingly long times for things to complete.

Given the static allocation requirements, we might also want to set a limit on how much data can be stored in the cache at any given time, so we don’t accidentally blow memory limits in the middle of the night. In general, limit all the things! If someone is unhappy with the limit, revise the limit – don’t remove it.

7. Make Testing Easy

To avoid regressions and ensure expected behaviour, I made the cache accept commands on standard input. This means we can start the cache and type write hello world in the terminal window to store "world" for the key "hello".

But it also means we can write up a long string of commands in a file, and pipe that into the cache. If we then also give the cache a command that asserts the last value read is something specific3 It also had a few more commands related to testing, such as a sleep command that forces it to believe time has passed (for testing expiration) and a dump command that just prints out everything it is currently storing, and whether it’s deleted, expired, or active., we can write up a whole test protocol in a plain text file, and pipe it to the program to verify its functionality. The input file might look like

# Cache empty, should not find anything.
read abc
assert undef

# Write and then immediately read the same thing back.
write abc 123
read abc
assert 123

# Different key should get nothing.
read xyz
assert undef

# Write different value, then read, should match.
write abc 567
read abc
assert 567

# Write one more key, abc should still be around.
write xyz 444
read abc
assert 567

In isolation, each of these things is very simple (accept commands at cli, allow asserting previous read) but it’s the combination of them, together with the tools that already exist at hand (shell input redirection) that enables faster cycle times and thus more reliable software.

8. Embed Performance Counters

Finally, we will want to know how our program spends its time. We can do that with profiling, or deduce it through logs, but the easiest way to do it is to embed performance counters. These are variables that accumulate how much of something is happening. It can be things like

  • How much time is spent reading keys?
  • How much time is spent writing key–value pairs?
  • How much time is spent on I/O?
  • How many cache misses have there been?
  • How many keys have we had to linearly search over?
  • How many times has the iteration limit of 500 been reached?

Note that all of these are accumulative, so we don’t ask “how much storage space is currently allocated?”, but we ask instead the pair of questions

  • How much storage space have we allocated in total from startup until now?
  • How much storage space have we freed in total from startup until now?

By breaking the measurement apart into two time-dependent totals we can learn significantly more about how our system behaves.

How these numbers evolve over time is also a useful indicator. For example, the value of “how many keys have we had to linearly search over?” might increase steadily (indicating relatively even distribution of hash collisions) or it jumps up a lot sometimes (indicating sudden burst of collisions). These are useful things to be able to tell apart. In combination with “How many times has the iteration limit been reached” it tells us a lot of the pressures put on the system.

Other practices

There are surely many more of these insights gained from years of software engineering, but these are the ones I thought of during this exercise. I’d be happy to hear from you about others you are thinking about! If nothing else, I hope that could teach me to be a better software engineer.