Response Time Is the System Talking
A colleague at my new job wanted to do some http scraping from an endpoint owned by a different team, and asked what request rate might be appropriate to avoid overloading the service. The team that owned the service didn’t really know, because it was a non-critical endpoint for which high utilisation was never anticipated, so no real load tests had been performed.
Aim for utilisation targets
We don’t actually have to know an appropriate request rate ahead of time. We can determine that dynamically based on how hard the system is forced to work. This is known as utilisation, i.e. how much of its time the system is spending servicing requests.
The obvious answer is incorrect here – 100 % utilisation is the wrong target. I strongly recommend aiming to run production systems under, say, 40 % utilisation at all times. If we expect high variation in demand, we should go even lower – perhaps 10 % or less.1 The problem with high utilisation is that requests will quickly start to time out waiting for free resources – even if utilisation is not quite 100 %, because when there’s little margin in the utilisation, natural variation in demand will cause it to peak above 100 %.
The question then becomes “How hard can I drive this endpoint while still keeping the system utilisation under 40 %?”
Response time reveals utilisation
Here’s a handy tip from queueing theory: response time is a function of utilisation. In other words, by listening to what the response time is, you can tell, approximately, how overloaded the system is.
Utilisation is called \(R\) in queueing theory, and it can be deduced from response time slowdown as such:
\[R = 1 - \frac{W_{\mathrm{baseline}}}{W_{\mathrm{loaded}}}\]
The baseline response time \(W_{\mathrm{baseline}}\) should be measured when the system is as near completely idle as possible. The response time under load, \(W_{\mathrm{loaded}}\) should be continuously measured as you apply requests to the system.
In a concrete example, say that we managed to measure a baseline response time during the middle of the night, and our system responded on average in 89 ms. Then we run our job and discover that the average response time is now 327 ms. This means we are probably utilising the system at
\[1 - \frac{89}{327} = 73 \%\]
This is not an exact science, but it gets us in the right ballpark. Critically, it tells us that whatever load we applied was too much to stay safe in production. We should scale it down by half to get back to a utilisation below 40 %.
New here? I like writing about applying statistics to handle software operations more thoughtfully. You should subscribe to receive weekly summaries of new articles by email. If you don't like it, you can unsubscribe any time.
Referencing This Article
Appendix A: Derivation
The relationship between response time and utilisation is based on three basic equations of queueing theory. First, the utilisation is given by the current request rate (\(\lambda\)) and the maximum request rate the system is capable of (\(\mu\)):
\[R = \frac{\lambda}{\mu}.\]
The response time at idle is the inverse of the maximum request rate the system is capable of:
\[\mu = \frac{1}{W_{\mathrm{baseline}}}.\]
And finally, the response time under load is related to the capacity of the server and the current request rate:
\[W_{\mathrm{loaded}} = \frac{1}{\mu - \lambda}.\]
The last equation can be rearranged to
\[(\mu - \lambda) W_{\mathrm{loaded}} = 1\]
and then we rewrite \(\lambda\) in terms of \(\mu\) and \(R\) such that
\[(1 - R) \mu W_{\mathrm{loaded}} = 1\]
and shuffle things around again
\[R = 1 - \frac{1}{\mu W_{\mathrm{loaded}}}\]
and then substitute \(\mu\) for the response time at idle,
\[R = 1 - \frac{W_{\mathrm{loaded}}}{W_{\mathrm{baseline}}}\]
Appendix B: Caveats
The usual queueing theory caveats are in place here. The equations above are primarily based on single-server fifo systems with independent, thin-tailed service times. Most real-world services are multi-server concurrent timeslicing systems with heavy-tailed service times. The equations still work as surprisingly good approximations in most systems I’ve come across.2 In particular, timeslicing happens to somewhat negate the effects of heavy-tailed service times, and multi-server systems look like single-server systems when viewed from an appropriate distance.
Let me know if they don’t work for you, and why you think that might be!