Sampling For Managers
Here’s an example of how you can make something annoying slightly easier.
Every year I’m asked to produce a very peculiar number: how many gigabytes of storage my department has reserved at cloud providers. This is apparently useful input into some methodology for estimating the environmental impact of renting cloud resources.
Not all cloud providers make this number easy to get. Hetzner Cloud, for example, gives you a paginated list of virtual machines and you have to click into each one to see the storage space allocated to it. This is one of the better providers for finding this out, but it’s different for all of them.1 I suspect not more than two hours from when I publish this article, someone will point out a really obvious way to do what I want in Hetzner easily – but that’s missing the point! This article is about a general technique for dealing with a general problem, not how to specifically get total allocated storage space from Hetzner. To get an exact sum of allocated storage space, you have to click every vm on every page in that list – either manually or with a script.
If you don’t want to click on every vm, here’s what you can do instead.
Estimation
The first step is realising that an exact number is superfluous. An exact number is almost always superfluous. We just need something in the right ballpark, where the size of the ballpark varies with the problem. Whatever the size of the ballpark, we can use statistics to spend less time and still achieve our goals.
Selecting A Sample
In the Hetzner account I’m going to use as an example for this, there are 115 virtual machines. I know this because Hetzner paginates the list with 10 machines per page, and the last page is number 12, which has five machines on it.
Let’s say we think clicking on one vm per page would be fairly low effort for this. Since there are 10 machines per page, we start with a sampling ratio of 10 % and see where that takes us.
We want a random sample for any of the below to work. Systematic sampling is often a convenient way to perform random sampling by hand. It requires that the list we’re drawing from does not have significant order or cyclic patterns that overlap with our sampling scheme. In this case, the list on Hetzner is ordered alphabetically by name, which is usually a good approximation for random order.
To perform systematic sampling, we need to generate one random number between 1 and 10 to find our starting point. Let’s say it was 4. Then we click on the vm on the fourth row on the first page, and then we click on every tenth vm after that (to get a sampling ratio of 10 %), which in practise will mean clicking on the fourth row on every page.
Estimating Total
For the account used as an example, the above systematic sampling process yielded the following disk sizes.
160 | 240 | 80 | 240 | 360 | 360 | 360 | 390 | 80 | 120 | 80 | 80 |
We can drop these into spreadsheet software and ask it to compute the sample mean for us.2 In both Google Sheets and Excel this would be the average function. We’ll put the numbers in the A column, rows 1 to 12, we will note how many machines we have in cell B1, how many we have sampled in B2, and the average size in B3.
I strongly recommend that you open up a spreadsheet of your own and follow along – it will be much easier to understand what happens that way.
B1 | '=115 |
115 |
B2 | '=12 |
12 |
B3 | '=AVERAGE(A1:A12) |
212.5 |
We learn that the average vm in our sample has a disk size of 213 gb.
If we selected the sample randomly (which we did), then this number is also the best guess for the average disk size of all 115 virtual machines. That means the total disk allocated to all those machines is probably close to whatever ends up in B4.
B1 | '=115 |
115 |
B2 | '=12 |
12 |
B3 | '=AVERAGE(A1:A12) |
212.5 |
B4 | '=B3 * B1 |
24437.5 |
This was the question we wanted to answer in the first place. We have learned the total allocated disk is probably close to 24.4 tb. And we got there with only about 10 % of the effort.
We could end there, but…
Sampling Error
There’s one thing that might irk you. We said the total disk size is close to 24.4 tb. How close is close?
Since we drew our sample randomly, we can actually compute how close to our estimation the true value is.
Sample Variance
In order to know how close a 10 % sample gets us to the true total, we need to know how much variation there is in the disk sizes of all 115 virtual machines. We don’t know that, but we can estimate it from the variation in the sample we already have – our spreadsheet will help us out here too.3 In Google Sheets this is the var function, in Excel it’s the var.s function.
B1 | '=115 |
115 |
B2 | '=12 |
12 |
B3 | '=AVERAGE(A1:A12) |
212.5 |
B4 | '=B3 * B1 |
24437.5 |
B5 | '=VAR(A1:A12) |
16347.73 |
In the case of the 12 disk sizes above, we have sample variance s² = 16348 _gb_². This tells us something about the population variance too: it’s likely to be around the same number.4 Strictly speaking, the population variance σ² of 115 virtual machines will be approximately 99 % of the sample variance s². That is close enough that we can ignore the difference. This means we can use the sample variance to draw conclusions about the likely range of the total disk allocated to all virtual machines.5 If you were surprised about the variance being measured in units of square gigabytes, it was not a typo. The variance, in contrast to the standard deviation, is measured as the square of the unit of interest. This sounds annoying, but it’s often easier to deal with variance directly. Besides, the standard deviation is frequently misunderstood and people have a tendency to draw the wrong conclusions from it.
Variance of Estimated Mean
We are going for the total allocated size, but it’s easier to see how we get there if we break it down and look at the variance of the estimated mean first.
Based on our sample, we think the mean disk size is close to 213 gb. How close? The formula is short:
\[\mathrm{Var}(\overline{y}) = \frac{\mathrm{Var}(y)}{n}.\]
As mentioned, we can use the variance of the sample as an estimation of the variance of individual disk sizes, so for our disk data above, this means
B1 | '=115 |
115 |
B2 | '=12 |
12 |
B3 | '=AVERAGE(A1:A12) |
212.5 |
B4 | '=B3 * B1 |
24437.5 |
B5 | '=VAR(A1:A12) |
16347.73 |
B6 | '=B5/B2 |
1362.31 |
Thus, the variance of our estimated mean is 1362 _gb_². We’ll use this number later!
Maths With Variances
If you’re happy to just accept the formula above without understanding, you can skip this subsection. If you need help remembering the formula above, there are two fundamental rules that will help you reconstruct many of these results:
\[\mathrm{Var}(a + b) = \mathrm{Var}(a) + \mathrm{Var}(b)\]
and
\[\mathrm{Var}(ka) = k^2 \mathrm{Var}(a).\]
Stated in English:
- The variance of a sum of two uncertain values is the sum of the variances of the two uncertain values.
- The variance of a multiple of an uncertain value is the multiple squared of the variance of the uncertain value.
We use the above to compute the variance of the mean. As you know, the mean is defined to be
\[\overline{y} = \frac{1}{n} \left(y_1 + y_2 + \dots + y_n\right).\]
Thus,
\[\mathrm{Var}(\overline{y}) = \frac{1}{n^2} \left(\mathrm{Var}(y_1) + \mathrm{Var}(y_2) + \dots + \mathrm{Var}(y_n)\right).\]
Since all \(y_i\) have the same variance (they are part of the same population, as far as we care), the above simplifies to
\[\mathrm{Var}(\overline{y}) = \frac{n \mathrm{Var}(y)}{n^2} = \frac{\mathrm{Var}(y)}{n}.\]
Variance of Estimated Total
With the above, we can finally get around to computing the variance of the estimation of the total size Y of all N virtual machines. We have6 See subsection Maths With Variances if you are unsure of why.
\[\mathrm{Var}(\hat{Y}) = \mathrm{Var}(N \overline{y}) = N^2 \mathrm{Var}(\overline{y}).\]
We can put this into the spreadsheet as well:
B1 | '=115 |
115 |
B2 | '=12 |
12 |
B3 | '=AVERAGE(A1:A12) |
212.5 |
B4 | '=B3 * B1 |
24437.5 |
B5 | '=VAR(A1:A12) |
16347.73 |
B6 | '=B5/B2 |
1362.31 |
B7 | '=B1^2 * B6 |
18016549.75 |
Note that variances of totals often end up looking like big numbers, in part because the totals themselves are big, and in part because the variance is measured in the square of the unit of interest.
Confidence Interval
With roughly 90 % probability, the true total allocated disk Y will end up within7 A 90 % confidence interval implies there’s a 5 % probability of being wrong on the high end, and a 5 % probability of being wrong on the low end. The 5 % mark on the normal distribution is 1.645 standard deviations away. If you are dealing with a life-and-death situation, you don’t want to build your calculations on a normality assumption, but instead derive confidence intervals using the bootstrap or other resampling techniques that respect the underlying distribution more.
\[\hat{Y} \pm 1.645 \sqrt{\mathrm{Var}(\hat{Y})}.\]
We compute this error margin in the spreadsheet:
B1 | '=115 |
115 |
B2 | '=12 |
12 |
B3 | '=AVERAGE(A1:A12) |
212.5 |
B4 | '=B3 * B1 |
24437.5 |
B5 | '=VAR(A1:A12) |
16347.73 |
B6 | '=B5/B2 |
1362.31 |
B7 | '=B1^2 * B6 |
18016549.75 |
B8 | '=1.645 * sqrt(B7) |
6982.35 |
In other words, our 90 % confidence interval is roughly \(24.4 \pm 7\) tb, or 17–31 tb.
Refinement
You may not be happy with that range of 17–31 tb. It’s wide. The reason it’s so wide is that the variation of the underlying data is high. The solution to that problem is simple: sample some more data to get a more accurate picture of the true total.8 It might not be obvious how sampling more data helps. The reason is that the sample size affects the accuracy of the estimation of the mean, and we use the estimation of the mean to estimate the total.
Let’s say we would like an uncertainty of ± 15 %, which, given our preliminary study above, ought to be approximately 4 tb. How much would we need to sample to do this? Now that we know the variance of disk sizes, we can compute this ahead of time! We need to pick a sample size \(n\) that solves
\[1.645 \sqrt{\left(\frac{115^2}{n}\right) 16348} = 4\]
and since it’s particularly convenient for us to sample 10 % at a time (because that means clicking on the same row on every page in the ui, remember), we can try \(n\) that are multiples of 12. It turns out at \(n=36\) we think we will get the desired accuracy.
So we click on some more virtual machines, following the same protocol as before but picking two new random offsets, and the full data set becomes
160 | 240 | 80 | 240 | 360 | 360 | 360 | 390 | 80 | 120 | 80 | 80 |
260 | 240 | 80 | 80 | 360 | 260 | 360 | 160 | 80 | 240 | 40 | 160 |
360 | 80 | 160 | 80 | 260 | 360 | 80 | 80 | 80 | 40 | 80 |
We update our spreadsheet with the new samples (and remember to update the sample size in cell B2!)
B1 | '=115 |
115 |
B2 | '=12 |
35 |
B3 | '=AVERAGE(A1:A35) |
186.57 |
B4 | '=B3 * B1 |
21455.71 |
B5 | '=VAR(A1:A35) |
14040.84 |
B6 | '=B5/B2 |
401.17 |
B7 | '=B1^2 * B6 |
5305432 |
B8 | '=1.645 * sqrt(B7) |
3789.02 |
The updated mean size based on all these samples is approximately 190 gb. The point estimate of the total is then 2.1 tb. The 90 % confidence interval around this is ± 3.8 tb.
Our updated confidence interval is, then, 18–25 tb. This is not much narrower than before, but within our 15 % requirement. We had to do a little more work than we hoped, but still only a third of what it would have been to click on all virtual machines.
Next Level
There are a couple of points we haven’t considered that a professional statistican would.
Effort–Accuracy Tradeoff
Notice what happened during the refinement – we traded effort for accuracy. We could have computed the value of this tradeoff up front! By spending 3× the effort, we were able to double the accuracy. Was this worth it?
We can evaluate the economic gain of additional accuracy in the estimation, and then compare this to the economic cost of the additional effort. In this case, I think we would have found the gain of additional accuracy to be very low, and it may not even have been worth it – if it wasn’t for this article.
This is how you recognise a good statistician9 As opposed to a convincing pretender, like myself.: they know not only that an uncertain answer is usually sufficient. They also know the exact uncertainty of their answer. In fact, they try to find out what the uncertainty of their answer will be before spending the effort answering. And, critically, they know the economic value of uncertainty reduction, and how much it makes sense to pay to achieve any given uncertainty level.
Finite Population Correction
I have lied to you somewhat, but exactly how might be difficult to explain.
See, we had a population of 115 virtual machines. These are a finite population of specific virtual machines, of which we sampled 36. Here’s a question we might want to answer:
If someone points at one of these specific 115 virtual machines, what’s our best guess for its disk size?
The answer to that question is the mean disk size of our sample of 36, i.e. 187 gb.
Here’s a very similar but subtly different question we might also want to answer:
If someone points at some virtual machine, guaranteed to be statistically similar to the 115, but not necessarily a part of that specific population, what’s our best guess for its disk size?
It turns out this question has the same answer: 187 gb.
However, since the first question is asking about a finite population of 115 specific virtual machines, and we already have peeked at 36 of these specific virtual machines, there’s actually less uncertainty in our first answer than in the second.
When is the second question relevant? For example, we might wonder what the disk size of the next virtual machine to be created will be. That machine is by definition not part of the specific 115 machines we have now, but will be a completely new one, still statistically similar to the ones we have. Or these virtual machines might be ephemeral and be destroyed and created constantly depending on workloads – then next month we will have a completely new population, yet maybe statistically similar to the current one.
The variance we computed for the mean disk size in this article actually implied we were answering the second question. In other words, 18–25 tb is the confidence interval we would give for the total disk size of some set of 115 virtual machines, statistically similar to the 115 we have, but not necessarily the exact same ones.
But the question we wanted to ask was about the total disk space allocated to this specific population of 115 virtual machines. So the uncertainty in our answer should actually be lower than we computed. Specifically, it would be one finite population correction factor lower. This factor is given by
\[\mathrm{fpc} = \frac{N-n}{N-1}.\]
For 12 samples, the actual variance is 90 % of what we computed before. For 36 samples, the actual variance is 69 % of our previously computed variance. That means our real estimation of the total allocated disk size ought to be
B1 | '=115 |
115 |
B2 | '=12 |
35 |
B3 | '=AVERAGE(A1:A35) |
186.57 |
B4 | '=B3 * B1 |
21455.71 |
B5 | '=VAR(A1:A35) |
14040.84 |
B6 | '=B5/B2 |
401.17 |
B7 | '=B1^2 * B6 |
5305432 |
B8 | '=1.645 * sqrt(B7) |
3789.02 |
B9 | '=(B1-B2)/B1 * B8 |
2602.89 |
Stated in text, we have a 90 % interval of 19–24 tb. In fact, if we had accounted for the fpc from the start, we could have gotten away with just 24 samples and still hit our accuracy target of 4 tb.
Corrigenda
This article originally stated – incorrectly – that the denominator of the fpc was N. This is now fixed with the correct denominator, N-1. Fortunately, mistaking N-1 for N when the population is large enough that you’re interested in sampling in the first place is not the worst of mistakes.