Entropic Thoughts

Wilks’ Tolerance Intervals

Wilks’ Tolerance Intervals

wilks-tolerance-intervals.jpg

Imagine we want to figure out what round-trip times we can expect between Sweden and New Zealand. We ping a server belonging to the University of Waikato from Stockholm, and record the following round-trip times in milliseconds.

290 388 299 290 462 292 291
293 293 308 292 292 290 294
292 333 348 292 292 293 293
292 460 408 290 350 475 290

We want to tell our friend about our experience, but we don’t want to send over this entire table. A decent way to summarise a distribution is by a tolerance interval, which means the central portion in which some fraction of the values end up. For our case, we might pick the fraction 90 %, meaning only 5 % of the data will be smaller, and 5 % will be greater than the interval.

What many people then do is compute a mean and a standard deviation and give a range that is 1.645 standard deviations on either side of the mean – this corresponds to a 90 % tolerance interval on the normal distribution when parameters are fully known. With mean and standard deviation computed from the above, we get 226–421 ms.1 324 ± 1.645 × 59 ms.

If we’re careless, we might tell our friend that 90 % of the time, we get round-trip times of 226–421 ms.


But that’s ridiculous. I’m fairly sure 226 ms never happens, since 290 ms is consistently the floor in the sample above. And well over 5 % of values in our sample are greater than 406 ms; almost 11 % of them are.

A perhaps more intuitive solution would be to use the 5 % and 95 % quantiles in the sample data. For that, we first need to sort the data by size.

290 290 290 290 290 291 292
292 292 292 292 292 292 293
293 293 293 294 299 308 333
348 350 388 408 460 462 475

Since this is 28 sample values, we compute the fifth percentile to be the 28×0.05 = 1.4th value from the start, and the 95th percentile to be the 1.4th value from the end. This means we will interpolate the lower bound between the first two numbers (290 and 290, yielding, well, 290 ms), and the upper bound between the two last values (462 and 475, yielding 470 ms).

Maybe, then, we can say the 90 % tolerance interval is 290–470 ms? This is much better.


But there’s something that feels wrong about both of the above methods: they do not account for the size of our sample. We have collected 28 values, but what if there were only 10? Would that be enough to determine a tolerance interval we could rely on? Intuition says no – we’re running the risk of having only collected central values, and underestimating the tolerance interval significantly.

Wilks’ method is a way to compute how confident we can be in a tolerance interval derived this way.2 I will not give an intuition for the theory this is based on today, but it is very neat and I wish I could take time to do it. It’s one of those things where you could have invented it. Above, we chose the 1.4th and 26.6th value from 28 samples to construct a 90 % tolerance interval. The confidence level \(\alpha\) of that interval is given by the incomplete beta function \(I_p\), evaluated with three arguments3 Wilks’ Formula Applied to Computational Tools: A Practical Discussion; Porter; Annals of Nuclear Energy; 2019.:

Mathematically, we write this as

\[\alpha = I_{0.9}\left(25.2,\; 2.8 + 1\right)\]

To illustrate graphically what happens, we represent the 28 samples as sticks and show which of them we have captured in the range we use for estimating the tolerance interval.

||||||||||||||||||||||||||||
|`----------v-------------´|
|    25.2 values inside    |
v                          v
1.4          +           1.4
    = 2.8 values outside

R implements the incomplete beta function as pbeta, so we can evaluate the confidence level of our proposed 90 % tolerance interval using R.

$ R -s -e 'pbeta(0.9, 25.2, 2.8+1)'
[1] 0.6521653

This is not good enough by any standard. A common confidence level to use is \(\alpha\) = 0.05, and 0.65 is far from that.

The skeptic reader may have suspected the problem already: is 28 samples enough to determine a 90 % tolerance interval? If we use all values gathered to build the 90 % tolerance interval, i.e. say that the interval is the minimum to the maximum (290–475 ms), what’s the confidence of that?

$ R -s -e 'pbeta(0.9, 28, 1)'
[1] 0.05233476

Hah! Just outside the traditional \(\alpha\) = 0.05 confidence level. We might be happy with that still – call it close enough – but if we wanted to exceed that \(\alpha\) level, we would have to collect more samples. So we run the ping for a little longer, and collect more data, and sort it from smallest to largest again.

290 290 290 290 290 290 290
290 290 291 291 291 292 292
292 292 292 292 292 292 292
292 292 292 293 293 293 293
293 294 294 295 295 296 299
301 303 306 308 310 320 329
333 337 338 342 344 347 348
349 350 368 372 378 381 382
388 408 439 439 460 462 475

There are 63 measurements here, so the 5 % and 95 % quantiles would be the 3.15th and the 59.85th value, i.e. a tolerance interval of 290–439 ms. To compute the confidence level of that, we compute the span inside the range, 56.7 values, and outside the range, 6.3 values. Then we plug in:

$ R -s -e 'pbeta(0.9, 56.7, 6.3+1)'
[1] 0.6039901

Slightly better than before, but almost as bad. What we can do is use more values from the sample. If we use all 63 values, i.e. say that the tolerance interval is 290–475, we achieve a confidence level of

$ R -s -e 'pbeta(0.9, 63, 1)'
[1] 0.001310021

which is well beyond the conventional level of significance. However, by choosing such a wide interval, we’re trading precision for lowered risk of error. What if we want the error rate to be in that \(\alpha\) = 0.05 spot? We need to use more than 59 values, but fewer than 63. We can ask R to figure out for us how many.

$ R -s -e 'which(pbeta(0.9, 1:63, 63:1) < 0.05)[1]'
[1] 61

In other words, for a confidence level of \(\alpha\) = 0.05, we need to use 61 out of the 63 sample values. That means we exclude the greatest and the smallest, and use the rest. Our tolerance interval, now at the desired confidence level of \(\alpha\) = 0.05, spans 290–462 ms.


That, finally, is how to construct a tolerance interval using Wilks’ method.

  1. Compute the incomplete beta function \(I_p\) for various combinations of sample values inside the range and outside the range.
  2. Pick the smallest number of values inside the range that satisfies the desired confidence level.
  3. Done!

Given that R has the incomplete beta function as pbeta this becomes easy with R. If we have \(N\) samples, and we want a 90 % tolerance interval at an \(\alpha\) = 0.05 confidence level, we run

which(pbeta(0.9, 1:N, N:1) < 0.05)[1]

and this tells us how many sample values we need to have inside the interval to get the desired confidence level.

If we plug in \(N=28\), this would have returned NA, indicating it is not possible to construct the desired tolerance interval using so few samples. On the other hand, if we plug in a ridiculous sample size like \(N\) = 10,000, this will return 9050, i.e. it will indicate that the naïve guess of 5 % and 95 % quantiles fairly represent the underlying distribution. This is what we’d expect.

Since this process never involves any of the actual sample values, we can also pre-compute tables for the number of sample values to include inside the range. The following table indicates how many sample values to keep outside the range of sample values that make up a 90 % tolerance interval at a confidence level of \(\alpha\) = 0.05.

Sample size Samples outside Comment
0–29 Impossible
30–46 0 Use full range
47–60 1 Throw away maximum
61–75 2 Throw away min and max
76–88 3 Keep roughly central 96.6 %
89–102 4 ʺ     96.1 %
150 8 ʺ     94.7 %
250 16 ʺ     93.6 %
500 38 ʺ     92.4 %
1000 84 ʺ     91.8 %
5000 464 ʺ     90.1 %

This goes on, of course, but by this point we’re rather close to just using the 5 % and 95 % quantiles of the sample, so we’ll stop here. But it might surprise you (it sure did me) that we need a whopping 5000 samples or more before the naïve limits start being reliable!


But, hey, the above table hints at an optimisation. If we want a 90 % tolerance interval at \(\alpha\) = 0.05, we didn’t need the full 63 sample values. We could have opted to collect fewer and still get it. In fact, we could have collected as few as 30 and then taken the smallest and largest as our tolerance interval bounds. That was in fact the problem Wilks’ was actually interested in solving: what’s the fewest number of samples we can collect and still produce the tolerance interval we’re interested in at a confidence level we feel good about?

Since this is independent of the actual sample values, it is easy to construct a constant-memory streaming algorithm that produces a tolerance interval for an incoming distribution in the shortest time possible: we read values and keep track of the largest and the smallest4 Aside from largest and smallest, we throw away all other sample values – that’s what makes the algorithm constant-memory., then we stop as soon as the pre-computed number of samples are consumed.

In the case of a 90 % tolerance interval at 95 % confidence we run this process for 30 samples. Here’s an awk snippet5 If you haven’t yet, read through the posix specification of awk. It’s tiny and teaches you the entire language. that does that to any stream of numbers in the terminal:

awk '
  {
    min = $0 < min || min == "" ? $0 : min;
    max = $0 > max || max == "" ? $0 : max;
    printf "%.0f %%\n", 100*NR/30;
  }
  NR == 30 { print (min "-" max); exit; };'

If we want our 90 % tolerance interval at greater or lower confidence, we’ll have to compute the minimum number of samples to read to produce it for any given level of confidence:

Confidence level (\(\alpha\)) Sample size
0.17 18
0.10 22
0.05 30
0.025 36
0.01 44
0.005 51
0.001 66

It is striking how reasonable these sample size requirements are! We can get a 90 % tolerance interval with a confidence level of 99.9 % using only 66 samples. Once we’ve seen 66 samples, we note the smallest and the biggest and the interval between the two is a 90 % tolerance interval at very high confidence.