Estimating Standard Deviation From Timeseries
Here’s an example of some data I’ve seen recently:
We might think these are practically random numbers, but we may still be interested in guessing which the next number will be.1 The label on the Y axis indicates that someone is likely to start asking the question at some point.
The next number is likely to lie somewhere around the mean of the numbers we’ve seen. We can find out what that is by eyeballing the chart and punching into a desktop calculator2 In my case, it is an rpn calculator.:
1 ENT 1 + 3 + 1.5 + 9 + 1 + 6 + 7 /
We get back $32,000 – great! But this just lets us forecast that there’s a 50 % chance next year’s profit will be less than $32,000 – hardly a sophisticated forecast. Anyone could have told us something like that after spending 10 seconds with the plot.
We want to forecast an interval of likely values for the profit. It would help us to a first approximation if we knew the standard deviation of the sequence, but even the quick, desktop calculator-friendly computation is somewhat tedious.
The trick is consecutive differences
Here’s a trick I figured out while competing in forecasting tournaments3 Where I want to get a quick statistical grasp of some data, but I also don’t want to spend too long analysing it., borrowed from statistical process control.
If we take the average difference between consecutive data points, we get another measure of dispersion (called the average moving range in spc literature). In this case, that would require punching into the calculator
0 ENT 2 + 1.5 + 7.5 + 8 + 5 + 6 /
and we get back $40,000. For stable, thin-tailed distributions, the average moving range is 1.128 times larger than the standard deviation.4 See the article on statistical process control for some evidence of this. This gives us an estimated standard deviation of $35,000 for the numbers in the plot above.
What’s great about this is that if the plot has grid lines like the one above, we get the consecutive differences just by counting the number of grid lines between each data point! We can do that on the fly as we punch numbers into the calculator. This is much easier than squaring numbers, as we would have to do to compute the variance properly.
How close did we get? The actual standard deviation for the numbers above is $32,000. Definitely close enough for a rough forecast interval of, say -$26,000 to $90,000 in next year’s profit with 90 % probability.
Why did we get a higher number when estimating the standard deviation from the moving ranges? Because the consecutive differences are not a global measurement of spread, but really account for spread-within-the-sequence-as-observed. For forecasting, that is almost always the appropriate measure of spread. In other words, if the numbers are drawn from a nice distribution when looked at atemporally, but they happen to fluctuate wildly from year to year, this method will over-estimate the standard deviation – just like we want when we produce a prediction interval.