Forecasting Accuracy
Reader Dan Turner pointed me toward a highly interesting paper co-authored by Tetlock, about the bin model of forecasting.1 Bias, Information, Noise: The bin Model of Forecasting; Satopää, Salikhov, Tetlock, Mellers; Management Science; 2021. Preprint available on Google Scholar.
Summary
The main takeaways from this paper are:
- Most of the improvements we know how to make to forecasting accuracy come from reducing noise.
- Training has a small noise-reducing effect across all timelines.
- Teaming has a large noise-reducing effect across all timelines.
- Teaming also improves information in the short-term, and reduces bias in the long term.
- Superforecasters show the same sorts of improvements as teaming, except even greater.
Brier scores for the various groups are:
Group | Score | Information | Bias | Noise |
---|---|---|---|---|
Baseline | 0.21 | – | – | – |
Training | 0.19 | -0.00 | -0.00 | -0.02 |
Teaming | 0.14 | -0.01 | -0.01 | -0.03 |
Superforecasters | 0.08 | -0.02 | -0.01 | -0.03 |
This makes it clear that e.g. teaming improves the Brier score by 0.05 over trained individuals, and this comes mainly from noise reduction (0.03), but also to some extent from bias reduction and information increase (0.01 each).
Brier Score Benchmark
Forecasts are often evaluated using the Brier score. You can find the full details elsewhere on the web, but it’s useful to know that the Brier score effectively runs from 0 (perfect predictions every time) through 0.25 (clueless about everything.) As may be clear already, a lower Brier score is better. Nobody gets a 0, and presumably people are a little better than 0.25, but how well do people perform?
According to this paper,
- The baseline Brier score on a 30-day forecast is 0.21 (in other words, the study subjects are somewhat better than random guessing.)2 This is also the average Tetlock reports from the Good Judgment Project in Superforecasting, lending it additional credibility.
- If subjects get some training in forecasting their Brier score improves to 0.19.
- If subjects are put into teams instructed to question and discuss each other’s forecasts, their Brier score improves to 0.14.
- Teams of superforecasters (the best 2 % of the subjects) have a Brier score as low as 0.08.3 In the first year of the Good Judgment Project, the best sujects had a Brier score of 0.13 – and it’s unsurprising that teaming superforecasters together would improve their Brier score significantly.
What this bin paper finds out, which is really fascinating, is some of the mechanics behind these improvements. What exactly is it about a person’s forecast that changes when they are put in a team? Let’s find out!
The Bias–Information–Noise Model
The bin paper is about breaking down forecasting accuracy into three components:
- Information
- The less information we have about something, the less accurate will our forecasts be.
- Bias
- Mental and external biases cause us to consistently forecast events as more or less likely than they really are in a given situation, which reduces accuracy.
- Noise
- Interpreting irrelevant information as a useful signal will cause us to spuriously adjust our forecasts in directions away from what would be accurate.4 In sports betting, the general public overcorrects on big news. A star player spraining their ankle may affect the probability that team Alpha wins, but not as much as you would think. I have also seen this effect in relation to geopolitics on Metaculus, where isolated events cause large movements in community probabilities.
It is easy to jump to the conclusion that information is the key to strong forecasting. It sounds so obvious: to predict the future, we have to know a lot of things, and then we apply knowledge to derive what the future must look like. This is not really true in any practical sense; the future is uncertain enough that additional information quickly reaches a point of diminishing returns.
Training reduces noise
Researchers have long understood that more information is not necessarily better. Instead, they would argue that biases are the problems behind weak forecasts. When we make forecasts, we are falling for mental and external biases, like narrative bias or management pressure. Training in prediction focuses on reducing vulnerability to biases – and as we saw above – that training works; it improves the Brier score by about 10 %! All evidence points to bias being the problem.
Then Kahneman wrote an entire book5 Noise: A Flaw in Human Judgment; Kahneman, Sibony, Sunstein; William Collins; 2021. on why bias is not as big a deal as we make it – in practise, noise is often large enough that its effects overwhelm those of any bias present6 Fans of statistical process control will recognise this tune: we focus on reducing noise first, because noisy measurements hide any improvements to the trend we want to make.. But noise is also difficult to measure, so it, and its effect, is understudied.
In the bin paper, the authors found out that the training that aimed at reducing bias actually appears to reduce noise and that’s why it works. For all practical purposes, the entire effect of training is reducing noise. Oops. But also good news, because now we know why training works, and that noise has a significant effect.
The authors find also that the majority of the improvement from teaming is reduced noise. The members of the team can temper each other and prevent over-reactions to insignificant data. Superforecasters again have reduced noise compared to regular teams. Most of the improvements as we go up the ladder of better forecasting comes from reduced noise.
Information helps short term
However, when comparing individuals to teams there are also two other effects in play. First up is information. Aside from the main effect of teaming, which is reduced noise, teaming also improves accuracy on short-term forecasts (a month or less) due to improved information. This makes sense: when multiple people get together and second-guess each other’s forecasts, clearly some people will sit on information other people have missed. No mystery there.
What’s interesting to me7 This conclusion is my own, not the authors. Since the authors do not point this out, either it is already established in the field, or I’m missing some flaw in my argument that invalidates it. Let me know if you know! is that this improvement due to added information peters out at about 30 days, meaning we can classify forecasts into short term where injecting additional information produces an improvement, and long term where additional information does not help accuracy, and the threshold to long term sits around 30 days.
Teams also reduce bias
Teaming also reduces bias somewhat. The paper speculates that some of this might be because team members collaborate by catching each other’s mental biases, which are notoriously hard to self-monitor for.
But here’s what’s odd: the debiasing effect is very small for short-term forecasts, and becomes significant only in long-term forecasts. The authors do not explain this. I can only speculate from personal experience that maybe biases compound as the time horizon extends, so that the debiasing effect of teaming exists also in short-term forecasting, but biases themselves have a comparatively small effect in short-term forecasts.
Superforecasters are teams-of-one
Superforecasters exhibit basically the same improvements as teaming8 Up to and including the contribution differences in short-term/long-term forecasts., except even stronger. In Superforecasting9 Superforecasting: The Art and Science of Prediction; Tetlock & Gardner; Cornerstone Digital; 2015., Tetlock describes superforecasters as performing the job of teams except in one brain, so it’s not too surprising that they have similar improvement patterns.