Brier Score
The accuracy of a binary prediction (i.e. will an event happen or not) can be evaluated through a Brier score. Here are some typical Brier scores, gathered from various sources1 Bias, Information, Noise: The bin Model of Forecasting; Satopää, Salikhov, Tetlock, Mellers; Management Science; 2021. Preprint available on Google Scholar.,2 Superforecasters: A Decade of Stochastic Dominance; Karvetski; Good Judgment; 2021.,3 Personal experience plus studying past performance of other people in forecasting competitions.. The point of this table is to give a non-forecaster, or beginning forecaster, a sense of how good a score is.4 In particular, don’t look at these particular values as ultimate truths. They are in the right ballpark only.
Since scores are sensitive both to forecaster skill and the time frame covered by the question (longer-term questions are more difficult), this table lists typical scores across both axes.
Forecaster | > 12 months | 1–12 months | < 1 month |
---|---|---|---|
People off the street | 0.70 | 0.50 | 0.30 |
Enthusiastic amateurs | 0.45 | 0.23 | 0.18 |
Teams of enthusiasts | 0.35 | 0.19 | 0.14 |
The best forecasters | 0.30 | 0.15 | 0.10 |
Other factors that affect the score
There are many factors that determine how achievable a particular Brier score is, aside from time frame and forecasting skill. These include, but are not limited to:
- Question difficulty
- Sampling error
- Blind vs. unblind forecasts
Explanations of these are below.
Question difficulty
Something like “Will there be a million humans spending more than a year on Mars before 2035” is a long term question, but it’s still a fairly clear no due to physics. This is an easy question, much like “Will it rain in Arizona” is easy: the answer is almost always no.
The numbers in the table are intended to represent the sort of question difficulty one encounters in a typical forecasting competition, i.e. in the goldilocks zone where it is possible to beat random guessing but not so easy anyone could get it.
Sample precision of Brier scores
Also note that Brier scores are essentially a mean squared error, and like any mean squared error, estimations don’t get very precise from small samples. A relatively large forecasting competition has 50-ish questions, and this easily gives well over a 10 % sampling error in the estimation of the Brier score.
This means someone who on average gets a Brier score of 0.21 can, when participating in a 30-question competition, get a Brier score of 0.16 out of sheer luck.
Blind vs. unblind forecasting
Another mistake people often do is compare blind to unblind forecasts. When I list Brier scores for myself or comparison, I am generally speaking about blind forecasts. These are forecasts made without looking at what forecasts other people have made on the same question.
The forecasting community is a very valuable information source, and one can get great Brier scores by just forecasting the same as the average of what everyone else does. This says nothing about the skill of the forecaster.
Community-weighted expected Brier score
Sometimes I want to evaluate forecasting performance on questions that have not yet closed. In those cases, I compute the Brier score under both outcomes (it happens or it does not), and take a weighted average of these values using the current community forecast as the weights. This effectively assumes the community forecast is an accurate representation of the probability of an event happening, which seems reasonable.