Brier Score
The accuracy of a binary prediction (i.e. will an event happen or not) can be evaluated through the Brier score. For all the faults of the Brier score1 Does not account for question difficulty, does not penalise overconfidence as strongly as e.g. a log score, cannot account for mutually exclusive forecasts, which implies it also cannot handle forecasting full distributions., it’s the most standard score we have to compare the accuracy of predictions and predictors.
Here are some typical Brier scores, gathered from various sources2 Bias, Information, Noise: The bin Model of Forecasting; Satopää, Salikhov, Tetlock, Mellers; Management Science; 2021. Preprint available on Google Scholar.,3 Superforecasters: A Decade of Stochastic Dominance; Karvetski; Good Judgment; 2021.,4 Personal experience plus studying past performance of other people in forecasting competitions.. The point of this table is to give a non-forecaster, or beginning forecaster, a sense of how good a score is.5 In particular, don’t look at these particular values as ultimate truths. They are in the right ballpark only. Since scores are sensitive both to forecaster skill and the time frame covered by the question (longer-term questions are more difficult), this table lists typical scores across both axes.
Forecaster | > 12 months | 1–12 months | < 1 month |
---|---|---|---|
People off the street | 0.50 | 0.45 | 0.30 |
Enthusiastic amateurs | 0.45 | 0.23 | 0.18 |
Teams of enthusiasts | 0.35 | 0.19 | 0.14 |
The best forecasters | 0.30 | 0.15 | 0.10 |
Although it might be obvious from the above, it should be said that a Brier score is a measure of forecasting error, which means a lower score is better. If one knows the outcome of every question, one can achieve a Brier score of zero – no error – by forecasting 100 % probability to all the events that are true and 0 % to those that are false. In practice, the worst Brier score one can get is 0.5, which indicates that one has predicted with full confidence but there’s no correlation between forecast and outcome.6 A Brier score worse than 0.5 usually indicates sampling error, but a good forecaster could hypothetically grief an evaluator by getting a Brier score close to 1, which indicates that there is a strong correlation between forecast and outcome – only that correlation is negative!
Note that forecasting 50 % on every question gives a Brier score of 0.25, so people off the street are worse than coinflips across all timelines. This is due to overconfidence: since the Brier score is in effect a quadratic error, getting one thing very wrong negates much of the score earned from getting many things a little right.
Other factors that affect the score
There are many factors that determine how achievable a particular Brier score is, aside from time frame and forecasting skill. These include, but are not limited to:
- Question difficulty
- Sampling error
- Blind vs. unblind forecasts
Explanations of these are below.
Question difficulty
Something like “Will there be a million humans spending more than a year on Mars before 2035” is a long term question, but it’s still a fairly clear no due to physics. This is an easy question, much like “Will it rain in Arizona” is easy: the answer is almost always no.
The numbers in the table are intended to represent the sort of question difficulty one encounters in a typical forecasting competition, i.e. in the goldilocks zone where it is possible to beat random guessing but not so easy anyone could get it.
Sample precision of Brier scores
Also note that Brier scores are essentially a mean squared error, and like any mean squared error, estimations don’t get very precise from small samples. A relatively large forecasting competition has 50-ish questions, and this easily gives well over a 10 % sampling error in the estimation of the Brier score.
This means someone who on average gets a Brier score of 0.21 can, when participating in a 30-question competition, get a Brier score of 0.16 out of sheer luck.
Blind vs. unblind forecasting
Another mistake people often do is compare blind to unblind forecasts. When I list Brier scores for myself or comparison, I am generally speaking about blind forecasts. These are forecasts made without looking at what forecasts other people have made on the same question. Blindness is different from research effort: a forecast can be blind but well-researched.
The reason this distinction matters is that it’s very easy to get a great Brier score if one makes unblind forecasts: one can simply forecast the same as the average of what other people have forecast, because the forecasting community is a very valuable information source. This, clearly, says nothing about the skill of the individual forecaster.
Community-weighted expected Brier score
Sometimes I want to evaluate forecasting performance on questions that have not yet closed. In those cases, I compute the Brier score under both outcomes (it happens or it does not), and take a weighted average of these values using the current community forecast as the weights. This is based on the assumption that the community forecast is an accurate representation of the probability of an event happening – for high-quality forecasting communities like Metaculus, that is usually the case.