Brier Score

kqr

, published 2025-02-04

Tags:

The accuracy of a binary prediction (i.e. will an event happen or not) can be evaluated through the Brier score. For all the faults of the Brier score1¹ Does not account for question difficulty, does not penalise overconfidence as strongly as e.g. a log score, cannot account for mutually exclusive forecasts, which implies it also cannot handle forecasting full distributions., it’s the most standard score we have to compare the accuracy of predictions and predictors.

Here are some typical Brier scores, gathered from various sources2² Bias, Information, Noise: The bin Model of Forecasting; Satopää, Salikhov, Tetlock, Mellers; Management Science; 2021. Preprint available on Google Scholar.,3³ Superforecasters: A Decade of Stochastic Dominance; Karvetski; Good Judgment; 2021.,4⁴ Personal experience plus studying past performance of other people in forecasting competitions.. The point of this table is to give a non-forecaster, or beginning forecaster, a sense of how good a score is.5⁵ In particular, don’t look at these particular values as ultimate truths. They are in the right ballpark only. Since scores are sensitive both to forecaster skill and the time frame covered by the question (longer-term questions are more difficult), this table lists typical scores across both axes.

Forecaster	> 12 months	1–12 months	< 1 month
People off the street	0.50	0.45	0.30
Enthusiastic amateurs	0.45	0.23	0.18
Teams of enthusiasts	0.35	0.19	0.14
The best forecasters	0.30	0.15	0.10

Although it might be obvious from the above, it should be said that a Brier score is a measure of forecasting error, which means a lower score is better. If one knows the outcome of every question, one can achieve a Brier score of zero – no error – by forecasting 100 % probability to all the events that are true and 0 % to those that are false. In practice, the worst Brier score one can get is 0.5, which indicates that one has predicted with full confidence but there’s no correlation between forecast and outcome.6⁶ A Brier score worse than 0.5 usually indicates sampling error, but a good forecaster could hypothetically grief an evaluator by getting a Brier score close to 1, which indicates that there is a strong correlation between forecast and outcome – only that correlation is negative!

Note that forecasting 50 % on every question gives a Brier score of 0.25, so people off the street are worse than coinflips across all timelines. This is due to overconfidence: since the Brier score is in effect a quadratic error, getting one thing very wrong negates much of the score earned from getting many things a little right.

Other factors that affect the score

There are many factors that determine how achievable a particular Brier score is, aside from time frame and forecasting skill. These include, but are not limited to:

Question difficulty
Sampling error
Blind vs. unblind forecasts

Explanations of these are below.

Question difficulty

Something like “Will there be a million humans spending more than a year on Mars before 2035” is a long term question, but it’s still a fairly clear no due to physics. This is an easy question, much like “Will it rain in Arizona” is easy: the answer is almost always no.

The numbers in the table are intended to represent the sort of question difficulty one encounters in a typical forecasting competition, i.e. in the goldilocks zone where it is possible to beat random guessing but not so easy anyone could get it.

Sample precision of Brier scores

Also note that Brier scores are essentially a mean squared error, and like any mean squared error, estimations don’t get very precise from small samples. A relatively large forecasting competition has 50-ish questions, and this easily gives well over a 10 % sampling error in the estimation of the Brier score.

This means someone who on average gets a Brier score of 0.21 can, when participating in a 30-question competition, get a Brier score of 0.16 out of sheer luck.

Community-weighted expected Brier score

Sometimes I want to evaluate forecasting performance on questions that have not yet closed. In those cases, I compute the Brier score under both outcomes (it happens or it does not), and take a weighted average of these values using the current community forecast as the weights. This is based on the assumption that the community forecast is an accurate representation of the probability of an event happening – for high-quality forecasting communities like Metaculus, that is usually the case.

Entropic Thoughts

Brier Score

Brier Score

Other factors that affect the score

Question difficulty

Sample precision of Brier scores

Blind vs. unblind forecasting

Community-weighted expected Brier score

Referencing This Article