Quarterly Cup 2024 Q3 Retrospective

kqr

, published 2024-11-01

Tags:

I forgot to write a retrospective for the 2024 Q2 Quarterly Cup, because my approach to forecasting has become kind of boring: I only do it by gut feeling. No research, no analytics, not looking at the community prediction. This is not because my gut feeling is so great – it is because I have too many other things going on that I cannot take the time to forecast more accurately.

I still want to share what my accuracy is, though, in the interest of accountability. The headline measure for me is be the average blind Brier score. Here’s how this has changed across the quarterly cups I have participated in, along with total tournament placement for each.

Tournament	Brier score	Placement
2023 Q3	0.18	#31 / 842
2023 Q4	0.13	#2 / 838
2024 Q1	0.25	#552 / 709
2024 Q2	0.18	#17 / 1002
2024 Q3	0.11	#16 / 760

The first quarter of 2024 was a disaster: it was the first time I tried forecasting without accounting for the community prediction, and I was a little overconfident from my strong performance in the quarter before that. It is clear I have managed to work on both of those problems, with the Brier score improving since then.

As mentioned before, the placement is a misleading statistic. Most of those participants do not try very hard. Some are not even aware they are in the competition, because they get auto-enrolled in the competition when they forecast on any of the questions that are part of the competition.

Unfortunately, while Metaculus used to display results in such a way that one could figure out which participants did give it an honest shot, several improvements to simplify and streamline the experience on the platform has removed these numbers from public display.1¹ In fact, with the latest open source rewrite, one can not even find out one’s own Brier score over a limited time frame – the feature I used to extract the Brier scores above. Some of this data may be available through the api, and a proper retrospective would probably be even better when performed with api support, rather than cobbled together from a spreadsheet of values pasted from the web interface, as I did previously.

But! As mentioned in the introduction, I devote much less time to forecasting these days. I don’t do any research, I only update my forecast a couple of times a week2² To account for time decay and new information in the question comments., and I have not taken the time to explore the performance data available over the api.

Techniques and methods

I used to use the community prediction as an information source – not by following it, but by taking it into account when adjusting my forecast. This is a very powerful strategy because it contains a lot of information. I don’t want to learn to be come reliant on it, though, so I have stopped looking at it almost entirely.3³ If there is a very large difference between my forecast and the community prediction, it is usually because I have misunderstood the question so I re-read it more carefully. But I try to limit my use of the community prediction to that case only: figure out if I’ve misunderstood something.

Along with my lack of methodology, this means there’s very little to say in a retrospective. I can share some questions where I did unusually well, or unusually bad, and I’ll highlight the part of my gut feeling which I think contributed to that.

Specific questions

Apple did not announce an iPhone with 40W charging

I had this question at 40–50 % when the community was at 60–75 %. People seemed to be banking on Apple following suite in line with other large technology companies, but I had a vague hunch that if there is one technology company that does not follow in the footsteps of others, it is Apple.

Kamala Harris said “AI” in her 2024 DNC speech

I stuck to 63 % for the duration of the question, as the community wavered from a high of 60 % down to 30–40 %. My reasoning was mainly that (a) it is very easy for this to resolve yes, since it only takes one mention, and (b) it might be in her interest to mention ai for signaling value, to appeal to younger voters.

Joe Biden ended his candidacy for re-election

I stuck to 25 % long after the community went up to 50 % and even 80 %. I just had trouble believing a candidate would quit so late in the race – even when it would be in the party’s best interest. I have to admit I don’t understand the U.S. political system. I mean, sure, it would make sense for Biden to quit. But look at the other candidate! He is also spouting nonsense, but additionally faces potential jail time for quite serious charges and still continues to run.

Deadpool & Wolverine outsold Deadpool and The Wolverine combined

I went with 50–55 % while the community hovered around 30 %.

I could see many reasons for this to go either way: inflation, appetite for going to the movies after covid-19, etc. But more importantly, even without knowing anything about the movies in question, based on their titles it sounds like there could be some strong correlation there: if either one of the two previous movies were popular, the same crowd is likely to want to go to the next one also. So it’s not like one movie independently competes with two others, but it’s whether the fans of either of the previous two movies will watch the new one. Seems like it could happen.

France won the 2024 Warhammer 40,000 World Team Championship

I entered a rather flat distribution across countries, and it paid out. I had France at 13 % while the community had them at 6 %.

The community seemed to know a lot about this, giving very strong probabilities to some nations and weaker to others. I – ignorant as always – believed that the variance of this competition is higher than the community thinks. The variance of competitions is almost always higher than people think.

China’s youth unemployment rate in August was 18.8 %

The community had assigned a probability of 7 % that it would be as extreme as it was, while I clocked in at about 5 %.

Here I miscalculated in my head. Youth employment tends to go up when people graduate from university, but for technical reasons this sometimes shows up in the statistics only when the next semester starts. At the time this question was asked, I thought the latest unemployment number referred to this student-elevated situation, which meant my forecast mainly had weight below the most recent number, as I pictured these former students would find jobs. But no, August is the student-elevated month. I was calculating with one month off.

The first round of voting in the Iranian presidential election had no winner

The community put the other outcome alternative at about 30 %, whereas I had it at about 50 % throughout the question’s lifetime. I always put more weight on other outcome-type alternatives because they tend to hide many alternative resolutions that we don’t think of as possibilities up front.

And indeed, this was what happened in this question. Neither I nor the community had realised that it takes a majority of votes to win the Iranian presidential election. We worked from the assumption that it only took a plurality. So when no candidate secured a majority of votes, the election went into a run-off vote, satisfying the criteria for resolving as other outcome.

MIT became #1 in the QS World University Rankings 2025

This was a fairly given winner, which the community had at about 80 %. I, lacking a lot of information, gave them the same chances as Cambridge, i.e. about 40 %. Clearly I would have done better with more information!

The 2024 Indian election had a disproportionality measure of 9.3

The community had the actual outcome at a 2 % probability, and I had it at about 9 %. I bring this question up because it highlights a typical problem with these continuous questions: the community tends to forecast a fairly narrow distribution, with tails that taper off to near-zero very quickly. A trick I’ve used for a long time is to add another component that has low weight but a huge variance. This gives the total forecast distribution fat tails, and that seems to often be the right thing to do.

I think the interpretation of why fat-tailing the forecast works is that it accounts for incorrect assumptions. In other words, if all our assumptions are correct, we might predict a fairly narrow hump on the expectation. But there’s always that 5–15 % chance that we are missing some detail, and then the answer can be just about anywhere in the broad neighbourhood of the expectation. An additional component that says “wuh-huh I don’t know it could be anything” at about a 10 % weight admits that in the forecast.