The Making of a Forecasting Bot

kqr

, published 2024-12-31

Tags:

Performance of the Basean Loom
Example reasoning
Summary of technique
Details of each step of the forecast
Other considerations
Appendix A: Pseudocode of central logic

In the summer of 2024 Metaculus reached out and invited me to participate in an ai Forecasting Benchmark Tournament.1¹ This is not about forecasting ai progress, to be clear, but about programming an ai to make forecasts. At the time, I didn’t build anything, because I had too much else going on. I did get around to it back in October, though, and I call the result the Basean Loom.

Performance of the Basean Loom

It is difficult to backtest these kinds of ai, because we only want to give them the information they would have had if they had made the forecast on a particular date, and it is virtually impossible to do this reliably and not accidentally leak information from the future. So I don’t know yet how the Basean Loom really performs. It has made forecasts every now and then2² The intention was to keep it going daily, but that didn’t end up happening. in the Q4 ai Forecasting Benchmark Tournament and we’ll see how it pans out.

In my preliminary tests, the Basean Loom has achieved a community-weighted expected Brier score of 0.21–0.24, which is good for a computer, but not better than even myself, so quite some ways away from superhuman performance. Taking the testing history together, I will make the following forecast of its performance in the Q4 tournament.

Percentile	5 %	25 %	50 %	75 %	95 %
Brier score	0.21	0.23	0.26	0.29	0.46

If it performs worse than 0.25 – which corresponds to the best score one can get without even reading the questions – I will be a little disappointed.3³ So yes, according to my forecast, the most likely outcome is that I will be a little disappointed. If it performs better than 0.22, I will be rather happy.

Example reasoning

We’ll illustrate how the Basean Loom works with a dumb question:

Will a person in Sweden die from fireworks on New Year’s Eve 2024?

To set the scene, here’s how I’d approach it:

During the 20 years between, say, 1995 and 2015, there were 0.35 deaths per year due to fireworks. All of these happened during New Year’s Eve – the major fireworks holiday in Sweden. Assuming the events are Poisson, this gives us a base rate of 30 % also for the next New Year’s.
However, brief research also revealed that the laws around fireworks became stricter in 2019. The number of serious fireworks-related accidents have steadily declined to maybe a third or a quarter of the rate experienced between 1995 and 2015, so we should reason that maybe the probability ought to be closer to 8 %.
To account for yearly variation and avoid overconfidence, I would go somewhere in between these two probabilities, with a slight preference for the latter. This puts us with a prediction near 15 %.

The Basean Loom can’t do reasoning this refined, and some of the things it brings up are absolutely delusional4⁴ One of my favourite sentences it actually spit out in relation to this question is “Assuming no firework-related death occurs in 2024 slightly decreases the likelihood of a major injury reduction in 2023.” Yes, that’s how causality works according to our machine-overlords-to-be. The ghosts of future deceased reach out into the past to offer a whisper of caution., but somehow it still works out reasonably well in the end. It prints out a summary of its reasoning, which will not make sense to you yet, but we’ll go through it as we learn how the Basean Loom works. Here it is:

-3.15 - weighted average base rate log odds on <question>Will a person in Sweden die from fireworks at New Year’s Eve 2024?</question>

-4.60 - estimated base rate log odds

-1.71 - implied base rate log odds

0.18 - mean log-odds difference from subproblems

0.59 - log-odds difference for event <event>The number of firework-related injuries in Sweden on New Year’s Eve 2023 decreases by more than 25% compared to the previous year.</event>

30 % - base rate of event

2.01 - log-odds difference for event <event>Sweden experiences unusually dry and windy weather conditions on New Year’s Eve 2024, increasing fire risk.</event>

5 % - base rate of event

1.69 - log-odds difference for event <event>The Swedish government launches an aggressive public awareness campaign about firework safety in December 2024, reaching over 80% of the population.</event>

15 % - base rate of event

-1.52 - de-extremised sum

18 % - final prediction

Note how close its final prediction of 18 % is to my suggestion of 15 % – but critically, it errs on the side of avoiding overconfidence.

Summary of technique

At a very high level, the technique used by the Basean Loom consists of these steps:

Explicitly estimate a base rate for the question.
List significant events that can affect forecast.
Merge the explicit base rate estimation with a base rate implied by the events.
Adjust the base rate by a small amount indicated by the significant events.
De-extremise and clamp.

This was discovered by trial and error, and there are many details that have still not been fully trial-and-errored, because evaluating these sorts of things using flagship models5⁵ Which seem to be needed to do anything that even looks like good forecasting. is really expensive in both time and money: forecasting on one question takes about a minute and costs 8 ¢. Many questions are needed to establish a trustworthy Brier score that is not just the result of noise, and the bill quickly racks up.

To be transparent, though, Metaculus did hook me up with an Anthropic api key on their dime. It still suffers from some reasonable rate limiting, though, so the evals (as it is known in the lingo) are now cheaper in money for me, but still expensive in time – another resource I don’t have a lot of.

Details of each step of the forecast

I am not in a position where I can publish the entire source code yet, though the code for the central coordination (less debugging logic) is in Appendix A. I want to avoid publishing the full code for three reasons:

Mostly because it’s not very well written and I don’t want to be associated with that. I’d have to clean it up first.
A little because I want to see how it compares against other approaches, not against improved versions of itself.
A tiny bit because it still has api keys checked into source control.

I think most of what the Basean Loom does is run-of-the-mill, but there is one thing that really sets it apart, and one smaller thing that might be unique to it. The first is computing the implied base rate, and the other relates to how the adjustments to the base rate are computed.

Compute an implied base rate

The Basean Loom does explicitly estimate a base rate for the question, and this is what gets reported as

-4.60 - estimated base rate log odds

in the summary. However, it also computes an implied base rate, based on the conditional probabilities of some precursor events. This is what is reported as

-1.71 - implied base rate log odds

It does this by making a list of events that could have a large effect on the forecast, and computes the following probabilities:

\(P(E_i)\): the probability of precursor event \(E_i\) happening.
\(P(E_i \mid Q)\): the probability of precursor event \(E_i\) having happened assuming that the main question \(Q\) turned out to happen.
\(P(E_i \mid \neg Q)\): the probability of precursor event \(E_i\) having happened assuming that the main question \(Q\) turned out to not happen.

Along with the base rate \(P(Q)\), these are related through a coherence law, namely

\[P(E_i) = P(E_i \mid Q)P(Q) + P(E_i \mid \neg Q)(1 - P(Q))\]

If we drive this backwards, we can compute an implied estimation \(P_i(Q)\) of the base rate of the main question. By averaging together these implied estimations into a single aggregate number \(\hat{P}(Q)\), we get a completely independent estimation of the base rate.

The Basean Loom then averages together these separate estimations, and uses this as the final base rate. That’s what is reported at the top level under the rubric6⁶ This description says “weighted average” but in the current implementation, the weights are 0.5 and 0.5, i.e. no weighting is performed at all.

-3.15 - weighted average base rate log odds on <question>Will a person in Sweden die from fireworks at New Year’s Eve 2024?</question>

Recall that the base rate it explicitly estimated was -4.6 log-odds, which translates to roughly 1 %. This was based entirely on feels, and not any data at all: its directed chain-of-thought includes token streams such as “Fireworks-related deaths are extremely rare events,” which is true, of course, but there are very many opportunities even for rare events when the entire country goes out at the same time to half-drunkenly fire off explosives. In the end, the explicit estimation of 1 % is much too low.

The implied base rate of 15 % is more reasonable in this case. Sometimes it’s the other way around: the explicit base rate is reasonable, and the implied one is nuts. But their average tends to be more reasonable than either on its own.

Adjust base rate by average log-odds difference from events

As we saw before, we estimate three quantities for each precursor event:

\(P(E_i)\): the probability of precursor event \(E_i\) happening.
\(P(E_i \mid Q)\): the probability of precursor event \(E_i\) having happened assuming that the main question \(Q\) turned out to happen.
\(P(E_i \mid \neg Q)\): the probability of precursor event \(E_i\) having happened assuming that the main question \(Q\) turned out to not happen.

If we divide the two conditional probabilities with each other, we get the odds ratio of the event, which tells us how the event affects the odds of the main question happening. If we take the logarithm of the odds ratio, we get a log-odds difference, which is a very natural, linear, direct measure of the strength of the evidence contained in the event.

The Baesan Loom does not do a great job of estimating the log-odds differences. In fact, it’s mostly terrible at it. For example, the first precursor event in the summary above is

The number of firework-related injuries in Sweden on New Year’s Eve 2023 decreases by more than 25 % compared to the previous year.

Clearly, this should count as evidence against a death in 2024. A reduction in injuries means people are exposing themselves to less risk, which also means they have a lower risk of dying. But the Loom accidentally arrives at the opposite conclusion: it assigns a log-odds difference of positive 0.59 to this event, meaning a reduction in injuries would be associated with an increased probability of death next year.7⁷ I think this happens because it gets its causality mixed up sometimes – see an earlier sidenote.

Despite this abysmal performance (which is consistently bad), the log-odds differences it does estimate seem to somehow offset other errors it makes, so in the end, they improve the forecast anyway.

Once it has estimated these log-odds differences (\(\delta L\)) for each event, it computes an adjustment to apply to the base rate as

\[\frac{\sum \delta L (E_i) P(E_i)}{n}\]

where \(n\) is the number of events under consideration. In other words, this is an average of a weirdly weighted sum. I don’t think it has any sensible interpretation in terms of probability theory, but again, it seems to work out reasonably in the end.

It adds this adjustment to the final base rate, which in the example would be adding up the two top-level figures reported as

-3.15 - weighted average base rate log odds on <question>Will a person in Sweden die from fireworks at New Year’s Eve 2024?</question>

0.18 - mean log-odds difference from subproblems

This becomes the preliminary forecast of -2.97 log-odds, corresponding to a probability of 4.9 %. Still too low, but there is one more step remaining, although this one is fairly uninteresting, and I think everyone who makes these things do similar things here.

De-extremise and clamp

I don’t trust an ai with probabilities more extreme than, say 10 % and 90 %. It took long before I trusted the Basean Loom with probabilities more extreme than 25 % and 75 %! One really has to be quite sure to enter forecasts in that range. It also seems like even when probabilities are clamped to the range of 10 % to 90 %, the Loom is somewhat overconfident within that region.

The final adjustment it makes to its preliminary forecast is a de-extremisation-and-clamping step combined, by passing the log-odds through a sigmoid function that squishes them together to the desired probability range, by pushing more extreme log-odds into less extreme probabilities.

This is how it goes from a preliminary forecast of -2.97-log odds to

-1.52 - de-extremised sum

When it comes to extreme probabilities like this, it nearly halves its confidence! On less extreme preliminary forecasts, it makes smaller adjustments. On very extreme forecasts, it practically clamps them to -2.0 or +2.0 log-odds.

The final forecast is then this log-odds value converted to a probability, which we have already seen is 18 %.

Ignore most of the background information and resolution criteria

Basean Loom is primarily designed to forecast on Metaculus questions, and these questions sometimes come with detailed background information and highly specific resolution criteria. At first I included these at each step of the way, in an attempt to clarify for the llm what it was supposed to be talking about. I did find, however, that these details confused it just as often as they helped. Background information, for example, got interpreted as up-to-date research. Resolution criteria mentioned some specific detail that it got hung up on and forgot the rest of the question.

Thus, there is a separate step that happens when Basean Loom processes a Metaculus question: it summarises it to one sentence that captures the essentials. Then it rolls with it as if it was any other question.

Other considerations

There are many more things I could say about this project:

How noise kills and is the primary enemy when it comes to ai forecasting.
How I made a stupid mistake and for a long time used Perplexity’s “chat” model which does no lookup, i.e. earlier attempts of mine didn’t actually have access to any up-to-date information, even though I thought they did.
Prompting advice, including namedropping terminology that I hope will put it in the right region of vector space, like narrative bias, base rate neglect, scope sensitivity, and System 2.
The evolution of Brier scores from the earlier attempts. (0.26, 0.24, 0.22, 0.21, 0.22, 0.23, where the Basean Loom gets 0.18 – though this method of evaluation is only good for comparing these models. It is not representative of real world performance.)
It is really freakin’ difficult to get an llm to paint a neutral picture of the evidence in favour or against an event. They always want present a balanced picture instead.8⁸ This is what the tv show The Newsroom named bias toward fairness: “There aren’t two sides to every story. […] Bias toward fairness means that if the entire Congressional Republican caucus were to walk into the House and propose a resolution stating the Earth was flat, the Times would lead with Democrats and Republicans Cannot Agree on Shape of Earth.”

I have also learned some more prompting techniques since I developed the Basean Loom which would likely improve its performance (and probably also the cost of each forecast!) It would be neat to try the same approach except with better prompting at some point.

To avoid bogging this article down with details, we’ll close off with a little code in the appendix. Hopefully that clarifies any confusion arising from the descriptions above.

Appendix A: Pseudocode of central logic

I have paid very little attention to code quality, so please don’t take this as an example of what maintainable code I write looks like. I am a parent of small children so I needed to get something going in the space of a few evenings, and that’s what I did.

Most of the functions here that sound like they are doing actual work don’t do much beyond setting up llm api calls by filling in data in prompt templates, and then making the request.

The research bit is handled by Perplexity. I don’t know if they are any good, but the template code supplied by Metaculus used that api, so I copied most of that implementation to avoid having to think about it. I don’t think superior information gathering is where an ai will improve its forecast anyway. But that might just my general anti-information bias.

In[1]:

def forecast_on(question_details, allow_research=True):
    question = summarise_question(question_details)
    research = perform_research(question) if allow_research else ''

    baserate = baserate_forecast(question, research)

    events = further_events(question, research)

    event_probabilities, log_odds_diffs, implied_baserates = [], [], []
    for event in events:
        # Make event-specific research and prepend the research already done.
        event_research = research + (
            perform_research(event) if allow_research else ''
        )

        # Compute probability of event
        e_p = baserate_forecast(event, event_research)
        event_probabilities.append(e_p)

        # Estimate conditional probabilities P(E|Q) and P(E|~Q)
        cond_event = [
            conditional_forecast(question, event, event_research, b)
            for b in (True, False)
        ]
        # Compute log-odds difference from conditional probabilities.
        log_odds_diffs.append(math.log(cond_event[0]/cond_event[1]))

        # Compute implied base rate from sub-problems.
        # The relevant equation is P(E) = P(E|Q)P(Q) + P(E|~Q)(1- P(Q))
        # which solves for
        #
        #          P(E) - P(E|~Q)
        # P(Q) = ------------------
        #         P(E|Q) - P(E|~Q)
        #
        i_p = cond_event[0] if cond_event[0] == cond_event[1] \
            else clamp(0.01, 0.99, (e_p - cond_event[1]) / (cond_event[0] - cond_event[1]))
        implied_baserates.append(math.log(i_p/(1 - i_p)))

    # Take a weighted mean of implied and estimated baserates.
    baserate = math.log(baserate/(1 - baserate))
    imp_base = sum(implied_baserates)/len(implied_baserates)
    adj_base = 0.5*baserate + 0.5*imp_base

    # This is a bit of a hack to combine these factors. There are more
    # world-accurate ways to do it but this seems to produce decently sized
    # adjustments.
    mean_lodiff  = sum(
        lodiff*p for p, lodiff in zip(event_probabilities, log_odds_diffs)
    )/len(log_odds_diffs)

    # De-extremise a fair bit and don't trust an LLM with extreme log-odds.
    forecast = clamping_sigmoid(adj_base + mean_lodiff, 2)
    prediction = round(100* math.exp(forecast)/(1+math.exp(forecast)))
    return prediction

Entropic Thoughts