Haiku 4.5 Playing Text Adventures

kqr

, scheduled 2025-10-23

Tags:

The announcement of Anthropic’s new smallish Claude model, Haiku 4.5, came with people running it through their favourite benchmarks. That reminded me of my favourite benchmark: how well it plays text adventures! On the first run Haiku 4.5 made, it blew the previous models out of the water. That seemed suspicious, so I ran it again. Then it still performed relatively well, but much more reasonably.

The following table is divided into “better than Gemini 2.5 Flash”, “roughly equivalent to Gemini 2.5 Flash”, and “worse than Gemini 2.5 Flash”, considering both price and regression coefficient.

The parenthesised number next to the model name indicates the number of samples that model has run per game.1¹ 💸. The regression coefficient indicates how many more achievements the model gets on average compared to Gemini 2.5 Flash, when correcting for varying difficulty across games.

Model	Coef.	Cost ($/Mt)
Claude Sonnet 4.5 (3)	+0.12	15.0
GPT-5 (2)	+0.10	10.0
Gemini 2.5 Flash (7)	±0.00	2.5
Claude Haiku 4.5 (3)	−0.01	5.0
Grok 4 (1)	−0.02	15.0
Gemini 2.5 Pro (3)	−0.06	10.0
GLM 4.6 (2)	−0.10	1.8
Grok Code Fast 1 (2)	−0.12	1.5
gpt-oss-120b (2)	−0.17	0.4
Qwen 3 Coder (1)	−0.24	0.3
gpt-oss-20b (1)	−0.28	0.2

As for the headline result: Haiku 4.5 performs roughly on the level of Gemini 2.5 Flash, but is twice as expensive and a little slower, so I would not use it for this.

Further observations:

Claude Sonnet, in its latest iteration, continues to be very good – but also very expensive. Is it worth an additional $12.5 per million tokens? Not sure. See below for a methodology fix we could use to find out.
Perhaps surprisingly, Grok 4 and Gemini 2.5 Pro are worse than Gemini 2.5 Flash despite being more expensive. I think this is because they are too systematic in their explorations and would take longer to make more forward progress.
The glm 4.6 model is more than twice as expensive as glm 4.5 was last time, and still doesn’t perform much better than its earlier version.
The cheap open-weight OpenAI models are not very good.

The stark difference between Gemini 2.5 Flash and Sonnet 4.5 made me realise a turn budget is the wrong way to think about this benchmark. What we really should do is give models a cash budget. Without adding too much complexity, we could approximate that by counting the number of words in their output2² I know words are not tokens, but I want to keep this simple., and cutting them off when they reach a predefined limit that depends on their cost. This would give Sonnet 4.5 a sixth of the time Gemini 2.5 Flash has to earn achievements – would it still outperform it then? Possibly not.

I don’t think I can take the time to do that now, but it would be interesting! In particular since a reader has contacted me about the potential of automatically generating text adventure transcripts, among other things for archival purposes, which would be super cool but I still hesitate around model performance and cost being worth it.

More nerdy notes on methodology: the games the models played, and the relative difficulty of earning achievements in each of them is

Game	Coef.	Variation
9:05	±0.00	0.15
Dreamhold	−0.21	0.09
Lost Pig	−0.25	0.09
Lockout	−0.35	0.10
So Far	−0.39	0.24
For a Change	−0.45	0.11
Plundered Hearts	−0.45	0.09

High variation indicates that even the same model sometimes earns many achievements, sometimes very few – high-variation games contribute more noise to the result. This mirrors our conclusions from last time: So Far is a bad test because of its high variation, and then For a Change and Plundered Hearts have low variation not because they are consistent estimators of progress, but because models consistently fail to make any progress at all in them!

I tried to measure the uncertainty of the model performance, did something highly unmathematical, and ended up with these numbers.

Model	Uncertainty
glm 4.6	0.13
gpt-5	0.12
gpt-oss-20b	0.10
gpt-oss-120b	0.10
Qwen 3 Coder	0.08
Sonnet 4.5	0.08
Haiku 4.5	0.08
Gemini 2.5 Pro	0.07
Grok Code Fast 1	0.07
Gemini 2.5 Flash	0.06

The reason we don’t run more samples for glm 4.6, the gpt-oss models, and Qwen 3 Coder despite their higher uncertainty is that they suck enough that we can be fairly sure they’re bad despite the high uncertainty.

It would be nice to run more samples for gpt-5 but it’s expensive and we can be okay with knowing it’s “around Sonnet 4.5” performance – perhaps a little better, given its lower cost.