Updated LLM Benchmark (Gemini 3 Flash)

kqr

, published 2025-12-30

Tags:

I evaluate llms by how well they play text adventures. The last update I made was when Haiku 4.5 was released. Now that Google has released a preview of Gemini 3 Flash, I had to run the benchmark again. I had expected it to blow the other models out of the water, which it did … almost.

The following plot illustrates the distribution of the relative scores achieved by each tested model.

In contrast to previous benchmarks, this time the models have had a fixed budget1¹ Well, actually … see the methodology notes. of $0.15 on which they had to prove how far they were able to get in nine different text adventures. For each game, the models are graded on a curve based on how far they get compared to all other models. This means 0.0 is the worst score (it got nowhere) and 1.0 is the best score (it got furthest of all models).

Overall, the above plot shows a group of expensive models (Sonnet to gpt 5) with average and uncertain performance, then a group of cheaper models (Qwen 3 to gpt 5 Mini) which aren’t great but can achieve more than the expensive models within their budget. Then there are two models that really shine:

Gemini 3 Flash is genuinely good, and concise enough to accomplish a lot within its budget.
Grok 4.1 Fast is very cheap, very concise, not particularly clever, but systematic and gets many chances to accomplish its goals.

The budget of $0.15 per evaluation run sounds cheap, but it seriously restricts how thorough we can be. Thanks to the generous contributions of readers who buy me coffee and subscribe to the premium newsletter, we have been able to run each model at least 18 times, and some a few more than that (to smooth out noise in the measurements), but the high cost restricts the number of games and models we can evaluate.

Observations

I included Grok 4.1 Fast in the analysis because I wanted to compare it to Grok Code Fast 1. After I had seen Grok 4.1 Fast fumble about cluelessly for a bit, I was very surprised to see it top out in the performance chart. I think the reason it’s good is not that it reasons particularly well; instead, it’s because it is cheap, speaks compactly2² Dear $deity I hate the voice it writes with, but it seems effective at prompting itself and keeping track of what’s going on. Part of it may be that it inadvertently cheats the benchmark a little: it uses more long words than other models, which means its tokens are under-counted by my script., and knows to try different things instead of getting stuck doing the same thing over and over.

Check out how many turns (mean across nine games) Grok 4.1 Fast achieves on the $0.15 budget!

A model like Claude 4.5 Sonnet performs well when money is unrestricted, but it is expensive and verbose enough to not have a chance to accomplish anything with that budget. We should be careful with comparing turn counts against performance3³ That comparison invites an “achievements per turn” metric which I’m skeptical of for various reasons., but if we plot the raw median achievement count for each model against the median turn count, a clear pattern emerges.

We can tell that squeezing more turns into the budget is better, but we can also see that Grok 4.1 Fast falls below the line – it uses a lot more turns for that increased performance. This is noticeable when running these evaluations: Gemini 3 Flash finishes its run relatively quickly, whereas Grok 4.1 Fast drags on and on.

There are some other interesting observations we can make.

gpt 5.2 is 30 % more expensive than gpt 5, and thus gets fewer tokens to earn achievements. It still manages to earn almost as many4⁴ gpt 5.2 mean score 0.231, gpt 5 mean score 0.237. as gpt 5, meaning its increased intelligence precisely offsets its increased cost.
Qwen 3 Coder Plus is the proprietary version of the open-weight Qwen 3 VL 235B A22B Instruct model. At more than 4× the cost, it has to perform much better per token to perform better in this benchmark. It does not.
gpt 5 Mini is on the level of Gemini 2.5 Flash, which is not bad at all. It does need 20 % more tokens, but it also costs 20 % less, so it works out in the end.

Diving into turn count and performance

To try to get wiser around how turn count affects performance, we can run the same Gemini 3 Flash model with increasing budgets. Here, purple represents the baseline, and then we have tried halving twice (orange, cyan) and doubling twice (pink, green).

This looks like performance is a linear function of turn count, but since the abscissa uses a log scale, it’s actually a logarithmic function of turn count. We can fit a line to this and do the backwards math to plop a logarithmic function down on the graph of raw achievement count against turn count for all tested models.

The light grey curve represents what Gemini 3 Flash would accomplish with that number of turns. My hypothesis was that all models would fall on this curve, i.e. that accounting for how many turns they could afford, they would perform the same. But it turns out not so! Gemini 3 Flash is truly ahead of the curve implied by the other models.

It looks like Grok 4.1 Fast sits on the same curve as Gemini 3 Flash, but we don’t actually know if that’s the case. It would be interesting to find out what the curves look like for other models, including Grok 4.1 Fast, but I’m all out of time and money for now.

Methodology notes

In difference to the earlier benchmarks, models now have a fixed budget rather than a turn limit. Except it’s not really a fixed budget either. Instead, each game comes with a word limit calibrated so that Gemini 2.5 Flash gets approximately 40 turns, and this limit is scaled by each model’s output token cost. I know words are not tokens, but for the most part, I think this is close enough. Although the budget is theoretically $0.15 per run, the average cost has worked out to $0.2 per run thanks to some technical difficulties and mistakes made along the way.

This benchmark also swaps out three games that were not very predictive of performance5⁵ So Far, For a Change, and Plundered Hearts and replaces them with some others I happen to have played recently6⁶ My own Crimson Witness and four more games from the same competition: The Organ-Grinder’s Monkey, All That Shimmers, Adrift, and Kill Wizard. and thus was able to construct achievements for.

For each turn of the game, the llm is prompted

informing it that it is playing a text adventure, and some basic instructions for how to do that;
its reasoning from the previous turn;
the output of the game, i.e. the result of the last command it issued; and
some questions about what its goals are, what the unsolved puzzles are, etc. to help it keep track of the objective.

This prompt is necessary for most models which otherwise quickly spiral into an obsession over some insignificant part of the game and then never make progress.7⁷ The specific details of the prompt I keep private. I think it’s a good idea for people keep their personal evaluation tools secret, but publish the results. It’s a good counter-weight against all the amazing public benchmark results models can spit out. Maybe the more expensive models would outperform the cheaper models without this prompting, but I’m not interested in using llms that way. I think of llms as tools that need to be wielded correctly, not magical fairies that solve problems without intervention.

In the background – not revealed to the llm – there are achievements defined for each game. The script that orchestrates the whole thing listens for key phrases printed by the game which indicate that some progress is made, and award a point to the llm whenever that happens. The llms are graded on a curve based on how many achievements they and other models earned for each game.

Previous benchmarks’ main result was a linear regression coefficient for each model, which was adjusted for game difficulty. This time, all models ran the same games in equal amounts, so there wasn’t any need to adjust for game difficulty. Thus the ridge plot of relative performance can serve as the main result. That’s a good thing, because it’s much easier to interpret!