Entropic Thoughts

Updated LLM Benchmark (Gemini 3 Flash)

Updated LLM Benchmark (Gemini 3 Flash)

I evaluate llms by how well they play text adventures. The last update I made was when Haiku 4.5 was released. Now that Google has released a preview of Gemini 3 Flash, I had to run the benchmark again. Having seen how well Gemini 2.5 Flash performed in earlier benchmarks, I would have expected Gemini 3 Flash to blow the other models out of the water, which it did … almost.

The following plot illustrates the distribution of the relative scores achieved by each tested model.

llm-evaluation-202512-01.svg

In contrast to previous benchmarks, this time the models have had a fixed budget1 Well, actually, they’ve had a word limit scaled by their output token cost, but that’s been close enough. The average cost has worked out to $0.2 per run thanks to some technical difficulties and mistakes made along the way. of $0.15 on which they’ll have to prove how far they can get in nine different text adventures. For each game, the models are graded on a curve based on how far they get compared to all other models – this means 0.0 is the worst score (it got nowhere) and 1.0 is the best score (it got furthest of all models).

Overall, there’s a group of expensive models (Sonnet to gpt 5) with average and uncertain performance, then a group of cheaper models (Qwen 3 to gpt 5 Mini) which aren’t great but can achieve more than the expensive models within their budget. Then there are two models that really shine, but for different reasons:

The budget of $0.15 per evaluation run might sound cheap, but it seriously restricts how many models I can evaluate, how long I can run them for, how many games I can have them play, and how many times I can run each to smooth out random variation. Were it not for the generous contributions of readers who buy me coffee, or subscribe to the premium newsletter I would not be able to run even these evaluations.

Observations

I included Grok 4.1 Fast in the analysis because I wanted to compare it to Grok Code Fast 1. After I had seen Grok 4.1 Fast fumble about cluelessly for a bit, I was very surprised to see it top out in the performance chart. I think the reason it’s good is not that it reasons particularly well; instead, it’s because it is cheap and speaks compactly.2 Dear $deity I hate the voice it writes with, but it seems effective at prompting itself and keeping track of what’s going on. Part of it may be that it inadvertently cheats the benchmark a little: it uses more long words than other models, which means its tokens are under-counted by my script.

The budget of $0.15 per run was calibrated so that Gemini 2.5 Flash should get roughly 40 turns on all games, to keep some continuity with earlier benchmark evaluations. Here’s how many turns these models achieve on that budget.

llm-evaluation-202512-02.svg

A model like Claude 4.5 Sonnet performs well when money is unrestricted, but it is expensive and verbose enough to not have a chance to accomplish anything with a $0.15 budget. Although we should be careful with comparing turn counts against performance3 That comparison invites an “achievements per turn” metric which I don’t think is meaningful when some models barely even have time to get out of the starting room., if we plot the raw median achievement count for each model against the median turn count, a clear pattern emerges.

llm-evaluation-202512-03.svg

We can tell that squeezing more turns into the budget is better, but we can also see that Grok 4.1 Fast falls below the line – it uses a lot more turns for that increased performance. This is noticeable when running these evaluations: Gemini 3 Flash finishes its run relatively quickly, whereas the Grok 4.1 Fast runs drag on and on. Thus, in my book, Gemini 3 Flash remains the champion of text adventures.

There are some other interesting observations we can make.

  • gpt 5.2 is 30 % more expensive than gpt 5, and thus gets fewer tokens to earn achievements. It still manages to earn as almost many as gpt 54 gpt 5.2 mean score 0.231, gpt 5 mean score 0.237., meaning its increased intelligence precisely offsets its increased cost.
  • Qwen 3 Coder Plus is the proprietary version of the open-weight Qwen 3 VL 235B A22B Instruct model. At more than 4× the cost, it has to perform much better per token to perform better in this benchmark. It does not.
  • gpt 5 Mini is on the level of Gemini 2.5 Flash, which is not bad at all. It does need 20 % more tokens, but it also costs 20 % less, so it works out in the end.

Diving into turn count and performance

To try to get wiser around how turn count affects performance, we can run the same Gemini 3 Flash model with increasing budgets. Here, purple represents the baseline, and then we have tried halving twice and doubling twice.

llm-evaluation-202512-04.svg

This looks like performance is a linear function of turn count, but since the abscissa uses a log scale, it’s actually a logarithmic function of turn count. If we fit a line to this and do the backwards math, we can plop this logarithmic function down on the graph of raw achievement count against turn count for all tested models.

llm-evaluation-202512-05.svg

The light grey curve represents what Gemini 3 Flash would accomplish with that number of turns. My hypothesis was that all models would fall on this curve, i.e. that accounting for how many turns they could afford, they would perform the same. But it turns out not so! Gemini 3 Flash is truly ahead of the curve implied by the other models.

It looks like Grok 4.1 Fast sits on the same curve, but we don’t actually know if that’s the case. It would be interesting to find out, but I’m all out of time and money for now.

Methodology notes

In difference to earlier benchmarks, models now have a fixed budget rather than a turn limit. This is a good thing, because some models are much more expensive than others, and we would like to know which model gets us furthest per dollar.

I have also switched three games that were not very predictive of performance5 So Far, For a Change, and Plundered Hearts against some others I happen to have played recently6 My own Crimson Witness but also three more games from the same competition: The Organ-Grinder’s Monkey, All That Shimmers…, Adrift, and Kill Wizard. and thus was able to construct achievements for.

The llms are prompted each turn

  1. informing it that it is playing a text adventure, and some basic instructions for how to do that;
  2. its internal thinking from the previous turn;
  3. the output of the game, i.e. the result of the last command it issued; and
  4. some leading questions about what its goals are, what the unsolved puzzles are, etc. to help it keep track of the objective.

This prompt is necessary for the most models which otherwise quickly spiral into an obsession over some insignificant part of the game and then never make progress.7 The specific details of the prompt I keep private. I think it’s a good idea for people keep their personal evaluation tools secret, but publish the results. It’s a good counter-weight against all the amazing public benchmark results models can spit out. Maybe the more expensive models would outperform the cheaper models without this prompting, but I’m not interested in using llms that way. I think of llms as tools that need to be wielded correctly, not magical fairies that solve problems without our intervention.

In the background – not revealed to the llm – there are achievements defined for each game. The script that orchestrates the whole thing listens for key phrases printed by the game which indicate that some progress is made, and award a point to the llm whenever that happens. These are called achievements in the code. The llms are graded on a curve based on how many achievements they and other models earned for each game.