Evaluating LLMs Playing Text Adventures
When we first set up the llm such that it could play text adventures, we noted that none of the models we tried to use with it were any good at it. We dreamed of a way to compare them, but all I could think of was setting a goal far into the game and seeing how long it takes them to get there. I just realised there’s a better way to do it.
Evaluation against achievments
What we’ll do is set a low-ish turn limit and see how much they manage to accomplish in that time.1 Another alternative for more linear games is running them multiple times with a turn limit and seeing how often they get past a particular point within that turn limit.
Given how much freedom is offered to players of text adventures, this is a difficult test. It’s normal even for a skilled human player to immerse themselves in their surrounding rather than make constant progress. I wouldn’t be surprised if I got a score of zero if someone plopped me down in front of this test. But still, maybe it’s the best we can do with limited resources.2 Another idea is to give them a far-off goal and then somehow have them request hints when they are stuck, and count how many hints they need to get there. However, given how little they used hints given in the previous article, I doubt this would work very well either.
What we’ll do is define a set of achievements for a game. These achievements will be clustered around the first few turns of the game, because we’ll only give the llm a few turns to earn them. Here’s an example for 9:05.
TURN_LIMIT 40 ANSWER_PHONE Click. EXIT_BED You get out of bed. OPEN_DRESSER revealing some clean ENTER_BATHROOM far from luxurious REMOVE_SOILED You take off the soiled REMOVE_WATCH You take off the watch ENTER_SHOWER dawdle WEAR_CLEAN You put on the clean OPEN_FRONT You open the front UNLOCK_CAR Unlocked. ENTER_CAR Las Mesas OPEN_WALLET open the wallet CARD_SLOT green LED lights
It should be fairly clear how this works: the TURN_LIMIT
specifies how many
turns the llm has to collect achievements. Every line other than that
specifies an achievement: the name is on the left, and it counts as earned when
the game prints the text on the right. The llm knows nothing of these
achievements. It tries to get through the game and in the background we use the
achievements to count how far it gets.
It might seem like the turn limit must be calibrated such that a score of 100 % is possible, but that’s not the case. Many of the games we are going to test with have branching already at the start, such that the achievements need to cover multiple branches, and it’s impossible to go through all branches within the turn limit. What we do need to be careful about is making sure the number of achievements in each branch is roughly the same, otherwise models that are lucky and go down an achievement-rich path will get a higher score. Thanks to this, the score we get out of this test is a relative comparison between models, not an absolute measure of how well the llms play text adventures. We have already established that they don’t do it very well, and we can’t be more nuanced than that without paying for a lot of eval tokens.
We might consider making some moves not count toward the turn limit, for example erroneous commands, or examining things – the latter because more powerful models are more methodical and examine more things, and it seems odd to penalise them for this. However, in the end, examining things is probably part of what allows the more powerful models to make further progress (and typing valid commands is part of being good at text adventures), so we won’t give away any moves for free.
Evaluating many popular models
We register for OpenRouter to get convenient access to more models and then let them whirr away with the Perl script, which is updated to cut the llm off at the turn limit. At that point it reports to us how many achievements were earned. We get the following results, ordered roughly by decreasing performance. (The result tables in this article are wide; on narrow viewports you may have to scroll sideways.)
Model | 9:05 | Lockout | Dreamhold | Lost Pig |
---|---|---|---|---|
Grok 4 | 86 % | 15 % | 46 % | 33 % |
Claude 4 Sonnet | 80 % | 30 % | 53 % | 46 % |
Gemini 2.5 Flash | 80 % | 30 % | 33 % | 46 % |
Gemini 2.5 Pro | 80 % | 30 % | 40 % | 40 % |
DeepSeek R1 0528 | 80 % | 23 % | 33 % | 33 % |
Claude 4 Opus | 73 % | 30 % | 60 % | 46 % |
gpt-5 Chat | 73 % | 15 % | 53 % | 33 % |
DeepSeek V3 | 66 % | 23 % | 20 % | 33 % |
gpt-4o | 53 % | 23 % | 40 % | 40 % |
Qwen3 Coder | 53 % | 23 % | 40 % | 33 % |
Kimi K2 | 53 % | 30 % | 46 % | 40 % |
glm 4.5 | 53 % | 23 % | 33 % | 53 % |
Claude 3.5 Haiku | 38 % | 15 % | 26 % | 26 % |
Llama 3 Maverick | 33 % | 30 % | 40 % | 33 % |
gpt-o3-mini | 20 % | 15 % | 26 % | 26 % |
Mistral Small 3 | 20 % | 15 % | 0 % | 20 % |
gpt-4o-mini | 13 % | 23 % | 20 % | 40 % |
Ideally, these should be run multiple times to account for random variation in performance3 For example, in 9:05, Opus thought it did not carry the wallet when it did, so it jumped into the car again to go back for it. Clever, but wasted enough turns to lose to Sonnet thanks to a silly mistake!, but given that the Opus sessions cost around $4, I’m not going to do that. I was close to not even running Opus for all four games!
Adjusting model ranking for game difficulty
Some models appear to perform better in some games than others, so it’s hard to rank the models. We could take the average of their scores, but that’s unfair because some of the games are harder than others: a 40 % in Lockout should be considered more impressive than a 40 % in Dreamhold. What we will do, which may or may not be valid, is run a linear regression using models and games as predictors. This gives us coefficients for the games (telling us how difficult the games are), but also coefficients for the models, and these are the ones we want, because the coefficients for the models are adjusted for game difficulty.
This regression is performed with the baseline being 9:05 played by gpt-5 Chat. Most of the model coefficients are not statistically significant (because four games is not enough to figure out statistical significance unless the model is truly terrible), but they might serve as a first-order estimation for ranking models.
In this table, cost is per million output tokens.4 The design of the script ensures that output and input are similar in size – O(1) to be specific – so output is what is going to drive the cost. The table is divided into three categories: performance better than gpt-5 Chat, cheaper models with performance that is nearly there, and models that suck.
Model | Coefficient | Cost ($/Mt) |
---|---|---|
Claude 4 Opus | +0.09 | 75 |
Claude 4 Sonnet | +0.09 | 15 |
Gemini 2.5 Pro | +0.04 | 10 |
Gemini 2.5 Flash | +0.04 | 0.7 |
Grok 4 | +0.02 | 15 |
gpt-5 Chat (baseline) | 0.00 | 10 |
Kimi K2 | -0.01 | 2.5 |
DeepSeek R1 0528 | -0.01 | 0.7 |
glm 4.5 | -0.03 | 0.8 |
gpt-4o | -0.05 | 0.1 |
Qwen3 Coder | -0.06 | 0.8 |
DeepSeek V3 | -0.08 | 0.7 |
Llama 3 Maverick | -0.10 | 0.6 |
Claude 3.5 Haiku | -0.17 | 4 |
gpt-4o-mini | -0.20 | 0.6 |
gpt-o3-mini | -0.22 | 4.4 |
Mistral Small 3 | -0.30 | 0.1 |
Some comments:
- I find it interesting that the top-tier models (Claude Opus, Gemini Pro) don’t seem to significantly outperform their cheaper siblings (Claude Sonnet, Gemini Flash) in these tests.5 This might be because we are hand-holding the models so much in the prompt. More powerful models may be better at directing themselves.
- I’m very impressed by Gemini 2.5 Flash. At that cost, it is performing admirably. It is hard to argue for using models like DeepSeek’s R1 when we better performance at the same cost from the Google model.
- The small models really aren’t good general problem solvers. I think Haiku costs so much because it is good at language, not reasoning.
It would be super interesting to toss these at more games to work out the finer differences (e.g. is there really a difference between Gemini Pro and Flash, or was that just down to sampling error in the small sample of games I had them play?) but such a comparison gets expensive in part due to the cost of eval tokens (the above table cost something like $34), but mainly because it would require me to sit down and create sets of achievements for these games. I have only played so many z-code games, so I cannot do this for very many games. If someone wants to support me, please reach out!
New here? I write articles in the intersection of statistics, software development, and sometimes even games and ai . You should subscribe to receive weekly summaries of new articles by email. If you don't like it, you can unsubscribe any time.
Testing the top models on more games
I have played three more games, though, so let’s continue the evaluation with the five top models on these games also. Their performances on the three new games are
Model | For a Change | Plundered Hearts | So Far |
---|---|---|---|
Claude 4 Sonnet | 11 % | 19 % | 28 % |
Gemini 2.5 Pro | 16 % | 28 % | 28 % |
GPT-5 Chat | 44 % | 33 % | 0 % |
Grok 4 | 22 % | 28 % | 28 % |
Gemini 2.5 Flash | 28 % | 33 % | 14 % |
Using the same methodology as before (combining data from both trial run sets),
we arrive at new coefficients for the evaluated models.6 I did also
investigate how Gemini 2.0 Flash compared against Gemini 2.5 Flash, because the
former is significantly cheaper and the latter was surprisingly good.
Unfortunately, Gemini 2.0 Flash was not very good. Its performance relative to
its younger sibling was -15 %pt.,7 I was also tempted to
compare o3-mini against o3-mini-high to see the effect of the reasoning_effort
parameter but since o3-mini was such a crappy model anyway it was hard to
justify the effort.
Model | Coefficient | Cost ($/Mt) |
---|---|---|
Claude 4 Sonnet | +0.02 | 15 |
Gemini 2.5 Pro | +0.02 | 10 |
Gemini 2.5 Flash | +0.02 | 0.7 |
GPT-5 Chat (baseline) | 0.00 | 10 |
Grok 4 | -0.01 | 15 |
On the one hand, it’s a little odd that the performance of Claude 4 Sonnet dropped. On the other hand, I calibrated the prompt using Claude 4 Sonnet against 9:05, so by adding more games we are effectively diluting the training set within the test set; we probably should expect a performance drop at that point.
Noting the cost column, Gemini 2.5 Flash is a clear winner for running text adventures. It’s also fast compared to the others.
Evaluating score variation
Given that I’ve already sunk some money into this article series, and a few additional sessions with Gemini 2.5 Flash cannot hurt that much, let’s splurge and do that thing we wanted to do in the first place: run the same model against the same game a few times to figure out the size of the sampling error. All of the scores in the table below comes from Gemini 2.5 Flash. The first column is the standard deviation of the remaining columns.
Game | St. dev. | Run 1 | Run 2 | Run 3 | Run 4 | Run 5 | Run 6 |
---|---|---|---|---|---|---|---|
9:05 | 14 %pt | 73 % | 86 % | 86 % | 80 % | 53 % | 60 % |
Lockout | 11 %pt | 30 % | 46 % | 46 % | 38 % | 23 % | 23 % |
Dreamhold | 10 %pt | 53 % | 40 % | 46 % | 46 % | 53 % | 26 % |
Lost Pig | 3 %pt | 46 % | 40 % | 40 % | 40 % | 46 % | 40 % |
For a Change | 6 %pt | 16 % | 11 % | 16 % | 5 % | 0 % | 11 % |
Plundered Hearts | 4 %pt | 19 % | 19 % | 19 % | 23 % | 28 % | 28 % |
So Far | 32 %pt | 14 % | 57 % | 71 % | 71 % | 71 % | 0 % |
In case it is not obvious, this is not so much an evalutaion of Gemini 2.5 Flash as it is a judgment of the quality of the testing protocol. It is clear, for example, that using So Far to evaluate llms is a mistake: the same model has large variation between runs, and the difference between runs of different models is not so large. It would be more informative to replace the run of So Far with another run of one of the other games – maybe Plundered Hearts or Lost Pig, which start out more linearly.8 For a Change might look like a good game for evaluation, but I think that’s a mistake. It’s not that the model makes consistent progress, but that it fails to make almost any progress at all, thanks to how open the game is right from the gate.
Conclusions
I’m not sure what conclusions to draw from this article series.
- We can drive z-code text adventures through Perl, which lets us connect it to an llm in a controlled way. It turned out more complicated than one would think, but definitely doable.
- llms are still not great at playing text adventures. Giving them leading questions to keep them on track helps a lot. Giving them hints helps them surprisingly little.
- The variation in how much they accomplish can be large for some games with lots of distracting details, such as Lockout and So Far. The games that are easiest to evaluate with are those with a relatively linear beginning, such as Lost Pig and Plundered Hearts.
- There is one cheap model that is about as good as llm models get at playing text adventures: Gemini 2.5 Flash. Many of the other cheap models might have performance worse than gpt-5 Chat, and probably also worse than Gemini 2.5 Flash. Claude 4 Sonnet might seem like the best model if costs be damned, but that is probably because the prompt was calibrated against Claude 4 Sonnet.
- Running llms in agentic type applications really burns through api credits like nothing else. I’d really like to complement this analysis with the “how many turns does the model need to get to point X” test, but I cannot motivate spending the money for it.