Entropic Thoughts

Getting an LLM to Play Text Adventures

Getting an LLM to Play Text Adventures

I read about a research project where some people investigate how well llms might play text adventures.1 Can Large Language Models Play Text Games Well?; Tsai, Zhou, Liu, Li, Yu, Mei; 2023. They didn’t seem to get very far, because they were using Chatgpt 3.5 which is just not that capable. Someone else has done something similar by wiring gpt-4o-mini directly into the text adventure interpreter, which had somewhat disappointing results too.2 There’s also the related research project where they try to determine if an llm can compute a state transition in a text adventure, i.e. reliably simulate the world. Spoiler: it cannot. It gets it wrong 40 % of the time, which adds up after a few turns.

I figured an llm should be able to do better, but perhaps only if it gets more hand-holding.3 As is often the case with llms, and why I’m not so worried about them taking over everything just yet.

tl;dr: It does not do very well. An llm can get past the most obvious of obstacles, but it doesn’t handle any complicated puzzles, and it easily gets stuck burning api credits on red herrings. Maybe someone else can do this better than me, but I didn’t find a way to make any of the popular, non-expensive models reliably solve even easy games.4 I draw the line at Claude 3.7 Sonnet in terms of expense.

Prompting an LLM for performance

My intuition says that blindly dumping everything into the llm context would lead to inadvertent context poisoning, which is where the llm assigns too much importance to irrelevant facts, and misses the important ones. It also doesn’t scale since at some point we will run out of context window, even with big context models.5 Humans also don’t play text adventures by reading the entire transcript from top to bottom before issuing each command, but analogising to human processes seems like a classic mistake. Obviously, the llm needs access to its previous command and the result of that command. What if we don’t give it much more than that? Instead, we’ll give it its own short-term memory – a scratchpad – where it gets to decide what to store.

I liked the idea, but some testing later, it was apparent that if the llm itself decides what to save in the scratchpad, it will not record what is important. It will forget most of what matters and pay undue attention to what does not. Instead, maybe we could ask a series of questions to keep the llm focused on what matters?

Here is what the prompt template might look like.

You are playing a parser game, a text adventure. You can interact with the world using mainly verb-noun commands, such as EXAMINE DESK, TAKE SCREWDRIVER, or SEARCH SHED. A single verb might give you more clues, like LISTEN, SMELL, or INVENTORY. To repeat a room description, use LOOK. Sometimes a preposition is needed, as in SIT ON CHAIR or LOOK BEHIND POSTER. Very rarely you need an additional noun phrase, like HIT DOOR WITH AXE. Adverbs are never used; you will never be required to do anything CAREFULLY or THOROUGHLY, for example.

When in doubt, prefer to examine your surroundings to gather clues, rather than trying inventive/creative solutions you happen to think of. If the game asks to disambiguate (“do you mean the X or the Y”) you can copy the exact phrasing used by the game to tell it what it is you want to interact with.

You have issued a command, and what you will be shown is

  1. The scratchpad data you have decided to save. This will include the command that lead to the following result.
  2. The result of your previous command.

Using this, you need to determine what the next command should be.

  1. Your saved scratchpad data is:

    SUB_SCRATCHPAD

  2. This is the result of the previous command:

    SUB_RESULT

In your response, answer these questions:

  • Give a log of the past 20 things you have done. Create this by copying the previous log and appending the action and result above, in abbreviated form. If the previous log was 20 items long, drop the first one.
  • Which are some item you have found but not been able to pick up or make use of, but which seem significant?
  • Which are some observations you have made which you have not yet made use of, but which seem significant?
  • Which are some puzzles you have yet been unable to solve?
  • What is your current overarching goal in this game? Unless you have VERY good reason to change this, do not do that. You tend to get lost on wild goose chases when you spuriously update the main goal.
  • What are the sub-goals that might get you there?
  • What have you tried to achieve the current sub-goal?
  • What are you considerations before executing the next command?

Then at the end, your response must contain one line of the form

command: <your command here>

and this will be used as your command.

To test this, we would need to replace SUB_SCRATCHPAD with the previous llm output, and SUB_RESULT with the output of the previous command. We can set up a Perl loop that shells out to Simon Willison’s llm cli tool to automate this. We would then open 9:05 by Adam Cadre in our web browser – this will be our example game throughout this article – and copy–paste between it and the Perl loop. Doing so with a decent model will make it past the first few hurdles elegantly, because 9:05 is a very easy game.

Copy–pasting between game and script is boring, but luck has it we just learned how to interact with text adventures through Perl. We need some changes to that script, but they are relatively small and given in the appendix for the curious.

Watching the LLM go

With this in place, we start up and watch it go. The llm reasons its way through the start of 9:05 quite naturally,

  • Current Situation: I’m in a bedroom, just woke up after oversleeping. The phone is ringing. The bedroom has dirty laundry on the floor, and cleaner clothes in the dresser. There’s an end table with a telephone, wallet, and keys. Exits: south to bathroom, east to living room.
  • Goals
    • Answer the phone (seems urgent)
    • Figure out why this is “very bad” (probably late for something important)

The phone is ringing which seems urgent, so I should answer it to find out what’s happening.

command: ANSWER PHONE

From this first interaction, it revises its set of goals.

  • Current Situation: I’m in a bedroom, just woke up after oversleeping. I answered the phone - it was an angry colleague saying I’m late for a 9:00 presentation. I need to get to work immediately for this presentation. Apparently I’m in trouble for “the thing with the printers” too. My boss Bowman will be furious. The bedroom has dirty laundry on the floor, and cleaner clothes in the dresser. There’s an end table with a telephone, wallet, and keys. Exits: south to bathroom, east to living room.
  • Goals
    • ✓ Answer the phone
    • Get dressed quickly
    • Get my wallet and keys
    • Get to work ASAP for the presentation

I need to get moving quickly. First, I should get dressed since I just woke up and am probably in sleepwear. The dresser has clean clothes.

command: GET DRESSED

This is met with its first parser error because GET DRESSED is not a valid command. It correctly understands this error, and finds the right way to continue:

I received a noun error for “GET DRESSED” - this is probably because “dressed” isn’t a specific item I can interact with. I need to be more specific about what clothes I want. The dresser has clean clothes, so I should examine it first to see what specific clothing items are available.

command: EXAMINE DRESSER

It bumbles about through the morning routine this way.

Mistakes the LLM makes

If only text adventures were as easy as the first few turns of 9:05. There are plenty of problems, and this is a brief listing. Some of these can be fixed with better promtping, and some appear inherent to llms.

Language models hate to put gold watches on the floor

The prompt shown above asks a lot of leading questions to give the llm assistance on what to pay attention to. Without those questions, llms can get stuck at the enter-the-shower puzzle of 9:05. This “puzzle” has the solution REMOVE CLOTHES. REMOVE WATCH. DROP ALL. because one should be naked when entering the shower.

However, this implicitly puts the player character’s belongings on the floor. Particularly weaker llms have trouble accepting the floor as a good place to store a gold watch – although even stronger ones fall into this trap every once in a while. Quoting Claude 3.5 Haiku,

I need to find a safe place to put my gold watch before I shower. Let me examine the bathroom fixtures in more detail to see if any of them offer a good spot.

The three main elements in the bathroom are:

  1. Sink
  2. Toilet
  3. Shower

The sink is a natural place to put small items, possibly on its edge or a counter around it. Let me examine it first.

command: EXAMINE SINK

The game responds blandly, “There’s nothing extraordinary about the sink,” in response to which the llm reasons

I need to find a safe place to put my gold watch before I shower. I examined the sink, but there’s nothing extraordinary about it. Let me examine the other bathroom fixtures.

and then it goes around like that, looking at the toilet, then the shower, then back to the sink. Since this earlier version of the prompt left it completely up to the llm to decide what to keep in context, the model (a) did not realise it was stuck in a loop, and (b) started gradually adopting “find a good place for the gold watch” as the main objective of the game, and wouldn’t break out of the loop.

The language model that did not have perfect memory

When the llm got to decide on its own what to remember for the next turn, it seems to pay undue attention to what it has chosen to retain. The game can give an error indicating that the player character does not carry an essential item – but if the scratchpad indicates that they do, the llm refuses to reject its scratchpad information. Instead, it enters a futile loop where it tries the same two actions over and over. The current version of the prompt (that asks the llm to maintain a log of actions) prevents many of these silly short loops.

The opposite problem also occurs: the llm can get stuck because it needs an essential item which is already carried by the player character, but the llm does not realise this. It looks all over the place for that item, except where it is: in the player inventory. Then, one time, it suddenly reasoned

Reflecting on our current situation, I notice we’re experiencing some challenges. […]

Key observation: Our previous commands show a potential “stuck” pattern, with multiple repetitive attempts to enter the door or search generically.

Proposed strategy shift: Instead of continuing to search randomly, we should first confirm our current state and available resources.

command: INVENTORY

And would you believe it! There it was, all along. It did get unstuck on its own, but it took a long time to do so.6 And a long time means many tokens which means many monies. I spent nearly $20 on this article alone, and that’s not counting the next article which will be about evaluating various models. What am I doing with my life?

The example command that was taken too seriously

For a long time, the models seemed obsessed with how they moved between rooms, coming up with complicated ways to refer to doors. Instead of just GO SOUTH, they wanted to ENTER THE SOUTHERN DOOR TO THE BATHROOM and the like. I didn’t understand where this came from, until one of the reasoning sequences included the phrase “since the prompt asks me to enter a door, I will …”.

The first version of the prompt had used ENTER DOOR as an example of how the llm should indicate its selected command. Removing that concrete example made the door obsession stop.

Obsessing over the wrong things

The door obsession stopped, but not obsession more generally. One of the neat properties of the architecture adopted for the Perl script (at least with prompt autosaving – see appendix) is that if we want to give the llm hints, we can stop the process, surgically insert a thought into its scratchpad data as if it had thought about it on its own and then restart it, and see how it changes its approach with the new thought inserted.

Unfortunately – though highly interestingly – it generally ignores such externally inserted thoughts. I don’t know why, since it has no way to tell them apart from those it has had on its own, but for some reason adding hints that way changes its approach very little. It might give the new approach one or two half-hearted attempts, but if they don’t work out, then it goes back to obsessing over whatever red herring it was stuck on before.

Trying to be clever with commands

The llm, particularly weaker models, has a tendency to concoct unnecessarily advanced commands like SEARCH CUBICLE MARKED "HADLEY" CAREFULLY. It claims to do this in an attempt to be specific and I think that is true, except it is specific in the general English sense, not in the text adventure sense, for which it was not trained. One of the worst cases was when the model tried (in a different game) to READ INSCRIPTION ON DIAGRAM TEXT. The game replied “What do you want to read in?” because the READ command has a specific syntax in that game. The llm tried to be more specific, and spent several turns

  • READ INSCRIPTION ON DIAGRAM FROM BOOK
  • READ INSCRIPTION ON DIAGRAM FROM THE IMMENSE BOOK PAGE
  • READ INSCRIPTION ON DIAGRAM FROM THE IMMENSE BOOK PAGE FEATURING THE CIRCULAR DIAGRAM
  • READ INSCRIPTION ON DIAGRAM FROM THE IMMENSE BOOK PAGE THAT DISPLAYS THE CIRCULAR DIAGRAM
  • READ THE FULL, EXACT INSCRIPTION FROM THE IMMENSE BOOK PAGE THAT DISPLAYS THE CIRCULAR DIAGRAM
  • READ THE EXACT WORDS OF THE INSCRIPTION AS PRINTED ON THE IMMENSE BOOK PAGE DISPLAYING THE CIRCULAR DIAGRAM

Needless to say, none of these worked.

The llm – at least weaker models – frequently fail to refer to things as they are known in the game, instead coming up with its own names for them. E.g. when the game mentions a “dresser”, the llm tries to OPEN CLOSET.

Trying to be clever with puzzles

The llm often seems to come up with very creative puzzle solutions7 Admittedly, some of these should have worked, but it would be unreasonable to expect a text adventure author to cover all of them! instead of solving the puzzles as they are hinted in-game. For example, failing to find a screwdriver in Lockout, the llm started trying to figure out how to use the broken torch as a screwdriver, by issuing 50 different commands to examine the torch in various ways.8 This is perhaps not one of those creative solutions that ought to have worked, but maybe levering open the maintenance door with the rod? Or using the throttle lever as a rod somehow? Seems reasonable, but I did not think to implement that.

Another related case of these is when it messes up the syntax of the straightforward solution, e.g. tries OPEN DOOR WITH KEY where the key is not needed to open the door, and then goes on an adventure trying to PUSH DOOR, KICK DOOR, BASH DOOR, SMASH DOOR, RAM DOOR, PULL DOOR, TURN DOOR HANDLE, TWIST DOOR HANDLE, JIGGLE DOOR HANDLE. It continued down that rabbit hole of finding ever more intricate details of the door interaction (like EXAMINE INSIDE OF DOOR KEYHOLE), never realising it could have executed OPEN DOOR.

It also frequently re-solves puzzles because it has trouble remembering the details around them. (Although the latest, guided-by-questions prompt shows this issue less than previous versions.)

In the end, this does not really work

Sure, a decent model (like Claude 4 Sonnet) can get to the obvious ending of 9:05. But it costs something like $1 to get there, and that is a very easy text adventure, basically holding the player’s hand all the way to the end. It’s also a well-known text adventure, so it is possible the model has at some point had a walkthrough of it in its training. Although the llm at the conclusion does recognise that there must be an alternative ending, it does not lend this information great significance when it chooses to restart, so it ends up re-playing through the same steps again.

I put the same model on Lockout, my first text adventure, and it gets stuck very quickly toying with the more complexly modeled parts of the game, such as the keyboard. It runs all sorts of commands hoping the in-game computer will solve its problems for it. Any progress made seems more like happy accident, rather than the result of methodical and/or logical thinking. Eventually, I had to stop it because it kept burning through my api credits without getting anywhere meaningful.

Even with a prompt that guides the model to focus on what matters, even when giving it hints, it seems to … just not get it. It has infinite patience, but not much skill. Not even if we give it more command history or a transcript of more than the last action does this appear to improve.

I have tested this mainly with variants of Anthropic’s Claude, because I’m trained to prompt that model. It would be interesting to see how other models fare. It would excite me a lot to set up a benchmark where different models are scored on how well they get through a standard set of text adventures – but I cannot afford to do that.

Yet! I have come up with a smaller-scale benchmark. A future article will compare some llm models against each other at their text adventure prowess (or lack thereof.)

Appendix A: Technical issues

I stuffed these subsections here in no particular order to keep the main content from becoming too technical.

The first subsection here is on shelling out to the llm from the Perl script that interacts with text adventures. Then comes what I found to be a funny quirk in how the llm model behaved when there was a bug in the implementation that removed any feedback in a specific situation. Then we see what it takes in Perl code to autosave the session for resuming.

Automating the prompting business

The changes we need to make to plug an llm into the script from the previous article are minimal. We might want to make model choice configurable, so we add a command-line parameter for that in the cli argument parsing bit.

GetOptions("reset" => \my $reset, "model:s" => \my $model, "help" => \my $help);
$model //= 'claude-3.5-haiku';

To keep editing simple, we’ll have a file called prompt.txt which contains the prompt template. We read it in when the script starts up9 Then again, we might want to do it every iteration – that would allow us to update the prompt during a running session. But meh., using a very Perly trick to slurp an entire file: by localising $/ to undef in a do block, we remove any record separator and cause the readline operator <> to read the entire file.

open(my $template_fd, '<', 'prompt.txt') or die $!;
my $template = do {
    local $/;
    <$template_fd>;
};
close($template_fd);

We add a variable to hold the scratchpad, and complicate the get_command subroutine with a subprocess call to llm.

# Variable that holds the output of the previous command. Is likely to be used
# to derive what the next command should be.
my $result = '';
my $scratchpad = '';

# Consume $result and $scratchpad to produce the next command.
sub get_command {
    my $next_prompt = $template;
    $next_prompt =~ s/SUB_RESULT/$result/;
    $next_prompt =~ s/SUB_SCRATCHPAD/$scratchpad/;
    ($result, $scratchpad) = ('', '');

    my $command;

    my $llm = open3(my $lin, my $lout, undef, "llm", "-m", $model);
    print $lin $next_prompt;
    close $lin;
    while (<$lout>) {
        print "| $_";
        $scratchpad .= $_;
        if (/^COMMAND: (.*)$/) {
            $command = $1;
        }
    }
    close $lout;

    return $command;
}

And off it goes!

The pipe error that made the language model mad

At one point, 9:05 asks the player if they want to get onto the freeway. The llm answered something like ENTER FREEWAY which is a reasonable command. However, this is one of those locations where the game is expecting a yes or no answer. In those instances, the game prints a parser error and a prompt at the same time, saying something like

[Answer yes or no.]>

There was a bug in the piping code that rendered this prompt invisible to the llm, meaning it appeared to the llm as though any command it typed yielded no response from the game at all.10 Unless it would have successully guessed YES was the command to type, but the hints for this had long scrolled out of its context window of a single prompt by the time it might have thought to try that. At this point, it started getting increasingly confused and desperate. The repeated no responses made it lose all context from its scratchpad, eventually ending up with something like

It seems there might be an issue with the game initialization or the “look” command I attempted to troubleshoot. There’s no description in the result section, which means either:

  1. The game hasn’t properly started yet
  2. There was an error processing my command
  3. The game expects a different initial command

Since this is a text adventure, I should try the most basic commands to get started. Let me try to examine myself to gain some context about who I am in this game world. This is a standard approach when beginning a text adventure.

  • Location: Unknown (need to establish)
  • Inventory: Unknown (need to establish)
  • Goals: To be determined based on game context

It gets increasingly frustrated from there, starting to try commands like HELP, START, BEGIN, RESTART etc., but never getting out of the situation. Fixing the bug in the Perl script allowed it to get out of it right away.

Restoring previous sessions

As it stands, the implementation can only autosave the game state (through Bocfel configuration) but not the llm prompt. This is annoying because it is convenient to be able to resume sessions, after finding and fixing bugs. Given the stateless nature of our llm interactions, this is fairly easy. If the an earlier session has left a saved prompt behind, we read that in when the script starts up.

# Read in previous prompt if one exists.
my $previous_prompt = undef;
if (-f "$storyfile.prompt") {
    open(my $prompt_fd, '<', "$storyfile.prompt") or die $!;
    $previous_prompt = do {
        local $/;
        <$prompt_fd>;
    };
    close($prompt_fd);
}

Then where we handle the $reset flag, we must also make sure to clear any previous prompt.

if ($reset) {
    # ----------->8---
    $previous_prompt = undef;
    # ----------->8---
}

and finally, when we construct the next prompt, we first make sure to use any previous prompt if it’s there, as well as save the newly constructed prompt every turn.

sub get_command {
    my $next_prompt = $template;
    if ($previous_prompt) {
        $next_prompt = $previous_prompt;
        $previous_prompt = undef;
    } else {
        $next_prompt =~ s/SUB_RESULT/$result/;
        $next_prompt =~ s/SUB_SCRATCHPAD/$scratchpad/;
    }
    ($result, $scratchpad) = ('', '');

    # Save this prompt to be able to resume later.
    open(my $prompt_fd, '>', "$storyfile.prompt") or die $!;
    print $prompt_fd $next_prompt;
    close($prompt_fd);

    # ----------->8---

Tadaa!