VideoGameBench: Can Vision-Language Models complete popular video games?

Alex L. Zhang; Karthik R. Narasimhan; Ofir Press; Thomas L. Griffiths

arxiv: 2505.18134 · v3 · pith:BA3E4RBKnew · submitted 2025-05-23 · 💻 cs.AI · cs.CL· cs.CV

VideoGameBench: Can Vision-Language Models complete popular video games?

Alex L. Zhang , Thomas L. Griffiths , Karthik R. Narasimhan , Ofir Press This is my paper

Pith reviewed 2026-05-19 12:42 UTC · model grok-4.3

classification 💻 cs.AI cs.CLcs.CV

keywords vision-language modelsvideo game benchmarkperceptionspatial navigationmemory managementreal-time interactiongeneralization

0 comments

The pith

Frontier vision-language models complete only 0.48 percent of VideoGameBench and 1.6 percent of its paused Lite version.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces VideoGameBench as a test of whether current vision-language models can handle tasks that feel natural to humans, such as perceiving game states, navigating spaces, and remembering prior events. It does this by letting models play ten popular 1990s video games using nothing but raw screen images and a short description of the goals and buttons. Three of the games stay hidden from the developers so that any solution must work on truly new environments rather than memorized tricks. Experiments with the strongest available models show they almost never move past the first few screens, and the main practical obstacle is that the models take too long to decide on each action.

Core claim

VideoGameBench requires models to finish entire games from raw pixel input and high-level objective descriptions alone, without game-specific code or extra sensors. The best frontier models reach only 0.48 percent completion in the live real-time setting and 1.6 percent when the game pauses for each decision. The benchmark therefore shows that current vision-language models do not yet exhibit the perception, navigation, and memory skills that the games were designed to reward in human players.

What carries the argument

VideoGameBench, a collection of ten 1990s games where models receive only screen frames plus a high-level goal and control summary and must output actions in real time.

If this is right

Current models cannot yet match human performance on intuitive real-time tasks that rely on spatial awareness and short-term recall.
Real-time latency remains a practical barrier that must be solved before models can interact fluidly with dynamic visual environments.
Generalization to completely unseen games stays out of reach for today's frontier systems.
Benchmarks built around familiar human games can expose gaps that coding or math tests miss.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If the gap persists, future models may need explicit memory buffers or faster decision loops to close it.
The same setup could be reused to test whether adding human-like inductive biases improves results on other interactive visual tasks.
Low scores on hidden games suggest that progress will require methods that transfer across environments rather than tuning to specific titles.

Load-bearing premise

Raw screen images plus a short goal description are enough to measure perception, navigation, and memory rather than being blocked mainly by unfamiliar button mappings or game-specific quirks.

What would settle it

A model that finishes more than a few percent of the games under identical raw-visual and high-level-instruction conditions would show the claimed limitation does not hold.

Figures

Figures reproduced from arXiv: 2505.18134 by Alex L. Zhang, Karthik R. Narasimhan, Ofir Press, Thomas L. Griffiths.

**Figure 2.** Figure 2: To track progress on VideoGameBench, we scrape deterministic checkpoints from online [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗

**Figure 3.** Figure 3: To determine when a run ends in VideoGameBench, we provide a bound of [PITH_FULL_IMAGE:figures/full_fig_p015_3.png] view at source ↗

**Figure 4.** Figure 4: VideoGameBench features a set of 20 video games from the MS-DOS and Game Boy [PITH_FULL_IMAGE:figures/full_fig_p024_4.png] view at source ↗

**Figure 5.** Figure 5: VideoGameBench checkpoint lengths. We show the length of each game walkthrough and the position of each checkpoint as a black divider. Checkpoints in VideoGameBench are mapped to the timestamp it was scraped from in an online walkthrough video to determine the percentage of the game that was completed [PITH_FULL_IMAGE:figures/full_fig_p027_5.png] view at source ↗

**Figure 6.** Figure 6: An example screen of the Location Clicking Game. A VG-Agent using a VLM is tasked with clicking 10 green circles, one at a time, in under 250 actions. The most basic action in any DOS game is to click a position on the screen. The Location Clicking Game is a simple task where an agent must click a green circle with radius 40px that randomly generates inside a 640px by 400px region on the browser (this repl… view at source ↗

**Figure 7.** Figure 7: An example screen of the Mouse Dragging Game. A VG-Agent using a VLM is tasked with dragging a red circle to a target green circle while staying on the black line [PITH_FULL_IMAGE:figures/full_fig_p033_7.png] view at source ↗

**Figure 8.** Figure 8: An example level in the 2D Navigation Game. The task is to use the arrow keys (mouse + keyboard interface on VideoGameBench) to move a red square to the green square in a small grid-world maze. agents, for VLMs it is not obvious that this task is easily solved. We generate 10 pre-defined mazes where the agent must move a red square to the green square in a small maze-like environment using the arrow keys. … view at source ↗

read the original abstract

Vision-language models (VLMs) have achieved strong results on coding and math benchmarks that are challenging for humans, yet their ability to perform tasks that come naturally to humans--such as perception, spatial navigation, and memory management--remains understudied. Real video games are crafted to be intuitive for humans to learn and master by leveraging innate inductive biases, making them an ideal testbed for evaluating such capabilities in VLMs. To this end, we introduce VideoGameBench, a benchmark consisting of 10 popular video games from the 1990s that VLMs directly interact with in real-time. VideoGameBench challenges models to complete entire games with access to only raw visual inputs and a high-level description of objectives and controls, a significant departure from existing setups that rely on game-specific scaffolding and auxiliary information. We keep three of the games secret to encourage solutions that generalize to unseen environments. Our experiments show that frontier vision-language models struggle to progress beyond the beginning of each game. We find inference latency to be a major limitation of frontier models in the real-time setting; therefore, we introduce VideoGameBench Lite, a setting where the game pauses while waiting for the LM's next action. The best performing models, Gemini 2.5 Pro and Claude 3.7 Sonnet, complete only 0.48% of VideoGameBench and 1.6% of VideoGameBench Lite. We hope that the formalization of the human skills mentioned above into this benchmark motivates progress in these research directions.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces VideoGameBench, a benchmark of 10 popular 1990s video games in which VLMs must complete entire games using only raw visual inputs and high-level descriptions of objectives and controls, without game-specific scaffolding. Frontier models complete only 0.48% of the full real-time benchmark and 1.6% of the paused VideoGameBench Lite variant, with inference latency identified as a key limitation; three games are kept secret to test generalization.

Significance. If the empirical results hold after addressing methodological gaps, the benchmark offers a useful new testbed for human-like capabilities such as perception, spatial navigation, and memory in real-time settings. The Lite variant and secret-game design are constructive additions that could motivate progress on generalization and latency-aware agents.

major comments (2)

[Abstract and Section 3] Abstract and benchmark description paragraph: the claim that raw visual input plus high-level objective/control descriptions suffice to isolate perception, spatial navigation, and memory management (rather than being dominated by control mapping or game-specific mechanics) is load-bearing for the headline 0.48%/1.6% completion rates. Section 3's interaction protocol defines free-form text outputs mapped to discrete inputs, but without explicit grounding verification, timing analysis, or per-game error breakdown, failures could stem from action translation difficulties instead of the targeted skills.
[Experiments] Experiments and results sections: the concrete completion percentages are reported without full methods details, variance across runs, or error analysis (e.g., breakdown by game or failure mode). This weakens assessment of whether the results robustly support the central claim about VLM limitations.

minor comments (2)

[Benchmark description] Clarify the exact list of 10 games and how the three secret games are selected and used to evaluate generalization.
[Results] Add a table or figure summarizing per-game completion rates and average progress metrics to make the aggregate percentages more interpretable.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive and detailed comments on our manuscript. We address each major comment point-by-point below. Where the feedback identifies areas for improvement, we have revised the manuscript to strengthen the presentation of our methods and results.

read point-by-point responses

Referee: [Abstract and Section 3] Abstract and benchmark description paragraph: the claim that raw visual input plus high-level objective/control descriptions suffice to isolate perception, spatial navigation, and memory management (rather than being dominated by control mapping or game-specific mechanics) is load-bearing for the headline 0.48%/1.6% completion rates. Section 3's interaction protocol defines free-form text outputs mapped to discrete inputs, but without explicit grounding verification, timing analysis, or per-game error breakdown, failures could stem from action translation difficulties instead of the targeted skills.

Authors: We appreciate the referee's emphasis on ensuring the benchmark isolates the intended capabilities. Section 3 describes a deliberately simple mapping from free-form model outputs to discrete game inputs, supported by the high-level control descriptions provided to each model. To address the concern directly, the revised manuscript adds an explicit subsection on the action grounding process, including verification steps and concrete examples of text-to-action mappings across games. We have also added timing analysis showing that inference latency dominates any mapping overhead by orders of magnitude, and per-game error breakdowns demonstrating that the large majority of failures occur in perception, navigation, and memory rather than action translation. These changes support the claim that the reported completion rates reflect limitations in the targeted skills. revision: yes
Referee: [Experiments] Experiments and results sections: the concrete completion percentages are reported without full methods details, variance across runs, or error analysis (e.g., breakdown by game or failure mode). This weakens assessment of whether the results robustly support the central claim about VLM limitations.

Authors: We agree that greater methodological transparency and statistical detail would strengthen the results. The revised experiments section now includes expanded methods descriptions covering the full evaluation protocol, reports completion rates with variance across multiple independent runs (including standard deviations), and provides a detailed error analysis broken down both by individual game and by failure mode (perception, spatial reasoning, memory, and action execution). This analysis indicates that the observed limitations are consistent with the human-like capabilities the benchmark targets. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical benchmark evaluation with independent experimental results

full rationale

The paper presents VideoGameBench as a new empirical benchmark for VLMs, reporting direct experimental outcomes (0.48% completion on the full benchmark and 1.6% on Lite) from model interactions with raw pixels plus high-level objective/control descriptions. No derivation chain, equations, parameter fitting, or first-principles claims exist that could reduce to self-definitions, fitted inputs renamed as predictions, or self-citation load-bearing steps. The central results are falsifiable experimental measurements rather than constructed equivalences, and any self-citations (if present) are not invoked to justify uniqueness theorems or ansatzes that carry the main claims. This is a standard self-contained benchmark paper.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the empirical observation that models receive only raw visuals and high-level descriptions; no free parameters or invented entities are introduced in the abstract.

axioms (1)

domain assumption Video games from the 1990s leverage innate human inductive biases for perception, navigation, and memory.
Invoked to justify the choice of testbed in the abstract.

pith-pipeline@v0.9.0 · 5820 in / 1218 out tokens · 52171 ms · 2026-05-19T12:42:17.856342+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

VideoGameBench challenges models to complete entire games with access to only raw visual inputs and a high-level description of objectives and controls... The best performing models... complete only 0.48% of VideoGameBench and 1.6% of VideoGameBench Lite.
IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We introduce a novel method for detecting an agent’s game progress: we scrape YouTube walkthroughs... and use perceptual hashing... to detect how much of the game it completed.

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 5 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Do Vision-Language-Models show human-like logical problem-solving capability in point and click puzzle games?
cs.AI 2026-05 unverdicted novelty 7.0

VLATIM benchmark reveals large VLMs excel at high-level planning in physics puzzles but struggle with precise visual grounding and mouse control, so they lack human-like problem-solving capabilities.
Co-Evolving LLM Decision and Skill Bank Agents for Long-Horizon Tasks
cs.AI 2026-04 unverdicted novelty 7.0

COSPLAY co-evolves an LLM decision agent with a skill bank agent to improve long-horizon game performance, reporting over 25.1% average reward gains versus frontier LLM baselines on single-player benchmarks.
Odysseus: Scaling VLMs to 100+ Turn Decision-Making in Games via Reinforcement Learning
cs.LG 2026-05 unverdicted novelty 6.0

Odysseus adapts PPO with a turn-level critic and leverages pretrained VLM action priors to train agents achieving at least 3x average game progress over frontier models in long-horizon Super Mario Land.
Towards Generalist Game Players: An Investigation of Foundation Models in the Game Multiverse
cs.CV 2026-05 unverdicted novelty 5.0

The paper organizes research on generalist game AI into Dataset, Model, Harness, and Benchmark pillars and charts a five-level progression from single-game mastery to agents that create and live inside game multiverses.
Towards Generalist Game Players: An Investigation of Foundation Models in the Game Multiverse
cs.CV 2026-05 unverdicted novelty 3.0

This work traces four eras of generalist game players across dataset, model, harness, and benchmark pillars and charts a five-level roadmap ending in agents that create and evolve within game multiverses.

Reference graph

Works this paper leans on

60 extracted references · 60 canonical work pages · cited by 4 Pith papers

[1]

Competition-level code generation with

URLhttps://arxiv.org/abs/2503.04094. Michał Kempka, Marek Wydmuch, Grzegorz Runc, Jakub Toczek, and Wojciech Ja´skowski. Vizdoom: A doom-based ai research platform for visual reinforcement learning, 2016. URLhttps://arxiv. org/abs/1605.02097. Heinrich Küttler, Nantas Nardelli, Alexander H. Miller, Roberta Raileanu, Marco Selvatici, Edward Grefenstette, an...

work page doi:10.1126/science.abq1158 2016
[2]

game over

URLhttps://openreview.net/forum?id=pmcFzuUxsP. Together AI. Together ai partners with meta to offer llama 4: Sota multimodal moe models. https: //www.together.ai/blog/llama-4, April 2025. Chen Feng Tsai, Xiaochen Zhou, Sierra S. Liu, Jing Li, Mo Yu, and Hongyuan Mei. Can large language models play text games well? current state-of-the-art and open questio...

work page arXiv 2025
[3]

Think: Analyze the current state and decide what to do next

work page
[4]

shift+ctrl

Action: Choose one of the following actions: - click [options as action_input]: Click the mouse at the current mouse position. Options include: * right: Right click instead of left click (default is left click) * shift: Hold shift while clicking * ctrl: Hold ctrl while clicking * alt: Hold alt while clicking Multiple modifiers can be combined with +, e.g....

work page
[5]

KeyA", "KeyB

Observation: You will receive the result of your action You will interact with the game via the keyboard and mouse actions. To help you with mouse actions, we provide a thin red grid overlay that intersects the screen at 100x100 pixel intervals (labelled with coordinates divided by 100). I also added 4 blue dots 25 pixels away in each direction with their...

work page
[6]

My opponent is trying to block my path, I should be wary.\n

work page
[7]

Farms make my units stronger

work page
[8]

} Another example of right clicking: {

The M button is to move units." } Another example of right clicking: { "thought": "I need to right click on the search box", "action": "click", "action_input": "right", "memory": "" } Or for keyboard actions: { "thought": "I need to move the character left in the game", "action": "press_key", "action_input": "ArrowLeft", "memory": "The character moves fas...

work page
[9]

Ground troops cannot walk through water (the blue regions), mountains, or other obstacles

work page
[10]

End your turn when you’re finished with what you want to do

work page
[11]

So if you want to move another unit, move the selected unit first

Each unit moves 1 tile. So if you want to move another unit, move the selected unit first

work page
[12]

The Need for Speed

In the beginning, a good strategy is to just explore and have your units move around and explore unseen areas. General Controls Mouse: Click to select units, cities, and menu options. Right-click may provide additional info. Keyboard Shortcuts: Movement: Arrow keys (or Numpad) to move selected unit. End Turn: Enter key. Access City Menu: Click on a city o...

work page
[13]

Examine the puzzle goal carefully

work page
[14]

Study the available objects

work page
[15]

Consider how objects will interact

work page
[16]

Test your solution in parts

work page
[17]

Make small adjustments for timing

work page
[18]

Watch for unintended interactions

work page
[19]

Pokemon Crystal

Use gravity to your advantage Remember: - There are often multiple solutions to each puzzle - Timing is crucial for many puzzles - Some objects may not be needed - Pay attention to object orientation - Chain reactions should flow naturally - Save working parts while experimenting with others If stuck: - Reset and try a different approach - Watch how objec...

work page
[21]

What action would be most appropriate?

work page
[22]

What buttons need to be pressed to take that action? Available buttons: A, B, START, SELECT, UP, DOWN, LEFT, RIGHT Tips:

work page
[23]

No matter where your arrow is on the screen, it’ll go to the end

To get past any menu or typing screen, press START or START, A when you are done. No matter where your arrow is on the screen, it’ll go to the end

work page
[24]

When trainers see you, they will want to battle

work page
[25]

In a Pokemon battle, you attack your enemies and you lose if your Pokemon all reach 0 HP

work page
[26]

Don’t go right then A

When typing a name, just press A twice to exit when your name is full. Don’t go right then A

work page
[27]

Wild Pokemon appear randomly when walking in tall grass, caves, or while surfing

work page
[28]

FIGHT" to use your Pokemon’s moves - Choose

During battles (using the movement keys to move icons): - Choose "FIGHT" to use your Pokemon’s moves - Choose "BAG" to use items like Potions or Pokeballs - Choose "POKEMON" to switch to a different Pokemon - Choose "RUN" to attempt escaping from wild Pokemon battles

work page
[29]

Type advantages are crucial: Water beats Fire, Fire beats Grass, Grass beats Water

work page
[30]

Use Pokemon Centers (buildings with red roofs) to heal your Pokemon for free

work page
[31]

Buy supplies like Pokeballs and Potions at PokeMarts (buildings with blue roofs)

work page
[32]

Hurt me plenty

Read dialogue and continue by pressing ’A’. Each movement key (e.g. UP, DOWN, LEFT, RIGHT) will move your character (with the hat) one tile in that direction. Keep that in mind, and calculate where to go based on what you want to do. You can interact with people (you should to get information and also proceed in the game) using the A button by standing ne...

work page
[33]

UAC is a door)

Look for doors, which will be in the corridors and have some kind of writing on the door (e.g. UAC is a door). You can open them! 21 Try aligning yourself so the door is centered on your screen, then walk up to it. When you’re pressed against the door, press space to open it

work page
[34]

Doors usually have blue triangles (they themselves are not doors) near them on the sides, and it will be obvious you can open it

work page
[35]

Don’t just go backwards because you’re not sure

You need to be directly in front of the door and press ’space’ to open it, you cannot be far away to open it. Don’t just go backwards because you’re not sure

work page
[36]

If you get stuck on a wall or moving against a wall, try taking a few steps back and re-adjusting your thoughts

work page
[37]

Kirby’s Dream Land

If there are a lot of enemies or you are being shot at, try strafing around and moving a lot side to side to avoid getting fired at while also aiming and shooting. Remember exactly what direction you were turning so you don’t make redundant movements. Use the repeated key presses to turn. You can also move your character to adjust your aim. If YOU SHOT AN...

work page
[38]

Assess the current screen: - What enemies or obstacles are present? - Is Kirby on the ground or in the air? - Are there any platforms or doorways? - Is there a boss battle happening?

work page
[39]

Consider your options: - Do you need to avoid enemies? - Should you inhale enemies to use as projectiles? - Is flying a better option than walking? - Are there items or power-ups to collect?

work page
[40]

Plan your next action and execute using the available controls: MOVEMENT CONTROLS: - LEFT/RIGHT on Control Pad: Move Kirby left/right - UP on Control Pad: Enter doorways or fly upward - DOWN on Control Pad: Crouch and swallow inhaled enemies ACTION BUTTONS: - A Button: Jump - B Button: Inhale enemies/objects or spit them out as projectiles - START Button:...

work page
[41]

Don’t just hover above it

For things that say "IN" or black doors / light doors, Kirby has to go into it to go into a room. Don’t just hover above it

work page
[42]

Kirby has to go into it or step on it

Shining stars (called warp stars) are the end of the level, and transition you further into the game. Kirby has to go into it or step on it

work page
[43]

The Legend of Zelda: Link’s Awakening (DX)

Do not hit enemies directly, or Kirby will take damage. Spit out enemies (not bosses) or items like bombs back to damage your enemies! Kirby is a classic platformer, so you generally should continue to the right to progress in the game. 23 Respond with a clear sequence of actions, explaining your reasoning for each decision. Available buttons: A, B, START...

work page
[44]

What is happening in the current screen?

work page
[45]

Are there enemies, NPCs, or interactive objects?

work page
[46]

What action would help progress in the game?

work page
[47]

Enter","Ctrl,C

What buttons need to be pressed to take that action? You cannot move if dialogue is on the screen until you finish it, so keep pressing A until it is over. Available buttons: A, B, START, SELECT, UP, DOWN, LEFT, RIGHT C VideoGameBench Details Figure 4: VideoGameBench features a set of 20 video games from the MS-DOS and Game Boy platforms that VLMs are tas...

work page 2025
[48]

In DOS or GBA games, if the agent quits the entire game (which requires multiple steps), it is unable to restart the game

The agentquits the emulator or puts the game in a “locked” state. In DOS or GBA games, if the agent quits the entire game (which requires multiple steps), it is unable to restart the game

work page
[49]

Game Over

The game provides the agent multiple lives, and the agent has a “Game Over” screen by losing them all (e.g.Kirby’s Dream LandandSuper Mario Land). Otherwise, if the agent loses in the same location more than three times, we also end the run

work page
[50]

the exact same screen) for more than 100 steps

The agent is “stuck” (i.e. the exact same screen) for more than 100 steps. We reason that the context window is only 20 steps, so the conditional probability on this context of getting unstuck is low. 27

work page
[51]

The agent loses without damaging any enemies, indicating little progress can be made in multiple repeated trials

work page
[52]

the enemy is still alive. I need to adjust aim again and fire

The agent uses $30 (∼2000 steps) without reaching a new checkpoint. This was mainly to avoid long loops like inDoom IIwhere the agent repeatedly revisits the same locations. D.2 Main VideoGameBench Experiments Cost We report the cost per experiment of Table 2. Discrepancies in costs do not necessarily reflect differ- ences in model costs – rather, some ru...

work page 2000
[53]

**Press "A" repeatedly** to complete the dialogue

work page
[54]

**Ensure the shield is obtained and visible** in Link’s inventory

work page
[55]

A", # Initiate or continue dialogue

Once the shield is confirmed, plan to exit the house. Reflection (after image) - Must confirm shield acquisition before leaving the house. - Pressing "A" multiple times ensures the dialogue is complete. - Next objective is to explore outside once the shield is obtained. - Focus on completing the dialogue and shield acquisition. - Ensure shield acquisition...

work page
[56]

**Current Situation**: Still trying to exit the lab

work page
[57]

Pokémon’s house

**Objective**: Leave the lab to head towards Mr. Pokémon’s house

work page
[58]

Reflection (before image) - Received Cyndaquil as my starter Pokémon

**Analysis**: Ensure proper alignment with the exit carpet and attempt to move down. Reflection (before image) - Received Cyndaquil as my starter Pokémon. - Skipped the nickname option to proceed quickly. - Objective: Visit Mr. Pokémon as directed by PROF. ELM. - Need to exit the lab and start the journey towards Mr. Pokémon’s location. - Route to Mr. Pok...

work page
[59]

**Current Situation**: Stuck at the lab exit

work page
[60]

Pokémon’s house

**Objective**: Exit the lab and start the journey to Mr. Pokémon’s house

work page
[61]

31 Reflection (after image) - Need to ensure proper alignment and navigation to exit buildings

**Analysis**: Confirm alignment with the exit and check potential obstacles. 31 Reflection (after image) - Need to ensure proper alignment and navigation to exit buildings. - Current focus is on successfully leaving the lab to progress the storyline. - Stuck at the exit requires paying attention to visual cues for proper movement. Action (after image) UP,...

work page

[1] [1]

Competition-level code generation with

URLhttps://arxiv.org/abs/2503.04094. Michał Kempka, Marek Wydmuch, Grzegorz Runc, Jakub Toczek, and Wojciech Ja´skowski. Vizdoom: A doom-based ai research platform for visual reinforcement learning, 2016. URLhttps://arxiv. org/abs/1605.02097. Heinrich Küttler, Nantas Nardelli, Alexander H. Miller, Roberta Raileanu, Marco Selvatici, Edward Grefenstette, an...

work page doi:10.1126/science.abq1158 2016

[2] [2]

game over

URLhttps://openreview.net/forum?id=pmcFzuUxsP. Together AI. Together ai partners with meta to offer llama 4: Sota multimodal moe models. https: //www.together.ai/blog/llama-4, April 2025. Chen Feng Tsai, Xiaochen Zhou, Sierra S. Liu, Jing Li, Mo Yu, and Hongyuan Mei. Can large language models play text games well? current state-of-the-art and open questio...

work page arXiv 2025

[3] [3]

Think: Analyze the current state and decide what to do next

work page

[4] [4]

shift+ctrl

Action: Choose one of the following actions: - click [options as action_input]: Click the mouse at the current mouse position. Options include: * right: Right click instead of left click (default is left click) * shift: Hold shift while clicking * ctrl: Hold ctrl while clicking * alt: Hold alt while clicking Multiple modifiers can be combined with +, e.g....

work page

[5] [5]

KeyA", "KeyB

Observation: You will receive the result of your action You will interact with the game via the keyboard and mouse actions. To help you with mouse actions, we provide a thin red grid overlay that intersects the screen at 100x100 pixel intervals (labelled with coordinates divided by 100). I also added 4 blue dots 25 pixels away in each direction with their...

work page

[6] [6]

My opponent is trying to block my path, I should be wary.\n

work page

[7] [7]

Farms make my units stronger

work page

[8] [8]

} Another example of right clicking: {

The M button is to move units." } Another example of right clicking: { "thought": "I need to right click on the search box", "action": "click", "action_input": "right", "memory": "" } Or for keyboard actions: { "thought": "I need to move the character left in the game", "action": "press_key", "action_input": "ArrowLeft", "memory": "The character moves fas...

work page

[9] [9]

Ground troops cannot walk through water (the blue regions), mountains, or other obstacles

work page

[10] [10]

End your turn when you’re finished with what you want to do

work page

[11] [11]

So if you want to move another unit, move the selected unit first

Each unit moves 1 tile. So if you want to move another unit, move the selected unit first

work page

[12] [12]

The Need for Speed

In the beginning, a good strategy is to just explore and have your units move around and explore unseen areas. General Controls Mouse: Click to select units, cities, and menu options. Right-click may provide additional info. Keyboard Shortcuts: Movement: Arrow keys (or Numpad) to move selected unit. End Turn: Enter key. Access City Menu: Click on a city o...

work page

[13] [13]

Examine the puzzle goal carefully

work page

[14] [14]

Study the available objects

work page

[15] [15]

Consider how objects will interact

work page

[16] [16]

Test your solution in parts

work page

[17] [17]

Make small adjustments for timing

work page

[18] [18]

Watch for unintended interactions

work page

[19] [19]

Pokemon Crystal

Use gravity to your advantage Remember: - There are often multiple solutions to each puzzle - Timing is crucial for many puzzles - Some objects may not be needed - Pay attention to object orientation - Chain reactions should flow naturally - Save working parts while experimenting with others If stuck: - Reset and try a different approach - Watch how objec...

work page

[20] [21]

What action would be most appropriate?

work page

[21] [22]

What buttons need to be pressed to take that action? Available buttons: A, B, START, SELECT, UP, DOWN, LEFT, RIGHT Tips:

work page

[22] [23]

No matter where your arrow is on the screen, it’ll go to the end

To get past any menu or typing screen, press START or START, A when you are done. No matter where your arrow is on the screen, it’ll go to the end

work page

[23] [24]

When trainers see you, they will want to battle

work page

[24] [25]

In a Pokemon battle, you attack your enemies and you lose if your Pokemon all reach 0 HP

work page

[25] [26]

Don’t go right then A

When typing a name, just press A twice to exit when your name is full. Don’t go right then A

work page

[26] [27]

Wild Pokemon appear randomly when walking in tall grass, caves, or while surfing

work page

[27] [28]

FIGHT" to use your Pokemon’s moves - Choose

During battles (using the movement keys to move icons): - Choose "FIGHT" to use your Pokemon’s moves - Choose "BAG" to use items like Potions or Pokeballs - Choose "POKEMON" to switch to a different Pokemon - Choose "RUN" to attempt escaping from wild Pokemon battles

work page

[28] [29]

Type advantages are crucial: Water beats Fire, Fire beats Grass, Grass beats Water

work page

[29] [30]

Use Pokemon Centers (buildings with red roofs) to heal your Pokemon for free

work page

[30] [31]

Buy supplies like Pokeballs and Potions at PokeMarts (buildings with blue roofs)

work page

[31] [32]

Hurt me plenty

Read dialogue and continue by pressing ’A’. Each movement key (e.g. UP, DOWN, LEFT, RIGHT) will move your character (with the hat) one tile in that direction. Keep that in mind, and calculate where to go based on what you want to do. You can interact with people (you should to get information and also proceed in the game) using the A button by standing ne...

work page

[32] [33]

UAC is a door)

Look for doors, which will be in the corridors and have some kind of writing on the door (e.g. UAC is a door). You can open them! 21 Try aligning yourself so the door is centered on your screen, then walk up to it. When you’re pressed against the door, press space to open it

work page

[33] [34]

Doors usually have blue triangles (they themselves are not doors) near them on the sides, and it will be obvious you can open it

work page

[34] [35]

Don’t just go backwards because you’re not sure

You need to be directly in front of the door and press ’space’ to open it, you cannot be far away to open it. Don’t just go backwards because you’re not sure

work page

[35] [36]

If you get stuck on a wall or moving against a wall, try taking a few steps back and re-adjusting your thoughts

work page

[36] [37]

Kirby’s Dream Land

If there are a lot of enemies or you are being shot at, try strafing around and moving a lot side to side to avoid getting fired at while also aiming and shooting. Remember exactly what direction you were turning so you don’t make redundant movements. Use the repeated key presses to turn. You can also move your character to adjust your aim. If YOU SHOT AN...

work page

[37] [38]

Assess the current screen: - What enemies or obstacles are present? - Is Kirby on the ground or in the air? - Are there any platforms or doorways? - Is there a boss battle happening?

work page

[38] [39]

Consider your options: - Do you need to avoid enemies? - Should you inhale enemies to use as projectiles? - Is flying a better option than walking? - Are there items or power-ups to collect?

work page

[39] [40]

Plan your next action and execute using the available controls: MOVEMENT CONTROLS: - LEFT/RIGHT on Control Pad: Move Kirby left/right - UP on Control Pad: Enter doorways or fly upward - DOWN on Control Pad: Crouch and swallow inhaled enemies ACTION BUTTONS: - A Button: Jump - B Button: Inhale enemies/objects or spit them out as projectiles - START Button:...

work page

[40] [41]

Don’t just hover above it

For things that say "IN" or black doors / light doors, Kirby has to go into it to go into a room. Don’t just hover above it

work page

[41] [42]

Kirby has to go into it or step on it

Shining stars (called warp stars) are the end of the level, and transition you further into the game. Kirby has to go into it or step on it

work page

[42] [43]

The Legend of Zelda: Link’s Awakening (DX)

Do not hit enemies directly, or Kirby will take damage. Spit out enemies (not bosses) or items like bombs back to damage your enemies! Kirby is a classic platformer, so you generally should continue to the right to progress in the game. 23 Respond with a clear sequence of actions, explaining your reasoning for each decision. Available buttons: A, B, START...

work page

[43] [44]

What is happening in the current screen?

work page

[44] [45]

Are there enemies, NPCs, or interactive objects?

work page

[45] [46]

What action would help progress in the game?

work page

[46] [47]

Enter","Ctrl,C

What buttons need to be pressed to take that action? You cannot move if dialogue is on the screen until you finish it, so keep pressing A until it is over. Available buttons: A, B, START, SELECT, UP, DOWN, LEFT, RIGHT C VideoGameBench Details Figure 4: VideoGameBench features a set of 20 video games from the MS-DOS and Game Boy platforms that VLMs are tas...

work page 2025

[47] [48]

In DOS or GBA games, if the agent quits the entire game (which requires multiple steps), it is unable to restart the game

The agentquits the emulator or puts the game in a “locked” state. In DOS or GBA games, if the agent quits the entire game (which requires multiple steps), it is unable to restart the game

work page

[48] [49]

Game Over

The game provides the agent multiple lives, and the agent has a “Game Over” screen by losing them all (e.g.Kirby’s Dream LandandSuper Mario Land). Otherwise, if the agent loses in the same location more than three times, we also end the run

work page

[49] [50]

the exact same screen) for more than 100 steps

The agent is “stuck” (i.e. the exact same screen) for more than 100 steps. We reason that the context window is only 20 steps, so the conditional probability on this context of getting unstuck is low. 27

work page

[50] [51]

The agent loses without damaging any enemies, indicating little progress can be made in multiple repeated trials

work page

[51] [52]

the enemy is still alive. I need to adjust aim again and fire

The agent uses $30 (∼2000 steps) without reaching a new checkpoint. This was mainly to avoid long loops like inDoom IIwhere the agent repeatedly revisits the same locations. D.2 Main VideoGameBench Experiments Cost We report the cost per experiment of Table 2. Discrepancies in costs do not necessarily reflect differ- ences in model costs – rather, some ru...

work page 2000

[52] [53]

**Press "A" repeatedly** to complete the dialogue

work page

[53] [54]

**Ensure the shield is obtained and visible** in Link’s inventory

work page

[54] [55]

A", # Initiate or continue dialogue

Once the shield is confirmed, plan to exit the house. Reflection (after image) - Must confirm shield acquisition before leaving the house. - Pressing "A" multiple times ensures the dialogue is complete. - Next objective is to explore outside once the shield is obtained. - Focus on completing the dialogue and shield acquisition. - Ensure shield acquisition...

work page

[55] [56]

**Current Situation**: Still trying to exit the lab

work page

[56] [57]

Pokémon’s house

**Objective**: Leave the lab to head towards Mr. Pokémon’s house

work page

[57] [58]

Reflection (before image) - Received Cyndaquil as my starter Pokémon

**Analysis**: Ensure proper alignment with the exit carpet and attempt to move down. Reflection (before image) - Received Cyndaquil as my starter Pokémon. - Skipped the nickname option to proceed quickly. - Objective: Visit Mr. Pokémon as directed by PROF. ELM. - Need to exit the lab and start the journey towards Mr. Pokémon’s location. - Route to Mr. Pok...

work page

[58] [59]

**Current Situation**: Stuck at the lab exit

work page

[59] [60]

Pokémon’s house

**Objective**: Exit the lab and start the journey to Mr. Pokémon’s house

work page

[60] [61]

31 Reflection (after image) - Need to ensure proper alignment and navigation to exit buildings

**Analysis**: Confirm alignment with the exit and check potential obstacles. 31 Reflection (after image) - Need to ensure proper alignment and navigation to exit buildings. - Current focus is on successfully leaving the lab to progress the storyline. - Stuck at the exit requires paying attention to visual cues for proper movement. Action (after image) UP,...

work page