VideoGameBench: Can Vision-Language Models complete popular video games?
Pith reviewed 2026-05-19 12:42 UTC · model grok-4.3
The pith
Frontier vision-language models complete only 0.48 percent of VideoGameBench and 1.6 percent of its paused Lite version.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
VideoGameBench requires models to finish entire games from raw pixel input and high-level objective descriptions alone, without game-specific code or extra sensors. The best frontier models reach only 0.48 percent completion in the live real-time setting and 1.6 percent when the game pauses for each decision. The benchmark therefore shows that current vision-language models do not yet exhibit the perception, navigation, and memory skills that the games were designed to reward in human players.
What carries the argument
VideoGameBench, a collection of ten 1990s games where models receive only screen frames plus a high-level goal and control summary and must output actions in real time.
If this is right
- Current models cannot yet match human performance on intuitive real-time tasks that rely on spatial awareness and short-term recall.
- Real-time latency remains a practical barrier that must be solved before models can interact fluidly with dynamic visual environments.
- Generalization to completely unseen games stays out of reach for today's frontier systems.
- Benchmarks built around familiar human games can expose gaps that coding or math tests miss.
Where Pith is reading between the lines
- If the gap persists, future models may need explicit memory buffers or faster decision loops to close it.
- The same setup could be reused to test whether adding human-like inductive biases improves results on other interactive visual tasks.
- Low scores on hidden games suggest that progress will require methods that transfer across environments rather than tuning to specific titles.
Load-bearing premise
Raw screen images plus a short goal description are enough to measure perception, navigation, and memory rather than being blocked mainly by unfamiliar button mappings or game-specific quirks.
What would settle it
A model that finishes more than a few percent of the games under identical raw-visual and high-level-instruction conditions would show the claimed limitation does not hold.
Figures
read the original abstract
Vision-language models (VLMs) have achieved strong results on coding and math benchmarks that are challenging for humans, yet their ability to perform tasks that come naturally to humans--such as perception, spatial navigation, and memory management--remains understudied. Real video games are crafted to be intuitive for humans to learn and master by leveraging innate inductive biases, making them an ideal testbed for evaluating such capabilities in VLMs. To this end, we introduce VideoGameBench, a benchmark consisting of 10 popular video games from the 1990s that VLMs directly interact with in real-time. VideoGameBench challenges models to complete entire games with access to only raw visual inputs and a high-level description of objectives and controls, a significant departure from existing setups that rely on game-specific scaffolding and auxiliary information. We keep three of the games secret to encourage solutions that generalize to unseen environments. Our experiments show that frontier vision-language models struggle to progress beyond the beginning of each game. We find inference latency to be a major limitation of frontier models in the real-time setting; therefore, we introduce VideoGameBench Lite, a setting where the game pauses while waiting for the LM's next action. The best performing models, Gemini 2.5 Pro and Claude 3.7 Sonnet, complete only 0.48% of VideoGameBench and 1.6% of VideoGameBench Lite. We hope that the formalization of the human skills mentioned above into this benchmark motivates progress in these research directions.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces VideoGameBench, a benchmark of 10 popular 1990s video games in which VLMs must complete entire games using only raw visual inputs and high-level descriptions of objectives and controls, without game-specific scaffolding. Frontier models complete only 0.48% of the full real-time benchmark and 1.6% of the paused VideoGameBench Lite variant, with inference latency identified as a key limitation; three games are kept secret to test generalization.
Significance. If the empirical results hold after addressing methodological gaps, the benchmark offers a useful new testbed for human-like capabilities such as perception, spatial navigation, and memory in real-time settings. The Lite variant and secret-game design are constructive additions that could motivate progress on generalization and latency-aware agents.
major comments (2)
- [Abstract and Section 3] Abstract and benchmark description paragraph: the claim that raw visual input plus high-level objective/control descriptions suffice to isolate perception, spatial navigation, and memory management (rather than being dominated by control mapping or game-specific mechanics) is load-bearing for the headline 0.48%/1.6% completion rates. Section 3's interaction protocol defines free-form text outputs mapped to discrete inputs, but without explicit grounding verification, timing analysis, or per-game error breakdown, failures could stem from action translation difficulties instead of the targeted skills.
- [Experiments] Experiments and results sections: the concrete completion percentages are reported without full methods details, variance across runs, or error analysis (e.g., breakdown by game or failure mode). This weakens assessment of whether the results robustly support the central claim about VLM limitations.
minor comments (2)
- [Benchmark description] Clarify the exact list of 10 games and how the three secret games are selected and used to evaluate generalization.
- [Results] Add a table or figure summarizing per-game completion rates and average progress metrics to make the aggregate percentages more interpretable.
Simulated Author's Rebuttal
We thank the referee for their constructive and detailed comments on our manuscript. We address each major comment point-by-point below. Where the feedback identifies areas for improvement, we have revised the manuscript to strengthen the presentation of our methods and results.
read point-by-point responses
-
Referee: [Abstract and Section 3] Abstract and benchmark description paragraph: the claim that raw visual input plus high-level objective/control descriptions suffice to isolate perception, spatial navigation, and memory management (rather than being dominated by control mapping or game-specific mechanics) is load-bearing for the headline 0.48%/1.6% completion rates. Section 3's interaction protocol defines free-form text outputs mapped to discrete inputs, but without explicit grounding verification, timing analysis, or per-game error breakdown, failures could stem from action translation difficulties instead of the targeted skills.
Authors: We appreciate the referee's emphasis on ensuring the benchmark isolates the intended capabilities. Section 3 describes a deliberately simple mapping from free-form model outputs to discrete game inputs, supported by the high-level control descriptions provided to each model. To address the concern directly, the revised manuscript adds an explicit subsection on the action grounding process, including verification steps and concrete examples of text-to-action mappings across games. We have also added timing analysis showing that inference latency dominates any mapping overhead by orders of magnitude, and per-game error breakdowns demonstrating that the large majority of failures occur in perception, navigation, and memory rather than action translation. These changes support the claim that the reported completion rates reflect limitations in the targeted skills. revision: yes
-
Referee: [Experiments] Experiments and results sections: the concrete completion percentages are reported without full methods details, variance across runs, or error analysis (e.g., breakdown by game or failure mode). This weakens assessment of whether the results robustly support the central claim about VLM limitations.
Authors: We agree that greater methodological transparency and statistical detail would strengthen the results. The revised experiments section now includes expanded methods descriptions covering the full evaluation protocol, reports completion rates with variance across multiple independent runs (including standard deviations), and provides a detailed error analysis broken down both by individual game and by failure mode (perception, spatial reasoning, memory, and action execution). This analysis indicates that the observed limitations are consistent with the human-like capabilities the benchmark targets. revision: yes
Circularity Check
No circularity: empirical benchmark evaluation with independent experimental results
full rationale
The paper presents VideoGameBench as a new empirical benchmark for VLMs, reporting direct experimental outcomes (0.48% completion on the full benchmark and 1.6% on Lite) from model interactions with raw pixels plus high-level objective/control descriptions. No derivation chain, equations, parameter fitting, or first-principles claims exist that could reduce to self-definitions, fitted inputs renamed as predictions, or self-citation load-bearing steps. The central results are falsifiable experimental measurements rather than constructed equivalences, and any self-citations (if present) are not invoked to justify uniqueness theorems or ansatzes that carry the main claims. This is a standard self-contained benchmark paper.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Video games from the 1990s leverage innate human inductive biases for perception, navigation, and memory.
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
VideoGameBench challenges models to complete entire games with access to only raw visual inputs and a high-level description of objectives and controls... The best performing models... complete only 0.48% of VideoGameBench and 1.6% of VideoGameBench Lite.
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We introduce a novel method for detecting an agent’s game progress: we scrape YouTube walkthroughs... and use perceptual hashing... to detect how much of the game it completed.
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Forward citations
Cited by 5 Pith papers
-
Do Vision-Language-Models show human-like logical problem-solving capability in point and click puzzle games?
VLATIM benchmark reveals large VLMs excel at high-level planning in physics puzzles but struggle with precise visual grounding and mouse control, so they lack human-like problem-solving capabilities.
-
Co-Evolving LLM Decision and Skill Bank Agents for Long-Horizon Tasks
COSPLAY co-evolves an LLM decision agent with a skill bank agent to improve long-horizon game performance, reporting over 25.1% average reward gains versus frontier LLM baselines on single-player benchmarks.
-
Odysseus: Scaling VLMs to 100+ Turn Decision-Making in Games via Reinforcement Learning
Odysseus adapts PPO with a turn-level critic and leverages pretrained VLM action priors to train agents achieving at least 3x average game progress over frontier models in long-horizon Super Mario Land.
-
Towards Generalist Game Players: An Investigation of Foundation Models in the Game Multiverse
The paper organizes research on generalist game AI into Dataset, Model, Harness, and Benchmark pillars and charts a five-level progression from single-game mastery to agents that create and live inside game multiverses.
-
Towards Generalist Game Players: An Investigation of Foundation Models in the Game Multiverse
This work traces four eras of generalist game players across dataset, model, harness, and benchmark pillars and charts a five-level roadmap ending in agents that create and evolve within game multiverses.
Reference graph
Works this paper leans on
-
[1]
Competition-level code generation with
URLhttps://arxiv.org/abs/2503.04094. Michał Kempka, Marek Wydmuch, Grzegorz Runc, Jakub Toczek, and Wojciech Ja´skowski. Vizdoom: A doom-based ai research platform for visual reinforcement learning, 2016. URLhttps://arxiv. org/abs/1605.02097. Heinrich Küttler, Nantas Nardelli, Alexander H. Miller, Roberta Raileanu, Marco Selvatici, Edward Grefenstette, an...
-
[2]
URLhttps://openreview.net/forum?id=pmcFzuUxsP. Together AI. Together ai partners with meta to offer llama 4: Sota multimodal moe models. https: //www.together.ai/blog/llama-4, April 2025. Chen Feng Tsai, Xiaochen Zhou, Sierra S. Liu, Jing Li, Mo Yu, and Hongyuan Mei. Can large language models play text games well? current state-of-the-art and open questio...
-
[3]
Think: Analyze the current state and decide what to do next
-
[4]
Action: Choose one of the following actions: - click [options as action_input]: Click the mouse at the current mouse position. Options include: * right: Right click instead of left click (default is left click) * shift: Hold shift while clicking * ctrl: Hold ctrl while clicking * alt: Hold alt while clicking Multiple modifiers can be combined with +, e.g....
-
[5]
Observation: You will receive the result of your action You will interact with the game via the keyboard and mouse actions. To help you with mouse actions, we provide a thin red grid overlay that intersects the screen at 100x100 pixel intervals (labelled with coordinates divided by 100). I also added 4 blue dots 25 pixels away in each direction with their...
-
[6]
My opponent is trying to block my path, I should be wary.\n
-
[7]
Farms make my units stronger
-
[8]
} Another example of right clicking: {
The M button is to move units." } Another example of right clicking: { "thought": "I need to right click on the search box", "action": "click", "action_input": "right", "memory": "" } Or for keyboard actions: { "thought": "I need to move the character left in the game", "action": "press_key", "action_input": "ArrowLeft", "memory": "The character moves fas...
-
[9]
Ground troops cannot walk through water (the blue regions), mountains, or other obstacles
-
[10]
End your turn when you’re finished with what you want to do
-
[11]
So if you want to move another unit, move the selected unit first
Each unit moves 1 tile. So if you want to move another unit, move the selected unit first
-
[12]
In the beginning, a good strategy is to just explore and have your units move around and explore unseen areas. General Controls Mouse: Click to select units, cities, and menu options. Right-click may provide additional info. Keyboard Shortcuts: Movement: Arrow keys (or Numpad) to move selected unit. End Turn: Enter key. Access City Menu: Click on a city o...
-
[13]
Examine the puzzle goal carefully
-
[14]
Study the available objects
-
[15]
Consider how objects will interact
-
[16]
Test your solution in parts
-
[17]
Make small adjustments for timing
-
[18]
Watch for unintended interactions
-
[19]
Use gravity to your advantage Remember: - There are often multiple solutions to each puzzle - Timing is crucial for many puzzles - Some objects may not be needed - Pay attention to object orientation - Chain reactions should flow naturally - Save working parts while experimenting with others If stuck: - Reset and try a different approach - Watch how objec...
-
[21]
What action would be most appropriate?
-
[22]
What buttons need to be pressed to take that action? Available buttons: A, B, START, SELECT, UP, DOWN, LEFT, RIGHT Tips:
-
[23]
No matter where your arrow is on the screen, it’ll go to the end
To get past any menu or typing screen, press START or START, A when you are done. No matter where your arrow is on the screen, it’ll go to the end
-
[24]
When trainers see you, they will want to battle
-
[25]
In a Pokemon battle, you attack your enemies and you lose if your Pokemon all reach 0 HP
-
[26]
When typing a name, just press A twice to exit when your name is full. Don’t go right then A
-
[27]
Wild Pokemon appear randomly when walking in tall grass, caves, or while surfing
-
[28]
FIGHT" to use your Pokemon’s moves - Choose
During battles (using the movement keys to move icons): - Choose "FIGHT" to use your Pokemon’s moves - Choose "BAG" to use items like Potions or Pokeballs - Choose "POKEMON" to switch to a different Pokemon - Choose "RUN" to attempt escaping from wild Pokemon battles
-
[29]
Type advantages are crucial: Water beats Fire, Fire beats Grass, Grass beats Water
-
[30]
Use Pokemon Centers (buildings with red roofs) to heal your Pokemon for free
-
[31]
Buy supplies like Pokeballs and Potions at PokeMarts (buildings with blue roofs)
-
[32]
Read dialogue and continue by pressing ’A’. Each movement key (e.g. UP, DOWN, LEFT, RIGHT) will move your character (with the hat) one tile in that direction. Keep that in mind, and calculate where to go based on what you want to do. You can interact with people (you should to get information and also proceed in the game) using the A button by standing ne...
-
[33]
Look for doors, which will be in the corridors and have some kind of writing on the door (e.g. UAC is a door). You can open them! 21 Try aligning yourself so the door is centered on your screen, then walk up to it. When you’re pressed against the door, press space to open it
-
[34]
Doors usually have blue triangles (they themselves are not doors) near them on the sides, and it will be obvious you can open it
-
[35]
Don’t just go backwards because you’re not sure
You need to be directly in front of the door and press ’space’ to open it, you cannot be far away to open it. Don’t just go backwards because you’re not sure
-
[36]
If you get stuck on a wall or moving against a wall, try taking a few steps back and re-adjusting your thoughts
-
[37]
If there are a lot of enemies or you are being shot at, try strafing around and moving a lot side to side to avoid getting fired at while also aiming and shooting. Remember exactly what direction you were turning so you don’t make redundant movements. Use the repeated key presses to turn. You can also move your character to adjust your aim. If YOU SHOT AN...
-
[38]
Assess the current screen: - What enemies or obstacles are present? - Is Kirby on the ground or in the air? - Are there any platforms or doorways? - Is there a boss battle happening?
-
[39]
Consider your options: - Do you need to avoid enemies? - Should you inhale enemies to use as projectiles? - Is flying a better option than walking? - Are there items or power-ups to collect?
-
[40]
Plan your next action and execute using the available controls: MOVEMENT CONTROLS: - LEFT/RIGHT on Control Pad: Move Kirby left/right - UP on Control Pad: Enter doorways or fly upward - DOWN on Control Pad: Crouch and swallow inhaled enemies ACTION BUTTONS: - A Button: Jump - B Button: Inhale enemies/objects or spit them out as projectiles - START Button:...
-
[41]
For things that say "IN" or black doors / light doors, Kirby has to go into it to go into a room. Don’t just hover above it
-
[42]
Kirby has to go into it or step on it
Shining stars (called warp stars) are the end of the level, and transition you further into the game. Kirby has to go into it or step on it
-
[43]
The Legend of Zelda: Link’s Awakening (DX)
Do not hit enemies directly, or Kirby will take damage. Spit out enemies (not bosses) or items like bombs back to damage your enemies! Kirby is a classic platformer, so you generally should continue to the right to progress in the game. 23 Respond with a clear sequence of actions, explaining your reasoning for each decision. Available buttons: A, B, START...
-
[44]
What is happening in the current screen?
-
[45]
Are there enemies, NPCs, or interactive objects?
-
[46]
What action would help progress in the game?
-
[47]
What buttons need to be pressed to take that action? You cannot move if dialogue is on the screen until you finish it, so keep pressing A until it is over. Available buttons: A, B, START, SELECT, UP, DOWN, LEFT, RIGHT C VideoGameBench Details Figure 4: VideoGameBench features a set of 20 video games from the MS-DOS and Game Boy platforms that VLMs are tas...
work page 2025
-
[48]
The agentquits the emulator or puts the game in a “locked” state. In DOS or GBA games, if the agent quits the entire game (which requires multiple steps), it is unable to restart the game
- [49]
-
[50]
the exact same screen) for more than 100 steps
The agent is “stuck” (i.e. the exact same screen) for more than 100 steps. We reason that the context window is only 20 steps, so the conditional probability on this context of getting unstuck is low. 27
-
[51]
The agent loses without damaging any enemies, indicating little progress can be made in multiple repeated trials
-
[52]
the enemy is still alive. I need to adjust aim again and fire
The agent uses $30 (∼2000 steps) without reaching a new checkpoint. This was mainly to avoid long loops like inDoom IIwhere the agent repeatedly revisits the same locations. D.2 Main VideoGameBench Experiments Cost We report the cost per experiment of Table 2. Discrepancies in costs do not necessarily reflect differ- ences in model costs – rather, some ru...
work page 2000
-
[53]
**Press "A" repeatedly** to complete the dialogue
-
[54]
**Ensure the shield is obtained and visible** in Link’s inventory
-
[55]
A", # Initiate or continue dialogue
Once the shield is confirmed, plan to exit the house. Reflection (after image) - Must confirm shield acquisition before leaving the house. - Pressing "A" multiple times ensures the dialogue is complete. - Next objective is to explore outside once the shield is obtained. - Focus on completing the dialogue and shield acquisition. - Ensure shield acquisition...
-
[56]
**Current Situation**: Still trying to exit the lab
- [57]
-
[58]
Reflection (before image) - Received Cyndaquil as my starter Pokémon
**Analysis**: Ensure proper alignment with the exit carpet and attempt to move down. Reflection (before image) - Received Cyndaquil as my starter Pokémon. - Skipped the nickname option to proceed quickly. - Objective: Visit Mr. Pokémon as directed by PROF. ELM. - Need to exit the lab and start the journey towards Mr. Pokémon’s location. - Route to Mr. Pok...
-
[59]
**Current Situation**: Stuck at the lab exit
- [60]
-
[61]
31 Reflection (after image) - Need to ensure proper alignment and navigation to exit buildings
**Analysis**: Confirm alignment with the exit and check potential obstacles. 31 Reflection (after image) - Need to ensure proper alignment and navigation to exit buildings. - Current focus is on successfully leaving the lab to progress the storyline. - Stuck at the exit requires paying attention to visual cues for proper movement. Action (after image) UP,...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.