VLATIM benchmark reveals large VLMs excel at high-level planning in physics puzzles but struggle with precise visual grounding and mouse control, so they lack human-like problem-solving capabilities.
Civrealm: A learning and reasoning odyssey in civilization for decision-making agents
3 Pith papers cite this work. Polarity classification is still indexing.
verdicts
UNVERDICTED 3representative citing papers
Orak is a foundational benchmark providing training data, interfaces, and evaluation tools for LLM agents across diverse video game genres.
CA2 integrates call stack information into RL agents for game testing and shows consistent gains over baselines that ignore code signals.
citing papers explorer
-
Do Vision-Language-Models show human-like logical problem-solving capability in point and click puzzle games?
VLATIM benchmark reveals large VLMs excel at high-level planning in physics puzzles but struggle with precise visual grounding and mouse control, so they lack human-like problem-solving capabilities.
-
Orak: A Foundational Benchmark for Training and Evaluating LLM Agents on Diverse Video Games
Orak is a foundational benchmark providing training data, interfaces, and evaluation tools for LLM agents across diverse video game genres.
-
CA2: Code-Aware Agent for Automated Game Testing
CA2 integrates call stack information into RL agents for game testing and shows consistent gains over baselines that ignore code signals.