V-MAGE: A Game Evaluation Framework for Assessing Vision-Centric Capabilities in Multimodal Large Language Models

Alex Jinpeng Wang; Lijuan Wang; Linjie Li; Ping Yu; Rui Yan; Xiangxi Zheng; Yuan Yao; Zhengyuan Yang

arxiv: 2504.06148 · v3 · submitted 2025-04-08 · 💻 cs.CV

V-MAGE: A Game Evaluation Framework for Assessing Vision-Centric Capabilities in Multimodal Large Language Models

Xiangxi Zheng , Linjie Li , Zhengyuan Yang , Ping Yu , Alex Jinpeng Wang , Rui Yan , Yuan Yao , Lijuan Wang This is my paper

classification 💻 cs.CV

keywords v-magemllmsmodelsdynamicevaluationinteractivereasoningcapabilities

0 comments

read the original abstract

Recent advancements in Multimodal Large Language Models (MLLMs) have demonstrated impressive capabilities in visual-text processing. However, existing static image-text benchmarks are insufficient for evaluating their dynamic perception and interactive reasoning abilities. We introduce Vision-centric Multiple Abilities Game Evaluation (V-MAGE), a novel game-based evaluation framework designed to systematically assess MLLMs' visual reasoning in interactive, continuous-space environments. V-MAGE features five distinct video games comprising over 30 carefully constructed evaluation scenarios. These scenarios are set in free-form, visually complex environments that require models to interpret dynamic game states and make decisions based solely on visual input, thereby closely reflecting the conditions encountered by human players. To ensure robust and interpretable comparisons across models, V-MAGE employs a dynamic ELO-based ranking system that accounts for varying difficulty levels and task diversity. Benchmarking state-of-the-art MLLMs against human baselines reveals that while leading models approach human-level performance in simple tasks, their performance drops significantly in complex scenarios requiring advanced reasoning and task orchestration. This persistent performance gap highlights fundamental limitations in current MLLMs' ability to perform vision-grounded, interactive frame-by-frame control in simulated continuous-time environments. Through extensive analyses, we demonstrate the utility of V-MAGE in uncovering these limitations and providing actionable insights for improving the visual and reasoning capabilities of MLLMs in dynamic, interactive settings. Code is publicly available at https://github.com/CSU-JPG/V-MAGE.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 5 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Orak: A Foundational Benchmark for Training and Evaluating LLM Agents on Diverse Video Games
cs.AI 2025-06 unverdicted novelty 7.0

Orak is a foundational benchmark providing training data, interfaces, and evaluation tools for LLM agents across diverse video game genres.
SPIKE: An Adaptive Dual Controller Framework for Cost-Efficient Long-Horizon Game Agents
cs.CV 2026-05 unverdicted novelty 6.0

SPIKE dual-controller framework raises success rates 5-9 points and cuts tokens 55% in StarDojo agents by reusing strategic plans across stable segments and escalating only at detected events.
GameWorld: Towards Standardized and Verifiable Evaluation of Multimodal Game Agents
cs.CV 2026-04 unverdicted novelty 6.0

GameWorld is a new benchmark providing standardized interfaces, 34 games, 170 tasks, and verifiable outcome metrics to evaluate multimodal large language model agents in video game environments.
Towards Generalist Game Players: An Investigation of Foundation Models in the Game Multiverse
cs.CV 2026-05 unverdicted novelty 5.0

The paper organizes research on generalist game AI into Dataset, Model, Harness, and Benchmark pillars and charts a five-level progression from single-game mastery to agents that create and live inside game multiverses.
Towards Generalist Game Players: An Investigation of Foundation Models in the Game Multiverse
cs.CV 2026-05 unverdicted novelty 3.0

This work traces four eras of generalist game players across dataset, model, harness, and benchmark pillars and charts a five-level roadmap ending in agents that create and evolve within game multiverses.