Agent-X: Evaluating Deep Multimodal Reasoning in Vision-Centric Agentic Tasks

[Accessed 09-02- · 2025 · cs.CV · arXiv 2505.24876

4 Pith papers cite this work. Polarity classification is still indexing.

4 Pith papers citing it

open full Pith review browse 4 citing papers arXiv PDF

abstract

Deep reasoning is fundamental for solving complex tasks, especially in vision-centric scenarios that demand sequential, multimodal understanding. However, existing benchmarks typically evaluate agents with fully synthetic, single-turn queries, limited visual modalities, and lack a framework to assess reasoning quality over multiple steps as required in real-world settings. To address this, we introduce Agent-X, a large-scale benchmark for evaluating vision-centric agents multi-step and deep reasoning capabilities in real-world, multimodal settings. Agent- X features 828 agentic tasks with authentic visual contexts, including images, multi-image comparisons, videos, and instructional text. These tasks span six major agentic environments: general visual reasoning, web browsing, security and surveillance, autonomous driving, sports, and math reasoning. Our benchmark requires agents to integrate tool use with explicit, stepwise decision-making in these diverse settings. In addition, we propose a fine-grained, step-level evaluation framework that assesses the correctness and logical coherence of each reasoning step and the effectiveness of tool usage throughout the task. Our results reveal that even the best-performing models, including GPT, Gemini, and Qwen families, struggle to solve multi-step vision tasks, achieving less than 50% full-chain success. These findings highlight key bottlenecks in current LMM reasoning and tool-use capabilities and identify future research directions in vision-centric agentic reasoning models. Our data and code are publicly available at https://github.com/mbzuai-oryx/Agent-X

citation-role summary

background 1

citation-polarity summary

background 1

representative citing papers

TerraBench: Can Agents Reason Over Heterogeneous Earth-System Data?

cs.AI · 2026-06-11 · unverdicted · novelty 7.0 · 2 refs

TerraBench is a new benchmark with 403 tasks across Earth-science domains that evaluates LLM agents on coordinating heterogeneous data using executable ReAct-style workflows and process-level metrics.

MedCTA: A Benchmark for Clinical Tool Agents

cs.CV · 2026-06-10 · unverdicted · novelty 6.0

MedCTA is a new benchmark with 107 real-world tasks and process-aware metrics that shows frontier multimodal models remain brittle at autonomous tool selection, execution, and trajectory completion in clinical settings.

Benchmark Everything Everywhere All at Once

cs.AI · 2026-06-04 · unverdicted · novelty 6.0

Benchmark Agent is an autonomous agentic system that constructs benchmarks for LLMs and MLLMs via query analysis, subtask design, annotation and quality control, yielding 15 benchmarks with minimal human input.

GUI Agents with Reinforcement Learning: Toward Digital Inhabitants

cs.AI · 2026-04-30 · unverdicted · novelty 5.0

The paper delivers the first comprehensive overview of RL for GUI agents, organizing methods into offline, online, and hybrid strategies while analyzing trends in rewards, efficiency, and deliberation to outline a future roadmap.

citing papers explorer

Showing 1 of 1 citing paper after filters.

GUI Agents with Reinforcement Learning: Toward Digital Inhabitants cs.AI · 2026-04-30 · unverdicted · none · ref 5 · internal anchor
The paper delivers the first comprehensive overview of RL for GUI agents, organizing methods into offline, online, and hybrid strategies while analyzing trends in rewards, efficiency, and deliberation to outline a future roadmap.

Agent-X: Evaluating Deep Multimodal Reasoning in Vision-Centric Agentic Tasks

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer