pith. sign in

arxiv: 2603.09731 · v3 · pith:LRL5ZS4Lnew · submitted 2026-03-10 · 💻 cs.CV · cs.AI· cs.CL

EXPLORE-Bench: Egocentric Scene Prediction with Long-Horizon Reasoning

classification 💻 cs.CV cs.AIcs.CL
keywords egocentricreasoninglong-horizonactionexplore-benchsceneactionsembodied
0
0 comments X
read the original abstract

Multimodal large language models (MLLMs) are increasingly considered as a foundation for embodied agents, yet it remains unclear whether they can reliably reason about the long-term physical consequences of actions from an egocentric viewpoint. We study this gap through a new task, Egocentric Scene Prediction with LOng-horizon REasoning: given an initial-scene image and a sequence of atomic action descriptions, a model is asked to predict the final scene after all actions are executed. To enable systematic evaluation, we introduce EXPLORE-Bench, a benchmark curated from real first-person videos spanning diverse scenarios. Each instance pairs long action sequences with structured final-scene annotations, including object categories, visual attributes, and inter-object relations, which supports fine-grained, quantitative assessment. Experiments on a range of proprietary and open-source MLLMs reveal a significant performance gap to humans, indicating that long-horizon egocentric reasoning remains a major challenge. We further analyze test-time scaling via stepwise reasoning and show that decomposing long action sequences can improve performance to some extent, while incurring non-trivial computational overhead. Overall, EXPLORE-Bench provides a principled testbed for measuring and advancing long-horizon reasoning for egocentric embodied perception.

This paper has not been read by Pith yet.

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Learning to Evolve Scenes: Reasoning about Human Activities with Scene Graphs

    cs.CV 2026-07 unverdicted novelty 5.0

    SG-Ego dataset and GLEN model enable structured reasoning over spatio-temporal scene graphs for ego-centric activity understanding, introducing the A-GEF forecasting task.