pith. sign in

arxiv: 2605.15342 · v1 · submitted 2026-05-14 · 💻 cs.CV · cs.LG

Minerva-Ego: Spatiotemporal Hints for Egocentric Video Understanding

Pith reviewed 2026-05-19 15:57 UTC · model grok-4.3

classification 💻 cs.CV cs.LG
keywords egocentric videovideo reasoningbenchmarkspatiotemporal hintsobject masksmultimodal questionsembodied agents
0
0 comments X p. Extension

The pith

Providing models with hints on where and when to look improves performance on complex egocentric video reasoning tasks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents Minerva-Ego, a benchmark that tests AI models on multi-step reasoning using videos recorded from a first-person viewpoint. It adds detailed human annotations that map out the exact reasoning steps along with the specific objects and moments required to reach each answer. Top models still trail human accuracy by a wide margin on these tasks. The authors find that giving models explicit spatial and temporal guidance drawn from those annotations produces clear gains. This setup matters because better egocentric video understanding supports agents that need to interpret and act within real environments.

Core claim

Minerva-Ego extends egocentric video sources with challenging multi-step multimodal questions and spatiotemporally-dense human-annotated reasoning traces, plus object mask annotations for each trace. Benchmarking shows state-of-the-art models maintain a large performance gap relative to humans. Prompting the same models with hints that indicate the relevant locations and times in the video yields substantial improvements in accuracy.

What carries the argument

Spatiotemporal hints drawn from human-annotated reasoning traces and object masks that direct models to the key locations and moments needed for each question.

If this is right

  • Frontier models gain measurable accuracy on egocentric video questions when supplied with explicit spatial and temporal guidance.
  • Dense object mask annotations allow direct evaluation of which visual elements matter for each reasoning step.
  • Benchmarks that score intermediate traces reveal gaps invisible to answer-only evaluation.
  • Embodied agents can benefit from inference-time guidance that mimics human focus patterns in first-person video.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same hinting method could be tested on non-egocentric video domains to check if spatial-temporal guidance generalizes.
  • Models trained to generate their own reasoning traces and masks might reduce reliance on human annotations over time.
  • Real-time embodied systems could incorporate online prediction of where and when to attend for ongoing tasks.

Load-bearing premise

The human-annotated reasoning traces and object masks accurately identify the intermediate steps and visual elements required to solve the questions.

What would settle it

Running the same models on the benchmark questions while supplying random or mismatched spatiotemporal hints and observing no performance change would indicate the hints are not the driver of gains.

Figures

Figures reproduced from arXiv: 2605.15342 by Arsha Nagrani, Bo Hu, Cordelia Schmid, David A Ross, Jasper Uijilings, Ramin Mehran, Shyamal Buch, Sudheendra Vijayanarasimhan, Tobias Weyand.

Figure 1
Figure 1. Figure 1: Spatio-temporal hinting in Minerva-Ego: We introduce Minerva-Ego, a complex video question-answering dataset for egocentric videos. Unlike prior work [10, 27, 34, 35], the answer to each question is accompanied by a spatiotemporally-grounded reasoning trace, which outlines the steps required to come to the answer which are grounded in time (localization with timestamps) and space (associations with segment… view at source ↗
Figure 2
Figure 2. Figure 2: Statistics of Reasoning Traces in Minerva-Ego. We show lengths of answers and reasoning traces (left), the distribution of timestamps mentioned in the reasoning traces (middle) and the distribution of objects referenced in the reasoning traces (right). Almost all reasoning traces contain at least one timestamp and refer to at least one object, for which a spatio-temporal mask is available. Reasoning traces… view at source ↗
Figure 3
Figure 3. Figure 3: Analysis of model produced reasoning traces: (Left) We report MiRA scores on reasoning traces for four axes, namely, Perceptual correctness, Temporal Grounding, Logical reasoning and Completeness. (Middle) We report the recall of relevant objects extracted from the ground truth reasoning trace in the reasoning traces of the models. Object recall matches the entire object’s name while Noun recall matches on… view at source ↗
Figure 4
Figure 4. Figure 4: Visualization of the reasoning traces without (Left) and with (Right) spatio-temporal hints for the question “A [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Visual annotations. From left to right, visualization using masks, boxes, and circles + classes (See Tab. 3). The target question is, “How many types of cheese did I add to the puff pastry roll?” [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Example errors made by the models in their video understanding and reasoning. The errors include failing to identify actions due to the egocentric viewpoint, missing the relevant temporal spans, and failing to capture actions due to fast motion. show results in [PITH_FULL_IMAGE:figures/full_fig_p008_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Video Lengths in Minerva-Ego. Action Recognition Audio Reasoning Cause and Effect Counterfactual Counting Egocentric & Spatial Perception Event Occurence Goal Reasoning Numerical Reasoning Object Recognition Reading Situational Awareness Spatial Perception State Changes Temporal Reasoning 0 50 100 150 Word Count Average Reasoning Length Action Recognition Audio Reasoning Cause and Effect Counterfactual Cou… view at source ↗
Figure 8
Figure 8. Figure 8: Statistics of reasoning trace lengths and object references [PITH_FULL_IMAGE:figures/full_fig_p012_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: The MiRA rubric prompt, as provided to Gemini (or GPT), used to score the reasoning traces produced by the VLMs being [PITH_FULL_IMAGE:figures/full_fig_p014_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: MiRA scores of the model reasoning traces using GPT-5 [PITH_FULL_IMAGE:figures/full_fig_p015_10.png] view at source ↗
Figure 12
Figure 12. Figure 12: Accuracy of SotA models by skill [PITH_FULL_IMAGE:figures/full_fig_p016_12.png] view at source ↗
read the original abstract

Video reasoning models are a core component of egocentric and embodied agents. However, standard benchmarks for assessing models provide only evaluation of the output (e.g. the answer to a question), without evaluation of intermediate reasoning steps, and most provide answers only in the text domain. We introduce Minerva-Ego, a benchmark for evaluating complex egocentric visual reasoning. We extend recent high-quality video data sources recorded from egocentric / embodied settings with a set of challenging, multi-step multimodal questions and spatiotemporally-dense human-annotated reasoning traces. Benchmarking experiments show that state-of-the-art models still have a large gap to human performance. To investigate this gap in detail, we annotate each reasoning trace in the dataset with the objects of interest required to solve the question, as spatiotemporal mask annotations. Through extensive evaluations, we identify that prompting frontier models with hints of 'where' and 'when' to look yields substantial improvements in performance. Minerva-Ego can be downloaded at https://github.com/google-deepmind/neptune.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces Minerva-Ego, a benchmark for complex egocentric visual reasoning. It augments existing egocentric video sources with multi-step multimodal questions, spatiotemporally-dense human-annotated reasoning traces, and object mask annotations. Benchmarking shows state-of-the-art models lag human performance, and experiments indicate that providing frontier models with 'where' and 'when' hints yields substantial gains.

Significance. If the results hold after addressing annotation validation, Minerva-Ego would be a useful resource for egocentric video understanding research, enabling analysis of intermediate reasoning steps rather than final answers alone. The empirical demonstration that spatiotemporal hints improve performance could guide model architectures for embodied agents.

major comments (2)
  1. [Abstract] Abstract: the central claim that 'prompting frontier models with hints of 'where' and 'when' to look yields substantial improvements in performance' depends on the human-annotated reasoning traces and masks accurately isolating the minimal visual elements and steps required. Without reported controls (e.g., machine-generated masks or ablations removing semantic content from traces), it remains possible that gains arise from answer leakage rather than pure spatiotemporal guidance.
  2. [Benchmarking experiments] Benchmarking experiments: the manuscript reports performance gaps and hint-based gains but provides no inter-annotator agreement statistics or validation that models using only the provided masks reach human-level accuracy on the questions. This is load-bearing for interpreting the hints as faithful spatiotemporal guidance.
minor comments (2)
  1. [Abstract] The GitHub link is provided, but the paper should explicitly state the data splits, question counts per split, and statistical significance tests used for the reported gains to support reproducibility.
  2. [Dataset description] Clarify how the 'spatiotemporally-dense' object masks are aligned with the reasoning traces (e.g., frame-level vs. clip-level) to avoid ambiguity in how hints are constructed from them.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback and for recognizing the potential utility of Minerva-Ego for analyzing intermediate reasoning steps in egocentric video understanding. We address each major comment below and will incorporate revisions to strengthen the claims regarding spatiotemporal hints.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the central claim that 'prompting frontier models with hints of 'where' and 'when' to look yields substantial improvements in performance' depends on the human-annotated reasoning traces and masks accurately isolating the minimal visual elements and steps required. Without reported controls (e.g., machine-generated masks or ablations removing semantic content from traces), it remains possible that gains arise from answer leakage rather than pure spatiotemporal guidance.

    Authors: We agree that explicit controls are needed to isolate the contribution of spatiotemporal information and rule out leakage. In the revised manuscript we will add two new ablation experiments: (1) replacing human masks with machine-generated masks from an off-the-shelf detector, and (2) stripping semantic labels from the traces while preserving only temporal and spatial coordinates. These controls will quantify how much of the observed gain is attributable to faithful 'where/when' guidance versus direct answer content. revision: yes

  2. Referee: [Benchmarking experiments] Benchmarking experiments: the manuscript reports performance gaps and hint-based gains but provides no inter-annotator agreement statistics or validation that models using only the provided masks reach human-level accuracy. This is load-bearing for interpreting the hints as faithful spatiotemporal guidance.

    Authors: We will report inter-annotator agreement statistics (IoU for masks and Cohen’s kappa for trace steps) in the revised version. For validation, we will include additional results showing model accuracy when given only the annotated masks and traces; we will also discuss the remaining gap to human performance, which we attribute to limitations in current models’ ability to integrate the provided spatiotemporal cues rather than to annotation infidelity. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical benchmark with no derivations or fitted predictions

full rationale

The paper introduces Minerva-Ego as a new benchmark consisting of egocentric videos, multi-step questions, human-annotated reasoning traces, and spatiotemporal object masks. Its central empirical claim—that prompting frontier models with 'where' and 'when' hints yields performance gains—is obtained directly from evaluations on this newly created dataset. There are no equations, first-principles derivations, fitted parameters, or predictions that reduce to prior quantities by construction. No self-citations, uniqueness theorems, or ansatzes are invoked in a load-bearing manner for any derivation, as none exist. The work is a self-contained empirical contribution whose findings rest on external model evaluations rather than internal redefinitions or self-referential fits.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

This is an empirical benchmark paper; the abstract describes no free parameters, mathematical axioms, or invented entities.

pith-pipeline@v0.9.0 · 5741 in / 1055 out tokens · 46702 ms · 2026-05-19T15:57:52.079820+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

66 extracted references · 66 canonical work pages · 10 internal anchors

  1. [1]

    GPT-4 Technical Report

    Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. GPT-4 technical report.arXiv preprint arXiv:2303.08774, 2023. 1

  2. [2]

    https://openai.com/index/learning- to-reason-with-llms , 2025

    Open AI. https://openai.com/index/learning- to-reason-with-llms , 2025. [Accessed 19-02-2025]. 1

  3. [3]

    System Card: Claude Opus 4 & Claude Sonnet 4

    Anthropic. System Card: Claude Opus 4 & Claude Sonnet 4. https://www- cdn.anthropic.com/ 6d8a8055020700718b0c49369f60816ba2a7c285. pdf. [Accessed 30-04-2025]. 4

  4. [4]

    Claude 3.5 sonnet v2

    Anthropic. Claude 3.5 sonnet v2. Anthropic API, 2023. A language model from Anthropic, featuring improved capabili- ties over the original Claude 3.5 Sonnet, including enhanced computer action generation. 1

  5. [5]

    In- finiBench: A comprehensive benchmark for large multimodal models in very long video understanding.arXiv preprint arXiv:2406.19875, 2024

    Kirolos Ataallah, Chenhui Gou, Eslam Abdelrahman, Khushbu Pahwa, Jian Ding, and Mohamed Elhoseiny. In- finiBench: A comprehensive benchmark for large multimodal models in very long video understanding.arXiv preprint arXiv:2406.19875, 2024. 2

  6. [6]

    Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, et al. Qwen2. 5-vl technical report.arXiv preprint arXiv:2502.13923, 2025. 1

  7. [7]

    Visual prompting via image inpaint- ing.Advances in Neural Information Processing Systems, 35: 25005–25017, 2022

    Amir Bar, Yossi Gandelsman, Trevor Darrell, Amir Glober- son, and Alexei Efros. Visual prompting via image inpaint- ing.Advances in Neural Information Processing Systems, 35: 25005–25017, 2022. 3

  8. [8]

    Look, Remember and Reason: Grounded Reasoning in Videos with Language Models

    Apratim Bhattacharyya, Sunny Panchal, Reza Pourreza, Mingu Lee, Pulkit Madan, and Roland Memisevic. Look, Remember and Reason: Grounded Reasoning in Videos with Language Models. InThe Twelfth International Conference on Learning Representations, 2024. 3

  9. [9]

    TemporalBench: Benchmarking fine- grained temporal understanding for multimodal video models

    Mu Cai, Reuben Tan, Jianrui Zhang, Bocheng Zou, Kai Zhang, Feng Yao, Fangrui Zhu, Jing Gu, Yiwu Zhong, Yuzhang Shang, Yao Dou, Jaden Park, Jianfeng Gao, Yong Jae Lee, and Jianwei Yang. TemporalBench: Benchmarking fine- grained temporal understanding for multimodal video models. arXiv preprint arXiv:2410.10818, 2024. 2

  10. [10]

    HourVideo: 1-hour video-language understanding.arXiv preprint arXiv:2411.04998, 2024

    Keshigeyan Chandrasegaran, Agrim Gupta, Lea M Hadzic, Taran Kota, Jimming He, Cristóbal Eyzaguirre, Zane Du- rante, Manling Li, Jiajun Wu, and Li Fei-Fei. HourVideo: 1-hour video-language understanding.arXiv preprint arXiv:2411.04998, 2024. 2

  11. [11]

    Egothink: Evaluating first- person perspective thinking capability of vision-language models

    Sijie Cheng, Zhicheng Guo, Jingwen Wu, Kechen Fang, Peng Li, Huaping Liu, and Yang Liu. Egothink: Evaluating first- person perspective thinking capability of vision-language models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 14291–14302, 2024. 3

  12. [12]

    TVBench: Redesigning video-language evaluation.arXiv preprint arXiv:2410.07752,

    Daniel Cores, Michael Dorkenwald, Manuel Mucientes, Cees G M Snoek, and Yuki M Asano. TVBench: Redesigning video-language evaluation.arXiv preprint arXiv:2410.07752,

  13. [13]

    Egovqa: An egocentric video question an- swering benchmark dataset

    Yifei Fan, Bo Zhang, Yun Zhou, Qingshan He, Zhenyi Liu, and Liang Wang. Egovqa: An egocentric video question an- swering benchmark dataset. InProceedings of the IEEE/CVF International Conference on Computer Vision Workshops (IC- CVW), pages 2501–2504, 2019. 2

  14. [14]

    Embodied ai: From llms to world models.arXiv preprint arXiv:2509.20021, 2025

    Tongtong Feng, Xin Wang, Yu-Gang Jiang, and Wenwu Zhu. Embodied ai: From llms to world models.arXiv preprint arXiv:2509.20021, 2025. 1

  15. [15]

    Video-MME: The First-Ever Comprehensive Evaluation Benchmark of Multi-modal LLMs in Video Analysis

    Chaoyou Fu, Yuhan Dai, Yondong Luo, Lei Li, Shuhuai Ren, Renrui Zhang, Zihan Wang, Chenyu Zhou, Yunhang Shen, Mengdan Zhang, et al. Video-mme: The first-ever comprehensive evaluation benchmark of multi-modal llms in video analysis.arXiv preprint arXiv:2405.21075, 2024. 1, 2

  16. [16]

    something something

    Raghav Goyal, Samira Ebrahimi Kahou, Vincent Michal- ski, Joanna Materzynska, Susanne Westphal, Heuna Kim, Valentin Haenel, Ingo Fruend, Peter Yianilos, Moritz Mueller- Freitag, et al. The" something something" video database for learning and evaluating visual common sense. InProceed- ings of the IEEE international conference on computer vision, pages 584...

  17. [17]

    Ego4d: Around the world in 3,000 hours of egocentric video, 2022

    Kristen Grauman, Andrew Westbury, Eugene Byrne, Zachary Chavis, Antonino Furnari, Rohit Girdhar, Jackson Hamburger, Hao Jiang, Miao Liu, Xingyu Liu, Miguel Martin, Tushar Na- garajan, Ilija Radosavovic, Santhosh Kumar Ramakrishnan, Fiona Ryan, et al. Ego4d: Around the world in 3,000 hours of egocentric video, 2022. 2

  18. [18]

    Videoespresso: A large-scale chain-of-thought dataset for fine-grained video reasoning via core frame selection.arXiv preprint arXiv:2411.14794, 2024

    Songhao Han, Wei Huang, Hairong Shi, Le Zhuo, Xiu Su, Shifeng Zhang, Xu Zhou, Xiaojuan Qi, Yue Liao, and Si Liu. Videoespresso: A large-scale chain-of-thought dataset for fine-grained video reasoning via core frame selection.arXiv preprint arXiv:2411.14794, 2024. 3

  19. [19]

    Sugarcrepe: Fixing hackable benchmarks for vision-language compositionality.Advances in neural information processing systems, 36:31096–31116,

    Cheng-Yu Hsieh, Jieyu Zhang, Zixian Ma, Aniruddha Kem- bhavi, and Ranjay Krishna. Sugarcrepe: Fixing hackable benchmarks for vision-language compositionality.Advances in neural information processing systems, 36:31096–31116,

  20. [20]

    Egotaskqa: Understanding human tasks in egocentric videos

    Baoxiong Jia, Ting Lei, Song-Chun Zhu, and Siyuan Huang. Egotaskqa: Understanding human tasks in egocentric videos. InThe 36th Conference on Neural Information Processing Systems (NeurIPS 2022) Track on Datasets and Benchmarks,

  21. [21]

    Scaling scaling laws with board games.arXiv preprint arXiv:2104.03113, 2021

    Andy L Jones. Scaling scaling laws with board games.arXiv preprint arXiv:2104.03113, 2021. 1

  22. [22]

    The Kinetics Human Action Video Dataset

    Will Kay, Joao Carreira, Karen Simonyan, Brian Zhang, Chloe Hillier, Sudheendra Vijayanarasimhan, Fabio Viola, Tim Green, Trevor Back, Paul Natsev, et al. The kinetics hu- man action video dataset.arXiv preprint arXiv:1705.06950,

  23. [23]

    Adversarial filters of dataset biases

    Ronan Le Bras, Swabha Swayamdipta, Chandra Bhagavatula, Rowan Zellers, Matthew Peters, Ashish Sabharwal, and Yejin Choi. Adversarial filters of dataset biases. InInternational conference on machine learning, pages 1078–1088. Pmlr,

  24. [24]

    Videovista: A versatile bench- mark for video understanding and reasoning.arXiv preprint arXiv:2406.11303, 2024

    Yunxin Li, Xinyu Chen, Baotian Hu, Longyue Wang, Haoyuan Shi, and Min Zhang. Videovista: A versatile bench- mark for video understanding and reasoning.arXiv preprint arXiv:2406.11303, 2024. 2

  25. [25]

    Egocen- tric planning for scalable embodied task achievement.Ad- vances in Neural Information Processing Systems, 36:54586– 54613, 2023

    Xiatoian Liu, Hector Palacios, and Christian Muise. Egocen- tric planning for scalable embodied task achievement.Ad- vances in Neural Information Processing Systems, 36:54586– 54613, 2023. 1

  26. [26]

    Ye Liu, Zongyang Ma, Zhongang Qi, Yang Wu, Ying Shan, and Chang Wen Chen. E.T. bench: Towards open-ended event-level video-language understanding.arXiv preprint arXiv:2409.18111, 2024. 2

  27. [27]

    Egoschema: A diagnostic benchmark for very long- form video language understanding

    Karttikeya Mangalam, Raiymbek Akshulakov, and Jitendra Malik. Egoschema: A diagnostic benchmark for very long- form video language understanding. InThirty-seventh Con- ference on Neural Information Processing Systems Datasets and Benchmarks Track, 2023. 1, 2

  28. [28]

    Scal- ing open-vocabulary object detection.Advances in Neural Information Processing Systems, 36:72983–73007, 2023

    Matthias Minderer, Alexey Gritsenko, and Neil Houlsby. Scal- ing open-vocabulary object detection.Advances in Neural Information Processing Systems, 36:72983–73007, 2023. 2, 7

  29. [29]

    Neptune: The long orbit to benchmarking long video under- standing.arXiv preprint arXiv:2412.09582, 2024

    Arsha Nagrani, Mingda Zhang, Ramin Mehran, Rachel Hor- nung, Nitesh Bharadwaj Gundavarapu, Nilpa Jha, Austin Myers, Xingyi Zhou, Boqing Gong, Cordelia Schmid, et al. Neptune: The long orbit to benchmarking long video under- standing.arXiv preprint arXiv:2412.09582, 2024. 1, 2

  30. [30]

    Minerva: Evaluating complex video reasoning.arXiv preprint arXiv:2505.00681,

    Arsha Nagrani, Sachit Menon, Ahmet Iscen, Shyamal Buch, Ramin Mehran, Nilpa Jha, Anja Hauth, Yukun Zhu, Carl V ondrick, Mikhail Sirotenko, et al. Minerva: Evaluating complex video reasoning.arXiv preprint arXiv:2505.00681,

  31. [31]

    Introducing GPT-4.1 in the API

    OpenAI. Introducing GPT-4.1 in the API. https:// openai.com/index/gpt-4-1/ , 2025. [Accessed 30- 04-2025]. 4

  32. [32]

    Introducing GPT-5

    OpenAI. Introducing GPT-5. https://openai.com/ index/introducing-gpt-5/, 2025. [Accessed 16-09- 2025]. 1, 4

  33. [33]

    Per- ception test: A diagnostic benchmark for multimodal video models.Advances in Neural Information Processing Systems, 36, 2024

    Viorica Patraucean, Lucas Smaira, Ankush Gupta, Adria Re- casens, Larisa Markeeva, Dylan Banarse, Skanda Koppula, Mateusz Malinowski, Yi Yang, Carl Doersch, et al. Per- ception test: A diagnostic benchmark for multimodal video models.Advances in Neural Information Processing Systems, 36, 2024. 2

  34. [34]

    Hd-epic: A highly-detailed egocentric video dataset

    Toby Perrett, Ahmad Darkhalil, Saptarshi Sinha, Omar Emara, Sam Pollard, Kranti Parida, Kaiting Liu, Prajwal Gatti, Sid- dhant Bansal, Kevin Flanagan, Jacob Chalk, Zhifan Zhu, Rho- dri Guerrier, Fahd Abdelazim, Bin Zhu, Davide Moltisanti, Michael Wray, Hazel Doughty, and Dima Damen. Hd-epic: A highly-detailed egocentric video dataset. InProceedings of the...

  35. [35]

    Omnia de egotempo: Bench- marking temporal understanding of multi-modal llms in ego- centric videos

    Chiara Plizzari, Alessio Tonioni, Yongqin Xian, Ace Kul- shrestha, and Federico Tombari. Omnia de egotempo: Bench- marking temporal understanding of multi-modal llms in ego- centric videos. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2025. 1, 2, 7, 8

  36. [36]

    Sam 2: Segment anything in images and videos.ICLR, 2025

    Nikhila Ravi, Valentin Gabeur, Yuan-Ting Hu, Ronghang Hu, Chaitanya Ryali, Tengyu Ma, Haitham Khedr, Roman Rädle, Chloe Rolland, Laura Gustafson, et al. Sam 2: Segment anything in images and videos.ICLR, 2025. 5

  37. [37]

    CinePile: A Long Video Question Answering Dataset and Benchmark.arXiv preprint arXiv:2405.08813, 2024

    Ruchit Rawal, Khalid Saifullah, Ronen Basri, David Jacobs, Gowthami Somepalli, and Tom Goldstein. CinePile: A Long Video Question Answering Dataset and Benchmark.arXiv preprint arXiv:2405.08813, 2024. 2

  38. [38]

    Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context

    Machel Reid, Nikolay Savinov, Denis Teplyashin, Dmitry Lepikhin, Timothy Lillicrap, Jean-baptiste Alayrac, Radu Soricut, Angeliki Lazaridou, Orhan Firat, Julian Schrit- twieser, et al. Gemini 1.5: Unlocking multimodal under- standing across millions of tokens of context.arXiv preprint arXiv:2403.05530, 2024. 1

  39. [39]

    Scienceqa: A novel resource for question answering on scholarly articles.International Journal on Digital Libraries, 23(3):289–301, 2022

    Tanik Saikh, Tirthankar Ghosal, Amish Mittal, Asif Ekbal, and Pushpak Bhattacharyya. Scienceqa: A novel resource for question answering on scholarly articles.International Journal on Digital Libraries, 23(3):289–301, 2022. 3

  40. [40]

    Visual cot: Advancing multi-modal language models with a comprehen- sive dataset and benchmark for chain-of-thought reasoning

    Hao Shao, Shengju Qian, Han Xiao, Guanglu Song, Zhuofan Zong, Letian Wang, Yu Liu, and Hongsheng Li. Visual cot: Advancing multi-modal language models with a comprehen- sive dataset and benchmark for chain-of-thought reasoning. Advances in Neural Information Processing Systems, 37:8612– 8642, 2024. 3

  41. [41]

    Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters

    Charlie Snell, Jaehoon Lee, Kelvin Xu, and Aviral Ku- mar. Scaling llm test-time compute optimally can be more effective than scaling model parameters.arXiv preprint arXiv:2408.03314, 2024. 1

  42. [42]

    Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

    Gemini Team. Gemini 2.5: Pushing the frontier with ad- vanced reasoning, multimodality, long context, and next gen- eration agentic capabilities.arXiv preprint arXiv:2507.06261,

  43. [43]

    LVBench: An Extreme Long Video Understanding Benchmark

    Weihan Wang, Zehai He, Wenyi Hong, Yean Cheng, Xiaohan Zhang, Ji Qi, Shiyu Huang, Bin Xu, Yuxiao Dong, Ming Ding, et al. Lvbench: An extreme long video understanding benchmark.arXiv preprint arXiv:2406.08035, 2024. 1, 2, 4

  44. [44]

    Videocot: A video chain-of- thought dataset with active annotation tool.arXiv preprint arXiv:2407.05355, 2024

    Yan Wang, Yawen Zeng, Jingsheng Zheng, Xiaofen Xing, Jin Xu, and Xiangmin Xu. Videocot: A video chain-of- thought dataset with active annotation tool.arXiv preprint arXiv:2407.05355, 2024. 3

  45. [45]

    Visual prompting in multimodal large language models: A survey.arXiv preprint arXiv:2409.15310, 2024

    Junda Wu, Zhehao Zhang, Yu Xia, Xintong Li, Zhaoyang Xia, Aaron Chang, Tong Yu, Sungchul Kim, Ryan A Rossi, Ruiyi Zhang, et al. Visual prompting in multimodal large language models: A survey.arXiv preprint arXiv:2409.15310, 2024. 3

  46. [46]

    Qwen3 Technical Report

    An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, Chujie Zheng, Dayiheng Liu, Fan Zhou, Fei Huang, Feng Hu, Hao Ge, Haoran Wei, Huan Lin, Jialong Tang, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Yang, Jiaxi Yang, Jing Zhou, Jingren Zhou, Junyang Lin, Kai Dang, Keqin Bao, Kexin Yang, ...

  47. [47]

    Set-of-Mark Prompting Unleashes Extraordinary Visual Grounding in GPT-4V

    Jianwei Yang, Hao Zhang, Feng Li, Xueyan Zou, Chunyuan Li, and Jianfeng Gao. Set-of-mark prompting unleashes extraordinary visual grounding in gpt-4v.arXiv preprint arXiv:2310.11441, 2023. 3, 5

  48. [48]

    MMEgo: Towards Building Egocentric Multimodal LLMs for Video QA

    Hanrong Ye, Haotian Zhang, Erik Daxberger, Lin Chen, Zongyu Lin, Yanghao Li, Bowen Zhang, Haoxuan You, Dan Xu, Zhe Gan, Jiasen Lu, and Yinfei Yang. MMEgo: Towards Building Egocentric Multimodal LLMs for Video QA. In International Conference on Learning Representations, pages 71705–71723, 2025. 3

  49. [49]

    Activitynet-qa: A dataset for understanding complex web videos via question answering

    Zhou Yu, Dejing Xu, Jun Yu, Ting Yu, Zhou Zhao, Yueting Zhuang, and Dacheng Tao. Activitynet-qa: A dataset for understanding complex web videos via question answering. In Proceedings of the AAAI Conference on Artificial Intelligence, pages 9127–9134, 2019. 1 Minerva-Ego: Spatiotemporal Hints for Egocentric Video Understanding Supplementary Material 0 10 2...

  50. [50]

    corrected

    Minerva-Ego 6.1. Additional Statistics The shortest videos is 10 seconds while the longest video is 75 minutes. The mean video length is 20 minutes. The full distribution can be found in Fig. 7. Figure 8 shows statistics of reasoning lengths and object occurrences by question type. We note that the distribution is flat, other thanAudio Reasoning, which ha...

  51. [51]

    All raters are native English speakers with graduate degrees

    Rater Guidelines All textual data in Minerva-Ego was manually annotated by human annotators (raters). All raters are native English speakers with graduate degrees. 7.1. Annotation Guidelines The raters were given the following guidelines before being asked to propose question, answers, decoys and reasoning traces: You will be given a video. For each video...

  52. [52]

    Egocentric and Spatial Perception

  53. [53]

    Numerical Reasoning (all math operations other than counting)

  54. [54]

    Counterfactual (“what if”) Important things to keep in mind:

  55. [55]

    Questions should be multi-step

  56. [56]

    They should be difficult to solve

  57. [57]

    Ideally they should involve looking at multiple different time segments of the video

  58. [58]

    Each question should be cover multiple skills and have multiple reasoning steps

  59. [59]

    The questions should be phrased with the word “I”, as if the cameraperson is asking the question

  60. [60]

    I” to refer to you (the rater, as you are doing the reasoning, and use the word “user

    The reasoning traces should use the word “I” to refer to you (the rater, as you are doing the reasoning, and use the word “user” to refer to the cameraperson). Note in the question, I refers to the camera person

  61. [61]

    objects of inter- est

    The reasoning traces should map to the “objects of inter- est”

  62. [62]

    Every time that object is referenced (even if it is in a different time step of the video), please use the same annotation

  63. [63]

    Please number the “steps” in each reasoning trace, as shown in the examples

  64. [64]

    All reasoning traces must have at least one time stamp, and at least one object of interest reference. 7.2. Human Study We used a disjoint pool of 12 raters for the human study, by asking the rater leads to ensure that the same rater who proposed the question (or even saw the video before) is not the same as the one who performs the human study. Hence no ...

  65. [65]

    Model Implementation Details Hyperparameters for all our models are provided in Table 7

  66. [66]

    density cubes,

    Results 9.1. MiRA Reasoning Evaluation MiRA is a model-based metric that uses Gemini 2.5 Pro for scoring. The exact prompt used to compute the MiRA score appears in Figure 9. Since using the same LLM for hinting and as a judge couldpotentiallybias the results, we repeat the MiRA anal- ysis from Fig. 3 (left) using GPT-5 as a judge, and report the scores i...