Minerva-Ego: Spatiotemporal Hints for Egocentric Video Understanding

arxiv: 2605.15342 · v1 · submitted 2026-05-14 · 💻 cs.CV · cs.LG

Minerva-Ego: Spatiotemporal Hints for Egocentric Video Understanding

Arsha Nagrani , Jasper Uijilings , Shyamal Buch , Tobias Weyand , Sudheendra Vijayanarasimhan , Bo Hu , Ramin Mehran , David A Ross

show 1 more author

Cordelia Schmid

This is my paper

Pith reviewed 2026-05-19 15:57 UTC · model grok-4.3

classification 💻 cs.CV cs.LG

keywords egocentric videovideo reasoningbenchmarkspatiotemporal hintsobject masksmultimodal questionsembodied agents

0 comments p. Extension

The pith

Providing models with hints on where and when to look improves performance on complex egocentric video reasoning tasks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents Minerva-Ego, a benchmark that tests AI models on multi-step reasoning using videos recorded from a first-person viewpoint. It adds detailed human annotations that map out the exact reasoning steps along with the specific objects and moments required to reach each answer. Top models still trail human accuracy by a wide margin on these tasks. The authors find that giving models explicit spatial and temporal guidance drawn from those annotations produces clear gains. This setup matters because better egocentric video understanding supports agents that need to interpret and act within real environments.

Core claim

Minerva-Ego extends egocentric video sources with challenging multi-step multimodal questions and spatiotemporally-dense human-annotated reasoning traces, plus object mask annotations for each trace. Benchmarking shows state-of-the-art models maintain a large performance gap relative to humans. Prompting the same models with hints that indicate the relevant locations and times in the video yields substantial improvements in accuracy.

What carries the argument

Spatiotemporal hints drawn from human-annotated reasoning traces and object masks that direct models to the key locations and moments needed for each question.

If this is right

Frontier models gain measurable accuracy on egocentric video questions when supplied with explicit spatial and temporal guidance.
Dense object mask annotations allow direct evaluation of which visual elements matter for each reasoning step.
Benchmarks that score intermediate traces reveal gaps invisible to answer-only evaluation.
Embodied agents can benefit from inference-time guidance that mimics human focus patterns in first-person video.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same hinting method could be tested on non-egocentric video domains to check if spatial-temporal guidance generalizes.
Models trained to generate their own reasoning traces and masks might reduce reliance on human annotations over time.
Real-time embodied systems could incorporate online prediction of where and when to attend for ongoing tasks.

Load-bearing premise

The human-annotated reasoning traces and object masks accurately identify the intermediate steps and visual elements required to solve the questions.

What would settle it

Running the same models on the benchmark questions while supplying random or mismatched spatiotemporal hints and observing no performance change would indicate the hints are not the driver of gains.

Figures

Figures reproduced from arXiv: 2605.15342 by Arsha Nagrani, Bo Hu, Cordelia Schmid, David A Ross, Jasper Uijilings, Ramin Mehran, Shyamal Buch, Sudheendra Vijayanarasimhan, Tobias Weyand.

**Figure 1.** Figure 1: Spatio-temporal hinting in Minerva-Ego: We introduce Minerva-Ego, a complex video question-answering dataset for egocentric videos. Unlike prior work [10, 27, 34, 35], the answer to each question is accompanied by a spatiotemporally-grounded reasoning trace, which outlines the steps required to come to the answer which are grounded in time (localization with timestamps) and space (associations with segment… view at source ↗

**Figure 2.** Figure 2: Statistics of Reasoning Traces in Minerva-Ego. We show lengths of answers and reasoning traces (left), the distribution of timestamps mentioned in the reasoning traces (middle) and the distribution of objects referenced in the reasoning traces (right). Almost all reasoning traces contain at least one timestamp and refer to at least one object, for which a spatio-temporal mask is available. Reasoning traces… view at source ↗

**Figure 3.** Figure 3: Analysis of model produced reasoning traces: (Left) We report MiRA scores on reasoning traces for four axes, namely, Perceptual correctness, Temporal Grounding, Logical reasoning and Completeness. (Middle) We report the recall of relevant objects extracted from the ground truth reasoning trace in the reasoning traces of the models. Object recall matches the entire object’s name while Noun recall matches on… view at source ↗

**Figure 4.** Figure 4: Visualization of the reasoning traces without (Left) and with (Right) spatio-temporal hints for the question “A [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗

**Figure 5.** Figure 5: Visual annotations. From left to right, visualization using masks, boxes, and circles + classes (See Tab. 3). The target question is, “How many types of cheese did I add to the puff pastry roll?” [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗

**Figure 6.** Figure 6: Example errors made by the models in their video understanding and reasoning. The errors include failing to identify actions due to the egocentric viewpoint, missing the relevant temporal spans, and failing to capture actions due to fast motion. show results in [PITH_FULL_IMAGE:figures/full_fig_p008_6.png] view at source ↗

**Figure 7.** Figure 7: Video Lengths in Minerva-Ego. Action Recognition Audio Reasoning Cause and Effect Counterfactual Counting Egocentric & Spatial Perception Event Occurence Goal Reasoning Numerical Reasoning Object Recognition Reading Situational Awareness Spatial Perception State Changes Temporal Reasoning 0 50 100 150 Word Count Average Reasoning Length Action Recognition Audio Reasoning Cause and Effect Counterfactual Cou… view at source ↗

**Figure 8.** Figure 8: Statistics of reasoning trace lengths and object references [PITH_FULL_IMAGE:figures/full_fig_p012_8.png] view at source ↗

**Figure 9.** Figure 9: The MiRA rubric prompt, as provided to Gemini (or GPT), used to score the reasoning traces produced by the VLMs being [PITH_FULL_IMAGE:figures/full_fig_p014_9.png] view at source ↗

**Figure 10.** Figure 10: MiRA scores of the model reasoning traces using GPT-5 [PITH_FULL_IMAGE:figures/full_fig_p015_10.png] view at source ↗

**Figure 12.** Figure 12: Accuracy of SotA models by skill [PITH_FULL_IMAGE:figures/full_fig_p016_12.png] view at source ↗

read the original abstract

Video reasoning models are a core component of egocentric and embodied agents. However, standard benchmarks for assessing models provide only evaluation of the output (e.g. the answer to a question), without evaluation of intermediate reasoning steps, and most provide answers only in the text domain. We introduce Minerva-Ego, a benchmark for evaluating complex egocentric visual reasoning. We extend recent high-quality video data sources recorded from egocentric / embodied settings with a set of challenging, multi-step multimodal questions and spatiotemporally-dense human-annotated reasoning traces. Benchmarking experiments show that state-of-the-art models still have a large gap to human performance. To investigate this gap in detail, we annotate each reasoning trace in the dataset with the objects of interest required to solve the question, as spatiotemporal mask annotations. Through extensive evaluations, we identify that prompting frontier models with hints of 'where' and 'when' to look yields substantial improvements in performance. Minerva-Ego can be downloaded at https://github.com/google-deepmind/neptune.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Minerva-Ego adds dense human reasoning traces and object masks to egocentric video questions and reports clear gains from spatiotemporal prompting hints, though those gains may partly depend on how independent the annotations really are from semantic leakage.

read the letter

The one or two things to know about this paper are that it presents Minerva-Ego as a benchmark for complex egocentric visual reasoning using multi-step questions paired with human-annotated reasoning traces and object masks, and that it finds substantial performance lifts when frontier models are prompted with hints on where and when to look in the video. What the paper does well is to move the evaluation beyond final answers alone. By extending egocentric video sources with these dense annotations, it creates a resource that can highlight specific failures in intermediate reasoning steps. The reported benchmarking shows clear room for improvement over current state-of-the-art models, and the hint experiments give a straightforward method to test the impact of providing spatiotemporal guidance. This approach aligns with needs in embodied AI where agents must handle real-time visual understanding. A softer spot is the assumption that the human traces accurately reflect the minimal visual elements required without additional knowledge. The stress here is that if the annotations incorporate post-hoc reasoning based on the full question, the hints might introduce answer leakage instead of pure guidance on location and timing. The abstract lacks mention of inter-annotator agreement metrics or comparisons to machine-generated alternatives, which leaves some uncertainty about how much the gains depend on careful human input versus the prompting technique itself. If the full paper includes those checks, that would strengthen the central claim. This work targets researchers in computer vision and robotics who are building systems for egocentric or embodied settings. It provides a more structured way to measure and improve reasoning capabilities in video models. I would send this to peer review. The benchmark adds useful structure, and referees can help ensure the evaluation details hold up under closer inspection.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces Minerva-Ego, a benchmark for complex egocentric visual reasoning. It augments existing egocentric video sources with multi-step multimodal questions, spatiotemporally-dense human-annotated reasoning traces, and object mask annotations. Benchmarking shows state-of-the-art models lag human performance, and experiments indicate that providing frontier models with 'where' and 'when' hints yields substantial gains.

Significance. If the results hold after addressing annotation validation, Minerva-Ego would be a useful resource for egocentric video understanding research, enabling analysis of intermediate reasoning steps rather than final answers alone. The empirical demonstration that spatiotemporal hints improve performance could guide model architectures for embodied agents.

major comments (2)

[Abstract] Abstract: the central claim that 'prompting frontier models with hints of 'where' and 'when' to look yields substantial improvements in performance' depends on the human-annotated reasoning traces and masks accurately isolating the minimal visual elements and steps required. Without reported controls (e.g., machine-generated masks or ablations removing semantic content from traces), it remains possible that gains arise from answer leakage rather than pure spatiotemporal guidance.
[Benchmarking experiments] Benchmarking experiments: the manuscript reports performance gaps and hint-based gains but provides no inter-annotator agreement statistics or validation that models using only the provided masks reach human-level accuracy on the questions. This is load-bearing for interpreting the hints as faithful spatiotemporal guidance.

minor comments (2)

[Abstract] The GitHub link is provided, but the paper should explicitly state the data splits, question counts per split, and statistical significance tests used for the reported gains to support reproducibility.
[Dataset description] Clarify how the 'spatiotemporally-dense' object masks are aligned with the reasoning traces (e.g., frame-level vs. clip-level) to avoid ambiguity in how hints are constructed from them.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback and for recognizing the potential utility of Minerva-Ego for analyzing intermediate reasoning steps in egocentric video understanding. We address each major comment below and will incorporate revisions to strengthen the claims regarding spatiotemporal hints.

read point-by-point responses

Referee: [Abstract] Abstract: the central claim that 'prompting frontier models with hints of 'where' and 'when' to look yields substantial improvements in performance' depends on the human-annotated reasoning traces and masks accurately isolating the minimal visual elements and steps required. Without reported controls (e.g., machine-generated masks or ablations removing semantic content from traces), it remains possible that gains arise from answer leakage rather than pure spatiotemporal guidance.

Authors: We agree that explicit controls are needed to isolate the contribution of spatiotemporal information and rule out leakage. In the revised manuscript we will add two new ablation experiments: (1) replacing human masks with machine-generated masks from an off-the-shelf detector, and (2) stripping semantic labels from the traces while preserving only temporal and spatial coordinates. These controls will quantify how much of the observed gain is attributable to faithful 'where/when' guidance versus direct answer content. revision: yes
Referee: [Benchmarking experiments] Benchmarking experiments: the manuscript reports performance gaps and hint-based gains but provides no inter-annotator agreement statistics or validation that models using only the provided masks reach human-level accuracy. This is load-bearing for interpreting the hints as faithful spatiotemporal guidance.

Authors: We will report inter-annotator agreement statistics (IoU for masks and Cohen’s kappa for trace steps) in the revised version. For validation, we will include additional results showing model accuracy when given only the annotated masks and traces; we will also discuss the remaining gap to human performance, which we attribute to limitations in current models’ ability to integrate the provided spatiotemporal cues rather than to annotation infidelity. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical benchmark with no derivations or fitted predictions

full rationale

The paper introduces Minerva-Ego as a new benchmark consisting of egocentric videos, multi-step questions, human-annotated reasoning traces, and spatiotemporal object masks. Its central empirical claim—that prompting frontier models with 'where' and 'when' hints yields performance gains—is obtained directly from evaluations on this newly created dataset. There are no equations, first-principles derivations, fitted parameters, or predictions that reduce to prior quantities by construction. No self-citations, uniqueness theorems, or ansatzes are invoked in a load-bearing manner for any derivation, as none exist. The work is a self-contained empirical contribution whose findings rest on external model evaluations rather than internal redefinitions or self-referential fits.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

This is an empirical benchmark paper; the abstract describes no free parameters, mathematical axioms, or invented entities.

pith-pipeline@v0.9.0 · 5741 in / 1055 out tokens · 46702 ms · 2026-05-19T15:57:52.079820+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

66 extracted references · 66 canonical work pages · 10 internal anchors

[1]

GPT-4 Technical Report

Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. GPT-4 technical report.arXiv preprint arXiv:2303.08774, 2023. 1

work page internal anchor Pith review Pith/arXiv arXiv 2023
[2]

https://openai.com/index/learning- to-reason-with-llms , 2025

Open AI. https://openai.com/index/learning- to-reason-with-llms , 2025. [Accessed 19-02-2025]. 1

work page 2025
[3]

System Card: Claude Opus 4 & Claude Sonnet 4

Anthropic. System Card: Claude Opus 4 & Claude Sonnet 4. https://www- cdn.anthropic.com/ 6d8a8055020700718b0c49369f60816ba2a7c285. pdf. [Accessed 30-04-2025]. 4

work page 2025
[4]

Claude 3.5 sonnet v2

Anthropic. Claude 3.5 sonnet v2. Anthropic API, 2023. A language model from Anthropic, featuring improved capabili- ties over the original Claude 3.5 Sonnet, including enhanced computer action generation. 1

work page 2023
[5]

In- finiBench: A comprehensive benchmark for large multimodal models in very long video understanding.arXiv preprint arXiv:2406.19875, 2024

Kirolos Ataallah, Chenhui Gou, Eslam Abdelrahman, Khushbu Pahwa, Jian Ding, and Mohamed Elhoseiny. In- finiBench: A comprehensive benchmark for large multimodal models in very long video understanding.arXiv preprint arXiv:2406.19875, 2024. 2

work page arXiv 2024
[6]

Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, et al. Qwen2. 5-vl technical report.arXiv preprint arXiv:2502.13923, 2025. 1

work page internal anchor Pith review Pith/arXiv arXiv 2025
[7]

Visual prompting via image inpaint- ing.Advances in Neural Information Processing Systems, 35: 25005–25017, 2022

Amir Bar, Yossi Gandelsman, Trevor Darrell, Amir Glober- son, and Alexei Efros. Visual prompting via image inpaint- ing.Advances in Neural Information Processing Systems, 35: 25005–25017, 2022. 3

work page 2022
[8]

Look, Remember and Reason: Grounded Reasoning in Videos with Language Models

Apratim Bhattacharyya, Sunny Panchal, Reza Pourreza, Mingu Lee, Pulkit Madan, and Roland Memisevic. Look, Remember and Reason: Grounded Reasoning in Videos with Language Models. InThe Twelfth International Conference on Learning Representations, 2024. 3

work page 2024
[9]

TemporalBench: Benchmarking fine- grained temporal understanding for multimodal video models

Mu Cai, Reuben Tan, Jianrui Zhang, Bocheng Zou, Kai Zhang, Feng Yao, Fangrui Zhu, Jing Gu, Yiwu Zhong, Yuzhang Shang, Yao Dou, Jaden Park, Jianfeng Gao, Yong Jae Lee, and Jianwei Yang. TemporalBench: Benchmarking fine- grained temporal understanding for multimodal video models. arXiv preprint arXiv:2410.10818, 2024. 2

work page arXiv 2024
[10]

HourVideo: 1-hour video-language understanding.arXiv preprint arXiv:2411.04998, 2024

Keshigeyan Chandrasegaran, Agrim Gupta, Lea M Hadzic, Taran Kota, Jimming He, CristÃ³bal Eyzaguirre, Zane Du- rante, Manling Li, Jiajun Wu, and Li Fei-Fei. HourVideo: 1-hour video-language understanding.arXiv preprint arXiv:2411.04998, 2024. 2

work page arXiv 2024
[11]

Egothink: Evaluating first- person perspective thinking capability of vision-language models

Sijie Cheng, Zhicheng Guo, Jingwen Wu, Kechen Fang, Peng Li, Huaping Liu, and Yang Liu. Egothink: Evaluating first- person perspective thinking capability of vision-language models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 14291–14302, 2024. 3

work page 2024
[12]

TVBench: Redesigning video-language evaluation.arXiv preprint arXiv:2410.07752,

Daniel Cores, Michael Dorkenwald, Manuel Mucientes, Cees G M Snoek, and Yuki M Asano. TVBench: Redesigning video-language evaluation.arXiv preprint arXiv:2410.07752,

work page arXiv
[13]

Egovqa: An egocentric video question an- swering benchmark dataset

Yifei Fan, Bo Zhang, Yun Zhou, Qingshan He, Zhenyi Liu, and Liang Wang. Egovqa: An egocentric video question an- swering benchmark dataset. InProceedings of the IEEE/CVF International Conference on Computer Vision Workshops (IC- CVW), pages 2501–2504, 2019. 2

work page 2019
[14]

Embodied ai: From llms to world models.arXiv preprint arXiv:2509.20021, 2025

Tongtong Feng, Xin Wang, Yu-Gang Jiang, and Wenwu Zhu. Embodied ai: From llms to world models.arXiv preprint arXiv:2509.20021, 2025. 1

work page arXiv 2025
[15]

Video-MME: The First-Ever Comprehensive Evaluation Benchmark of Multi-modal LLMs in Video Analysis

Chaoyou Fu, Yuhan Dai, Yondong Luo, Lei Li, Shuhuai Ren, Renrui Zhang, Zihan Wang, Chenyu Zhou, Yunhang Shen, Mengdan Zhang, et al. Video-mme: The first-ever comprehensive evaluation benchmark of multi-modal llms in video analysis.arXiv preprint arXiv:2405.21075, 2024. 1, 2

work page internal anchor Pith review Pith/arXiv arXiv 2024
[16]

something something

Raghav Goyal, Samira Ebrahimi Kahou, Vincent Michal- ski, Joanna Materzynska, Susanne Westphal, Heuna Kim, Valentin Haenel, Ingo Fruend, Peter Yianilos, Moritz Mueller- Freitag, et al. The" something something" video database for learning and evaluating visual common sense. InProceed- ings of the IEEE international conference on computer vision, pages 584...

work page 2017
[17]

Ego4d: Around the world in 3,000 hours of egocentric video, 2022

Kristen Grauman, Andrew Westbury, Eugene Byrne, Zachary Chavis, Antonino Furnari, Rohit Girdhar, Jackson Hamburger, Hao Jiang, Miao Liu, Xingyu Liu, Miguel Martin, Tushar Na- garajan, Ilija Radosavovic, Santhosh Kumar Ramakrishnan, Fiona Ryan, et al. Ego4d: Around the world in 3,000 hours of egocentric video, 2022. 2

work page 2022
[18]

Videoespresso: A large-scale chain-of-thought dataset for fine-grained video reasoning via core frame selection.arXiv preprint arXiv:2411.14794, 2024

Songhao Han, Wei Huang, Hairong Shi, Le Zhuo, Xiu Su, Shifeng Zhang, Xu Zhou, Xiaojuan Qi, Yue Liao, and Si Liu. Videoespresso: A large-scale chain-of-thought dataset for fine-grained video reasoning via core frame selection.arXiv preprint arXiv:2411.14794, 2024. 3

work page arXiv 2024
[19]

Sugarcrepe: Fixing hackable benchmarks for vision-language compositionality.Advances in neural information processing systems, 36:31096–31116,

Cheng-Yu Hsieh, Jieyu Zhang, Zixian Ma, Aniruddha Kem- bhavi, and Ranjay Krishna. Sugarcrepe: Fixing hackable benchmarks for vision-language compositionality.Advances in neural information processing systems, 36:31096–31116,

work page
[20]

Egotaskqa: Understanding human tasks in egocentric videos

Baoxiong Jia, Ting Lei, Song-Chun Zhu, and Siyuan Huang. Egotaskqa: Understanding human tasks in egocentric videos. InThe 36th Conference on Neural Information Processing Systems (NeurIPS 2022) Track on Datasets and Benchmarks,

work page 2022
[21]

Scaling scaling laws with board games.arXiv preprint arXiv:2104.03113, 2021

Andy L Jones. Scaling scaling laws with board games.arXiv preprint arXiv:2104.03113, 2021. 1

work page arXiv 2021
[22]

The Kinetics Human Action Video Dataset

Will Kay, Joao Carreira, Karen Simonyan, Brian Zhang, Chloe Hillier, Sudheendra Vijayanarasimhan, Fabio Viola, Tim Green, Trevor Back, Paul Natsev, et al. The kinetics hu- man action video dataset.arXiv preprint arXiv:1705.06950,

work page internal anchor Pith review Pith/arXiv arXiv
[23]

Adversarial filters of dataset biases

Ronan Le Bras, Swabha Swayamdipta, Chandra Bhagavatula, Rowan Zellers, Matthew Peters, Ashish Sabharwal, and Yejin Choi. Adversarial filters of dataset biases. InInternational conference on machine learning, pages 1078–1088. Pmlr,

work page
[24]

Videovista: A versatile bench- mark for video understanding and reasoning.arXiv preprint arXiv:2406.11303, 2024

Yunxin Li, Xinyu Chen, Baotian Hu, Longyue Wang, Haoyuan Shi, and Min Zhang. Videovista: A versatile bench- mark for video understanding and reasoning.arXiv preprint arXiv:2406.11303, 2024. 2

work page arXiv 2024
[25]

Egocen- tric planning for scalable embodied task achievement.Ad- vances in Neural Information Processing Systems, 36:54586– 54613, 2023

Xiatoian Liu, Hector Palacios, and Christian Muise. Egocen- tric planning for scalable embodied task achievement.Ad- vances in Neural Information Processing Systems, 36:54586– 54613, 2023. 1

work page 2023
[26]

Ye Liu, Zongyang Ma, Zhongang Qi, Yang Wu, Ying Shan, and Chang Wen Chen. E.T. bench: Towards open-ended event-level video-language understanding.arXiv preprint arXiv:2409.18111, 2024. 2

work page arXiv 2024
[27]

Egoschema: A diagnostic benchmark for very long- form video language understanding

Karttikeya Mangalam, Raiymbek Akshulakov, and Jitendra Malik. Egoschema: A diagnostic benchmark for very long- form video language understanding. InThirty-seventh Con- ference on Neural Information Processing Systems Datasets and Benchmarks Track, 2023. 1, 2

work page 2023
[28]

Scal- ing open-vocabulary object detection.Advances in Neural Information Processing Systems, 36:72983–73007, 2023

Matthias Minderer, Alexey Gritsenko, and Neil Houlsby. Scal- ing open-vocabulary object detection.Advances in Neural Information Processing Systems, 36:72983–73007, 2023. 2, 7

work page 2023
[29]

Neptune: The long orbit to benchmarking long video under- standing.arXiv preprint arXiv:2412.09582, 2024

Arsha Nagrani, Mingda Zhang, Ramin Mehran, Rachel Hor- nung, Nitesh Bharadwaj Gundavarapu, Nilpa Jha, Austin Myers, Xingyi Zhou, Boqing Gong, Cordelia Schmid, et al. Neptune: The long orbit to benchmarking long video under- standing.arXiv preprint arXiv:2412.09582, 2024. 1, 2

work page arXiv 2024
[30]

Minerva: Evaluating complex video reasoning.arXiv preprint arXiv:2505.00681,

Arsha Nagrani, Sachit Menon, Ahmet Iscen, Shyamal Buch, Ramin Mehran, Nilpa Jha, Anja Hauth, Yukun Zhu, Carl V ondrick, Mikhail Sirotenko, et al. Minerva: Evaluating complex video reasoning.arXiv preprint arXiv:2505.00681,

work page arXiv
[31]

Introducing GPT-4.1 in the API

OpenAI. Introducing GPT-4.1 in the API. https:// openai.com/index/gpt-4-1/ , 2025. [Accessed 30- 04-2025]. 4

work page 2025
[32]

Introducing GPT-5

OpenAI. Introducing GPT-5. https://openai.com/ index/introducing-gpt-5/, 2025. [Accessed 16-09- 2025]. 1, 4

work page 2025
[33]

Per- ception test: A diagnostic benchmark for multimodal video models.Advances in Neural Information Processing Systems, 36, 2024

Viorica Patraucean, Lucas Smaira, Ankush Gupta, Adria Re- casens, Larisa Markeeva, Dylan Banarse, Skanda Koppula, Mateusz Malinowski, Yi Yang, Carl Doersch, et al. Per- ception test: A diagnostic benchmark for multimodal video models.Advances in Neural Information Processing Systems, 36, 2024. 2

work page 2024
[34]

Hd-epic: A highly-detailed egocentric video dataset

Toby Perrett, Ahmad Darkhalil, Saptarshi Sinha, Omar Emara, Sam Pollard, Kranti Parida, Kaiting Liu, Prajwal Gatti, Sid- dhant Bansal, Kevin Flanagan, Jacob Chalk, Zhifan Zhu, Rho- dri Guerrier, Fahd Abdelazim, Bin Zhu, Davide Moltisanti, Michael Wray, Hazel Doughty, and Dima Damen. Hd-epic: A highly-detailed egocentric video dataset. InProceedings of the...

work page 2025
[35]

Omnia de egotempo: Bench- marking temporal understanding of multi-modal llms in ego- centric videos

Chiara Plizzari, Alessio Tonioni, Yongqin Xian, Ace Kul- shrestha, and Federico Tombari. Omnia de egotempo: Bench- marking temporal understanding of multi-modal llms in ego- centric videos. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2025. 1, 2, 7, 8

work page 2025
[36]

Sam 2: Segment anything in images and videos.ICLR, 2025

Nikhila Ravi, Valentin Gabeur, Yuan-Ting Hu, Ronghang Hu, Chaitanya Ryali, Tengyu Ma, Haitham Khedr, Roman Rädle, Chloe Rolland, Laura Gustafson, et al. Sam 2: Segment anything in images and videos.ICLR, 2025. 5

work page 2025
[37]

CinePile: A Long Video Question Answering Dataset and Benchmark.arXiv preprint arXiv:2405.08813, 2024

Ruchit Rawal, Khalid Saifullah, Ronen Basri, David Jacobs, Gowthami Somepalli, and Tom Goldstein. CinePile: A Long Video Question Answering Dataset and Benchmark.arXiv preprint arXiv:2405.08813, 2024. 2

work page arXiv 2024
[38]

Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context

Machel Reid, Nikolay Savinov, Denis Teplyashin, Dmitry Lepikhin, Timothy Lillicrap, Jean-baptiste Alayrac, Radu Soricut, Angeliki Lazaridou, Orhan Firat, Julian Schrit- twieser, et al. Gemini 1.5: Unlocking multimodal under- standing across millions of tokens of context.arXiv preprint arXiv:2403.05530, 2024. 1

work page internal anchor Pith review Pith/arXiv arXiv 2024
[39]

Scienceqa: A novel resource for question answering on scholarly articles.International Journal on Digital Libraries, 23(3):289–301, 2022

Tanik Saikh, Tirthankar Ghosal, Amish Mittal, Asif Ekbal, and Pushpak Bhattacharyya. Scienceqa: A novel resource for question answering on scholarly articles.International Journal on Digital Libraries, 23(3):289–301, 2022. 3

work page 2022
[40]

Visual cot: Advancing multi-modal language models with a comprehen- sive dataset and benchmark for chain-of-thought reasoning

Hao Shao, Shengju Qian, Han Xiao, Guanglu Song, Zhuofan Zong, Letian Wang, Yu Liu, and Hongsheng Li. Visual cot: Advancing multi-modal language models with a comprehen- sive dataset and benchmark for chain-of-thought reasoning. Advances in Neural Information Processing Systems, 37:8612– 8642, 2024. 3

work page 2024
[41]

Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters

Charlie Snell, Jaehoon Lee, Kelvin Xu, and Aviral Ku- mar. Scaling llm test-time compute optimally can be more effective than scaling model parameters.arXiv preprint arXiv:2408.03314, 2024. 1

work page internal anchor Pith review Pith/arXiv arXiv 2024
[42]

Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

Gemini Team. Gemini 2.5: Pushing the frontier with ad- vanced reasoning, multimodality, long context, and next gen- eration agentic capabilities.arXiv preprint arXiv:2507.06261,

work page internal anchor Pith review Pith/arXiv arXiv
[43]

LVBench: An Extreme Long Video Understanding Benchmark

Weihan Wang, Zehai He, Wenyi Hong, Yean Cheng, Xiaohan Zhang, Ji Qi, Shiyu Huang, Bin Xu, Yuxiao Dong, Ming Ding, et al. Lvbench: An extreme long video understanding benchmark.arXiv preprint arXiv:2406.08035, 2024. 1, 2, 4

work page internal anchor Pith review Pith/arXiv arXiv 2024
[44]

Videocot: A video chain-of- thought dataset with active annotation tool.arXiv preprint arXiv:2407.05355, 2024

Yan Wang, Yawen Zeng, Jingsheng Zheng, Xiaofen Xing, Jin Xu, and Xiangmin Xu. Videocot: A video chain-of- thought dataset with active annotation tool.arXiv preprint arXiv:2407.05355, 2024. 3

work page arXiv 2024
[45]

Visual prompting in multimodal large language models: A survey.arXiv preprint arXiv:2409.15310, 2024

Junda Wu, Zhehao Zhang, Yu Xia, Xintong Li, Zhaoyang Xia, Aaron Chang, Tong Yu, Sungchul Kim, Ryan A Rossi, Ruiyi Zhang, et al. Visual prompting in multimodal large language models: A survey.arXiv preprint arXiv:2409.15310, 2024. 3

work page arXiv 2024
[46]

Qwen3 Technical Report

An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, Chujie Zheng, Dayiheng Liu, Fan Zhou, Fei Huang, Feng Hu, Hao Ge, Haoran Wei, Huan Lin, Jialong Tang, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Yang, Jiaxi Yang, Jing Zhou, Jingren Zhou, Junyang Lin, Kai Dang, Keqin Bao, Kexin Yang, ...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[47]

Set-of-Mark Prompting Unleashes Extraordinary Visual Grounding in GPT-4V

Jianwei Yang, Hao Zhang, Feng Li, Xueyan Zou, Chunyuan Li, and Jianfeng Gao. Set-of-mark prompting unleashes extraordinary visual grounding in gpt-4v.arXiv preprint arXiv:2310.11441, 2023. 3, 5

work page internal anchor Pith review Pith/arXiv arXiv 2023
[48]

MMEgo: Towards Building Egocentric Multimodal LLMs for Video QA

Hanrong Ye, Haotian Zhang, Erik Daxberger, Lin Chen, Zongyu Lin, Yanghao Li, Bowen Zhang, Haoxuan You, Dan Xu, Zhe Gan, Jiasen Lu, and Yinfei Yang. MMEgo: Towards Building Egocentric Multimodal LLMs for Video QA. In International Conference on Learning Representations, pages 71705–71723, 2025. 3

work page 2025
[49]

Activitynet-qa: A dataset for understanding complex web videos via question answering

Zhou Yu, Dejing Xu, Jun Yu, Ting Yu, Zhou Zhao, Yueting Zhuang, and Dacheng Tao. Activitynet-qa: A dataset for understanding complex web videos via question answering. In Proceedings of the AAAI Conference on Artificial Intelligence, pages 9127–9134, 2019. 1 Minerva-Ego: Spatiotemporal Hints for Egocentric Video Understanding Supplementary Material 0 10 2...

work page 2019
[50]

corrected

Minerva-Ego 6.1. Additional Statistics The shortest videos is 10 seconds while the longest video is 75 minutes. The mean video length is 20 minutes. The full distribution can be found in Fig. 7. Figure 8 shows statistics of reasoning lengths and object occurrences by question type. We note that the distribution is flat, other thanAudio Reasoning, which ha...

work page
[51]

All raters are native English speakers with graduate degrees

Rater Guidelines All textual data in Minerva-Ego was manually annotated by human annotators (raters). All raters are native English speakers with graduate degrees. 7.1. Annotation Guidelines The raters were given the following guidelines before being asked to propose question, answers, decoys and reasoning traces: You will be given a video. For each video...

work page
[52]

Egocentric and Spatial Perception

work page
[53]

Numerical Reasoning (all math operations other than counting)

work page
[54]

Counterfactual (“what if”) Important things to keep in mind:

work page
[55]

Questions should be multi-step

work page
[56]

They should be difficult to solve

work page
[57]

Ideally they should involve looking at multiple different time segments of the video

work page
[58]

Each question should be cover multiple skills and have multiple reasoning steps

work page
[59]

The questions should be phrased with the word “I”, as if the cameraperson is asking the question

work page
[60]

I” to refer to you (the rater, as you are doing the reasoning, and use the word “user

The reasoning traces should use the word “I” to refer to you (the rater, as you are doing the reasoning, and use the word “user” to refer to the cameraperson). Note in the question, I refers to the camera person

work page
[61]

objects of inter- est

The reasoning traces should map to the “objects of inter- est”

work page
[62]

Every time that object is referenced (even if it is in a different time step of the video), please use the same annotation

work page
[63]

Please number the “steps” in each reasoning trace, as shown in the examples

work page
[64]

All reasoning traces must have at least one time stamp, and at least one object of interest reference. 7.2. Human Study We used a disjoint pool of 12 raters for the human study, by asking the rater leads to ensure that the same rater who proposed the question (or even saw the video before) is not the same as the one who performs the human study. Hence no ...

work page
[65]

Model Implementation Details Hyperparameters for all our models are provided in Table 7

work page
[66]

density cubes,

Results 9.1. MiRA Reasoning Evaluation MiRA is a model-based metric that uses Gemini 2.5 Pro for scoring. The exact prompt used to compute the MiRA score appears in Figure 9. Since using the same LLM for hinting and as a judge couldpotentiallybias the results, we repeat the MiRA anal- ysis from Fig. 3 (left) using GPT-5 as a judge, and report the scores i...

work page 2025

[1] [1]

GPT-4 Technical Report

Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. GPT-4 technical report.arXiv preprint arXiv:2303.08774, 2023. 1

work page internal anchor Pith review Pith/arXiv arXiv 2023

[2] [2]

https://openai.com/index/learning- to-reason-with-llms , 2025

Open AI. https://openai.com/index/learning- to-reason-with-llms , 2025. [Accessed 19-02-2025]. 1

work page 2025

[3] [3]

System Card: Claude Opus 4 & Claude Sonnet 4

Anthropic. System Card: Claude Opus 4 & Claude Sonnet 4. https://www- cdn.anthropic.com/ 6d8a8055020700718b0c49369f60816ba2a7c285. pdf. [Accessed 30-04-2025]. 4

work page 2025

[4] [4]

Claude 3.5 sonnet v2

Anthropic. Claude 3.5 sonnet v2. Anthropic API, 2023. A language model from Anthropic, featuring improved capabili- ties over the original Claude 3.5 Sonnet, including enhanced computer action generation. 1

work page 2023

[5] [5]

In- finiBench: A comprehensive benchmark for large multimodal models in very long video understanding.arXiv preprint arXiv:2406.19875, 2024

Kirolos Ataallah, Chenhui Gou, Eslam Abdelrahman, Khushbu Pahwa, Jian Ding, and Mohamed Elhoseiny. In- finiBench: A comprehensive benchmark for large multimodal models in very long video understanding.arXiv preprint arXiv:2406.19875, 2024. 2

work page arXiv 2024

[6] [6]

Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, et al. Qwen2. 5-vl technical report.arXiv preprint arXiv:2502.13923, 2025. 1

work page internal anchor Pith review Pith/arXiv arXiv 2025

[7] [7]

Visual prompting via image inpaint- ing.Advances in Neural Information Processing Systems, 35: 25005–25017, 2022

Amir Bar, Yossi Gandelsman, Trevor Darrell, Amir Glober- son, and Alexei Efros. Visual prompting via image inpaint- ing.Advances in Neural Information Processing Systems, 35: 25005–25017, 2022. 3

work page 2022

[8] [8]

Look, Remember and Reason: Grounded Reasoning in Videos with Language Models

Apratim Bhattacharyya, Sunny Panchal, Reza Pourreza, Mingu Lee, Pulkit Madan, and Roland Memisevic. Look, Remember and Reason: Grounded Reasoning in Videos with Language Models. InThe Twelfth International Conference on Learning Representations, 2024. 3

work page 2024

[9] [9]

TemporalBench: Benchmarking fine- grained temporal understanding for multimodal video models

Mu Cai, Reuben Tan, Jianrui Zhang, Bocheng Zou, Kai Zhang, Feng Yao, Fangrui Zhu, Jing Gu, Yiwu Zhong, Yuzhang Shang, Yao Dou, Jaden Park, Jianfeng Gao, Yong Jae Lee, and Jianwei Yang. TemporalBench: Benchmarking fine- grained temporal understanding for multimodal video models. arXiv preprint arXiv:2410.10818, 2024. 2

work page arXiv 2024

[10] [10]

HourVideo: 1-hour video-language understanding.arXiv preprint arXiv:2411.04998, 2024

Keshigeyan Chandrasegaran, Agrim Gupta, Lea M Hadzic, Taran Kota, Jimming He, CristÃ³bal Eyzaguirre, Zane Du- rante, Manling Li, Jiajun Wu, and Li Fei-Fei. HourVideo: 1-hour video-language understanding.arXiv preprint arXiv:2411.04998, 2024. 2

work page arXiv 2024

[11] [11]

Egothink: Evaluating first- person perspective thinking capability of vision-language models

Sijie Cheng, Zhicheng Guo, Jingwen Wu, Kechen Fang, Peng Li, Huaping Liu, and Yang Liu. Egothink: Evaluating first- person perspective thinking capability of vision-language models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 14291–14302, 2024. 3

work page 2024

[12] [12]

TVBench: Redesigning video-language evaluation.arXiv preprint arXiv:2410.07752,

Daniel Cores, Michael Dorkenwald, Manuel Mucientes, Cees G M Snoek, and Yuki M Asano. TVBench: Redesigning video-language evaluation.arXiv preprint arXiv:2410.07752,

work page arXiv

[13] [13]

Egovqa: An egocentric video question an- swering benchmark dataset

Yifei Fan, Bo Zhang, Yun Zhou, Qingshan He, Zhenyi Liu, and Liang Wang. Egovqa: An egocentric video question an- swering benchmark dataset. InProceedings of the IEEE/CVF International Conference on Computer Vision Workshops (IC- CVW), pages 2501–2504, 2019. 2

work page 2019

[14] [14]

Embodied ai: From llms to world models.arXiv preprint arXiv:2509.20021, 2025

Tongtong Feng, Xin Wang, Yu-Gang Jiang, and Wenwu Zhu. Embodied ai: From llms to world models.arXiv preprint arXiv:2509.20021, 2025. 1

work page arXiv 2025

[15] [15]

Video-MME: The First-Ever Comprehensive Evaluation Benchmark of Multi-modal LLMs in Video Analysis

Chaoyou Fu, Yuhan Dai, Yondong Luo, Lei Li, Shuhuai Ren, Renrui Zhang, Zihan Wang, Chenyu Zhou, Yunhang Shen, Mengdan Zhang, et al. Video-mme: The first-ever comprehensive evaluation benchmark of multi-modal llms in video analysis.arXiv preprint arXiv:2405.21075, 2024. 1, 2

work page internal anchor Pith review Pith/arXiv arXiv 2024

[16] [16]

something something

Raghav Goyal, Samira Ebrahimi Kahou, Vincent Michal- ski, Joanna Materzynska, Susanne Westphal, Heuna Kim, Valentin Haenel, Ingo Fruend, Peter Yianilos, Moritz Mueller- Freitag, et al. The" something something" video database for learning and evaluating visual common sense. InProceed- ings of the IEEE international conference on computer vision, pages 584...

work page 2017

[17] [17]

Ego4d: Around the world in 3,000 hours of egocentric video, 2022

Kristen Grauman, Andrew Westbury, Eugene Byrne, Zachary Chavis, Antonino Furnari, Rohit Girdhar, Jackson Hamburger, Hao Jiang, Miao Liu, Xingyu Liu, Miguel Martin, Tushar Na- garajan, Ilija Radosavovic, Santhosh Kumar Ramakrishnan, Fiona Ryan, et al. Ego4d: Around the world in 3,000 hours of egocentric video, 2022. 2

work page 2022

[18] [18]

Videoespresso: A large-scale chain-of-thought dataset for fine-grained video reasoning via core frame selection.arXiv preprint arXiv:2411.14794, 2024

Songhao Han, Wei Huang, Hairong Shi, Le Zhuo, Xiu Su, Shifeng Zhang, Xu Zhou, Xiaojuan Qi, Yue Liao, and Si Liu. Videoespresso: A large-scale chain-of-thought dataset for fine-grained video reasoning via core frame selection.arXiv preprint arXiv:2411.14794, 2024. 3

work page arXiv 2024

[19] [19]

Sugarcrepe: Fixing hackable benchmarks for vision-language compositionality.Advances in neural information processing systems, 36:31096–31116,

Cheng-Yu Hsieh, Jieyu Zhang, Zixian Ma, Aniruddha Kem- bhavi, and Ranjay Krishna. Sugarcrepe: Fixing hackable benchmarks for vision-language compositionality.Advances in neural information processing systems, 36:31096–31116,

work page

[20] [20]

Egotaskqa: Understanding human tasks in egocentric videos

Baoxiong Jia, Ting Lei, Song-Chun Zhu, and Siyuan Huang. Egotaskqa: Understanding human tasks in egocentric videos. InThe 36th Conference on Neural Information Processing Systems (NeurIPS 2022) Track on Datasets and Benchmarks,

work page 2022

[21] [21]

Scaling scaling laws with board games.arXiv preprint arXiv:2104.03113, 2021

Andy L Jones. Scaling scaling laws with board games.arXiv preprint arXiv:2104.03113, 2021. 1

work page arXiv 2021

[22] [22]

The Kinetics Human Action Video Dataset

Will Kay, Joao Carreira, Karen Simonyan, Brian Zhang, Chloe Hillier, Sudheendra Vijayanarasimhan, Fabio Viola, Tim Green, Trevor Back, Paul Natsev, et al. The kinetics hu- man action video dataset.arXiv preprint arXiv:1705.06950,

work page internal anchor Pith review Pith/arXiv arXiv

[23] [23]

Adversarial filters of dataset biases

Ronan Le Bras, Swabha Swayamdipta, Chandra Bhagavatula, Rowan Zellers, Matthew Peters, Ashish Sabharwal, and Yejin Choi. Adversarial filters of dataset biases. InInternational conference on machine learning, pages 1078–1088. Pmlr,

work page

[24] [24]

Videovista: A versatile bench- mark for video understanding and reasoning.arXiv preprint arXiv:2406.11303, 2024

Yunxin Li, Xinyu Chen, Baotian Hu, Longyue Wang, Haoyuan Shi, and Min Zhang. Videovista: A versatile bench- mark for video understanding and reasoning.arXiv preprint arXiv:2406.11303, 2024. 2

work page arXiv 2024

[25] [25]

Egocen- tric planning for scalable embodied task achievement.Ad- vances in Neural Information Processing Systems, 36:54586– 54613, 2023

Xiatoian Liu, Hector Palacios, and Christian Muise. Egocen- tric planning for scalable embodied task achievement.Ad- vances in Neural Information Processing Systems, 36:54586– 54613, 2023. 1

work page 2023

[26] [26]

Ye Liu, Zongyang Ma, Zhongang Qi, Yang Wu, Ying Shan, and Chang Wen Chen. E.T. bench: Towards open-ended event-level video-language understanding.arXiv preprint arXiv:2409.18111, 2024. 2

work page arXiv 2024

[27] [27]

Egoschema: A diagnostic benchmark for very long- form video language understanding

Karttikeya Mangalam, Raiymbek Akshulakov, and Jitendra Malik. Egoschema: A diagnostic benchmark for very long- form video language understanding. InThirty-seventh Con- ference on Neural Information Processing Systems Datasets and Benchmarks Track, 2023. 1, 2

work page 2023

[28] [28]

Scal- ing open-vocabulary object detection.Advances in Neural Information Processing Systems, 36:72983–73007, 2023

Matthias Minderer, Alexey Gritsenko, and Neil Houlsby. Scal- ing open-vocabulary object detection.Advances in Neural Information Processing Systems, 36:72983–73007, 2023. 2, 7

work page 2023

[29] [29]

Neptune: The long orbit to benchmarking long video under- standing.arXiv preprint arXiv:2412.09582, 2024

Arsha Nagrani, Mingda Zhang, Ramin Mehran, Rachel Hor- nung, Nitesh Bharadwaj Gundavarapu, Nilpa Jha, Austin Myers, Xingyi Zhou, Boqing Gong, Cordelia Schmid, et al. Neptune: The long orbit to benchmarking long video under- standing.arXiv preprint arXiv:2412.09582, 2024. 1, 2

work page arXiv 2024

[30] [30]

Minerva: Evaluating complex video reasoning.arXiv preprint arXiv:2505.00681,

Arsha Nagrani, Sachit Menon, Ahmet Iscen, Shyamal Buch, Ramin Mehran, Nilpa Jha, Anja Hauth, Yukun Zhu, Carl V ondrick, Mikhail Sirotenko, et al. Minerva: Evaluating complex video reasoning.arXiv preprint arXiv:2505.00681,

work page arXiv

[31] [31]

Introducing GPT-4.1 in the API

OpenAI. Introducing GPT-4.1 in the API. https:// openai.com/index/gpt-4-1/ , 2025. [Accessed 30- 04-2025]. 4

work page 2025

[32] [32]

Introducing GPT-5

OpenAI. Introducing GPT-5. https://openai.com/ index/introducing-gpt-5/, 2025. [Accessed 16-09- 2025]. 1, 4

work page 2025

[33] [33]

Per- ception test: A diagnostic benchmark for multimodal video models.Advances in Neural Information Processing Systems, 36, 2024

Viorica Patraucean, Lucas Smaira, Ankush Gupta, Adria Re- casens, Larisa Markeeva, Dylan Banarse, Skanda Koppula, Mateusz Malinowski, Yi Yang, Carl Doersch, et al. Per- ception test: A diagnostic benchmark for multimodal video models.Advances in Neural Information Processing Systems, 36, 2024. 2

work page 2024

[34] [34]

Hd-epic: A highly-detailed egocentric video dataset

Toby Perrett, Ahmad Darkhalil, Saptarshi Sinha, Omar Emara, Sam Pollard, Kranti Parida, Kaiting Liu, Prajwal Gatti, Sid- dhant Bansal, Kevin Flanagan, Jacob Chalk, Zhifan Zhu, Rho- dri Guerrier, Fahd Abdelazim, Bin Zhu, Davide Moltisanti, Michael Wray, Hazel Doughty, and Dima Damen. Hd-epic: A highly-detailed egocentric video dataset. InProceedings of the...

work page 2025

[35] [35]

Omnia de egotempo: Bench- marking temporal understanding of multi-modal llms in ego- centric videos

Chiara Plizzari, Alessio Tonioni, Yongqin Xian, Ace Kul- shrestha, and Federico Tombari. Omnia de egotempo: Bench- marking temporal understanding of multi-modal llms in ego- centric videos. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2025. 1, 2, 7, 8

work page 2025

[36] [36]

Sam 2: Segment anything in images and videos.ICLR, 2025

Nikhila Ravi, Valentin Gabeur, Yuan-Ting Hu, Ronghang Hu, Chaitanya Ryali, Tengyu Ma, Haitham Khedr, Roman Rädle, Chloe Rolland, Laura Gustafson, et al. Sam 2: Segment anything in images and videos.ICLR, 2025. 5

work page 2025

[37] [37]

CinePile: A Long Video Question Answering Dataset and Benchmark.arXiv preprint arXiv:2405.08813, 2024

Ruchit Rawal, Khalid Saifullah, Ronen Basri, David Jacobs, Gowthami Somepalli, and Tom Goldstein. CinePile: A Long Video Question Answering Dataset and Benchmark.arXiv preprint arXiv:2405.08813, 2024. 2

work page arXiv 2024

[38] [38]

Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context

Machel Reid, Nikolay Savinov, Denis Teplyashin, Dmitry Lepikhin, Timothy Lillicrap, Jean-baptiste Alayrac, Radu Soricut, Angeliki Lazaridou, Orhan Firat, Julian Schrit- twieser, et al. Gemini 1.5: Unlocking multimodal under- standing across millions of tokens of context.arXiv preprint arXiv:2403.05530, 2024. 1

work page internal anchor Pith review Pith/arXiv arXiv 2024

[39] [39]

Scienceqa: A novel resource for question answering on scholarly articles.International Journal on Digital Libraries, 23(3):289–301, 2022

Tanik Saikh, Tirthankar Ghosal, Amish Mittal, Asif Ekbal, and Pushpak Bhattacharyya. Scienceqa: A novel resource for question answering on scholarly articles.International Journal on Digital Libraries, 23(3):289–301, 2022. 3

work page 2022

[40] [40]

Visual cot: Advancing multi-modal language models with a comprehen- sive dataset and benchmark for chain-of-thought reasoning

Hao Shao, Shengju Qian, Han Xiao, Guanglu Song, Zhuofan Zong, Letian Wang, Yu Liu, and Hongsheng Li. Visual cot: Advancing multi-modal language models with a comprehen- sive dataset and benchmark for chain-of-thought reasoning. Advances in Neural Information Processing Systems, 37:8612– 8642, 2024. 3

work page 2024

[41] [41]

Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters

Charlie Snell, Jaehoon Lee, Kelvin Xu, and Aviral Ku- mar. Scaling llm test-time compute optimally can be more effective than scaling model parameters.arXiv preprint arXiv:2408.03314, 2024. 1

work page internal anchor Pith review Pith/arXiv arXiv 2024

[42] [42]

Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

Gemini Team. Gemini 2.5: Pushing the frontier with ad- vanced reasoning, multimodality, long context, and next gen- eration agentic capabilities.arXiv preprint arXiv:2507.06261,

work page internal anchor Pith review Pith/arXiv arXiv

[43] [43]

LVBench: An Extreme Long Video Understanding Benchmark

Weihan Wang, Zehai He, Wenyi Hong, Yean Cheng, Xiaohan Zhang, Ji Qi, Shiyu Huang, Bin Xu, Yuxiao Dong, Ming Ding, et al. Lvbench: An extreme long video understanding benchmark.arXiv preprint arXiv:2406.08035, 2024. 1, 2, 4

work page internal anchor Pith review Pith/arXiv arXiv 2024

[44] [44]

Videocot: A video chain-of- thought dataset with active annotation tool.arXiv preprint arXiv:2407.05355, 2024

Yan Wang, Yawen Zeng, Jingsheng Zheng, Xiaofen Xing, Jin Xu, and Xiangmin Xu. Videocot: A video chain-of- thought dataset with active annotation tool.arXiv preprint arXiv:2407.05355, 2024. 3

work page arXiv 2024

[45] [45]

Visual prompting in multimodal large language models: A survey.arXiv preprint arXiv:2409.15310, 2024

Junda Wu, Zhehao Zhang, Yu Xia, Xintong Li, Zhaoyang Xia, Aaron Chang, Tong Yu, Sungchul Kim, Ryan A Rossi, Ruiyi Zhang, et al. Visual prompting in multimodal large language models: A survey.arXiv preprint arXiv:2409.15310, 2024. 3

work page arXiv 2024

[46] [46]

Qwen3 Technical Report

An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, Chujie Zheng, Dayiheng Liu, Fan Zhou, Fei Huang, Feng Hu, Hao Ge, Haoran Wei, Huan Lin, Jialong Tang, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Yang, Jiaxi Yang, Jing Zhou, Jingren Zhou, Junyang Lin, Kai Dang, Keqin Bao, Kexin Yang, ...

work page internal anchor Pith review Pith/arXiv arXiv 2025

[47] [47]

Set-of-Mark Prompting Unleashes Extraordinary Visual Grounding in GPT-4V

Jianwei Yang, Hao Zhang, Feng Li, Xueyan Zou, Chunyuan Li, and Jianfeng Gao. Set-of-mark prompting unleashes extraordinary visual grounding in gpt-4v.arXiv preprint arXiv:2310.11441, 2023. 3, 5

work page internal anchor Pith review Pith/arXiv arXiv 2023

[48] [48]

MMEgo: Towards Building Egocentric Multimodal LLMs for Video QA

Hanrong Ye, Haotian Zhang, Erik Daxberger, Lin Chen, Zongyu Lin, Yanghao Li, Bowen Zhang, Haoxuan You, Dan Xu, Zhe Gan, Jiasen Lu, and Yinfei Yang. MMEgo: Towards Building Egocentric Multimodal LLMs for Video QA. In International Conference on Learning Representations, pages 71705–71723, 2025. 3

work page 2025

[49] [49]

Activitynet-qa: A dataset for understanding complex web videos via question answering

Zhou Yu, Dejing Xu, Jun Yu, Ting Yu, Zhou Zhao, Yueting Zhuang, and Dacheng Tao. Activitynet-qa: A dataset for understanding complex web videos via question answering. In Proceedings of the AAAI Conference on Artificial Intelligence, pages 9127–9134, 2019. 1 Minerva-Ego: Spatiotemporal Hints for Egocentric Video Understanding Supplementary Material 0 10 2...

work page 2019

[50] [50]

corrected

Minerva-Ego 6.1. Additional Statistics The shortest videos is 10 seconds while the longest video is 75 minutes. The mean video length is 20 minutes. The full distribution can be found in Fig. 7. Figure 8 shows statistics of reasoning lengths and object occurrences by question type. We note that the distribution is flat, other thanAudio Reasoning, which ha...

work page

[51] [51]

All raters are native English speakers with graduate degrees

Rater Guidelines All textual data in Minerva-Ego was manually annotated by human annotators (raters). All raters are native English speakers with graduate degrees. 7.1. Annotation Guidelines The raters were given the following guidelines before being asked to propose question, answers, decoys and reasoning traces: You will be given a video. For each video...

work page

[52] [52]

Egocentric and Spatial Perception

work page

[53] [53]

Numerical Reasoning (all math operations other than counting)

work page

[54] [54]

Counterfactual (“what if”) Important things to keep in mind:

work page

[55] [55]

Questions should be multi-step

work page

[56] [56]

They should be difficult to solve

work page

[57] [57]

Ideally they should involve looking at multiple different time segments of the video

work page

[58] [58]

Each question should be cover multiple skills and have multiple reasoning steps

work page

[59] [59]

The questions should be phrased with the word “I”, as if the cameraperson is asking the question

work page

[60] [60]

I” to refer to you (the rater, as you are doing the reasoning, and use the word “user

The reasoning traces should use the word “I” to refer to you (the rater, as you are doing the reasoning, and use the word “user” to refer to the cameraperson). Note in the question, I refers to the camera person

work page

[61] [61]

objects of inter- est

The reasoning traces should map to the “objects of inter- est”

work page

[62] [62]

Every time that object is referenced (even if it is in a different time step of the video), please use the same annotation

work page

[63] [63]

Please number the “steps” in each reasoning trace, as shown in the examples

work page

[64] [64]

All reasoning traces must have at least one time stamp, and at least one object of interest reference. 7.2. Human Study We used a disjoint pool of 12 raters for the human study, by asking the rater leads to ensure that the same rater who proposed the question (or even saw the video before) is not the same as the one who performs the human study. Hence no ...

work page

[65] [65]

Model Implementation Details Hyperparameters for all our models are provided in Table 7

work page

[66] [66]

density cubes,

Results 9.1. MiRA Reasoning Evaluation MiRA is a model-based metric that uses Gemini 2.5 Pro for scoring. The exact prompt used to compute the MiRA score appears in Figure 9. Since using the same LLM for hinting and as a judge couldpotentiallybias the results, we repeat the MiRA anal- ysis from Fig. 3 (left) using GPT-5 as a judge, and report the scores i...

work page 2025