arxiv: 2604.09535 · v1 · submitted 2026-04-10 · 💻 cs.CV

Recognition: unknown

EgoTL: Egocentric Think-Aloud Chains for Long-Horizon Tasks

Lulin Liu , Dayou Li , Yiqing Liang , Sicong Jiang , Hitesh Vijay , Hezhen Hu , Xuhai Xu , Zirui Liu

show 3 more authors

Srinivas Shakkottai Manling Li Zhiwen Fan

Authors on Pith no claims yet

Pith reviewed 2026-05-10 17:05 UTC · model grok-4.3

classification 💻 cs.CV

keywords egocentric visionthink-aloud chainslong-horizon planningvision-language modelschain-of-thoughtspatial groundinghousehold tasksembodied AI

0 comments

The pith

Finetuning foundation models on human think-aloud chains with metric labels from EgoTL improves long-horizon planning and spatial reasoning in egocentric tasks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces EgoTL as a capture pipeline for egocentric household data that records spoken step-by-step reasoning before each action. It aligns this reasoning with word-level timestamps, accurate physical measurements, and scene context to create cleaner training signals than noisy automatic labels. The authors benchmark existing models on minute-long sequences across many daily tasks and find persistent shortfalls in planning and grounding. They then show that fine-tuning on the new aligned data produces gains in planning, step-wise reasoning, instruction following, and spatial accuracy. A sympathetic reader would see this as a practical route to making vision-language models more reliable for extended real-world sequences.

Core claim

EgoTL builds a think-aloud capture pipeline for egocentric data. It uses a say-before-act protocol to record step-by-step goals and spoken reasoning with word-level timestamps, then calibrates physical properties with metric-scale spatial estimators, a memory-bank walkthrough for scene context, and clip-level tags for navigation instructions and detailed manipulation actions. With EgoTL, foundation models and world models are benchmarked on six task dimensions from three layers and long-horizon generation over minute-long sequences across over 100 daily household tasks. The models still fall short as egocentric assistants or open-world simulators. Finetuning foundation models with human CoT,

What carries the argument

The EgoTL think-aloud capture pipeline that records spoken reasoning before action and aligns it with metric spatial calibration and memory-bank scene context.

If this is right

Fine-tuned models generate better long-horizon plans over minute-long sequences.
Step-wise reasoning becomes more reliable across household tasks.
Instruction following improves for navigation and manipulation actions.
Spatial grounding better respects real-world physical attributes and avoids hallucinations.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same capture approach could be adapted to collect reasoning data for non-household embodied settings such as outdoor navigation.
EgoTL-style chains might help world models maintain consistent scene memory over longer simulated episodes.
Testing the fine-tuned models on physical robots would show whether the gains transfer beyond video evaluation.

Load-bearing premise

The say-before-act protocol combined with metric-scale spatial estimators and memory-bank walkthroughs accurately captures unbiased human reasoning chains and physical properties without introducing significant errors or biases.

What would settle it

An experiment showing that models fine-tuned on EgoTL-aligned labels perform no better than models fine-tuned on standard VLM-generated labels when tested on held-out minute-long egocentric household sequences would falsify the improvement claim.

Figures

Figures reproduced from arXiv: 2604.09535 by Dayou Li, Hezhen Hu, Hitesh Vijay, Lulin Liu, Manling Li, Sicong Jiang, Srinivas Shakkottai, Xuhai Xu, Yiqing Liang, Zhiwen Fan, Zirui Liu.

**Figure 1.** Figure 1: What is the right data for teaching current vision-language foundation models human-like spatial perception and long-horizon [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗

**Figure 2.** Figure 2: A comparison of video annotations for an everyday task. The top filmstrip shows keyframes of the task "put a biscuit box in the [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 3.** Figure 3: Benchmark statistics. Distribution of EgoTL benchmark tasks across three main categories. tions or language prompts [2, 42, 46, 57]. More recently, this line of work has shifted toward spatial intelligence: benchmarks such as VSI-Bench [53], VSTI-Bench [14], SpatialBench [8], and MindCube [55] ask models to reason about egocentric directions, relative distances, object locations, and 3D-consistent layout… view at source ↗

**Figure 4.** Figure 4: Task overview in EgoTL-Bench. EgoTL-Bench decomposes egocentric spatial understanding into six tasks across three layers. Memory-conditioned planning asks the model to generate an action plan from a memory-bank walkthrough and a high-level goal. Scene-aware action reasoning tests whether it selects the correct action in cluttered scenes, such as moving an obstacle before opening a door. Next action predict… view at source ↗

read the original abstract

Large foundation models have made significant advances in embodied intelligence, enabling synthesis and reasoning over egocentric input for household tasks. However, VLM-based auto-labeling is often noisy because the primary data sources lack accurate human action labels, chain-of-thought (CoT), and spatial annotations; these errors are amplified during long-horizon spatial instruction following. These issues stem from insufficient coverage of minute-long, daily household planning tasks and from inaccurate spatial grounding. As a result, VLM reasoning chains and world-model synthesis can hallucinate objects, skip steps, or fail to respect real-world physical attributes. To address these gaps, we introduce EgoTL. EgoTL builds a think-aloud capture pipeline for egocentric data. It uses a say-before-act protocol to record step-by-step goals and spoken reasoning with word-level timestamps, then calibrates physical properties with metric-scale spatial estimators, a memory-bank walkthrough for scene context, and clip-level tags for navigation instructions and detailed manipulation actions. With EgoTL, we are able to benchmark VLMs and World Models on six task dimensions from three layers and long-horizon generation over minute-long sequences across over 100 daily household tasks. We find that foundation models still fall short as egocentric assistants or open-world simulators. Finally, we finetune foundation models with human CoT aligned with metric labels on the training split of EgoTL, which improves long-horizon planning and reasoning, step-wise reasoning, instruction following, and spatial grounding.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

EgoTL gives a practical pipeline for capturing human think-aloud chains with timestamps and metric spatial labels on minute-scale household videos, but the abstract shows no numbers to support the finetuning gains.

read the letter

The paper's main contribution is a data collection setup that records people narrating their goals out loud before acting, then aligns those transcripts to word-level timestamps, metric-scale spatial estimates, and clip-level action tags. This produces training data for long-horizon egocentric tasks that current auto-labeling pipelines miss. They apply it to over 100 daily household activities and use the resulting chains to finetune VLMs, claiming better planning, step-wise reasoning, instruction following, and spatial grounding than the base models. The benchmarking section also shows that off-the-shelf foundation models still fail on these sequences, which lines up with what most groups observe in embodied settings. That combination of verbal CoT, metric calibration, and memory-bank context is not the default in prior VLM work on household videos, so the dataset itself is the clearest new element. The approach directly targets the noise problem in long sequences where models skip steps or invent objects. The abstract does not include any quantitative results, error bars, or ablation details on how large the reported gains actually are or how they were measured. Without those, it is difficult to separate real capability lift from artifacts introduced by the collection protocol itself. The say-before-act instruction plus retrospective memory-bank walkthroughs can easily produce post-hoc rationalizations rather than the planning traces that would occur in real time, and the paper does not appear to test whether that changes the downstream signal. The central finetuning claim therefore rests on an assumption that needs explicit checks. This work is aimed at groups building or fine-tuning VLMs for robotics and household assistance who need better aligned long-horizon data. It is worth sending to peer review because the data pipeline is concrete and addresses a documented weakness in current embodied benchmarks, even though the evaluation section will require substantial expansion before the improvements can be assessed.

Referee Report

3 major / 2 minor

Summary. The manuscript introduces EgoTL, a dataset and capture pipeline for egocentric think-aloud chains on long-horizon household tasks. The pipeline records step-by-step goals and reasoning via a say-before-act protocol with word-level timestamps, calibrates physical properties using metric-scale spatial estimators and a memory-bank walkthrough for scene context, and adds clip-level tags for navigation and manipulation actions. It benchmarks VLMs and world models on six task dimensions across three layers plus long-horizon generation over minute-long sequences on >100 daily tasks, concludes that current foundation models fall short, and reports that finetuning on human CoT aligned with metric labels from the training split improves long-horizon planning/reasoning, step-wise reasoning, instruction following, and spatial grounding.

Significance. If the reported finetuning gains are robust and the dataset is released with full annotations, EgoTL could provide a valuable resource for training embodied agents on realistic, minute-scale household planning where current VLMs struggle with hallucination and physical grounding. The multi-layer benchmarking and emphasis on metric spatial alignment are constructive contributions. The work explicitly credits the creation of human-aligned CoT data as the key enabler for the observed improvements.

major comments (3)

[§3] §3 (Data Collection Pipeline): The central claim that finetuning on EgoTL CoT yields genuine capability gains rests on the assumption that the say-before-act protocol plus memory-bank walkthrough produces faithful, unbiased reasoning traces. The manuscript provides no validation (e.g., comparison against post-action think-aloud, concurrent eye-tracking, or inter-annotator consistency on decision order) to demonstrate that verbalization-before-action does not systematically elicit post-hoc rationalizations or alter natural planning sequences, which directly undermines the training-signal quality.
[§5] §5 (Experiments and Results): The abstract and conclusion assert measurable improvements in long-horizon planning, step-wise reasoning, instruction following, and spatial grounding after finetuning, yet the manuscript summary supplies no quantitative metrics, baseline comparisons, ablation studies, or error analysis on the held-out test split. Without these load-bearing numbers and controls, the magnitude, statistical significance, and generalization of the claimed gains cannot be evaluated.
[§4] §4 (Benchmarking Setup): The claim that existing VLMs and world models 'fall short' on six task dimensions and long-horizon generation is presented without tabulated scores, failure-mode breakdowns, or comparison against human performance ceilings. This leaves the motivation for EgoTL and the finetuning target underspecified.

minor comments (2)

The abstract would be strengthened by including one or two key quantitative results (e.g., absolute gains on the primary metric) to allow readers to gauge the scale of improvement without reading the full experiments section.
Notation for the six task dimensions and three layers is introduced without an explicit table or diagram in the early sections, making cross-references harder to follow.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback on our manuscript. We have carefully reviewed each major comment and provide point-by-point responses below. Where appropriate, we indicate the revisions we will make to address the concerns while maintaining the core contributions of EgoTL.

read point-by-point responses

Referee: [§3] §3 (Data Collection Pipeline): The central claim that finetuning on EgoTL CoT yields genuine capability gains rests on the assumption that the say-before-act protocol plus memory-bank walkthrough produces faithful, unbiased reasoning traces. The manuscript provides no validation (e.g., comparison against post-action think-aloud, concurrent eye-tracking, or inter-annotator consistency on decision order) to demonstrate that verbalization-before-action does not systematically elicit post-hoc rationalizations or alter natural planning sequences, which directly undermines the training-signal quality.

Authors: We appreciate this methodological concern regarding the fidelity of the think-aloud traces. The say-before-act protocol was deliberately selected, following established practices in cognitive science and HCI, to capture reasoning prior to action and thereby reduce post-hoc rationalization. In the revised manuscript, we will expand §3 with a new subsection on protocol rationale, supported by citations to prior think-aloud validation studies. We will also report inter-annotator agreement on decision ordering for a sampled subset of traces and include qualitative comparisons of sample chains. We acknowledge that concurrent eye-tracking or systematic post-action comparisons would require additional participant studies and hardware; we will explicitly note this as a limitation while arguing that the current protocol still provides high-quality, human-aligned CoT for the reported finetuning gains. revision: partial
Referee: [§5] §5 (Experiments and Results): The abstract and conclusion assert measurable improvements in long-horizon planning, step-wise reasoning, instruction following, and spatial grounding after finetuning, yet the manuscript summary supplies no quantitative metrics, baseline comparisons, ablation studies, or error analysis on the held-out test split. Without these load-bearing numbers and controls, the magnitude, statistical significance, and generalization of the claimed gains cannot be evaluated.

Authors: We apologize for any lack of prominence in the summary presentation. The full §5 of the manuscript already contains quantitative results on the held-out test split, including before/after finetuning metrics for planning success, step-wise reasoning accuracy, instruction-following rates, and spatial grounding errors, along with baseline comparisons to multiple VLMs and ablations isolating the CoT and metric-label components. Error analysis categorizes failures by type (e.g., hallucination, sequencing, physical grounding). We will revise the abstract and conclusion to explicitly quote key numbers, add statistical significance reporting, and ensure all tables/figures are cross-referenced clearly so that the magnitude and robustness of the gains are immediately evaluable. revision: yes
Referee: [§4] §4 (Benchmarking Setup): The claim that existing VLMs and world models 'fall short' on six task dimensions and long-horizon generation is presented without tabulated scores, failure-mode breakdowns, or comparison against human performance ceilings. This leaves the motivation for EgoTL and the finetuning target underspecified.

Authors: We agree that clearer quantitative presentation will strengthen the motivation. In the revised §4, we will introduce a consolidated results table reporting scores for all evaluated VLMs and world models across the six task dimensions plus long-horizon generation. We will add failure-mode breakdowns (e.g., percentages attributable to object hallucination, step omission, or spatial misalignment) and, on a sampled subset of tasks, human performance ceilings for direct comparison. These additions will make the performance gaps explicit and better justify both the dataset and the finetuning experiments. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical dataset contribution with independent validation

full rationale

The paper's core contribution is the construction of the EgoTL dataset via a say-before-act think-aloud protocol plus metric calibration and memory-bank walkthroughs, followed by empirical benchmarking and finetuning experiments on VLMs. No mathematical derivation chain, fitted-parameter predictions, or self-referential equations are present. The reported improvements are measured outcomes on held-out tasks rather than quantities forced by construction from the input data or prior self-citations. The central claim rests on external evaluation metrics and does not reduce to re-labeling or re-fitting its own inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The work relies on standard assumptions about VLM trainability and introduces a data collection method without specifying fitted parameters or new physical entities in the abstract.

axioms (2)

domain assumption Human think-aloud protocols accurately reflect internal reasoning processes for household tasks
Invoked in the say-before-act protocol for recording step-by-step goals and spoken reasoning.
domain assumption Metric-scale spatial estimators and memory-bank walkthroughs can reliably calibrate physical properties from egocentric views
Central to the calibration of physical properties and scene context.

pith-pipeline@v0.9.0 · 5603 in / 1371 out tokens · 106391 ms · 2026-05-10T17:05:50.094669+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

60 extracted references · 28 canonical work pages · 18 internal anchors

[1]

GPT-4 Technical Report

Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report.arXiv preprint arXiv:2303.08774, 2023. 7, 8, 3

work page internal anchor Pith review Pith/arXiv arXiv 2023
[2]

Cosmos World Foundation Model Platform for Physical AI

Niket Agarwal, Arslan Ali, Maciej Bala, Yogesh Balaji, Erik Barker, Tiffany Cai, Prithvijit Chattopadhyay, Yongxin Chen, Yin Cui, Yifan Ding, et al. Cosmos world foundation model platform for physical ai.arXiv preprint arXiv:2501.03575,

work page internal anchor Pith review arXiv
[3]

Qwen Technical Report

Jinze Bai, Shuai Bai, Yunfei Chu, Zeyu Cui, Kai Dang, Xi- aodong Deng, Yang Fan, Wenbin Ge, Yu Han, Fei Huang, et al. Qwen technical report.arXiv preprint arXiv:2309.16609,

work page internal anchor Pith review Pith/arXiv arXiv
[4]

Whisperx: Time-accurate speech transcription of long-form audio.INTERSPEECH 2023, 2023

Max Bain, Jaesung Huh, Tengda Han, and Andrew Zisserman. Whisperx: Time-accurate speech transcription of long-form audio.INTERSPEECH 2023, 2023. 5

2023
[5]

Hot3d: Hand and object tracking in 3d from egocentric multi-view videos

Prithviraj Banerjee, Sindi Shkodrani, Pierre Moulon, Shreyas Hampali, Shangchen Han, Fan Zhang, Linguang Zhang, Jade Fountain, Edward Miller, Selen Basol, et al. Hot3d: Hand and object tracking in 3d from egocentric multi-view videos. In Proceedings of the Computer Vision and Pattern Recognition Conference, pages 7061–7071, 2025. 3, 5

2025
[6]

CoRR , volume =

Gašper Beguš, Maksymilian D ˛ abkowski, and Ryan Rhodes. Large linguistic models: Analyzing theoretical linguistic abil- ities of llms.arXiv preprint arXiv:2305.00948, 2023. 2

work page arXiv 2023
[7]

Language models are few-shot learners.NeurIPS, 2020

Tom Brown, Benjamin Mann, Nick Ryder, Melanie Sub- biah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakan- tan, Pranav Shyam, Girish Sastry, Amanda Askell, Sand- hini Agarwal, Ariel Herbert-V oss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel Ziegler, Jef- frey Wu, Clemens Winter, Chris Hesse, Mark Chen, Eric Sigler, Mateusz Litwin,...

2020
[8]

Spatialbot: Pre- cise spatial understanding with vision language models

Wenxiao Cai, Iaroslav Ponomarenko, Jianhao Yuan, Xiaoqi Li, Wankou Yang, Hao Dong, and Bo Zhao. Spatialbot: Pre- cise spatial understanding with vision language models. In 2025 IEEE International Conference on Robotics and Au- tomation (ICRA), pages 9490–9498. IEEE, 2025. 4

2025
[9]

How Far Are We to GPT-4V? Closing the Gap to Commercial Multimodal Models with Open-Source Suites

Zhe Chen, Weiyun Wang, Hao Tian, Shenglong Ye, Zhangwei Gao, Erfei Cui, Wenwen Tong, Kongzhi Hu, Jiapeng Luo, Zheng Ma, et al. How far are we to gpt-4v? closing the gap to commercial multimodal models with open-source suites. arXiv preprint arXiv:2404.16821, 2024. 3, 6

work page internal anchor Pith review arXiv 2024
[10]

Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

Gheorghe Comanici, Eric Bieber, Mike Schaekermann, Ice Pasupat, Noveen Sachdeva, Inderjit Dhillon, Marcel Blistein, Ori Ram, Dan Zhang, Evan Rosen, et al. Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long con- text, and next generation agentic capabilities.arXiv preprint arXiv:2507.06261, 2025. 6, 7, 8, 1, 3

work page internal anchor Pith review Pith/arXiv arXiv 2025
[11]

The epic-kitchens dataset: Collection, chal- lenges and baselines.IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 2020

Dima Damen, Hazel Doughty, Giovanni Maria Farinella, Sanja Fidler, Antonino Furnari, Evangelos Kazakos, Davide Moltisanti, Jonathan Munro, Toby Perrett, Will Price, and Michael Wray. The epic-kitchens dataset: Collection, chal- lenges and baselines.IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 2020. 5

2020
[12]

Project Aria: A New Tool for Egocentric Multi-Modal AI Research

Jakob Engel, Kiran Somasundaram, Michael Goesele, Albert Sun, Alexander Gamino, Andrew Turner, Arjang Talattof, Arnie Yuan, Bilal Souti, Brighid Meredith, et al. Project aria: A new tool for egocentric multi-modal ai research.arXiv preprint arXiv:2308.13561, 2023. 3, 4

work page internal anchor Pith review arXiv 2023
[13]

Arctic: A dataset for dexterous bimanual hand- object manipulation

Zicong Fan, Omid Taheri, Dimitrios Tzionas, Muhammed Kocabas, Manuel Kaufmann, Michael J Black, and Otmar Hilliges. Arctic: A dataset for dexterous bimanual hand- object manipulation. InProceedings of the IEEE/CVF con- ference on computer vision and pattern recognition, pages 12943–12954, 2023. 5

2023
[14]

VLM-3R: Vision-Language Models Augmented with Instruction-Aligned 3D Reconstruction

Zhiwen Fan, Jian Zhang, Renjie Li, Junge Zhang, Runjin Chen, Hezhen Hu, Kevin Wang, Huaizhi Qu, Dilin Wang, Zhicheng Yan, et al. Vlm-3r: Vision-language models aug- mented with instruction-aligned 3d reconstruction.arXiv preprint arXiv:2505.20279, 2025. 4, 6

work page internal anchor Pith review Pith/arXiv arXiv 2025
[15]

Video-mme: The first-ever comprehensive evaluation benchmark of multi-modal llms in video analysis

Chaoyou Fu, Yuhan Dai, Yongdong Luo, Lei Li, Shuhuai Ren, Renrui Zhang, Zihan Wang, Chenyu Zhou, Yunhang Shen, Mengdan Zhang, et al. Video-mme: The first-ever comprehensive evaluation benchmark of multi-modal llms in video analysis. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 24108–24118, 2025. 3, 7

2025
[16]

Ego-exo4d: Understanding skilled human activity from first-and third-person perspectives

Kristen Grauman, Andrew Westbury, Lorenzo Torresani, Kris Kitani, Jitendra Malik, Triantafyllos Afouras, Kumar Ashutosh, Vijay Baiyya, Siddhant Bansal, Bikram Boote, et al. Ego-exo4d: Understanding skilled human activity from first-and third-person perspectives. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 193...

2024
[17]

Kristen Grauman et al. Ego4d. InCVPR. 2, 3, 5
[18]

World Models

David Ha and Jürgen Schmidhuber. World models.arXiv preprint arXiv:1803.10122, 2(3), 2018. 8

work page internal anchor Pith review arXiv 2018
[19]

Mastering Diverse Domains through World Models

Danijar Hafner, Jurgis Pasukonis, Jimmy Ba, and Timothy Lillicrap. Mastering diverse domains through world models. InInternational Conference on Learning Representations (ICLR), 2024. arXiv:2301.04104. 8

work page internal anchor Pith review arXiv 2024
[20]

Measuring Massive Multitask Language Understanding

Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Man- tas Mazeika, Dawn Song, and Jacob Steinhardt. Measuring massive multitask language understanding.arXiv preprint arXiv:2009.03300, 2020. 7

work page internal anchor Pith review Pith/arXiv arXiv 2009
[21]

Clipscore: A reference-free evaluation metric for image captioning

Jack Hessel, Ari Holtzman, Maxwell Forbes, Ronan Le Bras, and Yejin Choi. Clipscore: A reference-free evaluation metric for image captioning. InProceedings of the 2021 conference on empirical methods in natural language processing, pages 7514–7528, 2021. 8

2021
[22]

Lora: Low-rank adaptation of large language models.ICLR, 1(2):3,

Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, Weizhu Chen, et al. Lora: Low-rank adaptation of large language models.ICLR, 1(2):3,
[23]

Vbench: Comprehensive benchmark suite for video generative models

Ziqi Huang, Yinan He, Jiashuo Yu, Fan Zhang, Chenyang Si, Yuming Jiang, Yuanhan Zhang, Tianxing Wu, Qingyang Jin, Nattapol Chanpaisit, et al. Vbench: Comprehensive benchmark suite for video generative models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 21807–21818, 2024. 8

2024
[24]

Gpt-4o system card.arXiv,

Aaron Hurst, Adam Lerer, Adam P Goucher, Adam Perelman, Aditya Ramesh, Aidan Clark, AJ Ostrow, Akila Welihinda, Alan Hayes, Alec Radford, et al. Gpt-4o system card.arXiv,
[25]

Mmtom-qa

Chuanyang Jin et al. Mmtom-qa. InACL, 2024. 2

2024
[26]

MapAnything: Universal Feed-Forward Metric 3D Reconstruction

Nikhil Keetha, Norman Müller, Johannes Schönberger, Lorenzo Porzi, Yuchen Zhang, Tobias Fischer, Arno Knapitsch, Duncan Zauss, Ethan Weber, Nelson Antunes, Jonathon Luiten, Manuel Lopez-Antequera, Samuel Rota Bulò, Christian Richardt, Deva Ramanan, Sebastian Scherer, and Peter Kontschieder. MapAnything: Universal feed- forward metric 3D reconstruction, 20...

work page internal anchor Pith review arXiv 2025
[27]

Learning instruction-guided manipulation affordance via large models for embodied robotic tasks

Dayou Li, Chenkun Zhao, Shuo Yang, Lin Ma, Yibin Li, and Wei Zhang. Learning instruction-guided manipulation affordance via large models for embodied robotic tasks. In 2024 International Conference on Advanced Robotics and Mechatronics (ICARM), pages 662–667. IEEE, 2024. 2

2024
[28]

Vila: On pre-training for visual language models

Ji Lin, Hongxu Yin, Wei Ping, Pavlo Molchanov, Mohammad Shoeybi, and Song Han. Vila: On pre-training for visual language models. InCVPR, pages 26689–26699, 2024. 3

2024
[29]

Hoi4d: A 4d egocentric dataset for category-level human- object interaction

Yunze Liu, Yun Liu, Che Jiang, Kangbo Lyu, Weikang Wan, Hao Shen, Boqiang Liang, Zhoujie Fu, He Wang, and Li Yi. Hoi4d: A 4d egocentric dataset for category-level human- object interaction. InProceedings of the IEEE/CVF Con- ference on Computer Vision and Pattern Recognition, pages 21013–21022, 2022. 3, 5

2022
[30]

Aria Everyday Activities Dataset,

Zhaoyang Lv, Nicholas Charron, Pierre Moulon, Alexander Gamino, Cheng Peng, Chris Sweeney, Edward Miller, Huix- uan Tang, Jeff Meissner, Jing Dong, et al. Aria everyday activities dataset.arXiv preprint arXiv:2402.13349, 2024. 3, 5

work page arXiv 2024
[31]

Nymeria: A massive collection of multimodal egocentric daily motion in the wild

Lingni Ma, Yuting Ye, Fangzhou Hong, Vladimir Guzov, Yifeng Jiang, Rowan Postyeni, Luis Pesqueira, Alexander Gamino, Vijay Baiyya, Hyo Jin Kim, et al. Nymeria: A massive collection of multimodal egocentric daily motion in the wild. InEuropean Conference on Computer Vision, pages 445–465. Springer, 2024. 3, 5

2024
[32]

A comprehensive overview of large language models,

Humza Naveed, Asad Ullah Khan, Shi Qiu, Muhammad Saqib, Saeed Anwar, Muhammad Usman, Naveed Akhtar, Nick Barnes, and Ajmal Mian. A comprehensive overview of large language models.arXiv preprint arXiv:2307.06435,

work page arXiv
[33]

Egothinker: Unveiling egocentric reasoning with spatio-temporal cot.arXiv preprint arXiv:2510.23569, 2025

Baoqi Pei, Yifei Huang, Jilan Xu, Yuping He, Guo Chen, Fei Wu, Yu Qiao, and Jiangmiao Pang. Egothinker: Unveiling egocentric reasoning with spatio-temporal cot.arXiv preprint arXiv:2510.23569, 2025. 2

work page arXiv 2025
[34]

Toby Perrett et al. Hd-epic. InCVPR, 2025. 2, 3, 5, 8

2025
[35]

arXiv preprint arXiv:2501.19061 (2025)

Heqian Qiu, Zhaofeng Shi, Lanxiao Wang, Huiyu Xiong, Xiang Li, and Hongliang Li. Egome: A new dataset and challenge for following me via egocentric view in real world. arXiv preprint arXiv:2501.19061, 2025. 5

work page arXiv 2025
[36]

Improving language understanding by genera- tive pre-training.OpenAI Blog, 2018

Alec Radford. Improving language understanding by genera- tive pre-training.OpenAI Blog, 2018. 2

2018
[37]

Language models are unsuper- vised multitask learners.OpenAI blog, 1(8):9, 2019

Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, Ilya Sutskever, et al. Language models are unsuper- vised multitask learners.OpenAI blog, 1(8):9, 2019. 2

2019
[38]

As- sembly101: A large-scale multi-view video dataset for un- derstanding procedural activities

Fadime Sener, Dibyadip Chatterjee, Daniel Shelepov, Kun He, Dipika Singhania, Robert Wang, and Angela Yao. As- sembly101: A large-scale multi-view video dataset for un- derstanding procedural activities. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 21096–21106, 2022. 3

2022
[39]

Alfred: A benchmark for interpreting grounded instructions for everyday tasks

Mohit Shridhar, Jesse Thomason, Daniel Gordon, Yonatan Bisk, Winson Han, Roozbeh Mottaghi, Luke Zettlemoyer, and Dieter Fox. Alfred: A benchmark for interpreting grounded instructions for everyday tasks. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10740–10749, 2020. 3

2020
[40]

Efm3d: A benchmark for measuring progress towards 3d egocentric foundation models

Julian Straub, Daniel DeTone, Tianwei Shen, Nan Yang, Chris Sweeney, and Richard Newcombe. Efm3d: A benchmark for measuring progress towards 3d egocentric foundation models
[41]

Gemini: A Family of Highly Capable Multimodal Models

Gemini Team, Rohan Anil, Sebastian Borgeaud, Jean- Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalkwyk, Andrew M Dai, Anja Hauth, Katie Millican, et al. Gemini: a family of highly capable multimodal models.arXiv preprint arXiv:2312.11805, 2023. 2, 6, 7, 8, 3

work page internal anchor Pith review Pith/arXiv arXiv 2023
[42]

Hunyuanworld 1.0: Generating immersive, explorable, and interactive 3d worlds from words or pixels.arXiv preprint arXiv:2507.21809, 2025

HunyuanWorld Team, Zhenwei Wang, Yuhao Liu, Junta Wu, Zixiao Gu, Haoyuan Wang, Xuhui Zuo, Tianyu Huang, Wen- huan Li, Sheng Zhang, et al. Hunyuanworld 1.0: Generating immersive, explorable, and interactive 3d worlds from words or pixels.arXiv preprint arXiv:2507.21809, 2025. 4

work page arXiv 2025
[43]

Qwen2.5: A party of foundation models, 2024

Qwen Team. Qwen2.5: A party of foundation models, 2024. 3, 6, 8

2024
[44]

LLaMA: Open and Efficient Foundation Language Models

Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Mar- tinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Roz- ière, Naman Goyal, Eric Hambro, Faisal Azhar, et al. Llama: Open and efficient foundation language models.arXiv preprint arXiv:2302.13971, 2023. 2

work page internal anchor Pith review Pith/arXiv arXiv 2023
[45]

Llama 2: Open Foundation and Fine-Tuned Chat Models

Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Am- jad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. Llama 2: Open foundation and fine-tuned chat models.arXiv preprint arXiv:2307.09288, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[46]

Wan: Open and Advanced Large-Scale Video Generative Models

Team Wan, Ang Wang, Baole Ai, Bin Wen, Chaojie Mao, Chen-Wei Xie, Di Chen, Feiwu Yu, Haiming Zhao, Jianx- iao Yang, Jianyuan Zeng, Jiayu Wang, Jingfeng Zhang, Jin- gren Zhou, Jinkai Wang, Jixuan Chen, Kai Zhu, Kang Zhao, Keyu Yan, Lianghua Huang, Mengyang Feng, Ningyi Zhang, Pandeng Li, Pingyu Wu, Ruihang Chu, Ruili Feng, Shiwei Zhang, Siyang Sun, Tao Fan...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[47]

Ho-cap: A capture system and dataset for 3d reconstruction and pose tracking of hand- object interaction.arXiv preprint arXiv:2406.06843, 2024

Jikai Wang, Qifan Zhang, Yu-Wei Chao, Bowen Wen, Xiaohu Guo, and Yu Xiang. Ho-cap: A capture system and dataset for 3d reconstruction and pose tracking of hand-object interaction. arXiv preprint arXiv:2406.06843, 2024. 5

work page arXiv 2024
[48]

InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency

Weiyun Wang, Zhangwei Gao, Lixin Gu, Hengjun Pu, Long Cui, Xingguang Wei, Zhaoyang Liu, Linglin Jing, Shenglong Ye, Jie Shao, et al. Internvl3.5: Advancing open-source multimodal models in versatility, reasoning, and efficiency. arXiv preprint arXiv:2508.18265, 2025. 2

work page internal anchor Pith review Pith/arXiv arXiv 2025
[49]

Holoassist: an egocen- tric human interaction dataset for interactive ai assistants in the real world

Xin Wang, Taein Kwon, Mahdi Rad, Bowen Pan, Ishani Chakraborty, Sean Andrist, Dan Bohus, Ashley Feniello, Bu- gra Tekin, Felipe Vieira Frujeri, et al. Holoassist: an egocen- tric human interaction dataset for interactive ai assistants in the real world. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 20270–20281, 2023. 5

2023
[50]

Egovid-5m: A large-scale video-action dataset for egocentric video generation, 2024

Xiaofeng Wang, Kang Zhao, Feng Liu, Jiayu Wang, Gu- osheng Zhao, Xiaoyi Bao, Zheng Zhu, Yingya Zhang, and Xingang Wang. Egovid-5m: A large-scale video-action dataset for egocentric video generation.arXiv preprint arXiv:2411.08380, 2024. 5

work page arXiv 2024
[51]

Emergent abilities of large language models.TMLR, 2022

Jason Wei, Yi Tay, Rishi Bommasani, Colin Raffel, Barret Zoph, Sebastian Borgeaud, Dani Yogatama, Maarten Bosma, Denny Zhou, Donald Metzler, et al. Emergent abilities of large language models.TMLR, 2022. 2

2022
[52]

Qwen2 Technical Report

An Yang, Baosong Yang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Zhou, Chengpeng Li, Chengyuan Li, Dayiheng Liu, Fei Huang, Guanting Dong, Haoran Wei, Huan Lin, Jialong Tang, Jialin Wang, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Ma, Jin Xu, Jingren Zhou, Jinze Bai, Jinzheng He, Junyang Lin, Kai Dang, Keming Lu, Keqin Chen, Kexin Yang, Mei Li, Mingfeng ...

work page internal anchor Pith review Pith/arXiv arXiv 2024
[53]

Thinking in space: How multimodal large language models see, remember, and recall spaces.arXiv preprint arXiv:2412.14171, 2024

Jihan Yang, Shusheng Yang, Anjali W Gupta, Rilyn Han, Li Fei-Fei, and Saining Xie. Thinking in space: How mul- timodal large language models see, remember, and recall spaces.arXiv preprint arXiv:2412.14171, 2024. 4, 3

work page arXiv 2024
[54]

Thinking in space: How multimodal large language models see, remember, and recall spaces

Jihan Yang, Shusheng Yang, Anjali W Gupta, Rilyn Han, Li Fei-Fei, and Saining Xie. Thinking in space: How multimodal large language models see, remember, and recall spaces. In Proceedings of the Computer Vision and Pattern Recognition Conference, pages 10632–10643, 2025. 6, 7

2025
[55]

Spatial mental modeling from limited views

Baiqiao Yin, Qineng Wang, Pingyue Zhang, Jianshu Zhang, Kangrui Wang, Zihan Wang, Jieyu Zhang, Keshigeyan Chan- drasegaran, Han Liu, Ranjay Krishna, et al. Spatial mental modeling from limited views. InStructural Priors for Vision Workshop at ICCV’25, 2025. 4

2025
[56]

Mmmu: A massive multi-discipline multi- modal understanding and reasoning benchmark for expert agi

Xiang Yue, Yuansheng Ni, Kai Zhang, Tianyu Zheng, Ruoqi Liu, Ge Zhang, Samuel Stevens, Dongfu Jiang, Weiming Ren, Yuxuan Sun, et al. Mmmu: A massive multi-discipline multi- modal understanding and reasoning benchmark for expert agi. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9556–9567, 2024. 7

2024
[57]

Patel, Paul Pu Liang, Daniel Khashabi, Cheng Peng, Rama Chellappa, Tianmin Shu, Alan L

Jiahan Zhang, Muqing Jiang, Nanru Dai, Taiming Lu, Arda Uzunoglu, Shunchi Zhang, Yana Wei, Jiahao Wang, Vishal M Patel, Paul Pu Liang, et al. World-in-world: World models in a closed-loop world.arXiv preprint arXiv:2510.18135, 2025. 4

work page arXiv 2025
[58]

Unveiling linguistic regions in large language models

Zhihao Zhang, Jun Zhao, Qi Zhang, Tao Gui, and Xuanjing Huang. Unveiling linguistic regions in large language models. InACL, 2024. 2

2024
[59]

InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models

Jinguo Zhu, Weiyun Wang, Zhe Chen, Zhaoyang Liu, Shen- glong Ye, Lixin Gu, Hao Tian, Yuchen Duan, Weijie Su, Jie Shao, et al. Internvl3: Exploring advanced training and test-time recipes for open-source multimodal models.arXiv preprint arXiv:2504.10479, 2025. 6 EgoTL: Egocentric Think-Aloud Chains for Long-Horizon Tasks Supplementary Material This supplem...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[60]

Turn left

Answer with a single real-valued number in meters and nothing else. Current video clip Table 5.Question Templates for tasks in EgoTL-Bench.We replace the highlighted part in the question template from scene to scene to construct our benchmark. candidate action descriptions. Each candidate is labeled with a letter (A, B, C, or D), and the model is prompted...