Recognition: unknown
EgoTL: Egocentric Think-Aloud Chains for Long-Horizon Tasks
Pith reviewed 2026-05-10 17:05 UTC · model grok-4.3
The pith
Finetuning foundation models on human think-aloud chains with metric labels from EgoTL improves long-horizon planning and spatial reasoning in egocentric tasks.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
EgoTL builds a think-aloud capture pipeline for egocentric data. It uses a say-before-act protocol to record step-by-step goals and spoken reasoning with word-level timestamps, then calibrates physical properties with metric-scale spatial estimators, a memory-bank walkthrough for scene context, and clip-level tags for navigation instructions and detailed manipulation actions. With EgoTL, foundation models and world models are benchmarked on six task dimensions from three layers and long-horizon generation over minute-long sequences across over 100 daily household tasks. The models still fall short as egocentric assistants or open-world simulators. Finetuning foundation models with human CoT,
What carries the argument
The EgoTL think-aloud capture pipeline that records spoken reasoning before action and aligns it with metric spatial calibration and memory-bank scene context.
If this is right
- Fine-tuned models generate better long-horizon plans over minute-long sequences.
- Step-wise reasoning becomes more reliable across household tasks.
- Instruction following improves for navigation and manipulation actions.
- Spatial grounding better respects real-world physical attributes and avoids hallucinations.
Where Pith is reading between the lines
- The same capture approach could be adapted to collect reasoning data for non-household embodied settings such as outdoor navigation.
- EgoTL-style chains might help world models maintain consistent scene memory over longer simulated episodes.
- Testing the fine-tuned models on physical robots would show whether the gains transfer beyond video evaluation.
Load-bearing premise
The say-before-act protocol combined with metric-scale spatial estimators and memory-bank walkthroughs accurately captures unbiased human reasoning chains and physical properties without introducing significant errors or biases.
What would settle it
An experiment showing that models fine-tuned on EgoTL-aligned labels perform no better than models fine-tuned on standard VLM-generated labels when tested on held-out minute-long egocentric household sequences would falsify the improvement claim.
Figures
read the original abstract
Large foundation models have made significant advances in embodied intelligence, enabling synthesis and reasoning over egocentric input for household tasks. However, VLM-based auto-labeling is often noisy because the primary data sources lack accurate human action labels, chain-of-thought (CoT), and spatial annotations; these errors are amplified during long-horizon spatial instruction following. These issues stem from insufficient coverage of minute-long, daily household planning tasks and from inaccurate spatial grounding. As a result, VLM reasoning chains and world-model synthesis can hallucinate objects, skip steps, or fail to respect real-world physical attributes. To address these gaps, we introduce EgoTL. EgoTL builds a think-aloud capture pipeline for egocentric data. It uses a say-before-act protocol to record step-by-step goals and spoken reasoning with word-level timestamps, then calibrates physical properties with metric-scale spatial estimators, a memory-bank walkthrough for scene context, and clip-level tags for navigation instructions and detailed manipulation actions. With EgoTL, we are able to benchmark VLMs and World Models on six task dimensions from three layers and long-horizon generation over minute-long sequences across over 100 daily household tasks. We find that foundation models still fall short as egocentric assistants or open-world simulators. Finally, we finetune foundation models with human CoT aligned with metric labels on the training split of EgoTL, which improves long-horizon planning and reasoning, step-wise reasoning, instruction following, and spatial grounding.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces EgoTL, a dataset and capture pipeline for egocentric think-aloud chains on long-horizon household tasks. The pipeline records step-by-step goals and reasoning via a say-before-act protocol with word-level timestamps, calibrates physical properties using metric-scale spatial estimators and a memory-bank walkthrough for scene context, and adds clip-level tags for navigation and manipulation actions. It benchmarks VLMs and world models on six task dimensions across three layers plus long-horizon generation over minute-long sequences on >100 daily tasks, concludes that current foundation models fall short, and reports that finetuning on human CoT aligned with metric labels from the training split improves long-horizon planning/reasoning, step-wise reasoning, instruction following, and spatial grounding.
Significance. If the reported finetuning gains are robust and the dataset is released with full annotations, EgoTL could provide a valuable resource for training embodied agents on realistic, minute-scale household planning where current VLMs struggle with hallucination and physical grounding. The multi-layer benchmarking and emphasis on metric spatial alignment are constructive contributions. The work explicitly credits the creation of human-aligned CoT data as the key enabler for the observed improvements.
major comments (3)
- [§3] §3 (Data Collection Pipeline): The central claim that finetuning on EgoTL CoT yields genuine capability gains rests on the assumption that the say-before-act protocol plus memory-bank walkthrough produces faithful, unbiased reasoning traces. The manuscript provides no validation (e.g., comparison against post-action think-aloud, concurrent eye-tracking, or inter-annotator consistency on decision order) to demonstrate that verbalization-before-action does not systematically elicit post-hoc rationalizations or alter natural planning sequences, which directly undermines the training-signal quality.
- [§5] §5 (Experiments and Results): The abstract and conclusion assert measurable improvements in long-horizon planning, step-wise reasoning, instruction following, and spatial grounding after finetuning, yet the manuscript summary supplies no quantitative metrics, baseline comparisons, ablation studies, or error analysis on the held-out test split. Without these load-bearing numbers and controls, the magnitude, statistical significance, and generalization of the claimed gains cannot be evaluated.
- [§4] §4 (Benchmarking Setup): The claim that existing VLMs and world models 'fall short' on six task dimensions and long-horizon generation is presented without tabulated scores, failure-mode breakdowns, or comparison against human performance ceilings. This leaves the motivation for EgoTL and the finetuning target underspecified.
minor comments (2)
- The abstract would be strengthened by including one or two key quantitative results (e.g., absolute gains on the primary metric) to allow readers to gauge the scale of improvement without reading the full experiments section.
- Notation for the six task dimensions and three layers is introduced without an explicit table or diagram in the early sections, making cross-references harder to follow.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed feedback on our manuscript. We have carefully reviewed each major comment and provide point-by-point responses below. Where appropriate, we indicate the revisions we will make to address the concerns while maintaining the core contributions of EgoTL.
read point-by-point responses
-
Referee: [§3] §3 (Data Collection Pipeline): The central claim that finetuning on EgoTL CoT yields genuine capability gains rests on the assumption that the say-before-act protocol plus memory-bank walkthrough produces faithful, unbiased reasoning traces. The manuscript provides no validation (e.g., comparison against post-action think-aloud, concurrent eye-tracking, or inter-annotator consistency on decision order) to demonstrate that verbalization-before-action does not systematically elicit post-hoc rationalizations or alter natural planning sequences, which directly undermines the training-signal quality.
Authors: We appreciate this methodological concern regarding the fidelity of the think-aloud traces. The say-before-act protocol was deliberately selected, following established practices in cognitive science and HCI, to capture reasoning prior to action and thereby reduce post-hoc rationalization. In the revised manuscript, we will expand §3 with a new subsection on protocol rationale, supported by citations to prior think-aloud validation studies. We will also report inter-annotator agreement on decision ordering for a sampled subset of traces and include qualitative comparisons of sample chains. We acknowledge that concurrent eye-tracking or systematic post-action comparisons would require additional participant studies and hardware; we will explicitly note this as a limitation while arguing that the current protocol still provides high-quality, human-aligned CoT for the reported finetuning gains. revision: partial
-
Referee: [§5] §5 (Experiments and Results): The abstract and conclusion assert measurable improvements in long-horizon planning, step-wise reasoning, instruction following, and spatial grounding after finetuning, yet the manuscript summary supplies no quantitative metrics, baseline comparisons, ablation studies, or error analysis on the held-out test split. Without these load-bearing numbers and controls, the magnitude, statistical significance, and generalization of the claimed gains cannot be evaluated.
Authors: We apologize for any lack of prominence in the summary presentation. The full §5 of the manuscript already contains quantitative results on the held-out test split, including before/after finetuning metrics for planning success, step-wise reasoning accuracy, instruction-following rates, and spatial grounding errors, along with baseline comparisons to multiple VLMs and ablations isolating the CoT and metric-label components. Error analysis categorizes failures by type (e.g., hallucination, sequencing, physical grounding). We will revise the abstract and conclusion to explicitly quote key numbers, add statistical significance reporting, and ensure all tables/figures are cross-referenced clearly so that the magnitude and robustness of the gains are immediately evaluable. revision: yes
-
Referee: [§4] §4 (Benchmarking Setup): The claim that existing VLMs and world models 'fall short' on six task dimensions and long-horizon generation is presented without tabulated scores, failure-mode breakdowns, or comparison against human performance ceilings. This leaves the motivation for EgoTL and the finetuning target underspecified.
Authors: We agree that clearer quantitative presentation will strengthen the motivation. In the revised §4, we will introduce a consolidated results table reporting scores for all evaluated VLMs and world models across the six task dimensions plus long-horizon generation. We will add failure-mode breakdowns (e.g., percentages attributable to object hallucination, step omission, or spatial misalignment) and, on a sampled subset of tasks, human performance ceilings for direct comparison. These additions will make the performance gaps explicit and better justify both the dataset and the finetuning experiments. revision: yes
Circularity Check
No circularity: empirical dataset contribution with independent validation
full rationale
The paper's core contribution is the construction of the EgoTL dataset via a say-before-act think-aloud protocol plus metric calibration and memory-bank walkthroughs, followed by empirical benchmarking and finetuning experiments on VLMs. No mathematical derivation chain, fitted-parameter predictions, or self-referential equations are present. The reported improvements are measured outcomes on held-out tasks rather than quantities forced by construction from the input data or prior self-citations. The central claim rests on external evaluation metrics and does not reduce to re-labeling or re-fitting its own inputs.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption Human think-aloud protocols accurately reflect internal reasoning processes for household tasks
- domain assumption Metric-scale spatial estimators and memory-bank walkthroughs can reliably calibrate physical properties from egocentric views
Reference graph
Works this paper leans on
-
[1]
Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report.arXiv preprint arXiv:2303.08774, 2023. 7, 8, 3
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[2]
Cosmos World Foundation Model Platform for Physical AI
Niket Agarwal, Arslan Ali, Maciej Bala, Yogesh Balaji, Erik Barker, Tiffany Cai, Prithvijit Chattopadhyay, Yongxin Chen, Yin Cui, Yifan Ding, et al. Cosmos world foundation model platform for physical ai.arXiv preprint arXiv:2501.03575,
work page internal anchor Pith review arXiv
-
[3]
Jinze Bai, Shuai Bai, Yunfei Chu, Zeyu Cui, Kai Dang, Xi- aodong Deng, Yang Fan, Wenbin Ge, Yu Han, Fei Huang, et al. Qwen technical report.arXiv preprint arXiv:2309.16609,
work page internal anchor Pith review Pith/arXiv arXiv
-
[4]
Whisperx: Time-accurate speech transcription of long-form audio.INTERSPEECH 2023, 2023
Max Bain, Jaesung Huh, Tengda Han, and Andrew Zisserman. Whisperx: Time-accurate speech transcription of long-form audio.INTERSPEECH 2023, 2023. 5
2023
-
[5]
Hot3d: Hand and object tracking in 3d from egocentric multi-view videos
Prithviraj Banerjee, Sindi Shkodrani, Pierre Moulon, Shreyas Hampali, Shangchen Han, Fan Zhang, Linguang Zhang, Jade Fountain, Edward Miller, Selen Basol, et al. Hot3d: Hand and object tracking in 3d from egocentric multi-view videos. In Proceedings of the Computer Vision and Pattern Recognition Conference, pages 7061–7071, 2025. 3, 5
2025
-
[6]
Gašper Beguš, Maksymilian D ˛ abkowski, and Ryan Rhodes. Large linguistic models: Analyzing theoretical linguistic abil- ities of llms.arXiv preprint arXiv:2305.00948, 2023. 2
-
[7]
Language models are few-shot learners.NeurIPS, 2020
Tom Brown, Benjamin Mann, Nick Ryder, Melanie Sub- biah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakan- tan, Pranav Shyam, Girish Sastry, Amanda Askell, Sand- hini Agarwal, Ariel Herbert-V oss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel Ziegler, Jef- frey Wu, Clemens Winter, Chris Hesse, Mark Chen, Eric Sigler, Mateusz Litwin,...
2020
-
[8]
Spatialbot: Pre- cise spatial understanding with vision language models
Wenxiao Cai, Iaroslav Ponomarenko, Jianhao Yuan, Xiaoqi Li, Wankou Yang, Hao Dong, and Bo Zhao. Spatialbot: Pre- cise spatial understanding with vision language models. In 2025 IEEE International Conference on Robotics and Au- tomation (ICRA), pages 9490–9498. IEEE, 2025. 4
2025
-
[9]
How Far Are We to GPT-4V? Closing the Gap to Commercial Multimodal Models with Open-Source Suites
Zhe Chen, Weiyun Wang, Hao Tian, Shenglong Ye, Zhangwei Gao, Erfei Cui, Wenwen Tong, Kongzhi Hu, Jiapeng Luo, Zheng Ma, et al. How far are we to gpt-4v? closing the gap to commercial multimodal models with open-source suites. arXiv preprint arXiv:2404.16821, 2024. 3, 6
work page internal anchor Pith review arXiv 2024
-
[10]
Gheorghe Comanici, Eric Bieber, Mike Schaekermann, Ice Pasupat, Noveen Sachdeva, Inderjit Dhillon, Marcel Blistein, Ori Ram, Dan Zhang, Evan Rosen, et al. Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long con- text, and next generation agentic capabilities.arXiv preprint arXiv:2507.06261, 2025. 6, 7, 8, 1, 3
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[11]
The epic-kitchens dataset: Collection, chal- lenges and baselines.IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 2020
Dima Damen, Hazel Doughty, Giovanni Maria Farinella, Sanja Fidler, Antonino Furnari, Evangelos Kazakos, Davide Moltisanti, Jonathan Munro, Toby Perrett, Will Price, and Michael Wray. The epic-kitchens dataset: Collection, chal- lenges and baselines.IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 2020. 5
2020
-
[12]
Project Aria: A New Tool for Egocentric Multi-Modal AI Research
Jakob Engel, Kiran Somasundaram, Michael Goesele, Albert Sun, Alexander Gamino, Andrew Turner, Arjang Talattof, Arnie Yuan, Bilal Souti, Brighid Meredith, et al. Project aria: A new tool for egocentric multi-modal ai research.arXiv preprint arXiv:2308.13561, 2023. 3, 4
work page internal anchor Pith review arXiv 2023
-
[13]
Arctic: A dataset for dexterous bimanual hand- object manipulation
Zicong Fan, Omid Taheri, Dimitrios Tzionas, Muhammed Kocabas, Manuel Kaufmann, Michael J Black, and Otmar Hilliges. Arctic: A dataset for dexterous bimanual hand- object manipulation. InProceedings of the IEEE/CVF con- ference on computer vision and pattern recognition, pages 12943–12954, 2023. 5
2023
-
[14]
VLM-3R: Vision-Language Models Augmented with Instruction-Aligned 3D Reconstruction
Zhiwen Fan, Jian Zhang, Renjie Li, Junge Zhang, Runjin Chen, Hezhen Hu, Kevin Wang, Huaizhi Qu, Dilin Wang, Zhicheng Yan, et al. Vlm-3r: Vision-language models aug- mented with instruction-aligned 3d reconstruction.arXiv preprint arXiv:2505.20279, 2025. 4, 6
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[15]
Video-mme: The first-ever comprehensive evaluation benchmark of multi-modal llms in video analysis
Chaoyou Fu, Yuhan Dai, Yongdong Luo, Lei Li, Shuhuai Ren, Renrui Zhang, Zihan Wang, Chenyu Zhou, Yunhang Shen, Mengdan Zhang, et al. Video-mme: The first-ever comprehensive evaluation benchmark of multi-modal llms in video analysis. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 24108–24118, 2025. 3, 7
2025
-
[16]
Ego-exo4d: Understanding skilled human activity from first-and third-person perspectives
Kristen Grauman, Andrew Westbury, Lorenzo Torresani, Kris Kitani, Jitendra Malik, Triantafyllos Afouras, Kumar Ashutosh, Vijay Baiyya, Siddhant Bansal, Bikram Boote, et al. Ego-exo4d: Understanding skilled human activity from first-and third-person perspectives. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 193...
2024
-
[17]
Kristen Grauman et al. Ego4d. InCVPR. 2, 3, 5
-
[18]
David Ha and Jürgen Schmidhuber. World models.arXiv preprint arXiv:1803.10122, 2(3), 2018. 8
work page internal anchor Pith review arXiv 2018
-
[19]
Mastering Diverse Domains through World Models
Danijar Hafner, Jurgis Pasukonis, Jimmy Ba, and Timothy Lillicrap. Mastering diverse domains through world models. InInternational Conference on Learning Representations (ICLR), 2024. arXiv:2301.04104. 8
work page internal anchor Pith review arXiv 2024
-
[20]
Measuring Massive Multitask Language Understanding
Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Man- tas Mazeika, Dawn Song, and Jacob Steinhardt. Measuring massive multitask language understanding.arXiv preprint arXiv:2009.03300, 2020. 7
work page internal anchor Pith review Pith/arXiv arXiv 2009
-
[21]
Clipscore: A reference-free evaluation metric for image captioning
Jack Hessel, Ari Holtzman, Maxwell Forbes, Ronan Le Bras, and Yejin Choi. Clipscore: A reference-free evaluation metric for image captioning. InProceedings of the 2021 conference on empirical methods in natural language processing, pages 7514–7528, 2021. 8
2021
-
[22]
Lora: Low-rank adaptation of large language models.ICLR, 1(2):3,
Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, Weizhu Chen, et al. Lora: Low-rank adaptation of large language models.ICLR, 1(2):3,
-
[23]
Vbench: Comprehensive benchmark suite for video generative models
Ziqi Huang, Yinan He, Jiashuo Yu, Fan Zhang, Chenyang Si, Yuming Jiang, Yuanhan Zhang, Tianxing Wu, Qingyang Jin, Nattapol Chanpaisit, et al. Vbench: Comprehensive benchmark suite for video generative models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 21807–21818, 2024. 8
2024
-
[24]
Gpt-4o system card.arXiv,
Aaron Hurst, Adam Lerer, Adam P Goucher, Adam Perelman, Aditya Ramesh, Aidan Clark, AJ Ostrow, Akila Welihinda, Alan Hayes, Alec Radford, et al. Gpt-4o system card.arXiv,
-
[25]
Mmtom-qa
Chuanyang Jin et al. Mmtom-qa. InACL, 2024. 2
2024
-
[26]
MapAnything: Universal Feed-Forward Metric 3D Reconstruction
Nikhil Keetha, Norman Müller, Johannes Schönberger, Lorenzo Porzi, Yuchen Zhang, Tobias Fischer, Arno Knapitsch, Duncan Zauss, Ethan Weber, Nelson Antunes, Jonathon Luiten, Manuel Lopez-Antequera, Samuel Rota Bulò, Christian Richardt, Deva Ramanan, Sebastian Scherer, and Peter Kontschieder. MapAnything: Universal feed- forward metric 3D reconstruction, 20...
work page internal anchor Pith review arXiv 2025
-
[27]
Learning instruction-guided manipulation affordance via large models for embodied robotic tasks
Dayou Li, Chenkun Zhao, Shuo Yang, Lin Ma, Yibin Li, and Wei Zhang. Learning instruction-guided manipulation affordance via large models for embodied robotic tasks. In 2024 International Conference on Advanced Robotics and Mechatronics (ICARM), pages 662–667. IEEE, 2024. 2
2024
-
[28]
Vila: On pre-training for visual language models
Ji Lin, Hongxu Yin, Wei Ping, Pavlo Molchanov, Mohammad Shoeybi, and Song Han. Vila: On pre-training for visual language models. InCVPR, pages 26689–26699, 2024. 3
2024
-
[29]
Hoi4d: A 4d egocentric dataset for category-level human- object interaction
Yunze Liu, Yun Liu, Che Jiang, Kangbo Lyu, Weikang Wan, Hao Shen, Boqiang Liang, Zhoujie Fu, He Wang, and Li Yi. Hoi4d: A 4d egocentric dataset for category-level human- object interaction. InProceedings of the IEEE/CVF Con- ference on Computer Vision and Pattern Recognition, pages 21013–21022, 2022. 3, 5
2022
-
[30]
Aria Everyday Activities Dataset,
Zhaoyang Lv, Nicholas Charron, Pierre Moulon, Alexander Gamino, Cheng Peng, Chris Sweeney, Edward Miller, Huix- uan Tang, Jeff Meissner, Jing Dong, et al. Aria everyday activities dataset.arXiv preprint arXiv:2402.13349, 2024. 3, 5
-
[31]
Nymeria: A massive collection of multimodal egocentric daily motion in the wild
Lingni Ma, Yuting Ye, Fangzhou Hong, Vladimir Guzov, Yifeng Jiang, Rowan Postyeni, Luis Pesqueira, Alexander Gamino, Vijay Baiyya, Hyo Jin Kim, et al. Nymeria: A massive collection of multimodal egocentric daily motion in the wild. InEuropean Conference on Computer Vision, pages 445–465. Springer, 2024. 3, 5
2024
-
[32]
A comprehensive overview of large language models,
Humza Naveed, Asad Ullah Khan, Shi Qiu, Muhammad Saqib, Saeed Anwar, Muhammad Usman, Naveed Akhtar, Nick Barnes, and Ajmal Mian. A comprehensive overview of large language models.arXiv preprint arXiv:2307.06435,
-
[33]
Baoqi Pei, Yifei Huang, Jilan Xu, Yuping He, Guo Chen, Fei Wu, Yu Qiao, and Jiangmiao Pang. Egothinker: Unveiling egocentric reasoning with spatio-temporal cot.arXiv preprint arXiv:2510.23569, 2025. 2
-
[34]
Toby Perrett et al. Hd-epic. InCVPR, 2025. 2, 3, 5, 8
2025
-
[35]
arXiv preprint arXiv:2501.19061 (2025)
Heqian Qiu, Zhaofeng Shi, Lanxiao Wang, Huiyu Xiong, Xiang Li, and Hongliang Li. Egome: A new dataset and challenge for following me via egocentric view in real world. arXiv preprint arXiv:2501.19061, 2025. 5
-
[36]
Improving language understanding by genera- tive pre-training.OpenAI Blog, 2018
Alec Radford. Improving language understanding by genera- tive pre-training.OpenAI Blog, 2018. 2
2018
-
[37]
Language models are unsuper- vised multitask learners.OpenAI blog, 1(8):9, 2019
Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, Ilya Sutskever, et al. Language models are unsuper- vised multitask learners.OpenAI blog, 1(8):9, 2019. 2
2019
-
[38]
As- sembly101: A large-scale multi-view video dataset for un- derstanding procedural activities
Fadime Sener, Dibyadip Chatterjee, Daniel Shelepov, Kun He, Dipika Singhania, Robert Wang, and Angela Yao. As- sembly101: A large-scale multi-view video dataset for un- derstanding procedural activities. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 21096–21106, 2022. 3
2022
-
[39]
Alfred: A benchmark for interpreting grounded instructions for everyday tasks
Mohit Shridhar, Jesse Thomason, Daniel Gordon, Yonatan Bisk, Winson Han, Roozbeh Mottaghi, Luke Zettlemoyer, and Dieter Fox. Alfred: A benchmark for interpreting grounded instructions for everyday tasks. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10740–10749, 2020. 3
2020
-
[40]
Efm3d: A benchmark for measuring progress towards 3d egocentric foundation models
Julian Straub, Daniel DeTone, Tianwei Shen, Nan Yang, Chris Sweeney, and Richard Newcombe. Efm3d: A benchmark for measuring progress towards 3d egocentric foundation models
-
[41]
Gemini: A Family of Highly Capable Multimodal Models
Gemini Team, Rohan Anil, Sebastian Borgeaud, Jean- Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalkwyk, Andrew M Dai, Anja Hauth, Katie Millican, et al. Gemini: a family of highly capable multimodal models.arXiv preprint arXiv:2312.11805, 2023. 2, 6, 7, 8, 3
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[42]
HunyuanWorld Team, Zhenwei Wang, Yuhao Liu, Junta Wu, Zixiao Gu, Haoyuan Wang, Xuhui Zuo, Tianyu Huang, Wen- huan Li, Sheng Zhang, et al. Hunyuanworld 1.0: Generating immersive, explorable, and interactive 3d worlds from words or pixels.arXiv preprint arXiv:2507.21809, 2025. 4
-
[43]
Qwen2.5: A party of foundation models, 2024
Qwen Team. Qwen2.5: A party of foundation models, 2024. 3, 6, 8
2024
-
[44]
LLaMA: Open and Efficient Foundation Language Models
Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Mar- tinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Roz- ière, Naman Goyal, Eric Hambro, Faisal Azhar, et al. Llama: Open and efficient foundation language models.arXiv preprint arXiv:2302.13971, 2023. 2
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[45]
Llama 2: Open Foundation and Fine-Tuned Chat Models
Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Am- jad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. Llama 2: Open foundation and fine-tuned chat models.arXiv preprint arXiv:2307.09288, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[46]
Wan: Open and Advanced Large-Scale Video Generative Models
Team Wan, Ang Wang, Baole Ai, Bin Wen, Chaojie Mao, Chen-Wei Xie, Di Chen, Feiwu Yu, Haiming Zhao, Jianx- iao Yang, Jianyuan Zeng, Jiayu Wang, Jingfeng Zhang, Jin- gren Zhou, Jinkai Wang, Jixuan Chen, Kai Zhu, Kang Zhao, Keyu Yan, Lianghua Huang, Mengyang Feng, Ningyi Zhang, Pandeng Li, Pingyu Wu, Ruihang Chu, Ruili Feng, Shiwei Zhang, Siyang Sun, Tao Fan...
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[47]
Jikai Wang, Qifan Zhang, Yu-Wei Chao, Bowen Wen, Xiaohu Guo, and Yu Xiang. Ho-cap: A capture system and dataset for 3d reconstruction and pose tracking of hand-object interaction. arXiv preprint arXiv:2406.06843, 2024. 5
-
[48]
InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency
Weiyun Wang, Zhangwei Gao, Lixin Gu, Hengjun Pu, Long Cui, Xingguang Wei, Zhaoyang Liu, Linglin Jing, Shenglong Ye, Jie Shao, et al. Internvl3.5: Advancing open-source multimodal models in versatility, reasoning, and efficiency. arXiv preprint arXiv:2508.18265, 2025. 2
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[49]
Holoassist: an egocen- tric human interaction dataset for interactive ai assistants in the real world
Xin Wang, Taein Kwon, Mahdi Rad, Bowen Pan, Ishani Chakraborty, Sean Andrist, Dan Bohus, Ashley Feniello, Bu- gra Tekin, Felipe Vieira Frujeri, et al. Holoassist: an egocen- tric human interaction dataset for interactive ai assistants in the real world. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 20270–20281, 2023. 5
2023
-
[50]
Egovid-5m: A large-scale video-action dataset for egocentric video generation, 2024
Xiaofeng Wang, Kang Zhao, Feng Liu, Jiayu Wang, Gu- osheng Zhao, Xiaoyi Bao, Zheng Zhu, Yingya Zhang, and Xingang Wang. Egovid-5m: A large-scale video-action dataset for egocentric video generation.arXiv preprint arXiv:2411.08380, 2024. 5
-
[51]
Emergent abilities of large language models.TMLR, 2022
Jason Wei, Yi Tay, Rishi Bommasani, Colin Raffel, Barret Zoph, Sebastian Borgeaud, Dani Yogatama, Maarten Bosma, Denny Zhou, Donald Metzler, et al. Emergent abilities of large language models.TMLR, 2022. 2
2022
-
[52]
An Yang, Baosong Yang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Zhou, Chengpeng Li, Chengyuan Li, Dayiheng Liu, Fei Huang, Guanting Dong, Haoran Wei, Huan Lin, Jialong Tang, Jialin Wang, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Ma, Jin Xu, Jingren Zhou, Jinze Bai, Jinzheng He, Junyang Lin, Kai Dang, Keming Lu, Keqin Chen, Kexin Yang, Mei Li, Mingfeng ...
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[53]
Jihan Yang, Shusheng Yang, Anjali W Gupta, Rilyn Han, Li Fei-Fei, and Saining Xie. Thinking in space: How mul- timodal large language models see, remember, and recall spaces.arXiv preprint arXiv:2412.14171, 2024. 4, 3
-
[54]
Thinking in space: How multimodal large language models see, remember, and recall spaces
Jihan Yang, Shusheng Yang, Anjali W Gupta, Rilyn Han, Li Fei-Fei, and Saining Xie. Thinking in space: How multimodal large language models see, remember, and recall spaces. In Proceedings of the Computer Vision and Pattern Recognition Conference, pages 10632–10643, 2025. 6, 7
2025
-
[55]
Spatial mental modeling from limited views
Baiqiao Yin, Qineng Wang, Pingyue Zhang, Jianshu Zhang, Kangrui Wang, Zihan Wang, Jieyu Zhang, Keshigeyan Chan- drasegaran, Han Liu, Ranjay Krishna, et al. Spatial mental modeling from limited views. InStructural Priors for Vision Workshop at ICCV’25, 2025. 4
2025
-
[56]
Mmmu: A massive multi-discipline multi- modal understanding and reasoning benchmark for expert agi
Xiang Yue, Yuansheng Ni, Kai Zhang, Tianyu Zheng, Ruoqi Liu, Ge Zhang, Samuel Stevens, Dongfu Jiang, Weiming Ren, Yuxuan Sun, et al. Mmmu: A massive multi-discipline multi- modal understanding and reasoning benchmark for expert agi. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9556–9567, 2024. 7
2024
-
[57]
Patel, Paul Pu Liang, Daniel Khashabi, Cheng Peng, Rama Chellappa, Tianmin Shu, Alan L
Jiahan Zhang, Muqing Jiang, Nanru Dai, Taiming Lu, Arda Uzunoglu, Shunchi Zhang, Yana Wei, Jiahao Wang, Vishal M Patel, Paul Pu Liang, et al. World-in-world: World models in a closed-loop world.arXiv preprint arXiv:2510.18135, 2025. 4
-
[58]
Unveiling linguistic regions in large language models
Zhihao Zhang, Jun Zhao, Qi Zhang, Tao Gui, and Xuanjing Huang. Unveiling linguistic regions in large language models. InACL, 2024. 2
2024
-
[59]
InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models
Jinguo Zhu, Weiyun Wang, Zhe Chen, Zhaoyang Liu, Shen- glong Ye, Lixin Gu, Hao Tian, Yuchen Duan, Weijie Su, Jie Shao, et al. Internvl3: Exploring advanced training and test-time recipes for open-source multimodal models.arXiv preprint arXiv:2504.10479, 2025. 6 EgoTL: Egocentric Think-Aloud Chains for Long-Horizon Tasks Supplementary Material This supplem...
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[60]
Turn left
Answer with a single real-valued number in meters and nothing else. Current video clip Table 5.Question Templates for tasks in EgoTL-Bench.We replace the highlighted part in the question template from scene to scene to construct our benchmark. candidate action descriptions. Each candidate is labeled with a letter (A, B, C, or D), and the model is prompted...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.