EgoExoMem: Cross-View Memory Reasoning over Synchronized Egocentric and Exocentric Videos
Pith reviewed 2026-05-20 11:14 UTC · model grok-4.3
The pith
Synchronized egocentric and exocentric videos supply complementary memory cues that current multimodal models have not yet fully exploited.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The paper claims that egocentric and exocentric views provide complementary cues for spatial-temporal memory reasoning, established through the EgoExoMem benchmark of 2.6K high-quality MCQs and shown by the performance gap between existing MLLMs at 55.3 percent and the proposed E2-Select method at 58.2 percent over frame-selection and RAG baselines.
What carries the argument
E2-Select, a training-free frame selection method that combines relevance-based budget allocation with per-view k-DPP sampling to manage view asymmetry and cross-view temporal consistency in synchronized ego-exo videos.
If this is right
- Ego and exo views supply complementary memory cues that improve reasoning when both are available.
- Existing multimodal large language models remain far from solving cross-view memory tasks.
- Training-free selection methods outperform standard frame-selection and RAG-based memory approaches.
- Question framing and answer grounding exhibit systematic view-preference conflicts.
Where Pith is reading between the lines
- Models that explicitly learn to resolve view-preference conflicts could close more of the performance gap than selection alone.
- The benchmark structure could be reused to test memory reasoning in longer, unscripted video streams from wearable and overhead cameras.
- Integration of dual-view selection into embodied agents might reduce errors in spatial tasks such as object relocation or route planning.
Load-bearing premise
The 2.6K multiple-choice questions are high-quality and representative of real cross-view memory reasoning demands.
What would settle it
A direct comparison of model accuracy on EgoExoMem against accuracy on a new set of cross-view questions derived from real robotic navigation logs using the same synchronized video sources.
Figures
read the original abstract
Egocentric memory is widely used in embodied intelligence, but it may be insufficient for comprehensive spatial-temporal reasoning. Inspired by human recall from both field and observer perspectives, we introduce EgoExoMem, the first benchmark for cross-view memory reasoning over synchronized egocentric and exocentric videos. EgoExoMem contains $2.6K$ high-quality MCQs across eight temporal, spatial, and cross-view QA types. To support dual-view retrieval, we propose E$^2$-Select, a training-free frame selection method for synchronized ego-exo videos. It combines relevance-based budget allocation with per-view k-DPP sampling to handle view asymmetry and cross-view temporal consistency. Experiments show that ego and exo views provide complementary memory cues, while existing MLLMs remain far from solving the benchmark: the best model reaches only $55.3\%$. E$^2$-Select achieves state-of-the-art performance of $58.2\%$ over frame-selection and RAG-based memory baselines. Further analysis reveals systematic view-preference conflicts between question framing and answer grounding, underscoring the novelty and challenge of cross-view memory reasoning.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces EgoExoMem, the first benchmark for cross-view memory reasoning over synchronized egocentric and exocentric videos, containing 2.6K high-quality MCQs across eight temporal, spatial, and cross-view QA types. It proposes E²-Select, a training-free frame selection method combining relevance-based budget allocation with per-view k-DPP sampling to address view asymmetry and cross-view temporal consistency. Experiments show that existing MLLMs reach at most 55.3% accuracy while E²-Select achieves 58.2% over frame-selection and RAG baselines, with further analysis of view-preference conflicts.
Significance. If the benchmark questions genuinely require cross-view integration, this work would provide a valuable new resource for evaluating and improving multimodal models on complementary ego-exo memory cues, an area relevant to embodied AI. The training-free design of E²-Select and the reproducible performance numbers are strengths that support broader adoption.
major comments (2)
- [Benchmark construction (§3/§4)] Benchmark construction section (likely §3 or §4): The central claims that 'ego and exo views provide complementary memory cues' and that 'existing MLLMs remain far from solving the benchmark' presuppose that the 2.6K MCQs cannot be solved from a single view. No single-view human accuracy, inter-annotator agreement on view necessity, or filtering steps that discard single-view-solvable questions are reported, leaving the complementarity conclusion and the 55.3%/58.2% gap on an unverified assumption.
- [Method (§4)] E²-Select description (likely §4): While the method is presented as training-free, the relevance-based budget allocation step requires explicit definition of how per-view relevance scores are obtained from the query without reference to the target benchmark; if these scores implicitly depend on benchmark-specific heuristics, the 'parameter-free' characterization needs clarification to avoid circularity with the evaluation.
minor comments (2)
- [Abstract] Abstract: Use consistent mathematical formatting (e.g., 2.6K vs $2.6K$) and ensure all acronyms (MLLM, k-DPP) are defined on first use.
- [Figures/Tables] Figure captions and tables: Add explicit legends distinguishing ego-only, exo-only, and dual-view conditions to improve readability of the complementarity analysis.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback. We address the major comments point by point below, indicating planned revisions where appropriate.
read point-by-point responses
-
Referee: [Benchmark construction (§3/§4)] Benchmark construction section (likely §3 or §4): The central claims that 'ego and exo views provide complementary memory cues' and that 'existing MLLMs remain far from solving the benchmark' presuppose that the 2.6K MCQs cannot be solved from a single view. No single-view human accuracy, inter-annotator agreement on view necessity, or filtering steps that discard single-view-solvable questions are reported, leaving the complementarity conclusion and the 55.3%/58.2% gap on an unverified assumption.
Authors: We agree that explicit verification of cross-view complementarity strengthens the central claims. The benchmark was designed with dedicated cross-view QA categories and questions that target integration of complementary cues (e.g., ego-centric action details paired with exo-centric spatial layout), supported by qualitative examples in the paper. However, we did not report single-view human accuracy or explicit filtering statistics. In the revision we will add single-view human evaluation on a representative subset of questions together with inter-annotator agreement on view necessity, thereby providing direct empirical support for the complementarity assumption. revision: yes
-
Referee: [Method (§4)] E²-Select description (likely §4): While the method is presented as training-free, the relevance-based budget allocation step requires explicit definition of how per-view relevance scores are obtained from the query without reference to the target benchmark; if these scores implicitly depend on benchmark-specific heuristics, the 'parameter-free' characterization needs clarification to avoid circularity with the evaluation.
Authors: The relevance scores are obtained by embedding the query with a fixed, off-the-shelf vision-language model (CLIP) and computing cosine similarity against frame embeddings from each view independently. No training, fine-tuning, or benchmark-specific heuristics are involved; the same general-purpose model is used for all queries. We will revise the method section to state this procedure explicitly and to clarify that E²-Select remains training-free with no parameters tuned on EgoExoMem, thereby removing any potential ambiguity regarding circularity. revision: yes
Circularity Check
No significant circularity; derivation is self-contained
full rationale
The paper introduces EgoExoMem benchmark and E²-Select method as training-free, relying on explicit algorithmic steps (relevance-based budget allocation plus per-view k-DPP sampling) that operate on input video features without fitting parameters to the target MCQ answers or reducing any claimed result to a self-definition. No equations or sections equate a prediction to its own fitted input, invoke load-bearing self-citations for uniqueness, or rename prior empirical patterns as new derivations. Experimental claims inherit benchmark validity risks but do not exhibit circular reduction in the derivation chain itself.
Axiom & Free-Parameter Ledger
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
EgoExoMem contains 2.6K high-quality MCQs across eight temporal, spatial, and cross-view QA types. ... E²-Select ... relevance-based budget allocation with per-view k-DPP sampling
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Experiments show that ego and exo views provide complementary memory cues
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Ring home security systems.https://ring.com, 2024
work page 2024
-
[2]
Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, et al. Qwen3-VL technical report.arXiv preprint arXiv:2511.21631, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[3]
Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, et al. Qwen2.5-VL technical report.arXiv preprint arXiv:2502.13923, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[4]
Glance and focus: Memory prompting for multi-event video question answering
Ziyi Bai, Ruiping Wang, and Xilin Chen. Glance and focus: Memory prompting for multi-event video question answering. InNeurIPS, 2023
work page 2023
-
[5]
Where did I leave my keys?—Episodic-memory-based question answering on egocentric videos
Leonard Bärmann and Alex Waibel. Where did I leave my keys?—Episodic-memory-based question answering on egocentric videos. InCVPRW, 2022
work page 2022
-
[6]
Andy Bonnetto, Haozhe Qi, Franklin Leong, Matea Tashkovska, Mahdi Rad, Solaiman Shokur, Friedhelm Hummel, Silvestro Micera, Marc Pollefeys, and Alexander Mathis. EPFL-Smart- Kitchen: An ego-exo multi-modal dataset for challenging action and motion understanding in video-language models. InNeurIPS, 2025
work page 2025
-
[7]
Spatial memory: how egocentric and allocentric combine.Trends in Cognitive Sciences, 2006
Neil Burgess. Spatial memory: how egocentric and allocentric combine.Trends in Cognitive Sciences, 2006
work page 2006
-
[8]
SA VVY: Spatial awareness via audio-visual LLMs through seeing and hearing
Mingfei Chen, Zijun Cui, Xiulong Liu, Jinlin Xiang, Caleb Zheng, Jingyuan Li, and Eli Shlizerman. SA VVY: Spatial awareness via audio-visual LLMs through seeing and hearing. InNeurIPS, 2025
work page 2025
-
[9]
Xi Chen, Julien Cumin, Fano Ramparany, and Dominique Vaufreydaz. MuRAL: A multi- resident ambient sensor dataset annotated with natural language for activities of daily living. InICIE, 2026
work page 2026
-
[10]
Yi Chen, Yuying Ge, Yixiao Ge, Mingyu Ding, Bohao Li, Rui Wang, Ruifeng Xu, Ying Shan, and Xihui Liu. EgoPlan-Bench: Benchmarking multimodal large language models for human-level planning.International Journal of Computer Vision, 2026
work page 2026
-
[11]
(2.5+ 1) D spatio-temporal scene graphs for video question answering
Anoop Cherian, Chiori Hori, Tim K Marks, and Jonathan Le Roux. (2.5+ 1) D spatio-temporal scene graphs for video question answering. InAAAI, 2022
work page 2022
-
[12]
Gheorghe Comanici, Eric Bieber, Mike Schaekermann, Ice Pasupat, Noveen Sachdeva, Inderjit Dhillon, Marcel Blistein, Ori Ram, Dan Zhang, Evan Rosen, et al. Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities.arXiv preprint arXiv:2507.06261, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[13]
Ronghao Dang, Yuqian Yuan, Wenqi Zhang, Yifei Xin, Boqiang Zhang, Long Li, Liuyi Wang, Qinyang Zeng, Xin Li, and Lidong Bing. ECBench: Can multi-modal foundation models understand the egocentric world? A holistic embodied cognition benchmark. InCVPR, 2025
work page 2025
-
[14]
Episodic memory question answering
Samyak Datta, Sameer Dharur, Vincent Cartillier, Ruta Desai, Mukul Khanna, Dhruv Batra, and Devi Parikh. Episodic memory question answering. InCVPR, 2022
work page 2022
-
[15]
Look and tell: A dataset for multimodal grounding across egocentric and exocentric views
Anna Deichler and Jonas Beskow. Look and tell: A dataset for multimodal grounding across egocentric and exocentric views. InNeurIPS 2025 Workshop on Space in Vision, Language, and Embodied AI, 2025
work page 2025
-
[16]
Exact sampling of determinantal point processes with sublinear time preprocessing
Michal Derezinski, Daniele Calandriello, and Michal Valko. Exact sampling of determinantal point processes with sublinear time preprocessing. InNeurIPS, 2019
work page 2019
-
[17]
Mica R. Endsley. Toward a theory of situation awareness in dynamic systems.Human Factors: The Journal of the Human Factors and Ergonomics Society, 1995. 11
work page 1995
-
[18]
PRVQL: Progressive knowledge-guided refinement for robust egocentric visual query localization
Bing Fan, Yunhe Feng, Yapeng Tian, James Chenhao Liang, Yuewei Lin, Yan Huang, and Heng Fan. PRVQL: Progressive knowledge-guided refinement for robust egocentric visual query localization. InICCV, 2025
work page 2025
-
[19]
Yue Fan, Xiaojian Ma, Rongpeng Su, Jun Guo, Rujie Wu, Xi Chen, and Qing Li. Embodied VideoAgent: Persistent memory from egocentric videos and embodied sensors enables dynamic scene understanding. InICCV, 2025
work page 2025
-
[20]
Object-shot enhanced grounding network for egocentric video
Yisen Feng, Haoyu Zhang, Meng Liu, Weili Guan, and Liqiang Nie. Object-shot enhanced grounding network for egocentric video. InCVPR, 2025
work page 2025
-
[21]
Video-mme: The first-ever comprehen- sive evaluation benchmark of multi-modal llms in video analysis
Chaoyou Fu, Yuhan Dai, Yongdong Luo, Lei Li, Shuhuai Ren, Renrui Zhang, Zihan Wang, Chenyu Zhou, Yunhang Shen, Mengdan Zhang, et al. Video-mme: The first-ever comprehen- sive evaluation benchmark of multi-modal llms in video analysis. InCVPR, 2025
work page 2025
-
[22]
Yuqian Fu, Runze Wang, Bin Ren, Guolei Sun, Biao Gong, Yanwei Fu, Danda Pani Paudel, Xuanjing Huang, and Luc Van Gool. ObjectRelator: Enabling cross-view object relation understanding across ego-centric and exo-centric perspectives. InICCV, 2025
work page 2025
-
[23]
Continuous patient monitoring with AI: Real-time analysis of video in hospital care settings
Paolo Gabriel, Peter Rehani, Tyler Troy, Tiffany Wyatt, Michael Choma, and Narinder Singh. Continuous patient monitoring with AI: Real-time analysis of video in hospital care settings. Frontiers in Imaging, 2025
work page 2025
-
[24]
Fan, Amirreza Shaban, Sung-Kyun Kim, Mykel J
Muhammad Fadhil Ginting, Dong-Ki Kim, Xiangyun Meng, Andrzej Reinke, Bandi Jai Krishna, Navid Kayhani, Oriana Peltzer, David D. Fan, Amirreza Shaban, Sung-Kyun Kim, Mykel J. Kochenderfer, Ali-akbar Agha-mohammadi, and Shayegan Omidshafiei. Enter the mind palace: Reasoning and planning for long-term active embodied question answering. In CoRL, 2026
work page 2026
-
[25]
Ego4D: Around the world in 3,000 hours of egocentric video
Kristen Grauman, Andrew Westbury, Eugene Byrne, Zachary Chavis, Antonino Furnari, Rohit Girdhar, Jackson Hamburger, Hao Jiang, Miao Liu, Xingyu Liu, et al. Ego4D: Around the world in 3,000 hours of egocentric video. InCVPR, 2022
work page 2022
-
[26]
Ego-Exo4D: Understanding skilled human activity from first-and third-person perspectives
Kristen Grauman, Andrew Westbury, Lorenzo Torresani, Kris Kitani, Jitendra Malik, Tri- antafyllos Afouras, Kumar Ashutosh, Vijay Baiyya, Siddhant Bansal, Bikram Boote, et al. Ego-Exo4D: Understanding skilled human activity from first-and third-person perspectives. In CVPR, 2024
work page 2024
-
[27]
Yuping He, Yifei Huang, Guo Chen, Lidong Lu, Baoqi Pei, Jilan Xu, Tong Lu, and Yoichi Sato. Bridging perspectives: A survey on cross-view collaborative intelligence with egocentric- exocentric vision.International Journal on Computer Vision, 2026
work page 2026
-
[28]
EgoExoBench: A benchmark for first-and third-person view video understanding in MLLMs
Yuping He, Yifei Huang, Guo Chen, Baoqi Pei, Jilan Xu, Tong Lu, and Jiangmiao Pang. EgoExoBench: A benchmark for first-and third-person view video understanding in MLLMs. InNeurIPS, 2025
work page 2025
-
[29]
Weipeng Hu, Jiun Tian Hoe, Jianhui Li, Haifeng Hu, Xudong Jiang, and Yap-Peng Tan. Cascaded dynamic memory refinement and semantic alignment for exo-to-ego cross-view video generation.IEEE Transactions on Pattern Analysis and Machine Intelligence, 2025
work page 2025
-
[30]
Robust ego-exo correspondence with long-term memory
Yijun Hu, Bing Fan, Xin Gu, Haiqing Ren, Dongfang Liu, Heng Fan, and Libo Zhang. Robust ego-exo correspondence with long-term memory. InNeurIPS, 2025
work page 2025
-
[31]
Sound bridge: Associating egocentric and exocentric videos via audio cues
Sihong Huang, Jiaxin Wu, Xiaoyong Wei, Yi Cai, Dongmei Jiang, and Yaowei Wang. Sound bridge: Associating egocentric and exocentric videos via audio cues. InCVPR, 2025
work page 2025
-
[32]
Yifei Huang, Guo Chen, Jilan Xu, Mingfang Zhang, Lijin Yang, Baoqi Pei, Hongjie Zhang, Lu Dong, Yali Wang, Limin Wang, and Qiao Yu. EgoExoLearn: A dataset for bridging asynchronous ego-and exo-centric view of procedural activities in real world. InCVPR, 2024
work page 2024
-
[33]
VideoRAG: Retrieval- augmented generation over video corpus
Soyeong Jeong, Kangsan Kim, Jinheon Baek, and Sung Ju Hwang. VideoRAG: Retrieval- augmented generation over video corpus. InACL (Findings), 2025. 12
work page 2025
-
[34]
LEMMA: A multi-view dataset for learning multi-agent multi-task activities
Baoxiong Jia, Yixin Chen, Siyuan Huang, Yixin Zhu, and Song-Chun Zhu. LEMMA: A multi-view dataset for learning multi-agent multi-task activities. InECCV, 2020
work page 2020
-
[35]
EgoTaskQA: Understanding human tasks in egocentric videos
Baoxiong Jia, Ting Lei, Song-Chun Zhu, and Siyuan Huang. EgoTaskQA: Understanding human tasks in egocentric videos. InNeurIPS, 2022
work page 2022
-
[36]
Rehg, Vamsi Krishna Ithapu, and Ruohan Gao
Wenqi Jia, Miao Liu, Hao Jiang, Ishwarya Ananthabhotla, James M. Rehg, Vamsi Krishna Ithapu, and Ruohan Gao. The audio-visual conversational graph: From an egocentric- exocentric perspective. InCVPR, 2024
work page 2024
-
[37]
Single-stage visual query localization in egocentric videos
Hanwen Jiang, Santhosh Kumar Ramakrishnan, and Kristen Grauman. Single-stage visual query localization in egocentric videos. InNeurIPS, 2023
work page 2023
-
[38]
Ji Hyeok Jung, Eun Tae Kim, Seoyeon Kim, Joo Ho Lee, Bumsoo Kim, and Buru Chang. Is ‘right’ right? Enhancing object orientation understanding in multimodal large language models through egocentric instruction tuning. InCVPR, 2025
work page 2025
-
[39]
Minjoon Jung, Junbin Xiao, Junghyun Kim, Byoung-Tak Zhang, and Angela Yao. EgoExo-Con: Exploring view-invariant video temporal understanding.arXiv preprint arXiv:2510.26113, 2025
-
[40]
Soo-Han Kang and Ji-Hyeong Han. Video captioning based on both egocentric and exocentric views of robot vision for human-robot interaction.International Journal of Social Robotics, 2023
work page 2023
-
[41]
Dense passage retrieval for open-domain question answering
Vladimir Karpukhin, Barlas Oguz, Sewon Min, Patrick Lewis, Ledell Wu, Sergey Edunov, Danqi Chen, and Wen-tau Yih. Dense passage retrieval for open-domain question answering. InEMNLP, 2020
work page 2020
-
[42]
Kangsan Kim, Yanlai Yang, Suji Kim, Woongyeong Yeo, Youngwan Lee, Mengye Ren, and Sung Ju Hwang. MA-EgoQA: Question answering over egocentric videos from multiple embodied agents.arXiv preprint arXiv:2603.09827, 2026
-
[43]
k-DPPs: Fixed-size determinantal point processes
Alex Kulesza and Ben Taskar. k-DPPs: Fixed-size determinantal point processes. InICML, 2011
work page 2011
-
[44]
Yogesh Kulkarni and Pooyan Fazli. EgoVITA: Learning to plan and verify for egocentric video reasoning.arXiv preprint arXiv:2511.18242, 2025
-
[45]
LLaVA-OneVision: Easy Visual Task Transfer
Bo Li, Yuanhan Zhang, Dong Guo, Renrui Zhang, Feng Li, Hao Zhang, Kaichen Zhang, Peiyuan Zhang, Yanwei Li, Ziwei Liu, et al. LLaV A-OneVision: Easy visual task transfer. arXiv preprint arXiv:2408.03326, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[46]
Do MLLMs understand pointing? Benchmarking and enhancing referential reasoning in egocentric vision
Chentao Li, Zirui Gao, Mingze Gao, Yinglian Ren, Jianjiang Feng, and Jie Zhou. Do MLLMs understand pointing? Benchmarking and enhancing referential reasoning in egocentric vision. InACL, 2026
work page 2026
-
[47]
Learning situated awareness in the real world
Chuhan Li, Ruilin Han, Joy Hsu, Yongyuan Liang, Rajiv Dhawan, Jiajun Wu, Ming-Hsuan Yang, and Xin Eric Wang. Learning situated awareness in the real world. InICML, 2026
work page 2026
-
[48]
LLaVA-NeXT-Interleave: Tackling Multi-image, Video, and 3D in Large Multimodal Models
Feng Li, Renrui Zhang, Hao Zhang, Yuanhan Zhang, Bo Li, Wei Li, Zejun Ma, and Chunyuan Li. LLaV A-NeXT-Interleave: Tackling multi-image, video, and 3D in large multimodal models. arXiv preprint arXiv:2407.07895, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[49]
Shenshen Li, Xing Xu, Fumin Shen, Zhe Sun, Andrzej Cichocki, and Heng Tao Shen. Col- laborated with hallucination: Enhancing egocentric grounded question answering via error demonstrations.IEEE Transactions on Image Processing, 2026
work page 2026
-
[50]
Xiang Li, Heqian Qiu, Lanxiao Wang, Benliu Qiu, Fanman Meng, Linfeng Xu, and Hongliang Li. SA V A-X: Ego-to-exo imitation error detection via scene-adaptive view alignment and bidirectional cross view fusion. InCVPR, 2026
work page 2026
-
[51]
Yanjun Li, Yuqian Fu, Tianwen Qian, Qi’Ao Xu, Silong Dai, Danda Pani Paudel, Luc Van Gool, and Xiaoling Wang. EgoCross: Benchmarking multimodal large language models for cross- domain egocentric video question answering. InAAAI, 2026. 13
work page 2026
-
[52]
Fine-grained spatiotem- poral grounding on egocentric videos
Shuo Liang, Yiwu Zhong, Zi-Yuan Hu, Yeyao Tao, and Liwei Wang. Fine-grained spatiotem- poral grounding on egocentric videos. InICCV, 2025
work page 2025
-
[53]
Objectfinder: An open-vocabulary assistive system for interactive object search by blind people
Ruiping Liu, Jiaming Zhang, Angela Schön, Karin Müller, Junwei Zheng, Kailun Yang, Anhong Guo, Kathrin Gerling, and Rainer Stiefelhagen. ObjectFinder: An open-vocabulary assistive system for interactive object search by blind people.arXiv preprint arXiv:2412.03118, 2024
-
[54]
BOLT: Boost large vision- language model without training for long-form video understanding
Shuming Liu, Chen Zhao, Tianqi Xu, and Bernard Ghanem. BOLT: Boost large vision- language model without training for long-form video understanding. InCVPR, 2025
work page 2025
-
[55]
Aligning cyber space with physical world: A comprehensive survey on embodied AI
Yang Liu, Weixing Chen, Yongjie Bai, Xiaodan Liang, Guanbin Li, Wen Gao, and Liang Lin. Aligning cyber space with physical world: A comprehensive survey on embodied AI. IEEE/ASME Transactions on Mechatronics, 2025
work page 2025
-
[56]
From screens to scenes: A survey of embodied ai in healthcare.Information Fusion, 119:103033, 2025
Yihao Liu, Xu Cao, Tingting Chen, Yankai Jiang, Junjie You, Minghua Wu, Xiaosong Wang, Mengling Feng, Yaochu Jin, and Jintai Chen. From screens to scenes: A survey of embodied ai in healthcare.Information Fusion, 119:103033, 2025
work page 2025
-
[57]
Tao Lu, Qian Zhu, Tiffany Ma, Wong Kam-Kwai, Anlan Xie, Alex Endert, and Yalong Yang. Ego vs. exo and active vs. passive: Investigating the individual and combined effects of viewpoint and navigation on spatial immersion and understanding in immersive storytelling. InCHI, 2025
work page 2025
-
[58]
OpenMMEgo: Enhancing egocentric understanding for LMMs with open weights and data
Hao Luo, Zihao Yue, Wanpeng Zhang, Yicheng Feng, Sipeng Zheng, Deheng Ye, and Zongqing Lu. OpenMMEgo: Enhancing egocentric understanding for LMMs with open weights and data. InNeurIPS, 2025
work page 2025
-
[59]
Grounded affordance from exocentric view.International Journal of Computer Vision, 2024
Hongchen Luo, Wei Zhai, Jing Zhang, Yang Cao, and Dacheng Tao. Grounded affordance from exocentric view.International Journal of Computer Vision, 2024
work page 2024
-
[60]
Put myself in your shoes: Lifting the egocentric perspective from exocentric videos
Mi Luo, Zihui Xue, Alex Dimakis, and Kristen Grauman. Put myself in your shoes: Lifting the egocentric perspective from exocentric videos. InECCV, 2024
work page 2024
-
[61]
Video-RAG: Visually-aligned retrieval-augmented long video comprehension
Yongdong Luo, Xiawu Zheng, Xiao Yang, Guilin Li, Haojia Lin, Jinfa Huang, Jiayi Ji, Fei Chao, Jiebo Luo, and Rongrong Ji. Video-RAG: Visually-aligned retrieval-augmented long video comprehension. InNeurIPS, 2025
work page 2025
-
[62]
Mohammad Mahdi, Yuqian Fu, Nedko Savov, Jiancheng Pan, Danda Pani Paudel, and Luc Van Gool. Exo2EgoSyn: Unlocking foundation video generation models for exocentric-to- egocentric video synthesis.arXiv preprint arXiv:2511.20186, 2025
-
[63]
OpenEQA: Embodied question answering in the era of foundation models
Arjun Majumdar, Anurag Ajay, Xiaohan Zhang, Pranav Putta, Sriram Yenamandra, Mikael Henaff, Sneha Silwal, Paul McVay, Oleksandr Maksymets, Sergio Arnaud, Karmesh Yadav, Qiyang Li, Ben Newman, Mohit Sharma, Vincent-Pierre Berges, Shiqi Zhang, Pulkit Agrawal, Yonatan Bisk, Dhruv Batra, Mrinal Kalakrishnan, Franziska Meier, Chris Paxton, Alexan- der Sax, and...
work page 2024
-
[64]
EgoSchema: A diagnostic benchmark for very long-form video language understanding
Karttikeya Mangalam, Raiymbek Akshulakov, and Jitendra Malik. EgoSchema: A diagnostic benchmark for very long-form video language understanding. InNeurIPS, 2023
work page 2023
- [65]
-
[66]
Point of view in personal memories.Cognitive Psychology, 1983
Georgia Nigro and Ulric Neisser. Point of view in personal memories.Cognitive Psychology, 1983
work page 1983
-
[67]
Takehiko Ohkawa, Takuma Yagi, Taichi Nishimura, Ryosuke Furuta, Atsushi Hashimoto, Yoshitaka Ushiku, and Yoichi Sato. Exo2EgoDVC: Dense video captioning of egocentric procedural activities using web instructional videos. InWACV, 2025. 14
work page 2025
-
[68]
OpenAI. Hello GPT-4o. https://openai.com/index/hello-gpt-4o, May 2024. Ac- cessed: 2026-05-05
work page 2024
-
[69]
OpenAI. Introducing GPT-5.4. https://openai.com/index/introducing-gpt-5-4 , March 2026. Accessed: 2026-05-05
work page 2026
-
[70]
V2-SAM: Marrying SAM2 with multi-prompt experts for cross-view object correspondence
Jiancheng Pan, Runze Wang, Tianwen Qian, Mohammad Mahdi, Yanwei Fu, Xiangyang Xue, Xiaomeng Huang, Luc Van Gool, Danda Pani Paudel, and Yuqian Fu. V2-SAM: Marrying SAM2 with multi-prompt experts for cross-view object correspondence. InCVPR, 2026
work page 2026
-
[71]
Jungin Park, Jiyoung Lee, and Kwanghoon Sohn. Bootstrap your own views: Masked ego-exo modeling for fine-grained view-invariant video representations. InCVPR, 2025
work page 2025
-
[72]
EgoWorld: Translating exocentric view to egocentric view using rich exocentric observations
Junho Park, Andrew Sangwoo Ye, and Taein Kwon. EgoWorld: Translating exocentric view to egocentric view using rich exocentric observations. InICLR, 2026
work page 2026
-
[73]
EgoThinker: Unveiling egocentric reasoning with spatio-temporal CoT
Baoqi Pei, Yifei Huang, Jilan Xu, Yuping He, Guo Chen, Fei Wu, Jiangmiao Pang, and Yu Qiao. EgoThinker: Unveiling egocentric reasoning with spatio-temporal CoT. InNeurIPS, 2025
work page 2025
-
[74]
In the eye of MLLM: Benchmarking egocentric video intent understanding with gaze-guided prompting
Taiying Peng, Jiacheng Hua, Miao Liu, and Feng Lu. In the eye of MLLM: Benchmarking egocentric video intent understanding with gaze-guided prompting. InNeurIPS, 2025
work page 2025
-
[75]
Learning transferable visual models from natural language supervision
Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. Learning transferable visual models from natural language supervision. In ICML, 2021
work page 2021
-
[76]
Ego-EXTRA: video-language egocentric dataset for expert-trainee assistance
Francesco Ragusa, Michele Mazzamuto, Rosario Forte, Irene D’Ambra, James Fort, Jakob En- gel, Antonino Furnari, and Giovanni Maria Farinella. Ego-EXTRA: video-language egocentric dataset for expert-trainee assistance. InWACV, 2026
work page 2026
-
[77]
Wilson, and Balasara- vanan Thoravi Kumaravel
Sahithya Ravi, Gabriel Herbert Sarch, Vibhav Vineet, Andrew D. Wilson, and Balasara- vanan Thoravi Kumaravel. Out of sight, not out of context? Egocentric spatial reasoning in vlms across disjoint frames. InEMNLP, 2025
work page 2025
-
[78]
Dominick Reilly, Manish Kumar Govind, Le Xue, and Srijan Das. From my view to yours: Ego-to-exo transfer in vlms for understanding activities of daily living.arXiv preprint arXiv:2501.05711, 2025
-
[79]
The probabilistic relevance framework: BM25 and beyond.Information Retrieval, 2009
Stephen Robertson and Hugo Zaragoza. The probabilistic relevance framework: BM25 and beyond.Information Retrieval, 2009
work page 2009
-
[80]
EASG-Bench: Video Q&A benchmark with egocentric action scene graphs
Ivan Rodin, Tz-Ying Wu, Kyle Min, Sharath Nittur Sridhar, Antonino Furnari, Subarna Tripathi, and Giovanni Maria Farinella. EASG-Bench: Video Q&A benchmark with egocentric action scene graphs. InICCVW, 2025
work page 2025
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.