pith. sign in

arxiv: 2605.18734 · v1 · pith:IPG6XWDGnew · submitted 2026-05-18 · 💻 cs.CV

EgoExoMem: Cross-View Memory Reasoning over Synchronized Egocentric and Exocentric Videos

Pith reviewed 2026-05-20 11:14 UTC · model grok-4.3

classification 💻 cs.CV
keywords egocentric videoexocentric videomemory reasoningmultimodal large language modelsframe selectioncross-view reasoningbenchmark
0
0 comments X

The pith

Synchronized egocentric and exocentric videos supply complementary memory cues that current multimodal models have not yet fully exploited.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces EgoExoMem as the first benchmark for cross-view memory reasoning using synchronized egocentric and exocentric videos, containing 2.6K multiple-choice questions across eight temporal, spatial, and cross-view types. It demonstrates that existing multimodal large language models reach only 55.3 percent accuracy at best, while a new training-free frame selection approach called E2-Select improves this to 58.2 percent by allocating budgets based on relevance and sampling with k-DPP to respect view asymmetry and temporal consistency. A sympathetic reader would care because embodied intelligence often relies on memory that single-view egocentric footage cannot fully support, and the results show both the value of dual perspectives and the remaining gap in model capabilities. The work further identifies systematic conflicts in how questions and answers align with particular views.

Core claim

The paper claims that egocentric and exocentric views provide complementary cues for spatial-temporal memory reasoning, established through the EgoExoMem benchmark of 2.6K high-quality MCQs and shown by the performance gap between existing MLLMs at 55.3 percent and the proposed E2-Select method at 58.2 percent over frame-selection and RAG baselines.

What carries the argument

E2-Select, a training-free frame selection method that combines relevance-based budget allocation with per-view k-DPP sampling to manage view asymmetry and cross-view temporal consistency in synchronized ego-exo videos.

If this is right

  • Ego and exo views supply complementary memory cues that improve reasoning when both are available.
  • Existing multimodal large language models remain far from solving cross-view memory tasks.
  • Training-free selection methods outperform standard frame-selection and RAG-based memory approaches.
  • Question framing and answer grounding exhibit systematic view-preference conflicts.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Models that explicitly learn to resolve view-preference conflicts could close more of the performance gap than selection alone.
  • The benchmark structure could be reused to test memory reasoning in longer, unscripted video streams from wearable and overhead cameras.
  • Integration of dual-view selection into embodied agents might reduce errors in spatial tasks such as object relocation or route planning.

Load-bearing premise

The 2.6K multiple-choice questions are high-quality and representative of real cross-view memory reasoning demands.

What would settle it

A direct comparison of model accuracy on EgoExoMem against accuracy on a new set of cross-view questions derived from real robotic navigation logs using the same synchronized video sources.

Figures

Figures reproduced from arXiv: 2605.18734 by Chengzhi Wu, Di Wen, Jiaming Zhang, Junwei Zheng, Kailun Yang, Kunyu Peng, Rainer Stiefelhagen, Ruiping Liu, Shaofang Quan, Yufan Chen.

Figure 1
Figure 1. Figure 1: EgoExoMem requires reasoning over synchronized ego-exo memory streams to answer [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Illustrative examples of the eight QA types (Q1–Q8) in EgoExoMem, covering object [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: illustrates the benchmark construction pipeline: MCQs are first generated, then human-edited and filtered for accuracy, and finally subjected to a text-only check to ensure vision dependency [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Dataset statistics of EgoExoMem. (a) Video length distribution for LEMMA and EgoExo4D [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Failure case analysis. (a) Question-aware view dependency measured by CLIP similarity [PITH_FULL_IMAGE:figures/full_fig_p009_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Verification tool for human annotator editing and filtering. [PITH_FULL_IMAGE:figures/full_fig_p018_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Caption generation used for retrieval in RAG-based methods. [PITH_FULL_IMAGE:figures/full_fig_p019_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Evaluation template [PITH_FULL_IMAGE:figures/full_fig_p019_8.png] view at source ↗
read the original abstract

Egocentric memory is widely used in embodied intelligence, but it may be insufficient for comprehensive spatial-temporal reasoning. Inspired by human recall from both field and observer perspectives, we introduce EgoExoMem, the first benchmark for cross-view memory reasoning over synchronized egocentric and exocentric videos. EgoExoMem contains $2.6K$ high-quality MCQs across eight temporal, spatial, and cross-view QA types. To support dual-view retrieval, we propose E$^2$-Select, a training-free frame selection method for synchronized ego-exo videos. It combines relevance-based budget allocation with per-view k-DPP sampling to handle view asymmetry and cross-view temporal consistency. Experiments show that ego and exo views provide complementary memory cues, while existing MLLMs remain far from solving the benchmark: the best model reaches only $55.3\%$. E$^2$-Select achieves state-of-the-art performance of $58.2\%$ over frame-selection and RAG-based memory baselines. Further analysis reveals systematic view-preference conflicts between question framing and answer grounding, underscoring the novelty and challenge of cross-view memory reasoning.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces EgoExoMem, the first benchmark for cross-view memory reasoning over synchronized egocentric and exocentric videos, containing 2.6K high-quality MCQs across eight temporal, spatial, and cross-view QA types. It proposes E²-Select, a training-free frame selection method combining relevance-based budget allocation with per-view k-DPP sampling to address view asymmetry and cross-view temporal consistency. Experiments show that existing MLLMs reach at most 55.3% accuracy while E²-Select achieves 58.2% over frame-selection and RAG baselines, with further analysis of view-preference conflicts.

Significance. If the benchmark questions genuinely require cross-view integration, this work would provide a valuable new resource for evaluating and improving multimodal models on complementary ego-exo memory cues, an area relevant to embodied AI. The training-free design of E²-Select and the reproducible performance numbers are strengths that support broader adoption.

major comments (2)
  1. [Benchmark construction (§3/§4)] Benchmark construction section (likely §3 or §4): The central claims that 'ego and exo views provide complementary memory cues' and that 'existing MLLMs remain far from solving the benchmark' presuppose that the 2.6K MCQs cannot be solved from a single view. No single-view human accuracy, inter-annotator agreement on view necessity, or filtering steps that discard single-view-solvable questions are reported, leaving the complementarity conclusion and the 55.3%/58.2% gap on an unverified assumption.
  2. [Method (§4)] E²-Select description (likely §4): While the method is presented as training-free, the relevance-based budget allocation step requires explicit definition of how per-view relevance scores are obtained from the query without reference to the target benchmark; if these scores implicitly depend on benchmark-specific heuristics, the 'parameter-free' characterization needs clarification to avoid circularity with the evaluation.
minor comments (2)
  1. [Abstract] Abstract: Use consistent mathematical formatting (e.g., 2.6K vs $2.6K$) and ensure all acronyms (MLLM, k-DPP) are defined on first use.
  2. [Figures/Tables] Figure captions and tables: Add explicit legends distinguishing ego-only, exo-only, and dual-view conditions to improve readability of the complementarity analysis.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address the major comments point by point below, indicating planned revisions where appropriate.

read point-by-point responses
  1. Referee: [Benchmark construction (§3/§4)] Benchmark construction section (likely §3 or §4): The central claims that 'ego and exo views provide complementary memory cues' and that 'existing MLLMs remain far from solving the benchmark' presuppose that the 2.6K MCQs cannot be solved from a single view. No single-view human accuracy, inter-annotator agreement on view necessity, or filtering steps that discard single-view-solvable questions are reported, leaving the complementarity conclusion and the 55.3%/58.2% gap on an unverified assumption.

    Authors: We agree that explicit verification of cross-view complementarity strengthens the central claims. The benchmark was designed with dedicated cross-view QA categories and questions that target integration of complementary cues (e.g., ego-centric action details paired with exo-centric spatial layout), supported by qualitative examples in the paper. However, we did not report single-view human accuracy or explicit filtering statistics. In the revision we will add single-view human evaluation on a representative subset of questions together with inter-annotator agreement on view necessity, thereby providing direct empirical support for the complementarity assumption. revision: yes

  2. Referee: [Method (§4)] E²-Select description (likely §4): While the method is presented as training-free, the relevance-based budget allocation step requires explicit definition of how per-view relevance scores are obtained from the query without reference to the target benchmark; if these scores implicitly depend on benchmark-specific heuristics, the 'parameter-free' characterization needs clarification to avoid circularity with the evaluation.

    Authors: The relevance scores are obtained by embedding the query with a fixed, off-the-shelf vision-language model (CLIP) and computing cosine similarity against frame embeddings from each view independently. No training, fine-tuning, or benchmark-specific heuristics are involved; the same general-purpose model is used for all queries. We will revise the method section to state this procedure explicitly and to clarify that E²-Select remains training-free with no parameters tuned on EgoExoMem, thereby removing any potential ambiguity regarding circularity. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation is self-contained

full rationale

The paper introduces EgoExoMem benchmark and E²-Select method as training-free, relying on explicit algorithmic steps (relevance-based budget allocation plus per-view k-DPP sampling) that operate on input video features without fitting parameters to the target MCQ answers or reducing any claimed result to a self-definition. No equations or sections equate a prediction to its own fitted input, invoke load-bearing self-citations for uniqueness, or rename prior empirical patterns as new derivations. Experimental claims inherit benchmark validity risks but do not exhibit circular reduction in the derivation chain itself.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The central claims rest on the quality and coverage of the newly created MCQ set and on the effectiveness of the described frame-selection procedure; no explicit free parameters, axioms, or invented entities are stated in the abstract.

pith-pipeline@v0.9.0 · 5765 in / 1123 out tokens · 41237 ms · 2026-05-20T11:14:25.824576+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

111 extracted references · 111 canonical work pages · 6 internal anchors

  1. [1]

    Ring home security systems.https://ring.com, 2024

  2. [2]

    Qwen3-VL Technical Report

    Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, et al. Qwen3-VL technical report.arXiv preprint arXiv:2511.21631, 2025

  3. [3]

    Qwen2.5-VL Technical Report

    Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, et al. Qwen2.5-VL technical report.arXiv preprint arXiv:2502.13923, 2025

  4. [4]

    Glance and focus: Memory prompting for multi-event video question answering

    Ziyi Bai, Ruiping Wang, and Xilin Chen. Glance and focus: Memory prompting for multi-event video question answering. InNeurIPS, 2023

  5. [5]

    Where did I leave my keys?—Episodic-memory-based question answering on egocentric videos

    Leonard Bärmann and Alex Waibel. Where did I leave my keys?—Episodic-memory-based question answering on egocentric videos. InCVPRW, 2022

  6. [6]

    EPFL-Smart- Kitchen: An ego-exo multi-modal dataset for challenging action and motion understanding in video-language models

    Andy Bonnetto, Haozhe Qi, Franklin Leong, Matea Tashkovska, Mahdi Rad, Solaiman Shokur, Friedhelm Hummel, Silvestro Micera, Marc Pollefeys, and Alexander Mathis. EPFL-Smart- Kitchen: An ego-exo multi-modal dataset for challenging action and motion understanding in video-language models. InNeurIPS, 2025

  7. [7]

    Spatial memory: how egocentric and allocentric combine.Trends in Cognitive Sciences, 2006

    Neil Burgess. Spatial memory: how egocentric and allocentric combine.Trends in Cognitive Sciences, 2006

  8. [8]

    SA VVY: Spatial awareness via audio-visual LLMs through seeing and hearing

    Mingfei Chen, Zijun Cui, Xiulong Liu, Jinlin Xiang, Caleb Zheng, Jingyuan Li, and Eli Shlizerman. SA VVY: Spatial awareness via audio-visual LLMs through seeing and hearing. InNeurIPS, 2025

  9. [9]

    MuRAL: A multi- resident ambient sensor dataset annotated with natural language for activities of daily living

    Xi Chen, Julien Cumin, Fano Ramparany, and Dominique Vaufreydaz. MuRAL: A multi- resident ambient sensor dataset annotated with natural language for activities of daily living. InICIE, 2026

  10. [10]

    EgoPlan-Bench: Benchmarking multimodal large language models for human-level planning.International Journal of Computer Vision, 2026

    Yi Chen, Yuying Ge, Yixiao Ge, Mingyu Ding, Bohao Li, Rui Wang, Ruifeng Xu, Ying Shan, and Xihui Liu. EgoPlan-Bench: Benchmarking multimodal large language models for human-level planning.International Journal of Computer Vision, 2026

  11. [11]

    (2.5+ 1) D spatio-temporal scene graphs for video question answering

    Anoop Cherian, Chiori Hori, Tim K Marks, and Jonathan Le Roux. (2.5+ 1) D spatio-temporal scene graphs for video question answering. InAAAI, 2022

  12. [12]

    Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

    Gheorghe Comanici, Eric Bieber, Mike Schaekermann, Ice Pasupat, Noveen Sachdeva, Inderjit Dhillon, Marcel Blistein, Ori Ram, Dan Zhang, Evan Rosen, et al. Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities.arXiv preprint arXiv:2507.06261, 2025

  13. [13]

    ECBench: Can multi-modal foundation models understand the egocentric world? A holistic embodied cognition benchmark

    Ronghao Dang, Yuqian Yuan, Wenqi Zhang, Yifei Xin, Boqiang Zhang, Long Li, Liuyi Wang, Qinyang Zeng, Xin Li, and Lidong Bing. ECBench: Can multi-modal foundation models understand the egocentric world? A holistic embodied cognition benchmark. InCVPR, 2025

  14. [14]

    Episodic memory question answering

    Samyak Datta, Sameer Dharur, Vincent Cartillier, Ruta Desai, Mukul Khanna, Dhruv Batra, and Devi Parikh. Episodic memory question answering. InCVPR, 2022

  15. [15]

    Look and tell: A dataset for multimodal grounding across egocentric and exocentric views

    Anna Deichler and Jonas Beskow. Look and tell: A dataset for multimodal grounding across egocentric and exocentric views. InNeurIPS 2025 Workshop on Space in Vision, Language, and Embodied AI, 2025

  16. [16]

    Exact sampling of determinantal point processes with sublinear time preprocessing

    Michal Derezinski, Daniele Calandriello, and Michal Valko. Exact sampling of determinantal point processes with sublinear time preprocessing. InNeurIPS, 2019

  17. [17]

    Mica R. Endsley. Toward a theory of situation awareness in dynamic systems.Human Factors: The Journal of the Human Factors and Ergonomics Society, 1995. 11

  18. [18]

    PRVQL: Progressive knowledge-guided refinement for robust egocentric visual query localization

    Bing Fan, Yunhe Feng, Yapeng Tian, James Chenhao Liang, Yuewei Lin, Yan Huang, and Heng Fan. PRVQL: Progressive knowledge-guided refinement for robust egocentric visual query localization. InICCV, 2025

  19. [19]

    Embodied VideoAgent: Persistent memory from egocentric videos and embodied sensors enables dynamic scene understanding

    Yue Fan, Xiaojian Ma, Rongpeng Su, Jun Guo, Rujie Wu, Xi Chen, and Qing Li. Embodied VideoAgent: Persistent memory from egocentric videos and embodied sensors enables dynamic scene understanding. InICCV, 2025

  20. [20]

    Object-shot enhanced grounding network for egocentric video

    Yisen Feng, Haoyu Zhang, Meng Liu, Weili Guan, and Liqiang Nie. Object-shot enhanced grounding network for egocentric video. InCVPR, 2025

  21. [21]

    Video-mme: The first-ever comprehen- sive evaluation benchmark of multi-modal llms in video analysis

    Chaoyou Fu, Yuhan Dai, Yongdong Luo, Lei Li, Shuhuai Ren, Renrui Zhang, Zihan Wang, Chenyu Zhou, Yunhang Shen, Mengdan Zhang, et al. Video-mme: The first-ever comprehen- sive evaluation benchmark of multi-modal llms in video analysis. InCVPR, 2025

  22. [22]

    ObjectRelator: Enabling cross-view object relation understanding across ego-centric and exo-centric perspectives

    Yuqian Fu, Runze Wang, Bin Ren, Guolei Sun, Biao Gong, Yanwei Fu, Danda Pani Paudel, Xuanjing Huang, and Luc Van Gool. ObjectRelator: Enabling cross-view object relation understanding across ego-centric and exo-centric perspectives. InICCV, 2025

  23. [23]

    Continuous patient monitoring with AI: Real-time analysis of video in hospital care settings

    Paolo Gabriel, Peter Rehani, Tyler Troy, Tiffany Wyatt, Michael Choma, and Narinder Singh. Continuous patient monitoring with AI: Real-time analysis of video in hospital care settings. Frontiers in Imaging, 2025

  24. [24]

    Fan, Amirreza Shaban, Sung-Kyun Kim, Mykel J

    Muhammad Fadhil Ginting, Dong-Ki Kim, Xiangyun Meng, Andrzej Reinke, Bandi Jai Krishna, Navid Kayhani, Oriana Peltzer, David D. Fan, Amirreza Shaban, Sung-Kyun Kim, Mykel J. Kochenderfer, Ali-akbar Agha-mohammadi, and Shayegan Omidshafiei. Enter the mind palace: Reasoning and planning for long-term active embodied question answering. In CoRL, 2026

  25. [25]

    Ego4D: Around the world in 3,000 hours of egocentric video

    Kristen Grauman, Andrew Westbury, Eugene Byrne, Zachary Chavis, Antonino Furnari, Rohit Girdhar, Jackson Hamburger, Hao Jiang, Miao Liu, Xingyu Liu, et al. Ego4D: Around the world in 3,000 hours of egocentric video. InCVPR, 2022

  26. [26]

    Ego-Exo4D: Understanding skilled human activity from first-and third-person perspectives

    Kristen Grauman, Andrew Westbury, Lorenzo Torresani, Kris Kitani, Jitendra Malik, Tri- antafyllos Afouras, Kumar Ashutosh, Vijay Baiyya, Siddhant Bansal, Bikram Boote, et al. Ego-Exo4D: Understanding skilled human activity from first-and third-person perspectives. In CVPR, 2024

  27. [27]

    Bridging perspectives: A survey on cross-view collaborative intelligence with egocentric- exocentric vision.International Journal on Computer Vision, 2026

    Yuping He, Yifei Huang, Guo Chen, Lidong Lu, Baoqi Pei, Jilan Xu, Tong Lu, and Yoichi Sato. Bridging perspectives: A survey on cross-view collaborative intelligence with egocentric- exocentric vision.International Journal on Computer Vision, 2026

  28. [28]

    EgoExoBench: A benchmark for first-and third-person view video understanding in MLLMs

    Yuping He, Yifei Huang, Guo Chen, Baoqi Pei, Jilan Xu, Tong Lu, and Jiangmiao Pang. EgoExoBench: A benchmark for first-and third-person view video understanding in MLLMs. InNeurIPS, 2025

  29. [29]

    Cascaded dynamic memory refinement and semantic alignment for exo-to-ego cross-view video generation.IEEE Transactions on Pattern Analysis and Machine Intelligence, 2025

    Weipeng Hu, Jiun Tian Hoe, Jianhui Li, Haifeng Hu, Xudong Jiang, and Yap-Peng Tan. Cascaded dynamic memory refinement and semantic alignment for exo-to-ego cross-view video generation.IEEE Transactions on Pattern Analysis and Machine Intelligence, 2025

  30. [30]

    Robust ego-exo correspondence with long-term memory

    Yijun Hu, Bing Fan, Xin Gu, Haiqing Ren, Dongfang Liu, Heng Fan, and Libo Zhang. Robust ego-exo correspondence with long-term memory. InNeurIPS, 2025

  31. [31]

    Sound bridge: Associating egocentric and exocentric videos via audio cues

    Sihong Huang, Jiaxin Wu, Xiaoyong Wei, Yi Cai, Dongmei Jiang, and Yaowei Wang. Sound bridge: Associating egocentric and exocentric videos via audio cues. InCVPR, 2025

  32. [32]

    EgoExoLearn: A dataset for bridging asynchronous ego-and exo-centric view of procedural activities in real world

    Yifei Huang, Guo Chen, Jilan Xu, Mingfang Zhang, Lijin Yang, Baoqi Pei, Hongjie Zhang, Lu Dong, Yali Wang, Limin Wang, and Qiao Yu. EgoExoLearn: A dataset for bridging asynchronous ego-and exo-centric view of procedural activities in real world. InCVPR, 2024

  33. [33]

    VideoRAG: Retrieval- augmented generation over video corpus

    Soyeong Jeong, Kangsan Kim, Jinheon Baek, and Sung Ju Hwang. VideoRAG: Retrieval- augmented generation over video corpus. InACL (Findings), 2025. 12

  34. [34]

    LEMMA: A multi-view dataset for learning multi-agent multi-task activities

    Baoxiong Jia, Yixin Chen, Siyuan Huang, Yixin Zhu, and Song-Chun Zhu. LEMMA: A multi-view dataset for learning multi-agent multi-task activities. InECCV, 2020

  35. [35]

    EgoTaskQA: Understanding human tasks in egocentric videos

    Baoxiong Jia, Ting Lei, Song-Chun Zhu, and Siyuan Huang. EgoTaskQA: Understanding human tasks in egocentric videos. InNeurIPS, 2022

  36. [36]

    Rehg, Vamsi Krishna Ithapu, and Ruohan Gao

    Wenqi Jia, Miao Liu, Hao Jiang, Ishwarya Ananthabhotla, James M. Rehg, Vamsi Krishna Ithapu, and Ruohan Gao. The audio-visual conversational graph: From an egocentric- exocentric perspective. InCVPR, 2024

  37. [37]

    Single-stage visual query localization in egocentric videos

    Hanwen Jiang, Santhosh Kumar Ramakrishnan, and Kristen Grauman. Single-stage visual query localization in egocentric videos. InNeurIPS, 2023

  38. [38]

    Is ‘right’ right? Enhancing object orientation understanding in multimodal large language models through egocentric instruction tuning

    Ji Hyeok Jung, Eun Tae Kim, Seoyeon Kim, Joo Ho Lee, Bumsoo Kim, and Buru Chang. Is ‘right’ right? Enhancing object orientation understanding in multimodal large language models through egocentric instruction tuning. InCVPR, 2025

  39. [39]

    EgoExo-Con: Exploring view-invariant video temporal understanding.arXiv preprint arXiv:2510.26113, 2025

    Minjoon Jung, Junbin Xiao, Junghyun Kim, Byoung-Tak Zhang, and Angela Yao. EgoExo-Con: Exploring view-invariant video temporal understanding.arXiv preprint arXiv:2510.26113, 2025

  40. [40]

    Video captioning based on both egocentric and exocentric views of robot vision for human-robot interaction.International Journal of Social Robotics, 2023

    Soo-Han Kang and Ji-Hyeong Han. Video captioning based on both egocentric and exocentric views of robot vision for human-robot interaction.International Journal of Social Robotics, 2023

  41. [41]

    Dense passage retrieval for open-domain question answering

    Vladimir Karpukhin, Barlas Oguz, Sewon Min, Patrick Lewis, Ledell Wu, Sergey Edunov, Danqi Chen, and Wen-tau Yih. Dense passage retrieval for open-domain question answering. InEMNLP, 2020

  42. [42]

    MA-EgoQA: Question answering over egocentric videos from multiple embodied agents.arXiv preprint arXiv:2603.09827, 2026

    Kangsan Kim, Yanlai Yang, Suji Kim, Woongyeong Yeo, Youngwan Lee, Mengye Ren, and Sung Ju Hwang. MA-EgoQA: Question answering over egocentric videos from multiple embodied agents.arXiv preprint arXiv:2603.09827, 2026

  43. [43]

    k-DPPs: Fixed-size determinantal point processes

    Alex Kulesza and Ben Taskar. k-DPPs: Fixed-size determinantal point processes. InICML, 2011

  44. [44]

    EgoVITA: Learning to plan and verify for egocentric video reasoning.arXiv preprint arXiv:2511.18242, 2025

    Yogesh Kulkarni and Pooyan Fazli. EgoVITA: Learning to plan and verify for egocentric video reasoning.arXiv preprint arXiv:2511.18242, 2025

  45. [45]

    LLaVA-OneVision: Easy Visual Task Transfer

    Bo Li, Yuanhan Zhang, Dong Guo, Renrui Zhang, Feng Li, Hao Zhang, Kaichen Zhang, Peiyuan Zhang, Yanwei Li, Ziwei Liu, et al. LLaV A-OneVision: Easy visual task transfer. arXiv preprint arXiv:2408.03326, 2024

  46. [46]

    Do MLLMs understand pointing? Benchmarking and enhancing referential reasoning in egocentric vision

    Chentao Li, Zirui Gao, Mingze Gao, Yinglian Ren, Jianjiang Feng, and Jie Zhou. Do MLLMs understand pointing? Benchmarking and enhancing referential reasoning in egocentric vision. InACL, 2026

  47. [47]

    Learning situated awareness in the real world

    Chuhan Li, Ruilin Han, Joy Hsu, Yongyuan Liang, Rajiv Dhawan, Jiajun Wu, Ming-Hsuan Yang, and Xin Eric Wang. Learning situated awareness in the real world. InICML, 2026

  48. [48]

    LLaVA-NeXT-Interleave: Tackling Multi-image, Video, and 3D in Large Multimodal Models

    Feng Li, Renrui Zhang, Hao Zhang, Yuanhan Zhang, Bo Li, Wei Li, Zejun Ma, and Chunyuan Li. LLaV A-NeXT-Interleave: Tackling multi-image, video, and 3D in large multimodal models. arXiv preprint arXiv:2407.07895, 2024

  49. [49]

    Col- laborated with hallucination: Enhancing egocentric grounded question answering via error demonstrations.IEEE Transactions on Image Processing, 2026

    Shenshen Li, Xing Xu, Fumin Shen, Zhe Sun, Andrzej Cichocki, and Heng Tao Shen. Col- laborated with hallucination: Enhancing egocentric grounded question answering via error demonstrations.IEEE Transactions on Image Processing, 2026

  50. [50]

    SA V A-X: Ego-to-exo imitation error detection via scene-adaptive view alignment and bidirectional cross view fusion

    Xiang Li, Heqian Qiu, Lanxiao Wang, Benliu Qiu, Fanman Meng, Linfeng Xu, and Hongliang Li. SA V A-X: Ego-to-exo imitation error detection via scene-adaptive view alignment and bidirectional cross view fusion. InCVPR, 2026

  51. [51]

    EgoCross: Benchmarking multimodal large language models for cross- domain egocentric video question answering

    Yanjun Li, Yuqian Fu, Tianwen Qian, Qi’Ao Xu, Silong Dai, Danda Pani Paudel, Luc Van Gool, and Xiaoling Wang. EgoCross: Benchmarking multimodal large language models for cross- domain egocentric video question answering. InAAAI, 2026. 13

  52. [52]

    Fine-grained spatiotem- poral grounding on egocentric videos

    Shuo Liang, Yiwu Zhong, Zi-Yuan Hu, Yeyao Tao, and Liwei Wang. Fine-grained spatiotem- poral grounding on egocentric videos. InICCV, 2025

  53. [53]

    Objectfinder: An open-vocabulary assistive system for interactive object search by blind people

    Ruiping Liu, Jiaming Zhang, Angela Schön, Karin Müller, Junwei Zheng, Kailun Yang, Anhong Guo, Kathrin Gerling, and Rainer Stiefelhagen. ObjectFinder: An open-vocabulary assistive system for interactive object search by blind people.arXiv preprint arXiv:2412.03118, 2024

  54. [54]

    BOLT: Boost large vision- language model without training for long-form video understanding

    Shuming Liu, Chen Zhao, Tianqi Xu, and Bernard Ghanem. BOLT: Boost large vision- language model without training for long-form video understanding. InCVPR, 2025

  55. [55]

    Aligning cyber space with physical world: A comprehensive survey on embodied AI

    Yang Liu, Weixing Chen, Yongjie Bai, Xiaodan Liang, Guanbin Li, Wen Gao, and Liang Lin. Aligning cyber space with physical world: A comprehensive survey on embodied AI. IEEE/ASME Transactions on Mechatronics, 2025

  56. [56]

    From screens to scenes: A survey of embodied ai in healthcare.Information Fusion, 119:103033, 2025

    Yihao Liu, Xu Cao, Tingting Chen, Yankai Jiang, Junjie You, Minghua Wu, Xiaosong Wang, Mengling Feng, Yaochu Jin, and Jintai Chen. From screens to scenes: A survey of embodied ai in healthcare.Information Fusion, 119:103033, 2025

  57. [57]

    Tao Lu, Qian Zhu, Tiffany Ma, Wong Kam-Kwai, Anlan Xie, Alex Endert, and Yalong Yang. Ego vs. exo and active vs. passive: Investigating the individual and combined effects of viewpoint and navigation on spatial immersion and understanding in immersive storytelling. InCHI, 2025

  58. [58]

    OpenMMEgo: Enhancing egocentric understanding for LMMs with open weights and data

    Hao Luo, Zihao Yue, Wanpeng Zhang, Yicheng Feng, Sipeng Zheng, Deheng Ye, and Zongqing Lu. OpenMMEgo: Enhancing egocentric understanding for LMMs with open weights and data. InNeurIPS, 2025

  59. [59]

    Grounded affordance from exocentric view.International Journal of Computer Vision, 2024

    Hongchen Luo, Wei Zhai, Jing Zhang, Yang Cao, and Dacheng Tao. Grounded affordance from exocentric view.International Journal of Computer Vision, 2024

  60. [60]

    Put myself in your shoes: Lifting the egocentric perspective from exocentric videos

    Mi Luo, Zihui Xue, Alex Dimakis, and Kristen Grauman. Put myself in your shoes: Lifting the egocentric perspective from exocentric videos. InECCV, 2024

  61. [61]

    Video-RAG: Visually-aligned retrieval-augmented long video comprehension

    Yongdong Luo, Xiawu Zheng, Xiao Yang, Guilin Li, Haojia Lin, Jinfa Huang, Jiayi Ji, Fei Chao, Jiebo Luo, and Rongrong Ji. Video-RAG: Visually-aligned retrieval-augmented long video comprehension. InNeurIPS, 2025

  62. [62]

    Exo2EgoSyn: Unlocking foundation video generation models for exocentric-to- egocentric video synthesis.arXiv preprint arXiv:2511.20186, 2025

    Mohammad Mahdi, Yuqian Fu, Nedko Savov, Jiancheng Pan, Danda Pani Paudel, and Luc Van Gool. Exo2EgoSyn: Unlocking foundation video generation models for exocentric-to- egocentric video synthesis.arXiv preprint arXiv:2511.20186, 2025

  63. [63]

    OpenEQA: Embodied question answering in the era of foundation models

    Arjun Majumdar, Anurag Ajay, Xiaohan Zhang, Pranav Putta, Sriram Yenamandra, Mikael Henaff, Sneha Silwal, Paul McVay, Oleksandr Maksymets, Sergio Arnaud, Karmesh Yadav, Qiyang Li, Ben Newman, Mohit Sharma, Vincent-Pierre Berges, Shiqi Zhang, Pulkit Agrawal, Yonatan Bisk, Dhruv Batra, Mrinal Kalakrishnan, Franziska Meier, Chris Paxton, Alexan- der Sax, and...

  64. [64]

    EgoSchema: A diagnostic benchmark for very long-form video language understanding

    Karttikeya Mangalam, Raiymbek Akshulakov, and Jitendra Malik. EgoSchema: A diagnostic benchmark for very long-form video language understanding. InNeurIPS, 2023

  65. [65]

    Guerrero

    Lorenzo Mur-Labadia, Maria Santos-Villafranca, Jesus Bermudez-Cameo, Alejandro Perez- Yus, Ruben Martinez-Cantin, and Jose J. Guerrero. O-MaMa: Learning object mask matching between egocentric and exocentric views. InICCV, 2025

  66. [66]

    Point of view in personal memories.Cognitive Psychology, 1983

    Georgia Nigro and Ulric Neisser. Point of view in personal memories.Cognitive Psychology, 1983

  67. [67]

    Exo2EgoDVC: Dense video captioning of egocentric procedural activities using web instructional videos

    Takehiko Ohkawa, Takuma Yagi, Taichi Nishimura, Ryosuke Furuta, Atsushi Hashimoto, Yoshitaka Ushiku, and Yoichi Sato. Exo2EgoDVC: Dense video captioning of egocentric procedural activities using web instructional videos. InWACV, 2025. 14

  68. [68]

    Hello GPT-4o

    OpenAI. Hello GPT-4o. https://openai.com/index/hello-gpt-4o, May 2024. Ac- cessed: 2026-05-05

  69. [69]

    Introducing GPT-5.4

    OpenAI. Introducing GPT-5.4. https://openai.com/index/introducing-gpt-5-4 , March 2026. Accessed: 2026-05-05

  70. [70]

    V2-SAM: Marrying SAM2 with multi-prompt experts for cross-view object correspondence

    Jiancheng Pan, Runze Wang, Tianwen Qian, Mohammad Mahdi, Yanwei Fu, Xiangyang Xue, Xiaomeng Huang, Luc Van Gool, Danda Pani Paudel, and Yuqian Fu. V2-SAM: Marrying SAM2 with multi-prompt experts for cross-view object correspondence. InCVPR, 2026

  71. [71]

    Bootstrap your own views: Masked ego-exo modeling for fine-grained view-invariant video representations

    Jungin Park, Jiyoung Lee, and Kwanghoon Sohn. Bootstrap your own views: Masked ego-exo modeling for fine-grained view-invariant video representations. InCVPR, 2025

  72. [72]

    EgoWorld: Translating exocentric view to egocentric view using rich exocentric observations

    Junho Park, Andrew Sangwoo Ye, and Taein Kwon. EgoWorld: Translating exocentric view to egocentric view using rich exocentric observations. InICLR, 2026

  73. [73]

    EgoThinker: Unveiling egocentric reasoning with spatio-temporal CoT

    Baoqi Pei, Yifei Huang, Jilan Xu, Yuping He, Guo Chen, Fei Wu, Jiangmiao Pang, and Yu Qiao. EgoThinker: Unveiling egocentric reasoning with spatio-temporal CoT. InNeurIPS, 2025

  74. [74]

    In the eye of MLLM: Benchmarking egocentric video intent understanding with gaze-guided prompting

    Taiying Peng, Jiacheng Hua, Miao Liu, and Feng Lu. In the eye of MLLM: Benchmarking egocentric video intent understanding with gaze-guided prompting. InNeurIPS, 2025

  75. [75]

    Learning transferable visual models from natural language supervision

    Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. Learning transferable visual models from natural language supervision. In ICML, 2021

  76. [76]

    Ego-EXTRA: video-language egocentric dataset for expert-trainee assistance

    Francesco Ragusa, Michele Mazzamuto, Rosario Forte, Irene D’Ambra, James Fort, Jakob En- gel, Antonino Furnari, and Giovanni Maria Farinella. Ego-EXTRA: video-language egocentric dataset for expert-trainee assistance. InWACV, 2026

  77. [77]

    Wilson, and Balasara- vanan Thoravi Kumaravel

    Sahithya Ravi, Gabriel Herbert Sarch, Vibhav Vineet, Andrew D. Wilson, and Balasara- vanan Thoravi Kumaravel. Out of sight, not out of context? Egocentric spatial reasoning in vlms across disjoint frames. InEMNLP, 2025

  78. [78]

    From my view to yours: Ego-to-exo transfer in vlms for understanding activities of daily living.arXiv preprint arXiv:2501.05711, 2025

    Dominick Reilly, Manish Kumar Govind, Le Xue, and Srijan Das. From my view to yours: Ego-to-exo transfer in vlms for understanding activities of daily living.arXiv preprint arXiv:2501.05711, 2025

  79. [79]

    The probabilistic relevance framework: BM25 and beyond.Information Retrieval, 2009

    Stephen Robertson and Hugo Zaragoza. The probabilistic relevance framework: BM25 and beyond.Information Retrieval, 2009

  80. [80]

    EASG-Bench: Video Q&A benchmark with egocentric action scene graphs

    Ivan Rodin, Tz-Ying Wu, Kyle Min, Sharath Nittur Sridhar, Antonino Furnari, Subarna Tripathi, and Giovanni Maria Farinella. EASG-Bench: Video Q&A benchmark with egocentric action scene graphs. InICCVW, 2025

Showing first 80 references.