EgoExoMem: Cross-View Memory Reasoning over Synchronized Egocentric and Exocentric Videos

Chengzhi Wu; Di Wen; Jiaming Zhang; Junwei Zheng; Kailun Yang; Kunyu Peng; Rainer Stiefelhagen; Ruiping Liu; Shaofang Quan; Yufan Chen

arxiv: 2605.18734 · v1 · pith:IPG6XWDGnew · submitted 2026-05-18 · 💻 cs.CV

EgoExoMem: Cross-View Memory Reasoning over Synchronized Egocentric and Exocentric Videos

Ruiping Liu , Junwei Zheng , Yufan Chen , Di Wen , Shaofang Quan , Chengzhi Wu , Jiaming Zhang , Kailun Yang

show 2 more authors

Kunyu Peng Rainer Stiefelhagen

This is my paper

Pith reviewed 2026-05-20 11:14 UTC · model grok-4.3

classification 💻 cs.CV

keywords egocentric videoexocentric videomemory reasoningmultimodal large language modelsframe selectioncross-view reasoningbenchmark

0 comments

The pith

Synchronized egocentric and exocentric videos supply complementary memory cues that current multimodal models have not yet fully exploited.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces EgoExoMem as the first benchmark for cross-view memory reasoning using synchronized egocentric and exocentric videos, containing 2.6K multiple-choice questions across eight temporal, spatial, and cross-view types. It demonstrates that existing multimodal large language models reach only 55.3 percent accuracy at best, while a new training-free frame selection approach called E2-Select improves this to 58.2 percent by allocating budgets based on relevance and sampling with k-DPP to respect view asymmetry and temporal consistency. A sympathetic reader would care because embodied intelligence often relies on memory that single-view egocentric footage cannot fully support, and the results show both the value of dual perspectives and the remaining gap in model capabilities. The work further identifies systematic conflicts in how questions and answers align with particular views.

Core claim

The paper claims that egocentric and exocentric views provide complementary cues for spatial-temporal memory reasoning, established through the EgoExoMem benchmark of 2.6K high-quality MCQs and shown by the performance gap between existing MLLMs at 55.3 percent and the proposed E2-Select method at 58.2 percent over frame-selection and RAG baselines.

What carries the argument

E2-Select, a training-free frame selection method that combines relevance-based budget allocation with per-view k-DPP sampling to manage view asymmetry and cross-view temporal consistency in synchronized ego-exo videos.

If this is right

Ego and exo views supply complementary memory cues that improve reasoning when both are available.
Existing multimodal large language models remain far from solving cross-view memory tasks.
Training-free selection methods outperform standard frame-selection and RAG-based memory approaches.
Question framing and answer grounding exhibit systematic view-preference conflicts.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Models that explicitly learn to resolve view-preference conflicts could close more of the performance gap than selection alone.
The benchmark structure could be reused to test memory reasoning in longer, unscripted video streams from wearable and overhead cameras.
Integration of dual-view selection into embodied agents might reduce errors in spatial tasks such as object relocation or route planning.

Load-bearing premise

The 2.6K multiple-choice questions are high-quality and representative of real cross-view memory reasoning demands.

What would settle it

A direct comparison of model accuracy on EgoExoMem against accuracy on a new set of cross-view questions derived from real robotic navigation logs using the same synchronized video sources.

Figures

Figures reproduced from arXiv: 2605.18734 by Chengzhi Wu, Di Wen, Jiaming Zhang, Junwei Zheng, Kailun Yang, Kunyu Peng, Rainer Stiefelhagen, Ruiping Liu, Shaofang Quan, Yufan Chen.

**Figure 2.** Figure 2: Illustrative examples of the eight QA types (Q1–Q8) in EgoExoMem, covering object [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: illustrates the benchmark construction pipeline: MCQs are first generated, then human-edited and filtered for accuracy, and finally subjected to a text-only check to ensure vision dependency [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗

**Figure 4.** Figure 4: Dataset statistics of EgoExoMem. (a) Video length distribution for LEMMA and EgoExo4D [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗

**Figure 5.** Figure 5: Failure case analysis. (a) Question-aware view dependency measured by CLIP similarity [PITH_FULL_IMAGE:figures/full_fig_p009_5.png] view at source ↗

**Figure 6.** Figure 6: Verification tool for human annotator editing and filtering. [PITH_FULL_IMAGE:figures/full_fig_p018_6.png] view at source ↗

**Figure 7.** Figure 7: Caption generation used for retrieval in RAG-based methods. [PITH_FULL_IMAGE:figures/full_fig_p019_7.png] view at source ↗

**Figure 8.** Figure 8: Evaluation template [PITH_FULL_IMAGE:figures/full_fig_p019_8.png] view at source ↗

read the original abstract

Egocentric memory is widely used in embodied intelligence, but it may be insufficient for comprehensive spatial-temporal reasoning. Inspired by human recall from both field and observer perspectives, we introduce EgoExoMem, the first benchmark for cross-view memory reasoning over synchronized egocentric and exocentric videos. EgoExoMem contains $2.6K$ high-quality MCQs across eight temporal, spatial, and cross-view QA types. To support dual-view retrieval, we propose E$^2$-Select, a training-free frame selection method for synchronized ego-exo videos. It combines relevance-based budget allocation with per-view k-DPP sampling to handle view asymmetry and cross-view temporal consistency. Experiments show that ego and exo views provide complementary memory cues, while existing MLLMs remain far from solving the benchmark: the best model reaches only $55.3\%$. E$^2$-Select achieves state-of-the-art performance of $58.2\%$ over frame-selection and RAG-based memory baselines. Further analysis reveals systematic view-preference conflicts between question framing and answer grounding, underscoring the novelty and challenge of cross-view memory reasoning.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces EgoExoMem, the first benchmark for cross-view memory reasoning over synchronized egocentric and exocentric videos, containing 2.6K high-quality MCQs across eight temporal, spatial, and cross-view QA types. It proposes E²-Select, a training-free frame selection method combining relevance-based budget allocation with per-view k-DPP sampling to address view asymmetry and cross-view temporal consistency. Experiments show that existing MLLMs reach at most 55.3% accuracy while E²-Select achieves 58.2% over frame-selection and RAG baselines, with further analysis of view-preference conflicts.

Significance. If the benchmark questions genuinely require cross-view integration, this work would provide a valuable new resource for evaluating and improving multimodal models on complementary ego-exo memory cues, an area relevant to embodied AI. The training-free design of E²-Select and the reproducible performance numbers are strengths that support broader adoption.

major comments (2)

[Benchmark construction (§3/§4)] Benchmark construction section (likely §3 or §4): The central claims that 'ego and exo views provide complementary memory cues' and that 'existing MLLMs remain far from solving the benchmark' presuppose that the 2.6K MCQs cannot be solved from a single view. No single-view human accuracy, inter-annotator agreement on view necessity, or filtering steps that discard single-view-solvable questions are reported, leaving the complementarity conclusion and the 55.3%/58.2% gap on an unverified assumption.
[Method (§4)] E²-Select description (likely §4): While the method is presented as training-free, the relevance-based budget allocation step requires explicit definition of how per-view relevance scores are obtained from the query without reference to the target benchmark; if these scores implicitly depend on benchmark-specific heuristics, the 'parameter-free' characterization needs clarification to avoid circularity with the evaluation.

minor comments (2)

[Abstract] Abstract: Use consistent mathematical formatting (e.g., 2.6K vs $2.6K$) and ensure all acronyms (MLLM, k-DPP) are defined on first use.
[Figures/Tables] Figure captions and tables: Add explicit legends distinguishing ego-only, exo-only, and dual-view conditions to improve readability of the complementarity analysis.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address the major comments point by point below, indicating planned revisions where appropriate.

read point-by-point responses

Referee: [Benchmark construction (§3/§4)] Benchmark construction section (likely §3 or §4): The central claims that 'ego and exo views provide complementary memory cues' and that 'existing MLLMs remain far from solving the benchmark' presuppose that the 2.6K MCQs cannot be solved from a single view. No single-view human accuracy, inter-annotator agreement on view necessity, or filtering steps that discard single-view-solvable questions are reported, leaving the complementarity conclusion and the 55.3%/58.2% gap on an unverified assumption.

Authors: We agree that explicit verification of cross-view complementarity strengthens the central claims. The benchmark was designed with dedicated cross-view QA categories and questions that target integration of complementary cues (e.g., ego-centric action details paired with exo-centric spatial layout), supported by qualitative examples in the paper. However, we did not report single-view human accuracy or explicit filtering statistics. In the revision we will add single-view human evaluation on a representative subset of questions together with inter-annotator agreement on view necessity, thereby providing direct empirical support for the complementarity assumption. revision: yes
Referee: [Method (§4)] E²-Select description (likely §4): While the method is presented as training-free, the relevance-based budget allocation step requires explicit definition of how per-view relevance scores are obtained from the query without reference to the target benchmark; if these scores implicitly depend on benchmark-specific heuristics, the 'parameter-free' characterization needs clarification to avoid circularity with the evaluation.

Authors: The relevance scores are obtained by embedding the query with a fixed, off-the-shelf vision-language model (CLIP) and computing cosine similarity against frame embeddings from each view independently. No training, fine-tuning, or benchmark-specific heuristics are involved; the same general-purpose model is used for all queries. We will revise the method section to state this procedure explicitly and to clarify that E²-Select remains training-free with no parameters tuned on EgoExoMem, thereby removing any potential ambiguity regarding circularity. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation is self-contained

full rationale

The paper introduces EgoExoMem benchmark and E²-Select method as training-free, relying on explicit algorithmic steps (relevance-based budget allocation plus per-view k-DPP sampling) that operate on input video features without fitting parameters to the target MCQ answers or reducing any claimed result to a self-definition. No equations or sections equate a prediction to its own fitted input, invoke load-bearing self-citations for uniqueness, or rename prior empirical patterns as new derivations. Experimental claims inherit benchmark validity risks but do not exhibit circular reduction in the derivation chain itself.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The central claims rest on the quality and coverage of the newly created MCQ set and on the effectiveness of the described frame-selection procedure; no explicit free parameters, axioms, or invented entities are stated in the abstract.

pith-pipeline@v0.9.0 · 5765 in / 1123 out tokens · 41237 ms · 2026-05-20T11:14:25.824576+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

EgoExoMem contains 2.6K high-quality MCQs across eight temporal, spatial, and cross-view QA types. ... E²-Select ... relevance-based budget allocation with per-view k-DPP sampling
IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Experiments show that ego and exo views provide complementary memory cues

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

111 extracted references · 111 canonical work pages · 6 internal anchors

[1]

Ring home security systems.https://ring.com, 2024

work page 2024
[2]

Qwen3-VL Technical Report

Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, et al. Qwen3-VL technical report.arXiv preprint arXiv:2511.21631, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[3]

Qwen2.5-VL Technical Report

Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, et al. Qwen2.5-VL technical report.arXiv preprint arXiv:2502.13923, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[4]

Glance and focus: Memory prompting for multi-event video question answering

Ziyi Bai, Ruiping Wang, and Xilin Chen. Glance and focus: Memory prompting for multi-event video question answering. InNeurIPS, 2023

work page 2023
[5]

Where did I leave my keys?—Episodic-memory-based question answering on egocentric videos

Leonard Bärmann and Alex Waibel. Where did I leave my keys?—Episodic-memory-based question answering on egocentric videos. InCVPRW, 2022

work page 2022
[6]

EPFL-Smart- Kitchen: An ego-exo multi-modal dataset for challenging action and motion understanding in video-language models

Andy Bonnetto, Haozhe Qi, Franklin Leong, Matea Tashkovska, Mahdi Rad, Solaiman Shokur, Friedhelm Hummel, Silvestro Micera, Marc Pollefeys, and Alexander Mathis. EPFL-Smart- Kitchen: An ego-exo multi-modal dataset for challenging action and motion understanding in video-language models. InNeurIPS, 2025

work page 2025
[7]

Spatial memory: how egocentric and allocentric combine.Trends in Cognitive Sciences, 2006

Neil Burgess. Spatial memory: how egocentric and allocentric combine.Trends in Cognitive Sciences, 2006

work page 2006
[8]

SA VVY: Spatial awareness via audio-visual LLMs through seeing and hearing

Mingfei Chen, Zijun Cui, Xiulong Liu, Jinlin Xiang, Caleb Zheng, Jingyuan Li, and Eli Shlizerman. SA VVY: Spatial awareness via audio-visual LLMs through seeing and hearing. InNeurIPS, 2025

work page 2025
[9]

MuRAL: A multi- resident ambient sensor dataset annotated with natural language for activities of daily living

Xi Chen, Julien Cumin, Fano Ramparany, and Dominique Vaufreydaz. MuRAL: A multi- resident ambient sensor dataset annotated with natural language for activities of daily living. InICIE, 2026

work page 2026
[10]

EgoPlan-Bench: Benchmarking multimodal large language models for human-level planning.International Journal of Computer Vision, 2026

Yi Chen, Yuying Ge, Yixiao Ge, Mingyu Ding, Bohao Li, Rui Wang, Ruifeng Xu, Ying Shan, and Xihui Liu. EgoPlan-Bench: Benchmarking multimodal large language models for human-level planning.International Journal of Computer Vision, 2026

work page 2026
[11]

(2.5+ 1) D spatio-temporal scene graphs for video question answering

Anoop Cherian, Chiori Hori, Tim K Marks, and Jonathan Le Roux. (2.5+ 1) D spatio-temporal scene graphs for video question answering. InAAAI, 2022

work page 2022
[12]

Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

Gheorghe Comanici, Eric Bieber, Mike Schaekermann, Ice Pasupat, Noveen Sachdeva, Inderjit Dhillon, Marcel Blistein, Ori Ram, Dan Zhang, Evan Rosen, et al. Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities.arXiv preprint arXiv:2507.06261, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[13]

ECBench: Can multi-modal foundation models understand the egocentric world? A holistic embodied cognition benchmark

Ronghao Dang, Yuqian Yuan, Wenqi Zhang, Yifei Xin, Boqiang Zhang, Long Li, Liuyi Wang, Qinyang Zeng, Xin Li, and Lidong Bing. ECBench: Can multi-modal foundation models understand the egocentric world? A holistic embodied cognition benchmark. InCVPR, 2025

work page 2025
[14]

Episodic memory question answering

Samyak Datta, Sameer Dharur, Vincent Cartillier, Ruta Desai, Mukul Khanna, Dhruv Batra, and Devi Parikh. Episodic memory question answering. InCVPR, 2022

work page 2022
[15]

Look and tell: A dataset for multimodal grounding across egocentric and exocentric views

Anna Deichler and Jonas Beskow. Look and tell: A dataset for multimodal grounding across egocentric and exocentric views. InNeurIPS 2025 Workshop on Space in Vision, Language, and Embodied AI, 2025

work page 2025
[16]

Exact sampling of determinantal point processes with sublinear time preprocessing

Michal Derezinski, Daniele Calandriello, and Michal Valko. Exact sampling of determinantal point processes with sublinear time preprocessing. InNeurIPS, 2019

work page 2019
[17]

Mica R. Endsley. Toward a theory of situation awareness in dynamic systems.Human Factors: The Journal of the Human Factors and Ergonomics Society, 1995. 11

work page 1995
[18]

PRVQL: Progressive knowledge-guided refinement for robust egocentric visual query localization

Bing Fan, Yunhe Feng, Yapeng Tian, James Chenhao Liang, Yuewei Lin, Yan Huang, and Heng Fan. PRVQL: Progressive knowledge-guided refinement for robust egocentric visual query localization. InICCV, 2025

work page 2025
[19]

Embodied VideoAgent: Persistent memory from egocentric videos and embodied sensors enables dynamic scene understanding

Yue Fan, Xiaojian Ma, Rongpeng Su, Jun Guo, Rujie Wu, Xi Chen, and Qing Li. Embodied VideoAgent: Persistent memory from egocentric videos and embodied sensors enables dynamic scene understanding. InICCV, 2025

work page 2025
[20]

Object-shot enhanced grounding network for egocentric video

Yisen Feng, Haoyu Zhang, Meng Liu, Weili Guan, and Liqiang Nie. Object-shot enhanced grounding network for egocentric video. InCVPR, 2025

work page 2025
[21]

Video-mme: The first-ever comprehen- sive evaluation benchmark of multi-modal llms in video analysis

Chaoyou Fu, Yuhan Dai, Yongdong Luo, Lei Li, Shuhuai Ren, Renrui Zhang, Zihan Wang, Chenyu Zhou, Yunhang Shen, Mengdan Zhang, et al. Video-mme: The first-ever comprehen- sive evaluation benchmark of multi-modal llms in video analysis. InCVPR, 2025

work page 2025
[22]

ObjectRelator: Enabling cross-view object relation understanding across ego-centric and exo-centric perspectives

Yuqian Fu, Runze Wang, Bin Ren, Guolei Sun, Biao Gong, Yanwei Fu, Danda Pani Paudel, Xuanjing Huang, and Luc Van Gool. ObjectRelator: Enabling cross-view object relation understanding across ego-centric and exo-centric perspectives. InICCV, 2025

work page 2025
[23]

Continuous patient monitoring with AI: Real-time analysis of video in hospital care settings

Paolo Gabriel, Peter Rehani, Tyler Troy, Tiffany Wyatt, Michael Choma, and Narinder Singh. Continuous patient monitoring with AI: Real-time analysis of video in hospital care settings. Frontiers in Imaging, 2025

work page 2025
[24]

Fan, Amirreza Shaban, Sung-Kyun Kim, Mykel J

Muhammad Fadhil Ginting, Dong-Ki Kim, Xiangyun Meng, Andrzej Reinke, Bandi Jai Krishna, Navid Kayhani, Oriana Peltzer, David D. Fan, Amirreza Shaban, Sung-Kyun Kim, Mykel J. Kochenderfer, Ali-akbar Agha-mohammadi, and Shayegan Omidshafiei. Enter the mind palace: Reasoning and planning for long-term active embodied question answering. In CoRL, 2026

work page 2026
[25]

Ego4D: Around the world in 3,000 hours of egocentric video

Kristen Grauman, Andrew Westbury, Eugene Byrne, Zachary Chavis, Antonino Furnari, Rohit Girdhar, Jackson Hamburger, Hao Jiang, Miao Liu, Xingyu Liu, et al. Ego4D: Around the world in 3,000 hours of egocentric video. InCVPR, 2022

work page 2022
[26]

Ego-Exo4D: Understanding skilled human activity from first-and third-person perspectives

Kristen Grauman, Andrew Westbury, Lorenzo Torresani, Kris Kitani, Jitendra Malik, Tri- antafyllos Afouras, Kumar Ashutosh, Vijay Baiyya, Siddhant Bansal, Bikram Boote, et al. Ego-Exo4D: Understanding skilled human activity from first-and third-person perspectives. In CVPR, 2024

work page 2024
[27]

Bridging perspectives: A survey on cross-view collaborative intelligence with egocentric- exocentric vision.International Journal on Computer Vision, 2026

Yuping He, Yifei Huang, Guo Chen, Lidong Lu, Baoqi Pei, Jilan Xu, Tong Lu, and Yoichi Sato. Bridging perspectives: A survey on cross-view collaborative intelligence with egocentric- exocentric vision.International Journal on Computer Vision, 2026

work page 2026
[28]

EgoExoBench: A benchmark for first-and third-person view video understanding in MLLMs

Yuping He, Yifei Huang, Guo Chen, Baoqi Pei, Jilan Xu, Tong Lu, and Jiangmiao Pang. EgoExoBench: A benchmark for first-and third-person view video understanding in MLLMs. InNeurIPS, 2025

work page 2025
[29]

Cascaded dynamic memory refinement and semantic alignment for exo-to-ego cross-view video generation.IEEE Transactions on Pattern Analysis and Machine Intelligence, 2025

Weipeng Hu, Jiun Tian Hoe, Jianhui Li, Haifeng Hu, Xudong Jiang, and Yap-Peng Tan. Cascaded dynamic memory refinement and semantic alignment for exo-to-ego cross-view video generation.IEEE Transactions on Pattern Analysis and Machine Intelligence, 2025

work page 2025
[30]

Robust ego-exo correspondence with long-term memory

Yijun Hu, Bing Fan, Xin Gu, Haiqing Ren, Dongfang Liu, Heng Fan, and Libo Zhang. Robust ego-exo correspondence with long-term memory. InNeurIPS, 2025

work page 2025
[31]

Sound bridge: Associating egocentric and exocentric videos via audio cues

Sihong Huang, Jiaxin Wu, Xiaoyong Wei, Yi Cai, Dongmei Jiang, and Yaowei Wang. Sound bridge: Associating egocentric and exocentric videos via audio cues. InCVPR, 2025

work page 2025
[32]

EgoExoLearn: A dataset for bridging asynchronous ego-and exo-centric view of procedural activities in real world

Yifei Huang, Guo Chen, Jilan Xu, Mingfang Zhang, Lijin Yang, Baoqi Pei, Hongjie Zhang, Lu Dong, Yali Wang, Limin Wang, and Qiao Yu. EgoExoLearn: A dataset for bridging asynchronous ego-and exo-centric view of procedural activities in real world. InCVPR, 2024

work page 2024
[33]

VideoRAG: Retrieval- augmented generation over video corpus

Soyeong Jeong, Kangsan Kim, Jinheon Baek, and Sung Ju Hwang. VideoRAG: Retrieval- augmented generation over video corpus. InACL (Findings), 2025. 12

work page 2025
[34]

LEMMA: A multi-view dataset for learning multi-agent multi-task activities

Baoxiong Jia, Yixin Chen, Siyuan Huang, Yixin Zhu, and Song-Chun Zhu. LEMMA: A multi-view dataset for learning multi-agent multi-task activities. InECCV, 2020

work page 2020
[35]

EgoTaskQA: Understanding human tasks in egocentric videos

Baoxiong Jia, Ting Lei, Song-Chun Zhu, and Siyuan Huang. EgoTaskQA: Understanding human tasks in egocentric videos. InNeurIPS, 2022

work page 2022
[36]

Rehg, Vamsi Krishna Ithapu, and Ruohan Gao

Wenqi Jia, Miao Liu, Hao Jiang, Ishwarya Ananthabhotla, James M. Rehg, Vamsi Krishna Ithapu, and Ruohan Gao. The audio-visual conversational graph: From an egocentric- exocentric perspective. InCVPR, 2024

work page 2024
[37]

Single-stage visual query localization in egocentric videos

Hanwen Jiang, Santhosh Kumar Ramakrishnan, and Kristen Grauman. Single-stage visual query localization in egocentric videos. InNeurIPS, 2023

work page 2023
[38]

Is ‘right’ right? Enhancing object orientation understanding in multimodal large language models through egocentric instruction tuning

Ji Hyeok Jung, Eun Tae Kim, Seoyeon Kim, Joo Ho Lee, Bumsoo Kim, and Buru Chang. Is ‘right’ right? Enhancing object orientation understanding in multimodal large language models through egocentric instruction tuning. InCVPR, 2025

work page 2025
[39]

EgoExo-Con: Exploring view-invariant video temporal understanding.arXiv preprint arXiv:2510.26113, 2025

Minjoon Jung, Junbin Xiao, Junghyun Kim, Byoung-Tak Zhang, and Angela Yao. EgoExo-Con: Exploring view-invariant video temporal understanding.arXiv preprint arXiv:2510.26113, 2025

work page arXiv 2025
[40]

Video captioning based on both egocentric and exocentric views of robot vision for human-robot interaction.International Journal of Social Robotics, 2023

Soo-Han Kang and Ji-Hyeong Han. Video captioning based on both egocentric and exocentric views of robot vision for human-robot interaction.International Journal of Social Robotics, 2023

work page 2023
[41]

Dense passage retrieval for open-domain question answering

Vladimir Karpukhin, Barlas Oguz, Sewon Min, Patrick Lewis, Ledell Wu, Sergey Edunov, Danqi Chen, and Wen-tau Yih. Dense passage retrieval for open-domain question answering. InEMNLP, 2020

work page 2020
[42]

MA-EgoQA: Question answering over egocentric videos from multiple embodied agents.arXiv preprint arXiv:2603.09827, 2026

Kangsan Kim, Yanlai Yang, Suji Kim, Woongyeong Yeo, Youngwan Lee, Mengye Ren, and Sung Ju Hwang. MA-EgoQA: Question answering over egocentric videos from multiple embodied agents.arXiv preprint arXiv:2603.09827, 2026

work page arXiv 2026
[43]

k-DPPs: Fixed-size determinantal point processes

Alex Kulesza and Ben Taskar. k-DPPs: Fixed-size determinantal point processes. InICML, 2011

work page 2011
[44]

EgoVITA: Learning to plan and verify for egocentric video reasoning.arXiv preprint arXiv:2511.18242, 2025

Yogesh Kulkarni and Pooyan Fazli. EgoVITA: Learning to plan and verify for egocentric video reasoning.arXiv preprint arXiv:2511.18242, 2025

work page arXiv 2025
[45]

LLaVA-OneVision: Easy Visual Task Transfer

Bo Li, Yuanhan Zhang, Dong Guo, Renrui Zhang, Feng Li, Hao Zhang, Kaichen Zhang, Peiyuan Zhang, Yanwei Li, Ziwei Liu, et al. LLaV A-OneVision: Easy visual task transfer. arXiv preprint arXiv:2408.03326, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[46]

Do MLLMs understand pointing? Benchmarking and enhancing referential reasoning in egocentric vision

Chentao Li, Zirui Gao, Mingze Gao, Yinglian Ren, Jianjiang Feng, and Jie Zhou. Do MLLMs understand pointing? Benchmarking and enhancing referential reasoning in egocentric vision. InACL, 2026

work page 2026
[47]

Learning situated awareness in the real world

Chuhan Li, Ruilin Han, Joy Hsu, Yongyuan Liang, Rajiv Dhawan, Jiajun Wu, Ming-Hsuan Yang, and Xin Eric Wang. Learning situated awareness in the real world. InICML, 2026

work page 2026
[48]

LLaVA-NeXT-Interleave: Tackling Multi-image, Video, and 3D in Large Multimodal Models

Feng Li, Renrui Zhang, Hao Zhang, Yuanhan Zhang, Bo Li, Wei Li, Zejun Ma, and Chunyuan Li. LLaV A-NeXT-Interleave: Tackling multi-image, video, and 3D in large multimodal models. arXiv preprint arXiv:2407.07895, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[49]

Col- laborated with hallucination: Enhancing egocentric grounded question answering via error demonstrations.IEEE Transactions on Image Processing, 2026

Shenshen Li, Xing Xu, Fumin Shen, Zhe Sun, Andrzej Cichocki, and Heng Tao Shen. Col- laborated with hallucination: Enhancing egocentric grounded question answering via error demonstrations.IEEE Transactions on Image Processing, 2026

work page 2026
[50]

SA V A-X: Ego-to-exo imitation error detection via scene-adaptive view alignment and bidirectional cross view fusion

Xiang Li, Heqian Qiu, Lanxiao Wang, Benliu Qiu, Fanman Meng, Linfeng Xu, and Hongliang Li. SA V A-X: Ego-to-exo imitation error detection via scene-adaptive view alignment and bidirectional cross view fusion. InCVPR, 2026

work page 2026
[51]

EgoCross: Benchmarking multimodal large language models for cross- domain egocentric video question answering

Yanjun Li, Yuqian Fu, Tianwen Qian, Qi’Ao Xu, Silong Dai, Danda Pani Paudel, Luc Van Gool, and Xiaoling Wang. EgoCross: Benchmarking multimodal large language models for cross- domain egocentric video question answering. InAAAI, 2026. 13

work page 2026
[52]

Fine-grained spatiotem- poral grounding on egocentric videos

Shuo Liang, Yiwu Zhong, Zi-Yuan Hu, Yeyao Tao, and Liwei Wang. Fine-grained spatiotem- poral grounding on egocentric videos. InICCV, 2025

work page 2025
[53]

Objectfinder: An open-vocabulary assistive system for interactive object search by blind people

Ruiping Liu, Jiaming Zhang, Angela Schön, Karin Müller, Junwei Zheng, Kailun Yang, Anhong Guo, Kathrin Gerling, and Rainer Stiefelhagen. ObjectFinder: An open-vocabulary assistive system for interactive object search by blind people.arXiv preprint arXiv:2412.03118, 2024

work page arXiv 2024
[54]

BOLT: Boost large vision- language model without training for long-form video understanding

Shuming Liu, Chen Zhao, Tianqi Xu, and Bernard Ghanem. BOLT: Boost large vision- language model without training for long-form video understanding. InCVPR, 2025

work page 2025
[55]

Aligning cyber space with physical world: A comprehensive survey on embodied AI

Yang Liu, Weixing Chen, Yongjie Bai, Xiaodan Liang, Guanbin Li, Wen Gao, and Liang Lin. Aligning cyber space with physical world: A comprehensive survey on embodied AI. IEEE/ASME Transactions on Mechatronics, 2025

work page 2025
[56]

From screens to scenes: A survey of embodied ai in healthcare.Information Fusion, 119:103033, 2025

Yihao Liu, Xu Cao, Tingting Chen, Yankai Jiang, Junjie You, Minghua Wu, Xiaosong Wang, Mengling Feng, Yaochu Jin, and Jintai Chen. From screens to scenes: A survey of embodied ai in healthcare.Information Fusion, 119:103033, 2025

work page 2025
[57]

Tao Lu, Qian Zhu, Tiffany Ma, Wong Kam-Kwai, Anlan Xie, Alex Endert, and Yalong Yang. Ego vs. exo and active vs. passive: Investigating the individual and combined effects of viewpoint and navigation on spatial immersion and understanding in immersive storytelling. InCHI, 2025

work page 2025
[58]

OpenMMEgo: Enhancing egocentric understanding for LMMs with open weights and data

Hao Luo, Zihao Yue, Wanpeng Zhang, Yicheng Feng, Sipeng Zheng, Deheng Ye, and Zongqing Lu. OpenMMEgo: Enhancing egocentric understanding for LMMs with open weights and data. InNeurIPS, 2025

work page 2025
[59]

Grounded affordance from exocentric view.International Journal of Computer Vision, 2024

Hongchen Luo, Wei Zhai, Jing Zhang, Yang Cao, and Dacheng Tao. Grounded affordance from exocentric view.International Journal of Computer Vision, 2024

work page 2024
[60]

Put myself in your shoes: Lifting the egocentric perspective from exocentric videos

Mi Luo, Zihui Xue, Alex Dimakis, and Kristen Grauman. Put myself in your shoes: Lifting the egocentric perspective from exocentric videos. InECCV, 2024

work page 2024
[61]

Video-RAG: Visually-aligned retrieval-augmented long video comprehension

Yongdong Luo, Xiawu Zheng, Xiao Yang, Guilin Li, Haojia Lin, Jinfa Huang, Jiayi Ji, Fei Chao, Jiebo Luo, and Rongrong Ji. Video-RAG: Visually-aligned retrieval-augmented long video comprehension. InNeurIPS, 2025

work page 2025
[62]

Exo2EgoSyn: Unlocking foundation video generation models for exocentric-to- egocentric video synthesis.arXiv preprint arXiv:2511.20186, 2025

Mohammad Mahdi, Yuqian Fu, Nedko Savov, Jiancheng Pan, Danda Pani Paudel, and Luc Van Gool. Exo2EgoSyn: Unlocking foundation video generation models for exocentric-to- egocentric video synthesis.arXiv preprint arXiv:2511.20186, 2025

work page arXiv 2025
[63]

OpenEQA: Embodied question answering in the era of foundation models

Arjun Majumdar, Anurag Ajay, Xiaohan Zhang, Pranav Putta, Sriram Yenamandra, Mikael Henaff, Sneha Silwal, Paul McVay, Oleksandr Maksymets, Sergio Arnaud, Karmesh Yadav, Qiyang Li, Ben Newman, Mohit Sharma, Vincent-Pierre Berges, Shiqi Zhang, Pulkit Agrawal, Yonatan Bisk, Dhruv Batra, Mrinal Kalakrishnan, Franziska Meier, Chris Paxton, Alexan- der Sax, and...

work page 2024
[64]

EgoSchema: A diagnostic benchmark for very long-form video language understanding

Karttikeya Mangalam, Raiymbek Akshulakov, and Jitendra Malik. EgoSchema: A diagnostic benchmark for very long-form video language understanding. InNeurIPS, 2023

work page 2023
[65]

Guerrero

Lorenzo Mur-Labadia, Maria Santos-Villafranca, Jesus Bermudez-Cameo, Alejandro Perez- Yus, Ruben Martinez-Cantin, and Jose J. Guerrero. O-MaMa: Learning object mask matching between egocentric and exocentric views. InICCV, 2025

work page 2025
[66]

Point of view in personal memories.Cognitive Psychology, 1983

Georgia Nigro and Ulric Neisser. Point of view in personal memories.Cognitive Psychology, 1983

work page 1983
[67]

Exo2EgoDVC: Dense video captioning of egocentric procedural activities using web instructional videos

Takehiko Ohkawa, Takuma Yagi, Taichi Nishimura, Ryosuke Furuta, Atsushi Hashimoto, Yoshitaka Ushiku, and Yoichi Sato. Exo2EgoDVC: Dense video captioning of egocentric procedural activities using web instructional videos. InWACV, 2025. 14

work page 2025
[68]

Hello GPT-4o

OpenAI. Hello GPT-4o. https://openai.com/index/hello-gpt-4o, May 2024. Ac- cessed: 2026-05-05

work page 2024
[69]

Introducing GPT-5.4

OpenAI. Introducing GPT-5.4. https://openai.com/index/introducing-gpt-5-4 , March 2026. Accessed: 2026-05-05

work page 2026
[70]

V2-SAM: Marrying SAM2 with multi-prompt experts for cross-view object correspondence

Jiancheng Pan, Runze Wang, Tianwen Qian, Mohammad Mahdi, Yanwei Fu, Xiangyang Xue, Xiaomeng Huang, Luc Van Gool, Danda Pani Paudel, and Yuqian Fu. V2-SAM: Marrying SAM2 with multi-prompt experts for cross-view object correspondence. InCVPR, 2026

work page 2026
[71]

Bootstrap your own views: Masked ego-exo modeling for fine-grained view-invariant video representations

Jungin Park, Jiyoung Lee, and Kwanghoon Sohn. Bootstrap your own views: Masked ego-exo modeling for fine-grained view-invariant video representations. InCVPR, 2025

work page 2025
[72]

EgoWorld: Translating exocentric view to egocentric view using rich exocentric observations

Junho Park, Andrew Sangwoo Ye, and Taein Kwon. EgoWorld: Translating exocentric view to egocentric view using rich exocentric observations. InICLR, 2026

work page 2026
[73]

EgoThinker: Unveiling egocentric reasoning with spatio-temporal CoT

Baoqi Pei, Yifei Huang, Jilan Xu, Yuping He, Guo Chen, Fei Wu, Jiangmiao Pang, and Yu Qiao. EgoThinker: Unveiling egocentric reasoning with spatio-temporal CoT. InNeurIPS, 2025

work page 2025
[74]

In the eye of MLLM: Benchmarking egocentric video intent understanding with gaze-guided prompting

Taiying Peng, Jiacheng Hua, Miao Liu, and Feng Lu. In the eye of MLLM: Benchmarking egocentric video intent understanding with gaze-guided prompting. InNeurIPS, 2025

work page 2025
[75]

Learning transferable visual models from natural language supervision

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. Learning transferable visual models from natural language supervision. In ICML, 2021

work page 2021
[76]

Ego-EXTRA: video-language egocentric dataset for expert-trainee assistance

Francesco Ragusa, Michele Mazzamuto, Rosario Forte, Irene D’Ambra, James Fort, Jakob En- gel, Antonino Furnari, and Giovanni Maria Farinella. Ego-EXTRA: video-language egocentric dataset for expert-trainee assistance. InWACV, 2026

work page 2026
[77]

Wilson, and Balasara- vanan Thoravi Kumaravel

Sahithya Ravi, Gabriel Herbert Sarch, Vibhav Vineet, Andrew D. Wilson, and Balasara- vanan Thoravi Kumaravel. Out of sight, not out of context? Egocentric spatial reasoning in vlms across disjoint frames. InEMNLP, 2025

work page 2025
[78]

From my view to yours: Ego-to-exo transfer in vlms for understanding activities of daily living.arXiv preprint arXiv:2501.05711, 2025

Dominick Reilly, Manish Kumar Govind, Le Xue, and Srijan Das. From my view to yours: Ego-to-exo transfer in vlms for understanding activities of daily living.arXiv preprint arXiv:2501.05711, 2025

work page arXiv 2025
[79]

The probabilistic relevance framework: BM25 and beyond.Information Retrieval, 2009

Stephen Robertson and Hugo Zaragoza. The probabilistic relevance framework: BM25 and beyond.Information Retrieval, 2009

work page 2009
[80]

EASG-Bench: Video Q&A benchmark with egocentric action scene graphs

Ivan Rodin, Tz-Ying Wu, Kyle Min, Sharath Nittur Sridhar, Antonino Furnari, Subarna Tripathi, and Giovanni Maria Farinella. EASG-Bench: Video Q&A benchmark with egocentric action scene graphs. InICCVW, 2025

work page 2025

Showing first 80 references.

[1] [1]

Ring home security systems.https://ring.com, 2024

work page 2024

[2] [2]

Qwen3-VL Technical Report

Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, et al. Qwen3-VL technical report.arXiv preprint arXiv:2511.21631, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[3] [3]

Qwen2.5-VL Technical Report

Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, et al. Qwen2.5-VL technical report.arXiv preprint arXiv:2502.13923, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[4] [4]

Glance and focus: Memory prompting for multi-event video question answering

Ziyi Bai, Ruiping Wang, and Xilin Chen. Glance and focus: Memory prompting for multi-event video question answering. InNeurIPS, 2023

work page 2023

[5] [5]

Where did I leave my keys?—Episodic-memory-based question answering on egocentric videos

Leonard Bärmann and Alex Waibel. Where did I leave my keys?—Episodic-memory-based question answering on egocentric videos. InCVPRW, 2022

work page 2022

[6] [6]

EPFL-Smart- Kitchen: An ego-exo multi-modal dataset for challenging action and motion understanding in video-language models

Andy Bonnetto, Haozhe Qi, Franklin Leong, Matea Tashkovska, Mahdi Rad, Solaiman Shokur, Friedhelm Hummel, Silvestro Micera, Marc Pollefeys, and Alexander Mathis. EPFL-Smart- Kitchen: An ego-exo multi-modal dataset for challenging action and motion understanding in video-language models. InNeurIPS, 2025

work page 2025

[7] [7]

Spatial memory: how egocentric and allocentric combine.Trends in Cognitive Sciences, 2006

Neil Burgess. Spatial memory: how egocentric and allocentric combine.Trends in Cognitive Sciences, 2006

work page 2006

[8] [8]

SA VVY: Spatial awareness via audio-visual LLMs through seeing and hearing

Mingfei Chen, Zijun Cui, Xiulong Liu, Jinlin Xiang, Caleb Zheng, Jingyuan Li, and Eli Shlizerman. SA VVY: Spatial awareness via audio-visual LLMs through seeing and hearing. InNeurIPS, 2025

work page 2025

[9] [9]

MuRAL: A multi- resident ambient sensor dataset annotated with natural language for activities of daily living

Xi Chen, Julien Cumin, Fano Ramparany, and Dominique Vaufreydaz. MuRAL: A multi- resident ambient sensor dataset annotated with natural language for activities of daily living. InICIE, 2026

work page 2026

[10] [10]

EgoPlan-Bench: Benchmarking multimodal large language models for human-level planning.International Journal of Computer Vision, 2026

Yi Chen, Yuying Ge, Yixiao Ge, Mingyu Ding, Bohao Li, Rui Wang, Ruifeng Xu, Ying Shan, and Xihui Liu. EgoPlan-Bench: Benchmarking multimodal large language models for human-level planning.International Journal of Computer Vision, 2026

work page 2026

[11] [11]

(2.5+ 1) D spatio-temporal scene graphs for video question answering

Anoop Cherian, Chiori Hori, Tim K Marks, and Jonathan Le Roux. (2.5+ 1) D spatio-temporal scene graphs for video question answering. InAAAI, 2022

work page 2022

[12] [12]

Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

Gheorghe Comanici, Eric Bieber, Mike Schaekermann, Ice Pasupat, Noveen Sachdeva, Inderjit Dhillon, Marcel Blistein, Ori Ram, Dan Zhang, Evan Rosen, et al. Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities.arXiv preprint arXiv:2507.06261, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[13] [13]

ECBench: Can multi-modal foundation models understand the egocentric world? A holistic embodied cognition benchmark

Ronghao Dang, Yuqian Yuan, Wenqi Zhang, Yifei Xin, Boqiang Zhang, Long Li, Liuyi Wang, Qinyang Zeng, Xin Li, and Lidong Bing. ECBench: Can multi-modal foundation models understand the egocentric world? A holistic embodied cognition benchmark. InCVPR, 2025

work page 2025

[14] [14]

Episodic memory question answering

Samyak Datta, Sameer Dharur, Vincent Cartillier, Ruta Desai, Mukul Khanna, Dhruv Batra, and Devi Parikh. Episodic memory question answering. InCVPR, 2022

work page 2022

[15] [15]

Look and tell: A dataset for multimodal grounding across egocentric and exocentric views

Anna Deichler and Jonas Beskow. Look and tell: A dataset for multimodal grounding across egocentric and exocentric views. InNeurIPS 2025 Workshop on Space in Vision, Language, and Embodied AI, 2025

work page 2025

[16] [16]

Exact sampling of determinantal point processes with sublinear time preprocessing

Michal Derezinski, Daniele Calandriello, and Michal Valko. Exact sampling of determinantal point processes with sublinear time preprocessing. InNeurIPS, 2019

work page 2019

[17] [17]

Mica R. Endsley. Toward a theory of situation awareness in dynamic systems.Human Factors: The Journal of the Human Factors and Ergonomics Society, 1995. 11

work page 1995

[18] [18]

PRVQL: Progressive knowledge-guided refinement for robust egocentric visual query localization

Bing Fan, Yunhe Feng, Yapeng Tian, James Chenhao Liang, Yuewei Lin, Yan Huang, and Heng Fan. PRVQL: Progressive knowledge-guided refinement for robust egocentric visual query localization. InICCV, 2025

work page 2025

[19] [19]

Embodied VideoAgent: Persistent memory from egocentric videos and embodied sensors enables dynamic scene understanding

Yue Fan, Xiaojian Ma, Rongpeng Su, Jun Guo, Rujie Wu, Xi Chen, and Qing Li. Embodied VideoAgent: Persistent memory from egocentric videos and embodied sensors enables dynamic scene understanding. InICCV, 2025

work page 2025

[20] [20]

Object-shot enhanced grounding network for egocentric video

Yisen Feng, Haoyu Zhang, Meng Liu, Weili Guan, and Liqiang Nie. Object-shot enhanced grounding network for egocentric video. InCVPR, 2025

work page 2025

[21] [21]

Video-mme: The first-ever comprehen- sive evaluation benchmark of multi-modal llms in video analysis

Chaoyou Fu, Yuhan Dai, Yongdong Luo, Lei Li, Shuhuai Ren, Renrui Zhang, Zihan Wang, Chenyu Zhou, Yunhang Shen, Mengdan Zhang, et al. Video-mme: The first-ever comprehen- sive evaluation benchmark of multi-modal llms in video analysis. InCVPR, 2025

work page 2025

[22] [22]

ObjectRelator: Enabling cross-view object relation understanding across ego-centric and exo-centric perspectives

Yuqian Fu, Runze Wang, Bin Ren, Guolei Sun, Biao Gong, Yanwei Fu, Danda Pani Paudel, Xuanjing Huang, and Luc Van Gool. ObjectRelator: Enabling cross-view object relation understanding across ego-centric and exo-centric perspectives. InICCV, 2025

work page 2025

[23] [23]

Continuous patient monitoring with AI: Real-time analysis of video in hospital care settings

Paolo Gabriel, Peter Rehani, Tyler Troy, Tiffany Wyatt, Michael Choma, and Narinder Singh. Continuous patient monitoring with AI: Real-time analysis of video in hospital care settings. Frontiers in Imaging, 2025

work page 2025

[24] [24]

Fan, Amirreza Shaban, Sung-Kyun Kim, Mykel J

Muhammad Fadhil Ginting, Dong-Ki Kim, Xiangyun Meng, Andrzej Reinke, Bandi Jai Krishna, Navid Kayhani, Oriana Peltzer, David D. Fan, Amirreza Shaban, Sung-Kyun Kim, Mykel J. Kochenderfer, Ali-akbar Agha-mohammadi, and Shayegan Omidshafiei. Enter the mind palace: Reasoning and planning for long-term active embodied question answering. In CoRL, 2026

work page 2026

[25] [25]

Ego4D: Around the world in 3,000 hours of egocentric video

Kristen Grauman, Andrew Westbury, Eugene Byrne, Zachary Chavis, Antonino Furnari, Rohit Girdhar, Jackson Hamburger, Hao Jiang, Miao Liu, Xingyu Liu, et al. Ego4D: Around the world in 3,000 hours of egocentric video. InCVPR, 2022

work page 2022

[26] [26]

Ego-Exo4D: Understanding skilled human activity from first-and third-person perspectives

Kristen Grauman, Andrew Westbury, Lorenzo Torresani, Kris Kitani, Jitendra Malik, Tri- antafyllos Afouras, Kumar Ashutosh, Vijay Baiyya, Siddhant Bansal, Bikram Boote, et al. Ego-Exo4D: Understanding skilled human activity from first-and third-person perspectives. In CVPR, 2024

work page 2024

[27] [27]

Bridging perspectives: A survey on cross-view collaborative intelligence with egocentric- exocentric vision.International Journal on Computer Vision, 2026

Yuping He, Yifei Huang, Guo Chen, Lidong Lu, Baoqi Pei, Jilan Xu, Tong Lu, and Yoichi Sato. Bridging perspectives: A survey on cross-view collaborative intelligence with egocentric- exocentric vision.International Journal on Computer Vision, 2026

work page 2026

[28] [28]

EgoExoBench: A benchmark for first-and third-person view video understanding in MLLMs

Yuping He, Yifei Huang, Guo Chen, Baoqi Pei, Jilan Xu, Tong Lu, and Jiangmiao Pang. EgoExoBench: A benchmark for first-and third-person view video understanding in MLLMs. InNeurIPS, 2025

work page 2025

[29] [29]

Cascaded dynamic memory refinement and semantic alignment for exo-to-ego cross-view video generation.IEEE Transactions on Pattern Analysis and Machine Intelligence, 2025

Weipeng Hu, Jiun Tian Hoe, Jianhui Li, Haifeng Hu, Xudong Jiang, and Yap-Peng Tan. Cascaded dynamic memory refinement and semantic alignment for exo-to-ego cross-view video generation.IEEE Transactions on Pattern Analysis and Machine Intelligence, 2025

work page 2025

[30] [30]

Robust ego-exo correspondence with long-term memory

Yijun Hu, Bing Fan, Xin Gu, Haiqing Ren, Dongfang Liu, Heng Fan, and Libo Zhang. Robust ego-exo correspondence with long-term memory. InNeurIPS, 2025

work page 2025

[31] [31]

Sound bridge: Associating egocentric and exocentric videos via audio cues

Sihong Huang, Jiaxin Wu, Xiaoyong Wei, Yi Cai, Dongmei Jiang, and Yaowei Wang. Sound bridge: Associating egocentric and exocentric videos via audio cues. InCVPR, 2025

work page 2025

[32] [32]

EgoExoLearn: A dataset for bridging asynchronous ego-and exo-centric view of procedural activities in real world

Yifei Huang, Guo Chen, Jilan Xu, Mingfang Zhang, Lijin Yang, Baoqi Pei, Hongjie Zhang, Lu Dong, Yali Wang, Limin Wang, and Qiao Yu. EgoExoLearn: A dataset for bridging asynchronous ego-and exo-centric view of procedural activities in real world. InCVPR, 2024

work page 2024

[33] [33]

VideoRAG: Retrieval- augmented generation over video corpus

Soyeong Jeong, Kangsan Kim, Jinheon Baek, and Sung Ju Hwang. VideoRAG: Retrieval- augmented generation over video corpus. InACL (Findings), 2025. 12

work page 2025

[34] [34]

LEMMA: A multi-view dataset for learning multi-agent multi-task activities

Baoxiong Jia, Yixin Chen, Siyuan Huang, Yixin Zhu, and Song-Chun Zhu. LEMMA: A multi-view dataset for learning multi-agent multi-task activities. InECCV, 2020

work page 2020

[35] [35]

EgoTaskQA: Understanding human tasks in egocentric videos

Baoxiong Jia, Ting Lei, Song-Chun Zhu, and Siyuan Huang. EgoTaskQA: Understanding human tasks in egocentric videos. InNeurIPS, 2022

work page 2022

[36] [36]

Rehg, Vamsi Krishna Ithapu, and Ruohan Gao

Wenqi Jia, Miao Liu, Hao Jiang, Ishwarya Ananthabhotla, James M. Rehg, Vamsi Krishna Ithapu, and Ruohan Gao. The audio-visual conversational graph: From an egocentric- exocentric perspective. InCVPR, 2024

work page 2024

[37] [37]

Single-stage visual query localization in egocentric videos

Hanwen Jiang, Santhosh Kumar Ramakrishnan, and Kristen Grauman. Single-stage visual query localization in egocentric videos. InNeurIPS, 2023

work page 2023

[38] [38]

Is ‘right’ right? Enhancing object orientation understanding in multimodal large language models through egocentric instruction tuning

Ji Hyeok Jung, Eun Tae Kim, Seoyeon Kim, Joo Ho Lee, Bumsoo Kim, and Buru Chang. Is ‘right’ right? Enhancing object orientation understanding in multimodal large language models through egocentric instruction tuning. InCVPR, 2025

work page 2025

[39] [39]

EgoExo-Con: Exploring view-invariant video temporal understanding.arXiv preprint arXiv:2510.26113, 2025

Minjoon Jung, Junbin Xiao, Junghyun Kim, Byoung-Tak Zhang, and Angela Yao. EgoExo-Con: Exploring view-invariant video temporal understanding.arXiv preprint arXiv:2510.26113, 2025

work page arXiv 2025

[40] [40]

Video captioning based on both egocentric and exocentric views of robot vision for human-robot interaction.International Journal of Social Robotics, 2023

Soo-Han Kang and Ji-Hyeong Han. Video captioning based on both egocentric and exocentric views of robot vision for human-robot interaction.International Journal of Social Robotics, 2023

work page 2023

[41] [41]

Dense passage retrieval for open-domain question answering

Vladimir Karpukhin, Barlas Oguz, Sewon Min, Patrick Lewis, Ledell Wu, Sergey Edunov, Danqi Chen, and Wen-tau Yih. Dense passage retrieval for open-domain question answering. InEMNLP, 2020

work page 2020

[42] [42]

MA-EgoQA: Question answering over egocentric videos from multiple embodied agents.arXiv preprint arXiv:2603.09827, 2026

Kangsan Kim, Yanlai Yang, Suji Kim, Woongyeong Yeo, Youngwan Lee, Mengye Ren, and Sung Ju Hwang. MA-EgoQA: Question answering over egocentric videos from multiple embodied agents.arXiv preprint arXiv:2603.09827, 2026

work page arXiv 2026

[43] [43]

k-DPPs: Fixed-size determinantal point processes

Alex Kulesza and Ben Taskar. k-DPPs: Fixed-size determinantal point processes. InICML, 2011

work page 2011

[44] [44]

EgoVITA: Learning to plan and verify for egocentric video reasoning.arXiv preprint arXiv:2511.18242, 2025

Yogesh Kulkarni and Pooyan Fazli. EgoVITA: Learning to plan and verify for egocentric video reasoning.arXiv preprint arXiv:2511.18242, 2025

work page arXiv 2025

[45] [45]

LLaVA-OneVision: Easy Visual Task Transfer

Bo Li, Yuanhan Zhang, Dong Guo, Renrui Zhang, Feng Li, Hao Zhang, Kaichen Zhang, Peiyuan Zhang, Yanwei Li, Ziwei Liu, et al. LLaV A-OneVision: Easy visual task transfer. arXiv preprint arXiv:2408.03326, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[46] [46]

Do MLLMs understand pointing? Benchmarking and enhancing referential reasoning in egocentric vision

Chentao Li, Zirui Gao, Mingze Gao, Yinglian Ren, Jianjiang Feng, and Jie Zhou. Do MLLMs understand pointing? Benchmarking and enhancing referential reasoning in egocentric vision. InACL, 2026

work page 2026

[47] [47]

Learning situated awareness in the real world

Chuhan Li, Ruilin Han, Joy Hsu, Yongyuan Liang, Rajiv Dhawan, Jiajun Wu, Ming-Hsuan Yang, and Xin Eric Wang. Learning situated awareness in the real world. InICML, 2026

work page 2026

[48] [48]

LLaVA-NeXT-Interleave: Tackling Multi-image, Video, and 3D in Large Multimodal Models

Feng Li, Renrui Zhang, Hao Zhang, Yuanhan Zhang, Bo Li, Wei Li, Zejun Ma, and Chunyuan Li. LLaV A-NeXT-Interleave: Tackling multi-image, video, and 3D in large multimodal models. arXiv preprint arXiv:2407.07895, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[49] [49]

Col- laborated with hallucination: Enhancing egocentric grounded question answering via error demonstrations.IEEE Transactions on Image Processing, 2026

Shenshen Li, Xing Xu, Fumin Shen, Zhe Sun, Andrzej Cichocki, and Heng Tao Shen. Col- laborated with hallucination: Enhancing egocentric grounded question answering via error demonstrations.IEEE Transactions on Image Processing, 2026

work page 2026

[50] [50]

SA V A-X: Ego-to-exo imitation error detection via scene-adaptive view alignment and bidirectional cross view fusion

Xiang Li, Heqian Qiu, Lanxiao Wang, Benliu Qiu, Fanman Meng, Linfeng Xu, and Hongliang Li. SA V A-X: Ego-to-exo imitation error detection via scene-adaptive view alignment and bidirectional cross view fusion. InCVPR, 2026

work page 2026

[51] [51]

EgoCross: Benchmarking multimodal large language models for cross- domain egocentric video question answering

Yanjun Li, Yuqian Fu, Tianwen Qian, Qi’Ao Xu, Silong Dai, Danda Pani Paudel, Luc Van Gool, and Xiaoling Wang. EgoCross: Benchmarking multimodal large language models for cross- domain egocentric video question answering. InAAAI, 2026. 13

work page 2026

[52] [52]

Fine-grained spatiotem- poral grounding on egocentric videos

Shuo Liang, Yiwu Zhong, Zi-Yuan Hu, Yeyao Tao, and Liwei Wang. Fine-grained spatiotem- poral grounding on egocentric videos. InICCV, 2025

work page 2025

[53] [53]

Objectfinder: An open-vocabulary assistive system for interactive object search by blind people

Ruiping Liu, Jiaming Zhang, Angela Schön, Karin Müller, Junwei Zheng, Kailun Yang, Anhong Guo, Kathrin Gerling, and Rainer Stiefelhagen. ObjectFinder: An open-vocabulary assistive system for interactive object search by blind people.arXiv preprint arXiv:2412.03118, 2024

work page arXiv 2024

[54] [54]

BOLT: Boost large vision- language model without training for long-form video understanding

Shuming Liu, Chen Zhao, Tianqi Xu, and Bernard Ghanem. BOLT: Boost large vision- language model without training for long-form video understanding. InCVPR, 2025

work page 2025

[55] [55]

Aligning cyber space with physical world: A comprehensive survey on embodied AI

Yang Liu, Weixing Chen, Yongjie Bai, Xiaodan Liang, Guanbin Li, Wen Gao, and Liang Lin. Aligning cyber space with physical world: A comprehensive survey on embodied AI. IEEE/ASME Transactions on Mechatronics, 2025

work page 2025

[56] [56]

From screens to scenes: A survey of embodied ai in healthcare.Information Fusion, 119:103033, 2025

Yihao Liu, Xu Cao, Tingting Chen, Yankai Jiang, Junjie You, Minghua Wu, Xiaosong Wang, Mengling Feng, Yaochu Jin, and Jintai Chen. From screens to scenes: A survey of embodied ai in healthcare.Information Fusion, 119:103033, 2025

work page 2025

[57] [57]

Tao Lu, Qian Zhu, Tiffany Ma, Wong Kam-Kwai, Anlan Xie, Alex Endert, and Yalong Yang. Ego vs. exo and active vs. passive: Investigating the individual and combined effects of viewpoint and navigation on spatial immersion and understanding in immersive storytelling. InCHI, 2025

work page 2025

[58] [58]

OpenMMEgo: Enhancing egocentric understanding for LMMs with open weights and data

Hao Luo, Zihao Yue, Wanpeng Zhang, Yicheng Feng, Sipeng Zheng, Deheng Ye, and Zongqing Lu. OpenMMEgo: Enhancing egocentric understanding for LMMs with open weights and data. InNeurIPS, 2025

work page 2025

[59] [59]

Grounded affordance from exocentric view.International Journal of Computer Vision, 2024

Hongchen Luo, Wei Zhai, Jing Zhang, Yang Cao, and Dacheng Tao. Grounded affordance from exocentric view.International Journal of Computer Vision, 2024

work page 2024

[60] [60]

Put myself in your shoes: Lifting the egocentric perspective from exocentric videos

Mi Luo, Zihui Xue, Alex Dimakis, and Kristen Grauman. Put myself in your shoes: Lifting the egocentric perspective from exocentric videos. InECCV, 2024

work page 2024

[61] [61]

Video-RAG: Visually-aligned retrieval-augmented long video comprehension

Yongdong Luo, Xiawu Zheng, Xiao Yang, Guilin Li, Haojia Lin, Jinfa Huang, Jiayi Ji, Fei Chao, Jiebo Luo, and Rongrong Ji. Video-RAG: Visually-aligned retrieval-augmented long video comprehension. InNeurIPS, 2025

work page 2025

[62] [62]

Exo2EgoSyn: Unlocking foundation video generation models for exocentric-to- egocentric video synthesis.arXiv preprint arXiv:2511.20186, 2025

Mohammad Mahdi, Yuqian Fu, Nedko Savov, Jiancheng Pan, Danda Pani Paudel, and Luc Van Gool. Exo2EgoSyn: Unlocking foundation video generation models for exocentric-to- egocentric video synthesis.arXiv preprint arXiv:2511.20186, 2025

work page arXiv 2025

[63] [63]

OpenEQA: Embodied question answering in the era of foundation models

Arjun Majumdar, Anurag Ajay, Xiaohan Zhang, Pranav Putta, Sriram Yenamandra, Mikael Henaff, Sneha Silwal, Paul McVay, Oleksandr Maksymets, Sergio Arnaud, Karmesh Yadav, Qiyang Li, Ben Newman, Mohit Sharma, Vincent-Pierre Berges, Shiqi Zhang, Pulkit Agrawal, Yonatan Bisk, Dhruv Batra, Mrinal Kalakrishnan, Franziska Meier, Chris Paxton, Alexan- der Sax, and...

work page 2024

[64] [64]

EgoSchema: A diagnostic benchmark for very long-form video language understanding

Karttikeya Mangalam, Raiymbek Akshulakov, and Jitendra Malik. EgoSchema: A diagnostic benchmark for very long-form video language understanding. InNeurIPS, 2023

work page 2023

[65] [65]

Guerrero

Lorenzo Mur-Labadia, Maria Santos-Villafranca, Jesus Bermudez-Cameo, Alejandro Perez- Yus, Ruben Martinez-Cantin, and Jose J. Guerrero. O-MaMa: Learning object mask matching between egocentric and exocentric views. InICCV, 2025

work page 2025

[66] [66]

Point of view in personal memories.Cognitive Psychology, 1983

Georgia Nigro and Ulric Neisser. Point of view in personal memories.Cognitive Psychology, 1983

work page 1983

[67] [67]

Exo2EgoDVC: Dense video captioning of egocentric procedural activities using web instructional videos

Takehiko Ohkawa, Takuma Yagi, Taichi Nishimura, Ryosuke Furuta, Atsushi Hashimoto, Yoshitaka Ushiku, and Yoichi Sato. Exo2EgoDVC: Dense video captioning of egocentric procedural activities using web instructional videos. InWACV, 2025. 14

work page 2025

[68] [68]

Hello GPT-4o

OpenAI. Hello GPT-4o. https://openai.com/index/hello-gpt-4o, May 2024. Ac- cessed: 2026-05-05

work page 2024

[69] [69]

Introducing GPT-5.4

OpenAI. Introducing GPT-5.4. https://openai.com/index/introducing-gpt-5-4 , March 2026. Accessed: 2026-05-05

work page 2026

[70] [70]

V2-SAM: Marrying SAM2 with multi-prompt experts for cross-view object correspondence

Jiancheng Pan, Runze Wang, Tianwen Qian, Mohammad Mahdi, Yanwei Fu, Xiangyang Xue, Xiaomeng Huang, Luc Van Gool, Danda Pani Paudel, and Yuqian Fu. V2-SAM: Marrying SAM2 with multi-prompt experts for cross-view object correspondence. InCVPR, 2026

work page 2026

[71] [71]

Bootstrap your own views: Masked ego-exo modeling for fine-grained view-invariant video representations

Jungin Park, Jiyoung Lee, and Kwanghoon Sohn. Bootstrap your own views: Masked ego-exo modeling for fine-grained view-invariant video representations. InCVPR, 2025

work page 2025

[72] [72]

EgoWorld: Translating exocentric view to egocentric view using rich exocentric observations

Junho Park, Andrew Sangwoo Ye, and Taein Kwon. EgoWorld: Translating exocentric view to egocentric view using rich exocentric observations. InICLR, 2026

work page 2026

[73] [73]

EgoThinker: Unveiling egocentric reasoning with spatio-temporal CoT

Baoqi Pei, Yifei Huang, Jilan Xu, Yuping He, Guo Chen, Fei Wu, Jiangmiao Pang, and Yu Qiao. EgoThinker: Unveiling egocentric reasoning with spatio-temporal CoT. InNeurIPS, 2025

work page 2025

[74] [74]

In the eye of MLLM: Benchmarking egocentric video intent understanding with gaze-guided prompting

Taiying Peng, Jiacheng Hua, Miao Liu, and Feng Lu. In the eye of MLLM: Benchmarking egocentric video intent understanding with gaze-guided prompting. InNeurIPS, 2025

work page 2025

[75] [75]

Learning transferable visual models from natural language supervision

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. Learning transferable visual models from natural language supervision. In ICML, 2021

work page 2021

[76] [76]

Ego-EXTRA: video-language egocentric dataset for expert-trainee assistance

Francesco Ragusa, Michele Mazzamuto, Rosario Forte, Irene D’Ambra, James Fort, Jakob En- gel, Antonino Furnari, and Giovanni Maria Farinella. Ego-EXTRA: video-language egocentric dataset for expert-trainee assistance. InWACV, 2026

work page 2026

[77] [77]

Wilson, and Balasara- vanan Thoravi Kumaravel

Sahithya Ravi, Gabriel Herbert Sarch, Vibhav Vineet, Andrew D. Wilson, and Balasara- vanan Thoravi Kumaravel. Out of sight, not out of context? Egocentric spatial reasoning in vlms across disjoint frames. InEMNLP, 2025

work page 2025

[78] [78]

From my view to yours: Ego-to-exo transfer in vlms for understanding activities of daily living.arXiv preprint arXiv:2501.05711, 2025

Dominick Reilly, Manish Kumar Govind, Le Xue, and Srijan Das. From my view to yours: Ego-to-exo transfer in vlms for understanding activities of daily living.arXiv preprint arXiv:2501.05711, 2025

work page arXiv 2025

[79] [79]

The probabilistic relevance framework: BM25 and beyond.Information Retrieval, 2009

Stephen Robertson and Hugo Zaragoza. The probabilistic relevance framework: BM25 and beyond.Information Retrieval, 2009

work page 2009

[80] [80]

EASG-Bench: Video Q&A benchmark with egocentric action scene graphs

Ivan Rodin, Tz-Ying Wu, Kyle Min, Sharath Nittur Sridhar, Antonino Furnari, Subarna Tripathi, and Giovanni Maria Farinella. EASG-Bench: Video Q&A benchmark with egocentric action scene graphs. InICCVW, 2025

work page 2025