pith. sign in

arxiv: 2605.19004 · v1 · pith:4MMYVLOGnew · submitted 2026-05-18 · 💻 cs.CV · cs.LG· cs.RO

EgoTraj: Real-World Egocentric Human Trajectory Dataset for Multimodal Prediction

Pith reviewed 2026-05-20 11:08 UTC · model grok-4.3

classification 💻 cs.CV cs.LGcs.RO
keywords egocentric trajectory predictionmultimodal datasetreal-world navigationhead pose trackingeye gazeurban environmentswearable sensinghuman trajectory forecasting
0
0 comments X

The pith

EgoTraj introduces 75 real-world sequences of egocentric urban navigation with synchronized head poses, gaze, and scene data to support multimodal trajectory prediction.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents EgoTraj, a dataset recorded using Meta Quest Pro headsets that captures 75 sequences of people navigating freely through diverse urban environments. It supplies synchronized RGB video along with continuous 6DoF head poses, per-frame 3D eye gaze vectors, and scene annotations from multiple participants. This collection targets the shortage of long-horizon, self-directed egocentric data that existing datasets lack. Accurate models built on such data would directly aid humanoid robotics, wearable sensing, and assistive navigation tools. The authors support this by running benchmarks on current prediction methods and testing how gaze, scene, and motion cues each contribute.

Core claim

EgoTraj consists of 75 sequences of human navigation collected from multiple Meta Quest Pro wearers in real-world urban environments, providing synchronized RGB video together with ground-truth continuous time-synchronized 6-degree-of-freedom head poses, per-frame 3D eye gaze vectors, and scene annotations. To the best of our knowledge, it differs from typical egocentric trajectory datasets by capturing long-horizon, self-directed navigation across diverse urban routes with broad participant diversity.

What carries the argument

The EgoTraj dataset, which supplies synchronized multimodal signals from real urban navigation to train and evaluate trajectory prediction models.

If this is right

  • Prediction models gain access to combined gaze, scene, and motion cues that ablation studies show each improve performance.
  • The dataset enables direct benchmarking of state-of-the-art egocentric trajectory methods on long-horizon real-world data.
  • Applications in AR perception, navigation assistance, and humanoid robotics obtain a public resource for development and testing.
  • Open release of sequences, code, and the EgoViz Dashboard allows community extension of the multimodal approach.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The long-horizon nature of the sequences could support training of predictors that operate over longer time windows than most current short-term models.
  • Broad participant diversity may help future systems generalize across different walking styles and body types without additional data collection.
  • Integration of EgoTraj with existing third-person or simulated trajectory datasets could produce hybrid training regimes that combine real egocentric signals with scale.

Load-bearing premise

The ground-truth 6DoF head poses, eye gaze vectors, and scene annotations provided by the Meta Quest Pro are sufficiently accurate and time-synchronized for training and evaluating trajectory prediction models.

What would settle it

A test in which models trained on EgoTraj produce higher error rates than models trained on prior datasets when evaluated on held-out real urban walks, or direct measurements revealing significant timing offsets or pose inaccuracies in the released ground-truth tracks.

Figures

Figures reproduced from arXiv: 2605.19004 by Abduallah Mohamed, Ahmad Yehia, Christian Claudel, Jiseop Byeon, Junfeng Jiao, Kun Qian, Tianyi Wang.

Figure 1
Figure 1. Figure 1: Overview of EgoTraj. (a) Protocol design: the Meta Quest Pro headset records synchronized RGB video, 6DoF head pose, and gaze signals during in-the-wild navigation. (b) Multimodal EgoTraj capture: representative egocentric frames from a crosswalk navigation scenario with multiple social interactions. (c) Analysis and applications: the dataset supports downstream tasks such as assistive navigation for blind… view at source ↗
Figure 2
Figure 2. Figure 2: Dataset statistics of EgoTraj participants. (a) Nationality breakdown of the 75 recruited participants. (b) Age [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Examples of VLM-generated scene annotations. Each tab shows an egocentric frame with the gaze marker [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Gaze-to-pixel calibration examples. The green dot indicates the projected gaze fixation overlaid on egocentric [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: A snapshot of the EgoViz Dashboard showing synchronized trajectory, gaze, video, and annotation streams. [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Qualitative trajectory forecasting. Egocentric trajectory predictions from multiple baselines using motion data (ego-translation and rotation) on three scenarios from the EgoTraj test split. Dark blue: observed path; green dashed: ground truth; colors: predictions. Left: gentle segment. Center: moderate turn where attention-based models better follow the trajectory. Right: sharp ∼ 90◦ intersection turn whe… view at source ↗
Figure 7
Figure 7. Figure 7: Multimodal observations across three consecutive timesteps. Each row corresponds to a frame at time t+1, t+2, and t+3 during a sidewalk navigation sequence. From left to right: egocentric RGB frame with gaze fixation (red dot), relative depth estimated by Depth Anything V2, semantic segmentation predicted by OneFormer, nearby pedestrians detected by YOLOv8-Pose ranked by depth proximity, and ground-truth (… view at source ↗
Figure 8
Figure 8. Figure 8: Active-transition windows where turning begins within the final 0.5 s of Tobs. Multimodal transformer-based predictors (CXA-Transformer, EgoCast) track the ground-truth trajectory through the turn, while motion-only baselines (Const-Vel, Lin-Ext, position-only) drift outward along the pre-turn heading, consistent with gaze leading motion by 1–2 s before turns. Failure Cases [PITH_FULL_IMAGE:figures/full_f… view at source ↗
Figure 9
Figure 9. Figure 9: Overview of the multimodal recording pipeline used to collect the EgoTraj dataset. A custom Unity application [PITH_FULL_IMAGE:figures/full_fig_p016_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Overview of the EgoTraj preprocessing pipeline. Stage 1 selects sessions and scans the dataset structure. [PITH_FULL_IMAGE:figures/full_fig_p017_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Area of interest for EgoTraj data collection. Colored lines denote recorded walking routes, and markers [PITH_FULL_IMAGE:figures/full_fig_p018_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Examples of EgoTraj scene annotations. Each observation frame shows the gaze marker (red dot) and the [PITH_FULL_IMAGE:figures/full_fig_p019_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: (a) Overview of the proposed multimodal trajectory forecasting model. Each modality (ego-motion, social [PITH_FULL_IMAGE:figures/full_fig_p020_13.png] view at source ↗
read the original abstract

Accurately forecasting human trajectories from an egocentric perspective plays a central role in applications such as humanoid robotics, wearable sensing systems, and assistive navigation. However, progress in this direction remains limited due to the scarcity of egocentric trajectory datasets collected in real-world environments. Addressing this need, we introduce EgoTraj, an egocentric multimodal open dataset recorded using Meta Quest Pro (MQPro). EgoTraj contains 75 sequences of human navigation collected from multiple MQPro wearers in real-world urban environments. Each recording provides synchronized RGB video along with ground-truth data, including continuous time-synchronized 6-degree-of-freedom head poses, per-frame 3D eye gaze vectors, scene annotations. To the best of our knowledge, EgoTraj differs from typical egocentric trajectory datasets by capturing long-horizon, self-directed navigation across diverse urban routes with broad participant diversity. To demonstrate the potential of the dataset, we benchmark several state-of-the-art methods for egocentric trajectory prediction and conduct ablation studies to analyze the contributions of gaze, scene, and motion cues. The results highlight the utility of EgoTraj for AR-based perception, navigation, and assistive systems. The EgoTraj dataset, code, and EgoViz Dashboard are publicly available at https://github.com/yehiahmad/EgoTraj.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The paper introduces EgoTraj, an egocentric multimodal dataset of 75 real-world urban navigation sequences recorded with Meta Quest Pro headsets. Each sequence supplies synchronized RGB video together with ground-truth 6DoF head poses, per-frame 3D eye-gaze vectors, and scene annotations. The authors benchmark several existing trajectory-prediction methods and run ablations that isolate the contribution of gaze, scene, and motion cues.

Significance. If the supplied ground-truth labels are shown to be sufficiently accurate and synchronized, the dataset would be a useful addition for multimodal egocentric prediction research, particularly because it targets long-horizon, self-directed routes with participant diversity. The public release of raw data, code, and the EgoViz Dashboard supports reproducibility without introducing new fitted parameters or circular derivations.

major comments (1)
  1. [Dataset Collection / Abstract] Dataset Collection / Abstract: the central claim that MQPro supplies usable ground-truth 6DoF head poses, 3D eye-gaze vectors, and scene annotations for training and evaluating trajectory models is not supported by any reported accuracy statistics, drift measurements, or external validation for long outdoor sequences under varying illumination and motion. Consumer headsets are known to accumulate error in GPS-denied conditions; without per-sequence error metrics the utility of the released labels cannot be assessed.
minor comments (2)
  1. [Abstract] The total recording duration and aggregate path length across the 75 sequences should be stated explicitly so readers can judge scale.
  2. [Experiments] Ablation tables would be clearer if the exact input modalities supplied to each baseline method were listed in a single summary table.

Simulated Author's Rebuttal

1 responses · 1 unresolved

We thank the referee for their constructive feedback on the validation of the ground-truth labels. We address the major comment below and describe the changes we will make to the manuscript.

read point-by-point responses
  1. Referee: the central claim that MQPro supplies usable ground-truth 6DoF head poses, 3D eye-gaze vectors, and scene annotations for training and evaluating trajectory models is not supported by any reported accuracy statistics, drift measurements, or external validation for long outdoor sequences under varying illumination and motion. Consumer headsets are known to accumulate error in GPS-denied conditions; without per-sequence error metrics the utility of the released labels cannot be assessed.

    Authors: We agree that the absence of quantitative accuracy statistics and drift measurements limits the ability to fully assess label utility. The manuscript presents the 6DoF poses and gaze vectors as provided by the Meta Quest Pro's built-in tracking without additional external validation, which is a limitation for long outdoor sequences. In the revised manuscript we will add a dedicated subsection under Dataset Collection that (1) cites prior work on Quest Pro and similar SLAM-based tracking accuracy in outdoor/GPS-denied settings, (2) discusses expected drift behavior over long horizons, and (3) includes qualitative observations from our sequences regarding tracking stability under varying illumination. We will also add a short clarifying sentence in the abstract and a limitations paragraph. These additions will increase transparency without altering the core dataset release. We cannot supply per-sequence numerical error metrics, as that would require new experiments with external reference systems that were not part of the original collection protocol. revision: partial

standing simulated objections not resolved
  • We cannot provide per-sequence quantitative error metrics or external validation results without conducting additional data collection using high-precision reference systems, which is not feasible for this real-world outdoor dataset.

Circularity Check

0 steps flagged

Dataset release paper with external benchmarks exhibits no derivation chain

full rationale

The manuscript introduces and releases the EgoTraj dataset of 75 real-world sequences captured via Meta Quest Pro hardware, then evaluates existing trajectory-prediction algorithms on it. No first-principles derivation, fitted parameter, or mathematical claim is advanced whose output reduces to its own inputs by construction. The central contribution is the data collection and public release itself, which stands independently of any self-referential loop. External benchmarks and ablation studies further anchor the work outside any internal redefinition.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The contribution is a data collection effort rather than a mathematical derivation; the main unstated premises are standard assumptions about the accuracy of commercial VR tracking hardware and the representativeness of the chosen urban routes and participants.

axioms (1)
  • domain assumption Meta Quest Pro provides sufficiently accurate and time-synchronized 6DoF head poses and eye gaze vectors for research use
    Invoked in the dataset description; no independent calibration study is mentioned in the abstract.

pith-pipeline@v0.9.0 · 5791 in / 1214 out tokens · 38671 ms · 2026-05-20T11:08:42.171340+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

57 extracted references · 57 canonical work pages · 1 internal anchor

  1. [1]

    In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

    Alahi, A., Goel, K., Ramanathan, V ., Robicquet, A., Fei-Fei, L., Savarese, S.: Social lstm: Human trajectory pre- diction in crowded spaces. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp. 961–971 (2016)

  2. [2]

    In: IEEE Intelligent Vehicles Symposium

    Bock, J., Krajewski, R., Moers, T., Runde, S., Vater, L., Eckstein, L.: The ind dataset: A drone dataset of naturalistic road user trajectories at german intersections. In: IEEE Intelligent Vehicles Symposium. pp. 1929–

  3. [3]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

    Caesar, H., Bankiti, V ., Lang, A.H., V ora, S., Liong, V .E., Xu, Q., Krishnan, A., Pan, Y ., Baldan, G., Beijbom, O.: nuscenes: A multimodal dataset for autonomous driving. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 11621–11631 (2020)

  4. [4]

    In: IEEE International Conference on Robotics and Automation

    Chen, C., Liu, Y ., Kreiss, S., Alahi, A.: Crowd-robot interaction: Crowd-aware robot navigation with attention- based deep reinforcement learning. In: IEEE International Conference on Robotics and Automation. pp. 6015–

  5. [5]

    In: IEEE/CVF Winter Conference on Applications of Computer Vision

    Escobar, M., Puentes, J., Forigua, C., Pont-Tuset, J., Maninis, K.K., Arbelaez, P.: Egocast: Forecasting egocentric human pose in the wild. In: IEEE/CVF Winter Conference on Applications of Computer Vision. pp. 5831–5841. IEEE (2025)

  6. [6]

    In: Proceedings of the IEEE/CVF International Conference on Computer Vision

    Ettinger, S., Cheng, S., Caine, B., Liu, C., Zhao, H., Pradhan, S., Chai, Y ., Sapp, B., Qi, C.R., Zhou, Y ., et al.: Large scale interactive motion forecasting for autonomous driving: The waymo open motion dataset. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 9710–9719 (2021)

  7. [7]

    https://blog.google/ outreach-initiatives/accessibility/project-guideline/(2021)

    Google: Project guideline: Enabling those with low vision to run independently. https://blog.google/ outreach-initiatives/accessibility/project-guideline/(2021)

  8. [8]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

    Grauman, K., Westbury, A., Byrne, E., Chavis, Z., Furnari, A., Girdhar, R., Hamburger, J., Jiang, H., Liu, M., Liu, X., et al.: Ego4d: Around the world in 3,000 hours of egocentric video. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 18995–19012 (2022)

  9. [9]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

    Grauman, K., Westbury, A., Torresani, L., Kitani, K., Malik, J., Afouras, T., Ashutosh, K., Baiyya, V ., Bansal, S., Boote, B., et al.: Ego-exo4d: Understanding skilled human activity from first- and third-person perspectives. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 19383–19400 (2024)

  10. [10]

    Behavior research methods56(7), 7307–7330 (2024)

    Hermens, F.: Automatic object detection for behavioural research using yolov8. Behavior research methods56(7), 7307–7330 (2024)

  11. [11]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

    Hu, Y ., Yang, J., Chen, L., Li, K., Sima, C., Zhu, X., Chai, S., Du, S., Lin, T., Wang, W., et al.: Planning-oriented autonomous driving. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 17853–17862 (2023)

  12. [12]

    In: Conference on Robot Learning

    Jain, A., Casas, S., Liao, R., Xiong, Y ., Feng, S., Segal, S., Urtasun, R.: Discrete residual flow for probabilistic pedestrian behavior prediction. In: Conference on Robot Learning. pp. 407–419. PMLR (2020)

  13. [13]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

    Jain, J., Li, J., Chiu, M.T., Hassani, A., Orlov, N., Shi, H.: Oneformer: One transformer to rule universal image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 2989–2998 (2023)

  14. [14]

    In: Proceedings of the CHI Conference on Human Factors in Computing Systems

    Kacorri, H., Kitani, K.M., Bigham, J.P., Asakawa, C.: People with visual impairment training personal object recognizers: Feasibility and challenges. In: Proceedings of the CHI Conference on Human Factors in Computing Systems. pp. 5839–5849 (2017)

  15. [15]

    IEEE Robotics and Automation Letters7(4), 11807–11814 (2022) 12

    Karnan, H., Nair, A., Xiao, X., Warnell, G., Pirk, S., Toshev, A., Hart, J., Biswas, J., Stone, P.: Socially compliant navigation dataset (scand): A large-scale dataset of demonstrations for social navigation. IEEE Robotics and Automation Letters7(4), 11807–11814 (2022) 12

  16. [16]

    arXiv preprint arXiv:2412.00396 (2024)

    Kim, D., Srouji, M., Chen, C., Zhang, J.: Armor: Egocentric perception for humanoid robot collision avoidance and motion planning. arXiv preprint arXiv:2412.00396 (2024)

  17. [17]

    Progress in Retinal and Eye Research 25(3), 296–324 (2006)

    Land, M.F.: Eye movements and the control of actions in everyday life. Progress in Retinal and Eye Research 25(3), 296–324 (2006)

  18. [18]

    In: Computer Graphics Forum

    Lerner, A., Chrysanthou, Y ., Lischinski, D.: Crowds by example. In: Computer Graphics Forum. vol. 26, pp. 655–664. Wiley Online Library (2007)

  19. [19]

    Aria Everyday Activities Dataset,

    Lv, Z., Charron, N., Moulon, P., Gamino, A., Peng, C., Sweeney, C., Miller, E., Tang, H., Meissner, J., Dong, J., et al.: Aria everyday activities dataset. arXiv preprint arXiv:2402.13349 (2024)

  20. [20]

    In: European Conference on Computer Vision

    Ma, L., Ye, Y ., Hong, F., Guzov, V ., Jiang, Y ., Postyeni, R., Pesqueira, L., Gamino, A., Baiyya, V ., Kim, H.J., et al.: Nymeria: A massive collection of multimodal egocentric daily motion in the wild. In: European Conference on Computer Vision. pp. 445–465. Springer (2024)

  21. [21]

    IEEE Transactions on Pattern Analysis and Machine Intelligence45(6), 6688–6702 (2020)

    Marchetti, F., Becattini, F., Seidenari, L., Del Bimbo, A.: Multiple trajectory prediction of moving agents with memory augmented networks. IEEE Transactions on Pattern Analysis and Machine Intelligence45(6), 6688–6702 (2020)

  22. [22]

    IEEE Transactions on Pattern Analysis and Machine Intelligence45(6), 6748–6765 (2021)

    Martin-Martin, R., Patel, M., Rezatofighi, H., Shenoi, A., Gwak, J., Frankel, E., Sadeghian, A., Savarese, S.: Jrdb: A dataset and benchmark of egocentric robot visual perception of humans in built environments. IEEE Transactions on Pattern Analysis and Machine Intelligence45(6), 6748–6765 (2021)

  23. [23]

    In: European Conference on Computer Vision

    Mohamed, A., Zhu, D., Vu, W., Elhoseiny, M., Claudel, C.: Social-implicit: Rethinking trajectory prediction evaluation and the effectiveness of implicit maximum likelihood estimation. In: European Conference on Computer Vision. pp. 463–479. Springer (2022)

  24. [24]

    In: IEEE/RSJ International Conference on Intelligent Robots and Systems

    Nguyen, D.M., Nazeri, M., Payandeh, A., Datar, A., Xiao, X.: Toward human-like social robot navigation: A large-scale, multi-modal, social human navigation dataset. In: IEEE/RSJ International Conference on Intelligent Robots and Systems. pp. 7442–7447. IEEE (2023)

  25. [25]

    In: Proceedings of the IEEE/CVF International Conference on Computer Vision

    Pan, B., Harley, A.W., Engelmann, F., Liu, C.K., Guibas, L.J.: Lookout: Real-world humanoid egocentric navigation. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 24977–24988 (2025)

  26. [26]

    In: Proceedings of the IEEE/CVF International Conference on Computer Vision

    Pan, X., Charron, N., Yang, Y ., Peters, S., Whelan, T., Kong, C., Parkhi, O., Newcombe, R., Ren, Y .C.: Aria digital twin: A new benchmark dataset for egocentric 3d machine perception. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 20133–20143 (2023)

  27. [27]

    In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

    Park, H.S., Hwang, J.J., Niu, Y ., Shi, J.: Egocentric future localization. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp. 4697–4705 (2016)

  28. [28]

    In: Proceedings of the IEEE International Conference on Computer Vision

    Pellegrini, S., Ess, A., Schindler, K., Van Gool, L.: You’ll never walk alone: Modeling social behavior for multi-target tracking. In: Proceedings of the IEEE International Conference on Computer Vision. pp. 261–268. IEEE (2009)

  29. [29]

    In: IEEE International Conference on Robotics and Automation

    Peng, C., Paredes, V ., Castillo, G.A., Hereid, A.: Real-time safe bipedal robot navigation using linear discrete control barrier functions. In: IEEE International Conference on Robotics and Automation. pp. 14903–14909. IEEE (2025)

  30. [30]

    IEEE Robotics and Automation Letters7(4), 8799–8806 (2022)

    Qiu, J., Chen, L., Gu, X., Lo, F.P.W., Tsai, Y .Y ., Sun, J., Liu, J., Lo, B.: Egocentric human trajectory forecasting with a wearable camera and multi-modal fusion. IEEE Robotics and Automation Letters7(4), 8799–8806 (2022)

  31. [31]

    arXiv preprint arXiv:2511.17581 (2025)

    Qiu, Z., Liu, Z., Niu, W., Bhattacharjee, T., Kalantari, S.: Egocognav: Cognition-aware human egocentric navigation. arXiv preprint arXiv:2511.17581 (2025)

  32. [32]

    Raina, N., Somasundaram, G., Zheng, K., Miglani, S., Saarinen, S., Meissner, J., Schwesinger, M., Pesqueira, L., Prasad, I., Miller, E., Gupta, P., Yan, M., Newcombe, R., Ren, C., Parkhi, O.: Egoblur model (2023)

  33. [33]

    In: European Conference on Computer Vision

    Robicquet, A., Sadeghian, A., Alahi, A., Savarese, S.: Learning social etiquette: Human trajectory understanding in crowded scenes. In: European Conference on Computer Vision. pp. 549–565. Springer (2016)

  34. [34]

    In: Proceedings of the IEEE/CVF International Conference on Computer Vision

    Shi, L., Wang, L., Zhou, S., Hua, G.: Trajectory unified transformer for pedestrian trajectory prediction. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 9675–9684 (2023)

  35. [35]

    In: IEEE Winter Conference on Applications of Computer Vision

    Singh, K.K., Fatahalian, K., Efros, A.A.: Krishnacam: Using a longitudinal, single-person, egocentric dataset for scene understanding tasks. In: IEEE Winter Conference on Applications of Computer Vision. pp. 1–9. IEEE (2016)

  36. [36]

    In: Proceedings of the ACM International Symposium on Wearable Computers

    Tang, T.J., Li, W.H.: An assistive eyewear prototype that interactively converts 3d object locations into spatial audio. In: Proceedings of the ACM International Symposium on Wearable Computers. pp. 119–126 (2014) 13

  37. [37]

    In: IEEE International Conference on Cyber Technology in Automation, Control and Intelligent Systems

    Tian, Y ., Liu, Y ., Tan, J.: Wearable navigation system for the blind people in dynamic environments. In: IEEE International Conference on Cyber Technology in Automation, Control and Intelligent Systems. pp. 153–158. IEEE (2013)

  38. [38]

    In: IEEE International Conference on Robotics and Automation

    Wang, H.C., Katzschmann, R.K., Teng, S., Araki, B., Giarré, L., Rus, D.: Enabling independent navigation for visually impaired people through a wearable vision-based feedback system. In: IEEE International Conference on Robotics and Automation. pp. 6533–6540. IEEE (2017)

  39. [39]

    Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution

    Wang, P., Bai, S., Tan, S., Wang, S., Fan, Z., Bai, J., Chen, K., Liu, X., Wang, J., Ge, W., et al.: Qwen2-vl: Enhancing vision-language model’s perception of the world at any resolution. arXiv preprint arXiv:2409.12191 (2024)

  40. [40]

    arXiv preprint arXiv:2512.05270 (2025)

    Wang, T., Byeon, J., Yehia, A., Wang, H., Xu, Y ., Zeng, T., Wang, Z., Jiao, J., Claudel, C.: Xr-dt: Extended reality-enhanced digital twin for agentic mobile robots. arXiv preprint arXiv:2512.05270 (2025)

  41. [41]

    Karen Liu, and Monroe Kennedy III

    Wang, W., Liu, C.K., Kennedy III, M.: Egonav: Egocentric scene-aware human trajectory prediction. arXiv preprint arXiv:2403.19026 (2024)

  42. [42]

    In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

    Yagi, T., Mangalam, K., Yonetani, R., Sato, Y .: Future person localization in first-person videos. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp. 7593–7602 (2018)

  43. [43]

    Advances in Neural Information Processing Systems37, 21875–21911 (2024)

    Yang, L., Kang, B., Huang, Z., Zhao, Z., Xu, X., Feng, J., Zhao, H.: Depth anything v2. Advances in Neural Information Processing Systems37, 21875–21911 (2024)

  44. [44]

    arXiv preprint arXiv:2512.05299 (2025)

    Yehia, A., Byeon, J., Wang, T., Wang, H., Xu, Y ., Jiao, J., Claudel, C.: Arcas: An augmented reality collision avoidance system with slam-based tracking for enhancing vru safety. arXiv preprint arXiv:2512.05299 (2025)

  45. [45]

    In: European Conference on Computer Vision

    Zheng, W., Song, R., Guo, X., Zhang, C., Chen, L.: Genad: Generative end-to-end autonomous driving. In: European Conference on Computer Vision. pp. 87–104. Springer (2024) 14 Appendix A Capture Setup and Recording Details All data was collected using the Meta Quest Pro headset (MQPro), a mixed-reality device equipped with integrated eye-tracking cameras, ...

  46. [46]

    Initialize the Unity recording application on the Meta Quest Pro headset

  47. [47]

    The application creates a new session directory and prepares the sensor logging files

  48. [48]

    The participant wears the headset and selects the two waypoints in the recording environment

  49. [49]

    The participant triggers the start of the session using the controller A button, which activates thestart_signal and begins both sensor logging and video recording

  50. [50]

    During the session, the headset’s wearer navigates naturally through the environment while the system records synchronized RGB video, head pose, and gaze measurements

  51. [51]

    When the recording is complete, the participant triggers the controller B button, which sends thestop_signal and terminates both processes

  52. [52]

    The recorded data are exported as synchronized sensor logs and video files for subsequent preprocessing. A.4 Data Processing After data collection, the recorded multimodal streams were processed using a custom preprocessing pipeline designed to synchronize and organize the data into a unified dataset format. The preprocessing workflow is illustrated in Fi...

  53. [53]

    Identify the environmental context (e.g., crosswalk, sidewalk, intersection)

  54. [54]

    Detect nearby dynamic agents such as pedestrians, vehicles, or cyclists

  55. [55]

    Analyze traffic signals, obstacles, or navigation constraints

  56. [56]

    Incorporate the projected gaze as an indicator of the user’s attention

  57. [57]

    Infer the likely short-term motion or navigation intent of the camera wearer. D.2 Annotation Quality Evaluation To verify the reliability of the generated annotations (Figure 12), we evaluated the pipeline using several complementary metrics on the same 100-frame stratified sample. Structural compliance measured whether each annotation follows the predefi...