EgoTraj: Real-World Egocentric Human Trajectory Dataset for Multimodal Prediction

Abduallah Mohamed; Ahmad Yehia; Christian Claudel; Jiseop Byeon; Junfeng Jiao; Kun Qian; Tianyi Wang

arxiv: 2605.19004 · v1 · pith:4MMYVLOGnew · submitted 2026-05-18 · 💻 cs.CV · cs.LG· cs.RO

EgoTraj: Real-World Egocentric Human Trajectory Dataset for Multimodal Prediction

Ahmad Yehia , Abduallah Mohamed , Tianyi Wang , Jiseop Byeon , Kun Qian , Junfeng Jiao , Christian Claudel This is my paper

Pith reviewed 2026-05-20 11:08 UTC · model grok-4.3

classification 💻 cs.CV cs.LGcs.RO

keywords egocentric trajectory predictionmultimodal datasetreal-world navigationhead pose trackingeye gazeurban environmentswearable sensinghuman trajectory forecasting

0 comments

The pith

EgoTraj introduces 75 real-world sequences of egocentric urban navigation with synchronized head poses, gaze, and scene data to support multimodal trajectory prediction.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents EgoTraj, a dataset recorded using Meta Quest Pro headsets that captures 75 sequences of people navigating freely through diverse urban environments. It supplies synchronized RGB video along with continuous 6DoF head poses, per-frame 3D eye gaze vectors, and scene annotations from multiple participants. This collection targets the shortage of long-horizon, self-directed egocentric data that existing datasets lack. Accurate models built on such data would directly aid humanoid robotics, wearable sensing, and assistive navigation tools. The authors support this by running benchmarks on current prediction methods and testing how gaze, scene, and motion cues each contribute.

Core claim

EgoTraj consists of 75 sequences of human navigation collected from multiple Meta Quest Pro wearers in real-world urban environments, providing synchronized RGB video together with ground-truth continuous time-synchronized 6-degree-of-freedom head poses, per-frame 3D eye gaze vectors, and scene annotations. To the best of our knowledge, it differs from typical egocentric trajectory datasets by capturing long-horizon, self-directed navigation across diverse urban routes with broad participant diversity.

What carries the argument

The EgoTraj dataset, which supplies synchronized multimodal signals from real urban navigation to train and evaluate trajectory prediction models.

If this is right

Prediction models gain access to combined gaze, scene, and motion cues that ablation studies show each improve performance.
The dataset enables direct benchmarking of state-of-the-art egocentric trajectory methods on long-horizon real-world data.
Applications in AR perception, navigation assistance, and humanoid robotics obtain a public resource for development and testing.
Open release of sequences, code, and the EgoViz Dashboard allows community extension of the multimodal approach.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The long-horizon nature of the sequences could support training of predictors that operate over longer time windows than most current short-term models.
Broad participant diversity may help future systems generalize across different walking styles and body types without additional data collection.
Integration of EgoTraj with existing third-person or simulated trajectory datasets could produce hybrid training regimes that combine real egocentric signals with scale.

Load-bearing premise

The ground-truth 6DoF head poses, eye gaze vectors, and scene annotations provided by the Meta Quest Pro are sufficiently accurate and time-synchronized for training and evaluating trajectory prediction models.

What would settle it

A test in which models trained on EgoTraj produce higher error rates than models trained on prior datasets when evaluated on held-out real urban walks, or direct measurements revealing significant timing offsets or pose inaccuracies in the released ground-truth tracks.

Figures

Figures reproduced from arXiv: 2605.19004 by Abduallah Mohamed, Ahmad Yehia, Christian Claudel, Jiseop Byeon, Junfeng Jiao, Kun Qian, Tianyi Wang.

**Figure 1.** Figure 1: Overview of EgoTraj. (a) Protocol design: the Meta Quest Pro headset records synchronized RGB video, 6DoF head pose, and gaze signals during in-the-wild navigation. (b) Multimodal EgoTraj capture: representative egocentric frames from a crosswalk navigation scenario with multiple social interactions. (c) Analysis and applications: the dataset supports downstream tasks such as assistive navigation for blind… view at source ↗

**Figure 2.** Figure 2: Dataset statistics of EgoTraj participants. (a) Nationality breakdown of the 75 recruited participants. (b) Age [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗

**Figure 3.** Figure 3: Examples of VLM-generated scene annotations. Each tab shows an egocentric frame with the gaze marker [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗

**Figure 4.** Figure 4: Gaze-to-pixel calibration examples. The green dot indicates the projected gaze fixation overlaid on egocentric [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗

**Figure 5.** Figure 5: A snapshot of the EgoViz Dashboard showing synchronized trajectory, gaze, video, and annotation streams. [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗

**Figure 6.** Figure 6: Qualitative trajectory forecasting. Egocentric trajectory predictions from multiple baselines using motion data (ego-translation and rotation) on three scenarios from the EgoTraj test split. Dark blue: observed path; green dashed: ground truth; colors: predictions. Left: gentle segment. Center: moderate turn where attention-based models better follow the trajectory. Right: sharp ∼ 90◦ intersection turn whe… view at source ↗

**Figure 7.** Figure 7: Multimodal observations across three consecutive timesteps. Each row corresponds to a frame at time t+1, t+2, and t+3 during a sidewalk navigation sequence. From left to right: egocentric RGB frame with gaze fixation (red dot), relative depth estimated by Depth Anything V2, semantic segmentation predicted by OneFormer, nearby pedestrians detected by YOLOv8-Pose ranked by depth proximity, and ground-truth (… view at source ↗

**Figure 8.** Figure 8: Active-transition windows where turning begins within the final 0.5 s of Tobs. Multimodal transformer-based predictors (CXA-Transformer, EgoCast) track the ground-truth trajectory through the turn, while motion-only baselines (Const-Vel, Lin-Ext, position-only) drift outward along the pre-turn heading, consistent with gaze leading motion by 1–2 s before turns. Failure Cases [PITH_FULL_IMAGE:figures/full_f… view at source ↗

**Figure 9.** Figure 9: Overview of the multimodal recording pipeline used to collect the EgoTraj dataset. A custom Unity application [PITH_FULL_IMAGE:figures/full_fig_p016_9.png] view at source ↗

**Figure 10.** Figure 10: Overview of the EgoTraj preprocessing pipeline. Stage 1 selects sessions and scans the dataset structure. [PITH_FULL_IMAGE:figures/full_fig_p017_10.png] view at source ↗

**Figure 11.** Figure 11: Area of interest for EgoTraj data collection. Colored lines denote recorded walking routes, and markers [PITH_FULL_IMAGE:figures/full_fig_p018_11.png] view at source ↗

**Figure 12.** Figure 12: Examples of EgoTraj scene annotations. Each observation frame shows the gaze marker (red dot) and the [PITH_FULL_IMAGE:figures/full_fig_p019_12.png] view at source ↗

**Figure 13.** Figure 13: (a) Overview of the proposed multimodal trajectory forecasting model. Each modality (ego-motion, social [PITH_FULL_IMAGE:figures/full_fig_p020_13.png] view at source ↗

read the original abstract

Accurately forecasting human trajectories from an egocentric perspective plays a central role in applications such as humanoid robotics, wearable sensing systems, and assistive navigation. However, progress in this direction remains limited due to the scarcity of egocentric trajectory datasets collected in real-world environments. Addressing this need, we introduce EgoTraj, an egocentric multimodal open dataset recorded using Meta Quest Pro (MQPro). EgoTraj contains 75 sequences of human navigation collected from multiple MQPro wearers in real-world urban environments. Each recording provides synchronized RGB video along with ground-truth data, including continuous time-synchronized 6-degree-of-freedom head poses, per-frame 3D eye gaze vectors, scene annotations. To the best of our knowledge, EgoTraj differs from typical egocentric trajectory datasets by capturing long-horizon, self-directed navigation across diverse urban routes with broad participant diversity. To demonstrate the potential of the dataset, we benchmark several state-of-the-art methods for egocentric trajectory prediction and conduct ablation studies to analyze the contributions of gaze, scene, and motion cues. The results highlight the utility of EgoTraj for AR-based perception, navigation, and assistive systems. The EgoTraj dataset, code, and EgoViz Dashboard are publicly available at https://github.com/yehiahmad/EgoTraj.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

EgoTraj releases 75 urban egocentric sequences from MQPro with synced modalities, but sensor accuracy for long outdoor use stays unverified.

read the letter

The main thing to know is that EgoTraj releases 75 sequences of real-world urban navigation captured with Meta Quest Pro headsets. Each one has synced RGB video, 6DoF head poses, 3D eye gaze, and scene annotations, collected from multiple people on self-directed routes. The paper does well by making the full dataset, code, and a visualization dashboard available. They also benchmark a few prediction models and run ablations to show the value of combining gaze, scene, and motion cues. This gives a practical starting point for work in AR perception and navigation. The soft spot is the accuracy of the provided ground truth. Consumer headsets like the MQPro are prone to drift in pose and gaze during long outdoor walks with varying conditions. The description does not include error statistics, calibration procedures, or external validation, so it's unclear if the labels are precise enough for training or evaluating models reliably. This paper is for researchers in egocentric vision, multimodal prediction, and wearable robotics who need real urban data. A reader working on assistive systems or AR would get the most direct value from the benchmarks and the data itself. I would recommend peer review. The dataset has clear novelty in its scale and modalities, and referees can push for more details on data quality to strengthen it.

Referee Report

1 major / 2 minor

Summary. The paper introduces EgoTraj, an egocentric multimodal dataset of 75 real-world urban navigation sequences recorded with Meta Quest Pro headsets. Each sequence supplies synchronized RGB video together with ground-truth 6DoF head poses, per-frame 3D eye-gaze vectors, and scene annotations. The authors benchmark several existing trajectory-prediction methods and run ablations that isolate the contribution of gaze, scene, and motion cues.

Significance. If the supplied ground-truth labels are shown to be sufficiently accurate and synchronized, the dataset would be a useful addition for multimodal egocentric prediction research, particularly because it targets long-horizon, self-directed routes with participant diversity. The public release of raw data, code, and the EgoViz Dashboard supports reproducibility without introducing new fitted parameters or circular derivations.

major comments (1)

[Dataset Collection / Abstract] Dataset Collection / Abstract: the central claim that MQPro supplies usable ground-truth 6DoF head poses, 3D eye-gaze vectors, and scene annotations for training and evaluating trajectory models is not supported by any reported accuracy statistics, drift measurements, or external validation for long outdoor sequences under varying illumination and motion. Consumer headsets are known to accumulate error in GPS-denied conditions; without per-sequence error metrics the utility of the released labels cannot be assessed.

minor comments (2)

[Abstract] The total recording duration and aggregate path length across the 75 sequences should be stated explicitly so readers can judge scale.
[Experiments] Ablation tables would be clearer if the exact input modalities supplied to each baseline method were listed in a single summary table.

Simulated Author's Rebuttal

1 responses · 1 unresolved

We thank the referee for their constructive feedback on the validation of the ground-truth labels. We address the major comment below and describe the changes we will make to the manuscript.

read point-by-point responses

Referee: the central claim that MQPro supplies usable ground-truth 6DoF head poses, 3D eye-gaze vectors, and scene annotations for training and evaluating trajectory models is not supported by any reported accuracy statistics, drift measurements, or external validation for long outdoor sequences under varying illumination and motion. Consumer headsets are known to accumulate error in GPS-denied conditions; without per-sequence error metrics the utility of the released labels cannot be assessed.

Authors: We agree that the absence of quantitative accuracy statistics and drift measurements limits the ability to fully assess label utility. The manuscript presents the 6DoF poses and gaze vectors as provided by the Meta Quest Pro's built-in tracking without additional external validation, which is a limitation for long outdoor sequences. In the revised manuscript we will add a dedicated subsection under Dataset Collection that (1) cites prior work on Quest Pro and similar SLAM-based tracking accuracy in outdoor/GPS-denied settings, (2) discusses expected drift behavior over long horizons, and (3) includes qualitative observations from our sequences regarding tracking stability under varying illumination. We will also add a short clarifying sentence in the abstract and a limitations paragraph. These additions will increase transparency without altering the core dataset release. We cannot supply per-sequence numerical error metrics, as that would require new experiments with external reference systems that were not part of the original collection protocol. revision: partial

standing simulated objections not resolved

We cannot provide per-sequence quantitative error metrics or external validation results without conducting additional data collection using high-precision reference systems, which is not feasible for this real-world outdoor dataset.

Circularity Check

0 steps flagged

Dataset release paper with external benchmarks exhibits no derivation chain

full rationale

The manuscript introduces and releases the EgoTraj dataset of 75 real-world sequences captured via Meta Quest Pro hardware, then evaluates existing trajectory-prediction algorithms on it. No first-principles derivation, fitted parameter, or mathematical claim is advanced whose output reduces to its own inputs by construction. The central contribution is the data collection and public release itself, which stands independently of any self-referential loop. External benchmarks and ablation studies further anchor the work outside any internal redefinition.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The contribution is a data collection effort rather than a mathematical derivation; the main unstated premises are standard assumptions about the accuracy of commercial VR tracking hardware and the representativeness of the chosen urban routes and participants.

axioms (1)

domain assumption Meta Quest Pro provides sufficiently accurate and time-synchronized 6DoF head poses and eye gaze vectors for research use
Invoked in the dataset description; no independent calibration study is mentioned in the abstract.

pith-pipeline@v0.9.0 · 5791 in / 1214 out tokens · 38671 ms · 2026-05-20T11:08:42.171340+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

EgoTraj contains 75 sequences of human navigation collected from multiple MQPro wearers... synchronized 6DoF head pose, per-frame 3D eye gaze vectors, scene annotations.
IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Ablation studies to analyze the contributions of gaze, scene, and motion cues.

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

57 extracted references · 57 canonical work pages · 1 internal anchor

[1]

In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

Alahi, A., Goel, K., Ramanathan, V ., Robicquet, A., Fei-Fei, L., Savarese, S.: Social lstm: Human trajectory pre- diction in crowded spaces. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp. 961–971 (2016)

work page 2016
[2]

In: IEEE Intelligent Vehicles Symposium

Bock, J., Krajewski, R., Moers, T., Runde, S., Vater, L., Eckstein, L.: The ind dataset: A drone dataset of naturalistic road user trajectories at german intersections. In: IEEE Intelligent Vehicles Symposium. pp. 1929–

work page 1929
[3]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

Caesar, H., Bankiti, V ., Lang, A.H., V ora, S., Liong, V .E., Xu, Q., Krishnan, A., Pan, Y ., Baldan, G., Beijbom, O.: nuscenes: A multimodal dataset for autonomous driving. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 11621–11631 (2020)

work page 2020
[4]

In: IEEE International Conference on Robotics and Automation

Chen, C., Liu, Y ., Kreiss, S., Alahi, A.: Crowd-robot interaction: Crowd-aware robot navigation with attention- based deep reinforcement learning. In: IEEE International Conference on Robotics and Automation. pp. 6015–

work page
[5]

In: IEEE/CVF Winter Conference on Applications of Computer Vision

Escobar, M., Puentes, J., Forigua, C., Pont-Tuset, J., Maninis, K.K., Arbelaez, P.: Egocast: Forecasting egocentric human pose in the wild. In: IEEE/CVF Winter Conference on Applications of Computer Vision. pp. 5831–5841. IEEE (2025)

work page 2025
[6]

In: Proceedings of the IEEE/CVF International Conference on Computer Vision

Ettinger, S., Cheng, S., Caine, B., Liu, C., Zhao, H., Pradhan, S., Chai, Y ., Sapp, B., Qi, C.R., Zhou, Y ., et al.: Large scale interactive motion forecasting for autonomous driving: The waymo open motion dataset. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 9710–9719 (2021)

work page 2021
[7]

https://blog.google/ outreach-initiatives/accessibility/project-guideline/(2021)

Google: Project guideline: Enabling those with low vision to run independently. https://blog.google/ outreach-initiatives/accessibility/project-guideline/(2021)

work page 2021
[8]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

Grauman, K., Westbury, A., Byrne, E., Chavis, Z., Furnari, A., Girdhar, R., Hamburger, J., Jiang, H., Liu, M., Liu, X., et al.: Ego4d: Around the world in 3,000 hours of egocentric video. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 18995–19012 (2022)

work page 2022
[9]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

Grauman, K., Westbury, A., Torresani, L., Kitani, K., Malik, J., Afouras, T., Ashutosh, K., Baiyya, V ., Bansal, S., Boote, B., et al.: Ego-exo4d: Understanding skilled human activity from first- and third-person perspectives. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 19383–19400 (2024)

work page 2024
[10]

Behavior research methods56(7), 7307–7330 (2024)

Hermens, F.: Automatic object detection for behavioural research using yolov8. Behavior research methods56(7), 7307–7330 (2024)

work page 2024
[11]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

Hu, Y ., Yang, J., Chen, L., Li, K., Sima, C., Zhu, X., Chai, S., Du, S., Lin, T., Wang, W., et al.: Planning-oriented autonomous driving. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 17853–17862 (2023)

work page 2023
[12]

In: Conference on Robot Learning

Jain, A., Casas, S., Liao, R., Xiong, Y ., Feng, S., Segal, S., Urtasun, R.: Discrete residual flow for probabilistic pedestrian behavior prediction. In: Conference on Robot Learning. pp. 407–419. PMLR (2020)

work page 2020
[13]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

Jain, J., Li, J., Chiu, M.T., Hassani, A., Orlov, N., Shi, H.: Oneformer: One transformer to rule universal image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 2989–2998 (2023)

work page 2023
[14]

In: Proceedings of the CHI Conference on Human Factors in Computing Systems

Kacorri, H., Kitani, K.M., Bigham, J.P., Asakawa, C.: People with visual impairment training personal object recognizers: Feasibility and challenges. In: Proceedings of the CHI Conference on Human Factors in Computing Systems. pp. 5839–5849 (2017)

work page 2017
[15]

IEEE Robotics and Automation Letters7(4), 11807–11814 (2022) 12

Karnan, H., Nair, A., Xiao, X., Warnell, G., Pirk, S., Toshev, A., Hart, J., Biswas, J., Stone, P.: Socially compliant navigation dataset (scand): A large-scale dataset of demonstrations for social navigation. IEEE Robotics and Automation Letters7(4), 11807–11814 (2022) 12

work page 2022
[16]

arXiv preprint arXiv:2412.00396 (2024)

Kim, D., Srouji, M., Chen, C., Zhang, J.: Armor: Egocentric perception for humanoid robot collision avoidance and motion planning. arXiv preprint arXiv:2412.00396 (2024)

work page arXiv 2024
[17]

Progress in Retinal and Eye Research 25(3), 296–324 (2006)

Land, M.F.: Eye movements and the control of actions in everyday life. Progress in Retinal and Eye Research 25(3), 296–324 (2006)

work page 2006
[18]

In: Computer Graphics Forum

Lerner, A., Chrysanthou, Y ., Lischinski, D.: Crowds by example. In: Computer Graphics Forum. vol. 26, pp. 655–664. Wiley Online Library (2007)

work page 2007
[19]

Aria Everyday Activities Dataset,

Lv, Z., Charron, N., Moulon, P., Gamino, A., Peng, C., Sweeney, C., Miller, E., Tang, H., Meissner, J., Dong, J., et al.: Aria everyday activities dataset. arXiv preprint arXiv:2402.13349 (2024)

work page arXiv 2024
[20]

In: European Conference on Computer Vision

Ma, L., Ye, Y ., Hong, F., Guzov, V ., Jiang, Y ., Postyeni, R., Pesqueira, L., Gamino, A., Baiyya, V ., Kim, H.J., et al.: Nymeria: A massive collection of multimodal egocentric daily motion in the wild. In: European Conference on Computer Vision. pp. 445–465. Springer (2024)

work page 2024
[21]

IEEE Transactions on Pattern Analysis and Machine Intelligence45(6), 6688–6702 (2020)

Marchetti, F., Becattini, F., Seidenari, L., Del Bimbo, A.: Multiple trajectory prediction of moving agents with memory augmented networks. IEEE Transactions on Pattern Analysis and Machine Intelligence45(6), 6688–6702 (2020)

work page 2020
[22]

IEEE Transactions on Pattern Analysis and Machine Intelligence45(6), 6748–6765 (2021)

Martin-Martin, R., Patel, M., Rezatofighi, H., Shenoi, A., Gwak, J., Frankel, E., Sadeghian, A., Savarese, S.: Jrdb: A dataset and benchmark of egocentric robot visual perception of humans in built environments. IEEE Transactions on Pattern Analysis and Machine Intelligence45(6), 6748–6765 (2021)

work page 2021
[23]

In: European Conference on Computer Vision

Mohamed, A., Zhu, D., Vu, W., Elhoseiny, M., Claudel, C.: Social-implicit: Rethinking trajectory prediction evaluation and the effectiveness of implicit maximum likelihood estimation. In: European Conference on Computer Vision. pp. 463–479. Springer (2022)

work page 2022
[24]

In: IEEE/RSJ International Conference on Intelligent Robots and Systems

Nguyen, D.M., Nazeri, M., Payandeh, A., Datar, A., Xiao, X.: Toward human-like social robot navigation: A large-scale, multi-modal, social human navigation dataset. In: IEEE/RSJ International Conference on Intelligent Robots and Systems. pp. 7442–7447. IEEE (2023)

work page 2023
[25]

In: Proceedings of the IEEE/CVF International Conference on Computer Vision

Pan, B., Harley, A.W., Engelmann, F., Liu, C.K., Guibas, L.J.: Lookout: Real-world humanoid egocentric navigation. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 24977–24988 (2025)

work page 2025
[26]

In: Proceedings of the IEEE/CVF International Conference on Computer Vision

Pan, X., Charron, N., Yang, Y ., Peters, S., Whelan, T., Kong, C., Parkhi, O., Newcombe, R., Ren, Y .C.: Aria digital twin: A new benchmark dataset for egocentric 3d machine perception. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 20133–20143 (2023)

work page 2023
[27]

In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

Park, H.S., Hwang, J.J., Niu, Y ., Shi, J.: Egocentric future localization. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp. 4697–4705 (2016)

work page 2016
[28]

In: Proceedings of the IEEE International Conference on Computer Vision

Pellegrini, S., Ess, A., Schindler, K., Van Gool, L.: You’ll never walk alone: Modeling social behavior for multi-target tracking. In: Proceedings of the IEEE International Conference on Computer Vision. pp. 261–268. IEEE (2009)

work page 2009
[29]

In: IEEE International Conference on Robotics and Automation

Peng, C., Paredes, V ., Castillo, G.A., Hereid, A.: Real-time safe bipedal robot navigation using linear discrete control barrier functions. In: IEEE International Conference on Robotics and Automation. pp. 14903–14909. IEEE (2025)

work page 2025
[30]

IEEE Robotics and Automation Letters7(4), 8799–8806 (2022)

Qiu, J., Chen, L., Gu, X., Lo, F.P.W., Tsai, Y .Y ., Sun, J., Liu, J., Lo, B.: Egocentric human trajectory forecasting with a wearable camera and multi-modal fusion. IEEE Robotics and Automation Letters7(4), 8799–8806 (2022)

work page 2022
[31]

arXiv preprint arXiv:2511.17581 (2025)

Qiu, Z., Liu, Z., Niu, W., Bhattacharjee, T., Kalantari, S.: Egocognav: Cognition-aware human egocentric navigation. arXiv preprint arXiv:2511.17581 (2025)

work page arXiv 2025
[32]

Raina, N., Somasundaram, G., Zheng, K., Miglani, S., Saarinen, S., Meissner, J., Schwesinger, M., Pesqueira, L., Prasad, I., Miller, E., Gupta, P., Yan, M., Newcombe, R., Ren, C., Parkhi, O.: Egoblur model (2023)

work page 2023
[33]

In: European Conference on Computer Vision

Robicquet, A., Sadeghian, A., Alahi, A., Savarese, S.: Learning social etiquette: Human trajectory understanding in crowded scenes. In: European Conference on Computer Vision. pp. 549–565. Springer (2016)

work page 2016
[34]

In: Proceedings of the IEEE/CVF International Conference on Computer Vision

Shi, L., Wang, L., Zhou, S., Hua, G.: Trajectory unified transformer for pedestrian trajectory prediction. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 9675–9684 (2023)

work page 2023
[35]

In: IEEE Winter Conference on Applications of Computer Vision

Singh, K.K., Fatahalian, K., Efros, A.A.: Krishnacam: Using a longitudinal, single-person, egocentric dataset for scene understanding tasks. In: IEEE Winter Conference on Applications of Computer Vision. pp. 1–9. IEEE (2016)

work page 2016
[36]

In: Proceedings of the ACM International Symposium on Wearable Computers

Tang, T.J., Li, W.H.: An assistive eyewear prototype that interactively converts 3d object locations into spatial audio. In: Proceedings of the ACM International Symposium on Wearable Computers. pp. 119–126 (2014) 13

work page 2014
[37]

In: IEEE International Conference on Cyber Technology in Automation, Control and Intelligent Systems

Tian, Y ., Liu, Y ., Tan, J.: Wearable navigation system for the blind people in dynamic environments. In: IEEE International Conference on Cyber Technology in Automation, Control and Intelligent Systems. pp. 153–158. IEEE (2013)

work page 2013
[38]

In: IEEE International Conference on Robotics and Automation

Wang, H.C., Katzschmann, R.K., Teng, S., Araki, B., Giarré, L., Rus, D.: Enabling independent navigation for visually impaired people through a wearable vision-based feedback system. In: IEEE International Conference on Robotics and Automation. pp. 6533–6540. IEEE (2017)

work page 2017
[39]

Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution

Wang, P., Bai, S., Tan, S., Wang, S., Fan, Z., Bai, J., Chen, K., Liu, X., Wang, J., Ge, W., et al.: Qwen2-vl: Enhancing vision-language model’s perception of the world at any resolution. arXiv preprint arXiv:2409.12191 (2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024
[40]

arXiv preprint arXiv:2512.05270 (2025)

Wang, T., Byeon, J., Yehia, A., Wang, H., Xu, Y ., Zeng, T., Wang, Z., Jiao, J., Claudel, C.: Xr-dt: Extended reality-enhanced digital twin for agentic mobile robots. arXiv preprint arXiv:2512.05270 (2025)

work page arXiv 2025
[41]

Karen Liu, and Monroe Kennedy III

Wang, W., Liu, C.K., Kennedy III, M.: Egonav: Egocentric scene-aware human trajectory prediction. arXiv preprint arXiv:2403.19026 (2024)

work page arXiv 2024
[42]

In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

Yagi, T., Mangalam, K., Yonetani, R., Sato, Y .: Future person localization in first-person videos. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp. 7593–7602 (2018)

work page 2018
[43]

Advances in Neural Information Processing Systems37, 21875–21911 (2024)

Yang, L., Kang, B., Huang, Z., Zhao, Z., Xu, X., Feng, J., Zhao, H.: Depth anything v2. Advances in Neural Information Processing Systems37, 21875–21911 (2024)

work page 2024
[44]

arXiv preprint arXiv:2512.05299 (2025)

Yehia, A., Byeon, J., Wang, T., Wang, H., Xu, Y ., Jiao, J., Claudel, C.: Arcas: An augmented reality collision avoidance system with slam-based tracking for enhancing vru safety. arXiv preprint arXiv:2512.05299 (2025)

work page arXiv 2025
[45]

In: European Conference on Computer Vision

Zheng, W., Song, R., Guo, X., Zhang, C., Chen, L.: Genad: Generative end-to-end autonomous driving. In: European Conference on Computer Vision. pp. 87–104. Springer (2024) 14 Appendix A Capture Setup and Recording Details All data was collected using the Meta Quest Pro headset (MQPro), a mixed-reality device equipped with integrated eye-tracking cameras, ...

work page 2024
[46]

Initialize the Unity recording application on the Meta Quest Pro headset

work page
[47]

The application creates a new session directory and prepares the sensor logging files

work page
[48]

The participant wears the headset and selects the two waypoints in the recording environment

work page
[49]

The participant triggers the start of the session using the controller A button, which activates thestart_signal and begins both sensor logging and video recording

work page
[50]

During the session, the headset’s wearer navigates naturally through the environment while the system records synchronized RGB video, head pose, and gaze measurements

work page
[51]

When the recording is complete, the participant triggers the controller B button, which sends thestop_signal and terminates both processes

work page
[52]

The recorded data are exported as synchronized sensor logs and video files for subsequent preprocessing. A.4 Data Processing After data collection, the recorded multimodal streams were processed using a custom preprocessing pipeline designed to synchronize and organize the data into a unified dataset format. The preprocessing workflow is illustrated in Fi...

work page
[53]

Identify the environmental context (e.g., crosswalk, sidewalk, intersection)

work page
[54]

Detect nearby dynamic agents such as pedestrians, vehicles, or cyclists

work page
[55]

Analyze traffic signals, obstacles, or navigation constraints

work page
[56]

Incorporate the projected gaze as an indicator of the user’s attention

work page
[57]

Infer the likely short-term motion or navigation intent of the camera wearer. D.2 Annotation Quality Evaluation To verify the reliability of the generated annotations (Figure 12), we evaluated the pipeline using several complementary metrics on the same 100-frame stratified sample. Structural compliance measured whether each annotation follows the predefi...

work page

[1] [1]

In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

Alahi, A., Goel, K., Ramanathan, V ., Robicquet, A., Fei-Fei, L., Savarese, S.: Social lstm: Human trajectory pre- diction in crowded spaces. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp. 961–971 (2016)

work page 2016

[2] [2]

In: IEEE Intelligent Vehicles Symposium

Bock, J., Krajewski, R., Moers, T., Runde, S., Vater, L., Eckstein, L.: The ind dataset: A drone dataset of naturalistic road user trajectories at german intersections. In: IEEE Intelligent Vehicles Symposium. pp. 1929–

work page 1929

[3] [3]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

Caesar, H., Bankiti, V ., Lang, A.H., V ora, S., Liong, V .E., Xu, Q., Krishnan, A., Pan, Y ., Baldan, G., Beijbom, O.: nuscenes: A multimodal dataset for autonomous driving. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 11621–11631 (2020)

work page 2020

[4] [4]

In: IEEE International Conference on Robotics and Automation

Chen, C., Liu, Y ., Kreiss, S., Alahi, A.: Crowd-robot interaction: Crowd-aware robot navigation with attention- based deep reinforcement learning. In: IEEE International Conference on Robotics and Automation. pp. 6015–

work page

[5] [5]

In: IEEE/CVF Winter Conference on Applications of Computer Vision

Escobar, M., Puentes, J., Forigua, C., Pont-Tuset, J., Maninis, K.K., Arbelaez, P.: Egocast: Forecasting egocentric human pose in the wild. In: IEEE/CVF Winter Conference on Applications of Computer Vision. pp. 5831–5841. IEEE (2025)

work page 2025

[6] [6]

In: Proceedings of the IEEE/CVF International Conference on Computer Vision

Ettinger, S., Cheng, S., Caine, B., Liu, C., Zhao, H., Pradhan, S., Chai, Y ., Sapp, B., Qi, C.R., Zhou, Y ., et al.: Large scale interactive motion forecasting for autonomous driving: The waymo open motion dataset. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 9710–9719 (2021)

work page 2021

[7] [7]

https://blog.google/ outreach-initiatives/accessibility/project-guideline/(2021)

Google: Project guideline: Enabling those with low vision to run independently. https://blog.google/ outreach-initiatives/accessibility/project-guideline/(2021)

work page 2021

[8] [8]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

Grauman, K., Westbury, A., Byrne, E., Chavis, Z., Furnari, A., Girdhar, R., Hamburger, J., Jiang, H., Liu, M., Liu, X., et al.: Ego4d: Around the world in 3,000 hours of egocentric video. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 18995–19012 (2022)

work page 2022

[9] [9]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

Grauman, K., Westbury, A., Torresani, L., Kitani, K., Malik, J., Afouras, T., Ashutosh, K., Baiyya, V ., Bansal, S., Boote, B., et al.: Ego-exo4d: Understanding skilled human activity from first- and third-person perspectives. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 19383–19400 (2024)

work page 2024

[10] [10]

Behavior research methods56(7), 7307–7330 (2024)

Hermens, F.: Automatic object detection for behavioural research using yolov8. Behavior research methods56(7), 7307–7330 (2024)

work page 2024

[11] [11]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

Hu, Y ., Yang, J., Chen, L., Li, K., Sima, C., Zhu, X., Chai, S., Du, S., Lin, T., Wang, W., et al.: Planning-oriented autonomous driving. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 17853–17862 (2023)

work page 2023

[12] [12]

In: Conference on Robot Learning

Jain, A., Casas, S., Liao, R., Xiong, Y ., Feng, S., Segal, S., Urtasun, R.: Discrete residual flow for probabilistic pedestrian behavior prediction. In: Conference on Robot Learning. pp. 407–419. PMLR (2020)

work page 2020

[13] [13]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

Jain, J., Li, J., Chiu, M.T., Hassani, A., Orlov, N., Shi, H.: Oneformer: One transformer to rule universal image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 2989–2998 (2023)

work page 2023

[14] [14]

In: Proceedings of the CHI Conference on Human Factors in Computing Systems

Kacorri, H., Kitani, K.M., Bigham, J.P., Asakawa, C.: People with visual impairment training personal object recognizers: Feasibility and challenges. In: Proceedings of the CHI Conference on Human Factors in Computing Systems. pp. 5839–5849 (2017)

work page 2017

[15] [15]

IEEE Robotics and Automation Letters7(4), 11807–11814 (2022) 12

Karnan, H., Nair, A., Xiao, X., Warnell, G., Pirk, S., Toshev, A., Hart, J., Biswas, J., Stone, P.: Socially compliant navigation dataset (scand): A large-scale dataset of demonstrations for social navigation. IEEE Robotics and Automation Letters7(4), 11807–11814 (2022) 12

work page 2022

[16] [16]

arXiv preprint arXiv:2412.00396 (2024)

Kim, D., Srouji, M., Chen, C., Zhang, J.: Armor: Egocentric perception for humanoid robot collision avoidance and motion planning. arXiv preprint arXiv:2412.00396 (2024)

work page arXiv 2024

[17] [17]

Progress in Retinal and Eye Research 25(3), 296–324 (2006)

Land, M.F.: Eye movements and the control of actions in everyday life. Progress in Retinal and Eye Research 25(3), 296–324 (2006)

work page 2006

[18] [18]

In: Computer Graphics Forum

Lerner, A., Chrysanthou, Y ., Lischinski, D.: Crowds by example. In: Computer Graphics Forum. vol. 26, pp. 655–664. Wiley Online Library (2007)

work page 2007

[19] [19]

Aria Everyday Activities Dataset,

Lv, Z., Charron, N., Moulon, P., Gamino, A., Peng, C., Sweeney, C., Miller, E., Tang, H., Meissner, J., Dong, J., et al.: Aria everyday activities dataset. arXiv preprint arXiv:2402.13349 (2024)

work page arXiv 2024

[20] [20]

In: European Conference on Computer Vision

Ma, L., Ye, Y ., Hong, F., Guzov, V ., Jiang, Y ., Postyeni, R., Pesqueira, L., Gamino, A., Baiyya, V ., Kim, H.J., et al.: Nymeria: A massive collection of multimodal egocentric daily motion in the wild. In: European Conference on Computer Vision. pp. 445–465. Springer (2024)

work page 2024

[21] [21]

IEEE Transactions on Pattern Analysis and Machine Intelligence45(6), 6688–6702 (2020)

Marchetti, F., Becattini, F., Seidenari, L., Del Bimbo, A.: Multiple trajectory prediction of moving agents with memory augmented networks. IEEE Transactions on Pattern Analysis and Machine Intelligence45(6), 6688–6702 (2020)

work page 2020

[22] [22]

IEEE Transactions on Pattern Analysis and Machine Intelligence45(6), 6748–6765 (2021)

Martin-Martin, R., Patel, M., Rezatofighi, H., Shenoi, A., Gwak, J., Frankel, E., Sadeghian, A., Savarese, S.: Jrdb: A dataset and benchmark of egocentric robot visual perception of humans in built environments. IEEE Transactions on Pattern Analysis and Machine Intelligence45(6), 6748–6765 (2021)

work page 2021

[23] [23]

In: European Conference on Computer Vision

Mohamed, A., Zhu, D., Vu, W., Elhoseiny, M., Claudel, C.: Social-implicit: Rethinking trajectory prediction evaluation and the effectiveness of implicit maximum likelihood estimation. In: European Conference on Computer Vision. pp. 463–479. Springer (2022)

work page 2022

[24] [24]

In: IEEE/RSJ International Conference on Intelligent Robots and Systems

Nguyen, D.M., Nazeri, M., Payandeh, A., Datar, A., Xiao, X.: Toward human-like social robot navigation: A large-scale, multi-modal, social human navigation dataset. In: IEEE/RSJ International Conference on Intelligent Robots and Systems. pp. 7442–7447. IEEE (2023)

work page 2023

[25] [25]

In: Proceedings of the IEEE/CVF International Conference on Computer Vision

Pan, B., Harley, A.W., Engelmann, F., Liu, C.K., Guibas, L.J.: Lookout: Real-world humanoid egocentric navigation. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 24977–24988 (2025)

work page 2025

[26] [26]

In: Proceedings of the IEEE/CVF International Conference on Computer Vision

Pan, X., Charron, N., Yang, Y ., Peters, S., Whelan, T., Kong, C., Parkhi, O., Newcombe, R., Ren, Y .C.: Aria digital twin: A new benchmark dataset for egocentric 3d machine perception. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 20133–20143 (2023)

work page 2023

[27] [27]

In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

Park, H.S., Hwang, J.J., Niu, Y ., Shi, J.: Egocentric future localization. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp. 4697–4705 (2016)

work page 2016

[28] [28]

In: Proceedings of the IEEE International Conference on Computer Vision

Pellegrini, S., Ess, A., Schindler, K., Van Gool, L.: You’ll never walk alone: Modeling social behavior for multi-target tracking. In: Proceedings of the IEEE International Conference on Computer Vision. pp. 261–268. IEEE (2009)

work page 2009

[29] [29]

In: IEEE International Conference on Robotics and Automation

Peng, C., Paredes, V ., Castillo, G.A., Hereid, A.: Real-time safe bipedal robot navigation using linear discrete control barrier functions. In: IEEE International Conference on Robotics and Automation. pp. 14903–14909. IEEE (2025)

work page 2025

[30] [30]

IEEE Robotics and Automation Letters7(4), 8799–8806 (2022)

Qiu, J., Chen, L., Gu, X., Lo, F.P.W., Tsai, Y .Y ., Sun, J., Liu, J., Lo, B.: Egocentric human trajectory forecasting with a wearable camera and multi-modal fusion. IEEE Robotics and Automation Letters7(4), 8799–8806 (2022)

work page 2022

[31] [31]

arXiv preprint arXiv:2511.17581 (2025)

Qiu, Z., Liu, Z., Niu, W., Bhattacharjee, T., Kalantari, S.: Egocognav: Cognition-aware human egocentric navigation. arXiv preprint arXiv:2511.17581 (2025)

work page arXiv 2025

[32] [32]

Raina, N., Somasundaram, G., Zheng, K., Miglani, S., Saarinen, S., Meissner, J., Schwesinger, M., Pesqueira, L., Prasad, I., Miller, E., Gupta, P., Yan, M., Newcombe, R., Ren, C., Parkhi, O.: Egoblur model (2023)

work page 2023

[33] [33]

In: European Conference on Computer Vision

Robicquet, A., Sadeghian, A., Alahi, A., Savarese, S.: Learning social etiquette: Human trajectory understanding in crowded scenes. In: European Conference on Computer Vision. pp. 549–565. Springer (2016)

work page 2016

[34] [34]

In: Proceedings of the IEEE/CVF International Conference on Computer Vision

Shi, L., Wang, L., Zhou, S., Hua, G.: Trajectory unified transformer for pedestrian trajectory prediction. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 9675–9684 (2023)

work page 2023

[35] [35]

In: IEEE Winter Conference on Applications of Computer Vision

Singh, K.K., Fatahalian, K., Efros, A.A.: Krishnacam: Using a longitudinal, single-person, egocentric dataset for scene understanding tasks. In: IEEE Winter Conference on Applications of Computer Vision. pp. 1–9. IEEE (2016)

work page 2016

[36] [36]

In: Proceedings of the ACM International Symposium on Wearable Computers

Tang, T.J., Li, W.H.: An assistive eyewear prototype that interactively converts 3d object locations into spatial audio. In: Proceedings of the ACM International Symposium on Wearable Computers. pp. 119–126 (2014) 13

work page 2014

[37] [37]

In: IEEE International Conference on Cyber Technology in Automation, Control and Intelligent Systems

Tian, Y ., Liu, Y ., Tan, J.: Wearable navigation system for the blind people in dynamic environments. In: IEEE International Conference on Cyber Technology in Automation, Control and Intelligent Systems. pp. 153–158. IEEE (2013)

work page 2013

[38] [38]

In: IEEE International Conference on Robotics and Automation

Wang, H.C., Katzschmann, R.K., Teng, S., Araki, B., Giarré, L., Rus, D.: Enabling independent navigation for visually impaired people through a wearable vision-based feedback system. In: IEEE International Conference on Robotics and Automation. pp. 6533–6540. IEEE (2017)

work page 2017

[39] [39]

Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution

Wang, P., Bai, S., Tan, S., Wang, S., Fan, Z., Bai, J., Chen, K., Liu, X., Wang, J., Ge, W., et al.: Qwen2-vl: Enhancing vision-language model’s perception of the world at any resolution. arXiv preprint arXiv:2409.12191 (2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024

[40] [40]

arXiv preprint arXiv:2512.05270 (2025)

Wang, T., Byeon, J., Yehia, A., Wang, H., Xu, Y ., Zeng, T., Wang, Z., Jiao, J., Claudel, C.: Xr-dt: Extended reality-enhanced digital twin for agentic mobile robots. arXiv preprint arXiv:2512.05270 (2025)

work page arXiv 2025

[41] [41]

Karen Liu, and Monroe Kennedy III

Wang, W., Liu, C.K., Kennedy III, M.: Egonav: Egocentric scene-aware human trajectory prediction. arXiv preprint arXiv:2403.19026 (2024)

work page arXiv 2024

[42] [42]

In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

Yagi, T., Mangalam, K., Yonetani, R., Sato, Y .: Future person localization in first-person videos. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp. 7593–7602 (2018)

work page 2018

[43] [43]

Advances in Neural Information Processing Systems37, 21875–21911 (2024)

Yang, L., Kang, B., Huang, Z., Zhao, Z., Xu, X., Feng, J., Zhao, H.: Depth anything v2. Advances in Neural Information Processing Systems37, 21875–21911 (2024)

work page 2024

[44] [44]

arXiv preprint arXiv:2512.05299 (2025)

Yehia, A., Byeon, J., Wang, T., Wang, H., Xu, Y ., Jiao, J., Claudel, C.: Arcas: An augmented reality collision avoidance system with slam-based tracking for enhancing vru safety. arXiv preprint arXiv:2512.05299 (2025)

work page arXiv 2025

[45] [45]

In: European Conference on Computer Vision

Zheng, W., Song, R., Guo, X., Zhang, C., Chen, L.: Genad: Generative end-to-end autonomous driving. In: European Conference on Computer Vision. pp. 87–104. Springer (2024) 14 Appendix A Capture Setup and Recording Details All data was collected using the Meta Quest Pro headset (MQPro), a mixed-reality device equipped with integrated eye-tracking cameras, ...

work page 2024

[46] [46]

Initialize the Unity recording application on the Meta Quest Pro headset

work page

[47] [47]

The application creates a new session directory and prepares the sensor logging files

work page

[48] [48]

The participant wears the headset and selects the two waypoints in the recording environment

work page

[49] [49]

The participant triggers the start of the session using the controller A button, which activates thestart_signal and begins both sensor logging and video recording

work page

[50] [50]

During the session, the headset’s wearer navigates naturally through the environment while the system records synchronized RGB video, head pose, and gaze measurements

work page

[51] [51]

When the recording is complete, the participant triggers the controller B button, which sends thestop_signal and terminates both processes

work page

[52] [52]

The recorded data are exported as synchronized sensor logs and video files for subsequent preprocessing. A.4 Data Processing After data collection, the recorded multimodal streams were processed using a custom preprocessing pipeline designed to synchronize and organize the data into a unified dataset format. The preprocessing workflow is illustrated in Fi...

work page

[53] [53]

Identify the environmental context (e.g., crosswalk, sidewalk, intersection)

work page

[54] [54]

Detect nearby dynamic agents such as pedestrians, vehicles, or cyclists

work page

[55] [55]

Analyze traffic signals, obstacles, or navigation constraints

work page

[56] [56]

Incorporate the projected gaze as an indicator of the user’s attention

work page

[57] [57]

Infer the likely short-term motion or navigation intent of the camera wearer. D.2 Annotation Quality Evaluation To verify the reliability of the generated annotations (Figure 12), we evaluated the pipeline using several complementary metrics on the same 100-frame stratified sample. Structural compliance measured whether each annotation follows the predefi...

work page