Recognition: unknown
LAMP: Localization Aware Multi-camera People Tracking in Metric 3D World
Pith reviewed 2026-05-08 16:32 UTC · model grok-4.3
The pith
LAMP tracks 3D human motion from egocentric multi-camera headsets by lifting 2D keypoints to a unified metric world frame and fitting a transformer to the resulting ray cloud.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
LAMP solves the problem with a two-step lift-then-fit process. Detected 2D body keypoints from all cameras across a temporal window are first transformed, using the device's accurate 6 DoF motion and calibration, into a unified 3D ray cloud expressed in the world reference frame. An end-to-end-trained spatio-temporal transformer then fits 3D human motion directly to this ray cloud. The method achieves state-of-the-art results on monocular benchmarks and significantly outperforms baselines in the targeted egocentric multi-camera setting.
What carries the argument
The lift-then-fit process that first projects 2D detections into a metric 3D world-space ray cloud using localization, then applies an end-to-end spatio-temporal transformer.
If this is right
- The framework flexibly incorporates information from multiple temporally asynchronous and partially observing cameras.
- The model can learn a natural human motion prior directly in stable world space.
- The approach remains effective under severe egomotion without requiring static cameras.
- It supplies a simple, modular way to add extra cameras or longer temporal windows.
Where Pith is reading between the lines
- If accurate localization is available, the same lifting step could improve 3D tracking for other moving-camera platforms such as drones or handheld devices.
- Representing input in a metric world frame may simplify the design of future motion priors that generalize across different camera rigs.
- The method could be evaluated for real-time use by pairing it with online SLAM systems that supply the required 6 DoF poses.
Load-bearing premise
The conversion of 2D keypoints into an accurate 3D ray cloud requires precise knowledge of the device's 6 DoF motion and camera calibration.
What would settle it
Running LAMP on the same egocentric sequences but with deliberately noisy or perturbed 6 DoF localization poses and measuring a sharp drop in 3D tracking accuracy.
Figures
read the original abstract
Tracking 3D human motion from egocentric multi-camera headset is challenged by severe egomotion, partial visibility or occlusions and lack of training data. Existing methods designed for monocular video often require static or slowly-moving cameras and cannot efficiently leverage multi-view, calibrated and localized input. This makes them brittle and prone to fail on dynamic egocentric captures. We propose LAMP (Localization Aware Multi-camera People Tracking): a novel, simple framework to solve this via early disentanglement of observer and target motion. LAMP introduces a two-step process. First, we leverage the known device 6 DoF motion and calibration to convert detected 2D body keypoints from all cameras over a temporal window into a unified 3D world reference frame. Second, an end-to-end-trained spatio-temporal transformer fits 3D human motion directly to this 3D ray cloud. This "lift-then-fit" approach allows LAMP to learn and leverage a natural human motion prior in the world-space, as well as providing an elegant framework to flexibly incorporate information from multiple temporally asynchronous, partially observing and moving cameras. LAMP achieves state-of-the-art results on monocular benchmarks, while significantly outperforming baselines for our targeted egocentric setting.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces LAMP, a localization-aware framework for 3D human motion tracking from egocentric multi-camera headsets. It describes a two-step 'lift-then-fit' pipeline: first, detected 2D body keypoints from multiple cameras are back-projected into a unified metric 3D world frame using the device's known 6DoF trajectory and camera calibration; second, a spatio-temporal transformer regresses 3D human motion directly from the resulting ray cloud. The paper claims state-of-the-art results on monocular benchmarks and significant outperformance over baselines in the targeted egocentric setting.
Significance. If the empirical claims are substantiated, the disentanglement of observer and target motion via early 3D lifting could provide a useful prior for handling severe egomotion and partial observations in dynamic multi-view egocentric captures, with potential applications in AR/VR. The approach offers a flexible way to incorporate asynchronous multi-camera data without requiring static cameras.
major comments (2)
- [Abstract] Abstract: The central claim that LAMP 'achieves state-of-the-art results on monocular benchmarks' and 'significantly outperforming baselines for our targeted egocentric setting' is presented without any reference to specific datasets, baselines, metrics, error bars, ablation studies, or quantitative tables, rendering the performance assertions unverifiable and load-bearing for the contribution.
- [Method] Method (lift-then-fit description): The pipeline explicitly requires accurate device 6DoF motion and calibration to convert 2D keypoints into undistorted 3D rays in world coordinates before the transformer stage. No analysis of sensitivity to localization noise, calibration drift, or SLAM errors is provided, despite these being common in real headset captures and directly impacting the separation of observer/target motion and the applicability of the learned world-space prior.
minor comments (1)
- [Abstract] The term '3D ray cloud' is introduced without a precise definition or diagram clarifying whether it consists of infinite rays, finite segments, or back-projected points, which could aid reader understanding of the transformer input.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback. We address the major comments point-by-point below and have revised the manuscript to improve clarity and add supporting analysis where needed.
read point-by-point responses
-
Referee: [Abstract] Abstract: The central claim that LAMP 'achieves state-of-the-art results on monocular benchmarks' and 'significantly outperforming baselines for our targeted egocentric setting' is presented without any reference to specific datasets, baselines, metrics, error bars, ablation studies, or quantitative tables, rendering the performance assertions unverifiable and load-bearing for the contribution.
Authors: We agree that the abstract would be stronger with more specific references to make the claims immediately verifiable. In the revised version, we have updated the abstract to explicitly name the monocular benchmarks (Human3.6M and MPI-INF-3DHP), the primary metrics (MPJPE and PCK@150mm), and note the comparison to baselines such as VideoPose3D and PoseFormer, while directing readers to the quantitative tables and ablations in Section 4. This preserves the abstract's length and focus while addressing the verifiability concern. revision: yes
-
Referee: [Method] Method (lift-then-fit description): The pipeline explicitly requires accurate device 6DoF motion and calibration to convert 2D keypoints into undistorted 3D rays in world coordinates before the transformer stage. No analysis of sensitivity to localization noise, calibration drift, or SLAM errors is provided, despite these being common in real headset captures and directly impacting the separation of observer/target motion and the applicability of the learned world-space prior.
Authors: The referee is correct that no sensitivity analysis was included in the original submission. We have added a new subsection (4.5) in the revised manuscript that quantifies robustness: we inject controlled Gaussian noise into the 6DoF trajectories and camera extrinsics at levels representative of real SLAM drift (up to 5cm/2deg), and report the resulting MPJPE degradation on both monocular and multi-camera egocentric sequences. The results indicate that the spatio-temporal transformer maintains reasonable accuracy for moderate noise thanks to its temporal modeling, and we discuss practical mitigations such as using the headset's uncertainty estimates. This directly addresses the applicability concern for real headset captures. revision: yes
Circularity Check
No circularity in derivation chain
full rationale
The paper's core pipeline lifts 2D keypoints to a metric 3D ray cloud using externally supplied device 6DoF poses and calibration, then trains a spatio-temporal transformer to regress 3D human motion from that cloud. This is an empirical learning procedure whose outputs are not equivalent to its inputs by construction, nor does any step rename a fitted parameter as a prediction, invoke a self-citation uniqueness theorem, or smuggle an ansatz. Performance claims rest on benchmark evaluation rather than tautological reduction, making the derivation self-contained against external data.
Axiom & Free-Parameter Ledger
free parameters (1)
- Spatio-temporal transformer hyperparameters
axioms (1)
- domain assumption Accurate 6 DoF device motion and camera calibration are available at inference time
Reference graph
Works this paper leans on
-
[1]
Apple Vision Pro
Apple Inc. Apple Vision Pro. https://www. apple.com/apple-vision-pro/specs/ . 1, 2, 3
-
[2]
Exploiting temporal context for 3d human pose esti- mation in the wild
Anurag Arnab, Carl Doersch, and Andrew Zisserman. Exploiting temporal context for 3d human pose esti- mation in the wild. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 3395–3404, 2019. 3
2019
-
[3]
Simple online and realtime track- ing
Alex Bewley, Zongyuan Ge, Lionel Ott, Fabio Ramos, and Ben Upcroft. Simple online and realtime track- ing. In2016 IEEE international conference on image processing (ICIP), pages 3464–3468. Ieee, 2016. 2, 4
2016
-
[4]
ZoeDepth: Zero-shot Transfer by Combining Relative and Metric Depth
Shariq Farooq Bhat, Reiner Birkl, Diana Wofk, Peter Wonka, and Matthias M ¨uller. ZoeDepth: Zero-shot transfer by combining relative and metric depth.arXiv preprint arXiv:2302.12288, 2023. 3
work page internal anchor Pith review arXiv 2023
-
[5]
BEDLAM: A synthetic dataset of bodies exhibiting detailed lifelike animated motion
Michael J Black, Priyanka Patel, Joachim Tesch, and Jinlong Yang. BEDLAM: A synthetic dataset of bodies exhibiting detailed lifelike animated motion. InPro- ceedings of the IEEE/CVF Conference on Computer Vi- sion and Pattern Recognition, pages 8726–8737, 2023. 2
2023
-
[6]
Keep it smpl: Automatic estimation of 3d human pose and shape from a single image
Federica Bogo, Angjoo Kanazawa, Christoph Lassner, Peter Gehler, Javier Romero, and Michael J Black. Keep it smpl: Automatic estimation of 3d human pose and shape from a single image. InEuropean conference on computer vision, pages 561–578. Springer, 2016. 2
2016
-
[7]
Keep it SMPL: Automatic estimation of 3D human pose and shape from a single image
Federica Bogo, Angjoo Kanazawa, Christoph Lass- ner, Peter Gehler, Javier Romero, and Michael J Black. Keep it SMPL: Automatic estimation of 3D human pose and shape from a single image. InEuropean Con- ference on Computer Vision, pages 561–578. Springer,
-
[8]
End-to-end object detection with trans- formers
Nicolas Carion, Francisco Massa, Gabriel Synnaeve, Nicolas Usunier, Alexander Kirillov, and Sergey Zagoruyko. End-to-end object detection with trans- formers. InEuropean conference on computer vision, pages 213–229, 2020. 2
2020
-
[9]
Multi-person 3d pose estimation in crowded scenes based on multi-view geometry
He Chen, Pengfei Guo, Pengfei Li, Gim Hee Lee, and Gregory Chirikjian. Multi-person 3d pose estimation in crowded scenes based on multi-view geometry. In European Conference on Computer Vision, pages 541–
-
[10]
Monoslam: Real-time single camera slam.IEEE TPAMI, 2007
Andrew J Davison, Ian D Reid, Nicholas D Molton, and Olivier Stasse. Monoslam: Real-time single camera slam.IEEE TPAMI, 2007. 4
2007
-
[11]
Fast and robust multi-person 3D pose estimation from multiple views
Junting Dong, Wen Jiang, Qixing Huang, Hujun Bao, and Xiaowei Zhou. Fast and robust multi-person 3D pose estimation from multiple views. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 7792–7801, 2019. 3
2019
-
[12]
LSD-SLAM: Large-Scale Direct Monocular SLAM
Jakob Engel, Thomas Sch ¨ops, and Daniel Cremers. LSD-SLAM: Large-Scale Direct Monocular SLAM. InECCV, 2014. 4
2014
-
[13]
Di- rect sparse odometry.IEEE Transactions on Pattern Analysis and Machine Intelligence, 40(3):611–625,
Jakob Engel, Vladlen Koltun, and Daniel Cremers. Di- rect sparse odometry.IEEE Transactions on Pattern Analysis and Machine Intelligence, 40(3):611–625,
-
[14]
Project Aria: A New Tool for Egocentric Multi-Modal AI Research, 2023
Jakob Engel, Kiran Somasundaram, Michael Goe- sele, Albert Sun, Alexander Gamino, Andrew Turner, Arjang Talattof, Arnie Yuan, Bilal Souti, Brighid Meredith, Cheng Peng, Chris Sweeney, Cole Wil- son, Dan Barnes, Daniel DeTone, David Caruso, Derek Valleroy, Dinesh Ginjupalli, Duncan Frost, Ed- ward Miller, Elias Mueggler, Evgeniy Oleinik, Fan Zhang, Gurupra...
2023
-
[15]
Project Aria: A New Tool for Egocentric Multi-Modal AI Research
Jakob Engel, Kiran Somasundaram, Michael Goesele, Albert Sun, Alexander Gamino, Andrew Turner, Ar- jang Talattof, Arnie Yuan, Bilal Souti, Brighid Mered- ith, et al. Project aria: A new tool for egocentric multi- modal ai research.arXiv preprint arXiv:2308.13561,
work page internal anchor Pith review arXiv
-
[16]
Aaron Ferguson, Ahmed A. A. Osman, Berta Bescos, Carsten Stoll, Chris Twigg, Christoph Lassner, David Otte, Eric Vignola, Fabian Prada, Federica Bogo, Igor Santesteban, Javier Romero, Jenna Zarate, Jeongseok Lee, Jinhyung Park, Jinlong Yang, John Doublestein, Kishore Venkateshan, Kris Kitani, Ladislav Kavan, Marco Dal Farra, Matthew Hu, Matthew Cioffi, Mi...
2025
-
[17]
SVO: Fast Semi-Direct Monocular Visual Odometry
Christian Forster, Matia Pizzoli, and Davide Scara- muzza. SVO: Fast Semi-Direct Monocular Visual Odometry. 2014. 4
2014
-
[18]
YOLOX: Exceeding YOLO Series in 2021
Zheng Ge, Songtao Liu, Feng Wang, Zeming Li, and Jian Sun. Yolox: Exceeding yolo series in 2021.arXiv preprint arXiv:2107.08430, 2021. 4
work page internal anchor Pith review arXiv 2021
-
[19]
Humans in 4D: Reconstructing and tracking humans with transformers
Shubham Goel, Georgios Pavlakos, Jathushan Ra- jasegaran, Angjoo Kanazawa*, and Jitendra Malik*. Humans in 4D: Reconstructing and tracking humans with transformers. InInternational Conference on Com- puter Vision (ICCV), 2023. 5
2023
-
[20]
Re- constructing and tracking humans with transformers
Shubham Goel, Georgios Pavlakos, Jathushan Ra- jasegaran, Angjoo Kanazawa, and Jitendra Malik. Re- constructing and tracking humans with transformers. Proceedings of the IEEE/CVF International Confer- ence on Computer Vision, 2023. 1, 2, 5
2023
-
[21]
Catalin Ionescu, Dragos Papava, Vlad Olaru, and Cris- tian Sminchisescu. Human3.6M: Large scale datasets and predictive methods for 3d human sensing in natural environments.IEEE Transactions on Pattern Analysis and Machine Intelligence, 36(7):1325–1339, 2013. 2
2013
-
[22]
End-to-end recovery of human shape and pose
Angjoo Kanazawa, Michael J Black, David W Jacobs, and Jitendra Malik. End-to-end recovery of human shape and pose. InProceedings of the IEEE/CVF Con- ference on Computer Vision and Pattern Recognition, pages 7122–7131, 2018. 2
2018
-
[23]
Learning 3D human dynamics from video
Angjoo Kanazawa, Jason Y Zhang, Panna Felsen, and Jitendra Malik. Learning 3D human dynamics from video. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5614–5623, 2019. 1, 3
2019
-
[24]
EMDB: The electromagnetic database of global 3d human pose and shape in the wild
Manuel Kaufmann, Jie Song, Chen Guo, Kaiyue Shen, Tianjian Jiang, Chengcheng Tang, Juan Jos ´e Z ´arate, and Otmar Hilliges. EMDB: The electromagnetic database of global 3d human pose and shape in the wild. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 14632–14643,
-
[25]
Egohumans: An egocentric 3d multi-human benchmark.arXiv preprint arXiv:2305.16487, 2023
Rawal Khirodkar, Aayush Bansal, Lingni Ma, Richard Newcombe, Minh V o, and Kris Kitani. Egohumans: An egocentric 3d multi-human benchmark.arXiv preprint arXiv:2305.16487, 2023. 2
-
[26]
Harmony4d: A video dataset for in-the-wild close human interactions.Ad- vances in Neural Information Processing Systems, 37: 107270–107285, 2024
Rawal Khirodkar, Jyun-Ting Song, Jinkun Cao, Zhengyi Luo, and Kris Kitani. Harmony4d: A video dataset for in-the-wild close human interactions.Ad- vances in Neural Information Processing Systems, 37: 107270–107285, 2024. 2
2024
-
[27]
Self-supervised learning of 3d human pose using multi-view geometry
Muhammed Kocabas, Salih Karagoz, and Emre Ak- bas. Self-supervised learning of 3d human pose using multi-view geometry. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 1077–1086, 2019. 3
2019
-
[28]
VIBE: Video inference for human body pose and shape estimation
Muhammed Kocabas, Nikos Athanasiou, and Michael J Black. VIBE: Video inference for human body pose and shape estimation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recogni- tion, pages 5253–5263, 2020. 3
2020
-
[29]
PACE: Human and camera motion estimation from in-the-wild videos
Muhammed Kocabas, Ye Yuan, Pavlo Molchanov, Yun- rong Guo, Michael J Black, Otmar Hilliges, Jan Kautz, and Umar Iqbal. PACE: Human and camera motion estimation from in-the-wild videos. InInternational Conference on 3D Vision, pages 397–408, 2024. 3, 7
2024
-
[30]
Learning to reconstruct 3d hu- man pose and shape via model-fitting in the loop
Nikos Kolotouros, Georgios Pavlakos, Michael J Black, and Kostas Daniilidis. Learning to reconstruct 3d hu- man pose and shape via model-fitting in the loop. In Proceedings of the IEEE/CVF international conference on computer vision, pages 2252–2261, 2019. 2
2019
-
[31]
Aria gen 2 pilot dataset.arXiv preprint arXiv:2510.16134, 2025
Chen Kong, James Fort, Aria Kang, Jonathan Wittmer, Simon Green, Tianwei Shen, Yipu Zhao, Cheng Peng, Gustavo Solaira, Andrew Berkovich, et al. Aria gen 2 pilot dataset.arXiv preprint arXiv:2510.16134, 2025. 8
-
[32]
Benchmarking Egocentric Visual-Inertial SLAM at City Scale
Anusha Krishnan, Shaohui Liu, Paul-Edouard Sarlin, Oscar Gentilhomme, David Caruso, Maurizio Monge, Richard Newcombe, Jakob Engel, and Marc Polle- feys. Benchmarking Egocentric Visual-Inertial SLAM at City Scale. InProceedings of the IEEE/CVF Interna- tional Conference on Computer Vision (ICCV), 2025. 2, 4
2025
-
[33]
Benchmarking egocentric visual-inertial slam at city scale
Anusha Krishnan, Shaohui Liu, Paul-Edouard Sarlin, Oscar Gentilhomme, David Caruso, Maurizio Monge, Richard Newcombe, Jakob Engel, and Marc Pollefeys. Benchmarking egocentric visual-inertial slam at city scale. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2025. 14
2025
-
[34]
Harold W. Kuhn. The hungarian method for the assign- ment problem.Naval Research Logistics Quarterly, 2: 83–97, 1955. 4
1955
-
[35]
OKVIS2: Realtime scalable visual-inertial SLAM with loop closure,
Stefan Leutenegger. OKVIS2: Realtime Scal- able Visual-Inertial SLAM with Loop Closure. arXiv:2202.09199, 2022. 4, 14
-
[36]
Genmo: A generalist model for human motion.arXiv preprint arXiv: 2505.01425,
Jiefeng Li, Jinkun Cao, Haotian Zhang, Davis Rempe, Jan Kautz, Umar Iqbal, and Ye Yuan. Genmo: A generalist model for human motion.arXiv preprint arXiv:2505.01425, 2025. 7
-
[37]
Lifting motion to the 3d world via 2d diffusion
Jiaman Li, C Karen Liu, and Jiajun Wu. Lifting motion to the 3d world via 2d diffusion. InProceedings of the 10 Computer Vision and Pattern Recognition Conference, pages 17518–17528, 2025. 3
2025
-
[38]
Mhformer: Multi-hypothesis trans- former for 3d human pose estimation
Wenhao Li, Hong Liu, Hao Tang, Pichao Wang, and Luc Van Gool. Mhformer: Multi-hypothesis trans- former for 3d human pose estimation. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 13147–13156, 2022. 3
2022
-
[39]
CLIFF: Carrying location in- formation in full frames into human pose and shape estimation
Zhihao Li, Jianzhuang Liu, Zhensong Zhang, Songcen Xu, and Youliang Yan. CLIFF: Carrying location in- formation in full frames into human pose and shape estimation. InEuropean Conference on Computer Vi- sion, pages 590–606. Springer, 2022. 2
2022
-
[40]
Multi-view multi-person 3d pose estimation with plane sweep stereo
Jiahao Lin and Gim Hee Lee. Multi-view multi-person 3d pose estimation with plane sweep stereo. InPro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 11886–11895,
-
[41]
Microsoft COCO: Common ob- jects in context
Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Doll´ar, and C Lawrence Zitnick. Microsoft COCO: Common ob- jects in context. InEuropean Conference on Computer Vision, pages 740–755. Springer, 2014. 2
2014
-
[42]
Microsoft coco: Common objects in context
Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Doll´ar, and C Lawrence Zitnick. Microsoft coco: Common objects in context. InEuropean conference on computer vision, pages 740–755. Springer, 2014. 2, 4
2014
-
[43]
SMPL: A skinned multi-person linear model.ACM TOG, 34(6): 1–16, 2015
Matthew Loper, Naureen Mahmood, Javier Romero, Gerard Pons-Moll, and Michael J Black. SMPL: A skinned multi-person linear model.ACM TOG, 34(6): 1–16, 2015. 2, 3, 5, 6, 14
2015
-
[44]
3D human motion estimation via motion compression and refinement
Zhengyi Luo, S Alireza Golestaneh, and Kris M Kitani. 3D human motion estimation via motion compression and refinement. InProceedings of the Asian Confer- ence on Computer Vision, 2020. 3
2020
-
[45]
Nymeria: A massive collection of multimodal egocen- tric daily motion in the wild
Lingni Ma, Yuting Ye, Fangzhou Hong, Vladimir Gu- zov, Yifeng Jiang, Rowan Postyeni, Luis Pesqueira, Alexander Gamino, Vijay Baiyya, Hyo Jin Kim, et al. Nymeria: A massive collection of multimodal egocen- tric daily motion in the wild. InEuropean Conference on Computer Vision, pages 445–465. Springer, 2024. 2, 6, 7, 8, 14
2024
-
[46]
AMASS: Archive of motion capture as surface shapes
Naureen Mahmood, Nima Ghorbani, Nikolaus F Troje, Gerard Pons-Moll, and Michael J Black. AMASS: Archive of motion capture as surface shapes. InPro- ceedings of the IEEE/CVF International Conference on Computer Vision, pages 5442–5451, 2019. 2, 6
2019
-
[47]
A simple yet effective baseline for 3d human pose estimation
Julieta Martinez, Rayat Hossain, Javier Romero, and James J Little. A simple yet effective baseline for 3d human pose estimation. InProceedings of the IEEE international conference on computer vision, pages 2640–2649, 2017. 2
2017
-
[48]
Vnect: Real-time 3d human pose estimation with a single rgb camera.Acm transactions on graphics (tog), 36(4):1–14, 2017
Dushyant Mehta, Srinath Sridhar, Oleksandr Sotny- chenko, Helge Rhodin, Mohammad Shafiei, Hans- Peter Seidel, Weipeng Xu, Dan Casas, and Christian Theobalt. Vnect: Real-time 3d human pose estimation with a single rgb camera.Acm transactions on graphics (tog), 36(4):1–14, 2017. 3
2017
-
[49]
Aria Gen 2: An Advanced ResearchDevice for Egocentric AI Research
Meta. Aria Gen 2: An Advanced ResearchDevice for Egocentric AI Research. https://https://www. projectaria.com/glasses/. 1, 2, 3, 6, 8, 14
-
[50]
Meta Quest 3: Next-Gen Mixed Reality Headset
Meta Platforms, Inc. Meta Quest 3: Next-Gen Mixed Reality Headset. https : / / www . meta . com / quest/quest-3/. 2
-
[51]
HoloLens 2
Microsoft Corporation. HoloLens 2. https:// learn.microsoft.com/en-us/hololens/ . 2
-
[52]
3d human pose estimation from a single image via distance matrix regression
Francesc Moreno-Noguer. 3d human pose estimation from a single image via distance matrix regression. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 2823–2832, 2017. 2
2017
-
[53]
Mourikis and Stergios I
Anastasios I. Mourikis and Stergios I. Roumeliotis. A multi-state constraint Kalman filter for vision-aided inertial navigation. InProceedings of the IEEE In- ternational Conference on Robotics and Automation (ICRA), pages 3565–3572, Rome, Italy, 2007. 2, 4
2007
-
[54]
ORB-SLAM: An open-source slam system for monocular, stereo, and RGB-D cameras.IEEE Transactions on Robotics, 33 (5):1255–1262, 2017
Raul Mur-Artal and Juan D Tard´os. ORB-SLAM: An open-source slam system for monocular, stereo, and RGB-D cameras.IEEE Transactions on Robotics, 33 (5):1255–1262, 2017. 2
2017
-
[55]
ORB-SLAM: A versatile and accu- rate monocular SLAM system.IEEE Transactions on Robotics, 31(5):1147–1163, 2015
Raul Mur-Artal, Jose Maria Martinez Montiel, and Juan D Tardos. ORB-SLAM: A versatile and accu- rate monocular SLAM system.IEEE Transactions on Robotics, 31(5):1147–1163, 2015. 4
2015
-
[56]
Richter, and Vladlen Koltun
Alejandro Newell, Peiyun Hu, Lahav Lipson, Stephan R. Richter, and Vladlen Koltun. Comotion: Concurrent multi-person 3d motion. InInternational Conference on Learning Representations, 2025. 2
2025
-
[57]
Camerahmr: Aligning people with perspective
Priyanka Patel and Michael J Black. Camerahmr: Aligning people with perspective. In2025 Interna- tional Conference on 3D Vision (3DV), pages 1562–
-
[58]
Coarse-to-fine volu- metric prediction for single-image 3d human pose
Georgios Pavlakos, Xiaowei Zhou, Konstantinos G Derpanis, and Kostas Daniilidis. Coarse-to-fine volu- metric prediction for single-image 3d human pose. In Proceedings of the IEEE conference on computer vi- sion and pattern recognition, pages 7025–7034, 2017. 2
2017
-
[59]
Learning to estimate 3d human pose and shape from a single color image
Georgios Pavlakos, Luyang Zhu, Xiaowei Zhou, and Kostas Daniilidis. Learning to estimate 3d human pose and shape from a single color image. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 459–468, 2018. 2 11
2018
-
[60]
Expressive body capture: 3D hands, face, and body from a single image
Georgios Pavlakos, Vasileios Choutas, Nima Ghorbani, Timo Bolkart, Ahmed AA Osman, Dimitrios Tzionas, and Michael J Black. Expressive body capture: 3D hands, face, and body from a single image. InProceed- ings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10975–10985, 2019. 2
2019
-
[61]
3d human pose estimation in video with temporal convolutions and semi-supervised train- ing
Dario Pavllo, Christoph Feichtenhofer, David Grangier, and Michael Auli. 3d human pose estimation in video with temporal convolutions and semi-supervised train- ing. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 7753– 7762, 2019. 3
2019
-
[62]
Bevtrack: Multi-view multi-human registration and tracking in the bird’s eye view.IEEE Transactions on Pattern Analysis and Machine Intelligence, 2025
Zekun Qian, Wei Feng, Feifan Wang, and Ruize Han. Bevtrack: Multi-view multi-human registration and tracking in the bird’s eye view.IEEE Transactions on Pattern Analysis and Machine Intelligence, 2025. 3
2025
-
[63]
You Only Look Once: Unified, Real-Time Object Detection
Joseph Redmon, Santosh Divvala, Ross Girshick, and Ali Farhadi. You Only Look Once: Unified, Real-Time Object Detection. InCVPR, 2016. 4
2016
-
[64]
Lightweight multi-view 3d pose estimation through camera-disentangled represen- tation
Edoardo Remelli, Shangchen Han, Sina Honari, Pascal Fua, and Robert Wang. Lightweight multi-view 3d pose estimation through camera-disentangled represen- tation. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 6040– 6049, 2020. 3
2020
-
[65]
Glopro: Globally-consistent uncertainty-aware 3d human pose estimation & tracking in the wild.IROS,
Simon Schaefer, Dorian Henning, and Stefan Leuteneg- ger. Glopro: Globally-consistent uncertainty-aware 3d human pose estimation & tracking in the wild.IROS,
-
[66]
Global-to-local mod- eling for video-based 3d human pose and shape esti- mation
Xiaolong Shen, Zongxin Yang, Xiaohan Wang, Jianxin Ma, Chang Zhou, and Yi Yang. Global-to-local mod- eling for video-based 3d human pose and shape esti- mation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8887–8896, 2023. 3
2023
-
[67]
World-grounded human motion recovery via gravity-view coordinates
Zehong Shen, Huaijin Pi, Yan Xia, Zhi Cen, Sida Peng, Zechen Hu, Hujun Bao, Ruizhen Hu, and Xiaowei Zhou. World-grounded human motion recovery via gravity-view coordinates. InSIGGRAPH Asia, 2024. 1, 3, 7
2024
-
[68]
Soyong Shin, Juyong Kim, Eni Halilaj, and Michael J Black. WHAM: Reconstructing world-grounded hu- mans with accurate 3D motion.arXiv preprint arXiv:2312.07531, 2023. 1, 3, 6, 7
-
[69]
Human pose estimation from silhouettes
Cristian Sminchisescu and Alexandru Telea. Human pose estimation from silhouettes. a consistent approach using distance level sets. InEPRINTS-BOOK-TITLE. University of Groningen, Johann Bernoulli Institute for Mathematics and . . . , 2002. 2
2002
-
[70]
Self- pose3d: self-supervised multi-person multi-view 3d pose estimation
Vinkle Srivastav, Keqi Chen, and Nicolas Padoy. Self- pose3d: self-supervised multi-person multi-view 3d pose estimation. InProceedings of the IEEE/CVF Con- ference on Computer Vision and Pattern Recognition, pages 2502–2512, 2024. 3
2024
-
[71]
DROID-SLAM: Deep visual slam for monocular, stereo, and RGB-D cameras
Zachary Teed and Jia Deng. DROID-SLAM: Deep visual slam for monocular, stereo, and RGB-D cameras. Advances in Neural Information Processing Systems, 34:16558–16569, 2021. 3, 14
2021
-
[72]
Deep patch visual odometry.Advances in Neural Information Processing Systems, 36, 2024
Zachary Teed, Lahav Lipson, and Jia Deng. Deep patch visual odometry.Advances in Neural Information Processing Systems, 36, 2024. 3
2024
-
[73]
Attention is all you need
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. Advances in Neural Information Processing Systems, 30, 2017. 3
2017
-
[74]
Attention is all you need
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. InNeurIPS, 2017. 5
2017
-
[75]
Self-supervised multi-view person association and its applications.IEEE transactions on pattern analysis and machine intelligence, 43(8):2794–2808, 2020
Minh V o, Ersin Yumer, Kalyan Sunkavalli, Sunil Hadap, Yaser Sheikh, and Srinivasa G Narasimhan. Self-supervised multi-view person association and its applications.IEEE transactions on pattern analysis and machine intelligence, 43(8):2794–2808, 2020. 3
2020
-
[76]
Encoder-decoder with multi-level attention for 3d human shape and pose esti- mation
Ziniu Wan, Zhengjia Li, Maoqing Tian, Jianbo Liu, Shuai Yi, and Hongsheng Li. Encoder-decoder with multi-level attention for 3d human shape and pose esti- mation. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 13033–13042,
-
[77]
Robust estimation of 3d human poses from a single image
Chunyu Wang, Yizhou Wang, Zhouchen Lin, Alan L Yuille, and Wen Gao. Robust estimation of 3d human poses from a single image. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 2361–2368, 2014. 2
2014
-
[78]
TRAM: Global trajectory and motion of 3d hu- mans from in-the-wild videos
Yufu Wang, Ziyun Wang, Lingjie Liu, and Kostas Dani- ilidis. TRAM: Global trajectory and motion of 3d hu- mans from in-the-wild videos. InEuropean Conference on Computer Vision, 2024. 1, 3, 4, 5
2024
-
[79]
Prompthmr: Promptable human mesh recovery
Yufu Wang, Yu Sun, Priyanka Patel, Kostas Dani- ilidis, Michael J Black, and Muhammed Kocabas. Prompthmr: Promptable human mesh recovery. In Proceedings of the Computer Vision and Pattern Recog- nition Conference, pages 1148–1159, 2025. 1, 3, 4, 6, 7, 14, 15
2025
-
[80]
Graph-based 3d multi- person pose estimation using multi-view images
Size Wu, Sheng Jin, Wentao Liu, Lei Bai, Chen Qian, Dong Liu, and Wanli Ouyang. Graph-based 3d multi- person pose estimation using multi-view images. In Proceedings of the IEEE/CVF international conference on computer vision, pages 11148–11157, 2021. 3
2021
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.