LiCamPose: Combining Multi-View LiDAR and RGB Cameras for Robust Single-timestamp 3D Human Pose Estimation
Pith reviewed 2026-05-24 04:36 UTC · model grok-4.3
The pith
LiCamPose fuses multi-view LiDAR point clouds with RGB images in a volumetric network to estimate 3D human poses from a single timestamp.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The paper claims that a volumetric network can integrate multi-view RGB features with sparse LiDAR point clouds to produce accurate single-timestamp 3D human poses, and that pretraining on synthetically generated data plus unsupervised domain adaptation suffices to transfer the model to real scenes without requiring any manual 3D annotations.
What carries the argument
Volumetric architecture that fuses multi-view RGB and sparse point cloud inputs, paired with a synthetic dataset generator and an unsupervised domain adaptation strategy.
If this is right
- Single-frame multimodal estimation becomes feasible in scenarios where image-only methods lose accuracy due to motion blur or occlusion.
- Training costs drop because no manual 3D pose labels are required on target real-world data.
- The same pipeline can be applied across public datasets, synthetic data, and newly collected challenging recordings while maintaining reported generalization.
- Pose output is available at each timestamp independently, supporting applications that need immediate rather than smoothed multi-frame results.
Where Pith is reading between the lines
- The same fusion and adaptation steps could extend to other body-tracking tasks such as hand or animal pose if the volumetric backbone accepts the corresponding input densities.
- Performance on the BasketBall dataset suggests the method may tolerate fast articulated motion better than purely image-based estimators, which could be verified by measuring error versus speed.
- If the domain gap between synthetic and real point clouds proves larger than expected, adding a small number of real LiDAR-only frames for adaptation might still avoid full 3D annotation.
Load-bearing premise
The volumetric fusion step successfully merges the sparse point clouds with RGB features and the synthetic pretraining transfers to real data without any manual 3D labels.
What would settle it
Demonstrating that LiCamPose produces lower accuracy than single-modality baselines on the self-collected BasketBall dataset when run without any 3D annotations would falsify the transfer claim.
Figures
read the original abstract
Several methods have been proposed to estimate 3D human pose from multi-view images, achieving satisfactory performance on public datasets collected under relatively simple conditions. However, there are limited approaches studying extracting 3D human skeletons from multimodal inputs, such as RGB and point cloud data. To address this gap, we introduce LiCamPose, a pipeline that integrates multi-view RGB and sparse point cloud information to estimate robust 3D human poses via single frame. We demonstrate the effectiveness of the volumetric architecture in combining these modalities. Furthermore, to circumvent the need for manually labeled 3D human pose annotations, we develop a synthetic dataset generator for pretraining and design an unsupervised domain adaptation strategy to train a 3D human pose estimator without manual annotations. To validate the generalization capability of our method, LiCamPose is evaluated on four datasets, including two public datasets, one synthetic dataset, and one challenging self-collected dataset named BasketBall, covering diverse scenarios. The results demonstrate that LiCamPose exhibits great generalization performance and significant application potential. The code, generator, and datasets will be made available upon acceptance of this paper.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces LiCamPose, a single-timestamp 3D human pose estimation pipeline that fuses multi-view RGB images with sparse LiDAR point clouds via a volumetric architecture. It pretrains on synthetically generated data and applies an unsupervised domain adaptation strategy to train without manual 3D annotations, then evaluates generalization on two public datasets, one synthetic dataset, and a self-collected BasketBall dataset, claiming strong performance and application potential.
Significance. If the volumetric fusion and unsupervised DA transfer succeed on real multimodal data, the approach could reduce reliance on expensive 3D annotations for pose estimation in challenging outdoor or sports scenarios. The release of code, generator, and datasets would further support reproducibility, but the current presentation provides insufficient quantitative grounding to evaluate these contributions.
major comments (2)
- [Abstract] Abstract: the central claims of 'great generalization performance' and 'significant application potential' on four datasets (including the challenging real BasketBall set) rest on unverified transfer from synthetic pretraining via unsupervised DA, yet the abstract supplies no MPJPE values, error bars, ablation results, or comparison to baselines on real vs. synthetic splits.
- [Method] Method description (paragraph on unsupervised domain adaptation): the strategy for aligning synthetic and real distributions without 3D annotations is stated at a high level only, with no specification of the adaptation losses, feature alignment procedure, or volumetric fusion mechanism; this is load-bearing for the no-annotation claim and the reported results on real data.
minor comments (1)
- [Abstract] The abstract states that 'the code, generator, and datasets will be made available upon acceptance' but does not indicate whether evaluation code or the exact synthetic generator parameters will be released, which would aid verification of the DA pipeline.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback. We address the two major comments point by point below and will revise the manuscript to strengthen the presentation of results and method details.
read point-by-point responses
-
Referee: [Abstract] Abstract: the central claims of 'great generalization performance' and 'significant application potential' on four datasets (including the challenging real BasketBall set) rest on unverified transfer from synthetic pretraining via unsupervised DA, yet the abstract supplies no MPJPE values, error bars, ablation results, or comparison to baselines on real vs. synthetic splits.
Authors: We agree that the abstract is concise and would benefit from quantitative support for the claims. In the revised manuscript we will add key MPJPE numbers, reference the baseline comparisons, and note the real-versus-synthetic splits to better ground the generalization statements. revision: yes
-
Referee: [Method] Method description (paragraph on unsupervised domain adaptation): the strategy for aligning synthetic and real distributions without 3D annotations is stated at a high level only, with no specification of the adaptation losses, feature alignment procedure, or volumetric fusion mechanism; this is load-bearing for the no-annotation claim and the reported results on real data.
Authors: The method section describes the volumetric fusion architecture and the overall unsupervised adaptation pipeline. We acknowledge, however, that explicit specification of the adaptation losses, feature alignment steps, and fusion details would improve clarity and reproducibility. We will expand the relevant paragraph with these specifications in the revision. revision: yes
Circularity Check
No circularity: empirical pipeline with no derivation chain or self-referential reductions
full rationale
The paper describes an engineering pipeline (volumetric fusion of LiDAR point clouds with RGB features, synthetic data generator for pretraining, and unsupervised domain adaptation) evaluated empirically on four datasets. No equations, fitted parameters, predictions, or uniqueness theorems are presented that reduce by construction to inputs or prior self-citations. The central claims rest on reported performance numbers rather than any self-definitional or load-bearing self-citation structure, making the work self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
2D human pose estimation: New benchmark and state of the art analysis
Mykhaylo Andriluka, Leonid Pishchulin, Peter Gehler, and Bernt Schiele. 2D human pose estimation: New benchmark and state of the art analysis. InProceedings of the IEEE Con- ference on Computer Vision and Pattern Recognition, pages 3686–3693, 2014. 1
work page 2014
-
[2]
Real-time rgbd-based extended body pose estimation
Renat Bashirov, Anastasia Ianina, Karim Iskakov, Yevgeniy Kononenko, Valeriya Strizhkova, Victor Lempitsky, and Alexander Vakhitov. Real-time rgbd-based extended body pose estimation. In Proceedings of the IEEE/CVF Win- ter Conference on Applications of Computer Vision , pages 2807–2816, 2021. 3
work page 2021
-
[3]
View invariant human body detection and pose estimation from multiple depth sensors
Walid Bekhtaoui, Ruhan Sa, Brian Teixeira, Vivek Singh, Klaus Kirchberg, Yao-jen Chang, and Ankur Kapoor. View invariant human body detection and pose estimation from multiple depth sensors. arXiv preprint arXiv:2005.04258 ,
-
[4]
3D pictorial structures for multiple human pose estimation
Vasileios Belagiannis, Sikandar Amin, Mykhaylo Andriluka, Bernt Schiele, Nassir Navab, and Slobodan Ilic. 3D pictorial structures for multiple human pose estimation. In Proceed- ings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 1669–1676, 2014. 1, 3
work page 2014
-
[5]
Domain adaptation through anatomical con- straints for 3D human pose estimation under the cover
Alexander Bigalke, Lasse Hansen, Jasper Diesel, and Mat- tias P Heinrich. Domain adaptation through anatomical con- straints for 3D human pose estimation under the cover. In International Conference on Medical Imaging with Deep Learning, pages 173–187, 2022. 3
work page 2022
-
[6]
Re- altime multi-person 2D pose estimation using part affinity fields
Zhe Cao, Tomas Simon, Shih-En Wei, and Yaser Sheikh. Re- altime multi-person 2D pose estimation using part affinity fields. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 7291–7299, 2017. 2
work page 2017
-
[7]
Sim2real transfer learning for 3D human pose estimation: motion to the res- cue
Carl Doersch and Andrew Zisserman. Sim2real transfer learning for 3D human pose estimation: motion to the res- cue. Advances in Neural Information Processing Systems , 32, 2019. 3
work page 2019
-
[8]
Fast and robust multi-person 3D pose estima- tion from multiple views
Junting Dong, Wen Jiang, Qixing Huang, Hujun Bao, and Xiaowei Zhou. Fast and robust multi-person 3D pose estima- tion from multiple views. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages 7792–7801, 2019. 2, 3
work page 2019
-
[9]
CARLA: An open urban driving simulator
Alexey Dosovitskiy, German Ros, Felipe Codevilla, Antonio Lopez, and Vladlen Koltun. CARLA: An open urban driving simulator. In Conference on Robot Learning , pages 1–16,
-
[10]
PeopleSansPeople: a synthetic data generator for human-centric computer vision
Salehe Erfanian Ebadi, You-Cyuan Jhang, Alex Zook, Saurav Dhakad, Adam Crespi, Pete Parisi, Steven Borkman, Jonathan Hogins, and Sujoy Ganguly. PeopleSansPeople: a synthetic data generator for human-centric computer vision. arXiv preprint arXiv:2112.09290, 2021. 3
-
[11]
Alpha- pose: Whole-body regional multi-person pose estimation and tracking in real-time
Hao-Shu Fang, Jiefeng Li, Hongyang Tang, Chao Xu, Haoyi Zhu, Yuliang Xiu, Yong-Lu Li, and Cewu Lu. Alpha- pose: Whole-body regional multi-person pose estimation and tracking in real-time. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2022. 2, 3, 5
work page 2022
-
[12]
DECA: Deep viewpoint-equivariant human pose es- timation using capsule autoencoders
Nicola Garau, Niccolo Bisagno, Piotr Br ´odka, and Nicola Conci. DECA: Deep viewpoint-equivariant human pose es- timation using capsule autoencoders. In Proceedings of the IEEE/CVF International Conference on Computer Vision , pages 11677–11686, 2021. 2
work page 2021
-
[13]
Are we ready for autonomous driving? the KITTI vision benchmark suite
Andreas Geiger, Philip Lenz, and Raquel Urtasun. Are we ready for autonomous driving? the KITTI vision benchmark suite. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 3354–3361, 2012. 2
work page 2012
-
[14]
Beerend GA Gerats, Jelmer M Wolterink, and Ivo AMJ Broeders. 3D human pose estimation in multi-view oper- ating room videos using differentiable camera projections. Computer Methods in Biomechanics and Biomedical Engi- neering: Imaging & Visualization, pages 1–9, 2022. 3
work page 2022
-
[15]
Self-supervised 3D human pose estimation from video
Mohsen Gholami, Ahmad Rezaei, Helge Rhodin, Rabab Ward, and Z Jane Wang. Self-supervised 3D human pose estimation from video. Neurocomputing, 488:97–106, 2022. 5
work page 2022
-
[16]
Generalized procrustes analysis
John C Gower. Generalized procrustes analysis. Psychome- trika, 40:33–51, 1975. 6
work page 1975
-
[17]
Towards Good Practices for Deep 3D Hand Pose Estimation
Hengkai Guo, Guijin Wang, Xinghao Chen, and Cairong Zhang. Towards good practices for deep 3D hand pose esti- mation. arXiv preprint arXiv:1707.07248, 2017. 2
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[18]
Lasse Hansen, Marlin Siebert, Jasper Diesel, and Mattias P Heinrich. Fusing information from multiple 2D depth cam- eras for 3D human pose estimation in the operating room. International Journal of Computer Assisted Radiology and Surgery, 14:1871–1879, 2019. 3
work page 2019
-
[19]
Towards viewpoint invariant 3D human pose estimation
Albert Haque, Boya Peng, Zelun Luo, Alexandre Alahi, Ser- ena Yeung, and Li Fei-Fei. Towards viewpoint invariant 3D human pose estimation. In Proceedings of the European Conference on Computer Vision, pages 160–177, 2016. 2
work page 2016
-
[20]
Multi- view detection with feature perspective transformation
Yunzhong Hou, Liang Zheng, and Stephen Gould. Multi- view detection with feature perspective transformation. In Proceedings of the European Conference on Computer Vi- sion, pages 1–18, 2020. 3
work page 2020
-
[21]
Catalin Ionescu, Dragos Papava, Vlad Olaru, and Cristian Sminchisescu. Human3.6M: Large scale datasets and pre- dictive methods for 3D human sensing in natural environ- ments. IEEE Transactions on Pattern Analysis and Machine Intelligence, 36(7):1325–1339, 2013. 1
work page 2013
-
[22]
Learnable triangulation of human pose
Karim Iskakov, Egor Burkov, Victor Lempitsky, and Yury Malkov. Learnable triangulation of human pose. InProceed- ings of the IEEE/CVF International Conference on Com- puter Vision, pages 7718–7727, 2019. 1, 2
work page 2019
-
[23]
Whole-body human pose estimation in the wild
Sheng Jin, Lumin Xu, Jin Xu, Can Wang, Wentao Liu, Chen Qian, Wanli Ouyang, and Ping Luo. Whole-body human pose estimation in the wild. In Proceedings of the European Conference on Computer Vision, pages 196–214, 2020. 1
work page 2020
-
[24]
Panoptic studio: A massively multiview system for social motion capture
Hanbyul Joo, Hao Liu, Lei Tan, Lin Gui, Bart Nabbe, Iain Matthews, Takeo Kanade, Shohei Nobuhara, and Yaser Sheikh. Panoptic studio: A massively multiview system for social motion capture. In Proceedings of the IEEE Inter- national Conference on Computer Vision, pages 3334–3342,
-
[25]
Adam: A Method for Stochastic Optimization
Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 ,
work page internal anchor Pith review Pith/arXiv arXiv
-
[26]
Self- supervised learning of 3D human pose using multi-view geometry
Muhammed Kocabas, Salih Karagoz, and Emre Akbas. Self- supervised learning of 3D human pose using multi-view geometry. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages 1077– 1086, 2019. 3, 5
work page 2019
-
[27]
Unsupervised cross-modal alignment for multi-person 3D pose estimation
Jogendra Nath Kundu, Ambareesh Revanur, Govind Vit- thal Waghmare, Rahul Mysore Venkatesh, and R Venkatesh Babu. Unsupervised cross-modal alignment for multi-person 3D pose estimation. In Proceedings of the European Confer- ence on Computer Vision, pages 35–52, 2020. 3
work page 2020
-
[28]
Uncertainty-aware adaptation for self-supervised 3D human pose estimation
Jogendra Nath Kundu, Siddharth Seth, Pradyumna YM, Varun Jampani, Anirban Chakraborty, and R Venkatesh Babu. Uncertainty-aware adaptation for self-supervised 3D human pose estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages 20448–20459, 2022. 3, 5
work page 2022
-
[29]
PointPillars: Fast encoders for object detection from point clouds
Alex H Lang, Sourabh V ora, Holger Caesar, Lubing Zhou, Jiong Yang, and Oscar Beijbom. PointPillars: Fast encoders for object detection from point clouds. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 12697–12705, 2019. 2, 3
work page 2019
-
[30]
Li- darcap: Long-range marker-less 3d human motion capture with lidar point clouds
Jialian Li, Jingyi Zhang, Zhiyong Wang, Siqi Shen, Chenglu Wen, Yuexin Ma, Lan Xu, Jingyi Yu, and Cheng Wang. Li- darcap: Long-range marker-less 3d human motion capture with lidar point clouds. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages 20502–20512, 2022. 3
work page 2022
-
[31]
Deep continuous fusion for multi-sensor 3D object detection
Ming Liang, Bin Yang, Shenlong Wang, and Raquel Urtasun. Deep continuous fusion for multi-sensor 3D object detection. In Proceedings of the European Conference on Computer Vi- sion, pages 641–656, 2018. 3
work page 2018
-
[32]
Multi-view multi-person 3D pose estimation with Plane Sweep Stereo
Jiahao Lin and Gim Hee Lee. Multi-view multi-person 3D pose estimation with Plane Sweep Stereo. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 11886–11895, 2021. 1, 2, 5, 6
work page 2021
-
[33]
Microsoft COCO: Common objects in context
Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Doll´ar, and C Lawrence Zitnick. Microsoft COCO: Common objects in context. In Proceedings of the European Conference on Computer Vi- sion, pages 740–755, 2014. 1, 4
work page 2014
-
[34]
Zhijian Liu, Haotian Tang, Alexander Amini, Xinyu Yang, Huizi Mao, Daniela Rus, and Song Han. BEVFusion: Multi- task multi-sensor fusion with unified bird’s-eye view repre- sentation. arXiv preprint arXiv:2205.13542, 2022. 1, 2
-
[35]
SMPL: A skinned multi-person linear model
Matthew Loper, Naureen Mahmood, Javier Romero, Ger- ard Pons-Moll, and Michael J Black. SMPL: A skinned multi-person linear model. ACM Transactions on Graphics, 34(6):1–16, 2015. 4
work page 2015
-
[36]
Learn- ing to dress 3D people in generative clothing
Qianli Ma, Jinlong Yang, Anurag Ranjan, Sergi Pujades, Gerard Pons-Moll, Siyu Tang, and Michael J Black. Learn- ing to dress 3D people in generative clothing. InProceedings of the IEEE/CVF Conference on Computer Vision and Pat- tern Recognition, pages 6469–6478, 2020. 3
work page 2020
-
[37]
AMASS: Archive of motion capture as surface shapes
Naureen Mahmood, Nima Ghorbani, Nikolaus F Troje, Ger- ard Pons-Moll, and Michael J Black. AMASS: Archive of motion capture as surface shapes. In Proceedings of the IEEE/CVF International Conference on Computer Vision , pages 5442–5451, 2019. 2
work page 2019
-
[38]
Residual pose: A decou- pled approach for depth-based 3D human pose estimation
Angel Mart ´ınez-Gonz´alez, Michael Villamizar, Olivier Can´evet, and Jean-Marc Odobez. Residual pose: A decou- pled approach for depth-based 3D human pose estimation. In IEEE/RSJ International Conference on Intelligent Robots and Systems, pages 10313–10318, 2020. 2
work page 2020
-
[39]
Gyeongsik Moon, Ju Yong Chang, and Kyoung Mu Lee. V2V-Posenet: V oxel-to-voxel prediction network for accu- rate 3D hand and human pose estimation from a single depth map. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, pages 5079–5088, 2018. 2, 3, 5
work page 2018
-
[40]
4D-net for learned multi-modal alignment
AJ Piergiovanni, Vincent Casser, Michael S Ryoo, and Anelia Angelova. 4D-net for learned multi-modal alignment. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 15435–15445, 2021. 3
work page 2021
-
[41]
Deep hough voting for 3D object detection in point clouds
Charles R Qi, Or Litany, Kaiming He, and Leonidas J Guibas. Deep hough voting for 3D object detection in point clouds. In Proceedings of the IEEE/CVF International Con- ference on Computer Vision, pages 9277–9286, 2019. 2
work page 2019
-
[42]
Tessetrack: End-to- end learnable multi-person articulated 3D pose tracking
N Dinesh Reddy, Laurent Guigues, Leonid Pishchulin, Jayan Eledath, and Srinivasa G Narasimhan. Tessetrack: End-to- end learnable multi-person articulated 3D pose tracking. In Proceedings of the IEEE/CVF Conference on Computer Vi- sion and Pattern Recognition, pages 15190–15200, 2021. 1, 2, 3
work page 2021
-
[43]
Lightweight multi-view 3D pose esti- mation through camera-disentangled representation
Edoardo Remelli, Shangchen Han, Sina Honari, Pascal Fua, and Robert Wang. Lightweight multi-view 3D pose esti- mation through camera-disentangled representation. In Pro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6040–6049, 2020. 3, 6
work page 2020
-
[44]
Self- supervision on unlabelled or data for multi-person 2D/3D human pose estimation
Vinkle Srivastav, Afshin Gangi, and Nicolas Padoy. Self- supervision on unlabelled or data for multi-person 2D/3D human pose estimation. In Medical Image Computing and Computer Assisted Intervention, pages 761–771, 2020. 3
work page 2020
-
[45]
Vinkle Srivastav, Thibaut Issenhuth, Abdolrahim Kadkho- damohammadi, Michel de Mathelin, Afshin Gangi, and Nicolas Padoy. Mvor: A multi-view rgb-d operating room dataset for 2D and 3D human pose estimation.arXiv preprint arXiv:1808.08180, 2018. 5
-
[46]
Deep high-resolution representation learning for human pose es- timation
Ke Sun, Bin Xiao, Dong Liu, and Jingdong Wang. Deep high-resolution representation learning for human pose es- timation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages 5693– 5703, 2019. 2, 5
work page 2019
-
[47]
V oxelPose: Towards multi-camera 3D human pose estimation in wild en- vironment
Hanyue Tu, Chunyu Wang, and Wenjun Zeng. V oxelPose: Towards multi-camera 3D human pose estimation in wild en- vironment. In Proceedings of the European Conference on Computer Vision, pages 197–212, 2020. 1, 2, 3, 5, 6, 7
work page 2020
-
[48]
Learning from synthetic humans
Gul Varol, Javier Romero, Xavier Martin, Naureen Mah- mood, Michael J Black, Ivan Laptev, and Cordelia Schmid. Learning from synthetic humans. In Proceedings of the IEEE Conference on computer vision and pattern recogni- tion, pages 109–117, 2017. 3
work page 2017
-
[49]
Recovering ac- curate 3D human pose in the wild using imus and a mov- ing camera
Timo V on Marcard, Roberto Henschel, Michael J Black, Bodo Rosenhahn, and Gerard Pons-Moll. Recovering ac- curate 3D human pose in the wild using imus and a mov- ing camera. In Proceedings of the European Conference on Computer Vision, pages 601–617, 2018. 1
work page 2018
-
[50]
Graph-based 3D multi-person pose estimation using multi-view images
Size Wu, Sheng Jin, Wentao Liu, Lei Bai, Chen Qian, Dong Liu, and Wanli Ouyang. Graph-based 3D multi-person pose estimation using multi-view images. In Proceedings of the IEEE/CVF International Conference on Computer Vision , pages 11148–11157, 2021. 2
work page 2021
-
[51]
Vit- pose: Simple vision transformer baselines for human pose estimation
Yufei Xu, Jing Zhang, Qiming Zhang, and Dacheng Tao. Vit- pose: Simple vision transformer baselines for human pose estimation. arXiv preprint arXiv:2204.12484, 2022. 2, 3, 5, 6
-
[52]
Chenyu Yang, Yuntao Chen, Hao Tian, Chenxin Tao, Xizhou Zhu, Zhaoxiang Zhang, Gao Huang, Hongyang Li, Yu Qiao, Lewei Lu, et al. Bevformer v2: Adapting modern image backbones to bird’s-eye-view recognition via perspective su- pervision. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages 17830– 17839, 2023. 1
work page 2023
-
[53]
Faster V oxelPose: Real-time 3d human pose estima- tion by orthographic projection
Hang Ye, Wentao Zhu, Chunyu Wang, Rujie Wu, and Yizhou Wang. Faster V oxelPose: Real-time 3d human pose estima- tion by orthographic projection. In Proceedings of the Euro- pean Conference on Computer Vision, pages 142–159, 2022. 2
work page 2022
-
[54]
Direct multi-view multi-person 3D pose estima- tion
Jianfeng Zhang, Yujun Cai, Shuicheng Yan, Jiashi Feng, et al. Direct multi-view multi-person 3D pose estima- tion. Advances in Neural Information Processing Systems , 34:13153–13164, 2021. 1, 2, 6
work page 2021
-
[55]
A flexible multi-view multi- modal imaging system for outdoor scenes
Meng Zhang, Wenxuan Guo, Bohao Fan, Yifan Chen, Jian- jiang Feng, and Jie Zhou. A flexible multi-view multi- modal imaging system for outdoor scenes. In 2022 Inter- national Conference on 3D Vision (3DV) , pages 322–331. IEEE, 2022. 1
work page 2022
-
[56]
Pose2seg: Detection free human instance segmentation
Song-Hai Zhang, Ruilong Li, Xin Dong, Paul Rosin, Zixi Cai, Xi Han, Dingcheng Yang, Haozhi Huang, and Shi-Min Hu. Pose2seg: Detection free human instance segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 889–898, 2019. 1
work page 2019
-
[57]
Unsupervised domain adaptation for 3D hu- man pose estimation
Xiheng Zhang, Yongkang Wong, Mohan S Kankanhalli, and Weidong Geng. Unsupervised domain adaptation for 3D hu- man pose estimation. In Proceedings of the 27th ACM Inter- national Conference on Multimedia , pages 926–934, 2019. 3
work page 2019
-
[58]
Sequential 3D human pose estimation using adaptive point cloud sampling strategy
Zihao Zhang, Lei Hu, Xiaoming Deng, and Shihong Xia. Sequential 3D human pose estimation using adaptive point cloud sampling strategy. In International Joint Conferences on Artificial Intelligence Organization , pages 1330–1337,
-
[59]
Multi-modal 3D human pose estimation with 2D weak supervision in autonomous driving
Jingxiao Zheng, Xinwei Shi, Alexander Gorban, Junhua Mao, Yang Song, Charles R Qi, Ting Liu, Visesh Chari, An- dre Cornman, Yin Zhou, et al. Multi-modal 3D human pose estimation with 2D weak supervision in autonomous driving. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4478–4487, 2022. 3 LiCamPose: Combining...
work page 2022
-
[60]
Different Scanning Patterns of Point Cloud There are various methods to obtain or scan the point cloud: 1) randomly sampling the depth map; 2) sampling the depth map using multiple equidistant horizontal lines to mimic Velodyne LiDARs; and 3) sampling the depth map with the ”Rose curve” sampling equation as discussed in our paper to replicate Livox LiDARs...
-
[61]
BaseketBall BasketBall is an outdoor dataset capturing a basketball match using four sensor nodes, each comprising one Livox LiDAR and one RGB camera, in a convergent acquisition setup. The dataset presents challenges due to its extensive coverage, occlusions, and the dynamic motions of the play- ers (Figure 8). We have developed an annotation tool to la-...
-
[62]
SyncHuman Generator We can use our synthetic data generator, SyncHuman, to simulate any arrangement of sensors to observe a scene. As demonstrated in the experiments in our paper, using the same scene setting for both training and testing yields bet- ter transfer performance. Figure 8 compares the datasets we generated, PanopticSync and BasketBallSync, wi...
-
[63]
Human Prior Loss We designed the human prior loss to encourage the net- work to generate human-like 3D keypoints. The human prior loss comprises three components: 1) the predicted bone lengths should be within a reasonable range; 2) the predicted lengths of symmetric bones should be similar; and
-
[64]
We set a limited length range for all bones
the predicted bone angles should be reasonable according to human kinematics. We set a limited length range for all bones. In our case, Figure 7. Different scanning patterns of point clouds. All samples shown in this figure are from the same scene, captured at the same time, and contain the same number of points. we set lmin = 0.05m and lmax = 0.7m. So th...
-
[65]
Extended Experiments In this section, we conduct experiments to verify the ad- vantages of using point cloud input for pedestrian detection. Additionally, we present more examples to explain the rela- tionship between entropy value and pose rationality. 10.1. Human Detection For evaluating human detection, we assess performance using the established avera...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.