Towards Robust and Realistic Human Pose Estimation via WiFi Signals
Pith reviewed 2026-05-23 05:39 UTC · model grok-4.3
The pith
WiFi-based pose estimation improves by first learning domain-consistent signal features through contrastive pretraining and then decoding them with explicit skeletal topology rules.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Reformulating WiFi human pose estimation as domain-consistent representation learning via temporal consistency contrastive learning with uniformity regularization inside self-supervised masked pretraining, followed by topology-constrained pose decoding in a hybrid architecture, closes the cross-domain gap caused by pose distribution shifts and the structural fidelity gap caused by missing spatial priors, leading to superior performance on 2D and 3D tasks across benchmarks.
What carries the argument
The DT-Pose two-phase framework: temporal consistency contrastive learning with uniformity regularization for domain-consistent WiFi representations, plus a hybrid decoder that adds explicit skeletal topology constraints on adjacent and overarching joint relationships.
If this is right
- Predicted skeletons maintain correct joint adjacency and overall body proportions even when WiFi signals are sparse.
- Representations learned in one environment transfer more reliably to new rooms or sensor placements.
- Both 2D and 3D pose outputs improve without needing additional labeled camera data at test time.
- Mode collapse during pretraining is reduced, preserving motion-discriminative information in the signal embeddings.
Where Pith is reading between the lines
- The same pretraining-plus-constrained-decoding pattern could be tested on other sparse sensing modalities such as millimeter-wave radar or acoustic signals for pose recovery.
- If the topology constraints prove effective, they might be added as a lightweight post-processing step to existing camera-based pose estimators when depth is noisy.
- Real-time applications such as fall detection in homes could become feasible once cross-domain robustness is confirmed on larger, more varied collections of WiFi recordings.
Load-bearing premise
The temporal consistency contrastive learning and skeletal topology constraints will close the identified gaps on data distributions and skeletal structures beyond the specific benchmark datasets used in the experiments.
What would settle it
A new WiFi dataset collected in an environment whose pose distribution differs markedly from the training sets, where the method produces no measurable gain in joint accuracy or bone-length realism compared with prior WiFi pose estimators.
Figures
read the original abstract
Robust WiFi-based human pose estimation (HPE) is a challenging task that bridges discrete and subtle WiFi signals to human skeletons. We revisit this problem and reveal two critical yet overlooked issues: 1) cross-domain gap, i.e., due to significant discrepancies in pose distributions between source and target domains; and 2) structural fidelity gap, i.e., predicted skeletal poses manifest distorted topology, usually with misplaced joints and disproportionate bone lengths. This paper fills these gaps by reformulating the task into a novel two-phase framework dubbed DT-Pose: Domain-consistent representation learning and Topology-constrained Pose decoding. Concretely, we first propose a temporal consistency contrastive learning strategy with uniformity regularization, integrated into a self-supervised masked pretraining paradigm. This design facilitates robust learning of domain-consistent and motion-discriminative WiFi representations while mitigating potential mode collapse caused by signal sparsity. Beyond this, we introduce an effective hybrid decoding architecture that incorporates explicit skeletal topology constraints. By compensating for the inherent absence of spatial priors in WiFi semantic vectors, the decoder enables structured modeling of both adjacent and overarching joint relationships, producing more realistic pose predictions. Extensive experiments conducted on various benchmark datasets highlight the superior performance of our method in tackling these fundamental challenges in 2D/3D WiFi-based HPE tasks. The associated code is released at https://github.com/cseeyangchen/DT-Pose.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes DT-Pose, a two-phase framework for WiFi-based 2D/3D human pose estimation that reformulates the task to address the cross-domain gap (via temporal consistency contrastive learning with uniformity regularization in self-supervised masked pretraining) and the structural fidelity gap (via a hybrid decoding architecture incorporating explicit skeletal topology constraints). It claims this yields superior performance on benchmark datasets and releases the associated code.
Significance. If the results hold with appropriate supporting evidence, the work could advance WiFi-based HPE by demonstrating mechanisms for domain-consistent representations and topologically constrained decoding; the public code release supports reproducibility and is a clear strength.
major comments (2)
- [§5 (Experiments)] §5 (Experiments): the central claim that the two-phase framework closes the cross-domain gap rests on performance improvements on benchmark datasets, but the evaluation provides no quantified domain-shift metrics (e.g., MMD or Fréchet distance between source/target representation distributions) or OOD skeletal error breakdowns to show the temporal consistency contrastive learning generalizes beyond the specific pose and signal distributions of the evaluated benchmarks.
- [§4.3 (Decoder architecture)] §4.3 (Decoder architecture): the assertion that the hybrid topology-constrained decoder resolves the structural fidelity gap would require an ablation isolating the effect of the explicit skeletal constraints on metrics such as bone-length ratio error or joint displacement; without this, it is unclear whether the reported gains are due to the topology terms or other decoder components.
minor comments (2)
- Abstract: the claim of 'superior performance' and 'extensive experiments' would be more informative if key quantitative results, baseline names, and dataset identifiers were included.
- Notation: the uniformity regularization weight and contrastive hyperparameters should be explicitly tabulated in the experimental protocol section for reproducibility.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback on our manuscript. The comments highlight important aspects of evaluation that can strengthen the presentation of our claims regarding domain-consistent representations and topology-constrained decoding. We address each major comment below and will revise the manuscript accordingly.
read point-by-point responses
-
Referee: [§5 (Experiments)] §5 (Experiments): the central claim that the two-phase framework closes the cross-domain gap rests on performance improvements on benchmark datasets, but the evaluation provides no quantified domain-shift metrics (e.g., MMD or Fréchet distance between source/target representation distributions) or OOD skeletal error breakdowns to show the temporal consistency contrastive learning generalizes beyond the specific pose and signal distributions of the evaluated benchmarks.
Authors: We acknowledge that explicit quantification of domain shift would provide stronger support for the generalization claims of the temporal consistency contrastive learning. Our current evaluation demonstrates consistent improvements across multiple benchmarks with varying pose and signal distributions, which indirectly supports the domain-consistency objective. In the revised version, we will compute and report MMD distances between source and target representation distributions before and after pretraining, along with OOD skeletal error breakdowns on held-out pose distributions where feasible. revision: yes
-
Referee: [§4.3 (Decoder architecture)] §4.3 (Decoder architecture): the assertion that the hybrid topology-constrained decoder resolves the structural fidelity gap would require an ablation isolating the effect of the explicit skeletal constraints on metrics such as bone-length ratio error or joint displacement; without this, it is unclear whether the reported gains are due to the topology terms or other decoder components.
Authors: We agree that an ablation isolating the skeletal topology constraints is necessary to clarify their contribution. The hybrid decoder combines multiple components, and while the overall gains are reported, we will add a targeted ablation that removes the explicit topology terms (both adjacent and overarching) and reports the impact on bone-length ratio error and joint displacement metrics in addition to standard pose errors. revision: yes
Circularity Check
No circularity: empirical framework with independent experimental validation.
full rationale
The DT-Pose framework is introduced as a two-phase empirical architecture (masked pretraining with temporal consistency contrastive learning plus uniformity regularization, followed by hybrid topology-constrained decoding). No equations or derivations are presented that reduce by construction to their own inputs, no fitted parameters are relabeled as predictions, and no load-bearing claims rest on self-citations or uniqueness theorems imported from the authors' prior work. The central claims rest on benchmark experiments and released code, which constitute external falsifiability rather than self-referential definition.
Axiom & Free-Parameter Ledger
free parameters (2)
- contrastive learning hyperparameters and uniformity regularization weight
- decoder architecture hyperparameters
axioms (2)
- domain assumption WiFi signals contain sufficient information about human joint positions and motion
- domain assumption Human skeletal topology provides useful explicit constraints for pose decoding
Reference graph
Works this paper leans on
-
[1]
BEiT: BERT Pre-Training of Image Transformers
Hangbo Bao, Li Dong, Songhao Piao, and Furu Wei. Beit: Bert pre-training of image transformers. arXiv preprint arXiv:2106.08254, 2021
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[2]
Realtime multi-person 2d pose estimation using part affinity fields
Zhe Cao, Tomas Simon, Shih-En Wei, and Yaser Sheikh. Realtime multi-person 2d pose estimation using part affinity fields. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 7291–7299, 2017
work page 2017
-
[3]
A survey on hand pose estimation with wearable sensors and computer- vision-based methods
Weiya Chen, Chenchen Yu, Chenyu Tu, Zehua Lyu, Jing Tang, Shiqi Ou, Yan Fu, and Zhidong Xue. A survey on hand pose estimation with wearable sensors and computer- vision-based methods. Sensors, 20(4):1074, 2020
work page 2020
-
[4]
Neuron: Learning context-aware evolving representations for zero-shot skeleton action recognition
Yang Chen, Jingcai Guo, Song Guo, and Dacheng Tao. Neuron: Learning context-aware evolving representations for zero-shot skeleton action recognition. arXiv preprint arXiv:2411.11288, 2024
-
[5]
Fine-grained side information guided dual-prompts for zero-shot skeleton action recognition
Yang Chen, Jingcai Guo, Tian He, Xiaocheng Lu, and Ling Wang. Fine-grained side information guided dual-prompts for zero-shot skeleton action recognition. In Proceedings of the 32nd ACM International Conference on Multimedia , pages 778–786, 2024
work page 2024
-
[6]
Yang Chen, Tian He, Junfeng Fu, Ling Wang, Jingcai Guo, Ting Hu, and Hong Cheng. Vision-language meets the skele- ton: Progressively distillation with cross-modal knowledge for 3d action representation learning. IEEE Transactions on Multimedia, 2024
work page 2024
-
[7]
Channel-wise topology refinement graph convolution for skeleton-based action recognition
Yuxin Chen, Ziqi Zhang, Chunfeng Yuan, Bing Li, Ying Deng, and Weiming Hu. Channel-wise topology refinement graph convolution for skeleton-based action recognition. In Proceedings of the IEEE/CVF international conference on computer vision, pages 13359–13368, 2021
work page 2021
-
[8]
Timemae: Self-supervised rep- resentations of time series with decoupled masked autoen- coders
Mingyue Cheng, Qi Liu, Zhiding Liu, Hao Zhang, Rujiao Zhang, and Enhong Chen. Timemae: Self-supervised rep- resentations of time series with decoupled masked autoen- coders. arXiv preprint arXiv:2303.00320, 2023
-
[9]
In- fogcn: Representation learning for human skeleton-based action recognition
Hyung-gun Chi, Myoung Hoon Ha, Seunggeun Chi, Sang Wan Lee, Qixing Huang, and Karthik Ramani. In- fogcn: Representation learning for human skeleton-based action recognition. In Proceedings of the IEEE/CVF Con- ference on Computer Vision and Pattern Recognition, pages 20186–20196, 2022
work page 2022
-
[10]
Hpe-li: Wifi-enabled lightweight dual selective kernel convolution for human pose estimation
Toan D Gian, Tien Dac Lai, Thien Van Luong, Kok- Seng Wong, and Van-Dinh Nguyen. Hpe-li: Wifi-enabled lightweight dual selective kernel convolution for human pose estimation. In European Conference on Computer Vision , pages 93–111. Springer, 2025
work page 2025
-
[11]
Wildar: Wifi signal-based lightweight deep learning model for human activity recogni- tion
Fuxiang Deng, Emil Jovanov, Houbing Song, Weisong Shi, Yuan Zhang, and Wenyao Xu. Wildar: Wifi signal-based lightweight deep learning model for human activity recogni- tion. IEEE Internet of Things Journal, 2023
work page 2023
-
[12]
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
Jacob Devlin. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[13]
Dif- fusion model is a good pose estimator from 3d rf-vision
Junqiao Fan, Jianfei Yang, Yuecong Xu, and Lihua Xie. Dif- fusion model is a good pose estimator from 3d rf-vision. In European Conference on Computer Vision , pages 1–18. Springer, 2025
work page 2025
-
[14]
Focal and global spatial-temporal transformer for skeleton-based action recognition
Zhimin Gao, Peitao Wang, Pei Lv, Xiaoheng Jiang, Qidong Liu, Pichao Wang, Mingliang Xu, and Wanqing Li. Focal and global spatial-temporal transformer for skeleton-based action recognition. In Proceedings of the Asian Conference on Computer Vision, pages 382–398, 2022
work page 2022
-
[15]
Diffpose: Toward more reliable 3d pose estimation
Jia Gong, Lin Geng Foo, Zhipeng Fan, Qiuhong Ke, Hos- sein Rahmani, and Jun Liu. Diffpose: Toward more reliable 3d pose estimation. In Proceedings of the IEEE/CVF Con- ference on Computer Vision and Pattern Recognition, pages 13041–13051, 2023
work page 2023
-
[16]
Masked autoencoders are scalable vision learners
Kaiming He, Xinlei Chen, Saining Xie, Yanghao Li, Piotr Doll´ar, and Ross Girshick. Masked autoencoders are scalable vision learners. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages 16000– 16009, 2022
work page 2022
-
[17]
Deep residual learning for image recognition
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceed- ings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016
work page 2016
-
[18]
Tian He, Yang Chen, Xu Gao, Ling Wang, Ting Hu, and Hong Cheng. Enhancing skeleton-based action recogni- tion with language descriptions from pre-trained large mul- timodal models. IEEE Transactions on Circuits and Systems for Video Technology, 2024
work page 2024
-
[19]
Tian He, Yang Chen, Ling Wang, and Hong Cheng. An expert-knowledge-based graph convolutional network for skeleton-based physical rehabilitation exercises assessment. IEEE Transactions on Neural Systems and Rehabilitation Engineering, 2024
work page 2024
-
[20]
Masked autoencoders that lis- ten
Po-Yao Huang, Hu Xu, Juncheng Li, Alexei Baevski, Michael Auli, Wojciech Galuba, Florian Metze, and Christoph Feichtenhofer. Masked autoencoders that lis- ten. Advances in Neural Information Processing Systems , 35:28708–28720, 2022
work page 2022
-
[21]
Towards 3d human pose construction using wifi
Wenjun Jiang, Hongfei Xue, Chenglin Miao, Shiyang Wang, Sen Lin, Chong Tian, Srinivasan Murali, Haochen Hu, Zhi Sun, and Lu Su. Towards 3d human pose construction using wifi. In Proceedings of the 26th Annual International Con- ference on Mobile Computing and Networking , pages 1–14, 2020
work page 2020
-
[22]
Wenhao Li, Hong Liu, Runwei Ding, Mengyuan Liu, Pichao Wang, and Wenming Yang. Exploiting temporal contexts with strided transformer for 3d human pose estimation.IEEE Transactions on Multimedia, 25:1282–1293, 2022
work page 2022
-
[23]
Group pose: A simple baseline for end-to- end multi-person pose estimation
Huan Liu, Qiang Chen, Zichang Tan, Jiang-Jiang Liu, Jian Wang, Xiangbo Su, Xiaolong Li, Kun Yao, Junyu Han, Er- rui Ding, et al. Group pose: A simple baseline for end-to- end multi-person pose estimation. In Proceedings of the IEEE/CVF International Conference on Computer Vision , pages 15029–15038, 2023
work page 2023
-
[24]
Enhanced skele- ton visualization for view invariant human action recogni- tion
Mengyuan Liu, Hong Liu, and Chen Chen. Enhanced skele- ton visualization for view invariant human action recogni- tion. Pattern Recognition, 68:346–362, 2017
work page 2017
-
[25]
Spa- tial temporal transformer network for skeleton-based ac- tion recognition
Chiara Plizzari, Marco Cannici, and Matteo Matteucci. Spa- tial temporal transformer network for skeleton-based ac- tion recognition. In Pattern recognition. ICPR international workshops and challenges: virtual event, January 10–15, 2021, Proceedings, Part III, pages 694–701. Springer, 2021. 9
work page 2021
-
[26]
Improving language understanding by gener- ative pre-training
Alec Radford. Improving language understanding by gener- ative pre-training. 2018
work page 2018
-
[27]
Winect: 3d human pose tracking for free-form activity using commodity wifi
Yili Ren, Zi Wang, Sheng Tan, Yingying Chen, and Jie Yang. Winect: 3d human pose tracking for free-form activity using commodity wifi. Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies, 5(4):1–29, 2021
work page 2021
-
[28]
Gopose: 3d human pose estimation using wifi
Yili Ren, Zi Wang, Yichao Wang, Sheng Tan, Yingying Chen, and Jie Yang. Gopose: 3d human pose estimation using wifi. Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies, 6(2):1–25, 2022
work page 2022
-
[29]
End-to-end multi-person pose estimation with transformers
Dahu Shi, Xing Wei, Liangqi Li, Ye Ren, and Wenming Tan. End-to-end multi-person pose estimation with transformers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 11069–11078, 2022
work page 2022
-
[30]
Constructing stronger and faster baselines for skeleton-based action recognition
Yi-Fan Song, Zhang Zhang, Caifeng Shan, and Liang Wang. Constructing stronger and faster baselines for skeleton-based action recognition. IEEE transactions on pattern analysis and machine intelligence, 45(2):1474–1488, 2022
work page 2022
-
[31]
Videomae: Masked autoencoders are data-efficient learners for self-supervised video pre-training
Zhan Tong, Yibing Song, Jue Wang, and Limin Wang. Videomae: Masked autoencoders are data-efficient learners for self-supervised video pre-training. Advances in neural information processing systems, 35:10078–10093, 2022
work page 2022
-
[32]
Can WiFi Estimate Person Pose?
Fei Wang, Stanislav Panev, Ziyi Dai, Jinsong Han, and Dong Huang. Can wifi estimate person pose? arXiv preprint arXiv:1904.00277, 2019
work page internal anchor Pith review Pith/arXiv arXiv 1904
-
[33]
Person-in-wifi: Fine-grained person percep- tion using wifi
Fei Wang, Sanping Zhou, Stanislav Panev, Jinsong Han, and Dong Huang. Person-in-wifi: Fine-grained person percep- tion using wifi. In Proceedings of the IEEE/CVF Interna- tional Conference on Computer Vision , pages 5452–5461, 2019
work page 2019
-
[34]
Videomae v2: Scaling video masked autoencoders with dual masking
Limin Wang, Bingkun Huang, Zhiyu Zhao, Zhan Tong, Yi- nan He, Yi Wang, Yali Wang, and Yu Qiao. Videomae v2: Scaling video masked autoencoders with dual masking. In Proceedings of the IEEE/CVF Conference on Computer Vi- sion and Pattern Recognition, pages 14549–14560, 2023
work page 2023
-
[35]
Action recognition based on joint trajectory maps using convolutional neural networks
Pichao Wang, Zhaoyang Li, Yonghong Hou, and Wanqing Li. Action recognition based on joint trajectory maps using convolutional neural networks. In Proceedings of the 24th ACM international conference on Multimedia , pages 102– 106, 2016
work page 2016
-
[36]
From point to space: 3d moving human pose estimation using commodity wifi
Yiming Wang, Lingchao Guo, Zhaoming Lu, Xiangming Wen, Shuang Zhou, and Wanyu Meng. From point to space: 3d moving human pose estimation using commodity wifi. IEEE Communications Letters, 25(7):2235–2239, 2021
work page 2021
-
[37]
Lite pose: Efficient architecture design for 2d hu- man pose estimation
Yihan Wang, Muyang Li, Han Cai, Wei-Ming Chen, and Song Han. Lite pose: Efficient architecture design for 2d hu- man pose estimation. In Proceedings of the IEEE/CVF Con- ference on Computer Vision and Pattern Recognition, pages 13126–13136, 2022
work page 2022
-
[38]
Skeletonmae: graph-based masked autoencoder for skeleton sequence pre-training
Hong Yan, Yang Liu, Yushen Wei, Zhen Li, Guanbin Li, and Liang Lin. Skeletonmae: graph-based masked autoencoder for skeleton sequence pre-training. In Proceedings of the IEEE/CVF International Conference on Computer Vision , pages 5606–5618, 2023
work page 2023
-
[39]
Person-in-wifi 3d: End-to-end multi- person 3d pose estimation with wi-fi
Kangwei Yan, Fei Wang, Bo Qian, Han Ding, Jinsong Han, and Xing Wei. Person-in-wifi 3d: End-to-end multi- person 3d pose estimation with wi-fi. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 969–978, 2024
work page 2024
-
[40]
Autofi: Toward automatic wi-fi human sensing via geometric self-supervised learning
Jianfei Yang, Xinyan Chen, Han Zou, Dazhuo Wang, and Li- hua Xie. Autofi: Toward automatic wi-fi human sensing via geometric self-supervised learning. IEEE Internet of Things Journal, 10(8):7416–7425, 2022
work page 2022
-
[41]
Mm-fi: Multi-modal non-intrusive 4d human dataset for versatile wireless sensing
Jianfei Yang, He Huang, Yunjiao Zhou, Xinyan Chen, Yue- cong Xu, Shenghai Yuan, Han Zou, Chris Xiaoxuan Lu, and Lihua Xie. Mm-fi: Multi-modal non-intrusive 4d human dataset for versatile wireless sensing. Advances in Neural Information Processing Systems, 36, 2024
work page 2024
-
[42]
Single person pose estimation: a survey
Feng Zhang, Xiatian Zhu, and Chen Wang. Single person pose estimation: a survey. arXiv preprint arXiv:2109.10056, 2021
-
[43]
Deep learning-based human pose estimation: A survey.ACM Computing Surveys, 56(1):1–37, 2023
Ce Zheng, Wenhan Wu, Chen Chen, Taojiannan Yang, Si- jie Zhu, Ju Shen, Nasser Kehtarnavaz, and Mubarak Shah. Deep learning-based human pose estimation: A survey.ACM Computing Surveys, 56(1):1–37, 2023
work page 2023
-
[44]
Multi-modal 3d human pose estimation with 2d weak supervision in autonomous driving
Jingxiao Zheng, Xinwei Shi, Alexander Gorban, Junhua Mao, Yang Song, Charles R Qi, Ting Liu, Visesh Chari, An- dre Cornman, Yin Zhou, et al. Multi-modal 3d human pose estimation with 2d weak supervision in autonomous driving. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4478–4487, 2022
work page 2022
-
[45]
Metafi++: Wifi-enabled transformer- based human pose estimation for metaverse avatar simula- tion
Yunjiao Zhou, He Huang, Shenghai Yuan, Han Zou, Lihua Xie, and Jianfei Yang. Metafi++: Wifi-enabled transformer- based human pose estimation for metaverse avatar simula- tion. IEEE Internet of Things Journal, 10(16):14128–14136, 2023
work page 2023
-
[46]
Ada- pose: Towards cross-site device-free human pose estima- tion with commodity wifi
Yunjiao Zhou, Jianfei Yang, He Huang, and Lihua Xie. Ada- pose: Towards cross-site device-free human pose estima- tion with commodity wifi. IEEE Internet of Things Journal, 2024
work page 2024
-
[47]
Perunet: Deep signal channel attention in unet for wifi-based human pose estimation
Yue Zhou, Aichun Zhu, Caojie Xu, Fangqiang Hu, and Yifeng Li. Perunet: Deep signal channel attention in unet for wifi-based human pose estimation. IEEE Sensors Jour- nal, 22(20):19750–19760, 2022. 10 A. Appendix A.1. Implementation Details In the pre-training phase, the encoder-decoder is trained for 400 epochs using the AdamW, employing a batch size of 2...
work page 2022
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.