pith. sign in

arxiv: 2501.09411 · v3 · submitted 2025-01-16 · 💻 cs.CV

Towards Robust and Realistic Human Pose Estimation via WiFi Signals

Pith reviewed 2026-05-23 05:39 UTC · model grok-4.3

classification 💻 cs.CV
keywords WiFi-based human pose estimationcross-domain gapstructural fidelity gapcontrastive learningskeletal topology constraintsself-supervised pretrainingmasked signal modelinghybrid decoder
0
0 comments X

The pith

WiFi-based pose estimation improves by first learning domain-consistent signal features through contrastive pretraining and then decoding them with explicit skeletal topology rules.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper identifies two overlooked problems in turning WiFi signals into human skeletons: large differences in pose distributions across environments and unrealistic joint placements or bone lengths in the output. It addresses these by creating a two-phase system that first pretrains on masked WiFi data using temporal consistency contrastive learning plus uniformity regularization to produce robust, motion-aware representations, then decodes those representations with a hybrid architecture that adds direct constraints on how joints connect and relate. If successful, the approach yields more accurate 2D and 3D pose estimates on multiple benchmark datasets without requiring paired camera data in every new setting.

Core claim

Reformulating WiFi human pose estimation as domain-consistent representation learning via temporal consistency contrastive learning with uniformity regularization inside self-supervised masked pretraining, followed by topology-constrained pose decoding in a hybrid architecture, closes the cross-domain gap caused by pose distribution shifts and the structural fidelity gap caused by missing spatial priors, leading to superior performance on 2D and 3D tasks across benchmarks.

What carries the argument

The DT-Pose two-phase framework: temporal consistency contrastive learning with uniformity regularization for domain-consistent WiFi representations, plus a hybrid decoder that adds explicit skeletal topology constraints on adjacent and overarching joint relationships.

If this is right

  • Predicted skeletons maintain correct joint adjacency and overall body proportions even when WiFi signals are sparse.
  • Representations learned in one environment transfer more reliably to new rooms or sensor placements.
  • Both 2D and 3D pose outputs improve without needing additional labeled camera data at test time.
  • Mode collapse during pretraining is reduced, preserving motion-discriminative information in the signal embeddings.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same pretraining-plus-constrained-decoding pattern could be tested on other sparse sensing modalities such as millimeter-wave radar or acoustic signals for pose recovery.
  • If the topology constraints prove effective, they might be added as a lightweight post-processing step to existing camera-based pose estimators when depth is noisy.
  • Real-time applications such as fall detection in homes could become feasible once cross-domain robustness is confirmed on larger, more varied collections of WiFi recordings.

Load-bearing premise

The temporal consistency contrastive learning and skeletal topology constraints will close the identified gaps on data distributions and skeletal structures beyond the specific benchmark datasets used in the experiments.

What would settle it

A new WiFi dataset collected in an environment whose pose distribution differs markedly from the training sets, where the method produces no measurable gain in joint accuracy or bone-length realism compared with prior WiFi pose estimators.

Figures

Figures reproduced from arXiv: 2501.09411 by Jingcai Guo, Yang Chen.

Figure 1
Figure 1. Figure 1: (a) shows the pose coordinates distribution between the source and target domains. (b) represents the predictions [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: The pipeline of our method, including the pre-training and pose decoding phases. [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Original WiFi CSI signals and different masking strategies on the MM-Fi dataset. [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗
Figure 6
Figure 6. Figure 6: Here, we introduce explicit uniformity regulariza [PITH_FULL_IMAGE:figures/full_fig_p004_6.png] view at source ↗
Figure 4
Figure 4. Figure 4: Performance on the MM-Fi (Protocol 1 - Setting 1) dataset with different masking ratios. [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: t-SNE visualization of WiFi representations [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Dimension collapse. (a) represents the statistics [PITH_FULL_IMAGE:figures/full_fig_p007_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: WiFi visualizations on the MM-Fi (P3-S1). The first row represents the raw WiFi signals, the second row represents [PITH_FULL_IMAGE:figures/full_fig_p008_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Predicted poses of the MetaFi++ [45], HPE-Li [10], and our proposed method among three different datasets. [PITH_FULL_IMAGE:figures/full_fig_p008_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: t-SNE visualization of WiFi representations. The first row denotes the WiFi representations extracted on the MM-Fi [PITH_FULL_IMAGE:figures/full_fig_p012_9.png] view at source ↗
read the original abstract

Robust WiFi-based human pose estimation (HPE) is a challenging task that bridges discrete and subtle WiFi signals to human skeletons. We revisit this problem and reveal two critical yet overlooked issues: 1) cross-domain gap, i.e., due to significant discrepancies in pose distributions between source and target domains; and 2) structural fidelity gap, i.e., predicted skeletal poses manifest distorted topology, usually with misplaced joints and disproportionate bone lengths. This paper fills these gaps by reformulating the task into a novel two-phase framework dubbed DT-Pose: Domain-consistent representation learning and Topology-constrained Pose decoding. Concretely, we first propose a temporal consistency contrastive learning strategy with uniformity regularization, integrated into a self-supervised masked pretraining paradigm. This design facilitates robust learning of domain-consistent and motion-discriminative WiFi representations while mitigating potential mode collapse caused by signal sparsity. Beyond this, we introduce an effective hybrid decoding architecture that incorporates explicit skeletal topology constraints. By compensating for the inherent absence of spatial priors in WiFi semantic vectors, the decoder enables structured modeling of both adjacent and overarching joint relationships, producing more realistic pose predictions. Extensive experiments conducted on various benchmark datasets highlight the superior performance of our method in tackling these fundamental challenges in 2D/3D WiFi-based HPE tasks. The associated code is released at https://github.com/cseeyangchen/DT-Pose.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper proposes DT-Pose, a two-phase framework for WiFi-based 2D/3D human pose estimation that reformulates the task to address the cross-domain gap (via temporal consistency contrastive learning with uniformity regularization in self-supervised masked pretraining) and the structural fidelity gap (via a hybrid decoding architecture incorporating explicit skeletal topology constraints). It claims this yields superior performance on benchmark datasets and releases the associated code.

Significance. If the results hold with appropriate supporting evidence, the work could advance WiFi-based HPE by demonstrating mechanisms for domain-consistent representations and topologically constrained decoding; the public code release supports reproducibility and is a clear strength.

major comments (2)
  1. [§5 (Experiments)] §5 (Experiments): the central claim that the two-phase framework closes the cross-domain gap rests on performance improvements on benchmark datasets, but the evaluation provides no quantified domain-shift metrics (e.g., MMD or Fréchet distance between source/target representation distributions) or OOD skeletal error breakdowns to show the temporal consistency contrastive learning generalizes beyond the specific pose and signal distributions of the evaluated benchmarks.
  2. [§4.3 (Decoder architecture)] §4.3 (Decoder architecture): the assertion that the hybrid topology-constrained decoder resolves the structural fidelity gap would require an ablation isolating the effect of the explicit skeletal constraints on metrics such as bone-length ratio error or joint displacement; without this, it is unclear whether the reported gains are due to the topology terms or other decoder components.
minor comments (2)
  1. Abstract: the claim of 'superior performance' and 'extensive experiments' would be more informative if key quantitative results, baseline names, and dataset identifiers were included.
  2. Notation: the uniformity regularization weight and contrastive hyperparameters should be explicitly tabulated in the experimental protocol section for reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. The comments highlight important aspects of evaluation that can strengthen the presentation of our claims regarding domain-consistent representations and topology-constrained decoding. We address each major comment below and will revise the manuscript accordingly.

read point-by-point responses
  1. Referee: [§5 (Experiments)] §5 (Experiments): the central claim that the two-phase framework closes the cross-domain gap rests on performance improvements on benchmark datasets, but the evaluation provides no quantified domain-shift metrics (e.g., MMD or Fréchet distance between source/target representation distributions) or OOD skeletal error breakdowns to show the temporal consistency contrastive learning generalizes beyond the specific pose and signal distributions of the evaluated benchmarks.

    Authors: We acknowledge that explicit quantification of domain shift would provide stronger support for the generalization claims of the temporal consistency contrastive learning. Our current evaluation demonstrates consistent improvements across multiple benchmarks with varying pose and signal distributions, which indirectly supports the domain-consistency objective. In the revised version, we will compute and report MMD distances between source and target representation distributions before and after pretraining, along with OOD skeletal error breakdowns on held-out pose distributions where feasible. revision: yes

  2. Referee: [§4.3 (Decoder architecture)] §4.3 (Decoder architecture): the assertion that the hybrid topology-constrained decoder resolves the structural fidelity gap would require an ablation isolating the effect of the explicit skeletal constraints on metrics such as bone-length ratio error or joint displacement; without this, it is unclear whether the reported gains are due to the topology terms or other decoder components.

    Authors: We agree that an ablation isolating the skeletal topology constraints is necessary to clarify their contribution. The hybrid decoder combines multiple components, and while the overall gains are reported, we will add a targeted ablation that removes the explicit topology terms (both adjacent and overarching) and reports the impact on bone-length ratio error and joint displacement metrics in addition to standard pose errors. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical framework with independent experimental validation.

full rationale

The DT-Pose framework is introduced as a two-phase empirical architecture (masked pretraining with temporal consistency contrastive learning plus uniformity regularization, followed by hybrid topology-constrained decoding). No equations or derivations are presented that reduce by construction to their own inputs, no fitted parameters are relabeled as predictions, and no load-bearing claims rest on self-citations or uniqueness theorems imported from the authors' prior work. The central claims rest on benchmark experiments and released code, which constitute external falsifiability rather than self-referential definition.

Axiom & Free-Parameter Ledger

2 free parameters · 2 axioms · 0 invented entities

The approach rests on the domain assumption that WiFi signals encode usable pose information and on standard machine-learning training assumptions; multiple unspecified hyperparameters are present but typical for the regime.

free parameters (2)
  • contrastive learning hyperparameters and uniformity regularization weight
    Standard training choices required to stabilize the masked pretraining and avoid mode collapse, not detailed in abstract.
  • decoder architecture hyperparameters
    Choices for incorporating skeletal topology constraints, tuned for the task.
axioms (2)
  • domain assumption WiFi signals contain sufficient information about human joint positions and motion
    Invoked in the problem formulation and task definition.
  • domain assumption Human skeletal topology provides useful explicit constraints for pose decoding
    Central to the second phase of the framework.

pith-pipeline@v0.9.0 · 5776 in / 1340 out tokens · 72785 ms · 2026-05-23T05:39:08.847858+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

47 extracted references · 47 canonical work pages · 3 internal anchors

  1. [1]

    BEiT: BERT Pre-Training of Image Transformers

    Hangbo Bao, Li Dong, Songhao Piao, and Furu Wei. Beit: Bert pre-training of image transformers. arXiv preprint arXiv:2106.08254, 2021

  2. [2]

    Realtime multi-person 2d pose estimation using part affinity fields

    Zhe Cao, Tomas Simon, Shih-En Wei, and Yaser Sheikh. Realtime multi-person 2d pose estimation using part affinity fields. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 7291–7299, 2017

  3. [3]

    A survey on hand pose estimation with wearable sensors and computer- vision-based methods

    Weiya Chen, Chenchen Yu, Chenyu Tu, Zehua Lyu, Jing Tang, Shiqi Ou, Yan Fu, and Zhidong Xue. A survey on hand pose estimation with wearable sensors and computer- vision-based methods. Sensors, 20(4):1074, 2020

  4. [4]

    Neuron: Learning context-aware evolving representations for zero-shot skeleton action recognition

    Yang Chen, Jingcai Guo, Song Guo, and Dacheng Tao. Neuron: Learning context-aware evolving representations for zero-shot skeleton action recognition. arXiv preprint arXiv:2411.11288, 2024

  5. [5]

    Fine-grained side information guided dual-prompts for zero-shot skeleton action recognition

    Yang Chen, Jingcai Guo, Tian He, Xiaocheng Lu, and Ling Wang. Fine-grained side information guided dual-prompts for zero-shot skeleton action recognition. In Proceedings of the 32nd ACM International Conference on Multimedia , pages 778–786, 2024

  6. [6]

    Vision-language meets the skele- ton: Progressively distillation with cross-modal knowledge for 3d action representation learning

    Yang Chen, Tian He, Junfeng Fu, Ling Wang, Jingcai Guo, Ting Hu, and Hong Cheng. Vision-language meets the skele- ton: Progressively distillation with cross-modal knowledge for 3d action representation learning. IEEE Transactions on Multimedia, 2024

  7. [7]

    Channel-wise topology refinement graph convolution for skeleton-based action recognition

    Yuxin Chen, Ziqi Zhang, Chunfeng Yuan, Bing Li, Ying Deng, and Weiming Hu. Channel-wise topology refinement graph convolution for skeleton-based action recognition. In Proceedings of the IEEE/CVF international conference on computer vision, pages 13359–13368, 2021

  8. [8]

    Timemae: Self-supervised rep- resentations of time series with decoupled masked autoen- coders

    Mingyue Cheng, Qi Liu, Zhiding Liu, Hao Zhang, Rujiao Zhang, and Enhong Chen. Timemae: Self-supervised rep- resentations of time series with decoupled masked autoen- coders. arXiv preprint arXiv:2303.00320, 2023

  9. [9]

    In- fogcn: Representation learning for human skeleton-based action recognition

    Hyung-gun Chi, Myoung Hoon Ha, Seunggeun Chi, Sang Wan Lee, Qixing Huang, and Karthik Ramani. In- fogcn: Representation learning for human skeleton-based action recognition. In Proceedings of the IEEE/CVF Con- ference on Computer Vision and Pattern Recognition, pages 20186–20196, 2022

  10. [10]

    Hpe-li: Wifi-enabled lightweight dual selective kernel convolution for human pose estimation

    Toan D Gian, Tien Dac Lai, Thien Van Luong, Kok- Seng Wong, and Van-Dinh Nguyen. Hpe-li: Wifi-enabled lightweight dual selective kernel convolution for human pose estimation. In European Conference on Computer Vision , pages 93–111. Springer, 2025

  11. [11]

    Wildar: Wifi signal-based lightweight deep learning model for human activity recogni- tion

    Fuxiang Deng, Emil Jovanov, Houbing Song, Weisong Shi, Yuan Zhang, and Wenyao Xu. Wildar: Wifi signal-based lightweight deep learning model for human activity recogni- tion. IEEE Internet of Things Journal, 2023

  12. [12]

    BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

    Jacob Devlin. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018

  13. [13]

    Dif- fusion model is a good pose estimator from 3d rf-vision

    Junqiao Fan, Jianfei Yang, Yuecong Xu, and Lihua Xie. Dif- fusion model is a good pose estimator from 3d rf-vision. In European Conference on Computer Vision , pages 1–18. Springer, 2025

  14. [14]

    Focal and global spatial-temporal transformer for skeleton-based action recognition

    Zhimin Gao, Peitao Wang, Pei Lv, Xiaoheng Jiang, Qidong Liu, Pichao Wang, Mingliang Xu, and Wanqing Li. Focal and global spatial-temporal transformer for skeleton-based action recognition. In Proceedings of the Asian Conference on Computer Vision, pages 382–398, 2022

  15. [15]

    Diffpose: Toward more reliable 3d pose estimation

    Jia Gong, Lin Geng Foo, Zhipeng Fan, Qiuhong Ke, Hos- sein Rahmani, and Jun Liu. Diffpose: Toward more reliable 3d pose estimation. In Proceedings of the IEEE/CVF Con- ference on Computer Vision and Pattern Recognition, pages 13041–13051, 2023

  16. [16]

    Masked autoencoders are scalable vision learners

    Kaiming He, Xinlei Chen, Saining Xie, Yanghao Li, Piotr Doll´ar, and Ross Girshick. Masked autoencoders are scalable vision learners. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages 16000– 16009, 2022

  17. [17]

    Deep residual learning for image recognition

    Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceed- ings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016

  18. [18]

    Enhancing skeleton-based action recogni- tion with language descriptions from pre-trained large mul- timodal models

    Tian He, Yang Chen, Xu Gao, Ling Wang, Ting Hu, and Hong Cheng. Enhancing skeleton-based action recogni- tion with language descriptions from pre-trained large mul- timodal models. IEEE Transactions on Circuits and Systems for Video Technology, 2024

  19. [19]

    An expert-knowledge-based graph convolutional network for skeleton-based physical rehabilitation exercises assessment

    Tian He, Yang Chen, Ling Wang, and Hong Cheng. An expert-knowledge-based graph convolutional network for skeleton-based physical rehabilitation exercises assessment. IEEE Transactions on Neural Systems and Rehabilitation Engineering, 2024

  20. [20]

    Masked autoencoders that lis- ten

    Po-Yao Huang, Hu Xu, Juncheng Li, Alexei Baevski, Michael Auli, Wojciech Galuba, Florian Metze, and Christoph Feichtenhofer. Masked autoencoders that lis- ten. Advances in Neural Information Processing Systems , 35:28708–28720, 2022

  21. [21]

    Towards 3d human pose construction using wifi

    Wenjun Jiang, Hongfei Xue, Chenglin Miao, Shiyang Wang, Sen Lin, Chong Tian, Srinivasan Murali, Haochen Hu, Zhi Sun, and Lu Su. Towards 3d human pose construction using wifi. In Proceedings of the 26th Annual International Con- ference on Mobile Computing and Networking , pages 1–14, 2020

  22. [22]

    Exploiting temporal contexts with strided transformer for 3d human pose estimation.IEEE Transactions on Multimedia, 25:1282–1293, 2022

    Wenhao Li, Hong Liu, Runwei Ding, Mengyuan Liu, Pichao Wang, and Wenming Yang. Exploiting temporal contexts with strided transformer for 3d human pose estimation.IEEE Transactions on Multimedia, 25:1282–1293, 2022

  23. [23]

    Group pose: A simple baseline for end-to- end multi-person pose estimation

    Huan Liu, Qiang Chen, Zichang Tan, Jiang-Jiang Liu, Jian Wang, Xiangbo Su, Xiaolong Li, Kun Yao, Junyu Han, Er- rui Ding, et al. Group pose: A simple baseline for end-to- end multi-person pose estimation. In Proceedings of the IEEE/CVF International Conference on Computer Vision , pages 15029–15038, 2023

  24. [24]

    Enhanced skele- ton visualization for view invariant human action recogni- tion

    Mengyuan Liu, Hong Liu, and Chen Chen. Enhanced skele- ton visualization for view invariant human action recogni- tion. Pattern Recognition, 68:346–362, 2017

  25. [25]

    Spa- tial temporal transformer network for skeleton-based ac- tion recognition

    Chiara Plizzari, Marco Cannici, and Matteo Matteucci. Spa- tial temporal transformer network for skeleton-based ac- tion recognition. In Pattern recognition. ICPR international workshops and challenges: virtual event, January 10–15, 2021, Proceedings, Part III, pages 694–701. Springer, 2021. 9

  26. [26]

    Improving language understanding by gener- ative pre-training

    Alec Radford. Improving language understanding by gener- ative pre-training. 2018

  27. [27]

    Winect: 3d human pose tracking for free-form activity using commodity wifi

    Yili Ren, Zi Wang, Sheng Tan, Yingying Chen, and Jie Yang. Winect: 3d human pose tracking for free-form activity using commodity wifi. Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies, 5(4):1–29, 2021

  28. [28]

    Gopose: 3d human pose estimation using wifi

    Yili Ren, Zi Wang, Yichao Wang, Sheng Tan, Yingying Chen, and Jie Yang. Gopose: 3d human pose estimation using wifi. Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies, 6(2):1–25, 2022

  29. [29]

    End-to-end multi-person pose estimation with transformers

    Dahu Shi, Xing Wei, Liangqi Li, Ye Ren, and Wenming Tan. End-to-end multi-person pose estimation with transformers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 11069–11078, 2022

  30. [30]

    Constructing stronger and faster baselines for skeleton-based action recognition

    Yi-Fan Song, Zhang Zhang, Caifeng Shan, and Liang Wang. Constructing stronger and faster baselines for skeleton-based action recognition. IEEE transactions on pattern analysis and machine intelligence, 45(2):1474–1488, 2022

  31. [31]

    Videomae: Masked autoencoders are data-efficient learners for self-supervised video pre-training

    Zhan Tong, Yibing Song, Jue Wang, and Limin Wang. Videomae: Masked autoencoders are data-efficient learners for self-supervised video pre-training. Advances in neural information processing systems, 35:10078–10093, 2022

  32. [32]

    Can WiFi Estimate Person Pose?

    Fei Wang, Stanislav Panev, Ziyi Dai, Jinsong Han, and Dong Huang. Can wifi estimate person pose? arXiv preprint arXiv:1904.00277, 2019

  33. [33]

    Person-in-wifi: Fine-grained person percep- tion using wifi

    Fei Wang, Sanping Zhou, Stanislav Panev, Jinsong Han, and Dong Huang. Person-in-wifi: Fine-grained person percep- tion using wifi. In Proceedings of the IEEE/CVF Interna- tional Conference on Computer Vision , pages 5452–5461, 2019

  34. [34]

    Videomae v2: Scaling video masked autoencoders with dual masking

    Limin Wang, Bingkun Huang, Zhiyu Zhao, Zhan Tong, Yi- nan He, Yi Wang, Yali Wang, and Yu Qiao. Videomae v2: Scaling video masked autoencoders with dual masking. In Proceedings of the IEEE/CVF Conference on Computer Vi- sion and Pattern Recognition, pages 14549–14560, 2023

  35. [35]

    Action recognition based on joint trajectory maps using convolutional neural networks

    Pichao Wang, Zhaoyang Li, Yonghong Hou, and Wanqing Li. Action recognition based on joint trajectory maps using convolutional neural networks. In Proceedings of the 24th ACM international conference on Multimedia , pages 102– 106, 2016

  36. [36]

    From point to space: 3d moving human pose estimation using commodity wifi

    Yiming Wang, Lingchao Guo, Zhaoming Lu, Xiangming Wen, Shuang Zhou, and Wanyu Meng. From point to space: 3d moving human pose estimation using commodity wifi. IEEE Communications Letters, 25(7):2235–2239, 2021

  37. [37]

    Lite pose: Efficient architecture design for 2d hu- man pose estimation

    Yihan Wang, Muyang Li, Han Cai, Wei-Ming Chen, and Song Han. Lite pose: Efficient architecture design for 2d hu- man pose estimation. In Proceedings of the IEEE/CVF Con- ference on Computer Vision and Pattern Recognition, pages 13126–13136, 2022

  38. [38]

    Skeletonmae: graph-based masked autoencoder for skeleton sequence pre-training

    Hong Yan, Yang Liu, Yushen Wei, Zhen Li, Guanbin Li, and Liang Lin. Skeletonmae: graph-based masked autoencoder for skeleton sequence pre-training. In Proceedings of the IEEE/CVF International Conference on Computer Vision , pages 5606–5618, 2023

  39. [39]

    Person-in-wifi 3d: End-to-end multi- person 3d pose estimation with wi-fi

    Kangwei Yan, Fei Wang, Bo Qian, Han Ding, Jinsong Han, and Xing Wei. Person-in-wifi 3d: End-to-end multi- person 3d pose estimation with wi-fi. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 969–978, 2024

  40. [40]

    Autofi: Toward automatic wi-fi human sensing via geometric self-supervised learning

    Jianfei Yang, Xinyan Chen, Han Zou, Dazhuo Wang, and Li- hua Xie. Autofi: Toward automatic wi-fi human sensing via geometric self-supervised learning. IEEE Internet of Things Journal, 10(8):7416–7425, 2022

  41. [41]

    Mm-fi: Multi-modal non-intrusive 4d human dataset for versatile wireless sensing

    Jianfei Yang, He Huang, Yunjiao Zhou, Xinyan Chen, Yue- cong Xu, Shenghai Yuan, Han Zou, Chris Xiaoxuan Lu, and Lihua Xie. Mm-fi: Multi-modal non-intrusive 4d human dataset for versatile wireless sensing. Advances in Neural Information Processing Systems, 36, 2024

  42. [42]

    Single person pose estimation: a survey

    Feng Zhang, Xiatian Zhu, and Chen Wang. Single person pose estimation: a survey. arXiv preprint arXiv:2109.10056, 2021

  43. [43]

    Deep learning-based human pose estimation: A survey.ACM Computing Surveys, 56(1):1–37, 2023

    Ce Zheng, Wenhan Wu, Chen Chen, Taojiannan Yang, Si- jie Zhu, Ju Shen, Nasser Kehtarnavaz, and Mubarak Shah. Deep learning-based human pose estimation: A survey.ACM Computing Surveys, 56(1):1–37, 2023

  44. [44]

    Multi-modal 3d human pose estimation with 2d weak supervision in autonomous driving

    Jingxiao Zheng, Xinwei Shi, Alexander Gorban, Junhua Mao, Yang Song, Charles R Qi, Ting Liu, Visesh Chari, An- dre Cornman, Yin Zhou, et al. Multi-modal 3d human pose estimation with 2d weak supervision in autonomous driving. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4478–4487, 2022

  45. [45]

    Metafi++: Wifi-enabled transformer- based human pose estimation for metaverse avatar simula- tion

    Yunjiao Zhou, He Huang, Shenghai Yuan, Han Zou, Lihua Xie, and Jianfei Yang. Metafi++: Wifi-enabled transformer- based human pose estimation for metaverse avatar simula- tion. IEEE Internet of Things Journal, 10(16):14128–14136, 2023

  46. [46]

    Ada- pose: Towards cross-site device-free human pose estima- tion with commodity wifi

    Yunjiao Zhou, Jianfei Yang, He Huang, and Lihua Xie. Ada- pose: Towards cross-site device-free human pose estima- tion with commodity wifi. IEEE Internet of Things Journal, 2024

  47. [47]

    Perunet: Deep signal channel attention in unet for wifi-based human pose estimation

    Yue Zhou, Aichun Zhu, Caojie Xu, Fangqiang Hu, and Yifeng Li. Perunet: Deep signal channel attention in unet for wifi-based human pose estimation. IEEE Sensors Jour- nal, 22(20):19750–19760, 2022. 10 A. Appendix A.1. Implementation Details In the pre-training phase, the encoder-decoder is trained for 400 epochs using the AdamW, employing a batch size of 2...