Towards Robust and Realistic Human Pose Estimation via WiFi Signals

Jingcai Guo; Yang Chen

arxiv: 2501.09411 · v3 · submitted 2025-01-16 · 💻 cs.CV

Towards Robust and Realistic Human Pose Estimation via WiFi Signals

Yang Chen , Jingcai Guo This is my paper

Pith reviewed 2026-05-23 05:39 UTC · model grok-4.3

classification 💻 cs.CV

keywords WiFi-based human pose estimationcross-domain gapstructural fidelity gapcontrastive learningskeletal topology constraintsself-supervised pretrainingmasked signal modelinghybrid decoder

0 comments

The pith

WiFi-based pose estimation improves by first learning domain-consistent signal features through contrastive pretraining and then decoding them with explicit skeletal topology rules.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper identifies two overlooked problems in turning WiFi signals into human skeletons: large differences in pose distributions across environments and unrealistic joint placements or bone lengths in the output. It addresses these by creating a two-phase system that first pretrains on masked WiFi data using temporal consistency contrastive learning plus uniformity regularization to produce robust, motion-aware representations, then decodes those representations with a hybrid architecture that adds direct constraints on how joints connect and relate. If successful, the approach yields more accurate 2D and 3D pose estimates on multiple benchmark datasets without requiring paired camera data in every new setting.

Core claim

Reformulating WiFi human pose estimation as domain-consistent representation learning via temporal consistency contrastive learning with uniformity regularization inside self-supervised masked pretraining, followed by topology-constrained pose decoding in a hybrid architecture, closes the cross-domain gap caused by pose distribution shifts and the structural fidelity gap caused by missing spatial priors, leading to superior performance on 2D and 3D tasks across benchmarks.

What carries the argument

The DT-Pose two-phase framework: temporal consistency contrastive learning with uniformity regularization for domain-consistent WiFi representations, plus a hybrid decoder that adds explicit skeletal topology constraints on adjacent and overarching joint relationships.

If this is right

Predicted skeletons maintain correct joint adjacency and overall body proportions even when WiFi signals are sparse.
Representations learned in one environment transfer more reliably to new rooms or sensor placements.
Both 2D and 3D pose outputs improve without needing additional labeled camera data at test time.
Mode collapse during pretraining is reduced, preserving motion-discriminative information in the signal embeddings.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same pretraining-plus-constrained-decoding pattern could be tested on other sparse sensing modalities such as millimeter-wave radar or acoustic signals for pose recovery.
If the topology constraints prove effective, they might be added as a lightweight post-processing step to existing camera-based pose estimators when depth is noisy.
Real-time applications such as fall detection in homes could become feasible once cross-domain robustness is confirmed on larger, more varied collections of WiFi recordings.

Load-bearing premise

The temporal consistency contrastive learning and skeletal topology constraints will close the identified gaps on data distributions and skeletal structures beyond the specific benchmark datasets used in the experiments.

What would settle it

A new WiFi dataset collected in an environment whose pose distribution differs markedly from the training sets, where the method produces no measurable gain in joint accuracy or bone-length realism compared with prior WiFi pose estimators.

Figures

Figures reproduced from arXiv: 2501.09411 by Jingcai Guo, Yang Chen.

**Figure 1.** Figure 1: (a) shows the pose coordinates distribution between the source and target domains. (b) represents the predictions [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗

**Figure 2.** Figure 2: The pipeline of our method, including the pre-training and pose decoding phases. [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 3.** Figure 3: Original WiFi CSI signals and different masking strategies on the MM-Fi dataset. [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗

**Figure 6.** Figure 6: Here, we introduce explicit uniformity regulariza [PITH_FULL_IMAGE:figures/full_fig_p004_6.png] view at source ↗

**Figure 4.** Figure 4: Performance on the MM-Fi (Protocol 1 - Setting 1) dataset with different masking ratios. [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗

**Figure 5.** Figure 5: t-SNE visualization of WiFi representations [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗

**Figure 6.** Figure 6: Dimension collapse. (a) represents the statistics [PITH_FULL_IMAGE:figures/full_fig_p007_6.png] view at source ↗

**Figure 7.** Figure 7: WiFi visualizations on the MM-Fi (P3-S1). The first row represents the raw WiFi signals, the second row represents [PITH_FULL_IMAGE:figures/full_fig_p008_7.png] view at source ↗

**Figure 8.** Figure 8: Predicted poses of the MetaFi++ [45], HPE-Li [10], and our proposed method among three different datasets. [PITH_FULL_IMAGE:figures/full_fig_p008_8.png] view at source ↗

**Figure 9.** Figure 9: t-SNE visualization of WiFi representations. The first row denotes the WiFi representations extracted on the MM-Fi [PITH_FULL_IMAGE:figures/full_fig_p012_9.png] view at source ↗

read the original abstract

Robust WiFi-based human pose estimation (HPE) is a challenging task that bridges discrete and subtle WiFi signals to human skeletons. We revisit this problem and reveal two critical yet overlooked issues: 1) cross-domain gap, i.e., due to significant discrepancies in pose distributions between source and target domains; and 2) structural fidelity gap, i.e., predicted skeletal poses manifest distorted topology, usually with misplaced joints and disproportionate bone lengths. This paper fills these gaps by reformulating the task into a novel two-phase framework dubbed DT-Pose: Domain-consistent representation learning and Topology-constrained Pose decoding. Concretely, we first propose a temporal consistency contrastive learning strategy with uniformity regularization, integrated into a self-supervised masked pretraining paradigm. This design facilitates robust learning of domain-consistent and motion-discriminative WiFi representations while mitigating potential mode collapse caused by signal sparsity. Beyond this, we introduce an effective hybrid decoding architecture that incorporates explicit skeletal topology constraints. By compensating for the inherent absence of spatial priors in WiFi semantic vectors, the decoder enables structured modeling of both adjacent and overarching joint relationships, producing more realistic pose predictions. Extensive experiments conducted on various benchmark datasets highlight the superior performance of our method in tackling these fundamental challenges in 2D/3D WiFi-based HPE tasks. The associated code is released at https://github.com/cseeyangchen/DT-Pose.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

DT-Pose combines masked contrastive pretraining and topology constraints for WiFi HPE but the abstract leaves the performance claims unverified.

read the letter

The main takeaway is that this paper identifies two practical gaps in WiFi-based human pose estimation—domain shift between source and target data, and distorted skeletal outputs—and proposes a two-phase DT-Pose pipeline to address them. The first phase uses self-supervised masked pretraining with temporal consistency contrastive learning plus uniformity regularization to learn domain-consistent representations. The second phase adds a hybrid decoder with explicit skeletal topology constraints to enforce realistic joint relations and bone lengths. Code is released, which is useful for anyone wanting to reproduce or extend it. The framing of the gaps and the specific combination of contrastive learning with topology constraints is new for this task, even if the individual pieces draw from established techniques. It does a reasonable job of motivating why standard approaches fall short on sparse WiFi signals and why adding these priors helps. The soft spots are clear from the abstract alone: no quantitative results, no baseline numbers, no ablation tables, and no error breakdowns are supplied, so the claim of superior performance cannot be checked. The stress-test concern about generalization also lands—the paper would need domain-shift metrics or out-of-distribution tests to show the mechanisms are not just fitting the benchmark distributions. This work is aimed at researchers in wireless sensing and computer vision who care about privacy-preserving pose estimation. A reader looking for concrete ideas on contrastive pretraining for signals or constrained decoding could extract value. It deserves a serious referee because the problem is well-motivated, the approach is explicit, and the code release allows verification, even though the current evidence is thin.

Referee Report

2 major / 2 minor

Summary. The paper proposes DT-Pose, a two-phase framework for WiFi-based 2D/3D human pose estimation that reformulates the task to address the cross-domain gap (via temporal consistency contrastive learning with uniformity regularization in self-supervised masked pretraining) and the structural fidelity gap (via a hybrid decoding architecture incorporating explicit skeletal topology constraints). It claims this yields superior performance on benchmark datasets and releases the associated code.

Significance. If the results hold with appropriate supporting evidence, the work could advance WiFi-based HPE by demonstrating mechanisms for domain-consistent representations and topologically constrained decoding; the public code release supports reproducibility and is a clear strength.

major comments (2)

[§5 (Experiments)] §5 (Experiments): the central claim that the two-phase framework closes the cross-domain gap rests on performance improvements on benchmark datasets, but the evaluation provides no quantified domain-shift metrics (e.g., MMD or Fréchet distance between source/target representation distributions) or OOD skeletal error breakdowns to show the temporal consistency contrastive learning generalizes beyond the specific pose and signal distributions of the evaluated benchmarks.
[§4.3 (Decoder architecture)] §4.3 (Decoder architecture): the assertion that the hybrid topology-constrained decoder resolves the structural fidelity gap would require an ablation isolating the effect of the explicit skeletal constraints on metrics such as bone-length ratio error or joint displacement; without this, it is unclear whether the reported gains are due to the topology terms or other decoder components.

minor comments (2)

Abstract: the claim of 'superior performance' and 'extensive experiments' would be more informative if key quantitative results, baseline names, and dataset identifiers were included.
Notation: the uniformity regularization weight and contrastive hyperparameters should be explicitly tabulated in the experimental protocol section for reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. The comments highlight important aspects of evaluation that can strengthen the presentation of our claims regarding domain-consistent representations and topology-constrained decoding. We address each major comment below and will revise the manuscript accordingly.

read point-by-point responses

Referee: [§5 (Experiments)] §5 (Experiments): the central claim that the two-phase framework closes the cross-domain gap rests on performance improvements on benchmark datasets, but the evaluation provides no quantified domain-shift metrics (e.g., MMD or Fréchet distance between source/target representation distributions) or OOD skeletal error breakdowns to show the temporal consistency contrastive learning generalizes beyond the specific pose and signal distributions of the evaluated benchmarks.

Authors: We acknowledge that explicit quantification of domain shift would provide stronger support for the generalization claims of the temporal consistency contrastive learning. Our current evaluation demonstrates consistent improvements across multiple benchmarks with varying pose and signal distributions, which indirectly supports the domain-consistency objective. In the revised version, we will compute and report MMD distances between source and target representation distributions before and after pretraining, along with OOD skeletal error breakdowns on held-out pose distributions where feasible. revision: yes
Referee: [§4.3 (Decoder architecture)] §4.3 (Decoder architecture): the assertion that the hybrid topology-constrained decoder resolves the structural fidelity gap would require an ablation isolating the effect of the explicit skeletal constraints on metrics such as bone-length ratio error or joint displacement; without this, it is unclear whether the reported gains are due to the topology terms or other decoder components.

Authors: We agree that an ablation isolating the skeletal topology constraints is necessary to clarify their contribution. The hybrid decoder combines multiple components, and while the overall gains are reported, we will add a targeted ablation that removes the explicit topology terms (both adjacent and overarching) and reports the impact on bone-length ratio error and joint displacement metrics in addition to standard pose errors. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical framework with independent experimental validation.

full rationale

The DT-Pose framework is introduced as a two-phase empirical architecture (masked pretraining with temporal consistency contrastive learning plus uniformity regularization, followed by hybrid topology-constrained decoding). No equations or derivations are presented that reduce by construction to their own inputs, no fitted parameters are relabeled as predictions, and no load-bearing claims rest on self-citations or uniqueness theorems imported from the authors' prior work. The central claims rest on benchmark experiments and released code, which constitute external falsifiability rather than self-referential definition.

Axiom & Free-Parameter Ledger

2 free parameters · 2 axioms · 0 invented entities

The approach rests on the domain assumption that WiFi signals encode usable pose information and on standard machine-learning training assumptions; multiple unspecified hyperparameters are present but typical for the regime.

free parameters (2)

contrastive learning hyperparameters and uniformity regularization weight
Standard training choices required to stabilize the masked pretraining and avoid mode collapse, not detailed in abstract.
decoder architecture hyperparameters
Choices for incorporating skeletal topology constraints, tuned for the task.

axioms (2)

domain assumption WiFi signals contain sufficient information about human joint positions and motion
Invoked in the problem formulation and task definition.
domain assumption Human skeletal topology provides useful explicit constraints for pose decoding
Central to the second phase of the framework.

pith-pipeline@v0.9.0 · 5776 in / 1340 out tokens · 72785 ms · 2026-05-23T05:39:08.847858+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

47 extracted references · 47 canonical work pages · 3 internal anchors

[1]

BEiT: BERT Pre-Training of Image Transformers

Hangbo Bao, Li Dong, Songhao Piao, and Furu Wei. Beit: Bert pre-training of image transformers. arXiv preprint arXiv:2106.08254, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021
[2]

Realtime multi-person 2d pose estimation using part affinity fields

Zhe Cao, Tomas Simon, Shih-En Wei, and Yaser Sheikh. Realtime multi-person 2d pose estimation using part affinity fields. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 7291–7299, 2017

work page 2017
[3]

A survey on hand pose estimation with wearable sensors and computer- vision-based methods

Weiya Chen, Chenchen Yu, Chenyu Tu, Zehua Lyu, Jing Tang, Shiqi Ou, Yan Fu, and Zhidong Xue. A survey on hand pose estimation with wearable sensors and computer- vision-based methods. Sensors, 20(4):1074, 2020

work page 2020
[4]

Neuron: Learning context-aware evolving representations for zero-shot skeleton action recognition

Yang Chen, Jingcai Guo, Song Guo, and Dacheng Tao. Neuron: Learning context-aware evolving representations for zero-shot skeleton action recognition. arXiv preprint arXiv:2411.11288, 2024

work page arXiv 2024
[5]

Fine-grained side information guided dual-prompts for zero-shot skeleton action recognition

Yang Chen, Jingcai Guo, Tian He, Xiaocheng Lu, and Ling Wang. Fine-grained side information guided dual-prompts for zero-shot skeleton action recognition. In Proceedings of the 32nd ACM International Conference on Multimedia , pages 778–786, 2024

work page 2024
[6]

Vision-language meets the skele- ton: Progressively distillation with cross-modal knowledge for 3d action representation learning

Yang Chen, Tian He, Junfeng Fu, Ling Wang, Jingcai Guo, Ting Hu, and Hong Cheng. Vision-language meets the skele- ton: Progressively distillation with cross-modal knowledge for 3d action representation learning. IEEE Transactions on Multimedia, 2024

work page 2024
[7]

Channel-wise topology refinement graph convolution for skeleton-based action recognition

Yuxin Chen, Ziqi Zhang, Chunfeng Yuan, Bing Li, Ying Deng, and Weiming Hu. Channel-wise topology refinement graph convolution for skeleton-based action recognition. In Proceedings of the IEEE/CVF international conference on computer vision, pages 13359–13368, 2021

work page 2021
[8]

Timemae: Self-supervised rep- resentations of time series with decoupled masked autoen- coders

Mingyue Cheng, Qi Liu, Zhiding Liu, Hao Zhang, Rujiao Zhang, and Enhong Chen. Timemae: Self-supervised rep- resentations of time series with decoupled masked autoen- coders. arXiv preprint arXiv:2303.00320, 2023

work page arXiv 2023
[9]

In- fogcn: Representation learning for human skeleton-based action recognition

Hyung-gun Chi, Myoung Hoon Ha, Seunggeun Chi, Sang Wan Lee, Qixing Huang, and Karthik Ramani. In- fogcn: Representation learning for human skeleton-based action recognition. In Proceedings of the IEEE/CVF Con- ference on Computer Vision and Pattern Recognition, pages 20186–20196, 2022

work page 2022
[10]

Hpe-li: Wifi-enabled lightweight dual selective kernel convolution for human pose estimation

Toan D Gian, Tien Dac Lai, Thien Van Luong, Kok- Seng Wong, and Van-Dinh Nguyen. Hpe-li: Wifi-enabled lightweight dual selective kernel convolution for human pose estimation. In European Conference on Computer Vision , pages 93–111. Springer, 2025

work page 2025
[11]

Wildar: Wifi signal-based lightweight deep learning model for human activity recogni- tion

Fuxiang Deng, Emil Jovanov, Houbing Song, Weisong Shi, Yuan Zhang, and Wenyao Xu. Wildar: Wifi signal-based lightweight deep learning model for human activity recogni- tion. IEEE Internet of Things Journal, 2023

work page 2023
[12]

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

Jacob Devlin. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[13]

Dif- fusion model is a good pose estimator from 3d rf-vision

Junqiao Fan, Jianfei Yang, Yuecong Xu, and Lihua Xie. Dif- fusion model is a good pose estimator from 3d rf-vision. In European Conference on Computer Vision , pages 1–18. Springer, 2025

work page 2025
[14]

Focal and global spatial-temporal transformer for skeleton-based action recognition

Zhimin Gao, Peitao Wang, Pei Lv, Xiaoheng Jiang, Qidong Liu, Pichao Wang, Mingliang Xu, and Wanqing Li. Focal and global spatial-temporal transformer for skeleton-based action recognition. In Proceedings of the Asian Conference on Computer Vision, pages 382–398, 2022

work page 2022
[15]

Diffpose: Toward more reliable 3d pose estimation

Jia Gong, Lin Geng Foo, Zhipeng Fan, Qiuhong Ke, Hos- sein Rahmani, and Jun Liu. Diffpose: Toward more reliable 3d pose estimation. In Proceedings of the IEEE/CVF Con- ference on Computer Vision and Pattern Recognition, pages 13041–13051, 2023

work page 2023
[16]

Masked autoencoders are scalable vision learners

Kaiming He, Xinlei Chen, Saining Xie, Yanghao Li, Piotr Doll´ar, and Ross Girshick. Masked autoencoders are scalable vision learners. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages 16000– 16009, 2022

work page 2022
[17]

Deep residual learning for image recognition

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceed- ings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016

work page 2016
[18]

Enhancing skeleton-based action recogni- tion with language descriptions from pre-trained large mul- timodal models

Tian He, Yang Chen, Xu Gao, Ling Wang, Ting Hu, and Hong Cheng. Enhancing skeleton-based action recogni- tion with language descriptions from pre-trained large mul- timodal models. IEEE Transactions on Circuits and Systems for Video Technology, 2024

work page 2024
[19]

An expert-knowledge-based graph convolutional network for skeleton-based physical rehabilitation exercises assessment

Tian He, Yang Chen, Ling Wang, and Hong Cheng. An expert-knowledge-based graph convolutional network for skeleton-based physical rehabilitation exercises assessment. IEEE Transactions on Neural Systems and Rehabilitation Engineering, 2024

work page 2024
[20]

Masked autoencoders that lis- ten

Po-Yao Huang, Hu Xu, Juncheng Li, Alexei Baevski, Michael Auli, Wojciech Galuba, Florian Metze, and Christoph Feichtenhofer. Masked autoencoders that lis- ten. Advances in Neural Information Processing Systems , 35:28708–28720, 2022

work page 2022
[21]

Towards 3d human pose construction using wifi

Wenjun Jiang, Hongfei Xue, Chenglin Miao, Shiyang Wang, Sen Lin, Chong Tian, Srinivasan Murali, Haochen Hu, Zhi Sun, and Lu Su. Towards 3d human pose construction using wifi. In Proceedings of the 26th Annual International Con- ference on Mobile Computing and Networking , pages 1–14, 2020

work page 2020
[22]

Exploiting temporal contexts with strided transformer for 3d human pose estimation.IEEE Transactions on Multimedia, 25:1282–1293, 2022

Wenhao Li, Hong Liu, Runwei Ding, Mengyuan Liu, Pichao Wang, and Wenming Yang. Exploiting temporal contexts with strided transformer for 3d human pose estimation.IEEE Transactions on Multimedia, 25:1282–1293, 2022

work page 2022
[23]

Group pose: A simple baseline for end-to- end multi-person pose estimation

Huan Liu, Qiang Chen, Zichang Tan, Jiang-Jiang Liu, Jian Wang, Xiangbo Su, Xiaolong Li, Kun Yao, Junyu Han, Er- rui Ding, et al. Group pose: A simple baseline for end-to- end multi-person pose estimation. In Proceedings of the IEEE/CVF International Conference on Computer Vision , pages 15029–15038, 2023

work page 2023
[24]

Enhanced skele- ton visualization for view invariant human action recogni- tion

Mengyuan Liu, Hong Liu, and Chen Chen. Enhanced skele- ton visualization for view invariant human action recogni- tion. Pattern Recognition, 68:346–362, 2017

work page 2017
[25]

Spa- tial temporal transformer network for skeleton-based ac- tion recognition

Chiara Plizzari, Marco Cannici, and Matteo Matteucci. Spa- tial temporal transformer network for skeleton-based ac- tion recognition. In Pattern recognition. ICPR international workshops and challenges: virtual event, January 10–15, 2021, Proceedings, Part III, pages 694–701. Springer, 2021. 9

work page 2021
[26]

Improving language understanding by gener- ative pre-training

Alec Radford. Improving language understanding by gener- ative pre-training. 2018

work page 2018
[27]

Winect: 3d human pose tracking for free-form activity using commodity wifi

Yili Ren, Zi Wang, Sheng Tan, Yingying Chen, and Jie Yang. Winect: 3d human pose tracking for free-form activity using commodity wifi. Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies, 5(4):1–29, 2021

work page 2021
[28]

Gopose: 3d human pose estimation using wifi

Yili Ren, Zi Wang, Yichao Wang, Sheng Tan, Yingying Chen, and Jie Yang. Gopose: 3d human pose estimation using wifi. Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies, 6(2):1–25, 2022

work page 2022
[29]

End-to-end multi-person pose estimation with transformers

Dahu Shi, Xing Wei, Liangqi Li, Ye Ren, and Wenming Tan. End-to-end multi-person pose estimation with transformers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 11069–11078, 2022

work page 2022
[30]

Constructing stronger and faster baselines for skeleton-based action recognition

Yi-Fan Song, Zhang Zhang, Caifeng Shan, and Liang Wang. Constructing stronger and faster baselines for skeleton-based action recognition. IEEE transactions on pattern analysis and machine intelligence, 45(2):1474–1488, 2022

work page 2022
[31]

Videomae: Masked autoencoders are data-efficient learners for self-supervised video pre-training

Zhan Tong, Yibing Song, Jue Wang, and Limin Wang. Videomae: Masked autoencoders are data-efficient learners for self-supervised video pre-training. Advances in neural information processing systems, 35:10078–10093, 2022

work page 2022
[32]

Can WiFi Estimate Person Pose?

Fei Wang, Stanislav Panev, Ziyi Dai, Jinsong Han, and Dong Huang. Can wifi estimate person pose? arXiv preprint arXiv:1904.00277, 2019

work page internal anchor Pith review Pith/arXiv arXiv 1904
[33]

Person-in-wifi: Fine-grained person percep- tion using wifi

Fei Wang, Sanping Zhou, Stanislav Panev, Jinsong Han, and Dong Huang. Person-in-wifi: Fine-grained person percep- tion using wifi. In Proceedings of the IEEE/CVF Interna- tional Conference on Computer Vision , pages 5452–5461, 2019

work page 2019
[34]

Videomae v2: Scaling video masked autoencoders with dual masking

Limin Wang, Bingkun Huang, Zhiyu Zhao, Zhan Tong, Yi- nan He, Yi Wang, Yali Wang, and Yu Qiao. Videomae v2: Scaling video masked autoencoders with dual masking. In Proceedings of the IEEE/CVF Conference on Computer Vi- sion and Pattern Recognition, pages 14549–14560, 2023

work page 2023
[35]

Action recognition based on joint trajectory maps using convolutional neural networks

Pichao Wang, Zhaoyang Li, Yonghong Hou, and Wanqing Li. Action recognition based on joint trajectory maps using convolutional neural networks. In Proceedings of the 24th ACM international conference on Multimedia , pages 102– 106, 2016

work page 2016
[36]

From point to space: 3d moving human pose estimation using commodity wifi

Yiming Wang, Lingchao Guo, Zhaoming Lu, Xiangming Wen, Shuang Zhou, and Wanyu Meng. From point to space: 3d moving human pose estimation using commodity wifi. IEEE Communications Letters, 25(7):2235–2239, 2021

work page 2021
[37]

Lite pose: Efficient architecture design for 2d hu- man pose estimation

Yihan Wang, Muyang Li, Han Cai, Wei-Ming Chen, and Song Han. Lite pose: Efficient architecture design for 2d hu- man pose estimation. In Proceedings of the IEEE/CVF Con- ference on Computer Vision and Pattern Recognition, pages 13126–13136, 2022

work page 2022
[38]

Skeletonmae: graph-based masked autoencoder for skeleton sequence pre-training

Hong Yan, Yang Liu, Yushen Wei, Zhen Li, Guanbin Li, and Liang Lin. Skeletonmae: graph-based masked autoencoder for skeleton sequence pre-training. In Proceedings of the IEEE/CVF International Conference on Computer Vision , pages 5606–5618, 2023

work page 2023
[39]

Person-in-wifi 3d: End-to-end multi- person 3d pose estimation with wi-fi

Kangwei Yan, Fei Wang, Bo Qian, Han Ding, Jinsong Han, and Xing Wei. Person-in-wifi 3d: End-to-end multi- person 3d pose estimation with wi-fi. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 969–978, 2024

work page 2024
[40]

Autofi: Toward automatic wi-fi human sensing via geometric self-supervised learning

Jianfei Yang, Xinyan Chen, Han Zou, Dazhuo Wang, and Li- hua Xie. Autofi: Toward automatic wi-fi human sensing via geometric self-supervised learning. IEEE Internet of Things Journal, 10(8):7416–7425, 2022

work page 2022
[41]

Mm-fi: Multi-modal non-intrusive 4d human dataset for versatile wireless sensing

Jianfei Yang, He Huang, Yunjiao Zhou, Xinyan Chen, Yue- cong Xu, Shenghai Yuan, Han Zou, Chris Xiaoxuan Lu, and Lihua Xie. Mm-fi: Multi-modal non-intrusive 4d human dataset for versatile wireless sensing. Advances in Neural Information Processing Systems, 36, 2024

work page 2024
[42]

Single person pose estimation: a survey

Feng Zhang, Xiatian Zhu, and Chen Wang. Single person pose estimation: a survey. arXiv preprint arXiv:2109.10056, 2021

work page arXiv 2021
[43]

Deep learning-based human pose estimation: A survey.ACM Computing Surveys, 56(1):1–37, 2023

Ce Zheng, Wenhan Wu, Chen Chen, Taojiannan Yang, Si- jie Zhu, Ju Shen, Nasser Kehtarnavaz, and Mubarak Shah. Deep learning-based human pose estimation: A survey.ACM Computing Surveys, 56(1):1–37, 2023

work page 2023
[44]

Multi-modal 3d human pose estimation with 2d weak supervision in autonomous driving

Jingxiao Zheng, Xinwei Shi, Alexander Gorban, Junhua Mao, Yang Song, Charles R Qi, Ting Liu, Visesh Chari, An- dre Cornman, Yin Zhou, et al. Multi-modal 3d human pose estimation with 2d weak supervision in autonomous driving. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4478–4487, 2022

work page 2022
[45]

Metafi++: Wifi-enabled transformer- based human pose estimation for metaverse avatar simula- tion

Yunjiao Zhou, He Huang, Shenghai Yuan, Han Zou, Lihua Xie, and Jianfei Yang. Metafi++: Wifi-enabled transformer- based human pose estimation for metaverse avatar simula- tion. IEEE Internet of Things Journal, 10(16):14128–14136, 2023

work page 2023
[46]

Ada- pose: Towards cross-site device-free human pose estima- tion with commodity wifi

Yunjiao Zhou, Jianfei Yang, He Huang, and Lihua Xie. Ada- pose: Towards cross-site device-free human pose estima- tion with commodity wifi. IEEE Internet of Things Journal, 2024

work page 2024
[47]

Perunet: Deep signal channel attention in unet for wifi-based human pose estimation

Yue Zhou, Aichun Zhu, Caojie Xu, Fangqiang Hu, and Yifeng Li. Perunet: Deep signal channel attention in unet for wifi-based human pose estimation. IEEE Sensors Jour- nal, 22(20):19750–19760, 2022. 10 A. Appendix A.1. Implementation Details In the pre-training phase, the encoder-decoder is trained for 400 epochs using the AdamW, employing a batch size of 2...

work page 2022

[1] [1]

BEiT: BERT Pre-Training of Image Transformers

Hangbo Bao, Li Dong, Songhao Piao, and Furu Wei. Beit: Bert pre-training of image transformers. arXiv preprint arXiv:2106.08254, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021

[2] [2]

Realtime multi-person 2d pose estimation using part affinity fields

Zhe Cao, Tomas Simon, Shih-En Wei, and Yaser Sheikh. Realtime multi-person 2d pose estimation using part affinity fields. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 7291–7299, 2017

work page 2017

[3] [3]

A survey on hand pose estimation with wearable sensors and computer- vision-based methods

Weiya Chen, Chenchen Yu, Chenyu Tu, Zehua Lyu, Jing Tang, Shiqi Ou, Yan Fu, and Zhidong Xue. A survey on hand pose estimation with wearable sensors and computer- vision-based methods. Sensors, 20(4):1074, 2020

work page 2020

[4] [4]

Neuron: Learning context-aware evolving representations for zero-shot skeleton action recognition

Yang Chen, Jingcai Guo, Song Guo, and Dacheng Tao. Neuron: Learning context-aware evolving representations for zero-shot skeleton action recognition. arXiv preprint arXiv:2411.11288, 2024

work page arXiv 2024

[5] [5]

Fine-grained side information guided dual-prompts for zero-shot skeleton action recognition

Yang Chen, Jingcai Guo, Tian He, Xiaocheng Lu, and Ling Wang. Fine-grained side information guided dual-prompts for zero-shot skeleton action recognition. In Proceedings of the 32nd ACM International Conference on Multimedia , pages 778–786, 2024

work page 2024

[6] [6]

Vision-language meets the skele- ton: Progressively distillation with cross-modal knowledge for 3d action representation learning

Yang Chen, Tian He, Junfeng Fu, Ling Wang, Jingcai Guo, Ting Hu, and Hong Cheng. Vision-language meets the skele- ton: Progressively distillation with cross-modal knowledge for 3d action representation learning. IEEE Transactions on Multimedia, 2024

work page 2024

[7] [7]

Channel-wise topology refinement graph convolution for skeleton-based action recognition

Yuxin Chen, Ziqi Zhang, Chunfeng Yuan, Bing Li, Ying Deng, and Weiming Hu. Channel-wise topology refinement graph convolution for skeleton-based action recognition. In Proceedings of the IEEE/CVF international conference on computer vision, pages 13359–13368, 2021

work page 2021

[8] [8]

Timemae: Self-supervised rep- resentations of time series with decoupled masked autoen- coders

Mingyue Cheng, Qi Liu, Zhiding Liu, Hao Zhang, Rujiao Zhang, and Enhong Chen. Timemae: Self-supervised rep- resentations of time series with decoupled masked autoen- coders. arXiv preprint arXiv:2303.00320, 2023

work page arXiv 2023

[9] [9]

In- fogcn: Representation learning for human skeleton-based action recognition

Hyung-gun Chi, Myoung Hoon Ha, Seunggeun Chi, Sang Wan Lee, Qixing Huang, and Karthik Ramani. In- fogcn: Representation learning for human skeleton-based action recognition. In Proceedings of the IEEE/CVF Con- ference on Computer Vision and Pattern Recognition, pages 20186–20196, 2022

work page 2022

[10] [10]

Hpe-li: Wifi-enabled lightweight dual selective kernel convolution for human pose estimation

Toan D Gian, Tien Dac Lai, Thien Van Luong, Kok- Seng Wong, and Van-Dinh Nguyen. Hpe-li: Wifi-enabled lightweight dual selective kernel convolution for human pose estimation. In European Conference on Computer Vision , pages 93–111. Springer, 2025

work page 2025

[11] [11]

Wildar: Wifi signal-based lightweight deep learning model for human activity recogni- tion

Fuxiang Deng, Emil Jovanov, Houbing Song, Weisong Shi, Yuan Zhang, and Wenyao Xu. Wildar: Wifi signal-based lightweight deep learning model for human activity recogni- tion. IEEE Internet of Things Journal, 2023

work page 2023

[12] [12]

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

Jacob Devlin. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018

[13] [13]

Dif- fusion model is a good pose estimator from 3d rf-vision

Junqiao Fan, Jianfei Yang, Yuecong Xu, and Lihua Xie. Dif- fusion model is a good pose estimator from 3d rf-vision. In European Conference on Computer Vision , pages 1–18. Springer, 2025

work page 2025

[14] [14]

Focal and global spatial-temporal transformer for skeleton-based action recognition

Zhimin Gao, Peitao Wang, Pei Lv, Xiaoheng Jiang, Qidong Liu, Pichao Wang, Mingliang Xu, and Wanqing Li. Focal and global spatial-temporal transformer for skeleton-based action recognition. In Proceedings of the Asian Conference on Computer Vision, pages 382–398, 2022

work page 2022

[15] [15]

Diffpose: Toward more reliable 3d pose estimation

Jia Gong, Lin Geng Foo, Zhipeng Fan, Qiuhong Ke, Hos- sein Rahmani, and Jun Liu. Diffpose: Toward more reliable 3d pose estimation. In Proceedings of the IEEE/CVF Con- ference on Computer Vision and Pattern Recognition, pages 13041–13051, 2023

work page 2023

[16] [16]

Masked autoencoders are scalable vision learners

Kaiming He, Xinlei Chen, Saining Xie, Yanghao Li, Piotr Doll´ar, and Ross Girshick. Masked autoencoders are scalable vision learners. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages 16000– 16009, 2022

work page 2022

[17] [17]

Deep residual learning for image recognition

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceed- ings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016

work page 2016

[18] [18]

Enhancing skeleton-based action recogni- tion with language descriptions from pre-trained large mul- timodal models

Tian He, Yang Chen, Xu Gao, Ling Wang, Ting Hu, and Hong Cheng. Enhancing skeleton-based action recogni- tion with language descriptions from pre-trained large mul- timodal models. IEEE Transactions on Circuits and Systems for Video Technology, 2024

work page 2024

[19] [19]

An expert-knowledge-based graph convolutional network for skeleton-based physical rehabilitation exercises assessment

Tian He, Yang Chen, Ling Wang, and Hong Cheng. An expert-knowledge-based graph convolutional network for skeleton-based physical rehabilitation exercises assessment. IEEE Transactions on Neural Systems and Rehabilitation Engineering, 2024

work page 2024

[20] [20]

Masked autoencoders that lis- ten

Po-Yao Huang, Hu Xu, Juncheng Li, Alexei Baevski, Michael Auli, Wojciech Galuba, Florian Metze, and Christoph Feichtenhofer. Masked autoencoders that lis- ten. Advances in Neural Information Processing Systems , 35:28708–28720, 2022

work page 2022

[21] [21]

Towards 3d human pose construction using wifi

Wenjun Jiang, Hongfei Xue, Chenglin Miao, Shiyang Wang, Sen Lin, Chong Tian, Srinivasan Murali, Haochen Hu, Zhi Sun, and Lu Su. Towards 3d human pose construction using wifi. In Proceedings of the 26th Annual International Con- ference on Mobile Computing and Networking , pages 1–14, 2020

work page 2020

[22] [22]

Exploiting temporal contexts with strided transformer for 3d human pose estimation.IEEE Transactions on Multimedia, 25:1282–1293, 2022

Wenhao Li, Hong Liu, Runwei Ding, Mengyuan Liu, Pichao Wang, and Wenming Yang. Exploiting temporal contexts with strided transformer for 3d human pose estimation.IEEE Transactions on Multimedia, 25:1282–1293, 2022

work page 2022

[23] [23]

Group pose: A simple baseline for end-to- end multi-person pose estimation

Huan Liu, Qiang Chen, Zichang Tan, Jiang-Jiang Liu, Jian Wang, Xiangbo Su, Xiaolong Li, Kun Yao, Junyu Han, Er- rui Ding, et al. Group pose: A simple baseline for end-to- end multi-person pose estimation. In Proceedings of the IEEE/CVF International Conference on Computer Vision , pages 15029–15038, 2023

work page 2023

[24] [24]

Enhanced skele- ton visualization for view invariant human action recogni- tion

Mengyuan Liu, Hong Liu, and Chen Chen. Enhanced skele- ton visualization for view invariant human action recogni- tion. Pattern Recognition, 68:346–362, 2017

work page 2017

[25] [25]

Spa- tial temporal transformer network for skeleton-based ac- tion recognition

Chiara Plizzari, Marco Cannici, and Matteo Matteucci. Spa- tial temporal transformer network for skeleton-based ac- tion recognition. In Pattern recognition. ICPR international workshops and challenges: virtual event, January 10–15, 2021, Proceedings, Part III, pages 694–701. Springer, 2021. 9

work page 2021

[26] [26]

Improving language understanding by gener- ative pre-training

Alec Radford. Improving language understanding by gener- ative pre-training. 2018

work page 2018

[27] [27]

Winect: 3d human pose tracking for free-form activity using commodity wifi

Yili Ren, Zi Wang, Sheng Tan, Yingying Chen, and Jie Yang. Winect: 3d human pose tracking for free-form activity using commodity wifi. Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies, 5(4):1–29, 2021

work page 2021

[28] [28]

Gopose: 3d human pose estimation using wifi

Yili Ren, Zi Wang, Yichao Wang, Sheng Tan, Yingying Chen, and Jie Yang. Gopose: 3d human pose estimation using wifi. Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies, 6(2):1–25, 2022

work page 2022

[29] [29]

End-to-end multi-person pose estimation with transformers

Dahu Shi, Xing Wei, Liangqi Li, Ye Ren, and Wenming Tan. End-to-end multi-person pose estimation with transformers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 11069–11078, 2022

work page 2022

[30] [30]

Constructing stronger and faster baselines for skeleton-based action recognition

Yi-Fan Song, Zhang Zhang, Caifeng Shan, and Liang Wang. Constructing stronger and faster baselines for skeleton-based action recognition. IEEE transactions on pattern analysis and machine intelligence, 45(2):1474–1488, 2022

work page 2022

[31] [31]

Videomae: Masked autoencoders are data-efficient learners for self-supervised video pre-training

Zhan Tong, Yibing Song, Jue Wang, and Limin Wang. Videomae: Masked autoencoders are data-efficient learners for self-supervised video pre-training. Advances in neural information processing systems, 35:10078–10093, 2022

work page 2022

[32] [32]

Can WiFi Estimate Person Pose?

Fei Wang, Stanislav Panev, Ziyi Dai, Jinsong Han, and Dong Huang. Can wifi estimate person pose? arXiv preprint arXiv:1904.00277, 2019

work page internal anchor Pith review Pith/arXiv arXiv 1904

[33] [33]

Person-in-wifi: Fine-grained person percep- tion using wifi

Fei Wang, Sanping Zhou, Stanislav Panev, Jinsong Han, and Dong Huang. Person-in-wifi: Fine-grained person percep- tion using wifi. In Proceedings of the IEEE/CVF Interna- tional Conference on Computer Vision , pages 5452–5461, 2019

work page 2019

[34] [34]

Videomae v2: Scaling video masked autoencoders with dual masking

Limin Wang, Bingkun Huang, Zhiyu Zhao, Zhan Tong, Yi- nan He, Yi Wang, Yali Wang, and Yu Qiao. Videomae v2: Scaling video masked autoencoders with dual masking. In Proceedings of the IEEE/CVF Conference on Computer Vi- sion and Pattern Recognition, pages 14549–14560, 2023

work page 2023

[35] [35]

Action recognition based on joint trajectory maps using convolutional neural networks

Pichao Wang, Zhaoyang Li, Yonghong Hou, and Wanqing Li. Action recognition based on joint trajectory maps using convolutional neural networks. In Proceedings of the 24th ACM international conference on Multimedia , pages 102– 106, 2016

work page 2016

[36] [36]

From point to space: 3d moving human pose estimation using commodity wifi

Yiming Wang, Lingchao Guo, Zhaoming Lu, Xiangming Wen, Shuang Zhou, and Wanyu Meng. From point to space: 3d moving human pose estimation using commodity wifi. IEEE Communications Letters, 25(7):2235–2239, 2021

work page 2021

[37] [37]

Lite pose: Efficient architecture design for 2d hu- man pose estimation

Yihan Wang, Muyang Li, Han Cai, Wei-Ming Chen, and Song Han. Lite pose: Efficient architecture design for 2d hu- man pose estimation. In Proceedings of the IEEE/CVF Con- ference on Computer Vision and Pattern Recognition, pages 13126–13136, 2022

work page 2022

[38] [38]

Skeletonmae: graph-based masked autoencoder for skeleton sequence pre-training

Hong Yan, Yang Liu, Yushen Wei, Zhen Li, Guanbin Li, and Liang Lin. Skeletonmae: graph-based masked autoencoder for skeleton sequence pre-training. In Proceedings of the IEEE/CVF International Conference on Computer Vision , pages 5606–5618, 2023

work page 2023

[39] [39]

Person-in-wifi 3d: End-to-end multi- person 3d pose estimation with wi-fi

Kangwei Yan, Fei Wang, Bo Qian, Han Ding, Jinsong Han, and Xing Wei. Person-in-wifi 3d: End-to-end multi- person 3d pose estimation with wi-fi. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 969–978, 2024

work page 2024

[40] [40]

Autofi: Toward automatic wi-fi human sensing via geometric self-supervised learning

Jianfei Yang, Xinyan Chen, Han Zou, Dazhuo Wang, and Li- hua Xie. Autofi: Toward automatic wi-fi human sensing via geometric self-supervised learning. IEEE Internet of Things Journal, 10(8):7416–7425, 2022

work page 2022

[41] [41]

Mm-fi: Multi-modal non-intrusive 4d human dataset for versatile wireless sensing

Jianfei Yang, He Huang, Yunjiao Zhou, Xinyan Chen, Yue- cong Xu, Shenghai Yuan, Han Zou, Chris Xiaoxuan Lu, and Lihua Xie. Mm-fi: Multi-modal non-intrusive 4d human dataset for versatile wireless sensing. Advances in Neural Information Processing Systems, 36, 2024

work page 2024

[42] [42]

Single person pose estimation: a survey

Feng Zhang, Xiatian Zhu, and Chen Wang. Single person pose estimation: a survey. arXiv preprint arXiv:2109.10056, 2021

work page arXiv 2021

[43] [43]

Deep learning-based human pose estimation: A survey.ACM Computing Surveys, 56(1):1–37, 2023

Ce Zheng, Wenhan Wu, Chen Chen, Taojiannan Yang, Si- jie Zhu, Ju Shen, Nasser Kehtarnavaz, and Mubarak Shah. Deep learning-based human pose estimation: A survey.ACM Computing Surveys, 56(1):1–37, 2023

work page 2023

[44] [44]

Multi-modal 3d human pose estimation with 2d weak supervision in autonomous driving

Jingxiao Zheng, Xinwei Shi, Alexander Gorban, Junhua Mao, Yang Song, Charles R Qi, Ting Liu, Visesh Chari, An- dre Cornman, Yin Zhou, et al. Multi-modal 3d human pose estimation with 2d weak supervision in autonomous driving. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4478–4487, 2022

work page 2022

[45] [45]

Metafi++: Wifi-enabled transformer- based human pose estimation for metaverse avatar simula- tion

Yunjiao Zhou, He Huang, Shenghai Yuan, Han Zou, Lihua Xie, and Jianfei Yang. Metafi++: Wifi-enabled transformer- based human pose estimation for metaverse avatar simula- tion. IEEE Internet of Things Journal, 10(16):14128–14136, 2023

work page 2023

[46] [46]

Ada- pose: Towards cross-site device-free human pose estima- tion with commodity wifi

Yunjiao Zhou, Jianfei Yang, He Huang, and Lihua Xie. Ada- pose: Towards cross-site device-free human pose estima- tion with commodity wifi. IEEE Internet of Things Journal, 2024

work page 2024

[47] [47]

Perunet: Deep signal channel attention in unet for wifi-based human pose estimation

Yue Zhou, Aichun Zhu, Caojie Xu, Fangqiang Hu, and Yifeng Li. Perunet: Deep signal channel attention in unet for wifi-based human pose estimation. IEEE Sensors Jour- nal, 22(20):19750–19760, 2022. 10 A. Appendix A.1. Implementation Details In the pre-training phase, the encoder-decoder is trained for 400 epochs using the AdamW, employing a batch size of 2...

work page 2022