LatentStealth: Unnoticeable and Efficient Adversarial Attacks on Expressive Human Pose and Shape Estimation

Fengyuan Ma; Guanggang Geng; Lili Wang; Shuyuan Lin; Yeying Jin; Zhaoxin Fan; Zhiying Li

arxiv: 2505.12009 · v2 · submitted 2025-05-17 · 💻 cs.CV

LatentStealth: Unnoticeable and Efficient Adversarial Attacks on Expressive Human Pose and Shape Estimation

Zhiying Li , Guanggang Geng , Yeying Jin , Shuyuan Lin , Fengyuan Ma , Zhaoxin Fan , Lili Wang This is my paper

Pith reviewed 2026-05-22 14:28 UTC · model grok-4.3

classification 💻 cs.CV

keywords adversarial attackshuman pose estimationlatent spaceexpressive human pose and shapeimperceptible perturbationsdigital human generationsecurity vulnerabilitiesEHPS models

0 comments

The pith

LatentStealth generates adversarial perturbations for human pose and shape models inside latent space rather than pixel space.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces LatentStealth to attack expressive human pose and shape estimation models used in digital human generation. Current attacks produce obvious visual changes that reduce their ability to reveal real security problems such as inappropriate content or offensive gestures. The method projects images into a structured latent space, creates adversarial patterns there, and refines them along optimized directions using only a few output queries from the target model. This yields perturbations that decode back to images with high visual imperceptibility while still degrading the estimator's performance. Tests on 3DPW and UBody datasets show competitive success rates at low computational cost, exposing vulnerabilities that existing defenses have not addressed.

Core claim

By projecting inputs into the latent space, where adversarial patterns are generated and progressively refined along optimized directions, LatentStealth maintains high imperceptibility while preserving effectiveness and achieving competitive attack performance with low computational overhead using only a small number of model output queries.

What carries the argument

Projection of inputs into latent space followed by generation and progressive refinement of adversarial patterns along optimized directions guided by few model output queries.

If this is right

EHPS models used in live-streaming and digital humans are exposed as vulnerable to stealthy attacks that current visual checks would miss.
Attack success with only a small number of queries makes the threat practical for real-world deployment against deployed systems.
The approach outperforms prior pixel-space attacks on standard 3DPW and UBody benchmarks in both stealth and efficiency.
Security risks such as forced generation of violent or offensive poses can now be demonstrated without obvious tampering.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same latent-space strategy could transfer to other regression-based vision tasks where pixel perturbations are too conspicuous.
If the pretrained encoder used for the latent space has reconstruction errors, those errors might either aid or hinder attack transferability.
Training EHPS models with latent-space robustness objectives would be a direct countermeasure suggested by the attack design.

Load-bearing premise

Structured latent representations of natural images allow adversarial patterns that remain visually imperceptible after decoding while still fooling the EHPS model, and that optimization with limited queries suffices without separate checks on reconstruction quality.

What would settle it

Decoded images from the perturbed latent codes show clearly visible artifacts or produce no measurable drop in EHPS accuracy on held-out test images.

Figures

Figures reproduced from arXiv: 2505.12009 by Fengyuan Ma, Guanggang Geng, Lili Wang, Shuyuan Lin, Yeying Jin, Zhaoxin Fan, Zhiying Li.

**Figure 2.** Figure 2: Overview of the LatentStealth pipeline. The framework consists of two stages: a) noise injection in latent space, and b) noise enhancement. [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 3.** Figure 3: Visualizing various adversarial samples for digital human generation on UBody. [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗

**Figure 4.** Figure 4: Visualizing various adversarial samples for digital human generation on PW3D. [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗

**Figure 5.** Figure 5: Performance of different perturbation magnitude [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗

read the original abstract

Expressive human pose and shape estimation (EHPS) plays a central role in digital human generation, particularly in live-streaming applications. However, most existing EHPS models focus primarily on minimizing estimation errors, with limited attention on potential security vulnerabilities, such as generating inappropriate content, violent actions, or racially offensive gestures and expressions. Current adversarial attacks on EHPS models often generate visually conspicuous perturbations, limiting their practicality and ability to expose real-world security threats. To address this limitation, we propose an unnoticeable adversarial method, termed \textbf{LatentStealth}, specifically tailored for EHPS models. The key idea is to exploit the structured latent representations of natural images as the medium for crafting perturbations. Instead of injecting noise directly into the pixel space, our method projects inputs into the latent space, where adversarial patterns are generated and progressively refined along optimized directions. This latent-space manipulation enables the attack to maintain high imperceptibility while preserving its effectiveness. Furthermore, as the optimization process is guided by only a small number of model output queries, the framework achieves competitive attack performance with low computational overhead, making it both practical and efficient for real-world scenarios. Extensive experiments on the 3DPW and UBody datasets demonstrate the superiority of LatentStealth, revealing critical vulnerabilities in current systems. These findings highlight the urgent need to address and mitigate security risks in digital human generation technologies.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

LatentStealth moves attacks on expressive human pose and shape models into latent space for better imperceptibility and efficiency, but the evidence for staying visually unnoticeable is not yet solid.

read the letter

LatentStealth is an attempt to create unnoticeable adversarial attacks on expressive human pose and shape estimation models by operating in latent space. The authors project the input image into a latent representation, add and refine an adversarial pattern there using optimized directions, and decode it back to pixel space. This is supposed to keep the changes invisible while still fooling the EHPS model, all with only a small number of queries to the model outputs. What is new is the specific application to EHPS and the progressive refinement in latent space for efficiency. Earlier adversarial methods on pose estimation often produced noticeable perturbations that would not work in real scenarios like live streaming. The paper does well in identifying the security risks, such as generating inappropriate or offensive content through fooled models, and in emphasizing the need for practical attacks that do not draw attention. The soft spots come in the validation of the imperceptibility claim. The abstract talks about high imperceptibility and superiority on 3DPW and UBody datasets, but it does not include any quantitative measures like perceptual similarity scores or reconstruction quality from the latent space. Without those, or without showing how the method compares to direct pixel attacks on visual metrics, it is difficult to accept that the decoded images remain truly unnoticeable. The optimization process is described as guided by few queries, yet no specific budget or convergence details are given in the summary, which leaves the efficiency claim somewhat open. This kind of work is aimed at researchers in adversarial machine learning within computer vision, especially those focused on 3D human modeling and its applications. Readers who care about robustness in digital human technologies could find useful ideas here for both attacks and potential defenses. The paper engages honestly with the literature on pose estimation attacks and presents a clear empirical construction, so it qualifies for a serious referee. I would recommend putting this through peer review. The idea has merit for the field, and referees can help strengthen the experimental evidence around the perceptual aspects.

Referee Report

3 major / 2 minor

Summary. The paper proposes LatentStealth, an adversarial attack on Expressive Human Pose and Shape Estimation (EHPS) models. Inputs are projected into the latent space of a pretrained encoder; adversarial patterns are generated and refined along optimized directions before decoding back to images. The method claims to achieve high imperceptibility, preserved attack effectiveness, and competitive performance with low overhead via a small number of model output queries. Experiments on 3DPW and UBody datasets are presented as demonstrating superiority over prior attacks.

Significance. If validated, the work is significant for highlighting security risks in EHPS models used in digital human generation and live-streaming. The latent-space formulation offers a practical route to imperceptible perturbations and few-query efficiency, which could inform defense design. The empirical focus on real-world applicability is a strength, though it requires quantitative grounding to realize that potential.

major comments (3)

[§3] §3 (Method): The central mechanism projects images into latent space, adds refined adversarial patterns, and decodes. No encoder architecture is named, no reconstruction-error bound is given, and no analysis shows that small latent displacements map to imperceptible pixel changes. This assumption is load-bearing for the imperceptibility claim.
[§4] §4 (Experiments): Superiority is asserted on 3DPW and UBody, yet the results provide no attack success rates, MPJPE or other EHPS error metrics, LPIPS/SSIM perceptual distances, or visual fidelity comparisons against pixel-space baselines. Without these numbers the effectiveness-plus-imperceptibility claim cannot be evaluated.
[§3.2] §3.2 (Optimization): The few-query refinement process is described without a concrete query budget, convergence criterion, or trade-off curve relating attack success to reconstruction quality. This detail is required to substantiate the efficiency and imperceptibility guarantees.

minor comments (2)

[Abstract] Abstract: The phrase 'extensive experiments demonstrate the superiority' should be accompanied by at least one concrete metric or baseline name to orient readers.
[§3] Notation: Ensure the latent-space variables (e.g., direction vectors, refinement steps) are defined with consistent symbols when first introduced.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the thoughtful and constructive comments on our manuscript. We address each of the major comments point by point below. Where revisions are needed to strengthen the presentation, we have indicated that changes will be made in the next version of the paper.

read point-by-point responses

Referee: [§3] §3 (Method): The central mechanism projects images into latent space, adds refined adversarial patterns, and decodes. No encoder architecture is named, no reconstruction-error bound is given, and no analysis shows that small latent displacements map to imperceptible pixel changes. This assumption is load-bearing for the imperceptibility claim.

Authors: We appreciate the referee pointing out the need for greater specificity in the method description. While the manuscript outlines the latent-space projection and decoding process, we agree that explicit details on the encoder architecture, reconstruction error bounds, and a supporting analysis for imperceptibility were insufficient. In the revised manuscript, we will name the specific pretrained encoder architecture employed, provide quantitative bounds on the reconstruction error, and include an analysis (with supporting experiments or mathematical justification) demonstrating that small displacements in the latent space result in imperceptible pixel-level changes. This will better substantiate the imperceptibility claim. revision: yes
Referee: [§4] §4 (Experiments): Superiority is asserted on 3DPW and UBody, yet the results provide no attack success rates, MPJPE or other EHPS error metrics, LPIPS/SSIM perceptual distances, or visual fidelity comparisons against pixel-space baselines. Without these numbers the effectiveness-plus-imperceptibility claim cannot be evaluated.

Authors: We acknowledge that the experimental section would benefit from additional quantitative metrics to allow for a more rigorous evaluation. The manuscript includes comparisons on the 3DPW and UBody datasets demonstrating the advantages of LatentStealth, but we agree that explicit reporting of attack success rates, MPJPE and other EHPS-specific error metrics, along with LPIPS and SSIM for perceptual quality, and side-by-side comparisons to pixel-space attack baselines, would strengthen the claims. We will incorporate these metrics and comparisons into the revised experiments section. revision: yes
Referee: [§3.2] §3.2 (Optimization): The few-query refinement process is described without a concrete query budget, convergence criterion, or trade-off curve relating attack success to reconstruction quality. This detail is required to substantiate the efficiency and imperceptibility guarantees.

Authors: Thank you for this observation regarding the optimization details. The manuscript emphasizes the few-query nature of the refinement process, but we concur that concrete specifications are necessary for reproducibility and to support the efficiency claims. In the revision, we will provide the specific query budget utilized in our experiments, detail the convergence criteria employed, and include trade-off curves that illustrate the relationship between attack success and reconstruction quality under varying query counts. revision: yes

Circularity Check

0 steps flagged

No significant circularity in the proposed empirical method

full rationale

The paper presents LatentStealth as an algorithmic construction for generating adversarial perturbations in latent space of a pretrained encoder, followed by decoding and experimental validation on 3DPW and UBody datasets. No mathematical derivation chain, equations, fitted parameters renamed as predictions, or load-bearing self-citations appear in the provided abstract or description. The central claims rest on empirical attack success and imperceptibility rather than any self-referential reduction to inputs by construction. This is the expected outcome for an applied CV attack paper whose validity is tested externally via experiments.

Axiom & Free-Parameter Ledger

2 free parameters · 2 axioms · 0 invented entities

The central claim rests on the existence of useful structured latent representations for natural images and on the effectiveness of query-based optimization in that space; these are standard in generative modeling but treated as given without new justification here.

free parameters (2)

number of optimization queries
The abstract states the attack uses only a small number of model output queries; the exact count and selection strategy are not specified and function as a tunable parameter.
latent direction refinement steps
Progressive refinement along optimized directions implies step count or learning rate choices that are not detailed.

axioms (2)

domain assumption Natural images possess structured latent representations that can be manipulated to produce adversarial effects while preserving visual fidelity after decoding.
Invoked when the method projects inputs into latent space instead of pixel space.
domain assumption Few-query optimization is sufficient to discover effective adversarial directions in latent space for EHPS models.
Stated as enabling low computational overhead and practicality.

pith-pipeline@v0.9.0 · 5801 in / 1527 out tokens · 38945 ms · 2026-05-22T14:28:02.577221+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

projects inputs into the latent space, where adversarial patterns are generated and progressively refined along optimized directions... VAE... η=0.05... multi-task loss L = -E[||Δα||² + ...] + pixel L2 term
IndisputableMonolith/Foundation/ArithmeticFromLogic.lean LogicNat induction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

strict test budget... only three queries of the model outputs

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

44 extracted references · 44 canonical work pages · 4 internal anchors

[1]

Smpl: A skinned multi-person linear model,

M. Loper, N. Mahmood, J. Romero, G. Pons-Moll, and M. J. Black, “Smpl: A skinned multi-person linear model,”ACM Transactions on Graphics, vol. 34, no. 6, 2015

work page 2015
[2]

Expressive body capture: 3d hands, face, and body from a single image,

G. Pavlakos, V . Choutas, N. Ghorbani, T. Bolkart, A. A. Osman, D. Tzionas, and M. J. Black, “Expressive body capture: 3d hands, face, and body from a single image,” inCVPR, 2019, pp. 10 967–10 977

work page 2019
[3]

Monocular real-time full body capture with inter-part correlations,

Y . Zhou, M. Habermann, I. Habibie, A. Tewari, C. Theobalt, and F. Xu, “Monocular real-time full body capture with inter-part correlations,” in CVPR, 2021, pp. 4811–4822

work page 2021
[4]

Pymaf- x: Towards well-aligned full-body model regression from monocular images,

H. Zhang, Y . Tian, Y . Zhang, M. Li, L. An, Z. Sun, and Y . Liu, “Pymaf- x: Towards well-aligned full-body model regression from monocular images,”TPAMI, vol. 45, no. 10, pp. 12 287–12 303, 2023

work page 2023
[5]

Smpler-x: Scaling up expressive human pose and shape estimation,

Z. Cai, W. Yin, A. Zeng, C. Wei, Q. Sun, W. Yanjun, H. E. Pang, H. Mei, M. Zhang, L. Zhanget al., “Smpler-x: Scaling up expressive human pose and shape estimation,” inNeurIPS, vol. 36, 2024

work page 2024
[6]

Avatarclip: Zero-shot text-driven generation and animation of 3d avatars,

F. Hong, M. Zhang, L. Pan, Z. Cai, L. Yang, and Z. Liu, “Avatarclip: Zero-shot text-driven generation and animation of 3d avatars,”ACM Transactions on Graphics, vol. 41, no. 4, pp. 1–19, 2022

work page 2022
[7]

Garment4d: Garment reconstruction from point cloud sequences,

F. Hong, L. Pan, Z. Cai, and Z. Liu, “Garment4d: Garment reconstruction from point cloud sequences,” inNeurIPS, vol. 34, 2021, pp. 27 940– 27 951

work page 2021
[8]

Understanding the robustness of skeleton-based action recognition under adversarial attack,

H. Wang, F. He, Z. Peng, T. Shao, Y .-L. Yang, K. Zhou, and D. Hogg, “Understanding the robustness of skeleton-based action recognition under adversarial attack,” inCVPR, 2021, pp. 14 656–14 665

work page 2021
[9]

Towards robust 3d pose transfer with adversarial learning,

H. Chen, H. Tang, E. Adeli, and G. Zhao, “Towards robust 3d pose transfer with adversarial learning,” inCVPR, 2024, pp. 2295–2304

work page 2024
[10]

Distracting downpour: Ad- versarial weather attacks for motion estimation,

J. Schmalfuss, L. Mehl, and A. Bruhn, “Distracting downpour: Ad- versarial weather attacks for motion estimation,” inICCV, 2023, pp. 10 106–10 116

work page 2023
[11]

On the robustness of neural-enhanced video streaming against adversarial attacks,

Q. Zhou, J. Guo, S. Guo, R. Li, J. Zhang, B. Wang, and Z. Xu, “On the robustness of neural-enhanced video streaming against adversarial attacks,” inAAAI, vol. 38, no. 15, 2024, pp. 17 123–17 131

work page 2024
[12]

Whole-body human pose estimation in the wild,

S. Jin, L. Xu, J. Xu, C. Wang, W. Liu, C. Qian, W. Ouyang, and P. Luo, “Whole-body human pose estimation in the wild,” inECCV, 2020, pp. 196–214

work page 2020
[13]

Accurate 3d hand pose estimation for whole-body 3d human mesh estimation,

G. Moon, H. Choi, and K. M. Lee, “Accurate 3d hand pose estimation for whole-body 3d human mesh estimation,” inCVPR, 2022, pp. 2308–2317

work page 2022
[14]

One-stage 3d whole- body mesh recovery with component aware transformer,

J. Lin, A. Zeng, H. Wang, L. Zhang, and Y . Li, “One-stage 3d whole- body mesh recovery with component aware transformer,” inCVPR, 2023, pp. 21 159–21 168

work page 2023
[15]

Unveiling hidden vulnerabilities in digital human generation via adversarial attacks,

Z. Li, Y . Jin, F. Shen, Z. Liu, W. Chen, P. Zhang, X. Zhang, B. Chen, M. Shen, K. Wuet al., “Unveiling hidden vulnerabilities in digital human generation via adversarial attacks,”Pattern Recognition, vol. 170, p. 112042, 2026

work page 2026
[16]

LatentPoison - Adversarial Attacks On The Latent Space

A. Creswell, A. A. Bharath, and B. Sengupta, “Latentpoison-adversarial attacks on the latent space,”arXiv:1711.02879, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[17]

Can push- forward generative models fit multimodal distributions?

A. Salmona, V . De Bortoli, J. Delon, and A. Desolneux, “Can push- forward generative models fit multimodal distributions?” inNeurIPS, vol. 35, 2022, pp. 10 766–10 779

work page 2022
[18]

Auto-encoding variational bayes,

D. P. Kingma, M. Wellinget al., “Auto-encoding variational bayes,” 2013

work page 2013
[19]

Classifier-free diffusion guidance,

J. Ho and T. Salimans, “Classifier-free diffusion guidance,” inNeurIPS Workshop, 2021

work page 2021
[20]

Adversarial robustness of vaes through the lens of local geometry,

A. Khan and A. Storkey, “Adversarial robustness of vaes through the lens of local geometry,” inAISTATS, 2023, pp. 8954–8967

work page 2023
[21]

Generating out of distribution adver- sarial attack using latent space poisoning,

U. Upadhyay and P. Mukherjee, “Generating out of distribution adver- sarial attack using latent space poisoning,”SPL, vol. 28, pp. 523–527, 2021

work page 2021
[22]

Gans trained by a two time-scale update rule converge to a local nash equilibrium,

M. Heusel, H. Ramsauer, T. Unterthiner, B. Nessler, and S. Hochreiter, “Gans trained by a two time-scale update rule converge to a local nash equilibrium,” inNeurIPS, vol. 30, 2017

work page 2017
[23]

Image quality assessment: from error visibility to structural similarity,

Z. Wang, A. C. Bovik, H. R. Sheikh, and E. P. Simoncelli, “Image quality assessment: from error visibility to structural similarity,”TIP, vol. 13, no. 4, pp. 600–612, 2004

work page 2004
[24]

Sequential 3d human pose and shape estimation from point clouds,

K. Wang, J. Xie, G. Zhang, L. Liu, and J. Yang, “Sequential 3d human pose and shape estimation from point clouds,” inCVPR, 2020, pp. 7275– 7284

work page 2020
[25]

Hybrik: A hybrid analytical-neural inverse kinematics solution for 3d human pose and shape estimation,

J. Li, C. Xu, Z. Chen, S. Bian, L. Yang, and C. Lu, “Hybrik: A hybrid analytical-neural inverse kinematics solution for 3d human pose and shape estimation,” inCVPR, 2021, pp. 3383–3393

work page 2021
[26]

Learning to estimate 3d human pose and shape from a single color image,

G. Pavlakos, L. Zhu, X. Zhou, and K. Daniilidis, “Learning to estimate 3d human pose and shape from a single color image,” inCVPR, 2018, pp. 459–468

work page 2018
[27]

Eventhpe: Event-based 3d human pose and shape estimation,

S. Zou, C. Guo, X. Zuo, S. Wang, P. Wang, X. Hu, S. Chen, M. Gong, and L. Cheng, “Eventhpe: Event-based 3d human pose and shape estimation,” inICCV, 2021, pp. 10 996–11 005

work page 2021
[28]

Benchmarking and analyzing 3d human pose and shape estimation beyond algorithms,

H. E. Pang, Z. Cai, L. Yang, T. Zhang, and Z. Liu, “Benchmarking and analyzing 3d human pose and shape estimation beyond algorithms,” in NeurIPS, vol. 35, 2022, pp. 26 034–26 051

work page 2022
[29]

Global-to- local modeling for video-based 3d human pose and shape estimation,

X. Shen, Z. Yang, X. Wang, J. Ma, C. Zhou, and Y . Yang, “Global-to- local modeling for video-based 3d human pose and shape estimation,” inCVPR, 2023, pp. 8887–8896

work page 2023
[30]

Is-wars: Intelligent and stealthy adversarial attack to wi-fi-based human activity recognition systems,

P. Huang, X. Zhang, S. Yu, and L. Guo, “Is-wars: Intelligent and stealthy adversarial attack to wi-fi-based human activity recognition systems,” TDSC, vol. 19, no. 6, pp. 3899–3912, 2021

work page 2021
[31]

Simple black-box adversarial attacks,

C. Guo, J. Gardner, Y . You, A. G. Wilson, and K. Weinberger, “Simple black-box adversarial attacks,” inICML, 2019, pp. 2484–2493

work page 2019
[32]

Adversarial texture for fooling person detectors in the physical world,

Z. Hu, S. Huang, X. Zhu, F. Sun, B. Zhang, and X. Hu, “Adversarial texture for fooling person detectors in the physical world,” inCVPR, 2022, pp. 13 307–13 316

work page 2022
[33]

Universal physical camouflage attacks on object detectors,

L. Huang, C. Gao, Y . Zhou, C. Xie, A. L. Yuille, C. Zou, and N. Liu, “Universal physical camouflage attacks on object detectors,” inCVPR, 2020, pp. 720–729

work page 2020
[34]

A comprehensive study of the robustness for lidar-based 3d object detectors against adversarial attacks,

Y . Zhang, J. Hou, and Y . Yuan, “A comprehensive study of the robustness for lidar-based 3d object detectors against adversarial attacks,” inIJCV, vol. 132, no. 5, 2024, pp. 1592–1624

work page 2024
[35]

Towards transferable targeted 3d adversarial attack in the physical world,

Y . Huang, Y . Dong, S. Ruan, X. Yang, H. Su, and X. Wei, “Towards transferable targeted 3d adversarial attack in the physical world,” in CVPR, 2024, pp. 24 512–24 522

work page 2024
[36]

Content- based unrestricted adversarial attack,

Z. Chen, B. Li, S. Wu, K. Jiang, S. Ding, and W. Zhang, “Content- based unrestricted adversarial attack,” inNeurIPS, vol. 36, 2023, pp. 51 719–51 733

work page 2023
[37]

Diffusion models for imperceptible and transferable adversarial attack,

J. Chen, H. Chen, K. Chen, Y . Zhang, Z. Zou, and Z. Shi, “Diffusion models for imperceptible and transferable adversarial attack,”TPAMI, 2024

work page 2024
[38]

Generating adversarial attacks in the latent space,

N. Shukla and S. Banerjee, “Generating adversarial attacks in the latent space,” inCVPR, 2023, pp. 730–739

work page 2023
[39]

Explaining and Harnessing Adversarial Examples

I. J. Goodfellow, J. Shlens, and C. Szegedy, “Explaining and harnessing adversarial examples,”arXiv:1412.6572, 2014

work page internal anchor Pith review Pith/arXiv arXiv 2014
[40]

Towards Deep Learning Models Resistant to Adversarial Attacks

A. Madry, A. Makelov, L. Schmidt, D. Tsipras, and A. Vladu, “Towards deep learning models resistant to adversarial attacks,”arXiv:1706.06083, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[41]

Enhanc- ing variational autoencoders with smooth robust latent encoding,

H. Lee, M. Kim, S. Jang, J. Jeong, and S. J. Hwang, “Enhanc- ing variational autoencoders with smooth robust latent encoding,” arXiv:2504.17219, 2025

work page arXiv 2025
[42]

Recovering accurate 3d human pose in the wild using imus and a moving camera,

T. V on Marcard, R. Henschel, M. J. Black, B. Rosenhahn, and G. Pons- Moll, “Recovering accurate 3d human pose in the wild using imus and a moving camera,” inECCV, 2018, pp. 601–617

work page 2018
[43]

An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

D. Alexey, “An image is worth 16x16 words: Transformers for image recognition at scale,”arXiv: 2010.11929, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2010
[44]

The unreasonable effectiveness of deep features as a perceptual metric,

R. Zhang, P. Isola, A. A. Efros, E. Shechtman, and O. Wang, “The unreasonable effectiveness of deep features as a perceptual metric,” in CVPR, 2018, pp. 586–595

work page 2018

[1] [1]

Smpl: A skinned multi-person linear model,

M. Loper, N. Mahmood, J. Romero, G. Pons-Moll, and M. J. Black, “Smpl: A skinned multi-person linear model,”ACM Transactions on Graphics, vol. 34, no. 6, 2015

work page 2015

[2] [2]

Expressive body capture: 3d hands, face, and body from a single image,

G. Pavlakos, V . Choutas, N. Ghorbani, T. Bolkart, A. A. Osman, D. Tzionas, and M. J. Black, “Expressive body capture: 3d hands, face, and body from a single image,” inCVPR, 2019, pp. 10 967–10 977

work page 2019

[3] [3]

Monocular real-time full body capture with inter-part correlations,

Y . Zhou, M. Habermann, I. Habibie, A. Tewari, C. Theobalt, and F. Xu, “Monocular real-time full body capture with inter-part correlations,” in CVPR, 2021, pp. 4811–4822

work page 2021

[4] [4]

Pymaf- x: Towards well-aligned full-body model regression from monocular images,

H. Zhang, Y . Tian, Y . Zhang, M. Li, L. An, Z. Sun, and Y . Liu, “Pymaf- x: Towards well-aligned full-body model regression from monocular images,”TPAMI, vol. 45, no. 10, pp. 12 287–12 303, 2023

work page 2023

[5] [5]

Smpler-x: Scaling up expressive human pose and shape estimation,

Z. Cai, W. Yin, A. Zeng, C. Wei, Q. Sun, W. Yanjun, H. E. Pang, H. Mei, M. Zhang, L. Zhanget al., “Smpler-x: Scaling up expressive human pose and shape estimation,” inNeurIPS, vol. 36, 2024

work page 2024

[6] [6]

Avatarclip: Zero-shot text-driven generation and animation of 3d avatars,

F. Hong, M. Zhang, L. Pan, Z. Cai, L. Yang, and Z. Liu, “Avatarclip: Zero-shot text-driven generation and animation of 3d avatars,”ACM Transactions on Graphics, vol. 41, no. 4, pp. 1–19, 2022

work page 2022

[7] [7]

Garment4d: Garment reconstruction from point cloud sequences,

F. Hong, L. Pan, Z. Cai, and Z. Liu, “Garment4d: Garment reconstruction from point cloud sequences,” inNeurIPS, vol. 34, 2021, pp. 27 940– 27 951

work page 2021

[8] [8]

Understanding the robustness of skeleton-based action recognition under adversarial attack,

H. Wang, F. He, Z. Peng, T. Shao, Y .-L. Yang, K. Zhou, and D. Hogg, “Understanding the robustness of skeleton-based action recognition under adversarial attack,” inCVPR, 2021, pp. 14 656–14 665

work page 2021

[9] [9]

Towards robust 3d pose transfer with adversarial learning,

H. Chen, H. Tang, E. Adeli, and G. Zhao, “Towards robust 3d pose transfer with adversarial learning,” inCVPR, 2024, pp. 2295–2304

work page 2024

[10] [10]

Distracting downpour: Ad- versarial weather attacks for motion estimation,

J. Schmalfuss, L. Mehl, and A. Bruhn, “Distracting downpour: Ad- versarial weather attacks for motion estimation,” inICCV, 2023, pp. 10 106–10 116

work page 2023

[11] [11]

On the robustness of neural-enhanced video streaming against adversarial attacks,

Q. Zhou, J. Guo, S. Guo, R. Li, J. Zhang, B. Wang, and Z. Xu, “On the robustness of neural-enhanced video streaming against adversarial attacks,” inAAAI, vol. 38, no. 15, 2024, pp. 17 123–17 131

work page 2024

[12] [12]

Whole-body human pose estimation in the wild,

S. Jin, L. Xu, J. Xu, C. Wang, W. Liu, C. Qian, W. Ouyang, and P. Luo, “Whole-body human pose estimation in the wild,” inECCV, 2020, pp. 196–214

work page 2020

[13] [13]

Accurate 3d hand pose estimation for whole-body 3d human mesh estimation,

G. Moon, H. Choi, and K. M. Lee, “Accurate 3d hand pose estimation for whole-body 3d human mesh estimation,” inCVPR, 2022, pp. 2308–2317

work page 2022

[14] [14]

One-stage 3d whole- body mesh recovery with component aware transformer,

J. Lin, A. Zeng, H. Wang, L. Zhang, and Y . Li, “One-stage 3d whole- body mesh recovery with component aware transformer,” inCVPR, 2023, pp. 21 159–21 168

work page 2023

[15] [15]

Unveiling hidden vulnerabilities in digital human generation via adversarial attacks,

Z. Li, Y . Jin, F. Shen, Z. Liu, W. Chen, P. Zhang, X. Zhang, B. Chen, M. Shen, K. Wuet al., “Unveiling hidden vulnerabilities in digital human generation via adversarial attacks,”Pattern Recognition, vol. 170, p. 112042, 2026

work page 2026

[16] [16]

LatentPoison - Adversarial Attacks On The Latent Space

A. Creswell, A. A. Bharath, and B. Sengupta, “Latentpoison-adversarial attacks on the latent space,”arXiv:1711.02879, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017

[17] [17]

Can push- forward generative models fit multimodal distributions?

A. Salmona, V . De Bortoli, J. Delon, and A. Desolneux, “Can push- forward generative models fit multimodal distributions?” inNeurIPS, vol. 35, 2022, pp. 10 766–10 779

work page 2022

[18] [18]

Auto-encoding variational bayes,

D. P. Kingma, M. Wellinget al., “Auto-encoding variational bayes,” 2013

work page 2013

[19] [19]

Classifier-free diffusion guidance,

J. Ho and T. Salimans, “Classifier-free diffusion guidance,” inNeurIPS Workshop, 2021

work page 2021

[20] [20]

Adversarial robustness of vaes through the lens of local geometry,

A. Khan and A. Storkey, “Adversarial robustness of vaes through the lens of local geometry,” inAISTATS, 2023, pp. 8954–8967

work page 2023

[21] [21]

Generating out of distribution adver- sarial attack using latent space poisoning,

U. Upadhyay and P. Mukherjee, “Generating out of distribution adver- sarial attack using latent space poisoning,”SPL, vol. 28, pp. 523–527, 2021

work page 2021

[22] [22]

Gans trained by a two time-scale update rule converge to a local nash equilibrium,

M. Heusel, H. Ramsauer, T. Unterthiner, B. Nessler, and S. Hochreiter, “Gans trained by a two time-scale update rule converge to a local nash equilibrium,” inNeurIPS, vol. 30, 2017

work page 2017

[23] [23]

Image quality assessment: from error visibility to structural similarity,

Z. Wang, A. C. Bovik, H. R. Sheikh, and E. P. Simoncelli, “Image quality assessment: from error visibility to structural similarity,”TIP, vol. 13, no. 4, pp. 600–612, 2004

work page 2004

[24] [24]

Sequential 3d human pose and shape estimation from point clouds,

K. Wang, J. Xie, G. Zhang, L. Liu, and J. Yang, “Sequential 3d human pose and shape estimation from point clouds,” inCVPR, 2020, pp. 7275– 7284

work page 2020

[25] [25]

Hybrik: A hybrid analytical-neural inverse kinematics solution for 3d human pose and shape estimation,

J. Li, C. Xu, Z. Chen, S. Bian, L. Yang, and C. Lu, “Hybrik: A hybrid analytical-neural inverse kinematics solution for 3d human pose and shape estimation,” inCVPR, 2021, pp. 3383–3393

work page 2021

[26] [26]

Learning to estimate 3d human pose and shape from a single color image,

G. Pavlakos, L. Zhu, X. Zhou, and K. Daniilidis, “Learning to estimate 3d human pose and shape from a single color image,” inCVPR, 2018, pp. 459–468

work page 2018

[27] [27]

Eventhpe: Event-based 3d human pose and shape estimation,

S. Zou, C. Guo, X. Zuo, S. Wang, P. Wang, X. Hu, S. Chen, M. Gong, and L. Cheng, “Eventhpe: Event-based 3d human pose and shape estimation,” inICCV, 2021, pp. 10 996–11 005

work page 2021

[28] [28]

Benchmarking and analyzing 3d human pose and shape estimation beyond algorithms,

H. E. Pang, Z. Cai, L. Yang, T. Zhang, and Z. Liu, “Benchmarking and analyzing 3d human pose and shape estimation beyond algorithms,” in NeurIPS, vol. 35, 2022, pp. 26 034–26 051

work page 2022

[29] [29]

Global-to- local modeling for video-based 3d human pose and shape estimation,

X. Shen, Z. Yang, X. Wang, J. Ma, C. Zhou, and Y . Yang, “Global-to- local modeling for video-based 3d human pose and shape estimation,” inCVPR, 2023, pp. 8887–8896

work page 2023

[30] [30]

Is-wars: Intelligent and stealthy adversarial attack to wi-fi-based human activity recognition systems,

P. Huang, X. Zhang, S. Yu, and L. Guo, “Is-wars: Intelligent and stealthy adversarial attack to wi-fi-based human activity recognition systems,” TDSC, vol. 19, no. 6, pp. 3899–3912, 2021

work page 2021

[31] [31]

Simple black-box adversarial attacks,

C. Guo, J. Gardner, Y . You, A. G. Wilson, and K. Weinberger, “Simple black-box adversarial attacks,” inICML, 2019, pp. 2484–2493

work page 2019

[32] [32]

Adversarial texture for fooling person detectors in the physical world,

Z. Hu, S. Huang, X. Zhu, F. Sun, B. Zhang, and X. Hu, “Adversarial texture for fooling person detectors in the physical world,” inCVPR, 2022, pp. 13 307–13 316

work page 2022

[33] [33]

Universal physical camouflage attacks on object detectors,

L. Huang, C. Gao, Y . Zhou, C. Xie, A. L. Yuille, C. Zou, and N. Liu, “Universal physical camouflage attacks on object detectors,” inCVPR, 2020, pp. 720–729

work page 2020

[34] [34]

A comprehensive study of the robustness for lidar-based 3d object detectors against adversarial attacks,

Y . Zhang, J. Hou, and Y . Yuan, “A comprehensive study of the robustness for lidar-based 3d object detectors against adversarial attacks,” inIJCV, vol. 132, no. 5, 2024, pp. 1592–1624

work page 2024

[35] [35]

Towards transferable targeted 3d adversarial attack in the physical world,

Y . Huang, Y . Dong, S. Ruan, X. Yang, H. Su, and X. Wei, “Towards transferable targeted 3d adversarial attack in the physical world,” in CVPR, 2024, pp. 24 512–24 522

work page 2024

[36] [36]

Content- based unrestricted adversarial attack,

Z. Chen, B. Li, S. Wu, K. Jiang, S. Ding, and W. Zhang, “Content- based unrestricted adversarial attack,” inNeurIPS, vol. 36, 2023, pp. 51 719–51 733

work page 2023

[37] [37]

Diffusion models for imperceptible and transferable adversarial attack,

J. Chen, H. Chen, K. Chen, Y . Zhang, Z. Zou, and Z. Shi, “Diffusion models for imperceptible and transferable adversarial attack,”TPAMI, 2024

work page 2024

[38] [38]

Generating adversarial attacks in the latent space,

N. Shukla and S. Banerjee, “Generating adversarial attacks in the latent space,” inCVPR, 2023, pp. 730–739

work page 2023

[39] [39]

Explaining and Harnessing Adversarial Examples

I. J. Goodfellow, J. Shlens, and C. Szegedy, “Explaining and harnessing adversarial examples,”arXiv:1412.6572, 2014

work page internal anchor Pith review Pith/arXiv arXiv 2014

[40] [40]

Towards Deep Learning Models Resistant to Adversarial Attacks

A. Madry, A. Makelov, L. Schmidt, D. Tsipras, and A. Vladu, “Towards deep learning models resistant to adversarial attacks,”arXiv:1706.06083, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017

[41] [41]

Enhanc- ing variational autoencoders with smooth robust latent encoding,

H. Lee, M. Kim, S. Jang, J. Jeong, and S. J. Hwang, “Enhanc- ing variational autoencoders with smooth robust latent encoding,” arXiv:2504.17219, 2025

work page arXiv 2025

[42] [42]

Recovering accurate 3d human pose in the wild using imus and a moving camera,

T. V on Marcard, R. Henschel, M. J. Black, B. Rosenhahn, and G. Pons- Moll, “Recovering accurate 3d human pose in the wild using imus and a moving camera,” inECCV, 2018, pp. 601–617

work page 2018

[43] [43]

An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

D. Alexey, “An image is worth 16x16 words: Transformers for image recognition at scale,”arXiv: 2010.11929, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2010

[44] [44]

The unreasonable effectiveness of deep features as a perceptual metric,

R. Zhang, P. Isola, A. A. Efros, E. Shechtman, and O. Wang, “The unreasonable effectiveness of deep features as a perceptual metric,” in CVPR, 2018, pp. 586–595

work page 2018