pith. sign in

arxiv: 2505.12009 · v2 · submitted 2025-05-17 · 💻 cs.CV

LatentStealth: Unnoticeable and Efficient Adversarial Attacks on Expressive Human Pose and Shape Estimation

Pith reviewed 2026-05-22 14:28 UTC · model grok-4.3

classification 💻 cs.CV
keywords adversarial attackshuman pose estimationlatent spaceexpressive human pose and shapeimperceptible perturbationsdigital human generationsecurity vulnerabilitiesEHPS models
0
0 comments X

The pith

LatentStealth generates adversarial perturbations for human pose and shape models inside latent space rather than pixel space.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces LatentStealth to attack expressive human pose and shape estimation models used in digital human generation. Current attacks produce obvious visual changes that reduce their ability to reveal real security problems such as inappropriate content or offensive gestures. The method projects images into a structured latent space, creates adversarial patterns there, and refines them along optimized directions using only a few output queries from the target model. This yields perturbations that decode back to images with high visual imperceptibility while still degrading the estimator's performance. Tests on 3DPW and UBody datasets show competitive success rates at low computational cost, exposing vulnerabilities that existing defenses have not addressed.

Core claim

By projecting inputs into the latent space, where adversarial patterns are generated and progressively refined along optimized directions, LatentStealth maintains high imperceptibility while preserving effectiveness and achieving competitive attack performance with low computational overhead using only a small number of model output queries.

What carries the argument

Projection of inputs into latent space followed by generation and progressive refinement of adversarial patterns along optimized directions guided by few model output queries.

If this is right

  • EHPS models used in live-streaming and digital humans are exposed as vulnerable to stealthy attacks that current visual checks would miss.
  • Attack success with only a small number of queries makes the threat practical for real-world deployment against deployed systems.
  • The approach outperforms prior pixel-space attacks on standard 3DPW and UBody benchmarks in both stealth and efficiency.
  • Security risks such as forced generation of violent or offensive poses can now be demonstrated without obvious tampering.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same latent-space strategy could transfer to other regression-based vision tasks where pixel perturbations are too conspicuous.
  • If the pretrained encoder used for the latent space has reconstruction errors, those errors might either aid or hinder attack transferability.
  • Training EHPS models with latent-space robustness objectives would be a direct countermeasure suggested by the attack design.

Load-bearing premise

Structured latent representations of natural images allow adversarial patterns that remain visually imperceptible after decoding while still fooling the EHPS model, and that optimization with limited queries suffices without separate checks on reconstruction quality.

What would settle it

Decoded images from the perturbed latent codes show clearly visible artifacts or produce no measurable drop in EHPS accuracy on held-out test images.

Figures

Figures reproduced from arXiv: 2505.12009 by Fengyuan Ma, Guanggang Geng, Lili Wang, Shuyuan Lin, Yeying Jin, Zhaoxin Fan, Zhiying Li.

Figure 1
Figure 1. Figure 1: Motivation and comparison of adversarial attacks on digital human [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Overview of the LatentStealth pipeline. The framework consists of two stages: a) noise injection in latent space, and b) noise enhancement. [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Visualizing various adversarial samples for digital human generation on UBody. [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Visualizing various adversarial samples for digital human generation on PW3D. [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Performance of different perturbation magnitude [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗
read the original abstract

Expressive human pose and shape estimation (EHPS) plays a central role in digital human generation, particularly in live-streaming applications. However, most existing EHPS models focus primarily on minimizing estimation errors, with limited attention on potential security vulnerabilities, such as generating inappropriate content, violent actions, or racially offensive gestures and expressions. Current adversarial attacks on EHPS models often generate visually conspicuous perturbations, limiting their practicality and ability to expose real-world security threats. To address this limitation, we propose an unnoticeable adversarial method, termed \textbf{LatentStealth}, specifically tailored for EHPS models. The key idea is to exploit the structured latent representations of natural images as the medium for crafting perturbations. Instead of injecting noise directly into the pixel space, our method projects inputs into the latent space, where adversarial patterns are generated and progressively refined along optimized directions. This latent-space manipulation enables the attack to maintain high imperceptibility while preserving its effectiveness. Furthermore, as the optimization process is guided by only a small number of model output queries, the framework achieves competitive attack performance with low computational overhead, making it both practical and efficient for real-world scenarios. Extensive experiments on the 3DPW and UBody datasets demonstrate the superiority of LatentStealth, revealing critical vulnerabilities in current systems. These findings highlight the urgent need to address and mitigate security risks in digital human generation technologies.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper proposes LatentStealth, an adversarial attack on Expressive Human Pose and Shape Estimation (EHPS) models. Inputs are projected into the latent space of a pretrained encoder; adversarial patterns are generated and refined along optimized directions before decoding back to images. The method claims to achieve high imperceptibility, preserved attack effectiveness, and competitive performance with low overhead via a small number of model output queries. Experiments on 3DPW and UBody datasets are presented as demonstrating superiority over prior attacks.

Significance. If validated, the work is significant for highlighting security risks in EHPS models used in digital human generation and live-streaming. The latent-space formulation offers a practical route to imperceptible perturbations and few-query efficiency, which could inform defense design. The empirical focus on real-world applicability is a strength, though it requires quantitative grounding to realize that potential.

major comments (3)
  1. [§3] §3 (Method): The central mechanism projects images into latent space, adds refined adversarial patterns, and decodes. No encoder architecture is named, no reconstruction-error bound is given, and no analysis shows that small latent displacements map to imperceptible pixel changes. This assumption is load-bearing for the imperceptibility claim.
  2. [§4] §4 (Experiments): Superiority is asserted on 3DPW and UBody, yet the results provide no attack success rates, MPJPE or other EHPS error metrics, LPIPS/SSIM perceptual distances, or visual fidelity comparisons against pixel-space baselines. Without these numbers the effectiveness-plus-imperceptibility claim cannot be evaluated.
  3. [§3.2] §3.2 (Optimization): The few-query refinement process is described without a concrete query budget, convergence criterion, or trade-off curve relating attack success to reconstruction quality. This detail is required to substantiate the efficiency and imperceptibility guarantees.
minor comments (2)
  1. [Abstract] Abstract: The phrase 'extensive experiments demonstrate the superiority' should be accompanied by at least one concrete metric or baseline name to orient readers.
  2. [§3] Notation: Ensure the latent-space variables (e.g., direction vectors, refinement steps) are defined with consistent symbols when first introduced.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the thoughtful and constructive comments on our manuscript. We address each of the major comments point by point below. Where revisions are needed to strengthen the presentation, we have indicated that changes will be made in the next version of the paper.

read point-by-point responses
  1. Referee: [§3] §3 (Method): The central mechanism projects images into latent space, adds refined adversarial patterns, and decodes. No encoder architecture is named, no reconstruction-error bound is given, and no analysis shows that small latent displacements map to imperceptible pixel changes. This assumption is load-bearing for the imperceptibility claim.

    Authors: We appreciate the referee pointing out the need for greater specificity in the method description. While the manuscript outlines the latent-space projection and decoding process, we agree that explicit details on the encoder architecture, reconstruction error bounds, and a supporting analysis for imperceptibility were insufficient. In the revised manuscript, we will name the specific pretrained encoder architecture employed, provide quantitative bounds on the reconstruction error, and include an analysis (with supporting experiments or mathematical justification) demonstrating that small displacements in the latent space result in imperceptible pixel-level changes. This will better substantiate the imperceptibility claim. revision: yes

  2. Referee: [§4] §4 (Experiments): Superiority is asserted on 3DPW and UBody, yet the results provide no attack success rates, MPJPE or other EHPS error metrics, LPIPS/SSIM perceptual distances, or visual fidelity comparisons against pixel-space baselines. Without these numbers the effectiveness-plus-imperceptibility claim cannot be evaluated.

    Authors: We acknowledge that the experimental section would benefit from additional quantitative metrics to allow for a more rigorous evaluation. The manuscript includes comparisons on the 3DPW and UBody datasets demonstrating the advantages of LatentStealth, but we agree that explicit reporting of attack success rates, MPJPE and other EHPS-specific error metrics, along with LPIPS and SSIM for perceptual quality, and side-by-side comparisons to pixel-space attack baselines, would strengthen the claims. We will incorporate these metrics and comparisons into the revised experiments section. revision: yes

  3. Referee: [§3.2] §3.2 (Optimization): The few-query refinement process is described without a concrete query budget, convergence criterion, or trade-off curve relating attack success to reconstruction quality. This detail is required to substantiate the efficiency and imperceptibility guarantees.

    Authors: Thank you for this observation regarding the optimization details. The manuscript emphasizes the few-query nature of the refinement process, but we concur that concrete specifications are necessary for reproducibility and to support the efficiency claims. In the revision, we will provide the specific query budget utilized in our experiments, detail the convergence criteria employed, and include trade-off curves that illustrate the relationship between attack success and reconstruction quality under varying query counts. revision: yes

Circularity Check

0 steps flagged

No significant circularity in the proposed empirical method

full rationale

The paper presents LatentStealth as an algorithmic construction for generating adversarial perturbations in latent space of a pretrained encoder, followed by decoding and experimental validation on 3DPW and UBody datasets. No mathematical derivation chain, equations, fitted parameters renamed as predictions, or load-bearing self-citations appear in the provided abstract or description. The central claims rest on empirical attack success and imperceptibility rather than any self-referential reduction to inputs by construction. This is the expected outcome for an applied CV attack paper whose validity is tested externally via experiments.

Axiom & Free-Parameter Ledger

2 free parameters · 2 axioms · 0 invented entities

The central claim rests on the existence of useful structured latent representations for natural images and on the effectiveness of query-based optimization in that space; these are standard in generative modeling but treated as given without new justification here.

free parameters (2)
  • number of optimization queries
    The abstract states the attack uses only a small number of model output queries; the exact count and selection strategy are not specified and function as a tunable parameter.
  • latent direction refinement steps
    Progressive refinement along optimized directions implies step count or learning rate choices that are not detailed.
axioms (2)
  • domain assumption Natural images possess structured latent representations that can be manipulated to produce adversarial effects while preserving visual fidelity after decoding.
    Invoked when the method projects inputs into latent space instead of pixel space.
  • domain assumption Few-query optimization is sufficient to discover effective adversarial directions in latent space for EHPS models.
    Stated as enabling low computational overhead and practicality.

pith-pipeline@v0.9.0 · 5801 in / 1527 out tokens · 38945 ms · 2026-05-22T14:28:02.577221+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

44 extracted references · 44 canonical work pages · 4 internal anchors

  1. [1]

    Smpl: A skinned multi-person linear model,

    M. Loper, N. Mahmood, J. Romero, G. Pons-Moll, and M. J. Black, “Smpl: A skinned multi-person linear model,”ACM Transactions on Graphics, vol. 34, no. 6, 2015

  2. [2]

    Expressive body capture: 3d hands, face, and body from a single image,

    G. Pavlakos, V . Choutas, N. Ghorbani, T. Bolkart, A. A. Osman, D. Tzionas, and M. J. Black, “Expressive body capture: 3d hands, face, and body from a single image,” inCVPR, 2019, pp. 10 967–10 977

  3. [3]

    Monocular real-time full body capture with inter-part correlations,

    Y . Zhou, M. Habermann, I. Habibie, A. Tewari, C. Theobalt, and F. Xu, “Monocular real-time full body capture with inter-part correlations,” in CVPR, 2021, pp. 4811–4822

  4. [4]

    Pymaf- x: Towards well-aligned full-body model regression from monocular images,

    H. Zhang, Y . Tian, Y . Zhang, M. Li, L. An, Z. Sun, and Y . Liu, “Pymaf- x: Towards well-aligned full-body model regression from monocular images,”TPAMI, vol. 45, no. 10, pp. 12 287–12 303, 2023

  5. [5]

    Smpler-x: Scaling up expressive human pose and shape estimation,

    Z. Cai, W. Yin, A. Zeng, C. Wei, Q. Sun, W. Yanjun, H. E. Pang, H. Mei, M. Zhang, L. Zhanget al., “Smpler-x: Scaling up expressive human pose and shape estimation,” inNeurIPS, vol. 36, 2024

  6. [6]

    Avatarclip: Zero-shot text-driven generation and animation of 3d avatars,

    F. Hong, M. Zhang, L. Pan, Z. Cai, L. Yang, and Z. Liu, “Avatarclip: Zero-shot text-driven generation and animation of 3d avatars,”ACM Transactions on Graphics, vol. 41, no. 4, pp. 1–19, 2022

  7. [7]

    Garment4d: Garment reconstruction from point cloud sequences,

    F. Hong, L. Pan, Z. Cai, and Z. Liu, “Garment4d: Garment reconstruction from point cloud sequences,” inNeurIPS, vol. 34, 2021, pp. 27 940– 27 951

  8. [8]

    Understanding the robustness of skeleton-based action recognition under adversarial attack,

    H. Wang, F. He, Z. Peng, T. Shao, Y .-L. Yang, K. Zhou, and D. Hogg, “Understanding the robustness of skeleton-based action recognition under adversarial attack,” inCVPR, 2021, pp. 14 656–14 665

  9. [9]

    Towards robust 3d pose transfer with adversarial learning,

    H. Chen, H. Tang, E. Adeli, and G. Zhao, “Towards robust 3d pose transfer with adversarial learning,” inCVPR, 2024, pp. 2295–2304

  10. [10]

    Distracting downpour: Ad- versarial weather attacks for motion estimation,

    J. Schmalfuss, L. Mehl, and A. Bruhn, “Distracting downpour: Ad- versarial weather attacks for motion estimation,” inICCV, 2023, pp. 10 106–10 116

  11. [11]

    On the robustness of neural-enhanced video streaming against adversarial attacks,

    Q. Zhou, J. Guo, S. Guo, R. Li, J. Zhang, B. Wang, and Z. Xu, “On the robustness of neural-enhanced video streaming against adversarial attacks,” inAAAI, vol. 38, no. 15, 2024, pp. 17 123–17 131

  12. [12]

    Whole-body human pose estimation in the wild,

    S. Jin, L. Xu, J. Xu, C. Wang, W. Liu, C. Qian, W. Ouyang, and P. Luo, “Whole-body human pose estimation in the wild,” inECCV, 2020, pp. 196–214

  13. [13]

    Accurate 3d hand pose estimation for whole-body 3d human mesh estimation,

    G. Moon, H. Choi, and K. M. Lee, “Accurate 3d hand pose estimation for whole-body 3d human mesh estimation,” inCVPR, 2022, pp. 2308–2317

  14. [14]

    One-stage 3d whole- body mesh recovery with component aware transformer,

    J. Lin, A. Zeng, H. Wang, L. Zhang, and Y . Li, “One-stage 3d whole- body mesh recovery with component aware transformer,” inCVPR, 2023, pp. 21 159–21 168

  15. [15]

    Unveiling hidden vulnerabilities in digital human generation via adversarial attacks,

    Z. Li, Y . Jin, F. Shen, Z. Liu, W. Chen, P. Zhang, X. Zhang, B. Chen, M. Shen, K. Wuet al., “Unveiling hidden vulnerabilities in digital human generation via adversarial attacks,”Pattern Recognition, vol. 170, p. 112042, 2026

  16. [16]

    LatentPoison - Adversarial Attacks On The Latent Space

    A. Creswell, A. A. Bharath, and B. Sengupta, “Latentpoison-adversarial attacks on the latent space,”arXiv:1711.02879, 2017

  17. [17]

    Can push- forward generative models fit multimodal distributions?

    A. Salmona, V . De Bortoli, J. Delon, and A. Desolneux, “Can push- forward generative models fit multimodal distributions?” inNeurIPS, vol. 35, 2022, pp. 10 766–10 779

  18. [18]

    Auto-encoding variational bayes,

    D. P. Kingma, M. Wellinget al., “Auto-encoding variational bayes,” 2013

  19. [19]

    Classifier-free diffusion guidance,

    J. Ho and T. Salimans, “Classifier-free diffusion guidance,” inNeurIPS Workshop, 2021

  20. [20]

    Adversarial robustness of vaes through the lens of local geometry,

    A. Khan and A. Storkey, “Adversarial robustness of vaes through the lens of local geometry,” inAISTATS, 2023, pp. 8954–8967

  21. [21]

    Generating out of distribution adver- sarial attack using latent space poisoning,

    U. Upadhyay and P. Mukherjee, “Generating out of distribution adver- sarial attack using latent space poisoning,”SPL, vol. 28, pp. 523–527, 2021

  22. [22]

    Gans trained by a two time-scale update rule converge to a local nash equilibrium,

    M. Heusel, H. Ramsauer, T. Unterthiner, B. Nessler, and S. Hochreiter, “Gans trained by a two time-scale update rule converge to a local nash equilibrium,” inNeurIPS, vol. 30, 2017

  23. [23]

    Image quality assessment: from error visibility to structural similarity,

    Z. Wang, A. C. Bovik, H. R. Sheikh, and E. P. Simoncelli, “Image quality assessment: from error visibility to structural similarity,”TIP, vol. 13, no. 4, pp. 600–612, 2004

  24. [24]

    Sequential 3d human pose and shape estimation from point clouds,

    K. Wang, J. Xie, G. Zhang, L. Liu, and J. Yang, “Sequential 3d human pose and shape estimation from point clouds,” inCVPR, 2020, pp. 7275– 7284

  25. [25]

    Hybrik: A hybrid analytical-neural inverse kinematics solution for 3d human pose and shape estimation,

    J. Li, C. Xu, Z. Chen, S. Bian, L. Yang, and C. Lu, “Hybrik: A hybrid analytical-neural inverse kinematics solution for 3d human pose and shape estimation,” inCVPR, 2021, pp. 3383–3393

  26. [26]

    Learning to estimate 3d human pose and shape from a single color image,

    G. Pavlakos, L. Zhu, X. Zhou, and K. Daniilidis, “Learning to estimate 3d human pose and shape from a single color image,” inCVPR, 2018, pp. 459–468

  27. [27]

    Eventhpe: Event-based 3d human pose and shape estimation,

    S. Zou, C. Guo, X. Zuo, S. Wang, P. Wang, X. Hu, S. Chen, M. Gong, and L. Cheng, “Eventhpe: Event-based 3d human pose and shape estimation,” inICCV, 2021, pp. 10 996–11 005

  28. [28]

    Benchmarking and analyzing 3d human pose and shape estimation beyond algorithms,

    H. E. Pang, Z. Cai, L. Yang, T. Zhang, and Z. Liu, “Benchmarking and analyzing 3d human pose and shape estimation beyond algorithms,” in NeurIPS, vol. 35, 2022, pp. 26 034–26 051

  29. [29]

    Global-to- local modeling for video-based 3d human pose and shape estimation,

    X. Shen, Z. Yang, X. Wang, J. Ma, C. Zhou, and Y . Yang, “Global-to- local modeling for video-based 3d human pose and shape estimation,” inCVPR, 2023, pp. 8887–8896

  30. [30]

    Is-wars: Intelligent and stealthy adversarial attack to wi-fi-based human activity recognition systems,

    P. Huang, X. Zhang, S. Yu, and L. Guo, “Is-wars: Intelligent and stealthy adversarial attack to wi-fi-based human activity recognition systems,” TDSC, vol. 19, no. 6, pp. 3899–3912, 2021

  31. [31]

    Simple black-box adversarial attacks,

    C. Guo, J. Gardner, Y . You, A. G. Wilson, and K. Weinberger, “Simple black-box adversarial attacks,” inICML, 2019, pp. 2484–2493

  32. [32]

    Adversarial texture for fooling person detectors in the physical world,

    Z. Hu, S. Huang, X. Zhu, F. Sun, B. Zhang, and X. Hu, “Adversarial texture for fooling person detectors in the physical world,” inCVPR, 2022, pp. 13 307–13 316

  33. [33]

    Universal physical camouflage attacks on object detectors,

    L. Huang, C. Gao, Y . Zhou, C. Xie, A. L. Yuille, C. Zou, and N. Liu, “Universal physical camouflage attacks on object detectors,” inCVPR, 2020, pp. 720–729

  34. [34]

    A comprehensive study of the robustness for lidar-based 3d object detectors against adversarial attacks,

    Y . Zhang, J. Hou, and Y . Yuan, “A comprehensive study of the robustness for lidar-based 3d object detectors against adversarial attacks,” inIJCV, vol. 132, no. 5, 2024, pp. 1592–1624

  35. [35]

    Towards transferable targeted 3d adversarial attack in the physical world,

    Y . Huang, Y . Dong, S. Ruan, X. Yang, H. Su, and X. Wei, “Towards transferable targeted 3d adversarial attack in the physical world,” in CVPR, 2024, pp. 24 512–24 522

  36. [36]

    Content- based unrestricted adversarial attack,

    Z. Chen, B. Li, S. Wu, K. Jiang, S. Ding, and W. Zhang, “Content- based unrestricted adversarial attack,” inNeurIPS, vol. 36, 2023, pp. 51 719–51 733

  37. [37]

    Diffusion models for imperceptible and transferable adversarial attack,

    J. Chen, H. Chen, K. Chen, Y . Zhang, Z. Zou, and Z. Shi, “Diffusion models for imperceptible and transferable adversarial attack,”TPAMI, 2024

  38. [38]

    Generating adversarial attacks in the latent space,

    N. Shukla and S. Banerjee, “Generating adversarial attacks in the latent space,” inCVPR, 2023, pp. 730–739

  39. [39]

    Explaining and Harnessing Adversarial Examples

    I. J. Goodfellow, J. Shlens, and C. Szegedy, “Explaining and harnessing adversarial examples,”arXiv:1412.6572, 2014

  40. [40]

    Towards Deep Learning Models Resistant to Adversarial Attacks

    A. Madry, A. Makelov, L. Schmidt, D. Tsipras, and A. Vladu, “Towards deep learning models resistant to adversarial attacks,”arXiv:1706.06083, 2017

  41. [41]

    Enhanc- ing variational autoencoders with smooth robust latent encoding,

    H. Lee, M. Kim, S. Jang, J. Jeong, and S. J. Hwang, “Enhanc- ing variational autoencoders with smooth robust latent encoding,” arXiv:2504.17219, 2025

  42. [42]

    Recovering accurate 3d human pose in the wild using imus and a moving camera,

    T. V on Marcard, R. Henschel, M. J. Black, B. Rosenhahn, and G. Pons- Moll, “Recovering accurate 3d human pose in the wild using imus and a moving camera,” inECCV, 2018, pp. 601–617

  43. [43]

    An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

    D. Alexey, “An image is worth 16x16 words: Transformers for image recognition at scale,”arXiv: 2010.11929, 2020

  44. [44]

    The unreasonable effectiveness of deep features as a perceptual metric,

    R. Zhang, P. Isola, A. A. Efros, E. Shechtman, and O. Wang, “The unreasonable effectiveness of deep features as a perceptual metric,” in CVPR, 2018, pp. 586–595