arxiv: 2604.23651 · v1 · submitted 2026-04-26 · 💻 cs.CV

Recognition: unknown

Geometry-Conditioned Diffusion for Occlusion-Robust In-Bed Pose Estimation

Navid Aslankhani Khameneh , Marco Carletti , Cigdem Beyan

Authors on Pith no claims yet

Pith reviewed 2026-05-08 06:30 UTC · model grok-4.3

classification 💻 cs.CV

keywords in-bed pose estimationblanket occlusionlatent diffusion modelgeometry conditioningdata augmentationhuman pose estimationsynthetic dataocclusion robustness

0 comments

The pith

A diffusion model conditioned only on body keypoints generates synthetic blanket-covered images that raise pose estimation accuracy under severe occlusion.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that treating data augmentation for occluded in-bed poses as a geometry-conditioned generative problem enables a Latent Diffusion Model to produce realistic covered images directly from skeletal keypoints. This removes reliance on paired visible-occluded image pairs or multi-modal sensors that current methods require. When these synthetic images are used to train a standard pose estimator, the resulting model achieves the highest strict localization accuracy on heavily occluded test cases while keeping overall detection rates comparable to models trained with paired diffusion data. The approach approaches the performance of fully supervised training on real occluded labels even though no such labels are used during augmentation. This matters because labeled data for people under blankets is scarce, and the method scales by allowing arbitrary pose inputs without pixel-level source conditioning.

Core claim

We reformulate occlusion-aware augmentation as a geometry-conditioned generative modeling task and compare deterministic masking, unpaired translation, paired diffusion-based translation, and a proposed pose-conditioned Latent Diffusion Model (Pose-LDM). Unlike image-guided methods, Pose-LDM synthesizes blanket-covered images directly from skeletal keypoints, eliminating dependence on paired supervision and pixel-level source-image conditioning while enabling generation from arbitrary pose inputs. All augmentation strategies are evaluated through their impact on downstream pose estimation under a fixed backbone. Pose-LDM achieves the highest strict localization accuracy under severe blanket,

What carries the argument

Pose-conditioned Latent Diffusion Model (Pose-LDM), which takes skeletal keypoints as input and outputs synthetic images of people covered by blankets to serve as augmentation data.

If this is right

The method produces training images from arbitrary poses without requiring any paired visible-occluded source data.
Pose estimation under heavy occlusion reaches the highest strict localization accuracy among the tested augmentation strategies.
Overall detection performance remains comparable to paired diffusion translation while approaching fully supervised levels.
No changes to the original camera-based sensing pipeline are needed to gain the robustness benefit.
Generation can be performed at scale for any desired pose configuration once the model is trained.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same keypoint-to-image synthesis pipeline could supply augmentation data for other occluded human pose tasks such as hospital monitoring or clothed-body estimation.
Because the model never sees source images, it may enable rapid adaptation to new camera angles or blanket types by simply changing the input keypoints.
Combining the generated images with a small amount of real occluded labels might close the remaining gap to fully supervised performance.
The independence from paired data suggests the technique could be tested on non-bed occlusion problems such as people partially hidden by furniture.

Load-bearing premise

Images synthesized from keypoints alone are realistic and diverse enough to improve a downstream pose estimator on real occluded data without introducing artifacts that harm generalization.

What would settle it

Training the same pose estimator on real data augmented with Pose-LDM images yields no improvement or a drop in strict localization accuracy on a held-out set of severely occluded real bed images compared with training on real data alone.

Figures

Figures reproduced from arXiv: 2604.23651 by Cigdem Beyan, Marco Carletti, Navid Aslankhani Khameneh.

**Figure 1.** Figure 1: Overview of the heuristic blanket augmentation baseline. Lower view at source ↗

**Figure 3.** Figure 3: Brownian Bridge Diffusion Model (BBDM) for paired domain view at source ↗

**Figure 4.** Figure 4: Pose-Conditioned Latent Diffusion Model (Pose-LDM). A covered view at source ↗

**Figure 5.** Figure 5: Qualitative comparison of pose estimation results under blanket occlusion. Each column corresponds to a pose estimator trained using a different view at source ↗

**Figure 6.** Figure 6: Examples of samples generated under covered settings using different conditioning variants of our Pose-LDM with two different bedsheet types. view at source ↗

read the original abstract

Robust in-bed human pose estimation under blanket occlusion remains challenging due to the scarcity of reliable labeled training data for heavily covered poses. Existing approaches rely on multi-modal sensing or image-to-image translation frameworks that remain conditioned on visible source imagery, limiting scalability and pose diversity. In this work, we reformulate occlusion-aware augmentation as a geometry-conditioned generative modeling task. We conduct a systematic comparison of deterministic masking, unpaired translation, paired diffusion-based translation, and a proposed pose-conditioned Latent Diffusion Model (Pose-LDM). Unlike image-guided methods, Pose-LDM synthesizes blanket-covered images directly from skeletal keypoints, eliminating dependence on paired supervision and pixel-level source-image conditioning while enabling generation from arbitrary pose inputs. All augmentation strategies are evaluated through their impact on downstream pose estimation under a fixed backbone. Pose- LDM achieves the highest strict localization accuracy under severe occlusion while maintaining overall detection performance comparable to paired diffusion models, approaching the performance of fully supervised training. These results demonstrate that geometry-conditioned diffusion provides an effective and supervision-efficient pathway toward occlusion-robust inbed pose estimation without modifying the sensing pipeline. The code is available at: github.com/navidTerraNova/ GeoDiffPose.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper proposes generating occluded in-bed images directly from keypoints via a latent diffusion model to augment pose estimation, claiming better strict accuracy than masking or translation baselines, but the supporting details are thin.

read the letter

The main thing here is a geometry-conditioned diffusion model called Pose-LDM that takes only skeletal keypoints and outputs blanket-covered images for training a pose estimator. This lets them create occluded examples from any pose without paired visible-occluded data or image conditioning, which is a practical shift from prior image-to-image methods. They compare it directly against deterministic masking, unpaired translation, and paired diffusion on the same fixed downstream backbone, and report that Pose-LDM gives the highest strict localization accuracy under severe occlusion while staying close to fully supervised levels overall.

Referee Report

3 major / 3 minor

Summary. The paper introduces Pose-LDM, a latent diffusion model conditioned only on skeletal keypoints, to synthesize blanket-occluded in-bed images for data augmentation. It compares this geometry-conditioned approach against deterministic masking, unpaired translation, and paired diffusion-based translation, evaluating all strategies solely via their effect on a fixed downstream pose estimator. The central claim is that Pose-LDM yields the highest strict localization accuracy under severe occlusion while preserving overall detection performance comparable to paired methods and approaching fully supervised baselines, without requiring visible-image conditioning or paired supervision.

Significance. If the reported gains hold under rigorous controls, the work demonstrates a supervision-efficient route to occlusion robustness that decouples generation from source imagery and enables arbitrary pose inputs. The systematic head-to-head comparison of augmentation strategies on a fixed backbone is a methodological strength, and the public code release supports reproducibility.

major comments (3)

[Abstract and §4] Abstract and §4 (Experiments): the claim that Pose-LDM achieves the 'highest strict localization accuracy under severe occlusion' is presented without the underlying metrics (e.g., PCK@0.05, AP, or strict IoU thresholds), error bars, dataset splits, or number of runs. This prevents verification that the reported superiority is statistically meaningful rather than an artifact of a single split or uncontrolled variance.
[§3.2 and §4.3] §3.2 (Pose-LDM) and §4.3 (Generation quality): no independent metrics (FID, KID, or perceptual user study) or distribution-shift analysis are reported for the synthetic blanket-covered images. Without these, it remains unclear whether the downstream gains stem from improved realism/diversity or from incidental factors in the fixed-backbone protocol, directly testing the weakest assumption that keypoints alone suffice to match the real occluded distribution.
[§4.1] §4.1 (Baselines): the comparison to 'paired diffusion models' does not specify whether the paired baseline uses the same latent space, noise schedule, or training compute as Pose-LDM; any mismatch could confound the claim that geometry conditioning alone drives the improvement.

minor comments (3)

[Abstract] Abstract: 'inbed' should be hyphenated as 'in-bed' for consistency with the title and body.
[Abstract] Abstract: 'Pose- LDM' contains an extraneous space before the hyphen.
[§4] The manuscript would benefit from an explicit statement of the exact occlusion severity thresholds used to define 'severe occlusion' subsets in the evaluation tables.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment below and commit to revisions that will improve the clarity and rigor of the manuscript without altering its core contributions.

read point-by-point responses

Referee: [Abstract and §4] Abstract and §4 (Experiments): the claim that Pose-LDM achieves the 'highest strict localization accuracy under severe occlusion' is presented without the underlying metrics (e.g., PCK@0.05, AP, or strict IoU thresholds), error bars, dataset splits, or number of runs. This prevents verification that the reported superiority is statistically meaningful rather than an artifact of a single split or uncontrolled variance.

Authors: We agree that the current presentation lacks sufficient detail for independent verification. In the revised manuscript we will explicitly report the underlying metrics (PCK@0.05, AP at multiple thresholds, and strict IoU), include mean and standard deviation computed over at least five independent runs with different random seeds, specify the exact train/validation/test splits, and add a short statistical significance analysis. revision: yes
Referee: [§3.2 and §4.3] §3.2 (Pose-LDM) and §4.3 (Generation quality): no independent metrics (FID, KID, or perceptual user study) or distribution-shift analysis are reported for the synthetic blanket-covered images. Without these, it remains unclear whether the downstream gains stem from improved realism/diversity or from incidental factors in the fixed-backbone protocol, directly testing the weakest assumption that keypoints alone suffice to match the real occluded distribution.

Authors: While the fixed-backbone downstream evaluation is the most direct test of augmentation utility for the target task, we acknowledge the value of complementary generation-quality metrics. We will add FID and KID scores (computed against real occluded images) together with a brief distribution-shift analysis in the revised §4.3. A perceptual user study lies outside the present scope and would require additional human-subject resources; we will note this limitation explicitly. revision: partial
Referee: [§4.1] §4.1 (Baselines): the comparison to 'paired diffusion models' does not specify whether the paired baseline uses the same latent space, noise schedule, or training compute as Pose-LDM; any mismatch could confound the claim that geometry conditioning alone drives the improvement.

Authors: We will revise §4.1 to state explicitly that the paired diffusion baseline employs the identical latent diffusion architecture, VAE latent space, noise schedule, number of training steps, batch size, and total compute budget as Pose-LDM. The only controlled difference is the conditioning input (paired visible images versus keypoints). This isolates the effect of geometry conditioning. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical evaluation is externally grounded

full rationale

The paper proposes Pose-LDM as a geometry-conditioned generator and evaluates its utility solely through downstream pose estimation accuracy on real occluded test images using a fixed backbone. No derivation reduces to self-definition, no fitted parameters are relabeled as predictions, and no load-bearing claims rely on self-citations or uniqueness theorems. All reported gains are measured against external benchmarks (deterministic masking, unpaired translation, paired diffusion, and fully supervised baselines), making the chain self-contained rather than circular.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The method assumes standard properties of latent diffusion models for conditional image synthesis and that skeletal keypoints provide sufficient geometry for realistic blanket rendering; no free parameters or invented physical entities are described in the abstract.

axioms (1)

domain assumption Latent diffusion models can produce realistic images when conditioned on pose geometry inputs
Core to the Pose-LDM design for generating occluded scenes without source image conditioning.

invented entities (1)

Pose-LDM no independent evidence
purpose: Generate blanket-covered images directly from skeletal keypoints
Proposed model variant for supervision-efficient augmentation

pith-pipeline@v0.9.0 · 5515 in / 1122 out tokens · 24527 ms · 2026-05-08T06:30:28.423941+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

29 extracted references · 1 canonical work pages · 1 internal anchor

[1]

Afham, U

M. Afham, U. Haputhanthri, J. Pradeepkumar, M. Anandakumar, A. De Silva, and C. U. S. Edussooriya. Towards accurate cross- domain in-bed human pose estimation. InICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 1–5. IEEE, 2022

2022
[2]

Bigalke, L

A. Bigalke, L. Hansen, J. Diesel, C. Hennigs, P. Rostalski, and M. P. Heinrich. Anatomy-guided domain adaptation for 3d in-bed human pose estimation.Medical Image Analysis, 83:102687, 2023

2023
[3]

H. Cao, C. Tan, Z. Gao, Y . Xu, G. Chen, P.-A. Heng, and S. Z. Li. A survey on generative diffusion models.IEEE transactions on knowledge and data engineering, 36(7):2814–2830, 2024

2024
[4]

T. Cao, M. A. Armin, S. Denman, L. Petersson, and D. Ahmedt- Aristizabal. In-bed human pose estimation from unseen and privacy- preserving image domains. In2022 IEEE 19th International Sympo- sium on Biomedical Imaging (ISBI), pages 1–5, 2022

2022
[5]

Z. Cao, G. Hidalgo, T. Simon, S.-E. Wei, and Y . Sheikh. Openpose: Realtime multi-person 2d pose estimation using part affinity fields. IEEE transactions on pattern analysis and machine intelligence, 43(1):172–186, 2019

2019
[6]

Davoodnia, S

V . Davoodnia, S. Ghorbani, and S. Ostadabbas. Estimating pose from pressure data for smart beds with deep image-based pose estimators. Applied Intelligence, 52:3864–3877, 2022

2022
[7]

Dayarathna, T

T. Dayarathna, T. Muthukumarana, Y . Rathnayaka, S. Denman, C. de Silva, A. Pemasiri, and D. Ahmedt-Aristizabal. Privacy- preserving in-bed pose monitoring: A fusion and reconstruction study. Expert Systems with Applications, 203:119139, 2022

2022
[8]

C. Dong, Y . Tang, and L. Zhang. Mda-yolo person: a 2d human pose estimation model based on yolo detection framework.Cluster Computing, 27(9):12323–12340, 2024

2024
[9]

H. Dou, C. Chen, X. Hu, L. Jia, and S. Peng. Asymmetric cyclegan for image-to-image translations with uneven complexities.Neurocom- puting, 415:114–122, 2020

2020
[10]

J. Ho, A. Jain, and P. Abbeel. Denoising diffusion probabilistic models. InAdvances in Neural Information Processing Systems, volume 33, pages 6840–6851, 2020

2020
[11]

Jocher and J

G. Jocher and J. Qiu. Ultralytics yolo11.https://github.com/ ultralytics/ultralytics, 2024

2024
[12]

B. Li, K. Xue, B. Liu, and Y .-K. Lai. Bbdm: Image-to-image trans- lation with brownian bridge diffusion models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 1952–1961, 2023

1952
[13]

T.-Y . Lin, M. Maire, S. Belongie, and et al. Microsoft coco: Common objects in context. InECCV, 2014

2014
[14]

Liu and S

S. Liu and S. Ostadabbas. Simultaneously-collected multimodal lying pose dataset: Towards in-bed human pose monitoring under various cover conditions.IEEE Transactions on Pattern Analysis and Machine Intelligence, 45(1):536–550, 2022

2022
[15]

Loper, N

M. Loper, N. Mahmood, J. Romero, G. Pons-Moll, and M. J. Black. Smpl: A skinned multi-person linear model. InSeminal Graphics Papers: Pushing the Boundaries, V olume 2, pages 851–866. 2023

2023
[16]

D. Maji, S. Nagori, M. Mathew, and D. Poddar. Yolo-pose: Enhancing yolo for multi person pose estimation using object keypoint similarity loss. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, pages 2637–2646, 2022

2022
[17]

Newell, K

A. Newell, K. Yang, and J. Deng. Stacked hourglass networks for human pose estimation. InEuropean conference on computer vision, pages 483–499. Springer, 2016

2016
[18]

Nyamathulla and N

S. Nyamathulla and N. Veeranjaneyulu. Analysis of pix2pix and cyclegan for image-to-image translation: A comparative study. In 2024 IEEE International Conference on Smart Power Control and Renewable Energy (ICSPCRE), pages 1–6. IEEE, 2024

2024
[19]

Obeidavi, M

S. Obeidavi, M. Gandomkar, and G. Hirtz. In-pose estimation of covered and uncovered human body from thermal camera images using multi-scale stacked hourglass (msshg) network. In2022 16th International Conference on Signal-Image Technology & Internet- Based Systems (SITIS), pages 84–90. IEEE, 2022

2022
[20]

Y . Pang, J. Lin, T. Qin, and Z. Chen. Image-to-image transla- tion: Methods and applications.IEEE Transactions on Multimedia, 24:3859–3881, 2021

2021
[21]

SDXL: Improving Latent Diffusion Models for High-Resolution Image Synthesis

D. Podell, Z. English, K. Lacey, A. Blattmann, T. Dockhorn, J. M ¨uller, J. Penna, and R. Rombach. Sdxl: Improving latent diffusion models for high-resolution image synthesis.arXiv preprint arXiv:2307.01952, 2023

work page internal anchor Pith review arXiv 2023
[22]

Rombach, A

R. Rombach, A. Blattmann, D. Lorenz, P. Esser, and B. Ommer. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 10684–10695, 2022

2022
[23]

Ronneberger, P

O. Ronneberger, P. Fischer, and T. Brox. U-net: Convolutional networks for biomedical image segmentation. InMICCAI, 2015

2015
[24]

Sonawane, V

A. Sonawane, V . Dandam, K. Khamkar, T. Wawge, and P. More. Leveraging yolo for real-time human detection and pose estimation in live stream environments. In2025 International Conference on Computing and Communication Technologies (ICCCT), pages 1–5, 2025

2025
[25]

Tarvainen and H

A. Tarvainen and H. Valpola. Mean teachers are better role models: Weight-averaged consistency targets improve semi-supervised deep learning results. InAdvances in Neural Information Processing Systems, volume 30, 2017

2017
[26]

Torbunov, Y

D. Torbunov, Y . Huang, H. Yu, J. Huang, S. Yoo, M. Lin, B. Viren, and Y . Ren. Uvcgan: Unet vision transformer cycle-consistent gan for unpaired image-to-image translation. InProceedings of the IEEE/CVF winter conference on applications of computer vision, pages 702–712, 2023

2023
[27]

Y . Yin, J. P. Robinson, and Y . Fu. Multimodal in-bed pose and shape estimation under the blankets. InProceedings of the 28th ACM International Conference on Multimedia, pages 1–9, 2020

2020
[28]

Zhang, Z

Y . Zhang, Z. Wang, M. Li, and P. Gao. Sp-yolo: An end-to-end lightweight network for real-time human pose estimation.Signal, Image and Video Processing, 18(1):863–876, 2024

2024
[29]

J.-Y . Zhu, T. Park, P. Isola, and A. A. Efros. Unpaired image-to-image translation using cycle-consistent adversarial networks. InProceedings of the IEEE International Conference on Computer Vision (ICCV), pages 2223–2232, 2017. Fig. 6. Examples of samples generated under covered settings using different conditioning variants of our Pose-LDM with two dif...

2017