Synthetic Data Generation and Vision-based Wrinkle and Keypoint Detection for Bimanual Cloth Manipulation

Ariel Herrera; Atal Anil Kumar; Xueyang Kang

arxiv: 2606.06292 · v1 · pith:ZD4ZCKHInew · submitted 2026-06-04 · 💻 cs.CV · cs.RO

Synthetic Data Generation and Vision-based Wrinkle and Keypoint Detection for Bimanual Cloth Manipulation

Ariel Herrera , Xueyang Kang , Atal Anil Kumar This is my paper

Pith reviewed 2026-06-28 02:12 UTC · model grok-4.3

classification 💻 cs.CV cs.RO

keywords synthetic data generationkeypoint detectionwrinkle detectioncloth manipulationbimanual roboticszero-shot transferdeformable object perceptioncomputer vision

0 comments

The pith

A Blender synthetic pipeline with limited real labels trains cloth keypoint and wrinkle detectors that transfer zero-shot to physical fabrics.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper develops a synthetic data generation method in Blender that automatically annotates keypoints on rendered cloth images and combines these with manually labeled real images to train detection models. These models feed a perception framework consisting of a CNN for keypoints and a YOLOv8-OpenCV pipeline for wrinkles, which in turn drives a bimanual manipulation sequence that stretches folded garments using wrinkle grasps before switching to keypoint-based ironing. The keypoint detector reports a mean position error of 1.7615 pixels and the full system operates on real fabrics without further training while beating baselines that break under heavy occlusion or produce false positives on deep folds. A reader cares because visual state estimation for continuously deforming textiles has been a persistent barrier to reliable robotic handling of laundry and garments. If the transfer result holds, it shows that mixed synthetic-real training can bypass the usual requirement for large manually annotated real-world datasets in this domain.

Core claim

The authors present a perception framework that integrates a CNN for permutation-invariant keypoint detection with a YOLOv8-OpenCV pipeline to extract grasping points from structural wrinkles. A proposed bimanual algorithm first uses detected wrinkles to stretch fully folded garments and then switches to keypoint-based ironing once corners become visible. The keypoint model reaches a mean position error of 1.7615 pixels. The perception system transfers directly to physical fabrics without fine-tuning and outperforms baselines that fail in high-occlusion states or yield false positives on severe folds.

What carries the argument

The Blender-based synthetic rendering pipeline that auto-generates keypoint annotations, mixed with manually labeled real images, to train the CNN keypoint detector and YOLO wrinkle extractor.

If this is right

The bimanual algorithm can stretch fully folded garments by first grasping on structural wrinkles before corners are visible.
Once corners emerge the system switches to keypoint detection for subsequent ironing motions without additional training.
The perception pipeline operates on real fabrics in zero-shot fashion and maintains performance where prior methods produce false positives or fail under occlusion.
Mean position error of 1.7615 pixels on keypoints provides sufficient precision for the described grasping and ironing sequence.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same synthetic-plus-limited-real strategy could be applied to other classes of deformable objects such as ropes or soft packaging if analogous auto-annotation pipelines are built.
Pairing the vision output with force or tactile sensing might reduce reliance on perfect visual detection during dynamic manipulation phases.
Extending the evaluation to a broader range of fabric materials, colors, and lighting would test whether the reported transfer holds outside the current test distribution.

Load-bearing premise

The mixture of Blender synthetic renders and manually labeled real data is representative enough of real-world cloth deformations, self-occlusions, and appearance variations to support zero-shot transfer to physical fabrics.

What would settle it

Run the trained keypoint detector on a new collection of physical fabric images containing folds and occlusions absent from the training mix and check whether the mean position error stays near 1.76 pixels or rises sharply.

Figures

Figures reproduced from arXiv: 2606.06292 by Ariel Herrera, Atal Anil Kumar, Xueyang Kang.

**Figure 3.** Figure 3: Training and validation metrics over epochs. The left plot [PITH_FULL_IMAGE:figures/full_fig_p003_3.png] view at source ↗

**Figure 4.** Figure 4: Vision framework evaluation on synthetic data. (a) Synthetic [PITH_FULL_IMAGE:figures/full_fig_p004_4.png] view at source ↗

**Figure 5.** Figure 5: Sim-to-Real transfer performance on physical fabrics. (a) Keypoint [PITH_FULL_IMAGE:figures/full_fig_p004_5.png] view at source ↗

read the original abstract

Robotic manipulation of textiles remains challenging because continuous deformation and self-occlusions hinder the robust visual perception required to estimate the cloth's state. To address the lack of annotated real-world data, we developed a Blender-based synthetic pipeline exporting auto-annotated keypoints, and combined manually labeled renders with real-world data to train a wrinkle detector. We present a perception framework integrating a CNN for permutation-invariant keypoint detection and a YOLOv8-OpenCV pipeline to extract grasping points from structural wrinkles. A proposed bimanual algorithm uses this system to stretch fully folded garments via wrinkles, transitioning to keypoint-based ironing once corners emerge. The keypoint model achieves a Mean Position Error (MPE) of 1.7615 pixels. The perception system transfers to physical fabrics without fine-tuning, outperforming baselines that fail in high-occlusion states or yield false positives on severe folds.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper shows a usable synthetic-plus-real pipeline for cloth keypoint and wrinkle detection that feeds a bimanual manipulation heuristic, but the zero-shot transfer numbers rest on thin evaluation details.

read the letter

This paper gives a working example of using Blender synthetic data plus limited real labels to train models for cloth wrinkle and keypoint detection that then support a bimanual folding strategy. The transfer to real fabrics is the part to watch.

It combines standard CNN and YOLOv8 with a custom heuristic for when to use which feature. The auto-annotation in synthetic renders is a practical step that avoids heavy manual labeling.

The results look promising on paper with the low MPE and outperformance in occlusion cases. But the lack of dataset sizes, training protocols, and clear separation between training real data and test real data makes it difficult to judge how robust the zero-shot claim really is. The MPE is reported in pixels without context on image resolution or physical error, which is a minor but noticeable gap.

If the full paper includes ablations and statistical tests, it strengthens the case. Otherwise the soundness rests mostly on the empirical transfer success.

This is for roboticists working on textile handling who might want to replicate or build on the pipeline. It is not a theoretical advance but a solid application piece.

I would recommend sending it for peer review so the community can see the full methods and verify the real-world results.

Referee Report

2 major / 0 minor

Summary. The paper develops a Blender-based synthetic data pipeline that auto-generates annotated keypoints on cloth meshes, augments this with manually labeled real images, and trains a CNN for permutation-invariant keypoint detection together with a YOLOv8-OpenCV pipeline for structural wrinkle extraction. These components feed a bimanual manipulation algorithm that first stretches fully folded garments using detected wrinkles and then switches to keypoint-based ironing once corners become visible. The keypoint detector is reported to achieve a mean position error of 1.7615 pixels, and the overall perception system is stated to transfer zero-shot to physical fabrics while outperforming baselines that fail under high occlusion or produce false positives on severe folds.

Significance. If the zero-shot transfer result is substantiated with proper held-out real-world evaluation, the work would offer a concrete route to mitigating the annotated-data bottleneck for deformable-object perception and could support more reliable bimanual textile handling. The synthetic pipeline and the wrinkle-to-keypoint transition strategy constitute practical engineering contributions that other groups could replicate.

major comments (2)

[Abstract] Abstract: The central claim that the perception system 'transfers to physical fabrics without fine-tuning' and 'outperforms baselines' in high-occlusion states rests on the reported MPE of 1.7615 pixels, yet the abstract supplies no information on whether the test images are held-out real photographs, synthetic renders, or a mixture, nor any conversion to physical units or failure-rate statistics under the exact occlusion conditions cited. This information is load-bearing for the transfer assertion.
[Abstract] Abstract / Results: No dataset sizes, training schedules, baseline implementations, or statistical significance tests are described, preventing assessment of whether the claimed superiority over baselines that 'fail in high-occlusion states' is robust or potentially affected by post-hoc case selection.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on the abstract. We agree that additional clarity is needed to substantiate the zero-shot transfer claims and will revise the abstract accordingly.

read point-by-point responses

Referee: [Abstract] Abstract: The central claim that the perception system 'transfers to physical fabrics without fine-tuning' and 'outperforms baselines' in high-occlusion states rests on the reported MPE of 1.7615 pixels, yet the abstract supplies no information on whether the test images are held-out real photographs, synthetic renders, or a mixture, nor any conversion to physical units or failure-rate statistics under the exact occlusion conditions cited. This information is load-bearing for the transfer assertion.

Authors: We agree the abstract should explicitly state the evaluation details. The reported MPE is measured on a held-out set of real photographs (distinct from training data), as described in the Experiments section. We will revise the abstract to clarify this and add failure-rate statistics for high-occlusion cases. Pixel error follows standard practice in keypoint detection; physical-unit conversion requires per-setup calibration and is not reported in the literature for this task. revision: yes
Referee: [Abstract] Abstract / Results: No dataset sizes, training schedules, baseline implementations, or statistical significance tests are described, preventing assessment of whether the claimed superiority over baselines that 'fail in high-occlusion states' is robust or potentially affected by post-hoc case selection.

Authors: Dataset sizes, training schedules, and baseline implementations (including adaptations) are provided in the Experiments section with a fixed held-out real test set to ensure reproducibility and avoid post-hoc selection. The abstract is space-constrained, but we will add a brief reference to these details or the relevant section. Statistical significance tests were not performed; we can include them in a revision if requested. revision: partial

Circularity Check

0 steps flagged

No circularity; empirical results without derivations or self-referential fits

full rationale

The paper presents a synthetic Blender pipeline combined with manual labeling to train CNN and YOLOv8 models for keypoint and wrinkle detection on cloth. The central claims rest on reported empirical metrics (MPE of 1.7615 pixels) and observed real-world transfer performance, with no equations, derivations, fitted parameters renamed as predictions, or self-citation chains. No load-bearing steps reduce to inputs by construction; the work is a standard data-driven computer vision pipeline whose validity is externally falsifiable via the described physical fabric tests.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review provides no explicit free parameters, axioms, or invented entities; standard computer-vision training assumptions are implicit but not detailed.

pith-pipeline@v0.9.1-grok · 5687 in / 1078 out tokens · 29045 ms · 2026-06-28T02:12:33.910538+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

17 extracted references · 3 linked inside Pith

[1]

Learning to estimate the pose of deformable objects,

T. Lipset al., “Learning to estimate the pose of deformable objects,” in IEEE International Conference on Robotics and Automation (ICRA), 2022

2022
[2]

Design and implementation of fabric wrinkle detection system based on yolov5 algorithm,

C. Liet al., “Design and implementation of fabric wrinkle detection system based on yolov5 algorithm,”Research Article, 2023

2023
[3]

Recognition of grasp points for cloth manipulation,

L. M. Mart ´ınez and J. Ruiz-del Solar, “Recognition of grasp points for cloth manipulation,” in2013 IEEE International Conference on Robotics and Biomimetics (ROBIO). IEEE, 2013, pp. 2399–2404

2013
[4]

Dressing-as-a-service: A cloud-based framework for assistive dressing with a bimanual robot,

C. Xuet al., “Dressing-as-a-service: A cloud-based framework for assistive dressing with a bimanual robot,” inCompanion of the 2024 ACM/IEEE International Conference on Human-Robot Interaction, 2024, pp. 1116–1120

2024
[5]

Robust data association for object-level semantic slam,

X. Kang and S. Yuan, “Robust data association for object-level semantic slam,”arXiv preprint arXiv:1909.13493, 2019

arXiv 1909
[6]

Few-click-driven interactive 3d segmentation with semantic embedding,

X. Kang, Z. Yu, K. Khoshelham, and L. Nan, “Few-click-driven interactive 3d segmentation with semantic embedding,”arXiv preprint arXiv:2605.08925, 2026

Pith/arXiv arXiv 2026
[7]

Hierarchical point-patch fusion with adaptive patch codebook for 3d shape anomaly detection,

X. Kang, Z. Li, T. Lan, D. Gong, K. Khoshelham, and L. Nan, “Hierarchical point-patch fusion with adaptive patch codebook for 3d shape anomaly detection,”arXiv preprint arXiv:2604.03972, 2026

Pith/arXiv arXiv 2026
[8]

Look beyond: Two-stage scene view generation via panorama and video diffusion,

X. Kang, Z. Xiang, Z. Zhang, and K. Khoshelham, “Look beyond: Two-stage scene view generation via panorama and video diffusion,” inProceedings of the 33rd ACM International Conference on Multi- media, 2025, pp. 9375–9384

2025
[9]

A survey of robotic navigation and manipulation with physics simulators in the era of embodied ai,

L. H. K. Wong, X. Kang, K. Bai, and J. Zhang, “A survey of robotic navigation and manipulation with physics simulators in the era of embodied ai,”arXiv preprint arXiv:2505.01458, 2025

Pith/arXiv arXiv 2025
[10]

Blender - a 3d modelling and rendering package,

Blender Online Community, “Blender - a 3d modelling and rendering package,” Blender Foundation, Stichting Blender Foundation, Amsterdam, 2021. [Online]. Available: http://www.blender.org

2021
[11]

Ambientcg public texture library,

AmbientCG, “Ambientcg public texture library,” 2023. [Online]. Available: https://ambientcg.com/

2023
[12]

Cgbookcase public texture library,

CGBookcase, “Cgbookcase public texture library,” 2023. [Online]. Available: https://www.cgbookcase.com/

2023
[13]

Polyhaven public texture library,

PolyHaven, “Polyhaven public texture library,” 2023. [Online]. Available: https://polyhaven.com/

2023
[14]

Wrinkle detector 2.0 dataset,

Fabric Accessor, “Wrinkle detector 2.0 dataset,” Roboflow Uni- verse, May 2024, [Online]. Available: https://universe.roboflow.com/ fabric-accessor/wrinkle-detector-2.0. [Accessed: Sep. 22, 2025]

2024
[15]

AnyLabeling: A Segment Anything Model (SAM) based labeling tool,

V .-A. Nguyen, “AnyLabeling: A Segment Anything Model (SAM) based labeling tool,” https://github.com/vietanhdev/anylabeling, 2023

2023
[16]

Ultralytics YOLOv8,

G. Jocher, A. Chaurasia, and J. Qiu, “Ultralytics YOLOv8,” https: //github.com/ultralytics/ultralytics, 2023

2023
[17]

Newton dynamics physics engine,

J. Jerez, “Newton dynamics physics engine,” 2024, [Online]. Avail- able: http://newtondynamics.com

2024

[1] [1]

Learning to estimate the pose of deformable objects,

T. Lipset al., “Learning to estimate the pose of deformable objects,” in IEEE International Conference on Robotics and Automation (ICRA), 2022

2022

[2] [2]

Design and implementation of fabric wrinkle detection system based on yolov5 algorithm,

C. Liet al., “Design and implementation of fabric wrinkle detection system based on yolov5 algorithm,”Research Article, 2023

2023

[3] [3]

Recognition of grasp points for cloth manipulation,

L. M. Mart ´ınez and J. Ruiz-del Solar, “Recognition of grasp points for cloth manipulation,” in2013 IEEE International Conference on Robotics and Biomimetics (ROBIO). IEEE, 2013, pp. 2399–2404

2013

[4] [4]

Dressing-as-a-service: A cloud-based framework for assistive dressing with a bimanual robot,

C. Xuet al., “Dressing-as-a-service: A cloud-based framework for assistive dressing with a bimanual robot,” inCompanion of the 2024 ACM/IEEE International Conference on Human-Robot Interaction, 2024, pp. 1116–1120

2024

[5] [5]

Robust data association for object-level semantic slam,

X. Kang and S. Yuan, “Robust data association for object-level semantic slam,”arXiv preprint arXiv:1909.13493, 2019

arXiv 1909

[6] [6]

Few-click-driven interactive 3d segmentation with semantic embedding,

X. Kang, Z. Yu, K. Khoshelham, and L. Nan, “Few-click-driven interactive 3d segmentation with semantic embedding,”arXiv preprint arXiv:2605.08925, 2026

Pith/arXiv arXiv 2026

[7] [7]

Hierarchical point-patch fusion with adaptive patch codebook for 3d shape anomaly detection,

X. Kang, Z. Li, T. Lan, D. Gong, K. Khoshelham, and L. Nan, “Hierarchical point-patch fusion with adaptive patch codebook for 3d shape anomaly detection,”arXiv preprint arXiv:2604.03972, 2026

Pith/arXiv arXiv 2026

[8] [8]

Look beyond: Two-stage scene view generation via panorama and video diffusion,

X. Kang, Z. Xiang, Z. Zhang, and K. Khoshelham, “Look beyond: Two-stage scene view generation via panorama and video diffusion,” inProceedings of the 33rd ACM International Conference on Multi- media, 2025, pp. 9375–9384

2025

[9] [9]

A survey of robotic navigation and manipulation with physics simulators in the era of embodied ai,

L. H. K. Wong, X. Kang, K. Bai, and J. Zhang, “A survey of robotic navigation and manipulation with physics simulators in the era of embodied ai,”arXiv preprint arXiv:2505.01458, 2025

Pith/arXiv arXiv 2025

[10] [10]

Blender - a 3d modelling and rendering package,

Blender Online Community, “Blender - a 3d modelling and rendering package,” Blender Foundation, Stichting Blender Foundation, Amsterdam, 2021. [Online]. Available: http://www.blender.org

2021

[11] [11]

Ambientcg public texture library,

AmbientCG, “Ambientcg public texture library,” 2023. [Online]. Available: https://ambientcg.com/

2023

[12] [12]

Cgbookcase public texture library,

CGBookcase, “Cgbookcase public texture library,” 2023. [Online]. Available: https://www.cgbookcase.com/

2023

[13] [13]

Polyhaven public texture library,

PolyHaven, “Polyhaven public texture library,” 2023. [Online]. Available: https://polyhaven.com/

2023

[14] [14]

Wrinkle detector 2.0 dataset,

Fabric Accessor, “Wrinkle detector 2.0 dataset,” Roboflow Uni- verse, May 2024, [Online]. Available: https://universe.roboflow.com/ fabric-accessor/wrinkle-detector-2.0. [Accessed: Sep. 22, 2025]

2024

[15] [15]

AnyLabeling: A Segment Anything Model (SAM) based labeling tool,

V .-A. Nguyen, “AnyLabeling: A Segment Anything Model (SAM) based labeling tool,” https://github.com/vietanhdev/anylabeling, 2023

2023

[16] [16]

Ultralytics YOLOv8,

G. Jocher, A. Chaurasia, and J. Qiu, “Ultralytics YOLOv8,” https: //github.com/ultralytics/ultralytics, 2023

2023

[17] [17]

Newton dynamics physics engine,

J. Jerez, “Newton dynamics physics engine,” 2024, [Online]. Avail- able: http://newtondynamics.com

2024