FCBV-Net: Category-Level Robotic Garment Smoothing via Feature-Conditioned Bimanual Value Prediction
Pith reviewed 2026-05-19 01:12 UTC · model grok-4.3
The pith
Conditioning bimanual action values on frozen 3D geometric features lets robots smooth unseen garments with far smaller performance loss than baselines.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
FCBV-Net conditions bimanual action value prediction on pre-trained and frozen dense geometric features extracted from 3D point clouds. The geometric front end stays static across intra-category variations while trainable downstream modules learn task-specific bimanual policies. This decoupling produces superior category-level generalization in simulated garment smoothing, evidenced by substantially lower efficiency degradation and higher final coverage on garments never seen during training.
What carries the argument
The Feature-Conditioned Bimanual Value Network (FCBV-Net), which feeds fixed geometric features from 3D point clouds into a trainable value head to separate perception from action evaluation.
If this is right
- Bimanual policies can be learned more reliably once geometric structure is supplied by a static feature extractor.
- Category-level performance improves when geometric understanding is not updated jointly with action values.
- 3D point-cloud conditioning outperforms both 2D image baselines and fixed-primitive 3D methods on coverage and generalization metrics.
- Downstream task-specific training on frozen features suffices for high final coverage in simulation.
Where Pith is reading between the lines
- The same separation of frozen geometry from value learning could be tested on related tasks such as folding or unfolding.
- Real-robot deployment would reveal whether simulation gains survive contact dynamics and sensor noise.
- Replacing the frozen feature extractor with newer pre-trained models might further reduce the observed 11.5 percent drop.
Load-bearing premise
Pre-trained dense geometric features from 3D point clouds remain informative enough for bimanual value prediction even when garment shapes vary within a category and receive no further adaptation.
What would settle it
Running the same trained FCBV-Net on a new collection of garments whose shape variations exceed those in the CLOTH3D unseen set and finding that its efficiency drop matches or exceeds the 96 percent drop seen in the 2D baseline.
Figures
read the original abstract
Category-level generalization for robotic garment manipulation, such as bimanual smoothing, remains a significant hurdle due to high dimensionality, complex dynamics, and intra-category variations. Current approaches often struggle, either overfitting with concurrently learned visual features for a specific instance or, despite Category-level perceptual generalization, failing to predict the value of synergistic bimanual actions. We propose the Feature-Conditioned bimanual Value Network (FCBV-Net), operating on 3D point clouds to specifically enhance category-level policy generalization for garment smoothing. FCBV-Net conditions bimanual action value prediction on pre-trained, frozen dense geometric features, ensuring robustness to intra-category garment variations. Trainable downstream components then learn a task-specific policy using these static features. In simulated PyFlex environments using the CLOTH3D dataset, FCBV-Net demonstrated superior category-level generalization. It exhibited only an 11.5% efficiency drop (Steps80) on unseen garments compared to 96.2% for a 2D image-based baseline, and achieved 89% final coverage, outperforming an 83% coverage from a 3D correspondence-based baseline that uses identical per-point geometric features but a fixed primitive. These results highlight that the decoupling of geometric understanding from bimanual action value learning enables better category-level generalization. Code, videos, and supplementary materials are available at the project website: https://dabaspark.github.io/fcbvnet/.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript proposes FCBV-Net, a Feature-Conditioned Bimanual Value Network that operates on 3D point clouds for category-level robotic garment smoothing. It conditions bimanual action value prediction on pre-trained, frozen dense geometric features to promote robustness to intra-category variations, with only the downstream components trained for the task-specific policy. In PyFlex simulations on the CLOTH3D dataset, FCBV-Net reports an 11.5% efficiency drop on unseen garments (compared to 96.2% for a 2D image-based baseline) and 89% final coverage (outperforming 83% for a 3D correspondence-based baseline using identical features but a fixed primitive), attributing the gains to the decoupling of geometric understanding from bimanual action value learning.
Significance. If the central claim holds, the work provides a useful separation of concerns for deformable object manipulation, allowing pre-trained geometric features to handle category-level variations while learning focuses on synergistic bimanual value prediction. The public release of code, videos, and supplementary materials is a clear strength for reproducibility. The simulation results on held-out garments offer moderate evidence for improved generalization over the reported baselines, though the overall significance for robotic applications remains limited by the absence of real-world validation and detailed statistical analysis.
major comments (2)
- Method section: The central claim that decoupling geometric understanding from bimanual action value learning enables better category-level generalization rests on the assumption that pre-trained frozen dense geometric features from 3D point clouds are sufficiently informative and robust across intra-category CLOTH3D variations without adaptation. No ablations are described that compare frozen vs. fine-tuned features or test alternative backbones, leaving open the possibility that performance (11.5% efficiency drop, 89% coverage) is an artifact of the specific feature extractor rather than the proposed conditioning architecture.
- Experiments section: The quantitative gains over baselines are presented as evidence for superior generalization, but the results lack reported statistical significance, variance across runs, number of trials, or hyperparameter sensitivity analysis. This weakens the ability to interpret the 11.5% vs. 96.2% efficiency drop and 89% vs. 83% coverage as robust support for the decoupling hypothesis.
minor comments (2)
- Abstract: The metric 'Steps80' is referenced without definition or pointer to its precise formulation; this should be clarified when first introduced in the main text.
- The manuscript would benefit from an explicit statement of the pre-training dataset and backbone architecture used for the dense geometric features to allow readers to assess transfer assumptions.
Simulated Author's Rebuttal
We are grateful to the referee for their insightful comments, which have helped us identify areas to improve the clarity and rigor of our work. We address the major comments point by point below and indicate the revisions we plan to make.
read point-by-point responses
-
Referee: Method section: The central claim that decoupling geometric understanding from bimanual action value learning enables better category-level generalization rests on the assumption that pre-trained frozen dense geometric features from 3D point clouds are sufficiently informative and robust across intra-category CLOTH3D variations without adaptation. No ablations are described that compare frozen vs. fine-tuned features or test alternative backbones, leaving open the possibility that performance (11.5% efficiency drop, 89% coverage) is an artifact of the specific feature extractor rather than the proposed conditioning architecture.
Authors: We agree that ablations on the feature extractor would provide additional support for our central claim. In the revised manuscript, we will add experiments comparing frozen versus fine-tuned features using the same backbone. Additionally, we will evaluate an alternative pre-trained feature backbone to show that the performance gains stem from the proposed conditioning and bimanual value prediction architecture rather than the specific choice of features. revision: yes
-
Referee: Experiments section: The quantitative gains over baselines are presented as evidence for superior generalization, but the results lack reported statistical significance, variance across runs, number of trials, or hyperparameter sensitivity analysis. This weakens the ability to interpret the 11.5% vs. 96.2% efficiency drop and 89% vs. 83% coverage as robust support for the decoupling hypothesis.
Authors: We agree that including statistical details would strengthen the presentation of our results. In the revised manuscript, we will report the number of evaluation trials, include error bars or standard deviations from multiple runs, perform statistical significance tests on the key metrics, and provide a hyperparameter sensitivity analysis in the supplementary materials. revision: yes
Circularity Check
No significant circularity detected; empirical claims rest on held-out evaluation
full rationale
The paper's central claim—that decoupling geometric feature extraction from bimanual value learning improves category-level generalization—is advanced via an empirical architecture (pre-trained frozen dense features from 3D point clouds feeding a trainable value head) and validated on held-out garments from the CLOTH3D dataset in PyFlex simulation. Comparisons to a 2D baseline and a 3D correspondence baseline using identical per-point features but a fixed primitive provide external grounding. No load-bearing equations, fitted parameters renamed as predictions, or self-citation chains appear in the provided text; the design choice is presented as a hypothesis tested against independent test splits rather than derived by construction from its own inputs.
Axiom & Free-Parameter Ledger
free parameters (1)
- Network hyperparameters and training schedule
axioms (1)
- domain assumption Pre-trained dense geometric features capture all necessary intra-category variations for bimanual smoothing
Reference graph
Works this paper leans on
-
[1]
Ide ntifying the potential for robotics to assist older adults in differe nt living environments,
T. L. Mitzner, T. L. Chen, C. C. Kemp, and W. A. Rogers, “Ide ntifying the potential for robotics to assist older adults in differe nt living environments,” International Journal of Social Robotics , vol. 6, no. 2, pp. 213–227, Apr 2014
work page 2014
-
[2]
Managing activity difficulties at ho me: A survey of medicare beneficiaries,
B. J. Dudgeon, J. M. Hoffman, M. A. Ciol, A. Shumway-Cook, K. M. Y orkston, and L. Chan, “Managing activity difficulties at ho me: A survey of medicare beneficiaries,” Archives of Physical Medicine and Rehabilitation, vol. 89, no. 7, pp. 1256–1261, Jul 2008
work page 2008
-
[3]
Challenges and outlook in robotic manipula tion of deformable objects,
J. Zhu, A. Cherubini, C. Dune, D. Navarro-Alarcon, F. Ala mbeigi, D. Berenson, F. Ficuciello, K. Harada, J. Kober, X. Li, J. Pan , W. Y uan, and M. Gienger, “Challenges and outlook in robotic manipula tion of deformable objects,” IEEE Robotics & Automation Magazine , vol. 29, no. 3, pp. 67–77, 2022
work page 2022
-
[4]
R. Wu, H. Lu, Y . Wang, Y . Wang, and H. Dong, “Unigarmentman ip: A unified framework for category-level garment manipulatio n via dense visual correspondence,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , June 2024, pp. 16 340–16 350
work page 2024
-
[5]
Speedfolding: Learning efficient bimanual folding of garm ents,
Y . Avigal, L. Berscheid, T. Asfour, T. Kr¨ oger, and K. Gol dberg, “Speedfolding: Learning efficient bimanual folding of garm ents,” in 2022 IEEE/RSJ International Conference on Intelligent Rob ots and Systems (IROS), 2022, pp. 1–8
work page 2022
-
[6]
Cloth funnels: Canonicalized-alignment for mult i-purpose garment manipulation,
A. Canberk, C. Chi, H. Ha, B. Burchfiel, E. Cousineau, S. Fe ng, and S. Song, “Cloth funnels: Canonicalized-alignment for mult i-purpose garment manipulation,” in 2023 IEEE International Conference on Robotics and Automation (ICRA) , 2023, pp. 5872–5879
work page 2023
-
[7]
Unifolding : Towards sample-efficient, scalable, and generalizable robotic gar ment folding,
H. Xue, Y . Li, W. Xu, H. Li, D. Zheng, and C. Lu, “Unifolding : Towards sample-efficient, scalable, and generalizable robotic gar ment folding,” arXiv preprint arXiv:2311.01267 , 2023
-
[8]
Folding clothes autonomously: A complete pipeline,
A. Doumanoglou, J. Stria, G. Peleka, I. Mariolis, V . Petr ´ ık, A. Kargakos, L. Wagner, V . Hlav´ aˇ c, T.-K. Kim, and S. Malassiotis, “Folding clothes autonomously: A complete pipeline,” IEEE Transactions on Robotics , vol. 32, no. 6, pp. 1461–1478, 2016
work page 2016
-
[9]
J. Maitin-Shepard, M. Cusumano-Towner, J. Lei, and P . Ab beel, “Cloth grasp point detection based on multiple-view geometric cue s with appli- cation to robotic towel folding,” in 2010 IEEE International Conference on Robotics and Automation , 2010, pp. 2308–2315
work page 2010
-
[10]
Skeleton merger: An uns upervised aligned keypoint detector,
R. Shi, Z. Xue, Y . Y ou, and C. Lu, “Skeleton merger: An uns upervised aligned keypoint detector,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , June 2021, pp. 43–52
work page 2021
-
[11]
Clothesnet: An information-rich 3d garment model repository with simulat ed clothes environment,
B. Zhou, H. Zhou, T. Liang, Q. Y u, S. Zhao, Y . Zeng, J. Lv, S . Luo, Q. Wang, X. Y u, H. Chen, C. Lu, and L. Shao, “Clothesnet: An information-rich 3d garment model repository with simulat ed clothes environment,” in Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) , October 2023, pp. 20 428–20 438
work page 2023
-
[12]
Learning foresightful dens e visual affordance for deformable object manipulation,
R. Wu, C. Ning, and H. Dong, “Learning foresightful dens e visual affordance for deformable object manipulation,” in Proceedings of the IEEE/CVF International Conference on Computer Vision ( ICCV), October 2023, pp. 10 947–10 956
work page 2023
-
[13]
Dualafford: Learning collaborative visual affordance fo r dual-gripper manipulation,
Y . Zhao, R. Wu, Z. Chen, Y . Zhang, Q. Fan, K. Mo, and H. Dong , “Dualafford: Learning collaborative visual affordance fo r dual-gripper manipulation,” in International Conference on Learning Representa- tions, 2023
work page 2023
-
[14]
A heuristic-based approach for flattening wrinkled clothe s,
L. Sun, G. Aragon-Camarasa, P . Cockshott, S. Rogers, an d J. P . Siebert, “A heuristic-based approach for flattening wrinkled clothe s,” in Towards Autonomous Robotic Systems , A. Natraj, S. Cameron, C. Melhuish, and M. Witkowski, Eds. Berlin, Heidelberg: Springer Berlin Hei delberg, 2014, pp. 148–160
work page 2014
-
[15]
Model for unf olding laundry using interactive perception,
B. Willimon, S. Birchfield, and I. Walker, “Model for unf olding laundry using interactive perception,” in 2011 IEEE/RSJ International Confer- ence on Intelligent Robots and Systems , 2011, pp. 4871–4876
work page 2011
-
[16]
D. Seita, P . Florence, J. Tompson, E. Coumans, V . Sindhw ani, K. Gold- berg, and A. Zeng, “Learning to rearrange deformable cables , fabrics, and bags with goal-conditioned transporter networks,” in 2021 IEEE International Conference on Robotics and Automation (ICRA ), 2021, pp. 4568–4575
work page 2021
-
[17]
Learning dense visual corresponde nces in simulation to smooth and fold real fabrics,
A. Ganapathi, P . Sundaresan, B. Thananjeyan, A. Balakr ishna, D. Seita, J. Grannen, M. Hwang, R. Hoque, J. E. Gonzalez, N. Jamali, K. Y amane, S. Iba, and K. Goldberg, “Learning dense visual corresponde nces in simulation to smooth and fold real fabrics,” in 2021 IEEE International Conference on Robotics and Automation (ICRA) , 2021, pp. 11 515– 11 522
work page 2021
-
[18]
Flingbot: The unreasonable effectiv eness of dynamic manipulation for cloth unfolding,
H. Ha and S. Song, “Flingbot: The unreasonable effectiv eness of dynamic manipulation for cloth unfolding,” in Proceedings of the 5th Conference on Robot Learning , ser. Proceedings of Machine Learning Research, A. Faust, D. Hsu, and G. Neumann, Eds., vol. 164. PM LR, 08–11 Nov 2022, pp. 24–33
work page 2022
-
[19]
VisuoSpatial foresight for multi-step, multi-task fabric manipulation,
R. Hoque, D. Seita, A. Balakrishna, A. Ganapathi, A. K. T anwani, N. Jamali, K. Y amane, S. Iba, and K. Goldberg, “VisuoSpatial foresight for multi-step, multi-task fabric manipulation,” 2020
work page 2020
-
[20]
Pybullet, a python module for phy sics simulation for games, robotics and machine learning,
E. Coumans and Y . Bai, “Pybullet, a python module for phy sics simulation for games, robotics and machine learning,” 2016
work page 2016
-
[21]
Softgym: Benchma rking deep reinforcement learning for deformable object manipul ation,
X. Lin, Y . Wang, J. Olkin, and D. Held, “Softgym: Benchma rking deep reinforcement learning for deformable object manipul ation,” in Proceedings of the 2020 Conference on Robot Learning , ser. Proceedings of Machine Learning Research, J. Kober, F. Ramos, and C. Toml in, Eds., vol. 155. PMLR, 16–18 Nov 2021, pp. 432–448
work page 2020
-
[22]
Un ified particle physics for real-time applications,
M. Macklin, M. M¨ uller, N. Chentanez, and T.-Y . Kim, “Un ified particle physics for real-time applications,” ACM Trans. Graph. , vol. 33, no. 4, Jul. 2014
work page 2014
-
[23]
Garmentlab: A unified simulation and benchmark for garment manipulation,
H. Lu, R. Wu, Y . Li, S. Li, Z. Zhu, C. Ning, Y . Shen, L. Luo, Y . Chen, and H. Dong, “Garmentlab: A unified simulation and benchmark for garment manipulation,” in Advances in Neural Information Processing Systems, A. Globerson, L. Mackey, D. Belgrave, A. Fan, U. Paquet, J. Tomczak, and C. Zhang, Eds., vol. 37. Curran Associates, I nc., 2024, pp. 11 866–11 903
work page 2024
-
[24]
Z. Zhou, J. Song, X. Xie, Z. Shu, L. Ma, D. Liu, J. Yin, and S. See, “Towards building ai-cps with nvidia isaac sim: An in dustrial benchmark and case study for robotics manipulation,” in Proceedings of the 46th International Conference on Software Engineering : Software Engineering in Practice , ser. ICSE-SEIP ’24. New Y ork, NY , USA: Association for Co...
work page 2024
-
[25]
Pointnet++: Deep hierarchical feature learning on point sets in a metric space,
C. R. Qi, L. Yi, H. Su, and L. J. Guibas, “Pointnet++: Deep hierarchical feature learning on point sets in a metric space,” in Advances in Neural Information Processing Systems , I. Guyon, U. V . Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, Eds. , vol. 30. Curran Associates, Inc., 2017
work page 2017
-
[26]
H. Bertiche, M. Madadi, and S. Escalera, “Cloth3d: Clot hed 3d humans,” in Computer Vision – ECCV 2020 , A. V edaldi, H. Bischof, T. Brox, and J.-M. Frahm, Eds. Cham: Springer International Publishing , 2020, pp. 344–359
work page 2020
-
[27]
Adam: A Method for Stochastic Optimization
K. D. B. J. Adam et al. , “A method for stochastic optimization,” arXiv preprint arXiv:1412.6980, vol. 1412, no. 6, 2014
work page internal anchor Pith review Pith/arXiv arXiv 2014
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.