GarmNet: Improving Global with Local Perception for Robotic Laundry Folding
Pith reviewed 2026-05-25 12:23 UTC · model grok-4.3
The pith
GarmNet performs garment localization and landmark detection together in one network, cutting localization error by 24.7 percent on the CloPeMa dataset.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
GarmNet simultaneously localizes the garment as a whole and detects landmarks for grasping. Localization supplies global information to recognize garment category, while landmark detection supports grasping actions. When landmark detection is included, garment localization error drops by 24.7 percent compared with localization alone.
What carries the argument
GarmNet, an end-to-end deep learning model that jointly outputs garment localization and landmark detections.
If this is right
- Robots obtain both category recognition and grasping cues from one forward pass, reducing separate processing steps.
- The combined representation supports handling a wider range of crumpled garment configurations than single-task models.
- Memory and compute stay low enough for deployment on robotic platforms that must run multiple domestic tasks.
- The same joint-perception pattern can be applied to other garment types in the dataset without redesigning separate networks.
Where Pith is reading between the lines
- If the joint-training benefit generalizes, similar multi-task networks could reduce error in other robotic perception problems that combine global scene understanding with local action points.
- Real-robot folding trials would be needed to check whether the dataset error reduction produces higher end-to-end success rates under variable lighting and fabric stretch.
- The approach leaves open whether adding more auxiliary tasks, such as grasp quality prediction, would yield further localization gains.
Load-bearing premise
The reported error reduction is produced by the joint training of localization and landmark detection rather than by differences in model size, training details, or data handling.
What would settle it
Training two models of identical capacity on the identical CloPeMa split, one with only localization and one with both tasks, and finding no meaningful difference in localization error.
Figures
read the original abstract
Developing autonomous assistants to help with domestic tasks is a vital topic in robotics research. Among these tasks, garment folding is one of them that is still far from being achieved mainly due to the large number of possible configurations that a crumpled piece of clothing may exhibit. Research has been done on either estimating the pose of the garment as a whole or detecting the landmarks for grasping separately. However, such works constrain the capability of the robots to perceive the states of the garment by limiting the representations for one single task. In this paper, we propose a novel end-to-end deep learning model named GarmNet that is able to simultaneously localize the garment and detect landmarks for grasping. The localization of the garment represents the global information for recognising the category of the garment, whereas the detection of landmarks can facilitate subsequent grasping actions. We train and evaluate our proposed GarmNet model using the CloPeMa Garment dataset that contains 3,330 images of different garment types in different poses. The experiments show that the inclusion of landmark detection (GarmNet-B) can largely improve the garment localization, with an error rate of 24.7% lower. Solutions as ours are important for robotics applications, as these offer scalable to many classes, memory and processing efficient solutions.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes GarmNet, an end-to-end CNN for simultaneous garment localization (global category recognition) and landmark detection (for grasping) in robotic laundry folding. It evaluates the model on the CloPeMa Garment dataset (3,330 images) and reports that adding landmark detection (GarmNet-B) reduces garment localization error by 24.7% compared to localization-only training.
Significance. If the reported improvement is attributable to joint training rather than capacity differences and generalizes beyond the fixed dataset, the multi-task formulation could provide a scalable, efficient perception module for domestic robotics tasks involving deformable objects. The work addresses a practical gap between separate global-pose and local-landmark pipelines.
major comments (3)
- [Abstract / Experiments] Abstract and experiments section: The central claim of a 24.7% localization error reduction for GarmNet-B is presented without any description of the baseline architecture (GarmNet-A), parameter counts, backbone depth, training schedule, data augmentation, or loss-weighting scheme. Without an explicit statement that the only difference is the added landmark head and multi-task loss, the improvement cannot be isolated from confounding factors such as increased model capacity.
- [Experiments] Experiments: No information is supplied on train/test splits, cross-validation, statistical significance testing, or variance across runs. The reported error reduction is therefore an empirical fit on a single fixed dataset whose robustness to different partitions or hyperparameter choices remains unverified.
- [Introduction / Conclusion] Introduction and conclusion: The paper assumes that performance on the CloPeMa dataset will transfer to real robotic folding scenarios, yet no domain-shift, sim-to-real, or physical-robot experiments are described to support this transfer claim.
minor comments (2)
- [Abstract] The abstract states the model is 'scalable to many classes, memory and processing efficient' but provides no supporting measurements (e.g., FLOPs, parameter counts, inference time) relative to single-task baselines.
- [Methods] Notation for the two variants (GarmNet-A vs. GarmNet-B) is introduced only in the abstract; a clear definition and diagram in the methods section would improve readability.
Simulated Author's Rebuttal
We thank the referee for the constructive comments, which help clarify the presentation of our multi-task approach. We address each point below and will revise the manuscript accordingly to improve clarity and rigor.
read point-by-point responses
-
Referee: [Abstract / Experiments] Abstract and experiments section: The central claim of a 24.7% localization error reduction for GarmNet-B is presented without any description of the baseline architecture (GarmNet-A), parameter counts, backbone depth, training schedule, data augmentation, or loss-weighting scheme. Without an explicit statement that the only difference is the added landmark head and multi-task loss, the improvement cannot be isolated from confounding factors such as increased model capacity.
Authors: We agree that additional architectural and training details are needed to isolate the effect of joint training. In the revised manuscript, we will expand the experiments section with a table comparing GarmNet-A and GarmNet-B, explicitly stating that the backbone, parameter counts (except for the added landmark head), training schedule, data augmentation, and loss weighting remain identical, with the sole difference being the addition of the landmark detection head and its multi-task loss term. revision: yes
-
Referee: [Experiments] Experiments: No information is supplied on train/test splits, cross-validation, statistical significance testing, or variance across runs. The reported error reduction is therefore an empirical fit on a single fixed dataset whose robustness to different partitions or hyperparameter choices remains unverified.
Authors: We will add the train/test split details (proportions and any randomization seed) used for the 3,330-image CloPeMa dataset to the experiments section. The original evaluation was performed on a single fixed partition without multiple runs; we will note this limitation explicitly and, where possible, report results from additional runs with varied seeds to provide variance estimates. revision: partial
-
Referee: [Introduction / Conclusion] Introduction and conclusion: The paper assumes that performance on the CloPeMa dataset will transfer to real robotic folding scenarios, yet no domain-shift, sim-to-real, or physical-robot experiments are described to support this transfer claim.
Authors: The manuscript evaluates the perception module on the CloPeMa dataset and discusses its relevance to robotic applications. We will revise the introduction and conclusion to remove any implication of direct transfer, instead stating that the results demonstrate improved perception on this dataset and that validation on physical robots or under domain shift remains future work. revision: yes
Circularity Check
No circularity: empirical performance comparison on fixed dataset with no load-bearing derivations or self-citations
full rationale
The paper presents an empirical ML model (GarmNet) trained and evaluated on the CloPeMa Garment dataset. The central claim is a measured 24.7% error reduction when adding a landmark detection head, reported directly from experimental results rather than any mathematical derivation, prediction, or first-principles chain. No equations, ansatzes, uniqueness theorems, or self-citations are invoked as load-bearing steps. The result is a standard train/evaluate comparison on a fixed dataset and does not reduce to its inputs by construction.
Axiom & Free-Parameter Ledger
free parameters (1)
- multi-task loss weighting and backbone choice
axioms (1)
- domain assumption The CloPeMa Garment dataset of 3,330 images is representative of garment configurations encountered in robotic folding.
Reference graph
Works this paper leans on
-
[1]
Pattern Recognition 74, 629 – 641 (2018)
Corona, E., Aleny, G., Gabas, A., Torras, C.: Active garment recognition and target grasping point detection using deep learning. Pattern Recognition 74, 629 – 641 (2018). https://doi.org/https://doi.org/10.1016/j.patcog.2017.09.042, http: //www.sciencedirect.com/science/article/pii/S0031320317303941
-
[2]
In: CVPR09 (2009) GarmNet: Improving Global with Local Perception for Robotic Laundry
Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., Fei-Fei, L.: ImageNet: A Large- Scale Hierarchical Image Database. In: CVPR09 (2009) GarmNet: Improving Global with Local Perception for Robotic Laundry... 11
work page 2009
-
[3]
https://doi.org/10.1007/3-540-44988-4 3
Engels, G., Heckel, R., Sauer, S.: Uml - a universal modeling language? LNCS (10 2000). https://doi.org/10.1007/3-540-44988-4 3
-
[4]
Everingham, M., Gool, L., Williams, C.K., Winn, J., Zisserman, A.: The pascal visual object classes (VOC) challenge. Int. J. Comput. Vision 88(2), 303–338 (Jun 2010). https://doi.org/10.1007/s11263-009-0275-4, http://dx.doi.org/10.1007/ s11263-009-0275-4
-
[5]
Girshick, R.B.: Fast R-CNN. CoRR abs/1504.08083 (2015), http://arxiv.org/ abs/1504.08083
work page internal anchor Pith review Pith/arXiv arXiv 2015
-
[6]
Rich feature hierarchies for accurate object detection and semantic segmentation
Girshick, R.B., Donahue, J., Darrell, T., Malik, J.: Rich feature hierarchies for ac- curate object detection and semantic segmentation. CoRRabs/1311.2524 (2013), http://arxiv.org/abs/1311.2524
work page internal anchor Pith review Pith/arXiv arXiv 2013
-
[7]
Deep Residual Learning for Image Recognition
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. CoRR abs/1512.03385 (2015), http://arxiv.org/abs/1512.03385
work page internal anchor Pith review Pith/arXiv arXiv 2015
-
[8]
In: Advances in Neural Information Processing Systems (2012)
Krizhevsky, A., Sutskever, I., Hinton, G.E.: Imagenet classification with deep con- volutional neural networks. In: Advances in Neural Information Processing Systems (2012)
work page 2012
-
[9]
The Handbook of Brain Theory and Neural Networks (01 1995)
Lecun, Y., Bengio, Y.: Convolutional networks for images, speech, and time-series. The Handbook of Brain Theory and Neural Networks (01 1995)
work page 1995
-
[10]
In: Proceed- ings of the IEEE International Conference on Robotics and Automation (ICRA) (2019)
Lee, J.T., Bollegala, D., Luo, S.: ”Touching to See” and” Seeing to Feel”: Robotic Cross-modal Sensory Data Generation for Visual-Tactile Perception. In: Proceed- ings of the IEEE International Conference on Robotics and Automation (ICRA) (2019)
work page 2019
-
[11]
In: Proceedings of the IEEE International Conference on Robotics and Automation (ICRA) (2014)
Li, Y., Chen, C.F., Allen, P.K.: Recognition of deformable object category and pose. In: Proceedings of the IEEE International Conference on Robotics and Automation (ICRA) (2014)
work page 2014
-
[12]
In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (June 2016)
Liu, Z., Luo, P., Qiu, S., Wang, X., Tang, X.: Deepfashion: Powering robust clothes recognition and retrieval with rich annotations. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (June 2016)
work page 2016
-
[13]
Luo, S., Bimbo, J., Dahiya, R., Liu, H.: Robotic tactile perception of object prop- erties: A review. Mechatronics 48, 54–67 (2017)
work page 2017
-
[14]
Luo, S., Mou, W., Althoefer, K., Liu, H.: iCLAP: Shape recognition by combining proprioception and touch sensing. Autonomous Robots pp. 1–12 (2018)
work page 2018
-
[15]
In: 2010 IEEE International Conference on Robotics and Automation
Maitin-Shepard, J., Cusumano-Towner, M., Lei, J., Abbeel, P.: Cloth grasp point detection based on multiple-view geometric cues with application to robotic towel folding. In: 2010 IEEE International Conference on Robotics and Automation. pp. 2308–2315 (May 2010). https://doi.org/10.1109/ROBOT.2010.5509439
-
[16]
In: 2015 Inter- national Conference on Advanced Robotics (ICAR)
Mariolis, I., Peleka, G., Kargakos, A., Malassiotis, S.: Pose and category recognition of highly deformable objects using deep learning. In: 2015 Inter- national Conference on Advanced Robotics (ICAR). pp. 655–662. IEEE (jul 2015). https://doi.org/10.1109/ICAR.2015.7251526, http://ieeexplore.ieee. org/document/7251526/
-
[17]
You Only Look Once: Unified, Real-Time Object Detection
Redmon, J., Divvala, S.K., Girshick, R.B., Farhadi, A.: You only look once: Unified, real-time object detection. CoRR abs/1506.02640 (2015), http://arxiv.org/ abs/1506.02640
work page internal anchor Pith review Pith/arXiv arXiv 2015
-
[18]
YOLO9000: Better, Faster, Stronger
Redmon, J., Farhadi, A.: YOLO9000: better, faster, stronger. CoRR abs/1612.08242 (2016), http://arxiv.org/abs/1612.08242
work page internal anchor Pith review Pith/arXiv arXiv 2016
-
[19]
Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks
Ren, S., He, K., Girshick, R.B., Sun, J.: Faster R-CNN: towards real-time object detection with region proposal networks. CoRR abs/1506.01497 (2015), http: //arxiv.org/abs/1506.01497 12 Daniel Fernandes Gomes, Shan Luo, and Luis F. Teixeira
work page internal anchor Pith review Pith/arXiv arXiv 2015
-
[20]
Expert Systems with Applications 116, 328 – 339 (2019)
Seo, Y., shik Shin, K.: Hierarchical convolutional neural networks for fash- ion image classification. Expert Systems with Applications 116, 328 – 339 (2019). https://doi.org/https://doi.org/10.1016/j.eswa.2018.09.022, http://www. sciencedirect.com/science/article/pii/S0957417418305992
-
[21]
Wagner, L., K.D., Smutn, V.: Ctu color and depth image dataset of spread gar- ments. Tech. Rep. CTUCMP201325, Center for Machine Perception, K13133 FEE Czech Technical University, Prague, Czech Republic (September 2013)
work page 2013
-
[22]
2015 IEEE International Conference on Robotics and Biomimetics, IEEE-ROBIO 2015 pp
Yamazaki, K.: Instance recognition of clumped clothing using image fea- tures focusing on clothing fabrics and wrinkles. 2015 IEEE International Conference on Robotics and Biomimetics, IEEE-ROBIO 2015 pp. 1102–1108 (2016). https://doi.org/10.1109/ROBIO.2015.7418919, http://dx.doi.org/10. 1007/s10514-016-9559-z
-
[23]
Yang, M., Yu, K.: Real-time clothing recognition in surveillance videos. In: Macq, B., Schelkens, P. (eds.) ICIP. pp. 2937–2940. IEEE (2011), http://dblp. uni-trier.de/db/conf/icip/icip2011.html#YangY11 GarmNet: Improving Global with Local Perception for Robotic Laundry... 13 6 Appendix T able 2. Summary of landmark Classification+Localization, as follows:...
work page 2011
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.