Instant-Fold: In-Context Imitation Learning for Deformable Object Manipulation
Pith reviewed 2026-06-28 09:24 UTC · model grok-4.3
The pith
Given a single human demonstration, a policy can infer and execute multiple manipulation modes for deformable objects without gradient updates.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Given a single human demonstration, our policy infers and executes diverse manipulation modes directly from the demonstration, including variations in spatial execution and ordering, without requiring gradient updates. Our approach first learns deformation-aware visual representations via temporal contrastive pretraining, after which a flow-matching transformer policy conditioned on the demonstration predicts actions to execute the intended manipulation mode. Trained entirely in simulation, Instant-Fold generalizes across diverse folding modes and transfers zero-shot to real-world settings without additional data collection or finetuning.
What carries the argument
A flow-matching transformer policy conditioned on the demonstration, supported by deformation-aware visual representations learned via temporal contrastive pretraining in simulation.
Load-bearing premise
The deformation-aware visual representations obtained from temporal contrastive pretraining in simulation are sufficient to support generalization across manipulation modes and zero-shot transfer to real-world settings without additional data or finetuning.
What would settle it
A test in which the policy cannot produce a new spatial layout or ordering from the same demonstration, or requires finetuning to succeed on real hardware, would show the central claim does not hold.
Figures
read the original abstract
Deformable object manipulation (DOM) is challenging due to high-dimensional, partially observable states that evolve through long-horizon, topology-changing interactions with multiple valid manipulation modes. We introduce Instant-Fold, an in-context imitation learning framework for DOM. Given a single human demonstration, our policy infers and executes diverse manipulation modes directly from the demonstration, including variations in spatial execution and ordering, without requiring gradient updates. Our approach first learns deformation-aware visual representations via temporal contrastive pretraining, after which a flow-matching transformer policy conditioned on the demonstration predicts actions to execute the intended manipulation mode. Trained entirely in simulation, Instant-Fold generalizes across diverse folding modes and transfers zero-shot to real-world settings without additional data collection or finetuning. Videos are available at https://instant-fold.github.io.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces Instant-Fold, an in-context imitation learning framework for deformable object manipulation (DOM). It first pretrains deformation-aware visual representations via temporal contrastive learning in simulation, then deploys a flow-matching transformer policy conditioned on a single human demonstration to infer and execute diverse manipulation modes (including spatial and ordering variations) without gradient updates. The approach is trained entirely in simulation and claims zero-shot transfer to real-world settings without additional data or finetuning.
Significance. If the central claims hold, the work would advance sample-efficient, mode-generalizing policies for high-dimensional, topology-changing DOM tasks by demonstrating in-context inference from one demonstration and sim-to-real transfer via contrastive representations. This could reduce the data requirements typical in deformable manipulation and support broader applicability of flow-matching policies in robotics.
major comments (2)
- [Abstract] Abstract: The central claim that temporal contrastive pretraining in simulation alone yields deformation-aware features sufficient for mode inference, diverse execution, and zero-shot real-world transfer is load-bearing, yet the provided text supplies no quantitative results, ablation studies on representation robustness, or tests addressing sim-real gaps in friction, elasticity, or sensor noise. This leaves the sufficiency of the representations unverified.
- [Abstract] Abstract: The assertion of generalization across folding modes and zero-shot transfer without domain randomization, real data mixing, or explicit mode-inference ablations is not supported by any reported metrics or error analysis in the given text, undermining evaluation of the weakest assumption regarding representation invariance.
minor comments (1)
- [Abstract] The abstract references a project website for videos but does not indicate whether quantitative results or ablations appear in the main text or supplementary material.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback on the abstract. We address each major comment below and will revise the abstract accordingly to better highlight the quantitative support present in the full manuscript.
read point-by-point responses
-
Referee: [Abstract] Abstract: The central claim that temporal contrastive pretraining in simulation alone yields deformation-aware features sufficient for mode inference, diverse execution, and zero-shot real-world transfer is load-bearing, yet the provided text supplies no quantitative results, ablation studies on representation robustness, or tests addressing sim-real gaps in friction, elasticity, or sensor noise. This leaves the sufficiency of the representations unverified.
Authors: The full manuscript contains quantitative results on success rates for mode inference and execution, ablation studies evaluating the temporal contrastive pretraining's impact on representation robustness, and sim-to-real experiments that explicitly test and report performance under variations in friction, elasticity, and sensor noise. We will revise the abstract to include key metrics and reference these supporting experiments and ablations. revision: yes
-
Referee: [Abstract] Abstract: The assertion of generalization across folding modes and zero-shot transfer without domain randomization, real data mixing, or explicit mode-inference ablations is not supported by any reported metrics or error analysis in the given text, undermining evaluation of the weakest assumption regarding representation invariance.
Authors: The manuscript reports specific metrics on generalization across folding modes (including spatial and ordering variations), zero-shot transfer without domain randomization or real data mixing, and includes explicit ablations on mode inference along with error analysis in the experiments section. We will update the abstract to incorporate these metrics, ablations, and error analysis to substantiate the claims. revision: yes
Circularity Check
No significant circularity detected
full rationale
The paper presents a two-stage learning pipeline (temporal contrastive pretraining for deformation-aware features, followed by conditioning a flow-matching transformer on a single demonstration) whose performance claims rest on empirical generalization from simulation to real rather than any definitional equivalence, fitted-parameter renaming, or self-citation chain. No equations, uniqueness theorems, or ansatzes are shown that reduce outputs to inputs by construction; the in-context mode inference and zero-shot transfer are presented as learned behaviors, not tautological restatements of the training procedure. This is the normal non-circular outcome for an empirical robotics method whose central results are externally falsifiable via held-out demonstrations and real-robot trials.
Axiom & Free-Parameter Ledger
free parameters (2)
- contrastive loss weights and augmentation parameters
- flow-matching network hyperparameters
axioms (1)
- domain assumption Simulation accurately models real-world deformable object dynamics for zero-shot transfer
Reference graph
Works this paper leans on
-
[1]
Brown, B
T. Brown, B. Mann, N. Ryder, M. Subbiah, J. D. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell, et al. Language models are few-shot learners.Advances in neural information processing systems, 33:1877–1901, 2020
1901
-
[2]
O. Sim ´eoni, H. V . V o, M. Seitzer, F. Baldassarre, M. Oquab, C. Jose, V . Khalidov, M. Szafraniec, S. Yi, M. Ramamonjisoa, et al. Dinov3.arXiv preprint arXiv:2508.10104, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[3]
Zhang, B
K. Zhang, B. Li, K. Hauser, and Y . Li. Adaptigraph: Material-adaptive graph-based neural dynamics for robotic manipulation. InProceedings of Robotics: Science and Systems (RSS), 2024
2024
-
[4]
S. Chen, Y . Xu, C. Yu, L. Li, X. Ma, Z. Xu, and D. Hsu. Daxbench: Benchmarking deformable object manipulation with differentiable physics. InThe Eleventh International Conference on Learning Representations
-
[5]
X. Lin, Y . Wang, J. Olkin, and D. Held. Softgym: Benchmarking deep reinforcement learning for deformable object manipulation. InConference on Robot Learning, pages 432–448. PMLR, 2021
2021
-
[6]
Bender, M
J. Bender, M. M ¨uller, and M. Macklin. Position-based simulation methods in computer graph- ics. InEurographics (tutorials), pages 1–32, 2015
2015
-
[7]
Pfaff, M
T. Pfaff, M. Fortunato, A. Sanchez-Gonzalez, and P. Battaglia. Learning mesh-based simula- tion with graph networks. InInternational Conference on Learning Representations
-
[8]
M. Macklin, M. M ¨uller, and N. Chentanez. Xpbd: position-based simulation of compliant con- strained dynamics. InProceedings of the 9th International Conference on Motion in Games, MIG ’16, page 49–54, New York, NY , USA, 2016. Association for Computing Machinery. ISBN 9781450345927. doi:10.1145/2994258.2994272. URLhttps://doi.org/10.1145/ 2994258.2994272
-
[9]
Canberk, C
A. Canberk, C. Chi, H. Ha, B. Burchfiel, E. Cousineau, S. Feng, and S. Song. Cloth funnels: Canonicalized-alignment for multi-purpose garment manipulation. In2023 IEEE International Conference on Robotics and Automation (ICRA), pages 5872–5879. IEEE, 2023
2023
-
[10]
X. Ma, D. Hsu, and W. S. Lee. Learning latent graph dynamics for visual manipulation of deformable objects. In2022 International Conference on Robotics and Automation (ICRA), pages 8266–8273. IEEE, 2022
2022
-
[11]
Deng and D
Y . Deng and D. Hsu. General-purpose clothes manipulation with semantic keypoints. In2025 IEEE International Conference on Robotics and Automation (ICRA), pages 13181–13187. IEEE, 2025
2025
-
[12]
R. Wu, H. Lu, Y . Wang, Y . Wang, and H. Dong. Unigarmentmanip: A unified framework for category-level garment manipulation via dense visual correspondence. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 16340– 16350, June 2024
2024
-
[13]
Lips, V .-L
T. Lips, V .-L. De Gusseme, and F. Wyffels. Learning keypoints for robotic cloth manipulation using synthetic data.IEEE Robotics and Automation Letters, 9(7):6528–6535, 2024
2024
-
[14]
Lippi, P
M. Lippi, P. Poklukar, M. C. Welle, A. Varava, H. Yin, A. Marino, and D. Kragic. Latent space roadmap for visual action planning of deformable and rigid object manipulation. In 2020 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 5619–5626. IEEE, 2020. 32
2020
-
[15]
W. Yan, A. Vangipuram, P. Abbeel, and L. Pinto. Learning predictive representations for deformable objects using contrastive estimation. InConference on Robot Learning, pages 564–574. PMLR, 2021
2021
-
[16]
Zhang, B
K. Zhang, B. Li, K. Hauser, and Y . Li. Particle-grid neural dynamics for learning deformable object models from rgb-d videos. InProceedings of Robotics: Science and Systems (RSS), 2025
2025
-
[17]
M. Song, J. Ha, B. Park, and D. Park. Implicit neural-representation learning for elastic deformable-object manipulations. InRobotics: Science and Systems (RSS). Robotics: Science and Systems Foundation, 2025
2025
-
[18]
T. Tian, H. Li, B. Ai, X. Yuan, Z. Huang, and H. Su. Diffusion dynamics models with gener- ative state estimation for cloth manipulation. InConference on Robot Learning, pages 1703–
-
[19]
R. Shi, Z. Xue, Y . You, and C. Lu. Skeleton merger: an unsupervised aligned keypoint detector. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 43–52, 2021
2021
-
[20]
B. Zhou, H. Zhou, T. Liang, Q. Yu, S. Zhao, Y . Zeng, J. Lv, S. Luo, Q. Wang, X. Yu, et al. Clothesnet: An information-rich 3d garment model repository with simulated clothes environ- ment. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 20428–20438, 2023
2023
-
[21]
Ha and S
H. Ha and S. Song. Flingbot: The unreasonable effectiveness of dynamic manipulation for cloth unfolding. InConference on Robot Learning, pages 24–33. PMLR, 2022
2022
-
[22]
H. Xue, Y . Li, W. Xu, H. Li, D. Zheng, and C. Lu. Unifolding: Towards sample-efficient, scalable, and generalizable robotic garment folding. InConference on Robot Learning, pages 3321–3341. PMLR, 2023
2023
-
[24]
Sunil, M
N. Sunil, M. Tippur, A. S. Portillo, E. H. Adelson, and A. R. Garcia. Reactive in-air clothing manipulation with confidence-aware dense correspondence and visuotactile affordance. In Conference on Robot Learning, pages 93–104. PMLR, 2025
2025
-
[25]
T. Weng, S. M. Bajracharya, Y . Wang, K. Agrawal, and D. Held. Fabricflownet: Bimanual cloth manipulation with a flow-based policy. InConference on Robot Learning, pages 192–
-
[26]
Hoque, D
R. Hoque, D. Seita, A. Balakrishna, A. Ganapathi, A. Tanwani, N. Jamali, K. Yamane, S. Iba, and K. Goldberg. VisuoSpatial Foresight for Multi-Step, Multi-Task Fabric Manipulation. In Robotics: Science and Systems (RSS), 2020
2020
-
[27]
Longhini, M
A. Longhini, M. C. Welle, Z. Erickson, and D. Kragic. Adafold: Adapting folding trajectories of cloths via feedback-loop manipulation.IEEE Robotics and Automation Letters, 2024
2024
-
[28]
V osylius and E
V . V osylius and E. Johns. Few-shot in-context imitation learning via implicit graph alignment. InConference on Robot Learning, 2023
2023
-
[29]
Di Palo and E
N. Di Palo and E. Johns. Dinobot: Robot manipulation via retrieval and alignment with vi- sion foundation models. In2024 IEEE International Conference on Robotics and Automation (ICRA), pages 2798–2805. IEEE, 2024
2024
-
[30]
Zhang and A
X. Zhang and A. Boularias. One-shot imitation learning with invariance matching for robotic manipulation. InProceedings of Robotics: Science and Systems, Delft, Netherlands, 2024. 33
2024
-
[31]
Wang and E
Y . Wang and E. Johns. One-shot dual-arm imitation learning.2025 IEEE International Con- ference on Robotics and Automation (ICRA), pages 5660–5668, 2025
2025
-
[32]
Y . Duan, M. Andrychowicz, B. Stadie, O. Jonathan Ho, J. Schneider, I. Sutskever, P. Abbeel, and W. Zaremba. One-shot imitation learning.Advances in neural information processing systems, 30, 2017
2017
-
[33]
Di Palo and E
N. Di Palo and E. Johns. Keypoint action tokens enable in-context imitation learning in robotics. InProceedings of Robotics: Science and Systems (RSS), 2024
2024
- [34]
-
[35]
Zhang, S
X. Zhang, S. Liu, P. Huang, W. J. Han, Y . Lyu, M. Xu, and D. Zhao. Dynamics as prompts: In- context learning for sim-to-real system identifications.IEEE Robotics and Automation Letters, 2025
2025
-
[36]
V osylius and E
V . V osylius and E. Johns. Instant policy: In-context imitation learning via graph diffusion. In Proceedings of the International Conference on Learning Representations (ICLR), 2025
2025
-
[37]
M ¨uller, B
M. M ¨uller, B. Heidelberger, M. Hennix, and J. Ratcliff. Position based dynamics. In Journal of Visual Communication and Image Representation, 2007. URLhttps://api. semanticscholar.org/CorpusID:6159986
2007
-
[38]
Khosla, P
P. Khosla, P. Teterwak, C. Wang, A. Sarna, Y . Tian, P. Isola, A. Maschinot, C. Liu, and D. Kr- ishnan. Supervised contrastive learning.Advances in neural information processing systems, 33:18661–18673, 2020
2020
-
[39]
E. J. Hu, Y . Shen, P. Wallis, Z. Allen-Zhu, Y . Li, S. Wang, L. Wang, and W. Chen. LoRA: Low-rank adaptation of large language models. InInternational Conference on Learning Rep- resentations, 2022
2022
-
[40]
In Defense of the Triplet Loss for Person Re-Identification
A. Hermans, L. Beyer, and B. Leibe. In defense of the triplet loss for person re-identification. arXiv preprint arXiv:1703.07737, 2017
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[41]
James and A
S. James and A. J. Davison. Q-attention: Enabling efficient learning for vision-based robotic manipulation.IEEE Robotics and Automation Letters, 7(2):1612–1619, 2022
2022
-
[42]
arXiv preprint arXiv:2508.11002 (2025)
N. Gkanatsios, J. Xu, M. Bronars, A. Mousavian, T.-W. Ke, and K. Fragkiadaki. 3d flowmatch actor: Unified 3d policy for single-and dual-arm manipulation.arXiv preprint arXiv:2508.11002, 2025
- [43]
-
[44]
J. Su, M. Ahmed, Y . Lu, S. Pan, W. Bo, and Y . Liu. Roformer: Enhanced transformer with rotary position embedding.Neurocomputing, 568:127063, 2024
2024
-
[45]
Press, N
O. Press, N. Smith, and M. Lewis. Train short, test long: Attention with linear biases enables input length extrapolation. InInternational Conference on Learning Representations, 2021
2021
-
[46]
Scalable Diffusion Models with Transformers
W. Peebles and S. Xie. Scalable diffusion models with transformers.arXiv preprint arXiv:2212.09748, 2022
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[47]
X. Liu, C. Gong, et al. Flow straight and fast: Learning to generate and transfer data with rectified flow. InThe Eleventh International Conference on Learning Representations, 2022
2022
-
[48]
M. Macklin, M. M ¨uller, N. Chentanez, and T.-Y . Kim. Unified particle physics for real-time applications.ACM Trans. Graph., 33(4), July 2014. ISSN 0730-0301. doi:10.1145/2601097. 2601152. URLhttps://doi.org/10.1145/2601097.2601152. 34
-
[49]
Bertiche, M
H. Bertiche, M. Madadi, and S. Escalera. Cloth3d: clothed 3d humans. InEuropean Confer- ence on Computer Vision, pages 344–359. Springer, 2020
2020
-
[50]
J. D. Robinson, C.-Y . Chuang, S. Sra, and S. Jegelka. Contrastive learning with hard negative samples. InInternational Conference on Learning Representations, 2021
2021
-
[51]
N. Ravi, V . Gabeur, Y .-T. Hu, R. Hu, C. Ryali, T. Ma, H. Khedr, R. R ¨adle, C. Rolland, L. Gustafson, et al. Sam 2: Segment anything in images and videos. InInternational Confer- ence on Learning Representations, volume 2025, pages 28085–28128, 2025
2025
-
[52]
C. Qian, J. Urain, K. Zakka, and J. Peters. Pianomime: Learning a generalist, dexterous piano player from internet demonstrations. InConference on Robot Learning, pages 1194–1215. PMLR, 2025. 35
2025
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.