LACY: A Vision-Language Model-based Language-Action Cycle for Self-Improving Robotic Manipulation
Pith reviewed 2026-05-25 07:33 UTC · model grok-4.3
The pith
A single vision-language model learns to both generate actions from language and explain actions in language, creating a self-improving cycle that generates its own training data.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
LACY trains one vision-language model on three tasks at once: language-to-action generation, action-to-language explanation, and language consistency verification. The resulting cycle lets the model autonomously produce and filter new training pairs by targeting uncertain cases, then uses those pairs to update itself. Experiments show this raises average success rates by 56.46 percent on pick-and-place tasks in simulation and on physical robots while producing more stable language-action alignment.
What carries the argument
The Language-Action Cycle that jointly optimizes L2A, A2L, and L2C mappings inside one model so low-confidence outputs can be turned into new filtered training data.
If this is right
- Task success rates increase by 56.46 percent on average in both simulated and real pick-and-place settings.
- Language-action grounding becomes more robust without requiring extra human annotations.
- The model improves through repeated cycles of self-generated data focused on uncertain predictions.
- The same model can both execute instructions and describe its own actions in language.
- Joint training on the three tasks supports the closed-loop data augmentation process.
Where Pith is reading between the lines
- The bidirectional capability could allow robots to communicate their intentions to human supervisors in shared workspaces.
- Extending the cycle to other manipulation skills might reduce the need for task-specific labeled datasets across robotics.
- If the verification step reliably catches errors, similar self-improvement loops could apply to navigation or assembly domains.
- The added explanation ability might make it easier to debug why a policy fails on particular instructions.
Load-bearing premise
The strategy of generating new data only from low-confidence cases will produce accurate examples that improve the model rather than adding noise or shifting the data distribution.
What would settle it
Run the active augmentation loop on a held-out set of low-confidence predictions, manually verify the generated language-action pairs for correctness, then measure whether retraining on the filtered set still raises or instead lowers task success rates.
Figures
read the original abstract
Learning generalizable policies for robotic manipulation increasingly relies on large-scale models that map language instructions to actions (L2A). However, this one-way paradigm often produces policies that execute tasks without deeper contextual understanding, limiting their ability to generalize or explain their behavior. We argue that the complementary skill of mapping actions back to language (A2L) is essential for developing more holistic grounding. An agent capable of both acting and explaining its actions can form richer internal representations and unlock new paradigms for self-supervised learning. We introduce LACY (Language-Action Cycle), a unified framework that learns such bidirectional mappings within a single vision-language model. LACY is jointly trained on three synergistic tasks: generating parameterized actions from language (L2A), explaining observed actions in language (A2L), and verifying semantic consistency between two language descriptions (L2C). This enables a self-improving cycle that autonomously generates and filters new training data through an active augmentation strategy targeting low-confidence cases, thereby improving the model without additional human labels. Experiments on pick-and-place tasks in both simulation and the real world show that LACY improves task success rates by 56.46% on average and yields more robust language-action grounding for robotic manipulation. Project page: https://vla2026.github.io/LACY/
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces LACY, a unified vision-language model framework trained jointly on language-to-action (L2A), action-to-language (A2L), and language-consistency (L2C) tasks. This bidirectional setup enables an active augmentation cycle that autonomously generates and filters new training data from low-confidence cases without additional human labels. Experiments on pick-and-place tasks report a 56.46% average improvement in success rates across simulation and real-world settings, with ablations isolating the contribution of the cycle and checks on generated data quality.
Significance. If the reported gains and data-quality checks hold, the approach offers a concrete mechanism for self-supervised improvement in robotic manipulation policies, reducing dependence on labeled data while strengthening language-action grounding. The explicit ablations and real-world validation strengthen the case for broader applicability in VLM-based robotics.
minor comments (3)
- The abstract states the 56.46% figure without referencing the number of trials, baselines, or statistical tests; move a concise version of the experimental protocol summary from §4 into the abstract for immediate evaluability.
- Figure 3 (cycle diagram) and the accompanying text in §3.2 use slightly inconsistent notation for the confidence threshold; standardize the symbol and add a one-sentence definition in the caption.
- Table 2 reports per-task success rates but does not list the exact number of real-world trials per condition; add this information to support reproducibility claims.
Simulated Author's Rebuttal
We thank the referee for the constructive summary and positive assessment of LACY, including recognition of the bidirectional training, active augmentation cycle, ablations, and real-world results. The recommendation for minor revision is appreciated. No specific major comments were raised in the report.
Circularity Check
No significant circularity detected
full rationale
The paper introduces an empirical framework (LACY) for bidirectional language-action mapping in a VLM, trained jointly on L2A/A2L/L2C tasks plus active augmentation for self-improvement. No equations, derivations, or first-principles claims appear that reduce the reported 56.46% success-rate gains to quantities defined by the method itself. Experimental results in simulation and real-world pick-and-place tasks, with ablations isolating the cycle's contribution and checks on generated data quality, stand as independent empirical evidence rather than tautological fits or self-citation chains. The central claims rest on observable performance lifts, not on renaming or re-deriving inputs.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
A. M. Glenberg and M. P. Kaschak, “Grounding language in action,” Psychonomic Bulletin & Review, vol. 9, no. 3, pp. 558–565, Sept
-
[2]
Available: https://doi.org/10.3758/BF03196313
[Online]. Available: https://doi.org/10.3758/BF03196313
-
[3]
G. Rizzolatti and M. A. Arbib, “Language within our grasp,”Trends in Neurosciences, vol. 21, no. 5, pp. 188–194, May 1998
work page 1998
-
[4]
Brain mechanisms linking language and action,
F. Pulverm ¨uller, “Brain mechanisms linking language and action,” Nature Reviews Neuroscience, vol. 6, no. 7, pp. 576–582, July 2005, publisher: Nature Publishing Group. [Online]. Available: https://www.nature.com/articles/nrn1706
work page 2005
-
[5]
Bc-z: Zero-shot task generalization with robotic imitation learning,
E. Jang, A. Irpan, M. Khansari, D. Kappler, F. Ebert, C. Lynch, S. Levine, and C. Finn, “Bc-z: Zero-shot task generalization with robotic imitation learning,” inConference on Robot Learning. PMLR, 2022, pp. 991–1002
work page 2022
-
[6]
Learning Fine-Grained Bimanual Manipulation with Low-Cost Hardware
T. Z. Zhao, V . Kumar, S. Levine, and C. Finn, “Learning fine-grained bimanual manipulation with low-cost hardware,”arXiv preprint arXiv:2304.13705, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[7]
Diffusion policy: Visuomotor policy learning via ac- tion diffusion,
C. Chi, Z. Xu, S. Feng, E. Cousineau, Y . Du, B. Burchfiel, R. Tedrake, and S. Song, “Diffusion policy: Visuomotor policy learning via ac- tion diffusion,”The International Journal of Robotics Research, p. 02783649241273668, 2023
work page 2023
-
[8]
3D Diffusion Policy: Generalizable Visuomotor Policy Learning via Simple 3D Representations
Y . Ze, G. Zhang, K. Zhang, C. Hu, M. Wang, and H. Xu, “3d diffusion policy: Generalizable visuomotor policy learning via simple 3d representations,”arXiv preprint arXiv:2403.03954, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[9]
Perceiver-actor: A multi- task transformer for robotic manipulation,
M. Shridhar, L. Manuelli, and D. Fox, “Perceiver-actor: A multi- task transformer for robotic manipulation,” inConference on Robot Learning. PMLR, 2023, pp. 785–799
work page 2023
-
[10]
Transporter networks: Rearranging the visual world for robotic manipulation,
A. Zeng, P. Florence, J. Tompson, S. Welker, J. Chien, M. Attarian, T. Armstrong, I. Krasin, D. Duong, V . Sindhwani,et al., “Transporter networks: Rearranging the visual world for robotic manipulation,” in Conference on Robot Learning. PMLR, 2021, pp. 103–120
work page 2021
-
[11]
Cliport: What and where pathways for robotic manipulation,
M. Shridhar, L. Manuelli, and D. Fox, “Cliport: What and where pathways for robotic manipulation,” 2021. [Online]. Available: https://arxiv.org/abs/2109.12098
-
[12]
RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control
A. Brohan, N. Brown, J. Carbajal, Y . Chebotar, X. Chen, K. Choromanski, T. Ding, D. Driess, A. Dubey, C. Finn, P. Florence, C. Fu, M. G. Arenas, K. Gopalakrishnan, K. Han, K. Hausman, A. Herzog, J. Hsu, B. Ichter, A. Irpan, N. Joshi, R. Julian, D. Kalashnikov, Y . Kuang, I. Leal, L. Lee, T.-W. E. Lee, S. Levine, Y . Lu, H. Michalewski, I. Mordatch, K. Pe...
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[13]
OpenVLA: An Open-Source Vision-Language-Action Model
M. J. Kim, K. Pertsch, S. Karamcheti, T. Xiao, A. Balakrishna, S. Nair, R. Rafailov, E. Foster, G. Lam, P. Sanketi,et al., “Open- vla: An open-source vision-language-action model,”arXiv preprint arXiv:2406.09246, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[14]
Llava-next: Improved reasoning, ocr, and world knowledge,
H. Liu, C. Li, Y . Li, B. Li, Y . Zhang, S. Shen, and Y . J. Lee, “Llava-next: Improved reasoning, ocr, and world knowledge,” January 2024. [Online]. Available: https://llava-vl.github.io/blog/ 2024-01-30-llava-next/
work page 2024
-
[15]
J. Li, D. Li, S. Savarese, and S. Hoi, “Blip-2: Bootstrapping language- image pre-training with frozen image encoders and large language models,” inInternational conference on machine learning. PMLR, 2023, pp. 19 730–19 742
work page 2023
-
[16]
Sigmoid loss for language image pre-training,
X. Zhai, B. Mustafa, A. Kolesnikov, and L. Beyer, “Sigmoid loss for language image pre-training,” inProceedings of the IEEE/CVF international conference on computer vision, 2023, pp. 11 975–11 986
work page 2023
-
[17]
Interactive robotic grasping with attribute-guided disambiguation,
Y . Yang, X. Lou, and C. Choi, “Interactive robotic grasping with attribute-guided disambiguation,”IEEE Robotics and Automation Let- ters, vol. 7, no. 2, pp. 4439–4446, 2022
work page 2022
-
[18]
A parameter- efficient tuning framework for language-guided object grounding and robot grasping,
H. Yu, M. Li, A. Rezazadeh, Y . Yang, and C. Choi, “A parameter- efficient tuning framework for language-guided object grounding and robot grasping,” in2025 IEEE International Conference on Robotics and Automation (ICRA). IEEE, 2025, pp. 14 353–14 360
work page 2025
-
[19]
H. Liu, C. Li, Q. Wu, and Y . J. Lee, “Visual instruction tuning,” Advances in neural information processing systems, vol. 36, pp. 34 892–34 916, 2023
work page 2023
-
[20]
Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond
J. Bai, S. Bai, S. Yang, S. Wang, S. Tan, P. Wang, J. Lin, C. Zhou, and J. Zhou, “Qwen-vl: A versatile vision-language model for un- derstanding, localization, text reading, and more,”arXiv preprint arXiv:2308.12966, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[21]
Bridgedata v2: A dataset for robot learning at scale,
H. R. Walke, K. Black, T. Z. Zhao, Q. Vuong, C. Zheng, P. Hansen- Estruch, A. W. He, V . Myers, M. J. Kim, M. Du,et al., “Bridgedata v2: A dataset for robot learning at scale,” inConference on Robot Learning. PMLR, 2023, pp. 1723–1736
work page 2023
-
[22]
Libero: Benchmarking knowledge transfer for lifelong robot learn- ing,
B. Liu, Y . Zhu, C. Gao, Y . Feng, Q. Liu, Y . Zhu, and P. Stone, “Libero: Benchmarking knowledge transfer for lifelong robot learn- ing,”Advances in Neural Information Processing Systems, vol. 36, pp. 44 776–44 791, 2023
work page 2023
-
[23]
Fine-tuning vision-language- action models: Optimizing speed and success,
A. Pari, K. Black, C. Xu, H. Walke, S. Dasari, A. Kumar, A. Ra- jeswaran, C. Finn, and S. Levine, “Fine-tuning vision-language- action models: Optimizing speed and success,”arXiv preprint arXiv:2405.08232, 2024
-
[24]
FAST: Efficient Action Tokenization for Vision-Language-Action Models
K. Pertsch, K. Stachowicz, B. Ichter, D. Driess, S. Nair, Q. Vuong, O. Mees, C. Finn, and S. Levine, “Fast: Efficient action tokenization for vision-language-action models,”arXiv preprint arXiv:2501.09747, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[25]
Robopoint: A vision-language model for spatial affordance prediction for robotics,
W. Yuan, J. Duan, V . Blukis, W. Pumacay, R. Krishna, A. Mu- rali, A. Mousavian, and D. Fox, “Robopoint: A vision-language model for spatial affordance prediction for robotics,”arXiv preprint arXiv:2406.10721, 2024
-
[26]
Rt-affordance: Affordances are versatile intermediate representations for robot manipulation,
S. Nasiriany, S. Kirmani, T. Ding, L. Smith, Y . Zhu, D. Driess, D. Sadigh, and T. Xiao, “Rt-affordance: Affordances are versatile intermediate representations for robot manipulation,” 2024. [Online]. Available: https://arxiv.org/abs/2411.02704
-
[27]
Hamster: Hierarchical action models for open-world robot manipulation,
Y . Li, Y . Deng, J. Zhang, J. Jang, M. Memmel, R. Yu, C. R. Garrett, F. Ramos, D. Fox, A. Li, A. Gupta, and A. Goyal, “Hamster: Hierarchical action models for open-world robot manipulation,” 2025. [Online]. Available: https://arxiv.org/abs/2502.05485
-
[28]
Thinkgrasp: A vision-language system for strategic part grasping in clutter,
Y . Qian, X. Zhu, O. Biza, S. Jiang, L. Zhao, H. Huang, Y . Qi, and R. Platt, “Thinkgrasp: A vision-language system for strategic part grasping in clutter,”arXiv preprint arXiv:2407.11298, 2024
-
[29]
Mimicgen: A data generation system for scalable robot learning using human demonstrations,
A. Mandlekar, S. Nasiriany, B. Wen, I. Akinola, Y . Narang, L. Fan, Y . Zhu, and D. Fox, “Mimicgen: A data generation system for scalable robot learning using human demonstrations,” inConference on Robot Learning. PMLR, 2023, pp. 1820–1864
work page 2023
-
[30]
Rapid motor adaptation for legged robots,
A. Kumar, Z. Fu, D. Pathak,et al., “Rapid motor adaptation for legged robots,” inRobotics: Science and Systems, 2021
work page 2021
-
[31]
Scalable deep reinforce- ment learning for vision-based robotic manipulation,
D. Kalashnikov, A. Irpan, P. Pastor,et al., “Scalable deep reinforce- ment learning for vision-based robotic manipulation,” inConference on Robot Learning, 2018
work page 2018
-
[32]
Exploration by random network distillation,
Y . Burda, H. Edwards, A. Storkey,et al., “Exploration by random network distillation,” inInternational Conference on Learning Repre- sentations, 2018
work page 2018
-
[33]
Curl: Contrastive unsupervised representations for reinforcement learning,
A. Srinivas, M. Laskin, and P. Abbeel, “Curl: Contrastive unsupervised representations for reinforcement learning,” inInternational Confer- ence on Machine Learning, 2020
work page 2020
-
[34]
Image augmentation is all you need: Regularizing deep reinforcement learning from pixels,
I. Kostrikov, D. Yarats, and R. Fergus, “Image augmentation is all you need: Regularizing deep reinforcement learning from pixels,” in International Conference on Learning Representations, 2020
work page 2020
-
[35]
Time-contrastive networks: Self-supervised learning from video,
P. Sermanet, C. Lynch, Y . Chebotar,et al., “Time-contrastive networks: Self-supervised learning from video,” inInternational Conference on Robotics and Automation, 2018
work page 2018
-
[36]
Large Language Models Cannot Self-Correct Reasoning Yet
J. Huang, X. Gu, L. Hou, J. Wang, J. Li, G. Chen, C. Chen, Z. Liu, Y . Zhang, T. Gui,et al., “Large language models cannot self-correct reasoning yet,”arXiv preprint arXiv:2310.01798, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[37]
Can llms really grasp simple causal structures?
K. Valmeekam, M. Marquez, S. Kumar, M. Sridharan, and S. Kamb- hampati, “Can llms really grasp simple causal structures?”arXiv preprint arXiv:2305.15769, 2023
-
[38]
Unpaired image- to-image translation using cycle-consistent adversarial networks,
J.-Y . Zhu, T. Park, P. Isola, and A. A. Efros, “Unpaired image- to-image translation using cycle-consistent adversarial networks,” in Proceedings of the IEEE international conference on computer vision, 2017, pp. 2223–2232
work page 2017
-
[39]
Cycle-consistent inverse dynamics for visual imitation learning,
A. Ajay, Y . Saber, B. Roh, and T. Jaakkola, “Cycle-consistent inverse dynamics for visual imitation learning,” inProceedings of the Thirty- First International Joint Conference on Artificial Intelligence, L. D. Raedt, Ed. International Joint Conferences on Artificial Intelligence Organization, 2022, pp. 4359–4366
work page 2022
-
[40]
Language Models (Mostly) Know What They Know
S. Kadavath, T. Conerly, A. Askell, T. Henighan, D. Drain, E. Perez, N. Schiefer, A. Jones, N. Joseph, N. DasSarma,et al., “Language mod- els (mostly) know what they know,”arXiv preprint arXiv:2207.05221, 2022
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[41]
Chain-of-thought prompting elicits reasoning in large language models,
J. Wei, X. Wang, D. Schuurmans, M. Bosma, F. Xia, E. Chi, Q. V . Le, D. Zhou,et al., “Chain-of-thought prompting elicits reasoning in large language models,”Advances in neural information processing systems, vol. 35, pp. 24 824–24 837, 2022
work page 2022
-
[42]
Lora: Low-rank adaptation of large language models,
E. J. Hu, Y . Shen, P. Wallis, Z. Allen-Zhu, Y . Li, S. Wang, L. Wang, and W. Chen, “Lora: Low-rank adaptation of large language models,” inInternational Conference on Learning Representations, 2021
work page 2021
-
[43]
Qa-calibration of language model confidence scores,
P. Manggala, A. Mastakouri, E. Kirschbaum, S. P. Kasiviswanathan, and A. Ramdas, “Qa-calibration of language model confidence scores,” arXiv preprint arXiv:2410.06615, 2024
-
[44]
Learning transferable visual models from natural language supervision,
A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark,et al., “Learning transferable visual models from natural language supervision,” inInternational conference on machine learning. PmLR, 2021, pp. 8748–8763
work page 2021
-
[45]
LM-Nav: Robotic navigation with large language models,
D. Shah, M. Liang, Y . Liu, A. S. Naren, G. Stone, A. Kumar, S. Scherer, and A. Gupta, “LM-Nav: Robotic navigation with large language models,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 18 052–18 062
work page 2024
-
[46]
Coppeliasim (formerly v-rep): a versatile and scalable robot simulation framework,
E. Rohmer, S. P. N. Singh, and M. Freese, “Coppeliasim (formerly v-rep): a versatile and scalable robot simulation framework,” inProc. of The International Conference on Intelligent Robots and Systems (IROS), 2013, www.coppeliarobotics.com
work page 2013
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.