Recognition: 2 theorem links
· Lean TheoremRT-H: Action Hierarchies Using Language
Pith reviewed 2026-05-17 06:49 UTC · model grok-4.3
The pith
Predicting fine-grained language descriptions of motions first helps robot policies share structure across diverse tasks and accept language corrections.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Our method RT-H builds an action hierarchy using language motions: it first learns to predict language motions, and conditioned on this and the high-level task, it predicts actions, using visual context at all stages. We show that RT-H leverages this language-action hierarchy to learn policies that are more robust and flexible by effectively tapping into multi-task datasets. These policies not only allow for responding to language interventions, but can also learn from such interventions and outperform methods that learn from teleoperated interventions.
What carries the argument
The two-stage policy that first predicts language motion phrases from task and visuals, then conditions action prediction on those predicted phrases plus the original task and visuals.
If this is right
- Policies can reuse low-level motion data across tasks that share no high-level vocabulary, such as picking objects and pouring liquids.
- During deployment a human can interrupt with corrective phrases like 'move arm left' instead of taking over the joystick.
- Training on language interventions yields higher final performance than training on equivalent teleoperated interventions.
- The same hierarchy makes multi-task imitation learning more sample-efficient without requiring task-specific architectural changes.
Where Pith is reading between the lines
- Collecting short motion-phrase annotations could become a cheaper way to label existing robot datasets than full action or task labels.
- The same intermediate-language idea might transfer to other long-horizon control problems such as game agents or autonomous driving.
- If motion phrases turn out to be largely task-agnostic, new tasks could be specified mostly by composing existing phrases rather than collecting fresh demonstrations.
Load-bearing premise
Fine-grained language motion phrases capture enough shared low-level structure across semantically different tasks that predicting them measurably improves downstream action accuracy and enables useful language corrections.
What would settle it
An ablation on a multi-task dataset where a model that predicts language motions shows no gain in action success rate or correction success rate over a direct task-to-action baseline.
read the original abstract
Language provides a way to break down complex concepts into digestible pieces. Recent works in robot imitation learning use language-conditioned policies that predict actions given visual observations and the high-level task specified in language. These methods leverage the structure of natural language to share data between semantically similar tasks (e.g., "pick coke can" and "pick an apple") in multi-task datasets. However, as tasks become more semantically diverse (e.g., "pick coke can" and "pour cup"), sharing data between tasks becomes harder, so learning to map high-level tasks to actions requires much more demonstration data. To bridge tasks and actions, our insight is to teach the robot the language of actions, describing low-level motions with more fine-grained phrases like "move arm forward". Predicting these language motions as an intermediate step between tasks and actions forces the policy to learn the shared structure of low-level motions across seemingly disparate tasks. Furthermore, a policy that is conditioned on language motions can easily be corrected during execution through human-specified language motions. This enables a new paradigm for flexible policies that can learn from human intervention in language. Our method RT-H builds an action hierarchy using language motions: it first learns to predict language motions, and conditioned on this and the high-level task, it predicts actions, using visual context at all stages. We show that RT-H leverages this language-action hierarchy to learn policies that are more robust and flexible by effectively tapping into multi-task datasets. We show that these policies not only allow for responding to language interventions, but can also learn from such interventions and outperform methods that learn from teleoperated interventions. Our website and videos are found at https://rt-hierarchy.github.io.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces RT-H, a hierarchical imitation-learning policy for robots that inserts an intermediate prediction of fine-grained 'language motions' (e.g., 'move arm forward') between high-level task descriptions and low-level actions. The architecture first predicts language motions from visual observations and task language, then conditions action prediction on the predicted motions, the task, and visuals at every stage. The central claims are that this hierarchy improves robustness and data efficiency on semantically diverse multi-task datasets and enables effective learning from and response to human language interventions during execution, outperforming teleoperated intervention baselines.
Significance. If the quantitative results hold under rigorous ablations, the work would demonstrate a practical way to leverage language for reusable low-level motion representations, reducing the demonstration burden for cross-task generalization and introducing a flexible language-based correction interface that can be used for online policy improvement.
major comments (2)
- [Abstract and §3] Abstract and §3 (Method): the claim that the language-motion prediction step 'forces the policy to learn the shared structure of low-level motions across seemingly disparate tasks' is load-bearing for both the robustness and intervention-learning results, yet no ablation isolates whether gains arise from the hierarchical conditioning versus simply adding extra language supervision; without this, the central assumption that fine-grained phrases capture reusable primitives remains unverified.
- [§4] §4 (Experiments): the abstract reports comparative experiments on robustness and language interventions, but provides no details on how language-motion phrases are obtained (human annotation protocol, automatic generation, or consistency checks across tasks), which directly affects whether the intermediate representation enforces the desired shared structure or merely adds noisy supervision.
minor comments (2)
- [Abstract] Abstract: the link to the project website is useful, but the summary omits concrete metrics, dataset sizes, or baseline names, which would help readers quickly gauge the scale of reported gains.
- [§3] Notation: the distinction between 'language motions' and the high-level task language is introduced informally; a short table or diagram in §3 clarifying the three levels (task, motion, action) and their conditioning would improve readability.
Simulated Author's Rebuttal
We thank the referee for their constructive feedback, which highlights important aspects of our claims and experimental details. We address each major comment below and propose revisions to strengthen the manuscript.
read point-by-point responses
-
Referee: [Abstract and §3] Abstract and §3 (Method): the claim that the language-motion prediction step 'forces the policy to learn the shared structure of low-level motions across seemingly disparate tasks' is load-bearing for both the robustness and intervention-learning results, yet no ablation isolates whether gains arise from the hierarchical conditioning versus simply adding extra language supervision; without this, the central assumption that fine-grained phrases capture reusable primitives remains unverified.
Authors: We agree that an ablation isolating the hierarchical conditioning from additional language supervision would provide stronger evidence for our central claim. Our current results demonstrate that RT-H outperforms direct language-to-action baselines on multi-task robustness and intervention tasks, and the intervention capability relies on explicit motion conditioning. However, we did not include a dedicated ablation removing the hierarchy while retaining extra language labels. We will add this ablation in the revised §4 to verify that gains stem from the structured hierarchy rather than supervision alone. revision: yes
-
Referee: [§4] §4 (Experiments): the abstract reports comparative experiments on robustness and language interventions, but provides no details on how language-motion phrases are obtained (human annotation protocol, automatic generation, or consistency checks across tasks), which directly affects whether the intermediate representation enforces the desired shared structure or merely adds noisy supervision.
Authors: We appreciate this point on reproducibility. The language-motion phrases were obtained via human annotation of low-level motions in the demonstration trajectories, with phrases selected for semantic consistency across tasks (e.g., reusing 'move arm forward' for similar primitives). We will expand the experimental section to detail the annotation protocol, including guidelines provided to annotators and any inter-annotator consistency checks performed. revision: yes
Circularity Check
No significant circularity detected
full rationale
The paper describes an empirical hierarchical imitation-learning architecture that predicts language motions as an intermediate step before actions. No equations, fitted parameters renamed as predictions, or self-referential derivations appear in the abstract or method description. The central claims rest on experimental evaluation against baselines rather than any mathematical reduction that equates outputs to inputs by construction. The approach is self-contained with external benchmarks and does not invoke load-bearing self-citations or uniqueness theorems.
Axiom & Free-Parameter Ledger
free parameters (1)
- language motion vocabulary size and phrasing
axioms (2)
- domain assumption Language motions capture shared low-level structure across tasks
- domain assumption Human language interventions provide useful training signal
invented entities (1)
-
language motion
no independent evidence
Forward citations
Cited by 19 Pith papers
-
From Imagined Futures to Executable Actions: Mixture of Latent Actions for Robot Manipulation
MoLA infers a mixture of latent actions from generated future videos via modality-aware inverse dynamics models to improve robot manipulation policies.
-
Using large language models for embodied planning introduces systematic safety risks
LLM planners for robots often produce dangerous plans even when planning succeeds, with safety awareness staying flat as model scale improves planning ability.
-
Learning Vision-Language-Action World Models for Autonomous Driving
VLA-World improves autonomous driving by using action-guided future image generation followed by reflective reasoning over the imagined scene to refine trajectories.
-
Towards Generalizable Robotic Manipulation in Dynamic Environments
DOMINO dataset and PUMA architecture enable better dynamic robotic manipulation by incorporating motion history, delivering 6.3% higher success rates than prior VLA models.
-
ST-BiBench: Benchmarking Multi-Stream Multimodal Coordination in Bimanual Embodied Tasks for MLLMs
ST-BiBench reveals a coordination paradox in which MLLMs show strong high-level strategic reasoning yet fail at fine-grained 16-dimensional bimanual action synthesis and multi-stream fusion.
-
Guide, Think, Act: Interactive Embodied Reasoning in Vision-Language-Action Models
GTA-VLA conditions VLA models on user spatial priors to produce a unified spatial-visual chain-of-thought, reaching 81.2% success on SimplerEnv WidowX and improving performance under out-of-distribution shifts.
-
Towards Long-horizon Embodied Agents with Tool-Aligned Vision-Language-Action Models
VLAs-as-Tools pairs a VLM planner with specialized VLA executors via a new interface and Tool-Aligned Post-Training to raise long-horizon robot success rates on LIBERO-Long and RoboTwin benchmarks.
-
VADF: Vision-Adaptive Diffusion Policy Framework for Efficient Robotic Manipulation
VADF adds an Adaptive Loss Network for hard-negative training sampling and a Hierarchical Vision Task Segmenter for adaptive noise scheduling during inference to speed convergence and reduce timeouts in diffusion robo...
-
PALM: Progress-Aware Policy Learning via Affordance Reasoning for Long-Horizon Robotic Manipulation
PALM improves long-horizon robotic manipulation success by distilling affordance representations for object interaction and predicting within-subtask progress in a VLA model.
-
Continually Evolving Skill Knowledge in Vision Language Action Model
Stellar VLA achieves continual learning in VLA models by maintaining a growing knowledge space and routing tasks to specialized experts conditioned on semantic relations, delivering strong LIBERO benchmark results wit...
-
DreamVLA: A Vision-Language-Action Model Dreamed with Comprehensive World Knowledge
DreamVLA uses dynamic-region-guided world knowledge prediction, block-wise attention to disentangle information types, and a diffusion transformer for actions, reaching 76.7% success on real robot tasks and 4.44 avera...
-
Cosmos-Reason1: From Physical Common Sense To Embodied Reasoning
Cosmos-Reason1-7B and 56B models are trained with physical common sense and embodied reasoning ontologies via supervised fine-tuning and reinforcement learning to produce next-step physical actions.
-
Hi Robot: Open-Ended Instruction Following with Hierarchical Vision-Language-Action Models
A hierarchical VLA architecture lets robots follow complex instructions and situated feedback by separating high-level reasoning from low-level control.
-
DexVLA: Vision-Language Model with Plug-In Diffusion Expert for General Robot Control
DexVLA combines a scaled diffusion action expert with embodiment curriculum learning to achieve better generalization and performance than prior VLA models on diverse robot hardware and long-horizon tasks.
-
FAST: Efficient Action Tokenization for Vision-Language-Action Models
FAST applies discrete cosine transform to robot action sequences for efficient tokenization, enabling autoregressive VLAs to succeed on high-frequency dexterous tasks and scale to 10k hours of data while matching diff...
-
RoboMIND: Benchmark on Multi-embodiment Intelligence Normative Data for Robot Manipulation
RoboMIND is a large-scale multi-embodiment teleoperation dataset for robot manipulation containing 107k trajectories across four robots, with failure annotations and a digital twin simulator.
-
HiVLA: A Visual-Grounded-Centric Hierarchical Embodied Manipulation System
HiVLA decouples VLM-based semantic planning with visual grounding from a cascaded cross-attention DiT action expert, outperforming end-to-end VLAs on long-horizon and fine-grained manipulation.
-
HiVLA: A Visual-Grounded-Centric Hierarchical Embodied Manipulation System
HiVLA decouples VLM-based semantic planning from DiT-based motor control via structured plans and cascaded cross-attention to outperform end-to-end VLA baselines in long-horizon and fine-grained manipulation.
-
AVA-VLA: Improving Vision-Language-Action models with Active Visual Attention
AVA-VLA reformulates VLA learning as a POMDP using recurrent states and active visual attention to achieve state-of-the-art results on LIBERO, CALVIN, and real dual-arm tasks.
Reference graph
Works this paper leans on
-
[1]
Do as i can, not as i say: Grounding language in robotic affordances
Anthony Brohan, Yevgen Chebotar, Chelsea Finn, Karol Hausman, Alexander Herzog, Daniel Ho, Julian Ibarz, Alex Irpan, Eric Jang, Ryan Julian, et al. Do as i can, not as i say: Grounding language in robotic affordances. In Conference on Robot Learning , pages 287–318. PMLR, 2023
work page 2023
-
[2]
No, to the right: Online language corrections for robotic ma- nipulation via shared autonomy
Yuchen Cui, Siddharth Karamcheti, Raj Palleti, Nidhya Shivakumar, Percy Liang, and Dorsa Sadigh. No, to the right: Online language corrections for robotic ma- nipulation via shared autonomy. In Proceedings of the 2023 ACM/IEEE International Conference on Human- Robot Interaction , HRI ’23, page 93–101, New York, NY , USA, 2023. Association for Computing M...
-
[3]
Correcting robot plans with natural language feedback
Pratyusha Sharma, Balakumar Sundaralingam, Valts Blukis, Chris Paxton, Tucker Hermans, Antonio Torralba, Jacob Andreas, and Dieter Fox. Correcting robot plans with natural language feedback. ArXiv, abs/2204.05186,
-
[4]
URL https://api.semanticscholar.org/CorpusID: 248085271
-
[5]
RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control
Anthony Brohan, Noah Brown, Justice Carbajal, Yevgen Chebotar, Xi Chen, Krzysztof Choromanski, Tianli Ding, Fig. 8: We show the generalization capabilities of RT-H with completely unseen tasks with minimal correction. By breaking down tasks into language motions, RT-H learns the shared structure between seemingly diverse tasks. This allows it to generaliz...
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[6]
Human-in-the- loop imitation learning using remote teleoperation,
Ajay Mandlekar, Danfei Xu, Roberto Mart ´ın-Mart´ın, Yuke Zhu, Li Fei-Fei, and Silvio Savarese. Human-in- the-loop imitation learning using remote teleoperation. CoRR, abs/2012.06733, 2020. URL https://arxiv.org/abs/ 2012.06733
-
[7]
RT-1: Robotics Transformer for Real-World Control at Scale
Anthony Brohan, Noah Brown, Justice Carbajal, Yev- gen Chebotar, Joseph Dabis, Chelsea Finn, Keerthana Gopalakrishnan, Karol Hausman, Alex Herzog, Jasmine Hsu, et al. Rt-1: Robotics transformer for real-world control at scale. arXiv preprint arXiv:2212.06817, 2022
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[8]
BC-z: Zero-shot task generalization with robotic imitation learning
Eric Jang, Alex Irpan, Mohi Khansari, Daniel Kappler, Frederik Ebert, Corey Lynch, Sergey Levine, and Chelsea Finn. BC-z: Zero-shot task generalization with robotic imitation learning. In 5th Annual Conference on Robot Learning, 2021. URL https://openreview.net/forum?id= 8kbp23tSGYv
work page 2021
-
[9]
Language- conditioned imitation learning for robot manipulation tasks
Simon Stepputtis, Joseph Campbell, Mariano Phielipp, Stefan Lee, Chitta Baral, and Heni Ben Amor. Language- conditioned imitation learning for robot manipulation tasks. In H. Larochelle, M. Ranzato, R. Hadsell, M.F. Balcan, and H. Lin, editors, Advances in Neural Information Processing Systems , volume 33, pages 13139–13150. Curran Associates, Inc., 2020....
-
[10]
Cliport: What and where pathways for robotic manipulation
Mohit Shridhar, Lucas Manuelli, and Dieter Fox. Cliport: What and where pathways for robotic manipulation. In Proceedings of the 5th Conference on Robot Learning (CoRL), 2021
work page 2021
-
[11]
What matters in language conditioned robotic imitation learning over unstructured data
Oier Mees, Lukas Hermann, and Wolfram Burgard. What matters in language conditioned robotic imitation learning over unstructured data. IEEE Robotics and Automation Letters, 7(4):11205–11212, 2022
work page 2022
-
[12]
KITE: Keypoint-conditioned policies for semantic manipulation
Priya Sundaresan, Suneel Belkhale, Dorsa Sadigh, and Jeannette Bohg. KITE: Keypoint-conditioned policies for semantic manipulation. In 7th Annual Conference on Robot Learning , 2023. URL https://openreview.net/ forum?id=veGdf4L4Xz
work page 2023
-
[13]
Learning transferable visual models from natural language supervision
Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. Learning transferable visual models from natural language supervision. In Marina Meila and Tong Zhang, editors, Proceedings of the 38th International Conference on Machine...
work page 2021
-
[14]
R3m: A universal visual representation for robot manipulation
Suraj Nair, Aravind Rajeswaran, Vikash Kumar, Chelsea Finn, and Abhinav Gupta. R3m: A universal visual representation for robot manipulation. In Conference on Robot Learning, pages 892–909. PMLR, 2023
work page 2023
-
[15]
Chen, Thomas Kollar, Chelsea Finn, Dorsa Sadigh, and Percy Liang
Siddharth Karamcheti, Suraj Nair, Annie S. Chen, Thomas Kollar, Chelsea Finn, Dorsa Sadigh, and Percy Liang. Language-driven representation learning for robotics. In Robotics: Science and Systems (RSS) , 2023
work page 2023
-
[16]
Vip: Towards universal visual reward and representation via value-implicit pre-training
Yecheng Jason Ma, Shagun Sodhani, Dinesh Jayaraman, Osbert Bastani, Vikash Kumar, and Amy Zhang. Vip: Towards universal visual reward and representation via value-implicit pre-training. In The Eleventh International Conference on Learning Representations , 2022
work page 2022
-
[17]
Danny Driess, Fei Xia, Mehdi S. M. Sajjadi, Corey Lynch, Aakanksha Chowdhery, Brian Ichter, Ayzaan Wahid, Jonathan Tompson, Quan Vuong, Tianhe Yu, Wenlong Huang, Yevgen Chebotar, Pierre Sermanet, Daniel Duckworth, Sergey Levine, Vincent Vanhoucke, Karol Hausman, Marc Toussaint, Klaus Greff, Andy Zeng, Igor Mordatch, and Pete Florence. Palm-e: An embodied ...
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[18]
Robot learning from demonstration by constructing skill trees
George Konidaris, Scott Kuindersma, Roderic Grupen, and Andrew Barto. Robot learning from demonstration by constructing skill trees. The International Journal of Robotics Research , 31(3):360–375, 2012. doi: 10. 1177/0278364911428653. URL https://doi.org/10.1177/ 0278364911428653
work page 2012
-
[19]
Scott Niekum, Sarah Osentoski, George Konidaris, and Andrew G. Barto. Learning and generalization of complex tasks from unstructured demonstrations. In 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems , pages 5239–5246, 2012. doi: 10.1109/IROS.2012.6386006
-
[20]
Ddco: Discovery of deep continuous options for robot learning from demonstrations
Sanjay Krishnan, Roy Fox, Ion Stoica, and Ken Gold- berg. Ddco: Discovery of deep continuous options for robot learning from demonstrations. In Conference on robot learning, pages 418–437. PMLR, 2017
work page 2017
-
[21]
Learning robot skills with temporal variational inference
Tanmay Shankar and Abhinav Gupta. Learning robot skills with temporal variational inference. In Interna- tional Conference on Machine Learning , pages 8624–
-
[22]
Compile: Compositional imitation learning and execution
Thomas Kipf, Yujia Li, Hanjun Dai, Vinicius Zambaldi, Alvaro Sanchez-Gonzalez, Edward Grefenstette, Push- meet Kohli, and Peter Battaglia. Compile: Compositional imitation learning and execution. In International Con- ference on Machine Learning, pages 3418–3428. PMLR, 2019
work page 2019
-
[23]
Discovering motor programs by re- composing demonstrations
Tanmay Shankar, Shubham Tulsiani, Lerrel Pinto, and Abhinav Gupta. Discovering motor programs by re- composing demonstrations. In International Confer- ence on Learning Representations , 2020. URL https: //openreview.net/forum?id=rkgHY0NYwr
work page 2020
-
[24]
Skid raw: Skill discovery from raw trajectories
Daniel Tanneberg, Kai Ploeger, Elmar Rueckert, and Jan Peters. Skid raw: Skill discovery from raw trajectories. IEEE robotics and automation letters , 6(3):4696–4703, 2021
work page 2021
-
[25]
Bottom-up skill discovery from unsegmented demonstrations for long-horizon robot manipulation
Yifeng Zhu, Peter Stone, and Yuke Zhu. Bottom-up skill discovery from unsegmented demonstrations for long-horizon robot manipulation. IEEE Robotics and Automation Letters, 7(2):4126–4133, 2022
work page 2022
-
[26]
Hierarchical few-shot imitation with skill transition models
Kourosh Hakhamaneshi, Ruihan Zhao, Albert Zhan, Pieter Abbeel, and Michael Laskin. Hierarchical few-shot imitation with skill transition models. In International Conference on Learning Representations , 2021
work page 2021
-
[27]
Robust imita- tion of diverse behaviors
Ziyu Wang, Josh S Merel, Scott E Reed, Nando de Fre- itas, Gregory Wayne, and Nicolas Heess. Robust imita- tion of diverse behaviors. Advances in Neural Informa- tion Processing Systems , 30, 2017
work page 2017
-
[28]
Learning latent plans from play
Corey Lynch, Mohi Khansari, Ted Xiao, Vikash Ku- mar, Jonathan Tompson, Sergey Levine, and Pierre Ser- manet. Learning latent plans from play. In Leslie Pack Kaelbling, Danica Kragic, and Komei Sugiura, editors, Proceedings of the Conference on Robot Learning , vol- ume 100 of Proceedings of Machine Learning Research , pages 1113–1132. PMLR, 30 Oct–01 Nov...
work page 2020
-
[29]
PLATO: Predicting latent affordances through object-centric play
Suneel Belkhale and Dorsa Sadigh. PLATO: Predicting latent affordances through object-centric play. In 6th Annual Conference on Robot Learning , 2022. URL https://openreview.net/forum?id=UAA5bNospA0
work page 2022
-
[30]
Coarse-to-fine imitation learning: Robot manipulation from a single demonstration
Edward Johns. Coarse-to-fine imitation learning: Robot manipulation from a single demonstration. In 2021 IEEE international conference on robotics and automation (ICRA), pages 4613–4619. IEEE, 2021
work page 2021
-
[31]
Hydra: Hybrid robot actions for imitation learning
Suneel Belkhale, Yuchen Cui, and Dorsa Sadigh. Hydra: Hybrid robot actions for imitation learning. In Confer- ence on Robot Learning, pages 2113–2133. PMLR, 2023
work page 2023
-
[32]
Inner monologue: Embodied reasoning through planning with language models
Wenlong Huang, Fei Xia, Ted Xiao, Harris Chan, Jacky Liang, Pete Florence, Andy Zeng, Jonathan Tompson, Igor Mordatch, Yevgen Chebotar, et al. Inner monologue: Embodied reasoning through planning with language models. In Conference on Robot Learning , pages 1769–
-
[33]
Robovqa: Multimodal long-horizon reasoning for robotics
Pierre Sermanet, Tianli Ding, Jeffrey Zhao, Fei Xia, Debidatta Dwibedi, Keerthana Gopalakrishnan, Chris- tine Chan, Gabriel Dulac-Arnold, Sharath Maddineni, Nikhil J Joshi, Pete Florence, Wei Han, Robert Baruch, Yao Lu, Suvir Mirchandani, Peng Xu, Pannag Sanketi, Karol Hausman, Izhak Shafran, Brian Ichter, and Yuan Cao. Robovqa: Multimodal long-horizon re...
-
[34]
ELLA: Exploration through learned language abstraction
Suvir Mirchandani, Siddharth Karamcheti, and Dorsa Sadigh. ELLA: Exploration through learned language abstraction. In A. Beygelzimer, Y . Dauphin, P. Liang, and J. Wortman Vaughan, editors, Advances in Neural Information Processing Systems , 2021. URL https:// openreview.net/forum?id=VvUldGZ3izR
work page 2021
-
[35]
Improving long-horizon imitation through instruction prediction
Joey Hejna, Pieter Abbeel, and Lerrel Pinto. Improving long-horizon imitation through instruction prediction. In Proceedings of the AAAI Conference on Artificial Intel- ligence, volume 37, pages 7857–7865, 2023
work page 2023
-
[36]
Thought Cloning: Learning to think while acting by imitating human thinking
Shengran Hu and Jeff Clune. Thought Cloning: Learning to think while acting by imitating human thinking. Ad- vances in Neural Information Processing Systems , 2023
work page 2023
-
[37]
Skill induction and planning with latent language
Pratyusha Sharma, Antonio Torralba, and Jacob Andreas. Skill induction and planning with latent language. In Pro- ceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages 1713–1726, 2022
work page 2022
-
[38]
Skill gen- eralization with verbs
Rachel Ma, Lyndon Lam, Benjamin A Spiegel, Aditya Ganeshan, Roma Patel, Ben Abbatematteo, David Paulius, Stefanie Tellex, and George Konidaris. Skill gen- eralization with verbs. In 2023 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS) , pages 5844–5851. IEEE, 2023
work page 2023
-
[39]
Interactive imitation learning in robotics based on simulations, 2022
Xinjie Liu. Interactive imitation learning in robotics based on simulations, 2022
work page 2022
-
[40]
A Reduction of Imitation Learning and Structured Prediction to No-Regret Online Learning
St ´ephane Ross, Geoffrey J. Gordon, and J. Andrew Bagnell. No-regret reductions for imitation learning and structured prediction. CoRR, abs/1011.0686, 2010. URL http://arxiv.org/abs/1011.0686
work page internal anchor Pith review Pith/arXiv arXiv 2010
-
[41]
Hg-dagger: Inter- active imitation learning with human experts
Michael Kelly, Chelsea Sidrane, Katherine Driggs- Campbell, and Mykel J Kochenderfer. Hg-dagger: Inter- active imitation learning with human experts. In 2019 International Conference on Robotics and Automation (ICRA), pages 8077–8083. IEEE, 2019
work page 2019
-
[42]
Thriftydagger: Budget-aware novelty and risk gating for interactive imitation learning
Ryan Hoque, Ashwin Balakrishna, Ellen Novoseller, Albert Wilcox, Daniel S Brown, and Ken Goldberg. Thriftydagger: Budget-aware novelty and risk gating for interactive imitation learning. In Conference on Robot Learning, pages 598–608. PMLR, 2022
work page 2022
-
[43]
Brown, Daniel Seita, Brijen Thananjeyan, Ellen R
Ryan Hoque, Ashwin Balakrishna, Carl Putterman, Michael Luo, Daniel S. Brown, Daniel Seita, Brijen Thananjeyan, Ellen R. Novoseller, and Ken Goldberg. Lazydagger: Reducing context switching in interactive imitation learning. In CASE, pages 502–509, 2021. URL https://doi.org/10.1109/CASE49439.2021.9551469
-
[44]
Query-efficient im- itation learning for end-to-end simulated driving
Jiakai Zhang and Kyunghyun Cho. Query-efficient im- itation learning for end-to-end simulated driving. In Proceedings of the Thirty-First AAAI Conference on Artificial Intelligence, AAAI’17, page 2891–2897. AAAI Press, 2017
work page 2017
-
[45]
Kunal Menda, Katherine Driggs-Campbell, and Mykel J. Kochenderfer. Ensembledagger: A bayesian approach to safe imitation learning. In 2019 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS) , pages 5041–5048, 2019. doi: 10.1109/IROS40897.2019. 8968287
-
[46]
Learning human objectives from sequences of physical corrections
Mengxi Li, Alper Canberk, Dylan P Losey, and Dorsa Sadigh. Learning human objectives from sequences of physical corrections. In 2021 IEEE International Conference on Robotics and Automation (ICRA) , pages 2877–2883. IEEE, 2021
work page 2021
-
[47]
Physical interaction as communication: Learning robot objectives online from human corrections
Dylan P Losey, Andrea Bajcsy, Marcia K O’Malley, and Anca D Dragan. Physical interaction as communication: Learning robot objectives online from human corrections. The International Journal of Robotics Research , 41(1): 20–44, 2022
work page 2022
-
[48]
Distilling and retrieving generalizable knowledge for robot manipulation via language cor- rections
Lihan Zha, Yuchen Cui, Li-Heng Lin, Minae Kwon, Montserrat Gonzalez Arenas, Andy Zeng, Fei Xia, and Dorsa Sadigh. Distilling and retrieving generalizable knowledge for robot manipulation via language cor- rections. In 2nd Workshop on Language and Robot Learning: Language as Grounding , 2023
work page 2023
-
[49]
Real-time natural language corrections for assistive robotic manipulators
Alexander Broad, Jacob Arkin, Nathan Ratliff, Thomas Howard, and Brenna Argall. Real-time natural language corrections for assistive robotic manipulators. The Inter- national Journal of Robotics Research, 36(5-7):684–698, 2017
work page 2017
-
[50]
Arthur Bucker, Luis Figueredo, Sami Haddadinl, Ashish Kapoor, Shuang Ma, and Rogerio Bonatti. Reshaping robot trajectories using natural language commands: A study of multi-modal data alignment using transformers. In 2022 IEEE/RSJ International Conference on Intelli- gent Robots and Systems (IROS) , pages 978–984. IEEE, 2022
work page 2022
-
[51]
Latte: Language trajectory transformer, 2022
Arthur Bucker, Luis Figueredo, Sami Haddadin, Ashish Kapoor, Shuang Ma, Sai Vemprala, and Rogerio Bonatti. Latte: Language trajectory transformer, 2022
work page 2022
-
[52]
Guiding policies with language via meta-learning
John D Co-Reyes, Abhishek Gupta, Suvansh San- jeev, Nick Altieri, Jacob Andreas, John DeNero, Pieter Abbeel, and Sergey Levine. Guiding policies with language via meta-learning. In International Conference on Learning Representations , 2018
work page 2018
-
[53]
Interactive language: Talking to robots in real time
Corey Lynch, Ayzaan Wahid, Jonathan Tompson, Tianli Ding, James Betker, Robert Baruch, Travis Armstrong, and Pete Florence. Interactive language: Talking to robots in real time. IEEE Robotics and Automation Letters , pages 1–8, 2023. doi: 10.1109/LRA.2023.3295255
-
[54]
Pali-x: On scaling up a multilingual vision and language model
Xi Chen, Josip Djolonga, Piotr Padlewski, Basil Mustafa, Soravit Changpinyo, Jialin Wu, Carlos Riquelme Ruiz, Sebastian Goodman, Xiao Wang, Yi Tay, et al. Pali-x: On scaling up a multilingual vision and language model. arXiv preprint arXiv:2305.18565 , 2023
- [55]
-
[56]
S. Lloyd. Least squares quantization in pcm. IEEE Trans- actions on Information Theory , 28(2):129–137, 1982. doi: 10.1109/TIT.1982.1056489
-
[57]
Open X-Embodiment Collaboration, Abhishek Padalkar, Acorn Pooley, Ajinkya Jain, Alex Bewley, Alex Her- zog, Alex Irpan, Alexander Khazatsky, Anant Rai, Anikait Singh, Anthony Brohan, Antonin Raffin, Ayzaan Wahid, Ben Burgess-Limerick, Beomjoon Kim, Bern- hard Sch ¨olkopf, Brian Ichter, Cewu Lu, Charles Xu, Chelsea Finn, Chenfeng Xu, Cheng Chi, Chenguang H...
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[58]
What matters in learning from offline human demonstra- tions for robot manipulation
Ajay Mandlekar, Danfei Xu, Josiah Wong, Soroush Nasiriany, Chen Wang, Rohun Kulkarni, Li Fei-Fei, Silvio Savarese, Yuke Zhu, and Roberto Mart ´ın-Mart´ın. What matters in learning from offline human demonstra- tions for robot manipulation. In 5th Annual Conference on Robot Learning , 2021. URL https://openreview.net/ forum?id=JrsfBJtDFdI
work page 2021
-
[59]
Data quality in imitation learning
Suneel Belkhale, Yuchen Cui, and Dorsa Sadigh. Data quality in imitation learning. In Thirty-seventh Confer- ence on Neural Information Processing Systems , 2023. URL https://openreview.net/forum?id=FwmvbuDiMk
work page 2023
-
[60]
Gpt-4v(ision) system card. 2023. URL https://api. semanticscholar.org/CorpusID:263218031
work page 2023
-
[61]
Soroush Nasiriany, Fei Xia, Wenhao Yu, Ted Xiao, Jacky Liang, Ishita Dasgupta, Annie Xie, Danny Driess, Ayzaan Wahid, Zhuo Xu, et al. Pivot: Iterative visual prompting elicits actionable knowledge for vlms. arXiv preprint arXiv:2402.07872, 2024. APPENDIX We first outline the implementation of RT-H and ablations in Appendix A, along with the training recip...
-
[62]
There is significant contextuality of language motions required when solving precise manipulation tasks (see Fig. 4, e.g., the speed or direction variety for a single language motion) – there was no single predefined primitive for many language motions that could safely and efficiently progress at the task. See Appendix D for a quantitative analysis of th...
-
[63]
LLMs would inherently struggle to predict language motions because they are not grounded in the visual context of the scene. Therefore we would not expect these models to understand directions like “left” and “up” or to know when to close the gripper with just a textual description of the scene (as provided in SayCan). Thus VLMs are much better suited for...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.