pith. machine review for the scientific record. sign in

arxiv: 2403.01823 · v2 · submitted 2024-03-04 · 💻 cs.RO · cs.AI

Recognition: 2 theorem links

· Lean Theorem

RT-H: Action Hierarchies Using Language

Authors on Pith no claims yet

Pith reviewed 2026-05-17 06:49 UTC · model grok-4.3

classification 💻 cs.RO cs.AI
keywords robot imitation learninglanguage-conditioned policiesaction hierarchiesmulti-task learninghuman interventionlanguage motions
0
0 comments X

The pith

Predicting fine-grained language descriptions of motions first helps robot policies share structure across diverse tasks and accept language corrections.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces RT-H, which inserts an intermediate prediction step where the model first outputs language phrases describing low-level motions such as 'move arm forward'. Conditioned on those phrases plus the high-level task and visual input, it then generates the actual robot actions. This hierarchy is meant to force the policy to discover motion patterns that recur even when high-level tasks have little semantic overlap. A reader would care because current language-conditioned imitation methods struggle to transfer knowledge between dissimilar tasks without enormous amounts of data, and this approach also opens a route for humans to steer execution with words rather than full teleoperation.

Core claim

Our method RT-H builds an action hierarchy using language motions: it first learns to predict language motions, and conditioned on this and the high-level task, it predicts actions, using visual context at all stages. We show that RT-H leverages this language-action hierarchy to learn policies that are more robust and flexible by effectively tapping into multi-task datasets. These policies not only allow for responding to language interventions, but can also learn from such interventions and outperform methods that learn from teleoperated interventions.

What carries the argument

The two-stage policy that first predicts language motion phrases from task and visuals, then conditions action prediction on those predicted phrases plus the original task and visuals.

If this is right

  • Policies can reuse low-level motion data across tasks that share no high-level vocabulary, such as picking objects and pouring liquids.
  • During deployment a human can interrupt with corrective phrases like 'move arm left' instead of taking over the joystick.
  • Training on language interventions yields higher final performance than training on equivalent teleoperated interventions.
  • The same hierarchy makes multi-task imitation learning more sample-efficient without requiring task-specific architectural changes.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Collecting short motion-phrase annotations could become a cheaper way to label existing robot datasets than full action or task labels.
  • The same intermediate-language idea might transfer to other long-horizon control problems such as game agents or autonomous driving.
  • If motion phrases turn out to be largely task-agnostic, new tasks could be specified mostly by composing existing phrases rather than collecting fresh demonstrations.

Load-bearing premise

Fine-grained language motion phrases capture enough shared low-level structure across semantically different tasks that predicting them measurably improves downstream action accuracy and enables useful language corrections.

What would settle it

An ablation on a multi-task dataset where a model that predicts language motions shows no gain in action success rate or correction success rate over a direct task-to-action baseline.

read the original abstract

Language provides a way to break down complex concepts into digestible pieces. Recent works in robot imitation learning use language-conditioned policies that predict actions given visual observations and the high-level task specified in language. These methods leverage the structure of natural language to share data between semantically similar tasks (e.g., "pick coke can" and "pick an apple") in multi-task datasets. However, as tasks become more semantically diverse (e.g., "pick coke can" and "pour cup"), sharing data between tasks becomes harder, so learning to map high-level tasks to actions requires much more demonstration data. To bridge tasks and actions, our insight is to teach the robot the language of actions, describing low-level motions with more fine-grained phrases like "move arm forward". Predicting these language motions as an intermediate step between tasks and actions forces the policy to learn the shared structure of low-level motions across seemingly disparate tasks. Furthermore, a policy that is conditioned on language motions can easily be corrected during execution through human-specified language motions. This enables a new paradigm for flexible policies that can learn from human intervention in language. Our method RT-H builds an action hierarchy using language motions: it first learns to predict language motions, and conditioned on this and the high-level task, it predicts actions, using visual context at all stages. We show that RT-H leverages this language-action hierarchy to learn policies that are more robust and flexible by effectively tapping into multi-task datasets. We show that these policies not only allow for responding to language interventions, but can also learn from such interventions and outperform methods that learn from teleoperated interventions. Our website and videos are found at https://rt-hierarchy.github.io.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces RT-H, a hierarchical imitation-learning policy for robots that inserts an intermediate prediction of fine-grained 'language motions' (e.g., 'move arm forward') between high-level task descriptions and low-level actions. The architecture first predicts language motions from visual observations and task language, then conditions action prediction on the predicted motions, the task, and visuals at every stage. The central claims are that this hierarchy improves robustness and data efficiency on semantically diverse multi-task datasets and enables effective learning from and response to human language interventions during execution, outperforming teleoperated intervention baselines.

Significance. If the quantitative results hold under rigorous ablations, the work would demonstrate a practical way to leverage language for reusable low-level motion representations, reducing the demonstration burden for cross-task generalization and introducing a flexible language-based correction interface that can be used for online policy improvement.

major comments (2)
  1. [Abstract and §3] Abstract and §3 (Method): the claim that the language-motion prediction step 'forces the policy to learn the shared structure of low-level motions across seemingly disparate tasks' is load-bearing for both the robustness and intervention-learning results, yet no ablation isolates whether gains arise from the hierarchical conditioning versus simply adding extra language supervision; without this, the central assumption that fine-grained phrases capture reusable primitives remains unverified.
  2. [§4] §4 (Experiments): the abstract reports comparative experiments on robustness and language interventions, but provides no details on how language-motion phrases are obtained (human annotation protocol, automatic generation, or consistency checks across tasks), which directly affects whether the intermediate representation enforces the desired shared structure or merely adds noisy supervision.
minor comments (2)
  1. [Abstract] Abstract: the link to the project website is useful, but the summary omits concrete metrics, dataset sizes, or baseline names, which would help readers quickly gauge the scale of reported gains.
  2. [§3] Notation: the distinction between 'language motions' and the high-level task language is introduced informally; a short table or diagram in §3 clarifying the three levels (task, motion, action) and their conditioning would improve readability.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive feedback, which highlights important aspects of our claims and experimental details. We address each major comment below and propose revisions to strengthen the manuscript.

read point-by-point responses
  1. Referee: [Abstract and §3] Abstract and §3 (Method): the claim that the language-motion prediction step 'forces the policy to learn the shared structure of low-level motions across seemingly disparate tasks' is load-bearing for both the robustness and intervention-learning results, yet no ablation isolates whether gains arise from the hierarchical conditioning versus simply adding extra language supervision; without this, the central assumption that fine-grained phrases capture reusable primitives remains unverified.

    Authors: We agree that an ablation isolating the hierarchical conditioning from additional language supervision would provide stronger evidence for our central claim. Our current results demonstrate that RT-H outperforms direct language-to-action baselines on multi-task robustness and intervention tasks, and the intervention capability relies on explicit motion conditioning. However, we did not include a dedicated ablation removing the hierarchy while retaining extra language labels. We will add this ablation in the revised §4 to verify that gains stem from the structured hierarchy rather than supervision alone. revision: yes

  2. Referee: [§4] §4 (Experiments): the abstract reports comparative experiments on robustness and language interventions, but provides no details on how language-motion phrases are obtained (human annotation protocol, automatic generation, or consistency checks across tasks), which directly affects whether the intermediate representation enforces the desired shared structure or merely adds noisy supervision.

    Authors: We appreciate this point on reproducibility. The language-motion phrases were obtained via human annotation of low-level motions in the demonstration trajectories, with phrases selected for semantic consistency across tasks (e.g., reusing 'move arm forward' for similar primitives). We will expand the experimental section to detail the annotation protocol, including guidelines provided to annotators and any inter-annotator consistency checks performed. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper describes an empirical hierarchical imitation-learning architecture that predicts language motions as an intermediate step before actions. No equations, fitted parameters renamed as predictions, or self-referential derivations appear in the abstract or method description. The central claims rest on experimental evaluation against baselines rather than any mathematical reduction that equates outputs to inputs by construction. The approach is self-contained with external benchmarks and does not invoke load-bearing self-citations or uniqueness theorems.

Axiom & Free-Parameter Ledger

1 free parameters · 2 axioms · 1 invented entities

The method rests on the domain assumption that natural language phrases can reliably describe and share low-level motion primitives across tasks, plus standard imitation-learning assumptions about demonstration quality and visual feature extraction. No new physical constants or mathematical axioms are introduced.

free parameters (1)
  • language motion vocabulary size and phrasing
    The set of fine-grained phrases used as intermediate targets must be chosen or learned; the abstract does not specify how this vocabulary is constructed or whether it is fixed in advance.
axioms (2)
  • domain assumption Language motions capture shared low-level structure across tasks
    Invoked when claiming that predicting language motions forces the policy to learn shared structure (abstract paragraph on bridging tasks and actions).
  • domain assumption Human language interventions provide useful training signal
    Assumed when stating that policies can learn from language interventions and outperform teleoperated ones.
invented entities (1)
  • language motion no independent evidence
    purpose: Intermediate representation that bridges high-level task language and low-level actions while remaining human-interpretable for correction.
    New postulated layer introduced to organize the hierarchy; no independent evidence outside the training loop is provided in the abstract.

pith-pipeline@v0.9.0 · 5638 in / 1455 out tokens · 36829 ms · 2026-05-17T06:49:12.759689+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 19 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. From Imagined Futures to Executable Actions: Mixture of Latent Actions for Robot Manipulation

    cs.RO 2026-05 unverdicted novelty 7.0

    MoLA infers a mixture of latent actions from generated future videos via modality-aware inverse dynamics models to improve robot manipulation policies.

  2. Using large language models for embodied planning introduces systematic safety risks

    cs.AI 2026-04 unverdicted novelty 7.0

    LLM planners for robots often produce dangerous plans even when planning succeeds, with safety awareness staying flat as model scale improves planning ability.

  3. Learning Vision-Language-Action World Models for Autonomous Driving

    cs.CV 2026-04 unverdicted novelty 7.0

    VLA-World improves autonomous driving by using action-guided future image generation followed by reflective reasoning over the imagined scene to refine trajectories.

  4. Towards Generalizable Robotic Manipulation in Dynamic Environments

    cs.CV 2026-03 unverdicted novelty 7.0

    DOMINO dataset and PUMA architecture enable better dynamic robotic manipulation by incorporating motion history, delivering 6.3% higher success rates than prior VLA models.

  5. ST-BiBench: Benchmarking Multi-Stream Multimodal Coordination in Bimanual Embodied Tasks for MLLMs

    cs.RO 2026-02 unverdicted novelty 7.0

    ST-BiBench reveals a coordination paradox in which MLLMs show strong high-level strategic reasoning yet fail at fine-grained 16-dimensional bimanual action synthesis and multi-stream fusion.

  6. Guide, Think, Act: Interactive Embodied Reasoning in Vision-Language-Action Models

    cs.RO 2026-05 conditional novelty 6.0

    GTA-VLA conditions VLA models on user spatial priors to produce a unified spatial-visual chain-of-thought, reaching 81.2% success on SimplerEnv WidowX and improving performance under out-of-distribution shifts.

  7. Towards Long-horizon Embodied Agents with Tool-Aligned Vision-Language-Action Models

    cs.RO 2026-05 unverdicted novelty 6.0

    VLAs-as-Tools pairs a VLM planner with specialized VLA executors via a new interface and Tool-Aligned Post-Training to raise long-horizon robot success rates on LIBERO-Long and RoboTwin benchmarks.

  8. VADF: Vision-Adaptive Diffusion Policy Framework for Efficient Robotic Manipulation

    cs.RO 2026-04 unverdicted novelty 6.0

    VADF adds an Adaptive Loss Network for hard-negative training sampling and a Hierarchical Vision Task Segmenter for adaptive noise scheduling during inference to speed convergence and reduce timeouts in diffusion robo...

  9. PALM: Progress-Aware Policy Learning via Affordance Reasoning for Long-Horizon Robotic Manipulation

    cs.RO 2026-01 unverdicted novelty 6.0

    PALM improves long-horizon robotic manipulation success by distilling affordance representations for object interaction and predicting within-subtask progress in a VLA model.

  10. Continually Evolving Skill Knowledge in Vision Language Action Model

    cs.RO 2025-11 unverdicted novelty 6.0

    Stellar VLA achieves continual learning in VLA models by maintaining a growing knowledge space and routing tasks to specialized experts conditioned on semantic relations, delivering strong LIBERO benchmark results wit...

  11. DreamVLA: A Vision-Language-Action Model Dreamed with Comprehensive World Knowledge

    cs.CV 2025-07 unverdicted novelty 6.0

    DreamVLA uses dynamic-region-guided world knowledge prediction, block-wise attention to disentangle information types, and a diffusion transformer for actions, reaching 76.7% success on real robot tasks and 4.44 avera...

  12. Cosmos-Reason1: From Physical Common Sense To Embodied Reasoning

    cs.AI 2025-03 conditional novelty 6.0

    Cosmos-Reason1-7B and 56B models are trained with physical common sense and embodied reasoning ontologies via supervised fine-tuning and reinforcement learning to produce next-step physical actions.

  13. Hi Robot: Open-Ended Instruction Following with Hierarchical Vision-Language-Action Models

    cs.RO 2025-02 unverdicted novelty 6.0

    A hierarchical VLA architecture lets robots follow complex instructions and situated feedback by separating high-level reasoning from low-level control.

  14. DexVLA: Vision-Language Model with Plug-In Diffusion Expert for General Robot Control

    cs.RO 2025-02 unverdicted novelty 6.0

    DexVLA combines a scaled diffusion action expert with embodiment curriculum learning to achieve better generalization and performance than prior VLA models on diverse robot hardware and long-horizon tasks.

  15. FAST: Efficient Action Tokenization for Vision-Language-Action Models

    cs.RO 2025-01 unverdicted novelty 6.0

    FAST applies discrete cosine transform to robot action sequences for efficient tokenization, enabling autoregressive VLAs to succeed on high-frequency dexterous tasks and scale to 10k hours of data while matching diff...

  16. RoboMIND: Benchmark on Multi-embodiment Intelligence Normative Data for Robot Manipulation

    cs.RO 2024-12 accept novelty 6.0

    RoboMIND is a large-scale multi-embodiment teleoperation dataset for robot manipulation containing 107k trajectories across four robots, with failure annotations and a digital twin simulator.

  17. HiVLA: A Visual-Grounded-Centric Hierarchical Embodied Manipulation System

    cs.CV 2026-04 unverdicted novelty 5.0

    HiVLA decouples VLM-based semantic planning with visual grounding from a cascaded cross-attention DiT action expert, outperforming end-to-end VLAs on long-horizon and fine-grained manipulation.

  18. HiVLA: A Visual-Grounded-Centric Hierarchical Embodied Manipulation System

    cs.CV 2026-04 unverdicted novelty 5.0

    HiVLA decouples VLM-based semantic planning from DiT-based motor control via structured plans and cascaded cross-attention to outperform end-to-end VLA baselines in long-horizon and fine-grained manipulation.

  19. AVA-VLA: Improving Vision-Language-Action models with Active Visual Attention

    cs.LG 2025-11 unverdicted novelty 5.0

    AVA-VLA reformulates VLA learning as a POMDP using recurrent states and active visual attention to achieve state-of-the-art results on LIBERO, CALVIN, and real dual-arm tasks.

Reference graph

Works this paper leans on

63 extracted references · 63 canonical work pages · cited by 18 Pith papers · 5 internal anchors

  1. [1]

    Do as i can, not as i say: Grounding language in robotic affordances

    Anthony Brohan, Yevgen Chebotar, Chelsea Finn, Karol Hausman, Alexander Herzog, Daniel Ho, Julian Ibarz, Alex Irpan, Eric Jang, Ryan Julian, et al. Do as i can, not as i say: Grounding language in robotic affordances. In Conference on Robot Learning , pages 287–318. PMLR, 2023

  2. [2]

    No, to the right: Online language corrections for robotic ma- nipulation via shared autonomy

    Yuchen Cui, Siddharth Karamcheti, Raj Palleti, Nidhya Shivakumar, Percy Liang, and Dorsa Sadigh. No, to the right: Online language corrections for robotic ma- nipulation via shared autonomy. In Proceedings of the 2023 ACM/IEEE International Conference on Human- Robot Interaction , HRI ’23, page 93–101, New York, NY , USA, 2023. Association for Computing M...

  3. [3]

    Correcting robot plans with natural language feedback

    Pratyusha Sharma, Balakumar Sundaralingam, Valts Blukis, Chris Paxton, Tucker Hermans, Antonio Torralba, Jacob Andreas, and Dieter Fox. Correcting robot plans with natural language feedback. ArXiv, abs/2204.05186,

  4. [4]

    URL https://api.semanticscholar.org/CorpusID: 248085271

  5. [5]

    RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control

    Anthony Brohan, Noah Brown, Justice Carbajal, Yevgen Chebotar, Xi Chen, Krzysztof Choromanski, Tianli Ding, Fig. 8: We show the generalization capabilities of RT-H with completely unseen tasks with minimal correction. By breaking down tasks into language motions, RT-H learns the shared structure between seemingly diverse tasks. This allows it to generaliz...

  6. [6]

    Human-in-the- loop imitation learning using remote teleoperation,

    Ajay Mandlekar, Danfei Xu, Roberto Mart ´ın-Mart´ın, Yuke Zhu, Li Fei-Fei, and Silvio Savarese. Human-in- the-loop imitation learning using remote teleoperation. CoRR, abs/2012.06733, 2020. URL https://arxiv.org/abs/ 2012.06733

  7. [7]

    RT-1: Robotics Transformer for Real-World Control at Scale

    Anthony Brohan, Noah Brown, Justice Carbajal, Yev- gen Chebotar, Joseph Dabis, Chelsea Finn, Keerthana Gopalakrishnan, Karol Hausman, Alex Herzog, Jasmine Hsu, et al. Rt-1: Robotics transformer for real-world control at scale. arXiv preprint arXiv:2212.06817, 2022

  8. [8]

    BC-z: Zero-shot task generalization with robotic imitation learning

    Eric Jang, Alex Irpan, Mohi Khansari, Daniel Kappler, Frederik Ebert, Corey Lynch, Sergey Levine, and Chelsea Finn. BC-z: Zero-shot task generalization with robotic imitation learning. In 5th Annual Conference on Robot Learning, 2021. URL https://openreview.net/forum?id= 8kbp23tSGYv

  9. [9]

    Language- conditioned imitation learning for robot manipulation tasks

    Simon Stepputtis, Joseph Campbell, Mariano Phielipp, Stefan Lee, Chitta Baral, and Heni Ben Amor. Language- conditioned imitation learning for robot manipulation tasks. In H. Larochelle, M. Ranzato, R. Hadsell, M.F. Balcan, and H. Lin, editors, Advances in Neural Information Processing Systems , volume 33, pages 13139–13150. Curran Associates, Inc., 2020....

  10. [10]

    Cliport: What and where pathways for robotic manipulation

    Mohit Shridhar, Lucas Manuelli, and Dieter Fox. Cliport: What and where pathways for robotic manipulation. In Proceedings of the 5th Conference on Robot Learning (CoRL), 2021

  11. [11]

    What matters in language conditioned robotic imitation learning over unstructured data

    Oier Mees, Lukas Hermann, and Wolfram Burgard. What matters in language conditioned robotic imitation learning over unstructured data. IEEE Robotics and Automation Letters, 7(4):11205–11212, 2022

  12. [12]

    KITE: Keypoint-conditioned policies for semantic manipulation

    Priya Sundaresan, Suneel Belkhale, Dorsa Sadigh, and Jeannette Bohg. KITE: Keypoint-conditioned policies for semantic manipulation. In 7th Annual Conference on Robot Learning , 2023. URL https://openreview.net/ forum?id=veGdf4L4Xz

  13. [13]

    Learning transferable visual models from natural language supervision

    Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. Learning transferable visual models from natural language supervision. In Marina Meila and Tong Zhang, editors, Proceedings of the 38th International Conference on Machine...

  14. [14]

    R3m: A universal visual representation for robot manipulation

    Suraj Nair, Aravind Rajeswaran, Vikash Kumar, Chelsea Finn, and Abhinav Gupta. R3m: A universal visual representation for robot manipulation. In Conference on Robot Learning, pages 892–909. PMLR, 2023

  15. [15]

    Chen, Thomas Kollar, Chelsea Finn, Dorsa Sadigh, and Percy Liang

    Siddharth Karamcheti, Suraj Nair, Annie S. Chen, Thomas Kollar, Chelsea Finn, Dorsa Sadigh, and Percy Liang. Language-driven representation learning for robotics. In Robotics: Science and Systems (RSS) , 2023

  16. [16]

    Vip: Towards universal visual reward and representation via value-implicit pre-training

    Yecheng Jason Ma, Shagun Sodhani, Dinesh Jayaraman, Osbert Bastani, Vikash Kumar, and Amy Zhang. Vip: Towards universal visual reward and representation via value-implicit pre-training. In The Eleventh International Conference on Learning Representations , 2022

  17. [17]

    Danny Driess, Fei Xia, Mehdi S. M. Sajjadi, Corey Lynch, Aakanksha Chowdhery, Brian Ichter, Ayzaan Wahid, Jonathan Tompson, Quan Vuong, Tianhe Yu, Wenlong Huang, Yevgen Chebotar, Pierre Sermanet, Daniel Duckworth, Sergey Levine, Vincent Vanhoucke, Karol Hausman, Marc Toussaint, Klaus Greff, Andy Zeng, Igor Mordatch, and Pete Florence. Palm-e: An embodied ...

  18. [18]

    Robot learning from demonstration by constructing skill trees

    George Konidaris, Scott Kuindersma, Roderic Grupen, and Andrew Barto. Robot learning from demonstration by constructing skill trees. The International Journal of Robotics Research , 31(3):360–375, 2012. doi: 10. 1177/0278364911428653. URL https://doi.org/10.1177/ 0278364911428653

  19. [19]

    Scott Niekum, Sarah Osentoski, George Konidaris, and Andrew G. Barto. Learning and generalization of complex tasks from unstructured demonstrations. In 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems , pages 5239–5246, 2012. doi: 10.1109/IROS.2012.6386006

  20. [20]

    Ddco: Discovery of deep continuous options for robot learning from demonstrations

    Sanjay Krishnan, Roy Fox, Ion Stoica, and Ken Gold- berg. Ddco: Discovery of deep continuous options for robot learning from demonstrations. In Conference on robot learning, pages 418–437. PMLR, 2017

  21. [21]

    Learning robot skills with temporal variational inference

    Tanmay Shankar and Abhinav Gupta. Learning robot skills with temporal variational inference. In Interna- tional Conference on Machine Learning , pages 8624–

  22. [22]

    Compile: Compositional imitation learning and execution

    Thomas Kipf, Yujia Li, Hanjun Dai, Vinicius Zambaldi, Alvaro Sanchez-Gonzalez, Edward Grefenstette, Push- meet Kohli, and Peter Battaglia. Compile: Compositional imitation learning and execution. In International Con- ference on Machine Learning, pages 3418–3428. PMLR, 2019

  23. [23]

    Discovering motor programs by re- composing demonstrations

    Tanmay Shankar, Shubham Tulsiani, Lerrel Pinto, and Abhinav Gupta. Discovering motor programs by re- composing demonstrations. In International Confer- ence on Learning Representations , 2020. URL https: //openreview.net/forum?id=rkgHY0NYwr

  24. [24]

    Skid raw: Skill discovery from raw trajectories

    Daniel Tanneberg, Kai Ploeger, Elmar Rueckert, and Jan Peters. Skid raw: Skill discovery from raw trajectories. IEEE robotics and automation letters , 6(3):4696–4703, 2021

  25. [25]

    Bottom-up skill discovery from unsegmented demonstrations for long-horizon robot manipulation

    Yifeng Zhu, Peter Stone, and Yuke Zhu. Bottom-up skill discovery from unsegmented demonstrations for long-horizon robot manipulation. IEEE Robotics and Automation Letters, 7(2):4126–4133, 2022

  26. [26]

    Hierarchical few-shot imitation with skill transition models

    Kourosh Hakhamaneshi, Ruihan Zhao, Albert Zhan, Pieter Abbeel, and Michael Laskin. Hierarchical few-shot imitation with skill transition models. In International Conference on Learning Representations , 2021

  27. [27]

    Robust imita- tion of diverse behaviors

    Ziyu Wang, Josh S Merel, Scott E Reed, Nando de Fre- itas, Gregory Wayne, and Nicolas Heess. Robust imita- tion of diverse behaviors. Advances in Neural Informa- tion Processing Systems , 30, 2017

  28. [28]

    Learning latent plans from play

    Corey Lynch, Mohi Khansari, Ted Xiao, Vikash Ku- mar, Jonathan Tompson, Sergey Levine, and Pierre Ser- manet. Learning latent plans from play. In Leslie Pack Kaelbling, Danica Kragic, and Komei Sugiura, editors, Proceedings of the Conference on Robot Learning , vol- ume 100 of Proceedings of Machine Learning Research , pages 1113–1132. PMLR, 30 Oct–01 Nov...

  29. [29]

    PLATO: Predicting latent affordances through object-centric play

    Suneel Belkhale and Dorsa Sadigh. PLATO: Predicting latent affordances through object-centric play. In 6th Annual Conference on Robot Learning , 2022. URL https://openreview.net/forum?id=UAA5bNospA0

  30. [30]

    Coarse-to-fine imitation learning: Robot manipulation from a single demonstration

    Edward Johns. Coarse-to-fine imitation learning: Robot manipulation from a single demonstration. In 2021 IEEE international conference on robotics and automation (ICRA), pages 4613–4619. IEEE, 2021

  31. [31]

    Hydra: Hybrid robot actions for imitation learning

    Suneel Belkhale, Yuchen Cui, and Dorsa Sadigh. Hydra: Hybrid robot actions for imitation learning. In Confer- ence on Robot Learning, pages 2113–2133. PMLR, 2023

  32. [32]

    Inner monologue: Embodied reasoning through planning with language models

    Wenlong Huang, Fei Xia, Ted Xiao, Harris Chan, Jacky Liang, Pete Florence, Andy Zeng, Jonathan Tompson, Igor Mordatch, Yevgen Chebotar, et al. Inner monologue: Embodied reasoning through planning with language models. In Conference on Robot Learning , pages 1769–

  33. [33]

    Robovqa: Multimodal long-horizon reasoning for robotics

    Pierre Sermanet, Tianli Ding, Jeffrey Zhao, Fei Xia, Debidatta Dwibedi, Keerthana Gopalakrishnan, Chris- tine Chan, Gabriel Dulac-Arnold, Sharath Maddineni, Nikhil J Joshi, Pete Florence, Wei Han, Robert Baruch, Yao Lu, Suvir Mirchandani, Peng Xu, Pannag Sanketi, Karol Hausman, Izhak Shafran, Brian Ichter, and Yuan Cao. Robovqa: Multimodal long-horizon re...

  34. [34]

    ELLA: Exploration through learned language abstraction

    Suvir Mirchandani, Siddharth Karamcheti, and Dorsa Sadigh. ELLA: Exploration through learned language abstraction. In A. Beygelzimer, Y . Dauphin, P. Liang, and J. Wortman Vaughan, editors, Advances in Neural Information Processing Systems , 2021. URL https:// openreview.net/forum?id=VvUldGZ3izR

  35. [35]

    Improving long-horizon imitation through instruction prediction

    Joey Hejna, Pieter Abbeel, and Lerrel Pinto. Improving long-horizon imitation through instruction prediction. In Proceedings of the AAAI Conference on Artificial Intel- ligence, volume 37, pages 7857–7865, 2023

  36. [36]

    Thought Cloning: Learning to think while acting by imitating human thinking

    Shengran Hu and Jeff Clune. Thought Cloning: Learning to think while acting by imitating human thinking. Ad- vances in Neural Information Processing Systems , 2023

  37. [37]

    Skill induction and planning with latent language

    Pratyusha Sharma, Antonio Torralba, and Jacob Andreas. Skill induction and planning with latent language. In Pro- ceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages 1713–1726, 2022

  38. [38]

    Skill gen- eralization with verbs

    Rachel Ma, Lyndon Lam, Benjamin A Spiegel, Aditya Ganeshan, Roma Patel, Ben Abbatematteo, David Paulius, Stefanie Tellex, and George Konidaris. Skill gen- eralization with verbs. In 2023 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS) , pages 5844–5851. IEEE, 2023

  39. [39]

    Interactive imitation learning in robotics based on simulations, 2022

    Xinjie Liu. Interactive imitation learning in robotics based on simulations, 2022

  40. [40]

    A Reduction of Imitation Learning and Structured Prediction to No-Regret Online Learning

    St ´ephane Ross, Geoffrey J. Gordon, and J. Andrew Bagnell. No-regret reductions for imitation learning and structured prediction. CoRR, abs/1011.0686, 2010. URL http://arxiv.org/abs/1011.0686

  41. [41]

    Hg-dagger: Inter- active imitation learning with human experts

    Michael Kelly, Chelsea Sidrane, Katherine Driggs- Campbell, and Mykel J Kochenderfer. Hg-dagger: Inter- active imitation learning with human experts. In 2019 International Conference on Robotics and Automation (ICRA), pages 8077–8083. IEEE, 2019

  42. [42]

    Thriftydagger: Budget-aware novelty and risk gating for interactive imitation learning

    Ryan Hoque, Ashwin Balakrishna, Ellen Novoseller, Albert Wilcox, Daniel S Brown, and Ken Goldberg. Thriftydagger: Budget-aware novelty and risk gating for interactive imitation learning. In Conference on Robot Learning, pages 598–608. PMLR, 2022

  43. [43]

    Brown, Daniel Seita, Brijen Thananjeyan, Ellen R

    Ryan Hoque, Ashwin Balakrishna, Carl Putterman, Michael Luo, Daniel S. Brown, Daniel Seita, Brijen Thananjeyan, Ellen R. Novoseller, and Ken Goldberg. Lazydagger: Reducing context switching in interactive imitation learning. In CASE, pages 502–509, 2021. URL https://doi.org/10.1109/CASE49439.2021.9551469

  44. [44]

    Query-efficient im- itation learning for end-to-end simulated driving

    Jiakai Zhang and Kyunghyun Cho. Query-efficient im- itation learning for end-to-end simulated driving. In Proceedings of the Thirty-First AAAI Conference on Artificial Intelligence, AAAI’17, page 2891–2897. AAAI Press, 2017

  45. [45]

    Kochenderfer

    Kunal Menda, Katherine Driggs-Campbell, and Mykel J. Kochenderfer. Ensembledagger: A bayesian approach to safe imitation learning. In 2019 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS) , pages 5041–5048, 2019. doi: 10.1109/IROS40897.2019. 8968287

  46. [46]

    Learning human objectives from sequences of physical corrections

    Mengxi Li, Alper Canberk, Dylan P Losey, and Dorsa Sadigh. Learning human objectives from sequences of physical corrections. In 2021 IEEE International Conference on Robotics and Automation (ICRA) , pages 2877–2883. IEEE, 2021

  47. [47]

    Physical interaction as communication: Learning robot objectives online from human corrections

    Dylan P Losey, Andrea Bajcsy, Marcia K O’Malley, and Anca D Dragan. Physical interaction as communication: Learning robot objectives online from human corrections. The International Journal of Robotics Research , 41(1): 20–44, 2022

  48. [48]

    Distilling and retrieving generalizable knowledge for robot manipulation via language cor- rections

    Lihan Zha, Yuchen Cui, Li-Heng Lin, Minae Kwon, Montserrat Gonzalez Arenas, Andy Zeng, Fei Xia, and Dorsa Sadigh. Distilling and retrieving generalizable knowledge for robot manipulation via language cor- rections. In 2nd Workshop on Language and Robot Learning: Language as Grounding , 2023

  49. [49]

    Real-time natural language corrections for assistive robotic manipulators

    Alexander Broad, Jacob Arkin, Nathan Ratliff, Thomas Howard, and Brenna Argall. Real-time natural language corrections for assistive robotic manipulators. The Inter- national Journal of Robotics Research, 36(5-7):684–698, 2017

  50. [50]

    Reshaping robot trajectories using natural language commands: A study of multi-modal data alignment using transformers

    Arthur Bucker, Luis Figueredo, Sami Haddadinl, Ashish Kapoor, Shuang Ma, and Rogerio Bonatti. Reshaping robot trajectories using natural language commands: A study of multi-modal data alignment using transformers. In 2022 IEEE/RSJ International Conference on Intelli- gent Robots and Systems (IROS) , pages 978–984. IEEE, 2022

  51. [51]

    Latte: Language trajectory transformer, 2022

    Arthur Bucker, Luis Figueredo, Sami Haddadin, Ashish Kapoor, Shuang Ma, Sai Vemprala, and Rogerio Bonatti. Latte: Language trajectory transformer, 2022

  52. [52]

    Guiding policies with language via meta-learning

    John D Co-Reyes, Abhishek Gupta, Suvansh San- jeev, Nick Altieri, Jacob Andreas, John DeNero, Pieter Abbeel, and Sergey Levine. Guiding policies with language via meta-learning. In International Conference on Learning Representations , 2018

  53. [53]

    Interactive language: Talking to robots in real time

    Corey Lynch, Ayzaan Wahid, Jonathan Tompson, Tianli Ding, James Betker, Robert Baruch, Travis Armstrong, and Pete Florence. Interactive language: Talking to robots in real time. IEEE Robotics and Automation Letters , pages 1–8, 2023. doi: 10.1109/LRA.2023.3295255

  54. [54]

    Pali-x: On scaling up a multilingual vision and language model

    Xi Chen, Josip Djolonga, Piotr Padlewski, Basil Mustafa, Soravit Changpinyo, Jialin Wu, Carlos Riquelme Ruiz, Sebastian Goodman, Xiao Wang, Yi Tay, et al. Pali-x: On scaling up a multilingual vision and language model. arXiv preprint arXiv:2305.18565 , 2023

  55. [55]

    Edwin B. Wilson. Probable inference, the law of succes- sion, and statistical inference. Journal of the American Statistical Association , 22(158):209–212, 1927. ISSN 01621459. URL http://www.jstor.org/stable/2276774

  56. [56]

    S. Lloyd. Least squares quantization in pcm. IEEE Trans- actions on Information Theory , 28(2):129–137, 1982. doi: 10.1109/TIT.1982.1056489

  57. [57]

    Open X-Embodiment Collaboration, Abhishek Padalkar, Acorn Pooley, Ajinkya Jain, Alex Bewley, Alex Her- zog, Alex Irpan, Alexander Khazatsky, Anant Rai, Anikait Singh, Anthony Brohan, Antonin Raffin, Ayzaan Wahid, Ben Burgess-Limerick, Beomjoon Kim, Bern- hard Sch ¨olkopf, Brian Ichter, Cewu Lu, Charles Xu, Chelsea Finn, Chenfeng Xu, Cheng Chi, Chenguang H...

  58. [58]

    What matters in learning from offline human demonstra- tions for robot manipulation

    Ajay Mandlekar, Danfei Xu, Josiah Wong, Soroush Nasiriany, Chen Wang, Rohun Kulkarni, Li Fei-Fei, Silvio Savarese, Yuke Zhu, and Roberto Mart ´ın-Mart´ın. What matters in learning from offline human demonstra- tions for robot manipulation. In 5th Annual Conference on Robot Learning , 2021. URL https://openreview.net/ forum?id=JrsfBJtDFdI

  59. [59]

    Data quality in imitation learning

    Suneel Belkhale, Yuchen Cui, and Dorsa Sadigh. Data quality in imitation learning. In Thirty-seventh Confer- ence on Neural Information Processing Systems , 2023. URL https://openreview.net/forum?id=FwmvbuDiMk

  60. [60]

    Gpt-4v(ision) system card. 2023. URL https://api. semanticscholar.org/CorpusID:263218031

  61. [61]

    move arm forward

    Soroush Nasiriany, Fei Xia, Wenhao Yu, Ted Xiao, Jacky Liang, Ishita Dasgupta, Annie Xie, Danny Driess, Ayzaan Wahid, Zhuo Xu, et al. Pivot: Iterative visual prompting elicits actionable knowledge for vlms. arXiv preprint arXiv:2402.07872, 2024. APPENDIX We first outline the implementation of RT-H and ablations in Appendix A, along with the training recip...

  62. [62]

    There is significant contextuality of language motions required when solving precise manipulation tasks (see Fig. 4, e.g., the speed or direction variety for a single language motion) – there was no single predefined primitive for many language motions that could safely and efficiently progress at the task. See Appendix D for a quantitative analysis of th...

  63. [63]

    left” and “up

    LLMs would inherently struggle to predict language motions because they are not grounded in the visual context of the scene. Therefore we would not expect these models to understand directions like “left” and “up” or to know when to close the gripper with just a textual description of the scene (as provided in SayCan). Thus VLMs are much better suited for...