Recognition: 1 theorem link
Hi Robot: Open-Ended Instruction Following with Hierarchical Vision-Language-Action Models
Pith reviewed 2026-05-15 22:49 UTC · model grok-4.3
The pith
A hierarchical vision-language model lets robots interpret complex instructions and real-time feedback to choose and perform next steps.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The paper claims that separating high-level reasoning from low-level control through a vision-language model allows a robot to process intricate prompts and incorporate corrective feedback during execution, enabling it to complete multi-step tasks that direct instruction-following methods cannot handle.
What carries the argument
Hierarchical vision-language-action model: a high-level VLM maps language and visual feedback to the next sub-goal, while a separate low-level policy translates that sub-goal into robot actions.
If this is right
- Robots can now respond to verbal corrections mid-task instead of requiring all instructions upfront.
- The same high-level model can be reused across single-arm, dual-arm, and mobile platforms with only the low-level policy swapped.
- Tasks that combine object manipulation with semantic understanding, such as distinguishing trash from food, become executable without custom code for each scenario.
Where Pith is reading between the lines
- The architecture may reduce the amount of robot-specific demonstration data needed for new tasks by leveraging the pre-trained vision-language model's reasoning.
- Extending the high-level layer to longer-horizon planning could allow robots to generate entire task sequences from a single high-level goal.
- Real-time feedback integration opens the possibility of safer shared workspaces where humans can verbally redirect the robot without physical intervention.
Load-bearing premise
The high-level vision-language model reliably converts open-ended instructions and visual feedback into correct next-step decisions without misinterpreting context or inventing invalid actions.
What would settle it
Run the robot on a table-cleaning task with an item the user labels 'that's not trash' and observe whether it correctly avoids removing that item while still clearing the rest of the table.
read the original abstract
Generalist robots that can perform a range of different tasks in open-world settings must be able to not only reason about the steps needed to accomplish their goals, but also process complex instructions, prompts, and even feedback during task execution. Intricate instructions (e.g., "Could you make me a vegetarian sandwich?" or "I don't like that one") require not just the ability to physically perform the individual steps, but the ability to situate complex commands and feedback in the physical world. In this work, we describe a system that uses vision-language models in a hierarchical structure, first reasoning over complex prompts and user feedback to deduce the most appropriate next step to fulfill the task, and then performing that step with low-level actions. In contrast to direct instruction following methods that can fulfill simple commands ("pick up the cup"), our system can reason through complex prompts and incorporate situated feedback during task execution ("that's not trash"). We evaluate our system across three robotic platforms, including single-arm, dual-arm, and dual-arm mobile robots, demonstrating its ability to handle tasks such as cleaning messy tables, making sandwiches, and grocery shopping. Videos are available at https://www.pi.website/research/hirobot
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper presents Hi Robot, a hierarchical vision-language-action architecture in which a high-level VLM first interprets open-ended natural-language instructions and situated visual feedback to select the next appropriate step, after which low-level action models execute the chosen primitive. The system is demonstrated qualitatively across three robot platforms (single-arm, dual-arm, and mobile dual-arm) on tasks such as table cleaning, sandwich assembly, and grocery shopping, with emphasis on its ability to incorporate corrective feedback such as “that’s not trash.”
Significance. A reliably working hierarchical decomposition could advance generalist robotics by enabling handling of nuanced, context-dependent instructions that direct VLM prompting struggles with. The multi-platform evaluation suggests some degree of transferability, yet the complete absence of quantitative metrics prevents any assessment of how large or consistent the claimed advantage actually is.
major comments (2)
- [Evaluation] Evaluation section: the central claim that the high-level VLM “can reason through complex prompts and incorporate situated feedback” rests entirely on qualitative video demonstrations; no success rates, error rates, confusion matrices, or controlled tests of feedback incorporation (e.g., accuracy on prompts containing “that’s not trash”) are reported, leaving the robustness assumption unmeasured.
- [Evaluation] Evaluation section: no baseline comparison to direct (non-hierarchical) VLM instruction following is provided, so the asserted superiority of the hierarchical structure cannot be quantified or even verified against the simpler alternative the paper contrasts with.
minor comments (2)
- [Implementation details] The manuscript should state the exact VLM checkpoints and prompting templates used for the high-level reasoner so that the qualitative results can be reproduced.
- [Figures and videos] Figure captions and video descriptions should explicitly link each clip to the specific feedback-handling behavior being illustrated.
Simulated Author's Rebuttal
We thank the referee for the constructive review and for recognizing the potential of hierarchical decomposition for handling nuanced instructions and feedback. We agree that stronger quantitative evaluation is needed and will revise the manuscript accordingly.
read point-by-point responses
-
Referee: [Evaluation] Evaluation section: the central claim that the high-level VLM “can reason through complex prompts and incorporate situated feedback” rests entirely on qualitative video demonstrations; no success rates, error rates, confusion matrices, or controlled tests of feedback incorporation (e.g., accuracy on prompts containing “that’s not trash”) are reported, leaving the robustness assumption unmeasured.
Authors: We agree that the current evaluation is primarily qualitative. In the revised manuscript we will add quantitative success rates obtained from repeated trials on the demonstrated tasks (table cleaning, sandwich assembly, grocery shopping) across the three platforms. We will also include a controlled test measuring the high-level VLM’s accuracy in correctly updating the plan when given corrective feedback phrases such as “that’s not trash.” revision: yes
-
Referee: [Evaluation] Evaluation section: no baseline comparison to direct (non-hierarchical) VLM instruction following is provided, so the asserted superiority of the hierarchical structure cannot be quantified or even verified against the simpler alternative the paper contrasts with.
Authors: We acknowledge the absence of a direct baseline comparison. We will add experiments that run the identical tasks using direct (non-hierarchical) VLM prompting and report comparative success rates, thereby quantifying the advantage of the hierarchical separation of high-level reasoning from low-level control. revision: yes
Circularity Check
No circularity: empirical system description without derivation chain
full rationale
The paper describes a hierarchical VLM-based robotic control system for open-ended instructions and feedback, evaluated via qualitative demonstrations on three platforms for tasks such as sandwich-making and grocery shopping. No equations, fitted parameters, uniqueness theorems, or self-citations that reduce claims to inputs appear in the provided text. The central claims rest on described experiments rather than any mathematical reduction or self-referential construction, satisfying the default expectation of no significant circularity.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Pre-trained vision-language models can accurately deduce appropriate next physical steps from complex natural-language instructions and visual feedback.
Forward citations
Cited by 20 Pith papers
-
Characterizing Vision-Language-Action Models across XPUs: Constraints and Acceleration for On-Robot Deployment
VLA models exhibit a compute-bound VLM phase followed by a memory-bound action phase on edge hardware; DP-Cache and V-AEFusion reduce redundancy and enable pipeline parallelism for up to 6x speedup on NPUs with margin...
-
${\pi}_{0.7}$: a Steerable Generalist Robotic Foundation Model with Emergent Capabilities
π₀.₇ is a steerable generalist robotic model that uses rich multimodal prompts including language, subgoal images, and performance metadata to achieve out-of-the-box generalization across tasks and robot bodies.
-
QuadAgent: A Responsive Agent System for Vision-Language Guided Quadrotor Agile Flight
QuadAgent uses an asynchronous multi-agent architecture with an Impression Graph for scene memory and vision-based avoidance to enable training-free vision-language guided agile quadrotor flight, outperforming baselin...
-
VP-VLA: Visual Prompting as an Interface for Vision-Language-Action Models
VP-VLA decouples high-level reasoning from low-level control in VLA models by rendering spatial anchors as visual prompts directly in the RGB observation space, outperforming end-to-end baselines.
-
UniLACT: Depth-Aware RGB Latent Action Learning for Vision-Language-Action Models
UniLACT improves VLA models by adding depth-aware unified latent action pretraining that outperforms RGB-only baselines on seen and unseen manipulation tasks.
-
Evo-Memory: Benchmarking LLM Agent Test-time Learning with Self-Evolving Memory
Evo-Memory is a new benchmark for self-evolving memory in LLM agents across task streams, with baseline ExpRAG and proposed ReMem method that integrates reasoning, actions, and memory updates for continual improvement.
-
Towards Long-horizon Embodied Agents with Tool-Aligned Vision-Language-Action Models
VLAs-as-Tools pairs a VLM planner with specialized VLA executors via a new interface and Tool-Aligned Post-Training to raise long-horizon robot success rates on LIBERO-Long and RoboTwin benchmarks.
-
G-Zero: Self-Play for Open-Ended Generation from Zero Data
G-Zero uses the Hint-δ intrinsic reward to drive co-evolution between a Proposer and Generator via GRPO and DPO, providing a theoretical suboptimality guarantee for self-improvement from internal dynamics alone.
-
SEIF: Self-Evolving Reinforcement Learning for Instruction Following
SEIF creates a self-reinforcing loop in which an LLM alternately generates increasingly difficult instructions and learns to follow them better using reinforcement learning signals from its own judgments.
-
AnySlot: Goal-Conditioned Vision-Language-Action Policies for Zero-Shot Slot-Level Placement
AnySlot decouples language grounding from low-level control by inserting an explicit visual goal image, yielding better zero-shot performance on precise slot placement tasks than flat VLA policies.
-
ExpressMM: Expressive Mobile Manipulation Behaviors in Human-Robot Interactions
ExpressMM integrates high-level language-guided planning with low-level vision-language-action policies to enable expressive and interruptible mobile manipulation behaviors in human-robot collaboration, shown effectiv...
-
ThermoAct:Thermal-Aware Vision-Language-Action Models for Robotic Perception and Decision-Making
ThermoAct integrates thermal imaging into VLA models via a VLM planner to enable robots to perceive physical properties like heat and improve safety over vision-only systems.
-
World Action Models are Zero-shot Policies
DreamZero uses a 14B video diffusion model as a World Action Model to achieve over 2x better zero-shot generalization on real robots than state-of-the-art VLAs, real-time 7Hz closed-loop control, and cross-embodiment ...
-
Steerable Vision-Language-Action Policies for Embodied Reasoning and Hierarchical Control
Steerable VLAs trained on rich synthetic commands at subtask, motion, and pixel levels enable VLMs to steer robot behavior more effectively, outperforming prior hierarchical baselines on real-world manipulation and ge...
-
$\pi^{*}_{0.6}$: a VLA That Learns From Experience
RECAP enables a generalist VLA to self-improve via advantage-conditioned RL on mixed real-world data, more than doubling throughput and halving failure rates on hard manipulation tasks.
-
InternVLA-M1: A Spatially Guided Vision-Language-Action Framework for Generalist Robot Policy
InternVLA-M1 uses spatially guided pre-training on 2.3M examples followed by action post-training to deliver up to 17% gains on robot manipulation benchmarks and 20.6% on unseen objects.
-
Ctrl-World: A Controllable Generative World Model for Robot Manipulation
A controllable world model trained on the DROID dataset generates consistent multi-view robot trajectories for over 20 seconds and improves generalist policy success rates by 44.7% via imagined trajectory fine-tuning.
-
HiVLA: A Visual-Grounded-Centric Hierarchical Embodied Manipulation System
HiVLA decouples VLM-based semantic planning with visual grounding from a cascaded cross-attention DiT action expert, outperforming end-to-end VLAs on long-horizon and fine-grained manipulation.
-
HiVLA: A Visual-Grounded-Centric Hierarchical Embodied Manipulation System
HiVLA decouples VLM-based semantic planning from DiT-based motor control via structured plans and cascaded cross-attention to outperform end-to-end VLA baselines in long-horizon and fine-grained manipulation.
-
RoboAgent: Chaining Basic Capabilities for Embodied Task Planning
RoboAgent chains basic vision-language capabilities inside a single VLM via a scheduler and trains it in three stages (behavior cloning, DAgger, RL) to improve embodied task planning.
Reference graph
Works this paper leans on
-
[1]
Rt-h: Action hierarchies using language.arXiv preprint arXiv:2403.01823, 2024
Belkhale, S., Ding, T., Xiao, T., Sermanet, P., Vuong, Q., Tompson, J., Chebotar, Y., Dwibedi, D., and Sadigh, D. Rt-h: Action hierarchies using language. arXiv preprint arXiv:2403.01823, 2024
-
[2]
PaliGemma: A versatile 3B VLM for transfer
Beyer, L., Steiner, A., Pinto, A. S., Kolesnikov, A., Wang, X., Salz, D., Neumann, M., Alabdulmohsin, I., Tschannen, M., Bugliarello, E., et al. Paligemma: A versatile 3b vlm for transfer. arXiv preprint arXiv:2407.07726, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[3]
$\pi_0$: A Vision-Language-Action Flow Model for General Robot Control
Black, K., Brown, N., Driess, D., Esmail, A., Equi, M., Finn, C., Fusai, N., Groom, L., Hausman, K., Ichter, B., et al. _0 : A vision-language-action flow model for general robot control. arXiv preprint arXiv:2410.24164, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[4]
RT-1: Robotics Transformer for Real-World Control at Scale
Brohan, A., Brown, N., Carbajal, J., Chebotar, Y., Dabis, J., Finn, C., Gopalakrishnan, K., Hausman, K., Herzog, A., Hsu, J., et al. Rt-1: Robotics transformer for real-world control at scale. arXiv preprint arXiv:2212.06817, 2022
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[5]
RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control
Brohan, A., Brown, N., Carbajal, J., Chebotar, Y., Chen, X., Choromanski, K., Ding, T., Driess, D., Dubey, A., Finn, C., et al. Rt-2: Vision-language-action models transfer web knowledge to robotic control. arXiv preprint arXiv:2307.15818, 2023 a
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[6]
Do as i can, not as i say: Grounding language in robotic affordances
Brohan, A., Chebotar, Y., Finn, C., Hausman, K., Herzog, A., Ho, D., Ibarz, J., Irpan, A., Jang, E., Julian, R., et al. Do as i can, not as i say: Grounding language in robotic affordances. In Conference on robot learning, pp.\ 287--318. PMLR, 2023 b
work page 2023
-
[7]
Automating robot failure recovery using vision-language models with optimized prompts
Chen, H., Yao, Y., Liu, R., Liu, C., and Ichnowski, J. Automating robot failure recovery using vision-language models with optimized prompts. arXiv preprint arXiv:2409.03966, 2024
-
[8]
Diffusion policy: Visuomotor policy learning via action diffusion
Chi, C., Feng, S., Du, Y., Xu, Z., Cousineau, E., Burchfiel, B., and Song, S. Diffusion policy: Visuomotor policy learning via action diffusion. In Proceedings of Robotics: Science and Systems (RSS), 2023
work page 2023
-
[9]
Racer: Rich language-guided failure recovery policies for imitation learning
Dai, Y., Lee, J., Fazeli, N., and Chai, J. Racer: Rich language-guided failure recovery policies for imitation learning. arXiv preprint arXiv:2409.14674, 2024
-
[10]
PaLM-E: An Embodied Multimodal Language Model
Driess, D., Xia, F., Sajjadi, M. S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al. Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[11]
Mobile ALOHA: Learning Bimanual Mobile Manipulation with Low-Cost Whole-Body Teleoperation
Fu, Z., Zhao, T. Z., and Finn, C. Mobile aloha: Learning bimanual mobile manipulation with low-cost whole-body teleoperation. arXiv preprint arXiv:2401.02117, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[12]
Look before you leap: Unveiling the power of gpt-4v in robotic vision-language planning, 2023
Hu, Y., Lin, F., Zhang, T., Yi, L., and Gao, Y. Look before you leap: Unveiling the power of gpt-4v in robotic vision-language planning, 2023. URL https://arxiv.org/abs/2311.17842
-
[13]
Language models as zero-shot planners: Extracting actionable knowledge for embodied agents
Huang, W., Abbeel, P., Pathak, D., and Mordatch, I. Language models as zero-shot planners: Extracting actionable knowledge for embodied agents. In International conference on machine learning, pp.\ 9118--9147. PMLR, 2022
work page 2022
-
[14]
VoxPoser: Composable 3D Value Maps for Robotic Manipulation with Language Models
Huang, W., Wang, C., Zhang, R., Li, Y., Wu, J., and Fei-Fei, L. Voxposer: Composable 3d value maps for robotic manipulation with language models. arXiv preprint arXiv:2307.05973, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[15]
Bc-z: Zero-shot task generalization with robotic imitation learning
Jang, E., Irpan, A., Khansari, M., Kappler, D., Ebert, F., Lynch, C., Levine, S., and Finn, C. Bc-z: Zero-shot task generalization with robotic imitation learning. In Conference on Robot Learning, pp.\ 991--1002. PMLR, 2022
work page 2022
-
[16]
Kahneman, D. Thinking, fast and slow. Farrar, Straus and Giroux, New York, 2011. ISBN 9780374275631 0374275637
work page 2011
-
[17]
OpenVLA: An Open-Source Vision-Language-Action Model
Kim, M. J., Pertsch, K., Karamcheti, S., Xiao, T., Balakrishna, A., Nair, S., Rafailov, R., Foster, E., Lam, G., Sanketi, P., et al. Openvla: An open-source vision-language-action model. arXiv preprint arXiv:2406.09246, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[18]
Interactive task planning with language models, 2025 a
Li, B., Wu, P., Abbeel, P., and Malik, J. Interactive task planning with language models, 2025 a . URL https://arxiv.org/abs/2310.10645
-
[19]
Li, Q., Liang, Y., Wang, Z., Luo, L., Chen, X., Liao, M., Wei, F., Deng, Y., Xu, S., Zhang, Y., et al. Cogact: A foundational vision-language-action model for synergizing cognition and action in robotic manipulation. arXiv preprint arXiv:2411.19650, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[20]
R., Ramos, F., Fox, D., Li, A., Gupta, A., and Goyal, A
Li, Y., Deng, Y., Zhang, J., Jang, J., Memmel, M., Yu, R., Garrett, C. R., Ramos, F., Fox, D., Li, A., Gupta, A., and Goyal, A. Hamster: Hierarchical action models for open-world robot manipulation, 2025 b . URL https://arxiv.org/abs/2502.05485
-
[21]
Code as policies: Language model programs for embodied control
Liang, J., Huang, W., Xia, F., Xu, P., Hausman, K., Ichter, B., Florence, P., and Zeng, A. Code as policies: Language model programs for embodied control. In 2023 IEEE International Conference on Robotics and Automation (ICRA), pp.\ 9493--9500. IEEE, 2023
work page 2023
-
[22]
Moka: Open-vocabulary robotic manipulation through mark-based visual prompting
Liu, F., Fang, K., Abbeel, P., and Levine, S. Moka: Open-vocabulary robotic manipulation through mark-based visual prompting. In First Workshop on Vision-Language Models for Navigation and Manipulation at ICRA 2024, 2024 a
work page 2024
-
[23]
Interactive robot learning from verbal correction
Liu, H., Chen, A., Zhu, Y., Swaminathan, A., Kolobov, A., and Cheng, C.-A. Interactive robot learning from verbal correction. arXiv preprint arXiv:2310.17555, 2023
- [24]
-
[25]
RDT-1B: a Diffusion Foundation Model for Bimanual Manipulation
Liu, S., Wu, L., Li, B., Tan, H., Chen, H., Wang, Z., Xu, K., Su, H., and Zhu, J. Rdt-1b: a diffusion foundation model for bimanual manipulation. arXiv preprint arXiv:2410.07864, 2024 c
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[26]
Loshchilov, I. and Hutter, F. Decoupled weight decay regularization, 2017
work page 2017
-
[27]
Learning to parse natural language commands to a robot control system
Matuszek, C., Herbst, E., Zettlemoyer, L., and Fox, D. Learning to parse natural language commands to a robot control system. In Experimental Robotics: The 13th International Symposium on Experimental Robotics, volume 88, pp.\ 403. Springer, 2013
work page 2013
-
[28]
Is feedback all you need? leveraging natural language feedback in goal-conditioned rl
McCallum, S., Taylor-Davies, M., Albrecht, S., and Suglia, A. Is feedback all you need? leveraging natural language feedback in goal-conditioned rl. In NeurIPS 2023 Workshop on Goal-Conditioned Reinforcement Learning
work page 2023
-
[29]
Learning neuro-symbolic programs for language guided robot manipulation
Namasivayam, K., Singh, H., Bindal, V., Tuli, A., Agrawal, V., Jain, R., Singla, P., and Paul, R. Learning neuro-symbolic programs for language guided robot manipulation. In 2023 IEEE International Conference on Robotics and Automation (ICRA), pp.\ 7973--7980. IEEE, 2023
work page 2023
-
[30]
Nasiriany, S., Xia, F., Yu, W., Xiao, T., Liang, J., Dasgupta, I., Xie, A., Driess, D., Wahid, A., Xu, Z., et al. Pivot: Iterative visual prompting elicits actionable knowledge for vlms. arXiv preprint arXiv:2402.07872, 2024
-
[31]
Y., Sanketi, P., Vuong, Q., Xiao, T., Sadigh, D., Finn, C., and Levine, S
Octo Model Team , Ghosh, D., Walke, H., Pertsch, K., Black, K., Mees, O., Dasari, S., Hejna, J., Xu, C., Luo, J., Kreiman, T., Tan, Y., Chen, L. Y., Sanketi, P., Vuong, Q., Xiao, T., Sadigh, D., Finn, C., and Levine, S. Octo: An open-source generalist robot policy. In Proceedings of Robotics: Science and Systems, Delft, Netherlands, 2024
work page 2024
-
[32]
Open x-embodiment: Robotic learning datasets and rt-x models: Open x-embodiment collaboration 0
O’Neill, A., Rehman, A., Maddukuri, A., Gupta, A., Padalkar, A., Lee, A., Pooley, A., Gupta, A., Mandlekar, A., Jain, A., et al. Open x-embodiment: Robotic learning datasets and rt-x models: Open x-embodiment collaboration 0. In 2024 IEEE International Conference on Robotics and Automation (ICRA), pp.\ 6892--6903. IEEE, 2024
work page 2024
-
[33]
Patki, S., Daniele, A. F., Walter, M. R., and Howard, T. M. Inferring compact representations for efficient natural language understanding of robot instructions. In 2019 International Conference on Robotics and Automation (ICRA), pp.\ 6926--6933. IEEE, 2019
work page 2019
-
[34]
FAST: Efficient Action Tokenization for Vision-Language-Action Models
Pertsch, K., Stachowicz, K., Ichter, B., Driess, D., Nair, S., Vuong, Q., Mees, O., Finn, C., and Levine, S. Fast: Efficient action tokenization for vision-language-action models. arXiv preprint arXiv:2501.09747, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[35]
Open-vocabulary mobile manipulation in unseen dynamic environments with 3d semantic maps
Qiu, D., Ma, W., Pan, Z., Xiong, H., and Liang, J. Open-vocabulary mobile manipulation in unseen dynamic environments with 3d semantic maps. arXiv preprint arXiv:2406.18115, 2024
-
[36]
W., Xu, T., Brockman, G., McLeavey, C., and Sutskever, I
Radford, A., Kim, J. W., Xu, T., Brockman, G., McLeavey, C., and Sutskever, I. Robust speech recognition via large-scale weak supervision. In International conference on machine learning, pp.\ 28492--28518. PMLR, 2023
work page 2023
-
[37]
Shah, R., Yu, A., Zhu, Y., Zhu, Y., and Mart \' n-Mart \' n, R. Bumble: Unifying reasoning and acting with vision-language models for building-wide mobile manipulation. arXiv preprint arXiv:2410.06237, 2024
-
[38]
Shi, L. X., Hu, Z., Zhao, T. Z., Sharma, A., Pertsch, K., Luo, J., Levine, S., and Finn, C. Yell at your robot: Improving on-the-fly from language corrections. arXiv preprint arXiv:2403.12910, 2024
-
[39]
Progprompt: Generating situated robot task plans using large language models
Singh, I., Blukis, V., Mousavian, A., Goyal, A., Xu, D., Tremblay, J., Fox, D., Thomason, J., and Garg, A. Progprompt: Generating situated robot task plans using large language models. In 2023 IEEE International Conference on Robotics and Automation (ICRA), pp.\ 11523--11530. IEEE, 2023
work page 2023
- [40]
-
[41]
S., Hsu, S., Sharma, A., and Finn, C
Stephan, M., Khazatsky, A., Mitchell, E., Chen, A. S., Hsu, S., Sharma, A., and Finn, C. Rlvf: Learning from verbal feedback without overgeneralization. arXiv preprint arXiv:2402.10893, 2024
-
[42]
Language-conditioned imitation learning for robot manipulation tasks
Stepputtis, S., Campbell, J., Phielipp, M., Lee, S., Baral, C., and Ben Amor, H. Language-conditioned imitation learning for robot manipulation tasks. Advances in Neural Information Processing Systems, 33: 0 13139--13150, 2020
work page 2020
-
[43]
Open-world object manipula- tion using pre-trained vision-language models
Stone, A., Xiao, T., Lu, Y., Gopalakrishnan, K., Lee, K.-H., Vuong, Q., Wohlhart, P., Kirmani, S., Zitkovich, B., Xia, F., et al. Open-world object manipulation using pre-trained vision-language models. arXiv preprint arXiv:2303.00905, 2023
-
[44]
Swadzba, A., Vorwerg, C., Wachsmuth, S., and Rickheit, G. A computational model for the alignment of hierarchical scene representations in human-robot interaction. In Twenty-First International Joint Conference on Artificial Intelligence. Citeseer, 2009
work page 2009
-
[45]
Wang, S., Han, M., Jiao, Z., Zhang, Z., Wu, Y. N., Zhu, S.-C., and Liu, H. Llm\^ 3: Large language model-based task and motion planning with motion failure reasoning. arXiv preprint arXiv:2403.11552, 2024
-
[46]
Tinyvla: Towards fast, data-efficient vision- language-action models for robotic manipulation
Wen, J., Zhu, Y., Li, J., Zhu, M., Wu, K., Xu, Z., Liu, N., Cheng, R., Shen, C., Peng, Y., et al. Tinyvla: Towards fast, data-efficient vision-language-action models for robotic manipulation. arXiv preprint arXiv:2409.12514, 2024
-
[47]
Robi butler: Remote multimodal interactions with household robot assistant
Xiao, A., Janaka, N., Hu, T., Gupta, A., Li, K., Yu, C., and Hsu, D. Robi butler: Remote multimodal interactions with household robot assistant. arXiv preprint arXiv:2409.20548, 2024
-
[48]
Robotic Control via Embodied Chain-of-Thought Reasoning
Zawalski, M., Chen, W., Pertsch, K., Mees, O., Finn, C., and Levine, S. Robotic control via embodied chain-of-thought reasoning. arXiv preprint arXiv:2407.08693, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[49]
Learning Fine-Grained Bimanual Manipulation with Low-Cost Hardware
Zhao, T. Z., Kumar, V., Levine, S., and Finn, C. Learning fine-grained bimanual manipulation with low-cost hardware. arXiv preprint arXiv:2304.13705, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[50]
Universal actions for enhanced embodied foundation models
Zheng, J., Li, J., Liu, D., Zheng, Y., Wang, Z., Ou, Z., Liu, Y., Liu, J., Zhang, Y.-Q., and Zhan, X. Universal actions for enhanced embodied foundation models. arXiv preprint arXiv:2501.10105, 2025
-
[51]
Closed-loop open-vocabulary mobile manipulation with gpt-4v
Zhi, P., Zhang, Z., Han, M., Zhang, Z., Li, Z., Jiao, Z., Jia, B., and Huang, S. Closed-loop open-vocabulary mobile manipulation with gpt-4v. arXiv preprint arXiv:2404.10220, 2024
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.