Adapting Generalist Robot Policies with Semantic Reinforcement Learning

Andrew Wagenmaker; Jagdeep Singh Bhatia; Sergey Levine; William Chen

arxiv: 2606.31958 · v1 · pith:B5ISACWYnew · submitted 2026-06-30 · 💻 cs.RO

Adapting Generalist Robot Policies with Semantic Reinforcement Learning

Jagdeep Singh Bhatia , Andrew Wagenmaker , William Chen , Sergey Levine This is my paper

Pith reviewed 2026-07-01 05:06 UTC · model grok-4.3

classification 💻 cs.RO

keywords generalist robot policiessemantic reinforcement learninglanguage promptsskill compositionrobot adaptationvisual language action modelsonline policy improvement

0 comments

The pith

Modulating language prompts lets generalist robot policies adapt to complex tasks via reinforcement learning.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Generalist robot policies pretrained on large datasets hold a wide range of behaviors, yet standard reinforcement learning that directly adjusts their action outputs often fails on tasks outside the original training distribution. The paper establishes that optimizing language prompts instead serves as a workable alternative space, because changes to the prompt can draw out and combine skills the policy already possesses. Semantic Action Reinforcement Learning implements this idea by running online RL in the prompt space while holding the base policy fixed as a skill prior. The result is structured exploration, rapid performance gains, and the ability to solve long-horizon tasks in both simulation and the real world. A reader would care because the approach turns an existing generalist model into a controllable starting point rather than requiring the model to learn entirely new behaviors from scratch.

Core claim

The paper's central claim is that, for sufficiently expressive generalist policies, language prompts form an effective optimization space for reinforcement learning: modulating the language input elicits skills already within the policy's repertoire and allows their composition to solve tasks beyond zero-shot performance. SARL learns a policy over prompt space through online interaction, treating the frozen generalist as a controllable skill prior. This yields semantically meaningful exploration, efficient online improvement, and prompts that become grounded in the actual behaviors they induce.

What carries the argument

Semantic Action Reinforcement Learning (SARL), the algorithm that performs reinforcement learning over the space of language prompts supplied to a fixed generalist policy rather than over the policy's raw action outputs.

If this is right

Complex, long-horizon tasks outside the pretraining distribution become solvable by adapting visual-language-action models.
Exploration during learning remains structured and semantically meaningful instead of unstructured in action space.
Prompts become grounded in induced real-world behaviors, supporting robust task execution.
Online improvement is substantially more sample-efficient than methods that learn new skills from scratch.
The approach outperforms prior methods that attempt to improve deployed robot behavior through direct action-space RL.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same prompt-space optimization idea could be tested on generalist policies that accept other conditioning signals besides language.
If prompt optimization works reliably, it may reduce the data and compute needed to specialize large robot models for new environments.
Limits of the method could be probed by measuring how performance scales with the diversity of skills present in the base policy.
Techniques developed for optimizing prompts in language models might transfer directly to this robot setting.

Load-bearing premise

The generalist policy must already contain the component skills that language-prompt changes can reliably elicit and combine for the new task.

What would settle it

SARL produces no measurable improvement over the base policy or random prompting on a task for which no relevant sub-skills can be elicited from the generalist even after exhaustive prompt search.

Figures

Figures reproduced from arXiv: 2606.31958 by Andrew Wagenmaker, Jagdeep Singh Bhatia, Sergey Levine, William Chen.

**Figure 2.** Figure 2: Adapting VLAs to solve new tasks requires decomposing task-goals into achievable stages, and grounding each stage in behaviors that can be executed successfully in the environment. VLAs prompted zero-shot struggle with both decomposition and grounding, and prompting VLAs with VLMs achieves decomposition but not grounding. Only SARL achieves both by learning to optimize a VLA’s prompt inputs with RL. In a… view at source ↗

**Figure 3.** Figure 3: Across four complex, real-world tasks, learning over semantic actions with SARL enables the [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗

**Figure 4.** Figure 4: SARL outperforms prior methods for deployment-time adaptation — DSRL [ [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗

**Figure 5.** Figure 5: We test our approach on four complex, long-horizon tasks on the WidowX robot, illustrated above, [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗

**Figure 6.** Figure 6: On complex, long-horizon tasks, the base policy fails when zero-shot prompted. SARL solves them [PITH_FULL_IMAGE:figures/full_fig_p008_6.png] view at source ↗

**Figure 7.** Figure 7: Across tasks, an in-context learning VLM [ [PITH_FULL_IMAGE:figures/full_fig_p009_7.png] view at source ↗

**Figure 8.** Figure 8: Across four complex, real-world tasks, learning over semantic actions with SARL enables the best [PITH_FULL_IMAGE:figures/full_fig_p018_8.png] view at source ↗

**Figure 9.** Figure 9: Across tasks, an in-context learning VLM [ [PITH_FULL_IMAGE:figures/full_fig_p023_9.png] view at source ↗

**Figure 10.** Figure 10: Prompt for open-ended real-world semantic action candidate generation 25 [PITH_FULL_IMAGE:figures/full_fig_p025_10.png] view at source ↗

**Figure 11.** Figure 11: Prompt for real-world semantic action candidate generation from cache 26 [PITH_FULL_IMAGE:figures/full_fig_p026_11.png] view at source ↗

**Figure 12.** Figure 12: Prompt for real-world in-context learning VLM baseline 27 [PITH_FULL_IMAGE:figures/full_fig_p027_12.png] view at source ↗

**Figure 13.** Figure 13: Prompt for open-ended sim semantic action candidate generation 28 [PITH_FULL_IMAGE:figures/full_fig_p028_13.png] view at source ↗

**Figure 14.** Figure 14: Prompt for sim semantic action candidate generation from cache 29 [PITH_FULL_IMAGE:figures/full_fig_p029_14.png] view at source ↗

**Figure 15.** Figure 15: Prompt for sim in-context learning VLM baseline 30 [PITH_FULL_IMAGE:figures/full_fig_p030_15.png] view at source ↗

read the original abstract

Generalist robot policies learn a diverse repertoire of behaviors from large-scale pretraining. In principle, this makes them excellent priors for downstream adaptation via reinforcement learning (RL). In practice, however, standard RL methods leveraging this prior optimize directly over robot actions, requiring the base policy's action distribution to be close to that of a performant policy from the start. This assumption breaks down for complex or long-horizon tasks that fall outside the pretraining distribution. Our key insight is that, for sufficiently expressive generalist policies, language prompts are an effective alternative space for learning to solve such tasks: modulating language inputs elicits skills already within the policy's repertoire, which can be composed to solve tasks beyond its zero-shot capabilities. We propose Semantic Action Reinforcement Learning (SARL), which learns to optimize this prompt space through online interaction, treating the generalist policy as a controllable skill prior. Importantly, leveraging pretrained skills rather than learning new ones from scratch yields structured, semantically meaningful exploration and highly efficient online improvement, and learning to modulate prompts through experience grounds them in induced real-world behaviors for robust task-solving. Across real-world settings and simulated benchmarks, we show SARL unlocks fundamentally new capabilities -- adapting VLA behavior to solve complex, long-horizon tasks -- and significantly outperforms existing approaches for improving robot behavior in deployment.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

SARL reframes adaptation by running RL over language prompts instead of actions, which looks workable for long-horizon robot tasks if the base policy already holds the needed skills.

read the letter

The main point is that SARL optimizes language prompts with reinforcement learning to adapt generalist robot policies, rather than working directly in action space. This lets the method handle tasks outside the pretraining distribution by composing existing skills through prompt modulation.

The paper presents this as a way to get structured exploration and efficient improvement. They evaluate on real-world settings and simulated benchmarks, claiming it unlocks new capabilities and beats existing methods.

What is new is the choice of prompt space as the RL action space. Prior approaches focused on actions or parameters, but here the semantic level provides meaningful steps. The work also shows how online interaction grounds the prompts in real behaviors.

This approach does well when the base policy has a rich enough repertoire. The experiments appear to support the efficiency gains from avoiding scratch learning.

A soft spot is the reliance on the policy being sufficiently expressive. The paper conditions on this and reports validation, but if prompt changes don't reliably trigger the right skills, results could vary. More detail on failure cases or sensitivity to the language model would help clarify the limits. The assumption is stated upfront, so it's not hidden.

The paper targets robot learning researchers focused on deploying and adapting large pretrained models like VLAs. Readers looking for practical adaptation techniques in embodied AI will find it relevant.

It has a clear problem statement, a distinct method, and empirical claims that merit review. I would recommend sending it for peer review.

Referee Report

0 major / 2 minor

Summary. The paper claims that for sufficiently expressive generalist robot policies, language prompts provide an effective alternative optimization space for RL adaptation on out-of-distribution tasks. SARL learns to modulate prompts online to elicit and compose pretrained skills, yielding structured exploration and efficient improvement over action-space RL. The method is reported to unlock new capabilities for complex long-horizon tasks and to outperform baselines across real-world settings and simulated benchmarks.

Significance. If the empirical claims hold, the work offers a promising direction for leveraging large-scale pretrained generalist policies in robotics by shifting RL to a semantically meaningful prompt space rather than raw actions. This could enable more practical adaptation without requiring the base policy's action distribution to already be near-optimal, with potential for broader impact on deployment of VLA-style models.

minor comments (2)

The description of how the prompt space is parameterized (e.g., continuous vs. discrete embeddings, vocabulary constraints) and how the RL update is performed on it would benefit from an explicit algorithmic outline or pseudocode for reproducibility.
The abstract states that SARL 'significantly outperforms existing approaches'; a concise table or set of key metrics (success rate, sample efficiency) comparing against the strongest baselines should be referenced early in the introduction or results overview.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for the positive review, accurate summary of our contributions, and recommendation for minor revision. We appreciate the recognition of SARL's potential impact on adapting generalist VLA policies for long-horizon tasks via prompt-space optimization.

Circularity Check

0 steps flagged

No significant circularity identified

full rationale

The paper proposes SARL as a method for adapting generalist policies via language prompt modulation in RL, with the central insight stated directly in the abstract and no derivation chain, equations, or fitted parameters presented. Claims rest on empirical evaluation across real-world and simulated settings rather than reducing to self-citations, ansatzes, or inputs by construction. No load-bearing steps match any of the enumerated circularity patterns, and the argument is self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that language modulation can access and compose skills in the pretrained policy; no free parameters or invented entities are identifiable from the abstract.

axioms (1)

domain assumption For sufficiently expressive generalist policies, modulating language inputs elicits skills already within the policy's repertoire that can be composed for new tasks.
This is the key insight stated in the abstract and is load-bearing for the entire approach.

pith-pipeline@v0.9.1-grok · 5763 in / 1103 out tokens · 37581 ms · 2026-07-01T05:06:51.431845+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

87 extracted references · 49 canonical work pages · 26 internal anchors

[1]

A Careful Examination of Large Behavior Models for Multitask Dexterous Manipulation

J. Barreiros, A. Beaulieu, A. Bhat, R. Cory, E. Cousineau, H. Dai, C.-H. Fang, K. Hashimoto, M. Z. Irshad, M. Itkina, et al. A careful examination of large behavior models for multitask dexterous manip- ulation.arXiv preprint arXiv:2507.05331, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[2]

$\pi_0$: A Vision-Language-Action Flow Model for General Robot Control

K. Black, N. Brown, D. Driess, A. Esmail, M. Equi, C. Finn, N. Fusai, L. Groom, K. Hausman, B. Ichter, et al.π 0: A vision-language-action flow model for general robot control.arXiv preprint arXiv:2410.24164, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[3]

M. J. Kim, K. Pertsch, S. Karamcheti, T. Xiao, A. Balakrishna, S. Nair, R. Rafailov, E. Foster, G. Lam, P. Sanketi, et al. Openvla: An open-source vision-language-action model.arXiv preprint arXiv:2406.09246, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[4]

$\pi_{0.5}$: a Vision-Language-Action Model with Open-World Generalization

Physical Intelligence, K. Black, N. Brown, J. Darpinian, K. Dhabalia, D. Driess, A. Esmail, M. Equi, C. Finn, N. Fusai, et al.π 0.5: a vision-language-action model with open-world generalization.arXiv preprint arXiv:2504.16054, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[5]

RT-1: Robotics Transformer for Real-World Control at Scale

A. Brohan, N. Brown, J. Carbajal, Y . Chebotar, J. Dabis, C. Finn, K. Gopalakrishnan, K. Hausman, A. Herzog, J. Hsu, et al. Rt-1: Robotics transformer for real-world control at scale.arXiv preprint arXiv:2212.06817, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[6]

J. Gu, S. Kirmani, P. Wohlhart, Y . Lu, M. G. Arenas, K. Rao, W. Yu, C. Fu, K. Gopalakrishnan, Z. Xu, et al. Rt-trajectory: Robotic task generalization via hindsight trajectory sketches.arXiv preprint arXiv:2311.01977, 2023

work page arXiv 2023
[7]

O. M. Team, D. Ghosh, H. Walke, K. Pertsch, K. Black, O. Mees, S. Dasari, J. Hejna, T. Kreiman, C. Xu, et al. Octo: An open-source generalist robot policy.arXiv preprint arXiv:2405.12213, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[8]

GR00T N1: An Open Foundation Model for Generalist Humanoid Robots

J. Bjorck, F. Casta ˜neda, N. Cherniadev, X. Da, R. Ding, L. Fan, Y . Fang, D. Fox, F. Hu, S. Huang, et al. Gr00t n1: An open foundation model for generalist humanoid robots.arXiv preprint arXiv:2503.14734, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[9]

G. R. Team, S. Abeyruwan, J. Ainslie, J.-B. Alayrac, M. G. Arenas, T. Armstrong, A. Balakrishna, R. Baruch, M. Bauza, M. Blokzijl, et al. Gemini robotics: Bringing ai into the physical world.arXiv preprint arXiv:2503.20020, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[10]

S. Liu, L. Wu, B. Li, H. Tan, H. Chen, Z. Wang, K. Xu, H. Su, and J. Zhu. Rdt-1b: a diffusion foundation model for bimanual manipulation. InInternational Conference on Learning Representations, volume 2025, pages 29982–30009, 2025

2025
[11]

L. Zha, A. J. Hancock, M. Zhang, T. Yin, Y . Huang, D. Shah, A. Z. Ren, and A. Majumdar. Lap: Language-action pre-training enables zero-shot cross-embodiment transfer.arXiv preprint arXiv:2602.10556, 2026

work page arXiv 2026
[12]

Johannink, S

T. Johannink, S. Bahl, A. Nair, J. Luo, A. Kumar, M. Loskyll, J. A. Ojea, E. Solowjow, and S. Levine. Residual reinforcement learning for robot control, 2018. URLhttps://arxiv.org/abs/1812. 03201

2018
[13]

Ankile, A

L. Ankile, A. Simeonov, I. Shenfeld, M. Torne, and P. Agrawal. From imitation to refinement-residual rl for precise assembly. In2025 IEEE International Conference on Robotics and Automation (ICRA), pages 01–08. IEEE, 2025

2025
[14]

X. Yuan, T. Mu, S. Tao, Y . Fang, M. Zhang, and H. Su. Policy decorator: Model-agnostic online refinement for large policy model.arXiv preprint arXiv:2412.13630, 2024. 10

work page arXiv 2024
[15]

J ¨ulg, W

T. J ¨ulg, W. Burgard, and F. Walter. Refined policy distillation: From vla generalists to rl experts.arXiv preprint arXiv:2503.05833, 2025

work page arXiv 2025
[16]

P. Dong, Q. Li, D. Sadigh, and C. Finn. Expo: Stable reinforcement learning with expressive policies. arXiv preprint arXiv:2507.07986, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[17]

W. Xiao, H. Lin, A. Peng, H. Xue, T. He, Y . Xie, F. Hu, J. Wu, Z. Luo, L. Fan, et al. Self-improving vision-language-action models with data generation via residual rl.arXiv preprint arXiv:2511.00091, 2025

work page arXiv 2025
[18]

C. Xu, J. T. Springenberg, M. Equi, A. Amin, A. Esmail, S. Levine, and L. Ke. Rl token: Bootstrapping online rl with vision-language-action models.arXiv preprint arXiv:2604.23073, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[19]

Sun and S

Z. Sun and S. Song. From prior to pro: Efficient skill mastery via distribution contractive rl finetuning. arXiv preprint arXiv:2603.10263, 2026

work page arXiv 2026
[20]

Wagenmaker, M

A. Wagenmaker, M. Nakamoto, Y . Zhang, S. Park, W. Yagoub, A. Nagabandi, A. Gupta, and S. Levine. Steering your diffusion policy with latent space reinforcement learning.Conference on Robot Learning, 2025

2025
[21]

A. Amin, R. Aniceto, A. Balakrishna, K. Black, K. Conley, G. Connors, J. Darpinian, K. Dhabalia, J. DiCarlo, D. Driess, et al.π ∗ 0.6: a vla that learns from experience.arXiv preprint arXiv:2511.14759, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[22]

M. S. Mark, T. Gao, G. G. Sampaio, M. K. Srirama, A. Sharma, C. Finn, and A. Kumar. Policy agnostic rl: Offline rl and online rl fine-tuning of any class and backbone.arXiv preprint arXiv:2412.06685, 2024

work page arXiv 2024
[23]

Nakamoto, O

M. Nakamoto, O. Mees, A. Kumar, and S. Levine. Steering your generalists: Improving robotic foun- dation models via value guidance.arXiv preprint arXiv:2410.13816, 2024

work page arXiv 2024
[24]

Y . Chen, S. Tian, S. Liu, Y . Zhou, H. Li, and D. Zhao. Conrft: A reinforced fine-tuning method for vla models via consistency policy.arXiv preprint arXiv:2502.05450, 2025

work page arXiv 2025
[25]

J. Hu, R. Hendrix, A. Farhadi, A. Kembhavi, R. Mart ´ın-Mart´ın, P. Stone, K.-H. Zeng, and K. Ehsani. Flare: Achieving masterful and adaptive robot policies with large-scale reinforcement learning fine- tuning. In2025 IEEE International Conference on Robotics and Automation (ICRA), pages 3617–3624. IEEE, 2025

2025
[26]

Y . Guo, J. Zhang, X. Chen, X. Ji, Y .-J. Wang, Y . Hu, and J. Chen. Improving vision-language-action model with online reinforcement learning.arXiv preprint arXiv:2501.16664, 2025

work page arXiv 2025
[27]

G. Lu, W. Guo, C. Zhang, Y . Zhou, H. Jiang, Z. Gao, Y . Tang, and Z. Wang. Vla-rl: Towards masterful and general robotic manipulation with scalable reinforcement learning.arXiv preprint arXiv:2505.18719, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[28]

J. Liu, F. Gao, B. Wei, X. Chen, Q. Liao, Y . Wu, C. Yu, and Y . Wang. What can rl bring to vla generalization? an empirical study.arXiv preprint arXiv:2505.19789, 2025

work page arXiv 2025
[29]

Daniel, G

C. Daniel, G. Neumann, O. Kroemer, and J. Peters. Hierarchical relative entropy policy search.Journal of Machine Learning Research, 17(93):1–50, 2016

2016
[30]

T. D. Kulkarni, K. Narasimhan, A. Saeedi, and J. Tenenbaum. Hierarchical deep reinforcement learning: Integrating temporal abstraction and intrinsic motivation.Advances in neural information processing systems, 29, 2016. 11

2016
[31]

X. B. Peng, G. Berseth, K. Yin, and M. Van De Panne. Deeploco: Dynamic locomotion skills using hierarchical deep reinforcement learning.Acm transactions on graphics (tog), 36(4):1–13, 2017

2017
[32]

Riedmiller, R

M. Riedmiller, R. Hafner, T. Lampe, M. Neunert, J. Degrave, T. Wiele, V . Mnih, N. Heess, and J. T. Springenberg. Learning by playing solving sparse reward tasks from scratch. InInternational confer- ence on machine learning, pages 4344–4353. PMLR, 2018

2018
[33]

Nachum, S

O. Nachum, S. S. Gu, H. Lee, and S. Levine. Data-efficient hierarchical reinforcement learning.Ad- vances in neural information processing systems, 31, 2018

2018
[34]

Gehring, G

J. Gehring, G. Synnaeve, A. Krause, and N. Usunier. Hierarchical skills for efficient exploration.Ad- vances in Neural Information Processing Systems, 34:11553–11564, 2021

2021
[35]

Neural probabilistic motor primitives for humanoid control

J. Merel, L. Hasenclever, A. Galashov, A. Ahuja, V . Pham, G. Wayne, Y . W. Teh, and N. Heess. Neural probabilistic motor primitives for humanoid control.arXiv preprint arXiv:1811.11711, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[36]

A. Ajay, A. Kumar, P. Agrawal, S. Levine, and O. Nachum. Opal: Offline primitive discovery for accelerating offline reinforcement learning.arXiv preprint arXiv:2010.13611, 2020

work page arXiv 2010
[37]

Singh, H

A. Singh, H. Liu, G. Zhou, A. Yu, N. Rhinehart, and S. Levine. Parrot: Data-driven behavioral priors for reinforcement learning.arXiv preprint arXiv:2011.10024, 2020

work page arXiv 2011
[38]

Pertsch, Y

K. Pertsch, Y . Lee, and J. Lim. Accelerating reinforcement learning with learned skill priors. InCon- ference on robot learning, pages 188–204. PMLR, 2021

2021
[39]

Nasiriany, T

S. Nasiriany, T. Gao, A. Mandlekar, and Y . Zhu. Learning and retrieval from prior data for skill-based imitation learning.arXiv preprint arXiv:2210.11435, 2022

work page arXiv 2022
[40]

Wilcoxson, Q

M. Wilcoxson, Q. Li, K. Frans, and S. Levine. Leveraging skills from unlabeled prior data for efficient online exploration.arXiv preprint arXiv:2410.18076, 2024

work page arXiv 2024
[41]

Huang, P

W. Huang, P. Abbeel, D. Pathak, and I. Mordatch. Language models as zero-shot planners: Extracting actionable knowledge for embodied agents. InInternational conference on machine learning, pages 9118–9147. PMLR, 2022

2022
[42]

A. Zeng, M. Attarian, B. Ichter, K. Choromanski, A. Wong, S. Welker, F. Tombari, A. Purohit, M. Ryoo, V . Sindhwani, et al. Socratic models: Composing zero-shot multimodal reasoning with language.arXiv preprint arXiv:2204.00598, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[43]

Inner Monologue: Embodied Reasoning through Planning with Language Models

W. Huang, F. Xia, T. Xiao, H. Chan, J. Liang, P. Florence, A. Zeng, J. Tompson, I. Mordatch, Y . Cheb- otar, et al. Inner monologue: Embodied reasoning through planning with language models.arXiv preprint arXiv:2207.05608, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[44]

PaLM-E: An Embodied Multimodal Language Model

D. Driess, F. Xia, M. S. Sajjadi, C. Lynch, A. Chowdhery, B. Ichter, A. Wahid, J. Tompson, Q. Vuong, T. Yu, et al. Palm-e: An embodied multimodal language model.arXiv preprint arXiv:2303.03378, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[45]

VoxPoser: Composable 3D Value Maps for Robotic Manipulation with Language Models

W. Huang, C. Wang, R. Zhang, Y . Li, J. Wu, and L. Fei-Fei. V oxposer: Composable 3d value maps for robotic manipulation with language models.arXiv preprint arXiv:2307.05973, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[46]

Liang, W

J. Liang, W. Huang, F. Xia, P. Xu, K. Hausman, B. Ichter, P. Florence, and A. Zeng. Code as policies: Language model programs for embodied control. In2023 IEEE International conference on robotics and automation (ICRA), pages 9493–9500. IEEE, 2023

2023
[47]

B. Li, P. Wu, P. Abbeel, and J. Malik. Interactive task planning with language models.arXiv preprint arXiv:2310.10645, 2023. 12

work page arXiv 2023
[48]

Zhang, J

J. Zhang, J. Zhang, K. Pertsch, Z. Liu, X. Ren, M. Chang, S.-H. Sun, and J. J. Lim. Bootstrap your own skills: Learning to solve new tasks with large language model guidance.arXiv preprint arXiv:2310.10021, 2023

work page arXiv 2023
[49]

H. Ha, P. Florence, and S. Song. Scaling up and distilling down: Language-guided robot skill acquisi- tion. InConference on Robot Learning, pages 3766–3777. PMLR, 2023

2023
[50]

M. Ahn, A. Brohan, N. Brown, Y . Chebotar, O. Cortes, B. David, C. Finn, C. Fu, K. Gopalakrishnan, K. Hausman, et al. Do as i can, not as i say: Grounding language in robotic affordances.arXiv preprint arXiv:2204.01691, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[51]

L. X. Shi, Z. Hu, T. Z. Zhao, A. Sharma, K. Pertsch, J. Luo, S. Levine, and C. Finn. Yell at your robot: Improving on-the-fly from language corrections.arXiv preprint arXiv:2403.12910, 2024

work page arXiv 2024
[52]

L. X. Shi, B. Ichter, M. Equi, L. Ke, K. Pertsch, Q. Vuong, J. Tanner, A. Walling, H. Wang, N. Fusai, et al. Hi robot: Open-ended instruction following with hierarchical vision-language-action models. arXiv preprint arXiv:2502.19417, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[53]

W. Chen, J. S. Bhatia, C. Glossop, N. Mathihalli, R. Doshi, A. Tang, D. Driess, K. Pertsch, and S. Levine. Steerable vision-language-action policies for embodied reasoning and hierarchical control. arXiv preprint arXiv:2602.13193, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[54]

A. Shah, W. Chen, A. Godbole, F. Mora, S. A. Seshia, and S. Levine. Learning affordances at inference- time for vision-language-action models.arXiv preprint arXiv:2510.19752, 2025

work page arXiv 2025
[55]

J. Kwok, X. Zhang, M. Xu, Y . Liu, A. Mirhoseini, C. Finn, and M. Pavone. Scaling verification can be more effective than scaling policy learning for vision-language-action alignment.arXiv preprint arXiv:2602.12281, 2026

work page arXiv 2026
[56]

T. Shin, Y . Razeghi, R. L. L. Iv, E. Wallace, and S. Singh. Autoprompt: Eliciting knowledge from language models with automatically generated prompts. InProceedings of the 2020 conference on empirical methods in natural language processing (EMNLP), pages 4222–4235, 2020

2020
[57]

Y . Zhou, A. I. Muresanu, Z. Han, K. Paster, S. Pitis, H. Chan, and J. Ba. Large language models are human-level prompt engineers. InThe eleventh international conference on learning representations, 2022

2022
[58]

gradient descent

R. Pryzant, D. Iter, J. Li, Y . Lee, C. Zhu, and M. Zeng. Automatic prompt optimization with “gradient descent” and beam search. InProceedings of the 2023 conference on empirical methods in natural language processing, pages 7957–7968, 2023

2023
[59]

M. Deng, J. Wang, C.-P. Hsieh, Y . Wang, H. Guo, T. Shu, M. Song, E. Xing, and Z. Hu. Rlprompt: Optimizing discrete text prompts with reinforcement learning. InProceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 3369–3391, 2022

2022
[60]

Zhang, X

T. Zhang, X. Wang, D. Zhou, D. Schuurmans, and J. E. Gonzalez. Tempera: Test-time prompting via reinforcement learning.arXiv preprint arXiv:2211.11890, 2022

work page arXiv 2022
[61]

Jung and K.-J

H. Jung and K.-J. Kim. Discrete prompt compression with reinforcement learning.IEEE Access, 12: 72578–72587, 2024

2024
[62]

X. Wang, C. Li, Z. Wang, F. Bai, H. Luo, J. Zhang, N. Jojic, E. Xing, and Z. Hu. Promptagent: Strategic planning with language models enables expert-level prompt optimization. InInternational Conference on Learning Representations, volume 2024, pages 23967–24001, 2024. 13

2024
[63]

Q. Guo, R. Wang, J. Guo, B. Li, K. Song, X. Tan, G. Liu, J. Bian, and Y . Yang. Evoprompt: Connecting llms with evolutionary algorithms yields powerful prompt optimizers.arXiv e-prints, pages arXiv–2309, 2023

2023
[64]

P. Lu, L. Qiu, K.-W. Chang, Y . N. Wu, S.-C. Zhu, T. Rajpurohit, P. Clark, and A. Kalyan. Dy- namic prompt learning via policy gradient for semi-structured mathematical reasoning.arXiv preprint arXiv:2209.14610, 2022

work page arXiv 2022
[65]

Jafari, D

Y . Jafari, D. Mekala, R. Yu, and T. Berg-Kirkpatrick. Morl-prompt: An empirical analysis of multi- objective reinforcement learning for discrete prompt optimization. InFindings of the Association for Computational Linguistics: EMNLP 2024, pages 9878–9889, 2024

2024
[66]

PRL: Prompts from Reinforcement Learning

P. Batorski, A. Kosmala, and P. Swoboda. Prl: Prompts from reinforcement learning.arXiv preprint arXiv:2505.14412, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[67]

W. Kong, S. Hombaiah, M. Zhang, Q. Mei, and M. Bendersky. Prewrite: Prompt rewriting with rein- forcement learning. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 594–601, 2024

2024
[68]

M. Kwon, G. Kim, J. Kim, H. Lee, and J. Kim. Stableprompt: Automatic prompt tuning using re- inforcement learning for large language model. InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 9868–9884, 2024

2024
[69]

Mirchandani, S

S. Mirchandani, S. Karamcheti, and D. Sadigh. Ella: Exploration through learned language abstraction. Advances in neural information processing systems, 34:29529–29540, 2021

2021
[70]

J. Mu, V . Zhong, R. Raileanu, M. Jiang, N. Goodman, T. Rockt¨aschel, and E. Grefenstette. Improving intrinsic exploration with language abstractions.Advances in Neural Information Processing Systems, 35:33947–33960, 2022

2022
[71]

A. Tam, N. Rabinowitz, A. Lampinen, N. A. Roy, S. Chan, D. Strouse, J. Wang, A. Banino, and F. Hill. Semantic exploration from language abstractions and pretrained representations.Advances in neural information processing systems, 35:25377–25389, 2022

2022
[72]

Y . Du, O. Watkins, Z. Wang, C. Colas, T. Darrell, P. Abbeel, A. Gupta, and J. Andreas. Guiding pre- training in reinforcement learning with large language models. InInternational Conference on Machine Learning, pages 8657–8677. PMLR, 2023

2023
[73]

Carta, P.-Y

T. Carta, P.-Y . Oudeyer, O. Sigaud, and S. Lamprier. Eager: Asking and answering questions for automatic reward shaping in language-guided rl.Advances in neural information processing systems, 35:12478–12490, 2022

2022
[74]

Using Natural Language for Reward Shaping in Reinforcement Learning

P. Goyal, S. Niekum, and R. J. Mooney. Using natural language for reward shaping in reinforcement learning.arXiv preprint arXiv:1903.02020, 2019

work page internal anchor Pith review Pith/arXiv arXiv 1903
[75]

A. Peng, I. Sucholutsky, B. Li, T. Sumers, T. L. Griffiths, J. Andreas, and J. Shah. Learning with language-guided state abstractions. InInternational Conference on Learning Representations, volume 2024, pages 38711–38744, 2024

2024
[76]

Branavan, D

S. Branavan, D. Silver, and R. Barzilay. Learning to win by reading manuals in a monte-carlo frame- work.Journal of Artificial Intelligence Research, 43:661–704, 2012

2012
[77]

K. M. Hermann, F. Hill, S. Green, F. Wang, R. Faulkner, H. Soyer, D. Szepesvari, W. M. Czarnecki, M. Jaderberg, D. Teplyashin, et al. Grounded language learning in a simulated 3d world.arXiv preprint arXiv:1706.06551, 2017. 14

work page internal anchor Pith review Pith/arXiv arXiv 2017
[78]

Misra, J

D. Misra, J. Langford, and Y . Artzi. Mapping instructions and visual observations to actions with rein- forcement learning. InProceedings of the 2017 conference on empirical methods in natural language processing, pages 1004–1015, 2017

2017
[79]

Narasimhan, R

K. Narasimhan, R. Barzilay, and T. Jaakkola. Grounding language for transfer in deep reinforcement learning.Journal of Artificial Intelligence Research, 63:849–874, 2018

2018
[80]

A. G. Barto. Reinforcement learning: An introduction. by richard’s sutton.SIAM Rev, 6(2):423, 2021

2021

Showing first 80 references.

[1] [1]

A Careful Examination of Large Behavior Models for Multitask Dexterous Manipulation

J. Barreiros, A. Beaulieu, A. Bhat, R. Cory, E. Cousineau, H. Dai, C.-H. Fang, K. Hashimoto, M. Z. Irshad, M. Itkina, et al. A careful examination of large behavior models for multitask dexterous manip- ulation.arXiv preprint arXiv:2507.05331, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[2] [2]

$\pi_0$: A Vision-Language-Action Flow Model for General Robot Control

K. Black, N. Brown, D. Driess, A. Esmail, M. Equi, C. Finn, N. Fusai, L. Groom, K. Hausman, B. Ichter, et al.π 0: A vision-language-action flow model for general robot control.arXiv preprint arXiv:2410.24164, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[3] [3]

M. J. Kim, K. Pertsch, S. Karamcheti, T. Xiao, A. Balakrishna, S. Nair, R. Rafailov, E. Foster, G. Lam, P. Sanketi, et al. Openvla: An open-source vision-language-action model.arXiv preprint arXiv:2406.09246, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[4] [4]

$\pi_{0.5}$: a Vision-Language-Action Model with Open-World Generalization

Physical Intelligence, K. Black, N. Brown, J. Darpinian, K. Dhabalia, D. Driess, A. Esmail, M. Equi, C. Finn, N. Fusai, et al.π 0.5: a vision-language-action model with open-world generalization.arXiv preprint arXiv:2504.16054, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[5] [5]

RT-1: Robotics Transformer for Real-World Control at Scale

A. Brohan, N. Brown, J. Carbajal, Y . Chebotar, J. Dabis, C. Finn, K. Gopalakrishnan, K. Hausman, A. Herzog, J. Hsu, et al. Rt-1: Robotics transformer for real-world control at scale.arXiv preprint arXiv:2212.06817, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022

[6] [6]

J. Gu, S. Kirmani, P. Wohlhart, Y . Lu, M. G. Arenas, K. Rao, W. Yu, C. Fu, K. Gopalakrishnan, Z. Xu, et al. Rt-trajectory: Robotic task generalization via hindsight trajectory sketches.arXiv preprint arXiv:2311.01977, 2023

work page arXiv 2023

[7] [7]

O. M. Team, D. Ghosh, H. Walke, K. Pertsch, K. Black, O. Mees, S. Dasari, J. Hejna, T. Kreiman, C. Xu, et al. Octo: An open-source generalist robot policy.arXiv preprint arXiv:2405.12213, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[8] [8]

GR00T N1: An Open Foundation Model for Generalist Humanoid Robots

J. Bjorck, F. Casta ˜neda, N. Cherniadev, X. Da, R. Ding, L. Fan, Y . Fang, D. Fox, F. Hu, S. Huang, et al. Gr00t n1: An open foundation model for generalist humanoid robots.arXiv preprint arXiv:2503.14734, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[9] [9]

G. R. Team, S. Abeyruwan, J. Ainslie, J.-B. Alayrac, M. G. Arenas, T. Armstrong, A. Balakrishna, R. Baruch, M. Bauza, M. Blokzijl, et al. Gemini robotics: Bringing ai into the physical world.arXiv preprint arXiv:2503.20020, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[10] [10]

S. Liu, L. Wu, B. Li, H. Tan, H. Chen, Z. Wang, K. Xu, H. Su, and J. Zhu. Rdt-1b: a diffusion foundation model for bimanual manipulation. InInternational Conference on Learning Representations, volume 2025, pages 29982–30009, 2025

2025

[11] [11]

L. Zha, A. J. Hancock, M. Zhang, T. Yin, Y . Huang, D. Shah, A. Z. Ren, and A. Majumdar. Lap: Language-action pre-training enables zero-shot cross-embodiment transfer.arXiv preprint arXiv:2602.10556, 2026

work page arXiv 2026

[12] [12]

Johannink, S

T. Johannink, S. Bahl, A. Nair, J. Luo, A. Kumar, M. Loskyll, J. A. Ojea, E. Solowjow, and S. Levine. Residual reinforcement learning for robot control, 2018. URLhttps://arxiv.org/abs/1812. 03201

2018

[13] [13]

Ankile, A

L. Ankile, A. Simeonov, I. Shenfeld, M. Torne, and P. Agrawal. From imitation to refinement-residual rl for precise assembly. In2025 IEEE International Conference on Robotics and Automation (ICRA), pages 01–08. IEEE, 2025

2025

[14] [14]

X. Yuan, T. Mu, S. Tao, Y . Fang, M. Zhang, and H. Su. Policy decorator: Model-agnostic online refinement for large policy model.arXiv preprint arXiv:2412.13630, 2024. 10

work page arXiv 2024

[15] [15]

J ¨ulg, W

T. J ¨ulg, W. Burgard, and F. Walter. Refined policy distillation: From vla generalists to rl experts.arXiv preprint arXiv:2503.05833, 2025

work page arXiv 2025

[16] [16]

P. Dong, Q. Li, D. Sadigh, and C. Finn. Expo: Stable reinforcement learning with expressive policies. arXiv preprint arXiv:2507.07986, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[17] [17]

W. Xiao, H. Lin, A. Peng, H. Xue, T. He, Y . Xie, F. Hu, J. Wu, Z. Luo, L. Fan, et al. Self-improving vision-language-action models with data generation via residual rl.arXiv preprint arXiv:2511.00091, 2025

work page arXiv 2025

[18] [18]

C. Xu, J. T. Springenberg, M. Equi, A. Amin, A. Esmail, S. Levine, and L. Ke. Rl token: Bootstrapping online rl with vision-language-action models.arXiv preprint arXiv:2604.23073, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[19] [19]

Sun and S

Z. Sun and S. Song. From prior to pro: Efficient skill mastery via distribution contractive rl finetuning. arXiv preprint arXiv:2603.10263, 2026

work page arXiv 2026

[20] [20]

Wagenmaker, M

A. Wagenmaker, M. Nakamoto, Y . Zhang, S. Park, W. Yagoub, A. Nagabandi, A. Gupta, and S. Levine. Steering your diffusion policy with latent space reinforcement learning.Conference on Robot Learning, 2025

2025

[21] [21]

A. Amin, R. Aniceto, A. Balakrishna, K. Black, K. Conley, G. Connors, J. Darpinian, K. Dhabalia, J. DiCarlo, D. Driess, et al.π ∗ 0.6: a vla that learns from experience.arXiv preprint arXiv:2511.14759, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[22] [22]

M. S. Mark, T. Gao, G. G. Sampaio, M. K. Srirama, A. Sharma, C. Finn, and A. Kumar. Policy agnostic rl: Offline rl and online rl fine-tuning of any class and backbone.arXiv preprint arXiv:2412.06685, 2024

work page arXiv 2024

[23] [23]

Nakamoto, O

M. Nakamoto, O. Mees, A. Kumar, and S. Levine. Steering your generalists: Improving robotic foun- dation models via value guidance.arXiv preprint arXiv:2410.13816, 2024

work page arXiv 2024

[24] [24]

Y . Chen, S. Tian, S. Liu, Y . Zhou, H. Li, and D. Zhao. Conrft: A reinforced fine-tuning method for vla models via consistency policy.arXiv preprint arXiv:2502.05450, 2025

work page arXiv 2025

[25] [25]

J. Hu, R. Hendrix, A. Farhadi, A. Kembhavi, R. Mart ´ın-Mart´ın, P. Stone, K.-H. Zeng, and K. Ehsani. Flare: Achieving masterful and adaptive robot policies with large-scale reinforcement learning fine- tuning. In2025 IEEE International Conference on Robotics and Automation (ICRA), pages 3617–3624. IEEE, 2025

2025

[26] [26]

Y . Guo, J. Zhang, X. Chen, X. Ji, Y .-J. Wang, Y . Hu, and J. Chen. Improving vision-language-action model with online reinforcement learning.arXiv preprint arXiv:2501.16664, 2025

work page arXiv 2025

[27] [27]

G. Lu, W. Guo, C. Zhang, Y . Zhou, H. Jiang, Z. Gao, Y . Tang, and Z. Wang. Vla-rl: Towards masterful and general robotic manipulation with scalable reinforcement learning.arXiv preprint arXiv:2505.18719, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[28] [28]

J. Liu, F. Gao, B. Wei, X. Chen, Q. Liao, Y . Wu, C. Yu, and Y . Wang. What can rl bring to vla generalization? an empirical study.arXiv preprint arXiv:2505.19789, 2025

work page arXiv 2025

[29] [29]

Daniel, G

C. Daniel, G. Neumann, O. Kroemer, and J. Peters. Hierarchical relative entropy policy search.Journal of Machine Learning Research, 17(93):1–50, 2016

2016

[30] [30]

T. D. Kulkarni, K. Narasimhan, A. Saeedi, and J. Tenenbaum. Hierarchical deep reinforcement learning: Integrating temporal abstraction and intrinsic motivation.Advances in neural information processing systems, 29, 2016. 11

2016

[31] [31]

X. B. Peng, G. Berseth, K. Yin, and M. Van De Panne. Deeploco: Dynamic locomotion skills using hierarchical deep reinforcement learning.Acm transactions on graphics (tog), 36(4):1–13, 2017

2017

[32] [32]

Riedmiller, R

M. Riedmiller, R. Hafner, T. Lampe, M. Neunert, J. Degrave, T. Wiele, V . Mnih, N. Heess, and J. T. Springenberg. Learning by playing solving sparse reward tasks from scratch. InInternational confer- ence on machine learning, pages 4344–4353. PMLR, 2018

2018

[33] [33]

Nachum, S

O. Nachum, S. S. Gu, H. Lee, and S. Levine. Data-efficient hierarchical reinforcement learning.Ad- vances in neural information processing systems, 31, 2018

2018

[34] [34]

Gehring, G

J. Gehring, G. Synnaeve, A. Krause, and N. Usunier. Hierarchical skills for efficient exploration.Ad- vances in Neural Information Processing Systems, 34:11553–11564, 2021

2021

[35] [35]

Neural probabilistic motor primitives for humanoid control

J. Merel, L. Hasenclever, A. Galashov, A. Ahuja, V . Pham, G. Wayne, Y . W. Teh, and N. Heess. Neural probabilistic motor primitives for humanoid control.arXiv preprint arXiv:1811.11711, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018

[36] [36]

A. Ajay, A. Kumar, P. Agrawal, S. Levine, and O. Nachum. Opal: Offline primitive discovery for accelerating offline reinforcement learning.arXiv preprint arXiv:2010.13611, 2020

work page arXiv 2010

[37] [37]

Singh, H

A. Singh, H. Liu, G. Zhou, A. Yu, N. Rhinehart, and S. Levine. Parrot: Data-driven behavioral priors for reinforcement learning.arXiv preprint arXiv:2011.10024, 2020

work page arXiv 2011

[38] [38]

Pertsch, Y

K. Pertsch, Y . Lee, and J. Lim. Accelerating reinforcement learning with learned skill priors. InCon- ference on robot learning, pages 188–204. PMLR, 2021

2021

[39] [39]

Nasiriany, T

S. Nasiriany, T. Gao, A. Mandlekar, and Y . Zhu. Learning and retrieval from prior data for skill-based imitation learning.arXiv preprint arXiv:2210.11435, 2022

work page arXiv 2022

[40] [40]

Wilcoxson, Q

M. Wilcoxson, Q. Li, K. Frans, and S. Levine. Leveraging skills from unlabeled prior data for efficient online exploration.arXiv preprint arXiv:2410.18076, 2024

work page arXiv 2024

[41] [41]

Huang, P

W. Huang, P. Abbeel, D. Pathak, and I. Mordatch. Language models as zero-shot planners: Extracting actionable knowledge for embodied agents. InInternational conference on machine learning, pages 9118–9147. PMLR, 2022

2022

[42] [42]

A. Zeng, M. Attarian, B. Ichter, K. Choromanski, A. Wong, S. Welker, F. Tombari, A. Purohit, M. Ryoo, V . Sindhwani, et al. Socratic models: Composing zero-shot multimodal reasoning with language.arXiv preprint arXiv:2204.00598, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022

[43] [43]

Inner Monologue: Embodied Reasoning through Planning with Language Models

W. Huang, F. Xia, T. Xiao, H. Chan, J. Liang, P. Florence, A. Zeng, J. Tompson, I. Mordatch, Y . Cheb- otar, et al. Inner monologue: Embodied reasoning through planning with language models.arXiv preprint arXiv:2207.05608, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022

[44] [44]

PaLM-E: An Embodied Multimodal Language Model

D. Driess, F. Xia, M. S. Sajjadi, C. Lynch, A. Chowdhery, B. Ichter, A. Wahid, J. Tompson, Q. Vuong, T. Yu, et al. Palm-e: An embodied multimodal language model.arXiv preprint arXiv:2303.03378, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[45] [45]

VoxPoser: Composable 3D Value Maps for Robotic Manipulation with Language Models

W. Huang, C. Wang, R. Zhang, Y . Li, J. Wu, and L. Fei-Fei. V oxposer: Composable 3d value maps for robotic manipulation with language models.arXiv preprint arXiv:2307.05973, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[46] [46]

Liang, W

J. Liang, W. Huang, F. Xia, P. Xu, K. Hausman, B. Ichter, P. Florence, and A. Zeng. Code as policies: Language model programs for embodied control. In2023 IEEE International conference on robotics and automation (ICRA), pages 9493–9500. IEEE, 2023

2023

[47] [47]

B. Li, P. Wu, P. Abbeel, and J. Malik. Interactive task planning with language models.arXiv preprint arXiv:2310.10645, 2023. 12

work page arXiv 2023

[48] [48]

Zhang, J

J. Zhang, J. Zhang, K. Pertsch, Z. Liu, X. Ren, M. Chang, S.-H. Sun, and J. J. Lim. Bootstrap your own skills: Learning to solve new tasks with large language model guidance.arXiv preprint arXiv:2310.10021, 2023

work page arXiv 2023

[49] [49]

H. Ha, P. Florence, and S. Song. Scaling up and distilling down: Language-guided robot skill acquisi- tion. InConference on Robot Learning, pages 3766–3777. PMLR, 2023

2023

[50] [50]

M. Ahn, A. Brohan, N. Brown, Y . Chebotar, O. Cortes, B. David, C. Finn, C. Fu, K. Gopalakrishnan, K. Hausman, et al. Do as i can, not as i say: Grounding language in robotic affordances.arXiv preprint arXiv:2204.01691, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022

[51] [51]

L. X. Shi, Z. Hu, T. Z. Zhao, A. Sharma, K. Pertsch, J. Luo, S. Levine, and C. Finn. Yell at your robot: Improving on-the-fly from language corrections.arXiv preprint arXiv:2403.12910, 2024

work page arXiv 2024

[52] [52]

L. X. Shi, B. Ichter, M. Equi, L. Ke, K. Pertsch, Q. Vuong, J. Tanner, A. Walling, H. Wang, N. Fusai, et al. Hi robot: Open-ended instruction following with hierarchical vision-language-action models. arXiv preprint arXiv:2502.19417, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[53] [53]

W. Chen, J. S. Bhatia, C. Glossop, N. Mathihalli, R. Doshi, A. Tang, D. Driess, K. Pertsch, and S. Levine. Steerable vision-language-action policies for embodied reasoning and hierarchical control. arXiv preprint arXiv:2602.13193, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[54] [54]

A. Shah, W. Chen, A. Godbole, F. Mora, S. A. Seshia, and S. Levine. Learning affordances at inference- time for vision-language-action models.arXiv preprint arXiv:2510.19752, 2025

work page arXiv 2025

[55] [55]

J. Kwok, X. Zhang, M. Xu, Y . Liu, A. Mirhoseini, C. Finn, and M. Pavone. Scaling verification can be more effective than scaling policy learning for vision-language-action alignment.arXiv preprint arXiv:2602.12281, 2026

work page arXiv 2026

[56] [56]

T. Shin, Y . Razeghi, R. L. L. Iv, E. Wallace, and S. Singh. Autoprompt: Eliciting knowledge from language models with automatically generated prompts. InProceedings of the 2020 conference on empirical methods in natural language processing (EMNLP), pages 4222–4235, 2020

2020

[57] [57]

Y . Zhou, A. I. Muresanu, Z. Han, K. Paster, S. Pitis, H. Chan, and J. Ba. Large language models are human-level prompt engineers. InThe eleventh international conference on learning representations, 2022

2022

[58] [58]

gradient descent

R. Pryzant, D. Iter, J. Li, Y . Lee, C. Zhu, and M. Zeng. Automatic prompt optimization with “gradient descent” and beam search. InProceedings of the 2023 conference on empirical methods in natural language processing, pages 7957–7968, 2023

2023

[59] [59]

M. Deng, J. Wang, C.-P. Hsieh, Y . Wang, H. Guo, T. Shu, M. Song, E. Xing, and Z. Hu. Rlprompt: Optimizing discrete text prompts with reinforcement learning. InProceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 3369–3391, 2022

2022

[60] [60]

Zhang, X

T. Zhang, X. Wang, D. Zhou, D. Schuurmans, and J. E. Gonzalez. Tempera: Test-time prompting via reinforcement learning.arXiv preprint arXiv:2211.11890, 2022

work page arXiv 2022

[61] [61]

Jung and K.-J

H. Jung and K.-J. Kim. Discrete prompt compression with reinforcement learning.IEEE Access, 12: 72578–72587, 2024

2024

[62] [62]

X. Wang, C. Li, Z. Wang, F. Bai, H. Luo, J. Zhang, N. Jojic, E. Xing, and Z. Hu. Promptagent: Strategic planning with language models enables expert-level prompt optimization. InInternational Conference on Learning Representations, volume 2024, pages 23967–24001, 2024. 13

2024

[63] [63]

Q. Guo, R. Wang, J. Guo, B. Li, K. Song, X. Tan, G. Liu, J. Bian, and Y . Yang. Evoprompt: Connecting llms with evolutionary algorithms yields powerful prompt optimizers.arXiv e-prints, pages arXiv–2309, 2023

2023

[64] [64]

P. Lu, L. Qiu, K.-W. Chang, Y . N. Wu, S.-C. Zhu, T. Rajpurohit, P. Clark, and A. Kalyan. Dy- namic prompt learning via policy gradient for semi-structured mathematical reasoning.arXiv preprint arXiv:2209.14610, 2022

work page arXiv 2022

[65] [65]

Jafari, D

Y . Jafari, D. Mekala, R. Yu, and T. Berg-Kirkpatrick. Morl-prompt: An empirical analysis of multi- objective reinforcement learning for discrete prompt optimization. InFindings of the Association for Computational Linguistics: EMNLP 2024, pages 9878–9889, 2024

2024

[66] [66]

PRL: Prompts from Reinforcement Learning

P. Batorski, A. Kosmala, and P. Swoboda. Prl: Prompts from reinforcement learning.arXiv preprint arXiv:2505.14412, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[67] [67]

W. Kong, S. Hombaiah, M. Zhang, Q. Mei, and M. Bendersky. Prewrite: Prompt rewriting with rein- forcement learning. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 594–601, 2024

2024

[68] [68]

M. Kwon, G. Kim, J. Kim, H. Lee, and J. Kim. Stableprompt: Automatic prompt tuning using re- inforcement learning for large language model. InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 9868–9884, 2024

2024

[69] [69]

Mirchandani, S

S. Mirchandani, S. Karamcheti, and D. Sadigh. Ella: Exploration through learned language abstraction. Advances in neural information processing systems, 34:29529–29540, 2021

2021

[70] [70]

J. Mu, V . Zhong, R. Raileanu, M. Jiang, N. Goodman, T. Rockt¨aschel, and E. Grefenstette. Improving intrinsic exploration with language abstractions.Advances in Neural Information Processing Systems, 35:33947–33960, 2022

2022

[71] [71]

A. Tam, N. Rabinowitz, A. Lampinen, N. A. Roy, S. Chan, D. Strouse, J. Wang, A. Banino, and F. Hill. Semantic exploration from language abstractions and pretrained representations.Advances in neural information processing systems, 35:25377–25389, 2022

2022

[72] [72]

Y . Du, O. Watkins, Z. Wang, C. Colas, T. Darrell, P. Abbeel, A. Gupta, and J. Andreas. Guiding pre- training in reinforcement learning with large language models. InInternational Conference on Machine Learning, pages 8657–8677. PMLR, 2023

2023

[73] [73]

Carta, P.-Y

T. Carta, P.-Y . Oudeyer, O. Sigaud, and S. Lamprier. Eager: Asking and answering questions for automatic reward shaping in language-guided rl.Advances in neural information processing systems, 35:12478–12490, 2022

2022

[74] [74]

Using Natural Language for Reward Shaping in Reinforcement Learning

P. Goyal, S. Niekum, and R. J. Mooney. Using natural language for reward shaping in reinforcement learning.arXiv preprint arXiv:1903.02020, 2019

work page internal anchor Pith review Pith/arXiv arXiv 1903

[75] [75]

A. Peng, I. Sucholutsky, B. Li, T. Sumers, T. L. Griffiths, J. Andreas, and J. Shah. Learning with language-guided state abstractions. InInternational Conference on Learning Representations, volume 2024, pages 38711–38744, 2024

2024

[76] [76]

Branavan, D

S. Branavan, D. Silver, and R. Barzilay. Learning to win by reading manuals in a monte-carlo frame- work.Journal of Artificial Intelligence Research, 43:661–704, 2012

2012

[77] [77]

K. M. Hermann, F. Hill, S. Green, F. Wang, R. Faulkner, H. Soyer, D. Szepesvari, W. M. Czarnecki, M. Jaderberg, D. Teplyashin, et al. Grounded language learning in a simulated 3d world.arXiv preprint arXiv:1706.06551, 2017. 14

work page internal anchor Pith review Pith/arXiv arXiv 2017

[78] [78]

Misra, J

D. Misra, J. Langford, and Y . Artzi. Mapping instructions and visual observations to actions with rein- forcement learning. InProceedings of the 2017 conference on empirical methods in natural language processing, pages 1004–1015, 2017

2017

[79] [79]

Narasimhan, R

K. Narasimhan, R. Barzilay, and T. Jaakkola. Grounding language for transfer in deep reinforcement learning.Journal of Artificial Intelligence Research, 63:849–874, 2018

2018

[80] [80]

A. G. Barto. Reinforcement learning: An introduction. by richard’s sutton.SIAM Rev, 6(2):423, 2021

2021