pith. sign in

arxiv: 2606.31958 · v1 · pith:B5ISACWYnew · submitted 2026-06-30 · 💻 cs.RO

Adapting Generalist Robot Policies with Semantic Reinforcement Learning

Pith reviewed 2026-07-01 05:06 UTC · model grok-4.3

classification 💻 cs.RO
keywords generalist robot policiessemantic reinforcement learninglanguage promptsskill compositionrobot adaptationvisual language action modelsonline policy improvement
0
0 comments X

The pith

Modulating language prompts lets generalist robot policies adapt to complex tasks via reinforcement learning.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Generalist robot policies pretrained on large datasets hold a wide range of behaviors, yet standard reinforcement learning that directly adjusts their action outputs often fails on tasks outside the original training distribution. The paper establishes that optimizing language prompts instead serves as a workable alternative space, because changes to the prompt can draw out and combine skills the policy already possesses. Semantic Action Reinforcement Learning implements this idea by running online RL in the prompt space while holding the base policy fixed as a skill prior. The result is structured exploration, rapid performance gains, and the ability to solve long-horizon tasks in both simulation and the real world. A reader would care because the approach turns an existing generalist model into a controllable starting point rather than requiring the model to learn entirely new behaviors from scratch.

Core claim

The paper's central claim is that, for sufficiently expressive generalist policies, language prompts form an effective optimization space for reinforcement learning: modulating the language input elicits skills already within the policy's repertoire and allows their composition to solve tasks beyond zero-shot performance. SARL learns a policy over prompt space through online interaction, treating the frozen generalist as a controllable skill prior. This yields semantically meaningful exploration, efficient online improvement, and prompts that become grounded in the actual behaviors they induce.

What carries the argument

Semantic Action Reinforcement Learning (SARL), the algorithm that performs reinforcement learning over the space of language prompts supplied to a fixed generalist policy rather than over the policy's raw action outputs.

If this is right

  • Complex, long-horizon tasks outside the pretraining distribution become solvable by adapting visual-language-action models.
  • Exploration during learning remains structured and semantically meaningful instead of unstructured in action space.
  • Prompts become grounded in induced real-world behaviors, supporting robust task execution.
  • Online improvement is substantially more sample-efficient than methods that learn new skills from scratch.
  • The approach outperforms prior methods that attempt to improve deployed robot behavior through direct action-space RL.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same prompt-space optimization idea could be tested on generalist policies that accept other conditioning signals besides language.
  • If prompt optimization works reliably, it may reduce the data and compute needed to specialize large robot models for new environments.
  • Limits of the method could be probed by measuring how performance scales with the diversity of skills present in the base policy.
  • Techniques developed for optimizing prompts in language models might transfer directly to this robot setting.

Load-bearing premise

The generalist policy must already contain the component skills that language-prompt changes can reliably elicit and combine for the new task.

What would settle it

SARL produces no measurable improvement over the base policy or random prompting on a task for which no relevant sub-skills can be elicited from the generalist even after exhaustive prompt search.

Figures

Figures reproduced from arXiv: 2606.31958 by Andrew Wagenmaker, Jagdeep Singh Bhatia, Sergey Levine, William Chen.

Figure 1
Figure 1. Figure 1: Steering VLA priors at deployment to solve complex, long-horizon tasks is challenging. To over [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Adapting VLAs to solve new tasks requires decomposing task-goals into achievable stages, and grounding each stage in behaviors that can be executed success￾fully in the environment. VLAs prompted zero-shot struggle with both decomposi￾tion and grounding, and prompting VLAs with VLMs achieves decomposition but not grounding. Only SARL achieves both by learning to optimize a VLA’s prompt inputs with RL. In a… view at source ↗
Figure 3
Figure 3. Figure 3: Across four complex, real-world tasks, learning over semantic actions with SARL enables the [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: SARL outperforms prior methods for deployment-time adaptation — DSRL [ [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: We test our approach on four complex, long-horizon tasks on the WidowX robot, illustrated above, [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: On complex, long-horizon tasks, the base policy fails when zero-shot prompted. SARL solves them [PITH_FULL_IMAGE:figures/full_fig_p008_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Across tasks, an in-context learning VLM [ [PITH_FULL_IMAGE:figures/full_fig_p009_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Across four complex, real-world tasks, learning over semantic actions with SARL enables the best [PITH_FULL_IMAGE:figures/full_fig_p018_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Across tasks, an in-context learning VLM [ [PITH_FULL_IMAGE:figures/full_fig_p023_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Prompt for open-ended real-world semantic action candidate generation 25 [PITH_FULL_IMAGE:figures/full_fig_p025_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Prompt for real-world semantic action candidate generation from cache 26 [PITH_FULL_IMAGE:figures/full_fig_p026_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Prompt for real-world in-context learning VLM baseline 27 [PITH_FULL_IMAGE:figures/full_fig_p027_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: Prompt for open-ended sim semantic action candidate generation 28 [PITH_FULL_IMAGE:figures/full_fig_p028_13.png] view at source ↗
Figure 14
Figure 14. Figure 14: Prompt for sim semantic action candidate generation from cache 29 [PITH_FULL_IMAGE:figures/full_fig_p029_14.png] view at source ↗
Figure 15
Figure 15. Figure 15: Prompt for sim in-context learning VLM baseline 30 [PITH_FULL_IMAGE:figures/full_fig_p030_15.png] view at source ↗
read the original abstract

Generalist robot policies learn a diverse repertoire of behaviors from large-scale pretraining. In principle, this makes them excellent priors for downstream adaptation via reinforcement learning (RL). In practice, however, standard RL methods leveraging this prior optimize directly over robot actions, requiring the base policy's action distribution to be close to that of a performant policy from the start. This assumption breaks down for complex or long-horizon tasks that fall outside the pretraining distribution. Our key insight is that, for sufficiently expressive generalist policies, language prompts are an effective alternative space for learning to solve such tasks: modulating language inputs elicits skills already within the policy's repertoire, which can be composed to solve tasks beyond its zero-shot capabilities. We propose Semantic Action Reinforcement Learning (SARL), which learns to optimize this prompt space through online interaction, treating the generalist policy as a controllable skill prior. Importantly, leveraging pretrained skills rather than learning new ones from scratch yields structured, semantically meaningful exploration and highly efficient online improvement, and learning to modulate prompts through experience grounds them in induced real-world behaviors for robust task-solving. Across real-world settings and simulated benchmarks, we show SARL unlocks fundamentally new capabilities -- adapting VLA behavior to solve complex, long-horizon tasks -- and significantly outperforms existing approaches for improving robot behavior in deployment.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

0 major / 2 minor

Summary. The paper claims that for sufficiently expressive generalist robot policies, language prompts provide an effective alternative optimization space for RL adaptation on out-of-distribution tasks. SARL learns to modulate prompts online to elicit and compose pretrained skills, yielding structured exploration and efficient improvement over action-space RL. The method is reported to unlock new capabilities for complex long-horizon tasks and to outperform baselines across real-world settings and simulated benchmarks.

Significance. If the empirical claims hold, the work offers a promising direction for leveraging large-scale pretrained generalist policies in robotics by shifting RL to a semantically meaningful prompt space rather than raw actions. This could enable more practical adaptation without requiring the base policy's action distribution to already be near-optimal, with potential for broader impact on deployment of VLA-style models.

minor comments (2)
  1. The description of how the prompt space is parameterized (e.g., continuous vs. discrete embeddings, vocabulary constraints) and how the RL update is performed on it would benefit from an explicit algorithmic outline or pseudocode for reproducibility.
  2. The abstract states that SARL 'significantly outperforms existing approaches'; a concise table or set of key metrics (success rate, sample efficiency) comparing against the strongest baselines should be referenced early in the introduction or results overview.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for the positive review, accurate summary of our contributions, and recommendation for minor revision. We appreciate the recognition of SARL's potential impact on adapting generalist VLA policies for long-horizon tasks via prompt-space optimization.

Circularity Check

0 steps flagged

No significant circularity identified

full rationale

The paper proposes SARL as a method for adapting generalist policies via language prompt modulation in RL, with the central insight stated directly in the abstract and no derivation chain, equations, or fitted parameters presented. Claims rest on empirical evaluation across real-world and simulated settings rather than reducing to self-citations, ansatzes, or inputs by construction. No load-bearing steps match any of the enumerated circularity patterns, and the argument is self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that language modulation can access and compose skills in the pretrained policy; no free parameters or invented entities are identifiable from the abstract.

axioms (1)
  • domain assumption For sufficiently expressive generalist policies, modulating language inputs elicits skills already within the policy's repertoire that can be composed for new tasks.
    This is the key insight stated in the abstract and is load-bearing for the entire approach.

pith-pipeline@v0.9.1-grok · 5763 in / 1103 out tokens · 37581 ms · 2026-07-01T05:06:51.431845+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

87 extracted references · 49 canonical work pages · 26 internal anchors

  1. [1]

    A Careful Examination of Large Behavior Models for Multitask Dexterous Manipulation

    J. Barreiros, A. Beaulieu, A. Bhat, R. Cory, E. Cousineau, H. Dai, C.-H. Fang, K. Hashimoto, M. Z. Irshad, M. Itkina, et al. A careful examination of large behavior models for multitask dexterous manip- ulation.arXiv preprint arXiv:2507.05331, 2025

  2. [2]

    $\pi_0$: A Vision-Language-Action Flow Model for General Robot Control

    K. Black, N. Brown, D. Driess, A. Esmail, M. Equi, C. Finn, N. Fusai, L. Groom, K. Hausman, B. Ichter, et al.π 0: A vision-language-action flow model for general robot control.arXiv preprint arXiv:2410.24164, 2024

  3. [3]

    M. J. Kim, K. Pertsch, S. Karamcheti, T. Xiao, A. Balakrishna, S. Nair, R. Rafailov, E. Foster, G. Lam, P. Sanketi, et al. Openvla: An open-source vision-language-action model.arXiv preprint arXiv:2406.09246, 2024

  4. [4]

    $\pi_{0.5}$: a Vision-Language-Action Model with Open-World Generalization

    Physical Intelligence, K. Black, N. Brown, J. Darpinian, K. Dhabalia, D. Driess, A. Esmail, M. Equi, C. Finn, N. Fusai, et al.π 0.5: a vision-language-action model with open-world generalization.arXiv preprint arXiv:2504.16054, 2025

  5. [5]

    RT-1: Robotics Transformer for Real-World Control at Scale

    A. Brohan, N. Brown, J. Carbajal, Y . Chebotar, J. Dabis, C. Finn, K. Gopalakrishnan, K. Hausman, A. Herzog, J. Hsu, et al. Rt-1: Robotics transformer for real-world control at scale.arXiv preprint arXiv:2212.06817, 2022

  6. [6]

    J. Gu, S. Kirmani, P. Wohlhart, Y . Lu, M. G. Arenas, K. Rao, W. Yu, C. Fu, K. Gopalakrishnan, Z. Xu, et al. Rt-trajectory: Robotic task generalization via hindsight trajectory sketches.arXiv preprint arXiv:2311.01977, 2023

  7. [7]

    O. M. Team, D. Ghosh, H. Walke, K. Pertsch, K. Black, O. Mees, S. Dasari, J. Hejna, T. Kreiman, C. Xu, et al. Octo: An open-source generalist robot policy.arXiv preprint arXiv:2405.12213, 2024

  8. [8]

    GR00T N1: An Open Foundation Model for Generalist Humanoid Robots

    J. Bjorck, F. Casta ˜neda, N. Cherniadev, X. Da, R. Ding, L. Fan, Y . Fang, D. Fox, F. Hu, S. Huang, et al. Gr00t n1: An open foundation model for generalist humanoid robots.arXiv preprint arXiv:2503.14734, 2025

  9. [9]

    G. R. Team, S. Abeyruwan, J. Ainslie, J.-B. Alayrac, M. G. Arenas, T. Armstrong, A. Balakrishna, R. Baruch, M. Bauza, M. Blokzijl, et al. Gemini robotics: Bringing ai into the physical world.arXiv preprint arXiv:2503.20020, 2025

  10. [10]

    S. Liu, L. Wu, B. Li, H. Tan, H. Chen, Z. Wang, K. Xu, H. Su, and J. Zhu. Rdt-1b: a diffusion foundation model for bimanual manipulation. InInternational Conference on Learning Representations, volume 2025, pages 29982–30009, 2025

  11. [11]

    L. Zha, A. J. Hancock, M. Zhang, T. Yin, Y . Huang, D. Shah, A. Z. Ren, and A. Majumdar. Lap: Language-action pre-training enables zero-shot cross-embodiment transfer.arXiv preprint arXiv:2602.10556, 2026

  12. [12]

    Johannink, S

    T. Johannink, S. Bahl, A. Nair, J. Luo, A. Kumar, M. Loskyll, J. A. Ojea, E. Solowjow, and S. Levine. Residual reinforcement learning for robot control, 2018. URLhttps://arxiv.org/abs/1812. 03201

  13. [13]

    Ankile, A

    L. Ankile, A. Simeonov, I. Shenfeld, M. Torne, and P. Agrawal. From imitation to refinement-residual rl for precise assembly. In2025 IEEE International Conference on Robotics and Automation (ICRA), pages 01–08. IEEE, 2025

  14. [14]

    X. Yuan, T. Mu, S. Tao, Y . Fang, M. Zhang, and H. Su. Policy decorator: Model-agnostic online refinement for large policy model.arXiv preprint arXiv:2412.13630, 2024. 10

  15. [15]

    J ¨ulg, W

    T. J ¨ulg, W. Burgard, and F. Walter. Refined policy distillation: From vla generalists to rl experts.arXiv preprint arXiv:2503.05833, 2025

  16. [16]

    P. Dong, Q. Li, D. Sadigh, and C. Finn. Expo: Stable reinforcement learning with expressive policies. arXiv preprint arXiv:2507.07986, 2025

  17. [17]

    W. Xiao, H. Lin, A. Peng, H. Xue, T. He, Y . Xie, F. Hu, J. Wu, Z. Luo, L. Fan, et al. Self-improving vision-language-action models with data generation via residual rl.arXiv preprint arXiv:2511.00091, 2025

  18. [18]

    C. Xu, J. T. Springenberg, M. Equi, A. Amin, A. Esmail, S. Levine, and L. Ke. Rl token: Bootstrapping online rl with vision-language-action models.arXiv preprint arXiv:2604.23073, 2026

  19. [19]

    Sun and S

    Z. Sun and S. Song. From prior to pro: Efficient skill mastery via distribution contractive rl finetuning. arXiv preprint arXiv:2603.10263, 2026

  20. [20]

    Wagenmaker, M

    A. Wagenmaker, M. Nakamoto, Y . Zhang, S. Park, W. Yagoub, A. Nagabandi, A. Gupta, and S. Levine. Steering your diffusion policy with latent space reinforcement learning.Conference on Robot Learning, 2025

  21. [21]

    A. Amin, R. Aniceto, A. Balakrishna, K. Black, K. Conley, G. Connors, J. Darpinian, K. Dhabalia, J. DiCarlo, D. Driess, et al.π ∗ 0.6: a vla that learns from experience.arXiv preprint arXiv:2511.14759, 2025

  22. [22]

    M. S. Mark, T. Gao, G. G. Sampaio, M. K. Srirama, A. Sharma, C. Finn, and A. Kumar. Policy agnostic rl: Offline rl and online rl fine-tuning of any class and backbone.arXiv preprint arXiv:2412.06685, 2024

  23. [23]

    Nakamoto, O

    M. Nakamoto, O. Mees, A. Kumar, and S. Levine. Steering your generalists: Improving robotic foun- dation models via value guidance.arXiv preprint arXiv:2410.13816, 2024

  24. [24]

    Y . Chen, S. Tian, S. Liu, Y . Zhou, H. Li, and D. Zhao. Conrft: A reinforced fine-tuning method for vla models via consistency policy.arXiv preprint arXiv:2502.05450, 2025

  25. [25]

    J. Hu, R. Hendrix, A. Farhadi, A. Kembhavi, R. Mart ´ın-Mart´ın, P. Stone, K.-H. Zeng, and K. Ehsani. Flare: Achieving masterful and adaptive robot policies with large-scale reinforcement learning fine- tuning. In2025 IEEE International Conference on Robotics and Automation (ICRA), pages 3617–3624. IEEE, 2025

  26. [26]

    Y . Guo, J. Zhang, X. Chen, X. Ji, Y .-J. Wang, Y . Hu, and J. Chen. Improving vision-language-action model with online reinforcement learning.arXiv preprint arXiv:2501.16664, 2025

  27. [27]

    G. Lu, W. Guo, C. Zhang, Y . Zhou, H. Jiang, Z. Gao, Y . Tang, and Z. Wang. Vla-rl: Towards masterful and general robotic manipulation with scalable reinforcement learning.arXiv preprint arXiv:2505.18719, 2025

  28. [28]

    J. Liu, F. Gao, B. Wei, X. Chen, Q. Liao, Y . Wu, C. Yu, and Y . Wang. What can rl bring to vla generalization? an empirical study.arXiv preprint arXiv:2505.19789, 2025

  29. [29]

    Daniel, G

    C. Daniel, G. Neumann, O. Kroemer, and J. Peters. Hierarchical relative entropy policy search.Journal of Machine Learning Research, 17(93):1–50, 2016

  30. [30]

    T. D. Kulkarni, K. Narasimhan, A. Saeedi, and J. Tenenbaum. Hierarchical deep reinforcement learning: Integrating temporal abstraction and intrinsic motivation.Advances in neural information processing systems, 29, 2016. 11

  31. [31]

    X. B. Peng, G. Berseth, K. Yin, and M. Van De Panne. Deeploco: Dynamic locomotion skills using hierarchical deep reinforcement learning.Acm transactions on graphics (tog), 36(4):1–13, 2017

  32. [32]

    Riedmiller, R

    M. Riedmiller, R. Hafner, T. Lampe, M. Neunert, J. Degrave, T. Wiele, V . Mnih, N. Heess, and J. T. Springenberg. Learning by playing solving sparse reward tasks from scratch. InInternational confer- ence on machine learning, pages 4344–4353. PMLR, 2018

  33. [33]

    Nachum, S

    O. Nachum, S. S. Gu, H. Lee, and S. Levine. Data-efficient hierarchical reinforcement learning.Ad- vances in neural information processing systems, 31, 2018

  34. [34]

    Gehring, G

    J. Gehring, G. Synnaeve, A. Krause, and N. Usunier. Hierarchical skills for efficient exploration.Ad- vances in Neural Information Processing Systems, 34:11553–11564, 2021

  35. [35]

    Neural probabilistic motor primitives for humanoid control

    J. Merel, L. Hasenclever, A. Galashov, A. Ahuja, V . Pham, G. Wayne, Y . W. Teh, and N. Heess. Neural probabilistic motor primitives for humanoid control.arXiv preprint arXiv:1811.11711, 2018

  36. [36]

    A. Ajay, A. Kumar, P. Agrawal, S. Levine, and O. Nachum. Opal: Offline primitive discovery for accelerating offline reinforcement learning.arXiv preprint arXiv:2010.13611, 2020

  37. [37]

    Singh, H

    A. Singh, H. Liu, G. Zhou, A. Yu, N. Rhinehart, and S. Levine. Parrot: Data-driven behavioral priors for reinforcement learning.arXiv preprint arXiv:2011.10024, 2020

  38. [38]

    Pertsch, Y

    K. Pertsch, Y . Lee, and J. Lim. Accelerating reinforcement learning with learned skill priors. InCon- ference on robot learning, pages 188–204. PMLR, 2021

  39. [39]

    Nasiriany, T

    S. Nasiriany, T. Gao, A. Mandlekar, and Y . Zhu. Learning and retrieval from prior data for skill-based imitation learning.arXiv preprint arXiv:2210.11435, 2022

  40. [40]

    Wilcoxson, Q

    M. Wilcoxson, Q. Li, K. Frans, and S. Levine. Leveraging skills from unlabeled prior data for efficient online exploration.arXiv preprint arXiv:2410.18076, 2024

  41. [41]

    Huang, P

    W. Huang, P. Abbeel, D. Pathak, and I. Mordatch. Language models as zero-shot planners: Extracting actionable knowledge for embodied agents. InInternational conference on machine learning, pages 9118–9147. PMLR, 2022

  42. [42]

    A. Zeng, M. Attarian, B. Ichter, K. Choromanski, A. Wong, S. Welker, F. Tombari, A. Purohit, M. Ryoo, V . Sindhwani, et al. Socratic models: Composing zero-shot multimodal reasoning with language.arXiv preprint arXiv:2204.00598, 2022

  43. [43]

    Inner Monologue: Embodied Reasoning through Planning with Language Models

    W. Huang, F. Xia, T. Xiao, H. Chan, J. Liang, P. Florence, A. Zeng, J. Tompson, I. Mordatch, Y . Cheb- otar, et al. Inner monologue: Embodied reasoning through planning with language models.arXiv preprint arXiv:2207.05608, 2022

  44. [44]

    PaLM-E: An Embodied Multimodal Language Model

    D. Driess, F. Xia, M. S. Sajjadi, C. Lynch, A. Chowdhery, B. Ichter, A. Wahid, J. Tompson, Q. Vuong, T. Yu, et al. Palm-e: An embodied multimodal language model.arXiv preprint arXiv:2303.03378, 2023

  45. [45]

    VoxPoser: Composable 3D Value Maps for Robotic Manipulation with Language Models

    W. Huang, C. Wang, R. Zhang, Y . Li, J. Wu, and L. Fei-Fei. V oxposer: Composable 3d value maps for robotic manipulation with language models.arXiv preprint arXiv:2307.05973, 2023

  46. [46]

    Liang, W

    J. Liang, W. Huang, F. Xia, P. Xu, K. Hausman, B. Ichter, P. Florence, and A. Zeng. Code as policies: Language model programs for embodied control. In2023 IEEE International conference on robotics and automation (ICRA), pages 9493–9500. IEEE, 2023

  47. [47]

    B. Li, P. Wu, P. Abbeel, and J. Malik. Interactive task planning with language models.arXiv preprint arXiv:2310.10645, 2023. 12

  48. [48]

    Zhang, J

    J. Zhang, J. Zhang, K. Pertsch, Z. Liu, X. Ren, M. Chang, S.-H. Sun, and J. J. Lim. Bootstrap your own skills: Learning to solve new tasks with large language model guidance.arXiv preprint arXiv:2310.10021, 2023

  49. [49]

    H. Ha, P. Florence, and S. Song. Scaling up and distilling down: Language-guided robot skill acquisi- tion. InConference on Robot Learning, pages 3766–3777. PMLR, 2023

  50. [50]

    M. Ahn, A. Brohan, N. Brown, Y . Chebotar, O. Cortes, B. David, C. Finn, C. Fu, K. Gopalakrishnan, K. Hausman, et al. Do as i can, not as i say: Grounding language in robotic affordances.arXiv preprint arXiv:2204.01691, 2022

  51. [51]

    L. X. Shi, Z. Hu, T. Z. Zhao, A. Sharma, K. Pertsch, J. Luo, S. Levine, and C. Finn. Yell at your robot: Improving on-the-fly from language corrections.arXiv preprint arXiv:2403.12910, 2024

  52. [52]

    L. X. Shi, B. Ichter, M. Equi, L. Ke, K. Pertsch, Q. Vuong, J. Tanner, A. Walling, H. Wang, N. Fusai, et al. Hi robot: Open-ended instruction following with hierarchical vision-language-action models. arXiv preprint arXiv:2502.19417, 2025

  53. [53]

    W. Chen, J. S. Bhatia, C. Glossop, N. Mathihalli, R. Doshi, A. Tang, D. Driess, K. Pertsch, and S. Levine. Steerable vision-language-action policies for embodied reasoning and hierarchical control. arXiv preprint arXiv:2602.13193, 2026

  54. [54]

    A. Shah, W. Chen, A. Godbole, F. Mora, S. A. Seshia, and S. Levine. Learning affordances at inference- time for vision-language-action models.arXiv preprint arXiv:2510.19752, 2025

  55. [55]

    J. Kwok, X. Zhang, M. Xu, Y . Liu, A. Mirhoseini, C. Finn, and M. Pavone. Scaling verification can be more effective than scaling policy learning for vision-language-action alignment.arXiv preprint arXiv:2602.12281, 2026

  56. [56]

    T. Shin, Y . Razeghi, R. L. L. Iv, E. Wallace, and S. Singh. Autoprompt: Eliciting knowledge from language models with automatically generated prompts. InProceedings of the 2020 conference on empirical methods in natural language processing (EMNLP), pages 4222–4235, 2020

  57. [57]

    Y . Zhou, A. I. Muresanu, Z. Han, K. Paster, S. Pitis, H. Chan, and J. Ba. Large language models are human-level prompt engineers. InThe eleventh international conference on learning representations, 2022

  58. [58]

    gradient descent

    R. Pryzant, D. Iter, J. Li, Y . Lee, C. Zhu, and M. Zeng. Automatic prompt optimization with “gradient descent” and beam search. InProceedings of the 2023 conference on empirical methods in natural language processing, pages 7957–7968, 2023

  59. [59]

    M. Deng, J. Wang, C.-P. Hsieh, Y . Wang, H. Guo, T. Shu, M. Song, E. Xing, and Z. Hu. Rlprompt: Optimizing discrete text prompts with reinforcement learning. InProceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 3369–3391, 2022

  60. [60]

    Zhang, X

    T. Zhang, X. Wang, D. Zhou, D. Schuurmans, and J. E. Gonzalez. Tempera: Test-time prompting via reinforcement learning.arXiv preprint arXiv:2211.11890, 2022

  61. [61]

    Jung and K.-J

    H. Jung and K.-J. Kim. Discrete prompt compression with reinforcement learning.IEEE Access, 12: 72578–72587, 2024

  62. [62]

    X. Wang, C. Li, Z. Wang, F. Bai, H. Luo, J. Zhang, N. Jojic, E. Xing, and Z. Hu. Promptagent: Strategic planning with language models enables expert-level prompt optimization. InInternational Conference on Learning Representations, volume 2024, pages 23967–24001, 2024. 13

  63. [63]

    Q. Guo, R. Wang, J. Guo, B. Li, K. Song, X. Tan, G. Liu, J. Bian, and Y . Yang. Evoprompt: Connecting llms with evolutionary algorithms yields powerful prompt optimizers.arXiv e-prints, pages arXiv–2309, 2023

  64. [64]

    P. Lu, L. Qiu, K.-W. Chang, Y . N. Wu, S.-C. Zhu, T. Rajpurohit, P. Clark, and A. Kalyan. Dy- namic prompt learning via policy gradient for semi-structured mathematical reasoning.arXiv preprint arXiv:2209.14610, 2022

  65. [65]

    Jafari, D

    Y . Jafari, D. Mekala, R. Yu, and T. Berg-Kirkpatrick. Morl-prompt: An empirical analysis of multi- objective reinforcement learning for discrete prompt optimization. InFindings of the Association for Computational Linguistics: EMNLP 2024, pages 9878–9889, 2024

  66. [66]

    PRL: Prompts from Reinforcement Learning

    P. Batorski, A. Kosmala, and P. Swoboda. Prl: Prompts from reinforcement learning.arXiv preprint arXiv:2505.14412, 2025

  67. [67]

    W. Kong, S. Hombaiah, M. Zhang, Q. Mei, and M. Bendersky. Prewrite: Prompt rewriting with rein- forcement learning. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 594–601, 2024

  68. [68]

    M. Kwon, G. Kim, J. Kim, H. Lee, and J. Kim. Stableprompt: Automatic prompt tuning using re- inforcement learning for large language model. InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 9868–9884, 2024

  69. [69]

    Mirchandani, S

    S. Mirchandani, S. Karamcheti, and D. Sadigh. Ella: Exploration through learned language abstraction. Advances in neural information processing systems, 34:29529–29540, 2021

  70. [70]

    J. Mu, V . Zhong, R. Raileanu, M. Jiang, N. Goodman, T. Rockt¨aschel, and E. Grefenstette. Improving intrinsic exploration with language abstractions.Advances in Neural Information Processing Systems, 35:33947–33960, 2022

  71. [71]

    A. Tam, N. Rabinowitz, A. Lampinen, N. A. Roy, S. Chan, D. Strouse, J. Wang, A. Banino, and F. Hill. Semantic exploration from language abstractions and pretrained representations.Advances in neural information processing systems, 35:25377–25389, 2022

  72. [72]

    Y . Du, O. Watkins, Z. Wang, C. Colas, T. Darrell, P. Abbeel, A. Gupta, and J. Andreas. Guiding pre- training in reinforcement learning with large language models. InInternational Conference on Machine Learning, pages 8657–8677. PMLR, 2023

  73. [73]

    Carta, P.-Y

    T. Carta, P.-Y . Oudeyer, O. Sigaud, and S. Lamprier. Eager: Asking and answering questions for automatic reward shaping in language-guided rl.Advances in neural information processing systems, 35:12478–12490, 2022

  74. [74]

    Using Natural Language for Reward Shaping in Reinforcement Learning

    P. Goyal, S. Niekum, and R. J. Mooney. Using natural language for reward shaping in reinforcement learning.arXiv preprint arXiv:1903.02020, 2019

  75. [75]

    A. Peng, I. Sucholutsky, B. Li, T. Sumers, T. L. Griffiths, J. Andreas, and J. Shah. Learning with language-guided state abstractions. InInternational Conference on Learning Representations, volume 2024, pages 38711–38744, 2024

  76. [76]

    Branavan, D

    S. Branavan, D. Silver, and R. Barzilay. Learning to win by reading manuals in a monte-carlo frame- work.Journal of Artificial Intelligence Research, 43:661–704, 2012

  77. [77]

    K. M. Hermann, F. Hill, S. Green, F. Wang, R. Faulkner, H. Soyer, D. Szepesvari, W. M. Czarnecki, M. Jaderberg, D. Teplyashin, et al. Grounded language learning in a simulated 3d world.arXiv preprint arXiv:1706.06551, 2017. 14

  78. [78]

    Misra, J

    D. Misra, J. Langford, and Y . Artzi. Mapping instructions and visual observations to actions with rein- forcement learning. InProceedings of the 2017 conference on empirical methods in natural language processing, pages 1004–1015, 2017

  79. [79]

    Narasimhan, R

    K. Narasimhan, R. Barzilay, and T. Jaakkola. Grounding language for transfer in deep reinforcement learning.Journal of Artificial Intelligence Research, 63:849–874, 2018

  80. [80]

    A. G. Barto. Reinforcement learning: An introduction. by richard’s sutton.SIAM Rev, 6(2):423, 2021

Showing first 80 references.