pith. sign in

arxiv: 2311.18232 · v1 · pith:3MD5E4EJnew · submitted 2023-11-30 · 💻 cs.CL · cs.AI· cs.LG

LMRL Gym: Benchmarks for Multi-Turn Reinforcement Learning with Language Models

classification 💻 cs.CL cs.AIcs.LG
keywords languagellmsmulti-turnlearningreinforcementalgorithmsgoal-directedinteractions
0
0 comments X
read the original abstract

Large language models (LLMs) provide excellent text-generation capabilities, but standard prompting and generation methods generally do not lead to intentional or goal-directed agents and might necessitate considerable prompt tuning. This becomes particularly apparent in multi-turn conversations: even the best current LLMs rarely ask clarifying questions, engage in explicit information gathering, or take actions now that lead to better decisions after multiple turns. Reinforcement learning has the potential to leverage the powerful modeling capabilities of LLMs, as well as their internal representation of textual interactions, to create capable goal-directed language agents. This can enable intentional and temporally extended interactions, such as with humans, through coordinated persuasion and carefully crafted questions, or in goal-directed play through text games to bring about desired final outcomes. However, enabling this requires the community to develop stable and reliable reinforcement learning algorithms that can effectively train LLMs. Developing such algorithms requires tasks that can gauge progress on algorithm design, provide accessible and reproducible evaluations for multi-turn interactions, and cover a range of task properties and challenges in improving reinforcement learning algorithms. Our paper introduces the LMRL-Gym benchmark for evaluating multi-turn RL for LLMs, together with an open-source research framework containing a basic toolkit for getting started on multi-turn RL with offline value-based and policy-based RL methods. Our benchmark consists of 8 different language tasks, which require multiple rounds of language interaction and cover a range of tasks in open-ended dialogue and text games.

This paper has not been read by Pith yet.

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 7 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Resolving Action Bottleneck: Agentic Reinforcement Learning Informed by Token-Level Energy

    cs.LG 2026-05 conditional novelty 6.0

    ActFocus resolves the action bottleneck in agentic RL by reweighting token gradients toward action tokens using observed reward variance and an energy-based uncertainty term, outperforming PPO and GRPO by up to 65 per...

  2. JigsawRL: Assembling RL Pipelines for Efficient LLM Post-Training

    cs.LG 2026-04 unverdicted novelty 6.0

    JigsawRL achieves up to 1.85x higher throughput in LLM RL pipelines via pipeline multiplexing, sub-stage graphs, and look-ahead scheduling compared to prior systems.

  3. Alignment has a Fantasia Problem

    cs.AI 2026-04 unverdicted novelty 6.0

    AI alignment must move beyond assuming users have fully formed goals and instead provide active cognitive support to help form and refine intent over time.

  4. Specificity-aware reinforcement learning for fine-grained open-world classification

    cs.CV 2026-03 unverdicted novelty 6.0

    SpeciaRL applies a dynamic verifier-based reward in reinforcement learning to steer reasoning LMMs toward correct and specific predictions on fine-grained open-world image classification tasks.

  5. Visual-RFT: Visual Reinforcement Fine-Tuning

    cs.CV 2025-03 conditional novelty 6.0

    Visual-RFT applies reinforcement learning with verifiable perception rewards to improve large vision-language models on fine-grained classification, few-shot detection, and grounding tasks.

  6. Agentic Environment Engineering for Large Language Models: A Survey of Environment Modeling, Synthesis, Evaluation, and Application

    cs.CL 2026-06 unverdicted novelty 5.0

    This survey categorizes agentic environments for LLMs by eight attributes and domains, introduces symbolic and neural synthesis paradigms with evaluation, and outlines four agent evolution pathways plus three environm...

  7. Rethinking Expert Trajectory Utilization in LLM Post-training for Mathematical Reasoning

    cs.LG 2025-12 unverdicted novelty 5.0

    Sequential SFT followed by RL, guided by the Plasticity-Ceiling Framework, achieves higher performance ceilings in LLM mathematical reasoning than synchronized methods by optimizing data scale and transition timing.