pith. sign in

arxiv: 2511.14460 · v2 · pith:KM3JL6PWnew · submitted 2025-11-18 · 💻 cs.CL

Agent-R1: A Unified and Modular Framework for Agentic Reinforcement Learning

classification 💻 cs.CL
keywords agenticframeworkincreasinglyoptimizationagent-r1agentsassignmentcompatible
0
0 comments X
read the original abstract

Large language models (LLMs) have rapidly evolved from single-turn text generators into the foundation of increasingly capable agents. As these agents take on more complex reasoning, decision making, tool use, and long-horizon tasks, reinforcement learning (RL) is becoming increasingly important for shaping their behavior. This shift is especially visible in agentic RL, where models must interact with tools and environments across multiple rounds rather than produce a single standalone response. In this regime, the usual view of a trajectory as one ever-growing token sequence becomes increasingly inadequate: it makes context evolution rigid and creates representation mismatches between rollout and training. This paper presents Agent-R1, a unified and modular framework for agentic RL built around step-level trajectory representation, flexible context management, and layered interfaces for workflows, environments and optimization. The key idea is to treat each interaction step as the basic reinforcement-learning transition, while keeping the optimization layer flexible: once the interaction is modeled at the step level, the framework can support token-level credit assignment, step-level credit assignment, or other compatible designs. These design choices make the framework compatible with a range of optimization strategies rather than tying it to a single algorithm. Together, these components provide a principled, extensible, and reusable substrate for agentic RL.

This paper has not been read by Pith yet.

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 9 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. ClawForge: Generating Executable Interactive Benchmarks for Command-Line Agents

    cs.AI 2026-05 unverdicted novelty 7.0

    ClawForge is a generator framework that creates reproducible executable benchmarks for command-line agents under state conflict, with ClawForge-Bench showing frontier models reach at most 45.3% strict accuracy and tha...

  2. ClawForge: Generating Executable Interactive Benchmarks for Command-Line Agents

    cs.AI 2026-05 conditional novelty 7.0

    ClawForge supplies a generator that turns scenario templates into reproducible command-line tasks testing state conflict handling, where the strongest frontier model scores only 45.3 percent strict accuracy.

  3. Tools as Continuous Flow for Evolving Agentic Reasoning

    cs.AI 2026-05 unverdicted novelty 7.0

    FlowAgent models tool chaining as continuous latent trajectory generation with conditional flow matching to deliver global planning, formal utility bounds, and better robustness on long-horizon tasks, plus a new plan-...

  4. Skill-SD: Skill-Conditioned Self-Distillation for Multi-turn LLM Agents

    cs.LG 2026-04 unverdicted novelty 6.0

    Skill-SD turns an agent's completed trajectories into dynamic natural-language skills that condition only the teacher in self-distillation, yielding 14-42% gains over RL and OPSD baselines on multi-turn agent benchmarks.

  5. TEC: A Collection of Human Trial-and-error Trajectories for Problem Solving

    cs.CL 2026-04 unverdicted novelty 6.0

    TEC is a new public dataset of detailed human trial-and-error trajectories and reflections on web tasks, with humans showing substantially higher accuracy than LLMs.

  6. SABER: A Stealthy Agentic Black-Box Attack Framework for Vision-Language-Action Models

    cs.RO 2026-03 unverdicted novelty 6.0

    SABER uses a trained ReAct agent to produce bounded adversarial edits to robot instructions, cutting task success by 20.6% and increasing execution length and violations on the LIBERO benchmark across six VLA models.

  7. StepPO: Step-Aligned Policy Optimization for Agentic Reinforcement Learning

    cs.CL 2026-04 unverdicted novelty 4.0

    StepPO argues that LLM agents should optimize at the step level rather than token level to better handle delayed rewards and long contexts in agentic RL.

  8. EigentSearch-Q+: Enhancing Deep Research Agents with Structured Reasoning Tools

    cs.AI 2026-04 unverdicted novelty 4.0

    Structured query and evidence tools added to an AI research agent improve benchmark accuracy by 0.6 to 3.8 percentage points.

  9. Toward a Safe Internet of Agents

    cs.MA 2025-11 unverdicted novelty 4.0

    The paper proposes a bottom-up framework for safe agentic AI systems that treats each component as a dual-use interface where added capabilities also expand attack surfaces across single agents, multi-agent systems, a...