hub Canonical reference

Bartoldson, Bhavya Kailkhura, Fan Lai, Jiawei Zhao, and Beidi Chen

Haizhong Zheng, Yang Zhou, Brian R · 2025 · arXiv 2506.02177

Canonical reference. 86% of citing Pith papers cite this work as background.

15 Pith papers citing it

Background 86% of classified citations

read on arXiv browse 15 citing papers

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 6 baseline 1

citation-polarity summary

background 6 baseline 1

representative citing papers

AstraFlow: Dataflow-Oriented Reinforcement Learning for Agentic LLMs

cs.LG · 2026-05-15 · unverdicted · novelty 7.0

AstraFlow decouples RL components into autonomous dataflow services to natively support multi-policy agentic LLM training, elastic scaling, and cross-region execution with 2.7x speedup on math, code, search, and AgentBench workloads.

Learning from Language Feedback via Variational Policy Distillation

cs.LG · 2026-05-14 · unverdicted · novelty 7.0

VPD frames language feedback learning as variational EM so the teacher policy refines itself via trust-region updates on outcomes while the student learns dense token distributions on its own rollouts, outperforming fixed-teacher baselines on reasoning and code tasks.

AIS: Adaptive Importance Sampling for Quantized RL

stat.ML · 2026-05-13 · unverdicted · novelty 7.0

AIS adaptively corrects non-stationary policy gradient bias in quantized LLM RL, matching BF16 performance while retaining 1.5-2.76x FP8 rollout speedup.

DUET: Optimize Token-Budget Allocation for Reinforcement Learning with Verifiable Rewards

cs.LG · 2026-05-08 · unverdicted · novelty 7.0

DUET improves RLVR by allocating tokens across both prompt selection and rollout length, outperforming full-budget baselines even when using only half the tokens.

Your Language Model is Its Own Critic: Reinforcement Learning with Value Estimation from Actor's Internal States

cs.LG · 2026-05-08 · unverdicted · novelty 7.0 · 2 refs

POISE trains a lightweight probe on the actor's internal states to predict expected rewards for RLVR, matching DAPO performance on math benchmarks with lower compute by avoiding extra rollouts or critic models.

Selective Rollout: Mid-Trajectory Termination for Multi-Sample Agent RL

cs.LG · 2026-05-07 · conditional · novelty 7.0

A one-parameter early-termination gate based on mean pairwise prefix edit distance reduces wall-clock time by 10.7% and raises held-out success by 2.5 pp in GRPO on ALFWorld by cutting zero-advantage batch dilution.

LASER: A Data-Centric Method for Low-Cost and Efficient SQL Rewriting based on SQL-GRPO

cs.DB · 2026-04-08 · unverdicted · novelty 7.0

LASER generates complex slow-query training data with MCTS and aligns small models via SQL-GRPO to deliver efficient, low-cost SQL rewriting that outperforms rules and large models.

Generate, Filter, Control, Replay: A Comprehensive Survey of Rollout Strategies for LLM Reinforcement Learning

cs.LG · 2026-04-08 · unverdicted · novelty 7.0

This survey introduces the Generate-Filter-Control-Replay (GFCR) taxonomy to structure rollout pipelines for RL-based post-training of reasoning LLMs.

Internalizing Curriculum Judgment for LLM Reinforcement Fine-Tuning

cs.LG · 2026-05-11 · unverdicted · novelty 6.0

METIS internalizes curriculum judgment in LLM reinforcement fine-tuning by predicting within-prompt reward variance via in-context learning and jointly optimizing with a self-judgment reward, yielding superior performance and up to 67% faster convergence across math, code, and agent benchmarks.

DARE: Difficulty-Adaptive Reinforcement Learning with Co-Evolved Difficulty Estimation

cs.LG · 2026-05-09 · unverdicted · novelty 6.0

DARE co-evolves difficulty estimation and policy in RL for LLMs to improve training efficiency, final performance, and inference speed by using tailored strategies for different difficulty levels.

ROSE: Rollout On Serving GPUs via Cooperative Elasticity for Agentic RL

cs.DC · 2026-05-07 · unverdicted · novelty 6.0 · 2 refs

ROSE is a system for cooperative elasticity that co-locates serving and rollout models on shared GPUs, delivering 1.3-3.3x higher end-to-end throughput than fixed-resource baselines while preserving serving SLOs.

Gradient Extrapolation-Based Policy Optimization

cs.LG · 2026-05-07 · unverdicted · novelty 6.0

GXPO approximates longer local lookahead in GRPO training via gradient extrapolation from two optimizer steps using three backward passes total, improving pass@1 accuracy by 1.65-5.00 points over GRPO and delivering up to 4x step speedup.

The Landscape of Agentic Reinforcement Learning for LLMs: A Survey

cs.AI · 2025-09-02 · accept · novelty 6.0

Survey that defines agentic RL for LLMs via POMDPs, introduces a taxonomy of planning/tool-use/memory/reasoning capabilities and domains, and compiles open environments from over 500 papers.

MCPO: Mastery-Consolidated Policy Optimization for Large Reasoning Models

cs.AI · 2026-04-18 · unverdicted · novelty 5.0

MCPO fixes vanishing training signals and shrinking weights in GRPO by using a hinge-KL regularizer on mastered prompts and prioritizing majority-correct prompts, yielding higher pass@1 and pass@k on math tasks.

Cost-Aware Learning

cs.LG · 2026-04-30

citing papers explorer

Showing 15 of 15 citing papers.

AstraFlow: Dataflow-Oriented Reinforcement Learning for Agentic LLMs cs.LG · 2026-05-15 · unverdicted · none · ref 47
AstraFlow decouples RL components into autonomous dataflow services to natively support multi-policy agentic LLM training, elastic scaling, and cross-region execution with 2.7x speedup on math, code, search, and AgentBench workloads.
Learning from Language Feedback via Variational Policy Distillation cs.LG · 2026-05-14 · unverdicted · none · ref 51
VPD frames language feedback learning as variational EM so the teacher policy refines itself via trust-region updates on outcomes while the student learns dense token distributions on its own rollouts, outperforming fixed-teacher baselines on reasoning and code tasks.
AIS: Adaptive Importance Sampling for Quantized RL stat.ML · 2026-05-13 · unverdicted · none · ref 26
AIS adaptively corrects non-stationary policy gradient bias in quantized LLM RL, matching BF16 performance while retaining 1.5-2.76x FP8 rollout speedup.
DUET: Optimize Token-Budget Allocation for Reinforcement Learning with Verifiable Rewards cs.LG · 2026-05-08 · unverdicted · none · ref 43
DUET improves RLVR by allocating tokens across both prompt selection and rollout length, outperforming full-budget baselines even when using only half the tokens.
Your Language Model is Its Own Critic: Reinforcement Learning with Value Estimation from Actor's Internal States cs.LG · 2026-05-08 · unverdicted · none · ref 43 · 2 links
POISE trains a lightweight probe on the actor's internal states to predict expected rewards for RLVR, matching DAPO performance on math benchmarks with lower compute by avoiding extra rollouts or critic models.
Selective Rollout: Mid-Trajectory Termination for Multi-Sample Agent RL cs.LG · 2026-05-07 · conditional · none · ref 9
A one-parameter early-termination gate based on mean pairwise prefix edit distance reduces wall-clock time by 10.7% and raises held-out success by 2.5 pp in GRPO on ALFWorld by cutting zero-advantage batch dilution.
LASER: A Data-Centric Method for Low-Cost and Efficient SQL Rewriting based on SQL-GRPO cs.DB · 2026-04-08 · unverdicted · none · ref 49
LASER generates complex slow-query training data with MCTS and aligns small models via SQL-GRPO to deliver efficient, low-cost SQL rewriting that outperforms rules and large models.
Generate, Filter, Control, Replay: A Comprehensive Survey of Rollout Strategies for LLM Reinforcement Learning cs.LG · 2026-04-08 · unverdicted · none · ref 184
This survey introduces the Generate-Filter-Control-Replay (GFCR) taxonomy to structure rollout pipelines for RL-based post-training of reasoning LLMs.
Internalizing Curriculum Judgment for LLM Reinforcement Fine-Tuning cs.LG · 2026-05-11 · unverdicted · none · ref 16
METIS internalizes curriculum judgment in LLM reinforcement fine-tuning by predicting within-prompt reward variance via in-context learning and jointly optimizing with a self-judgment reward, yielding superior performance and up to 67% faster convergence across math, code, and agent benchmarks.
DARE: Difficulty-Adaptive Reinforcement Learning with Co-Evolved Difficulty Estimation cs.LG · 2026-05-09 · unverdicted · none · ref 61
DARE co-evolves difficulty estimation and policy in RL for LLMs to improve training efficiency, final performance, and inference speed by using tailored strategies for different difficulty levels.
ROSE: Rollout On Serving GPUs via Cooperative Elasticity for Agentic RL cs.DC · 2026-05-07 · unverdicted · none · ref 94 · 2 links
ROSE is a system for cooperative elasticity that co-locates serving and rollout models on shared GPUs, delivering 1.3-3.3x higher end-to-end throughput than fixed-resource baselines while preserving serving SLOs.
Gradient Extrapolation-Based Policy Optimization cs.LG · 2026-05-07 · unverdicted · none · ref 45
GXPO approximates longer local lookahead in GRPO training via gradient extrapolation from two optimizer steps using three backward passes total, improving pass@1 accuracy by 1.65-5.00 points over GRPO and delivering up to 4x step speedup.
The Landscape of Agentic Reinforcement Learning for LLMs: A Survey cs.AI · 2025-09-02 · accept · none · ref 57
Survey that defines agentic RL for LLMs via POMDPs, introduces a taxonomy of planning/tool-use/memory/reasoning capabilities and domains, and compiles open environments from over 500 papers.
MCPO: Mastery-Consolidated Policy Optimization for Large Reasoning Models cs.AI · 2026-04-18 · unverdicted · none · ref 33
MCPO fixes vanishing training signals and shrinking weights in GRPO by using a hinge-KL regularizer on mastered prompts and prioritizing majority-correct prompts, yielding higher pass@1 and pass@k on math tasks.
Cost-Aware Learning cs.LG · 2026-04-30 · unreviewed · ref 23

Bartoldson, Bhavya Kailkhura, Fan Lai, Jiawei Zhao, and Beidi Chen

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer