hub Mixed citations

Proximal policy optimization algorithms

John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, Oleg Klimov · 2017

Mixed citation behavior. Most common role is background (60%).

29 Pith papers citing it

Background 60% of classified citations

browse 29 citing papers

hub tools

JSON dossier citing papers JSON

citation-role summary

background 3 baseline 1 method 1

citation-polarity summary

background 3 baseline 1 use method 1

representative citing papers

R1-Zero's "Aha Moment" in Visual Reasoning on a 2B Non-SFT Model

cs.AI · 2025-03-07 · conditional · novelty 7.0

RL on Qwen2-VL-2B with SAT dataset produces R1-like reasoning and 59.47% CVBench accuracy, outperforming base model by ~30% and SFT by ~2%.

Learning-Zone Energy: Online Data Selection for Efficient RL Post-Training

cs.LG · 2026-05-16 · unverdicted · novelty 6.0 · 2 refs

Learning-Zone Energy is a new online data selection framework for RL post-training that retains 40% of data per step yet matches or exceeds full-data baselines on math tasks with 36% lower FLOPs.

PiCA: Pivot-Based Credit Assignment for Search Agentic Reinforcement Learning

cs.AI · 2026-05-10 · unverdicted · novelty 6.0 · 2 refs

PiCA uses pivot-based potential rewards derived from historical sub-queries to supply trajectory-aware step guidance in agentic RL, delivering 15% gains on QA benchmarks for 3B/7B models.

Behavior Cue Reasoning: Monitorable Reasoning Improves Efficiency and Safety through Oversight

cs.AI · 2026-05-07 · conditional · novelty 6.0 · 2 refs

Behavior Cue Reasoning trains LLMs to emit special tokens before behaviors, enabling monitors to cut up to 50% wasted reasoning tokens and recover safe actions from 80% of unsafe traces, more than doubling success rates with no performance cost.

Distributional Reinforcement Learning via the Cram\'er Distance

cs.LG · 2026-04-26 · unverdicted · novelty 6.0

C-DSAC applies the Cramér distance to distributional value learning inside SAC and outperforms standard SAC on robotic benchmarks, with larger gains in complex environments due to confidence-driven conservative updates.

COSAC: Counterfactual Credit Assignment in Sequential Cooperative Teams

cs.LG · 2026-04-20 · unverdicted · novelty 6.0 · 2 refs

COSAC enables scalable per-agent policy gradients in sequential cooperative teams via ridge regression on additive reward decomposition and counterfactual advantages from fictitious policy continuations, extending aristocrat utility with controlled bias-variance bounds.

RTMC: Step-Level Credit Assignment via Rollout Trees

cs.LG · 2026-04-13 · unverdicted · novelty 6.0

RTMC aggregates returns across rollout trees to produce step-level Q-values and advantages, improving pass@1 by 3.2 points over GRPO on SWE-bench Verified.

RAGEN-2: Reasoning Collapse in Agentic RL

cs.LG · 2026-04-07 · unverdicted · novelty 6.0

Template collapse is a distinct failure mode in agentic RL invisible to entropy; mutual information proxies diagnose it better and SNR-aware filtering using reward variance improves input-dependent reasoning and task performance across planning, math, navigation, and code tasks.

PRPO: Paragraph-level Policy Optimization for Vision-Language Deepfake Detection

cs.CV · 2025-09-30 · unverdicted · novelty 6.0

PRPO is a paragraph-level policy optimization technique that grounds vision-language model reasoning in image content to raise deepfake detection accuracy and reasoning quality.

SPaCe: Unlocking Sample-Efficient Large Language Models Training With Self-Pace Curriculum Learning

cs.LG · 2025-08-07 · unverdicted · novelty 6.0

SPaCe uses semantic clustering to shrink training sets and a multi-armed bandit to adaptively select samples, matching or beating baselines on reasoning benchmarks with up to 100x fewer examples.

RewardBench 2: Advancing Reward Model Evaluation

cs.CL · 2025-06-02 · unverdicted · novelty 6.0

RewardBench 2 is a new benchmark that supplies challenging fresh human prompts for reward model evaluation, yielding lower average scores but higher correlation with downstream best-of-N sampling and RLHF training performance.

ProRL: Prolonged Reinforcement Learning Expands Reasoning Boundaries in Large Language Models

cs.CL · 2025-05-30 · conditional · novelty 6.0

Prolonged RL training with KL control and reference policy resetting enables LLMs to develop novel reasoning strategies inaccessible to base models even under extensive sampling.

MegaScale-Data: Scaling Dataloader for Multisource Large Foundation Model Training

cs.DC · 2025-04-14 · unverdicted · novelty 6.0

MegaScale-Data is a distributed data loading system that disaggregates preprocessing and applies auto-partitioning to deliver 4.5x higher end-to-end training throughput and 13.5x lower CPU memory usage for multisource large foundation models.

Agent Q: Advanced Reasoning and Learning for Autonomous AI Agents

cs.AI · 2024-08-13 · unverdicted · novelty 6.0

Agent Q integrates MCTS-guided search, self-critique, and off-policy DPO to train LLM agents that outperform behavior cloning and reinforced fine-tuning baselines in WebShop and achieve up to 95.4% success in real-world booking scenarios.

Back to Basics: Revisiting REINFORCE Style Optimization for Learning from Human Feedback in LLMs

cs.LG · 2024-02-22 · conditional · novelty 6.0

REINFORCE-style variants outperform PPO, DPO, and RAFT in RLHF for LLMs by removing unnecessary PPO components and adapting the simpler method to LLM alignment characteristics.

Generalizing from a few environments in safety-critical reinforcement learning

cs.LG · 2019-07-02 · unverdicted · novelty 6.0

RL agents fail dangerously on unseen environments; ensembles reduce catastrophes in gridworld but not CoinRun, with uncertainty enabling intervention prediction.

Efficient 3D Content Reconstruction and Generation

cs.CV · 2026-05-18 · unverdicted · novelty 5.0

Presents Instant3D for rapid text/image-to-3D generation via multi-view diffusion plus feed-forward reconstruction, and FastMap for 10x faster structure-from-motion with comparable accuracy.

Ranking-Aware Calibration for Reliable Multimodal Reinforcement Learning

cs.LG · 2026-05-16 · unverdicted · novelty 5.0

RAC adds ranking-aware group loss and clean-corrupted pairwise loss to RL post-training to boost both accuracy and calibration in multimodal reasoning without extra annotations.

AdaGamma: State-Dependent Discounting for Temporal Adaptation in Reinforcement Learning

cs.LG · 2026-05-07 · unverdicted · novelty 5.0

AdaGamma stabilizes state-dependent discounting in deep actor-critic RL by adding a return-consistency regularizer, delivering gains on continuous-control benchmarks and a real-world logistics A/B test.

PokeRL: Reinforcement Learning for Pokemon Red

cs.LG · 2026-04-12 · unverdicted · novelty 5.0

PokeRL trains PPO agents to finish early Pokemon Red tasks using a loop-aware environment wrapper, multi-layer anti-loop mechanisms, and dense hierarchical rewards.

Enabling Off-Policy Imitation Learning with Deep Actor Critic Stabilization

cs.LG · 2025-11-10 · unverdicted · novelty 5.0

Introduces an off-policy adversarial imitation learning method with double Q stabilization that reduces samples required to match expert behavior.

GIFT: Group-Relative Implicit Fine-Tuning Integrates GRPO with DPO and UNA

cs.LG · 2025-10-27 · unverdicted · novelty 5.0

GIFT matches the optimal policy of GRPO using an endogenous prompt-dependent KL coefficient derived via z-score standardization of implicit rewards.

Heterogeneous Adaptive Policy Optimization: Tailoring Optimization to Every Token's Nature

cs.CL · 2025-09-20 · unverdicted · novelty 5.0

HAPO is a new token-level policy optimization method for LLMs that continuously adapts four optimization stages using entropy, claiming consistent gains over DAPO on math, code, and logic tasks.

Partner Modelling Emerges in Recurrent Agents (But Only When It Matters)

cs.AI · 2025-05-22 · unverdicted · novelty 5.0

Model-free RNN agents in Overcooked-AI spontaneously develop structured internal models of partner abilities when they can allocate tasks, enabling adaptation to novel collaborators.

citing papers explorer

Showing 29 of 29 citing papers.

R1-Zero's "Aha Moment" in Visual Reasoning on a 2B Non-SFT Model cs.AI · 2025-03-07 · conditional · none · ref 11
RL on Qwen2-VL-2B with SAT dataset produces R1-like reasoning and 59.47% CVBench accuracy, outperforming base model by ~30% and SFT by ~2%.
Learning-Zone Energy: Online Data Selection for Efficient RL Post-Training cs.LG · 2026-05-16 · unverdicted · none · ref 27 · 2 links
Learning-Zone Energy is a new online data selection framework for RL post-training that retains 40% of data per step yet matches or exceeds full-data baselines on math tasks with 36% lower FLOPs.
PiCA: Pivot-Based Credit Assignment for Search Agentic Reinforcement Learning cs.AI · 2026-05-10 · unverdicted · none · ref 25 · 2 links
PiCA uses pivot-based potential rewards derived from historical sub-queries to supply trajectory-aware step guidance in agentic RL, delivering 15% gains on QA benchmarks for 3B/7B models.
Behavior Cue Reasoning: Monitorable Reasoning Improves Efficiency and Safety through Oversight cs.AI · 2026-05-07 · conditional · none · ref 31 · 2 links
Behavior Cue Reasoning trains LLMs to emit special tokens before behaviors, enabling monitors to cut up to 50% wasted reasoning tokens and recover safe actions from 80% of unsafe traces, more than doubling success rates with no performance cost.
Distributional Reinforcement Learning via the Cram\'er Distance cs.LG · 2026-04-26 · unverdicted · none · ref 28
C-DSAC applies the Cramér distance to distributional value learning inside SAC and outperforms standard SAC on robotic benchmarks, with larger gains in complex environments due to confidence-driven conservative updates.
COSAC: Counterfactual Credit Assignment in Sequential Cooperative Teams cs.LG · 2026-04-20 · unverdicted · none · ref 18 · 2 links
COSAC enables scalable per-agent policy gradients in sequential cooperative teams via ridge regression on additive reward decomposition and counterfactual advantages from fictitious policy continuations, extending aristocrat utility with controlled bias-variance bounds.
RTMC: Step-Level Credit Assignment via Rollout Trees cs.LG · 2026-04-13 · unverdicted · none · ref 16
RTMC aggregates returns across rollout trees to produce step-level Q-values and advantages, improving pass@1 by 3.2 points over GRPO on SWE-bench Verified.
RAGEN-2: Reasoning Collapse in Agentic RL cs.LG · 2026-04-07 · unverdicted · none · ref 42
Template collapse is a distinct failure mode in agentic RL invisible to entropy; mutual information proxies diagnose it better and SNR-aware filtering using reward variance improves input-dependent reasoning and task performance across planning, math, navigation, and code tasks.
PRPO: Paragraph-level Policy Optimization for Vision-Language Deepfake Detection cs.CV · 2025-09-30 · unverdicted · none · ref 60
PRPO is a paragraph-level policy optimization technique that grounds vision-language model reasoning in image content to raise deepfake detection accuracy and reasoning quality.
SPaCe: Unlocking Sample-Efficient Large Language Models Training With Self-Pace Curriculum Learning cs.LG · 2025-08-07 · unverdicted · none · ref 27
SPaCe uses semantic clustering to shrink training sets and a multi-armed bandit to adaptively select samples, matching or beating baselines on reasoning benchmarks with up to 100x fewer examples.
RewardBench 2: Advancing Reward Model Evaluation cs.CL · 2025-06-02 · unverdicted · none · ref 50
RewardBench 2 is a new benchmark that supplies challenging fresh human prompts for reward model evaluation, yielding lower average scores but higher correlation with downstream best-of-N sampling and RLHF training performance.
ProRL: Prolonged Reinforcement Learning Expands Reasoning Boundaries in Large Language Models cs.CL · 2025-05-30 · conditional · none · ref 17
Prolonged RL training with KL control and reference policy resetting enables LLMs to develop novel reasoning strategies inaccessible to base models even under extensive sampling.
MegaScale-Data: Scaling Dataloader for Multisource Large Foundation Model Training cs.DC · 2025-04-14 · unverdicted · none · ref 59
MegaScale-Data is a distributed data loading system that disaggregates preprocessing and applies auto-partitioning to deliver 4.5x higher end-to-end training throughput and 13.5x lower CPU memory usage for multisource large foundation models.
Agent Q: Advanced Reasoning and Learning for Autonomous AI Agents cs.AI · 2024-08-13 · unverdicted · none · ref 51
Agent Q integrates MCTS-guided search, self-critique, and off-policy DPO to train LLM agents that outperform behavior cloning and reinforced fine-tuning baselines in WebShop and achieve up to 95.4% success in real-world booking scenarios.
Back to Basics: Revisiting REINFORCE Style Optimization for Learning from Human Feedback in LLMs cs.LG · 2024-02-22 · conditional · none · ref 41
REINFORCE-style variants outperform PPO, DPO, and RAFT in RLHF for LLMs by removing unnecessary PPO components and adapting the simpler method to LLM alignment characteristics.
Generalizing from a few environments in safety-critical reinforcement learning cs.LG · 2019-07-02 · unverdicted · none · ref 31
RL agents fail dangerously on unseen environments; ensembles reduce catastrophes in gridworld but not CoinRun, with uncertainty enabling intervention prediction.
Efficient 3D Content Reconstruction and Generation cs.CV · 2026-05-18 · unverdicted · none · ref 218
Presents Instant3D for rapid text/image-to-3D generation via multi-view diffusion plus feed-forward reconstruction, and FastMap for 10x faster structure-from-motion with comparable accuracy.
Ranking-Aware Calibration for Reliable Multimodal Reinforcement Learning cs.LG · 2026-05-16 · unverdicted · none · ref 44
RAC adds ranking-aware group loss and clean-corrupted pairwise loss to RL post-training to boost both accuracy and calibration in multimodal reasoning without extra annotations.
AdaGamma: State-Dependent Discounting for Temporal Adaptation in Reinforcement Learning cs.LG · 2026-05-07 · unverdicted · none · ref 7
AdaGamma stabilizes state-dependent discounting in deep actor-critic RL by adding a return-consistency regularizer, delivering gains on continuous-control benchmarks and a real-world logistics A/B test.
PokeRL: Reinforcement Learning for Pokemon Red cs.LG · 2026-04-12 · unverdicted · none · ref 9
PokeRL trains PPO agents to finish early Pokemon Red tasks using a loop-aware environment wrapper, multi-layer anti-loop mechanisms, and dense hierarchical rewards.
Enabling Off-Policy Imitation Learning with Deep Actor Critic Stabilization cs.LG · 2025-11-10 · unverdicted · none · ref 20
Introduces an off-policy adversarial imitation learning method with double Q stabilization that reduces samples required to match expert behavior.
GIFT: Group-Relative Implicit Fine-Tuning Integrates GRPO with DPO and UNA cs.LG · 2025-10-27 · unverdicted · none · ref 5
GIFT matches the optimal policy of GRPO using an endogenous prompt-dependent KL coefficient derived via z-score standardization of implicit rewards.
Heterogeneous Adaptive Policy Optimization: Tailoring Optimization to Every Token's Nature cs.CL · 2025-09-20 · unverdicted · none · ref 21
HAPO is a new token-level policy optimization method for LLMs that continuously adapts four optimization stages using entropy, claiming consistent gains over DAPO on math, code, and logic tasks.
Partner Modelling Emerges in Recurrent Agents (But Only When It Matters) cs.AI · 2025-05-22 · unverdicted · none · ref 45
Model-free RNN agents in Overcooked-AI spontaneously develop structured internal models of partner abilities when they can allocate tasks, enabling adaptation to novel collaborators.
InvDesFlow-AL: active learning-based workflow for inverse design of functional materials cond-mat.mtrl-sci · 2025-05-14 · unverdicted · none · ref 63
InvDesFlow-AL combines active learning with diffusion generative models to improve crystal structure prediction accuracy by 33% and identifies Li2AuH6 as a candidate BCS superconductor with 140 K transition temperature.
Logic-RL: Unleashing LLM Reasoning with Rule-Based Reinforcement Learning cs.CL · 2025-02-20 · unverdicted · none · ref 12
Rule-based RL on 5K logic puzzles induces advanced reasoning in a 7B model that transfers to AIME and AMC.
Reinforcement Learning for LLM Post-Training: A Survey cs.CL · 2024-07-23 · unverdicted · none · ref 72
A survey deriving a unified policy gradient framework for LLM post-training methods and providing technical comparisons of PPO, GRPO, DPO variants.
The Many Faces of On-Policy Distillation: Pitfalls, Mechanisms, and Fixes cs.AI · 2026-05-11 · unreviewed · ref 26
Hint Tuning: Less Data Makes Better Reasoners cs.CL · 2026-05-09 · unreviewed · ref 19

Proximal policy optimization algorithms

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer