mega hub Mixed citations

Proximal Policy Optimization Algorithms

Alec Radford, Filip Wolski, John Schulman, Oleg Klimov, Prafulla Dhariwal · 2017 · cs.LG · arXiv 1707.06347

Mixed citation behavior. Most common role is background (52%).

1546 Pith papers citing it

Background 52% of classified citations

open full Pith review browse 1546 citing papers more from Alec Radford arXiv PDF

abstract

We propose a new family of policy gradient methods for reinforcement learning, which alternate between sampling data through interaction with the environment, and optimizing a "surrogate" objective function using stochastic gradient ascent. Whereas standard policy gradient methods perform one gradient update per data sample, we propose a novel objective function that enables multiple epochs of minibatch updates. The new methods, which we call proximal policy optimization (PPO), have some of the benefits of trust region policy optimization (TRPO), but they are much simpler to implement, more general, and have better sample complexity (empirically). Our experiments test PPO on a collection of benchmark tasks, including simulated robotic locomotion and Atari game playing, and we show that PPO outperforms other online policy gradient methods, and overall strikes a favorable balance between sample complexity, simplicity, and wall-time.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 155 method 113 baseline 15 dataset 4

citation-polarity summary

background 150 use method 109 baseline 15 unclear 7 use dataset 4 support 2

claims ledger

abstract We propose a new family of policy gradient methods for reinforcement learning, which alternate between sampling data through interaction with the environment, and optimizing a "surrogate" objective function using stochastic gradient ascent. Whereas standard policy gradient methods perform one gradient update per data sample, we propose a novel objective function that enables multiple epochs of minibatch updates. The new methods, which we call proximal policy optimization (PPO), have some of the benefits of trust region policy optimization (TRPO), but they are much simpler to implement, more ge

authors

Alec Radford Filip Wolski John Schulman Oleg Klimov Prafulla Dhariwal

mega hub controls

export citing contexts JSON export graph JSON export full bundle JSON open full Pith review annotated reader queued

Recognition alignment

counterfactual ablation

If this work disappeared, these are the nearest dependency candidates in Pith, weighted toward method, dataset, baseline, and extension contexts where available. This is a structural signal, not a retraction verdict.

co-cited works

representative citing papers

Alignment faking in large language models

cs.AI · 2024-12-18 · conditional · novelty 9.0

Claude 3 Opus strategically fakes alignment by complying with harmful requests only during simulated training to preserve its preference for refusing them afterward.

On the Policy Gradient Foundations of Group Relative Policy Optimization: Credit Assignment, Gradient Sparsity, and Rank Collapse

cs.LG · 2026-06-28 · conditional · novelty 8.0

GRPO's group-mean baseline assigns identical advantages to all tokens under output-only rewards, inducing gradient sparsity and an intrinsic rank-2 structure proven from the zero-sum constraint and confirmed by SVD on Nemotron-4B gradients.

Bad company corrupts good morals: Understanding and Measuring Narrative-Induced Moral Reasoning Degradation in LLMs

cs.CY · 2026-06-27 · unverdicted · novelty 8.0

Negative narrative immersion causes 12-31% drops in LLM moral accuracy and produces structured shifts that appear in downstream applications.

Learning to Trigger: Reinforcement Learning at the Large Hadron Collider

cs.LG · 2026-06-22 · conditional · novelty 8.0

RL agent for online LHC trigger threshold tuning improves in-tolerance intervals by 28-56% on Monte Carlo and real CMS data without fine-tuning.

Extreme dynamic symmetry enables omnidirectional and multifunctional robots

cs.RO · 2026-05-28 · unverdicted · novelty 8.0

Dynamic isotropy, quantifying uniform center-of-mass acceleration capability, improves robot performance and enables omnidirectional locomotion, terrain traversal, and failure resilience in a spherical robot design.

AtomComposer: Discovering Chemical Space from First Principles with Reinforcement Learning

cs.LG · 2026-05-27 · unverdicted · novelty 8.0

AtomComposer uses online RL with multi-composition training to discover up to 10x more valid 3D isomers on unseen chemical formulas than single-composition baselines.

Agent-BRACE: Decoupling Beliefs from Actions in Long-Horizon Tasks via Verbalized State Uncertainty

cs.CL · 2026-05-12 · unverdicted · novelty 8.0

Agent-BRACE improves LLM agent performance on long-horizon partially observable tasks by 5.3-14.5% through a decoupled belief state of verbalized atomic claims with certainty labels that keeps context length constant.

SimWorld Studio: Automatic Environment Generation with Evolving Coding Agent for Embodied Agent Learning

cs.AI · 2026-05-10 · accept · novelty 8.0 · 2 refs

SimWorld Studio deploys an evolving coding agent to create adaptive 3D environments that co-evolve with embodied learners, delivering 18-point success-rate gains over fixed environments in navigation benchmarks.

ReLibra: Routing-Replay-Guided Load Balancing for MoE Training in Reinforcement Learning

cs.LG · 2026-05-09 · conditional · novelty 8.0

ReLibra uses pre-known token-to-expert routing from RL rollouts to perform inter-batch expert reordering and intra-batch replication, delivering up to 1.6x higher throughput than Megatron-LM and 1.2x over oracle-equipped EPLB while staying within 6-10% of an ideal balanced baseline.

Weak-to-Strong Generalization is Nearly Inevitable (in Linear Models)

cs.LG · 2026-05-07 · unverdicted · novelty 8.0

Weak-to-strong generalization is nearly inevitable in linear logistic regression for most student-teacher pairs without any model capacity mismatch.

Structural Equivalence and Learning Dynamics in Delayed MARL

cs.LG · 2026-05-05 · accept · novelty 8.0

Observation and action delays are formally equivalent in cooperative Dec-POMDPs, yielding identical optimal solutions and enabling zero-shot transfer, though learning dynamics differ due to credit assignment and operational constraints.

Language Game: Talking to Non-Human Systems

cs.LG · 2026-05-05 · unverdicted · novelty 8.0

A language-game framework enables dialogue with dynamical systems such as GRNs by treating their frozen dynamics as an RL policy core, using an LM to route prompts so the system responds through its own behavior without parameter changes.

RefereeBench: Are Video MLLMs Ready to be Multi-Sport Referees

cs.CV · 2026-04-17 · unverdicted · novelty 8.0

RefereeBench shows that even the strongest video MLLMs reach only around 60% accuracy on multi-sport refereeing tasks and struggle with rule application and temporal grounding.

OP-GRPO: Efficient Off-Policy GRPO for Flow-Matching Models

cs.CV · 2026-04-05 · unverdicted · novelty 8.0

OP-GRPO is the first off-policy GRPO method for flow-matching models that reuses trajectories via replay buffer and importance sampling corrections, matching on-policy performance with 34.2% of the training steps.

Beyond the Assistant Turn: User Turn Generation as a Probe of Interaction Awareness in Language Models

cs.AI · 2026-04-02 · unverdicted · novelty 8.0

User-turn generation reveals that LLMs' interaction awareness is largely decoupled from task accuracy, remaining near zero in deterministic settings even as accuracy scales to 96.8% on GSM8K.

Reinforcement Learning for Diffusion LLMs with Entropy-Guided Step Selection and Stepwise Advantages

cs.LG · 2026-03-13 · unverdicted · novelty 8.0

Derives an exact unbiased policy gradient for RL post-training of diffusion LLMs via entropy-guided step selection and one-step denoising rewards, achieving state-of-the-art results on coding and logical reasoning benchmarks.

Certified Gradient-Based Contact-Rich Manipulation via Smoothing-Error Reachable Tubes

cs.RO · 2026-02-10 · unverdicted · novelty 8.0

A certified gradient-based method for contact-rich manipulation that quantifies smoothing-induced errors via set-valued discrepancies and incorporates them into analytical reachable sets for robust affine feedback policies.

LeLaR: The First In-Orbit Demonstration of an AI-Based Satellite Attitude Controller

cs.RO · 2025-12-22 · conditional · novelty 8.0

First in-orbit demonstration of a DRL-trained AI satellite attitude controller that performs robust inertial pointing after sim-to-real transfer.

Leveraging Analytic Gradients in Provably Safe Reinforcement Learning

cs.LG · 2025-06-02 · unverdicted · novelty 8.0

Develops and tests the first effective safeguard for analytic gradient-based provably safe RL, showing safe training on three control tasks without performance loss.

Flow-GRPO: Training Flow Matching Models via Online RL

cs.CV · 2025-05-08 · unverdicted · novelty 8.0

Flow-GRPO is the first online RL method for flow matching models, raising GenEval accuracy from 63% to 95% and text-rendering accuracy from 59% to 92% with little reward hacking.

Training Software Engineering Agents and Verifiers with SWE-Gym

cs.SE · 2024-12-30 · conditional · novelty 8.0

SWE-Gym supplies 2438 executable real-world Python tasks to train SWE agents and verifiers, yielding up to 19% gains and new open-weight SOTA of 32% on SWE-Bench Verified.

BEHAVIOR-1K: A Human-Centered, Embodied AI Benchmark with 1,000 Everyday Activities and Realistic Simulation

cs.RO · 2024-03-14 · accept · novelty 8.0

BEHAVIOR-1K introduces a benchmark of 1,000 human everyday activities in realistic simulated scenes together with the OMNIGIBSON physics simulator to evaluate embodied AI.

ORPO: Monolithic Preference Optimization without Reference Model

cs.CL · 2024-03-12 · conditional · novelty 8.0

ORPO performs preference alignment during supervised fine-tuning via a monolithic odds ratio penalty, allowing 7B models to outperform larger state-of-the-art models on alignment benchmarks.

GEAR: Guided End-to-End AutoRegression for Image Synthesis

cs.CV · 2026-06-30 · unverdicted · novelty 7.0

GEAR jointly trains VQ tokenizer and AR generator end-to-end via dual hard/soft read-out and representation alignment, achieving up to 10x faster ImageNet gFID convergence than LlamaGen-REPA while generalizing across quantizers and to text-to-image.

citing papers explorer

Showing 50 of 1546 citing papers.

Reinforcement learning for ion shuttling on trapped-ion quantum computers quant-ph · 2026-05-21 · unverdicted · none · ref 50 · internal anchor
Reinforcement learning optimizes ion shuttling on trapped-ion quantum chips and reduces operations by up to 36.3% versus heuristics across multiple architectures.
DeferMem: Query-Time Evidence Distillation via Reinforcement Learning for Long-Term Memory QA cs.CL · 2026-05-21 · unverdicted · none · ref 33 · internal anchor
DeferMem decouples memory QA into high-recall retrieval and RL-based query-conditioned evidence distillation, outperforming baselines on LoCoMo and LongMemEval-S with highest accuracy, fastest runtime, and zero API token cost.
Unified Data Selection for LLM Reasoning cs.CL · 2026-05-21 · unverdicted · none · ref 54 · internal anchor
High-Entropy Sum (HES) selects high-quality reasoning data for LLMs by summing entropy of the top highest-entropy tokens, matching full-dataset performance with top 20% in SFT and outperforming baselines in RFT and RL.
Imagine2Real: Towards Zero-shot Humanoid-Object Interaction via Video Generative Priors cs.RO · 2026-05-21 · unverdicted · none · ref 71 · 2 links · internal anchor
Imagine2Real enables zero-shot humanoid-object interaction by unifying motions as 4D point trajectories, tracking only base/hands/object keypoints inside a BFM latent space, and training with progressive simple rewards for mocap deployment.
Tailoring Teaching to Aptitude: Direction-Adaptive Self-Distillation for LLM Reasoning cs.LG · 2026-05-21 · unverdicted · none · ref 7 · internal anchor
DASD improves math reasoning in LLMs by adaptively directing self-distillation based on per-token entropy to balance exploration and step accuracy, outperforming prior self-distillation and RLVR baselines on six benchmarks.
Emergence of agriculture in an artificial society of reinforcement learning agents cs.MA · 2026-05-21 · unverdicted · none · ref 20 · 2 links · internal anchor
Agriculture emerges spontaneously in an RL agent society through planning for delayed rewards, social learning that counters cheaters, and an irreversible lock-in effect.
CLORE: Content-Level Optimization for Reasoning Efficiency cs.AI · 2026-05-21 · unverdicted · none · ref 40 · internal anchor
CLORE augments correct on-policy rollouts by deleting repetitive and irrelevant segments then optimizes with auxiliary DPO to improve accuracy-efficiency trade-off on math benchmarks.
Maestro: Reinforcement Learning to Orchestrate Hierarchical Model-Skill Ensembles cs.LG · 2026-05-21 · unverdicted · none · ref 44 · internal anchor
Maestro uses outcome-based RL to train a lightweight policy that orchestrates ensembles of frozen expert models and skills, reporting 70.1% average accuracy across ten multimodal benchmarks and outperforming GPT-5 and Gemini-2.5-Pro while generalizing to unseen components.
CoRMA: Contrastive RMA for Contact-Rich Meta-Adaptation cs.RO · 2026-05-21 · unverdicted · none · ref 30 · 2 links · internal anchor
CoRMA modifies RMA by replacing raw parameter adaptation with inference of a 6D semantic contact context via a causal Transformer trained with semantic regression and force-regime contrastive loss, yielding higher real-world success than FORGE baselines on PegInsert, GearMesh, and NutThread under ta
Token-weighted Direct Preference Optimization with Attention cs.CL · 2026-05-21 · unverdicted · none · ref 2 · 2 links · internal anchor
AttentionPO weights tokens in DPO using LLM attention as a pairwise judge, yielding better results on AlpacaEval, MT-Bench, and ArenaHard than prior preference optimization methods.
OPPO: Bayesian Value Recursion for Token-Level Credit Assignment in LLM Reasoning cs.LG · 2026-05-21 · unverdicted · none · ref 29 · 2 links · internal anchor
OPPO derives token-level advantages for LLM RL via Bayesian recursion on oracle signals, recovering prior distillation methods as a special case and showing gains on math and code benchmarks.
Value-Gradient Hypothesis of RL for LLMs cs.LG · 2026-05-20 · unverdicted · none · ref 9 · internal anchor
Shows that under differentiable rollouts with additive noise, actor updates in critic-free RL for LLMs are value-gradient-like in expectation, motivating a decomposition into value signal and reward headroom for when RL is most effective.
DelTA: Discriminative Token Credit Assignment for Reinforcement Learning from Verifiable Rewards cs.LG · 2026-05-20 · unverdicted · none · ref 67 · internal anchor
DelTA estimates token coefficients to amplify discriminative directions in token-gradient vectors, reweighting the RLVR surrogate to produce more contrastive side-wise centroids and yielding 3.26 and 2.62 point gains on math benchmarks for 8B and 14B Qwen3 models.
DeCoR: Design and Control Co-Optimization for Urban Streets Using Reinforcement Learning cs.LG · 2026-05-20 · unverdicted · none · ref 35 · internal anchor
DeCoR co-optimizes crosswalk placement and signal control via reinforcement learning on a real 750 m urban corridor, reporting 23% faster pedestrian access to crossings and 79%/65% reductions in pedestrian/vehicle wait times versus fixed-time baselines.
Behavior-Consistent Deep Reinforcement Learning cs.LG · 2026-05-20 · unverdicted · none · ref 155 · 2 links · internal anchor
QED bounds cross-run KL divergence in Boltzmann policies by setting temperature proportional to Q-disagreement and reduces return variance by two orders of magnitude on 18 continuous-control tasks without performance loss.
ScenePilot: Controllable Boundary-Driven Critical Scenario Generation for Autonomous Driving cs.AI · 2026-05-20 · unverdicted · none · ref 3 · 2 links · internal anchor
ScenePilot uses RSS-derived physical feasibility score and online-learned AV-risk predictor in constrained RL with feasibility-aware shielding to generate boundary-band scenarios, yielding +6.2 pp higher collision rates on SafeBench while preserving validity.
Advantage Collapse in Group Relative Policy Optimization: Diagnosis and Mitigation cs.LG · 2026-05-20 · unverdicted · none · ref 33 · 2 links · internal anchor
GRPO suffers advantage collapse on uniform-reward groups; ACR quantifies it and AVSPO adds virtual samples to restore gradients, yielding 4-6% accuracy gains on math benchmarks across 0.5B-14B models.
Towards Context-Invariant Safety Alignment for Large Language Models cs.CL · 2026-05-20 · unverdicted · none · ref 98 · internal anchor
Introduces AIR, an asymmetric regularization that anchors open-ended safety prompts to verifiable ones via stop-gradient, improving invariance and accuracy when combined with group preference optimization.
Multi-Step Likelihood-Ratio Correction for Reinforcement Learning with Verifiable Rewards cs.LG · 2026-05-20 · unverdicted · none · ref 26 · internal anchor
NFPO augments the PPO surrogate with N-step forward traces to bridge local approximations and exact policy gradients, delivering tighter policy-improvement bounds and improved results on reasoning benchmarks.
SURF: Steering the Scalarization Weight to Uniformly Traverse the Pareto Front cs.LG · 2026-05-20 · unverdicted · none · ref 82 · internal anchor
SURF derives weight sampling rules from the arc-length CDF of the scalarization path to uniformly traverse the Pareto front in multi-objective optimization.
Spectral Souping: A Unified Framework for Online Preference Alignment cs.LG · 2026-05-19 · unverdicted · none · ref 26 · internal anchor
Spectral Souping learns offline specialized policies for fine-grained preferences and merges them online using a discovered universal spectral representation for efficient LLM alignment.
SUGAR: A Scalable Human-Video-Driven Generalizable Humanoid Loco-Manipulation Learning Framework cs.RO · 2026-05-19 · unverdicted · none · ref 53 · internal anchor
SUGAR turns diverse human videos into deployable humanoid loco-manipulation policies via automated prior extraction, physics refinement, and hierarchical distillation, showing scaling with data volume and zero-shot real-world transfer on six tasks.
Attention-Guided Reward for Reinforcement Learning-based Jailbreak against Large Reasoning Models cs.AI · 2026-05-19 · unverdicted · none · ref 30 · internal anchor
An attention-guided RL reward combined with diverse persuasion strategies produces higher attack success rates against large reasoning models than prior jailbreak methods.
Beyond Mode Collapse: Distribution Matching for Diverse Reasoning cs.AI · 2026-05-19 · unverdicted · none · ref 29 · internal anchor
DMPO approximates forward KL minimization in on-policy RL by aligning the policy to a group-level reward-proportional target distribution, yielding 9-12% relative gains over GRPO on NP-Bench and smaller gains on math reasoning.
What and When to Distill: Selective Hindsight Distillation for Multi-Turn Agents cs.AI · 2026-05-19 · unverdicted · none · ref 24 · internal anchor
SERL selectively reweights learning using task success and environment feedback to reach 90.0% success on ALFWorld and 80.1% on WebShop, outperforming RL and distillation baselines.
When to Stop Reusing: Dynamic Gradient Gating for Sample-Efficient RLVR cs.LG · 2026-05-19 · unverdicted · none · ref 21 · internal anchor
Dynamic Gradient Gating monitors lm_head gradient norms to safely reuse rollout batches in RLVR, achieving up to 2.93x sample efficiency and 2.14x wall-clock speedup across math, ALFWorld, WebShop, and QA tasks.
Taming the Thinker: Conditional Entropy Shaping for Adaptive LLM Reasoning cs.CL · 2026-05-19 · unverdicted · none · ref 18 · internal anchor
CES applies conditional bidirectional entropy control on top of DAPO to improve accuracy and shorten responses on mathematical benchmarks for 7B and 1.5B LLMs.
GAE Falls Short in Imperfect-Information Self-Play Reinforcement Learning cs.LG · 2026-05-19 · unverdicted · none · ref 12 · internal anchor
GAE suffers from amplified variance in imperfect-info self-play RL; VRPO with Q-boosting and multi-step Expected SARSA(λ) reduces it and improves performance on mid-to-large games.
General Preference Reinforcement Learning cs.LG · 2026-05-18 · unverdicted · none · ref 3 · 3 links · internal anchor
GPRL carries a k-dimensional skew-symmetric preference structure into policy updates with per-dimension advantages and a drift monitor, yielding 56.51% length-controlled win rate on AlpacaEval 2.0 from Llama-3-8B-Instruct while outperforming SimPO and SPPO on other benchmarks.
Unified Walking, Running, and Recovery for Humanoids via State-Dependent Adversarial Motion Priors cs.RO · 2026-05-18 · unverdicted · none · ref 12 · internal anchor
State-dependent adversarial motion priors with a gravity-threshold gate enable one frozen policy to unify walking, running, and recovery on humanoid hardware.
DiPRL: Learning Discrete Programmatic Policies via Architecture Entropy Regularization cs.LG · 2026-05-18 · unverdicted · none · ref 30 · internal anchor
DiPRL trains nearly discrete programmatic policies in RL by adding architecture entropy regularization to gradient-based optimization, avoiding performance collapse from post-hoc discretization.
Modelling Customer Trajectories with Reinforcement Learning for Practical Retail Insights cs.LG · 2026-05-18 · conditional · none · ref 41 · internal anchor
A maximum entropy reinforcement learning framework generates realistic customer trajectories in retail spaces that match real data better than TSP or PNN heuristics and support more accurate layout optimization decisions.
SD-Search: On-Policy Hindsight Self-Distillation for Search-Augmented Reasoning cs.AI · 2026-05-18 · unverdicted · none · ref 21 · internal anchor
SD-Search derives step-level supervision for search queries in reasoning agents via on-policy hindsight self-distillation using the policy as both student and teacher.
Enhancing the Code Reasoning Capabilities of LLMs via Consistency-based Reinforcement Learning cs.LG · 2026-05-18 · unverdicted · none · ref 30 · internal anchor
CodeThinker improves LLM code reasoning via consistency-based RL with stepwise training data, dynamic beam sampling, and consistency rewards, reaching SOTA on benchmarks with 4.3% gains on Qwen2.5-Coder-7B.
PAIR: Prefix-Aware Internal Reward Model for Multi-Turn Agent Optimization cs.AI · 2026-05-18 · unverdicted · none · ref 14 · internal anchor
PAIR combines a hidden-state probe with an attention correction to deliver robust step-level rewards for GRPO-based optimization of multi-turn LLM agents, achieving high AUROC on contaminated trajectories at low cost.
Fine-tuning Pocket-Aware Diffusion Models via Denoising Policy Optimization cs.LG · 2026-05-17 · unverdicted · none · ref 12 · internal anchor
DEPPA reformulates the denoising process of pocket-aware diffusion models as a multi-step MDP and applies RL fine-tuning with a coarse scheduler to optimize ligands for binding affinity, drug-likeness, synthesizability and diversity on CrossDocked2020.
How Off-Policy Can GRPO Be? Mu-GRPO for Efficient LLM Reinforcement Learning cs.LG · 2026-05-17 · conditional · none · ref 21 · internal anchor
Mu-GRPO enables substantially more off-policy GRPO training for LLMs via relaxed clipping and negative-advantage veto in large staged batches, matching standard GRPO performance at ~2x training speed.
Self-Supervised On-Policy Distillation for Reasoning Language Models cs.LG · 2026-05-17 · unverdicted · none · ref 33 · internal anchor
SSOPD converts intra-group correct-wrong contrast into process supervision by distilling a teacher distribution from the shortest correct completion into prefixes of the longest wrong completion, improving GRPO on AIME and HMMT benchmarks.
Transitivity Meets Cyclicity: Explicit Preference Decomposition for Dynamic Large Language Model Alignment cs.CL · 2026-05-17 · unverdicted · none · ref 93 · internal anchor
Introduces HRC model for game-theoretic decomposition of preferences into orthogonal transitive and cyclic components, paired with DSPPO for dynamic Nash-seeking alignment, reporting gains over BT and GPM baselines on RewardBench and downstream LLM evaluations.
QPLEX Decision Processes: Formulation via Nonlinear Markov Chains and Optimization via Policy Gradients math.OC · 2026-05-16 · unverdicted · none · ref 19 · internal anchor
QPLEX Decision Processes embed QPLEX transient approximations into a nonlinear MDP framework and optimize policies via deterministic gradients and natural-gradient methods, demonstrated on a dynamic pricing problem with waiting costs or chance constraints.
A Red Teaming Framework for Evaluating Robustness of AI-enabled Security Orchestration, Automation, and Response Systems cs.CR · 2026-05-16 · unverdicted · none · ref 22 · internal anchor
A hybrid LLM-RL red teaming framework generates adaptive attack campaigns in simulated enterprise networks to evaluate the robustness of AI-enabled SOAR systems.
Plan First, Diffuse Later: Extrinsic Graph Guidance for Long-Horizon Diffusion Planning cs.RO · 2026-05-16 · unverdicted · none · ref 10 · internal anchor
XDiffuser combines extrinsic graph planning with diffusion models to guide denoising and improve performance on long-horizon robotic tasks including multi-agent coordination and TSP-style problems.
Distinguishable Deletion: Unifying Knowledge Erasure and Refusal for Large Language Model Unlearning cs.LG · 2026-05-16 · unverdicted · none · ref 48 · internal anchor
Distinguishable Deletion unifies knowledge erasure and refusal for LLM unlearning via an energy index that enforces boundaries during training and enables refusal at inference.
PopuLoRA: Co-Evolving LLM Populations for Reasoning Self-Play cs.AI · 2026-05-16 · unverdicted · none · ref 4 · internal anchor
PopuLoRA shows that co-evolving populations of LoRA adapters through cross-evaluated self-play can outperform compute-matched single-agent baselines on multiple code and math reasoning benchmarks.
Global Convergence of Sampling-Based Nonconvex Optimization through Diffusion-Style Smoothing cs.LG · 2026-05-15 · unverdicted · none · ref 189 · internal anchor
Recasts sampling-based nonconvex optimization as smoothed gradient descent to obtain non-asymptotic convergence guarantees and introduces the DIDA annealed algorithm that converges to the global optimum.
Ada-Diffuser: Latent-Aware Adaptive Diffusion for Decision-Making cs.LG · 2026-05-15 · unverdicted · none · ref 154 · internal anchor
Ada-Diffuser is a causal diffusion model that jointly learns observed interaction structure and underlying latent dynamics from minimal observations for adaptive planning and policy learning.
Flash-GRPO: Efficient Alignment for Video Diffusion via One-Step Policy Optimization cs.CV · 2026-05-15 · unverdicted · none · ref 20 · 2 links · internal anchor
Flash-GRPO is a one-step GRPO framework for video diffusion alignment that applies iso-temporal grouping and temporal gradient rectification to achieve higher alignment quality and stability than full-trajectory training under low compute budgets on 1.3B-14B models.
Do Less, Achieve More: Do We Need Every-Step Optimization for RL Fine-tuning of Diffusion Models? cs.CV · 2026-05-15 · unverdicted · none · ref 46 · internal anchor
AdaScope adaptively selects optimal RL intervention points during diffusion denoising by monitoring structural and semantic changes, delivering 66% higher performance at 59% lower cost than full-trajectory RL baselines.
Nudging Beyond the Comfort Zone: Efficient Strategy-Guided Exploration for RLVR cs.AI · 2026-05-15 · unverdicted · none · ref 17 · internal anchor
NudgeRL conditions RLVR rollouts on strategy-level contexts to drive diverse trajectories and applies an inter/intra-context reward decomposition plus distillation objective, outperforming GRPO and oracle baselines on math benchmarks.
Offline Reinforcement Learning with Universal Horizon Models cs.LG · 2026-05-15 · unverdicted · none · ref 4 · internal anchor
Universal horizon models extend geometric horizon models to arbitrary horizons and apply winsorized distributions for stable offline RL value learning, outperforming baselines on 100 OGBench tasks.