mega hub Mixed citations

Proximal Policy Optimization Algorithms

Alec Radford, Filip Wolski, John Schulman, Oleg Klimov, Prafulla Dhariwal · 2017 · cs.LG · arXiv 1707.06347

Mixed citation behavior. Most common role is background (52%).

1479 Pith papers citing it

Background 52% of classified citations

open full Pith review browse 1479 citing papers more from Alec Radford arXiv PDF

abstract

We propose a new family of policy gradient methods for reinforcement learning, which alternate between sampling data through interaction with the environment, and optimizing a "surrogate" objective function using stochastic gradient ascent. Whereas standard policy gradient methods perform one gradient update per data sample, we propose a novel objective function that enables multiple epochs of minibatch updates. The new methods, which we call proximal policy optimization (PPO), have some of the benefits of trust region policy optimization (TRPO), but they are much simpler to implement, more general, and have better sample complexity (empirically). Our experiments test PPO on a collection of benchmark tasks, including simulated robotic locomotion and Atari game playing, and we show that PPO outperforms other online policy gradient methods, and overall strikes a favorable balance between sample complexity, simplicity, and wall-time.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 155 method 113 baseline 15 dataset 4

citation-polarity summary

background 150 use method 109 baseline 15 unclear 7 use dataset 4 support 2

claims ledger

abstract We propose a new family of policy gradient methods for reinforcement learning, which alternate between sampling data through interaction with the environment, and optimizing a "surrogate" objective function using stochastic gradient ascent. Whereas standard policy gradient methods perform one gradient update per data sample, we propose a novel objective function that enables multiple epochs of minibatch updates. The new methods, which we call proximal policy optimization (PPO), have some of the benefits of trust region policy optimization (TRPO), but they are much simpler to implement, more ge

authors

Alec Radford Filip Wolski John Schulman Oleg Klimov Prafulla Dhariwal

mega hub controls

export citing contexts JSON export graph JSON export full bundle JSON open full Pith review annotated reader queued

Recognition alignment

counterfactual ablation

If this work disappeared, these are the nearest dependency candidates in Pith, weighted toward method, dataset, baseline, and extension contexts where available. This is a structural signal, not a retraction verdict.

co-cited works

representative citing papers

Alignment faking in large language models

cs.AI · 2024-12-18 · conditional · novelty 9.0

Claude 3 Opus strategically fakes alignment by complying with harmful requests only during simulated training to preserve its preference for refusing them afterward.

On the Policy Gradient Foundations of Group Relative Policy Optimization: Credit Assignment, Gradient Sparsity, and Rank Collapse

cs.LG · 2026-06-28 · conditional · novelty 8.0

GRPO's group-mean baseline assigns identical advantages to all tokens under output-only rewards, inducing gradient sparsity and an intrinsic rank-2 structure proven from the zero-sum constraint and confirmed by SVD on Nemotron-4B gradients.

Bad company corrupts good morals: Understanding and Measuring Narrative-Induced Moral Reasoning Degradation in LLMs

cs.CY · 2026-06-27 · unverdicted · novelty 8.0

Negative narrative immersion causes 12-31% drops in LLM moral accuracy and produces structured shifts that appear in downstream applications.

Extreme dynamic symmetry enables omnidirectional and multifunctional robots

cs.RO · 2026-05-28 · unverdicted · novelty 8.0

Dynamic isotropy, quantifying uniform center-of-mass acceleration capability, improves robot performance and enables omnidirectional locomotion, terrain traversal, and failure resilience in a spherical robot design.

AtomComposer: Discovering Chemical Space from First Principles with Reinforcement Learning

cs.LG · 2026-05-27 · unverdicted · novelty 8.0

AtomComposer uses online RL with multi-composition training to discover up to 10x more valid 3D isomers on unseen chemical formulas than single-composition baselines.

Agent-BRACE: Decoupling Beliefs from Actions in Long-Horizon Tasks via Verbalized State Uncertainty

cs.CL · 2026-05-12 · unverdicted · novelty 8.0

Agent-BRACE improves LLM agent performance on long-horizon partially observable tasks by 5.3-14.5% through a decoupled belief state of verbalized atomic claims with certainty labels that keeps context length constant.

SimWorld Studio: Automatic Environment Generation with Evolving Coding Agent for Embodied Agent Learning

cs.AI · 2026-05-10 · accept · novelty 8.0 · 2 refs

SimWorld Studio deploys an evolving coding agent to create adaptive 3D environments that co-evolve with embodied learners, delivering 18-point success-rate gains over fixed environments in navigation benchmarks.

ReLibra: Routing-Replay-Guided Load Balancing for MoE Training in Reinforcement Learning

cs.LG · 2026-05-09 · conditional · novelty 8.0

ReLibra uses pre-known token-to-expert routing from RL rollouts to perform inter-batch expert reordering and intra-batch replication, delivering up to 1.6x higher throughput than Megatron-LM and 1.2x over oracle-equipped EPLB while staying within 6-10% of an ideal balanced baseline.

Weak-to-Strong Generalization is Nearly Inevitable (in Linear Models)

cs.LG · 2026-05-07 · unverdicted · novelty 8.0

Weak-to-strong generalization is nearly inevitable in linear logistic regression for most student-teacher pairs without any model capacity mismatch.

Structural Equivalence and Learning Dynamics in Delayed MARL

cs.LG · 2026-05-05 · accept · novelty 8.0

Observation and action delays are formally equivalent in cooperative Dec-POMDPs, yielding identical optimal solutions and enabling zero-shot transfer, though learning dynamics differ due to credit assignment and operational constraints.

Language Game: Talking to Non-Human Systems

cs.LG · 2026-05-05 · unverdicted · novelty 8.0

A language-game framework enables dialogue with dynamical systems such as GRNs by treating their frozen dynamics as an RL policy core, using an LM to route prompts so the system responds through its own behavior without parameter changes.

RefereeBench: Are Video MLLMs Ready to be Multi-Sport Referees

cs.CV · 2026-04-17 · unverdicted · novelty 8.0

RefereeBench shows that even the strongest video MLLMs reach only around 60% accuracy on multi-sport refereeing tasks and struggle with rule application and temporal grounding.

OP-GRPO: Efficient Off-Policy GRPO for Flow-Matching Models

cs.CV · 2026-04-05 · unverdicted · novelty 8.0

OP-GRPO is the first off-policy GRPO method for flow-matching models that reuses trajectories via replay buffer and importance sampling corrections, matching on-policy performance with 34.2% of the training steps.

Beyond the Assistant Turn: User Turn Generation as a Probe of Interaction Awareness in Language Models

cs.AI · 2026-04-02 · unverdicted · novelty 8.0

User-turn generation reveals that LLMs' interaction awareness is largely decoupled from task accuracy, remaining near zero in deterministic settings even as accuracy scales to 96.8% on GSM8K.

Reinforcement Learning for Diffusion LLMs with Entropy-Guided Step Selection and Stepwise Advantages

cs.LG · 2026-03-13 · unverdicted · novelty 8.0

Derives an exact unbiased policy gradient for RL post-training of diffusion LLMs via entropy-guided step selection and one-step denoising rewards, achieving state-of-the-art results on coding and logical reasoning benchmarks.

Certified Gradient-Based Contact-Rich Manipulation via Smoothing-Error Reachable Tubes

cs.RO · 2026-02-10 · unverdicted · novelty 8.0

A certified gradient-based method for contact-rich manipulation that quantifies smoothing-induced errors via set-valued discrepancies and incorporates them into analytical reachable sets for robust affine feedback policies.

LeLaR: The First In-Orbit Demonstration of an AI-Based Satellite Attitude Controller

cs.RO · 2025-12-22 · conditional · novelty 8.0

First in-orbit demonstration of a DRL-trained AI satellite attitude controller that performs robust inertial pointing after sim-to-real transfer.

Leveraging Analytic Gradients in Provably Safe Reinforcement Learning

cs.LG · 2025-06-02 · unverdicted · novelty 8.0

Develops and tests the first effective safeguard for analytic gradient-based provably safe RL, showing safe training on three control tasks without performance loss.

Flow-GRPO: Training Flow Matching Models via Online RL

cs.CV · 2025-05-08 · unverdicted · novelty 8.0

Flow-GRPO is the first online RL method for flow matching models, raising GenEval accuracy from 63% to 95% and text-rendering accuracy from 59% to 92% with little reward hacking.

Training Software Engineering Agents and Verifiers with SWE-Gym

cs.SE · 2024-12-30 · conditional · novelty 8.0

SWE-Gym supplies 2438 executable real-world Python tasks to train SWE agents and verifiers, yielding up to 19% gains and new open-weight SOTA of 32% on SWE-Bench Verified.

BEHAVIOR-1K: A Human-Centered, Embodied AI Benchmark with 1,000 Everyday Activities and Realistic Simulation

cs.RO · 2024-03-14 · accept · novelty 8.0

BEHAVIOR-1K introduces a benchmark of 1,000 human everyday activities in realistic simulated scenes together with the OMNIGIBSON physics simulator to evaluate embodied AI.

ORPO: Monolithic Preference Optimization without Reference Model

cs.CL · 2024-03-12 · conditional · novelty 8.0

ORPO performs preference alignment during supervised fine-tuning via a monolithic odds ratio penalty, allowing 7B models to outperform larger state-of-the-art models on alignment benchmarks.

STEMGym: Benchmarking Sequential Decision-Making under Dose Budgets in Autonomous Electron Microscopy

cs.LG · 2026-06-28 · unverdicted · novelty 7.0

STEMGym benchmark demonstrates that perception pipelines dominate dose efficiency in autonomous STEM over navigation methods across 33 agent setups.

The Mirage of Optimizing Training Policies: Monotonic Inference Policies as the Real Objective for LLM Reinforcement Learning

cs.LG · 2026-06-28 · unverdicted · novelty 7.0

Proposes Monotonic Inference Policy Improvement (MIPI) objective and MIPU two-step update framework to address objective misalignment between training and inference policies in LLM reinforcement learning.

citing papers explorer

Showing 50 of 1479 citing papers.

Graph Transformers and Stabilized Reinforcement Learning for Large-Scale Dynamic Routing Modulation and Spectrum Allocation in Elastic Optical Networks cs.NI · 2026-05-03 · unverdicted · none · ref 22 · 2 links · internal anchor
A graph transformer with RL stabilizations is the first to exceed benchmarks for dynamic RMSA, supporting up to 13% more traffic load on networks up to 143 nodes.
Enhanced LLM Reasoning by Optimizing Reward Functions with Search-Driven Reinforcement Learning cs.CL · 2026-05-03 · unverdicted · none · ref 4 · 2 links · internal anchor
Iterative search over reward functions with ranked feedback in GRPO training improves LLM math reasoning, achieving F1 of 0.795 on GSM8K versus 0.609 for baseline.
Coopetition-Gym v1: A Formally Grounded Platform for Mixed-Motive Multi-Agent Reinforcement Learning under Strategic Coopetition cs.MA · 2026-05-03 · unverdicted · none · ref 43 · internal anchor
Coopetition-Gym v1 provides twenty calibrated environments for mixed-motive MARL with parameterized private/integrated/cooperative rewards, game-theoretic oracles, and validation against four historical coopetitive cases at 81-98% accuracy.
The Control Plant as A Communication Channel: Implicit Communication for Decentralized LQG Control math.OC · 2026-05-03 · unverdicted · none · ref 39 · internal anchor
By treating the control plant as a communication channel and using joint source-channel coding ideas, the leader implicitly conveys the target state to the follower whose estimation error decreases monotonically to zero, achieving coordination with control cost close to the explicit-communication 2.
MIRL: Mutual Information-Guided Reinforcement Learning for Vision-Language Models cs.CV · 2026-05-02 · unverdicted · none · ref 24 · internal anchor
MIRL uses mutual information to guide trajectory selection and provide separate rewards for visual perception in RLVR for VLMs, achieving 70.22% average accuracy with 25% fewer full trajectories.
VAnim: Rendering-Aware Sparse State Modeling for Structure-Preserving Vector Animation cs.CV · 2026-05-02 · unverdicted · none · ref 280 · internal anchor
VAnim creates open-domain text-to-SVG animations via sparse state updates on a persistent DOM tree, identification-first planning, and rendering-aware RL with a new 134k-example benchmark.
PACE: Parameter Change for Unsupervised Environment Design cs.LG · 2026-05-02 · unverdicted · none · ref 11 · internal anchor
PACE uses the squared L2 norm of policy parameter changes from a first-order approximation as an efficient proxy for environment value in UED, outperforming baselines with higher IQM and lower optimality gap on MiniGrid and Craftax OOD tests.
VUDA: Breaking CUDA-Vulkan Isolation for Spatial Sharing of Compute and Graphics on the Same GPU cs.OS · 2026-05-02 · unverdicted · none · ref 49 · internal anchor
VUDA enables spatial sharing between CUDA and Vulkan on GPUs via channel redirection and page-table grafting, achieving up to 85% higher throughput than temporal baselines in embodied AI tasks.
A Multi-View Media Profiling Suite: Resources, Evaluation, and Analysis cs.CL · 2026-05-02 · unverdicted · none · ref 72 · internal anchor
Presents MBFC-2025 dataset and multi-view embeddings with fusion methods for media bias and factuality, reporting SOTA results on ACL-2020 and new benchmarks on MBFC-2025.
Forager: a lightweight testbed for continual learning with partial observability in RL cs.LG · 2026-05-01 · unverdicted · none · ref 61 · internal anchor
Forager is a lightweight partially-observable continual RL environment that exposes loss of plasticity in current agents and highlights the value of state construction for ongoing learning.
Deep Variational Inference Symbolic Regression cs.LG · 2026-05-01 · unverdicted · none · ref 19 · internal anchor
DVISR performs variational inference over symbolic expression trees and constants by training a neural network with the ELBO as reward, recovering true posteriors in simple test cases.
Your Loss is My Gain: Low Stake Attacks on Liquid Staking Pools cs.GT · 2026-05-01 · unverdicted · none · ref 74 · internal anchor
A low-stake adversary can degrade a liquid staking pool's performance via consensus manipulation and profit from the resulting drop in its LST value through application-layer financial positions.
Beyond Heuristics: Learnable Density Control for 3D Gaussian Splatting cs.CV · 2026-05-01 · unverdicted · none · ref 6 · 2 links · internal anchor
LeGS turns density control in 3D Gaussian Splatting into a learnable RL policy whose reward is derived from a closed-form sensitivity analysis that measures each Gaussian's marginal contribution to reconstruction quality.
Uniform-Correct Policy Optimization: Breaking RLVR's Indifference to Diversity cs.LG · 2026-05-01 · unverdicted · none · ref 34 · internal anchor
UCPO modifies GRPO with a uniformity penalty over correct solutions to prevent diversity collapse in RLVR, yielding up to 10% higher Pass@64 on AIME24 and 45% more equation-level diversity.
RSAT: Structured Attribution Makes Small Language Models Faithful Table Reasoners cs.CL · 2026-04-30 · conditional · none · ref 25 · internal anchor
RSAT uses SFT on verified traces followed by GRPO with NLI faithfulness rewards to make 1-8B models produce verifiable table reasoning with cell citations, raising faithfulness 3.7x to 0.826.
BoostLoRA: Growing Effective Rank by Boosting Adapters cs.LG · 2026-04-30 · unverdicted · none · ref 31 · internal anchor
BoostLoRA grows effective adapter rank linearly via iterative boosting on hard examples with orthogonal low-rank updates, outperforming both single-shot ultra-low-rank adapters and full fine-tuning on math and code tasks with zero added inference overhead.
TLPO: Token-Level Policy Optimization for Mitigating Language Confusion in Large Language Models cs.CL · 2026-04-29 · unverdicted · none · ref 4 · internal anchor
TLPO mitigates language confusion in LLMs via token-level policy updates that outperform sequence-level methods while preserving general capabilities.
HiPAN: Hierarchical Posture-Adaptive Navigation for Quadruped Robots in Unstructured 3D Environments cs.RO · 2026-04-29 · unverdicted · none · ref 34 · internal anchor
HiPAN enables quadruped robots to navigate unstructured 3D environments more successfully by combining a high-level posture-adaptive policy with a low-level controller and curriculum learning on depth images.
KinDER: A Physical Reasoning Benchmark for Robot Learning and Planning cs.RO · 2026-04-28 · unverdicted · none · ref 36 · internal anchor
KinDER is a new open-source benchmark that demonstrates substantial gaps in current robot learning and planning methods for handling physical constraints.
EOS-Bench: A Comprehensive Benchmark for Earth Observation Satellite Scheduling cs.NI · 2026-04-28 · conditional · none · ref 107 · internal anchor
EOS-Bench creates thousands of satellite scheduling test cases spanning small to large scales and evaluates multiple solver types across five performance metrics.
Ember: An Extensible Benchmark Suite for Quantum Annealing Embedding Algorithms quant-ph · 2026-04-28 · unverdicted · none · ref 22 · 2 links · internal anchor
Ember provides the first standardized, reproducible benchmark framework with 24,016 diverse graph instances for quantum annealing embedding algorithms, showing that no single algorithm performs best across all graph families.
HANDFUL: Sequential Grasp-Conditioned Dexterous Manipulation with Resource Awareness cs.RO · 2026-04-28 · unverdicted · none · ref 26 · internal anchor
HANDFUL learns resource-aware grasps using finger contact rewards and curriculum learning to improve success on sequential dexterous tasks in simulation and on a real LEAP hand.
GradMAP: Gradient-Based Multi-Agent Proximal Learning for Grid-Edge Flexibility cs.LG · 2026-04-27 · unverdicted · none · ref 23 · internal anchor
GradMAP enables fast offline training of fully decentralized neural policies for grid-edge flexibility by embedding a differentiable three-phase AC power-flow model and applying proximal surrogates in action space.
Aligning with Your Own Voice: Self-Corrected Preference Learning for Hallucination Mitigation in LVLMs cs.AI · 2026-04-27 · unverdicted · none · ref 2 · internal anchor
AVES-DPO mitigates hallucinations in LVLMs by creating in-distribution preference pairs through the model's self-correction, outperforming baselines with only 5.2k samples.
MUSIC: Learning Muscle-Driven Dexterous Hand Control cs.GR · 2026-04-26 · unverdicted · none · ref 2 · internal anchor
A hierarchical RL controller with VAE distillation enables muscle-driven hands to synthesize accurate bimanual piano performances for novel music in physics simulation.
Agentic Adversarial Rewriting Exposes Architectural Vulnerabilities in Black-Box NLP Pipelines cs.AI · 2026-04-26 · unverdicted · none · ref 67 · internal anchor
A two-agent adversarial rewriting framework achieves 20-40% evasion rates against LLM-based misinformation detectors under strict black-box constraints with binary feedback only, far outperforming prior methods and linking success to specific architectural properties.
Replay-buffer engineering for noise-robust quantum circuit optimization quant-ph · 2026-04-23 · unverdicted · none · ref 52 · internal anchor
Treating the replay buffer as a central lever in RL for quantum circuit optimization yields 4-32x sample efficiency gains, up to 67.5% faster episodes, and 85-90% fewer steps to accuracy on noisy molecular and compilation tasks.
DeVI: Physics-based Dexterous Human-Object Interaction via Synthetic Video Imitation cs.CV · 2026-04-22 · unverdicted · none · ref 39 · internal anchor
DeVI enables zero-shot physically plausible dexterous control by imitating synthetic videos via a hybrid 3D-human plus 2D-object tracking reward.
Render-in-the-Loop: Vector Graphics Generation via Visual Self-Feedback cs.CV · 2026-04-22 · unverdicted · none · ref 32 · internal anchor
Render-in-the-Loop reformulates SVG generation as a step-wise visual-context-aware process using self-feedback from rendered intermediate states, VSF training, and RaV inference to outperform baselines on MMSVGBench for Text-to-SVG and Image-to-SVG.
SceneOrchestra: Efficient Agentic 3D Scene Synthesis via Full Tool-Call Trajectory Generation cs.CV · 2026-04-21 · unverdicted · none · ref 30 · internal anchor
SceneOrchestra trains an orchestrator to generate full tool-call trajectories for 3D scene synthesis and uses a discriminator during training to select high-quality plans, yielding state-of-the-art results with lower runtime.
ChipCraftBrain: Validation-First RTL Generation via Multi-Agent Orchestration cs.AR · 2026-04-21 · unverdicted · none · ref 28 · internal anchor
ChipCraftBrain achieves 97.2% pass rate on VerilogEval and 94.7% on CVDP benchmarks for generating functional RTL code using adaptive multi-agent orchestration and hybrid reasoning.
Learning Hybrid-Control Policies for High-Precision In-Contact Manipulation Under Uncertainty cs.RO · 2026-04-21 · unverdicted · none · ref 31 · internal anchor
MATCH trains hybrid position-force RL policies that achieve up to 10% higher success rates and 5x fewer breaks than pose-only policies in fragile peg-in-hole tasks under localization uncertainty, with strong sim-to-real results.
EVPO: Explained Variance Policy Optimization for Adaptive Critic Utilization in LLM Post-Training cs.LG · 2026-04-21 · unverdicted · none · ref 24 · internal anchor
EVPO adaptively switches between critic-based and batch-mean advantage estimation using batch-level explained variance to provably achieve no greater variance than the better of PPO or GRPO at every step.
Learning to Credit the Right Steps: Objective-aware Process Optimization for Visual Generation cs.CV · 2026-04-21 · unverdicted · none · ref 36 · internal anchor
OTCA improves GRPO training for visual generation by estimating step importance in trajectories and adaptively weighting multiple reward objectives.
ReflectMT: Internalizing Reflection for Efficient and High-Quality Machine Translation cs.CL · 2026-04-21 · unverdicted · none · ref 56 · internal anchor
ReflectMT internalizes reflection via two-stage RL to enable direct high-quality machine translation that outperforms explicit reasoning models like DeepSeek-R1 on WMT24 while using 94% fewer tokens.
AeroBridge-TTA: Test-Time Adaptive Language-Conditioned Control for UAVs cs.RO · 2026-04-21 · unverdicted · none · ref 28 · internal anchor
AeroBridge-TTA achieves +22 pt average gains on out-of-distribution UAV dynamics mismatches by updating a latent state online from observed transitions in a language-conditioned policy.
Guiding Distribution Matching Distillation with Gradient-Based Reinforcement Learning cs.LG · 2026-04-21 · unverdicted · none · ref 44 · internal anchor
GDMD replaces raw-sample rewards with distillation-gradient rewards in RL-guided diffusion distillation, yielding 4-step models that surpass their multi-step teachers on GenEval and human preference metrics.
Bounded Ratio Reinforcement Learning cs.LG · 2026-04-20 · conditional · none · ref 24 · internal anchor
BRRL derives an analytic optimal policy for regularized constrained RL that guarantees monotonic improvement and yields the BPO algorithm that matches or exceeds PPO.
DART: Learning-Enhanced Model Predictive Control for Dual-Arm Non-Prehensile Manipulation cs.RO · 2026-04-20 · unverdicted · none · ref 32 · internal anchor
DART is the first claimed framework for non-prehensile dual-arm tray manipulation, integrating MPC with physics-based, online regression, and reinforcement learning dynamics models, validated in simulation.
Reverse Constitutional AI: A Framework for Controllable Toxic Data Generation via Probability-Clamped RLAIF cs.CL · 2026-04-20 · unverdicted · none · ref 63 · internal anchor
R-CAI inverts constitutional AI to automatically generate diverse toxic data for LLM red teaming, with probability clamping improving output coherence by 15% while preserving adversarial strength.
Waking Up Blind: Cold-Start Optimization of Supervision-Free Agentic Trajectories for Grounded Visual Perception cs.AI · 2026-04-19 · unverdicted · none · ref 6 · internal anchor
SPECTRA enables supervision-free bootstrapping of agentic capabilities in SVLMs via cascaded tool rollout alignment, multi-objective rewards, and the TIU metric, yielding up to 5% higher task accuracy and 9% better tool efficiency.
SynthFix: Adaptive Neuro-Symbolic Code Vulnerability Repair cs.SE · 2026-04-19 · unverdicted · none · ref 64 · internal anchor
SynthFix adaptively routes LLM code repairs to supervised fine-tuning or symbolic-reward fine-tuning, yielding up to 32% higher exact match on JavaScript and C vulnerability benchmarks.
Dynamic locking of an interacting spin system via periodic driving quant-ph · 2026-04-18 · unverdicted · none · ref 3 · internal anchor
Detuning from resonance combined with pulse structure in periodic driving produces a structured effective Rabi field that enables dynamic spin locking, reversible Zeeman-dipolar order interconversion, and heterospin polarization transfer in interacting spin systems.
Freshness-Aware Prioritized Experience Replay for LLM/VLM Reinforcement Learning cs.CL · 2026-04-18 · unverdicted · none · ref 16 · internal anchor
Freshness-Aware PER augments prioritized experience replay with exponential age decay based on effective sample size to enable successful reuse of trajectories in LLM and VLM reinforcement learning, outperforming on-policy baselines on agentic tasks.
Beyond One-Size-Fits-All: Adaptive Test-Time Augmentation for Sequential Recommendation cs.IR · 2026-04-17 · unverdicted · none · ref 26 · internal anchor
AdaTTA is an actor-critic RL framework that selects sequence-specific test-time augmentations and improves recommendation metrics by up to 26% over fixed augmentation strategies on four datasets.
S-GRPO: Unified Post-Training for Large Vision-Language Models cs.LG · 2026-04-17 · unverdicted · none · ref 39 · internal anchor
S-GRPO unifies SFT and RL for LVLMs via conditional ground-truth injection that supplies a maximal-reward anchor when group exploration fails completely.
The Price of Paranoia: Robust Risk-Sensitive Cooperation in Non-Stationary Multi-Agent Reinforcement Learning cs.GT · 2026-04-17 · unverdicted · none · ref 5 · internal anchor
Robustness applied to policy-gradient variance rather than return distributions expands the basin of cooperative equilibria under partner noise in coordination games, quantified via the new Price of Paranoia metric.
SAT: Sequential Agent Tuning for Coordinator Free Plug and Play Multi-LLM Training with Monotonic Improvement Guarantees cs.LG · 2026-04-17 · unverdicted · none · ref 21 · internal anchor
SAT trains multi-LLM teams with sequential block updates to deliver monotonic gains and plug-and-play model swaps that provably improve performance bounds.
LeapAlign: Post-Training Flow Matching Models at Any Generation Step by Building Two-Step Trajectories cs.CV · 2026-04-16 · unverdicted · none · ref 40 · internal anchor
LeapAlign fine-tunes flow matching models by constructing two consecutive leaps that skip multiple ODE steps with randomized timesteps and consistency weighting, enabling stable updates at any generation step.
IG-Search: Step-Level Information Gain Rewards for Search-Augmented Reasoning cs.AI · 2026-04-16 · unverdicted · none · ref 24 · internal anchor
IG-Search computes step-level information gain rewards from policy probabilities to improve credit assignment in RL training for search-augmented QA, yielding 1.6-point gains over trajectory-level baselines on multi-hop tasks.