mega hub Mixed citations

Proximal Policy Optimization Algorithms

Alec Radford, Filip Wolski, John Schulman, Oleg Klimov, Prafulla Dhariwal · 2017 · cs.LG · arXiv 1707.06347

Mixed citation behavior. Most common role is background (52%).

1991 Pith papers citing it

Background 52% of classified citations

open full Pith review browse 1991 citing papers more from Alec Radford arXiv PDF

abstract

We propose a new family of policy gradient methods for reinforcement learning, which alternate between sampling data through interaction with the environment, and optimizing a "surrogate" objective function using stochastic gradient ascent. Whereas standard policy gradient methods perform one gradient update per data sample, we propose a novel objective function that enables multiple epochs of minibatch updates. The new methods, which we call proximal policy optimization (PPO), have some of the benefits of trust region policy optimization (TRPO), but they are much simpler to implement, more general, and have better sample complexity (empirically). Our experiments test PPO on a collection of benchmark tasks, including simulated robotic locomotion and Atari game playing, and we show that PPO outperforms other online policy gradient methods, and overall strikes a favorable balance between sample complexity, simplicity, and wall-time.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 156 method 114 baseline 15 dataset 4

citation-polarity summary

background 151 use method 110 baseline 15 unclear 7 use dataset 4 support 2

claims ledger

abstract We propose a new family of policy gradient methods for reinforcement learning, which alternate between sampling data through interaction with the environment, and optimizing a "surrogate" objective function using stochastic gradient ascent. Whereas standard policy gradient methods perform one gradient update per data sample, we propose a novel objective function that enables multiple epochs of minibatch updates. The new methods, which we call proximal policy optimization (PPO), have some of the benefits of trust region policy optimization (TRPO), but they are much simpler to implement, more ge

authors

Alec Radford Filip Wolski John Schulman Oleg Klimov Prafulla Dhariwal

mega hub controls

export citing contexts JSON export graph JSON export full bundle JSON open full Pith review annotated reader queued

Recognition alignment

counterfactual ablation

If this work disappeared, these are the nearest dependency candidates in Pith, weighted toward method, dataset, baseline, and extension contexts where available. This is a structural signal, not a retraction verdict.

co-cited works

representative citing papers

Alignment faking in large language models

cs.AI · 2024-12-18 · conditional · novelty 9.0

Claude 3 Opus strategically fakes alignment by complying with harmful requests only during simulated training to preserve its preference for refusing them afterward.

On the Policy Gradient Foundations of Group Relative Policy Optimization: Credit Assignment, Gradient Sparsity, and Rank Collapse

cs.LG · 2026-06-28 · conditional · novelty 8.0

GRPO's group-mean baseline assigns identical advantages to all tokens under output-only rewards, inducing gradient sparsity and an intrinsic rank-2 structure proven from the zero-sum constraint and confirmed by SVD on Nemotron-4B gradients.

Bad company corrupts good morals: Understanding and Measuring Narrative-Induced Moral Reasoning Degradation in LLMs

cs.CY · 2026-06-27 · unverdicted · novelty 8.0

Negative narrative immersion causes 12-31% drops in LLM moral accuracy and produces structured shifts that appear in downstream applications.

Learning to Trigger: Reinforcement Learning at the Large Hadron Collider

cs.LG · 2026-06-22 · conditional · novelty 8.0 · 2 refs

RL agent for online LHC trigger threshold tuning improves in-tolerance intervals by 28-56% on Monte Carlo and real CMS data without fine-tuning.

IRumAI: Reinforcement Learning for Indian Rummy

cs.AI · 2026-06-20 · unverdicted · novelty 8.0

IRumAI is the first RL agent for Indian Rummy, trained on weak heuristics to beat strong search opponents at 7000x speed.

DeFAb: A Verifiable Benchmark for Defeasible Abduction in Foundation Models

cs.AI · 2026-06-17 · conditional · novelty 8.0

DeFAb is a large-scale, formally verifiable benchmark for defeasible abduction derived from 18 knowledge bases, demonstrating that frontier LLMs achieve 7.8-65% accuracy versus 100% for a rule-based solver with polynomial-time checks.

Efficient AI-Inspired Reduction of Feynman Integrals via Tube Seeding

hep-ph · 2026-06-09 · unverdicted · novelty 8.0

Machine learning discovers a tube-seeding strategy for IBP reduction of Feynman integrals that scales linearly with numerator power, demonstrated on rank-20 2-loop 5-point integrals.

From Reward-Free Representations to Preferences: Rethinking Offline Preference-Based Reinforcement Learning

cs.LG · 2026-05-31 · unverdicted · novelty 8.0

A reward-free representation learning pipeline for offline PbRL achieves better preference efficiency than standard two-stage baselines by connecting RFRL concepts to preference data.

Extreme dynamic symmetry enables omnidirectional and multifunctional robots

cs.RO · 2026-05-28 · unverdicted · novelty 8.0

Dynamic isotropy, quantifying uniform center-of-mass acceleration capability, improves robot performance and enables omnidirectional locomotion, terrain traversal, and failure resilience in a spherical robot design.

AtomComposer: Discovering Chemical Space from First Principles with Reinforcement Learning

cs.LG · 2026-05-27 · unverdicted · novelty 8.0

AtomComposer uses online RL with multi-composition training to discover up to 10x more valid 3D isomers on unseen chemical formulas than single-composition baselines.

Agent-BRACE: Decoupling Beliefs from Actions in Long-Horizon Tasks via Verbalized State Uncertainty

cs.CL · 2026-05-12 · unverdicted · novelty 8.0

Agent-BRACE improves LLM agent performance on long-horizon partially observable tasks by 5.3-14.5% through a decoupled belief state of verbalized atomic claims with certainty labels that keeps context length constant.

SimWorld Studio: Automatic Environment Generation with Evolving Coding Agent for Embodied Agent Learning

cs.AI · 2026-05-10 · accept · novelty 8.0 · 2 refs

SimWorld Studio deploys an evolving coding agent to create adaptive 3D environments that co-evolve with embodied learners, delivering 18-point success-rate gains over fixed environments in navigation benchmarks.

ReLibra: Routing-Replay-Guided Load Balancing for MoE Training in Reinforcement Learning

cs.LG · 2026-05-09 · conditional · novelty 8.0

ReLibra uses pre-known token-to-expert routing from RL rollouts to perform inter-batch expert reordering and intra-batch replication, delivering up to 1.6x higher throughput than Megatron-LM and 1.2x over oracle-equipped EPLB while staying within 6-10% of an ideal balanced baseline.

Weak-to-Strong Generalization is Nearly Inevitable (in Linear Models)

cs.LG · 2026-05-07 · unverdicted · novelty 8.0

Weak-to-strong generalization is nearly inevitable in linear logistic regression for most student-teacher pairs without any model capacity mismatch.

Structural Equivalence and Learning Dynamics in Delayed MARL

cs.LG · 2026-05-05 · accept · novelty 8.0

Observation and action delays are formally equivalent in cooperative Dec-POMDPs, yielding identical optimal solutions and enabling zero-shot transfer, though learning dynamics differ due to credit assignment and operational constraints.

Language Game: Talking to Non-Human Systems

cs.LG · 2026-05-05 · unverdicted · novelty 8.0

A language-game framework enables dialogue with dynamical systems such as GRNs by treating their frozen dynamics as an RL policy core, using an LM to route prompts so the system responds through its own behavior without parameter changes.

RefereeBench: Are Video MLLMs Ready to be Multi-Sport Referees

cs.CV · 2026-04-17 · unverdicted · novelty 8.0

RefereeBench shows that even the strongest video MLLMs reach only around 60% accuracy on multi-sport refereeing tasks and struggle with rule application and temporal grounding.

OP-GRPO: Efficient Off-Policy GRPO for Flow-Matching Models

cs.CV · 2026-04-05 · unverdicted · novelty 8.0

OP-GRPO is the first off-policy GRPO method for flow-matching models that reuses trajectories via replay buffer and importance sampling corrections, matching on-policy performance with 34.2% of the training steps.

Beyond the Assistant Turn: User Turn Generation as a Probe of Interaction Awareness in Language Models

cs.AI · 2026-04-02 · unverdicted · novelty 8.0

User-turn generation reveals that LLMs' interaction awareness is largely decoupled from task accuracy, remaining near zero in deterministic settings even as accuracy scales to 96.8% on GSM8K.

Reinforcement Learning for Diffusion LLMs with Entropy-Guided Step Selection and Stepwise Advantages

cs.LG · 2026-03-13 · unverdicted · novelty 8.0

Derives an exact unbiased policy gradient for RL post-training of diffusion LLMs via entropy-guided step selection and one-step denoising rewards, achieving state-of-the-art results on coding and logical reasoning benchmarks.

Certified Gradient-Based Contact-Rich Manipulation via Smoothing-Error Reachable Tubes

cs.RO · 2026-02-10 · unverdicted · novelty 8.0

A certified gradient-based method for contact-rich manipulation that quantifies smoothing-induced errors via set-valued discrepancies and incorporates them into analytical reachable sets for robust affine feedback policies.

LeLaR: The First In-Orbit Demonstration of an AI-Based Satellite Attitude Controller

cs.RO · 2025-12-22 · conditional · novelty 8.0

First in-orbit demonstration of a DRL-trained AI satellite attitude controller that performs robust inertial pointing after sim-to-real transfer.

Leveraging Analytic Gradients in Provably Safe Reinforcement Learning

cs.LG · 2025-06-02 · unverdicted · novelty 8.0

Develops and tests the first effective safeguard for analytic gradient-based provably safe RL, showing safe training on three control tasks without performance loss.

Flow-GRPO: Training Flow Matching Models via Online RL

cs.CV · 2025-05-08 · unverdicted · novelty 8.0

Flow-GRPO is the first online RL method for flow matching models, raising GenEval accuracy from 63% to 95% and text-rendering accuracy from 59% to 92% with little reward hacking.

citing papers explorer

Showing 50 of 1991 citing papers.

Knowledge Graphs and Reasoning LLMs for Finding Simple Yet Effective Transcriptomic Perturbation Predictors cs.LG · 2026-06-07 · unverdicted · none · ref 15 · internal anchor
K-nearest neighbor from a knowledge graph beats most methods on out-of-distribution transcriptomic perturbation prediction, and an RL-trained reasoning LLM matches SOTA on Replogle et al. (2022) cell lines while improving downstream differential expression prediction.
When RL Fails after SFT: Rejuvenating Model Plasticity for Robust SFT-to-RL Handoff cs.LG · 2026-06-07 · unverdicted · none · ref 39 · internal anchor
Excessive SFT reduces LLM plasticity for RL; Rejuvenation restores it via base-anchored fusion and targeted neuron resets, yielding better RL performance and OOD generalization.
IR-SIM: A Lightweight Skill-Native Simulator for Navigation, Learning, and Benchmarking cs.RO · 2026-06-07 · unverdicted · none · ref 44 · internal anchor
IR-SIM is a YAML-defined simulator for mobile robot navigation that supports text-prompt scenario creation, policy training, benchmarking, and bridging to higher-fidelity or real-world settings.
Adaptive Loss Balancing for Noise-Robust GRPO in Generative Recommendation cs.LG · 2026-06-07 · unverdicted · none · ref 18 · internal anchor
AdaGRPO gates GRPO reinforcement learning with supervised NLL using per-sample binary clips based on policy difficulty and reward discriminability, raising HR@10 from 11.01% to 12.18% while keeping hallucination below 0.22% on large-scale e-commerce data and showing A/B gains.
Mind Your Steps: A General Learning Framework for Accurate Humanoid Foothold Tracking cs.RO · 2026-06-06 · unverdicted · none · ref 36 · internal anchor
A lightweight RL framework trains terrain-agnostic 3D foothold-tracking policies for humanoids that transfer directly to real-world use as standalone low-level controllers.
Affordance-Based Hierarchical Reinforcement Learning for Quadruped Pedipulation cs.RO · 2026-06-05 · unverdicted · none · ref 29 · internal anchor
A three-level hierarchical RL framework uses pose affordances to guide navigation and interaction-point affordances to guide pedipulation, enabling autonomous object manipulation by quadrupeds in simulation and real-world tests.
QuadVerse: An Integrated Framework Aligning Visual-Physical Reality for Quadruped Simulation cs.RO · 2026-06-05 · unverdicted · none · ref 49 · internal anchor
QuadVerse integrates 3D Gaussian Splatting scene reconstruction, friction calibration via trajectory search, and a residual dynamics compensator to improve quadruped simulation fidelity and enable zero-shot policy transfer.
T-GMP: Terrain-conditioned Generative Motion Priors for Versatile and Natural Humanoid Locomotion cs.RO · 2026-06-05 · unverdicted · none · ref 48 · internal anchor
T-GMP learns a terrain-conditioned latent motion manifold via CVAE from demonstrations and integrates it into an adversarial pipeline with a foothold penalty for versatile, natural humanoid locomotion.
AdaGRPO: A Capability-Aware Adaptive Enhancement for Flow-based GRPO cs.CV · 2026-06-05 · unverdicted · none · ref 16 · internal anchor
AdaGRPO enhances GRPO for flow models via online curriculum filtering of prompts and cross-level advantage fusion, yielding performance gains and training stability.
Progress-SQL: Improving Reinforcement Learning for Text-to-SQL via Progressive Rewards cs.CL · 2026-06-05 · unverdicted · none · ref 21 · internal anchor
Progress-SQL introduces a multi-turn RL framework with ODT-based structural alignment and progressive rewards that measure improvement across refinement turns, yielding gains on BIRD, Spider, and robustness benchmarks.
Learning All-Terrain Locomotion for a Planetary Rover with Actively Articulated Suspension cs.RO · 2026-06-05 · unverdicted · none · ref 56 · 2 links · internal anchor
Reinforcement learning produces a single unified controller that lets an actively suspended planetary rover autonomously cross heterogeneous rough terrains after sim training and zero-shot hardware transfer.
HANDOFF: Humanoid Agentic Task-Space Whole-Body Control via Distilled Complementary Teachers cs.RO · 2026-06-04 · unverdicted · none · ref 13 · internal anchor
HANDOFF is a distilled mixture-of-experts humanoid whole-body controller that follows a compact task-space interface, matches SOTA velocity tracking, provides large manipulation workspace on Unitree G1, and supports VLM-driven agentic planning with no task-specific data.
EDIT: Evidence-Diagnosed Intervention Training for Rule-Faithful LLM Grading cs.CL · 2026-06-04 · unverdicted · none · ref 35 · internal anchor
EDIT improves LLM rubric grading faithfulness by diagnosing problematic reasoning steps via posterior belief and grounding scores then applying local SFT revisions and belief-penalizing RL.
L-SDPPO: Policy Optimization of Spiking Diffusion Policy for Intra-vehicular Robotic Manipulation cs.RO · 2026-06-04 · unverdicted · none · ref 38 · internal anchor
L-SDPPO optimizes a spiking diffusion policy with RL and adds SDLI to handle microgravity dynamics, reporting higher success rates and lower energy use than prior methods on five intra-vehicular tasks.
LadderMan: Learning Humanoid Perceptive Ladder Climbing cs.RO · 2026-06-04 · unverdicted · none · ref 35 · internal anchor
A hybrid motion-tracking and imitation-reinforcement pipeline produces a depth-based visuomotor policy that lets humanoids climb varied ladders zero-shot on hardware and perform teleoperated manipulation while climbing.
EEGDancer: Dynamic Emotion Latent Space Masked Modeling with Reinforcement Learning for EEG Continuous Emotion Prediction cs.HC · 2026-06-04 · unverdicted · none · ref 62 · internal anchor
EEGDancer integrates VQ-VAE latent space learning, masked Transformer modeling, and SAC reinforcement learning to improve continuous EEG emotion prediction over prior methods on SEED datasets.
BMCR: Adaptive Backbone Module Composition via Reinforcement Learning for Remote Sensing Object Detection cs.CV · 2026-06-04 · unverdicted · none · ref 41 · internal anchor
BMCR uses RL to adaptively compose modules from CNN and ViT backbones with an OT alignment interface, reporting mAP gains of up to 2.5 points on DOTA and DIOR-R datasets.
Representation Learning Enables Scalable Multitask Deep Reinforcement Learning cs.LG · 2026-06-04 · unverdicted · none · ref 33 · internal anchor
MR.Q combines predictive auxiliary tasks with high-capacity value functions in a model-free architecture to achieve strong multitask RL performance without planning.
Selective-Advantage Entropy-Adaptive Horizon GRPO: Asymmetric Token-Level Discounting for Efficient Reinforcement Learning of Language Models cs.LG · 2026-06-03 · unverdicted · none · ref 7 · internal anchor
SA-AH-GRPO applies asymmetric entropy-based discounting only to negative-advantage trajectories in GRPO, yielding similar peak Pass@1 accuracy with 3.6x lower training variance on GSM8K for Qwen 2.5 models.
GRAIL: Generating Humanoid Loco-Manipulation from 3D Assets and Video Priors cs.RO · 2026-06-03 · unverdicted · none · ref 55 · internal anchor
GRAIL creates over 20,000 synthetic loco-manipulation sequences from known 3D configurations and video priors, then trains policies that achieve 84% pick-up and 90% stair-climbing success on a real Unitree G1 humanoid using only the generated data.
Trace-Mediated Peak Bias: Bridging Temporal Credit Assignment and Cognitive Heuristics in Deep Reinforcement Learning cs.LG · 2026-06-03 · unverdicted · none · ref 34 · internal anchor
Eligibility traces in deep RL create a peak bias by amplifying distal TD errors into gradient shocks that fixed-step SGD cannot normalize, leading to overestimation of peak-reward trajectories and a mechanistic account of the peak-end rule.
CoRe-MoE: Contrastive Reweighted Mixture of Experts for Multi-Terrain Humanoid Locomotion with Gait Adaptation cs.RO · 2026-06-03 · unverdicted · none · ref 36 · internal anchor
CoRe-MoE uses a two-stage RL framework with contrastive reweighting in a Mixture-of-Experts architecture to enable gait transitions and multi-terrain adaptation for humanoid locomotion.
Dynamic Infilling Anchors for Format-Constrained Generation in Diffusion Large Language Models cs.CL · 2026-06-03 · unverdicted · none · ref 71 · internal anchor
DIA is a training-free method that dynamically adjusts anchor positions in diffusion LLMs to improve format compliance and accuracy on reasoning benchmarks like GSM8K and MATH.
Self-Optimizing Control of Continuous Processes Based on Reinforcement Learning eess.SY · 2026-06-03 · unverdicted · none · ref 9 · internal anchor
Reinforcement learning optimizes controlled variable selection for self-optimizing control by embedding the structure in an actor network and using economic rewards, showing better dynamic performance than a steady-state baseline in a CSTR simulation under disturbances.
CoPark: Learning Reactive Parking via Self-Play cs.RO · 2026-06-02 · unverdicted · none · ref 37 · internal anchor
CoPark uses multi-agent self-play RL with a residual policy and threat-modulated asymmetric prior release to achieve 70-85% success and 3-6% collision rates in reactive parking benchmarks.
QUBRIC: Co-Designing Queries and Rubrics for RL Beyond Verifiable Rewards cs.CL · 2026-06-02 · unverdicted · none · ref 10 · internal anchor
QUBRIC co-designs queries and rubrics via teacher key points, contrastive generation, and learnability filtering to support GRPO training, yielding +5.5 on ArenaHard and +6.3 average transfer to legal/moral/narrative benchmarks.
Tool-Aware Optimization with Entropy Guidance for Efficient Agentic Reinforcement Learning cs.LG · 2026-06-02 · unverdicted · none · ref 26 · internal anchor
TAO-RL improves agentic RL by filtering degenerate trajectories and reshaping advantages with tool-aware entropy bonuses, yielding better performance on reasoning benchmarks.
PerchRL: Vision-Based Agile Perching on Inclined Platforms under Rapid and Irregular Motion cs.RO · 2026-06-02 · unverdicted · none · ref 32 · internal anchor
PerchRL applies two-stage RL with randomized trajectories, temporal augmentation, and visibility-aware rewards to achieve vision-based perching on irregularly moving inclined platforms.
Local Guidance, Global Impact: Gaussian-Reshaped Trust Region Unlocks Behavior Transitions cs.LG · 2026-06-02 · unverdicted · none · ref 1 · internal anchor
GTR introduces a bounded non-monotonic Gaussian trust region and Mixture Gaussian Anchor to enable effective behavior transitions in non-stationary RL where standard PPO fails.
When RLHF Fails: A Mechanistic Taxonomy of Reward Hacking, Collapse, and Evaluator Gaming cs.LG · 2026-06-02 · unverdicted · none · ref 5 · internal anchor
An empirical study of RLHF pipelines classifies failure modes such as reward hacking by analyzing directions of change in learned reward and judge scores across training checkpoints and shows they can be localized and partially predicted.
ConTrack: Constrained Hand Motion Tracking with Adaptive Trade-off Control cs.RO · 2026-06-02 · unverdicted · none · ref 40 · internal anchor
ConTrack introduces a constrained RL method with online dual-variable adaptation and adaptive resets for improved long-horizon hand tracking in simulation and on real robots.
Constitutional On-Policy Safe Distillation cs.LG · 2026-06-02 · unverdicted · none · ref 33 · internal anchor
COPSD uses a Cross-SFT cold-start followed by constitution-conditioned distillation to achieve stronger safety-helpfulness balance and lower safety tax on reasoning than prior on-policy self-distillation methods.
Efficient Hyperparameter Optimization for LLM Reinforcement Learning cs.LG · 2026-06-02 · unverdicted · none · ref 17 · internal anchor
JF-HPO jointly adapts model size and training budget as fidelity for efficient HPO in LLM RL, reporting up to 14.9x trial speedup and performance gains of 5.8-111.6% over the VeRL recipe.
SIRI: Self-Internalizing Reinforcement Learning with Intrinsic Skills for LLM Agent Training cs.AI · 2026-06-01 · unverdicted · none · ref 47 · internal anchor
SIRI trains LLM agents to discover, validate, and internalize reusable skills from their own rollouts without external generators or inference-time skill banks, yielding gains on ALFWorld and WebShop.
MT-EditFlow: Reinforcement Learning for Multi-Turn Image Editing with Flow Matching cs.CV · 2026-06-01 · unverdicted · none · ref 41 · internal anchor
MT-EditFlow applies flow-matching RL with multi-reward aggregation to improve multi-turn image editing performance on models like FLUX.1-Kontext-dev by 6.85 points at turn-3.
S-SPPO: Semantic-Calibrated Self-Play Preference Optimization cs.AI · 2026-06-01 · unverdicted · none · ref 15 · internal anchor
S-SPPO stabilizes SPPO via semantic calibration in supervision and representation spaces, reporting 52.19% win rate on AlpacaEval 2.0 with Llama-3-8B.
RLVR without Ineffective Samples: Group Prioritized Off-Policy Optimization for LLM Reasoning cs.LG · 2026-05-31 · unverdicted · none · ref 44 · internal anchor
POPO uses recency-based prioritized group replay and decoupled off-policy optimization to avoid zero-variance ineffective samples in RLVR, accelerating LLM reasoning finetuning with fewer rollouts.
Trust Region On-Policy Distillation cs.LG · 2026-05-31 · unverdicted · none · ref 229 · internal anchor
TrOPD stabilizes on-policy distillation for LLMs with trust-region learning, outlier estimation, and off-policy guidance, outperforming prior OPD methods on reasoning and code benchmarks.
COLLIE: Guiding Skill Discovery in Semantically Coherent Latent Space cs.LG · 2026-05-31 · unverdicted · none · ref 47 · internal anchor
COLLIE constructs a semantically coherent skill latent space from unsupervised data to enable training-free guidance with sparse online feedback in guided skill discovery.
Generative Multi-Robot Motion Planning via Diffusion Modeling with Multi-Agent Reinforcement Learning Guidance cs.RO · 2026-05-30 · unverdicted · none · ref 15 · internal anchor
A hybrid framework uses MARL value guidance to steer diffusion-generated trajectories for coordinated multi-robot planning, cutting interference from 55.4% to 41.8% in a 4-robot maze simulation.
Certificate-Guided Evaluation of Reinforcement Learning Generalization cs.AI · 2026-05-30 · unverdicted · none · ref 20 · internal anchor
A logic-driven framework defines inductive reach-avoid tasks and uses neural certificates to certify RL generalization, with empirical results linking fewer violations to more solved test tasks.
MESA: Improving MoE Safety Alignment via Decentralized Expertise cs.LG · 2026-05-30 · unverdicted · none · ref 14 · internal anchor
MESA decentralizes safety duties in MoE LLMs via expert capacity reallocation and dynamic routing refinement based on optimal transport theory, yielding robust defense on harmful benchmarks while preserving helpfulness.
Global-Local Attention Decomposition for Terrain Encoding in Humanoid Perceptive Locomotion cs.RO · 2026-05-30 · unverdicted · none · ref 25 · internal anchor
GLAD decomposes terrain encoding via coarse-to-fine attention on elevation maps to separate broad awareness from precise foothold selection in perceptive humanoid locomotion.
SPADER: Step-wise Peer Advantage with Diversity-Aware Exploration Rewards for Multi-Answer Question Answering cs.CL · 2026-05-30 · unverdicted · none · ref 6 · 2 links · internal anchor
SPADER is an RL method for multi-answer QA that claims better recall and F1 via peer-aligned step-level advantages and diversity rewards on four benchmarks.
Improving Visual Representation Alignment Generation with GRPO cs.CV · 2026-05-30 · unverdicted · none · ref 26 · internal anchor
VRPO applies generative representation policy optimization to dynamically align diffusion features with pretrained visual encoders, claiming +1.8 FID gains and 2.3x faster training versus REPA.
CAST: Non-Privileged Clipped Asymmetric Self-Teaching with Advantage Flipping for GRPO cs.AI · 2026-05-29 · unverdicted · none · ref 19 · internal anchor
CAST adds non-privileged self-teacher scoring and bidirectional advantage flipping to GRPO so that zero-variance groups still produce verifier-signed token gradients.
Why Linear Recurrent Memory Works in Partially Observable Reinforcement Learning cs.LG · 2026-05-29 · unverdicted · none · ref 4 · internal anchor
Linear recurrent filters exactly reproduce HMM belief logits under deterministic transitions and achieve near-zero decoding error under nearly deterministic ones, extending to action-controlled cases.
Sample-Efficient Post-Training for LEGO Spatial-Physics Reasoning cs.LG · 2026-05-29 · unverdicted · none · ref 38 · internal anchor
PVPO is a sample-efficient RL method that improves semantic, geometric, and physical quality in LLM LEGO assembly generation by mitigating the PhysHack failure mode where validity alone fails to ensure fidelity.
Automating Formal Verification with Reinforcement Learning and Recursive Inference cs.LG · 2026-05-29 · unverdicted · none · ref 100 · internal anchor
RLVR training raises verified Dafny pass rates from 9.7% to 31.1% on a filtered benchmark while a Lean proof scaffold lifts success from 46.2% to 69.2% on a pilot set and solves 7 of 42 prior unsolved tasks.
Zero Collapse: A Failure Mode of Policy Gradient Methods in Discontinuous Reward Environments cs.LG · 2026-05-29 · unverdicted · none · ref 4 · internal anchor
Policy gradient methods suffer from zero collapse in discontinuous reward environments such as first-price auctions, where exploration causes policies to enter flat zero-reward regions from which recovery is sample-inefficient due to absent gradient signals.