mega hub Mixed citations

Proximal Policy Optimization Algorithms

Alec Radford, Filip Wolski, John Schulman, Oleg Klimov, Prafulla Dhariwal · 2017 · cs.LG · arXiv 1707.06347

Mixed citation behavior. Most common role is background (52%).

1991 Pith papers citing it

Background 52% of classified citations

open full Pith review browse 1991 citing papers more from Alec Radford arXiv PDF

abstract

We propose a new family of policy gradient methods for reinforcement learning, which alternate between sampling data through interaction with the environment, and optimizing a "surrogate" objective function using stochastic gradient ascent. Whereas standard policy gradient methods perform one gradient update per data sample, we propose a novel objective function that enables multiple epochs of minibatch updates. The new methods, which we call proximal policy optimization (PPO), have some of the benefits of trust region policy optimization (TRPO), but they are much simpler to implement, more general, and have better sample complexity (empirically). Our experiments test PPO on a collection of benchmark tasks, including simulated robotic locomotion and Atari game playing, and we show that PPO outperforms other online policy gradient methods, and overall strikes a favorable balance between sample complexity, simplicity, and wall-time.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 156 method 114 baseline 15 dataset 4

citation-polarity summary

background 151 use method 110 baseline 15 unclear 7 use dataset 4 support 2

claims ledger

abstract We propose a new family of policy gradient methods for reinforcement learning, which alternate between sampling data through interaction with the environment, and optimizing a "surrogate" objective function using stochastic gradient ascent. Whereas standard policy gradient methods perform one gradient update per data sample, we propose a novel objective function that enables multiple epochs of minibatch updates. The new methods, which we call proximal policy optimization (PPO), have some of the benefits of trust region policy optimization (TRPO), but they are much simpler to implement, more ge

authors

Alec Radford Filip Wolski John Schulman Oleg Klimov Prafulla Dhariwal

mega hub controls

export citing contexts JSON export graph JSON export full bundle JSON open full Pith review annotated reader queued

Recognition alignment

counterfactual ablation

If this work disappeared, these are the nearest dependency candidates in Pith, weighted toward method, dataset, baseline, and extension contexts where available. This is a structural signal, not a retraction verdict.

co-cited works

representative citing papers

Alignment faking in large language models

cs.AI · 2024-12-18 · conditional · novelty 9.0

Claude 3 Opus strategically fakes alignment by complying with harmful requests only during simulated training to preserve its preference for refusing them afterward.

On the Policy Gradient Foundations of Group Relative Policy Optimization: Credit Assignment, Gradient Sparsity, and Rank Collapse

cs.LG · 2026-06-28 · conditional · novelty 8.0

GRPO's group-mean baseline assigns identical advantages to all tokens under output-only rewards, inducing gradient sparsity and an intrinsic rank-2 structure proven from the zero-sum constraint and confirmed by SVD on Nemotron-4B gradients.

Bad company corrupts good morals: Understanding and Measuring Narrative-Induced Moral Reasoning Degradation in LLMs

cs.CY · 2026-06-27 · unverdicted · novelty 8.0

Negative narrative immersion causes 12-31% drops in LLM moral accuracy and produces structured shifts that appear in downstream applications.

Learning to Trigger: Reinforcement Learning at the Large Hadron Collider

cs.LG · 2026-06-22 · conditional · novelty 8.0 · 2 refs

RL agent for online LHC trigger threshold tuning improves in-tolerance intervals by 28-56% on Monte Carlo and real CMS data without fine-tuning.

IRumAI: Reinforcement Learning for Indian Rummy

cs.AI · 2026-06-20 · unverdicted · novelty 8.0

IRumAI is the first RL agent for Indian Rummy, trained on weak heuristics to beat strong search opponents at 7000x speed.

DeFAb: A Verifiable Benchmark for Defeasible Abduction in Foundation Models

cs.AI · 2026-06-17 · conditional · novelty 8.0

DeFAb is a large-scale, formally verifiable benchmark for defeasible abduction derived from 18 knowledge bases, demonstrating that frontier LLMs achieve 7.8-65% accuracy versus 100% for a rule-based solver with polynomial-time checks.

Efficient AI-Inspired Reduction of Feynman Integrals via Tube Seeding

hep-ph · 2026-06-09 · unverdicted · novelty 8.0

Machine learning discovers a tube-seeding strategy for IBP reduction of Feynman integrals that scales linearly with numerator power, demonstrated on rank-20 2-loop 5-point integrals.

From Reward-Free Representations to Preferences: Rethinking Offline Preference-Based Reinforcement Learning

cs.LG · 2026-05-31 · unverdicted · novelty 8.0

A reward-free representation learning pipeline for offline PbRL achieves better preference efficiency than standard two-stage baselines by connecting RFRL concepts to preference data.

Extreme dynamic symmetry enables omnidirectional and multifunctional robots

cs.RO · 2026-05-28 · unverdicted · novelty 8.0

Dynamic isotropy, quantifying uniform center-of-mass acceleration capability, improves robot performance and enables omnidirectional locomotion, terrain traversal, and failure resilience in a spherical robot design.

AtomComposer: Discovering Chemical Space from First Principles with Reinforcement Learning

cs.LG · 2026-05-27 · unverdicted · novelty 8.0

AtomComposer uses online RL with multi-composition training to discover up to 10x more valid 3D isomers on unseen chemical formulas than single-composition baselines.

Agent-BRACE: Decoupling Beliefs from Actions in Long-Horizon Tasks via Verbalized State Uncertainty

cs.CL · 2026-05-12 · unverdicted · novelty 8.0

Agent-BRACE improves LLM agent performance on long-horizon partially observable tasks by 5.3-14.5% through a decoupled belief state of verbalized atomic claims with certainty labels that keeps context length constant.

SimWorld Studio: Automatic Environment Generation with Evolving Coding Agent for Embodied Agent Learning

cs.AI · 2026-05-10 · accept · novelty 8.0 · 2 refs

SimWorld Studio deploys an evolving coding agent to create adaptive 3D environments that co-evolve with embodied learners, delivering 18-point success-rate gains over fixed environments in navigation benchmarks.

ReLibra: Routing-Replay-Guided Load Balancing for MoE Training in Reinforcement Learning

cs.LG · 2026-05-09 · conditional · novelty 8.0

ReLibra uses pre-known token-to-expert routing from RL rollouts to perform inter-batch expert reordering and intra-batch replication, delivering up to 1.6x higher throughput than Megatron-LM and 1.2x over oracle-equipped EPLB while staying within 6-10% of an ideal balanced baseline.

Weak-to-Strong Generalization is Nearly Inevitable (in Linear Models)

cs.LG · 2026-05-07 · unverdicted · novelty 8.0

Weak-to-strong generalization is nearly inevitable in linear logistic regression for most student-teacher pairs without any model capacity mismatch.

Structural Equivalence and Learning Dynamics in Delayed MARL

cs.LG · 2026-05-05 · accept · novelty 8.0

Observation and action delays are formally equivalent in cooperative Dec-POMDPs, yielding identical optimal solutions and enabling zero-shot transfer, though learning dynamics differ due to credit assignment and operational constraints.

Language Game: Talking to Non-Human Systems

cs.LG · 2026-05-05 · unverdicted · novelty 8.0

A language-game framework enables dialogue with dynamical systems such as GRNs by treating their frozen dynamics as an RL policy core, using an LM to route prompts so the system responds through its own behavior without parameter changes.

RefereeBench: Are Video MLLMs Ready to be Multi-Sport Referees

cs.CV · 2026-04-17 · unverdicted · novelty 8.0

RefereeBench shows that even the strongest video MLLMs reach only around 60% accuracy on multi-sport refereeing tasks and struggle with rule application and temporal grounding.

OP-GRPO: Efficient Off-Policy GRPO for Flow-Matching Models

cs.CV · 2026-04-05 · unverdicted · novelty 8.0

OP-GRPO is the first off-policy GRPO method for flow-matching models that reuses trajectories via replay buffer and importance sampling corrections, matching on-policy performance with 34.2% of the training steps.

Beyond the Assistant Turn: User Turn Generation as a Probe of Interaction Awareness in Language Models

cs.AI · 2026-04-02 · unverdicted · novelty 8.0

User-turn generation reveals that LLMs' interaction awareness is largely decoupled from task accuracy, remaining near zero in deterministic settings even as accuracy scales to 96.8% on GSM8K.

Reinforcement Learning for Diffusion LLMs with Entropy-Guided Step Selection and Stepwise Advantages

cs.LG · 2026-03-13 · unverdicted · novelty 8.0

Derives an exact unbiased policy gradient for RL post-training of diffusion LLMs via entropy-guided step selection and one-step denoising rewards, achieving state-of-the-art results on coding and logical reasoning benchmarks.

Certified Gradient-Based Contact-Rich Manipulation via Smoothing-Error Reachable Tubes

cs.RO · 2026-02-10 · unverdicted · novelty 8.0

A certified gradient-based method for contact-rich manipulation that quantifies smoothing-induced errors via set-valued discrepancies and incorporates them into analytical reachable sets for robust affine feedback policies.

LeLaR: The First In-Orbit Demonstration of an AI-Based Satellite Attitude Controller

cs.RO · 2025-12-22 · conditional · novelty 8.0

First in-orbit demonstration of a DRL-trained AI satellite attitude controller that performs robust inertial pointing after sim-to-real transfer.

Leveraging Analytic Gradients in Provably Safe Reinforcement Learning

cs.LG · 2025-06-02 · unverdicted · novelty 8.0

Develops and tests the first effective safeguard for analytic gradient-based provably safe RL, showing safe training on three control tasks without performance loss.

Flow-GRPO: Training Flow Matching Models via Online RL

cs.CV · 2025-05-08 · unverdicted · novelty 8.0

Flow-GRPO is the first online RL method for flow matching models, raising GenEval accuracy from 63% to 95% and text-rendering accuracy from 59% to 92% with little reward hacking.

citing papers explorer

Showing 50 of 1991 citing papers.

Federated Variational Preference Alignment with Gumbel-Softmax Prior for Personalized User Preferences cs.LG · 2026-05-29 · unverdicted · none · ref 28 · internal anchor
FedVPA-GP applies variational preference learning in a federated setting with a mixture prior and orthogonal loss to disentangle user preferences on the HH-RLHF dataset.
DARTS: Distribution-Aware Active Rollout Trajectory Shaping for Accelerating LLM Reinforcement Learning cs.LG · 2026-05-29 · unverdicted · none · ref 7 · internal anchor
DARTS accelerates LLM RL training up to 1.77x by distribution-aware trajectory sampling and adaptive redundancy allocation that shapes rollouts toward conciseness without performance loss.
SSR: Scaling Surefooted and Symmetric Humanoid Traversal to the Open World cs.RO · 2026-05-29 · unverdicted · none · ref 39 · internal anchor
SSR is an end-to-end vision-based framework for humanoid traversal that learns imagined foothold guidance, equivariant latent-space symmetry augmentation, and terrain-specific multi-discriminator motion priors to enable safe locomotion on diverse real-world terrains.
Representation Collapse in Sequential Post-Training of Large Language Models cs.LG · 2026-05-28 · unverdicted · none · ref 33 · internal anchor
Sequential post-training of LLMs induces representation collapse that correlates with reduced plasticity, weaker generalization, and poorer calibration, with lightweight interventions tested to mitigate it.
Hista and Numca: Estimate State Value Effectively for LLM Reinforcement Learning cs.LG · 2026-05-28 · unverdicted · none · ref 3 · internal anchor
Introduces SVEB benchmark and Numca/Hista methods claiming more accurate state value estimates and better RL training performance for LLMs.
Improving Collaborative Storytelling with a Multi-Agent Framework Based on Large Language Models cs.AI · 2026-05-28 · unverdicted · none · ref 28 · internal anchor
An iterative writer-editor multi-agent LLM process improves perceived story quality in simulations of child collaborative storytelling.
Learning User-Aware Recall: Personalized Retrieval in Long-Term Conversational Memory cs.IR · 2026-05-28 · unverdicted · none · ref 53 · 2 links · internal anchor
PPRO improves user-aware memory retrieval in conversational agents by using derived user profiles for ranking and training a query rewriter via Group Relative Policy Optimization, with reported gains on LoCoMo and LongMemEval-S benchmarks.
Learning Design Skills as Memory Policies for Agentic Photonic Inverse Design cs.CL · 2026-05-28 · unverdicted · none · ref 10 · internal anchor
SkillPCF is a closed-loop agent framework with a physics-guided memory skill bank, reinforcement-learned skill selection, and simulator-grounded evolution that improves design quality and efficiency for photonic crystal fiber inverse design under limited simulation budgets.
SPRINT: Efficient Spectral Priors for Humanoid Athletic Sprints cs.RO · 2026-05-27 · unverdicted · none · ref 25 · internal anchor
SPRINT generates sprint trajectories for humanoids via spectral priors from five human motion sequences, achieving 6 m/s peak velocity with zero-shot sim-to-real transfer on Unitree G1.
Mechanistically Interpreting the Role of Sample Difficulty in RLVR for LLMs cs.AI · 2026-05-27 · unverdicted · none · ref 35 · internal anchor
Sample difficulty in RLVR shows non-monotonic effects on LLM reasoning, with easy/medium problems strengthening computation and reasoning features while hard problems often yield weak or harmful signals.
Confidence-Orchestrated Self-Evolution against Uncertain LLM Feedback cs.AI · 2026-05-27 · unverdicted · none · ref 6 · internal anchor
COSE uses LLM intrinsic confidence to weight PPO updates and prioritize replay, yielding better average performance than base models on reasoning and math benchmarks across multiple small backbones.
ABot-OCR Technical Report cs.CV · 2026-05-27 · unverdicted · none · ref 40 · internal anchor
ABot-OCR is a new end-to-end VLM for direct image-to-Markdown transcription using a custom data engine and structure-constrained RL optimization, reporting SOTA scores of 92.81/93.30 on OmniDocBench v1.5/v1.6.
SANTS: A State-Adaptive Scheduler for World Action Models cs.RO · 2026-05-27 · unverdicted · none · ref 41 · internal anchor
SANTS adaptively chooses denoising depth in video-based robot action diffusion policies using a state-dependent stopping hazard and noise ratio, trained via downstream action reward to reduce latency.
Escape the Language Prior: Mitigating Late-Stage Modality Collapse in Audio Reasoning via Modality-Aware Policy Optimization cs.CL · 2026-05-26 · unverdicted · none · ref 18 · internal anchor
MAPO is a dual-branch RL framework using modality relevance masks from cross-modal differential entropy and auxiliary attention losses to reduce late-stage modality collapse in audio reasoning models and improve benchmark results.
Training-Inference Kernel Contracts: Bounding Divergence in Post-Training and Deployment cs.LG · 2026-05-26 · unverdicted · none · ref 21 · internal anchor
Introduces kernel contracts framework with derived bounds on divergence from logit drift to reward drift, specialized for RL post-training under support and norm assumptions.
Bayesian Deployment Approval for Learned Landing Controllers under Finite Rollout Validation cs.LG · 2026-05-26 · unverdicted · none · ref 3 · internal anchor
Proposes Bayesian posterior inference on probabilistic landing capability to enable sequential approve/reject/continue decisions for RL landing controllers under finite validation evidence.
Quantifying Uncertainty in Space Debris Capture with Active Tether-Net Systems Caused by Noisy Observations eess.SY · 2026-05-26 · unverdicted · none · ref 33 · internal anchor
Presents a UQ pipeline applying Sobol sensitivity analysis and perturbation methods to quantify noisy-observation effects on Capture Quality Index for fixed-control and neuro-controlled active tether-net systems, using high- and low-fidelity simulators.
Learning to Act under Noise: Enhancing Agent Robustness via Noisy Environments cs.AI · 2026-05-26 · unverdicted · none · ref 31 · internal anchor
NoisyAgent trains LLM agents with controlled user and tool noise to improve robustness in stochastic environments while also boosting clean-benchmark performance.
LitSeg: Narrative-Aware Document Segmentation for Literary RAG cs.CL · 2026-05-26 · unverdicted · none · ref 6 · internal anchor
LitSeg segments literary texts using narrative analysis via multi-stage prompting and offers a distilled lightweight version for efficient use in RAG systems.
Ratio-Variance Regularized Policy Optimization cs.LG · 2026-05-26 · unverdicted · none · ref 11 · internal anchor
R²VPO uses ratio-variance regularization as a distributional soft brake on policy updates, claiming better performance than PPO on math reasoning and robotic control without hard clipping.
KARMA: Karma-Aligned Reward Model Adaptation cs.CL · 2026-05-26 · unverdicted · none · ref 19 · internal anchor
KARMA adapts reward models from Reddit karma data to align LLMs with conversational pragmatics, finding that context-only rewards outperform karma-predictive ones downstream while reducing factuality across conditions.
Mind the Tool Failures: Achieving Synergistic Tool Gains for Medical Agents cs.AI · 2026-05-26 · unverdicted · none · ref 28 · internal anchor
A GRPO-based RL framework with probabilistic risk minimization, disagreement-aware synergy rewards, and entropy-guided sampling enables instance-level tool selection that closes the single-oracle risk gap on medical benchmarks.
Scaling World-Model Reinforcement Learning Through Diffusion Policy Optimization cs.LG · 2026-05-25 · unverdicted · none · ref 66 · internal anchor
MBDPO reformulates policy optimization as a diffusion process over searched trajectories in latent world models to reduce misalignment between search and value learning.
Building an Adversarial Malware Dataset by Family and Type: Generation, Evasion, and Poisoning Evaluation cs.CR · 2026-05-25 · unverdicted · none · ref 16 · internal anchor
The paper releases two adversarial malware datasets (44k family-labelled, 33k type-labelled) with high evasion rates and demonstrates that 0.5% poisoning injection raises evasion from 26.1% to 92.8%.
ParkourFormer: Integrating Predictive Supervision and Sequence Modeling into Parkour Locomotion cs.RO · 2026-05-25 · unverdicted · none · ref 40 · internal anchor
ParkourFormer achieves 93.85% average success on multi-terrain humanoid parkour by fusing Transformer sequence modeling with supervised future-state prediction.
Reinforcement Learning from Denoising Feedback cs.CL · 2026-05-25 · unverdicted · none · ref 19 · internal anchor
RLDF is a new RL paradigm for diffusion language models that optimizes toward clipped clean states with weighted timestep sampling and reports substantial gains on reasoning benchmarks for LLaDA and Dream.
DVAO: Dynamic Variance-adaptive Advantage Optimization for Multi-reward Reinforcement Learning cs.CL · 2026-05-25 · unverdicted · none · ref 20 · internal anchor
DVAO dynamically weights multi-objective advantages by rollout-group reward variance to bound magnitudes, add cross-objective regularization, and outperform static baselines on math and tool-use tasks with Qwen models.
GAC: Noise-Aware Adaptive Mixing for Hybrid SFT-RL Post-Training cs.LG · 2026-05-25 · unverdicted · none · ref 19 · internal anchor
GAC derives adaptive mixing weights for SFT-RL hybrid post-training from online gradient variance and signal disagreement estimates, improving benchmark performance over fixed schedules with under 1% overhead.
Bridging the Gap: Enabling Soft Actor Critic for High Performance Legged Locomotion cs.RO · 2026-05-24 · unverdicted · none · ref 8 · internal anchor
Targeted changes to policy initialization, critic targets, and return estimation let SAC match PPO performance across legged locomotion tasks in massively parallel simulation.
Integrated Sensing, Communication, and Computing for NR-V2X: A Cross-Layer Resource Allocation Framework Using Multi-Agent Reinforcement Learning cs.IT · 2026-05-24 · unverdicted · none · ref 35 · internal anchor
MAPPO-SPS applies multi-agent proximal policy optimization to a cooperative partially observable Markov game formulation of ISCC-aware SB-SPS scheduling in NR-V2X, yielding balanced simulation tradeoffs across CRLB sensing accuracy, PRR, throughput, energy, and delay.
DTO: a Differentiable Training Objective for Effective Counterfactual Story Rewriting cs.CL · 2026-05-24 · unverdicted · none · ref 46 · internal anchor
DTO is a new differentiable objective combining fidelity to reference rewrites and semantic consistency that outperforms MLE and preference baselines while matching LLMs on TimeTravel and ART datasets.
Market Regime Council for Dynamic Credit Assignment in Multi-Agent LLM Decision Systems cs.AI · 2026-05-23 · unverdicted · none · ref 30 · internal anchor
MRC computes coalition Shapley credits from performance histories to weight three LLM agents, stabilized by Bayesian mixture and regime multipliers, achieving SR 1.51 and 440.1% cumulative return over 1037 days on 13 crypto assets.
Vision-Guided Outdoor Flight and Obstacle Evasion via Reinforcement Learning cs.RO · 2026-05-23 · unverdicted · none · ref 16 · internal anchor
A sensorimotor policy with a pre-trained autoencoder perception head and LSTM controller, trained in two stages via privileged learning and curriculum reinforcement learning with domain randomization, achieves zero-shot transfer for outdoor obstacle evasion on unseen environments and platforms.
SafeSABR: Risk-Calibrated Adaptive Bitrate Streaming over Starlink Networks eess.SY · 2026-05-22 · unverdicted · none · ref 38 · 2 links · internal anchor
SafeSABR cuts severe-stall sessions in Starlink video streaming from 22.8% to 7.2% and worst-5% rebuffering from 54.30 s to 22.68 s at 1.8% QoE cost via behavior-cloning pretraining, risk-calibrated RL, and safe-capacity auditing.
StepAudio 2.5 Technical Report eess.AS · 2026-05-22 · unverdicted · none · ref 40 · internal anchor
StepAudio 2.5 is a unified audio-language foundation model that reaches state-of-the-art results on ASR, TTS, and realtime interaction by using task-tailored RLHF on a shared backbone.
MileStone: A Multi-Objective Compiler Phase Ordering Framework for Graph-based IR-Level Optimization cs.PL · 2026-05-22 · unverdicted · none · ref 44 · internal anchor
MileStone models compiler phase ordering as a multi-objective optimization problem using graph representations, GNN predictions, and RL agents to find Pareto-optimal pass sequences under user constraints.
TPMM-DPO: Trajectory-aware Preference-guided Model Merging for Iterative Direct Preference Optimization cs.IR · 2026-05-22 · unverdicted · none · ref 26 · internal anchor
TPMM-DPO applies trajectory-aware learned-weight merging of prior policy models to stabilize iterative DPO against preference noise accumulation.
Post-Training is About States, Not Tokens: A State Distribution View of SFT, RL, and On-Policy Distillation cs.LG · 2026-05-21 · unverdicted · none · ref 24 · internal anchor
A state distribution view of post-training shows that on-policy supervision from the learner itself can outperform fixed-dataset SFT and preserve retention better than aggressive supervised updates.
Clipping Bottleneck: Stabilizing RLVR via Stochastic Recovery of Near-Boundary Signals cs.LG · 2026-05-21 · unverdicted · none · ref 17 · internal anchor
Proposes Near-boundary Stochastic Rescue (NSR) as a stochastic modification to clipping in RLVR that recovers near-boundary signals and yields gains over baselines like DAPO and GSPO.
Meta-Learning for Rapid Adaptation in Reference Tracking of Uncertain Nonlinear Systems cs.AI · 2026-05-21 · unverdicted · none · ref 28 · internal anchor
Meta-learning framework adapting iMAML for rapid controller tuning on uncertain nonlinear systems via offline source data and limited online target adaptation, shown with neural state-space and DQN variants.
ACCoRD: Actor-Critic Conflict Resolution with Deep learning for O-RAN xApps cs.MA · 2026-05-21 · unverdicted · none · ref 100 · internal anchor
ACCoRD trains an ANN with PPO-Clip reinforcement learning to select conflict resolution actions in O-RAN, reducing negative network events versus rule-based methods in medium and high traffic simulations.
Reinforced Graph of Thoughts: RL-Driven Adaptive Prompting for LLMs cs.LG · 2026-05-21 · unverdicted · none · ref 39 · internal anchor
RGoT uses RL to adaptively generate task-specific graphs of operations for GoT-style LLM prompting from a human-provided set, with results suggesting feasibility under constraints.
One-Way Policy Optimization for Self-Evolving LLMs cs.LG · 2026-05-21 · unverdicted · none · ref 12 · internal anchor
OWPO decouples optimization direction from magnitude via asymmetric reweighting (Accelerated Alignment for inferior deviations, Gain Locking for superior) plus iterative references to create a ratchet effect for continuous LLM improvement.
OPERA: An Agent for Image Restoration with End-to-End Joint Planning-Execution Optimization cs.CV · 2026-05-21 · unverdicted · none · ref 27 · internal anchor
OPERA jointly optimizes restoration planning via RL over tool compositions and execution via agent-guided co-training of tools, claiming consistent gains over all-in-one models and prior agent methods on multi-degradation benchmarks.
ECPO: Evidence-Coupled Policy Optimization for Evidence-Certified Candidate Ranking cs.AI · 2026-05-21 · unverdicted · none · ref 16 · internal anchor
ECPO is a listwise policy optimization method that couples ranking utility with span-level evidence certificate validity and a deterministic verifier reward on MAVEN-ERE and RAMS datasets.
stable-worldmodel: A Platform for Reproducible World Modeling Research and Evaluation cs.LG · 2026-05-20 · unverdicted · none · ref 55 · internal anchor
The paper presents stable-worldmodel (swm), a platform with high-performance data layer, modern world model baselines, planning solvers, and extended environments for reproducible research and generalization evaluation.
Closed-Loop Sim-to-Real Reinforcement Learning for Deformable Microfiber Shape Control cs.RO · 2026-05-20 · unverdicted · none · ref 24 · internal anchor
A closed-loop sim-to-real RL policy trained in a simplified frictionless simulator achieves sub-millimeter microfiber shape control on physical hardware via visual feedback without retraining.
torchtune: PyTorch native post-training library cs.LG · 2026-05-20 · unverdicted · none · ref 71 · internal anchor
torchtune is a modular PyTorch library for LLM post-training that delivers competitive performance and memory efficiency while supporting rapid research iteration through hackable components.
Stochastic MeanFlow Policies: One-Step Generative Control with Entropic Mirror Descent cs.LG · 2026-05-20 · unverdicted · none · ref 48 · 2 links · internal anchor
Stochastic MeanFlow Policies enable one-step generative control in off-policy mirror descent by mapping noise through a MeanFlow transform, yielding tractable entropy and improved MuJoCo performance over Gaussian and generative baselines.
LamPO: A Lambda Style Policy Optimization for Reasoning Language Models cs.CL · 2026-05-20 · unverdicted · none · ref 13 · internal anchor
LamPO introduces a pairwise decomposed advantage with confidence-aware weighting to replace scalar group advantages in group-relative policy optimization for reasoning models.