mega hub Mixed citations

Proximal Policy Optimization Algorithms

Alec Radford, Filip Wolski, John Schulman, Oleg Klimov, Prafulla Dhariwal · 2017 · cs.LG · arXiv 1707.06347

Mixed citation behavior. Most common role is background (52%).

1631 Pith papers citing it

Background 52% of classified citations

open full Pith review browse 1631 citing papers more from Alec Radford arXiv PDF

abstract

We propose a new family of policy gradient methods for reinforcement learning, which alternate between sampling data through interaction with the environment, and optimizing a "surrogate" objective function using stochastic gradient ascent. Whereas standard policy gradient methods perform one gradient update per data sample, we propose a novel objective function that enables multiple epochs of minibatch updates. The new methods, which we call proximal policy optimization (PPO), have some of the benefits of trust region policy optimization (TRPO), but they are much simpler to implement, more general, and have better sample complexity (empirically). Our experiments test PPO on a collection of benchmark tasks, including simulated robotic locomotion and Atari game playing, and we show that PPO outperforms other online policy gradient methods, and overall strikes a favorable balance between sample complexity, simplicity, and wall-time.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 155 method 113 baseline 15 dataset 4

citation-polarity summary

background 150 use method 109 baseline 15 unclear 7 use dataset 4 support 2

claims ledger

abstract We propose a new family of policy gradient methods for reinforcement learning, which alternate between sampling data through interaction with the environment, and optimizing a "surrogate" objective function using stochastic gradient ascent. Whereas standard policy gradient methods perform one gradient update per data sample, we propose a novel objective function that enables multiple epochs of minibatch updates. The new methods, which we call proximal policy optimization (PPO), have some of the benefits of trust region policy optimization (TRPO), but they are much simpler to implement, more ge

authors

Alec Radford Filip Wolski John Schulman Oleg Klimov Prafulla Dhariwal

mega hub controls

export citing contexts JSON export graph JSON export full bundle JSON open full Pith review annotated reader queued

Recognition alignment

counterfactual ablation

If this work disappeared, these are the nearest dependency candidates in Pith, weighted toward method, dataset, baseline, and extension contexts where available. This is a structural signal, not a retraction verdict.

co-cited works

representative citing papers

Alignment faking in large language models

cs.AI · 2024-12-18 · conditional · novelty 9.0

Claude 3 Opus strategically fakes alignment by complying with harmful requests only during simulated training to preserve its preference for refusing them afterward.

On the Policy Gradient Foundations of Group Relative Policy Optimization: Credit Assignment, Gradient Sparsity, and Rank Collapse

cs.LG · 2026-06-28 · conditional · novelty 8.0

GRPO's group-mean baseline assigns identical advantages to all tokens under output-only rewards, inducing gradient sparsity and an intrinsic rank-2 structure proven from the zero-sum constraint and confirmed by SVD on Nemotron-4B gradients.

Bad company corrupts good morals: Understanding and Measuring Narrative-Induced Moral Reasoning Degradation in LLMs

cs.CY · 2026-06-27 · unverdicted · novelty 8.0

Negative narrative immersion causes 12-31% drops in LLM moral accuracy and produces structured shifts that appear in downstream applications.

Learning to Trigger: Reinforcement Learning at the Large Hadron Collider

cs.LG · 2026-06-22 · conditional · novelty 8.0

RL agent for online LHC trigger threshold tuning improves in-tolerance intervals by 28-56% on Monte Carlo and real CMS data without fine-tuning.

From Reward-Free Representations to Preferences: Rethinking Offline Preference-Based Reinforcement Learning

cs.LG · 2026-05-31 · unverdicted · novelty 8.0

A reward-free representation learning pipeline for offline PbRL achieves better preference efficiency than standard two-stage baselines by connecting RFRL concepts to preference data.

Extreme dynamic symmetry enables omnidirectional and multifunctional robots

cs.RO · 2026-05-28 · unverdicted · novelty 8.0

Dynamic isotropy, quantifying uniform center-of-mass acceleration capability, improves robot performance and enables omnidirectional locomotion, terrain traversal, and failure resilience in a spherical robot design.

AtomComposer: Discovering Chemical Space from First Principles with Reinforcement Learning

cs.LG · 2026-05-27 · unverdicted · novelty 8.0

AtomComposer uses online RL with multi-composition training to discover up to 10x more valid 3D isomers on unseen chemical formulas than single-composition baselines.

Agent-BRACE: Decoupling Beliefs from Actions in Long-Horizon Tasks via Verbalized State Uncertainty

cs.CL · 2026-05-12 · unverdicted · novelty 8.0

Agent-BRACE improves LLM agent performance on long-horizon partially observable tasks by 5.3-14.5% through a decoupled belief state of verbalized atomic claims with certainty labels that keeps context length constant.

SimWorld Studio: Automatic Environment Generation with Evolving Coding Agent for Embodied Agent Learning

cs.AI · 2026-05-10 · accept · novelty 8.0 · 2 refs

SimWorld Studio deploys an evolving coding agent to create adaptive 3D environments that co-evolve with embodied learners, delivering 18-point success-rate gains over fixed environments in navigation benchmarks.

ReLibra: Routing-Replay-Guided Load Balancing for MoE Training in Reinforcement Learning

cs.LG · 2026-05-09 · conditional · novelty 8.0

ReLibra uses pre-known token-to-expert routing from RL rollouts to perform inter-batch expert reordering and intra-batch replication, delivering up to 1.6x higher throughput than Megatron-LM and 1.2x over oracle-equipped EPLB while staying within 6-10% of an ideal balanced baseline.

Weak-to-Strong Generalization is Nearly Inevitable (in Linear Models)

cs.LG · 2026-05-07 · unverdicted · novelty 8.0

Weak-to-strong generalization is nearly inevitable in linear logistic regression for most student-teacher pairs without any model capacity mismatch.

Structural Equivalence and Learning Dynamics in Delayed MARL

cs.LG · 2026-05-05 · accept · novelty 8.0

Observation and action delays are formally equivalent in cooperative Dec-POMDPs, yielding identical optimal solutions and enabling zero-shot transfer, though learning dynamics differ due to credit assignment and operational constraints.

Language Game: Talking to Non-Human Systems

cs.LG · 2026-05-05 · unverdicted · novelty 8.0

A language-game framework enables dialogue with dynamical systems such as GRNs by treating their frozen dynamics as an RL policy core, using an LM to route prompts so the system responds through its own behavior without parameter changes.

RefereeBench: Are Video MLLMs Ready to be Multi-Sport Referees

cs.CV · 2026-04-17 · unverdicted · novelty 8.0

RefereeBench shows that even the strongest video MLLMs reach only around 60% accuracy on multi-sport refereeing tasks and struggle with rule application and temporal grounding.

OP-GRPO: Efficient Off-Policy GRPO for Flow-Matching Models

cs.CV · 2026-04-05 · unverdicted · novelty 8.0

OP-GRPO is the first off-policy GRPO method for flow-matching models that reuses trajectories via replay buffer and importance sampling corrections, matching on-policy performance with 34.2% of the training steps.

Beyond the Assistant Turn: User Turn Generation as a Probe of Interaction Awareness in Language Models

cs.AI · 2026-04-02 · unverdicted · novelty 8.0

User-turn generation reveals that LLMs' interaction awareness is largely decoupled from task accuracy, remaining near zero in deterministic settings even as accuracy scales to 96.8% on GSM8K.

Reinforcement Learning for Diffusion LLMs with Entropy-Guided Step Selection and Stepwise Advantages

cs.LG · 2026-03-13 · unverdicted · novelty 8.0

Derives an exact unbiased policy gradient for RL post-training of diffusion LLMs via entropy-guided step selection and one-step denoising rewards, achieving state-of-the-art results on coding and logical reasoning benchmarks.

Certified Gradient-Based Contact-Rich Manipulation via Smoothing-Error Reachable Tubes

cs.RO · 2026-02-10 · unverdicted · novelty 8.0

A certified gradient-based method for contact-rich manipulation that quantifies smoothing-induced errors via set-valued discrepancies and incorporates them into analytical reachable sets for robust affine feedback policies.

LeLaR: The First In-Orbit Demonstration of an AI-Based Satellite Attitude Controller

cs.RO · 2025-12-22 · conditional · novelty 8.0

First in-orbit demonstration of a DRL-trained AI satellite attitude controller that performs robust inertial pointing after sim-to-real transfer.

Leveraging Analytic Gradients in Provably Safe Reinforcement Learning

cs.LG · 2025-06-02 · unverdicted · novelty 8.0

Develops and tests the first effective safeguard for analytic gradient-based provably safe RL, showing safe training on three control tasks without performance loss.

Flow-GRPO: Training Flow Matching Models via Online RL

cs.CV · 2025-05-08 · unverdicted · novelty 8.0

Flow-GRPO is the first online RL method for flow matching models, raising GenEval accuracy from 63% to 95% and text-rendering accuracy from 59% to 92% with little reward hacking.

Training Software Engineering Agents and Verifiers with SWE-Gym

cs.SE · 2024-12-30 · conditional · novelty 8.0

SWE-Gym supplies 2438 executable real-world Python tasks to train SWE agents and verifiers, yielding up to 19% gains and new open-weight SOTA of 32% on SWE-Bench Verified.

BEHAVIOR-1K: A Human-Centered, Embodied AI Benchmark with 1,000 Everyday Activities and Realistic Simulation

cs.RO · 2024-03-14 · accept · novelty 8.0

BEHAVIOR-1K introduces a benchmark of 1,000 human everyday activities in realistic simulated scenes together with the OMNIGIBSON physics simulator to evaluate embodied AI.

ORPO: Monolithic Preference Optimization without Reference Model

cs.CL · 2024-03-12 · conditional · novelty 8.0

ORPO performs preference alignment during supervised fine-tuning via a monolithic odds ratio penalty, allowing 7B models to outperform larger state-of-the-art models on alignment benchmarks.

citing papers explorer

Showing 50 of 1631 citing papers.

Truncated Rectified Flow Policy for Reinforcement Learning with One-Step Sampling cs.LG · 2026-04-10 · unverdicted · none · ref 28 · internal anchor
TRFP combines rectified flow models with truncation to support multimodal policies in MaxEnt RL while allowing fast one-step sampling and stable training.
Think Less, Know More: State-Aware Reasoning Compression with Knowledge Guidance for Efficient Reasoning cs.CL · 2026-04-10 · unverdicted · none · ref 4 · internal anchor
STACK reduces average reasoning response length by 59.9% and raises accuracy by 4.8 points over prior methods on three math benchmarks via state-aware compression, knowledge guidance, and early stopping.
TensorHub: Scalable and Elastic Weight Transfer for LLM RL Training cs.DC · 2026-04-10 · unverdicted · none · ref 29 · internal anchor
TensorHub uses Reference-Oriented Storage to enable scalable weight transfer in LLM RL training by referencing replicated GPU weights, achieving up to 19x reduction in cross-datacenter stall time.
C$^2$T: Captioning-Structure and LLM-Aligned Common-Sense Reward Learning for Traffic--Vehicle Coordination cs.MA · 2026-04-10 · unverdicted · none · ref 11 · internal anchor
C2T learns an LLM-derived common-sense reward function to improve cooperative multi-intersection traffic control policies, outperforming standard MARL baselines on efficiency, safety, and energy proxies while allowing prompt-based policy tuning.
HTNav: A Hybrid Navigation Framework with Tiered Structure for Urban Aerial Vision-and-Language Navigation cs.RO · 2026-04-10 · unverdicted · none · ref 32 · internal anchor
HTNav combines imitation and reinforcement learning in a staged, tiered structure with map learning to reach state-of-the-art performance on the CityNav benchmark for urban aerial navigation.
SPPO: Sequence-Level PPO for Long-Horizon Reasoning Tasks cs.AI · 2026-04-10 · unverdicted · none · ref 3 · internal anchor
SPPO enables stable, sample-efficient alignment of LLMs on long-horizon reasoning tasks by using a decoupled scalar value function for low-variance advantages without multi-sampling.
Toward Hardware-Agnostic Quadrupedal World Models via Morphology Conditioning cs.RO · 2026-04-09 · unverdicted · none · ref 58 · internal anchor
Morphology-conditioned quadrupedal world model enables zero-shot generalization to new robot embodiments for locomotion tasks.
Sumo: Dynamic and Generalizable Whole-Body Loco-Manipulation cs.RO · 2026-04-09 · unverdicted · none · ref 38 · internal anchor
Test-time steering of pre-trained whole-body policies via sample-based planning lets legged robots generalize dynamic loco-manipulation to varied heavy objects and tasks without additional training or tuning.
ProMedical: Hierarchical Fine-Grained Criteria Modeling for Medical LLM Alignment via Explicit Injection cs.AI · 2026-04-09 · unverdicted · none · ref 5 · internal anchor
ProMedical builds a 50k preference dataset with fine-grained rubrics and a multi-dimensional reward model that disentangles safety from proficiency, yielding 22.3% accuracy and 21.7% safety gains on Qwen3-8B via GRPO while generalizing to UltraMedical.
EvoGymCM: Harnessing Continuous Material Stiffness for Soft Robot Co-Design cs.RO · 2026-04-09 · unverdicted · none · ref 23 · internal anchor
EvoGymCM introduces continuous material stiffness as a first-class variable in soft robot co-design, with reactive and invariant settings that improve task performance over discrete baselines.
HiRO-Nav: Hybrid ReasOning Enables Efficient Embodied Navigation cs.AI · 2026-04-09 · unverdicted · none · ref 26 · internal anchor
HiRO-Nav adaptively triggers reasoning only on high-entropy actions via a hybrid training pipeline and shows better success-token trade-offs than always-reason or never-reason baselines on the CHORES-S benchmark.
PriPG-RL: Privileged Planner-Guided Reinforcement Learning for Partially Observable Systems with Anytime-Feasible MPC cs.LG · 2026-04-09 · unverdicted · none · ref 5 · internal anchor
PriPG-RL trains RL policies for POMDPs by distilling knowledge from a privileged anytime-feasible MPC planner into a P2P-SAC policy, improving sample efficiency and performance in partially observable robotic navigation.
AnomalyAgent: Agentic Industrial Anomaly Synthesis via Tool-Augmented Reinforcement Learning cs.CV · 2026-04-09 · unverdicted · none · ref 24 · internal anchor
AnomalyAgent uses tool-augmented reinforcement learning with self-reflection to generate realistic industrial anomalies, achieving better metrics than zero-shot methods on MVTec-AD.
TSUBASA: Improving Long-Horizon Personalization via Evolving Memory and Self-Learning with Context Distillation cs.CL · 2026-04-09 · unverdicted · none · ref 55 · internal anchor
TSUBASA improves long-horizon personalization in LLMs via dynamic memory evolution for writing and context-distillation self-learning for reading, outperforming Mem0 and Memory-R1 on Qwen-3 benchmarks while reducing token use.
MemReader: From Passive to Active Extraction for Long-Term Agent Memory cs.CL · 2026-04-09 · unverdicted · none · ref 18 · internal anchor
MemReader uses distilled passive and GRPO-trained active extractors to selectively write low-noise long-term memories, outperforming passive baselines on knowledge updating, temporal reasoning, and hallucination tasks.
ReRec: Reasoning-Augmented LLM-based Recommendation Assistant via Reinforcement Fine-tuning cs.IR · 2026-04-09 · unverdicted · none · ref 43 · internal anchor
ReRec uses reinforcement fine-tuning with dual-graph reward shaping, reasoning-aware advantage estimation, and online curriculum scheduling to improve LLM reasoning and performance in recommendation tasks.
LPM 1.0: Video-based Character Performance Model cs.CV · 2026-04-09 · unverdicted · none · ref 64 · internal anchor
LPM 1.0 generates infinite-length, identity-stable, real-time audio-visual conversational performances for single characters using a distilled causal diffusion transformer and a new benchmark.
Tool Retrieval Bridge: Aligning Vague Instructions with Retriever Preferences via Bridge Model cs.CL · 2026-04-09 · unverdicted · none · ref 4 · internal anchor
A bridge model rewrites vague instructions into specific ones, raising tool retrieval NDCG scores substantially on a new vague-instruction benchmark.
CMP: Robust Whole-Body Tracking for Loco-Manipulation via Competence Manifold Projection cs.RO · 2026-04-08 · unverdicted · none · ref 41 · internal anchor
CMP projects actions onto a learned competence manifold using a frame-wise safety scheme and isomorphic latent space to achieve up to 10x better survival in out-of-distribution scenarios with under 10% tracking loss.
Personalized RewardBench: Evaluating Reward Models with Human Aligned Personalization cs.CL · 2026-04-08 · unverdicted · none · ref 13 · internal anchor
Personalized RewardBench reveals that state-of-the-art reward models reach only 75.94% accuracy on personalized preferences and shows stronger correlation with downstream BoN and PPO performance than prior benchmarks.
FP4 Explore, BF16 Train: Diffusion Reinforcement Learning via Efficient Rollout Scaling cs.LG · 2026-04-08 · unverdicted · none · ref 1 · internal anchor
Sol-RL decouples FP4-based candidate exploration from BF16 policy optimization in diffusion RL, delivering up to 4.64x faster convergence with maintained or superior alignment performance on models like FLUX.1 and SD3.5.
Walk the Talk: Bridging the Reasoning-Action Gap for Thinking with Images via Multimodal Agentic Policy Optimization cs.CV · 2026-04-08 · unverdicted · none · ref 43 · internal anchor
MAPO improves multimodal chain-of-thought reasoning by requiring explicit textual descriptions of visual tool results and using a novel advantage estimator that combines semantic alignment with task rewards.
Hyperfastrl: Hypernetwork-based reinforcement learning for unified control of parametric chaotic PDEs cs.CE · 2026-04-07 · unverdicted · none · ref 21 · internal anchor
Hypernetworks map a forcing parameter directly to policy weights in an RL framework, enabling unified stabilization of the Kuramoto-Sivashinsky equation across regimes with KAN architectures showing strongest extrapolation.
Target Policy Optimization cs.LG · 2026-04-07 · unverdicted · none · ref 13 · internal anchor
TPO constructs a target distribution q proportional to the old policy times exp(utility) and trains the policy to match it via cross-entropy, matching or beating PPO and GRPO especially under sparse rewards.
JD-BP: A Joint-Decision Generative Framework for Auto-Bidding and Pricing cs.GT · 2026-04-07 · unverdicted · none · ref 35 · internal anchor
JD-BP jointly generates bids and pricing corrections via generative models, memory-less return-to-go, trajectory augmentation, and energy-based DPO to improve auto-bidding performance despite prediction errors and latency.
Precise Aggressive Aerial Maneuvers with Sensorimotor Policies cs.RO · 2026-04-07 · unverdicted · none · ref 44 · internal anchor
Reinforcement learning sensorimotor policies enable quadrotors to traverse narrow gaps at extreme tilts with 5 cm clearance using only vision and proprioception, including reactive traversal of moving gaps.
Hierarchical Reinforcement Learning with Augmented Step-Level Transitions for LLM Agents cs.AI · 2026-04-07 · unverdicted · none · ref 5 · internal anchor
STEP-HRL enables step-level learning in LLM agents via hierarchical task structure and local progress modules, outperforming baselines on ScienceWorld and ALFWorld while cutting token usage.
Uncovering Linguistic Fragility in Vision-Language-Action Models via Diversity-Aware Red Teaming cs.RO · 2026-04-07 · unverdicted · none · ref 31 · internal anchor
DAERT generates diverse adversarial instructions via a uniform policy in RL to drop VLA task success rates from 93.33% to 5.85% on benchmarks with models like π0 and OpenVLA.
GaussFly: Contrastive Reinforcement Learning for Visuomotor Policies in 3D Gaussian Fields cs.RO · 2026-04-06 · unverdicted · none · ref 36 · internal anchor
GaussFly decouples representation learning from policy optimization via 3D Gaussian Splatting reconstruction and contrastive features to achieve superior sample efficiency and zero-shot sim-to-real transfer for AAV visuomotor policies.
FlashSAC: Fast and Stable Off-Policy Reinforcement Learning for High-Dimensional Robot Control cs.LG · 2026-04-06 · unverdicted · none · ref 72 · 2 links · internal anchor
FlashSAC improves training speed and final performance of off-policy RL on high-dimensional robot tasks by reducing update frequency, increasing model scale, and bounding norms to limit critic error accumulation.
One Model for All: Multi-Objective Controllable Language Models cs.LG · 2026-04-06 · unverdicted · none · ref 23 · internal anchor
Multi-Objective Control trains a single LLM as a preference-conditioned policy using multi-objective optimization in RLHF to produce outputs in user-specified regions of the Pareto front.
Reinforce to Learn, Elect to Reason: A Dual Paradigm for Video Reasoning cs.CV · 2026-04-06 · unverdicted · none · ref 33 · internal anchor
RLER trains video-reasoning models with three task-driven RL rewards for evidence production and elects the best answer from a few candidates via evidence consistency scoring, yielding 6.3% average gains on eight benchmarks.
Regime-Calibrated Fleet Repositioning with a Spatial Queue-Regret Decomposition cs.LG · 2026-04-04 · unverdicted · none · ref 18 · internal anchor
A leakage-safe similarity gate and spatial queue-regret decomposition reduce mean passenger wait times to 82.3 seconds in New York City ride-hailing simulations by linking demand-field error directly to wait via queueing and allocator sensitivities.
Stabilizing Unsupervised Self-Evolution of MLLMs via Continuous Softened Retracing reSampling cs.CV · 2026-04-04 · unverdicted · none · ref 1 · internal anchor
CSRS improves MLLM self-evolution stability by using retracing mechanisms and softened continuous rewards instead of majority voting, reaching SOTA on geometric reasoning benchmarks like MathVision.
InCoder-32B-Thinking: Industrial Code World Model for Thinking cs.AR · 2026-04-03 · unverdicted · none · ref 31 · internal anchor
InCoder-32B-Thinking uses error-feedback synthesized thinking traces and a code world model to reach top open-source scores on general and industrial code benchmarks including 81.3% on LiveCodeBench and 84.0% on CAD-Coder.
Behavior-Constrained Reinforcement Learning with Receding-Horizon Credit Assignment for High-Performance Control cs.RO · 2026-04-03 · unverdicted · none · ref 38 · internal anchor
A behavior-constrained RL framework with receding-horizon credit assignment learns high-performance control policies that stay aligned with expert behavior in race car simulation.
Learning Task-Invariant Properties via Dreamer: Enabling Efficient Policy Transfer for Quadruped Robots cs.RO · 2026-04-03 · unverdicted · none · ref 31 · internal anchor
DreamTIP adds LLM-identified task-invariant properties as auxiliary targets in Dreamer's world model plus a mixed-replay adaptation step, delivering 28.1% average simulated transfer gains and 100% real-world climb success versus 10% for baselines.
Vision-Based End-to-End Learning for UAV Traversal of Irregular Gaps via Differentiable Simulation cs.RO · 2026-04-03 · unverdicted · none · ref 27 · internal anchor
An end-to-end vision-based framework enables UAVs to traverse complex irregular gaps in unseen environments by mapping depth images to SE(3) control commands using differentiable simulation.
Adaptive Memory Crystallization for Autonomous AI Agent Learning in Dynamic Environments cs.LG · 2026-04-02 · unverdicted · none · ref 41 · internal anchor
AMC models memory consolidation via a Liquid-Glass-Crystal process governed by an SDE with proven convergence to a Beta distribution, yielding 34-43% better forward transfer and 67-80% less forgetting on standard continual RL benchmarks.
Mitigating Data Scarcity in Spaceflight Applications for Offline Reinforcement Learning Using Physics-Informed Deep Generative Models cs.LG · 2026-04-02 · unverdicted · none · ref 37 · internal anchor
MI-VAE generates physics-constrained synthetic trajectories from scarce real data to improve offline RL policy performance on planetary lander tasks over standard VAEs.
STRIVE: Structured Spatiotemporal Exploration for Reinforcement Learning in Video Question Answering cs.CV · 2026-04-02 · unverdicted · none · ref 23 · internal anchor
STRIVE stabilizes RL for video QA by creating spatiotemporal video variants and using importance-aware sampling, yielding consistent gains over baselines on six benchmarks.
TestDecision: Sequential Test Suite Generation via Greedy Optimization and Reinforcement Learning cs.SE · 2026-04-02 · unverdicted · none · ref 48 · internal anchor
By proving test suite coverage is monotone submodular and training LLMs with RL to maximize marginal gains, TestDecision improves branch coverage 38-52% and bug detection up to 95% over base models on ULT and LiveCodeBench.
Reliability-Aware Geometric Fusion for Robust Audio-Visual Navigation cs.SD · 2026-04-02 · unverdicted · none · ref 21 · internal anchor
RAVN improves audio-visual navigation by learning audio-derived reliability cues via an Acoustic Geometry Reasoner and using them to modulate visual features through Reliability-Aware Geometric Modulation.
Heterogeneous Self-Play for Realistic Highway Traffic Simulation cs.AI · 2026-03-31 · accept · none · ref 18 · internal anchor
PHASE uses heterogeneous self-play and context-conditioned policies to achieve realistic, zero-shot highway traffic simulation that outperforms traditional rule-based and self-play models on real-world datasets.
MemFactory: Unified Inference & Training Framework for Agent Memory cs.CL · 2026-03-31 · unverdicted · none · ref 10 · internal anchor
MemFactory is a new unified modular framework for memory-augmented LLM agent inference and training that integrates GRPO and reports up to 14.8% relative gains on MemAgent evaluations.
Learning Unified Control of Intrinsic Nonlinear Spin Dynamics in Atomic Qudits for Magnetometry quant-ph · 2026-03-30 · unverdicted · none · ref 60 · internal anchor
Reinforcement learning stabilizes more than 4 dB of fixed-axis spin squeezing under continuous nonlinear Zeeman evolution in the f=21/2 manifold of 161Dy, yielding a single-atom sensitivity of 13.9 pT/sqrt(Hz) that is 3 dB beyond the standard quantum limit.
Competitor-aware Race Management for Electric Endurance Racing eess.SY · 2026-03-30 · unverdicted · none · ref 30 · internal anchor
A bi-level game-theoretic optimal control plus reinforcement learning framework enables competitor-aware energy management and pit-stop scheduling that exploits aerodynamic drafting in simulated electric endurance races.
AdaRubric: Task-Adaptive Rubrics for Reliable LLM Agent Evaluation and Reward Learning cs.AI · 2026-03-22 · conditional · none · ref 9 · internal anchor
AdaRubric adaptively generates task-specific rubrics via LLM, scores agent trajectories with per-dimension confidence weighting, and produces filtered DPO pairs that raise human correlation to Pearson r=0.79 and downstream task success by 6.8-8.5%.
Delightful Distributed Policy Gradient cs.LG · 2026-03-20 · unverdicted · none · ref 23 · internal anchor
Delightful Policy Gradient gates updates with advantage times surprisal to suppress rare failures while preserving rare successes in distributed RL with stale or buggy data.
PersonaVLM: Long-Term Personalized Multimodal LLMs cs.CL · 2026-03-20 · unverdicted · none · ref 36 · internal anchor
PersonaVLM adds memory extraction, multi-turn retrieval-based reasoning, and personality inference to multimodal LLMs, yielding 22.4% gains on a new long-term personalization benchmark and outperforming GPT-4o.