pith. machine review for the scientific record. sign in

arxiv: 1506.02438 · v6 · submitted 2015-06-08 · 💻 cs.LG · cs.RO· cs.SY

Recognition: 1 theorem link

High-Dimensional Continuous Control Using Generalized Advantage Estimation

John Schulman, Michael Jordan, Philipp Moritz, Pieter Abbeel, Sergey Levine

Authors on Pith no claims yet

Pith reviewed 2026-05-11 04:15 UTC · model grok-4.3

classification 💻 cs.LG cs.ROcs.SY
keywords reinforcement learningpolicy gradientscontinuous controladvantage estimationneural networkslocomotiontrust region optimizationvalue functions
0
0 comments X

The pith

Generalized advantage estimation reduces variance in policy gradients for high-dimensional continuous control by exponentially weighting temporal difference residuals.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows how value functions can cut the variance of policy gradient estimates in reinforcement learning while accepting some bias, through an exponentially-weighted advantage estimator similar to TD(lambda). It pairs this with trust region optimization to keep both policy and value function updates stable as data arrives. This combination lets neural networks learn complex behaviors like running gaits for simulated 3D robots directly from raw joint positions and velocities to torques. The approach succeeds on bipedal and quadrupedal locomotion and standing up tasks using only model-free simulation data equivalent to weeks of real time. A reader would care if they want practical ways to apply policy gradients to high-dimensional problems without excessive sample requirements or hand-designed features.

Core claim

We address the first challenge by using value functions to substantially reduce the variance of policy gradient estimates at the cost of some bias, with an exponentially-weighted estimator of the advantage function that is analogous to TD(lambda). We address the second challenge by using trust region optimization procedure for both the policy and the value function, which are represented by neural networks. Our approach yields strong empirical results on highly challenging 3D locomotion tasks, learning running gaits for bipedal and quadrupedal simulated robots, and learning a policy for getting the biped to stand up from starting out lying on the ground. In contrast to a body of prior work,

What carries the argument

The generalized advantage estimator: an exponentially-weighted sum of temporal difference residuals analogous to TD(lambda), which trades bias for lower variance in advantage estimates used by policy gradients.

If this is right

  • Neural network policies map directly from raw kinematics to joint torques without hand-crafted representations.
  • Trust region optimization stabilizes improvement for both policy and value functions despite nonstationary incoming data.
  • Model-free learning succeeds on running gaits for simulated bipeds and quadrupeds plus standing-up tasks.
  • The amount of simulated experience needed corresponds to 1-2 weeks of real time for the biped tasks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • GAE may apply to other high-variance policy optimization settings such as robotic manipulation or game playing with continuous actions.
  • The bias-variance tradeoff in the estimator could be tuned per-task to optimize sample efficiency beyond the fixed lambda used here.
  • Success in simulation raises the question of whether the same direct mapping from kinematics to torques would transfer to physical robots, though sim-to-real gaps are outside the paper's scope.

Load-bearing premise

A neural network value function approximator can be trained sufficiently accurately to deliver useful advantage estimates without introducing bias that negates the variance reduction.

What would settle it

If the learned policies on the 3D locomotion tasks require sample counts comparable to or higher than high-variance Monte Carlo policy gradients, or fail to produce stable gaits, this would show the variance reduction is not effective in practice.

read the original abstract

Policy gradient methods are an appealing approach in reinforcement learning because they directly optimize the cumulative reward and can straightforwardly be used with nonlinear function approximators such as neural networks. The two main challenges are the large number of samples typically required, and the difficulty of obtaining stable and steady improvement despite the nonstationarity of the incoming data. We address the first challenge by using value functions to substantially reduce the variance of policy gradient estimates at the cost of some bias, with an exponentially-weighted estimator of the advantage function that is analogous to TD(lambda). We address the second challenge by using trust region optimization procedure for both the policy and the value function, which are represented by neural networks. Our approach yields strong empirical results on highly challenging 3D locomotion tasks, learning running gaits for bipedal and quadrupedal simulated robots, and learning a policy for getting the biped to stand up from starting out lying on the ground. In contrast to a body of prior work that uses hand-crafted policy representations, our neural network policies map directly from raw kinematics to joint torques. Our algorithm is fully model-free, and the amount of simulated experience required for the learning tasks on 3D bipeds corresponds to 1-2 weeks of real time.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The paper introduces Generalized Advantage Estimation (GAE), an exponentially-weighted estimator of the advantage function (analogous to TD(λ)) derived from standard returns and value functions, to reduce variance in policy gradient estimates at the cost of bias. It combines GAE with trust-region optimization applied to both policy and value function neural networks for stable learning. Empirical results demonstrate success on challenging 3D locomotion tasks, including learning running gaits for bipedal and quadrupedal robots and standing up from a lying position, using model-free policies that map raw kinematics directly to joint torques, with simulated experience equivalent to 1-2 weeks of real time.

Significance. If the results hold, this provides a practical method for high-dimensional continuous control with neural network policies in model-free RL, addressing variance and non-stationarity issues. Strengths include the first-principles derivation of GAE from RL quantities (returns, value functions) independent of the final performance metric, the combination with trust-region constraints, and the demonstration of complex behaviors without hand-crafted representations. The work advances empirical RL for robotics-like tasks.

major comments (1)
  1. [Experiments] Experiments section: The central claim is that GAE reduces policy-gradient variance enough for learning on 3D locomotion while the bias from the neural-network value function approximator remains tolerable. However, the manuscript provides no direct measurement of advantage-estimate bias or variance on the learned policies, nor an ablation isolating value-function accuracy from the trust-region updates. This leaves unaddressed whether approximation error in the value function negates the variance reduction.
minor comments (2)
  1. [Abstract] The claim that simulated experience corresponds to '1-2 weeks of real time' should be supported by the exact number of timesteps or episodes in the main text or a table for reproducibility.
  2. [Method] The GAE(λ) estimator would benefit from an explicit equation and notation definition in the early sections before the empirical results.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive review and the recommendation for minor revision. The positive assessment of GAE combined with trust-region optimization for high-dimensional continuous control is appreciated. We address the single major comment below.

read point-by-point responses
  1. Referee: Experiments section: The central claim is that GAE reduces policy-gradient variance enough for learning on 3D locomotion while the bias from the neural-network value function approximator remains tolerable. However, the manuscript provides no direct measurement of advantage-estimate bias or variance on the learned policies, nor an ablation isolating value-function accuracy from the trust-region updates. This leaves unaddressed whether approximation error in the value function negates the variance reduction.

    Authors: We agree that the manuscript does not include direct empirical measurements of bias or variance for the advantage estimates under the learned policies, nor an explicit ablation separating value-function approximation quality from the trust-region mechanism. Computing ground-truth advantages is intractable for these tasks without an optimal value function. Our defense of the central claim rests on the observed outcomes: the algorithm learns stable running gaits and stand-up behaviors on 3D bipeds and quadrupeds from raw kinematics, using only model-free experience equivalent to 1-2 weeks of real time. Such complex, high-dimensional policies would be unlikely to emerge if value-function bias dominated or if variance reduction were ineffective. The trust-region updates on both policy and value networks are presented as a joint mechanism for stability rather than isolated components. We will add a clarifying paragraph in the discussion section noting the reliance on end-to-end empirical success and the practical difficulty of direct bias/variance diagnostics in this setting. revision: partial

Circularity Check

0 steps flagged

GAE derivation is self-contained from standard RL definitions

full rationale

The paper derives the exponentially-weighted advantage estimator directly from the definitions of the advantage function A_t = Q_t - V_t and the TD residual delta_t = r_t + gamma V(s_{t+1}) - V(s_t), yielding the standard GAE(lambda) sum without any reduction to fitted parameters, self-citations, or input data by construction. Trust-region policy optimization is referenced separately and does not enter the estimator derivation. No quoted step equates a claimed prediction or result to its own inputs; the method remains falsifiable via external benchmarks on variance reduction and bias in continuous control.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The method rests on standard MDP assumptions and the existence of a sufficiently accurate value function approximator; the only explicit free parameter is the GAE lambda that trades bias for variance.

free parameters (1)
  • lambda
    Exponential weighting factor in the advantage estimator that controls the bias-variance tradeoff; chosen by the practitioner.
axioms (1)
  • domain assumption The environment is a Markov decision process with stationary dynamics.
    Required for the definition of returns and advantage functions used in the estimator.

pith-pipeline@v0.9.0 · 5526 in / 1127 out tokens · 20900 ms · 2026-05-11T04:15:59.942358+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 60 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Agent-BRACE: Decoupling Beliefs from Actions in Long-Horizon Tasks via Verbalized State Uncertainty

    cs.CL 2026-05 unverdicted novelty 8.0

    Agent-BRACE improves LLM agent performance on long-horizon partially observable tasks by 5.3-14.5% through a decoupled belief state of verbalized atomic claims with certainty labels that keeps context length constant.

  2. Self-Supervised On-Policy Reinforcement Learning via Contrastive Proximal Policy Optimisation

    cs.LG 2026-05 unverdicted novelty 7.0

    CPPO is an on-policy contrastive RL method that derives advantages from contrastive Q-values for PPO optimization, outperforming prior CRL baselines in 14/18 tasks and matching or exceeding reward-based PPO in 12/18 tasks.

  3. Adaptive Smooth Tchebycheff Attention for Multi-Objective Policy Optimization

    cs.RO 2026-05 unverdicted novelty 7.0

    An adaptive smooth Tchebycheff controller for multi-objective RL lets agents reach non-convex Pareto regions in robotic tasks while avoiding the instability of static non-linear scalarizations.

  4. Adaptive TD-Lambda for Cooperative Multi-agent Reinforcement Learning

    cs.LG 2026-05 unverdicted novelty 7.0

    ATD(λ) adapts TD(λ) in MARL via a density ratio estimator on past/current replay buffers to assign λ per state-action pair, yielding competitive or better results than fixed-λ QMIX and MAPPO on SMAC and Gfootball.

  5. From Generic Correlation to Input-Specific Credit in On-Policy Self Distillation

    cs.LG 2026-05 conditional novelty 7.0

    Self-distillation token rewards measure input-response-feedback pointwise mutual information, and CREDIT extracts the input-specific component with contrastive baselines to improve LLM reasoning performance.

  6. TuniQ: Autotuning Compilation Passes for Quantum Workloads at Scale for Effectiveness and Efficiency

    quant-ph 2026-05 unverdicted novelty 7.0

    TuniQ uses RL with a dual-encoder, shaped rewards, and action masking to autotune quantum compilation passes, improving fidelity and speed over Qiskit while generalizing across backends and scaling to large circuits.

  7. Controllability in preference-conditioned multi-objective reinforcement learning

    cs.LG 2026-05 unverdicted novelty 7.0

    Standard MORL metrics do not measure whether preference inputs reliably control agent behavior, so a new controllability metric is introduced to restore the link between user intent and agent output.

  8. Omni-scale Learning-based Sequential Decision Framework for Order Fulfillment of Tote-handling Robotic Systems

    cs.RO 2026-05 unverdicted novelty 7.0

    OLSF-TRS is a generalized sequential decision framework using structured combinatorial optimization and multi-agent reinforcement learning for order-tote-robot coordination in tote-handling robotic systems, with near-...

  9. Value-Decomposed Reinforcement Learning Framework for Taxiway Routing with Hierarchical Conflict-Aware Observations

    cs.AI 2026-05 unverdicted novelty 7.0

    CaTR applies value-decomposed RL with hierarchical conflict-aware observations to achieve better safety-efficiency trade-offs than planning, optimization, and standard RL baselines in a realistic airport taxiway simulation.

  10. KL for a KL: On-Policy Distillation with Control Variate Baseline

    cs.LG 2026-05 unverdicted novelty 7.0

    vOPD stabilizes on-policy distillation gradients by subtracting a closed-form per-token negative reverse KL baseline as a detached control variate, preserving unbiasedness while lowering variance and matching expensiv...

  11. Learning Visual Feature-Based World Models via Residual Latent Action

    cs.CV 2026-05 unverdicted novelty 7.0

    RLA-WM predicts residual latent actions via flow matching to create visual feature world models that outperform prior feature-based and diffusion approaches while enabling offline video-based robot RL.

  12. Approximate Next Policy Sampling: Replacing Conservative Target Policy Updates in Deep RL

    cs.LG 2026-05 unverdicted novelty 7.0

    Approximate Next Policy Sampling approximates the next policy's state distribution during training to enable larger safe policy updates in deep RL, demonstrated by SV-PPO matching or exceeding standard PPO on Atari an...

  13. Graph Transformers and Stabilized Reinforcement Learning for Large-Scale Dynamic Routing Modulation and Spectrum Allocation in Elastic Optical Networks

    cs.NI 2026-05 conditional novelty 7.0

    Graph transformer RL for dynamic RMSA supports up to 13% more traffic than benchmarks on networks up to 143 nodes and 362 links.

  14. Beyond Heuristics: Learnable Density Control for 3D Gaussian Splatting

    cs.CV 2026-05 unverdicted novelty 7.0

    LeGS turns density control in 3D Gaussian Splatting into a learnable RL policy whose reward is derived from a closed-form sensitivity analysis that measures each Gaussian's marginal contribution to reconstruction quality.

  15. Financial Market as a Self-Organized Ecosystem: Simulation via Learning with Heterogeneous Preferences

    q-fin.CP 2026-04 unverdicted novelty 7.0

    Multi-agent reinforcement learning with heterogeneous preferences leads to emergent role specialization whose interactions produce fat-tailed returns and volatility clustering, offering a computational realization of ...

  16. DeVI: Physics-based Dexterous Human-Object Interaction via Synthetic Video Imitation

    cs.CV 2026-04 unverdicted novelty 7.0

    DeVI enables zero-shot physically plausible dexterous control by imitating synthetic videos via a hybrid 3D-human plus 2D-object tracking reward.

  17. EVPO: Explained Variance Policy Optimization for Adaptive Critic Utilization in LLM Post-Training

    cs.LG 2026-04 unverdicted novelty 7.0

    EVPO adaptively switches between critic-based and batch-mean advantage estimation using batch-level explained variance to provably achieve no greater variance than the better of PPO or GRPO at every step.

  18. Bounded Ratio Reinforcement Learning

    cs.LG 2026-04 conditional novelty 7.0

    BRRL derives an analytic optimal policy for regularized constrained RL that guarantees monotonic improvement and yields the BPO algorithm that matches or exceeds PPO.

  19. SAT: Sequential Agent Tuning for Coordinator Free Plug and Play Multi-LLM Training with Monotonic Improvement Guarantees

    cs.LG 2026-04 unverdicted novelty 7.0

    SAT trains multi-LLM teams with sequential block updates to deliver monotonic gains and plug-and-play model swaps that provably improve performance bounds.

  20. ScoRe-Flow: Complete Distributional Control via Score-Based Reinforcement Learning for Flow Matching

    cs.RO 2026-04 unverdicted novelty 7.0

    ScoRe-Flow achieves decoupled mean-variance control in stochastic flow matching by deriving a closed-form score for drift modulation plus learned variance, yielding faster RL convergence and higher success rates on lo...

  21. Bringing Value Models Back: Generative Critics for Value Modeling in LLM Reinforcement Learning

    cs.LG 2026-04 unverdicted novelty 7.0

    GenAC introduces generative critics with chain-of-thought reasoning and in-context conditioning to improve value approximation and downstream RL performance in LLMs compared to value-based and value-free baselines.

  22. Plasticity-Enhanced Multi-Agent Mixture of Experts for Dynamic Objective Adaptation in UAVs-Assisted Emergency Communication Networks

    cs.MA 2026-04 unverdicted novelty 7.0

    PE-MAMoE combines sparsely gated mixture-of-experts actors with a non-parametric phase controller in MAPPO to maintain plasticity under dynamic user mobility and traffic, yielding 26.3% higher normalized IQM return in...

  23. Cayley Graph Optimization for Scalable Multi-Agent Communication Topologies

    cs.NI 2026-04 unverdicted novelty 7.0

    CayleyTopo uses reinforcement learning to optimize Cayley graph generators for lower diameter, yielding faster and more resilient information flow in multi-agent systems than hand-crafted sparse topologies.

  24. Neural Assistive Impulses: Synthesizing Exaggerated Motions for Physics-based Characters

    cs.AI 2026-04 unverdicted novelty 7.0

    A hybrid neural policy operating in impulse space enables physics-based characters to track exaggerated, dynamically infeasible motions that standard DRL methods cannot stabilize.

  25. A semicontinuous relaxation of Saito's criterion and freeness as angular minimization

    math.AG 2026-04 conditional novelty 7.0

    A new functional S vanishes precisely on free line arrangements and enables discovery of verified free examples for every admissible exponent pair with up to 20 lines.

  26. Mastering Atari with Discrete World Models

    cs.LG 2020-10 accept novelty 7.0

    DreamerV2 reaches human-level performance on 55 Atari games by learning behaviors inside a separately trained discrete-latent world model.

  27. Dota 2 with Large Scale Deep Reinforcement Learning

    cs.LG 2019-12 accept novelty 7.0

    OpenAI Five achieved superhuman performance in Dota 2 by defeating the world champions using scaled self-play reinforcement learning.

  28. Concrete Problems in AI Safety

    cs.AI 2016-06 accept novelty 7.0

    The paper categorizes five concrete AI safety problems arising from flawed objectives, costly evaluation, and learning dynamics.

  29. What to Ignore, What to React: Visually Robust RL Fine-Tuning of VLA Models

    cs.RO 2026-05 conditional novelty 6.0

    PAIR-VLA adds invariance and sensitivity objectives over paired visual variants during PPO fine-tuning of VLA models, yielding 9-16% average gains on ManiSkill3 under distractors, textures, poses, viewpoints, and ligh...

  30. Explicit Stair Geometry Conditioning for Robust Humanoid Locomotion

    cs.RO 2026-05 unverdicted novelty 6.0

    Explicit conditioning of a PPO policy on interpretable stair parameters (height, depth, yaw) yields improved generalization to unseen stairs and reliable real-world traversal on the Unitree G1, including 33 consecutiv...

  31. dFlowGRPO: Rate-Aware Policy Optimization for Discrete Flow Models

    cs.LG 2026-05 unverdicted novelty 6.0

    dFlowGRPO is a new rate-aware RL method for discrete flow models that outperforms prior GRPO approaches on image generation and matches continuous flow models while supporting broad probability paths.

  32. OracleTSC: Oracle-Informed Reward Hurdle and Uncertainty Regularization for Traffic Signal Control

    cs.AI 2026-05 unverdicted novelty 6.0

    OracleTSC introduces a reward hurdle and uncertainty regularization to stabilize LLM-based reinforcement learning for traffic signal control, delivering 75% lower travel time and 67% lower queue length on benchmarks p...

  33. Actor-Critic Algorithm for Dynamic Expectile and CVaR

    cs.LG 2026-05 unverdicted novelty 6.0

    A model-free off-policy actor-critic algorithm is constructed for dynamic expectile and CVaR using a surrogate policy gradient without transition perturbation and elicitability-based value learning, with empirical out...

  34. Gradient Starvation in Binary-Reward GRPO: Why Group-Mean Centering Fails and Why the Simplest Fix Works

    cs.LG 2026-05 conditional novelty 6.0

    Group-mean centering in binary-reward GRPO produces gradient starvation; the fixed sign advantage A=2r-1 raises GSM8K accuracy from 28.4% to 73.8% at group size 4.

  35. Response Time Enhances Alignment with Heterogeneous Preferences

    cs.LG 2026-05 unverdicted novelty 6.0

    Response times modeled as drift-diffusion processes enable consistent estimation of population-average preferences from heterogeneous anonymous binary choices.

  36. Sequential Design of Genetic Circuits Under Uncertainty With Reinforcement Learning

    cs.LG 2026-05 unverdicted novelty 6.0

    An amortized reinforcement learning method enables immediate, observation-driven sequential optimization of genetic circuits while accounting for both intrinsic stochasticity and cross-laboratory variability without r...

  37. Beyond Uniform Credit Assignment: Selective Eligibility Traces for RLVR

    cs.LG 2026-05 unverdicted novelty 6.0

    S-trace adds sparse eligibility traces to RLVR that mask low-entropy tokens, outperforming GRPO by 0.49-3.16% pass@16 on Qwen3 models while improving sample and token efficiency.

  38. Long-Horizon Q-Learning: Accurate Value Learning via n-Step Inequalities

    cs.AI 2026-05 unverdicted novelty 6.0

    LQL stabilizes Q-learning by penalizing violations of n-step action-sequence lower bounds with a hinge loss computed from standard network outputs.

  39. Long-Horizon Q-Learning: Accurate Value Learning via n-Step Inequalities

    cs.AI 2026-05 unverdicted novelty 6.0

    LQL turns n-step action-sequence lower bounds into a practical hinge-loss stabilizer for off-policy Q-learning without extra networks or forward passes.

  40. OpenG2G: A Simulation Platform for AI Datacenter-Grid Runtime Coordination

    cs.LG 2026-05 unverdicted novelty 6.0

    OpenG2G is a new extensible simulation platform that lets users implement and compare classic, optimization, and learning-based controllers for AI datacenter power flexibility coordinated with the grid.

  41. Neural Co-state Policies: Structuring Hidden States in Recurrent Reinforcement Learning

    cs.LG 2026-05 unverdicted novelty 6.0

    Hidden states in recurrent RL policies correspond to PMP co-states, so a derived co-state loss structures the dynamics and yields robust performance on partially observable continuous control tasks.

  42. SOAR: Real-Time Joint Optimization of Order Allocation and Robot Scheduling in Robotic Mobile Fulfillment Systems

    cs.AI 2026-05 unverdicted novelty 6.0

    SOAR is a unified DRL method using soft allocations, event-driven MDP, and heterogeneous graph transformers that cuts global makespan by 7.5% and average order completion time by 15.4% at sub-100ms latency in RMFS.

  43. Stage Light is Sequence$^2$: Multi-Light Control via Imitation Learning

    cs.MM 2026-05 unverdicted novelty 6.0

    SeqLight maps music to multi-light HSV control via SkipBART for global color prediction followed by hybrid imitation learning in a goal-conditioned MDP to decompose colors across lights.

  44. ANO: A Principled Approach to Robust Policy Optimization

    cs.AI 2026-05 unverdicted novelty 6.0

    ANO derives a robust policy optimizer from geometric principles that replaces clipping with a smooth redescending gradient, showing better performance and stability than PPO, SPO, and GRPO in MuJoCo, Atari, and RLHF e...

  45. QHyer: Q-conditioned Hybrid Attention-mamba Transformer for Offline Goal-conditioned RL

    cs.LG 2026-05 unverdicted novelty 6.0

    QHyer replaces return-to-go with a state-conditioned Q-estimator and adds a gated hybrid attention-mamba backbone to achieve state-of-the-art performance in offline goal-conditioned RL on both Markovian and non-Markov...

  46. QHyer: Q-conditioned Hybrid Attention-mamba Transformer for Offline Goal-conditioned RL

    cs.LG 2026-05 unverdicted novelty 6.0

    QHyer achieves state-of-the-art results in offline goal-conditioned RL by replacing return-to-go with a state-conditioned Q-estimator and introducing a gated hybrid attention-mamba backbone for content-adaptive histor...

  47. Odysseus: Scaling VLMs to 100+ Turn Decision-Making in Games via Reinforcement Learning

    cs.LG 2026-05 unverdicted novelty 6.0

    Odysseus adapts PPO with a turn-level critic and leverages pretrained VLM action priors to train agents achieving at least 3x average game progress over frontier models in long-horizon Super Mario Land.

  48. Length Value Model: Scalable Value Pretraining for Token-Level Length Modeling

    cs.CL 2026-04 unverdicted novelty 6.0

    LenVM models token-level remaining generation length as a bounded discounted value function derived from constant negative per-token rewards, providing a scalable proxy for generation horizon.

  49. Reinforcement Learning for Public Safety Power Shutoffs Under Decision-Dependent Uncertainty and Nonlinear Wildfire Ignition Models

    math.OC 2026-04 unverdicted novelty 6.0

    Reinforcement learning learns optimal PSPS topology adjustments via simulation of any nonlinear line failure model, reducing costs versus MIP baselines on 54-bus and 138-bus systems.

  50. Sample-efficient Neuro-symbolic Proximal Policy Optimization

    cs.AI 2026-04 unverdicted novelty 6.0

    H-PPO-Product and H-PPO-SymLoss achieve faster learning and higher final returns than standard PPO and Reward Machine baselines on OfficeWorld, WaterWorld, and DoorKey by transferring imperfect logical policy specific...

  51. Compute Aligned Training: Optimizing for Test Time Inference

    cs.LG 2026-04 unverdicted novelty 6.0

    Compute Aligned Training derives new loss functions by modeling test-time strategies as operators on the base policy, yielding empirical gains in test-time compute scaling over standard SFT and RL.

  52. K-Score: Kalman Filter as a Principled Alternative to Reward Normalization in Reinforcement Learning

    cs.LG 2026-04 unverdicted novelty 6.0

    A 1D Kalman filter for online reward mean estimation accelerates convergence and lowers variance in policy gradient RL compared to standard normalization on LunarLander and CartPole.

  53. Navigating the Clutter: Waypoint-Based Bi-Level Planning for Multi-Robot Systems

    cs.RO 2026-04 unverdicted novelty 6.0

    Waypoint-based bi-level planning with curriculum RLVR improves multi-robot task success rates in dense-obstacle benchmarks over motion-agnostic and VLA baselines.

  54. Temporally Extended Mixture-of-Experts Models

    cs.LG 2026-04 unverdicted novelty 6.0

    Temporally extended MoE layers using the option-critic framework with deliberation costs cut switching rates below 5% while retaining most capability on MATH, MMLU, and MMMLU.

  55. Beyond Importance Sampling: Rejection-Gated Policy Optimization

    cs.LG 2026-04 unverdicted novelty 6.0

    RGPO replaces importance sampling with a smooth [0,1] acceptance gate in policy gradients, unifying TRPO/PPO/REINFORCE, bounding variance for heavy-tailed ratios, and showing gains in online RLHF experiments.

  56. From $P(y|x)$ to $P(y)$: Investigating Reinforcement Learning in Pre-train Space

    cs.LG 2026-04 unverdicted novelty 6.0

    PreRL applies reward-driven updates to P(y) in pre-train space, uses Negative Sample Reinforcement to prune bad reasoning paths and boost reflection, and combines with standard RL in Dual Space RL to outperform baseli...

  57. Bridging MARL to SARL: An Order-Independent Multi-Agent Transformer via Latent Consensus

    cs.LG 2026-04 conditional novelty 6.0

    CMAT uses a transformer decoder to produce a high-level consensus vector in latent space, enabling simultaneous order-independent actions by all agents and optimization via single-agent PPO, with superior results on S...

  58. Learning-Based Sparsification of Dynamic Graphs in Robotic Exploration Algorithms

    cs.RO 2026-04 unverdicted novelty 6.0

    A PPO-trained transformer policy sparsifies dynamic graphs during RRT frontier exploration, cutting size by up to 96% and yielding the most consistent exploration rates across environments.

  59. Unleashing Implicit Rewards: Prefix-Value Learning for Distribution-Level Optimization

    cs.CL 2026-04 unverdicted novelty 6.0

    IPVRM learns prefix values to produce reliable step rewards from sequence outcomes using TD learning, enabling distribution-level RL that improves reasoning when paired with calibrated rewards.

  60. Mitigating Data Scarcity in Spaceflight Applications for Offline Reinforcement Learning Using Physics-Informed Deep Generative Models

    cs.LG 2026-04 unverdicted novelty 6.0

    MI-VAE generates physics-constrained synthetic trajectories from scarce real data to improve offline RL policy performance on planetary lander tasks over standard VAEs.

Reference graph

Works this paper leans on

25 extracted references · 25 canonical work pages · cited by 70 Pith papers · 1 internal anchor

  1. [1]

    Neuronlike adaptive elements that can solve difficult learning control problems

    Barto, Andrew G, Sutton, Richard S, and Anderson, Charles W. Neuronlike adaptive elements that can solve difficult learning control problems. Systems, Man and Cybernetics, IEEE Transactions on, 0 (5): 0 834--846, 1983

  2. [2]

    Reinforcement learning in POMDP s via direct gradient ascent

    Baxter, Jonathan and Bartlett, Peter L. Reinforcement learning in POMDP s via direct gradient ascent. In ICML, pp.\ 41--48, 2000

  3. [3]

    Dynamic programming and optimal control, volume 2

    Bertsekas, Dimitri P. Dynamic programming and optimal control, volume 2. Athena Scientific, 2012

  4. [4]

    Convergent temporal-difference learning with arbitrary smooth function approximation

    Bhatnagar, Shalabh, Precup, Doina, Silver, David, Sutton, Richard S, Maei, Hamid R, and Szepesv \'a ri, Csaba. Convergent temporal-difference learning with arbitrary smooth function approximation. In Advances in Neural Information Processing Systems, pp.\ 1204--1212, 2009

  5. [5]

    Variance reduction techniques for gradient estimates in reinforcement learning

    Greensmith, Evan, Bartlett, Peter L, and Baxter, Jonathan. Variance reduction techniques for gradient estimates in reinforcement learning. The Journal of Machine Learning Research, 5: 0 1471--1530, 2004

  6. [6]

    Reinforcement learning in feedback control

    Hafner, Roland and Riedmiller, Martin. Reinforcement learning in feedback control. Machine learning, 84 0 (1-2): 0 137--169, 2011

  7. [7]

    Learning continuous control policies by stochastic value gradients

    Heess, Nicolas, Wayne, Greg, Silver, David, Lillicrap, Timothy, Tassa, Yuval, and Erez, Tom. Learning continuous control policies by stochastic value gradients. arXiv preprint arXiv:1510.09142, 2015

  8. [8]

    Principles of behavior

    Hull, Clark. Principles of behavior. 1943

  9. [9]

    A natural policy gradient

    Kakade, Sham. A natural policy gradient. In NIPS, volume 14, pp.\ 1531--1538, 2001 a

  10. [10]

    Optimizing average reward using discounted rewards

    Kakade, Sham. Optimizing average reward using discounted rewards. In Computational Learning Theory, pp.\ 605--615. Springer, 2001 b

  11. [11]

    An analysis of actor/critic algorithms using eligibility traces: Reinforcement learning with imperfect value function

    Kimura, Hajime and Kobayashi, Shigenobu. An analysis of actor/critic algorithms using eligibility traces: Reinforcement learning with imperfect value function. In ICML, pp.\ 278--286, 1998

  12. [12]

    On actor-critic algorithms

    Konda, Vijay R and Tsitsiklis, John N. On actor-critic algorithms. SIAM journal on Control and Optimization, 42 0 (4): 0 1143--1166, 2003

  13. [13]

    Continuous control with deep reinforcement learning

    Lillicrap, Timothy P, Hunt, Jonathan J, Pritzel, Alexander, Heess, Nicolas, Erez, Tom, Tassa, Yuval, Silver, David, and Wierstra, Daan. Continuous control with deep reinforcement learning. arXiv preprint arXiv:1509.02971, 2015

  14. [14]

    Approximate gradient methods in policy-space optimization of markov reward processes

    Marbach, Peter and Tsitsiklis, John N. Approximate gradient methods in policy-space optimization of markov reward processes. Discrete Event Dynamic Systems, 13 0 (1-2): 0 111--148, 2003

  15. [15]

    Steps toward artificial intelligence

    Minsky, Marvin. Steps toward artificial intelligence. Proceedings of the IRE, 49 0 (1): 0 8--30, 1961

  16. [16]

    Policy invariance under reward transformations: Theory and application to reward shaping

    Ng, Andrew Y, Harada, Daishi, and Russell, Stuart. Policy invariance under reward transformations: Theory and application to reward shaping. In ICML, volume 99, pp.\ 278--287, 1999

  17. [17]

    Natural actor-critic

    Peters, Jan and Schaal, Stefan. Natural actor-critic. Neurocomputing, 71 0 (7): 0 1180--1190, 2008

  18. [18]

    Trust region policy optimization,

    Schulman, John, Levine, Sergey, Moritz, Philipp, Jordan, Michael I, and Abbeel, Pieter. Trust region policy optimization. arXiv preprint arXiv:1502.05477, 2015

  19. [19]

    Introduction to reinforcement learning

    Sutton, Richard S and Barto, Andrew G. Introduction to reinforcement learning. MIT Press, 1998

  20. [20]

    Policy gradient methods for reinforcement learning with function approximation

    Sutton, Richard S, McAllester, David A, Singh, Satinder P, and Mansour, Yishay. Policy gradient methods for reinforcement learning with function approximation. In NIPS, volume 99, pp.\ 1057--1063. Citeseer, 1999

  21. [21]

    Bias in natural actor-critic algorithms

    Thomas, Philip. Bias in natural actor-critic algorithms. In Proceedings of The 31st International Conference on Machine Learning, pp.\ 441--448, 2014

  22. [22]

    Mujoco: A physics engine for model-based control

    Todorov, Emanuel, Erez, Tom, and Tassa, Yuval. Mujoco: A physics engine for model-based control. In Intelligent Robots and Systems (IROS), 2012 IEEE/RSJ International Conference on, pp.\ 5026--5033. IEEE, 2012

  23. [23]

    Real-time reinforcement learning by sequential actor--critics and experience replay

    Wawrzy \'n ski, Pawe . Real-time reinforcement learning by sequential actor--critics and experience replay. Neural Networks, 22 0 (10): 0 1484--1497, 2009

  24. [24]

    Simple statistical gradient-following algorithms for connectionist reinforcement learning

    Williams, Ronald J. Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine learning, 8 0 (3-4): 0 229--256, 1992

  25. [25]

    Numerical optimization

    Wright, Stephen J and Nocedal, Jorge. Numerical optimization. Springer New York, 1999