Machine Learning , volume=

Simple Statistical Gradient-Following Algorithms for Connectionist Reinforcement Learning , author=

3 Pith papers cite this work. Polarity classification is still indexing.

3 Pith papers citing it

browse 3 citing papers

representative citing papers

Equilibrium Selection in Multi-Agent Policy Gradients via Opponent-Aware Basin Entry

cs.LG · 2026-05-18 · unverdicted · novelty 6.0

Opponent-aware peer-learning corrections in finite-unroll Meta-MAPG increase entry probability into target stable-Nash basins relative to standard policy gradient, with annealing to recover local convergence.

BoostAPR: Boosting Automated Program Repair via Execution-Grounded Reinforcement Learning with Dual Reward Models

cs.AI · 2026-05-09 · unverdicted · novelty 6.0 · 2 refs

BoostAPR boosts automated program repair by training a sequence-level assessor and line-level credit allocator from execution outcomes, then applying them in PPO to reach 40.7% on SWE-bench Verified.

Learning to Cut: Reinforcement Learning for Benders Decomposition

math.OC · 2026-05-07 · unverdicted · novelty 6.0

RLBD trains a neural policy with REINFORCE to select cuts adaptively in Benders decomposition, yielding faster convergence and better generalization than standard BD or SVM-based LearnBD on an EV charging problem.

citing papers explorer

Showing 3 of 3 citing papers.

Equilibrium Selection in Multi-Agent Policy Gradients via Opponent-Aware Basin Entry cs.LG · 2026-05-18 · unverdicted · none · ref 16
Opponent-aware peer-learning corrections in finite-unroll Meta-MAPG increase entry probability into target stable-Nash basins relative to standard policy gradient, with annealing to recover local convergence.
BoostAPR: Boosting Automated Program Repair via Execution-Grounded Reinforcement Learning with Dual Reward Models cs.AI · 2026-05-09 · unverdicted · none · ref 33 · 2 links
BoostAPR boosts automated program repair by training a sequence-level assessor and line-level credit allocator from execution outcomes, then applying them in PPO to reach 40.7% on SWE-bench Verified.
Learning to Cut: Reinforcement Learning for Benders Decomposition math.OC · 2026-05-07 · unverdicted · none · ref 35
RLBD trains a neural policy with REINFORCE to select cuts adaptively in Benders decomposition, yielding faster convergence and better generalization than standard BD or SVM-based LearnBD on an EV charging problem.

Machine Learning , volume=

fields

years

verdicts

representative citing papers

citing papers explorer