Simple statistical gradient-following algorithms for connectionist reinforcement learning.Machine learning, 8(3):229–256

Ronald J Williams · 1992

4 Pith papers cite this work. Polarity classification is still indexing.

4 Pith papers citing it

browse 4 citing papers

representative citing papers

Reasoning Within the Mind: Dynamic Multimodal Interleaving in Latent Space

cs.CV · 2025-12-14 · unverdicted · novelty 7.0

DMLR performs dynamic visual-textual interleaving in latent space using confidence-guided latent policy gradient optimization and a dynamic visual injection strategy, yielding improved multimodal reasoning on benchmarks.

When Errors Can Be Beneficial: A Categorization of Imperfect Rewards for Policy Gradient

cs.LG · 2026-04-28 · unverdicted · novelty 6.0

Certain errors in proxy rewards for policy gradient methods can be benign or beneficial by preventing policies from stalling on outputs with mediocre ground truth rewards, enabling improved RLHF metrics and reward design insights.

FP4 Explore, BF16 Train: Diffusion Reinforcement Learning via Efficient Rollout Scaling

cs.LG · 2026-04-08 · unverdicted · novelty 6.0

Sol-RL decouples FP4-based candidate exploration from BF16 policy optimization in diffusion RL, delivering up to 4.64x faster convergence with maintained or superior alignment performance on models like FLUX.1 and SD3.5.

Model-Driven Policy Optimization in Differentiable Simulators via Stochastic Exploration

cs.AI · 2026-05-08 · unverdicted · novelty 5.0

MDPO improves differentiable planning by injecting gradient-sensitivity-adapted noise into the action space, outperforming both deterministic variants and PPO on nonlinear and hybrid benchmarks.

citing papers explorer

Showing 4 of 4 citing papers.

Reasoning Within the Mind: Dynamic Multimodal Interleaving in Latent Space cs.CV · 2025-12-14 · unverdicted · none · ref 49
DMLR performs dynamic visual-textual interleaving in latent space using confidence-guided latent policy gradient optimization and a dynamic visual injection strategy, yielding improved multimodal reasoning on benchmarks.
When Errors Can Be Beneficial: A Categorization of Imperfect Rewards for Policy Gradient cs.LG · 2026-04-28 · unverdicted · none · ref 90
Certain errors in proxy rewards for policy gradient methods can be benign or beneficial by preventing policies from stalling on outputs with mediocre ground truth rewards, enabling improved RLHF metrics and reward design insights.
FP4 Explore, BF16 Train: Diffusion Reinforcement Learning via Efficient Rollout Scaling cs.LG · 2026-04-08 · unverdicted · none · ref 16
Sol-RL decouples FP4-based candidate exploration from BF16 policy optimization in diffusion RL, delivering up to 4.64x faster convergence with maintained or superior alignment performance on models like FLUX.1 and SD3.5.
Model-Driven Policy Optimization in Differentiable Simulators via Stochastic Exploration cs.AI · 2026-05-08 · unverdicted · none · ref 18
MDPO improves differentiable planning by injecting gradient-sensitivity-adapted noise into the action space, outperforming both deterministic variants and PPO on nonlinear and hybrid benchmarks.

Simple statistical gradient-following algorithms for connectionist reinforcement learning.Machine learning, 8(3):229–256

fields

years

verdicts

representative citing papers

citing papers explorer