arXiv preprint arXiv:1711.00123 , Title =

Grathwohl, Will, Choi, Dami, Wu, Yuhuai, Roeder, Geoffrey, Duvenaud, David , Date-Added = · 2017 · cs.LG · arXiv 1711.00123

2 Pith papers cite this work. Polarity classification is still indexing.

2 Pith papers citing it

open full Pith review browse 2 citing papers arXiv PDF

abstract

Gradient-based optimization is the foundation of deep learning and reinforcement learning. Even when the mechanism being optimized is unknown or not differentiable, optimization using high-variance or biased gradient estimates is still often the best strategy. We introduce a general framework for learning low-variance, unbiased gradient estimators for black-box functions of random variables. Our method uses gradients of a neural network trained jointly with model parameters or policies, and is applicable in both discrete and continuous settings. We demonstrate this framework for training discrete latent-variable models. We also give an unbiased, action-conditional extension of the advantage actor-critic reinforcement learning algorithm.

representative citing papers

PS-PPO: Prefix-Sampling PPO for Critic-Free RLHF

cs.LG · 2026-06-29 · unverdicted · novelty 6.0

PS-PPO samples prefixes of trajectories in critic-free RLHF and uses importance-weighted updates to reduce compute and memory while claiming to preserve the full-trajectory objective.

Learning to Theorize the World from Observation

cs.LG · 2026-05-05

citing papers explorer

Showing 2 of 2 citing papers.

PS-PPO: Prefix-Sampling PPO for Critic-Free RLHF cs.LG · 2026-06-29 · unverdicted · none · ref 73 · internal anchor
PS-PPO samples prefixes of trajectories in critic-free RLHF and uses importance-weighted updates to reduce compute and memory while claiming to preserve the full-trajectory objective.
Learning to Theorize the World from Observation cs.LG · 2026-05-05 · unreviewed · ref 162

arXiv preprint arXiv:1711.00123 , Title =

fields

years

verdicts

representative citing papers

citing papers explorer