Policy Gradient with Kernel Quadrature

Satoshi Hayakawa; Tetsuro Morimura

arxiv: 2310.14768 · v2 · submitted 2023-10-23 · 💻 cs.LG · cs.AI

Policy Gradient with Kernel Quadrature

Satoshi Hayakawa , Tetsuro Morimura This is my paper

Pith reviewed 2026-05-24 06:19 UTC · model grok-4.3

classification 💻 cs.LG cs.AI

keywords policy gradientkernel quadraturegaussian processreinforcement learningepisode selectionreward evaluationMuJoCo

0 comments

The pith

A Gaussian process on discounted returns induces a kernel on episodes that lets kernel quadrature select a small representative subset for policy gradient updates.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper seeks to ease the bottleneck of reward evaluation in policy gradient reinforcement learning by computing rewards only on a compressed subset of episodes rather than the full batch. It places a Gaussian process prior on discounted returns to define a positive definite kernel over entire episodes, then applies an episodic kernel quadrature rule to identify and weight a small number of representative episodes. These reduced episodes are fed to the policy network for gradient computation. The approach comes with theoretical justification and is illustrated on standard MuJoCo continuous-control tasks. A reader would care because full-batch reward evaluation often dominates wall-clock time when episodes are long or simulators are expensive.

Core claim

Placing a Gaussian process on the discounted returns of episodes produces a positive definite kernel on the space of episodes. Kernel quadrature with respect to this kernel then yields a small set of weighted episodes whose aggregate contribution to the policy gradient closely matches that of the full sample batch, allowing gradient updates to proceed on the reduced set alone.

What carries the argument

Episodic kernel quadrature, obtained by first modeling discounted returns with a Gaussian process and then using the induced kernel to compress batches of episodes into a representative subset for gradient steps.

If this is right

Reward computation is needed only for the quadrature-selected episodes rather than every sampled trajectory.
Policy updates remain valid because the weighted subset approximates the full-batch gradient contribution.
The procedure applies to any policy-gradient algorithm that relies on Monte-Carlo estimates of returns.
Numerical results on MuJoCo tasks show that the method can be implemented and produces usable policies.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same GP-induced kernel could be reused across multiple gradient steps if the return distribution changes slowly.
If the kernel is cheap to evaluate, the compression step itself might be performed on even larger batches than current practice allows.
The approach opens a route to combining kernel quadrature with other variance-reduction techniques such as baselines or control variates.

Load-bearing premise

A Gaussian process placed on discounted returns induces a kernel on episodes whose quadrature rule yields a representative subset whose policy-gradient contribution is close to the full-batch contribution.

What would settle it

On a MuJoCo task, compute the policy gradient both on the full episode batch and on the quadrature-reduced subset; if the two gradients differ by more than a small relative error and produce visibly different learning curves, the compression step fails to preserve the necessary information.

Figures

Figures reproduced from arXiv: 2310.14768 by Satoshi Hayakawa, Tetsuro Morimura.

**Figure 2.** Figure 2: Learning curves in Mujoco Tasks: reward vs step [PITH_FULL_IMAGE:figures/full_fig_p018_2.png] view at source ↗

read the original abstract

Reward evaluation of episodes becomes a bottleneck in a broad range of reinforcement learning tasks. Our aim in this paper is to select a small but representative subset of a large batch of episodes, only on which we actually compute rewards for more efficient policy gradient iterations. We build a Gaussian process modeling of discounted returns or rewards to derive a positive definite kernel on the space of episodes, run an ``episodic" kernel quadrature method to compress the information of sample episodes, and pass the reduced episodes to the policy network for gradient updates. We present the theoretical background of this procedure as well as its numerical illustrations in MuJoCo tasks.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper uses a GP on returns to induce an episode kernel and applies kernel quadrature to compress batches before reward evaluation in policy gradients, but the RKHS error bound does not automatically cover the score-weighted integrand.

read the letter

The main contribution is a concrete procedure that models discounted returns with a Gaussian process, derives a positive definite kernel over trajectories, runs episodic kernel quadrature to pick a small weighted subset, and feeds only those episodes into the policy gradient step. This targets settings where reward computation is expensive. The abstract and the described numerical work on MuJoCo tasks show the method can be implemented inside standard RL loops without changing the policy network itself. That framing of compression via return kernels appears new relative to the cited prior work. The theoretical background section is mentioned, which is a plus for a methods paper. The experiments give at least a basic sanity check on continuous-control benchmarks. The central soft spot is the one raised in the stress test. Kernel quadrature gives worst-case error control only for functions inside the RKHS induced by the return kernel. The actual quantity needed for the gradient is the score function times the return, and nothing in the abstract shows that this product remains in the same RKHS or that the quadrature weights are adjusted for it. If the paper does not supply an additional argument or a modified error analysis for the policy-gradient estimator, the claimed reduction in variance or bias for the update rests on an unverified step. Minor additional issues are the usual ones for quadrature-based methods: how sensitive the compression is to the choice of kernel hyperparameters and whether the reduced set still covers rare but high-reward trajectories. The work is aimed at researchers who already run large-batch policy gradients and want to cut the number of expensive reward calls. It is coherent on its own terms and shows clear thinking about the integration problem, so it is worth sending to referees who can check the derivations and the empirical controls. I would bring it to a reading group for discussion of the RKHS transfer question.

Referee Report

1 major / 1 minor

Summary. The paper claims that modeling discounted returns via a Gaussian process induces a positive-definite kernel on episode space; an episodic kernel quadrature rule then selects a small representative subset of episodes whose rewards are evaluated, after which the reduced set is passed directly to the policy network for gradient updates. Theoretical background for the construction is provided together with numerical illustrations on MuJoCo tasks.

Significance. If the quadrature error bound transfers to the policy-gradient integrand, the approach would reduce the number of expensive reward evaluations while preserving gradient accuracy, offering a practical efficiency gain for batch policy-gradient methods.

major comments (1)

[theoretical background / derivation of episodic kernel quadrature] The kernel quadrature is constructed so that its worst-case error is controlled for every integrand in the RKHS H_k induced by the episode kernel. The policy-gradient update, however, requires approximating ∫ ∇_θ log π_θ(τ) R(τ) dμ(τ). No argument is given that multiplication by the score function maps H_k into itself, so the quadrature error bound does not automatically apply to the quantity actually being estimated. This transfer is load-bearing for the central claim that the compressed episodes produce representative gradient updates.

minor comments (1)

[abstract and method description] Clarify whether the GP is placed on the return function R(τ) or on the reward sequence, and how the induced kernel is explicitly defined on the space of trajectories.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We appreciate the referee's thorough review and the identification of a key theoretical consideration regarding the applicability of the kernel quadrature error bounds to the policy gradient estimation. Our point-by-point response follows.

read point-by-point responses

Referee: The kernel quadrature is constructed so that its worst-case error is controlled for every integrand in the RKHS H_k induced by the episode kernel. The policy-gradient update, however, requires approximating ∫ ∇_θ log π_θ(τ) R(τ) dμ(τ). No argument is given that multiplication by the score function maps H_k into itself, so the quadrature error bound does not automatically apply to the quantity actually being estimated. This transfer is load-bearing for the central claim that the compressed episodes produce representative gradient updates.

Authors: We agree that the manuscript provides no explicit argument showing that multiplication by the score function ∇_θ log π_θ(τ) maps H_k into itself, and therefore the worst-case error bound for integrands in the RKHS does not automatically transfer to the policy-gradient integrand. The episodic kernel is induced by a GP model on the return function R(τ), so the quadrature guarantees controlled error only for functions well-represented in that RKHS. The method selects a compressed subset of episodes that are representative with respect to return variability and then forms an empirical gradient estimate from the reduced set. We will revise the theoretical background section to clarify this distinction, to state the conditions (e.g., bounded score multipliers) under which the bound could extend, and to emphasize that the primary support for using the compressed episodes in the gradient update is the combination of the return-kernel construction with the empirical results on MuJoCo tasks. revision: yes

Circularity Check

0 steps flagged

No circularity; forward construction from GP modeling to quadrature selection.

full rationale

The paper describes a sequential procedure: place a GP prior on discounted returns to induce a kernel on episode space, apply episodic kernel quadrature to select a compressed subset, and feed the subset into the policy gradient update. No quoted equations or steps reduce any claimed benefit or prediction to a quantity defined by the same procedure. No self-citation load-bearing steps, uniqueness theorems, or ansatzes smuggled via prior work appear in the abstract or described chain. The skeptic concern targets whether the quadrature error bound transfers to the score-weighted integrand, which is a validity question rather than a definitional reduction. The derivation remains self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central procedure rests on the domain assumption that discounted returns admit a useful Gaussian-process model that induces a kernel on the space of episodes; no free parameters or invented entities are mentioned in the abstract.

axioms (1)

domain assumption Discounted returns of episodes can be modeled by a Gaussian process that induces a positive definite kernel on the episode space
Invoked to derive the kernel used by the quadrature rule.

pith-pipeline@v0.9.0 · 5619 in / 1074 out tokens · 20957 ms · 2026-05-24T06:19:37.328019+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

12 extracted references · 12 canonical work pages · 4 internal anchors

[1]

SOBER: Highly parallel Bayesian optimization and bayesian quadrature over discrete and mixed spaces

Masaki Adachi, Satoshi Hayakawa, Saad Hamid, Martin Jørgensen, Harald Oberhauser, and Micheal A Osborne. SOBER: Highly parallel Bayesian optimization and bayesian quadrature over discrete and mixed spaces. arXiv preprint arXiv:2301.11832, 2023a. 2All the experiments with MuJoCo were conducted with a Google Cloud Vertex AI notebook with an NVIDIA T4 (16-co...

work page arXiv
[2]

Practical Coreset Constructions for Machine Learning

Olivier Bachem, Mario Lucic, and Andreas Krause. Practical coreset constructions for machine learning. arXiv preprint arXiv:1703.06476,

work page internal anchor Pith review Pith/arXiv arXiv
[3]

Carathéodory sampling for stochastic gradient descent.arXiv preprint arXiv:2006.01819,

Francesco Cosentino, Harald Oberhauser, and Alessandro Abate. Carathéodory sampling for stochastic gradient descent.arXiv preprint arXiv:2006.01819,

work page arXiv 2006
[4]

Kernel quadrature with randomly pivoted cholesky.arXiv preprint arXiv:2306.03955,

Ethan N Epperly and Elvira Moreno. Kernel quadrature with randomly pivoted cholesky.arXiv preprint arXiv:2306.03955,

work page arXiv
[5]

RLAIF vs. RLHF: Scaling Reinforcement Learning from Human Feedback with AI Feedback

Harrison Lee, Samrat Phatale, Hassan Mansoor, Kellie Lu, Thomas Mesnard, Colton Bishop, Victor Car- bune, and Abhinav Rastogi. RLAIF: Scaling reinforcement learning from human feedback with ai feedback. arXiv preprint arXiv:2309.00267,

work page internal anchor Pith review Pith/arXiv arXiv
[6]

Asynchronous methods for deep reinforcement learning

Volodymyr Mnih, Adria Puigdomenech Badia, Mehdi Mirza, Alex Graves, Timothy Lillicrap, Tim Harley, David Silver, and Koray Kavukcuoglu. Asynchronous methods for deep reinforcement learning. InPro- ceedings of The 33rd International Conference on Machine Learning, volume 48, pp. 1928–1937. PMLR,

work page 1928
[7]

Proximal Policy Optimization Algorithms

John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimiza- tion algorithms. arXiv preprint arXiv:1707.06347,

work page internal anchor Pith review Pith/arXiv arXiv
[8]

Kazuma Tsuji, Ken’ichiro Tanaka, and Sebastian Pokutta

URL https://zenodo.org/record/8127025. Kazuma Tsuji, Ken’ichiro Tanaka, and Sebastian Pokutta. Pairwise conditional gradients without swap steps and sparser kernel herding. InInternational Conference on Machine Learning, pp. 21864–21883. PMLR,

work page arXiv
[9]

Dataset Distillation

Tongzhou Wang, Jun-Yan Zhu, Antonio Torralba, and Alexei A Efros. Dataset distillation.arXiv preprint arXiv:1811.10959,

work page internal anchor Pith review Pith/arXiv arXiv
[10]

Suppose f∼GP(0,K )

A Proofs A.1 Proof of Proposition 1 Proof. Suppose f∼GP(0,K ). We first compute the following term: EGP [µ(f )2] = EGP [∫ ∫ f (x)f (y) dµ(x) dµ(y) ] . By Cauchy-Schwarz, we have ∫∫ |f (x)f (y)|dµ(x) dµ(y) ≤( ∫ f (x)2 dµ(x))1/2( ∫ f (y)2 dµ(y))1/2 =∫ f (x)2 dµ(x), so we have EGP [∫ ∫ |f (x)f (y)|dµ(x) dµ(y) ] ≤EGP [∫ f (x)2 dµ(x) ] = ∫ EGP [ f (x)2] dµ(x) ...

work page 2006
[11]

We can also apply PGKQ to PPO (Schulman et al., 2017)

The observationLvI =Lvpg[Rt−EGP [Rt]] andLvII =Lvpg[EGP [Rt]−b] allows a unified implemen- tation. We can also apply PGKQ to PPO (Schulman et al., 2017). In the usual PPO, we use the probability ratio (as a functional of an episode)qθ t (e) := πθ(at|st)/πθold (at|st), whereθold is the policy parameter at which we assume the episodese1,...,eN have been dra...

work page 2017
[12]

Here, b is again the value estimatorVφtrained by fake rewards

is given by a straightforward replacement ofLvpg byLppo (p2) When combining PPO with the Option 2, we can also imitate the decomposition of (v2): (Loss) = ∑ i∈I wiLpI(ei)+ 1 N N∑ i=1 LpII(ei), where LpI :=Lppo[Rt−EGP [Rt]], LpII :=Lppo[EGP [Rt]−b]. Here, b is again the value estimatorVφtrained by fake rewards. We can also consider the use of other advanta...

work page 2017

[1] [1]

SOBER: Highly parallel Bayesian optimization and bayesian quadrature over discrete and mixed spaces

Masaki Adachi, Satoshi Hayakawa, Saad Hamid, Martin Jørgensen, Harald Oberhauser, and Micheal A Osborne. SOBER: Highly parallel Bayesian optimization and bayesian quadrature over discrete and mixed spaces. arXiv preprint arXiv:2301.11832, 2023a. 2All the experiments with MuJoCo were conducted with a Google Cloud Vertex AI notebook with an NVIDIA T4 (16-co...

work page arXiv

[2] [2]

Practical Coreset Constructions for Machine Learning

Olivier Bachem, Mario Lucic, and Andreas Krause. Practical coreset constructions for machine learning. arXiv preprint arXiv:1703.06476,

work page internal anchor Pith review Pith/arXiv arXiv

[3] [3]

Carathéodory sampling for stochastic gradient descent.arXiv preprint arXiv:2006.01819,

Francesco Cosentino, Harald Oberhauser, and Alessandro Abate. Carathéodory sampling for stochastic gradient descent.arXiv preprint arXiv:2006.01819,

work page arXiv 2006

[4] [4]

Kernel quadrature with randomly pivoted cholesky.arXiv preprint arXiv:2306.03955,

Ethan N Epperly and Elvira Moreno. Kernel quadrature with randomly pivoted cholesky.arXiv preprint arXiv:2306.03955,

work page arXiv

[5] [5]

RLAIF vs. RLHF: Scaling Reinforcement Learning from Human Feedback with AI Feedback

Harrison Lee, Samrat Phatale, Hassan Mansoor, Kellie Lu, Thomas Mesnard, Colton Bishop, Victor Car- bune, and Abhinav Rastogi. RLAIF: Scaling reinforcement learning from human feedback with ai feedback. arXiv preprint arXiv:2309.00267,

work page internal anchor Pith review Pith/arXiv arXiv

[6] [6]

Asynchronous methods for deep reinforcement learning

Volodymyr Mnih, Adria Puigdomenech Badia, Mehdi Mirza, Alex Graves, Timothy Lillicrap, Tim Harley, David Silver, and Koray Kavukcuoglu. Asynchronous methods for deep reinforcement learning. InPro- ceedings of The 33rd International Conference on Machine Learning, volume 48, pp. 1928–1937. PMLR,

work page 1928

[7] [7]

Proximal Policy Optimization Algorithms

John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimiza- tion algorithms. arXiv preprint arXiv:1707.06347,

work page internal anchor Pith review Pith/arXiv arXiv

[8] [8]

Kazuma Tsuji, Ken’ichiro Tanaka, and Sebastian Pokutta

URL https://zenodo.org/record/8127025. Kazuma Tsuji, Ken’ichiro Tanaka, and Sebastian Pokutta. Pairwise conditional gradients without swap steps and sparser kernel herding. InInternational Conference on Machine Learning, pp. 21864–21883. PMLR,

work page arXiv

[9] [9]

Dataset Distillation

Tongzhou Wang, Jun-Yan Zhu, Antonio Torralba, and Alexei A Efros. Dataset distillation.arXiv preprint arXiv:1811.10959,

work page internal anchor Pith review Pith/arXiv arXiv

[10] [10]

Suppose f∼GP(0,K )

A Proofs A.1 Proof of Proposition 1 Proof. Suppose f∼GP(0,K ). We first compute the following term: EGP [µ(f )2] = EGP [∫ ∫ f (x)f (y) dµ(x) dµ(y) ] . By Cauchy-Schwarz, we have ∫∫ |f (x)f (y)|dµ(x) dµ(y) ≤( ∫ f (x)2 dµ(x))1/2( ∫ f (y)2 dµ(y))1/2 =∫ f (x)2 dµ(x), so we have EGP [∫ ∫ |f (x)f (y)|dµ(x) dµ(y) ] ≤EGP [∫ f (x)2 dµ(x) ] = ∫ EGP [ f (x)2] dµ(x) ...

work page 2006

[11] [11]

We can also apply PGKQ to PPO (Schulman et al., 2017)

The observationLvI =Lvpg[Rt−EGP [Rt]] andLvII =Lvpg[EGP [Rt]−b] allows a unified implemen- tation. We can also apply PGKQ to PPO (Schulman et al., 2017). In the usual PPO, we use the probability ratio (as a functional of an episode)qθ t (e) := πθ(at|st)/πθold (at|st), whereθold is the policy parameter at which we assume the episodese1,...,eN have been dra...

work page 2017

[12] [12]

Here, b is again the value estimatorVφtrained by fake rewards

is given by a straightforward replacement ofLvpg byLppo (p2) When combining PPO with the Option 2, we can also imitate the decomposition of (v2): (Loss) = ∑ i∈I wiLpI(ei)+ 1 N N∑ i=1 LpII(ei), where LpI :=Lppo[Rt−EGP [Rt]], LpII :=Lppo[EGP [Rt]−b]. Here, b is again the value estimatorVφtrained by fake rewards. We can also consider the use of other advanta...

work page 2017