On the "Causality" Step in Policy Gradient Derivations: A Pedagogical Reconciliation of Full Return and Reward-to-Go

Nima H. Siboni

arxiv: 2604.04686 · v1 · submitted 2026-04-06 · 💻 cs.AI

On the "Causality" Step in Policy Gradient Derivations: A Pedagogical Reconciliation of Full Return and Reward-to-Go

Nima H. Siboni This is my paper

Pith reviewed 2026-05-10 18:59 UTC · model grok-4.3

classification 💻 cs.AI

keywords policy gradientREINFORCEreward-to-gocausalityscore functiontrajectory decompositionMarkov decision process

0 comments

The pith

Reward-to-go in policy gradients arises directly from decomposing the objective over prefix trajectories, recovering causality as a corollary.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper isolates the step where full-trajectory returns are replaced by reward-to-go in REINFORCE-style derivations. It shows this replacement occurs automatically when the objective is rewritten as an expectation over successive prefix trajectories and the score-function identity is applied to each prefix. A reader would care because the past-reward terms cancel explicitly due to the decomposition and the Markov property, turning the familiar causality argument into a derived consequence rather than a separate rule. The estimator itself stays unchanged; only the justification is made explicit.

Core claim

Expressing the policy-gradient objective as an expectation over prefix trajectory distributions and applying the score-function identity at each prefix causes all terms involving rewards before the current time step to drop out of the gradient, producing the reward-to-go estimator directly. The conventional appeal to causality then follows as a simple corollary of the same decomposition.

What carries the argument

Prefix trajectory distributions that decompose the full trajectory measure under the MDP Markov property, combined with the score-function identity applied stepwise to each prefix.

If this is right

The reward-to-go form of the policy gradient is obtained as the direct result of the decomposition rather than by a later substitution.
Past rewards are seen to cancel because their score-function contributions average to zero over the conditional distribution of later actions.
The same estimator is recovered, but the derivation no longer requires treating causality as an extra principle.
Any setting where the prefix decomposition is valid inherits the same cancellation automatically.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same prefix-decomposition logic could be tried on other estimators that currently rely on separate causality or baseline arguments.
In environments that violate the Markov assumption the explicit cancellation would fail, offering a concrete test of where the argument breaks.
Teaching materials could adopt the prefix view to reduce the number of heuristic steps students must accept on faith.

Load-bearing premise

The trajectory distribution admits a clean decomposition into prefix distributions under the standard MDP Markov property, and the score-function identity holds without modification.

What would settle it

A direct expansion of the gradient under the prefix decomposition that leaves nonzero expectation terms involving rewards from earlier timesteps would show the cancellation does not occur.

read the original abstract

In introductory presentations of policy gradients, one often derives the REINFORCE estimator using the full trajectory return and then states, by ``causality,'' that the full return may be replaced by the reward-to-go. Although this statement is correct, it is frequently presented at a level of rigor that leaves unclear where the past-reward terms disappear. This short paper isolates that step and gives a mathematically explicit derivation based on prefix trajectory distributions and the score-function identity. The resulting account does not change the estimator. Its contribution is conceptual: instead of presenting reward-to-go as a post hoc unbiased replacement for full return, it shows that reward-to-go arises directly once the objective is decomposed over prefix trajectories. In this formulation, the usual causality argument is recovered as a corollary of the derivation rather than as an additional heuristic principle.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This paper gives a clean algebraic derivation of the reward-to-go estimator from prefix trajectories but does not change the estimator or add new results.

read the letter

This paper isolates the step in policy gradient derivations where past rewards drop out of the estimator. It does so by rewriting the objective as an expectation over prefix trajectories and applying the score-function identity directly to the prefix distribution. The earlier reward terms cancel algebraically, recovering the standard reward-to-go form as a direct consequence rather than an added rule. The steps follow from the usual MDP factorization and the score-function lemma with no extra assumptions or fitting. The algebra is transparent and checks out. If you are explaining this material to students, the explicit cancellation may reduce the sense that causality is being inserted by hand. The limitation is that the final estimator is identical to the known REINFORCE-with-reward-to-go. There are no new algorithms, convergence results, or experiments. The contribution is strictly in how one step is presented. This makes the paper narrow in scope. It is aimed at readers who are learning or teaching the basics of policy gradients and want to see that particular cancellation spelled out. Experienced researchers who already use the estimator will not gain new tools or insights. I would not cite it in my own work. It is worth sending to peer review for a venue that publishes short pedagogical clarifications, since the reasoning is sound and the presentation is careful.

Referee Report

0 major / 2 minor

Summary. The manuscript provides a pedagogical derivation of the REINFORCE policy gradient estimator. It decomposes the objective J(π) = E_{τ ~ p(τ|π)} [R(τ)] over prefix trajectories τ_{0:t}, applies the score-function identity to the prefix distribution p(τ_{0:t}|π), and shows that the gradient yields the reward-to-go form after algebraic cancellation of past-reward terms via E[∇_θ log p(τ_{0:t})] = 0 under the MDP factorization. The causality principle is recovered as a corollary rather than an inserted heuristic.

Significance. This work clarifies a commonly opaque step in policy gradient derivations without altering the estimator or introducing new assumptions beyond the standard score-function lemma and MDP Markov property. It strengthens the conceptual understanding by deriving reward-to-go directly from prefix distributions, which may aid teaching and reduce reliance on post-hoc justifications. The derivation is parameter-free and relies on established identities.

minor comments (2)

[Introduction] The introduction could benefit from citing one or two specific common references (e.g., a textbook section or lecture note) where the causality step is presented as a heuristic replacement, to sharpen the contrast with the proposed derivation.
[Derivation] In the derivation section, the transition from the full-trajectory expectation to the prefix decomposition (around the point where the objective is rewritten as a sum over t) would be clearer with an explicit equation number for the intermediate step J(π) = E[∑_t r_t].

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for the positive assessment of the manuscript, the clear summary of its contribution, and the recommendation to accept. We are pleased that the pedagogical value of deriving reward-to-go directly from prefix-trajectory decompositions was recognized.

Circularity Check

0 steps flagged

No significant circularity; derivation uses external score-function identity

full rationale

The paper decomposes the policy-gradient objective over prefix trajectories and applies the standard score-function identity E[∇_θ log p(τ_{0:t})] = 0 to cancel past-reward terms algebraically under the MDP factorization. This identity is an externally established fact (∫ p ∇log p dτ = 0) not derived or fitted inside the paper, and the Markov property is a background RL assumption rather than a self-referential definition. No self-citations are load-bearing, no parameters are estimated from data, and the reward-to-go form emerges directly from the decomposition without renaming a known result or smuggling an ansatz. The central claim therefore remains independent of its inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The paper rests on two standard background items from RL theory and no new free parameters or postulated entities.

axioms (2)

standard math Score-function identity: gradient of expectation equals expectation of score times function
Invoked to convert the policy-gradient objective into an expectation over trajectories.
domain assumption Trajectory distribution factors into prefix distributions under the Markov property
Allows the objective to be decomposed so that past rewards become independent of the current action.

pith-pipeline@v0.9.0 · 5443 in / 1258 out tokens · 32669 ms · 2026-05-10T18:59:32.902588+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

4 extracted references · 4 canonical work pages

[1]

University of California, Berkeley

Sergey Levine.CS 285/185: Deep Reinforcement Learning, Lecture 5: Policy Gradients. University of California, Berkeley. Available at the course website

work page
[2]

Williams

Ronald J. Williams. Simple statistical gradient-following algorithms for connectionist reinforce- ment learning.Machine Learning, 8:229–256, 1992

work page 1992
[3]

Sutton, David McAllester, Satinder Singh, and Yishay Mansour

Richard S. Sutton, David McAllester, Satinder Singh, and Yishay Mansour. Policy gradient methods for reinforcement learning with function approximation. InAdvances in Neural Information Processing Systems 12, 1999. 5

work page 1999
[4]

Documentation resource

OpenAI.Spinning Up in Deep RL: Part 3, Intro to Policy Optimization. Documentation resource. 6

work page

[1] [1]

University of California, Berkeley

Sergey Levine.CS 285/185: Deep Reinforcement Learning, Lecture 5: Policy Gradients. University of California, Berkeley. Available at the course website

work page

[2] [2]

Williams

Ronald J. Williams. Simple statistical gradient-following algorithms for connectionist reinforce- ment learning.Machine Learning, 8:229–256, 1992

work page 1992

[3] [3]

Sutton, David McAllester, Satinder Singh, and Yishay Mansour

Richard S. Sutton, David McAllester, Satinder Singh, and Yishay Mansour. Policy gradient methods for reinforcement learning with function approximation. InAdvances in Neural Information Processing Systems 12, 1999. 5

work page 1999

[4] [4]

Documentation resource

OpenAI.Spinning Up in Deep RL: Part 3, Intro to Policy Optimization. Documentation resource. 6

work page