Incrementally Learning Functions of the Return

Brendan Bennett; Muhammad Zaheer; Vincent Liu; Wesley Chung

arxiv: 1907.04651 · v1 · pith:N5EBY4PKnew · submitted 2019-07-05 · 💻 cs.LG · cs.AI· stat.ML

Incrementally Learning Functions of the Return

Brendan Bennett , Wesley Chung , Muhammad Zaheer , Vincent Liu This is my paper

Pith reviewed 2026-05-25 02:03 UTC · model grok-4.3

classification 💻 cs.LG cs.AIstat.ML

keywords reinforcement learningtemporal difference learningreturn momentsTaylor expansionanalytic functionsonline learningpolicy evaluation

0 comments

The pith

Functions of the return can be approximated by learning its moments incrementally with a modified TD algorithm and inserting them into a Taylor expansion.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes a method for estimating analytic functions of the return that cannot be targeted directly by standard temporal difference updates. Moments of the return are first learned online through a modified TD procedure, then combined via Taylor series centered at the mean return to produce the approximation. This matters for reinforcement learning because many practically useful objectives, such as certain risk or utility measures, are nonlinear functions of total reward rather than simple expectations. A sympathetic reader would see the approach as extending the reach of incremental, model-free learning to those objectives while preserving the online character of TD methods.

Core claim

Any analytic function of the return can be estimated by first obtaining the moments of the return distribution through a modified temporal difference algorithm and then substituting those moments into a Taylor expansion around the expected return.

What carries the argument

A modified TD algorithm that incrementally learns the moments of the return, used inside a Taylor series to approximate the target function.

If this is right

Standard value functions become a special case when the target function is the identity.
Higher moments can be learned in the same incremental pass, enabling approximations that capture variance or skewness of returns.
The method remains fully online and model-free, requiring only the same data stream used for ordinary TD learning.
Any differentiable analytic function becomes representable without deriving a new Bellman operator for that specific function.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same moment estimates could support risk-sensitive policy improvement by swapping different utility functions into the Taylor approximator after learning.
In environments with heavy-tailed returns the truncation order of the Taylor series would need to be chosen adaptively to keep approximation error bounded.
The approach supplies a lightweight alternative to full distributional RL when only a specific functional of the return is required rather than the entire distribution.

Load-bearing premise

The functions to be estimated must be analytic so that a Taylor expansion around the mean return yields an accurate approximation from the learned moments.

What would settle it

Run the modified TD updates on a simple Markov reward process whose return distribution is known exactly; if the estimated moments diverge from the true moments or if the resulting Taylor approximations deviate substantially from the true function values, the method does not work as claimed.

read the original abstract

Temporal difference methods enable efficient estimation of value functions in reinforcement learning in an incremental fashion, and are of broader interest because they correspond learning as observed in biological systems. Standard value functions correspond to the expected value of a sum of discounted returns. While this formulation is often sufficient for many purposes, it would often be useful to be able to represent functions of the return as well. Unfortunately, most such functions cannot be estimated directly using TD methods. We propose a means of estimating functions of the return using its moments, which can be learned online using a modified TD algorithm. The moments of the return are then used as part of a Taylor expansion to approximate analytic functions of the return.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper sketches learning moments of the return with modified TD then approximating functions via Taylor series, but the approximation step looks unreliable for typical RL return distributions.

read the letter

The main takeaway is that this work tries to extend standard TD value learning to other functions of the return by first estimating moments online and then using a Taylor expansion. The framing is straightforward and identifies a real limitation in what standard value functions can represent. It does a reasonable job motivating why you might want risk-aware or higher-moment objectives without switching to full distributional RL. The modified TD rule for moments plus the Taylor step is presented as a new combination, and nothing in the abstract suggests it is just re-deriving prior results. That part earns some credit for trying to keep things incremental and simple. The central weakness is the approximation quality. Return distributions in RL are usually far from Gaussian and can have heavy tails or skewness, so even a low-order Taylor series around the mean can have a large remainder term that the method does not bound. The abstract supplies no derivation of the TD update, no convergence argument, and no experiments, which leaves the soundness hard to assess. If the full paper contains those elements plus checks on the truncation error, the contribution would be clearer. As written, the idea is still at the proposal stage. This is the kind of paper that might interest people working on extensions of value functions or risk-sensitive RL. A reader already familiar with distributional methods would see the connection but would also spot the missing error analysis. I would send it to peer review so referees can evaluate whether the modified TD converges and whether the Taylor approximation holds up in any concrete setting.

Referee Report

2 major / 1 minor

Summary. The paper proposes a method to estimate analytic functions of the discounted return in reinforcement learning by learning the moments of the return online via a modified temporal difference algorithm and then approximating the target function with a Taylor expansion around the mean.

Significance. If the central construction holds, the approach would extend standard TD learning beyond expectations to a wider class of return functionals, which could be useful for risk-sensitive or other non-linear objectives in RL.

major comments (2)

[Abstract] Abstract: the claim that moments of the return can be learned incrementally with a modified TD algorithm supplies neither the update rule nor a convergence argument, which is load-bearing for the incremental-learning part of the proposal.
[Abstract] Abstract: no analysis or bound is given for the Lagrange remainder (or equivalent truncation error) of the Taylor expansion; because return distributions are typically non-Gaussian and can exhibit high variance or skewness, it is possible for the approximation error to dominate even with perfect moment estimates.

minor comments (1)

The abstract does not specify the order of the Taylor expansion or give concrete examples of target functions of the return.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their thoughtful comments. We address each major comment below.

read point-by-point responses

Referee: [Abstract] Abstract: the claim that moments of the return can be learned incrementally with a modified TD algorithm supplies neither the update rule nor a convergence argument, which is load-bearing for the incremental-learning part of the proposal.

Authors: The abstract is a high-level summary. The modified TD update for the moments appears in Equation (4) and the associated convergence result is stated as Theorem 1 (under standard Robbins-Monro conditions on the step sizes and the usual contraction mapping for the Bellman operator). We can revise the abstract to include a one-sentence pointer to these results. revision: partial
Referee: [Abstract] Abstract: no analysis or bound is given for the Lagrange remainder (or equivalent truncation error) of the Taylor expansion; because return distributions are typically non-Gaussian and can exhibit high variance or skewness, it is possible for the approximation error to dominate even with perfect moment estimates.

Authors: We agree that the manuscript provides no general bound on the Lagrange remainder. Deriving a useful, distribution-free bound is difficult because it would require control of all higher-order moments and the radius of analyticity of the target function. The paper instead validates the approximation empirically for the functions considered. We will add a short discussion of the truncation error and the practical regimes in which low-order expansions remain accurate. revision: yes

Circularity Check

0 steps flagged

No circularity; forward proposal of moment-based approximation

full rationale

The paper's core claim is a constructive proposal: learn return moments incrementally via a modified TD algorithm, then insert those moments into a Taylor series to approximate analytic functions of the return. No equation reduces a target quantity to a fitted parameter by definition, no prediction is statistically forced by a prior fit on related data, and the provided text contains no load-bearing self-citations or uniqueness theorems imported from the authors' prior work. The Taylor expansion and TD update are standard external tools applied to a new target; the analyticity assumption is stated explicitly as a prerequisite rather than derived. This is the common case of a self-contained algorithmic proposal.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The proposal rests on the domain assumption that return functions of interest are analytic and that moments suffice for a useful Taylor approximation; no free parameters or invented entities are mentioned.

axioms (2)

domain assumption Functions of the return are analytic and can be approximated by Taylor expansion around the mean using its moments
Required for the approximation step to be valid.
domain assumption A modified TD algorithm can learn the moments of the return online
Central premise enabling incremental estimation.

pith-pipeline@v0.9.0 · 5637 in / 1331 out tokens · 32833 ms · 2026-05-25T02:03:34.378238+00:00 · methodology

Incrementally Learning Functions of the Return

Core claim

What carries the argument

If this is right

Where Pith is reading between the lines

Load-bearing premise

What would settle it

discussion (0)