Incrementally Learning Functions of the Return
Pith reviewed 2026-05-25 02:03 UTC · model grok-4.3
The pith
Functions of the return can be approximated by learning its moments incrementally with a modified TD algorithm and inserting them into a Taylor expansion.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Any analytic function of the return can be estimated by first obtaining the moments of the return distribution through a modified temporal difference algorithm and then substituting those moments into a Taylor expansion around the expected return.
What carries the argument
A modified TD algorithm that incrementally learns the moments of the return, used inside a Taylor series to approximate the target function.
If this is right
- Standard value functions become a special case when the target function is the identity.
- Higher moments can be learned in the same incremental pass, enabling approximations that capture variance or skewness of returns.
- The method remains fully online and model-free, requiring only the same data stream used for ordinary TD learning.
- Any differentiable analytic function becomes representable without deriving a new Bellman operator for that specific function.
Where Pith is reading between the lines
- The same moment estimates could support risk-sensitive policy improvement by swapping different utility functions into the Taylor approximator after learning.
- In environments with heavy-tailed returns the truncation order of the Taylor series would need to be chosen adaptively to keep approximation error bounded.
- The approach supplies a lightweight alternative to full distributional RL when only a specific functional of the return is required rather than the entire distribution.
Load-bearing premise
The functions to be estimated must be analytic so that a Taylor expansion around the mean return yields an accurate approximation from the learned moments.
What would settle it
Run the modified TD updates on a simple Markov reward process whose return distribution is known exactly; if the estimated moments diverge from the true moments or if the resulting Taylor approximations deviate substantially from the true function values, the method does not work as claimed.
read the original abstract
Temporal difference methods enable efficient estimation of value functions in reinforcement learning in an incremental fashion, and are of broader interest because they correspond learning as observed in biological systems. Standard value functions correspond to the expected value of a sum of discounted returns. While this formulation is often sufficient for many purposes, it would often be useful to be able to represent functions of the return as well. Unfortunately, most such functions cannot be estimated directly using TD methods. We propose a means of estimating functions of the return using its moments, which can be learned online using a modified TD algorithm. The moments of the return are then used as part of a Taylor expansion to approximate analytic functions of the return.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes a method to estimate analytic functions of the discounted return in reinforcement learning by learning the moments of the return online via a modified temporal difference algorithm and then approximating the target function with a Taylor expansion around the mean.
Significance. If the central construction holds, the approach would extend standard TD learning beyond expectations to a wider class of return functionals, which could be useful for risk-sensitive or other non-linear objectives in RL.
major comments (2)
- [Abstract] Abstract: the claim that moments of the return can be learned incrementally with a modified TD algorithm supplies neither the update rule nor a convergence argument, which is load-bearing for the incremental-learning part of the proposal.
- [Abstract] Abstract: no analysis or bound is given for the Lagrange remainder (or equivalent truncation error) of the Taylor expansion; because return distributions are typically non-Gaussian and can exhibit high variance or skewness, it is possible for the approximation error to dominate even with perfect moment estimates.
minor comments (1)
- The abstract does not specify the order of the Taylor expansion or give concrete examples of target functions of the return.
Simulated Author's Rebuttal
We thank the referee for their thoughtful comments. We address each major comment below.
read point-by-point responses
-
Referee: [Abstract] Abstract: the claim that moments of the return can be learned incrementally with a modified TD algorithm supplies neither the update rule nor a convergence argument, which is load-bearing for the incremental-learning part of the proposal.
Authors: The abstract is a high-level summary. The modified TD update for the moments appears in Equation (4) and the associated convergence result is stated as Theorem 1 (under standard Robbins-Monro conditions on the step sizes and the usual contraction mapping for the Bellman operator). We can revise the abstract to include a one-sentence pointer to these results. revision: partial
-
Referee: [Abstract] Abstract: no analysis or bound is given for the Lagrange remainder (or equivalent truncation error) of the Taylor expansion; because return distributions are typically non-Gaussian and can exhibit high variance or skewness, it is possible for the approximation error to dominate even with perfect moment estimates.
Authors: We agree that the manuscript provides no general bound on the Lagrange remainder. Deriving a useful, distribution-free bound is difficult because it would require control of all higher-order moments and the radius of analyticity of the target function. The paper instead validates the approximation empirically for the functions considered. We will add a short discussion of the truncation error and the practical regimes in which low-order expansions remain accurate. revision: yes
Circularity Check
No circularity; forward proposal of moment-based approximation
full rationale
The paper's core claim is a constructive proposal: learn return moments incrementally via a modified TD algorithm, then insert those moments into a Taylor series to approximate analytic functions of the return. No equation reduces a target quantity to a fitted parameter by definition, no prediction is statistically forced by a prior fit on related data, and the provided text contains no load-bearing self-citations or uniqueness theorems imported from the authors' prior work. The Taylor expansion and TD update are standard external tools applied to a new target; the analyticity assumption is stated explicitly as a prerequisite rather than derived. This is the common case of a self-contained algorithmic proposal.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption Functions of the return are analytic and can be approximated by Taylor expansion around the mean using its moments
- domain assumption A modified TD algorithm can learn the moments of the return online
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.