arxiv: 2604.18972 · v2 · submitted 2026-04-21 · 📊 stat.ML · cs.LG· math.OC

Recognition: 2 theorem links

· Lean Theorem

Beyond Bellman: High-Order Generator Regression for Continuous-Time Policy Evaluation

Yaowei Zheng , Richong Zhang , Shenxi Wu , Shirui Bian , Haosong Zhang , Li Zeng , Xingjian Ma , Yichi Zhang

Authors on Pith no claims yet

Pith reviewed 2026-05-11 00:49 UTC · model grok-4.3

classification 📊 stat.ML cs.LGmath.OC

keywords continuous-time policy evaluationhigh-order generator regressionBellman recursionmoment-matching coefficientstime-inhomogeneous dynamicsvalue function estimationerror decompositionfinite-horizon evaluation

0 comments

The pith

High-order generator regression from multi-step trajectories yields more accurate continuous-time policy values than the Bellman baseline.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that in continuous-time policy evaluation with discrete data under time-inhomogeneous dynamics, the standard one-step Bellman recursion only achieves first-order accuracy in the time grid. By estimating the time-dependent generator using moment-matching coefficients on multi-step transitions to cancel lower-order truncation terms, and then performing backward regression, higher-order accuracy becomes possible. This approach comes with a full error decomposition into generator misspecification, projection error, pooling bias, finite-sample error, and start-up error, plus a decision-frequency regime map showing when the gains appear. Experiments confirm consistent improvements over the baseline in calibration studies, benchmarks, and stress tests. A sympathetic reader would care because many real systems evolve continuously but data arrives in discrete steps, and better value estimates can improve downstream decisions.

Core claim

The central discovery is that high-order generator regression, obtained by fitting moment-matching coefficients to multi-step closed-loop trajectories to approximate the time-dependent generator, when combined with backward regression, provides an end-to-end error decomposition into generator misspecification, projection error, pooling bias, finite-sample error, and start-up error, along with a decision-frequency regime map that identifies where the higher-order benefits are observable, leading to better performance than first-order Bellman methods in calibration and benchmark tests.

What carries the argument

The high-order generator estimator using moment-matching coefficients from multi-step transitions that cancel truncation terms, combined with backward regression on the surrogate generator.

If this is right

The second-order estimator achieves visible accuracy gains in the decision-frequency regimes identified by the theory.
The full error decomposition allows separate diagnosis of misspecification, projection, pooling, finite-sample, and start-up contributions.
Performance remains stable across feature ablations, start-up variations, and gain-mismatch stress tests.
Consistent outperformance over the Bellman baseline holds in four-scale benchmarks and calibration studies.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The regime map could guide discretization choices when deploying the method on real trajectory data with varying sampling rates.
Extensions to third- or higher-order coefficients might be tested in low-noise settings to check for further error reduction.
The approach may connect to numerical schemes for solving parabolic PDEs in related optimal control problems.
Adaptive collection of multi-step trajectories could be explored to tighten finite-sample error bounds in online settings.

Load-bearing premise

Multi-step closed-loop trajectories permit reliable estimation of the time-dependent generator via moment-matching without new dominant biases under the stated dynamics.

What would settle it

Observing no improvement or increased error from the second-order method over Bellman in high-frequency decision regimes with known dynamics would falsify the claim of visible higher-order gains.

Figures

Figures reproduced from arXiv: 2604.18972 by Haosong Zhang, Li Zeng, Richong Zhang, Shenxi Wu, Shirui Bian, Xingjian Ma, Yaowei Zheng, Yichi Zhang.

**Figure 1.** Figure 1: Decision-frequency regime map. The left panel shows the three regimes implied by the optimized nonstationarity floor. The right panel reports the extended heavy-suite gain of Gen2 over the Bellman baseline as the number of logged episodes and the nonstationarity scale vary. Darker cells correspond to smaller gains; the annotations give the percentage reduction in integrated RMSE. (d = 4), and two networked… view at source ↗

**Figure 2.** Figure 2: Calibration of the discretization rate. The Bellman baseline, Gen2, and Gen3 exhibit the predicted O(∆t), O(∆t 2 ), and O(∆t 3 ) scaling in a controlled time-varying diffusion example. as the theory predicts. 3.2. Main Benchmark Results [PITH_FULL_IMAGE:figures/full_fig_p014_2.png] view at source ↗

**Figure 3.** Figure 3: Over-time RMSE profiles for the Bellman baseline and Gen2 on the four benchmark families. Shaded bands show 95% confidence intervals across seeds. The largest improvements appear on the medium and larger tasks, where first-order Bellman error accumulates most strongly over the horizon [PITH_FULL_IMAGE:figures/full_fig_p016_3.png] view at source ↗

**Figure 4.** Figure 4: Feature-family ablation for the Bellman baseline and Gen2. Each panel shows only the feature families actually evaluated for that task, which removes the appearance of missing data. High-order gains on the harder tasks require richer approximation classes. 0.0 0.1 0.2 0.3 0.4 0.5 Gain-mismatch level 0.06 0.08 0.10 0.12 0.14 Integrated RMSE Small 0.0 0.1 0.2 0.3 0.4 0.5 Gain-mismatch level 0.06 0.08 0.10 0.… view at source ↗

**Figure 5.** Figure 5: Gain-mismatch stress test for the Bellman baseline and Gen2. The Bellman-to-Gen2 improvement persists over a wider mismatch range on the medium and large families than on the small family, which makes the empirical failure boundary easy to see. the start-up and bandwidth diagnostics help explain why Gen2 is usually the safest practical choice even though higher-order approximations are mathematically avail… view at source ↗

**Figure 6.** Figure 6: Mean Bellman-minus-Gen2 error gap under gain mismatch, with ribbons showing acrossseed variability. Positive values indicate a Gen2 advantage. The small family loses that advantage quickly, while the medium and large families retain a positive gap over a broader mismatch range. benchmark table, and the over-time profiles all point to the same conclusion: second-order generator regression reliably improves… view at source ↗

**Figure 1.** Figure 1: Seed-wise distribution of integrated RMSE for the four-scale benchmark suite. The medium, large, and extra-large improvements are distributional shifts rather than isolated seeds. 0 1 2 4 6 8 Selected pooling width 0 5 10 15 20 25 Count Small 0 1 2 4 6 8 Selected pooling width Medium BE Gen2 MBLinear 0 1 2 4 6 8 Selected pooling width 0 5 10 15 20 25 Count Large 0 1 2 4 6 8 Selected pooling width XLarge [… view at source ↗

**Figure 2.** Figure 2: Selection frequencies for the temporal-pooling window across benchmark seeds. On the main benchmark families the selector often prefers the largest tested window, which is why the regime-map evidence in the main paper is interpreted qualitatively rather than as a fully asymptotic verification. 12 [PITH_FULL_IMAGE:figures/full_fig_p035_2.png] view at source ↗

**Figure 3.** Figure 3: Start-up ablation for second- and third-order methods. Bellman-based start-up materially improves the backward multistep recursion, especially on the medium task. 120 240 480 960 Logged episodes 0.0 0.5 1.0 1.5 2.0 Nonstationarity level 0.06 0.08 0.10 0.12 0.14 0.16 0.18 Relative gain of Gen2 over Bellman [PITH_FULL_IMAGE:figures/full_fig_p036_3.png] view at source ↗

**Figure 4.** Figure 4: Nonstationarity diagnostic heat map. The average relative gain of Gen2 over the Bellman baseline shrinks as nonstationarity strengthens, consistent with the temporal-pooling trade-off. 13 [PITH_FULL_IMAGE:figures/full_fig_p036_4.png] view at source ↗

**Figure 5.** Figure 5: Decision-frequency refinement under the enlarged bandwidth grid. The sweep is diagnostic because large selected windows remain common on part of the grid. The near-off-policy stress test perturbs the target controller rather than introducing a separate behavior-learning problem. In the underlying benchmark files, three perturbation families are available: multiplicative gain shift, covariance inflation of … view at source ↗

**Figure 6.** Figure 6: Mean Bellman-minus-Gen2 integrated-RMSE gap under gain mismatch, with ribbons showing across-seed variability. Positive values indicate a Gen2 advantage. The small family crosses zero early, while the medium and large families remain positive over a wider mismatch range. 14 [PITH_FULL_IMAGE:figures/full_fig_p037_6.png] view at source ↗

**Figure 7.** Figure 7: Selection frequency of the Gen2 pooling window under gain mismatch. On the medium and large families the widest tested bandwidth remains dominant across the sweep, whereas the small family gradually shifts mass toward smaller windows as mismatch grows. Taken together, Figures 6 and 7 show that the breakdown boundary is not a single phenomenon. On the smallest family, the Bellman-minus-Gen2 gap shrinks quic… view at source ↗

**Figure 8.** Figure 8: Runtime scaling and dimension stress test. On the D4, D8, and D12 families, Gen2 costs about twice the Bellman-baseline runtime while substantially reducing error. 15 [PITH_FULL_IMAGE:figures/full_fig_p038_8.png] view at source ↗

**Figure 9.** Figure 9: Median conditioning diagnostic as a function of the pooling window. Conditioning does not deteriorate monotonically with larger windows on these benchmark families. 10 2 × 10 2 3 × 10 2 4 × 10 2 6 × 10 2 3 number of episodes M 10 1 4 × 10 2 6 × 10 2 2 × 10 1 L 2 v alu e error at t = 0 BE Generator (order 2) [PITH_FULL_IMAGE:figures/full_fig_p039_9.png] view at source ↗

**Figure 10.** Figure 10: Calibration of data scaling in the time-varying 10-dimensional OU process. 16 [PITH_FULL_IMAGE:figures/full_fig_p039_10.png] view at source ↗

**Figure 11.** Figure 11: Calibration of data scaling in the time-varying linear-quadratic benchmark. 17 [PITH_FULL_IMAGE:figures/full_fig_p040_11.png] view at source ↗

read the original abstract

We study finite-horizon continuous-time policy evaluation from discrete closed-loop trajectories under time-inhomogeneous dynamics. The target value surface solves a backward parabolic equation, but the Bellman baseline obtained from one-step recursion is only first-order in the grid width. We estimate the time-dependent generator from multi-step transitions using moment-matching coefficients that cancel lower-order truncation terms, and combine the resulting surrogate with backward regression. The main theory gives an end-to-end decomposition into generator misspecification, projection error, pooling bias, finite-sample error, and start-up error, together with a decision-frequency regime map explaining when higher-order gains should be visible. Across calibration studies, four-scale benchmarks, feature and start-up ablations, and gain-mismatch stress tests, the second-order estimator consistently improves on the Bellman baseline and remains stable in the regime where the theory predicts visible gains. These results position high-order generator regression as an interpretable continuous-time policy-evaluation method with a clear operating region.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper gives a second-order moment-matching method for continuous-time policy evaluation with a full error decomposition and regime map, but the time-inhomogeneous handling may carry an uncancelled bias.

read the letter

The paper introduces moment-matching on multi-step transitions to get a second-order estimate of the time-dependent generator for continuous-time policy evaluation. It pairs this with backward regression and supplies an error decomposition plus a regime map for when the extra order helps. This moves past the standard first-order Bellman recursion in a concrete way for finite-horizon settings with discrete trajectories under time-inhomogeneous dynamics. The decomposition covers generator misspecification, projection error, pooling bias, finite-sample error, and start-up error. This breakdown is concrete and ties directly to the experiments, which show gains over the Bellman baseline across benchmarks, ablations, and stress tests. The regime map also lines up with where the improvements appear. The soft spot is the time-inhomogeneous part. If the moment-matching coefficients come from a fixed-time Taylor expansion, cross terms between time and spatial derivatives could leave an uncancelled O(h) remainder. The abstract claims the construction works for time-inhomogeneous dynamics, but the full derivation would need to show how those terms are handled without extra uniformity assumptions. If they are not, the bounds and the map could be optimistic. This work is for researchers focused on continuous-time reinforcement learning with discrete trajectory data. It has enough new structure and supporting evidence to go to a serious referee. I would recommend sending it for peer review.

Referee Report

2 major / 1 minor

Summary. The manuscript proposes a high-order generator regression method for finite-horizon continuous-time policy evaluation from discrete closed-loop trajectories under time-inhomogeneous dynamics. It estimates the time-dependent generator via multi-step moment-matching coefficients designed to cancel lower-order truncation terms, combines the surrogate with backward regression, and supplies an end-to-end error decomposition (generator misspecification, projection error, pooling bias, finite-sample error, start-up error) together with a decision-frequency regime map. Empirical results across calibration studies, benchmarks, ablations, and stress tests show consistent improvement over the first-order Bellman baseline in the predicted regimes.

Significance. If the central derivation and cancellation hold, the work supplies a principled, interpretable continuous-time alternative to Bellman recursion with explicit error sources and operating conditions. The end-to-end decomposition and regime map are potentially valuable for RL and control applications where decision frequency and time variation matter; the reported empirical stability across multiple studies adds practical support.

major comments (2)

[Main theory section (error decomposition and generator estimation)] The end-to-end error decomposition and regime map rest on the claim that multi-step moment-matching coefficients cancel lower-order truncation terms for time-inhomogeneous generators. The Taylor expansion of the transition kernel around a fixed t generally contains cross terms between time derivatives and spatial derivatives; if the coefficients are derived under a frozen-time assumption, an O(h) remainder may remain uncancelled and is not absorbed into the stated generator-misspecification term. This directly affects the validity of the higher-order accuracy claim and the regime map. Please supply the explicit derivation of the moment-matching coefficients (including any time-derivative handling) and the precise remainder bound.
[Empirical studies and stress-test section] The weakest assumption identified is that multi-step closed-loop trajectories permit reliable estimation of the time-dependent generator without introducing dominant new biases. The empirical gain-mismatch stress tests should be augmented with a controlled time-inhomogeneity sweep (e.g., varying the magnitude of the time derivative of the generator) to verify that the observed gains align with the predicted regime rather than being masked by the potential O(h) bias.

minor comments (1)

[Abstract] The abstract refers to a 'decision-frequency regime map' without a one-sentence gloss; a brief parenthetical description would improve accessibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the careful and constructive review. The comments raise important points about the handling of time inhomogeneity in the theoretical derivation and the scope of the empirical validation. We address each major comment below and indicate the revisions we will make.

read point-by-point responses

Referee: [Main theory section (error decomposition and generator estimation)] The end-to-end error decomposition and regime map rest on the claim that multi-step moment-matching coefficients cancel lower-order truncation terms for time-inhomogeneous generators. The Taylor expansion of the transition kernel around a fixed t generally contains cross terms between time derivatives and spatial derivatives; if the coefficients are derived under a frozen-time assumption, an O(h) remainder may remain uncancelled and is not absorbed into the stated generator-misspecification term. This directly affects the validity of the higher-order accuracy claim and the regime map. Please supply the explicit derivation of the moment-matching coefficients (including any time-derivative handling) and the precise remainder bound.

Authors: We thank the referee for this precise observation. The moment-matching coefficients are obtained from the full Taylor expansion of the transition kernel that retains both the time derivatives of the generator and the spatial derivatives of the test functions; the coefficients are solved to match the first two moments exactly, which cancels all O(h) contributions including the cross terms. The resulting remainder is O(h^2) under the standing assumption that the generator is twice continuously differentiable in time and space, with the explicit bound stated in the supplementary material. To address the request for transparency we will insert the full derivation of the coefficients together with the remainder bound into the main text of the revised manuscript. revision: yes
Referee: [Empirical studies and stress-test section] The weakest assumption identified is that multi-step closed-loop trajectories permit reliable estimation of the time-dependent generator without introducing dominant new biases. The empirical gain-mismatch stress tests should be augmented with a controlled time-inhomogeneity sweep (e.g., varying the magnitude of the time derivative of the generator) to verify that the observed gains align with the predicted regime rather than being masked by the potential O(h) bias.

Authors: We agree that a controlled sweep over the degree of time inhomogeneity would provide stronger empirical corroboration of the regime map. We will extend the existing gain-mismatch stress-test suite by introducing a parameter that scales the magnitude of the time derivative of the generator, rerun the experiments, and report the resulting performance gains against the predicted decision-frequency regimes. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation is self-contained

full rationale

The paper's central contribution is a multi-step moment-matching procedure for estimating the time-dependent generator from closed-loop trajectories, followed by backward regression and an end-to-end error decomposition into misspecification, projection, pooling, finite-sample, and start-up terms. The abstract describes the estimator construction and the accompanying regime map without any equations or statements that define a target quantity in terms of its own fitted values or that rename a fitted parameter as an independent prediction. No self-citations are invoked as load-bearing uniqueness theorems, and no ansatz is smuggled via prior work. The derivation therefore rests on standard moment-matching algebra and regression analysis whose validity can be checked against external benchmarks rather than reducing to its own inputs by construction.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The approach rests on standard continuous-time MDP assumptions plus the new high-order estimation technique; no new physical entities are introduced.

free parameters (1)

moment-matching coefficients
Chosen to cancel lower-order truncation terms in the generator estimate; their exact construction is part of the proposed method.

axioms (1)

domain assumption The target value surface solves a backward parabolic equation under time-inhomogeneous dynamics
Stated directly as the mathematical foundation for the policy-evaluation problem.

pith-pipeline@v0.9.0 · 5493 in / 1255 out tokens · 63244 ms · 2026-05-11T00:49:19.581615+00:00 · methodology

Review history (2 revisions) →

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Choose coefficients a(i) so that sum a(i)_j j^k = 1 for k=1 and 0 otherwise (2.9); weighted combination yields GU + O(Δt^i) remainder (2.10)
IndisputableMonolith/Foundation/ArithmeticFromLogic.lean absolute_floor_iff_bare_distinguishability unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

End-to-end recursion with C0, Cms, regime map Fns ≍ (d(Lμ,t + LΣ,t)Δt/M)^{1/3} (2.24)

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

2 extracted references

[1]

Proof of Proposition 2.1 Letρ(s ′, τ|s, t) denote the transition density ofs τ givens t =sand define f(s ′, τ) :=e −β(τ−t) r(s′, τ), g(τ) := Z Rd f(s ′, τ)ρ(s ′, τ|s, t) ds ′

PROOFS FOR SECTION 2 1.1. Proof of Proposition 2.1 Letρ(s ′, τ|s, t) denote the transition density ofs τ givens t =sand define f(s ′, τ) :=e −β(τ−t) r(s′, τ), g(τ) := Z Rd f(s ′, τ)ρ(s ′, τ|s, t) ds ′. Then V(s, t) = Z T t g(τ) dτ, ˜V(s, t) = Mt−1X j=0 ∆t g(t+j∆t), M t := T−t ∆t . The Bellman baseline error is therefore the left-Riemann error for g. Since...
[2]

They are reported here so that the main text can remain centered on the theory-driven benchmark narrative

ADDITIONAL EXPERIMENT AL DET AILS AND SUPPLEMENT AR Y DIS- PLAYS The displays in this section provide the detailed benchmark summaries, start-up checks, runtime diagnostics, and mismatch diagnostics that support the main paper. They are reported here so that the main text can remain centered on the theory-driven benchmark narrative. Table 1:Detailed heavy...