Curiosity-Critic: Cumulative Prediction Error Improvement as a Tractable Intrinsic Reward for World Model Training

Haicheng Wang; Vin Bhaskara

arxiv: 2604.18701 · v3 · pith:FY7XKCPDnew · submitted 2026-04-20 · 💻 cs.LG · cs.AI· stat.ML

Curiosity-Critic: Cumulative Prediction Error Improvement as a Tractable Intrinsic Reward for World Model Training

Vin Bhaskara , Haicheng Wang This is my paper

Pith reviewed 2026-05-10 04:37 UTC · model grok-4.3

classification 💻 cs.LG cs.AIstat.ML

keywords intrinsic rewardcuriosityworld modelsprediction errorexplorationreinforcement learningepistemic uncertaintyaleatoric uncertainty

0 comments

The pith

Curiosity-Critic uses cumulative prediction error improvement as an intrinsic reward for world model training, estimated via a co-trained critic.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes Curiosity-Critic to create better intrinsic rewards for training world models by measuring improvement in the total prediction error across all visited transitions rather than local errors alone. It derives a practical per-step reward as the gap between the immediate prediction error and the long-run error baseline for that transition, with the baseline learned online by a critic network. This directs the agent toward transitions that are still reducible while de-emphasizing irreducible stochasticity. A reader would care if this leads to quicker and more accurate learning of environment dynamics without needing prior knowledge of the environment's noise level.

Core claim

Curiosity-Critic grounds its intrinsic reward in the improvement of the cumulative prediction error objective across visited transitions. This admits a tractable per-step surrogate given by the difference between the current prediction error and the asymptotic error baseline of the current state transition. The baseline is estimated online by a learned critic co-trained with the world model that regresses a single scalar and converges early, allowing the reward to favor learnable transitions and approach the baseline for stochastic ones, thereby separating epistemic from aleatoric prediction error without an oracle noise floor.

What carries the argument

The per-step surrogate of cumulative error improvement, computed as current prediction error minus the asymptotic baseline estimated by a co-trained critic network.

If this is right

Prior local prediction-error curiosity methods appear as special cases under particular choices of the error baseline approximation.
On stochastic grid worlds, Curiosity-Critic achieves faster training speed and higher final world model accuracy than prediction-error, visitation-count, and Random Network Distillation approaches.
Exploration is redirected to learnable transitions as the reward collapses to the baseline for stochastic transitions.
The critic provides a reliable estimate of the error floor without requiring oracle knowledge.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

This approach could be tested in more complex environments like continuous control tasks to see if the critic still converges reliably.
It might connect to uncertainty estimation techniques in Bayesian world models by providing a simple scalar baseline.
Future work could explore whether the same critic can be used for other intrinsic motivation signals beyond prediction error.

Load-bearing premise

The learned critic converges well before the world model saturates, providing a reliable online estimate of the asymptotic error baseline without oracle knowledge of the noise floor.

What would settle it

A direct counterexample would be if the stochastic grid world experiments reveal that Curiosity-Critic trains no faster and achieves no higher accuracy than standard prediction-error curiosity, or if the critic fails to provide a stable baseline early in training.

Figures

Figures reproduced from arXiv: 2604.18701 by Haicheng Wang, Vin Bhaskara.

**Figure 1.** Figure 1: The Curiosity-Critic architecture. Solid black arrows: forward-pass flow. Dashed gray arrows: backward-pass training signals. Steps: (1) world model computes error e(st, at | θt); (2) world model updates to θt+1 on st+1; (3) critic regresses onto the post-update error e(st, at |θt+1); (4) critic outputs baseline ϕt+1(st, at); (5) reward rt = e(st, at |θt) − ϕt+1(st, at); (6) the curiosity agent updates πt … view at source ↗

**Figure 2.** Figure 2: Mean L2 prediction error on deterministic cells versus environment steps, averaged over five [PITH_FULL_IMAGE:figures/full_fig_p008_2.png] view at source ↗

**Figure 3.** Figure 3: Fraction of environment steps spent in the deterministic region (columns 0–14) over training, [PITH_FULL_IMAGE:figures/full_fig_p009_3.png] view at source ↗

**Figure 4.** Figure 4: Mean neural critic estimate over all deterministic cells (left) and all stochastic cells (right) [PITH_FULL_IMAGE:figures/full_fig_p009_4.png] view at source ↗

**Figure 5.** Figure 5: Snapshot of agent trajectories at environment step 30,000. Each panel shows one method; [PITH_FULL_IMAGE:figures/full_fig_p016_5.png] view at source ↗

**Figure 6.** Figure 6: Seed-averaged visitation heatmaps at the end of training (final 5,000 of 35,000 steps), [PITH_FULL_IMAGE:figures/full_fig_p017_6.png] view at source ↗

**Figure 7.** Figure 7: Seed-averaged visitation heatmaps at three training windows: Early (steps 0–5k), Mid [PITH_FULL_IMAGE:figures/full_fig_p018_7.png] view at source ↗

read the original abstract

Local prediction-error-based curiosity rewards focus on the current transition without considering the world model's cumulative prediction error across all visited transitions. We introduce Curiosity-Critic, which grounds its intrinsic reward in the improvement of this cumulative objective, and show that it admits a tractable per-step surrogate: the difference between the current prediction error and the asymptotic error baseline of the current state transition. We estimate this error baseline online with a learned critic co-trained alongside the world model; since the critic only has to learn how hard a transition is to predict, its estimate of the irreducible noise floor converges well before the world model saturates, redirecting exploration toward learnable transitions. The reward is higher for learnable transitions and collapses toward zero for stochastic ones, thereby separating epistemic (reducible) from aleatoric (irreducible) prediction error online. Prior prediction-error curiosity formulations, from Schmidhuber (1991) to learned-feature-space variants, emerge as special cases corresponding to specific approximations of this error baseline. Experiments on a stochastic grid world show that Curiosity-Critic outperforms prediction-error, visitation-count, and Random Network Distillation methods in training speed and final world model accuracy.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Curiosity-Critic reframes intrinsic rewards around cumulative prediction-error improvement via a co-trained critic baseline, but the single grid-world run does not verify that the critic actually converges early enough to deliver the claimed separation.

read the letter

The core idea is to replace raw prediction error with the improvement in total cumulative error across visited transitions. They derive a per-step surrogate that subtracts a learned asymptotic baseline from the current error, and they train a critic to estimate that baseline online. Older methods like basic prediction-error curiosity and RND then appear as special cases where the baseline is zero or fixed in advance. That unifying angle is the clearest contribution here, and the math for the surrogate looks straightforward on paper. The claim that the critic converges faster than the world model, letting the reward ignore irreducible noise without an oracle, is a reasonable way to target epistemic uncertainty in stochastic settings. The abstract says this works on a stochastic grid world, beating the usual baselines on speed and final model accuracy. That is the extent of the positive evidence. The main soft spot is that nothing in the reported results actually checks the timing assumption. There are no learning curves comparing critic loss to world-model error, no ablations on critic capacity or learning rate, and no error bars or statistical tests. If the critic keeps improving in step with the model, the reward signal collapses and the separation fails. A single domain with no protocol details makes it hard to judge robustness. This is for researchers already working on intrinsic rewards and world-model training in RL. Someone looking for a new reward formulation might borrow the baseline trick, but the current write-up is too light on verification to build on directly. Send it to peer review so the experiments can be expanded and the convergence claim tested properly.

Referee Report

3 major / 2 minor

Summary. The paper introduces Curiosity-Critic, an intrinsic reward for world model training grounded in improvement of cumulative prediction error across visited transitions. It derives a tractable per-step surrogate as the difference between current prediction error and an asymptotic error baseline estimated online by a co-trained critic. The method is claimed to separate epistemic from aleatoric error, with prior prediction-error formulations emerging as special cases of baseline approximations. Experiments on a stochastic grid world report faster training and higher final accuracy than prediction-error, visitation-count, and Random Network Distillation baselines.

Significance. If the central assumption holds, the approach supplies a principled online mechanism for directing exploration toward reducible uncertainty in stochastic settings, which could enhance sample efficiency for model-based RL without requiring oracle noise-floor knowledge. The clean derivation of the surrogate and its unification of prior methods constitute a conceptual contribution.

major comments (3)

[Abstract] Abstract: the assertion that 'the critic converges well before the world model saturates' is load-bearing for the claimed separation of epistemic from aleatoric error, yet the experimental section supplies no supporting evidence such as side-by-side learning curves of critic loss versus world-model prediction error.
[Experiments] Experiments: the reported outperformance on the stochastic grid world lacks error bars, number of random seeds, ablation studies on critic training schedule or learning rate, and any statistical significance tests, leaving the superiority claim difficult to evaluate.
[Method] Method (surrogate derivation): because the critic is co-trained on the identical data stream as the world model, the paper must demonstrate that the reward signal does not collapse toward zero when the two improve in lockstep; the current text provides only the assertion of faster critic convergence without analysis or safeguards.

minor comments (2)

[Abstract] The abstract and experimental description omit concrete environment parameters (grid size, transition stochasticity level, episode length) needed to reproduce the grid-world results.
Consider adding a dedicated figure or table that plots critic convergence trajectory against world-model error reduction to directly address the temporal-separation assumption.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the detailed and constructive report. We address each major comment below, indicating where we agree that revisions are warranted and outlining the specific changes we will make to strengthen the manuscript.

read point-by-point responses

Referee: [Abstract] Abstract: the assertion that 'the critic converges well before the world model saturates' is load-bearing for the claimed separation of epistemic from aleatoric error, yet the experimental section supplies no supporting evidence such as side-by-side learning curves of critic loss versus world-model prediction error.

Authors: We agree that empirical support for the critic's faster convergence is necessary to substantiate the epistemic/aleatoric separation. In the revised manuscript we will add side-by-side learning curves of critic loss and world-model prediction error (generated from the same experimental runs already performed) to the experiments section and reference them from the abstract and method discussion. revision: yes
Referee: [Experiments] Experiments: the reported outperformance on the stochastic grid world lacks error bars, number of random seeds, ablation studies on critic training schedule or learning rate, and any statistical significance tests, leaving the superiority claim difficult to evaluate.

Authors: We accept that the current experimental reporting is insufficient for rigorous evaluation. The revised version will report all results with error bars over 5 independent random seeds, include ablations on critic learning rate and training frequency, and add statistical significance tests (paired t-tests with p-values) comparing Curiosity-Critic against the baselines. revision: yes
Referee: [Method] Method (surrogate derivation): because the critic is co-trained on the identical data stream as the world model, the paper must demonstrate that the reward signal does not collapse toward zero when the two improve in lockstep; the current text provides only the assertion of faster critic convergence without analysis or safeguards.

Authors: This concern is well-founded. While the derivation shows that the critic solves a simpler scalar regression task, the manuscript currently offers only the convergence claim. In revision we will expand the method section with (i) a short theoretical argument bounding the rate difference and (ii) empirical plots of the intrinsic reward magnitude over training, confirming it remains informative rather than collapsing. We will also note a simple safeguard (periodic low-rate critic warm-up) that can be activated if needed. revision: partial

Circularity Check

0 steps flagged

No significant circularity in the derivation chain

full rationale

The paper defines its intrinsic reward explicitly as the per-step surrogate (current prediction error minus critic-estimated asymptotic baseline) and asserts that the co-trained critic converges faster as an empirical property of regressing a single scalar. This is a modeling choice and assumption rather than a mathematical reduction in which the reward or separation of epistemic/aleatoric error is forced to equal its inputs by construction. No equations are provided that equate the claimed cumulative improvement to the fitted critic output tautologically, and the unification with prior methods is framed as special cases of baseline approximation rather than self-referential. The derivation therefore retains independent content outside any fitted quantities.

Axiom & Free-Parameter Ledger

1 free parameters · 0 axioms · 1 invented entities

Abstract provides no explicit list of axioms or free parameters; the critic is a learned component whose training details are unspecified. The asymptotic baseline functions as a fitted quantity whose independent evidence is not supplied.

free parameters (1)

critic convergence rate
The claim that the critic converges before the world model requires an implicit rate or schedule that is chosen or fitted during training.

invented entities (1)

asymptotic error baseline no independent evidence
purpose: Provides the tractable per-step surrogate for cumulative prediction-error improvement
Introduced as a learned scalar without external validation or falsifiable prediction outside the training loop.

pith-pipeline@v0.9.0 · 5505 in / 1371 out tokens · 43180 ms · 2026-05-10T04:37:24.959327+00:00 · methodology

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Online Reward-Punishment Learning from Fixed-Channel Perceptual Event Streams without Environment Rewards
cs.LG 2026-06 unverdicted novelty 5.0

OHIRL separates next-packet prediction, residual dynamics, a fixed recovery-positive evaluator, and policy learning to achieve high sign and action accuracy in reward-free perceptual tasks where standard reward proxies fail.