Curiosity-Critic: Cumulative Prediction Error Improvement as a Tractable Intrinsic Reward for World Model Training

Haicheng Wang; Vin Bhaskara

arxiv: 2604.18701 · v2 · submitted 2026-04-20 · 💻 cs.LG · cs.AI· stat.ML

Curiosity-Critic: Cumulative Prediction Error Improvement as a Tractable Intrinsic Reward for World Model Training

Vin Bhaskara , Haicheng Wang This is my paper

Pith reviewed 2026-05-10 04:37 UTC · model grok-4.3

classification 💻 cs.LG cs.AIstat.ML

keywords intrinsic rewardcuriosityworld modelsprediction errorexplorationreinforcement learningepistemic uncertaintyaleatoric uncertainty

0 comments

The pith

Curiosity-Critic uses cumulative prediction error improvement as an intrinsic reward for world model training, estimated via a co-trained critic.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes Curiosity-Critic to create better intrinsic rewards for training world models by measuring improvement in the total prediction error across all visited transitions rather than local errors alone. It derives a practical per-step reward as the gap between the immediate prediction error and the long-run error baseline for that transition, with the baseline learned online by a critic network. This directs the agent toward transitions that are still reducible while de-emphasizing irreducible stochasticity. A reader would care if this leads to quicker and more accurate learning of environment dynamics without needing prior knowledge of the environment's noise level.

Core claim

Curiosity-Critic grounds its intrinsic reward in the improvement of the cumulative prediction error objective across visited transitions. This admits a tractable per-step surrogate given by the difference between the current prediction error and the asymptotic error baseline of the current state transition. The baseline is estimated online by a learned critic co-trained with the world model that regresses a single scalar and converges early, allowing the reward to favor learnable transitions and approach the baseline for stochastic ones, thereby separating epistemic from aleatoric prediction error without an oracle noise floor.

What carries the argument

The per-step surrogate of cumulative error improvement, computed as current prediction error minus the asymptotic baseline estimated by a co-trained critic network.

If this is right

Prior local prediction-error curiosity methods appear as special cases under particular choices of the error baseline approximation.
On stochastic grid worlds, Curiosity-Critic achieves faster training speed and higher final world model accuracy than prediction-error, visitation-count, and Random Network Distillation approaches.
Exploration is redirected to learnable transitions as the reward collapses to the baseline for stochastic transitions.
The critic provides a reliable estimate of the error floor without requiring oracle knowledge.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

This approach could be tested in more complex environments like continuous control tasks to see if the critic still converges reliably.
It might connect to uncertainty estimation techniques in Bayesian world models by providing a simple scalar baseline.
Future work could explore whether the same critic can be used for other intrinsic motivation signals beyond prediction error.

Load-bearing premise

The learned critic converges well before the world model saturates, providing a reliable online estimate of the asymptotic error baseline without oracle knowledge of the noise floor.

What would settle it

A direct counterexample would be if the stochastic grid world experiments reveal that Curiosity-Critic trains no faster and achieves no higher accuracy than standard prediction-error curiosity, or if the critic fails to provide a stable baseline early in training.

Figures

Figures reproduced from arXiv: 2604.18701 by Haicheng Wang, Vin Bhaskara.

**Figure 1.** Figure 1: The Curiosity-Critic architecture. Solid black arrows: forward-pass flow. Dashed gray arrows: backward-pass training signals. Steps: (1) world model computes error e(st, at | θt); (2) world model updates to θt+1 on st+1; (3) critic regresses onto the post-update error e(st, at |θt+1); (4) critic outputs baseline ϕt+1(st, at); (5) reward rt = e(st, at |θt) − ϕt+1(st, at); (6) the curiosity agent updates πt … view at source ↗

**Figure 2.** Figure 2: Mean L2 prediction error on deterministic cells versus environment steps, averaged over five [PITH_FULL_IMAGE:figures/full_fig_p008_2.png] view at source ↗

**Figure 3.** Figure 3: Fraction of environment steps spent in the deterministic region (columns 0–14) over training, [PITH_FULL_IMAGE:figures/full_fig_p009_3.png] view at source ↗

**Figure 4.** Figure 4: Mean neural critic estimate over all deterministic cells (left) and all stochastic cells (right) [PITH_FULL_IMAGE:figures/full_fig_p009_4.png] view at source ↗

**Figure 5.** Figure 5: Snapshot of agent trajectories at environment step 30,000. Each panel shows one method; [PITH_FULL_IMAGE:figures/full_fig_p016_5.png] view at source ↗

**Figure 6.** Figure 6: Seed-averaged visitation heatmaps at the end of training (final 5,000 of 35,000 steps), [PITH_FULL_IMAGE:figures/full_fig_p017_6.png] view at source ↗

**Figure 7.** Figure 7: Seed-averaged visitation heatmaps at three training windows: Early (steps 0–5k), Mid [PITH_FULL_IMAGE:figures/full_fig_p018_7.png] view at source ↗

read the original abstract

Local prediction-error-based curiosity rewards focus on the current transition without considering the world model's cumulative prediction error across all visited transitions. We introduce Curiosity-Critic, which grounds its intrinsic reward in the improvement of this cumulative objective, and show that it admits a tractable per-step surrogate: the difference between the current prediction error and the asymptotic error baseline of the current state transition. We estimate this error baseline online with a learned critic co-trained alongside the world model; regressing a single scalar, the critic converges well before the world model saturates, redirecting exploration toward learnable transitions without oracle knowledge of the noise floor. The reward is higher for learnable transitions and collapses toward the error baseline for stochastic ones, effectively separating epistemic (reducible) from aleatoric (irreducible) prediction error online. Prior prediction-error curiosity formulations, from Schmidhuber (1991) to learned-feature-space variants, emerge as special cases corresponding to specific approximations of this error baseline. Experiments on a stochastic grid world show that Curiosity-Critic outperforms prediction-error, visitation-count, and Random Network Distillation methods in training speed and final world model accuracy.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Curiosity-Critic reframes intrinsic rewards around cumulative prediction-error improvement via a co-trained critic baseline, but the single grid-world run does not verify that the critic actually converges early enough to deliver the claimed separation.

read the letter

The core idea is to replace raw prediction error with the improvement in total cumulative error across visited transitions. They derive a per-step surrogate that subtracts a learned asymptotic baseline from the current error, and they train a critic to estimate that baseline online. Older methods like basic prediction-error curiosity and RND then appear as special cases where the baseline is zero or fixed in advance. That unifying angle is the clearest contribution here, and the math for the surrogate looks straightforward on paper. The claim that the critic converges faster than the world model, letting the reward ignore irreducible noise without an oracle, is a reasonable way to target epistemic uncertainty in stochastic settings. The abstract says this works on a stochastic grid world, beating the usual baselines on speed and final model accuracy. That is the extent of the positive evidence. The main soft spot is that nothing in the reported results actually checks the timing assumption. There are no learning curves comparing critic loss to world-model error, no ablations on critic capacity or learning rate, and no error bars or statistical tests. If the critic keeps improving in step with the model, the reward signal collapses and the separation fails. A single domain with no protocol details makes it hard to judge robustness. This is for researchers already working on intrinsic rewards and world-model training in RL. Someone looking for a new reward formulation might borrow the baseline trick, but the current write-up is too light on verification to build on directly. Send it to peer review so the experiments can be expanded and the convergence claim tested properly.

Referee Report

3 major / 2 minor

Summary. The paper introduces Curiosity-Critic, an intrinsic reward for world model training grounded in improvement of cumulative prediction error across visited transitions. It derives a tractable per-step surrogate as the difference between current prediction error and an asymptotic error baseline estimated online by a co-trained critic. The method is claimed to separate epistemic from aleatoric error, with prior prediction-error formulations emerging as special cases of baseline approximations. Experiments on a stochastic grid world report faster training and higher final accuracy than prediction-error, visitation-count, and Random Network Distillation baselines.

Significance. If the central assumption holds, the approach supplies a principled online mechanism for directing exploration toward reducible uncertainty in stochastic settings, which could enhance sample efficiency for model-based RL without requiring oracle noise-floor knowledge. The clean derivation of the surrogate and its unification of prior methods constitute a conceptual contribution.

major comments (3)

[Abstract] Abstract: the assertion that 'the critic converges well before the world model saturates' is load-bearing for the claimed separation of epistemic from aleatoric error, yet the experimental section supplies no supporting evidence such as side-by-side learning curves of critic loss versus world-model prediction error.
[Experiments] Experiments: the reported outperformance on the stochastic grid world lacks error bars, number of random seeds, ablation studies on critic training schedule or learning rate, and any statistical significance tests, leaving the superiority claim difficult to evaluate.
[Method] Method (surrogate derivation): because the critic is co-trained on the identical data stream as the world model, the paper must demonstrate that the reward signal does not collapse toward zero when the two improve in lockstep; the current text provides only the assertion of faster critic convergence without analysis or safeguards.

minor comments (2)

[Abstract] The abstract and experimental description omit concrete environment parameters (grid size, transition stochasticity level, episode length) needed to reproduce the grid-world results.
Consider adding a dedicated figure or table that plots critic convergence trajectory against world-model error reduction to directly address the temporal-separation assumption.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the detailed and constructive report. We address each major comment below, indicating where we agree that revisions are warranted and outlining the specific changes we will make to strengthen the manuscript.

read point-by-point responses

Referee: [Abstract] Abstract: the assertion that 'the critic converges well before the world model saturates' is load-bearing for the claimed separation of epistemic from aleatoric error, yet the experimental section supplies no supporting evidence such as side-by-side learning curves of critic loss versus world-model prediction error.

Authors: We agree that empirical support for the critic's faster convergence is necessary to substantiate the epistemic/aleatoric separation. In the revised manuscript we will add side-by-side learning curves of critic loss and world-model prediction error (generated from the same experimental runs already performed) to the experiments section and reference them from the abstract and method discussion. revision: yes
Referee: [Experiments] Experiments: the reported outperformance on the stochastic grid world lacks error bars, number of random seeds, ablation studies on critic training schedule or learning rate, and any statistical significance tests, leaving the superiority claim difficult to evaluate.

Authors: We accept that the current experimental reporting is insufficient for rigorous evaluation. The revised version will report all results with error bars over 5 independent random seeds, include ablations on critic learning rate and training frequency, and add statistical significance tests (paired t-tests with p-values) comparing Curiosity-Critic against the baselines. revision: yes
Referee: [Method] Method (surrogate derivation): because the critic is co-trained on the identical data stream as the world model, the paper must demonstrate that the reward signal does not collapse toward zero when the two improve in lockstep; the current text provides only the assertion of faster critic convergence without analysis or safeguards.

Authors: This concern is well-founded. While the derivation shows that the critic solves a simpler scalar regression task, the manuscript currently offers only the convergence claim. In revision we will expand the method section with (i) a short theoretical argument bounding the rate difference and (ii) empirical plots of the intrinsic reward magnitude over training, confirming it remains informative rather than collapsing. We will also note a simple safeguard (periodic low-rate critic warm-up) that can be activated if needed. revision: partial

Circularity Check

0 steps flagged

No significant circularity in the derivation chain

full rationale

The paper defines its intrinsic reward explicitly as the per-step surrogate (current prediction error minus critic-estimated asymptotic baseline) and asserts that the co-trained critic converges faster as an empirical property of regressing a single scalar. This is a modeling choice and assumption rather than a mathematical reduction in which the reward or separation of epistemic/aleatoric error is forced to equal its inputs by construction. No equations are provided that equate the claimed cumulative improvement to the fitted critic output tautologically, and the unification with prior methods is framed as special cases of baseline approximation rather than self-referential. The derivation therefore retains independent content outside any fitted quantities.

Axiom & Free-Parameter Ledger

1 free parameters · 0 axioms · 1 invented entities

Abstract provides no explicit list of axioms or free parameters; the critic is a learned component whose training details are unspecified. The asymptotic baseline functions as a fitted quantity whose independent evidence is not supplied.

free parameters (1)

critic convergence rate
The claim that the critic converges before the world model requires an implicit rate or schedule that is chosen or fitted during training.

invented entities (1)

asymptotic error baseline no independent evidence
purpose: Provides the tractable per-step surrogate for cumulative prediction-error improvement
Introduced as a learned scalar without external validation or falsifiable prediction outside the training loop.

pith-pipeline@v0.9.0 · 5505 in / 1371 out tokens · 43180 ms · 2026-05-10T04:37:24.959327+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

17 extracted references · 17 canonical work pages · 2 internal anchors

[1]

Unifying count-based exploration and intrinsic motivation.Advances in neural information processing systems, 29, 2016

Marc Bellemare, Sriram Srinivasan, Georg Ostrovski, Tom Schaul, David Saxton, and Remi Munos. Unifying count-based exploration and intrinsic motivation.Advances in neural information processing systems, 29, 2016

work page 2016
[2]

The arcade learning environment: An evaluation platform for general agents.Journal of artificial intelligence research, 47:253–279, 2013

Marc G Bellemare, Yavar Naddaf, Joel Veness, and Michael Bowling. The arcade learning environment: An evaluation platform for general agents.Journal of artificial intelligence research, 47:253–279, 2013

work page 2013
[3]

Large-Scale Study of Curiosity-Driven Learning

Yuri Burda, Harri Edwards, Deepak Pathak, Amos Storkey, Trevor Darrell, and Alexei A Efros. Large-scale study of curiosity-driven learning.arXiv preprint arXiv:1808.04355, 2018

work page Pith review arXiv 2018
[4]

Exploration by random network distillation.International conference on learning representations, 2019

Yuri Burda, Harrison Edwards, Amos Storkey, and Oleg Klimov. Exploration by random network distillation.International conference on learning representations, 2019

work page 2019
[5]

World Models

David Ha and Jürgen Schmidhuber. World models.arXiv preprint arXiv:1803.10122, 2(3):440, 2018

work page internal anchor Pith review arXiv 2018
[6]

Vime: Variational information maximizing exploration.Advances in neural information processing systems, 29, 2016

Rein Houthooft, Xi Chen, Yan Duan, John Schulman, Filip De Turck, and Pieter Abbeel. Vime: Variational information maximizing exploration.Advances in neural information processing systems, 29, 2016

work page 2016
[7]

Aleatoric and epistemic uncertainty in machine learning: an introduction to concepts and methods.Machine Learning, 110(3):457–506, 2021

Eyke Hüllermeier and Willem Waegeman. Aleatoric and epistemic uncertainty in machine learning: an introduction to concepts and methods.Machine Learning, 110(3):457–506, 2021

work page 2021
[8]

Vizdoom: A doom-based ai research platform for visual reinforcement learning

Michał Kempka, Marek Wydmuch, Grzegorz Runc, Jakub Toczek, and Wojciech Ja´skowski. Vizdoom: A doom-based ai research platform for visual reinforcement learning. In2016 IEEE conference on computational intelligence and games (CIG), pages 1–8. IEEE, 2016

work page 2016
[9]

Adam: A Method for Stochastic Optimization

Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization.arXiv preprint arXiv:1412.6980, 2014

work page internal anchor Pith review Pith/arXiv arXiv 2014
[10]

What is intrinsic motivation? a typology of computational approaches.Frontiers in neurorobotics, 1:108, 2007

Pierre-Yves Oudeyer and Frederic Kaplan. What is intrinsic motivation? a typology of computational approaches.Frontiers in neurorobotics, 1:108, 2007

work page 2007
[11]

Curiosity-driven exploration by self-supervised prediction

Deepak Pathak, Pulkit Agrawal, Alexei A Efros, and Trevor Darrell. Curiosity-driven exploration by self-supervised prediction. InInternational conference on machine learning, pages 2778–

work page
[12]

Self-supervised exploration via disagree- ment

Deepak Pathak, Dhiraj Gandhi, and Abhinav Gupta. Self-supervised exploration via disagree- ment. InInternational conference on machine learning, pages 5062–5071. PMLR, 2019

work page 2019
[13]

A possibility for implementing curiosity and boredom in model-building neural controllers

Jürgen Schmidhuber. A possibility for implementing curiosity and boredom in model-building neural controllers. InProc. of the international conference on simulation of adaptive behavior: From animals to animats, pages 222–227, 1991

work page 1991
[14]

Curious model-building control systems

Jürgen Schmidhuber. Curious model-building control systems. InProc. international joint conference on neural networks, pages 1458–1463, 1991

work page 1991
[15]

Formal theory of creativity, fun, and intrinsic motivation (1990–2010)

Jürgen Schmidhuber. Formal theory of creativity, fun, and intrinsic motivation (1990–2010). IEEE transactions on autonomous mental development, 2(3):230–247, 2010

work page 1990
[16]

An analysis of model-based interval estimation for markov decision processes.Journal of Computer and System Sciences, 74(8):1309–1331, 2008

Alexander L Strehl and Michael L Littman. An analysis of model-based interval estimation for markov decision processes.Journal of Computer and System Sciences, 74(8):1309–1331, 2008

work page 2008
[17]

TV pixels

Richard S Sutton. Integrated architectures for learning, planning, and reacting based on approximating dynamic programming. InMachine learning proceedings 1990, pages 216–224. Elsevier, 1990. 10 A Learnable vs Unlearnable Decomposition Let D+ t and D− t denote the subsets of learnable and highly noisy state transitions in the transi- tion history Dt up to...

work page 1990

[1] [1]

Unifying count-based exploration and intrinsic motivation.Advances in neural information processing systems, 29, 2016

Marc Bellemare, Sriram Srinivasan, Georg Ostrovski, Tom Schaul, David Saxton, and Remi Munos. Unifying count-based exploration and intrinsic motivation.Advances in neural information processing systems, 29, 2016

work page 2016

[2] [2]

The arcade learning environment: An evaluation platform for general agents.Journal of artificial intelligence research, 47:253–279, 2013

Marc G Bellemare, Yavar Naddaf, Joel Veness, and Michael Bowling. The arcade learning environment: An evaluation platform for general agents.Journal of artificial intelligence research, 47:253–279, 2013

work page 2013

[3] [3]

Large-Scale Study of Curiosity-Driven Learning

Yuri Burda, Harri Edwards, Deepak Pathak, Amos Storkey, Trevor Darrell, and Alexei A Efros. Large-scale study of curiosity-driven learning.arXiv preprint arXiv:1808.04355, 2018

work page Pith review arXiv 2018

[4] [4]

Exploration by random network distillation.International conference on learning representations, 2019

Yuri Burda, Harrison Edwards, Amos Storkey, and Oleg Klimov. Exploration by random network distillation.International conference on learning representations, 2019

work page 2019

[5] [5]

World Models

David Ha and Jürgen Schmidhuber. World models.arXiv preprint arXiv:1803.10122, 2(3):440, 2018

work page internal anchor Pith review arXiv 2018

[6] [6]

Vime: Variational information maximizing exploration.Advances in neural information processing systems, 29, 2016

Rein Houthooft, Xi Chen, Yan Duan, John Schulman, Filip De Turck, and Pieter Abbeel. Vime: Variational information maximizing exploration.Advances in neural information processing systems, 29, 2016

work page 2016

[7] [7]

Aleatoric and epistemic uncertainty in machine learning: an introduction to concepts and methods.Machine Learning, 110(3):457–506, 2021

Eyke Hüllermeier and Willem Waegeman. Aleatoric and epistemic uncertainty in machine learning: an introduction to concepts and methods.Machine Learning, 110(3):457–506, 2021

work page 2021

[8] [8]

Vizdoom: A doom-based ai research platform for visual reinforcement learning

Michał Kempka, Marek Wydmuch, Grzegorz Runc, Jakub Toczek, and Wojciech Ja´skowski. Vizdoom: A doom-based ai research platform for visual reinforcement learning. In2016 IEEE conference on computational intelligence and games (CIG), pages 1–8. IEEE, 2016

work page 2016

[9] [9]

Adam: A Method for Stochastic Optimization

Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization.arXiv preprint arXiv:1412.6980, 2014

work page internal anchor Pith review Pith/arXiv arXiv 2014

[10] [10]

What is intrinsic motivation? a typology of computational approaches.Frontiers in neurorobotics, 1:108, 2007

Pierre-Yves Oudeyer and Frederic Kaplan. What is intrinsic motivation? a typology of computational approaches.Frontiers in neurorobotics, 1:108, 2007

work page 2007

[11] [11]

Curiosity-driven exploration by self-supervised prediction

Deepak Pathak, Pulkit Agrawal, Alexei A Efros, and Trevor Darrell. Curiosity-driven exploration by self-supervised prediction. InInternational conference on machine learning, pages 2778–

work page

[12] [12]

Self-supervised exploration via disagree- ment

Deepak Pathak, Dhiraj Gandhi, and Abhinav Gupta. Self-supervised exploration via disagree- ment. InInternational conference on machine learning, pages 5062–5071. PMLR, 2019

work page 2019

[13] [13]

A possibility for implementing curiosity and boredom in model-building neural controllers

Jürgen Schmidhuber. A possibility for implementing curiosity and boredom in model-building neural controllers. InProc. of the international conference on simulation of adaptive behavior: From animals to animats, pages 222–227, 1991

work page 1991

[14] [14]

Curious model-building control systems

Jürgen Schmidhuber. Curious model-building control systems. InProc. international joint conference on neural networks, pages 1458–1463, 1991

work page 1991

[15] [15]

Formal theory of creativity, fun, and intrinsic motivation (1990–2010)

Jürgen Schmidhuber. Formal theory of creativity, fun, and intrinsic motivation (1990–2010). IEEE transactions on autonomous mental development, 2(3):230–247, 2010

work page 1990

[16] [16]

An analysis of model-based interval estimation for markov decision processes.Journal of Computer and System Sciences, 74(8):1309–1331, 2008

Alexander L Strehl and Michael L Littman. An analysis of model-based interval estimation for markov decision processes.Journal of Computer and System Sciences, 74(8):1309–1331, 2008

work page 2008

[17] [17]

TV pixels

Richard S Sutton. Integrated architectures for learning, planning, and reacting based on approximating dynamic programming. InMachine learning proceedings 1990, pages 216–224. Elsevier, 1990. 10 A Learnable vs Unlearnable Decomposition Let D+ t and D− t denote the subsets of learnable and highly noisy state transitions in the transi- tion history Dt up to...

work page 1990