Curiosity-Critic: Cumulative Prediction Error Improvement as a Tractable Intrinsic Reward for World Model Training
Pith reviewed 2026-05-10 04:37 UTC · model grok-4.3
The pith
Curiosity-Critic uses cumulative prediction error improvement as an intrinsic reward for world model training, estimated via a co-trained critic.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Curiosity-Critic grounds its intrinsic reward in the improvement of the cumulative prediction error objective across visited transitions. This admits a tractable per-step surrogate given by the difference between the current prediction error and the asymptotic error baseline of the current state transition. The baseline is estimated online by a learned critic co-trained with the world model that regresses a single scalar and converges early, allowing the reward to favor learnable transitions and approach the baseline for stochastic ones, thereby separating epistemic from aleatoric prediction error without an oracle noise floor.
What carries the argument
The per-step surrogate of cumulative error improvement, computed as current prediction error minus the asymptotic baseline estimated by a co-trained critic network.
If this is right
- Prior local prediction-error curiosity methods appear as special cases under particular choices of the error baseline approximation.
- On stochastic grid worlds, Curiosity-Critic achieves faster training speed and higher final world model accuracy than prediction-error, visitation-count, and Random Network Distillation approaches.
- Exploration is redirected to learnable transitions as the reward collapses to the baseline for stochastic transitions.
- The critic provides a reliable estimate of the error floor without requiring oracle knowledge.
Where Pith is reading between the lines
- This approach could be tested in more complex environments like continuous control tasks to see if the critic still converges reliably.
- It might connect to uncertainty estimation techniques in Bayesian world models by providing a simple scalar baseline.
- Future work could explore whether the same critic can be used for other intrinsic motivation signals beyond prediction error.
Load-bearing premise
The learned critic converges well before the world model saturates, providing a reliable online estimate of the asymptotic error baseline without oracle knowledge of the noise floor.
What would settle it
A direct counterexample would be if the stochastic grid world experiments reveal that Curiosity-Critic trains no faster and achieves no higher accuracy than standard prediction-error curiosity, or if the critic fails to provide a stable baseline early in training.
Figures
read the original abstract
Local prediction-error-based curiosity rewards focus on the current transition without considering the world model's cumulative prediction error across all visited transitions. We introduce Curiosity-Critic, which grounds its intrinsic reward in the improvement of this cumulative objective, and show that it admits a tractable per-step surrogate: the difference between the current prediction error and the asymptotic error baseline of the current state transition. We estimate this error baseline online with a learned critic co-trained alongside the world model; regressing a single scalar, the critic converges well before the world model saturates, redirecting exploration toward learnable transitions without oracle knowledge of the noise floor. The reward is higher for learnable transitions and collapses toward the error baseline for stochastic ones, effectively separating epistemic (reducible) from aleatoric (irreducible) prediction error online. Prior prediction-error curiosity formulations, from Schmidhuber (1991) to learned-feature-space variants, emerge as special cases corresponding to specific approximations of this error baseline. Experiments on a stochastic grid world show that Curiosity-Critic outperforms prediction-error, visitation-count, and Random Network Distillation methods in training speed and final world model accuracy.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces Curiosity-Critic, an intrinsic reward for world model training grounded in improvement of cumulative prediction error across visited transitions. It derives a tractable per-step surrogate as the difference between current prediction error and an asymptotic error baseline estimated online by a co-trained critic. The method is claimed to separate epistemic from aleatoric error, with prior prediction-error formulations emerging as special cases of baseline approximations. Experiments on a stochastic grid world report faster training and higher final accuracy than prediction-error, visitation-count, and Random Network Distillation baselines.
Significance. If the central assumption holds, the approach supplies a principled online mechanism for directing exploration toward reducible uncertainty in stochastic settings, which could enhance sample efficiency for model-based RL without requiring oracle noise-floor knowledge. The clean derivation of the surrogate and its unification of prior methods constitute a conceptual contribution.
major comments (3)
- [Abstract] Abstract: the assertion that 'the critic converges well before the world model saturates' is load-bearing for the claimed separation of epistemic from aleatoric error, yet the experimental section supplies no supporting evidence such as side-by-side learning curves of critic loss versus world-model prediction error.
- [Experiments] Experiments: the reported outperformance on the stochastic grid world lacks error bars, number of random seeds, ablation studies on critic training schedule or learning rate, and any statistical significance tests, leaving the superiority claim difficult to evaluate.
- [Method] Method (surrogate derivation): because the critic is co-trained on the identical data stream as the world model, the paper must demonstrate that the reward signal does not collapse toward zero when the two improve in lockstep; the current text provides only the assertion of faster critic convergence without analysis or safeguards.
minor comments (2)
- [Abstract] The abstract and experimental description omit concrete environment parameters (grid size, transition stochasticity level, episode length) needed to reproduce the grid-world results.
- Consider adding a dedicated figure or table that plots critic convergence trajectory against world-model error reduction to directly address the temporal-separation assumption.
Simulated Author's Rebuttal
We thank the referee for the detailed and constructive report. We address each major comment below, indicating where we agree that revisions are warranted and outlining the specific changes we will make to strengthen the manuscript.
read point-by-point responses
-
Referee: [Abstract] Abstract: the assertion that 'the critic converges well before the world model saturates' is load-bearing for the claimed separation of epistemic from aleatoric error, yet the experimental section supplies no supporting evidence such as side-by-side learning curves of critic loss versus world-model prediction error.
Authors: We agree that empirical support for the critic's faster convergence is necessary to substantiate the epistemic/aleatoric separation. In the revised manuscript we will add side-by-side learning curves of critic loss and world-model prediction error (generated from the same experimental runs already performed) to the experiments section and reference them from the abstract and method discussion. revision: yes
-
Referee: [Experiments] Experiments: the reported outperformance on the stochastic grid world lacks error bars, number of random seeds, ablation studies on critic training schedule or learning rate, and any statistical significance tests, leaving the superiority claim difficult to evaluate.
Authors: We accept that the current experimental reporting is insufficient for rigorous evaluation. The revised version will report all results with error bars over 5 independent random seeds, include ablations on critic learning rate and training frequency, and add statistical significance tests (paired t-tests with p-values) comparing Curiosity-Critic against the baselines. revision: yes
-
Referee: [Method] Method (surrogate derivation): because the critic is co-trained on the identical data stream as the world model, the paper must demonstrate that the reward signal does not collapse toward zero when the two improve in lockstep; the current text provides only the assertion of faster critic convergence without analysis or safeguards.
Authors: This concern is well-founded. While the derivation shows that the critic solves a simpler scalar regression task, the manuscript currently offers only the convergence claim. In revision we will expand the method section with (i) a short theoretical argument bounding the rate difference and (ii) empirical plots of the intrinsic reward magnitude over training, confirming it remains informative rather than collapsing. We will also note a simple safeguard (periodic low-rate critic warm-up) that can be activated if needed. revision: partial
Circularity Check
No significant circularity in the derivation chain
full rationale
The paper defines its intrinsic reward explicitly as the per-step surrogate (current prediction error minus critic-estimated asymptotic baseline) and asserts that the co-trained critic converges faster as an empirical property of regressing a single scalar. This is a modeling choice and assumption rather than a mathematical reduction in which the reward or separation of epistemic/aleatoric error is forced to equal its inputs by construction. No equations are provided that equate the claimed cumulative improvement to the fitted critic output tautologically, and the unification with prior methods is framed as special cases of baseline approximation rather than self-referential. The derivation therefore retains independent content outside any fitted quantities.
Axiom & Free-Parameter Ledger
free parameters (1)
- critic convergence rate
invented entities (1)
-
asymptotic error baseline
no independent evidence
Reference graph
Works this paper leans on
-
[1]
Marc Bellemare, Sriram Srinivasan, Georg Ostrovski, Tom Schaul, David Saxton, and Remi Munos. Unifying count-based exploration and intrinsic motivation.Advances in neural information processing systems, 29, 2016
work page 2016
-
[2]
Marc G Bellemare, Yavar Naddaf, Joel Veness, and Michael Bowling. The arcade learning environment: An evaluation platform for general agents.Journal of artificial intelligence research, 47:253–279, 2013
work page 2013
-
[3]
Large-Scale Study of Curiosity-Driven Learning
Yuri Burda, Harri Edwards, Deepak Pathak, Amos Storkey, Trevor Darrell, and Alexei A Efros. Large-scale study of curiosity-driven learning.arXiv preprint arXiv:1808.04355, 2018
work page Pith review arXiv 2018
-
[4]
Yuri Burda, Harrison Edwards, Amos Storkey, and Oleg Klimov. Exploration by random network distillation.International conference on learning representations, 2019
work page 2019
-
[5]
David Ha and Jürgen Schmidhuber. World models.arXiv preprint arXiv:1803.10122, 2(3):440, 2018
work page internal anchor Pith review arXiv 2018
-
[6]
Rein Houthooft, Xi Chen, Yan Duan, John Schulman, Filip De Turck, and Pieter Abbeel. Vime: Variational information maximizing exploration.Advances in neural information processing systems, 29, 2016
work page 2016
-
[7]
Eyke Hüllermeier and Willem Waegeman. Aleatoric and epistemic uncertainty in machine learning: an introduction to concepts and methods.Machine Learning, 110(3):457–506, 2021
work page 2021
-
[8]
Vizdoom: A doom-based ai research platform for visual reinforcement learning
Michał Kempka, Marek Wydmuch, Grzegorz Runc, Jakub Toczek, and Wojciech Ja´skowski. Vizdoom: A doom-based ai research platform for visual reinforcement learning. In2016 IEEE conference on computational intelligence and games (CIG), pages 1–8. IEEE, 2016
work page 2016
-
[9]
Adam: A Method for Stochastic Optimization
Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization.arXiv preprint arXiv:1412.6980, 2014
work page internal anchor Pith review Pith/arXiv arXiv 2014
-
[10]
Pierre-Yves Oudeyer and Frederic Kaplan. What is intrinsic motivation? a typology of computational approaches.Frontiers in neurorobotics, 1:108, 2007
work page 2007
-
[11]
Curiosity-driven exploration by self-supervised prediction
Deepak Pathak, Pulkit Agrawal, Alexei A Efros, and Trevor Darrell. Curiosity-driven exploration by self-supervised prediction. InInternational conference on machine learning, pages 2778–
-
[12]
Self-supervised exploration via disagree- ment
Deepak Pathak, Dhiraj Gandhi, and Abhinav Gupta. Self-supervised exploration via disagree- ment. InInternational conference on machine learning, pages 5062–5071. PMLR, 2019
work page 2019
-
[13]
A possibility for implementing curiosity and boredom in model-building neural controllers
Jürgen Schmidhuber. A possibility for implementing curiosity and boredom in model-building neural controllers. InProc. of the international conference on simulation of adaptive behavior: From animals to animats, pages 222–227, 1991
work page 1991
-
[14]
Curious model-building control systems
Jürgen Schmidhuber. Curious model-building control systems. InProc. international joint conference on neural networks, pages 1458–1463, 1991
work page 1991
-
[15]
Formal theory of creativity, fun, and intrinsic motivation (1990–2010)
Jürgen Schmidhuber. Formal theory of creativity, fun, and intrinsic motivation (1990–2010). IEEE transactions on autonomous mental development, 2(3):230–247, 2010
work page 1990
-
[16]
Alexander L Strehl and Michael L Littman. An analysis of model-based interval estimation for markov decision processes.Journal of Computer and System Sciences, 74(8):1309–1331, 2008
work page 2008
-
[17]
Richard S Sutton. Integrated architectures for learning, planning, and reacting based on approximating dynamic programming. InMachine learning proceedings 1990, pages 216–224. Elsevier, 1990. 10 A Learnable vs Unlearnable Decomposition Let D+ t and D− t denote the subsets of learnable and highly noisy state transitions in the transi- tion history Dt up to...
work page 1990
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.