pith. sign in

arxiv: 2604.18701 · v2 · submitted 2026-04-20 · 💻 cs.LG · cs.AI· stat.ML

Curiosity-Critic: Cumulative Prediction Error Improvement as a Tractable Intrinsic Reward for World Model Training

Pith reviewed 2026-05-10 04:37 UTC · model grok-4.3

classification 💻 cs.LG cs.AIstat.ML
keywords intrinsic rewardcuriosityworld modelsprediction errorexplorationreinforcement learningepistemic uncertaintyaleatoric uncertainty
0
0 comments X

The pith

Curiosity-Critic uses cumulative prediction error improvement as an intrinsic reward for world model training, estimated via a co-trained critic.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes Curiosity-Critic to create better intrinsic rewards for training world models by measuring improvement in the total prediction error across all visited transitions rather than local errors alone. It derives a practical per-step reward as the gap between the immediate prediction error and the long-run error baseline for that transition, with the baseline learned online by a critic network. This directs the agent toward transitions that are still reducible while de-emphasizing irreducible stochasticity. A reader would care if this leads to quicker and more accurate learning of environment dynamics without needing prior knowledge of the environment's noise level.

Core claim

Curiosity-Critic grounds its intrinsic reward in the improvement of the cumulative prediction error objective across visited transitions. This admits a tractable per-step surrogate given by the difference between the current prediction error and the asymptotic error baseline of the current state transition. The baseline is estimated online by a learned critic co-trained with the world model that regresses a single scalar and converges early, allowing the reward to favor learnable transitions and approach the baseline for stochastic ones, thereby separating epistemic from aleatoric prediction error without an oracle noise floor.

What carries the argument

The per-step surrogate of cumulative error improvement, computed as current prediction error minus the asymptotic baseline estimated by a co-trained critic network.

If this is right

  • Prior local prediction-error curiosity methods appear as special cases under particular choices of the error baseline approximation.
  • On stochastic grid worlds, Curiosity-Critic achieves faster training speed and higher final world model accuracy than prediction-error, visitation-count, and Random Network Distillation approaches.
  • Exploration is redirected to learnable transitions as the reward collapses to the baseline for stochastic transitions.
  • The critic provides a reliable estimate of the error floor without requiring oracle knowledge.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • This approach could be tested in more complex environments like continuous control tasks to see if the critic still converges reliably.
  • It might connect to uncertainty estimation techniques in Bayesian world models by providing a simple scalar baseline.
  • Future work could explore whether the same critic can be used for other intrinsic motivation signals beyond prediction error.

Load-bearing premise

The learned critic converges well before the world model saturates, providing a reliable online estimate of the asymptotic error baseline without oracle knowledge of the noise floor.

What would settle it

A direct counterexample would be if the stochastic grid world experiments reveal that Curiosity-Critic trains no faster and achieves no higher accuracy than standard prediction-error curiosity, or if the critic fails to provide a stable baseline early in training.

Figures

Figures reproduced from arXiv: 2604.18701 by Haicheng Wang, Vin Bhaskara.

Figure 1
Figure 1. Figure 1: The Curiosity-Critic architecture. Solid black arrows: forward-pass flow. Dashed gray arrows: backward-pass training signals. Steps: (1) world model computes error e(st, at | θt); (2) world model updates to θt+1 on st+1; (3) critic regresses onto the post-update error e(st, at |θt+1); (4) critic outputs baseline ϕt+1(st, at); (5) reward rt = e(st, at |θt) − ϕt+1(st, at); (6) the curiosity agent updates πt … view at source ↗
Figure 2
Figure 2. Figure 2: Mean L2 prediction error on deterministic cells versus environment steps, averaged over five [PITH_FULL_IMAGE:figures/full_fig_p008_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Fraction of environment steps spent in the deterministic region (columns 0–14) over training, [PITH_FULL_IMAGE:figures/full_fig_p009_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Mean neural critic estimate over all deterministic cells (left) and all stochastic cells (right) [PITH_FULL_IMAGE:figures/full_fig_p009_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Snapshot of agent trajectories at environment step 30,000. Each panel shows one method; [PITH_FULL_IMAGE:figures/full_fig_p016_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Seed-averaged visitation heatmaps at the end of training (final 5,000 of 35,000 steps), [PITH_FULL_IMAGE:figures/full_fig_p017_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Seed-averaged visitation heatmaps at three training windows: Early (steps 0–5k), Mid [PITH_FULL_IMAGE:figures/full_fig_p018_7.png] view at source ↗
read the original abstract

Local prediction-error-based curiosity rewards focus on the current transition without considering the world model's cumulative prediction error across all visited transitions. We introduce Curiosity-Critic, which grounds its intrinsic reward in the improvement of this cumulative objective, and show that it admits a tractable per-step surrogate: the difference between the current prediction error and the asymptotic error baseline of the current state transition. We estimate this error baseline online with a learned critic co-trained alongside the world model; regressing a single scalar, the critic converges well before the world model saturates, redirecting exploration toward learnable transitions without oracle knowledge of the noise floor. The reward is higher for learnable transitions and collapses toward the error baseline for stochastic ones, effectively separating epistemic (reducible) from aleatoric (irreducible) prediction error online. Prior prediction-error curiosity formulations, from Schmidhuber (1991) to learned-feature-space variants, emerge as special cases corresponding to specific approximations of this error baseline. Experiments on a stochastic grid world show that Curiosity-Critic outperforms prediction-error, visitation-count, and Random Network Distillation methods in training speed and final world model accuracy.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper introduces Curiosity-Critic, an intrinsic reward for world model training grounded in improvement of cumulative prediction error across visited transitions. It derives a tractable per-step surrogate as the difference between current prediction error and an asymptotic error baseline estimated online by a co-trained critic. The method is claimed to separate epistemic from aleatoric error, with prior prediction-error formulations emerging as special cases of baseline approximations. Experiments on a stochastic grid world report faster training and higher final accuracy than prediction-error, visitation-count, and Random Network Distillation baselines.

Significance. If the central assumption holds, the approach supplies a principled online mechanism for directing exploration toward reducible uncertainty in stochastic settings, which could enhance sample efficiency for model-based RL without requiring oracle noise-floor knowledge. The clean derivation of the surrogate and its unification of prior methods constitute a conceptual contribution.

major comments (3)
  1. [Abstract] Abstract: the assertion that 'the critic converges well before the world model saturates' is load-bearing for the claimed separation of epistemic from aleatoric error, yet the experimental section supplies no supporting evidence such as side-by-side learning curves of critic loss versus world-model prediction error.
  2. [Experiments] Experiments: the reported outperformance on the stochastic grid world lacks error bars, number of random seeds, ablation studies on critic training schedule or learning rate, and any statistical significance tests, leaving the superiority claim difficult to evaluate.
  3. [Method] Method (surrogate derivation): because the critic is co-trained on the identical data stream as the world model, the paper must demonstrate that the reward signal does not collapse toward zero when the two improve in lockstep; the current text provides only the assertion of faster critic convergence without analysis or safeguards.
minor comments (2)
  1. [Abstract] The abstract and experimental description omit concrete environment parameters (grid size, transition stochasticity level, episode length) needed to reproduce the grid-world results.
  2. Consider adding a dedicated figure or table that plots critic convergence trajectory against world-model error reduction to directly address the temporal-separation assumption.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the detailed and constructive report. We address each major comment below, indicating where we agree that revisions are warranted and outlining the specific changes we will make to strengthen the manuscript.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the assertion that 'the critic converges well before the world model saturates' is load-bearing for the claimed separation of epistemic from aleatoric error, yet the experimental section supplies no supporting evidence such as side-by-side learning curves of critic loss versus world-model prediction error.

    Authors: We agree that empirical support for the critic's faster convergence is necessary to substantiate the epistemic/aleatoric separation. In the revised manuscript we will add side-by-side learning curves of critic loss and world-model prediction error (generated from the same experimental runs already performed) to the experiments section and reference them from the abstract and method discussion. revision: yes

  2. Referee: [Experiments] Experiments: the reported outperformance on the stochastic grid world lacks error bars, number of random seeds, ablation studies on critic training schedule or learning rate, and any statistical significance tests, leaving the superiority claim difficult to evaluate.

    Authors: We accept that the current experimental reporting is insufficient for rigorous evaluation. The revised version will report all results with error bars over 5 independent random seeds, include ablations on critic learning rate and training frequency, and add statistical significance tests (paired t-tests with p-values) comparing Curiosity-Critic against the baselines. revision: yes

  3. Referee: [Method] Method (surrogate derivation): because the critic is co-trained on the identical data stream as the world model, the paper must demonstrate that the reward signal does not collapse toward zero when the two improve in lockstep; the current text provides only the assertion of faster critic convergence without analysis or safeguards.

    Authors: This concern is well-founded. While the derivation shows that the critic solves a simpler scalar regression task, the manuscript currently offers only the convergence claim. In revision we will expand the method section with (i) a short theoretical argument bounding the rate difference and (ii) empirical plots of the intrinsic reward magnitude over training, confirming it remains informative rather than collapsing. We will also note a simple safeguard (periodic low-rate critic warm-up) that can be activated if needed. revision: partial

Circularity Check

0 steps flagged

No significant circularity in the derivation chain

full rationale

The paper defines its intrinsic reward explicitly as the per-step surrogate (current prediction error minus critic-estimated asymptotic baseline) and asserts that the co-trained critic converges faster as an empirical property of regressing a single scalar. This is a modeling choice and assumption rather than a mathematical reduction in which the reward or separation of epistemic/aleatoric error is forced to equal its inputs by construction. No equations are provided that equate the claimed cumulative improvement to the fitted critic output tautologically, and the unification with prior methods is framed as special cases of baseline approximation rather than self-referential. The derivation therefore retains independent content outside any fitted quantities.

Axiom & Free-Parameter Ledger

1 free parameters · 0 axioms · 1 invented entities

Abstract provides no explicit list of axioms or free parameters; the critic is a learned component whose training details are unspecified. The asymptotic baseline functions as a fitted quantity whose independent evidence is not supplied.

free parameters (1)
  • critic convergence rate
    The claim that the critic converges before the world model requires an implicit rate or schedule that is chosen or fitted during training.
invented entities (1)
  • asymptotic error baseline no independent evidence
    purpose: Provides the tractable per-step surrogate for cumulative prediction-error improvement
    Introduced as a learned scalar without external validation or falsifiable prediction outside the training loop.

pith-pipeline@v0.9.0 · 5505 in / 1371 out tokens · 43180 ms · 2026-05-10T04:37:24.959327+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

17 extracted references · 17 canonical work pages · 2 internal anchors

  1. [1]

    Unifying count-based exploration and intrinsic motivation.Advances in neural information processing systems, 29, 2016

    Marc Bellemare, Sriram Srinivasan, Georg Ostrovski, Tom Schaul, David Saxton, and Remi Munos. Unifying count-based exploration and intrinsic motivation.Advances in neural information processing systems, 29, 2016

  2. [2]

    The arcade learning environment: An evaluation platform for general agents.Journal of artificial intelligence research, 47:253–279, 2013

    Marc G Bellemare, Yavar Naddaf, Joel Veness, and Michael Bowling. The arcade learning environment: An evaluation platform for general agents.Journal of artificial intelligence research, 47:253–279, 2013

  3. [3]

    Large-Scale Study of Curiosity-Driven Learning

    Yuri Burda, Harri Edwards, Deepak Pathak, Amos Storkey, Trevor Darrell, and Alexei A Efros. Large-scale study of curiosity-driven learning.arXiv preprint arXiv:1808.04355, 2018

  4. [4]

    Exploration by random network distillation.International conference on learning representations, 2019

    Yuri Burda, Harrison Edwards, Amos Storkey, and Oleg Klimov. Exploration by random network distillation.International conference on learning representations, 2019

  5. [5]

    World Models

    David Ha and Jürgen Schmidhuber. World models.arXiv preprint arXiv:1803.10122, 2(3):440, 2018

  6. [6]

    Vime: Variational information maximizing exploration.Advances in neural information processing systems, 29, 2016

    Rein Houthooft, Xi Chen, Yan Duan, John Schulman, Filip De Turck, and Pieter Abbeel. Vime: Variational information maximizing exploration.Advances in neural information processing systems, 29, 2016

  7. [7]

    Aleatoric and epistemic uncertainty in machine learning: an introduction to concepts and methods.Machine Learning, 110(3):457–506, 2021

    Eyke Hüllermeier and Willem Waegeman. Aleatoric and epistemic uncertainty in machine learning: an introduction to concepts and methods.Machine Learning, 110(3):457–506, 2021

  8. [8]

    Vizdoom: A doom-based ai research platform for visual reinforcement learning

    Michał Kempka, Marek Wydmuch, Grzegorz Runc, Jakub Toczek, and Wojciech Ja´skowski. Vizdoom: A doom-based ai research platform for visual reinforcement learning. In2016 IEEE conference on computational intelligence and games (CIG), pages 1–8. IEEE, 2016

  9. [9]

    Adam: A Method for Stochastic Optimization

    Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization.arXiv preprint arXiv:1412.6980, 2014

  10. [10]

    What is intrinsic motivation? a typology of computational approaches.Frontiers in neurorobotics, 1:108, 2007

    Pierre-Yves Oudeyer and Frederic Kaplan. What is intrinsic motivation? a typology of computational approaches.Frontiers in neurorobotics, 1:108, 2007

  11. [11]

    Curiosity-driven exploration by self-supervised prediction

    Deepak Pathak, Pulkit Agrawal, Alexei A Efros, and Trevor Darrell. Curiosity-driven exploration by self-supervised prediction. InInternational conference on machine learning, pages 2778–

  12. [12]

    Self-supervised exploration via disagree- ment

    Deepak Pathak, Dhiraj Gandhi, and Abhinav Gupta. Self-supervised exploration via disagree- ment. InInternational conference on machine learning, pages 5062–5071. PMLR, 2019

  13. [13]

    A possibility for implementing curiosity and boredom in model-building neural controllers

    Jürgen Schmidhuber. A possibility for implementing curiosity and boredom in model-building neural controllers. InProc. of the international conference on simulation of adaptive behavior: From animals to animats, pages 222–227, 1991

  14. [14]

    Curious model-building control systems

    Jürgen Schmidhuber. Curious model-building control systems. InProc. international joint conference on neural networks, pages 1458–1463, 1991

  15. [15]

    Formal theory of creativity, fun, and intrinsic motivation (1990–2010)

    Jürgen Schmidhuber. Formal theory of creativity, fun, and intrinsic motivation (1990–2010). IEEE transactions on autonomous mental development, 2(3):230–247, 2010

  16. [16]

    An analysis of model-based interval estimation for markov decision processes.Journal of Computer and System Sciences, 74(8):1309–1331, 2008

    Alexander L Strehl and Michael L Littman. An analysis of model-based interval estimation for markov decision processes.Journal of Computer and System Sciences, 74(8):1309–1331, 2008

  17. [17]

    TV pixels

    Richard S Sutton. Integrated architectures for learning, planning, and reacting based on approximating dynamic programming. InMachine learning proceedings 1990, pages 216–224. Elsevier, 1990. 10 A Learnable vs Unlearnable Decomposition Let D+ t and D− t denote the subsets of learnable and highly noisy state transitions in the transi- tion history Dt up to...