Curiosity-Critic: Cumulative Prediction Error Improvement as a Tractable Intrinsic Reward for World Model Training
Pith reviewed 2026-05-10 04:37 UTC · model grok-4.3
The pith
Curiosity-Critic uses cumulative prediction error improvement as an intrinsic reward for world model training, estimated via a co-trained critic.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Curiosity-Critic grounds its intrinsic reward in the improvement of the cumulative prediction error objective across visited transitions. This admits a tractable per-step surrogate given by the difference between the current prediction error and the asymptotic error baseline of the current state transition. The baseline is estimated online by a learned critic co-trained with the world model that regresses a single scalar and converges early, allowing the reward to favor learnable transitions and approach the baseline for stochastic ones, thereby separating epistemic from aleatoric prediction error without an oracle noise floor.
What carries the argument
The per-step surrogate of cumulative error improvement, computed as current prediction error minus the asymptotic baseline estimated by a co-trained critic network.
If this is right
- Prior local prediction-error curiosity methods appear as special cases under particular choices of the error baseline approximation.
- On stochastic grid worlds, Curiosity-Critic achieves faster training speed and higher final world model accuracy than prediction-error, visitation-count, and Random Network Distillation approaches.
- Exploration is redirected to learnable transitions as the reward collapses to the baseline for stochastic transitions.
- The critic provides a reliable estimate of the error floor without requiring oracle knowledge.
Where Pith is reading between the lines
- This approach could be tested in more complex environments like continuous control tasks to see if the critic still converges reliably.
- It might connect to uncertainty estimation techniques in Bayesian world models by providing a simple scalar baseline.
- Future work could explore whether the same critic can be used for other intrinsic motivation signals beyond prediction error.
Load-bearing premise
The learned critic converges well before the world model saturates, providing a reliable online estimate of the asymptotic error baseline without oracle knowledge of the noise floor.
What would settle it
A direct counterexample would be if the stochastic grid world experiments reveal that Curiosity-Critic trains no faster and achieves no higher accuracy than standard prediction-error curiosity, or if the critic fails to provide a stable baseline early in training.
Figures
read the original abstract
Local prediction-error-based curiosity rewards focus on the current transition without considering the world model's cumulative prediction error across all visited transitions. We introduce Curiosity-Critic, which grounds its intrinsic reward in the improvement of this cumulative objective, and show that it admits a tractable per-step surrogate: the difference between the current prediction error and the asymptotic error baseline of the current state transition. We estimate this error baseline online with a learned critic co-trained alongside the world model; since the critic only has to learn how hard a transition is to predict, its estimate of the irreducible noise floor converges well before the world model saturates, redirecting exploration toward learnable transitions. The reward is higher for learnable transitions and collapses toward zero for stochastic ones, thereby separating epistemic (reducible) from aleatoric (irreducible) prediction error online. Prior prediction-error curiosity formulations, from Schmidhuber (1991) to learned-feature-space variants, emerge as special cases corresponding to specific approximations of this error baseline. Experiments on a stochastic grid world show that Curiosity-Critic outperforms prediction-error, visitation-count, and Random Network Distillation methods in training speed and final world model accuracy.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces Curiosity-Critic, an intrinsic reward for world model training grounded in improvement of cumulative prediction error across visited transitions. It derives a tractable per-step surrogate as the difference between current prediction error and an asymptotic error baseline estimated online by a co-trained critic. The method is claimed to separate epistemic from aleatoric error, with prior prediction-error formulations emerging as special cases of baseline approximations. Experiments on a stochastic grid world report faster training and higher final accuracy than prediction-error, visitation-count, and Random Network Distillation baselines.
Significance. If the central assumption holds, the approach supplies a principled online mechanism for directing exploration toward reducible uncertainty in stochastic settings, which could enhance sample efficiency for model-based RL without requiring oracle noise-floor knowledge. The clean derivation of the surrogate and its unification of prior methods constitute a conceptual contribution.
major comments (3)
- [Abstract] Abstract: the assertion that 'the critic converges well before the world model saturates' is load-bearing for the claimed separation of epistemic from aleatoric error, yet the experimental section supplies no supporting evidence such as side-by-side learning curves of critic loss versus world-model prediction error.
- [Experiments] Experiments: the reported outperformance on the stochastic grid world lacks error bars, number of random seeds, ablation studies on critic training schedule or learning rate, and any statistical significance tests, leaving the superiority claim difficult to evaluate.
- [Method] Method (surrogate derivation): because the critic is co-trained on the identical data stream as the world model, the paper must demonstrate that the reward signal does not collapse toward zero when the two improve in lockstep; the current text provides only the assertion of faster critic convergence without analysis or safeguards.
minor comments (2)
- [Abstract] The abstract and experimental description omit concrete environment parameters (grid size, transition stochasticity level, episode length) needed to reproduce the grid-world results.
- Consider adding a dedicated figure or table that plots critic convergence trajectory against world-model error reduction to directly address the temporal-separation assumption.
Simulated Author's Rebuttal
We thank the referee for the detailed and constructive report. We address each major comment below, indicating where we agree that revisions are warranted and outlining the specific changes we will make to strengthen the manuscript.
read point-by-point responses
-
Referee: [Abstract] Abstract: the assertion that 'the critic converges well before the world model saturates' is load-bearing for the claimed separation of epistemic from aleatoric error, yet the experimental section supplies no supporting evidence such as side-by-side learning curves of critic loss versus world-model prediction error.
Authors: We agree that empirical support for the critic's faster convergence is necessary to substantiate the epistemic/aleatoric separation. In the revised manuscript we will add side-by-side learning curves of critic loss and world-model prediction error (generated from the same experimental runs already performed) to the experiments section and reference them from the abstract and method discussion. revision: yes
-
Referee: [Experiments] Experiments: the reported outperformance on the stochastic grid world lacks error bars, number of random seeds, ablation studies on critic training schedule or learning rate, and any statistical significance tests, leaving the superiority claim difficult to evaluate.
Authors: We accept that the current experimental reporting is insufficient for rigorous evaluation. The revised version will report all results with error bars over 5 independent random seeds, include ablations on critic learning rate and training frequency, and add statistical significance tests (paired t-tests with p-values) comparing Curiosity-Critic against the baselines. revision: yes
-
Referee: [Method] Method (surrogate derivation): because the critic is co-trained on the identical data stream as the world model, the paper must demonstrate that the reward signal does not collapse toward zero when the two improve in lockstep; the current text provides only the assertion of faster critic convergence without analysis or safeguards.
Authors: This concern is well-founded. While the derivation shows that the critic solves a simpler scalar regression task, the manuscript currently offers only the convergence claim. In revision we will expand the method section with (i) a short theoretical argument bounding the rate difference and (ii) empirical plots of the intrinsic reward magnitude over training, confirming it remains informative rather than collapsing. We will also note a simple safeguard (periodic low-rate critic warm-up) that can be activated if needed. revision: partial
Circularity Check
No significant circularity in the derivation chain
full rationale
The paper defines its intrinsic reward explicitly as the per-step surrogate (current prediction error minus critic-estimated asymptotic baseline) and asserts that the co-trained critic converges faster as an empirical property of regressing a single scalar. This is a modeling choice and assumption rather than a mathematical reduction in which the reward or separation of epistemic/aleatoric error is forced to equal its inputs by construction. No equations are provided that equate the claimed cumulative improvement to the fitted critic output tautologically, and the unification with prior methods is framed as special cases of baseline approximation rather than self-referential. The derivation therefore retains independent content outside any fitted quantities.
Axiom & Free-Parameter Ledger
free parameters (1)
- critic convergence rate
invented entities (1)
-
asymptotic error baseline
no independent evidence
Forward citations
Cited by 1 Pith paper
-
Online Reward-Punishment Learning from Fixed-Channel Perceptual Event Streams without Environment Rewards
OHIRL separates next-packet prediction, residual dynamics, a fixed recovery-positive evaluator, and policy learning to achieve high sign and action accuracy in reward-free perceptual tasks where standard reward proxies fail.
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.