Curiosity-Critic: Cumulative Prediction Error Improvement as a Tractable Intrinsic Reward for World Model Training

· 2026 · cs.LG · arXiv 2604.18701

1 Pith paper cite this work. Polarity classification is still indexing.

1 Pith paper citing it

open full Pith review browse 1 citing papers arXiv PDF

abstract

Local prediction-error-based curiosity rewards focus on the current transition without considering the world model's cumulative prediction error across all visited transitions. We introduce Curiosity-Critic, which grounds its intrinsic reward in the improvement of this cumulative objective, and show that it admits a tractable per-step surrogate: the difference between the current prediction error and the asymptotic error baseline of the current state transition. We estimate this error baseline online with a learned critic co-trained alongside the world model; since the critic only has to learn how hard a transition is to predict, its estimate of the irreducible noise floor converges well before the world model saturates, redirecting exploration toward learnable transitions. The reward is higher for learnable transitions and collapses toward zero for stochastic ones, thereby separating epistemic (reducible) from aleatoric (irreducible) prediction error online. Prior prediction-error curiosity formulations, from Schmidhuber (1991) to learned-feature-space variants, emerge as special cases corresponding to specific approximations of this error baseline. Experiments on a stochastic grid world show that Curiosity-Critic outperforms prediction-error, visitation-count, and Random Network Distillation methods in training speed and final world model accuracy.

representative citing papers

Online Reward-Punishment Learning from Fixed-Channel Perceptual Event Streams without Environment Rewards

cs.LG · 2026-06-17 · unverdicted · novelty 5.0

OHIRL separates next-packet prediction, residual dynamics, a fixed recovery-positive evaluator, and policy learning to achieve high sign and action accuracy in reward-free perceptual tasks where standard reward proxies fail.

citing papers explorer

Showing 1 of 1 citing paper.

Online Reward-Punishment Learning from Fixed-Channel Perceptual Event Streams without Environment Rewards cs.LG · 2026-06-17 · unverdicted · none · ref 8 · internal anchor
OHIRL separates next-packet prediction, residual dynamics, a fixed recovery-positive evaluator, and policy learning to achieve high sign and action accuracy in reward-free perceptual tasks where standard reward proxies fail.

Curiosity-Critic: Cumulative Prediction Error Improvement as a Tractable Intrinsic Reward for World Model Training

fields

years

verdicts

representative citing papers

citing papers explorer