OHIRL separates next-packet prediction, residual dynamics, a fixed recovery-positive evaluator, and policy learning to achieve high sign and action accuracy in reward-free perceptual tasks where standard reward proxies fail.
Curiosity-Critic: Cumulative Prediction Error Improvement as a Tractable Intrinsic Reward for World Model Training
1 Pith paper cite this work. Polarity classification is still indexing.
abstract
Local prediction-error-based curiosity rewards focus on the current transition without considering the world model's cumulative prediction error across all visited transitions. We introduce Curiosity-Critic, which grounds its intrinsic reward in the improvement of this cumulative objective, and show that it admits a tractable per-step surrogate: the difference between the current prediction error and the asymptotic error baseline of the current state transition. We estimate this error baseline online with a learned critic co-trained alongside the world model; since the critic only has to learn how hard a transition is to predict, its estimate of the irreducible noise floor converges well before the world model saturates, redirecting exploration toward learnable transitions. The reward is higher for learnable transitions and collapses toward zero for stochastic ones, thereby separating epistemic (reducible) from aleatoric (irreducible) prediction error online. Prior prediction-error curiosity formulations, from Schmidhuber (1991) to learned-feature-space variants, emerge as special cases corresponding to specific approximations of this error baseline. Experiments on a stochastic grid world show that Curiosity-Critic outperforms prediction-error, visitation-count, and Random Network Distillation methods in training speed and final world model accuracy.
fields
cs.LG 1years
2026 1verdicts
UNVERDICTED 1representative citing papers
citing papers explorer
-
Online Reward-Punishment Learning from Fixed-Channel Perceptual Event Streams without Environment Rewards
OHIRL separates next-packet prediction, residual dynamics, a fixed recovery-positive evaluator, and policy learning to achieve high sign and action accuracy in reward-free perceptual tasks where standard reward proxies fail.