pith. sign in

arxiv: 2605.20878 · v1 · pith:KL52XIKDnew · submitted 2026-05-20 · 💻 cs.LG

CIG: Exploration via Conditional Information Gain

Pith reviewed 2026-05-21 06:43 UTC · model grok-4.3

classification 💻 cs.LG
keywords intrinsic rewardsexplorationreinforcement learningconditional information gainensemble disagreementmodel-based RLinformation gain
0
0 comments X

The pith

Conditional Information Gain gives a scalable intrinsic reward for exploration that conditions on both lifetime experience and the current rollout.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper seeks to build an intrinsic reward in reinforcement learning that accounts for both long-term progress against all past data and short-term redundancy inside one ongoing trajectory. Existing rewards tend to capture only one of those signals or rely on methods that break down in high-dimensional spaces. The authors derive Conditional Information Gain as a practical surrogate that uses a log-determinant objective over an ensemble disagreement kernel. Cholesky factorization then splits this into causal per-step rewards that keep both conditioning sets intact. If the surrogate works, agents can explore more efficiently without ignoring either lifetime learning or intra-episode repetition.

Core claim

Trajectory-level information gain decomposes into per-step terms that condition simultaneously on the replay buffer and the rollout prefix, yet this remains intractable for deep models. The Conditional Information Gain (CIG) reward serves as a tractable surrogate through a log-determinant objective over an ensemble disagreement kernel. Its Cholesky factorization then produces causal per-step rewards that preserve both conditioning sets and scale to high-dimensional state spaces. The method is instantiated in a model-based setting and tested across twelve tasks in discrete and continuous control, including stochastic-distractor variants.

What carries the argument

The Conditional Information Gain (CIG) reward, defined as a log-determinant objective over an ensemble disagreement kernel whose Cholesky factorization yields per-step rewards that retain joint conditioning on replay buffer and rollout prefix.

If this is right

  • CIG combines lifelong and episodic signals without needing heuristic weights or low-dimensional assumptions.
  • The reward scales to high-dimensional state spaces where Gaussian-process methods fail.
  • It remains robust when stochastic distractors appear in the environment.
  • Performance holds across both discrete grid tasks and continuous control benchmarks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same decomposition technique could be applied to other information measures in sequential decision problems.
  • Testing CIG outside the model-based setting with short rollouts would reveal how far the per-step approximation generalizes.
  • Varying the ensemble size could serve as a direct way to study the accuracy of the disagreement kernel approximation.

Load-bearing premise

The ensemble disagreement kernel approximates the true trajectory-level conditional information gain closely enough that the Cholesky decomposition preserves the joint conditioning without material loss.

What would settle it

Running the method on high-dimensional tasks and finding that CIG yields no exploration gain over prior methods, or that the resulting per-step rewards show low correlation with actual trajectory information gain, would falsify the claim.

Figures

Figures reproduced from arXiv: 2605.20878 by J. Marius Z\"ollner, Karam Daaboul, Marcus Fechner, Philipp Stegmaier, Tim Joseph.

Figure 1
Figure 1. Figure 1: Conditioning contexts of intrinsic rewards in model-based RL. The policy plans over short imagined rollouts. Neural network weights w, trained on the replay buffer D, summarize lifetime uncertainty. Solid states and arrows mark the contexts each reward class conditions on; dashed/faded elements are not used as context. In the model-based setting, the rollout prefix s<t is too short to supply a rich episodi… view at source ↗
Figure 2
Figure 2. Figure 2: Mechanism of the CIG reward on a four-step imagined rollout (T=4, M=3). (a) Dis￾agreement vectors δ (t) k at each step. (b) The kernel Kjt (Eq. 6) built from their inner products; the column k<4 links step 4 to its prefix. (c) Eq. 8 subtracts the prefix-explained portion from the lifelong term: step 3, nearly orthogonal to the prefix, retains most of its bonus; step 4, largely spanned by the prefix s<4, is… view at source ↗
Figure 3
Figure 3. Figure 3: Cumulative successes (total completed episodes) on three MiniGrid tasks (columns) in clean [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Coverage curves on six continuous-control tasks: four clean (top row and bottom right) [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Aggregate normalised exploration scores across all twelve tasks (six MiniGrid, six contin [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Ablation on AntMaze. Each vari￾ant removes one component of the CIG reward (Eq. (8)). Lines show the IQM; shaded regions are 95 % stratified boot￾strap confidence intervals. We isolate the three components of Eq. (8) on AntMaze in [PITH_FULL_IMAGE:figures/full_fig_p009_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: RGB observations from the evaluation environments (the agent sees 64 [PITH_FULL_IMAGE:figures/full_fig_p017_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Prefix-redundancy activation across training (AntMaze-medium). Each panel shows the column-normalised conditional density P(y | xbin) on a 10 × 10 grid. Top row: prefix-explained fraction k ⊤ <tK˜ −1 <t k<t/Ktt (the share of the lifelong disagreement at step t accounted for by the prefix) vs. lifelong disagreement log Ktt, the diagonal term of Eq. (8). Bottom row: Spearman rank correlation ρ(CIG, log Ktt) … view at source ↗
Figure 9
Figure 9. Figure 9: Per-episode visit-distribution entropy (x-axis) versus final cumulative successes (y-axis, log [PITH_FULL_IMAGE:figures/full_fig_p026_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Total button flips over training on Puzzle [PITH_FULL_IMAGE:figures/full_fig_p027_10.png] view at source ↗
read the original abstract

Intrinsic rewards for exploration in reinforcement learning condition on different contexts: lifelong rewards score each transition against accumulated experience but ignore within-rollout redundancy; episodic rewards penalize intra-trajectory repetition but discard lifetime progress. Hybrid methods combine both signals through heuristic weights or require Gaussian-process dynamics that do not scale beyond low-dimensional state spaces. Trajectory-level information gain decomposes into per-step terms that condition on the replay buffer and rollout prefix simultaneously, but remains intractable for deep models. We derive the Conditional Information Gain (CIG) reward as a tractable surrogate: a log-determinant objective over an ensemble disagreement kernel whose Cholesky factorization yields causal per-step rewards that retain both conditioning sets while scaling to high-dimensional state spaces. We instantiate CIG in a model-based setting, where rollouts are short and within-rollout corrections remain largely unexplored. Across twelve tasks spanning discrete (MiniGrid) and continuous control (OGBench), in both clean and stochastic-distractor settings, CIG outperforms or matches prior exploration methods while remaining robust to stochastic distractors.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper proposes Conditional Information Gain (CIG) as an intrinsic reward for exploration in RL. It derives a tractable surrogate for trajectory-level conditional information gain via a log-determinant objective over an ensemble disagreement kernel; Cholesky factorization then produces causal per-step rewards that simultaneously condition on the replay buffer and rollout prefix. The approach is instantiated in a model-based setting with short rollouts and evaluated on twelve tasks spanning MiniGrid (discrete) and OGBench (continuous control) in both clean and stochastic-distractor environments, where it outperforms or matches prior methods while remaining robust to distractors.

Significance. If the derivation holds and the Cholesky decomposition preserves joint conditioning with bounded error, the result would offer a scalable, principled hybrid of lifelong and episodic exploration signals that avoids heuristic weighting and extends beyond low-dimensional GP methods. The empirical scope across twelve tasks in discrete/continuous and clean/noisy settings is a concrete strength that supports practical utility.

major comments (2)
  1. [§3 (CIG Derivation)] §3 (CIG Derivation): The central claim that Cholesky factorization of the ensemble disagreement kernel yields per-step rewards retaining simultaneous conditioning on both the replay buffer and rollout prefix requires an explicit demonstration or error bound. The current presentation leaves open whether the kernel construction introduces low-rank or independence assumptions that produce material information loss, which is load-bearing for the tractability and faithfulness assertions.
  2. [§3.2 (Ensemble Disagreement Kernel)] §3.2 (Ensemble Disagreement Kernel): The surrogate is defined via an ensemble disagreement kernel whose parameters are learned from data. The manuscript must show that the final reward expression remains independent of these fitted quantities; absent such verification, the information-gain interpretation risks circularity, as the exploration signal could reduce to quantities defined by the fitted model itself.
minor comments (2)
  1. [Abstract and §4 (Experiments)] Abstract and §4 (Experiments): The claim of evaluation 'across twelve tasks' would benefit from an explicit list of environments, number of random seeds, and statistical significance tests to allow direct replication and assessment of robustness.
  2. [Notation] Notation: The distinction between lifetime (replay buffer) and episodic (rollout prefix) conditioning sets should be denoted consistently with subscripts or superscripts throughout the equations to improve readability.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments on our manuscript. We address each major comment below, indicating the revisions we will make to strengthen the presentation of the CIG derivation and the ensemble kernel.

read point-by-point responses
  1. Referee: [§3 (CIG Derivation)] The central claim that Cholesky factorization of the ensemble disagreement kernel yields per-step rewards retaining simultaneous conditioning on both the replay buffer and rollout prefix requires an explicit demonstration or error bound. The current presentation leaves open whether the kernel construction introduces low-rank or independence assumptions that produce material information loss, which is load-bearing for the tractability and faithfulness assertions.

    Authors: We agree that an explicit demonstration would improve clarity. In the revised manuscript we will add a detailed derivation in the appendix showing that the Cholesky factorization of the joint kernel matrix decomposes the log-determinant exactly into a sum of per-step conditional log-determinants. Because the kernel matrix is assembled from all points in the replay buffer together with the rollout prefix, the conditioning on both sets is retained by construction; no additional low-rank or independence assumptions are imposed beyond the positive-definiteness of the ensemble kernel. We will also include a brief error-bound discussion based on the fact that the Cholesky factors yield the exact conditional variances at each step, with any numerical error bounded by standard floating-point analysis. revision: yes

  2. Referee: [§3.2 (Ensemble Disagreement Kernel)] The surrogate is defined via an ensemble disagreement kernel whose parameters are learned from data. The manuscript must show that the final reward expression remains independent of these fitted quantities; absent such verification, the information-gain interpretation risks circularity, as the exploration signal could reduce to quantities defined by the fitted model itself.

    Authors: We will clarify this point in the revision. The reward is the log-determinant of the kernel matrix whose entries are pairwise disagreements between ensemble members evaluated at the relevant state-action pairs. Once the kernel matrix is formed, the reward expression depends only on these matrix entries and not on the internal parameters of the individual ensemble members. In the revised §3.2 we will explicitly rewrite the reward formula in terms of the kernel matrix alone, thereby showing that the information-gain interpretation is preserved as a measure of predictive uncertainty reduction and is not circular with respect to the fitting procedure. revision: yes

Circularity Check

0 steps flagged

No significant circularity in derivation of CIG surrogate

full rationale

The paper claims to derive the Conditional Information Gain reward as a log-determinant objective over an ensemble disagreement kernel, with Cholesky factorization yielding per-step rewards. This is presented as a mathematical construction of a tractable surrogate for the intractable trajectory-level conditional information gain. No equations or steps are shown that reduce the final reward expression to fitted parameters or prior self-citations by construction. The central claim remains a proposed approximation whose validity is evaluated empirically on external benchmarks (MiniGrid, OGBench tasks), making the derivation self-contained rather than tautological. No load-bearing self-citation chains or renamed known results are identified.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

Review performed on abstract alone; full derivation, parameter choices, and supporting lemmas are unavailable, so ledger entries are limited to those explicitly named in the abstract.

axioms (1)
  • domain assumption Trajectory-level information gain decomposes into per-step terms that condition simultaneously on the replay buffer and rollout prefix.
    Directly stated in the abstract as the starting point for the CIG derivation.
invented entities (1)
  • Conditional Information Gain (CIG) reward no independent evidence
    purpose: Tractable surrogate objective that retains joint conditioning while scaling to high-dimensional spaces
    Introduced in the abstract as the central new construct realized via ensemble disagreement kernel and Cholesky factorization.

pith-pipeline@v0.9.0 · 5718 in / 1356 out tokens · 45156 ms · 2026-05-21T06:43:40.080863+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

52 extracted references · 52 canonical work pages

  1. [1]

    Curious model-building control systems

    Jürgen Schmidhuber. Curious model-building control systems. InProceedings of the 1991 IEEE International Joint Conference on Neural Networks (IJCNN ’91), Singapore, volume 2, pages 1458–1463. IEEE, 1991. doi: 10.1109/IJCNN.1991.170605

  2. [2]

    What is intrinsic motivation? A typology of computational approaches.Frontiers in Neurorobotics, V olume 1 - 2007, 2007

    Pierre-Yves Oudeyer and Frederic Kaplan. What is intrinsic motivation? A typology of computational approaches.Frontiers in Neurorobotics, V olume 1 - 2007, 2007. ISSN 1662-

  3. [3]

    doi: 10.3389/neuro.12.006.2007

  4. [4]

    Efros, and Trevor Darrell

    Deepak Pathak, Pulkit Agrawal, Alexei A. Efros, and Trevor Darrell. Curiosity-Driven Explo- ration by Self-Supervised Prediction. InProceedings of the 34th International Conference on Machine Learning - Volume 70, ICML’17, pages 2778–2787, Sydney, NSW, Australia, 2017. JMLR.org

  5. [5]

    Exploration by random network distillation

    Yuri Burda, Harrison Edwards, Amos Storkey, and Oleg Klimov. Exploration by random network distillation. InInternational Conference on Learning Representations, 2019

  6. [6]

    Planning to Explore via Self-Supervised World Models

    Ramanan Sekar, Oleh Rybkin, Kostas Daniilidis, Pieter Abbeel, Danijar Hafner, and Deepak Pathak. Planning to Explore via Self-Supervised World Models. In Hal Daumé III and Aarti Singh, editors,Proceedings of the 37th International Conference on Machine Learning, volume 119 ofProceedings of Machine Learning Research, pages 8583–8592. PMLR, July 2020

  7. [7]

    Nicklas Hansen, Hao Su, and Xiaolong Wang

    Danijar Hafner, Jurgis Pasukonis, Jimmy Ba, and Timothy Lillicrap. Mastering diverse control tasks through world models.Nature, 640(8059):647–653, April 2025. ISSN 1476-4687. doi: 10.1038/s41586-025-08744-2

  8. [8]

    Recurrent World Models Facilitate Policy Evolution

    David Ha and Jürgen Schmidhuber. Recurrent World Models Facilitate Policy Evolution. In S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 31. Curran Associates, Inc., 2018

  9. [9]

    Learning Latent Dynamics for Planning from Pixels

    Danijar Hafner, Timothy Lillicrap, Ian Fischer, Ruben Villegas, David Ha, Honglak Lee, and James Davidson. Learning Latent Dynamics for Planning from Pixels. In Kamalika Chaudhuri and Ruslan Salakhutdinov, editors,Proceedings of the 36th International Conference on Machine Learning, volume 97 ofProceedings of Machine Learning Research, pages 2555–

  10. [10]

    Lillicrap, Mohammad Norouzi, and Jimmy Ba

    Danijar Hafner, Timothy P. Lillicrap, Mohammad Norouzi, and Jimmy Ba. Mastering Atari with Discrete World Models. InInternational Conference on Learning Representations, 2021

  11. [11]

    Self-Supervised Exploration via Dis- agreement

    Deepak Pathak, Dhiraj Gandhi, and Abhinav Gupta. Self-Supervised Exploration via Dis- agreement. In Kamalika Chaudhuri and Ruslan Salakhutdinov, editors,Proceedings of the 36th International Conference on Machine Learning, volume 97 ofProceedings of Machine Learning Research, pages 5062–5071. PMLR, June 2019

  12. [12]

    Exploration via Elliptical Episodic Bonuses

    Mikael Henaff, Roberta Raileanu, Minqi Jiang, and Tim Rocktäschel. Exploration via Elliptical Episodic Bonuses. In Alice H. Oh, Alekh Agarwal, Danielle Belgrave, and Kyunghyun Cho, editors,Advances in Neural Information Processing Systems, 2022

  13. [13]

    A Study of Global and Episodic Bonuses for Exploration in Contextual MDPs

    Mikael Henaff, Minqi Jiang, and Roberta Raileanu. A Study of Global and Episodic Bonuses for Exploration in Contextual MDPs. In Andreas Krause, Emma Brunskill, Kyunghyun Cho, Barbara Engelhardt, Sivan Sabato, and Jonathan Scarlett, editors,Proceedings of the 40th Inter- national Conference on Machine Learning, volume 202 ofProceedings of Machine Learning ...

  14. [14]

    Never Give Up: Learning Directed Exploration Strategies, February 2020

    Adrià Puigdomènech Badia, Pablo Sprechmann, Alex Vitvitskyi, Daniel Guo, Bilal Piot, Steven Kapturowski, Olivier Tieleman, Martín Arjovsky, Alexander Pritzel, Andew Bolt, and Charles Blundell. Never Give Up: Learning Directed Exploration Strategies, February 2020

  15. [15]

    Gonzalez, and Yuandong Tian

    Tianjun Zhang, Huazhe Xu, Xiaolong Wang, Yi Wu, Kurt Keutzer, Joseph E. Gonzalez, and Yuandong Tian. NovelD: A Simple yet Effective Exploration Criterion. In A. Beygelzimer, Y . Dauphin, P. Liang, and J. Wortman Vaughan, editors,Advances in Neural Information Processing Systems, 2021. 10

  16. [16]

    Exploration via Planning for Information about the Optimal Trajectory

    Viraj Mehta, Ian Char, Joseph Abbate, Rory Conlin, Mark Boyer, Stefano Ermon, Jeff Schneider, and Willie Neiswanger. Exploration via Planning for Information about the Optimal Trajectory. In S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, and A. Oh, editors,Advances in Neural Information Processing Systems, volume 35, pages 28761–28775. Curran Ass...

  17. [17]

    D. V . Lindley. On a measure of the information provided by an experiment.The Annals of Mathematical Statistics, 27(4):986–1005, 1956. doi: 10.1214/aoms/1177728069

  18. [18]

    Unifying Count-Based Exploration and Intrinsic Motivation

    Marc Bellemare, Sriram Srinivasan, Georg Ostrovski, Tom Schaul, David Saxton, and Remi Munos. Unifying Count-Based Exploration and Intrinsic Motivation. In D. Lee, M. Sugiyama, U. Luxburg, I. Guyon, and R. Garnett, editors,Advances in Neural Information Processing Systems, volume 29. Curran Associates, Inc., 2016

  19. [19]

    Episodic Curiosity through Reachability

    Nikolay Savinov, Anton Raichuk, Damien Vincent, Raphael Marinier, Marc Pollefeys, Tim- othy Lillicrap, and Sylvain Gelly. Episodic Curiosity through Reachability. InInternational Conference on Learning Representations, 2019

  20. [20]

    LECO: Learnable Episodic Count for Task-Specific Intrinsic Reward

    Daejin Jo, Sungwoong Kim, Daniel Nam, Taehwan Kwon, Seungeun Rho, Jongmin Kim, and Donghoon Lee. LECO: Learnable Episodic Count for Task-Specific Intrinsic Reward. In S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, and A. Oh, editors,Advances in Neural Information Processing Systems, volume 35, pages 30432–30445. Curran Associates, Inc., 2022

  21. [21]

    Agent57: Outperforming the Atari Human Benchmark

    Adrià Puigdomènech Badia, Bilal Piot, Steven Kapturowski, Pablo Sprechmann, Alex Vitvitskyi, Zhaohan Daniel Guo, and Charles Blundell. Agent57: Outperforming the Atari Human Benchmark. In Hal Daumé III and Aarti Singh, editors,Proceedings of the 37th International Conference on Machine Learning, volume 119 ofProceedings of Machine Learning Research, pages...

  22. [22]

    RIDE: Rewarding Impact-Driven Exploration for Procedurally-Generated Environments

    Roberta Raileanu and Tim Rocktäschel. RIDE: Rewarding Impact-Driven Exploration for Procedurally-Generated Environments. InInternational Conference on Learning Representa- tions, 2020

  23. [23]

    Bayesian experimental design: A review.Statistical Science, 10(3):273–304, 1995

    Kathryn Chaloner and Isabella Verdinelli. Bayesian experimental design: A review.Statistical Science, 10(3):273–304, 1995. doi: 10.1214/ss/1177009939

  24. [24]

    Ivanova, and Freddie Bickford Smith

    Tom Rainforth, Adam Foster, Desi R. Ivanova, and Freddie Bickford Smith. Modern bayesian experimental design.Statistical Science, 39(1):100–114, 2024. doi: 10.1214/23-STS915

  25. [25]

    Simple and Scalable Predictive Uncertainty Estimation using Deep Ensembles

    Balaji Lakshminarayanan, Alexander Pritzel, and Charles Blundell. Simple and Scalable Predictive Uncertainty Estimation using Deep Ensembles. In I. Guyon, U. V on Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors,Advances in Neural Information Processing Systems, volume 30. Curran Associates, Inc., 2017

  26. [26]

    Reprint of the 1993 original.https://doi.org/10.1137/1.9780898719109 MR2376769

    Friedrich Pukelsheim.Optimal Design of Experiments. Society for Industrial and Applied Mathematics, 2006. doi: 10.1137/1.9780898719109

  27. [27]

    Minigrid & miniworld: Modular & customizable reinforcement learning environments for goal-oriented tasks

    Maxime Chevalier-Boisvert, Bolun Dai, Mark Towers, Rodrigo De Lazcano Perez-Vicente, Lucas Willems, Salem Lahlou, Suman Pal, Pablo Samuel Castro, and J K Terry. Minigrid & miniworld: Modular & customizable reinforcement learning environments for goal-oriented tasks. InThirty-Seventh Conference on Neural Information Processing Systems Datasets and Benchmar...

  28. [28]

    OGBench: Benchmark- ing Offline Goal-Conditioned RL

    Seohong Park, Kevin Frans, Benjamin Eysenbach, and Sergey Levine. OGBench: Benchmark- ing Offline Goal-Conditioned RL. InInternational Conference on Learning Representations (ICLR), 2025

  29. [29]

    Behavior From the V oid: Unsupervised Active Pre-Training

    Hao Liu and Pieter Abbeel. Behavior From the V oid: Unsupervised Active Pre-Training. In A. Beygelzimer, Y . Dauphin, P. Liang, and J. Wortman Vaughan, editors,Advances in Neural Information Processing Systems, 2021. 11

  30. [30]

    Deep Reinforcement Learning at the Edge of the Statistical Precipice

    Rishabh Agarwal, Max Schwarzer, Pablo Samuel Castro, Aaron C Courville, and Marc Belle- mare. Deep Reinforcement Learning at the Edge of the Statistical Precipice. In M. Ranzato, A. Beygelzimer, Y . Dauphin, P. S. Liang, and J. Wortman Vaughan, editors,Advances in Neural Information Processing Systems, volume 34, pages 29304–29320. Curran Associates, Inc., 2021

  31. [31]

    Kingma and Jimmy Ba

    Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization. In Yoshua Bengio and Yann LeCun, editors,ICLR (Poster), 2015

  32. [32]

    Horn and Charles R

    Roger A. Horn and Charles R. Johnson.Matrix Analysis. Cambridge University Press, 1990. ISBN 0-521-38632-2

  33. [33]

    Bellemare, Aäron van den Oord, and Rémi Munos

    Georg Ostrovski, Marc G. Bellemare, Aäron van den Oord, and Rémi Munos. Count-Based Exploration with Neural Density Models. In Doina Precup and Yee Whye Teh, editors,Pro- ceedings of the 34th International Conference on Machine Learning, volume 70 ofProceedings of Machine Learning Research, pages 2721–2730. PMLR, August 2017

  34. [34]

    Flipping Coins to Estimate Pseudocounts for Exploration in Reinforcement Learning

    Sam Lobel, Akhil Bagaria, and George Konidaris. Flipping Coins to Estimate Pseudocounts for Exploration in Reinforcement Learning. In Andreas Krause, Emma Brunskill, Kyunghyun Cho, Barbara Engelhardt, Sivan Sabato, and Jonathan Scarlett, editors,Proceedings of the 40th International Conference on Machine Learning, volume 202 ofProceedings of Machine Learn...

  35. [35]

    Exploration and Anti-Exploration with Distributional Random Network Distillation

    Kai Yang, Jian Tao, Jiafei Lyu, and Xiu Li. Exploration and Anti-Exploration with Distributional Random Network Distillation. In Ruslan Salakhutdinov, Zico Kolter, Katherine Heller, Adrian Weller, Nuria Oliver, Jonathan Scarlett, and Felix Berkenkamp, editors,Proceedings of the 41st International Conference on Machine Learning, volume 235 ofProceedings of...

  36. [36]

    How to Stay Curi- ous while avoiding Noisy TVs using Aleatoric Uncertainty Estimation

    Augustine Mavor-Parker, Kimberly Young, Caswell Barry, and Lewis Griffin. How to Stay Curi- ous while avoiding Noisy TVs using Aleatoric Uncertainty Estimation. In Kamalika Chaudhuri, Stefanie Jegelka, Le Song, Csaba Szepesvari, Gang Niu, and Sivan Sabato, editors,Proceedings of the 39th International Conference on Machine Learning, volume 162 ofProceedin...

  37. [37]

    State Entropy Maximization with Random Encoders for Efficient Exploration

    Younggyo Seo, Lili Chen, Jinwoo Shin, Honglak Lee, Pieter Abbeel, and Kimin Lee. State Entropy Maximization with Random Encoders for Efficient Exploration. In Marina Meila and Tong Zhang, editors,Proceedings of the 38th International Conference on Machine Learning, volume 139 ofProceedings of Machine Learning Research, pages 9443–9454. PMLR, July 2021

  38. [38]

    Rethinking Exploration in Reinforce- ment Learning with Effective Metric-Based Exploration Bonus

    Yiming Wang, Kaiyan Zhao, Furui Liu, and Leong Hou U. Rethinking Exploration in Reinforce- ment Learning with Effective Metric-Based Exploration Bonus. InThe Thirty-eighth Annual Conference on Neural Information Processing Systems, 2024

  39. [39]

    VIME: Variational Information Maximizing Exploration

    Rein Houthooft, Xi Chen, Xi Chen, Yan Duan, John Schulman, Filip De Turck, and Pieter Abbeel. VIME: Variational Information Maximizing Exploration. In D. Lee, M. Sugiyama, U. Luxburg, I. Guyon, and R. Garnett, editors,Advances in Neural Information Processing Systems, volume 29. Curran Associates, Inc., 2016

  40. [40]

    Model-Based Active Exploration

    Pranav Shyam, Wojciech Ja´skowski, and Faustino Gomez. Model-Based Active Exploration. In Kamalika Chaudhuri and Ruslan Salakhutdinov, editors,Proceedings of the 36th International Conference on Machine Learning, volume 97 ofProceedings of Machine Learning Research, pages 5779–5788. PMLR, June 2019

  41. [41]

    Curiosity-Driven Exploration via Latent Bayesian Surprise.Proceedings of the AAAI Conference on Artificial Intelligence, 36 (7):7752–7760, June 2022

    Pietro Mazzaglia, Ozan Catal, Tim Verbelen, and Bart Dhoedt. Curiosity-Driven Exploration via Latent Bayesian Surprise.Proceedings of the AAAI Conference on Artificial Intelligence, 36 (7):7752–7760, June 2022. doi: 10.1609/aaai.v36i7.20743

  42. [42]

    Episodic Novelty Through Temporal Distance

    Yuhua Jiang, Qihan Liu, Yiqin Yang, Xiaoteng Ma, Dianyu Zhong, Hao Hu, Jun Yang, Bin Liang, Bo XU, Chongjie Zhang, and Qianchuan Zhao. Episodic Novelty Through Temporal Distance. InThe Thirteenth International Conference on Learning Representations, 2025. 12

  43. [43]

    Go Beyond Imagination: Maximizing Episodic Reacha- bility with World Models, 2023

    Yao Fu, Run Peng, and Honglak Lee. Go Beyond Imagination: Maximizing Episodic Reacha- bility with World Models, 2023

  44. [44]

    Rank the Episodes: A Simple Approach for Exploration in Procedurally-Generated Environments

    Daochen Zha, Wenye Ma, Lei Yuan, Xia Hu, and Ji Liu. Rank the Episodes: A Simple Approach for Exploration in Procedurally-Generated Environments. InInternational Conference on Learning Representations, 2021

  45. [45]

    Gonzalez, and Stuart Russell

    Tianjun Zhang, Paria Rashidinejad, Jiantao Jiao, Yuandong Tian, Joseph E. Gonzalez, and Stuart Russell. MADE: Exploration via Maximizing Deviation from Explored Regions. In A. Beygelzimer, Y . Dauphin, P. Liang, and J. Wortman Vaughan, editors,Advances in Neural Information Processing Systems, 2021

  46. [46]

    Zico Kolter, and Roberta Raileanu

    Yiding Jiang, J. Zico Kolter, and Roberta Raileanu. On the Importance of Exploration for Generalization in Reinforcement Learning. InThirty-Seventh Conference on Neural Information Processing Systems, 2023

  47. [47]

    A brief note on the Bayesian D-optimality criterion, 2023

    Alen Alexanderian. A brief note on the Bayesian D-optimality criterion, 2023

  48. [48]

    An experimental design perspective on model-based reinforcement learning

    Viraj Mehta, Biswajit Paria, Jeff Schneider, Willie Neiswanger, and Stefano Ermon. An experimental design perspective on model-based reinforcement learning. InInternational Conference on Learning Representations, 2022

  49. [49]

    Distractor

    Alberto Caron, Vasilios Mavroudis, and Chris Hicks. On efficient bayesian exploration in model-based reinforcement learning.Transactions on Machine Learning Research, 2025. ISSN 2835-8856. 13 Algorithm 1Conditional Information Gain reward (Algorithm referenced in §3.4). Require: Imagined latent rollout (s0, a0, . . . , sT−1 , aT−1 ); ensemble of one-step ...

  50. [50]

    OGBench (https://github.com/seohongpark/ogbench) — MIT License

  51. [51]

    Limitations

    MiniGrid (https://github.com/Farama-Foundation/Minigrid) — Apache License 2.0 B Theoretical Analysis B.1 Tightness of the Gaussian Entropy Bound (A3) Approximation A3 replaces the mixture entropy H(s 1:T ) with the entropy of the moment-matched Gaussian q=N( ¯µ,Σ) , where Σ =σ 2IT d +C . Because q shares the mean and covariance of the mixture p= 1 M P k N...

  52. [52]

    • Depending on the country in which research is conducted, IRB approval (or equivalent) may be required for any human subjects research

    Institutional review board (IRB) approvals or equivalent for research with human subjects Question: Does the paper describe potential risks incurred by study participants, whether such risks were disclosed to the subjects, and whether Institutional Review Board (IRB) approvals (or an equivalent approval/review based on the requirements of your country or ...