CIG: Exploration via Conditional Information Gain

J. Marius Z\"ollner; Karam Daaboul; Marcus Fechner; Philipp Stegmaier; Tim Joseph

arxiv: 2605.20878 · v1 · pith:KL52XIKDnew · submitted 2026-05-20 · 💻 cs.LG

CIG: Exploration via Conditional Information Gain

Tim Joseph , Marcus Fechner , Philipp Stegmaier , Karam Daaboul , J. Marius Z\"ollner This is my paper

Pith reviewed 2026-05-21 06:43 UTC · model grok-4.3

classification 💻 cs.LG

keywords intrinsic rewardsexplorationreinforcement learningconditional information gainensemble disagreementmodel-based RLinformation gain

0 comments

The pith

Conditional Information Gain gives a scalable intrinsic reward for exploration that conditions on both lifetime experience and the current rollout.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper seeks to build an intrinsic reward in reinforcement learning that accounts for both long-term progress against all past data and short-term redundancy inside one ongoing trajectory. Existing rewards tend to capture only one of those signals or rely on methods that break down in high-dimensional spaces. The authors derive Conditional Information Gain as a practical surrogate that uses a log-determinant objective over an ensemble disagreement kernel. Cholesky factorization then splits this into causal per-step rewards that keep both conditioning sets intact. If the surrogate works, agents can explore more efficiently without ignoring either lifetime learning or intra-episode repetition.

Core claim

Trajectory-level information gain decomposes into per-step terms that condition simultaneously on the replay buffer and the rollout prefix, yet this remains intractable for deep models. The Conditional Information Gain (CIG) reward serves as a tractable surrogate through a log-determinant objective over an ensemble disagreement kernel. Its Cholesky factorization then produces causal per-step rewards that preserve both conditioning sets and scale to high-dimensional state spaces. The method is instantiated in a model-based setting and tested across twelve tasks in discrete and continuous control, including stochastic-distractor variants.

What carries the argument

The Conditional Information Gain (CIG) reward, defined as a log-determinant objective over an ensemble disagreement kernel whose Cholesky factorization yields per-step rewards that retain joint conditioning on replay buffer and rollout prefix.

If this is right

CIG combines lifelong and episodic signals without needing heuristic weights or low-dimensional assumptions.
The reward scales to high-dimensional state spaces where Gaussian-process methods fail.
It remains robust when stochastic distractors appear in the environment.
Performance holds across both discrete grid tasks and continuous control benchmarks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same decomposition technique could be applied to other information measures in sequential decision problems.
Testing CIG outside the model-based setting with short rollouts would reveal how far the per-step approximation generalizes.
Varying the ensemble size could serve as a direct way to study the accuracy of the disagreement kernel approximation.

Load-bearing premise

The ensemble disagreement kernel approximates the true trajectory-level conditional information gain closely enough that the Cholesky decomposition preserves the joint conditioning without material loss.

What would settle it

Running the method on high-dimensional tasks and finding that CIG yields no exploration gain over prior methods, or that the resulting per-step rewards show low correlation with actual trajectory information gain, would falsify the claim.

Figures

Figures reproduced from arXiv: 2605.20878 by J. Marius Z\"ollner, Karam Daaboul, Marcus Fechner, Philipp Stegmaier, Tim Joseph.

**Figure 1.** Figure 1: Conditioning contexts of intrinsic rewards in model-based RL. The policy plans over short imagined rollouts. Neural network weights w, trained on the replay buffer D, summarize lifetime uncertainty. Solid states and arrows mark the contexts each reward class conditions on; dashed/faded elements are not used as context. In the model-based setting, the rollout prefix s<t is too short to supply a rich episodi… view at source ↗

**Figure 2.** Figure 2: Mechanism of the CIG reward on a four-step imagined rollout (T=4, M=3). (a) Disagreement vectors δ (t) k at each step. (b) The kernel Kjt (Eq. 6) built from their inner products; the column k<4 links step 4 to its prefix. (c) Eq. 8 subtracts the prefix-explained portion from the lifelong term: step 3, nearly orthogonal to the prefix, retains most of its bonus; step 4, largely spanned by the prefix s<4, is… view at source ↗

**Figure 3.** Figure 3: Cumulative successes (total completed episodes) on three MiniGrid tasks (columns) in clean [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗

**Figure 4.** Figure 4: Coverage curves on six continuous-control tasks: four clean (top row and bottom right) [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗

**Figure 5.** Figure 5: Aggregate normalised exploration scores across all twelve tasks (six MiniGrid, six contin [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗

**Figure 6.** Figure 6: Ablation on AntMaze. Each variant removes one component of the CIG reward (Eq. (8)). Lines show the IQM; shaded regions are 95 % stratified bootstrap confidence intervals. We isolate the three components of Eq. (8) on AntMaze in [PITH_FULL_IMAGE:figures/full_fig_p009_6.png] view at source ↗

**Figure 7.** Figure 7: RGB observations from the evaluation environments (the agent sees 64 [PITH_FULL_IMAGE:figures/full_fig_p017_7.png] view at source ↗

**Figure 8.** Figure 8: Prefix-redundancy activation across training (AntMaze-medium). Each panel shows the column-normalised conditional density P(y | xbin) on a 10 × 10 grid. Top row: prefix-explained fraction k ⊤ <tK˜ −1 <t k<t/Ktt (the share of the lifelong disagreement at step t accounted for by the prefix) vs. lifelong disagreement log Ktt, the diagonal term of Eq. (8). Bottom row: Spearman rank correlation ρ(CIG, log Ktt) … view at source ↗

**Figure 9.** Figure 9: Per-episode visit-distribution entropy (x-axis) versus final cumulative successes (y-axis, log [PITH_FULL_IMAGE:figures/full_fig_p026_9.png] view at source ↗

**Figure 10.** Figure 10: Total button flips over training on Puzzle [PITH_FULL_IMAGE:figures/full_fig_p027_10.png] view at source ↗

read the original abstract

Intrinsic rewards for exploration in reinforcement learning condition on different contexts: lifelong rewards score each transition against accumulated experience but ignore within-rollout redundancy; episodic rewards penalize intra-trajectory repetition but discard lifetime progress. Hybrid methods combine both signals through heuristic weights or require Gaussian-process dynamics that do not scale beyond low-dimensional state spaces. Trajectory-level information gain decomposes into per-step terms that condition on the replay buffer and rollout prefix simultaneously, but remains intractable for deep models. We derive the Conditional Information Gain (CIG) reward as a tractable surrogate: a log-determinant objective over an ensemble disagreement kernel whose Cholesky factorization yields causal per-step rewards that retain both conditioning sets while scaling to high-dimensional state spaces. We instantiate CIG in a model-based setting, where rollouts are short and within-rollout corrections remain largely unexplored. Across twelve tasks spanning discrete (MiniGrid) and continuous control (OGBench), in both clean and stochastic-distractor settings, CIG outperforms or matches prior exploration methods while remaining robust to stochastic distractors.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

CIG gives a concrete ensemble-kernel plus Cholesky route to per-step rewards that try to keep both replay-buffer and rollout-prefix conditioning, but the abstract leaves the preservation claim hard to check.

read the letter

The punchline is that this paper supplies a specific, scalable surrogate for trajectory-level conditional information gain in RL exploration. It uses an ensemble disagreement kernel inside a log-determinant objective and then applies Cholesky factorization to produce causal per-step rewards that are meant to condition simultaneously on the lifetime replay buffer and the current rollout prefix. That construction is the main novelty relative to the heuristic hybrids and non-scaling GP baselines mentioned in the abstract.

Referee Report

2 major / 2 minor

Summary. The paper proposes Conditional Information Gain (CIG) as an intrinsic reward for exploration in RL. It derives a tractable surrogate for trajectory-level conditional information gain via a log-determinant objective over an ensemble disagreement kernel; Cholesky factorization then produces causal per-step rewards that simultaneously condition on the replay buffer and rollout prefix. The approach is instantiated in a model-based setting with short rollouts and evaluated on twelve tasks spanning MiniGrid (discrete) and OGBench (continuous control) in both clean and stochastic-distractor environments, where it outperforms or matches prior methods while remaining robust to distractors.

Significance. If the derivation holds and the Cholesky decomposition preserves joint conditioning with bounded error, the result would offer a scalable, principled hybrid of lifelong and episodic exploration signals that avoids heuristic weighting and extends beyond low-dimensional GP methods. The empirical scope across twelve tasks in discrete/continuous and clean/noisy settings is a concrete strength that supports practical utility.

major comments (2)

[§3 (CIG Derivation)] §3 (CIG Derivation): The central claim that Cholesky factorization of the ensemble disagreement kernel yields per-step rewards retaining simultaneous conditioning on both the replay buffer and rollout prefix requires an explicit demonstration or error bound. The current presentation leaves open whether the kernel construction introduces low-rank or independence assumptions that produce material information loss, which is load-bearing for the tractability and faithfulness assertions.
[§3.2 (Ensemble Disagreement Kernel)] §3.2 (Ensemble Disagreement Kernel): The surrogate is defined via an ensemble disagreement kernel whose parameters are learned from data. The manuscript must show that the final reward expression remains independent of these fitted quantities; absent such verification, the information-gain interpretation risks circularity, as the exploration signal could reduce to quantities defined by the fitted model itself.

minor comments (2)

[Abstract and §4 (Experiments)] Abstract and §4 (Experiments): The claim of evaluation 'across twelve tasks' would benefit from an explicit list of environments, number of random seeds, and statistical significance tests to allow direct replication and assessment of robustness.
[Notation] Notation: The distinction between lifetime (replay buffer) and episodic (rollout prefix) conditioning sets should be denoted consistently with subscripts or superscripts throughout the equations to improve readability.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments on our manuscript. We address each major comment below, indicating the revisions we will make to strengthen the presentation of the CIG derivation and the ensemble kernel.

read point-by-point responses

Referee: [§3 (CIG Derivation)] The central claim that Cholesky factorization of the ensemble disagreement kernel yields per-step rewards retaining simultaneous conditioning on both the replay buffer and rollout prefix requires an explicit demonstration or error bound. The current presentation leaves open whether the kernel construction introduces low-rank or independence assumptions that produce material information loss, which is load-bearing for the tractability and faithfulness assertions.

Authors: We agree that an explicit demonstration would improve clarity. In the revised manuscript we will add a detailed derivation in the appendix showing that the Cholesky factorization of the joint kernel matrix decomposes the log-determinant exactly into a sum of per-step conditional log-determinants. Because the kernel matrix is assembled from all points in the replay buffer together with the rollout prefix, the conditioning on both sets is retained by construction; no additional low-rank or independence assumptions are imposed beyond the positive-definiteness of the ensemble kernel. We will also include a brief error-bound discussion based on the fact that the Cholesky factors yield the exact conditional variances at each step, with any numerical error bounded by standard floating-point analysis. revision: yes
Referee: [§3.2 (Ensemble Disagreement Kernel)] The surrogate is defined via an ensemble disagreement kernel whose parameters are learned from data. The manuscript must show that the final reward expression remains independent of these fitted quantities; absent such verification, the information-gain interpretation risks circularity, as the exploration signal could reduce to quantities defined by the fitted model itself.

Authors: We will clarify this point in the revision. The reward is the log-determinant of the kernel matrix whose entries are pairwise disagreements between ensemble members evaluated at the relevant state-action pairs. Once the kernel matrix is formed, the reward expression depends only on these matrix entries and not on the internal parameters of the individual ensemble members. In the revised §3.2 we will explicitly rewrite the reward formula in terms of the kernel matrix alone, thereby showing that the information-gain interpretation is preserved as a measure of predictive uncertainty reduction and is not circular with respect to the fitting procedure. revision: yes

Circularity Check

0 steps flagged

No significant circularity in derivation of CIG surrogate

full rationale

The paper claims to derive the Conditional Information Gain reward as a log-determinant objective over an ensemble disagreement kernel, with Cholesky factorization yielding per-step rewards. This is presented as a mathematical construction of a tractable surrogate for the intractable trajectory-level conditional information gain. No equations or steps are shown that reduce the final reward expression to fitted parameters or prior self-citations by construction. The central claim remains a proposed approximation whose validity is evaluated empirically on external benchmarks (MiniGrid, OGBench tasks), making the derivation self-contained rather than tautological. No load-bearing self-citation chains or renamed known results are identified.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

Review performed on abstract alone; full derivation, parameter choices, and supporting lemmas are unavailable, so ledger entries are limited to those explicitly named in the abstract.

axioms (1)

domain assumption Trajectory-level information gain decomposes into per-step terms that condition simultaneously on the replay buffer and rollout prefix.
Directly stated in the abstract as the starting point for the CIG derivation.

invented entities (1)

Conditional Information Gain (CIG) reward no independent evidence
purpose: Tractable surrogate objective that retains joint conditioning while scaling to high-dimensional spaces
Introduced in the abstract as the central new construct realized via ensemble disagreement kernel and Cholesky factorization.

pith-pipeline@v0.9.0 · 5718 in / 1356 out tokens · 45156 ms · 2026-05-21T06:43:40.080863+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We derive the Conditional Information Gain (CIG) reward as a tractable surrogate: a log-determinant objective over an ensemble disagreement kernel whose Cholesky factorization yields causal per-step rewards
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Trajectory-level information gain decomposes into per-step terms that condition on the replay buffer and rollout prefix simultaneously

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

52 extracted references · 52 canonical work pages

[1]

Curious model-building control systems

Jürgen Schmidhuber. Curious model-building control systems. InProceedings of the 1991 IEEE International Joint Conference on Neural Networks (IJCNN ’91), Singapore, volume 2, pages 1458–1463. IEEE, 1991. doi: 10.1109/IJCNN.1991.170605

work page doi:10.1109/ijcnn.1991.170605 1991
[2]

What is intrinsic motivation? A typology of computational approaches.Frontiers in Neurorobotics, V olume 1 - 2007, 2007

Pierre-Yves Oudeyer and Frederic Kaplan. What is intrinsic motivation? A typology of computational approaches.Frontiers in Neurorobotics, V olume 1 - 2007, 2007. ISSN 1662-

work page 2007
[3]

doi: 10.3389/neuro.12.006.2007

work page doi:10.3389/neuro.12.006.2007 2007
[4]

Efros, and Trevor Darrell

Deepak Pathak, Pulkit Agrawal, Alexei A. Efros, and Trevor Darrell. Curiosity-Driven Explo- ration by Self-Supervised Prediction. InProceedings of the 34th International Conference on Machine Learning - Volume 70, ICML’17, pages 2778–2787, Sydney, NSW, Australia, 2017. JMLR.org

work page 2017
[5]

Exploration by random network distillation

Yuri Burda, Harrison Edwards, Amos Storkey, and Oleg Klimov. Exploration by random network distillation. InInternational Conference on Learning Representations, 2019

work page 2019
[6]

Planning to Explore via Self-Supervised World Models

Ramanan Sekar, Oleh Rybkin, Kostas Daniilidis, Pieter Abbeel, Danijar Hafner, and Deepak Pathak. Planning to Explore via Self-Supervised World Models. In Hal Daumé III and Aarti Singh, editors,Proceedings of the 37th International Conference on Machine Learning, volume 119 ofProceedings of Machine Learning Research, pages 8583–8592. PMLR, July 2020

work page 2020
[7]

Nicklas Hansen, Hao Su, and Xiaolong Wang

Danijar Hafner, Jurgis Pasukonis, Jimmy Ba, and Timothy Lillicrap. Mastering diverse control tasks through world models.Nature, 640(8059):647–653, April 2025. ISSN 1476-4687. doi: 10.1038/s41586-025-08744-2

work page doi:10.1038/s41586-025-08744-2 2025
[8]

Recurrent World Models Facilitate Policy Evolution

David Ha and Jürgen Schmidhuber. Recurrent World Models Facilitate Policy Evolution. In S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 31. Curran Associates, Inc., 2018

work page 2018
[9]

Learning Latent Dynamics for Planning from Pixels

Danijar Hafner, Timothy Lillicrap, Ian Fischer, Ruben Villegas, David Ha, Honglak Lee, and James Davidson. Learning Latent Dynamics for Planning from Pixels. In Kamalika Chaudhuri and Ruslan Salakhutdinov, editors,Proceedings of the 36th International Conference on Machine Learning, volume 97 ofProceedings of Machine Learning Research, pages 2555–

work page
[10]

Lillicrap, Mohammad Norouzi, and Jimmy Ba

Danijar Hafner, Timothy P. Lillicrap, Mohammad Norouzi, and Jimmy Ba. Mastering Atari with Discrete World Models. InInternational Conference on Learning Representations, 2021

work page 2021
[11]

Self-Supervised Exploration via Dis- agreement

Deepak Pathak, Dhiraj Gandhi, and Abhinav Gupta. Self-Supervised Exploration via Dis- agreement. In Kamalika Chaudhuri and Ruslan Salakhutdinov, editors,Proceedings of the 36th International Conference on Machine Learning, volume 97 ofProceedings of Machine Learning Research, pages 5062–5071. PMLR, June 2019

work page 2019
[12]

Exploration via Elliptical Episodic Bonuses

Mikael Henaff, Roberta Raileanu, Minqi Jiang, and Tim Rocktäschel. Exploration via Elliptical Episodic Bonuses. In Alice H. Oh, Alekh Agarwal, Danielle Belgrave, and Kyunghyun Cho, editors,Advances in Neural Information Processing Systems, 2022

work page 2022
[13]

A Study of Global and Episodic Bonuses for Exploration in Contextual MDPs

Mikael Henaff, Minqi Jiang, and Roberta Raileanu. A Study of Global and Episodic Bonuses for Exploration in Contextual MDPs. In Andreas Krause, Emma Brunskill, Kyunghyun Cho, Barbara Engelhardt, Sivan Sabato, and Jonathan Scarlett, editors,Proceedings of the 40th Inter- national Conference on Machine Learning, volume 202 ofProceedings of Machine Learning ...

work page 2023
[14]

Never Give Up: Learning Directed Exploration Strategies, February 2020

Adrià Puigdomènech Badia, Pablo Sprechmann, Alex Vitvitskyi, Daniel Guo, Bilal Piot, Steven Kapturowski, Olivier Tieleman, Martín Arjovsky, Alexander Pritzel, Andew Bolt, and Charles Blundell. Never Give Up: Learning Directed Exploration Strategies, February 2020

work page 2020
[15]

Gonzalez, and Yuandong Tian

Tianjun Zhang, Huazhe Xu, Xiaolong Wang, Yi Wu, Kurt Keutzer, Joseph E. Gonzalez, and Yuandong Tian. NovelD: A Simple yet Effective Exploration Criterion. In A. Beygelzimer, Y . Dauphin, P. Liang, and J. Wortman Vaughan, editors,Advances in Neural Information Processing Systems, 2021. 10

work page 2021
[16]

Exploration via Planning for Information about the Optimal Trajectory

Viraj Mehta, Ian Char, Joseph Abbate, Rory Conlin, Mark Boyer, Stefano Ermon, Jeff Schneider, and Willie Neiswanger. Exploration via Planning for Information about the Optimal Trajectory. In S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, and A. Oh, editors,Advances in Neural Information Processing Systems, volume 35, pages 28761–28775. Curran Ass...

work page 2022
[17]

D. V . Lindley. On a measure of the information provided by an experiment.The Annals of Mathematical Statistics, 27(4):986–1005, 1956. doi: 10.1214/aoms/1177728069

work page doi:10.1214/aoms/1177728069 1956
[18]

Unifying Count-Based Exploration and Intrinsic Motivation

Marc Bellemare, Sriram Srinivasan, Georg Ostrovski, Tom Schaul, David Saxton, and Remi Munos. Unifying Count-Based Exploration and Intrinsic Motivation. In D. Lee, M. Sugiyama, U. Luxburg, I. Guyon, and R. Garnett, editors,Advances in Neural Information Processing Systems, volume 29. Curran Associates, Inc., 2016

work page 2016
[19]

Episodic Curiosity through Reachability

Nikolay Savinov, Anton Raichuk, Damien Vincent, Raphael Marinier, Marc Pollefeys, Tim- othy Lillicrap, and Sylvain Gelly. Episodic Curiosity through Reachability. InInternational Conference on Learning Representations, 2019

work page 2019
[20]

LECO: Learnable Episodic Count for Task-Specific Intrinsic Reward

Daejin Jo, Sungwoong Kim, Daniel Nam, Taehwan Kwon, Seungeun Rho, Jongmin Kim, and Donghoon Lee. LECO: Learnable Episodic Count for Task-Specific Intrinsic Reward. In S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, and A. Oh, editors,Advances in Neural Information Processing Systems, volume 35, pages 30432–30445. Curran Associates, Inc., 2022

work page 2022
[21]

Agent57: Outperforming the Atari Human Benchmark

Adrià Puigdomènech Badia, Bilal Piot, Steven Kapturowski, Pablo Sprechmann, Alex Vitvitskyi, Zhaohan Daniel Guo, and Charles Blundell. Agent57: Outperforming the Atari Human Benchmark. In Hal Daumé III and Aarti Singh, editors,Proceedings of the 37th International Conference on Machine Learning, volume 119 ofProceedings of Machine Learning Research, pages...

work page 2020
[22]

RIDE: Rewarding Impact-Driven Exploration for Procedurally-Generated Environments

Roberta Raileanu and Tim Rocktäschel. RIDE: Rewarding Impact-Driven Exploration for Procedurally-Generated Environments. InInternational Conference on Learning Representa- tions, 2020

work page 2020
[23]

Bayesian experimental design: A review.Statistical Science, 10(3):273–304, 1995

Kathryn Chaloner and Isabella Verdinelli. Bayesian experimental design: A review.Statistical Science, 10(3):273–304, 1995. doi: 10.1214/ss/1177009939

work page doi:10.1214/ss/1177009939 1995
[24]

Ivanova, and Freddie Bickford Smith

Tom Rainforth, Adam Foster, Desi R. Ivanova, and Freddie Bickford Smith. Modern bayesian experimental design.Statistical Science, 39(1):100–114, 2024. doi: 10.1214/23-STS915

work page doi:10.1214/23-sts915 2024
[25]

Simple and Scalable Predictive Uncertainty Estimation using Deep Ensembles

Balaji Lakshminarayanan, Alexander Pritzel, and Charles Blundell. Simple and Scalable Predictive Uncertainty Estimation using Deep Ensembles. In I. Guyon, U. V on Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors,Advances in Neural Information Processing Systems, volume 30. Curran Associates, Inc., 2017

work page 2017
[26]

Reprint of the 1993 original.https://doi.org/10.1137/1.9780898719109 MR2376769

Friedrich Pukelsheim.Optimal Design of Experiments. Society for Industrial and Applied Mathematics, 2006. doi: 10.1137/1.9780898719109

work page doi:10.1137/1.9780898719109 2006
[27]

Minigrid & miniworld: Modular & customizable reinforcement learning environments for goal-oriented tasks

Maxime Chevalier-Boisvert, Bolun Dai, Mark Towers, Rodrigo De Lazcano Perez-Vicente, Lucas Willems, Salem Lahlou, Suman Pal, Pablo Samuel Castro, and J K Terry. Minigrid & miniworld: Modular & customizable reinforcement learning environments for goal-oriented tasks. InThirty-Seventh Conference on Neural Information Processing Systems Datasets and Benchmar...

work page 2023
[28]

OGBench: Benchmark- ing Offline Goal-Conditioned RL

Seohong Park, Kevin Frans, Benjamin Eysenbach, and Sergey Levine. OGBench: Benchmark- ing Offline Goal-Conditioned RL. InInternational Conference on Learning Representations (ICLR), 2025

work page 2025
[29]

Behavior From the V oid: Unsupervised Active Pre-Training

Hao Liu and Pieter Abbeel. Behavior From the V oid: Unsupervised Active Pre-Training. In A. Beygelzimer, Y . Dauphin, P. Liang, and J. Wortman Vaughan, editors,Advances in Neural Information Processing Systems, 2021. 11

work page 2021
[30]

Deep Reinforcement Learning at the Edge of the Statistical Precipice

Rishabh Agarwal, Max Schwarzer, Pablo Samuel Castro, Aaron C Courville, and Marc Belle- mare. Deep Reinforcement Learning at the Edge of the Statistical Precipice. In M. Ranzato, A. Beygelzimer, Y . Dauphin, P. S. Liang, and J. Wortman Vaughan, editors,Advances in Neural Information Processing Systems, volume 34, pages 29304–29320. Curran Associates, Inc., 2021

work page 2021
[31]

Kingma and Jimmy Ba

Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization. In Yoshua Bengio and Yann LeCun, editors,ICLR (Poster), 2015

work page 2015
[32]

Horn and Charles R

Roger A. Horn and Charles R. Johnson.Matrix Analysis. Cambridge University Press, 1990. ISBN 0-521-38632-2

work page 1990
[33]

Bellemare, Aäron van den Oord, and Rémi Munos

Georg Ostrovski, Marc G. Bellemare, Aäron van den Oord, and Rémi Munos. Count-Based Exploration with Neural Density Models. In Doina Precup and Yee Whye Teh, editors,Pro- ceedings of the 34th International Conference on Machine Learning, volume 70 ofProceedings of Machine Learning Research, pages 2721–2730. PMLR, August 2017

work page 2017
[34]

Flipping Coins to Estimate Pseudocounts for Exploration in Reinforcement Learning

Sam Lobel, Akhil Bagaria, and George Konidaris. Flipping Coins to Estimate Pseudocounts for Exploration in Reinforcement Learning. In Andreas Krause, Emma Brunskill, Kyunghyun Cho, Barbara Engelhardt, Sivan Sabato, and Jonathan Scarlett, editors,Proceedings of the 40th International Conference on Machine Learning, volume 202 ofProceedings of Machine Learn...

work page 2023
[35]

Exploration and Anti-Exploration with Distributional Random Network Distillation

Kai Yang, Jian Tao, Jiafei Lyu, and Xiu Li. Exploration and Anti-Exploration with Distributional Random Network Distillation. In Ruslan Salakhutdinov, Zico Kolter, Katherine Heller, Adrian Weller, Nuria Oliver, Jonathan Scarlett, and Felix Berkenkamp, editors,Proceedings of the 41st International Conference on Machine Learning, volume 235 ofProceedings of...

work page 2024
[36]

How to Stay Curi- ous while avoiding Noisy TVs using Aleatoric Uncertainty Estimation

Augustine Mavor-Parker, Kimberly Young, Caswell Barry, and Lewis Griffin. How to Stay Curi- ous while avoiding Noisy TVs using Aleatoric Uncertainty Estimation. In Kamalika Chaudhuri, Stefanie Jegelka, Le Song, Csaba Szepesvari, Gang Niu, and Sivan Sabato, editors,Proceedings of the 39th International Conference on Machine Learning, volume 162 ofProceedin...

work page 2022
[37]

State Entropy Maximization with Random Encoders for Efficient Exploration

Younggyo Seo, Lili Chen, Jinwoo Shin, Honglak Lee, Pieter Abbeel, and Kimin Lee. State Entropy Maximization with Random Encoders for Efficient Exploration. In Marina Meila and Tong Zhang, editors,Proceedings of the 38th International Conference on Machine Learning, volume 139 ofProceedings of Machine Learning Research, pages 9443–9454. PMLR, July 2021

work page 2021
[38]

Rethinking Exploration in Reinforce- ment Learning with Effective Metric-Based Exploration Bonus

Yiming Wang, Kaiyan Zhao, Furui Liu, and Leong Hou U. Rethinking Exploration in Reinforce- ment Learning with Effective Metric-Based Exploration Bonus. InThe Thirty-eighth Annual Conference on Neural Information Processing Systems, 2024

work page 2024
[39]

VIME: Variational Information Maximizing Exploration

Rein Houthooft, Xi Chen, Xi Chen, Yan Duan, John Schulman, Filip De Turck, and Pieter Abbeel. VIME: Variational Information Maximizing Exploration. In D. Lee, M. Sugiyama, U. Luxburg, I. Guyon, and R. Garnett, editors,Advances in Neural Information Processing Systems, volume 29. Curran Associates, Inc., 2016

work page 2016
[40]

Model-Based Active Exploration

Pranav Shyam, Wojciech Ja´skowski, and Faustino Gomez. Model-Based Active Exploration. In Kamalika Chaudhuri and Ruslan Salakhutdinov, editors,Proceedings of the 36th International Conference on Machine Learning, volume 97 ofProceedings of Machine Learning Research, pages 5779–5788. PMLR, June 2019

work page 2019
[41]

Curiosity-Driven Exploration via Latent Bayesian Surprise.Proceedings of the AAAI Conference on Artificial Intelligence, 36 (7):7752–7760, June 2022

Pietro Mazzaglia, Ozan Catal, Tim Verbelen, and Bart Dhoedt. Curiosity-Driven Exploration via Latent Bayesian Surprise.Proceedings of the AAAI Conference on Artificial Intelligence, 36 (7):7752–7760, June 2022. doi: 10.1609/aaai.v36i7.20743

work page doi:10.1609/aaai.v36i7.20743 2022
[42]

Episodic Novelty Through Temporal Distance

Yuhua Jiang, Qihan Liu, Yiqin Yang, Xiaoteng Ma, Dianyu Zhong, Hao Hu, Jun Yang, Bin Liang, Bo XU, Chongjie Zhang, and Qianchuan Zhao. Episodic Novelty Through Temporal Distance. InThe Thirteenth International Conference on Learning Representations, 2025. 12

work page 2025
[43]

Go Beyond Imagination: Maximizing Episodic Reacha- bility with World Models, 2023

Yao Fu, Run Peng, and Honglak Lee. Go Beyond Imagination: Maximizing Episodic Reacha- bility with World Models, 2023

work page 2023
[44]

Rank the Episodes: A Simple Approach for Exploration in Procedurally-Generated Environments

Daochen Zha, Wenye Ma, Lei Yuan, Xia Hu, and Ji Liu. Rank the Episodes: A Simple Approach for Exploration in Procedurally-Generated Environments. InInternational Conference on Learning Representations, 2021

work page 2021
[45]

Gonzalez, and Stuart Russell

Tianjun Zhang, Paria Rashidinejad, Jiantao Jiao, Yuandong Tian, Joseph E. Gonzalez, and Stuart Russell. MADE: Exploration via Maximizing Deviation from Explored Regions. In A. Beygelzimer, Y . Dauphin, P. Liang, and J. Wortman Vaughan, editors,Advances in Neural Information Processing Systems, 2021

work page 2021
[46]

Zico Kolter, and Roberta Raileanu

Yiding Jiang, J. Zico Kolter, and Roberta Raileanu. On the Importance of Exploration for Generalization in Reinforcement Learning. InThirty-Seventh Conference on Neural Information Processing Systems, 2023

work page 2023
[47]

A brief note on the Bayesian D-optimality criterion, 2023

Alen Alexanderian. A brief note on the Bayesian D-optimality criterion, 2023

work page 2023
[48]

An experimental design perspective on model-based reinforcement learning

Viraj Mehta, Biswajit Paria, Jeff Schneider, Willie Neiswanger, and Stefano Ermon. An experimental design perspective on model-based reinforcement learning. InInternational Conference on Learning Representations, 2022

work page 2022
[49]

Distractor

Alberto Caron, Vasilios Mavroudis, and Chris Hicks. On efficient bayesian exploration in model-based reinforcement learning.Transactions on Machine Learning Research, 2025. ISSN 2835-8856. 13 Algorithm 1Conditional Information Gain reward (Algorithm referenced in §3.4). Require: Imagined latent rollout (s0, a0, . . . , sT−1 , aT−1 ); ensemble of one-step ...

work page 2025
[50]

OGBench (https://github.com/seohongpark/ogbench) — MIT License

work page
[51]

Limitations

MiniGrid (https://github.com/Farama-Foundation/Minigrid) — Apache License 2.0 B Theoretical Analysis B.1 Tightness of the Gaussian Entropy Bound (A3) Approximation A3 replaces the mixture entropy H(s 1:T ) with the entropy of the moment-matched Gaussian q=N( ¯µ,Σ) , where Σ =σ 2IT d +C . Because q shares the mean and covariance of the mixture p= 1 M P k N...

work page
[52]

• Depending on the country in which research is conducted, IRB approval (or equivalent) may be required for any human subjects research

Institutional review board (IRB) approvals or equivalent for research with human subjects Question: Does the paper describe potential risks incurred by study participants, whether such risks were disclosed to the subjects, and whether Institutional Review Board (IRB) approvals (or an equivalent approval/review based on the requirements of your country or ...

work page

[1] [1]

Curious model-building control systems

Jürgen Schmidhuber. Curious model-building control systems. InProceedings of the 1991 IEEE International Joint Conference on Neural Networks (IJCNN ’91), Singapore, volume 2, pages 1458–1463. IEEE, 1991. doi: 10.1109/IJCNN.1991.170605

work page doi:10.1109/ijcnn.1991.170605 1991

[2] [2]

What is intrinsic motivation? A typology of computational approaches.Frontiers in Neurorobotics, V olume 1 - 2007, 2007

Pierre-Yves Oudeyer and Frederic Kaplan. What is intrinsic motivation? A typology of computational approaches.Frontiers in Neurorobotics, V olume 1 - 2007, 2007. ISSN 1662-

work page 2007

[3] [3]

doi: 10.3389/neuro.12.006.2007

work page doi:10.3389/neuro.12.006.2007 2007

[4] [4]

Efros, and Trevor Darrell

Deepak Pathak, Pulkit Agrawal, Alexei A. Efros, and Trevor Darrell. Curiosity-Driven Explo- ration by Self-Supervised Prediction. InProceedings of the 34th International Conference on Machine Learning - Volume 70, ICML’17, pages 2778–2787, Sydney, NSW, Australia, 2017. JMLR.org

work page 2017

[5] [5]

Exploration by random network distillation

Yuri Burda, Harrison Edwards, Amos Storkey, and Oleg Klimov. Exploration by random network distillation. InInternational Conference on Learning Representations, 2019

work page 2019

[6] [6]

Planning to Explore via Self-Supervised World Models

Ramanan Sekar, Oleh Rybkin, Kostas Daniilidis, Pieter Abbeel, Danijar Hafner, and Deepak Pathak. Planning to Explore via Self-Supervised World Models. In Hal Daumé III and Aarti Singh, editors,Proceedings of the 37th International Conference on Machine Learning, volume 119 ofProceedings of Machine Learning Research, pages 8583–8592. PMLR, July 2020

work page 2020

[7] [7]

Nicklas Hansen, Hao Su, and Xiaolong Wang

Danijar Hafner, Jurgis Pasukonis, Jimmy Ba, and Timothy Lillicrap. Mastering diverse control tasks through world models.Nature, 640(8059):647–653, April 2025. ISSN 1476-4687. doi: 10.1038/s41586-025-08744-2

work page doi:10.1038/s41586-025-08744-2 2025

[8] [8]

Recurrent World Models Facilitate Policy Evolution

David Ha and Jürgen Schmidhuber. Recurrent World Models Facilitate Policy Evolution. In S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 31. Curran Associates, Inc., 2018

work page 2018

[9] [9]

Learning Latent Dynamics for Planning from Pixels

Danijar Hafner, Timothy Lillicrap, Ian Fischer, Ruben Villegas, David Ha, Honglak Lee, and James Davidson. Learning Latent Dynamics for Planning from Pixels. In Kamalika Chaudhuri and Ruslan Salakhutdinov, editors,Proceedings of the 36th International Conference on Machine Learning, volume 97 ofProceedings of Machine Learning Research, pages 2555–

work page

[10] [10]

Lillicrap, Mohammad Norouzi, and Jimmy Ba

Danijar Hafner, Timothy P. Lillicrap, Mohammad Norouzi, and Jimmy Ba. Mastering Atari with Discrete World Models. InInternational Conference on Learning Representations, 2021

work page 2021

[11] [11]

Self-Supervised Exploration via Dis- agreement

Deepak Pathak, Dhiraj Gandhi, and Abhinav Gupta. Self-Supervised Exploration via Dis- agreement. In Kamalika Chaudhuri and Ruslan Salakhutdinov, editors,Proceedings of the 36th International Conference on Machine Learning, volume 97 ofProceedings of Machine Learning Research, pages 5062–5071. PMLR, June 2019

work page 2019

[12] [12]

Exploration via Elliptical Episodic Bonuses

Mikael Henaff, Roberta Raileanu, Minqi Jiang, and Tim Rocktäschel. Exploration via Elliptical Episodic Bonuses. In Alice H. Oh, Alekh Agarwal, Danielle Belgrave, and Kyunghyun Cho, editors,Advances in Neural Information Processing Systems, 2022

work page 2022

[13] [13]

A Study of Global and Episodic Bonuses for Exploration in Contextual MDPs

Mikael Henaff, Minqi Jiang, and Roberta Raileanu. A Study of Global and Episodic Bonuses for Exploration in Contextual MDPs. In Andreas Krause, Emma Brunskill, Kyunghyun Cho, Barbara Engelhardt, Sivan Sabato, and Jonathan Scarlett, editors,Proceedings of the 40th Inter- national Conference on Machine Learning, volume 202 ofProceedings of Machine Learning ...

work page 2023

[14] [14]

Never Give Up: Learning Directed Exploration Strategies, February 2020

Adrià Puigdomènech Badia, Pablo Sprechmann, Alex Vitvitskyi, Daniel Guo, Bilal Piot, Steven Kapturowski, Olivier Tieleman, Martín Arjovsky, Alexander Pritzel, Andew Bolt, and Charles Blundell. Never Give Up: Learning Directed Exploration Strategies, February 2020

work page 2020

[15] [15]

Gonzalez, and Yuandong Tian

Tianjun Zhang, Huazhe Xu, Xiaolong Wang, Yi Wu, Kurt Keutzer, Joseph E. Gonzalez, and Yuandong Tian. NovelD: A Simple yet Effective Exploration Criterion. In A. Beygelzimer, Y . Dauphin, P. Liang, and J. Wortman Vaughan, editors,Advances in Neural Information Processing Systems, 2021. 10

work page 2021

[16] [16]

Exploration via Planning for Information about the Optimal Trajectory

Viraj Mehta, Ian Char, Joseph Abbate, Rory Conlin, Mark Boyer, Stefano Ermon, Jeff Schneider, and Willie Neiswanger. Exploration via Planning for Information about the Optimal Trajectory. In S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, and A. Oh, editors,Advances in Neural Information Processing Systems, volume 35, pages 28761–28775. Curran Ass...

work page 2022

[17] [17]

D. V . Lindley. On a measure of the information provided by an experiment.The Annals of Mathematical Statistics, 27(4):986–1005, 1956. doi: 10.1214/aoms/1177728069

work page doi:10.1214/aoms/1177728069 1956

[18] [18]

Unifying Count-Based Exploration and Intrinsic Motivation

Marc Bellemare, Sriram Srinivasan, Georg Ostrovski, Tom Schaul, David Saxton, and Remi Munos. Unifying Count-Based Exploration and Intrinsic Motivation. In D. Lee, M. Sugiyama, U. Luxburg, I. Guyon, and R. Garnett, editors,Advances in Neural Information Processing Systems, volume 29. Curran Associates, Inc., 2016

work page 2016

[19] [19]

Episodic Curiosity through Reachability

Nikolay Savinov, Anton Raichuk, Damien Vincent, Raphael Marinier, Marc Pollefeys, Tim- othy Lillicrap, and Sylvain Gelly. Episodic Curiosity through Reachability. InInternational Conference on Learning Representations, 2019

work page 2019

[20] [20]

LECO: Learnable Episodic Count for Task-Specific Intrinsic Reward

Daejin Jo, Sungwoong Kim, Daniel Nam, Taehwan Kwon, Seungeun Rho, Jongmin Kim, and Donghoon Lee. LECO: Learnable Episodic Count for Task-Specific Intrinsic Reward. In S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, and A. Oh, editors,Advances in Neural Information Processing Systems, volume 35, pages 30432–30445. Curran Associates, Inc., 2022

work page 2022

[21] [21]

Agent57: Outperforming the Atari Human Benchmark

Adrià Puigdomènech Badia, Bilal Piot, Steven Kapturowski, Pablo Sprechmann, Alex Vitvitskyi, Zhaohan Daniel Guo, and Charles Blundell. Agent57: Outperforming the Atari Human Benchmark. In Hal Daumé III and Aarti Singh, editors,Proceedings of the 37th International Conference on Machine Learning, volume 119 ofProceedings of Machine Learning Research, pages...

work page 2020

[22] [22]

RIDE: Rewarding Impact-Driven Exploration for Procedurally-Generated Environments

Roberta Raileanu and Tim Rocktäschel. RIDE: Rewarding Impact-Driven Exploration for Procedurally-Generated Environments. InInternational Conference on Learning Representa- tions, 2020

work page 2020

[23] [23]

Bayesian experimental design: A review.Statistical Science, 10(3):273–304, 1995

Kathryn Chaloner and Isabella Verdinelli. Bayesian experimental design: A review.Statistical Science, 10(3):273–304, 1995. doi: 10.1214/ss/1177009939

work page doi:10.1214/ss/1177009939 1995

[24] [24]

Ivanova, and Freddie Bickford Smith

Tom Rainforth, Adam Foster, Desi R. Ivanova, and Freddie Bickford Smith. Modern bayesian experimental design.Statistical Science, 39(1):100–114, 2024. doi: 10.1214/23-STS915

work page doi:10.1214/23-sts915 2024

[25] [25]

Simple and Scalable Predictive Uncertainty Estimation using Deep Ensembles

Balaji Lakshminarayanan, Alexander Pritzel, and Charles Blundell. Simple and Scalable Predictive Uncertainty Estimation using Deep Ensembles. In I. Guyon, U. V on Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors,Advances in Neural Information Processing Systems, volume 30. Curran Associates, Inc., 2017

work page 2017

[26] [26]

Reprint of the 1993 original.https://doi.org/10.1137/1.9780898719109 MR2376769

Friedrich Pukelsheim.Optimal Design of Experiments. Society for Industrial and Applied Mathematics, 2006. doi: 10.1137/1.9780898719109

work page doi:10.1137/1.9780898719109 2006

[27] [27]

Minigrid & miniworld: Modular & customizable reinforcement learning environments for goal-oriented tasks

Maxime Chevalier-Boisvert, Bolun Dai, Mark Towers, Rodrigo De Lazcano Perez-Vicente, Lucas Willems, Salem Lahlou, Suman Pal, Pablo Samuel Castro, and J K Terry. Minigrid & miniworld: Modular & customizable reinforcement learning environments for goal-oriented tasks. InThirty-Seventh Conference on Neural Information Processing Systems Datasets and Benchmar...

work page 2023

[28] [28]

OGBench: Benchmark- ing Offline Goal-Conditioned RL

Seohong Park, Kevin Frans, Benjamin Eysenbach, and Sergey Levine. OGBench: Benchmark- ing Offline Goal-Conditioned RL. InInternational Conference on Learning Representations (ICLR), 2025

work page 2025

[29] [29]

Behavior From the V oid: Unsupervised Active Pre-Training

Hao Liu and Pieter Abbeel. Behavior From the V oid: Unsupervised Active Pre-Training. In A. Beygelzimer, Y . Dauphin, P. Liang, and J. Wortman Vaughan, editors,Advances in Neural Information Processing Systems, 2021. 11

work page 2021

[30] [30]

Deep Reinforcement Learning at the Edge of the Statistical Precipice

Rishabh Agarwal, Max Schwarzer, Pablo Samuel Castro, Aaron C Courville, and Marc Belle- mare. Deep Reinforcement Learning at the Edge of the Statistical Precipice. In M. Ranzato, A. Beygelzimer, Y . Dauphin, P. S. Liang, and J. Wortman Vaughan, editors,Advances in Neural Information Processing Systems, volume 34, pages 29304–29320. Curran Associates, Inc., 2021

work page 2021

[31] [31]

Kingma and Jimmy Ba

Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization. In Yoshua Bengio and Yann LeCun, editors,ICLR (Poster), 2015

work page 2015

[32] [32]

Horn and Charles R

Roger A. Horn and Charles R. Johnson.Matrix Analysis. Cambridge University Press, 1990. ISBN 0-521-38632-2

work page 1990

[33] [33]

Bellemare, Aäron van den Oord, and Rémi Munos

Georg Ostrovski, Marc G. Bellemare, Aäron van den Oord, and Rémi Munos. Count-Based Exploration with Neural Density Models. In Doina Precup and Yee Whye Teh, editors,Pro- ceedings of the 34th International Conference on Machine Learning, volume 70 ofProceedings of Machine Learning Research, pages 2721–2730. PMLR, August 2017

work page 2017

[34] [34]

Flipping Coins to Estimate Pseudocounts for Exploration in Reinforcement Learning

Sam Lobel, Akhil Bagaria, and George Konidaris. Flipping Coins to Estimate Pseudocounts for Exploration in Reinforcement Learning. In Andreas Krause, Emma Brunskill, Kyunghyun Cho, Barbara Engelhardt, Sivan Sabato, and Jonathan Scarlett, editors,Proceedings of the 40th International Conference on Machine Learning, volume 202 ofProceedings of Machine Learn...

work page 2023

[35] [35]

Exploration and Anti-Exploration with Distributional Random Network Distillation

Kai Yang, Jian Tao, Jiafei Lyu, and Xiu Li. Exploration and Anti-Exploration with Distributional Random Network Distillation. In Ruslan Salakhutdinov, Zico Kolter, Katherine Heller, Adrian Weller, Nuria Oliver, Jonathan Scarlett, and Felix Berkenkamp, editors,Proceedings of the 41st International Conference on Machine Learning, volume 235 ofProceedings of...

work page 2024

[36] [36]

How to Stay Curi- ous while avoiding Noisy TVs using Aleatoric Uncertainty Estimation

Augustine Mavor-Parker, Kimberly Young, Caswell Barry, and Lewis Griffin. How to Stay Curi- ous while avoiding Noisy TVs using Aleatoric Uncertainty Estimation. In Kamalika Chaudhuri, Stefanie Jegelka, Le Song, Csaba Szepesvari, Gang Niu, and Sivan Sabato, editors,Proceedings of the 39th International Conference on Machine Learning, volume 162 ofProceedin...

work page 2022

[37] [37]

State Entropy Maximization with Random Encoders for Efficient Exploration

Younggyo Seo, Lili Chen, Jinwoo Shin, Honglak Lee, Pieter Abbeel, and Kimin Lee. State Entropy Maximization with Random Encoders for Efficient Exploration. In Marina Meila and Tong Zhang, editors,Proceedings of the 38th International Conference on Machine Learning, volume 139 ofProceedings of Machine Learning Research, pages 9443–9454. PMLR, July 2021

work page 2021

[38] [38]

Rethinking Exploration in Reinforce- ment Learning with Effective Metric-Based Exploration Bonus

Yiming Wang, Kaiyan Zhao, Furui Liu, and Leong Hou U. Rethinking Exploration in Reinforce- ment Learning with Effective Metric-Based Exploration Bonus. InThe Thirty-eighth Annual Conference on Neural Information Processing Systems, 2024

work page 2024

[39] [39]

VIME: Variational Information Maximizing Exploration

Rein Houthooft, Xi Chen, Xi Chen, Yan Duan, John Schulman, Filip De Turck, and Pieter Abbeel. VIME: Variational Information Maximizing Exploration. In D. Lee, M. Sugiyama, U. Luxburg, I. Guyon, and R. Garnett, editors,Advances in Neural Information Processing Systems, volume 29. Curran Associates, Inc., 2016

work page 2016

[40] [40]

Model-Based Active Exploration

Pranav Shyam, Wojciech Ja´skowski, and Faustino Gomez. Model-Based Active Exploration. In Kamalika Chaudhuri and Ruslan Salakhutdinov, editors,Proceedings of the 36th International Conference on Machine Learning, volume 97 ofProceedings of Machine Learning Research, pages 5779–5788. PMLR, June 2019

work page 2019

[41] [41]

Curiosity-Driven Exploration via Latent Bayesian Surprise.Proceedings of the AAAI Conference on Artificial Intelligence, 36 (7):7752–7760, June 2022

Pietro Mazzaglia, Ozan Catal, Tim Verbelen, and Bart Dhoedt. Curiosity-Driven Exploration via Latent Bayesian Surprise.Proceedings of the AAAI Conference on Artificial Intelligence, 36 (7):7752–7760, June 2022. doi: 10.1609/aaai.v36i7.20743

work page doi:10.1609/aaai.v36i7.20743 2022

[42] [42]

Episodic Novelty Through Temporal Distance

Yuhua Jiang, Qihan Liu, Yiqin Yang, Xiaoteng Ma, Dianyu Zhong, Hao Hu, Jun Yang, Bin Liang, Bo XU, Chongjie Zhang, and Qianchuan Zhao. Episodic Novelty Through Temporal Distance. InThe Thirteenth International Conference on Learning Representations, 2025. 12

work page 2025

[43] [43]

Go Beyond Imagination: Maximizing Episodic Reacha- bility with World Models, 2023

Yao Fu, Run Peng, and Honglak Lee. Go Beyond Imagination: Maximizing Episodic Reacha- bility with World Models, 2023

work page 2023

[44] [44]

Rank the Episodes: A Simple Approach for Exploration in Procedurally-Generated Environments

Daochen Zha, Wenye Ma, Lei Yuan, Xia Hu, and Ji Liu. Rank the Episodes: A Simple Approach for Exploration in Procedurally-Generated Environments. InInternational Conference on Learning Representations, 2021

work page 2021

[45] [45]

Gonzalez, and Stuart Russell

Tianjun Zhang, Paria Rashidinejad, Jiantao Jiao, Yuandong Tian, Joseph E. Gonzalez, and Stuart Russell. MADE: Exploration via Maximizing Deviation from Explored Regions. In A. Beygelzimer, Y . Dauphin, P. Liang, and J. Wortman Vaughan, editors,Advances in Neural Information Processing Systems, 2021

work page 2021

[46] [46]

Zico Kolter, and Roberta Raileanu

Yiding Jiang, J. Zico Kolter, and Roberta Raileanu. On the Importance of Exploration for Generalization in Reinforcement Learning. InThirty-Seventh Conference on Neural Information Processing Systems, 2023

work page 2023

[47] [47]

A brief note on the Bayesian D-optimality criterion, 2023

Alen Alexanderian. A brief note on the Bayesian D-optimality criterion, 2023

work page 2023

[48] [48]

An experimental design perspective on model-based reinforcement learning

Viraj Mehta, Biswajit Paria, Jeff Schneider, Willie Neiswanger, and Stefano Ermon. An experimental design perspective on model-based reinforcement learning. InInternational Conference on Learning Representations, 2022

work page 2022

[49] [49]

Distractor

Alberto Caron, Vasilios Mavroudis, and Chris Hicks. On efficient bayesian exploration in model-based reinforcement learning.Transactions on Machine Learning Research, 2025. ISSN 2835-8856. 13 Algorithm 1Conditional Information Gain reward (Algorithm referenced in §3.4). Require: Imagined latent rollout (s0, a0, . . . , sT−1 , aT−1 ); ensemble of one-step ...

work page 2025

[50] [50]

OGBench (https://github.com/seohongpark/ogbench) — MIT License

work page

[51] [51]

Limitations

MiniGrid (https://github.com/Farama-Foundation/Minigrid) — Apache License 2.0 B Theoretical Analysis B.1 Tightness of the Gaussian Entropy Bound (A3) Approximation A3 replaces the mixture entropy H(s 1:T ) with the entropy of the moment-matched Gaussian q=N( ¯µ,Σ) , where Σ =σ 2IT d +C . Because q shares the mean and covariance of the mixture p= 1 M P k N...

work page

[52] [52]

• Depending on the country in which research is conducted, IRB approval (or equivalent) may be required for any human subjects research

Institutional review board (IRB) approvals or equivalent for research with human subjects Question: Does the paper describe potential risks incurred by study participants, whether such risks were disclosed to the subjects, and whether Institutional Review Board (IRB) approvals (or an equivalent approval/review based on the requirements of your country or ...

work page