CIG: Exploration via Conditional Information Gain
Pith reviewed 2026-05-21 06:43 UTC · model grok-4.3
The pith
Conditional Information Gain gives a scalable intrinsic reward for exploration that conditions on both lifetime experience and the current rollout.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Trajectory-level information gain decomposes into per-step terms that condition simultaneously on the replay buffer and the rollout prefix, yet this remains intractable for deep models. The Conditional Information Gain (CIG) reward serves as a tractable surrogate through a log-determinant objective over an ensemble disagreement kernel. Its Cholesky factorization then produces causal per-step rewards that preserve both conditioning sets and scale to high-dimensional state spaces. The method is instantiated in a model-based setting and tested across twelve tasks in discrete and continuous control, including stochastic-distractor variants.
What carries the argument
The Conditional Information Gain (CIG) reward, defined as a log-determinant objective over an ensemble disagreement kernel whose Cholesky factorization yields per-step rewards that retain joint conditioning on replay buffer and rollout prefix.
If this is right
- CIG combines lifelong and episodic signals without needing heuristic weights or low-dimensional assumptions.
- The reward scales to high-dimensional state spaces where Gaussian-process methods fail.
- It remains robust when stochastic distractors appear in the environment.
- Performance holds across both discrete grid tasks and continuous control benchmarks.
Where Pith is reading between the lines
- The same decomposition technique could be applied to other information measures in sequential decision problems.
- Testing CIG outside the model-based setting with short rollouts would reveal how far the per-step approximation generalizes.
- Varying the ensemble size could serve as a direct way to study the accuracy of the disagreement kernel approximation.
Load-bearing premise
The ensemble disagreement kernel approximates the true trajectory-level conditional information gain closely enough that the Cholesky decomposition preserves the joint conditioning without material loss.
What would settle it
Running the method on high-dimensional tasks and finding that CIG yields no exploration gain over prior methods, or that the resulting per-step rewards show low correlation with actual trajectory information gain, would falsify the claim.
Figures
read the original abstract
Intrinsic rewards for exploration in reinforcement learning condition on different contexts: lifelong rewards score each transition against accumulated experience but ignore within-rollout redundancy; episodic rewards penalize intra-trajectory repetition but discard lifetime progress. Hybrid methods combine both signals through heuristic weights or require Gaussian-process dynamics that do not scale beyond low-dimensional state spaces. Trajectory-level information gain decomposes into per-step terms that condition on the replay buffer and rollout prefix simultaneously, but remains intractable for deep models. We derive the Conditional Information Gain (CIG) reward as a tractable surrogate: a log-determinant objective over an ensemble disagreement kernel whose Cholesky factorization yields causal per-step rewards that retain both conditioning sets while scaling to high-dimensional state spaces. We instantiate CIG in a model-based setting, where rollouts are short and within-rollout corrections remain largely unexplored. Across twelve tasks spanning discrete (MiniGrid) and continuous control (OGBench), in both clean and stochastic-distractor settings, CIG outperforms or matches prior exploration methods while remaining robust to stochastic distractors.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes Conditional Information Gain (CIG) as an intrinsic reward for exploration in RL. It derives a tractable surrogate for trajectory-level conditional information gain via a log-determinant objective over an ensemble disagreement kernel; Cholesky factorization then produces causal per-step rewards that simultaneously condition on the replay buffer and rollout prefix. The approach is instantiated in a model-based setting with short rollouts and evaluated on twelve tasks spanning MiniGrid (discrete) and OGBench (continuous control) in both clean and stochastic-distractor environments, where it outperforms or matches prior methods while remaining robust to distractors.
Significance. If the derivation holds and the Cholesky decomposition preserves joint conditioning with bounded error, the result would offer a scalable, principled hybrid of lifelong and episodic exploration signals that avoids heuristic weighting and extends beyond low-dimensional GP methods. The empirical scope across twelve tasks in discrete/continuous and clean/noisy settings is a concrete strength that supports practical utility.
major comments (2)
- [§3 (CIG Derivation)] §3 (CIG Derivation): The central claim that Cholesky factorization of the ensemble disagreement kernel yields per-step rewards retaining simultaneous conditioning on both the replay buffer and rollout prefix requires an explicit demonstration or error bound. The current presentation leaves open whether the kernel construction introduces low-rank or independence assumptions that produce material information loss, which is load-bearing for the tractability and faithfulness assertions.
- [§3.2 (Ensemble Disagreement Kernel)] §3.2 (Ensemble Disagreement Kernel): The surrogate is defined via an ensemble disagreement kernel whose parameters are learned from data. The manuscript must show that the final reward expression remains independent of these fitted quantities; absent such verification, the information-gain interpretation risks circularity, as the exploration signal could reduce to quantities defined by the fitted model itself.
minor comments (2)
- [Abstract and §4 (Experiments)] Abstract and §4 (Experiments): The claim of evaluation 'across twelve tasks' would benefit from an explicit list of environments, number of random seeds, and statistical significance tests to allow direct replication and assessment of robustness.
- [Notation] Notation: The distinction between lifetime (replay buffer) and episodic (rollout prefix) conditioning sets should be denoted consistently with subscripts or superscripts throughout the equations to improve readability.
Simulated Author's Rebuttal
We thank the referee for the constructive comments on our manuscript. We address each major comment below, indicating the revisions we will make to strengthen the presentation of the CIG derivation and the ensemble kernel.
read point-by-point responses
-
Referee: [§3 (CIG Derivation)] The central claim that Cholesky factorization of the ensemble disagreement kernel yields per-step rewards retaining simultaneous conditioning on both the replay buffer and rollout prefix requires an explicit demonstration or error bound. The current presentation leaves open whether the kernel construction introduces low-rank or independence assumptions that produce material information loss, which is load-bearing for the tractability and faithfulness assertions.
Authors: We agree that an explicit demonstration would improve clarity. In the revised manuscript we will add a detailed derivation in the appendix showing that the Cholesky factorization of the joint kernel matrix decomposes the log-determinant exactly into a sum of per-step conditional log-determinants. Because the kernel matrix is assembled from all points in the replay buffer together with the rollout prefix, the conditioning on both sets is retained by construction; no additional low-rank or independence assumptions are imposed beyond the positive-definiteness of the ensemble kernel. We will also include a brief error-bound discussion based on the fact that the Cholesky factors yield the exact conditional variances at each step, with any numerical error bounded by standard floating-point analysis. revision: yes
-
Referee: [§3.2 (Ensemble Disagreement Kernel)] The surrogate is defined via an ensemble disagreement kernel whose parameters are learned from data. The manuscript must show that the final reward expression remains independent of these fitted quantities; absent such verification, the information-gain interpretation risks circularity, as the exploration signal could reduce to quantities defined by the fitted model itself.
Authors: We will clarify this point in the revision. The reward is the log-determinant of the kernel matrix whose entries are pairwise disagreements between ensemble members evaluated at the relevant state-action pairs. Once the kernel matrix is formed, the reward expression depends only on these matrix entries and not on the internal parameters of the individual ensemble members. In the revised §3.2 we will explicitly rewrite the reward formula in terms of the kernel matrix alone, thereby showing that the information-gain interpretation is preserved as a measure of predictive uncertainty reduction and is not circular with respect to the fitting procedure. revision: yes
Circularity Check
No significant circularity in derivation of CIG surrogate
full rationale
The paper claims to derive the Conditional Information Gain reward as a log-determinant objective over an ensemble disagreement kernel, with Cholesky factorization yielding per-step rewards. This is presented as a mathematical construction of a tractable surrogate for the intractable trajectory-level conditional information gain. No equations or steps are shown that reduce the final reward expression to fitted parameters or prior self-citations by construction. The central claim remains a proposed approximation whose validity is evaluated empirically on external benchmarks (MiniGrid, OGBench tasks), making the derivation self-contained rather than tautological. No load-bearing self-citation chains or renamed known results are identified.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Trajectory-level information gain decomposes into per-step terms that condition simultaneously on the replay buffer and rollout prefix.
invented entities (1)
-
Conditional Information Gain (CIG) reward
no independent evidence
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We derive the Conditional Information Gain (CIG) reward as a tractable surrogate: a log-determinant objective over an ensemble disagreement kernel whose Cholesky factorization yields causal per-step rewards
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Trajectory-level information gain decomposes into per-step terms that condition on the replay buffer and rollout prefix simultaneously
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Curious model-building control systems
Jürgen Schmidhuber. Curious model-building control systems. InProceedings of the 1991 IEEE International Joint Conference on Neural Networks (IJCNN ’91), Singapore, volume 2, pages 1458–1463. IEEE, 1991. doi: 10.1109/IJCNN.1991.170605
-
[2]
Pierre-Yves Oudeyer and Frederic Kaplan. What is intrinsic motivation? A typology of computational approaches.Frontiers in Neurorobotics, V olume 1 - 2007, 2007. ISSN 1662-
work page 2007
-
[3]
doi: 10.3389/neuro.12.006.2007
-
[4]
Deepak Pathak, Pulkit Agrawal, Alexei A. Efros, and Trevor Darrell. Curiosity-Driven Explo- ration by Self-Supervised Prediction. InProceedings of the 34th International Conference on Machine Learning - Volume 70, ICML’17, pages 2778–2787, Sydney, NSW, Australia, 2017. JMLR.org
work page 2017
-
[5]
Exploration by random network distillation
Yuri Burda, Harrison Edwards, Amos Storkey, and Oleg Klimov. Exploration by random network distillation. InInternational Conference on Learning Representations, 2019
work page 2019
-
[6]
Planning to Explore via Self-Supervised World Models
Ramanan Sekar, Oleh Rybkin, Kostas Daniilidis, Pieter Abbeel, Danijar Hafner, and Deepak Pathak. Planning to Explore via Self-Supervised World Models. In Hal Daumé III and Aarti Singh, editors,Proceedings of the 37th International Conference on Machine Learning, volume 119 ofProceedings of Machine Learning Research, pages 8583–8592. PMLR, July 2020
work page 2020
-
[7]
Nicklas Hansen, Hao Su, and Xiaolong Wang
Danijar Hafner, Jurgis Pasukonis, Jimmy Ba, and Timothy Lillicrap. Mastering diverse control tasks through world models.Nature, 640(8059):647–653, April 2025. ISSN 1476-4687. doi: 10.1038/s41586-025-08744-2
-
[8]
Recurrent World Models Facilitate Policy Evolution
David Ha and Jürgen Schmidhuber. Recurrent World Models Facilitate Policy Evolution. In S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 31. Curran Associates, Inc., 2018
work page 2018
-
[9]
Learning Latent Dynamics for Planning from Pixels
Danijar Hafner, Timothy Lillicrap, Ian Fischer, Ruben Villegas, David Ha, Honglak Lee, and James Davidson. Learning Latent Dynamics for Planning from Pixels. In Kamalika Chaudhuri and Ruslan Salakhutdinov, editors,Proceedings of the 36th International Conference on Machine Learning, volume 97 ofProceedings of Machine Learning Research, pages 2555–
-
[10]
Lillicrap, Mohammad Norouzi, and Jimmy Ba
Danijar Hafner, Timothy P. Lillicrap, Mohammad Norouzi, and Jimmy Ba. Mastering Atari with Discrete World Models. InInternational Conference on Learning Representations, 2021
work page 2021
-
[11]
Self-Supervised Exploration via Dis- agreement
Deepak Pathak, Dhiraj Gandhi, and Abhinav Gupta. Self-Supervised Exploration via Dis- agreement. In Kamalika Chaudhuri and Ruslan Salakhutdinov, editors,Proceedings of the 36th International Conference on Machine Learning, volume 97 ofProceedings of Machine Learning Research, pages 5062–5071. PMLR, June 2019
work page 2019
-
[12]
Exploration via Elliptical Episodic Bonuses
Mikael Henaff, Roberta Raileanu, Minqi Jiang, and Tim Rocktäschel. Exploration via Elliptical Episodic Bonuses. In Alice H. Oh, Alekh Agarwal, Danielle Belgrave, and Kyunghyun Cho, editors,Advances in Neural Information Processing Systems, 2022
work page 2022
-
[13]
A Study of Global and Episodic Bonuses for Exploration in Contextual MDPs
Mikael Henaff, Minqi Jiang, and Roberta Raileanu. A Study of Global and Episodic Bonuses for Exploration in Contextual MDPs. In Andreas Krause, Emma Brunskill, Kyunghyun Cho, Barbara Engelhardt, Sivan Sabato, and Jonathan Scarlett, editors,Proceedings of the 40th Inter- national Conference on Machine Learning, volume 202 ofProceedings of Machine Learning ...
work page 2023
-
[14]
Never Give Up: Learning Directed Exploration Strategies, February 2020
Adrià Puigdomènech Badia, Pablo Sprechmann, Alex Vitvitskyi, Daniel Guo, Bilal Piot, Steven Kapturowski, Olivier Tieleman, Martín Arjovsky, Alexander Pritzel, Andew Bolt, and Charles Blundell. Never Give Up: Learning Directed Exploration Strategies, February 2020
work page 2020
-
[15]
Tianjun Zhang, Huazhe Xu, Xiaolong Wang, Yi Wu, Kurt Keutzer, Joseph E. Gonzalez, and Yuandong Tian. NovelD: A Simple yet Effective Exploration Criterion. In A. Beygelzimer, Y . Dauphin, P. Liang, and J. Wortman Vaughan, editors,Advances in Neural Information Processing Systems, 2021. 10
work page 2021
-
[16]
Exploration via Planning for Information about the Optimal Trajectory
Viraj Mehta, Ian Char, Joseph Abbate, Rory Conlin, Mark Boyer, Stefano Ermon, Jeff Schneider, and Willie Neiswanger. Exploration via Planning for Information about the Optimal Trajectory. In S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, and A. Oh, editors,Advances in Neural Information Processing Systems, volume 35, pages 28761–28775. Curran Ass...
work page 2022
-
[17]
D. V . Lindley. On a measure of the information provided by an experiment.The Annals of Mathematical Statistics, 27(4):986–1005, 1956. doi: 10.1214/aoms/1177728069
-
[18]
Unifying Count-Based Exploration and Intrinsic Motivation
Marc Bellemare, Sriram Srinivasan, Georg Ostrovski, Tom Schaul, David Saxton, and Remi Munos. Unifying Count-Based Exploration and Intrinsic Motivation. In D. Lee, M. Sugiyama, U. Luxburg, I. Guyon, and R. Garnett, editors,Advances in Neural Information Processing Systems, volume 29. Curran Associates, Inc., 2016
work page 2016
-
[19]
Episodic Curiosity through Reachability
Nikolay Savinov, Anton Raichuk, Damien Vincent, Raphael Marinier, Marc Pollefeys, Tim- othy Lillicrap, and Sylvain Gelly. Episodic Curiosity through Reachability. InInternational Conference on Learning Representations, 2019
work page 2019
-
[20]
LECO: Learnable Episodic Count for Task-Specific Intrinsic Reward
Daejin Jo, Sungwoong Kim, Daniel Nam, Taehwan Kwon, Seungeun Rho, Jongmin Kim, and Donghoon Lee. LECO: Learnable Episodic Count for Task-Specific Intrinsic Reward. In S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, and A. Oh, editors,Advances in Neural Information Processing Systems, volume 35, pages 30432–30445. Curran Associates, Inc., 2022
work page 2022
-
[21]
Agent57: Outperforming the Atari Human Benchmark
Adrià Puigdomènech Badia, Bilal Piot, Steven Kapturowski, Pablo Sprechmann, Alex Vitvitskyi, Zhaohan Daniel Guo, and Charles Blundell. Agent57: Outperforming the Atari Human Benchmark. In Hal Daumé III and Aarti Singh, editors,Proceedings of the 37th International Conference on Machine Learning, volume 119 ofProceedings of Machine Learning Research, pages...
work page 2020
-
[22]
RIDE: Rewarding Impact-Driven Exploration for Procedurally-Generated Environments
Roberta Raileanu and Tim Rocktäschel. RIDE: Rewarding Impact-Driven Exploration for Procedurally-Generated Environments. InInternational Conference on Learning Representa- tions, 2020
work page 2020
-
[23]
Bayesian experimental design: A review.Statistical Science, 10(3):273–304, 1995
Kathryn Chaloner and Isabella Verdinelli. Bayesian experimental design: A review.Statistical Science, 10(3):273–304, 1995. doi: 10.1214/ss/1177009939
-
[24]
Ivanova, and Freddie Bickford Smith
Tom Rainforth, Adam Foster, Desi R. Ivanova, and Freddie Bickford Smith. Modern bayesian experimental design.Statistical Science, 39(1):100–114, 2024. doi: 10.1214/23-STS915
-
[25]
Simple and Scalable Predictive Uncertainty Estimation using Deep Ensembles
Balaji Lakshminarayanan, Alexander Pritzel, and Charles Blundell. Simple and Scalable Predictive Uncertainty Estimation using Deep Ensembles. In I. Guyon, U. V on Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors,Advances in Neural Information Processing Systems, volume 30. Curran Associates, Inc., 2017
work page 2017
-
[26]
Reprint of the 1993 original.https://doi.org/10.1137/1.9780898719109 MR2376769
Friedrich Pukelsheim.Optimal Design of Experiments. Society for Industrial and Applied Mathematics, 2006. doi: 10.1137/1.9780898719109
-
[27]
Maxime Chevalier-Boisvert, Bolun Dai, Mark Towers, Rodrigo De Lazcano Perez-Vicente, Lucas Willems, Salem Lahlou, Suman Pal, Pablo Samuel Castro, and J K Terry. Minigrid & miniworld: Modular & customizable reinforcement learning environments for goal-oriented tasks. InThirty-Seventh Conference on Neural Information Processing Systems Datasets and Benchmar...
work page 2023
-
[28]
OGBench: Benchmark- ing Offline Goal-Conditioned RL
Seohong Park, Kevin Frans, Benjamin Eysenbach, and Sergey Levine. OGBench: Benchmark- ing Offline Goal-Conditioned RL. InInternational Conference on Learning Representations (ICLR), 2025
work page 2025
-
[29]
Behavior From the V oid: Unsupervised Active Pre-Training
Hao Liu and Pieter Abbeel. Behavior From the V oid: Unsupervised Active Pre-Training. In A. Beygelzimer, Y . Dauphin, P. Liang, and J. Wortman Vaughan, editors,Advances in Neural Information Processing Systems, 2021. 11
work page 2021
-
[30]
Deep Reinforcement Learning at the Edge of the Statistical Precipice
Rishabh Agarwal, Max Schwarzer, Pablo Samuel Castro, Aaron C Courville, and Marc Belle- mare. Deep Reinforcement Learning at the Edge of the Statistical Precipice. In M. Ranzato, A. Beygelzimer, Y . Dauphin, P. S. Liang, and J. Wortman Vaughan, editors,Advances in Neural Information Processing Systems, volume 34, pages 29304–29320. Curran Associates, Inc., 2021
work page 2021
-
[31]
Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization. In Yoshua Bengio and Yann LeCun, editors,ICLR (Poster), 2015
work page 2015
-
[32]
Roger A. Horn and Charles R. Johnson.Matrix Analysis. Cambridge University Press, 1990. ISBN 0-521-38632-2
work page 1990
-
[33]
Bellemare, Aäron van den Oord, and Rémi Munos
Georg Ostrovski, Marc G. Bellemare, Aäron van den Oord, and Rémi Munos. Count-Based Exploration with Neural Density Models. In Doina Precup and Yee Whye Teh, editors,Pro- ceedings of the 34th International Conference on Machine Learning, volume 70 ofProceedings of Machine Learning Research, pages 2721–2730. PMLR, August 2017
work page 2017
-
[34]
Flipping Coins to Estimate Pseudocounts for Exploration in Reinforcement Learning
Sam Lobel, Akhil Bagaria, and George Konidaris. Flipping Coins to Estimate Pseudocounts for Exploration in Reinforcement Learning. In Andreas Krause, Emma Brunskill, Kyunghyun Cho, Barbara Engelhardt, Sivan Sabato, and Jonathan Scarlett, editors,Proceedings of the 40th International Conference on Machine Learning, volume 202 ofProceedings of Machine Learn...
work page 2023
-
[35]
Exploration and Anti-Exploration with Distributional Random Network Distillation
Kai Yang, Jian Tao, Jiafei Lyu, and Xiu Li. Exploration and Anti-Exploration with Distributional Random Network Distillation. In Ruslan Salakhutdinov, Zico Kolter, Katherine Heller, Adrian Weller, Nuria Oliver, Jonathan Scarlett, and Felix Berkenkamp, editors,Proceedings of the 41st International Conference on Machine Learning, volume 235 ofProceedings of...
work page 2024
-
[36]
How to Stay Curi- ous while avoiding Noisy TVs using Aleatoric Uncertainty Estimation
Augustine Mavor-Parker, Kimberly Young, Caswell Barry, and Lewis Griffin. How to Stay Curi- ous while avoiding Noisy TVs using Aleatoric Uncertainty Estimation. In Kamalika Chaudhuri, Stefanie Jegelka, Le Song, Csaba Szepesvari, Gang Niu, and Sivan Sabato, editors,Proceedings of the 39th International Conference on Machine Learning, volume 162 ofProceedin...
work page 2022
-
[37]
State Entropy Maximization with Random Encoders for Efficient Exploration
Younggyo Seo, Lili Chen, Jinwoo Shin, Honglak Lee, Pieter Abbeel, and Kimin Lee. State Entropy Maximization with Random Encoders for Efficient Exploration. In Marina Meila and Tong Zhang, editors,Proceedings of the 38th International Conference on Machine Learning, volume 139 ofProceedings of Machine Learning Research, pages 9443–9454. PMLR, July 2021
work page 2021
-
[38]
Rethinking Exploration in Reinforce- ment Learning with Effective Metric-Based Exploration Bonus
Yiming Wang, Kaiyan Zhao, Furui Liu, and Leong Hou U. Rethinking Exploration in Reinforce- ment Learning with Effective Metric-Based Exploration Bonus. InThe Thirty-eighth Annual Conference on Neural Information Processing Systems, 2024
work page 2024
-
[39]
VIME: Variational Information Maximizing Exploration
Rein Houthooft, Xi Chen, Xi Chen, Yan Duan, John Schulman, Filip De Turck, and Pieter Abbeel. VIME: Variational Information Maximizing Exploration. In D. Lee, M. Sugiyama, U. Luxburg, I. Guyon, and R. Garnett, editors,Advances in Neural Information Processing Systems, volume 29. Curran Associates, Inc., 2016
work page 2016
-
[40]
Model-Based Active Exploration
Pranav Shyam, Wojciech Ja´skowski, and Faustino Gomez. Model-Based Active Exploration. In Kamalika Chaudhuri and Ruslan Salakhutdinov, editors,Proceedings of the 36th International Conference on Machine Learning, volume 97 ofProceedings of Machine Learning Research, pages 5779–5788. PMLR, June 2019
work page 2019
-
[41]
Pietro Mazzaglia, Ozan Catal, Tim Verbelen, and Bart Dhoedt. Curiosity-Driven Exploration via Latent Bayesian Surprise.Proceedings of the AAAI Conference on Artificial Intelligence, 36 (7):7752–7760, June 2022. doi: 10.1609/aaai.v36i7.20743
-
[42]
Episodic Novelty Through Temporal Distance
Yuhua Jiang, Qihan Liu, Yiqin Yang, Xiaoteng Ma, Dianyu Zhong, Hao Hu, Jun Yang, Bin Liang, Bo XU, Chongjie Zhang, and Qianchuan Zhao. Episodic Novelty Through Temporal Distance. InThe Thirteenth International Conference on Learning Representations, 2025. 12
work page 2025
-
[43]
Go Beyond Imagination: Maximizing Episodic Reacha- bility with World Models, 2023
Yao Fu, Run Peng, and Honglak Lee. Go Beyond Imagination: Maximizing Episodic Reacha- bility with World Models, 2023
work page 2023
-
[44]
Rank the Episodes: A Simple Approach for Exploration in Procedurally-Generated Environments
Daochen Zha, Wenye Ma, Lei Yuan, Xia Hu, and Ji Liu. Rank the Episodes: A Simple Approach for Exploration in Procedurally-Generated Environments. InInternational Conference on Learning Representations, 2021
work page 2021
-
[45]
Tianjun Zhang, Paria Rashidinejad, Jiantao Jiao, Yuandong Tian, Joseph E. Gonzalez, and Stuart Russell. MADE: Exploration via Maximizing Deviation from Explored Regions. In A. Beygelzimer, Y . Dauphin, P. Liang, and J. Wortman Vaughan, editors,Advances in Neural Information Processing Systems, 2021
work page 2021
-
[46]
Zico Kolter, and Roberta Raileanu
Yiding Jiang, J. Zico Kolter, and Roberta Raileanu. On the Importance of Exploration for Generalization in Reinforcement Learning. InThirty-Seventh Conference on Neural Information Processing Systems, 2023
work page 2023
-
[47]
A brief note on the Bayesian D-optimality criterion, 2023
Alen Alexanderian. A brief note on the Bayesian D-optimality criterion, 2023
work page 2023
-
[48]
An experimental design perspective on model-based reinforcement learning
Viraj Mehta, Biswajit Paria, Jeff Schneider, Willie Neiswanger, and Stefano Ermon. An experimental design perspective on model-based reinforcement learning. InInternational Conference on Learning Representations, 2022
work page 2022
-
[49]
Alberto Caron, Vasilios Mavroudis, and Chris Hicks. On efficient bayesian exploration in model-based reinforcement learning.Transactions on Machine Learning Research, 2025. ISSN 2835-8856. 13 Algorithm 1Conditional Information Gain reward (Algorithm referenced in §3.4). Require: Imagined latent rollout (s0, a0, . . . , sT−1 , aT−1 ); ensemble of one-step ...
work page 2025
-
[50]
OGBench (https://github.com/seohongpark/ogbench) — MIT License
-
[51]
MiniGrid (https://github.com/Farama-Foundation/Minigrid) — Apache License 2.0 B Theoretical Analysis B.1 Tightness of the Gaussian Entropy Bound (A3) Approximation A3 replaces the mixture entropy H(s 1:T ) with the entropy of the moment-matched Gaussian q=N( ¯µ,Σ) , where Σ =σ 2IT d +C . Because q shares the mean and covariance of the mixture p= 1 M P k N...
-
[52]
Institutional review board (IRB) approvals or equivalent for research with human subjects Question: Does the paper describe potential risks incurred by study participants, whether such risks were disclosed to the subjects, and whether Institutional Review Board (IRB) approvals (or an equivalent approval/review based on the requirements of your country or ...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.