pith. sign in

arxiv: 2504.06355 · v2 · submitted 2025-04-08 · 💻 cs.LG

An Information-Geometric Approach to Artificial Curiosity

Pith reviewed 2026-05-22 19:45 UTC · model grok-4.3

classification 💻 cs.LG
keywords artificial curiosityintrinsic rewardsinformation geometryreinforcement learningexplorationoccupancy measurecount-based exploration
0
0 comments X

The pith

Information geometry constrains intrinsic rewards to strictly concave functions of reciprocal occupancy.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper argues that artificial curiosity in sparse-reward reinforcement learning can be placed on a firmer footing by applying principles from information geometry. It shows that the dual requirements of information monotonicity and invariance under agent-environment interactions force intrinsic rewards to take the form of strictly concave functions of the reciprocal occupancy measure. Requiring in addition a coherent way to balance exploration against exploitation further restricts the rewards to a one-parameter family obtained by geodesic interpolation on the occupancy manifold. A reader should care because this derivation replaces heuristic choices of curiosity bonuses with a small set of candidates that recover familiar methods as special cases.

Core claim

Leveraging information monotonicity and invariance under the agent-environment interaction, the authors show that intrinsic rewards are uniquely constrained to strictly concave functions of the reciprocal occupancy. When these rewards must also support a principled exploration-exploitation trade-off via information geodesic interpolation on the occupancy manifold, the candidates reduce to a one-parameter family. Special values of the parameter recover count-based exploration and maximum-entropy exploration.

What carries the argument

Information monotonicity together with invariance under agent-environment interaction, which pins intrinsic rewards to strictly concave functions of the reciprocal occupancy on the occupancy manifold.

If this is right

  • All valid intrinsic rewards for artificial curiosity share the mathematical form of a strictly concave function applied to the reciprocal occupancy.
  • Exploration and exploitation trade off by moving along an information geodesic controlled by a single scalar parameter.
  • Count-based exploration corresponds to one specific value of the scalar parameter.
  • Maximum-entropy exploration corresponds to another specific value of the scalar parameter.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Rewards that explicitly depend on a chosen state encoding would violate the invariance and could produce inconsistent exploration across equivalent representations of the same environment.
  • Different choices of strictly concave function within the allowed family could be tested empirically to discover new exploration bonuses with desirable properties.
  • The same geometric construction might be applied in continuous state spaces once suitable occupancy measures and geodesics are defined.

Load-bearing premise

Intrinsic rewards are required to be representation-agnostic and to depend only on the agent's information about the environment.

What would settle it

Constructing or exhibiting an intrinsic reward function that is not strictly concave in the reciprocal occupancy yet still obeys both information monotonicity and invariance under agent-environment interaction would disprove the claimed uniqueness.

Figures

Figures reproduced from arXiv: 2504.06355 by Alexander Nedergaard, Pablo A. Morales.

Figure 1
Figure 1. Figure 1: Artificial curiosity with α-information rewards on the curved occupancy manifold. (Top). The Amari-Cencov tensor ˇ constant α ∈ R encodes the occupancy manifold curvature (red– spherical, blue–flat, green–hyperbolic). Count-based exploration corresponds to the Riemannian geometry with α = 0, and maxi￾mum entropy exploration to the flat geometry with α = −1 (The￾orem 3.3). (Bottom) The intrinsic rewards sca… view at source ↗
read the original abstract

Learning in environments with sparse rewards remains a fundamental challenge in reinforcement learning. Artificial curiosity addresses this limitation through intrinsic rewards to guide exploration, however, the precise formulation of these rewards has remained elusive. Ideally, such rewards should depend on the agent's information about the environment, remaining agnostic to its representation -- an invariance central to information geometry. Leveraging this, we show that information monotonicity and invariance under the agent-environment interaction uniquely constrains intrinsic rewards to strictly concave functions of the reciprocal occupancy. Requiring these rewards to yield a principled exploration-exploitation trade-off, via information geodesic interpolation on the occupancy manifold, effectively limits the candidates to those determined by a scalar parameter. Remarkably, special values of this parameter are found to correspond to count-based and maximum entropy exploration. This framework provides important constraints to the engineering of intrinsic rewards while integrating foundational exploration methods into a single, cohesive model.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The manuscript proposes an information-geometric framework for artificial curiosity in sparse-reward RL. It claims that information monotonicity together with invariance under the agent-environment interaction uniquely constrains intrinsic rewards to strictly concave functions of the reciprocal occupancy. Requiring a principled exploration-exploitation trade-off via information-geodesic interpolation on the occupancy manifold then reduces the family to a one-parameter model whose special values recover count-based and maximum-entropy exploration.

Significance. If the uniqueness result can be placed on a fully rigorous footing, the work would supply a principled geometric unification of several existing exploration heuristics and useful constraints on the design of intrinsic rewards. The recovery of known methods as special cases of the scalar parameter is a constructive feature, though the framework largely selects among previously published strategies rather than generating new falsifiable predictions.

major comments (3)
  1. Abstract: the central uniqueness claim states that 'information monotonicity and invariance under the agent-environment interaction uniquely constrains intrinsic rewards to strictly concave functions of the reciprocal occupancy,' yet no explicit functional equation or axiom set is supplied that defines how the interaction map acts on the information measure or reward functional. Without this, it is impossible to confirm that the constraint excludes non-concave or non-reciprocal candidates.
  2. Derivation of the one-parameter family (via geodesic interpolation): the reduction to a scalar parameter inherits the same ambiguity; the manuscript must state any regularity conditions imposed on the occupancy manifold and show that the geodesic step does not introduce hidden assumptions that weaken the uniqueness result.
  3. Representation-agnostic premise (abstract, paragraph 3): the assumption that intrinsic rewards must depend solely on the agent's information and remain representation-agnostic is taken as a direct consequence of information geometry, but the precise invariance property is not formalized before the derivation; this is load-bearing for the uniqueness conclusion.
minor comments (2)
  1. Abstract: the term 'reciprocal occupancy' is used without a brief inline definition or reference to its standard definition in the literature; adding one sentence would improve accessibility.
  2. The manuscript introduces the 'occupancy manifold' as a central object; a short paragraph clarifying its construction from the occupancy measure and its relation to existing information-geometric structures would aid readers.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the thoughtful and constructive report. The comments correctly identify areas where additional formalization will strengthen the presentation. We address each major comment below and will revise the manuscript accordingly to make the axiomatic foundations explicit while preserving the core information-geometric results.

read point-by-point responses
  1. Referee: Abstract: the central uniqueness claim states that 'information monotonicity and invariance under the agent-environment interaction uniquely constrains intrinsic rewards to strictly concave functions of the reciprocal occupancy,' yet no explicit functional equation or axiom set is supplied that defines how the interaction map acts on the information measure or reward functional. Without this, it is impossible to confirm that the constraint excludes non-concave or non-reciprocal candidates.

    Authors: The abstract necessarily condenses the result. In the body (Section 3), information monotonicity is defined via the data-processing inequality on the chosen divergence, and invariance under the agent-environment interaction is stated as invariance of the reward functional under the push-forward of the occupancy measure. To address the concern directly, we will insert a new subsection (2.3) that states the two axioms as explicit functional equations before deriving the uniqueness theorem. This will make the exclusion of non-concave candidates fully verifiable from the axioms alone. revision: yes

  2. Referee: Derivation of the one-parameter family (via geodesic interpolation): the reduction to a scalar parameter inherits the same ambiguity; the manuscript must state any regularity conditions imposed on the occupancy manifold and show that the geodesic step does not introduce hidden assumptions that weaken the uniqueness result.

    Authors: The occupancy manifold is the simplex equipped with the Fisher-Rao metric; geodesics are the standard information-geometric interpolations between occupancy measures. Regularity assumptions are smoothness of the occupancy functions and a compact state space ensuring geodesic existence. The interpolation step is applied only after the concavity constraint has been established and does not relax it. We will add an explicit paragraph in Section 4 listing these conditions and a short verification that the interpolated family remains strictly concave for all admissible parameter values. revision: yes

  3. Referee: Representation-agnostic premise (abstract, paragraph 3): the assumption that intrinsic rewards must depend solely on the agent's information and remain representation-agnostic is taken as a direct consequence of information geometry, but the precise invariance property is not formalized before the derivation; this is load-bearing for the uniqueness conclusion.

    Authors: The representation-agnostic property follows from the invariance of the information measure under sufficient statistics, which is a standard axiom in information geometry. While motivated in the introduction, we agree that a self-contained statement of this invariance axiom should precede the main derivation. We will restructure Section 2 to list all three axioms (monotonicity, interaction invariance, and representation invariance) before the uniqueness theorem, thereby clarifying the logical order. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected; derivation relies on external information-geometry axioms.

full rationale

The paper invokes information monotonicity and invariance under agent-environment interaction as central properties of information geometry to constrain intrinsic rewards to strictly concave functions of reciprocal occupancy. These axioms are presented as independent inputs rather than defined in terms of the target reward form. The subsequent geodesic interpolation step introduces a scalar parameter whose special values recover count-based and maximum-entropy methods; this is an integration of existing strategies, not a statistical fit or self-definition that forces the result by construction. No load-bearing self-citations, uniqueness theorems imported from the authors' prior work, or ansatzes smuggled via citation are evident. The framework remains self-contained against the stated monotonicity and invariance premises.

Axiom & Free-Parameter Ledger

1 free parameters · 2 axioms · 1 invented entities

The claim rests on two domain assumptions from information geometry and introduces one free scalar parameter plus the occupancy manifold as a geometric construct.

free parameters (1)
  • scalar parameter
    Single adjustable value that selects the specific concave function within the derived family; special values recover count-based and maximum-entropy exploration.
axioms (2)
  • domain assumption Information monotonicity
    More information about the environment must not decrease the intrinsic reward (abstract).
  • domain assumption Invariance under agent-environment interaction
    Intrinsic reward must remain unchanged under re-representations of the same information (abstract).
invented entities (1)
  • occupancy manifold no independent evidence
    purpose: Geometric space on which information geodesic interpolation is performed to enforce exploration-exploitation trade-off.
    Introduced to carry out the interpolation step that reduces the reward family to a scalar parameter.

pith-pipeline@v0.9.0 · 5668 in / 1371 out tokens · 125159 ms · 2026-05-22T19:45:11.349727+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

16 extracted references · 16 canonical work pages · 4 internal anchors

  1. [1]

    A., Rosas, F

    Aguilera, M., Morales, P. A., Rosas, F. E., and Shimazaki, H. Explosive neural networks via higher-order inter- actions in curved statistical manifolds. arXiv preprint arXiv:2408.02326,

  2. [2]

    Exploration by Random Network Distillation

    Burda, Y ., Edwards, H., Storkey, A., and Klimov, O. Ex- ploration by random network distillation. arXiv preprint arXiv:1810.12894,

  3. [3]

    New foundations for the theory of quadratic forms with infinitely many variables (in german)

    Hellinger, E. New foundations for the theory of quadratic forms with infinitely many variables (in german). Jour- nal f ¨ur die reine und angewandte Mathematik , 1909 9 An Information-Geometric Approach to Artificial Curiosity (136):210–271,

  4. [4]

    URL https://doi.org/10.1515/crll.1909

    doi: doi:10.1515/crll.1909.136.210. URL https://doi.org/10.1515/crll.1909. 136.210. Hoeffding, W. Probability inequalities for sums of bounded random variables

  5. [6]

    URL https://arxiv.org/abs/2103.04551. Meyn, S. P. and Tweedie, R. L. Markov chains and stochas- tic stability. Springer Science & Business Media,

  6. [7]

    URL https://doi.org/ 10.1103%2Fphysrevresearch.3.033216

    doi: 10.1103/ physrevresearch.3.033216. URL https://doi.org/ 10.1103%2Fphysrevresearch.3.033216. Morales, P. A., Korbel, J., and Rosas, F. E. Geometric struc- tures induced by deformations of the legendre transform. Entropy, 25(4):678, 2023a. Morales, P. A., Korbel, J., and Rosas, F. E. Thermodynam- ics of exponential kolmogorov–nagumo averages. New Journ...

  7. [8]

    and Cook, M

    Nedergaard, A. and Cook, M. k-means maximum entropy exploration. arXiv preprint arXiv:2205.15623,

  8. [10]

    Dota 2 with Large Scale Deep Reinforcement Learning

    URL http://arxiv.org/abs/1912.06680. Pathak, D., Agrawal, P., Efros, A. A., and Darrell, T. Curiosity-driven exploration by self-supervised predic- tion,

  9. [11]

    Trust Region Policy Optimization

    Schulman, J. Trust region policy optimization. arXiv preprint arXiv:1502.05477,

  10. [13]

    Proximal Policy Optimization Algorithms

    URL http://arxiv. org/abs/1707.06347. Schultz, W., Dayan, P., and Montague, P. R. A neural substrate of prediction and reward. Science, 275(5306): 1593–1599,

  11. [14]

    To ensure that ˜M forms a Markov chain given that M does, we need to handle a topological technicality: ˜B, the Borel σ-algebra of ˜S, must be countably generated

    else, (7) where c(s) gives the counter and B ∈ ˜B. To ensure that ˜M forms a Markov chain given that M does, we need to handle a topological technicality: ˜B, the Borel σ-algebra of ˜S, must be countably generated. Let I have any topology and let ˜S have the product topology. If M formed a Markov chain, S must have been second-countable. Since I is second...

  12. [15]

    13 An Information-Geometric Approach to Artificial Curiosity Proof

    Z S pπ(s)r(s)dV (s) = R(π). 13 An Information-Geometric Approach to Artificial Curiosity Proof. The proof is due to (Bojun, 2020). Define the state-action value Q(s, a) := Z S n−k nX i=k r(si)dδ(s, a, sk)dM(sk, sk+1) · · · dM(sn−1, sn) (19) where the notation R S n−k :=Qn−kR S had been adopted. Q(s, a) satisfies, R(π) = Z S r(s) + Z A Q(s, a)dπ(s, a) dµ(s...

  13. [16]

    (25b) Proposition 2.4

    Z S pπ(s)r(s)dV (s). (25b) Proposition 2.4. Any divergence that is a strictly monotonic function of a geodetic divergence is geodetic. Proof. Let ¯D be a geodetic divergence and D = F ( ¯D) with F strictly monotonic function. Using the shorthand Dp := D(p∥·) and letting γ denote any path with arbitrary endpoints p and q, ¯Dp(q) = ¯Dp(q) − ¯Dp(p)|{z} =0 = ...

  14. [17]

    The unique invariance under the agent-environment interaction whenp = pπ follows from the unique invariance ofpπ under the agent-environment interaction by Theorem 2.1

    Z S pπ(s)¯r(s)dV (s) = Z S n+1 nX i=0 ¯r(si)dµ(s0) nY i=1 dM(si−1, si) (30) where the equality holds by Lemma 2.2. The unique invariance under the agent-environment interaction whenp = pπ follows from the unique invariance ofpπ under the agent-environment interaction by Theorem 2.1. For the data processing inequality, we may consider ¯f [pπ(s)] = f pπ(s)−...

  15. [18]

    Interpreting 1 β(n+1) as a Lagrange multiplier, we have pα,β = arg min p∈{p∈P:R(p)=cα,β } Dα(p∥u) = arg min p∈H(cα,β) Dα(p∥u)

    Z S pπ(s)Iα(s, pπ)dV (s) (55a) = R(π) − β(n + 1)Dα(p∥u) (55b) Then, the optima pα,β = arg max p∈P Rα,β(p) (56a) = arg max p∈P {R(p) − β(n + 1)Dα(p∥u)} (56b) = arg max p∈P −Dα(p∥u) + 1 β(n + 1)R(p) (56c) = arg min p∈P Dα(p∥u) − 1 β(n + 1)R(p) (56d) = arg min p∈P Dα(p∥u) − 1 β(n + 1)[R(p) − c] (56e) 18 An Information-Geometric Approach to Artificial Curiosi...

  16. [19]

    Rα,β is α-concave if Rα,β ◦ γ is concave for any α-geodesic

    Z S pπ(s) {r(s) + βIα(s; pπ)} dV (s), (76) we assume the reward function is bounded. Rα,β is α-concave if Rα,β ◦ γ is concave for any α-geodesic. The α-geodesic γ : [0, 1] → P connecting p ∈ P and q ∈ P is given by γp,q(t) = n (1 − t)p 1−α 2 + tq 1−α 2 o 2 1−α ξ(t) (77) where ξ(t) ensures normalization (Ay et al., 2017, Equation 2.59). Assume first that α...