pith. sign in

arxiv: 2605.19215 · v1 · pith:KEVMUIRRnew · submitted 2026-05-19 · 💻 cs.AI

Not all uncertainty is alike: volatility, stochasticity, and exploration

Pith reviewed 2026-05-20 06:38 UTC · model grok-4.3

classification 💻 cs.AI
keywords explorationvolatilitystochasticityuncertaintybanditsreinforcement learningdecision makingGittins index
0
0 comments X

The pith

Volatility boosts optimal exploration while stochasticity suppresses it despite both increasing uncertainty.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that uncertainty from volatility and from stochasticity drive exploration in opposite directions. Volatility in latent reward states increases optimal exploration to keep up with changes. Stochasticity in observations decreases it because new data provides less reliable information. This is formally derived by extending the Gittins index to Gaussian state-space models with dynamics, yielding the CAUSE exploration bonus. The result suggests that exploration algorithms and biological decision processes should treat these uncertainty sources differently to perform well.

Core claim

Volatility enhances optimal exploration while stochasticity suppresses it in latent reward environments. This asymmetry is established by extending the Gittins index to Gaussian state-space bandits with dynamics. The authors derive CAUSE, a closed-form exploration bonus from control-as-inference that follows the same pattern. CAUSE beats standard strategies in mixed-noise settings and improves on Gittins policies for restless bandits. Faulty noise inference can reverse exploration patterns.

What carries the argument

The extension of the Gittins index framework to Gaussian state-space bandits with latent dynamics that distinguishes volatility from stochasticity in computing exploration value.

Load-bearing premise

The environment is accurately described by a Gaussian state-space model with known latent dynamics allowing exact index derivations.

What would settle it

Running an optimal policy on a Gaussian bandit where volatility is increased while holding stochasticity fixed and verifying that exploration rate rises, or the reverse for stochasticity.

Figures

Figures reproduced from arXiv: 2605.19215 by Payam Piray.

Figure 1
Figure 1. Figure 1: Cumulative discounted regret over T = 200 steps in three regimes (K = 4, γ = 0.95, 1000 Monte Carlo runs). CAUSE achieves the lowest regret in all three regimes. 6.1 Regret in heterogeneous-arms restless bandits Standard exploration strategies treat uncertainty as a single quantity and prescribe more explo￾ration whenever uncertainty is high. The structural analysis of Section 4 predicts this is suboptimal… view at source ↗
Figure 2
Figure 2. Figure 2: Exploration bonus as a func￾tion of stochasticity s, normalized to a common range. CAUSE tracks the Git￾tins shape; UCB is insensitive to s. CAUSE is derived independently of the Gittins frame￾work, via control-as-inference under an optimality con￾straint. Whether its closed form captures the structural shape of the optimal Gittins bonus is an empirical ques￾tion. We compute the Gittins bonus by value iter… view at source ↗
Figure 3
Figure 3. Figure 3: Learning rate (top row) and CAUSE exploration bonus (bottom row) for healthy, [PITH_FULL_IMAGE:figures/full_fig_p010_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Exploration bonus as a function of volatility [PITH_FULL_IMAGE:figures/full_fig_p023_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Cumulative discounted regret at v = 0 for K = 4 arms over T = 200 steps, γ = 0.95, 1000 Monte Carlo runs. Left: s ∈ {9, 25}. Right: s ∈ {9, 900}. CAUSE and Gittins-per-arm overlap within Monte Carlo precision in both configurations. At extreme stochasticity, CAUSE achieves 42.71±2.40 and Gittins-per-arm achieves 41.36±2.04 (mean ± SEM, 1000 runs); the difference of 1.35 is well within the combined sampling… view at source ↗
Figure 6
Figure 6. Figure 6: Cumulative discounted regret in the mixed regime across discount factors [PITH_FULL_IMAGE:figures/full_fig_p025_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Cumulative discounted regret in the mixed regime at [PITH_FULL_IMAGE:figures/full_fig_p026_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: UCB regret across the three regimes for c ∈ {0.5, 1, 2, 3}, with CAUSE shown for com￾parison (K = 4, T = 200, γ = 0.95, 1000 Monte Carlo runs). 26 [PITH_FULL_IMAGE:figures/full_fig_p026_8.png] view at source ↗
read the original abstract

Adaptive decision-making in biological and artificial intelligence requires balancing the exploitation of known outcomes with the exploration of uncertain alternatives. Although prior work suggests that uncertainty generally promotes exploration, it has typically treated distinct sources of environmental uncertainty as equivalent. We consider environments with latent reward states that drift over time (volatility) and are observed through noisy outcomes (stochasticity). Both increase posterior uncertainty, yet we show they drive optimal exploration in opposite directions: volatility enhances it, stochasticity suppresses it. We establish this asymmetry formally by extending the Gittins index framework to Gaussian state-space bandits with latent dynamics. We further derive Cause-Aware Uncertainty-Sensitive Exploration (CAUSE), a closed-form exploration bonus obtained via control-as-inference that inherits the same monotonicities. CAUSE outperforms standard exploration strategies in environments with heterogeneous noise structure, and also improves on a Gittins-per-arm policy whose rested-bandit optimality does not transfer to restless settings. Learning and exploration are governed by the same noise-inference asymmetry, and the framework predicts that pathological noise inference produces \emph{reversed} rather than merely impaired exploration, with implications for computational accounts of psychiatric conditions.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper claims that volatility (drifting latent reward states over time) and stochasticity (noisy observations) are distinct sources of uncertainty that drive optimal exploration in opposite directions: volatility enhances exploration while stochasticity suppresses it. This asymmetry is established formally by extending the Gittins index to Gaussian state-space bandits with latent dynamics and deriving a closed-form Cause-Aware Uncertainty-Sensitive Exploration (CAUSE) bonus via control-as-inference. CAUSE is shown to outperform standard exploration strategies and a Gittins-per-arm policy in restless settings with heterogeneous noise, with implications for learning under noise inference and computational psychiatry.

Significance. If the central claims hold, the work provides a significant formal distinction between types of uncertainty in adaptive decision-making, extending classical bandit theory to restless environments. The closed-form derivation, monotonicity results, and prediction of reversed exploration under pathological noise inference are notable strengths that could inform both algorithmic design in AI and models of exploration deficits in psychiatric conditions.

major comments (2)
  1. [Formal extension of Gittins index and CAUSE derivation] The extension of the Gittins index and the closed-form derivation of the CAUSE bonus both presuppose known Gaussian latent transition and emission dynamics so that the value function remains quadratic. This modeling choice is load-bearing for the directional claims on volatility versus stochasticity; the manuscript should explicitly test or discuss whether the asymmetry survives under unknown or non-Gaussian dynamics (as noted in the abstract's formal extension).
  2. [Comparison to Gittins-per-arm policy] The claim that a Gittins-per-arm policy's rested-bandit optimality does not transfer to restless settings is central to positioning CAUSE, yet the manuscript provides no explicit counter-example or derivation showing where the transfer fails when arms are coupled through shared latent dynamics.
minor comments (2)
  1. [Abstract] The abstract introduces the acronym CAUSE without a brief parenthetical expansion on first use; adding this would improve readability for a broad audience.
  2. Notation for the state-space model (latent states, volatility parameter, stochasticity variance) could be accompanied by a simple diagram or table summarizing the roles of each noise source.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments, which help clarify the scope and positioning of our results. We address each major comment below and indicate the planned revisions.

read point-by-point responses
  1. Referee: [Formal extension of Gittins index and CAUSE derivation] The extension of the Gittins index and the closed-form derivation of the CAUSE bonus both presuppose known Gaussian latent transition and emission dynamics so that the value function remains quadratic. This modeling choice is load-bearing for the directional claims on volatility versus stochasticity; the manuscript should explicitly test or discuss whether the asymmetry survives under unknown or non-Gaussian dynamics (as noted in the abstract's formal extension).

    Authors: We agree that the Gaussian assumption with known dynamics is essential for preserving the quadratic value function and deriving the exact monotonicities and closed-form CAUSE bonus. This choice enables the precise separation of volatility and stochasticity effects that is the paper's core contribution. While the control-as-inference perspective underlying CAUSE suggests the directional asymmetry may generalize, we do not claim it holds universally. In the revised manuscript we will add an explicit limitations paragraph in the discussion that (i) states the Gaussian known-dynamics assumption, (ii) notes that non-Gaussian or unknown-dynamics cases would require approximate methods such as particle filters or variational inference, and (iii) cites related work on restless bandits under more general dynamics. We will also revise the abstract to make clear that the formal extension applies to the Gaussian state-space setting. revision: partial

  2. Referee: [Comparison to Gittins-per-arm policy] The claim that a Gittins-per-arm policy's rested-bandit optimality does not transfer to restless settings is central to positioning CAUSE, yet the manuscript provides no explicit counter-example or derivation showing where the transfer fails when arms are coupled through shared latent dynamics.

    Authors: The classical Gittins index is optimal only when each arm's latent state evolves independently of the others when not played. In our restless Gaussian state-space model the latent dynamics are shared (common volatility process), so that selecting one arm updates the joint posterior over all arms. This coupling means the per-arm Gittins indices, computed in isolation, ignore the cross-arm information gain and the resulting change in opportunity cost. We will insert a short subsection containing (a) a two-arm analytic counter-example in which the shared latent state causes the per-arm policy to select the wrong arm, and (b) a brief derivation showing that the index decomposition fails once the value function depends on the joint posterior rather than on independent marginals. revision: yes

Circularity Check

0 steps flagged

No circularity: formal extension of Gittins index and CAUSE derivation are model-based and independent of the target result.

full rationale

The paper derives the volatility-stochasticity asymmetry by extending the Gittins index to Gaussian state-space bandits with latent dynamics and obtaining CAUSE as a closed-form bonus via control-as-inference. These steps presuppose the linear-Gaussian model class but do not define the monotonicities in terms of themselves or reduce a prediction to a fitted input by construction. No self-citation chain, ansatz smuggling, or renaming of known results is required for the central claim. The derivation remains self-contained against the stated model assumptions and external Gittins framework.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim rests on standard modeling assumptions for bandits and introduces a derived exploration quantity without explicit free parameters or new postulated entities beyond the framework extension.

axioms (1)
  • domain assumption Gaussian state-space model for latent reward dynamics
    Assumed to allow closed-form extension of Gittins indices and control-as-inference derivation.
invented entities (1)
  • CAUSE exploration bonus no independent evidence
    purpose: Closed-form term that adjusts exploration for the inferred cause of uncertainty
    Derived from the framework rather than postulated independently; no external falsifiable prediction stated in abstract.

pith-pipeline@v0.9.0 · 5724 in / 1228 out tokens · 67322 ms · 2026-05-20T06:38:00.738911+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

40 extracted references · 40 canonical work pages · 2 internal anchors

  1. [1]

    Finite-time analysis of the multiarmed bandit problem.Machine learning, 47(2):235–256, 2002

    Peter Auer, Nicolo Cesa-Bianchi, and Paul Fischer. Finite-time analysis of the multiarmed bandit problem.Machine learning, 47(2):235–256, 2002

  2. [2]

    Asymptotically efficient adaptive allocation rules.Advances in applied mathematics, 6(1):4–22, 1985

    Tze Leung Lai and Herbert Robbins. Asymptotically efficient adaptive allocation rules.Advances in applied mathematics, 6(1):4–22, 1985

  3. [3]

    On the likelihood that one unknown probability exceeds another in view of the evidence of two samples.Biometrika, 25(3/4):285–294, 1933

    William R Thompson. On the likelihood that one unknown probability exceeds another in view of the evidence of two samples.Biometrika, 25(3/4):285–294, 1933

  4. [4]

    Learning to optimize via information-directed sampling.Advances in neural information processing systems, 27, 2014

    Daniel Russo and Benjamin Van Roy. Learning to optimize via information-directed sampling.Advances in neural information processing systems, 27, 2014. 11

  5. [5]

    Humans use directed and random exploration to solve the explore–exploit dilemma.Journal of experimental psychology: General, 143(6):2074, 2014

    Robert C Wilson, Andra Geana, John M White, Elliot A Ludvig, and Jonathan D Cohen. Humans use directed and random exploration to solve the explore–exploit dilemma.Journal of experimental psychology: General, 143(6):2074, 2014

  6. [6]

    Uncertainty and exploration.Decision, 6(3):277, 2019

    Samuel J Gershman. Uncertainty and exploration.Decision, 6(3):277, 2019

  7. [7]

    Cortical substrates for exploratory decisions in humans.Nature, 441(7095):876–879, 2006

    Nathaniel D Daw, John P O’doherty, Peter Dayan, Ben Seymour, and Raymond J Dolan. Cortical substrates for exploratory decisions in humans.Nature, 441(7095):876–879, 2006

  8. [8]

    Prefrontal and striatal dopaminergic genes predict individual differences in exploration and exploitation.Nature neuroscience, 12(8):1062–1068, 2009

    Michael J Frank, Bradley B Doll, Jen Oas-Terpstra, and Francisco Moreno. Prefrontal and striatal dopaminergic genes predict individual differences in exploration and exploitation.Nature neuroscience, 12(8):1062–1068, 2009

  9. [9]

    Reinforcement Learning and Control as Probabilistic Inference: Tutorial and Review

    Sergey Levine. Reinforcement learning and control as probabilistic inference: Tutorial and review.arXiv preprint arXiv:1805.00909, 2018

  10. [10]

    General duality between optimal control and estimation

    Emanuel Todorov. General duality between optimal control and estimation. In2008 47th IEEE con- ference on decision and control, pages 4286–4292. IEEE, 2008

  11. [11]

    A model for learning based on the joint estimation of stochasticity and volatility.Nature communications, 12(1):6587, 2021

    Payam Piray and Nathaniel D Daw. A model for learning based on the joint estimation of stochasticity and volatility.Nature communications, 12(1):6587, 2021

  12. [12]

    Computational processes of simultaneous learning of stochasticity and volatility in humans.Nature communications, 15(1):9073, 2024

    Payam Piray and Nathaniel D Daw. Computational processes of simultaneous learning of stochasticity and volatility in humans.Nature communications, 15(1):9073, 2024

  13. [13]

    Jonathan D Cohen, Samuel M McClure, and Angela J Yu. Should i stay or should i go? how the human brain manages the trade-off between exploitation and exploration.Philosophical Transactions of the Royal Society B: Biological Sciences, 362(1481):933–942, 2007

  14. [14]

    Deconstructing the human algorithms for exploration.Cognition, 173:34–42, 2018

    Samuel J Gershman. Deconstructing the human algorithms for exploration.Cognition, 173:34–42, 2018

  15. [15]

    Learning the value of information in an uncertain world.Nature neuroscience, 10(9):1214–1221, 2007

    Timothy EJ Behrens, Mark W Woolrich, Mark E Walton, and Matthew FS Rushworth. Learning the value of information in an uncertain world.Nature neuroscience, 10(9):1214–1221, 2007

  16. [16]

    A bayesian foundation for individual learning under uncertainty.Frontiers in human neuroscience, 5:39, 2011

    Christoph Mathys, Jean Daunizeau, Karl J Friston, and Klaas E Stephan. A bayesian foundation for individual learning under uncertainty.Frontiers in human neuroscience, 5:39, 2011

  17. [17]

    An approximately bayesian delta-rule model explains the dynamics of belief updating in a changing environment.Journal of Neu- roscience, 30(37):12366–12378, 2010

    Matthew R Nassar, Robert C Wilson, Benjamin Heasly, and Joshua I Gold. An approximately bayesian delta-rule model explains the dynamics of belief updating in a changing environment.Journal of Neu- roscience, 30(37):12366–12378, 2010

  18. [18]

    Rational regulation of learning dynamics by pupil-linked arousal systems.Nature neuroscience, 15(7):1040–1046, 2012

    Matthew R Nassar, Katherine M Rumsey, Robert C Wilson, Kinjan Parikh, Benjamin Heasly, and Joshua I Gold. Rational regulation of learning dynamics by pupil-linked arousal systems.Nature neuroscience, 15(7):1040–1046, 2012

  19. [19]

    A tutorial on thompson sampling.Foundations and Trends®in Machine Learning, 11(1):1–99, 2018

    Daniel Russo, Benjamin Van Roy, Abbas Kazerouni, Ian Osband, and Zheng Wen. A tutorial on thompson sampling.Foundations and Trends®in Machine Learning, 11(1):1–99, 2018

  20. [20]

    On bayesian upper confidence bounds for bandit problems

    Emilie Kaufmann, Olivier Capp´ e, and Aur´ elien Garivier. On bayesian upper confidence bounds for bandit problems. InArtificial intelligence and statistics, pages 592–600. PMLR, 2012

  21. [21]

    Bandit processes and dynamic allocation indices.Journal of the Royal Statistical Society Series B: Statistical Methodology, 41(2):148–164, 1979

    John C Gittins. Bandit processes and dynamic allocation indices.Journal of the Royal Statistical Society Series B: Statistical Methodology, 41(2):148–164, 1979

  22. [22]

    Some results on the gittins index for a normal reward process.Lecture Notes-Monograph Series, pages 284–294, 2006

    Yi-Ching Yao. Some results on the gittins index for a normal reward process.Lecture Notes-Monograph Series, pages 284–294, 2006

  23. [23]

    Restless bandits: Activity allocation in a changing world.Journal of applied probability, 25(A):287–298, 1988

    Peter Whittle. Restless bandits: Activity allocation in a changing world.Journal of applied probability, 25(A):287–298, 1988. 12

  24. [24]

    The complexity of optimal queuing network control

    Christos H Papadimitriou and John N Tsitsiklis. The complexity of optimal queuing network control. Mathematics of Operations Research, 24(2):293–305, 1999

  25. [25]

    Satisficing in time-sensitive bandit learning.arXiv preprint arXiv:1803.02855, 2018

    Daniel Russo and Benjamin Van Roy. Satisficing in time-sensitive bandit learning.arXiv preprint arXiv:1803.02855, 2018

  26. [26]

    Nonstationary bandit learning via predictive sampling

    Yueyang Liu, Benjamin Van Roy, and Kuang Xu. Nonstationary bandit learning via predictive sampling. InInternational Conference on Artificial Intelligence and Statistics, pages 6215–6244. PMLR, 2023

  27. [27]

    Optimal control as a graphical model inference problem.Machine learning, 87(2):159–182, 2012

    Hilbert J Kappen, Vicen¸ c G´ omez, and Manfred Opper. Optimal control as a graphical model inference problem.Machine learning, 87(2):159–182, 2012

  28. [28]

    Maximum a Posteriori Policy Optimisation

    Abbas Abdolmaleki, Jost Tobias Springenberg, Yuval Tassa, Remi Munos, Nicolas Heess, and Martin Riedmiller. Maximum a posteriori policy optimisation.arXiv preprint arXiv:1806.06920, 2018

  29. [29]

    Soft actor-critic: Off-policy max- imum entropy deep reinforcement learning with a stochastic actor

    Tuomas Haarnoja, Aurick Zhou, Pieter Abbeel, and Sergey Levine. Soft actor-critic: Off-policy max- imum entropy deep reinforcement learning with a stochastic actor. InInternational conference on machine learning, pages 1861–1870. Pmlr, 2018

  30. [30]

    Robot trajectory optimization using approximate inference

    Marc Toussaint. Robot trajectory optimization using approximate inference. InProceedings of the 26th annual international conference on machine learning, pages 1049–1056, 2009

  31. [31]

    Planning as inference.Trends in cognitive sciences, 16(10): 485–488, 2012

    Matthew Botvinick and Marc Toussaint. Planning as inference.Trends in cognitive sciences, 16(10): 485–488, 2012

  32. [32]

    Anxious individuals have difficulty learning the causal statistics of aversive environments.Nature neuroscience, 18(4):590–596, 2015

    Michael Browning, Timothy E Behrens, Gerhard Jocham, Jill X O’reilly, and Sonia J Bishop. Anxious individuals have difficulty learning the causal statistics of aversive environments.Nature neuroscience, 18(4):590–596, 2015

  33. [33]

    Computational dysfunctions in anxiety: failure to differentiate signal from noise.Biological psychiatry, 82(6):440–446, 2017

    He Huang, Wesley Thompson, and Martin P Paulus. Computational dysfunctions in anxiety: failure to differentiate signal from noise.Biological psychiatry, 82(6):440–446, 2017

  34. [34]

    Absence of systematic effects of internalizing psychopathology on learning under uncertainty.bioRxiv, pages 2025–05, 2025

    Muhammad H Satti, Katharina Wille, Matthew R Nassar, Radoslaw M Cichy, Nicolas W Schuck, Peter Dayan, and Rasmus Bruckner. Absence of systematic effects of internalizing psychopathology on learning under uncertainty.bioRxiv, pages 2025–05, 2025

  35. [35]

    Pavlovian conditioning–induced hal- lucinations result from overweighting of perceptual priors.Science, 357(6351):596–600, 2017

    Albert R Powers, Christoph Mathys, and Philip Robert Corlett. Pavlovian conditioning–induced hal- lucinations result from overweighting of perceptual priors.Science, 357(6351):596–600, 2017

  36. [36]

    Affective bias as a rational response to the statistics of rewards and punishments.Elife, 6:e27879, 2017

    Erdem Pulcu and Michael Browning. Affective bias as a rational response to the statistics of rewards and punishments.Elife, 6:e27879, 2017

  37. [37]

    Altered learning under uncertainty in unmedicated mood and anxiety disorders

    Jessica Aylward, Vincent Valton, Woo-Young Ahn, Rebecca L Bond, Peter Dayan, Jonathan P Roiser, and Oliver J Robinson. Altered learning under uncertainty in unmedicated mood and anxiety disorders. Nature human behaviour, 3(10):1116–1123, 2019

  38. [38]

    Trait somatic anxiety is associated with reduced directed exploration and underestimation of uncertainty.Nature Human Behaviour, 7(1):102– 113, 2023

    Haoxue Fan, Samuel J Gershman, and Elizabeth A Phelps. Trait somatic anxiety is associated with reduced directed exploration and underestimation of uncertainty.Nature Human Behaviour, 7(1):102– 113, 2023

  39. [39]

    ξX n=1 γn−1rn +γ ξ λ 1−γ # =E

    David JC MacKay.Information theory, inference and learning algorithms. Cambridge university press, 2003. 13 A Proofs of monotonicity results This appendix provides proofs of the three monotonicity results stated in Section 4 of the main text: the index decomposition (Proposition 1), the monotonicity of the exploration bonus in the observation noises(Theor...

  40. [40]

    19 C Experimental setup This appendix provides implementation details for the experiments of Section 6

    We fixc= 1 2 throughout this paper. 19 C Experimental setup This appendix provides implementation details for the experiments of Section 6. The scale param- eterc(Eq. 7) was held at 0.5 across all experiments and not tuned per-condition; reported regret reflects this fixed setting. C.1 Baseline policies All baselines use the same Kalman tracker as CAUSE, ...