Not all uncertainty is alike: volatility, stochasticity, and exploration

Payam Piray

arxiv: 2605.19215 · v1 · pith:KEVMUIRRnew · submitted 2026-05-19 · 💻 cs.AI

Not all uncertainty is alike: volatility, stochasticity, and exploration

Payam Piray This is my paper

Pith reviewed 2026-05-20 06:38 UTC · model grok-4.3

classification 💻 cs.AI

keywords explorationvolatilitystochasticityuncertaintybanditsreinforcement learningdecision makingGittins index

0 comments

The pith

Volatility boosts optimal exploration while stochasticity suppresses it despite both increasing uncertainty.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that uncertainty from volatility and from stochasticity drive exploration in opposite directions. Volatility in latent reward states increases optimal exploration to keep up with changes. Stochasticity in observations decreases it because new data provides less reliable information. This is formally derived by extending the Gittins index to Gaussian state-space models with dynamics, yielding the CAUSE exploration bonus. The result suggests that exploration algorithms and biological decision processes should treat these uncertainty sources differently to perform well.

Core claim

Volatility enhances optimal exploration while stochasticity suppresses it in latent reward environments. This asymmetry is established by extending the Gittins index to Gaussian state-space bandits with dynamics. The authors derive CAUSE, a closed-form exploration bonus from control-as-inference that follows the same pattern. CAUSE beats standard strategies in mixed-noise settings and improves on Gittins policies for restless bandits. Faulty noise inference can reverse exploration patterns.

What carries the argument

The extension of the Gittins index framework to Gaussian state-space bandits with latent dynamics that distinguishes volatility from stochasticity in computing exploration value.

Load-bearing premise

The environment is accurately described by a Gaussian state-space model with known latent dynamics allowing exact index derivations.

What would settle it

Running an optimal policy on a Gaussian bandit where volatility is increased while holding stochasticity fixed and verifying that exploration rate rises, or the reverse for stochasticity.

Figures

Figures reproduced from arXiv: 2605.19215 by Payam Piray.

**Figure 1.** Figure 1: Cumulative discounted regret over T = 200 steps in three regimes (K = 4, γ = 0.95, 1000 Monte Carlo runs). CAUSE achieves the lowest regret in all three regimes. 6.1 Regret in heterogeneous-arms restless bandits Standard exploration strategies treat uncertainty as a single quantity and prescribe more exploration whenever uncertainty is high. The structural analysis of Section 4 predicts this is suboptimal… view at source ↗

**Figure 2.** Figure 2: Exploration bonus as a function of stochasticity s, normalized to a common range. CAUSE tracks the Gittins shape; UCB is insensitive to s. CAUSE is derived independently of the Gittins framework, via control-as-inference under an optimality constraint. Whether its closed form captures the structural shape of the optimal Gittins bonus is an empirical question. We compute the Gittins bonus by value iter… view at source ↗

**Figure 3.** Figure 3: Learning rate (top row) and CAUSE exploration bonus (bottom row) for healthy, [PITH_FULL_IMAGE:figures/full_fig_p010_3.png] view at source ↗

**Figure 4.** Figure 4: Exploration bonus as a function of volatility [PITH_FULL_IMAGE:figures/full_fig_p023_4.png] view at source ↗

**Figure 5.** Figure 5: Cumulative discounted regret at v = 0 for K = 4 arms over T = 200 steps, γ = 0.95, 1000 Monte Carlo runs. Left: s ∈ {9, 25}. Right: s ∈ {9, 900}. CAUSE and Gittins-per-arm overlap within Monte Carlo precision in both configurations. At extreme stochasticity, CAUSE achieves 42.71±2.40 and Gittins-per-arm achieves 41.36±2.04 (mean ± SEM, 1000 runs); the difference of 1.35 is well within the combined sampling… view at source ↗

**Figure 6.** Figure 6: Cumulative discounted regret in the mixed regime across discount factors [PITH_FULL_IMAGE:figures/full_fig_p025_6.png] view at source ↗

**Figure 7.** Figure 7: Cumulative discounted regret in the mixed regime at [PITH_FULL_IMAGE:figures/full_fig_p026_7.png] view at source ↗

**Figure 8.** Figure 8: UCB regret across the three regimes for c ∈ {0.5, 1, 2, 3}, with CAUSE shown for comparison (K = 4, T = 200, γ = 0.95, 1000 Monte Carlo runs). 26 [PITH_FULL_IMAGE:figures/full_fig_p026_8.png] view at source ↗

read the original abstract

Adaptive decision-making in biological and artificial intelligence requires balancing the exploitation of known outcomes with the exploration of uncertain alternatives. Although prior work suggests that uncertainty generally promotes exploration, it has typically treated distinct sources of environmental uncertainty as equivalent. We consider environments with latent reward states that drift over time (volatility) and are observed through noisy outcomes (stochasticity). Both increase posterior uncertainty, yet we show they drive optimal exploration in opposite directions: volatility enhances it, stochasticity suppresses it. We establish this asymmetry formally by extending the Gittins index framework to Gaussian state-space bandits with latent dynamics. We further derive Cause-Aware Uncertainty-Sensitive Exploration (CAUSE), a closed-form exploration bonus obtained via control-as-inference that inherits the same monotonicities. CAUSE outperforms standard exploration strategies in environments with heterogeneous noise structure, and also improves on a Gittins-per-arm policy whose rested-bandit optimality does not transfer to restless settings. Learning and exploration are governed by the same noise-inference asymmetry, and the framework predicts that pathological noise inference produces \emph{reversed} rather than merely impaired exploration, with implications for computational accounts of psychiatric conditions.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Volatility and stochasticity have opposite effects on exploration in Gaussian state-space bandits, derived cleanly but tied tightly to known linear dynamics.

read the letter

The main thing is that volatility increases optimal exploration while stochasticity decreases it, at least inside this model. They show the asymmetry by extending the Gittins index to restless bandits with latent Gaussian dynamics and then pulling a closed-form CAUSE bonus out of control-as-inference. That derivation step is the actual contribution, and it looks internally consistent on the math they present. The simulations apparently beat both standard bonuses and a per-arm Gittins policy that does not carry over to drifting settings, which is a reasonable check. The shared asymmetry between learning and exploration is a nice unifying observation too. The soft spot is exactly what the stress-test note flags: the whole argument needs the latent transition and emission to be known and Gaussian so the value function stays quadratic and the index stays closed-form. Outside that class the monotonicities are not established, yet the paper frames the result as a general distinction between two sources of uncertainty rather than a property of one linear-Gaussian restless bandit. If the full text has any robustness checks or extensions to unknown dynamics, they would strengthen the case; from the abstract it reads as model-specific. This is worth sending to referees who work on bandit theory or RL exploration. The formal part has enough substance to benefit from detailed review even if the broader claims need tightening. Readers in computational psychiatry might also find the reversed-exploration prediction useful as a hypothesis, though that part is still thin.

Referee Report

2 major / 2 minor

Summary. The paper claims that volatility (drifting latent reward states over time) and stochasticity (noisy observations) are distinct sources of uncertainty that drive optimal exploration in opposite directions: volatility enhances exploration while stochasticity suppresses it. This asymmetry is established formally by extending the Gittins index to Gaussian state-space bandits with latent dynamics and deriving a closed-form Cause-Aware Uncertainty-Sensitive Exploration (CAUSE) bonus via control-as-inference. CAUSE is shown to outperform standard exploration strategies and a Gittins-per-arm policy in restless settings with heterogeneous noise, with implications for learning under noise inference and computational psychiatry.

Significance. If the central claims hold, the work provides a significant formal distinction between types of uncertainty in adaptive decision-making, extending classical bandit theory to restless environments. The closed-form derivation, monotonicity results, and prediction of reversed exploration under pathological noise inference are notable strengths that could inform both algorithmic design in AI and models of exploration deficits in psychiatric conditions.

major comments (2)

[Formal extension of Gittins index and CAUSE derivation] The extension of the Gittins index and the closed-form derivation of the CAUSE bonus both presuppose known Gaussian latent transition and emission dynamics so that the value function remains quadratic. This modeling choice is load-bearing for the directional claims on volatility versus stochasticity; the manuscript should explicitly test or discuss whether the asymmetry survives under unknown or non-Gaussian dynamics (as noted in the abstract's formal extension).
[Comparison to Gittins-per-arm policy] The claim that a Gittins-per-arm policy's rested-bandit optimality does not transfer to restless settings is central to positioning CAUSE, yet the manuscript provides no explicit counter-example or derivation showing where the transfer fails when arms are coupled through shared latent dynamics.

minor comments (2)

[Abstract] The abstract introduces the acronym CAUSE without a brief parenthetical expansion on first use; adding this would improve readability for a broad audience.
Notation for the state-space model (latent states, volatility parameter, stochasticity variance) could be accompanied by a simple diagram or table summarizing the roles of each noise source.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments, which help clarify the scope and positioning of our results. We address each major comment below and indicate the planned revisions.

read point-by-point responses

Referee: [Formal extension of Gittins index and CAUSE derivation] The extension of the Gittins index and the closed-form derivation of the CAUSE bonus both presuppose known Gaussian latent transition and emission dynamics so that the value function remains quadratic. This modeling choice is load-bearing for the directional claims on volatility versus stochasticity; the manuscript should explicitly test or discuss whether the asymmetry survives under unknown or non-Gaussian dynamics (as noted in the abstract's formal extension).

Authors: We agree that the Gaussian assumption with known dynamics is essential for preserving the quadratic value function and deriving the exact monotonicities and closed-form CAUSE bonus. This choice enables the precise separation of volatility and stochasticity effects that is the paper's core contribution. While the control-as-inference perspective underlying CAUSE suggests the directional asymmetry may generalize, we do not claim it holds universally. In the revised manuscript we will add an explicit limitations paragraph in the discussion that (i) states the Gaussian known-dynamics assumption, (ii) notes that non-Gaussian or unknown-dynamics cases would require approximate methods such as particle filters or variational inference, and (iii) cites related work on restless bandits under more general dynamics. We will also revise the abstract to make clear that the formal extension applies to the Gaussian state-space setting. revision: partial
Referee: [Comparison to Gittins-per-arm policy] The claim that a Gittins-per-arm policy's rested-bandit optimality does not transfer to restless settings is central to positioning CAUSE, yet the manuscript provides no explicit counter-example or derivation showing where the transfer fails when arms are coupled through shared latent dynamics.

Authors: The classical Gittins index is optimal only when each arm's latent state evolves independently of the others when not played. In our restless Gaussian state-space model the latent dynamics are shared (common volatility process), so that selecting one arm updates the joint posterior over all arms. This coupling means the per-arm Gittins indices, computed in isolation, ignore the cross-arm information gain and the resulting change in opportunity cost. We will insert a short subsection containing (a) a two-arm analytic counter-example in which the shared latent state causes the per-arm policy to select the wrong arm, and (b) a brief derivation showing that the index decomposition fails once the value function depends on the joint posterior rather than on independent marginals. revision: yes

Circularity Check

0 steps flagged

No circularity: formal extension of Gittins index and CAUSE derivation are model-based and independent of the target result.

full rationale

The paper derives the volatility-stochasticity asymmetry by extending the Gittins index to Gaussian state-space bandits with latent dynamics and obtaining CAUSE as a closed-form bonus via control-as-inference. These steps presuppose the linear-Gaussian model class but do not define the monotonicities in terms of themselves or reduce a prediction to a fitted input by construction. No self-citation chain, ansatz smuggling, or renaming of known results is required for the central claim. The derivation remains self-contained against the stated model assumptions and external Gittins framework.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim rests on standard modeling assumptions for bandits and introduces a derived exploration quantity without explicit free parameters or new postulated entities beyond the framework extension.

axioms (1)

domain assumption Gaussian state-space model for latent reward dynamics
Assumed to allow closed-form extension of Gittins indices and control-as-inference derivation.

invented entities (1)

CAUSE exploration bonus no independent evidence
purpose: Closed-form term that adjusts exploration for the inferred cause of uncertainty
Derived from the framework rather than postulated independently; no external falsifiable prediction stated in abstract.

pith-pipeline@v0.9.0 · 5724 in / 1228 out tokens · 67322 ms · 2026-05-20T06:38:00.738911+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean (Jcost uniqueness, washburn_uniqueness_aczel) reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We establish this asymmetry formally by extending the Gittins index framework to Gaussian state-space bandits with latent dynamics... derive Cause-Aware Uncertainty-Sensitive Exploration (CAUSE), a closed-form exploration bonus obtained via control-as-inference
IndisputableMonolith/Foundation/AlphaCoordinateFixation.lean (higher-derivative calibration of CostAlphaLog) J_uniquely_calibrated_via_higher_derivative unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

The exploration bonus B(P, s, v, γ) is nonincreasing in the observation noise s... nondecreasing in the innovation variance v

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

40 extracted references · 40 canonical work pages · 2 internal anchors

[1]

Finite-time analysis of the multiarmed bandit problem.Machine learning, 47(2):235–256, 2002

Peter Auer, Nicolo Cesa-Bianchi, and Paul Fischer. Finite-time analysis of the multiarmed bandit problem.Machine learning, 47(2):235–256, 2002

work page 2002
[2]

Asymptotically efficient adaptive allocation rules.Advances in applied mathematics, 6(1):4–22, 1985

Tze Leung Lai and Herbert Robbins. Asymptotically efficient adaptive allocation rules.Advances in applied mathematics, 6(1):4–22, 1985

work page 1985
[3]

On the likelihood that one unknown probability exceeds another in view of the evidence of two samples.Biometrika, 25(3/4):285–294, 1933

William R Thompson. On the likelihood that one unknown probability exceeds another in view of the evidence of two samples.Biometrika, 25(3/4):285–294, 1933

work page 1933
[4]

Learning to optimize via information-directed sampling.Advances in neural information processing systems, 27, 2014

Daniel Russo and Benjamin Van Roy. Learning to optimize via information-directed sampling.Advances in neural information processing systems, 27, 2014. 11

work page 2014
[5]

Humans use directed and random exploration to solve the explore–exploit dilemma.Journal of experimental psychology: General, 143(6):2074, 2014

Robert C Wilson, Andra Geana, John M White, Elliot A Ludvig, and Jonathan D Cohen. Humans use directed and random exploration to solve the explore–exploit dilemma.Journal of experimental psychology: General, 143(6):2074, 2014

work page 2074
[6]

Uncertainty and exploration.Decision, 6(3):277, 2019

Samuel J Gershman. Uncertainty and exploration.Decision, 6(3):277, 2019

work page 2019
[7]

Cortical substrates for exploratory decisions in humans.Nature, 441(7095):876–879, 2006

Nathaniel D Daw, John P O’doherty, Peter Dayan, Ben Seymour, and Raymond J Dolan. Cortical substrates for exploratory decisions in humans.Nature, 441(7095):876–879, 2006

work page 2006
[8]

Prefrontal and striatal dopaminergic genes predict individual differences in exploration and exploitation.Nature neuroscience, 12(8):1062–1068, 2009

Michael J Frank, Bradley B Doll, Jen Oas-Terpstra, and Francisco Moreno. Prefrontal and striatal dopaminergic genes predict individual differences in exploration and exploitation.Nature neuroscience, 12(8):1062–1068, 2009

work page 2009
[9]

Reinforcement Learning and Control as Probabilistic Inference: Tutorial and Review

Sergey Levine. Reinforcement learning and control as probabilistic inference: Tutorial and review.arXiv preprint arXiv:1805.00909, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[10]

General duality between optimal control and estimation

Emanuel Todorov. General duality between optimal control and estimation. In2008 47th IEEE con- ference on decision and control, pages 4286–4292. IEEE, 2008

work page 2008
[11]

A model for learning based on the joint estimation of stochasticity and volatility.Nature communications, 12(1):6587, 2021

Payam Piray and Nathaniel D Daw. A model for learning based on the joint estimation of stochasticity and volatility.Nature communications, 12(1):6587, 2021

work page 2021
[12]

Computational processes of simultaneous learning of stochasticity and volatility in humans.Nature communications, 15(1):9073, 2024

Payam Piray and Nathaniel D Daw. Computational processes of simultaneous learning of stochasticity and volatility in humans.Nature communications, 15(1):9073, 2024

work page 2024
[13]

Jonathan D Cohen, Samuel M McClure, and Angela J Yu. Should i stay or should i go? how the human brain manages the trade-off between exploitation and exploration.Philosophical Transactions of the Royal Society B: Biological Sciences, 362(1481):933–942, 2007

work page 2007
[14]

Deconstructing the human algorithms for exploration.Cognition, 173:34–42, 2018

Samuel J Gershman. Deconstructing the human algorithms for exploration.Cognition, 173:34–42, 2018

work page 2018
[15]

Learning the value of information in an uncertain world.Nature neuroscience, 10(9):1214–1221, 2007

Timothy EJ Behrens, Mark W Woolrich, Mark E Walton, and Matthew FS Rushworth. Learning the value of information in an uncertain world.Nature neuroscience, 10(9):1214–1221, 2007

work page 2007
[16]

A bayesian foundation for individual learning under uncertainty.Frontiers in human neuroscience, 5:39, 2011

Christoph Mathys, Jean Daunizeau, Karl J Friston, and Klaas E Stephan. A bayesian foundation for individual learning under uncertainty.Frontiers in human neuroscience, 5:39, 2011

work page 2011
[17]

An approximately bayesian delta-rule model explains the dynamics of belief updating in a changing environment.Journal of Neu- roscience, 30(37):12366–12378, 2010

Matthew R Nassar, Robert C Wilson, Benjamin Heasly, and Joshua I Gold. An approximately bayesian delta-rule model explains the dynamics of belief updating in a changing environment.Journal of Neu- roscience, 30(37):12366–12378, 2010

work page 2010
[18]

Rational regulation of learning dynamics by pupil-linked arousal systems.Nature neuroscience, 15(7):1040–1046, 2012

Matthew R Nassar, Katherine M Rumsey, Robert C Wilson, Kinjan Parikh, Benjamin Heasly, and Joshua I Gold. Rational regulation of learning dynamics by pupil-linked arousal systems.Nature neuroscience, 15(7):1040–1046, 2012

work page 2012
[19]

A tutorial on thompson sampling.Foundations and Trends®in Machine Learning, 11(1):1–99, 2018

Daniel Russo, Benjamin Van Roy, Abbas Kazerouni, Ian Osband, and Zheng Wen. A tutorial on thompson sampling.Foundations and Trends®in Machine Learning, 11(1):1–99, 2018

work page 2018
[20]

On bayesian upper confidence bounds for bandit problems

Emilie Kaufmann, Olivier Capp´ e, and Aur´ elien Garivier. On bayesian upper confidence bounds for bandit problems. InArtificial intelligence and statistics, pages 592–600. PMLR, 2012

work page 2012
[21]

Bandit processes and dynamic allocation indices.Journal of the Royal Statistical Society Series B: Statistical Methodology, 41(2):148–164, 1979

John C Gittins. Bandit processes and dynamic allocation indices.Journal of the Royal Statistical Society Series B: Statistical Methodology, 41(2):148–164, 1979

work page 1979
[22]

Some results on the gittins index for a normal reward process.Lecture Notes-Monograph Series, pages 284–294, 2006

Yi-Ching Yao. Some results on the gittins index for a normal reward process.Lecture Notes-Monograph Series, pages 284–294, 2006

work page 2006
[23]

Restless bandits: Activity allocation in a changing world.Journal of applied probability, 25(A):287–298, 1988

Peter Whittle. Restless bandits: Activity allocation in a changing world.Journal of applied probability, 25(A):287–298, 1988. 12

work page 1988
[24]

The complexity of optimal queuing network control

Christos H Papadimitriou and John N Tsitsiklis. The complexity of optimal queuing network control. Mathematics of Operations Research, 24(2):293–305, 1999

work page 1999
[25]

Satisficing in time-sensitive bandit learning.arXiv preprint arXiv:1803.02855, 2018

Daniel Russo and Benjamin Van Roy. Satisficing in time-sensitive bandit learning.arXiv preprint arXiv:1803.02855, 2018

work page arXiv 2018
[26]

Nonstationary bandit learning via predictive sampling

Yueyang Liu, Benjamin Van Roy, and Kuang Xu. Nonstationary bandit learning via predictive sampling. InInternational Conference on Artificial Intelligence and Statistics, pages 6215–6244. PMLR, 2023

work page 2023
[27]

Optimal control as a graphical model inference problem.Machine learning, 87(2):159–182, 2012

Hilbert J Kappen, Vicen¸ c G´ omez, and Manfred Opper. Optimal control as a graphical model inference problem.Machine learning, 87(2):159–182, 2012

work page 2012
[28]

Maximum a Posteriori Policy Optimisation

Abbas Abdolmaleki, Jost Tobias Springenberg, Yuval Tassa, Remi Munos, Nicolas Heess, and Martin Riedmiller. Maximum a posteriori policy optimisation.arXiv preprint arXiv:1806.06920, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[29]

Soft actor-critic: Off-policy max- imum entropy deep reinforcement learning with a stochastic actor

Tuomas Haarnoja, Aurick Zhou, Pieter Abbeel, and Sergey Levine. Soft actor-critic: Off-policy max- imum entropy deep reinforcement learning with a stochastic actor. InInternational conference on machine learning, pages 1861–1870. Pmlr, 2018

work page 2018
[30]

Robot trajectory optimization using approximate inference

Marc Toussaint. Robot trajectory optimization using approximate inference. InProceedings of the 26th annual international conference on machine learning, pages 1049–1056, 2009

work page 2009
[31]

Planning as inference.Trends in cognitive sciences, 16(10): 485–488, 2012

Matthew Botvinick and Marc Toussaint. Planning as inference.Trends in cognitive sciences, 16(10): 485–488, 2012

work page 2012
[32]

Anxious individuals have difficulty learning the causal statistics of aversive environments.Nature neuroscience, 18(4):590–596, 2015

Michael Browning, Timothy E Behrens, Gerhard Jocham, Jill X O’reilly, and Sonia J Bishop. Anxious individuals have difficulty learning the causal statistics of aversive environments.Nature neuroscience, 18(4):590–596, 2015

work page 2015
[33]

Computational dysfunctions in anxiety: failure to differentiate signal from noise.Biological psychiatry, 82(6):440–446, 2017

He Huang, Wesley Thompson, and Martin P Paulus. Computational dysfunctions in anxiety: failure to differentiate signal from noise.Biological psychiatry, 82(6):440–446, 2017

work page 2017
[34]

Absence of systematic effects of internalizing psychopathology on learning under uncertainty.bioRxiv, pages 2025–05, 2025

Muhammad H Satti, Katharina Wille, Matthew R Nassar, Radoslaw M Cichy, Nicolas W Schuck, Peter Dayan, and Rasmus Bruckner. Absence of systematic effects of internalizing psychopathology on learning under uncertainty.bioRxiv, pages 2025–05, 2025

work page 2025
[35]

Pavlovian conditioning–induced hal- lucinations result from overweighting of perceptual priors.Science, 357(6351):596–600, 2017

Albert R Powers, Christoph Mathys, and Philip Robert Corlett. Pavlovian conditioning–induced hal- lucinations result from overweighting of perceptual priors.Science, 357(6351):596–600, 2017

work page 2017
[36]

Affective bias as a rational response to the statistics of rewards and punishments.Elife, 6:e27879, 2017

Erdem Pulcu and Michael Browning. Affective bias as a rational response to the statistics of rewards and punishments.Elife, 6:e27879, 2017

work page 2017
[37]

Altered learning under uncertainty in unmedicated mood and anxiety disorders

Jessica Aylward, Vincent Valton, Woo-Young Ahn, Rebecca L Bond, Peter Dayan, Jonathan P Roiser, and Oliver J Robinson. Altered learning under uncertainty in unmedicated mood and anxiety disorders. Nature human behaviour, 3(10):1116–1123, 2019

work page 2019
[38]

Trait somatic anxiety is associated with reduced directed exploration and underestimation of uncertainty.Nature Human Behaviour, 7(1):102– 113, 2023

Haoxue Fan, Samuel J Gershman, and Elizabeth A Phelps. Trait somatic anxiety is associated with reduced directed exploration and underestimation of uncertainty.Nature Human Behaviour, 7(1):102– 113, 2023

work page 2023
[39]

ξX n=1 γn−1rn +γ ξ λ 1−γ # =E

David JC MacKay.Information theory, inference and learning algorithms. Cambridge university press, 2003. 13 A Proofs of monotonicity results This appendix provides proofs of the three monotonicity results stated in Section 4 of the main text: the index decomposition (Proposition 1), the monotonicity of the exploration bonus in the observation noises(Theor...

work page 2003
[40]

19 C Experimental setup This appendix provides implementation details for the experiments of Section 6

We fixc= 1 2 throughout this paper. 19 C Experimental setup This appendix provides implementation details for the experiments of Section 6. The scale param- eterc(Eq. 7) was held at 0.5 across all experiments and not tuned per-condition; reported regret reflects this fixed setting. C.1 Baseline policies All baselines use the same Kalman tracker as CAUSE, ...

work page

[1] [1]

Finite-time analysis of the multiarmed bandit problem.Machine learning, 47(2):235–256, 2002

Peter Auer, Nicolo Cesa-Bianchi, and Paul Fischer. Finite-time analysis of the multiarmed bandit problem.Machine learning, 47(2):235–256, 2002

work page 2002

[2] [2]

Asymptotically efficient adaptive allocation rules.Advances in applied mathematics, 6(1):4–22, 1985

Tze Leung Lai and Herbert Robbins. Asymptotically efficient adaptive allocation rules.Advances in applied mathematics, 6(1):4–22, 1985

work page 1985

[3] [3]

On the likelihood that one unknown probability exceeds another in view of the evidence of two samples.Biometrika, 25(3/4):285–294, 1933

William R Thompson. On the likelihood that one unknown probability exceeds another in view of the evidence of two samples.Biometrika, 25(3/4):285–294, 1933

work page 1933

[4] [4]

Learning to optimize via information-directed sampling.Advances in neural information processing systems, 27, 2014

Daniel Russo and Benjamin Van Roy. Learning to optimize via information-directed sampling.Advances in neural information processing systems, 27, 2014. 11

work page 2014

[5] [5]

Humans use directed and random exploration to solve the explore–exploit dilemma.Journal of experimental psychology: General, 143(6):2074, 2014

Robert C Wilson, Andra Geana, John M White, Elliot A Ludvig, and Jonathan D Cohen. Humans use directed and random exploration to solve the explore–exploit dilemma.Journal of experimental psychology: General, 143(6):2074, 2014

work page 2074

[6] [6]

Uncertainty and exploration.Decision, 6(3):277, 2019

Samuel J Gershman. Uncertainty and exploration.Decision, 6(3):277, 2019

work page 2019

[7] [7]

Cortical substrates for exploratory decisions in humans.Nature, 441(7095):876–879, 2006

Nathaniel D Daw, John P O’doherty, Peter Dayan, Ben Seymour, and Raymond J Dolan. Cortical substrates for exploratory decisions in humans.Nature, 441(7095):876–879, 2006

work page 2006

[8] [8]

Prefrontal and striatal dopaminergic genes predict individual differences in exploration and exploitation.Nature neuroscience, 12(8):1062–1068, 2009

Michael J Frank, Bradley B Doll, Jen Oas-Terpstra, and Francisco Moreno. Prefrontal and striatal dopaminergic genes predict individual differences in exploration and exploitation.Nature neuroscience, 12(8):1062–1068, 2009

work page 2009

[9] [9]

Reinforcement Learning and Control as Probabilistic Inference: Tutorial and Review

Sergey Levine. Reinforcement learning and control as probabilistic inference: Tutorial and review.arXiv preprint arXiv:1805.00909, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018

[10] [10]

General duality between optimal control and estimation

Emanuel Todorov. General duality between optimal control and estimation. In2008 47th IEEE con- ference on decision and control, pages 4286–4292. IEEE, 2008

work page 2008

[11] [11]

A model for learning based on the joint estimation of stochasticity and volatility.Nature communications, 12(1):6587, 2021

Payam Piray and Nathaniel D Daw. A model for learning based on the joint estimation of stochasticity and volatility.Nature communications, 12(1):6587, 2021

work page 2021

[12] [12]

Computational processes of simultaneous learning of stochasticity and volatility in humans.Nature communications, 15(1):9073, 2024

Payam Piray and Nathaniel D Daw. Computational processes of simultaneous learning of stochasticity and volatility in humans.Nature communications, 15(1):9073, 2024

work page 2024

[13] [13]

Jonathan D Cohen, Samuel M McClure, and Angela J Yu. Should i stay or should i go? how the human brain manages the trade-off between exploitation and exploration.Philosophical Transactions of the Royal Society B: Biological Sciences, 362(1481):933–942, 2007

work page 2007

[14] [14]

Deconstructing the human algorithms for exploration.Cognition, 173:34–42, 2018

Samuel J Gershman. Deconstructing the human algorithms for exploration.Cognition, 173:34–42, 2018

work page 2018

[15] [15]

Learning the value of information in an uncertain world.Nature neuroscience, 10(9):1214–1221, 2007

Timothy EJ Behrens, Mark W Woolrich, Mark E Walton, and Matthew FS Rushworth. Learning the value of information in an uncertain world.Nature neuroscience, 10(9):1214–1221, 2007

work page 2007

[16] [16]

A bayesian foundation for individual learning under uncertainty.Frontiers in human neuroscience, 5:39, 2011

Christoph Mathys, Jean Daunizeau, Karl J Friston, and Klaas E Stephan. A bayesian foundation for individual learning under uncertainty.Frontiers in human neuroscience, 5:39, 2011

work page 2011

[17] [17]

An approximately bayesian delta-rule model explains the dynamics of belief updating in a changing environment.Journal of Neu- roscience, 30(37):12366–12378, 2010

Matthew R Nassar, Robert C Wilson, Benjamin Heasly, and Joshua I Gold. An approximately bayesian delta-rule model explains the dynamics of belief updating in a changing environment.Journal of Neu- roscience, 30(37):12366–12378, 2010

work page 2010

[18] [18]

Rational regulation of learning dynamics by pupil-linked arousal systems.Nature neuroscience, 15(7):1040–1046, 2012

Matthew R Nassar, Katherine M Rumsey, Robert C Wilson, Kinjan Parikh, Benjamin Heasly, and Joshua I Gold. Rational regulation of learning dynamics by pupil-linked arousal systems.Nature neuroscience, 15(7):1040–1046, 2012

work page 2012

[19] [19]

A tutorial on thompson sampling.Foundations and Trends®in Machine Learning, 11(1):1–99, 2018

Daniel Russo, Benjamin Van Roy, Abbas Kazerouni, Ian Osband, and Zheng Wen. A tutorial on thompson sampling.Foundations and Trends®in Machine Learning, 11(1):1–99, 2018

work page 2018

[20] [20]

On bayesian upper confidence bounds for bandit problems

Emilie Kaufmann, Olivier Capp´ e, and Aur´ elien Garivier. On bayesian upper confidence bounds for bandit problems. InArtificial intelligence and statistics, pages 592–600. PMLR, 2012

work page 2012

[21] [21]

Bandit processes and dynamic allocation indices.Journal of the Royal Statistical Society Series B: Statistical Methodology, 41(2):148–164, 1979

John C Gittins. Bandit processes and dynamic allocation indices.Journal of the Royal Statistical Society Series B: Statistical Methodology, 41(2):148–164, 1979

work page 1979

[22] [22]

Some results on the gittins index for a normal reward process.Lecture Notes-Monograph Series, pages 284–294, 2006

Yi-Ching Yao. Some results on the gittins index for a normal reward process.Lecture Notes-Monograph Series, pages 284–294, 2006

work page 2006

[23] [23]

Restless bandits: Activity allocation in a changing world.Journal of applied probability, 25(A):287–298, 1988

Peter Whittle. Restless bandits: Activity allocation in a changing world.Journal of applied probability, 25(A):287–298, 1988. 12

work page 1988

[24] [24]

The complexity of optimal queuing network control

Christos H Papadimitriou and John N Tsitsiklis. The complexity of optimal queuing network control. Mathematics of Operations Research, 24(2):293–305, 1999

work page 1999

[25] [25]

Satisficing in time-sensitive bandit learning.arXiv preprint arXiv:1803.02855, 2018

Daniel Russo and Benjamin Van Roy. Satisficing in time-sensitive bandit learning.arXiv preprint arXiv:1803.02855, 2018

work page arXiv 2018

[26] [26]

Nonstationary bandit learning via predictive sampling

Yueyang Liu, Benjamin Van Roy, and Kuang Xu. Nonstationary bandit learning via predictive sampling. InInternational Conference on Artificial Intelligence and Statistics, pages 6215–6244. PMLR, 2023

work page 2023

[27] [27]

Optimal control as a graphical model inference problem.Machine learning, 87(2):159–182, 2012

Hilbert J Kappen, Vicen¸ c G´ omez, and Manfred Opper. Optimal control as a graphical model inference problem.Machine learning, 87(2):159–182, 2012

work page 2012

[28] [28]

Maximum a Posteriori Policy Optimisation

Abbas Abdolmaleki, Jost Tobias Springenberg, Yuval Tassa, Remi Munos, Nicolas Heess, and Martin Riedmiller. Maximum a posteriori policy optimisation.arXiv preprint arXiv:1806.06920, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018

[29] [29]

Soft actor-critic: Off-policy max- imum entropy deep reinforcement learning with a stochastic actor

Tuomas Haarnoja, Aurick Zhou, Pieter Abbeel, and Sergey Levine. Soft actor-critic: Off-policy max- imum entropy deep reinforcement learning with a stochastic actor. InInternational conference on machine learning, pages 1861–1870. Pmlr, 2018

work page 2018

[30] [30]

Robot trajectory optimization using approximate inference

Marc Toussaint. Robot trajectory optimization using approximate inference. InProceedings of the 26th annual international conference on machine learning, pages 1049–1056, 2009

work page 2009

[31] [31]

Planning as inference.Trends in cognitive sciences, 16(10): 485–488, 2012

Matthew Botvinick and Marc Toussaint. Planning as inference.Trends in cognitive sciences, 16(10): 485–488, 2012

work page 2012

[32] [32]

Anxious individuals have difficulty learning the causal statistics of aversive environments.Nature neuroscience, 18(4):590–596, 2015

Michael Browning, Timothy E Behrens, Gerhard Jocham, Jill X O’reilly, and Sonia J Bishop. Anxious individuals have difficulty learning the causal statistics of aversive environments.Nature neuroscience, 18(4):590–596, 2015

work page 2015

[33] [33]

Computational dysfunctions in anxiety: failure to differentiate signal from noise.Biological psychiatry, 82(6):440–446, 2017

He Huang, Wesley Thompson, and Martin P Paulus. Computational dysfunctions in anxiety: failure to differentiate signal from noise.Biological psychiatry, 82(6):440–446, 2017

work page 2017

[34] [34]

Absence of systematic effects of internalizing psychopathology on learning under uncertainty.bioRxiv, pages 2025–05, 2025

Muhammad H Satti, Katharina Wille, Matthew R Nassar, Radoslaw M Cichy, Nicolas W Schuck, Peter Dayan, and Rasmus Bruckner. Absence of systematic effects of internalizing psychopathology on learning under uncertainty.bioRxiv, pages 2025–05, 2025

work page 2025

[35] [35]

Pavlovian conditioning–induced hal- lucinations result from overweighting of perceptual priors.Science, 357(6351):596–600, 2017

Albert R Powers, Christoph Mathys, and Philip Robert Corlett. Pavlovian conditioning–induced hal- lucinations result from overweighting of perceptual priors.Science, 357(6351):596–600, 2017

work page 2017

[36] [36]

Affective bias as a rational response to the statistics of rewards and punishments.Elife, 6:e27879, 2017

Erdem Pulcu and Michael Browning. Affective bias as a rational response to the statistics of rewards and punishments.Elife, 6:e27879, 2017

work page 2017

[37] [37]

Altered learning under uncertainty in unmedicated mood and anxiety disorders

Jessica Aylward, Vincent Valton, Woo-Young Ahn, Rebecca L Bond, Peter Dayan, Jonathan P Roiser, and Oliver J Robinson. Altered learning under uncertainty in unmedicated mood and anxiety disorders. Nature human behaviour, 3(10):1116–1123, 2019

work page 2019

[38] [38]

Trait somatic anxiety is associated with reduced directed exploration and underestimation of uncertainty.Nature Human Behaviour, 7(1):102– 113, 2023

Haoxue Fan, Samuel J Gershman, and Elizabeth A Phelps. Trait somatic anxiety is associated with reduced directed exploration and underestimation of uncertainty.Nature Human Behaviour, 7(1):102– 113, 2023

work page 2023

[39] [39]

ξX n=1 γn−1rn +γ ξ λ 1−γ # =E

David JC MacKay.Information theory, inference and learning algorithms. Cambridge university press, 2003. 13 A Proofs of monotonicity results This appendix provides proofs of the three monotonicity results stated in Section 4 of the main text: the index decomposition (Proposition 1), the monotonicity of the exploration bonus in the observation noises(Theor...

work page 2003

[40] [40]

19 C Experimental setup This appendix provides implementation details for the experiments of Section 6

We fixc= 1 2 throughout this paper. 19 C Experimental setup This appendix provides implementation details for the experiments of Section 6. The scale param- eterc(Eq. 7) was held at 0.5 across all experiments and not tuned per-condition; reported regret reflects this fixed setting. C.1 Baseline policies All baselines use the same Kalman tracker as CAUSE, ...

work page