A Closed-Form Upper Bound for Admissible Learning-Rate Steps in Belief-Space Dynamics

Youzhen Li; Zixi Li

arxiv: 2605.06741 · v1 · submitted 2026-05-07 · 💻 cs.LG

A Closed-Form Upper Bound for Admissible Learning-Rate Steps in Belief-Space Dynamics

Zixi Li , Youzhen Li This is my paper

Pith reviewed 2026-05-11 01:10 UTC · model grok-4.3

classification 💻 cs.LG

keywords learning ratebelief spaceprobability simplexKL divergencecontractivityupper boundadmissible steps

0 comments

The pith

A closed-form formula gives the maximum admissible step size for belief updates on the probability simplex.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper models belief updates as projected forward steps on the probability simplex and defines admissibility through contractivity in the KL/Bregman geometry. It derives an explicit upper bound on the learning rate step that ensures this contractivity. A sympathetic reader would care because it replaces heuristic tuning of learning rates with a direct calculation based on the current belief state. The result applies to any dynamics where beliefs evolve via such projected steps, turning stability into a verifiable local property.

Core claim

Under the model of a projected forward step on the probability simplex, admissibility equates to contractivity in the natural KL/Bregman geometry, from which follows a closed-form expression for the upper bound on the admissible step size.

What carries the argument

The KL/Bregman contractivity condition on the simplex that bounds the step size from above.

If this is right

The learning rate no longer requires manual tuning but can be computed directly from the belief distribution.
Each local update can be guaranteed stable by respecting the derived bound.
Belief-space algorithms gain a built-in safeguard against divergence without additional checks.
Hyperparameter search in probabilistic learning reduces to selecting rates below this explicit limit.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The bound might extend to other Bregman divergences if the geometry changes.
Implementing this could improve reliability in reinforcement learning with belief states.
Future work could test the bound's tightness against observed divergence points in simulations.

Load-bearing premise

Belief updates behave exactly as single projected steps whose stability is fully captured by contraction in KL divergence.

What would settle it

Observing stable convergence or non-divergence with a step size exceeding the formula in a simple simplex update would disprove the bound.

Figures

Figures reproduced from arXiv: 2605.06741 by Youzhen Li, Zixi Li.

**Figure 2.** Figure 2: Closed-form admissible cross-entropy step on a binary belief slice. Panel A shows the raw [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗

**Figure 3.** Figure 3: ADS supplies an entropy backoff factor. The loss geometry supplies the upper bound. In [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗

**Figure 4.** Figure 4: Belief-space distribution-shift experiment. [PITH_FULL_IMAGE:figures/full_fig_p010_4.png] view at source ↗

**Figure 5.** Figure 5: The forward pass as a discrete dynamical system. Each hidden-state transfer [PITH_FULL_IMAGE:figures/full_fig_p012_5.png] view at source ↗

read the original abstract

Learning-rate steps are usually treated as hyperparameters. This paper isolates a local beliefspace calculation: when an update is modeled as a projected forward step on the probability simplex, admissibility means contractivity in the natural KL/Bregman geometry. Under this model, the upper bound of an admissible step is not a tuning slogan but a formula.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

0 major / 1 minor

Summary. The manuscript claims that learning-rate steps, typically treated as hyperparameters, can be bounded in closed form for belief-space dynamics. It models updates as projected forward steps on the probability simplex and defines admissibility via contractivity in the KL/Bregman geometry; under this scoped model the admissible step size is given by an explicit formula rather than left to tuning.

Significance. If the derivation holds, the result supplies a model-specific but exact upper bound that replaces empirical tuning with a direct calculation. This is a clear strength for any setting that already uses projected simplex updates and Bregman divergences (e.g., certain online learning or belief-propagation algorithms). The paper correctly limits its claim to the stated modeling assumptions and does not assert parameter-freeness or global optimality outside that setting.

minor comments (1)

The abstract would be strengthened by a single sentence stating the explicit form of the derived bound (or the key quantities it depends on) so that readers can immediately see what the formula involves.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for the positive assessment and recommendation to accept the manuscript. The summary accurately captures the scope of our contribution: a closed-form admissible step-size bound derived under the specific modeling assumptions of projected simplex updates and contractivity in KL/Bregman geometry.

Circularity Check

0 steps flagged

No significant circularity; derivation self-contained under stated model

full rationale

The paper explicitly scopes its claim to a specific modeling choice: updates as projected forward steps on the probability simplex, with admissibility defined as contractivity in the KL/Bregman geometry. Under this model it supplies a closed-form upper bound on the step size. No evidence in the abstract, title, or modeling statements indicates that the bound reduces by construction to a fitted parameter, a self-citation chain, or a renamed input. The derivation is presented as a direct consequence of the chosen geometry and projection, which is independent of the target result. This is the most common honest outcome for a scoped theoretical derivation.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claim rests on two modeling choices: treating the update as a projected forward step on the simplex and equating admissibility with contractivity in KL/Bregman geometry. No free parameters or invented entities are mentioned in the abstract.

axioms (2)

domain assumption An update can be modeled as a projected forward step on the probability simplex.
Stated in the abstract as the local belief-space calculation.
domain assumption Admissibility of the step is equivalent to contractivity in the natural KL/Bregman geometry.
Directly asserted in the abstract.

pith-pipeline@v0.9.0 · 5341 in / 1228 out tokens · 47214 ms · 2026-05-11T01:10:47.168356+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

the admissible cross-entropy step must satisfy 0 < η < 2μ/L² ... η_CE^max(p) = 2 min_i(p_i)² / max_i(p_i)

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

10 extracted references · 10 canonical work pages · 2 internal anchors

[1]

Adam: A Method for Stochastic Optimization

Diederik P. Kingma and Jimmy Ba. Adam: A Method for Stochastic Optimization.arXiv preprint arXiv:1412.6980, 2014

work page internal anchor Pith review Pith/arXiv arXiv 2014
[2]

Decoupled Weight Decay Regularization

Ilya Loshchilov and Frank Hutter. Decoupled Weight Decay Regularization.arXiv preprint arXiv:1711.05101, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[3]

Adaptive inertia: Disentangling the effects of adaptive learning rate and momentum, 2022

Zeke Xie, Xinrui Wang, Huishuai Zhang, Issei Sato, and Masashi Sugiyama. Adaptive In- ertia: Disentangling the Effects of Adaptive Learning Rate and Momentum.arXiv preprint arXiv:2006.15815, 2020

work page arXiv 2006
[4]

arXiv preprint

Aaron Defazio, Ashok Cutkosky, Harsh Mehta, and Konstantin Mishchenko. Optimal Linear Decay Learning Rate Schedules and Further Refinements.arXiv preprint arXiv:2310.07831, 2023

work page arXiv 2023
[5]

Layer-Specific Adaptive Learning Rates for Deep Networks.arXiv preprint arXiv:1510.04609, 2015

Bharat Singh, Soham De, Yangmuzi Zhang, Thomas Goldstein, and Gavin Taylor. Layer-Specific Adaptive Learning Rates for Deep Networks.arXiv preprint arXiv:1510.04609, 2015

work page arXiv 2015
[6]

Wensing and Jean-Jacques E

Patrick M. Wensing and Jean-Jacques E. Slotine. Beyond Convexity: Contraction and Global Convergence of Gradient Descent.arXiv preprint arXiv:1806.06655, 2018. 16

work page arXiv 2018
[7]

A Note on the Optimal Convergence Rate of Descent Methods with Fixed Step Sizes for Smooth Strongly Convex Functions.arXiv preprint arXiv:2106.08020, 2021

Andre Uschmajew and Bart Vandereycken. A Note on the Optimal Convergence Rate of Descent Methods with Fixed Step Sizes for Smooth Strongly Convex Functions.arXiv preprint arXiv:2106.08020, 2021

work page arXiv 2021
[8]

Convergence Rate Bounds for the Mirror Descent Method: IQCs, Popov Criterion and Bregman Divergence.arXiv preprint arXiv:2304.03886, 2023

Mengmou Li, Khaled Laib, Takeshi Hatanaka, and Ioannis Lestas. Convergence Rate Bounds for the Mirror Descent Method: IQCs, Popov Criterion and Bregman Divergence.arXiv preprint arXiv:2304.03886, 2023

work page arXiv 2023
[9]

Conformal Mirror Descent with Logarithmic Divergences.arXiv preprint arXiv:2209.02938, 2022

Amanjit Singh Kainth, Ting-Kam Leonard Wong, and Frank Rudzicz. Conformal Mirror Descent with Logarithmic Divergences.arXiv preprint arXiv:2209.02938, 2022

work page arXiv 2022
[10]

Strongly Convex Divergences.arXiv preprint arXiv:2009.10838, 2020

James Melbourne. Strongly Convex Divergences.arXiv preprint arXiv:2009.10838, 2020. 17

work page arXiv 2009

[1] [1]

Adam: A Method for Stochastic Optimization

Diederik P. Kingma and Jimmy Ba. Adam: A Method for Stochastic Optimization.arXiv preprint arXiv:1412.6980, 2014

work page internal anchor Pith review Pith/arXiv arXiv 2014

[2] [2]

Decoupled Weight Decay Regularization

Ilya Loshchilov and Frank Hutter. Decoupled Weight Decay Regularization.arXiv preprint arXiv:1711.05101, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017

[3] [3]

Adaptive inertia: Disentangling the effects of adaptive learning rate and momentum, 2022

Zeke Xie, Xinrui Wang, Huishuai Zhang, Issei Sato, and Masashi Sugiyama. Adaptive In- ertia: Disentangling the Effects of Adaptive Learning Rate and Momentum.arXiv preprint arXiv:2006.15815, 2020

work page arXiv 2006

[4] [4]

arXiv preprint

Aaron Defazio, Ashok Cutkosky, Harsh Mehta, and Konstantin Mishchenko. Optimal Linear Decay Learning Rate Schedules and Further Refinements.arXiv preprint arXiv:2310.07831, 2023

work page arXiv 2023

[5] [5]

Layer-Specific Adaptive Learning Rates for Deep Networks.arXiv preprint arXiv:1510.04609, 2015

Bharat Singh, Soham De, Yangmuzi Zhang, Thomas Goldstein, and Gavin Taylor. Layer-Specific Adaptive Learning Rates for Deep Networks.arXiv preprint arXiv:1510.04609, 2015

work page arXiv 2015

[6] [6]

Wensing and Jean-Jacques E

Patrick M. Wensing and Jean-Jacques E. Slotine. Beyond Convexity: Contraction and Global Convergence of Gradient Descent.arXiv preprint arXiv:1806.06655, 2018. 16

work page arXiv 2018

[7] [7]

A Note on the Optimal Convergence Rate of Descent Methods with Fixed Step Sizes for Smooth Strongly Convex Functions.arXiv preprint arXiv:2106.08020, 2021

Andre Uschmajew and Bart Vandereycken. A Note on the Optimal Convergence Rate of Descent Methods with Fixed Step Sizes for Smooth Strongly Convex Functions.arXiv preprint arXiv:2106.08020, 2021

work page arXiv 2021

[8] [8]

Convergence Rate Bounds for the Mirror Descent Method: IQCs, Popov Criterion and Bregman Divergence.arXiv preprint arXiv:2304.03886, 2023

Mengmou Li, Khaled Laib, Takeshi Hatanaka, and Ioannis Lestas. Convergence Rate Bounds for the Mirror Descent Method: IQCs, Popov Criterion and Bregman Divergence.arXiv preprint arXiv:2304.03886, 2023

work page arXiv 2023

[9] [9]

Conformal Mirror Descent with Logarithmic Divergences.arXiv preprint arXiv:2209.02938, 2022

Amanjit Singh Kainth, Ting-Kam Leonard Wong, and Frank Rudzicz. Conformal Mirror Descent with Logarithmic Divergences.arXiv preprint arXiv:2209.02938, 2022

work page arXiv 2022

[10] [10]

Strongly Convex Divergences.arXiv preprint arXiv:2009.10838, 2020

James Melbourne. Strongly Convex Divergences.arXiv preprint arXiv:2009.10838, 2020. 17

work page arXiv 2009