Propagation of Chaos in Contextual Flow Maps

arxiv: 2605.16747 · v1 · pith:HL3PWH3Bnew · submitted 2026-05-16 · 💻 cs.LG · math.AP· math.OC· math.PR· math.ST· stat.TH

Propagation of Chaos in Contextual Flow Maps

Shi Chen , Zhengjiang Lin , Kaizhao Liu , Philippe Rigollet This is my paper

Pith reviewed 2026-05-19 20:15 UTC · model grok-4.3

classification 💻 cs.LG math.APmath.OCmath.PRmath.STstat.TH

keywords contextual flow mapspropagation of chaostransformerslarge context regimeWasserstein distanceonline gradient descentattention blocks

0 comments p. Extension

pith:HL3PWH3B Add to your LaTeX paper

What is a Pith Number?

\usepackage{pith}
\pithnumber{HL3PWH3B}

Prints a linked pith:HL3PWH3B badge after your title and writes the identifier into PDF metadata. Compiles on arXiv with no extra files. Learn more

The pith

Finite-context models converge to infinite-context versions uniformly in depth and training steps at optimal Wasserstein rates.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper models transformers as contextual flow maps, dynamical systems that evolve a token under a contextual measure through stacked attention blocks. It proves that the finite-context version stays close to an idealized infinite-context system, in which the contextual measure is replaced by its underlying population average, treating context length as a statistical resource rather than a fixed hyperparameter. Forward bounds control the deviation between the two systems uniformly across the depth of the stack. Backward bounds control the deviation between their training trajectories under online gradient descent. Both sets of bounds recover the optimal Wasserstein rate of n to the power of minus one over d for general maps and the faster parametric rate of n to the power of minus one half for a subclass that includes transformers.

Core claim

Contextual flow maps evolve a distinguished token in the presence of a contextual measure across attention blocks. The finite-context model approximates an idealized infinite-context system by replacing the contextual measure with its population counterpart. Exploiting the structure of the dynamics, forward bounds control the deviation between finite- and infinite-context versions uniformly along depth, while backward bounds control the deviation between the corresponding training trajectories uniformly across iterations of online gradient descent. Both bounds achieve the optimal Wasserstein rate n to the power of minus one over d for general maps and the parametric rate n to the power of a

What carries the argument

Contextual flow maps: dynamical systems that evolve a distinguished token in the presence of a contextual measure across a stack of attention blocks.

If this is right

The deviation between finite- and infinite-context models remains controlled uniformly in the number of attention blocks.
Training trajectories of finite- and infinite-context models stay close uniformly across iterations of online gradient descent.
The approximation achieves the optimal Wasserstein rate n to the power of minus one over d for general contextual flow maps.
A faster parametric rate n to the power of minus one half holds for a restricted class that includes transformers.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Context length can be treated as an independent statistical resource that improves fidelity without altering model depth or width.
The uniform-in-depth and uniform-in-iteration control suggests that infinite-context proxies could be used to study scaling behavior of attention-based models.
The same bounding technique may apply to other sequence architectures whose block dynamics permit a similar convergence of empirical measures.

Load-bearing premise

The attention dynamics must admit a structure in which the finite-context empirical measure converges to a population measure as context length grows large.

What would settle it

Measure the Wasserstein distance between the outputs of finite-context and infinite-context models for successively larger context lengths n and check whether the distance fails to decay at rate n to the power of minus one over d.

Figures

Figures reproduced from arXiv: 2605.16747 by Kaizhao Liu, Philippe Rigollet, Shi Chen, Zhengjiang Lin.

**Figure 1.** Figure 1: Three formalizations of transformers. The contextual measure µ0 = 1 n P i δz (i) can be regarded as an n-sample empirical estimator of an underlying population measure µ∞ 0 . The contextual flow map (1.1) then takes this empirical measure as input and produces an output x1 that approximates an idealized infinite-context system, obtained by replacing µ0 with µ∞ 0 in (1.1). The infinite-context system is a n… view at source ↗

read the original abstract

We develop a quantitative statistical theory of transformers in the large-context regime by adopting the abstraction of contextual flow maps (CFMs): dynamical systems that evolve a distinguished token in the presence of a contextual measure across a stack of attention blocks. Within this framework, the finite-context model approximates an idealized infinite-context system in which the contextual measure is replaced by its underlying population, so that the context length $n$ becomes a statistical resource. Exploiting the McKean--Vlasov structure of the dynamics and the classical machinery of propagation of chaos, we establish a forward bound controlling the deviation between the finite- and infinite-context CFMs uniformly along depth, and a backward bound controlling the deviation between the corresponding training trajectories uniformly across iterations of online gradient descent. Both bounds achieve the optimal Wasserstein rate $n^{-1/d}$ for general CFMs and parametric rate $n^{-1/2}$ for a restricted class of CFMs that includes transformers as a special case. The analysis rests on a new Eulerian adjoint formulation of the loss gradient and stability estimates for the resulting forward--adjoint system, both of which may be of independent interest.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This paper uses contextual flow maps to get explicit Wasserstein rates for finite-context transformers approximating their infinite-context mean-field limit, with uniform control in depth and training steps, but the uniformity claim needs the stability estimates checked against possible expansion in attention maps.

read the letter

The main takeaway is that the authors model stacked attention as contextual flow maps and then apply propagation of chaos to bound the gap between finite-n context and the population limit. They claim both a forward bound that stays controlled across layers and a backward bound that tracks online gradient descent trajectories, both at the usual n^{-1/d} rate in general and n^{-1/2} for a parametric subclass that includes transformers. The Eulerian adjoint they introduce for the loss gradient is the piece that lets them close the backward analysis cleanly.

Referee Report

2 major / 2 minor

Summary. The paper introduces the abstraction of contextual flow maps (CFMs) to model transformers in the large-context regime. It exploits the McKean-Vlasov structure of attention dynamics to apply propagation-of-chaos techniques, deriving a forward bound on the Wasserstein deviation between finite-context and infinite-context CFMs that is uniform in depth, together with a backward bound on the deviation of online gradient-descent trajectories that is uniform in iteration count. Both bounds attain the optimal rate n^{-1/d} in general and the parametric rate n^{-1/2} for a restricted class that includes transformers; the analysis relies on a new Eulerian adjoint formulation of the loss gradient and associated stability estimates for the forward-adjoint system.

Significance. If the uniformity claims hold, the work supplies the first quantitative statistical theory linking context length to approximation and optimization error in transformer-like models, with rates that match known optimal rates for empirical-measure convergence. The new adjoint formulation and forward-adjoint stability estimates are presented as potentially reusable tools. The manuscript correctly invokes classical McKean-Vlasov propagation-of-chaos results rather than deriving rates circularly.

major comments (2)

[forward bound derivation] Forward bound (abstract and the section deriving the uniform-in-depth PoC estimate): the claimed uniformity in depth L is not obviously inherited from standard Gronwall-based PoC estimates, which produce a factor C(L) that grows at least linearly (and typically exponentially) with the time horizon. The manuscript invokes stability estimates on the forward-adjoint system to obtain uniformity, but does not state an explicit dissipativity, contractivity, or non-expansiveness condition on the CFM vector field or attention kernel that would prevent accumulation of deviations across blocks. Without such a condition, the n^{-1/d} rate may hold only for fixed L rather than uniformly in L.
[McKean-Vlasov structure paragraph] Assumptions on the attention kernel (paragraph invoking McKean-Vlasov structure): the reduction to a McKean-Vlasov process is asserted for general CFMs, yet the precise Lipschitz or boundedness requirements on the interaction kernel that justify application of the classical theorems are not listed explicitly. This makes it difficult to verify that the stated rates apply to the transformer case without additional post-hoc restrictions.

minor comments (2)

[abstract and introduction] Notation for the contextual measure and its empirical counterpart should be introduced once and used consistently; the current alternation between “contextual measure” and “empirical measure” in the abstract and early sections is slightly confusing.
[abstract] The restricted class achieving the n^{-1/2} rate is described as “including transformers as a special case,” but the precise finite-dimensional embedding or parametric assumption that yields the faster rate is not spelled out in the abstract; a one-sentence clarification would help readers.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the thoughtful and detailed report. The comments highlight opportunities to strengthen the presentation of our uniformity claims and assumptions. We respond to each major comment below and indicate the revisions we will make.

read point-by-point responses

Referee: [forward bound derivation] Forward bound (abstract and the section deriving the uniform-in-depth PoC estimate): the claimed uniformity in depth L is not obviously inherited from standard Gronwall-based PoC estimates, which produce a factor C(L) that grows at least linearly (and typically exponentially) with the time horizon. The manuscript invokes stability estimates on the forward-adjoint system to obtain uniformity, but does not state an explicit dissipativity, contractivity, or non-expansiveness condition on the CFM vector field or attention kernel that would prevent accumulation of deviations across blocks. Without such a condition, the n^{-1/d} rate may hold only for fixed L rather than uniformly in L.

Authors: We appreciate the referee's careful scrutiny of the depth uniformity. The stability estimates for the forward-adjoint system (derived via the new Eulerian formulation) are designed precisely to yield a non-expansive bound in the Wasserstein metric that is independent of depth L. These estimates exploit the specific structure of the contextual flow map dynamics to obtain a contraction that counters the usual Gronwall growth. Nevertheless, we agree that an explicit statement of the underlying dissipativity condition would make the argument more transparent. In the revision we will add a clear statement of the one-sided Lipschitz condition on the CFM vector field (with a negative constant) that guarantees the required non-expansiveness and prevents accumulation of deviations across blocks. This condition is satisfied by the attention kernels arising in transformers under standard boundedness assumptions on queries and keys. revision: yes
Referee: [McKean-Vlasov structure paragraph] Assumptions on the attention kernel (paragraph invoking McKean-Vlasov structure): the reduction to a McKean-Vlasov process is asserted for general CFMs, yet the precise Lipschitz or boundedness requirements on the interaction kernel that justify application of the classical theorems are not listed explicitly. This makes it difficult to verify that the stated rates apply to the transformer case without additional post-hoc restrictions.

Authors: We agree that the precise requirements should be stated explicitly rather than left implicit. In the revised manuscript we will insert a dedicated paragraph that enumerates the Lipschitz continuity, boundedness, and measurability conditions on the interaction kernel needed to invoke the classical McKean-Vlasov propagation-of-chaos theorems. We will also verify that these conditions are satisfied by the standard softmax attention kernel used in transformers, confirming that the stated rates apply directly without additional restrictions. revision: yes

Circularity Check

0 steps flagged

No circularity; bounds derived from classical external PoC machinery

full rationale

The paper applies the McKean-Vlasov structure of attention dynamics and classical propagation-of-chaos results to obtain forward and backward deviation bounds that achieve standard Wasserstein rates n^{-1/d} and n^{-1/2}. These rates are not fitted to the paper's outputs or defined in terms of the same quantities being bounded. The new Eulerian adjoint formulation and stability estimates are presented as independent contributions that do not reduce to self-definition, self-citation chains, or renaming of known results. The derivation remains self-contained against external benchmarks with no load-bearing internal reductions.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 1 invented entities

The central claims rest on the McKean-Vlasov structure of attention dynamics and the replacement of empirical context by its population measure; CFMs are introduced as a modeling abstraction without independent empirical validation.

axioms (2)

domain assumption Attention-block dynamics admit a McKean-Vlasov mean-field limit when context length tends to infinity
Invoked to apply propagation-of-chaos machinery to the stack of attention blocks.
domain assumption The loss gradient admits an Eulerian adjoint formulation that remains stable under the forward-adjoint system
Required for the backward bound on training trajectories.

invented entities (1)

Contextual Flow Map (CFM) no independent evidence
purpose: Abstraction that represents a transformer layer as a dynamical system evolving a distinguished token under a contextual measure
New modeling device introduced to enable the mean-field analysis; no independent evidence outside the paper is provided.

pith-pipeline@v0.9.0 · 5743 in / 1495 out tokens · 43775 ms · 2026-05-19T20:15:44.139730+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Exploiting the McKean–Vlasov structure of the dynamics and the classical machinery of propagation of chaos... stability estimates for the resulting forward–adjoint system

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

21 extracted references · 21 canonical work pages · 6 internal anchors

[1]

[ÁLGRB26] Antonio Álvarez-López, Borjan Geshkovski, and Domènec Ruiz-Balet

arXiv:2508.09628. [ÁLGRB26] Antonio Álvarez-López, Borjan Geshkovski, and Domènec Ruiz-Balet. Perceptrons and localization of attention’s mean-field landscape,

work page arXiv
[2]

Perceptrons and localization of attention's mean-field landscape

arXiv:2601.21366. [BB25] Etienne Boursier and Claire Boyer. Softmax as linear attention in the large-prompt regime: a measure-based perspective,

work page internal anchor Pith review Pith/arXiv arXiv
[3]

[BCL+26] Giuseppe Bruno, Shi Chen, Zhengjiang Lin, Yury Polyanskiy, and Philippe Rigollet

arXiv:2512.11784. [BCL+26] Giuseppe Bruno, Shi Chen, Zhengjiang Lin, Yury Polyanskiy, and Philippe Rigollet. Scaling limits of long-context transformers,

work page arXiv
[4]

Scaling Limits of Long-Context Transformers

arXiv:2605.08505. [BCW+23] Yu Bai, Fan Chen, Huan Wang, Caiming Xiong, and Song Mei. Transformers as statisticians: Provable in-context learning with in-context algorithm selection. InNeurIPS,

work page internal anchor Pith review Pith/arXiv arXiv
[5]

[BO20] Tom B

arXiv:2512.10656. [BO20] Tom B. Brown and Others. Language models are few-shot learners. InNeurIPS,

work page arXiv
[6]

Multi-layer Cross-attention is Provably Optimal for Multi-modal In-context Learning

[BSS26] Nicholas Barnfield, Subhabrata Sen, and Pragya Sur. Multi-layer cross-attention is provably optimal for multi-modal in-context learning.arXiv:2602.04872,

work page internal anchor Pith review Pith/arXiv arXiv
[7]

28 [CAP24] Valérie Castin, Pierre Ablin, and Gabriel Peyré

arXiv:2501.18322. 28 [CAP24] Valérie Castin, Pierre Ablin, and Gabriel Peyré. How smooth is attention? InICML,

work page arXiv
[8]

[CD15] René Carmona and François Delarue

arXiv:2603.18168. [CD15] René Carmona and François Delarue. Forward-backward stochastic differential equations and controlled McKean-Vlasov dynamics.The Annals of Probability, 43(5):2647–2700,

work page arXiv
[9]

[CLPR25] Shi Chen, Zhengjiang Lin, Yury Polyanskiy, and Philippe Rigollet

arXiv:2509.10167. [CLPR25] Shi Chen, Zhengjiang Lin, Yury Polyanskiy, and Philippe Rigollet. Quantitative clustering in mean-field transformer models,

work page arXiv
[10]

Quantitative Clustering in Mean-Field Transformer Models

arXiv:2504.14697. [CLPR26] Shi Chen, Zhengjiang Lin, Yury Polyanskiy, and Philippe Rigollet. Critical attention scaling in long-context transformers. InICLR,

work page internal anchor Pith review Pith/arXiv arXiv
[11]

[Fou23] Nicolas Fournier

arXiv:2602.15503. [Fou23] Nicolas Fournier. Convergence of the empirical measure in expected wasserstein distance: non-asymptotic explicit bounds inRd.ESAIM: Probability and Statistics, 27:749–775,

work page arXiv
[12]

[GLPR23] Borjan Geshkovski, Cyril Letrouit, Yury Polyanskiy, and Philippe Rigollet

arXiv:2410.06833. [GLPR23] Borjan Geshkovski, Cyril Letrouit, Yury Polyanskiy, and Philippe Rigollet. The emergence of clusters in self-attention dynamics. InNeurIPS,

work page arXiv
[13]

[GRR26] Borjan Geshkovski, Philippe Rigollet, and Domènec Ruiz-Balet

arXiv:2603.01842. [GRR26] Borjan Geshkovski, Philippe Rigollet, and Domènec Ruiz-Balet. Measure-to-measure inter- polation using Transformers,

work page arXiv
[14]

[GTLV22] Shivam Garg, Dimitris Tsipras, Percy Liang, and Gregory Valiant

arXiv:2411.04551. [GTLV22] Shivam Garg, Dimitris Tsipras, Percy Liang, and Gregory Valiant. What can transformers learn in-context? A case study of simple function classes. InNeurIPS,

work page arXiv
[15]

[HR18] Eldad Haber and Lars Ruthotto

arXiv:2504.13110. [HR18] Eldad Haber and Lars Ruthotto. Stable architectures for deep neural networks.Inverse problems, 34(1):014004,

work page arXiv
[16]

[Lac23] Daniel Lacker

arXiv:2602.01863. [Lac23] Daniel Lacker. Hierarchies, entropy, and quantitative propagation of chaos for mean field diffusions.Probability and Mathematical Physics, 4(2):377–432,

work page arXiv
[17]

[RR26] Maxim Raginsky and Benjamin Recht

arXiv:2512.01868. [RR26] Maxim Raginsky and Benjamin Recht. Separating geometry from probability in the analysis of generalization,

work page arXiv
[18]

Separating Geometry from Probability in the Analysis of Generalization

arXiv:2604.19560. [SABP22] Michael E Sander, Pierre Ablin, Mathieu Blondel, and Gabriel Peyré. Sinkformers: Trans- formers with doubly stochastic attention. InAISTATS,

work page internal anchor Pith review Pith/arXiv arXiv
[19]

Topics in propagation of chaos

[Szn91] Alain-Sol Sznitman. Topics in propagation of chaos. InÉcole d’Été de Probabilités de Saint-Flour XIX — 1989, volume 1464 ofLecture Notes in Mathematics, pages 165–251. Springer, Berlin, Heidelberg,

work page 1989
[20]

Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context

arXiv:2403.05530. [VBTdC20] James Vuckovic, Aristide Baratin, and Remi Tachet des Combes. A mathematical theory of attention,

work page internal anchor Pith review Pith/arXiv arXiv
[21]

[VDVW96] Aad W Van Der Vaart and Jon A Wellner

arXiv:2007.02876. [VDVW96] Aad W Van Der Vaart and Jon A Wellner. Weak convergence. InWeak convergence and empirical processes: with applications to statistics, pages 16–28. Springer,

work page arXiv 2007

[1] [1]

[ÁLGRB26] Antonio Álvarez-López, Borjan Geshkovski, and Domènec Ruiz-Balet

arXiv:2508.09628. [ÁLGRB26] Antonio Álvarez-López, Borjan Geshkovski, and Domènec Ruiz-Balet. Perceptrons and localization of attention’s mean-field landscape,

work page arXiv

[2] [2]

Perceptrons and localization of attention's mean-field landscape

arXiv:2601.21366. [BB25] Etienne Boursier and Claire Boyer. Softmax as linear attention in the large-prompt regime: a measure-based perspective,

work page internal anchor Pith review Pith/arXiv arXiv

[3] [3]

[BCL+26] Giuseppe Bruno, Shi Chen, Zhengjiang Lin, Yury Polyanskiy, and Philippe Rigollet

arXiv:2512.11784. [BCL+26] Giuseppe Bruno, Shi Chen, Zhengjiang Lin, Yury Polyanskiy, and Philippe Rigollet. Scaling limits of long-context transformers,

work page arXiv

[4] [4]

Scaling Limits of Long-Context Transformers

arXiv:2605.08505. [BCW+23] Yu Bai, Fan Chen, Huan Wang, Caiming Xiong, and Song Mei. Transformers as statisticians: Provable in-context learning with in-context algorithm selection. InNeurIPS,

work page internal anchor Pith review Pith/arXiv arXiv

[5] [5]

[BO20] Tom B

arXiv:2512.10656. [BO20] Tom B. Brown and Others. Language models are few-shot learners. InNeurIPS,

work page arXiv

[6] [6]

Multi-layer Cross-attention is Provably Optimal for Multi-modal In-context Learning

[BSS26] Nicholas Barnfield, Subhabrata Sen, and Pragya Sur. Multi-layer cross-attention is provably optimal for multi-modal in-context learning.arXiv:2602.04872,

work page internal anchor Pith review Pith/arXiv arXiv

[7] [7]

28 [CAP24] Valérie Castin, Pierre Ablin, and Gabriel Peyré

arXiv:2501.18322. 28 [CAP24] Valérie Castin, Pierre Ablin, and Gabriel Peyré. How smooth is attention? InICML,

work page arXiv

[8] [8]

[CD15] René Carmona and François Delarue

arXiv:2603.18168. [CD15] René Carmona and François Delarue. Forward-backward stochastic differential equations and controlled McKean-Vlasov dynamics.The Annals of Probability, 43(5):2647–2700,

work page arXiv

[9] [9]

[CLPR25] Shi Chen, Zhengjiang Lin, Yury Polyanskiy, and Philippe Rigollet

arXiv:2509.10167. [CLPR25] Shi Chen, Zhengjiang Lin, Yury Polyanskiy, and Philippe Rigollet. Quantitative clustering in mean-field transformer models,

work page arXiv

[10] [10]

Quantitative Clustering in Mean-Field Transformer Models

arXiv:2504.14697. [CLPR26] Shi Chen, Zhengjiang Lin, Yury Polyanskiy, and Philippe Rigollet. Critical attention scaling in long-context transformers. InICLR,

work page internal anchor Pith review Pith/arXiv arXiv

[11] [11]

[Fou23] Nicolas Fournier

arXiv:2602.15503. [Fou23] Nicolas Fournier. Convergence of the empirical measure in expected wasserstein distance: non-asymptotic explicit bounds inRd.ESAIM: Probability and Statistics, 27:749–775,

work page arXiv

[12] [12]

[GLPR23] Borjan Geshkovski, Cyril Letrouit, Yury Polyanskiy, and Philippe Rigollet

arXiv:2410.06833. [GLPR23] Borjan Geshkovski, Cyril Letrouit, Yury Polyanskiy, and Philippe Rigollet. The emergence of clusters in self-attention dynamics. InNeurIPS,

work page arXiv

[13] [13]

[GRR26] Borjan Geshkovski, Philippe Rigollet, and Domènec Ruiz-Balet

arXiv:2603.01842. [GRR26] Borjan Geshkovski, Philippe Rigollet, and Domènec Ruiz-Balet. Measure-to-measure inter- polation using Transformers,

work page arXiv

[14] [14]

[GTLV22] Shivam Garg, Dimitris Tsipras, Percy Liang, and Gregory Valiant

arXiv:2411.04551. [GTLV22] Shivam Garg, Dimitris Tsipras, Percy Liang, and Gregory Valiant. What can transformers learn in-context? A case study of simple function classes. InNeurIPS,

work page arXiv

[15] [15]

[HR18] Eldad Haber and Lars Ruthotto

arXiv:2504.13110. [HR18] Eldad Haber and Lars Ruthotto. Stable architectures for deep neural networks.Inverse problems, 34(1):014004,

work page arXiv

[16] [16]

[Lac23] Daniel Lacker

arXiv:2602.01863. [Lac23] Daniel Lacker. Hierarchies, entropy, and quantitative propagation of chaos for mean field diffusions.Probability and Mathematical Physics, 4(2):377–432,

work page arXiv

[17] [17]

[RR26] Maxim Raginsky and Benjamin Recht

arXiv:2512.01868. [RR26] Maxim Raginsky and Benjamin Recht. Separating geometry from probability in the analysis of generalization,

work page arXiv

[18] [18]

Separating Geometry from Probability in the Analysis of Generalization

arXiv:2604.19560. [SABP22] Michael E Sander, Pierre Ablin, Mathieu Blondel, and Gabriel Peyré. Sinkformers: Trans- formers with doubly stochastic attention. InAISTATS,

work page internal anchor Pith review Pith/arXiv arXiv

[19] [19]

Topics in propagation of chaos

[Szn91] Alain-Sol Sznitman. Topics in propagation of chaos. InÉcole d’Été de Probabilités de Saint-Flour XIX — 1989, volume 1464 ofLecture Notes in Mathematics, pages 165–251. Springer, Berlin, Heidelberg,

work page 1989

[20] [20]

Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context

arXiv:2403.05530. [VBTdC20] James Vuckovic, Aristide Baratin, and Remi Tachet des Combes. A mathematical theory of attention,

work page internal anchor Pith review Pith/arXiv arXiv

[21] [21]

[VDVW96] Aad W Van Der Vaart and Jon A Wellner

arXiv:2007.02876. [VDVW96] Aad W Van Der Vaart and Jon A Wellner. Weak convergence. InWeak convergence and empirical processes: with applications to statistics, pages 16–28. Springer,

work page arXiv 2007