Semantic Concurrency Limits in Large Language Models

Karl Svozil

arxiv: 2504.13824 · v5 · pith:CEB7ISLBnew · submitted 2025-04-18 · 🪐 quant-ph

Semantic Concurrency Limits in Large Language Models

Karl Svozil This is my paper

Pith reviewed 2026-05-22 18:37 UTC · model grok-4.3

classification 🪐 quant-ph

keywords semantic embeddingsconcurrency limitsinterference scalingepistemic accessibilityembedding dimensionkinetic capacitypolysemy

0 comments

The pith

Dimension limits simultaneous semantic access in embeddings through accumulating interference rather than acting only as storage.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that high-dimensional spaces allow many semantic directions to coexist when their overlaps remain small, yet these non-zero overlaps produce interference that grows with the number of active directions and restricts what a finite readout channel can recover. It separates geometric hosting capacity from recoverable information under concurrency, showing dimension functions as bandwidth for simultaneous processing. A sympathetic reader would care because this frames scaling limits in language models as a geometric constraint on handling multiple concepts at once, rather than a pure data or parameter issue.

Core claim

High-dimensional embedding spaces can host N < exp(c d_eff ε²) semantic directions with small overlap ε. When k directions activate together, residual interference yields readout variance σ_int ∼ √(k/d_eff). Dimension therefore serves as semantic concurrency bandwidth, distinguishing what the geometry can contain (kinetic capacity) from what readout can access (epistemic accessibility).

What carries the argument

The kinetic capacity versus epistemic accessibility distinction, expressed through the coexistence bound N < exp(c d_eff ε²) and the interference scaling σ_int ∼ √(k/d_eff)

If this is right

Higher embedding dimensions increase the number of semantic directions that can be processed simultaneously before interference dominates.
Readout accuracy in language models faces a hard geometric limit tied to effective dimension when many concepts are jointly active.
Polysemous token resolution may require explicit low-dimensional subspaces orthogonal to stable hinge directions to remain accessible.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Embedding training procedures that explicitly minimize effective overlaps could raise concurrency limits beyond what dimension scaling alone achieves.
Task-specific models might benefit from dimension choices matched to expected numbers of concurrent concepts rather than uniform scaling.
The same interference mechanism suggests testable predictions for how model performance degrades on inputs requiring many simultaneous distinctions.

Load-bearing premise

Residual interference from small non-zero overlaps between semantic directions accumulates without mitigation by additional structure in the embeddings.

What would settle it

Measure recovery error variance while activating k increasing numbers of directions in fixed effective dimension d and test whether the variance grows as sqrt(k/d) rather than remaining independent of d.

Figures

Figures reproduced from arXiv: 2504.13824 by Karl Svozil.

read the original abstract

High-dimensional embedding spaces can host many semantic directions with small mutual overlap. But small overlaps are not zero: when many directions are jointly active, their residual interference accumulates and limits what a finite readout channel can recover. We formulate this as a distinction between \emph{kinetic capacity} -- what the geometry can host -- and \emph{epistemic accessibility} -- what readout can recover. The two sides are summarized by N < exp(c d_{eff} \epsilon^2) for coexistence and \sigma_{int} \sim \sqrt{k/d_{eff}} for simultaneous readout. Thus dimension acts not merely as storage capacity but as semantic concurrency bandwidth. On this geometric foundation we propose a separate hypothesis: some polysemous tokens may be organized around stable token-associated hinge directions, with sense information carried by low-dimensional subspaces in the hinge-perpendicular carrier. The capacity/accessibility distinction is the main claim; the hinge hypothesis is a stronger, separately falsifiable empirical proposal.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper applies random-vector geometry to frame embedding dimension as a semantic concurrency limit in LLMs, with a separate hinge hypothesis for polysemy, but the random assumption may miss training-induced structure.

read the letter

The paper's core idea is that embedding dimension in LLMs acts as a limit on semantic concurrency because small overlaps between directions cause interference that builds up when many are active at once. This leads to a split between what the geometry can support and what a finite readout can actually recover. It takes standard results from high-dimensional geometry and concentration of measure and applies them to semantic embeddings. The bound on the number of coexisting directions and the scaling of interference noise are presented as direct consequences. What stands out as new is the explicit distinction between kinetic capacity and epistemic accessibility, along with the separate hypothesis that polysemous tokens might be structured around stable hinge directions, with different senses carried in low-dimensional subspaces perpendicular to the hinge. The paper does well at keeping the argument focused and connecting the geometry to questions about LLM scaling and handling of ambiguous words. It avoids overclaiming by noting the hinge part as a stronger empirical proposal. The soft spots are around the modeling assumptions. The interference calculation assumes directions are sufficiently random with overlaps around 1 over square root of effective dimension. But as the stress-test points out, training could create mitigating structure like near-orthogonality in certain subspaces or nonlinear readouts that change the picture. The paper does not seem to include checks against actual embedding statistics from models to see if the random baseline holds or if real interference is lower. Without that, the claim that dimension fundamentally limits concurrency rests on how well the random vector model fits trained embeddings. This is the kind of paper for people working on theoretical foundations of LLMs or geometric approaches to semantics. It has a clear line of thinking and enough novelty in the framing to make it worth sending out for peer review, even if the random assumption needs more scrutiny.

Referee Report

2 major / 2 minor

Summary. The manuscript claims that high-dimensional embedding spaces in LLMs host many semantic directions with small but non-zero overlaps, leading to accumulating residual interference that limits simultaneous recovery by a finite readout. It distinguishes kinetic capacity (N < exp(c d_eff ε²) for coexistence of directions) from epistemic accessibility (σ_int ∼ √(k/d_eff) for readout limits), arguing that dimension functions as semantic concurrency bandwidth rather than mere storage. A separate, stronger hypothesis proposes that polysemous tokens are organized around stable token-associated hinge directions, with sense information carried in low-dimensional subspaces perpendicular to the hinge.

Significance. If the geometric bounds are rigorously derived and shown to apply to trained LLM embeddings without substantial mitigating structure, the capacity-accessibility distinction could provide a useful conceptual tool for analyzing limits on parallel semantic processing. The hinge hypothesis is framed as separately falsifiable, which is a positive feature for empirical follow-up. However, the significance hinges on whether the random-vector interference model captures LLM realities or if training-induced correlations reduce the claimed limits.

major comments (2)

[Abstract] The central formulas N < exp(c d_eff ε²) and σ_int ∼ √(k/d_eff) are stated in the abstract and appear to rest on standard high-dimensional random vector geometry, but no derivation steps, error bounds, or sensitivity analysis to the modeling choices of d_eff and ε are provided. This is load-bearing because the epistemic-accessibility claim requires that residual interference cannot be mitigated below the random baseline.
[Geometric Foundation] The argument that residual interference fundamentally limits recovery assumes embeddings behave sufficiently like uncorrelated random vectors (pairwise overlaps ~1/√d_eff with no higher-order dependencies). No analysis or counterexample is given to address whether training-induced structure (near-orthogonality, clustered subspaces, or nonlinear readouts) could reduce effective interference, which directly undermines the concurrency-bandwidth interpretation if such structure exists.

minor comments (2)

[Notation] The effective dimension d_eff and overlap parameter ε are used throughout without explicit operational definitions or estimation procedures from embedding data.
[Hinge Hypothesis] The hinge hypothesis is introduced as a stronger empirical proposal but lacks a concrete falsification plan or reference to existing polysemy literature in embedding spaces.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed report. The comments correctly identify areas where the geometric claims require more explicit support. We address each major comment below and indicate the revisions made to the manuscript.

read point-by-point responses

Referee: [Abstract] The central formulas N < exp(c d_eff ε²) and σ_int ∼ √(k/d_eff) are stated in the abstract and appear to rest on standard high-dimensional random vector geometry, but no derivation steps, error bounds, or sensitivity analysis to the modeling choices of d_eff and ε are provided. This is load-bearing because the epistemic-accessibility claim requires that residual interference cannot be mitigated below the random baseline.

Authors: We agree that the derivations and supporting analysis are essential for the load-bearing claims. In the revised manuscript we have added a new subsection (Section 2.2) that derives N < exp(c d_eff ε²) from the standard concentration-of-measure argument for the maximum number of vectors with pairwise inner products bounded by ε, including explicit error terms from the union bound and a sensitivity analysis showing how the exponent scales with small changes in d_eff and ε. The readout noise formula σ_int ∼ √(k/d_eff) is likewise derived from the variance of the sum of approximately independent inner products under the random-vector model, with a short discussion of the conditions under which this variance bound remains valid. revision: yes
Referee: [Geometric Foundation] The argument that residual interference fundamentally limits recovery assumes embeddings behave sufficiently like uncorrelated random vectors (pairwise overlaps ~1/√d_eff with no higher-order dependencies). No analysis or counterexample is given to address whether training-induced structure (near-orthogonality, clustered subspaces, or nonlinear readouts) could reduce effective interference, which directly undermines the concurrency-bandwidth interpretation if such structure exists.

Authors: This is a substantive point. Our model treats the random-vector baseline as the generic case in the absence of specially engineered cancellation; we do not claim it is the only possible regime. In the revision we have expanded the discussion (new paragraph in Section 3) to argue that training-induced near-orthogonality still leaves average overlaps of order 1/√d_eff and that subspace clustering effectively lowers d_eff, which tightens rather than relaxes the concurrency bound. We acknowledge that a full empirical counterexample on trained embeddings lies beyond the present theoretical scope and would constitute valuable follow-up work; the current claim is therefore framed as a baseline limit under the stated modeling assumptions. revision: partial

Circularity Check

0 steps flagged

No circularity; bounds follow from standard high-dimensional geometry

full rationale

The paper derives N < exp(c d_eff ε²) for coexistence and σ_int ∼ √(k/d_eff) for readout limits directly from the geometry of high-dimensional spaces with small but non-zero overlaps between directions, treating embeddings as behaving like random vectors. These relations are standard consequences of concentration and packing arguments in random vector models and are presented as external mathematical facts rather than fitted quantities or self-referential definitions. The kinetic-capacity versus epistemic-accessibility distinction is a conceptual framing built on these geometric premises, while the hinge hypothesis is explicitly separated as an independent empirical proposal. No equation or claim reduces by construction to a prior self-citation, a fitted parameter renamed as a prediction, or an ansatz smuggled through internal reference; the derivation remains self-contained against external benchmarks in high-dimensional geometry.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 1 invented entities

The central claim rests primarily on standard high-dimensional geometry; the hinge directions constitute an additional postulated entity without independent evidence supplied in the abstract.

axioms (2)

domain assumption High-dimensional embedding spaces host many semantic directions with small but non-zero mutual overlaps.
Stated explicitly as the foundation for interference accumulation in the abstract.
domain assumption A finite readout channel must recover information from the joint activation of multiple directions.
Implicit in the definition of epistemic accessibility.

invented entities (1)

Stable token-associated hinge directions no independent evidence
purpose: Organize sense information for polysemous tokens in low-dimensional subspaces perpendicular to the hinge.
Introduced as a stronger, separately falsifiable empirical proposal.

pith-pipeline@v0.9.0 · 5685 in / 1427 out tokens · 92357 ms · 2026-05-22T18:37:16.592154+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel echoes

?

echoes
ECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.

N_ε(d) satisfies exp(c ε² d) ≤ N_ε(d) ≤ exp(C ε² d) … d-dimensional space can accommodate … e^{ε² O(d)} property directions
IndisputableMonolith/Foundation/DimensionForcing.lean alexander_duality_circle_linking unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

dimension acts not merely as storage capacity but as semantic concurrency bandwidth

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

55 extracted references · 55 canonical work pages · 5 internal anchors

[1]

(1,−1,0) ⊺,f 2 = (1/ √

work page
[2]

strategy

(1,1,0) ⊺, and f3 = (0,0,1) ⊺, obtained by rotating the first and sec- ond vectors by an angleπ/4 clockwise around the z-axis, which is aligned withe 3. In this construction both bases share the same vectore 3 =f 3, but the context or max- imal observables involved are very different: in one case the maximal observable, in its spectral decomposition with ...

work page
[3]

meaning-space

Encoding tokens as quantum statesψin a high- dimensional Hilbert space, thereby embedding them into the “meaning-space” of the model

work page
[4]

Evolving these states by a suitably chosen unitary transformationU, producing a candidate output state ϕ=U ψ.(8)

work page
[5]

attention weights

Performing a terminal measurement (Process 1), which yields a single token according to the Born rule, with probabilities proportional to|ϕ †ψ|2 for pure states. In this scheme, nonlinearities and irreversibility are ab- sent everywhere except at the initial state preparation (if stochastic) and at the final, irreducibly random measure- ment. Everything i...

work page doi:10.55776/pin5424624 2025
[6]

Karpathy, Deep dive into LLMs like ChatGPT (2025), youTube video, accessed September 05, 2025

A. Karpathy, Deep dive into LLMs like ChatGPT (2025), youTube video, accessed September 05, 2025

work page 2025
[7]

Attention Is All You Need

A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin, Attention is all you need, inProceedings of the 31st In- ternational Conference on Neural Information Processing Systems, NIPS’17 (Curran Associates Inc., Red Hook, NY, USA, 2017) pp. 6000–6010, arXiv:1706.03762

work page internal anchor Pith review Pith/arXiv arXiv 2017
[8]

Kruger and D

J. Kruger and D. Dunning, Unskilled and unaware of it: How difficulties in recognizing one’s own incompetence lead to inflated self-assessments, Journal of Personality and Social Psychology77, 1121 (1999)

work page 1999
[9]

A. T. Kalai, O. Nachum, S. S. Vempala, and E. Zhang, Why language models hallucinate (2025), arXiv:arXiv:2509.04664

work page internal anchor Pith review Pith/arXiv arXiv 2025
[10]

Silver, J

D. Silver, J. Schrittwieser, K. Simonyan, I. Antonoglou, A. Huang, A. Guez, T. Hubert, L. Baker, M. Lai, A. Bolton, Y. Chen, T. Lillicrap, F. Hui, L. Sifre, G. van den Driessche, T. Graepel, and D. Hassabis, Mas- tering the game of Go without human knowledge, Nature 12 550, 354 (2017)

work page 2017
[11]

Schuurmans, H

D. Schuurmans, H. Dai, and F. Zanini, Autoregres- sive large language models are computationally universal (2024), arXiv:arXiv:2410.03170 [cs.CL]

work page arXiv 2024
[12]

W. B. Johnson and J. Lindenstrauss, Extensions of Lip- schitz mappings into a Hilbert space, inConference on Modern Analysis and Probability, Contemporary Math- ematics, Vol. 26 (American Mathematical Society, 1984) pp. 189–206

work page 1984
[13]

Optimal compression of approximate inner products and dimension reduction

N. Alon and B. Klartag, Optimal compression of approx- imate inner products and dimension reduction (2016), arXiv:arXiv:1610.00239 [math.MG]

work page internal anchor Pith review Pith/arXiv arXiv 2016
[14]

K. G. Larsen and J. Nelson, Optimality of the Johnson- Lindenstrauss lemma, in58th Annual IEEE Symposium on Foundations of Computer Science (FOCS)(2017) pp. 633–638, arXiv:1609.02094

work page internal anchor Pith review Pith/arXiv arXiv 2017
[15]

A. M. Gleason, Measures on the closed subspaces of a Hilbert space, Journal of Mathematics and Mechanics (now Indiana University Mathematics Journal)6, 885 (1957)

work page 1957
[16]

E. Specker: "The logic of non-simultaneously decidable propositions" (1960)

E. Specker, Die Logik nicht gleichzeitig entscheidbarer Aussagen, Dialectica14, 239 (1960), english translation at https://arxiv.org/abs/1103.4537, arXiv:1103.4537

work page internal anchor Pith review Pith/arXiv arXiv 1960
[17]

Kochen and E

S. Kochen and E. P. Specker, The problem of hidden variables in quantum mechanics, Journal of Mathemat- ics and Mechanics (now Indiana University Mathematics Journal)17, 59 (1967)

work page 1967
[18]

Jadhav, The art of sampling: Controlling randomness in LLMs (2025), the AI Engineering Brief, Accessed: Sept 7, 2025

A. Jadhav, The art of sampling: Controlling randomness in LLMs (2025), the AI Engineering Brief, Accessed: Sept 7, 2025

work page 2025
[19]

He and T

H. He and T. M. Lab, Defeating nondeter- minism in llm inference, Thinking Machines Lab: Connectionism 10.64434/tml.20250910 (2025), https://thinkingmachines.ai/blog/defeating- nondeterminism-in-llm-inference/

work page doi:10.64434/tml.20250910 2025
[20]

G. J. Chaitin,Algorithmic Information Theory, revised edition ed., Cambridge Tracts in Theoretical Computer Science, Volume 1 (Cambridge University Press, Cam- bridge, 1987,2003)

work page 1987
[21]

C. S. Calude,Information and Randomness—An Algo- rithmic Perspective, 2nd ed. (Springer, Berlin, 2002)

work page 2002
[22]

G. J. Chaitin, Information-theoretic limitations of for- mal systems, Journal of the Association of Computing Machinery (JACM)21, 403 (1974)

work page 1974
[23]

G. J. Chaitin, Information-theoretic incompleteness, Ap- plied Mathematics and Computation52, 83 (1992)

work page 1992
[24]

G. J. Chaitin,Information, Randomness and Incomplete- ness. Papers on Algorithmic Information Theory (World Scientific Series in Computer Science: Volume 8), 2nd ed. (World Scientific, Singapore, 1990) this is a collec- tion of G. Chaitin’s early publications

work page 1990
[25]

Frankle and M

J. Frankle and M. Carbin, The lottery ticket hypoth- esis: Finding sparse, trainable neural networks, in7th International Conference on Learning Representations (ICLR)(The International Conference on Learning Rep- resentations (ICLR), New Orleans, Louisiana, USA,

work page
[26]

see also URLhttps://openreview.net/forum? id=rJl-b3RcF7

work page
[27]

Zeilinger, The message of the quantum, Nature438, 743 (2005)

A. Zeilinger, The message of the quantum, Nature438, 743 (2005)

work page 2005
[28]

J. von Neumann,Mathematical Foundations of Quan- tum Mechanics, Princeton Landmarks in Mathematics and Physics (Princeton University Press, Princeton, NJ, USA, 2018) translated by Robert T. Beyer, edited by Nicholas A. Wheeler

work page 2018
[29]

J. H. Holland,Adaptation in Natural and Artificial Sys- tems: An Introductory Analysis with Applications to Biology, Control, and Artificial Intelligence, Complex Adaptive Systems (MIT Press, Cambridge, MA, 1992)

work page 1992
[30]

Toffoli, The role of the observer in uniform systems, in Applied General Systems Research: Recent Developments and Trends, edited by G

T. Toffoli, The role of the observer in uniform systems, in Applied General Systems Research: Recent Developments and Trends, edited by G. J. Klir (Plenum Press, Springer US, New York, London, and Boston, MA, USA, 1978) pp. 395–400

work page 1978
[31]

G. Li, X. Zhao, and X. Wang, Quantum self-attention neural networks for text classification, Science China Information Sciences67, 10.1007/s11432-023-3879-7 (2024)

work page doi:10.1007/s11432-023-3879-7 2024
[32]

Zhang, Q

H. Zhang, Q. Zhao, M. Zhou, L. Feng, D. Niyato, S. Zheng, and L. Chen, A survey of quantum trans- formers: Architectures, challenges and outlooks (2025), arXiv:2504.03192 [quant-ph]

work page arXiv 2025
[33]

Dasgupta and A

S. Dasgupta and A. Gupta, An elementary proof of a the- orem of Johnson and Lindenstrauss, Random Structures & Algorithms22, 60 (2003). Appendix A: Mathematical F oundations This appendix provides mathematical background for key concepts in neural network training and the Trans- former architecture. For a general-audience introduction, these technical det...

work page 2003
[34]

pre-activation

Probability and Logits For a probabilityp∈(0,1), thelogitofpis logit(p) = log p 1−p .(A1) The sigmoid function, σ(z) = 1 1 +e −z ,(A2) maps a real-valued logitzto a probabilityp. They are exact inverses: σ(logit(p)) =pand logit(σ(z)) =z.(A3) The derivative of the sigmoid function can be obtained with the chain rule: σ′(z) =σ(z)(1−σ(z)).(A4) In the context...

work page
[35]

The softmax function, softmax(zi) = exp(zi) CX j=1 exp(zj) ,(A5) fori= 1,

Softmax for Multi-Class Problems LetC≥2 be the number of classes, and letz∈R C be a score (logit) vector. The softmax function, softmax(zi) = exp(zi) CX j=1 exp(zj) ,(A5) fori= 1, . . . , C, mapszto a probability vector softmax(z)∈(0,1) C that lies on the probability simplex: softmax(z)∈ n p∈R C :p i ≥0, CX i=1 pi = 1 o .(A6) Equivalently, if we defineu= ...

work page
[36]

Positivity: softmax(z i)≥0

work page
[37]

Normalization: P i s(zi) = 1

work page
[38]

In the context of language model inference, the output of the softmax function is modified by a sampling hyper- parameter calledtemperature,T >0

Shift-invariance: for any scalara, softmax(z+ a1) = softmax(z). In the context of language model inference, the output of the softmax function is modified by a sampling hyper- parameter calledtemperature,T >0. The temperature is used to control the randomness of the model’s predictions by scaling the logits before the softmax is applied: pi = exp(zi/T) CX...

work page
[39]

error signal

Model, notation, and shapes Backpropagation elegantly leverages the chain rule of calculus to compute the gradient of the loss function with respect to every parameter in a neural network. By first calculating an “error signal” at the output layer and then propagating it backward, it determines the precise ad- justment needed for each weight, bias, and, c...

work page
[40]

2.Input:The input for a vocabulary tokenkis a one- hot column vectorx∈R V , where thek-th element is 1 and all other elements are zero

Model, Notation, and Shapes The network architecture and its associated mathe- matical objects are defined as follows: 1.Vocabulary size:V;Embedding dimension: d;Hidden layer width:h. 2.Input:The input for a vocabulary tokenkis a one- hot column vectorx∈R V , where thek-th element is 1 and all other elements are zero. We can denote this asx=e k. 3.Embeddi...

work page
[41]

The functionσis applied component-wise in Eq

F orward Pass The forward computation proceeds through the net- work layers as follows: z2 =W 1vk +b 1,(B2) a2 =σ(z 2),(B3) z3 =W 2a2 +b 2,(B4) ˆy=σ(z3).(B5) Here,z 2 ∈R h is the pre-activation of the hidden layer, anda 2 ∈R h is its activation. The functionσis applied component-wise in Eq. (B3). The scalarz 3 ∈Ris the final pre-activation, or logit, and ...

work page
[42]

error signals

Backward Pass We use a mean squared error loss for a single training example with targety∈ {0,1}: L= 1 2(ˆy−y)2.(B6) Our goal is to compute the gradient ofLwith respect to each parameter. We do this by defining layer-wise “error signals”, which are derivatives of the loss with respect to the pre-activations (z). a. Output Layer The error signal for the ou...

work page
[43]

This is accomplished using three learned weight ma- trices:W Q ∈R d×dk,W K ∈R d×dk, andW V ∈R d×dv

Projection to Query , Key , and V alue First, the input matrixXis linearly projected into three distinct matrices: Query (Q), Key (K), and Value (V). This is accomplished using three learned weight ma- trices:W Q ∈R d×dk,W K ∈R d×dk, andW V ∈R d×dv. Typically, for the single-head attention block, the di- mensions are set such thatd=d k =d v [2] (for multi...

work page
[44]

TheQueryvector can be seen as a representation of what information the current token is seeking from the rest of the sequence

work page
[45]

TheKeyvector represents what kind of informa- tion the token itself contains or offers

work page
[46]

TheValuevector is the actual content or represen- tation of the token that will be passed on if other tokens attend to it

work page
[47]

This is performed for all tokens simultaneously via a matrix multiplication betweenQand the transpose ofK

Attention Score Calculation The relationship, or compatibility, between each pair of tokens is computed by taking the dot product of their respective Query and Key vectors. This is performed for all tokens simultaneously via a matrix multiplication betweenQand the transpose ofK. The resulting matrix of raw attention scores is: Scores =QK T ∈R n×n.(C4) The...

work page
[48]

seeing into the future

Applying Softmax and Causal Masking A softmax function is applied row-wise to the scaled scores matrix. This converts the raw, unnormalized scores into a probability distribution, yielding the atten- tion weights matrixA. A= softmax QK T √dk ∈R n×n.(C6) Each elementA ij is the weight assigned to the Value vector of tokenjwhen computing the output for toke...

work page
[49]

Output Computation The final output of the self-attention layer is a weighted sum of all Value vectors, where the weights are given by the matrixA. This computation is performed with a single matrix multiplication: Y=AV∈R n×dv .(C7) Thei-th row of the output,y i, is therefore a contextual- ized vector for thei-th token, formed by aggregating in- formation...

work page
[50]

heads.” headi = Attention(XWQ,i, XWK,i, XWV,i).(D1) 3.Concatenation and Final Projection:The outputs of thesehheads are “horizontally

Multi-Head Attention A single attention mechanism might be forced to learn an “average” of several types of relationships between tokens. To allow the model to capture a richer set of relationships, the Transformer employsmulti-head atten- tion. This mechanism does not run a single attention calculation, but rather multiple attention calculations in paral...

work page
[51]

Add & Norm

The Complete T ransformer Block A complete Transformer “block” is the fundamental re- peating unit of the model, designed to facilitate the stable training of very deep networks. Each block is composed of two main sub-layers: the multi-head attention mech- anism followed by a position-wise feed-forward network. Crucially, each of these sub-layers is wrapp...

work page
[52]

bank” is identical in the phrases “river bank

An Embedding-Centric View of the T ransformer The mathematical operations of the Transformer, par- ticularly self-attention, are best understood as a sophis- ticated mechanism for the iterative refinement of token representations. The entire architecture is designed to transform an initial set of static, context-free embed- dings into a final set of dynam...

work page
[53]

Each token’s embedding (via itsQueryprojection) effectively queries all other token embeddings in the sequence

work page
[54]

It assesses their relevance based on a compatibility function with their respectiveKeyprojections

work page
[55]

bank” in “river bank

It then synthesizes its new representation by com- puting a weighted sum of theirValueprojections, where the weights are determined by the aforemen- tioned relevance scores. The output of the block is therefore a new matrix of embeddings where each token’s vector representation has absorbed relevant contextual information from its peers. After just one la...

work page

[1] [1]

(1,−1,0) ⊺,f 2 = (1/ √

work page

[2] [2]

strategy

(1,1,0) ⊺, and f3 = (0,0,1) ⊺, obtained by rotating the first and sec- ond vectors by an angleπ/4 clockwise around the z-axis, which is aligned withe 3. In this construction both bases share the same vectore 3 =f 3, but the context or max- imal observables involved are very different: in one case the maximal observable, in its spectral decomposition with ...

work page

[3] [3]

meaning-space

Encoding tokens as quantum statesψin a high- dimensional Hilbert space, thereby embedding them into the “meaning-space” of the model

work page

[4] [4]

Evolving these states by a suitably chosen unitary transformationU, producing a candidate output state ϕ=U ψ.(8)

work page

[5] [5]

attention weights

Performing a terminal measurement (Process 1), which yields a single token according to the Born rule, with probabilities proportional to|ϕ †ψ|2 for pure states. In this scheme, nonlinearities and irreversibility are ab- sent everywhere except at the initial state preparation (if stochastic) and at the final, irreducibly random measure- ment. Everything i...

work page doi:10.55776/pin5424624 2025

[6] [6]

Karpathy, Deep dive into LLMs like ChatGPT (2025), youTube video, accessed September 05, 2025

A. Karpathy, Deep dive into LLMs like ChatGPT (2025), youTube video, accessed September 05, 2025

work page 2025

[7] [7]

Attention Is All You Need

A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin, Attention is all you need, inProceedings of the 31st In- ternational Conference on Neural Information Processing Systems, NIPS’17 (Curran Associates Inc., Red Hook, NY, USA, 2017) pp. 6000–6010, arXiv:1706.03762

work page internal anchor Pith review Pith/arXiv arXiv 2017

[8] [8]

Kruger and D

J. Kruger and D. Dunning, Unskilled and unaware of it: How difficulties in recognizing one’s own incompetence lead to inflated self-assessments, Journal of Personality and Social Psychology77, 1121 (1999)

work page 1999

[9] [9]

A. T. Kalai, O. Nachum, S. S. Vempala, and E. Zhang, Why language models hallucinate (2025), arXiv:arXiv:2509.04664

work page internal anchor Pith review Pith/arXiv arXiv 2025

[10] [10]

Silver, J

D. Silver, J. Schrittwieser, K. Simonyan, I. Antonoglou, A. Huang, A. Guez, T. Hubert, L. Baker, M. Lai, A. Bolton, Y. Chen, T. Lillicrap, F. Hui, L. Sifre, G. van den Driessche, T. Graepel, and D. Hassabis, Mas- tering the game of Go without human knowledge, Nature 12 550, 354 (2017)

work page 2017

[11] [11]

Schuurmans, H

D. Schuurmans, H. Dai, and F. Zanini, Autoregres- sive large language models are computationally universal (2024), arXiv:arXiv:2410.03170 [cs.CL]

work page arXiv 2024

[12] [12]

W. B. Johnson and J. Lindenstrauss, Extensions of Lip- schitz mappings into a Hilbert space, inConference on Modern Analysis and Probability, Contemporary Math- ematics, Vol. 26 (American Mathematical Society, 1984) pp. 189–206

work page 1984

[13] [13]

Optimal compression of approximate inner products and dimension reduction

N. Alon and B. Klartag, Optimal compression of approx- imate inner products and dimension reduction (2016), arXiv:arXiv:1610.00239 [math.MG]

work page internal anchor Pith review Pith/arXiv arXiv 2016

[14] [14]

K. G. Larsen and J. Nelson, Optimality of the Johnson- Lindenstrauss lemma, in58th Annual IEEE Symposium on Foundations of Computer Science (FOCS)(2017) pp. 633–638, arXiv:1609.02094

work page internal anchor Pith review Pith/arXiv arXiv 2017

[15] [15]

A. M. Gleason, Measures on the closed subspaces of a Hilbert space, Journal of Mathematics and Mechanics (now Indiana University Mathematics Journal)6, 885 (1957)

work page 1957

[16] [16]

E. Specker: "The logic of non-simultaneously decidable propositions" (1960)

E. Specker, Die Logik nicht gleichzeitig entscheidbarer Aussagen, Dialectica14, 239 (1960), english translation at https://arxiv.org/abs/1103.4537, arXiv:1103.4537

work page internal anchor Pith review Pith/arXiv arXiv 1960

[17] [17]

Kochen and E

S. Kochen and E. P. Specker, The problem of hidden variables in quantum mechanics, Journal of Mathemat- ics and Mechanics (now Indiana University Mathematics Journal)17, 59 (1967)

work page 1967

[18] [18]

Jadhav, The art of sampling: Controlling randomness in LLMs (2025), the AI Engineering Brief, Accessed: Sept 7, 2025

A. Jadhav, The art of sampling: Controlling randomness in LLMs (2025), the AI Engineering Brief, Accessed: Sept 7, 2025

work page 2025

[19] [19]

He and T

H. He and T. M. Lab, Defeating nondeter- minism in llm inference, Thinking Machines Lab: Connectionism 10.64434/tml.20250910 (2025), https://thinkingmachines.ai/blog/defeating- nondeterminism-in-llm-inference/

work page doi:10.64434/tml.20250910 2025

[20] [20]

G. J. Chaitin,Algorithmic Information Theory, revised edition ed., Cambridge Tracts in Theoretical Computer Science, Volume 1 (Cambridge University Press, Cam- bridge, 1987,2003)

work page 1987

[21] [21]

C. S. Calude,Information and Randomness—An Algo- rithmic Perspective, 2nd ed. (Springer, Berlin, 2002)

work page 2002

[22] [22]

G. J. Chaitin, Information-theoretic limitations of for- mal systems, Journal of the Association of Computing Machinery (JACM)21, 403 (1974)

work page 1974

[23] [23]

G. J. Chaitin, Information-theoretic incompleteness, Ap- plied Mathematics and Computation52, 83 (1992)

work page 1992

[24] [24]

G. J. Chaitin,Information, Randomness and Incomplete- ness. Papers on Algorithmic Information Theory (World Scientific Series in Computer Science: Volume 8), 2nd ed. (World Scientific, Singapore, 1990) this is a collec- tion of G. Chaitin’s early publications

work page 1990

[25] [25]

Frankle and M

J. Frankle and M. Carbin, The lottery ticket hypoth- esis: Finding sparse, trainable neural networks, in7th International Conference on Learning Representations (ICLR)(The International Conference on Learning Rep- resentations (ICLR), New Orleans, Louisiana, USA,

work page

[26] [26]

see also URLhttps://openreview.net/forum? id=rJl-b3RcF7

work page

[27] [27]

Zeilinger, The message of the quantum, Nature438, 743 (2005)

A. Zeilinger, The message of the quantum, Nature438, 743 (2005)

work page 2005

[28] [28]

J. von Neumann,Mathematical Foundations of Quan- tum Mechanics, Princeton Landmarks in Mathematics and Physics (Princeton University Press, Princeton, NJ, USA, 2018) translated by Robert T. Beyer, edited by Nicholas A. Wheeler

work page 2018

[29] [29]

J. H. Holland,Adaptation in Natural and Artificial Sys- tems: An Introductory Analysis with Applications to Biology, Control, and Artificial Intelligence, Complex Adaptive Systems (MIT Press, Cambridge, MA, 1992)

work page 1992

[30] [30]

Toffoli, The role of the observer in uniform systems, in Applied General Systems Research: Recent Developments and Trends, edited by G

T. Toffoli, The role of the observer in uniform systems, in Applied General Systems Research: Recent Developments and Trends, edited by G. J. Klir (Plenum Press, Springer US, New York, London, and Boston, MA, USA, 1978) pp. 395–400

work page 1978

[31] [31]

G. Li, X. Zhao, and X. Wang, Quantum self-attention neural networks for text classification, Science China Information Sciences67, 10.1007/s11432-023-3879-7 (2024)

work page doi:10.1007/s11432-023-3879-7 2024

[32] [32]

Zhang, Q

H. Zhang, Q. Zhao, M. Zhou, L. Feng, D. Niyato, S. Zheng, and L. Chen, A survey of quantum trans- formers: Architectures, challenges and outlooks (2025), arXiv:2504.03192 [quant-ph]

work page arXiv 2025

[33] [33]

Dasgupta and A

S. Dasgupta and A. Gupta, An elementary proof of a the- orem of Johnson and Lindenstrauss, Random Structures & Algorithms22, 60 (2003). Appendix A: Mathematical F oundations This appendix provides mathematical background for key concepts in neural network training and the Trans- former architecture. For a general-audience introduction, these technical det...

work page 2003

[34] [34]

pre-activation

Probability and Logits For a probabilityp∈(0,1), thelogitofpis logit(p) = log p 1−p .(A1) The sigmoid function, σ(z) = 1 1 +e −z ,(A2) maps a real-valued logitzto a probabilityp. They are exact inverses: σ(logit(p)) =pand logit(σ(z)) =z.(A3) The derivative of the sigmoid function can be obtained with the chain rule: σ′(z) =σ(z)(1−σ(z)).(A4) In the context...

work page

[35] [35]

The softmax function, softmax(zi) = exp(zi) CX j=1 exp(zj) ,(A5) fori= 1,

Softmax for Multi-Class Problems LetC≥2 be the number of classes, and letz∈R C be a score (logit) vector. The softmax function, softmax(zi) = exp(zi) CX j=1 exp(zj) ,(A5) fori= 1, . . . , C, mapszto a probability vector softmax(z)∈(0,1) C that lies on the probability simplex: softmax(z)∈ n p∈R C :p i ≥0, CX i=1 pi = 1 o .(A6) Equivalently, if we defineu= ...

work page

[36] [36]

Positivity: softmax(z i)≥0

work page

[37] [37]

Normalization: P i s(zi) = 1

work page

[38] [38]

In the context of language model inference, the output of the softmax function is modified by a sampling hyper- parameter calledtemperature,T >0

Shift-invariance: for any scalara, softmax(z+ a1) = softmax(z). In the context of language model inference, the output of the softmax function is modified by a sampling hyper- parameter calledtemperature,T >0. The temperature is used to control the randomness of the model’s predictions by scaling the logits before the softmax is applied: pi = exp(zi/T) CX...

work page

[39] [39]

error signal

Model, notation, and shapes Backpropagation elegantly leverages the chain rule of calculus to compute the gradient of the loss function with respect to every parameter in a neural network. By first calculating an “error signal” at the output layer and then propagating it backward, it determines the precise ad- justment needed for each weight, bias, and, c...

work page

[40] [40]

2.Input:The input for a vocabulary tokenkis a one- hot column vectorx∈R V , where thek-th element is 1 and all other elements are zero

Model, Notation, and Shapes The network architecture and its associated mathe- matical objects are defined as follows: 1.Vocabulary size:V;Embedding dimension: d;Hidden layer width:h. 2.Input:The input for a vocabulary tokenkis a one- hot column vectorx∈R V , where thek-th element is 1 and all other elements are zero. We can denote this asx=e k. 3.Embeddi...

work page

[41] [41]

The functionσis applied component-wise in Eq

F orward Pass The forward computation proceeds through the net- work layers as follows: z2 =W 1vk +b 1,(B2) a2 =σ(z 2),(B3) z3 =W 2a2 +b 2,(B4) ˆy=σ(z3).(B5) Here,z 2 ∈R h is the pre-activation of the hidden layer, anda 2 ∈R h is its activation. The functionσis applied component-wise in Eq. (B3). The scalarz 3 ∈Ris the final pre-activation, or logit, and ...

work page

[42] [42]

error signals

Backward Pass We use a mean squared error loss for a single training example with targety∈ {0,1}: L= 1 2(ˆy−y)2.(B6) Our goal is to compute the gradient ofLwith respect to each parameter. We do this by defining layer-wise “error signals”, which are derivatives of the loss with respect to the pre-activations (z). a. Output Layer The error signal for the ou...

work page

[43] [43]

This is accomplished using three learned weight ma- trices:W Q ∈R d×dk,W K ∈R d×dk, andW V ∈R d×dv

Projection to Query , Key , and V alue First, the input matrixXis linearly projected into three distinct matrices: Query (Q), Key (K), and Value (V). This is accomplished using three learned weight ma- trices:W Q ∈R d×dk,W K ∈R d×dk, andW V ∈R d×dv. Typically, for the single-head attention block, the di- mensions are set such thatd=d k =d v [2] (for multi...

work page

[44] [44]

TheQueryvector can be seen as a representation of what information the current token is seeking from the rest of the sequence

work page

[45] [45]

TheKeyvector represents what kind of informa- tion the token itself contains or offers

work page

[46] [46]

TheValuevector is the actual content or represen- tation of the token that will be passed on if other tokens attend to it

work page

[47] [47]

This is performed for all tokens simultaneously via a matrix multiplication betweenQand the transpose ofK

Attention Score Calculation The relationship, or compatibility, between each pair of tokens is computed by taking the dot product of their respective Query and Key vectors. This is performed for all tokens simultaneously via a matrix multiplication betweenQand the transpose ofK. The resulting matrix of raw attention scores is: Scores =QK T ∈R n×n.(C4) The...

work page

[48] [48]

seeing into the future

Applying Softmax and Causal Masking A softmax function is applied row-wise to the scaled scores matrix. This converts the raw, unnormalized scores into a probability distribution, yielding the atten- tion weights matrixA. A= softmax QK T √dk ∈R n×n.(C6) Each elementA ij is the weight assigned to the Value vector of tokenjwhen computing the output for toke...

work page

[49] [49]

Output Computation The final output of the self-attention layer is a weighted sum of all Value vectors, where the weights are given by the matrixA. This computation is performed with a single matrix multiplication: Y=AV∈R n×dv .(C7) Thei-th row of the output,y i, is therefore a contextual- ized vector for thei-th token, formed by aggregating in- formation...

work page

[50] [50]

heads.” headi = Attention(XWQ,i, XWK,i, XWV,i).(D1) 3.Concatenation and Final Projection:The outputs of thesehheads are “horizontally

Multi-Head Attention A single attention mechanism might be forced to learn an “average” of several types of relationships between tokens. To allow the model to capture a richer set of relationships, the Transformer employsmulti-head atten- tion. This mechanism does not run a single attention calculation, but rather multiple attention calculations in paral...

work page

[51] [51]

Add & Norm

The Complete T ransformer Block A complete Transformer “block” is the fundamental re- peating unit of the model, designed to facilitate the stable training of very deep networks. Each block is composed of two main sub-layers: the multi-head attention mech- anism followed by a position-wise feed-forward network. Crucially, each of these sub-layers is wrapp...

work page

[52] [52]

bank” is identical in the phrases “river bank

An Embedding-Centric View of the T ransformer The mathematical operations of the Transformer, par- ticularly self-attention, are best understood as a sophis- ticated mechanism for the iterative refinement of token representations. The entire architecture is designed to transform an initial set of static, context-free embed- dings into a final set of dynam...

work page

[53] [53]

Each token’s embedding (via itsQueryprojection) effectively queries all other token embeddings in the sequence

work page

[54] [54]

It assesses their relevance based on a compatibility function with their respectiveKeyprojections

work page

[55] [55]

bank” in “river bank

It then synthesizes its new representation by com- puting a weighted sum of theirValueprojections, where the weights are determined by the aforemen- tioned relevance scores. The output of the block is therefore a new matrix of embeddings where each token’s vector representation has absorbed relevant contextual information from its peers. After just one la...

work page