Semantic Concurrency Limits in Large Language Models
Pith reviewed 2026-05-22 18:37 UTC · model grok-4.3
The pith
Dimension limits simultaneous semantic access in embeddings through accumulating interference rather than acting only as storage.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
High-dimensional embedding spaces can host N < exp(c d_eff ε²) semantic directions with small overlap ε. When k directions activate together, residual interference yields readout variance σ_int ∼ √(k/d_eff). Dimension therefore serves as semantic concurrency bandwidth, distinguishing what the geometry can contain (kinetic capacity) from what readout can access (epistemic accessibility).
What carries the argument
The kinetic capacity versus epistemic accessibility distinction, expressed through the coexistence bound N < exp(c d_eff ε²) and the interference scaling σ_int ∼ √(k/d_eff)
If this is right
- Higher embedding dimensions increase the number of semantic directions that can be processed simultaneously before interference dominates.
- Readout accuracy in language models faces a hard geometric limit tied to effective dimension when many concepts are jointly active.
- Polysemous token resolution may require explicit low-dimensional subspaces orthogonal to stable hinge directions to remain accessible.
Where Pith is reading between the lines
- Embedding training procedures that explicitly minimize effective overlaps could raise concurrency limits beyond what dimension scaling alone achieves.
- Task-specific models might benefit from dimension choices matched to expected numbers of concurrent concepts rather than uniform scaling.
- The same interference mechanism suggests testable predictions for how model performance degrades on inputs requiring many simultaneous distinctions.
Load-bearing premise
Residual interference from small non-zero overlaps between semantic directions accumulates without mitigation by additional structure in the embeddings.
What would settle it
Measure recovery error variance while activating k increasing numbers of directions in fixed effective dimension d and test whether the variance grows as sqrt(k/d) rather than remaining independent of d.
Figures
read the original abstract
High-dimensional embedding spaces can host many semantic directions with small mutual overlap. But small overlaps are not zero: when many directions are jointly active, their residual interference accumulates and limits what a finite readout channel can recover. We formulate this as a distinction between \emph{kinetic capacity} -- what the geometry can host -- and \emph{epistemic accessibility} -- what readout can recover. The two sides are summarized by N < exp(c d_{eff} \epsilon^2) for coexistence and \sigma_{int} \sim \sqrt{k/d_{eff}} for simultaneous readout. Thus dimension acts not merely as storage capacity but as semantic concurrency bandwidth. On this geometric foundation we propose a separate hypothesis: some polysemous tokens may be organized around stable token-associated hinge directions, with sense information carried by low-dimensional subspaces in the hinge-perpendicular carrier. The capacity/accessibility distinction is the main claim; the hinge hypothesis is a stronger, separately falsifiable empirical proposal.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript claims that high-dimensional embedding spaces in LLMs host many semantic directions with small but non-zero overlaps, leading to accumulating residual interference that limits simultaneous recovery by a finite readout. It distinguishes kinetic capacity (N < exp(c d_eff ε²) for coexistence of directions) from epistemic accessibility (σ_int ∼ √(k/d_eff) for readout limits), arguing that dimension functions as semantic concurrency bandwidth rather than mere storage. A separate, stronger hypothesis proposes that polysemous tokens are organized around stable token-associated hinge directions, with sense information carried in low-dimensional subspaces perpendicular to the hinge.
Significance. If the geometric bounds are rigorously derived and shown to apply to trained LLM embeddings without substantial mitigating structure, the capacity-accessibility distinction could provide a useful conceptual tool for analyzing limits on parallel semantic processing. The hinge hypothesis is framed as separately falsifiable, which is a positive feature for empirical follow-up. However, the significance hinges on whether the random-vector interference model captures LLM realities or if training-induced correlations reduce the claimed limits.
major comments (2)
- [Abstract] The central formulas N < exp(c d_eff ε²) and σ_int ∼ √(k/d_eff) are stated in the abstract and appear to rest on standard high-dimensional random vector geometry, but no derivation steps, error bounds, or sensitivity analysis to the modeling choices of d_eff and ε are provided. This is load-bearing because the epistemic-accessibility claim requires that residual interference cannot be mitigated below the random baseline.
- [Geometric Foundation] The argument that residual interference fundamentally limits recovery assumes embeddings behave sufficiently like uncorrelated random vectors (pairwise overlaps ~1/√d_eff with no higher-order dependencies). No analysis or counterexample is given to address whether training-induced structure (near-orthogonality, clustered subspaces, or nonlinear readouts) could reduce effective interference, which directly undermines the concurrency-bandwidth interpretation if such structure exists.
minor comments (2)
- [Notation] The effective dimension d_eff and overlap parameter ε are used throughout without explicit operational definitions or estimation procedures from embedding data.
- [Hinge Hypothesis] The hinge hypothesis is introduced as a stronger empirical proposal but lacks a concrete falsification plan or reference to existing polysemy literature in embedding spaces.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed report. The comments correctly identify areas where the geometric claims require more explicit support. We address each major comment below and indicate the revisions made to the manuscript.
read point-by-point responses
-
Referee: [Abstract] The central formulas N < exp(c d_eff ε²) and σ_int ∼ √(k/d_eff) are stated in the abstract and appear to rest on standard high-dimensional random vector geometry, but no derivation steps, error bounds, or sensitivity analysis to the modeling choices of d_eff and ε are provided. This is load-bearing because the epistemic-accessibility claim requires that residual interference cannot be mitigated below the random baseline.
Authors: We agree that the derivations and supporting analysis are essential for the load-bearing claims. In the revised manuscript we have added a new subsection (Section 2.2) that derives N < exp(c d_eff ε²) from the standard concentration-of-measure argument for the maximum number of vectors with pairwise inner products bounded by ε, including explicit error terms from the union bound and a sensitivity analysis showing how the exponent scales with small changes in d_eff and ε. The readout noise formula σ_int ∼ √(k/d_eff) is likewise derived from the variance of the sum of approximately independent inner products under the random-vector model, with a short discussion of the conditions under which this variance bound remains valid. revision: yes
-
Referee: [Geometric Foundation] The argument that residual interference fundamentally limits recovery assumes embeddings behave sufficiently like uncorrelated random vectors (pairwise overlaps ~1/√d_eff with no higher-order dependencies). No analysis or counterexample is given to address whether training-induced structure (near-orthogonality, clustered subspaces, or nonlinear readouts) could reduce effective interference, which directly undermines the concurrency-bandwidth interpretation if such structure exists.
Authors: This is a substantive point. Our model treats the random-vector baseline as the generic case in the absence of specially engineered cancellation; we do not claim it is the only possible regime. In the revision we have expanded the discussion (new paragraph in Section 3) to argue that training-induced near-orthogonality still leaves average overlaps of order 1/√d_eff and that subspace clustering effectively lowers d_eff, which tightens rather than relaxes the concurrency bound. We acknowledge that a full empirical counterexample on trained embeddings lies beyond the present theoretical scope and would constitute valuable follow-up work; the current claim is therefore framed as a baseline limit under the stated modeling assumptions. revision: partial
Circularity Check
No circularity; bounds follow from standard high-dimensional geometry
full rationale
The paper derives N < exp(c d_eff ε²) for coexistence and σ_int ∼ √(k/d_eff) for readout limits directly from the geometry of high-dimensional spaces with small but non-zero overlaps between directions, treating embeddings as behaving like random vectors. These relations are standard consequences of concentration and packing arguments in random vector models and are presented as external mathematical facts rather than fitted quantities or self-referential definitions. The kinetic-capacity versus epistemic-accessibility distinction is a conceptual framing built on these geometric premises, while the hinge hypothesis is explicitly separated as an independent empirical proposal. No equation or claim reduces by construction to a prior self-citation, a fitted parameter renamed as a prediction, or an ansatz smuggled through internal reference; the derivation remains self-contained against external benchmarks in high-dimensional geometry.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption High-dimensional embedding spaces host many semantic directions with small but non-zero mutual overlaps.
- domain assumption A finite readout channel must recover information from the joint activation of multiple directions.
invented entities (1)
-
Stable token-associated hinge directions
no independent evidence
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel echoes?
echoesECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.
N_ε(d) satisfies exp(c ε² d) ≤ N_ε(d) ≤ exp(C ε² d) … d-dimensional space can accommodate … e^{ε² O(d)} property directions
-
IndisputableMonolith/Foundation/DimensionForcing.leanalexander_duality_circle_linking unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
dimension acts not merely as storage capacity but as semantic concurrency bandwidth
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
(1,−1,0) ⊺,f 2 = (1/ √
-
[2]
(1,1,0) ⊺, and f3 = (0,0,1) ⊺, obtained by rotating the first and sec- ond vectors by an angleπ/4 clockwise around the z-axis, which is aligned withe 3. In this construction both bases share the same vectore 3 =f 3, but the context or max- imal observables involved are very different: in one case the maximal observable, in its spectral decomposition with ...
-
[3]
Encoding tokens as quantum statesψin a high- dimensional Hilbert space, thereby embedding them into the “meaning-space” of the model
-
[4]
Evolving these states by a suitably chosen unitary transformationU, producing a candidate output state ϕ=U ψ.(8)
-
[5]
Performing a terminal measurement (Process 1), which yields a single token according to the Born rule, with probabilities proportional to|ϕ †ψ|2 for pure states. In this scheme, nonlinearities and irreversibility are ab- sent everywhere except at the initial state preparation (if stochastic) and at the final, irreducibly random measure- ment. Everything i...
-
[6]
Karpathy, Deep dive into LLMs like ChatGPT (2025), youTube video, accessed September 05, 2025
A. Karpathy, Deep dive into LLMs like ChatGPT (2025), youTube video, accessed September 05, 2025
work page 2025
-
[7]
A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin, Attention is all you need, inProceedings of the 31st In- ternational Conference on Neural Information Processing Systems, NIPS’17 (Curran Associates Inc., Red Hook, NY, USA, 2017) pp. 6000–6010, arXiv:1706.03762
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[8]
J. Kruger and D. Dunning, Unskilled and unaware of it: How difficulties in recognizing one’s own incompetence lead to inflated self-assessments, Journal of Personality and Social Psychology77, 1121 (1999)
work page 1999
-
[9]
A. T. Kalai, O. Nachum, S. S. Vempala, and E. Zhang, Why language models hallucinate (2025), arXiv:arXiv:2509.04664
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[10]
D. Silver, J. Schrittwieser, K. Simonyan, I. Antonoglou, A. Huang, A. Guez, T. Hubert, L. Baker, M. Lai, A. Bolton, Y. Chen, T. Lillicrap, F. Hui, L. Sifre, G. van den Driessche, T. Graepel, and D. Hassabis, Mas- tering the game of Go without human knowledge, Nature 12 550, 354 (2017)
work page 2017
-
[11]
D. Schuurmans, H. Dai, and F. Zanini, Autoregres- sive large language models are computationally universal (2024), arXiv:arXiv:2410.03170 [cs.CL]
-
[12]
W. B. Johnson and J. Lindenstrauss, Extensions of Lip- schitz mappings into a Hilbert space, inConference on Modern Analysis and Probability, Contemporary Math- ematics, Vol. 26 (American Mathematical Society, 1984) pp. 189–206
work page 1984
-
[13]
Optimal compression of approximate inner products and dimension reduction
N. Alon and B. Klartag, Optimal compression of approx- imate inner products and dimension reduction (2016), arXiv:arXiv:1610.00239 [math.MG]
work page internal anchor Pith review Pith/arXiv arXiv 2016
-
[14]
K. G. Larsen and J. Nelson, Optimality of the Johnson- Lindenstrauss lemma, in58th Annual IEEE Symposium on Foundations of Computer Science (FOCS)(2017) pp. 633–638, arXiv:1609.02094
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[15]
A. M. Gleason, Measures on the closed subspaces of a Hilbert space, Journal of Mathematics and Mechanics (now Indiana University Mathematics Journal)6, 885 (1957)
work page 1957
-
[16]
E. Specker: "The logic of non-simultaneously decidable propositions" (1960)
E. Specker, Die Logik nicht gleichzeitig entscheidbarer Aussagen, Dialectica14, 239 (1960), english translation at https://arxiv.org/abs/1103.4537, arXiv:1103.4537
work page internal anchor Pith review Pith/arXiv arXiv 1960
-
[17]
S. Kochen and E. P. Specker, The problem of hidden variables in quantum mechanics, Journal of Mathemat- ics and Mechanics (now Indiana University Mathematics Journal)17, 59 (1967)
work page 1967
-
[18]
A. Jadhav, The art of sampling: Controlling randomness in LLMs (2025), the AI Engineering Brief, Accessed: Sept 7, 2025
work page 2025
-
[19]
H. He and T. M. Lab, Defeating nondeter- minism in llm inference, Thinking Machines Lab: Connectionism 10.64434/tml.20250910 (2025), https://thinkingmachines.ai/blog/defeating- nondeterminism-in-llm-inference/
-
[20]
G. J. Chaitin,Algorithmic Information Theory, revised edition ed., Cambridge Tracts in Theoretical Computer Science, Volume 1 (Cambridge University Press, Cam- bridge, 1987,2003)
work page 1987
-
[21]
C. S. Calude,Information and Randomness—An Algo- rithmic Perspective, 2nd ed. (Springer, Berlin, 2002)
work page 2002
-
[22]
G. J. Chaitin, Information-theoretic limitations of for- mal systems, Journal of the Association of Computing Machinery (JACM)21, 403 (1974)
work page 1974
-
[23]
G. J. Chaitin, Information-theoretic incompleteness, Ap- plied Mathematics and Computation52, 83 (1992)
work page 1992
-
[24]
G. J. Chaitin,Information, Randomness and Incomplete- ness. Papers on Algorithmic Information Theory (World Scientific Series in Computer Science: Volume 8), 2nd ed. (World Scientific, Singapore, 1990) this is a collec- tion of G. Chaitin’s early publications
work page 1990
-
[25]
J. Frankle and M. Carbin, The lottery ticket hypoth- esis: Finding sparse, trainable neural networks, in7th International Conference on Learning Representations (ICLR)(The International Conference on Learning Rep- resentations (ICLR), New Orleans, Louisiana, USA,
-
[26]
see also URLhttps://openreview.net/forum? id=rJl-b3RcF7
-
[27]
Zeilinger, The message of the quantum, Nature438, 743 (2005)
A. Zeilinger, The message of the quantum, Nature438, 743 (2005)
work page 2005
-
[28]
J. von Neumann,Mathematical Foundations of Quan- tum Mechanics, Princeton Landmarks in Mathematics and Physics (Princeton University Press, Princeton, NJ, USA, 2018) translated by Robert T. Beyer, edited by Nicholas A. Wheeler
work page 2018
-
[29]
J. H. Holland,Adaptation in Natural and Artificial Sys- tems: An Introductory Analysis with Applications to Biology, Control, and Artificial Intelligence, Complex Adaptive Systems (MIT Press, Cambridge, MA, 1992)
work page 1992
-
[30]
T. Toffoli, The role of the observer in uniform systems, in Applied General Systems Research: Recent Developments and Trends, edited by G. J. Klir (Plenum Press, Springer US, New York, London, and Boston, MA, USA, 1978) pp. 395–400
work page 1978
-
[31]
G. Li, X. Zhao, and X. Wang, Quantum self-attention neural networks for text classification, Science China Information Sciences67, 10.1007/s11432-023-3879-7 (2024)
- [32]
-
[33]
S. Dasgupta and A. Gupta, An elementary proof of a the- orem of Johnson and Lindenstrauss, Random Structures & Algorithms22, 60 (2003). Appendix A: Mathematical F oundations This appendix provides mathematical background for key concepts in neural network training and the Trans- former architecture. For a general-audience introduction, these technical det...
work page 2003
-
[34]
Probability and Logits For a probabilityp∈(0,1), thelogitofpis logit(p) = log p 1−p .(A1) The sigmoid function, σ(z) = 1 1 +e −z ,(A2) maps a real-valued logitzto a probabilityp. They are exact inverses: σ(logit(p)) =pand logit(σ(z)) =z.(A3) The derivative of the sigmoid function can be obtained with the chain rule: σ′(z) =σ(z)(1−σ(z)).(A4) In the context...
-
[35]
The softmax function, softmax(zi) = exp(zi) CX j=1 exp(zj) ,(A5) fori= 1,
Softmax for Multi-Class Problems LetC≥2 be the number of classes, and letz∈R C be a score (logit) vector. The softmax function, softmax(zi) = exp(zi) CX j=1 exp(zj) ,(A5) fori= 1, . . . , C, mapszto a probability vector softmax(z)∈(0,1) C that lies on the probability simplex: softmax(z)∈ n p∈R C :p i ≥0, CX i=1 pi = 1 o .(A6) Equivalently, if we defineu= ...
-
[36]
Positivity: softmax(z i)≥0
-
[37]
Normalization: P i s(zi) = 1
-
[38]
Shift-invariance: for any scalara, softmax(z+ a1) = softmax(z). In the context of language model inference, the output of the softmax function is modified by a sampling hyper- parameter calledtemperature,T >0. The temperature is used to control the randomness of the model’s predictions by scaling the logits before the softmax is applied: pi = exp(zi/T) CX...
-
[39]
Model, notation, and shapes Backpropagation elegantly leverages the chain rule of calculus to compute the gradient of the loss function with respect to every parameter in a neural network. By first calculating an “error signal” at the output layer and then propagating it backward, it determines the precise ad- justment needed for each weight, bias, and, c...
-
[40]
Model, Notation, and Shapes The network architecture and its associated mathe- matical objects are defined as follows: 1.Vocabulary size:V;Embedding dimension: d;Hidden layer width:h. 2.Input:The input for a vocabulary tokenkis a one- hot column vectorx∈R V , where thek-th element is 1 and all other elements are zero. We can denote this asx=e k. 3.Embeddi...
-
[41]
The functionσis applied component-wise in Eq
F orward Pass The forward computation proceeds through the net- work layers as follows: z2 =W 1vk +b 1,(B2) a2 =σ(z 2),(B3) z3 =W 2a2 +b 2,(B4) ˆy=σ(z3).(B5) Here,z 2 ∈R h is the pre-activation of the hidden layer, anda 2 ∈R h is its activation. The functionσis applied component-wise in Eq. (B3). The scalarz 3 ∈Ris the final pre-activation, or logit, and ...
-
[42]
Backward Pass We use a mean squared error loss for a single training example with targety∈ {0,1}: L= 1 2(ˆy−y)2.(B6) Our goal is to compute the gradient ofLwith respect to each parameter. We do this by defining layer-wise “error signals”, which are derivatives of the loss with respect to the pre-activations (z). a. Output Layer The error signal for the ou...
-
[43]
This is accomplished using three learned weight ma- trices:W Q ∈R d×dk,W K ∈R d×dk, andW V ∈R d×dv
Projection to Query , Key , and V alue First, the input matrixXis linearly projected into three distinct matrices: Query (Q), Key (K), and Value (V). This is accomplished using three learned weight ma- trices:W Q ∈R d×dk,W K ∈R d×dk, andW V ∈R d×dv. Typically, for the single-head attention block, the di- mensions are set such thatd=d k =d v [2] (for multi...
-
[44]
TheQueryvector can be seen as a representation of what information the current token is seeking from the rest of the sequence
-
[45]
TheKeyvector represents what kind of informa- tion the token itself contains or offers
-
[46]
TheValuevector is the actual content or represen- tation of the token that will be passed on if other tokens attend to it
-
[47]
Attention Score Calculation The relationship, or compatibility, between each pair of tokens is computed by taking the dot product of their respective Query and Key vectors. This is performed for all tokens simultaneously via a matrix multiplication betweenQand the transpose ofK. The resulting matrix of raw attention scores is: Scores =QK T ∈R n×n.(C4) The...
-
[48]
Applying Softmax and Causal Masking A softmax function is applied row-wise to the scaled scores matrix. This converts the raw, unnormalized scores into a probability distribution, yielding the atten- tion weights matrixA. A= softmax QK T √dk ∈R n×n.(C6) Each elementA ij is the weight assigned to the Value vector of tokenjwhen computing the output for toke...
-
[49]
Output Computation The final output of the self-attention layer is a weighted sum of all Value vectors, where the weights are given by the matrixA. This computation is performed with a single matrix multiplication: Y=AV∈R n×dv .(C7) Thei-th row of the output,y i, is therefore a contextual- ized vector for thei-th token, formed by aggregating in- formation...
-
[50]
Multi-Head Attention A single attention mechanism might be forced to learn an “average” of several types of relationships between tokens. To allow the model to capture a richer set of relationships, the Transformer employsmulti-head atten- tion. This mechanism does not run a single attention calculation, but rather multiple attention calculations in paral...
-
[51]
The Complete T ransformer Block A complete Transformer “block” is the fundamental re- peating unit of the model, designed to facilitate the stable training of very deep networks. Each block is composed of two main sub-layers: the multi-head attention mech- anism followed by a position-wise feed-forward network. Crucially, each of these sub-layers is wrapp...
-
[52]
bank” is identical in the phrases “river bank
An Embedding-Centric View of the T ransformer The mathematical operations of the Transformer, par- ticularly self-attention, are best understood as a sophis- ticated mechanism for the iterative refinement of token representations. The entire architecture is designed to transform an initial set of static, context-free embed- dings into a final set of dynam...
-
[53]
Each token’s embedding (via itsQueryprojection) effectively queries all other token embeddings in the sequence
-
[54]
It assesses their relevance based on a compatibility function with their respectiveKeyprojections
-
[55]
It then synthesizes its new representation by com- puting a weighted sum of theirValueprojections, where the weights are determined by the aforemen- tioned relevance scores. The output of the block is therefore a new matrix of embeddings where each token’s vector representation has absorbed relevant contextual information from its peers. After just one la...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.