pith. sign in

arxiv: 2504.13824 · v5 · pith:CEB7ISLBnew · submitted 2025-04-18 · 🪐 quant-ph

Semantic Concurrency Limits in Large Language Models

Pith reviewed 2026-05-22 18:37 UTC · model grok-4.3

classification 🪐 quant-ph
keywords semantic embeddingsconcurrency limitsinterference scalingepistemic accessibilityembedding dimensionkinetic capacitypolysemy
0
0 comments X

The pith

Dimension limits simultaneous semantic access in embeddings through accumulating interference rather than acting only as storage.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that high-dimensional spaces allow many semantic directions to coexist when their overlaps remain small, yet these non-zero overlaps produce interference that grows with the number of active directions and restricts what a finite readout channel can recover. It separates geometric hosting capacity from recoverable information under concurrency, showing dimension functions as bandwidth for simultaneous processing. A sympathetic reader would care because this frames scaling limits in language models as a geometric constraint on handling multiple concepts at once, rather than a pure data or parameter issue.

Core claim

High-dimensional embedding spaces can host N < exp(c d_eff ε²) semantic directions with small overlap ε. When k directions activate together, residual interference yields readout variance σ_int ∼ √(k/d_eff). Dimension therefore serves as semantic concurrency bandwidth, distinguishing what the geometry can contain (kinetic capacity) from what readout can access (epistemic accessibility).

What carries the argument

The kinetic capacity versus epistemic accessibility distinction, expressed through the coexistence bound N < exp(c d_eff ε²) and the interference scaling σ_int ∼ √(k/d_eff)

If this is right

  • Higher embedding dimensions increase the number of semantic directions that can be processed simultaneously before interference dominates.
  • Readout accuracy in language models faces a hard geometric limit tied to effective dimension when many concepts are jointly active.
  • Polysemous token resolution may require explicit low-dimensional subspaces orthogonal to stable hinge directions to remain accessible.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Embedding training procedures that explicitly minimize effective overlaps could raise concurrency limits beyond what dimension scaling alone achieves.
  • Task-specific models might benefit from dimension choices matched to expected numbers of concurrent concepts rather than uniform scaling.
  • The same interference mechanism suggests testable predictions for how model performance degrades on inputs requiring many simultaneous distinctions.

Load-bearing premise

Residual interference from small non-zero overlaps between semantic directions accumulates without mitigation by additional structure in the embeddings.

What would settle it

Measure recovery error variance while activating k increasing numbers of directions in fixed effective dimension d and test whether the variance grows as sqrt(k/d) rather than remaining independent of d.

Figures

Figures reproduced from arXiv: 2504.13824 by Karl Svozil.

Figure 1
Figure 1. Figure 1: FIG. 1. Hypergraph visualization of three intertwining con [PITH_FULL_IMAGE:figures/full_fig_p006_1.png] view at source ↗
read the original abstract

High-dimensional embedding spaces can host many semantic directions with small mutual overlap. But small overlaps are not zero: when many directions are jointly active, their residual interference accumulates and limits what a finite readout channel can recover. We formulate this as a distinction between \emph{kinetic capacity} -- what the geometry can host -- and \emph{epistemic accessibility} -- what readout can recover. The two sides are summarized by N < exp(c d_{eff} \epsilon^2) for coexistence and \sigma_{int} \sim \sqrt{k/d_{eff}} for simultaneous readout. Thus dimension acts not merely as storage capacity but as semantic concurrency bandwidth. On this geometric foundation we propose a separate hypothesis: some polysemous tokens may be organized around stable token-associated hinge directions, with sense information carried by low-dimensional subspaces in the hinge-perpendicular carrier. The capacity/accessibility distinction is the main claim; the hinge hypothesis is a stronger, separately falsifiable empirical proposal.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript claims that high-dimensional embedding spaces in LLMs host many semantic directions with small but non-zero overlaps, leading to accumulating residual interference that limits simultaneous recovery by a finite readout. It distinguishes kinetic capacity (N < exp(c d_eff ε²) for coexistence of directions) from epistemic accessibility (σ_int ∼ √(k/d_eff) for readout limits), arguing that dimension functions as semantic concurrency bandwidth rather than mere storage. A separate, stronger hypothesis proposes that polysemous tokens are organized around stable token-associated hinge directions, with sense information carried in low-dimensional subspaces perpendicular to the hinge.

Significance. If the geometric bounds are rigorously derived and shown to apply to trained LLM embeddings without substantial mitigating structure, the capacity-accessibility distinction could provide a useful conceptual tool for analyzing limits on parallel semantic processing. The hinge hypothesis is framed as separately falsifiable, which is a positive feature for empirical follow-up. However, the significance hinges on whether the random-vector interference model captures LLM realities or if training-induced correlations reduce the claimed limits.

major comments (2)
  1. [Abstract] The central formulas N < exp(c d_eff ε²) and σ_int ∼ √(k/d_eff) are stated in the abstract and appear to rest on standard high-dimensional random vector geometry, but no derivation steps, error bounds, or sensitivity analysis to the modeling choices of d_eff and ε are provided. This is load-bearing because the epistemic-accessibility claim requires that residual interference cannot be mitigated below the random baseline.
  2. [Geometric Foundation] The argument that residual interference fundamentally limits recovery assumes embeddings behave sufficiently like uncorrelated random vectors (pairwise overlaps ~1/√d_eff with no higher-order dependencies). No analysis or counterexample is given to address whether training-induced structure (near-orthogonality, clustered subspaces, or nonlinear readouts) could reduce effective interference, which directly undermines the concurrency-bandwidth interpretation if such structure exists.
minor comments (2)
  1. [Notation] The effective dimension d_eff and overlap parameter ε are used throughout without explicit operational definitions or estimation procedures from embedding data.
  2. [Hinge Hypothesis] The hinge hypothesis is introduced as a stronger empirical proposal but lacks a concrete falsification plan or reference to existing polysemy literature in embedding spaces.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed report. The comments correctly identify areas where the geometric claims require more explicit support. We address each major comment below and indicate the revisions made to the manuscript.

read point-by-point responses
  1. Referee: [Abstract] The central formulas N < exp(c d_eff ε²) and σ_int ∼ √(k/d_eff) are stated in the abstract and appear to rest on standard high-dimensional random vector geometry, but no derivation steps, error bounds, or sensitivity analysis to the modeling choices of d_eff and ε are provided. This is load-bearing because the epistemic-accessibility claim requires that residual interference cannot be mitigated below the random baseline.

    Authors: We agree that the derivations and supporting analysis are essential for the load-bearing claims. In the revised manuscript we have added a new subsection (Section 2.2) that derives N < exp(c d_eff ε²) from the standard concentration-of-measure argument for the maximum number of vectors with pairwise inner products bounded by ε, including explicit error terms from the union bound and a sensitivity analysis showing how the exponent scales with small changes in d_eff and ε. The readout noise formula σ_int ∼ √(k/d_eff) is likewise derived from the variance of the sum of approximately independent inner products under the random-vector model, with a short discussion of the conditions under which this variance bound remains valid. revision: yes

  2. Referee: [Geometric Foundation] The argument that residual interference fundamentally limits recovery assumes embeddings behave sufficiently like uncorrelated random vectors (pairwise overlaps ~1/√d_eff with no higher-order dependencies). No analysis or counterexample is given to address whether training-induced structure (near-orthogonality, clustered subspaces, or nonlinear readouts) could reduce effective interference, which directly undermines the concurrency-bandwidth interpretation if such structure exists.

    Authors: This is a substantive point. Our model treats the random-vector baseline as the generic case in the absence of specially engineered cancellation; we do not claim it is the only possible regime. In the revision we have expanded the discussion (new paragraph in Section 3) to argue that training-induced near-orthogonality still leaves average overlaps of order 1/√d_eff and that subspace clustering effectively lowers d_eff, which tightens rather than relaxes the concurrency bound. We acknowledge that a full empirical counterexample on trained embeddings lies beyond the present theoretical scope and would constitute valuable follow-up work; the current claim is therefore framed as a baseline limit under the stated modeling assumptions. revision: partial

Circularity Check

0 steps flagged

No circularity; bounds follow from standard high-dimensional geometry

full rationale

The paper derives N < exp(c d_eff ε²) for coexistence and σ_int ∼ √(k/d_eff) for readout limits directly from the geometry of high-dimensional spaces with small but non-zero overlaps between directions, treating embeddings as behaving like random vectors. These relations are standard consequences of concentration and packing arguments in random vector models and are presented as external mathematical facts rather than fitted quantities or self-referential definitions. The kinetic-capacity versus epistemic-accessibility distinction is a conceptual framing built on these geometric premises, while the hinge hypothesis is explicitly separated as an independent empirical proposal. No equation or claim reduces by construction to a prior self-citation, a fitted parameter renamed as a prediction, or an ansatz smuggled through internal reference; the derivation remains self-contained against external benchmarks in high-dimensional geometry.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 1 invented entities

The central claim rests primarily on standard high-dimensional geometry; the hinge directions constitute an additional postulated entity without independent evidence supplied in the abstract.

axioms (2)
  • domain assumption High-dimensional embedding spaces host many semantic directions with small but non-zero mutual overlaps.
    Stated explicitly as the foundation for interference accumulation in the abstract.
  • domain assumption A finite readout channel must recover information from the joint activation of multiple directions.
    Implicit in the definition of epistemic accessibility.
invented entities (1)
  • Stable token-associated hinge directions no independent evidence
    purpose: Organize sense information for polysemous tokens in low-dimensional subspaces perpendicular to the hinge.
    Introduced as a stronger, separately falsifiable empirical proposal.

pith-pipeline@v0.9.0 · 5685 in / 1427 out tokens · 92357 ms · 2026-05-22T18:37:16.592154+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

55 extracted references · 55 canonical work pages · 5 internal anchors

  1. [1]

    (1,−1,0) ⊺,f 2 = (1/ √

  2. [2]

    strategy

    (1,1,0) ⊺, and f3 = (0,0,1) ⊺, obtained by rotating the first and sec- ond vectors by an angleπ/4 clockwise around the z-axis, which is aligned withe 3. In this construction both bases share the same vectore 3 =f 3, but the context or max- imal observables involved are very different: in one case the maximal observable, in its spectral decomposition with ...

  3. [3]

    meaning-space

    Encoding tokens as quantum statesψin a high- dimensional Hilbert space, thereby embedding them into the “meaning-space” of the model

  4. [4]

    Evolving these states by a suitably chosen unitary transformationU, producing a candidate output state ϕ=U ψ.(8)

  5. [5]

    attention weights

    Performing a terminal measurement (Process 1), which yields a single token according to the Born rule, with probabilities proportional to|ϕ †ψ|2 for pure states. In this scheme, nonlinearities and irreversibility are ab- sent everywhere except at the initial state preparation (if stochastic) and at the final, irreducibly random measure- ment. Everything i...

  6. [6]

    Karpathy, Deep dive into LLMs like ChatGPT (2025), youTube video, accessed September 05, 2025

    A. Karpathy, Deep dive into LLMs like ChatGPT (2025), youTube video, accessed September 05, 2025

  7. [7]

    Attention Is All You Need

    A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin, Attention is all you need, inProceedings of the 31st In- ternational Conference on Neural Information Processing Systems, NIPS’17 (Curran Associates Inc., Red Hook, NY, USA, 2017) pp. 6000–6010, arXiv:1706.03762

  8. [8]

    Kruger and D

    J. Kruger and D. Dunning, Unskilled and unaware of it: How difficulties in recognizing one’s own incompetence lead to inflated self-assessments, Journal of Personality and Social Psychology77, 1121 (1999)

  9. [9]

    A. T. Kalai, O. Nachum, S. S. Vempala, and E. Zhang, Why language models hallucinate (2025), arXiv:arXiv:2509.04664

  10. [10]

    Silver, J

    D. Silver, J. Schrittwieser, K. Simonyan, I. Antonoglou, A. Huang, A. Guez, T. Hubert, L. Baker, M. Lai, A. Bolton, Y. Chen, T. Lillicrap, F. Hui, L. Sifre, G. van den Driessche, T. Graepel, and D. Hassabis, Mas- tering the game of Go without human knowledge, Nature 12 550, 354 (2017)

  11. [11]

    Schuurmans, H

    D. Schuurmans, H. Dai, and F. Zanini, Autoregres- sive large language models are computationally universal (2024), arXiv:arXiv:2410.03170 [cs.CL]

  12. [12]

    W. B. Johnson and J. Lindenstrauss, Extensions of Lip- schitz mappings into a Hilbert space, inConference on Modern Analysis and Probability, Contemporary Math- ematics, Vol. 26 (American Mathematical Society, 1984) pp. 189–206

  13. [13]

    Optimal compression of approximate inner products and dimension reduction

    N. Alon and B. Klartag, Optimal compression of approx- imate inner products and dimension reduction (2016), arXiv:arXiv:1610.00239 [math.MG]

  14. [14]

    K. G. Larsen and J. Nelson, Optimality of the Johnson- Lindenstrauss lemma, in58th Annual IEEE Symposium on Foundations of Computer Science (FOCS)(2017) pp. 633–638, arXiv:1609.02094

  15. [15]

    A. M. Gleason, Measures on the closed subspaces of a Hilbert space, Journal of Mathematics and Mechanics (now Indiana University Mathematics Journal)6, 885 (1957)

  16. [16]

    E. Specker: "The logic of non-simultaneously decidable propositions" (1960)

    E. Specker, Die Logik nicht gleichzeitig entscheidbarer Aussagen, Dialectica14, 239 (1960), english translation at https://arxiv.org/abs/1103.4537, arXiv:1103.4537

  17. [17]

    Kochen and E

    S. Kochen and E. P. Specker, The problem of hidden variables in quantum mechanics, Journal of Mathemat- ics and Mechanics (now Indiana University Mathematics Journal)17, 59 (1967)

  18. [18]

    Jadhav, The art of sampling: Controlling randomness in LLMs (2025), the AI Engineering Brief, Accessed: Sept 7, 2025

    A. Jadhav, The art of sampling: Controlling randomness in LLMs (2025), the AI Engineering Brief, Accessed: Sept 7, 2025

  19. [19]

    He and T

    H. He and T. M. Lab, Defeating nondeter- minism in llm inference, Thinking Machines Lab: Connectionism 10.64434/tml.20250910 (2025), https://thinkingmachines.ai/blog/defeating- nondeterminism-in-llm-inference/

  20. [20]

    G. J. Chaitin,Algorithmic Information Theory, revised edition ed., Cambridge Tracts in Theoretical Computer Science, Volume 1 (Cambridge University Press, Cam- bridge, 1987,2003)

  21. [21]

    C. S. Calude,Information and Randomness—An Algo- rithmic Perspective, 2nd ed. (Springer, Berlin, 2002)

  22. [22]

    G. J. Chaitin, Information-theoretic limitations of for- mal systems, Journal of the Association of Computing Machinery (JACM)21, 403 (1974)

  23. [23]

    G. J. Chaitin, Information-theoretic incompleteness, Ap- plied Mathematics and Computation52, 83 (1992)

  24. [24]

    G. J. Chaitin,Information, Randomness and Incomplete- ness. Papers on Algorithmic Information Theory (World Scientific Series in Computer Science: Volume 8), 2nd ed. (World Scientific, Singapore, 1990) this is a collec- tion of G. Chaitin’s early publications

  25. [25]

    Frankle and M

    J. Frankle and M. Carbin, The lottery ticket hypoth- esis: Finding sparse, trainable neural networks, in7th International Conference on Learning Representations (ICLR)(The International Conference on Learning Rep- resentations (ICLR), New Orleans, Louisiana, USA,

  26. [26]

    see also URLhttps://openreview.net/forum? id=rJl-b3RcF7

  27. [27]

    Zeilinger, The message of the quantum, Nature438, 743 (2005)

    A. Zeilinger, The message of the quantum, Nature438, 743 (2005)

  28. [28]

    J. von Neumann,Mathematical Foundations of Quan- tum Mechanics, Princeton Landmarks in Mathematics and Physics (Princeton University Press, Princeton, NJ, USA, 2018) translated by Robert T. Beyer, edited by Nicholas A. Wheeler

  29. [29]

    J. H. Holland,Adaptation in Natural and Artificial Sys- tems: An Introductory Analysis with Applications to Biology, Control, and Artificial Intelligence, Complex Adaptive Systems (MIT Press, Cambridge, MA, 1992)

  30. [30]

    Toffoli, The role of the observer in uniform systems, in Applied General Systems Research: Recent Developments and Trends, edited by G

    T. Toffoli, The role of the observer in uniform systems, in Applied General Systems Research: Recent Developments and Trends, edited by G. J. Klir (Plenum Press, Springer US, New York, London, and Boston, MA, USA, 1978) pp. 395–400

  31. [31]

    G. Li, X. Zhao, and X. Wang, Quantum self-attention neural networks for text classification, Science China Information Sciences67, 10.1007/s11432-023-3879-7 (2024)

  32. [32]

    Zhang, Q

    H. Zhang, Q. Zhao, M. Zhou, L. Feng, D. Niyato, S. Zheng, and L. Chen, A survey of quantum trans- formers: Architectures, challenges and outlooks (2025), arXiv:2504.03192 [quant-ph]

  33. [33]

    Dasgupta and A

    S. Dasgupta and A. Gupta, An elementary proof of a the- orem of Johnson and Lindenstrauss, Random Structures & Algorithms22, 60 (2003). Appendix A: Mathematical F oundations This appendix provides mathematical background for key concepts in neural network training and the Trans- former architecture. For a general-audience introduction, these technical det...

  34. [34]

    pre-activation

    Probability and Logits For a probabilityp∈(0,1), thelogitofpis logit(p) = log p 1−p .(A1) The sigmoid function, σ(z) = 1 1 +e −z ,(A2) maps a real-valued logitzto a probabilityp. They are exact inverses: σ(logit(p)) =pand logit(σ(z)) =z.(A3) The derivative of the sigmoid function can be obtained with the chain rule: σ′(z) =σ(z)(1−σ(z)).(A4) In the context...

  35. [35]

    The softmax function, softmax(zi) = exp(zi) CX j=1 exp(zj) ,(A5) fori= 1,

    Softmax for Multi-Class Problems LetC≥2 be the number of classes, and letz∈R C be a score (logit) vector. The softmax function, softmax(zi) = exp(zi) CX j=1 exp(zj) ,(A5) fori= 1, . . . , C, mapszto a probability vector softmax(z)∈(0,1) C that lies on the probability simplex: softmax(z)∈ n p∈R C :p i ≥0, CX i=1 pi = 1 o .(A6) Equivalently, if we defineu= ...

  36. [36]

    Positivity: softmax(z i)≥0

  37. [37]

    Normalization: P i s(zi) = 1

  38. [38]

    In the context of language model inference, the output of the softmax function is modified by a sampling hyper- parameter calledtemperature,T >0

    Shift-invariance: for any scalara, softmax(z+ a1) = softmax(z). In the context of language model inference, the output of the softmax function is modified by a sampling hyper- parameter calledtemperature,T >0. The temperature is used to control the randomness of the model’s predictions by scaling the logits before the softmax is applied: pi = exp(zi/T) CX...

  39. [39]

    error signal

    Model, notation, and shapes Backpropagation elegantly leverages the chain rule of calculus to compute the gradient of the loss function with respect to every parameter in a neural network. By first calculating an “error signal” at the output layer and then propagating it backward, it determines the precise ad- justment needed for each weight, bias, and, c...

  40. [40]

    2.Input:The input for a vocabulary tokenkis a one- hot column vectorx∈R V , where thek-th element is 1 and all other elements are zero

    Model, Notation, and Shapes The network architecture and its associated mathe- matical objects are defined as follows: 1.Vocabulary size:V;Embedding dimension: d;Hidden layer width:h. 2.Input:The input for a vocabulary tokenkis a one- hot column vectorx∈R V , where thek-th element is 1 and all other elements are zero. We can denote this asx=e k. 3.Embeddi...

  41. [41]

    The functionσis applied component-wise in Eq

    F orward Pass The forward computation proceeds through the net- work layers as follows: z2 =W 1vk +b 1,(B2) a2 =σ(z 2),(B3) z3 =W 2a2 +b 2,(B4) ˆy=σ(z3).(B5) Here,z 2 ∈R h is the pre-activation of the hidden layer, anda 2 ∈R h is its activation. The functionσis applied component-wise in Eq. (B3). The scalarz 3 ∈Ris the final pre-activation, or logit, and ...

  42. [42]

    error signals

    Backward Pass We use a mean squared error loss for a single training example with targety∈ {0,1}: L= 1 2(ˆy−y)2.(B6) Our goal is to compute the gradient ofLwith respect to each parameter. We do this by defining layer-wise “error signals”, which are derivatives of the loss with respect to the pre-activations (z). a. Output Layer The error signal for the ou...

  43. [43]

    This is accomplished using three learned weight ma- trices:W Q ∈R d×dk,W K ∈R d×dk, andW V ∈R d×dv

    Projection to Query , Key , and V alue First, the input matrixXis linearly projected into three distinct matrices: Query (Q), Key (K), and Value (V). This is accomplished using three learned weight ma- trices:W Q ∈R d×dk,W K ∈R d×dk, andW V ∈R d×dv. Typically, for the single-head attention block, the di- mensions are set such thatd=d k =d v [2] (for multi...

  44. [44]

    TheQueryvector can be seen as a representation of what information the current token is seeking from the rest of the sequence

  45. [45]

    TheKeyvector represents what kind of informa- tion the token itself contains or offers

  46. [46]

    TheValuevector is the actual content or represen- tation of the token that will be passed on if other tokens attend to it

  47. [47]

    This is performed for all tokens simultaneously via a matrix multiplication betweenQand the transpose ofK

    Attention Score Calculation The relationship, or compatibility, between each pair of tokens is computed by taking the dot product of their respective Query and Key vectors. This is performed for all tokens simultaneously via a matrix multiplication betweenQand the transpose ofK. The resulting matrix of raw attention scores is: Scores =QK T ∈R n×n.(C4) The...

  48. [48]

    seeing into the future

    Applying Softmax and Causal Masking A softmax function is applied row-wise to the scaled scores matrix. This converts the raw, unnormalized scores into a probability distribution, yielding the atten- tion weights matrixA. A= softmax QK T √dk ∈R n×n.(C6) Each elementA ij is the weight assigned to the Value vector of tokenjwhen computing the output for toke...

  49. [49]

    Output Computation The final output of the self-attention layer is a weighted sum of all Value vectors, where the weights are given by the matrixA. This computation is performed with a single matrix multiplication: Y=AV∈R n×dv .(C7) Thei-th row of the output,y i, is therefore a contextual- ized vector for thei-th token, formed by aggregating in- formation...

  50. [50]

    heads.” headi = Attention(XWQ,i, XWK,i, XWV,i).(D1) 3.Concatenation and Final Projection:The outputs of thesehheads are “horizontally

    Multi-Head Attention A single attention mechanism might be forced to learn an “average” of several types of relationships between tokens. To allow the model to capture a richer set of relationships, the Transformer employsmulti-head atten- tion. This mechanism does not run a single attention calculation, but rather multiple attention calculations in paral...

  51. [51]

    Add & Norm

    The Complete T ransformer Block A complete Transformer “block” is the fundamental re- peating unit of the model, designed to facilitate the stable training of very deep networks. Each block is composed of two main sub-layers: the multi-head attention mech- anism followed by a position-wise feed-forward network. Crucially, each of these sub-layers is wrapp...

  52. [52]

    bank” is identical in the phrases “river bank

    An Embedding-Centric View of the T ransformer The mathematical operations of the Transformer, par- ticularly self-attention, are best understood as a sophis- ticated mechanism for the iterative refinement of token representations. The entire architecture is designed to transform an initial set of static, context-free embed- dings into a final set of dynam...

  53. [53]

    Each token’s embedding (via itsQueryprojection) effectively queries all other token embeddings in the sequence

  54. [54]

    It assesses their relevance based on a compatibility function with their respectiveKeyprojections

  55. [55]

    bank” in “river bank

    It then synthesizes its new representation by com- puting a weighted sum of theirValueprojections, where the weights are determined by the aforemen- tioned relevance scores. The output of the block is therefore a new matrix of embeddings where each token’s vector representation has absorbed relevant contextual information from its peers. After just one la...