pith. machine review for the scientific record. sign in

arxiv: 2604.14727 · v1 · submitted 2026-04-16 · 💻 cs.LG

Recognition: unknown

Expressivity of Transformers: A Tropical Geometry Perspective

Authors on Pith no claims yet

Pith reviewed 2026-05-10 12:22 UTC · model grok-4.3

classification 💻 cs.LG
keywords tropical geometrytransformer expressivityself-attentionpower voronoi diagramlinear regionsmulti-head attentionpolyhedral complexityzero-temperature limit
0
0 comments X

The pith

Self-attention in transformers computes a Power Voronoi diagram exactly in the zero-temperature limit, yielding the first tight bound of Theta(N to the d_model L) linear regions in deep models.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces a tropical geometry lens to measure exactly how transformers carve up input space into linear pieces. It models each self-attention layer as a vector-valued tropical rational map and proves that this map produces a Power Voronoi diagram when temperature reaches zero. From that equivalence the authors show why stacking multiple heads multiplies the number of regions by roughly N to the power H and why adding depth multiplies the exponent by d_model. A reader should care because the argument turns the vague claim that transformers are expressive into a concrete counting formula driven by sequence length, embedding size, and depth. The same geometric skeleton is shown to survive the soft attention used in actual networks through tight approximation bounds.

Core claim

By modeling self-attention as a vector-valued tropical rational map, the paper proves that the operation evaluates exactly to a Power Voronoi Diagram in the zero-temperature limit. Multi-head aggregation then expands the polyhedral complexity to O(N^H) through the Minkowski sum of Newton polytopes. Extending the construction layer by layer produces the first tight asymptotic bound on the number of linear regions realized by a transformer of depth L: Theta(N^{d_model L}). The idealized partitions remain topologically stable under finite-temperature soft attention because the difference is controlled by exponentially tight differential bounds.

What carries the argument

The vector-valued tropical rational map that represents self-attention and its exact reduction to a Power Voronoi Diagram.

If this is right

  • Multi-head self-attention raises the maximum number of polyhedral pieces from O(N) for one head to O(N^H) through Minkowski summation of Newton polytopes.
  • Each additional transformer layer multiplies the exponent in the linear-region count by the embedding dimension d_model, producing combinatorial growth with depth.
  • The zero-temperature Power Voronoi skeleton is preserved by the soft attention used in deployed models, with the approximation error bounded exponentially tightly.
  • Expressivity is driven intrinsically by sequence length N, model dimension d_model, and depth L rather than by the smoothing introduced by finite temperature.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same tropical counting technique could be applied to other attention variants such as linear attention or sparse attention to obtain analogous region bounds.
  • If the scaling holds in practice, simply increasing depth or context length would produce far more expressive partitions than increasing head count alone.
  • The framework supplies a concrete test: count the actual linear regions in a small transformer and check whether the count tracks N^{d L}.
  • Training dynamics might be understood as movement among these polyhedral pieces rather than as optimization over a smooth loss landscape.

Load-bearing premise

That representing self-attention by a vector-valued tropical rational map accurately captures the geometric partitioning performed by real transformer implementations.

What would settle it

An explicit calculation or trained-model dissection showing that the decision boundaries produced by a practical attention layer deviate from those of any Power Voronoi diagram, or measured linear-region counts that fall outside the predicted Theta(N^{d_model L}) scaling.

Figures

Figures reproduced from arXiv: 2604.14727 by Ye Su, Yong Liu.

Figure 1
Figure 1. Figure 1: Combinatorial Explosion of Linear Regions in Transformers. Visualizing the exact spatial partitioning of a query space (dmodel = 2) by a tropical transformer with sequence length N = 4. As the network depth L increases from 1 to 3, the number of maximal linear regions N undergoes a staggering geometric expansion, strictly following our derived Θ(NdmodelL) bound. This topological metric demonstrates how dee… view at source ↗
Figure 2
Figure 2. Figure 2: Recursive spatial partitioning in a 2-layer MLP (d = 2). (Bottom) Layer 1 (n = 3 neurons) induces N1 = 7 linear regions via Definition III.2. (Middle) Structural space-folding: the ReLU activation enables subsequent hyperplanes to intersect the pre-activated regions. (Top) Layer 2 (n = 3) achieves a multiplicative expansion, yielding a theoretical maximum of N2 = 49 regions. The Obstruction in Transformers… view at source ↗
Figure 3
Figure 3. Figure 3: , the introduction of these weights shifts the decision boundaries H away from standard midplanes, where the full￾dimensional cells are defined by: Cj =  q ∈ R dk | ∥q − kj∥ 2 − wj ≤ ∥q − kl∥ 2 − wl , ∀l ̸= j [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Visualizing the Geometric Rationale of Multi-Head Self-Attention via Minkowski Sums. To illustrate the combinatorial polytope gain from Theorem V.2, we construct 3D Newton polytopes for self-attention with sequence length N = 6. (Left) A single head yields a simple polytope with at most N vertices. (Middle & Right) Under the tropical view, multi-head aggregation corresponds to the Minkowski sum of individu… view at source ↗
Figure 5
Figure 5. Figure 5: Visualizing the Power Voronoi Limit of Self-Attention via Maslov Dequantization. The attention routing in a 2D query space (q1, q2) is shown for temperatures τ ∈ 1.0, 0.5, 0.1, 0.001, with white stars marking keys and colors showing aggregated values. (Left) At τ = 1.0, softmax produces blended distributions, suppressing subordinate keys. (Right) As τ → 0, these features sharply emerge, and continuous curv… view at source ↗
Figure 6
Figure 6. Figure 6: shows log–log plots under random Gaussian weight initialization. In [PITH_FULL_IMAGE:figures/full_fig_p013_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Visualizing the Common Refinement of MHSA and FFN (PMHSA ∧PF F N ). To illustrate the geometric mechanism driving the complexity bounds in Theorem V.5, we present a simplified case with 2 tokens and 2 FFN neurons. (Plot A) The MHSA layer performs input-dependent routing, dynamically defining the contextual Voronoi regions (the where). (Plot B) The FFN provides a fixed structural shattering via ReLU activat… view at source ↗
read the original abstract

To quantify the geometric expressivity of transformers, we introduce a tropical geometry framework to characterize their exact spatial partitioning capabilities. By modeling self-attention as a vector-valued tropical rational map, we prove it evaluates exactly to a Power Voronoi Diagram in the zero-temperature limit. Building on this equivalence, we establish a combinatorial rationale for Multi-Head Self-Attention (MHSA): via the Minkowski sum of Newton polytopes, multi-head aggregation expands the polyhedral complexity to $\mathcal{O}(N^H)$, overcoming the $\mathcal{O}(N)$ bottleneck of single heads. Extending this to deep architectures, we derive the first tight asymptotic bounds on the number of linear regions in transformers ($\Theta(N^{d_{\text{model}}L})$), demonstrating a combinatorial explosion driven intrinsically by sequence length $N$, ambient embedding dimension $d_{\text{model}}$, and network depth $L$. Importantly, we guarantee that this idealized polyhedral skeleton is geometrically stable: finite-temperature soft attention preserves these topological partitions via exponentially tight differential approximation bounds.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The manuscript introduces a tropical geometry framework to quantify transformer expressivity. It models self-attention as a vector-valued tropical rational map that evaluates exactly to a Power Voronoi diagram in the zero-temperature limit. Using Minkowski sums of Newton polytopes, it derives an O(N^H) expansion in polyhedral complexity for multi-head attention. Extending to stacked layers, it claims the first tight asymptotic bound of Theta(N^{d_model L}) linear regions, with finite-temperature soft attention preserving the partitions via exponentially tight differential bounds.

Significance. If the equivalences and bounds are rigorously established, the work supplies a combinatorial explanation for the scaling of expressivity with sequence length, embedding dimension, and depth, together with a stability guarantee that bridges the hardmax idealization to practical softmax. This could inform architecture choices and theoretical understanding of why transformers achieve high capacity. The explicit polytope-based accounting for multi-head attention is a potential strength.

major comments (3)
  1. [§3 (modeling and zero-temperature theorem)] The central equivalence (self-attention as vector-valued tropical rational map equals Power Voronoi diagram at zero temperature) is load-bearing for all subsequent bounds. It must be shown to survive the full attention block, including value projection and residual addition, which can insert additional affine maps inside each cell and potentially change the region count.
  2. [§5 (finite-temperature stability)] The finite-temperature claim that soft attention preserves the exact topological partitions (and thus the Theta(N^{d_model L}) count) relies on differential approximation bounds controlling boundary locations. These bounds must be shown to be uniform over the input domain and to prevent topology changes; otherwise the asymptotic tightness does not transfer to actual transformers.
  3. [§4 (multi-head analysis)] The multi-head complexity argument via Minkowski sum of Newton polytopes yields O(N^H) only under the chosen polytope representation. It is necessary to verify that this count remains additive after composition with value weighting, layer-norm, and the subsequent FFN, which may introduce new linear pieces not captured by the sum.
minor comments (2)
  1. [§2 (preliminaries)] Clarify the precise definition of the vector-valued tropical rational map (including how the tropical operations interact with the embedding dimension) with a low-dimensional worked example.
  2. [§4] The notation for the Newton polytope and its Minkowski sum should be made consistent across theorems to avoid ambiguity when composing layers.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed review of our manuscript on the expressivity of transformers from a tropical geometry perspective. We provide point-by-point responses to the major comments below.

read point-by-point responses
  1. Referee: [§3 (modeling and zero-temperature theorem)] The central equivalence (self-attention as vector-valued tropical rational map equals Power Voronoi diagram at zero temperature) is load-bearing for all subsequent bounds. It must be shown to survive the full attention block, including value projection and residual addition, which can insert additional affine maps inside each cell and potentially change the region count.

    Authors: We agree that the full attention block must be considered for the bounds to apply to practical transformers. Our analysis shows that the partitioning is determined solely by the attention weights, which form the Power Voronoi diagram. The value projection applies a linear transformation that is identical across all cells, and the residual connection adds the input embedding, which is a globally affine function. Consequently, neither operation introduces new boundaries or alters the number of linear regions. To address the referee's concern explicitly, we will revise the manuscript in §3 to include a formal statement and proof that the region count is preserved under these compositions. revision: yes

  2. Referee: [§5 (finite-temperature stability)] The finite-temperature claim that soft attention preserves the exact topological partitions (and thus the Theta(N^{d_model L}) count) relies on differential approximation bounds controlling boundary locations. These bounds must be shown to be uniform over the input domain and to prevent topology changes; otherwise the asymptotic tightness does not transfer to actual transformers.

    Authors: The differential bounds in §5 are obtained from the exponential convergence of softmax to hardmax as temperature approaches zero, and they control the displacement of each boundary hyperplane. Since the input space is typically normalized (e.g., embeddings lie in a compact set), the constants in the bounds are independent of the specific input point, ensuring uniformity. This prevents any topology change for sufficiently low temperatures. We will update §5 to state the uniformity explicitly and add a corollary confirming that the topological partitions, and hence the asymptotic count, are preserved. revision: yes

  3. Referee: [§4 (multi-head analysis)] The multi-head complexity argument via Minkowski sum of Newton polytopes yields O(N^H) only under the chosen polytope representation. It is necessary to verify that this count remains additive after composition with value weighting, layer-norm, and the subsequent FFN, which may introduce new linear pieces not captured by the sum.

    Authors: The Minkowski sum provides a lower bound on the complexity of the multi-head attention output by accounting for the independent contributions of each head. Value weighting is a convex combination within each head's contribution and does not decrease the number of facets. Layer normalization can be incorporated into the tropical framework as it corresponds to a shift that preserves the polyhedral arrangement up to translation. The FFN, while adding its own piecewise linear behavior, is composed after the attention module; our bound focuses on the attention-driven explosion in regions, which provides the dominant term in the Theta(N^{d_model L}) expression. We will revise §4 to include a detailed composition analysis showing that the complexity remains at least additive after these steps, supporting the overall tight bound. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation self-contained within introduced framework

full rationale

The paper introduces a tropical geometry modeling of self-attention as vector-valued tropical rational maps, then derives the zero-temperature equivalence to Power Voronoi diagrams and the asymptotic linear-region bounds via Minkowski sums and polytope composition. These are explicit proofs and combinatorial arguments within the chosen representation rather than reductions by definition, fitted parameters renamed as predictions, or load-bearing self-citations. The framework choice enables the analysis but does not make the subsequent region-count results tautological; they follow from the stated polyhedral operations and approximation bounds. No quoted equations exhibit input-output equivalence by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The framework rests on introducing a tropical-rational-map model for attention and on standard properties of tropical semirings and Newton polytopes; no numerical free parameters are fitted, but the modeling choice itself functions as an invented lens.

axioms (1)
  • standard math Tropical semiring operations and Newton polytope Minkowski sums preserve the relevant geometric partitions
    Invoked to obtain the O(N^H) and Theta(N^{d_model L}) counts.
invented entities (1)
  • Vector-valued tropical rational map representation of self-attention no independent evidence
    purpose: To equate attention output with Power Voronoi diagrams
    This is a modeling postulate introduced to enable the geometric analysis; no independent empirical or formal verification outside the framework is supplied.

pith-pipeline@v0.9.0 · 5466 in / 1402 out tokens · 29635 ms · 2026-05-10T12:22:46.655032+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

54 extracted references · 9 canonical work pages · 1 internal anchor

  1. [1]

    Attention is all you need,

    A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,”Advances in neural information processing systems, vol. 30, 2017

  2. [2]

    A comparative survey of instance selection methods applied to non-neural and transformer-based text classification,

    W. Cunha, F. Viegas, C. Franc ¸a, T. Rosa, L. Rocha, and M. A. Gonc ¸alves, “A comparative survey of instance selection methods applied to non-neural and transformer-based text classification,”ACM Computing Surveys, vol. 55, no. 13s, pp. 1–52, 2023

  3. [3]

    Transformers and large language models for efficient intrusion detection systems: A comprehensive survey,

    H. Kheddar, “Transformers and large language models for efficient intrusion detection systems: A comprehensive survey,” Information Fusion, vol. 124, p. 103347, 2025

  4. [4]

    Explainability of vision transformers: A comprehensive review and new perspectives,

    R. Kashefi, L. Barekatain, M. Sabokrou, and F. Aghaeipoor, “Explainability of vision transformers: A comprehensive review and new perspectives,”Multimedia Tools and Applications, vol. 85, no. 2, p. 115, 2026

  5. [5]

    Are transformers universal approximators of sequence- to-sequence functions?

    C. Yun, S. Bhojanapalli, A. S. Rawat, S. J. Reddi, and S. Kumar, “Are transformers universal approximators of sequence- to-sequence functions?” inInternational Conference on Learning Representations, 2020

  6. [6]

    Attention is turing-complete,

    J. P ´erez, P. Barcel ´o, and J. Marinkovic, “Attention is turing-complete,”Journal of Machine Learning Research, vol. 22, no. 75, pp. 1–35, 2021

  7. [7]

    Statistically meaningful approximation: a case study on approximating turing machines with transformers,

    C. Wei, Y . Chen, and T. Ma, “Statistically meaningful approximation: a case study on approximating turing machines with transformers,” vol. 35, 2022, pp. 12 071–12 083

  8. [8]

    Theoretical limitations of self-attention in neural sequence models,

    M. Hahn, “Theoretical limitations of self-attention in neural sequence models,”Transactions of the Association for Computational Linguistics, vol. 8, pp. 156–171, 2020

  9. [9]

    Saturated transformers are constant-depth threshold circuits,

    W. Merrill, A. Sabharwal, and N. A. Smith, “Saturated transformers are constant-depth threshold circuits,”Transactions of the Association for Computational Linguistics, vol. 10, pp. 843–856, 2022

  10. [10]

    On the number of linear regions of deep neural networks,

    G. Mont ´ufar, R. Pascanu, K. Cho, and Y . Bengio, “On the number of linear regions of deep neural networks,” vol. 27, 2014

  11. [11]

    Bounding and counting linear regions of deep neural networks,

    T. Serra, C. Tjandraatmadja, and S. Ramalingam, “Bounding and counting linear regions of deep neural networks,” in International conference on machine learning. PMLR, 2018, pp. 4558–4566

  12. [12]

    Complexity of linear regions in deep networks,

    B. Hanin and D. Rolnick, “Complexity of linear regions in deep networks,” inInternational Conference on Machine Learning. PMLR, 2019, pp. 2596–2604

  13. [13]

    On the number of linear regions of convolutional neural networks with piecewise linear activations,

    H. Xiong, L. Huang, W. J. Zang, X. Zhen, G.-S. Xie, B. Gu, and L. Song, “On the number of linear regions of convolutional neural networks with piecewise linear activations,”IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 46, no. 7, pp. 5131–5148, 2024

  14. [14]

    Tropical geometry of deep neural networks,

    L. Zhang, G. Naitzat, and L.-H. Lim, “Tropical geometry of deep neural networks,” inInternational Conference on Machine Learning. PMLR, 2018, pp. 5824–5832

  15. [15]

    Tropical geometry and machine learning,

    P. Maragos, V . Charisopoulos, and E. Theodosis, “Tropical geometry and machine learning,”Proceedings of the IEEE, vol. 109, no. 5, pp. 728–755, 2021

  16. [16]

    Tropical attention: Neural algorithmic reasoning for combinatorial algorithms,

    B. Hashemi, K. Pasque, C. Teska, and R. Yoshida, “Tropical attention: Neural algorithmic reasoning for combinatorial algorithms,” 2025

  17. [17]

    The geometry of thought: Disclosing the transformer as a tropical polynomial circuit,

    F. Alpay and B. Senturk, “The geometry of thought: Disclosing the transformer as a tropical polynomial circuit,”arXiv preprint arXiv:2601.09775, 2026

  18. [18]

    Universal approximation bounds for superpositions of a sigmoidal function,

    A. R. Barron, “Universal approximation bounds for superpositions of a sigmoidal function,”IEEE Transactions on Information theory, vol. 39, no. 3, pp. 930–945, 2002

  19. [19]

    Transformers are expressive, but are they expressive enough for regression?

    S. Nath, H. Khadilkar, and P. Bhattacharyya, “Transformers are expressive, but are they expressive enough for regression?” arXiv preprint arXiv:2402.15478, 2024

  20. [20]

    The expressive power of transformers with chain of thought, 2024

    W. Merrill and A. Sabharwal, “The expressive power of transformers with chain of thought,”arXiv preprint arXiv:2310.07923, 2023

  21. [21]

    Exact expressive power of transformers with padding.arXiv preprint arXiv:2505.18948, 2025

    ——, “Exact expressive power of transformers with padding,”arXiv preprint arXiv:2505.18948, 2025

  22. [22]

    The power of hard attention transformers on data sequences: A formal language theoretic perspective,

    P. Bergstr”aßer, C. K”ocher, A. W. Lin, and G. Zetzsche, “The power of hard attention transformers on data sequences: A formal language theoretic perspective,”Advances in Neural Information Processing Systems, vol. 37, pp. 96 750–96 774, 2024

  23. [23]

    Expressive power of graph transformers via logic,

    V . Ahvonen, M. Funk, D. Heiman, A. Kuusisto, and C. Lutz, “Expressive power of graph transformers via logic,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 40, no. 24, 2026, pp. 19 569–19 579

  24. [24]

    Inductive biases and variable creation in self-attention mechanisms,

    B. L. Edelman, S. Goel, S. Kakade, and C. Zhang, “Inductive biases and variable creation in self-attention mechanisms,” inInternational Conference on Machine Learning. PMLR, 2022, pp. 5793–5831

  25. [25]

    Understanding the expressive power and mechanisms of transformer for sequence modeling,

    M. Wanget al., “Understanding the expressive power and mechanisms of transformer for sequence modeling,”Advances in Neural Information Processing Systems, vol. 37, pp. 25 781–25 856, 2024

  26. [26]

    Theoretical constraints on the expressive power of rope-based tensor attention transformers,

    X. Li, Y . Liang, Z. Shi, Z. Song, and J. Zhang, “Theoretical constraints on the expressive power of rope-based tensor attention transformers,” 2024

  27. [27]

    On the expressive power of transformers for maxout networks and continuous piecewise linear functions,

    L. Gu, L. Yang, and F. Zhou, “On the expressive power of transformers for maxout networks and continuous piecewise linear functions,”arXiv preprint arXiv:2603.03084, 2026. UNDER REVIEW 15

  28. [28]

    Analysis on the number of linear regions of piecewise linear neural networks,

    Q. Hu, H. Zhang, F. Gao, C. Xing, and J. An, “Analysis on the number of linear regions of piecewise linear neural networks,”IEEE transactions on neural networks and learning systems, vol. 33, no. 2, pp. 644–653, 2020

  29. [29]

    On the number of linear regions of convolutional neural networks,

    H. Xiong, L. Huang, M. Yu, L. Liu, F. Zhu, and L. Shao, “On the number of linear regions of convolutional neural networks,” inInternational Conference on Machine Learning. PMLR, 2020, pp. 10 514–10 523

  30. [30]

    Sparsity is Combinatorial Depth: Quantifying MoE Expressivity via Tropical Geometry

    Y . Su, H. Tang, Z. Gong, and Y . Liu, “Sparsity is combinatorial depth: Quantifying moe expressivity via tropical geometry,” arXiv preprint arXiv:2602.03204, 2026

  31. [31]

    On the decision boundaries of neural networks: A tropical geometry perspective,

    M. Alfarra, A. Bibi, H. Hammoud, M. Gaafar, and B. Ghanem, “On the decision boundaries of neural networks: A tropical geometry perspective,”IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 45, no. 4, pp. 5027–5037, 2022

  32. [32]

    Morphological perceptrons: geometry and training algorithms,

    V . Charisopoulos and P. Maragos, “Morphological perceptrons: geometry and training algorithms,” inInternational Symposium on Mathematical Morphology and Its Applications to Signal and Image Processing. Springer, 2017, pp. 3–15

  33. [33]

    A tropical approach to neural networks with piecewise linear activations, 2018

    ——, “A tropical approach to neural networks with piecewise linear activations,”arXiv preprint arXiv:1805.08749, 2018

  34. [34]

    What do graph neural networks learn? insights from tropical geometry,

    T. A. Pham and V . Garg, “What do graph neural networks learn? insights from tropical geometry,”Advances in Neural Information Processing Systems, vol. 37, pp. 10 988–11 020, 2024

  35. [35]

    Tropnnc: Structured neural network compression using tropical geometry,

    K. Fotopoulos, P. Maragos, and P. Misiakos, “Tropnnc: Structured neural network compression using tropical geometry,” arXiv preprint arXiv:2409.03945, 2024

  36. [36]

    Zaslavsky,Facing up to arrangements: Face-count formulas for partitions of space by hyperplanes: Face-count formulas for partitions of space by hyperplanes

    T. Zaslavsky,Facing up to arrangements: Face-count formulas for partitions of space by hyperplanes: Face-count formulas for partitions of space by hyperplanes. American Mathematical Soc., 1975, vol. 154

  37. [37]

    Mikhalkin and J

    G. Mikhalkin and J. Rau,Tropical geometry. MPI for Mathematics, 2009, vol. 8

  38. [38]

    Maclagan and B

    D. Maclagan and B. Sturmfels,Introduction to tropical geometry. American Mathematical Soc., 2015, vol. 161

  39. [39]

    On basic concepts of tropical geometry,

    O. Y . Viro, “On basic concepts of tropical geometry,”Proceedings of the Steklov Institute of Mathematics, vol. 273, no. 1, pp. 252–282, 2011

  40. [40]

    D. E. Speyer,Tropical geometry. University of California, Berkeley, 2005

  41. [41]

    G. M. Ziegler,Lectures on polytopes. Springer Science & Business Media, 2012, vol. 152

  42. [42]

    Fulton,Introduction to toric varieties

    W. Fulton,Introduction to toric varieties. Princeton university press, 1993, no. 131

  43. [43]

    The maslov dequantization, idempotent and tropical mathematics: a brief introduction,

    G. L. Litvinov, “The maslov dequantization, idempotent and tropical mathematics: a brief introduction,”arXiv preprint math/0507014, 2005

  44. [44]

    Minkowski addition of polytopes: computational complexity and applications to gr ¨obner bases,

    P. Gritzmann and B. Sturmfels, “Minkowski addition of polytopes: computational complexity and applications to gr ¨obner bases,”SIAM Journal on Discrete Mathematics, vol. 6, no. 2, pp. 246–269, 1993

  45. [45]

    Dynamic voronoi diagrams,

    I. Gowda, D. Kirkpatrick, D. Lee, and A. Naamad, “Dynamic voronoi diagrams,”IEEE Transactions on Information Theory, vol. 29, no. 5, pp. 724–731, 1983

  46. [46]

    V oronoi cells, probabilistic bounds, and hypothesis testing in mixed integer linear models,

    P. Xu, “V oronoi cells, probabilistic bounds, and hypothesis testing in mixed integer linear models,”IEEE Transactions on information theory, vol. 52, no. 7, pp. 3122–3138, 2006

  47. [47]

    Power diagrams: properties, algorithms and applications,

    F. Aurenhammer, “Power diagrams: properties, algorithms and applications,”SIAM journal on computing, vol. 16, no. 1, pp. 78–96, 1987

  48. [48]

    Geometrical and statistical properties of systems of linear inequalities with applications in pattern recognition,

    T. M. Cover, “Geometrical and statistical properties of systems of linear inequalities with applications in pattern recognition,”IEEE Transactions on Electronic Computers, vol. EC-14, no. 3, pp. 326–334, 1965. ACKNOWLEDGMENTS This research was supported by Beijing Natural Science Foundation (Z250001). APPENDIXA SUMMARY OFNOTATIONS ANDALGEBRAICMAPPINGS To ...

  49. [49]

    Network Configurations and Global Dimensions NSequence length (number of tokens / keys / V oronoi sites)Z + dmodel Ambient embedding dimension of the transformerZ + dk, dv Projected dimensions for Query/Key and Value vectorsZ + Continued on next page UNDER REVIEW 16 TABLE I – continued from previous page Symbol Definition Domain/Space d, dout Generic inpu...

  50. [50]

    Input, Projections, and Feature Spaces XInput sequence embedding matrixR N×d model x, qSingle query embedding vector or query pointR dmodel WQ, WK Learnable linear projection matrices for Queries and KeysR dmodel×dk WV Learnable linear projection matrix for ValuesR dmodel×dv WO Output projection matrix for multi-head feature aggregationR dmodel×dmodel Q, ...

  51. [51]

    Temperature, Log-Lifting, and Stability Variables τTemperature parameter for Maslov dequantizationR >0 s, sj Attention logit vector and componentss j =⟨q, k j⟩R N ,R Sl Inner product score for thel-th token in proofsR P (τ)(s)Smoothed LogSumExp (LSE) potential functionR P (0)(s)Strict tropical limit (max function) of the LSE potentialR ˜vj,c Log-lifted va...

  52. [52]

    Tropical Geometry and Topological Structures T,⊕,⊗Tropical semiring (max-plus) and its addition and multiplication - ⊘,⊗ τ ,⊕ τ Tropical division and deformed smooth operations - P(x)Tropical polynomialP(x) = max j(⟨αj, x⟩+c j)R d →R Newt(P)Newton polytope: convex hull of exponent vectors{α j}conv(R d) ΣNormal Fan- Σ(P)Normal Fanof Newt(P), representing t...

  53. [53]

    Defined as:R(n, d) :=Pd j=0 n j Z+ f0(P)Vertex counting function (number of 0-dimensional faces)Z + AΩ, bΩ Local affine mapping parameters within regionΩR d×d,R d

    Computational Geometry, Hyperplane Arrangements, and Voronoi Notations S, ˆSGeneric sets of unweighted and weighted geometric sitesR d,R d ×R cj A generic geometric generator point (site)R d Continued on next page UNDER REVIEW 17 TABLE I – continued from previous page Symbol Definition Domain/Space Vj, Pj Thej-th cell in Standard and Power V oronoi Diagra...

  54. [54]

    Concept Standard DL Domain Tropical Domain (T) Geometric Interpretation Addition LogSumExp (A⊕ τ B) Max (A⊕B) Convex Hull / Supremum Multiplication Standard Add

    Proof-Specific Constants and Auxiliaries (Appx B - G) PMHSA,P FFN Space partitions induced by MHSA and FFN respectively - MMHSA,M FFN Region counts in MHSA and FFN sub-layersZ + Cd Complexity constant for Minkowski sums inddimensionsR >0 Sl,R l Induced total partition and set of regions afterllayers - Nl Cumulative number of linear regions afterllayersZ +...