arxiv: 2604.14727 · v1 · submitted 2026-04-16 · 💻 cs.LG

Recognition: unknown

Expressivity of Transformers: A Tropical Geometry Perspective

Ye Su , Yong Liu

Authors on Pith no claims yet

Pith reviewed 2026-05-10 12:22 UTC · model grok-4.3

classification 💻 cs.LG

keywords tropical geometrytransformer expressivityself-attentionpower voronoi diagramlinear regionsmulti-head attentionpolyhedral complexityzero-temperature limit

0 comments

The pith

Self-attention in transformers computes a Power Voronoi diagram exactly in the zero-temperature limit, yielding the first tight bound of Theta(N to the d_model L) linear regions in deep models.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces a tropical geometry lens to measure exactly how transformers carve up input space into linear pieces. It models each self-attention layer as a vector-valued tropical rational map and proves that this map produces a Power Voronoi diagram when temperature reaches zero. From that equivalence the authors show why stacking multiple heads multiplies the number of regions by roughly N to the power H and why adding depth multiplies the exponent by d_model. A reader should care because the argument turns the vague claim that transformers are expressive into a concrete counting formula driven by sequence length, embedding size, and depth. The same geometric skeleton is shown to survive the soft attention used in actual networks through tight approximation bounds.

Core claim

By modeling self-attention as a vector-valued tropical rational map, the paper proves that the operation evaluates exactly to a Power Voronoi Diagram in the zero-temperature limit. Multi-head aggregation then expands the polyhedral complexity to O(N^H) through the Minkowski sum of Newton polytopes. Extending the construction layer by layer produces the first tight asymptotic bound on the number of linear regions realized by a transformer of depth L: Theta(N^{d_model L}). The idealized partitions remain topologically stable under finite-temperature soft attention because the difference is controlled by exponentially tight differential bounds.

What carries the argument

The vector-valued tropical rational map that represents self-attention and its exact reduction to a Power Voronoi Diagram.

If this is right

Multi-head self-attention raises the maximum number of polyhedral pieces from O(N) for one head to O(N^H) through Minkowski summation of Newton polytopes.
Each additional transformer layer multiplies the exponent in the linear-region count by the embedding dimension d_model, producing combinatorial growth with depth.
The zero-temperature Power Voronoi skeleton is preserved by the soft attention used in deployed models, with the approximation error bounded exponentially tightly.
Expressivity is driven intrinsically by sequence length N, model dimension d_model, and depth L rather than by the smoothing introduced by finite temperature.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same tropical counting technique could be applied to other attention variants such as linear attention or sparse attention to obtain analogous region bounds.
If the scaling holds in practice, simply increasing depth or context length would produce far more expressive partitions than increasing head count alone.
The framework supplies a concrete test: count the actual linear regions in a small transformer and check whether the count tracks N^{d L}.
Training dynamics might be understood as movement among these polyhedral pieces rather than as optimization over a smooth loss landscape.

Load-bearing premise

That representing self-attention by a vector-valued tropical rational map accurately captures the geometric partitioning performed by real transformer implementations.

What would settle it

An explicit calculation or trained-model dissection showing that the decision boundaries produced by a practical attention layer deviate from those of any Power Voronoi diagram, or measured linear-region counts that fall outside the predicted Theta(N^{d_model L}) scaling.

Figures

Figures reproduced from arXiv: 2604.14727 by Ye Su, Yong Liu.

**Figure 1.** Figure 1: Combinatorial Explosion of Linear Regions in Transformers. Visualizing the exact spatial partitioning of a query space (dmodel = 2) by a tropical transformer with sequence length N = 4. As the network depth L increases from 1 to 3, the number of maximal linear regions N undergoes a staggering geometric expansion, strictly following our derived Θ(NdmodelL) bound. This topological metric demonstrates how dee… view at source ↗

**Figure 2.** Figure 2: Recursive spatial partitioning in a 2-layer MLP (d = 2). (Bottom) Layer 1 (n = 3 neurons) induces N1 = 7 linear regions via Definition III.2. (Middle) Structural space-folding: the ReLU activation enables subsequent hyperplanes to intersect the pre-activated regions. (Top) Layer 2 (n = 3) achieves a multiplicative expansion, yielding a theoretical maximum of N2 = 49 regions. The Obstruction in Transformers… view at source ↗

**Figure 3.** Figure 3: , the introduction of these weights shifts the decision boundaries H away from standard midplanes, where the fulldimensional cells are defined by: Cj = q ∈ R dk | ∥q − kj∥ 2 − wj ≤ ∥q − kl∥ 2 − wl , ∀l ̸= j [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗

**Figure 4.** Figure 4: Visualizing the Geometric Rationale of Multi-Head Self-Attention via Minkowski Sums. To illustrate the combinatorial polytope gain from Theorem V.2, we construct 3D Newton polytopes for self-attention with sequence length N = 6. (Left) A single head yields a simple polytope with at most N vertices. (Middle & Right) Under the tropical view, multi-head aggregation corresponds to the Minkowski sum of individu… view at source ↗

**Figure 5.** Figure 5: Visualizing the Power Voronoi Limit of Self-Attention via Maslov Dequantization. The attention routing in a 2D query space (q1, q2) is shown for temperatures τ ∈ 1.0, 0.5, 0.1, 0.001, with white stars marking keys and colors showing aggregated values. (Left) At τ = 1.0, softmax produces blended distributions, suppressing subordinate keys. (Right) As τ → 0, these features sharply emerge, and continuous curv… view at source ↗

**Figure 6.** Figure 6: shows log–log plots under random Gaussian weight initialization. In [PITH_FULL_IMAGE:figures/full_fig_p013_6.png] view at source ↗

**Figure 7.** Figure 7: Visualizing the Common Refinement of MHSA and FFN (PMHSA ∧PF F N ). To illustrate the geometric mechanism driving the complexity bounds in Theorem V.5, we present a simplified case with 2 tokens and 2 FFN neurons. (Plot A) The MHSA layer performs input-dependent routing, dynamically defining the contextual Voronoi regions (the where). (Plot B) The FFN provides a fixed structural shattering via ReLU activat… view at source ↗

read the original abstract

To quantify the geometric expressivity of transformers, we introduce a tropical geometry framework to characterize their exact spatial partitioning capabilities. By modeling self-attention as a vector-valued tropical rational map, we prove it evaluates exactly to a Power Voronoi Diagram in the zero-temperature limit. Building on this equivalence, we establish a combinatorial rationale for Multi-Head Self-Attention (MHSA): via the Minkowski sum of Newton polytopes, multi-head aggregation expands the polyhedral complexity to $\mathcal{O}(N^H)$, overcoming the $\mathcal{O}(N)$ bottleneck of single heads. Extending this to deep architectures, we derive the first tight asymptotic bounds on the number of linear regions in transformers ($\Theta(N^{d_{\text{model}}L})$), demonstrating a combinatorial explosion driven intrinsically by sequence length $N$, ambient embedding dimension $d_{\text{model}}$, and network depth $L$. Importantly, we guarantee that this idealized polyhedral skeleton is geometrically stable: finite-temperature soft attention preserves these topological partitions via exponentially tight differential approximation bounds.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The tropical model gives a clean combinatorial bound on transformer regions but the reduction may not survive the full attention block with values and residuals.

read the letter

The paper models self-attention as a vector-valued tropical rational map whose zero-temperature limit is a power Voronoi diagram, then uses Newton polytopes and Minkowski sums to bound the number of linear regions by Theta(N to the d_model L). That explicit asymptotic is the main new piece, and the argument that multi-head attention lifts the single-head O(N) limit via polytope sums is a direct combinatorial explanation for why heads help capacity. The finite-temperature approximation bounds are also a useful bridge back to soft attention. The work is honest about working in an idealized geometric setting. The soft spot is exactly the one the stress-test note flags: once value projections, residual additions, and the feed-forward layers are included, the region count inside each Voronoi cell can change because those steps introduce additional linear maps that the polytope count does not automatically capture. If the paper only verifies the equivalence for the attention map in isolation, the claimed tightness for complete transformer blocks is not yet shown. The derivations are not visible in the abstract, so gaps in the polytope composition steps cannot be ruled out yet. This is for readers who track geometric expressivity results and want a fresh tool for counting regions. It is worth sending to referees because the framework is new and the bounds are stated sharply, even though the modeling assumptions need explicit checking against the full block.

Referee Report

3 major / 2 minor

Summary. The manuscript introduces a tropical geometry framework to quantify transformer expressivity. It models self-attention as a vector-valued tropical rational map that evaluates exactly to a Power Voronoi diagram in the zero-temperature limit. Using Minkowski sums of Newton polytopes, it derives an O(N^H) expansion in polyhedral complexity for multi-head attention. Extending to stacked layers, it claims the first tight asymptotic bound of Theta(N^{d_model L}) linear regions, with finite-temperature soft attention preserving the partitions via exponentially tight differential bounds.

Significance. If the equivalences and bounds are rigorously established, the work supplies a combinatorial explanation for the scaling of expressivity with sequence length, embedding dimension, and depth, together with a stability guarantee that bridges the hardmax idealization to practical softmax. This could inform architecture choices and theoretical understanding of why transformers achieve high capacity. The explicit polytope-based accounting for multi-head attention is a potential strength.

major comments (3)

[§3 (modeling and zero-temperature theorem)] The central equivalence (self-attention as vector-valued tropical rational map equals Power Voronoi diagram at zero temperature) is load-bearing for all subsequent bounds. It must be shown to survive the full attention block, including value projection and residual addition, which can insert additional affine maps inside each cell and potentially change the region count.
[§5 (finite-temperature stability)] The finite-temperature claim that soft attention preserves the exact topological partitions (and thus the Theta(N^{d_model L}) count) relies on differential approximation bounds controlling boundary locations. These bounds must be shown to be uniform over the input domain and to prevent topology changes; otherwise the asymptotic tightness does not transfer to actual transformers.
[§4 (multi-head analysis)] The multi-head complexity argument via Minkowski sum of Newton polytopes yields O(N^H) only under the chosen polytope representation. It is necessary to verify that this count remains additive after composition with value weighting, layer-norm, and the subsequent FFN, which may introduce new linear pieces not captured by the sum.

minor comments (2)

[§2 (preliminaries)] Clarify the precise definition of the vector-valued tropical rational map (including how the tropical operations interact with the embedding dimension) with a low-dimensional worked example.
[§4] The notation for the Newton polytope and its Minkowski sum should be made consistent across theorems to avoid ambiguity when composing layers.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed review of our manuscript on the expressivity of transformers from a tropical geometry perspective. We provide point-by-point responses to the major comments below.

read point-by-point responses

Referee: [§3 (modeling and zero-temperature theorem)] The central equivalence (self-attention as vector-valued tropical rational map equals Power Voronoi diagram at zero temperature) is load-bearing for all subsequent bounds. It must be shown to survive the full attention block, including value projection and residual addition, which can insert additional affine maps inside each cell and potentially change the region count.

Authors: We agree that the full attention block must be considered for the bounds to apply to practical transformers. Our analysis shows that the partitioning is determined solely by the attention weights, which form the Power Voronoi diagram. The value projection applies a linear transformation that is identical across all cells, and the residual connection adds the input embedding, which is a globally affine function. Consequently, neither operation introduces new boundaries or alters the number of linear regions. To address the referee's concern explicitly, we will revise the manuscript in §3 to include a formal statement and proof that the region count is preserved under these compositions. revision: yes
Referee: [§5 (finite-temperature stability)] The finite-temperature claim that soft attention preserves the exact topological partitions (and thus the Theta(N^{d_model L}) count) relies on differential approximation bounds controlling boundary locations. These bounds must be shown to be uniform over the input domain and to prevent topology changes; otherwise the asymptotic tightness does not transfer to actual transformers.

Authors: The differential bounds in §5 are obtained from the exponential convergence of softmax to hardmax as temperature approaches zero, and they control the displacement of each boundary hyperplane. Since the input space is typically normalized (e.g., embeddings lie in a compact set), the constants in the bounds are independent of the specific input point, ensuring uniformity. This prevents any topology change for sufficiently low temperatures. We will update §5 to state the uniformity explicitly and add a corollary confirming that the topological partitions, and hence the asymptotic count, are preserved. revision: yes
Referee: [§4 (multi-head analysis)] The multi-head complexity argument via Minkowski sum of Newton polytopes yields O(N^H) only under the chosen polytope representation. It is necessary to verify that this count remains additive after composition with value weighting, layer-norm, and the subsequent FFN, which may introduce new linear pieces not captured by the sum.

Authors: The Minkowski sum provides a lower bound on the complexity of the multi-head attention output by accounting for the independent contributions of each head. Value weighting is a convex combination within each head's contribution and does not decrease the number of facets. Layer normalization can be incorporated into the tropical framework as it corresponds to a shift that preserves the polyhedral arrangement up to translation. The FFN, while adding its own piecewise linear behavior, is composed after the attention module; our bound focuses on the attention-driven explosion in regions, which provides the dominant term in the Theta(N^{d_model L}) expression. We will revise §4 to include a detailed composition analysis showing that the complexity remains at least additive after these steps, supporting the overall tight bound. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation self-contained within introduced framework

full rationale

The paper introduces a tropical geometry modeling of self-attention as vector-valued tropical rational maps, then derives the zero-temperature equivalence to Power Voronoi diagrams and the asymptotic linear-region bounds via Minkowski sums and polytope composition. These are explicit proofs and combinatorial arguments within the chosen representation rather than reductions by definition, fitted parameters renamed as predictions, or load-bearing self-citations. The framework choice enables the analysis but does not make the subsequent region-count results tautological; they follow from the stated polyhedral operations and approximation bounds. No quoted equations exhibit input-output equivalence by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The framework rests on introducing a tropical-rational-map model for attention and on standard properties of tropical semirings and Newton polytopes; no numerical free parameters are fitted, but the modeling choice itself functions as an invented lens.

axioms (1)

standard math Tropical semiring operations and Newton polytope Minkowski sums preserve the relevant geometric partitions
Invoked to obtain the O(N^H) and Theta(N^{d_model L}) counts.

invented entities (1)

Vector-valued tropical rational map representation of self-attention no independent evidence
purpose: To equate attention output with Power Voronoi diagrams
This is a modeling postulate introduced to enable the geometric analysis; no independent empirical or formal verification outside the framework is supplied.

pith-pipeline@v0.9.0 · 5466 in / 1402 out tokens · 29635 ms · 2026-05-10T12:22:46.655032+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

54 extracted references · 9 canonical work pages · 1 internal anchor

[1]

Attention is all you need,

A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,”Advances in neural information processing systems, vol. 30, 2017

2017
[2]

A comparative survey of instance selection methods applied to non-neural and transformer-based text classification,

W. Cunha, F. Viegas, C. Franc ¸a, T. Rosa, L. Rocha, and M. A. Gonc ¸alves, “A comparative survey of instance selection methods applied to non-neural and transformer-based text classification,”ACM Computing Surveys, vol. 55, no. 13s, pp. 1–52, 2023

2023
[3]

Transformers and large language models for efficient intrusion detection systems: A comprehensive survey,

H. Kheddar, “Transformers and large language models for efficient intrusion detection systems: A comprehensive survey,” Information Fusion, vol. 124, p. 103347, 2025

2025
[4]

Explainability of vision transformers: A comprehensive review and new perspectives,

R. Kashefi, L. Barekatain, M. Sabokrou, and F. Aghaeipoor, “Explainability of vision transformers: A comprehensive review and new perspectives,”Multimedia Tools and Applications, vol. 85, no. 2, p. 115, 2026

2026
[5]

Are transformers universal approximators of sequence- to-sequence functions?

C. Yun, S. Bhojanapalli, A. S. Rawat, S. J. Reddi, and S. Kumar, “Are transformers universal approximators of sequence- to-sequence functions?” inInternational Conference on Learning Representations, 2020

2020
[6]

Attention is turing-complete,

J. P ´erez, P. Barcel ´o, and J. Marinkovic, “Attention is turing-complete,”Journal of Machine Learning Research, vol. 22, no. 75, pp. 1–35, 2021

2021
[7]

Statistically meaningful approximation: a case study on approximating turing machines with transformers,

C. Wei, Y . Chen, and T. Ma, “Statistically meaningful approximation: a case study on approximating turing machines with transformers,” vol. 35, 2022, pp. 12 071–12 083

2022
[8]

Theoretical limitations of self-attention in neural sequence models,

M. Hahn, “Theoretical limitations of self-attention in neural sequence models,”Transactions of the Association for Computational Linguistics, vol. 8, pp. 156–171, 2020

2020
[9]

Saturated transformers are constant-depth threshold circuits,

W. Merrill, A. Sabharwal, and N. A. Smith, “Saturated transformers are constant-depth threshold circuits,”Transactions of the Association for Computational Linguistics, vol. 10, pp. 843–856, 2022

2022
[10]

On the number of linear regions of deep neural networks,

G. Mont ´ufar, R. Pascanu, K. Cho, and Y . Bengio, “On the number of linear regions of deep neural networks,” vol. 27, 2014

2014
[11]

Bounding and counting linear regions of deep neural networks,

T. Serra, C. Tjandraatmadja, and S. Ramalingam, “Bounding and counting linear regions of deep neural networks,” in International conference on machine learning. PMLR, 2018, pp. 4558–4566

2018
[12]

Complexity of linear regions in deep networks,

B. Hanin and D. Rolnick, “Complexity of linear regions in deep networks,” inInternational Conference on Machine Learning. PMLR, 2019, pp. 2596–2604

2019
[13]

On the number of linear regions of convolutional neural networks with piecewise linear activations,

H. Xiong, L. Huang, W. J. Zang, X. Zhen, G.-S. Xie, B. Gu, and L. Song, “On the number of linear regions of convolutional neural networks with piecewise linear activations,”IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 46, no. 7, pp. 5131–5148, 2024

2024
[14]

Tropical geometry of deep neural networks,

L. Zhang, G. Naitzat, and L.-H. Lim, “Tropical geometry of deep neural networks,” inInternational Conference on Machine Learning. PMLR, 2018, pp. 5824–5832

2018
[15]

Tropical geometry and machine learning,

P. Maragos, V . Charisopoulos, and E. Theodosis, “Tropical geometry and machine learning,”Proceedings of the IEEE, vol. 109, no. 5, pp. 728–755, 2021

2021
[16]

Tropical attention: Neural algorithmic reasoning for combinatorial algorithms,

B. Hashemi, K. Pasque, C. Teska, and R. Yoshida, “Tropical attention: Neural algorithmic reasoning for combinatorial algorithms,” 2025

2025
[17]

The geometry of thought: Disclosing the transformer as a tropical polynomial circuit,

F. Alpay and B. Senturk, “The geometry of thought: Disclosing the transformer as a tropical polynomial circuit,”arXiv preprint arXiv:2601.09775, 2026

work page arXiv 2026
[18]

Universal approximation bounds for superpositions of a sigmoidal function,

A. R. Barron, “Universal approximation bounds for superpositions of a sigmoidal function,”IEEE Transactions on Information theory, vol. 39, no. 3, pp. 930–945, 2002

2002
[19]

Transformers are expressive, but are they expressive enough for regression?

S. Nath, H. Khadilkar, and P. Bhattacharyya, “Transformers are expressive, but are they expressive enough for regression?” arXiv preprint arXiv:2402.15478, 2024

work page arXiv 2024
[20]

The expressive power of transformers with chain of thought, 2024

W. Merrill and A. Sabharwal, “The expressive power of transformers with chain of thought,”arXiv preprint arXiv:2310.07923, 2023

work page arXiv 2023
[21]

Exact expressive power of transformers with padding.arXiv preprint arXiv:2505.18948, 2025

——, “Exact expressive power of transformers with padding,”arXiv preprint arXiv:2505.18948, 2025

work page arXiv 2025
[22]

The power of hard attention transformers on data sequences: A formal language theoretic perspective,

P. Bergstr”aßer, C. K”ocher, A. W. Lin, and G. Zetzsche, “The power of hard attention transformers on data sequences: A formal language theoretic perspective,”Advances in Neural Information Processing Systems, vol. 37, pp. 96 750–96 774, 2024

2024
[23]

Expressive power of graph transformers via logic,

V . Ahvonen, M. Funk, D. Heiman, A. Kuusisto, and C. Lutz, “Expressive power of graph transformers via logic,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 40, no. 24, 2026, pp. 19 569–19 579

2026
[24]

Inductive biases and variable creation in self-attention mechanisms,

B. L. Edelman, S. Goel, S. Kakade, and C. Zhang, “Inductive biases and variable creation in self-attention mechanisms,” inInternational Conference on Machine Learning. PMLR, 2022, pp. 5793–5831

2022
[25]

Understanding the expressive power and mechanisms of transformer for sequence modeling,

M. Wanget al., “Understanding the expressive power and mechanisms of transformer for sequence modeling,”Advances in Neural Information Processing Systems, vol. 37, pp. 25 781–25 856, 2024

2024
[26]

Theoretical constraints on the expressive power of rope-based tensor attention transformers,

X. Li, Y . Liang, Z. Shi, Z. Song, and J. Zhang, “Theoretical constraints on the expressive power of rope-based tensor attention transformers,” 2024

2024
[27]

On the expressive power of transformers for maxout networks and continuous piecewise linear functions,

L. Gu, L. Yang, and F. Zhou, “On the expressive power of transformers for maxout networks and continuous piecewise linear functions,”arXiv preprint arXiv:2603.03084, 2026. UNDER REVIEW 15

work page arXiv 2026
[28]

Analysis on the number of linear regions of piecewise linear neural networks,

Q. Hu, H. Zhang, F. Gao, C. Xing, and J. An, “Analysis on the number of linear regions of piecewise linear neural networks,”IEEE transactions on neural networks and learning systems, vol. 33, no. 2, pp. 644–653, 2020

2020
[29]

On the number of linear regions of convolutional neural networks,

H. Xiong, L. Huang, M. Yu, L. Liu, F. Zhu, and L. Shao, “On the number of linear regions of convolutional neural networks,” inInternational Conference on Machine Learning. PMLR, 2020, pp. 10 514–10 523

2020
[30]

Sparsity is Combinatorial Depth: Quantifying MoE Expressivity via Tropical Geometry

Y . Su, H. Tang, Z. Gong, and Y . Liu, “Sparsity is combinatorial depth: Quantifying moe expressivity via tropical geometry,” arXiv preprint arXiv:2602.03204, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[31]

On the decision boundaries of neural networks: A tropical geometry perspective,

M. Alfarra, A. Bibi, H. Hammoud, M. Gaafar, and B. Ghanem, “On the decision boundaries of neural networks: A tropical geometry perspective,”IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 45, no. 4, pp. 5027–5037, 2022

2022
[32]

Morphological perceptrons: geometry and training algorithms,

V . Charisopoulos and P. Maragos, “Morphological perceptrons: geometry and training algorithms,” inInternational Symposium on Mathematical Morphology and Its Applications to Signal and Image Processing. Springer, 2017, pp. 3–15

2017
[33]

A tropical approach to neural networks with piecewise linear activations, 2018

——, “A tropical approach to neural networks with piecewise linear activations,”arXiv preprint arXiv:1805.08749, 2018

work page arXiv 2018
[34]

What do graph neural networks learn? insights from tropical geometry,

T. A. Pham and V . Garg, “What do graph neural networks learn? insights from tropical geometry,”Advances in Neural Information Processing Systems, vol. 37, pp. 10 988–11 020, 2024

2024
[35]

Tropnnc: Structured neural network compression using tropical geometry,

K. Fotopoulos, P. Maragos, and P. Misiakos, “Tropnnc: Structured neural network compression using tropical geometry,” arXiv preprint arXiv:2409.03945, 2024

work page arXiv 2024
[36]

Zaslavsky,Facing up to arrangements: Face-count formulas for partitions of space by hyperplanes: Face-count formulas for partitions of space by hyperplanes

T. Zaslavsky,Facing up to arrangements: Face-count formulas for partitions of space by hyperplanes: Face-count formulas for partitions of space by hyperplanes. American Mathematical Soc., 1975, vol. 154

1975
[37]

Mikhalkin and J

G. Mikhalkin and J. Rau,Tropical geometry. MPI for Mathematics, 2009, vol. 8

2009
[38]

Maclagan and B

D. Maclagan and B. Sturmfels,Introduction to tropical geometry. American Mathematical Soc., 2015, vol. 161

2015
[39]

On basic concepts of tropical geometry,

O. Y . Viro, “On basic concepts of tropical geometry,”Proceedings of the Steklov Institute of Mathematics, vol. 273, no. 1, pp. 252–282, 2011

2011
[40]

D. E. Speyer,Tropical geometry. University of California, Berkeley, 2005

2005
[41]

G. M. Ziegler,Lectures on polytopes. Springer Science & Business Media, 2012, vol. 152

2012
[42]

Fulton,Introduction to toric varieties

W. Fulton,Introduction to toric varieties. Princeton university press, 1993, no. 131

1993
[43]

The maslov dequantization, idempotent and tropical mathematics: a brief introduction,

G. L. Litvinov, “The maslov dequantization, idempotent and tropical mathematics: a brief introduction,”arXiv preprint math/0507014, 2005

work page arXiv 2005
[44]

Minkowski addition of polytopes: computational complexity and applications to gr ¨obner bases,

P. Gritzmann and B. Sturmfels, “Minkowski addition of polytopes: computational complexity and applications to gr ¨obner bases,”SIAM Journal on Discrete Mathematics, vol. 6, no. 2, pp. 246–269, 1993

1993
[45]

Dynamic voronoi diagrams,

I. Gowda, D. Kirkpatrick, D. Lee, and A. Naamad, “Dynamic voronoi diagrams,”IEEE Transactions on Information Theory, vol. 29, no. 5, pp. 724–731, 1983

1983
[46]

V oronoi cells, probabilistic bounds, and hypothesis testing in mixed integer linear models,

P. Xu, “V oronoi cells, probabilistic bounds, and hypothesis testing in mixed integer linear models,”IEEE Transactions on information theory, vol. 52, no. 7, pp. 3122–3138, 2006

2006
[47]

Power diagrams: properties, algorithms and applications,

F. Aurenhammer, “Power diagrams: properties, algorithms and applications,”SIAM journal on computing, vol. 16, no. 1, pp. 78–96, 1987

1987
[48]

Geometrical and statistical properties of systems of linear inequalities with applications in pattern recognition,

T. M. Cover, “Geometrical and statistical properties of systems of linear inequalities with applications in pattern recognition,”IEEE Transactions on Electronic Computers, vol. EC-14, no. 3, pp. 326–334, 1965. ACKNOWLEDGMENTS This research was supported by Beijing Natural Science Foundation (Z250001). APPENDIXA SUMMARY OFNOTATIONS ANDALGEBRAICMAPPINGS To ...

1965
[49]

Network Configurations and Global Dimensions NSequence length (number of tokens / keys / V oronoi sites)Z + dmodel Ambient embedding dimension of the transformerZ + dk, dv Projected dimensions for Query/Key and Value vectorsZ + Continued on next page UNDER REVIEW 16 TABLE I – continued from previous page Symbol Definition Domain/Space d, dout Generic inpu...
[50]

Input, Projections, and Feature Spaces XInput sequence embedding matrixR N×d model x, qSingle query embedding vector or query pointR dmodel WQ, WK Learnable linear projection matrices for Queries and KeysR dmodel×dk WV Learnable linear projection matrix for ValuesR dmodel×dv WO Output projection matrix for multi-head feature aggregationR dmodel×dmodel Q, ...
[51]

Temperature, Log-Lifting, and Stability Variables τTemperature parameter for Maslov dequantizationR >0 s, sj Attention logit vector and componentss j =⟨q, k j⟩R N ,R Sl Inner product score for thel-th token in proofsR P (τ)(s)Smoothed LogSumExp (LSE) potential functionR P (0)(s)Strict tropical limit (max function) of the LSE potentialR ˜vj,c Log-lifted va...
[52]

Tropical Geometry and Topological Structures T,⊕,⊗Tropical semiring (max-plus) and its addition and multiplication - ⊘,⊗ τ ,⊕ τ Tropical division and deformed smooth operations - P(x)Tropical polynomialP(x) = max j(⟨αj, x⟩+c j)R d →R Newt(P)Newton polytope: convex hull of exponent vectors{α j}conv(R d) ΣNormal Fan- Σ(P)Normal Fanof Newt(P), representing t...
[53]

Defined as:R(n, d) :=Pd j=0 n j Z+ f0(P)Vertex counting function (number of 0-dimensional faces)Z + AΩ, bΩ Local affine mapping parameters within regionΩR d×d,R d

Computational Geometry, Hyperplane Arrangements, and Voronoi Notations S, ˆSGeneric sets of unweighted and weighted geometric sitesR d,R d ×R cj A generic geometric generator point (site)R d Continued on next page UNDER REVIEW 17 TABLE I – continued from previous page Symbol Definition Domain/Space Vj, Pj Thej-th cell in Standard and Power V oronoi Diagra...
[54]

Concept Standard DL Domain Tropical Domain (T) Geometric Interpretation Addition LogSumExp (A⊕ τ B) Max (A⊕B) Convex Hull / Supremum Multiplication Standard Add

Proof-Specific Constants and Auxiliaries (Appx B - G) PMHSA,P FFN Space partitions induced by MHSA and FFN respectively - MMHSA,M FFN Region counts in MHSA and FFN sub-layersZ + Cd Complexity constant for Minkowski sums inddimensionsR >0 Sl,R l Induced total partition and set of regions afterllayers - Nl Cumulative number of linear regions afterllayersZ +...