Vertex-Softmax: Tight Transformer Verification via Exact Softmax Optimization

Arash Gholami Davoodi; Navid Rezazadeh

arxiv: 2605.10974 · v1 · submitted 2026-05-08 · 💻 cs.LG · cs.AI

Vertex-Softmax: Tight Transformer Verification via Exact Softmax Optimization

Navid Rezazadeh , Arash Gholami Davoodi This is my paper

Pith reviewed 2026-05-13 06:07 UTC · model grok-4.3

classification 💻 cs.LG cs.AI

keywords transformer verificationsoftmax boundscertified robustnessinterval constraintsvertex optimizationattention mechanismsCROWN

0 comments

The pith

The exact optimum of bounding softmax over score intervals occurs at a vertex of the box, enabling a log-linear tight bound for transformer verification.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper proves that when bounding the softmax function for transformer verification using only interval constraints on the pre-softmax scores, the exact worst-case value is always achieved at one of the corners of the interval box. A structure theorem shows that sorting the coefficients reduces the number of candidates to check to a linear number in the sequence length, enabling an efficient algorithm called Vertex-Softmax. This bound is the tightest possible using only the intervals, and when used in a verifier it raises the number of certifiably robust inputs on standard image datasets while running faster than alternatives. A reader would care because looser bounds currently limit how much safety can be guaranteed for attention models in practice.

Core claim

We prove that the exact optimum of this score-box problem is attained at a vertex of the constraint box, and establish a threshold structure theorem showing that, after sorting the objective coefficients, the optimum lies among only linearly many candidates, yielding the Vertex-Softmax primitive with log-linear complexity in the sequence length. We further prove a formal optimality result showing that Vertex-Softmax is the tightest sound bound obtainable from score intervals alone.

What carries the argument

Vertex-Softmax, the algorithm that finds the exact softmax bound by checking a linear number of sorted vertex candidates in the score interval box.

If this is right

Integrates into CROWN-style verifiers while preserving formal soundness.
Improves certified robustness rates on MNIST, Fashion-MNIST, and CIFAR-10 models.
Tightens lower bounds on worst-case outputs substantially.
Matches or exceeds performance of alpha-CROWN and branch-and-bound at lower cost.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The result implies that further tightening would require exploiting correlations between scores or coupling with output values.
The log-linear scaling supports application to longer sequence lengths in transformers.
Similar vertex-based exact optimization may apply to bounding other activation functions in neural network verifiers.

Load-bearing premise

The exact optimality and tightness results rely on having no additional information beyond independent interval constraints on each score.

What would settle it

Finding a set of score intervals and objective coefficients where the maximizing point for the softmax is not a vertex of the box or is missed by the linear candidate list after sorting.

Figures

Figures reproduced from arXiv: 2605.10974 by Arash Gholami Davoodi, Navid Rezazadeh.

**Figure 2.** Figure 2: Scalable attention sweep at ϵ = 0.02. Vertex-Softmax gives the highest mean certified lower bound and the smallest gap to the best attack across the tested sequence lengths. GaLileo-style implementation baseline results for K ≤ 16 are in Appendix [PITH_FULL_IMAGE:figures/full_fig_p007_2.png] view at source ↗

**Figure 3.** Figure 3: Comparison against alpha-CROWN and fixed-budget ABCrown-BaB on selected small [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗

**Figure 4.** Figure 4: Runtime scaling for the exact Vertex-Softmax threshold solver. Exhaustive enumeration is [PITH_FULL_IMAGE:figures/full_fig_p019_4.png] view at source ↗

read the original abstract

Certified verification of transformer attention requires bounding the softmax function over interval constraints on the pre-softmax scores. Existing verifiers relax softmax ndependently of the downstream objective, leaving avoidable slack. We prove that the exact optimum of this score-box problem is attained at a vertex of the constraint box, and establish a threshold structure theorem showing that, after sorting the objective coefficients, the optimum lies among only linearly many candidates, yielding the Vertex-Softmax primitive with log-linear complexity in the sequence length. We further prove a formal optimality result showing that Vertex-Softmax is the tightest sound bound obtainable from score intervals alone, characterizing precisely what additional structure (score correlations, score-value coupling) is needed for further improvement. Integrated into a CROWN Convex Relaxation based Optimization for Worst-case Neurons)-style verifier with a formal soundness guarantee, Vertex-Softmax significantly improves certified rates and substantially tightens lower bounds across MNIST, Fashion-MNIST, and CIFAR-10 attention models, while consistently matching or outperforming alpha-CROWN and branch-and-bound baselines at a fraction of their cost.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Vertex-Softmax gives an exact, log-linear bound on softmax under score intervals by proving vertex attainment and a threshold structure that reduces candidates to linear in sequence length.

read the letter

The main contribution is a clean optimization result for the score-box problem: the maximum of a linear functional over softmax(s) with s in an interval box is always attained at a vertex, and after sorting the coefficients it lies among only O(n) threshold points (prefix upper, suffix lower). This yields the Vertex-Softmax primitive with log-linear complexity and, by the optimality theorem, the tightest sound bound possible from intervals alone. They correctly flag that correlations or score-value coupling would be needed for anything tighter. That is the new piece; prior relaxations in the verification literature treated softmax independently of the objective and left slack that this removes by construction. The integration into a CROWN-style verifier carries a formal soundness guarantee, and the experiments report higher certified rates plus tighter lower bounds on MNIST, Fashion-MNIST, and CIFAR-10 attention models while running faster than alpha-CROWN or branch-and-bound. The proofs appear internally consistent from the abstract and stress-test note, with no visible circularity or hidden assumptions about independence. The main soft spot is that everything remains interval-only, so real-world score dependence is still ignored; that is not a flaw but a stated boundary. Experiments stay on modest models, which is normal for this stage but leaves scaling questions open. This is for people working on certified robustness for transformers. It deserves a serious referee because it supplies a new, formally grounded primitive with measurable gains rather than another heuristic relaxation.

Referee Report

0 major / 2 minor

Summary. The manuscript presents Vertex-Softmax, an exact optimization method for bounding the softmax function in transformer attention layers under interval constraints on the pre-softmax scores. The central contributions are proofs that the optimum is attained at a vertex of the box constraint and that a threshold structure allows checking only O(n) candidates after sorting the coefficients, resulting in a log-linear time algorithm. It also establishes that this bound is the tightest possible using only score intervals and integrates the method into a CROWN-style verifier with soundness guarantees, reporting improved certified accuracies on standard image classification datasets.

Significance. This result is significant for the field of certified verification of neural networks, particularly transformers, as it eliminates avoidable slack in softmax relaxations without requiring additional assumptions like score correlations. The formal optimality result helps delineate the boundaries of interval-based verification. If the proofs are correct, the efficiency and tightness could lead to broader adoption in safety-critical applications of attention-based models. The empirical results suggest practical benefits over existing methods like alpha-CROWN.

minor comments (2)

Abstract: The phrase 'relax softmax ndependently of the downstream objective' contains a likely typo ('ndependently' instead of 'independently').
Abstract: The parenthetical in 'CROWN Convex Relaxation based Optimization for Worst-case Neurons)-style verifier' appears malformed; it should probably read 'CROWN (Convex Relaxation based Optimization for Worst-case Neurons)-style verifier'.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for the positive summary, recognition of the significance of the optimality and efficiency results for interval-based softmax bounding in transformer verification, and the recommendation of minor revision. No specific major comments were raised in the report.

Circularity Check

0 steps flagged

No significant circularity; claims rest on self-contained optimization proofs

full rationale

The paper's core results consist of mathematical proofs establishing that the optimum of a linear functional over softmax(s) subject to independent interval constraints on s is attained at a vertex of the box, together with a threshold structure theorem that reduces candidates to O(n) after sorting coefficients. These follow from standard arguments in linear programming and monotonicity properties of the softmax, without reducing to self-referential definitions or fitted parameters. The formal optimality characterization explicitly carves out the 'score intervals alone' setting and states what additional structure would be needed for tightening, which is transparent rather than circular. No load-bearing self-citations, ansatz smuggling, or renaming of known results appear in the derivation chain. The Vertex-Softmax primitive is presented as an algorithmic consequence of the proven structure, not as a tautology. This is the expected honest non-finding for a paper whose central contribution is a new exact optimization result rather than empirical fitting or external uniqueness theorems.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claims rest on standard properties of linear optimization over polytopes and basic sorting; no free parameters, ad-hoc axioms, or invented entities are introduced.

axioms (1)

standard math The optimum of a linear objective over a polytope is attained at a vertex.
Fundamental result from linear programming invoked to establish vertex attainment.

pith-pipeline@v0.9.0 · 5492 in / 1186 out tokens · 90785 ms · 2026-05-13T06:07:15.895511+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We prove that the exact optimum of this score-box problem is attained at a vertex of the constraint box, and establish a threshold structure theorem showing that, after sorting the objective coefficients, the optimum lies among only linearly many candidates
IndisputableMonolith/Foundation/AbsoluteFloorClosure.lean absolute_floor_iff_bare_distinguishability unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Vertex-Softmax is the tightest sound bound obtainable from score intervals alone

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

21 extracted references · 21 canonical work pages

[1]

Advances in Neural Information Processing Systems , year=

Efficient Neural Network Robustness Certification with General Activation Functions , author=. Advances in Neural Information Processing Systems , year=

work page
[2]

Advances in Neural Information Processing Systems , year=

Automatic Perturbation Analysis for Scalable Certified Robustness and Beyond , author=. Advances in Neural Information Processing Systems , year=

work page
[3]

Advances in Neural Information Processing Systems , year=

Beta-CROWN: Efficient Bound Propagation with Per-Neuron Split Constraints for Neural Network Robustness Verification , author=. Advances in Neural Information Processing Systems , year=

work page
[4]

International Conference on Learning Representations , year=

Robustness Verification for Transformers , author=. International Conference on Learning Representations , year=

work page
[5]

Proceedings of the ACM SIGPLAN Conference on Programming Language Design and Implementation , pages=

Fast and Precise Certification of Transformers , author=. Proceedings of the ACM SIGPLAN Conference on Programming Language Design and Implementation , pages=. 2021 , doi=

work page 2021
[6]

Computer Safety, Reliability, and Security , pages=

Are Transformers More Robust? Towards Exact Robustness Verification for Transformers , author=. Computer Safety, Reliability, and Security , pages=. 2023 , publisher=

work page 2023
[7]

Proceedings of the 26th International Conference on Artificial Intelligence and Statistics , series=

Convex Bounds on the Softmax Function with Applications to Robustness Verification , author=. Proceedings of the 26th International Conference on Artificial Intelligence and Statistics , series=. 2023 , publisher=

work page 2023
[8]

Proceedings of the AAAI Conference on Artificial Intelligence , volume=

GaLileo: General Linear Relaxation Framework for Tightening Robustness Certification of Transformers , author=. Proceedings of the AAAI Conference on Artificial Intelligence , volume=

work page
[9]

International Conference on Learning Representations , year=

Evaluating Robustness of Neural Networks with Mixed Integer Programming , author=. International Conference on Learning Representations , year=

work page
[10]

Computer Aided Verification , pages=

The Marabou Framework for Verification and Analysis of Deep Neural Networks , author=. Computer Aided Verification , pages=. 2019 , publisher=

work page 2019
[11]

2020 , publisher=

Tran, Hoang-Dung and others , booktitle=. 2020 , publisher=

work page 2020
[12]

Tools and Algorithms for the Construction and Analysis of Systems , pages=

Neural Network Verification with Branch-and-Bound for General Nonlinearities , author=. Tools and Algorithms for the Construction and Analysis of Systems , pages=. 2025 , publisher=

work page 2025
[13]

Certifying

Kumar, Aounon and Agarwal, Chirag and Srinivas, Suraj and Li, Aaron Jiaxun and Feizi, Soheil and Lakkaraju, Himabindu , booktitle=. Certifying

work page
[14]

Certifying Counterfactual Bias in

Chaudhary, Isha and Hu, Qian and Kumar, Manoj and Ziyadi, Morteza and Gupta, Rahul and Singh, Gagandeep , booktitle=. Certifying Counterfactual Bias in

work page
[15]

arXiv preprint arXiv:2406.09714 , year=

Large Language Model Validity via Enhanced Conformal Prediction Methods , author=. arXiv preprint arXiv:2406.09714 , year=

work page arXiv
[16]

Zhu, Kaijie and others , journal=

work page
[17]

Zhang, Zhexin and others , journal=

work page
[18]

Wang, Haoyu and others , journal=

work page
[19]

Zhang, Yedi and Sun, Yi Emma and Lee, Annabelle Jia En and Dong, Jin Song , journal=

work page
[20]

and Wei, Jiali and Sun, Jun , journal=

Wang, Haoyu and Poskitt, Christopher M. and Wei, Jiali and Sun, Jun , journal=

work page
[21]

Management Science , volume=

On Nonlinear Fractional Programming , author=. Management Science , volume=. 1967 , publisher=

work page 1967

[1] [1]

Advances in Neural Information Processing Systems , year=

Efficient Neural Network Robustness Certification with General Activation Functions , author=. Advances in Neural Information Processing Systems , year=

work page

[2] [2]

Advances in Neural Information Processing Systems , year=

Automatic Perturbation Analysis for Scalable Certified Robustness and Beyond , author=. Advances in Neural Information Processing Systems , year=

work page

[3] [3]

Advances in Neural Information Processing Systems , year=

Beta-CROWN: Efficient Bound Propagation with Per-Neuron Split Constraints for Neural Network Robustness Verification , author=. Advances in Neural Information Processing Systems , year=

work page

[4] [4]

International Conference on Learning Representations , year=

Robustness Verification for Transformers , author=. International Conference on Learning Representations , year=

work page

[5] [5]

Proceedings of the ACM SIGPLAN Conference on Programming Language Design and Implementation , pages=

Fast and Precise Certification of Transformers , author=. Proceedings of the ACM SIGPLAN Conference on Programming Language Design and Implementation , pages=. 2021 , doi=

work page 2021

[6] [6]

Computer Safety, Reliability, and Security , pages=

Are Transformers More Robust? Towards Exact Robustness Verification for Transformers , author=. Computer Safety, Reliability, and Security , pages=. 2023 , publisher=

work page 2023

[7] [7]

Proceedings of the 26th International Conference on Artificial Intelligence and Statistics , series=

Convex Bounds on the Softmax Function with Applications to Robustness Verification , author=. Proceedings of the 26th International Conference on Artificial Intelligence and Statistics , series=. 2023 , publisher=

work page 2023

[8] [8]

Proceedings of the AAAI Conference on Artificial Intelligence , volume=

GaLileo: General Linear Relaxation Framework for Tightening Robustness Certification of Transformers , author=. Proceedings of the AAAI Conference on Artificial Intelligence , volume=

work page

[9] [9]

International Conference on Learning Representations , year=

Evaluating Robustness of Neural Networks with Mixed Integer Programming , author=. International Conference on Learning Representations , year=

work page

[10] [10]

Computer Aided Verification , pages=

The Marabou Framework for Verification and Analysis of Deep Neural Networks , author=. Computer Aided Verification , pages=. 2019 , publisher=

work page 2019

[11] [11]

2020 , publisher=

Tran, Hoang-Dung and others , booktitle=. 2020 , publisher=

work page 2020

[12] [12]

Tools and Algorithms for the Construction and Analysis of Systems , pages=

Neural Network Verification with Branch-and-Bound for General Nonlinearities , author=. Tools and Algorithms for the Construction and Analysis of Systems , pages=. 2025 , publisher=

work page 2025

[13] [13]

Certifying

Kumar, Aounon and Agarwal, Chirag and Srinivas, Suraj and Li, Aaron Jiaxun and Feizi, Soheil and Lakkaraju, Himabindu , booktitle=. Certifying

work page

[14] [14]

Certifying Counterfactual Bias in

Chaudhary, Isha and Hu, Qian and Kumar, Manoj and Ziyadi, Morteza and Gupta, Rahul and Singh, Gagandeep , booktitle=. Certifying Counterfactual Bias in

work page

[15] [15]

arXiv preprint arXiv:2406.09714 , year=

Large Language Model Validity via Enhanced Conformal Prediction Methods , author=. arXiv preprint arXiv:2406.09714 , year=

work page arXiv

[16] [16]

Zhu, Kaijie and others , journal=

work page

[17] [17]

Zhang, Zhexin and others , journal=

work page

[18] [18]

Wang, Haoyu and others , journal=

work page

[19] [19]

Zhang, Yedi and Sun, Yi Emma and Lee, Annabelle Jia En and Dong, Jin Song , journal=

work page

[20] [20]

and Wei, Jiali and Sun, Jun , journal=

Wang, Haoyu and Poskitt, Christopher M. and Wei, Jiali and Sun, Jun , journal=

work page

[21] [21]

Management Science , volume=

On Nonlinear Fractional Programming , author=. Management Science , volume=. 1967 , publisher=

work page 1967