Geometric Measurements of the Axiom of Choice in Neural Proof Embeddings

Rodrigo Mendoza-Smith

arxiv: 2606.28572 · v1 · pith:I47QEGCLnew · submitted 2026-06-26 · 💻 cs.LG · cs.AI· cs.LO· math.LO

Geometric Measurements of the Axiom of Choice in Neural Proof Embeddings

Rodrigo Mendoza-Smith This is my paper

Pith reviewed 2026-06-30 00:43 UTC · model grok-4.3

classification 💻 cs.LG cs.AIcs.LOmath.LO

keywords axiom of choiceneural embeddingsconstructive proofstheorem provinganomaly detectionLean Mathlibself-supervised learning

0 comments

The pith

A neural encoder trained solely on constructive proofs measures classical proofs by their distance from the axiom of choice in the dependency graph.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper partitions Mathlib theorems by whether they depend on the axiom of choice according to Lean's kernel. It trains a self-supervised encoder on the constructive subset using their tactic sequences. When this encoder processes classical theorems, three measures of anomaly decrease as the theorems depend on the axiom only indirectly through longer chains. This geometric pattern holds after controlling for length and other factors and links to differences in how easily tactics can solve the theorems.

Core claim

When a self-supervised proof encoder trained only on constructive theorems is applied to classical theorems, anomaly score, reconstruction loss, and density-superlevel containment all decline with the proof's distance from the axiom of choice in the dependency graph, from AUC 0.847 at distance 2 to indistinguishability at distance 9+.

What carries the argument

The self-supervised proof encoder that embeds sequences of tactic invocations and computes anomaly via reconstruction loss and density-superlevel containment.

If this is right

The geometric anomaly score predicts aesop tactic failure beyond proof length.
A neural-guided hybrid using ReProver compresses the gap in solving rate between constructive and classical theorems from 13x to 5x.
The signature is robust under controls for length, file, author, and topic.
Full-source encoders trained on normalised proof source replicate the decline pattern.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Similar geometric signatures might appear for other foundational axioms such as the law of excluded middle.
The anomaly score could serve as a practical heuristic to prioritize theorems for proof search in mixed constructive-classical libraries.
Training encoders on proofs that avoid specific axioms might improve generalization in automated theorem proving systems.

Load-bearing premise

The filtering and transitive closure in Lean produces a partition where the only systematic difference between constructive and classical theorems is the presence or absence of the axiom of choice itself.

What would settle it

If the anomaly metrics fail to show a decline with increasing dependency distance when the same encoder is tested on the dataset with randomized dependency labels.

Figures

Figures reproduced from arXiv: 2606.28572 by Rodrigo Mendoza-Smith.

**Figure 1.** Figure 1: The depth law in three views. (a) k-NN distance distribution by depth: histograms of mean Euclidean distance to the 5 nearest constructive-training proofs (constructive-standardised encoder embeddings; Section G). (b) Per-token saliency: one representative classical proof per depth bucket; cell colour intensity is the single-token reconstruction loss under the constructive encoder, and Σ is the summed per-… view at source ↗

**Figure 2.** Figure 2: Proof representation and denoising pipeline. Five stages, left to right, illustrated on a real Mathlib proof of GrothendieckTopology.arrow trans (<- and >> in the body stand for the Lean unicode ← and ≫). (1) A tactic-mode proof body, with each line a tactic invocation and the leading identifier in bold blue. (2) The tactic head of each invocation is its first identifier, e.g. rw for rw [<- Sieve.pullback … view at source ↗

**Figure 3.** Figure 3: Mean reconstruction loss by depth from Classical.choice. Green bar: held-out constructive test proofs. Red bars: classical proofs stratified by depth; the first bar pools the 27 maskable depth-1 proofs with the 3,680 depth-2 proofs. Error bars show SEM. 2.2. Measurement #3: Density-Superlevel Containment The third measurement stops asking for a ranking score and instead asks whether a proof lies inside the… view at source ↗

**Figure 4.** Figure 4: Three measurements, one gradient. Each metric is rescaled from the constructive baseline (0) to its shallow endpoint (1): strict depth 2 for AUC and containment, and d ≤ 2 for reconstruction. Intermediate buckets are not fit; the solid curve is the pointwise mean and the band its range. The separate λd fit is reported in Section F. the constructive mean masked-token loss for reconstruction, and the 10% out… view at source ↗

**Figure 5.** Figure 5: Depth-stratified ablation. Grey percentages mark the fraction of proofs in each bucket containing a classical tactic marker. Depth 2 loses 0.09 AUC (from 0.847 to 0.754), confirming some marker-keyword signal there. All deeper buckets lose at most 0.008. The length-residualised column of [PITH_FULL_IMAGE:figures/full_fig_p011_5.png] view at source ↗

**Figure 6.** Figure 6: The classical anomaly gap survives author and topic controls. (a) Depth-stratified anomaly gap (classical − constructive, in standard deviations) under three length-matched pairings: same file (the file-matched control of Section C.5), same author, and same statement (topic). All three remain positive and reproduce the depth gradient. (b) Within-area k-NN AUC across the 52 level-2 Mathlib subdomains with a… view at source ↗

read the original abstract

The axiom of choice has divided the foundations of mathematics for over a century, but the distinction between classical and constructive proofs has remained a philosophical and methodological one. We use Lean 4's kernel-level tracking of axiom dependence to show that the axiom of choice has a measurable geometric correlate in proof space that obeys a one-parameter mixture law and has operational consequences for neural theorem provers. To do this, we partition $471{,}260$ declarations of Mathlib by transitive dependence on the axiom of choice and represent a filtered population of $42{,}355$ traced theorems by their sequences of tactic invocations. We use the constructive proofs in this dataset to train a self-supervised proof encoder and show that when using it to measure classical proofs, three complementary measurements (anomaly score, reconstruction loss, and density-superlevel containment) exhibit a common decline with the proof's distance from the axiom in the dependency graph, from sharp separation at the shallow boundary (AUC $0.847$ at distance $2$) to indistinguishability at distance~$9{+}$. Robustness controls show that the signature survives length, file, author, and topic controls, and replicates under full-source encoders trained on normalised proof source. Operationally, we show that on an evaluation sample of $251$ Mathlib theorems, Lean's \texttt{aesop} tactic solves constructive theorems at $13\times$ the rate of classical ones, and a neural-guided hybrid using the ReProver tactic generator compresses the gap to $5\times$. The geometric anomaly score predicts \texttt{aesop} failure beyond proof length, providing an operational link between the geometric signature and prover performance.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper extracts a geometric signature from proof embeddings that tracks distance to the axiom of choice and correlates with tactic success rates.

read the letter

The main takeaway is that a self-supervised encoder trained only on constructive Lean theorems produces anomaly, reconstruction, and density metrics on classical theorems that fall off with graph distance to AC, from AUC 0.847 at distance 2 down to flat at distance 9+. The work also ties this signal to prover behavior, showing aesop runs 13x faster on constructive cases and a neural hybrid closes it to 5x.

What is actually new is the combination of kernel-level transitive closure partitioning on 42k theorems with three complementary geometric measurements on tactic sequences, plus the explicit link to operational prover metrics. The controls for length, file, author, and topic, plus the full-source encoder replication, are solid steps that make the AC dependence claim more believable than a raw embedding difference would be.

The soft spots are in the quantitative layer. The abstract states the one-parameter mixture law and the specific AUC and speedup numbers but gives no derivation, no error bars, and no detail on how the law was fitted or tested against the three metrics. Without those, it is difficult to judge whether the monotonic decline is as clean as reported or whether the mixture parameter absorbs other variation. The partition assumption looks reasonable given the controls, but any residual topic or style leakage would weaken the geometric interpretation.

This is for researchers building neural proof assistants who need better ways to route or prioritize across classical and constructive libraries. The core idea is worth a serious referee even if the stats section needs tightening; the experimental design is grounded enough to merit external review.

Referee Report

3 major / 2 minor

Summary. The paper partitions 471,260 Mathlib declarations by transitive AC dependence in Lean's kernel, filters to 42,355 traced theorems, trains a self-supervised encoder solely on constructive proofs, and reports that anomaly score, reconstruction loss, and density-superlevel containment on classical proofs decline monotonically with graph distance to AC (AUC 0.847 at distance 2, indistinguishable at 9+). The signature survives explicit controls for length/file/author/topic and replicates with full-source encoders; operationally, aesop solves constructive theorems 13× faster than classical ones and a ReProver hybrid reduces the gap to 5×, with the geometric score predicting aesop failure beyond length.

Significance. If the quantitative claims and controls hold, the work supplies the first large-scale, kernel-grounded geometric signature of a foundational axiom in proof embeddings and demonstrates an operational link to neural prover performance, strengthening the case that self-supervised representations can surface non-syntactic distinctions in formal mathematics.

major comments (3)

[Abstract, §4] Abstract and §4 (results on mixture law): the claim that the three metrics 'obey a one-parameter mixture law' is load-bearing for the geometric-correlate thesis, yet the manuscript provides no equation for the law, no value or fitting procedure for the free parameter, and no cross-validation or residual analysis; without these the reported monotonic decline cannot be distinguished from a post-hoc descriptive fit.
[Abstract, §5] Abstract and §5 (AUC and speedup numbers): AUC 0.847, 13× and 5× speedups, and robustness to length/file/author/topic are reported without error bars, bootstrap intervals, or statistical tests; because these quantities are the primary evidence for both the geometric effect and its operational relevance, their uncertainty must be quantified before the central claim can be evaluated.
[§3] §3 (dataset construction): the filtering to 42,355 theorems and the transitive-closure partition are presented as isolating the AC effect, but no table or appendix quantifies residual differences in proof length, tactic vocabulary, or author distribution between the constructive and classical subsets after the stated controls; if these covariates remain imbalanced the distance-to-AC correlation could be confounded.

minor comments (2)

[§2] Notation for the three metrics (anomaly score, reconstruction loss, density-superlevel containment) is introduced without a single consolidated definition or pseudocode; a short methods subsection or appendix table would improve reproducibility.
[§5] The 251-theorem evaluation sample for aesop/ReProver is described only by size; a breakdown by distance-to-AC bin or by the same controls used in the embedding experiments would strengthen the operational link.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive comments, which help clarify the presentation of our results. We address each major point below and will incorporate the suggested changes in the revised manuscript.

read point-by-point responses

Referee: [Abstract, §4] Abstract and §4 (results on mixture law): the claim that the three metrics 'obey a one-parameter mixture law' is load-bearing for the geometric-correlate thesis, yet the manuscript provides no equation for the law, no value or fitting procedure for the free parameter, and no cross-validation or residual analysis; without these the reported monotonic decline cannot be distinguished from a post-hoc descriptive fit.

Authors: We agree that the mixture law claim requires explicit formalization. In the revision we will add the precise equation for the one-parameter mixture law, the fitted parameter value and its estimation procedure, cross-validation results, and residual analysis to confirm that the observed monotonic decline is not a post-hoc descriptive fit. revision: yes
Referee: [Abstract, §5] Abstract and §5 (AUC and speedup numbers): AUC 0.847, 13× and 5× speedups, and robustness to length/file/author/topic are reported without error bars, bootstrap intervals, or statistical tests; because these quantities are the primary evidence for both the geometric effect and its operational relevance, their uncertainty must be quantified before the central claim can be evaluated.

Authors: We accept that uncertainty quantification is necessary for the primary metrics. The revised version will include bootstrap confidence intervals, error bars on the reported AUC and speedup figures, and statistical tests (e.g., permutation tests for AUC and appropriate nonparametric tests for speedups) in the abstract, §5, and associated tables/figures. revision: yes
Referee: [§3] §3 (dataset construction): the filtering to 42,355 theorems and the transitive-closure partition are presented as isolating the AC effect, but no table or appendix quantifies residual differences in proof length, tactic vocabulary, or author distribution between the constructive and classical subsets after the stated controls; if these covariates remain imbalanced the distance-to-AC correlation could be confounded.

Authors: We will add a supplementary table (or appendix section) that reports the residual differences in proof length, tactic vocabulary size, and author distribution between the constructive and classical subsets after the length/file/author/topic controls have been applied, allowing direct assessment of any remaining imbalance. revision: yes

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper partitions theorems using Lean's external kernel-level transitive closure on axiom dependence, trains a self-supervised encoder solely on the constructive subset, and evaluates anomaly/reconstruction/density metrics on the classical subset. The reported monotonic decline with graph distance, AUC values, and robustness to length/author/topic controls are empirical measurements, not reductions of a fitted parameter or self-citation to the target result. The one-parameter mixture law is presented as an observed pattern rather than a definitional input. No load-bearing step equates a prediction to its own training data by construction.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim rests on Lean's kernel correctly computing transitive axiom dependence, on the tactic sequences being a faithful representation of proof structure, and on the self-supervised encoder learning a representation whose only relevant axis is constructive vs. classical. No explicit free parameters or invented entities are named in the abstract beyond the one-parameter mixture law.

free parameters (1)

one-parameter mixture law parameter
The abstract states that the geometric correlate 'obeys a one-parameter mixture law'; this scalar is introduced to describe the decline and is therefore a free parameter unless derived from first principles elsewhere in the paper.

axioms (1)

domain assumption Lean's kernel-level tracking of axiom dependence produces an accurate transitive closure for every declaration in Mathlib.
The entire partitioning into constructive and classical sets depends on this property of the Lean 4 kernel.

pith-pipeline@v0.9.1-grok · 5834 in / 1489 out tokens · 36000 ms · 2026-06-30T00:43:09.839967+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

13 extracted references · 8 canonical work pages · 3 internal anchors

[1]

BERT: Pre-training of deep bidirectional transformers for lan- guage understanding

Devlin, J., Chang, M.-W., Lee, K., and Toutanova, K. BERT: Pre-training of deep bidirectional transformers for lan- guage understanding. InProceedings of the 2019 confer- ence of the North American chapter of the association for computational linguistics: human language technologies, volume 1 (long and short papers), pp. 4171–4186,

2019
[2]

M., Rute, J., Wu, Y ., Ayers, E

Han, J. M., Rute, J., Wu, Y ., Ayers, E. W., and Polu, S. Proof artifact co-training for theorem proving with language models.arXiv preprint arXiv:2102.06203,

work page arXiv
[3]

A Baseline for Detecting Misclassified and Out-of-Distribution Examples in Neural Networks

Hendrycks, D. and Gimpel, K. A baseline for detecting misclassified and out-of-distribution examples in neural networks.arXiv preprint arXiv:1610.02136,

work page internal anchor Pith review Pith/arXiv arXiv
[4]

GamePad: A Learning Environment for Theorem Proving

Huang, D., Dhariwal, P., Song, D., and Sutskever, I. Gamepad: A learning environment for theorem proving. arXiv preprint arXiv:1806.00608,

work page internal anchor Pith review Pith/arXiv arXiv
[5]

Enhancing the reliability of out-of-distribution image detection in neural networks

Liang, S., Li, Y ., and Srikant, R. Enhancing the reliability of out-of-distribution image detection in neural networks. arXiv preprint arXiv:1706.02690,

work page arXiv
[6]

Magnushammer: A transformer-based ap- proach to premise selection

Mikuła, M., Tworkowski, S., Antoniak, S., Piotrowski, B., Jiang, Q., Zhou, J., Szegedy, C., Kuci´nski, Ł., Miło´s, P., and Wu, Y . Magnushammer: A transformer-based ap- proach to premise selection. InInternational Conference on Learning Representations, volume 2024, pp. 39326– 39350,

2024
[7]

Generative Language Modeling for Automated Theorem Proving

Polu, S. and Sutskever, I. Generative language model- ing for automated theorem proving.arXiv preprint arXiv:2009.03393,

work page internal anchor Pith review Pith/arXiv arXiv 2009
[8]

M., Zheng, K., Baksys, M., Babuschkin, I., and Sutskever, I

Polu, S., Han, J. M., Zheng, K., Baksys, M., Babuschkin, I., and Sutskever, I. Formal mathematics statement curricu- lum learning.arXiv preprint arXiv:2202.01344,

work page arXiv
[9]

The Lean Mathematical Library

The mathlib Community. The Lean Mathematical Library. InProceedings of the 9th ACM SIGPLAN International Conference on Certified Programs and Proofs, CPP 2020, New Orleans, LA, USA, January

2020
[10]

2020 , pages =

1145/3372885.3373824. URL https://doi.org/ 10.1145/3372885.3373824. Yang, K. and Deng, J. Learning to prove theorems via inter- acting with proof assistants. InInternational Conference on Machine Learning, pp. 6984–6994. PMLR,

work page doi:10.1145/3372885.3373824
[11]

Frac. with marker

9 Geometric Measurements of the Axiom of Choice in Neural Proof Embeddings A. Reproducibility Code, theorem identifiers, dependency-depth labels, data splits, the251-theorem operational sample, and commands to repro- duce every table and figure are available at https://github.com/rodrgo/geometric-axiom-of-choice . All randomness is seeded, and the reposit...

work page arXiv
[12]

Linear probes.All probes are ℓ2-regularised scikit-learn models with C= 1.0 , L-BFGS solver, 2000 iterations, evaluated by 5-fold cross-validation

For every proof (train and test) we then subtract the fitted prediction ˜ei =e i − ˆα− ˆβℓ i, and pass the residual ˜ei to downstream analyses. Linear probes.All probes are ℓ2-regularised scikit-learn models with C= 1.0 , L-BFGS solver, 2000 iterations, evaluated by 5-fold cross-validation. The domain probe is multinomial LogisticRegression predicting Mat...

2000
[13]

If the depth law identifies a frontier in proof methodology, theorems on the far side of that frontier should be harder foraesopto solve

is the standard general-purpose automation in Lean 4, a best-first search over a rule set fixed at compile time. If the depth law identifies a frontier in proof methodology, theorems on the far side of that frontier should be harder foraesopto solve. Sampling and pipeline.We draw up to 60 theorems per bucket (constructive, depth 2, depth 3−4, depth 5−6, d...

2023

[1] [1]

BERT: Pre-training of deep bidirectional transformers for lan- guage understanding

Devlin, J., Chang, M.-W., Lee, K., and Toutanova, K. BERT: Pre-training of deep bidirectional transformers for lan- guage understanding. InProceedings of the 2019 confer- ence of the North American chapter of the association for computational linguistics: human language technologies, volume 1 (long and short papers), pp. 4171–4186,

2019

[2] [2]

M., Rute, J., Wu, Y ., Ayers, E

Han, J. M., Rute, J., Wu, Y ., Ayers, E. W., and Polu, S. Proof artifact co-training for theorem proving with language models.arXiv preprint arXiv:2102.06203,

work page arXiv

[3] [3]

A Baseline for Detecting Misclassified and Out-of-Distribution Examples in Neural Networks

Hendrycks, D. and Gimpel, K. A baseline for detecting misclassified and out-of-distribution examples in neural networks.arXiv preprint arXiv:1610.02136,

work page internal anchor Pith review Pith/arXiv arXiv

[4] [4]

GamePad: A Learning Environment for Theorem Proving

Huang, D., Dhariwal, P., Song, D., and Sutskever, I. Gamepad: A learning environment for theorem proving. arXiv preprint arXiv:1806.00608,

work page internal anchor Pith review Pith/arXiv arXiv

[5] [5]

Enhancing the reliability of out-of-distribution image detection in neural networks

Liang, S., Li, Y ., and Srikant, R. Enhancing the reliability of out-of-distribution image detection in neural networks. arXiv preprint arXiv:1706.02690,

work page arXiv

[6] [6]

Magnushammer: A transformer-based ap- proach to premise selection

Mikuła, M., Tworkowski, S., Antoniak, S., Piotrowski, B., Jiang, Q., Zhou, J., Szegedy, C., Kuci´nski, Ł., Miło´s, P., and Wu, Y . Magnushammer: A transformer-based ap- proach to premise selection. InInternational Conference on Learning Representations, volume 2024, pp. 39326– 39350,

2024

[7] [7]

Generative Language Modeling for Automated Theorem Proving

Polu, S. and Sutskever, I. Generative language model- ing for automated theorem proving.arXiv preprint arXiv:2009.03393,

work page internal anchor Pith review Pith/arXiv arXiv 2009

[8] [8]

M., Zheng, K., Baksys, M., Babuschkin, I., and Sutskever, I

Polu, S., Han, J. M., Zheng, K., Baksys, M., Babuschkin, I., and Sutskever, I. Formal mathematics statement curricu- lum learning.arXiv preprint arXiv:2202.01344,

work page arXiv

[9] [9]

The Lean Mathematical Library

The mathlib Community. The Lean Mathematical Library. InProceedings of the 9th ACM SIGPLAN International Conference on Certified Programs and Proofs, CPP 2020, New Orleans, LA, USA, January

2020

[10] [10]

2020 , pages =

1145/3372885.3373824. URL https://doi.org/ 10.1145/3372885.3373824. Yang, K. and Deng, J. Learning to prove theorems via inter- acting with proof assistants. InInternational Conference on Machine Learning, pp. 6984–6994. PMLR,

work page doi:10.1145/3372885.3373824

[11] [11]

Frac. with marker

9 Geometric Measurements of the Axiom of Choice in Neural Proof Embeddings A. Reproducibility Code, theorem identifiers, dependency-depth labels, data splits, the251-theorem operational sample, and commands to repro- duce every table and figure are available at https://github.com/rodrgo/geometric-axiom-of-choice . All randomness is seeded, and the reposit...

work page arXiv

[12] [12]

Linear probes.All probes are ℓ2-regularised scikit-learn models with C= 1.0 , L-BFGS solver, 2000 iterations, evaluated by 5-fold cross-validation

For every proof (train and test) we then subtract the fitted prediction ˜ei =e i − ˆα− ˆβℓ i, and pass the residual ˜ei to downstream analyses. Linear probes.All probes are ℓ2-regularised scikit-learn models with C= 1.0 , L-BFGS solver, 2000 iterations, evaluated by 5-fold cross-validation. The domain probe is multinomial LogisticRegression predicting Mat...

2000

[13] [13]

If the depth law identifies a frontier in proof methodology, theorems on the far side of that frontier should be harder foraesopto solve

is the standard general-purpose automation in Lean 4, a best-first search over a rule set fixed at compile time. If the depth law identifies a frontier in proof methodology, theorems on the far side of that frontier should be harder foraesopto solve. Sampling and pipeline.We draw up to 60 theorems per bucket (constructive, depth 2, depth 3−4, depth 5−6, d...

2023