FormInv: A Measurement Protocol for Semantic Invariance in Mathematical Reasoning Benchmarks
Pith reviewed 2026-06-29 13:25 UTC · model grok-4.3
The pith
Semantic errors in math benchmark paraphrases reorder frontier model rankings and allow any desired ordering via family weighting.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
FormInv supplies an audit protocol that uses cross-model unanimity (>=3/4 models on MathCheck, >=6/9 on the primary set) to flag semantically incorrect paraphrases at low cost, along with SCR and per-theorem Cochran's Q as invariance measures. Applied to Lean4-verified theorems, it finds 3.1% incorrect paraphrases whose removal changes model ranks and 47% incorrect connective-variation paraphrases in an auto-generated set. The central formal result is the No-Free-Benchmark corollary: because no model Pareto-dominates all paraphrase families, for any target ranking over the nine models there exists a weighting over families that realizes it.
What carries the argument
Cross-model unanimity threshold applied to paraphrase families to detect semantic errors, paired with Semantic Consistency Rate (SCR) as the invariance metric.
If this is right
- Removing the flagged incorrect paraphrases produces ranking shifts that aggregate accuracy alone does not reveal.
- SCR spans a 32-point range across models while accuracy spans only 10 points, exposing robustness differences hidden by standard metrics.
- Any desired ordering of the nine models can be realized by selecting and weighting appropriate paraphrase families.
- Benchmark construction that fixes one set of paraphrase families implicitly determines which models appear strongest.
Where Pith is reading between the lines
- Benchmarks could be required to demonstrate invariance across several paraphrase families before rankings are reported.
- Model training objectives might add explicit penalties for inconsistency under semantic restatements.
- The unanimity audit could be tested on non-math domains to check whether the same error-detection pattern holds.
- Developers might report both accuracy and SCR when claiming benchmark leadership.
Load-bearing premise
That agreement among multiple frontier models reliably identifies paraphrases that are truly semantically incorrect rather than simply reflecting shared model limitations.
What would settle it
A human or formal verification of a set of paraphrases unanimously flagged as incorrect by the protocol that shows they are actually semantically equivalent.
Figures
read the original abstract
A paraphrase-quality audit of MathCheck (ICLR 2025) detected 4 semantically incorrect paraphrases in 129 groups (3.1%); removing them drops GPT-4o from rank 2 to rank 4 and elevates Claude Haiku and DeepSeek V3 above it; these ranking changes are invisible to any single-model evaluation. Cross-model unanimity found these errors automatically (>= 3/4 models for MathCheck; >= 6/9 for our primary evaluation) for under $10; in our own dataset the same protocol found that 47% of auto-generated connective-variation paraphrases were semantically incorrect. That flaw compounds a deeper measurement gap: Claude Haiku 4.5 achieves 86% accuracy yet SCR=50%, meaning half its theorems are answered differently under semantically equivalent restatements, while aggregate accuracy across 9 models spans only 86-96% yet Semantic Consistency Rates (SCR) span 50-82% -- a 32-point gap invisible to standard benchmarks. Formally, for any target ranking over 9 frontier models there exists a weighting over paraphrase families that realizes it (No-Free-Benchmark corollary), because no model Pareto-dominates all families -- so benchmark designers who select families are implicitly choosing which model wins. FormInv supplies the audit protocol (replicated on external benchmarks at 100% recall), SCR and per-theorem Cochran's Q as primary invariance measures evaluated on 9 models across 366-811 items (on Lean4-verified theorems), and FormInvSelector for regime-aware model selection.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces FormInv, a cross-model unanimity protocol to audit semantic invariance in mathematical reasoning benchmarks. It reports detecting 3.1% semantically incorrect paraphrases in MathCheck (causing ranking shifts among 9 frontier models invisible to single-model eval), 47% error rate in auto-generated paraphrases, wide SCR variation (50-82%) despite narrow accuracy range (86-96%), and the No-Free-Benchmark corollary that any target ranking over the 9 models is realizable by some weighting over paraphrase families since no model Pareto-dominates all families. It supplies SCR, Cochran's Q, and FormInvSelector as tools, with replication at 100% recall on external benchmarks.
Significance. If the protocol and corollary hold, the work identifies a previously invisible measurement gap in benchmark robustness and shows that paraphrase-family selection implicitly determines model rankings, with direct implications for how mathematical reasoning benchmarks are constructed and interpreted. The low-cost audit protocol and per-theorem invariance metrics are potentially useful additions if validated.
major comments (3)
- [audit protocol description] The audit protocol section: cross-model unanimity (>=3/4 for MathCheck, >=6/9 for primary eval) is used to label paraphrases as semantically incorrect and to compute the 3.1% error rate, 47% auto-paraphrase error rate, and the performance vectors underlying the No-Free-Benchmark corollary, yet no validation against human semantic equivalence judgments or independent ground truth is reported; correlated model errors could produce unanimous disagreement on truly equivalent paraphrases, rendering the reported percentages and corollary dependent on an untested assumption.
- [No-Free-Benchmark corollary] The No-Free-Benchmark corollary (abstract and final section): the claim that any target ranking is realizable by weighting over paraphrase families is presented as following directly from the observation that no model Pareto-dominates all families, but no derivation, proof sketch, or formal argument is supplied showing how the (potentially mislabeled) performance vectors imply existence of such weightings for arbitrary targets.
- [results on ranking changes and SCR] Results on ranking shifts and SCR spread: the reported drops (GPT-4o from rank 2 to 4) and 32-point SCR gap are load-bearing for the central measurement-gap claim, yet no error-bar analysis, sensitivity to the exact unanimity threshold, or ablation on the 366-811 items is provided to establish robustness of these quantities.
minor comments (2)
- [abstract] The abstract states 'replicated on external benchmarks at 100% recall' without specifying which external benchmarks or the exact replication procedure.
- [methods] Notation for SCR and Cochran's Q is introduced without an explicit equation or definition in the provided abstract, making it hard to assess how per-theorem invariance is quantified.
Simulated Author's Rebuttal
We thank the referee for their detailed and constructive feedback. We address each major comment point by point below, providing clarifications based on the manuscript content and indicating planned revisions where appropriate.
read point-by-point responses
-
Referee: [audit protocol description] The audit protocol section: cross-model unanimity (>=3/4 for MathCheck, >=6/9 for primary eval) is used to label paraphrases as semantically incorrect and to compute the 3.1% error rate, 47% auto-paraphrase error rate, and the performance vectors underlying the No-Free-Benchmark corollary, yet no validation against human semantic equivalence judgments or independent ground truth is reported; correlated model errors could produce unanimous disagreement on truly equivalent paraphrases, rendering the reported percentages and corollary dependent on an untested assumption.
Authors: The FormInv protocol employs cross-model unanimity as an automated, low-cost proxy for detecting semantic non-invariance, with the manuscript explicitly reporting 100% recall replication on external benchmarks as empirical support. We acknowledge that this does not include direct human semantic equivalence judgments for the primary MathCheck audit and that correlated model errors remain a theoretical possibility. We will revise the manuscript to add an explicit limitations subsection discussing this assumption and outlining future human validation work. revision: partial
-
Referee: [No-Free-Benchmark corollary] The No-Free-Benchmark corollary (abstract and final section): the claim that any target ranking is realizable by weighting over paraphrase families is presented as following directly from the observation that no model Pareto-dominates all families, but no derivation, proof sketch, or formal argument is supplied showing how the (potentially mislabeled) performance vectors imply existence of such weightings for arbitrary targets.
Authors: We agree a formal derivation is needed for clarity. The claim rests on the observation that the models' performance vectors across paraphrase families have no single model dominating all families; this implies that suitable non-negative weightings over families can realize arbitrary target rankings via convex combination. We will add a concise proof sketch deriving the existence result from the no-Pareto-dominance condition in the revised manuscript. revision: yes
-
Referee: [results on ranking changes and SCR] Results on ranking shifts and SCR spread: the reported drops (GPT-4o from rank 2 to 4) and 32-point SCR gap are load-bearing for the central measurement-gap claim, yet no error-bar analysis, sensitivity to the exact unanimity threshold, or ablation on the 366-811 items is provided to establish robustness of these quantities.
Authors: We agree that robustness analyses would strengthen the presentation of the ranking shifts and SCR spread. In the revision we will add bootstrapped error bars on the reported quantities, sensitivity analysis across unanimity thresholds (e.g., 5/9 and 7/9), and ablations on item subsets within the 366-811 range to confirm stability of the GPT-4o rank change and the 32-point SCR gap. revision: yes
Circularity Check
No significant circularity detected
full rationale
The paper's No-Free-Benchmark corollary is stated as following directly from the empirical observation that no model Pareto-dominates all paraphrase families in the collected data. No quoted step reduces any claimed result to its inputs by definition, renames a fitted parameter as a prediction, or relies on a load-bearing self-citation chain. The invariance measures (SCR, Cochran's Q), the audit protocol, and the ranking changes are defined and computed from the performance vectors on the identified families without circular reduction. The derivation is self-contained relative to the 9-model evaluation on 366-811 items.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Cross-model unanimity (>=3/4 or >=6/9 models) reliably flags semantically incorrect paraphrases
Reference graph
Works this paper leans on
-
[1]
de Zarz `a, I., de Curt `o, J., Cabot, J., Manzoni, P., and Calafate, C
doi: 10.2307/2346806. de Zarz `a, I., de Curt `o, J., Cabot, J., Manzoni, P., and Calafate, C. T. Semantic invariance in agentic AI. InIn- ternational Conference on Agents and Multi-Agent Sys- tems: Technologies and Applications (AMSTA),
-
[2]
doi: 10.18653/v1/2020.acl-main
-
[3]
doi: 10.1177/109442810031002. Yang, Y . and Wang, W. Y . The benchmark illusion: On the incoherence of machine learning evaluation,
-
[4]
Zhou, Y ., Zhu, Y ., Antognini, D., Kim, Y ., and Zhang, Y
ICLR 2026 Poster. Zhou, Y ., Zhu, Y ., Antognini, D., Kim, Y ., and Zhang, Y . Paraphrase and solve: Exploring and exploiting the impact of surface form on mathematical reasoning in large language models. InNorth American Chapter of the Association for Computational Linguistics (NAACL), 2024a. Zhou, Z., Liu, S., Ning, M., Liu, W., Wang, J., Wong, D. F., H...
2026
-
[5]
adividesm+nexactly when it divides each sum- mand
PU:30 minutes/day, models compute10×30×7 = 2100without unit conversion. All 4 models wrong. Group 27 (sub-question redirection, severity: moderate-high).Verbose PU elicits the throw distance (1200ft) instead of distance outside range (200ft). 3/4 models wrong (GPT-4o correct by format). 12 FormInv: Semantic Invariance in Mathematical Reasoning Benchmarks ...
1979
-
[6]
For independent evaluators with individual accuracya:P(truth=v|all9votev) =a 9/(a9 + (1−a) 9)≈99.99%ata= 0.85
generalizes this to heterogeneous, correlated evaluators via EM inference on competence parameters. For independent evaluators with individual accuracya:P(truth=v|all9votev) =a 9/(a9 + (1−a) 9)≈99.99%ata= 0.85. The independence assumption is the key caveat: LLM errors are significantly correlated across models (Kim et al., 2025). We use mod- els from 5 pr...
2025
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.