pith. sign in

arxiv: 2605.29001 · v1 · pith:WRP4JNDNnew · submitted 2026-05-27 · 💻 cs.LG · cs.AI

FormInv: A Measurement Protocol for Semantic Invariance in Mathematical Reasoning Benchmarks

Pith reviewed 2026-06-29 13:25 UTC · model grok-4.3

classification 💻 cs.LG cs.AI
keywords semantic invariancemathematical reasoning benchmarksparaphrase auditsemantic consistency ratemodel ranking sensitivitybenchmark designfrontier model evaluation
0
0 comments X

The pith

Semantic errors in math benchmark paraphrases reorder frontier model rankings and allow any desired ordering via family weighting.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that conventional accuracy scores on mathematical reasoning benchmarks miss large differences in how models handle semantically equivalent restatements of the same problem. It introduces FormInv to audit paraphrase groups for semantic correctness by checking agreement across multiple models and to measure invariance through Semantic Consistency Rate. The audit detects errors that shift rankings, such as moving GPT-4o from second to fourth place, changes invisible when evaluating one model at a time. It also establishes that no model outperforms all paraphrase families, so any target ranking over nine models can be realized by weighting the families appropriately. This matters because benchmark designers are effectively choosing winners through their selection of restatement styles.

Core claim

FormInv supplies an audit protocol that uses cross-model unanimity (>=3/4 models on MathCheck, >=6/9 on the primary set) to flag semantically incorrect paraphrases at low cost, along with SCR and per-theorem Cochran's Q as invariance measures. Applied to Lean4-verified theorems, it finds 3.1% incorrect paraphrases whose removal changes model ranks and 47% incorrect connective-variation paraphrases in an auto-generated set. The central formal result is the No-Free-Benchmark corollary: because no model Pareto-dominates all paraphrase families, for any target ranking over the nine models there exists a weighting over families that realizes it.

What carries the argument

Cross-model unanimity threshold applied to paraphrase families to detect semantic errors, paired with Semantic Consistency Rate (SCR) as the invariance metric.

If this is right

  • Removing the flagged incorrect paraphrases produces ranking shifts that aggregate accuracy alone does not reveal.
  • SCR spans a 32-point range across models while accuracy spans only 10 points, exposing robustness differences hidden by standard metrics.
  • Any desired ordering of the nine models can be realized by selecting and weighting appropriate paraphrase families.
  • Benchmark construction that fixes one set of paraphrase families implicitly determines which models appear strongest.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Benchmarks could be required to demonstrate invariance across several paraphrase families before rankings are reported.
  • Model training objectives might add explicit penalties for inconsistency under semantic restatements.
  • The unanimity audit could be tested on non-math domains to check whether the same error-detection pattern holds.
  • Developers might report both accuracy and SCR when claiming benchmark leadership.

Load-bearing premise

That agreement among multiple frontier models reliably identifies paraphrases that are truly semantically incorrect rather than simply reflecting shared model limitations.

What would settle it

A human or formal verification of a set of paraphrases unanimously flagged as incorrect by the protocol that shows they are actually semantically equivalent.

Figures

Figures reproduced from arXiv: 2605.29001 by Nishal Thomas, Noel Thomas.

Figure 1
Figure 1. Figure 1: Per-family failure rates across three representative mod￾els (3-model shared evaluation). F5 (connective variation) is the highest-failure family for all models; F6 (comparison order) shows the strongest complementarity (Claude: 16.7%, GPT-4o: 0.0%) [PITH_FULL_IMAGE:figures/full_fig_p007_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Bootstrap confidence intervals for Mean-IG as a function of sample size N (subsampled from 103 theorems, 1000 bootstrap iterations). The 95% CI at N = 50 is (0.042, 0.140); at N = 103 it narrows to (0.056, 0.124). 17 [PITH_FULL_IMAGE:figures/full_fig_p017_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Per-theorem IG distribution across all 9 models and 103 theorems. The distribution is strongly bimodal: 79% of theorems have IG = 0 (consistent responses across all paraphrase families) and 21% have IG ≈ 0.42 (fail on 1–3 families). 18 [PITH_FULL_IMAGE:figures/full_fig_p018_3.png] view at source ↗
read the original abstract

A paraphrase-quality audit of MathCheck (ICLR 2025) detected 4 semantically incorrect paraphrases in 129 groups (3.1%); removing them drops GPT-4o from rank 2 to rank 4 and elevates Claude Haiku and DeepSeek V3 above it; these ranking changes are invisible to any single-model evaluation. Cross-model unanimity found these errors automatically (>= 3/4 models for MathCheck; >= 6/9 for our primary evaluation) for under $10; in our own dataset the same protocol found that 47% of auto-generated connective-variation paraphrases were semantically incorrect. That flaw compounds a deeper measurement gap: Claude Haiku 4.5 achieves 86% accuracy yet SCR=50%, meaning half its theorems are answered differently under semantically equivalent restatements, while aggregate accuracy across 9 models spans only 86-96% yet Semantic Consistency Rates (SCR) span 50-82% -- a 32-point gap invisible to standard benchmarks. Formally, for any target ranking over 9 frontier models there exists a weighting over paraphrase families that realizes it (No-Free-Benchmark corollary), because no model Pareto-dominates all families -- so benchmark designers who select families are implicitly choosing which model wins. FormInv supplies the audit protocol (replicated on external benchmarks at 100% recall), SCR and per-theorem Cochran's Q as primary invariance measures evaluated on 9 models across 366-811 items (on Lean4-verified theorems), and FormInvSelector for regime-aware model selection.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper introduces FormInv, a cross-model unanimity protocol to audit semantic invariance in mathematical reasoning benchmarks. It reports detecting 3.1% semantically incorrect paraphrases in MathCheck (causing ranking shifts among 9 frontier models invisible to single-model eval), 47% error rate in auto-generated paraphrases, wide SCR variation (50-82%) despite narrow accuracy range (86-96%), and the No-Free-Benchmark corollary that any target ranking over the 9 models is realizable by some weighting over paraphrase families since no model Pareto-dominates all families. It supplies SCR, Cochran's Q, and FormInvSelector as tools, with replication at 100% recall on external benchmarks.

Significance. If the protocol and corollary hold, the work identifies a previously invisible measurement gap in benchmark robustness and shows that paraphrase-family selection implicitly determines model rankings, with direct implications for how mathematical reasoning benchmarks are constructed and interpreted. The low-cost audit protocol and per-theorem invariance metrics are potentially useful additions if validated.

major comments (3)
  1. [audit protocol description] The audit protocol section: cross-model unanimity (>=3/4 for MathCheck, >=6/9 for primary eval) is used to label paraphrases as semantically incorrect and to compute the 3.1% error rate, 47% auto-paraphrase error rate, and the performance vectors underlying the No-Free-Benchmark corollary, yet no validation against human semantic equivalence judgments or independent ground truth is reported; correlated model errors could produce unanimous disagreement on truly equivalent paraphrases, rendering the reported percentages and corollary dependent on an untested assumption.
  2. [No-Free-Benchmark corollary] The No-Free-Benchmark corollary (abstract and final section): the claim that any target ranking is realizable by weighting over paraphrase families is presented as following directly from the observation that no model Pareto-dominates all families, but no derivation, proof sketch, or formal argument is supplied showing how the (potentially mislabeled) performance vectors imply existence of such weightings for arbitrary targets.
  3. [results on ranking changes and SCR] Results on ranking shifts and SCR spread: the reported drops (GPT-4o from rank 2 to 4) and 32-point SCR gap are load-bearing for the central measurement-gap claim, yet no error-bar analysis, sensitivity to the exact unanimity threshold, or ablation on the 366-811 items is provided to establish robustness of these quantities.
minor comments (2)
  1. [abstract] The abstract states 'replicated on external benchmarks at 100% recall' without specifying which external benchmarks or the exact replication procedure.
  2. [methods] Notation for SCR and Cochran's Q is introduced without an explicit equation or definition in the provided abstract, making it hard to assess how per-theorem invariance is quantified.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their detailed and constructive feedback. We address each major comment point by point below, providing clarifications based on the manuscript content and indicating planned revisions where appropriate.

read point-by-point responses
  1. Referee: [audit protocol description] The audit protocol section: cross-model unanimity (>=3/4 for MathCheck, >=6/9 for primary eval) is used to label paraphrases as semantically incorrect and to compute the 3.1% error rate, 47% auto-paraphrase error rate, and the performance vectors underlying the No-Free-Benchmark corollary, yet no validation against human semantic equivalence judgments or independent ground truth is reported; correlated model errors could produce unanimous disagreement on truly equivalent paraphrases, rendering the reported percentages and corollary dependent on an untested assumption.

    Authors: The FormInv protocol employs cross-model unanimity as an automated, low-cost proxy for detecting semantic non-invariance, with the manuscript explicitly reporting 100% recall replication on external benchmarks as empirical support. We acknowledge that this does not include direct human semantic equivalence judgments for the primary MathCheck audit and that correlated model errors remain a theoretical possibility. We will revise the manuscript to add an explicit limitations subsection discussing this assumption and outlining future human validation work. revision: partial

  2. Referee: [No-Free-Benchmark corollary] The No-Free-Benchmark corollary (abstract and final section): the claim that any target ranking is realizable by weighting over paraphrase families is presented as following directly from the observation that no model Pareto-dominates all families, but no derivation, proof sketch, or formal argument is supplied showing how the (potentially mislabeled) performance vectors imply existence of such weightings for arbitrary targets.

    Authors: We agree a formal derivation is needed for clarity. The claim rests on the observation that the models' performance vectors across paraphrase families have no single model dominating all families; this implies that suitable non-negative weightings over families can realize arbitrary target rankings via convex combination. We will add a concise proof sketch deriving the existence result from the no-Pareto-dominance condition in the revised manuscript. revision: yes

  3. Referee: [results on ranking changes and SCR] Results on ranking shifts and SCR spread: the reported drops (GPT-4o from rank 2 to 4) and 32-point SCR gap are load-bearing for the central measurement-gap claim, yet no error-bar analysis, sensitivity to the exact unanimity threshold, or ablation on the 366-811 items is provided to establish robustness of these quantities.

    Authors: We agree that robustness analyses would strengthen the presentation of the ranking shifts and SCR spread. In the revision we will add bootstrapped error bars on the reported quantities, sensitivity analysis across unanimity thresholds (e.g., 5/9 and 7/9), and ablations on item subsets within the 366-811 range to confirm stability of the GPT-4o rank change and the 32-point SCR gap. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper's No-Free-Benchmark corollary is stated as following directly from the empirical observation that no model Pareto-dominates all paraphrase families in the collected data. No quoted step reduces any claimed result to its inputs by definition, renames a fitted parameter as a prediction, or relies on a load-bearing self-citation chain. The invariance measures (SCR, Cochran's Q), the audit protocol, and the ranking changes are defined and computed from the performance vectors on the identified families without circular reduction. The derivation is self-contained relative to the 9-model evaluation on 366-811 items.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claims rest on the domain assumption that cross-model unanimity detects semantic errors and on the unstated premise that the sampled paraphrase families are representative enough to support the corollary.

axioms (1)
  • domain assumption Cross-model unanimity (>=3/4 or >=6/9 models) reliably flags semantically incorrect paraphrases
    Used to detect the 4 incorrect paraphrases in MathCheck and the 47 percent error rate in the authors' dataset

pith-pipeline@v0.9.1-grok · 5815 in / 1320 out tokens · 37393 ms · 2026-06-29T13:25:18.268677+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

6 extracted references · 3 canonical work pages

  1. [1]

    de Zarz `a, I., de Curt `o, J., Cabot, J., Manzoni, P., and Calafate, C

    doi: 10.2307/2346806. de Zarz `a, I., de Curt `o, J., Cabot, J., Manzoni, P., and Calafate, C. T. Semantic invariance in agentic AI. InIn- ternational Conference on Agents and Multi-Agent Sys- tems: Technologies and Applications (AMSTA),

  2. [2]

    doi: 10.18653/v1/2020.acl-main

  3. [3]

    doi: 10.1177/109442810031002. Yang, Y . and Wang, W. Y . The benchmark illusion: On the incoherence of machine learning evaluation,

  4. [4]

    Zhou, Y ., Zhu, Y ., Antognini, D., Kim, Y ., and Zhang, Y

    ICLR 2026 Poster. Zhou, Y ., Zhu, Y ., Antognini, D., Kim, Y ., and Zhang, Y . Paraphrase and solve: Exploring and exploiting the impact of surface form on mathematical reasoning in large language models. InNorth American Chapter of the Association for Computational Linguistics (NAACL), 2024a. Zhou, Z., Liu, S., Ning, M., Liu, W., Wang, J., Wong, D. F., H...

  5. [5]

    adividesm+nexactly when it divides each sum- mand

    PU:30 minutes/day, models compute10×30×7 = 2100without unit conversion. All 4 models wrong. Group 27 (sub-question redirection, severity: moderate-high).Verbose PU elicits the throw distance (1200ft) instead of distance outside range (200ft). 3/4 models wrong (GPT-4o correct by format). 12 FormInv: Semantic Invariance in Mathematical Reasoning Benchmarks ...

  6. [6]

    For independent evaluators with individual accuracya:P(truth=v|all9votev) =a 9/(a9 + (1−a) 9)≈99.99%ata= 0.85

    generalizes this to heterogeneous, correlated evaluators via EM inference on competence parameters. For independent evaluators with individual accuracya:P(truth=v|all9votev) =a 9/(a9 + (1−a) 9)≈99.99%ata= 0.85. The independence assumption is the key caveat: LLM errors are significantly correlated across models (Kim et al., 2025). We use mod- els from 5 pr...