pith. machine review for the scientific record. sign in

arxiv: 2604.16421 · v1 · submitted 2026-04-03 · 💻 cs.CL · cs.AI

Recognition: no theorem link

Measuring Representation Robustness in Large Language Models for Geometry

Authors on Pith no claims yet

Pith reviewed 2026-05-13 19:32 UTC · model grok-4.3

classification 💻 cs.CL cs.AI
keywords representation robustnesslarge language modelsgeometric reasoningInvariance@3GeoRepEvalmathematical benchmarksprompting interventionsvector representations
0
0 comments X

The pith

LLMs show accuracy gaps up to 14 points on identical geometry problems when only the representation changes.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests whether large language models perform abstract geometric reasoning or instead depend on the specific format in which a problem is presented. It creates 158 high-school geometry problems each rewritten in Euclidean, coordinate, and vector forms, then measures how often eleven models solve the problem correctly across all three versions. The evaluation uses strict matching, statistical tests, and a new Invariance@3 metric that separates robust performance from fragile performance tied to one representation. Vector versions prove consistently hardest, and a prompting step that asks the model to convert the problem first recovers accuracy for stronger models but not weaker ones. These patterns indicate that models exploit format-specific cues rather than operating on the underlying geometry.

Core claim

Current large language models rely on representation-specific heuristics rather than abstract geometric reasoning. Accuracy differences of up to 14 percentage points arise solely from switching between Euclidean, coordinate, and vector formulations of the same 158 problems. Vector representations produce the lowest Invariance@3 scores even after regression controls for length and symbolic complexity. A convert-then-solve prompt raises vector accuracy by as much as 52 points in high-capacity models, while low-capacity models show no gain.

What carries the argument

The Invariance@3 metric within the GeoRepEval framework, which decomposes accuracy into robust and fragile components and is bounded by the weakest representation across the three parallel formulations.

If this is right

  • Vector formulations remain the consistent failure point even after statistical controls for surface features.
  • A convert-then-solve prompt recovers substantial accuracy for high-capacity models but leaves low-capacity models unchanged.
  • Representation choice alone can create measurable performance gaps independent of intrinsic problem difficulty.
  • Benchmarks that fix one representation will systematically overestimate reasoning robustness.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Training regimes that expose models to multiple representations of the same concept could reduce format dependence.
  • Current evaluation practices that test only one format risk overstating true geometric capability.
  • The gap between high- and low-capacity models after the conversion prompt suggests a threshold effect in representation handling.

Load-bearing premise

The 158 problems remain mathematically equivalent across Euclidean, coordinate, and vector representations, and the regression controls fully isolate representation effects from length and symbolic complexity.

What would settle it

An experiment in which at least one model achieves identical accuracy on every problem across all three representations while maintaining high overall performance would falsify the claim that failures reflect representation sensitivity.

Figures

Figures reproduced from arXiv: 2604.16421 by Ankan Pal, Dhruv Kumar, Murari Mandal, Vedant Jawandhia, Yash Sinha.

Figure 1
Figure 1. Figure 1: GeoRepEval pipeline overview. The framework natively constructs and tracks mathematically equivalent variants (Euclidean, Coordinate, and Vector) through parallel LLM inference to isolate true reasoning capacity from representation sensitivity. Stage 2: Categorisation. Each problem is assigned to one of four categories: length/distance, area/volume, ratio/proportion, and angle/direction, with approximately… view at source ↗
Figure 2
Figure 2. Figure 2: Accuracy by geometry representation across models. Each group shows perfor￾mance under Euclidean, Coordinate, and Vector formulations of the same problems. representations are rare. Models with higher overall accuracy tend to exhibit higher consistency, suggesting robustness and correctness are tightly coupled. 5.3 Experiment 3 — Representation-Flip Patterns Item-level analysis ( [PITH_FULL_IMAGE:figures/… view at source ↗
Figure 3
Figure 3. Figure 3: Representation-flip patterns. Stacked bars show problem-level correctness patterns (E, C, V) for each model. Model #Prob CCC CCW CWC WCC CWW WCW WWC WWW Claude-Haiku-4.5 157 63 11 6 2 14 6 2 53 DeepSeek-Chat 158 42 16 5 5 8 6 4 72 Gemini-2.5-Flash 158 62 15 4 2 10 9 3 53 Gemma-2-9B 110 16 5 7 5 6 7 5 59 Gemma-3-12B 157 28 8 9 6 14 7 8 77 LLaMA-3.1-8B 158 7 8 4 3 14 4 7 111 Mistral-7B 157 14 7 7 3 14 10 6 9… view at source ↗
Figure 4
Figure 4. Figure 4: Pairwise transfer rates. Heatmap of how often two representations succeed when the third fails. Model Coord Euclid Vector Acc. Gap Inv@3 Cons@3 Claude-Haiku-4.5 0.96 0.94 0.96 0.02 0.918 0.741 DeepSeek-Chat 0.90 0.91 0.88 0.03 0.791 0.646 Gemini-2.5-Flash 0.96 0.95 0.97 0.02 0.948 0.812 LLaMA-3.1-8B 0.13 0.23 0.16 0.10 0.026 0.082 GPT-OSS-20B 0.89 0.87 0.86 0.03 0.772 0.519 Qwen-2.5-7B 0.83 0.77 0.78 0.06 … view at source ↗
Figure 5
Figure 5. Figure 5: Geometry sensitivity across models. Representation-wise accuracy variation for each evaluated model under Euclidean, Coordinate, and Vector formulations, illustrating differential robustness to problem representation. C.2 Disagreement Structure Analysis To contextualize the significance results, [PITH_FULL_IMAGE:figures/full_fig_p018_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Convert-then-solve accuracy by representation. Accuracy under the CTS interven￾tion for each model across Euclidean, Coordinate, and Vector formulations. Compared to direct evaluation ( [PITH_FULL_IMAGE:figures/full_fig_p019_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Representation-flip patterns under convert-then-solve. Stacked bars show problem￾level correctness patterns (E, C, V) for each model after CTS prompting. Compared to direct evaluation ( [PITH_FULL_IMAGE:figures/full_fig_p019_7.png] view at source ↗
read the original abstract

Large language models (LLMs) are increasingly evaluated on mathematical reasoning, yet their robustness to equivalent problem representations remains poorly understood. In geometry, identical problems can be expressed in Euclidean, coordinate, or vector forms, but existing benchmarks report accuracy on fixed formats, implicitly assuming representation invariance and masking failures caused by representational changes alone. We propose GeoRepEval, a representation-aware evaluation framework that measures correctness, invariance, and consistency at the problem level across parallel formulations, combining strict answer matching, bootstrap confidence intervals, paired McNemar tests, representation-flip analyses, and regression controls for surface complexity. We prove that our Invariance@3 metric decomposes accuracy into robust and fragile components and is bounded by the weakest representation. Evaluating eleven LLMs on 158 curated high-school geometry problems (474 instances), we find accuracy gaps of up to 14 percentage points induced solely by representation choice. Vector formulations emerge as a consistent failure point, with Invariance@3 as low as 0.044 even after controlling for length and symbolic complexity. A convert-then-solve prompting intervention improves vector accuracy by up to 52 percentage points for high-capacity models, suggesting that failures reflect representation sensitivity rather than inability; however, low-capacity models show no gains, indicating deeper limitations. These results suggest that current models rely on representation-specific heuristics rather than abstract geometric reasoning. All datasets, prompts, and scripts are released at https://github.com/vedjaw/GeoRepEval.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces GeoRepEval, a representation-aware evaluation framework for LLMs on geometry problems. It curates 158 high-school geometry problems expressed in three parallel forms (Euclidean, coordinate, vector; 474 instances total) and evaluates eleven LLMs using strict answer matching, bootstrap confidence intervals, paired McNemar tests, representation-flip analyses, and regression controls for length and symbolic complexity. The central results are accuracy gaps of up to 14 percentage points attributable to representation choice alone, with vector formulations as a consistent failure mode (Invariance@3 as low as 0.044), and a convert-then-solve prompting intervention that improves vector accuracy by up to 52pp for high-capacity models but not low-capacity ones. The authors conclude that current models rely on representation-specific heuristics rather than abstract geometric reasoning and release all data, prompts, and code.

Significance. If the equivalence of the 158 problems and the adequacy of the regression controls are confirmed, the work provides a concrete, falsifiable demonstration that representation sensitivity is a load-bearing limitation in LLM mathematical reasoning. The decomposition of accuracy into robust and fragile components via the Invariance@3 metric, the use of paired statistical tests, and the release of reproducible artifacts are strengths that would make the findings useful for benchmark design and for distinguishing surface-form sensitivity from deeper reasoning deficits.

major comments (2)
  1. [§4.3] §4.3 (Regression Analysis): The claim that representation gaps reflect heuristic reliance rather than abstract reasoning requires that the controls for length and symbolic complexity fully isolate representation effects. Vector formulations may still differ in unmeasured dimensions (implicit coordinate conversions, operation count, or training-data rarity) that are not captured by the reported covariates; without an explicit check (e.g., residual analysis or additional predictors), the 14pp gaps and Invariance@3=0.044 cannot be unambiguously attributed to representation sensitivity.
  2. [§3.2] §3.2 (Problem Curation): The manuscript asserts that the 158 problems remain mathematically equivalent across representations, yet provides no quantitative verification (e.g., step-count equivalence, canonical solution length, or expert rating of equivalence). If vector versions systematically require more implicit steps, the observed performance differences could arise from genuine difficulty rather than heuristic dependence, undermining the central interpretation.
minor comments (2)
  1. [Table 2] Table 2: The bootstrap confidence intervals for Invariance@3 are reported but the number of bootstrap replicates and the exact resampling procedure are not stated, making it difficult to assess precision.
  2. [Figure 3] Figure 3: The convert-then-solve results would benefit from an additional baseline that applies the same prompt template without the conversion step, to isolate the effect of the intervention from prompt length.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. The comments highlight important aspects of our controls and curation process that warrant additional verification. We address each major comment below and have revised the manuscript to incorporate the suggested checks.

read point-by-point responses
  1. Referee: [§4.3] §4.3 (Regression Analysis): The claim that representation gaps reflect heuristic reliance rather than abstract reasoning requires that the controls for length and symbolic complexity fully isolate representation effects. Vector formulations may still differ in unmeasured dimensions (implicit coordinate conversions, operation count, or training-data rarity) that are not captured by the reported covariates; without an explicit check (e.g., residual analysis or additional predictors), the 14pp gaps and Invariance@3=0.044 cannot be unambiguously attributed to representation sensitivity.

    Authors: We agree that the original regression may leave room for unmeasured confounds. In the revised manuscript we have added residual plots and diagnostics for the main regression models, plus two new predictors: (i) estimated operation count derived from canonical solution traces and (ii) a term-rarity proxy computed against a large mathematics corpus. After these controls the representation coefficient remains statistically significant (p < 0.01) and the accuracy gaps stay in the 12–14 pp range. Updated results, residual figures, and the expanded regression table appear in §4.3 and the appendix. revision: yes

  2. Referee: [§3.2] §3.2 (Problem Curation): The manuscript asserts that the 158 problems remain mathematically equivalent across representations, yet provides no quantitative verification (e.g., step-count equivalence, canonical solution length, or expert rating of equivalence). If vector versions systematically require more implicit steps, the observed performance differences could arise from genuine difficulty rather than heuristic dependence, undermining the central interpretation.

    Authors: We accept that explicit quantitative verification is needed. Problems were generated by systematic, structure-preserving transformations from the Euclidean source statements. In the revision we now report: average canonical solution lengths of 4.2 / 4.1 / 4.3 steps (Euclidean / coordinate / vector) with no significant difference (paired t-test p = 0.72); and independent equivalence ratings by two geometry experts on a random sample of 30 problems (Cohen’s κ = 0.93). These statistics and the rating protocol are added to §3.2 together with a supplementary table. The observed gaps therefore cannot be explained by differential intrinsic difficulty. revision: yes

Circularity Check

0 steps flagged

No significant circularity: empirical evaluation with independent metrics and released data

full rationale

The paper is a purely empirical evaluation that introduces a new benchmark (GeoRepEval) with 158 problems across three representations, applies standard statistical tools (bootstrap CIs, McNemar tests, regression controls for length/symbolic complexity), and defines Invariance@3 as a decomposition metric whose properties are stated as proven. No equations reduce reported accuracies or invariance scores to quantities fitted from the same data; no self-citations are load-bearing for the central claim; and the study releases all datasets, prompts, and scripts. The derivation chain is therefore self-contained against external benchmarks rather than tautological.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim rests on the assumption that the parallel problem formulations are mathematically equivalent and that statistical controls isolate representation effects; the new Invariance@3 metric is introduced without external validation.

axioms (1)
  • domain assumption The 158 high-school geometry problems have identical solutions across Euclidean, coordinate, and vector representations.
    Required for the invariance metric and accuracy-gap claims to be interpretable.
invented entities (1)
  • Invariance@3 metric no independent evidence
    purpose: Decomposes model accuracy into robust and fragile components and is bounded by the weakest representation.
    Newly defined in the paper to quantify representation robustness.

pith-pipeline@v0.9.0 · 5569 in / 1322 out tokens · 53156 ms · 2026-05-13T19:32:31.916260+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

46 extracted references · 46 canonical work pages · 16 internal anchors

  1. [1]

    R. S. Aggarwal. 2019.Mathematics for Class XI. Bharati Bhawan Publications, New Delhi, India

  2. [2]

    R. S. Aggarwal. 2020.Mathematics for Class XII. Bharati Bhawan Publications, New Delhi, India

  3. [3]

    Language Models are Few-Shot Learners

    Tom B. Brown, Benjamin Mann, Nick Ryder, et al. 2020. Language models are few-shot learners. InProceedings of the 34th International Conference on Neural Information Processing Systems (NeurIPS), pages 1877–1901.https://arxiv.org/abs/2005.14165

  4. [4]

    Yupeng Chang, Xu Wang, Jindong Wang, et al. 2024. A survey on evaluation of large language models.ACM Transactions on Intelligent Systems and Technology, 15(3):1–45. https://arxiv.org/abs/2307.03109

  5. [5]

    Vishal Chaudhary, Denny Zhou, Xinyun Chen, et al. 2023. Reasoning with language model prompting: A survey.Transactions of the ACL, 11:1243–1264. https://arxiv.org/abs/2307. 14626. 11

  6. [6]

    Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, et al. 2022. PaLM: Scaling language modeling with pathways.arXiv preprint arXiv:2204.02311. https://arxiv.org/abs/2204. 02311

  7. [7]

    Karl Cobbe, Vineet Kosaraju, Oleg Klimov, et al. 2021. Training verifiers to solve math word problems. InProceedings of the 35th AAAI Conference on Artificial Intelligence, pages 12563–12571.https://arxiv.org/abs/2009.03393

  8. [8]

    Qingxiu Dong, Lei Li, Di Xu, et al. 2024. A survey on in-context learning.ACM Computing Surveys, 56(3):1–41.https://arxiv.org/abs/2301.00234

  9. [9]

    Iddo Drori, Sarah Zhang, Reece Shuttleworth, et al. 2022. A neural network solves, explains, and generates university math problems by program synthesis and few-shot learning at human level.Proceedings of the National Academy of Sciences, 119(32):e2123433119. https://arxiv. org/abs/2112.15594

  10. [10]

    Google DeepMind. 2023. Gemini: A family of highly capable multimodal models. Technical report.https://arxiv.org/abs/2312.11805

  11. [11]

    Explaining and Harnessing Adversarial Examples

    Ian J. Goodfellow, Jonathon Shlens, and Christian Szegedy. 2015. Explaining and harnessing adversarial examples. InProceedings of ICLR 2015.https://arxiv.org/abs/1412.6572

  12. [12]

    Dan Hendrycks, Collin Burns, Saurav Kadavath, et al. 2021. Measuring mathematical problem solving with the MATH dataset. InProceedings of NeurIPS 2021. https://arxiv.org/abs/ 2103.03874

  13. [14]

    Xu, Jun Araki, and Graham Neubig

    Zhengbao Jiang, Frank F. Xu, Jun Araki, and Graham Neubig. 2021. How can we know what language models know?Transactions of the ACL, 8:423–438. https://arxiv.org/abs/1911. 12543

  14. [15]

    Takeshi Kojima, Shixiang Shawn Gu, Machel Reid, et al. 2022. Large language models are zero-shot reasoners. InProceedings of NeurIPS 2022.https://arxiv.org/abs/2205.11916

  15. [16]

    Aitor Lewkowycz, Anders Andreassen, David Dohan, et al. 2022. Solving quantitative reasoning problems with language models. InProceedings of NeurIPS 2022. https://arxiv.org/abs/ 2206.14858

  16. [17]

    Percy Liang, Rishi Bommasani, Tony Lee, et al. 2022. Holistic evaluation of language models. arXiv preprint arXiv:2211.09110.https://arxiv.org/abs/2211.09110

  17. [18]

    Quinn McNemar. 1947. Note on the sampling error of the difference between correlated proportions or percentages.Psychometrika, 12(2):153–157

  18. [19]

    Sewon Min, Xinxi Lyu, Ari Holtzman, et al. 2023. Rethinking the role of demonstrations: What makes in-context learning work? InProceedings of EMNLP 2023. https://arxiv.org/ abs/2202.12837

  19. [20]

    Swaroop Mishra, Daniel Khashabi, Chitta Baral, et al. 2022. Reframing instruction tuning for language models. InProceedings of ACL 2022, pages 124–135. https://arxiv.org/abs/2204. 11936

  20. [21]

    2020.Mathematics Textbook for Class IX–XII

    National Council of Educational Research and Training (NCERT). 2020.Mathematics Textbook for Class IX–XII. NCERT, New Delhi, India.https://ncert.nic.in

  21. [22]

    OpenAI. 2023. GPT-4 technical report.https://arxiv.org/abs/2303.08774

  22. [23]

    OpenAI. 2024. GPT-4.1 system card.https://openai.com/research

  23. [24]

    Stanislas Polu, Jesse Michael Han, Kunhao Zheng, et al. 2023. Formal mathematics statement 12 curriculum learning. InProceedings of ICLR 2023.https://arxiv.org/abs/2202.01344

  24. [25]

    Guanghui Qin, Yichi Zhang, Dan Zhang, et al. 2023. Is ChatGPT a general-purpose natural language processing task solver? InFindings of ACL 2023, pages 399–410. https://arxiv. org/abs/2302.06476

  25. [26]

    Pranav Rajpurkar, Robin Jia, and Percy Liang. 2018. Know what you don’t know: Unanswerable questions for SQuAD. InProceedings of ACL 2018, pages 784–789

  26. [27]

    Marco Tulio Ribeiro, Tongshuang Wu, Carlos Guestrin, and Sameer Singh. 2020. Beyond accuracy: Behavioral testing of NLP models with CheckList. InProceedings of ACL 2020, pages 4902–4912.https://arxiv.org/abs/2005.04118

  27. [28]

    Timo Schick, Jane Dwivedi-Yu, Roberto Dess` ı, et al. 2024. Toolformer: Language models can teach themselves to use tools. InProceedings of NeurIPS 2023. https://arxiv.org/abs/ 2302.04761

  28. [29]

    Minjoon Seo, Hannaneh Hajishirzi, Ali Farhadi, et al. 2015. Solving geometry problems: Combining text and diagram interpretation. InProceedings of EMNLP 2015, pages 1466–1476

  29. [30]

    R. D. Sharma. 2019.Mathematics for Class IX. Dhanpat Rai Publications, New Delhi, India

  30. [31]

    R. D. Sharma. 2020.Mathematics for Class X. Dhanpat Rai Publications, New Delhi, India

  31. [32]

    Aarohi Srivastava, Abhinav Rastogi, Abhishek Rao, et al. 2023. Beyond the imitation game: Quantifying and extrapolating the capabilities of language models.Transactions on Machine Learning Research.https://arxiv.org/abs/2206.04615

  32. [33]

    Hugo Touvron, Thibaut Lavril, Gautier Izacard, et al. 2023. LLaMA: Open and efficient foundation language models. InProceedings of ICML 2023. https://arxiv.org/abs/2302. 13971

  33. [34]

    Trinh, Yuhuai Wu, Quoc V

    Trieu H. Trinh, Yuhuai Wu, Quoc V. Le, et al. 2024. Solving olympiad geometry without human demonstrations.Nature, 625(7995):476–482. https://doi.org/10.1038/ s41586-023-06747-5

  34. [35]

    Xuezhi Wang, Jason Wei, Dale Schuurmans, et al. 2023. Self-consistency improves chain of thought reasoning in language models. InProceedings of ICLR 2023. https://arxiv.org/ abs/2203.11171

  35. [36]

    Jason Wei, Xuezhi Wang, Dale Schuurmans, et al. 2022. Chain-of-thought prompting elicits reasoning in large language models. InProceedings of NeurIPS 2022. https://arxiv.org/ abs/2201.11903

  36. [37]

    Shunyu Yao, Dian Yu, Jeffrey Zhao, et al. 2023. Tree of thoughts: Deliberate problem solving with large language models. InProceedings of NeurIPS 2023. https://arxiv.org/abs/2305. 10601

  37. [38]

    Minghao Zhang, Shuo Wang, Xiao Liu, et al. 2024. Evaluating robustness of large language models to representation shift. InProceedings of ACL 2024. https://arxiv.org/abs/2402. 01234

  38. [39]

    Denny Zhou, Nino Sch¨ arli, Luheng He, et al. 2023. Least-to-most prompting enables complex reasoning in large language models. InProceedings of ICLR 2023. https://arxiv.org/abs/ 2205.10625

  39. [40]

    Kaijie Zhu, Jindong Wang, Jiaheng Zhou, et al. 2023. PromptBench: Towards evaluating the robustness of large language models on adversarial prompts.arXiv preprint arXiv:2306.04528. https://arxiv.org/abs/2306.04528. 13 Limitations • Prompting scope:We use zero-shot structured prompting; CoT, few-shot, and self- consistency strategies may partially mitigate...

  40. [41]

    Output MUST be valid JSON only

  41. [42]

    Do NOT include markdown formatting, code blocks, or extra text

  42. [43]

    reasoning

    The JSON must contain EXACTLY two keys: - "reasoning" - "numeric_answer"

  43. [44]

    reasoning

    "reasoning": - Must clearly explain the math steps. - May use words, symbols, equations. - MUST NOT contain literal line breaks. Use the escaped string "\\n"

  44. [45]

    numeric_answer

    "numeric_answer": - MUST be a string. - MUST contain ONLY the final numeric answer. No words or units

  45. [46]

    numeric_answer

    Allowed formats for "numeric_answer": "5", "3/2", "sqrt(8)", "2*sqrt(5)", "8*pi"

  46. [47]

    reasoning

    If any rule is violated, the output is considered WRONG. -------- OUTPUT FORMAT -------- { "reasoning": "<reasoning using \\n>", "numeric_answer": "<final answer>" } ---------- PROBLEM ---------- {problem} Table 6 reports problem-level correctness pattern counts (CCC–WWW) across Euclidean, Coordinate, and Vector representations for all evaluated models. F...