Recognition: no theorem link
Measuring Representation Robustness in Large Language Models for Geometry
Pith reviewed 2026-05-13 19:32 UTC · model grok-4.3
The pith
LLMs show accuracy gaps up to 14 points on identical geometry problems when only the representation changes.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Current large language models rely on representation-specific heuristics rather than abstract geometric reasoning. Accuracy differences of up to 14 percentage points arise solely from switching between Euclidean, coordinate, and vector formulations of the same 158 problems. Vector representations produce the lowest Invariance@3 scores even after regression controls for length and symbolic complexity. A convert-then-solve prompt raises vector accuracy by as much as 52 points in high-capacity models, while low-capacity models show no gain.
What carries the argument
The Invariance@3 metric within the GeoRepEval framework, which decomposes accuracy into robust and fragile components and is bounded by the weakest representation across the three parallel formulations.
If this is right
- Vector formulations remain the consistent failure point even after statistical controls for surface features.
- A convert-then-solve prompt recovers substantial accuracy for high-capacity models but leaves low-capacity models unchanged.
- Representation choice alone can create measurable performance gaps independent of intrinsic problem difficulty.
- Benchmarks that fix one representation will systematically overestimate reasoning robustness.
Where Pith is reading between the lines
- Training regimes that expose models to multiple representations of the same concept could reduce format dependence.
- Current evaluation practices that test only one format risk overstating true geometric capability.
- The gap between high- and low-capacity models after the conversion prompt suggests a threshold effect in representation handling.
Load-bearing premise
The 158 problems remain mathematically equivalent across Euclidean, coordinate, and vector representations, and the regression controls fully isolate representation effects from length and symbolic complexity.
What would settle it
An experiment in which at least one model achieves identical accuracy on every problem across all three representations while maintaining high overall performance would falsify the claim that failures reflect representation sensitivity.
Figures
read the original abstract
Large language models (LLMs) are increasingly evaluated on mathematical reasoning, yet their robustness to equivalent problem representations remains poorly understood. In geometry, identical problems can be expressed in Euclidean, coordinate, or vector forms, but existing benchmarks report accuracy on fixed formats, implicitly assuming representation invariance and masking failures caused by representational changes alone. We propose GeoRepEval, a representation-aware evaluation framework that measures correctness, invariance, and consistency at the problem level across parallel formulations, combining strict answer matching, bootstrap confidence intervals, paired McNemar tests, representation-flip analyses, and regression controls for surface complexity. We prove that our Invariance@3 metric decomposes accuracy into robust and fragile components and is bounded by the weakest representation. Evaluating eleven LLMs on 158 curated high-school geometry problems (474 instances), we find accuracy gaps of up to 14 percentage points induced solely by representation choice. Vector formulations emerge as a consistent failure point, with Invariance@3 as low as 0.044 even after controlling for length and symbolic complexity. A convert-then-solve prompting intervention improves vector accuracy by up to 52 percentage points for high-capacity models, suggesting that failures reflect representation sensitivity rather than inability; however, low-capacity models show no gains, indicating deeper limitations. These results suggest that current models rely on representation-specific heuristics rather than abstract geometric reasoning. All datasets, prompts, and scripts are released at https://github.com/vedjaw/GeoRepEval.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces GeoRepEval, a representation-aware evaluation framework for LLMs on geometry problems. It curates 158 high-school geometry problems expressed in three parallel forms (Euclidean, coordinate, vector; 474 instances total) and evaluates eleven LLMs using strict answer matching, bootstrap confidence intervals, paired McNemar tests, representation-flip analyses, and regression controls for length and symbolic complexity. The central results are accuracy gaps of up to 14 percentage points attributable to representation choice alone, with vector formulations as a consistent failure mode (Invariance@3 as low as 0.044), and a convert-then-solve prompting intervention that improves vector accuracy by up to 52pp for high-capacity models but not low-capacity ones. The authors conclude that current models rely on representation-specific heuristics rather than abstract geometric reasoning and release all data, prompts, and code.
Significance. If the equivalence of the 158 problems and the adequacy of the regression controls are confirmed, the work provides a concrete, falsifiable demonstration that representation sensitivity is a load-bearing limitation in LLM mathematical reasoning. The decomposition of accuracy into robust and fragile components via the Invariance@3 metric, the use of paired statistical tests, and the release of reproducible artifacts are strengths that would make the findings useful for benchmark design and for distinguishing surface-form sensitivity from deeper reasoning deficits.
major comments (2)
- [§4.3] §4.3 (Regression Analysis): The claim that representation gaps reflect heuristic reliance rather than abstract reasoning requires that the controls for length and symbolic complexity fully isolate representation effects. Vector formulations may still differ in unmeasured dimensions (implicit coordinate conversions, operation count, or training-data rarity) that are not captured by the reported covariates; without an explicit check (e.g., residual analysis or additional predictors), the 14pp gaps and Invariance@3=0.044 cannot be unambiguously attributed to representation sensitivity.
- [§3.2] §3.2 (Problem Curation): The manuscript asserts that the 158 problems remain mathematically equivalent across representations, yet provides no quantitative verification (e.g., step-count equivalence, canonical solution length, or expert rating of equivalence). If vector versions systematically require more implicit steps, the observed performance differences could arise from genuine difficulty rather than heuristic dependence, undermining the central interpretation.
minor comments (2)
- [Table 2] Table 2: The bootstrap confidence intervals for Invariance@3 are reported but the number of bootstrap replicates and the exact resampling procedure are not stated, making it difficult to assess precision.
- [Figure 3] Figure 3: The convert-then-solve results would benefit from an additional baseline that applies the same prompt template without the conversion step, to isolate the effect of the intervention from prompt length.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed feedback. The comments highlight important aspects of our controls and curation process that warrant additional verification. We address each major comment below and have revised the manuscript to incorporate the suggested checks.
read point-by-point responses
-
Referee: [§4.3] §4.3 (Regression Analysis): The claim that representation gaps reflect heuristic reliance rather than abstract reasoning requires that the controls for length and symbolic complexity fully isolate representation effects. Vector formulations may still differ in unmeasured dimensions (implicit coordinate conversions, operation count, or training-data rarity) that are not captured by the reported covariates; without an explicit check (e.g., residual analysis or additional predictors), the 14pp gaps and Invariance@3=0.044 cannot be unambiguously attributed to representation sensitivity.
Authors: We agree that the original regression may leave room for unmeasured confounds. In the revised manuscript we have added residual plots and diagnostics for the main regression models, plus two new predictors: (i) estimated operation count derived from canonical solution traces and (ii) a term-rarity proxy computed against a large mathematics corpus. After these controls the representation coefficient remains statistically significant (p < 0.01) and the accuracy gaps stay in the 12–14 pp range. Updated results, residual figures, and the expanded regression table appear in §4.3 and the appendix. revision: yes
-
Referee: [§3.2] §3.2 (Problem Curation): The manuscript asserts that the 158 problems remain mathematically equivalent across representations, yet provides no quantitative verification (e.g., step-count equivalence, canonical solution length, or expert rating of equivalence). If vector versions systematically require more implicit steps, the observed performance differences could arise from genuine difficulty rather than heuristic dependence, undermining the central interpretation.
Authors: We accept that explicit quantitative verification is needed. Problems were generated by systematic, structure-preserving transformations from the Euclidean source statements. In the revision we now report: average canonical solution lengths of 4.2 / 4.1 / 4.3 steps (Euclidean / coordinate / vector) with no significant difference (paired t-test p = 0.72); and independent equivalence ratings by two geometry experts on a random sample of 30 problems (Cohen’s κ = 0.93). These statistics and the rating protocol are added to §3.2 together with a supplementary table. The observed gaps therefore cannot be explained by differential intrinsic difficulty. revision: yes
Circularity Check
No significant circularity: empirical evaluation with independent metrics and released data
full rationale
The paper is a purely empirical evaluation that introduces a new benchmark (GeoRepEval) with 158 problems across three representations, applies standard statistical tools (bootstrap CIs, McNemar tests, regression controls for length/symbolic complexity), and defines Invariance@3 as a decomposition metric whose properties are stated as proven. No equations reduce reported accuracies or invariance scores to quantities fitted from the same data; no self-citations are load-bearing for the central claim; and the study releases all datasets, prompts, and scripts. The derivation chain is therefore self-contained against external benchmarks rather than tautological.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption The 158 high-school geometry problems have identical solutions across Euclidean, coordinate, and vector representations.
invented entities (1)
-
Invariance@3 metric
no independent evidence
Reference graph
Works this paper leans on
-
[1]
R. S. Aggarwal. 2019.Mathematics for Class XI. Bharati Bhawan Publications, New Delhi, India
work page 2019
-
[2]
R. S. Aggarwal. 2020.Mathematics for Class XII. Bharati Bhawan Publications, New Delhi, India
work page 2020
-
[3]
Language Models are Few-Shot Learners
Tom B. Brown, Benjamin Mann, Nick Ryder, et al. 2020. Language models are few-shot learners. InProceedings of the 34th International Conference on Neural Information Processing Systems (NeurIPS), pages 1877–1901.https://arxiv.org/abs/2005.14165
work page internal anchor Pith review Pith/arXiv arXiv 2020
- [4]
-
[5]
Vishal Chaudhary, Denny Zhou, Xinyun Chen, et al. 2023. Reasoning with language model prompting: A survey.Transactions of the ACL, 11:1243–1264. https://arxiv.org/abs/2307. 14626. 11
work page 2023
-
[6]
Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, et al. 2022. PaLM: Scaling language modeling with pathways.arXiv preprint arXiv:2204.02311. https://arxiv.org/abs/2204. 02311
work page internal anchor Pith review Pith/arXiv arXiv 2022
- [7]
-
[8]
Qingxiu Dong, Lei Li, Di Xu, et al. 2024. A survey on in-context learning.ACM Computing Surveys, 56(3):1–41.https://arxiv.org/abs/2301.00234
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[9]
Iddo Drori, Sarah Zhang, Reece Shuttleworth, et al. 2022. A neural network solves, explains, and generates university math problems by program synthesis and few-shot learning at human level.Proceedings of the National Academy of Sciences, 119(32):e2123433119. https://arxiv. org/abs/2112.15594
-
[10]
Google DeepMind. 2023. Gemini: A family of highly capable multimodal models. Technical report.https://arxiv.org/abs/2312.11805
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[11]
Explaining and Harnessing Adversarial Examples
Ian J. Goodfellow, Jonathon Shlens, and Christian Szegedy. 2015. Explaining and harnessing adversarial examples. InProceedings of ICLR 2015.https://arxiv.org/abs/1412.6572
work page internal anchor Pith review Pith/arXiv arXiv 2015
-
[12]
Dan Hendrycks, Collin Burns, Saurav Kadavath, et al. 2021. Measuring mathematical problem solving with the MATH dataset. InProceedings of NeurIPS 2021. https://arxiv.org/abs/ 2103.03874
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[14]
Xu, Jun Araki, and Graham Neubig
Zhengbao Jiang, Frank F. Xu, Jun Araki, and Graham Neubig. 2021. How can we know what language models know?Transactions of the ACL, 8:423–438. https://arxiv.org/abs/1911. 12543
work page 2021
-
[15]
Takeshi Kojima, Shixiang Shawn Gu, Machel Reid, et al. 2022. Large language models are zero-shot reasoners. InProceedings of NeurIPS 2022.https://arxiv.org/abs/2205.11916
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[16]
Aitor Lewkowycz, Anders Andreassen, David Dohan, et al. 2022. Solving quantitative reasoning problems with language models. InProceedings of NeurIPS 2022. https://arxiv.org/abs/ 2206.14858
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[17]
Percy Liang, Rishi Bommasani, Tony Lee, et al. 2022. Holistic evaluation of language models. arXiv preprint arXiv:2211.09110.https://arxiv.org/abs/2211.09110
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[18]
Quinn McNemar. 1947. Note on the sampling error of the difference between correlated proportions or percentages.Psychometrika, 12(2):153–157
work page 1947
-
[19]
Sewon Min, Xinxi Lyu, Ari Holtzman, et al. 2023. Rethinking the role of demonstrations: What makes in-context learning work? InProceedings of EMNLP 2023. https://arxiv.org/ abs/2202.12837
work page internal anchor Pith review arXiv 2023
-
[20]
Swaroop Mishra, Daniel Khashabi, Chitta Baral, et al. 2022. Reframing instruction tuning for language models. InProceedings of ACL 2022, pages 124–135. https://arxiv.org/abs/2204. 11936
work page 2022
-
[21]
2020.Mathematics Textbook for Class IX–XII
National Council of Educational Research and Training (NCERT). 2020.Mathematics Textbook for Class IX–XII. NCERT, New Delhi, India.https://ncert.nic.in
work page 2020
-
[22]
OpenAI. 2023. GPT-4 technical report.https://arxiv.org/abs/2303.08774
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[23]
OpenAI. 2024. GPT-4.1 system card.https://openai.com/research
work page 2024
- [24]
- [25]
-
[26]
Pranav Rajpurkar, Robin Jia, and Percy Liang. 2018. Know what you don’t know: Unanswerable questions for SQuAD. InProceedings of ACL 2018, pages 784–789
work page 2018
- [27]
-
[28]
Timo Schick, Jane Dwivedi-Yu, Roberto Dess` ı, et al. 2024. Toolformer: Language models can teach themselves to use tools. InProceedings of NeurIPS 2023. https://arxiv.org/abs/ 2302.04761
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[29]
Minjoon Seo, Hannaneh Hajishirzi, Ali Farhadi, et al. 2015. Solving geometry problems: Combining text and diagram interpretation. InProceedings of EMNLP 2015, pages 1466–1476
work page 2015
-
[30]
R. D. Sharma. 2019.Mathematics for Class IX. Dhanpat Rai Publications, New Delhi, India
work page 2019
-
[31]
R. D. Sharma. 2020.Mathematics for Class X. Dhanpat Rai Publications, New Delhi, India
work page 2020
-
[32]
Aarohi Srivastava, Abhinav Rastogi, Abhishek Rao, et al. 2023. Beyond the imitation game: Quantifying and extrapolating the capabilities of language models.Transactions on Machine Learning Research.https://arxiv.org/abs/2206.04615
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[33]
Hugo Touvron, Thibaut Lavril, Gautier Izacard, et al. 2023. LLaMA: Open and efficient foundation language models. InProceedings of ICML 2023. https://arxiv.org/abs/2302. 13971
work page 2023
-
[34]
Trieu H. Trinh, Yuhuai Wu, Quoc V. Le, et al. 2024. Solving olympiad geometry without human demonstrations.Nature, 625(7995):476–482. https://doi.org/10.1038/ s41586-023-06747-5
work page 2024
-
[35]
Xuezhi Wang, Jason Wei, Dale Schuurmans, et al. 2023. Self-consistency improves chain of thought reasoning in language models. InProceedings of ICLR 2023. https://arxiv.org/ abs/2203.11171
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[36]
Jason Wei, Xuezhi Wang, Dale Schuurmans, et al. 2022. Chain-of-thought prompting elicits reasoning in large language models. InProceedings of NeurIPS 2022. https://arxiv.org/ abs/2201.11903
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[37]
Shunyu Yao, Dian Yu, Jeffrey Zhao, et al. 2023. Tree of thoughts: Deliberate problem solving with large language models. InProceedings of NeurIPS 2023. https://arxiv.org/abs/2305. 10601
work page 2023
-
[38]
Minghao Zhang, Shuo Wang, Xiao Liu, et al. 2024. Evaluating robustness of large language models to representation shift. InProceedings of ACL 2024. https://arxiv.org/abs/2402. 01234
work page 2024
-
[39]
Denny Zhou, Nino Sch¨ arli, Luheng He, et al. 2023. Least-to-most prompting enables complex reasoning in large language models. InProceedings of ICLR 2023. https://arxiv.org/abs/ 2205.10625
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[40]
Kaijie Zhu, Jindong Wang, Jiaheng Zhou, et al. 2023. PromptBench: Towards evaluating the robustness of large language models on adversarial prompts.arXiv preprint arXiv:2306.04528. https://arxiv.org/abs/2306.04528. 13 Limitations • Prompting scope:We use zero-shot structured prompting; CoT, few-shot, and self- consistency strategies may partially mitigate...
-
[41]
Output MUST be valid JSON only
-
[42]
Do NOT include markdown formatting, code blocks, or extra text
- [43]
- [44]
-
[45]
"numeric_answer": - MUST be a string. - MUST contain ONLY the final numeric answer. No words or units
-
[46]
Allowed formats for "numeric_answer": "5", "3/2", "sqrt(8)", "2*sqrt(5)", "8*pi"
-
[47]
If any rule is violated, the output is considered WRONG. -------- OUTPUT FORMAT -------- { "reasoning": "<reasoning using \\n>", "numeric_answer": "<final answer>" } ---------- PROBLEM ---------- {problem} Table 6 reports problem-level correctness pattern counts (CCC–WWW) across Euclidean, Coordinate, and Vector representations for all evaluated models. F...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.