arxiv: 2604.16421 · v1 · submitted 2026-04-03 · 💻 cs.CL · cs.AI

Recognition: no theorem link

Measuring Representation Robustness in Large Language Models for Geometry

Vedant Jawandhia , Yash Sinha , Murari Mandal , Ankan Pal , Dhruv Kumar

Authors on Pith no claims yet

Pith reviewed 2026-05-13 19:32 UTC · model grok-4.3

classification 💻 cs.CL cs.AI

keywords representation robustnesslarge language modelsgeometric reasoningInvariance@3GeoRepEvalmathematical benchmarksprompting interventionsvector representations

0 comments

The pith

LLMs show accuracy gaps up to 14 points on identical geometry problems when only the representation changes.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests whether large language models perform abstract geometric reasoning or instead depend on the specific format in which a problem is presented. It creates 158 high-school geometry problems each rewritten in Euclidean, coordinate, and vector forms, then measures how often eleven models solve the problem correctly across all three versions. The evaluation uses strict matching, statistical tests, and a new Invariance@3 metric that separates robust performance from fragile performance tied to one representation. Vector versions prove consistently hardest, and a prompting step that asks the model to convert the problem first recovers accuracy for stronger models but not weaker ones. These patterns indicate that models exploit format-specific cues rather than operating on the underlying geometry.

Core claim

Current large language models rely on representation-specific heuristics rather than abstract geometric reasoning. Accuracy differences of up to 14 percentage points arise solely from switching between Euclidean, coordinate, and vector formulations of the same 158 problems. Vector representations produce the lowest Invariance@3 scores even after regression controls for length and symbolic complexity. A convert-then-solve prompt raises vector accuracy by as much as 52 points in high-capacity models, while low-capacity models show no gain.

What carries the argument

The Invariance@3 metric within the GeoRepEval framework, which decomposes accuracy into robust and fragile components and is bounded by the weakest representation across the three parallel formulations.

If this is right

Vector formulations remain the consistent failure point even after statistical controls for surface features.
A convert-then-solve prompt recovers substantial accuracy for high-capacity models but leaves low-capacity models unchanged.
Representation choice alone can create measurable performance gaps independent of intrinsic problem difficulty.
Benchmarks that fix one representation will systematically overestimate reasoning robustness.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Training regimes that expose models to multiple representations of the same concept could reduce format dependence.
Current evaluation practices that test only one format risk overstating true geometric capability.
The gap between high- and low-capacity models after the conversion prompt suggests a threshold effect in representation handling.

Load-bearing premise

The 158 problems remain mathematically equivalent across Euclidean, coordinate, and vector representations, and the regression controls fully isolate representation effects from length and symbolic complexity.

What would settle it

An experiment in which at least one model achieves identical accuracy on every problem across all three representations while maintaining high overall performance would falsify the claim that failures reflect representation sensitivity.

Figures

Figures reproduced from arXiv: 2604.16421 by Ankan Pal, Dhruv Kumar, Murari Mandal, Vedant Jawandhia, Yash Sinha.

**Figure 1.** Figure 1: GeoRepEval pipeline overview. The framework natively constructs and tracks mathematically equivalent variants (Euclidean, Coordinate, and Vector) through parallel LLM inference to isolate true reasoning capacity from representation sensitivity. Stage 2: Categorisation. Each problem is assigned to one of four categories: length/distance, area/volume, ratio/proportion, and angle/direction, with approximately… view at source ↗

**Figure 2.** Figure 2: Accuracy by geometry representation across models. Each group shows performance under Euclidean, Coordinate, and Vector formulations of the same problems. representations are rare. Models with higher overall accuracy tend to exhibit higher consistency, suggesting robustness and correctness are tightly coupled. 5.3 Experiment 3 — Representation-Flip Patterns Item-level analysis ( [PITH_FULL_IMAGE:figures/… view at source ↗

**Figure 3.** Figure 3: Representation-flip patterns. Stacked bars show problem-level correctness patterns (E, C, V) for each model. Model #Prob CCC CCW CWC WCC CWW WCW WWC WWW Claude-Haiku-4.5 157 63 11 6 2 14 6 2 53 DeepSeek-Chat 158 42 16 5 5 8 6 4 72 Gemini-2.5-Flash 158 62 15 4 2 10 9 3 53 Gemma-2-9B 110 16 5 7 5 6 7 5 59 Gemma-3-12B 157 28 8 9 6 14 7 8 77 LLaMA-3.1-8B 158 7 8 4 3 14 4 7 111 Mistral-7B 157 14 7 7 3 14 10 6 9… view at source ↗

**Figure 4.** Figure 4: Pairwise transfer rates. Heatmap of how often two representations succeed when the third fails. Model Coord Euclid Vector Acc. Gap Inv@3 Cons@3 Claude-Haiku-4.5 0.96 0.94 0.96 0.02 0.918 0.741 DeepSeek-Chat 0.90 0.91 0.88 0.03 0.791 0.646 Gemini-2.5-Flash 0.96 0.95 0.97 0.02 0.948 0.812 LLaMA-3.1-8B 0.13 0.23 0.16 0.10 0.026 0.082 GPT-OSS-20B 0.89 0.87 0.86 0.03 0.772 0.519 Qwen-2.5-7B 0.83 0.77 0.78 0.06 … view at source ↗

**Figure 5.** Figure 5: Geometry sensitivity across models. Representation-wise accuracy variation for each evaluated model under Euclidean, Coordinate, and Vector formulations, illustrating differential robustness to problem representation. C.2 Disagreement Structure Analysis To contextualize the significance results, [PITH_FULL_IMAGE:figures/full_fig_p018_5.png] view at source ↗

**Figure 6.** Figure 6: Convert-then-solve accuracy by representation. Accuracy under the CTS intervention for each model across Euclidean, Coordinate, and Vector formulations. Compared to direct evaluation ( [PITH_FULL_IMAGE:figures/full_fig_p019_6.png] view at source ↗

**Figure 7.** Figure 7: Representation-flip patterns under convert-then-solve. Stacked bars show problemlevel correctness patterns (E, C, V) for each model after CTS prompting. Compared to direct evaluation ( [PITH_FULL_IMAGE:figures/full_fig_p019_7.png] view at source ↗

read the original abstract

Large language models (LLMs) are increasingly evaluated on mathematical reasoning, yet their robustness to equivalent problem representations remains poorly understood. In geometry, identical problems can be expressed in Euclidean, coordinate, or vector forms, but existing benchmarks report accuracy on fixed formats, implicitly assuming representation invariance and masking failures caused by representational changes alone. We propose GeoRepEval, a representation-aware evaluation framework that measures correctness, invariance, and consistency at the problem level across parallel formulations, combining strict answer matching, bootstrap confidence intervals, paired McNemar tests, representation-flip analyses, and regression controls for surface complexity. We prove that our Invariance@3 metric decomposes accuracy into robust and fragile components and is bounded by the weakest representation. Evaluating eleven LLMs on 158 curated high-school geometry problems (474 instances), we find accuracy gaps of up to 14 percentage points induced solely by representation choice. Vector formulations emerge as a consistent failure point, with Invariance@3 as low as 0.044 even after controlling for length and symbolic complexity. A convert-then-solve prompting intervention improves vector accuracy by up to 52 percentage points for high-capacity models, suggesting that failures reflect representation sensitivity rather than inability; however, low-capacity models show no gains, indicating deeper limitations. These results suggest that current models rely on representation-specific heuristics rather than abstract geometric reasoning. All datasets, prompts, and scripts are released at https://github.com/vedjaw/GeoRepEval.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Representation shifts alone drop LLM geometry accuracy by up to 14 points, with vector forms as the weak link, and the new Invariance@3 metric plus released data make this worth checking.

read the letter

The main thing to know is that this paper documents clear accuracy gaps from representation alone on the same geometry problems, with vector forms consistently the worst even after length and complexity controls, and shows that a convert-then-solve prompt lifts big models but not small ones. They introduce GeoRepEval with parallel Euclidean, coordinate, and vector versions of 158 high-school problems, plus the Invariance@3 metric that splits accuracy into robust versus fragile parts and is bounded by the weakest form. The work runs eleven models, applies strict answer matching, bootstrap intervals, paired McNemar tests, and regression controls, then releases the full dataset, prompts, and scripts. That combination of a new decomposition metric and concrete intervention results is the actual addition. The statistical toolkit and data release are solid and make the numbers checkable. The convert-then-solve gains for larger models give some evidence that the failures are more about surface sensitivity than total inability. The soft spot is the equivalence assumption. Even with regression on length and symbolic complexity, vector formulations could still differ in unmeasured ways such as implicit conversion steps or lower frequency in training data, which would produce the observed 14-point gaps and the 0.044 Invariance@3 value without requiring the stronger claim that models rely only on representation-specific heuristics. The fact that low-capacity models show no prompt gains is useful but does not close that gap. This is aimed at groups working on math reasoning benchmarks and robustness testing. Anyone building or auditing evaluation suites for geometric or symbolic tasks will find the framework and the released instances directly usable. It deserves a serious referee because the empirical setup is reproducible, the metric is new, and the controls are explicit even if the interpretation of the gaps can be debated.

Referee Report

2 major / 2 minor

Summary. The paper introduces GeoRepEval, a representation-aware evaluation framework for LLMs on geometry problems. It curates 158 high-school geometry problems expressed in three parallel forms (Euclidean, coordinate, vector; 474 instances total) and evaluates eleven LLMs using strict answer matching, bootstrap confidence intervals, paired McNemar tests, representation-flip analyses, and regression controls for length and symbolic complexity. The central results are accuracy gaps of up to 14 percentage points attributable to representation choice alone, with vector formulations as a consistent failure mode (Invariance@3 as low as 0.044), and a convert-then-solve prompting intervention that improves vector accuracy by up to 52pp for high-capacity models but not low-capacity ones. The authors conclude that current models rely on representation-specific heuristics rather than abstract geometric reasoning and release all data, prompts, and code.

Significance. If the equivalence of the 158 problems and the adequacy of the regression controls are confirmed, the work provides a concrete, falsifiable demonstration that representation sensitivity is a load-bearing limitation in LLM mathematical reasoning. The decomposition of accuracy into robust and fragile components via the Invariance@3 metric, the use of paired statistical tests, and the release of reproducible artifacts are strengths that would make the findings useful for benchmark design and for distinguishing surface-form sensitivity from deeper reasoning deficits.

major comments (2)

[§4.3] §4.3 (Regression Analysis): The claim that representation gaps reflect heuristic reliance rather than abstract reasoning requires that the controls for length and symbolic complexity fully isolate representation effects. Vector formulations may still differ in unmeasured dimensions (implicit coordinate conversions, operation count, or training-data rarity) that are not captured by the reported covariates; without an explicit check (e.g., residual analysis or additional predictors), the 14pp gaps and Invariance@3=0.044 cannot be unambiguously attributed to representation sensitivity.
[§3.2] §3.2 (Problem Curation): The manuscript asserts that the 158 problems remain mathematically equivalent across representations, yet provides no quantitative verification (e.g., step-count equivalence, canonical solution length, or expert rating of equivalence). If vector versions systematically require more implicit steps, the observed performance differences could arise from genuine difficulty rather than heuristic dependence, undermining the central interpretation.

minor comments (2)

[Table 2] Table 2: The bootstrap confidence intervals for Invariance@3 are reported but the number of bootstrap replicates and the exact resampling procedure are not stated, making it difficult to assess precision.
[Figure 3] Figure 3: The convert-then-solve results would benefit from an additional baseline that applies the same prompt template without the conversion step, to isolate the effect of the intervention from prompt length.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. The comments highlight important aspects of our controls and curation process that warrant additional verification. We address each major comment below and have revised the manuscript to incorporate the suggested checks.

read point-by-point responses

Referee: [§4.3] §4.3 (Regression Analysis): The claim that representation gaps reflect heuristic reliance rather than abstract reasoning requires that the controls for length and symbolic complexity fully isolate representation effects. Vector formulations may still differ in unmeasured dimensions (implicit coordinate conversions, operation count, or training-data rarity) that are not captured by the reported covariates; without an explicit check (e.g., residual analysis or additional predictors), the 14pp gaps and Invariance@3=0.044 cannot be unambiguously attributed to representation sensitivity.

Authors: We agree that the original regression may leave room for unmeasured confounds. In the revised manuscript we have added residual plots and diagnostics for the main regression models, plus two new predictors: (i) estimated operation count derived from canonical solution traces and (ii) a term-rarity proxy computed against a large mathematics corpus. After these controls the representation coefficient remains statistically significant (p < 0.01) and the accuracy gaps stay in the 12–14 pp range. Updated results, residual figures, and the expanded regression table appear in §4.3 and the appendix. revision: yes
Referee: [§3.2] §3.2 (Problem Curation): The manuscript asserts that the 158 problems remain mathematically equivalent across representations, yet provides no quantitative verification (e.g., step-count equivalence, canonical solution length, or expert rating of equivalence). If vector versions systematically require more implicit steps, the observed performance differences could arise from genuine difficulty rather than heuristic dependence, undermining the central interpretation.

Authors: We accept that explicit quantitative verification is needed. Problems were generated by systematic, structure-preserving transformations from the Euclidean source statements. In the revision we now report: average canonical solution lengths of 4.2 / 4.1 / 4.3 steps (Euclidean / coordinate / vector) with no significant difference (paired t-test p = 0.72); and independent equivalence ratings by two geometry experts on a random sample of 30 problems (Cohen’s κ = 0.93). These statistics and the rating protocol are added to §3.2 together with a supplementary table. The observed gaps therefore cannot be explained by differential intrinsic difficulty. revision: yes

Circularity Check

0 steps flagged

No significant circularity: empirical evaluation with independent metrics and released data

full rationale

The paper is a purely empirical evaluation that introduces a new benchmark (GeoRepEval) with 158 problems across three representations, applies standard statistical tools (bootstrap CIs, McNemar tests, regression controls for length/symbolic complexity), and defines Invariance@3 as a decomposition metric whose properties are stated as proven. No equations reduce reported accuracies or invariance scores to quantities fitted from the same data; no self-citations are load-bearing for the central claim; and the study releases all datasets, prompts, and scripts. The derivation chain is therefore self-contained against external benchmarks rather than tautological.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim rests on the assumption that the parallel problem formulations are mathematically equivalent and that statistical controls isolate representation effects; the new Invariance@3 metric is introduced without external validation.

axioms (1)

domain assumption The 158 high-school geometry problems have identical solutions across Euclidean, coordinate, and vector representations.
Required for the invariance metric and accuracy-gap claims to be interpretable.

invented entities (1)

Invariance@3 metric no independent evidence
purpose: Decomposes model accuracy into robust and fragile components and is bounded by the weakest representation.
Newly defined in the paper to quantify representation robustness.

pith-pipeline@v0.9.0 · 5569 in / 1322 out tokens · 53156 ms · 2026-05-13T19:32:31.916260+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

46 extracted references · 46 canonical work pages · 16 internal anchors

[1]

R. S. Aggarwal. 2019.Mathematics for Class XI. Bharati Bhawan Publications, New Delhi, India

work page 2019
[2]

R. S. Aggarwal. 2020.Mathematics for Class XII. Bharati Bhawan Publications, New Delhi, India

work page 2020
[3]

Language Models are Few-Shot Learners

Tom B. Brown, Benjamin Mann, Nick Ryder, et al. 2020. Language models are few-shot learners. InProceedings of the 34th International Conference on Neural Information Processing Systems (NeurIPS), pages 1877–1901.https://arxiv.org/abs/2005.14165

work page internal anchor Pith review Pith/arXiv arXiv 2020
[4]

Yupeng Chang, Xu Wang, Jindong Wang, et al. 2024. A survey on evaluation of large language models.ACM Transactions on Intelligent Systems and Technology, 15(3):1–45. https://arxiv.org/abs/2307.03109

work page arXiv 2024
[5]

Vishal Chaudhary, Denny Zhou, Xinyun Chen, et al. 2023. Reasoning with language model prompting: A survey.Transactions of the ACL, 11:1243–1264. https://arxiv.org/abs/2307. 14626. 11

work page 2023
[6]

Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, et al. 2022. PaLM: Scaling language modeling with pathways.arXiv preprint arXiv:2204.02311. https://arxiv.org/abs/2204. 02311

work page internal anchor Pith review Pith/arXiv arXiv 2022
[7]

Karl Cobbe, Vineet Kosaraju, Oleg Klimov, et al. 2021. Training verifiers to solve math word problems. InProceedings of the 35th AAAI Conference on Artificial Intelligence, pages 12563–12571.https://arxiv.org/abs/2009.03393

work page arXiv 2021
[8]

Qingxiu Dong, Lei Li, Di Xu, et al. 2024. A survey on in-context learning.ACM Computing Surveys, 56(3):1–41.https://arxiv.org/abs/2301.00234

work page internal anchor Pith review Pith/arXiv arXiv 2024
[9]

Iddo Drori, Sarah Zhang, Reece Shuttleworth, et al. 2022. A neural network solves, explains, and generates university math problems by program synthesis and few-shot learning at human level.Proceedings of the National Academy of Sciences, 119(32):e2123433119. https://arxiv. org/abs/2112.15594

work page arXiv 2022
[10]

Google DeepMind. 2023. Gemini: A family of highly capable multimodal models. Technical report.https://arxiv.org/abs/2312.11805

work page internal anchor Pith review Pith/arXiv arXiv 2023
[11]

Explaining and Harnessing Adversarial Examples

Ian J. Goodfellow, Jonathon Shlens, and Christian Szegedy. 2015. Explaining and harnessing adversarial examples. InProceedings of ICLR 2015.https://arxiv.org/abs/1412.6572

work page internal anchor Pith review Pith/arXiv arXiv 2015
[12]

Dan Hendrycks, Collin Burns, Saurav Kadavath, et al. 2021. Measuring mathematical problem solving with the MATH dataset. InProceedings of NeurIPS 2021. https://arxiv.org/abs/ 2103.03874

work page internal anchor Pith review Pith/arXiv arXiv 2021
[14]

Xu, Jun Araki, and Graham Neubig

Zhengbao Jiang, Frank F. Xu, Jun Araki, and Graham Neubig. 2021. How can we know what language models know?Transactions of the ACL, 8:423–438. https://arxiv.org/abs/1911. 12543

work page 2021
[15]

Takeshi Kojima, Shixiang Shawn Gu, Machel Reid, et al. 2022. Large language models are zero-shot reasoners. InProceedings of NeurIPS 2022.https://arxiv.org/abs/2205.11916

work page internal anchor Pith review Pith/arXiv arXiv 2022
[16]

Aitor Lewkowycz, Anders Andreassen, David Dohan, et al. 2022. Solving quantitative reasoning problems with language models. InProceedings of NeurIPS 2022. https://arxiv.org/abs/ 2206.14858

work page internal anchor Pith review Pith/arXiv arXiv 2022
[17]

Percy Liang, Rishi Bommasani, Tony Lee, et al. 2022. Holistic evaluation of language models. arXiv preprint arXiv:2211.09110.https://arxiv.org/abs/2211.09110

work page internal anchor Pith review Pith/arXiv arXiv 2022
[18]

Quinn McNemar. 1947. Note on the sampling error of the difference between correlated proportions or percentages.Psychometrika, 12(2):153–157

work page 1947
[19]

Sewon Min, Xinxi Lyu, Ari Holtzman, et al. 2023. Rethinking the role of demonstrations: What makes in-context learning work? InProceedings of EMNLP 2023. https://arxiv.org/ abs/2202.12837

work page internal anchor Pith review arXiv 2023
[20]

Swaroop Mishra, Daniel Khashabi, Chitta Baral, et al. 2022. Reframing instruction tuning for language models. InProceedings of ACL 2022, pages 124–135. https://arxiv.org/abs/2204. 11936

work page 2022
[21]

2020.Mathematics Textbook for Class IX–XII

National Council of Educational Research and Training (NCERT). 2020.Mathematics Textbook for Class IX–XII. NCERT, New Delhi, India.https://ncert.nic.in

work page 2020
[22]

OpenAI. 2023. GPT-4 technical report.https://arxiv.org/abs/2303.08774

work page internal anchor Pith review Pith/arXiv arXiv 2023
[23]

OpenAI. 2024. GPT-4.1 system card.https://openai.com/research

work page 2024
[24]

Stanislas Polu, Jesse Michael Han, Kunhao Zheng, et al. 2023. Formal mathematics statement 12 curriculum learning. InProceedings of ICLR 2023.https://arxiv.org/abs/2202.01344

work page arXiv 2023
[25]

Guanghui Qin, Yichi Zhang, Dan Zhang, et al. 2023. Is ChatGPT a general-purpose natural language processing task solver? InFindings of ACL 2023, pages 399–410. https://arxiv. org/abs/2302.06476

work page arXiv 2023
[26]

Pranav Rajpurkar, Robin Jia, and Percy Liang. 2018. Know what you don’t know: Unanswerable questions for SQuAD. InProceedings of ACL 2018, pages 784–789

work page 2018
[27]

Marco Tulio Ribeiro, Tongshuang Wu, Carlos Guestrin, and Sameer Singh. 2020. Beyond accuracy: Behavioral testing of NLP models with CheckList. InProceedings of ACL 2020, pages 4902–4912.https://arxiv.org/abs/2005.04118

work page arXiv 2020
[28]

Timo Schick, Jane Dwivedi-Yu, Roberto Dess` ı, et al. 2024. Toolformer: Language models can teach themselves to use tools. InProceedings of NeurIPS 2023. https://arxiv.org/abs/ 2302.04761

work page internal anchor Pith review Pith/arXiv arXiv 2024
[29]

Minjoon Seo, Hannaneh Hajishirzi, Ali Farhadi, et al. 2015. Solving geometry problems: Combining text and diagram interpretation. InProceedings of EMNLP 2015, pages 1466–1476

work page 2015
[30]

R. D. Sharma. 2019.Mathematics for Class IX. Dhanpat Rai Publications, New Delhi, India

work page 2019
[31]

R. D. Sharma. 2020.Mathematics for Class X. Dhanpat Rai Publications, New Delhi, India

work page 2020
[32]

Aarohi Srivastava, Abhinav Rastogi, Abhishek Rao, et al. 2023. Beyond the imitation game: Quantifying and extrapolating the capabilities of language models.Transactions on Machine Learning Research.https://arxiv.org/abs/2206.04615

work page internal anchor Pith review Pith/arXiv arXiv 2023
[33]

Hugo Touvron, Thibaut Lavril, Gautier Izacard, et al. 2023. LLaMA: Open and efficient foundation language models. InProceedings of ICML 2023. https://arxiv.org/abs/2302. 13971

work page 2023
[34]

Trinh, Yuhuai Wu, Quoc V

Trieu H. Trinh, Yuhuai Wu, Quoc V. Le, et al. 2024. Solving olympiad geometry without human demonstrations.Nature, 625(7995):476–482. https://doi.org/10.1038/ s41586-023-06747-5

work page 2024
[35]

Xuezhi Wang, Jason Wei, Dale Schuurmans, et al. 2023. Self-consistency improves chain of thought reasoning in language models. InProceedings of ICLR 2023. https://arxiv.org/ abs/2203.11171

work page internal anchor Pith review Pith/arXiv arXiv 2023
[36]

Jason Wei, Xuezhi Wang, Dale Schuurmans, et al. 2022. Chain-of-thought prompting elicits reasoning in large language models. InProceedings of NeurIPS 2022. https://arxiv.org/ abs/2201.11903

work page internal anchor Pith review Pith/arXiv arXiv 2022
[37]

Shunyu Yao, Dian Yu, Jeffrey Zhao, et al. 2023. Tree of thoughts: Deliberate problem solving with large language models. InProceedings of NeurIPS 2023. https://arxiv.org/abs/2305. 10601

work page 2023
[38]

Minghao Zhang, Shuo Wang, Xiao Liu, et al. 2024. Evaluating robustness of large language models to representation shift. InProceedings of ACL 2024. https://arxiv.org/abs/2402. 01234

work page 2024
[39]

Denny Zhou, Nino Sch¨ arli, Luheng He, et al. 2023. Least-to-most prompting enables complex reasoning in large language models. InProceedings of ICLR 2023. https://arxiv.org/abs/ 2205.10625

work page internal anchor Pith review Pith/arXiv arXiv 2023
[40]

Kaijie Zhu, Jindong Wang, Jiaheng Zhou, et al. 2023. PromptBench: Towards evaluating the robustness of large language models on adversarial prompts.arXiv preprint arXiv:2306.04528. https://arxiv.org/abs/2306.04528. 13 Limitations • Prompting scope:We use zero-shot structured prompting; CoT, few-shot, and self- consistency strategies may partially mitigate...

work page arXiv 2023
[41]

Output MUST be valid JSON only

work page
[42]

Do NOT include markdown formatting, code blocks, or extra text

work page
[43]

reasoning

The JSON must contain EXACTLY two keys: - "reasoning" - "numeric_answer"

work page
[44]

reasoning

"reasoning": - Must clearly explain the math steps. - May use words, symbols, equations. - MUST NOT contain literal line breaks. Use the escaped string "\\n"

work page
[45]

numeric_answer

"numeric_answer": - MUST be a string. - MUST contain ONLY the final numeric answer. No words or units

work page
[46]

numeric_answer

Allowed formats for "numeric_answer": "5", "3/2", "sqrt(8)", "2*sqrt(5)", "8*pi"

work page
[47]

reasoning

If any rule is violated, the output is considered WRONG. -------- OUTPUT FORMAT -------- { "reasoning": "<reasoning using \\n>", "numeric_answer": "<final answer>" } ---------- PROBLEM ---------- {problem} Table 6 reports problem-level correctness pattern counts (CCC–WWW) across Euclidean, Coordinate, and Vector representations for all evaluated models. F...

work page