pith. machine review for the scientific record. sign in

arxiv: 2605.07053 · v1 · submitted 2026-05-08 · 💻 cs.CL · cs.AI

Recognition: 2 theorem links

· Lean Theorem

GSM-SEM: Benchmark and Framework for Generating Semantically Variant Augmentations

Authors on Pith no claims yet

Pith reviewed 2026-05-11 00:50 UTC · model grok-4.3

classification 💻 cs.CL cs.AI
keywords GSM8Kmathematical reasoningLLM robustnesssemantic perturbationbenchmark augmentationdata generationreasoning evaluation
0
0 comments X

The pith

GSM-SEM creates fresh math problem variants by changing facts and entities while preserving the original answers and calculations.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents GSM-SEM, a stochastic framework that generates new benchmark versions by perturbing entities, attributes, and relationships in problem statements. These changes often alter the underlying facts and force recomputation under different conditions, yet the framework constrains outputs to keep the same calculations, final answer, and roughly the same difficulty. Applied to GSM8K, GSM-Symbolic, and GSM-Plus, the resulting datasets produce consistent performance drops across 14 state-of-the-art language models, with larger declines when semantic changes combine with symbolic or plus-style variations. The approach runs fresh each time without new human annotation, lowering the risk that models simply memorize fixed public test sets. The same generation method is shown to work on non-math benchmarks including BigBenchHard, LogicBench, and NLR-BIRD.

Core claim

GSM-SEM is a reusable and stochastic framework for generating semantically diverse benchmark variants with substantially higher semantic variance than prior approaches. It perturbs problem statements by modifying entities, attributes, and relationships, frequently altering underlying facts and requiring models to recompute solutions under new conditions, while constraining generation to preserve the original calculations, answer, and approximate problem difficulty. When applied to GSM8K and existing variation suites, the resulting GSM8K-SEM, GSM-Symbolic-SEM, and GSM-Plus-SEM datasets reveal consistent performance drops in 14 SOTA LLMs, with an average drop rate of 28 percent in the maximum-

What carries the argument

The GSM-SEM framework, a stochastic generator that applies constrained modifications to entities, attributes, and relationships in problem statements to produce new variants on each run.

If this is right

  • Models achieving high scores on static GSM8K versions may exhibit lower accuracy when facts change but the underlying math remains identical.
  • Fresh variants generated on each run reduce the long-term value of memorizing any single public test set.
  • Combining semantic perturbations with symbolic or plus-style changes produces larger performance declines than either type alone.
  • The same generation process can be reused on other reasoning benchmarks without requiring new human annotation for each release.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Integrating stochastic variant generation into model training loops could encourage learning of reasoning patterns that generalize across changed contexts rather than surface forms.
  • If the preserved difficulty claim holds, the observed drops point to limits in how current models handle recomputation under altered problem conditions.
  • Extending the approach to domains with different reasoning structures, such as code or planning tasks, would test whether similar memorization vulnerabilities exist outside math word problems.

Load-bearing premise

Modifications to entities, attributes, and relationships can alter underlying facts and require recomputation while still preserving the original calculations, answer, and approximate problem difficulty.

What would settle it

A set of generated variants where human validators confirm that the required calculations or final answer have changed, or where the 14 evaluated LLMs show no measurable accuracy drop relative to the original problems.

Figures

Figures reproduced from arXiv: 2605.07053 by Amit Agarwal, Aziza Mirzadova, Dan Roth, Fang Tu, Graham Horwood, Hitesh Laxmichand Patel, Jyotika Singh, Miguel Ballesteros, Sandip Ghoshal, Sujith Ravi, Weiyi Sun, Yassine Benajiba.

Figure 1
Figure 1. Figure 1: Example perturbations and per-run accu￾racy. Top: original GSM8K problem. Middle: GSM-Symbolic and GSM-Plus rewrites. Bottom: GSM￾SEM variants (orange highlights indicate edited spans). For each panel, accuracy across five independent runs is shown (✓ correct, x incorrect), illustrating higher failures on SEM variants. L3.1 = Llama-3.1-405B-Ins; GPT-5 uses medium/default reasoning effort. often interpreted… view at source ↗
Figure 2
Figure 2. Figure 2: GSM-SEM: Semantic variant generation pipeline. serve the original answer and approximate diffi￾culty. This addresses two practical limitations of many existing benchmarks: (i) released variants are static and can become memorization targets over time; and (ii) extending them with fresh, compa￾rable perturbations is often infeasible or requires re-annotating ground-truth answers, which is hard to do reliabl… view at source ↗
Figure 3
Figure 3. Figure 3: Cosine similarity distribution of GSM8K variants with respect to GSM8K, where GSM8K-SEM shows higher semantic divergence than other variants. show the count-based cosine similarity distribu￾tion between GSM-Symbolic and GSM8K, and be￾tween GSM-Plus and GSM8K. As GSM-Symbolic essentially only swaps entities from the original dataset, its similarity tends to be higher than para￾phrased queries. GSM-Plus show… view at source ↗
Figure 4
Figure 4. Figure 4: Effect of Strictness Filter (Section 3) on PDR% (relative to GSM8K) and statistical significance. Filter settings: none (all samples kept; [α, β] 0-1), min ([α, β] 0.30–0.70), min-med (0.35–0.65), med (0.40–0.60), med￾max (0.45–0.55), and max (all such samples filtered out). Variant dataset sizes across filters are shared in [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗
Figure 7
Figure 7. Figure 7: Delta in performance for GSM-variants com [PITH_FULL_IMAGE:figures/full_fig_p008_7.png] view at source ↗
Figure 6
Figure 6. Figure 6: Average accuracy across models for GSM8K [PITH_FULL_IMAGE:figures/full_fig_p008_6.png] view at source ↗
Figure 8
Figure 8. Figure 8: Strictness filter configurations. Each setting [PITH_FULL_IMAGE:figures/full_fig_p014_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Cosine similarity distribution of GSM8K vari [PITH_FULL_IMAGE:figures/full_fig_p016_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Cosine similarity distribution using all [PITH_FULL_IMAGE:figures/full_fig_p016_10.png] view at source ↗
read the original abstract

Benchmarks like GSM8K are popular measures of mathematical reasoning, but leaderboard gains can overstate true capability due to memorization of fixed test sets. Most robustness variants apply surface-level perturbations (paraphrases, renamings, number swaps, distractors) that largely preserve the underlying facts, and static releases can themselves become memorization targets over time. We introduce GSM-SEM, a reusable and stochastic framework for generating semantically diverse benchmark variants with substantially higher semantic variance than prior approaches. GSM-SEM perturbs problem statements by modifying entities, attributes, and/or relationships, frequently altering underlying facts and requiring models to recompute solutions under new conditions, while constraining generation to preserve the original calculations/answer and approximate problem difficulty. GSM-SEM generates fresh variants on each run without requiring re-annotation, reducing reliance on static public benchmarks for evaluation and thereby lowering the bias of memorization. We apply GSM-SEM on GSM8K and two existing variation suites (GSM-Symbolic and GSM-Plus), producing GSM8K-SEM, GSM-Symbolic-SEM, and GSM-Plus-SEM. Evaluating 14 SOTA LLMs, we observe consistent performance drops with larger decline when semantic perturbations are coupled with symbolic/plus variations (average drop rate 28% in maximum strictness configuration of GSM-SEM). We publicly release the three SEM variants as fully human-validated datasets. Finally, to demonstrate applicability beyond GSM-style math problems, we apply GSM-SEM to additional benchmarks including BigBenchHard, LogicBench, and NLR-BIRD.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces GSM-SEM, a reusable stochastic framework for generating semantically variant augmentations of math reasoning benchmarks (GSM8K, GSM-Symbolic, GSM-Plus) by perturbing entities/attributes/relationships to increase semantic variance while constraining generation to preserve original calculations, answers, and approximate difficulty. It produces three new human-validated datasets, evaluates 14 SOTA LLMs showing consistent performance drops (larger when semantic perturbations are combined with symbolic/plus variations, averaging 28% in the maximum-strictness configuration), and demonstrates extension to other benchmarks such as BigBenchHard, LogicBench, and NLR-BIRD.

Significance. If the preservation constraints hold, GSM-SEM provides a practical, on-demand method for creating dynamic benchmarks that reduce memorization bias and better isolate true reasoning generalization; the reported drops when semantic changes are layered on symbolic variations would constitute useful evidence of current model limitations. The public release of validated datasets and the framework's applicability beyond GSM-style problems are concrete strengths.

major comments (2)
  1. [§3] §3 (GSM-SEM framework description): the central claim that perturbations 'frequently alter underlying facts and require models to recompute solutions' while 'constraining generation to preserve the original calculations/answer and approximate problem difficulty' is load-bearing for interpreting the 28% drop as semantic-robustness evidence rather than difficulty inflation or answer mismatch; the manuscript provides no concrete description of the enforcement mechanism (template rules, symbolic equivalence checks, post-generation filtering, or verification steps).
  2. [Results] Results (evaluation on 14 LLMs and the 28% figure): without explicit reporting of how answer equivalence and difficulty preservation were measured or validated on the generated sets (beyond the high-level human validation statement), the cross-configuration drops cannot be unambiguously attributed to semantic variance.
minor comments (2)
  1. [Abstract] Abstract: the human-validation claim would be strengthened by a brief statement of validation criteria or inter-annotator statistics.
  2. [Throughout] Throughout: notation for the three released variants (GSM8K-SEM, etc.) should be introduced once and used consistently; a small number of figure captions could be expanded to clarify what 'maximum strictness configuration' entails.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive comments, which help clarify how to strengthen the presentation of the GSM-SEM framework and results. We address each major comment below and will incorporate the requested details into the revised manuscript.

read point-by-point responses
  1. Referee: [§3] §3 (GSM-SEM framework description): the central claim that perturbations 'frequently alter underlying facts and require models to recompute solutions' while 'constraining generation to preserve the original calculations/answer and approximate problem difficulty' is load-bearing for interpreting the 28% drop as semantic-robustness evidence rather than difficulty inflation or answer mismatch; the manuscript provides no concrete description of the enforcement mechanism (template rules, symbolic equivalence checks, post-generation filtering, or verification steps).

    Authors: We agree that Section 3 would benefit from a more explicit account of the enforcement mechanisms. In the revision we will expand the framework description to detail the template rules governing entity/attribute/relationship perturbations, the symbolic equivalence checks that verify answer preservation, the post-generation filtering criteria, and the verification steps used to maintain approximate problem difficulty. These additions will directly support the interpretation of performance drops as arising from increased semantic variance. revision: yes

  2. Referee: [Results] Results (evaluation on 14 LLMs and the 28% figure): without explicit reporting of how answer equivalence and difficulty preservation were measured or validated on the generated sets (beyond the high-level human validation statement), the cross-configuration drops cannot be unambiguously attributed to semantic variance.

    Authors: We acknowledge that the Results section should report the validation procedures more explicitly. We will add a dedicated subsection describing the automated answer-equivalence checks (exact numerical match after recomputation), the human validation protocol (including inter-annotator agreement on answer correctness and difficulty), and how these steps were applied across the GSM8K-SEM, GSM-Symbolic-SEM, and GSM-Plus-SEM datasets. This will allow readers to attribute the observed drops unambiguously to semantic variance. revision: yes

Circularity Check

0 steps flagged

No significant circularity in GSM-SEM framework or empirical evaluation

full rationale

The paper introduces a stochastic generation framework that perturbs entities/attributes/relationships in existing benchmarks while enforcing preservation of answers and difficulty, applies it to produce new variant sets (GSM8K-SEM etc.), and reports empirical LLM performance drops on those sets. This chain relies on external model evaluations and human validation rather than any self-definitional loop, fitted parameter renamed as prediction, or load-bearing self-citation. The central claims are observational results from applying the defined process to independent test items, making the derivation self-contained.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that semantic perturbations can be generated to change facts while preserving calculations and difficulty; no free parameters or invented entities are described in the abstract.

axioms (1)
  • domain assumption Semantic perturbations can be applied to alter underlying facts while preserving original calculations, answers, and approximate difficulty
    This constraint is invoked as the core mechanism of GSM-SEM in the abstract.

pith-pipeline@v0.9.0 · 5630 in / 1347 out tokens · 63201 ms · 2026-05-11T00:50:17.185012+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

75 extracted references · 30 canonical work pages · 7 internal anchors

  1. [1]

    Gautam Balakrishnan and Anupam Purwar. 2024. https://doi.org/10.13140/RG.2.2.24790.25926 Evaluating the efficacy of open-source llms in enterprise-specific rag systems: A comparative study of performance and scalability

  2. [2]

    BIG bench authors. 2023. https://openreview.net/forum?id=uyTL5Bvosj Beyond the imitation game: Quantifying and extrapolating the capabilities of language models . Transactions on Machine Learning Research

  3. [5]

    Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, Christopher Hesse, and John Schulman. 2021. https://arxiv.org/abs/2110.14168 Training verifiers to solve math word problems . Preprint, arXiv:2110.14168

  4. [6]

    Giancarlo Crocetti. 2015. https://arxiv.org/abs/1505.03934 Textual spatial cosine similarity . Preprint, arXiv:1505.03934

  5. [8]

    Aabid Karim, Abdul Karim, Bhoomika Lohana, Matt Keon, Jaswinder Singh, and Abdul Sattar. 2025. https://arxiv.org/abs/2503.18018 Lost in cultural translation: Do llms struggle with math across cultural contexts? Preprint, arXiv:2503.18018

  6. [9]

    Qintong Li, Leyang Cui, Xueliang Zhao, Lingpeng Kong, and Wei Bi. 2024. Gsm-plus: A comprehensive benchmark for evaluating the robustness of llms as mathematical problem solvers. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 2961--2984. Association for Computational Linguistics

  7. [10]

    Jikai Long, Zijian Hu, Xiaodong Yu, Jianwen Xie, and Zhaozhuo Xu. 2025. https://arxiv.org/abs/2506.17264 Oat-rephrase: Optimization-aware training data rephrasing for zeroth-order llm fine-tuning . Preprint, arXiv:2506.17264

  8. [11]

    Riccardo Lunardi, Vincenzo Della Mea, Stefano Mizzaro, and Kevin Roitero. 2025. https://arxiv.org/abs/2509.04013 On robustness and reliability of benchmark-based evaluation of llms . Preprint, arXiv:2509.04013

  9. [14]

    Seyed Iman Mirzadeh, Keivan Alizadeh, Hooman Shahrokhi, Oncel Tuzel, Samy Bengio, and Mehrdad Farajtabar. 2025. Gsm-symbolic: Understanding the limitations of mathematical reasoning in large language models. In The Thirteenth International Conference on Learning Representations

  10. [17]

    Ella Rabinovich, Samuel Ackerman, Orna Raz, Eitan Farchi, and Ateret Anaby Tavor. 2023. Predicting question-answering performance of large language models through semantic consistency. In Proceedings of the Third Workshop on Natural Language Generation, Evaluation, and Metrics (GEM), pages 138--154

  11. [18]

    Harsh Raj, Domenic Rosati, and Subhabrata Majumdar. 2022. Measuring reliability of large language models through semantic consistency. In NeurIPS ML Safety Workshop

  12. [19]

    Navid Rekabsaz, Mihai Lupu, Allan Hanbury, and Guido Zuccon. 2017. https://doi.org/10.1007/978-3-319-56608-5_31 Exploration of a threshold for similarity based on uncertainty in word embedding . In Advances in Information Retrieval: 39th European Conference on IR Research, ECIR 2017 , pages 396--409. Springer

  13. [20]

    Mohammadtaher Safarzadeh, Afshin Oroojlooyjadid, and Dan Roth. 2025. https://arxiv.org/abs/2509.04657 Evaluating nl2sql via sql2nl . Preprint, arXiv:2509.04657

  14. [21]

    Viktor Schlegel, Goran Nenadic, and Riza Batista-Navarro. 2021. https://doi.org/10.1609/aaai.v35i15.17622 Semantics altering modifications for evaluating comprehension in machine reading . Proceedings of the AAAI Conference on Artificial Intelligence, 35:13762--13770

  15. [22]

    Melanie Sclar, Yejin Choi, Yulia Tsvetkov, and Alane Suhr. 2024. Quantifying language models' sensitivity to spurious features in prompt design or: How i learned to start worrying about prompt formatting. In The Twelfth International Conference on Learning Representations

  16. [23]

    Freda Shi, Xinyun Chen, Kanishka Misra, Nathan Scales, David Dohan, Ed Chi, Nathanael Sch\" a rli, and Denny Zhou. 2023 a . Large language models can be easily distracted by irrelevant context. In Proceedings of the 40th International Conference on Machine Learning, ICML'23. JMLR.org

  17. [24]

    Freda Shi, Xinyun Chen, Kanishka Misra, Nathan Scales, David Dohan, Ed H Chi, Nathanael Sch \"a rli, and Denny Zhou. 2023 b . Large language models can be easily distracted by irrelevant context. In International Conference on Machine Learning, pages 31210--31227. PMLR

  18. [26]

    Jyotika Singh, Fang Tu, Miguel Ballesteros, Weiyi Sun, Sandip Ghoshal, Michelle Yuan, Yassine Benajiba, Sujith Ravi, and Dan Roth. 2026. https://arxiv.org/abs/2604.08782 Mt-osc: Path for llms that get lost in multi-turn conversation . Preprint, arXiv:2604.08782

  19. [29]

    Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, Aurelien Rodriguez, Armand Joulin, Edouard Grave, and Guillaume Lample. 2023. https://arxiv.org/abs/2302.13971 Llama: Open and efficient foundation language models . Preprint, arXiv:2302.13971

  20. [30]

    Yuqing Wang and Yun Zhao. 2024. https://arxiv.org/abs/2406.11020 Rupbench: Benchmarking reasoning under perturbations for robustness evaluation in large language models . Preprint, arXiv:2406.11020

  21. [33]

    Chulin Xie, Yangsibo Huang, Chiyuan Zhang, Da Yu, Xinyun Chen, Bill Yuchen Lin, Bo Li, Badih Ghazi, and Ravi Kumar. 2025. On memorization of large language models in logical reasoning. In Proceedings of the 14th International Joint Conference on Natural Language Processing and the 4th Conference of the Asia-Pacific Chapter of the Association for Computati...

  22. [34]

    Gonzalez, and Ion Stoica

    Shuo Yang, Wei-Lin Chiang, Lianmin Zheng, Joseph E. Gonzalez, and Ion Stoica. 2023. https://arxiv.org/abs/2311.04850 Rethinking benchmark and contamination for language models with rephrased samples . Preprint, arXiv:2311.04850

  23. [35]

    Tian Ye, Zicheng Xu, Yuanzhi Li, and Zeyuan Allen-Zhu. 2025. https://openreview.net/forum?id=Tn5B6Udq3E Physics of language models: Part 2.1, grade-school math and the hidden reasoning process . In The Thirteenth International Conference on Learning Representations

  24. [36]

    Erez Yosef, Oron Anschel, Shunit Haviv Hakimi, Asaf Gendler, Adam Botach, Nimrod Berman, and Igor Kviatkovsky. 2026. https://arxiv.org/abs/2604.22597 Rethinking math reasoning evaluation: A robust llm-as-a-judge framework beyond symbolic rigidity . Preprint, arXiv:2604.22597

  25. [37]

    Longhui Yu, Weisen Jiang, Han Shi, Jincheng YU, Zhengying Liu, Yu Zhang, James Kwok, Zhenguo Li, Adrian Weller, and Weiyang Liu. 2024 a . https://openreview.net/forum?id=N8N0hgNDRt Metamath: Bootstrap your own mathematical questions for large language models . In The Twelfth International Conference on Learning Representations

  26. [38]

    Xfinder: Large language models as auto-mated evaluators for reliable evaluation

    Qingchen Yu, Zifan Zheng, Shichao Song, Zhiyu Li, Feiyu Xiong, Bo Tang, and Ding Chen. Xfinder: Large language models as auto-mated evaluators for reliable evaluation

  27. [40]

    Zhongshen Zeng, Pengguang Chen, Shu Liu, Haiyun Jiang, and Jiaya Jia. 2025. https://openreview.net/forum?id=br4H61LOoI MR - GSM 8k: A meta-reasoning benchmark for large language model evaluation . In The Thirteenth International Conference on Learning Representations

  28. [41]

    Hugh Zhang, Jeff Da, Dean Lee, Vaughn Robinson, Catherine Wu, Will Song, Tiffany Zhao, Pranav Raja, Charlotte Zhuang, Dylan Slack, and 1 others. 2024. A careful examination of large language model performance on grade school arithmetic. In Proceedings of the 38th International Conference on Neural Information Processing Systems, pages 46819--46836

  29. [43]

    Wong, Xiaowei Huang, Qiufeng Wang, and Kaizhu Huang

    Zihao Zhou, Shudong Liu, Maizhen Ning, Wei Liu, Jindong Wang, Derek F. Wong, Xiaowei Huang, Qiufeng Wang, and Kaizhu Huang. 2025. https://openreview.net/forum?id=nDvgHIBRxQ Is your model really a good math reasoner? evaluating mathematical reasoning with checklist . In The Thirteenth International Conference on Learning Representations

  30. [44]

    On memorization of large language models in logical reasoning , author=. Proceedings of the 14th International Joint Conference on Natural Language Processing and the 4th Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics , pages=

  31. [45]

    2021 , eprint=

    Training Verifiers to Solve Math Word Problems , author=. 2021 , eprint=

  32. [46]

    International Conference on Machine Learning , pages=

    Large language models can be easily distracted by irrelevant context , author=. International Conference on Machine Learning , pages=. 2023 , organization=

  33. [47]

    Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

    GSM-Plus: A Comprehensive Benchmark for Evaluating the Robustness of LLMs as Mathematical Problem Solvers , author=. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=. 2024 , organization=

  34. [48]

    Proceedings of the 38th International Conference on Neural Information Processing Systems , pages=

    A careful examination of large language model performance on grade school arithmetic , author=. Proceedings of the 38th International Conference on Neural Information Processing Systems , pages=

  35. [49]

    The Thirteenth International Conference on Learning Representations , year=

    GSM-Symbolic: Understanding the Limitations of Mathematical Reasoning in Large Language Models , author=. The Thirteenth International Conference on Learning Representations , year=

  36. [50]

    2023 , eprint=

    Rethinking Benchmark and Contamination for Language Models with Rephrased Samples , author=. 2023 , eprint=

  37. [51]

    2024 , eprint=

    RUPBench: Benchmarking Reasoning Under Perturbations for Robustness Evaluation in Large Language Models , author=. 2024 , eprint=

  38. [52]

    Proceedings of the Third Workshop on Natural Language Generation, Evaluation, and Metrics (GEM) , pages=

    Predicting Question-Answering Performance of Large Language Models through Semantic Consistency , author=. Proceedings of the Third Workshop on Natural Language Generation, Evaluation, and Metrics (GEM) , pages=

  39. [53]

    2025 , eprint=

    Evaluating NL2SQL via SQL2NL , author=. 2025 , eprint=

  40. [54]

    NeurIPS ML Safety Workshop , year=

    Measuring Reliability of Large Language Models through Semantic Consistency , author=. NeurIPS ML Safety Workshop , year=

  41. [55]

    Demystifying Prompts in Language Models via Perplexity Estimation

    Gonen, Hila and Iyer, Srini and Blevins, Terra and Smith, Noah and Zettlemoyer, Luke. Demystifying Prompts in Language Models via Perplexity Estimation. Findings of the Association for Computational Linguistics: EMNLP 2023. 2023. doi:10.18653/v1/2023.findings-emnlp.679

  42. [56]

    The Thirteenth International Conference on Learning Representations , year=

    Physics of Language Models: Part 2.1, Grade-School Math and the Hidden Reasoning Process , author=. The Thirteenth International Conference on Learning Representations , year=

  43. [57]

    The Twelfth International Conference on Learning Representations , year=

    Quantifying Language Models' Sensitivity to Spurious Features in Prompt Design or: How I learned to start worrying about prompt formatting , author=. The Twelfth International Conference on Learning Representations , year=

  44. [58]

    C. L. Hamblin , title =. Foundations of Language , year =

  45. [59]

    Reasoning or Reciting? Exploring the Capabilities and Limitations of Language Models Through Counterfactual Tasks

    Wu, Zhaofeng and Qiu, Linlu and Ross, Alexis and Aky. Reasoning or Reciting? Exploring the Capabilities and Limitations of Language Models Through Counterfactual Tasks. Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers). 2024. doi:10.18653/v...

  46. [60]

    Paraphrase and Solve: Exploring and Exploiting the Impact of Surface Form on Mathematical Reasoning in Large Language Models

    Zhou, Yue and Zhu, Yada and Antognini, Diego and Kim, Yoon and Zhang, Yang. Paraphrase and Solve: Exploring and Exploiting the Impact of Surface Form on Mathematical Reasoning in Large Language Models. Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: L...

  47. [61]

    What Makes Math Word Problems Challenging for LLM s?

    Srivatsa, Kv Aditya and Kochmar, Ekaterina. What Makes Math Word Problems Challenging for LLM s?. Findings of the Association for Computational Linguistics: NAACL 2024. 2024. doi:10.18653/v1/2024.findings-naacl.72

  48. [62]

    Seemingly Plausible Distractors in Multi-Hop Reasoning: Are Large Language Models Attentive Readers?

    Bhuiya, Neeladri and Schlegel, Viktor and Winkler, Stefan. Seemingly Plausible Distractors in Multi-Hop Reasoning: Are Large Language Models Attentive Readers?. Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing. 2024. doi:10.18653/v1/2024.emnlp-main.147

  49. [63]

    Large language models can be easily distracted by irrelevant context , year =

    Shi, Freda and Chen, Xinyun and Misra, Kanishka and Scales, Nathan and Dohan, David and Chi, Ed and Sch\". Large language models can be easily distracted by irrelevant context , year =. Proceedings of the 40th International Conference on Machine Learning , articleno =

  50. [64]

    GPT-4o System Card

    OpenAI and : and Hurst, Aaron and Lerer, Adam and Goucher, Adam P and Perelman, Adam and Ramesh, Aditya and Clark, Aidan and Ostrow, A J and Welihinda, Akila and Hayes, Alan and Radford, Alec and M a dry, Aleksander and Baker-Whitcomb, Alex and Beutel, Alex and Borzunov, Alex and Carney, Alex and Chow, Alex and Kirillov, Alex and Nichol, Alex and Paino, A...

  51. [65]

    2023 , eprint=

    LLaMA: Open and Efficient Foundation Language Models , author=. 2023 , eprint=

  52. [66]

    Semantics Altering Modifications for Evaluating Comprehension in Machine Reading , volume =

    Schlegel, Viktor and Nenadic, Goran and Batista-Navarro, Riza , year =. Semantics Altering Modifications for Evaluating Comprehension in Machine Reading , volume =. Proceedings of the AAAI Conference on Artificial Intelligence , doi =

  53. [67]

    SSS : Editing Factual Knowledge in Language Models towards Semantic Sparse Space

    Wang, Huazheng and Sun, Haifeng and Wang, Jingyu and Qi, Qi and Xia, Zixuan and Zhang, Menghao and Liao, Jianxin. SSS : Editing Factual Knowledge in Language Models towards Semantic Sparse Space. Findings of the Association for Computational Linguistics: ACL 2024. 2024. doi:10.18653/v1/2024.findings-acl.331

  54. [68]

    Tailor: Generating and Perturbing Text with Semantic Controls

    Ross, Alexis and Wu, Tongshuang and Peng, Hao and Peters, Matthew and Gardner, Matt. Tailor: Generating and Perturbing Text with Semantic Controls. Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2022. doi:10.18653/v1/2022.acl-long.228

  55. [69]

    Rephrasing the Web: A Recipe for Compute and Data-Efficient Language Modeling

    Maini, Pratyush and Seto, Skyler and Bai, Richard and Grangier, David and Zhang, Yizhe and Jaitly, Navdeep. Rephrasing the Web: A Recipe for Compute and Data-Efficient Language Modeling. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2024. doi:10.18653/v1/2024.acl-long.757

  56. [70]

    The Thirteenth International Conference on Learning Representations , year=

    Is Your Model Really A Good Math Reasoner? Evaluating Mathematical Reasoning with Checklist , author=. The Thirteenth International Conference on Learning Representations , year=

  57. [71]

    2025 , eprint=

    Lost in Cultural Translation: Do LLMs Struggle with Math Across Cultural Contexts? , author=. 2025 , eprint=

  58. [72]

    The Twelfth International Conference on Learning Representations , year=

    MetaMath: Bootstrap Your Own Mathematical Questions for Large Language Models , author=. The Twelfth International Conference on Learning Representations , year=

  59. [73]

    2025 , url=

    Zhongshen Zeng and Pengguang Chen and Shu Liu and Haiyun Jiang and Jiaya Jia , booktitle=. 2025 , url=

  60. [74]

    CoRR , year=

    ReasonAgain: Using Extractable Symbolic Programs to Evaluate Mathematical Reasoning , author=. CoRR , year=. 2410.19056 , archivePrefix=

  61. [75]

    2025 , eprint=

    On Robustness and Reliability of Benchmark-Based Evaluation of LLMs , author=. 2025 , eprint=

  62. [76]

    2025 , eprint=

    OAT-Rephrase: Optimization-Aware Training Data Rephrasing for Zeroth-Order LLM Fine-Tuning , author=. 2025 , eprint=

  63. [77]

    Transactions on Machine Learning Research , issn=

    Beyond the Imitation Game: Quantifying and extrapolating the capabilities of language models , author=. Transactions on Machine Learning Research , issn=. 2023 , url=

  64. [78]

    Challenging BIG-Bench Tasks and Whether Chain-of-Thought Can Solve Them

    Challenging BIG-Bench Tasks and Whether Chain-of-Thought Can Solve Them , author=. arXiv preprint arXiv:2210.09261 , year=

  65. [79]

    Mihir Parmar, Nisarg Patel, Neeraj Varshney, Mutsumi Nakamura, Man Luo, Santosh Mashetty, Arindam Mitra, and Chitta Baral

    Parmar, Mihir and Patel, Nisarg and Varshney, Neeraj and Nakamura, Mutsumi and Luo, Man and Mashetty, Santosh and Mitra, Arindam and Baral, Chitta. L ogic B ench: Towards Systematic Evaluation of Logical Reasoning Ability of Large Language Models. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Paper...

  66. [80]

    Can LLM s Narrate Tabular Data? An Evaluation Framework for Natural Language Representations of Text-to- SQL System Outputs

    Singh, Jyotika and Sun, Weiyi and Agarwal, Amit and Krishnamurthy, Viji and Benajiba, Yassine and Ravi, Sujith and Roth, Dan. Can LLM s Narrate Tabular Data? An Evaluation Framework for Natural Language Representations of Text-to- SQL System Outputs. Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing: Industry Track. 20...

  67. [81]

    ACM Trans

    Xiao, Chuan and Wang, Wei and Lin, Xuemin and Yu, Jeffrey Xu and Wang, Guoren , title =. ACM Trans. Database Syst. , month = aug, articleno =. 2011 , issue_date =. doi:10.1145/2000824.2000825 , abstract =

  68. [82]

    , title =

    Bilenko, Mikhail and Mooney, Raymond J. , title =. Proceedings of the Ninth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining , pages =. 2003 , isbn =. doi:10.1145/956750.956759 , abstract =

  69. [83]

    Advances in Information Retrieval: 39th European Conference on IR Research,

    Rekabsaz, Navid and Lupu, Mihai and Hanbury, Allan and Zuccon, Guido , title=. Advances in Information Retrieval: 39th European Conference on IR Research,. 2017 , pages=

  70. [84]

    2015 , eprint=

    Textual Spatial Cosine Similarity , author=. 2015 , eprint=

  71. [85]

    Evaluating the Efficacy of Open-Source LLMs in Enterprise-Specific RAG Systems: A Comparative Study of Performance and Scalability , doi =

    Balakrishnan, Gautam and Purwar, Anupam , year =. Evaluating the Efficacy of Open-Source LLMs in Enterprise-Specific RAG Systems: A Comparative Study of Performance and Scalability , doi =

  72. [86]

    A Fast Method to Filter Noisy Parallel Data WMT 2023 Shared Task on Parallel Data Curation

    Minh-Cong, Nguyen-Hoang and Vinh, Nguyen Van and Le-Minh, Nguyen. A Fast Method to Filter Noisy Parallel Data WMT 2023 Shared Task on Parallel Data Curation. Proceedings of the Eighth Conference on Machine Translation. 2023. doi:10.18653/v1/2023.wmt-1.37

  73. [87]

    XFINDER: LARGE LANGUAGE MODELS AS AUTO-MATED EVALUATORS FOR RELIABLE EVALUATION , author=

  74. [88]

    2026 , eprint=

    MT-OSC: Path for LLMs that Get Lost in Multi-Turn Conversation , author=. 2026 , eprint=

  75. [89]

    2026 , eprint=

    Rethinking Math Reasoning Evaluation: A Robust LLM-as-a-Judge Framework Beyond Symbolic Rigidity , author=. 2026 , eprint=