Recognition: 2 theorem links
· Lean TheoremGSM-SEM: Benchmark and Framework for Generating Semantically Variant Augmentations
Pith reviewed 2026-05-11 00:50 UTC · model grok-4.3
The pith
GSM-SEM creates fresh math problem variants by changing facts and entities while preserving the original answers and calculations.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
GSM-SEM is a reusable and stochastic framework for generating semantically diverse benchmark variants with substantially higher semantic variance than prior approaches. It perturbs problem statements by modifying entities, attributes, and relationships, frequently altering underlying facts and requiring models to recompute solutions under new conditions, while constraining generation to preserve the original calculations, answer, and approximate problem difficulty. When applied to GSM8K and existing variation suites, the resulting GSM8K-SEM, GSM-Symbolic-SEM, and GSM-Plus-SEM datasets reveal consistent performance drops in 14 SOTA LLMs, with an average drop rate of 28 percent in the maximum-
What carries the argument
The GSM-SEM framework, a stochastic generator that applies constrained modifications to entities, attributes, and relationships in problem statements to produce new variants on each run.
If this is right
- Models achieving high scores on static GSM8K versions may exhibit lower accuracy when facts change but the underlying math remains identical.
- Fresh variants generated on each run reduce the long-term value of memorizing any single public test set.
- Combining semantic perturbations with symbolic or plus-style changes produces larger performance declines than either type alone.
- The same generation process can be reused on other reasoning benchmarks without requiring new human annotation for each release.
Where Pith is reading between the lines
- Integrating stochastic variant generation into model training loops could encourage learning of reasoning patterns that generalize across changed contexts rather than surface forms.
- If the preserved difficulty claim holds, the observed drops point to limits in how current models handle recomputation under altered problem conditions.
- Extending the approach to domains with different reasoning structures, such as code or planning tasks, would test whether similar memorization vulnerabilities exist outside math word problems.
Load-bearing premise
Modifications to entities, attributes, and relationships can alter underlying facts and require recomputation while still preserving the original calculations, answer, and approximate problem difficulty.
What would settle it
A set of generated variants where human validators confirm that the required calculations or final answer have changed, or where the 14 evaluated LLMs show no measurable accuracy drop relative to the original problems.
Figures
read the original abstract
Benchmarks like GSM8K are popular measures of mathematical reasoning, but leaderboard gains can overstate true capability due to memorization of fixed test sets. Most robustness variants apply surface-level perturbations (paraphrases, renamings, number swaps, distractors) that largely preserve the underlying facts, and static releases can themselves become memorization targets over time. We introduce GSM-SEM, a reusable and stochastic framework for generating semantically diverse benchmark variants with substantially higher semantic variance than prior approaches. GSM-SEM perturbs problem statements by modifying entities, attributes, and/or relationships, frequently altering underlying facts and requiring models to recompute solutions under new conditions, while constraining generation to preserve the original calculations/answer and approximate problem difficulty. GSM-SEM generates fresh variants on each run without requiring re-annotation, reducing reliance on static public benchmarks for evaluation and thereby lowering the bias of memorization. We apply GSM-SEM on GSM8K and two existing variation suites (GSM-Symbolic and GSM-Plus), producing GSM8K-SEM, GSM-Symbolic-SEM, and GSM-Plus-SEM. Evaluating 14 SOTA LLMs, we observe consistent performance drops with larger decline when semantic perturbations are coupled with symbolic/plus variations (average drop rate 28% in maximum strictness configuration of GSM-SEM). We publicly release the three SEM variants as fully human-validated datasets. Finally, to demonstrate applicability beyond GSM-style math problems, we apply GSM-SEM to additional benchmarks including BigBenchHard, LogicBench, and NLR-BIRD.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces GSM-SEM, a reusable stochastic framework for generating semantically variant augmentations of math reasoning benchmarks (GSM8K, GSM-Symbolic, GSM-Plus) by perturbing entities/attributes/relationships to increase semantic variance while constraining generation to preserve original calculations, answers, and approximate difficulty. It produces three new human-validated datasets, evaluates 14 SOTA LLMs showing consistent performance drops (larger when semantic perturbations are combined with symbolic/plus variations, averaging 28% in the maximum-strictness configuration), and demonstrates extension to other benchmarks such as BigBenchHard, LogicBench, and NLR-BIRD.
Significance. If the preservation constraints hold, GSM-SEM provides a practical, on-demand method for creating dynamic benchmarks that reduce memorization bias and better isolate true reasoning generalization; the reported drops when semantic changes are layered on symbolic variations would constitute useful evidence of current model limitations. The public release of validated datasets and the framework's applicability beyond GSM-style problems are concrete strengths.
major comments (2)
- [§3] §3 (GSM-SEM framework description): the central claim that perturbations 'frequently alter underlying facts and require models to recompute solutions' while 'constraining generation to preserve the original calculations/answer and approximate problem difficulty' is load-bearing for interpreting the 28% drop as semantic-robustness evidence rather than difficulty inflation or answer mismatch; the manuscript provides no concrete description of the enforcement mechanism (template rules, symbolic equivalence checks, post-generation filtering, or verification steps).
- [Results] Results (evaluation on 14 LLMs and the 28% figure): without explicit reporting of how answer equivalence and difficulty preservation were measured or validated on the generated sets (beyond the high-level human validation statement), the cross-configuration drops cannot be unambiguously attributed to semantic variance.
minor comments (2)
- [Abstract] Abstract: the human-validation claim would be strengthened by a brief statement of validation criteria or inter-annotator statistics.
- [Throughout] Throughout: notation for the three released variants (GSM8K-SEM, etc.) should be introduced once and used consistently; a small number of figure captions could be expanded to clarify what 'maximum strictness configuration' entails.
Simulated Author's Rebuttal
We thank the referee for their constructive comments, which help clarify how to strengthen the presentation of the GSM-SEM framework and results. We address each major comment below and will incorporate the requested details into the revised manuscript.
read point-by-point responses
-
Referee: [§3] §3 (GSM-SEM framework description): the central claim that perturbations 'frequently alter underlying facts and require models to recompute solutions' while 'constraining generation to preserve the original calculations/answer and approximate problem difficulty' is load-bearing for interpreting the 28% drop as semantic-robustness evidence rather than difficulty inflation or answer mismatch; the manuscript provides no concrete description of the enforcement mechanism (template rules, symbolic equivalence checks, post-generation filtering, or verification steps).
Authors: We agree that Section 3 would benefit from a more explicit account of the enforcement mechanisms. In the revision we will expand the framework description to detail the template rules governing entity/attribute/relationship perturbations, the symbolic equivalence checks that verify answer preservation, the post-generation filtering criteria, and the verification steps used to maintain approximate problem difficulty. These additions will directly support the interpretation of performance drops as arising from increased semantic variance. revision: yes
-
Referee: [Results] Results (evaluation on 14 LLMs and the 28% figure): without explicit reporting of how answer equivalence and difficulty preservation were measured or validated on the generated sets (beyond the high-level human validation statement), the cross-configuration drops cannot be unambiguously attributed to semantic variance.
Authors: We acknowledge that the Results section should report the validation procedures more explicitly. We will add a dedicated subsection describing the automated answer-equivalence checks (exact numerical match after recomputation), the human validation protocol (including inter-annotator agreement on answer correctness and difficulty), and how these steps were applied across the GSM8K-SEM, GSM-Symbolic-SEM, and GSM-Plus-SEM datasets. This will allow readers to attribute the observed drops unambiguously to semantic variance. revision: yes
Circularity Check
No significant circularity in GSM-SEM framework or empirical evaluation
full rationale
The paper introduces a stochastic generation framework that perturbs entities/attributes/relationships in existing benchmarks while enforcing preservation of answers and difficulty, applies it to produce new variant sets (GSM8K-SEM etc.), and reports empirical LLM performance drops on those sets. This chain relies on external model evaluations and human validation rather than any self-definitional loop, fitted parameter renamed as prediction, or load-bearing self-citation. The central claims are observational results from applying the defined process to independent test items, making the derivation self-contained.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Semantic perturbations can be applied to alter underlying facts while preserving original calculations, answers, and approximate difficulty
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
GSM-SEM perturbs problem statements by modifying entities, attributes, and/or relationships, frequently altering underlying facts and requiring models to recompute solutions under new conditions, while constraining generation to preserve the original calculations/answer and approximate problem difficulty.
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We introduce GSM-SEM, a reusable and stochastic framework for generating semantically diverse benchmark variants...
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Gautam Balakrishnan and Anupam Purwar. 2024. https://doi.org/10.13140/RG.2.2.24790.25926 Evaluating the efficacy of open-source llms in enterprise-specific rag systems: A comparative study of performance and scalability
-
[2]
BIG bench authors. 2023. https://openreview.net/forum?id=uyTL5Bvosj Beyond the imitation game: Quantifying and extrapolating the capabilities of language models . Transactions on Machine Learning Research
2023
-
[5]
Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, Christopher Hesse, and John Schulman. 2021. https://arxiv.org/abs/2110.14168 Training verifiers to solve math word problems . Preprint, arXiv:2110.14168
work page internal anchor Pith review Pith/arXiv arXiv 2021
- [6]
-
[8]
Aabid Karim, Abdul Karim, Bhoomika Lohana, Matt Keon, Jaswinder Singh, and Abdul Sattar. 2025. https://arxiv.org/abs/2503.18018 Lost in cultural translation: Do llms struggle with math across cultural contexts? Preprint, arXiv:2503.18018
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[9]
Qintong Li, Leyang Cui, Xueliang Zhao, Lingpeng Kong, and Wei Bi. 2024. Gsm-plus: A comprehensive benchmark for evaluating the robustness of llms as mathematical problem solvers. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 2961--2984. Association for Computational Linguistics
2024
- [10]
- [11]
-
[14]
Seyed Iman Mirzadeh, Keivan Alizadeh, Hooman Shahrokhi, Oncel Tuzel, Samy Bengio, and Mehrdad Farajtabar. 2025. Gsm-symbolic: Understanding the limitations of mathematical reasoning in large language models. In The Thirteenth International Conference on Learning Representations
2025
-
[17]
Ella Rabinovich, Samuel Ackerman, Orna Raz, Eitan Farchi, and Ateret Anaby Tavor. 2023. Predicting question-answering performance of large language models through semantic consistency. In Proceedings of the Third Workshop on Natural Language Generation, Evaluation, and Metrics (GEM), pages 138--154
2023
-
[18]
Harsh Raj, Domenic Rosati, and Subhabrata Majumdar. 2022. Measuring reliability of large language models through semantic consistency. In NeurIPS ML Safety Workshop
2022
-
[19]
Navid Rekabsaz, Mihai Lupu, Allan Hanbury, and Guido Zuccon. 2017. https://doi.org/10.1007/978-3-319-56608-5_31 Exploration of a threshold for similarity based on uncertainty in word embedding . In Advances in Information Retrieval: 39th European Conference on IR Research, ECIR 2017 , pages 396--409. Springer
- [20]
-
[21]
Viktor Schlegel, Goran Nenadic, and Riza Batista-Navarro. 2021. https://doi.org/10.1609/aaai.v35i15.17622 Semantics altering modifications for evaluating comprehension in machine reading . Proceedings of the AAAI Conference on Artificial Intelligence, 35:13762--13770
-
[22]
Melanie Sclar, Yejin Choi, Yulia Tsvetkov, and Alane Suhr. 2024. Quantifying language models' sensitivity to spurious features in prompt design or: How i learned to start worrying about prompt formatting. In The Twelfth International Conference on Learning Representations
2024
-
[23]
Freda Shi, Xinyun Chen, Kanishka Misra, Nathan Scales, David Dohan, Ed Chi, Nathanael Sch\" a rli, and Denny Zhou. 2023 a . Large language models can be easily distracted by irrelevant context. In Proceedings of the 40th International Conference on Machine Learning, ICML'23. JMLR.org
2023
-
[24]
Freda Shi, Xinyun Chen, Kanishka Misra, Nathan Scales, David Dohan, Ed H Chi, Nathanael Sch \"a rli, and Denny Zhou. 2023 b . Large language models can be easily distracted by irrelevant context. In International Conference on Machine Learning, pages 31210--31227. PMLR
2023
-
[26]
Jyotika Singh, Fang Tu, Miguel Ballesteros, Weiyi Sun, Sandip Ghoshal, Michelle Yuan, Yassine Benajiba, Sujith Ravi, and Dan Roth. 2026. https://arxiv.org/abs/2604.08782 Mt-osc: Path for llms that get lost in multi-turn conversation . Preprint, arXiv:2604.08782
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[29]
Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, Aurelien Rodriguez, Armand Joulin, Edouard Grave, and Guillaume Lample. 2023. https://arxiv.org/abs/2302.13971 Llama: Open and efficient foundation language models . Preprint, arXiv:2302.13971
work page internal anchor Pith review Pith/arXiv arXiv 2023
- [30]
-
[33]
Chulin Xie, Yangsibo Huang, Chiyuan Zhang, Da Yu, Xinyun Chen, Bill Yuchen Lin, Bo Li, Badih Ghazi, and Ravi Kumar. 2025. On memorization of large language models in logical reasoning. In Proceedings of the 14th International Joint Conference on Natural Language Processing and the 4th Conference of the Asia-Pacific Chapter of the Association for Computati...
2025
-
[34]
Shuo Yang, Wei-Lin Chiang, Lianmin Zheng, Joseph E. Gonzalez, and Ion Stoica. 2023. https://arxiv.org/abs/2311.04850 Rethinking benchmark and contamination for language models with rephrased samples . Preprint, arXiv:2311.04850
-
[35]
Tian Ye, Zicheng Xu, Yuanzhi Li, and Zeyuan Allen-Zhu. 2025. https://openreview.net/forum?id=Tn5B6Udq3E Physics of language models: Part 2.1, grade-school math and the hidden reasoning process . In The Thirteenth International Conference on Learning Representations
2025
-
[36]
Erez Yosef, Oron Anschel, Shunit Haviv Hakimi, Asaf Gendler, Adam Botach, Nimrod Berman, and Igor Kviatkovsky. 2026. https://arxiv.org/abs/2604.22597 Rethinking math reasoning evaluation: A robust llm-as-a-judge framework beyond symbolic rigidity . Preprint, arXiv:2604.22597
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[37]
Longhui Yu, Weisen Jiang, Han Shi, Jincheng YU, Zhengying Liu, Yu Zhang, James Kwok, Zhenguo Li, Adrian Weller, and Weiyang Liu. 2024 a . https://openreview.net/forum?id=N8N0hgNDRt Metamath: Bootstrap your own mathematical questions for large language models . In The Twelfth International Conference on Learning Representations
2024
-
[38]
Xfinder: Large language models as auto-mated evaluators for reliable evaluation
Qingchen Yu, Zifan Zheng, Shichao Song, Zhiyu Li, Feiyu Xiong, Bo Tang, and Ding Chen. Xfinder: Large language models as auto-mated evaluators for reliable evaluation
-
[40]
Zhongshen Zeng, Pengguang Chen, Shu Liu, Haiyun Jiang, and Jiaya Jia. 2025. https://openreview.net/forum?id=br4H61LOoI MR - GSM 8k: A meta-reasoning benchmark for large language model evaluation . In The Thirteenth International Conference on Learning Representations
2025
-
[41]
Hugh Zhang, Jeff Da, Dean Lee, Vaughn Robinson, Catherine Wu, Will Song, Tiffany Zhao, Pranav Raja, Charlotte Zhuang, Dylan Slack, and 1 others. 2024. A careful examination of large language model performance on grade school arithmetic. In Proceedings of the 38th International Conference on Neural Information Processing Systems, pages 46819--46836
2024
-
[43]
Wong, Xiaowei Huang, Qiufeng Wang, and Kaizhu Huang
Zihao Zhou, Shudong Liu, Maizhen Ning, Wei Liu, Jindong Wang, Derek F. Wong, Xiaowei Huang, Qiufeng Wang, and Kaizhu Huang. 2025. https://openreview.net/forum?id=nDvgHIBRxQ Is your model really a good math reasoner? evaluating mathematical reasoning with checklist . In The Thirteenth International Conference on Learning Representations
2025
-
[44]
On memorization of large language models in logical reasoning , author=. Proceedings of the 14th International Joint Conference on Natural Language Processing and the 4th Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics , pages=
-
[45]
2021 , eprint=
Training Verifiers to Solve Math Word Problems , author=. 2021 , eprint=
2021
-
[46]
International Conference on Machine Learning , pages=
Large language models can be easily distracted by irrelevant context , author=. International Conference on Machine Learning , pages=. 2023 , organization=
2023
-
[47]
Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=
GSM-Plus: A Comprehensive Benchmark for Evaluating the Robustness of LLMs as Mathematical Problem Solvers , author=. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=. 2024 , organization=
2024
-
[48]
Proceedings of the 38th International Conference on Neural Information Processing Systems , pages=
A careful examination of large language model performance on grade school arithmetic , author=. Proceedings of the 38th International Conference on Neural Information Processing Systems , pages=
-
[49]
The Thirteenth International Conference on Learning Representations , year=
GSM-Symbolic: Understanding the Limitations of Mathematical Reasoning in Large Language Models , author=. The Thirteenth International Conference on Learning Representations , year=
-
[50]
2023 , eprint=
Rethinking Benchmark and Contamination for Language Models with Rephrased Samples , author=. 2023 , eprint=
2023
-
[51]
2024 , eprint=
RUPBench: Benchmarking Reasoning Under Perturbations for Robustness Evaluation in Large Language Models , author=. 2024 , eprint=
2024
-
[52]
Proceedings of the Third Workshop on Natural Language Generation, Evaluation, and Metrics (GEM) , pages=
Predicting Question-Answering Performance of Large Language Models through Semantic Consistency , author=. Proceedings of the Third Workshop on Natural Language Generation, Evaluation, and Metrics (GEM) , pages=
-
[53]
2025 , eprint=
Evaluating NL2SQL via SQL2NL , author=. 2025 , eprint=
2025
-
[54]
NeurIPS ML Safety Workshop , year=
Measuring Reliability of Large Language Models through Semantic Consistency , author=. NeurIPS ML Safety Workshop , year=
-
[55]
Demystifying Prompts in Language Models via Perplexity Estimation
Gonen, Hila and Iyer, Srini and Blevins, Terra and Smith, Noah and Zettlemoyer, Luke. Demystifying Prompts in Language Models via Perplexity Estimation. Findings of the Association for Computational Linguistics: EMNLP 2023. 2023. doi:10.18653/v1/2023.findings-emnlp.679
-
[56]
The Thirteenth International Conference on Learning Representations , year=
Physics of Language Models: Part 2.1, Grade-School Math and the Hidden Reasoning Process , author=. The Thirteenth International Conference on Learning Representations , year=
-
[57]
The Twelfth International Conference on Learning Representations , year=
Quantifying Language Models' Sensitivity to Spurious Features in Prompt Design or: How I learned to start worrying about prompt formatting , author=. The Twelfth International Conference on Learning Representations , year=
-
[58]
C. L. Hamblin , title =. Foundations of Language , year =
-
[59]
Wu, Zhaofeng and Qiu, Linlu and Ross, Alexis and Aky. Reasoning or Reciting? Exploring the Capabilities and Limitations of Language Models Through Counterfactual Tasks. Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers). 2024. doi:10.18653/v...
-
[60]
Zhou, Yue and Zhu, Yada and Antognini, Diego and Kim, Yoon and Zhang, Yang. Paraphrase and Solve: Exploring and Exploiting the Impact of Surface Form on Mathematical Reasoning in Large Language Models. Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: L...
-
[61]
What Makes Math Word Problems Challenging for LLM s?
Srivatsa, Kv Aditya and Kochmar, Ekaterina. What Makes Math Word Problems Challenging for LLM s?. Findings of the Association for Computational Linguistics: NAACL 2024. 2024. doi:10.18653/v1/2024.findings-naacl.72
-
[62]
Seemingly Plausible Distractors in Multi-Hop Reasoning: Are Large Language Models Attentive Readers?
Bhuiya, Neeladri and Schlegel, Viktor and Winkler, Stefan. Seemingly Plausible Distractors in Multi-Hop Reasoning: Are Large Language Models Attentive Readers?. Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing. 2024. doi:10.18653/v1/2024.emnlp-main.147
-
[63]
Large language models can be easily distracted by irrelevant context , year =
Shi, Freda and Chen, Xinyun and Misra, Kanishka and Scales, Nathan and Dohan, David and Chi, Ed and Sch\". Large language models can be easily distracted by irrelevant context , year =. Proceedings of the 40th International Conference on Machine Learning , articleno =
-
[64]
OpenAI and : and Hurst, Aaron and Lerer, Adam and Goucher, Adam P and Perelman, Adam and Ramesh, Aditya and Clark, Aidan and Ostrow, A J and Welihinda, Akila and Hayes, Alan and Radford, Alec and M a dry, Aleksander and Baker-Whitcomb, Alex and Beutel, Alex and Borzunov, Alex and Carney, Alex and Chow, Alex and Kirillov, Alex and Nichol, Alex and Paino, A...
work page internal anchor Pith review Pith/arXiv arXiv
-
[65]
2023 , eprint=
LLaMA: Open and Efficient Foundation Language Models , author=. 2023 , eprint=
2023
-
[66]
Semantics Altering Modifications for Evaluating Comprehension in Machine Reading , volume =
Schlegel, Viktor and Nenadic, Goran and Batista-Navarro, Riza , year =. Semantics Altering Modifications for Evaluating Comprehension in Machine Reading , volume =. Proceedings of the AAAI Conference on Artificial Intelligence , doi =
-
[67]
SSS : Editing Factual Knowledge in Language Models towards Semantic Sparse Space
Wang, Huazheng and Sun, Haifeng and Wang, Jingyu and Qi, Qi and Xia, Zixuan and Zhang, Menghao and Liao, Jianxin. SSS : Editing Factual Knowledge in Language Models towards Semantic Sparse Space. Findings of the Association for Computational Linguistics: ACL 2024. 2024. doi:10.18653/v1/2024.findings-acl.331
-
[68]
Tailor: Generating and Perturbing Text with Semantic Controls
Ross, Alexis and Wu, Tongshuang and Peng, Hao and Peters, Matthew and Gardner, Matt. Tailor: Generating and Perturbing Text with Semantic Controls. Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2022. doi:10.18653/v1/2022.acl-long.228
-
[69]
Rephrasing the Web: A Recipe for Compute and Data-Efficient Language Modeling
Maini, Pratyush and Seto, Skyler and Bai, Richard and Grangier, David and Zhang, Yizhe and Jaitly, Navdeep. Rephrasing the Web: A Recipe for Compute and Data-Efficient Language Modeling. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2024. doi:10.18653/v1/2024.acl-long.757
-
[70]
The Thirteenth International Conference on Learning Representations , year=
Is Your Model Really A Good Math Reasoner? Evaluating Mathematical Reasoning with Checklist , author=. The Thirteenth International Conference on Learning Representations , year=
-
[71]
2025 , eprint=
Lost in Cultural Translation: Do LLMs Struggle with Math Across Cultural Contexts? , author=. 2025 , eprint=
2025
-
[72]
The Twelfth International Conference on Learning Representations , year=
MetaMath: Bootstrap Your Own Mathematical Questions for Large Language Models , author=. The Twelfth International Conference on Learning Representations , year=
-
[73]
2025 , url=
Zhongshen Zeng and Pengguang Chen and Shu Liu and Haiyun Jiang and Jiaya Jia , booktitle=. 2025 , url=
2025
-
[74]
ReasonAgain: Using Extractable Symbolic Programs to Evaluate Mathematical Reasoning , author=. CoRR , year=. 2410.19056 , archivePrefix=
-
[75]
2025 , eprint=
On Robustness and Reliability of Benchmark-Based Evaluation of LLMs , author=. 2025 , eprint=
2025
-
[76]
2025 , eprint=
OAT-Rephrase: Optimization-Aware Training Data Rephrasing for Zeroth-Order LLM Fine-Tuning , author=. 2025 , eprint=
2025
-
[77]
Transactions on Machine Learning Research , issn=
Beyond the Imitation Game: Quantifying and extrapolating the capabilities of language models , author=. Transactions on Machine Learning Research , issn=. 2023 , url=
2023
-
[78]
Challenging BIG-Bench Tasks and Whether Chain-of-Thought Can Solve Them
Challenging BIG-Bench Tasks and Whether Chain-of-Thought Can Solve Them , author=. arXiv preprint arXiv:2210.09261 , year=
work page internal anchor Pith review arXiv
-
[79]
Parmar, Mihir and Patel, Nisarg and Varshney, Neeraj and Nakamura, Mutsumi and Luo, Man and Mashetty, Santosh and Mitra, Arindam and Baral, Chitta. L ogic B ench: Towards Systematic Evaluation of Logical Reasoning Ability of Large Language Models. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Paper...
-
[80]
Singh, Jyotika and Sun, Weiyi and Agarwal, Amit and Krishnamurthy, Viji and Benajiba, Yassine and Ravi, Sujith and Roth, Dan. Can LLM s Narrate Tabular Data? An Evaluation Framework for Natural Language Representations of Text-to- SQL System Outputs. Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing: Industry Track. 20...
-
[81]
Xiao, Chuan and Wang, Wei and Lin, Xuemin and Yu, Jeffrey Xu and Wang, Guoren , title =. ACM Trans. Database Syst. , month = aug, articleno =. 2011 , issue_date =. doi:10.1145/2000824.2000825 , abstract =
-
[82]
Bilenko, Mikhail and Mooney, Raymond J. , title =. Proceedings of the Ninth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining , pages =. 2003 , isbn =. doi:10.1145/956750.956759 , abstract =
-
[83]
Advances in Information Retrieval: 39th European Conference on IR Research,
Rekabsaz, Navid and Lupu, Mihai and Hanbury, Allan and Zuccon, Guido , title=. Advances in Information Retrieval: 39th European Conference on IR Research,. 2017 , pages=
2017
-
[84]
2015 , eprint=
Textual Spatial Cosine Similarity , author=. 2015 , eprint=
2015
-
[85]
Evaluating the Efficacy of Open-Source LLMs in Enterprise-Specific RAG Systems: A Comparative Study of Performance and Scalability , doi =
Balakrishnan, Gautam and Purwar, Anupam , year =. Evaluating the Efficacy of Open-Source LLMs in Enterprise-Specific RAG Systems: A Comparative Study of Performance and Scalability , doi =
-
[86]
A Fast Method to Filter Noisy Parallel Data WMT 2023 Shared Task on Parallel Data Curation
Minh-Cong, Nguyen-Hoang and Vinh, Nguyen Van and Le-Minh, Nguyen. A Fast Method to Filter Noisy Parallel Data WMT 2023 Shared Task on Parallel Data Curation. Proceedings of the Eighth Conference on Machine Translation. 2023. doi:10.18653/v1/2023.wmt-1.37
-
[87]
XFINDER: LARGE LANGUAGE MODELS AS AUTO-MATED EVALUATORS FOR RELIABLE EVALUATION , author=
-
[88]
2026 , eprint=
MT-OSC: Path for LLMs that Get Lost in Multi-Turn Conversation , author=. 2026 , eprint=
2026
-
[89]
2026 , eprint=
Rethinking Math Reasoning Evaluation: A Robust LLM-as-a-Judge Framework Beyond Symbolic Rigidity , author=. 2026 , eprint=
2026
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.