pith. sign in

arxiv: 2605.27709 · v1 · pith:GIIVZH3Wnew · submitted 2026-05-26 · 💻 cs.CL

ReverseMath: Answer Inversion for Scalable and Verifiable Mathematical Problem Generation

Pith reviewed 2026-06-29 17:58 UTC · model grok-4.3

classification 💻 cs.CL
keywords mathematical problem generationanswer inversionLLM evaluationmemorization detectiondata augmentationmathematical reasoningreinforcement learning
0
0 comments X

The pith

Answer inversion generates scalable verifiable math problems by reversing original input-output relations.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces ReverseMath to address static benchmarks and the high cost of creating new math problems with reliable answers. It works by masking a numerical value in an existing problem, treating the original answer as a known condition, and using an LLM to rewrite the statement so the masked value becomes the new answer to find. This reversal ensures the new answer is known by construction without manual checking. The generated pairs help reveal memorization when models behave differently on reversed versions, and the new problems serve as automatically labeled data to improve model performance through reinforcement learning on multiple benchmarks.

Core claim

Given a math problem and its answer, ReverseMath masks one numerical value, conditions the rewrite on the original answer, and produces a new problem whose solution is exactly the masked value, thereby creating fresh problems whose correctness is guaranteed by the inversion process itself.

What carries the argument

Answer inversion: the process of masking a numerical value in the original problem and rewriting it via LLM so that the masked value is now the target answer, reversing the original input-output relation.

If this is right

  • Paired original and reversed problems produce measurable behavioral shifts, including cases where models output the original answer on the reversed version.
  • Augmenting reinforcement learning with the automatically labeled reversed problems raises accuracy on several mathematical reasoning benchmarks.
  • The method supplies new problems at low cost while keeping answers known by construction, bypassing manual answer verification.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Iterating the inversion on already-reversed problems could produce sequences of related instances for studying how formulation changes affect reasoning.
  • The same masking-and-rewrite pattern might extend to non-numerical domains if other verifiable quantities can be isolated and reversed.
  • Gains from the added data imply that varying problem wording helps models move beyond surface-level pattern matching.

Load-bearing premise

The LLM rewriting step produces new problems whose difficulty and logical validity match the originals closely enough that observed differences stem mainly from memorization rather than artifacts of the inversion.

What would settle it

Finding that models achieve the same accuracy on original and reversed problem pairs, or that a substantial fraction of reversed problems have inconsistent or incorrect answers upon manual check.

Figures

Figures reproduced from arXiv: 2605.27709 by Hinrich Sch\"utze, Michael A. Hedderich, Raoyuan Zhao, Yihong Liu, Yupei Du.

Figure 1
Figure 1. Figure 1: REVERSEMATH turns existing math problems into verified reversed problems with deterministic an￾swers. These generated problems then support two uses in our study: controlled evaluation with paired orig￾inal/reversed problems for analyzing model behavior, and verifiable data augmentation for RL training. static and are extensively exposed through public evaluation, online discussion, and even training pipel… view at source ↗
Figure 2
Figure 2. Figure 2: Overview of REVERSEMATH. Starting from an original math problem and its answer, REVERSEMATH masks a numerical value and conditions on the original answer to construct an intermediate inverse problem. A generator LLM rewrites it into a natural reversed problem, and a verifier checks whether the problem is uniquely solvable, free of answer leakage, and has an answer matching the masked value. See details in … view at source ↗
Figure 3
Figure 3. Figure 3: Behavioral transition patterns induced by [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Additional Failures revelaved by rule-based reversal and [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Overlap of TF cases on MATH-500 across models. Stronger within-family (dashed blocks) overlap indicates that answer-inversion failures are more consis￾tent within model families than across them. unsolved ones. Consequently, models may show similar average@10 before and after while relying on very different sets of correctly solved examples. The transition patterns indicate that REVERSE￾MATH does not unifo… view at source ↗
Figure 6
Figure 6. Figure 6: Behavioral transition patterns induced by R [PITH_FULL_IMAGE:figures/full_fig_p016_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Behavioral transition patterns induced by R [PITH_FULL_IMAGE:figures/full_fig_p017_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Behavioral transition patterns induced by [PITH_FULL_IMAGE:figures/full_fig_p018_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Complete TF-overlap heatmaps across all evaluated datasets. From left to right, top to bottom: GSM8K, [PITH_FULL_IMAGE:figures/full_fig_p019_9.png] view at source ↗
read the original abstract

Mathematical reasoning benchmarks are vital for evaluating large language models (LLMs), but many are static and repeatedly exposed through public evaluation and training pipelines, making it difficult to separate genuine reasoning from memorization. Meanwhile, manually constructing new math problems with reliable answers remains costly. We introduce ReverseMath, a scalable method for generating new math problems through answer inversion. Given a problem and its answer, ReverseMath masks a numerical value in the original problem, treats the original answer as a known condition, and rewrites the problem so that the masked value becomes the new answer. The generated problem reverses the original input-output relation, making its answer known by construction. We study ReverseMath for both evaluation and training. For evaluation, paired original/reversed problems reveal substantial behavioral shifts: models sometimes fail on reversed problems and even incorrectly output the original answer, suggesting memorization-like behavior. For training, ReverseMath provides automatically labeled reversed problems as data augmentation for reinforcement learning (RL). Experiments show that including ReverseMath-generated data improves mathematical reasoning performance across multiple benchmarks, demonstrating its value as both an analysis tool and a scalable source of verifiable training data.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces ReverseMath, a method to generate new math problems via answer inversion: mask a numerical value in an original problem, treat the original answer as a condition, and use an LLM to rewrite the statement so the masked value becomes the new answer. This produces automatically verifiable problems that reverse the original input-output mapping. The work evaluates the method for detecting memorization-like behavior through behavioral shifts on paired original/reversed problems and for training by augmenting RL data, reporting improved performance on multiple mathematical reasoning benchmarks.

Significance. If the generated problems maintain validity and comparable reasoning depth, ReverseMath offers a scalable source of verifiable training data and an analysis tool for distinguishing reasoning from memorization on contaminated benchmarks. The automatic labeling by construction is a clear strength, as is the paired-problem design for evaluation.

major comments (2)
  1. [§3] §3 (Method): The LLM rewriting procedure is described at a high level with no reported controls, prompt templates, or post-generation validation metrics (e.g., human verification of mathematical equivalence or difficulty ratings) to confirm that reversed problems preserve the original reasoning depth and skill distribution. Without this, performance gains in the RL experiments cannot be confidently attributed to the inversion mechanism rather than uncontrolled shifts in problem characteristics.
  2. [§5] §5 (Experiments): The training results claim that including ReverseMath data improves benchmark performance, yet there are no ablations isolating the inversion from confounding factors such as increased data volume, LLM-induced stylistic artifacts, or changes in problem length/difficulty. This undermines the central training claim.
minor comments (2)
  1. The abstract and introduction would benefit from a clearer statement of the exact masking criterion and how multiple numerical values are handled when present.
  2. Figure captions for the behavioral-shift plots should include the exact model sizes and benchmark names for immediate readability.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment below and will revise the manuscript accordingly to strengthen the presentation of the method and experiments.

read point-by-point responses
  1. Referee: [§3] §3 (Method): The LLM rewriting procedure is described at a high level with no reported controls, prompt templates, or post-generation validation metrics (e.g., human verification of mathematical equivalence or difficulty ratings) to confirm that reversed problems preserve the original reasoning depth and skill distribution. Without this, performance gains in the RL experiments cannot be confidently attributed to the inversion mechanism rather than uncontrolled shifts in problem characteristics.

    Authors: We agree that the method section would benefit from greater transparency. In the revised manuscript we will add the complete prompt templates to an appendix. We will also report aggregate statistics comparing original and reversed problems (length, number of numerical values, and answer magnitude distributions) as basic controls. Additionally, we will include results from a small-scale human study on a random sample of 100 reversed problems, with raters assessing mathematical equivalence to the original and relative difficulty; inter-rater agreement will be reported. These additions will allow readers to better evaluate whether gains stem from the inversion itself. revision: yes

  2. Referee: [§5] §5 (Experiments): The training results claim that including ReverseMath data improves benchmark performance, yet there are no ablations isolating the inversion from confounding factors such as increased data volume, LLM-induced stylistic artifacts, or changes in problem length/difficulty. This undermines the central training claim.

    Authors: The referee correctly notes the absence of targeted ablations. We will add them in the revision: (i) a data-volume control using an equal number of non-inverted problems drawn from the same original sources, (ii) a stylistic-artifact control using problems rewritten by the LLM without the answer-inversion conditioning, and (iii) length- and difficulty-matched subsets. Results will be presented in an expanded experimental section with a new table. These controls should clarify the contribution of the inversion mechanism. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper presents an empirical generation procedure that starts from external source problems, applies LLM-based rewriting to produce reversed problems whose answers are known by construction, and then measures downstream effects on independent external benchmarks. No step reduces a claimed result to a fitted parameter, self-citation chain, or definitional equivalence; the performance gains are reported as measured outcomes rather than derived by construction from the inputs. The method is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review; no free parameters, axioms, or invented entities are described.

pith-pipeline@v0.9.1-grok · 5743 in / 1014 out tokens · 23475 ms · 2026-06-29T17:58:03.043032+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

39 extracted references · 29 canonical work pages · 14 internal anchors

  1. [1]

    Janice Ahn, Rishu Verma, Renze Lou, Di Liu, Rui Zhang, and Wenpeng Yin. 2024. https://doi.org/10.18653/v1/2024.eacl-srw.17 Large language models for mathematical reasoning: Progresses and challenges . In Proceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics: Student Research Workshop, pages 225--237, S...

  2. [2]

    AgentCoMa: A Compositional Benchmark Mixing Commonsense and Mathematical Reasoning in Real-World Scenarios

    Lisa Alazraki, Lihu Chen, Ana Brassard, Joe Stacey, Hossein A. Rahmani, and Marek Rei. 2026. https://arxiv.org/abs/2508.19988 Agentcoma: A compositional benchmark mixing commonsense and mathematical reasoning in real-world scenarios . Preprint, arXiv:2508.19988

  3. [3]

    Simin Chen, Yiming Chen, Zexin Li, Yifan Jiang, Zhongwei Wan, Yixin He, Dezhi Ran, Tianle Gu, Haizhou Li, Tao Xie, and Baishakhi Ray. 2025. https://doi.org/10.18653/v1/2025.emnlp-main.511 Benchmarking large language models under data contamination: A survey from static to dynamic evaluation . In Proceedings of the 2025 Conference on Empirical Methods in N...

  4. [4]

    Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, Christopher Hesse, and John Schulman. 2021. https://arxiv.org/abs/2110.14168 Training verifiers to solve math word problems . Preprint, arXiv:2110.14168

  5. [5]

    Chunyuan Deng, Yilun Zhao, Xiangru Tang, Mark Gerstein, and Arman Cohan. 2024. https://doi.org/10.18653/v1/2024.naacl-long.482 Investigating data contamination in modern benchmarks for large language models . In Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (V...

  6. [6]

    Yupei Du, Philipp Mondorf, Silvia Casola, Yuekun Yao, Robert Litschko, and Barbara Plank. 2025. https://doi.org/10.18653/v1/2025.emnlp-main.437 Reason to rote: Rethinking memorization in reasoning . In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 8659--8679, Suzhou, China. Association for Computational Linguistics

  7. [7]

    Gemma Team , Aishwarya Kamath, Johan Ferret, Shreya Pathak, Nino Vieillard, Ramona Merhej, Sarah Perrin, Tatiana Matejovicova, Alexandre Ramé, Morgane Rivière, Louis Rouillard, Thomas Mesnard, Geoffrey Cideron, Jean bastien Grill, Sabela Ramos, Edouard Yvinec, Michelle Casbon, Etienne Pot, Ivo Penchev, and 197 others. 2025. https://arxiv.org/abs/2503.1978...

  8. [8]

    Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, Amy Yang, Angela Fan, Anirudh Goyal, Anthony Hartshorn, Aobo Yang, Archi Mitra, Archie Sravankumar, Artem Korenev, Arthur Hinsvark, and 542 others. 2024. https://arxiv.org/abs/2407.21783 The llama 3...

  9. [9]

    Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Peiyi Wang, Qihao Zhu, Runxin Xu, Ruoyu Zhang, Shirong Ma, Xiao Bi, Xiaokang Zhang, Xingkai Yu, Yu Wu, Z. F. Wu, Zhibin Gou, Zhihong Shao, Zhuoshu Li, Ziyi Gao, Aixin Liu, and 175 others. 2025. https://doi.org/10.1038/s41586-025-09422-z Deepseek-r1 incentivizes reasoning in llms through reinforcement lear...

  10. [10]

    Pei Guo, WangJie You, Juntao Li, Yan Bowen, and Min Zhang. 2024. https://doi.org/10.18653/v1/2024.findings-acl.811 Exploring reversal mathematical reasoning ability for large language models . In Findings of the Association for Computational Linguistics: ACL 2024, pages 13671--13685, Bangkok, Thailand. Association for Computational Linguistics

  11. [11]

    Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt. 2021. https://datasets-benchmarks-proceedings.neurips.cc/paper/2021/hash/be83ab3ecd0db773eb2dc1b0a17836a1-Abstract-round2.html Measuring mathematical problem solving with the MATH dataset . In Proceedings of the Neural Information Processin...

  12. [12]

    Kaixuan Huang, Jiacheng Guo, Zihao Li, Xiang Ji, Jiawei Ge, Wenzhe Li, Yingqing Guo, Tianle Cai, Hui Yuan, Runzhe Wang, Yue Wu, Ming Yin, Shange Tang, Yangsibo Huang, Chi Jin, Xinyun Chen, Chiyuan Zhang, and Mengdi Wang. 2025. https://proceedings.mlr.press/v267/huang25k.html Math-perturb: Benchmarking llms' math reasoning abilities against hard perturbati...

  13. [13]

    Albert Q. Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Florian Bressand, Gianna Lengyel, Guillaume Lample, Lucile Saulnier, Lélio Renard Lavaud, Marie-Anne Lachaux, Pierre Stock, Teven Le Scao, Thibaut Lavril, Thomas Wang, Timothée Lacroix, and William El Sayed. 2023. https://arxiv.org/abs/2310.0...

  14. [14]

    Albert Q. Jiang, Alexandre Sablayrolles, Antoine Roux, Arthur Mensch, Blanche Savary, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Emma Bou Hanna, Florian Bressand, Gianna Lengyel, Guillaume Bour, Guillaume Lample, Lélio Renard Lavaud, Lucile Saulnier, Marie-Anne Lachaux, Pierre Stock, Sandeep Subramanian, Sophia Yang, and 7 others. 2024. ht...

  15. [15]

    Qintong Li, Leyang Cui, Xueliang Zhao, Lingpeng Kong, and Wei Bi. 2024. https://doi.org/10.18653/v1/2024.acl-long.163 GSM -plus: A comprehensive benchmark for evaluating the robustness of LLM s as mathematical problem solvers . In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 2961--2...

  16. [16]

    Inbal Magar and Roy Schwartz. 2022. https://doi.org/10.18653/v1/2022.acl-short.18 Data contamination: From memorization to exploitation . In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 157--165, Dublin, Ireland. Association for Computational Linguistics

  17. [17]

    Iman Mirzadeh, Keivan Alizadeh, Hooman Shahrokhi, Oncel Tuzel, Samy Bengio, and Mehrdad Farajtabar. 2025. https://openreview.net/forum?id=AjXkRZIvjB Gsm-symbolic: Understanding the limitations of mathematical reasoning in large language models . In The Thirteenth International Conference on Learning Representations, ICLR 2025, Singapore, April 24-28, 2025...

  18. [18]

    Ivan Moshkov, Darragh Hanley, Ivan Sorokin, Shubham Toshniwal, Christof Henkel, Benedikt Schifferer, Wei Du, and Igor Gitman. 2025. https://arxiv.org/abs/2504.16891 Aimo-2 winning solution: Building state-of-the-art mathematical reasoning models with openmathreasoning dataset . Preprint, arXiv:2504.16891

  19. [19]

    Qwen Team , An Yang, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chengyuan Li, Dayiheng Liu, Fei Huang, Haoran Wei, Huan Lin, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Yang, Jiaxi Yang, Jingren Zhou, Junyang Lin, and 24 others. 2025. https://arxiv.org/abs/2412.15115 Qwen2.5 technical report . Preprint, arXiv:2412.15115

  20. [20]

    Oscar Sainz, Jon Campos, Iker Garc \'i a-Ferrero, Julen Etxaniz, Oier Lopez de Lacalle, and Eneko Agirre. 2023. https://doi.org/10.18653/v1/2023.findings-emnlp.722 NLP evaluation in trouble: On the need to measure LLM data contamination for each benchmark . In Findings of the Association for Computational Linguistics: EMNLP 2023, pages 10776--10787, Singa...

  21. [21]

    Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, Y. K. Li, Y. Wu, and Daya Guo. 2024. https://arxiv.org/abs/2402.03300 Deepseekmath: Pushing the limits of mathematical reasoning in open language models . Preprint, arXiv:2402.03300

  22. [22]

    Chi, Nathanael Sch \" a rli, and Denny Zhou

    Freda Shi, Xinyun Chen, Kanishka Misra, Nathan Scales, David Dohan, Ed H. Chi, Nathanael Sch \" a rli, and Denny Zhou. 2023 a . https://proceedings.mlr.press/v202/shi23a.html Large language models can be easily distracted by irrelevant context . In International Conference on Machine Learning, ICML 2023, 23-29 July 2023, Honolulu, Hawaii, USA , Proceeding...

  23. [23]

    Freda Shi, Mirac Suzgun, Markus Freitag, Xuezhi Wang, Suraj Srivats, Soroush Vosoughi, Hyung Won Chung, Yi Tay, Sebastian Ruder, Denny Zhou, Dipanjan Das, and Jason Wei. 2023 b . https://openreview.net/forum?id=fR3wGCk-IXp Language models are multilingual chain-of-thought reasoners . In The Eleventh International Conference on Learning Representations, IC...

  24. [24]

    Shengkun Tang, Zekun Wang, Bo Zheng, Liangyu Wang, Rui Men, Siqi Zhang, Xiulong Yuan, Zihan Qiu, Zhiqiang Shen, and Dayiheng Liu. 2026. https://arxiv.org/abs/2605.08738 Slimqwen: Exploring the pruning and distillation in large moe model pre-training . Preprint, arXiv:2605.08738

  25. [25]

    Peng - Yuan Wang, Tian - Shuo Liu, Chenyang Wang, Ziniu Li, Yidi Wang, Shu Yan, Chengxing Jia, Xu - Hui Liu, Xinwei Chen, Jiacheng Xu, and Yang Yu. 2026. https://doi.org/10.1145/3786333 A survey on large language models for mathematical reasoning . ACM Comput. Surv. , 58(8):209:1--209:35

  26. [26]

    Chi, Quoc V

    Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Brian Ichter, Fei Xia, Ed H. Chi, Quoc V. Le, and Denny Zhou. 2022. http://papers.nips.cc/paper\_files/paper/2022/hash/9d5609613524ecf4f15af0f7b31abca4-Abstract-Conference.html Chain-of-thought prompting elicits reasoning in large language models . In Advances in Neural Information Processing Systems...

  27. [27]

    Cheng Xu, Shuhao Guan, Derek Greene, and M-Tahar Kechadi. 2024. https://arxiv.org/abs/2406.04244 Benchmark data contamination of large language models: A survey . Preprint, arXiv:2406.04244

  28. [28]

    Tianyi Xu, Kosei Uemura, Alfred Malengo Kondoro, Tadesse Destaw Belay, Catherine Nana Nyaah Essuman, Ifeoma Okoh, Ganiyat Afolabi, Ayodele Awokoya, and David Ifeoluwa Adelani. 2026. https://arxiv.org/abs/2601.21225 Mgsm-pro: A simple strategy for robust multilingual mathematical reasoning evaluation . Preprint, arXiv:2601.21225

  29. [29]

    An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, Chujie Zheng, Dayiheng Liu, Fan Zhou, Fei Huang, Feng Hu, Hao Ge, Haoran Wei, Huan Lin, Jialong Tang, and 41 others. 2025 a . https://arxiv.org/abs/2505.09388 Qwen3 technical report . Preprint, arXiv:2505.09388

  30. [30]

    An Yang, Beichen Zhang, Binyuan Hui, Bofei Gao, Bowen Yu, Chengpeng Li, Dayiheng Liu, Jianhong Tu, Jingren Zhou, Junyang Lin, Keming Lu, Mingfeng Xue, Runji Lin, Tianyu Liu, Xingzhang Ren, and Zhenru Zhang. 2024. https://arxiv.org/abs/2409.12122 Qwen2.5-math technical report: Toward mathematical expert model via self-improvement . Preprint, arXiv:2409.12122

  31. [31]

    Yuli Yang, Hiroaki Yamada, and Takenobu Tokunaga. 2025 b . https://doi.org/10.18653/v1/2025.insights-1.16 Evaluating robustness of LLM s to numerical variations in mathematical reasoning . In The Sixth Workshop on Insights from Negative Results in NLP, pages 171--180, Albuquerque, New Mexico. Association for Computational Linguistics

  32. [32]

    Kwok, Zhenguo Li, Adrian Weller, and Weiyang Liu

    Longhui Yu, Weisen Jiang, Han Shi, Jincheng Yu, Zhengying Liu, Yu Zhang, James T. Kwok, Zhenguo Li, Adrian Weller, and Weiyang Liu. 2024. https://openreview.net/forum?id=N8N0hgNDRt Metamath: Bootstrap your own mathematical questions for large language models . In The Twelfth International Conference on Learning Representations, ICLR 2024, Vienna, Austria,...

  33. [33]

    Hugh Zhang, Jeff Da, Dean Lee, Vaughn Robinson, Catherine Wu, William Song, Tiffany Zhao, Pranav Raja, Charlotte Zhuang, Dylan Slack, Qin Lyu, Sean Hendryx, Russell Kaplan, Michele Lunati, and Summer Yue. 2024. http://papers.nips.cc/paper\_files/paper/2024/hash/53384f2090c6a5cac952c598fd67992f-Abstract-Datasets\_and\_Benchmarks\_Track.html A careful exami...

  34. [34]

    Raoyuan Zhao, Abdullatif K \"o ksal, Yihong Liu, Leonie Weissweiler, Anna Korhonen, and Hinrich Schuetze. 2024. https://doi.org/10.18653/v1/2024.findings-emnlp.412 S ynth E val: Hybrid behavioral testing of NLP models with synthetic C heck L ists . In Findings of the Association for Computational Linguistics: EMNLP 2024, pages 7017--7034, Miami, Florida, ...

  35. [35]

    Hedderich, and Hinrich Schuetze

    Raoyuan Zhao, Abdullatif K \"o ksal, Ali Modarressi, Michael A. Hedderich, and Hinrich Schuetze. 2025. https://doi.org/10.18653/v1/2025.findings-emnlp.1263 Do we know what LLM s don ' t know? a study of consistency in knowledge probing . In Findings of the Association for Computational Linguistics: EMNLP 2025, pages 23254--23280, Suzhou, China. Associatio...

  36. [36]

    Evaluating Robustness of Large Language Models Against Multilingual Typographical Errors

    Raoyuan Zhao, Yihong Liu, Lena Altinger, Hinrich Schütze, and Michael A. Hedderich. 2026. https://arxiv.org/abs/2510.09536 Evaluating robustness of large language models against multilingual typographical errors . Preprint, arXiv:2510.09536

  37. [37]

    Yue Zhou, Yada Zhu, Diego Antognini, Yoon Kim, and Yang Zhang. 2024. https://doi.org/10.18653/v1/2024.naacl-long.153 Paraphrase and solve: Exploring and exploiting the impact of surface form on mathematical reasoning in large language models . In Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguist...

  38. [38]

    online" 'onlinestring :=

    ENTRY address archivePrefix author booktitle chapter edition editor eid eprint eprinttype howpublished institution journal key month note number organization pages publisher school series title type volume year doi pubmed url lastchecked label extra.label sort.label short.list INTEGERS output.state before.all mid.sentence after.sentence after.block STRING...

  39. [39]

    write newline

    " write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION word.in bbl.in capitalize " " * FUNCT...