Less Is More: Cognitive Load and the Single-Prompt Ceiling in LLM Mathematical Reasoning
Pith reviewed 2026-05-10 04:08 UTC · model grok-4.3
The pith
LLM mathematical reasoning hits a single-prompt accuracy ceiling around 60-79%.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Despite substantial engineering effort across more than 40 prompt variants, balanced hard accuracy plateaus in an empirical saturation region of approximately 60-79% for gpt-oss-120b, compared to a 59.75% no-cheatsheet baseline. The best single prompt reaches 79.25% accuracy on the hard3 split, with the plateau explained by the undecidability of the TRUE case limiting what any finite prompt can encode, complex rule systems decreasing performance on weaker models, and prompt ordering effects interacting with model attention in fragile ways.
What carries the argument
The single-prompt ceiling, an observed saturation region where further increases in prompt length or complexity stop yielding accuracy gains on the equational implication task.
If this is right
- Complex prompts exceeding 2KB cause TRUE recall to collapse to 0% on Llama 3.3 70B.
- A 2252-byte prompt can deliver a 19.5 percentage-point gain over the no-cheatsheet baseline on the hard split.
- The undecidability of the TRUE case means no finite prompt can fully encode the required logic for all instances.
- Prompt ordering produces non-monotonic performance changes that interact with each model's attention mechanism.
Where Pith is reading between the lines
- The ceiling may be less pronounced in settings that allow multiple turns or external verification tools rather than a single forward pass.
- Similar saturation effects could appear in other domains involving undecidable or partially decidable formal problems.
- The interaction between prompt length and model size suggests a practical trade-off where simpler prompts may be preferable for smaller models.
Load-bearing premise
That the chosen evaluation splits, baseline, and three specific models are representative enough to support general claims about prompt engineering limits for mathematical reasoning.
What would settle it
Running the same set of prompt variants on a different formal reasoning benchmark or with a substantially stronger model and observing whether accuracy on the TRUE cases exceeds 80% while maintaining balanced recall.
read the original abstract
We present a systematic empirical study of prompt engineering for formal mathematical reasoning in the context of the SAIR Equational Theories Stage 1 competition. The task requires deciding whether one equational law implies another over all magmas -- a problem that is undecidable in general but decidable for FALSE via finite model search. Over five weeks, we designed, tested, and analyzed more than 40 prompt variants, ranging from 0 to 4,878 bytes, across four evaluation splits and three language models (gpt-oss-120b, Llama 3.3 70B, Gemma 4 31B). Our central finding is a single-prompt ceiling: despite substantial engineering effort, balanced hard accuracy plateaus in an empirical saturation region of approximately 60--79% for gpt-oss-120b, compared to a 59.75% no-cheatsheet baseline. We identify three mechanisms underlying this ceiling: (1) the mathematical undecidability of the TRUE case limits what any finite prompt can encode; (2) complex rule systems decrease performance on weaker models (Llama 3.3 70B collapses to 0% TRUE recall with prompts exceeding 2KB); and (3) prompt ordering effects interact with model attention in fragile, non-monotonic ways. Our best submission (AN45c, 2,252 bytes) achieves 79.25% accuracy on hard3 (n=400; 95% CI: [75.0%, 82.9%]), with TRUE recall of 95.9% and FALSE recall of 63.4%, representing a +19.5 percentage-point improvement over the no-cheatsheet baseline (59.75%). We release all prompt variants, evaluation scripts, and results at https://github.com/israelcazares/sair-prompt-engineering
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper reports a systematic empirical study of more than 40 prompt variants (0 to 4,878 bytes) for three LLMs on the SAIR Equational Theories Stage 1 task of deciding whether one equational law implies another over magmas. It identifies a single-prompt ceiling with balanced hard accuracy plateauing at approximately 60-79% for gpt-oss-120b (versus 59.75% no-cheatsheet baseline), attributes this to undecidability of TRUE cases, prompt complexity overload on weaker models, and ordering effects, and reports a best prompt (AN45c) achieving 79.25% accuracy on the hard3 split (n=400, 95% CI [75.0%, 82.9%]) with +19.5pp gain; all prompts, scripts, and results are released on GitHub.
Significance. If the empirical plateau and mechanisms hold, the work supplies concrete, quantitative evidence (with CIs and open code) of practical limits to single-prompt engineering in a partially decidable formal reasoning task, highlights cognitive-load and attention interactions, and provides reusable resources that can inform prompt design in similar settings. The release of >40 variants and evaluation scripts is a clear strength for reproducibility.
major comments (2)
- [Abstract] Abstract and title: The central claim of a 'single-prompt ceiling in LLM Mathematical Reasoning' is presented as a general finding, yet all experiments, splits, baselines, and mechanisms (including the role of undecidability for TRUE cases and finite-model search for FALSE) are confined to the SAIR equational implication task. No cross-benchmark results on other mathematical reasoning settings (e.g., GSM8K or MATH) are provided, so the observed saturation could be an artifact of this specific problem structure rather than a broad prompt-engineering limit; the manuscript should either narrow the claims or add targeted validation experiments.
- [Results] Results section (hard3 split reporting): The 79.25% accuracy and +19.5pp improvement are given with a 95% CI, but the manuscript lacks explicit details on data-split construction, exact baseline prompting, and any correction for multiple comparisons across the >40 variants; these omissions make it difficult to verify that the plateau is not influenced by split-specific properties or selection effects.
minor comments (3)
- [Abstract] The term 'balanced hard accuracy' is used without an explicit formula or definition in the provided abstract; adding a clear definition (e.g., in §3 or a table) would improve clarity.
- A summary table listing all 40+ prompt variants by length, key features, and per-split accuracies would make the engineering effort and saturation pattern easier to inspect.
- Model names (e.g., 'gpt-oss-120b') should be standardized and any non-standard nomenclature explained.
Simulated Author's Rebuttal
We thank the referee for their detailed and constructive feedback on our manuscript. We address each of the major comments below, proposing specific revisions to the abstract, title, and results section to improve clarity and transparency.
read point-by-point responses
-
Referee: [Abstract] Abstract and title: The central claim of a 'single-prompt ceiling in LLM Mathematical Reasoning' is presented as a general finding, yet all experiments, splits, baselines, and mechanisms (including the role of undecidability for TRUE cases and finite-model search for FALSE) are confined to the SAIR equational implication task. No cross-benchmark results on other mathematical reasoning settings (e.g., GSM8K or MATH) are provided, so the observed saturation could be an artifact of this specific problem structure rather than a broad prompt-engineering limit; the manuscript should either narrow the claims or add targeted validation experiments.
Authors: We agree that the title and abstract frame the single-prompt ceiling as a general phenomenon in LLM mathematical reasoning, whereas the empirical evidence is derived exclusively from the SAIR Equational Theories Stage 1 task. As we lack cross-benchmark results on datasets such as GSM8K or MATH, we will narrow the claims accordingly. Specifically, we will revise the title to 'Less Is More: Cognitive Load and the Single-Prompt Ceiling in LLM Equational Reasoning' and update the abstract to emphasize that the findings pertain to the SAIR equational implication task, while noting that the identified mechanisms (undecidability of TRUE cases and cognitive load effects) may have broader implications. This revision avoids overgeneralization without requiring new experiments. revision: yes
-
Referee: [Results] Results section (hard3 split reporting): The 79.25% accuracy and +19.5pp improvement are given with a 95% CI, but the manuscript lacks explicit details on data-split construction, exact baseline prompting, and any correction for multiple comparisons across the >40 variants; these omissions make it difficult to verify that the plateau is not influenced by split-specific properties or selection effects.
Authors: We acknowledge that the manuscript would benefit from greater explicitness on these methodological details. We will expand the Results section to include: (1) a precise description of how the hard3 split was constructed from the SAIR dataset, (2) the verbatim text of the no-cheatsheet baseline prompt, and (3) a discussion of multiple comparisons, noting that our primary goal was to characterize the performance plateau across variants rather than to identify a single superior prompt; consequently, we did not apply formal corrections such as Bonferroni, but all individual accuracies and the full evaluation code are publicly available in the GitHub repository for independent verification. These additions will make the reporting self-contained while leveraging the released artifacts. revision: yes
Circularity Check
No significant circularity; empirical measurements of prompt variants
full rationale
The paper reports an empirical study that directly measures balanced hard accuracy across more than 40 prompt variants, four evaluation splits, and three models, establishing the single-prompt ceiling (60-79% plateau versus 59.75% baseline) from those observed results rather than any derivation. No equations, fitted parameters, or first-principles claims are present that could reduce to the inputs by construction. The identified mechanisms (undecidability of TRUE cases, length-induced collapse, ordering effects) are post-hoc interpretations of the data, not load-bearing steps. No self-citations, ansatzes, or renamings of known results appear in the load-bearing claims. The work is self-contained as a direct empirical investigation.
Axiom & Free-Parameter Ledger
axioms (2)
- standard math Standard assumptions for binomial confidence intervals on accuracy metrics
- domain assumption The SAIR equational implication task is a valid proxy for broader mathematical reasoning challenges
Reference graph
Works this paper leans on
-
[1]
Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D. Kaplan, Prafulla Dhari- wal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel Ziegler, Jeffrey Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler, Mateusz Litwin,...
work page 1901
-
[2]
Manuel Israel C´ azares. SAIR prompt engineering — equational theories stage 1.https: //github.com/israelcazares/sair-prompt-engineering, 2026. Accessed April 2026
work page 2026
-
[3]
Training verifiers to solve math word problems
Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, Christopher Hesse, and John Schulman. Training verifiers to solve math word problems. 2021
work page 2021
-
[4]
Measuring mathematical problem solving with the MATH dataset
Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt. Measuring mathematical problem solving with the MATH dataset. InAdvances in Neural Information Processing Systems, volume 34, pages 4130–4143, 2021
work page 2021
-
[5]
Large language models are zero-shot reasoners
Takeshi Kojima, Shixiang Shane Gu, Machel Reid, Yutaka Matsuo, and Yusuke Iwasawa. Large language models are zero-shot reasoners. InAdvances in Neural Information Processing Systems, volume 35, pages 22199–22213, 2022
work page 2022
-
[6]
Liu, Kevin Lin, John Hewitt, Ashwin Paranjape, Michele Bevilacqua, Fabio Petroni, and Percy Liang
Nelson F. Liu, Kevin Lin, John Hewitt, Ashwin Paranjape, Michele Bevilacqua, Fabio Petroni, and Percy Liang. Lost in the middle: How language models use long contexts.Transactions of the Association for Computational Linguistics, 12:157–173, 2023. 16
work page 2023
-
[7]
gpt-oss-120b & gpt-oss-20b model card, 2025
OpenAI. gpt-oss-120b & gpt-oss-20b model card, 2025
work page 2025
-
[8]
SAIR Foundation. Stage 1 judge for the mathematics distillation challenge: Equational theo- ries.https://github.com/SAIRcompetition/equational-theories-stage1-judge, 2026. Official evaluation setup: OpenRouter/DeepInfra bf16, temperature 0.0, seed 0, max to- kens 8,192. Canonical smoke test:problems hard3 20.jsonl. Accessed April 2026
work page 2026
-
[9]
SAIR Foundation. SAIR mathematics distillation challenge — equational theories: Con- tributor network leaderboard, 2026. URLhttps://competition.sair.foundation/ competitions/mathematics-distillation-challenge-equational-theories-stage1/ leaderboard. Data as of April 20, 2026 (competition close).n= 52 voluntary public submissions of 1,007 total registered ...
work page 2026
-
[10]
Bowman, Newton Cheng, Esin Durmus, Zac Hatfield-Dodds, Scott R
Mrinank Sharma, Meg Tong, Tomasz Korbak, David Duvenaud, Amanda Askell, Samuel R. Bowman, Newton Cheng, Esin Durmus, Zac Hatfield-Dodds, Scott R. Johnston, Shauna Kravec, Timothy Maxwell, Sam McCandlish, Kamal Ndousse, Oliver Rausch, Nicholas Schiefer, Da Yan, Miranda Zhang, and Ethan Perez. Towards understanding sycophancy in language models. InInternati...
work page 2024
-
[11]
Chi, Nathanael Sch¨ arli, and Denny Zhou
Freda Shi, Xinyun Chen, Kanishka Misra, Nathan Scales, David Dohan, Ed H. Chi, Nathanael Sch¨ arli, and Denny Zhou. Large language models can be easily distracted by irrelevant con- text. InProceedings of the 40th International Conference on Machine Learning, volume 202 ofProceedings of Machine Learning Research, pages 31210–31227, 2023
work page 2023
-
[12]
Terence Tao. Mathematics distillation challenge – equa- tional theories.https://terrytao.wordpress.com/2026/03/13/ mathematics-distillation-challenge-equational-theories/, 2026. Accessed April 2026
work page 2026
-
[13]
Equational theories project.https://github.com/teorth/equational_ theories, 2024
Terence Tao et al. Equational theories project.https://github.com/teorth/equational_ theories, 2024. Accessed April 2026
work page 2024
-
[14]
Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V. Le, and Denny Zhou. Chain-of-thought prompting elicits reasoning in large language models. In Advances in Neural Information Processing Systems, volume 35, pages 24824–24837, 2022. 17
work page 2022
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.