Code-Guided Reasoning for Small Language Models: Evaluating Executable MCQA Scaffolds

Amit Sheth; Dhaval Patel; Prateek Biswas; Shuxin Lin; Vedant Khandelwal

arxiv: 2605.18827 · v1 · pith:4T77SKZRnew · submitted 2026-05-12 · 💻 cs.IR · cs.LG· cs.PL

Code-Guided Reasoning for Small Language Models: Evaluating Executable MCQA Scaffolds

Prateek Biswas , Dhaval Patel , Vedant Khandelwal , Shuxin Lin , Amit Sheth This is my paper

Pith reviewed 2026-05-20 21:03 UTC · model grok-4.3

classification 💻 cs.IR cs.LGcs.PL

keywords code-guided reasoningsmall language modelsmultiple-choice QAexecutable scaffoldsreasoning assistanceaccuracy evaluationprogram generation

0 comments

The pith

Code scaffolds improve small language model accuracy on multiple-choice questions by 28 percentage points.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces Code-Guided Reasoning as an evaluation protocol to test whether small language models gain accuracy on multiple-choice questions by generating and executing Python programs as reasoning scaffolds instead of answering directly. It defines six standardized components including prompts for direct solving and code generation, a Python execution scaffold, and result recording to enable consistent measurement across models and questions. On over 20,000 retained results from a prepared MCQA set and six solver models, assisted accuracy reached 66.21 percent compared with 38.11 percent direct, for a 28.10-point gain with a bootstrap interval of roughly 20 to 36 points. Even under a stricter gate requiring over 30 percent direct-signal accuracy the gain remained 14 points. The work supplies the full trace package of programs, answers, and metadata so others can examine the sources of the observed improvement.

Core claim

The authors claim that an executable reasoning scaffold in which the model first writes a Python program to analyze the question and then runs it for answer extraction produces substantially higher accuracy than direct answering on MCQA tasks for small language models, with the macro difference measured at +28.10 percentage points across the non-zero-baseline partition of results.

What carries the argument

Code-Guided Reasoning (CGR), the standardized protocol with its six components: normalized item interface, direct solver prompt, generator prompt, Python scaffold, solver-call and extraction helpers, and three-channel result record.

If this is right

Small language models can offload complex reasoning steps to executable code and thereby reach higher accuracy on multiple-choice tasks.
Standardized scaffolds allow direct comparison of assisted versus direct performance across different solver models.
The protocol records generator-side answers and full program traces, enabling diagnosis of when code helps versus when it fails.
Some generated programs violate the no-hard-coding rule, showing that instruction adherence remains an issue even when accuracy rises.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same scaffold approach could be applied to open-ended reasoning benchmarks to test whether the accuracy lift generalizes beyond fixed choices.
Replacing the brittle extraction step with more robust parsing or verification could increase the reliability of the assisted path.
Combining CGR with different programming languages or with retrieval of verified code snippets might further isolate the contribution of execution itself.

Load-bearing premise

That answer extraction from the executed program output is reliable and that the generated programs supply genuine reasoning assistance rather than incidental hard-coding or leakage.

What would settle it

A replication on the same questions and models in which direct accuracy equals or exceeds assisted accuracy or in which a large share of program outputs cannot be extracted reliably.

Figures

Figures reproduced from arXiv: 2605.18827 by Amit Sheth, Dhaval Patel, Prateek Biswas, Shuxin Lin, Vedant Khandelwal.

**Figure 2.** Figure 2: CGR evaluation flow with a concrete retained-row example. The same MCQA item is [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: Excerpt from a generated OpenBookQA scaffold. The full artifact keeps the generated [PITH_FULL_IMAGE:figures/full_fig_p012_3.png] view at source ↗

**Figure 4.** Figure 4: Direct baseline, assisted solver, and generator-side answer accuracy. The x-axis ticks are [PITH_FULL_IMAGE:figures/full_fig_p014_4.png] view at source ↗

**Figure 5.** Figure 5: Assisted-minus-direct accuracy by solver and dataset. Columns are datasets and rows are [PITH_FULL_IMAGE:figures/full_fig_p014_5.png] view at source ↗

**Figure 6.** Figure 6: Partitioned result matrices. The left panel reports assisted-minus-direct accuracy for all [PITH_FULL_IMAGE:figures/full_fig_p015_6.png] view at source ↗

**Figure 7.** Figure 7: Assisted-minus-direct accuracy for observed non-zero-baseline dataset–solver pairs. Nega [PITH_FULL_IMAGE:figures/full_fig_p016_7.png] view at source ↗

**Figure 8.** Figure 8: Dataset-level and solver-level macro accuracy profiles. Each polar axis reorganizes the [PITH_FULL_IMAGE:figures/full_fig_p016_8.png] view at source ↗

**Figure 9.** Figure 9: Accuracy by generator-estimated difficulty with representative scaffold text. The examples [PITH_FULL_IMAGE:figures/full_fig_p017_9.png] view at source ↗

**Figure 10.** Figure 10: Representative examples from AIME, CorrectBenchQA, and FailureSensorIQ. [PITH_FULL_IMAGE:figures/full_fig_p023_10.png] view at source ↗

**Figure 11.** Figure 11: Representative examples from MMLU-Pro, OpenBookQA, and SuperGPQA. [PITH_FULL_IMAGE:figures/full_fig_p024_11.png] view at source ↗

**Figure 12.** Figure 12: Representative examples from Time-MQA, MedQA, and PhysicsQA. [PITH_FULL_IMAGE:figures/full_fig_p024_12.png] view at source ↗

**Figure 13.** Figure 13: Representative pilot example from the HLE evaluation artifacts. [PITH_FULL_IMAGE:figures/full_fig_p024_13.png] view at source ↗

**Figure 14.** Figure 14: Strong-LLM harness examples for AIME and CorrectBenchQA. The cards show local [PITH_FULL_IMAGE:figures/full_fig_p025_14.png] view at source ↗

**Figure 15.** Figure 15: Strong-LLM harness examples for FailureSensorIQ and MMLU-Pro. Both use domain [PITH_FULL_IMAGE:figures/full_fig_p025_15.png] view at source ↗

**Figure 16.** Figure 16: Strong-LLM harness examples for OpenBookQA and SuperGPQA. The first is analogy [PITH_FULL_IMAGE:figures/full_fig_p026_16.png] view at source ↗

**Figure 17.** Figure 17: Strong-LLM harness examples for Time-MQA and MedQA. These rows show repeated [PITH_FULL_IMAGE:figures/full_fig_p026_17.png] view at source ↗

**Figure 18.** Figure 18: Strong-LLM harness examples for PhysicsQA and the HLE pilot. These rows show [PITH_FULL_IMAGE:figures/full_fig_p027_18.png] view at source ↗

read the original abstract

Multiple-choice QA benchmarks usually evaluate small language models (SLMs) as direct answerers, but deployed language-model systems increasingly rely on external scaffolds such as tools, code, and repeated model calls. We introduce Code-Guided Reasoning (CGR), an evaluation protocol and generated-program resource for measuring when executable reasoning scaffolds improve SLM performance on MCQA tasks. CGR standardizes six components: a normalized item interface, a direct solver prompt, a generator prompt, a Python scaffold, solver-call and extraction helpers, and a three-channel result record. On 20,498 retained result rows from a locally prepared MCQA bundle and six metadata-registered solver models, the observed non-zero-baseline partition shows 66.21% macro assisted accuracy versus 38.11% direct accuracy, a +28.10 percentage-point difference with a pair-bootstrap interval of [20.32, 36.43]. Under a stricter Ab > 30% direct-signal gate, the macro difference is +14.11 points. These estimates are descriptive. Assisted inference uses a larger solver-call budget, answer extraction is brittle, Time-MQA contains the observed regressions, and some generated programs violate the no-hard-coding instruction. CGR provides the trace package needed to interpret these results, including direct, assisted, and generator-side answers, partition definitions, generated programs, response metadata, and audits.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The CGR protocol standardizes code scaffolds for SLM MCQA evaluation and ships a reusable trace package, but the reported accuracy lift rests on unverified extraction and hard-coding assumptions.

read the letter

The main point is that this paper supplies a standardized six-component protocol for testing executable code scaffolds on multiple-choice QA with small language models, along with a packaged trace resource that includes generated programs, audits, and result records. That framing and the accompanying bundle are the concrete new pieces here. On their 20k retained rows they show a +28 point macro accuracy difference between assisted and direct modes, with bootstrap intervals that stay positive even under a stricter direct-signal gate. They are explicit that the assisted path uses more solver calls and that extraction can be brittle while some programs violate the no-hard-coding rule. The work stays descriptive and does not overclaim causation. The trace package itself is useful because it lets others inspect the actual programs and partitions rather than just the headline numbers. The citation choices are straightforward for an empirical methods piece and do not lean on self-reference in a circular way. The central comparison is a direct head-to-head on held-out items, so there is no obvious fitting artifact. The soft spot is the one the stress-test note raises. If extraction helpers fail on a non-trivial share of outputs or if many generated programs simply embed the answer instead of performing step-by-step reasoning, the observed lift becomes an artifact of the scaffold mechanics rather than evidence that the code improves reasoning. The abstract flags both issues, yet the reported averages treat the extracted answers as valid. Without clearer counts on extraction success rates or violation audits, it is difficult to judge how much the +28 point gap would shrink under tighter controls. This paper is aimed at researchers working on tool use and scaffolding for smaller models who want a reproducible starting point for MCQA experiments. A reader interested in efficient deployment or benchmark construction can pull the trace package and run their own checks. It deserves a serious referee because the protocol and resource are concrete enough to evaluate and extend, even if the current numbers need tighter quantification of the extraction and violation problems before the accuracy claims can be taken at face value.

Referee Report

2 major / 2 minor

Summary. The paper introduces Code-Guided Reasoning (CGR), a standardized evaluation protocol and generated-program resource for assessing when executable code scaffolds improve small language model (SLM) performance on multiple-choice QA (MCQA) tasks. It reports descriptive results on 20,498 retained rows from a locally prepared MCQA bundle and six solver models: in the non-zero-baseline partition, macro assisted accuracy reaches 66.21% versus 38.11% direct accuracy (+28.10 pp, pair-bootstrap interval [20.32, 36.43]); a stricter Ab > 30% direct-signal gate yields +14.11 pp. The authors explicitly characterize the estimates as descriptive, noting larger solver budgets for assisted inference, brittle answer extraction, regressions in Time-MQA, and some generated programs violating the no-hard-coding instruction, while supplying a full trace package for interpretation.

Significance. The primary contribution is the CGR protocol itself together with the accompanying trace package (direct/assisted/generator answers, partition definitions, generated programs, metadata, and audits). This resource enables reproducible examination of scaffold effects and could support future work on tool-augmented SLM inference. Because the manuscript already flags the key methodological caveats and presents the numbers as descriptive rather than causal, the significance lies more in the standardized evaluation framework than in any strong claim of reasoning improvement.

major comments (2)

[Abstract] Abstract: The reported +28.10 pp macro difference and its bootstrap interval treat extracted answers as valid, yet the abstract itself flags brittle extraction and instruction violations without providing extraction success rates, failure-mode counts, or the fraction of programs that hard-code answers across the 20,498 rows. A quantitative audit of these issues is needed to establish whether the observed lift is robust to extraction mechanics.
[Results] Results (non-zero-baseline and Ab > 30% partitions): The central descriptive comparison rests on the definition and application of the non-zero-baseline partition and the stricter direct-signal gate. Without explicit formulas or pseudocode showing how baseline accuracy is computed per item and how the gate filters rows, it is difficult to verify that the +28.10 pp and +14.11 pp differences are not sensitive to partition construction choices.

minor comments (2)

[Abstract] Abstract: The phrase 'Time-MQA contains the observed regressions' is undefined; a brief parenthetical or footnote clarifying what Time-MQA refers to and its relation to the MCQA bundle would improve readability.
The manuscript would benefit from a small summary table listing the six metadata-registered solver models, their parameter counts, and any key differences in prompt handling.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive review and the recommendation of minor revision. We address each major comment below, indicating where revisions have been made to improve clarity and transparency while preserving the descriptive framing of the results.

read point-by-point responses

Referee: [Abstract] Abstract: The reported +28.10 pp macro difference and its bootstrap interval treat extracted answers as valid, yet the abstract itself flags brittle extraction and instruction violations without providing extraction success rates, failure-mode counts, or the fraction of programs that hard-code answers across the 20,498 rows. A quantitative audit of these issues is needed to establish whether the observed lift is robust to extraction mechanics.

Authors: We agree that quantifying extraction success, failure modes, and hard-coding violations would strengthen the presentation. The manuscript already characterizes the estimates as descriptive and notes these caveats, while the released trace package supplies the complete set of generated programs, solver outputs, extraction logs, and metadata for all 20,498 rows. In the revised manuscript we have added a concise quantitative audit summary to the abstract and a dedicated paragraph in the Results section reporting aggregate extraction success rates, primary failure-mode counts, and the fraction of programs flagged for hard-coding violations. These additions are derived directly from the trace package and support interpretation of the reported differences without changing their descriptive status. revision: yes
Referee: [Results] Results (non-zero-baseline and Ab > 30% partitions): The central descriptive comparison rests on the definition and application of the non-zero-baseline partition and the stricter direct-signal gate. Without explicit formulas or pseudocode showing how baseline accuracy is computed per item and how the gate filters rows, it is difficult to verify that the +28.10 pp and +14.11 pp differences are not sensitive to partition construction choices.

Authors: We appreciate the request for explicit definitions. The revised Methods section now includes formal definitions together with pseudocode for (i) the non-zero-baseline partition, which retains items for which at least one of the six direct solvers produces a correct answer, and (ii) the stricter Ab > 30% direct-signal gate. We have also added a short sensitivity table showing that the reported macro differences remain stable under modest changes to the threshold. These additions enable readers to reproduce the partitions exactly and to assess sensitivity. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical comparison of inference modes

full rationale

The paper presents an evaluation protocol (CGR) and reports descriptive statistics from running direct vs. assisted inference on 20,498 retained MCQA items across six models. The central claim is an observed +28.10 pp macro accuracy lift on the non-zero-baseline partition, accompanied by a bootstrap interval. No mathematical derivation, fitted parameter, or self-referential quantity is defined; the result is a direct measurement of two inference procedures on the same items. The abstract itself flags brittleness in extraction and instruction violations as limitations rather than claiming a closed-form or self-justifying result. No self-citation load-bearing steps, ansatz smuggling, or renaming of known results appear in the provided text.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

This is an empirical evaluation study; no mathematical axioms, free parameters, or invented entities are introduced beyond the definition of the evaluation protocol itself.

pith-pipeline@v0.9.0 · 5802 in / 1190 out tokens · 63186 ms · 2026-05-20T21:03:50.121294+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

26 extracted references · 26 canonical work pages

[1]

Executable code actions elicit better LLM agents

Xingyao Wang, Yangyi Chen, Lifan Yuan, Yizhe Zhang, Yunzhu Li, Hao Peng, and Heng Ji. Executable code actions elicit better LLM agents. InProceedings of the 41st International Conference on Machine Learning, volume 235 ofProceedings of Machine Learning Research, pages 50208–50232, 2024

work page 2024
[2]

LLMs for doctors: Leveraging medical LLMs to assist doctors, not replace them, 2024

Wenya Xie, Qingying Xiao, Yu Zheng, Xidong Wang, Junying Chen, Ke Ji, Anningzhe Gao, Xiang Wan, Feng Jiang, and Benyou Wang. LLMs for doctors: Leveraging medical LLMs to assist doctors, not replace them, 2024

work page 2024
[3]

Chain-of-thought prompting elicits reasoning in large language models

Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Brian Ichter, Fei Xia, Ed Chi, Quoc Le, and Denny Zhou. Chain-of-thought prompting elicits reasoning in large language models. InAdvances in Neural Information Processing Systems, 2022

work page 2022
[4]

Self-consistency improves chain of thought reasoning in language models

Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc Le, Ed Chi, Sharan Narang, Aakanksha Chowdhery, and Denny Zhou. Self-consistency improves chain of thought reasoning in language models. InInternational Conference on Learning Representations, 2023

work page 2023
[5]

Wenhu Chen, Xueguang Ma, Xinyi Wang, and William W. Cohen. Program of thoughts prompt- ing: Disentangling computation from reasoning for numerical reasoning tasks.Transactions on Machine Learning Research, 2023

work page 2023
[6]

PAL: Program-aided language models

Luyu Gao, Aman Madaan, Shuyan Zhou, Uri Alon, Pengfei Liu, Yiming Yang, Jamie Callan, and Graham Neubig. PAL: Program-aided language models. InProceedings of the 40th International Conference on Machine Learning, volume 202 ofProceedings of Machine Learning Research, pages 10764–10799, 2023

work page 2023
[7]

ToolQA: A dataset for LLM question answering with external tools

Yuchen Zhuang, Yue Yu, Kuan Wang, Haotian Sun, and Chao Zhang. ToolQA: A dataset for LLM question answering with external tools. InAdvances in Neural Information Processing Systems, 2023

work page 2023
[8]

InterCode: Standard- izing and benchmarking interactive coding with execution feedback

John Yang, Akshara Prabhakar, Karthik Narasimhan, and Shunyu Yao. InterCode: Standard- izing and benchmarking interactive coding with execution feedback. InAdvances in Neural Information Processing Systems, 2023

work page 2023
[9]

Gonzalez, and Bin Cui

Ling Yang, Zhaochen Yu, Tianjun Zhang, Shiyi Cao, Minkai Xu, Wentao Zhang, Joseph E. Gonzalez, and Bin Cui. Buffer of thoughts: Thought-augmented reasoning with large language models. InAdvances in Neural Information Processing Systems, 2024

work page 2024
[10]

ReasonFlux: Hierarchical LLM reasoning via scaling thought templates, 2025

Ling Yang, Zhaochen Yu, Bin Cui, and Mengdi Wang. ReasonFlux: Hierarchical LLM reasoning via scaling thought templates, 2025

work page 2025
[11]

MMLU-Pro: A more robust and challenging multi-task language understanding benchmark

Yubo Wang, Xueguang Ma, Ge Zhang, Yuansheng Ni, Abhranil Chandra, Shiguang Guo, Weiming Ren, Aaran Arulraj, Xuan He, Ziyan Jiang, Tianle Li, Max Ku, Kai Wang, Alex Zhuang, Rongqi Fan, Xiang Yue, and Wenhu Chen. MMLU-Pro: A more robust and challenging multi-task language understanding benchmark. InAdvances in Neural Information Processing Systems, 2024

work page 2024
[12]

Can a suit of armor conduct electricity? a new dataset for open book question answering

Todor Mihaylov, Peter Clark, Tushar Khot, and Ashish Sabharwal. Can a suit of armor conduct electricity? a new dataset for open book question answering. InProceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 2381–2391, 2018

work page 2018
[13]

Time-MQA: Time series multi-task question answering with context enhancement

Yaxuan Kong, Yiyuan Yang, Yoontae Hwang, Wenjie Du, Stefan Zohren, Zhangyang Wang, Ming Jin, and Qingsong Wen. Time-MQA: Time series multi-task question answering with context enhancement. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (V olume 1: Long Papers), pages 29736–29753, 2025

work page 2025
[14]

Humanity’s last exam, 2025

Long Phan, Alice Gatti, Ziwen Han, Nathaniel Li, Josephina Hu, Hugh Zhang, Chen Bo Calvin Zhang, Mohamed Shaaban, John Ling, Sean Shi, et al. Humanity’s last exam, 2025

work page 2025
[15]

Measuring coding challenge competence with APPS

Dan Hendrycks, Steven Basart, Saurav Kadavath, Mantas Mazeika, Akul Arora, Ethan Guo, Collin Burns, Samir Puranik, Horace He, Dawn Song, and Jacob Steinhardt. Measuring coding challenge competence with APPS. InAdvances in Neural Information Processing Systems, 2021. 10

work page 2021
[16]

DataComp: In search of the next generation of multimodal datasets

Samir Yitzhak Gadre, Gabriel Ilharco, Alex Fang, Jonathan Hayase, Georgios Smyrnis, Thao Nguyen, Ryan Marten, Mitchell Wortsman, Dhruba Ghosh, Jieyu Zhang, et al. DataComp: In search of the next generation of multimodal datasets. InAdvances in Neural Information Processing Systems, 2023

work page 2023
[17]

DecodingTrust: A comprehensive assess- ment of trustworthiness in GPT models

Boxin Wang, Weixin Chen, Hengzhi Pei, Chulin Xie, Mintong Kang, Chenhui Zhang, Chejian Xu, Zidi Xiong, Ritik Dutta, Rylan Schaeffer, et al. DecodingTrust: A comprehensive assess- ment of trustworthiness in GPT models. InAdvances in Neural Information Processing Systems, 2023

work page 2023
[18]

2025 AIME I

Art of Problem Solving Wiki. 2025 AIME I. AoPS Wiki page, 2025. Accessed 2026-05-04

work page 2025
[19]

2025 AIME II

Art of Problem Solving Wiki. 2025 AIME II. AoPS Wiki page, 2025. Accessed 2026-05-04

work page 2025
[20]

What disease does this patient have? a large-scale open domain question answering dataset from medical exams.Applied Sciences, 11(14):6421, 2021

Di Jin, Eileen Pan, Nassim Oufattole, Wei-Hung Weng, Hanyi Fang, and Peter Szolovits. What disease does this patient have? a large-scale open domain question answering dataset from medical exams.Applied Sciences, 11(14):6421, 2021

work page 2021
[21]

Improving physics reasoning in large language models using mixture of refinement agents, 2024

Raj Jaiswal, Dhruv Jain, Harsh Parimal Popat, Avinash Anand, Abhishek Dharmadhikari, Atharva Marathe, and Rajiv Ratn Shah. Improving physics reasoning in large language models using mixture of refinement agents, 2024

work page 2024
[22]

SuperGPQA: Scaling LLM evaluation across 285 graduate disciplines, 2025

M-A-P Team, Xinrun Du, Yifan Yao, Kaijing Ma, Bingli Wang, Tianyu Zheng, King Zhu, Minghao Liu, Yiming Liang, Xiaolong Jin, et al. SuperGPQA: Scaling LLM evaluation across 285 graduate disciplines, 2025

work page 2025
[23]

Correctbench: A benchmark of self-correction in llms

Guiyao Tie, Zenghui Yuan, Zeli Zhao, Chaoran Hu, Tianhe Gu, Ruihang Zhang, Sizhe Zhang, Junran Wu, Xiaoyue Tu, Ming Jin, Qingsong Wen, Lixing Chen, Pan Zhou, and Lichao Sun. Correctbench: A benchmark of self-correction in llms. InProceedings of the NeurIPS 2025 Datasets and Benchmarks Track, 2025

work page 2025
[24]

FailuresensorIQ: A multi-choice QA dataset for understanding sensor relationships and failure modes

Christodoulos Constantinides, Dhaval C Patel, Shuxin Lin, Claudio Guerrero, SUNIL DA- GAJIRAO PATIL, and Jayant Kalagnanam. FailuresensorIQ: A multi-choice QA dataset for understanding sensor relationships and failure modes. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2026

work page 2026
[25]

Gemma 4 E2B: Intelligence, performance and price analysis

Artificial Analysis. Gemma 4 E2B: Intelligence, performance and price analysis. Model analysis page, 2026. Accessed 2026-05-05

work page 2026
[26]

NVIDIA Nemotron 3 nano 4b: Intelligence, performance and price analysis

Artificial Analysis. NVIDIA Nemotron 3 nano 4b: Intelligence, performance and price analysis. Model analysis page, 2026. Accessed 2026-05-05. 11 A Additional Analyses The additional analyses below preserve the same claim scope as the primary result. CGR observes the same target solver under a direct option-selection prompt and under a generated Python ski...

work page 2026

[1] [1]

Executable code actions elicit better LLM agents

Xingyao Wang, Yangyi Chen, Lifan Yuan, Yizhe Zhang, Yunzhu Li, Hao Peng, and Heng Ji. Executable code actions elicit better LLM agents. InProceedings of the 41st International Conference on Machine Learning, volume 235 ofProceedings of Machine Learning Research, pages 50208–50232, 2024

work page 2024

[2] [2]

LLMs for doctors: Leveraging medical LLMs to assist doctors, not replace them, 2024

Wenya Xie, Qingying Xiao, Yu Zheng, Xidong Wang, Junying Chen, Ke Ji, Anningzhe Gao, Xiang Wan, Feng Jiang, and Benyou Wang. LLMs for doctors: Leveraging medical LLMs to assist doctors, not replace them, 2024

work page 2024

[3] [3]

Chain-of-thought prompting elicits reasoning in large language models

Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Brian Ichter, Fei Xia, Ed Chi, Quoc Le, and Denny Zhou. Chain-of-thought prompting elicits reasoning in large language models. InAdvances in Neural Information Processing Systems, 2022

work page 2022

[4] [4]

Self-consistency improves chain of thought reasoning in language models

Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc Le, Ed Chi, Sharan Narang, Aakanksha Chowdhery, and Denny Zhou. Self-consistency improves chain of thought reasoning in language models. InInternational Conference on Learning Representations, 2023

work page 2023

[5] [5]

Wenhu Chen, Xueguang Ma, Xinyi Wang, and William W. Cohen. Program of thoughts prompt- ing: Disentangling computation from reasoning for numerical reasoning tasks.Transactions on Machine Learning Research, 2023

work page 2023

[6] [6]

PAL: Program-aided language models

Luyu Gao, Aman Madaan, Shuyan Zhou, Uri Alon, Pengfei Liu, Yiming Yang, Jamie Callan, and Graham Neubig. PAL: Program-aided language models. InProceedings of the 40th International Conference on Machine Learning, volume 202 ofProceedings of Machine Learning Research, pages 10764–10799, 2023

work page 2023

[7] [7]

ToolQA: A dataset for LLM question answering with external tools

Yuchen Zhuang, Yue Yu, Kuan Wang, Haotian Sun, and Chao Zhang. ToolQA: A dataset for LLM question answering with external tools. InAdvances in Neural Information Processing Systems, 2023

work page 2023

[8] [8]

InterCode: Standard- izing and benchmarking interactive coding with execution feedback

John Yang, Akshara Prabhakar, Karthik Narasimhan, and Shunyu Yao. InterCode: Standard- izing and benchmarking interactive coding with execution feedback. InAdvances in Neural Information Processing Systems, 2023

work page 2023

[9] [9]

Gonzalez, and Bin Cui

Ling Yang, Zhaochen Yu, Tianjun Zhang, Shiyi Cao, Minkai Xu, Wentao Zhang, Joseph E. Gonzalez, and Bin Cui. Buffer of thoughts: Thought-augmented reasoning with large language models. InAdvances in Neural Information Processing Systems, 2024

work page 2024

[10] [10]

ReasonFlux: Hierarchical LLM reasoning via scaling thought templates, 2025

Ling Yang, Zhaochen Yu, Bin Cui, and Mengdi Wang. ReasonFlux: Hierarchical LLM reasoning via scaling thought templates, 2025

work page 2025

[11] [11]

MMLU-Pro: A more robust and challenging multi-task language understanding benchmark

Yubo Wang, Xueguang Ma, Ge Zhang, Yuansheng Ni, Abhranil Chandra, Shiguang Guo, Weiming Ren, Aaran Arulraj, Xuan He, Ziyan Jiang, Tianle Li, Max Ku, Kai Wang, Alex Zhuang, Rongqi Fan, Xiang Yue, and Wenhu Chen. MMLU-Pro: A more robust and challenging multi-task language understanding benchmark. InAdvances in Neural Information Processing Systems, 2024

work page 2024

[12] [12]

Can a suit of armor conduct electricity? a new dataset for open book question answering

Todor Mihaylov, Peter Clark, Tushar Khot, and Ashish Sabharwal. Can a suit of armor conduct electricity? a new dataset for open book question answering. InProceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 2381–2391, 2018

work page 2018

[13] [13]

Time-MQA: Time series multi-task question answering with context enhancement

Yaxuan Kong, Yiyuan Yang, Yoontae Hwang, Wenjie Du, Stefan Zohren, Zhangyang Wang, Ming Jin, and Qingsong Wen. Time-MQA: Time series multi-task question answering with context enhancement. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (V olume 1: Long Papers), pages 29736–29753, 2025

work page 2025

[14] [14]

Humanity’s last exam, 2025

Long Phan, Alice Gatti, Ziwen Han, Nathaniel Li, Josephina Hu, Hugh Zhang, Chen Bo Calvin Zhang, Mohamed Shaaban, John Ling, Sean Shi, et al. Humanity’s last exam, 2025

work page 2025

[15] [15]

Measuring coding challenge competence with APPS

Dan Hendrycks, Steven Basart, Saurav Kadavath, Mantas Mazeika, Akul Arora, Ethan Guo, Collin Burns, Samir Puranik, Horace He, Dawn Song, and Jacob Steinhardt. Measuring coding challenge competence with APPS. InAdvances in Neural Information Processing Systems, 2021. 10

work page 2021

[16] [16]

DataComp: In search of the next generation of multimodal datasets

Samir Yitzhak Gadre, Gabriel Ilharco, Alex Fang, Jonathan Hayase, Georgios Smyrnis, Thao Nguyen, Ryan Marten, Mitchell Wortsman, Dhruba Ghosh, Jieyu Zhang, et al. DataComp: In search of the next generation of multimodal datasets. InAdvances in Neural Information Processing Systems, 2023

work page 2023

[17] [17]

DecodingTrust: A comprehensive assess- ment of trustworthiness in GPT models

Boxin Wang, Weixin Chen, Hengzhi Pei, Chulin Xie, Mintong Kang, Chenhui Zhang, Chejian Xu, Zidi Xiong, Ritik Dutta, Rylan Schaeffer, et al. DecodingTrust: A comprehensive assess- ment of trustworthiness in GPT models. InAdvances in Neural Information Processing Systems, 2023

work page 2023

[18] [18]

2025 AIME I

Art of Problem Solving Wiki. 2025 AIME I. AoPS Wiki page, 2025. Accessed 2026-05-04

work page 2025

[19] [19]

2025 AIME II

Art of Problem Solving Wiki. 2025 AIME II. AoPS Wiki page, 2025. Accessed 2026-05-04

work page 2025

[20] [20]

What disease does this patient have? a large-scale open domain question answering dataset from medical exams.Applied Sciences, 11(14):6421, 2021

Di Jin, Eileen Pan, Nassim Oufattole, Wei-Hung Weng, Hanyi Fang, and Peter Szolovits. What disease does this patient have? a large-scale open domain question answering dataset from medical exams.Applied Sciences, 11(14):6421, 2021

work page 2021

[21] [21]

Improving physics reasoning in large language models using mixture of refinement agents, 2024

Raj Jaiswal, Dhruv Jain, Harsh Parimal Popat, Avinash Anand, Abhishek Dharmadhikari, Atharva Marathe, and Rajiv Ratn Shah. Improving physics reasoning in large language models using mixture of refinement agents, 2024

work page 2024

[22] [22]

SuperGPQA: Scaling LLM evaluation across 285 graduate disciplines, 2025

M-A-P Team, Xinrun Du, Yifan Yao, Kaijing Ma, Bingli Wang, Tianyu Zheng, King Zhu, Minghao Liu, Yiming Liang, Xiaolong Jin, et al. SuperGPQA: Scaling LLM evaluation across 285 graduate disciplines, 2025

work page 2025

[23] [23]

Correctbench: A benchmark of self-correction in llms

Guiyao Tie, Zenghui Yuan, Zeli Zhao, Chaoran Hu, Tianhe Gu, Ruihang Zhang, Sizhe Zhang, Junran Wu, Xiaoyue Tu, Ming Jin, Qingsong Wen, Lixing Chen, Pan Zhou, and Lichao Sun. Correctbench: A benchmark of self-correction in llms. InProceedings of the NeurIPS 2025 Datasets and Benchmarks Track, 2025

work page 2025

[24] [24]

FailuresensorIQ: A multi-choice QA dataset for understanding sensor relationships and failure modes

Christodoulos Constantinides, Dhaval C Patel, Shuxin Lin, Claudio Guerrero, SUNIL DA- GAJIRAO PATIL, and Jayant Kalagnanam. FailuresensorIQ: A multi-choice QA dataset for understanding sensor relationships and failure modes. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2026

work page 2026

[25] [25]

Gemma 4 E2B: Intelligence, performance and price analysis

Artificial Analysis. Gemma 4 E2B: Intelligence, performance and price analysis. Model analysis page, 2026. Accessed 2026-05-05

work page 2026

[26] [26]

NVIDIA Nemotron 3 nano 4b: Intelligence, performance and price analysis

Artificial Analysis. NVIDIA Nemotron 3 nano 4b: Intelligence, performance and price analysis. Model analysis page, 2026. Accessed 2026-05-05. 11 A Additional Analyses The additional analyses below preserve the same claim scope as the primary result. CGR observes the same target solver under a direct option-selection prompt and under a generated Python ski...

work page 2026