Recognition: 2 theorem links
· Lean TheoremTeaching Language Models to Think in Code
Pith reviewed 2026-05-12 04:21 UTC · model grok-4.3
The pith
Small language models outperform much larger ones on math benchmarks by reasoning exclusively through code blocks connected by execution outputs.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
ThinC makes code the reasoner: after an initial natural language planning step, all further reasoning consists of code blocks connected solely by their execution outputs. When 12.2k such trajectories are distilled and used to train ThinC-4B, the model surpasses every TIR baseline on five math benchmarks and exceeds the much larger Qwen3-235B-A22B-Thinking, while 99.2 percent of final answers remain grounded in interpreter results and the model recovers from execution failures without reverting to natural language reasoning.
What carries the argument
ThinC trajectories in which an initial natural language plan is followed by code blocks linked only by interpreter execution outputs.
If this is right
- ThinC-4B outperforms every TIR baseline on five competition-level math benchmarks.
- The 4B model surpasses the much larger Qwen3-235B-A22B-Thinking.
- 99.2 percent of ThinC final answers are grounded in interpreter output.
- Models recover reliably from code execution failures without using intermediate natural language reasoning.
- Code serves as the primary reasoner rather than a post-hoc verifier.
Where Pith is reading between the lines
- Shifting reasoning load to code execution could reduce unverified calculations in other precise domains.
- The distillation-plus-RL pipeline may allow smaller models to match or exceed larger ones when the reasoning medium is restricted to executable code.
- High interpreter grounding suggests deployment advantages when models must interface directly with code runtimes.
Load-bearing premise
Distilled trajectories from a teacher will train small models to perform all intermediate reasoning exclusively through code execution outputs instead of error-prone natural language steps.
What would settle it
A trained ThinC model that produces a final answer unsupported by the code execution trace or inserts natural language computations between code blocks.
Figures
read the original abstract
Tool-integrated reasoning (TIR) has emerged as a dominant paradigm for mathematical problem solving in language models, combining natural language (NL) reasoning with code execution. However, this interleaved setup has three key limitations: code often acts as a post-hoc verifier, intermediate NL computations are error-prone, and NL and code play overlapping rather than clearly distinct roles. We propose ThinC (Thinking in Code), a framework in which code itself serves as the reasoner rather than as a tool invoked by NL. A ThinC trajectory begins with a brief NL planning step, after which all reasoning unfolds through code blocks connected only by their execution outputs. We distill 12.2k code-centric trajectories from a teacher model and train ThinC-1.7B and ThinC-4B with supervised fine-tuning followed by reinforcement learning. ThinC-4B consistently outperforms every TIR baseline on five competition-level math benchmarks and even surpasses the much larger Qwen3-235B-A22B-Thinking. Further analysis shows that ThinC reasons through code: 99.2% of its final answers are grounded in interpreter output, and the model recovers reliably from code execution failures without intermediate NL reasoning. Our code and models will be released soon.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes ThinC (Thinking in Code), a new framework for mathematical reasoning in which code serves as the primary reasoner rather than a tool invoked by natural language. After a short initial NL planning step, all subsequent reasoning occurs exclusively through sequences of code blocks whose logic is connected solely by interpreter execution outputs. The authors distill 12.2k code-centric trajectories from a teacher model, then train ThinC-1.7B and ThinC-4B via SFT followed by RL. They report that ThinC-4B outperforms all TIR baselines on five competition-level math benchmarks, surpasses the much larger Qwen3-235B-A22B-Thinking, achieves 99.2% grounding of final answers in interpreter output, and recovers reliably from code execution failures without reverting to intermediate NL reasoning.
Significance. If the empirical results and the claim of exclusive code-based reasoning hold under closer scrutiny, the work would demonstrate a viable path for smaller models to achieve reliable, verifiable intermediate reasoning on hard math problems by shifting the reasoning substrate to code execution. This could reduce reliance on error-prone NL steps and provide a clearer separation of roles between planning and execution, with potential implications for interpretability and robustness in tool-augmented LLMs.
major comments (3)
- [Further analysis (as described in the abstract)] The central claim that ThinC 'reasons through code' with no reversion to NL computations rests on the 99.2% final-answer grounding statistic and the recovery-from-failure analysis. However, these metrics do not directly measure whether intermediate reasoning steps during successful trajectories remain free of hidden NL computations or whether the generated code merely encodes logic that was already derived in NL before the code block is emitted.
- [Experiments and results (implied by benchmark comparisons in the abstract)] The performance advantage of ThinC-4B over TIR baselines and the larger Qwen3 model is presented as evidence for the paradigm shift, yet the manuscript provides no details on whether the TIR baselines were trained on comparable data volumes, used the same teacher for distillation, or employed equivalent RL reward shaping. Without these controls, gains could be attributable to trajectory quality or optimization rather than the code-as-reasoner design.
- [Method and analysis sections] The brief initial NL planning step is acknowledged, but the paper does not report any ablation or inspection (e.g., via activation patching or forced code-only generation) to test whether the model can be induced to fall back to NL reasoning when the planning step is removed or when code execution fails in ways not covered by the recovery analysis.
minor comments (2)
- [Abstract] The abstract states that 'our code and models will be released soon' but provides no link, repository, or timeline; this should be clarified or a placeholder added for reproducibility.
- [Experimental setup] Exact data splits, number of training epochs, RL hyperparameters, and the precise definition of 'grounded in interpreter output' are not detailed in the provided abstract; these should be expanded in the main text for verifiability.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed feedback. We have carefully reviewed each major comment and provide point-by-point responses below, indicating where we agree and what revisions we will make to the manuscript.
read point-by-point responses
-
Referee: [Further analysis (as described in the abstract)] The central claim that ThinC 'reasons through code' with no reversion to NL computations rests on the 99.2% final-answer grounding statistic and the recovery-from-failure analysis. However, these metrics do not directly measure whether intermediate reasoning steps during successful trajectories remain free of hidden NL computations or whether the generated code merely encodes logic that was already derived in NL before the code block is emitted.
Authors: We agree that the 99.2% grounding and recovery metrics provide indirect rather than direct evidence of purely code-based intermediate reasoning. The ThinC design explicitly constrains outputs after the planning step to code blocks whose logic is advanced solely via interpreter feedback, with no NL tokens permitted in the trajectory. This structural constraint, combined with distillation from code-centric teacher trajectories, makes reversion to NL unlikely. To strengthen the claim, we will add a qualitative analysis of 50 sampled successful trajectories in the revised manuscript, explicitly verifying that each code block contains the complete next reasoning step without presupposing unstated NL derivations. We will also report results from a forced code-only generation test (removing the planning step at inference) to measure any performance drop. revision: partial
-
Referee: [Experiments and results (implied by benchmark comparisons in the abstract)] The performance advantage of ThinC-4B over TIR baselines and the larger Qwen3 model is presented as evidence for the paradigm shift, yet the manuscript provides no details on whether the TIR baselines were trained on comparable data volumes, used the same teacher for distillation, or employed equivalent RL reward shaping. Without these controls, gains could be attributable to trajectory quality or optimization rather than the code-as-reasoner design.
Authors: This is a valid concern. The TIR baselines were reproduced or taken from their original publications using the setups described in those works. To allow direct comparison, the revised manuscript will include an expanded experimental setup section with a table detailing data volume, teacher model, and RL configuration for each baseline where such information is available in the source papers. Where exact matches were not possible, we note the differences and argue that the consistent outperformance across multiple benchmarks still supports the contribution of the code-centric paradigm, as ThinC uses a single unified training pipeline rather than ad-hoc tool calls. revision: yes
-
Referee: [Method and analysis sections] The brief initial NL planning step is acknowledged, but the paper does not report any ablation or inspection (e.g., via activation patching or forced code-only generation) to test whether the model can be induced to fall back to NL reasoning when the planning step is removed or when code execution fails in ways not covered by the recovery analysis.
Authors: The planning step is intentionally limited to high-level strategy and is followed exclusively by code in both training and inference. The existing recovery analysis already covers a range of execution failures without NL reversion. In the revision we will add a forced code-only ablation (prompting without the planning prefix) and report the resulting accuracy drop on the benchmarks. We will also expand the failure recovery section with additional failure modes. Activation patching is a more advanced interpretability technique that lies outside the current experimental scope and resources; we therefore do not plan to include it. revision: partial
- Activation patching or similar internal interpretability probes to detect potential hidden NL computations, as these require specialized tooling and compute not available within the project timeline.
Circularity Check
No circularity: purely empirical training and benchmark evaluation
full rationale
The paper presents an empirical method: distill 12.2k code-centric trajectories from a teacher, apply SFT then RL to 1.7B/4B models, and evaluate on five external math benchmarks. Claims such as outperformance and '99.2% of final answers grounded in interpreter output' are observational results from that training and analysis pipeline, not quantities derived from self-referential equations or fitted parameters renamed as predictions. No uniqueness theorems, ansatzes, or self-citations are invoked as load-bearing mathematical justifications. The derivation chain consists of standard distillation + RL steps whose outputs are compared against independent baselines; nothing reduces to the inputs by construction.
Axiom & Free-Parameter Ledger
free parameters (1)
- 12.2k distilled trajectories
axioms (1)
- domain assumption Code execution outputs provide sufficient and deterministic signals to continue multi-step reasoning without intermediate natural language.
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
A THINC trajectory begins with a brief NL planning step, after which all reasoning unfolds through code blocks connected only by their execution outputs.
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
99.2% of its final answers are grounded in interpreter output
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Opencodereasoning: Advancing data distillation for competitive coding
Wasi Uddin Ahmad, Sean Narenthiran, Somshubra Majumdar, Aleksander Ficek, Siddhartha Jain, Jocelyn Huang, Vahid Noroozi, and Boris Ginsburg. Opencodereasoning: Advancing data distillation for competitive coding, 2025. URLhttps://arxiv.org/abs/2504.01943
-
[2]
Wenhu Chen, Xueguang Ma, Xinyi Wang, and William W. Cohen. Program of thoughts prompt- ing: Disentangling computation from reasoning for numerical reasoning tasks.Transactions on Machine Learning Research, 2023. ISSN 2835-8856. URL https://openreview.net/ forum?id=YfZ4ZPt8zd
work page 2023
-
[3]
Guanting Dong, Yifei Chen, Xiaoxi Li, Jiajie Jin, Hongjin Qian, Yutao Zhu, Hangyu Mao, Guorui Zhou, Zhicheng Dou, and Ji-Rong Wen. Tool-star: Empowering llm-brained multi-tool reasoner via reinforcement learning, 2025. URLhttps://arxiv.org/abs/2505.16410
-
[4]
ReTool: Reinforcement Learning for Strategic Tool Use in LLMs
Jiazhan Feng, Shijue Huang, Xingwei Qu, Ge Zhang, Yujia Qin, Baoquan Zhong, Chengquan Jiang, Jinxin Chi, and Wanjun Zhong. Retool: Reinforcement learning for strategic tool use in llms, 2025. URLhttps://arxiv.org/abs/2504.11536
work page internal anchor Pith review arXiv 2025
-
[5]
PAL: Program-aided Language Models
Luyu Gao, Aman Madaan, Shuyan Zhou, Uri Alon, Pengfei Liu, Yiming Yang, Jamie Callan, and Graham Neubig. Pal: Program-aided language models, 2023. URL https://arxiv.org/ abs/2211.10435
work page Pith review arXiv 2023
-
[6]
ToRA: A tool-integrated reasoning agent for mathematical problem solving
Zhibin Gou, Zhihong Shao, Yeyun Gong, yelong shen, Yujiu Yang, Minlie Huang, Nan Duan, and Weizhu Chen. ToRA: A tool-integrated reasoning agent for mathematical problem solving. InThe Twelfth International Conference on Learning Representations, 2024. URL https: //openreview.net/forum?id=Ep0TtjVoap
work page 2024
-
[7]
DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning
Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Peiyi Wang, Qihao Zhu, Runxin Xu, Ruoyu Zhang, Shirong Ma, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[8]
Skywork open reasoner 1 technical report.arXiv preprint arXiv:2505.22312,
Jujie He, Jiacai Liu, Chris Yuhao Liu, Rui Yan, Chaojie Wang, Peng Cheng, Xiaoyu Zhang, Fuxiang Zhang, Jiacheng Xu, Wei Shen, et al. Skywork open reasoner 1 technical report.arXiv preprint arXiv:2505.22312, 2025
-
[9]
Aaron Jaech, Adam Kalai, Adam Lerer, Adam Richardson, Ahmed El-Kishky, Aiden Low, Alec Helyar, Aleksander Madry, Alex Beutel, Alex Carney, et al. Openai o1 system card.arXiv preprint arXiv:2412.16720, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[10]
Teaching language models to reason with tools, 2025
Chengpeng Li, Zhengyang Tang, Ziniu Li, Mingfeng Xue, Keqin Bao, Tian Ding, Ruoyu Sun, Benyou Wang, Xiang Wang, Junyang Lin, and Dayiheng Liu. Teaching language models to reason with tools, 2025. URLhttps://arxiv.org/abs/2510.20342
-
[11]
Ivan Moshkov, Darragh Hanley, Ivan Sorokin, Shubham Toshniwal, Christof Henkel, Benedikt Schifferer, Wei Du, and Igor Gitman. Aimo-2 winning solution: Building state-of-the-art math- ematical reasoning models with openmathreasoning dataset.arXiv preprint arXiv:2504.16891, 2025
-
[12]
gpt-oss-120b & gpt-oss-20b Model Card
OpenAI. gpt-oss-120b & gpt-oss-20b model card, 2025. URL https://arxiv.org/abs/ 2508.10925
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[13]
David Rein, Betty Li Hou, Asa Cooper Stickland, Jackson Petty, Richard Yuanzhe Pang, Julien Dirani, Julian Michael, and Samuel R. Bowman. GPQA: A graduate-level google- proof q&a benchmark. InFirst Conference on Language Modeling, 2024. URL https: //openreview.net/forum?id=Ti67584b98. 10
work page 2024
- [14]
-
[15]
rstar2-agent: Agentic reasoning technical report, 2025
Ning Shang, Yifei Liu, Yi Zhu, Li Lyna Zhang, Weijiang Xu, Xinyu Guan, Buze Zhang, Bingcheng Dong, Xudong Zhou, Bowen Zhang, Ying Xin, Ziming Miao, Scarlett Li, Fan Yang, and Mao Yang. rstar2-agent: Agentic reasoning technical report, 2025. URL https: //arxiv.org/abs/2508.20722
-
[16]
DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models
Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Yang Wu, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[17]
HybridFlow: A Flexible and Efficient RLHF Framework
Guangming Sheng, Chi Zhang, Zilingfeng Ye, Xibin Wu, Wang Zhang, Ru Zhang, Yanghua Peng, Haibin Lin, and Chuan Wu. Hybridflow: A flexible and efficient rlhf framework.arXiv preprint arXiv: 2409.19256, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[18]
Mathcoder: Seamless code integration in LLMs for enhanced mathematical reasoning
Ke Wang, Houxing Ren, Aojun Zhou, Zimu Lu, Sichun Luo, Weikang Shi, Renrui Zhang, Linqi Song, Mingjie Zhan, and Hongsheng Li. Mathcoder: Seamless code integration in LLMs for enhanced mathematical reasoning. InThe Twelfth International Conference on Learning Representations, 2024. URLhttps://openreview.net/forum?id=z8TW0ttBPp
work page 2024
-
[19]
Chi, Quoc V Le, and Denny Zhou
Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, brian ichter, Fei Xia, Ed H. Chi, Quoc V Le, and Denny Zhou. Chain of thought prompting elicits reasoning in large language models. In Alice H. Oh, Alekh Agarwal, Danielle Belgrave, and Kyunghyun Cho, editors, Advances in Neural Information Processing Systems, 2022. URL https://openreview.net/ forum?...
work page 2022
-
[20]
An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, Chujie Zheng, Dayiheng Liu, Fan Zhou, Fei Huang, Feng Hu, Hao Ge, Haoran Wei, Huan Lin, Jialong Tang, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Yang, Jiaxi Yang, Jing Zhou, Jingren Zhou, Junyang Lin, Kai Dang, Keqin Bao, Kexin Yang, ...
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[21]
DAPO: An Open-Source LLM Reinforcement Learning System at Scale
Qiying Yu, Zheng Zhang, Ruofei Zhu, Yufeng Yuan, Xiaochen Zuo, Yu Yue, Weinan Dai, Tiantian Fan, Gaohong Liu, Lingjun Liu, Xin Liu, Haibin Lin, Zhiqi Lin, Bole Ma, Guangming Sheng, Yuxuan Tong, Chi Zhang, Mofan Zhang, Wang Zhang, Hang Zhu, Jinhua Zhu, Jiaze Chen, Jiangjie Chen, Chengyi Wang, Hongli Yu, Yuxuan Song, Xiangpeng Wei, Hao Zhou, Jingjing Liu, W...
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[22]
Demystifying reinforcement learning in agentic reasoning, 2025
Zhaochen Yu, Ling Yang, Jiaru Zou, Shuicheng Yan, and Mengdi Wang. Demystifying reinforcement learning in agentic reasoning, 2025. URL https://arxiv.org/abs/2510. 11701
work page 2025
-
[23]
Xuqin Zhang, Quan He, Zhenrui Zheng, Zongzhang Zhang, Xu He, and Dong Li. Aster: Agentic scaling with tool-integrated extended reasoning, 2026. URL https://arxiv.org/ abs/2602.01204
-
[24]
LlamaFactory: Unified Efficient Fine-Tuning of 100+ Language Models
Yaowei Zheng, Richong Zhang, Junhao Zhang, Yanhan Ye, Zheyan Luo, Zhangchi Feng, and Yongqiang Ma. Llamafactory: Unified efficient fine-tuning of 100+ language models. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (V olume 3: System Demonstrations), Bangkok, Thailand, 2024. Association for Computational Linguis...
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[25]
The think channel is used purely for symbolic restructuring of the problem
No arithmetic in NL.The model never evaluates an expression or enumerates a case manually. The think channel is used purely for symbolic restructuring of the problem
-
[26]
Search-space specification.It fixes the exact constraints (m, n′ ≥2 , m̸=n ′, m·n′ ≤101 ) before any code is written, providing a clean starting point for the code blocks that follow
-
[27]
Let me write code to compute this directly
Single transition to code.The block ends with one sentence (“Let me write code to compute this directly”) and from this point onward, NL does not return: every subsequent reasoning step is carried out inside a<python>block. A.3 Stage 2 — Code-Centric Reasoning The remainder of the rollout consists of five <python>/<result> exchanges. Each code block build...
-
[28]
You will receive execution results inside <result> </result> tags
Write Python code inside <python> </python> tags. You will receive execution results inside <result> </result> tags
-
[29]
Variables do NOT persist between turns
Each code block runs in a separate process. Variables do NOT persist between turns. Re-define or hardcode values from previous results
-
[30]
You may think freely before the first code block, but your final output to the user must be ONLY <python> blocks and one <answer> block
-
[31]
Embed your reasoning as concise comments inside the code. Each code block should contain short comments explaining: what you are computing and why, based on previous results
-
[32]
Each block should advance the solution by one logical step
Solve problems step by step: compute intermediate results in each code block, observe the output, then decide what to compute next based on those results. Each block should advance the solution by one logical step
-
[33]
total cells: {total_cells}, min colors needed: {min_colors}
When you have the final answer, provide it inside <answer> The final answer is \boxed{answer} </answer>. CRITICAL: Each code block is executed in a completely fresh Python process. You MUST re-import libraries and re-define all variables in every code block. Use hardcoded values from previous <result> outputs, not variable names. Here are some examples: Q...
work page 2024
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.