arxiv: 2605.07237 · v2 · submitted 2026-05-08 · 💻 cs.CL

Recognition: 2 theorem links

· Lean Theorem

Teaching Language Models to Think in Code

Hyeon Hwang , Jiwoo Lee , Jaewoo Kang

Authors on Pith no claims yet

Pith reviewed 2026-05-12 04:21 UTC · model grok-4.3

classification 💻 cs.CL

keywords Thinking in CodeTool-Integrated ReasoningMathematical ReasoningCode ExecutionTrajectory DistillationLanguage Model TrainingMath BenchmarksInterpreter Grounding

0 comments

The pith

Small language models outperform much larger ones on math benchmarks by reasoning exclusively through code blocks connected by execution outputs.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes ThinC, in which language models begin with a brief natural language plan but then carry out all reasoning via code blocks whose only links are the results returned by executing the previous block. Trajectories are distilled from a teacher model and used to train 1.7B and 4B parameter models through supervised fine-tuning and reinforcement learning. ThinC-4B beats every tool-integrated reasoning baseline on five competition-level math benchmarks and even exceeds a 235B model, with 99.2 percent of its answers directly supported by interpreter output. The approach treats code as the reasoner rather than a verifier invoked by natural language, avoiding error-prone intermediate NL computations.

Core claim

ThinC makes code the reasoner: after an initial natural language planning step, all further reasoning consists of code blocks connected solely by their execution outputs. When 12.2k such trajectories are distilled and used to train ThinC-4B, the model surpasses every TIR baseline on five math benchmarks and exceeds the much larger Qwen3-235B-A22B-Thinking, while 99.2 percent of final answers remain grounded in interpreter results and the model recovers from execution failures without reverting to natural language reasoning.

What carries the argument

ThinC trajectories in which an initial natural language plan is followed by code blocks linked only by interpreter execution outputs.

If this is right

ThinC-4B outperforms every TIR baseline on five competition-level math benchmarks.
The 4B model surpasses the much larger Qwen3-235B-A22B-Thinking.
99.2 percent of ThinC final answers are grounded in interpreter output.
Models recover reliably from code execution failures without using intermediate natural language reasoning.
Code serves as the primary reasoner rather than a post-hoc verifier.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Shifting reasoning load to code execution could reduce unverified calculations in other precise domains.
The distillation-plus-RL pipeline may allow smaller models to match or exceed larger ones when the reasoning medium is restricted to executable code.
High interpreter grounding suggests deployment advantages when models must interface directly with code runtimes.

Load-bearing premise

Distilled trajectories from a teacher will train small models to perform all intermediate reasoning exclusively through code execution outputs instead of error-prone natural language steps.

What would settle it

A trained ThinC model that produces a final answer unsupported by the code execution trace or inserts natural language computations between code blocks.

Figures

Figures reproduced from arXiv: 2605.07237 by Hyeon Hwang, Jaewoo Kang, Jiwoo Lee.

**Figure 2.** Figure 2: Comparison of interleaved TIR (left) and [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗

**Figure 3.** Figure 3: Training dynamics. (a) Benchmark avg@16 after SFT (light) and after RL (dark) for [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗

**Figure 4.** Figure 4: Code-centric reasoning behavior measured on overall benchmarks. (a) Average lines of [PITH_FULL_IMAGE:figures/full_fig_p009_4.png] view at source ↗

**Figure 5.** Figure 5: Recovery@k under initial code failures. THINC remains substantially more robust as early execution failures accumulate. Interleaved baselines degrade with k; THINC-4B stays robust. Every interleaved TIR system loses ground as initial failures accumulate ( [PITH_FULL_IMAGE:figures/full_fig_p009_5.png] view at source ↗

**Figure 6.** Figure 6: Average tool calls per benchmark. We compare how often models invoke the Python [PITH_FULL_IMAGE:figures/full_fig_p020_6.png] view at source ↗

**Figure 7.** Figure 7: Average response length per benchmark. We report the mean trajectory length across AIME [PITH_FULL_IMAGE:figures/full_fig_p020_7.png] view at source ↗

read the original abstract

Tool-integrated reasoning (TIR) has emerged as a dominant paradigm for mathematical problem solving in language models, combining natural language (NL) reasoning with code execution. However, this interleaved setup has three key limitations: code often acts as a post-hoc verifier, intermediate NL computations are error-prone, and NL and code play overlapping rather than clearly distinct roles. We propose ThinC (Thinking in Code), a framework in which code itself serves as the reasoner rather than as a tool invoked by NL. A ThinC trajectory begins with a brief NL planning step, after which all reasoning unfolds through code blocks connected only by their execution outputs. We distill 12.2k code-centric trajectories from a teacher model and train ThinC-1.7B and ThinC-4B with supervised fine-tuning followed by reinforcement learning. ThinC-4B consistently outperforms every TIR baseline on five competition-level math benchmarks and even surpasses the much larger Qwen3-235B-A22B-Thinking. Further analysis shows that ThinC reasons through code: 99.2% of its final answers are grounded in interpreter output, and the model recovers reliably from code execution failures without intermediate NL reasoning. Our code and models will be released soon.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

ThinC trains small models to reason almost entirely in code after brief NL planning and gets them to beat larger TIR systems on math benchmarks, but the claim of truly exclusive code reasoning still needs tighter proof.

read the letter

The key takeaway is that this work flips the usual TIR setup: instead of NL doing the thinking and code verifying, they start with a short natural-language plan and then chain code blocks whose only connections are interpreter outputs. They pull 12.2k such trajectories from a teacher, fine-tune 1.7B and 4B models with SFT followed by RL, and report that the 4B version beats every TIR baseline on five competition math sets while also topping a 235B-scale model. The 99.2% grounding statistic and the failure-recovery results are the strongest parts of the evidence they present.

Referee Report

3 major / 2 minor

Summary. The paper proposes ThinC (Thinking in Code), a new framework for mathematical reasoning in which code serves as the primary reasoner rather than a tool invoked by natural language. After a short initial NL planning step, all subsequent reasoning occurs exclusively through sequences of code blocks whose logic is connected solely by interpreter execution outputs. The authors distill 12.2k code-centric trajectories from a teacher model, then train ThinC-1.7B and ThinC-4B via SFT followed by RL. They report that ThinC-4B outperforms all TIR baselines on five competition-level math benchmarks, surpasses the much larger Qwen3-235B-A22B-Thinking, achieves 99.2% grounding of final answers in interpreter output, and recovers reliably from code execution failures without reverting to intermediate NL reasoning.

Significance. If the empirical results and the claim of exclusive code-based reasoning hold under closer scrutiny, the work would demonstrate a viable path for smaller models to achieve reliable, verifiable intermediate reasoning on hard math problems by shifting the reasoning substrate to code execution. This could reduce reliance on error-prone NL steps and provide a clearer separation of roles between planning and execution, with potential implications for interpretability and robustness in tool-augmented LLMs.

major comments (3)

[Further analysis (as described in the abstract)] The central claim that ThinC 'reasons through code' with no reversion to NL computations rests on the 99.2% final-answer grounding statistic and the recovery-from-failure analysis. However, these metrics do not directly measure whether intermediate reasoning steps during successful trajectories remain free of hidden NL computations or whether the generated code merely encodes logic that was already derived in NL before the code block is emitted.
[Experiments and results (implied by benchmark comparisons in the abstract)] The performance advantage of ThinC-4B over TIR baselines and the larger Qwen3 model is presented as evidence for the paradigm shift, yet the manuscript provides no details on whether the TIR baselines were trained on comparable data volumes, used the same teacher for distillation, or employed equivalent RL reward shaping. Without these controls, gains could be attributable to trajectory quality or optimization rather than the code-as-reasoner design.
[Method and analysis sections] The brief initial NL planning step is acknowledged, but the paper does not report any ablation or inspection (e.g., via activation patching or forced code-only generation) to test whether the model can be induced to fall back to NL reasoning when the planning step is removed or when code execution fails in ways not covered by the recovery analysis.

minor comments (2)

[Abstract] The abstract states that 'our code and models will be released soon' but provides no link, repository, or timeline; this should be clarified or a placeholder added for reproducibility.
[Experimental setup] Exact data splits, number of training epochs, RL hyperparameters, and the precise definition of 'grounded in interpreter output' are not detailed in the provided abstract; these should be expanded in the main text for verifiability.

Simulated Author's Rebuttal

3 responses · 1 unresolved

We thank the referee for the constructive and detailed feedback. We have carefully reviewed each major comment and provide point-by-point responses below, indicating where we agree and what revisions we will make to the manuscript.

read point-by-point responses

Referee: [Further analysis (as described in the abstract)] The central claim that ThinC 'reasons through code' with no reversion to NL computations rests on the 99.2% final-answer grounding statistic and the recovery-from-failure analysis. However, these metrics do not directly measure whether intermediate reasoning steps during successful trajectories remain free of hidden NL computations or whether the generated code merely encodes logic that was already derived in NL before the code block is emitted.

Authors: We agree that the 99.2% grounding and recovery metrics provide indirect rather than direct evidence of purely code-based intermediate reasoning. The ThinC design explicitly constrains outputs after the planning step to code blocks whose logic is advanced solely via interpreter feedback, with no NL tokens permitted in the trajectory. This structural constraint, combined with distillation from code-centric teacher trajectories, makes reversion to NL unlikely. To strengthen the claim, we will add a qualitative analysis of 50 sampled successful trajectories in the revised manuscript, explicitly verifying that each code block contains the complete next reasoning step without presupposing unstated NL derivations. We will also report results from a forced code-only generation test (removing the planning step at inference) to measure any performance drop. revision: partial
Referee: [Experiments and results (implied by benchmark comparisons in the abstract)] The performance advantage of ThinC-4B over TIR baselines and the larger Qwen3 model is presented as evidence for the paradigm shift, yet the manuscript provides no details on whether the TIR baselines were trained on comparable data volumes, used the same teacher for distillation, or employed equivalent RL reward shaping. Without these controls, gains could be attributable to trajectory quality or optimization rather than the code-as-reasoner design.

Authors: This is a valid concern. The TIR baselines were reproduced or taken from their original publications using the setups described in those works. To allow direct comparison, the revised manuscript will include an expanded experimental setup section with a table detailing data volume, teacher model, and RL configuration for each baseline where such information is available in the source papers. Where exact matches were not possible, we note the differences and argue that the consistent outperformance across multiple benchmarks still supports the contribution of the code-centric paradigm, as ThinC uses a single unified training pipeline rather than ad-hoc tool calls. revision: yes
Referee: [Method and analysis sections] The brief initial NL planning step is acknowledged, but the paper does not report any ablation or inspection (e.g., via activation patching or forced code-only generation) to test whether the model can be induced to fall back to NL reasoning when the planning step is removed or when code execution fails in ways not covered by the recovery analysis.

Authors: The planning step is intentionally limited to high-level strategy and is followed exclusively by code in both training and inference. The existing recovery analysis already covers a range of execution failures without NL reversion. In the revision we will add a forced code-only ablation (prompting without the planning prefix) and report the resulting accuracy drop on the benchmarks. We will also expand the failure recovery section with additional failure modes. Activation patching is a more advanced interpretability technique that lies outside the current experimental scope and resources; we therefore do not plan to include it. revision: partial

standing simulated objections not resolved

Activation patching or similar internal interpretability probes to detect potential hidden NL computations, as these require specialized tooling and compute not available within the project timeline.

Circularity Check

0 steps flagged

No circularity: purely empirical training and benchmark evaluation

full rationale

The paper presents an empirical method: distill 12.2k code-centric trajectories from a teacher, apply SFT then RL to 1.7B/4B models, and evaluate on five external math benchmarks. Claims such as outperformance and '99.2% of final answers grounded in interpreter output' are observational results from that training and analysis pipeline, not quantities derived from self-referential equations or fitted parameters renamed as predictions. No uniqueness theorems, ansatzes, or self-citations are invoked as load-bearing mathematical justifications. The derivation chain consists of standard distillation + RL steps whose outputs are compared against independent baselines; nothing reduces to the inputs by construction.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The claim rests on the domain assumption that code execution can serve as a complete, reliable reasoning medium and on the choice of 12.2k distilled trajectories as sufficient training data.

free parameters (1)

12.2k distilled trajectories
Specific count of teacher-generated examples used to train the models; chosen to enable supervised fine-tuning.

axioms (1)

domain assumption Code execution outputs provide sufficient and deterministic signals to continue multi-step reasoning without intermediate natural language.
Invoked when the framework states that all reasoning unfolds through code blocks connected only by their execution outputs.

pith-pipeline@v0.9.0 · 5514 in / 1261 out tokens · 56593 ms · 2026-05-12T04:21:02.107838+00:00 · methodology

Review history (2 revisions) →

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

A THINC trajectory begins with a brief NL planning step, after which all reasoning unfolds through code blocks connected only by their execution outputs.
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

99.2% of its final answers are grounded in interpreter output

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

33 extracted references · 33 canonical work pages · 9 internal anchors

[1]

Opencodereasoning: Advancing data distillation for competitive coding

Wasi Uddin Ahmad, Sean Narenthiran, Somshubra Majumdar, Aleksander Ficek, Siddhartha Jain, Jocelyn Huang, Vahid Noroozi, and Boris Ginsburg. Opencodereasoning: Advancing data distillation for competitive coding, 2025. URLhttps://arxiv.org/abs/2504.01943

work page arXiv 2025
[2]

Wenhu Chen, Xueguang Ma, Xinyi Wang, and William W. Cohen. Program of thoughts prompt- ing: Disentangling computation from reasoning for numerical reasoning tasks.Transactions on Machine Learning Research, 2023. ISSN 2835-8856. URL https://openreview.net/ forum?id=YfZ4ZPt8zd

work page 2023
[3]

Tool-star: Empowering llm- brained multi-tool reasoner via reinforcement learn- ing.arXiv:2505.16410, 2025

Guanting Dong, Yifei Chen, Xiaoxi Li, Jiajie Jin, Hongjin Qian, Yutao Zhu, Hangyu Mao, Guorui Zhou, Zhicheng Dou, and Ji-Rong Wen. Tool-star: Empowering llm-brained multi-tool reasoner via reinforcement learning, 2025. URLhttps://arxiv.org/abs/2505.16410

work page arXiv 2025
[4]

ReTool: Reinforcement Learning for Strategic Tool Use in LLMs

Jiazhan Feng, Shijue Huang, Xingwei Qu, Ge Zhang, Yujia Qin, Baoquan Zhong, Chengquan Jiang, Jinxin Chi, and Wanjun Zhong. Retool: Reinforcement learning for strategic tool use in llms, 2025. URLhttps://arxiv.org/abs/2504.11536

work page internal anchor Pith review arXiv 2025
[5]

PAL: Program-aided Language Models

Luyu Gao, Aman Madaan, Shuyan Zhou, Uri Alon, Pengfei Liu, Yiming Yang, Jamie Callan, and Graham Neubig. Pal: Program-aided language models, 2023. URL https://arxiv.org/ abs/2211.10435

work page Pith review arXiv 2023
[6]

ToRA: A tool-integrated reasoning agent for mathematical problem solving

Zhibin Gou, Zhihong Shao, Yeyun Gong, yelong shen, Yujiu Yang, Minlie Huang, Nan Duan, and Weizhu Chen. ToRA: A tool-integrated reasoning agent for mathematical problem solving. InThe Twelfth International Conference on Learning Representations, 2024. URL https: //openreview.net/forum?id=Ep0TtjVoap

work page 2024
[7]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Peiyi Wang, Qihao Zhu, Runxin Xu, Ruoyu Zhang, Shirong Ma, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[8]

Skywork open reasoner 1 technical report.arXiv preprint arXiv:2505.22312,

Jujie He, Jiacai Liu, Chris Yuhao Liu, Rui Yan, Chaojie Wang, Peng Cheng, Xiaoyu Zhang, Fuxiang Zhang, Jiacheng Xu, Wei Shen, et al. Skywork open reasoner 1 technical report.arXiv preprint arXiv:2505.22312, 2025

work page arXiv 2025
[9]

OpenAI o1 System Card

Aaron Jaech, Adam Kalai, Adam Lerer, Adam Richardson, Ahmed El-Kishky, Aiden Low, Alec Helyar, Aleksander Madry, Alex Beutel, Alex Carney, et al. Openai o1 system card.arXiv preprint arXiv:2412.16720, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[10]

Teaching language models to reason with tools, 2025

Chengpeng Li, Zhengyang Tang, Ziniu Li, Mingfeng Xue, Keqin Bao, Tian Ding, Ruoyu Sun, Benyou Wang, Xiang Wang, Junyang Lin, and Dayiheng Liu. Teaching language models to reason with tools, 2025. URLhttps://arxiv.org/abs/2510.20342

work page arXiv 2025
[11]

Aimo-2 winning solution: Building state-of-the-art mathematical reasoning models with openmathreasoning dataset

Ivan Moshkov, Darragh Hanley, Ivan Sorokin, Shubham Toshniwal, Christof Henkel, Benedikt Schifferer, Wei Du, and Igor Gitman. Aimo-2 winning solution: Building state-of-the-art math- ematical reasoning models with openmathreasoning dataset.arXiv preprint arXiv:2504.16891, 2025

work page arXiv 2025
[12]

gpt-oss-120b & gpt-oss-20b Model Card

OpenAI. gpt-oss-120b & gpt-oss-20b model card, 2025. URL https://arxiv.org/abs/ 2508.10925

work page internal anchor Pith review Pith/arXiv arXiv 2025
[13]

David Rein, Betty Li Hou, Asa Cooper Stickland, Jackson Petty, Richard Yuanzhe Pang, Julien Dirani, Julian Michael, and Samuel R. Bowman. GPQA: A graduate-level google- proof q&a benchmark. InFirst Conference on Language Modeling, 2024. URL https: //openreview.net/forum?id=Ti67584b98. 10

work page 2024
[14]

ByteDance Seed, Jiaze Chen, Tiantian Fan, Xin Liu, Lingjun Liu, Zhiqi Lin, Mingxuan Wang, Chengyi Wang, Xiangpeng Wei, Wenyuan Xu, et al. Seed1. 5-thinking: Advancing superb reasoning models with reinforcement learning.arXiv preprint arXiv:2504.13914, 2025

work page arXiv 2025
[15]

rstar2-agent: Agentic reasoning technical report, 2025

Ning Shang, Yifei Liu, Yi Zhu, Li Lyna Zhang, Weijiang Xu, Xinyu Guan, Buze Zhang, Bingcheng Dong, Xudong Zhou, Bowen Zhang, Ying Xin, Ziming Miao, Scarlett Li, Fan Yang, and Mao Yang. rstar2-agent: Agentic reasoning technical report, 2025. URL https: //arxiv.org/abs/2508.20722

work page arXiv 2025
[16]

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Yang Wu, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[17]

HybridFlow: A Flexible and Efficient RLHF Framework

Guangming Sheng, Chi Zhang, Zilingfeng Ye, Xibin Wu, Wang Zhang, Ru Zhang, Yanghua Peng, Haibin Lin, and Chuan Wu. Hybridflow: A flexible and efficient rlhf framework.arXiv preprint arXiv: 2409.19256, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[18]

Mathcoder: Seamless code integration in LLMs for enhanced mathematical reasoning

Ke Wang, Houxing Ren, Aojun Zhou, Zimu Lu, Sichun Luo, Weikang Shi, Renrui Zhang, Linqi Song, Mingjie Zhan, and Hongsheng Li. Mathcoder: Seamless code integration in LLMs for enhanced mathematical reasoning. InThe Twelfth International Conference on Learning Representations, 2024. URLhttps://openreview.net/forum?id=z8TW0ttBPp

work page 2024
[19]

Chi, Quoc V Le, and Denny Zhou

Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, brian ichter, Fei Xia, Ed H. Chi, Quoc V Le, and Denny Zhou. Chain of thought prompting elicits reasoning in large language models. In Alice H. Oh, Alekh Agarwal, Danielle Belgrave, and Kyunghyun Cho, editors, Advances in Neural Information Processing Systems, 2022. URL https://openreview.net/ forum?...

work page 2022
[20]

Qwen3 Technical Report

An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, Chujie Zheng, Dayiheng Liu, Fan Zhou, Fei Huang, Feng Hu, Hao Ge, Haoran Wei, Huan Lin, Jialong Tang, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Yang, Jiaxi Yang, Jing Zhou, Jingren Zhou, Junyang Lin, Kai Dang, Keqin Bao, Kexin Yang, ...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[21]

DAPO: An Open-Source LLM Reinforcement Learning System at Scale

Qiying Yu, Zheng Zhang, Ruofei Zhu, Yufeng Yuan, Xiaochen Zuo, Yu Yue, Weinan Dai, Tiantian Fan, Gaohong Liu, Lingjun Liu, Xin Liu, Haibin Lin, Zhiqi Lin, Bole Ma, Guangming Sheng, Yuxuan Tong, Chi Zhang, Mofan Zhang, Wang Zhang, Hang Zhu, Jinhua Zhu, Jiaze Chen, Jiangjie Chen, Chengyi Wang, Hongli Yu, Yuxuan Song, Xiangpeng Wei, Hao Zhou, Jingjing Liu, W...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[22]

Demystifying reinforcement learning in agentic reasoning, 2025

Zhaochen Yu, Ling Yang, Jiaru Zou, Shuicheng Yan, and Mengdi Wang. Demystifying reinforcement learning in agentic reasoning, 2025. URL https://arxiv.org/abs/2510. 11701

work page 2025
[23]

Aster: Agentic scaling with tool-integrated extended reasoning.arXiv preprint arXiv:2602.01204, 2026c

Xuqin Zhang, Quan He, Zhenrui Zheng, Zongzhang Zhang, Xu He, and Dong Li. Aster: Agentic scaling with tool-integrated extended reasoning, 2026. URL https://arxiv.org/ abs/2602.01204

work page arXiv 2026
[24]

LlamaFactory: Unified Efficient Fine-Tuning of 100+ Language Models

Yaowei Zheng, Richong Zhang, Junhao Zhang, Yanhan Ye, Zheyan Luo, Zhangchi Feng, and Yongqiang Ma. Llamafactory: Unified efficient fine-tuning of 100+ language models. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (V olume 3: System Demonstrations), Bangkok, Thailand, 2024. Association for Computational Linguis...

work page internal anchor Pith review Pith/arXiv arXiv 2024
[25]

The think channel is used purely for symbolic restructuring of the problem

No arithmetic in NL.The model never evaluates an expression or enumerates a case manually. The think channel is used purely for symbolic restructuring of the problem

work page
[26]

Search-space specification.It fixes the exact constraints (m, n′ ≥2 , m̸=n ′, m·n′ ≤101 ) before any code is written, providing a clean starting point for the code blocks that follow

work page
[27]

Let me write code to compute this directly

Single transition to code.The block ends with one sentence (“Let me write code to compute this directly”) and from this point onward, NL does not return: every subsequent reasoning step is carried out inside a<python>block. A.3 Stage 2 — Code-Centric Reasoning The remainder of the rollout consists of five <python>/<result> exchanges. Each code block build...

work page
[28]

You will receive execution results inside <result> </result> tags

Write Python code inside <python> </python> tags. You will receive execution results inside <result> </result> tags

work page
[29]

Variables do NOT persist between turns

Each code block runs in a separate process. Variables do NOT persist between turns. Re-define or hardcode values from previous results

work page
[30]

You may think freely before the first code block, but your final output to the user must be ONLY <python> blocks and one <answer> block

work page
[31]

Each code block should contain short comments explaining: what you are computing and why, based on previous results

Embed your reasoning as concise comments inside the code. Each code block should contain short comments explaining: what you are computing and why, based on previous results

work page
[32]

Each block should advance the solution by one logical step

Solve problems step by step: compute intermediate results in each code block, observe the output, then decide what to compute next based on those results. Each block should advance the solution by one logical step

work page
[33]

total cells: {total_cells}, min colors needed: {min_colors}

When you have the final answer, provide it inside <answer> The final answer is \boxed{answer} </answer>. CRITICAL: Each code block is executed in a completely fresh Python process. You MUST re-import libraries and re-define all variables in every code block. Use hardcoded values from previous <result> outputs, not variable names. Here are some examples: Q...

work page 2024