What Really Improves Mathematical Reasoning: Structured Reasoning Signals Beyond Pure Code

Enhong Chen; Junpeng Fang; Jun Zhou; Kai Zhang; Lu Yu; Qi Liu; Qing Cui; Yuze Zhao; Zhenya Huang

arxiv: 2605.19762 · v1 · pith:YMTT3LIJnew · submitted 2026-05-19 · 💻 cs.AI · cs.CL

What Really Improves Mathematical Reasoning: Structured Reasoning Signals Beyond Pure Code

Yuze Zhao , Junpeng Fang , Lu Yu , Zhenya Huang , Kai Zhang , Qing Cui , Qi Liu , Jun Zhou

show 1 more author

Enhong Chen

This is my paper

Pith reviewed 2026-05-20 05:03 UTC · model grok-4.3

classification 💻 cs.AI cs.CL

keywords mathematical reasoningcode pretrainingstructured reasoning tracesdomain competitionLLM data compositioncross-domain transferexpert routing

0 comments

The pith

Pure executable code improves programming but competes with complex mathematical reasoning instead of acting as a general enhancer.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper runs controlled pretraining experiments on a 10T-token corpus that separates pure standalone code from cross-domain mixtures such as code-text and math-text. It finds that when Code-NL data are held constant and code is restricted to executable programs, the main benefit is stronger programming ability, while performance on knowledge-intensive tasks, especially hard math problems, declines. Gains previously credited to code turn out to come from the structured traces that mix domains rather than from code in isolation. Within a fixed math data budget, raising the share of structured math samples lifts difficult reasoning scores and largely keeps programming performance intact. Internal routing patterns in the trained models track these competitive and synergistic domain interactions.

Core claim

The central claim is that reasoning improvements long attributed to code in language-model training are explained by structured cross-domain traces rather than by executable code itself. When the training data isolate standalone executable programs and control for mixed Code-NL content, code raises programming skill but trades off against complex mathematical reasoning. Raising the density of structured math-domain samples inside a fixed math budget produces large gains on hard math problems while preserving most programming capability, and expert-activation routing confirms that these data-composition effects appear in the model's internal specialization patterns.

What carries the argument

Fine-grained domain separation of pretraining tokens into pure executable code, Code-NL mixtures, and math-text traces, together with post-training expert-activation routing to observe domain competition and synergy.

If this is right

Code data should be allocated primarily when the target capability is programming rather than general reasoning.
Structured mixtures that cross domains supply the transfer mechanism previously credited to code alone.
Within any fixed math budget, higher density of structured math samples improves hard reasoning without broad loss of other skills.
Model routing patterns will continue to reflect the same competitive and synergistic interactions when data composition changes.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Training pipelines could prioritize explicit cognitive scaffolds over raw volume of any single domain.
Similar controlled separation experiments could be run for science or logic traces to check whether the pattern generalizes beyond math and code.
Data budgets might be re-optimized by treating structured cross-domain density as a tunable hyperparameter rather than treating all tokens within a domain as interchangeable.

Load-bearing premise

The fine-grained separation of domains inside the 10T-token corpus successfully removes confounding differences in data quality, collection method, or unmeasured variables so that observed effects can be attributed to code versus structured traces.

What would settle it

Retraining an identical model on a corpus that replaces all structured math-text mixtures with an equal volume of pure executable code and then measuring whether math-reasoning scores fall while programming scores rise would test the central distinction.

Figures

Figures reproduced from arXiv: 2605.19762 by Enhong Chen, Junpeng Fang, Jun Zhou, Kai Zhang, Lu Yu, Qi Liu, Qing Cui, Yuze Zhao, Zhenya Huang.

**Figure 1.** Figure 1: The impact of three data compositions on model performance across capability dimensions. Starting from the 10T-token corpus, we ablate either the code corpus (w/o code) or the math corpus (w/o math), and then evaluate the resulting models along five dimensions: general knowledge, coding ability, mathematical ability, comprehensive reasoning, and professional knowledge. The results suggest clear trade-off… view at source ↗

**Figure 2.** Figure 2: Code data exhibits a competitive relationship with knowledge-intensive tasks. When code is ablated from the full corpus, performance declines substantially across all programming benchmarks, as expected. Beyond programming, code data also competes with comprehensive reasoning tasks such as PIQA and HellaSwag. For mathematical reasoning, the impact is more task-dependent: code data significantly hinders per… view at source ↗

**Figure 3.** Figure 3: Math data exhibits a competitive effect on comprehensive reasoning tasks. As with code, ablating math data causes a pronounced decline on mathematical benchmarks. Unlike code, however, math data exerts limited competitive influence on programming ability. Its competitive effect is more apparent in comprehensive reasoning tasks, including code reasoning benchmarks such as CruxEval and MBPP and commonsense r… view at source ↗

**Figure 4.** Figure 4: Incorporating structured reasoning data affects tasks differently. For challenging datasets such as College Math and MATH, cognitive scaffolds improve model performance. By contrast, for tasks that can often be solved without explicit structured reasoning, such as GSM8K and CMath, adding such data introduces competition and may hinder performance. cutable code, while holding mixed code-language reasoning t… view at source ↗

**Figure 5.** Figure 5: MoE expert-routing probability deviations and JS divergence in the Math, Code, and QA domains under different data configurations. The upper row shows the 20 experts with the largest absolute deviations relative to the full-data model within each domain; the lower row reports pairwise JS divergence between complete expert-routing distributions. ablation alone. The effect of code-related ablation on part of… view at source ↗

**Figure 6.** Figure 6: Complete 64-expert analysis of routing-probability deviations in the Math, Code, and QA domains. Experts are sorted within each domain by their maximum absolute deviation relative to the full-data model. This figure complements the top-20 expert-level summary in [PITH_FULL_IMAGE:figures/full_fig_p015_6.png] view at source ↗

**Figure 7.** Figure 7: Baseline training-loss trajectories across aggregate, Web, Math, and Code domains. 19 [PITH_FULL_IMAGE:figures/full_fig_p019_7.png] view at source ↗

**Figure 8.** Figure 8: Math- and Web-domain training-loss comparisons for ablated data configurations. 20 [PITH_FULL_IMAGE:figures/full_fig_p020_8.png] view at source ↗

**Figure 9.** Figure 9: Code-domain training-loss comparisons for ablated data configurations. 21 [PITH_FULL_IMAGE:figures/full_fig_p021_9.png] view at source ↗

read the original abstract

Code has become a standard component of modern foundation language model (LM) training, yet its role beyond programming remains unclear. We revisit the claim that code improves reasoning through controlled pretraining experiments on a 10T-token corpus with fine-grained domain separation. Our findings are threefold. First, when code is restricted to standalone executable programs and Code-NL data are controlled for, code substantially improves programming ability but does not act as a general reasoning enhancer; instead, it competes with knowledge-intensive tasks, especially complex mathematical reasoning. Second, the reasoning gains often attributed to code are better explained by cross-domain structured reasoning traces, such as code-text and math-text mixtures, rather than by executable code alone. Third, increasing the density of structured math-domain samples within a fixed math budget yields substantial gains on difficult mathematical reasoning while largely preserving programming performance, suggesting that cognitive scaffolds offer a targeted way to mitigate cross-domain trade-offs. Finally, routing analyses show that data-composition effects are reflected in expert-activation patterns, providing mechanism-level evidence for competitive and synergistic interactions across domains. Our results clarify which data characteristics transfer across capability dimensions and point to more precise data-centric optimization strategies.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Pure executable code boosts programming but competes with complex math reasoning, while structured cross-domain traces explain the gains, from controlled 10T-scale runs.

read the letter

Hi, the main thing to know is that this paper finds pure standalone executable code improves coding tasks but actually hurts performance on hard math reasoning, while the benefits people saw before come more from structured mixtures like code-text or math-text data. They also show that packing more dense structured math samples into a fixed budget helps math without as much damage to programming, and they back some of this with routing patterns in expert activations. That's the core empirical distinction they draw from the large corpus experiments. What stands out is the scale and the attempt to separate domains more carefully than most prior work, plus the mechanism-level routing analysis that ties data composition to internal model behavior. That part feels like a useful addition for thinking about pretraining mixtures. The soft spot is the potential for residual confounds in how the subsets were built. The stress-test concern about missing checks on perplexity distributions, source metadata, or length matching between pure-code and structured math data is reasonable, and if those aren't addressed in the full methods it weakens the causal claims about structure versus quality or collection artifacts. The abstract is internally consistent but leaves the statistical details and exclusion criteria a bit thin for full verification. This paper is aimed at people working on data-centric pretraining for LLMs, especially those balancing math, code, and knowledge tasks at scale. A reader who cares about practical ablations and routing evidence would get something out of the specific comparisons. I'd send it to peer review because the questions are actionable and the experiments are large enough that referees can pressure-test the controls and reporting.

Referee Report

1 major / 1 minor

Summary. The paper conducts large-scale controlled pretraining experiments on a 10T-token corpus with fine-grained domain separation to revisit the role of code in improving mathematical reasoning. It reports three findings: pure standalone executable code improves programming but competes with knowledge-intensive tasks such as complex mathematical reasoning; prior reasoning gains attributed to code are better explained by cross-domain structured traces (e.g., code-text and math-text mixtures); and increasing the density of structured math-domain samples within a fixed math budget produces substantial gains on difficult mathematical reasoning while largely preserving programming performance. Routing analyses link these data-composition effects to expert-activation patterns.

Significance. If the central empirical claims hold after addressing the noted methodological gaps, the work provides a valuable clarification of data characteristics that transfer (or compete) across capability dimensions. It distinguishes pure executable code from structured reasoning signals and offers mechanism-level evidence via routing, which could inform more precise data-centric training strategies. The scale of the experiments and the focus on held-out task measurements are strengths.

major comments (1)

[Section 3] Section 3 and experimental setup: the central claim that fine-grained domain separation isolates causal effects of pure executable code versus cross-domain structured traces without residual confounding is load-bearing, yet the manuscript provides no explicit verification such as perplexity distributions across subsets, source metadata balance, or length/difficulty matching between pure-code and math-text data. Without these checks, observed trade-offs and routing patterns could arise from collection artifacts or unmeasured quality differences rather than the intended domain signals.

minor comments (1)

[Abstract] The abstract summarizes the threefold findings clearly but would benefit from briefly noting the evaluation metrics or task suites used to measure programming versus mathematical reasoning performance.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback and for recognizing the value of our controlled experiments on data composition effects. We address the methodological concern below and will strengthen the manuscript with additional verification analyses.

read point-by-point responses

Referee: [Section 3] Section 3 and experimental setup: the central claim that fine-grained domain separation isolates causal effects of pure executable code versus cross-domain structured traces without residual confounding is load-bearing, yet the manuscript provides no explicit verification such as perplexity distributions across subsets, source metadata balance, or length/difficulty matching between pure-code and math-text data. Without these checks, observed trade-offs and routing patterns could arise from collection artifacts or unmeasured quality differences rather than the intended domain signals.

Authors: We agree that explicit verification would further support the causal interpretation of our domain separation. Our fine-grained separation relies on source metadata and content-based filtering to isolate pure executable code from mixtures such as code-text and math-text within the 10T-token corpus. To address potential residual confounding, the revised manuscript will include: (1) perplexity distributions across subsets on a held-out set to assess comparable data quality; (2) source metadata balance statistics; and (3) length and difficulty matching via token-length histograms and proxy metrics derived from problem sources. These will be added to an expanded Section 3. We believe this will confirm that the reported trade-offs and routing patterns stem from the intended domain signals. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical results from controlled pretraining experiments

full rationale

The paper reports findings from direct empirical measurements on held-out tasks after controlled pretraining on a 10T-token corpus with fine-grained domain separation. No equations, derivations, or first-principles claims are presented that reduce to fitted parameters, self-referential definitions, or self-citation chains. The central claims about code's effects on programming versus mathematical reasoning derive from experimental comparisons rather than any construction that equates outcomes to inputs by definition. Self-citations, if present, are not load-bearing for the main results, which rest on new experimental data.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claims rest on empirical assumptions about data-domain isolation rather than new mathematical axioms or postulated entities; no free parameters are introduced because the work is purely experimental.

axioms (1)

domain assumption Fine-grained domain separation in the 10T-token corpus accurately isolates effects of executable code versus structured reasoning traces without significant confounding.
Invoked throughout the controlled pretraining experiments to attribute performance differences to specific data characteristics.

pith-pipeline@v0.9.0 · 5752 in / 1340 out tokens · 58413 ms · 2026-05-20T05:03:31.270417+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

when code is restricted to standalone executable programs and Code-NL data are controlled for, code substantially improves programming ability but does not act as a general reasoning enhancer
IndisputableMonolith/Foundation/AlphaCoordinateFixation.lean J_uniquely_calibrated_via_higher_derivative unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

increasing the density of structured math-domain samples within a fixed math budget yields substantial gains on difficult mathematical reasoning

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

46 extracted references · 46 canonical work pages · 5 internal anchors

[1]

2021 , eprint=

Evaluating Large Language Models Trained on Code , author=. 2021 , eprint=

work page 2021
[2]

2022 , eprint=

Training language models to follow instructions with human feedback , author=. 2022 , eprint=

work page 2022
[3]

Evaluating Large Language Models in Theory of Mind Tasks.arXiv preprint arXiv:2302.02083,

Theory of mind may have spontaneously emerged in large language models , author=. arXiv preprint arXiv:2302.02083 , volume=

work page arXiv
[4]

Os-copilot: Towards generalist computer agents with self-improvement.arXiv preprint arXiv:2402.07456, 2024

Os-copilot: Towards generalist computer agents with self-improvement , author=. arXiv preprint arXiv:2402.07456 , year=

work page arXiv
[5]

Reinforcement learning for long-horizon interactive llm agents.arXiv preprint arXiv:2502.01600, 2025

Reinforcement learning for long-horizon interactive llm agents , author=. arXiv preprint arXiv:2502.01600 , year=

work page arXiv
[6]

Advances in Neural Information Processing Systems , volume=

Toolformer: Language models can teach themselves to use tools , author=. Advances in Neural Information Processing Systems , volume=

work page
[7]

The Thirteenth International Conference on Learning Representations , year=

MetaGPT: Meta programming for a multi-agent collaborative framework , author=. The Thirteenth International Conference on Learning Representations , year=

work page
[8]

WebGPT: Browser-assisted question-answering with human feedback

Webgpt: Browser-assisted question-answering with human feedback , author=. arXiv preprint arXiv:2112.09332 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[9]

Transactions on Machine Learning Research , year=

Program of Thoughts Prompting: Disentangling Computation from Reasoning for Numerical Reasoning Tasks , author=. Transactions on Machine Learning Research , year=

work page
[10]

International Conference on Machine Learning , pages=

Pal: Program-aided language models , author=. International Conference on Machine Learning , pages=. 2023 , organization=

work page 2023
[11]

The Thirteenth International Conference on Learning Representations , year=

Unveiling the Magic of Code Reasoning through Hypothesis Decomposition and Amendment , author=. The Thirteenth International Conference on Learning Representations , year=

work page
[12]

arXiv preprint arXiv:2309.16298 , year=

At which training stage does code data help llms reasoning? , author=. arXiv preprint arXiv:2309.16298 , year=

work page arXiv
[13]

The Thirteenth International Conference on Learning Representations , year=

To Code or Not To Code? Exploring Impact of Code in Pre-training , author=. The Thirteenth International Conference on Learning Representations , year=

work page
[14]

Introducing Structured Outputs in the API , author =

work page
[15]

Introducing the Model Context Protocol , author =

work page
[16]

Journal of Machine Learning Research , volume=

Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity , author=. Journal of Machine Learning Research , volume=

work page
[17]

GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding

Gshard: Scaling giant models with conditional computation and automatic sharding , author=. arXiv preprint arXiv:2006.16668 , year=

work page internal anchor Pith review Pith/arXiv arXiv 2006
[18]

Mixtral of Experts

Mixtral of experts , author=. arXiv preprint arXiv:2401.04088 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[19]

DeepSeekMoE: Towards Ultimate Expert Specialization in Mixture-of-Experts Language Models

Deepseekmoe: Towards ultimate expert specialization in mixture-of-experts language models , author=. arXiv preprint arXiv:2401.06066 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[20]

DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model

Deepseek-v2: A strong, economical, and efficient mixture-of-experts language model , author=. arXiv preprint arXiv:2405.04434 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[21]

2017 , eprint=

Enriching Word Vectors with Subword Information , author=. 2017 , eprint=

work page 2017
[22]

2023 , eprint=

Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer , author=. 2023 , eprint=

work page 2023
[23]

Advances in neural information processing systems , volume=

Are emergent abilities of large language models a mirage? , author=. Advances in neural information processing systems , volume=

work page
[24]

2022 , eprint=

Impact of Pretraining Term Frequencies on Few-Shot Reasoning , author=. 2022 , eprint=

work page 2022
[25]

M., Longpre, S., Lambert, N., Wang, X., Muennighoff, N., Hou, B., Pan, L., Jeong, H., et al

A survey on data selection for language models , author=. arXiv preprint arXiv:2402.16827 , year=

work page arXiv
[26]

When less is more: Investigating data pruning for pretraining llms at scale.arXiv preprint arXiv:2309.04564,

When less is more: Investigating data pruning for pretraining llms at scale , author=. arXiv preprint arXiv:2309.04564 , year=

work page arXiv
[27]

Advances in Neural Information Processing Systems , volume=

D4: Improving llm pretraining via document de-duplication and diversification , author=. Advances in Neural Information Processing Systems , volume=

work page
[28]

Advances in neural information processing systems , volume=

Data programming: Creating large training sets, quickly , author=. Advances in neural information processing systems , volume=

work page
[29]

arXiv preprint arXiv:2409.17115 , year=

Programming every example: Lifting pre-training data quality like experts at scale , author=. arXiv preprint arXiv:2409.17115 , year=

work page arXiv
[30]

The Thirteenth International Conference on Learning Representations , year=

Data Selection via Optimal Control for Language Models , author=. The Thirteenth International Conference on Learning Representations , year=

work page
[31]

The Thirteenth International Conference on Learning Representations , year=

Adaptive Data Optimization: Dynamic Sample Selection with Scaling Laws , author=. The Thirteenth International Conference on Learning Representations , year=

work page
[32]

Advances in Neural Information Processing Systems , volume=

Doremi: Optimizing data mixtures speeds up language model pretraining , author=. Advances in Neural Information Processing Systems , volume=

work page
[33]

The Thirteenth International Conference on Learning Representations , year=

RegMix: Data Mixture as Regression for Language Model Pre-training , author=. The Thirteenth International Conference on Learning Representations , year=

work page
[34]

Getting structured data from the internet: running web crawlers/scrapers on a big data production scale , pages=

Introduction to common crawl datasets , author=. Getting structured data from the internet: running web crawlers/scrapers on a big data production scale , pages=. 2020 , publisher=

work page 2020
[35]

Communications of the ACM , volume=

The growing cost of deep learning for source code , author=. Communications of the ACM , volume=. 2021 , publisher=

work page 2021
[36]

2019 , eprint=

On the Use of ArXiv as a Dataset , author=. 2019 , eprint=

work page 2019
[37]

XRDS: Crossroads, The ACM Magazine for Students , volume=

Literary freedom: Project gutenberg , author=. XRDS: Crossroads, The ACM Magazine for Students , volume=. 2003 , publisher=

work page 2003
[38]

Braud, Chloé and Zeldes, Amir and Rivière, Laura and Liu, Yang Janet and Muller, Philippe and Sileo, Damien and Aoyama, Tatsuya , booktitle=

work page
[39]

Journal of the Franklin Institute , volume=

The jensen-shannon divergence , author=. Journal of the Franklin Institute , volume=. 1997 , publisher=

work page 1997
[40]

2023 , eprint=

Abstract Syntax Tree for Programming Language Understanding and Representation: How Far Are We? , author=. 2023 , eprint=

work page 2023
[41]

Findings of the Association for Computational Linguistics: ACL 2024 , month=aug, year=

Zhao, Yuze and Huang, Zhenya and Ma, Yixiao and Li, Rui and Zhang, Kai and Jiang, Hao and Liu, Qi and Zhu, Linbo and Su, Yu , editor=. Findings of the Association for Computational Linguistics: ACL 2024 , month=aug, year=. doi:10.18653/v1/2024.findings-acl.973 , pages=

work page doi:10.18653/v1/2024.findings-acl.973 2024
[42]

Semantic-Aligned Code Summarization: Bridging the Gap Between Code and Natural Language Through Data Flow Analysis , year=

Zhao, Yuze and Huang, Zhenya and Zhang, Kai and Gao, Weibo and Liu, Qi and Liu, Xukai and Yao, Fangzhou and Chen, Enhong , journal=. Semantic-Aligned Code Summarization: Bridging the Gap Between Code and Natural Language Through Data Flow Analysis , year=

work page
[43]

2026 , address =

Sun, Yuxuan and Zhao, Yuze and Wang, Yufeng and Du, Yao and Ma, Zhiyuan and Wang, Jinbo and Zhang, Mengdi and Zhang, Kai and Huang, Zhenya , booktitle =. 2026 , address =

work page 2026
[44]

Proceedings of the 42nd International Conference on Machine Learning , pages=

What Makes In-context Learning Effective for Mathematical Reasoning , author=. Proceedings of the 42nd International Conference on Machine Learning , pages=. 2025 , editor=

work page 2025
[45]

2024 , publisher=

Liu, Jiayu and Huang, Zhenya and Xiao, Tong and Sha, Jing and Wu, Jinze and Liu, Qi and Wang, Shijin and Chen, Enhong , booktitle=. 2024 , publisher=

work page 2024
[46]

2025 , url=

John Yang and Carlos E Jimenez and Alex L Zhang and Kilian Lieret and Joyce Yang and Xindi Wu and Ori Press and Niklas Muennighoff and Gabriel Synnaeve and Karthik R Narasimhan and Diyi Yang and Sida Wang and Ofir Press , booktitle=. 2025 , url=

work page 2025

[1] [1]

2021 , eprint=

Evaluating Large Language Models Trained on Code , author=. 2021 , eprint=

work page 2021

[2] [2]

2022 , eprint=

Training language models to follow instructions with human feedback , author=. 2022 , eprint=

work page 2022

[3] [3]

Evaluating Large Language Models in Theory of Mind Tasks.arXiv preprint arXiv:2302.02083,

Theory of mind may have spontaneously emerged in large language models , author=. arXiv preprint arXiv:2302.02083 , volume=

work page arXiv

[4] [4]

Os-copilot: Towards generalist computer agents with self-improvement.arXiv preprint arXiv:2402.07456, 2024

Os-copilot: Towards generalist computer agents with self-improvement , author=. arXiv preprint arXiv:2402.07456 , year=

work page arXiv

[5] [5]

Reinforcement learning for long-horizon interactive llm agents.arXiv preprint arXiv:2502.01600, 2025

Reinforcement learning for long-horizon interactive llm agents , author=. arXiv preprint arXiv:2502.01600 , year=

work page arXiv

[6] [6]

Advances in Neural Information Processing Systems , volume=

Toolformer: Language models can teach themselves to use tools , author=. Advances in Neural Information Processing Systems , volume=

work page

[7] [7]

The Thirteenth International Conference on Learning Representations , year=

MetaGPT: Meta programming for a multi-agent collaborative framework , author=. The Thirteenth International Conference on Learning Representations , year=

work page

[8] [8]

WebGPT: Browser-assisted question-answering with human feedback

Webgpt: Browser-assisted question-answering with human feedback , author=. arXiv preprint arXiv:2112.09332 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[9] [9]

Transactions on Machine Learning Research , year=

Program of Thoughts Prompting: Disentangling Computation from Reasoning for Numerical Reasoning Tasks , author=. Transactions on Machine Learning Research , year=

work page

[10] [10]

International Conference on Machine Learning , pages=

Pal: Program-aided language models , author=. International Conference on Machine Learning , pages=. 2023 , organization=

work page 2023

[11] [11]

The Thirteenth International Conference on Learning Representations , year=

Unveiling the Magic of Code Reasoning through Hypothesis Decomposition and Amendment , author=. The Thirteenth International Conference on Learning Representations , year=

work page

[12] [12]

arXiv preprint arXiv:2309.16298 , year=

At which training stage does code data help llms reasoning? , author=. arXiv preprint arXiv:2309.16298 , year=

work page arXiv

[13] [13]

The Thirteenth International Conference on Learning Representations , year=

To Code or Not To Code? Exploring Impact of Code in Pre-training , author=. The Thirteenth International Conference on Learning Representations , year=

work page

[14] [14]

Introducing Structured Outputs in the API , author =

work page

[15] [15]

Introducing the Model Context Protocol , author =

work page

[16] [16]

Journal of Machine Learning Research , volume=

Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity , author=. Journal of Machine Learning Research , volume=

work page

[17] [17]

GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding

Gshard: Scaling giant models with conditional computation and automatic sharding , author=. arXiv preprint arXiv:2006.16668 , year=

work page internal anchor Pith review Pith/arXiv arXiv 2006

[18] [18]

Mixtral of Experts

Mixtral of experts , author=. arXiv preprint arXiv:2401.04088 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[19] [19]

DeepSeekMoE: Towards Ultimate Expert Specialization in Mixture-of-Experts Language Models

Deepseekmoe: Towards ultimate expert specialization in mixture-of-experts language models , author=. arXiv preprint arXiv:2401.06066 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[20] [20]

DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model

Deepseek-v2: A strong, economical, and efficient mixture-of-experts language model , author=. arXiv preprint arXiv:2405.04434 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[21] [21]

2017 , eprint=

Enriching Word Vectors with Subword Information , author=. 2017 , eprint=

work page 2017

[22] [22]

2023 , eprint=

Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer , author=. 2023 , eprint=

work page 2023

[23] [23]

Advances in neural information processing systems , volume=

Are emergent abilities of large language models a mirage? , author=. Advances in neural information processing systems , volume=

work page

[24] [24]

2022 , eprint=

Impact of Pretraining Term Frequencies on Few-Shot Reasoning , author=. 2022 , eprint=

work page 2022

[25] [25]

M., Longpre, S., Lambert, N., Wang, X., Muennighoff, N., Hou, B., Pan, L., Jeong, H., et al

A survey on data selection for language models , author=. arXiv preprint arXiv:2402.16827 , year=

work page arXiv

[26] [26]

When less is more: Investigating data pruning for pretraining llms at scale.arXiv preprint arXiv:2309.04564,

When less is more: Investigating data pruning for pretraining llms at scale , author=. arXiv preprint arXiv:2309.04564 , year=

work page arXiv

[27] [27]

Advances in Neural Information Processing Systems , volume=

D4: Improving llm pretraining via document de-duplication and diversification , author=. Advances in Neural Information Processing Systems , volume=

work page

[28] [28]

Advances in neural information processing systems , volume=

Data programming: Creating large training sets, quickly , author=. Advances in neural information processing systems , volume=

work page

[29] [29]

arXiv preprint arXiv:2409.17115 , year=

Programming every example: Lifting pre-training data quality like experts at scale , author=. arXiv preprint arXiv:2409.17115 , year=

work page arXiv

[30] [30]

The Thirteenth International Conference on Learning Representations , year=

Data Selection via Optimal Control for Language Models , author=. The Thirteenth International Conference on Learning Representations , year=

work page

[31] [31]

The Thirteenth International Conference on Learning Representations , year=

Adaptive Data Optimization: Dynamic Sample Selection with Scaling Laws , author=. The Thirteenth International Conference on Learning Representations , year=

work page

[32] [32]

Advances in Neural Information Processing Systems , volume=

Doremi: Optimizing data mixtures speeds up language model pretraining , author=. Advances in Neural Information Processing Systems , volume=

work page

[33] [33]

The Thirteenth International Conference on Learning Representations , year=

RegMix: Data Mixture as Regression for Language Model Pre-training , author=. The Thirteenth International Conference on Learning Representations , year=

work page

[34] [34]

Getting structured data from the internet: running web crawlers/scrapers on a big data production scale , pages=

Introduction to common crawl datasets , author=. Getting structured data from the internet: running web crawlers/scrapers on a big data production scale , pages=. 2020 , publisher=

work page 2020

[35] [35]

Communications of the ACM , volume=

The growing cost of deep learning for source code , author=. Communications of the ACM , volume=. 2021 , publisher=

work page 2021

[36] [36]

2019 , eprint=

On the Use of ArXiv as a Dataset , author=. 2019 , eprint=

work page 2019

[37] [37]

XRDS: Crossroads, The ACM Magazine for Students , volume=

Literary freedom: Project gutenberg , author=. XRDS: Crossroads, The ACM Magazine for Students , volume=. 2003 , publisher=

work page 2003

[38] [38]

Braud, Chloé and Zeldes, Amir and Rivière, Laura and Liu, Yang Janet and Muller, Philippe and Sileo, Damien and Aoyama, Tatsuya , booktitle=

work page

[39] [39]

Journal of the Franklin Institute , volume=

The jensen-shannon divergence , author=. Journal of the Franklin Institute , volume=. 1997 , publisher=

work page 1997

[40] [40]

2023 , eprint=

Abstract Syntax Tree for Programming Language Understanding and Representation: How Far Are We? , author=. 2023 , eprint=

work page 2023

[41] [41]

Findings of the Association for Computational Linguistics: ACL 2024 , month=aug, year=

Zhao, Yuze and Huang, Zhenya and Ma, Yixiao and Li, Rui and Zhang, Kai and Jiang, Hao and Liu, Qi and Zhu, Linbo and Su, Yu , editor=. Findings of the Association for Computational Linguistics: ACL 2024 , month=aug, year=. doi:10.18653/v1/2024.findings-acl.973 , pages=

work page doi:10.18653/v1/2024.findings-acl.973 2024

[42] [42]

Semantic-Aligned Code Summarization: Bridging the Gap Between Code and Natural Language Through Data Flow Analysis , year=

Zhao, Yuze and Huang, Zhenya and Zhang, Kai and Gao, Weibo and Liu, Qi and Liu, Xukai and Yao, Fangzhou and Chen, Enhong , journal=. Semantic-Aligned Code Summarization: Bridging the Gap Between Code and Natural Language Through Data Flow Analysis , year=

work page

[43] [43]

2026 , address =

Sun, Yuxuan and Zhao, Yuze and Wang, Yufeng and Du, Yao and Ma, Zhiyuan and Wang, Jinbo and Zhang, Mengdi and Zhang, Kai and Huang, Zhenya , booktitle =. 2026 , address =

work page 2026

[44] [44]

Proceedings of the 42nd International Conference on Machine Learning , pages=

What Makes In-context Learning Effective for Mathematical Reasoning , author=. Proceedings of the 42nd International Conference on Machine Learning , pages=. 2025 , editor=

work page 2025

[45] [45]

2024 , publisher=

Liu, Jiayu and Huang, Zhenya and Xiao, Tong and Sha, Jing and Wu, Jinze and Liu, Qi and Wang, Shijin and Chen, Enhong , booktitle=. 2024 , publisher=

work page 2024

[46] [46]

2025 , url=

John Yang and Carlos E Jimenez and Alex L Zhang and Kilian Lieret and Joyce Yang and Xindi Wu and Ori Press and Niklas Muennighoff and Gabriel Synnaeve and Karthik R Narasimhan and Diyi Yang and Sida Wang and Ofir Press , booktitle=. 2025 , url=

work page 2025