arxiv: 2604.18936 · v1 · submitted 2026-04-21 · 💻 cs.LG · cs.AI· hep-ph· hep-th

Recognition: unknown

Fine-Tuning Small Reasoning Models for Quantum Field Theory

Nathaniel S. Woodward , Zhiqi Gao , Yurii Kvasiuk , Kendrick M. Smith , Frederic Sala , Moritz M\"unchmeyer

Authors on Pith no claims yet

Pith reviewed 2026-05-10 03:19 UTC · model grok-4.3

classification 💻 cs.LG cs.AIhep-phhep-th

keywords fine-tuningreasoning modelsquantum field theorylarge language modelsreinforcement learningsupervised fine-tuningdata generation pipelinephysics reasoning

0 comments

The pith

Fine-tuning 7B reasoning models on synthetic and human-adapted Quantum Field Theory problems improves performance on QFT tasks and shows some generalization to other physics domains.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper sets out to demonstrate that small reasoning models can develop domain-specific capabilities in theoretical physics through dedicated fine-tuning when suitable training data is created. The authors built a pipeline to generate more than 2,500 synthetic verifiable QFT problems and adapted additional problems from arXiv papers and textbooks so they could be used for training. They then ran both supervised fine-tuning and reinforcement learning experiments, tracked accuracy improvements, tested transfer to other physics areas, and examined how the models' step-by-step reasoning changed. A reader would care because this work supplies both a concrete method for creating scarce physics training data and the first public release of such a dataset and reasoning traces for 7B-scale models.

Core claim

The authors establish that the first academic fine-tuning of 7B-parameter reasoning models specifically for Quantum Field Theory, using a combination of over 2,500 synthetic problems generated by their pipeline and curated human-authored problems, produces measurable gains in problem-solving accuracy under both reinforcement learning and supervised fine-tuning, with partial generalization to other physics domains and visible evolution in the patterns of reasoning errors.

What carries the argument

The data generation pipeline that both creates synthetic verifiable QFT problems and converts existing human problems into training-ready format.

If this is right

Accuracy on QFT problems rises after both RL and SFT stages.
Some of the acquired capability transfers to reasoning tasks in other physics domains.
Analysis of model chains-of-thought shows how specific types of reasoning errors decrease during training.
The released pipeline, dataset, and 200 million tokens of QFT reasoning traces enable further experiments by others.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same pipeline could be reused to create training data for other advanced physics topics such as general relativity or particle phenomenology.
Models trained this way might eventually be tested on open-ended research-style questions rather than closed textbook problems.
Public release of the verifiable problems and traces provides a benchmark resource that future studies of LLM physics reasoning can build upon.

Load-bearing premise

That performance gains reflect the development of transferable QFT reasoning rather than the models learning to match patterns in the generated training set.

What would settle it

If a fresh set of QFT problems that require derivations outside the style and content of the training examples shows no accuracy improvement after fine-tuning, the claim that genuine reasoning ability was acquired would be undermined.

Figures

Figures reproduced from arXiv: 2604.18936 by Frederic Sala, Kendrick M. Smith, Moritz M\"unchmeyer, Nathaniel S. Woodward, Yurii Kvasiuk, Zhiqi Gao.

**Figure 1.** Figure 1: Performance gains after fine-tuning DeepSeek-R1-Distill-Qwen-7B with RL on QFT Easy compared to SFT on Qwen3-30B-A3B correct reasoning traces on QFT Easy. Corresponding Author: Nathaniel S. Woodward (nwoodward2@wisc.edu) 1 arXiv:2604.18936v1 [cs.LG] 21 Apr 2026 [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗

**Figure 2.** Figure 2: A characteristic example of the Python code skeleton provided in a problem statement. The model [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗

**Figure 3.** Figure 3: Task distributions across the three synthetic training datasets. We find hidden-coefficient calcula [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗

**Figure 4.** Figure 4: Schematic overview of the data curation pipeline. The workflow is divided into two tracks: a fully [PITH_FULL_IMAGE:figures/full_fig_p009_4.png] view at source ↗

**Figure 5.** Figure 5: Easy QFT validation performance throughout RL training. The model starts from its base zero [PITH_FULL_IMAGE:figures/full_fig_p016_5.png] view at source ↗

**Figure 6.** Figure 6: SFT on Qwen3-30B yields performance gains in validation accuracy. We find validation accuracy gains continue to grow even as the model begins to overtrain during SFT. Benchmarking Teacher Models for SFT Generating reasoning traces from multiple teacher models provides a diverse dataset for SFT. However, variance in reasoning styles may introduce challenges. First, a significant divergence from the base mod… view at source ↗

**Figure 7.** Figure 7: Performance of select open models on comparison on semi-synthetic (top) and synthetic (bottom) [PITH_FULL_IMAGE:figures/full_fig_p020_7.png] view at source ↗

**Figure 9.** Figure 9: Comparison between base model and finetuned model [PITH_FULL_IMAGE:figures/full_fig_p023_9.png] view at source ↗

**Figure 10.** Figure 10: The three-stage error analysis pipeline. [PITH_FULL_IMAGE:figures/full_fig_p024_10.png] view at source ↗

**Figure 11.** Figure 11: Representative examples of the four error categories identified by our pipeline. Each box shows [PITH_FULL_IMAGE:figures/full_fig_p026_11.png] view at source ↗

**Figure 12.** Figure 12: Major error count and frequency (average number of major errors per incorrect rollout) across [PITH_FULL_IMAGE:figures/full_fig_p026_12.png] view at source ↗

**Figure 13.** Figure 13: Major error count and frequency (average number of major errors per incorrect rollout) across [PITH_FULL_IMAGE:figures/full_fig_p027_13.png] view at source ↗

**Figure 14.** Figure 14: Three-way major error comparison on the 10 problems appearing in both the RL and SFT top-20 [PITH_FULL_IMAGE:figures/full_fig_p027_14.png] view at source ↗

**Figure 15.** Figure 15: Per-problem accuracy change (percentage points) for RL-finetuned vs. SFT, relative to the base [PITH_FULL_IMAGE:figures/full_fig_p028_15.png] view at source ↗

**Figure 16.** Figure 16: Trace length and backtracking frequency for the base and RL-finetuned models (80 problems, [PITH_FULL_IMAGE:figures/full_fig_p029_16.png] view at source ↗

**Figure 17.** Figure 17: Mean accuracy by domain level (defined by pedagogical level) for Easy, Medium, and Hard [PITH_FULL_IMAGE:figures/full_fig_p030_17.png] view at source ↗

**Figure 18.** Figure 18: Qualitative Comparison of Problem Difficulty. Representative samples from the Easy (top) and Hard (bottom) subsets of the QFT dataset. Note the increase in complexity and requisite symbolic manipulations in the Hard sample. 54 [PITH_FULL_IMAGE:figures/full_fig_p054_18.png] view at source ↗

**Figure 19.** Figure 19: Qualitative Comparison of Problem Contextualization. The original textbook problem (top) relies heavily on external context, such as results from previous exercises. The expanded synthetic version (bottom) formulates the same physical task into a rigorous, self-contained prompt. This problem received the following quality grading by Gemini-3-pro: • Seed Correspondence Score: 95 (excellent) • Seed Correspo… view at source ↗

**Figure 20.** Figure 20: Performance of select open models on comparison on human-adapted (top) and synthetic (bottom) [PITH_FULL_IMAGE:figures/full_fig_p059_20.png] view at source ↗

**Figure 21.** Figure 21: Selected training metrics during RL on the Easy QFT dataset using [PITH_FULL_IMAGE:figures/full_fig_p061_21.png] view at source ↗

**Figure 22.** Figure 22: Response length (tokens) variation of correct CoT across the synthetic training datasets. Broadly [PITH_FULL_IMAGE:figures/full_fig_p062_22.png] view at source ↗

read the original abstract

Despite the growing application of Large Language Models (LLMs) to theoretical physics, there is little academic exploration into how domain-specific physics reasoning ability develops while training these models. To investigate this, we perform the first academic fine-tuning study of small (7B-parameter) reasoning models dedicated specifically to theoretical physics. Because open-source verifiable training data required to train such capabilities is scarce, we developed a robust data generation pipeline that can both create synthetic problems and make existing human-authored problems suitable for model training. Selecting Quantum Field Theory (QFT) as our primary domain, we generated over 2,500 synthetic problems alongside a curated collection of human-adapted problems sourced from arXiv and standard pedagogical resources. We conduct both Reinforcement Learning (RL) and Supervised Fine-Tuning (SFT) experiments, benchmarking performance gains as well as generalization to other physics domains. We perform an extensive analysis of model chains-of-though before and after fine-tuning, to understand how reasoning errors evolve during RL and SFT. Finally, we publicly release our data pipeline, verifiable QFT training data, and $\sim$200M tokens of QFT reasoning traces.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

They built a QFT data pipeline and fine-tuned 7B models with SFT and RL but the abstract gives no performance numbers or benchmark details.

read the letter

The main new piece is the data generation pipeline that produces synthetic QFT problems and adapts human ones from arXiv and textbooks, plus the public release of over 2500 problems and 200M tokens of reasoning traces. They then run SFT and RL on 7B reasoning models and track changes in chain-of-thought reasoning. Releasing the pipeline and traces is the clearest practical value here, since verifiable domain-specific data for physics has been hard to come by. That part could save other groups time if they want to try similar fine-tuning for other subfields. The CoT analysis before and after training is a straightforward way to look for shifts in how the models handle problems. The soft spot is the complete absence of results in the abstract. It mentions benchmarking gains and generalization to other physics domains but supplies no accuracy numbers, error breakdowns, or details on how they checked whether improvements reflect real reasoning rather than surface patterns. Without those, the central claim that the fine-tuning produces reliable QFT reasoning stays untested in what is provided. The paper positions itself as the first academic effort of this type for small models in theoretical physics, which looks plausible from the cited prior work. This is mainly for groups working on compact, domain-tuned models for science rather than general-purpose LLMs. The data release gives it enough substance to deserve peer review even if the model results need tighter validation and clearer metrics in the full text.

Referee Report

2 major / 2 minor

Summary. The manuscript claims to conduct the first academic fine-tuning study of 7B-parameter reasoning models for Quantum Field Theory (QFT). It introduces a data generation pipeline producing over 2,500 synthetic verifiable QFT problems plus curated human-adapted problems from arXiv and textbooks, performs both Supervised Fine-Tuning (SFT) and Reinforcement Learning (RL) experiments, benchmarks performance gains and generalization to other physics domains, analyzes evolution of chain-of-thought reasoning errors, and publicly releases the pipeline, training data, and ~200M tokens of reasoning traces.

Significance. If the reported gains and generalization hold under rigorous evaluation, the work would be significant for establishing a reproducible foundation for domain-specific fine-tuning in theoretical physics. The public release of the data pipeline, verifiable QFT problems, and reasoning traces is a clear strength that enables follow-on research and addresses the scarcity of open training resources in this area.

major comments (2)

[§5] §5 (Experiments and Results): The benchmarking of performance gains and generalization to other physics domains is described at a high level but lacks concrete quantitative metrics (e.g., accuracy, pass rates, or error reductions with standard deviations), baseline comparisons, or statistical tests. This makes it impossible to assess whether observed improvements exceed what would be expected from pattern matching on the synthetic data.
[§6] §6 (Chain-of-Thought Analysis): The analysis of how reasoning errors evolve during SFT and RL is primarily qualitative. Without quantitative categorization of error types (e.g., algebraic mistakes vs. conceptual misunderstandings) or before/after success rates on held-out problems, it is difficult to substantiate the claim that fine-tuning produces transferable QFT understanding rather than superficial adaptations to the training distribution.

minor comments (2)

The abstract would be strengthened by including one or two key quantitative results (e.g., percentage improvement on QFT benchmarks) to allow readers to immediately gauge the magnitude of the reported gains.
[Data Release] Ensure that the released dataset documentation explicitly describes the verification procedure for synthetic problems and any human review steps applied to arXiv-sourced material.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the thoughtful and constructive comments, which highlight opportunities to strengthen the quantitative rigor of our evaluation. We address each major comment below and commit to revisions that provide the requested metrics and analyses without altering the core claims of the work.

read point-by-point responses

Referee: [§5] §5 (Experiments and Results): The benchmarking of performance gains and generalization to other physics domains is described at a high level but lacks concrete quantitative metrics (e.g., accuracy, pass rates, or error reductions with standard deviations), baseline comparisons, or statistical tests. This makes it impossible to assess whether observed improvements exceed what would be expected from pattern matching on the synthetic data.

Authors: We acknowledge that the presentation in §5 is currently summarized at a high level. In the revised manuscript we will add explicit quantitative results, including accuracy and pass rates on held-out QFT problems, generalization metrics to other physics domains, comparisons against the base 7B model and additional baselines, standard deviations from repeated runs, and statistical significance tests. These additions will enable direct assessment of whether gains exceed pattern matching on the synthetic distribution. revision: yes
Referee: [§6] §6 (Chain-of-Thought Analysis): The analysis of how reasoning errors evolve during SFT and RL is primarily qualitative. Without quantitative categorization of error types (e.g., algebraic mistakes vs. conceptual misunderstandings) or before/after success rates on held-out problems, it is difficult to substantiate the claim that fine-tuning produces transferable QFT understanding rather than superficial adaptations to the training distribution.

Authors: We agree that the CoT analysis would be strengthened by quantitative support. We will expand §6 to include a categorized breakdown of error types with counts and percentages before and after fine-tuning, together with success rates on held-out problems. This will provide clearer evidence regarding the nature of the observed improvements. While some interpretive aspects of reasoning evolution remain qualitative, the added metrics will make the section substantially more rigorous. revision: partial

Circularity Check

0 steps flagged

No significant circularity in empirical fine-tuning study

full rationale

This is a purely empirical machine-learning paper describing a data generation pipeline for synthetic and human-adapted QFT problems, followed by SFT and RL fine-tuning experiments on 7B models, with benchmarking of performance gains, generalization to other domains, and CoT analysis. No mathematical derivations, equations, or first-principles results are claimed whose outputs are defined in terms of their own inputs or fitted parameters. Central claims rest on measured experimental outcomes and a public data release rather than internal self-definitions, self-citation chains, or renamed known results. The study is self-contained against external benchmarks (model performance before/after fine-tuning) with no load-bearing circular steps.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim depends on the assumption that synthetic and curated QFT problems supply sufficient signal for genuine reasoning improvement rather than memorization, and that RL/SFT on this data produces measurable, generalizable gains; these are domain assumptions not independently verified in the provided abstract.

free parameters (1)

Fine-tuning hyperparameters (learning rate, batch size, RL reward scaling, etc.)
Standard training choices that must be selected to achieve the reported performance; not numerically specified in the abstract.

axioms (1)

domain assumption Synthetic data generated by the pipeline can train transferable QFT reasoning capabilities
The pipeline is presented as the key enabler, yet the abstract provides no external validation that the generated problems match the distribution of real QFT reasoning tasks.

pith-pipeline@v0.9.0 · 5522 in / 1544 out tokens · 55322 ms · 2026-05-10T03:19:52.354575+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

231 extracted references · 196 canonical work pages · 27 internal anchors

[1]

Aitor Lewkowycz et al.Solving Quantitative Reasoning Problems with Language Models. 2022. arXiv: 2206.14858 [cs.CL].url:https://arxiv.org/abs/2206.14858

work page internal anchor Pith review arXiv 2022
[2]

Ross Taylor et al.Galactica: A Large Language Model for Science. 2022. arXiv:2211.09085 [cs.CL]. url:https://arxiv.org/abs/2211.09085

work page internal anchor Pith review arXiv 2022
[3]

Elliot Glazer et al.FrontierMath: A Benchmark for Evaluating Advanced Mathematical Reasoning in AI. 2024. arXiv:2411.04872 [cs.AI].url:https://arxiv.org/abs/2411.04872

work page arXiv 2024
[4]

Daniel J. H. Chung et al.Theoretical Physics Benchmark (TPBench) – a Dataset and Study of AI Reasoning Capabilities in Theoretical Physics. 2025. arXiv:2502.15815 [cs.LG].url:https: //arxiv.org/abs/2502.15815

work page arXiv 2025
[5]

Kaiyue Feng et al.PHYSICS: Benchmarking Foundation Models on University-Level Physics Problem Solving. 2025. arXiv:2503.21821 [physics.ed-ph].url:https://arxiv.org/abs/2503.21821

work page arXiv 2025
[6]

Xinyu Zhang et al.PhysReason: A Comprehensive Benchmark towards Physics-Based Reasoning
[7]

arXiv:2502.12054 [cs.AI].url:https://arxiv.org/abs/2502.12054

work page arXiv
[8]

Xinyu Pang et al.Physics Reasoner: Knowledge-Augmented Reasoning for Solving Physics Problems with Large Language Models. 2024. arXiv:2412.13791 [cs.CL].url:https://arxiv.org/abs/ 2412.13791

work page arXiv 2024
[9]

Xin Xu et al.UGPhysics: A Comprehensive Benchmark for Undergraduate Physics Reasoning with Large Language Models. 2025. arXiv:2502.00334 [cs.CL].url:https://arxiv.org/abs/2502. 00334

work page arXiv 2025
[10]

Evaluating GPT- and reasoning-based large language models on Physics Olympiad problems: Surpassing human performance and implications for educational assessment

Paul Tschisgale et al. “Evaluating GPT- and reasoning-based large language models on Physics Olympiad problems: Surpassing human performance and implications for educational assessment”. In: Physical Review Physics Education Research21.2 (Aug. 2025).issn: 2469-9896.doi:10.1103/6fmx- bsnl.url:http://dx.doi.org/10.1103/6fmx-bsnl

work page doi:10.1103/6fmx- 2025
[11]

Mizukami, M

Haining Pan et al. “Quantum many-body physics calculations with large language models”. In:Com- munications Physics8.1 (Jan. 2025).issn: 2399-3650.doi:10.1038/s42005- 025- 01956- y.url: http://dx.doi.org/10.1038/s42005-025-01956-y

work page doi:10.1038/s42005- 2025
[12]

Learning-at-Criticality in Large Language Models for Quantum Field Theory and Beyond

Xiansheng Cai et al. “Learning-at-Criticality in Large Language Models for Quantum Field Theory and Beyond”. In:Chinese Physics Letters42.12 (Nov. 2025), p. 120002.issn: 1741-3540.doi:10. 1088/0256-307x/42/12/120002.url:http://dx.doi.org/10.1088/0256-307X/42/12/120002. 32

work page doi:10.1088/0256-307x/42/12/120002 2025
[13]

Large physics models: towards a collaborative approach with large language models and foundation models

Kristian G. Barman et al. “Large physics models: towards a collaborative approach with large language models and foundation models”. In:The European Physical Journal C85.9 (Sept. 2025).issn: 1434- 6052.doi:10.1140/epjc/s10052-025-14707-8.url:http://dx.doi.org/10.1140/epjc/s10052- 025-14707-8

work page doi:10.1140/epjc/s10052-025-14707-8.url:http://dx.doi.org/10.1140/epjc/s10052- 2025
[14]

Alfredo Guevara et al.Single-minus gluon tree amplitudes are nonzero. 2026. arXiv:2602 . 12176 [hep-th].url:https://arxiv.org/abs/2602.12176

work page arXiv 2026
[15]

Zhangir Azerbayev et al.Llemma: An Open Language Model For Mathematics. 2024. arXiv:2310. 10631 [cs.CL].url:https://arxiv.org/abs/2310.10631

work page arXiv 2024
[16]

Solving olympiad geometry without human demonstrations

T. H. Trinh, Y. Wu, Q. V. Le, et al. “Solving olympiad geometry without human demonstrations”. In: Nature625 (2024), pp. 476–482.url:https://www.nature.com/articles/s41586-023-06747-5

2024
[17]

2024.url:https : / / deepmind

AlphaProof and AlphaGeometry TEams.Ai achieves silver-medal standard solving International Mathematical Olympiad Problems. 2024.url:https : / / deepmind . google / discover / blog / ai - solves-imo-problems-at-silver-medal-level/

2024
[18]

biology" or

Yichen Huang and Lin F. Yang.Winning Gold at IMO 2025 with a Model-Agnostic Verification-and- Refinement Pipeline. 2025. arXiv:2507.15855 [cs.AI].url:https://arxiv.org/abs/2507.15855

work page arXiv 2025
[19]

Benjamin Breen et al.Ax-Prover: A Deep Reasoning Agentic Framework for Theorem Proving in Mathematics and Quantum Physics. 2025. arXiv:2510.12787 [cs.AI].url:https://arxiv.org/ abs/2510.12787

work page arXiv 2025
[20]

Azim Ospanov, Farzan Farnia, and Roozbeh Yousefzadeh.miniF2F-Lean Revisited: Reviewing Limi- tations and Charting a Path Forward. 2025. arXiv:2511.03108 [cs.AI].url:https://arxiv.org/ abs/2511.03108

work page arXiv 2025
[21]

Minhui Zhu et al.Probing the Critical Point (CritPt) of AI Reasoning: a Frontier Physics Research Benchmark. 2025. arXiv:2509.26574 [cs.AI].url:https://arxiv.org/abs/2509.26574

work page internal anchor Pith review arXiv 2025
[22]

Yiming Zhang et al.ABench-Physics: Benchmarking Physical Reasoning in LLMs via High-Difficulty and Dynamic Physics Problems. 2025. arXiv:2507.04766 [cs.LG].url:https://arxiv.org/abs/ 2507.04766

work page arXiv 2025
[23]

Haining Pan et al.CMT-Benchmark: A Benchmark for Condensed Matter Theory Built by Expert Researchers. 2026. arXiv:2510.05228 [cs.LG].url:https://arxiv.org/abs/2510.05228

work page arXiv 2026
[24]

Tianshi Zheng et al.NewtonBench: Benchmarking Generalizable Scientific Law Discovery in LLM Agents. 2026. arXiv:2510.07172 [cs.AI].url:https://arxiv.org/abs/2510.07172

work page arXiv 2026
[25]

Tony Feng et al.Towards Autonomous Mathematics Research. 2026. arXiv:2602.10177 [cs.LG]. url:https://arxiv.org/abs/2602.10177

work page arXiv 2026
[26]

Yutaro Yamada et al.The AI Scientist-v2: Workshop-Level Automated Scientific Discovery via Agen- tic Tree Search. 2025. arXiv:2504.08066 [cs.AI].url:https://arxiv.org/abs/2504.08066

work page internal anchor Pith review Pith/arXiv arXiv 2025
[27]

Anna Goldie et al.Synthetic Data Generation & Multi-Step RL for Reasoning & Tool Use. 2025. arXiv:2504.04736 [cs.AI].url:https://arxiv.org/abs/2504.04736

work page arXiv 2025
[28]

Xianyang Liu et al.AgenticMath: Enhancing LLM Reasoning via Agentic-based Math Data Genera- tion. 2025. arXiv:2510.19361 [cs.CL].url:https://arxiv.org/abs/2510.19361

work page arXiv 2025
[29]

VCR: A Cone of Experience Driven Synthetic Data Generation Framework for Mathematical Reasoning

Sannyuya Liu et al. “VCR: A Cone of Experience Driven Synthetic Data Generation Framework for Mathematical Reasoning”. In:Proceedings of the AAAI Conference on Artificial Intelligence39.23 (2025), pp. 24650–24658.doi:10.1609/aaai.v39i23.34645.url:https://ojs.aaai.org/index. php/AAAI/article/view/34645

work page doi:10.1609/aaai.v39i23.34645.url:https://ojs.aaai.org/index 2025
[30]

Xiong Jun Wu et al.SHARP: Synthesizing High-quality Aligned Reasoning Problems for Large Rea- soning Models Reinforcement Learning. 2025. arXiv:2505.14147 [cs.AI].url:https://arxiv. org/abs/2505.14147

work page arXiv 2025
[31]

Xiang Huang et al.TARGA: Targeted Synthetic Data Generation for Practical Reasoning over Struc- tured Data. 2024. arXiv:2412.19544 [cs.CL].url:https://arxiv.org/abs/2412.19544. 33

work page arXiv 2024
[32]

Orchestrating Synthetic Data with Reasoning

Tim R. Davidson et al. “Orchestrating Synthetic Data with Reasoning”. In:Will Synthetic Data Fi- nally Solve the Data Access Problem?2025.url:https://openreview.net/forum?id=VOoeogZbMb

2025
[33]

Tongyi DeepResearch Team et al.Tongyi DeepResearch Technical Report. 2025. arXiv:2510.24701 [cs.CL].url:https://arxiv.org/abs/2510.24701

work page internal anchor Pith review arXiv 2025
[34]

Zhangchen Xu et al.TOUCAN: Synthesizing 1.5M Tool-Agentic Data from Real-World MCP Envi- ronments. 2025. arXiv:2510.01179 [cs.LG].url:https://arxiv.org/abs/2510.01179

work page arXiv 2025
[35]

Suriya Gunasekar et al.Textbooks Are All You Need. 2023. arXiv:2306.11644 [cs.CL].url:https: //arxiv.org/abs/2306.11644

work page internal anchor Pith review arXiv 2023
[36]

Yuanzhi Li et al.Textbooks Are All You Need II: phi-1.5 technical report. 2023. arXiv:2309.05463 [cs.CL].url:https://arxiv.org/abs/2309.05463

work page internal anchor Pith review arXiv 2023
[37]

Geoffrey Hinton, Oriol Vinyals, and Jeff Dean.Distilling the Knowledge in a Neural Network. 2015. arXiv:1503.02531 [stat.ML].url:https://arxiv.org/abs/1503.02531

work page internal anchor Pith review Pith/arXiv arXiv 2015
[38]

Subhabrata Mukherjee et al.Orca: Progressive Learning from Complex Explanation Traces of GPT-4
[39]

arXiv:2306.02707 [cs.CL].url:https://arxiv.org/abs/2306.02707

work page arXiv
[40]

Cheng-Yu Hsieh et al.Distilling Step-by-Step! Outperforming Larger Language Models with Less Training Data and Smaller Model Sizes. 2023. arXiv:2305.02301 [cs.CL].url:https://arxiv. org/abs/2305.02301

work page arXiv 2023
[41]

Yuxian Gu et al.MiniLLM: On-Policy Distillation of Large Language Models. 2026. arXiv:2306. 08543 [cs.CL].url:https://arxiv.org/abs/2306.08543

work page internal anchor Pith review arXiv 2026
[42]

Machine behaviour

Daya Guo et al. “DeepSeek-R1 incentivizes reasoning in LLMs through reinforcement learning”. In: Nature645.8081 (Sept. 2025), 633638.issn: 1476-4687.doi:10.1038/s41586- 025- 09422- z.url: http://dx.doi.org/10.1038/s41586-025-09422-z

work page doi:10.1038/s41586- 2025
[43]

Xinghao Chen et al.Unveiling the Key Factors for Distilling Chain-of-Thought Reasoning. 2025. arXiv:2502.18001 [cs.CL].url:https://arxiv.org/abs/2502.18001

work page arXiv 2025
[44]

Hoang Anh Just, Myeongseob Ko, and Ruoxi Jia.Distilling Reasoning into Student LLMs: Local Naturalness for Selecting Teacher Data. 2025. arXiv:2510.03988 [cs.LG].url:https://arxiv. org/abs/2510.03988

work page internal anchor Pith review Pith/arXiv arXiv 2025
[45]

arXiv preprint arXiv:2503.03730 , year=

David D. Baek and Max Tegmark.Towards Understanding Distilled Reasoning Models: A Represen- tational Approach. 2025. arXiv:2503.03730 [cs.LG].url:https://arxiv.org/abs/2503.03730

work page arXiv 2025
[46]

Hunter Lightman et al.Let’s Verify Step by Step. 2023. arXiv:2305.20050 [cs.LG].url:https: //arxiv.org/abs/2305.20050

work page internal anchor Pith review arXiv 2023
[47]

Jonathan Uesato et al.Solving math word problems with process- and outcome-based feedback. 2022. arXiv:2211.14275 [cs.LG].url:https://arxiv.org/abs/2211.14275

work page internal anchor Pith review Pith/arXiv arXiv 2022
[48]

Zhihong Shao et al.DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models. 2024. arXiv:2402.03300 [cs.CL].url:https://arxiv.org/abs/2402.03300

work page internal anchor Pith review Pith/arXiv arXiv 2024
[49]

Xinyu Guan et al.rStar-Math: Small LLMs Can Master Math Reasoning with Self-Evolved Deep Thinking. 2025. arXiv:2501.04519 [cs.CL].url:https://arxiv.org/abs/2501.04519

work page arXiv 2025
[50]

Haozhe Wang et al.Emergent Hierarchical Reasoning in LLMs through Reinforcement Learning. 2025. arXiv:2509.03646 [cs.AI].url:https://arxiv.org/abs/2509.03646

work page arXiv 2025
[51]

Jiayu Wang et al.Beyond Accuracy: Dissecting Mathematical Reasoning for LLMs Under Reinforce- ment Learning. 2025. arXiv:2506.04723 [cs.AI].url:https://arxiv.org/abs/2506.04723

work page arXiv 2025
[52]

Jeesu Jung and Sangkeun Jung.Reasoning Steps as Curriculum: Using Depth of Thought as a Diffi- culty Signal for Tuning LLMs. 2025. arXiv:2508.18279 [cs.LG].url:https://arxiv.org/abs/ 2508.18279

work page arXiv 2025
[53]

Bohan Tang et al.On the Importance of Task Complexity in Evaluating LLM-Based Multi-Agent Systems. 2025. arXiv:2510.04311 [cs.AI].url:https://arxiv.org/abs/2510.04311

work page arXiv 2025
[54]

Zhenru Zhang et al.The Lessons of Developing Process Reward Models in Mathematical Reasoning
[55]

arXiv:2501.07301 [cs.CL].url:https://arxiv.org/abs/2501.07301. 34

work page arXiv
[56]

Peiyi Wang et al.Math-Shepherd: Verify and Reinforce LLMs Step-by-step without Human Annota- tions. 2024. arXiv:2312.08935 [cs.AI].url:https://arxiv.org/abs/2312.08935

work page internal anchor Pith review arXiv 2024
[57]

Chuhuai Yue et al.Promoting Efficient Reasoning with Verifiable Stepwise Reward. 2025. arXiv: 2508.10293 [cs.AI].url:https://arxiv.org/abs/2508.10293

work page arXiv 2025
[58]

Zihan Liu et al.AceMath: Advancing Frontier Math Reasoning with Post-Training and Reward Mod- eling. 2025. arXiv:2412.15084 [cs.CL].url:https://arxiv.org/abs/2412.15084

work page arXiv 2025
[59]

Jon Saad-Falcon et al.Shrinking the Generation-Verification Gap with Weak Verifiers. 2025. arXiv: 2506.18203 [cs.CL].url:https://arxiv.org/abs/2506.18203

work page arXiv 2025
[60]

Benedikt Stroebl, Sayash Kapoor, and Arvind Narayanan.The Limits of Inference Scaling Through Resampling. 2026. arXiv:2411.17501 [cs.LG].url:https://arxiv.org/abs/2411.17501

work page arXiv 2026
[61]

DeepSeek-AI.DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning
[62]

arXiv:2501.12948 [cs.CL].url:https://arxiv.org/abs/2501.12948

work page internal anchor Pith review Pith/arXiv arXiv
[63]

Qwen Team.Qwen3 Technical Report. 2025. arXiv:2505.09388 [cs.CL].url:https://arxiv.org/ abs/2505.09388

work page internal anchor Pith review Pith/arXiv arXiv 2025
[64]

Olympiad-Level Formal Mathematical Reasoning with Reinforcement Learn- ing

Thomas Hubert et al. “Olympiad-Level Formal Mathematical Reasoning with Reinforcement Learn- ing”. In:Nature(2025).url:https://www.nature.com/articles/s41586-025-09833-y

2025
[65]

Zee.Quantum field theory in a nutshell

A. Zee.Quantum field theory in a nutshell. 2003.isbn: 978-0-691-14034-6

2003
[66]

Peskin and Daniel V

Michael E. Peskin and Daniel V. Schroeder.An Introduction to quantum field theory. Reading, USA: Addison-Wesley, 1995.isbn: 978-0-201-50397-5, 978-0-429-50355-9, 978-0-429-49417-8.doi:10.1201/ 9780429503559

1995
[67]

Steven Weinberg.The Quantum theory of fields. Vol. 1: Foundations. Cambridge University Press, June 2005.isbn: 978-0-521-67053-1, 978-0-511-25204-4.doi:10.1017/CBO9781139644167

work page doi:10.1017/cbo9781139644167 2005
[68]

Voja Radovanovic.Problem book in quantum field theory. 2008

2008
[69]

Oxford, UK: Oxford University Press, 1984.isbn: 978-0-19-851961-4, 978-0-19-851961-4

Ta-Pei [0000-0002-1137-0969] Cheng and Ling-Fong [0000-0002-8035-3329] Li.Gauge Theory of El- ementary Particle Physics. Oxford, UK: Oxford University Press, 1984.isbn: 978-0-19-851961-4, 978-0-19-851961-4

1984
[70]

Massachusetts Institute of Technology: MIT OpenCourseWare

Hong Liu.8.323 Relativistic Quantum Field Theory I. Massachusetts Institute of Technology: MIT OpenCourseWare. Spring 2023. License: Creative Commons BY-NC-SA. 2023.url:https://ocw. mit.edu/courses/8-323-relativistic-quantum-field-theory-i-spring-2023/

2023
[71]

Massachusetts Institute of Technology: MIT OpenCourseWare

Hong Liu.8.324 Relativistic Quantum Field Theory II. Massachusetts Institute of Technology: MIT OpenCourseWare. Fall 2010. License: Creative Commons BY-NC-SA. 2010.url:https://ocw.mit. edu/courses/8-324-relativistic-quantum-field-theory-ii-fall-2010/

2010
[72]

Massachusetts Institute of Technology: MIT OpenCourseWare

Frank Wilczek.8.325 Relativistic Quantum Field Theory III. Massachusetts Institute of Technology: MIT OpenCourseWare. Spring 2003. License: Creative Commons BY-NC-SA. 2003.url:https : //ocw.mit.edu/courses/8-325-relativistic-quantum-field-theory-iii-spring-2003/

2003
[73]

A., Solomon, A

P. Blasiak et al. “Combinatorics and Boson normal ordering: A gentle introduction”. In: (2007).doi: 10.1119/1.2723799. eprint:arXiv:0704.3116

work page doi:10.1119/1.2723799 2007
[74]

Ollitrault, Relativistic hydrodynamics for heavy- ion collisions, European Journal of Physics29, 275 (2008), arXiv:0708.2433 [nucl-th]

Jean-Yves Ollitrault. “Relativistic hydrodynamics for heavy-ion collisions”. In: (2007).doi:10.1088/ 0143-0807/29/2/010. eprint:arXiv:0708.2433

work page arXiv 2007
[75]

Arakida, Light deflection and Gauss–Bonnet theorem: definition of total deflection angle and its applications, Gen

Raphael Bousso. “TASI Lectures on the Cosmological Constant”. In: (2007).doi:10.1007/s10714- 007-0557-5. eprint:arXiv:0708.4231

work page doi:10.1007/s10714- 2007
[76]

Emparan, R

Roberto Emparan and Harvey S. Reall. “Black Holes in Higher Dimensions”. In: (2008).doi:10. 12942/lrr-2008-6. eprint:arXiv:0801.3471

work page arXiv 2008
[77]

G. H. E. Duchamp et al.Hopf Algebras in General and in Combinatorial Physics: a practical intro- duction. 2008. eprint:arXiv:0802.0249

work page arXiv 2008
[78]

Stefan Vandoren and Peter van Nieuwenhuizen.Lectures on instantons. 2008. eprint:arXiv:0802. 1862. 35

2008
[79]

Braneworld black holes

Ruth Gregory. “Braneworld black holes”. In: (2008).doi:10.1007/978-3-540-88460-6_7. eprint: arXiv:0804.2595

work page doi:10.1007/978-3-540-88460-6_7 2008
[80]

Higher order gravity theories and their black hole solutions

Christos Charmousis. “Higher order gravity theories and their black hole solutions”. In: (2008).doi: 10.1007/978-3-540-88460-6_8. eprint:arXiv:0805.0568

work page doi:10.1007/978-3-540-88460-6_8 2008

Showing first 80 references.