pith. machine review for the scientific record. sign in

arxiv: 2604.18936 · v1 · submitted 2026-04-21 · 💻 cs.LG · cs.AI· hep-ph· hep-th

Recognition: unknown

Fine-Tuning Small Reasoning Models for Quantum Field Theory

Authors on Pith no claims yet

Pith reviewed 2026-05-10 03:19 UTC · model grok-4.3

classification 💻 cs.LG cs.AIhep-phhep-th
keywords fine-tuningreasoning modelsquantum field theorylarge language modelsreinforcement learningsupervised fine-tuningdata generation pipelinephysics reasoning
0
0 comments X

The pith

Fine-tuning 7B reasoning models on synthetic and human-adapted Quantum Field Theory problems improves performance on QFT tasks and shows some generalization to other physics domains.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper sets out to demonstrate that small reasoning models can develop domain-specific capabilities in theoretical physics through dedicated fine-tuning when suitable training data is created. The authors built a pipeline to generate more than 2,500 synthetic verifiable QFT problems and adapted additional problems from arXiv papers and textbooks so they could be used for training. They then ran both supervised fine-tuning and reinforcement learning experiments, tracked accuracy improvements, tested transfer to other physics areas, and examined how the models' step-by-step reasoning changed. A reader would care because this work supplies both a concrete method for creating scarce physics training data and the first public release of such a dataset and reasoning traces for 7B-scale models.

Core claim

The authors establish that the first academic fine-tuning of 7B-parameter reasoning models specifically for Quantum Field Theory, using a combination of over 2,500 synthetic problems generated by their pipeline and curated human-authored problems, produces measurable gains in problem-solving accuracy under both reinforcement learning and supervised fine-tuning, with partial generalization to other physics domains and visible evolution in the patterns of reasoning errors.

What carries the argument

The data generation pipeline that both creates synthetic verifiable QFT problems and converts existing human problems into training-ready format.

If this is right

  • Accuracy on QFT problems rises after both RL and SFT stages.
  • Some of the acquired capability transfers to reasoning tasks in other physics domains.
  • Analysis of model chains-of-thought shows how specific types of reasoning errors decrease during training.
  • The released pipeline, dataset, and 200 million tokens of QFT reasoning traces enable further experiments by others.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same pipeline could be reused to create training data for other advanced physics topics such as general relativity or particle phenomenology.
  • Models trained this way might eventually be tested on open-ended research-style questions rather than closed textbook problems.
  • Public release of the verifiable problems and traces provides a benchmark resource that future studies of LLM physics reasoning can build upon.

Load-bearing premise

That performance gains reflect the development of transferable QFT reasoning rather than the models learning to match patterns in the generated training set.

What would settle it

If a fresh set of QFT problems that require derivations outside the style and content of the training examples shows no accuracy improvement after fine-tuning, the claim that genuine reasoning ability was acquired would be undermined.

Figures

Figures reproduced from arXiv: 2604.18936 by Frederic Sala, Kendrick M. Smith, Moritz M\"unchmeyer, Nathaniel S. Woodward, Yurii Kvasiuk, Zhiqi Gao.

Figure 1
Figure 1. Figure 1: Performance gains after fine-tuning DeepSeek-R1-Distill-Qwen-7B with RL on QFT Easy com￾pared to SFT on Qwen3-30B-A3B correct reasoning traces on QFT Easy. Corresponding Author: Nathaniel S. Woodward (nwoodward2@wisc.edu) 1 arXiv:2604.18936v1 [cs.LG] 21 Apr 2026 [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: A characteristic example of the Python code skeleton provided in a problem statement. The model [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Task distributions across the three synthetic training datasets. We find hidden-coefficient calcula [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Schematic overview of the data curation pipeline. The workflow is divided into two tracks: a fully [PITH_FULL_IMAGE:figures/full_fig_p009_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Easy QFT validation performance throughout RL training. The model starts from its base zero [PITH_FULL_IMAGE:figures/full_fig_p016_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: SFT on Qwen3-30B yields performance gains in validation accuracy. We find validation accuracy gains continue to grow even as the model begins to overtrain during SFT. Benchmarking Teacher Models for SFT Generating reasoning traces from multiple teacher models provides a diverse dataset for SFT. However, variance in reasoning styles may introduce challenges. First, a significant divergence from the base mod… view at source ↗
Figure 7
Figure 7. Figure 7: Performance of select open models on comparison on semi-synthetic (top) and synthetic (bottom) [PITH_FULL_IMAGE:figures/full_fig_p020_7.png] view at source ↗
Figure 9
Figure 9. Figure 9: Comparison between base model and finetuned model [PITH_FULL_IMAGE:figures/full_fig_p023_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: The three-stage error analysis pipeline. [PITH_FULL_IMAGE:figures/full_fig_p024_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Representative examples of the four error categories identified by our pipeline. Each box shows [PITH_FULL_IMAGE:figures/full_fig_p026_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Major error count and frequency (average number of major errors per incorrect rollout) across [PITH_FULL_IMAGE:figures/full_fig_p026_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: Major error count and frequency (average number of major errors per incorrect rollout) across [PITH_FULL_IMAGE:figures/full_fig_p027_13.png] view at source ↗
Figure 14
Figure 14. Figure 14: Three-way major error comparison on the 10 problems appearing in both the RL and SFT top-20 [PITH_FULL_IMAGE:figures/full_fig_p027_14.png] view at source ↗
Figure 15
Figure 15. Figure 15: Per-problem accuracy change (percentage points) for RL-finetuned vs. SFT, relative to the base [PITH_FULL_IMAGE:figures/full_fig_p028_15.png] view at source ↗
Figure 16
Figure 16. Figure 16: Trace length and backtracking frequency for the base and RL-finetuned models (80 problems, [PITH_FULL_IMAGE:figures/full_fig_p029_16.png] view at source ↗
Figure 17
Figure 17. Figure 17: Mean accuracy by domain level (defined by pedagogical level) for Easy, Medium, and Hard [PITH_FULL_IMAGE:figures/full_fig_p030_17.png] view at source ↗
Figure 18
Figure 18. Figure 18: Qualitative Comparison of Problem Difficulty. Representative samples from the Easy (top) and Hard (bottom) subsets of the QFT dataset. Note the increase in complexity and requisite symbolic manipulations in the Hard sample. 54 [PITH_FULL_IMAGE:figures/full_fig_p054_18.png] view at source ↗
Figure 19
Figure 19. Figure 19: Qualitative Comparison of Problem Contextualization. The original textbook problem (top) relies heavily on external context, such as results from previous exercises. The expanded synthetic version (bottom) formulates the same physical task into a rigorous, self-contained prompt. This problem received the following quality grading by Gemini-3-pro: • Seed Correspondence Score: 95 (excellent) • Seed Correspo… view at source ↗
Figure 20
Figure 20. Figure 20: Performance of select open models on comparison on human-adapted (top) and synthetic (bottom) [PITH_FULL_IMAGE:figures/full_fig_p059_20.png] view at source ↗
Figure 21
Figure 21. Figure 21: Selected training metrics during RL on the Easy QFT dataset using [PITH_FULL_IMAGE:figures/full_fig_p061_21.png] view at source ↗
Figure 22
Figure 22. Figure 22: Response length (tokens) variation of correct CoT across the synthetic training datasets. Broadly [PITH_FULL_IMAGE:figures/full_fig_p062_22.png] view at source ↗
read the original abstract

Despite the growing application of Large Language Models (LLMs) to theoretical physics, there is little academic exploration into how domain-specific physics reasoning ability develops while training these models. To investigate this, we perform the first academic fine-tuning study of small (7B-parameter) reasoning models dedicated specifically to theoretical physics. Because open-source verifiable training data required to train such capabilities is scarce, we developed a robust data generation pipeline that can both create synthetic problems and make existing human-authored problems suitable for model training. Selecting Quantum Field Theory (QFT) as our primary domain, we generated over 2,500 synthetic problems alongside a curated collection of human-adapted problems sourced from arXiv and standard pedagogical resources. We conduct both Reinforcement Learning (RL) and Supervised Fine-Tuning (SFT) experiments, benchmarking performance gains as well as generalization to other physics domains. We perform an extensive analysis of model chains-of-though before and after fine-tuning, to understand how reasoning errors evolve during RL and SFT. Finally, we publicly release our data pipeline, verifiable QFT training data, and $\sim$200M tokens of QFT reasoning traces.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript claims to conduct the first academic fine-tuning study of 7B-parameter reasoning models for Quantum Field Theory (QFT). It introduces a data generation pipeline producing over 2,500 synthetic verifiable QFT problems plus curated human-adapted problems from arXiv and textbooks, performs both Supervised Fine-Tuning (SFT) and Reinforcement Learning (RL) experiments, benchmarks performance gains and generalization to other physics domains, analyzes evolution of chain-of-thought reasoning errors, and publicly releases the pipeline, training data, and ~200M tokens of reasoning traces.

Significance. If the reported gains and generalization hold under rigorous evaluation, the work would be significant for establishing a reproducible foundation for domain-specific fine-tuning in theoretical physics. The public release of the data pipeline, verifiable QFT problems, and reasoning traces is a clear strength that enables follow-on research and addresses the scarcity of open training resources in this area.

major comments (2)
  1. [§5] §5 (Experiments and Results): The benchmarking of performance gains and generalization to other physics domains is described at a high level but lacks concrete quantitative metrics (e.g., accuracy, pass rates, or error reductions with standard deviations), baseline comparisons, or statistical tests. This makes it impossible to assess whether observed improvements exceed what would be expected from pattern matching on the synthetic data.
  2. [§6] §6 (Chain-of-Thought Analysis): The analysis of how reasoning errors evolve during SFT and RL is primarily qualitative. Without quantitative categorization of error types (e.g., algebraic mistakes vs. conceptual misunderstandings) or before/after success rates on held-out problems, it is difficult to substantiate the claim that fine-tuning produces transferable QFT understanding rather than superficial adaptations to the training distribution.
minor comments (2)
  1. The abstract would be strengthened by including one or two key quantitative results (e.g., percentage improvement on QFT benchmarks) to allow readers to immediately gauge the magnitude of the reported gains.
  2. [Data Release] Ensure that the released dataset documentation explicitly describes the verification procedure for synthetic problems and any human review steps applied to arXiv-sourced material.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the thoughtful and constructive comments, which highlight opportunities to strengthen the quantitative rigor of our evaluation. We address each major comment below and commit to revisions that provide the requested metrics and analyses without altering the core claims of the work.

read point-by-point responses
  1. Referee: [§5] §5 (Experiments and Results): The benchmarking of performance gains and generalization to other physics domains is described at a high level but lacks concrete quantitative metrics (e.g., accuracy, pass rates, or error reductions with standard deviations), baseline comparisons, or statistical tests. This makes it impossible to assess whether observed improvements exceed what would be expected from pattern matching on the synthetic data.

    Authors: We acknowledge that the presentation in §5 is currently summarized at a high level. In the revised manuscript we will add explicit quantitative results, including accuracy and pass rates on held-out QFT problems, generalization metrics to other physics domains, comparisons against the base 7B model and additional baselines, standard deviations from repeated runs, and statistical significance tests. These additions will enable direct assessment of whether gains exceed pattern matching on the synthetic distribution. revision: yes

  2. Referee: [§6] §6 (Chain-of-Thought Analysis): The analysis of how reasoning errors evolve during SFT and RL is primarily qualitative. Without quantitative categorization of error types (e.g., algebraic mistakes vs. conceptual misunderstandings) or before/after success rates on held-out problems, it is difficult to substantiate the claim that fine-tuning produces transferable QFT understanding rather than superficial adaptations to the training distribution.

    Authors: We agree that the CoT analysis would be strengthened by quantitative support. We will expand §6 to include a categorized breakdown of error types with counts and percentages before and after fine-tuning, together with success rates on held-out problems. This will provide clearer evidence regarding the nature of the observed improvements. While some interpretive aspects of reasoning evolution remain qualitative, the added metrics will make the section substantially more rigorous. revision: partial

Circularity Check

0 steps flagged

No significant circularity in empirical fine-tuning study

full rationale

This is a purely empirical machine-learning paper describing a data generation pipeline for synthetic and human-adapted QFT problems, followed by SFT and RL fine-tuning experiments on 7B models, with benchmarking of performance gains, generalization to other domains, and CoT analysis. No mathematical derivations, equations, or first-principles results are claimed whose outputs are defined in terms of their own inputs or fitted parameters. Central claims rest on measured experimental outcomes and a public data release rather than internal self-definitions, self-citation chains, or renamed known results. The study is self-contained against external benchmarks (model performance before/after fine-tuning) with no load-bearing circular steps.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim depends on the assumption that synthetic and curated QFT problems supply sufficient signal for genuine reasoning improvement rather than memorization, and that RL/SFT on this data produces measurable, generalizable gains; these are domain assumptions not independently verified in the provided abstract.

free parameters (1)
  • Fine-tuning hyperparameters (learning rate, batch size, RL reward scaling, etc.)
    Standard training choices that must be selected to achieve the reported performance; not numerically specified in the abstract.
axioms (1)
  • domain assumption Synthetic data generated by the pipeline can train transferable QFT reasoning capabilities
    The pipeline is presented as the key enabler, yet the abstract provides no external validation that the generated problems match the distribution of real QFT reasoning tasks.

pith-pipeline@v0.9.0 · 5522 in / 1544 out tokens · 55322 ms · 2026-05-10T03:19:52.354575+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

231 extracted references · 196 canonical work pages · 27 internal anchors

  1. [1]

    Aitor Lewkowycz et al.Solving Quantitative Reasoning Problems with Language Models. 2022. arXiv: 2206.14858 [cs.CL].url:https://arxiv.org/abs/2206.14858

  2. [2]

    Ross Taylor et al.Galactica: A Large Language Model for Science. 2022. arXiv:2211.09085 [cs.CL]. url:https://arxiv.org/abs/2211.09085

  3. [3]

    Elliot Glazer et al.FrontierMath: A Benchmark for Evaluating Advanced Mathematical Reasoning in AI. 2024. arXiv:2411.04872 [cs.AI].url:https://arxiv.org/abs/2411.04872

  4. [4]

    Daniel J. H. Chung et al.Theoretical Physics Benchmark (TPBench) – a Dataset and Study of AI Reasoning Capabilities in Theoretical Physics. 2025. arXiv:2502.15815 [cs.LG].url:https: //arxiv.org/abs/2502.15815

  5. [5]

    Kaiyue Feng et al.PHYSICS: Benchmarking Foundation Models on University-Level Physics Problem Solving. 2025. arXiv:2503.21821 [physics.ed-ph].url:https://arxiv.org/abs/2503.21821

  6. [6]

    Xinyu Zhang et al.PhysReason: A Comprehensive Benchmark towards Physics-Based Reasoning

  7. [7]

    arXiv:2502.12054 [cs.AI].url:https://arxiv.org/abs/2502.12054

  8. [8]

    Xinyu Pang et al.Physics Reasoner: Knowledge-Augmented Reasoning for Solving Physics Problems with Large Language Models. 2024. arXiv:2412.13791 [cs.CL].url:https://arxiv.org/abs/ 2412.13791

  9. [9]

    Xin Xu et al.UGPhysics: A Comprehensive Benchmark for Undergraduate Physics Reasoning with Large Language Models. 2025. arXiv:2502.00334 [cs.CL].url:https://arxiv.org/abs/2502. 00334

  10. [10]

    Evaluating GPT- and reasoning-based large language models on Physics Olympiad problems: Surpassing human performance and implications for educational assessment

    Paul Tschisgale et al. “Evaluating GPT- and reasoning-based large language models on Physics Olympiad problems: Surpassing human performance and implications for educational assessment”. In: Physical Review Physics Education Research21.2 (Aug. 2025).issn: 2469-9896.doi:10.1103/6fmx- bsnl.url:http://dx.doi.org/10.1103/6fmx-bsnl

  11. [11]

    Mizukami, M

    Haining Pan et al. “Quantum many-body physics calculations with large language models”. In:Com- munications Physics8.1 (Jan. 2025).issn: 2399-3650.doi:10.1038/s42005- 025- 01956- y.url: http://dx.doi.org/10.1038/s42005-025-01956-y

  12. [12]

    Learning-at-Criticality in Large Language Models for Quantum Field Theory and Beyond

    Xiansheng Cai et al. “Learning-at-Criticality in Large Language Models for Quantum Field Theory and Beyond”. In:Chinese Physics Letters42.12 (Nov. 2025), p. 120002.issn: 1741-3540.doi:10. 1088/0256-307x/42/12/120002.url:http://dx.doi.org/10.1088/0256-307X/42/12/120002. 32

  13. [13]

    Large physics models: towards a collaborative approach with large language models and foundation models

    Kristian G. Barman et al. “Large physics models: towards a collaborative approach with large language models and foundation models”. In:The European Physical Journal C85.9 (Sept. 2025).issn: 1434- 6052.doi:10.1140/epjc/s10052-025-14707-8.url:http://dx.doi.org/10.1140/epjc/s10052- 025-14707-8

  14. [14]

    Alfredo Guevara et al.Single-minus gluon tree amplitudes are nonzero. 2026. arXiv:2602 . 12176 [hep-th].url:https://arxiv.org/abs/2602.12176

  15. [15]

    Zhangir Azerbayev et al.Llemma: An Open Language Model For Mathematics. 2024. arXiv:2310. 10631 [cs.CL].url:https://arxiv.org/abs/2310.10631

  16. [16]

    Solving olympiad geometry without human demonstrations

    T. H. Trinh, Y. Wu, Q. V. Le, et al. “Solving olympiad geometry without human demonstrations”. In: Nature625 (2024), pp. 476–482.url:https://www.nature.com/articles/s41586-023-06747-5

  17. [17]

    2024.url:https : / / deepmind

    AlphaProof and AlphaGeometry TEams.Ai achieves silver-medal standard solving International Mathematical Olympiad Problems. 2024.url:https : / / deepmind . google / discover / blog / ai - solves-imo-problems-at-silver-medal-level/

  18. [18]

    biology" or

    Yichen Huang and Lin F. Yang.Winning Gold at IMO 2025 with a Model-Agnostic Verification-and- Refinement Pipeline. 2025. arXiv:2507.15855 [cs.AI].url:https://arxiv.org/abs/2507.15855

  19. [19]

    Benjamin Breen et al.Ax-Prover: A Deep Reasoning Agentic Framework for Theorem Proving in Mathematics and Quantum Physics. 2025. arXiv:2510.12787 [cs.AI].url:https://arxiv.org/ abs/2510.12787

  20. [20]

    Azim Ospanov, Farzan Farnia, and Roozbeh Yousefzadeh.miniF2F-Lean Revisited: Reviewing Limi- tations and Charting a Path Forward. 2025. arXiv:2511.03108 [cs.AI].url:https://arxiv.org/ abs/2511.03108

  21. [21]

    Minhui Zhu et al.Probing the Critical Point (CritPt) of AI Reasoning: a Frontier Physics Research Benchmark. 2025. arXiv:2509.26574 [cs.AI].url:https://arxiv.org/abs/2509.26574

  22. [22]

    Yiming Zhang et al.ABench-Physics: Benchmarking Physical Reasoning in LLMs via High-Difficulty and Dynamic Physics Problems. 2025. arXiv:2507.04766 [cs.LG].url:https://arxiv.org/abs/ 2507.04766

  23. [23]

    Haining Pan et al.CMT-Benchmark: A Benchmark for Condensed Matter Theory Built by Expert Researchers. 2026. arXiv:2510.05228 [cs.LG].url:https://arxiv.org/abs/2510.05228

  24. [24]

    Tianshi Zheng et al.NewtonBench: Benchmarking Generalizable Scientific Law Discovery in LLM Agents. 2026. arXiv:2510.07172 [cs.AI].url:https://arxiv.org/abs/2510.07172

  25. [25]

    Tony Feng et al.Towards Autonomous Mathematics Research. 2026. arXiv:2602.10177 [cs.LG]. url:https://arxiv.org/abs/2602.10177

  26. [26]

    Yutaro Yamada et al.The AI Scientist-v2: Workshop-Level Automated Scientific Discovery via Agen- tic Tree Search. 2025. arXiv:2504.08066 [cs.AI].url:https://arxiv.org/abs/2504.08066

  27. [27]

    Anna Goldie et al.Synthetic Data Generation & Multi-Step RL for Reasoning & Tool Use. 2025. arXiv:2504.04736 [cs.AI].url:https://arxiv.org/abs/2504.04736

  28. [28]

    Xianyang Liu et al.AgenticMath: Enhancing LLM Reasoning via Agentic-based Math Data Genera- tion. 2025. arXiv:2510.19361 [cs.CL].url:https://arxiv.org/abs/2510.19361

  29. [29]

    VCR: A Cone of Experience Driven Synthetic Data Generation Framework for Mathematical Reasoning

    Sannyuya Liu et al. “VCR: A Cone of Experience Driven Synthetic Data Generation Framework for Mathematical Reasoning”. In:Proceedings of the AAAI Conference on Artificial Intelligence39.23 (2025), pp. 24650–24658.doi:10.1609/aaai.v39i23.34645.url:https://ojs.aaai.org/index. php/AAAI/article/view/34645

  30. [30]

    Xiong Jun Wu et al.SHARP: Synthesizing High-quality Aligned Reasoning Problems for Large Rea- soning Models Reinforcement Learning. 2025. arXiv:2505.14147 [cs.AI].url:https://arxiv. org/abs/2505.14147

  31. [31]

    Xiang Huang et al.TARGA: Targeted Synthetic Data Generation for Practical Reasoning over Struc- tured Data. 2024. arXiv:2412.19544 [cs.CL].url:https://arxiv.org/abs/2412.19544. 33

  32. [32]

    Orchestrating Synthetic Data with Reasoning

    Tim R. Davidson et al. “Orchestrating Synthetic Data with Reasoning”. In:Will Synthetic Data Fi- nally Solve the Data Access Problem?2025.url:https://openreview.net/forum?id=VOoeogZbMb

  33. [33]

    Tongyi DeepResearch Team et al.Tongyi DeepResearch Technical Report. 2025. arXiv:2510.24701 [cs.CL].url:https://arxiv.org/abs/2510.24701

  34. [34]

    Zhangchen Xu et al.TOUCAN: Synthesizing 1.5M Tool-Agentic Data from Real-World MCP Envi- ronments. 2025. arXiv:2510.01179 [cs.LG].url:https://arxiv.org/abs/2510.01179

  35. [35]

    Suriya Gunasekar et al.Textbooks Are All You Need. 2023. arXiv:2306.11644 [cs.CL].url:https: //arxiv.org/abs/2306.11644

  36. [36]

    Yuanzhi Li et al.Textbooks Are All You Need II: phi-1.5 technical report. 2023. arXiv:2309.05463 [cs.CL].url:https://arxiv.org/abs/2309.05463

  37. [37]

    Geoffrey Hinton, Oriol Vinyals, and Jeff Dean.Distilling the Knowledge in a Neural Network. 2015. arXiv:1503.02531 [stat.ML].url:https://arxiv.org/abs/1503.02531

  38. [38]

    Subhabrata Mukherjee et al.Orca: Progressive Learning from Complex Explanation Traces of GPT-4

  39. [39]

    arXiv:2306.02707 [cs.CL].url:https://arxiv.org/abs/2306.02707

  40. [40]

    Cheng-Yu Hsieh et al.Distilling Step-by-Step! Outperforming Larger Language Models with Less Training Data and Smaller Model Sizes. 2023. arXiv:2305.02301 [cs.CL].url:https://arxiv. org/abs/2305.02301

  41. [41]

    Yuxian Gu et al.MiniLLM: On-Policy Distillation of Large Language Models. 2026. arXiv:2306. 08543 [cs.CL].url:https://arxiv.org/abs/2306.08543

  42. [42]

    Machine behaviour

    Daya Guo et al. “DeepSeek-R1 incentivizes reasoning in LLMs through reinforcement learning”. In: Nature645.8081 (Sept. 2025), 633638.issn: 1476-4687.doi:10.1038/s41586- 025- 09422- z.url: http://dx.doi.org/10.1038/s41586-025-09422-z

  43. [43]

    Xinghao Chen et al.Unveiling the Key Factors for Distilling Chain-of-Thought Reasoning. 2025. arXiv:2502.18001 [cs.CL].url:https://arxiv.org/abs/2502.18001

  44. [44]

    Hoang Anh Just, Myeongseob Ko, and Ruoxi Jia.Distilling Reasoning into Student LLMs: Local Naturalness for Selecting Teacher Data. 2025. arXiv:2510.03988 [cs.LG].url:https://arxiv. org/abs/2510.03988

  45. [45]

    arXiv preprint arXiv:2503.03730 , year=

    David D. Baek and Max Tegmark.Towards Understanding Distilled Reasoning Models: A Represen- tational Approach. 2025. arXiv:2503.03730 [cs.LG].url:https://arxiv.org/abs/2503.03730

  46. [46]

    Hunter Lightman et al.Let’s Verify Step by Step. 2023. arXiv:2305.20050 [cs.LG].url:https: //arxiv.org/abs/2305.20050

  47. [47]

    Jonathan Uesato et al.Solving math word problems with process- and outcome-based feedback. 2022. arXiv:2211.14275 [cs.LG].url:https://arxiv.org/abs/2211.14275

  48. [48]

    Zhihong Shao et al.DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models. 2024. arXiv:2402.03300 [cs.CL].url:https://arxiv.org/abs/2402.03300

  49. [49]

    Xinyu Guan et al.rStar-Math: Small LLMs Can Master Math Reasoning with Self-Evolved Deep Thinking. 2025. arXiv:2501.04519 [cs.CL].url:https://arxiv.org/abs/2501.04519

  50. [50]

    Haozhe Wang et al.Emergent Hierarchical Reasoning in LLMs through Reinforcement Learning. 2025. arXiv:2509.03646 [cs.AI].url:https://arxiv.org/abs/2509.03646

  51. [51]

    Jiayu Wang et al.Beyond Accuracy: Dissecting Mathematical Reasoning for LLMs Under Reinforce- ment Learning. 2025. arXiv:2506.04723 [cs.AI].url:https://arxiv.org/abs/2506.04723

  52. [52]

    Jeesu Jung and Sangkeun Jung.Reasoning Steps as Curriculum: Using Depth of Thought as a Diffi- culty Signal for Tuning LLMs. 2025. arXiv:2508.18279 [cs.LG].url:https://arxiv.org/abs/ 2508.18279

  53. [53]

    Bohan Tang et al.On the Importance of Task Complexity in Evaluating LLM-Based Multi-Agent Systems. 2025. arXiv:2510.04311 [cs.AI].url:https://arxiv.org/abs/2510.04311

  54. [54]

    Zhenru Zhang et al.The Lessons of Developing Process Reward Models in Mathematical Reasoning

  55. [55]

    arXiv:2501.07301 [cs.CL].url:https://arxiv.org/abs/2501.07301. 34

  56. [56]

    Peiyi Wang et al.Math-Shepherd: Verify and Reinforce LLMs Step-by-step without Human Annota- tions. 2024. arXiv:2312.08935 [cs.AI].url:https://arxiv.org/abs/2312.08935

  57. [57]

    Chuhuai Yue et al.Promoting Efficient Reasoning with Verifiable Stepwise Reward. 2025. arXiv: 2508.10293 [cs.AI].url:https://arxiv.org/abs/2508.10293

  58. [58]

    Zihan Liu et al.AceMath: Advancing Frontier Math Reasoning with Post-Training and Reward Mod- eling. 2025. arXiv:2412.15084 [cs.CL].url:https://arxiv.org/abs/2412.15084

  59. [59]

    Jon Saad-Falcon et al.Shrinking the Generation-Verification Gap with Weak Verifiers. 2025. arXiv: 2506.18203 [cs.CL].url:https://arxiv.org/abs/2506.18203

  60. [60]

    Benedikt Stroebl, Sayash Kapoor, and Arvind Narayanan.The Limits of Inference Scaling Through Resampling. 2026. arXiv:2411.17501 [cs.LG].url:https://arxiv.org/abs/2411.17501

  61. [61]

    DeepSeek-AI.DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

  62. [62]

    arXiv:2501.12948 [cs.CL].url:https://arxiv.org/abs/2501.12948

  63. [63]

    Qwen Team.Qwen3 Technical Report. 2025. arXiv:2505.09388 [cs.CL].url:https://arxiv.org/ abs/2505.09388

  64. [64]

    Olympiad-Level Formal Mathematical Reasoning with Reinforcement Learn- ing

    Thomas Hubert et al. “Olympiad-Level Formal Mathematical Reasoning with Reinforcement Learn- ing”. In:Nature(2025).url:https://www.nature.com/articles/s41586-025-09833-y

  65. [65]

    Zee.Quantum field theory in a nutshell

    A. Zee.Quantum field theory in a nutshell. 2003.isbn: 978-0-691-14034-6

  66. [66]

    Peskin and Daniel V

    Michael E. Peskin and Daniel V. Schroeder.An Introduction to quantum field theory. Reading, USA: Addison-Wesley, 1995.isbn: 978-0-201-50397-5, 978-0-429-50355-9, 978-0-429-49417-8.doi:10.1201/ 9780429503559

  67. [67]

    Steven Weinberg.The Quantum theory of fields. Vol. 1: Foundations. Cambridge University Press, June 2005.isbn: 978-0-521-67053-1, 978-0-511-25204-4.doi:10.1017/CBO9781139644167

  68. [68]

    Voja Radovanovic.Problem book in quantum field theory. 2008

  69. [69]

    Oxford, UK: Oxford University Press, 1984.isbn: 978-0-19-851961-4, 978-0-19-851961-4

    Ta-Pei [0000-0002-1137-0969] Cheng and Ling-Fong [0000-0002-8035-3329] Li.Gauge Theory of El- ementary Particle Physics. Oxford, UK: Oxford University Press, 1984.isbn: 978-0-19-851961-4, 978-0-19-851961-4

  70. [70]

    Massachusetts Institute of Technology: MIT OpenCourseWare

    Hong Liu.8.323 Relativistic Quantum Field Theory I. Massachusetts Institute of Technology: MIT OpenCourseWare. Spring 2023. License: Creative Commons BY-NC-SA. 2023.url:https://ocw. mit.edu/courses/8-323-relativistic-quantum-field-theory-i-spring-2023/

  71. [71]

    Massachusetts Institute of Technology: MIT OpenCourseWare

    Hong Liu.8.324 Relativistic Quantum Field Theory II. Massachusetts Institute of Technology: MIT OpenCourseWare. Fall 2010. License: Creative Commons BY-NC-SA. 2010.url:https://ocw.mit. edu/courses/8-324-relativistic-quantum-field-theory-ii-fall-2010/

  72. [72]

    Massachusetts Institute of Technology: MIT OpenCourseWare

    Frank Wilczek.8.325 Relativistic Quantum Field Theory III. Massachusetts Institute of Technology: MIT OpenCourseWare. Spring 2003. License: Creative Commons BY-NC-SA. 2003.url:https : //ocw.mit.edu/courses/8-325-relativistic-quantum-field-theory-iii-spring-2003/

  73. [73]

    A., Solomon, A

    P. Blasiak et al. “Combinatorics and Boson normal ordering: A gentle introduction”. In: (2007).doi: 10.1119/1.2723799. eprint:arXiv:0704.3116

  74. [74]

    Ollitrault, Relativistic hydrodynamics for heavy- ion collisions, European Journal of Physics29, 275 (2008), arXiv:0708.2433 [nucl-th]

    Jean-Yves Ollitrault. “Relativistic hydrodynamics for heavy-ion collisions”. In: (2007).doi:10.1088/ 0143-0807/29/2/010. eprint:arXiv:0708.2433

  75. [75]

    Arakida, Light deflection and Gauss–Bonnet theorem: definition of total deflection angle and its applications, Gen

    Raphael Bousso. “TASI Lectures on the Cosmological Constant”. In: (2007).doi:10.1007/s10714- 007-0557-5. eprint:arXiv:0708.4231

  76. [76]

    Emparan, R

    Roberto Emparan and Harvey S. Reall. “Black Holes in Higher Dimensions”. In: (2008).doi:10. 12942/lrr-2008-6. eprint:arXiv:0801.3471

  77. [77]

    G. H. E. Duchamp et al.Hopf Algebras in General and in Combinatorial Physics: a practical intro- duction. 2008. eprint:arXiv:0802.0249

  78. [78]

    Stefan Vandoren and Peter van Nieuwenhuizen.Lectures on instantons. 2008. eprint:arXiv:0802. 1862. 35

  79. [79]

    Braneworld black holes

    Ruth Gregory. “Braneworld black holes”. In: (2008).doi:10.1007/978-3-540-88460-6_7. eprint: arXiv:0804.2595

  80. [80]

    Higher order gravity theories and their black hole solutions

    Christos Charmousis. “Higher order gravity theories and their black hole solutions”. In: (2008).doi: 10.1007/978-3-540-88460-6_8. eprint:arXiv:0805.0568

Showing first 80 references.