pith. machine review for the scientific record. sign in

arxiv: 2504.21318 · v1 · submitted 2025-04-30 · 💻 cs.AI · cs.CL

Recognition: 2 theorem links

· Lean Theorem

Phi-4-reasoning Technical Report

Authors on Pith no claims yet

Pith reviewed 2026-05-17 03:35 UTC · model grok-4.3

classification 💻 cs.AI cs.CL
keywords reasoning modelssupervised fine-tuningreinforcement learningteachable promptsbenchmark evaluationPhi-4o3-mini demonstrations
0
0 comments X

The pith

A 14-billion parameter model trained on curated teachable prompts and o3-mini demonstrations reaches performance levels of much larger reasoning systems.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces Phi-4-reasoning as a 14B model obtained by supervised fine-tuning on a carefully chosen set of teachable prompts and reasoning traces generated by o3-mini. A second variant adds a short phase of outcome-based reinforcement learning to produce longer traces and higher scores. Both versions outperform several substantially larger open-weight models on math, coding, planning, algorithmic, and spatial tasks while showing some gains on general benchmarks as well. The work demonstrates that deliberate data selection for supervised fine-tuning can be extended to reasoning models and further strengthened by reinforcement learning.

Core claim

Phi-4-reasoning is produced by supervised fine-tuning of the Phi-4 base model on a curated collection of teachable prompts chosen for appropriate complexity and diversity together with detailed reasoning demonstrations generated by o3-mini. The plus variant receives an additional short phase of outcome-based reinforcement learning that encourages longer reasoning traces. Across benchmarks in mathematics, scientific reasoning, coding, algorithmic problem solving, planning, and spatial understanding, both models surpass significantly larger open-weight systems such as DeepSeek-R1-Distill-Llama-70B and approach the results of the full DeepSeek-R1 model, with observable transfer to general-capab

What carries the argument

Curated teachable prompts of balanced complexity and diversity paired with o3-mini reasoning demonstrations for supervised fine-tuning, optionally followed by outcome-based reinforcement learning to lengthen traces.

If this is right

  • Careful data curation for supervised fine-tuning extends to reasoning language models and produces measurable gains.
  • A short reinforcement-learning phase on top of the fine-tuned model further increases performance by lengthening reasoning traces.
  • Reasoning improvements transfer in a non-trivial way to general-purpose benchmarks.
  • Current evaluation practices leave room for improvement in measuring robustness of reasoning models.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Targeted data curation may allow smaller models to close performance gaps with larger ones without relying primarily on scale.
  • Similar curation strategies could be tested on other base models or applied to domains beyond the current benchmarks.
  • Longer traces encouraged by reinforcement learning may require new evaluation protocols that account for trace length and consistency.

Load-bearing premise

Performance gains arise chiefly from the selection of teachable prompts and the o3-mini demonstrations rather than from hidden properties of the base model or from evaluation overlap.

What would settle it

Retraining the same base model on randomly selected prompts of comparable length and then re-evaluating on the identical benchmark suite yields scores within a few points of the curated version.

read the original abstract

We introduce Phi-4-reasoning, a 14-billion parameter reasoning model that achieves strong performance on complex reasoning tasks. Trained via supervised fine-tuning of Phi-4 on carefully curated set of "teachable" prompts-selected for the right level of complexity and diversity-and reasoning demonstrations generated using o3-mini, Phi-4-reasoning generates detailed reasoning chains that effectively leverage inference-time compute. We further develop Phi-4-reasoning-plus, a variant enhanced through a short phase of outcome-based reinforcement learning that offers higher performance by generating longer reasoning traces. Across a wide range of reasoning tasks, both models outperform significantly larger open-weight models such as DeepSeek-R1-Distill-Llama-70B model and approach the performance levels of full DeepSeek-R1 model. Our comprehensive evaluations span benchmarks in math and scientific reasoning, coding, algorithmic problem solving, planning, and spatial understanding. Interestingly, we observe a non-trivial transfer of improvements to general-purpose benchmarks as well. In this report, we provide insights into our training data, our training methodologies, and our evaluations. We show that the benefit of careful data curation for supervised fine-tuning (SFT) extends to reasoning language models, and can be further amplified by reinforcement learning (RL). Finally, our evaluation points to opportunities for improving how we assess the performance and robustness of reasoning models.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces Phi-4-reasoning, a 14-billion parameter reasoning model obtained by supervised fine-tuning of the Phi-4 base model on a curated set of 'teachable' prompts and reasoning demonstrations generated by o3-mini. It also presents Phi-4-reasoning-plus, which adds a short phase of outcome-based reinforcement learning to produce longer reasoning traces. The central claim is that both models outperform significantly larger open-weight models such as DeepSeek-R1-Distill-Llama-70B and approach the performance of the full DeepSeek-R1 model across benchmarks in math, scientific reasoning, coding, algorithmic problem solving, planning, and spatial understanding, with non-trivial transfer to general-purpose benchmarks. The report supplies insights into the training data, methodologies, and evaluations, arguing that careful data curation for SFT can be amplified by RL.

Significance. If the reported results hold after addressing the points below, the work would be significant for showing that high-quality, curated data for SFT (plus limited RL) can produce competitive reasoning performance in a 14B model, reducing reliance on scale alone. The multi-domain evaluation and observed transfer effects are valuable, and the explicit discussion of training insights aids reproducibility and understanding of reasoning model development.

major comments (2)
  1. [Abstract] Abstract: The headline claim that the models 'outperform significantly larger open-weight models such as DeepSeek-R1-Distill-Llama-70B model and approach the performance levels of full DeepSeek-R1 model' is stated without any numerical scores, error bars, specific benchmark lists, or controls for data contamination. This leaves the central comparative result unverified from the abstract alone and requires the evaluations section to supply the missing quantitative details.
  2. [Training Methodology and Evaluations sections] Training Methodology and Evaluations sections: The performance gains are attributed primarily to the curated teachable prompts and o3-mini demonstrations (plus the RL phase). However, the manuscript does not report an ablation of the unmodified base Phi-4 model on the same reasoning benchmarks. This omission is load-bearing for the causal claim, as it leaves open whether gains derive from the base model properties, evaluation protocol, or overlap with the teacher data rather than the described curation.
minor comments (2)
  1. [Introduction] The term 'teachable' prompts is introduced without a precise operational definition or concrete examples in the main text; adding these would improve clarity.
  2. [Evaluations] Ensure all tables and figures include self-contained captions and axis labels so that results can be interpreted without cross-referencing the surrounding prose.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive and detailed review. We address each major comment below and will revise the manuscript accordingly to strengthen the presentation of our results and methodology.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The headline claim that the models 'outperform significantly larger open-weight models such as DeepSeek-R1-Distill-Llama-70B model and approach the performance levels of full DeepSeek-R1 model' is stated without any numerical scores, error bars, specific benchmark lists, or controls for data contamination. This leaves the central comparative result unverified from the abstract alone and requires the evaluations section to supply the missing quantitative details.

    Authors: We agree that the abstract would be strengthened by including concrete quantitative details. In the revised manuscript we will update the abstract to report key numerical results (e.g., average scores on math, coding, and algorithmic benchmarks) showing the 14B models outperforming the 70B baseline and approaching the full DeepSeek-R1 model, while explicitly referencing the full evaluation tables and contamination controls described in the Evaluations section. revision: yes

  2. Referee: [Training Methodology and Evaluations sections] Training Methodology and Evaluations sections: The performance gains are attributed primarily to the curated teachable prompts and o3-mini demonstrations (plus the RL phase). However, the manuscript does not report an ablation of the unmodified base Phi-4 model on the same reasoning benchmarks. This omission is load-bearing for the causal claim, as it leaves open whether gains derive from the base model properties, evaluation protocol, or overlap with the teacher data rather than the described curation.

    Authors: We acknowledge the importance of this ablation for isolating the contribution of our data curation and training. We have evaluated the unmodified Phi-4 base model on the identical reasoning benchmarks. The base model exhibits substantially lower performance than both Phi-4-reasoning and Phi-4-reasoning-plus. In the revised manuscript we will add these results to the Evaluations section, together with a brief discussion of how the curated SFT data and RL phase drive the observed improvements beyond base-model capabilities. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical training report with external benchmark comparisons

full rationale

The paper describes supervised fine-tuning of the base Phi-4 model on curated teachable prompts paired with o3-mini demonstrations, followed by optional outcome-based RL for the plus variant. Performance claims rest on direct evaluations across standard math, coding, planning, and general benchmarks, with explicit comparisons to independent external models (DeepSeek-R1-Distill-Llama-70B and full DeepSeek-R1). No equations, fitted parameters, or first-principles derivations appear that could reduce reported gains to self-definitions or tautologies by construction. Self-references to the prior Phi-4 base model are present but not load-bearing; the central results are measured outcomes on external benchmarks rather than any renaming, ansatz smuggling, or uniqueness theorem imported from the same authors.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The central performance claim rests on the unstated assumption that the selected training prompts and o3-mini traces constitute an effective and generalizable teaching signal; no free parameters, axioms, or invented entities are explicitly introduced in the abstract.

pith-pipeline@v0.9.0 · 5646 in / 1111 out tokens · 38953 ms · 2026-05-17T03:35:32.502533+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 17 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. VLM Judges Can Rank but Cannot Score: Task-Dependent Uncertainty in Multimodal Evaluation

    cs.LG 2026-04 unverdicted novelty 7.0

    VLM judges exhibit task-dependent uncertainty in their scores, with conformal prediction revealing wide intervals for complex tasks and a decoupling between good ranking performance and poor absolute scoring reliability.

  2. When AI Models Become Dependencies: Studying the Evolution of Pre-Trained Model Reuse in Downstream Software Systems

    cs.SE 2026-04 unverdicted novelty 7.0

    Pre-trained models are added late in projects, accumulate rather than get replaced, and change three times less often than libraries, with distinct documentation driven by capability needs and testing uncertainty.

  3. MathArena: Evaluating LLMs on Uncontaminated Math Competitions

    cs.AI 2025-05 unverdicted novelty 7.0

    MathArena evaluates over 50 LLMs on 162 fresh competition problems across seven contests, detects contamination in AIME 2024, and reports top models scoring below 40 percent on IMO 2025 proof tasks.

  4. Distilling Long-CoT Reasoning through Collaborative Step-wise Multi-Teacher Decoding

    cs.AI 2026-05 unverdicted novelty 6.0

    CoRD uses collaborative multi-teacher step-wise decoding with perplexity-guided beam search to generate higher-quality Long-CoT data that lets smaller models reach near-teacher performance with less supervision.

  5. SeLaR: Selective Latent Reasoning in Large Language Models

    cs.CL 2026-04 unverdicted novelty 6.0

    SeLaR selectively applies latent soft reasoning in LLMs via entropy gating and contrastive regularization, outperforming standard CoT on five benchmarks without training.

  6. ZeroCoder: Can LLMs Improve Code Generation Without Ground-Truth Supervision?

    cs.SE 2026-04 unverdicted novelty 6.0

    ZeroCoder co-evolves coder and tester LLMs via self-generated code-test execution feedback to improve code generation up to 21.6% without ground-truth supervision.

  7. CoME-VL: Scaling Complementary Multi-Encoder Vision-Language Learning

    cs.CV 2026-04 unverdicted novelty 6.0

    CoME-VL fuses contrastive and self-supervised vision encoders via entropy-guided multi-layer aggregation and RoPE cross-attention to improve vision-language model performance on benchmarks.

  8. Flexible Entropy Control in RLVR with a Gradient-Preserving Perspective

    cs.LG 2026-02 unverdicted novelty 6.0

    Dynamic clipping strategies based on importance sampling regions enable precise entropy management in RLVR, mitigating collapse and improving benchmark performance.

  9. Do Not Waste Your Rollouts: Recycling Search Experience for Efficient Test-Time Scaling

    cs.CL 2026-01 unverdicted novelty 6.0

    RSE distills search trajectories into an experience bank for positive and negative recycling, yielding efficiency gains over independent sampling on math reasoning benchmarks.

  10. The Geometric Reasoner: Manifold-Informed Latent Foresight Search for Long-Context Reasoning

    cs.LG 2026-01 unverdicted novelty 6.0

    TGR performs manifold-informed latent foresight search to boost trajectory coverage in long-context reasoning tasks by up to 13 AUC points with minimal overhead.

  11. Which Reasoning Trajectories Teach Students to Reason Better? A Simple Metric of Informative Alignment

    cs.CL 2026-01 conditional novelty 6.0

    Rank-Surprisal Ratio (RSR) correlates strongly (average Spearman 0.86) with post-distillation reasoning gains across five student models and trajectories from eleven teachers, outperforming existing selection metrics.

  12. TwiSTAR:Think Fast, Think Slow, Then Act,Generative Recommendation with Adaptive Reasoning

    cs.IR 2026-05 unverdicted novelty 5.0

    TwiSTAR learns to switch between fast SID retrieval and slow rationale-generating reasoning in generative recommendation, yielding better accuracy-latency trade-offs on three datasets.

  13. Chain-of-Thought Reasoning Enhances In-Context Learning for LLM-Based Mobile Traffic Prediction

    cs.NI 2026-05 unverdicted novelty 5.0

    Chain-of-thought reasoning with plan-based demonstrations and similarity retrieval improves LLM mobile traffic prediction accuracy by up to 15% over standard in-context learning on real 5G data.

  14. Beyond Benchmarks: MathArena as an Evaluation Platform for Mathematics with LLMs

    cs.CL 2026-05 unverdicted novelty 5.0

    MathArena is a maintained platform evaluating LLMs across olympiad problems, proofs, research questions, and formal proofs, with GPT-5.5 reaching 98% on 2026 USAMO and 74% on research-level tasks.

  15. Ranking Reasoning LLMs under Test-Time Scaling

    cs.LG 2026-03 accept novelty 5.0

    Many established statistical ranking techniques produce orderings of reasoning LLMs under test-time scaling that closely match a Bayesian gold standard, with mean Kendall tau_b of 0.93-0.95 at full trials and best met...

  16. Gemma 4, Phi-4, and Qwen3: Accuracy-Efficiency Tradeoffs in Dense and MoE Reasoning Language Models

    cs.CL 2026-04 accept novelty 4.0

    Gemma-4-E4B with few-shot chain-of-thought reaches the highest weighted accuracy of 0.675 at 14.9 GB VRAM, while the larger Gemma-4-26B-A4B MoE model scores 0.663 but uses 48.1 GB.

  17. XekRung Technical Report

    cs.CR 2026-04 unverdicted novelty 3.0

    XekRung achieves state-of-the-art performance on cybersecurity benchmarks among same-scale models via tailored data synthesis and multi-stage training while retaining strong general capabilities.

Reference graph

Works this paper leans on

64 extracted references · 64 canonical work pages · cited by 17 Pith papers · 20 internal anchors

  1. [1]

    Phi-3 Technical Report: A Highly Capable Language Model Locally on Your Phone

    Marah Abdin, Jyoti Aneja, Hany Awadalla, Ahmed Awadallah, Ammar Ahmad Awan, Nguyen Bach, Amit Bahree, Arash Bakhtiari, Jianmin Bao, Harkirat Behl, et al. Phi-3 technical report: A highly capable language model locally on your phone.arXiv preprint arXiv:2404.14219, 2024

  2. [2]

    Phi-4 Technical Report

    Marah Abdin, Jyoti Aneja, Harkirat Behl, Sébastien Bubeck, Ronen Eldan, Suriya Gunasekar, Michael Harrison, Russell J. Hewett, Mojan Javaheripi, Piero Kauffmann, James R. Lee, Yin Tat Lee, Yuanzhi Li, Weishung Liu, Caio C. T. Mendes, Anh Nguyen, Eric Price, Gustavo de Rosa, Olli Saarikivi, Adil Salim, Shital Shah, Xin Wang, Rachel Ward, Yue Wu, Dingli Yu,...

  3. [3]

    KITAB: evaluating llms on constraint satisfaction for information retrieval

    Marah I Abdin, Suriya Gunasekar, Varun Chandrasekaran, Jerry Li, Mert Yüksekgönül, Rahee Ghosh Peshawaria, Ranjita Naik, and Besmira Nushi. KITAB: evaluating llms on constraint satisfaction for information retrieval. In International Conference on Learning Representations, 2024

  4. [4]

    Aime 83-24

    AIME. Aime 83-24. https://huggingface.co/datasets/lchen001/AIME1983_2024, 2024. Accessed: 2025- 03-17

  5. [5]

    Aime 2025

    AIME. Aime 2025. https://huggingface.co/datasets/lchen001/AIME2025, 2025. Accessed: 2025-03-17

  6. [6]

    Concrete Problems in AI Safety

    Dario Amodei, Chris Olah, Jacob Steinhardt, Paul Christiano, John Schulman, and Dan Mané. Concrete problems in ai safety.arXiv preprint arXiv:1606.06565, 2016

  7. [7]

    Claude 3.7 sonnet.https://www.anthropic.com/news/claude-3-7-sonnet, 2025

    Anthropic. Claude 3.7 sonnet.https://www.anthropic.com/news/claude-3-7-sonnet, 2025. Accessed: 2025- 03-17

  8. [8]

    Chain-of-thought reasoning in the wild is not always faithful.arXiv preprint arXiv:2503.08679, 2025

    Iván Arcuschin, Jett Janiak, Robert Krzyzanowski, Senthooran Rajamanoharan, Neel Nanda, and Arthur Conmy. Chain-of-thought reasoning in the wild is not always faithful.arXiv preprint arXiv:2503.08679, 2025

  9. [9]

    Eureka: Evaluating and understanding large foundation models.arXiv preprint arXiv:2409.10566, 2024

    Vidhisha Balachandran, Jingya Chen, Neel Joshi, Besmira Nushi, Hamid Palangi, Eduardo Salinas, Vibhav Vineet, James Woffinden-Luey, and Safoora Yousefi. Eureka: Evaluating and understanding large foundation models.arXiv preprint arXiv:2409.10566, 2024

  10. [10]

    Inference-time scaling for complex tasks: Where we stand and what lies ahead, 2025

    Vidhisha Balachandran, Jingya Chen, Lingjiao Chen, Shivam Garg, Neel Joshi, Yash Lara, John Langford, Besmira Nushi, Vibhav Vineet, Yue Wu, and Safoora Yousefi. Inference-time scaling for complex tasks: Where we stand and what lies ahead, 2025. URLhttps://arxiv.org/abs/2504.00294

  11. [11]

    Matharena: Evaluating llms on uncontaminated math competitions, February 2025

    Mislav Balunović, Jasper Dekoninck, Ivo Petrov, Nikola Jovanović, and Martin Vechev. Matharena: Evaluating llms on uncontaminated math competitions, February 2025. URLhttps://matharena.ai/

  12. [12]

    Designing disaggregated evaluations of ai systems: Choices, considera- tions, and tradeoffs

    Solon Barocas, Anhong Guo, Ece Kamar, Jacquelyn Krones, Meredith Ringel Morris, Jennifer Wortman Vaughan, W Duncan Wadsworth, and Hanna Wallach. Designing disaggregated evaluations of ai systems: Choices, considera- tions, and tradeoffs. InProceedings of the 2021 AAAI/ACM Conference on AI, Ethics, and Society, pages 368–378, 2021

  13. [13]

    Benchagents: Auto- mated benchmark creation with agent interaction.arXiv preprint arXiv:2410.22584, 2024

    Natasha Butt, Varun Chandrasekaran, Neel Joshi, Besmira Nushi, and Vidhisha Balachandran. Benchagents: Auto- mated benchmark creation with agent interaction.arXiv preprint arXiv:2410.22584, 2024

  14. [14]

    Extending Context Window of Large Language Models via Positional Interpolation

    Shouyuan Chen, Sherman Wong, Liangjian Chen, and Yuandong Tian. Extending context window of large language models via positional interpolation.arXiv preprint arXiv:2306.15595, 2023

  15. [15]

    Reinforcement learning for reasoning in small llms: What works and what doesn’t,

    Quy-Anh Dang and Chris Ngo. Reinforcement learning for reasoning in small llms: What works and what doesn’t,

  16. [16]

    URL https://arxiv.org/abs/2503.16219

  17. [17]

    Omni-math: A universal olympiad level mathematic benchmark for large language models.ICLR, 2025

    Bofei Gao, Feifan Song, Zhe Yang, Zefan Cai, Yibo Miao, Qingxiu Dong, Lei Li, Chenghao Ma, Liang Chen, Runxin Xu, et al. Omni-math: A universal olympiad level mathematic benchmark for large language models.ICLR, 2025

  18. [18]

    Scaling laws for reward model overoptimization

    Leo Gao, John Schulman, and Jacob Hilton. Scaling laws for reward model overoptimization. In International Conference on Machine Learning, pages 10835–10866. PMLR, 2023

  19. [19]

    Gemini flash thinking

    Google. Gemini flash thinking. https://deepmind.google/technologies/gemini/flash/, 2025. Accessed: 2025-03-17. 24

  20. [20]

    rStar-Math: Small LLMs Can Master Math Reasoning with Self-Evolved Deep Thinking

    Xinyu Guan, Li Lyna Zhang, Yifei Liu, Ning Shang, Youran Sun, Yi Zhu, Fan Yang, and Mao Yang. rstar-math: Small llms can master math reasoning with self-evolved deep thinking.arXiv preprint arXiv:2501.04519, 2025

  21. [21]

    Textbooks Are All You Need

    Suriya Gunasekar, Yi Zhang, Jyoti Aneja, Caio César Teodoro Mendes, Allie Del Giorno, Sivakanth Gopi, Mojan Javaheripi, Piero Kauffmann, Gustavo de Rosa, Olli Saarikivi, et al. Textbooks are all you need.arXiv preprint arXiv:2306.11644, 2023

  22. [22]

    DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

    Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning. arXiv preprint arXiv:2501.12948, 2025

  23. [23]

    Computers and intractability: a guide to the theory of np-completeness (michael r

    Juris Hartmanis. Computers and intractability: a guide to the theory of np-completeness (michael r. garey and david s. johnson). Siam Review, 24(1):90, 1982

  24. [24]

    ToxiGen: A large- scale machine-generated dataset for adversarial and implicit hate speech detection

    Thomas Hartvigsen, Saadia Gabriel, Hamid Palangi, Maarten Sap, Dipankar Ray, and Ece Kamar. ToxiGen: A large- scale machine-generated dataset for adversarial and implicit hate speech detection. InProceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 3309–3326. Association for Computational L...

  25. [25]

    Measuring Massive Multitask Language Understanding

    Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. Mea- suring massive multitask language understanding, 2021. URLhttps://arxiv.org/abs/2009.03300

  26. [26]

    A sober look at progress in language model reasoning: Pitfalls and paths to reproducibility

    Andreas Hochlehnert, Hardik Bhatnagar, Vishaal Udandarao, Samuel Albanie, Ameya Prabhu, and Matthias Bethge. A sober look at progress in language model reasoning: Pitfalls and paths to reproducibility. arXiv preprint arXiv:2504.07086, 2025

  27. [27]

    GPT-4o System Card

    Aaron Hurst, Adam Lerer, Adam P Goucher, Adam Perelman, Aditya Ramesh, Aidan Clark, AJ Ostrow, Akila Welihinda, Alan Hayes, Alec Radford, et al. Gpt-4o system card.arXiv preprint arXiv:2410.21276, 2024

  28. [28]

    OpenAI o1 System Card

    Aaron Jaech, Adam Kalai, Adam Lerer, Adam Richardson, Ahmed El-Kishky, Aiden Low, Alec Helyar, Aleksander Madry, Alex Beutel, Alex Carney, et al. Openai o1 system card.arXiv preprint arXiv:2412.16720, 2024

  29. [29]

    Phi-2: The surprising power of small language models.Microsoft Research Blog, 2023

    Mojan Javaheripi, Sébastien Bubeck, Marah Abdin, Jyoti Aneja, Caio César Teodoro Mendes, Weizhu Chen, Allie Del Giorno, Ronen Eldan, Sivakanth Gopi, Suriya Gunasekar, Piero Kauffmann, Yin Tat Lee, Yuanzhi Li, Anh Nguyen, Gustavo de Rosa, Olli Saarikivi, Adil Salim, Shital Shah, Michael Santacroce, Harkirat Singh Behl, Adam Taumann Kalai, Xin Wang, Rachel ...

  30. [30]

    Safechain: Safety of language models with long chain-of-thought reasoning capabilities

    Fengqing Jiang, Zhangchen Xu, Yuetai Li, Luyao Niu, Zhen Xiang, Bo Li, Bill Yuchen Lin, and Radha Pooven- dran. Safechain: Safety of language models with long chain-of-thought reasoning capabilities. arXiv preprint arXiv:2502.12025, 2025

  31. [31]

    Same task, more tokens: the impact of input length on the reasoning performance of large language models

    Mosh Levy, Alon Jacoby, and Yoav Goldberg. Same task, more tokens: the impact of input length on the reasoning performance of large language models. InACL, 2024

  32. [32]

    Exaone deep: Reasoning enhanced language models.arXiv preprint arXiv:2503.12524, 2025

    LG AI Research. Exaone deep: Reasoning enhanced language models.arXiv preprint arXiv:2503.12524, 2025

  33. [33]

    Functional interpolation for relative positions improves long context transformers

    Shanda Li, Chong You, Guru Guruganesh, Joshua Ainslie, Santiago Ontanon, Manzil Zaheer, Sumit Sanghai, Yiming Yang, Sanjiv Kumar, and Srinadh Bhojanapalli. Functional interpolation for relative positions improves long context transformers. arXiv preprint arXiv:2310.04418, 2023

  34. [34]

    From crowdsourced data to high-quality benchmarks: Arena-hard and benchbuilder pipeline

    Tianle Li, Wei-Lin Chiang, Evan Frick, Lisa Dunlap, Tianhao Wu, Banghua Zhu, Joseph E Gonzalez, and Ion Sto- ica. From crowdsourced data to high-quality benchmarks: Arena-hard and benchbuilder pipeline. arXiv preprint arXiv:2406.11939, 2024

  35. [35]

    Limr: Less is more for rl scaling, 2025

    Xuefeng Li, Haoyang Zou, and Pengfei Liu. Limr: Less is more for rl scaling, 2025

  36. [36]

    Is your code generated by chatgpt really correct? rigorous evaluation of large language models for code generation, 2023

    Jiawei Liu, Chunqiu Steven Xia, Yuyao Wang, and Lingming Zhang. Is your code generated by chatgpt really correct? rigorous evaluation of large language models for code generation, 2023. URL https://arxiv.org/abs/2305. 01210. 25

  37. [37]

    Tang, Manan Roongta, Colin Cai, Jeffrey Luo, Li Erran Li, Raluca Ada Popa, and Ion Stoica

    Michael Luo, Sijun Tan, Justin Wong, Xiaoxiang Shi, William Y. Tang, Manan Roongta, Colin Cai, Jeffrey Luo, Li Erran Li, Raluca Ada Popa, and Ion Stoica. Deepscaler: Surpass- ing o1-preview with a 1.5b model by scaling rl. https://pretty-radio-b75.notion.site/ DeepScaleR-Surpassing-O1-Preview-with-a-1-5B-Model-by-Scaling-RL-19681902c1468005bed8ca303013a4e2 ,

  38. [38]

    A framework for automated measurement of responsible ai harms in generative ai applications,

    Ahmed Magooda, Alec Helyar, Kyle Jackson, David Sullivan, Chad Atalla, Emily Sheng, Dan Vann, Richard Edgar, Hamid Palangi, Roman Lutz, Hongliang Kong, Vincent Yun, Eslam Kamal, Federico Zarfati, Hanna Wallach, Sarah Bird, and Mei Chen. A framework for automated measurement of responsible ai harms in generative ai applications,

  39. [39]

    URL https://arxiv.org/abs/2310.17750

  40. [40]

    Orca 2: Teaching small language models how to reason

    Arindam Mitra, Luciano Del Corro, Shweti Mahajan, Andres Codas, Clarisse Simoes, Sahaj Agarwal, Xuxi Chen, Anastasia Razdaibiedina, Erik Jones, Kriti Aggarwal, et al. Orca 2: Teaching small language models how to reason. arXiv preprint arXiv:2311.11045, 2023

  41. [41]

    Agentinstruct: Toward generative teaching with agentic flows

    Arindam Mitra, Luciano Del Corro, Guoqing Zheng, Shweti Mahajan, Dany Rouhana, Andres Codas, Yadong Lu, Wei-ge Chen, Olga Vrousgos, Corby Rosset, et al. Agentinstruct: Toward generative teaching with agentic flows. arXiv preprint arXiv:2407.03502, 2024

  42. [42]

    Unearthing skill-level insights for understanding trade-offs of foundation models

    Mazda Moayeri, Vidhisha Balachandran, Varun Chandrasekaran, Safoora Yousefi, Thomas Fel, Soheil Feizi, Besmira Nushi, Neel Joshi, and Vibhav Vineet. Unearthing skill-level insights for understanding trade-offs of foundation models. ICLR, 2025

  43. [43]

    Orca: Progressive Learning from Complex Explanation Traces of GPT-4

    Subhabrata Mukherjee, Arindam Mitra, Ganesh Jawahar, Sahaj Agarwal, Hamid Palangi, and Ahmed Awadallah. Orca: Progressive learning from complex explanation traces of gpt-4.arXiv preprint arXiv:2306.02707, 2023

  44. [44]

    Towards accountable ai: Hybrid human-machine analyses for character- izing system failure

    Besmira Nushi, Ece Kamar, and Eric Horvitz. Towards accountable ai: Hybrid human-machine analyses for character- izing system failure. InProceedings of the AAAI Conference on Human Computation and Crowdsourcing, volume 6, pages 126–135, 2018

  45. [45]

    Openai o3-mini system card

    OpenAI. Openai o3-mini system card. https://openai.com/index/o3-mini-system-card/, 2025. Accessed: 2025-03-17

  46. [46]

    Computational complexity

    Christos H Papadimitriou. Computational complexity. In Encyclopedia of computer science, pages 260–265. John Wiley and Sons Ltd., 2003

  47. [47]

    Overreliance on ai literature review.Microsoft Research, 339:340, 2022

    Samir Passi and Mihaela Vorvoreanu. Overreliance on ai literature review.Microsoft Research, 339:340, 2022

  48. [48]

    Proof or bluff? evaluating llms on 2025 usa math olympiad

    Ivo Petrov, Jasper Dekoninck, Lyuben Baltadzhiev, Maria Drencheva, Kristian Minchev, Mislav Balunović, Nikola Jovanović, and Martin Vechev. Proof or bluff? evaluating llms on 2025 usa math olympiad. arXiv preprint arXiv:2503.21934, 2025

  49. [49]

    Gpqa: A graduate-level google-proof q&a benchmark

    David Rein, Betty Li Hou, Asa Cooper Stickland, Jackson Petty, Richard Yuanzhe Pang, Julien Dirani, Julian Michael, and Samuel R Bowman. Gpqa: A graduate-level google-proof q&a benchmark. In First Conference on Language Modeling, 2024

  50. [50]

    DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

    Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Y Wu, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300, 2024

  51. [51]

    HybridFlow: A Flexible and Efficient RLHF Framework

    Guangming Sheng, Chi Zhang, Zilingfeng Ye, Xibin Wu, Wang Zhang, Ru Zhang, Yanghua Peng, Haibin Lin, and Chuan Wu. Hybridflow: A flexible and efficient rlhf framework.arXiv preprint arXiv: 2409.19256, 2024

  52. [52]

    Language Models are Multilingual Chain-of-Thought Reasoners

    Freda Shi, Mirac Suzgun, Markus Freitag, Xuezhi Wang, Suraj Srivats, Soroush Vosoughi, Hyung Won Chung, Yi Tay, Sebastian Ruder, Denny Zhou, Dipanjan Das, and Jason Wei. Language models are multilingual chain-of-thought reasoners, 2022. URL https://arxiv.org/abs/2210.03057

  53. [53]

    RoFormer: Enhanced Transformer with Rotary Position Embedding

    Jianlin Su, Yu Lu, Shengfeng Pan, Ahmed Murtadha, Bo Wen, and Yunfeng Liu. Roformer: Enhanced transformer with rotary position embedding.arXiv preprint arXiv:2104.09864, 2021

  54. [54]

    Kimi Team, Angang Du, Bofei Gao, Bowei Xing, Changjiu Jiang, Cheng Chen, Cheng Li, Chenjun Xiao, Chenzhuang Du, Chonghua Liao, et al. Kimi k1. 5: Scaling reinforcement learning with llms.arXiv preprint arXiv:2501.12599, 2025. 26

  55. [55]

    Open Thoughts

    OpenThoughts Team. Open Thoughts. https://open-thoughts.ai, January 2025

  56. [56]

    Qwq-32b: Embracing the power of reinforcement learning, March 2025

    Qwen Team. Qwq-32b: Embracing the power of reinforcement learning, March 2025. URL https://qwenlm. github.io/blog/qwq-32b/

  57. [57]

    Miles Turpin, Julian Michael, Ethan Perez, and Samuel R. Bowman. Language models don’t always say what they think: Unfaithful explanations in chain-of-thought prompting, 2023. URLhttps://arxiv.org/abs/2305.04388

  58. [58]

    Is a picture worth a thousand words? delving into spatial reasoning for vision language models.Advances in Neural Information Processing Systems, 37:75392–75421, 2024

    Jiayu Wang, Yifei Ming, Zhenmei Shi, Vibhav Vineet, Xin Wang, Sharon Li, and Neel Joshi. Is a picture worth a thousand words? delving into spatial reasoning for vision language models.Advances in Neural Information Processing Systems, 37:75392–75421, 2024

  59. [59]

    MMLU-Pro: A More Robust and Challenging Multi-Task Language Understanding Benchmark

    Yubo Wang, Xueguang Ma, Ge Zhang, Yuansheng Ni, Abhranil Chandra, Shiguang Guo, Weiming Ren, Aaran Arulraj, Xuan He, Ziyan Jiang, Tianle Li, Max Ku, Kai Wang, Alex Zhuang, Rongqi Fan, Xiang Yue, and Wenhu Chen. Mmlu- pro: A more robust and challenging multi-task language understanding benchmark, 2024. URLhttps://arxiv. org/abs/2406.01574

  60. [60]

    An Yang, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chengyuan Li, Dayiheng Liu, Fei Huang, Haoran Wei, et al. Qwen2. 5 technical report.arXiv preprint arXiv:2412.15115, 2024

  61. [61]

    On the emergence of thinking in llms i: Searching for the right intuition

    Guanghao Ye, Khiem Duc Pham, Xinzhi Zhang, Sivakanth Gopi, Baolin Peng, Beibin Li, Janardhan Kulkarni, and Huseyin A Inan. On the emergence of thinking in llms i: Searching for the right intuition. arXiv preprint arXiv:2502.06773, 2025

  62. [62]

    Demystifying long chain-of-thought reasoning in llms

    Edward Yeo, Yuxuan Tong, Morry Niu, Graham Neubig, and Xiang Yue. Demystifying long chain-of-thought reasoning in llms. arXiv preprint arXiv:2502.03373, 2025

  63. [63]

    Dapo: An open-source llm reinforcement learning system at scale, 2025

    Qiying Yu, Zheng Zhang, Ruofei Zhu, Yufeng Yuan, Xiaochen Zuo, Yu Yue, Tiantian Fan, Gaohong Liu, Lingjun Liu, Xin Liu, Haibin Lin, Zhiqi Lin, Bole Ma, Guangming Sheng, Yuxuan Tong, Chi Zhang, Mofan Zhang, Wang Zhang, Hang Zhu, Jinhua Zhu, Jiaze Chen, Jiangjie Chen, Chengyi Wang, Hongli Yu, Weinan Dai, Yuxuan Song, Xiangpeng Wei, Hao Zhou, Jingjing Liu, W...

  64. [64]

    Instruction-Following Evaluation for Large Language Models

    Jeffrey Zhou, Tianjian Lu, Swaroop Mishra, Siddhartha Brahma, Sujoy Basu, Yi Luan, Denny Zhou, and Le Hou. Instruction-following evaluation for large language models.arXiv preprint arXiv:2311.07911, 2023. 27 A Benchmarking Details Model temp. max token reasoning Phi-4 [2] 0.8 † 4,096 n Phi-4-reasoning 0.8 32,768 ∗ y Phi-4-reasoning-plus 0.8 32,768 ∗ y Dee...