arxiv: 2504.21318 · v1 · submitted 2025-04-30 · 💻 cs.AI · cs.CL

Recognition: 2 theorem links

· Lean Theorem

Phi-4-reasoning Technical Report

Marah Abdin , Sahaj Agarwal , Ahmed Awadallah , Vidhisha Balachandran , Harkirat Behl , Lingjiao Chen , Gustavo de Rosa , Suriya Gunasekar

show 15 more authors

Mojan Javaheripi Neel Joshi Piero Kauffmann Yash Lara Caio C\'esar Teodoro Mendes Arindam Mitra Besmira Nushi Dimitris Papailiopoulos Olli Saarikivi Shital Shah Vaishnavi Shrivastava Vibhav Vineet Yue Wu Safoora Yousefi Guoqing Zheng

Authors on Pith no claims yet

Pith reviewed 2026-05-17 03:35 UTC · model grok-4.3

classification 💻 cs.AI cs.CL

keywords reasoning modelssupervised fine-tuningreinforcement learningteachable promptsbenchmark evaluationPhi-4o3-mini demonstrations

0 comments

The pith

A 14-billion parameter model trained on curated teachable prompts and o3-mini demonstrations reaches performance levels of much larger reasoning systems.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces Phi-4-reasoning as a 14B model obtained by supervised fine-tuning on a carefully chosen set of teachable prompts and reasoning traces generated by o3-mini. A second variant adds a short phase of outcome-based reinforcement learning to produce longer traces and higher scores. Both versions outperform several substantially larger open-weight models on math, coding, planning, algorithmic, and spatial tasks while showing some gains on general benchmarks as well. The work demonstrates that deliberate data selection for supervised fine-tuning can be extended to reasoning models and further strengthened by reinforcement learning.

Core claim

Phi-4-reasoning is produced by supervised fine-tuning of the Phi-4 base model on a curated collection of teachable prompts chosen for appropriate complexity and diversity together with detailed reasoning demonstrations generated by o3-mini. The plus variant receives an additional short phase of outcome-based reinforcement learning that encourages longer reasoning traces. Across benchmarks in mathematics, scientific reasoning, coding, algorithmic problem solving, planning, and spatial understanding, both models surpass significantly larger open-weight systems such as DeepSeek-R1-Distill-Llama-70B and approach the results of the full DeepSeek-R1 model, with observable transfer to general-capab

What carries the argument

Curated teachable prompts of balanced complexity and diversity paired with o3-mini reasoning demonstrations for supervised fine-tuning, optionally followed by outcome-based reinforcement learning to lengthen traces.

If this is right

Careful data curation for supervised fine-tuning extends to reasoning language models and produces measurable gains.
A short reinforcement-learning phase on top of the fine-tuned model further increases performance by lengthening reasoning traces.
Reasoning improvements transfer in a non-trivial way to general-purpose benchmarks.
Current evaluation practices leave room for improvement in measuring robustness of reasoning models.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Targeted data curation may allow smaller models to close performance gaps with larger ones without relying primarily on scale.
Similar curation strategies could be tested on other base models or applied to domains beyond the current benchmarks.
Longer traces encouraged by reinforcement learning may require new evaluation protocols that account for trace length and consistency.

Load-bearing premise

Performance gains arise chiefly from the selection of teachable prompts and the o3-mini demonstrations rather than from hidden properties of the base model or from evaluation overlap.

What would settle it

Retraining the same base model on randomly selected prompts of comparable length and then re-evaluating on the identical benchmark suite yields scores within a few points of the curated version.

read the original abstract

We introduce Phi-4-reasoning, a 14-billion parameter reasoning model that achieves strong performance on complex reasoning tasks. Trained via supervised fine-tuning of Phi-4 on carefully curated set of "teachable" prompts-selected for the right level of complexity and diversity-and reasoning demonstrations generated using o3-mini, Phi-4-reasoning generates detailed reasoning chains that effectively leverage inference-time compute. We further develop Phi-4-reasoning-plus, a variant enhanced through a short phase of outcome-based reinforcement learning that offers higher performance by generating longer reasoning traces. Across a wide range of reasoning tasks, both models outperform significantly larger open-weight models such as DeepSeek-R1-Distill-Llama-70B model and approach the performance levels of full DeepSeek-R1 model. Our comprehensive evaluations span benchmarks in math and scientific reasoning, coding, algorithmic problem solving, planning, and spatial understanding. Interestingly, we observe a non-trivial transfer of improvements to general-purpose benchmarks as well. In this report, we provide insights into our training data, our training methodologies, and our evaluations. We show that the benefit of careful data curation for supervised fine-tuning (SFT) extends to reasoning language models, and can be further amplified by reinforcement learning (RL). Finally, our evaluation points to opportunities for improving how we assess the performance and robustness of reasoning models.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces Phi-4-reasoning, a 14-billion parameter reasoning model obtained by supervised fine-tuning of the Phi-4 base model on a curated set of 'teachable' prompts and reasoning demonstrations generated by o3-mini. It also presents Phi-4-reasoning-plus, which adds a short phase of outcome-based reinforcement learning to produce longer reasoning traces. The central claim is that both models outperform significantly larger open-weight models such as DeepSeek-R1-Distill-Llama-70B and approach the performance of the full DeepSeek-R1 model across benchmarks in math, scientific reasoning, coding, algorithmic problem solving, planning, and spatial understanding, with non-trivial transfer to general-purpose benchmarks. The report supplies insights into the training data, methodologies, and evaluations, arguing that careful data curation for SFT can be amplified by RL.

Significance. If the reported results hold after addressing the points below, the work would be significant for showing that high-quality, curated data for SFT (plus limited RL) can produce competitive reasoning performance in a 14B model, reducing reliance on scale alone. The multi-domain evaluation and observed transfer effects are valuable, and the explicit discussion of training insights aids reproducibility and understanding of reasoning model development.

major comments (2)

[Abstract] Abstract: The headline claim that the models 'outperform significantly larger open-weight models such as DeepSeek-R1-Distill-Llama-70B model and approach the performance levels of full DeepSeek-R1 model' is stated without any numerical scores, error bars, specific benchmark lists, or controls for data contamination. This leaves the central comparative result unverified from the abstract alone and requires the evaluations section to supply the missing quantitative details.
[Training Methodology and Evaluations sections] Training Methodology and Evaluations sections: The performance gains are attributed primarily to the curated teachable prompts and o3-mini demonstrations (plus the RL phase). However, the manuscript does not report an ablation of the unmodified base Phi-4 model on the same reasoning benchmarks. This omission is load-bearing for the causal claim, as it leaves open whether gains derive from the base model properties, evaluation protocol, or overlap with the teacher data rather than the described curation.

minor comments (2)

[Introduction] The term 'teachable' prompts is introduced without a precise operational definition or concrete examples in the main text; adding these would improve clarity.
[Evaluations] Ensure all tables and figures include self-contained captions and axis labels so that results can be interpreted without cross-referencing the surrounding prose.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive and detailed review. We address each major comment below and will revise the manuscript accordingly to strengthen the presentation of our results and methodology.

read point-by-point responses

Referee: [Abstract] Abstract: The headline claim that the models 'outperform significantly larger open-weight models such as DeepSeek-R1-Distill-Llama-70B model and approach the performance levels of full DeepSeek-R1 model' is stated without any numerical scores, error bars, specific benchmark lists, or controls for data contamination. This leaves the central comparative result unverified from the abstract alone and requires the evaluations section to supply the missing quantitative details.

Authors: We agree that the abstract would be strengthened by including concrete quantitative details. In the revised manuscript we will update the abstract to report key numerical results (e.g., average scores on math, coding, and algorithmic benchmarks) showing the 14B models outperforming the 70B baseline and approaching the full DeepSeek-R1 model, while explicitly referencing the full evaluation tables and contamination controls described in the Evaluations section. revision: yes
Referee: [Training Methodology and Evaluations sections] Training Methodology and Evaluations sections: The performance gains are attributed primarily to the curated teachable prompts and o3-mini demonstrations (plus the RL phase). However, the manuscript does not report an ablation of the unmodified base Phi-4 model on the same reasoning benchmarks. This omission is load-bearing for the causal claim, as it leaves open whether gains derive from the base model properties, evaluation protocol, or overlap with the teacher data rather than the described curation.

Authors: We acknowledge the importance of this ablation for isolating the contribution of our data curation and training. We have evaluated the unmodified Phi-4 base model on the identical reasoning benchmarks. The base model exhibits substantially lower performance than both Phi-4-reasoning and Phi-4-reasoning-plus. In the revised manuscript we will add these results to the Evaluations section, together with a brief discussion of how the curated SFT data and RL phase drive the observed improvements beyond base-model capabilities. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical training report with external benchmark comparisons

full rationale

The paper describes supervised fine-tuning of the base Phi-4 model on curated teachable prompts paired with o3-mini demonstrations, followed by optional outcome-based RL for the plus variant. Performance claims rest on direct evaluations across standard math, coding, planning, and general benchmarks, with explicit comparisons to independent external models (DeepSeek-R1-Distill-Llama-70B and full DeepSeek-R1). No equations, fitted parameters, or first-principles derivations appear that could reduce reported gains to self-definitions or tautologies by construction. Self-references to the prior Phi-4 base model are present but not load-bearing; the central results are measured outcomes on external benchmarks rather than any renaming, ansatz smuggling, or uniqueness theorem imported from the same authors.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The central performance claim rests on the unstated assumption that the selected training prompts and o3-mini traces constitute an effective and generalizable teaching signal; no free parameters, axioms, or invented entities are explicitly introduced in the abstract.

pith-pipeline@v0.9.0 · 5646 in / 1111 out tokens · 38953 ms · 2026-05-17T03:35:32.502533+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Trained via supervised fine-tuning of Phi-4 on carefully curated set of 'teachable' prompts... and reasoning demonstrations generated using o3-mini... short phase of outcome-based reinforcement learning... GRPO algorithm
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We report standard deviations across multiple runs... average-of-5 evaluations can differ significantly (by up to 5-10 percentage points on AIME)

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 17 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

VLM Judges Can Rank but Cannot Score: Task-Dependent Uncertainty in Multimodal Evaluation
cs.LG 2026-04 unverdicted novelty 7.0

VLM judges exhibit task-dependent uncertainty in their scores, with conformal prediction revealing wide intervals for complex tasks and a decoupling between good ranking performance and poor absolute scoring reliability.
When AI Models Become Dependencies: Studying the Evolution of Pre-Trained Model Reuse in Downstream Software Systems
cs.SE 2026-04 unverdicted novelty 7.0

Pre-trained models are added late in projects, accumulate rather than get replaced, and change three times less often than libraries, with distinct documentation driven by capability needs and testing uncertainty.
MathArena: Evaluating LLMs on Uncontaminated Math Competitions
cs.AI 2025-05 unverdicted novelty 7.0

MathArena evaluates over 50 LLMs on 162 fresh competition problems across seven contests, detects contamination in AIME 2024, and reports top models scoring below 40 percent on IMO 2025 proof tasks.
Distilling Long-CoT Reasoning through Collaborative Step-wise Multi-Teacher Decoding
cs.AI 2026-05 unverdicted novelty 6.0

CoRD uses collaborative multi-teacher step-wise decoding with perplexity-guided beam search to generate higher-quality Long-CoT data that lets smaller models reach near-teacher performance with less supervision.
SeLaR: Selective Latent Reasoning in Large Language Models
cs.CL 2026-04 unverdicted novelty 6.0

SeLaR selectively applies latent soft reasoning in LLMs via entropy gating and contrastive regularization, outperforming standard CoT on five benchmarks without training.
ZeroCoder: Can LLMs Improve Code Generation Without Ground-Truth Supervision?
cs.SE 2026-04 unverdicted novelty 6.0

ZeroCoder co-evolves coder and tester LLMs via self-generated code-test execution feedback to improve code generation up to 21.6% without ground-truth supervision.
CoME-VL: Scaling Complementary Multi-Encoder Vision-Language Learning
cs.CV 2026-04 unverdicted novelty 6.0

CoME-VL fuses contrastive and self-supervised vision encoders via entropy-guided multi-layer aggregation and RoPE cross-attention to improve vision-language model performance on benchmarks.
Flexible Entropy Control in RLVR with a Gradient-Preserving Perspective
cs.LG 2026-02 unverdicted novelty 6.0

Dynamic clipping strategies based on importance sampling regions enable precise entropy management in RLVR, mitigating collapse and improving benchmark performance.
Do Not Waste Your Rollouts: Recycling Search Experience for Efficient Test-Time Scaling
cs.CL 2026-01 unverdicted novelty 6.0

RSE distills search trajectories into an experience bank for positive and negative recycling, yielding efficiency gains over independent sampling on math reasoning benchmarks.
The Geometric Reasoner: Manifold-Informed Latent Foresight Search for Long-Context Reasoning
cs.LG 2026-01 unverdicted novelty 6.0

TGR performs manifold-informed latent foresight search to boost trajectory coverage in long-context reasoning tasks by up to 13 AUC points with minimal overhead.
Which Reasoning Trajectories Teach Students to Reason Better? A Simple Metric of Informative Alignment
cs.CL 2026-01 conditional novelty 6.0

Rank-Surprisal Ratio (RSR) correlates strongly (average Spearman 0.86) with post-distillation reasoning gains across five student models and trajectories from eleven teachers, outperforming existing selection metrics.
TwiSTAR:Think Fast, Think Slow, Then Act,Generative Recommendation with Adaptive Reasoning
cs.IR 2026-05 unverdicted novelty 5.0

TwiSTAR learns to switch between fast SID retrieval and slow rationale-generating reasoning in generative recommendation, yielding better accuracy-latency trade-offs on three datasets.
Chain-of-Thought Reasoning Enhances In-Context Learning for LLM-Based Mobile Traffic Prediction
cs.NI 2026-05 unverdicted novelty 5.0

Chain-of-thought reasoning with plan-based demonstrations and similarity retrieval improves LLM mobile traffic prediction accuracy by up to 15% over standard in-context learning on real 5G data.
Beyond Benchmarks: MathArena as an Evaluation Platform for Mathematics with LLMs
cs.CL 2026-05 unverdicted novelty 5.0

MathArena is a maintained platform evaluating LLMs across olympiad problems, proofs, research questions, and formal proofs, with GPT-5.5 reaching 98% on 2026 USAMO and 74% on research-level tasks.
Ranking Reasoning LLMs under Test-Time Scaling
cs.LG 2026-03 accept novelty 5.0

Many established statistical ranking techniques produce orderings of reasoning LLMs under test-time scaling that closely match a Bayesian gold standard, with mean Kendall tau_b of 0.93-0.95 at full trials and best met...
Gemma 4, Phi-4, and Qwen3: Accuracy-Efficiency Tradeoffs in Dense and MoE Reasoning Language Models
cs.CL 2026-04 accept novelty 4.0

Gemma-4-E4B with few-shot chain-of-thought reaches the highest weighted accuracy of 0.675 at 14.9 GB VRAM, while the larger Gemma-4-26B-A4B MoE model scores 0.663 but uses 48.1 GB.
XekRung Technical Report
cs.CR 2026-04 unverdicted novelty 3.0

XekRung achieves state-of-the-art performance on cybersecurity benchmarks among same-scale models via tailored data synthesis and multi-stage training while retaining strong general capabilities.

Reference graph

Works this paper leans on

64 extracted references · 64 canonical work pages · cited by 17 Pith papers · 20 internal anchors

[1]

Phi-3 Technical Report: A Highly Capable Language Model Locally on Your Phone

Marah Abdin, Jyoti Aneja, Hany Awadalla, Ahmed Awadallah, Ammar Ahmad Awan, Nguyen Bach, Amit Bahree, Arash Bakhtiari, Jianmin Bao, Harkirat Behl, et al. Phi-3 technical report: A highly capable language model locally on your phone.arXiv preprint arXiv:2404.14219, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[2]

Phi-4 Technical Report

Marah Abdin, Jyoti Aneja, Harkirat Behl, Sébastien Bubeck, Ronen Eldan, Suriya Gunasekar, Michael Harrison, Russell J. Hewett, Mojan Javaheripi, Piero Kauffmann, James R. Lee, Yin Tat Lee, Yuanzhi Li, Weishung Liu, Caio C. T. Mendes, Anh Nguyen, Eric Price, Gustavo de Rosa, Olli Saarikivi, Adil Salim, Shital Shah, Xin Wang, Rachel Ward, Yue Wu, Dingli Yu,...

work page internal anchor Pith review Pith/arXiv arXiv 2024
[3]

KITAB: evaluating llms on constraint satisfaction for information retrieval

Marah I Abdin, Suriya Gunasekar, Varun Chandrasekaran, Jerry Li, Mert Yüksekgönül, Rahee Ghosh Peshawaria, Ranjita Naik, and Besmira Nushi. KITAB: evaluating llms on constraint satisfaction for information retrieval. In International Conference on Learning Representations, 2024

work page 2024
[4]

Aime 83-24

AIME. Aime 83-24. https://huggingface.co/datasets/lchen001/AIME1983_2024, 2024. Accessed: 2025- 03-17

work page 2024
[5]

Aime 2025

AIME. Aime 2025. https://huggingface.co/datasets/lchen001/AIME2025, 2025. Accessed: 2025-03-17

work page 2025
[6]

Concrete Problems in AI Safety

Dario Amodei, Chris Olah, Jacob Steinhardt, Paul Christiano, John Schulman, and Dan Mané. Concrete problems in ai safety.arXiv preprint arXiv:1606.06565, 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016
[7]

Claude 3.7 sonnet.https://www.anthropic.com/news/claude-3-7-sonnet, 2025

Anthropic. Claude 3.7 sonnet.https://www.anthropic.com/news/claude-3-7-sonnet, 2025. Accessed: 2025- 03-17

work page 2025
[8]

Chain-of-thought reasoning in the wild is not always faithful.arXiv preprint arXiv:2503.08679, 2025

Iván Arcuschin, Jett Janiak, Robert Krzyzanowski, Senthooran Rajamanoharan, Neel Nanda, and Arthur Conmy. Chain-of-thought reasoning in the wild is not always faithful.arXiv preprint arXiv:2503.08679, 2025

work page arXiv 2025
[9]

Eureka: Evaluating and understanding large foundation models.arXiv preprint arXiv:2409.10566, 2024

Vidhisha Balachandran, Jingya Chen, Neel Joshi, Besmira Nushi, Hamid Palangi, Eduardo Salinas, Vibhav Vineet, James Woffinden-Luey, and Safoora Yousefi. Eureka: Evaluating and understanding large foundation models.arXiv preprint arXiv:2409.10566, 2024

work page arXiv 2024
[10]

Inference-time scaling for complex tasks: Where we stand and what lies ahead, 2025

Vidhisha Balachandran, Jingya Chen, Lingjiao Chen, Shivam Garg, Neel Joshi, Yash Lara, John Langford, Besmira Nushi, Vibhav Vineet, Yue Wu, and Safoora Yousefi. Inference-time scaling for complex tasks: Where we stand and what lies ahead, 2025. URLhttps://arxiv.org/abs/2504.00294

work page arXiv 2025
[11]

Matharena: Evaluating llms on uncontaminated math competitions, February 2025

Mislav Balunović, Jasper Dekoninck, Ivo Petrov, Nikola Jovanović, and Martin Vechev. Matharena: Evaluating llms on uncontaminated math competitions, February 2025. URLhttps://matharena.ai/

work page 2025
[12]

Designing disaggregated evaluations of ai systems: Choices, considera- tions, and tradeoffs

Solon Barocas, Anhong Guo, Ece Kamar, Jacquelyn Krones, Meredith Ringel Morris, Jennifer Wortman Vaughan, W Duncan Wadsworth, and Hanna Wallach. Designing disaggregated evaluations of ai systems: Choices, considera- tions, and tradeoffs. InProceedings of the 2021 AAAI/ACM Conference on AI, Ethics, and Society, pages 368–378, 2021

work page 2021
[13]

Benchagents: Auto- mated benchmark creation with agent interaction.arXiv preprint arXiv:2410.22584, 2024

Natasha Butt, Varun Chandrasekaran, Neel Joshi, Besmira Nushi, and Vidhisha Balachandran. Benchagents: Auto- mated benchmark creation with agent interaction.arXiv preprint arXiv:2410.22584, 2024

work page arXiv 2024
[14]

Extending Context Window of Large Language Models via Positional Interpolation

Shouyuan Chen, Sherman Wong, Liangjian Chen, and Yuandong Tian. Extending context window of large language models via positional interpolation.arXiv preprint arXiv:2306.15595, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[15]

Reinforcement learning for reasoning in small llms: What works and what doesn’t,

Quy-Anh Dang and Chris Ngo. Reinforcement learning for reasoning in small llms: What works and what doesn’t,

work page
[16]

URL https://arxiv.org/abs/2503.16219

work page arXiv
[17]

Omni-math: A universal olympiad level mathematic benchmark for large language models.ICLR, 2025

Bofei Gao, Feifan Song, Zhe Yang, Zefan Cai, Yibo Miao, Qingxiu Dong, Lei Li, Chenghao Ma, Liang Chen, Runxin Xu, et al. Omni-math: A universal olympiad level mathematic benchmark for large language models.ICLR, 2025

work page 2025
[18]

Scaling laws for reward model overoptimization

Leo Gao, John Schulman, and Jacob Hilton. Scaling laws for reward model overoptimization. In International Conference on Machine Learning, pages 10835–10866. PMLR, 2023

work page 2023
[19]

Gemini flash thinking

Google. Gemini flash thinking. https://deepmind.google/technologies/gemini/flash/, 2025. Accessed: 2025-03-17. 24

work page 2025
[20]

rStar-Math: Small LLMs Can Master Math Reasoning with Self-Evolved Deep Thinking

Xinyu Guan, Li Lyna Zhang, Yifei Liu, Ning Shang, Youran Sun, Yi Zhu, Fan Yang, and Mao Yang. rstar-math: Small llms can master math reasoning with self-evolved deep thinking.arXiv preprint arXiv:2501.04519, 2025

work page internal anchor Pith review arXiv 2025
[21]

Textbooks Are All You Need

Suriya Gunasekar, Yi Zhang, Jyoti Aneja, Caio César Teodoro Mendes, Allie Del Giorno, Sivakanth Gopi, Mojan Javaheripi, Piero Kauffmann, Gustavo de Rosa, Olli Saarikivi, et al. Textbooks are all you need.arXiv preprint arXiv:2306.11644, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[22]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning. arXiv preprint arXiv:2501.12948, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[23]

Computers and intractability: a guide to the theory of np-completeness (michael r

Juris Hartmanis. Computers and intractability: a guide to the theory of np-completeness (michael r. garey and david s. johnson). Siam Review, 24(1):90, 1982

work page 1982
[24]

ToxiGen: A large- scale machine-generated dataset for adversarial and implicit hate speech detection

Thomas Hartvigsen, Saadia Gabriel, Hamid Palangi, Maarten Sap, Dipankar Ray, and Ece Kamar. ToxiGen: A large- scale machine-generated dataset for adversarial and implicit hate speech detection. InProceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 3309–3326. Association for Computational L...

work page 2022
[25]

Measuring Massive Multitask Language Understanding

Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. Mea- suring massive multitask language understanding, 2021. URLhttps://arxiv.org/abs/2009.03300

work page internal anchor Pith review Pith/arXiv arXiv 2021
[26]

A sober look at progress in language model reasoning: Pitfalls and paths to reproducibility

Andreas Hochlehnert, Hardik Bhatnagar, Vishaal Udandarao, Samuel Albanie, Ameya Prabhu, and Matthias Bethge. A sober look at progress in language model reasoning: Pitfalls and paths to reproducibility. arXiv preprint arXiv:2504.07086, 2025

work page arXiv 2025
[27]

GPT-4o System Card

Aaron Hurst, Adam Lerer, Adam P Goucher, Adam Perelman, Aditya Ramesh, Aidan Clark, AJ Ostrow, Akila Welihinda, Alan Hayes, Alec Radford, et al. Gpt-4o system card.arXiv preprint arXiv:2410.21276, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[28]

OpenAI o1 System Card

Aaron Jaech, Adam Kalai, Adam Lerer, Adam Richardson, Ahmed El-Kishky, Aiden Low, Alec Helyar, Aleksander Madry, Alex Beutel, Alex Carney, et al. Openai o1 system card.arXiv preprint arXiv:2412.16720, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[29]

Phi-2: The surprising power of small language models.Microsoft Research Blog, 2023

Mojan Javaheripi, Sébastien Bubeck, Marah Abdin, Jyoti Aneja, Caio César Teodoro Mendes, Weizhu Chen, Allie Del Giorno, Ronen Eldan, Sivakanth Gopi, Suriya Gunasekar, Piero Kauffmann, Yin Tat Lee, Yuanzhi Li, Anh Nguyen, Gustavo de Rosa, Olli Saarikivi, Adil Salim, Shital Shah, Michael Santacroce, Harkirat Singh Behl, Adam Taumann Kalai, Xin Wang, Rachel ...

work page 2023
[30]

Safechain: Safety of language models with long chain-of-thought reasoning capabilities

Fengqing Jiang, Zhangchen Xu, Yuetai Li, Luyao Niu, Zhen Xiang, Bo Li, Bill Yuchen Lin, and Radha Pooven- dran. Safechain: Safety of language models with long chain-of-thought reasoning capabilities. arXiv preprint arXiv:2502.12025, 2025

work page arXiv 2025
[31]

Same task, more tokens: the impact of input length on the reasoning performance of large language models

Mosh Levy, Alon Jacoby, and Yoav Goldberg. Same task, more tokens: the impact of input length on the reasoning performance of large language models. InACL, 2024

work page 2024
[32]

Exaone deep: Reasoning enhanced language models.arXiv preprint arXiv:2503.12524, 2025

LG AI Research. Exaone deep: Reasoning enhanced language models.arXiv preprint arXiv:2503.12524, 2025

work page arXiv 2025
[33]

Functional interpolation for relative positions improves long context transformers

Shanda Li, Chong You, Guru Guruganesh, Joshua Ainslie, Santiago Ontanon, Manzil Zaheer, Sumit Sanghai, Yiming Yang, Sanjiv Kumar, and Srinadh Bhojanapalli. Functional interpolation for relative positions improves long context transformers. arXiv preprint arXiv:2310.04418, 2023

work page arXiv 2023
[34]

From crowdsourced data to high-quality benchmarks: Arena-hard and benchbuilder pipeline

Tianle Li, Wei-Lin Chiang, Evan Frick, Lisa Dunlap, Tianhao Wu, Banghua Zhu, Joseph E Gonzalez, and Ion Sto- ica. From crowdsourced data to high-quality benchmarks: Arena-hard and benchbuilder pipeline. arXiv preprint arXiv:2406.11939, 2024

work page arXiv 2024
[35]

Limr: Less is more for rl scaling, 2025

Xuefeng Li, Haoyang Zou, and Pengfei Liu. Limr: Less is more for rl scaling, 2025

work page 2025
[36]

Is your code generated by chatgpt really correct? rigorous evaluation of large language models for code generation, 2023

Jiawei Liu, Chunqiu Steven Xia, Yuyao Wang, and Lingming Zhang. Is your code generated by chatgpt really correct? rigorous evaluation of large language models for code generation, 2023. URL https://arxiv.org/abs/2305. 01210. 25

work page 2023
[37]

Tang, Manan Roongta, Colin Cai, Jeffrey Luo, Li Erran Li, Raluca Ada Popa, and Ion Stoica

Michael Luo, Sijun Tan, Justin Wong, Xiaoxiang Shi, William Y. Tang, Manan Roongta, Colin Cai, Jeffrey Luo, Li Erran Li, Raluca Ada Popa, and Ion Stoica. Deepscaler: Surpass- ing o1-preview with a 1.5b model by scaling rl. https://pretty-radio-b75.notion.site/ DeepScaleR-Surpassing-O1-Preview-with-a-1-5B-Model-by-Scaling-RL-19681902c1468005bed8ca303013a4e2 ,

work page
[38]

A framework for automated measurement of responsible ai harms in generative ai applications,

Ahmed Magooda, Alec Helyar, Kyle Jackson, David Sullivan, Chad Atalla, Emily Sheng, Dan Vann, Richard Edgar, Hamid Palangi, Roman Lutz, Hongliang Kong, Vincent Yun, Eslam Kamal, Federico Zarfati, Hanna Wallach, Sarah Bird, and Mei Chen. A framework for automated measurement of responsible ai harms in generative ai applications,

work page
[39]

URL https://arxiv.org/abs/2310.17750

work page arXiv
[40]

Orca 2: Teaching small language models how to reason

Arindam Mitra, Luciano Del Corro, Shweti Mahajan, Andres Codas, Clarisse Simoes, Sahaj Agarwal, Xuxi Chen, Anastasia Razdaibiedina, Erik Jones, Kriti Aggarwal, et al. Orca 2: Teaching small language models how to reason. arXiv preprint arXiv:2311.11045, 2023

work page arXiv 2023
[41]

Agentinstruct: Toward generative teaching with agentic flows

Arindam Mitra, Luciano Del Corro, Guoqing Zheng, Shweti Mahajan, Dany Rouhana, Andres Codas, Yadong Lu, Wei-ge Chen, Olga Vrousgos, Corby Rosset, et al. Agentinstruct: Toward generative teaching with agentic flows. arXiv preprint arXiv:2407.03502, 2024

work page arXiv 2024
[42]

Unearthing skill-level insights for understanding trade-offs of foundation models

Mazda Moayeri, Vidhisha Balachandran, Varun Chandrasekaran, Safoora Yousefi, Thomas Fel, Soheil Feizi, Besmira Nushi, Neel Joshi, and Vibhav Vineet. Unearthing skill-level insights for understanding trade-offs of foundation models. ICLR, 2025

work page 2025
[43]

Orca: Progressive Learning from Complex Explanation Traces of GPT-4

Subhabrata Mukherjee, Arindam Mitra, Ganesh Jawahar, Sahaj Agarwal, Hamid Palangi, and Ahmed Awadallah. Orca: Progressive learning from complex explanation traces of gpt-4.arXiv preprint arXiv:2306.02707, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[44]

Towards accountable ai: Hybrid human-machine analyses for character- izing system failure

Besmira Nushi, Ece Kamar, and Eric Horvitz. Towards accountable ai: Hybrid human-machine analyses for character- izing system failure. InProceedings of the AAAI Conference on Human Computation and Crowdsourcing, volume 6, pages 126–135, 2018

work page 2018
[45]

Openai o3-mini system card

OpenAI. Openai o3-mini system card. https://openai.com/index/o3-mini-system-card/, 2025. Accessed: 2025-03-17

work page 2025
[46]

Computational complexity

Christos H Papadimitriou. Computational complexity. In Encyclopedia of computer science, pages 260–265. John Wiley and Sons Ltd., 2003

work page 2003
[47]

Overreliance on ai literature review.Microsoft Research, 339:340, 2022

Samir Passi and Mihaela Vorvoreanu. Overreliance on ai literature review.Microsoft Research, 339:340, 2022

work page 2022
[48]

Proof or bluff? evaluating llms on 2025 usa math olympiad

Ivo Petrov, Jasper Dekoninck, Lyuben Baltadzhiev, Maria Drencheva, Kristian Minchev, Mislav Balunović, Nikola Jovanović, and Martin Vechev. Proof or bluff? evaluating llms on 2025 usa math olympiad. arXiv preprint arXiv:2503.21934, 2025

work page arXiv 2025
[49]

Gpqa: A graduate-level google-proof q&a benchmark

David Rein, Betty Li Hou, Asa Cooper Stickland, Jackson Petty, Richard Yuanzhe Pang, Julien Dirani, Julian Michael, and Samuel R Bowman. Gpqa: A graduate-level google-proof q&a benchmark. In First Conference on Language Modeling, 2024

work page 2024
[50]

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Y Wu, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[51]

HybridFlow: A Flexible and Efficient RLHF Framework

Guangming Sheng, Chi Zhang, Zilingfeng Ye, Xibin Wu, Wang Zhang, Ru Zhang, Yanghua Peng, Haibin Lin, and Chuan Wu. Hybridflow: A flexible and efficient rlhf framework.arXiv preprint arXiv: 2409.19256, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[52]

Language Models are Multilingual Chain-of-Thought Reasoners

Freda Shi, Mirac Suzgun, Markus Freitag, Xuezhi Wang, Suraj Srivats, Soroush Vosoughi, Hyung Won Chung, Yi Tay, Sebastian Ruder, Denny Zhou, Dipanjan Das, and Jason Wei. Language models are multilingual chain-of-thought reasoners, 2022. URL https://arxiv.org/abs/2210.03057

work page internal anchor Pith review Pith/arXiv arXiv 2022
[53]

RoFormer: Enhanced Transformer with Rotary Position Embedding

Jianlin Su, Yu Lu, Shengfeng Pan, Ahmed Murtadha, Bo Wen, and Yunfeng Liu. Roformer: Enhanced transformer with rotary position embedding.arXiv preprint arXiv:2104.09864, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021
[54]

Kimi Team, Angang Du, Bofei Gao, Bowei Xing, Changjiu Jiang, Cheng Chen, Cheng Li, Chenjun Xiao, Chenzhuang Du, Chonghua Liao, et al. Kimi k1. 5: Scaling reinforcement learning with llms.arXiv preprint arXiv:2501.12599, 2025. 26

work page internal anchor Pith review Pith/arXiv arXiv 2025
[55]

Open Thoughts

OpenThoughts Team. Open Thoughts. https://open-thoughts.ai, January 2025

work page 2025
[56]

Qwq-32b: Embracing the power of reinforcement learning, March 2025

Qwen Team. Qwq-32b: Embracing the power of reinforcement learning, March 2025. URL https://qwenlm. github.io/blog/qwq-32b/

work page 2025
[57]

Miles Turpin, Julian Michael, Ethan Perez, and Samuel R. Bowman. Language models don’t always say what they think: Unfaithful explanations in chain-of-thought prompting, 2023. URLhttps://arxiv.org/abs/2305.04388

work page internal anchor Pith review Pith/arXiv arXiv 2023
[58]

Is a picture worth a thousand words? delving into spatial reasoning for vision language models.Advances in Neural Information Processing Systems, 37:75392–75421, 2024

Jiayu Wang, Yifei Ming, Zhenmei Shi, Vibhav Vineet, Xin Wang, Sharon Li, and Neel Joshi. Is a picture worth a thousand words? delving into spatial reasoning for vision language models.Advances in Neural Information Processing Systems, 37:75392–75421, 2024

work page 2024
[59]

MMLU-Pro: A More Robust and Challenging Multi-Task Language Understanding Benchmark

Yubo Wang, Xueguang Ma, Ge Zhang, Yuansheng Ni, Abhranil Chandra, Shiguang Guo, Weiming Ren, Aaran Arulraj, Xuan He, Ziyan Jiang, Tianle Li, Max Ku, Kai Wang, Alex Zhuang, Rongqi Fan, Xiang Yue, and Wenhu Chen. Mmlu- pro: A more robust and challenging multi-task language understanding benchmark, 2024. URLhttps://arxiv. org/abs/2406.01574

work page internal anchor Pith review Pith/arXiv arXiv 2024
[60]

An Yang, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chengyuan Li, Dayiheng Liu, Fei Huang, Haoran Wei, et al. Qwen2. 5 technical report.arXiv preprint arXiv:2412.15115, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[61]

On the emergence of thinking in llms i: Searching for the right intuition

Guanghao Ye, Khiem Duc Pham, Xinzhi Zhang, Sivakanth Gopi, Baolin Peng, Beibin Li, Janardhan Kulkarni, and Huseyin A Inan. On the emergence of thinking in llms i: Searching for the right intuition. arXiv preprint arXiv:2502.06773, 2025

work page arXiv 2025
[62]

Demystifying long chain-of-thought reasoning in llms

Edward Yeo, Yuxuan Tong, Morry Niu, Graham Neubig, and Xiang Yue. Demystifying long chain-of-thought reasoning in llms. arXiv preprint arXiv:2502.03373, 2025

work page arXiv 2025
[63]

Dapo: An open-source llm reinforcement learning system at scale, 2025

Qiying Yu, Zheng Zhang, Ruofei Zhu, Yufeng Yuan, Xiaochen Zuo, Yu Yue, Tiantian Fan, Gaohong Liu, Lingjun Liu, Xin Liu, Haibin Lin, Zhiqi Lin, Bole Ma, Guangming Sheng, Yuxuan Tong, Chi Zhang, Mofan Zhang, Wang Zhang, Hang Zhu, Jinhua Zhu, Jiaze Chen, Jiangjie Chen, Chengyi Wang, Hongli Yu, Weinan Dai, Yuxuan Song, Xiangpeng Wei, Hao Zhou, Jingjing Liu, W...

work page 2025
[64]

Instruction-Following Evaluation for Large Language Models

Jeffrey Zhou, Tianjian Lu, Swaroop Mishra, Siddhartha Brahma, Sujoy Basu, Yi Luan, Denny Zhou, and Le Hou. Instruction-following evaluation for large language models.arXiv preprint arXiv:2311.07911, 2023. 27 A Benchmarking Details Model temp. max token reasoning Phi-4 [2] 0.8 † 4,096 n Phi-4-reasoning 0.8 32,768 ∗ y Phi-4-reasoning-plus 0.8 32,768 ∗ y Dee...

work page internal anchor Pith review Pith/arXiv arXiv 2023