Recognition: 2 theorem links
· Lean TheoremPhi-4-reasoning Technical Report
Pith reviewed 2026-05-17 03:35 UTC · model grok-4.3
The pith
A 14-billion parameter model trained on curated teachable prompts and o3-mini demonstrations reaches performance levels of much larger reasoning systems.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Phi-4-reasoning is produced by supervised fine-tuning of the Phi-4 base model on a curated collection of teachable prompts chosen for appropriate complexity and diversity together with detailed reasoning demonstrations generated by o3-mini. The plus variant receives an additional short phase of outcome-based reinforcement learning that encourages longer reasoning traces. Across benchmarks in mathematics, scientific reasoning, coding, algorithmic problem solving, planning, and spatial understanding, both models surpass significantly larger open-weight systems such as DeepSeek-R1-Distill-Llama-70B and approach the results of the full DeepSeek-R1 model, with observable transfer to general-capab
What carries the argument
Curated teachable prompts of balanced complexity and diversity paired with o3-mini reasoning demonstrations for supervised fine-tuning, optionally followed by outcome-based reinforcement learning to lengthen traces.
If this is right
- Careful data curation for supervised fine-tuning extends to reasoning language models and produces measurable gains.
- A short reinforcement-learning phase on top of the fine-tuned model further increases performance by lengthening reasoning traces.
- Reasoning improvements transfer in a non-trivial way to general-purpose benchmarks.
- Current evaluation practices leave room for improvement in measuring robustness of reasoning models.
Where Pith is reading between the lines
- Targeted data curation may allow smaller models to close performance gaps with larger ones without relying primarily on scale.
- Similar curation strategies could be tested on other base models or applied to domains beyond the current benchmarks.
- Longer traces encouraged by reinforcement learning may require new evaluation protocols that account for trace length and consistency.
Load-bearing premise
Performance gains arise chiefly from the selection of teachable prompts and the o3-mini demonstrations rather than from hidden properties of the base model or from evaluation overlap.
What would settle it
Retraining the same base model on randomly selected prompts of comparable length and then re-evaluating on the identical benchmark suite yields scores within a few points of the curated version.
read the original abstract
We introduce Phi-4-reasoning, a 14-billion parameter reasoning model that achieves strong performance on complex reasoning tasks. Trained via supervised fine-tuning of Phi-4 on carefully curated set of "teachable" prompts-selected for the right level of complexity and diversity-and reasoning demonstrations generated using o3-mini, Phi-4-reasoning generates detailed reasoning chains that effectively leverage inference-time compute. We further develop Phi-4-reasoning-plus, a variant enhanced through a short phase of outcome-based reinforcement learning that offers higher performance by generating longer reasoning traces. Across a wide range of reasoning tasks, both models outperform significantly larger open-weight models such as DeepSeek-R1-Distill-Llama-70B model and approach the performance levels of full DeepSeek-R1 model. Our comprehensive evaluations span benchmarks in math and scientific reasoning, coding, algorithmic problem solving, planning, and spatial understanding. Interestingly, we observe a non-trivial transfer of improvements to general-purpose benchmarks as well. In this report, we provide insights into our training data, our training methodologies, and our evaluations. We show that the benefit of careful data curation for supervised fine-tuning (SFT) extends to reasoning language models, and can be further amplified by reinforcement learning (RL). Finally, our evaluation points to opportunities for improving how we assess the performance and robustness of reasoning models.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces Phi-4-reasoning, a 14-billion parameter reasoning model obtained by supervised fine-tuning of the Phi-4 base model on a curated set of 'teachable' prompts and reasoning demonstrations generated by o3-mini. It also presents Phi-4-reasoning-plus, which adds a short phase of outcome-based reinforcement learning to produce longer reasoning traces. The central claim is that both models outperform significantly larger open-weight models such as DeepSeek-R1-Distill-Llama-70B and approach the performance of the full DeepSeek-R1 model across benchmarks in math, scientific reasoning, coding, algorithmic problem solving, planning, and spatial understanding, with non-trivial transfer to general-purpose benchmarks. The report supplies insights into the training data, methodologies, and evaluations, arguing that careful data curation for SFT can be amplified by RL.
Significance. If the reported results hold after addressing the points below, the work would be significant for showing that high-quality, curated data for SFT (plus limited RL) can produce competitive reasoning performance in a 14B model, reducing reliance on scale alone. The multi-domain evaluation and observed transfer effects are valuable, and the explicit discussion of training insights aids reproducibility and understanding of reasoning model development.
major comments (2)
- [Abstract] Abstract: The headline claim that the models 'outperform significantly larger open-weight models such as DeepSeek-R1-Distill-Llama-70B model and approach the performance levels of full DeepSeek-R1 model' is stated without any numerical scores, error bars, specific benchmark lists, or controls for data contamination. This leaves the central comparative result unverified from the abstract alone and requires the evaluations section to supply the missing quantitative details.
- [Training Methodology and Evaluations sections] Training Methodology and Evaluations sections: The performance gains are attributed primarily to the curated teachable prompts and o3-mini demonstrations (plus the RL phase). However, the manuscript does not report an ablation of the unmodified base Phi-4 model on the same reasoning benchmarks. This omission is load-bearing for the causal claim, as it leaves open whether gains derive from the base model properties, evaluation protocol, or overlap with the teacher data rather than the described curation.
minor comments (2)
- [Introduction] The term 'teachable' prompts is introduced without a precise operational definition or concrete examples in the main text; adding these would improve clarity.
- [Evaluations] Ensure all tables and figures include self-contained captions and axis labels so that results can be interpreted without cross-referencing the surrounding prose.
Simulated Author's Rebuttal
We thank the referee for their constructive and detailed review. We address each major comment below and will revise the manuscript accordingly to strengthen the presentation of our results and methodology.
read point-by-point responses
-
Referee: [Abstract] Abstract: The headline claim that the models 'outperform significantly larger open-weight models such as DeepSeek-R1-Distill-Llama-70B model and approach the performance levels of full DeepSeek-R1 model' is stated without any numerical scores, error bars, specific benchmark lists, or controls for data contamination. This leaves the central comparative result unverified from the abstract alone and requires the evaluations section to supply the missing quantitative details.
Authors: We agree that the abstract would be strengthened by including concrete quantitative details. In the revised manuscript we will update the abstract to report key numerical results (e.g., average scores on math, coding, and algorithmic benchmarks) showing the 14B models outperforming the 70B baseline and approaching the full DeepSeek-R1 model, while explicitly referencing the full evaluation tables and contamination controls described in the Evaluations section. revision: yes
-
Referee: [Training Methodology and Evaluations sections] Training Methodology and Evaluations sections: The performance gains are attributed primarily to the curated teachable prompts and o3-mini demonstrations (plus the RL phase). However, the manuscript does not report an ablation of the unmodified base Phi-4 model on the same reasoning benchmarks. This omission is load-bearing for the causal claim, as it leaves open whether gains derive from the base model properties, evaluation protocol, or overlap with the teacher data rather than the described curation.
Authors: We acknowledge the importance of this ablation for isolating the contribution of our data curation and training. We have evaluated the unmodified Phi-4 base model on the identical reasoning benchmarks. The base model exhibits substantially lower performance than both Phi-4-reasoning and Phi-4-reasoning-plus. In the revised manuscript we will add these results to the Evaluations section, together with a brief discussion of how the curated SFT data and RL phase drive the observed improvements beyond base-model capabilities. revision: yes
Circularity Check
No circularity: empirical training report with external benchmark comparisons
full rationale
The paper describes supervised fine-tuning of the base Phi-4 model on curated teachable prompts paired with o3-mini demonstrations, followed by optional outcome-based RL for the plus variant. Performance claims rest on direct evaluations across standard math, coding, planning, and general benchmarks, with explicit comparisons to independent external models (DeepSeek-R1-Distill-Llama-70B and full DeepSeek-R1). No equations, fitted parameters, or first-principles derivations appear that could reduce reported gains to self-definitions or tautologies by construction. Self-references to the prior Phi-4 base model are present but not load-bearing; the central results are measured outcomes on external benchmarks rather than any renaming, ansatz smuggling, or uniqueness theorem imported from the same authors.
Axiom & Free-Parameter Ledger
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Trained via supervised fine-tuning of Phi-4 on carefully curated set of 'teachable' prompts... and reasoning demonstrations generated using o3-mini... short phase of outcome-based reinforcement learning... GRPO algorithm
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We report standard deviations across multiple runs... average-of-5 evaluations can differ significantly (by up to 5-10 percentage points on AIME)
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Forward citations
Cited by 17 Pith papers
-
VLM Judges Can Rank but Cannot Score: Task-Dependent Uncertainty in Multimodal Evaluation
VLM judges exhibit task-dependent uncertainty in their scores, with conformal prediction revealing wide intervals for complex tasks and a decoupling between good ranking performance and poor absolute scoring reliability.
-
When AI Models Become Dependencies: Studying the Evolution of Pre-Trained Model Reuse in Downstream Software Systems
Pre-trained models are added late in projects, accumulate rather than get replaced, and change three times less often than libraries, with distinct documentation driven by capability needs and testing uncertainty.
-
MathArena: Evaluating LLMs on Uncontaminated Math Competitions
MathArena evaluates over 50 LLMs on 162 fresh competition problems across seven contests, detects contamination in AIME 2024, and reports top models scoring below 40 percent on IMO 2025 proof tasks.
-
Distilling Long-CoT Reasoning through Collaborative Step-wise Multi-Teacher Decoding
CoRD uses collaborative multi-teacher step-wise decoding with perplexity-guided beam search to generate higher-quality Long-CoT data that lets smaller models reach near-teacher performance with less supervision.
-
SeLaR: Selective Latent Reasoning in Large Language Models
SeLaR selectively applies latent soft reasoning in LLMs via entropy gating and contrastive regularization, outperforming standard CoT on five benchmarks without training.
-
ZeroCoder: Can LLMs Improve Code Generation Without Ground-Truth Supervision?
ZeroCoder co-evolves coder and tester LLMs via self-generated code-test execution feedback to improve code generation up to 21.6% without ground-truth supervision.
-
CoME-VL: Scaling Complementary Multi-Encoder Vision-Language Learning
CoME-VL fuses contrastive and self-supervised vision encoders via entropy-guided multi-layer aggregation and RoPE cross-attention to improve vision-language model performance on benchmarks.
-
Flexible Entropy Control in RLVR with a Gradient-Preserving Perspective
Dynamic clipping strategies based on importance sampling regions enable precise entropy management in RLVR, mitigating collapse and improving benchmark performance.
-
Do Not Waste Your Rollouts: Recycling Search Experience for Efficient Test-Time Scaling
RSE distills search trajectories into an experience bank for positive and negative recycling, yielding efficiency gains over independent sampling on math reasoning benchmarks.
-
The Geometric Reasoner: Manifold-Informed Latent Foresight Search for Long-Context Reasoning
TGR performs manifold-informed latent foresight search to boost trajectory coverage in long-context reasoning tasks by up to 13 AUC points with minimal overhead.
-
Which Reasoning Trajectories Teach Students to Reason Better? A Simple Metric of Informative Alignment
Rank-Surprisal Ratio (RSR) correlates strongly (average Spearman 0.86) with post-distillation reasoning gains across five student models and trajectories from eleven teachers, outperforming existing selection metrics.
-
TwiSTAR:Think Fast, Think Slow, Then Act,Generative Recommendation with Adaptive Reasoning
TwiSTAR learns to switch between fast SID retrieval and slow rationale-generating reasoning in generative recommendation, yielding better accuracy-latency trade-offs on three datasets.
-
Chain-of-Thought Reasoning Enhances In-Context Learning for LLM-Based Mobile Traffic Prediction
Chain-of-thought reasoning with plan-based demonstrations and similarity retrieval improves LLM mobile traffic prediction accuracy by up to 15% over standard in-context learning on real 5G data.
-
Beyond Benchmarks: MathArena as an Evaluation Platform for Mathematics with LLMs
MathArena is a maintained platform evaluating LLMs across olympiad problems, proofs, research questions, and formal proofs, with GPT-5.5 reaching 98% on 2026 USAMO and 74% on research-level tasks.
-
Ranking Reasoning LLMs under Test-Time Scaling
Many established statistical ranking techniques produce orderings of reasoning LLMs under test-time scaling that closely match a Bayesian gold standard, with mean Kendall tau_b of 0.93-0.95 at full trials and best met...
-
Gemma 4, Phi-4, and Qwen3: Accuracy-Efficiency Tradeoffs in Dense and MoE Reasoning Language Models
Gemma-4-E4B with few-shot chain-of-thought reaches the highest weighted accuracy of 0.675 at 14.9 GB VRAM, while the larger Gemma-4-26B-A4B MoE model scores 0.663 but uses 48.1 GB.
-
XekRung Technical Report
XekRung achieves state-of-the-art performance on cybersecurity benchmarks among same-scale models via tailored data synthesis and multi-stage training while retaining strong general capabilities.
Reference graph
Works this paper leans on
-
[1]
Phi-3 Technical Report: A Highly Capable Language Model Locally on Your Phone
Marah Abdin, Jyoti Aneja, Hany Awadalla, Ahmed Awadallah, Ammar Ahmad Awan, Nguyen Bach, Amit Bahree, Arash Bakhtiari, Jianmin Bao, Harkirat Behl, et al. Phi-3 technical report: A highly capable language model locally on your phone.arXiv preprint arXiv:2404.14219, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[2]
Marah Abdin, Jyoti Aneja, Harkirat Behl, Sébastien Bubeck, Ronen Eldan, Suriya Gunasekar, Michael Harrison, Russell J. Hewett, Mojan Javaheripi, Piero Kauffmann, James R. Lee, Yin Tat Lee, Yuanzhi Li, Weishung Liu, Caio C. T. Mendes, Anh Nguyen, Eric Price, Gustavo de Rosa, Olli Saarikivi, Adil Salim, Shital Shah, Xin Wang, Rachel Ward, Yue Wu, Dingli Yu,...
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[3]
KITAB: evaluating llms on constraint satisfaction for information retrieval
Marah I Abdin, Suriya Gunasekar, Varun Chandrasekaran, Jerry Li, Mert Yüksekgönül, Rahee Ghosh Peshawaria, Ranjita Naik, and Besmira Nushi. KITAB: evaluating llms on constraint satisfaction for information retrieval. In International Conference on Learning Representations, 2024
work page 2024
-
[4]
AIME. Aime 83-24. https://huggingface.co/datasets/lchen001/AIME1983_2024, 2024. Accessed: 2025- 03-17
work page 2024
- [5]
-
[6]
Concrete Problems in AI Safety
Dario Amodei, Chris Olah, Jacob Steinhardt, Paul Christiano, John Schulman, and Dan Mané. Concrete problems in ai safety.arXiv preprint arXiv:1606.06565, 2016
work page internal anchor Pith review Pith/arXiv arXiv 2016
-
[7]
Claude 3.7 sonnet.https://www.anthropic.com/news/claude-3-7-sonnet, 2025
Anthropic. Claude 3.7 sonnet.https://www.anthropic.com/news/claude-3-7-sonnet, 2025. Accessed: 2025- 03-17
work page 2025
-
[8]
Chain-of-thought reasoning in the wild is not always faithful.arXiv preprint arXiv:2503.08679, 2025
Iván Arcuschin, Jett Janiak, Robert Krzyzanowski, Senthooran Rajamanoharan, Neel Nanda, and Arthur Conmy. Chain-of-thought reasoning in the wild is not always faithful.arXiv preprint arXiv:2503.08679, 2025
-
[9]
Eureka: Evaluating and understanding large foundation models.arXiv preprint arXiv:2409.10566, 2024
Vidhisha Balachandran, Jingya Chen, Neel Joshi, Besmira Nushi, Hamid Palangi, Eduardo Salinas, Vibhav Vineet, James Woffinden-Luey, and Safoora Yousefi. Eureka: Evaluating and understanding large foundation models.arXiv preprint arXiv:2409.10566, 2024
-
[10]
Inference-time scaling for complex tasks: Where we stand and what lies ahead, 2025
Vidhisha Balachandran, Jingya Chen, Lingjiao Chen, Shivam Garg, Neel Joshi, Yash Lara, John Langford, Besmira Nushi, Vibhav Vineet, Yue Wu, and Safoora Yousefi. Inference-time scaling for complex tasks: Where we stand and what lies ahead, 2025. URLhttps://arxiv.org/abs/2504.00294
-
[11]
Matharena: Evaluating llms on uncontaminated math competitions, February 2025
Mislav Balunović, Jasper Dekoninck, Ivo Petrov, Nikola Jovanović, and Martin Vechev. Matharena: Evaluating llms on uncontaminated math competitions, February 2025. URLhttps://matharena.ai/
work page 2025
-
[12]
Designing disaggregated evaluations of ai systems: Choices, considera- tions, and tradeoffs
Solon Barocas, Anhong Guo, Ece Kamar, Jacquelyn Krones, Meredith Ringel Morris, Jennifer Wortman Vaughan, W Duncan Wadsworth, and Hanna Wallach. Designing disaggregated evaluations of ai systems: Choices, considera- tions, and tradeoffs. InProceedings of the 2021 AAAI/ACM Conference on AI, Ethics, and Society, pages 368–378, 2021
work page 2021
-
[13]
Natasha Butt, Varun Chandrasekaran, Neel Joshi, Besmira Nushi, and Vidhisha Balachandran. Benchagents: Auto- mated benchmark creation with agent interaction.arXiv preprint arXiv:2410.22584, 2024
-
[14]
Extending Context Window of Large Language Models via Positional Interpolation
Shouyuan Chen, Sherman Wong, Liangjian Chen, and Yuandong Tian. Extending context window of large language models via positional interpolation.arXiv preprint arXiv:2306.15595, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[15]
Reinforcement learning for reasoning in small llms: What works and what doesn’t,
Quy-Anh Dang and Chris Ngo. Reinforcement learning for reasoning in small llms: What works and what doesn’t,
- [16]
-
[17]
Omni-math: A universal olympiad level mathematic benchmark for large language models.ICLR, 2025
Bofei Gao, Feifan Song, Zhe Yang, Zefan Cai, Yibo Miao, Qingxiu Dong, Lei Li, Chenghao Ma, Liang Chen, Runxin Xu, et al. Omni-math: A universal olympiad level mathematic benchmark for large language models.ICLR, 2025
work page 2025
-
[18]
Scaling laws for reward model overoptimization
Leo Gao, John Schulman, and Jacob Hilton. Scaling laws for reward model overoptimization. In International Conference on Machine Learning, pages 10835–10866. PMLR, 2023
work page 2023
-
[19]
Google. Gemini flash thinking. https://deepmind.google/technologies/gemini/flash/, 2025. Accessed: 2025-03-17. 24
work page 2025
-
[20]
rStar-Math: Small LLMs Can Master Math Reasoning with Self-Evolved Deep Thinking
Xinyu Guan, Li Lyna Zhang, Yifei Liu, Ning Shang, Youran Sun, Yi Zhu, Fan Yang, and Mao Yang. rstar-math: Small llms can master math reasoning with self-evolved deep thinking.arXiv preprint arXiv:2501.04519, 2025
work page internal anchor Pith review arXiv 2025
-
[21]
Suriya Gunasekar, Yi Zhang, Jyoti Aneja, Caio César Teodoro Mendes, Allie Del Giorno, Sivakanth Gopi, Mojan Javaheripi, Piero Kauffmann, Gustavo de Rosa, Olli Saarikivi, et al. Textbooks are all you need.arXiv preprint arXiv:2306.11644, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[22]
DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning
Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning. arXiv preprint arXiv:2501.12948, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[23]
Computers and intractability: a guide to the theory of np-completeness (michael r
Juris Hartmanis. Computers and intractability: a guide to the theory of np-completeness (michael r. garey and david s. johnson). Siam Review, 24(1):90, 1982
work page 1982
-
[24]
ToxiGen: A large- scale machine-generated dataset for adversarial and implicit hate speech detection
Thomas Hartvigsen, Saadia Gabriel, Hamid Palangi, Maarten Sap, Dipankar Ray, and Ece Kamar. ToxiGen: A large- scale machine-generated dataset for adversarial and implicit hate speech detection. InProceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 3309–3326. Association for Computational L...
work page 2022
-
[25]
Measuring Massive Multitask Language Understanding
Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. Mea- suring massive multitask language understanding, 2021. URLhttps://arxiv.org/abs/2009.03300
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[26]
A sober look at progress in language model reasoning: Pitfalls and paths to reproducibility
Andreas Hochlehnert, Hardik Bhatnagar, Vishaal Udandarao, Samuel Albanie, Ameya Prabhu, and Matthias Bethge. A sober look at progress in language model reasoning: Pitfalls and paths to reproducibility. arXiv preprint arXiv:2504.07086, 2025
-
[27]
Aaron Hurst, Adam Lerer, Adam P Goucher, Adam Perelman, Aditya Ramesh, Aidan Clark, AJ Ostrow, Akila Welihinda, Alan Hayes, Alec Radford, et al. Gpt-4o system card.arXiv preprint arXiv:2410.21276, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[28]
Aaron Jaech, Adam Kalai, Adam Lerer, Adam Richardson, Ahmed El-Kishky, Aiden Low, Alec Helyar, Aleksander Madry, Alex Beutel, Alex Carney, et al. Openai o1 system card.arXiv preprint arXiv:2412.16720, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[29]
Phi-2: The surprising power of small language models.Microsoft Research Blog, 2023
Mojan Javaheripi, Sébastien Bubeck, Marah Abdin, Jyoti Aneja, Caio César Teodoro Mendes, Weizhu Chen, Allie Del Giorno, Ronen Eldan, Sivakanth Gopi, Suriya Gunasekar, Piero Kauffmann, Yin Tat Lee, Yuanzhi Li, Anh Nguyen, Gustavo de Rosa, Olli Saarikivi, Adil Salim, Shital Shah, Michael Santacroce, Harkirat Singh Behl, Adam Taumann Kalai, Xin Wang, Rachel ...
work page 2023
-
[30]
Safechain: Safety of language models with long chain-of-thought reasoning capabilities
Fengqing Jiang, Zhangchen Xu, Yuetai Li, Luyao Niu, Zhen Xiang, Bo Li, Bill Yuchen Lin, and Radha Pooven- dran. Safechain: Safety of language models with long chain-of-thought reasoning capabilities. arXiv preprint arXiv:2502.12025, 2025
-
[31]
Mosh Levy, Alon Jacoby, and Yoav Goldberg. Same task, more tokens: the impact of input length on the reasoning performance of large language models. InACL, 2024
work page 2024
-
[32]
Exaone deep: Reasoning enhanced language models.arXiv preprint arXiv:2503.12524, 2025
LG AI Research. Exaone deep: Reasoning enhanced language models.arXiv preprint arXiv:2503.12524, 2025
-
[33]
Functional interpolation for relative positions improves long context transformers
Shanda Li, Chong You, Guru Guruganesh, Joshua Ainslie, Santiago Ontanon, Manzil Zaheer, Sumit Sanghai, Yiming Yang, Sanjiv Kumar, and Srinadh Bhojanapalli. Functional interpolation for relative positions improves long context transformers. arXiv preprint arXiv:2310.04418, 2023
-
[34]
From crowdsourced data to high-quality benchmarks: Arena-hard and benchbuilder pipeline
Tianle Li, Wei-Lin Chiang, Evan Frick, Lisa Dunlap, Tianhao Wu, Banghua Zhu, Joseph E Gonzalez, and Ion Sto- ica. From crowdsourced data to high-quality benchmarks: Arena-hard and benchbuilder pipeline. arXiv preprint arXiv:2406.11939, 2024
-
[35]
Limr: Less is more for rl scaling, 2025
Xuefeng Li, Haoyang Zou, and Pengfei Liu. Limr: Less is more for rl scaling, 2025
work page 2025
-
[36]
Jiawei Liu, Chunqiu Steven Xia, Yuyao Wang, and Lingming Zhang. Is your code generated by chatgpt really correct? rigorous evaluation of large language models for code generation, 2023. URL https://arxiv.org/abs/2305. 01210. 25
work page 2023
-
[37]
Tang, Manan Roongta, Colin Cai, Jeffrey Luo, Li Erran Li, Raluca Ada Popa, and Ion Stoica
Michael Luo, Sijun Tan, Justin Wong, Xiaoxiang Shi, William Y. Tang, Manan Roongta, Colin Cai, Jeffrey Luo, Li Erran Li, Raluca Ada Popa, and Ion Stoica. Deepscaler: Surpass- ing o1-preview with a 1.5b model by scaling rl. https://pretty-radio-b75.notion.site/ DeepScaleR-Surpassing-O1-Preview-with-a-1-5B-Model-by-Scaling-RL-19681902c1468005bed8ca303013a4e2 ,
-
[38]
A framework for automated measurement of responsible ai harms in generative ai applications,
Ahmed Magooda, Alec Helyar, Kyle Jackson, David Sullivan, Chad Atalla, Emily Sheng, Dan Vann, Richard Edgar, Hamid Palangi, Roman Lutz, Hongliang Kong, Vincent Yun, Eslam Kamal, Federico Zarfati, Hanna Wallach, Sarah Bird, and Mei Chen. A framework for automated measurement of responsible ai harms in generative ai applications,
- [39]
-
[40]
Orca 2: Teaching small language models how to reason
Arindam Mitra, Luciano Del Corro, Shweti Mahajan, Andres Codas, Clarisse Simoes, Sahaj Agarwal, Xuxi Chen, Anastasia Razdaibiedina, Erik Jones, Kriti Aggarwal, et al. Orca 2: Teaching small language models how to reason. arXiv preprint arXiv:2311.11045, 2023
-
[41]
Agentinstruct: Toward generative teaching with agentic flows
Arindam Mitra, Luciano Del Corro, Guoqing Zheng, Shweti Mahajan, Dany Rouhana, Andres Codas, Yadong Lu, Wei-ge Chen, Olga Vrousgos, Corby Rosset, et al. Agentinstruct: Toward generative teaching with agentic flows. arXiv preprint arXiv:2407.03502, 2024
-
[42]
Unearthing skill-level insights for understanding trade-offs of foundation models
Mazda Moayeri, Vidhisha Balachandran, Varun Chandrasekaran, Safoora Yousefi, Thomas Fel, Soheil Feizi, Besmira Nushi, Neel Joshi, and Vibhav Vineet. Unearthing skill-level insights for understanding trade-offs of foundation models. ICLR, 2025
work page 2025
-
[43]
Orca: Progressive Learning from Complex Explanation Traces of GPT-4
Subhabrata Mukherjee, Arindam Mitra, Ganesh Jawahar, Sahaj Agarwal, Hamid Palangi, and Ahmed Awadallah. Orca: Progressive learning from complex explanation traces of gpt-4.arXiv preprint arXiv:2306.02707, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[44]
Towards accountable ai: Hybrid human-machine analyses for character- izing system failure
Besmira Nushi, Ece Kamar, and Eric Horvitz. Towards accountable ai: Hybrid human-machine analyses for character- izing system failure. InProceedings of the AAAI Conference on Human Computation and Crowdsourcing, volume 6, pages 126–135, 2018
work page 2018
-
[45]
OpenAI. Openai o3-mini system card. https://openai.com/index/o3-mini-system-card/, 2025. Accessed: 2025-03-17
work page 2025
-
[46]
Christos H Papadimitriou. Computational complexity. In Encyclopedia of computer science, pages 260–265. John Wiley and Sons Ltd., 2003
work page 2003
-
[47]
Overreliance on ai literature review.Microsoft Research, 339:340, 2022
Samir Passi and Mihaela Vorvoreanu. Overreliance on ai literature review.Microsoft Research, 339:340, 2022
work page 2022
-
[48]
Proof or bluff? evaluating llms on 2025 usa math olympiad
Ivo Petrov, Jasper Dekoninck, Lyuben Baltadzhiev, Maria Drencheva, Kristian Minchev, Mislav Balunović, Nikola Jovanović, and Martin Vechev. Proof or bluff? evaluating llms on 2025 usa math olympiad. arXiv preprint arXiv:2503.21934, 2025
-
[49]
Gpqa: A graduate-level google-proof q&a benchmark
David Rein, Betty Li Hou, Asa Cooper Stickland, Jackson Petty, Richard Yuanzhe Pang, Julien Dirani, Julian Michael, and Samuel R Bowman. Gpqa: A graduate-level google-proof q&a benchmark. In First Conference on Language Modeling, 2024
work page 2024
-
[50]
DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models
Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Y Wu, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[51]
HybridFlow: A Flexible and Efficient RLHF Framework
Guangming Sheng, Chi Zhang, Zilingfeng Ye, Xibin Wu, Wang Zhang, Ru Zhang, Yanghua Peng, Haibin Lin, and Chuan Wu. Hybridflow: A flexible and efficient rlhf framework.arXiv preprint arXiv: 2409.19256, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[52]
Language Models are Multilingual Chain-of-Thought Reasoners
Freda Shi, Mirac Suzgun, Markus Freitag, Xuezhi Wang, Suraj Srivats, Soroush Vosoughi, Hyung Won Chung, Yi Tay, Sebastian Ruder, Denny Zhou, Dipanjan Das, and Jason Wei. Language models are multilingual chain-of-thought reasoners, 2022. URL https://arxiv.org/abs/2210.03057
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[53]
RoFormer: Enhanced Transformer with Rotary Position Embedding
Jianlin Su, Yu Lu, Shengfeng Pan, Ahmed Murtadha, Bo Wen, and Yunfeng Liu. Roformer: Enhanced transformer with rotary position embedding.arXiv preprint arXiv:2104.09864, 2021
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[54]
Kimi Team, Angang Du, Bofei Gao, Bowei Xing, Changjiu Jiang, Cheng Chen, Cheng Li, Chenjun Xiao, Chenzhuang Du, Chonghua Liao, et al. Kimi k1. 5: Scaling reinforcement learning with llms.arXiv preprint arXiv:2501.12599, 2025. 26
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[55]
OpenThoughts Team. Open Thoughts. https://open-thoughts.ai, January 2025
work page 2025
-
[56]
Qwq-32b: Embracing the power of reinforcement learning, March 2025
Qwen Team. Qwq-32b: Embracing the power of reinforcement learning, March 2025. URL https://qwenlm. github.io/blog/qwq-32b/
work page 2025
-
[57]
Miles Turpin, Julian Michael, Ethan Perez, and Samuel R. Bowman. Language models don’t always say what they think: Unfaithful explanations in chain-of-thought prompting, 2023. URLhttps://arxiv.org/abs/2305.04388
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[58]
Jiayu Wang, Yifei Ming, Zhenmei Shi, Vibhav Vineet, Xin Wang, Sharon Li, and Neel Joshi. Is a picture worth a thousand words? delving into spatial reasoning for vision language models.Advances in Neural Information Processing Systems, 37:75392–75421, 2024
work page 2024
-
[59]
MMLU-Pro: A More Robust and Challenging Multi-Task Language Understanding Benchmark
Yubo Wang, Xueguang Ma, Ge Zhang, Yuansheng Ni, Abhranil Chandra, Shiguang Guo, Weiming Ren, Aaran Arulraj, Xuan He, Ziyan Jiang, Tianle Li, Max Ku, Kai Wang, Alex Zhuang, Rongqi Fan, Xiang Yue, and Wenhu Chen. Mmlu- pro: A more robust and challenging multi-task language understanding benchmark, 2024. URLhttps://arxiv. org/abs/2406.01574
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[60]
An Yang, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chengyuan Li, Dayiheng Liu, Fei Huang, Haoran Wei, et al. Qwen2. 5 technical report.arXiv preprint arXiv:2412.15115, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[61]
On the emergence of thinking in llms i: Searching for the right intuition
Guanghao Ye, Khiem Duc Pham, Xinzhi Zhang, Sivakanth Gopi, Baolin Peng, Beibin Li, Janardhan Kulkarni, and Huseyin A Inan. On the emergence of thinking in llms i: Searching for the right intuition. arXiv preprint arXiv:2502.06773, 2025
-
[62]
Demystifying long chain-of-thought reasoning in llms
Edward Yeo, Yuxuan Tong, Morry Niu, Graham Neubig, and Xiang Yue. Demystifying long chain-of-thought reasoning in llms. arXiv preprint arXiv:2502.03373, 2025
-
[63]
Dapo: An open-source llm reinforcement learning system at scale, 2025
Qiying Yu, Zheng Zhang, Ruofei Zhu, Yufeng Yuan, Xiaochen Zuo, Yu Yue, Tiantian Fan, Gaohong Liu, Lingjun Liu, Xin Liu, Haibin Lin, Zhiqi Lin, Bole Ma, Guangming Sheng, Yuxuan Tong, Chi Zhang, Mofan Zhang, Wang Zhang, Hang Zhu, Jinhua Zhu, Jiaze Chen, Jiangjie Chen, Chengyi Wang, Hongli Yu, Weinan Dai, Yuxuan Song, Xiangpeng Wei, Hao Zhou, Jingjing Liu, W...
work page 2025
-
[64]
Instruction-Following Evaluation for Large Language Models
Jeffrey Zhou, Tianjian Lu, Swaroop Mishra, Siddhartha Brahma, Sujoy Basu, Yi Luan, Denny Zhou, and Le Hou. Instruction-following evaluation for large language models.arXiv preprint arXiv:2311.07911, 2023. 27 A Benchmarking Details Model temp. max token reasoning Phi-4 [2] 0.8 † 4,096 n Phi-4-reasoning 0.8 32,768 ∗ y Phi-4-reasoning-plus 0.8 32,768 ∗ y Dee...
work page internal anchor Pith review Pith/arXiv arXiv 2023
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.