arxiv: 2504.01943 · v2 · pith:CGTO6CK7new · submitted 2025-04-02 · 💻 cs.CL

OpenCodeReasoning: Advancing Data Distillation for Competitive Coding

Wasi Uddin Ahmad , Sean Narenthiran , Somshubra Majumdar , Aleksander Ficek , Siddhartha Jain , Jocelyn Huang , Vahid Noroozi , Boris Ginsburg This is my paper

Pith reviewed 2026-05-17 19:16 UTC · model grok-4.3

classification 💻 cs.CL

keywords supervised fine-tuningdata distillationreasoning modelscompetitive codingLiveCodeBenchCodeContestsinstruction diversityopen-source datasets

0 comments

The pith

Curating a diverse dataset for supervised fine-tuning lets coding models outperform reinforcement learning on competitive benchmarks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper constructs an open SFT dataset to distill reasoning into coding models and demonstrates that this approach alone produces strong results. Models trained with the dataset reach 61.8% on LiveCodeBench and 24.6% on CodeContests, exceeding alternatives that rely on reinforcement learning. Analysis of the data sources shows that code execution filtering reduces accuracy, so the authors instead emphasize variety across instructions and solutions. Releasing the dataset and models makes the method available for others to use and extend.

Core claim

The central claim is that an SFT dataset built by prioritizing instruction and solution diversity, while deliberately avoiding code execution filtering, enables distilled models to achieve state-of-the-art coding performance using only supervised fine-tuning. This yields 61.8% on LiveCodeBench and 24.6% on CodeContests while surpassing RL-trained models. The authors support the claim through comparisons across model sizes, analysis of filtering effects, and examination of token use and reasoning patterns.

What carries the argument

The SFT dataset curation process that selects for instruction and solution diversity from multiple sources while omitting execution-based correctness filters.

If this is right

Models of different sizes reach competitive coding results through SFT alone without reinforcement learning.
Open-sourcing the dataset and models gives the community direct access to the training resources that produced the reported scores.
Execution filtering should be avoided or de-emphasized when curating data for reasoning distillation.
Analysis of token efficiency and reasoning patterns in the distilled models offers guidance for further training refinements.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same diversity-first curation strategy could be tested on distillation for non-coding reasoning tasks such as mathematics.
Applying the dataset to base models larger than those reported might produce additional gains on the same benchmarks.
Combining the SFT dataset with a small amount of subsequent reinforcement learning could create hybrid models with even higher performance.
Re-running the filtering analysis on other coding benchmarks would show whether the negative effect of execution filtering holds more broadly.

Load-bearing premise

That prioritizing diversity in instructions and solutions over filtering for code-execution correctness is what produces the higher benchmark scores.

What would settle it

Retraining the same base models on a version of the dataset that includes only solutions verified through code execution and finding that the scores on LiveCodeBench and CodeContests do not drop below the reported levels.

read the original abstract

Since the advent of reasoning-based large language models, many have found great success from distilling reasoning capabilities into student models. Such techniques have significantly bridged the gap between reasoning and standard LLMs on coding tasks. Despite this, much of the progress on distilling reasoning models remains locked behind proprietary datasets or lacks details on data curation, filtering and subsequent training. To address this, we construct a superior supervised fine-tuning (SFT) dataset that we use to achieve state-of-the-art coding capability results in models of various sizes. Our distilled models use only SFT to achieve 61.8% on LiveCodeBench and 24.6% on CodeContests, surpassing alternatives trained with reinforcement learning. We then perform analysis on the data sources used to construct our dataset, the impact of code execution filtering, and the importance of instruction/solution diversity. We observe that execution filtering negatively affected benchmark accuracy, leading us to prioritize instruction diversity over solution correctness. Finally, we also analyze the token efficiency and reasoning patterns utilized by these models. We will open-source these datasets and distilled models to the community.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper's useful part is releasing an open coding SFT dataset that hits 61.8% on LiveCodeBench with plain fine-tuning, though the filtering analysis may mix diversity with raw data volume.

read the letter

The paper releases an open SFT dataset for coding that gets models to 61.8% on LiveCodeBench and 24.6% on CodeContests with supervised fine-tuning alone. These scores top some RL-trained alternatives on the cited tasks. They put real effort into data curation. The work compares data sources, tests what happens when you apply code execution filtering, and shows that keeping more diverse instructions and solutions worked better than filtering for correctness. They also examine token efficiency and the reasoning patterns in the trained models. Releasing the data and models openly stands out as practical help for the field. The main soft spot is the filtering ablation. The observation that filtering hurt performance is interesting, but without numbers on how much data was retained or whether total tokens were equalized between the filtered and unfiltered sets, it is difficult to separate the effect of diversity from the effect of dataset size. Typical filtering removes a lot of examples, so the accuracy drop could simply reflect less training data. The paper would be stronger with that control or at least the retention rates reported. Readers working on LLM coding tools or data distillation will get the most from this. It supplies concrete numbers, an open resource, and some analysis that others can extend. The results look plausible and the release makes verification possible, so it deserves a serious referee. I would send it for peer review, with the expectation that the filtering experiments get clarified.

Referee Report

1 major / 2 minor

Summary. The manuscript presents OpenCodeReasoning, a curated SFT dataset for distilling reasoning into coding LLMs from multiple sources. Models trained solely with SFT on this dataset achieve 61.8% on LiveCodeBench and 24.6% on CodeContests, outperforming some RL-trained alternatives. The authors analyze data sources, report that code-execution filtering reduced benchmark accuracy (prompting a shift to prioritize instruction/solution diversity), examine token efficiency and reasoning patterns, and commit to open-sourcing the datasets and models.

Significance. If the benchmark results and ablation findings hold, the work shows that targeted data curation emphasizing diversity can deliver competitive coding performance via straightforward SFT, potentially lowering barriers compared to RL pipelines. The explicit plan to open-source both datasets and models is a clear strength that supports reproducibility and community follow-up.

major comments (1)

[Analysis of the impact of code execution filtering] In the analysis of the impact of code execution filtering: the observation that filtering lowered benchmark accuracy is used to justify prioritizing instruction and solution diversity over correctness. However, the manuscript does not report the retention rate after filtering or confirm that total examples or token counts were matched between filtered and unfiltered conditions. Without these controls, the performance difference could be driven by reduced training volume rather than the presence of incorrect solutions, weakening the central methodological conclusion.

minor comments (2)

[Abstract and results] Abstract and results sections: the reported scores (61.8% and 24.6%) are given without error bars, standard deviations, or information on the number of evaluation runs or seeds, limiting assessment of result stability.
[Data construction] Data construction section: additional statistics on the sizes, token counts, and exact composition of each data source used in the final dataset would improve transparency and allow readers to better contextualize the diversity claims.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their constructive feedback on our manuscript. We address the major comment below and commit to revising the manuscript to strengthen the analysis.

read point-by-point responses

Referee: [Analysis of the impact of code execution filtering] In the analysis of the impact of code execution filtering: the observation that filtering lowered benchmark accuracy is used to justify prioritizing instruction and solution diversity over correctness. However, the manuscript does not report the retention rate after filtering or confirm that total examples or token counts were matched between filtered and unfiltered conditions. Without these controls, the performance difference could be driven by reduced training volume rather than the presence of incorrect solutions, weakening the central methodological conclusion.

Authors: We agree that reporting the retention rate and confirming matched training conditions is necessary to isolate the effect of incorrect solutions. In the revised manuscript, we will explicitly state the retention rate after code execution filtering and clarify that the unfiltered condition was subsampled to match both the number of examples and total token count of the filtered set. This additional control will confirm that the observed drop in benchmark accuracy is attributable to the removal of incorrect solutions rather than differences in data volume, thereby reinforcing our conclusion to prioritize instruction and solution diversity. revision: yes

Circularity Check

0 steps flagged

No significant circularity in empirical data curation and SFT pipeline

full rationale

The paper reports an empirical workflow: construction of an SFT dataset from public sources, ablation on execution filtering versus diversity, and direct benchmark evaluation of distilled models. No equations, fitted parameters renamed as predictions, or self-referential definitions appear in the provided text. The central observation that filtering lowered accuracy is presented as an experimental result rather than a derivation that reduces to its own inputs by construction. No load-bearing self-citations, uniqueness theorems, or ansatzes are invoked. The reported scores (61.8% LiveCodeBench, 24.6% CodeContests) are independent benchmark measurements after training, not outputs forced by the methodology itself. Potential confounds such as unmatched dataset sizes after filtering are experimental-design issues, not circularity.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Central claim rests on the empirical effectiveness of the curated dataset and the observation that execution filtering reduced accuracy; no new theoretical entities or axioms are introduced beyond standard LLM training assumptions.

axioms (1)

domain assumption Supervised fine-tuning on curated reasoning traces transfers coding capability from larger to smaller models.
Invoked implicitly when claiming SFT alone suffices to surpass RL methods.

pith-pipeline@v0.9.0 · 5524 in / 1231 out tokens · 77172 ms · 2026-05-17T19:16:46.259013+00:00 · methodology

discussion (0)

Forward citations

Cited by 22 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Primal Generation, Dual Judgment: Self-Training from Test-Time Scaling
cs.LG 2026-05 conditional novelty 7.0

DuST self-trains LLMs for code generation by ranking their own test-time samples via sandbox execution and applying GRPO, improving judgment by +6.2 NDCG and single-sample pass@1 by +3.1 on LiveCodeBench.
Primal Generation, Dual Judgment: Self-Training from Test-Time Scaling
cs.LG 2026-05 unverdicted novelty 7.0

DuST uses on-policy RL to train code models on ranking their own sampled solutions by sandbox execution correctness, improving judgment NDCG, pass@1, and Best-of-4 accuracy while showing that SFT on the same data does...
Teaching Language Models to Think in Code
cs.CL 2026-05 unverdicted novelty 7.0

ThinC trains small models to reason primarily in code rather than natural language, outperforming tool-integrated baselines and even larger models on competition math benchmarks.
Rethinking KV Cache Eviction via a Unified Information-Theoretic Objective
cs.LG 2026-04 unverdicted novelty 7.0

KV cache eviction is unified under an information capacity maximization principle derived from a linear-Gaussian attention surrogate, with CapKV proposed as a leverage-score based implementation that outperforms prior...
Rethinking the Comparison Unit in Sequence-Level Reinforcement Learning: An Equal-Length Paired Training Framework from Loss Correction to Sample Construction
cs.LG 2026-04 unverdicted novelty 7.0

EqLen is a sample-construction framework that builds equal-length paired segments via dual-track generation and masking for stable group-relative RL in sequences, reframing the length problem as a comparison-unit issu...
TrigReason: Trigger-Based Collaboration between Small and Large Reasoning Models
cs.AI 2026-04 unverdicted novelty 7.0

TrigReason matches large reasoning model accuracy on math and science benchmarks by delegating most steps to small models and intervening selectively on three triggers, cutting latency by 43.9% and cost by 73.3%.
SUPERNOVA: Eliciting General Reasoning in LLMs with Reinforcement Learning on Natural Instructions
cs.AI 2026-04 unverdicted novelty 7.0

SUPERNOVA adapts instruction-tuning data for RLVR and achieves up to 52.8% relative gains on general reasoning benchmarks like BBEH through targeted task selection and mixing.
Think Anywhere in Code Generation
cs.SE 2026-03 unverdicted novelty 7.0

Think-Anywhere lets LLMs invoke on-demand reasoning at any token during code generation via cold-start imitation followed by outcome-based RL, reaching state-of-the-art results on LeetCode, LiveCodeBench, HumanEval, and MBPP.
How to Fine-Tune a Reasoning Model? A Teacher-Student Cooperation Framework to Synthesize Student-Consistent SFT Data
cs.CL 2026-03 conditional novelty 7.0

TESSY creates stylistically consistent synthetic data via teacher-student token interleaving, yielding 11.25% and 6.68% gains on code benchmarks where pure teacher data causes 3.25% and 10.02% drops.
Scaling Latent Reasoning via Looped Language Models
cs.CL 2025-10 unverdicted novelty 7.0

Looped language models with latent iterative computation and entropy-regularized depth allocation achieve performance matching up to 12B standard LLMs through superior knowledge manipulation.
CodeRL+: Improving Code Generation via Reinforcement with Execution Semantics Alignment
cs.SE 2025-10 conditional novelty 7.0

CodeRL+ integrates variable-level execution trajectory inference into RLVR training to align textual code representations with execution semantics, delivering 4.6% relative pass@1 gains and generalization to code-reas...
Scalable Token-Level Hallucination Detection in Large Language Models
cs.CL 2026-05 unverdicted novelty 6.0

TokenHD uses a scalable data synthesis engine and importance-weighted training to create token-level hallucination detectors that work on free-form text and scale from 0.6B to 8B parameters, outperforming larger reaso...
Teaching Language Models to Think in Code
cs.CL 2026-05 unverdicted novelty 6.0

ThinC trains smaller language models to reason entirely in code after minimal NL planning, outperforming tool-integrated baselines and even much larger models on competition math benchmarks.
Train Separately, Merge Together: Modular Post-Training with Mixture-of-Experts
cs.LG 2026-04 unverdicted novelty 6.0

BAR trains independent domain experts via separate mid-training, SFT, and RL pipelines then composes them with a MoE router to match monolithic retraining performance at lower cost and without catastrophic forgetting.
Internalizing Outcome Supervision into Process Supervision: A New Paradigm for Reinforcement Learning for Reasoning
cs.LG 2026-04 unverdicted novelty 6.0

A new RL paradigm for reasoning where models generate their own internal process supervision from outcome feedback by recycling failed trajectories.
InCoder-32B-Thinking: Industrial Code World Model for Thinking
cs.AR 2026-04 unverdicted novelty 6.0

InCoder-32B-Thinking uses error-feedback synthesized thinking traces and a code world model to reach top open-source scores on general and industrial code benchmarks including 81.3% on LiveCodeBench and 84.0% on CAD-Coder.
How Robustly do LLMs Understand Execution Semantics?
cs.SE 2026-02 unverdicted novelty 6.0

Frontier LLMs like GPT-5.2 show large accuracy drops on perturbed program-output prediction tasks while open-source reasoning models remain more stable, exposing limits in code semantics understanding.
Don't Pass@k: A Bayesian Framework for Large Language Model Evaluation
cs.AI 2025-10 unverdicted novelty 6.0

A Dirichlet-prior Bayesian estimator for model success probability replaces Pass@k, delivering faster-converging and more stable rankings with credible intervals on math benchmarks.
Open-Reasoner-Zero: An Open Source Approach to Scaling Up Reinforcement Learning on the Base Model
cs.LG 2025-03 unverdicted novelty 6.0

A simple PPO-based RL training pipeline on base models scales reasoning performance and response length, outperforming prior work on math and science benchmarks with one-tenth the training steps.
NVIDIA Nemotron 3: Efficient and Open Intelligence
cs.CL 2025-12 unverdicted novelty 5.0

NVIDIA releases the Nemotron 3 model family with hybrid Mamba-Transformer architecture, LatentMoE, NVFP4 training, MTP layers, and multi-environment RL post-training for reasoning and agentic tasks.
Towards Reasoning Era: A Survey of Long Chain-of-Thought for Reasoning Large Language Models
cs.AI 2025-03 unverdicted novelty 5.0

The paper unifies perspectives on Long CoT in reasoning LLMs by introducing a taxonomy, detailing characteristics of deep reasoning and reflection, and discussing emergence phenomena and future directions.
A Survey of Reinforcement Learning for Large Reasoning Models
cs.CL 2025-09 accept novelty 3.0

A survey compiling RL methods, challenges, data resources, and applications for enhancing reasoning in large language models and large reasoning models since DeepSeek-R1.

Reference graph

Works this paper leans on

27 extracted references · 27 canonical work pages · cited by 20 Pith papers · 10 internal anchors

[1]

Unified pre-training for program understanding and generation

Wasi Ahmad, Saikat Chakraborty, Baishakhi Ray, and Kai-Wei Chang. Unified pre-training for program understanding and generation. In Kristina Toutanova, Anna Rumshisky, Luke Zettlemoyer, Dilek Hakkani-Tur, Iz Beltagy, Steven Bethard, Ryan Cotterell, Tan- moy Chakraborty, and Yichao Zhou (eds.),Proceedings of the 2021 Conference of the North American Chapte...

work page 2021
[2]

doi: 10.18653/v1/2021.naacl-main.211

Association for Computational Linguistics. doi: 10.18653/v1/2021.naacl-main.211. URL https://aclanthology.org/2021.naacl-main.211/. Jacob Austin, Augustus Odena, Maxwell Nye, Maarten Bosma, Henryk Michalewski, David Dohan, Ellen Jiang, Carrie Cai, Michael Terry, Quoc Le, and Charles Sutton. Program synthesis with large language models,

work page doi:10.18653/v1/2021.naacl-main.211 2021
[3]

Program Synthesis with Large Language Models

URL https://arxiv.org/abs/2108.07732. BespokeLabs. Bespoke-stratos: The unreasonable effectiveness of reasoning distil- lation. www.bespokelabs.ai/blog/bespoke-stratos-the-unreasonable-effectiveness-of- reasoning-distillation,

work page internal anchor Pith review Pith/arXiv arXiv
[4]

Evaluating Large Language Models Trained on Code

Accessed: 2025-01-22. Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde De Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, et al. Evalu- ating large language models trained on code. arXiv preprint arXiv:2107.03374,

work page internal anchor Pith review Pith/arXiv arXiv 2025
[5]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

URL https://arxiv.org/abs/2501.12948. Kanishk Gandhi, Ayush Chakravarthy, Anikait Singh, Nathan Lile, and Noah D Good- man. Cognitive behaviors that enable self-improving reasoners, or, four habits of highly effective stars. arXiv preprint arXiv:2503.01307,

work page internal anchor Pith review Pith/arXiv arXiv
[6]

The Llama 3 Herd of Models

Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, et al. The llama 3 herd of models. arXiv preprint arXiv:2407.21783,

work page internal anchor Pith review Pith/arXiv arXiv
[7]

DeepSeek-Coder: When the Large Language Model Meets Programming -- The Rise of Code Intelligence

10 Published as a conference paper at COLM 2025 Daya Guo, Qihao Zhu, Dejian Yang, Zhenda Xie, Kai Dong, Wentao Zhang, Guanting Chen, Xiao Bi, Yu Wu, YK Li, et al. Deepseek-coder: When the large language model meets programming–the rise of code intelligence. arXiv preprint arXiv:2401.14196,

work page internal anchor Pith review Pith/arXiv arXiv 2025
[8]

arXiv preprint arXiv:2411.04905 , year=

URL https://arxiv.org/ abs/2411.04905. HuggingFace. Open r1: A fully open reproduction of deepseek-r1, January

work page arXiv
[9]

Qwen2.5-Coder Technical Report

URL https://github.com/huggingface/open-r1. Binyuan Hui, Jian Yang, Zeyu Cui, Jiaxi Yang, Dayiheng Liu, Lei Zhang, Tianyu Liu, Jiajun Zhang, Bowen Yu, Keming Lu, et al. Qwen2. 5-coder technical report. arXiv preprint arXiv:2409.12186,

work page internal anchor Pith review Pith/arXiv arXiv
[10]

Diederik P

URL https://openreview.net/ forum?id=chfJJYC3iL. Diederik P . Kingma and Jimmy Ba. Adam: A method for stochastic optimization. In Yoshua Bengio and Yann LeCun (eds.), 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings,

work page 2015
[11]

Adam: A Method for Stochastic Optimization

URL http://arxiv.org/abs/1412.6980. Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph E. Gonzalez, Hao Zhang, and Ion Stoica. Efficient memory management for large language model serving with pagedattention. In Proceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles,

work page internal anchor Pith review Pith/arXiv arXiv
[12]

URL https://arxiv.org/abs/2502.07374. Raymond Li, Loubna Ben allal, Yangtian Zi, Niklas Muennighoff, Denis Kocetkov, Chenghao Mou, Marc Marone, Christopher Akiki, Jia LI, Jenny Chim, Qian Liu, Evgenii Zheltonozh- skii, Terry Yue Zhuo, Thomas Wang, Olivier Dehaene, Joel Lamy-Poirier, Joao Monteiro, Nicolas Gontier, Ming-Ho Yee, Logesh Kumar Umapathi, Jian ...

work page arXiv 2025
[13]

Teaching Small Language Models to Reason

Association for Computational Linguistics. doi: 10.18653/v1/2023.acl-short.151. URL https://aclanthology.org/2023.acl-short.151/. Somshubra Majumdar, Vahid Noroozi, Sean Narenthiran, Aleksander Ficek, Jagadeesh Balam, and Boris Ginsburg. Genetic instruct: Scaling up synthetic generation of coding instructions for large language models,

work page doi:10.18653/v1/2023.acl-short.151 2023
[14]

arXiv preprint arXiv:2407.21077 , year=

URL https://arxiv.org/abs/2407.21077. Niklas Muennighoff, Zitong Yang, Weijia Shi, Xiang Lisa Li, Li Fei-Fei, Hannaneh Hajishirzi, Luke Zettlemoyer, Percy Liang, Emmanuel Cand`es, and Tatsunori Hashimoto. s1: Simple test-time scaling,

work page arXiv
[15]

s1: Simple test-time scaling

URL https://arxiv.org/abs/2501.19393. OpenThoughts. Open Thoughts. https://open-thoughts.ai, February

work page internal anchor Pith review Pith/arXiv arXiv
[16]

URL https://arxiv.org/abs/2410.03131. Guilherme Penedo, Anton Lozhkov, Hynek Kydl´ıˇcek, Loubna Ben Allal, Edward Beeching, Agust´ın Piqueres Lajar´ın, Quentin Gallou´edec, Nathan Habib, Lewis Tunstall, and Le- andro von Werra. Codeforces. https://huggingface.co/datasets/open-r1/codeforces, 2025a. Guilherme Penedo, Anton Lozhkov, Hynek Kydl´ıˇcek, Loubna ...

work page arXiv
[17]

Code Llama: Open Foundation Models for Code

URL https://qwenlm. github.io/blog/qwen2.5/. 12 Published as a conference paper at COLM 2025 Baptiste Roziere, Jonas Gehring, Fabian Gloeckle, Sten Sootla, Itai Gat, Xiaoqing Ellen Tan, Yossi Adi, Jingyu Liu, Romain Sauvestre, Tal Remez, et al. Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950,

work page internal anchor Pith review Pith/arXiv arXiv 2025
[18]

Scaling test-time compute without verification or RL is suboptimal

Amrith Setlur, Nived Rajaraman, Sergey Levine, and Aviral Kumar. Scaling test-time compute without verification or RL is suboptimal. In ICLR 2025 Workshop: VerifAI: AI Verification in the Wild,

work page 2025
[19]

Kimi k1.5: Scaling Reinforcement Learning with LLMs

URL https://openreview.net/forum?id= yK2eGE8QVW. Kimi Team, Angang Du, Bofei Gao, Bowei Xing, Changjiu Jiang, Cheng Chen, Cheng Li, Chenjun Xiao, Chenzhuang Du, Chonghua Liao, et al. Kimi k1. 5: Scaling reinforcement learning with llms. arXiv preprint arXiv:2501.12599,

work page internal anchor Pith review Pith/arXiv arXiv
[20]

https://novasky-ai.github.io/posts/sky-t1, 2025a

NovaSky Team. https://novasky-ai.github.io/posts/sky-t1, 2025a. Accessed: 2025-01-09. Qwen Team. Qwq-32b: Embracing the power of reinforcement learning, March 2025b. URL https://qwenlm.github.io/blog/qwq-32b/. TreeSitter. Tree sitter. https://github.com/tree-sitter/tree-sitter,

work page 2025
[21]

In: Rogers, A., Boyd-Graber, J., Okazaki, N

Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc V Le, Ed H. Chi, Sharan Narang, Aakanksha Chowdhery, and Denny Zhou. Self-consistency improves chain of thought reasoning in language models. In The Eleventh International Conference on Learning Repre- sentations, 2023a. URL https://openreview.net/forum?id=1PL1NIMMrw. Yizhong Wang, Yeganeh Kordi, Swaroop Mishr...

work page doi:10.18653/v1/2023.acl-long.754 2023
[22]

Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Brian Ichter, Fei Xia, Ed Chi, Quoc Le, and Denny Zhou

URL https://arxiv.org/abs/2501.18585. Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, brian ichter, Fei Xia, Ed H. Chi, Quoc V Le, and Denny Zhou. Chain of thought prompting elicits reasoning in large language models. In Alice H. Oh, Alekh Agarwal, Danielle Belgrave, and Kyunghyun Cho (eds.), Advances in Neural Information Processing Systems,

work page arXiv
[23]

net/forum?id= VjQlMeSB J

URL https://openreview. net/forum?id= VjQlMeSB J. Yuxiang Wei, Federico Cassano, Jiawei Liu, Yifeng Ding, Naman Jain, Zachary Mueller, Harm de Vries, Leandro Von Werra, Arjun Guha, and LINGMING ZHANG. Selfcodealign: Self-alignment for code generation. In The Thirty-eighth Annual Conference on Neural Infor- mation Processing Systems, 2024a. URL https://ope...

work page arXiv
[24]

Towards system 2 reasoning in llms: Learning how to think with meta chain-of-thought, 2025

13 Published as a conference paper at COLM 2025 Violet Xiang, Charlie Snell, Kanishk Gandhi, Alon Albalak, Anikait Singh, Chase Blagden, Duy Phung, Rafael Rafailov, Nathan Lile, Dakota Mahan, et al. Towards system 2 reasoning in llms: Learning how to think with meta chain-of-though. arXiv preprint arXiv:2501.04682,

work page arXiv 2025
[25]

An Yang, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chengyuan Li, Dayiheng Liu, Fei Huang, Haoran Wei, et al

URL https: //arxiv.org/abs/2503.02951. An Yang, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chengyuan Li, Dayiheng Liu, Fei Huang, Haoran Wei, et al. Qwen2. 5 technical report. arXiv preprint arXiv:2412.15115,

work page arXiv
[26]

doi: 10.18653/v1/2024.acl-long.280

Association for Computational Linguistics. doi: 10.18653/v1/2024.acl-long.280. URL https://aclanthology.org/2024.acl-long.280/. Lianmin Zheng, Liangsheng Yin, Zhiqiang Xie, Chuyue Sun, Jeff Huang, Cody Hao Yu, Shiyi Cao, Christos Kozyrakis, Ion Stoica, Joseph E. Gonzalez, Clark Barrett, and Ying Sheng. SGLang: Efficient execution of structured language mo...

work page doi:10.18653/v1/2024.acl-long.280 2024
[27]

URL https://openreview.net/ forum?id=YrycTjllL0. 14 Published as a conference paper at COLM 2025 Supplementary Material: Appendices A Reasoning pattern extraction Initial chain of thought segmentation prompt Below is a chain of thought for solving a question. Figure out what are the different reasoning patterns that are used like problem rephrasing, new i...

work page 2025