CODEBLOCK: Learning to Supervise Code at the Right Granularity

Jiaheng Wei; Jinlong Pang; Kaiqin Hu; Ling Li; Qi Xuan; Zhaowei Zhu; Zhijie Deng

arxiv: 2606.18286 · v1 · pith:3TGVZ3BOnew · submitted 2026-06-10 · 💻 cs.LG

CODEBLOCK: Learning to Supervise Code at the Right Granularity

Zhijie Deng , Ling Li , Jinlong Pang , Kaiqin Hu , Qi Xuan , Zhaowei Zhu , Jiaheng Wei This is my paper

Pith reviewed 2026-06-27 10:49 UTC · model grok-4.3

classification 💻 cs.LG

keywords supervised fine-tuningcode generationsparse supervisionstructure-aware selectiondata-flow analysisLLM trainingtoken selectioncode LLMs

0 comments

The pith

Supervising only selected complete code blocks improves pass rates over full-token training while using 1.9 percent of response tokens.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Standard supervised fine-tuning for code LLMs applies cross-entropy loss uniformly to every token in the response. The paper claims this wastes signal because code depends on complete syntactic units and definition-use relations that isolated tokens can break. CodeBlock instead partitions responses into syntactically coherent items, scores them by aggregating generalized cross-entropy on core logic tokens, and reranks them using data-flow reach and bridge signals. Loss is then applied only to the selected items plus informative natural-language tokens, while the full response stays available as context. Experiments on six benchmarks show this yields higher average pass@1 than full supervision or other selection methods.

Core claim

CodeBlock is a structure-aware sparse supervision framework that first selects high-quality instruction-response pairs, then partitions code responses into syntactically coherent coding items, estimates their utility by aggregating generalized cross-entropy over core logic tokens, and reranks them with data-flow reach and bridge signals to prioritize blocks that propagate or connect important program dependencies. During training the full response remains available as context while loss is applied only to selected code items and informative natural-language tokens, producing stronger average pass@1 than full-token SFT on six code-generation benchmarks while supervising only 1.9 percent of re

What carries the argument

The CodeBlock selection process that partitions responses into syntactically coherent coding items, aggregates generalized cross-entropy on core logic tokens, and reranks with data-flow reach and bridge signals.

If this is right

Models trained with CodeBlock achieve higher average pass@1 than full-token SFT on code generation benchmarks.
Only 1.9 percent of response tokens require supervision under the method.
Performance remains competitive with other token-level selection baselines.
The full response is retained as context while loss is restricted to selected blocks and natural-language tokens.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The emphasis on syntactic completeness and data-flow connections suggests the same selection logic could be tested on other structured outputs that rely on dependency relations.
Lower supervised-token counts may allow fine-tuning runs on larger code datasets without proportional increases in compute.
Replacing uniform loss with block-level selection could be combined with existing data-filtering pipelines to further reduce training cost.

Load-bearing premise

That selecting syntactically complete code items via aggregated generalized cross-entropy on core logic tokens plus data-flow reach and bridge reranking identifies blocks that deliver superior learning signal compared with uniform token supervision.

What would settle it

Train identical models on the same data once with CodeBlock block selection and once with uniform token loss, then compare their pass@1 scores on the six benchmarks; equal or lower scores for the CodeBlock version would falsify the claim.

Figures

Figures reproduced from arXiv: 2606.18286 by Jiaheng Wei, Jinlong Pang, Kaiqin Hu, Ling Li, Qi Xuan, Zhaowei Zhu, Zhijie Deng.

**Figure 2.** Figure 2: Comparison between previous token-level selection and CODEBLOCK. While prior methods select isolated high-score tokens, CODEBLOCK preserves complete coding items with coherent syntax and data dependencies. ments, and variable definition–use relations (Allamanis et al., 2017; Guo et al., 2020). An isolated variable name or operator may not carry complete semantics on its own; only when combined into an as… view at source ↗

**Figure 3.** Figure 3: Motivating analysis of token-level sparse su [PITH_FULL_IMAGE:figures/full_fig_p003_3.png] view at source ↗

**Figure 4.** Figure 4: Overview of CODEBLOCK, which combines sample-level selection, GCE-based coding item scoring, data-flow guided item selection, and sparse fine-tuning for efficient code LLM supervised fine-tuning. Algorithm 1 CODEBLOCK Coding-Item Construction with Local Closure Require: Response yi, response-token indices Ti with character offsets, code regions Fi, protected syntax set ΩP Ensure: Coding items Ui = {ui,k}… view at source ↗

**Figure 5.** Figure 5: Sensitivity analysis of NL and code supervi [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗

**Figure 6.** Figure 6: Sensitivity analysis of token scoring sources [PITH_FULL_IMAGE:figures/full_fig_p008_6.png] view at source ↗

**Figure 7.** Figure 7: Sensitivity analysis of λ on the average pass@1 score across six code-generation benchmarks. best average score. Compared with λ = 0, adding the data-flow bonus improves avg06 from 53.84 to 54.58, suggesting that reach and bridge signals provide complementary information beyond tokenlevel GCE utility. However, the trend is not monotonic: further increasing λ does not consistently improve performance and … view at source ↗

read the original abstract

Supervised fine-tuning of code LLMs typically applies uniform cross-entropy loss to all response tokens, implicitly assuming that every token provides equally useful learning signal. Recent token-level selection methods challenge this assumption in natural-language SFT by supervising only high-value tokens. However, directly transferring token-level masking to code can break syntactically and semantically coherent program units, because code depends on structural completeness and definition-use relations. We therefore propose CodeBlock, a structure-aware sparse supervision framework that selects structure-complete code evidence rather than isolated tokens. CodeBlock first selects high-quality instruction-response pairs, then partitions code responses into syntactically coherent coding items, estimates their utility by aggregating generalized cross-entropy over core logic tokens, and reranks them with data-flow reach and bridge signals to prioritize blocks that propagate or connect important program dependencies. During training, the full response remains available as context, while loss is applied only to selected code items and informative natural-language tokens. Experiments on six code-generation benchmarks show that CodeBlock achieves stronger average pass@1 than full-token SFT and competitive selection baselines, while using only 1.9% of supervised response tokens.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

CodeBlock adapts token selection to code via syntactic blocks and data-flow reranking, but gains may trace to pair filtering rather than the block step.

read the letter

The main thing here is a pipeline that first picks high-quality instruction-response pairs, splits code responses into syntactically complete items, scores them by aggregating generalized cross-entropy on core logic tokens, and reranks with data-flow reach and bridge signals before applying loss only to the selected blocks. It keeps the full response as context and claims stronger average pass@1 than full-token SFT while supervising just 1.9% of response tokens.

This is a direct extension of token-level selection work to code, and the syntactic partitioning plus data-flow reranking is a sensible response to the fact that isolated token masking can break program units. The approach is concrete and the efficiency angle is practical.

The soft spot the stress-test flags is real: the pair selection step comes first, yet the abstract does not state that the full-token SFT baseline uses the same filtered pairs. Without an ablation that holds the pair filter fixed and varies only the block mechanism, it is hard to credit the block granularity itself for the reported lift. The soundness numbers in the reader's report are low for the same reason—no protocol, baselines, or error bars appear in the abstract, so the numbers cannot be checked yet.

This is for people working on efficient fine-tuning of code models. A reader who cares about structure-aware supervision would get a usable method description and a clear motivation.

It deserves peer review because the problem is well-posed and the pipeline is testable, even if the current evidence leaves the pair-filter alternative open.

Referee Report

2 major / 1 minor

Summary. The paper proposes CodeBlock, a structure-aware sparse supervision method for fine-tuning code LLMs. It first filters high-quality instruction-response pairs, partitions code responses into syntactically complete blocks, estimates block utility via aggregated generalized cross-entropy on core logic tokens combined with data-flow reach and bridge reranking, and applies loss only to selected blocks plus informative NL tokens while keeping the full response as context. On six code-generation benchmarks, it reports stronger average pass@1 than full-token SFT and competitive selection baselines while supervising only 1.9% of response tokens.

Significance. If the performance gains are shown to arise from the block-level mechanism rather than upstream pair filtering, the work would provide a principled way to respect syntactic and data-flow structure in code supervision, potentially improving sample efficiency over uniform or token-level masking approaches.

major comments (2)

[Experimental Results] Experimental Results section: the full-token SFT baseline is not described as using the same high-quality pair filter as CodeBlock, so the reported pass@1 gains cannot be attributed to block granularity versus pair selection. An ablation that applies full-token SFT to the filtered pairs (and block supervision to unfiltered pairs) is required to support the central claim.
[Experimental protocol] § on experimental protocol (and Abstract): no description of the training protocol, exact baseline implementations, number of runs, statistical tests, or error bars is supplied, preventing verification that the average pass@1 improvements are reliable rather than artifacts of a single run or implementation detail.

minor comments (1)

[Method] Notation for 'generalized cross-entropy' and 'data-flow reach/bridge' should be defined with an equation or pseudocode in the method section for reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments. We address each major point below and will revise the manuscript to strengthen the experimental claims.

read point-by-point responses

Referee: [Experimental Results] Experimental Results section: the full-token SFT baseline is not described as using the same high-quality pair filter as CodeBlock, so the reported pass@1 gains cannot be attributed to block granularity versus pair selection. An ablation that applies full-token SFT to the filtered pairs (and block supervision to unfiltered pairs) is required to support the central claim.

Authors: We agree that the manuscript does not explicitly clarify whether the full-token SFT baseline applies the same pair filter used by CodeBlock. To isolate the contribution of block-level selection from pair filtering, we will add the requested ablations (full SFT on filtered pairs; block supervision on unfiltered pairs) to the Experimental Results section of the revised manuscript. revision: yes
Referee: [Experimental protocol] § on experimental protocol (and Abstract): no description of the training protocol, exact baseline implementations, number of runs, statistical tests, or error bars is supplied, preventing verification that the average pass@1 improvements are reliable rather than artifacts of a single run or implementation detail.

Authors: We acknowledge that the current manuscript lacks sufficient detail on the experimental protocol. In the revision we will expand the relevant section (and update the abstract if needed) to specify the full training protocol, exact baseline implementations, number of runs with different random seeds, statistical tests performed, and error bars on all reported metrics. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical method with independent benchmarks

full rationale

The paper describes an empirical framework (pair selection, syntactic partitioning, utility estimation via aggregated GCE plus data-flow reranking, then masked loss) and reports pass@1 results on six benchmarks. No equations, derivations, or first-principles claims appear that reduce an output quantity to a fitted parameter or self-citation by construction. The central claim rests on comparative experiments rather than any self-definitional or load-bearing self-citation step. This is the normal case of a self-contained empirical contribution.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review yields no explicit free parameters, axioms, or invented entities; the approach builds on standard cross-entropy and static analysis concepts without new postulates.

pith-pipeline@v0.9.1-grok · 5747 in / 1028 out tokens · 28922 ms · 2026-06-27T10:49:37.550465+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

24 extracted references · 20 canonical work pages · 6 internal anchors

[1]

Miltiadis Allamanis, Marc Brockschmidt, and Mah- moud Khademi

Opencodein- struct: A large-scale instruction tuning dataset for code llms.arXiv preprint arXiv:2504.04030. Miltiadis Allamanis, Marc Brockschmidt, and Mah- moud Khademi

work page arXiv
[2]

Learning to Represent Programs with Graphs

Learning to repre- sent programs with graphs.arXiv preprint arXiv:1711.00740. Jacob Austin, Augustus Odena, Maxwell Nye, Maarten Bosma, Henryk Michalewski, David Dohan, Ellen Jiang, Carrie Cai, Michael Terry, Quoc Le, and 1 others

work page internal anchor Pith review Pith/arXiv arXiv
[3]

Program Synthesis with Large Language Models

Program synthesis with large language models.arXiv preprint arXiv:2108.07732. Lichang Chen, Shiyang Li, Jun Yan, Hai Wang, Kalpa Gunaratna, Vikas Yadav, Zheng Tang, Vijay Srini- vasan, Tianyi Zhou, Heng Huang, and 1 others

work page internal anchor Pith review Pith/arXiv arXiv
[4]

arXiv preprint arXiv:2307.08701

Alpagasus: Training a better alpaca with fewer data. arXiv preprint arXiv:2307.08701. Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde De Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, and 1 others

work page arXiv
[5]

Evaluating Large Language Models Trained on Code

Evaluating large language models trained on code.arXiv preprint arXiv:2107.03374. Yicheng Chen, Yining Li, Kai Hu, Ma Zerun, HaochenYe HaochenYe, and Kai Chen

work page internal anchor Pith review Pith/arXiv arXiv
[6]

In Findings of the Association for Computational Lin- guistics: ACL 2025, pages 9902–9915

Mig: Automatic data selection for instruction tuning by maximizing information gain in semantic space. In Findings of the Association for Computational Lin- guistics: ACL 2025, pages 9902–9915. Zhijie Deng, Zhouan Shen, Ling Li, Yao Zhou, Zhaowei Zhu, Yanji He, Wei Wang, and Jiaheng Wei

2025
[7]

Yanjun Fu, Faisal Hamman, and Sanghamitra Dutta

Lm-mixup: Text data augmentation via language model based mixup.arXiv preprint arXiv:2510.20449. Yanjun Fu, Faisal Hamman, and Sanghamitra Dutta

work page arXiv
[8]

GraphCodeBERT: Pre-training Code Representations with Data Flow

Graph- codebert: Pre-training code representations with data flow.arXiv preprint arXiv:2009.08366. Daya Guo, Qihao Zhu, Dejian Yang, Zhenda Xie, Kai Dong, Wentao Zhang, Guanting Chen, Xiao Bi, Yi- fan Wu, YK Li, and 1 others

work page internal anchor Pith review Pith/arXiv arXiv 2009
[9]

DeepSeek-Coder: When the Large Language Model Meets Programming -- The Rise of Code Intelligence

Deepseek- coder: when the large language model meets programming–the rise of code intelligence.arXiv preprint arXiv:2401.14196. Siming Huang, Tianhao Cheng, Jason Klein Liu, Weidi Xu, Jiaran Hao, Liuyihan Song, Yang Xu, Jian Yang, Jiaheng Liu, Chenchen Zhang, and 1 others

work page internal anchor Pith review Pith/arXiv arXiv
[10]

Qwen2.5-Coder Technical Report

Qwen2. 5-coder technical report.arXiv preprint arXiv:2409.12186. Ming Li, Yong Zhang, Zhitao Li, Jiuhai Chen, Lichang Chen, Ning Cheng, Jianzong Wang, Tianyi Zhou, and Jing Xiao

work page internal anchor Pith review Pith/arXiv arXiv
[11]

From quantity to quality: Boost- ing llm performance with self-guided data selection for instruction tuning. InProceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), pages 7602–7635. Yanping Li, Zhening Liu, Zijian Li, Zehong Lin, and Jun Zhang

2024
[12]

Xiaotian Lin, Yanlin Qi, Yuxiang Luo, Themis Palpanas, and Yuyu Luo

Token-level data selection for safe llm fine-tuning.arXiv preprint arXiv:2603.01185. Xiaotian Lin, Yanlin Qi, Yuxiang Luo, Themis Palpanas, and Yuyu Luo

work page arXiv
[13]

Zhenghao Lin, Zhibin Gou, Yeyun Gong, Xiao Liu, Yelong Shen, Ruochen Xu, Chen Lin, Yujiu Yang, Jian Jiao, Nan Duan, and 1 others

Tokentune: Dual-level utility estimation for scalable data selection in instruction tuning. Zhenghao Lin, Zhibin Gou, Yeyun Gong, Xiao Liu, Yelong Shen, Ruochen Xu, Chen Lin, Yujiu Yang, Jian Jiao, Nan Duan, and 1 others. 2024a. Rho-1: Not all tokens are what you need.arXiv preprint arXiv:2404.07965. Zicheng Lin, Tian Liang, Jiahao Xu, Qiuzhi Lin, Xing Wa...

work page arXiv
[14]

9 Weijie Lyu, Sheng-Jun Huang, and Xuan Xia

What makes good data for alignment? a comprehensive study of automatic data selection in instruction tuning.arXiv preprint arXiv:2312.15685. 9 Weijie Lyu, Sheng-Jun Huang, and Xuan Xia

work page arXiv
[15]

Jinlong Pang, Na Di, Zhaowei Zhu, Jiaheng Wei, Hao Cheng, Chen Qian, and Yang Liu

Efficient code llm training via distribution-consistent and diversity-aware data selection.arXiv preprint arXiv:2507.02378. Jinlong Pang, Na Di, Zhaowei Zhu, Jiaheng Wei, Hao Cheng, Chen Qian, and Yang Liu

work page arXiv
[16]

Xiaohan Qin, Xiaoxing Wang, Ning Liao, Cancheng Zhang, Xiangdong Zhang, Mingquan Feng, Jingzhi Wang, and Junchi Yan

Improving data efficiency via curating llm-driven rating systems.arXiv preprint arXiv:2410.10877. Xiaohan Qin, Xiaoxing Wang, Ning Liao, Cancheng Zhang, Xiangdong Zhang, Mingquan Feng, Jingzhi Wang, and Junchi Yan

work page arXiv
[17]

ByteDance Seed, Yuyu Zhang, Jing Su, Yifan Sun, Chenguang Xi, Xia Xiao, Shen Zheng, Anxiang Zhang, Kaibo Liu, Daoguang Zan, and 1 others

sstoken: Self- modulated and semantic-aware token selection for llm fine-tuning.arXiv preprint arXiv:2510.18250. ByteDance Seed, Yuyu Zhang, Jing Su, Yifan Sun, Chenguang Xi, Xia Xiao, Shen Zheng, Anxiang Zhang, Kaibo Liu, Daoguang Zan, and 1 others

work page arXiv
[18]

arXiv preprint arXiv:2506.03524

Seed-coder: Let the code model curate data for itself. arXiv preprint arXiv:2506.03524. Yun-Da Tsai, Mingjie Liu, and Haoxing Ren

work page arXiv
[19]

Yejie Wang, Keqing He, Dayuan Fu, Zhuoma GongQue, Heyang Xu, Yanxu Chen, Zhexu Wang, Yujia Fu, Guanting Dong, Muxi Diao, and 1 others

Code less, align more: Efficient llm fine-tuning for code generation with data pruning.arXiv preprint arXiv:2407.05040. Yejie Wang, Keqing He, Dayuan Fu, Zhuoma GongQue, Heyang Xu, Yanxu Chen, Zhexu Wang, Yujia Fu, Guanting Dong, Muxi Diao, and 1 others

work page arXiv
[20]

InProceed- ings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 14027–14043

How do your code llms perform? empowering code in- struction tuning with really good data. InProceed- ings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 14027–14043. Yuxiang Wei, Federico Cassano, Jiawei Liu, Yifeng Ding, Naman Jain, Zachary Mueller, Harm de Vries, Leandro V on Werra, Arjun Guha, and Lingming Zhang. 202...

work page arXiv 2024
[21]

Mengzhou Xia, Sadhika Malladi, Suchin Gururangan, Sanjeev Arora, and Danqi Chen

Rose: A reward-oriented data selection framework for llm task-specific instruction tuning.arXiv preprint arXiv:2412.00631. Mengzhou Xia, Sadhika Malladi, Suchin Gururangan, Sanjeev Arora, and Danqi Chen

work page arXiv
[22]

arXiv preprint arXiv:2402.04333

Less: Se- lecting influential data for targeted instruction tuning. arXiv preprint arXiv:2402.04333. Jiapeng Yu, Yuqian Wu, Yajing Zhan, Wenhao Guo, Zhou Xu, and Raymond Lee

work page arXiv
[23]

InInternational Conference on Learning Representations, volume 2025, pages 66602–66656

Bigcodebench: Benchmarking code generation with diverse function calls and complex in- structions. InInternational Conference on Learning Representations, volume 2025, pages 66602–66656. 10 A Experimental Details A.1 Sample-Level Selection Settings We perform sample-level selection from the 300K OpenCodeInstruct (OCI) source pool before ap- plying item-le...

2025
[24]

For full-token SFT, we train the model for 30,000 steps to provide a strong dense-supervision base- line. For CodeBlcok and all sparse-selection base- lines, we use 3,000 training steps under the same optimization setup, so that the comparison focuses on the quality of selected supervision signals un- der a limited training budget. All experiments are con...

2048

[1] [1]

Miltiadis Allamanis, Marc Brockschmidt, and Mah- moud Khademi

Opencodein- struct: A large-scale instruction tuning dataset for code llms.arXiv preprint arXiv:2504.04030. Miltiadis Allamanis, Marc Brockschmidt, and Mah- moud Khademi

work page arXiv

[2] [2]

Learning to Represent Programs with Graphs

Learning to repre- sent programs with graphs.arXiv preprint arXiv:1711.00740. Jacob Austin, Augustus Odena, Maxwell Nye, Maarten Bosma, Henryk Michalewski, David Dohan, Ellen Jiang, Carrie Cai, Michael Terry, Quoc Le, and 1 others

work page internal anchor Pith review Pith/arXiv arXiv

[3] [3]

Program Synthesis with Large Language Models

Program synthesis with large language models.arXiv preprint arXiv:2108.07732. Lichang Chen, Shiyang Li, Jun Yan, Hai Wang, Kalpa Gunaratna, Vikas Yadav, Zheng Tang, Vijay Srini- vasan, Tianyi Zhou, Heng Huang, and 1 others

work page internal anchor Pith review Pith/arXiv arXiv

[4] [4]

arXiv preprint arXiv:2307.08701

Alpagasus: Training a better alpaca with fewer data. arXiv preprint arXiv:2307.08701. Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde De Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, and 1 others

work page arXiv

[5] [5]

Evaluating Large Language Models Trained on Code

Evaluating large language models trained on code.arXiv preprint arXiv:2107.03374. Yicheng Chen, Yining Li, Kai Hu, Ma Zerun, HaochenYe HaochenYe, and Kai Chen

work page internal anchor Pith review Pith/arXiv arXiv

[6] [6]

In Findings of the Association for Computational Lin- guistics: ACL 2025, pages 9902–9915

Mig: Automatic data selection for instruction tuning by maximizing information gain in semantic space. In Findings of the Association for Computational Lin- guistics: ACL 2025, pages 9902–9915. Zhijie Deng, Zhouan Shen, Ling Li, Yao Zhou, Zhaowei Zhu, Yanji He, Wei Wang, and Jiaheng Wei

2025

[7] [7]

Yanjun Fu, Faisal Hamman, and Sanghamitra Dutta

Lm-mixup: Text data augmentation via language model based mixup.arXiv preprint arXiv:2510.20449. Yanjun Fu, Faisal Hamman, and Sanghamitra Dutta

work page arXiv

[8] [8]

GraphCodeBERT: Pre-training Code Representations with Data Flow

Graph- codebert: Pre-training code representations with data flow.arXiv preprint arXiv:2009.08366. Daya Guo, Qihao Zhu, Dejian Yang, Zhenda Xie, Kai Dong, Wentao Zhang, Guanting Chen, Xiao Bi, Yi- fan Wu, YK Li, and 1 others

work page internal anchor Pith review Pith/arXiv arXiv 2009

[9] [9]

DeepSeek-Coder: When the Large Language Model Meets Programming -- The Rise of Code Intelligence

Deepseek- coder: when the large language model meets programming–the rise of code intelligence.arXiv preprint arXiv:2401.14196. Siming Huang, Tianhao Cheng, Jason Klein Liu, Weidi Xu, Jiaran Hao, Liuyihan Song, Yang Xu, Jian Yang, Jiaheng Liu, Chenchen Zhang, and 1 others

work page internal anchor Pith review Pith/arXiv arXiv

[10] [10]

Qwen2.5-Coder Technical Report

Qwen2. 5-coder technical report.arXiv preprint arXiv:2409.12186. Ming Li, Yong Zhang, Zhitao Li, Jiuhai Chen, Lichang Chen, Ning Cheng, Jianzong Wang, Tianyi Zhou, and Jing Xiao

work page internal anchor Pith review Pith/arXiv arXiv

[11] [11]

From quantity to quality: Boost- ing llm performance with self-guided data selection for instruction tuning. InProceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), pages 7602–7635. Yanping Li, Zhening Liu, Zijian Li, Zehong Lin, and Jun Zhang

2024

[12] [12]

Xiaotian Lin, Yanlin Qi, Yuxiang Luo, Themis Palpanas, and Yuyu Luo

Token-level data selection for safe llm fine-tuning.arXiv preprint arXiv:2603.01185. Xiaotian Lin, Yanlin Qi, Yuxiang Luo, Themis Palpanas, and Yuyu Luo

work page arXiv

[13] [13]

Zhenghao Lin, Zhibin Gou, Yeyun Gong, Xiao Liu, Yelong Shen, Ruochen Xu, Chen Lin, Yujiu Yang, Jian Jiao, Nan Duan, and 1 others

Tokentune: Dual-level utility estimation for scalable data selection in instruction tuning. Zhenghao Lin, Zhibin Gou, Yeyun Gong, Xiao Liu, Yelong Shen, Ruochen Xu, Chen Lin, Yujiu Yang, Jian Jiao, Nan Duan, and 1 others. 2024a. Rho-1: Not all tokens are what you need.arXiv preprint arXiv:2404.07965. Zicheng Lin, Tian Liang, Jiahao Xu, Qiuzhi Lin, Xing Wa...

work page arXiv

[14] [14]

9 Weijie Lyu, Sheng-Jun Huang, and Xuan Xia

What makes good data for alignment? a comprehensive study of automatic data selection in instruction tuning.arXiv preprint arXiv:2312.15685. 9 Weijie Lyu, Sheng-Jun Huang, and Xuan Xia

work page arXiv

[15] [15]

Jinlong Pang, Na Di, Zhaowei Zhu, Jiaheng Wei, Hao Cheng, Chen Qian, and Yang Liu

Efficient code llm training via distribution-consistent and diversity-aware data selection.arXiv preprint arXiv:2507.02378. Jinlong Pang, Na Di, Zhaowei Zhu, Jiaheng Wei, Hao Cheng, Chen Qian, and Yang Liu

work page arXiv

[16] [16]

Xiaohan Qin, Xiaoxing Wang, Ning Liao, Cancheng Zhang, Xiangdong Zhang, Mingquan Feng, Jingzhi Wang, and Junchi Yan

Improving data efficiency via curating llm-driven rating systems.arXiv preprint arXiv:2410.10877. Xiaohan Qin, Xiaoxing Wang, Ning Liao, Cancheng Zhang, Xiangdong Zhang, Mingquan Feng, Jingzhi Wang, and Junchi Yan

work page arXiv

[17] [17]

ByteDance Seed, Yuyu Zhang, Jing Su, Yifan Sun, Chenguang Xi, Xia Xiao, Shen Zheng, Anxiang Zhang, Kaibo Liu, Daoguang Zan, and 1 others

sstoken: Self- modulated and semantic-aware token selection for llm fine-tuning.arXiv preprint arXiv:2510.18250. ByteDance Seed, Yuyu Zhang, Jing Su, Yifan Sun, Chenguang Xi, Xia Xiao, Shen Zheng, Anxiang Zhang, Kaibo Liu, Daoguang Zan, and 1 others

work page arXiv

[18] [18]

arXiv preprint arXiv:2506.03524

Seed-coder: Let the code model curate data for itself. arXiv preprint arXiv:2506.03524. Yun-Da Tsai, Mingjie Liu, and Haoxing Ren

work page arXiv

[19] [19]

Yejie Wang, Keqing He, Dayuan Fu, Zhuoma GongQue, Heyang Xu, Yanxu Chen, Zhexu Wang, Yujia Fu, Guanting Dong, Muxi Diao, and 1 others

Code less, align more: Efficient llm fine-tuning for code generation with data pruning.arXiv preprint arXiv:2407.05040. Yejie Wang, Keqing He, Dayuan Fu, Zhuoma GongQue, Heyang Xu, Yanxu Chen, Zhexu Wang, Yujia Fu, Guanting Dong, Muxi Diao, and 1 others

work page arXiv

[20] [20]

InProceed- ings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 14027–14043

How do your code llms perform? empowering code in- struction tuning with really good data. InProceed- ings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 14027–14043. Yuxiang Wei, Federico Cassano, Jiawei Liu, Yifeng Ding, Naman Jain, Zachary Mueller, Harm de Vries, Leandro V on Werra, Arjun Guha, and Lingming Zhang. 202...

work page arXiv 2024

[21] [21]

Mengzhou Xia, Sadhika Malladi, Suchin Gururangan, Sanjeev Arora, and Danqi Chen

Rose: A reward-oriented data selection framework for llm task-specific instruction tuning.arXiv preprint arXiv:2412.00631. Mengzhou Xia, Sadhika Malladi, Suchin Gururangan, Sanjeev Arora, and Danqi Chen

work page arXiv

[22] [22]

arXiv preprint arXiv:2402.04333

Less: Se- lecting influential data for targeted instruction tuning. arXiv preprint arXiv:2402.04333. Jiapeng Yu, Yuqian Wu, Yajing Zhan, Wenhao Guo, Zhou Xu, and Raymond Lee

work page arXiv

[23] [23]

InInternational Conference on Learning Representations, volume 2025, pages 66602–66656

Bigcodebench: Benchmarking code generation with diverse function calls and complex in- structions. InInternational Conference on Learning Representations, volume 2025, pages 66602–66656. 10 A Experimental Details A.1 Sample-Level Selection Settings We perform sample-level selection from the 300K OpenCodeInstruct (OCI) source pool before ap- plying item-le...

2025

[24] [24]

For full-token SFT, we train the model for 30,000 steps to provide a strong dense-supervision base- line. For CodeBlcok and all sparse-selection base- lines, we use 3,000 training steps under the same optimization setup, so that the comparison focuses on the quality of selected supervision signals un- der a limited training budget. All experiments are con...

2048