pith. sign in

arxiv: 2606.18286 · v1 · pith:3TGVZ3BOnew · submitted 2026-06-10 · 💻 cs.LG

CODEBLOCK: Learning to Supervise Code at the Right Granularity

Pith reviewed 2026-06-27 10:49 UTC · model grok-4.3

classification 💻 cs.LG
keywords supervised fine-tuningcode generationsparse supervisionstructure-aware selectiondata-flow analysisLLM trainingtoken selectioncode LLMs
0
0 comments X

The pith

Supervising only selected complete code blocks improves pass rates over full-token training while using 1.9 percent of response tokens.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Standard supervised fine-tuning for code LLMs applies cross-entropy loss uniformly to every token in the response. The paper claims this wastes signal because code depends on complete syntactic units and definition-use relations that isolated tokens can break. CodeBlock instead partitions responses into syntactically coherent items, scores them by aggregating generalized cross-entropy on core logic tokens, and reranks them using data-flow reach and bridge signals. Loss is then applied only to the selected items plus informative natural-language tokens, while the full response stays available as context. Experiments on six benchmarks show this yields higher average pass@1 than full supervision or other selection methods.

Core claim

CodeBlock is a structure-aware sparse supervision framework that first selects high-quality instruction-response pairs, then partitions code responses into syntactically coherent coding items, estimates their utility by aggregating generalized cross-entropy over core logic tokens, and reranks them with data-flow reach and bridge signals to prioritize blocks that propagate or connect important program dependencies. During training the full response remains available as context while loss is applied only to selected code items and informative natural-language tokens, producing stronger average pass@1 than full-token SFT on six code-generation benchmarks while supervising only 1.9 percent of re

What carries the argument

The CodeBlock selection process that partitions responses into syntactically coherent coding items, aggregates generalized cross-entropy on core logic tokens, and reranks with data-flow reach and bridge signals.

If this is right

  • Models trained with CodeBlock achieve higher average pass@1 than full-token SFT on code generation benchmarks.
  • Only 1.9 percent of response tokens require supervision under the method.
  • Performance remains competitive with other token-level selection baselines.
  • The full response is retained as context while loss is restricted to selected blocks and natural-language tokens.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The emphasis on syntactic completeness and data-flow connections suggests the same selection logic could be tested on other structured outputs that rely on dependency relations.
  • Lower supervised-token counts may allow fine-tuning runs on larger code datasets without proportional increases in compute.
  • Replacing uniform loss with block-level selection could be combined with existing data-filtering pipelines to further reduce training cost.

Load-bearing premise

That selecting syntactically complete code items via aggregated generalized cross-entropy on core logic tokens plus data-flow reach and bridge reranking identifies blocks that deliver superior learning signal compared with uniform token supervision.

What would settle it

Train identical models on the same data once with CodeBlock block selection and once with uniform token loss, then compare their pass@1 scores on the six benchmarks; equal or lower scores for the CodeBlock version would falsify the claim.

Figures

Figures reproduced from arXiv: 2606.18286 by Jiaheng Wei, Jinlong Pang, Kaiqin Hu, Ling Li, Qi Xuan, Zhaowei Zhu, Zhijie Deng.

Figure 1
Figure 1. Figure 1: Performance–efficiency trade-off on Qwen2.5- [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Comparison between previous token-level selection and CODEBLOCK. While prior methods se￾lect isolated high-score tokens, CODEBLOCK preserves complete coding items with coherent syntax and data dependencies. ments, and variable definition–use relations (Alla￾manis et al., 2017; Guo et al., 2020). An isolated variable name or operator may not carry complete semantics on its own; only when combined into an as… view at source ↗
Figure 3
Figure 3. Figure 3: Motivating analysis of token-level sparse su [PITH_FULL_IMAGE:figures/full_fig_p003_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Overview of CODEBLOCK, which combines sample-level selection, GCE-based coding item scoring, data-flow guided item selection, and sparse fine-tuning for efficient code LLM supervised fine-tuning. Algorithm 1 CODEBLOCK Coding-Item Construc￾tion with Local Closure Require: Response yi, response-token indices Ti with char￾acter offsets, code regions Fi, protected syntax set ΩP Ensure: Coding items Ui = {ui,k}… view at source ↗
Figure 5
Figure 5. Figure 5: Sensitivity analysis of NL and code supervi [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Sensitivity analysis of token scoring sources [PITH_FULL_IMAGE:figures/full_fig_p008_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Sensitivity analysis of λ on the average pass@1 score across six code-generation benchmarks. best average score. Compared with λ = 0, adding the data-flow bonus improves avg06 from 53.84 to 54.58, suggesting that reach and bridge signals provide complementary information beyond token￾level GCE utility. However, the trend is not monotonic: further increasing λ does not consistently improve perfor￾mance and … view at source ↗
read the original abstract

Supervised fine-tuning of code LLMs typically applies uniform cross-entropy loss to all response tokens, implicitly assuming that every token provides equally useful learning signal. Recent token-level selection methods challenge this assumption in natural-language SFT by supervising only high-value tokens. However, directly transferring token-level masking to code can break syntactically and semantically coherent program units, because code depends on structural completeness and definition-use relations. We therefore propose CodeBlock, a structure-aware sparse supervision framework that selects structure-complete code evidence rather than isolated tokens. CodeBlock first selects high-quality instruction-response pairs, then partitions code responses into syntactically coherent coding items, estimates their utility by aggregating generalized cross-entropy over core logic tokens, and reranks them with data-flow reach and bridge signals to prioritize blocks that propagate or connect important program dependencies. During training, the full response remains available as context, while loss is applied only to selected code items and informative natural-language tokens. Experiments on six code-generation benchmarks show that CodeBlock achieves stronger average pass@1 than full-token SFT and competitive selection baselines, while using only 1.9% of supervised response tokens.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper proposes CodeBlock, a structure-aware sparse supervision method for fine-tuning code LLMs. It first filters high-quality instruction-response pairs, partitions code responses into syntactically complete blocks, estimates block utility via aggregated generalized cross-entropy on core logic tokens combined with data-flow reach and bridge reranking, and applies loss only to selected blocks plus informative NL tokens while keeping the full response as context. On six code-generation benchmarks, it reports stronger average pass@1 than full-token SFT and competitive selection baselines while supervising only 1.9% of response tokens.

Significance. If the performance gains are shown to arise from the block-level mechanism rather than upstream pair filtering, the work would provide a principled way to respect syntactic and data-flow structure in code supervision, potentially improving sample efficiency over uniform or token-level masking approaches.

major comments (2)
  1. [Experimental Results] Experimental Results section: the full-token SFT baseline is not described as using the same high-quality pair filter as CodeBlock, so the reported pass@1 gains cannot be attributed to block granularity versus pair selection. An ablation that applies full-token SFT to the filtered pairs (and block supervision to unfiltered pairs) is required to support the central claim.
  2. [Experimental protocol] § on experimental protocol (and Abstract): no description of the training protocol, exact baseline implementations, number of runs, statistical tests, or error bars is supplied, preventing verification that the average pass@1 improvements are reliable rather than artifacts of a single run or implementation detail.
minor comments (1)
  1. [Method] Notation for 'generalized cross-entropy' and 'data-flow reach/bridge' should be defined with an equation or pseudocode in the method section for reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments. We address each major point below and will revise the manuscript to strengthen the experimental claims.

read point-by-point responses
  1. Referee: [Experimental Results] Experimental Results section: the full-token SFT baseline is not described as using the same high-quality pair filter as CodeBlock, so the reported pass@1 gains cannot be attributed to block granularity versus pair selection. An ablation that applies full-token SFT to the filtered pairs (and block supervision to unfiltered pairs) is required to support the central claim.

    Authors: We agree that the manuscript does not explicitly clarify whether the full-token SFT baseline applies the same pair filter used by CodeBlock. To isolate the contribution of block-level selection from pair filtering, we will add the requested ablations (full SFT on filtered pairs; block supervision on unfiltered pairs) to the Experimental Results section of the revised manuscript. revision: yes

  2. Referee: [Experimental protocol] § on experimental protocol (and Abstract): no description of the training protocol, exact baseline implementations, number of runs, statistical tests, or error bars is supplied, preventing verification that the average pass@1 improvements are reliable rather than artifacts of a single run or implementation detail.

    Authors: We acknowledge that the current manuscript lacks sufficient detail on the experimental protocol. In the revision we will expand the relevant section (and update the abstract if needed) to specify the full training protocol, exact baseline implementations, number of runs with different random seeds, statistical tests performed, and error bars on all reported metrics. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical method with independent benchmarks

full rationale

The paper describes an empirical framework (pair selection, syntactic partitioning, utility estimation via aggregated GCE plus data-flow reranking, then masked loss) and reports pass@1 results on six benchmarks. No equations, derivations, or first-principles claims appear that reduce an output quantity to a fitted parameter or self-citation by construction. The central claim rests on comparative experiments rather than any self-definitional or load-bearing self-citation step. This is the normal case of a self-contained empirical contribution.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review yields no explicit free parameters, axioms, or invented entities; the approach builds on standard cross-entropy and static analysis concepts without new postulates.

pith-pipeline@v0.9.1-grok · 5747 in / 1028 out tokens · 28922 ms · 2026-06-27T10:49:37.550465+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

24 extracted references · 20 canonical work pages · 6 internal anchors

  1. [1]

    Miltiadis Allamanis, Marc Brockschmidt, and Mah- moud Khademi

    Opencodein- struct: A large-scale instruction tuning dataset for code llms.arXiv preprint arXiv:2504.04030. Miltiadis Allamanis, Marc Brockschmidt, and Mah- moud Khademi

  2. [2]

    Learning to Represent Programs with Graphs

    Learning to repre- sent programs with graphs.arXiv preprint arXiv:1711.00740. Jacob Austin, Augustus Odena, Maxwell Nye, Maarten Bosma, Henryk Michalewski, David Dohan, Ellen Jiang, Carrie Cai, Michael Terry, Quoc Le, and 1 others

  3. [3]

    Program Synthesis with Large Language Models

    Program synthesis with large language models.arXiv preprint arXiv:2108.07732. Lichang Chen, Shiyang Li, Jun Yan, Hai Wang, Kalpa Gunaratna, Vikas Yadav, Zheng Tang, Vijay Srini- vasan, Tianyi Zhou, Heng Huang, and 1 others

  4. [4]

    arXiv preprint arXiv:2307.08701

    Alpagasus: Training a better alpaca with fewer data. arXiv preprint arXiv:2307.08701. Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde De Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, and 1 others

  5. [5]

    Evaluating Large Language Models Trained on Code

    Evaluating large language models trained on code.arXiv preprint arXiv:2107.03374. Yicheng Chen, Yining Li, Kai Hu, Ma Zerun, HaochenYe HaochenYe, and Kai Chen

  6. [6]

    In Findings of the Association for Computational Lin- guistics: ACL 2025, pages 9902–9915

    Mig: Automatic data selection for instruction tuning by maximizing information gain in semantic space. In Findings of the Association for Computational Lin- guistics: ACL 2025, pages 9902–9915. Zhijie Deng, Zhouan Shen, Ling Li, Yao Zhou, Zhaowei Zhu, Yanji He, Wei Wang, and Jiaheng Wei

  7. [7]

    Yanjun Fu, Faisal Hamman, and Sanghamitra Dutta

    Lm-mixup: Text data augmentation via language model based mixup.arXiv preprint arXiv:2510.20449. Yanjun Fu, Faisal Hamman, and Sanghamitra Dutta

  8. [8]

    GraphCodeBERT: Pre-training Code Representations with Data Flow

    Graph- codebert: Pre-training code representations with data flow.arXiv preprint arXiv:2009.08366. Daya Guo, Qihao Zhu, Dejian Yang, Zhenda Xie, Kai Dong, Wentao Zhang, Guanting Chen, Xiao Bi, Yi- fan Wu, YK Li, and 1 others

  9. [9]

    DeepSeek-Coder: When the Large Language Model Meets Programming -- The Rise of Code Intelligence

    Deepseek- coder: when the large language model meets programming–the rise of code intelligence.arXiv preprint arXiv:2401.14196. Siming Huang, Tianhao Cheng, Jason Klein Liu, Weidi Xu, Jiaran Hao, Liuyihan Song, Yang Xu, Jian Yang, Jiaheng Liu, Chenchen Zhang, and 1 others

  10. [10]

    Qwen2.5-Coder Technical Report

    Qwen2. 5-coder technical report.arXiv preprint arXiv:2409.12186. Ming Li, Yong Zhang, Zhitao Li, Jiuhai Chen, Lichang Chen, Ning Cheng, Jianzong Wang, Tianyi Zhou, and Jing Xiao

  11. [11]

    From quantity to quality: Boost- ing llm performance with self-guided data selection for instruction tuning. InProceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), pages 7602–7635. Yanping Li, Zhening Liu, Zijian Li, Zehong Lin, and Jun Zhang

  12. [12]

    Xiaotian Lin, Yanlin Qi, Yuxiang Luo, Themis Palpanas, and Yuyu Luo

    Token-level data selection for safe llm fine-tuning.arXiv preprint arXiv:2603.01185. Xiaotian Lin, Yanlin Qi, Yuxiang Luo, Themis Palpanas, and Yuyu Luo

  13. [13]

    Zhenghao Lin, Zhibin Gou, Yeyun Gong, Xiao Liu, Yelong Shen, Ruochen Xu, Chen Lin, Yujiu Yang, Jian Jiao, Nan Duan, and 1 others

    Tokentune: Dual-level utility estimation for scalable data selection in instruction tuning. Zhenghao Lin, Zhibin Gou, Yeyun Gong, Xiao Liu, Yelong Shen, Ruochen Xu, Chen Lin, Yujiu Yang, Jian Jiao, Nan Duan, and 1 others. 2024a. Rho-1: Not all tokens are what you need.arXiv preprint arXiv:2404.07965. Zicheng Lin, Tian Liang, Jiahao Xu, Qiuzhi Lin, Xing Wa...

  14. [14]

    9 Weijie Lyu, Sheng-Jun Huang, and Xuan Xia

    What makes good data for alignment? a comprehensive study of automatic data selection in instruction tuning.arXiv preprint arXiv:2312.15685. 9 Weijie Lyu, Sheng-Jun Huang, and Xuan Xia

  15. [15]

    Jinlong Pang, Na Di, Zhaowei Zhu, Jiaheng Wei, Hao Cheng, Chen Qian, and Yang Liu

    Efficient code llm training via distribution-consistent and diversity-aware data selection.arXiv preprint arXiv:2507.02378. Jinlong Pang, Na Di, Zhaowei Zhu, Jiaheng Wei, Hao Cheng, Chen Qian, and Yang Liu

  16. [16]

    Xiaohan Qin, Xiaoxing Wang, Ning Liao, Cancheng Zhang, Xiangdong Zhang, Mingquan Feng, Jingzhi Wang, and Junchi Yan

    Improving data efficiency via curating llm-driven rating systems.arXiv preprint arXiv:2410.10877. Xiaohan Qin, Xiaoxing Wang, Ning Liao, Cancheng Zhang, Xiangdong Zhang, Mingquan Feng, Jingzhi Wang, and Junchi Yan

  17. [17]

    ByteDance Seed, Yuyu Zhang, Jing Su, Yifan Sun, Chenguang Xi, Xia Xiao, Shen Zheng, Anxiang Zhang, Kaibo Liu, Daoguang Zan, and 1 others

    sstoken: Self- modulated and semantic-aware token selection for llm fine-tuning.arXiv preprint arXiv:2510.18250. ByteDance Seed, Yuyu Zhang, Jing Su, Yifan Sun, Chenguang Xi, Xia Xiao, Shen Zheng, Anxiang Zhang, Kaibo Liu, Daoguang Zan, and 1 others

  18. [18]

    arXiv preprint arXiv:2506.03524

    Seed-coder: Let the code model curate data for itself. arXiv preprint arXiv:2506.03524. Yun-Da Tsai, Mingjie Liu, and Haoxing Ren

  19. [19]

    Yejie Wang, Keqing He, Dayuan Fu, Zhuoma GongQue, Heyang Xu, Yanxu Chen, Zhexu Wang, Yujia Fu, Guanting Dong, Muxi Diao, and 1 others

    Code less, align more: Efficient llm fine-tuning for code generation with data pruning.arXiv preprint arXiv:2407.05040. Yejie Wang, Keqing He, Dayuan Fu, Zhuoma GongQue, Heyang Xu, Yanxu Chen, Zhexu Wang, Yujia Fu, Guanting Dong, Muxi Diao, and 1 others

  20. [20]

    InProceed- ings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 14027–14043

    How do your code llms perform? empowering code in- struction tuning with really good data. InProceed- ings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 14027–14043. Yuxiang Wei, Federico Cassano, Jiawei Liu, Yifeng Ding, Naman Jain, Zachary Mueller, Harm de Vries, Leandro V on Werra, Arjun Guha, and Lingming Zhang. 202...

  21. [21]

    Mengzhou Xia, Sadhika Malladi, Suchin Gururangan, Sanjeev Arora, and Danqi Chen

    Rose: A reward-oriented data selection framework for llm task-specific instruction tuning.arXiv preprint arXiv:2412.00631. Mengzhou Xia, Sadhika Malladi, Suchin Gururangan, Sanjeev Arora, and Danqi Chen

  22. [22]

    arXiv preprint arXiv:2402.04333

    Less: Se- lecting influential data for targeted instruction tuning. arXiv preprint arXiv:2402.04333. Jiapeng Yu, Yuqian Wu, Yajing Zhan, Wenhao Guo, Zhou Xu, and Raymond Lee

  23. [23]

    InInternational Conference on Learning Representations, volume 2025, pages 66602–66656

    Bigcodebench: Benchmarking code generation with diverse function calls and complex in- structions. InInternational Conference on Learning Representations, volume 2025, pages 66602–66656. 10 A Experimental Details A.1 Sample-Level Selection Settings We perform sample-level selection from the 300K OpenCodeInstruct (OCI) source pool before ap- plying item-le...

  24. [24]

    For full-token SFT, we train the model for 30,000 steps to provide a strong dense-supervision base- line. For CodeBlcok and all sparse-selection base- lines, we use 3,000 training steps under the same optimization setup, so that the comparison focuses on the quality of selected supervision signals un- der a limited training budget. All experiments are con...