pith. sign in

arxiv: 2502.14925 · v2 · submitted 2025-02-19 · 💻 cs.SE

CODEPROMPTZIP: Code-specific Prompt Compression for Retrieval-Augmented Generation in Coding Tasks with LMs

Pith reviewed 2026-05-23 01:55 UTC · model grok-4.3

classification 💻 cs.SE
keywords prompt compressionretrieval-augmented generationcode compressionprogram analysistoken prioritizationcopy mechanismRAG for coding taskslanguage model compression
0
0 comments X

The pith

CodePromptZip compresses code prompts for RAG by ranking token types through program analysis and training a copy-augmented small LM.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes a code-specific compression method for retrieval-augmented generation in software engineering tasks where prompts often exceed practical context limits. It identifies token types such as identifiers via program analysis, ranks their removal order by ablation impact on downstream accuracy, and uses those rankings to build training data for a small language model. The model learns to produce compressed versions at chosen ratios while a built-in copy mechanism lets it retain critical tokens from the original snippet. This targets the gap that general natural-language compressors leave unaddressed in code. A reader would care because shorter yet faithful code contexts would allow more retrieved examples or lower inference cost without retraining the main coding model.

Core claim

CodePromptZip employs a type-aware, priority-driven strategy to construct training samples for a code compression model by using program analysis to rank token types based on their impact on task performance, then trains a small LM augmented with a copy mechanism to enable flexible compression that minimizes performance degradation, surpassing entropy-based and distillation-based baselines with gains of 23.4 percent, 28.7 percent, and 8.7 percent on assertion generation, bugs-to-fix, and code suggestion.

What carries the argument

Type-aware priority-driven token removal ranking obtained from ablation analysis on program-analysis-identified token types, used to supervise training of a small LM compressor equipped with a copy mechanism.

If this is right

  • More retrieved code examples fit inside fixed context windows for RAG coding workflows.
  • Prompt processing cost drops while task accuracy on assertion generation, bug repair, and suggestion rises.
  • Compression ratio becomes a controllable input rather than a fixed hyper-parameter.
  • The same priority ranking supports multiple downstream coding models without retraining the compressor.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • If the learned priorities transfer across languages, the approach could serve as a lightweight pre-processing step for multilingual codebases.
  • Combining the copy-augmented compressor with existing entropy-based filters might yield further length reductions.
  • Testing the compressor on larger main models would show whether the relative gains persist when the downstream LM itself has greater capacity.

Load-bearing premise

The token-type removal priorities obtained from ablation analysis on the training tasks will remain effective when the compressor is applied to new tasks, different programming languages, or different downstream models.

What would settle it

Applying the trained compressor to a new coding task or programming language yields no gain or clear degradation relative to the strongest baseline would falsify the central claim.

Figures

Figures reproduced from arXiv: 2502.14925 by Pengfei He, Shaowei Wang, Tse-Hsun Chen.

Figure 1
Figure 1. Figure 1: Removal priority of code token types: e.g., [PITH_FULL_IMAGE:figures/full_fig_p004_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Framework of CODEPROMPTZIP. Algorithm 1: Priority-driven Greedy Algorithm for Dataset Construction Input: x code i = {xj} L j=1, τcode, type priorities of T . Output: ex code i . 1: Initialize a priority queue pq. 2: for each token xj ∈ x code i do 3: Assign priority to xj (Prioritize the drop of high-frequency tokens in prioritized type). 4: Insert xj into pq. 5: end for 6: removedTokens ← ∅. 7: Lrm ← ⌊τc… view at source ↗
Figure 3
Figure 3. Figure 3: Illustration of copy mechanism on CodeT5. [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Trade-off between keeping more tokens in a [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Compression ratio control. 0.0 0.2 0.4 Exact Match (a) : CodeLlama-13B 0.0 0.2 0.4 CodeBleu (b) : CodeLlama-13B 0.00 0.25 0.50 CodeBleu (c) : CodeLlama-13B 0.00 0.25 0.50 Exact Match (a) : Gemini-1.0-pro 0.0 0.2 0.4 CodeBleu (b) : Gemini-1.0-pro 0.0 0.1 0.2 CodeBleu (c) : Gemini-1.0-pro LLMLingua LongLLMLingua LLMLingua-2 RECOMP CodePromptZip (a) Assertion Generation (b) Bugs2Fix (c) Code Suggestion [PITH… view at source ↗
Figure 6
Figure 6. Figure 6: Performance of the proposed CODE￾PROMPTZIP across different BLMs. 6.4 RQ4: Transferability with Different BLM CODEPROMPTZIP consistently outperforms baselines across studied base LMs CodeLlama￾13B and Gemini-1.0 [PITH_FULL_IMAGE:figures/full_fig_p008_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: The illustration of different RAG coding tasks [PITH_FULL_IMAGE:figures/full_fig_p011_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Original Code Examples of Assertion Genera [PITH_FULL_IMAGE:figures/full_fig_p012_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Compressed Code Examples of Assertion Generation (55 tokens, τcode: 0.1) ### FOCAL_METHOD getProduction(java.lang.String) return productionsByName; ### UNIT_TEST testJustifications() ; org.jsoar.kernel.Production j = agent; "<AssertPlaceHolder>"; [PITH_FULL_IMAGE:figures/full_fig_p013_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Compressed Code Examples of Assertion Generation (39 tokens, τcode: 0.4) ### BUGGY_CODE public static TYPE_1 init(java.lang.String name, java.util.Date date) { TYPE_1 VAR_1 = new TYPE_1(); VAR_1.METHOD_1(name); java.util.Calendar VAR_2 = java.util.Calendar.getInstance(); VAR_2.METHOD_2(date); VAR_1.METHOD_3(VAR_2); return VAR_1; } ### FIXED_CODE public static TYPE_1 init(java.lang.String name, java.util.D… view at source ↗
Figure 13
Figure 13. Figure 13: Original Code Examples of Code Suggestion [PITH_FULL_IMAGE:figures/full_fig_p014_13.png] view at source ↗
Figure 14
Figure 14. Figure 14: Compressed Code Examples of Code Sug￾gestion (121 tokens, τcode: 0.3) [PITH_FULL_IMAGE:figures/full_fig_p014_14.png] view at source ↗
read the original abstract

Retrieval-Augmented Generation (RAG) enhances coding tasks by incorporating retrieved code examples into prompts. However, lengthy prompts, often exceeding tens of thousands of tokens, introduce challenges related to limited context windows of language models (LMs) and high computational costs. Existing prompt compression techniques focus on natural language, lacking tailored solutions for code. To address the gap, we propose CodePromptZip, a framework that compresses code examples before integrating into RAG workflows. Our framework employs a type-aware, priority-driven strategy to construct training samples for training code compression model. By using program analysis, we identify token types (e.g., Identifier) and perform ablation analysis to rank their removal priorities based on their impact on task performance. We then train a small LM as the compressor on these samples, enabling flexible compression conditioned on specified ratios while minimizing performance degradation. Specially, the compressor is augmented with a copy mechanism, allowing tokens to be directly copied from the original code snippets. Evaluation results show that CodePromptZip surpasses SOTA entropy-based and distillation-based baselines, improving by 23.4%, 28.7%, and 8.7% over the best baseline for Assertion Generation, Bugs2Fix, and Code Suggestion, respectively.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper claims to introduce CodePromptZip, a framework for compressing code prompts in RAG for coding tasks. It employs program analysis to identify token types, ablation analysis to rank removal priorities based on task performance impact, constructs training samples, and trains a small LM with copy mechanism for flexible compression. Evaluation shows it surpasses SOTA baselines with improvements of 23.4%, 28.7%, and 8.7% on Assertion Generation, Bugs2Fix, and Code Suggestion tasks.

Significance. Should the experimental results prove robust and the compression method generalize across tasks, languages, and models, this could represent a meaningful advance in handling lengthy code contexts for retrieval-augmented generation in software engineering applications, potentially lowering costs and enabling better use of LMs in coding workflows.

major comments (2)
  1. [Abstract] The method's reliance on ablation-derived token removal priorities is load-bearing for the performance claims, yet the description provides no information on whether these ablations were conducted using held-out data or cross-validation separate from the final evaluation tasks. This leaves open the possibility that the reported gains arise from task-specific tuning of the priority list rather than a general code-specific strategy.
  2. [Evaluation results] No details are supplied regarding experimental controls, statistical significance tests, exact baseline implementations, or sensitivity to variations in model size or task distributions, which are necessary to substantiate the central claim of outperforming SOTA methods.
minor comments (1)
  1. Consider adding a dedicated section on limitations, particularly regarding the generalizability of the token-type priorities to new programming languages or downstream models.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment below and will revise the manuscript accordingly to improve clarity and rigor.

read point-by-point responses
  1. Referee: [Abstract] The method's reliance on ablation-derived token removal priorities is load-bearing for the performance claims, yet the description provides no information on whether these ablations were conducted using held-out data or cross-validation separate from the final evaluation tasks. This leaves open the possibility that the reported gains arise from task-specific tuning of the priority list rather than a general code-specific strategy.

    Authors: We agree the manuscript does not explicitly describe the data split used for ablation analysis. In the revision we will add a dedicated paragraph in the method section stating that ablation studies to derive token-type removal priorities were performed on a held-out validation subset drawn from the training data and kept separate from all test sets used in the final evaluation. We will also report the cross-validation procedure employed to ensure the priority ranking generalizes beyond any single task split. revision: yes

  2. Referee: [Evaluation results] No details are supplied regarding experimental controls, statistical significance tests, exact baseline implementations, or sensitivity to variations in model size or task distributions, which are necessary to substantiate the central claim of outperforming SOTA methods.

    Authors: We acknowledge that the current evaluation section lacks these details. The revised manuscript will include: (i) explicit descriptions of experimental controls and the precise re-implementations of all baselines, (ii) statistical significance results (paired t-tests or Wilcoxon signed-rank tests with p-values) across the three tasks, and (iii) sensitivity analyses varying compressor model size and task distribution. These additions will appear in the main evaluation section and an expanded appendix. revision: yes

Circularity Check

0 steps flagged

No circularity: method constructs compressor from program analysis and ablation on training samples, evaluated on held-out tasks.

full rationale

The paper describes a pipeline that first uses program analysis to identify token types, performs ablation to rank removal priorities based on task performance impact, constructs training samples from those priorities, and trains a compressor LM (with copy mechanism) on the resulting samples. The reported gains are measured on separate evaluation tasks (Assertion Generation, Bugs2Fix, Code Suggestion). No equations, fitted parameters, or self-citations are presented that reduce the final performance numbers to quantities defined by construction inside the paper itself. The derivation chain is therefore self-contained against external benchmarks and does not match any of the enumerated circularity patterns.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The framework depends on the assumption that ablation-derived token priorities transfer across tasks and that a small LM can learn to compress at arbitrary ratios while preserving task utility; no new physical entities are postulated.

free parameters (1)
  • token removal priority ranking
    Derived from ablation analysis measuring impact on downstream task performance; used to label training samples.
axioms (1)
  • domain assumption Program analysis correctly identifies token types and ablation on those types produces stable removal priorities for the target coding tasks.
    Invoked when constructing the training samples for the compressor model.

pith-pipeline@v0.9.0 · 5757 in / 1356 out tokens · 62395 ms · 2026-05-23T01:55:05.557782+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

15 extracted references · 15 canonical work pages · 5 internal anchors

  1. [1]

    GPT-4 Technical Report

    Gpt-4 technical report. arXiv preprint arXiv:2303.08774. Anonymous

  2. [2]

    arXiv preprint arXiv:2404.01077

    Efficient prompting methods for large language mod- els: A survey. arXiv preprint arXiv:2404.01077. Junkai Chen, Xing Hu, Zhenhao Li, Cuiyun Gao, Xin Xia, and David Lo

  3. [3]

    Adapting language models to compress contexts

    Adapting language models to compress contexts. Preprint, arXiv:2305.14788. Yangruibo Ding, Zijian Wang, Wasi Ahmad, Hantian Ding, Ming Tan, Nihal Jain, Murali Krishna Ra- manathan, Ramesh Nallapati, Parminder Bhatia, Dan Roth, et al

  4. [4]

    Gautier Izacard, Patrick Lewis, Maria Lomeli, Lucas Hosseini, Fabio Petroni, Timo Schick, Jane Dwivedi- Yu, Armand Joulin, Sebastian Riedel, and Edouard Grave

    Exploring demonstration retrievers in rag for coding tasks: Yeas and nays! Preprint, arXiv:2410.09662. Gautier Izacard, Patrick Lewis, Maria Lomeli, Lucas Hosseini, Fabio Petroni, Timo Schick, Jane Dwivedi- Yu, Armand Joulin, Sebastian Riedel, and Edouard Grave

  5. [5]

    Longllmlingua: Accelerating and enhancing llms in long context scenarios via prompt compression,

    Atlas: Few-shot learning with retrieval augmented language models. Journal of Machine Learning Research, 24(251):1–43. Huiqiang Jiang, Qianhui Wu, Chin-Yew Lin, Yuqing Yang, and Lili Qiu. 2023a. LLMLingua: Com- pressing prompts for accelerated inference of large language models. In Proceedings of the 2023 Conference on Empirical Methods in Natural Languag...

  6. [6]

    Unlocking context constraints of llms: Enhancing context efficiency of llms with self-information-based content filtering

    Unlocking context constraints of llms: Enhancing context efficiency of llms with self- information-based content filtering. arXiv preprint arXiv:2304.12102. Shuai Lu, Daya Guo, Shuo Ren, Junjie Huang, Alexey Svyatkovskiy, Ambrosio Blanco, Colin Clement, Dawn Drain, Daxin Jiang, Duyu Tang, et al

  7. [7]

    CodeXGLUE: A Machine Learning Benchmark Dataset for Code Understanding and Generation

    Codexglue: A machine learning benchmark dataset for code understanding and generation. arXiv preprint arXiv:2102.04664. Jesse Mu, Xiang Li, and Noah Goodman

  8. [8]

    In 2023 IEEE/ACM 45th International Conference on Software Engineering (ICSE), pages 2450–2462

    Retrieval-based prompt selection for code-related few-shot learning. In 2023 IEEE/ACM 45th International Conference on Software Engineering (ICSE), pages 2450–2462. IEEE. Zhuoshi Pan, Qianhui Wu, Huiqiang Jiang, Menglin Xia, Xufang Luo, Jue Zhang, Qingwei Lin, Victor Rühle, Yuqing Yang, Chin-Yew Lin, H. Vicky Zhao, Lili Qiu, and Dongmei Zhang

  9. [9]

    In Findings of the Association for Computational Linguistics ACL 2024, pages 963– 981, Bangkok, Thailand and virtual meeting

    LLMLingua-2: Data distillation for efficient and faithful task-agnostic prompt compression. In Findings of the Association for Computational Linguistics ACL 2024, pages 963– 981, Bangkok, Thailand and virtual meeting. Associ- ation for Computational Linguistics. Shuo Ren, Daya Guo, Shuai Lu, Long Zhou, Shujie Liu, Duyu Tang, Neel Sundaresan, Ming Zhou, Am...

  10. [10]

    CodeBLEU: a Method for Automatic Evaluation of Code Synthesis

    Codebleu: a method for automatic evaluation of code synthesis. arXiv preprint arXiv:2009.10297. Baptiste Roziere, Jonas Gehring, Fabian Gloeckle, Sten Sootla, Itai Gat, Xiaoqing Ellen Tan, Yossi Adi, Jingyu Liu, Romain Sauvestre, Tal Remez, et al

  11. [11]

    Code Llama: Open Foundation Models for Code

    Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950. Abigail See, Peter J Liu, and Christopher D Man- ning

  12. [12]

    Gemini: A Family of Highly Capable Multimodal Models

    Gemini: a family of highly capable multimodal models. arXiv preprint arXiv:2312.11805. Yan Wang, Xiaoning Li, Tien N Nguyen, Shaohua Wang, Chao Ni, and Ling Ding

  13. [13]

    In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 8696–8708

    Codet5: Identifier-aware unified pre- trained encoder-decoder models for code under- standing and generation. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 8696–8708. Fangyuan Xu, Weijia Shi, and Eunsol Choi

  14. [14]

    A learning-based approach to static program slicing. Proc. ACM Program. Lang., 8(OOPSLA1). Guang Yang, Yu Zhou, Wei Cheng, Xiangyu Zhang, Xi- ang Chen, Terry Zhuo, Ke Liu, Xin Zhou, David Lo, and Taolue Chen. 2024a. Less is more: Docstring compression in code generation. arXiv preprint arXiv:2410.22793. Guang Yang, Yu Zhou, Wei Cheng, Xiangyu Zhang, Xiang...

  15. [15]

    testJustifications

    Unifying the perspectives of nlp and software en- gineering: A survey on language models for code. Preprint, arXiv:2311.07989. Demonstrations: [START] ### METHOD_HEADER: {header} ### WHOLE_METHOD: {body} … [END] Query [START] ### METHOD_HEADER: {header} ### WHOLE_METHOD: Demonstrations: [START] ### FOCAL_METHOD: {method under test} ### UNIT_TEST : {test m...