pith. sign in

arxiv: 2509.18337 · v3 · submitted 2025-09-22 · 💻 cs.SE

CoRaCMG: Contextual Retrieval-Augmented Framework for Commit Message Generation

Pith reviewed 2026-05-18 13:50 UTC · model grok-4.3

classification 💻 cs.SE
keywords commit message generationretrieval augmented generationlarge language modelscode diffssoftware documentationnatural language generation
0
0 comments X

The pith

Retrieving similar historical diff-message pairs lets LLMs generate more precise commit messages by learning project terminology and style.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper aims to establish that adding a small number of retrieved similar past diff and message pairs to an LLM prompt substantially raises the quality of generated commit messages. A sympathetic reader would care because commit messages are frequently vague or incomplete, and better automation could reduce manual documentation effort while improving change traceability. The approach works by retrieving matching pairs from project history, structuring them into the input, and letting the model observe human examples before generating for the current diff. Gains appear across standard metrics and hold for multiple models, with most benefit from the first one to three examples and little added value beyond that. The mechanism relies on the model extracting relevant terminology and conventions directly from the provided pairs.

Core claim

The paper claims that retrieving similar historical diff-message pairs and incorporating them into a structured prompt enables large language models to capture project-specific terminologies and writing styles from human-written examples, producing commit messages that score higher on BLEU, Rouge-L, METEOR, and CIDEr.

What carries the argument

The three-phase retrieval-augmented process: fetch similar past diff-message pairs, combine them with the query diff in a prompt, and generate the new message via LLM.

If this is right

  • Adding one or three retrieved pairs produces large relative gains on automatic metrics.
  • Further pairs beyond three add little additional improvement.
  • The gains occur because the model adopts terminology and conventions visible in the examples.
  • The method applies across different LLMs without model-specific tuning.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same retrieval step could support other style-sensitive software tasks such as generating pull-request summaries or review comments.
  • Retrieval may serve as a low-cost substitute for fine-tuning when adapting models to a single project's conventions.
  • Testing retrieval quality with stricter similarity thresholds or cross-project examples would clarify how much match is required for reliable gains.

Load-bearing premise

That the retrieved historical pairs will be close enough to supply useful terminology and style cues rather than noise or mismatched patterns.

What would settle it

Generating messages with randomly chosen historical pairs instead of similarity-retrieved ones and checking whether the metric gains largely disappear.

Figures

Figures reproduced from arXiv: 2509.18337 by Bo Xiong, Chong Wang, Linghao Zhang, Peng Liang, Zongen Ren.

Figure 1
Figure 1. Figure 1: Commit Filtering Process and Results • Date: The timestamp when the commit was formally recorded in the version control system. • LoC: The total number of code lines modified in the commit, calculated as the sum of added and deleted lines. 3.3. Data Filtering To ensure quality, usability, and adaptability for down￾stream tasks, the ApacheCM dataset was created with six filtering rules to exclude low-qualit… view at source ↗
Figure 2
Figure 2. Figure 2: Overview of the CoRaCMG Framework [PITH_FULL_IMAGE:figures/full_fig_p007_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Direct prompt template and CoRaCMG prompt template Finally, following the normalization of the scores to a common scale, we combine the scores obtained from these two methods with equal weights (1:1) in our experiment, and then set it as the hybrid score. After calculating the hybrid scores between the query diff and all diffs in the ApacheCM￾10K, the example pair with the highest score is retrieved as the… view at source ↗
Figure 4
Figure 4. Figure 4: Performance of CoRaCMG by Using GPT-4o Across Different Numbers of Example Pairs BLEU score grows from 18.76 to 21.52, and the CIDEr score increases from 13.76 to 15.75. However, the increasing trend of scores seems to reach a plateau when more than three retrieved example pairs are fed to GPT-4o in CoRaCMG. 6.3.2. Analysis of Answer to RQ3 [PITH_FULL_IMAGE:figures/full_fig_p013_4.png] view at source ↗
read the original abstract

Commit messages play a key role in documenting the intent behind code changes. However, they are often low-quality, vague, or incomplete, limiting their usefulness. Commit Message Generation (CMG) aims to automatically generate descriptive commit messages from code diffs to reduce developers' effort and improve message quality. Although recent advances in LLMs have shown promise in automating CMG, their performance remains limited. This paper aims to enhance CMG performance by retrieving similar diff-message pairs to guide LLMs to generate commit messages that are more precise and informative. We proposed CoRaCMG, a Contextual Retrieval-augmented framework for Commit Message Generation, structured in three phases: (1) Retrieve: retrieving the similar diff-message pairs; (2) Augment: combining them with the query diff into a structured prompt; and (3) Generate: generating commit messages corresponding to the query diff via LLMs. CoRaCMG enables LLMs to learn project-specific terminologies and writing styles from the retrieved diff-message pairs. We evaluated CoRaCMG across multiple LLMs (e.g., GPT, DeepSeek, and Qwen) and compared its performance against SOTA baselines. Experimental results show that CoRaCMG significantly boosts LLM performance across four metrics (BLEU, Rouge-L, METEOR, and CIDEr). Specifically, DeepSeek-R1 achieves relative improvements of 76% in BLEU and 71% in CIDEr when augmented with a single retrieved example pair. After incorporating the single example pair, GPT-4o achieves the highest improvement rate, with BLEU increasing by 89%. Moreover, performance gains plateau after more than three examples are used, indicating diminishing returns. Further analysis shows that the improvements are attributed to the model's ability to capture the terminologies and writing styles of human-written commit messages from the retrieved example pairs.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper proposes CoRaCMG, a three-phase retrieval-augmented framework (Retrieve similar diff-message pairs, Augment the prompt, Generate via LLM) for commit message generation. It claims that retrieving contextually similar historical pairs enables LLMs to learn project-specific terminology and writing style, yielding large relative gains across BLEU, Rouge-L, METEOR, and CIDEr for models including GPT-4o (89% BLEU lift with one pair) and DeepSeek-R1 (76% BLEU, 71% CIDEr), with performance plateauing after three examples.

Significance. If the results are robust, the work offers a practical, low-overhead way to improve LLM-based commit message generation by exploiting existing project history, which could reduce developer documentation burden in real repositories. The diminishing-returns observation is a useful empirical finding for prompt design in this domain.

major comments (3)
  1. [Abstract and §4] Abstract and §4 (Experiments): the central claim attributes the reported metric lifts (e.g., 76% BLEU for DeepSeek-R1, 89% for GPT-4o with a single pair) to the retrieval of similar pairs teaching terminology and style. No ablation replacing retrieved pairs with random or dissimilar pairs from the same corpus is described; without this control it is impossible to rule out that the gains arise from generic few-shot prompting rather than the contextual retrieval mechanism.
  2. [Abstract] Abstract: the similarity function, embedding model, and any similarity threshold used in the Retrieve phase are not specified. This detail is load-bearing because the framework's novelty rests on the quality of the retrieved context.
  3. [§4] §4: no statistical significance tests, confidence intervals, or variance across multiple runs are reported for the relative improvements, weakening the strength of the performance claims.
minor comments (2)
  1. [§4] Provide the full list of baselines, dataset statistics, and exact prompt templates in the main text or appendix to support reproducibility.
  2. [§3] Clarify how the retrieval corpus is constructed and whether it is project-specific or cross-project.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and insightful comments. We have carefully considered each point and outline our responses below, indicating the revisions we will make to strengthen the manuscript.

read point-by-point responses
  1. Referee: [Abstract and §4] Abstract and §4 (Experiments): the central claim attributes the reported metric lifts (e.g., 76% BLEU for DeepSeek-R1, 89% for GPT-4o with a single pair) to the retrieval of similar pairs teaching terminology and style. No ablation replacing retrieved pairs with random or dissimilar pairs from the same corpus is described; without this control it is impossible to rule out that the gains arise from generic few-shot prompting rather than the contextual retrieval mechanism.

    Authors: We agree that the absence of an ablation with random or dissimilar pairs leaves open the possibility that some gains stem from few-shot prompting in general rather than the contextual nature of the retrieval. While the manuscript includes further analysis attributing improvements to captured terminology and style, a controlled ablation would provide stronger evidence. We will add this ablation study in the revised §4, comparing performance with randomly selected pairs from the same corpus against the retrieved similar pairs. revision: yes

  2. Referee: [Abstract] Abstract: the similarity function, embedding model, and any similarity threshold used in the Retrieve phase are not specified. This detail is load-bearing because the framework's novelty rests on the quality of the retrieved context.

    Authors: We appreciate this observation regarding reproducibility. The Retrieve phase uses cosine similarity over embeddings from a code-specific model, with top-k selection and no explicit threshold beyond k. We will revise the abstract and add a detailed description of the embedding model, similarity function, and selection process in the methodology section of the revised manuscript. revision: yes

  3. Referee: [§4] §4: no statistical significance tests, confidence intervals, or variance across multiple runs are reported for the relative improvements, weakening the strength of the performance claims.

    Authors: We acknowledge that reporting statistical significance and variance would increase the robustness of the claims. In the revised version of §4, we will include paired statistical tests (e.g., t-tests) on the metric improvements, along with standard deviations or confidence intervals computed over multiple runs with different random seeds where applicable. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical pipeline evaluated on external metrics

full rationale

The paper presents CoRaCMG as a three-phase procedural framework (retrieve similar diff-message pairs, augment the prompt, generate via LLM) whose performance is measured against standard external metrics (BLEU, Rouge-L, METEOR, CIDEr) and baselines. No equations, fitted parameters, or self-referential definitions appear in the abstract or described claims; reported gains are empirical outcomes rather than quantities derived by construction from the method itself. The attribution of improvements to learning project-specific style is an interpretive claim supported by experimental results, not a tautological reduction. This is a standard non-circular empirical contribution.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the empirical observation that retrieved examples improve LLM output quality; no new mathematical entities or fitted constants are introduced beyond standard LLM prompting practices.

axioms (1)
  • domain assumption Retrieved historical diff-message pairs contain transferable project-specific terminology and writing conventions that LLMs can internalize from few-shot examples.
    This premise is invoked when the paper states that CoRaCMG enables LLMs to learn from the retrieved pairs.

pith-pipeline@v0.9.0 · 5876 in / 1232 out tokens · 28522 ms · 2026-05-18T13:50:16.611557+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

35 extracted references · 35 canonical work pages · 1 internal anchor

  1. [1]

    write newline

    " write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION word.in bbl.in ":" * " " * FUNCTION f...

  2. [2]

    write newline

    " write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION word.in bbl.in ":" * " " * FUNCTION f...

  3. [3]

    , author Lavie, A

    author Banerjee, S. , author Lavie, A. , year 2005 . title Meteor: An automatic metric for mt evaluation with improved correlation with human judgments , in: booktitle Proceedings of the Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization , organization ACL . pp. pages 65--72

  4. [4]

    , author Weimer, W

    author Buse, R.P.L. , author Weimer, W. , year 2010 . title Automatically documenting program changes , in: booktitle Proceedings of the 25th IEEE/ACM International Conference on Automated Software Engineering (ASE) , organization ACM . pp. pages 33--42

  5. [5]

    , author Lou, Y

    author Dong, J. , author Lou, Y. , author Zhu, Q. , author Sun, Z. , author Li, Z. , author Zhang, W. , author Hao, D. , year 2022 . title Fira: Fine-grained graph-based code change representation for automated commit message generation , in: booktitle Proceedings of the 44th International Conference on Software Engineering (ICSE) , organization ACM . pp....

  6. [6]

    Retrieval-Augmented Generation for Large Language Models: A Survey

    author Gao, Y. , author Xiong, Y. , author Gao, X. , author Jia, K. , author Pan, J. , author Bi, Y. , author Dai, Y. , author Sun, J. , author Guo, Q. , author Wang, M. , author Wang, H. , year 2023 . title Retrieval-augmented generation for large language models: A survey . journal arXiv preprint arXiv:2312.10997

  7. [7]

    , author Kang, H.J

    author Hoang, T. , author Kang, H.J. , author Lo, D. , author Lawall, J. , year 2020 . title Cc2vec: distributed representations of code changes , in: booktitle Proceedings of the 42nd International Conference on Software Engineering (ICSE) , organization ACM . pp. pages 518--529

  8. [8]

    , author Tang, Z

    author Huang, Y. , author Tang, Z. , author Chen, X. , author Yang, C. , author Zheng, Z. , author Zhou, X. , year 2025 . title Commit messages generation based on core changes . journal ACM Transactions on Software Engineering and Methodology

  9. [9]

    , author Armaly, A

    author Jiang, S. , author Armaly, A. , author McMillan, C. , year 2017 . title Automatically generating commit messages from diffs using neural machine translation , in: booktitle Proceedings of the 32nd IEEE/ACM International Conference on Automated Software Engineering (ASE) , organization IEEE . pp. pages 135--146

  10. [10]

    , author Sankar, S

    author Kumar, A. , author Sankar, S. , author Das, P.P. , author Chakrabarti, P.P. , year 2025 . title Using large language models for multi-level commit message generation for large diffs . journal Information and Software Technology volume 187 , pages 107831

  11. [11]

    u ttler, H. , author Lewis, M. , author tau Yih, W. , author Rockt \

    author Lewis, P. , author Perez, E. , author Piktus, A. , author Petroni, F. , author Karpukhin, V. , author Goyal, N. , author K \"u ttler, H. , author Lewis, M. , author tau Yih, W. , author Rockt \"a schel, T. , author Riedel, S. , author Kiela, D. , year 2020 . title Retrieval-augmented generation for knowledge-intensive nlp tasks , in: booktitle Proc...

  12. [12]

    , author Farag \'o , D

    author Li, J. , author Farag \'o , D. , author Petrov, C. , author Ahmed, I. , year 2024 . title Only diff is not enough: Generating commit messages leveraging reasoning and action of large language model . journal Proceedings of the ACM on Software Engineering volume 1 , pages 745--766

  13. [13]

    , year 2004

    author Lin, C.Y. , year 2004 . title Rouge: A package for automatic evaluation of summaries , in: booktitle Text Summarization Branches Out , organization ACL . pp. pages 74--81

  14. [14]

    , author Liu, Z

    author Liu, Q. , author Liu, Z. , author Zhu, H. , author Fan, H. , author Du, B. , author Qian, Y. , year 2019 . title Generating commit messages from diffs using pointer-generator network , in: booktitle Proceedings of the 16th International Conference on Mining Software Repositories (MSR) , organization IEEE . pp. pages 299--309

  15. [15]

    , author Gao, C

    author Liu, S. , author Gao, C. , author Chen, S. , author Nie, L.Y. , author Liu, Y. , year 2022 . title Atom: Commit message generation based on abstract syntax tree and hybrid ranking . journal IEEE Transactions on Software Engineering volume 48 , pages 1800--1817

  16. [16]

    , author Xia, X

    author Liu, Z. , author Xia, X. , author Hassan, A.E. , author Lo, D. , author Xing, Z. , author Wang, X. , year 2018 . title Neural-machine-translation-based commit message generation: how far are we? , in: booktitle Proceedings of the 33rd ACM/IEEE International Conference on Automated Software Engineering (ASE) , organization ACM . pp. pages 373--384

  17. [17]

    , author Marrese-Taylor, E

    author Loyola, P. , author Marrese-Taylor, E. , author Matsuo, Y. , year 2017 . title A neural architecture for generating natural language descriptions from source code changes , in: booktitle Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (ACL) , organization ACL . pp. pages 287--292

  18. [18]

    , author Happel, H.J

    author Maalej, W. , author Happel, H.J. , year 2010 . title Can development work describe itself? , in: booktitle Proceedings of the 7th International Working Conference on Mining Software Repositories (MSR) , organization IEEE . pp. pages 191--200

  19. [19]

    , author Gao, C

    author Nie, L.Y. , author Gao, C. , author Zhong, Z. , author Lam, W. , author Liu, Y. , author Xu, Z. , year 2021 . title Coregen: Contextualized code representation learning for commit message generation . journal Neurocomputing volume 459 , pages 97--107

  20. [20]

    , author Roukos, S

    author Papineni, K. , author Roukos, S. , author Ward, T. , author Zhu, W.J. , year 2002 . title Bleu: a method for automatic evaluation of machine translation , in: booktitle Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics (ACL) , organization ACL . pp. pages 311--318

  21. [21]

    , author Levine, Y

    author Ram, O. , author Levine, Y. , author Dalmedigos, I. , author Muhlgay, D. , author Shashua, A. , author Leyton-Brown, K. , author Shoham, Y. , year 2023 . title In-context retrieval-augmented language models . journal Transactions of the Association for Computational Linguistics volume 11 , pages 1316--1331

  22. [22]

    , author Zaragoza, H

    author Robertson, S.E. , author Zaragoza, H. , year 2009 . title The probabilistic relevance framework: Bm25 and beyond . journal Foundations and Trends in Information Retrieval volume 3 , pages 333--389

  23. [23]

    , author Sun, X

    author Shen, J. , author Sun, X. , author Li, B. , author Yang, H. , author Hu, J. , year 2016 . title On automatic summarization of what and why information in source code changes , in: booktitle Proceedings of the 40th IEEE Annual Computer Software and Applications Conference (COMPSAC) , organization IEEE . pp. pages 103--112

  24. [24]

    , author Wang, Y

    author Shi, E. , author Wang, Y. , author Tao, W. , author Du, L. , author Zhang, H. , author Han, S. , author Zhang, D. , author Sun, H. , year 2022 . title Race: Retrieval-augmented commit message generation , in: booktitle Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing (EMNLP) , organization ACL . pp. pages 5520--5530

  25. [25]

    , author Zhang, Y

    author Tian, Y. , author Zhang, Y. , author Stol, K.J. , author Jiang, L. , author Liu, H. , year 2022 . title What makes a good commit message? , in: booktitle Proceedings of the 44th IEEE/ACM International Conference on Software Engineering (ICSE) , organization ACM . pp. pages 2389--2401

  26. [26]

    , author Cortes-Coy, L.F

    author V \'a squez, M.L. , author Cortes-Coy, L.F. , author Aponte, J. , author Poshyvanyk, D. , year 2015 . title Changescribe: A tool for automatically generating commit messages , in: booktitle Proceedings of the 37th IEEE/ACM International Conference on Software Engineering (ICSE) , organization IEEE Computer Society . pp. pages 709--712

  27. [27]

    , author Xia, X

    author Wang, H. , author Xia, X. , author Lo, D. , author He, Q. , author Wang, X. , author Grundy, J. , year 2021 . title Context-aware retrieval-based deep commit message generation . journal ACM Transactions on Software Engineering and Methodology volume 30 , pages 1--30

  28. [28]

    , author Wang, Y

    author Wang, W. , author Wang, Y. , author Joty, S. , author Hoi, S.C.H. , year 2023 . title Rap-gen: Retrieval-augmented patch generation with codet5 for automatic program repair , in: booktitle Proceedings of the 31st ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering (ESEC/FSE) , organization ACM...

  29. [29]

    , author Runeson, P

    author Wohlin, C. , author Runeson, P. , author H \"o st, M. , author Ohlsson, M.C. , author Regnell, B. , author Wessl \'e n, A. , et al., year 2012 . title Experimentation in Software Engineering . volume volume 236 . publisher Springer

  30. [30]

    , author Zhang, L

    author Xiong, B. , author Zhang, L. , author Wang, C. , author Liang, P. , year 2025 a. title Contextual code retrieval for commit message generation: A preliminary study , in: booktitle Proceedings of the 19th ACM/IEEE International Symposium on Empirical Software Engineering and Measurement (ESEM) , organization ACM

  31. [31]

    , author Zhang, L

    author Xiong, B. , author Zhang, L. , author Wang, C. , author Liang, P. , year 2025 b. title Replication package of the paper `` CoRaCMG : Contextual retrieval-augmented framework for commit message generation'' . note https://github.com/riverBag/CoRaCMG https://github.com/riverBag/CoRaCMG

  32. [32]

    , author Yao, Y

    author Xu, S. , author Yao, Y. , author Xu, F. , author Gu, T. , author Tong, H. , author Lu, J. , year 2019 . title Commit message generation for source code changes , in: booktitle Proceedings of the 28th International Joint Conference on Artificial Intelligence (IJCAI) , organization IJCAI . pp. pages 3975--3981

  33. [33]

    , author Zhao, J

    author Yao, S. , author Zhao, J. , author Yu, D. , author Du, N. , author Shafran, I. , author Narasimhan, K.R. , author Cao, Y. , year 2023 . title React: Synergizing reasoning and acting in language models , in: booktitle Proceedings of the 11th International Conference on Learning Representations (ICLR) , organization OpenReview.net . pp. pages 1--33

  34. [34]

    , author Zhao, J

    author Zhang, L. , author Zhao, J. , author Wang, C. , author Liang, P. , year 2024 a. title Using large language models for commit message generation: A preliminary study , in: booktitle Proceedings of the 31st IEEE International Conference on Software Analysis, Evolution and Reengineering (SANER) , organization IEEE . pp. pages 126--130

  35. [35]

    , author Qiu, Z

    author Zhang, Y. , author Qiu, Z. , author Stol, K.J. , author Zhu, W. , author Zhu, J. , author Tian, Y. , author Liu, H. , year 2024 b. title Automatic commit message generation: A critical review and directions for future work . journal IEEE Transactions on Software Engineering volume 50 , pages 816--835