pith. sign in

arxiv: 2605.02256 · v1 · submitted 2026-05-04 · 💻 cs.SE

CommitSuite: A Comprehensive Benchmark for Commit Classification and Message Generation

Pith reviewed 2026-05-08 18:19 UTC · model grok-4.3

classification 💻 cs.SE
keywords commit message generationcommit classificationsoftware engineering benchmarklarge language modelsConventional Commits Specificationreference-free evaluationcode change semantics
0
0 comments X

The pith

CommitSuite benchmark with 63,533 commits and semantic labels supports reliable LLM use for commit classification and message generation.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

High-quality commit messages help maintain software projects but creating consistent ones is hard. The paper presents CommitSuite as a large collection of commits that follow the Conventional Commits format, drawn from hundreds of repositories and multiple languages. Each entry includes type labels, detailed code change structures, and explanations of the purpose generated with LLM help. It also defines a new way to score generated commit messages using five separate checks on their qualities without needing matching human examples. Results indicate LLMs handle both creating and judging these messages effectively.

Core claim

The central contribution is CommitSuite, a benchmark of 63,533 CCS-compliant commits from 243 open-source repositories across seven programming languages. Each commit receives a CCS type label, AST-level code change information, and LLM-assisted semantic annotations describing what the change accomplishes and why it was made. To support evaluation of commit message generation systems, the paper introduces a reference-free framework based on five binary metrics: rationality, comprehensiveness, non-redundancy, authenticity, and logicality. Experiments demonstrate that LLMs can generate commit messages and evaluate them, achieving 0.849 Cohen's Kappa agreement with human judgments.

What carries the argument

CommitSuite dataset together with the reference-free evaluation framework using five binary metrics for assessing commit message quality.

If this is right

  • Developers and researchers gain a standardized dataset for training models on commit classification and message generation tasks.
  • The semantic annotations enable deeper analysis of the reasons behind code changes in addition to their types.
  • Generated commit messages can be assessed semantically without requiring human-written reference messages for comparison.
  • LLMs become viable tools for both producing and automatically evaluating commit messages at scale.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Adoption of this benchmark could promote wider use of structured commit message formats in open source development.
  • Similar annotation and evaluation approaches might apply to other software engineering tasks like code review or documentation generation.
  • Improved commit messages from such systems could lead to better project history understanding and easier maintenance over time.

Load-bearing premise

The semantic annotations created with LLM assistance correctly and without bias describe the intent and effects of each code change.

What would settle it

Independent human review of a subset of the LLM semantic annotations revealing frequent inaccuracies in capturing the 'what' or 'why', or the five-metric scores failing to align with human ratings of message quality.

Figures

Figures reproduced from arXiv: 2605.02256 by Haoyu Wang, Pengcheng Xia, Xinyi Hou, Yanjie Zhao, Zhaonan Wu, Zirui Wan.

Figure 1
Figure 1. Figure 1: Commit message format defined by CCS. 2.2 Commit Datasets and Their Limitations view at source ↗
Figure 2
Figure 2. Figure 2: Missing function-level information in diff. view at source ↗
Figure 3
Figure 3. Figure 3: Overview of CommitSuite. 3.2.2 Data Crawling. In this step, our goal is to collect up-to-date and comprehensive commits. We collected commits up to May 5, 2025 from the selected repositories using PyDriller [23] and GitHub API [32], recording each commit’s hash, message, author, email, date, modifications, associated comments and related issues/PRs. After above steps, a total of 358,921 commits were obtain… view at source ↗
Figure 4
Figure 4. Figure 4: The distributions of “Diff length”, “Description character count”, “Diff token count”, “Description token count”, and view at source ↗
Figure 5
Figure 5. Figure 5: Commit type distribution in CommitSuite. for the presence of “what” and “why” information as defined by Tian et al. [26]. The message quality classifiers proposed in their re￾search have low precision in “good messages” (include both “what” and “why”) and Xue et al. [38] also demonstrated in their research that LLMs exhibit high consistency with humans in this classifica￾tion task. Therefore, we used the D… view at source ↗
Figure 7
Figure 7. Figure 7: Compare the average scores of humans, LLMs, and view at source ↗
Figure 8
Figure 8. Figure 8: Compares the confusion matrices of the classifier, GPT-4.1, and DeepSeek-R1. view at source ↗
Figure 9
Figure 9. Figure 9: Compares the messages generated by humans, view at source ↗
read the original abstract

High-quality commit messages are critical for maintaining software projects, yet ensuring their consistency and informativeness remains a practical challenge. While the Conventional Commits Specification (CCS) provides a structured format for commit messages, research on CCS-based commit classification and commit message generation (CMG) is limited by the absence of large-scale benchmarks, semantic annotations, and reliable evaluation methods. In this paper, we introduce CommitSuite, a benchmark comprising 63,533 CCS-compliant commits from 243 open-source repositories across seven programming languages. Each commit is labeled with its CCS type and enriched with AST-level code changes, along with LLM-assisted semantic annotations that capture the "what" and "why" behind the change. To evaluate CMG systems, we propose a reference-free framework based on five binary metrics: rationality, comprehensiveness, non-redundancy, authenticity, and logicality, enabling semantic-level assessment without relying on human-written references. Our experiments show that LLMs can effectively support both generation and evaluation, with evaluation achieving 0.849 Cohen's Kappa agreement against human judgments. CommitSuite offers a unified resource for structured commit understanding and facilitates reproducible research on commit classification and generation.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The manuscript introduces CommitSuite, a large-scale benchmark of 63,533 CCS-compliant commits collected from 243 open-source repositories spanning seven programming languages. Commits are labeled with Conventional Commits Specification (CCS) types, enriched with AST-level code change information, and augmented with LLM-generated semantic annotations describing the 'what' and 'why' of each change. The authors propose a novel reference-free evaluation framework for commit message generation (CMG) consisting of five binary metrics—rationality, comprehensiveness, non-redundancy, authenticity, and logicality—and demonstrate through experiments that LLMs can be leveraged for both CMG and its evaluation, achieving a Cohen's Kappa of 0.849 with human judgments on the evaluation task.

Significance. Should the LLM-assisted annotations be shown to be reliable through additional validation, CommitSuite would represent a significant contribution to software engineering research by providing the first large-scale, multi-language benchmark for CCS-based commit classification and message generation. The reference-free evaluation framework addresses a key limitation in prior CMG work that relies on potentially inconsistent human-written references. The high agreement between LLM and human evaluations suggests promise for scalable, automated assessment methods in this area. The release of such a benchmark could facilitate more reproducible and comparable research on commit understanding tasks.

major comments (1)
  1. [Data Collection and Annotation] The benchmark's utility for classification and CMG tasks rests on the accuracy of the LLM-assisted semantic annotations for the 'what' and 'why' aspects of code changes. The abstract and methods describe these annotations but provide no details on validation procedures, such as human review of a sample, inter-annotator agreement specifically for the annotations, or analysis of potential LLM hallucinations or biases in capturing commit intent. In contrast, the 0.849 Kappa is reported only for the downstream reference-free evaluation metrics. Without this validation, systematic errors in annotations could propagate to and undermine the reported experimental results on classification and generation.
minor comments (2)
  1. [Abstract] The abstract could more explicitly state the LLM model and version used for annotations and evaluation to improve reproducibility.
  2. Consider adding a table summarizing the distribution of CCS types across languages and repositories for better context on the benchmark composition.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback and positive assessment of CommitSuite's potential contribution. We agree that additional validation details for the LLM-assisted annotations would strengthen the manuscript and have revised accordingly to address this point.

read point-by-point responses
  1. Referee: The benchmark's utility for classification and CMG tasks rests on the accuracy of the LLM-assisted semantic annotations for the 'what' and 'why' aspects of code changes. The abstract and methods describe these annotations but provide no details on validation procedures, such as human review of a sample, inter-annotator agreement specifically for the annotations, or analysis of potential LLM hallucinations or biases in capturing commit intent. In contrast, the 0.849 Kappa is reported only for the downstream reference-free evaluation metrics. Without this validation, systematic errors in annotations could propagate to and undermine the reported experimental results on classification and generation.

    Authors: We acknowledge that the original manuscript provided insufficient detail on validating the LLM-assisted 'what' and 'why' annotations, focusing quantitative agreement metrics solely on the reference-free evaluation framework. This is a valid concern, as unvalidated annotations could indeed introduce systematic biases affecting downstream classification and generation experiments. In the revised manuscript, we have added a new subsection under Data Annotation describing a human validation study: a stratified random sample of 500 commits was independently reviewed by two software engineering researchers (not involved in prompt design) for factual accuracy, completeness of intent capture, and hallucination presence. We report Cohen's Kappa of 0.81 between the LLM outputs and human judgments, along with error analysis showing that hallucinations were rare (<4%) and primarily limited to edge-case refactorings. We also discuss prompt engineering steps taken to reduce bias (e.g., few-shot examples from diverse languages and explicit instructions to ground descriptions in AST diffs). These additions directly address error propagation risks and will be reflected in updated experimental result interpretations. revision: yes

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper constructs CommitSuite by selecting 63,533 CCS-compliant commits from external open-source repositories, extracting CCS types directly from the messages, computing AST-level code changes independently, and adding LLM-assisted semantic annotations as an enrichment step. The reference-free evaluation framework defines five binary metrics (rationality, comprehensiveness, non-redundancy, authenticity, logicality) without reference to the annotations or fitted parameters. The reported 0.849 Cohen's Kappa measures agreement between LLM-based evaluation and separate human judgments, constituting external validation rather than a self-referential loop. No equations, self-citations, or renamings appear in the provided text that would reduce any central claim to its inputs by construction; the derivation chain relies on external data sources and independent metric definitions.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The work relies on the existing Conventional Commits Specification as a domain standard and on LLM capabilities for annotation and evaluation; no new free parameters, invented entities, or ad-hoc axioms are introduced beyond standard assumptions in empirical software engineering.

axioms (1)
  • domain assumption The Conventional Commits Specification provides a reliable and consistent structure for labeling commit messages across projects.
    Invoked as the basis for all labeling and classification in the benchmark construction.

pith-pipeline@v0.9.0 · 5519 in / 1191 out tokens · 65245 ms · 2026-05-08T18:19:33.356505+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

46 extracted references · 46 canonical work pages

  1. [1]

    Satanjeev Banerjee and Alon Lavie. 2005. METEOR: An automatic metric for MT evaluation with improved correlation with human judgments. InProceedings of the acl workshop on intrinsic and extrinsic evaluation measures for machine translation and/or summarization. 65–72

  2. [2]

    Raymond PL Buse and Westley R Weimer. 2010. Automatically documenting program changes. InProceedings of the 25th IEEE/ACM international conference on automated software engineering. 33–42

  3. [3]

    Natarajan Chidambaram, Alexandre Decan, and Tom Mens. 2023. A dataset of bot and human activities in github. In2023 IEEE/ACM 20th International Conference on Mining Software Repositories (MSR). IEEE, 465–469

  4. [4]

    Brian De Alwis and Jonathan Sillito. 2009. Why are software projects moving from centralized to decentralized version control systems?. In2009 ICSE Workshop on Cooperative and Human Aspects on Software Engineering. IEEE, 36–39

  5. [5]

    Jinhao Dong, Yiling Lou, Qihao Zhu, Zeyu Sun, Zhilin Li, Wenjie Zhang, and Dan Hao. 2022. FIRA: fine-grained graph-based code change representation for automated commit message generation. InProceedings of the 44th International Conference on Software Engineering. 970–981

  6. [6]

    Aleksandra Eliseeva, Yaroslav Sokolov, Egor Bogomolov, Yaroslav Golubev, Danny Dig, and Timofey Bryksin. 2023. From commit message generation to history-aware commit message completion. In2023 38th IEEE/ACM International Conference on Automated Software Engineering (ASE). IEEE, 723–735

  7. [7]

    Lishui Fan, Jiakun Liu, Zhongxin Liu, David Lo, Xin Xia, and Shanping Li. 2024. Exploring the capabilities of llms for code change related tasks.ACM Transactions on Software Engineering and Methodology(2024)

  8. [8]

    Nadia Ghamrawi and Andrew McCallum. 2005. Collective multi-label classifica- tion. InProceedings of the 14th ACM international conference on Information and knowledge management. 195–200

  9. [9]

    Mehdi Golzadeh, Alexandre Decan, Damien Legay, and Tom Mens. 2021. A ground-truth dataset and classification model for detecting bots in GitHub issue and PR comments.Journal of Systems and Software175 (2021), 110911

  10. [10]

    Aaron Imani, Iftekhar Ahmed, and Mohammad Moshirpour. 2024. Context Conquers Parameters: Outperforming Proprietary LLM in Commit Message Generation.arXiv preprint arXiv:2408.02502(2024)

  11. [11]

    Tae-Hwan Jung. 2021. Commitbert: Commit message generation using pre- trained programming language model.arXiv preprint arXiv:2105.14242(2021)

  12. [12]

    Jiawei Li and Iftekhar Ahmed. 2023. Commit message matters: Investigating impact and evolution of commit message quality. In2023 IEEE/ACM 45th Interna- tional Conference on Software Engineering (ICSE). IEEE, 806–817

  13. [13]

    Jiawei Li, David Faragó, Christian Petrov, and Iftekhar Ahmed. 2024. Only diff is not enough: Generating commit messages leveraging reasoning and action of large language model.Proceedings of the ACM on Software Engineering1, FSE (2024), 745–766

  14. [14]

    Jiawei Li, David Faragó, Christian Petrov, and Iftekhar Ahmed. 2025. Consider What Humans Consider: Optimizing Commit Message Leveraging Contexts Considered By Human.arXiv preprint arXiv:2503.11960(2025)

  15. [15]

    Rensis Likert. 1932. A technique for the measurement of attitudes.Archives of psychology(1932)

  16. [16]

    Chin-Yew Lin. 2004. Rouge: A package for automatic evaluation of summaries. InText summarization branches out. 74–81

  17. [17]

    Zhongxin Liu, Xin Xia, Ahmed E Hassan, David Lo, Zhenchang Xing, and Xinyu Wang. 2018. Neural-machine-translation-based commit message generation: how far are we?. InProceedings of the 33rd ACM/IEEE international conference on automated software engineering. 373–384

  18. [18]

    Andreas Mauczka, Florian Brosch, Christian Schanes, and Thomas Grechenig

  19. [19]

    In2015 IEEE/ACM 12th working conference on mining software repositories

    Dataset of developer-labeled commit messages. In2015 IEEE/ACM 12th working conference on mining software repositories. IEEE, 490–493

  20. [20]

    Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. Bleu: a method for automatic evaluation of machine translation. InProceedings of the 40th annual meeting of the Association for Computational Linguistics. 311–318

  21. [21]

    Eddie Antonio Santos and Abram Hindle. 2016. Judging a commit by its cover: Correlating commit message entropy with build status on travis-ci. (2016)

  22. [22]

    Muhammad Usman Sarwar, Sarim Zafar, Mohamed Wiem Mkaouer, Gursim- ran Singh Walia, and Muhammad Zubair Malik. 2020. Multi-label classification of commit messages using transfer learning. In2020 IEEE International Symposium on Software Reliability Engineering Workshops (ISSREW). IEEE, 37–42

  23. [23]

    Maxmilian Schall, Tamara Czinczoll, and Gerard De Melo. 2024. Commitbench: A benchmark for commit message generation. In2024 IEEE International Conference on Software Analysis, Evolution and Reengineering (SANER). IEEE, 728–739

  24. [24]

    Davide Spadini, Maurício Aniche, and Alberto Bacchelli. 2018. PyDriller: Python framework for mining software repositories. InProceedings of the 2018 26th ACM Joint meeting on european software engineering conference and symposium on the foundations of software engineering. 908–911

  25. [25]

    Wei Tao, Yanlin Wang, Ensheng Shi, Lun Du, Shi Han, Hongyu Zhang, Dongmei Zhang, and Wenqiang Zhang. 2021. On the evaluation of commit message generation models: An experimental study. In2021 IEEE International Conference on Software Maintenance and Evolution (ICSME). IEEE, 126–136

  26. [26]

    Wei Tao, Yucheng Zhou, Yanlin Wang, Hongyu Zhang, Haofen Wang, and Wen- qiang Zhang. 2024. Kadel: Knowledge-aware denoising learning for commit message generation.ACM Transactions on Software Engineering and Methodology 33, 5 (2024), 1–32

  27. [27]

    Yingchen Tian, Yuxia Zhang, Klaas-Jan Stol, Lin Jiang, and Hui Liu. 2022. What makes a good commit message?. InProceedings of the 44th International Conference on Software Engineering. 2389–2401

  28. [28]

    Jiajun Tong and Xiaobin Rui. 2025. A Commit Classification Framework Incorpo- rated With Prompt Tuning and External Knowledge.IET Software2025, 1 (2025), 5566134

  29. [29]

    Petr Tsvetkov, Aleksandra Eliseeva, Danny Dig, Alexander Bezzubov, Yaroslav Golubev, Timofey Bryksin, and Yaroslav Zharov. 2024. Towards Realistic Evalua- tion of Commit Message Generation by Matching Online and Offline Settings. arXiv preprint arXiv:2410.12046(2024)

  30. [30]

    1977.Exploratory data analysis

    John Wilder Tukey et al. 1977.Exploratory data analysis. Vol. 2. Springer

  31. [31]

    Year unknown

    Author unknown. Year unknown. Conventional Commits — conventionalcom- mits.org. https://www.conventionalcommits.org/en/v1.0.0/. [Accessed 31-05- 2025]

  32. [32]

    Year unknown

    Author unknown. Year unknown. GitHub - conventional- commits/conventionalcommits.org: The conventional commits specification — github.com. https://github.com/conventional-commits/conventionalcommits.org. [Accessed 31-05-2025]

  33. [33]

    Year unknown

    Author unknown. Year unknown. GitHub REST API documentation - GitHub Docs — docs.github.com. https://docs.github.com/en/rest. [Accessed 31-05-2025]

  34. [34]

    Year unknown

    Author unknown. Year unknown. Introduction - Tree-sitter — tree-sitter.github.io. https://tree-sitter.github.io/tree-sitter/. [Accessed 31-05-2025]

  35. [35]

    Bei Wang, Meng Yan, Zhongxin Liu, Ling Xu, Xin Xia, Xiaohong Zhang, and Dan Yang. 2021. Quality assurance for automated commit message generation. In2021 IEEE International Conference on Software Analysis, Evolution and Reengineering (SANER). IEEE, 260–271

  36. [36]

    Haoye Wang, Xin Xia, David Lo, Qiang He, Xinyu Wang, and John Grundy

  37. [37]

    Context-aware retrieval-based deep commit message generation.ACM Transactions on Software Engineering and Methodology (TOSEM)30, 4 (2021), 1–30

  38. [38]

    Yifan Wu, Ying Li, and Siyu Yu. 2024. Commit Message Generation via Chat- GPT: How Far Are We?. InProceedings of the 2024 IEEE/ACM First International Conference on AI Foundation Models and Software Engineering. 124–129

  39. [39]

    Shengbin Xu, Yuan Yao, Feng Xu, Tianxiao Gu, Hanghang Tong, and Jian Lu

  40. [40]

    Commit message generation for source code changes. InIJCAI

  41. [41]

    Pengyu Xue, Linhao Wu, Zhongxing Yu, Zhi Jin, Zhen Yang, Xinyi Li, Zhenyu Yang, and Yue Tan. 2024. Automated commit message generation with large language models: An empirical study and beyond.IEEE Transactions on Software Engineering(2024)

  42. [42]

    Qunhong Zeng, Yuxia Zhang, Zhiqing Qiu, and Hui Liu. 2024. A First Look at Conventional Commits Classification. In2025 IEEE/ACM 47th International Conference on Software Engineering (ICSE). IEEE Computer Society, 127–139

  43. [43]

    Linghao Zhang, Hongyi Zhang, Chong Wang, and Peng Liang. 2024. RAG- Enhanced Commit Message Generation.arXiv preprint arXiv:2406.05514(2024)

  44. [44]

    Linghao Zhang, Jingshu Zhao, Chong Wang, and Peng Liang. 2024. Using large language models for commit message generation: A preliminary study. In2024 IEEE International Conference on Software Analysis, Evolution and Reengineering (SANER). IEEE, 126–130

  45. [45]

    Min-Ling Zhang and Zhi-Hua Zhou. 2013. A review on multi-label learning algorithms.IEEE transactions on knowledge and data engineering26, 8 (2013), 1819–1837

  46. [46]

    Yuxia Zhang, Zhiqing Qiu, Klaas-Jan Stol, Wenhui Zhu, Jiaxin Zhu, Yingchen Tian, and Hui Liu. 2024. Automatic commit message generation: A critical review and directions for future work.IEEE Transactions on Software Engineering50, 4 (2024), 816–835