CommitSuite: A Comprehensive Benchmark for Commit Classification and Message Generation
Pith reviewed 2026-05-08 18:19 UTC · model grok-4.3
The pith
CommitSuite benchmark with 63,533 commits and semantic labels supports reliable LLM use for commit classification and message generation.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The central contribution is CommitSuite, a benchmark of 63,533 CCS-compliant commits from 243 open-source repositories across seven programming languages. Each commit receives a CCS type label, AST-level code change information, and LLM-assisted semantic annotations describing what the change accomplishes and why it was made. To support evaluation of commit message generation systems, the paper introduces a reference-free framework based on five binary metrics: rationality, comprehensiveness, non-redundancy, authenticity, and logicality. Experiments demonstrate that LLMs can generate commit messages and evaluate them, achieving 0.849 Cohen's Kappa agreement with human judgments.
What carries the argument
CommitSuite dataset together with the reference-free evaluation framework using five binary metrics for assessing commit message quality.
If this is right
- Developers and researchers gain a standardized dataset for training models on commit classification and message generation tasks.
- The semantic annotations enable deeper analysis of the reasons behind code changes in addition to their types.
- Generated commit messages can be assessed semantically without requiring human-written reference messages for comparison.
- LLMs become viable tools for both producing and automatically evaluating commit messages at scale.
Where Pith is reading between the lines
- Adoption of this benchmark could promote wider use of structured commit message formats in open source development.
- Similar annotation and evaluation approaches might apply to other software engineering tasks like code review or documentation generation.
- Improved commit messages from such systems could lead to better project history understanding and easier maintenance over time.
Load-bearing premise
The semantic annotations created with LLM assistance correctly and without bias describe the intent and effects of each code change.
What would settle it
Independent human review of a subset of the LLM semantic annotations revealing frequent inaccuracies in capturing the 'what' or 'why', or the five-metric scores failing to align with human ratings of message quality.
Figures
read the original abstract
High-quality commit messages are critical for maintaining software projects, yet ensuring their consistency and informativeness remains a practical challenge. While the Conventional Commits Specification (CCS) provides a structured format for commit messages, research on CCS-based commit classification and commit message generation (CMG) is limited by the absence of large-scale benchmarks, semantic annotations, and reliable evaluation methods. In this paper, we introduce CommitSuite, a benchmark comprising 63,533 CCS-compliant commits from 243 open-source repositories across seven programming languages. Each commit is labeled with its CCS type and enriched with AST-level code changes, along with LLM-assisted semantic annotations that capture the "what" and "why" behind the change. To evaluate CMG systems, we propose a reference-free framework based on five binary metrics: rationality, comprehensiveness, non-redundancy, authenticity, and logicality, enabling semantic-level assessment without relying on human-written references. Our experiments show that LLMs can effectively support both generation and evaluation, with evaluation achieving 0.849 Cohen's Kappa agreement against human judgments. CommitSuite offers a unified resource for structured commit understanding and facilitates reproducible research on commit classification and generation.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces CommitSuite, a large-scale benchmark of 63,533 CCS-compliant commits collected from 243 open-source repositories spanning seven programming languages. Commits are labeled with Conventional Commits Specification (CCS) types, enriched with AST-level code change information, and augmented with LLM-generated semantic annotations describing the 'what' and 'why' of each change. The authors propose a novel reference-free evaluation framework for commit message generation (CMG) consisting of five binary metrics—rationality, comprehensiveness, non-redundancy, authenticity, and logicality—and demonstrate through experiments that LLMs can be leveraged for both CMG and its evaluation, achieving a Cohen's Kappa of 0.849 with human judgments on the evaluation task.
Significance. Should the LLM-assisted annotations be shown to be reliable through additional validation, CommitSuite would represent a significant contribution to software engineering research by providing the first large-scale, multi-language benchmark for CCS-based commit classification and message generation. The reference-free evaluation framework addresses a key limitation in prior CMG work that relies on potentially inconsistent human-written references. The high agreement between LLM and human evaluations suggests promise for scalable, automated assessment methods in this area. The release of such a benchmark could facilitate more reproducible and comparable research on commit understanding tasks.
major comments (1)
- [Data Collection and Annotation] The benchmark's utility for classification and CMG tasks rests on the accuracy of the LLM-assisted semantic annotations for the 'what' and 'why' aspects of code changes. The abstract and methods describe these annotations but provide no details on validation procedures, such as human review of a sample, inter-annotator agreement specifically for the annotations, or analysis of potential LLM hallucinations or biases in capturing commit intent. In contrast, the 0.849 Kappa is reported only for the downstream reference-free evaluation metrics. Without this validation, systematic errors in annotations could propagate to and undermine the reported experimental results on classification and generation.
minor comments (2)
- [Abstract] The abstract could more explicitly state the LLM model and version used for annotations and evaluation to improve reproducibility.
- Consider adding a table summarizing the distribution of CCS types across languages and repositories for better context on the benchmark composition.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback and positive assessment of CommitSuite's potential contribution. We agree that additional validation details for the LLM-assisted annotations would strengthen the manuscript and have revised accordingly to address this point.
read point-by-point responses
-
Referee: The benchmark's utility for classification and CMG tasks rests on the accuracy of the LLM-assisted semantic annotations for the 'what' and 'why' aspects of code changes. The abstract and methods describe these annotations but provide no details on validation procedures, such as human review of a sample, inter-annotator agreement specifically for the annotations, or analysis of potential LLM hallucinations or biases in capturing commit intent. In contrast, the 0.849 Kappa is reported only for the downstream reference-free evaluation metrics. Without this validation, systematic errors in annotations could propagate to and undermine the reported experimental results on classification and generation.
Authors: We acknowledge that the original manuscript provided insufficient detail on validating the LLM-assisted 'what' and 'why' annotations, focusing quantitative agreement metrics solely on the reference-free evaluation framework. This is a valid concern, as unvalidated annotations could indeed introduce systematic biases affecting downstream classification and generation experiments. In the revised manuscript, we have added a new subsection under Data Annotation describing a human validation study: a stratified random sample of 500 commits was independently reviewed by two software engineering researchers (not involved in prompt design) for factual accuracy, completeness of intent capture, and hallucination presence. We report Cohen's Kappa of 0.81 between the LLM outputs and human judgments, along with error analysis showing that hallucinations were rare (<4%) and primarily limited to edge-case refactorings. We also discuss prompt engineering steps taken to reduce bias (e.g., few-shot examples from diverse languages and explicit instructions to ground descriptions in AST diffs). These additions directly address error propagation risks and will be reflected in updated experimental result interpretations. revision: yes
Circularity Check
No significant circularity
full rationale
The paper constructs CommitSuite by selecting 63,533 CCS-compliant commits from external open-source repositories, extracting CCS types directly from the messages, computing AST-level code changes independently, and adding LLM-assisted semantic annotations as an enrichment step. The reference-free evaluation framework defines five binary metrics (rationality, comprehensiveness, non-redundancy, authenticity, logicality) without reference to the annotations or fitted parameters. The reported 0.849 Cohen's Kappa measures agreement between LLM-based evaluation and separate human judgments, constituting external validation rather than a self-referential loop. No equations, self-citations, or renamings appear in the provided text that would reduce any central claim to its inputs by construction; the derivation chain relies on external data sources and independent metric definitions.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption The Conventional Commits Specification provides a reliable and consistent structure for labeling commit messages across projects.
Lean theorems connected to this paper
-
IndisputableMonolith.Cost.FunctionalEquationwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
reference-free framework based on five binary metrics: rationality, comprehensiveness, non-redundancy, authenticity, and logicality
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Satanjeev Banerjee and Alon Lavie. 2005. METEOR: An automatic metric for MT evaluation with improved correlation with human judgments. InProceedings of the acl workshop on intrinsic and extrinsic evaluation measures for machine translation and/or summarization. 65–72
work page 2005
-
[2]
Raymond PL Buse and Westley R Weimer. 2010. Automatically documenting program changes. InProceedings of the 25th IEEE/ACM international conference on automated software engineering. 33–42
work page 2010
-
[3]
Natarajan Chidambaram, Alexandre Decan, and Tom Mens. 2023. A dataset of bot and human activities in github. In2023 IEEE/ACM 20th International Conference on Mining Software Repositories (MSR). IEEE, 465–469
work page 2023
-
[4]
Brian De Alwis and Jonathan Sillito. 2009. Why are software projects moving from centralized to decentralized version control systems?. In2009 ICSE Workshop on Cooperative and Human Aspects on Software Engineering. IEEE, 36–39
work page 2009
-
[5]
Jinhao Dong, Yiling Lou, Qihao Zhu, Zeyu Sun, Zhilin Li, Wenjie Zhang, and Dan Hao. 2022. FIRA: fine-grained graph-based code change representation for automated commit message generation. InProceedings of the 44th International Conference on Software Engineering. 970–981
work page 2022
-
[6]
Aleksandra Eliseeva, Yaroslav Sokolov, Egor Bogomolov, Yaroslav Golubev, Danny Dig, and Timofey Bryksin. 2023. From commit message generation to history-aware commit message completion. In2023 38th IEEE/ACM International Conference on Automated Software Engineering (ASE). IEEE, 723–735
work page 2023
-
[7]
Lishui Fan, Jiakun Liu, Zhongxin Liu, David Lo, Xin Xia, and Shanping Li. 2024. Exploring the capabilities of llms for code change related tasks.ACM Transactions on Software Engineering and Methodology(2024)
work page 2024
-
[8]
Nadia Ghamrawi and Andrew McCallum. 2005. Collective multi-label classifica- tion. InProceedings of the 14th ACM international conference on Information and knowledge management. 195–200
work page 2005
-
[9]
Mehdi Golzadeh, Alexandre Decan, Damien Legay, and Tom Mens. 2021. A ground-truth dataset and classification model for detecting bots in GitHub issue and PR comments.Journal of Systems and Software175 (2021), 110911
work page 2021
- [10]
- [11]
-
[12]
Jiawei Li and Iftekhar Ahmed. 2023. Commit message matters: Investigating impact and evolution of commit message quality. In2023 IEEE/ACM 45th Interna- tional Conference on Software Engineering (ICSE). IEEE, 806–817
work page 2023
-
[13]
Jiawei Li, David Faragó, Christian Petrov, and Iftekhar Ahmed. 2024. Only diff is not enough: Generating commit messages leveraging reasoning and action of large language model.Proceedings of the ACM on Software Engineering1, FSE (2024), 745–766
work page 2024
- [14]
-
[15]
Rensis Likert. 1932. A technique for the measurement of attitudes.Archives of psychology(1932)
work page 1932
-
[16]
Chin-Yew Lin. 2004. Rouge: A package for automatic evaluation of summaries. InText summarization branches out. 74–81
work page 2004
-
[17]
Zhongxin Liu, Xin Xia, Ahmed E Hassan, David Lo, Zhenchang Xing, and Xinyu Wang. 2018. Neural-machine-translation-based commit message generation: how far are we?. InProceedings of the 33rd ACM/IEEE international conference on automated software engineering. 373–384
work page 2018
-
[18]
Andreas Mauczka, Florian Brosch, Christian Schanes, and Thomas Grechenig
-
[19]
In2015 IEEE/ACM 12th working conference on mining software repositories
Dataset of developer-labeled commit messages. In2015 IEEE/ACM 12th working conference on mining software repositories. IEEE, 490–493
-
[20]
Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. Bleu: a method for automatic evaluation of machine translation. InProceedings of the 40th annual meeting of the Association for Computational Linguistics. 311–318
work page 2002
-
[21]
Eddie Antonio Santos and Abram Hindle. 2016. Judging a commit by its cover: Correlating commit message entropy with build status on travis-ci. (2016)
work page 2016
-
[22]
Muhammad Usman Sarwar, Sarim Zafar, Mohamed Wiem Mkaouer, Gursim- ran Singh Walia, and Muhammad Zubair Malik. 2020. Multi-label classification of commit messages using transfer learning. In2020 IEEE International Symposium on Software Reliability Engineering Workshops (ISSREW). IEEE, 37–42
work page 2020
-
[23]
Maxmilian Schall, Tamara Czinczoll, and Gerard De Melo. 2024. Commitbench: A benchmark for commit message generation. In2024 IEEE International Conference on Software Analysis, Evolution and Reengineering (SANER). IEEE, 728–739
work page 2024
-
[24]
Davide Spadini, Maurício Aniche, and Alberto Bacchelli. 2018. PyDriller: Python framework for mining software repositories. InProceedings of the 2018 26th ACM Joint meeting on european software engineering conference and symposium on the foundations of software engineering. 908–911
work page 2018
-
[25]
Wei Tao, Yanlin Wang, Ensheng Shi, Lun Du, Shi Han, Hongyu Zhang, Dongmei Zhang, and Wenqiang Zhang. 2021. On the evaluation of commit message generation models: An experimental study. In2021 IEEE International Conference on Software Maintenance and Evolution (ICSME). IEEE, 126–136
work page 2021
-
[26]
Wei Tao, Yucheng Zhou, Yanlin Wang, Hongyu Zhang, Haofen Wang, and Wen- qiang Zhang. 2024. Kadel: Knowledge-aware denoising learning for commit message generation.ACM Transactions on Software Engineering and Methodology 33, 5 (2024), 1–32
work page 2024
-
[27]
Yingchen Tian, Yuxia Zhang, Klaas-Jan Stol, Lin Jiang, and Hui Liu. 2022. What makes a good commit message?. InProceedings of the 44th International Conference on Software Engineering. 2389–2401
work page 2022
-
[28]
Jiajun Tong and Xiaobin Rui. 2025. A Commit Classification Framework Incorpo- rated With Prompt Tuning and External Knowledge.IET Software2025, 1 (2025), 5566134
work page 2025
- [29]
-
[30]
1977.Exploratory data analysis
John Wilder Tukey et al. 1977.Exploratory data analysis. Vol. 2. Springer
work page 1977
-
[31]
Author unknown. Year unknown. Conventional Commits — conventionalcom- mits.org. https://www.conventionalcommits.org/en/v1.0.0/. [Accessed 31-05- 2025]
work page 2025
-
[32]
Author unknown. Year unknown. GitHub - conventional- commits/conventionalcommits.org: The conventional commits specification — github.com. https://github.com/conventional-commits/conventionalcommits.org. [Accessed 31-05-2025]
work page 2025
-
[33]
Author unknown. Year unknown. GitHub REST API documentation - GitHub Docs — docs.github.com. https://docs.github.com/en/rest. [Accessed 31-05-2025]
work page 2025
-
[34]
Author unknown. Year unknown. Introduction - Tree-sitter — tree-sitter.github.io. https://tree-sitter.github.io/tree-sitter/. [Accessed 31-05-2025]
work page 2025
-
[35]
Bei Wang, Meng Yan, Zhongxin Liu, Ling Xu, Xin Xia, Xiaohong Zhang, and Dan Yang. 2021. Quality assurance for automated commit message generation. In2021 IEEE International Conference on Software Analysis, Evolution and Reengineering (SANER). IEEE, 260–271
work page 2021
-
[36]
Haoye Wang, Xin Xia, David Lo, Qiang He, Xinyu Wang, and John Grundy
-
[37]
Context-aware retrieval-based deep commit message generation.ACM Transactions on Software Engineering and Methodology (TOSEM)30, 4 (2021), 1–30
work page 2021
-
[38]
Yifan Wu, Ying Li, and Siyu Yu. 2024. Commit Message Generation via Chat- GPT: How Far Are We?. InProceedings of the 2024 IEEE/ACM First International Conference on AI Foundation Models and Software Engineering. 124–129
work page 2024
-
[39]
Shengbin Xu, Yuan Yao, Feng Xu, Tianxiao Gu, Hanghang Tong, and Jian Lu
-
[40]
Commit message generation for source code changes. InIJCAI
-
[41]
Pengyu Xue, Linhao Wu, Zhongxing Yu, Zhi Jin, Zhen Yang, Xinyi Li, Zhenyu Yang, and Yue Tan. 2024. Automated commit message generation with large language models: An empirical study and beyond.IEEE Transactions on Software Engineering(2024)
work page 2024
-
[42]
Qunhong Zeng, Yuxia Zhang, Zhiqing Qiu, and Hui Liu. 2024. A First Look at Conventional Commits Classification. In2025 IEEE/ACM 47th International Conference on Software Engineering (ICSE). IEEE Computer Society, 127–139
work page 2024
- [43]
-
[44]
Linghao Zhang, Jingshu Zhao, Chong Wang, and Peng Liang. 2024. Using large language models for commit message generation: A preliminary study. In2024 IEEE International Conference on Software Analysis, Evolution and Reengineering (SANER). IEEE, 126–130
work page 2024
-
[45]
Min-Ling Zhang and Zhi-Hua Zhou. 2013. A review on multi-label learning algorithms.IEEE transactions on knowledge and data engineering26, 8 (2013), 1819–1837
work page 2013
-
[46]
Yuxia Zhang, Zhiqing Qiu, Klaas-Jan Stol, Wenhui Zhu, Jiaxin Zhu, Yingchen Tian, and Hui Liu. 2024. Automatic commit message generation: A critical review and directions for future work.IEEE Transactions on Software Engineering50, 4 (2024), 816–835
work page 2024
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.