Recognition: no theorem link
SACS: A Code Smell Dataset using Semi-automatic Generation Approach
Pith reviewed 2026-05-15 22:15 UTC · model grok-4.3
The pith
A semi-automatic approach creates an open-source dataset with over 10,000 labeled samples for each of three code smells.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The paper introduces a semi-automatic generation approach for code smell datasets. Candidate smelly samples are produced using automatic generation rules. Multiple metrics then group the samples into an automatically accepted group and a manually reviewed group. Structured review guidelines and an annotation tool support the manual validation. This process yielded the SACS dataset with over 10,000 labeled samples for Long Method, Large Class, and Feature Envy each.
What carries the argument
semi-automatic generation approach that applies automatic rules then uses metrics to separate samples for targeted manual review
Load-bearing premise
Automatic generation rules combined with metric-based grouping produce candidate samples whose labels can be reliably validated through structured manual review.
What would settle it
A test in which independent experts following the same guidelines assign conflicting labels to a random sample of more than a few percent of the dataset instances.
Figures
read the original abstract
Code smell is a great challenge in software refactoring, which indicates latent design or implementation flaws that may degrade the software maintainability and evolution. Over the past of decades, the research on code smell has received extensive attention. Especially the researches applied machine learning-technique have become a popular topic in recent studies. However, one of the biggest challenges to apply machine learning-technique is the lack of high-quality code smell datasets. Manually constructing such datasets is extremely labor-intensive, as identifying code smells requires substantial development expertise and considerable time investment. In contrast, automatically generated datasets, while scalable, frequently exhibit reduced label reliability and compromised data quality. To overcome this challenge, in this study, we explore a semi-automatic approach to generate a code smell dataset with high quality data samples. Specifically, we first applied a set of automatic generation rules to produce candidate smelly samples. We then employed multiple metrics to group the data samples into an automatically accepted group and a manually reviewed group, enabling reviewers to concentrate their efforts on ambiguous samples. Furthermore, we established structured review guidelines and developed a annotation tool to support the manual validation process. Based on the proposed semi-automatic generation approach, we created an open-source code smell dataset, SACS, covering three widely studied code smells: Long Method, Large Class, and Feature Envy. Each code smell category includes over 10,000 labeled samples. This dataset could provide a large-scale and publicly available benchmark to facilitate future studies on code smell detection and automated refactoring.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes a semi-automatic method to construct the SACS dataset for three code smells (Long Method, Large Class, Feature Envy). Candidate samples are generated via automatic rules, partitioned by metrics into an auto-accepted group and a manually reviewed group, and the latter is validated using structured guidelines and a custom annotation tool. The resulting open-source dataset contains over 10,000 labeled samples per smell category and is positioned as a large-scale benchmark for machine-learning-based smell detection and refactoring research.
Significance. If the quality claims are substantiated, SACS would be a useful contribution by supplying a publicly available, large-scale resource that attempts to combine the scalability of rule-based generation with targeted human oversight. This could support more reliable training and benchmarking of ML models for code smell detection, directly addressing the scarcity of high-quality labeled data noted in the introduction.
major comments (1)
- [Semi-automatic generation approach and dataset construction] The manuscript asserts that the semi-automatic approach produces 'high-quality data samples' and a 'high-quality code smell dataset,' yet supplies no quantitative validation of the manual review step. No inter-rater agreement statistics (e.g., Cohen's kappa), fraction of samples routed to manual review, disagreement-resolution procedure, or error analysis are reported. This information is required to assess whether the metric-based grouping plus structured review materially improves label reliability over purely automatic labeling; without it the central quality claim remains unsupported.
minor comments (1)
- [Dataset description] The exact counts, class distributions, and source-project statistics for the >10,000 samples per category should be stated explicitly (rather than the rounded 'over 10,000' figure) to allow readers to judge balance and coverage.
Simulated Author's Rebuttal
We thank the referee for the constructive comments on our manuscript. We address the major comment below and indicate the changes planned for the revised version.
read point-by-point responses
-
Referee: [Semi-automatic generation approach and dataset construction] The manuscript asserts that the semi-automatic approach produces 'high-quality data samples' and a 'high-quality code smell dataset,' yet supplies no quantitative validation of the manual review step. No inter-rater agreement statistics (e.g., Cohen's kappa), fraction of samples routed to manual review, disagreement-resolution procedure, or error analysis are reported. This information is required to assess whether the metric-based grouping plus structured review materially improves label reliability over purely automatic labeling; without it the central quality claim remains unsupported.
Authors: We agree that quantitative details on the manual review step would strengthen the quality claims. The review was conducted by a single experienced annotator using the structured guidelines and annotation tool to maintain consistency, which is why inter-rater agreement statistics such as Cohen's kappa were not applicable or reported. In the revised manuscript we will add: (1) the exact fraction of candidate samples routed to the manual review group, (2) a description of the disagreement-resolution procedure (re-examination of borderline cases by the same annotator against the guidelines), and (3) a brief error analysis of samples initially flagged by metrics but rejected after review. These additions will allow readers to evaluate the improvement over purely automatic labeling. revision: partial
Circularity Check
No circularity: dataset labels derive from external rules and review, not self-referential inputs
full rationale
The paper describes generating candidates via a fixed set of automatic rules, partitioning them by independent metrics into auto-accepted versus manual-review buckets, then applying external structured guidelines and an annotation tool for validation. No step defines the final labels in terms of the dataset itself, fits parameters to the output data and renames them as predictions, or relies on self-citations whose content reduces to the present claims. The process is self-contained against external benchmarks (rules, metrics, human review) and produces the SACS dataset as an output rather than presupposing it.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption A set of automatic generation rules can produce candidate smelly code samples
- domain assumption Multiple metrics can group samples into automatically accepted and manually reviewed categories
Reference graph
Works this paper leans on
-
[1]
It refers to optimizing the software internal structure without changing its external behavior
INTRODUCTION Software Refactoring has been playing an important role in software lifecycle helps to improve the maintainability and readability of the software. It refers to optimizing the software internal structure without changing its external behavior. During the s oftware refactoring process, Code Smell has always been an important topic, which indic...
work page 1999
-
[2]
The most commonly used dataset construction approach is the manual approach
RELATED DATASET In this section, we will present several code smell datasets mainly focusing on the dataset generation approach proposed in previous code smell-related studies. The most commonly used dataset construction approach is the manual approach. A representative example is the MLCQ dataset proposed by Madeyski [4]. It contains approximately 15,000...
-
[3]
Overview The overall workflow of our approach is illustrated in Figure
SEMI-AUTOMATIC GENERATION 3.1. Overview The overall workflow of our approach is illustrated in Figure
-
[4]
Prior to the dataset generation, we collect several open- source projects as code corpus. Next, in the first step, smelly software entities are intentionally created from the code corpus by pre-defined generation rules. Subsequently, a set of rules is applied to group both the automatically generated samples and the original code samples into two groups: ...
-
[5]
In Pattern 1, the method print_ary is invoked at line S7 of the main method. In this case, the two methods exhibit a caller-callee relationship and can be merged by copying all statements from the callee method print_ary into the caller method main. The parameter a in print_ary is replaced with the corresponding variable result after the merge operation. ...
-
[6]
OPEN-SOURCE DATASET: SACS Following the above dataset generation approach, we created a large-scale open-source code smell dataset. We collected 16 open-source java projects as the code corpus including: JEdit [10], RxJava [1 1], Junit4 [1 2], Mybatis3 [1 3], Netty [1 4], Gephi [15], Plantuml [16], Groot [17], MusicBot [18], Traccar [19], Jgrapht [ 20], L...
-
[7]
CONCLUSION This study propose d a novel semi -automatic approach for generating a large-scale, high-quality code smell dataset. The approach first applies several patterns of rules to automatically generate candidate positive samples . Subsequently, the samples are divided into automatically accepted and manually reviewed groups based on predefined metric...
-
[8]
Fowler, M., Beck, K., Brant, J. and Opdyke, W.: Refactoring: Improving the Design of Existing Code, Reading, USA: Addison Wesley Professional (1999)
work page 1999
-
[9]
Automatic software refactoring: a systematic literature review,
A. B. Baqais Abdulrahman and A. Mohammad, "Automatic software refactoring: a systematic literature review," Software Quality Journal, vol. 28, (2), pp. 459- 502, 2020
work page 2020
-
[10]
T. Sharma and D. Spinellis, “A survey on software smells,” Journal of Systems and Software, vol.138, pp.158–173, 2018
work page 2018
-
[11]
MLCQ: Industry - Relevant Code Smell Data Set,
L. Madeyski and T. Lewowski, “MLCQ: Industry - Relevant Code Smell Data Set,” In Proceedings of the 24th International Conference on Evaluation and Assessment in Software Engineering (EASE '20), 342- 347, 2020
work page 2020
-
[12]
Deep Learning Based Code Smell Detection,
H. Liu, J. Jin, Z. Xu, Y. Zou, Y. Bu and L. Zhang, “Deep Learning Based Code Smell Detection,” in IEEE Transactions on Software Engineering, vol. 47, no. 9, pp. 1811-1837, 1 Sept. 2021
work page 2021
-
[13]
Long Method Detection Using Graph Convolutional Networks,
H.Y. Zhang, T. Kishi, “Long Method Detection Using Graph Convolutional Networks,” Journal of Information Processing, vol. 31, pp. 469-477, 2023
work page 2023
-
[14]
H.Y. Zhang, T. Kishi, “Large Class Detection Using GNNs: A Graph Based Deep Learning Approach Utilizing Three Typical GNN Model Architectures,” IEICE Transactions on Information and Systems, vol. E107.D, Issue 9, pp. 1140-1150, 2024
work page 2024
-
[15]
Comparing and experimenting machine learning techniques for code smell detection,
F.A. Fontana, M.V. Mäntylä, M. Zanoni, A. Marino. “Comparing and experimenting machine learning techniques for code smell detection,” Empirical Software Engineering, vol. 21, pp. 1143-1191, 2016
work page 2016
- [16]
- [17]
- [18]
- [19]
- [20]
- [21]
- [22]
- [23]
- [24]
- [25]
- [26]
- [27]
- [28]
- [29]
- [30]
- [31]
-
[32]
OpenRefine.https://github.com/OpenRefine
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.