arxiv: 2602.15342 · v2 · submitted 2026-02-17 · 💻 cs.SE

Recognition: no theorem link

SACS: A Code Smell Dataset using Semi-automatic Generation Approach

Hanyu Zhang , Tomoji Kishi

Authors on Pith no claims yet

Pith reviewed 2026-05-15 22:15 UTC · model grok-4.3

classification 💻 cs.SE

keywords code smelldatasetsemi-automatic generationLong MethodLarge ClassFeature Envysoftware refactoringmachine learning

0 comments

The pith

A semi-automatic approach creates an open-source dataset with over 10,000 labeled samples for each of three code smells.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper proposes a semi-automatic method to build code smell datasets that balances scale and label reliability. It applies automatic rules to generate candidate samples, uses metrics to group them for automatic acceptance or manual review, and provides structured guidelines and a tool for the review process. The result is the SACS dataset covering Long Method, Large Class, and Feature Envy, each with more than 10,000 samples. Such a dataset addresses the shortage of reliable data for machine learning research on code smell detection and refactoring. A sympathetic reader would care because better datasets can lead to more accurate tools for improving software maintainability.

Core claim

The paper introduces a semi-automatic generation approach for code smell datasets. Candidate smelly samples are produced using automatic generation rules. Multiple metrics then group the samples into an automatically accepted group and a manually reviewed group. Structured review guidelines and an annotation tool support the manual validation. This process yielded the SACS dataset with over 10,000 labeled samples for Long Method, Large Class, and Feature Envy each.

What carries the argument

semi-automatic generation approach that applies automatic rules then uses metrics to separate samples for targeted manual review

Load-bearing premise

Automatic generation rules combined with metric-based grouping produce candidate samples whose labels can be reliably validated through structured manual review.

What would settle it

A test in which independent experts following the same guidelines assign conflicting labels to a random sample of more than a few percent of the dataset instances.

Figures

Figures reproduced from arXiv: 2602.15342 by Hanyu Zhang, Tomoji Kishi.

**Figure 4.** Figure 4: Feature envy data sample generation In the first pattern, we determine whether the class of the target method has a parent class. If so, we examine whether any unique fields of the current class are accessed by the target method. If no such fields are used, the method is considered a candidate for relocation to the parent class. In the second pattern, we identify related classes based on property usage. Fo… view at source ↗

read the original abstract

Code smell is a great challenge in software refactoring, which indicates latent design or implementation flaws that may degrade the software maintainability and evolution. Over the past of decades, the research on code smell has received extensive attention. Especially the researches applied machine learning-technique have become a popular topic in recent studies. However, one of the biggest challenges to apply machine learning-technique is the lack of high-quality code smell datasets. Manually constructing such datasets is extremely labor-intensive, as identifying code smells requires substantial development expertise and considerable time investment. In contrast, automatically generated datasets, while scalable, frequently exhibit reduced label reliability and compromised data quality. To overcome this challenge, in this study, we explore a semi-automatic approach to generate a code smell dataset with high quality data samples. Specifically, we first applied a set of automatic generation rules to produce candidate smelly samples. We then employed multiple metrics to group the data samples into an automatically accepted group and a manually reviewed group, enabling reviewers to concentrate their efforts on ambiguous samples. Furthermore, we established structured review guidelines and developed a annotation tool to support the manual validation process. Based on the proposed semi-automatic generation approach, we created an open-source code smell dataset, SACS, covering three widely studied code smells: Long Method, Large Class, and Feature Envy. Each code smell category includes over 10,000 labeled samples. This dataset could provide a large-scale and publicly available benchmark to facilitate future studies on code smell detection and automated refactoring.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

SACS gives a new large dataset for three code smells via rules plus targeted review, but the labels lack any reported validation numbers.

read the letter

The punchline is that this paper gives us SACS, a new open dataset with over ten thousand samples for each of three code smells, built with a semi-automatic process that mixes rules and targeted manual review. That size is the practical win for people training detectors. What they do well is spell out a workable way to generate candidates at scale. They start with automatic rules to find potential smelly code, then apply metrics to split the pool into ones that can be accepted automatically and ones that need human checking. They also set up review guidelines and an annotation tool to make the manual part consistent. Releasing the whole thing publicly is a solid move for the community. The soft spot is the lack of hard numbers on the review step. There's no figure for how many samples went to manual review, no inter-rater agreement score, and no breakdown of disagreements or how they were settled. Without that, the claim that the labels are high quality rests mostly on the description of the process rather than evidence that it worked. The stress test note hits this exactly: the hybrid method could fail to deliver better data if the manual part doesn't add much. This paper is for software engineering folks focused on code smell detection and ML-based refactoring tools. A reader who needs a benchmark set for Long Method, Large Class, or Feature Envy would find the scale useful, provided the labels hold up. It deserves a serious referee because the artifact is new and the generation idea is reasonable, even though the current evidence for quality is thin. I would recommend sending it to peer review, with the main request being to add quantitative validation of the manual review reliability. That would turn the dataset into something more trustworthy for downstream work.

Referee Report

1 major / 1 minor

Summary. The paper proposes a semi-automatic method to construct the SACS dataset for three code smells (Long Method, Large Class, Feature Envy). Candidate samples are generated via automatic rules, partitioned by metrics into an auto-accepted group and a manually reviewed group, and the latter is validated using structured guidelines and a custom annotation tool. The resulting open-source dataset contains over 10,000 labeled samples per smell category and is positioned as a large-scale benchmark for machine-learning-based smell detection and refactoring research.

Significance. If the quality claims are substantiated, SACS would be a useful contribution by supplying a publicly available, large-scale resource that attempts to combine the scalability of rule-based generation with targeted human oversight. This could support more reliable training and benchmarking of ML models for code smell detection, directly addressing the scarcity of high-quality labeled data noted in the introduction.

major comments (1)

[Semi-automatic generation approach and dataset construction] The manuscript asserts that the semi-automatic approach produces 'high-quality data samples' and a 'high-quality code smell dataset,' yet supplies no quantitative validation of the manual review step. No inter-rater agreement statistics (e.g., Cohen's kappa), fraction of samples routed to manual review, disagreement-resolution procedure, or error analysis are reported. This information is required to assess whether the metric-based grouping plus structured review materially improves label reliability over purely automatic labeling; without it the central quality claim remains unsupported.

minor comments (1)

[Dataset description] The exact counts, class distributions, and source-project statistics for the >10,000 samples per category should be stated explicitly (rather than the rounded 'over 10,000' figure) to allow readers to judge balance and coverage.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive comments on our manuscript. We address the major comment below and indicate the changes planned for the revised version.

read point-by-point responses

Referee: [Semi-automatic generation approach and dataset construction] The manuscript asserts that the semi-automatic approach produces 'high-quality data samples' and a 'high-quality code smell dataset,' yet supplies no quantitative validation of the manual review step. No inter-rater agreement statistics (e.g., Cohen's kappa), fraction of samples routed to manual review, disagreement-resolution procedure, or error analysis are reported. This information is required to assess whether the metric-based grouping plus structured review materially improves label reliability over purely automatic labeling; without it the central quality claim remains unsupported.

Authors: We agree that quantitative details on the manual review step would strengthen the quality claims. The review was conducted by a single experienced annotator using the structured guidelines and annotation tool to maintain consistency, which is why inter-rater agreement statistics such as Cohen's kappa were not applicable or reported. In the revised manuscript we will add: (1) the exact fraction of candidate samples routed to the manual review group, (2) a description of the disagreement-resolution procedure (re-examination of borderline cases by the same annotator against the guidelines), and (3) a brief error analysis of samples initially flagged by metrics but rejected after review. These additions will allow readers to evaluate the improvement over purely automatic labeling. revision: partial

Circularity Check

0 steps flagged

No circularity: dataset labels derive from external rules and review, not self-referential inputs

full rationale

The paper describes generating candidates via a fixed set of automatic rules, partitioning them by independent metrics into auto-accepted versus manual-review buckets, then applying external structured guidelines and an annotation tool for validation. No step defines the final labels in terms of the dataset itself, fits parameters to the output data and renames them as predictions, or relies on self-citations whose content reduces to the present claims. The process is self-contained against external benchmarks (rules, metrics, human review) and produces the SACS dataset as an output rather than presupposing it.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claim depends on domain assumptions about the reliability of automatic rules and metrics for code smell candidate generation; no free parameters or invented entities are introduced in the abstract.

axioms (2)

domain assumption A set of automatic generation rules can produce candidate smelly code samples
Invoked when describing the first step of the semi-automatic approach.
domain assumption Multiple metrics can group samples into automatically accepted and manually reviewed categories
Used to justify focusing reviewer effort on ambiguous samples.

pith-pipeline@v0.9.0 · 5564 in / 1251 out tokens · 27329 ms · 2026-05-15T22:15:47.904914+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

32 extracted references · 32 canonical work pages

[1]

It refers to optimizing the software internal structure without changing its external behavior

INTRODUCTION Software Refactoring has been playing an important role in software lifecycle helps to improve the maintainability and readability of the software. It refers to optimizing the software internal structure without changing its external behavior. During the s oftware refactoring process, Code Smell has always been an important topic, which indic...

work page 1999
[2]

The most commonly used dataset construction approach is the manual approach

RELATED DATASET In this section, we will present several code smell datasets mainly focusing on the dataset generation approach proposed in previous code smell-related studies. The most commonly used dataset construction approach is the manual approach. A representative example is the MLCQ dataset proposed by Madeyski [4]. It contains approximately 15,000...

work page
[3]

Overview The overall workflow of our approach is illustrated in Figure

SEMI-AUTOMATIC GENERATION 3.1. Overview The overall workflow of our approach is illustrated in Figure

work page
[4]

Next, in the first step, smelly software entities are intentionally created from the code corpus by pre-defined generation rules

Prior to the dataset generation, we collect several open- source projects as code corpus. Next, in the first step, smelly software entities are intentionally created from the code corpus by pre-defined generation rules. Subsequently, a set of rules is applied to group both the automatically generated samples and the original code samples into two groups: ...

work page
[5]

In this case, the two methods exhibit a caller-callee relationship and can be merged by copying all statements from the callee method print_ary into the caller method main

In Pattern 1, the method print_ary is invoked at line S7 of the main method. In this case, the two methods exhibit a caller-callee relationship and can be merged by copying all statements from the callee method print_ary into the caller method main. The parameter a in print_ary is replaced with the corresponding variable result after the merge operation. ...

work page
[6]

OPEN-SOURCE DATASET: SACS Following the above dataset generation approach, we created a large-scale open-source code smell dataset. We collected 16 open-source java projects as the code corpus including: JEdit [10], RxJava [1 1], Junit4 [1 2], Mybatis3 [1 3], Netty [1 4], Gephi [15], Plantuml [16], Groot [17], MusicBot [18], Traccar [19], Jgrapht [ 20], L...

work page
[7]

The approach first applies several patterns of rules to automatically generate candidate positive samples

CONCLUSION This study propose d a novel semi -automatic approach for generating a large-scale, high-quality code smell dataset. The approach first applies several patterns of rules to automatically generate candidate positive samples . Subsequently, the samples are divided into automatically accepted and manually reviewed groups based on predefined metric...

work page
[8]

and Opdyke, W.: Refactoring: Improving the Design of Existing Code, Reading, USA: Addison Wesley Professional (1999)

Fowler, M., Beck, K., Brant, J. and Opdyke, W.: Refactoring: Improving the Design of Existing Code, Reading, USA: Addison Wesley Professional (1999)

work page 1999
[9]

Automatic software refactoring: a systematic literature review,

A. B. Baqais Abdulrahman and A. Mohammad, "Automatic software refactoring: a systematic literature review," Software Quality Journal, vol. 28, (2), pp. 459- 502, 2020

work page 2020
[10]

A survey on software smells,

T. Sharma and D. Spinellis, “A survey on software smells,” Journal of Systems and Software, vol.138, pp.158–173, 2018

work page 2018
[11]

MLCQ: Industry - Relevant Code Smell Data Set,

L. Madeyski and T. Lewowski, “MLCQ: Industry - Relevant Code Smell Data Set,” In Proceedings of the 24th International Conference on Evaluation and Assessment in Software Engineering (EASE '20), 342- 347, 2020

work page 2020
[12]

Deep Learning Based Code Smell Detection,

H. Liu, J. Jin, Z. Xu, Y. Zou, Y. Bu and L. Zhang, “Deep Learning Based Code Smell Detection,” in IEEE Transactions on Software Engineering, vol. 47, no. 9, pp. 1811-1837, 1 Sept. 2021

work page 2021
[13]

Long Method Detection Using Graph Convolutional Networks,

H.Y. Zhang, T. Kishi, “Long Method Detection Using Graph Convolutional Networks,” Journal of Information Processing, vol. 31, pp. 469-477, 2023

work page 2023
[14]

Large Class Detection Using GNNs: A Graph Based Deep Learning Approach Utilizing Three Typical GNN Model Architectures,

H.Y. Zhang, T. Kishi, “Large Class Detection Using GNNs: A Graph Based Deep Learning Approach Utilizing Three Typical GNN Model Architectures,” IEICE Transactions on Information and Systems, vol. E107.D, Issue 9, pp. 1140-1150, 2024

work page 2024
[15]

Comparing and experimenting machine learning techniques for code smell detection,

F.A. Fontana, M.V. Mäntylä, M. Zanoni, A. Marino. “Comparing and experimenting machine learning techniques for code smell detection,” Empirical Software Engineering, vol. 21, pp. 1143-1191, 2016

work page 2016
[16]

Lanza, R

M. Lanza, R . Marinescu, and S . Ducasse. Object- Oriented Metrics in Practice. Springer -Verlag, Berlin, Heidelberg. (2005)

work page 2005
[17]

http://jedit.org/

JEdit. http://jedit.org/

work page
[18]

https://github.com/ReactiveX/RxJava

RxJava. https://github.com/ReactiveX/RxJava

work page
[19]

https://github.com/junit-team/junit4

Junit4. https://github.com/junit-team/junit4

work page
[20]

https://github.com/mybatis/mybatis-3

Mybatis3. https://github.com/mybatis/mybatis-3

work page
[21]

https://github.com/netty/netty

Netty. https://github.com/netty/netty

work page
[22]

https://github.com/gephi/gephi

Gephi. https://github.com/gephi/gephi

work page
[23]

https://github.com/plantuml/plantuml

Plantuml. https://github.com/plantuml/plantuml

work page
[24]

https://github.com/gavalian/groot

Groot. https://github.com/gavalian/groot

work page
[25]

https://github.com/jagrosh/MusicBot

MusicBot. https://github.com/jagrosh/MusicBot

work page
[26]

https://github.com/traccar/traccar

Traccar. https://github.com/traccar/traccar

work page
[27]

https://jgrapht.org

Jgrapht. https://jgrapht.org

work page
[28]

https://github.com/libgdx/libgdx

Libgdx. https://github.com/libgdx/libgdx

work page
[29]

https://github.com/freeplane/freeplane

Freeplane. https://github.com/freeplane/freeplane

work page
[30]

https://github.com/graphhopper/jsprit

Jsprit. https://github.com/graphhopper/jsprit

work page
[31]

https://github.com/informatici/

OpenHosipital. https://github.com/informatici/

work page
[32]

OpenRefine.https://github.com/OpenRefine

work page