pith. machine review for the scientific record. sign in

arxiv: 2602.15342 · v2 · submitted 2026-02-17 · 💻 cs.SE

Recognition: no theorem link

SACS: A Code Smell Dataset using Semi-automatic Generation Approach

Authors on Pith no claims yet

Pith reviewed 2026-05-15 22:15 UTC · model grok-4.3

classification 💻 cs.SE
keywords code smelldatasetsemi-automatic generationLong MethodLarge ClassFeature Envysoftware refactoringmachine learning
0
0 comments X

The pith

A semi-automatic approach creates an open-source dataset with over 10,000 labeled samples for each of three code smells.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper proposes a semi-automatic method to build code smell datasets that balances scale and label reliability. It applies automatic rules to generate candidate samples, uses metrics to group them for automatic acceptance or manual review, and provides structured guidelines and a tool for the review process. The result is the SACS dataset covering Long Method, Large Class, and Feature Envy, each with more than 10,000 samples. Such a dataset addresses the shortage of reliable data for machine learning research on code smell detection and refactoring. A sympathetic reader would care because better datasets can lead to more accurate tools for improving software maintainability.

Core claim

The paper introduces a semi-automatic generation approach for code smell datasets. Candidate smelly samples are produced using automatic generation rules. Multiple metrics then group the samples into an automatically accepted group and a manually reviewed group. Structured review guidelines and an annotation tool support the manual validation. This process yielded the SACS dataset with over 10,000 labeled samples for Long Method, Large Class, and Feature Envy each.

What carries the argument

semi-automatic generation approach that applies automatic rules then uses metrics to separate samples for targeted manual review

Load-bearing premise

Automatic generation rules combined with metric-based grouping produce candidate samples whose labels can be reliably validated through structured manual review.

What would settle it

A test in which independent experts following the same guidelines assign conflicting labels to a random sample of more than a few percent of the dataset instances.

Figures

Figures reproduced from arXiv: 2602.15342 by Hanyu Zhang, Tomoji Kishi.

Figure 4
Figure 4. Figure 4: Feature envy data sample generation In the first pattern, we determine whether the class of the target method has a parent class. If so, we examine whether any unique fields of the current class are accessed by the target method. If no such fields are used, the method is considered a candidate for relocation to the parent class. In the second pattern, we identify related classes based on property usage. Fo… view at source ↗
read the original abstract

Code smell is a great challenge in software refactoring, which indicates latent design or implementation flaws that may degrade the software maintainability and evolution. Over the past of decades, the research on code smell has received extensive attention. Especially the researches applied machine learning-technique have become a popular topic in recent studies. However, one of the biggest challenges to apply machine learning-technique is the lack of high-quality code smell datasets. Manually constructing such datasets is extremely labor-intensive, as identifying code smells requires substantial development expertise and considerable time investment. In contrast, automatically generated datasets, while scalable, frequently exhibit reduced label reliability and compromised data quality. To overcome this challenge, in this study, we explore a semi-automatic approach to generate a code smell dataset with high quality data samples. Specifically, we first applied a set of automatic generation rules to produce candidate smelly samples. We then employed multiple metrics to group the data samples into an automatically accepted group and a manually reviewed group, enabling reviewers to concentrate their efforts on ambiguous samples. Furthermore, we established structured review guidelines and developed a annotation tool to support the manual validation process. Based on the proposed semi-automatic generation approach, we created an open-source code smell dataset, SACS, covering three widely studied code smells: Long Method, Large Class, and Feature Envy. Each code smell category includes over 10,000 labeled samples. This dataset could provide a large-scale and publicly available benchmark to facilitate future studies on code smell detection and automated refactoring.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 1 minor

Summary. The paper proposes a semi-automatic method to construct the SACS dataset for three code smells (Long Method, Large Class, Feature Envy). Candidate samples are generated via automatic rules, partitioned by metrics into an auto-accepted group and a manually reviewed group, and the latter is validated using structured guidelines and a custom annotation tool. The resulting open-source dataset contains over 10,000 labeled samples per smell category and is positioned as a large-scale benchmark for machine-learning-based smell detection and refactoring research.

Significance. If the quality claims are substantiated, SACS would be a useful contribution by supplying a publicly available, large-scale resource that attempts to combine the scalability of rule-based generation with targeted human oversight. This could support more reliable training and benchmarking of ML models for code smell detection, directly addressing the scarcity of high-quality labeled data noted in the introduction.

major comments (1)
  1. [Semi-automatic generation approach and dataset construction] The manuscript asserts that the semi-automatic approach produces 'high-quality data samples' and a 'high-quality code smell dataset,' yet supplies no quantitative validation of the manual review step. No inter-rater agreement statistics (e.g., Cohen's kappa), fraction of samples routed to manual review, disagreement-resolution procedure, or error analysis are reported. This information is required to assess whether the metric-based grouping plus structured review materially improves label reliability over purely automatic labeling; without it the central quality claim remains unsupported.
minor comments (1)
  1. [Dataset description] The exact counts, class distributions, and source-project statistics for the >10,000 samples per category should be stated explicitly (rather than the rounded 'over 10,000' figure) to allow readers to judge balance and coverage.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive comments on our manuscript. We address the major comment below and indicate the changes planned for the revised version.

read point-by-point responses
  1. Referee: [Semi-automatic generation approach and dataset construction] The manuscript asserts that the semi-automatic approach produces 'high-quality data samples' and a 'high-quality code smell dataset,' yet supplies no quantitative validation of the manual review step. No inter-rater agreement statistics (e.g., Cohen's kappa), fraction of samples routed to manual review, disagreement-resolution procedure, or error analysis are reported. This information is required to assess whether the metric-based grouping plus structured review materially improves label reliability over purely automatic labeling; without it the central quality claim remains unsupported.

    Authors: We agree that quantitative details on the manual review step would strengthen the quality claims. The review was conducted by a single experienced annotator using the structured guidelines and annotation tool to maintain consistency, which is why inter-rater agreement statistics such as Cohen's kappa were not applicable or reported. In the revised manuscript we will add: (1) the exact fraction of candidate samples routed to the manual review group, (2) a description of the disagreement-resolution procedure (re-examination of borderline cases by the same annotator against the guidelines), and (3) a brief error analysis of samples initially flagged by metrics but rejected after review. These additions will allow readers to evaluate the improvement over purely automatic labeling. revision: partial

Circularity Check

0 steps flagged

No circularity: dataset labels derive from external rules and review, not self-referential inputs

full rationale

The paper describes generating candidates via a fixed set of automatic rules, partitioning them by independent metrics into auto-accepted versus manual-review buckets, then applying external structured guidelines and an annotation tool for validation. No step defines the final labels in terms of the dataset itself, fits parameters to the output data and renames them as predictions, or relies on self-citations whose content reduces to the present claims. The process is self-contained against external benchmarks (rules, metrics, human review) and produces the SACS dataset as an output rather than presupposing it.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claim depends on domain assumptions about the reliability of automatic rules and metrics for code smell candidate generation; no free parameters or invented entities are introduced in the abstract.

axioms (2)
  • domain assumption A set of automatic generation rules can produce candidate smelly code samples
    Invoked when describing the first step of the semi-automatic approach.
  • domain assumption Multiple metrics can group samples into automatically accepted and manually reviewed categories
    Used to justify focusing reviewer effort on ambiguous samples.

pith-pipeline@v0.9.0 · 5564 in / 1251 out tokens · 27329 ms · 2026-05-15T22:15:47.904914+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

32 extracted references · 32 canonical work pages

  1. [1]

    It refers to optimizing the software internal structure without changing its external behavior

    INTRODUCTION Software Refactoring has been playing an important role in software lifecycle helps to improve the maintainability and readability of the software. It refers to optimizing the software internal structure without changing its external behavior. During the s oftware refactoring process, Code Smell has always been an important topic, which indic...

  2. [2]

    The most commonly used dataset construction approach is the manual approach

    RELATED DATASET In this section, we will present several code smell datasets mainly focusing on the dataset generation approach proposed in previous code smell-related studies. The most commonly used dataset construction approach is the manual approach. A representative example is the MLCQ dataset proposed by Madeyski [4]. It contains approximately 15,000...

  3. [3]

    Overview The overall workflow of our approach is illustrated in Figure

    SEMI-AUTOMATIC GENERATION 3.1. Overview The overall workflow of our approach is illustrated in Figure

  4. [4]

    Next, in the first step, smelly software entities are intentionally created from the code corpus by pre-defined generation rules

    Prior to the dataset generation, we collect several open- source projects as code corpus. Next, in the first step, smelly software entities are intentionally created from the code corpus by pre-defined generation rules. Subsequently, a set of rules is applied to group both the automatically generated samples and the original code samples into two groups: ...

  5. [5]

    In this case, the two methods exhibit a caller-callee relationship and can be merged by copying all statements from the callee method print_ary into the caller method main

    In Pattern 1, the method print_ary is invoked at line S7 of the main method. In this case, the two methods exhibit a caller-callee relationship and can be merged by copying all statements from the callee method print_ary into the caller method main. The parameter a in print_ary is replaced with the corresponding variable result after the merge operation. ...

  6. [6]

    OPEN-SOURCE DATASET: SACS Following the above dataset generation approach, we created a large-scale open-source code smell dataset. We collected 16 open-source java projects as the code corpus including: JEdit [10], RxJava [1 1], Junit4 [1 2], Mybatis3 [1 3], Netty [1 4], Gephi [15], Plantuml [16], Groot [17], MusicBot [18], Traccar [19], Jgrapht [ 20], L...

  7. [7]

    The approach first applies several patterns of rules to automatically generate candidate positive samples

    CONCLUSION This study propose d a novel semi -automatic approach for generating a large-scale, high-quality code smell dataset. The approach first applies several patterns of rules to automatically generate candidate positive samples . Subsequently, the samples are divided into automatically accepted and manually reviewed groups based on predefined metric...

  8. [8]

    and Opdyke, W.: Refactoring: Improving the Design of Existing Code, Reading, USA: Addison Wesley Professional (1999)

    Fowler, M., Beck, K., Brant, J. and Opdyke, W.: Refactoring: Improving the Design of Existing Code, Reading, USA: Addison Wesley Professional (1999)

  9. [9]

    Automatic software refactoring: a systematic literature review,

    A. B. Baqais Abdulrahman and A. Mohammad, "Automatic software refactoring: a systematic literature review," Software Quality Journal, vol. 28, (2), pp. 459- 502, 2020

  10. [10]

    A survey on software smells,

    T. Sharma and D. Spinellis, “A survey on software smells,” Journal of Systems and Software, vol.138, pp.158–173, 2018

  11. [11]

    MLCQ: Industry - Relevant Code Smell Data Set,

    L. Madeyski and T. Lewowski, “MLCQ: Industry - Relevant Code Smell Data Set,” In Proceedings of the 24th International Conference on Evaluation and Assessment in Software Engineering (EASE '20), 342- 347, 2020

  12. [12]

    Deep Learning Based Code Smell Detection,

    H. Liu, J. Jin, Z. Xu, Y. Zou, Y. Bu and L. Zhang, “Deep Learning Based Code Smell Detection,” in IEEE Transactions on Software Engineering, vol. 47, no. 9, pp. 1811-1837, 1 Sept. 2021

  13. [13]

    Long Method Detection Using Graph Convolutional Networks,

    H.Y. Zhang, T. Kishi, “Long Method Detection Using Graph Convolutional Networks,” Journal of Information Processing, vol. 31, pp. 469-477, 2023

  14. [14]

    Large Class Detection Using GNNs: A Graph Based Deep Learning Approach Utilizing Three Typical GNN Model Architectures,

    H.Y. Zhang, T. Kishi, “Large Class Detection Using GNNs: A Graph Based Deep Learning Approach Utilizing Three Typical GNN Model Architectures,” IEICE Transactions on Information and Systems, vol. E107.D, Issue 9, pp. 1140-1150, 2024

  15. [15]

    Comparing and experimenting machine learning techniques for code smell detection,

    F.A. Fontana, M.V. Mäntylä, M. Zanoni, A. Marino. “Comparing and experimenting machine learning techniques for code smell detection,” Empirical Software Engineering, vol. 21, pp. 1143-1191, 2016

  16. [16]

    Lanza, R

    M. Lanza, R . Marinescu, and S . Ducasse. Object- Oriented Metrics in Practice. Springer -Verlag, Berlin, Heidelberg. (2005)

  17. [17]

    http://jedit.org/

    JEdit. http://jedit.org/

  18. [18]

    https://github.com/ReactiveX/RxJava

    RxJava. https://github.com/ReactiveX/RxJava

  19. [19]

    https://github.com/junit-team/junit4

    Junit4. https://github.com/junit-team/junit4

  20. [20]

    https://github.com/mybatis/mybatis-3

    Mybatis3. https://github.com/mybatis/mybatis-3

  21. [21]

    https://github.com/netty/netty

    Netty. https://github.com/netty/netty

  22. [22]

    https://github.com/gephi/gephi

    Gephi. https://github.com/gephi/gephi

  23. [23]

    https://github.com/plantuml/plantuml

    Plantuml. https://github.com/plantuml/plantuml

  24. [24]

    https://github.com/gavalian/groot

    Groot. https://github.com/gavalian/groot

  25. [25]

    https://github.com/jagrosh/MusicBot

    MusicBot. https://github.com/jagrosh/MusicBot

  26. [26]

    https://github.com/traccar/traccar

    Traccar. https://github.com/traccar/traccar

  27. [27]

    https://jgrapht.org

    Jgrapht. https://jgrapht.org

  28. [28]

    https://github.com/libgdx/libgdx

    Libgdx. https://github.com/libgdx/libgdx

  29. [29]

    https://github.com/freeplane/freeplane

    Freeplane. https://github.com/freeplane/freeplane

  30. [30]

    https://github.com/graphhopper/jsprit

    Jsprit. https://github.com/graphhopper/jsprit

  31. [31]

    https://github.com/informatici/

    OpenHosipital. https://github.com/informatici/

  32. [32]

    OpenRefine.https://github.com/OpenRefine