pith. sign in

arxiv: 2606.05499 · v1 · pith:4AHAS24Anew · submitted 2026-06-03 · 💻 cs.SE

STMutants: A Mutation Testing Dataset for Structured Text Programs in Industrial Automation

Pith reviewed 2026-06-28 04:48 UTC · model grok-4.3

classification 💻 cs.SE
keywords mutation testingstructured textPLC softwareindustrial automationtest benchmarkIEC 61131-3software fault injectionLLM evaluation
0
0 comments X

The pith

STMutants supplies the first public collection of 108 mutants from 11 Structured Text programs to support reproducible testing research for PLC software.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces STMutants to close the absence of any shared mutation testing benchmark for Structured Text programs written for Programmable Logic Controllers. These programs control real-time equipment in safety-critical settings where undetected faults risk damage, lost production, or unsafe states. The dataset draws 11 programs from the OSCAT library and similar sources, applies seven adapted mutation operator categories, and retains 108 mutants after compilability checks and equivalence screening that reached kappa agreement of 0.87. The authors illustrate the dataset's value by running three large language models through test-suite generation followed by mutant kill prediction, yielding accuracies of 86.1 percent, 94.4 percent, and 86.1 percent with statistically significant differences. The resulting resource is positioned to make future experiments on automated test generation, mutation analysis, fault localization, and AI-supported quality assurance for industrial ST code directly comparable.

Core claim

STMutants contains 110 first-order mutants generated from 11 ST programs, of which 108 survive observability and equivalence screening. The mutants are produced by seven operator categories adapted for the PLC domain: value, relational, arithmetic, logical, negation, operation insertion/omission, and initialization faults. Construction proceeds through four phases of fault-type profiling, syntactic transformation, compilability verification, and manual equivalence screening. When the dataset is used to evaluate three large language models on test generation and mutation kill prediction, the models reach detection accuracies of 86.1 percent, 94.4 percent, and 86.1 percent.

What carries the argument

The STMutants dataset of 108 screened first-order mutants derived from 11 industrially relevant ST programs via a four-phase methodology that includes inter-rater equivalence screening.

If this is right

  • Researchers gain a shared starting point for experiments that measure how well test suites detect faults in ST programs.
  • Direct comparisons become possible across different automated test generation methods applied to the same set of mutants.
  • The dataset supplies concrete examples for studying fault localization techniques on PLC code.
  • Evaluations of large language models or other AI tools for quality assurance can use the same mutants and reach reproducible conclusions.
  • Mutation scores derived from the dataset can serve as a standardized effectiveness metric for industrial automation testing tools.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The benchmark could be paired with hardware-in-the-loop simulators to check whether killed mutants correspond to observable control failures on actual PLC hardware.
  • Extending the set with programs that include timing and real-time constraints would test whether the current operators cover faults unique to cyclic execution.
  • The high LLM prediction accuracies open the possibility of using similar models to prioritize which mutants to retain when scaling the dataset to hundreds of programs.
  • Adoption by standards bodies could turn mutation adequacy into an auditable criterion for safety-critical control software.

Load-bearing premise

The 11 chosen programs together with the seven adapted mutation operator categories produce a representative sample of the faults that actually occur in deployed industrial ST code.

What would settle it

A field study that logs real faults in production PLC systems and finds that the mutants in STMutants rarely match those observed faults would show the dataset does not capture representative errors.

Figures

Figures reproduced from arXiv: 2606.05499 by Helen H. Lou, Md Humaun Kabir, Md Rakibul Islam.

Figure 1
Figure 1. Figure 1: Annotated first-order mutants applied to the [PITH_FULL_IMAGE:figures/full_fig_p005_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Standardized natural-language prompt used in Phase A (test suite gen [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗
read the original abstract

Mutation testing is widely used to evaluate test-suite effectiveness, yet IEC 61131-3 Structured Text (ST) programs still lack a publicly available benchmark that supports reproducible mutation-based research. This gap is especially important because ST is extensively used in Programmable Logic Controllers (PLCs) that operate in real-time, safety-critical industrial environments, where software faults may cause equipment damage, production loss, or unsafe system behavior. To address this need, we present STMutants, a curated mutation testing dataset for industrial automation software. STMutants contains 110 generated first-order mutants derived from 11 ST programs collected from the OSCAT basic library and industrially relevant sources, of which 108 are retained after observability and equivalence screening. The dataset covers seven mutation operator categories adapted from classical taxonomies for the PLC domain, including value, relational, arithmetic, logical, negation, operation insertion/omission, and initialization faults. Each mutant is constructed through a four-phase methodology: fault-type profiling and operator selection, syntactic transformation, compilability verification, and manual equivalence screening with strong inter-rater agreement (kappa = 0.87). To demonstrate the usefulness of the dataset, we evaluate three large language models (LLMs) in a two-phase setting: test-suite generation followed by mutation kill/survive prediction. Across 108 retained mutants, the models achieve mutation detection accuracies of 86.1%, 94.4%, and 86.1%, respectively, with statistical analysis confirming significant performance differences. By providing the first publicly available mutation benchmark for ST programs, STMutants enables reproducible research on automated test generation, mutation analysis, fault localization, and AI-assisted quality assurance for PLC software.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 0 minor

Summary. The manuscript introduces STMutants, the first publicly available mutation testing dataset for IEC 61131-3 Structured Text (ST) programs in industrial automation. It describes a four-phase methodology to generate and curate 108 first-order mutants (from 110 initially generated) across 11 programs sourced from the OSCAT library and other industrially relevant sources, using seven adapted mutation operator categories (value, relational, arithmetic, logical, negation, insertion/omission, initialization). The curation includes compilability verification and manual equivalence screening with reported Cohen's kappa of 0.87. Utility is demonstrated via experiments applying three LLMs to test-suite generation followed by mutation kill/survive prediction, yielding accuracies of 86.1%, 94.4%, and 86.1% with statistical confirmation of performance differences.

Significance. If the mutants are representative of industrial ST faults, the dataset would provide a valuable, reproducible benchmark that fills a clear gap in mutation testing for safety-critical PLC software. Strengths include the explicit four-phase methodology, the high inter-rater agreement (kappa = 0.87), the statistical analysis of LLM results, and the public release of the dataset, all of which directly support the goal of enabling reproducible research on test generation, mutation analysis, fault localization, and AI-assisted QA.

major comments (1)
  1. [Abstract] Abstract (and the fault-type profiling step described therein): the claim that the 11 programs and seven operator categories yield a representative sample of faults occurring in real industrial ST code rests on 'fault-type profiling' from 'industrially relevant sources,' yet no quantitative comparison to a broader corpus, no fault-frequency statistics from deployed PLC systems, and no external validation of the profiling step are reported. This directly affects the generalizability of the LLM kill-rate results and the assertion that STMutants enables research applicable to safety-critical industrial environments.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the positive assessment of the dataset's potential value and for the constructive feedback on the abstract. We address the major comment point-by-point below.

read point-by-point responses
  1. Referee: [Abstract] Abstract (and the fault-type profiling step described therein): the claim that the 11 programs and seven operator categories yield a representative sample of faults occurring in real industrial ST code rests on 'fault-type profiling' from 'industrially relevant sources,' yet no quantitative comparison to a broader corpus, no fault-frequency statistics from deployed PLC systems, and no external validation of the profiling step are reported. This directly affects the generalizability of the LLM kill-rate results and the assertion that STMutants enables research applicable to safety-critical industrial environments.

    Authors: We agree that the manuscript does not include quantitative comparisons to a broader corpus, fault-frequency statistics from deployed PLC systems, or external validation of the profiling step. The 11 programs were drawn from the OSCAT library (widely adopted in industrial automation) and other relevant sources, with the seven operator categories adapted from classical mutation taxonomies to the PLC domain; the four-phase methodology is described in Section 3. However, this selection process does not constitute a statistically representative sample of all industrial ST faults. We will revise the abstract and introduction to qualify the language, removing any implication of broad representativeness and instead describing STMutants as a curated benchmark from industrially relevant sources. This revision will also clarify the scope of the reported LLM accuracies (86.1–94.4%). revision: yes

Circularity Check

0 steps flagged

Dataset curation paper exhibits no circularity in its construction methodology

full rationale

The paper presents STMutants as a curated collection of 108 mutants from 11 externally sourced ST programs, constructed via an explicit four-phase process of fault-type profiling, syntactic transformation, compilability verification, and manual equivalence screening (with reported kappa=0.87). No mathematical derivations, fitted parameters, predictions of related quantities, or self-citations appear in the load-bearing steps. The LLM experiments function only as a usage demonstration, not as justification for the dataset itself. The central claim of providing the first public benchmark therefore rests on external sourcing and manual curation rather than any reduction to its own inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

No free parameters, invented entities, or non-standard axioms are introduced; the work rests on the standard domain assumption that mutation testing is a useful proxy for test-suite effectiveness and that the chosen programs and operators are representative.

axioms (1)
  • domain assumption Mutation testing is a valid method to evaluate test-suite effectiveness
    The entire contribution presupposes this established premise in software engineering.

pith-pipeline@v0.9.1-grok · 5849 in / 1148 out tokens · 43416 ms · 2026-06-28T04:48:14.111681+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

30 extracted references · 1 canonical work pages

  1. [1]

    An investigation of the therac-25 accidents,

    N. Leveson and C. Turner, “An investigation of the therac-25 accidents,” Computer, vol. 26, no. 7, pp. 18–41, 1993

  2. [2]

    Hints on test data selection: Help for the practicing programmer,

    R. DeMillo, R. Lipton, and F. Sayward, “Hints on test data selection: Help for the practicing programmer,”Computer, vol. 11, no. 4, pp. 34–41, 1978

  3. [3]

    DBSecQA: A curated dataset of developer discussions on database security from stack exchange,

    M. Islam, F. Kamal, M. Kabir, and M. Sharif, “DBSecQA: A curated dataset of developer discussions on database security from stack exchange,” inMSR, 2026, pp. 613–617

  4. [4]

    Testing programs with the aid of a compiler,

    R. Hamlet, “Testing programs with the aid of a compiler,”IEEE Transactions on Software Engineering, vol. 3, no. 4, pp. 279–290, 1977

  5. [5]

    Theoretical and empirical studies on using program mutation to test the functional correctness of programs,

    T. Budd, R. DeMillo, R. Lipton, and F. Sayward, “Theoretical and empirical studies on using program mutation to test the functional correctness of programs,” inSPPL, 1980, pp. 220–233

  6. [6]

    Investigations of the software testing coupling effect,

    A. J. Offutt, “Investigations of the software testing coupling effect,”ACM Transactions on Software Engineering and Methodology, vol. 1, no. 1, pp. 5–20, 1992

  7. [7]

    Is mutation an appropriate tool for testing experiments?

    J. H. Andrews, L. C. Briand, and Y . Labiche, “Is mutation an appropriate tool for testing experiments?” inICSE, 2005, pp. 402–411

  8. [8]

    Are mutants a valid substitute for real faults in software testing?

    R. Just, D. Jalali, L. Inozemtseva, M. Ernst, R. Holmes, and G. Fraser, “Are mutants a valid substitute for real faults in software testing?” in FSE, 2014, pp. 654–665

  9. [9]

    Pit: A practical mutation testing tool for java,

    H. Coles, T. Laurent, C. Henard, M. Papadakis, and A. Ventresque, “Pit: A practical mutation testing tool for java,” inISSTA, 2016, pp. 449–452

  10. [10]

    Decoding the decoders: An empirical study of reverse engineering questions on stack exchange,

    M. R. Islam, M. H. Kabir, and A. I. Sifat, “Decoding the decoders: An empirical study of reverse engineering questions on stack exchange,” in TPS-ISA, 2025, pp. 226–236

  11. [11]

    STMutants: A dataset of mutants of Structured Text Programs,

    Dataset, “STMutants: A dataset of mutants of Structured Text Programs,” https://doi.org/10.6084/m9.figshare.31724761, 2025, accessed: March 2026

  12. [12]

    An experimental determi- nation of sufficient mutant operators,

    A. J. Offutt, G. Rothermel, and C. Zapf, “An experimental determi- nation of sufficient mutant operators,”ACM Transactions on Software Engineering and Methodology, vol. 5, no. 2, pp. 99–118, 1996

  13. [13]

    Design of mutant operators for the c programming language,

    H. Agrawal, R. A. DeMillo, B. Hathaway, W. Hsu, W. Hsu, E. W. Krauser, and E. Spafford, “Design of mutant operators for the c programming language,” Software Engineering Research Center, Tech. Rep. SERC- TR-41-P, 1989

  14. [14]

    Automated control logic test case generation using large language models,

    H. Koziolek, V . Ashiwal, S. Bandyopadhyay, and C. K. R., “Automated control logic test case generation using large language models,” arXiv preprint arXiv:2405.01874, 2024

  15. [15]

    Mujava: An automated class mutation system,

    Y . Ma, J. Offutt, and Y . Kwon, “Mujava: An automated class mutation system,”Software Testing, Verification and Reliability, vol. 15, no. 2, pp. 97–133, 2005

  16. [16]

    An analysis and survey of the development of mutation testing,

    Y . Jia and M. Harman, “An analysis and survey of the development of mutation testing,”IEEE Transactions on Software Engineering, vol. 37, no. 5, pp. 649–678, 2011

  17. [17]

    Analyzing software requirements errors in safety-critical, embedded systems,

    R. Lutz, “Analyzing software requirements errors in safety-critical, embedded systems,” inRE, 1993, pp. 126–133

  18. [18]

    Fault injection for dependability validation: A methodology and some applications,

    J. Arlat, M. Aguera, L. Amat, Y . Crouzet, J. C. Fabre, J. C. Laprie, and D. Powell, “Fault injection for dependability validation: A methodology and some applications,”IEEE Transactions on Software Engineering, vol. 16, no. 2, pp. 166–182, 1990

  19. [19]

    Formal methods in plc programming,

    G. Frey and L. Litz, “Formal methods in plc programming,” inIEEE SMC, vol. 4, 2000, pp. 2431–2436

  20. [20]

    [Online]

    (2025) Matiec. [Online]. Available: https://github.com/nucleron/matiec

  21. [21]

    The measurement of observer agreement for categorical data,

    J. R. Landis and G. G. Koch, “The measurement of observer agreement for categorical data,”Biometrics, vol. 33, no. 1, pp. 159–174, 1977

  22. [22]

    Robust or overfitted? investigating the generalization of pretrained models in requirement classification,

    F. Kamal and M. R. Islam, “Robust or overfitted? investigating the generalization of pretrained models in requirement classification,” in ESEM, 2025, pp. 414–420

  23. [23]

    GPT-5.2: Technical report,

    OpenAI, “GPT-5.2: Technical report,” https://openai.com/index/ introducing-gpt-5-2/, 2025, accessed: March 2026

  24. [24]

    Gemini 2.5 Flash: A multimodal model for fast and efficient inference,

    Google DeepMind, “Gemini 2.5 Flash: A multimodal model for fast and efficient inference,” https://ai.google.dev/gemini-api/docs/models/ gemini-2.5-pro, 2025, accessed: March 2026

  25. [25]

    Claude Sonnet 4.5: Model card and technical overview,

    Anthropic, “Claude Sonnet 4.5: Model card and technical overview,” https: //www.anthropic.com/news/claude-sonnet-4-5, 2025, accessed: March 2026

  26. [26]

    Language models are few-shot learners,

    T. B. Brown, B. Mann, N. Ryder, M. Subbiah, J. Kaplanet al., “Language models are few-shot learners,”NeurIPS, 2020

  27. [27]

    The comparison of percentages in matched samples,

    W. G. Cochran, “The comparison of percentages in matched samples,” Biometrika, vol. 37, no. 3–4, pp. 256–266, 1950

  28. [28]

    Individual comparisons by ranking methods,

    F. Wilcoxon, “Individual comparisons by ranking methods,”Biometrics Bulletin, vol. 1, no. 6, pp. 80–83, 1945

  29. [29]

    A simple sequentially rejective multiple test procedure,

    S. Holm, “A simple sequentially rejective multiple test procedure,” Scandinavian Journal of Statistics, vol. 6, no. 2, pp. 65–70, 1979

  30. [30]

    MAJOR: An efficient and extensible tool for mutation analysis in a Java compiler,

    R. Just, F. Schweiggert, and G. M. Kapfhammer, “MAJOR: An efficient and extensible tool for mutation analysis in a Java compiler,” inASE, 2011, pp. 612–615