STMutants: A Mutation Testing Dataset for Structured Text Programs in Industrial Automation

Helen H. Lou; Md Humaun Kabir; Md Rakibul Islam

arxiv: 2606.05499 · v1 · pith:4AHAS24Anew · submitted 2026-06-03 · 💻 cs.SE

STMutants: A Mutation Testing Dataset for Structured Text Programs in Industrial Automation

Md Humaun Kabir , Md Rakibul Islam , Helen H. Lou This is my paper

Pith reviewed 2026-06-28 04:48 UTC · model grok-4.3

classification 💻 cs.SE

keywords mutation testingstructured textPLC softwareindustrial automationtest benchmarkIEC 61131-3software fault injectionLLM evaluation

0 comments

The pith

STMutants supplies the first public collection of 108 mutants from 11 Structured Text programs to support reproducible testing research for PLC software.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces STMutants to close the absence of any shared mutation testing benchmark for Structured Text programs written for Programmable Logic Controllers. These programs control real-time equipment in safety-critical settings where undetected faults risk damage, lost production, or unsafe states. The dataset draws 11 programs from the OSCAT library and similar sources, applies seven adapted mutation operator categories, and retains 108 mutants after compilability checks and equivalence screening that reached kappa agreement of 0.87. The authors illustrate the dataset's value by running three large language models through test-suite generation followed by mutant kill prediction, yielding accuracies of 86.1 percent, 94.4 percent, and 86.1 percent with statistically significant differences. The resulting resource is positioned to make future experiments on automated test generation, mutation analysis, fault localization, and AI-supported quality assurance for industrial ST code directly comparable.

Core claim

STMutants contains 110 first-order mutants generated from 11 ST programs, of which 108 survive observability and equivalence screening. The mutants are produced by seven operator categories adapted for the PLC domain: value, relational, arithmetic, logical, negation, operation insertion/omission, and initialization faults. Construction proceeds through four phases of fault-type profiling, syntactic transformation, compilability verification, and manual equivalence screening. When the dataset is used to evaluate three large language models on test generation and mutation kill prediction, the models reach detection accuracies of 86.1 percent, 94.4 percent, and 86.1 percent.

What carries the argument

The STMutants dataset of 108 screened first-order mutants derived from 11 industrially relevant ST programs via a four-phase methodology that includes inter-rater equivalence screening.

If this is right

Researchers gain a shared starting point for experiments that measure how well test suites detect faults in ST programs.
Direct comparisons become possible across different automated test generation methods applied to the same set of mutants.
The dataset supplies concrete examples for studying fault localization techniques on PLC code.
Evaluations of large language models or other AI tools for quality assurance can use the same mutants and reach reproducible conclusions.
Mutation scores derived from the dataset can serve as a standardized effectiveness metric for industrial automation testing tools.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The benchmark could be paired with hardware-in-the-loop simulators to check whether killed mutants correspond to observable control failures on actual PLC hardware.
Extending the set with programs that include timing and real-time constraints would test whether the current operators cover faults unique to cyclic execution.
The high LLM prediction accuracies open the possibility of using similar models to prioritize which mutants to retain when scaling the dataset to hundreds of programs.
Adoption by standards bodies could turn mutation adequacy into an auditable criterion for safety-critical control software.

Load-bearing premise

The 11 chosen programs together with the seven adapted mutation operator categories produce a representative sample of the faults that actually occur in deployed industrial ST code.

What would settle it

A field study that logs real faults in production PLC systems and finds that the mutants in STMutants rarely match those observed faults would show the dataset does not capture representative errors.

Figures

Figures reproduced from arXiv: 2606.05499 by Helen H. Lou, Md Humaun Kabir, Md Rakibul Islam.

**Figure 2.** Figure 2: Standardized natural-language prompt used in Phase A (test suite gen [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗

read the original abstract

Mutation testing is widely used to evaluate test-suite effectiveness, yet IEC 61131-3 Structured Text (ST) programs still lack a publicly available benchmark that supports reproducible mutation-based research. This gap is especially important because ST is extensively used in Programmable Logic Controllers (PLCs) that operate in real-time, safety-critical industrial environments, where software faults may cause equipment damage, production loss, or unsafe system behavior. To address this need, we present STMutants, a curated mutation testing dataset for industrial automation software. STMutants contains 110 generated first-order mutants derived from 11 ST programs collected from the OSCAT basic library and industrially relevant sources, of which 108 are retained after observability and equivalence screening. The dataset covers seven mutation operator categories adapted from classical taxonomies for the PLC domain, including value, relational, arithmetic, logical, negation, operation insertion/omission, and initialization faults. Each mutant is constructed through a four-phase methodology: fault-type profiling and operator selection, syntactic transformation, compilability verification, and manual equivalence screening with strong inter-rater agreement (kappa = 0.87). To demonstrate the usefulness of the dataset, we evaluate three large language models (LLMs) in a two-phase setting: test-suite generation followed by mutation kill/survive prediction. Across 108 retained mutants, the models achieve mutation detection accuracies of 86.1%, 94.4%, and 86.1%, respectively, with statistical analysis confirming significant performance differences. By providing the first publicly available mutation benchmark for ST programs, STMutants enables reproducible research on automated test generation, mutation analysis, fault localization, and AI-assisted quality assurance for PLC software.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

STMutants is the first public mutation dataset for Structured Text PLC code, with solid curation details but unverified representativeness for real industrial faults.

read the letter

This paper's main contribution is releasing STMutants, the first public mutation testing dataset for Structured Text programs used in industrial PLCs. They started with 11 programs from the OSCAT library and similar sources, applied seven adapted operator categories, generated 110 first-order mutants, and kept 108 after observability and equivalence checks.

The curation process is handled well. The four-phase method is described step by step, including fault-type profiling, syntactic transformation, compilability verification, and manual screening that reached kappa 0.87. That agreement score and the domain-adapted operators give the work a practical foundation. The LLM experiments add a quick demonstration: three models reached 86 to 94 percent accuracy on kill/survive prediction, with statistical tests showing differences. This shows the dataset can support immediate follow-up work on test generation and AI-assisted verification.

The soft spot is the claim that these mutants represent faults that actually occur in deployed industrial ST code. The programs are drawn from a library and "industrially relevant sources," and the operators were selected after profiling, but the abstract gives no quantitative comparison to fault distributions from real systems or larger corpora. Timing, I/O, or state-machine errors could be under- or over-represented. If that is the case, results on LLM performance or test effectiveness will not generalize beyond this sample.

The paper is aimed at researchers working on automated testing, mutation analysis, or verification for industrial automation software. Anyone needing a starting benchmark for PLC code or LLM applications in that domain will get direct value from the released mutants and the documented construction process. It deserves a serious referee because the resource itself is concrete and usable, even if the selection criteria need more external validation.

I would send it to peer review. Reviewers can press on the representativeness question and help turn this into a stronger shared benchmark.

Referee Report

1 major / 0 minor

Summary. The manuscript introduces STMutants, the first publicly available mutation testing dataset for IEC 61131-3 Structured Text (ST) programs in industrial automation. It describes a four-phase methodology to generate and curate 108 first-order mutants (from 110 initially generated) across 11 programs sourced from the OSCAT library and other industrially relevant sources, using seven adapted mutation operator categories (value, relational, arithmetic, logical, negation, insertion/omission, initialization). The curation includes compilability verification and manual equivalence screening with reported Cohen's kappa of 0.87. Utility is demonstrated via experiments applying three LLMs to test-suite generation followed by mutation kill/survive prediction, yielding accuracies of 86.1%, 94.4%, and 86.1% with statistical confirmation of performance differences.

Significance. If the mutants are representative of industrial ST faults, the dataset would provide a valuable, reproducible benchmark that fills a clear gap in mutation testing for safety-critical PLC software. Strengths include the explicit four-phase methodology, the high inter-rater agreement (kappa = 0.87), the statistical analysis of LLM results, and the public release of the dataset, all of which directly support the goal of enabling reproducible research on test generation, mutation analysis, fault localization, and AI-assisted QA.

major comments (1)

[Abstract] Abstract (and the fault-type profiling step described therein): the claim that the 11 programs and seven operator categories yield a representative sample of faults occurring in real industrial ST code rests on 'fault-type profiling' from 'industrially relevant sources,' yet no quantitative comparison to a broader corpus, no fault-frequency statistics from deployed PLC systems, and no external validation of the profiling step are reported. This directly affects the generalizability of the LLM kill-rate results and the assertion that STMutants enables research applicable to safety-critical industrial environments.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the positive assessment of the dataset's potential value and for the constructive feedback on the abstract. We address the major comment point-by-point below.

read point-by-point responses

Referee: [Abstract] Abstract (and the fault-type profiling step described therein): the claim that the 11 programs and seven operator categories yield a representative sample of faults occurring in real industrial ST code rests on 'fault-type profiling' from 'industrially relevant sources,' yet no quantitative comparison to a broader corpus, no fault-frequency statistics from deployed PLC systems, and no external validation of the profiling step are reported. This directly affects the generalizability of the LLM kill-rate results and the assertion that STMutants enables research applicable to safety-critical industrial environments.

Authors: We agree that the manuscript does not include quantitative comparisons to a broader corpus, fault-frequency statistics from deployed PLC systems, or external validation of the profiling step. The 11 programs were drawn from the OSCAT library (widely adopted in industrial automation) and other relevant sources, with the seven operator categories adapted from classical mutation taxonomies to the PLC domain; the four-phase methodology is described in Section 3. However, this selection process does not constitute a statistically representative sample of all industrial ST faults. We will revise the abstract and introduction to qualify the language, removing any implication of broad representativeness and instead describing STMutants as a curated benchmark from industrially relevant sources. This revision will also clarify the scope of the reported LLM accuracies (86.1–94.4%). revision: yes

Circularity Check

0 steps flagged

Dataset curation paper exhibits no circularity in its construction methodology

full rationale

The paper presents STMutants as a curated collection of 108 mutants from 11 externally sourced ST programs, constructed via an explicit four-phase process of fault-type profiling, syntactic transformation, compilability verification, and manual equivalence screening (with reported kappa=0.87). No mathematical derivations, fitted parameters, predictions of related quantities, or self-citations appear in the load-bearing steps. The LLM experiments function only as a usage demonstration, not as justification for the dataset itself. The central claim of providing the first public benchmark therefore rests on external sourcing and manual curation rather than any reduction to its own inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

No free parameters, invented entities, or non-standard axioms are introduced; the work rests on the standard domain assumption that mutation testing is a useful proxy for test-suite effectiveness and that the chosen programs and operators are representative.

axioms (1)

domain assumption Mutation testing is a valid method to evaluate test-suite effectiveness
The entire contribution presupposes this established premise in software engineering.

pith-pipeline@v0.9.1-grok · 5849 in / 1148 out tokens · 43416 ms · 2026-06-28T04:48:14.111681+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

30 extracted references · 1 canonical work pages

[1]

An investigation of the therac-25 accidents,

N. Leveson and C. Turner, “An investigation of the therac-25 accidents,” Computer, vol. 26, no. 7, pp. 18–41, 1993

1993
[2]

Hints on test data selection: Help for the practicing programmer,

R. DeMillo, R. Lipton, and F. Sayward, “Hints on test data selection: Help for the practicing programmer,”Computer, vol. 11, no. 4, pp. 34–41, 1978

1978
[3]

DBSecQA: A curated dataset of developer discussions on database security from stack exchange,

M. Islam, F. Kamal, M. Kabir, and M. Sharif, “DBSecQA: A curated dataset of developer discussions on database security from stack exchange,” inMSR, 2026, pp. 613–617

2026
[4]

Testing programs with the aid of a compiler,

R. Hamlet, “Testing programs with the aid of a compiler,”IEEE Transactions on Software Engineering, vol. 3, no. 4, pp. 279–290, 1977

1977
[5]

Theoretical and empirical studies on using program mutation to test the functional correctness of programs,

T. Budd, R. DeMillo, R. Lipton, and F. Sayward, “Theoretical and empirical studies on using program mutation to test the functional correctness of programs,” inSPPL, 1980, pp. 220–233

1980
[6]

Investigations of the software testing coupling effect,

A. J. Offutt, “Investigations of the software testing coupling effect,”ACM Transactions on Software Engineering and Methodology, vol. 1, no. 1, pp. 5–20, 1992

1992
[7]

Is mutation an appropriate tool for testing experiments?

J. H. Andrews, L. C. Briand, and Y . Labiche, “Is mutation an appropriate tool for testing experiments?” inICSE, 2005, pp. 402–411

2005
[8]

Are mutants a valid substitute for real faults in software testing?

R. Just, D. Jalali, L. Inozemtseva, M. Ernst, R. Holmes, and G. Fraser, “Are mutants a valid substitute for real faults in software testing?” in FSE, 2014, pp. 654–665

2014
[9]

Pit: A practical mutation testing tool for java,

H. Coles, T. Laurent, C. Henard, M. Papadakis, and A. Ventresque, “Pit: A practical mutation testing tool for java,” inISSTA, 2016, pp. 449–452

2016
[10]

Decoding the decoders: An empirical study of reverse engineering questions on stack exchange,

M. R. Islam, M. H. Kabir, and A. I. Sifat, “Decoding the decoders: An empirical study of reverse engineering questions on stack exchange,” in TPS-ISA, 2025, pp. 226–236

2025
[11]

STMutants: A dataset of mutants of Structured Text Programs,

Dataset, “STMutants: A dataset of mutants of Structured Text Programs,” https://doi.org/10.6084/m9.figshare.31724761, 2025, accessed: March 2026

work page doi:10.6084/m9.figshare.31724761 2025
[12]

An experimental determi- nation of sufficient mutant operators,

A. J. Offutt, G. Rothermel, and C. Zapf, “An experimental determi- nation of sufficient mutant operators,”ACM Transactions on Software Engineering and Methodology, vol. 5, no. 2, pp. 99–118, 1996

1996
[13]

Design of mutant operators for the c programming language,

H. Agrawal, R. A. DeMillo, B. Hathaway, W. Hsu, W. Hsu, E. W. Krauser, and E. Spafford, “Design of mutant operators for the c programming language,” Software Engineering Research Center, Tech. Rep. SERC- TR-41-P, 1989

1989
[14]

Automated control logic test case generation using large language models,

H. Koziolek, V . Ashiwal, S. Bandyopadhyay, and C. K. R., “Automated control logic test case generation using large language models,” arXiv preprint arXiv:2405.01874, 2024

arXiv 2024
[15]

Mujava: An automated class mutation system,

Y . Ma, J. Offutt, and Y . Kwon, “Mujava: An automated class mutation system,”Software Testing, Verification and Reliability, vol. 15, no. 2, pp. 97–133, 2005

2005
[16]

An analysis and survey of the development of mutation testing,

Y . Jia and M. Harman, “An analysis and survey of the development of mutation testing,”IEEE Transactions on Software Engineering, vol. 37, no. 5, pp. 649–678, 2011

2011
[17]

Analyzing software requirements errors in safety-critical, embedded systems,

R. Lutz, “Analyzing software requirements errors in safety-critical, embedded systems,” inRE, 1993, pp. 126–133

1993
[18]

Fault injection for dependability validation: A methodology and some applications,

J. Arlat, M. Aguera, L. Amat, Y . Crouzet, J. C. Fabre, J. C. Laprie, and D. Powell, “Fault injection for dependability validation: A methodology and some applications,”IEEE Transactions on Software Engineering, vol. 16, no. 2, pp. 166–182, 1990

1990
[19]

Formal methods in plc programming,

G. Frey and L. Litz, “Formal methods in plc programming,” inIEEE SMC, vol. 4, 2000, pp. 2431–2436

2000
[20]

[Online]

(2025) Matiec. [Online]. Available: https://github.com/nucleron/matiec

2025
[21]

The measurement of observer agreement for categorical data,

J. R. Landis and G. G. Koch, “The measurement of observer agreement for categorical data,”Biometrics, vol. 33, no. 1, pp. 159–174, 1977

1977
[22]

Robust or overfitted? investigating the generalization of pretrained models in requirement classification,

F. Kamal and M. R. Islam, “Robust or overfitted? investigating the generalization of pretrained models in requirement classification,” in ESEM, 2025, pp. 414–420

2025
[23]

GPT-5.2: Technical report,

OpenAI, “GPT-5.2: Technical report,” https://openai.com/index/ introducing-gpt-5-2/, 2025, accessed: March 2026

2025
[24]

Gemini 2.5 Flash: A multimodal model for fast and efficient inference,

Google DeepMind, “Gemini 2.5 Flash: A multimodal model for fast and efficient inference,” https://ai.google.dev/gemini-api/docs/models/ gemini-2.5-pro, 2025, accessed: March 2026

2025
[25]

Claude Sonnet 4.5: Model card and technical overview,

Anthropic, “Claude Sonnet 4.5: Model card and technical overview,” https: //www.anthropic.com/news/claude-sonnet-4-5, 2025, accessed: March 2026

2025
[26]

Language models are few-shot learners,

T. B. Brown, B. Mann, N. Ryder, M. Subbiah, J. Kaplanet al., “Language models are few-shot learners,”NeurIPS, 2020

2020
[27]

The comparison of percentages in matched samples,

W. G. Cochran, “The comparison of percentages in matched samples,” Biometrika, vol. 37, no. 3–4, pp. 256–266, 1950

1950
[28]

Individual comparisons by ranking methods,

F. Wilcoxon, “Individual comparisons by ranking methods,”Biometrics Bulletin, vol. 1, no. 6, pp. 80–83, 1945

1945
[29]

A simple sequentially rejective multiple test procedure,

S. Holm, “A simple sequentially rejective multiple test procedure,” Scandinavian Journal of Statistics, vol. 6, no. 2, pp. 65–70, 1979

1979
[30]

MAJOR: An efficient and extensible tool for mutation analysis in a Java compiler,

R. Just, F. Schweiggert, and G. M. Kapfhammer, “MAJOR: An efficient and extensible tool for mutation analysis in a Java compiler,” inASE, 2011, pp. 612–615

2011

[1] [1]

An investigation of the therac-25 accidents,

N. Leveson and C. Turner, “An investigation of the therac-25 accidents,” Computer, vol. 26, no. 7, pp. 18–41, 1993

1993

[2] [2]

Hints on test data selection: Help for the practicing programmer,

R. DeMillo, R. Lipton, and F. Sayward, “Hints on test data selection: Help for the practicing programmer,”Computer, vol. 11, no. 4, pp. 34–41, 1978

1978

[3] [3]

DBSecQA: A curated dataset of developer discussions on database security from stack exchange,

M. Islam, F. Kamal, M. Kabir, and M. Sharif, “DBSecQA: A curated dataset of developer discussions on database security from stack exchange,” inMSR, 2026, pp. 613–617

2026

[4] [4]

Testing programs with the aid of a compiler,

R. Hamlet, “Testing programs with the aid of a compiler,”IEEE Transactions on Software Engineering, vol. 3, no. 4, pp. 279–290, 1977

1977

[5] [5]

Theoretical and empirical studies on using program mutation to test the functional correctness of programs,

T. Budd, R. DeMillo, R. Lipton, and F. Sayward, “Theoretical and empirical studies on using program mutation to test the functional correctness of programs,” inSPPL, 1980, pp. 220–233

1980

[6] [6]

Investigations of the software testing coupling effect,

A. J. Offutt, “Investigations of the software testing coupling effect,”ACM Transactions on Software Engineering and Methodology, vol. 1, no. 1, pp. 5–20, 1992

1992

[7] [7]

Is mutation an appropriate tool for testing experiments?

J. H. Andrews, L. C. Briand, and Y . Labiche, “Is mutation an appropriate tool for testing experiments?” inICSE, 2005, pp. 402–411

2005

[8] [8]

Are mutants a valid substitute for real faults in software testing?

R. Just, D. Jalali, L. Inozemtseva, M. Ernst, R. Holmes, and G. Fraser, “Are mutants a valid substitute for real faults in software testing?” in FSE, 2014, pp. 654–665

2014

[9] [9]

Pit: A practical mutation testing tool for java,

H. Coles, T. Laurent, C. Henard, M. Papadakis, and A. Ventresque, “Pit: A practical mutation testing tool for java,” inISSTA, 2016, pp. 449–452

2016

[10] [10]

Decoding the decoders: An empirical study of reverse engineering questions on stack exchange,

M. R. Islam, M. H. Kabir, and A. I. Sifat, “Decoding the decoders: An empirical study of reverse engineering questions on stack exchange,” in TPS-ISA, 2025, pp. 226–236

2025

[11] [11]

STMutants: A dataset of mutants of Structured Text Programs,

Dataset, “STMutants: A dataset of mutants of Structured Text Programs,” https://doi.org/10.6084/m9.figshare.31724761, 2025, accessed: March 2026

work page doi:10.6084/m9.figshare.31724761 2025

[12] [12]

An experimental determi- nation of sufficient mutant operators,

A. J. Offutt, G. Rothermel, and C. Zapf, “An experimental determi- nation of sufficient mutant operators,”ACM Transactions on Software Engineering and Methodology, vol. 5, no. 2, pp. 99–118, 1996

1996

[13] [13]

Design of mutant operators for the c programming language,

H. Agrawal, R. A. DeMillo, B. Hathaway, W. Hsu, W. Hsu, E. W. Krauser, and E. Spafford, “Design of mutant operators for the c programming language,” Software Engineering Research Center, Tech. Rep. SERC- TR-41-P, 1989

1989

[14] [14]

Automated control logic test case generation using large language models,

H. Koziolek, V . Ashiwal, S. Bandyopadhyay, and C. K. R., “Automated control logic test case generation using large language models,” arXiv preprint arXiv:2405.01874, 2024

arXiv 2024

[15] [15]

Mujava: An automated class mutation system,

Y . Ma, J. Offutt, and Y . Kwon, “Mujava: An automated class mutation system,”Software Testing, Verification and Reliability, vol. 15, no. 2, pp. 97–133, 2005

2005

[16] [16]

An analysis and survey of the development of mutation testing,

Y . Jia and M. Harman, “An analysis and survey of the development of mutation testing,”IEEE Transactions on Software Engineering, vol. 37, no. 5, pp. 649–678, 2011

2011

[17] [17]

Analyzing software requirements errors in safety-critical, embedded systems,

R. Lutz, “Analyzing software requirements errors in safety-critical, embedded systems,” inRE, 1993, pp. 126–133

1993

[18] [18]

Fault injection for dependability validation: A methodology and some applications,

J. Arlat, M. Aguera, L. Amat, Y . Crouzet, J. C. Fabre, J. C. Laprie, and D. Powell, “Fault injection for dependability validation: A methodology and some applications,”IEEE Transactions on Software Engineering, vol. 16, no. 2, pp. 166–182, 1990

1990

[19] [19]

Formal methods in plc programming,

G. Frey and L. Litz, “Formal methods in plc programming,” inIEEE SMC, vol. 4, 2000, pp. 2431–2436

2000

[20] [20]

[Online]

(2025) Matiec. [Online]. Available: https://github.com/nucleron/matiec

2025

[21] [21]

The measurement of observer agreement for categorical data,

J. R. Landis and G. G. Koch, “The measurement of observer agreement for categorical data,”Biometrics, vol. 33, no. 1, pp. 159–174, 1977

1977

[22] [22]

Robust or overfitted? investigating the generalization of pretrained models in requirement classification,

F. Kamal and M. R. Islam, “Robust or overfitted? investigating the generalization of pretrained models in requirement classification,” in ESEM, 2025, pp. 414–420

2025

[23] [23]

GPT-5.2: Technical report,

OpenAI, “GPT-5.2: Technical report,” https://openai.com/index/ introducing-gpt-5-2/, 2025, accessed: March 2026

2025

[24] [24]

Gemini 2.5 Flash: A multimodal model for fast and efficient inference,

Google DeepMind, “Gemini 2.5 Flash: A multimodal model for fast and efficient inference,” https://ai.google.dev/gemini-api/docs/models/ gemini-2.5-pro, 2025, accessed: March 2026

2025

[25] [25]

Claude Sonnet 4.5: Model card and technical overview,

Anthropic, “Claude Sonnet 4.5: Model card and technical overview,” https: //www.anthropic.com/news/claude-sonnet-4-5, 2025, accessed: March 2026

2025

[26] [26]

Language models are few-shot learners,

T. B. Brown, B. Mann, N. Ryder, M. Subbiah, J. Kaplanet al., “Language models are few-shot learners,”NeurIPS, 2020

2020

[27] [27]

The comparison of percentages in matched samples,

W. G. Cochran, “The comparison of percentages in matched samples,” Biometrika, vol. 37, no. 3–4, pp. 256–266, 1950

1950

[28] [28]

Individual comparisons by ranking methods,

F. Wilcoxon, “Individual comparisons by ranking methods,”Biometrics Bulletin, vol. 1, no. 6, pp. 80–83, 1945

1945

[29] [29]

A simple sequentially rejective multiple test procedure,

S. Holm, “A simple sequentially rejective multiple test procedure,” Scandinavian Journal of Statistics, vol. 6, no. 2, pp. 65–70, 1979

1979

[30] [30]

MAJOR: An efficient and extensible tool for mutation analysis in a Java compiler,

R. Just, F. Schweiggert, and G. M. Kapfhammer, “MAJOR: An efficient and extensible tool for mutation analysis in a Java compiler,” inASE, 2011, pp. 612–615

2011