PITMuS: A Tool for Automated Bug Dataset Generation via Source-Level Mutant Reconstruction

Soneya Binta Hossain; Tasfia Tasnim

arxiv: 2605.21930 · v1 · pith:QBOA5EG5new · submitted 2026-05-21 · 💻 cs.SE

PITMuS: A Tool for Automated Bug Dataset Generation via Source-Level Mutant Reconstruction

Tasfia Tasnim , Soneya Binta Hossain This is my paper

Pith reviewed 2026-05-22 05:19 UTC · model grok-4.3

classification 💻 cs.SE

keywords mutation testingbug dataset generationsource-level reconstructionJavaPITbug repairsoftware engineeringLLM training data

0 comments

The pith

PITMuS reconstructs source-level edits from PIT bytecode mutants to automatically generate structured bug datasets with paired code, documentation, and metadata.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces PITMuS to solve the problem of creating fresh, context-rich bug artifacts needed for training and evaluating LLM-based software engineering tools. It starts from PIT's fast bytecode-level mutations and uses debug information in compiled Java class files to map each mutant back to the exact source code change. The tool then outputs structured records that include the buggy version, the fixed version, the method under test, documentation, and other metadata. These datasets can be produced from any Java system that integrates with PIT and help avoid the contamination risks that affect static benchmarks. A sympathetic reader would care because such on-demand datasets support more reliable development of bug localization, repair, and test generation methods.

Core claim

PITMuS combines PIT XML metadata with debug information from compiled Java class files to localize and reconstruct the source edit corresponding to each mutant. PITMuS then automatically produces structured datasets containing source-level buggy and fixed code pairs, documentation context, and metadata for downstream training and evaluation. Although evaluated on eight open-source Java systems, the approach can be applied to any Java system where PIT can be integrated.

What carries the argument

Source-level mutant reconstruction that maps each bytecode mutant from PIT back to its corresponding source edit using debug information in Java class files.

If this is right

Fresh datasets can be generated from current versions of Java systems rather than relying on fixed historical benchmarks.
The produced records directly support training and evaluation of automated bug localization and repair techniques.
Each dataset entry includes documentation context and metadata that enable documentation-driven automation.
The method scales to any Java system compatible with PIT integration.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Analogous reconstruction pipelines could be built for mutation tools in languages other than Java.
Repeated application over successive system versions could yield evolving benchmarks that stay current with code changes.
The generated mutant metadata might be reused to improve test oracle generation or mutation-based test adequacy assessment.

Load-bearing premise

Debug information present in compiled Java class files is always sufficient and accurate enough to localize and reconstruct the precise source-level edit corresponding to each bytecode mutant without introducing mapping errors or omissions.

What would settle it

A concrete case in which the reconstructed source edit for a PIT mutant produces a different program behavior or fails to match the original bytecode change when the code is recompiled and re-mutated.

Figures

Figures reproduced from arXiv: 2605.21930 by Soneya Binta Hossain, Tasfia Tasnim.

**Figure 2.** Figure 2: Overview of PITMuS workflow. and a Java JDK that provides javap. The target system must provide a PIT XML report, compiled .class files with debug metadata, and the original Java source files. 2.1 Automated Dataset Generation Dataset generation turns each PIT report entry into a methodlevel record. For a Java system 𝑃 with PIT report 𝑅𝑃 , PITMuS defines the dataset as 𝒟𝑃 = { ⟨orig𝑖 , mut𝑖, doc𝑖, meta𝑖⟩ |… view at source ↗

read the original abstract

LLM-based software engineering increasingly depends on executable, context-rich bug artifacts: paired correct and buggy code, methods under test (MUTs), documentation, and metadata. These artifacts support the training and evaluation of automated bug localization and repair techniques, testing and test oracle generation methods, and documentation-driven automation. Although curated benchmarks (e.g., Defects4J) remain valuable, they are static and increasingly vulnerable to contamination as code models are trained on large public corpora. A complementary strategy is to generate fresh, cutoff-aware datasets by selecting real system versions and injecting controlled bugs at the source level. Mutation testing is a natural basis for this strategy: it applies predefined mutation operators to programs and records whether the existing test suite detects each injected change. PIT is a state-of-the-practice mutation testing tool for Java that performs mutation at the bytecode level. This design makes mutation testing fast and practical, but PITMuS reports mutants primarily through XML, making them difficult to inspect, replay, or reuse as structured source-level dataset records. To address this gap, we present PITMuS, which combines PITMuS XML metadata with debug information from compiled Java class files to localize and reconstruct the source edit corresponding to each mutant. PITMuS then automatically produces structured datasets containing source-level buggy and fixed code pairs, documentation context, and metadata for downstream training and evaluation. Although we evaluate PITMuS on eight open-source Java systems, it can be applied to any Java system where PITMuS can be integrated.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

PITMuS turns PIT bytecode mutants into source-level bug/fix pairs for Java dataset generation, but the reconstruction step lacks any reported accuracy numbers.

read the letter

Hi, the core thing here is a tool that takes PIT's bytecode mutation output and maps it back to source edits using debug info from class files, then bundles the results into structured buggy/fixed pairs with metadata and docs. This directly targets the need for fresh, cutoff-aware bug datasets in Java to reduce contamination risks when training or evaluating LLM-based repair and localization models. They ran the pipeline on eight open-source systems, which at least shows it integrates with real codebases without obvious breakage. The engineering step of combining PIT XML with line and variable debug tables is a targeted extension rather than a new theory, and it produces reusable records that could plug into downstream training workflows. The soft spot is exactly what the stress test flags: no success rates, error counts, or manual checks on whether the source reconstruction actually matches the intended mutant edit. Debug info can be incomplete or affected by compiler choices, so without those numbers it's unclear how clean the output pairs really are for model training. This is a tool paper for people who build or maintain bug datasets in software engineering research, especially anyone working on automated repair or testing who needs scalable fresh data. A reader focused on practical dataset construction would get the most out of the pipeline description. I would send it to peer review because the idea is grounded in a real workflow gap and the evaluation on multiple systems provides a starting point, even though reviewers will almost certainly ask for quantitative validation of the mapping accuracy.

Referee Report

2 major / 2 minor

Summary. The paper presents PITMuS, a tool that reconstructs source-level mutants from PIT's bytecode-level mutation testing outputs for Java by combining PIT XML metadata with debug information (line numbers, variable tables) from compiled class files. This enables automatic generation of structured datasets containing buggy/fixed code pairs, method-under-test context, documentation, and metadata, with evaluation reported on eight open-source Java systems; the approach is positioned as applicable to any Java project integrable with PIT.

Significance. If the reconstruction step is shown to be reliable, PITMuS would provide a practical, extensible method for producing fresh, version-specific bug datasets that mitigate contamination risks in static benchmarks such as Defects4J. This directly supports training and evaluation of automated repair, localization, testing, and documentation-driven techniques in software engineering.

major comments (2)

[Evaluation] Evaluation section: The manuscript states that PITMuS was applied to eight open-source Java systems and produces usable source-level datasets, but reports no quantitative metrics on reconstruction fidelity (e.g., success rate of mapping bytecode mutants to exact source edits, number of mapping failures or omissions, or results from manual verification against ground truth). This directly bears on the central claim that the generated pairs are accurate and usable for downstream tasks.
[Approach] Approach / Implementation: The description of how debug information is used to localize and invert each mutant operator (e.g., for multi-line statements or optimized bytecode) is high-level; without an explicit algorithm, pseudocode, or handling of edge cases such as incomplete debug info or compiler optimizations, it is difficult to assess whether the reconstruction is deterministic and complete for all PIT operators.

minor comments (2)

[Abstract] Abstract: The final sentence reads 'where PITMuS can be integrated' but the surrounding context indicates this should refer to the PIT mutation testing tool rather than the new tool itself.
[Evaluation] The manuscript would benefit from a small table or figure in the evaluation section showing example reconstructed source edits alongside the original PIT XML entries to illustrate the mapping process.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address each major comment below and indicate the planned revisions to strengthen the paper.

read point-by-point responses

Referee: [Evaluation] Evaluation section: The manuscript states that PITMuS was applied to eight open-source Java systems and produces usable source-level datasets, but reports no quantitative metrics on reconstruction fidelity (e.g., success rate of mapping bytecode mutants to exact source edits, number of mapping failures or omissions, or results from manual verification against ground truth). This directly bears on the central claim that the generated pairs are accurate and usable for downstream tasks.

Authors: We agree that the absence of quantitative fidelity metrics limits the strength of the central claim. The current evaluation demonstrates applicability across eight systems and the structure of the output datasets but does not report mapping success rates, failure counts, or manual verification results. In the revised manuscript we will add a dedicated subsection to the Evaluation section that includes these metrics, computed over the generated datasets, together with a description of the verification procedure. revision: yes
Referee: [Approach] Approach / Implementation: The description of how debug information is used to localize and invert each mutant operator (e.g., for multi-line statements or optimized bytecode) is high-level; without an explicit algorithm, pseudocode, or handling of edge cases such as incomplete debug info or compiler optimizations, it is difficult to assess whether the reconstruction is deterministic and complete for all PIT operators.

Authors: The manuscript presents the reconstruction process at a conceptual level to emphasize the overall tool workflow. We acknowledge that greater detail is needed for reproducibility and to address determinism. In the revision we will augment the Approach section with pseudocode for the localization and inversion steps and add a discussion of edge-case handling, including incomplete debug information and compiler-induced bytecode changes, for each supported PIT operator. revision: yes

Circularity Check

0 steps flagged

Tool implementation paper with no derivation chain or circular elements

full rationale

This is a software engineering tool paper describing PITMuS, which combines PIT XML output with standard Java class-file debug information (line numbers, variable tables) to reconstruct source-level mutants. The manuscript contains no equations, no fitted parameters, no predictions of quantities from subsets of data, and no load-bearing self-citations or uniqueness theorems. The central output is produced by an external pipeline operating on independent artifacts (PIT mutants and compiled class files), making the work self-contained against external benchmarks with no reduction of claims to inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The approach rests on standard assumptions about mutation testing validity and the presence of debug information in Java class files; no free parameters or new invented entities are introduced.

axioms (1)

domain assumption PIT produces valid bytecode-level mutants that correspond to meaningful source changes when debug information is available.
Invoked when the reconstruction step is presented as feasible for any Java system where PIT can run.

pith-pipeline@v0.9.0 · 5807 in / 1216 out tokens · 46365 ms · 2026-05-22T05:19:54.366665+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

13 extracted references · 13 canonical work pages

[1]

AssertLab. [n.d.]. PITMuS: PIT Mutations In the Source Code. YouTube demo, https://youtu.be/zgHkXnsgciw

work page
[2]

R. A. DeMillo, R. J. Lipton, and F. G. Sayward. 1978. Hints on Test Data Selection: Help for the Practicing Programmer.Computer 11, 4 (1978), 34–41

work page 1978
[3]

Soneya Binta Hossain and Matthew B. Dwyer. 2025. TOGLL: Correct and Strong Test Oracle Generation with LLMs. In47th IEEE/ACM International Conference on Software Engineering (ICSE). IEEE, 1475–1487. doi:10.1109/ICSE55347.2025.00098

work page doi:10.1109/icse55347.2025.00098 2025
[4]

In 45th IEEE/ACM International Conference on Software Engineering, ICSE 2023, Melbourne, Australia, May 14-20, 2023

Soneya Binta Hossain, Matthew B. Dwyer, Sebastian G. Elbaum, and Anh Nguyen-Tuong. 2023. Measuring and Mitigating Gaps in Structural Testing. In45th IEEE/ACM International Conference on Software Engineering (ICSE). IEEE, 1712–1723. doi:10.1109/ ICSE48619.2023.00147

work page arXiv 2023
[5]

Dwyer, Se- bastian G

Soneya Binta Hossain, Antonio Filieri, Matthew B. Dwyer, Se- bastian G. Elbaum, and Willem Visser. 2023. Neural-Based Test Oracle Generation: A Large-Scale Evaluation and Lessons Learned. InProceedings of the 31st ACM Joint European Soft- ware Engineering Conference and Symposium on the Founda- tions of Software Engineering (ESEC/FSE). ACM, 120–132. doi:1...

work page doi:10.1145/3611643.3616265 2023
[6]

2026.PITMuS: PIT Mutations In the Source Code

Soneya Binta Hossain and Tasfia Tasnim. 2026.PITMuS: PIT Mutations In the Source Code. https://github.com/assert-lab/ PITMuS

work page 2026
[7]

Naman Jain, King Han, Alex Gu, Wen-Ding Li, Fanjia Yan, Tianjun Zhang, Sida Wang, Armando Solar-Lezama, Koushik Sen, and Ion Stoica. 2025. LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code. InThe Thirteenth International Conference on Learning Representations, ICLR 2025, Singapore, April 24-28, 2025. OpenReview.ne...

work page 2025
[8]

René Just, Darioush Jalali, and Michael D. Ernst. 2014. Defects4J: A Database of Existing Faults to Enable Controlled Testing Studies for Java Programs. InProceedings of the 2014 International PITMuS: A Tool for Automated Bug Dataset Generation via Source-Level Mutant Reconstruction Conference’17, July 2017, Washington, DC, USA Symposium on Software Testi...

work page doi:10.1145/2610384.2628055 2014
[9]

2018.Mutation Testing Advances: An Analysis and Survey

Mike Papadakis, Marinos Kintis, Jie Zhang, Yue Jia, Yves Le Traon, and Mark Harman. 2018.Mutation Testing Advances: An Analysis and Survey. doi:10.1016/bs.adcom.2018.03.015

work page doi:10.1016/bs.adcom.2018.03.015 2018
[10]

PIT Project. 2026. Real World Mutation Testing. Retrieved April 16, 2026 from https://pitest.org/

work page 2026
[11]

C. Thunes. [n.d.]. javalang. PyPI project page, https://pypi.org/ project/javalang/. Accessed April 16, 2026

work page 2026
[12]

Ruijie Xu, Zengzhi Wang, Run-Ze Fan, and Pengfei Liu. 2024. Benchmarking Benchmark Leakage in Large Language Models. arXiv preprint arXiv:2404.18824(2024). https://arxiv.org/abs/ 2404.18824

work page arXiv 2024
[13]

Linghao Zhang, Shilin He, Chaoyun Zhang, Yu Kang, Bowen Li, Chengxing Xie, Junhao Wang, Maoquan Wang, Yufan Huang, Shengyu Fu, Elsie Nallipogu, Qingwei Lin, Yingnong Dang, Saravan Rajmohan, and Dongmei Zhang. 2025. SWE-bench Goes Live! arXiv preprint arXiv:2505.23419(2025). https://arxiv.org/abs/ 2505.23419

work page arXiv 2025

[1] [1]

AssertLab. [n.d.]. PITMuS: PIT Mutations In the Source Code. YouTube demo, https://youtu.be/zgHkXnsgciw

work page

[2] [2]

R. A. DeMillo, R. J. Lipton, and F. G. Sayward. 1978. Hints on Test Data Selection: Help for the Practicing Programmer.Computer 11, 4 (1978), 34–41

work page 1978

[3] [3]

Soneya Binta Hossain and Matthew B. Dwyer. 2025. TOGLL: Correct and Strong Test Oracle Generation with LLMs. In47th IEEE/ACM International Conference on Software Engineering (ICSE). IEEE, 1475–1487. doi:10.1109/ICSE55347.2025.00098

work page doi:10.1109/icse55347.2025.00098 2025

[4] [4]

In 45th IEEE/ACM International Conference on Software Engineering, ICSE 2023, Melbourne, Australia, May 14-20, 2023

Soneya Binta Hossain, Matthew B. Dwyer, Sebastian G. Elbaum, and Anh Nguyen-Tuong. 2023. Measuring and Mitigating Gaps in Structural Testing. In45th IEEE/ACM International Conference on Software Engineering (ICSE). IEEE, 1712–1723. doi:10.1109/ ICSE48619.2023.00147

work page arXiv 2023

[5] [5]

Dwyer, Se- bastian G

Soneya Binta Hossain, Antonio Filieri, Matthew B. Dwyer, Se- bastian G. Elbaum, and Willem Visser. 2023. Neural-Based Test Oracle Generation: A Large-Scale Evaluation and Lessons Learned. InProceedings of the 31st ACM Joint European Soft- ware Engineering Conference and Symposium on the Founda- tions of Software Engineering (ESEC/FSE). ACM, 120–132. doi:1...

work page doi:10.1145/3611643.3616265 2023

[6] [6]

2026.PITMuS: PIT Mutations In the Source Code

Soneya Binta Hossain and Tasfia Tasnim. 2026.PITMuS: PIT Mutations In the Source Code. https://github.com/assert-lab/ PITMuS

work page 2026

[7] [7]

Naman Jain, King Han, Alex Gu, Wen-Ding Li, Fanjia Yan, Tianjun Zhang, Sida Wang, Armando Solar-Lezama, Koushik Sen, and Ion Stoica. 2025. LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code. InThe Thirteenth International Conference on Learning Representations, ICLR 2025, Singapore, April 24-28, 2025. OpenReview.ne...

work page 2025

[8] [8]

René Just, Darioush Jalali, and Michael D. Ernst. 2014. Defects4J: A Database of Existing Faults to Enable Controlled Testing Studies for Java Programs. InProceedings of the 2014 International PITMuS: A Tool for Automated Bug Dataset Generation via Source-Level Mutant Reconstruction Conference’17, July 2017, Washington, DC, USA Symposium on Software Testi...

work page doi:10.1145/2610384.2628055 2014

[9] [9]

2018.Mutation Testing Advances: An Analysis and Survey

Mike Papadakis, Marinos Kintis, Jie Zhang, Yue Jia, Yves Le Traon, and Mark Harman. 2018.Mutation Testing Advances: An Analysis and Survey. doi:10.1016/bs.adcom.2018.03.015

work page doi:10.1016/bs.adcom.2018.03.015 2018

[10] [10]

PIT Project. 2026. Real World Mutation Testing. Retrieved April 16, 2026 from https://pitest.org/

work page 2026

[11] [11]

C. Thunes. [n.d.]. javalang. PyPI project page, https://pypi.org/ project/javalang/. Accessed April 16, 2026

work page 2026

[12] [12]

Ruijie Xu, Zengzhi Wang, Run-Ze Fan, and Pengfei Liu. 2024. Benchmarking Benchmark Leakage in Large Language Models. arXiv preprint arXiv:2404.18824(2024). https://arxiv.org/abs/ 2404.18824

work page arXiv 2024

[13] [13]

Linghao Zhang, Shilin He, Chaoyun Zhang, Yu Kang, Bowen Li, Chengxing Xie, Junhao Wang, Maoquan Wang, Yufan Huang, Shengyu Fu, Elsie Nallipogu, Qingwei Lin, Yingnong Dang, Saravan Rajmohan, and Dongmei Zhang. 2025. SWE-bench Goes Live! arXiv preprint arXiv:2505.23419(2025). https://arxiv.org/abs/ 2505.23419

work page arXiv 2025