PITMuS: A Tool for Automated Bug Dataset Generation via Source-Level Mutant Reconstruction
Pith reviewed 2026-05-22 05:19 UTC · model grok-4.3
The pith
PITMuS reconstructs source-level edits from PIT bytecode mutants to automatically generate structured bug datasets with paired code, documentation, and metadata.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
PITMuS combines PIT XML metadata with debug information from compiled Java class files to localize and reconstruct the source edit corresponding to each mutant. PITMuS then automatically produces structured datasets containing source-level buggy and fixed code pairs, documentation context, and metadata for downstream training and evaluation. Although evaluated on eight open-source Java systems, the approach can be applied to any Java system where PIT can be integrated.
What carries the argument
Source-level mutant reconstruction that maps each bytecode mutant from PIT back to its corresponding source edit using debug information in Java class files.
If this is right
- Fresh datasets can be generated from current versions of Java systems rather than relying on fixed historical benchmarks.
- The produced records directly support training and evaluation of automated bug localization and repair techniques.
- Each dataset entry includes documentation context and metadata that enable documentation-driven automation.
- The method scales to any Java system compatible with PIT integration.
Where Pith is reading between the lines
- Analogous reconstruction pipelines could be built for mutation tools in languages other than Java.
- Repeated application over successive system versions could yield evolving benchmarks that stay current with code changes.
- The generated mutant metadata might be reused to improve test oracle generation or mutation-based test adequacy assessment.
Load-bearing premise
Debug information present in compiled Java class files is always sufficient and accurate enough to localize and reconstruct the precise source-level edit corresponding to each bytecode mutant without introducing mapping errors or omissions.
What would settle it
A concrete case in which the reconstructed source edit for a PIT mutant produces a different program behavior or fails to match the original bytecode change when the code is recompiled and re-mutated.
Figures
read the original abstract
LLM-based software engineering increasingly depends on executable, context-rich bug artifacts: paired correct and buggy code, methods under test (MUTs), documentation, and metadata. These artifacts support the training and evaluation of automated bug localization and repair techniques, testing and test oracle generation methods, and documentation-driven automation. Although curated benchmarks (e.g., Defects4J) remain valuable, they are static and increasingly vulnerable to contamination as code models are trained on large public corpora. A complementary strategy is to generate fresh, cutoff-aware datasets by selecting real system versions and injecting controlled bugs at the source level. Mutation testing is a natural basis for this strategy: it applies predefined mutation operators to programs and records whether the existing test suite detects each injected change. PIT is a state-of-the-practice mutation testing tool for Java that performs mutation at the bytecode level. This design makes mutation testing fast and practical, but PITMuS reports mutants primarily through XML, making them difficult to inspect, replay, or reuse as structured source-level dataset records. To address this gap, we present PITMuS, which combines PITMuS XML metadata with debug information from compiled Java class files to localize and reconstruct the source edit corresponding to each mutant. PITMuS then automatically produces structured datasets containing source-level buggy and fixed code pairs, documentation context, and metadata for downstream training and evaluation. Although we evaluate PITMuS on eight open-source Java systems, it can be applied to any Java system where PITMuS can be integrated.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper presents PITMuS, a tool that reconstructs source-level mutants from PIT's bytecode-level mutation testing outputs for Java by combining PIT XML metadata with debug information (line numbers, variable tables) from compiled class files. This enables automatic generation of structured datasets containing buggy/fixed code pairs, method-under-test context, documentation, and metadata, with evaluation reported on eight open-source Java systems; the approach is positioned as applicable to any Java project integrable with PIT.
Significance. If the reconstruction step is shown to be reliable, PITMuS would provide a practical, extensible method for producing fresh, version-specific bug datasets that mitigate contamination risks in static benchmarks such as Defects4J. This directly supports training and evaluation of automated repair, localization, testing, and documentation-driven techniques in software engineering.
major comments (2)
- [Evaluation] Evaluation section: The manuscript states that PITMuS was applied to eight open-source Java systems and produces usable source-level datasets, but reports no quantitative metrics on reconstruction fidelity (e.g., success rate of mapping bytecode mutants to exact source edits, number of mapping failures or omissions, or results from manual verification against ground truth). This directly bears on the central claim that the generated pairs are accurate and usable for downstream tasks.
- [Approach] Approach / Implementation: The description of how debug information is used to localize and invert each mutant operator (e.g., for multi-line statements or optimized bytecode) is high-level; without an explicit algorithm, pseudocode, or handling of edge cases such as incomplete debug info or compiler optimizations, it is difficult to assess whether the reconstruction is deterministic and complete for all PIT operators.
minor comments (2)
- [Abstract] Abstract: The final sentence reads 'where PITMuS can be integrated' but the surrounding context indicates this should refer to the PIT mutation testing tool rather than the new tool itself.
- [Evaluation] The manuscript would benefit from a small table or figure in the evaluation section showing example reconstructed source edits alongside the original PIT XML entries to illustrate the mapping process.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback on our manuscript. We address each major comment below and indicate the planned revisions to strengthen the paper.
read point-by-point responses
-
Referee: [Evaluation] Evaluation section: The manuscript states that PITMuS was applied to eight open-source Java systems and produces usable source-level datasets, but reports no quantitative metrics on reconstruction fidelity (e.g., success rate of mapping bytecode mutants to exact source edits, number of mapping failures or omissions, or results from manual verification against ground truth). This directly bears on the central claim that the generated pairs are accurate and usable for downstream tasks.
Authors: We agree that the absence of quantitative fidelity metrics limits the strength of the central claim. The current evaluation demonstrates applicability across eight systems and the structure of the output datasets but does not report mapping success rates, failure counts, or manual verification results. In the revised manuscript we will add a dedicated subsection to the Evaluation section that includes these metrics, computed over the generated datasets, together with a description of the verification procedure. revision: yes
-
Referee: [Approach] Approach / Implementation: The description of how debug information is used to localize and invert each mutant operator (e.g., for multi-line statements or optimized bytecode) is high-level; without an explicit algorithm, pseudocode, or handling of edge cases such as incomplete debug info or compiler optimizations, it is difficult to assess whether the reconstruction is deterministic and complete for all PIT operators.
Authors: The manuscript presents the reconstruction process at a conceptual level to emphasize the overall tool workflow. We acknowledge that greater detail is needed for reproducibility and to address determinism. In the revision we will augment the Approach section with pseudocode for the localization and inversion steps and add a discussion of edge-case handling, including incomplete debug information and compiler-induced bytecode changes, for each supported PIT operator. revision: yes
Circularity Check
Tool implementation paper with no derivation chain or circular elements
full rationale
This is a software engineering tool paper describing PITMuS, which combines PIT XML output with standard Java class-file debug information (line numbers, variable tables) to reconstruct source-level mutants. The manuscript contains no equations, no fitted parameters, no predictions of quantities from subsets of data, and no load-bearing self-citations or uniqueness theorems. The central output is produced by an external pipeline operating on independent artifacts (PIT mutants and compiled class files), making the work self-contained against external benchmarks with no reduction of claims to inputs by construction.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption PIT produces valid bytecode-level mutants that correspond to meaningful source changes when debug information is available.
Reference graph
Works this paper leans on
-
[1]
AssertLab. [n.d.]. PITMuS: PIT Mutations In the Source Code. YouTube demo, https://youtu.be/zgHkXnsgciw
-
[2]
R. A. DeMillo, R. J. Lipton, and F. G. Sayward. 1978. Hints on Test Data Selection: Help for the Practicing Programmer.Computer 11, 4 (1978), 34–41
work page 1978
-
[3]
Soneya Binta Hossain and Matthew B. Dwyer. 2025. TOGLL: Correct and Strong Test Oracle Generation with LLMs. In47th IEEE/ACM International Conference on Software Engineering (ICSE). IEEE, 1475–1487. doi:10.1109/ICSE55347.2025.00098
-
[4]
Soneya Binta Hossain, Matthew B. Dwyer, Sebastian G. Elbaum, and Anh Nguyen-Tuong. 2023. Measuring and Mitigating Gaps in Structural Testing. In45th IEEE/ACM International Conference on Software Engineering (ICSE). IEEE, 1712–1723. doi:10.1109/ ICSE48619.2023.00147
-
[5]
Soneya Binta Hossain, Antonio Filieri, Matthew B. Dwyer, Se- bastian G. Elbaum, and Willem Visser. 2023. Neural-Based Test Oracle Generation: A Large-Scale Evaluation and Lessons Learned. InProceedings of the 31st ACM Joint European Soft- ware Engineering Conference and Symposium on the Founda- tions of Software Engineering (ESEC/FSE). ACM, 120–132. doi:1...
-
[6]
2026.PITMuS: PIT Mutations In the Source Code
Soneya Binta Hossain and Tasfia Tasnim. 2026.PITMuS: PIT Mutations In the Source Code. https://github.com/assert-lab/ PITMuS
work page 2026
-
[7]
Naman Jain, King Han, Alex Gu, Wen-Ding Li, Fanjia Yan, Tianjun Zhang, Sida Wang, Armando Solar-Lezama, Koushik Sen, and Ion Stoica. 2025. LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code. InThe Thirteenth International Conference on Learning Representations, ICLR 2025, Singapore, April 24-28, 2025. OpenReview.ne...
work page 2025
-
[8]
René Just, Darioush Jalali, and Michael D. Ernst. 2014. Defects4J: A Database of Existing Faults to Enable Controlled Testing Studies for Java Programs. InProceedings of the 2014 International PITMuS: A Tool for Automated Bug Dataset Generation via Source-Level Mutant Reconstruction Conference’17, July 2017, Washington, DC, USA Symposium on Software Testi...
-
[9]
2018.Mutation Testing Advances: An Analysis and Survey
Mike Papadakis, Marinos Kintis, Jie Zhang, Yue Jia, Yves Le Traon, and Mark Harman. 2018.Mutation Testing Advances: An Analysis and Survey. doi:10.1016/bs.adcom.2018.03.015
-
[10]
PIT Project. 2026. Real World Mutation Testing. Retrieved April 16, 2026 from https://pitest.org/
work page 2026
-
[11]
C. Thunes. [n.d.]. javalang. PyPI project page, https://pypi.org/ project/javalang/. Accessed April 16, 2026
work page 2026
- [12]
-
[13]
Linghao Zhang, Shilin He, Chaoyun Zhang, Yu Kang, Bowen Li, Chengxing Xie, Junhao Wang, Maoquan Wang, Yufan Huang, Shengyu Fu, Elsie Nallipogu, Qingwei Lin, Yingnong Dang, Saravan Rajmohan, and Dongmei Zhang. 2025. SWE-bench Goes Live! arXiv preprint arXiv:2505.23419(2025). https://arxiv.org/abs/ 2505.23419
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.