pith. sign in

arxiv: 2605.21930 · v1 · pith:QBOA5EG5new · submitted 2026-05-21 · 💻 cs.SE

PITMuS: A Tool for Automated Bug Dataset Generation via Source-Level Mutant Reconstruction

Pith reviewed 2026-05-22 05:19 UTC · model grok-4.3

classification 💻 cs.SE
keywords mutation testingbug dataset generationsource-level reconstructionJavaPITbug repairsoftware engineeringLLM training data
0
0 comments X

The pith

PITMuS reconstructs source-level edits from PIT bytecode mutants to automatically generate structured bug datasets with paired code, documentation, and metadata.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces PITMuS to solve the problem of creating fresh, context-rich bug artifacts needed for training and evaluating LLM-based software engineering tools. It starts from PIT's fast bytecode-level mutations and uses debug information in compiled Java class files to map each mutant back to the exact source code change. The tool then outputs structured records that include the buggy version, the fixed version, the method under test, documentation, and other metadata. These datasets can be produced from any Java system that integrates with PIT and help avoid the contamination risks that affect static benchmarks. A sympathetic reader would care because such on-demand datasets support more reliable development of bug localization, repair, and test generation methods.

Core claim

PITMuS combines PIT XML metadata with debug information from compiled Java class files to localize and reconstruct the source edit corresponding to each mutant. PITMuS then automatically produces structured datasets containing source-level buggy and fixed code pairs, documentation context, and metadata for downstream training and evaluation. Although evaluated on eight open-source Java systems, the approach can be applied to any Java system where PIT can be integrated.

What carries the argument

Source-level mutant reconstruction that maps each bytecode mutant from PIT back to its corresponding source edit using debug information in Java class files.

If this is right

  • Fresh datasets can be generated from current versions of Java systems rather than relying on fixed historical benchmarks.
  • The produced records directly support training and evaluation of automated bug localization and repair techniques.
  • Each dataset entry includes documentation context and metadata that enable documentation-driven automation.
  • The method scales to any Java system compatible with PIT integration.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Analogous reconstruction pipelines could be built for mutation tools in languages other than Java.
  • Repeated application over successive system versions could yield evolving benchmarks that stay current with code changes.
  • The generated mutant metadata might be reused to improve test oracle generation or mutation-based test adequacy assessment.

Load-bearing premise

Debug information present in compiled Java class files is always sufficient and accurate enough to localize and reconstruct the precise source-level edit corresponding to each bytecode mutant without introducing mapping errors or omissions.

What would settle it

A concrete case in which the reconstructed source edit for a PIT mutant produces a different program behavior or fails to match the original bytecode change when the code is recompiled and re-mutated.

Figures

Figures reproduced from arXiv: 2605.21930 by Soneya Binta Hossain, Tasfia Tasnim.

Figure 1
Figure 1. Figure 1: Example of a source line, a mutation applied to it, and the [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Overview of PITMuS workflow. and a Java JDK that provides javap. The target system must provide a PIT XML report, compiled .class files with debug metadata, and the original Java source files. 2.1 Automated Dataset Generation Dataset generation turns each PIT report entry into a method￾level record. For a Java system 𝑃 with PIT report 𝑅𝑃 , PIT￾MuS defines the dataset as 𝒟𝑃 = { ⟨orig𝑖 , mut𝑖, doc𝑖, meta𝑖⟩ |… view at source ↗
read the original abstract

LLM-based software engineering increasingly depends on executable, context-rich bug artifacts: paired correct and buggy code, methods under test (MUTs), documentation, and metadata. These artifacts support the training and evaluation of automated bug localization and repair techniques, testing and test oracle generation methods, and documentation-driven automation. Although curated benchmarks (e.g., Defects4J) remain valuable, they are static and increasingly vulnerable to contamination as code models are trained on large public corpora. A complementary strategy is to generate fresh, cutoff-aware datasets by selecting real system versions and injecting controlled bugs at the source level. Mutation testing is a natural basis for this strategy: it applies predefined mutation operators to programs and records whether the existing test suite detects each injected change. PIT is a state-of-the-practice mutation testing tool for Java that performs mutation at the bytecode level. This design makes mutation testing fast and practical, but PITMuS reports mutants primarily through XML, making them difficult to inspect, replay, or reuse as structured source-level dataset records. To address this gap, we present PITMuS, which combines PITMuS XML metadata with debug information from compiled Java class files to localize and reconstruct the source edit corresponding to each mutant. PITMuS then automatically produces structured datasets containing source-level buggy and fixed code pairs, documentation context, and metadata for downstream training and evaluation. Although we evaluate PITMuS on eight open-source Java systems, it can be applied to any Java system where PITMuS can be integrated.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper presents PITMuS, a tool that reconstructs source-level mutants from PIT's bytecode-level mutation testing outputs for Java by combining PIT XML metadata with debug information (line numbers, variable tables) from compiled class files. This enables automatic generation of structured datasets containing buggy/fixed code pairs, method-under-test context, documentation, and metadata, with evaluation reported on eight open-source Java systems; the approach is positioned as applicable to any Java project integrable with PIT.

Significance. If the reconstruction step is shown to be reliable, PITMuS would provide a practical, extensible method for producing fresh, version-specific bug datasets that mitigate contamination risks in static benchmarks such as Defects4J. This directly supports training and evaluation of automated repair, localization, testing, and documentation-driven techniques in software engineering.

major comments (2)
  1. [Evaluation] Evaluation section: The manuscript states that PITMuS was applied to eight open-source Java systems and produces usable source-level datasets, but reports no quantitative metrics on reconstruction fidelity (e.g., success rate of mapping bytecode mutants to exact source edits, number of mapping failures or omissions, or results from manual verification against ground truth). This directly bears on the central claim that the generated pairs are accurate and usable for downstream tasks.
  2. [Approach] Approach / Implementation: The description of how debug information is used to localize and invert each mutant operator (e.g., for multi-line statements or optimized bytecode) is high-level; without an explicit algorithm, pseudocode, or handling of edge cases such as incomplete debug info or compiler optimizations, it is difficult to assess whether the reconstruction is deterministic and complete for all PIT operators.
minor comments (2)
  1. [Abstract] Abstract: The final sentence reads 'where PITMuS can be integrated' but the surrounding context indicates this should refer to the PIT mutation testing tool rather than the new tool itself.
  2. [Evaluation] The manuscript would benefit from a small table or figure in the evaluation section showing example reconstructed source edits alongside the original PIT XML entries to illustrate the mapping process.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address each major comment below and indicate the planned revisions to strengthen the paper.

read point-by-point responses
  1. Referee: [Evaluation] Evaluation section: The manuscript states that PITMuS was applied to eight open-source Java systems and produces usable source-level datasets, but reports no quantitative metrics on reconstruction fidelity (e.g., success rate of mapping bytecode mutants to exact source edits, number of mapping failures or omissions, or results from manual verification against ground truth). This directly bears on the central claim that the generated pairs are accurate and usable for downstream tasks.

    Authors: We agree that the absence of quantitative fidelity metrics limits the strength of the central claim. The current evaluation demonstrates applicability across eight systems and the structure of the output datasets but does not report mapping success rates, failure counts, or manual verification results. In the revised manuscript we will add a dedicated subsection to the Evaluation section that includes these metrics, computed over the generated datasets, together with a description of the verification procedure. revision: yes

  2. Referee: [Approach] Approach / Implementation: The description of how debug information is used to localize and invert each mutant operator (e.g., for multi-line statements or optimized bytecode) is high-level; without an explicit algorithm, pseudocode, or handling of edge cases such as incomplete debug info or compiler optimizations, it is difficult to assess whether the reconstruction is deterministic and complete for all PIT operators.

    Authors: The manuscript presents the reconstruction process at a conceptual level to emphasize the overall tool workflow. We acknowledge that greater detail is needed for reproducibility and to address determinism. In the revision we will augment the Approach section with pseudocode for the localization and inversion steps and add a discussion of edge-case handling, including incomplete debug information and compiler-induced bytecode changes, for each supported PIT operator. revision: yes

Circularity Check

0 steps flagged

Tool implementation paper with no derivation chain or circular elements

full rationale

This is a software engineering tool paper describing PITMuS, which combines PIT XML output with standard Java class-file debug information (line numbers, variable tables) to reconstruct source-level mutants. The manuscript contains no equations, no fitted parameters, no predictions of quantities from subsets of data, and no load-bearing self-citations or uniqueness theorems. The central output is produced by an external pipeline operating on independent artifacts (PIT mutants and compiled class files), making the work self-contained against external benchmarks with no reduction of claims to inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The approach rests on standard assumptions about mutation testing validity and the presence of debug information in Java class files; no free parameters or new invented entities are introduced.

axioms (1)
  • domain assumption PIT produces valid bytecode-level mutants that correspond to meaningful source changes when debug information is available.
    Invoked when the reconstruction step is presented as feasible for any Java system where PIT can run.

pith-pipeline@v0.9.0 · 5807 in / 1216 out tokens · 46365 ms · 2026-05-22T05:19:54.366665+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

13 extracted references · 13 canonical work pages

  1. [1]

    AssertLab. [n.d.]. PITMuS: PIT Mutations In the Source Code. YouTube demo, https://youtu.be/zgHkXnsgciw

  2. [2]

    R. A. DeMillo, R. J. Lipton, and F. G. Sayward. 1978. Hints on Test Data Selection: Help for the Practicing Programmer.Computer 11, 4 (1978), 34–41

  3. [3]

    Soneya Binta Hossain and Matthew B. Dwyer. 2025. TOGLL: Correct and Strong Test Oracle Generation with LLMs. In47th IEEE/ACM International Conference on Software Engineering (ICSE). IEEE, 1475–1487. doi:10.1109/ICSE55347.2025.00098

  4. [4]

    In 45th IEEE/ACM International Conference on Software Engineering, ICSE 2023, Melbourne, Australia, May 14-20, 2023

    Soneya Binta Hossain, Matthew B. Dwyer, Sebastian G. Elbaum, and Anh Nguyen-Tuong. 2023. Measuring and Mitigating Gaps in Structural Testing. In45th IEEE/ACM International Conference on Software Engineering (ICSE). IEEE, 1712–1723. doi:10.1109/ ICSE48619.2023.00147

  5. [5]

    Dwyer, Se- bastian G

    Soneya Binta Hossain, Antonio Filieri, Matthew B. Dwyer, Se- bastian G. Elbaum, and Willem Visser. 2023. Neural-Based Test Oracle Generation: A Large-Scale Evaluation and Lessons Learned. InProceedings of the 31st ACM Joint European Soft- ware Engineering Conference and Symposium on the Founda- tions of Software Engineering (ESEC/FSE). ACM, 120–132. doi:1...

  6. [6]

    2026.PITMuS: PIT Mutations In the Source Code

    Soneya Binta Hossain and Tasfia Tasnim. 2026.PITMuS: PIT Mutations In the Source Code. https://github.com/assert-lab/ PITMuS

  7. [7]

    Naman Jain, King Han, Alex Gu, Wen-Ding Li, Fanjia Yan, Tianjun Zhang, Sida Wang, Armando Solar-Lezama, Koushik Sen, and Ion Stoica. 2025. LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code. InThe Thirteenth International Conference on Learning Representations, ICLR 2025, Singapore, April 24-28, 2025. OpenReview.ne...

  8. [8]

    René Just, Darioush Jalali, and Michael D. Ernst. 2014. Defects4J: A Database of Existing Faults to Enable Controlled Testing Studies for Java Programs. InProceedings of the 2014 International PITMuS: A Tool for Automated Bug Dataset Generation via Source-Level Mutant Reconstruction Conference’17, July 2017, Washington, DC, USA Symposium on Software Testi...

  9. [9]

    2018.Mutation Testing Advances: An Analysis and Survey

    Mike Papadakis, Marinos Kintis, Jie Zhang, Yue Jia, Yves Le Traon, and Mark Harman. 2018.Mutation Testing Advances: An Analysis and Survey. doi:10.1016/bs.adcom.2018.03.015

  10. [10]

    PIT Project. 2026. Real World Mutation Testing. Retrieved April 16, 2026 from https://pitest.org/

  11. [11]

    C. Thunes. [n.d.]. javalang. PyPI project page, https://pypi.org/ project/javalang/. Accessed April 16, 2026

  12. [12]

    Ruijie Xu, Zengzhi Wang, Run-Ze Fan, and Pengfei Liu. 2024. Benchmarking Benchmark Leakage in Large Language Models. arXiv preprint arXiv:2404.18824(2024). https://arxiv.org/abs/ 2404.18824

  13. [13]

    Linghao Zhang, Shilin He, Chaoyun Zhang, Yu Kang, Bowen Li, Chengxing Xie, Junhao Wang, Maoquan Wang, Yufan Huang, Shengyu Fu, Elsie Nallipogu, Qingwei Lin, Yingnong Dang, Saravan Rajmohan, and Dongmei Zhang. 2025. SWE-bench Goes Live! arXiv preprint arXiv:2505.23419(2025). https://arxiv.org/abs/ 2505.23419