arxiv: 2605.04320 · v2 · submitted 2026-05-05 · 💻 cs.SE

Recognition: no theorem link

Reproduction Test Generation for Java SWE Issues

Toufique Ahmed , Jatin Ganhotra , Avraham Shinnar , Martin Hirzel

Authors on Pith no claims yet

Pith reviewed 2026-05-11 00:43 UTC · model grok-4.3

classification 💻 cs.SE

keywords reproduction test generationJavabenchmarksoftware engineeringtest generationissue reportsTDD-Bench-Javae-Otter++

0 comments

The pith

This paper introduces TDD-Bench-Java, a 250-instance benchmark, and e-Otter++ to generate reproduction tests for Java issues from issue reports.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper addresses the absence of reproduction tests in Java software projects by creating the first dedicated benchmark and an adapted generator for the task. TDD-Bench-Java supplies 250 curated instances drawn from popular open-source Java repositories to model repository-level reproduction test generation. The solution adapts a Python reproduction test generator to Java and reports high performance on the benchmark as well as on a separate proprietary dataset. Reproduction tests supply concrete execution feedback that confirms a reported issue exists before a fix and disappears afterward, aiding diagnosis and validation steps in development. The work therefore extends prior Python-centric research to support enterprise Java codebases.

Core claim

This paper introduces TDD-Bench-Java, the first benchmark for Java repository-level reproduction test generation that comprises 250 instances sourced from popular open-source repositories, together with e-Otter++ for Java, which adapts a state-of-the-art Python reproduction test generator and yields high performance on both TDD-Bench-Java and a contamination-free proprietary dataset.

What carries the argument

TDD-Bench-Java benchmark of 250 Java issue instances and the e-Otter++ adaptation of the Python reproduction test generator.

Load-bearing premise

The 250 curated instances are representative of typical Java software-engineering issues and the Python-to-Java adaptation transfers without major language-specific obstacles.

What would settle it

Running e-Otter++ on a fresh collection of Java issue reports drawn from repositories outside the benchmark and finding that it produces incorrect or no reproduction tests for most cases would show the claimed performance does not hold.

Figures

Figures reproduced from arXiv: 2605.04320 by Avraham Shinnar, Jatin Ganhotra, Martin Hirzel, Toufique Ahmed.

**Figure 1.** Figure 1: Evaluation harness for bug reproduction test. First, view at source ↗

**Figure 2.** Figure 2: Overview of test generation pipeline. We use the fail-to-pass rate as the primary evaluation metric, following prior work [4, 23]. The fail-to-pass rate is the percentage of generated tests that fail on the original code (𝑐old) and pass after applying the fix (𝑐new), indicating successful reproduction and validation of the issue. SWT-Bench also hosts a leaderboard2 for Python reproduction test generators, … view at source ↗

**Figure 3.** Figure 3: Performance of Otter, e-Otter, and e-Otter++ on view at source ↗

**Figure 4.** Figure 4: e-Otter++ generated test forfasterxml__jackson view at source ↗

**Figure 5.** Figure 5: Comparing the code patches in open-source and proprietary data. view at source ↗

**Figure 6.** Figure 6: Comparing the word count of issue descriptions in view at source ↗

**Figure 7.** Figure 7: e-Otter performance on Open- and Closed-sourced view at source ↗

read the original abstract

Given an issue on a software repository, a reproduction test confirms its presence in the code before it gets fixed and its absence after. Reproduction tests provide crucial execution-based feedback for diagnosis and validation during software development. Unfortunately, they are usually missing. Therefore, recent work has introduced both benchmarks and a thriving literature on solutions for reproduction test generation from issues. However, that work has focused on Python and neglected other languages such as Java, which is important for enterprise software. This paper introduces both a benchmark and a solution for Java repository-level reproduction test generation. The benchmark, TDD-Bench-Java, is the first to model this problem and comprises 250 instances sourced from popular open-source repositories. The solution, e-Otter++ for Java, adapts a state-of-the-art reproduction test generator for Python to yield high performance on Java. To evaluate in an industry setting, besides empirical results with TDD-Bench-Java, this paper also presents results with a contamination-free proprietary dataset. Overall, we hope that this paper contributes to bringing better diagnosis and validation to Java software development.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This paper adds the first Java benchmark for issue-based reproduction test generation plus a Python-to-Java tool adaptation, but the abstract supplies no numbers or curation details to judge whether the results are solid.

read the letter

The main takeaway is that the work creates TDD-Bench-Java, a set of 250 instances drawn from open-source Java repositories, and ports an existing Python reproduction test generator into e-Otter++ for Java. It also runs the approach on a separate proprietary dataset. That fills an obvious hole since almost all prior reproduction test work stayed in Python, and Java matters for a lot of real enterprise codebases. The benchmark itself is a concrete new artifact that others can build on, and the adaptation shows a practical way to reuse prior machinery rather than starting from scratch. Those are the clear positives: a targeted extension that brings the task to a new language with both public and private evaluation settings. The abstract claims high performance, which would be useful if the full paper backs it with actual numbers. The soft spots are exactly where the stress-test note flags them. No selection protocol for the 250 instances appears, no breakdown of issue types or build systems, and no discussion of Java-specific hurdles such as static typing, JUnit lifecycle, or dependency resolution during test synthesis. The abstract also gives no quantitative metrics, error bars, or validation steps, so the central claim of practical utility stays uncheckable from what is shown. The proprietary dataset is called contamination-free, yet again without reported controls or scores. These gaps are not minor if the goal is to convince readers that the results generalize rather than reflect careful curation. This is for people already working on automated test generation or SE benchmarks who want to move beyond Python. A reader who needs Java examples or a starting point for cross-language work could pull the benchmark and try the adaptation. I would send it to peer review because the contribution is legitimate and the idea is sound; the current draft just needs the missing evidence on methods and results to stand up.

Referee Report

2 major / 1 minor

Summary. The paper introduces TDD-Bench-Java, the first benchmark for Java repository-level reproduction test generation consisting of 250 instances sourced from popular open-source repositories, along with e-Otter++ for Java, an adaptation of a state-of-the-art Python reproduction test generator. It claims high performance on both the benchmark and a contamination-free proprietary dataset to support better diagnosis and validation in Java software development.

Significance. If the performance claims hold with proper validation, this would be a meaningful contribution by extending reproduction test generation to Java, an important language for enterprise software where such tests are often missing. The creation of an independent benchmark and evaluation on a proprietary dataset supply external grounding without evident circular dependence, addressing a gap left by prior Python-focused work.

major comments (2)

[Abstract] Abstract: The assertion of 'high performance' on TDD-Bench-Java and the proprietary dataset supplies no quantitative metrics, error bars, or detailed validation procedures, which is load-bearing for the central claim of practical utility and prevents assessment of whether observed results reflect genuine generalization.
[Abstract] Abstract: The benchmark description provides no selection protocol for the 250 instances, issue-type distribution, build-system coverage (Maven/Gradle), or analysis of Java-specific obstacles such as static typing effects on test synthesis, JUnit lifecycle differences, or dependency resolution; these omissions are load-bearing for the representativeness of TDD-Bench-Java and the fidelity of the Python-to-Java adaptation.

minor comments (1)

The exact adaptations made in e-Otter++ for Java from the Python original should be specified to support reproducibility and to allow readers to assess language-specific transfer.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address each major comment below and indicate planned revisions to strengthen the presentation of our contributions.

read point-by-point responses

Referee: [Abstract] Abstract: The assertion of 'high performance' on TDD-Bench-Java and the proprietary dataset supplies no quantitative metrics, error bars, or detailed validation procedures, which is load-bearing for the central claim of practical utility and prevents assessment of whether observed results reflect genuine generalization.

Authors: We agree that the abstract would be strengthened by including key quantitative metrics to support the performance claims. The experimental results section of the manuscript reports specific success rates, pass@k metrics, and comparisons on both TDD-Bench-Java and the proprietary dataset, along with the evaluation protocol used to ensure contamination-free assessment. We will revise the abstract to incorporate representative performance figures and a brief reference to the validation approach, allowing readers to better gauge the results at a glance. revision: yes
Referee: [Abstract] Abstract: The benchmark description provides no selection protocol for the 250 instances, issue-type distribution, build-system coverage (Maven/Gradle), or analysis of Java-specific obstacles such as static typing effects on test synthesis, JUnit lifecycle differences, or dependency resolution; these omissions are load-bearing for the representativeness of TDD-Bench-Java and the fidelity of the Python-to-Java adaptation.

Authors: The abstract provides a high-level summary, while Section 3 of the manuscript details the benchmark construction, including the selection protocol from popular open-source repositories, issue-type distribution, coverage of Maven and Gradle build systems, and specific adaptations addressing Java challenges such as static typing, JUnit lifecycle, and dependency management. We acknowledge that the abstract could better signal these elements. We will update the abstract with a concise mention of the sourcing criteria, build-system coverage, and Java-specific adaptations to improve clarity on representativeness. revision: yes

Circularity Check

0 steps flagged

No significant circularity; independent benchmark creation and external-tool adaptation

full rationale

The paper introduces TDD-Bench-Java as a new benchmark of 250 instances sourced from open-source repositories and adapts an existing state-of-the-art Python reproduction test generator (e-Otter) to Java as e-Otter++. Neither step reduces to a self-definition, a fitted input renamed as prediction, or a load-bearing self-citation chain. The abstract and description explicitly position the benchmark as the first for Java repository-level issues and the solution as an adaptation of prior external work. Evaluation on a separate contamination-free proprietary dataset supplies external grounding. No equations, uniqueness theorems, or ansatzes are invoked that collapse back to the paper's own inputs. The derivation chain is therefore self-contained.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Empirical software-engineering contribution; no mathematical free parameters, domain axioms, or invented entities are required by the central claim.

pith-pipeline@v0.9.0 · 5490 in / 1051 out tokens · 60846 ms · 2026-05-11T00:43:32.765018+00:00 · methodology

Review history (2 revisions) →

discussion (0)

Reference graph

Works this paper leans on

33 extracted references · 21 canonical work pages · 2 internal anchors

[1]

Toufique Ahmed, Premkumar Devanbu, Christoph Treude, and Michael Pradel
[2]

InConference on Mining Software Repositories (MSR)

Can LLMs replace manual annotation of software engineering artifacts?. InConference on Mining Software Repositories (MSR). 526–538
[3]

Toufique Ahmed, Jatin Ganhotra, Rangeet Pan, Avraham Shinnar, Saurabh Sinha, and Martin Hirzel. 2025. Otter: Generating Tests from Issues to Validate SWE Patches. InInternational Conference on Machine Learning (ICML)

2025
[4]

Toufique Ahmed, Jatin Ganhotra, Avraham Shinnar, and Martin Hirzel. 2026. Heterogeneous Prompting and Execution Feedback for SWE Issue Test Gener- ation and Selection. InInternational Conference on Software Engineering(ICSE). https://arxiv.org/abs/2508.06365

work page arXiv 2026
[5]

Toufique Ahmed, Martin Hirzel, Rangeet Pan, Avraham Shinnar, and Saurabh Sinha. 2024. TDD-Bench Verified: Can LLMs Generate Tests for Issues Before They Get Resolved? https://arxiv.org/abs/2412.02883

work page arXiv 2024
[6]

Leonhard Applis, Yuntong Zhang, Shanchao Liang, Nan Jiang, Lin Tan, and Abhik Roychoudhury. 2026. Unified Software Engineering agent as AI Software Engineer. InInternational Conference on Software Engineering (ICSE). https: //arxiv.org/abs/2506.14683

work page arXiv 2026
[7]

Ibragim Badertdinov, Alexander Golubev, Maksim Nekrashevich, Anton Shevtsov, Simon Karasik, Andrei Andriushchenko, Maria Trofimova, Daria Litvintseva, and Boris Yangel. 2025. SWE-rebench: An Automated Pipeline for Task Collection and Decontaminated Evaluation of Software Engineering Agents. https://arxiv. org/abs/2505.20411 Reproduction Test Generation fo...

work page arXiv 2025
[8]

Zhi Chen, Wei Ma, and Lingxiao Jiang. 2025. Unveiling Pitfalls: Understanding Why AI-driven Code Agents Fail at GitHub Issue Resolution. https://arxiv.org/ abs/2503.12374

work page internal anchor Pith review Pith/arXiv arXiv 2025
[9]

Zimin Chen, Yue Pan, Siyu Lu, Jiayi Xu, Claire Le Goues, Martin Monperrus, and He Ye. 2025. Prometheus: Unified Knowledge Graphs for Issue Resolution in Multilingual Codebases. https://arxiv.org/abs/2507.19942

work page arXiv 2025
[10]

Runxiang Cheng, Michele Tufano, Jürgen Cito, José Cambronero, Pat Rondon, Renyao Wei, Aaron Sun, and Satish Chandra. 2025. Agentic Bug Reproduction for Effective Automated Program Repair at Google. https://arxiv.org/abs/2502.01821

work page arXiv 2025
[11]

Jimenez, John Yang, Kevin Liu, and Aleksander Madry

Neil Chowdhury, James Aung, Chan Jun Shern, Oliver Jaffe, Dane Sherburn, Giulio Starace, Evan Mays, Rachel Dias, Marwan Aljubeh, Mia Glaese, Carlos E. Jimenez, John Yang, Kevin Liu, and Aleksander Madry. 2024. Introducing SWE- bench Verified. https://openai.com/index/introducing-swe-bench-verified/

2024
[12]

Le Deng, Zhonghao Jiang, Jialun Cao, Michael Pradel, and Zhongxin Liu. 2025. NoCode-bench: A Benchmark for Evaluating Natural Language-Driven Feature Addition. https://arxiv.org/abs/2507.18130

work page arXiv 2025
[13]

Xiang Deng, Jeff Da, Edwin Pan, Yannis Yiming He, Charles Ide, Kanak Garg, Niklas Lauffer, Andrew Park, Nitin Pasari, Chetan Rane, Karmini Sampath, Maya Krishnan, Srivatsa Kundurthy, Sean Hendryx, Zifan Wang, Chen Bo Calvin Zhang, Noah Jacobson, Bing Liu, and Brad Kenstler. 2025. SWE-Bench Pro: Can AI Agents Solve Long-Horizon Software Engineering Tasks? ...

work page internal anchor Pith review arXiv 2025
[14]

Ryan Ehrlich, Bradley Brown, Jordan Juravsky, Ronald Clark, Christopher Re, and Azalia Mirhoseini. 2025. CodeMonkeys: Scaling Test-Time Compute for Software Engineering. https://arxiv.org/abs/2501.14723

work page arXiv 2025
[15]

Jatin Ganhotra, Sami Serhan, Antonio Abu Nassar, Avraham Shinnar, Ziv Nevo, and Martin Hirzel. 2026. Resolving Java Code Repository Issues with iSWE Agent. https://arxiv.org/abs/2603.11356

work page arXiv 2026
[16]

Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik Narasimhan

Carlos E. Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik Narasimhan. 2024. SWE-bench: Can Language Models Resolve Real- World GitHub Issues?. InInternational Conference on Learning Representations (ICLR)

2024
[17]

René Just, Darioush Jalali, and Michael D. Ernst. 2014. Defects4J: a database of existing faults to enable controlled testing studies for Java programs. In International Symposium on Software Testing and Analysis (ISSTA). 437–440. https://doi.org/10.1145/2610384.2628055

work page doi:10.1145/2610384.2628055 2014
[18]

Sungmin Kang, Juyeon Yoon, and Shin Yoo. 2023. Large Language Models are Few- Shot Testers: Exploring LLM-Based General Bug Reproduction. InInternational Conference on Software Engineering (ICSE). 2312–2323. https://doi.org/10.1109/ ICSE48619.2023.00194

work page arXiv 2023
[19]

Lara Khatib, Noble Saji Mathews, and Meiyappan Nagappan. 2026. AssertFlip: Reproducing Bugs via Inversion of LLM-Generated Passing Tests. InInternational Conference on Software Engineering (ICSE)

2026
[20]

KeFan Li, Mengfei Wang, Hengzhi Zhang, Zhichao Li, Yuan Yuan, Mu Li, Xiang Gao, Hailong Sun, Chunming Hu, and Weifeng Lv. 2025. InfCode: Adversarial Iterative Refinement of Tests and Patches for Reliable Software Issue Resolution. https://arxiv.org/abs/2511.16004

work page arXiv 2025
[21]

Shanchao Liang, Spandan Garg, and Roshanak Zilouchian Moghaddam. 2025. The SWE-Bench Illusion: When State-of-the-Art LLMs Remember Instead of Reason. https://arxiv.org/abs/2506.12286

work page arXiv 2025
[22]

Quinn McNemar. 1947. Note on the sampling error of the difference between correlated proportions or percentages.Psychometrika12, 2 (1947), 153–157

1947
[23]

Martin Mirchev, Ridwan Shariffdeen, Haifeng Ruan, Yuntong Zhang, and Abhik Roychoudhury. 2026. AutoCodeRover: Agentic Program Repair for SonarQube Issues. InIndustry paper at Symposium on the Foundations of Software Engineering (FSE-Industry)

2026
[24]

Niels Mündler, Mark Niklas Müller, Jingxuan He, and Martin Vechev. 2024. SWT- Bench: Testing and Validating Real-World Bug-Fixes with Code Agents. InCon- ference on Neural Information Processing Systems (NeurIPS)

2024
[25]

Muhammad Shihab Rashid, Christian Bock, Yuan Zhuang, Alexander Buc- cholz, Tim Esler, Simon Valentin, Luca Franceschi, Martin Wistuba, Prabhu Teja Sivaprasad, Woo Jung Kim, Anoop Deoras, Giovanni Zappella, and Laurent Cal- lot. 2025. SWE-PolyBench: A multi-language benchmark for repository level evaluation of coding agents. https://arxiv.org/abs/2504.08703

work page arXiv 2025
[26]

Atharv Sonwane, Eng-Shen Tu, Wei-Chung Lu, Claas Beger, Carter Larsen, Debjit Dhar, Simon Alford, Rachel Chen, Ronit Pattanayak, Tuan Anh Dang, Guohao Chen, Gloria Geng, Kevin Ellis, and Saikat Dutta. 2026. OmniCode: A Benchmark for Evaluating Software Engineering Agents. https://arxiv.org/abs/2602.02262

work page arXiv 2026
[27]

Wannita Takerngsaksiri, Jirat Pasuksmit, Patanamon Thongtanunam, Chakkrit Tantithamthavorn, Ruixiong Zhang, Fan Jiang, Jing Li, Evan Cook, Kun Chen, and Ming Wu. 2025. Human-In-The-Loop Software Development Agents. In International Conference on Software Engineering: Software Engineering in Practice track (ICSE-SEIP). 342–352. https://doi.org/10.1109/ICSE...

work page doi:10.1109/icse-seip66354.2025.00036 2025
[28]

Ruiqi Wang, Jiyu Guo, Cuiyun Gao, Guodong Fan, Chun Yong Chong, and Xin Xia. 2025. Can LLMs Replace Human Evaluators? An Empirical Study of LLM-as- a-Judge in Software Engineering. InInternational Symposium on Software Testing and Analysis (ISSTA). https://doi.org/10.1145/3728963

work page doi:10.1145/3728963 2025
[29]

Xinchen Wang, Pengfei Gao, Xiangxin Meng, Chao Peng, Ruida Hu, Yun Lin, and Cuiyun Gao. 2024. AEGIS: An Agent-based Framework for General Bug Reproduction from Issue Descriptions. https://arxiv.org/abs/2411.18015

work page arXiv 2024
[30]

Xu, Xiangru Tang, Mingchen Zhuge, Jiayi Pan, Yueqi Song, Bowen Li, Jaskirat Singh, Hoang H

Xingyao Wang, Boxuan Li, Yufan Song, Frank F. Xu, Xiangru Tang, Mingchen Zhuge, Jiayi Pan, Yueqi Song, Bowen Li, Jaskirat Singh, Hoang H. Tran, Fuqiang Li, Ren Ma, Mingzhang Zheng, Bill Qian, Yanjun Shao, Niklas Muennighoff, Yizhe Zhang, Binyuan Hui, Junyang Lin, Robert Brennan, Hao Peng, Heng Ji, and Graham Neubig. 2025. OpenHands: An Open Platform for A...

2025
[31]

Chunqiu Steven Xia, Yinlin Deng, Soren Dunn, and Lingming Zhang. 2025. De- mystifying LLM-based Software Engineering Agents. InSymposium on the Foun- dations of Software Engineering (FSE). 801–824. https://doi.org/10.1145/3715754

work page doi:10.1145/3715754 2025
[32]

Jimenez, Alexander Wettig, Kilian Lieret, Shunyu Yao, Karthik Narasimhan, and Ofir Press

John Yang, Carlos E. Jimenez, Alexander Wettig, Kilian Lieret, Shunyu Yao, Karthik Narasimhan, and Ofir Press. 2024. SWE-Agent: Agent-computer Interfaces Enable Automated Software Engineering. InConference on Neural Information Processing Systems (NeurIPS). https://proceedings.neurips.cc/ paper_files/paper/2024/hash/5a7c947568c1b1328ccc5230172e1e7c-Abstra...

2024
[33]

Daoguang Zan, Zhirong Huang, Wei Liu, Hanwu Chen, Shulin Xin, Linhao Zhang, Qi Liu, Aoyan Li, Lu Chen, Xiaojian Zhong, Siyao Liu, Yongsheng Xiao, Liangqiang Chen, Yuyu Zhang, Jing Su, Tianyu Liu, Rui Long, Ming Ding, and Liang Xiang. 2025. Multi-SWE-bench: A Multilingual Benchmark for Issue Resolving. InNeurIPS Datasets and Benchmarks Track. https://openr...

2025