Recognition: no theorem link
Reproduction Test Generation for Java SWE Issues
Pith reviewed 2026-05-11 00:43 UTC · model grok-4.3
The pith
This paper introduces TDD-Bench-Java, a 250-instance benchmark, and e-Otter++ to generate reproduction tests for Java issues from issue reports.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
This paper introduces TDD-Bench-Java, the first benchmark for Java repository-level reproduction test generation that comprises 250 instances sourced from popular open-source repositories, together with e-Otter++ for Java, which adapts a state-of-the-art Python reproduction test generator and yields high performance on both TDD-Bench-Java and a contamination-free proprietary dataset.
What carries the argument
TDD-Bench-Java benchmark of 250 Java issue instances and the e-Otter++ adaptation of the Python reproduction test generator.
Load-bearing premise
The 250 curated instances are representative of typical Java software-engineering issues and the Python-to-Java adaptation transfers without major language-specific obstacles.
What would settle it
Running e-Otter++ on a fresh collection of Java issue reports drawn from repositories outside the benchmark and finding that it produces incorrect or no reproduction tests for most cases would show the claimed performance does not hold.
Figures
read the original abstract
Given an issue on a software repository, a reproduction test confirms its presence in the code before it gets fixed and its absence after. Reproduction tests provide crucial execution-based feedback for diagnosis and validation during software development. Unfortunately, they are usually missing. Therefore, recent work has introduced both benchmarks and a thriving literature on solutions for reproduction test generation from issues. However, that work has focused on Python and neglected other languages such as Java, which is important for enterprise software. This paper introduces both a benchmark and a solution for Java repository-level reproduction test generation. The benchmark, TDD-Bench-Java, is the first to model this problem and comprises 250 instances sourced from popular open-source repositories. The solution, e-Otter++ for Java, adapts a state-of-the-art reproduction test generator for Python to yield high performance on Java. To evaluate in an industry setting, besides empirical results with TDD-Bench-Java, this paper also presents results with a contamination-free proprietary dataset. Overall, we hope that this paper contributes to bringing better diagnosis and validation to Java software development.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces TDD-Bench-Java, the first benchmark for Java repository-level reproduction test generation consisting of 250 instances sourced from popular open-source repositories, along with e-Otter++ for Java, an adaptation of a state-of-the-art Python reproduction test generator. It claims high performance on both the benchmark and a contamination-free proprietary dataset to support better diagnosis and validation in Java software development.
Significance. If the performance claims hold with proper validation, this would be a meaningful contribution by extending reproduction test generation to Java, an important language for enterprise software where such tests are often missing. The creation of an independent benchmark and evaluation on a proprietary dataset supply external grounding without evident circular dependence, addressing a gap left by prior Python-focused work.
major comments (2)
- [Abstract] Abstract: The assertion of 'high performance' on TDD-Bench-Java and the proprietary dataset supplies no quantitative metrics, error bars, or detailed validation procedures, which is load-bearing for the central claim of practical utility and prevents assessment of whether observed results reflect genuine generalization.
- [Abstract] Abstract: The benchmark description provides no selection protocol for the 250 instances, issue-type distribution, build-system coverage (Maven/Gradle), or analysis of Java-specific obstacles such as static typing effects on test synthesis, JUnit lifecycle differences, or dependency resolution; these omissions are load-bearing for the representativeness of TDD-Bench-Java and the fidelity of the Python-to-Java adaptation.
minor comments (1)
- The exact adaptations made in e-Otter++ for Java from the Python original should be specified to support reproducibility and to allow readers to assess language-specific transfer.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback on our manuscript. We address each major comment below and indicate planned revisions to strengthen the presentation of our contributions.
read point-by-point responses
-
Referee: [Abstract] Abstract: The assertion of 'high performance' on TDD-Bench-Java and the proprietary dataset supplies no quantitative metrics, error bars, or detailed validation procedures, which is load-bearing for the central claim of practical utility and prevents assessment of whether observed results reflect genuine generalization.
Authors: We agree that the abstract would be strengthened by including key quantitative metrics to support the performance claims. The experimental results section of the manuscript reports specific success rates, pass@k metrics, and comparisons on both TDD-Bench-Java and the proprietary dataset, along with the evaluation protocol used to ensure contamination-free assessment. We will revise the abstract to incorporate representative performance figures and a brief reference to the validation approach, allowing readers to better gauge the results at a glance. revision: yes
-
Referee: [Abstract] Abstract: The benchmark description provides no selection protocol for the 250 instances, issue-type distribution, build-system coverage (Maven/Gradle), or analysis of Java-specific obstacles such as static typing effects on test synthesis, JUnit lifecycle differences, or dependency resolution; these omissions are load-bearing for the representativeness of TDD-Bench-Java and the fidelity of the Python-to-Java adaptation.
Authors: The abstract provides a high-level summary, while Section 3 of the manuscript details the benchmark construction, including the selection protocol from popular open-source repositories, issue-type distribution, coverage of Maven and Gradle build systems, and specific adaptations addressing Java challenges such as static typing, JUnit lifecycle, and dependency management. We acknowledge that the abstract could better signal these elements. We will update the abstract with a concise mention of the sourcing criteria, build-system coverage, and Java-specific adaptations to improve clarity on representativeness. revision: yes
Circularity Check
No significant circularity; independent benchmark creation and external-tool adaptation
full rationale
The paper introduces TDD-Bench-Java as a new benchmark of 250 instances sourced from open-source repositories and adapts an existing state-of-the-art Python reproduction test generator (e-Otter) to Java as e-Otter++. Neither step reduces to a self-definition, a fitted input renamed as prediction, or a load-bearing self-citation chain. The abstract and description explicitly position the benchmark as the first for Java repository-level issues and the solution as an adaptation of prior external work. Evaluation on a separate contamination-free proprietary dataset supplies external grounding. No equations, uniqueness theorems, or ansatzes are invoked that collapse back to the paper's own inputs. The derivation chain is therefore self-contained.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Toufique Ahmed, Premkumar Devanbu, Christoph Treude, and Michael Pradel
-
[2]
InConference on Mining Software Repositories (MSR)
Can LLMs replace manual annotation of software engineering artifacts?. InConference on Mining Software Repositories (MSR). 526–538
-
[3]
Toufique Ahmed, Jatin Ganhotra, Rangeet Pan, Avraham Shinnar, Saurabh Sinha, and Martin Hirzel. 2025. Otter: Generating Tests from Issues to Validate SWE Patches. InInternational Conference on Machine Learning (ICML)
2025
- [4]
- [5]
- [6]
-
[7]
Ibragim Badertdinov, Alexander Golubev, Maksim Nekrashevich, Anton Shevtsov, Simon Karasik, Andrei Andriushchenko, Maria Trofimova, Daria Litvintseva, and Boris Yangel. 2025. SWE-rebench: An Automated Pipeline for Task Collection and Decontaminated Evaluation of Software Engineering Agents. https://arxiv. org/abs/2505.20411 Reproduction Test Generation fo...
-
[8]
Zhi Chen, Wei Ma, and Lingxiao Jiang. 2025. Unveiling Pitfalls: Understanding Why AI-driven Code Agents Fail at GitHub Issue Resolution. https://arxiv.org/ abs/2503.12374
work page internal anchor Pith review Pith/arXiv arXiv 2025
- [9]
- [10]
-
[11]
Jimenez, John Yang, Kevin Liu, and Aleksander Madry
Neil Chowdhury, James Aung, Chan Jun Shern, Oliver Jaffe, Dane Sherburn, Giulio Starace, Evan Mays, Rachel Dias, Marwan Aljubeh, Mia Glaese, Carlos E. Jimenez, John Yang, Kevin Liu, and Aleksander Madry. 2024. Introducing SWE- bench Verified. https://openai.com/index/introducing-swe-bench-verified/
2024
- [12]
-
[13]
Xiang Deng, Jeff Da, Edwin Pan, Yannis Yiming He, Charles Ide, Kanak Garg, Niklas Lauffer, Andrew Park, Nitin Pasari, Chetan Rane, Karmini Sampath, Maya Krishnan, Srivatsa Kundurthy, Sean Hendryx, Zifan Wang, Chen Bo Calvin Zhang, Noah Jacobson, Bing Liu, and Brad Kenstler. 2025. SWE-Bench Pro: Can AI Agents Solve Long-Horizon Software Engineering Tasks? ...
work page internal anchor Pith review arXiv 2025
- [14]
- [15]
-
[16]
Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik Narasimhan
Carlos E. Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik Narasimhan. 2024. SWE-bench: Can Language Models Resolve Real- World GitHub Issues?. InInternational Conference on Learning Representations (ICLR)
2024
-
[17]
René Just, Darioush Jalali, and Michael D. Ernst. 2014. Defects4J: a database of existing faults to enable controlled testing studies for Java programs. In International Symposium on Software Testing and Analysis (ISSTA). 437–440. https://doi.org/10.1145/2610384.2628055
- [18]
-
[19]
Lara Khatib, Noble Saji Mathews, and Meiyappan Nagappan. 2026. AssertFlip: Reproducing Bugs via Inversion of LLM-Generated Passing Tests. InInternational Conference on Software Engineering (ICSE)
2026
- [20]
- [21]
-
[22]
Quinn McNemar. 1947. Note on the sampling error of the difference between correlated proportions or percentages.Psychometrika12, 2 (1947), 153–157
1947
-
[23]
Martin Mirchev, Ridwan Shariffdeen, Haifeng Ruan, Yuntong Zhang, and Abhik Roychoudhury. 2026. AutoCodeRover: Agentic Program Repair for SonarQube Issues. InIndustry paper at Symposium on the Foundations of Software Engineering (FSE-Industry)
2026
-
[24]
Niels Mündler, Mark Niklas Müller, Jingxuan He, and Martin Vechev. 2024. SWT- Bench: Testing and Validating Real-World Bug-Fixes with Code Agents. InCon- ference on Neural Information Processing Systems (NeurIPS)
2024
-
[25]
Muhammad Shihab Rashid, Christian Bock, Yuan Zhuang, Alexander Buc- cholz, Tim Esler, Simon Valentin, Luca Franceschi, Martin Wistuba, Prabhu Teja Sivaprasad, Woo Jung Kim, Anoop Deoras, Giovanni Zappella, and Laurent Cal- lot. 2025. SWE-PolyBench: A multi-language benchmark for repository level evaluation of coding agents. https://arxiv.org/abs/2504.08703
-
[26]
Atharv Sonwane, Eng-Shen Tu, Wei-Chung Lu, Claas Beger, Carter Larsen, Debjit Dhar, Simon Alford, Rachel Chen, Ronit Pattanayak, Tuan Anh Dang, Guohao Chen, Gloria Geng, Kevin Ellis, and Saikat Dutta. 2026. OmniCode: A Benchmark for Evaluating Software Engineering Agents. https://arxiv.org/abs/2602.02262
-
[27]
Wannita Takerngsaksiri, Jirat Pasuksmit, Patanamon Thongtanunam, Chakkrit Tantithamthavorn, Ruixiong Zhang, Fan Jiang, Jing Li, Evan Cook, Kun Chen, and Ming Wu. 2025. Human-In-The-Loop Software Development Agents. In International Conference on Software Engineering: Software Engineering in Practice track (ICSE-SEIP). 342–352. https://doi.org/10.1109/ICSE...
-
[28]
Ruiqi Wang, Jiyu Guo, Cuiyun Gao, Guodong Fan, Chun Yong Chong, and Xin Xia. 2025. Can LLMs Replace Human Evaluators? An Empirical Study of LLM-as- a-Judge in Software Engineering. InInternational Symposium on Software Testing and Analysis (ISSTA). https://doi.org/10.1145/3728963
- [29]
-
[30]
Xu, Xiangru Tang, Mingchen Zhuge, Jiayi Pan, Yueqi Song, Bowen Li, Jaskirat Singh, Hoang H
Xingyao Wang, Boxuan Li, Yufan Song, Frank F. Xu, Xiangru Tang, Mingchen Zhuge, Jiayi Pan, Yueqi Song, Bowen Li, Jaskirat Singh, Hoang H. Tran, Fuqiang Li, Ren Ma, Mingzhang Zheng, Bill Qian, Yanjun Shao, Niklas Muennighoff, Yizhe Zhang, Binyuan Hui, Junyang Lin, Robert Brennan, Hao Peng, Heng Ji, and Graham Neubig. 2025. OpenHands: An Open Platform for A...
2025
-
[31]
Chunqiu Steven Xia, Yinlin Deng, Soren Dunn, and Lingming Zhang. 2025. De- mystifying LLM-based Software Engineering Agents. InSymposium on the Foun- dations of Software Engineering (FSE). 801–824. https://doi.org/10.1145/3715754
-
[32]
Jimenez, Alexander Wettig, Kilian Lieret, Shunyu Yao, Karthik Narasimhan, and Ofir Press
John Yang, Carlos E. Jimenez, Alexander Wettig, Kilian Lieret, Shunyu Yao, Karthik Narasimhan, and Ofir Press. 2024. SWE-Agent: Agent-computer Interfaces Enable Automated Software Engineering. InConference on Neural Information Processing Systems (NeurIPS). https://proceedings.neurips.cc/ paper_files/paper/2024/hash/5a7c947568c1b1328ccc5230172e1e7c-Abstra...
2024
-
[33]
Daoguang Zan, Zhirong Huang, Wei Liu, Hanwu Chen, Shulin Xin, Linhao Zhang, Qi Liu, Aoyan Li, Lu Chen, Xiaojian Zhong, Siyao Liu, Yongsheng Xiao, Liangqiang Chen, Yuyu Zhang, Jing Su, Tianyu Liu, Rui Long, Ming Ding, and Liang Xiang. 2025. Multi-SWE-bench: A Multilingual Benchmark for Issue Resolving. InNeurIPS Datasets and Benchmarks Track. https://openr...
2025
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.