MigrationBench: Repository-Level Code Migration Benchmark from Java 8
read the original abstract
With the rapid advancement of powerful large language models (LLMs) in recent years, a wide range of software engineering tasks can now be addressed using LLMs, significantly enhancing productivity and scalability. Numerous benchmark datasets have been developed to evaluate the coding capabilities of these models, while they primarily focus on code generation and issue-resolution tasks. In contrast, we introduce a new coding benchmark MigrationBench with a distinct focus: code migration. MigrationBench aims to serve as a comprehensive benchmark for migration from Java 8 to the latest long-term support (LTS) versions (Java 17, 21), including a full dataset and its subset selected with 5,102 and 300 repositories respectively. selected is a representative subset curated for complexity and difficulty, offering a versatile resource to support research in the field of code migration. Additionally, we provide a comprehensive evaluation framework to facilitate rigorous and standardized assessment of LLMs on this challenging task. We further propose an agentic framework and demonstrate that LLMs can effectively tackle repository-level code migration to Java 17. For the selected subset with Claude-4.5-Sonnet, our agentic framework achieves 71.67% and 53.33% success rate (pass@1) for minimal and maximal migration respectively. The dataset and evaluation source code are available at: https://huggingface.co/collections/AmazonScience/migrationbench and https://github.com/amazon-science/MigrationBench respectively.
This paper has not been read by Pith yet.
Forward citations
Cited by 2 Pith papers
-
LLM Agents Already Know When to Call Tools -- Even Without Reasoning
LLMs encode tool necessity in pre-generation hidden states at AUROC 0.89-0.96, enabling Probe&Prefill to reduce tool calls 48% with 1.7% accuracy loss, outperforming prompt and reasoning baselines.
-
LLM Agents Already Know When to Call Tools -- Even Without Reasoning
LLM agents encode tool necessity in pre-generation hidden states with high linear decodability (AUROC 0.89-0.96); Probe&Prefill uses this to reduce tool calls 48% with 1.7% accuracy loss.
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.