ScarfBench: A Benchmark for Cross-Framework Application Migration in Enterprise Java
Pith reviewed 2026-05-20 23:00 UTC · model grok-4.3
The pith
State-of-the-art coding agents succeed on only 15 percent of enterprise Java framework migrations.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
ScarfBench demonstrates that behavior-preserving cross-framework refactoring of enterprise Java applications is not yet reliably achievable by current coding agents, as measured by low success rates across 204 tasks and a derived taxonomy of failures at build, deploy, and test stages.
What carries the argument
The benchmark's 34 application triples and application-specific executable oracles that require a candidate to compile, deploy in a target runtime container, and pass behavioral tests over the observable interface.
Load-bearing premise
The 34 expert-written applications and their oracles are representative of real enterprise migration problems and correctly measure behavior preservation.
What would settle it
A new coding agent that achieves over 70 percent aggregate test pass rate on the full set of 204 tasks while producing implementations that independent review confirms preserve observable behavior.
Figures
read the original abstract
Java remains central to enterprise software, and many applications outlive their original architecture. Migrating them across frameworks is a behavior-preserving refactoring spanning build configuration, dependency injection, persistence, request handling, and deployment. Existing software-engineering benchmarks cover bug fixing, feature implementation, and language or version modernization, but leave cross-framework refactoring largely unmeasured. We introduce ScarfBench, a benchmark for behavior-preserving cross-framework refactoring of enterprise Java applications. It is built from expert-written implementation triples across Spring, Jakarta EE, and Quarkus: 34 applications (29 focused single-layer, 5 whole) yielding 102 variants (~151K lines across 1946 source and test files) and 204 directed refactoring tasks. Each task gives an agent a working source application and a target framework; the agent must synthesize a target implementation preserving the source behavior. Correctness is evaluated by an application-specific executable oracle: the candidate must compile, deploy in a containerized target runtime, and pass behavioral tests over the application's observable interface. We evaluate five state-of-the-art coding agents on ScarfBench. The strongest achieves only 15.3% aggregate test pass on focused-layer migrations and 12.2% on whole applications, and only one of the 204 tasks yields a fully behaviorally equivalent target. Difficulty is asymmetric across framework directions and architectural layers: Spring<->Quarkus is the most tractable pair, and Jakarta-targeted migrations are hardest. From LLM-as-a-judge and expert adjudication of failed-task traces, we derive a taxonomy of recurring failure categories spanning build, deploy, and test stages. We release the benchmark, harness, and agent traces at https://scarfbench.info.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces ScarfBench, a benchmark for behavior-preserving cross-framework refactoring of enterprise Java applications between Spring, Jakarta EE, and Quarkus. It is constructed from 34 expert-written implementation triples (29 focused single-layer and 5 whole applications) that produce 102 variants, 1946 files, and 204 directed migration tasks. Correctness is defined via application-specific executable oracles requiring successful compilation, containerized deployment, and passage of behavioral tests on the observable interface. Evaluation of five state-of-the-art coding agents shows the strongest achieving 15.3% aggregate test pass on focused-layer migrations and 12.2% on whole applications, with only one of 204 tasks producing a fully behaviorally equivalent target. The work also derives a taxonomy of recurring failure categories and releases the benchmark, harness, and agent traces.
Significance. If the oracles are shown to be reliable and the application set representative, ScarfBench would fill a notable gap in software-engineering benchmarks by providing a reproducible, executable measure of automated cross-framework migration capability in enterprise Java. The public release of the full benchmark, harness, and agent traces is a clear strength that supports reproducibility and community follow-up work. The reported performance gap and failure taxonomy could usefully direct future research on LLM agents for multi-layer refactoring involving dependency injection, persistence, and deployment configuration.
major comments (2)
- [Abstract (benchmark construction paragraph)] Abstract, paragraph on benchmark construction: The description states that each task is evaluated by an application-specific executable oracle requiring compile + deploy + behavioral tests, yet provides no quantitative validation of oracle correctness (e.g., inter-expert agreement on triple construction, false-positive/negative rates on known-good migrations, or coverage of non-observable state). This is load-bearing for the central empirical claims because the headline results (15.3% focused-layer and 12.2% whole-application pass rates, 1/204 fully equivalent) are only interpretable if the oracles accurately detect behavior-preserving migrations.
- [Abstract (benchmark construction paragraph)] Abstract, paragraph on benchmark construction: The selection criteria and representativeness of the 34 expert-written application triples (29 focused + 5 whole) are not quantified, leaving open the possibility of selection bias toward unusually difficult cases or omission of common enterprise patterns. This directly affects the generalizability of the difficulty findings and the claim that the benchmark captures real migration challenges.
minor comments (2)
- The abstract would benefit from an explicit breakdown of the 204 tasks by source-target framework pair to clarify the reported asymmetry (e.g., Spring<->Quarkus vs. Jakarta-targeted).
- The release statement at the end of the abstract should include a brief note on the license and expected maintenance of the artifacts at https://scarfbench.info.
Simulated Author's Rebuttal
We thank the referee for their constructive feedback on ScarfBench. The comments on oracle validation and application representativeness are well-taken and point to areas where the manuscript can be strengthened for clarity and rigor. We address each major comment below and outline planned revisions.
read point-by-point responses
-
Referee: Abstract, paragraph on benchmark construction: The description states that each task is evaluated by an application-specific executable oracle requiring compile + deploy + behavioral tests, yet provides no quantitative validation of oracle correctness (e.g., inter-expert agreement on triple construction, false-positive/negative rates on known-good migrations, or coverage of non-observable state). This is load-bearing for the central empirical claims because the headline results (15.3% focused-layer and 12.2% whole-application pass rates, 1/204 fully equivalent) are only interpretable if the oracles accurately detect behavior-preserving migrations.
Authors: We agree that explicit quantitative validation details would improve interpretability. The oracles rely on expert-constructed implementation triples where behavioral equivalence was ensured through the observable interface (public APIs, endpoints, and test suites). In the revised manuscript we will add a new subsection under benchmark construction that describes the triple development process, including that each triple received review by multiple domain experts for functional equivalence, and we will report available test coverage statistics for the behavioral oracles. We will also explicitly note the limitation regarding non-observable internal state, which is inherent to black-box oracles, and discuss why the compile-deploy-test pipeline reduces false positives in practice. These additions will be reflected in both the abstract and main text. revision: yes
-
Referee: Abstract, paragraph on benchmark construction: The selection criteria and representativeness of the 34 expert-written application triples (29 focused + 5 whole) are not quantified, leaving open the possibility of selection bias toward unusually difficult cases or omission of common enterprise patterns. This directly affects the generalizability of the difficulty findings and the claim that the benchmark captures real migration challenges.
Authors: We acknowledge the value of quantifying selection criteria to mitigate concerns about bias. The 34 applications were chosen by experts to span representative enterprise patterns such as dependency injection, JPA persistence, REST handling, and configuration differences across Spring, Jakarta EE, and Quarkus. In the revision we will expand Section 3 to include explicit selection criteria, a summary table of application characteristics (e.g., LOC ranges, layer coverage, framework distribution), and a brief discussion of how the set reflects common real-world migration scenarios. While exhaustive sampling of all enterprise Java codebases is not feasible, these additions will better support the generalizability claims. revision: yes
Circularity Check
No circularity: empirical benchmark evaluation on external agents
full rationale
The paper constructs ScarfBench from 34 expert-written application triples and measures performance of five external state-of-the-art coding agents, reporting direct empirical pass rates (15.3% focused, 12.2% whole-application) and one fully equivalent target out of 204 tasks. These outcomes are obtained by executing the agents against the provided source applications and oracles; they do not reduce to any fitted parameter, self-definition, or self-citation chain. No equations or derivations appear; the central claims rest on the benchmark's construction and external evaluation rather than internal re-use of prior results by the same authors. The paper is therefore self-contained against external benchmarks and agents.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Expert-written implementation triples provide valid behavior-preserving variants across frameworks.
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We introduce SCARFBENCH, a benchmark for behavior-preserving cross-framework refactoring of enterprise Java applications... 204 directed migration tasks... scored by 1,331 expert-written tests in a containerized harness
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
URLhttps://arxiv.org/abs/2506.00894. DataStax. Stargate: An open-source data API gateway. https://github.com/stargate/starga te, 2024. Accessed: 2026-04-23. Xiang Deng, Jeff Da, Edwin Pan, Yannis Yiming He, Charles Ide, Kanak Garg, Niklas Lauffer, Andrew Park, Nitin Pasari, Chetan Rane, Karmini Sampath, Maya Krishnan, Srivatsa Kundurthy, Sean Hendryx, Zif...
-
[2]
URLhttps://arxiv.org/abs/2507.12367. New Relic. 2024 state of the Java ecosystem. https://newrelic.com/resources/report/st ate-of-the-java-ecosystem-2024, 2024. Accessed: 2026-04-20. Orange. How Orange leverages Quarkus for seamless access to telco network capabilities. https: //quarkus.io/blog/orange-telco-core-network-api-management-with-quarkus/ ,
-
[3]
OmniCode: A Benchmark for Evaluating Software Engineering Agents
Accessed: 2026-04-23. Ahilan Ayyachamy Nadar Ponnusamy. Application modernization with llms: Addressing core challenges in reliability, security, and quality, 2025. URL https://arxiv.org/abs/2506.109 84. Quarkus Project. Quarkus user stories: Lufthansa Technik A VIATAR.https://quarkus.io/use rstories/, 2024. Accessed: 2026-04-23. Muhammad Shihab Rashid, C...
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[4]
Inspect project structure
-
[5]
Detect build system and framework usage
-
[6]
Migrate dependencies and plugins
-
[7]
Migrate framework configuration
-
[8]
Refactor framework-bound source code
-
[9]
Compile and fix errors until build succeeds or no safe fix remains
-
[10]
Produce a final migration report including file changes, chronological log, and unresolved issues. 20 E.4 Runtime Configuration Table 12 separates the model declared in agent.toml from the model string explicitly passed by run.sh. For Codex, the wrapper does not pass an explicit model flag, so the invoked model is resolved by the configured Codex CLI/acco...
-
[11]
Start by reading the metadata to understand the migration context
-
[12]
Read the run.log to find the specific error that caused the failure
-
[13]
Use targeted tools depending on the failure phase: compare POM files, inspect Dockerfile/server.xml, ,→scan imports, or check multi-module structure
-
[14]
When you have enough evidence, call the classify tool with your classification. When you classify, provide the phase, taxonomy category ID/name, subcategory, whether a new category ,→is needed, confidence, and a 1-2 sentence evidence summary. F Failure-Mode Subcategory Reference This appendix expands each row of the per-agent failure-mode heatmap (Table 3...
-
[15]
A client calls GET /rest/quotes/\{symbol\} or submits a trading operation through the web UI
-
[16]
A JAX-RS resource underrest/handles the request
-
[17]
The resource invokes the activeTradeServicesimplementation selected through CDI wiring
-
[18]
The service accesses entities such as accounts, holdings, orders, and quotes through Quarkus- managed persistence
-
[19]
If the workflow involves asynchronous order or quote behavior, the service delegates to reactive messaging components undermessaging/
-
[20]
The response is returned through JAX-RS or reflected in the web UI. This path exercises REST routing, CDI service resolution, transaction boundaries, persistence access, optional messaging, and static/web UI behavior. A migration that only compiles but fails on this path is not considered functionally equivalent. 31 G.9 Validation Protocol The target is v...
-
[21]
Identify the Spring project as a W AR-packaged application with static/web UI, REST endpoints, JPA/H2 persistence, JMS/Artemis-style messaging, WebSocket support, and Spring Boot tests
-
[22]
Generate a Quarkus POM with the required extensions for REST, CDI, persistence, transactions, validation, messaging, scheduler, WebSocket support, health, and tests
-
[23]
Convert application.yml intent into application.properties, preserving server, persis- tence, REST, logging, messaging, and DayTrader runtime settings
-
[24]
Preserve core domain models, data beans, interfaces, and utility classes unless imports or runtime APIs require adaptation
-
[25]
Replace Spring DI annotations and bean selection with CDI scopes, injection, qualifiers or producers, and ambiguity controls
-
[26]
Replace REST controller conventions with JAX-RS resources
-
[27]
Replace Spring Data or Spring-managed persistence access with Quarkus/Hibernate ORM and Jakarta transaction boundaries
-
[28]
Replace JMS/Artemis queue and topic logic with SmallRye Reactive Messaging processors and emitters
-
[29]
Move static resources into the Quarkus-compatible resource layout and ensure application entry points remain reachable
-
[30]
uid:0" When I buy 100 shares of
Validate through build, startup, REST endpoint checks, default-data login, buy/sell workflows, and smoke tests. G.11 Migration Challenges and Resolutions The migration challenges are not evenly distributed. Annotation replacement is relatively mechanical, while messaging, configuration intent, and web-resource layout require design decisions. The table be...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.