pith. sign in

arxiv: 2605.06754 · v2 · pith:SGPTAII7new · submitted 2026-05-07 · 💻 cs.SE

ScarfBench: A Benchmark for Cross-Framework Application Migration in Enterprise Java

Pith reviewed 2026-05-20 23:00 UTC · model grok-4.3

classification 💻 cs.SE
keywords benchmarkenterprise Javaframework migrationbehavior-preserving refactoringcoding agentsSpringJakarta EEQuarkus
0
0 comments X

The pith

State-of-the-art coding agents succeed on only 15 percent of enterprise Java framework migrations.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents ScarfBench, a collection of 34 expert-written Java applications with 204 directed migration tasks between Spring, Jakarta EE, and Quarkus. Each task supplies a working source application and requires an agent to produce an equivalent implementation in the target framework. Correctness is checked by executable oracles that verify compilation, containerized deployment, and passage of behavioral tests on the observable interface. Evaluation of five current agents shows the strongest reaching just 15.3 percent aggregate test pass rate on focused-layer tasks and 12.2 percent on whole applications, with only a single task producing a fully behaviorally equivalent result. The benchmark also surfaces asymmetric difficulty across framework pairs and layers, plus recurring failure modes in build, deploy, and test phases.

Core claim

ScarfBench demonstrates that behavior-preserving cross-framework refactoring of enterprise Java applications is not yet reliably achievable by current coding agents, as measured by low success rates across 204 tasks and a derived taxonomy of failures at build, deploy, and test stages.

What carries the argument

The benchmark's 34 application triples and application-specific executable oracles that require a candidate to compile, deploy in a target runtime container, and pass behavioral tests over the observable interface.

Load-bearing premise

The 34 expert-written applications and their oracles are representative of real enterprise migration problems and correctly measure behavior preservation.

What would settle it

A new coding agent that achieves over 70 percent aggregate test pass rate on the full set of 204 tasks while producing implementations that independent review confirms preserve observable behavior.

Figures

Figures reproduced from arXiv: 2605.06754 by Advait Pavuluri, Ashita Saxena, Baishakhi Ray, Bridget McGinn, George Safta, Michele Merler, Rahul Krishna, Raju Pavuluri, Srikanth Tamilselvam.

Figure 1
Figure 1. Figure 1: Migration is a structural transformation across heterogeneous artifacts: porting Spring to Jakarta expands a 3-line interface into a 14-line CDI bean, rewrites derived queries as hand-written JPQL, externalizes auto-config into JPA and CDI descriptors and adds a Java–XML string binding. pressure and by non-functional goals such as lower memory overhead, faster startup and improved cloud elasticity LogicMon… view at source ↗
Figure 2
Figure 2. Figure 2: Overview of ScarfBench construction: 34 application families implemented across Spring, [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Each panel routes source-framework migration [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: visualizes the aggregate SCARF leaderboard from [PITH_FULL_IMAGE:figures/full_fig_p016_4.png] view at source ↗
read the original abstract

Java remains central to enterprise software, and many applications outlive their original architecture. Migrating them across frameworks is a behavior-preserving refactoring spanning build configuration, dependency injection, persistence, request handling, and deployment. Existing software-engineering benchmarks cover bug fixing, feature implementation, and language or version modernization, but leave cross-framework refactoring largely unmeasured. We introduce ScarfBench, a benchmark for behavior-preserving cross-framework refactoring of enterprise Java applications. It is built from expert-written implementation triples across Spring, Jakarta EE, and Quarkus: 34 applications (29 focused single-layer, 5 whole) yielding 102 variants (~151K lines across 1946 source and test files) and 204 directed refactoring tasks. Each task gives an agent a working source application and a target framework; the agent must synthesize a target implementation preserving the source behavior. Correctness is evaluated by an application-specific executable oracle: the candidate must compile, deploy in a containerized target runtime, and pass behavioral tests over the application's observable interface. We evaluate five state-of-the-art coding agents on ScarfBench. The strongest achieves only 15.3% aggregate test pass on focused-layer migrations and 12.2% on whole applications, and only one of the 204 tasks yields a fully behaviorally equivalent target. Difficulty is asymmetric across framework directions and architectural layers: Spring<->Quarkus is the most tractable pair, and Jakarta-targeted migrations are hardest. From LLM-as-a-judge and expert adjudication of failed-task traces, we derive a taxonomy of recurring failure categories spanning build, deploy, and test stages. We release the benchmark, harness, and agent traces at https://scarfbench.info.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces ScarfBench, a benchmark for behavior-preserving cross-framework refactoring of enterprise Java applications between Spring, Jakarta EE, and Quarkus. It is constructed from 34 expert-written implementation triples (29 focused single-layer and 5 whole applications) that produce 102 variants, 1946 files, and 204 directed migration tasks. Correctness is defined via application-specific executable oracles requiring successful compilation, containerized deployment, and passage of behavioral tests on the observable interface. Evaluation of five state-of-the-art coding agents shows the strongest achieving 15.3% aggregate test pass on focused-layer migrations and 12.2% on whole applications, with only one of 204 tasks producing a fully behaviorally equivalent target. The work also derives a taxonomy of recurring failure categories and releases the benchmark, harness, and agent traces.

Significance. If the oracles are shown to be reliable and the application set representative, ScarfBench would fill a notable gap in software-engineering benchmarks by providing a reproducible, executable measure of automated cross-framework migration capability in enterprise Java. The public release of the full benchmark, harness, and agent traces is a clear strength that supports reproducibility and community follow-up work. The reported performance gap and failure taxonomy could usefully direct future research on LLM agents for multi-layer refactoring involving dependency injection, persistence, and deployment configuration.

major comments (2)
  1. [Abstract (benchmark construction paragraph)] Abstract, paragraph on benchmark construction: The description states that each task is evaluated by an application-specific executable oracle requiring compile + deploy + behavioral tests, yet provides no quantitative validation of oracle correctness (e.g., inter-expert agreement on triple construction, false-positive/negative rates on known-good migrations, or coverage of non-observable state). This is load-bearing for the central empirical claims because the headline results (15.3% focused-layer and 12.2% whole-application pass rates, 1/204 fully equivalent) are only interpretable if the oracles accurately detect behavior-preserving migrations.
  2. [Abstract (benchmark construction paragraph)] Abstract, paragraph on benchmark construction: The selection criteria and representativeness of the 34 expert-written application triples (29 focused + 5 whole) are not quantified, leaving open the possibility of selection bias toward unusually difficult cases or omission of common enterprise patterns. This directly affects the generalizability of the difficulty findings and the claim that the benchmark captures real migration challenges.
minor comments (2)
  1. The abstract would benefit from an explicit breakdown of the 204 tasks by source-target framework pair to clarify the reported asymmetry (e.g., Spring<->Quarkus vs. Jakarta-targeted).
  2. The release statement at the end of the abstract should include a brief note on the license and expected maintenance of the artifacts at https://scarfbench.info.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive feedback on ScarfBench. The comments on oracle validation and application representativeness are well-taken and point to areas where the manuscript can be strengthened for clarity and rigor. We address each major comment below and outline planned revisions.

read point-by-point responses
  1. Referee: Abstract, paragraph on benchmark construction: The description states that each task is evaluated by an application-specific executable oracle requiring compile + deploy + behavioral tests, yet provides no quantitative validation of oracle correctness (e.g., inter-expert agreement on triple construction, false-positive/negative rates on known-good migrations, or coverage of non-observable state). This is load-bearing for the central empirical claims because the headline results (15.3% focused-layer and 12.2% whole-application pass rates, 1/204 fully equivalent) are only interpretable if the oracles accurately detect behavior-preserving migrations.

    Authors: We agree that explicit quantitative validation details would improve interpretability. The oracles rely on expert-constructed implementation triples where behavioral equivalence was ensured through the observable interface (public APIs, endpoints, and test suites). In the revised manuscript we will add a new subsection under benchmark construction that describes the triple development process, including that each triple received review by multiple domain experts for functional equivalence, and we will report available test coverage statistics for the behavioral oracles. We will also explicitly note the limitation regarding non-observable internal state, which is inherent to black-box oracles, and discuss why the compile-deploy-test pipeline reduces false positives in practice. These additions will be reflected in both the abstract and main text. revision: yes

  2. Referee: Abstract, paragraph on benchmark construction: The selection criteria and representativeness of the 34 expert-written application triples (29 focused + 5 whole) are not quantified, leaving open the possibility of selection bias toward unusually difficult cases or omission of common enterprise patterns. This directly affects the generalizability of the difficulty findings and the claim that the benchmark captures real migration challenges.

    Authors: We acknowledge the value of quantifying selection criteria to mitigate concerns about bias. The 34 applications were chosen by experts to span representative enterprise patterns such as dependency injection, JPA persistence, REST handling, and configuration differences across Spring, Jakarta EE, and Quarkus. In the revision we will expand Section 3 to include explicit selection criteria, a summary table of application characteristics (e.g., LOC ranges, layer coverage, framework distribution), and a brief discussion of how the set reflects common real-world migration scenarios. While exhaustive sampling of all enterprise Java codebases is not feasible, these additions will better support the generalizability claims. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical benchmark evaluation on external agents

full rationale

The paper constructs ScarfBench from 34 expert-written application triples and measures performance of five external state-of-the-art coding agents, reporting direct empirical pass rates (15.3% focused, 12.2% whole-application) and one fully equivalent target out of 204 tasks. These outcomes are obtained by executing the agents against the provided source applications and oracles; they do not reduce to any fitted parameter, self-definition, or self-citation chain. No equations or derivations appear; the central claims rest on the benchmark's construction and external evaluation rather than internal re-use of prior results by the same authors. The paper is therefore self-contained against external benchmarks and agents.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The benchmark rests on the assumption that expert-written triples faithfully represent behavior-preserving migrations; no free parameters, new physical entities, or ad-hoc mathematical constructs are introduced.

axioms (1)
  • domain assumption Expert-written implementation triples provide valid behavior-preserving variants across frameworks.
    The benchmark is built directly from these 34 applications and 102 variants.

pith-pipeline@v0.9.0 · 5878 in / 1347 out tokens · 59173 ms · 2026-05-20T23:00:57.839178+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

30 extracted references · 30 canonical work pages · 1 internal anchor

  1. [1]

    DataStax

    URLhttps://arxiv.org/abs/2506.00894. DataStax. Stargate: An open-source data API gateway. https://github.com/stargate/starga te, 2024. Accessed: 2026-04-23. Xiang Deng, Jeff Da, Edwin Pan, Yannis Yiming He, Charles Ide, Kanak Garg, Niklas Lauffer, Andrew Park, Nitin Pasari, Chetan Rane, Karmini Sampath, Maya Krishnan, Srivatsa Kundurthy, Sean Hendryx, Zif...

  2. [2]

    New Relic

    URLhttps://arxiv.org/abs/2507.12367. New Relic. 2024 state of the Java ecosystem. https://newrelic.com/resources/report/st ate-of-the-java-ecosystem-2024, 2024. Accessed: 2026-04-20. Orange. How Orange leverages Quarkus for seamless access to telco network capabilities. https: //quarkus.io/blog/orange-telco-core-network-api-management-with-quarkus/ ,

  3. [3]

    OmniCode: A Benchmark for Evaluating Software Engineering Agents

    Accessed: 2026-04-23. Ahilan Ayyachamy Nadar Ponnusamy. Application modernization with llms: Addressing core challenges in reliability, security, and quality, 2025. URL https://arxiv.org/abs/2506.109 84. Quarkus Project. Quarkus user stories: Lufthansa Technik A VIATAR.https://quarkus.io/use rstories/, 2024. Accessed: 2026-04-23. Muhammad Shihab Rashid, C...

  4. [4]

    Inspect project structure

  5. [5]

    Detect build system and framework usage

  6. [6]

    Migrate dependencies and plugins

  7. [7]

    Migrate framework configuration

  8. [8]

    Refactor framework-bound source code

  9. [9]

    Compile and fix errors until build succeeds or no safe fix remains

  10. [10]

    20 E.4 Runtime Configuration Table 12 separates the model declared in agent.toml from the model string explicitly passed by run.sh

    Produce a final migration report including file changes, chronological log, and unresolved issues. 20 E.4 Runtime Configuration Table 12 separates the model declared in agent.toml from the model string explicitly passed by run.sh. For Codex, the wrapper does not pass an explicit model flag, so the invoked model is resolved by the configured Codex CLI/acco...

  11. [11]

    Start by reading the metadata to understand the migration context

  12. [12]

    Read the run.log to find the specific error that caused the failure

  13. [13]

    Use targeted tools depending on the failure phase: compare POM files, inspect Dockerfile/server.xml, ,→scan imports, or check multi-module structure

  14. [14]

    /dukeetf

    When you have enough evidence, call the classify tool with your classification. When you classify, provide the phase, taxonomy category ID/name, subcategory, whether a new category ,→is needed, confidence, and a 1-2 sentence evidence summary. F Failure-Mode Subcategory Reference This appendix expands each row of the per-agent failure-mode heatmap (Table 3...

  15. [15]

    A client calls GET /rest/quotes/\{symbol\} or submits a trading operation through the web UI

  16. [16]

    A JAX-RS resource underrest/handles the request

  17. [17]

    The resource invokes the activeTradeServicesimplementation selected through CDI wiring

  18. [18]

    The service accesses entities such as accounts, holdings, orders, and quotes through Quarkus- managed persistence

  19. [19]

    If the workflow involves asynchronous order or quote behavior, the service delegates to reactive messaging components undermessaging/

  20. [20]

    This path exercises REST routing, CDI service resolution, transaction boundaries, persistence access, optional messaging, and static/web UI behavior

    The response is returned through JAX-RS or reflected in the web UI. This path exercises REST routing, CDI service resolution, transaction boundaries, persistence access, optional messaging, and static/web UI behavior. A migration that only compiles but fails on this path is not considered functionally equivalent. 31 G.9 Validation Protocol The target is v...

  21. [21]

    Identify the Spring project as a W AR-packaged application with static/web UI, REST endpoints, JPA/H2 persistence, JMS/Artemis-style messaging, WebSocket support, and Spring Boot tests

  22. [22]

    Generate a Quarkus POM with the required extensions for REST, CDI, persistence, transactions, validation, messaging, scheduler, WebSocket support, health, and tests

  23. [23]

    Convert application.yml intent into application.properties, preserving server, persis- tence, REST, logging, messaging, and DayTrader runtime settings

  24. [24]

    Preserve core domain models, data beans, interfaces, and utility classes unless imports or runtime APIs require adaptation

  25. [25]

    Replace Spring DI annotations and bean selection with CDI scopes, injection, qualifiers or producers, and ambiguity controls

  26. [26]

    Replace REST controller conventions with JAX-RS resources

  27. [27]

    Replace Spring Data or Spring-managed persistence access with Quarkus/Hibernate ORM and Jakarta transaction boundaries

  28. [28]

    Replace JMS/Artemis queue and topic logic with SmallRye Reactive Messaging processors and emitters

  29. [29]

    Move static resources into the Quarkus-compatible resource layout and ensure application entry points remain reachable

  30. [30]

    uid:0" When I buy 100 shares of

    Validate through build, startup, REST endpoint checks, default-data login, buy/sell workflows, and smoke tests. G.11 Migration Challenges and Resolutions The migration challenges are not evenly distributed. Annotation replacement is relatively mechanical, while messaging, configuration intent, and web-resource layout require design decisions. The table be...