ScarfBench: A Benchmark for Cross-Framework Application Migration in Enterprise Java

Advait Pavuluri; Ashita Saxena; Baishakhi Ray; Bridget McGinn; George Safta; Michele Merler; Rahul Krishna; Raju Pavuluri; Srikanth Tamilselvam

arxiv: 2605.06754 · v2 · pith:SGPTAII7new · submitted 2026-05-07 · 💻 cs.SE

ScarfBench: A Benchmark for Cross-Framework Application Migration in Enterprise Java

Advait Pavuluri , Bridget McGinn , Ashita Saxena , George Safta , Srikanth Tamilselvam , Raju Pavuluri , Michele Merler , Baishakhi Ray

show 1 more author

Rahul Krishna

This is my paper

Pith reviewed 2026-05-20 23:00 UTC · model grok-4.3

classification 💻 cs.SE

keywords benchmarkenterprise Javaframework migrationbehavior-preserving refactoringcoding agentsSpringJakarta EEQuarkus

0 comments

The pith

State-of-the-art coding agents succeed on only 15 percent of enterprise Java framework migrations.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents ScarfBench, a collection of 34 expert-written Java applications with 204 directed migration tasks between Spring, Jakarta EE, and Quarkus. Each task supplies a working source application and requires an agent to produce an equivalent implementation in the target framework. Correctness is checked by executable oracles that verify compilation, containerized deployment, and passage of behavioral tests on the observable interface. Evaluation of five current agents shows the strongest reaching just 15.3 percent aggregate test pass rate on focused-layer tasks and 12.2 percent on whole applications, with only a single task producing a fully behaviorally equivalent result. The benchmark also surfaces asymmetric difficulty across framework pairs and layers, plus recurring failure modes in build, deploy, and test phases.

Core claim

ScarfBench demonstrates that behavior-preserving cross-framework refactoring of enterprise Java applications is not yet reliably achievable by current coding agents, as measured by low success rates across 204 tasks and a derived taxonomy of failures at build, deploy, and test stages.

What carries the argument

The benchmark's 34 application triples and application-specific executable oracles that require a candidate to compile, deploy in a target runtime container, and pass behavioral tests over the observable interface.

Load-bearing premise

The 34 expert-written applications and their oracles are representative of real enterprise migration problems and correctly measure behavior preservation.

What would settle it

A new coding agent that achieves over 70 percent aggregate test pass rate on the full set of 204 tasks while producing implementations that independent review confirms preserve observable behavior.

Figures

Figures reproduced from arXiv: 2605.06754 by Advait Pavuluri, Ashita Saxena, Baishakhi Ray, Bridget McGinn, George Safta, Michele Merler, Rahul Krishna, Raju Pavuluri, Srikanth Tamilselvam.

**Figure 1.** Figure 1: Migration is a structural transformation across heterogeneous artifacts: porting Spring to Jakarta expands a 3-line interface into a 14-line CDI bean, rewrites derived queries as hand-written JPQL, externalizes auto-config into JPA and CDI descriptors and adds a Java–XML string binding. pressure and by non-functional goals such as lower memory overhead, faster startup and improved cloud elasticity LogicMon… view at source ↗

**Figure 2.** Figure 2: Overview of ScarfBench construction: 34 application families implemented across Spring, [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗

**Figure 3.** Figure 3: Each panel routes source-framework migration [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗

**Figure 4.** Figure 4: visualizes the aggregate SCARF leaderboard from [PITH_FULL_IMAGE:figures/full_fig_p016_4.png] view at source ↗

read the original abstract

Java remains central to enterprise software, and many applications outlive their original architecture. Migrating them across frameworks is a behavior-preserving refactoring spanning build configuration, dependency injection, persistence, request handling, and deployment. Existing software-engineering benchmarks cover bug fixing, feature implementation, and language or version modernization, but leave cross-framework refactoring largely unmeasured. We introduce ScarfBench, a benchmark for behavior-preserving cross-framework refactoring of enterprise Java applications. It is built from expert-written implementation triples across Spring, Jakarta EE, and Quarkus: 34 applications (29 focused single-layer, 5 whole) yielding 102 variants (~151K lines across 1946 source and test files) and 204 directed refactoring tasks. Each task gives an agent a working source application and a target framework; the agent must synthesize a target implementation preserving the source behavior. Correctness is evaluated by an application-specific executable oracle: the candidate must compile, deploy in a containerized target runtime, and pass behavioral tests over the application's observable interface. We evaluate five state-of-the-art coding agents on ScarfBench. The strongest achieves only 15.3% aggregate test pass on focused-layer migrations and 12.2% on whole applications, and only one of the 204 tasks yields a fully behaviorally equivalent target. Difficulty is asymmetric across framework directions and architectural layers: Spring<->Quarkus is the most tractable pair, and Jakarta-targeted migrations are hardest. From LLM-as-a-judge and expert adjudication of failed-task traces, we derive a taxonomy of recurring failure categories spanning build, deploy, and test stages. We release the benchmark, harness, and agent traces at https://scarfbench.info.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

ScarfBench gives a practical new benchmark for Java framework migrations with low agent success rates, but the oracles and app selection need clearer validation.

read the letter

The main point is that this paper builds a benchmark of 204 directed tasks from 34 expert triples across Spring, Jakarta EE, and Quarkus, then shows that five current coding agents top out at 15.3 percent test pass on focused migrations and 12.2 percent on whole applications, with only one task producing a fully equivalent result. That gap is the headline result. The construction itself is new: prior benchmarks handled bug fixes or language updates but left cross-framework behavior-preserving refactors unmeasured. The executable oracles that check compile, container deploy, and observable-interface tests are a reasonable way to score success, and the release of the harness plus traces lets others inspect the failures directly. The taxonomy of recurring issues across build, deploy, and test stages also gives concrete directions for improvement. The asymmetry they report, with Spring-Quarkus pairs easier than Jakarta-targeted ones, lines up with what one would expect from framework differences. On the soft side, the oracle correctness and selection of the 34 applications still sit on expert judgment without the kind of quantitative checks that would make the numbers fully robust. Details on inter-expert agreement, false-positive rates on known-good migrations, or coverage of non-observable state would strengthen the claim that the low scores reflect agent limits rather than benchmark artifacts. If those checks are in the full methods they are not highlighted in the abstract, so a referee would likely ask for them. This work is aimed at researchers building or evaluating refactoring agents for enterprise code. Anyone running agent experiments in that space would get value from the task set and the failure breakdown. It is worth sending for peer review because the benchmark artifact and the empirical gap are substantive enough to justify referee time, even if the validation evidence needs tightening.

Referee Report

2 major / 2 minor

Summary. The paper introduces ScarfBench, a benchmark for behavior-preserving cross-framework refactoring of enterprise Java applications between Spring, Jakarta EE, and Quarkus. It is constructed from 34 expert-written implementation triples (29 focused single-layer and 5 whole applications) that produce 102 variants, 1946 files, and 204 directed migration tasks. Correctness is defined via application-specific executable oracles requiring successful compilation, containerized deployment, and passage of behavioral tests on the observable interface. Evaluation of five state-of-the-art coding agents shows the strongest achieving 15.3% aggregate test pass on focused-layer migrations and 12.2% on whole applications, with only one of 204 tasks producing a fully behaviorally equivalent target. The work also derives a taxonomy of recurring failure categories and releases the benchmark, harness, and agent traces.

Significance. If the oracles are shown to be reliable and the application set representative, ScarfBench would fill a notable gap in software-engineering benchmarks by providing a reproducible, executable measure of automated cross-framework migration capability in enterprise Java. The public release of the full benchmark, harness, and agent traces is a clear strength that supports reproducibility and community follow-up work. The reported performance gap and failure taxonomy could usefully direct future research on LLM agents for multi-layer refactoring involving dependency injection, persistence, and deployment configuration.

major comments (2)

[Abstract (benchmark construction paragraph)] Abstract, paragraph on benchmark construction: The description states that each task is evaluated by an application-specific executable oracle requiring compile + deploy + behavioral tests, yet provides no quantitative validation of oracle correctness (e.g., inter-expert agreement on triple construction, false-positive/negative rates on known-good migrations, or coverage of non-observable state). This is load-bearing for the central empirical claims because the headline results (15.3% focused-layer and 12.2% whole-application pass rates, 1/204 fully equivalent) are only interpretable if the oracles accurately detect behavior-preserving migrations.
[Abstract (benchmark construction paragraph)] Abstract, paragraph on benchmark construction: The selection criteria and representativeness of the 34 expert-written application triples (29 focused + 5 whole) are not quantified, leaving open the possibility of selection bias toward unusually difficult cases or omission of common enterprise patterns. This directly affects the generalizability of the difficulty findings and the claim that the benchmark captures real migration challenges.

minor comments (2)

The abstract would benefit from an explicit breakdown of the 204 tasks by source-target framework pair to clarify the reported asymmetry (e.g., Spring<->Quarkus vs. Jakarta-targeted).
The release statement at the end of the abstract should include a brief note on the license and expected maintenance of the artifacts at https://scarfbench.info.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive feedback on ScarfBench. The comments on oracle validation and application representativeness are well-taken and point to areas where the manuscript can be strengthened for clarity and rigor. We address each major comment below and outline planned revisions.

read point-by-point responses

Referee: Abstract, paragraph on benchmark construction: The description states that each task is evaluated by an application-specific executable oracle requiring compile + deploy + behavioral tests, yet provides no quantitative validation of oracle correctness (e.g., inter-expert agreement on triple construction, false-positive/negative rates on known-good migrations, or coverage of non-observable state). This is load-bearing for the central empirical claims because the headline results (15.3% focused-layer and 12.2% whole-application pass rates, 1/204 fully equivalent) are only interpretable if the oracles accurately detect behavior-preserving migrations.

Authors: We agree that explicit quantitative validation details would improve interpretability. The oracles rely on expert-constructed implementation triples where behavioral equivalence was ensured through the observable interface (public APIs, endpoints, and test suites). In the revised manuscript we will add a new subsection under benchmark construction that describes the triple development process, including that each triple received review by multiple domain experts for functional equivalence, and we will report available test coverage statistics for the behavioral oracles. We will also explicitly note the limitation regarding non-observable internal state, which is inherent to black-box oracles, and discuss why the compile-deploy-test pipeline reduces false positives in practice. These additions will be reflected in both the abstract and main text. revision: yes
Referee: Abstract, paragraph on benchmark construction: The selection criteria and representativeness of the 34 expert-written application triples (29 focused + 5 whole) are not quantified, leaving open the possibility of selection bias toward unusually difficult cases or omission of common enterprise patterns. This directly affects the generalizability of the difficulty findings and the claim that the benchmark captures real migration challenges.

Authors: We acknowledge the value of quantifying selection criteria to mitigate concerns about bias. The 34 applications were chosen by experts to span representative enterprise patterns such as dependency injection, JPA persistence, REST handling, and configuration differences across Spring, Jakarta EE, and Quarkus. In the revision we will expand Section 3 to include explicit selection criteria, a summary table of application characteristics (e.g., LOC ranges, layer coverage, framework distribution), and a brief discussion of how the set reflects common real-world migration scenarios. While exhaustive sampling of all enterprise Java codebases is not feasible, these additions will better support the generalizability claims. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical benchmark evaluation on external agents

full rationale

The paper constructs ScarfBench from 34 expert-written application triples and measures performance of five external state-of-the-art coding agents, reporting direct empirical pass rates (15.3% focused, 12.2% whole-application) and one fully equivalent target out of 204 tasks. These outcomes are obtained by executing the agents against the provided source applications and oracles; they do not reduce to any fitted parameter, self-definition, or self-citation chain. No equations or derivations appear; the central claims rest on the benchmark's construction and external evaluation rather than internal re-use of prior results by the same authors. The paper is therefore self-contained against external benchmarks and agents.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The benchmark rests on the assumption that expert-written triples faithfully represent behavior-preserving migrations; no free parameters, new physical entities, or ad-hoc mathematical constructs are introduced.

axioms (1)

domain assumption Expert-written implementation triples provide valid behavior-preserving variants across frameworks.
The benchmark is built directly from these 34 applications and 102 variants.

pith-pipeline@v0.9.0 · 5878 in / 1347 out tokens · 59173 ms · 2026-05-20T23:00:57.839178+00:00 · methodology

Review history (2 revisions) →

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We introduce SCARFBENCH, a benchmark for behavior-preserving cross-framework refactoring of enterprise Java applications... 204 directed migration tasks... scored by 1,331 expert-written tests in a containerized harness

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

30 extracted references · 30 canonical work pages · 1 internal anchor

[1]

DataStax

URLhttps://arxiv.org/abs/2506.00894. DataStax. Stargate: An open-source data API gateway. https://github.com/stargate/starga te, 2024. Accessed: 2026-04-23. Xiang Deng, Jeff Da, Edwin Pan, Yannis Yiming He, Charles Ide, Kanak Garg, Niklas Lauffer, Andrew Park, Nitin Pasari, Chetan Rane, Karmini Sampath, Maya Krishnan, Srivatsa Kundurthy, Sean Hendryx, Zif...

work page doi:10.1145/3793302.3793331 2024
[2]

New Relic

URLhttps://arxiv.org/abs/2507.12367. New Relic. 2024 state of the Java ecosystem. https://newrelic.com/resources/report/st ate-of-the-java-ecosystem-2024, 2024. Accessed: 2026-04-20. Orange. How Orange leverages Quarkus for seamless access to telco network capabilities. https: //quarkus.io/blog/orange-telco-core-network-api-management-with-quarkus/ ,

work page arXiv 2024
[3]

OmniCode: A Benchmark for Evaluating Software Engineering Agents

Accessed: 2026-04-23. Ahilan Ayyachamy Nadar Ponnusamy. Application modernization with llms: Addressing core challenges in reliability, security, and quality, 2025. URL https://arxiv.org/abs/2506.109 84. Quarkus Project. Quarkus user stories: Lufthansa Technik A VIATAR.https://quarkus.io/use rstories/, 2024. Accessed: 2026-04-23. Muhammad Shihab Rashid, C...

work page internal anchor Pith review Pith/arXiv arXiv 2026
[4]

Inspect project structure

work page
[5]

Detect build system and framework usage

work page
[6]

Migrate dependencies and plugins

work page
[7]

Migrate framework configuration

work page
[8]

Refactor framework-bound source code

work page
[9]

Compile and fix errors until build succeeds or no safe fix remains

work page
[10]

20 E.4 Runtime Configuration Table 12 separates the model declared in agent.toml from the model string explicitly passed by run.sh

Produce a final migration report including file changes, chronological log, and unresolved issues. 20 E.4 Runtime Configuration Table 12 separates the model declared in agent.toml from the model string explicitly passed by run.sh. For Codex, the wrapper does not pass an explicit model flag, so the invoked model is resolved by the configured Codex CLI/acco...

work page
[11]

Start by reading the metadata to understand the migration context

work page
[12]

Read the run.log to find the specific error that caused the failure

work page
[13]

Use targeted tools depending on the failure phase: compare POM files, inspect Dockerfile/server.xml, ,→scan imports, or check multi-module structure

work page
[14]

/dukeetf

When you have enough evidence, call the classify tool with your classification. When you classify, provide the phase, taxonomy category ID/name, subcategory, whether a new category ,→is needed, confidence, and a 1-2 sentence evidence summary. F Failure-Mode Subcategory Reference This appendix expands each row of the per-agent failure-mode heatmap (Table 3...

work page
[15]

A client calls GET /rest/quotes/\{symbol\} or submits a trading operation through the web UI

work page
[16]

A JAX-RS resource underrest/handles the request

work page
[17]

The resource invokes the activeTradeServicesimplementation selected through CDI wiring

work page
[18]

The service accesses entities such as accounts, holdings, orders, and quotes through Quarkus- managed persistence

work page
[19]

If the workflow involves asynchronous order or quote behavior, the service delegates to reactive messaging components undermessaging/

work page
[20]

This path exercises REST routing, CDI service resolution, transaction boundaries, persistence access, optional messaging, and static/web UI behavior

The response is returned through JAX-RS or reflected in the web UI. This path exercises REST routing, CDI service resolution, transaction boundaries, persistence access, optional messaging, and static/web UI behavior. A migration that only compiles but fails on this path is not considered functionally equivalent. 31 G.9 Validation Protocol The target is v...

work page
[21]

Identify the Spring project as a W AR-packaged application with static/web UI, REST endpoints, JPA/H2 persistence, JMS/Artemis-style messaging, WebSocket support, and Spring Boot tests

work page
[22]

Generate a Quarkus POM with the required extensions for REST, CDI, persistence, transactions, validation, messaging, scheduler, WebSocket support, health, and tests

work page
[23]

Convert application.yml intent into application.properties, preserving server, persis- tence, REST, logging, messaging, and DayTrader runtime settings

work page
[24]

Preserve core domain models, data beans, interfaces, and utility classes unless imports or runtime APIs require adaptation

work page
[25]

Replace Spring DI annotations and bean selection with CDI scopes, injection, qualifiers or producers, and ambiguity controls

work page
[26]

Replace REST controller conventions with JAX-RS resources

work page
[27]

Replace Spring Data or Spring-managed persistence access with Quarkus/Hibernate ORM and Jakarta transaction boundaries

work page
[28]

Replace JMS/Artemis queue and topic logic with SmallRye Reactive Messaging processors and emitters

work page
[29]

Move static resources into the Quarkus-compatible resource layout and ensure application entry points remain reachable

work page
[30]

uid:0" When I buy 100 shares of

Validate through build, startup, REST endpoint checks, default-data login, buy/sell workflows, and smoke tests. G.11 Migration Challenges and Resolutions The migration challenges are not evenly distributed. Annotation replacement is relatively mechanical, while messaging, configuration intent, and web-resource layout require design decisions. The table be...

work page

[1] [1]

DataStax

URLhttps://arxiv.org/abs/2506.00894. DataStax. Stargate: An open-source data API gateway. https://github.com/stargate/starga te, 2024. Accessed: 2026-04-23. Xiang Deng, Jeff Da, Edwin Pan, Yannis Yiming He, Charles Ide, Kanak Garg, Niklas Lauffer, Andrew Park, Nitin Pasari, Chetan Rane, Karmini Sampath, Maya Krishnan, Srivatsa Kundurthy, Sean Hendryx, Zif...

work page doi:10.1145/3793302.3793331 2024

[2] [2]

New Relic

URLhttps://arxiv.org/abs/2507.12367. New Relic. 2024 state of the Java ecosystem. https://newrelic.com/resources/report/st ate-of-the-java-ecosystem-2024, 2024. Accessed: 2026-04-20. Orange. How Orange leverages Quarkus for seamless access to telco network capabilities. https: //quarkus.io/blog/orange-telco-core-network-api-management-with-quarkus/ ,

work page arXiv 2024

[3] [3]

OmniCode: A Benchmark for Evaluating Software Engineering Agents

Accessed: 2026-04-23. Ahilan Ayyachamy Nadar Ponnusamy. Application modernization with llms: Addressing core challenges in reliability, security, and quality, 2025. URL https://arxiv.org/abs/2506.109 84. Quarkus Project. Quarkus user stories: Lufthansa Technik A VIATAR.https://quarkus.io/use rstories/, 2024. Accessed: 2026-04-23. Muhammad Shihab Rashid, C...

work page internal anchor Pith review Pith/arXiv arXiv 2026

[4] [4]

Inspect project structure

work page

[5] [5]

Detect build system and framework usage

work page

[6] [6]

Migrate dependencies and plugins

work page

[7] [7]

Migrate framework configuration

work page

[8] [8]

Refactor framework-bound source code

work page

[9] [9]

Compile and fix errors until build succeeds or no safe fix remains

work page

[10] [10]

20 E.4 Runtime Configuration Table 12 separates the model declared in agent.toml from the model string explicitly passed by run.sh

Produce a final migration report including file changes, chronological log, and unresolved issues. 20 E.4 Runtime Configuration Table 12 separates the model declared in agent.toml from the model string explicitly passed by run.sh. For Codex, the wrapper does not pass an explicit model flag, so the invoked model is resolved by the configured Codex CLI/acco...

work page

[11] [11]

Start by reading the metadata to understand the migration context

work page

[12] [12]

Read the run.log to find the specific error that caused the failure

work page

[13] [13]

Use targeted tools depending on the failure phase: compare POM files, inspect Dockerfile/server.xml, ,→scan imports, or check multi-module structure

work page

[14] [14]

/dukeetf

When you have enough evidence, call the classify tool with your classification. When you classify, provide the phase, taxonomy category ID/name, subcategory, whether a new category ,→is needed, confidence, and a 1-2 sentence evidence summary. F Failure-Mode Subcategory Reference This appendix expands each row of the per-agent failure-mode heatmap (Table 3...

work page

[15] [15]

A client calls GET /rest/quotes/\{symbol\} or submits a trading operation through the web UI

work page

[16] [16]

A JAX-RS resource underrest/handles the request

work page

[17] [17]

The resource invokes the activeTradeServicesimplementation selected through CDI wiring

work page

[18] [18]

The service accesses entities such as accounts, holdings, orders, and quotes through Quarkus- managed persistence

work page

[19] [19]

If the workflow involves asynchronous order or quote behavior, the service delegates to reactive messaging components undermessaging/

work page

[20] [20]

This path exercises REST routing, CDI service resolution, transaction boundaries, persistence access, optional messaging, and static/web UI behavior

The response is returned through JAX-RS or reflected in the web UI. This path exercises REST routing, CDI service resolution, transaction boundaries, persistence access, optional messaging, and static/web UI behavior. A migration that only compiles but fails on this path is not considered functionally equivalent. 31 G.9 Validation Protocol The target is v...

work page

[21] [21]

Identify the Spring project as a W AR-packaged application with static/web UI, REST endpoints, JPA/H2 persistence, JMS/Artemis-style messaging, WebSocket support, and Spring Boot tests

work page

[22] [22]

Generate a Quarkus POM with the required extensions for REST, CDI, persistence, transactions, validation, messaging, scheduler, WebSocket support, health, and tests

work page

[23] [23]

Convert application.yml intent into application.properties, preserving server, persis- tence, REST, logging, messaging, and DayTrader runtime settings

work page

[24] [24]

Preserve core domain models, data beans, interfaces, and utility classes unless imports or runtime APIs require adaptation

work page

[25] [25]

Replace Spring DI annotations and bean selection with CDI scopes, injection, qualifiers or producers, and ambiguity controls

work page

[26] [26]

Replace REST controller conventions with JAX-RS resources

work page

[27] [27]

Replace Spring Data or Spring-managed persistence access with Quarkus/Hibernate ORM and Jakarta transaction boundaries

work page

[28] [28]

Replace JMS/Artemis queue and topic logic with SmallRye Reactive Messaging processors and emitters

work page

[29] [29]

Move static resources into the Quarkus-compatible resource layout and ensure application entry points remain reachable

work page

[30] [30]

uid:0" When I buy 100 shares of

Validate through build, startup, REST endpoint checks, default-data login, buy/sell workflows, and smoke tests. G.11 Migration Challenges and Resolutions The migration challenges are not evenly distributed. Annotation replacement is relatively mechanical, while messaging, configuration intent, and web-resource layout require design decisions. The table be...

work page