{"paper":{"title":"ScarfBench: A Benchmark for Cross-Framework Application Migration in Enterprise Java","license":"http://creativecommons.org/licenses/by/4.0/","headline":"Current coding agents succeed on only 15 percent of behavior-preserving cross-framework migrations in enterprise Java.","cross_cats":[],"primary_cat":"cs.SE","authors_text":"Advait Pavuluri, Ashita Saxena, Baishakhi Ray, Bridget McGinn, George Safta, Michele Merler, Rahul Krishna, Raju Pavuluri, Srikanth Tamilselvam","submitted_at":"2026-05-07T16:05:35Z","abstract_excerpt":"Java remains central to enterprise software, and many applications outlive their original architecture. Migrating them across frameworks is a behavior-preserving refactoring spanning build configuration, dependency injection, persistence, request handling, and deployment. Existing software-engineering benchmarks cover bug fixing, feature implementation, and language or version modernization, but leave cross-framework refactoring largely unmeasured.\n  We introduce ScarfBench, a benchmark for behavior-preserving cross-framework refactoring of enterprise Java applications. It is built from expert"},"claims":{"count":4,"items":[{"kind":"strongest_claim","text":"The strongest agent achieves only 15.3% aggregate test pass on focused-layer migrations and 12.2% on whole applications, and only one of the 204 tasks yields a fully behaviorally equivalent target.","source":"verdict.strongest_claim","status":"machine_extracted","claim_id":"C1","attestation":"unclaimed"},{"kind":"weakest_assumption","text":"That the 34 expert-written application triples and their associated test oracles are representative of real-world cross-framework migration difficulty and that passing the oracles guarantees behavior preservation outside the tested interface.","source":"verdict.weakest_assumption","status":"machine_extracted","claim_id":"C2","attestation":"unclaimed"},{"kind":"one_line_summary","text":"ScarfBench supplies 204 cross-framework Java migration tasks where the best agent passes only 15.3% of focused and 12.2% of whole-application tests.","source":"verdict.one_line_summary","status":"machine_extracted","claim_id":"C3","attestation":"unclaimed"},{"kind":"headline","text":"Current coding agents succeed on only 15 percent of behavior-preserving cross-framework migrations in enterprise Java.","source":"verdict.pith_extraction.headline","status":"machine_extracted","claim_id":"C4","attestation":"unclaimed"}],"snapshot_sha256":"d19b79440a8d1d591c98a408f3b2f3a8cca8f85685ee8feef12f2dec7c5ab64a"},"source":{"id":"2605.06754","kind":"arxiv","version":2},"verdict":{"id":"a69dcdd9-4f52-45ec-9f43-2d387a4d2053","model_set":{"reader":"grok-4.3"},"created_at":"2026-05-11T00:45:27.015711Z","strongest_claim":"The strongest agent achieves only 15.3% aggregate test pass on focused-layer migrations and 12.2% on whole applications, and only one of the 204 tasks yields a fully behaviorally equivalent target.","one_line_summary":"ScarfBench supplies 204 cross-framework Java migration tasks where the best agent passes only 15.3% of focused and 12.2% of whole-application tests.","pipeline_version":"pith-pipeline@v0.9.0","weakest_assumption":"That the 34 expert-written application triples and their associated test oracles are representative of real-world cross-framework migration difficulty and that passing the oracles guarantees behavior preservation outside the tested interface.","pith_extraction_headline":"Current coding agents succeed on only 15 percent of behavior-preserving cross-framework migrations in enterprise Java."},"integrity":{"clean":false,"summary":{"advisory":1,"critical":0,"by_detector":{"doi_compliance":{"total":1,"advisory":1,"critical":0,"informational":0}},"informational":0},"endpoint":"/pith/2605.06754/integrity.json","findings":[{"note":"DOI in the printed bibliography is fragmented by whitespace or line breaks. A longer candidate (10.1145/3793302.3793331.Keycloak) was visible in the surrounding text but could not be confirmed against doi.org as printed.","detector":"doi_compliance","severity":"advisory","ref_index":1,"audited_at":"2026-05-19T12:38:00.120586Z","detected_doi":"10.1145/3793302.3793331.Keycloak","finding_type":"recoverable_identifier","verdict_class":"incontrovertible","detected_arxiv_id":null}],"available":true,"detectors_run":[{"name":"doi_title_agreement","ran_at":"2026-05-19T18:31:18.864973Z","status":"completed","version":"1.0.0","findings_count":0},{"name":"doi_compliance","ran_at":"2026-05-19T12:38:00.120586Z","status":"completed","version":"1.0.0","findings_count":1}],"snapshot_sha256":"2b73a4899fd74ae8c99b98bd081fda8771aa73f51fa46fa62e9cff33c5d1fddc"},"references":{"count":0,"sample":[],"resolved_work":0,"snapshot_sha256":"258153158e38e3291e3d48162225fcdb2d5a3ed65a07baac614ab91432fd4f57","internal_anchors":0},"formal_canon":{"evidence_count":2,"snapshot_sha256":"060f4b69c6962250900948366c608656b338bd0d6eb8c9caa0231aea82e216e6"},"author_claims":{"count":0,"strong_count":0,"snapshot_sha256":"258153158e38e3291e3d48162225fcdb2d5a3ed65a07baac614ab91432fd4f57"},"builder_version":"pith-number-builder-2026-05-17-v1"}