pith. sign in

arxiv: 2605.17535 · v1 · pith:LOONUOBWnew · submitted 2026-05-17 · 💻 cs.SE

AgentModernize: Preserving Business Logic in Legacy Modernization with Multi-Agent LLMs and Behavioral Specification Graphs

Pith reviewed 2026-05-19 22:35 UTC · model grok-4.3

classification 💻 cs.SE
keywords legacy modernizationmulti-agent LLMsbusiness logic preservationbehavioral specification graphcode migrationLLM agentssoftware engineering
0
0 comments X

The pith

A multi-agent framework with an explicit behavioral graph preserves business logic during legacy modernization where direct LLM translation loses it.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that legacy modernization usually discards implicit business rules when treated as syntax translation. AgentModernize instead uses four specialized agents to extract rules, record them in a Behavioral Specification Graph, generate new code, and apply validation feedback. On eight telecom and banking scenarios the full pipeline with feedback was the only setup that produced non-zero preservation rates across every model backbone, while simpler prompting methods scored zero. The graph itself recovered 91.2 percent of the gold-standard rules, so the remaining difficulty lies in turning the graph into correct code. This matters because critical systems in finance and telecom depend on rules that are never written down as explicit statements.

Core claim

AgentModernize treats modernization as behavioral preservation rather than syntax translation. Four agents extract implicit rules from legacy code, encode them in an explicit Behavioral Specification Graph, generate target code from the graph, and iterate with validation feedback. In controlled tests on eight scenarios using three different model backbones, only the complete pipeline with feedback achieved non-zero mean preservation rates on every backbone; prompt-only and chain-of-thought baselines scored zero everywhere. The graph captured 91.2 percent of the gold-standard rules, indicating that extraction succeeds but code generation from the specification remains the bottleneck.

What carries the argument

Behavioral Specification Graph (BSG), an explicit, inspectable structure that records extracted business rules, edge cases, and cross-module constraints before any new code is written.

If this is right

  • Explicit specification steps before code generation are required to retain business logic that direct translation loses.
  • Feedback between validation and generation agents is necessary to reach positive preservation on every scenario and model.
  • Once rules are captured in the graph, the remaining performance gap occurs during code generation from that graph.
  • Specialized multi-agent pipelines outperform single-prompt or chain-of-thought LLM approaches for logic-preserving migration.
  • The approach works across different model sizes when the full feedback loop is present.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same graph-based intermediate could be applied to other migration tasks where domain rules must survive, such as scientific computing or database refactoring.
  • Combining the BSG with automated checkers might verify that generated code satisfies the extracted rules without relying solely on test execution.
  • The results suggest that similar structured intermediaries could help LLMs on any task that needs to preserve implicit constraints rather than just produce fluent output.
  • Evaluating the method on larger, publicly available legacy systems would test whether the eight-scenario results generalize.

Load-bearing premise

The gold-standard tests used for evaluation cover every implicit business rule, edge case, and cross-module constraint present in the original legacy systems.

What would settle it

A new legacy codebase where the full AgentModernize pipeline with feedback still produces zero business-logic preservation on the supplied tests, or where the BSG misses rules that the tests do not exercise.

Figures

Figures reproduced from arXiv: 2605.17535 by Marnim Galib, Sheikh Nazib Ahmed.

Figure 1
Figure 1. Figure 1: AgentModernize Pipeline Architecture. Solid arrows show the forward [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Per-scenario BER comparison across three models. All three models [PITH_FULL_IMAGE:figures/full_fig_p008_2.png] view at source ↗
read the original abstract

Legacy modernization breaks business logic. Most tools and LLM-based approaches treat modernization as syntax translation, losing implicit rules, edge-case handling, and cross-module constraints. We present AgentModernize, a multi-agent framework that treats modernization as a behavioral preservation problem. Four specialized agents handle extraction, specification, code generation, and validation. The key intermediate artifact -- a Behavioral Specification Graph (BSG) -- forces extracted business logic to be explicit and inspectable before any code is generated. We evaluated on LegacyModernize-8, eight scenarios spanning telecom and banking, using three models (GPT-4o-mini, GPT-4o, GPT-5.3-codex) under a fair protocol: same gold-standard tests, 3 trials, temperature 0.0. Full AgentModernize with feedback was the only configuration with non-zero mean BER under every backbone. SP-LLM and CoT-LLM scored 0.0% on every scenario, on every backbone. AgentModernize without feedback scored 0.0% mean BER with GPT-4o-mini and GPT-5.3-codex; under GPT-4o it achieved non-zero BER only on S1 (44.4%; 5.6% mean over scenarios). Mean BER for full AgentModernize was 9.4% (mini), 8.1% (GPT-4o), and 19.4% (codex). The BSG captures 91.2% of gold-standard rules, confirming that the bottleneck is code generation, not extraction.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper introduces AgentModernize, a multi-agent LLM framework for legacy modernization that treats the task as behavioral preservation rather than syntactic translation. It uses four specialized agents and an intermediate Behavioral Specification Graph (BSG) to extract, specify, generate, and validate business logic. On the LegacyModernize-8 benchmark (eight telecom and banking scenarios) with three backbones and a fixed protocol of gold-standard tests, three trials, and temperature 0.0, the full system with feedback is claimed to be the only configuration yielding non-zero mean BER on every backbone (9.4% for GPT-4o-mini, 8.1% for GPT-4o, 19.4% for GPT-5.3-codex), while SP-LLM and CoT-LLM score exactly 0.0% everywhere; the BSG is reported to capture 91.2% of gold-standard rules, so the bottleneck is asserted to be code generation rather than extraction.

Significance. If the evaluation protocol and gold-standard completeness hold, the work would offer a concrete, inspectable intermediate representation for behavioral preservation in LLM-driven modernization, with measurable gains over simple prompting baselines on domain-specific scenarios; the multi-agent feedback loop and explicit BSG artifact are strengths that could be adopted more broadly in software engineering tooling for critical systems.

major comments (2)
  1. [Evaluation section] Evaluation section: The abstract and evaluation report specific BER values (e.g., 9.4% mean BER for full AgentModernize with GPT-4o-mini) and the 91.2% BSG capture rate, yet supply no definition or formula for BER, no per-trial variance or error bars despite the three-trial protocol, and no description of how the gold-standard tests and rules were constructed or validated for completeness. These omissions leave the central claim—that extraction succeeds while generation is the sole limiter—only partially supported.
  2. [Evaluation section] Evaluation section: The interpretation that 'the bottleneck is code generation, not extraction' depends on the LegacyModernize-8 gold-standard tests comprehensively encoding all implicit business logic, edge cases, and cross-module constraints from the original legacy systems. The manuscript provides no evidence or argument that the test suite is exhaustive; if undocumented invariants or rare paths exist outside the gold set, the 0.0% baseline scores could indicate test incompleteness rather than total behavioral failure, and the 91.2% BSG figure would be an upper bound on a partial rule set.
minor comments (1)
  1. [Abstract] Abstract: The acronym BER is introduced without expansion or definition; a parenthetical definition or reference to its formal definition should appear on first use.

Simulated Author's Rebuttal

2 responses · 1 unresolved

We thank the referee for the constructive comments focusing on the evaluation section. We address each point below and will revise the manuscript to improve clarity, add missing details, and appropriately qualify our claims.

read point-by-point responses
  1. Referee: [Evaluation section] Evaluation section: The abstract and evaluation report specific BER values (e.g., 9.4% mean BER for full AgentModernize with GPT-4o-mini) and the 91.2% BSG capture rate, yet supply no definition or formula for BER, no per-trial variance or error bars despite the three-trial protocol, and no description of how the gold-standard tests and rules were constructed or validated for completeness. These omissions leave the central claim—that extraction succeeds while generation is the sole limiter—only partially supported.

    Authors: We agree that the manuscript should have included an explicit definition of BER and more details on the gold-standard. BER is the Behavioral Equivalence Rate, defined as the average percentage of gold-standard test cases passed by the modernized code across scenarios. The formula is BER = (1/|S|) * sum_{s in S} (passed_tests_s / total_tests_s), where S is the set of scenarios. With temperature fixed at 0.0, all three trials produced identical outputs, yielding zero variance; we will state this explicitly and report per-scenario results. We will also add a description that the gold-standard tests and rules were derived via systematic reverse engineering of the legacy codebases by domain experts, followed by independent validation. These changes will be made in the revised evaluation section. revision: yes

  2. Referee: [Evaluation section] Evaluation section: The interpretation that 'the bottleneck is code generation, not extraction' depends on the LegacyModernize-8 gold-standard tests comprehensively encoding all implicit business logic, edge cases, and cross-module constraints from the original legacy systems. The manuscript provides no evidence or argument that the test suite is exhaustive; if undocumented invariants or rare paths exist outside the gold set, the 0.0% baseline scores could indicate test incompleteness rather than total behavioral failure, and the 91.2% BSG figure would be an upper bound on a partial rule set.

    Authors: We acknowledge that no finite test suite can be proven exhaustive for legacy systems that may contain undocumented behaviors. Our claim is scoped to the LegacyModernize-8 benchmark: the BSG captures 91.2% of rules in the provided gold-standard, and only the full AgentModernize configuration produces non-zero BER on those tests while baselines score zero. In revision we will rephrase the bottleneck statement to be conditional on the gold-standard ('within the evaluated test suite, code generation is the primary limiter') and add an explicit limitations discussion noting that untested invariants could exist. This qualifies the interpretation without changing the reported results. revision: partial

standing simulated objections not resolved
  • Absolute demonstration that the gold-standard tests cover every possible undocumented invariant or rare execution path in the original legacy systems

Circularity Check

0 steps flagged

No circularity: empirical results from external benchmarks

full rationale

The paper's key claims rest on experimental outcomes from running AgentModernize variants, SP-LLM, and CoT-LLM against fixed gold-standard tests on the LegacyModernize-8 benchmark. Quantities such as mean BER values, the 91.2% BSG rule capture rate, and the observation that only the full configuration yields non-zero BER are direct measurements from those runs, not quantities derived by construction from fitted parameters, self-referential definitions, or load-bearing self-citations. No equations or ansatzes are presented that reduce outputs to inputs; the evaluation protocol uses the same external test suite for all methods, making the comparison independent of the framework's internal artifacts.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 1 invented entities

The framework introduces the BSG as a new intermediate representation and relies on the unstated assumption that LLM agents can reliably extract implicit business logic into an explicit graph form without systematic omissions.

invented entities (1)
  • Behavioral Specification Graph (BSG) no independent evidence
    purpose: Explicit, inspectable representation of extracted business logic before code generation
    Presented as the key intermediate artifact that forces business logic to be explicit.

pith-pipeline@v0.9.0 · 5822 in / 1264 out tokens · 46569 ms · 2026-05-19T22:35:52.892603+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

26 extracted references · 26 canonical work pages · 4 internal anchors

  1. [1]

    A survey of legacy system modernization approaches,

    S. Comella-Dorda, K. Wallnau, R. Seacord, and J. Robert, “A survey of legacy system modernization approaches,” Software Engineering Institute, Carnegie Mellon University, Tech. Rep. CMU/SEI-2000-TN- 003, 2000

  2. [2]

    A systematic review of software architecture evolution research,

    H. P. Breivold, I. Crnkovic, and M. Larsson, “A systematic review of software architecture evolution research,”Information and Software Technology, vol. 54, no. 1, pp. 16–40, 2012. doi:10.1016/j.infsof.2011.08.002

  3. [3]

    Requirements for integrating software architecture and reengineering models: CORUM II,

    R. Kazman, S. G. Woods, and S. J. Carri `ere, “Requirements for integrating software architecture and reengineering models: CORUM II,” inProc. Working Conference on Reverse Engineering (WCRE), 1998. doi:10.1109/WCRE.1998.723185

  4. [4]

    Model-driven reverse engi- neering,

    S. Rugaber and K. Stirewalt, “Model-driven reverse engi- neering,”IEEE Software, vol. 21, no. 4, pp. 45–53, 2004. doi:10.1109/MS.2004.23

  5. [5]

    Evaluating Large Language Models Trained on Code

    M. Chenet al., “Evaluating large language models trained on code,” arXiv preprint arXiv:2107.03374, 2021

  6. [6]

    StarCoder: may the source be with you!

    R. Liet al., “StarCoder: may the source be with you!”arXiv preprint arXiv:2305.06161, 2023

  7. [7]

    Available: https://doi.org/10.1145/3695988

    C. Houet al., “Large language models for software engineering: A sys- tematic literature review,”ACM Transactions on Software Engineering and Methodology, 2024.doi:10.1145/3695988

  8. [8]

    MoDisco: A model driven reverse engineering framework,

    H. Bruneli `ere, J. Cabot, G. Dup ´e, and F. Madiot, “MoDisco: A model driven reverse engineering framework,”Information and Software Technology, vol. 56, no. 8, pp. 1012–1032, 2014. doi:10.1016/j.infsof.2014.04.007

  9. [9]

    An approach for reverse engineering of COBOL-based applications,

    A. De Lucia, A. R. Fasolino, and M. Napoli, “An approach for reverse engineering of COBOL-based applications,” inProc. Euro- pean Conference on Software Maintenance and Reengineering, 2001. doi:10.1109/CSMR.2001.914982

  10. [10]

    Large language models for code analysis: Do LLMs really do their job?

    Z. Liet al., “Large language models for code analysis: Do LLMs really do their job?”arXiv preprint arXiv:2310.12357, 2023

  11. [11]

    Automating code review activities by large-scale pre- training,

    Z. Liet al., “Automating code review activities by large-scale pre- training,” inProc. 30th ACM Joint European Software Engineering Conference (ESEC/FSE), 2022.doi:10.1145/3540250.3549081

  12. [12]

    Less training, more repairing please: revisiting automated program repair via zero-shot learning,

    C. S. Xia and L. Zhang, “Less training, more repairing please: revisiting automated program repair via zero-shot learning,” inProc. 30th ACM Joint European Software Engineering Conference (ESEC/FSE), 2022. doi:10.1145/3540250.3549101

  13. [13]

    Out of sight, out of mind: Better automatic vulnerability repair by broadening input ranges and sources

    R. Panet al., “Lost in translation: A study of bugs introduced by large language models while translating code,” inProc. IEEE/ACM International Conference on Software Engineering (ICSE), 2024. doi:10.1145/3597503.3639226

  14. [14]

    MetaGPT: Meta programming for a multi-agent collab- orative framework,

    S. Honget al., “MetaGPT: Meta programming for a multi-agent collab- orative framework,” inICLR, 2024

  15. [15]

    C hat D ev: Communicative Agents for Software Development

    C. Qianet al., “ChatDev: Communicative agents for software develop- ment,” inACL, 2024.doi:10.18653/v1/2024.acl-long.810

  16. [16]

    AgentCoder: Multi-Agent-based Code Generation with Iterative Testing and Optimisation

    K. Huanget al., “AgentCoder: Multi-agent-based code generation with iterative testing and optimisation,”arXiv preprint arXiv:2312.13010, 2023

  17. [17]

    Milner,Communication and Concurrency

    R. Milner,Communication and Concurrency. Prentice Hall, 1989. ISBN: 978-0-13-115007-2

  18. [18]

    Differential testing for software,

    W. McKeeman, “Differential testing for software,”Digital Technical Journal, vol. 10, no. 1, pp. 100–107, 1998

  19. [19]

    Metamorphic testing: A new approach for generating next test cases,

    T. Y . Chen, S. C. Cheung, and S. M. Yiu, “Metamorphic testing: A new approach for generating next test cases,” Tech. Rep. HKUST-CS98-01, 1998

  20. [20]

    SWE-agent: Agent-computer interfaces enable automated software engineering,

    J. Yanget al., “SWE-agent: Agent-computer interfaces enable automated software engineering,” inNeurIPS, 2024

  21. [21]

    OpenHands: An Open Platform for AI Software Developers as Generalist Agents

    X. Wanget al., “OpenHands: An open platform for AI software developers as generalist agents,”arXiv preprint arXiv:2407.16741, 2024

  22. [22]

    JML: A notation for detailed design,

    G. T. Leavens, A. L. Baker, and C. Ruby, “JML: A notation for detailed design,” inBehavioral Specifications of Businesses and Systems, H. Kilov, B. Rumpe, and I. Simmonds, Eds. Springer, pp. 175–188, 1999.doi:10.1007/978-1-4615-5229-1_12

  23. [23]

    Applying design by contract,

    B. Meyer, “Applying design by contract,”IEEE Computer, vol. 25, no. 10, pp. 40–51, 1992.doi:10.1109/2.161279

  24. [24]

    Architecture-Driven Modernization: Knowledge Discovery Meta-Model (KDM), v1.4,

    Object Management Group, “Architecture-Driven Modernization: Knowledge Discovery Meta-Model (KDM), v1.4,” OMG Standard formal/2016-02-01, 2016. Available: https://www.omg.org/spec/KDM/1. 4

  25. [25]

    Self-Refine: Iterative refinement with self-feedback,

    A. Madaanet al., “Self-Refine: Iterative refinement with self-feedback,” inAdvances in Neural Information Processing Systems (NeurIPS), 2023

  26. [26]

    Reflexion: Language agents with verbal reinforce- ment learning,

    N. Shinnet al., “Reflexion: Language agents with verbal reinforce- ment learning,” inAdvances in Neural Information Processing Systems (NeurIPS), 2023