AgentModernize: Preserving Business Logic in Legacy Modernization with Multi-Agent LLMs and Behavioral Specification Graphs

Marnim Galib; Sheikh Nazib Ahmed

arxiv: 2605.17535 · v1 · pith:LOONUOBWnew · submitted 2026-05-17 · 💻 cs.SE

AgentModernize: Preserving Business Logic in Legacy Modernization with Multi-Agent LLMs and Behavioral Specification Graphs

Sheikh Nazib Ahmed , Marnim Galib This is my paper

Pith reviewed 2026-05-19 22:35 UTC · model grok-4.3

classification 💻 cs.SE

keywords legacy modernizationmulti-agent LLMsbusiness logic preservationbehavioral specification graphcode migrationLLM agentssoftware engineering

0 comments

The pith

A multi-agent framework with an explicit behavioral graph preserves business logic during legacy modernization where direct LLM translation loses it.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that legacy modernization usually discards implicit business rules when treated as syntax translation. AgentModernize instead uses four specialized agents to extract rules, record them in a Behavioral Specification Graph, generate new code, and apply validation feedback. On eight telecom and banking scenarios the full pipeline with feedback was the only setup that produced non-zero preservation rates across every model backbone, while simpler prompting methods scored zero. The graph itself recovered 91.2 percent of the gold-standard rules, so the remaining difficulty lies in turning the graph into correct code. This matters because critical systems in finance and telecom depend on rules that are never written down as explicit statements.

Core claim

AgentModernize treats modernization as behavioral preservation rather than syntax translation. Four agents extract implicit rules from legacy code, encode them in an explicit Behavioral Specification Graph, generate target code from the graph, and iterate with validation feedback. In controlled tests on eight scenarios using three different model backbones, only the complete pipeline with feedback achieved non-zero mean preservation rates on every backbone; prompt-only and chain-of-thought baselines scored zero everywhere. The graph captured 91.2 percent of the gold-standard rules, indicating that extraction succeeds but code generation from the specification remains the bottleneck.

What carries the argument

Behavioral Specification Graph (BSG), an explicit, inspectable structure that records extracted business rules, edge cases, and cross-module constraints before any new code is written.

If this is right

Explicit specification steps before code generation are required to retain business logic that direct translation loses.
Feedback between validation and generation agents is necessary to reach positive preservation on every scenario and model.
Once rules are captured in the graph, the remaining performance gap occurs during code generation from that graph.
Specialized multi-agent pipelines outperform single-prompt or chain-of-thought LLM approaches for logic-preserving migration.
The approach works across different model sizes when the full feedback loop is present.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same graph-based intermediate could be applied to other migration tasks where domain rules must survive, such as scientific computing or database refactoring.
Combining the BSG with automated checkers might verify that generated code satisfies the extracted rules without relying solely on test execution.
The results suggest that similar structured intermediaries could help LLMs on any task that needs to preserve implicit constraints rather than just produce fluent output.
Evaluating the method on larger, publicly available legacy systems would test whether the eight-scenario results generalize.

Load-bearing premise

The gold-standard tests used for evaluation cover every implicit business rule, edge case, and cross-module constraint present in the original legacy systems.

What would settle it

A new legacy codebase where the full AgentModernize pipeline with feedback still produces zero business-logic preservation on the supplied tests, or where the BSG misses rules that the tests do not exercise.

Figures

Figures reproduced from arXiv: 2605.17535 by Marnim Galib, Sheikh Nazib Ahmed.

**Figure 2.** Figure 2: Per-scenario BER comparison across three models. All three models [PITH_FULL_IMAGE:figures/full_fig_p008_2.png] view at source ↗

read the original abstract

Legacy modernization breaks business logic. Most tools and LLM-based approaches treat modernization as syntax translation, losing implicit rules, edge-case handling, and cross-module constraints. We present AgentModernize, a multi-agent framework that treats modernization as a behavioral preservation problem. Four specialized agents handle extraction, specification, code generation, and validation. The key intermediate artifact -- a Behavioral Specification Graph (BSG) -- forces extracted business logic to be explicit and inspectable before any code is generated. We evaluated on LegacyModernize-8, eight scenarios spanning telecom and banking, using three models (GPT-4o-mini, GPT-4o, GPT-5.3-codex) under a fair protocol: same gold-standard tests, 3 trials, temperature 0.0. Full AgentModernize with feedback was the only configuration with non-zero mean BER under every backbone. SP-LLM and CoT-LLM scored 0.0% on every scenario, on every backbone. AgentModernize without feedback scored 0.0% mean BER with GPT-4o-mini and GPT-5.3-codex; under GPT-4o it achieved non-zero BER only on S1 (44.4%; 5.6% mean over scenarios). Mean BER for full AgentModernize was 9.4% (mini), 8.1% (GPT-4o), and 19.4% (codex). The BSG captures 91.2% of gold-standard rules, confirming that the bottleneck is code generation, not extraction.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper frames legacy modernization around an explicit Behavioral Specification Graph built by multi-agent LLMs, and shows this beats direct prompting on their benchmark, but the evaluation leaves test coverage and metric details thin.

read the letter

The main thing here is that AgentModernize treats modernization as behavioral preservation rather than syntax translation. It runs four specialized agents to extract rules, build a Behavioral Specification Graph, generate code, and validate, with a feedback loop. On the LegacyModernize-8 scenarios the full version is the only setup that gets non-zero mean BER across every model, while SP-LLM and CoT-LLM stay at zero. The BSG itself is reported to capture 91.2% of the gold-standard rules, which the authors read as evidence that extraction is not the limiter and code generation is the remaining bottleneck.

Referee Report

2 major / 1 minor

Summary. The paper introduces AgentModernize, a multi-agent LLM framework for legacy modernization that treats the task as behavioral preservation rather than syntactic translation. It uses four specialized agents and an intermediate Behavioral Specification Graph (BSG) to extract, specify, generate, and validate business logic. On the LegacyModernize-8 benchmark (eight telecom and banking scenarios) with three backbones and a fixed protocol of gold-standard tests, three trials, and temperature 0.0, the full system with feedback is claimed to be the only configuration yielding non-zero mean BER on every backbone (9.4% for GPT-4o-mini, 8.1% for GPT-4o, 19.4% for GPT-5.3-codex), while SP-LLM and CoT-LLM score exactly 0.0% everywhere; the BSG is reported to capture 91.2% of gold-standard rules, so the bottleneck is asserted to be code generation rather than extraction.

Significance. If the evaluation protocol and gold-standard completeness hold, the work would offer a concrete, inspectable intermediate representation for behavioral preservation in LLM-driven modernization, with measurable gains over simple prompting baselines on domain-specific scenarios; the multi-agent feedback loop and explicit BSG artifact are strengths that could be adopted more broadly in software engineering tooling for critical systems.

major comments (2)

[Evaluation section] Evaluation section: The abstract and evaluation report specific BER values (e.g., 9.4% mean BER for full AgentModernize with GPT-4o-mini) and the 91.2% BSG capture rate, yet supply no definition or formula for BER, no per-trial variance or error bars despite the three-trial protocol, and no description of how the gold-standard tests and rules were constructed or validated for completeness. These omissions leave the central claim—that extraction succeeds while generation is the sole limiter—only partially supported.
[Evaluation section] Evaluation section: The interpretation that 'the bottleneck is code generation, not extraction' depends on the LegacyModernize-8 gold-standard tests comprehensively encoding all implicit business logic, edge cases, and cross-module constraints from the original legacy systems. The manuscript provides no evidence or argument that the test suite is exhaustive; if undocumented invariants or rare paths exist outside the gold set, the 0.0% baseline scores could indicate test incompleteness rather than total behavioral failure, and the 91.2% BSG figure would be an upper bound on a partial rule set.

minor comments (1)

[Abstract] Abstract: The acronym BER is introduced without expansion or definition; a parenthetical definition or reference to its formal definition should appear on first use.

Simulated Author's Rebuttal

2 responses · 1 unresolved

We thank the referee for the constructive comments focusing on the evaluation section. We address each point below and will revise the manuscript to improve clarity, add missing details, and appropriately qualify our claims.

read point-by-point responses

Referee: [Evaluation section] Evaluation section: The abstract and evaluation report specific BER values (e.g., 9.4% mean BER for full AgentModernize with GPT-4o-mini) and the 91.2% BSG capture rate, yet supply no definition or formula for BER, no per-trial variance or error bars despite the three-trial protocol, and no description of how the gold-standard tests and rules were constructed or validated for completeness. These omissions leave the central claim—that extraction succeeds while generation is the sole limiter—only partially supported.

Authors: We agree that the manuscript should have included an explicit definition of BER and more details on the gold-standard. BER is the Behavioral Equivalence Rate, defined as the average percentage of gold-standard test cases passed by the modernized code across scenarios. The formula is BER = (1/|S|) * sum_{s in S} (passed_tests_s / total_tests_s), where S is the set of scenarios. With temperature fixed at 0.0, all three trials produced identical outputs, yielding zero variance; we will state this explicitly and report per-scenario results. We will also add a description that the gold-standard tests and rules were derived via systematic reverse engineering of the legacy codebases by domain experts, followed by independent validation. These changes will be made in the revised evaluation section. revision: yes
Referee: [Evaluation section] Evaluation section: The interpretation that 'the bottleneck is code generation, not extraction' depends on the LegacyModernize-8 gold-standard tests comprehensively encoding all implicit business logic, edge cases, and cross-module constraints from the original legacy systems. The manuscript provides no evidence or argument that the test suite is exhaustive; if undocumented invariants or rare paths exist outside the gold set, the 0.0% baseline scores could indicate test incompleteness rather than total behavioral failure, and the 91.2% BSG figure would be an upper bound on a partial rule set.

Authors: We acknowledge that no finite test suite can be proven exhaustive for legacy systems that may contain undocumented behaviors. Our claim is scoped to the LegacyModernize-8 benchmark: the BSG captures 91.2% of rules in the provided gold-standard, and only the full AgentModernize configuration produces non-zero BER on those tests while baselines score zero. In revision we will rephrase the bottleneck statement to be conditional on the gold-standard ('within the evaluated test suite, code generation is the primary limiter') and add an explicit limitations discussion noting that untested invariants could exist. This qualifies the interpretation without changing the reported results. revision: partial

standing simulated objections not resolved

Absolute demonstration that the gold-standard tests cover every possible undocumented invariant or rare execution path in the original legacy systems

Circularity Check

0 steps flagged

No circularity: empirical results from external benchmarks

full rationale

The paper's key claims rest on experimental outcomes from running AgentModernize variants, SP-LLM, and CoT-LLM against fixed gold-standard tests on the LegacyModernize-8 benchmark. Quantities such as mean BER values, the 91.2% BSG rule capture rate, and the observation that only the full configuration yields non-zero BER are direct measurements from those runs, not quantities derived by construction from fitted parameters, self-referential definitions, or load-bearing self-citations. No equations or ansatzes are presented that reduce outputs to inputs; the evaluation protocol uses the same external test suite for all methods, making the comparison independent of the framework's internal artifacts.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 1 invented entities

The framework introduces the BSG as a new intermediate representation and relies on the unstated assumption that LLM agents can reliably extract implicit business logic into an explicit graph form without systematic omissions.

invented entities (1)

Behavioral Specification Graph (BSG) no independent evidence
purpose: Explicit, inspectable representation of extracted business logic before code generation
Presented as the key intermediate artifact that forces business logic to be explicit.

pith-pipeline@v0.9.0 · 5822 in / 1264 out tokens · 46569 ms · 2026-05-19T22:35:52.892603+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

26 extracted references · 26 canonical work pages · 4 internal anchors

[1]

A survey of legacy system modernization approaches,

S. Comella-Dorda, K. Wallnau, R. Seacord, and J. Robert, “A survey of legacy system modernization approaches,” Software Engineering Institute, Carnegie Mellon University, Tech. Rep. CMU/SEI-2000-TN- 003, 2000

work page 2000
[2]

A systematic review of software architecture evolution research,

H. P. Breivold, I. Crnkovic, and M. Larsson, “A systematic review of software architecture evolution research,”Information and Software Technology, vol. 54, no. 1, pp. 16–40, 2012. doi:10.1016/j.infsof.2011.08.002

work page doi:10.1016/j.infsof.2011.08.002 2012
[3]

Requirements for integrating software architecture and reengineering models: CORUM II,

R. Kazman, S. G. Woods, and S. J. Carri `ere, “Requirements for integrating software architecture and reengineering models: CORUM II,” inProc. Working Conference on Reverse Engineering (WCRE), 1998. doi:10.1109/WCRE.1998.723185

work page doi:10.1109/wcre.1998.723185 1998
[4]

Model-driven reverse engi- neering,

S. Rugaber and K. Stirewalt, “Model-driven reverse engi- neering,”IEEE Software, vol. 21, no. 4, pp. 45–53, 2004. doi:10.1109/MS.2004.23

work page doi:10.1109/ms.2004.23 2004
[5]

Evaluating Large Language Models Trained on Code

M. Chenet al., “Evaluating large language models trained on code,” arXiv preprint arXiv:2107.03374, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021
[6]

StarCoder: may the source be with you!

R. Liet al., “StarCoder: may the source be with you!”arXiv preprint arXiv:2305.06161, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[7]

Available: https://doi.org/10.1145/3695988

C. Houet al., “Large language models for software engineering: A sys- tematic literature review,”ACM Transactions on Software Engineering and Methodology, 2024.doi:10.1145/3695988

work page doi:10.1145/3695988 2024
[8]

MoDisco: A model driven reverse engineering framework,

H. Bruneli `ere, J. Cabot, G. Dup ´e, and F. Madiot, “MoDisco: A model driven reverse engineering framework,”Information and Software Technology, vol. 56, no. 8, pp. 1012–1032, 2014. doi:10.1016/j.infsof.2014.04.007

work page doi:10.1016/j.infsof.2014.04.007 2014
[9]

An approach for reverse engineering of COBOL-based applications,

A. De Lucia, A. R. Fasolino, and M. Napoli, “An approach for reverse engineering of COBOL-based applications,” inProc. Euro- pean Conference on Software Maintenance and Reengineering, 2001. doi:10.1109/CSMR.2001.914982

work page doi:10.1109/csmr.2001.914982 2001
[10]

Large language models for code analysis: Do LLMs really do their job?

Z. Liet al., “Large language models for code analysis: Do LLMs really do their job?”arXiv preprint arXiv:2310.12357, 2023

work page arXiv 2023
[11]

Automating code review activities by large-scale pre- training,

Z. Liet al., “Automating code review activities by large-scale pre- training,” inProc. 30th ACM Joint European Software Engineering Conference (ESEC/FSE), 2022.doi:10.1145/3540250.3549081

work page doi:10.1145/3540250.3549081 2022
[12]

Less training, more repairing please: revisiting automated program repair via zero-shot learning,

C. S. Xia and L. Zhang, “Less training, more repairing please: revisiting automated program repair via zero-shot learning,” inProc. 30th ACM Joint European Software Engineering Conference (ESEC/FSE), 2022. doi:10.1145/3540250.3549101

work page doi:10.1145/3540250.3549101 2022
[13]

Out of sight, out of mind: Better automatic vulnerability repair by broadening input ranges and sources

R. Panet al., “Lost in translation: A study of bugs introduced by large language models while translating code,” inProc. IEEE/ACM International Conference on Software Engineering (ICSE), 2024. doi:10.1145/3597503.3639226

work page doi:10.1145/3597503.3639226 2024
[14]

MetaGPT: Meta programming for a multi-agent collab- orative framework,

S. Honget al., “MetaGPT: Meta programming for a multi-agent collab- orative framework,” inICLR, 2024

work page 2024
[15]

C hat D ev: Communicative Agents for Software Development

C. Qianet al., “ChatDev: Communicative agents for software develop- ment,” inACL, 2024.doi:10.18653/v1/2024.acl-long.810

work page doi:10.18653/v1/2024.acl-long.810 2024
[16]

AgentCoder: Multi-Agent-based Code Generation with Iterative Testing and Optimisation

K. Huanget al., “AgentCoder: Multi-agent-based code generation with iterative testing and optimisation,”arXiv preprint arXiv:2312.13010, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[17]

Milner,Communication and Concurrency

R. Milner,Communication and Concurrency. Prentice Hall, 1989. ISBN: 978-0-13-115007-2

work page 1989
[18]

Differential testing for software,

W. McKeeman, “Differential testing for software,”Digital Technical Journal, vol. 10, no. 1, pp. 100–107, 1998

work page 1998
[19]

Metamorphic testing: A new approach for generating next test cases,

T. Y . Chen, S. C. Cheung, and S. M. Yiu, “Metamorphic testing: A new approach for generating next test cases,” Tech. Rep. HKUST-CS98-01, 1998

work page 1998
[20]

SWE-agent: Agent-computer interfaces enable automated software engineering,

J. Yanget al., “SWE-agent: Agent-computer interfaces enable automated software engineering,” inNeurIPS, 2024

work page 2024
[21]

OpenHands: An Open Platform for AI Software Developers as Generalist Agents

X. Wanget al., “OpenHands: An open platform for AI software developers as generalist agents,”arXiv preprint arXiv:2407.16741, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[22]

JML: A notation for detailed design,

G. T. Leavens, A. L. Baker, and C. Ruby, “JML: A notation for detailed design,” inBehavioral Specifications of Businesses and Systems, H. Kilov, B. Rumpe, and I. Simmonds, Eds. Springer, pp. 175–188, 1999.doi:10.1007/978-1-4615-5229-1_12

work page doi:10.1007/978-1-4615-5229-1_12 1999
[23]

Applying design by contract,

B. Meyer, “Applying design by contract,”IEEE Computer, vol. 25, no. 10, pp. 40–51, 1992.doi:10.1109/2.161279

work page doi:10.1109/2.161279 1992
[24]

Architecture-Driven Modernization: Knowledge Discovery Meta-Model (KDM), v1.4,

Object Management Group, “Architecture-Driven Modernization: Knowledge Discovery Meta-Model (KDM), v1.4,” OMG Standard formal/2016-02-01, 2016. Available: https://www.omg.org/spec/KDM/1. 4

work page 2016
[25]

Self-Refine: Iterative refinement with self-feedback,

A. Madaanet al., “Self-Refine: Iterative refinement with self-feedback,” inAdvances in Neural Information Processing Systems (NeurIPS), 2023

work page 2023
[26]

Reflexion: Language agents with verbal reinforce- ment learning,

N. Shinnet al., “Reflexion: Language agents with verbal reinforce- ment learning,” inAdvances in Neural Information Processing Systems (NeurIPS), 2023

work page 2023

[1] [1]

A survey of legacy system modernization approaches,

S. Comella-Dorda, K. Wallnau, R. Seacord, and J. Robert, “A survey of legacy system modernization approaches,” Software Engineering Institute, Carnegie Mellon University, Tech. Rep. CMU/SEI-2000-TN- 003, 2000

work page 2000

[2] [2]

A systematic review of software architecture evolution research,

H. P. Breivold, I. Crnkovic, and M. Larsson, “A systematic review of software architecture evolution research,”Information and Software Technology, vol. 54, no. 1, pp. 16–40, 2012. doi:10.1016/j.infsof.2011.08.002

work page doi:10.1016/j.infsof.2011.08.002 2012

[3] [3]

Requirements for integrating software architecture and reengineering models: CORUM II,

R. Kazman, S. G. Woods, and S. J. Carri `ere, “Requirements for integrating software architecture and reengineering models: CORUM II,” inProc. Working Conference on Reverse Engineering (WCRE), 1998. doi:10.1109/WCRE.1998.723185

work page doi:10.1109/wcre.1998.723185 1998

[4] [4]

Model-driven reverse engi- neering,

S. Rugaber and K. Stirewalt, “Model-driven reverse engi- neering,”IEEE Software, vol. 21, no. 4, pp. 45–53, 2004. doi:10.1109/MS.2004.23

work page doi:10.1109/ms.2004.23 2004

[5] [5]

Evaluating Large Language Models Trained on Code

M. Chenet al., “Evaluating large language models trained on code,” arXiv preprint arXiv:2107.03374, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021

[6] [6]

StarCoder: may the source be with you!

R. Liet al., “StarCoder: may the source be with you!”arXiv preprint arXiv:2305.06161, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[7] [7]

Available: https://doi.org/10.1145/3695988

C. Houet al., “Large language models for software engineering: A sys- tematic literature review,”ACM Transactions on Software Engineering and Methodology, 2024.doi:10.1145/3695988

work page doi:10.1145/3695988 2024

[8] [8]

MoDisco: A model driven reverse engineering framework,

H. Bruneli `ere, J. Cabot, G. Dup ´e, and F. Madiot, “MoDisco: A model driven reverse engineering framework,”Information and Software Technology, vol. 56, no. 8, pp. 1012–1032, 2014. doi:10.1016/j.infsof.2014.04.007

work page doi:10.1016/j.infsof.2014.04.007 2014

[9] [9]

An approach for reverse engineering of COBOL-based applications,

A. De Lucia, A. R. Fasolino, and M. Napoli, “An approach for reverse engineering of COBOL-based applications,” inProc. Euro- pean Conference on Software Maintenance and Reengineering, 2001. doi:10.1109/CSMR.2001.914982

work page doi:10.1109/csmr.2001.914982 2001

[10] [10]

Large language models for code analysis: Do LLMs really do their job?

Z. Liet al., “Large language models for code analysis: Do LLMs really do their job?”arXiv preprint arXiv:2310.12357, 2023

work page arXiv 2023

[11] [11]

Automating code review activities by large-scale pre- training,

Z. Liet al., “Automating code review activities by large-scale pre- training,” inProc. 30th ACM Joint European Software Engineering Conference (ESEC/FSE), 2022.doi:10.1145/3540250.3549081

work page doi:10.1145/3540250.3549081 2022

[12] [12]

Less training, more repairing please: revisiting automated program repair via zero-shot learning,

C. S. Xia and L. Zhang, “Less training, more repairing please: revisiting automated program repair via zero-shot learning,” inProc. 30th ACM Joint European Software Engineering Conference (ESEC/FSE), 2022. doi:10.1145/3540250.3549101

work page doi:10.1145/3540250.3549101 2022

[13] [13]

Out of sight, out of mind: Better automatic vulnerability repair by broadening input ranges and sources

R. Panet al., “Lost in translation: A study of bugs introduced by large language models while translating code,” inProc. IEEE/ACM International Conference on Software Engineering (ICSE), 2024. doi:10.1145/3597503.3639226

work page doi:10.1145/3597503.3639226 2024

[14] [14]

MetaGPT: Meta programming for a multi-agent collab- orative framework,

S. Honget al., “MetaGPT: Meta programming for a multi-agent collab- orative framework,” inICLR, 2024

work page 2024

[15] [15]

C hat D ev: Communicative Agents for Software Development

C. Qianet al., “ChatDev: Communicative agents for software develop- ment,” inACL, 2024.doi:10.18653/v1/2024.acl-long.810

work page doi:10.18653/v1/2024.acl-long.810 2024

[16] [16]

AgentCoder: Multi-Agent-based Code Generation with Iterative Testing and Optimisation

K. Huanget al., “AgentCoder: Multi-agent-based code generation with iterative testing and optimisation,”arXiv preprint arXiv:2312.13010, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[17] [17]

Milner,Communication and Concurrency

R. Milner,Communication and Concurrency. Prentice Hall, 1989. ISBN: 978-0-13-115007-2

work page 1989

[18] [18]

Differential testing for software,

W. McKeeman, “Differential testing for software,”Digital Technical Journal, vol. 10, no. 1, pp. 100–107, 1998

work page 1998

[19] [19]

Metamorphic testing: A new approach for generating next test cases,

T. Y . Chen, S. C. Cheung, and S. M. Yiu, “Metamorphic testing: A new approach for generating next test cases,” Tech. Rep. HKUST-CS98-01, 1998

work page 1998

[20] [20]

SWE-agent: Agent-computer interfaces enable automated software engineering,

J. Yanget al., “SWE-agent: Agent-computer interfaces enable automated software engineering,” inNeurIPS, 2024

work page 2024

[21] [21]

OpenHands: An Open Platform for AI Software Developers as Generalist Agents

X. Wanget al., “OpenHands: An open platform for AI software developers as generalist agents,”arXiv preprint arXiv:2407.16741, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[22] [22]

JML: A notation for detailed design,

G. T. Leavens, A. L. Baker, and C. Ruby, “JML: A notation for detailed design,” inBehavioral Specifications of Businesses and Systems, H. Kilov, B. Rumpe, and I. Simmonds, Eds. Springer, pp. 175–188, 1999.doi:10.1007/978-1-4615-5229-1_12

work page doi:10.1007/978-1-4615-5229-1_12 1999

[23] [23]

Applying design by contract,

B. Meyer, “Applying design by contract,”IEEE Computer, vol. 25, no. 10, pp. 40–51, 1992.doi:10.1109/2.161279

work page doi:10.1109/2.161279 1992

[24] [24]

Architecture-Driven Modernization: Knowledge Discovery Meta-Model (KDM), v1.4,

Object Management Group, “Architecture-Driven Modernization: Knowledge Discovery Meta-Model (KDM), v1.4,” OMG Standard formal/2016-02-01, 2016. Available: https://www.omg.org/spec/KDM/1. 4

work page 2016

[25] [25]

Self-Refine: Iterative refinement with self-feedback,

A. Madaanet al., “Self-Refine: Iterative refinement with self-feedback,” inAdvances in Neural Information Processing Systems (NeurIPS), 2023

work page 2023

[26] [26]

Reflexion: Language agents with verbal reinforce- ment learning,

N. Shinnet al., “Reflexion: Language agents with verbal reinforce- ment learning,” inAdvances in Neural Information Processing Systems (NeurIPS), 2023

work page 2023