pith. machine review for the scientific record. sign in

arxiv: 2605.09894 · v1 · submitted 2026-05-11 · 💻 cs.SE · cs.MA

Recognition: 1 theorem link

· Lean Theorem

Deterministic vs. LLM-Controlled Orchestration for COBOL-to-Python Modernization

Authors on Pith no claims yet

Pith reviewed 2026-05-12 04:20 UTC · model grok-4.3

classification 💻 cs.SE cs.MA
keywords COBOL modernizationPython migrationLLM orchestrationdeterministic executionagentic workflowscode translationsoftware engineeringrobustness evaluation
0
0 comments X

The pith

Deterministic orchestration achieves the same accuracy as LLM-controlled orchestration for COBOL-to-Python modernization but with better robustness and up to 3.5 times lower token consumption.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper compares two ways of running modernization pipelines from COBOL to Python. In one, a fixed sequence of steps and checks controls the process. In the other, the language model decides what tool to call next at each point. By keeping the models, prompts, and tools identical and only changing who makes the decisions, the study measures differences in correctness, consistency across runs, and cost. It finds that the fixed approach performs as well on accuracy but varies less between runs and uses far fewer tokens.

Core claim

In a controlled study holding models, prompts, tools, and source programs constant, deterministic orchestration—following a fixed execution policy with explicit validation stages—produces functional correctness comparable to LLM-controlled orchestration across multiple models. However, it improves worst-case robustness, reduces performance variability across repeated runs, and cuts token consumption by up to 3.5 times.

What carries the argument

Orchestration strategy, defined as whether a predetermined sequence of actions or the LLM itself selects and sequences the tool executions and validation checks in the modernization workflow.

If this is right

  • Functional correctness remains the same whether control is fixed or delegated to the model.
  • Worst-case outcomes improve under deterministic control because bad runs are less likely.
  • Variability in results across multiple executions of the same task decreases.
  • Operational costs drop significantly due to lower token usage in deterministic runs.
  • Structured workflows with clear validation points favor fixed policies over full agentic control for stability and efficiency.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Similar benefits may appear in other legacy code migration tasks where the steps are well understood in advance.
  • Teams handling production modernization might choose deterministic methods to avoid unpredictable costs and outputs.
  • Hybrid systems could be explored where the LLM suggests but does not control the overall flow.
  • The findings question whether agentic LLM workflows are necessary for all software engineering tasks with defined processes.

Load-bearing premise

That the tested COBOL programs and evaluation metrics capture the full range of real-world modernization challenges and that fixing all other factors completely isolates the impact of the orchestration method.

What would settle it

Finding a set of COBOL programs or metrics where LLM-controlled orchestration produces measurably higher correctness or lower overall costs than the deterministic version.

Figures

Figures reproduced from arXiv: 2605.09894 by Naing Oo Lwin, Rajesh Kumar.

Figure 1
Figure 1. Figure 1: Overview of the ATLAS system architecture and end-to-end COBOL-to-Python modernization workflow. [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Comparison of deterministic orchestration and [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Token usage per successful translation for [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Cost comparison between deterministic and [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗
read the original abstract

Modernizing legacy COBOL systems remains difficult due to scarce expertise, large and long-lived codebases, and strict correctness requirements. Recent large language model (LLM)-based modernization systems increasingly rely on agentic workflows in which the model controls multi-step tool execution. However, it remains unclear whether delegating execution control to the LLM improves correctness, robustness, or efficiency in structured software engineering workflows. We present a controlled empirical study of deterministic and LLM-controlled orchestration for COBOL-to-Python modernization. Using a unified experimental framework, we hold the language models, prompts, tools, configurations, and source programs constant while varying only the execution control strategy. This isolates orchestration as the sole experimental variable. We evaluate both approaches using functional correctness, robustness across repeated stochastic runs, and computational efficiency. Across multiple models, deterministic orchestration achieves comparable computational accuracy to LLM-controlled orchestration while improving worst-case robustness and reducing performance variability across runs. Deterministic execution also reduces token consumption by up to 3.5x, leading to substantially lower operational cost. These results suggest that, in structured modernization workflows with explicit validation stages, fixed execution policies provide more stable and cost-efficient behavior than fully agentic orchestration without reducing translation quality.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper describes a controlled empirical comparison of deterministic orchestration versus LLM-controlled (agentic) orchestration for COBOL-to-Python code modernization. By fixing the underlying models, prompts, tools, configurations, and source programs, and varying only the execution control strategy, the study evaluates functional correctness, robustness across stochastic runs, and computational efficiency (token consumption). The key finding is that deterministic orchestration matches LLM-controlled in accuracy while offering better worst-case robustness, lower run-to-run variability, and up to 3.5× reduction in token usage.

Significance. If these results hold under broader conditions, the work has significant implications for the design of LLM-based software modernization tools. The strength of the study lies in its controlled design that isolates the orchestration variable, providing clear evidence against the assumption that delegating control to LLMs always improves outcomes in structured workflows. This could encourage more hybrid or deterministic approaches in agentic systems, leading to more reliable and cost-effective solutions in legacy system modernization.

major comments (2)
  1. [Experimental Setup] The representativeness of the source COBOL programs is not sufficiently addressed. The manuscript does not specify the size, complexity, or features (such as database interactions, file I/O, or intricate business rules) of the programs used. As noted in the stress-test, if these are limited to simple procedural code, the advantages in robustness and the 3.5x token reduction may not extend to typical real-world COBOL modernization workloads, weakening the generalizability of the conclusions.
  2. [Results and Analysis] The claims regarding reduced performance variability and improved worst-case robustness lack supporting statistical analysis, such as standard deviation calculations, variance tests, or p-values across the repeated runs. Without these, the quantitative support for 'reducing performance variability' remains qualitative.
minor comments (1)
  1. [Abstract] Consider specifying the exact number of models tested and the number of repeated runs to allow readers to better gauge the reliability of the 'across multiple models' and variability claims.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the thoughtful and constructive comments on our manuscript. We address each of the major comments below in detail. We believe these suggestions will help improve the clarity and rigor of the work, and we indicate where revisions will be made in the next version.

read point-by-point responses
  1. Referee: [Experimental Setup] The representativeness of the source COBOL programs is not sufficiently addressed. The manuscript does not specify the size, complexity, or features (such as database interactions, file I/O, or intricate business rules) of the programs used. As noted in the stress-test, if these are limited to simple procedural code, the advantages in robustness and the 3.5x token reduction may not extend to typical real-world COBOL modernization workloads, weakening the generalizability of the conclusions.

    Authors: We agree that the manuscript would benefit from more explicit characterization of the COBOL programs to address concerns about representativeness. In the revised version, we will add a new subsection in the Experimental Setup detailing the programs' sizes (LOC), structural complexity, and key features including file I/O, database interactions, and business rules. The selected programs come from an established benchmark for COBOL modernization and include a range of complexities, as partially indicated in our stress-test. We will also expand the discussion of limitations to note that while our controlled comparison isolates orchestration effects, further validation on larger, more diverse real-world codebases is warranted. revision: yes

  2. Referee: [Results and Analysis] The claims regarding reduced performance variability and improved worst-case robustness lack supporting statistical analysis, such as standard deviation calculations, variance tests, or p-values across the repeated runs. Without these, the quantitative support for 'reducing performance variability' remains qualitative.

    Authors: We concur that incorporating formal statistical analysis would strengthen the evidence for our claims on variability and robustness. The revised manuscript will report standard deviations and other descriptive statistics for the metrics across repeated runs. We will also include results from variance tests (e.g., F-test or Levene's test) comparing the two orchestration strategies and discuss the statistical significance of the observed differences in variability. This will move the support from qualitative to quantitative while maintaining the integrity of the original findings. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical comparison without derivations or self-referential reductions

full rationale

The paper conducts a controlled empirical study that isolates orchestration strategy by holding models, prompts, tools, configurations, and source programs fixed while reporting functional correctness, robustness, and token consumption directly from experimental runs. No equations, fitted parameters, derivations, or self-citations appear in the load-bearing claims; results are not reduced to prior quantities by construction. The analysis is self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The work is an empirical comparison that relies on standard software-engineering evaluation practices and does not introduce new free parameters, axioms, or postulated entities.

pith-pipeline@v0.9.0 · 5513 in / 1126 out tokens · 48805 ms · 2026-05-12T04:20:15.212632+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

18 extracted references · 18 canonical work pages · 4 internal anchors

  1. [1]

    arXiv preprint arXiv:2509.12973 (2025)

    Aamer Aljagthami, Mohammed Banabila, Musab Alshehri, Mohammed Kabini, and Mohammad D. Alahmadi. 2025. Evaluating Large Language Models for Code Translation: Effects of Prompt Language and Prompt Design. arXiv:2509.12973 [cs.SE] https://arxiv.org/abs/2509.12973

  2. [2]

    Agnieszka Ciborowska, Aleksandar Chakarov, and Rahul Pandita. 2021. Contemporary COBOL: Developers’ Perspectives on Defects and Defect Location. In2021 IEEE International Conference on Software Maintenance and Evolution (ICSME). IEEE, 227–238. doi:10.1109/icsme52107.2021.00027

  3. [3]

    Large Language Model based Multi-Agents: A Survey of Progress and Challenges

    Taicheng Guo, Xiuying Chen, Yaqi Wang, Ruidi Chang, Shichao Pei, Nitesh V. Chawla, Olaf Wiest, and Xiangliang Zhang. 2024. Large Language Model based Multi-Agents: A Survey of Progress and Challenges. arXiv:2402.01680 [cs.CL] https://arxiv.org/abs/2402.01680

  4. [4]

    SWE-bench: Can Language Models Resolve Real-World GitHub Issues?

    Carlos E. Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik Narasimhan. 2024. SWE-bench: Can Language Models Resolve Real-World GitHub Issues? arXiv:2310.06770 [cs.CL] https://arxiv.org/abs/2310. 06770

  5. [5]

    Marie-Anne Lachaux, Baptiste Roziere, Lowik Chanussot, and Guillaume Lample. 2020. Unsupervised Translation of Programming Languages. arXiv:2006.03511 [cs.CL] https://arxiv.org/abs/2006.03511

  6. [6]

    Maria Emilia Mazzolenis and Ruirui Zhang. 2025. Agent WARPP: Workflow Adherence via Runtime Parallel Personalization. arXiv:2507.19543 [cs.AI] https: //arxiv.org/abs/2507.19543

  7. [7]

    Gabriel Orlanski, Kefan Xiao, Xavier Garcia, Jeffrey Hui, Joshua Howland, Jonathan Malmaud, Jacob Austin, Rishabh Singh, and Michele Catasta

  8. [8]

    Measuring the impact of programming language distribution

    Measuring The Impact Of Programming Language Distribution. arXiv:2302.01973 [cs.SE] https://arxiv.org/abs/2302.01973

  9. [9]

    Jialing Pan, Adrien Sadé, Jin Kim, Eric Soriano, Guillem Sole, and Sylvain Flamant

  10. [10]

    Stelocoder: a decoder-only llm for multi-language to pyth on code translation,

    SteloCoder: a Decoder-Only LLM for Multi-Language to Python Code Translation. arXiv:2310.15539 [cs.CL] https://arxiv.org/abs/2310.15539

  11. [11]

    Libin Qiu, Yuhang Ye, Zhirong Gao, Xide Zou, Junfu Chen, Ziming Gui, Weizhi Huang, Xiaobo Xue, Wenkai Qiu, and Kun Zhao. 2025. Blueprint First, Model Second: A Framework for Deterministic LLM Workflow. arXiv:2508.02721 [cs.SE] https://arxiv.org/abs/2508.02721

  12. [12]

    Timo Schick, Jane Dwivedi-Yu, Roberto Dessì, Roberta Raileanu, Maria Lomeli, Luke Zettlemoyer, Nicola Cancedda, and Thomas Scialom. 2023. Toolformer: Language Models Can Teach Themselves to Use Tools. arXiv:2302.04761 [cs.CL] https://arxiv.org/abs/2302.04761

  13. [13]

    Qingxiao Tao, Tingrui Yu, Xiaodong Gu, and Beijun Shen. 2024. Unraveling the Potential of Large Language Models in Code Translation: How Far Are We? arXiv:2410.09812 [cs.SE] https://arxiv.org/abs/2410.09812

  14. [14]

    Michele Tufano, Anisha Agarwal, Jinu Jang, Roshanak Zilouchian Moghaddam, and Neel Sundaresan. 2024. AutoDev: Automated AI-Driven Development. arXiv:2403.08299 [cs.SE] https://arxiv.org/abs/2403.08299

  15. [15]

    Karthik Valmeekam, Sarath Sreedharan, Matthew Marquez, Alberto Olmo, and Subbarao Kambhampati. 2023. On the Planning Abilities of Large Language Models (A Critical Investigation with a Proposed Benchmark). arXiv:2302.06706 [cs.AI] https://arxiv.org/abs/2302.06706

  16. [16]

    Zhen Yang, Fang Liu, Zhongxing Yu, Jacky Wai Keung, Jia Li, Shuo Liu, Yifan Hong, Xiaoxue Ma, Zhi Jin, and Ge Li. 2024. Exploring and Unleashing the Power of Large Language Models in Automated Code Translation. arXiv:2404.14646 [cs.SE] https://arxiv.org/abs/2404.14646

  17. [17]

    Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan Cao. 2023. ReAct: Synergizing Reasoning and Acting in Language Models. arXiv:2210.03629 [cs.CL] https://arxiv.org/abs/2210.03629

  18. [18]

    Qianqian Zhang, Jiajia Liao, Heting Ying, Yibo Ma, Haozhan Shen, Jingcheng Li, Peng Liu, Lu Zhang, Chunxin Fang, Kyusong Lee, Ruochen Xu, and Tiancheng Zhao. 2025. Unifying Language Agent Algorithms with Graph-based Orchestration Engine for Reproducible Agent Research. arXiv:2505.24354 [cs.CL] https://arxiv.org/abs/2505.24354