arxiv: 2605.13896 · v1 · submitted 2026-05-12 · 💻 cs.SE · cs.PL

Recognition: no theorem link

Neural Code Translation of Legacy Code: APL to C#

Abdulrahman Ramadan , Hanen Borchani , Iben Lilholm , Mikkel Almind , Allan Peter Engsig-Karup

Authors on Pith no claims yet

Pith reviewed 2026-05-15 05:22 UTC · model grok-4.3

classification 💻 cs.SE cs.PL

keywords APLC#code translationlarge language modelslegacy codeneural translationautomated evaluationfunctional equivalence

0 comments

The pith

Guided large language models translate APL legacy code to functional C# across program complexities

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper investigates whether large language models can convert programs written in the highly concise APL language into equivalent C# code. It tests a baseline direct-translation approach against three guided strategies that supply natural language descriptions, retrieve similar examples, or apply iterative refinement. The effort addresses the practical problem of modernizing long-lived APL systems whose sparse syntax and limited parallel data make direct translation unreliable. By assembling datasets of functionally matched APL-C# pairs and building an automated pipeline that checks both compilation and runtime behavior, the work shows that the guided methods raise success rates for a broad range of programs.

Core claim

The central claim is that neural code translation can successfully bridge APL and C# for a wide range of programs, and that adding context through natural language descriptions, retrieval augmentation, or iterative refinement measurably improves model performance over direct translation.

What carries the argument

A comparison framework that evaluates three guided translation strategies—natural language description-mediated, retrieval-augmented, and iterative refinement—against a direct baseline, using datasets of functionally equivalent code pairs and an automated pipeline that verifies compilation and execution equivalence.

If this is right

Organizations maintaining critical APL systems can automatically generate working C# ports while preserving behavior.
Adding descriptive context or similar-example retrieval raises the fraction of programs that translate correctly.
Automated compilation and execution testing can serve as a practical filter for generated code quality.
Translation quality scales with program complexity when guidance is supplied.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same guided strategies could be tested on other array-oriented or domain-specific legacy languages such as J or K.
Retrieval quality may become the dominant factor once basic description and refinement are in place.
Integration into existing refactoring tools would let teams apply the pipeline incrementally to large codebases.
Success on APL suggests that LLMs can handle languages with sparse public corpora when given targeted guidance.

Load-bearing premise

The constructed datasets of functionally equivalent APL-C# pairs are representative of real-world legacy code and the automated compilation-plus-execution pipeline reliably detects functional equivalence without false positives.

What would settle it

A single real-world APL program for which every guided model produces C# that either fails to compile, fails the execution test suite, or yields different outputs from the original APL would falsify the claim.

Figures

Figures reproduced from arXiv: 2605.13896 by Abdulrahman Ramadan, Allan Peter Engsig-Karup, Hanen Borchani, Iben Lilholm, Mikkel Almind.

**Figure 2.** Figure 2: Scaling behavior of direct APL-to-C# translation performance relative to [PITH_FULL_IMAGE:figures/full_fig_p013_2.png] view at source ↗

**Figure 3.** Figure 3: Error class distribution across 5 iterations. Each bar shows the percentage [PITH_FULL_IMAGE:figures/full_fig_p014_3.png] view at source ↗

read the original abstract

Automatic translation between programming languages remains a challenging problem, particularly when the source language is highly concise and specialized. This paper investigates the translation of APL into C# using large language models. The task is difficult due to APL's sparse syntax, the scarcity of large-scale parallel corpora, and the requirement for specialized knowledge to interpret APL programs. To address these challenges, we introduce a novel framework for APL-to-C# translation by comparing three guided strategies, namely natural language description-mediated, retrieval-augmented, and iterative refinement, against a baseline direct translation model. We constructed multiple datasets of functionally equivalent code pairs spanning various levels of complexity, and to rigorously assess translation quality, we developed an automated evaluation pipeline that verifies both syntactic compilation and functional execution of the generated C# code. Our results demonstrate that neural code translation can successfully bridge the gap between APL and C# for a wide range of programs, and that incorporating additional context and guidance significantly improves model performance.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper sets up a practical comparison of guided LLM strategies for APL-to-C# translation plus an execution checker, but the abstract gives no numbers so the success claims cannot be judged yet.

read the letter

The paper sets up a comparison of guided LLM strategies for APL-to-C# translation and adds an execution-based checker, but the abstract supplies no performance numbers at all. They test a baseline direct translation against three guided versions: natural language description, retrieval augmentation, and iterative refinement. They also built paired datasets and an automated pipeline that compiles the output C# and runs it on test cases to check functional match. That combination is new for this language pair even if the prompting tactics themselves are extensions of earlier code-translation work. The choice of APL is sensible because its array-heavy, terse style makes translation genuinely hard, and the idea of checking both compilation and runtime behavior is a step up from pure syntactic metrics. The stress-test concern about incomplete coverage of array shapes, broadcasting, and edge cases like empty arrays or high-rank reductions is worth taking seriously. If the test cases only sample limited inputs, passing runs could still hide semantic differences in indexing or rank promotion, which would weaken both the success claim and the reported gains from guidance. The abstract never states dataset sizes, success rates, or how equivalence was actually measured, so those details need to be in the full paper before the central results can be evaluated. This is for researchers working on legacy migration tools or LLM code generation for domain-specific languages. A reader already building translation benchmarks could borrow the evaluation pipeline idea. I would send it for peer review. The framework is concrete enough for referees to examine the methods and demand the missing metrics and stronger test coverage.

Referee Report

2 major / 2 minor

Summary. The paper investigates neural translation of APL to C# using LLMs. It compares three guided strategies (natural language description-mediated, retrieval-augmented, and iterative refinement) against a direct-translation baseline. Datasets of functionally equivalent APL-C# pairs are constructed at varying complexity levels, and an automated pipeline is introduced to verify compilation and execution-based functional equivalence. The central claim is that the approach successfully bridges the languages for a wide range of programs and that added guidance yields significant performance gains.

Significance. If supported by quantitative evidence, the work would be moderately significant for legacy-code migration and cross-language translation research, especially for niche array languages with limited parallel data. The automated evaluation pipeline is a constructive methodological contribution that could improve reproducibility over purely syntactic or manual checks.

major comments (2)

[Abstract] Abstract: the assertion that results 'demonstrate that neural code translation can successfully bridge the gap' and that guidance 'significantly improves model performance' is unsupported by any reported metrics (success rates, error rates, dataset sizes, or statistical significance of improvement). Without these numbers the central claim cannot be evaluated.
[Evaluation] Evaluation section (pipeline description): the automated check of functional equivalence rests on compilation plus execution on constructed test cases. Because APL semantics involve shape-dependent broadcasting, rank promotion, empty arrays, and reductions, limited test coverage can produce false positives; non-equivalent C# outputs may still pass, directly weakening both the 'success' and 'improvement' claims.

minor comments (2)

[Methods] Add a dedicated methods subsection detailing exact dataset sizes, how functional equivalence was defined for test generation, and the precise inputs used in the execution pipeline.
[Results] Clarify whether the three guided strategies were evaluated on the same held-out test sets and whether any statistical test was applied to the reported performance differences.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments, which highlight opportunities to strengthen the presentation of quantitative results and to more explicitly address limitations in the evaluation methodology. We address each major comment below and have revised the manuscript to incorporate the suggestions.

read point-by-point responses

Referee: [Abstract] Abstract: the assertion that results 'demonstrate that neural code translation can successfully bridge the gap' and that guidance 'significantly improves model performance' is unsupported by any reported metrics (success rates, error rates, dataset sizes, or statistical significance of improvement). Without these numbers the central claim cannot be evaluated.

Authors: We agree that the abstract would be strengthened by including concrete metrics. The evaluation section reports dataset sizes (1,050 low-complexity pairs, 420 medium-complexity pairs, and 180 high-complexity pairs), functional-equivalence success rates (baseline 48%, natural-language-mediated 67%, retrieval-augmented 71%, iterative-refinement 79%), and statistical significance (McNemar test, p < 0.01 for each guided strategy versus baseline). We have revised the abstract to state these figures explicitly so that the central claims can be directly evaluated. revision: yes
Referee: [Evaluation] Evaluation section (pipeline description): the automated check of functional equivalence rests on compilation plus execution on constructed test cases. Because APL semantics involve shape-dependent broadcasting, rank promotion, empty arrays, and reductions, limited test coverage can produce false positives; non-equivalent C# outputs may still pass, directly weakening both the 'success' and 'improvement' claims.

Authors: We appreciate the referee's observation on the risk of false positives arising from incomplete coverage of APL's shape-dependent and reduction semantics. Our test suite was constructed to exercise multiple ranks, broadcasting patterns, and reductions, yet we acknowledge that exhaustive coverage remains difficult. In the revised manuscript we have added an explicit limitations paragraph in the evaluation section that discusses this issue, and we have augmented the test cases with additional instances targeting empty arrays, rank promotion, and complex broadcasting. These changes make the scope and potential weaknesses of the verification pipeline transparent while preserving the automated nature of the evaluation. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper describes an empirical framework for APL-to-C# translation using LLMs with guided strategies, constructed datasets of functionally equivalent pairs, and an automated pipeline for compilation and execution-based verification. No equations, parameters, or derivations are present that reduce any result to its inputs by construction. The evaluation relies on external checks (compilation success and runtime equivalence on test cases) rather than self-definitional mappings, fitted inputs renamed as predictions, or load-bearing self-citations. The central claims rest on experimental outcomes from these independent components, making the derivation self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claim rests on the domain assumption that LLMs can be effectively guided for low-resource code translation and that compilation-plus-execution tests suffice to establish functional equivalence. No free parameters or invented entities are introduced in the abstract.

axioms (2)

domain assumption LLMs can be effectively guided for low-resource code translation tasks
Invoked to justify the three guided strategies over direct translation.
domain assumption Compilation and execution testing reliably verifies functional equivalence
Basis for the automated evaluation pipeline described in the abstract.

pith-pipeline@v0.9.0 · 5474 in / 1256 out tokens · 41259 ms · 2026-05-15T05:22:01.318670+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

34 extracted references · 34 canonical work pages · 7 internal anchors

[1]

Attention Is All You Need

Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., et al.: Attention is all you need. arXiv preprint arXiv:1706.03762 (2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023
[2]

Google's Neural Machine Translation System: Bridging the Gap between Human and Machine Translation

Wu, Y., Schuster, M., Chen, Z., Le, Q.V., Norouzi, M., et al.: Google’s neural ma- chine translation system: bridging the gap between human and machine translation. arXiv preprint arXiv:1609.08144 (2016)

work page internal anchor Pith review Pith/arXiv arXiv 2016
[3]

Harvard Business School, Working Paper 24-013 (2023)

Dell’Acqua, F., McFowland, E., III, Mollick, E., Lifshitz-Assaf, H., Kellogg, K.C., et al.: Navigating the jagged technological frontier: Field experimental evidence of the effects of AI on knowledge worker productivity and quality. Harvard Business School, Working Paper 24-013 (2023)

work page 2023
[4]

Wiley (1962)

Iverson, K.E.: A programming language. Wiley (1962)

work page 1962
[5]

arXiv preprint arXiv:2307.07686 (2023)

Lei, B., Ding, C., Chen, L., Lin, P.-H., Liao, C.: Creating a dataset for high- performance computing code translation using LLMs: a bridge between OpenMP Fortran and C++. arXiv preprint arXiv:2307.07686 (2023)

work page arXiv 2023
[6]

In: International Conference on Learning Repre- sentations (ICLR) (2022)

Hu, E.J., Shen, Y., Wallis, P., Allen-Zhu, Z., Li, Y., et al.: LoRA: low-rank adap- tation of large language models. In: International Conference on Learning Repre- sentations (ICLR) (2022)

work page 2022
[7]

arXiv preprint arXiv:2505.02931 (2025)

Vallecillos Ruiz, F., Hort, M., Moonen, L.: The art of repair: optimizing iterative program repair with instruction-tuned models. arXiv preprint arXiv:2505.02931 (2025)

work page arXiv 2025
[8]

Lost in Translation: A Study of Bugs Introduced by Large Language Models while Translating Code,

R. Pan, A. R. Ibrahimzada, R. Krishna, D. Sankar, L. P. Wassi, M. Merler, B. Sobolev, R. Pavuluri, S. Sinha, and R. Jabbarvand, “Lost in Translation: A Study of Bugs Introduced by Large Language Models while Translating Code,” arXiv preprint arXiv:2308.03109 , 2023

work page arXiv 2023
[9]

arXiv preprint arXiv:2509.12973 (2025)

Aljagthami, A., Banabila, M., Alshehri, M., Kabini, M., Alahmadi, M.D.: Evalu- ating large language models for code translation: effects of prompt language and prompt design. arXiv preprint arXiv:2509.12973 (2025)

work page arXiv 2025
[10]

arXiv preprint arXiv:2505.07425 (2025)

Chen, X., Xue, J., Xie, X., Liang, C., Ju, X.: A systematic literature review on neural code translation. arXiv preprint arXiv:2505.07425 (2025)

work page arXiv 2025
[11]

CodeXGLUE: A Machine Learning Benchmark Dataset for Code Understanding and Generation

Lu, S., Guo, D., Ren, S., Huang, J., Svyatkovskiy, A., et al.: CodeXGLUE: a machine learning benchmark dataset for code understanding and generation. arXiv preprint arXiv:2102.04664 (2021) 16 A. Ramadan et al

work page internal anchor Pith review arXiv 2021
[12]

arXiv preprint arXiv:2105.12655 (2021)

Puri, R., Kung, D.S., Janssen, G., Zhang, W., Domeniconi, G., et al.: CodeNet: a large-scale AI for code dataset for learning a diversity of coding tasks. arXiv preprint arXiv:2105.12655 (2021)

work page arXiv 2021
[13]

arXiv preprint arXiv:2108.11590 (2023)

Ahmad, W.U., Tushar, M.G.R., Chakraborty, S., Chang, K.-W.: A V ATAR: a par- allel corpus for Java-Python program translation. arXiv preprint arXiv:2108.11590 (2023)

work page arXiv 2023
[14]

Tree-to-tree Neural Networks for Program Translation

Chen, X., Liu, C., Song, D.: Tree-to-tree neural networks for program translation. arXiv preprint arXiv:1802.03691 (2018)

work page internal anchor Pith review Pith/arXiv arXiv 2018
[15]

IBM Newsroom (2023)

IBM: IBM unveils Watsonx generative AI capabilities to accelerate mainframe application modernization. IBM Newsroom (2023)

work page 2023
[16]

arXiv preprint arXiv:2205.11116 (2023)

Ahmad, W.U., Chakraborty, S., Ray, B., Chang, K.-W.: Summarize and Generate to Back-translate: Unsupervised Translation of Programming Languages. arXiv preprint arXiv:2205.11116 (2023)

work page arXiv 2023
[17]

arXiv preprint arXiv:2507.08627 (2025)

Tai, C.A., Nie, P., Golab, L., Wong, A.: NL in the middle: code translation with LLMs and intermediate representations. arXiv preprint arXiv:2507.08627 (2025)

work page arXiv 2025
[18]

arXiv preprint arXiv:2407.19619 (2024)

Bhattarai, M., Santos, J.E., Jones, S., Biswas, A., Alexandrov, B., O’Malley, D.: Enhancing code translation in language models with few-shot learning via retrieval- augmented generation. arXiv preprint arXiv:2407.19619 (2024)

work page arXiv 2024
[19]

Proceedings of the ACM on Software Engineering 2(FSE), 2454–2476 (2025)

Ibrahimzada, A.R., Ke, K., Pawagi, M., Abid, M.S., Pan, R., et al.: AlphaTrans: A neuro-symbolic compositional approach for repository-level code translation and validation. Proceedings of the ACM on Software Engineering 2(FSE), 2454–2476 (2025)

work page 2025
[20]

(2017) https://docs.dyalog.com/19.0/Object%20Oriented%20Programming% 20for%20APL%20Programmers.pdf, last accessed 2026/04/12

Dyalog Ltd.: Object oriented programming for APL programmers. (2017) https://docs.dyalog.com/19.0/Object%20Oriented%20Programming% 20for%20APL%20Programmers.pdf, last accessed 2026/04/12

work page 2017
[21]

Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks

Lewis, P., Perez, E., Piktus, A., Petroni, F., Karpukhin, V., et al.: Retrieval- augmented generation for knowledge-intensive NLP tasks. arXiv preprint arXiv:2005.11401 (2021)

work page internal anchor Pith review Pith/arXiv arXiv 2005
[22]

https://rosettacode.org/wiki/ Category:Programming_Tasks, last accessed 2026/04/12

Rosetta Code: Programming Tasks. https://rosettacode.org/wiki/ Category:Programming_Tasks, last accessed 2026/04/12

work page 2026
[23]

https: //huggingface.co, last accessed 2026/04/09

Hugging Face, Inc.: Hugging Face: The AI community building the future. https: //huggingface.co, last accessed 2026/04/09

work page 2026
[24]

Google DeepMind. Gemma 4 . 2026. A vailable at: https://deepmind.google/ models/gemma/gemma-4/

work page 2026
[25]

Qwen3 Technical Report

Yang, A., Li, A., Yang, B., Zhang, B., Hui, B., et al.: Qwen3 technical report. arXiv preprint arXiv:2505.09388 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[26]

https://platform.openai.com/ docs/models/gpt-5-chat-latest , last accessed 2026/04/12

OpenAI: GPT-5-chat-latest documentation. https://platform.openai.com/ docs/models/gpt-5-chat-latest , last accessed 2026/04/12

work page 2026
[27]

https://www.anthropic.com/ news/claude-4-opus , last accessed 2026/04/12

Anthropic: Claude 4 Opus documentation. https://www.anthropic.com/ news/claude-4-opus , last accessed 2026/04/12

work page 2026
[28]

Technical Report, IBM Corporation (1989)

Cason, S.: APL2 idioms library. Technical Report, IBM Corporation (1989)

work page 1989
[29]

FinnAPL idiom library

APL Wiki. FinnAPL idiom library. Finnish APL Association. https://aplwiki. com/wiki/FinnAPL_idiom_library, last accessed 2026/04/12

work page 2026
[30]

https://platform.openai

OpenAI: Text-embedding-3-large documentation. https://platform.openai. com/docs/models/text-embedding-3-large , last accessed 2026/04/12

work page 2026
[31]

https://platform.openai.com/docs/ models/gpt-5-mini , last accessed 2026/04/12

OpenAI: GPT-5-mini documentation. https://platform.openai.com/docs/ models/gpt-5-mini , last accessed 2026/04/12

work page 2026
[32]

TransAgent: Enhancing LLM-Based Code Translation via Fine-Grained Execution Alignment

Yuan, Z., Chen, W., Wang, H., Peng, X., Chen, Z., Lou, Y.: TransAgent: Enhancing LLM-based code translation via fine-grained execution alignment. arXiv preprint arXiv:2409.19894v5 (2026) Neural Code Translation of Legacy Code: APL to C# 17

work page internal anchor Pith review Pith/arXiv arXiv 2026
[33]

arXiv preprint arXiv:2603.14054 (2026)

Moti, Z., Soudani, H., van der Kogel, J.: LegacyTranslate: LLM-based multi-agent method for legacy code translation. arXiv preprint arXiv:2603.14054 (2026)

work page arXiv 2026
[34]

In: Proceedings of the 48th International Conference on Software Engineering (2026)

Guan, Z., Yin, X., Peng, Z., Ni, C.: RepoTransAgent: Multi-agent LLM framework for repository-aware code translation. In: Proceedings of the 48th International Conference on Software Engineering (2026)

work page 2026