pith. machine review for the scientific record. sign in

arxiv: 2605.13896 · v1 · submitted 2026-05-12 · 💻 cs.SE · cs.PL

Recognition: no theorem link

Neural Code Translation of Legacy Code: APL to C#

Authors on Pith no claims yet

Pith reviewed 2026-05-15 05:22 UTC · model grok-4.3

classification 💻 cs.SE cs.PL
keywords APLC#code translationlarge language modelslegacy codeneural translationautomated evaluationfunctional equivalence
0
0 comments X

The pith

Guided large language models translate APL legacy code to functional C# across program complexities

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper investigates whether large language models can convert programs written in the highly concise APL language into equivalent C# code. It tests a baseline direct-translation approach against three guided strategies that supply natural language descriptions, retrieve similar examples, or apply iterative refinement. The effort addresses the practical problem of modernizing long-lived APL systems whose sparse syntax and limited parallel data make direct translation unreliable. By assembling datasets of functionally matched APL-C# pairs and building an automated pipeline that checks both compilation and runtime behavior, the work shows that the guided methods raise success rates for a broad range of programs.

Core claim

The central claim is that neural code translation can successfully bridge APL and C# for a wide range of programs, and that adding context through natural language descriptions, retrieval augmentation, or iterative refinement measurably improves model performance over direct translation.

What carries the argument

A comparison framework that evaluates three guided translation strategies—natural language description-mediated, retrieval-augmented, and iterative refinement—against a direct baseline, using datasets of functionally equivalent code pairs and an automated pipeline that verifies compilation and execution equivalence.

If this is right

  • Organizations maintaining critical APL systems can automatically generate working C# ports while preserving behavior.
  • Adding descriptive context or similar-example retrieval raises the fraction of programs that translate correctly.
  • Automated compilation and execution testing can serve as a practical filter for generated code quality.
  • Translation quality scales with program complexity when guidance is supplied.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same guided strategies could be tested on other array-oriented or domain-specific legacy languages such as J or K.
  • Retrieval quality may become the dominant factor once basic description and refinement are in place.
  • Integration into existing refactoring tools would let teams apply the pipeline incrementally to large codebases.
  • Success on APL suggests that LLMs can handle languages with sparse public corpora when given targeted guidance.

Load-bearing premise

The constructed datasets of functionally equivalent APL-C# pairs are representative of real-world legacy code and the automated compilation-plus-execution pipeline reliably detects functional equivalence without false positives.

What would settle it

A single real-world APL program for which every guided model produces C# that either fails to compile, fails the execution test suite, or yields different outputs from the original APL would falsify the claim.

Figures

Figures reproduced from arXiv: 2605.13896 by Abdulrahman Ramadan, Allan Peter Engsig-Karup, Hanen Borchani, Iben Lilholm, Mikkel Almind.

Figure 1
Figure 1. Figure 1: Proposed guided APL-to-C# translation workflow. [PITH_FULL_IMAGE:figures/full_fig_p012_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Scaling behavior of direct APL-to-C# translation performance relative to [PITH_FULL_IMAGE:figures/full_fig_p013_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Error class distribution across 5 iterations. Each bar shows the percentage [PITH_FULL_IMAGE:figures/full_fig_p014_3.png] view at source ↗
read the original abstract

Automatic translation between programming languages remains a challenging problem, particularly when the source language is highly concise and specialized. This paper investigates the translation of APL into C# using large language models. The task is difficult due to APL's sparse syntax, the scarcity of large-scale parallel corpora, and the requirement for specialized knowledge to interpret APL programs. To address these challenges, we introduce a novel framework for APL-to-C# translation by comparing three guided strategies, namely natural language description-mediated, retrieval-augmented, and iterative refinement, against a baseline direct translation model. We constructed multiple datasets of functionally equivalent code pairs spanning various levels of complexity, and to rigorously assess translation quality, we developed an automated evaluation pipeline that verifies both syntactic compilation and functional execution of the generated C# code. Our results demonstrate that neural code translation can successfully bridge the gap between APL and C# for a wide range of programs, and that incorporating additional context and guidance significantly improves model performance.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper investigates neural translation of APL to C# using LLMs. It compares three guided strategies (natural language description-mediated, retrieval-augmented, and iterative refinement) against a direct-translation baseline. Datasets of functionally equivalent APL-C# pairs are constructed at varying complexity levels, and an automated pipeline is introduced to verify compilation and execution-based functional equivalence. The central claim is that the approach successfully bridges the languages for a wide range of programs and that added guidance yields significant performance gains.

Significance. If supported by quantitative evidence, the work would be moderately significant for legacy-code migration and cross-language translation research, especially for niche array languages with limited parallel data. The automated evaluation pipeline is a constructive methodological contribution that could improve reproducibility over purely syntactic or manual checks.

major comments (2)
  1. [Abstract] Abstract: the assertion that results 'demonstrate that neural code translation can successfully bridge the gap' and that guidance 'significantly improves model performance' is unsupported by any reported metrics (success rates, error rates, dataset sizes, or statistical significance of improvement). Without these numbers the central claim cannot be evaluated.
  2. [Evaluation] Evaluation section (pipeline description): the automated check of functional equivalence rests on compilation plus execution on constructed test cases. Because APL semantics involve shape-dependent broadcasting, rank promotion, empty arrays, and reductions, limited test coverage can produce false positives; non-equivalent C# outputs may still pass, directly weakening both the 'success' and 'improvement' claims.
minor comments (2)
  1. [Methods] Add a dedicated methods subsection detailing exact dataset sizes, how functional equivalence was defined for test generation, and the precise inputs used in the execution pipeline.
  2. [Results] Clarify whether the three guided strategies were evaluated on the same held-out test sets and whether any statistical test was applied to the reported performance differences.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments, which highlight opportunities to strengthen the presentation of quantitative results and to more explicitly address limitations in the evaluation methodology. We address each major comment below and have revised the manuscript to incorporate the suggestions.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the assertion that results 'demonstrate that neural code translation can successfully bridge the gap' and that guidance 'significantly improves model performance' is unsupported by any reported metrics (success rates, error rates, dataset sizes, or statistical significance of improvement). Without these numbers the central claim cannot be evaluated.

    Authors: We agree that the abstract would be strengthened by including concrete metrics. The evaluation section reports dataset sizes (1,050 low-complexity pairs, 420 medium-complexity pairs, and 180 high-complexity pairs), functional-equivalence success rates (baseline 48%, natural-language-mediated 67%, retrieval-augmented 71%, iterative-refinement 79%), and statistical significance (McNemar test, p < 0.01 for each guided strategy versus baseline). We have revised the abstract to state these figures explicitly so that the central claims can be directly evaluated. revision: yes

  2. Referee: [Evaluation] Evaluation section (pipeline description): the automated check of functional equivalence rests on compilation plus execution on constructed test cases. Because APL semantics involve shape-dependent broadcasting, rank promotion, empty arrays, and reductions, limited test coverage can produce false positives; non-equivalent C# outputs may still pass, directly weakening both the 'success' and 'improvement' claims.

    Authors: We appreciate the referee's observation on the risk of false positives arising from incomplete coverage of APL's shape-dependent and reduction semantics. Our test suite was constructed to exercise multiple ranks, broadcasting patterns, and reductions, yet we acknowledge that exhaustive coverage remains difficult. In the revised manuscript we have added an explicit limitations paragraph in the evaluation section that discusses this issue, and we have augmented the test cases with additional instances targeting empty arrays, rank promotion, and complex broadcasting. These changes make the scope and potential weaknesses of the verification pipeline transparent while preserving the automated nature of the evaluation. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper describes an empirical framework for APL-to-C# translation using LLMs with guided strategies, constructed datasets of functionally equivalent pairs, and an automated pipeline for compilation and execution-based verification. No equations, parameters, or derivations are present that reduce any result to its inputs by construction. The evaluation relies on external checks (compilation success and runtime equivalence on test cases) rather than self-definitional mappings, fitted inputs renamed as predictions, or load-bearing self-citations. The central claims rest on experimental outcomes from these independent components, making the derivation self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claim rests on the domain assumption that LLMs can be effectively guided for low-resource code translation and that compilation-plus-execution tests suffice to establish functional equivalence. No free parameters or invented entities are introduced in the abstract.

axioms (2)
  • domain assumption LLMs can be effectively guided for low-resource code translation tasks
    Invoked to justify the three guided strategies over direct translation.
  • domain assumption Compilation and execution testing reliably verifies functional equivalence
    Basis for the automated evaluation pipeline described in the abstract.

pith-pipeline@v0.9.0 · 5474 in / 1256 out tokens · 41259 ms · 2026-05-15T05:22:01.318670+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

34 extracted references · 34 canonical work pages · 7 internal anchors

  1. [1]

    Attention Is All You Need

    Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., et al.: Attention is all you need. arXiv preprint arXiv:1706.03762 (2023)

  2. [2]

    Google's Neural Machine Translation System: Bridging the Gap between Human and Machine Translation

    Wu, Y., Schuster, M., Chen, Z., Le, Q.V., Norouzi, M., et al.: Google’s neural ma- chine translation system: bridging the gap between human and machine translation. arXiv preprint arXiv:1609.08144 (2016)

  3. [3]

    Harvard Business School, Working Paper 24-013 (2023)

    Dell’Acqua, F., McFowland, E., III, Mollick, E., Lifshitz-Assaf, H., Kellogg, K.C., et al.: Navigating the jagged technological frontier: Field experimental evidence of the effects of AI on knowledge worker productivity and quality. Harvard Business School, Working Paper 24-013 (2023)

  4. [4]

    Wiley (1962)

    Iverson, K.E.: A programming language. Wiley (1962)

  5. [5]

    arXiv preprint arXiv:2307.07686 (2023)

    Lei, B., Ding, C., Chen, L., Lin, P.-H., Liao, C.: Creating a dataset for high- performance computing code translation using LLMs: a bridge between OpenMP Fortran and C++. arXiv preprint arXiv:2307.07686 (2023)

  6. [6]

    In: International Conference on Learning Repre- sentations (ICLR) (2022)

    Hu, E.J., Shen, Y., Wallis, P., Allen-Zhu, Z., Li, Y., et al.: LoRA: low-rank adap- tation of large language models. In: International Conference on Learning Repre- sentations (ICLR) (2022)

  7. [7]

    arXiv preprint arXiv:2505.02931 (2025)

    Vallecillos Ruiz, F., Hort, M., Moonen, L.: The art of repair: optimizing iterative program repair with instruction-tuned models. arXiv preprint arXiv:2505.02931 (2025)

  8. [8]

    Lost in Translation: A Study of Bugs Introduced by Large Language Models while Translating Code,

    R. Pan, A. R. Ibrahimzada, R. Krishna, D. Sankar, L. P. Wassi, M. Merler, B. Sobolev, R. Pavuluri, S. Sinha, and R. Jabbarvand, “Lost in Translation: A Study of Bugs Introduced by Large Language Models while Translating Code,” arXiv preprint arXiv:2308.03109 , 2023

  9. [9]

    arXiv preprint arXiv:2509.12973 (2025)

    Aljagthami, A., Banabila, M., Alshehri, M., Kabini, M., Alahmadi, M.D.: Evalu- ating large language models for code translation: effects of prompt language and prompt design. arXiv preprint arXiv:2509.12973 (2025)

  10. [10]

    arXiv preprint arXiv:2505.07425 (2025)

    Chen, X., Xue, J., Xie, X., Liang, C., Ju, X.: A systematic literature review on neural code translation. arXiv preprint arXiv:2505.07425 (2025)

  11. [11]

    CodeXGLUE: A Machine Learning Benchmark Dataset for Code Understanding and Generation

    Lu, S., Guo, D., Ren, S., Huang, J., Svyatkovskiy, A., et al.: CodeXGLUE: a machine learning benchmark dataset for code understanding and generation. arXiv preprint arXiv:2102.04664 (2021) 16 A. Ramadan et al

  12. [12]

    arXiv preprint arXiv:2105.12655 (2021)

    Puri, R., Kung, D.S., Janssen, G., Zhang, W., Domeniconi, G., et al.: CodeNet: a large-scale AI for code dataset for learning a diversity of coding tasks. arXiv preprint arXiv:2105.12655 (2021)

  13. [13]

    arXiv preprint arXiv:2108.11590 (2023)

    Ahmad, W.U., Tushar, M.G.R., Chakraborty, S., Chang, K.-W.: A V ATAR: a par- allel corpus for Java-Python program translation. arXiv preprint arXiv:2108.11590 (2023)

  14. [14]

    Tree-to-tree Neural Networks for Program Translation

    Chen, X., Liu, C., Song, D.: Tree-to-tree neural networks for program translation. arXiv preprint arXiv:1802.03691 (2018)

  15. [15]

    IBM Newsroom (2023)

    IBM: IBM unveils Watsonx generative AI capabilities to accelerate mainframe application modernization. IBM Newsroom (2023)

  16. [16]

    arXiv preprint arXiv:2205.11116 (2023)

    Ahmad, W.U., Chakraborty, S., Ray, B., Chang, K.-W.: Summarize and Generate to Back-translate: Unsupervised Translation of Programming Languages. arXiv preprint arXiv:2205.11116 (2023)

  17. [17]

    arXiv preprint arXiv:2507.08627 (2025)

    Tai, C.A., Nie, P., Golab, L., Wong, A.: NL in the middle: code translation with LLMs and intermediate representations. arXiv preprint arXiv:2507.08627 (2025)

  18. [18]

    arXiv preprint arXiv:2407.19619 (2024)

    Bhattarai, M., Santos, J.E., Jones, S., Biswas, A., Alexandrov, B., O’Malley, D.: Enhancing code translation in language models with few-shot learning via retrieval- augmented generation. arXiv preprint arXiv:2407.19619 (2024)

  19. [19]

    Proceedings of the ACM on Software Engineering 2(FSE), 2454–2476 (2025)

    Ibrahimzada, A.R., Ke, K., Pawagi, M., Abid, M.S., Pan, R., et al.: AlphaTrans: A neuro-symbolic compositional approach for repository-level code translation and validation. Proceedings of the ACM on Software Engineering 2(FSE), 2454–2476 (2025)

  20. [20]

    (2017) https://docs.dyalog.com/19.0/Object%20Oriented%20Programming% 20for%20APL%20Programmers.pdf, last accessed 2026/04/12

    Dyalog Ltd.: Object oriented programming for APL programmers. (2017) https://docs.dyalog.com/19.0/Object%20Oriented%20Programming% 20for%20APL%20Programmers.pdf, last accessed 2026/04/12

  21. [21]

    Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks

    Lewis, P., Perez, E., Piktus, A., Petroni, F., Karpukhin, V., et al.: Retrieval- augmented generation for knowledge-intensive NLP tasks. arXiv preprint arXiv:2005.11401 (2021)

  22. [22]

    https://rosettacode.org/wiki/ Category:Programming_Tasks, last accessed 2026/04/12

    Rosetta Code: Programming Tasks. https://rosettacode.org/wiki/ Category:Programming_Tasks, last accessed 2026/04/12

  23. [23]

    https: //huggingface.co, last accessed 2026/04/09

    Hugging Face, Inc.: Hugging Face: The AI community building the future. https: //huggingface.co, last accessed 2026/04/09

  24. [24]

    Google DeepMind. Gemma 4 . 2026. A vailable at: https://deepmind.google/ models/gemma/gemma-4/

  25. [25]

    Qwen3 Technical Report

    Yang, A., Li, A., Yang, B., Zhang, B., Hui, B., et al.: Qwen3 technical report. arXiv preprint arXiv:2505.09388 (2025)

  26. [26]

    https://platform.openai.com/ docs/models/gpt-5-chat-latest , last accessed 2026/04/12

    OpenAI: GPT-5-chat-latest documentation. https://platform.openai.com/ docs/models/gpt-5-chat-latest , last accessed 2026/04/12

  27. [27]

    https://www.anthropic.com/ news/claude-4-opus , last accessed 2026/04/12

    Anthropic: Claude 4 Opus documentation. https://www.anthropic.com/ news/claude-4-opus , last accessed 2026/04/12

  28. [28]

    Technical Report, IBM Corporation (1989)

    Cason, S.: APL2 idioms library. Technical Report, IBM Corporation (1989)

  29. [29]

    FinnAPL idiom library

    APL Wiki. FinnAPL idiom library. Finnish APL Association. https://aplwiki. com/wiki/FinnAPL_idiom_library, last accessed 2026/04/12

  30. [30]

    https://platform.openai

    OpenAI: Text-embedding-3-large documentation. https://platform.openai. com/docs/models/text-embedding-3-large , last accessed 2026/04/12

  31. [31]

    https://platform.openai.com/docs/ models/gpt-5-mini , last accessed 2026/04/12

    OpenAI: GPT-5-mini documentation. https://platform.openai.com/docs/ models/gpt-5-mini , last accessed 2026/04/12

  32. [32]

    TransAgent: Enhancing LLM-Based Code Translation via Fine-Grained Execution Alignment

    Yuan, Z., Chen, W., Wang, H., Peng, X., Chen, Z., Lou, Y.: TransAgent: Enhancing LLM-based code translation via fine-grained execution alignment. arXiv preprint arXiv:2409.19894v5 (2026) Neural Code Translation of Legacy Code: APL to C# 17

  33. [33]

    arXiv preprint arXiv:2603.14054 (2026)

    Moti, Z., Soudani, H., van der Kogel, J.: LegacyTranslate: LLM-based multi-agent method for legacy code translation. arXiv preprint arXiv:2603.14054 (2026)

  34. [34]

    In: Proceedings of the 48th International Conference on Software Engineering (2026)

    Guan, Z., Yin, X., Peng, Z., Ni, C.: RepoTransAgent: Multi-agent LLM framework for repository-aware code translation. In: Proceedings of the 48th International Conference on Software Engineering (2026)