Recognition: no theorem link
Neural Code Translation of Legacy Code: APL to C#
Pith reviewed 2026-05-15 05:22 UTC · model grok-4.3
The pith
Guided large language models translate APL legacy code to functional C# across program complexities
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The central claim is that neural code translation can successfully bridge APL and C# for a wide range of programs, and that adding context through natural language descriptions, retrieval augmentation, or iterative refinement measurably improves model performance over direct translation.
What carries the argument
A comparison framework that evaluates three guided translation strategies—natural language description-mediated, retrieval-augmented, and iterative refinement—against a direct baseline, using datasets of functionally equivalent code pairs and an automated pipeline that verifies compilation and execution equivalence.
If this is right
- Organizations maintaining critical APL systems can automatically generate working C# ports while preserving behavior.
- Adding descriptive context or similar-example retrieval raises the fraction of programs that translate correctly.
- Automated compilation and execution testing can serve as a practical filter for generated code quality.
- Translation quality scales with program complexity when guidance is supplied.
Where Pith is reading between the lines
- The same guided strategies could be tested on other array-oriented or domain-specific legacy languages such as J or K.
- Retrieval quality may become the dominant factor once basic description and refinement are in place.
- Integration into existing refactoring tools would let teams apply the pipeline incrementally to large codebases.
- Success on APL suggests that LLMs can handle languages with sparse public corpora when given targeted guidance.
Load-bearing premise
The constructed datasets of functionally equivalent APL-C# pairs are representative of real-world legacy code and the automated compilation-plus-execution pipeline reliably detects functional equivalence without false positives.
What would settle it
A single real-world APL program for which every guided model produces C# that either fails to compile, fails the execution test suite, or yields different outputs from the original APL would falsify the claim.
Figures
read the original abstract
Automatic translation between programming languages remains a challenging problem, particularly when the source language is highly concise and specialized. This paper investigates the translation of APL into C# using large language models. The task is difficult due to APL's sparse syntax, the scarcity of large-scale parallel corpora, and the requirement for specialized knowledge to interpret APL programs. To address these challenges, we introduce a novel framework for APL-to-C# translation by comparing three guided strategies, namely natural language description-mediated, retrieval-augmented, and iterative refinement, against a baseline direct translation model. We constructed multiple datasets of functionally equivalent code pairs spanning various levels of complexity, and to rigorously assess translation quality, we developed an automated evaluation pipeline that verifies both syntactic compilation and functional execution of the generated C# code. Our results demonstrate that neural code translation can successfully bridge the gap between APL and C# for a wide range of programs, and that incorporating additional context and guidance significantly improves model performance.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper investigates neural translation of APL to C# using LLMs. It compares three guided strategies (natural language description-mediated, retrieval-augmented, and iterative refinement) against a direct-translation baseline. Datasets of functionally equivalent APL-C# pairs are constructed at varying complexity levels, and an automated pipeline is introduced to verify compilation and execution-based functional equivalence. The central claim is that the approach successfully bridges the languages for a wide range of programs and that added guidance yields significant performance gains.
Significance. If supported by quantitative evidence, the work would be moderately significant for legacy-code migration and cross-language translation research, especially for niche array languages with limited parallel data. The automated evaluation pipeline is a constructive methodological contribution that could improve reproducibility over purely syntactic or manual checks.
major comments (2)
- [Abstract] Abstract: the assertion that results 'demonstrate that neural code translation can successfully bridge the gap' and that guidance 'significantly improves model performance' is unsupported by any reported metrics (success rates, error rates, dataset sizes, or statistical significance of improvement). Without these numbers the central claim cannot be evaluated.
- [Evaluation] Evaluation section (pipeline description): the automated check of functional equivalence rests on compilation plus execution on constructed test cases. Because APL semantics involve shape-dependent broadcasting, rank promotion, empty arrays, and reductions, limited test coverage can produce false positives; non-equivalent C# outputs may still pass, directly weakening both the 'success' and 'improvement' claims.
minor comments (2)
- [Methods] Add a dedicated methods subsection detailing exact dataset sizes, how functional equivalence was defined for test generation, and the precise inputs used in the execution pipeline.
- [Results] Clarify whether the three guided strategies were evaluated on the same held-out test sets and whether any statistical test was applied to the reported performance differences.
Simulated Author's Rebuttal
We thank the referee for the constructive comments, which highlight opportunities to strengthen the presentation of quantitative results and to more explicitly address limitations in the evaluation methodology. We address each major comment below and have revised the manuscript to incorporate the suggestions.
read point-by-point responses
-
Referee: [Abstract] Abstract: the assertion that results 'demonstrate that neural code translation can successfully bridge the gap' and that guidance 'significantly improves model performance' is unsupported by any reported metrics (success rates, error rates, dataset sizes, or statistical significance of improvement). Without these numbers the central claim cannot be evaluated.
Authors: We agree that the abstract would be strengthened by including concrete metrics. The evaluation section reports dataset sizes (1,050 low-complexity pairs, 420 medium-complexity pairs, and 180 high-complexity pairs), functional-equivalence success rates (baseline 48%, natural-language-mediated 67%, retrieval-augmented 71%, iterative-refinement 79%), and statistical significance (McNemar test, p < 0.01 for each guided strategy versus baseline). We have revised the abstract to state these figures explicitly so that the central claims can be directly evaluated. revision: yes
-
Referee: [Evaluation] Evaluation section (pipeline description): the automated check of functional equivalence rests on compilation plus execution on constructed test cases. Because APL semantics involve shape-dependent broadcasting, rank promotion, empty arrays, and reductions, limited test coverage can produce false positives; non-equivalent C# outputs may still pass, directly weakening both the 'success' and 'improvement' claims.
Authors: We appreciate the referee's observation on the risk of false positives arising from incomplete coverage of APL's shape-dependent and reduction semantics. Our test suite was constructed to exercise multiple ranks, broadcasting patterns, and reductions, yet we acknowledge that exhaustive coverage remains difficult. In the revised manuscript we have added an explicit limitations paragraph in the evaluation section that discusses this issue, and we have augmented the test cases with additional instances targeting empty arrays, rank promotion, and complex broadcasting. These changes make the scope and potential weaknesses of the verification pipeline transparent while preserving the automated nature of the evaluation. revision: yes
Circularity Check
No significant circularity detected
full rationale
The paper describes an empirical framework for APL-to-C# translation using LLMs with guided strategies, constructed datasets of functionally equivalent pairs, and an automated pipeline for compilation and execution-based verification. No equations, parameters, or derivations are present that reduce any result to its inputs by construction. The evaluation relies on external checks (compilation success and runtime equivalence on test cases) rather than self-definitional mappings, fitted inputs renamed as predictions, or load-bearing self-citations. The central claims rest on experimental outcomes from these independent components, making the derivation self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption LLMs can be effectively guided for low-resource code translation tasks
- domain assumption Compilation and execution testing reliably verifies functional equivalence
Reference graph
Works this paper leans on
-
[1]
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., et al.: Attention is all you need. arXiv preprint arXiv:1706.03762 (2023)
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[2]
Google's Neural Machine Translation System: Bridging the Gap between Human and Machine Translation
Wu, Y., Schuster, M., Chen, Z., Le, Q.V., Norouzi, M., et al.: Google’s neural ma- chine translation system: bridging the gap between human and machine translation. arXiv preprint arXiv:1609.08144 (2016)
work page internal anchor Pith review Pith/arXiv arXiv 2016
-
[3]
Harvard Business School, Working Paper 24-013 (2023)
Dell’Acqua, F., McFowland, E., III, Mollick, E., Lifshitz-Assaf, H., Kellogg, K.C., et al.: Navigating the jagged technological frontier: Field experimental evidence of the effects of AI on knowledge worker productivity and quality. Harvard Business School, Working Paper 24-013 (2023)
work page 2023
- [4]
-
[5]
arXiv preprint arXiv:2307.07686 (2023)
Lei, B., Ding, C., Chen, L., Lin, P.-H., Liao, C.: Creating a dataset for high- performance computing code translation using LLMs: a bridge between OpenMP Fortran and C++. arXiv preprint arXiv:2307.07686 (2023)
-
[6]
In: International Conference on Learning Repre- sentations (ICLR) (2022)
Hu, E.J., Shen, Y., Wallis, P., Allen-Zhu, Z., Li, Y., et al.: LoRA: low-rank adap- tation of large language models. In: International Conference on Learning Repre- sentations (ICLR) (2022)
work page 2022
-
[7]
arXiv preprint arXiv:2505.02931 (2025)
Vallecillos Ruiz, F., Hort, M., Moonen, L.: The art of repair: optimizing iterative program repair with instruction-tuned models. arXiv preprint arXiv:2505.02931 (2025)
-
[8]
Lost in Translation: A Study of Bugs Introduced by Large Language Models while Translating Code,
R. Pan, A. R. Ibrahimzada, R. Krishna, D. Sankar, L. P. Wassi, M. Merler, B. Sobolev, R. Pavuluri, S. Sinha, and R. Jabbarvand, “Lost in Translation: A Study of Bugs Introduced by Large Language Models while Translating Code,” arXiv preprint arXiv:2308.03109 , 2023
-
[9]
arXiv preprint arXiv:2509.12973 (2025)
Aljagthami, A., Banabila, M., Alshehri, M., Kabini, M., Alahmadi, M.D.: Evalu- ating large language models for code translation: effects of prompt language and prompt design. arXiv preprint arXiv:2509.12973 (2025)
-
[10]
arXiv preprint arXiv:2505.07425 (2025)
Chen, X., Xue, J., Xie, X., Liang, C., Ju, X.: A systematic literature review on neural code translation. arXiv preprint arXiv:2505.07425 (2025)
-
[11]
CodeXGLUE: A Machine Learning Benchmark Dataset for Code Understanding and Generation
Lu, S., Guo, D., Ren, S., Huang, J., Svyatkovskiy, A., et al.: CodeXGLUE: a machine learning benchmark dataset for code understanding and generation. arXiv preprint arXiv:2102.04664 (2021) 16 A. Ramadan et al
work page internal anchor Pith review arXiv 2021
-
[12]
arXiv preprint arXiv:2105.12655 (2021)
Puri, R., Kung, D.S., Janssen, G., Zhang, W., Domeniconi, G., et al.: CodeNet: a large-scale AI for code dataset for learning a diversity of coding tasks. arXiv preprint arXiv:2105.12655 (2021)
-
[13]
arXiv preprint arXiv:2108.11590 (2023)
Ahmad, W.U., Tushar, M.G.R., Chakraborty, S., Chang, K.-W.: A V ATAR: a par- allel corpus for Java-Python program translation. arXiv preprint arXiv:2108.11590 (2023)
-
[14]
Tree-to-tree Neural Networks for Program Translation
Chen, X., Liu, C., Song, D.: Tree-to-tree neural networks for program translation. arXiv preprint arXiv:1802.03691 (2018)
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[15]
IBM: IBM unveils Watsonx generative AI capabilities to accelerate mainframe application modernization. IBM Newsroom (2023)
work page 2023
-
[16]
arXiv preprint arXiv:2205.11116 (2023)
Ahmad, W.U., Chakraborty, S., Ray, B., Chang, K.-W.: Summarize and Generate to Back-translate: Unsupervised Translation of Programming Languages. arXiv preprint arXiv:2205.11116 (2023)
-
[17]
arXiv preprint arXiv:2507.08627 (2025)
Tai, C.A., Nie, P., Golab, L., Wong, A.: NL in the middle: code translation with LLMs and intermediate representations. arXiv preprint arXiv:2507.08627 (2025)
-
[18]
arXiv preprint arXiv:2407.19619 (2024)
Bhattarai, M., Santos, J.E., Jones, S., Biswas, A., Alexandrov, B., O’Malley, D.: Enhancing code translation in language models with few-shot learning via retrieval- augmented generation. arXiv preprint arXiv:2407.19619 (2024)
-
[19]
Proceedings of the ACM on Software Engineering 2(FSE), 2454–2476 (2025)
Ibrahimzada, A.R., Ke, K., Pawagi, M., Abid, M.S., Pan, R., et al.: AlphaTrans: A neuro-symbolic compositional approach for repository-level code translation and validation. Proceedings of the ACM on Software Engineering 2(FSE), 2454–2476 (2025)
work page 2025
-
[20]
Dyalog Ltd.: Object oriented programming for APL programmers. (2017) https://docs.dyalog.com/19.0/Object%20Oriented%20Programming% 20for%20APL%20Programmers.pdf, last accessed 2026/04/12
work page 2017
-
[21]
Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks
Lewis, P., Perez, E., Piktus, A., Petroni, F., Karpukhin, V., et al.: Retrieval- augmented generation for knowledge-intensive NLP tasks. arXiv preprint arXiv:2005.11401 (2021)
work page internal anchor Pith review Pith/arXiv arXiv 2005
-
[22]
https://rosettacode.org/wiki/ Category:Programming_Tasks, last accessed 2026/04/12
Rosetta Code: Programming Tasks. https://rosettacode.org/wiki/ Category:Programming_Tasks, last accessed 2026/04/12
work page 2026
-
[23]
https: //huggingface.co, last accessed 2026/04/09
Hugging Face, Inc.: Hugging Face: The AI community building the future. https: //huggingface.co, last accessed 2026/04/09
work page 2026
-
[24]
Google DeepMind. Gemma 4 . 2026. A vailable at: https://deepmind.google/ models/gemma/gemma-4/
work page 2026
-
[25]
Yang, A., Li, A., Yang, B., Zhang, B., Hui, B., et al.: Qwen3 technical report. arXiv preprint arXiv:2505.09388 (2025)
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[26]
https://platform.openai.com/ docs/models/gpt-5-chat-latest , last accessed 2026/04/12
OpenAI: GPT-5-chat-latest documentation. https://platform.openai.com/ docs/models/gpt-5-chat-latest , last accessed 2026/04/12
work page 2026
-
[27]
https://www.anthropic.com/ news/claude-4-opus , last accessed 2026/04/12
Anthropic: Claude 4 Opus documentation. https://www.anthropic.com/ news/claude-4-opus , last accessed 2026/04/12
work page 2026
-
[28]
Technical Report, IBM Corporation (1989)
Cason, S.: APL2 idioms library. Technical Report, IBM Corporation (1989)
work page 1989
-
[29]
APL Wiki. FinnAPL idiom library. Finnish APL Association. https://aplwiki. com/wiki/FinnAPL_idiom_library, last accessed 2026/04/12
work page 2026
-
[30]
OpenAI: Text-embedding-3-large documentation. https://platform.openai. com/docs/models/text-embedding-3-large , last accessed 2026/04/12
work page 2026
-
[31]
https://platform.openai.com/docs/ models/gpt-5-mini , last accessed 2026/04/12
OpenAI: GPT-5-mini documentation. https://platform.openai.com/docs/ models/gpt-5-mini , last accessed 2026/04/12
work page 2026
-
[32]
TransAgent: Enhancing LLM-Based Code Translation via Fine-Grained Execution Alignment
Yuan, Z., Chen, W., Wang, H., Peng, X., Chen, Z., Lou, Y.: TransAgent: Enhancing LLM-based code translation via fine-grained execution alignment. arXiv preprint arXiv:2409.19894v5 (2026) Neural Code Translation of Legacy Code: APL to C# 17
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[33]
arXiv preprint arXiv:2603.14054 (2026)
Moti, Z., Soudani, H., van der Kogel, J.: LegacyTranslate: LLM-based multi-agent method for legacy code translation. arXiv preprint arXiv:2603.14054 (2026)
-
[34]
In: Proceedings of the 48th International Conference on Software Engineering (2026)
Guan, Z., Yin, X., Peng, Z., Ni, C.: RepoTransAgent: Multi-agent LLM framework for repository-aware code translation. In: Proceedings of the 48th International Conference on Software Engineering (2026)
work page 2026
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.