Recognition: unknown
HELO-APR: Enhancing Low-Resource Program Repair through Cross-Lingual Knowledge Transfer
Pith reviewed 2026-05-10 06:29 UTC · model grok-4.3
The pith
Cross-lingual transfer from C++ raises low-resource language repair Pass@1 from 1.67% to 11.97% on CodeLlama.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
HELO-APR is a two-stage framework that constructs high-quality low-resource training data by synthesizing buggy-fixed pairs from high-resource counterparts while preserving defect type consistency and idiomaticity, then applies curriculum learning that progresses through high-resource repair learning, cross-lingual repair alignment, and low-resource repair adaptation. On xCodeEval this raises Pass@1 from 31.32% to 48.65% on DeepSeek-Coder-6.7B and from 1.67% to 11.97% on CodeLlama-7B, while lifting average target compilation rate on CodeLlama from 49.77% to 91.98%. On Defects4Ruby it also increases BLEU-4 from 61.20 to 66.79 and ROUGE-1 from 76.76 to 83.59 on CodeLlama-7B.
What carries the argument
HELO-APR, the two-stage framework that first synthesizes LRPL buggy-fixed pairs from HRPL counterparts and then runs a three-phase curriculum of HRPL repair learning, cross-lingual alignment, and LRPL adaptation.
If this is right
- LLMs reach substantially higher Pass@1 scores on Ruby and Rust repairs without large native training sets.
- Generated patches exhibit markedly higher syntactic validity, with compilation rates rising above 90 percent on CodeLlama.
- The gains extend to real-world benchmarks such as Defects4Ruby, producing patches closer to developer-written fixes.
- Ablation results confirm that both the data-synthesis step and each curriculum phase contribute measurably to the final performance.
Where Pith is reading between the lines
- The same synthesis-plus-curriculum pattern could be tested on other low-data software engineering tasks such as bug localization or test generation.
- Varying the high-resource source language or adding more low-resource targets would test how far defect-type preservation can stretch before the synthesized data loses utility.
- Larger base models might show even bigger relative gains if their cross-lingual alignment capacity exceeds that of the 7B-scale models evaluated here.
Load-bearing premise
Synthesizing low-resource language buggy-fixed pairs from high-resource language examples must preserve defect type consistency while producing idiomatic target code that supplies effective training signals.
What would settle it
Running the same synthesis and curriculum procedure on a new low-resource language whose defect distribution differs markedly from the high-resource source and observing no Pass@1 gain or a drop below direct fine-tuning baselines would show the transfer fails to deliver usable supervision.
Figures
read the original abstract
Large Language Models (LLMs) perform well on automatic program repair (APR) for high-resource programming languages (HRPLs), but their effectiveness drops sharply in low-resource programming languages (LRPLs), due to a lack of sufficient verified buggy-fixed pairs for APR training. To address this challenge, we propose HELO-APR (High-resource Enabled LOw-resource APR), a two-stage APR framework that enables cross-lingual transfer of repair knowledge from HRPLs to LRPLs. HELO-APR (1) constructs high-quality LRPL training data by synthesizing LRPL buggy-fixed pairs from HRPL counterparts, preserving defect type consistency while ensuring the synthesized code is idiomatic, and then (2) adopts a curriculum learning strategy that progressively performs HRPL repair learning, cross-lingual repair alignment, and LRPL repair adaptation, improving repair effectiveness in LRPLs. Using C++ as the source HRPL and Ruby and Rust as the target LRPLs, experiments on xCodeEval show that HELO-APR consistently outperforms strong baselines, increasing Pass@1 from 31.32% to 48.65% on DeepSeek-Coder-6.7B and from 1.67% to 11.97% on CodeLlama-7B, while improving syntactic validity by raising the average target compilation rate on CodeLlama from 49.77% to 91.98%. On Defects4Ruby, HELO-APR increases BLEU-4 from 61.20 to 66.79 and ROUGE-1 from 76.76 to 83.59 on CodeLlama-7B, indicating higher similarity to developer patches in real-world settings. Finally, we conduct ablation studies to assess the necessity of each core component. These results suggest that verified cross-lingual supervision provides a reusable approach for improving LLM-based repair in low-resource languages.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes HELO-APR, a two-stage framework for automatic program repair (APR) in low-resource programming languages (LRPLs) such as Ruby and Rust. It first synthesizes LRPL buggy-fixed pairs from high-resource counterparts (C++ as HRPL) while claiming to preserve defect types and idiomaticity, then applies curriculum learning with HRPL repair pretraining, cross-lingual alignment, and LRPL adaptation. Experiments on xCodeEval report Pass@1 gains (e.g., 31.32% to 48.65% on DeepSeek-Coder-6.7B) and higher compilation rates; results on Defects4Ruby show improved BLEU/ROUGE similarity to developer patches. Ablations assess component necessity.
Significance. If the synthesis step produces valid supervision, the approach offers a practical route to bootstrap APR for LRPLs from abundant HRPL data, addressing a clear resource disparity. The empirical gains on public benchmarks and the curriculum design are potentially reusable, and the ablation studies provide some isolation of effects. However, the absence of independent checks on the synthesized data limits how much weight the performance claims can carry for the field.
major comments (3)
- [§3] §3 (Data Construction): The central claim that HRPL-to-LRPL synthesis preserves defect-type consistency and produces idiomatic, compilable LRPL pairs is asserted without any quantitative validation (e.g., defect-type agreement rate, human idiomaticity scores, or comparison to native LRPL bugs). This step supplies all supervision for the subsequent adaptation stage; without such checks the reported Pass@1 lifts (31.32%→48.65%, 1.67%→11.97%) cannot be confidently attributed to cross-lingual transfer rather than data volume or ordering artifacts.
- [§4.1] §4.1 (Experimental Setup): Baseline implementations, hyper-parameter matching, and statistical significance tests for the Pass@1 and compilation-rate improvements are not described in sufficient detail. Without these, it is impossible to determine whether the gains over the listed strong baselines are reproducible or merely reflect differences in training regime.
- [§4.3] §4.3 (Ablation Studies): The ablations demonstrate necessity of each stage, yet they do not isolate the contribution of synthesis quality itself (e.g., by comparing against randomly translated or non-idiomatic pairs). This leaves open whether the curriculum ordering or the sheer volume of synthesized data drives the results.
minor comments (2)
- [§3.3] Notation for the three curriculum stages (HRPL repair learning, cross-lingual alignment, LRPL adaptation) is introduced without a compact diagram or equation summarizing the progressive loss schedule.
- [§2] The paper cites prior cross-lingual transfer work but does not compare against recent multilingual code models that already incorporate some Ruby/Rust data; a brief discussion of why those baselines were omitted would strengthen the positioning.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback on our manuscript. We address each major comment below and outline the revisions we will incorporate to strengthen the paper.
read point-by-point responses
-
Referee: [§3] §3 (Data Construction): The central claim that HRPL-to-LRPL synthesis preserves defect-type consistency and produces idiomatic, compilable LRPL pairs is asserted without any quantitative validation (e.g., defect-type agreement rate, human idiomaticity scores, or comparison to native LRPL bugs). This step supplies all supervision for the subsequent adaptation stage; without such checks the reported Pass@1 lifts (31.32%→48.65%, 1.67%→11.97%) cannot be confidently attributed to cross-lingual transfer rather than data volume or ordering artifacts.
Authors: We agree that the current manuscript lacks explicit quantitative validation for the synthesis step. In the revised version, we will add a dedicated subsection in §3 reporting defect-type agreement rates (via automated static analysis matching bug patterns between HRPL and synthesized LRPL pairs), human idiomaticity scores from a pilot evaluation on 100 samples, and direct comparisons of synthesized defect distributions against native LRPL bugs from Defects4Ruby. These additions will provide stronger evidence that performance gains stem from cross-lingual transfer rather than artifacts. revision: yes
-
Referee: [§4.1] §4.1 (Experimental Setup): Baseline implementations, hyper-parameter matching, and statistical significance tests for the Pass@1 and compilation-rate improvements are not described in sufficient detail. Without these, it is impossible to determine whether the gains over the listed strong baselines are reproducible or merely reflect differences in training regime.
Authors: We acknowledge that §4.1 requires more implementation detail. We will expand this section to include all hyper-parameter values, exact baseline configurations (with code references or prompt templates), and statistical significance results using paired tests (e.g., McNemar's test) and bootstrap confidence intervals on the Pass@1 and compilation-rate metrics to demonstrate reproducibility. revision: yes
-
Referee: [§4.3] §4.3 (Ablation Studies): The ablations demonstrate necessity of each stage, yet they do not isolate the contribution of synthesis quality itself (e.g., by comparing against randomly translated or non-idiomatic pairs). This leaves open whether the curriculum ordering or the sheer volume of synthesized data drives the results.
Authors: The existing ablations isolate the curriculum stages. To further isolate synthesis quality, we will add new ablation experiments in the revised §4.3 that compare our synthesized pairs against randomly translated equivalents and non-idiomatic variants, quantifying their impact on final Pass@1 to clarify the role of synthesis quality versus data volume or ordering. revision: yes
Circularity Check
No circularity: empirical results from external benchmarks
full rationale
The paper presents an empirical method (data synthesis from HRPL to LRPL followed by curriculum learning) whose performance claims are measured on independent benchmarks (xCodeEval, Defects4Ruby) rather than any internal derivation, equation, or fitted parameter that reduces to the method's own inputs by construction. No self-definitional steps, fitted-input predictions, or load-bearing self-citations appear in the described framework or results. The synthesis assumption is an unverified modeling choice, not a circular reduction.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption LLMs can be effectively fine-tuned on synthesized cross-lingual code repair data
- domain assumption Repair knowledge can be aligned across programming languages through curriculum learning
Reference graph
Works this paper leans on
- [1]
-
[2]
Razan Baltaji, Saurabh Pujar, Martin Hirzel, Louis Mandel, Luca Buratti, and Lav R Varshney. 2025. Cross-lingual Transfer in Programming Languages: An Extensive Empirical Study.Transactions on Machine Learning Research(2025)
2025
-
[3]
Berkay Berabi, Jingxuan He, Veselin Raychev, and Martin Vechev. 2021. Tfix: Learning to fix coding errors with a text-to-text transformer. InInternational Conference on Machine Learning. PMLR, 780–791
2021
- [4]
-
[5]
Federico Cassano, John Gouwar, Francesca Lucchetti, Claire Schlesinger, Anders Freeman, Carolyn Jane Anderson, Molly Q Feldman, Michael Greenberg, Abhinav Jangda, and Arjun Guha. 2024. Knowledge transfer from high-resource to low-resource programming languages for code llms.Proceedings of the ACM on Programming Languages8, OOPSLA2 (2024), 677–708
2024
-
[6]
Meghdad Dehghan, Mohammadreza Saeidi, Rohit Dandamudi, Jie JW Wu, Fatemeh H Fard, and Gema Rodrıguez-Pérez
-
[7]
https://jie-jw-wu.github.io/assets/ICPC_2025_RENE.pdf
Defects4Ruby: Benchmarking and Analyzing Bug Detection and Repair for Ruby Using Language Models. https://jie-jw-wu.github.io/assets/ICPC_2025_RENE.pdf
-
[8]
Michael Fu, Chakkrit Tantithamthavorn, Trung Le, Van Nguyen, and Dinh Phung. 2022. VulRepair: a T5-based automated software vulnerability repair. InProceedings of the 30th ACM joint european software engineering conference and symposium on the foundations of software engineering. 935–947
2022
-
[9]
Soneya Binta Hossain, Nan Jiang, Qiang Zhou, Xiaopeng Li, Wen-Hao Chiang, Yingjun Lyu, Hoan Nguyen, and Omer Tripp. 2024. A deep dive into large language models for automated bug localization and repair.Proceedings of the ACM on Software Engineering1, FSE (2024), 1471–1493
2024
-
[10]
Jinru Hua, Mengshi Zhang, Kaiyuan Wang, and Sarfraz Khurshid. 2018. Towards practical program repair with on-demand candidate generation. InProceedings of the 40th international conference on software engineering. 12–23
2018
-
[11]
Nan Jiang, Thibaud Lutellier, and Lin Tan. 2021. Cure: Code-aware neural machine translation for automatic program repair. In2021 IEEE/ACM 43rd International Conference on Software Engineering (ICSE). IEEE, 1161–1173
2021
-
[12]
Sathvik Joel, Jie Wu, and Fatemeh Fard. 2024. A survey on llm-based code generation for low-resource and domain- specific programming languages.ACM Transactions on Software Engineering and Methodology(2024)
2024
-
[13]
Mohammad Abdullah Matin Khan, M Saiful Bari, Xuan Long Do, Weishi Wang, Md Rizwan Parvez, and Shafiq Joty
- [14]
- [15]
-
[16]
Wenqiang Luo, Jacky Wai Keung, Boyang Yang, Jacques Klein, Tegawende F Bissyande, Haoye Tian, and Bach Le. 2025. Unlocking LLM Repair Capabilities in Low-Resource Programming Languages Through Cross-Language Translation and Multi-Agent Refinement.arXiv preprint arXiv:2503.22512(2025)
-
[17]
Wenqiang Luo, Jacky Wai Keung, Boyang Yang, He Ye, Claire Le Goues, Tegawende F Bissyande, Haoye Tian, and Bach Le. 2026. When fine-tuning llms meets data privacy: An empirical study of federated learning in llm-based program repair.ACM Transactions on Software Engineering and Methodology35, 3 (2026), 1–46
2026
-
[18]
Yingwei Ma, Qingping Yang, Rongyu Cao, Binhua Li, Fei Huang, and Yongbin Li. 2025. Alibaba lingmaagent: Improving automated issue resolution via comprehensive repository exploration. InProceedings of the 33rd ACM International Conference on the Foundations of Software Engineering. 238–249
2025
-
[19]
Farhad Pourpanah, Moloud Abdar, Yuxuan Luo, Xinlei Zhou, Ran Wang, Chee Peng Lim, Xi-Zhao Wang, and QM Jonathan Wu. 2022. A review of generalized zero-shot learning methods.IEEE transactions on pattern analysis and machine intelligence45, 4 (2022), 4051–4070
2022
-
[20]
Weishi Wang, Yue Wang, Steven Hoi, and Shafiq Joty. 2023. Towards low-resource automatic program repair with meta-learning and pretrained language models. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing. 6954–6968. Proc. ACM Softw. Eng., Vol. X, No. ISSTA, Article XX. Publication date: July 2026. HELO-APR: Enhancin...
2023
-
[21]
Zhipeng Wang, Tieke He, Ruoyu Zhao, and Tao Zheng. 2025. Exploration and Improvement of Capabilities of LLMs in Code Refinement Task.International Journal of Software & Informatics15, 2 (2025)
2025
-
[22]
Kyle Wong, Alfonso Amayuelas, Liangming Pan, and William Yang Wang. 2025. Investigating the transferability of code repair for low-resource programming languages. InFindings of the Association for Computational Linguistics: NAACL 2025. 3410–3432
2025
-
[23]
Chunqiu Steven Xia, Yuxiang Wei, and Lingming Zhang. 2023. Automated program repair in the era of large pre-trained language models. In2023 IEEE/ACM 45th International Conference on Software Engineering (ICSE). IEEE, 1482–1494
2023
-
[24]
Chunqiu Steven Xia and Lingming Zhang. 2022. Less training, more repairing please: revisiting automated program repair via zero-shot learning. InProceedings of the 30th ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering. 959–971
2022
-
[25]
Chunqiu Steven Xia and Lingming Zhang. 2024. Automated program repair via conversation: Fixing 162 out of 337 bugs for $0.42 each using ChatGPT. InProceedings of the 33rd ACM SIGSOFT International Symposium on Software Testing and Analysis. 819–831
2024
-
[26]
Boyang Yang, Zijian Cai, Fengling Liu, Bach Le, Lingming Zhang, Tegawendé F Bissyandé, Yang Liu, and Haoye Tian
- [27]
-
[28]
Boyang Yang, Haoye Tian, Weiguo Pian, Haoran Yu, Haitao Wang, Jacques Klein, Tegawendé F Bissyandé, and Shunfu Jin. 2024. Cref: An llm-based conversational software repair framework for programming tutors. InProceedings of the 33rd ACM SIGSOFT International Symposium on Software Testing and Analysis. 882–894
2024
-
[29]
Boyang Yang, Haoye Tian, Jiadong Ren, Hongyu Zhang, Jacques Klein, Tegawende Bissyande, Claire Le Goues, and Shunfu Jin. 2025. Morepair: Teaching llms to repair code via multi-objective fine-tuning.ACM Transactions on Software Engineering and Methodology(2025)
2025
-
[30]
John Yang, Carlos E Jimenez, Alexander Wettig, Kilian Lieret, Shunyu Yao, Karthik Narasimhan, and Ofir Press. 2024. Swe-agent: Agent-computer interfaces enable automated software engineering.Advances in Neural Information Processing Systems37 (2024), 50528–50652
2024
- [31]
- [32]
-
[33]
Wei Yuan, Quanjun Zhang, Tieke He, Chunrong Fang, Nguyen Quoc Viet Hung, Xiaodong Hao, and Hongzhi Yin
-
[34]
InProceedings of the 31st ACM SIGSOFT international symposium on software testing and analysis
CIRCLE: Continual repair across programming languages. InProceedings of the 31st ACM SIGSOFT international symposium on software testing and analysis. 678–690
-
[35]
Jialu Zhang, José Pablo Cambronero, Sumit Gulwani, Vu Le, Ruzica Piskac, Gustavo Soares, and Gust Verbruggen
-
[36]
Pydex: Repairing bugs in introductory python assignments using llms.Proceedings of the ACM on Programming Languages8, OOPSLA1 (2024), 1100–1124
2024
-
[37]
Jipeng Zhang, Jianshu Zhang, Yuanzhe Li, Renjie Pi, Rui Pan, Runtao Liu, Zheng Ziqiang, and Tong Zhang. 2025. Bridge-Coder: Transferring Model Capabilities from High-Resource to Low-Resource Programming Language. In Findings of the Association for Computational Linguistics: ACL 2025. 10865–10882
2025
-
[38]
Quanjun Zhang, Chunrong Fang, Yuxiang Ma, Weisong Sun, and Zhenyu Chen. 2023. A survey of learning-based automated program repair.ACM Transactions on Software Engineering and Methodology33, 2 (2023), 1–69
2023
-
[39]
Quanjun Zhang, Chunrong Fang, Weisong Sun, Yan Liu, Tieke He, Xiaodong Hao, and Zhenyu Chen. 2024. Appt: Boosting automated patch correctness prediction via fine-tuning pre-trained models.IEEE Transactions on Software Engineering50, 3 (2024), 474–494
2024
-
[40]
Yuntong Zhang, Haifeng Ruan, Zhiyu Fan, and Abhik Roychoudhury. 2024. Autocoderover: Autonomous program improvement. InProceedings of the 33rd ACM SIGSOFT International Symposium on Software Testing and Analysis. 1592–1604. Proc. ACM Softw. Eng., Vol. X, No. ISSTA, Article XX. Publication date: July 2026
2024
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.