pith. machine review for the scientific record. sign in

arxiv: 2604.17016 · v1 · submitted 2026-04-18 · 💻 cs.SE

Recognition: unknown

HELO-APR: Enhancing Low-Resource Program Repair through Cross-Lingual Knowledge Transfer

Authors on Pith no claims yet

Pith reviewed 2026-05-10 06:29 UTC · model grok-4.3

classification 💻 cs.SE
keywords automatic program repairlow-resource programming languagescross-lingual transfercurriculum learningcode synthesisLLM fine-tuningxCodeEval
0
0 comments X

The pith

Cross-lingual transfer from C++ raises low-resource language repair Pass@1 from 1.67% to 11.97% on CodeLlama.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes a method to improve automatic program repair for languages that lack enough verified buggy-fixed examples by borrowing repair knowledge from languages that have abundant data. It first creates training examples in the target low-resource languages by deriving them from high-resource language cases while keeping the same kinds of defects and producing natural code. It then trains models in three progressive stages: learning repair on the high-resource language, aligning the repair process across languages, and finally adapting to the low-resource language. A sympathetic reader would care because this approach could let existing large models produce useful fixes for Ruby and Rust without first collecting massive new datasets in those languages.

Core claim

HELO-APR is a two-stage framework that constructs high-quality low-resource training data by synthesizing buggy-fixed pairs from high-resource counterparts while preserving defect type consistency and idiomaticity, then applies curriculum learning that progresses through high-resource repair learning, cross-lingual repair alignment, and low-resource repair adaptation. On xCodeEval this raises Pass@1 from 31.32% to 48.65% on DeepSeek-Coder-6.7B and from 1.67% to 11.97% on CodeLlama-7B, while lifting average target compilation rate on CodeLlama from 49.77% to 91.98%. On Defects4Ruby it also increases BLEU-4 from 61.20 to 66.79 and ROUGE-1 from 76.76 to 83.59 on CodeLlama-7B.

What carries the argument

HELO-APR, the two-stage framework that first synthesizes LRPL buggy-fixed pairs from HRPL counterparts and then runs a three-phase curriculum of HRPL repair learning, cross-lingual alignment, and LRPL adaptation.

If this is right

  • LLMs reach substantially higher Pass@1 scores on Ruby and Rust repairs without large native training sets.
  • Generated patches exhibit markedly higher syntactic validity, with compilation rates rising above 90 percent on CodeLlama.
  • The gains extend to real-world benchmarks such as Defects4Ruby, producing patches closer to developer-written fixes.
  • Ablation results confirm that both the data-synthesis step and each curriculum phase contribute measurably to the final performance.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same synthesis-plus-curriculum pattern could be tested on other low-data software engineering tasks such as bug localization or test generation.
  • Varying the high-resource source language or adding more low-resource targets would test how far defect-type preservation can stretch before the synthesized data loses utility.
  • Larger base models might show even bigger relative gains if their cross-lingual alignment capacity exceeds that of the 7B-scale models evaluated here.

Load-bearing premise

Synthesizing low-resource language buggy-fixed pairs from high-resource language examples must preserve defect type consistency while producing idiomatic target code that supplies effective training signals.

What would settle it

Running the same synthesis and curriculum procedure on a new low-resource language whose defect distribution differs markedly from the high-resource source and observing no Pass@1 gain or a drop below direct fine-tuning baselines would show the transfer fails to deliver usable supervision.

Figures

Figures reproduced from arXiv: 2604.17016 by Boyang Yang, Liuye Guo, Tao Zheng, Tieke He, Yidong Wan, You Lv, Zhipeng Wang, Zhuowei Wang.

Figure 1
Figure 1. Figure 1: Overview of the cross-lingual APR workflow (upper) and four critical challenges limiting its effectiveness [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: LRPL Dataset Construction. pairs that faithfully reproduce the defect behaviors exhibited in HRPLs (§2.2) [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Three-stage curriculum learning framework of HELO-APR. The model is progressively trained via [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗
read the original abstract

Large Language Models (LLMs) perform well on automatic program repair (APR) for high-resource programming languages (HRPLs), but their effectiveness drops sharply in low-resource programming languages (LRPLs), due to a lack of sufficient verified buggy-fixed pairs for APR training. To address this challenge, we propose HELO-APR (High-resource Enabled LOw-resource APR), a two-stage APR framework that enables cross-lingual transfer of repair knowledge from HRPLs to LRPLs. HELO-APR (1) constructs high-quality LRPL training data by synthesizing LRPL buggy-fixed pairs from HRPL counterparts, preserving defect type consistency while ensuring the synthesized code is idiomatic, and then (2) adopts a curriculum learning strategy that progressively performs HRPL repair learning, cross-lingual repair alignment, and LRPL repair adaptation, improving repair effectiveness in LRPLs. Using C++ as the source HRPL and Ruby and Rust as the target LRPLs, experiments on xCodeEval show that HELO-APR consistently outperforms strong baselines, increasing Pass@1 from 31.32% to 48.65% on DeepSeek-Coder-6.7B and from 1.67% to 11.97% on CodeLlama-7B, while improving syntactic validity by raising the average target compilation rate on CodeLlama from 49.77% to 91.98%. On Defects4Ruby, HELO-APR increases BLEU-4 from 61.20 to 66.79 and ROUGE-1 from 76.76 to 83.59 on CodeLlama-7B, indicating higher similarity to developer patches in real-world settings. Finally, we conduct ablation studies to assess the necessity of each core component. These results suggest that verified cross-lingual supervision provides a reusable approach for improving LLM-based repair in low-resource languages.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper proposes HELO-APR, a two-stage framework for automatic program repair (APR) in low-resource programming languages (LRPLs) such as Ruby and Rust. It first synthesizes LRPL buggy-fixed pairs from high-resource counterparts (C++ as HRPL) while claiming to preserve defect types and idiomaticity, then applies curriculum learning with HRPL repair pretraining, cross-lingual alignment, and LRPL adaptation. Experiments on xCodeEval report Pass@1 gains (e.g., 31.32% to 48.65% on DeepSeek-Coder-6.7B) and higher compilation rates; results on Defects4Ruby show improved BLEU/ROUGE similarity to developer patches. Ablations assess component necessity.

Significance. If the synthesis step produces valid supervision, the approach offers a practical route to bootstrap APR for LRPLs from abundant HRPL data, addressing a clear resource disparity. The empirical gains on public benchmarks and the curriculum design are potentially reusable, and the ablation studies provide some isolation of effects. However, the absence of independent checks on the synthesized data limits how much weight the performance claims can carry for the field.

major comments (3)
  1. [§3] §3 (Data Construction): The central claim that HRPL-to-LRPL synthesis preserves defect-type consistency and produces idiomatic, compilable LRPL pairs is asserted without any quantitative validation (e.g., defect-type agreement rate, human idiomaticity scores, or comparison to native LRPL bugs). This step supplies all supervision for the subsequent adaptation stage; without such checks the reported Pass@1 lifts (31.32%→48.65%, 1.67%→11.97%) cannot be confidently attributed to cross-lingual transfer rather than data volume or ordering artifacts.
  2. [§4.1] §4.1 (Experimental Setup): Baseline implementations, hyper-parameter matching, and statistical significance tests for the Pass@1 and compilation-rate improvements are not described in sufficient detail. Without these, it is impossible to determine whether the gains over the listed strong baselines are reproducible or merely reflect differences in training regime.
  3. [§4.3] §4.3 (Ablation Studies): The ablations demonstrate necessity of each stage, yet they do not isolate the contribution of synthesis quality itself (e.g., by comparing against randomly translated or non-idiomatic pairs). This leaves open whether the curriculum ordering or the sheer volume of synthesized data drives the results.
minor comments (2)
  1. [§3.3] Notation for the three curriculum stages (HRPL repair learning, cross-lingual alignment, LRPL adaptation) is introduced without a compact diagram or equation summarizing the progressive loss schedule.
  2. [§2] The paper cites prior cross-lingual transfer work but does not compare against recent multilingual code models that already incorporate some Ruby/Rust data; a brief discussion of why those baselines were omitted would strengthen the positioning.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address each major comment below and outline the revisions we will incorporate to strengthen the paper.

read point-by-point responses
  1. Referee: [§3] §3 (Data Construction): The central claim that HRPL-to-LRPL synthesis preserves defect-type consistency and produces idiomatic, compilable LRPL pairs is asserted without any quantitative validation (e.g., defect-type agreement rate, human idiomaticity scores, or comparison to native LRPL bugs). This step supplies all supervision for the subsequent adaptation stage; without such checks the reported Pass@1 lifts (31.32%→48.65%, 1.67%→11.97%) cannot be confidently attributed to cross-lingual transfer rather than data volume or ordering artifacts.

    Authors: We agree that the current manuscript lacks explicit quantitative validation for the synthesis step. In the revised version, we will add a dedicated subsection in §3 reporting defect-type agreement rates (via automated static analysis matching bug patterns between HRPL and synthesized LRPL pairs), human idiomaticity scores from a pilot evaluation on 100 samples, and direct comparisons of synthesized defect distributions against native LRPL bugs from Defects4Ruby. These additions will provide stronger evidence that performance gains stem from cross-lingual transfer rather than artifacts. revision: yes

  2. Referee: [§4.1] §4.1 (Experimental Setup): Baseline implementations, hyper-parameter matching, and statistical significance tests for the Pass@1 and compilation-rate improvements are not described in sufficient detail. Without these, it is impossible to determine whether the gains over the listed strong baselines are reproducible or merely reflect differences in training regime.

    Authors: We acknowledge that §4.1 requires more implementation detail. We will expand this section to include all hyper-parameter values, exact baseline configurations (with code references or prompt templates), and statistical significance results using paired tests (e.g., McNemar's test) and bootstrap confidence intervals on the Pass@1 and compilation-rate metrics to demonstrate reproducibility. revision: yes

  3. Referee: [§4.3] §4.3 (Ablation Studies): The ablations demonstrate necessity of each stage, yet they do not isolate the contribution of synthesis quality itself (e.g., by comparing against randomly translated or non-idiomatic pairs). This leaves open whether the curriculum ordering or the sheer volume of synthesized data drives the results.

    Authors: The existing ablations isolate the curriculum stages. To further isolate synthesis quality, we will add new ablation experiments in the revised §4.3 that compare our synthesized pairs against randomly translated equivalents and non-idiomatic variants, quantifying their impact on final Pass@1 to clarify the role of synthesis quality versus data volume or ordering. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical results from external benchmarks

full rationale

The paper presents an empirical method (data synthesis from HRPL to LRPL followed by curriculum learning) whose performance claims are measured on independent benchmarks (xCodeEval, Defects4Ruby) rather than any internal derivation, equation, or fitted parameter that reduces to the method's own inputs by construction. No self-definitional steps, fitted-input predictions, or load-bearing self-citations appear in the described framework or results. The synthesis assumption is an unverified modeling choice, not a circular reduction.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The approach rests on standard machine-learning assumptions about fine-tuning and transfer rather than new invented entities or many explicit free parameters.

axioms (2)
  • domain assumption LLMs can be effectively fine-tuned on synthesized cross-lingual code repair data
    Invoked in the description of the training stages and data construction.
  • domain assumption Repair knowledge can be aligned across programming languages through curriculum learning
    Central premise of the second stage of the framework.

pith-pipeline@v0.9.0 · 5677 in / 1463 out tokens · 56744 ms · 2026-05-10T06:29:21.507469+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

40 extracted references · 8 canonical work pages

  1. [1]

    Antonis Antoniades, Albert Örwall, Kexun Zhang, Yuxi Xie, Anirudh Goyal, and William Wang. 2025. SWE-Search: Enhancing Software Agents with Monte Carlo Tree Search and Iterative Refinement. arXiv:2410.20285 [cs.AI] https://arxiv.org/abs/2410.20285

  2. [2]

    Razan Baltaji, Saurabh Pujar, Martin Hirzel, Louis Mandel, Luca Buratti, and Lav R Varshney. 2025. Cross-lingual Transfer in Programming Languages: An Extensive Empirical Study.Transactions on Machine Learning Research(2025)

  3. [3]

    Berkay Berabi, Jingxuan He, Veselin Raychev, and Martin Vechev. 2021. Tfix: Learning to fix coding errors with a text-to-text transformer. InInternational Conference on Machine Learning. PMLR, 780–791

  4. [4]

    Islem Bouzenia, Premkumar Devanbu, and Michael Pradel. 2024. RepairAgent: An Autonomous, LLM-Based Agent for Program Repair. arXiv:2403.17134 [cs.SE] https://arxiv.org/abs/2403.17134

  5. [5]

    Federico Cassano, John Gouwar, Francesca Lucchetti, Claire Schlesinger, Anders Freeman, Carolyn Jane Anderson, Molly Q Feldman, Michael Greenberg, Abhinav Jangda, and Arjun Guha. 2024. Knowledge transfer from high-resource to low-resource programming languages for code llms.Proceedings of the ACM on Programming Languages8, OOPSLA2 (2024), 677–708

  6. [6]

    Meghdad Dehghan, Mohammadreza Saeidi, Rohit Dandamudi, Jie JW Wu, Fatemeh H Fard, and Gema Rodrıguez-Pérez

  7. [7]

    https://jie-jw-wu.github.io/assets/ICPC_2025_RENE.pdf

    Defects4Ruby: Benchmarking and Analyzing Bug Detection and Repair for Ruby Using Language Models. https://jie-jw-wu.github.io/assets/ICPC_2025_RENE.pdf

  8. [8]

    Michael Fu, Chakkrit Tantithamthavorn, Trung Le, Van Nguyen, and Dinh Phung. 2022. VulRepair: a T5-based automated software vulnerability repair. InProceedings of the 30th ACM joint european software engineering conference and symposium on the foundations of software engineering. 935–947

  9. [9]

    Soneya Binta Hossain, Nan Jiang, Qiang Zhou, Xiaopeng Li, Wen-Hao Chiang, Yingjun Lyu, Hoan Nguyen, and Omer Tripp. 2024. A deep dive into large language models for automated bug localization and repair.Proceedings of the ACM on Software Engineering1, FSE (2024), 1471–1493

  10. [10]

    Jinru Hua, Mengshi Zhang, Kaiyuan Wang, and Sarfraz Khurshid. 2018. Towards practical program repair with on-demand candidate generation. InProceedings of the 40th international conference on software engineering. 12–23

  11. [11]

    Nan Jiang, Thibaud Lutellier, and Lin Tan. 2021. Cure: Code-aware neural machine translation for automatic program repair. In2021 IEEE/ACM 43rd International Conference on Software Engineering (ICSE). IEEE, 1161–1173

  12. [12]

    Sathvik Joel, Jie Wu, and Fatemeh Fard. 2024. A survey on llm-based code generation for low-resource and domain- specific programming languages.ACM Transactions on Software Engineering and Methodology(2024)

  13. [13]

    Mohammad Abdullah Matin Khan, M Saiful Bari, Xuan Long Do, Weishi Wang, Md Rizwan Parvez, and Shafiq Joty

  14. [14]

    xcodeeval: A large scale multilingual multitask benchmark for code understanding, generation, translation and retrieval.arXiv preprint arXiv:2303.03004(2023)

  15. [15]

    Jiaolong Kong, Mingfei Cheng, Xiaofei Xie, Shangqing Liu, Xiaoning Du, and Qi Guo. 2024. Contrastrepair: Enhancing conversation-based automated program repair via contrastive test case pairs.arXiv preprint arXiv:2403.01971(2024)

  16. [16]

    Wenqiang Luo, Jacky Wai Keung, Boyang Yang, Jacques Klein, Tegawende F Bissyande, Haoye Tian, and Bach Le. 2025. Unlocking LLM Repair Capabilities in Low-Resource Programming Languages Through Cross-Language Translation and Multi-Agent Refinement.arXiv preprint arXiv:2503.22512(2025)

  17. [17]

    Wenqiang Luo, Jacky Wai Keung, Boyang Yang, He Ye, Claire Le Goues, Tegawende F Bissyande, Haoye Tian, and Bach Le. 2026. When fine-tuning llms meets data privacy: An empirical study of federated learning in llm-based program repair.ACM Transactions on Software Engineering and Methodology35, 3 (2026), 1–46

  18. [18]

    Yingwei Ma, Qingping Yang, Rongyu Cao, Binhua Li, Fei Huang, and Yongbin Li. 2025. Alibaba lingmaagent: Improving automated issue resolution via comprehensive repository exploration. InProceedings of the 33rd ACM International Conference on the Foundations of Software Engineering. 238–249

  19. [19]

    Farhad Pourpanah, Moloud Abdar, Yuxuan Luo, Xinlei Zhou, Ran Wang, Chee Peng Lim, Xi-Zhao Wang, and QM Jonathan Wu. 2022. A review of generalized zero-shot learning methods.IEEE transactions on pattern analysis and machine intelligence45, 4 (2022), 4051–4070

  20. [20]

    Weishi Wang, Yue Wang, Steven Hoi, and Shafiq Joty. 2023. Towards low-resource automatic program repair with meta-learning and pretrained language models. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing. 6954–6968. Proc. ACM Softw. Eng., Vol. X, No. ISSTA, Article XX. Publication date: July 2026. HELO-APR: Enhancin...

  21. [21]

    Zhipeng Wang, Tieke He, Ruoyu Zhao, and Tao Zheng. 2025. Exploration and Improvement of Capabilities of LLMs in Code Refinement Task.International Journal of Software & Informatics15, 2 (2025)

  22. [22]

    Kyle Wong, Alfonso Amayuelas, Liangming Pan, and William Yang Wang. 2025. Investigating the transferability of code repair for low-resource programming languages. InFindings of the Association for Computational Linguistics: NAACL 2025. 3410–3432

  23. [23]

    Chunqiu Steven Xia, Yuxiang Wei, and Lingming Zhang. 2023. Automated program repair in the era of large pre-trained language models. In2023 IEEE/ACM 45th International Conference on Software Engineering (ICSE). IEEE, 1482–1494

  24. [24]

    Chunqiu Steven Xia and Lingming Zhang. 2022. Less training, more repairing please: revisiting automated program repair via zero-shot learning. InProceedings of the 30th ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering. 959–971

  25. [25]

    Chunqiu Steven Xia and Lingming Zhang. 2024. Automated program repair via conversation: Fixing 162 out of 337 bugs for $0.42 each using ChatGPT. InProceedings of the 33rd ACM SIGSOFT International Symposium on Software Testing and Analysis. 819–831

  26. [26]

    Boyang Yang, Zijian Cai, Fengling Liu, Bach Le, Lingming Zhang, Tegawendé F Bissyandé, Yang Liu, and Haoye Tian

  27. [27]

    A Survey of LLM-based Automated Program Repair: Taxonomies, Design Paradigms, and Applications.arXiv preprint arXiv:2506.23749(2025)

  28. [28]

    Boyang Yang, Haoye Tian, Weiguo Pian, Haoran Yu, Haitao Wang, Jacques Klein, Tegawendé F Bissyandé, and Shunfu Jin. 2024. Cref: An llm-based conversational software repair framework for programming tutors. InProceedings of the 33rd ACM SIGSOFT International Symposium on Software Testing and Analysis. 882–894

  29. [29]

    Boyang Yang, Haoye Tian, Jiadong Ren, Hongyu Zhang, Jacques Klein, Tegawende Bissyande, Claire Le Goues, and Shunfu Jin. 2025. Morepair: Teaching llms to repair code via multi-objective fine-tuning.ACM Transactions on Software Engineering and Methodology(2025)

  30. [30]

    John Yang, Carlos E Jimenez, Alexander Wettig, Kilian Lieret, Shunyu Yao, Karthik Narasimhan, and Ofir Press. 2024. Swe-agent: Agent-computer interfaces enable automated software engineering.Advances in Neural Information Processing Systems37 (2024), 50528–50652

  31. [31]

    He Ye, Aidan ZH Yang, Chang Hu, Yanlin Wang, Tao Zhang, and Claire Le Goues. 2025. Adversarial Reasoning for Repair Based on Inferred Program Intent.arXiv preprint arXiv:2505.13008(2025)

  32. [32]

    Mingyue Yuan, Jieshan Chen, Zhenchang Xing, Aaron Quigley, Yuyu Luo, Tianqi Luo, Gelareh Mohammadi, Qinghua Lu, and Liming Zhu. 2024. DesignRepair: Dual-Stream Design Guideline-Aware Frontend Repair with Large Language Models.arXiv preprint arXiv:2411.01606(2024)

  33. [33]

    Wei Yuan, Quanjun Zhang, Tieke He, Chunrong Fang, Nguyen Quoc Viet Hung, Xiaodong Hao, and Hongzhi Yin

  34. [34]

    InProceedings of the 31st ACM SIGSOFT international symposium on software testing and analysis

    CIRCLE: Continual repair across programming languages. InProceedings of the 31st ACM SIGSOFT international symposium on software testing and analysis. 678–690

  35. [35]

    Jialu Zhang, José Pablo Cambronero, Sumit Gulwani, Vu Le, Ruzica Piskac, Gustavo Soares, and Gust Verbruggen

  36. [36]

    Pydex: Repairing bugs in introductory python assignments using llms.Proceedings of the ACM on Programming Languages8, OOPSLA1 (2024), 1100–1124

  37. [37]

    Jipeng Zhang, Jianshu Zhang, Yuanzhe Li, Renjie Pi, Rui Pan, Runtao Liu, Zheng Ziqiang, and Tong Zhang. 2025. Bridge-Coder: Transferring Model Capabilities from High-Resource to Low-Resource Programming Language. In Findings of the Association for Computational Linguistics: ACL 2025. 10865–10882

  38. [38]

    Quanjun Zhang, Chunrong Fang, Yuxiang Ma, Weisong Sun, and Zhenyu Chen. 2023. A survey of learning-based automated program repair.ACM Transactions on Software Engineering and Methodology33, 2 (2023), 1–69

  39. [39]

    Quanjun Zhang, Chunrong Fang, Weisong Sun, Yan Liu, Tieke He, Xiaodong Hao, and Zhenyu Chen. 2024. Appt: Boosting automated patch correctness prediction via fine-tuning pre-trained models.IEEE Transactions on Software Engineering50, 3 (2024), 474–494

  40. [40]

    Yuntong Zhang, Haifeng Ruan, Zhiyu Fan, and Abhik Roychoudhury. 2024. Autocoderover: Autonomous program improvement. InProceedings of the 33rd ACM SIGSOFT International Symposium on Software Testing and Analysis. 1592–1604. Proc. ACM Softw. Eng., Vol. X, No. ISSTA, Article XX. Publication date: July 2026