pith. sign in

arxiv: 2409.19894 · v5 · submitted 2024-09-30 · 💻 cs.SE · cs.AI

TransAgent: Enhancing LLM-Based Code Translation via Fine-Grained Execution Alignment

Pith reviewed 2026-05-23 20:53 UTC · model grok-4.3

classification 💻 cs.SE cs.AI
keywords code translationlarge language modelsmulti-agent systemsexecution alignmenterror localizationprogram repairsoftware maintenance
0
0 comments X

The pith

TransAGENT corrects errors in LLM code translations by using multi-agent fine-grained execution alignment to locate faulty blocks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces TransAGENT, a multi-agent system that improves LLM-based code translation by identifying and fixing errors through detailed execution comparisons between source and target programs. Traditional learning methods struggle with insufficient parallel data, and even strong LLMs produce translations with syntax and semantic flaws that limit their use in software maintenance. By constructing a new benchmark of recent tasks, the work shows that alignment-based localization leads to measurable gains in accuracy and repair tasks. A sympathetic reader would care because reliable cross-language code movement remains a practical bottleneck in development workflows.

Core claim

TransAGENT is a novel multi-agent system that eliminates errors during LLM-based code translation. The main insight is to localize error-prone code blocks via fine-grained execution alignment between source and target code. Evaluated on a newly constructed benchmark of recent programming tasks to mitigate data leakage, TransAGENT outperforms the latest UniTrans by up to 33.3% in translation accuracy and achieves an average improvement of 56.7% over Agentless in program repair performance, with ablation studies and tests across LLMs confirming its effectiveness and generalizability.

What carries the argument

Fine-grained execution alignment between source and target code, performed by a multi-agent system to localize error-prone blocks.

If this is right

  • Translation accuracy rises by as much as 33.3 percent relative to the prior UniTrans method.
  • Program repair performance improves by an average of 56.7 percent compared with the Agentless baseline.
  • The gains hold when the underlying LLM is swapped, indicating broad applicability.
  • A fresh benchmark of recent tasks reduces the risk that reported numbers reflect memorized training data.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The alignment technique could be applied to same-language code repair tasks where test coverage is sparse.
  • If alignment succeeds with partial executions, it may support migration of legacy systems that lack comprehensive test suites.
  • Integration with other agent workflows for code generation could create end-to-end pipelines for cross-language refactoring.

Load-bearing premise

Fine-grained execution alignment between source and target code can reliably localize error-prone blocks even without complete test suites and without creating alignment artifacts that hide real differences.

What would settle it

A set of translated programs where the alignment step marks a block as correct yet the block still produces wrong outputs on valid inputs, or marks an incorrect block while missing the actual error location.

Figures

Figures reproduced from arXiv: 2409.19894 by Hanlin Wang, Weitong Chen, Xin Peng, Yiling Lou, Zhenpeng Chen, Zhiqiang Yuan.

Figure 1
Figure 1. Figure 1: Example of Fixing Semantic Errors in Target Java Program [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗
Figure 3
Figure 3. Figure 3: Workflow of TRANSAGENT behaviors from its aligned block in the source program; and then it leverages LLMs to specifically fix the error block with the observed runtime difference. Semantic Error Fixer is novel in fixing the semantic errors during code translation in such a fine-grained way. In particular, whenever the target program passing all the generated tests, the workflow terminates and the target pr… view at source ↗
Figure 4
Figure 4. Figure 4: Source Python and Ground-truth Java Program of [PITH_FULL_IMAGE:figures/full_fig_p004_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Prompts in Syntax Error Fixer The patched target program would further go to syntax validation of the next iteration. The iterative process terminates when (i) there are no syntax errors or (ii) there are the same syntax errors occurring at the same buggy location as the previous iteration (to avoid being stuck in an endless loop). Otherwise, if there are syntax errors different from the previous iteration… view at source ↗
Figure 6
Figure 6. Figure 6: Prompts in Coder Aligner error information (i.e., Semantic Patch Generation). Different from previous LLM-based code translation work [12], [11] that directly leverages LLMs to fix semantic errors without pinpointing the suspicious location, Semantic Error Fixer can (i) not only narrow down the fixing space by pinpointing the error target block (ii) but also provide detailed error infor￾mation about the ru… view at source ↗
Figure 7
Figure 7. Figure 7: a illustrates the vanilla fix [PITH_FULL_IMAGE:figures/full_fig_p007_7.png] view at source ↗
Figure 7
Figure 7. Figure 7: Prompts in Semantic Patch Generation tasks in different programming languages, which are released after August 2023. Specifically, we focus on three popular programming languages, i.e., Java, Python, and C++. As the solutions in these websites typically come with only two or three test cases, which can be insufficient for guaranteeing the semantic correctness of code [38], we further leverage gpt-4o-mini [… view at source ↗
Figure 8
Figure 8. Figure 8: Example of Mapping Results of TransMap and T [PITH_FULL_IMAGE:figures/full_fig_p012_8.png] view at source ↗
read the original abstract

Code translation transforms code between programming languages while preserving functionality, which is critical in software development and maintenance. While traditional learning-based code translation methods have limited effectiveness due to the lack of sufficient parallel training data, Large Language Models (LLMs) have recently advanced this field with their strong code generation and comprehension capabilities. However, code translated by LLMs still suffers from diverse quality issues, such as syntax and semantic errors. In this work, we propose TransAGENT, a novel multi-agent system that eliminates the errors during LLM-based code translation. The main insight of TransAGENT is to localize error-prone code blocks via fine-grained execution alignment between source and target code. We evaluate TransAGENT on a newly constructed benchmark of recent programming tasks to mitigate data leakage. TransAGENT outperforms the latest UniTrans by up to 33.3% in translation accuracy and achieves an average improvement of 56.7% over Agentless in program repair performance. We also conduct an ablation study and evaluate TransAGENT across different LLMs, demonstrating its effectiveness and strong generalizability.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper introduces TransAGENT, a multi-agent system for LLM-based code translation that localizes error-prone blocks via fine-grained execution alignment between source and target code. It constructs a new benchmark of recent programming tasks to reduce data leakage, reports up to 33.3% higher translation accuracy than UniTrans and 56.7% average improvement over Agentless on program repair, and includes ablation studies plus evaluations across multiple LLMs to demonstrate generalizability.

Significance. If the empirical gains hold under scrutiny, the work offers a practical mechanism for improving semantic fidelity in cross-language translation by grounding LLM outputs in execution traces rather than static analysis alone. The new benchmark construction is a constructive contribution for the field, and the multi-agent framing with explicit alignment could generalize to other code maintenance tasks.

major comments (3)
  1. [§3] §3 (Method), execution alignment procedure: the central claim that fine-grained alignment reliably localizes errors without complete test suites is not supported by a concrete algorithm or pseudocode; the description leaves open how partial traces are matched and whether alignment artifacts could mask semantic differences, which directly underpins the reported accuracy deltas.
  2. [§4.1] §4.1 (Benchmark), Table 1: the construction details for the new benchmark (task selection criteria, leakage mitigation steps, and test-suite coverage statistics) are insufficient to assess whether the 33.3% and 56.7% gains are robust or sensitive to post-hoc choices; no inter-rater agreement or leakage audit is reported.
  3. [§4.2] §4.2 (Results), accuracy and repair tables: the improvements are presented as point estimates without statistical significance tests, confidence intervals, or variance across random seeds; this weakens the claim that TransAGENT consistently outperforms the baselines.
minor comments (2)
  1. [Abstract] The abstract and §1 use “up to 33.3%” and “average improvement of 56.7%” without clarifying whether these are relative or absolute gains or on which exact metric subsets.
  2. [Figure 2] Figure 2 (agent workflow) would benefit from explicit labeling of the execution-alignment step and data flow between agents.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive comments, which help clarify key aspects of the method, benchmark, and evaluation. We address each major point below and will incorporate revisions to strengthen the manuscript.

read point-by-point responses
  1. Referee: [§3] §3 (Method), execution alignment procedure: the central claim that fine-grained alignment reliably localizes errors without complete test suites is not supported by a concrete algorithm or pseudocode; the description leaves open how partial traces are matched and whether alignment artifacts could mask semantic differences, which directly underpins the reported accuracy deltas.

    Authors: We agree that a formal algorithmic description would improve clarity and reproducibility. In the revised manuscript we will add pseudocode (as a new Algorithm 1 in §3) that explicitly specifies the trace-matching procedure for partial executions, the similarity metric used, and safeguards against masking semantic differences (e.g., by requiring both syntactic and semantic equivalence checks on aligned blocks). revision: yes

  2. Referee: [§4.1] §4.1 (Benchmark), Table 1: the construction details for the new benchmark (task selection criteria, leakage mitigation steps, and test-suite coverage statistics) are insufficient to assess whether the 33.3% and 56.7% gains are robust or sensitive to post-hoc choices; no inter-rater agreement or leakage audit is reported.

    Authors: We acknowledge the need for greater transparency. The revised §4.1 and Table 1 will include: (i) explicit task-selection criteria (problems posted after 2023 on LeetCode/Codeforces with at least three test cases), (ii) leakage-mitigation steps (timestamp filtering plus manual overlap checks against common pre-training corpora), (iii) test-suite coverage statistics (average number of tests per task and branch coverage), and (iv) results of a leakage audit together with inter-rater agreement (Cohen’s κ) for any manual verification steps. revision: yes

  3. Referee: [§4.2] §4.2 (Results), accuracy and repair tables: the improvements are presented as point estimates without statistical significance tests, confidence intervals, or variance across random seeds; this weakens the claim that TransAGENT consistently outperforms the baselines.

    Authors: We agree that statistical rigor is required. The revised §4.2 will report: (i) paired statistical significance tests (Wilcoxon signed-rank) with p-values, (ii) 95% confidence intervals computed via bootstrap resampling, and (iii) standard deviation across five random seeds for both translation accuracy and program-repair metrics. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper presents an empirical multi-agent system (TransAGENT) whose core contribution is fine-grained execution alignment for error localization in LLM code translation. Evaluation relies on a newly constructed benchmark and direct comparisons to external baselines (UniTrans, Agentless) with reported accuracy deltas. No equations, fitted parameters renamed as predictions, self-definitional constructs, or load-bearing self-citations appear in the provided text. The derivation chain consists of system design followed by external benchmarking; all performance claims are falsifiable against independent test suites and do not reduce to internal definitions or prior author work by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

This is an applied empirical system paper. No free parameters, mathematical axioms, or invented physical entities are invoked; the contribution is an engineered workflow whose correctness is asserted via benchmark results.

pith-pipeline@v0.9.0 · 5729 in / 999 out tokens · 41436 ms · 2026-05-23T20:53:12.587450+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 7 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Beyond Translation Accuracy: Addressing False Failures in LLM-Based Code Translation

    cs.SE 2026-05 unverdicted novelty 7.0

    Many reported failures in LLM-based code translation are false negatives due to evaluation pipeline issues such as improper compilation flags, missing library links, and unconfigured runtime environments rather than i...

  2. uGen: An Agentic Framework for Generating Microarchitectural Attack PoCs

    cs.CR 2026-05 unverdicted novelty 6.0

    uGen is the first retrieval-augmented multi-agent LLM framework for generating functionally correct microarchitectural attack PoCs, reporting up to 100% success on Spectre-v1 and 80% on Prime+Probe at low cost.

  3. Project-Level C-to-Rust Translation via Pointer Knowledge Graphs

    cs.SE 2025-10 unverdicted novelty 6.0

    PtrTrans builds a Pointer Knowledge Graph with points-to flows, struct abstractions, and Rust annotations to guide LLMs toward project-level C-to-Rust translations that cut unsafe code by 99.9% and raise functional co...

  4. Neural Code Translation of Legacy Code: APL to C#

    cs.SE 2026-05 unverdicted novelty 5.0

    Guided LLM strategies with custom datasets and execution-based verification enable functional APL-to-C# translation across a range of program complexities.

  5. Boosting Automatic Java-to-Cangjie Translation with Multi-Stage LLM Training and Error Repair

    cs.SE 2026-05 unverdicted novelty 5.0

    Multi-stage LLM training plus compiler-guided error repair boosts functional equivalence in Java-to-Cangjie translation by 6.06% over prior methods despite scarce parallel data.

  6. Beyond Translation Accuracy: Addressing False Failures in LLM-Based Code Translation

    cs.SE 2026-05 unverdicted novelty 5.0

    A large-scale study finds that many LLM code translation failures are false negatives due to improper evaluation configurations rather than incorrect translations.

  7. Large Language Model-Based Agents for Software Engineering: A Survey

    cs.SE 2024-09 unverdicted novelty 4.0

    A literature survey that collects and categorizes 124 papers on LLM-based agents for software engineering from SE and agent perspectives.

Reference graph

Works this paper leans on

79 extracted references · 79 canonical work pages · cited by 6 Pith papers · 2 internal anchors

  1. [1]

    Migrating monoliths to microservices-based customizable multi-tenant cloud-native apps

    Sindre Grønstøl Haugeland, Phu Hong Nguyen, Hui Song, and Franck Chauvel. Migrating monoliths to microservices-based customizable multi-tenant cloud-native apps. In 47th Euromicro Conference on Software Engineering and Advanced Applications, SEAA 2021, Palermo, Italy, September 1-3, 2021 , pages 170–177. IEEE, 2021

  2. [2]

    In 2021 36th IEEE/ACM International Conference on Automated Software Engineering (ASE) , pages 3–3, 2021

    Transforming monolithic applications to microservices with mono2micro. In 2021 36th IEEE/ACM International Conference on Automated Software Engineering (ASE) , pages 3–3, 2021

  3. [3]

    Legacy web application modernization by generating a rest service layer

    Roberto Rodriguez Echeverria, Fernando Macias, Victor Manuel Pavon, Jose Maria Conejero, and Fernando Sanchez Figueroa. Legacy web application modernization by generating a rest service layer. IEEE Latin America Transactions, 13(7):2379–2383, 2015

  4. [4]

    Mahdi Fahmideh, Farhad Daneshgar, Ghassan Beydoun, and Fethi A. Rabhi. Challenges in migrating legacy software systems to the cloud an empirical study. CoRR, abs/2004.10724, 2020

  5. [5]

    CARGO: ai-guided dependency analysis for migrating monolithic appli- cations to microservices architecture

    Vikram Nitin, Shubhi Asthana, Baishakhi Ray, and Rahul Krishna. CARGO: ai-guided dependency analysis for migrating monolithic appli- cations to microservices architecture. In 37th IEEE/ACM International Conference on Automated Software Engineering, ASE 2022, Rochester, MI, USA, October 10-14, 2022 , pages 20:1–20:12. ACM, 2022

  6. [6]

    Unsupervised translation of programming languages

    Baptiste Rozi `ere, Marie-Anne Lachaux, Lowik Chanussot, and Guil- laume Lample. Unsupervised translation of programming languages. In Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020, December 6-12, 2020, virtual , 2020

  7. [7]

    Leveraging automated unit tests for unsupervised code translation

    Baptiste Rozi `ere, Jie Zhang, Franc ¸ois Charton, Mark Harman, Gabriel Synnaeve, and Guillaume Lample. Leveraging automated unit tests for unsupervised code translation. In The Tenth International Conference on Learning Representations, ICLR 2022, Virtual Event, April 25-29,

  8. [8]

    Code translation with compiler representations

    Marc Szafraniec, Baptiste Rozi `ere, Hugh Leather, Patrick Labatut, Franc ¸ois Charton, and Gabriel Synnaeve. Code translation with compiler representations. In The Eleventh International Conference on Learning Representations, ICLR 2023, Kigali, Rwanda, May 1-5, 2023 . OpenRe- view.net, 2023

  9. [9]

    Summarize and generate to back-translate: Unsupervised translation of programming languages

    Wasi Uddin Ahmad, Saikat Chakraborty, Baishakhi Ray, and Kai- Wei Chang. Summarize and generate to back-translate: Unsupervised translation of programming languages. In Proceedings of the 17th Con- ference of the European Chapter of the Association for Computational Linguistics, EACL 2023, Dubrovnik, Croatia, May 2-6, 2023 , pages 1520–1534. Association f...

  10. [10]

    Yiqing Xie, Atharva Naik, Daniel Fried, and Carolyn P. Ros ´e. Data augmentation for code translation with comparable corpora and multiple references. In Findings of the Association for Computational Linguistics: EMNLP 2023, Singapore, December 6-10, 2023 , pages 13725–13739. Association for Computational Linguistics, 2023

  11. [11]

    Lost in translation: A study of bugs introduced by large language models while translating code

    Rangeet Pan, Ali Reza Ibrahimzada, Rahul Krishna, Divya Sankar, and et al. Lost in translation: A study of bugs introduced by large language models while translating code. In Proceedings of the 46th IEEE/ACM International Conference on Software Engineering, ICSE 2024, Lisbon, Portugal, April 14-20, 2024 , pages 82:1–82:13. ACM, 2024

  12. [12]

    Exploring and unleashing the power of large language models in automated code translation

    Zhen Yang, Fang Liu, Zhongxing Yu, Jacky Wai Keung, Jia Li, Shuo Liu, Yifan Hong, Xiaoxue Ma, Zhi Jin, and Ge Li. Exploring and unleashing the power of large language models in automated code translation. Proc. ACM Softw. Eng., 1(FSE):1585–1608, 2024

  13. [13]

    Reasoning runtime behavior of a program with llm: How far are we? arXiv e-prints, 2024

    Junkai Chen, Zhiyuan Pan, Xing Hu, Zhenhao Li, Ge Li, and Xin Xia. Reasoning runtime behavior of a program with llm: How far are we? arXiv e-prints, 2024

  14. [14]

    Large language model-based agents for software engineering: A survey, 2024

    Junwei Liu, Kaixin Wang, Yixuan Chen, Xin Peng, Zhenpeng Chen, Lingming Zhang, and Yiling Lou. Large language model-based agents for software engineering: A survey, 2024

  15. [15]

    Transmap: Pinpointing mistakes in neural code translation

    Bo Wang, Ruishi Li, Mingkai Li, and Prateek Saxena. Transmap: Pinpointing mistakes in neural code translation. In Proceedings of the 31st ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering, ESEC/FSE 2023, San Francisco, CA, USA, December 3-9, 2023 , pages 999–1011. ACM, 2023

  16. [16]

    deepseek-coder-6.7b instruct. 2023

  17. [17]

    minimumArrayLength. 2024.01

  18. [18]

    minOperations. 2024.03

  19. [19]

    minOrAfterOperations. 2024.01

  20. [20]

    Sherwood, E

    T. Sherwood, E. Perelman, and B. Calder. Basic block distribution analysis to find periodic behavior and simulation points in applications. In Proceedings 2001 International Conference on Parallel Architectures and Compilation Techniques, pages 3–14, 2001

  21. [21]

    Gamma: Revisiting template-based automated program repair via mask prediction

    Quanjun Zhang, Chunrong Fang, Tongke Zhang, Bowen Yu, Weisong Sun, and Zhenyu Chen. Gamma: Revisiting template-based automated program repair via mask prediction. In 38th IEEE/ACM International Conference on Automated Software Engineering, ASE 2023, Luxem- bourg, September 11-15, 2023 , pages 535–547. IEEE, 2023

  22. [22]

    Copiloting the copilots: Fusing large language models with completion engines for automated program repair

    Yuxiang Wei, Chunqiu Steven Xia, and Lingming Zhang. Copiloting the copilots: Fusing large language models with completion engines for automated program repair. In Proceedings of the 31st ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering, ESEC/FSE 2023, San Francisco, CA, USA, December 3-9, 2023 , ...

  23. [23]

    The plastic surgery hypothesis in the era of large language models

    Chunqiu Steven Xia, Yifeng Ding, and Lingming Zhang. The plastic surgery hypothesis in the era of large language models. In 38th IEEE/ACM International Conference on Automated Software Engineer- ing, ASE 2023, Luxembourg, September 11-15, 2023 , pages 522–534. IEEE, 2023

  24. [24]

    Less training, more repairing please: revisiting automated program repair via zero-shot learning

    Chunqiu Steven Xia and Lingming Zhang. Less training, more repairing please: revisiting automated program repair via zero-shot learning. In Proceedings of the 30th ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering, ESEC/FSE 2022, Singapore, Singapore, November 14-18, 2022 , pages 959–971. ACM, 2022

  25. [25]

    Chi, Quoc V

    Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Brian Ichter, Fei Xia, Ed H. Chi, Quoc V . Le, and Denny Zhou. Chain- of-thought prompting elicits reasoning in large language models. In Advances in Neural Information Processing Systems 35: Annual Con- ference on Neural Information Processing Systems 2022, NeurIPS 2022, New Orleans, LA, USA, Novemb...

  26. [26]

    Le, Ed H

    Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc V . Le, Ed H. Chi, Sharan Narang, Aakanksha Chowdhery, and Denny Zhou. Self- consistency improves chain of thought reasoning in language models. In The Eleventh International Conference on Learning Representations, ICLR 2023, Kigali, Rwanda, May 1-5, 2023 . OpenReview.net, 2023

  27. [27]

    To- wards better chain-of-thought prompting strategies: A survey

    Zihan Yu, Liang He, Zhen Wu, Xinyu Dai, and Jiajun Chen. To- wards better chain-of-thought prompting strategies: A survey. CoRR, abs/2310.04959, 2023

  28. [28]

    DOBF: A deobfuscation pre-training objective for programming languages

    Marie-Anne Lachaux, Baptiste Rozi `ere, Marc Szafraniec, and Guillaume Lample. DOBF: A deobfuscation pre-training objective for programming languages. In Advances in Neural Information Processing Systems 34: Annual Conference on Neural Information Processing Systems 2021, NeurIPS 2021, December 6-14, 2021, virtual, pages 14967–14979, 2021

  29. [29]

    Clement, Dawn Drain, and et al

    Shuai Lu, Daya Guo, Shuo Ren, Junjie Huang, Alexey Svyatkovskiy, Ambrosio Blanco, Colin B. Clement, Dawn Drain, and et al. Codexglue: A machine learning benchmark dataset for code understanding and generation. In Proceedings of the Neural Information Processing Systems Track on Datasets and Benchmarks 1, NeurIPS Datasets and Benchmarks 2021, December 2021...

  30. [30]

    A V ATAR: A parallel corpus for java-python program translation

    Wasi Uddin Ahmad, Md Golam Rahman Tushar, Saikat Chakraborty, and Kai-Wei Chang. A V ATAR: A parallel corpus for java-python program translation. In Findings of the Association for Computational Linguistics: ACL 2023, Toronto, Canada, July 9-14, 2023 , pages 2268–

  31. [31]

    Association for Computational Linguistics, 2023

  32. [32]

    Anh Tuan Nguyen, Tung Thanh Nguyen, and Tien N. Nguyen. Lex- ical statistical machine translation for language migration. In Joint Meeting of the European Software Engineering Conference and the ACM SIGSOFT Symposium on the Foundations of Software Engineering, ESEC/FSE’13, Saint Petersburg, Russian Federation, August 18-26, 2013, pages 651–654. ACM, 2013

  33. [33]

    Tree-to-tree neural networks for program translation

    Xinyun Chen, Chang Liu, and Dawn Song. Tree-to-tree neural networks for program translation. In 6th International Conference on Learning Representations, ICLR 2018, Vancouver, BC, Canada, April 30 - May 3, 2018, Workshop Track Proceedings . OpenReview.net, 2018

  34. [34]

    Kung, Geert Janssen, Wei Zhang, Giacomo Domeniconi, Vladimir Zolotov, and et al

    Ruchir Puri, David S. Kung, Geert Janssen, Wei Zhang, Giacomo Domeniconi, Vladimir Zolotov, and et al. Project codenet: A large- scale AI for code dataset for learning a diversity of coding tasks. CoRR, abs/2105.12655, 2021

  35. [35]

    Ming Zhu, Karthik Suresh, and Chandan K. Reddy. Multilingual code snippets training for program translation. In Thirty-Sixth AAAI Confer- ence on Artificial Intelligence, AAAI 2022, Thirty-Fourth Conference on Innovative Applications of Artificial Intelligence, IAAI 2022, The Twelveth Symposium on Educational Advances in Artificial Intelligence, EAAI 2022...

  36. [36]

    Is your code generated by chatgpt really correct? rigorous evaluation of large language models for code generation

    Jiawei Liu, Chunqiu Steven Xia, Yuyao Wang, and Lingming Zhang. Is your code generated by chatgpt really correct? rigorous evaluation of large language models for code generation. In Alice Oh, Tristan Naumann, Amir Globerson, Kate Saenko, Moritz Hardt, and Sergey Levine, editors, Advances in Neural Information Processing Systems 36: Annual Conference on N...

  37. [37]

    Llama-3-8B-Instruct. 2023

  38. [38]

    CodeBLEU: a Method for Automatic Evaluation of Code Synthesis

    Shuo Ren, Daya Guo, Shuai Lu, Long Zhou, Shujie Liu, Duyu Tang, Neel Sundaresan, Ming Zhou, Ambrosio Blanco, and Shuai Ma. Code- bleu: a method for automatic evaluation of code synthesis. CoRR, abs/2009.10297, 2020

  39. [39]

    Elements of survey sampling, volume 15

    Ravindra Singh and Naurang Singh Mangat. Elements of survey sampling, volume 15. Springer Science & Business Media, 2013

  40. [40]

    Math- ematical statistics with applications

    Dennis Wackerly, William Mendenhall, and Richard L Scheaffer. Math- ematical statistics with applications . Cengage Learning, 2014

  41. [41]

    Richard Landis and Gary G

    J. Richard Landis and Gary G. Koch. An application of hierarchical kappa-type statistics in the assessment of majority agreement among multiple observers. Biometrics, 33(2):363–374, 1977

  42. [42]

    cxgo: C to Go transpiler. 2024

  43. [43]

    https://github.com/mono/sharpen, 2020

    Sharpen. https://github.com/mono/sharpen, 2020

  44. [44]

    https://github.com/paulirwin/JavaToCSharp, 2024

    JavaToCSharp. https://github.com/paulirwin/JavaToCSharp, 2024

  45. [45]

    Anh Tuan Nguyen, Tung Thanh Nguyen, and Tien N. Nguyen. Migrating code with statistical machine translation. In 36th International Con- ference on Software Engineering, ICSE ’14, Companion Proceedings, Hyderabad, India, May 31 - June 07, 2014, pages 544–547. ACM, 2014

  46. [46]

    Svetoslav Karaivanov, Veselin Raychev, and Martin T. Vechev. Phrase- based statistical translation of programming languages. In Onward! 2014, Proceedings of the 2014 ACM International Symposium on New Ideas, New Paradigms, and Reflections on Programming & Software, part of SPLASH ’14, Portland, OR, USA, October 20-24, 2014 , pages 173–184. ACM, 2014

  47. [47]

    Learning to generate pseudo-code from source code using statistical machine translation (T)

    Yusuke Oda, Hiroyuki Fudaba, Graham Neubig, Hideaki Hata, Sakriani Sakti, Tomoki Toda, and Satoshi Nakamura. Learning to generate pseudo-code from source code using statistical machine translation (T). In 30th IEEE/ACM International Conference on Automated Software Engineering, ASE 2015, Lincoln, NE, USA, November 9-13, 2015 , pages 574–584. IEEE Computer...

  48. [48]

    Evaluating Large Language Models Trained on Code

    Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Pond ´e de Oliveira Pinto, Jared Kaplan, Harri Edwards, and et al. Evaluating large language models trained on code. CoRR, abs/2107.03374, 2021

  49. [49]

    Self-collaboration code generation via chatgpt

    Yihong Dong, Xue Jiang, Zhi Jin, and Ge Li. Self-collaboration code generation via chatgpt. CoRR, abs/2304.07590, 2023

  50. [50]

    Evaluating and improving chatgpt for unit test generation

    Zhiqiang Yuan, Mingwei Liu, Shiji Ding, Kaixin Wang, Yixuan Chen, Xin Peng, and Yiling Lou. Evaluating and improving chatgpt for unit test generation. Proc. ACM Softw. Eng. , 1(FSE):1703–1726, 2024

  51. [51]

    Automated repair of programs from large language models

    Zhiyu Fan, Xiang Gao, Martin Mirchev, Abhik Roychoudhury, and Shin Hwei Tan. Automated repair of programs from large language models. In 45th IEEE/ACM International Conference on Software Engineering, ICSE 2023, Melbourne, Australia, May 14-20, 2023 , pages 1469–1481. IEEE, 2023

  52. [52]

    Automated program repair in the era of large pre-trained language models

    Chunqiu Steven Xia, Yuxiang Wei, and Lingming Zhang. Automated program repair in the era of large pre-trained language models. In 45th IEEE/ACM International Conference on Software Engineering, ICSE 2023, Melbourne, Australia, May 14-20, 2023 , pages 1482–1494. IEEE, 2023

  53. [53]

    Toufique Ahmed and Premkumar T. Devanbu. Few-shot training llms for project-specific code-summarization. In 37th IEEE/ACM International Conference on Automated Software Engineering, ASE 2022, Rochester, MI, USA, October 10-14, 2022 , pages 177:1–177:5. ACM, 2022

  54. [54]

    An empirical study on using large language models for multi-intent comment generation

    Mingyang Geng, Shangwen Wang, Dezun Dong, Haotian Wang, Ge Li, Zhi Jin, Xiaoguang Mao, and Xiangke Liao. An empirical study on using large language models for multi-intent comment generation. CoRR, abs/2304.11384, 2023

  55. [55]

    Ahead of time mutation based fault localisation using statistical inference

    Jinhan Kim, Gabin An, Robert Feldt, and Shin Yoo. Ahead of time mutation based fault localisation using statistical inference. In 32nd IEEE International Symposium on Software Reliability Engineering, ISSRE 2021, Wuhan, China, October 25-28, 2021, pages 253–263. IEEE, 2021

  56. [56]

    Metallaxis-fl: mutation-based fault localization

    Mike Papadakis and Yves Le Traon. Metallaxis-fl: mutation-based fault localization. Softw. Test. Verification Reliab., 25(5-7):605–628, 2015

  57. [57]

    FATOC: bug isolation based multi-fault localization by using OPTICS clustering

    Yonghao Wu, Zheng Li, Yong Liu, and Xiang Chen. FATOC: bug isolation based multi-fault localization by using OPTICS clustering. J. Comput. Sci. Technol., 35(5):979–998, 2020

  58. [58]

    Hassan, Khaled Wassif, Ramadan Moawad, and Soha Makady

    Amr Mansour Mohsen, Hesham A. Hassan, Khaled Wassif, Ramadan Moawad, and Soha Makady. Enhancing bug localization using phase- based approach. IEEE Access, 11:35901–35913, 2023

  59. [59]

    Fast changeset-based bug localization with BERT

    Agnieszka Ciborowska and Kostadin Damevski. Fast changeset-based bug localization with BERT. In 44th IEEE/ACM 44th International Conference on Software Engineering, ICSE 2022, Pittsburgh, PA, USA, May 25-27, 2022 , pages 946–957. ACM, 2022

  60. [60]

    Trobo: A novel deep transfer model for enhancing cross-project bug localization

    Ziye Zhu, Yu Wang, and Yun Li. Trobo: A novel deep transfer model for enhancing cross-project bug localization. In Knowledge Science, Engineering and Management - 14th International Conference, KSEM 2021, Tokyo, Japan, August 14-16, 2021, Proceedings, Part I , volume 12815 of Lecture Notes in Computer Science , pages 529–541. Springer, 2021

  61. [61]

    A preliminary evaluation of llm-based fault localization

    Sungmin Kang, Gabin An, and Shin Yoo. A preliminary evaluation of llm-based fault localization. CoRR, abs/2308.05487, 2023

  62. [62]

    Pruning dynamic slices with confidence

    Xiangyu Zhang, Neelam Gupta, and Rajiv Gupta. Pruning dynamic slices with confidence. In Proceedings of the ACM SIGPLAN 2006 Conference on Programming Language Design and Implementation, Ottawa, Ontario, Canada, June 11-14, 2006 , pages 169–180. ACM, 2006

  63. [63]

    REPT: reverse debugging of failures in deployed software

    Weidong Cui, Xinyang Ge, Baris Kasikci, Ben Niu, Upamanyu Sharma, Ruoyu Wang, and Insu Yun. REPT: reverse debugging of failures in deployed software. In 13th USENIX Symposium on Operating Systems Design and Implementation, OSDI 2018, Carlsbad, CA, USA, October 8-10, 2018, pages 17–32. USENIX Association, 2018

  64. [64]

    Shaping program repair space with existing patches and similar code

    Jiajun Jiang, Yingfei Xiong, Hongyu Zhang, Qing Gao, and Xiangqun Chen. Shaping program repair space with existing patches and similar code. In Proceedings of the 27th ACM SIGSOFT International Sympo- sium on Software Testing and Analysis, ISSTA 2018, Amsterdam, The Netherlands, July 16-21, 2018 , pages 298–309. ACM, 2018

  65. [65]

    ARJA: automated repair of java pro- grams via multi-objective genetic programming

    Yuan Yuan and Wolfgang Banzhaf. ARJA: automated repair of java pro- grams via multi-objective genetic programming. IEEE Trans. Software Eng., 46(10):1040–1067, 2020

  66. [66]

    ASTOR: a program repair library for java (demo)

    Matias Martinez and Martin Monperrus. ASTOR: a program repair library for java (demo). In Proceedings of the 25th International Symposium on Software Testing and Analysis, ISSTA 2016, Saarbr¨ucken, Germany, July 18-20, 2016 , pages 441–444. ACM, 2016

  67. [67]

    Precise condition synthesis for program repair

    Yingfei Xiong, Jie Wang, Runfa Yan, Jiachen Zhang, Shi Han, Gang Huang, and Lu Zhang. Precise condition synthesis for program repair. In Proceedings of the 39th International Conference on Software Engi- neering, ICSE 2017, Buenos Aires, Argentina, May 20-28, 2017 , pages 416–426. IEEE / ACM, 2017

  68. [68]

    Nopol: Automatic repair of conditional statement bugs in java programs

    Jifeng Xuan, Matias Martinez, Favio Demarco, Maxime Cl ´ement, and et al. Nopol: Automatic repair of conditional statement bugs in java programs. CoRR, abs/1811.04211, 2018

  69. [69]

    Ultra-large repair search space with automatically mined templates: The cardumen mode of astor

    Matias Martinez and Martin Monperrus. Ultra-large repair search space with automatically mined templates: The cardumen mode of astor. In Search-Based Software Engineering - 10th International Symposium, SSBSE 2018, Montpellier, France, September 8-9, 2018, Proceedings , volume 11036 of Lecture Notes in Computer Science , pages 65–86. Springer, 2018

  70. [70]

    Bissyand ´e

    Kui Liu, Anil Koyuncu, Dongsun Kim, and Tegawend ´e F. Bissyand ´e. Tbar: revisiting template-based automated program repair. In Proceed- ings of the 28th ACM SIGSOFT International Symposium on Software Testing and Analysis, ISSTA 2019, Beijing, China, July 15-19, 2019 , pages 31–42. ACM, 2019

  71. [71]

    Bissyand´e, Dongsun Kim, Jacques Klein, Martin Monperrus, and Yves Le Traon

    Anil Koyuncu, Kui Liu, Tegawend ´e F. Bissyand´e, Dongsun Kim, Jacques Klein, Martin Monperrus, and Yves Le Traon. Fixminer: Mining relevant fix patterns for automated program repair. Empir. Softw. Eng., 25(3):1980–2024, 2020

  72. [72]

    Bissyand ´e

    Kui Liu, Anil Koyuncu, Dongsun Kim, and Tegawend ´e F. Bissyand ´e. A V ATAR: fixing semantic bugs with fix patterns of static analysis violations. In 26th IEEE International Conference on Software Analysis, Evolution and Reengineering, SANER 2019, Hangzhou, China, February 24-27, 2019, pages 456–467. IEEE, 2019

  73. [73]

    Sequencer: Sequence-to- sequence learning for end-to-end program repair

    Zimin Chen, Steve Kommrusch, Michele Tufano, Louis-No ¨el Pouchet, Denys Poshyvanyk, and Martin Monperrus. Sequencer: Sequence-to- sequence learning for end-to-end program repair. IEEE Trans. Software Eng., 47(9):1943–1959, 2021

  74. [74]

    Coconut: combining context-aware neural translation models using ensemble for program repair

    Thibaud Lutellier, Hung Viet Pham, Lawrence Pang, Yitong Li, Moshi Wei, and Lin Tan. Coconut: combining context-aware neural translation models using ensemble for program repair. In ISSTA ’20: 29th ACM SIGSOFT International Symposium on Software Testing and Analysis, Virtual Event, USA, July 18-22, 2020 , pages 101–114. ACM, 2020

  75. [75]

    Tare: Type-aware neural program repair

    Qihao Zhu, Zeyu Sun, Wenjie Zhang, Yingfei Xiong, and Lu Zhang. Tare: Type-aware neural program repair. In 45th IEEE/ACM Interna- tional Conference on Software Engineering, ICSE 2023, Melbourne, Australia, May 14-20, 2023 , pages 1443–1455. IEEE, 2023

  76. [76]

    A survey of learning-based automated program repair

    Quanjun Zhang, Chunrong Fang, Yuxiang Ma, Weisong Sun, and Zhenyu Chen. A survey of learning-based automated program repair. ACM Trans. Softw. Eng. Methodol. , 33(2):55:1–55:69, 2024

  77. [77]

    Pre-trained model-based automated software vulnerability repair: How far are we? IEEE Trans

    Quanjun Zhang, Chunrong Fang, Bowen Yu, Weisong Sun, Tongke Zhang, and Zhenyu Chen. Pre-trained model-based automated software vulnerability repair: How far are we? IEEE Trans. Dependable Secur. Comput., 21(4):2507–2525, 2024

  78. [78]

    Fixing rust compilation errors using llms

    Pantazis Deligiannis, Akash Lal, Nikita Mehrotra, and Aseem Rastogi. Fixing rust compilation errors using llms. CoRR, abs/2308.05177, 2023

  79. [79]

    Repair is nearly generation: Multilingual program repair with llms

    Harshit Joshi, Jos ´e Pablo Cambronero S ´anchez, Sumit Gulwani, Vu Le, Gust Verbruggen, and Ivan Radicek. Repair is nearly generation: Multilingual program repair with llms. In Thirty-Seventh AAAI Confer- ence on Artificial Intelligence, AAAI 2023, Thirty-Fifth Conference on Innovative Applications of Artificial Intelligence, IAAI 2023, Thirteenth Sympos...