Beyond Code Pairs: Dialogue-Based Data Generation for LLM Code Translation

Ali Jannesari; Bin Lei; Caiwen Ding; Chunhua Liao; Dunzhi Zhou; Le Chen; Nuo Xu; Pei-Hung Lin; Rajeev Thakur; Winson Chen

arxiv: 2512.03086 · v2 · pith:H6F3GTQGnew · submitted 2025-11-29 · 💻 cs.PL · cs.AI· cs.SE

Beyond Code Pairs: Dialogue-Based Data Generation for LLM Code Translation

Le Chen , Nuo Xu , Winson Chen , Bin Lei , Pei-Hung Lin , Dunzhi Zhou , Rajeev Thakur , Caiwen Ding

show 2 more authors

Ali Jannesari Chunhua Liao

This is my paper

classification 💻 cs.PL cs.AIcs.SE

keywords codedatatranslationbeyonddialoguesfunctionalgenerationlike

0 comments

read the original abstract

Large language models (LLMs) have shown remarkable capabilities in code translation, yet their performance deteriorates in low-resource programming domains such as Fortran and emerging frameworks like CUDA, where high-quality parallel data are scarce. We present an automated dataset generation pipeline featuring a dual-LLM Questioner-Solver design that incorporates external knowledge from compilers and runtime feedback. Beyond traditional source-target code pair datasets, our approach additionally generates (1) verified translations with unit tests for assessing functional consistency and (2) multi-turn dialogues that capture the reasoning process behind translation refinement. Applied to Fortran-to-C++ and C++-to-CUDA, the pipeline yields 3.64k and 3.93k dialogues, respectively. Fine-tuning on this data yields dramatic improvements in functional correctness, boosting unit test success rates by over 56% on the challenging C++-to-CUDA task. We show that the generated data enables a 7B open-weight model to significantly outperform larger proprietary systems on key metrics like compilation success.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

LLM-Based Porting of Optimized C++ to CUDA Through Deoptimization and Reoptimization
cs.DC 2026-06 unverdicted novelty 6.0

Deopt-Reopt workflow for LLM-based C++ to CUDA porting shows mixed performance gains over direct translation depending on kernel, model, and success rate, with no universal benefit.