pith. sign in

arxiv: 2512.03086 · v2 · pith:H6F3GTQGnew · submitted 2025-11-29 · 💻 cs.PL · cs.AI· cs.SE

Beyond Code Pairs: Dialogue-Based Data Generation for LLM Code Translation

classification 💻 cs.PL cs.AIcs.SE
keywords codedatatranslationbeyonddialoguesfunctionalgenerationlike
0
0 comments X
read the original abstract

Large language models (LLMs) have shown remarkable capabilities in code translation, yet their performance deteriorates in low-resource programming domains such as Fortran and emerging frameworks like CUDA, where high-quality parallel data are scarce. We present an automated dataset generation pipeline featuring a dual-LLM Questioner-Solver design that incorporates external knowledge from compilers and runtime feedback. Beyond traditional source-target code pair datasets, our approach additionally generates (1) verified translations with unit tests for assessing functional consistency and (2) multi-turn dialogues that capture the reasoning process behind translation refinement. Applied to Fortran-to-C++ and C++-to-CUDA, the pipeline yields 3.64k and 3.93k dialogues, respectively. Fine-tuning on this data yields dramatic improvements in functional correctness, boosting unit test success rates by over 56% on the challenging C++-to-CUDA task. We show that the generated data enables a 7B open-weight model to significantly outperform larger proprietary systems on key metrics like compilation success.

This paper has not been read by Pith yet.

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. LLM-Based Porting of Optimized C++ to CUDA Through Deoptimization and Reoptimization

    cs.DC 2026-06 unverdicted novelty 6.0

    Deopt-Reopt workflow for LLM-based C++ to CUDA porting shows mixed performance gains over direct translation depending on kernel, model, and success rate, with no universal benefit.