arxiv: 2603.16479 · v3 · submitted 2026-03-17 · 💻 cs.SE

Recognition: no theorem link

TRACE: Evaluating Execution Efficiency of LLM-Based Code Translation

Zhihao Gong , Zeyu Sun , Dong Huang , Qingyuan Liang , Jie M. Zhang , Dan Hao

Authors on Pith no claims yet

Pith reviewed 2026-05-15 10:18 UTC · model grok-4.3

classification 💻 cs.SE

keywords LLM code translationexecution efficiencybenchmarkperformance evaluationcode correctness

0 comments

The pith

Correctness is not a reliable proxy for execution efficiency in LLM-based code translation.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces TRACE, a benchmark of 1,000 efficiency-critical tasks across C++, Java, and Python, each paired with stress tests that expose performance issues missed by ordinary correctness checks. Evaluation of 28 LLMs shows the correctness leader Claude-4-think ranks only mid-tier on runtime speed and is beaten by smaller models such as Qwen2.5-Coder-14B-Instruct. Roughly 23.5 percent of translations that pass functional tests still exhibit clear inefficiency, split among algorithmic faults, language-construct mismatches, and resource mismanagement. Standard prompt strategies produce only small gains, suggesting current models lack intrinsic efficiency awareness. This matters because code that is functionally correct yet slow can still cause production failures where speed and resource use determine success.

Core claim

TRACE shows that efficiency must be measured separately from correctness: 23.5 percent of functionally correct LLM translations display pronounced runtime inefficiency, the top correctness model does not lead in speed, and smaller open-source models can outperform larger ones on time efficiency.

What carries the argument

The TRACE benchmark of 1,000 efficiency-critical tasks augmented with stress tests that reveal efficiency degradations overlooked by small-scale correctness tests.

Load-bearing premise

The 1,000 tasks and their stress tests represent the efficiency degradations that would appear in real production code.

What would settle it

A study in which all correct translations from the evaluated LLMs match or exceed the runtime performance of human reference implementations under the same stress tests.

read the original abstract

While Large Language Models (LLMs) have substantially improved the functional correctness of code translation, the critical dimension of \textit{execution efficiency} remains overlooked. We present \textbf{\textsc{trace}}, the first benchmark to explicitly assess efficiency in LLM-translated code. \textsc{trace} includes 1,000 efficiency-critical tasks across C++, Java, and Python, each augmented with stress tests that reveal efficiency degradations often overlooked by small-scale tests. Using \textsc{trace}, we conduct an extensive evaluation of 28 representative LLMs and highlight several key insights: 1) Correctness is not a reliable proxy for efficiency: the correctness leader \textit{Claude-4-think} achieves only mid-level time efficiency, outperformed by smaller open-source LLMs such as \textit{Qwen2.5-Coder-14B-Instruct}. 2) Inefficiency is both prevalent and patterned: 23.5\% of correct translations exhibit pronounced inefficiency, distributed across algorithmic faults (11.9\%), language construct mismatches (66.4\%), and resource mismanagement (21.7\%). 3) Inference-time prompt strategies bring only modest improvements, suggesting that current LLMs lack intrinsic efficiency awareness. Together, our results establish efficiency as an essential dimension of code translation and position \textsc{trace} as a principled foundation for efficiency-oriented evaluation.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

TRACE claims correctness is a weak proxy for efficiency in LLM code translation and offers the first benchmark built around that gap.

read the letter

The main takeaway is that even when LLMs produce functionally correct translations, the resulting code can still run much slower than it should, and the paper positions TRACE as the first benchmark that tries to measure this directly. They built 1,000 tasks across C++, Java, and Python, each paired with stress tests meant to expose efficiency problems that ordinary unit tests would miss. On 28 models they report that the strongest correctness performer, Claude-4-think, only reaches mid-tier speed and is beaten by smaller open-source models such as Qwen2.5-Coder-14B-Instruct. They also break down the 23.5% of correct translations that show clear inefficiency into algorithmic faults (11.9%), language construct mismatches (66.4%), and resource mismanagement (21.7%), and note that extra prompting at inference time gives only modest gains.

Referee Report

1 major / 0 minor

Summary. The manuscript introduces the TRACE benchmark consisting of 1,000 efficiency-critical tasks in C++, Java, and Python, augmented with stress tests to evaluate the execution efficiency of code translations produced by LLMs. Based on an evaluation of 28 LLMs, it claims that functional correctness is not a reliable indicator of efficiency, as evidenced by the top-performing model in correctness (Claude-4-think) achieving only mid-level time efficiency and being outperformed by smaller open-source models such as Qwen2.5-Coder-14B-Instruct. The paper further reports that 23.5% of correct translations exhibit pronounced inefficiency, with distributions of 11.9% algorithmic faults, 66.4% language construct mismatches, and 21.7% resource mismanagement. It concludes that inference-time prompt strategies provide only modest improvements, indicating a lack of intrinsic efficiency awareness in current LLMs.

Significance. Should the findings prove robust upon detailed examination of the methods and data, this work would be significant for the software engineering and AI communities. It establishes efficiency as a critical and previously overlooked dimension in LLM-based code translation, beyond the common focus on correctness. The creation of a specialized benchmark with stress tests designed to expose efficiency degradations offers a valuable resource for future research. The empirical demonstration that smaller models can surpass larger ones in efficiency and that nearly a quarter of correct translations are inefficient provides actionable insights for improving LLM code generation practices.

major comments (1)

[Abstract] The abstract presents precise quantitative results, including the 23.5% rate of pronounced inefficiency and its breakdown into algorithmic faults (11.9%), language construct mismatches (66.4%), and resource mismanagement (21.7%), as well as the relative performance of specific models like Claude-4-think and Qwen2.5-Coder-14B-Instruct. However, it contains no description of the experimental methodology, such as the definition and measurement of execution efficiency, the construction of the 1,000 tasks and associated stress tests, the criteria for identifying and categorizing inefficiencies, or the statistical analysis used. This omission is load-bearing because these details are necessary to substantiate the central claim that correctness is not a reliable proxy for efficiency.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback and for recognizing the potential significance of TRACE for the software engineering and AI communities. We address the major comment on the abstract below and will revise the manuscript to improve clarity.

read point-by-point responses

Referee: [Abstract] The abstract presents precise quantitative results, including the 23.5% rate of pronounced inefficiency and its breakdown into algorithmic faults (11.9%), language construct mismatches (66.4%), and resource mismanagement (21.7%), as well as the relative performance of specific models like Claude-4-think and Qwen2.5-Coder-14B-Instruct. However, it contains no description of the experimental methodology, such as the definition and measurement of execution efficiency, the construction of the 1,000 tasks and associated stress tests, the criteria for identifying and categorizing inefficiencies, or the statistical analysis used. This omission is load-bearing because these details are necessary to substantiate the central claim that correctness is not a reliable proxy for efficiency.

Authors: We agree that the abstract's brevity omits key methodological details, which are fully elaborated in the body of the manuscript (Sections 3 and 4). Execution efficiency is defined as wall-clock runtime on stress-test inputs designed to expose scalability issues; the 1,000 tasks were curated from competitive programming and real-world scenarios emphasizing efficiency-critical operations (e.g., loops, recursion, data structures); stress tests were generated by scaling input sizes; and inefficiency categories were derived from manual code inspection with inter-annotator agreement reported. To address the concern directly, we will revise the abstract to incorporate a concise methodology summary (approximately 40 words) while preserving the quantitative findings. This revision will make the central claim more self-contained without altering the paper's length constraints. revision: yes

Circularity Check

0 steps flagged

No significant circularity

full rationale

This is a direct empirical benchmark paper. The abstract describes construction of the TRACE dataset (1,000 tasks with stress tests) and reports observed performance statistics across 28 LLMs. No equations, derivations, fitted parameters, predictions that reduce to inputs, or load-bearing self-citations appear in the provided text. All claims are presented as direct measurement outcomes rather than derived results, so the evaluation chain contains no circular reductions.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

Abstract-only review; assumptions inferred from described evaluation design. No free parameters or invented entities are mentioned.

axioms (2)

domain assumption The 1,000 selected tasks represent efficiency-critical translation scenarios across the three languages.
Benchmark validity rests on this representativeness claim.
domain assumption Stress tests expose efficiency issues that standard functional tests miss.
Central justification for the benchmark's added value.

pith-pipeline@v0.9.0 · 5523 in / 1212 out tokens · 41490 ms · 2026-05-15T10:18:07.901334+00:00 · methodology

TRACE: Evaluating Execution Efficiency of LLM-Based Code Translation

Core claim

What carries the argument

Load-bearing premise

What would settle it

discussion (0)