A Multi-Language Perspective on the Robustness of LLM Code Generation

Fazle Rabbi , Zishuo Ding , Jinqiu Yang

Authors on Pith no claims yet

classification 💻 cs.SE

keywords modelsrobustnesscodegenerationperturbationslanguagelanguagesperformance

read the original abstract

Large language models have gained significant traction and popularity in recent times, extending their usage to code-generation tasks. While this field has garnered considerable attention, the exploration of testing and evaluating the robustness of code generation models remains an ongoing endeavor. Previous studies have primarily focused on code generation models specifically for the Python language, overlooking other widely used programming languages. In this work, we conduct a comprehensive comparative analysis to assess the robustness performance of several prominent code generation models and investigate whether robustness can be improved by repairing perturbed docstrings using an LLM. Furthermore, we investigate how their performance varies across different programming languages. To accomplish this, we introduce perturbations in four key areas of the prompt: DocString, function name, syntax, and format. We have compiled and released a dedicated dataset for this purpose. Our results show that all models consistently degrade under perturbations across all three languages, but vary in magnitude depending on the language and perturbation type. Larger model size does not reliably improve robustness, and semantic perturbations prove at least as disruptive as syntactic ones. Our LLM-based docString repair yields only marginal gains for simple perturbations and can degrade performance for semantic ones, highlighting the limits of prompt-level mitigation.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 7 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

HEJ-Robust: A Robustness Benchmark for LLM-Based Automated Program Repair
cs.SE 2026-05 unverdicted novelty 7.0

HEJ-Robust benchmark shows LLM-based program repair models drop over 50% in accuracy when buggy code is rewritten with equivalent syntax.
HEJ-Robust: A Robustness Benchmark for LLM-Based Automated Program Repair
cs.SE 2026-05 accept novelty 7.0

LLM-based Java program repair models lose over 50% of their bug-fixing success rate when presented with equivalent but syntactically varied buggy code.
Beyond Translation Accuracy: Addressing False Failures in LLM-Based Code Translation
cs.SE 2026-05 unverdicted novelty 7.0

Many reported failures in LLM-based code translation are false negatives due to evaluation pipeline issues such as improper compilation flags, missing library links, and unconfigured runtime environments rather than i...
Social Bias in LLM-Generated Code: Benchmark and Mitigation
cs.SE 2026-05 unverdicted novelty 7.0

LLMs show up to 60.58% social bias in generated code; a new Fairness Monitor Agent cuts bias by 65.1% and raises functional correctness from 75.80% to 83.97%.
When Prompt Under-Specification Improves Code Correctness: An Exploratory Study of Prompt Wording and Structure Effects on LLM-Based Code Generation
cs.SE 2026-04 unverdicted novelty 7.0

Structurally rich task descriptions make LLMs robust to prompt under-specification, and under-specification can enhance code correctness by disrupting misleading lexical or structural cues.
Defective Task Descriptions in LLM-Based Code Generation: Detection and Analysis
cs.SE 2026-04 conditional novelty 6.0

SpecValidator detects lexical vagueness, under-specification, and syntax-formatting defects in LLM code-generation prompts with F1 0.804, outperforming GPT-5-mini and Claude Sonnet 4, and shows that under-specificatio...
Beyond Translation Accuracy: Addressing False Failures in LLM-Based Code Translation
cs.SE 2026-05 unverdicted novelty 5.0

A large-scale study finds that many LLM code translation failures are false negatives due to improper evaluation configurations rather than incorrect translations.