arxiv: 2604.24703 · v1 · submitted 2026-04-27 · 💻 cs.SE · cs.AI

Recognition: unknown

Defective Task Descriptions in LLM-Based Code Generation: Detection and Analysis

Amal Akli , Mike Papadakis , Maxime Cordy , Yves Le Traon

Authors on Pith no claims yet

Pith reviewed 2026-05-08 03:01 UTC · model grok-4.3

classification 💻 cs.SE cs.AI

keywords LLM code generationtask description defectsdefect detectionunder-specificationprompt qualitySpecValidatorcode correctnessfinetuned classifier

0 comments

The pith

A finetuned small model detects defective task descriptions for LLM code generation with F1 of 0.804, outperforming larger models.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that defective task descriptions often cause LLMs to produce incorrect code for programming tasks. It introduces SpecValidator, a lightweight classifier created by finetuning a small model, to spot issues like vague language, missing specifications, and formatting errors. Tests on several benchmarks show the classifier reaches an F1 score of 0.804 and can find under-specification problems even when they were not seen during training. This matters because it offers a way to check and improve prompts automatically, leading to more reliable code generation. The results indicate that the severity of problems depends on the defect type and how detailed the description is, not on how large the LLM is, and that better-structured descriptions help LLMs perform more consistently.

Core claim

SpecValidator achieves defect detection of F1 = 0.804 and MCC = 0.745, significantly outperforming GPT-5-mini (F1 = 0.469) and Claude Sonnet 4 (F1 = 0.518). It can generalize to unseen Under-Specification defects in the original benchmark descriptions. The robustness of LLMs in task description defects depends primarily on the type of defect and the characteristics of the task description, rather than the capacity of the model, with Under-Specification defects being the most severe. Benchmarks with richer contextual grounding, such as LiveCodeBench, exhibit substantially greater resilience.

What carries the argument

SpecValidator, a lightweight classifier obtained by parameter-efficient finetuning of a small model to identify lexical vagueness, under-specification, and syntax-formatting defects in task descriptions.

If this is right

Under-Specification defects cause the most severe impact on LLM code generation correctness.
LLM robustness to defective descriptions varies more with defect type and description structure than with model size.
Task descriptions with richer context lead to more reliable code generation from LLMs.
Small finetuned models can outperform large general-purpose models in detecting prompt defects for code tasks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Integrating such a detector into coding tools could flag and suggest fixes for poor prompts before code is generated.
Efforts to improve task description quality may deliver larger gains in code accuracy than scaling model size alone.
The observed generalization suggests similar lightweight classifiers could be trained for prompt defects in other LLM applications.

Load-bearing premise

The three defect categories and the labeled examples used for finetuning are representative of the defects that real users actually produce when writing task descriptions for LLM code generation.

What would settle it

Collect task descriptions written by real users for code generation tasks, run SpecValidator to label defects, generate code with an LLM from both defective and clean versions, and check whether detected defects predict measurably lower code correctness.

Figures

Figures reproduced from arXiv: 2604.24703 by Amal Akli, Maxime Cordy, Mike Papadakis, Yves Le Traon.

**Figure 1.** Figure 1: Dataset construction. Defects are generated with GPT view at source ↗

**Figure 2.** Figure 2: Pass@1 drop (%) with all defect types in all models and benchmarks, computed as view at source ↗

**Figure 3.** Figure 3: t-SNE projections [40] of the embedding space of the classifiers (Qwen2.5-Coder-1.5B backbone). From left to right: Linear Probe (frozen backbone), LoRA fine-tune, and Full fine-tune. RQ3 Takeaway. SpecValidator flags 83 of 456 clean benchmark descriptions as under-specified. These flagged descriptions yield extremely low Pass@1 across all ten models (90.9% failure rate on MBPP, 98.0% on LiveCodeBench), … view at source ↗

**Figure 4.** Figure 4: Confusion matrix of the LoRA classifier. view at source ↗

read the original abstract

Large language models are widely used for code generation, yet they rely on an implicit assumption that the task descriptions are sufficiently detailed and well-formed. However, in practice, users may provide defective descriptions, which can have a strong effect on code correctness. To address this issue, we develop SpecValidator, a lightweight classifier based on a small model that has been parameter-efficiently finetuned, to automatically detect task description defects. We evaluate SpecValidator on three types of defects, Lexical Vagueness, Under-Specification and Syntax-Formatting on 3 benchmarks with task descriptions of varying structure and complexity. Our results show that SpecValidator achieves defect detection of F1 = 0.804 and MCC = 0.745, significantly outperforming GPT-5-mini (F1 = 0.469 and MCC = 0.281) and Claude Sonnet 4 (F1 = 0.518 and MCC = 0.359). Perhaps more importantly, our analysis indicates that SpecValidator can generalize to unseen issues and detect unknown Under-Specification defects in the original (real) descriptions of the benchmarks used. Our results also show that the robustness of LLMs in task description defects depends primarily on the type of defect and the characteristics of the task description, rather than the capacity of the model, with Under-Specification defects being the most severe. We further found that benchmarks with richer contextual grounding, such as LiveCodeBench, exhibit substantially greater resilience, highlighting the importance of structured task descriptions for reliable LLM-based code generation.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper introduces SpecValidator, a lightweight classifier obtained via parameter-efficient fine-tuning of a small language model, to detect three categories of defects (Lexical Vagueness, Under-Specification, and Syntax-Formatting) in task descriptions for LLM-based code generation. It reports evaluation results on three benchmarks of varying structure, with SpecValidator achieving F1 = 0.804 and MCC = 0.745 while outperforming GPT-5-mini (F1 = 0.469, MCC = 0.281) and Claude Sonnet 4 (F1 = 0.518, MCC = 0.359). The work further claims that the model generalizes to detect unseen Under-Specification defects in the original (real) benchmark descriptions and analyzes how defect type and task-description characteristics affect LLM robustness, finding Under-Specification most severe and richer-context benchmarks like LiveCodeBench more resilient.

Significance. If the ground-truth labels are reliable and representative, the results would be significant for the software engineering and LLM application communities. A small, efficient detector that outperforms much larger general-purpose models on defect detection, combined with evidence of generalization to real unseen defects and the finding that robustness depends more on defect type than model scale, would offer both a practical tool for prompt validation and actionable guidance on benchmark and task-description design. The emphasis on structured descriptions for reliable code generation aligns with growing interest in prompt engineering and LLM reliability.

major comments (3)

[§3] §3 (Methodology, defect labeling subsection): The paper provides no description of the labeling protocol used to create ground-truth labels for Lexical Vagueness, Under-Specification, and Syntax-Formatting, including the number of annotators, their qualifications, the exact criteria applied, or any inter-annotator agreement statistics. Because the headline F1/MCC numbers, the outperformance claims, and the generalization result all presuppose that these labels are accurate and representative of real user defects, the absence of validation metrics is load-bearing for the central empirical claims.
[§4.3] §4.3 (Generalization to unseen defects): The claim that SpecValidator detects 'unknown' Under-Specification defects in the original benchmark descriptions requires a concrete account of how these defects were identified and labeled as Under-Specification in the test set without prior annotation. If the identification was performed post-hoc by the authors or via the same model, this undermines the assertion of true generalization to unseen issues rather than circular or post-hoc interpretation.
[§4.1] §4.1 (Baseline evaluation): The comparison against GPT-5-mini and Claude Sonnet 4 does not specify whether these models received identical input formatting, few-shot examples, or output constraints as those used during fine-tuning and inference of SpecValidator. Any difference in prompting strategy could confound the reported performance gap and weaken the conclusion that the small fine-tuned model is inherently superior for this task.

minor comments (2)

[Abstract] The abstract and results sections refer to 'GPT-5-mini'; confirm the exact model identifier and version used, as this name does not correspond to a currently released model and may cause confusion for readers.
[§4] Table or figure captions for the benchmark results should explicitly state the number of test instances per defect category and per benchmark to allow readers to assess the statistical reliability of the F1 and MCC figures.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. The comments highlight important aspects of methodological transparency that we have addressed through revisions. We respond to each major comment below.

read point-by-point responses

Referee: [§3] §3 (Methodology, defect labeling subsection): The paper provides no description of the labeling protocol used to create ground-truth labels for Lexical Vagueness, Under-Specification, and Syntax-Formatting, including the number of annotators, their qualifications, the exact criteria applied, or any inter-annotator agreement statistics. Because the headline F1/MCC numbers, the outperformance claims, and the generalization result all presuppose that these labels are accurate and representative of real user defects, the absence of validation metrics is load-bearing for the central empirical claims.

Authors: We agree that a clear description of the labeling process is necessary to support the validity of our empirical results. In the revised manuscript we have expanded the defect labeling subsection in §3 to specify that labels were created by two co-authors with expertise in software engineering and NLP. We now include the complete annotation guidelines for each defect type in a new appendix. We did not collect multiple independent annotations for every instance and therefore cannot report inter-annotator agreement statistics; this limitation is now explicitly noted in the revised text together with the criteria used. These additions directly address the concern about the reliability of the ground-truth labels. revision: yes
Referee: [§4.3] §4.3 (Generalization to unseen defects): The claim that SpecValidator detects 'unknown' Under-Specification defects in the original benchmark descriptions requires a concrete account of how these defects were identified and labeled as Under-Specification in the test set without prior annotation. If the identification was performed post-hoc by the authors or via the same model, this undermines the assertion of true generalization to unseen issues rather than circular or post-hoc interpretation.

Authors: We have clarified the procedure in the revised §4.3. The original benchmark descriptions carried no prior defect annotations. For the generalization test we performed an independent manual review of a sample of these descriptions using the same explicit criteria later documented in the appendix. Instances identified as Under-Specification by this review were held out and used solely as an evaluation set; SpecValidator had never seen them during training. We now provide concrete examples of the identified defects and the decision rules applied, making clear that the process was manual and independent of the model itself. revision: yes
Referee: [§4.1] §4.1 (Baseline evaluation): The comparison against GPT-5-mini and Claude Sonnet 4 does not specify whether these models received identical input formatting, few-shot examples, or output constraints as those used during fine-tuning and inference of SpecValidator. Any difference in prompting strategy could confound the reported performance gap and weaken the conclusion that the small fine-tuned model is inherently superior for this task.

Authors: We confirm that every model, including the two large language models, was evaluated under an identical zero-shot regime. The same input formatting, task instruction, and output constraint (return one of the three defect labels or 'none') were used for all systems. No few-shot examples were supplied to any model. The exact prompt template is now reproduced in the appendix, and §4.1 has been updated to state the uniformity of the evaluation protocol explicitly. revision: yes

Circularity Check

0 steps flagged

No significant circularity in empirical evaluation

full rationale

The paper reports an empirical ML study: defect categories are defined, data is labeled, a small model is parameter-efficiently fine-tuned, and performance (F1/MCC) is measured on held-out benchmarks plus generalization tests on original descriptions. No equations, derivations, or self-citations reduce any claimed result to its own inputs by construction. The central claims rest on direct experimental measurements against external benchmarks rather than tautological re-labeling or fitted-parameter renaming. This is the standard non-circular pattern for supervised detection papers.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The work rests on standard supervised-learning assumptions (labeled training data faithfully represent the target defect distribution) and on the existence of three pre-existing benchmarks whose task descriptions can be treated as ground truth for evaluation. No new physical entities or ad-hoc constants are introduced.

axioms (2)

domain assumption The three defect categories (Lexical Vagueness, Under-Specification, Syntax-Formatting) are exhaustive and mutually exclusive for the purposes of detection.
Invoked when the classifier is trained and evaluated on these three labels only.
domain assumption Finetuning a small model on the provided labeled examples produces a detector whose decisions generalize beyond the training distribution.
Required for the claim that SpecValidator detects unknown Under-Specification defects in the original benchmark descriptions.

pith-pipeline@v0.9.0 · 5587 in / 1617 out tokens · 42002 ms · 2026-05-08T03:01:10.368755+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

49 extracted references · 19 canonical work pages · 11 internal anchors

[1]

Anthropic. 2024. Claude 4 Model Card. Anthropic Documentation. https: //www.anthropic.com

2024
[2]

Jacob Austin, Augustus Odena, Maxwell Nye, Maarten Bosma, Henryk Michalewski, David Dohan, Ellen Jiang, Carrie Cai, Michael Terry, Quoc Le, and Charles Sutton. 2021. Program Synthesis with Large Language Models. arXiv preprint arXiv:2108.07732(2021)

work page internal anchor Pith review arXiv 2021
[3]

Brown, Benjamin Mann, Nick Ryder, et al

Tom B. Brown, Benjamin Mann, Nick Ryder, et al . 2020. Language Models are Few-Shot Learners. InAdvances in Neural Information Processing Systems (NeurIPS), Vol. 33. 1877–1901

2020
[4]

Junkai Chen, Zhenhao Li, Xing Hu, and Xin Xia. 2026. NLPerturbator: Studying the Robustness of Code LLMs to Natural Language Variations.ACM Trans. Softw. Eng. Methodol.35, 4, Article 89 (March 2026), 20 pages. doi:10.1145/3745764

work page doi:10.1145/3745764 2026
[6]

Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde De Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, et al. 2021. Evaluating large language models trained on code. arXiv preprint arXiv:2107.03374(2021)

work page internal anchor Pith review arXiv 2021
[7]

Xinyun Chen, Maxwell Lin, Nathanael Schärli, and Denny Zhou. 2024. Teaching Large Language Models to Self-Debug. InProceedings of the 12th International Conference on Learning Representations (ICLR). https://openreview.net/forum? id=KuPixIqPiq

2024
[8]

Tim Dettmers, Artidoro Pagnoni, Ari Holtzman, and Luke Zettlemoyer. 2023. QLoRA: Efficient Finetuning of Quantized LLMs. InAdvances in Neural Informa- tion Processing Systems (NeurIPS), Vol. 36

2023
[9]

Sarah Fakhoury, Aaditya Naik, Georgios Sakkas, Saikat Chakraborty, and Shu- vendu K Lahiri. 2024. Llm-based test-driven interactive code generation: User study and empirical evaluation.IEEE Transactions on Software Engineering50, 9 (2024), 2254–2268

2024
[10]

Daya Guo, Qihao Zhu, Dejian Yang, Zhenda Xie, Kai Dong, Wenjie Zhang, Wenhu Chen, Kexin Bi, et al . 2024. DeepSeek-Coder: When the Large Lan- guage Model Meets Programming—The Rise of Code Intelligence.arXiv preprint arXiv:2401.14196(2024)

work page internal anchor Pith review arXiv 2024
[11]

Asma Hamidi, Ahmed Khanfir, and Mike Papadakis. 2025. Intent-Based Mu- tation Testing: From Naturally Written Programming Intents to Mutants. In IEEE International Conference on Software Testing, Verification and Validation, ICST 2025 - Workshops, Naples, Italy, March 31 - April 4, 2025. IEEE, 347–357. doi:10.1109/ICSTW64639.2025.10962508

work page doi:10.1109/icstw64639.2025.10962508 2025
[12]

Neil Houlsby, Andrei Giurgiu, Stanislaw Jastrzebski, Bruna Morrone, Quentin de Laroussilhe, Andrea Gesmundo, Mona Attariyan, and Sylvain Gelly. 2019. Parameter-Efficient Transfer Learning for NLP. InProceedings of the 36th Inter- national Conference on Machine Learning (ICML). 2790–2799

2019
[13]

Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, et al

Edward J. Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, et al. 2022. LoRA: Low-Rank Adaptation of Large Language Models. InInternational Conference on Learning Representations (ICLR)

2022
[14]

Binyuan Hui, Jian Yang, Zeyu Cui, et al. 2024. Qwen2.5-Coder Technical Report. arXiv preprint arXiv(2024)

2024
[15]

Binyuan Hui, Jian Yang, Zeyu Cui, Jiaxi Yang, Dayiheng Liu, Lei Zhang, Tianyu Liu, Jiajun Zhang, Bowen Yu, Keming Lu, Kai Dang, Yang Fan, Yichang Zhang, An Yang, Rui Men, Fei Huang, Bo Zheng, Yibo Miao, Shanghaoran Quan, Yunlong Feng, Xingzhang Ren, Xuancheng Ren, Jingren Zhou, and Junyang Lin. 2024. Qwen2.5-Coder Technical Report. arXiv:2409.12186 [cs.CL...

work page internal anchor Pith review arXiv 2024
[16]

Naman Jain, Jiayi Han, Alex Gu, William Yang, Yiming Li, Koushik Sen, and Ion Stoica. 2024. LiveCodeBench: Holistic and Contamination-Free Evaluation of Large Language Models for Code.arXiv preprint arXiv:2403.07974(2024)

work page internal anchor Pith review arXiv 2024
[17]

Haoxiang Jia, Robbie Morris, He Ye, Federica Sarro, and Sergey Mechtaev. 2025. Automated Repair of Ambiguous Problem Descriptions for LLM-Based Code Generation.arXiv preprint arXiv:2505.07270(2025)

work page arXiv 2025
[18]

Juyong Jiang, Fan Wang, Jiasi Shen, Sungju Kim, and Sunghun Kim. 2026. A survey on large language models for code generation.ACM Transactions on Software Engineering and Methodology35, 2 (2026), 1–72

2026
[19]

Maya Larbi, Amal Akli, Mike Papadakis, Rihab Bouyousfi, Maxime Cordy, Feder- ica Sarro, and Yves Le Traon. 2025. When prompts go wrong: Evaluating code model robustness to ambiguous, contradictory, and incomplete task descriptions. arXiv preprint arXiv:2507.20439(2025)

work page arXiv 2025
[20]

Jia Li, Ge Li, Yongmin Li, and Zhi Jin. 2025. Structured Chain-of-Thought Prompt- ing for Code Generation.ACM Transactions on Software Engineering and Method- ology34, 2 (2025), 1–23. doi:10.1145/3690635

work page doi:10.1145/3690635 2025
[21]

Yujia Li et al. 2022. Competition-level Code Generation with AlphaCode.Science 378, 6624 (2022), 1092–1097

2022
[22]

Jiawei Liu, Chunqiu Steven Xia, Yuyao Wang, and Lingming Zhang. 2023. Is Your Code Generated by ChatGPT Really Correct? Rigorous Evaluation of Large Language Models for Code Generation. InAdvances in Neural Information Pro- cessing Systems 36 (NeurIPS). https://proceedings.neurips.cc/paper_files/paper/ 2023/hash/43e9d647ccd3e4b7b5baab53f0368686-Abstract-C...

work page internal anchor Pith review arXiv 2023
[23]

Ilya Loshchilov and Frank Hutter. 2019. Decoupled Weight Decay Regularization. InInternational Conference on Learning Representations (ICLR)

2019
[24]

Anton Lozhkov, Raymond Li, Vaibhav Chaudhary, et al. 2024. StarCoder2 and The Stack v2: The Next Generation.arXiv preprint arXiv:2402.19173(2024)

work page internal anchor Pith review arXiv 2024
[25]

Antonio Mastropaolo, Luca Pascarella, Emanuela Guglielmi, Matteo Ciniselli, Simone Scalabrino, Rocco Oliveto, and Gabriele Bavota. 2023. On the Robustness of Code Generation Techniques: An Empirical Study on GitHub Copilot. In Proceedings of the 45th IEEE/ACM International Conference on Software Engineering (ICSE). IEEE, 2149–2160. doi:10.1109/ICSE48619.2...

work page doi:10.1109/icse48619.2023.00181 2023
[26]

Matthews

Brian W. Matthews. 1975. Comparison of the predicted and observed secondary structure of T4 phage lysozyme.Biochimica et Biophysica Acta405, 2 (1975), 442–451

1975
[27]

Alexandre Matton et al . 2024. On Leakage of Code Generation Evaluation Datasets. InFindings of the Association for Computational Linguistics: EMNLP. 13215–13223

2024
[28]

Bertrand Meyer. 1985. On Formalism in Specifications.IEEE Softw.2, 1 (1985), 6–26. doi:10.1109/MS.1985.229776

work page doi:10.1109/ms.1985.229776 1985
[29]

Mistral AI. 2024. Codestral: Hello, World! Technical report. https://mistral.ai/ news/codestral

2024
[30]

Fangwen Mu, Lin Shi, Song Wang, Zhuohao Yu, Binquan Zhang, ChenXue Wang, Shichao Liu, and Qing Wang. 2024. Clarifygpt: A framework for enhancing llm-based code generation via requirements clarification.Proceedings of the ACM on Software Engineering1, FSE (2024), 2332–2354

2024
[31]

OpenAI. 2025. GPT-5 Technical Report. https://openai.com. Accessed: 2026

2025
[32]

Shunyu Peng, Eirini Kalliamvakou, Peter Cihon, and Mert Demirer. 2023. The Impact of AI on Developer Productivity: Evidence from GitHub Copilot.arXiv preprint arXiv:2302.06590(2023)

work page internal anchor Pith review arXiv 2023
[33]

Fazle Rabbi, Zishuo Ding, and Jinqiu Yang. 2025. A Multi-Language Perspective on the Robustness of LLM Code Generation.arXiv preprint arXiv:2504.19108 (2025). https://arxiv.org/abs/2504.19108

work page internal anchor Pith review Pith/arXiv arXiv 2025
[34]

Martin Riddell, Ansong Ni, and Arman Cohan. 2024. Quantifying Contamination in Evaluating Code Generation Capabilities of Language Models. InProceedings of ACL. 14116–14137

2024
[35]

Baptiste Rozière, Jonas Gehring, Fabian Gloeckle, Sten Sootla, Itai Gat, Xiaoqing Tan, Yossi Adi, Jingyu Liu, et al. 2023. Code Llama: Open Foundation Models for Code.arXiv preprint arXiv:2308.12950(2023)

work page internal anchor Pith review arXiv 2023
[36]

Melanie Sclar, Yejin Choi, Yulia Tsvetkov, and Alane Suhr. 2024. Quantifying Lan- guage Models’ Sensitivity to Spurious Features in Prompt Design. InProceedings of the International Conference on Learning Representations (ICLR)

2024
[37]

Atsushi Shirafuji, Yutaka Watanobe, Takumi Ito, Makoto Morishita, Yuki Naka- mura, Yusuke Oda, and Jun Suzuki. 2023. Exploring the Robustness of Large Lan- guage Models for Solving Programming Problems.arXiv preprint arXiv:2306.14583 (2023)

work page arXiv 2023
[38]

Saha, and Joanna Santos

Mohammed Latif Siddiq, Simantika Dristi, Joanna C.S. Saha, and Joanna Santos
[39]

InProceedings of the IEEE International Working Conference on Source Code Analysis and Manipulation (SCAM)

The Fault in our Stars: Quality Assessment of Code Generation Bench- marks. InProceedings of the IEEE International Working Conference on Source Code Analysis and Manipulation (SCAM). IEEE, 201–212
[40]

2025.µFiX: Fixing Large Language Models’ Specification Misunderstanding for Better Code Generation

Zhao Tian, Junjie Chen, and Xiaofei Zhang. 2025.µFiX: Fixing Large Language Models’ Specification Misunderstanding for Better Code Generation. InProceed- ings of the IEEE/ACM International Conference on Software Engineering (ICSE). IEEE/ACM, 1514–1526

2025
[41]

Laurens van der Maaten and Geoffrey Hinton. 2008. Visualizing Data using t-SNE.Journal of Machine Learning Research9 (2008), 2579–2605

2008
[42]

2009.Requirements Engineering - From System Goals to UML Models to Software Specifications

Axel van Lamsweerde. 2009.Requirements Engineering - From System Goals to UML Models to Software Specifications. Wiley. http://eu.wiley.com/WileyCDA/ WileyTitle/productCd-EHEP000863.html

2009
[43]

Andreas Vogelsang, Alexander Korn, Giovanna Broccia, Alessio Ferrari, Jannik Fischbach, and Chetan Arora. 2025. On the Impact of Requirements Smells in Prompts: The Case of Automated Traceability. InProceedings of the IEEE/ACM International Conference on Software Engineering: New Ideas and Emerging Results (ICSE-NIER). IEEE/ACM

2025
[44]

Shiqi Wang, Zheng Li, Haifeng Qian, Chenghao Yang, Zijian Wang, Mingyue Shang, Varun Kumar, Samson Tan, Baishakhi Ray, Parminder Bhatia, et al. 2023. ReCode: Robustness Evaluation of Code Generation Models. InProceedings of the 61st Annual Meeting of the Association for Computational Linguistics (ACL). 13234–13274

2023
[45]

Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Brian Ichter, Fei Xia, Ed Chi, Quoc Le, and Denny Zhou. 2022. Chain-of-Thought Prompting Elicits Reasoning in Large Language Models. InAdvances in Neural Information Processing Systems (NeurIPS), Vol. 35

2022
[46]

Wu and Fatemeh H

Jie J.W. Wu and Fatemeh H. Fard. 2025. HumanEvalComm: Benchmarking the Communication Competence of Code Generation for LLMs and LLM Agents. ACM Transactions on Software Engineering and Methodology34, 7 (2025). doi:10. 1145/3715109

2025
[47]

Chunqiu Steven Xia, Yinlin Deng, and Lingming Zhang. 2024. EvoEval: Evolving Coding Benchmarks via LLM. InProceedings of the Conference on Language Modeling (COLM). Defective Task Descriptions in LLM-Based Code Generation: Detection and Analysis

2024
[48]

Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric Xing, et al. 2023. Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena. InAdvances in Neural Information Processing Systems (NeurIPS), Vol. 36. 46595–46623

2023
[49]

Kaijie Zhu, Jindong Wang, Jiaheng Zhou, Zichen Wang, Hao Chen, Yidong Wang, Linyi Yang, Wei Ye, Neil Zhenqiang Gong, Yue Zhang, and Xing Xie. 2024. PromptRobust: Towards Evaluating the Robustness of Large Language Models on Adversarial Prompts. InProceedings of the 1st ACM Workshop on Large AI Systems and Models with Privacy and Safety Analysis (LAMPS). A...

2024
[50]

Terry Yue Zhuo, Minh Chien Vu, Jenny Chim, Han Hu, Wenhao Yu, Ratnadira Widyasari, Imam Nur Bani Yusuf, Haolan Zhan, Junda He, Indraneil Paul, Simon Brunner, Chen Gong, Thong Hoang, Armel Randy Zebaze, Xiaoheng Hong, Wen- Ding Li, Jean Kaddour, Ming Xu, Zhihan Zhang, Prateek Yadav, Naman Jain, Alex Gu, Zhoujun Cheng, Jiawei Liu, Qian Liu, Zijian Wang, Bin...

work page internal anchor Pith review arXiv 2025