Recognition: unknown
Defective Task Descriptions in LLM-Based Code Generation: Detection and Analysis
Pith reviewed 2026-05-08 03:01 UTC · model grok-4.3
The pith
A finetuned small model detects defective task descriptions for LLM code generation with F1 of 0.804, outperforming larger models.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
SpecValidator achieves defect detection of F1 = 0.804 and MCC = 0.745, significantly outperforming GPT-5-mini (F1 = 0.469) and Claude Sonnet 4 (F1 = 0.518). It can generalize to unseen Under-Specification defects in the original benchmark descriptions. The robustness of LLMs in task description defects depends primarily on the type of defect and the characteristics of the task description, rather than the capacity of the model, with Under-Specification defects being the most severe. Benchmarks with richer contextual grounding, such as LiveCodeBench, exhibit substantially greater resilience.
What carries the argument
SpecValidator, a lightweight classifier obtained by parameter-efficient finetuning of a small model to identify lexical vagueness, under-specification, and syntax-formatting defects in task descriptions.
If this is right
- Under-Specification defects cause the most severe impact on LLM code generation correctness.
- LLM robustness to defective descriptions varies more with defect type and description structure than with model size.
- Task descriptions with richer context lead to more reliable code generation from LLMs.
- Small finetuned models can outperform large general-purpose models in detecting prompt defects for code tasks.
Where Pith is reading between the lines
- Integrating such a detector into coding tools could flag and suggest fixes for poor prompts before code is generated.
- Efforts to improve task description quality may deliver larger gains in code accuracy than scaling model size alone.
- The observed generalization suggests similar lightweight classifiers could be trained for prompt defects in other LLM applications.
Load-bearing premise
The three defect categories and the labeled examples used for finetuning are representative of the defects that real users actually produce when writing task descriptions for LLM code generation.
What would settle it
Collect task descriptions written by real users for code generation tasks, run SpecValidator to label defects, generate code with an LLM from both defective and clean versions, and check whether detected defects predict measurably lower code correctness.
Figures
read the original abstract
Large language models are widely used for code generation, yet they rely on an implicit assumption that the task descriptions are sufficiently detailed and well-formed. However, in practice, users may provide defective descriptions, which can have a strong effect on code correctness. To address this issue, we develop SpecValidator, a lightweight classifier based on a small model that has been parameter-efficiently finetuned, to automatically detect task description defects. We evaluate SpecValidator on three types of defects, Lexical Vagueness, Under-Specification and Syntax-Formatting on 3 benchmarks with task descriptions of varying structure and complexity. Our results show that SpecValidator achieves defect detection of F1 = 0.804 and MCC = 0.745, significantly outperforming GPT-5-mini (F1 = 0.469 and MCC = 0.281) and Claude Sonnet 4 (F1 = 0.518 and MCC = 0.359). Perhaps more importantly, our analysis indicates that SpecValidator can generalize to unseen issues and detect unknown Under-Specification defects in the original (real) descriptions of the benchmarks used. Our results also show that the robustness of LLMs in task description defects depends primarily on the type of defect and the characteristics of the task description, rather than the capacity of the model, with Under-Specification defects being the most severe. We further found that benchmarks with richer contextual grounding, such as LiveCodeBench, exhibit substantially greater resilience, highlighting the importance of structured task descriptions for reliable LLM-based code generation.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces SpecValidator, a lightweight classifier obtained via parameter-efficient fine-tuning of a small language model, to detect three categories of defects (Lexical Vagueness, Under-Specification, and Syntax-Formatting) in task descriptions for LLM-based code generation. It reports evaluation results on three benchmarks of varying structure, with SpecValidator achieving F1 = 0.804 and MCC = 0.745 while outperforming GPT-5-mini (F1 = 0.469, MCC = 0.281) and Claude Sonnet 4 (F1 = 0.518, MCC = 0.359). The work further claims that the model generalizes to detect unseen Under-Specification defects in the original (real) benchmark descriptions and analyzes how defect type and task-description characteristics affect LLM robustness, finding Under-Specification most severe and richer-context benchmarks like LiveCodeBench more resilient.
Significance. If the ground-truth labels are reliable and representative, the results would be significant for the software engineering and LLM application communities. A small, efficient detector that outperforms much larger general-purpose models on defect detection, combined with evidence of generalization to real unseen defects and the finding that robustness depends more on defect type than model scale, would offer both a practical tool for prompt validation and actionable guidance on benchmark and task-description design. The emphasis on structured descriptions for reliable code generation aligns with growing interest in prompt engineering and LLM reliability.
major comments (3)
- [§3] §3 (Methodology, defect labeling subsection): The paper provides no description of the labeling protocol used to create ground-truth labels for Lexical Vagueness, Under-Specification, and Syntax-Formatting, including the number of annotators, their qualifications, the exact criteria applied, or any inter-annotator agreement statistics. Because the headline F1/MCC numbers, the outperformance claims, and the generalization result all presuppose that these labels are accurate and representative of real user defects, the absence of validation metrics is load-bearing for the central empirical claims.
- [§4.3] §4.3 (Generalization to unseen defects): The claim that SpecValidator detects 'unknown' Under-Specification defects in the original benchmark descriptions requires a concrete account of how these defects were identified and labeled as Under-Specification in the test set without prior annotation. If the identification was performed post-hoc by the authors or via the same model, this undermines the assertion of true generalization to unseen issues rather than circular or post-hoc interpretation.
- [§4.1] §4.1 (Baseline evaluation): The comparison against GPT-5-mini and Claude Sonnet 4 does not specify whether these models received identical input formatting, few-shot examples, or output constraints as those used during fine-tuning and inference of SpecValidator. Any difference in prompting strategy could confound the reported performance gap and weaken the conclusion that the small fine-tuned model is inherently superior for this task.
minor comments (2)
- [Abstract] The abstract and results sections refer to 'GPT-5-mini'; confirm the exact model identifier and version used, as this name does not correspond to a currently released model and may cause confusion for readers.
- [§4] Table or figure captions for the benchmark results should explicitly state the number of test instances per defect category and per benchmark to allow readers to assess the statistical reliability of the F1 and MCC figures.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed feedback. The comments highlight important aspects of methodological transparency that we have addressed through revisions. We respond to each major comment below.
read point-by-point responses
-
Referee: [§3] §3 (Methodology, defect labeling subsection): The paper provides no description of the labeling protocol used to create ground-truth labels for Lexical Vagueness, Under-Specification, and Syntax-Formatting, including the number of annotators, their qualifications, the exact criteria applied, or any inter-annotator agreement statistics. Because the headline F1/MCC numbers, the outperformance claims, and the generalization result all presuppose that these labels are accurate and representative of real user defects, the absence of validation metrics is load-bearing for the central empirical claims.
Authors: We agree that a clear description of the labeling process is necessary to support the validity of our empirical results. In the revised manuscript we have expanded the defect labeling subsection in §3 to specify that labels were created by two co-authors with expertise in software engineering and NLP. We now include the complete annotation guidelines for each defect type in a new appendix. We did not collect multiple independent annotations for every instance and therefore cannot report inter-annotator agreement statistics; this limitation is now explicitly noted in the revised text together with the criteria used. These additions directly address the concern about the reliability of the ground-truth labels. revision: yes
-
Referee: [§4.3] §4.3 (Generalization to unseen defects): The claim that SpecValidator detects 'unknown' Under-Specification defects in the original benchmark descriptions requires a concrete account of how these defects were identified and labeled as Under-Specification in the test set without prior annotation. If the identification was performed post-hoc by the authors or via the same model, this undermines the assertion of true generalization to unseen issues rather than circular or post-hoc interpretation.
Authors: We have clarified the procedure in the revised §4.3. The original benchmark descriptions carried no prior defect annotations. For the generalization test we performed an independent manual review of a sample of these descriptions using the same explicit criteria later documented in the appendix. Instances identified as Under-Specification by this review were held out and used solely as an evaluation set; SpecValidator had never seen them during training. We now provide concrete examples of the identified defects and the decision rules applied, making clear that the process was manual and independent of the model itself. revision: yes
-
Referee: [§4.1] §4.1 (Baseline evaluation): The comparison against GPT-5-mini and Claude Sonnet 4 does not specify whether these models received identical input formatting, few-shot examples, or output constraints as those used during fine-tuning and inference of SpecValidator. Any difference in prompting strategy could confound the reported performance gap and weaken the conclusion that the small fine-tuned model is inherently superior for this task.
Authors: We confirm that every model, including the two large language models, was evaluated under an identical zero-shot regime. The same input formatting, task instruction, and output constraint (return one of the three defect labels or 'none') were used for all systems. No few-shot examples were supplied to any model. The exact prompt template is now reproduced in the appendix, and §4.1 has been updated to state the uniformity of the evaluation protocol explicitly. revision: yes
Circularity Check
No significant circularity in empirical evaluation
full rationale
The paper reports an empirical ML study: defect categories are defined, data is labeled, a small model is parameter-efficiently fine-tuned, and performance (F1/MCC) is measured on held-out benchmarks plus generalization tests on original descriptions. No equations, derivations, or self-citations reduce any claimed result to its own inputs by construction. The central claims rest on direct experimental measurements against external benchmarks rather than tautological re-labeling or fitted-parameter renaming. This is the standard non-circular pattern for supervised detection papers.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption The three defect categories (Lexical Vagueness, Under-Specification, Syntax-Formatting) are exhaustive and mutually exclusive for the purposes of detection.
- domain assumption Finetuning a small model on the provided labeled examples produces a detector whose decisions generalize beyond the training distribution.
Reference graph
Works this paper leans on
-
[1]
Anthropic. 2024. Claude 4 Model Card. Anthropic Documentation. https: //www.anthropic.com
2024
-
[2]
Jacob Austin, Augustus Odena, Maxwell Nye, Maarten Bosma, Henryk Michalewski, David Dohan, Ellen Jiang, Carrie Cai, Michael Terry, Quoc Le, and Charles Sutton. 2021. Program Synthesis with Large Language Models. arXiv preprint arXiv:2108.07732(2021)
work page internal anchor Pith review arXiv 2021
-
[3]
Brown, Benjamin Mann, Nick Ryder, et al
Tom B. Brown, Benjamin Mann, Nick Ryder, et al . 2020. Language Models are Few-Shot Learners. InAdvances in Neural Information Processing Systems (NeurIPS), Vol. 33. 1877–1901
2020
-
[4]
Junkai Chen, Zhenhao Li, Xing Hu, and Xin Xia. 2026. NLPerturbator: Studying the Robustness of Code LLMs to Natural Language Variations.ACM Trans. Softw. Eng. Methodol.35, 4, Article 89 (March 2026), 20 pages. doi:10.1145/3745764
-
[6]
Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde De Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, et al. 2021. Evaluating large language models trained on code. arXiv preprint arXiv:2107.03374(2021)
work page internal anchor Pith review arXiv 2021
-
[7]
Xinyun Chen, Maxwell Lin, Nathanael Schärli, and Denny Zhou. 2024. Teaching Large Language Models to Self-Debug. InProceedings of the 12th International Conference on Learning Representations (ICLR). https://openreview.net/forum? id=KuPixIqPiq
2024
-
[8]
Tim Dettmers, Artidoro Pagnoni, Ari Holtzman, and Luke Zettlemoyer. 2023. QLoRA: Efficient Finetuning of Quantized LLMs. InAdvances in Neural Informa- tion Processing Systems (NeurIPS), Vol. 36
2023
-
[9]
Sarah Fakhoury, Aaditya Naik, Georgios Sakkas, Saikat Chakraborty, and Shu- vendu K Lahiri. 2024. Llm-based test-driven interactive code generation: User study and empirical evaluation.IEEE Transactions on Software Engineering50, 9 (2024), 2254–2268
2024
-
[10]
Daya Guo, Qihao Zhu, Dejian Yang, Zhenda Xie, Kai Dong, Wenjie Zhang, Wenhu Chen, Kexin Bi, et al . 2024. DeepSeek-Coder: When the Large Lan- guage Model Meets Programming—The Rise of Code Intelligence.arXiv preprint arXiv:2401.14196(2024)
work page internal anchor Pith review arXiv 2024
-
[11]
Asma Hamidi, Ahmed Khanfir, and Mike Papadakis. 2025. Intent-Based Mu- tation Testing: From Naturally Written Programming Intents to Mutants. In IEEE International Conference on Software Testing, Verification and Validation, ICST 2025 - Workshops, Naples, Italy, March 31 - April 4, 2025. IEEE, 347–357. doi:10.1109/ICSTW64639.2025.10962508
-
[12]
Neil Houlsby, Andrei Giurgiu, Stanislaw Jastrzebski, Bruna Morrone, Quentin de Laroussilhe, Andrea Gesmundo, Mona Attariyan, and Sylvain Gelly. 2019. Parameter-Efficient Transfer Learning for NLP. InProceedings of the 36th Inter- national Conference on Machine Learning (ICML). 2790–2799
2019
-
[13]
Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, et al
Edward J. Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, et al. 2022. LoRA: Low-Rank Adaptation of Large Language Models. InInternational Conference on Learning Representations (ICLR)
2022
-
[14]
Binyuan Hui, Jian Yang, Zeyu Cui, et al. 2024. Qwen2.5-Coder Technical Report. arXiv preprint arXiv(2024)
2024
-
[15]
Binyuan Hui, Jian Yang, Zeyu Cui, Jiaxi Yang, Dayiheng Liu, Lei Zhang, Tianyu Liu, Jiajun Zhang, Bowen Yu, Keming Lu, Kai Dang, Yang Fan, Yichang Zhang, An Yang, Rui Men, Fei Huang, Bo Zheng, Yibo Miao, Shanghaoran Quan, Yunlong Feng, Xingzhang Ren, Xuancheng Ren, Jingren Zhou, and Junyang Lin. 2024. Qwen2.5-Coder Technical Report. arXiv:2409.12186 [cs.CL...
work page internal anchor Pith review arXiv 2024
-
[16]
Naman Jain, Jiayi Han, Alex Gu, William Yang, Yiming Li, Koushik Sen, and Ion Stoica. 2024. LiveCodeBench: Holistic and Contamination-Free Evaluation of Large Language Models for Code.arXiv preprint arXiv:2403.07974(2024)
work page internal anchor Pith review arXiv 2024
- [17]
-
[18]
Juyong Jiang, Fan Wang, Jiasi Shen, Sungju Kim, and Sunghun Kim. 2026. A survey on large language models for code generation.ACM Transactions on Software Engineering and Methodology35, 2 (2026), 1–72
2026
- [19]
-
[20]
Jia Li, Ge Li, Yongmin Li, and Zhi Jin. 2025. Structured Chain-of-Thought Prompt- ing for Code Generation.ACM Transactions on Software Engineering and Method- ology34, 2 (2025), 1–23. doi:10.1145/3690635
-
[21]
Yujia Li et al. 2022. Competition-level Code Generation with AlphaCode.Science 378, 6624 (2022), 1092–1097
2022
-
[22]
Jiawei Liu, Chunqiu Steven Xia, Yuyao Wang, and Lingming Zhang. 2023. Is Your Code Generated by ChatGPT Really Correct? Rigorous Evaluation of Large Language Models for Code Generation. InAdvances in Neural Information Pro- cessing Systems 36 (NeurIPS). https://proceedings.neurips.cc/paper_files/paper/ 2023/hash/43e9d647ccd3e4b7b5baab53f0368686-Abstract-C...
work page internal anchor Pith review arXiv 2023
-
[23]
Ilya Loshchilov and Frank Hutter. 2019. Decoupled Weight Decay Regularization. InInternational Conference on Learning Representations (ICLR)
2019
-
[24]
Anton Lozhkov, Raymond Li, Vaibhav Chaudhary, et al. 2024. StarCoder2 and The Stack v2: The Next Generation.arXiv preprint arXiv:2402.19173(2024)
work page internal anchor Pith review arXiv 2024
-
[25]
Antonio Mastropaolo, Luca Pascarella, Emanuela Guglielmi, Matteo Ciniselli, Simone Scalabrino, Rocco Oliveto, and Gabriele Bavota. 2023. On the Robustness of Code Generation Techniques: An Empirical Study on GitHub Copilot. In Proceedings of the 45th IEEE/ACM International Conference on Software Engineering (ICSE). IEEE, 2149–2160. doi:10.1109/ICSE48619.2...
-
[26]
Matthews
Brian W. Matthews. 1975. Comparison of the predicted and observed secondary structure of T4 phage lysozyme.Biochimica et Biophysica Acta405, 2 (1975), 442–451
1975
-
[27]
Alexandre Matton et al . 2024. On Leakage of Code Generation Evaluation Datasets. InFindings of the Association for Computational Linguistics: EMNLP. 13215–13223
2024
-
[28]
Bertrand Meyer. 1985. On Formalism in Specifications.IEEE Softw.2, 1 (1985), 6–26. doi:10.1109/MS.1985.229776
-
[29]
Mistral AI. 2024. Codestral: Hello, World! Technical report. https://mistral.ai/ news/codestral
2024
-
[30]
Fangwen Mu, Lin Shi, Song Wang, Zhuohao Yu, Binquan Zhang, ChenXue Wang, Shichao Liu, and Qing Wang. 2024. Clarifygpt: A framework for enhancing llm-based code generation via requirements clarification.Proceedings of the ACM on Software Engineering1, FSE (2024), 2332–2354
2024
-
[31]
OpenAI. 2025. GPT-5 Technical Report. https://openai.com. Accessed: 2026
2025
-
[32]
Shunyu Peng, Eirini Kalliamvakou, Peter Cihon, and Mert Demirer. 2023. The Impact of AI on Developer Productivity: Evidence from GitHub Copilot.arXiv preprint arXiv:2302.06590(2023)
work page internal anchor Pith review arXiv 2023
-
[33]
Fazle Rabbi, Zishuo Ding, and Jinqiu Yang. 2025. A Multi-Language Perspective on the Robustness of LLM Code Generation.arXiv preprint arXiv:2504.19108 (2025). https://arxiv.org/abs/2504.19108
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[34]
Martin Riddell, Ansong Ni, and Arman Cohan. 2024. Quantifying Contamination in Evaluating Code Generation Capabilities of Language Models. InProceedings of ACL. 14116–14137
2024
-
[35]
Baptiste Rozière, Jonas Gehring, Fabian Gloeckle, Sten Sootla, Itai Gat, Xiaoqing Tan, Yossi Adi, Jingyu Liu, et al. 2023. Code Llama: Open Foundation Models for Code.arXiv preprint arXiv:2308.12950(2023)
work page internal anchor Pith review arXiv 2023
-
[36]
Melanie Sclar, Yejin Choi, Yulia Tsvetkov, and Alane Suhr. 2024. Quantifying Lan- guage Models’ Sensitivity to Spurious Features in Prompt Design. InProceedings of the International Conference on Learning Representations (ICLR)
2024
- [37]
-
[38]
Saha, and Joanna Santos
Mohammed Latif Siddiq, Simantika Dristi, Joanna C.S. Saha, and Joanna Santos
-
[39]
InProceedings of the IEEE International Working Conference on Source Code Analysis and Manipulation (SCAM)
The Fault in our Stars: Quality Assessment of Code Generation Bench- marks. InProceedings of the IEEE International Working Conference on Source Code Analysis and Manipulation (SCAM). IEEE, 201–212
-
[40]
2025.µFiX: Fixing Large Language Models’ Specification Misunderstanding for Better Code Generation
Zhao Tian, Junjie Chen, and Xiaofei Zhang. 2025.µFiX: Fixing Large Language Models’ Specification Misunderstanding for Better Code Generation. InProceed- ings of the IEEE/ACM International Conference on Software Engineering (ICSE). IEEE/ACM, 1514–1526
2025
-
[41]
Laurens van der Maaten and Geoffrey Hinton. 2008. Visualizing Data using t-SNE.Journal of Machine Learning Research9 (2008), 2579–2605
2008
-
[42]
2009.Requirements Engineering - From System Goals to UML Models to Software Specifications
Axel van Lamsweerde. 2009.Requirements Engineering - From System Goals to UML Models to Software Specifications. Wiley. http://eu.wiley.com/WileyCDA/ WileyTitle/productCd-EHEP000863.html
2009
-
[43]
Andreas Vogelsang, Alexander Korn, Giovanna Broccia, Alessio Ferrari, Jannik Fischbach, and Chetan Arora. 2025. On the Impact of Requirements Smells in Prompts: The Case of Automated Traceability. InProceedings of the IEEE/ACM International Conference on Software Engineering: New Ideas and Emerging Results (ICSE-NIER). IEEE/ACM
2025
-
[44]
Shiqi Wang, Zheng Li, Haifeng Qian, Chenghao Yang, Zijian Wang, Mingyue Shang, Varun Kumar, Samson Tan, Baishakhi Ray, Parminder Bhatia, et al. 2023. ReCode: Robustness Evaluation of Code Generation Models. InProceedings of the 61st Annual Meeting of the Association for Computational Linguistics (ACL). 13234–13274
2023
-
[45]
Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Brian Ichter, Fei Xia, Ed Chi, Quoc Le, and Denny Zhou. 2022. Chain-of-Thought Prompting Elicits Reasoning in Large Language Models. InAdvances in Neural Information Processing Systems (NeurIPS), Vol. 35
2022
-
[46]
Wu and Fatemeh H
Jie J.W. Wu and Fatemeh H. Fard. 2025. HumanEvalComm: Benchmarking the Communication Competence of Code Generation for LLMs and LLM Agents. ACM Transactions on Software Engineering and Methodology34, 7 (2025). doi:10. 1145/3715109
2025
-
[47]
Chunqiu Steven Xia, Yinlin Deng, and Lingming Zhang. 2024. EvoEval: Evolving Coding Benchmarks via LLM. InProceedings of the Conference on Language Modeling (COLM). Defective Task Descriptions in LLM-Based Code Generation: Detection and Analysis
2024
-
[48]
Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric Xing, et al. 2023. Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena. InAdvances in Neural Information Processing Systems (NeurIPS), Vol. 36. 46595–46623
2023
-
[49]
Kaijie Zhu, Jindong Wang, Jiaheng Zhou, Zichen Wang, Hao Chen, Yidong Wang, Linyi Yang, Wei Ye, Neil Zhenqiang Gong, Yue Zhang, and Xing Xie. 2024. PromptRobust: Towards Evaluating the Robustness of Large Language Models on Adversarial Prompts. InProceedings of the 1st ACM Workshop on Large AI Systems and Models with Privacy and Safety Analysis (LAMPS). A...
2024
-
[50]
Terry Yue Zhuo, Minh Chien Vu, Jenny Chim, Han Hu, Wenhao Yu, Ratnadira Widyasari, Imam Nur Bani Yusuf, Haolan Zhan, Junda He, Indraneil Paul, Simon Brunner, Chen Gong, Thong Hoang, Armel Randy Zebaze, Xiaoheng Hong, Wen- Ding Li, Jean Kaddour, Ming Xu, Zhihan Zhang, Prateek Yadav, Naman Jain, Alex Gu, Zhoujun Cheng, Jiawei Liu, Qian Liu, Zijian Wang, Bin...
work page internal anchor Pith review arXiv 2025
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.