arxiv: 2604.24712 · v1 · submitted 2026-04-27 · 💻 cs.SE

Recognition: unknown

When Prompt Under-Specification Improves Code Correctness: An Exploratory Study of Prompt Wording and Structure Effects on LLM-Based Code Generation

Amal Akli , Mike Papadakis , Maxime Cordy , Yves Le Traon

Authors on Pith no claims yet

Pith reviewed 2026-05-08 02:54 UTC · model grok-4.3

classification 💻 cs.SE

keywords prompt engineeringLLM code generationunder-specificationcode correctnessbenchmark comparisonHumanEvalLiveCodeBenchrobustness to prompt changes

0 comments

The pith

Under-specification in prompts can increase the correctness of code generated by large language models on structurally rich tasks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper tests how changes that make prompts less complete affect the accuracy of code produced by ten different large language models. On minimal benchmarks such as HumanEval, those changes tend to lower correctness, but the same changes applied to the richer LiveCodeBench produce almost no net drop and sometimes raise correctness instead. The difference arises because rich task descriptions contain repeated information across descriptions, constraints, examples, and input-output rules, so losing one part does not remove all necessary details. In some cases the loss actually helps by removing wording that steers the model toward a wrong solution it has seen before. The authors trace the helpful cases to three repeatable patterns and conclude that prompt structure itself, not just model size, determines how sensitive the output is to missing details.

Core claim

Robustness to prompt under-specification is not a fixed property of the model. On minimal-specification benchmarks the same mutations that remove details reduce correctness, yet on LiveCodeBench the mutations produce near-zero net change because losses in one part of the description are offset by gains when misleading lexical or structural cues are broken. Manual review of the improved cases identifies three consistent mechanisms: disruption of over-fitted terminology, removal of misleading constraints, and elimination of spurious identifier triggers. The study therefore shows that structurally rich task descriptions can both buffer against under-specification and, in identifiable cases, let

What carries the argument

The interaction between prompt under-specification mutations and structural redundancy across descriptions, constraints, examples, and I/O conventions in task statements.

If this is right

Structurally rich task descriptions substantially reduce the negative impact of under-specification on code correctness.
Certain prompt mutations can improve correctness by breaking misleading lexical or structural cues that trigger incorrect retrieval.
Three repeatable categories of prompt change—disrupting over-fitted terms, removing misleading constraints, and eliminating spurious identifiers—produce positive effects.
Prompt writers can use the identified categories to create descriptions that are more robust to small wording variations.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Engineers building code-generation tools may benefit from deliberately adding redundant phrasing and varied examples to user-facing prompts.
Benchmark suites for evaluating LLM robustness should include both minimal and richly specified tasks rather than relying on HumanEval-style problems alone.
The same cue-breaking mechanism could be tested in non-code domains such as mathematical problem solving or legal document drafting where descriptions also contain repeated constraints.

Load-bearing premise

The particular wording changes tested here capture the kinds of under-specification that real users introduce when they write prompts for code generation.

What would settle it

Applying the identical set of under-specification mutations to a new benchmark that also contains high structural redundancy and checking whether the rate of correctness improvements remains comparable to the rate observed on LiveCodeBench.

Figures

Figures reproduced from arXiv: 2604.24712 by Amal Akli, Maxime Cordy, Mike Papadakis, Yves Le Traon.

**Figure 1.** Figure 1: A LiveCodeBench example where removing a constraint view at source ↗

**Figure 2.** Figure 2: Pass@1 delta under LV mutations on HumanEval view at source ↗

**Figure 3.** Figure 3: Normalized last-layer last-token attention distributed across prompt regions for HumanEval (Signature, Description, view at source ↗

**Figure 4.** Figure 4: Example of an LV mutation improving performance by neutralizing a domain-specific vocabulary cue. view at source ↗

read the original abstract

Large language models are increasingly used for code generation, yet the correctness of their outputs depends not only on model capability but also on how tasks are specified. Prior studies demonstrate that small changes in natural language prompts, particularly under-specification can substantially reduce code correctness; however, these findings are largely based on minimal-specification benchmarks such as HumanEval and MBPP, where limited structural redundancy may exaggerate sensitivity. In this exploratory study, we investigate how prompt structure, task complexity, and specification richness interact with LLM robustness to prompt mutations. We evaluate 10 different models across HumanEval and the structurally richer LiveCodeBench. Our results reveal that robustness is not a fixed property of LLMs but is highly dependent on prompt structure: the same under-specification mutations that degrade performance on HumanEval have near-zero net effect on LiveCodeBench due to redundancy across descriptions, constraints, examples, and I/O conventions. Surprisingly, we also find that prompt mutations can improve correctness. In LiveCodeBench, under-specification often breaks misleading lexical or structural cues that trigger incorrect retrieval-based solution strategies, leading to correctness improvements that counterbalance degradations. Manual analysis identifies consistent mechanisms behind these improvements, including the disruption of over-fitted terminology, removal of misleading constraints, and elimination of spurious identifier triggers. Overall, our study shows that structurally rich task descriptions can substantially mitigate the negative effects of under-specification and, in some cases, even enhance correctness. We outline categories of prompt modifications that positively influence the behavior of LLM code-generation, offering practical insights for writing robust prompts.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript reports an exploratory empirical study on how prompt under-specification mutations affect the correctness of LLM-generated code. It evaluates 10 models on HumanEval (minimal specification) versus the structurally richer LiveCodeBench, applying the same mutations across both. Results show degradation on HumanEval but near-zero net effect on LiveCodeBench due to redundancy in descriptions, constraints, examples, and I/O conventions. The study also reports that mutations can improve correctness on LiveCodeBench, attributing this to disruption of misleading cues, with mechanisms (disruption of over-fitted terminology, removal of misleading constraints, elimination of spurious identifier triggers) identified via manual analysis. The paper concludes that richer task structures mitigate negative effects and offers categories of beneficial prompt modifications.

Significance. If the quantitative deltas and explanatory mechanisms hold, the work demonstrates that LLM robustness to prompt variation is not intrinsic but depends on specification richness, with direct implications for prompt engineering practices in code generation. The contrast between benchmarks provides a useful lens on when under-specification harms versus helps, and the multi-model scope strengthens generalizability. The identification of improvement mechanisms, if rigorously validated, supplies actionable categories beyond the usual 'more specification is better' advice.

major comments (2)

[Results (manual mechanism analysis)] The manual analysis of mechanisms behind correctness improvements (disruption of over-fitted terminology, removal of misleading constraints, elimination of spurious identifier triggers) is presented without pre-registered coding categories, blinded annotation, or inter-annotator agreement. Because the same mutations produce opposite net effects across benchmarks, this analysis is load-bearing for the claim that improvements arise specifically from breaking misleading cues rather than random variation; post-hoc interpretation risks undermining the explanatory component of the robustness argument.
[Evaluation Methodology and Results] The evaluation description does not report the number of tasks or prompt variants per benchmark, nor any statistical controls, significance tests, or confidence intervals for the performance deltas (e.g., the 'near-zero net effect' on LiveCodeBench or the counterbalancing improvements). These details are necessary to assess whether the reported patterns are reliable or could be driven by small samples or model-specific variance.

minor comments (2)

[Abstract] The abstract states that 10 models were used but does not name them or point to the methods section; adding this information would improve immediate readability.
[Results] Figure or table captions could more explicitly link the plotted deltas to the specific mutation categories to help readers connect quantitative results to the later mechanism discussion.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the thoughtful and detailed comments on our exploratory study. We address each major comment below and indicate the revisions we will make to strengthen the manuscript.

read point-by-point responses

Referee: [Results (manual mechanism analysis)] The manual analysis of mechanisms behind correctness improvements (disruption of over-fitted terminology, removal of misleading constraints, elimination of spurious identifier triggers) is presented without pre-registered coding categories, blinded annotation, or inter-annotator agreement. Because the same mutations produce opposite net effects across benchmarks, this analysis is load-bearing for the claim that improvements arise specifically from breaking misleading cues rather than random variation; post-hoc interpretation risks undermining the explanatory component of the robustness argument.

Authors: We acknowledge that the manual analysis is qualitative and post-hoc, which is a limitation given its role in explaining the counterbalancing improvements. As an exploratory study, the analysis was performed by the authors to surface recurring patterns across improvement cases rather than to test pre-defined hypotheses. In the revision we will expand the description of the analysis process (including the number of cases examined and concrete examples for each mechanism), explicitly state the absence of pre-registration and inter-annotator agreement as a limitation, and frame the identified categories as hypotheses suitable for future confirmatory work. This preserves the exploratory contribution while addressing concerns about explanatory rigor. revision: partial
Referee: [Evaluation Methodology and Results] The evaluation description does not report the number of tasks or prompt variants per benchmark, nor any statistical controls, significance tests, or confidence intervals for the performance deltas (e.g., the 'near-zero net effect' on LiveCodeBench or the counterbalancing improvements). These details are necessary to assess whether the reported patterns are reliable or could be driven by small samples or model-specific variance.

Authors: We agree that clearer reporting of sample sizes and uncertainty measures is needed. The manuscript already states that we evaluate 10 models on the full HumanEval set (164 tasks) and the LiveCodeBench set using the same set of under-specification mutations, but we will add an explicit subsection that reports the exact task counts, the number of prompt variants generated per task, and appropriate statistical summaries (e.g., confidence intervals or non-parametric tests) for the reported deltas. These additions will allow readers to evaluate the reliability of the near-zero net effect and the counterbalancing improvements. revision: yes

Circularity Check

0 steps flagged

No circularity: direct empirical measurements on fixed benchmarks with no derivations or fitted predictions

full rationale

This is a purely empirical exploratory study that measures LLM code-generation correctness on fixed, external benchmarks (HumanEval and LiveCodeBench) under controlled prompt mutations. No equations, first-principles derivations, or predictive models are claimed; results consist of direct performance deltas and post-hoc manual categorization of observed improvements. The manual analysis is interpretive and could carry selection bias (as noted in the skeptic load-bearing attack), but this is a methodological limitation, not a circular reduction of any claimed result to its own inputs. No self-citations are used to justify uniqueness theorems or ansatzes, and no parameters are fitted to a subset then renamed as predictions. The central claims rest on observable benchmark outcomes rather than any self-referential construction.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The paper is an empirical exploratory study; central claims rest on experimental design choices rather than new theory. No free parameters are fitted to data, and no new entities are postulated.

axioms (2)

domain assumption The applied prompt mutations constitute valid tests of under-specification
The study defines and applies specific mutations to simulate under-specification without external validation of their representativeness.
domain assumption LiveCodeBench descriptions contain sufficient redundancy to buffer against under-specification
This is invoked to explain the near-zero net effect and occasional improvements.

pith-pipeline@v0.9.0 · 5598 in / 1396 out tokens · 46775 ms · 2026-05-08T02:54:27.335547+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

47 extracted references · 22 canonical work pages · 12 internal anchors

[1]

Anthropic. 2024. Claude 4 Model Card. Anthropic Documentation. https: //www.anthropic.com

2024
[2]

Jacob Austin, Augustus Odena, Maxwell Nye, Maarten Bosma, Henryk Michalewski, David Dohan, Ellen Jiang, Carrie Cai, Michael Terry, Quoc Le, and Charles Sutton. 2021. Program Synthesis with Large Language Models. arXiv preprint arXiv:2108.07732(2021)

work page internal anchor Pith review arXiv 2021
[3]

Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al

Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. 2020. Language Models are Few-Shot Learners. InAdvances in Neural Information Processing Systems (NeurIPS). 1877–1901

2020
[4]

Brown, Dawn Song, Úlfar Erlingsson, Alina Oprea, and Colin Raffel

Nicholas Carlini, Florian Tramèr, Eric Wallace, Matthew Jagielski, Ariel Herbert- Voss, Katherine Lee, Adam Roberts, Tom B. Brown, Dawn Song, Úlfar Erlingsson, Alina Oprea, and Colin Raffel. 2021. Extracting Training Data from Large Lan- guage Models. In30th USENIX Security Symposium. 2633–2650

2021
[5]

Yiorgos Charalambous et al. 2025. Break-The-Chain: Reasoning Failures in LLMs via Adversarial Prompting in Code Generation. arXiv preprint arXiv:2506.06971

work page arXiv 2025
[6]

Junkai Chen, Zhenhao Li, Xing Hu, and Xin Xia. 2026. NLPerturbator: Studying the Robustness of Code LLMs to Natural Language Variations.ACM Trans. Softw. Eng. Methodol.35, 4, Article 89 (March 2026), 20 pages. doi:10.1145/3745764

work page doi:10.1145/3745764 2026
[9]

Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde De Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, et al. 2021. Evaluating large language models trained on code. arXiv preprint arXiv:2107.03374(2021)

work page internal anchor Pith review arXiv 2021
[10]

Jean-Baptiste Döderlein, Nguessan Hermann Kouadio, Mathieu Acher, Djamel Ed- dine Khelladi, and Benoît Combemale. 2025. Piloting Copilot, Codex, and Star- Coder2: Hot temperature, cold prompts, or black magic?Journal of Systems and Software230 (2025), 112562. doi:10.1016/j.jss.2025.112562

work page doi:10.1016/j.jss.2025.112562 2025
[11]

Sarah Fakhoury, Aaditya Naik, Georgios Sakkas, Saikat Chakraborty, and Shu- vendu K Lahiri. 2024. Llm-based test-driven interactive code generation: User study and empirical evaluation.IEEE Transactions on Software Engineering50, 9 (2024), 2254–2268

2024
[12]

Ángel González-Prieto, Jorge Pérez, Jessica Díaz, and Daniel López-Fernández
[13]

Reliability in Software Engineering Qualitative Research through Inter- Coder Agreement.J. Syst. Softw.202 (2023), 111707

2023
[14]

Daya Guo, Qihao Zhu, Dejian Yang, Zhenda Xie, Kai Dong, Wenjie Zhang, Wenhu Chen, Kexin Bi, et al . 2024. DeepSeek-Coder: When the Large Lan- guage Model Meets Programming—The Rise of Code Intelligence.arXiv preprint arXiv:2401.14196(2024)

work page internal anchor Pith review arXiv 2024
[15]

Asma Hamidi, Ahmed Khanfir, and Mike Papadakis. 2025. Intent-Based Mu- tation Testing: From Naturally Written Programming Intents to Mutants. In IEEE International Conference on Software Testing, Verification and Validation, ICST 2025 - Workshops, Naples, Italy, March 31 - April 4, 2025. IEEE, 347–357. doi:10.1109/ICSTW64639.2025.10962508

work page doi:10.1109/icstw64639.2025.10962508 2025
[16]

Binyuan Hui, Jian Yang, Zeyu Cui, et al. 2024. Qwen2.5-Coder Technical Report. arXiv preprint arXiv(2024)

2024
[17]

Binyuan Hui, Jian Yang, Zeyu Cui, Jiaxi Yang, Dayiheng Liu, Lei Zhang, Tianyu Liu, Jiajun Zhang, Bowen Yu, Keming Lu, Kai Dang, Yang Fan, Yichang Zhang, An Yang, Rui Men, Fei Huang, Bo Zheng, Yibo Miao, Shanghaoran Quan, Yunlong Feng, Xingzhang Ren, Xuancheng Ren, Jingren Zhou, and Junyang Lin. 2024. Qwen2.5-Coder Technical Report. arXiv:2409.12186 [cs.CL...

work page internal anchor Pith review arXiv 2024
[18]

Naman Jain, Jiayi Han, Alex Gu, William Yang, Yiming Li, Koushik Sen, and Ion Stoica. 2024. LiveCodeBench: Holistic and Contamination-Free Evaluation of Large Language Models for Code.arXiv preprint arXiv:2403.07974(2024)

work page internal anchor Pith review arXiv 2024
[19]

Haoxiang Jia, Robbie Morris, He Ye, Federica Sarro, and Sergey Mechtaev. 2025. Automated Repair of Ambiguous Problem Descriptions for LLM-Based Code Generation.arXiv preprint arXiv:2505.07270(2025)

work page arXiv 2025
[20]

Juyong Jiang, Fan Wang, Jiasi Shen, Sungju Kim, and Sunghun Kim. 2026. A survey on large language models for code generation.ACM Transactions on Software Engineering and Methodology35, 2 (2026), 1–72

2026
[21]

Ernst, Reid Holmes, and Gordon Fraser

René Just, Darioush Jalali, Laura Inozemtseva, Michael D. Ernst, Reid Holmes, and Gordon Fraser. 2014. Are Mutants a Valid Substitute for Real Faults in Software Testing?. InFSE. 654–665

2014
[22]

Maya Larbi, Amal Akli, Mike Papadakis, Rihab Bouyousfi, Maxime Cordy, Feder- ica Sarro, and Yves Le Traon. 2025. When prompts go wrong: Evaluating code model robustness to ambiguous, contradictory, and incomplete task descriptions. arXiv preprint arXiv:2507.20439(2025)

work page arXiv 2025
[23]

Jiawei Liu, Chunqiu Steven Xia, Yuyao Wang, and Lingming Zhang. 2023. Is Your Code Generated by ChatGPT Really Correct? Rigorous Evaluation of Large Language Models for Code Generation. InAdvances in Neural Information Pro- cessing Systems 36 (NeurIPS). https://proceedings.neurips.cc/paper_files/paper/ 2023/hash/43e9d647ccd3e4b7b5baab53f0368686-Abstract-C...

work page internal anchor Pith review arXiv 2023
[24]

Anton Lozhkov, Raymond Li, Vaibhav Chaudhary, et al. 2024. StarCoder2 and The Stack v2: The Next Generation.arXiv preprint arXiv:2402.19173(2024)

work page internal anchor Pith review arXiv 2024
[25]

Antonio Mastropaolo, Luca Pascarella, Emanuela Guglielmi, Matteo Ciniselli, Simone Scalabrino, Rocco Oliveto, and Gabriele Bavota. 2023. On the Robustness of Code Generation Techniques: An Empirical Study on GitHub Copilot. In Proceedings of the 45th IEEE/ACM International Conference on Software Engineering (ICSE). IEEE, 2149–2160. doi:10.1109/ICSE48619.2...

work page doi:10.1109/icse48619.2023.00181 2023
[26]

Alexandre Matton, Tom Sherborne, Dennis Aumiller, Elena Tommasone, Milad Alizadeh, Jingyi He, Raymond Ma, Maxime Voisin, Ellen Gilsenan-McMahon, and Matthias Gallé. 2024. On Leakage of Code Generation Evaluation Datasets. InFindings of EMNLP. 13215–13223

2024
[27]

Mistral AI. 2024. Codestral: Hello, World! Technical report. https://mistral.ai/ news/codestral

2024
[28]

Fangwen Mu, Lin Shi, Song Wang, Zhuohao Yu, Binquan Zhang, ChenXue Wang, Shichao Liu, and Qing Wang. 2024. Clarifygpt: A framework for enhancing llm-based code generation via requirements clarification.Proceedings of the ACM on Software Engineering1, FSE (2024), 2332–2354

2024
[29]

OpenAI. 2025. GPT-5 Technical Report. https://openai.com. Accessed: 2026

2025
[30]

Zhang, Mark Harman, and Meng Wang

Shuyin Ouyang, Jie M. Zhang, Mark Harman, and Meng Wang. 2025. An Em- pirical Study of the Non-Determinism of ChatGPT in Code Generation.ACM Trans. Softw. Eng. Methodol.34, 2 (2025), 42:1–42:28

2025
[31]

Malintha Perera, Aldeida Aleti, Chunyang Chen, and Chakkrit Tantithamthavorn
[32]

InProceedings of the 38th IEEE/ACM International Conference on Automated Software Engineering (ASE)

Revisiting the Impact of Natural Language Descriptions on the Correctness of Code Generation. InProceedings of the 38th IEEE/ACM International Conference on Automated Software Engineering (ASE)
[33]

Fazle Rabbi, Zishuo Ding, and Jinqiu Yang. 2025. A Multi-Language Perspective on the Robustness of LLM Code Generation.arXiv preprint arXiv:2504.19108 (2025). https://arxiv.org/abs/2504.19108

work page internal anchor Pith review Pith/arXiv arXiv 2025
[34]

Paul Ralph, Nauman bin Ali, Sebastian Baltes, Domenico Bianculli, Jessica Diaz, Yvonne Dittrich, Neil Ernst, Michael Felderer, et al. 2021. Empirical Standards for Software Engineering Research.arXiv preprint arXiv:2010.03525(2021)

work page arXiv 2021
[35]

Baptiste Rozière, Jonas Gehring, Fabian Gloeckle, Sten Sootla, Itai Gat, Xiaoqing Tan, Yossi Adi, Jingyu Liu, et al. 2023. Code Llama: Open Foundation Models for Code.arXiv preprint arXiv:2308.12950(2023)

work page internal anchor Pith review arXiv 2023
[36]

Melanie Sclar, Yejin Choi, Yulia Tsvetkov, and Alane Suhr. 2024. Quantifying Lan- guage Models’ Sensitivity to Spurious Features in Prompt Design. InProceedings of the International Conference on Learning Representations (ICLR)

2024
[37]

Yifan Song, Guoyin Wang, Sujian Li, and Bill Yuchen Lin. 2025. The Good, The Bad, and The Greedy: Evaluation of LLMs Should Not Ignore Non-Determinism. InNAACL. 4195–4206

2025
[38]

Peiyi Wang, Lei Li, Liang Chen, Zefan Cai, Dawei Zhu, Binghuai Lin, Yunbo Cao, Qi Liu, Tianyu Liu, and Zhifang Sui. 2024. Large Language Models are not Fair Evaluators. InACL. 9440–9450

2024
[39]

Shiqi Wang, Zheng Li, Haifeng Qian, Chenghao Yang, Zijian Wang, Mingyue Shang, Varun Kumar, Samson Tan, Baishakhi Ray, Parminder Bhatia, et al. 2023. ReCode: Robustness Evaluation of Code Generation Models. InProceedings of the 61st Annual Meeting of the Association for Computational Linguistics (ACL). 13234–13274

2023
[40]

Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Brian Ichter, Fei Xia, Ed Chi, Quoc Le, and Denny Zhou. 2022. Chain-of-Thought Prompting Elicits Reasoning in Large Language Models. InAdvances in Neural Information Processing Systems (NeurIPS)

2022
[41]

Jules White, Quchen Fu, Sam Hays, Michael Sandborn, Carlos Olea, Henry Gilbert, Ashraf Elnashar, Jesse Spencer-Smith, and Douglas C. Schmidt. 2023. A Prompt Pattern Catalog to Enhance Prompt Engineering with ChatGPT.arXiv preprint arXiv:2302.11382(2023)

work page internal anchor Pith review arXiv 2023
[42]

Chenyang Yang, Yike Shi, Qianou Ma, Michael Xieyang Liu, Christian Käst- ner, and Tongshuang Wu. 2025. What Prompts Don’t Say: Understanding and Managing Underspecification in LLM Prompts. arXiv:2505.13360 [cs.CL]

work page internal anchor Pith review Pith/arXiv arXiv 2025
[43]

Gonzalez, and Ion Stoica

Shuo Yang, Wei-Lin Chiang, Lianmin Zheng, Joseph E. Gonzalez, and Ion Stoica
[44]

Gonzalez, and Ion Stoica

Rethinking Benchmark and Contamination for Language Models with Rephrased Samples.arXiv preprint arXiv:2311.04850(2023). https://arxiv.org/ abs/2311.04850

work page arXiv 2023
[45]

Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric Xing, et al. 2023. Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena. InAdvances in Neural Information Processing Systems (NeurIPS), Vol. 36. 46595–46623

2023
[46]

Kaijie Zhu et al . 2023. PromptBench: Towards Evaluating the Robustness of Large Language Models on Adversarial Prompts.arXiv preprint arXiv:2306.04528 (2023)

work page arXiv 2023
[47]

Terry Yue Zhuo, Minh Chien Vu, Jenny Chim, Han Hu, Wenhao Yu, Ratnadira Widyasari, Imam Nur Bani Yusuf, Haolan Zhan, Junda He, Indraneil Paul, et al
[48]

BigCodeBench: Benchmarking Code Generation with Diverse Function Calls and Complex Instructions. InICLR
[49]

Terry Yue Zhuo, Minh Chien Vu, Jenny Chim, Han Hu, Wenhao Yu, Ratnadira Widyasari, Imam Nur Bani Yusuf, Haolan Zhan, Junda He, Indraneil Paul, Simon Akli et al. Brunner, Chen Gong, Thong Hoang, Armel Randy Zebaze, Xiaoheng Hong, Wen- Ding Li, Jean Kaddour, Ming Xu, Zhihan Zhang, Prateek Yadav, Naman Jain, Alex Gu, Zhoujun Cheng, Jiawei Liu, Qian Liu, Ziji...

work page internal anchor Pith review arXiv 2025