pith. sign in

arxiv: 2607.00711 · v1 · pith:Z5ESAJVZnew · submitted 2026-07-01 · 💻 cs.SE

ClarifyCodeBench: Evaluating LLMs on Clarifying Ambiguous Requirements for Code Generation

Pith reviewed 2026-07-02 08:37 UTC · model grok-4.3

classification 💻 cs.SE
keywords LLMscode generationrequirement clarificationambiguity resolutioninteractive benchmarksoftware engineeringAI4SE
0
0 comments X

The pith

LLMs that generate code well still struggle to clarify ambiguous requirements and get worse with more ambiguities.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents ClarifyCodeBench, an interactive benchmark built from real-world programming tasks that includes manual annotations of ambiguity types along with example clarification questions and answers. It evaluates six leading LLMs using two new metrics that score how efficiently and precisely models ask questions to resolve unclear parts of a task description. The results show that models strong at writing code from clear prompts do not automatically become good at spotting ambiguities, that extra reasoning steps improve final code correctness more than they improve clarification, and that clarification quality falls quickly once multiple ambiguities are present. This setup highlights a gap between one-shot code synthesis and the back-and-forth needed in actual software projects.

Core claim

ClarifyCodeBench supplies annotated real-world tasks, N ambiguity types, clarification questions, and ground-truth answers. Its two metrics are Turn-discounted Key Question Rate, which reduces credit for inefficient or redundant questions, and Optimal Round Adherence, which checks whether the model reaches a complete specification in the expected number of turns. Systematic tests of six state-of-the-art LLMs produce three findings: code-generation strength does not imply clarification strength, heavier reasoning improves code correctness but yields only marginal clarification gains, and clarification performance drops sharply as ambiguity density rises.

What carries the argument

ClarifyCodeBench benchmark together with Turn-discounted Key Question Rate and Optimal Round Adherence metrics that quantify interactive clarification quality.

If this is right

  • Strong code generation ability does not guarantee effective requirement clarification.
  • Extra computational reasoning steps improve code correctness more than they improve ambiguity detection.
  • Clarification performance declines sharply once multiple ambiguities appear in a specification.
  • AI4SE research must move from static one-shot synthesis toward interactive elicitation methods.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Training regimes focused only on code completion may need to be supplemented with explicit clarification dialogue data.
  • Hybrid human-AI workflows could remain necessary for projects whose requirements contain many ambiguities.
  • Extending the benchmark to longer, open-ended conversations might expose further limits not captured by the current optimal-round metric.

Load-bearing premise

The manually chosen ambiguities and the two proposed metrics correctly capture what matters for clarification quality in real software engineering work.

What would settle it

A controlled study in which an LLM scores high on ClarifyCodeBench yet still produces many incorrect implementations when given the same ambiguous tasks by actual developers would undermine the claim that the benchmark measures relevant capability.

Figures

Figures reproduced from arXiv: 2607.00711 by Dongming Jin, Ge Li, Kechi Zhang, Yihong Dong, Yongmin Li, Zheng Fang, Zhi Jin.

Figure 1
Figure 1. Figure 1: An ambiguous code generation requirement. [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Pass@1 (%) under four input settings: Complete Requirement (Full Req.), Ambiguous Requirement (Ambig. Req.), [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Distribution of ambiguity types in Clarify [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗
Figure 5
Figure 5. Figure 5: Overview of the evaluation protocol. 4.1 Studied LLMs We evaluate a broad set of advanced LLMs, covering both non￾reasoning and reasoning models. The models in our study include GPT-4o [15] and GPT-5 [29] from OpenAI, Gemini-2.5-Flash [7] from Google, Claude-Sonnet-4.5 [2] from Anthropic, DeepSeek￾V3.2 [25] from DeepSeek, and Qwen3-235B-A22B [37] from Qwen. These models have achieved strong results on a ra… view at source ↗
Figure 4
Figure 4. Figure 4: System prompt used in our interactive evaluation. [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗
Figure 6
Figure 6. Figure 6: Hit rate across ambiguity types for non-thinking [PITH_FULL_IMAGE:figures/full_fig_p007_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Hit rate across difficulty levels. D𝑛-H𝑚 indicates that a requirement contains 𝑛 ambiguities and the LLM suc￾cessfully asks 𝑚 matched questions [PITH_FULL_IMAGE:figures/full_fig_p008_7.png] view at source ↗
Figure 9
Figure 9. Figure 9: Wrong Answer Case. leaves the kick semantics near the boundary partially underspeci￾fied, which can reduce the confidence of the model in applying a higher-level state abstraction. As a result, the same model falls back to a conservative literal simulation under the ambiguous require￾ment by explicitly maintaining the entire mutable grid, whereas under the full requirement it abstracts the problem as a 0-1… view at source ↗
Figure 10
Figure 10. Figure 10: Time Limit Exceeded Case. study examines advanced LLMs under a unified prompting setting in order to preserve comparability across models. Therefore, the findings primarily reflect the clarification ability of base models rather than the end-to-end performance of agent systems. Extend￾ing the evaluation to agentic workflows is therefore a natural and important direction for future work. 6.2 Threats to Val… view at source ↗
read the original abstract

Large Language Models have emerged as programming assistants. However, the efficacy of code generation is constrained by the quality of input requirements, which are frequently ambiguous, incomplete, or underspecified. While LLMs excel at one-shot code synthesis, their ability to proactively clarify intent remains underexplored, as a critical trait for robust software engineering. Existing benchmarks largely overlook this interactive bottleneck, assuming perfectly specified prompts that do not reflect the iterative nature of requirement elicitation. To bridge this gap, we introduce ClarifyCodeBench, a novel interactive benchmark for evaluating LLMs' capability in resolving requirement ambiguity. Constructed from real-world programming tasks, ClarifyCodeBench features high-quality manual annotations, including N unique ambiguity types, associated clarification questions, and corresponding ground-truth answers. Furthermore, we formalize two rigorous metrics to assess the interaction quality: Turn-discounted Key Question Rate, which penalizes inefficient questioning, and Optimal Round Adherence, which measures the precision of the elicitation process. We conduct a systematic evaluation of six state-of-the-art LLMs using ClarifyCodeBench. Our empirical results yield three critical insights: 1) Capability Decoupling: Strong code generation performance does not inherently translate to effective requirement clarification; 2) The Reasoning Paradox: While increased computational thinking enhances code correctness, it yields marginal gains in identifying ambiguities; 3) The Multi-ambiguity Ceiling: LLMs' clarification performance degrades sharply as the density of ambiguities increases, revealing a significant bottleneck in handling complex, real-world specifications. Our work underscores the necessity for future AI4SE research to transition from static synthesis to interactive elicitation.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces ClarifyCodeBench, an interactive benchmark constructed from real-world programming tasks with high-quality manual annotations of ambiguity types, clarification questions, and ground-truth answers. It defines two metrics (Turn-discounted Key Question Rate and Optimal Round Adherence) and evaluates six state-of-the-art LLMs, reporting three insights: capability decoupling between code generation and clarification, a reasoning paradox where more computation aids correctness but not ambiguity detection, and a multi-ambiguity ceiling where performance degrades with higher ambiguity density.

Significance. If the benchmark construction, annotation validity, and metric soundness are rigorously established, the work would be significant for AI4SE by demonstrating that current LLMs have fundamental limitations in interactive requirement elicitation and by motivating a shift from static code synthesis to interactive clarification. The three insights, if supported, would provide actionable guidance on model weaknesses in handling real-world specifications.

major comments (2)
  1. [ClarifyCodeBench construction] The section describing ClarifyCodeBench construction asserts construction from real-world tasks with 'high-quality manual annotations' including N unique ambiguity types, associated questions, and ground-truth answers, but supplies no protocol details, annotator count, inter-annotator agreement, ambiguity identification method, or ground-truth validation procedure. This is load-bearing for the central claims, as the three insights (capability decoupling, reasoning paradox, multi-ambiguity ceiling) cannot be interpreted without evidence that the benchmark faithfully represents software engineering practice.
  2. [Evaluation] The evaluation section reports results on six LLMs and the three insights but provides no dataset size, statistical tests, or validation that the two proposed metrics correlate with actual software engineering outcomes. Without these, the reported effects cannot be distinguished from benchmark artifacts.
minor comments (2)
  1. [Abstract] The abstract uses the placeholder 'N' for the number of ambiguity types; replace with the actual count and ensure consistency throughout the manuscript.
  2. [Metrics] Ensure all metric definitions and any tables reporting per-model or per-ambiguity results include clear notation and units.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment below and commit to revisions that strengthen the transparency of the benchmark and evaluation.

read point-by-point responses
  1. Referee: [ClarifyCodeBench construction] The section describing ClarifyCodeBench construction asserts construction from real-world tasks with 'high-quality manual annotations' including N unique ambiguity types, associated questions, and ground-truth answers, but supplies no protocol details, annotator count, inter-annotator agreement, ambiguity identification method, or ground-truth validation procedure. This is load-bearing for the central claims, as the three insights (capability decoupling, reasoning paradox, multi-ambiguity ceiling) cannot be interpreted without evidence that the benchmark faithfully represents software engineering practice.

    Authors: We agree that the absence of detailed annotation protocols limits the interpretability of the benchmark and the three insights. The submitted manuscript provides only a high-level description. In the revised version we will expand the construction section with the requested information: annotator count and expertise, inter-annotator agreement, the ambiguity identification procedure, and ground-truth validation steps. This addition will directly support claims about fidelity to real-world software engineering practice. revision: yes

  2. Referee: [Evaluation] The evaluation section reports results on six LLMs and the three insights but provides no dataset size, statistical tests, or validation that the two proposed metrics correlate with actual software engineering outcomes. Without these, the reported effects cannot be distinguished from benchmark artifacts.

    Authors: We agree that the evaluation section requires greater statistical rigor and justification. The manuscript currently omits explicit dataset size and statistical tests. We will revise it to report the dataset size, include statistical significance tests for the observed differences, and add a discussion of how the two metrics align with established software engineering outcomes in the literature on requirement elicitation. A new empirical correlation study is beyond the scope of the current work, but the expanded discussion will address the concern about benchmark artifacts. revision: partial

Circularity Check

0 steps flagged

No circularity: empirical benchmark evaluation with no derivation chain

full rationale

The paper is a pure empirical study introducing ClarifyCodeBench and reporting LLM evaluation results. No equations, derivations, fitted parameters, or predictions appear in the provided text. The three insights are direct observations from benchmark runs, not reductions of any claimed result to its own inputs. No self-citation load-bearing steps, uniqueness theorems, or ansatzes are invoked. The work is self-contained against external benchmarks and receives the default non-circularity finding.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The paper introduces a new benchmark and metrics but does not introduce fitted parameters, unproved axioms, or new postulated entities; it relies on standard practices of manual annotation and empirical LLM evaluation.

pith-pipeline@v0.9.1-grok · 5843 in / 1160 out tokens · 30431 ms · 2026-07-02T08:37:04.099941+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

41 extracted references · 18 canonical work pages · 12 internal anchors

  1. [1]

    Mehmet Akhoroz and Caglar Yildirim. 2025. Conversational AI as a Coding Assistant: Understanding Programmers’ Interactions with and Expectations from Large Language Models for Coding.CoRRabs/2503.16508 (2025)

  2. [2]

    2025.Claude Sonnet 4.5 System Card

    Anthropic. 2025.Claude Sonnet 4.5 System Card. Technical Report. Anthropic. https://www.anthropic.com/claude-sonnet-4-5-system-card

  3. [3]

    Jacob Austin, Augustus Odena, Maxwell Nye, Maarten Bosma, Henryk Michalewski, David Dohan, Ellen Jiang, Carrie Cai, Michael Terry, Quoc Le, et al. 2021. Program synthesis with large language models.arXiv preprint arXiv:2108.07732(2021)

  4. [4]

    Sher Badshah and Hassan Sajjad. 2025. Reference-Guided Verdict: LLMs-as-Judges in Automatic Evaluation of Free-Form QA. InProceedings of the 9th Widening NLP Workshop. 251–267

  5. [5]

    Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde De Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, et al. 2021. Evaluating large language models trained on code. arXiv preprint arXiv:2107.03374(2021)

  6. [6]

    Jacob Cohen. 1960. A coefficient of agreement for nominal scales.Educational and psychological measurement20, 1 (1960), 37–46

  7. [7]

    Gheorghe Comanici, Eric Bieber, Mike Schaekermann, Ice Pasupat, Noveen Sachdeva, Inderjit Dhillon, Marcel Blistein, Ori Ram, Dan Zhang, Evan Rosen, et al. 2025. Gemini 2.5: Pushing the frontier with advanced reasoning, multi- modality, long context, and next generation agentic capabilities.arXiv preprint arXiv:2507.06261(2025)

  8. [8]

    Alan Davis, Scott Overmyer, Kathleen Jordan, Joseph Caruso, Fatma Dandashi, Anhtuan Dinh, Gary Kincaid, Glen Ledeboer, Patricia Reynolds, Pradip Sitaram, et al. 1993. Identifying and measuring quality in a software requirements specifi- cation. In[1993] Proceedings First International Software Metrics Symposium. Ieee, 141–152

  9. [9]

    Yihong Dong, Xue Jiang, Zhi Jin, and Ge Li. 2024. Self-collaboration code genera- tion via chatgpt.ACM Transactions on Software Engineering and Methodology33, 7 (2024), 1–38

  10. [10]

    Sarah Fakhoury, Aaditya Naik, Georgios Sakkas, Saikat Chakraborty, and Shu- vendu K Lahiri. 2024. Llm-based test-driven interactive code generation: User study and empirical evaluation.IEEE Transactions on Software Engineering50, 9 (2024), 2254–2268

  11. [11]

    Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Peiyi Wang, Qihao Zhu, Runxin Xu, Ruoyu Zhang, Shirong Ma, Xiao Bi, et al. 2025. Deepseek-r1: Incen- tivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948(2025)

  12. [12]

    Hojae Han, Seung-won Hwang, Rajhans Samdani, and Yuxiong He. 2025. Con- vcodeworld: Benchmarking conversational code generation in reproducible feed- back environments.arXiv preprint arXiv:2502.19852(2025)

  13. [13]

    A Handbook. 2003. From contract drafting to software specification: Linguistic sources of ambiguity

  14. [14]

    Dan Hendrycks, Steven Basart, Saurav Kadavath, Mantas Mazeika, Akul Arora, Ethan Guo, Collin Burns, Samir Puranik, Horace He, Dawn Song, et al. 2021. Mea- suring coding challenge competence with apps.arXiv preprint arXiv:2105.09938 (2021)

  15. [15]

    Aaron Hurst, Adam Lerer, Adam P Goucher, Adam Perelman, Aditya Ramesh, Aidan Clark, AJ Ostrow, Akila Welihinda, Alan Hayes, Alec Radford, et al. 2024. Gpt-4o system card.arXiv preprint arXiv:2410.21276(2024)

  16. [16]

    2018.IEEE: ISO/IEC/IEEE 29148: 2018-Systems and software engineering- Life cycle processes-Requirements engineering

    IEC ISO. 2018.IEEE: ISO/IEC/IEEE 29148: 2018-Systems and software engineering- Life cycle processes-Requirements engineering. Technical Report. Technical report, ISO IEEE IEC

  17. [17]

    Naman Jain, King Han, Alex Gu, Wen-Ding Li, Fanjia Yan, Tianjun Zhang, Sida Wang, Armando Solar-Lezama, Koushik Sen, and Ion Stoica. 2024. Livecodebench: Holistic and contamination free evaluation of large language models for code. arXiv preprint arXiv:2403.07974(2024)

  18. [18]

    Kalervo Järvelin and Jaana Kekäläinen. 2002. Cumulated gain-based evaluation of IR techniques.ACM Transactions on Information Systems (TOIS)20, 4 (2002), 422–446

  19. [19]

    Kalervo Järvelin and Jaana Kekäläinen. 2017. IR evaluation methods for retrieving highly relevant documents. InACM SIGIR Forum, Vol. 51. ACM New York, NY, USA, 243–250

  20. [20]

    Xue Jiang, Yihong Dong, Lecheng Wang, Zheng Fang, Qiwei Shang, Ge Li, Zhi Jin, and Wenpin Jiao. 2024. Self-planning code generation with large language models.ACM Transactions on Software Engineering and Methodology33, 7 (2024), 1–30

  21. [21]

    Philippe Laban, Hiroaki Hayashi, Yingbo Zhou, and Jennifer Neville. 2025. Llms get lost in multi-turn conversation.arXiv preprint arXiv:2505.06120(2025)

  22. [22]

    J Richard Landis and Gary G Koch. 1977. The measurement of observer agreement for categorical data.biometrics(1977), 159–174

  23. [23]

    Maya Larbi, Amal Akli, Mike Papadakis, Rihab Bouyousfi, Maxime Cordy, Feder- ica Sarro, and Yves Le Traon. 2025. When prompts go wrong: Evaluating code model robustness to ambiguous, contradictory, and incomplete task descriptions. arXiv preprint arXiv:2507.20439(2025)

  24. [24]

    Jia Li, Ge Li, Yongmin Li, and Zhi Jin. 2025. Structured chain-of-thought prompt- ing for code generation.ACM Transactions on Software Engineering and Method- ology34, 2 (2025), 1–23

  25. [25]

    Aixin Liu, Aoxue Mei, Bangcai Lin, Bing Xue, Bingxuan Wang, Bingzheng Xu, Bochao Wu, Bowei Zhang, Chaofan Lin, Chen Dong, et al . 2025. Deepseek- v3. 2: Pushing the frontier of open large language models.arXiv preprint arXiv:2512.02556(2025)

  26. [26]

    Fangwen Mu, Lin Shi, Song Wang, Zhuohao Yu, Binquan Zhang, ChenXue Wang, Shichao Liu, and Qing Wang. 2024. Clarifygpt: A framework for enhancing llm-based code generation via requirements clarification.Proceedings of the ACM on Software Engineering1, FSE (2024), 2332–2354

  27. [27]

    Elise Paradis, Kate Grey, Quinn Madison, Daye Nam, Andrew Macvean, Vahid Meimand, Nan Zhang, Benjamin Ferrari-Church, and Satish Chandra. 2025. How Much Does AI Impact Development Speed? an Enterprise-Based Randomized Controlled Trial. InSEIP@ICSE. IEEE, 618–629

  28. [28]

    Agnia Sergeyuk, Yaroslav Golubev, Timofey Bryksin, and Iftekhar Ahmed. 2025. Using AI-based coding assistants in practice: State of affairs, perceptions, and ways forward.Inf. Softw. Technol.178 (2025), 107610

  29. [29]

    Aaditya Singh, Adam Fry, Adam Perelman, Adam Tart, Adi Ganesh, Ahmed El-Kishky, Aidan McLaughlin, Aiden Low, AJ Ostrow, Akhila Ananthram, et al

  30. [30]

    Openai gpt-5 system card.arXiv preprint arXiv:2601.03267(2025)

  31. [31]

    Sanidhya Vijayvargiya, Xuhui Zhou, Akhila Yerukola, Maarten Sap, and Graham Neubig. 2025. Interactive agents to overcome ambiguity in software engineering. arXiv preprint arXiv:2502.13069(2025)

  32. [32]

    Peiding Wang, Li Zhang, Fang Liu, Lin Shi, Minxiao Li, Bo Shen, and An Fu. 2025. Codeif-bench: Evaluating instruction-following capabilities of large language models in interactive code generation.arXiv preprint arXiv:2503.22688(2025)

  33. [33]

    Sizhe Wang, Zhengren Wang, Dongsheng Ma, Yongan Yu, Rui Ling, Zhiyu Li, Feiyu Xiong, and Wentao Zhang. 2025. Codeflowbench: A multi-turn, iterative benchmark for complex code generation.arXiv preprint arXiv:2504.21751(2025)

  34. [34]

    Weisz, Shraddha Vijay Kumar, Michael J

    Justin D. Weisz, Shraddha Vijay Kumar, Michael J. Muller, Karen-Ellen Browne, Arielle Goldberg, Katrin Ellice Heintze, and Shagun Bajpai. 2025. Examining the Use and Impact of an AI Code Assistant on Developer Productivity and Experience in the Enterprise. InCHI Extended Abstracts. ACM, 673:1–673:13

  35. [35]

    Jie JW Wu, Manav Chaudhary, Davit Abrahamyan, Arhaan Khaku, Anjiang Wei, and Fatemeh H Fard. 2025. Clarifycoder: Clarification-aware fine-tuning for programmatic problem solving.arXiv e-prints(2025), arXiv–2504

  36. [36]

    Jie JW Wu and Fatemeh H Fard. 2025. Humanevalcomm: Benchmarking the communication competence of code generation for llms and llm agents.ACM Transactions on Software Engineering and Methodology34, 7 (2025), 1–42

  37. [37]

    Chunqiu Steven Xia, Yinlin Deng, Soren Dunn, and Lingming Zhang. 2025. De- mystifying llm-based software engineering agents.Proceedings of the ACM on Software Engineering2, FSE (2025), 801–824

  38. [38]

    An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. 2025. Qwen3 technical report.arXiv preprint arXiv:2505.09388(2025)

  39. [39]

    John Yang, Carlos E Jimenez, Alexander Wettig, Kilian Lieret, Shunyu Yao, Karthik Narasimhan, and Ofir Press. 2024. Swe-agent: Agent-computer interfaces enable automated software engineering.Advances in Neural Information Processing Systems37 (2024), 50528–50652

  40. [40]

    Yuntong Zhang, Haifeng Ruan, Zhiyu Fan, and Abhik Roychoudhury. [n. d.]. Autocoderover: Autonomous program improvement, 2024.URL https://arxiv. org/abs/2404.05427([n. d.])

  41. [41]

    Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric Xing, et al. 2023. Judging llm-as-a-judge with mt-bench and chatbot arena.Advances in neural information processing systems36 (2023), 46595–46623