Structurally rich task descriptions make LLMs robust to prompt under-specification, and under-specification can enhance code correctness by disrupting misleading lexical or structural cues.
Title resolution pending
7 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
roles
background 2polarities
background 2representative citing papers
LLMs display clear performance stratification on formal language tasks aligned with Chomsky hierarchy complexity levels, limited by severe efficiency barriers rather than absolute capability.
ALADDIN is a user-requirement-driven GUI test generation framework that incrementally navigates mobile app UIs and builds LLM-guided oracles to validate both correct and faulty user-requested functionalities across six apps.
SelfEvolve achieves 92.7% Pass@1 success on 11 runtime self-extension tasks and outperforms baselines like AutoGen by 61.8% with statistical significance.
ClarifySTL uses LLM agents to interactively detect and resolve vagueness and ambiguity in natural language requirements via clarification queries before generating STL formulas, with evaluations on existing and new benchmarks showing effectiveness.
SpecValidator detects lexical vagueness, under-specification, and syntax-formatting defects in LLM code-generation prompts with F1 0.804, outperforming GPT-5-mini and Claude Sonnet 4, and shows that under-specification is the most damaging defect type while richer benchmarks are more resilient.
PerfOrch is a four-agent multi-LLM system that uses offline profiling to build language-and-category rankings for routing tasks, achieving 97.19% and 95.83% pass@1 on HumanEval-X and EffiBench-X with generalization across benchmarks.
citing papers explorer
-
When Prompt Under-Specification Improves Code Correctness: An Exploratory Study of Prompt Wording and Structure Effects on LLM-Based Code Generation
Structurally rich task descriptions make LLMs robust to prompt under-specification, and under-specification can enhance code correctness by disrupting misleading lexical or structural cues.
-
Evaluating the Formal Reasoning Capabilities of Large Language Models through Chomsky Hierarchy
LLMs display clear performance stratification on formal language tasks aligned with Chomsky hierarchy complexity levels, limited by severe efficiency barriers rather than absolute capability.
-
Automated Functional Testing for Malleable Mobile Application Driven from User Intent
ALADDIN is a user-requirement-driven GUI test generation framework that incrementally navigates mobile app UIs and builds LLM-guided oracles to validate both correct and faulty user-requested functionalities across six apps.
-
Software Self-Extension with SelfEvolve: an Agentic Architecture for Runtime Code Generation
SelfEvolve achieves 92.7% Pass@1 success on 11 runtime self-extension tasks and outperforms baselines like AutoGen by 61.8% with statistical significance.
-
ClarifySTL: An Interactive LLM Agent Framework for STL Transformation through Requirements Clarification
ClarifySTL uses LLM agents to interactively detect and resolve vagueness and ambiguity in natural language requirements via clarification queries before generating STL formulas, with evaluations on existing and new benchmarks showing effectiveness.
-
Defective Task Descriptions in LLM-Based Code Generation: Detection and Analysis
SpecValidator detects lexical vagueness, under-specification, and syntax-formatting defects in LLM code-generation prompts with F1 0.804, outperforming GPT-5-mini and Claude Sonnet 4, and shows that under-specification is the most damaging defect type while richer benchmarks are more resilient.
-
Multi-LLM Orchestration for High-Quality Code Generation: Exploiting Complementary Model Strengths
PerfOrch is a four-agent multi-LLM system that uses offline profiling to build language-and-category rankings for routing tasks, achieving 97.19% and 95.83% pass@1 on HumanEval-X and EffiBench-X with generalization across benchmarks.