A new benchmark for 0-to-1 CLI tool generation shows state-of-the-art LLMs achieve under 43% success rate with black-box equivalence testing against real oracles.
hub
A Survey on Evaluating Large Language Models in Code Generation Tasks
11 Pith papers cite this work. Polarity classification is still indexing.
hub tools
citation-role summary
citation-polarity summary
representative citing papers
A study of seven LLMs finds that realistic prompt variations such as one-character misspellings trigger library hallucinations in up to 26% of cases, fabricated names in up to 99%, and time-based prompts in up to 85%, and introduces LibHalluBench for evaluation.
TRAILS infers code correctness by aggregating LLM judgments on input-output pairs from category-partitioned specification tests, improving MCC by up to 39% over Zero-Shot COT on LiveCodeBench and CoCoClaNeL.
Introduces contextualized code pretraining with caller-callee pairs from static analysis to train CallerGen models that outperform baselines on the new CallerEval benchmark.
A review of 114 studies creates taxonomies for code and data quality issues, formalizes 18 propagation mechanisms from training data defects to LLM-generated code defects, and synthesizes detection and mitigation techniques.
Fine-tuning 7B code LLMs on a custom multi-file DSL dataset achieves structural fidelity of 1.00, high exact-match accuracy, and practical utility validated by expert survey and execution checks.
LLMs produce executable code only 42.55% of the time under API evolution without full documentation, improving to 66.36% with structured docs and by 11% more with reasoning strategies, yet outdated patterns persist.
Empirical study of eight LLMs finds overuse of popular libraries like NumPy in up to 45% of unnecessary cases and strong default preference for Python even when suboptimal.
LLM-generated cryptographic Rust code compiles successfully only 23% of the time and contains detectable vulnerabilities in 57% of the cases that do compile.
LLM-based SE tools lack stable ground truth and deterministic outputs, making standard evaluation assumptions invalid and requiring new approaches for reliable assessment.
A systematic review finds research on the sustainability of LLM-generated code to be limited, fragmented, and without accepted frameworks for measurement or benchmarking.
citing papers explorer
-
An Empirical Security Evaluation of LLM-Generated Cryptographic Rust Code
LLM-generated cryptographic Rust code compiles successfully only 23% of the time and contains detectable vulnerabilities in 57% of the cases that do compile.