LibEvoBench benchmark shows LLMs are version-oblivious on evolving APIs, with documentation helping but version specification not.
hub
Exploring and evaluating hallucinations in llm-powered code generation
20 Pith papers cite this work. Polarity classification is still indexing.
hub tools
citation-role summary
citation-polarity summary
roles
background 3polarities
background 3representative citing papers
First empirical study shows crate hallucination in Rust LLMs has consistent rates across models insensitive to parameters and tests prompt-based mitigation.
Chain-of-Thought prompting balances high accuracy with low energy use in small language models for code generation, while multi-sampling strategies add high energy costs for small accuracy gains.
A study of seven LLMs finds that realistic prompt variations such as one-character misspellings trigger library hallucinations in up to 26% of cases, fabricated names in up to 99%, and time-based prompts in up to 85%, and introduces LibHalluBench for evaluation.
FASE approximates functional correctness via MST on structural and semantic dissimilarity graphs, reporting 25% better Spearman correlation and 19% better ROCAUC than LLM-based semantic entropy at 0.3% runtime cost on HumanEval and BigCodeBench.
Introduces functional equivalence methods and functional entropy to predict functional correctness of LLM-generated code via uncertainty quantification, outperforming NLI-based baselines in most tested settings.
RisCoSet applies multiple hypothesis testing to construct risk-controlling partial-program prediction sets for LLM code generation, achieving up to 24.5% less code removal than prior methods at equivalent risk levels.
SOMA estimates a local response manifold from early turns and adapts a small surrogate model via divergence-maximizing prompts and localized LoRA fine-tuning for efficient multi-turn serving.
A review of 114 studies creates taxonomies for code and data quality issues, formalizes 18 propagation mechanisms from training data defects to LLM-generated code defects, and synthesizes detection and mitigation techniques.
SOCIA-EVO generates statistically consistent simulators by separating structural refinement from parameter calibration via bi-level optimization and falsifying strategies through execution feedback in a Bayesian-weighted playbook.
ContextCov compiles agent instruction files into static, runtime, and architectural guardrails, raising constraint compliance to 88.3% on SWE-bench Lite tasks versus 67% and 50.3% for prompt and reflection baselines.
FELA deploys specialized LLM agents in an evolutionary framework to generate, validate, and refine explainable features from heterogeneous industrial event logs, improving downstream model performance.
PerfOrch is a four-agent multi-LLM system that uses offline profiling to build language-and-category rankings for routing tasks, achieving 97.19% and 95.83% pass@1 on HumanEval-X and EffiBench-X with generalization across benchmarks.
BashCoder-R1 applies CPT, L-CoT SFT, and R-GRPO to reach higher syntax, robustness, and functionality rates than baselines on the new BashBench benchmark of 952 tasks.
A layered framework with physical gatekeepers, fidelity analysis against reference VQE circuits, and a consistency metric identifies five LLM failure modes in quantum circuit generation and reveals that some apparent model errors originated in the evaluation harness itself.
ABLE uses LLMs with sanitization and iterative refinement to generate bypass YARA rules from malware traces, achieving 79% success on 334 samples and 47% more family detections.
ClusterFusion++ fuses the entire Transformer block (LayerNorm to residual) via CUDA extensions and achieves 1.34x throughput on Pythia-2.8B with near-identical output fidelity.
LLMs achieve only 0-60% success when asked to contribute code to sizable open-source projects, often failing basic checks or simply repeating training data.
ICL4Decomp applies in-context learning to guide LLMs in generating re-executable decompiled code from binaries, reporting roughly 40% higher re-executability than prior methods across datasets and optimization levels.
citing papers explorer
-
LibEvoBench: Probing Temporal Knowledge Stratification in Code Generation Models
LibEvoBench benchmark shows LLMs are version-oblivious on evolving APIs, with documentation helping but version specification not.
-
When LLMs Invent Rust Crates: An Empirical Study of Hallucination Patterns and Mitigation
First empirical study shows crate hallucination in Rust LLMs has consistent rates across models insensitive to parameters and tests prompt-based mitigation.
-
Evaluating the Environmental Impact of using SLMs and Prompt Engineering for Code Generation
Chain-of-Thought prompting balances high accuracy with low energy use in small language models for code generation, while multi-sampling strategies add high energy costs for small accuracy gains.
-
Library Hallucinations in LLM-Generated Code: A Risk Analysis Grounded in Developer Queries
A study of seven LLMs finds that realistic prompt variations such as one-character misspellings trigger library hallucinations in up to 26% of cases, fabricated names in up to 99%, and time-based prompts in up to 85%, and introduces LibHalluBench for evaluation.
-
FASE: Fast Adaptive Semantic Entropy for Code Quality
FASE approximates functional correctness via MST on structural and semantic dissimilarity graphs, reporting 25% better Spearman correlation and 19% better ROCAUC than LLM-based semantic entropy at 0.3% runtime cost on HumanEval and BigCodeBench.
-
Functional Entropy: Predicting Functional Correctness in LLM-Generated Code with Uncertainty Quantification
Introduces functional equivalence methods and functional entropy to predict functional correctness of LLM-generated code via uncertainty quantification, outperforming NLI-based baselines in most tested settings.
-
Uncertainty Quantification for LLM-based Code Generation
RisCoSet applies multiple hypothesis testing to construct risk-controlling partial-program prediction sets for LLM code generation, achieving up to 24.5% less code removal than prior methods at equivalent risk levels.
-
SOMA: Efficient Multi-turn LLM Serving via Small Language Model
SOMA estimates a local response manifold from early turns and adapts a small surrogate model via divergence-maximizing prompts and localized LoRA fine-tuning for efficient multi-turn serving.
-
SOCIA-EVO: Automated Simulator Construction via Dual-Anchored Bi-Level Optimization
SOCIA-EVO generates statistically consistent simulators by separating structural refinement from parameter calibration via bi-level optimization and falsifying strategies through execution feedback in a Bayesian-weighted playbook.
-
ContextCov: Deriving and Enforcing Executable Constraints from Agent Instruction Files
ContextCov compiles agent instruction files into static, runtime, and architectural guardrails, raising constraint compliance to 88.3% on SWE-bench Lite tasks versus 67% and 50.3% for prompt and reflection baselines.
-
FELA: A Multi-Agent Evolutionary System for Feature Engineering of Industrial Event Log Data
FELA deploys specialized LLM agents in an evolutionary framework to generate, validate, and refine explainable features from heterogeneous industrial event logs, improving downstream model performance.
-
BashCoder-R1: Towards Robust and Explainable Bash Code Generation with Robustness-Aware Group Relative Policy Optimization
BashCoder-R1 applies CPT, L-CoT SFT, and R-GRPO to reach higher syntax, robustness, and functionality rates than baselines on the new BashBench benchmark of 952 tasks.
-
Gatekeepers and Hallucinations: A Layered Evaluation Framework for LLM-Driven Quantum Circuit Generation
A layered framework with physical gatekeepers, fidelity analysis against reference VQE circuits, and a consistency metric identifies five LLM failure modes in quantum circuit generation and reveals that some apparent model errors originated in the evaluation harness itself.
-
A Large Language Model Approach to Generating Bypass Rules for Malware Evasion in Analysis Sandbox
ABLE uses LLMs with sanitization and iterative refinement to generate bypass YARA rules from malware traces, achieving 79% success on 334 samples and 47% more family detections.
-
ClusterFusion++: Expanding Cluster-Level Fusion to Full Transformer-Block Decoding
ClusterFusion++ fuses the entire Transformer block (LayerNorm to residual) via CUDA extensions and achieves 1.34x throughput on Pythia-2.8B with near-identical output fidelity.
-
Can LLMs be Effective Code Contributors? A Study on Open-source Projects
LLMs achieve only 0-60% success when asked to contribute code to sizable open-source projects, often failing basic checks or simply repeating training data.
-
Context-Guided Decompilation: A Step Towards Re-executability
ICL4Decomp applies in-context learning to guide LLMs in generating re-executable decompiled code from binaries, reporting roughly 40% higher re-executability than prior methods across datasets and optimization levels.