hub

Exploring and evaluating hallucinations in llm-powered code generation

· 2024 · arXiv 2404.00971

20 Pith papers cite this work. Polarity classification is still indexing.

20 Pith papers citing it

read on arXiv browse 20 citing papers

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 3

citation-polarity summary

background 3

representative citing papers

LibEvoBench: Probing Temporal Knowledge Stratification in Code Generation Models

cs.SE · 2026-06-24 · unverdicted · novelty 7.0

LibEvoBench benchmark shows LLMs are version-oblivious on evolving APIs, with documentation helping but version specification not.

When LLMs Invent Rust Crates: An Empirical Study of Hallucination Patterns and Mitigation

cs.SE · 2026-06-07 · unverdicted · novelty 7.0

First empirical study shows crate hallucination in Rust LLMs has consistent rates across models insensitive to parameters and tests prompt-based mitigation.

Evaluating the Environmental Impact of using SLMs and Prompt Engineering for Code Generation

cs.SE · 2026-04-03 · unverdicted · novelty 7.0

Chain-of-Thought prompting balances high accuracy with low energy use in small language models for code generation, while multi-sampling strategies add high energy costs for small accuracy gains.

Library Hallucinations in LLM-Generated Code: A Risk Analysis Grounded in Developer Queries

cs.SE · 2025-09-26 · unverdicted · novelty 7.0

A study of seven LLMs finds that realistic prompt variations such as one-character misspellings trigger library hallucinations in up to 26% of cases, fabricated names in up to 99%, and time-based prompts in up to 85%, and introduces LibHalluBench for evaluation.

FASE: Fast Adaptive Semantic Entropy for Code Quality

cs.SE · 2026-06-08 · unverdicted · novelty 6.0

FASE approximates functional correctness via MST on structural and semantic dissimilarity graphs, reporting 25% better Spearman correlation and 19% better ROCAUC than LLM-based semantic entropy at 0.3% runtime cost on HumanEval and BigCodeBench.

Functional Entropy: Predicting Functional Correctness in LLM-Generated Code with Uncertainty Quantification

cs.CL · 2026-05-27 · unverdicted · novelty 6.0

Introduces functional equivalence methods and functional entropy to predict functional correctness of LLM-generated code via uncertainty quantification, outperforming NLI-based baselines in most tested settings.

Uncertainty Quantification for LLM-based Code Generation

cs.SE · 2026-05-12 · unverdicted · novelty 6.0

RisCoSet applies multiple hypothesis testing to construct risk-controlling partial-program prediction sets for LLM code generation, achieving up to 24.5% less code removal than prior methods at equivalent risk levels.

SOMA: Efficient Multi-turn LLM Serving via Small Language Model

cs.CL · 2026-05-11 · unverdicted · novelty 6.0

SOMA estimates a local response manifold from early turns and adapts a small surrogate model via divergence-maximizing prompts and localized LoRA fine-tuning for efficient multi-turn serving.

Bridging Generation and Training: A Systematic Review of Quality Issues in LLMs for Code

cs.SE · 2026-05-06 · accept · novelty 6.0

A review of 114 studies creates taxonomies for code and data quality issues, formalizes 18 propagation mechanisms from training data defects to LLM-generated code defects, and synthesizes detection and mitigation techniques.

SOCIA-EVO: Automated Simulator Construction via Dual-Anchored Bi-Level Optimization

cs.AI · 2026-04-19 · unverdicted · novelty 6.0

SOCIA-EVO generates statistically consistent simulators by separating structural refinement from parameter calibration via bi-level optimization and falsifying strategies through execution feedback in a Bayesian-weighted playbook.

ContextCov: Deriving and Enforcing Executable Constraints from Agent Instruction Files

cs.SE · 2026-02-28 · unverdicted · novelty 6.0

ContextCov compiles agent instruction files into static, runtime, and architectural guardrails, raising constraint compliance to 88.3% on SWE-bench Lite tasks versus 67% and 50.3% for prompt and reflection baselines.

FELA: A Multi-Agent Evolutionary System for Feature Engineering of Industrial Event Log Data

cs.AI · 2025-10-29 · unverdicted · novelty 6.0

FELA deploys specialized LLM agents in an evolutionary framework to generate, validate, and refine explainable features from heterogeneous industrial event logs, improving downstream model performance.

Multi-LLM Orchestration for High-Quality Code Generation: Exploiting Complementary Model Strengths

cs.SE · 2025-10-01 · conditional · novelty 6.0

PerfOrch is a four-agent multi-LLM system that uses offline profiling to build language-and-category rankings for routing tasks, achieving 97.19% and 95.83% pass@1 on HumanEval-X and EffiBench-X with generalization across benchmarks.

BashCoder-R1: Towards Robust and Explainable Bash Code Generation with Robustness-Aware Group Relative Policy Optimization

cs.SE · 2026-06-26 · unverdicted · novelty 5.0

BashCoder-R1 applies CPT, L-CoT SFT, and R-GRPO to reach higher syntax, robustness, and functionality rates than baselines on the new BashBench benchmark of 952 tasks.

Gatekeepers and Hallucinations: A Layered Evaluation Framework for LLM-Driven Quantum Circuit Generation

quant-ph · 2026-06-16 · unverdicted · novelty 5.0

A layered framework with physical gatekeepers, fidelity analysis against reference VQE circuits, and a consistency metric identifies five LLM failure modes in quantum circuit generation and reveals that some apparent model errors originated in the evaluation harness itself.

A Large Language Model Approach to Generating Bypass Rules for Malware Evasion in Analysis Sandbox

cs.CR · 2026-05-20 · unverdicted · novelty 5.0

ABLE uses LLMs with sanitization and iterative refinement to generate bypass YARA rules from malware traces, achieving 79% success on 334 samples and 47% more family detections.

ClusterFusion++: Expanding Cluster-Level Fusion to Full Transformer-Block Decoding

cs.DC · 2026-04-26 · unverdicted · novelty 5.0

ClusterFusion++ fuses the entire Transformer block (LayerNorm to residual) via CUDA extensions and achieves 1.34x throughput on Pythia-2.8B with near-identical output fidelity.

Can LLMs be Effective Code Contributors? A Study on Open-source Projects

cs.SE · 2026-04-25 · unverdicted · novelty 5.0

LLMs achieve only 0-60% success when asked to contribute code to sizable open-source projects, often failing basic checks or simply repeating training data.

Context-Guided Decompilation: A Step Towards Re-executability

cs.SE · 2025-11-03 · unverdicted · novelty 5.0

ICL4Decomp applies in-context learning to guide LLMs in generating re-executable decompiled code from binaries, reporting roughly 40% higher re-executability than prior methods across datasets and optimization levels.

iCoRe: An Iterative Correlation-Aware Retriever for Bug Reproduction Test Generation

cs.SE · 2026-04-21

citing papers explorer

Showing 17 of 17 citing papers after filters.

LibEvoBench: Probing Temporal Knowledge Stratification in Code Generation Models cs.SE · 2026-06-24 · unverdicted · none · ref 42
LibEvoBench benchmark shows LLMs are version-oblivious on evolving APIs, with documentation helping but version specification not.
When LLMs Invent Rust Crates: An Empirical Study of Hallucination Patterns and Mitigation cs.SE · 2026-06-07 · unverdicted · none · ref 24
First empirical study shows crate hallucination in Rust LLMs has consistent rates across models insensitive to parameters and tests prompt-based mitigation.
Evaluating the Environmental Impact of using SLMs and Prompt Engineering for Code Generation cs.SE · 2026-04-03 · unverdicted · none · ref 24
Chain-of-Thought prompting balances high accuracy with low energy use in small language models for code generation, while multi-sampling strategies add high energy costs for small accuracy gains.
Library Hallucinations in LLM-Generated Code: A Risk Analysis Grounded in Developer Queries cs.SE · 2025-09-26 · unverdicted · none · ref 30
A study of seven LLMs finds that realistic prompt variations such as one-character misspellings trigger library hallucinations in up to 26% of cases, fabricated names in up to 99%, and time-based prompts in up to 85%, and introduces LibHalluBench for evaluation.
FASE: Fast Adaptive Semantic Entropy for Code Quality cs.SE · 2026-06-08 · unverdicted · none · ref 26
FASE approximates functional correctness via MST on structural and semantic dissimilarity graphs, reporting 25% better Spearman correlation and 19% better ROCAUC than LLM-based semantic entropy at 0.3% runtime cost on HumanEval and BigCodeBench.
Functional Entropy: Predicting Functional Correctness in LLM-Generated Code with Uncertainty Quantification cs.CL · 2026-05-27 · unverdicted · none · ref 15
Introduces functional equivalence methods and functional entropy to predict functional correctness of LLM-generated code via uncertainty quantification, outperforming NLI-based baselines in most tested settings.
Uncertainty Quantification for LLM-based Code Generation cs.SE · 2026-05-12 · unverdicted · none · ref 56
RisCoSet applies multiple hypothesis testing to construct risk-controlling partial-program prediction sets for LLM code generation, achieving up to 24.5% less code removal than prior methods at equivalent risk levels.
SOMA: Efficient Multi-turn LLM Serving via Small Language Model cs.CL · 2026-05-11 · unverdicted · none · ref 29
SOMA estimates a local response manifold from early turns and adapts a small surrogate model via divergence-maximizing prompts and localized LoRA fine-tuning for efficient multi-turn serving.
SOCIA-EVO: Automated Simulator Construction via Dual-Anchored Bi-Level Optimization cs.AI · 2026-04-19 · unverdicted · none · ref 87
SOCIA-EVO generates statistically consistent simulators by separating structural refinement from parameter calibration via bi-level optimization and falsifying strategies through execution feedback in a Bayesian-weighted playbook.
ContextCov: Deriving and Enforcing Executable Constraints from Agent Instruction Files cs.SE · 2026-02-28 · unverdicted · none · ref 46
ContextCov compiles agent instruction files into static, runtime, and architectural guardrails, raising constraint compliance to 88.3% on SWE-bench Lite tasks versus 67% and 50.3% for prompt and reflection baselines.
FELA: A Multi-Agent Evolutionary System for Feature Engineering of Industrial Event Log Data cs.AI · 2025-10-29 · unverdicted · none · ref 15
FELA deploys specialized LLM agents in an evolutionary framework to generate, validate, and refine explainable features from heterogeneous industrial event logs, improving downstream model performance.
BashCoder-R1: Towards Robust and Explainable Bash Code Generation with Robustness-Aware Group Relative Policy Optimization cs.SE · 2026-06-26 · unverdicted · none · ref 26
BashCoder-R1 applies CPT, L-CoT SFT, and R-GRPO to reach higher syntax, robustness, and functionality rates than baselines on the new BashBench benchmark of 952 tasks.
Gatekeepers and Hallucinations: A Layered Evaluation Framework for LLM-Driven Quantum Circuit Generation quant-ph · 2026-06-16 · unverdicted · none · ref 15
A layered framework with physical gatekeepers, fidelity analysis against reference VQE circuits, and a consistency metric identifies five LLM failure modes in quantum circuit generation and reveals that some apparent model errors originated in the evaluation harness itself.
A Large Language Model Approach to Generating Bypass Rules for Malware Evasion in Analysis Sandbox cs.CR · 2026-05-20 · unverdicted · none · ref 60
ABLE uses LLMs with sanitization and iterative refinement to generate bypass YARA rules from malware traces, achieving 79% success on 334 samples and 47% more family detections.
ClusterFusion++: Expanding Cluster-Level Fusion to Full Transformer-Block Decoding cs.DC · 2026-04-26 · unverdicted · none · ref 6
ClusterFusion++ fuses the entire Transformer block (LayerNorm to residual) via CUDA extensions and achieves 1.34x throughput on Pythia-2.8B with near-identical output fidelity.
Can LLMs be Effective Code Contributors? A Study on Open-source Projects cs.SE · 2026-04-25 · unverdicted · none · ref 9
LLMs achieve only 0-60% success when asked to contribute code to sizable open-source projects, often failing basic checks or simply repeating training data.
Context-Guided Decompilation: A Step Towards Re-executability cs.SE · 2025-11-03 · unverdicted · none · ref 40
ICL4Decomp applies in-context learning to guide LLMs in generating re-executable decompiled code from binaries, reporting roughly 40% higher re-executability than prior methods across datasets and optimization levels.

Exploring and evaluating hallucinations in llm-powered code generation

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer