Chain-of-Thought prompting balances high accuracy with low energy use in small language models for code generation, while multi-sampling strategies add high energy costs for small accuracy gains.
hub
Exploring and evaluating hallucinations in llm-powered code generation
14 Pith papers cite this work. Polarity classification is still indexing.
hub tools
citation-role summary
citation-polarity summary
roles
background 3polarities
background 3representative citing papers
A study of seven LLMs finds that realistic prompt variations such as one-character misspellings trigger library hallucinations in up to 26% of cases, fabricated names in up to 99%, and time-based prompts in up to 85%, and introduces LibHalluBench for evaluation.
RisCoSet applies multiple hypothesis testing to construct risk-controlling partial-program prediction sets for LLM code generation, achieving up to 24.5% less code removal than prior methods at equivalent risk levels.
SOMA estimates a local response manifold from early turns and adapts a small surrogate model via divergence-maximizing prompts and localized LoRA fine-tuning for efficient multi-turn serving.
A review of 114 studies creates taxonomies for code and data quality issues, formalizes 18 propagation mechanisms from training data defects to LLM-generated code defects, and synthesizes detection and mitigation techniques.
iCoRe improves Fail-to-Pass rates to 42.0% and 52.8% on two bug reproduction benchmarks by using correlation-aware iterative retrieval instead of standard semantic or BM25 methods.
SOCIA-EVO generates statistically consistent simulators by separating structural refinement from parameter calibration via bi-level optimization and falsifying strategies through execution feedback in a Bayesian-weighted playbook.
ContextCov compiles agent instruction files into static, runtime, and architectural guardrails, raising constraint compliance to 88.3% on SWE-bench Lite tasks versus 67% and 50.3% for prompt and reflection baselines.
FELA deploys specialized LLM agents in an evolutionary framework to generate, validate, and refine explainable features from heterogeneous industrial event logs, improving downstream model performance.
PerfOrch is a four-agent multi-LLM system that uses offline profiling to build language-and-category rankings for routing tasks, achieving 97.19% and 95.83% pass@1 on HumanEval-X and EffiBench-X with generalization across benchmarks.
ABLE uses LLMs with sanitization and iterative refinement to generate bypass YARA rules from malware traces, achieving 79% success on 334 samples and 47% more family detections.
ClusterFusion++ fuses the entire Transformer block (LayerNorm to residual) via CUDA extensions and achieves 1.34x throughput on Pythia-2.8B with near-identical output fidelity.
LLMs achieve only 0-60% success when asked to contribute code to sizable open-source projects, often failing basic checks or simply repeating training data.
ICL4Decomp applies in-context learning to guide LLMs in generating re-executable decompiled code from binaries, reporting roughly 40% higher re-executability than prior methods across datasets and optimization levels.
citing papers explorer
-
Evaluating the Environmental Impact of using SLMs and Prompt Engineering for Code Generation
Chain-of-Thought prompting balances high accuracy with low energy use in small language models for code generation, while multi-sampling strategies add high energy costs for small accuracy gains.
-
Library Hallucinations in LLM-Generated Code: A Risk Analysis Grounded in Developer Queries
A study of seven LLMs finds that realistic prompt variations such as one-character misspellings trigger library hallucinations in up to 26% of cases, fabricated names in up to 99%, and time-based prompts in up to 85%, and introduces LibHalluBench for evaluation.
-
Uncertainty Quantification for LLM-based Code Generation
RisCoSet applies multiple hypothesis testing to construct risk-controlling partial-program prediction sets for LLM code generation, achieving up to 24.5% less code removal than prior methods at equivalent risk levels.
-
SOMA: Efficient Multi-turn LLM Serving via Small Language Model
SOMA estimates a local response manifold from early turns and adapts a small surrogate model via divergence-maximizing prompts and localized LoRA fine-tuning for efficient multi-turn serving.
-
Bridging Generation and Training: A Systematic Review of Quality Issues in LLMs for Code
A review of 114 studies creates taxonomies for code and data quality issues, formalizes 18 propagation mechanisms from training data defects to LLM-generated code defects, and synthesizes detection and mitigation techniques.
-
iCoRe: An Iterative Correlation-Aware Retriever for Bug Reproduction Test Generation
iCoRe improves Fail-to-Pass rates to 42.0% and 52.8% on two bug reproduction benchmarks by using correlation-aware iterative retrieval instead of standard semantic or BM25 methods.
-
SOCIA-EVO: Automated Simulator Construction via Dual-Anchored Bi-Level Optimization
SOCIA-EVO generates statistically consistent simulators by separating structural refinement from parameter calibration via bi-level optimization and falsifying strategies through execution feedback in a Bayesian-weighted playbook.
-
ContextCov: Deriving and Enforcing Executable Constraints from Agent Instruction Files
ContextCov compiles agent instruction files into static, runtime, and architectural guardrails, raising constraint compliance to 88.3% on SWE-bench Lite tasks versus 67% and 50.3% for prompt and reflection baselines.
-
FELA: A Multi-Agent Evolutionary System for Feature Engineering of Industrial Event Log Data
FELA deploys specialized LLM agents in an evolutionary framework to generate, validate, and refine explainable features from heterogeneous industrial event logs, improving downstream model performance.
-
Multi-LLM Orchestration for High-Quality Code Generation: Exploiting Complementary Model Strengths
PerfOrch is a four-agent multi-LLM system that uses offline profiling to build language-and-category rankings for routing tasks, achieving 97.19% and 95.83% pass@1 on HumanEval-X and EffiBench-X with generalization across benchmarks.
-
A Large Language Model Approach to Generating Bypass Rules for Malware Evasion in Analysis Sandbox
ABLE uses LLMs with sanitization and iterative refinement to generate bypass YARA rules from malware traces, achieving 79% success on 334 samples and 47% more family detections.
-
ClusterFusion++: Expanding Cluster-Level Fusion to Full Transformer-Block Decoding
ClusterFusion++ fuses the entire Transformer block (LayerNorm to residual) via CUDA extensions and achieves 1.34x throughput on Pythia-2.8B with near-identical output fidelity.
-
Can LLMs be Effective Code Contributors? A Study on Open-source Projects
LLMs achieve only 0-60% success when asked to contribute code to sizable open-source projects, often failing basic checks or simply repeating training data.
-
Context-Guided Decompilation: A Step Towards Re-executability
ICL4Decomp applies in-context learning to guide LLMs in generating re-executable decompiled code from binaries, reporting roughly 40% higher re-executability than prior methods across datasets and optimization levels.