hub Canonical reference

A survey on large language models for code generation.ACM Transactions on Software Engineering and Methodology, 35 (2):1–72

Juyong Jiang, Fan Wang, Jiasi Shen, Sungju Kim, Sunghun Kim · 2026 · ACM Transactions on Software Engineering and Methodology · DOI 10.1145/3747588

Canonical reference. 88% of citing Pith papers cite this work as background.

22 Pith papers citing it

75 external citations · Crossref

Background 88% of classified citations

open at publisher browse 22 citing papers

hub tools

JSON dossier citing papers JSON publisher DOI

citation-role summary

background 7 method 1

citation-polarity summary

background 7 use method 1

representative citing papers

Correct Code, Vulnerable Dependencies: A Large Scale Measurement Study of LLM-Specified Library Versions

cs.SE · 2026-05-07 · conditional · novelty 8.0

LLMs frequently specify library versions with known CVEs in generated code (36-56% of tasks), show low compatibility (20-63%), and converge on the same risky versions across models.

SaaSBench: Exploring the Boundaries of Coding Agents in Long-Horizon Enterprise SaaS Engineering

cs.SE · 2026-05-17 · unverdicted · novelty 7.0

SaaSBench introduces a heterogeneous benchmark for enterprise SaaS engineering and shows that state-of-the-art coding agents fail over 95% of the time before reaching deep business logic due to setup and integration problems.

RepoDoc: A Knowledge Graph-Based Framework to Automatic Documentation Generation and Incremental Updates

cs.SE · 2026-04-29 · unverdicted · novelty 7.0

RepoDoc uses a repository knowledge graph with module clustering and semantic impact propagation to generate more complete documentation 3x faster with 85% fewer tokens and handle incremental updates 73% faster than prior LLM-based tools.

CodeSpecBench: Benchmarking LLMs for Executable Behavioral Specification Generation

cs.SE · 2026-04-14 · accept · novelty 7.0

CodeSpecBench shows LLMs achieve at most 20.2% pass rate on repository-level executable behavioral specification generation, revealing that strong code generation does not imply deep semantic understanding.

One Model to Translate Them All? A Journey to Mount Doom for Multilingual Model Merging

cs.CL · 2026-04-03 · unverdicted · novelty 7.0

Merging fine-tuned models for multilingual translation fails because fine-tuning redistributes language-specific neurons rather than sharpening them, increasing representational divergence in output-generating layers.

MURPHY: Feedback-Aware GRPO with Retrospective Credit Assignment for Multi-Turn Code Generation

cs.LG · 2025-11-11 · unverdicted · novelty 7.0

MURPHY improves code generation pass rates by up to 6% through retrospective credit assignment on multi-turn feedback trees using max or mean reward propagation.

LPDS: Evaluating LLM Robustness Through Logic-Preserving Difficulty Scaling

cs.LG · 2026-05-14 · conditional · novelty 6.0

LPDS quantifies difficulty of logic-preserving problem variations and searches for the hardest ones, producing up to 5x larger performance drops than random sampling and better robustness gains from fine-tuning on difficult examples.

Coding Agents Don't Know When to Act

cs.SE · 2026-05-08 · unverdicted · novelty 6.0

Coding agents exhibit action bias by proposing undesirable changes on already-fixed issues 35-65% of the time, and explicit reproduction instructions only partially mitigate this while creating new abstention errors.

Bridging Generation and Training: A Systematic Review of Quality Issues in LLMs for Code

cs.SE · 2026-05-06 · accept · novelty 6.0

A review of 114 studies creates taxonomies for code and data quality issues, formalizes 18 propagation mechanisms from training data defects to LLM-generated code defects, and synthesizes detection and mitigation techniques.

RuC: HDL-Agnostic Rule Completion Benchmark Generation

cs.AR · 2026-04-30 · unverdicted · novelty 6.0

RuC generates language-agnostic, grammar-based benchmarks for evaluating LLMs on RTL code completion at controllable granularities, demonstrated on SystemVerilog designs from Tiny Tapeout and a RISC-V core where Fill-in-the-Middle prompting performed best.

HypEHR: Hyperbolic Modeling of Electronic Health Records for Efficient Question Answering

cs.AI · 2026-04-22 · unverdicted · novelty 6.0

HypEHR is a hyperbolic embedding model for EHR data that uses Lorentzian geometry and hierarchy-aware pretraining to answer clinical questions nearly as well as large language models but with much smaller size.

Probabilistic Programs of Thought

cs.CL · 2026-04-19 · unverdicted · novelty 6.0

Probabilistic programs of thought let LLMs produce many program variants from one generation by building a compact probabilistic representation of the token distribution.

On the Effectiveness of Context Compression for Repository-Level Tasks: An Empirical Investigation

cs.SE · 2026-04-15 · unverdicted · novelty 6.0

Continuous latent-vector compression improves BLEU scores on repository-level code tasks by up to 28.3% at 4x compression while cutting inference latency.

Babbling Suppression: Making LLMs Greener One Token at a Time

cs.SE · 2026-04-08 · unverdicted · novelty 6.0

Babbling Suppression stops LLM code generation upon test passage to reduce token output and energy consumption by up to 65% across Python and Java benchmarks.

MICA: Multi-granularity Intertemporal Credit Assignment for Long-Horizon Emotional Support Dialogue

cs.CL · 2026-03-06 · unverdicted · novelty 6.0

MICA combines incremental per-turn distance rewards and Monte Carlo returns from a shared potential function over user support states to create a mixed advantage signal that enables stable multi-turn RL optimization for emotional support dialogues.

How Robustly do LLMs Understand Execution Semantics?

cs.SE · 2026-02-24 · unverdicted · novelty 6.0

Frontier LLMs like GPT-5.2 show large accuracy drops on perturbed program-output prediction tasks while open-source reasoning models remain more stable, exposing limits in code semantics understanding.

Can LLMs Solve Science or Just Write Code? Evaluating Quantum Solver Generation

cs.SE · 2026-05-08 · unverdicted · novelty 4.0 · 2 refs

Iterative refinement boosts LLM success in generating quantum solvers that match classical results, but more advanced models shift from execution errors to hard-to-detect numerical inaccuracies.

VeriInteresting: An Empirical Study of Model Prompt Interactions in Verilog Code Generation

cs.AR · 2026-02-04 · unverdicted · novelty 4.0

Empirical study identifies patterns in how model classes respond to structured prompts, optimization, and other techniques across two Verilog benchmarks.

Search-Based Software Engineering and AI Foundation Models: Current Landscape and Future Roadmap

cs.SE · 2025-05-26 · unverdicted · novelty 4.0

A research roadmap analyzing the current state of search-based software engineering with foundation models, outlining challenges and directions across three integration aspects.

From Helpful to Trustworthy: LLM Agents for Pair Programming

cs.SE · 2026-04-11 · unverdicted · novelty 3.0

A research proposal for three studies on multi-agent LLM pair programming that externalizes intent and uses automated validation to increase trustworthiness.

Luminol-AIDetect: Fast Zero-shot Machine-Generated Text Detection based on Perplexity under Text Shuffling

cs.CL · 2026-04-28

Finding Memory Leaks in C/C++ Programs via Neuro-Symbolic Augmented Static Analysis

cs.SE · 2026-03-28

citing papers explorer

Showing 22 of 22 citing papers.

Correct Code, Vulnerable Dependencies: A Large Scale Measurement Study of LLM-Specified Library Versions cs.SE · 2026-05-07 · conditional · none · ref 26
LLMs frequently specify library versions with known CVEs in generated code (36-56% of tasks), show low compatibility (20-63%), and converge on the same risky versions across models.
SaaSBench: Exploring the Boundaries of Coding Agents in Long-Horizon Enterprise SaaS Engineering cs.SE · 2026-05-17 · unverdicted · none · ref 24
SaaSBench introduces a heterogeneous benchmark for enterprise SaaS engineering and shows that state-of-the-art coding agents fail over 95% of the time before reaching deep business logic due to setup and integration problems.
RepoDoc: A Knowledge Graph-Based Framework to Automatic Documentation Generation and Incremental Updates cs.SE · 2026-04-29 · unverdicted · none · ref 19
RepoDoc uses a repository knowledge graph with module clustering and semantic impact propagation to generate more complete documentation 3x faster with 85% fewer tokens and handle incremental updates 73% faster than prior LLM-based tools.
CodeSpecBench: Benchmarking LLMs for Executable Behavioral Specification Generation cs.SE · 2026-04-14 · accept · none · ref 15
CodeSpecBench shows LLMs achieve at most 20.2% pass rate on repository-level executable behavioral specification generation, revealing that strong code generation does not imply deep semantic understanding.
One Model to Translate Them All? A Journey to Mount Doom for Multilingual Model Merging cs.CL · 2026-04-03 · unverdicted · none · ref 20
Merging fine-tuned models for multilingual translation fails because fine-tuning redistributes language-specific neurons rather than sharpening them, increasing representational divergence in output-generating layers.
MURPHY: Feedback-Aware GRPO with Retrospective Credit Assignment for Multi-Turn Code Generation cs.LG · 2025-11-11 · unverdicted · none · ref 11
MURPHY improves code generation pass rates by up to 6% through retrospective credit assignment on multi-turn feedback trees using max or mean reward propagation.
LPDS: Evaluating LLM Robustness Through Logic-Preserving Difficulty Scaling cs.LG · 2026-05-14 · conditional · none · ref 2
LPDS quantifies difficulty of logic-preserving problem variations and searches for the hardest ones, producing up to 5x larger performance drops than random sampling and better robustness gains from fine-tuning on difficult examples.
Coding Agents Don't Know When to Act cs.SE · 2026-05-08 · unverdicted · none · ref 6
Coding agents exhibit action bias by proposing undesirable changes on already-fixed issues 35-65% of the time, and explicit reproduction instructions only partially mitigate this while creating new abstention errors.
Bridging Generation and Training: A Systematic Review of Quality Issues in LLMs for Code cs.SE · 2026-05-06 · accept · none · ref 51
A review of 114 studies creates taxonomies for code and data quality issues, formalizes 18 propagation mechanisms from training data defects to LLM-generated code defects, and synthesizes detection and mitigation techniques.
RuC: HDL-Agnostic Rule Completion Benchmark Generation cs.AR · 2026-04-30 · unverdicted · none · ref 1
RuC generates language-agnostic, grammar-based benchmarks for evaluating LLMs on RTL code completion at controllable granularities, demonstrated on SystemVerilog designs from Tiny Tapeout and a RISC-V core where Fill-in-the-Middle prompting performed best.
HypEHR: Hyperbolic Modeling of Electronic Health Records for Efficient Question Answering cs.AI · 2026-04-22 · unverdicted · none · ref 21
HypEHR is a hyperbolic embedding model for EHR data that uses Lorentzian geometry and hierarchy-aware pretraining to answer clinical questions nearly as well as large language models but with much smaller size.
Probabilistic Programs of Thought cs.CL · 2026-04-19 · unverdicted · none · ref 4
Probabilistic programs of thought let LLMs produce many program variants from one generation by building a compact probabilistic representation of the token distribution.
On the Effectiveness of Context Compression for Repository-Level Tasks: An Empirical Investigation cs.SE · 2026-04-15 · unverdicted · none · ref 19
Continuous latent-vector compression improves BLEU scores on repository-level code tasks by up to 28.3% at 4x compression while cutting inference latency.
Babbling Suppression: Making LLMs Greener One Token at a Time cs.SE · 2026-04-08 · unverdicted · none · ref 20
Babbling Suppression stops LLM code generation upon test passage to reduce token output and energy consumption by up to 65% across Python and Java benchmarks.
MICA: Multi-granularity Intertemporal Credit Assignment for Long-Horizon Emotional Support Dialogue cs.CL · 2026-03-06 · unverdicted · none · ref 9
MICA combines incremental per-turn distance rewards and Monte Carlo returns from a shared potential function over user support states to create a mixed advantage signal that enables stable multi-turn RL optimization for emotional support dialogues.
How Robustly do LLMs Understand Execution Semantics? cs.SE · 2026-02-24 · unverdicted · none · ref 24
Frontier LLMs like GPT-5.2 show large accuracy drops on perturbed program-output prediction tasks while open-source reasoning models remain more stable, exposing limits in code semantics understanding.
Can LLMs Solve Science or Just Write Code? Evaluating Quantum Solver Generation cs.SE · 2026-05-08 · unverdicted · none · ref 26 · 2 links
Iterative refinement boosts LLM success in generating quantum solvers that match classical results, but more advanced models shift from execution errors to hard-to-detect numerical inaccuracies.
VeriInteresting: An Empirical Study of Model Prompt Interactions in Verilog Code Generation cs.AR · 2026-02-04 · unverdicted · none · ref 8
Empirical study identifies patterns in how model classes respond to structured prompts, optimization, and other techniques across two Verilog benchmarks.
Search-Based Software Engineering and AI Foundation Models: Current Landscape and Future Roadmap cs.SE · 2025-05-26 · unverdicted · none · ref 84
A research roadmap analyzing the current state of search-based software engineering with foundation models, outlining challenges and directions across three integration aspects.
From Helpful to Trustworthy: LLM Agents for Pair Programming cs.SE · 2026-04-11 · unverdicted · none · ref 8
A research proposal for three studies on multi-agent LLM pair programming that externalizes intent and uses automated validation to increase trustworthiness.
Luminol-AIDetect: Fast Zero-shot Machine-Generated Text Detection based on Perplexity under Text Shuffling cs.CL · 2026-04-28 · unreviewed · ref 14
Finding Memory Leaks in C/C++ Programs via Neuro-Symbolic Augmented Static Analysis cs.SE · 2026-03-28 · unreviewed · ref 31

A survey on large language models for code generation.ACM Transactions on Software Engineering and Methodology, 35 (2):1–72

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer