super hub Canonical reference

In 45th IEEE/ACM International Conference on Software Engineering, ICSE 2023, Melbourne, Australia, May 14-20

Mastropaolo · 2023 · arXiv 8619.2023

Canonical reference. 76% of citing Pith papers cite this work as background.

115 Pith papers citing it

Background 76% of classified citations

read on arXiv browse 115 citing papers more from Mastropaolo

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 32 baseline 2 method 2 extension 1 other 1

citation-polarity summary

background 29 support 3 baseline 2 use method 2 extend 1 unclear 1

authors

Mastropaolo

co-cited works

representative citing papers

Mind your key: An Empirical Study of LLM API Credential Leakage in iOS Apps

cs.SE · 2026-06-10 · unverdicted · novelty 8.0

Empirical analysis of 444 iOS apps using dynamic traffic interception found 282 leaking LLM API keys across ten providers, with only 28% remediation after three months.

Autoregressive, Yet Revisable: In Decoding Revision for Secure Code Generation

cs.SE · 2026-02-01 · unverdicted · novelty 8.0

Stream of Revision adds action tokens to LLM decoding so the model can revise its own code history on the fly, cutting vulnerabilities in generated code with little added cost.

RepairAgent: An Autonomous, LLM-Based Agent for Program Repair

cs.SE · 2024-03-25 · conditional · novelty 8.0 · 2 refs

RepairAgent autonomously repairs 164 bugs on Defects4J including 39 not fixed by prior techniques by treating an LLM as an agent that invokes tools via a finite state machine and dynamic prompts.

Knowledge Over Parameters: Evolving Smart Contract Vulnerability Detection

cs.CR · 2026-07-02 · unverdicted · novelty 7.0

EvoVuln evolves executable detection policies for five smart-contract vulnerability types using cold-start synthetic testing followed by few-shot refinement on five vulnerable and five safe contracts, reaching 71% macro F1 and enabling a small model to beat a large zero-shot model by 19 points at un

Quantum Mutant Equivalence via Transpilation

cs.SE · 2026-06-25 · unverdicted · novelty 7.0

TBE identifies 32.1% of 92,011 equivalent surviving quantum mutants (29,536) via OpenQASM comparison after transpilation, reporting 100% precision and 82% accuracy on 348,299 mutants.

What You See Is Not What You Execute: Memory-Based Runtime SBOM Generation for Supply Chain Security

cs.CR · 2026-06-22 · unverdicted · novelty 7.0 · 4 refs

MEM-SBOM generates runtime SBOMs for Python applications by recovering modules, versions, and dependency graphs from volatile memory via Volatility 3 plugins, achieving 100% extraction accuracy on 51 apps.

The Correctness Illusion in LLM-Generated GPU Kernels

cs.SE · 2026-06-18 · accept · novelty 7.0

Controlled corpus testing shows that fixed allclose oracles in LLM kernel benchmarks certify transcription-buggy kernels as correct while seeded fuzzing with fp64 references does not.

BioDefect: The First Dataset for Defect Detection in Bioinformatics Software

cs.SE · 2026-05-20 · unverdicted · novelty 7.0 · 3 refs

BioDefect is a new dataset for defect detection in bioinformatics software that improves average F1-scores by 29.61% to 38.04% over existing datasets when evaluated on nine language models.

Code Generation by Differential Test Time Scaling

cs.SE · 2026-05-19 · unverdicted · novelty 7.0

DiffCodeGen clusters code candidates by behavioral similarity from fuzzing-synthesized inputs and selects the largest cluster's medoid, matching or exceeding prior test-time scaling methods with far less token and time cost.

Hydra: Efficient, Correct Code Generation via Checkpoint-and-Rollback Support

cs.SE · 2026-05-14 · unverdicted · novelty 7.0

Hydra enables asynchronous static error checking and targeted checkpoint-rollback repair during LLM code generation, cutting latency by up to 71% and token use by up to 70% versus post-hoc repair on C/C++ tasks.

PBT-Bench: Benchmarking AI Agents on Property-Based Testing

cs.SE · 2026-05-13 · unverdicted · novelty 7.0 · 6 refs

PBT-Bench is a new benchmark with 100 property-based testing problems across 40 Python libraries that measures LLM bug recall rates of 42.1-83.4% under guided prompting versus 31.4-76.7% in baseline.

Quantifying Sensitivity for Tree Ensembles: A symbolic and compositional approach

cs.AI · 2026-05-13 · unverdicted · novelty 7.0

A compositional algebraic decision diagram algorithm quantifies sensitivity in decision tree ensembles with certified error and confidence bounds, outperforming model counters on benchmarks.

The Death Spiral of Open Source Projects: A Post-Mortem Analysis of Pull Request Workflow Dynamics

cs.SE · 2026-05-12 · unverdicted · novelty 7.0

Large-scale analysis of inactive GitHub repositories shows open source projects die primarily from insufficient value and ecosystem dynamics, not from pull request workflow problems, despite a common pattern of declining activity.

Breaking the Dependency Chaos: A Constraint-Driven Python Dependency Resolution Strategy with Selective LLM Imputation

cs.SE · 2026-05-12 · unverdicted · novelty 7.0 · 2 refs

SMT-LLM builds a constraint graph from PyPI metadata and AST-derived imports, solves it with Z3, and uses LLM imputation only when needed, resolving 83.6% of HG2.9K snippets versus PLLM's 54.8% while cutting median time by 6.3x and LLM calls by 11x.

ConCovUp: Effective Agent-Based Test Driver Generation for Concurrency Testing

cs.SE · 2026-05-10 · unverdicted · novelty 7.0

ConCovUp uses static analysis to ground LLM test generation and backward tracing to produce concurrent test drivers that raise average shared-memory access pair coverage from 36.6% to 68.1% on nine real-world libraries.

SmellBench: Evaluating LLM Agents on Architectural Code Smell Repair

cs.SE · 2026-05-07 · unverdicted · novelty 7.0 · 2 refs

SmellBench is the first benchmark showing LLM agents resolve 47.7% of architectural code smells while accurately spotting false positives, but aggressive repairs often introduce new smells and degrade overall quality.

VulKey: Automated Vulnerability Repair Guided by Domain-Specific Repair Patterns

cs.CR · 2026-05-03 · unverdicted · novelty 7.0 · 2 refs

VulKey introduces hierarchical expert knowledge abstractions to guide LLMs in vulnerability repair, reporting 31.5% accuracy on PrimeVul (7.6% above best baseline) and strong results on Vul4J.

ClozeMaster: Fuzzing Rust Compiler by Harnessing LLMs for Infilling Masked Real Programs

cs.SE · 2026-05-01 · unverdicted · novelty 7.0

ClozeMaster masks bracketed structures in historical Rust bug code and uses LLMs to infill them, generating test programs that discovered 27 confirmed bugs in rustc and mrustc while outperforming existing fuzzers.

Single-Language Evidence Is Insufficient for Automated Logging: A Multilingual Benchmark and Empirical Study with LLMs

cs.SE · 2026-04-19 · unverdicted · novelty 7.0 · 2 refs

MultiLogBench shows that LLM performance on automated logging varies substantially across programming languages, demonstrating that single-language evidence is insufficient for general claims about model behavior or tool design.

Isolating Recurring Execution-Dependent Abnormal Patterns on NISQ Quantum Devices

cs.SE · 2026-04-19 · unverdicted · novelty 7.0

QRisk isolates backend-specific abnormal error patterns on NISQ devices via delta debugging and mitigates them with commuting gate swaps, cutting excess noise by 24-45% on IBM backends where noise models predict no difference.

Clover: A Neural-Symbolic Agentic Harness with Stochastic Tree-of-Thoughts for Verified RTL Repair

cs.AR · 2026-04-19 · unverdicted · novelty 7.0 · 2 refs

Clover fixes 96.8% of bugs on an RTL-repair benchmark using stochastic tree-of-thoughts and neural-symbolic agents, outperforming traditional and LLM baselines by 94% and 63% respectively with 87.5% pass@1.

Towards Personalizing Secure Programming Education with LLM-Injected Vulnerabilities

cs.CR · 2026-04-15 · conditional · novelty 7.0

LLM agents inject CWEs into student-authored code to generate personalized security examples; in a 71-student deployment, participants rated them more relevant than textbook cases but quantitative differences remained limited.

CIR+CVN: Bridging LLM Semantic Understanding and Petri-Net Verification for Concurrent Programs

cs.PL · 2026-04-10 · unverdicted · novelty 7.0 · 2 refs

An LLM synthesizes an alias-free concurrency model (CIR) from natural language that is translated to a Petri net (CVN) for exhaustive verification and targeted repair, with goal-reachability checks to avoid incomplete fixes.

REAP: Automatic Curation of Coding Agent Benchmarks from Interactive Production Usage

cs.SE · 2026-04-02 · unverdicted · novelty 7.0

REAP automatically curates production-derived benchmarks for AI coding agents via LLM classification and stability checks, producing the Harvest benchmark with model solve rates of 42.9-58.2%.

citing papers explorer

Showing 1 of 1 citing paper after filters.

PromptAudit: Auditing Prompt Sensitivity in LLM-Based Vulnerability Detection cs.LG · 2026-05-22 · unverdicted · none · ref 44
PromptAudit evaluates five prompting strategies across five LLMs on 1000 CVEs and finds chain-of-thought prompting yields the strongest overall performance while adaptive chain-of-thought and self-consistency reduce effective results.

In 45th IEEE/ACM International Conference on Software Engineering, ICSE 2023, Melbourne, Australia, May 14-20

hub tools

citation-role summary

citation-polarity summary

authors

co-cited works

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer