Empirical analysis of 444 iOS apps using dynamic traffic interception found 282 leaking LLM API keys across ten providers, with only 28% remediation after three months.
super hub Canonical reference
In 45th IEEE/ACM International Conference on Software Engineering, ICSE 2023, Melbourne, Australia, May 14-20
Canonical reference. 76% of citing Pith papers cite this work as background.
hub tools
citation-role summary
citation-polarity summary
authors
co-cited works
representative citing papers
Stream of Revision adds action tokens to LLM decoding so the model can revise its own code history on the fly, cutting vulnerabilities in generated code with little added cost.
RepairAgent autonomously repairs 164 bugs on Defects4J including 39 not fixed by prior techniques by treating an LLM as an agent that invokes tools via a finite state machine and dynamic prompts.
EvoVuln evolves executable detection policies for five smart-contract vulnerability types using cold-start synthetic testing followed by few-shot refinement on five vulnerable and five safe contracts, reaching 71% macro F1 and enabling a small model to beat a large zero-shot model by 19 points at un
TBE identifies 32.1% of 92,011 equivalent surviving quantum mutants (29,536) via OpenQASM comparison after transpilation, reporting 100% precision and 82% accuracy on 348,299 mutants.
MEM-SBOM generates runtime SBOMs for Python applications by recovering modules, versions, and dependency graphs from volatile memory via Volatility 3 plugins, achieving 100% extraction accuracy on 51 apps.
Controlled corpus testing shows that fixed allclose oracles in LLM kernel benchmarks certify transcription-buggy kernels as correct while seeded fuzzing with fp64 references does not.
BioDefect is a new dataset for defect detection in bioinformatics software that improves average F1-scores by 29.61% to 38.04% over existing datasets when evaluated on nine language models.
DiffCodeGen clusters code candidates by behavioral similarity from fuzzing-synthesized inputs and selects the largest cluster's medoid, matching or exceeding prior test-time scaling methods with far less token and time cost.
Hydra enables asynchronous static error checking and targeted checkpoint-rollback repair during LLM code generation, cutting latency by up to 71% and token use by up to 70% versus post-hoc repair on C/C++ tasks.
PBT-Bench is a new benchmark with 100 property-based testing problems across 40 Python libraries that measures LLM bug recall rates of 42.1-83.4% under guided prompting versus 31.4-76.7% in baseline.
A compositional algebraic decision diagram algorithm quantifies sensitivity in decision tree ensembles with certified error and confidence bounds, outperforming model counters on benchmarks.
Large-scale analysis of inactive GitHub repositories shows open source projects die primarily from insufficient value and ecosystem dynamics, not from pull request workflow problems, despite a common pattern of declining activity.
SMT-LLM builds a constraint graph from PyPI metadata and AST-derived imports, solves it with Z3, and uses LLM imputation only when needed, resolving 83.6% of HG2.9K snippets versus PLLM's 54.8% while cutting median time by 6.3x and LLM calls by 11x.
ConCovUp uses static analysis to ground LLM test generation and backward tracing to produce concurrent test drivers that raise average shared-memory access pair coverage from 36.6% to 68.1% on nine real-world libraries.
SmellBench is the first benchmark showing LLM agents resolve 47.7% of architectural code smells while accurately spotting false positives, but aggressive repairs often introduce new smells and degrade overall quality.
VulKey introduces hierarchical expert knowledge abstractions to guide LLMs in vulnerability repair, reporting 31.5% accuracy on PrimeVul (7.6% above best baseline) and strong results on Vul4J.
ClozeMaster masks bracketed structures in historical Rust bug code and uses LLMs to infill them, generating test programs that discovered 27 confirmed bugs in rustc and mrustc while outperforming existing fuzzers.
MultiLogBench shows that LLM performance on automated logging varies substantially across programming languages, demonstrating that single-language evidence is insufficient for general claims about model behavior or tool design.
QRisk isolates backend-specific abnormal error patterns on NISQ devices via delta debugging and mitigates them with commuting gate swaps, cutting excess noise by 24-45% on IBM backends where noise models predict no difference.
Clover fixes 96.8% of bugs on an RTL-repair benchmark using stochastic tree-of-thoughts and neural-symbolic agents, outperforming traditional and LLM baselines by 94% and 63% respectively with 87.5% pass@1.
LLM agents inject CWEs into student-authored code to generate personalized security examples; in a 71-student deployment, participants rated them more relevant than textbook cases but quantitative differences remained limited.
An LLM synthesizes an alias-free concurrency model (CIR) from natural language that is translated to a Petri net (CVN) for exhaustive verification and targeted repair, with goal-reachability checks to avoid incomplete fixes.
REAP automatically curates production-derived benchmarks for AI coding agents via LLM classification and stability checks, producing the Harvest benchmark with model solve rates of 42.9-58.2%.
citing papers explorer
-
Mind your key: An Empirical Study of LLM API Credential Leakage in iOS Apps
Empirical analysis of 444 iOS apps using dynamic traffic interception found 282 leaking LLM API keys across ten providers, with only 28% remediation after three months.
-
Autoregressive, Yet Revisable: In Decoding Revision for Secure Code Generation
Stream of Revision adds action tokens to LLM decoding so the model can revise its own code history on the fly, cutting vulnerabilities in generated code with little added cost.
-
RepairAgent: An Autonomous, LLM-Based Agent for Program Repair
RepairAgent autonomously repairs 164 bugs on Defects4J including 39 not fixed by prior techniques by treating an LLM as an agent that invokes tools via a finite state machine and dynamic prompts.
-
Knowledge Over Parameters: Evolving Smart Contract Vulnerability Detection
EvoVuln evolves executable detection policies for five smart-contract vulnerability types using cold-start synthetic testing followed by few-shot refinement on five vulnerable and five safe contracts, reaching 71% macro F1 and enabling a small model to beat a large zero-shot model by 19 points at un
-
Quantum Mutant Equivalence via Transpilation
TBE identifies 32.1% of 92,011 equivalent surviving quantum mutants (29,536) via OpenQASM comparison after transpilation, reporting 100% precision and 82% accuracy on 348,299 mutants.
-
What You See Is Not What You Execute: Memory-Based Runtime SBOM Generation for Supply Chain Security
MEM-SBOM generates runtime SBOMs for Python applications by recovering modules, versions, and dependency graphs from volatile memory via Volatility 3 plugins, achieving 100% extraction accuracy on 51 apps.
-
The Correctness Illusion in LLM-Generated GPU Kernels
Controlled corpus testing shows that fixed allclose oracles in LLM kernel benchmarks certify transcription-buggy kernels as correct while seeded fuzzing with fp64 references does not.
-
BioDefect: The First Dataset for Defect Detection in Bioinformatics Software
BioDefect is a new dataset for defect detection in bioinformatics software that improves average F1-scores by 29.61% to 38.04% over existing datasets when evaluated on nine language models.
-
Code Generation by Differential Test Time Scaling
DiffCodeGen clusters code candidates by behavioral similarity from fuzzing-synthesized inputs and selects the largest cluster's medoid, matching or exceeding prior test-time scaling methods with far less token and time cost.
-
Hydra: Efficient, Correct Code Generation via Checkpoint-and-Rollback Support
Hydra enables asynchronous static error checking and targeted checkpoint-rollback repair during LLM code generation, cutting latency by up to 71% and token use by up to 70% versus post-hoc repair on C/C++ tasks.
-
PBT-Bench: Benchmarking AI Agents on Property-Based Testing
PBT-Bench is a new benchmark with 100 property-based testing problems across 40 Python libraries that measures LLM bug recall rates of 42.1-83.4% under guided prompting versus 31.4-76.7% in baseline.
-
Quantifying Sensitivity for Tree Ensembles: A symbolic and compositional approach
A compositional algebraic decision diagram algorithm quantifies sensitivity in decision tree ensembles with certified error and confidence bounds, outperforming model counters on benchmarks.
-
The Death Spiral of Open Source Projects: A Post-Mortem Analysis of Pull Request Workflow Dynamics
Large-scale analysis of inactive GitHub repositories shows open source projects die primarily from insufficient value and ecosystem dynamics, not from pull request workflow problems, despite a common pattern of declining activity.
-
Breaking the Dependency Chaos: A Constraint-Driven Python Dependency Resolution Strategy with Selective LLM Imputation
SMT-LLM builds a constraint graph from PyPI metadata and AST-derived imports, solves it with Z3, and uses LLM imputation only when needed, resolving 83.6% of HG2.9K snippets versus PLLM's 54.8% while cutting median time by 6.3x and LLM calls by 11x.
-
ConCovUp: Effective Agent-Based Test Driver Generation for Concurrency Testing
ConCovUp uses static analysis to ground LLM test generation and backward tracing to produce concurrent test drivers that raise average shared-memory access pair coverage from 36.6% to 68.1% on nine real-world libraries.
-
SmellBench: Evaluating LLM Agents on Architectural Code Smell Repair
SmellBench is the first benchmark showing LLM agents resolve 47.7% of architectural code smells while accurately spotting false positives, but aggressive repairs often introduce new smells and degrade overall quality.
-
VulKey: Automated Vulnerability Repair Guided by Domain-Specific Repair Patterns
VulKey introduces hierarchical expert knowledge abstractions to guide LLMs in vulnerability repair, reporting 31.5% accuracy on PrimeVul (7.6% above best baseline) and strong results on Vul4J.
-
ClozeMaster: Fuzzing Rust Compiler by Harnessing LLMs for Infilling Masked Real Programs
ClozeMaster masks bracketed structures in historical Rust bug code and uses LLMs to infill them, generating test programs that discovered 27 confirmed bugs in rustc and mrustc while outperforming existing fuzzers.
-
Single-Language Evidence Is Insufficient for Automated Logging: A Multilingual Benchmark and Empirical Study with LLMs
MultiLogBench shows that LLM performance on automated logging varies substantially across programming languages, demonstrating that single-language evidence is insufficient for general claims about model behavior or tool design.
-
Isolating Recurring Execution-Dependent Abnormal Patterns on NISQ Quantum Devices
QRisk isolates backend-specific abnormal error patterns on NISQ devices via delta debugging and mitigates them with commuting gate swaps, cutting excess noise by 24-45% on IBM backends where noise models predict no difference.
-
Clover: A Neural-Symbolic Agentic Harness with Stochastic Tree-of-Thoughts for Verified RTL Repair
Clover fixes 96.8% of bugs on an RTL-repair benchmark using stochastic tree-of-thoughts and neural-symbolic agents, outperforming traditional and LLM baselines by 94% and 63% respectively with 87.5% pass@1.
-
Towards Personalizing Secure Programming Education with LLM-Injected Vulnerabilities
LLM agents inject CWEs into student-authored code to generate personalized security examples; in a 71-student deployment, participants rated them more relevant than textbook cases but quantitative differences remained limited.
-
CIR+CVN: Bridging LLM Semantic Understanding and Petri-Net Verification for Concurrent Programs
An LLM synthesizes an alias-free concurrency model (CIR) from natural language that is translated to a Petri net (CVN) for exhaustive verification and targeted repair, with goal-reachability checks to avoid incomplete fixes.
-
REAP: Automatic Curation of Coding Agent Benchmarks from Interactive Production Usage
REAP automatically curates production-derived benchmarks for AI coding agents via LLM classification and stability checks, producing the Harvest benchmark with model solve rates of 42.9-58.2%.
-
Measuring and Exploiting Contextual Bias in LLM-Assisted Security Code Review
LLM-based security code review is vulnerable to framing bias, with a novel iterative refinement attack achieving 100% success in reintroducing vulnerabilities across real projects.
-
Can Language Models Go Beyond Coding? Assessing the Capability of Language Models to Build Real-World Systems
Build-bench is the first architecture-aware benchmark that evaluates LLMs on repairing cross-ISA build failures via iterative tool-augmented reasoning, with the best model reaching 63.19% success.
-
Do AI Models Dream of Faster Code? An Empirical Study on LLM-Proposed Performance Improvements in Real-World Software
LLMs propose volatile performance improvements on real-world Java tasks that lag human developers on average, showing algorithmic benchmarks overestimate capabilities.
-
ContractEval: A Benchmark for Evaluating Contract-Satisfying Assertions in Code Generation
ContractEval benchmark on 364 tasks shows code LLMs achieve 75-82% functional pass@1 but 0% contract satisfaction under standard prompting, rising only to 23-41% with explicit contracts.
-
CodeCureAgent: Automatic Classification and Repair of Static Analysis Warnings
CodeCureAgent achieves 96.8% plausible fixes and 86.3% correct fixes for 1,000 SonarQube warnings across 106 Java projects using an agentic LLM framework.
-
Once4All: Skeleton-Guided SMT Solver Fuzzing with LLM-Synthesized Generators
Once4All synthesizes LLM-based generators from extracted SMT grammars and populates formula skeletons to fuzz Z3 and cvc5, discovering 43 confirmed bugs with 40 fixed.
-
Guidelines for Empirical Studies in Software Engineering involving Large Language Models
The paper delivers a taxonomy of seven LLM study types in software engineering along with eight guidelines that separate mandatory requirements from recommended practices to address reproducibility challenges.
-
SWE-RL: Advancing LLM Reasoning via Reinforcement Learning on Open Software Evolution
SWE-RL uses RL on software evolution data to train LLMs achieving 41% on SWE-bench Verified with generalization to other reasoning tasks.
-
NESA: Relational Neuro-Symbolic Static Program Analysis
NESA presents a neuro-symbolic framework that decomposes static analyses into policy-defined sub-problems solved by parsers and LLMs to enable compilation-free customizable analysis with reduced hallucinations.
-
Direction for Detection: A Survey of Automated Vulnerability Detection and all of its Pain Points
ML4AVD research remains locked into binary function-level classification of C/C++ vulnerabilities because twelve pain points in the pipeline reinforce each other through feedback loops.
-
Prompt Coverage Adequacy
Prompt Coverage Adequacy, measured via attention boosting in LLMs, is associated with fault detection and uncovers over 30% more faults than traditional code coverage when guiding test generation across two datasets.
-
AdaTrans: Automated C to Rust Transformation via Error-Adaptive Repair
AdaTrans uses strategy-driven RAG, error-stratified transformation, and multi-stage validation to reach 95.51% mean compilation pass rate and 81.09% solve rate on 104 algorithmic problems with only 1.19% unsafe files.
-
Uncovering Similar but Different Packages in PyPI and Potential Security Threats
Large-scale analysis of 200K PyPI packages identifies 1,361 replicated popular packages, 256 replicated vulnerable packages, and 7 new replicated malicious packages, showing replication as a security threat vector.
-
TraceView: Interactive Visualization of Agentic Program Repair Trajectories
TraceView organizes agentic APR trajectories into Thought-Action-Result components for semantic labeling and renders them as interactive graphs, with a user study showing improved scanability and understanding for five researchers.
-
Holmes: Multimodal Agentic Diagnosis for Mixed-Language Mobile Crashes at Industrial Scale
Holmes is a multimodal multi-agent system using a hierarchical Retrieve-Explore-Reason architecture to automate root cause analysis of mobile crashes, achieving 87.6% function-level accuracy and 98% time reduction on real WeChat data.
-
Differential Zonotopes for Verifying Global Robustness of DNNs
Differential halo zonotopes enable static verification of global robustness in DNNs by jointly propagating pairs of perturbed inputs while bounding divergence, with a relaxed confidence-based variant.
-
ATTAIN: Automated Exploit Failure Analysis through Trace-Driven Diff Analysis
ATTAIN is a three-module trace-driven framework that combines exploit execution, LLM-guided diff search, and evidence-based judgment to identify affected library versions for CVEs, reporting 93.24% F1 on 224 CVEs across 25,943 versions.
-
Type-Error Ablation and AI Coding Agents
Ablation experiment in Shplait finds that detailed type error messages improve AI agents' type-error repair rates over minimal messages or dynamic errors, with type systems adding further benefit.
-
FPMoE: A Sparse Mixture-of-Experts Approach to Functional Code Generation
FPMoE applies a sparse MoE architecture with per-language routed experts and a shared expert to improve LLM code generation on functional languages, outperforming fine-tuned baselines while matching larger models with 3B active parameters.
-
PromptAudit: Auditing Prompt Sensitivity in LLM-Based Vulnerability Detection
PromptAudit evaluates five prompting strategies across five LLMs on 1000 CVEs and finds chain-of-thought prompting yields the strongest overall performance while adaptive chain-of-thought and self-consistency reduce effective results.
-
Three Heads Are Better Than One: A Multi-perspective Reasoning Framework for Enhanced Vulnerability Detection
ReasonVul deploys three LLM agents with independent analysis and structured debate to achieve 40% PairAcc and 72.52% F1 on PrimeVul, outperforming baselines by 81% in PairAcc.
-
Task Abstention for Large Language Models in Code Generation
A distribution-free abstention rule grounded in multiple hypothesis testing uses execution consistency to let code LLMs avoid hallucination-prone tasks with theoretical guarantees.
-
Code-Centric Detection of Vulnerability-Fixing Commits: A Unified Benchmark and Empirical Study
Code language models show no transferable security understanding from code diffs alone, rely on commit messages, miss over 93% of fixes at 0.5% false positive rate, and suffer large drops under group or temporal splits.
-
BoostAPR: Boosting Automated Program Repair via Execution-Grounded Reinforcement Learning with Dual Reward Models
BoostAPR boosts automated program repair by training a sequence-level assessor and line-level credit allocator from execution outcomes, then applying them in PPO to reach 40.7% on SWE-bench Verified.
-
Debugging the Debuggers: Failure-Anchored Structured Recovery for Software Engineering Agents
PROBE turns runtime telemetry from failed software engineering agent runs into evidence-grounded diagnoses and actionable recovery guidance, achieving 65.37% diagnosis accuracy and 21.79% recovery rate on 257 cases.
-
Similar Pattern Annotation via Retrieval Knowledge for LLM-Based Test Code Fault Localization
SPARK improves LLM-based test code fault localization by retrieving similar past faults and selectively annotating suspicious lines in new failing tests.