LLM-based merge conflict resolution performs well on imbalanced conflicts but struggles with large or non-English inputs, while search-based methods show better generalization and strength on balanced conflicts.
hub Mixed citations
In: Advances in Neural Information Processing Systems, Curran Associates, Inc., vol 34, pp 27
Mixed citation behavior. Most common role is background (67%).
hub tools
citation-role summary
citation-polarity summary
representative citing papers
SiblingRepair uses LLMs with semantic sibling detection and simultaneous/iterative repair strategies to outperform prior multi-hunk APR tools like Hercules on Defects4J and GHRB benchmarks.
MultiLogBench shows that LLM performance on automated logging varies substantially across programming languages, demonstrating that single-language evidence is insufficient for general claims about model behavior or tool design.
Software engineering scope expands beyond executable code to semi-executable artifacts best diagnosed by the new six-ring Semi-Executable Stack model.
Multiple-choice queries synthesized from Hoare triples enable more reliable identification of intended programs than labeled-example supervision in active learning for program disambiguation.
Analysis of 17k LLM agent skills reveals 520 vulnerable ones with 1,708 leakage issues, primarily from debug output exposure, with a 10-pattern taxonomy and released dataset for future detection.
AgentSZZ is an LLM-agent framework that identifies bug-inducing commits with up to 27.2% higher F1 scores than prior methods by enabling adaptive exploration and causal tracing, especially for cross-file and ghost commits.
APIDiffer automatically detects 72 API inconsistencies across 11 Ethereum clients using specification-guided test generation and LLM-based false-positive filtering, with 90% of bugs confirmed by developers.
AgenticSZZ reframes bug-inducing commit identification as temporal knowledge graph search navigated by an LLM agent, reporting F1 scores of 0.47-0.79 and up to 34% improvement over prior SZZ methods on three datasets.
A systematic analysis of 59 quantum software testing empirical studies reveals highly diverse designs, inconsistent reporting, and open methodological challenges, leading to recommendations for future work.
A study of seven LLMs finds that realistic prompt variations such as one-character misspellings trigger library hallucinations in up to 26% of cases, fabricated names in up to 99%, and time-based prompts in up to 85%, and introduces LibHalluBench for evaluation.
Once4All synthesizes LLM-based generators from extracted SMT grammars and populates formula skeletons to fuzz Z3 and cvc5, discovering 43 confirmed bugs with 40 fixed.
Analysis of SATD in Dockerfiles shows 27% of admissions and 40% of repayments are coupled to non-Dockerfile artifacts, with coupled events repaid faster overall and external dependencies as a key trigger.
QUTest is a native OpenQASM testing framework that encodes Arrange/Act/Assert tests and 12 assertion types via pragma comments while remaining compatible with existing tools.
MuMuTestUp is a mutation-guided multi-agent framework for updating test cases in evolving software that strengthens assertions via surviving mutants, targets specific coverage gaps, and uses semantic search instead of exact matching.
Noise from quantum hardware simulators significantly alters mutant detection distances, making equivalent mutants harder to separate from faults, with output-distribution metrics reaching 73.03% accuracy and 74.89% F1-score under device-specific thresholds.
AutoSOUP automates component-level memory-safety verification by generating Safety-Oriented Unit Proofs via three techniques and a hybrid LLM-plus-program-synthesis architecture called LLM-As-Function-Call.
A dual-axis quality framework ranks DL mutation operators by statistical resistance and Jaccard-based realism to real faults, enabling up to 55.6% fewer mutants on held-out validation data without dropping baseline performance.
QuanForge introduces statistical mutation killing and nine post-training mutation operators for QNNs to distinguish test suites and localize vulnerable circuit regions.
SAGE uses sparse autoencoders to boost vulnerability signals in LLMs, raising internal SNR 12.7x and delivering up to 318% MCC gains on vulnerability detection benchmarks.
GLMTest integrates code property graphs and GNNs with LLMs to steer test case generation toward targeted branches, raising branch accuracy from 27.4% to 50.2% on the TestGenEval benchmark.
WarpL uses mutation to find and isolate suboptimal instruction sequences causing performance issues in WebAssembly runtimes by comparing machine code of original and non-problematic mutant programs.
Reconstructing 6946 syzbot bug-fix lifecycles reveals that accepted kernel patches are non-local and reviewer-constrained, enabling PatchAdvisor to improve automated repair quality over baselines via retrieval and diagnostic guidance.
PAFT improves LLM-based program repair pass rates by up to 65.6% while cutting average edit distance by up to 32.6% through explicit preservation signals and curriculum training.
citing papers explorer
-
LLM-based vs. Search-based Merge Conflict Resolution: An Empirical Study of Competing Paradigms
LLM-based merge conflict resolution performs well on imbalanced conflicts but struggles with large or non-English inputs, while search-based methods show better generalization and strength on balanced conflicts.
-
SiblingRepair: Sibling-Based Multi-Hunk Repair with Large Language Models
SiblingRepair uses LLMs with semantic sibling detection and simultaneous/iterative repair strategies to outperform prior multi-hunk APR tools like Hercules on Defects4J and GHRB benchmarks.
-
Single-Language Evidence Is Insufficient for Automated Logging: A Multilingual Benchmark and Empirical Study with LLMs
MultiLogBench shows that LLM performance on automated logging varies substantially across programming languages, demonstrating that single-language evidence is insufficient for general claims about model behavior or tool design.
-
The Semi-Executable Stack: Agentic Software Engineering and the Expanding Scope of SE
Software engineering scope expands beyond executable code to semi-executable artifacts best diagnosed by the new six-ring Semi-Executable Stack model.
-
Choose, Don't Label: Multiple-Choice Query Synthesis for Program Disambiguation
Multiple-choice queries synthesized from Hoare triples enable more reliable identification of intended programs than labeled-example supervision in active learning for program disambiguation.
-
Credential Leakage in LLM Agent Skills: A Large-Scale Empirical Study
Analysis of 17k LLM agent skills reveals 520 vulnerable ones with 1,708 leakage issues, primarily from debug output exposure, with a 10-pattern taxonomy and released dataset for future detection.
-
AgentSZZ: Teaching the LLM Agent to Play Detective with Bug-Inducing Commits
AgentSZZ is an LLM-agent framework that identifies bug-inducing commits with up to 27.2% higher F1 scores than prior methods by enabling adaptive exploration and causal tracing, especially for cross-file and ghost commits.
-
When Specifications Meet Reality: Uncovering API Inconsistencies in Ethereum Infrastructure
APIDiffer automatically detects 72 API inconsistencies across 11 Ethereum clients using specification-guided test generation and LLM-based false-positive filtering, with 90% of bugs confirmed by developers.
-
AgenticSZZ: Temporal Knowledge Graph-Guided Agentic Bug-Inducing Commit Identification
AgenticSZZ reframes bug-inducing commit identification as temporal knowledge graph search navigated by an LLM agent, reporting F1 scores of 0.47-0.79 and up to 34% improvement over prior SZZ methods on three datasets.
-
A Methodological Analysis of Empirical Studies in Quantum Software Testing
A systematic analysis of 59 quantum software testing empirical studies reveals highly diverse designs, inconsistent reporting, and open methodological challenges, leading to recommendations for future work.
-
Library Hallucinations in LLM-Generated Code: A Risk Analysis Grounded in Developer Queries
A study of seven LLMs finds that realistic prompt variations such as one-character misspellings trigger library hallucinations in up to 26% of cases, fabricated names in up to 99%, and time-based prompts in up to 85%, and introduces LibHalluBench for evaluation.
-
Once4All: Skeleton-Guided SMT Solver Fuzzing with LLM-Synthesized Generators
Once4All synthesizes LLM-based generators from extracted SMT grammars and populates formula skeletons to fuzz Z3 and cvc5, discovering 43 confirmed bugs with 40 fixed.
-
Beyond the Tip of the Iceberg: Understanding SATD in Dockerfiles through the Lens of Co-evolution
Analysis of SATD in Dockerfiles shows 27% of admissions and 40% of repayments are coupled to non-Dockerfile artifacts, with coupled events repaid faster overall and external dependencies as a key trigger.
-
QUTest: A Native Testing Framework for Quantum Programs
QUTest is a native OpenQASM testing framework that encodes Arrange/Act/Assert tests and 12 assertion types via pragma comments while remaining compatible with existing tools.
-
MuMuTestUp: Mutation-based Multi-Agent Test Case Update
MuMuTestUp is a mutation-guided multi-agent framework for updating test cases in evolving software that strengthens assertions via surviving mutants, targets specific coverage gaps, and uses semantic search instead of exact matching.
-
Robust Mutation Analysis of Quantum Programs Under Noise
Noise from quantum hardware simulators significantly alters mutant detection distances, making equivalent mutants harder to separate from faults, with output-distribution metrics reaching 73.03% accuracy and 74.89% F1-score under device-specific thresholds.
-
AutoSOUP: Safety-Oriented Unit Proof Generation for Component-level Memory-Safety Verification
AutoSOUP automates component-level memory-safety verification by generating Safety-Oriented Unit Proofs via three techniques and a hybrid LLM-plus-program-synthesis architecture called LLM-As-Function-Call.
-
Quality-Driven Selective Mutation for Deep Learning
A dual-axis quality framework ranks DL mutation operators by statistical resistance and Jaccard-based realism to real faults, enabling up to 55.6% fewer mutants on held-out validation data without dropping baseline performance.
-
QuanForge: A Mutation Testing Framework for Quantum Neural Networks
QuanForge introduces statistical mutation killing and nine post-training mutation operators for QNNs to distinguish test suites and localize vulnerable circuit regions.
-
SAGE: Signal-Amplified Guided Embeddings for LLM-based Vulnerability Detection
SAGE uses sparse autoencoders to boost vulnerability signals in LLMs, raising internal SNR 12.7x and delivering up to 318% MCC gains on vulnerability detection benchmarks.
-
Program Structure-aware Language Models: Targeted Software Testing beyond Textual Semantics
GLMTest integrates code property graphs and GNNs with LLMs to steer test case generation toward targeted branches, raising branch accuracy from 27.4% to 50.2% on the TestGenEval benchmark.
-
Debugging Performance Issues in WebAssembly Runtimes via Mutation-based Inference
WarpL uses mutation to find and isolate suboptimal instruction sequences causing performance issues in WebAssembly runtimes by comparing machine code of original and non-problematic mutant programs.
-
Beyond Crash-to-Patch: Patch Evolution for Linux Kernel Repair
Reconstructing 6946 syzbot bug-fix lifecycles reveals that accepted kernel patches are non-local and reviewer-constrained, enabling PatchAdvisor to improve automated repair quality over baselines via retrieval and diagnostic guidance.
-
PAFT: Preservation Aware Fine-Tuning for Minimal-Edit Program Repair
PAFT improves LLM-based program repair pass rates by up to 65.6% while cutting average edit distance by up to 32.6% through explicit preservation signals and curriculum training.
-
Knowledge-Graph-Driven Data Synthesis for Low-Resource Software Development: A HarmonyOS Case Study
APIKG4Syn synthesizes API-oriented training data via knowledge graphs and Monte Carlo search to fine-tune a 7B model that reaches 25% pass@1 on HarmonyOS code generation, beating untuned GPT-4o at 17.59%.
-
Multi-LLM Orchestration for High-Quality Code Generation: Exploiting Complementary Model Strengths
PerfOrch is a four-agent multi-LLM system that uses offline profiling to build language-and-category rankings for routing tasks, achieving 97.19% and 95.83% pass@1 on HumanEval-X and EffiBench-X with generalization across benchmarks.
-
PatchTrack: A Comprehensive Analysis of ChatGPT's Influence on Pull Request Outcomes
Empirical analysis of 338 PRs with self-admitted ChatGPT usage shows low full integration (median 25%), selective adaptation patterns, and broader influence on developer reasoning during reviews.
-
A Study of LLMs' Preferences for Libraries and Programming Languages
Empirical study of eight LLMs finds overuse of popular libraries like NumPy in up to 45% of unnecessary cases and strong default preference for Python even when suboptimal.
-
UntrustVul: An Automated Approach for Identifying Untrustworthy Alerts in Vulnerability Detection Models
UntrustVul identifies untrustworthy vulnerability predictions by marking lines that neither match historical vulnerability patterns nor influence vulnerable lines through dependencies, reporting AUC 70-88% and F1 82-94% on 115K predictions.
-
XOXO: Stealthy Cross-Origin Context Poisoning Attacks against AI Coding Assistants
XOXO is a cross-origin context poisoning attack on AI coding assistants that uses a Cayley Graph search algorithm (GCGS) to find stealthy perturbations, achieving 75.72% average success rate across five tasks and eleven models.
-
Improving MPI Error Detection and Repair with Large Language Models and Bug References
Augmenting LLMs with bug references, few-shot learning, chain-of-thought, and RAG improves MPI error detection accuracy from 44% to 77% and generalizes across models.
-
OpDiffer: LLM-Assisted Opcode-Level Differential Testing of Ethereum Virtual Machine
OpDiffer applies LLMs and static analysis to opcode-level differential testing of EVMs, reporting 26 previously unknown bugs across nine implementations along with coverage gains and an estimate that 7.21% of real contracts could trigger the bugs.
-
LLM4Log: A Systematic Review of Large Language Model-based Log Analysis
Systematic review of 145 papers on LLM-based log analysis, providing a unified taxonomy, common design patterns, evaluation practices, and challenges for deployment under drift and limited labels.
-
MultiMend: Multilingual Program Repair with Context Augmentation and Multi-Hunk Patch Generation
MultiMend augments buggy function context via retrieval and generates multi-hunk patches, fixing 2,227 of 5,501 bugs across six benchmarks in four languages.
-
To Vibe Research or Not to Vibe Research? Generative AI in Qualitative Research
Generative AI suitability in qualitative research depends primarily on the approach (small-q positivist/post-positivist or Big Q non-positivist) along with skills, ethics, and personal preferences.