Stratified analysis of AIDev PRs shows co-authorship effects on AI agent merge rates are artefacts of agent composition, repository selection, and PR commit structure rather than causal benefits.
hub
An empirical study of usages, updates and risks of third-party libraries in java projects
14 Pith papers cite this work. Polarity classification is still indexing.
hub tools
citation-role summary
citation-polarity summary
roles
background 2polarities
background 2representative citing papers
REStack is a new public dataset of 12k+ RE discussions from Stack Exchange sites, enriched with 23 LDA-derived topics grouped into six categories and community-derived difficulty metadata.
An empirical study of 547 confirmed safety incidents from GitHub and literature derives a 33-type taxonomy showing constraint violations, destructive actions, and deception dominate in everyday coding-agent use.
MultiLogBench shows that LLM performance on automated logging varies substantially across programming languages, demonstrating that single-language evidence is insufficient for general claims about model behavior or tool design.
IntentTester migrates tests across libraries using TDL abstraction and multi-agent LLM synthesis, achieving 85% correctness and 74% effectiveness versus 51% and 43% for baselines on nine projects in JSON, HTML, and Time domains.
SPARK improves LLM-based test code fault localization by retrieving similar past faults and selectively annotating suspicious lines in new failing tests.
APIKG4Syn synthesizes API-oriented training data via knowledge graphs and Monte Carlo search to fine-tune a 7B model that reaches 25% pass@1 on HarmonyOS code generation, beating untuned GPT-4o at 17.59%.
MR-Adopt deduces input transformations from hard-coded MR test cases using LLMs, data-flow refinement, and output-relation selection to enable reuse with new source inputs.
MR-Scout extracts over 11,000 metamorphic-relation-encoded test cases from 701 OSS projects, codifies 97% of them as high-quality generators, and shows they raise line coverage by 13.52% and mutation score by 9.42% on programs that already have developer tests.
Analysis of 252 bug fixes in an LLM-powered multi-market web app found 44% escaped through four seams invisible to component unit tests, motivating a four-seam verification framework.
Developers most frequently reference the full Log4j migration guide in pull request descriptions (82.81% of cases) and continue consulting it during post-update maintenance tasks.
Empirical review of 233 real-world vulnerabilities from 34 TON audits produces a specialized checklist for asynchronous message handling, supported by case studies and an 11-person practitioner survey.
Generative AI suitability in qualitative research depends primarily on the approach (small-q positivist/post-positivist or Big Q non-positivist) along with skills, ethics, and personal preferences.
citing papers explorer
-
Beyond Simpson's Paradox: A Cascade of Confounders in AI Agent Pull-Request Co-Authorship
Stratified analysis of AIDev PRs shows co-authorship effects on AI agent merge rates are artefacts of agent composition, repository selection, and PR commit structure rather than causal benefits.
-
REStack: A Large-Scale Dataset of Reverse Engineering Discussions from Stack Exchange
REStack is a new public dataset of 12k+ RE discussions from Stack Exchange sites, enriched with 23 LDA-derived topics grouped into six categories and community-derived difficulty metadata.
-
What Breaks When LLMs Code? Characterizing Operational Safety Failures of Agentic Code Assistants
An empirical study of 547 confirmed safety incidents from GitHub and literature derives a 33-type taxonomy showing constraint violations, destructive actions, and deception dominate in everyday coding-agent use.
-
Single-Language Evidence Is Insufficient for Automated Logging: A Multilingual Benchmark and Empirical Study with LLMs
MultiLogBench shows that LLM performance on automated logging varies substantially across programming languages, demonstrating that single-language evidence is insufficient for general claims about model behavior or tool design.
-
IntentTester: Intent-Driven Multi-agent Framework for Cross-Library Test Migration
IntentTester migrates tests across libraries using TDL abstraction and multi-agent LLM synthesis, achieving 85% correctness and 74% effectiveness versus 51% and 43% for baselines on nine projects in JSON, HTML, and Time domains.
-
Similar Pattern Annotation via Retrieval Knowledge for LLM-Based Test Code Fault Localization
SPARK improves LLM-based test code fault localization by retrieving similar past faults and selectively annotating suspicious lines in new failing tests.
-
Knowledge-Graph-Driven Data Synthesis for Low-Resource Software Development: A HarmonyOS Case Study
APIKG4Syn synthesizes API-oriented training data via knowledge graphs and Monte Carlo search to fine-tune a 7B model that reaches 25% pass@1 on HarmonyOS code generation, beating untuned GPT-4o at 17.59%.
-
MR-Adopt: Automatic Deduction of Input Transformation Function for Metamorphic Testing
MR-Adopt deduces input transformations from hard-coded MR test cases using LLMs, data-flow refinement, and output-relation selection to enable reuse with new source inputs.
-
MR-Scout: Automated Synthesis of Metamorphic Relations from Existing Test Cases
MR-Scout extracts over 11,000 metamorphic-relation-encoded test cases from 701 OSS projects, codifies 97% of them as high-quality generators, and shows they raise line coverage by 13.52% and mutation score by 9.42% on programs that already have developer tests.
-
All Green, Still Broken: Real-Flow Verification Lessons from an LLM-Integrated, Multi-Market Web Application
Analysis of 252 bug fixes in an LLM-powered multi-market web app found 44% escaped through four seams invisible to component unit tests, motivating a four-seam verification framework.
-
How Do Developers Use Migration Guides? A Case Study of Log4j
Developers most frequently reference the full Log4j migration guide in pull request descriptions (82.81% of cases) and continue consulting it during post-update maintenance tasks.
-
From Paradigm Shift to Audit Rift: Empirical Analysis and Validation of Security Audit Methodologies for Asynchronous Smart Contract Systems
Empirical review of 233 real-world vulnerabilities from 34 TON audits produces a specialized checklist for asynchronous message handling, supported by case studies and an 11-person practitioner survey.
-
To Vibe Research or Not to Vibe Research? Generative AI in Qualitative Research
Generative AI suitability in qualitative research depends primarily on the approach (small-q positivist/post-positivist or Big Q non-positivist) along with skills, ethics, and personal preferences.
- Efficient Black-Box Fault Localization for System-Level Test Code Using Large Language Models