AIDev is a new open dataset of 456k AI-agent pull requests showing agents submit code faster than humans but with lower acceptance rates and simpler changes.
arXiv:2502.06215 [cs.SE] https://arxiv.org/abs/2502.06215 Manuscript submitted to ACM
8 Pith papers cite this work. Polarity classification is still indexing.
representative citing papers
The paper delivers a taxonomy of seven LLM study types in software engineering along with eight guidelines that separate mandatory requirements from recommended practices to address reproducibility challenges.
BT-APE automates prompt engineering for requirements classification using backtracking search and dynamic examples, matching PE2 accuracy while using 72% fewer tokens and 66% less time than that baseline.
SrDetection detects data leakage in Code LLMs via contrast between original benchmark samples and their semantic variants, reporting F1 gains of 21.52 (gray-box) and 14.46 (black-box) over baselines in a controlled testbed.
Empirical benchmarks show distribution similarity between adaptation and pretraining data increases practical privacy leakage in DP-adapted LLMs at fixed theoretical guarantees, with LoRA providing strongest protection for OOD cases.
PRISM detects and stops credential leakage during LLM generation in multi-agent pipelines using per-token risk scores from lexical, structural, and behavioral signals, achieving zero observed leaks and F1 of 0.832 on a 2000-task benchmark.
OpenAnt is an open-source pipeline that uses code decomposition, LLM-based adversarial verification, and automated dynamic testing to find vulnerabilities in large projects like OpenSSL and WordPress while claiming lower false positives.
Larger LLMs reproduce constructional productivity via entrenchment in coercion cases with nonce words but fail to use statistical preemption to avoid overgeneralizing semantically plausible but unobserved patterns.
citing papers explorer
-
BT-APE: A Computationally Light Backtracking Approach to Automatic Prompt Engineering for Requirements Classification
BT-APE automates prompt engineering for requirements classification using backtracking search and dynamic examples, matching PE2 accuracy while using 72% fewer tokens and 66% less time than that baseline.
-
SrDetection: A Self-Referential Framework for Data Leakage Detection in Code Large Language Models
SrDetection detects data leakage in Code LLMs via contrast between original benchmark samples and their semantic variants, reporting F1 gains of 21.52 (gray-box) and 14.46 (black-box) over baselines in a controlled testbed.
-
Benchmarking Empirical Privacy Protection for Adaptations of Large Language Models
Empirical benchmarks show distribution similarity between adaptation and pretraining data increases practical privacy leakage in DP-adapted LLMs at fixed theoretical guarantees, with LoRA providing strongest protection for OOD cases.
-
PRISM: Generation-Time Detection and Mitigation of Secret Leakage in Multi-Agent LLM Pipelines
PRISM detects and stops credential leakage during LLM generation in multi-agent pipelines using per-token risk scores from lexical, structural, and behavioral signals, achieving zero observed leaks and F1 of 0.832 on a 2000-task benchmark.
-
OpenAnt: LLM-Powered Vulnerability Discovery Through Code Decomposition, Adversarial Verification, and Dynamic Testing
OpenAnt is an open-source pipeline that uses code decomposition, LLM-based adversarial verification, and automated dynamic testing to find vulnerabilities in large projects like OpenSSL and WordPress while claiming lower false positives.
-
Linguistic Productivity in Large Language Models: Models Coerce, but do not Preempt
Larger LLMs reproduce constructional productivity via entrenchment in coercion cases with nonce words but fail to use statistical preemption to avoid overgeneralizing semantically plausible but unobserved patterns.