RubberDuckBench shows top AI models score around 68% on real GitHub coding questions, rarely answer completely correctly, and hallucinate in 58% of responses on average.
Repairbench: Leaderboard of frontier models for program repair
3 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
roles
background 1polarities
background 1representative citing papers
The paper delivers a taxonomy of seven LLM study types in software engineering along with eight guidelines that separate mandatory requirements from recommended practices to address reproducibility challenges.
Linking team bonuses to automated security scan results reduced issue density in a controlled experiment with 84 students across 14 teams.
citing papers explorer
-
RubberDuckBench: A Benchmark for AI Coding Assistants
RubberDuckBench shows top AI models score around 68% on real GitHub coding questions, rarely answer completely correctly, and hallucinate in 58% of responses on average.
-
Guidelines for Empirical Studies in Software Engineering involving Large Language Models
The paper delivers a taxonomy of seven LLM study types in software engineering along with eight guidelines that separate mandatory requirements from recommended practices to address reproducibility challenges.
-
Security Incentivization: An Empirical Study of how Micropayments Impact Code Security
Linking team bonuses to automated security scan results reduced issue density in a controlled experiment with 84 students across 14 teams.