Developers using AI showed the same core problem-solving behaviors as those without but differed in how they became stuck and recovered, with AI helping or hindering in specific cases.
hub Canonical reference
Experience: Evaluating the Usability of Code Generation Tools Powered by Large Language Models
Canonical reference. 88% of citing Pith papers cite this work as background.
hub tools
citation-role summary
citation-polarity summary
roles
background 8representative citing papers
CyberCertBench shows frontier LLMs reach human-expert performance on general IT and networking security but drop on vendor-specific and formal standards questions such as IEC 62443, with a new framework for producing interpretable explanations.
EvoGraph turns linear AI-assisted programming into a manipulable graph of branching histories, reducing cognitive load and enabling better iteration according to a user study with 20 developers.
REAP automatically curates production-derived benchmarks for AI coding agents via LLM classification and stability checks, producing the Harvest benchmark with model solve rates of 42.9-58.2%.
A qualitative study of South Korean parents shows that trauma and healing after learning a child is LGBTQ+ leads to identity reconstruction as supportive parents and more critical, protective informating practices.
Empirical analysis of 4707 MoltBook posts shows AI-only technical discourse focuses on security, trust, and abstract topics while lacking concrete runtime and project details found in human GitHub discussions.
Co-locating tests with implementation code yields substantially higher preservation and correctness in foundation-model-generated programs than separated test syntax.
Researchers created a stigma-aware WhatsApp chatbot for menstrual health education in Pakistan through co-design workshops and a two-week deployment, yielding insights on its use for challenging taboos alongside tensions around trust and cultural explanations.
Aporia makes design decisions explicit and interactive in AI-assisted programming, leading to higher engagement and 5x fewer mental model disagreements with code in a 14-person user study compared to a baseline agent.
Polite chatbot feedback lowers psychological reactance and boosts behavioral intentions but lacks engagement, whereas verbal leakage heightens surprise and engagement at the expense of increased reactance.
Empirical analysis of 338 PRs with self-admitted ChatGPT usage shows low full integration (median 25%), selective adaptation patterns, and broader influence on developer reasoning during reviews.
Longitudinal surveys show AI coding assistants reduce time on code writing but increase supervisory verification tasks, with stable productivity perceptions yet rising reports of worsened developer experience.
Among novice programmers using AI code generators, trust did not predict compliance with suggestions, while performance correlated with both compliance and increased subsequent trust.
Hiding generative AI use to signal expertise reduces knowledge sharing and transparency among workplace colleagues.
User study reveals nine LLM failure categories in SE tasks and quantifies abandonment factors from 26 participants.
Smaller LLMs produce functional but limited Python code with variable quantization effects and quality/maintainability concerns that require validation before use.
A survey of user studies on LLM use in programming that identifies interaction behaviors, mixed benefits and weaknesses, and factors influencing human and task performance.
citing papers explorer
-
ChatGPT: Friend or Foe When Comprehending and Changing Unfamiliar Code
Developers using AI showed the same core problem-solving behaviors as those without but differed in how they became stuck and recovered, with AI helping or hindering in specific cases.
-
CyberCertBench: Evaluating LLMs in Cybersecurity Certification Knowledge
CyberCertBench shows frontier LLMs reach human-expert performance on general IT and networking security but drop on vendor-specific and formal standards questions such as IEC 62443, with a new framework for producing interpretable explanations.
-
Choose Your Own Adventure: Non-Linear AI-Assisted Programming with EvoGraph
EvoGraph turns linear AI-assisted programming into a manipulable graph of branching histories, reducing cognitive load and enabling better iteration according to a user study with 20 developers.
-
REAP: Automatic Curation of Coding Agent Benchmarks from Interactive Production Usage
REAP automatically curates production-derived benchmarks for AI coding agents via LLM classification and stability checks, producing the Harvest benchmark with model solve rates of 42.9-58.2%.
-
Journeys of Parents with LGBTQ+ Children: How Trauma and Healing Reshape Identity and (Mis)Informating Practices
A qualitative study of South Korean parents shows that trauma and healing after learning a child is LGBTQ+ leads to identity reconstruction as supportive parents and more critical, protective informating practices.
-
What Software Engineering Looks Like to AI Agents? -- An Empirical Study of AI-Only Technical Discourse on MoltBook
Empirical analysis of 4707 MoltBook posts shows AI-only technical discourse focuses on security, trust, and abstract topics while lacking concrete runtime and project details found in human GitHub discussions.
-
Co-Located Tests, Better AI Code: How Test Syntax Structure Affects Foundation Model Code Generation
Co-locating tests with implementation code yields substantially higher preservation and correctness in foundation-model-generated programs than separated test syntax.
-
Designing Around Stigma: Human-Centered LLMs for Menstrual Health
Researchers created a stigma-aware WhatsApp chatbot for menstrual health education in Pakistan through co-design workshops and a two-week deployment, yielding insights on its use for challenging taboos alongside tensions around trust and cultural explanations.
-
Decision-Oriented Programming with Aporia
Aporia makes design decisions explicit and interactive in AI-assisted programming, leading to higher engagement and 5x fewer mental model disagreements with code in a 14-person user study compared to a baseline agent.
-
Polite But Boring? Trade-offs Between Engagement and Psychological Reactance to Chatbot Feedback Styles
Polite chatbot feedback lowers psychological reactance and boosts behavioral intentions but lacks engagement, whereas verbal leakage heightens surprise and engagement at the expense of increased reactance.
-
PatchTrack: A Comprehensive Analysis of ChatGPT's Influence on Pull Request Outcomes
Empirical analysis of 338 PRs with self-admitted ChatGPT usage shows low full integration (median 25%), selective adaptation patterns, and broader influence on developer reasoning during reviews.
-
The Impact of AI Coding Assistants on Software Engineering: A Longitudinal Study
Longitudinal surveys show AI coding assistants reduce time on code writing but increase supervisory verification tasks, with stable productivity perceptions yet rising reports of worsened developer experience.
-
Relationships Between Trust, Compliance, and Performance for Novice Programmers Using AI Code Generation
Among novice programmers using AI code generators, trust did not predict compliance with suggestions, while performance correlated with both compliance and increased subsequent trust.
-
"If You're Very Clever, No One Knows You've Used It": The Social Dynamics of Developing Generative AI Literacy in the Workplace
Hiding generative AI use to signal expertise reduces knowledge sharing and transparency among workplace colleagues.
-
"Should I Give Up Now?" Investigating LLM Pitfalls in Software Engineering
User study reveals nine LLM failure categories in SE tasks and quantifies abandonment factors from 26 participants.
-
Precision or Peril: A PoC of Python Code Quality from Quantized Large Language Models
Smaller LLMs produce functional but limited Python code with variable quantization effects and quality/maintainability concerns that require validation before use.
-
Understanding the Human-LLM Dynamic: A Literature Survey of LLM Use in Programming Tasks
A survey of user studies on LLM use in programming that identifies interaction behaviors, mixed benefits and weaknesses, and factors influencing human and task performance.