archive
Every paper Pith has read. Search by title, abstract, or pith.
1797 papers in cs.SE · page 14
-
Brief role-model stories in lectures support belonging in software courses
Supporting Belonging in Software Engineering Through Role Models Exposure
-
Intent compilation turns partial goals into binding AI artifacts
Toward a Science of Intent: Closure Gaps and Delegation Envelopes for Open-World AI Agents
-
Product context retrieval lifts AI coding compliance from 46% to 95%
Context-Augmented Code Generation: How Product Context Improves AI Coding Agent Decision Compliance by 49%
-
Speculative societies prompt OSS practitioners to rethink designer roles
What If We Work Together? Fostering Reflections on Designer Inclusion in Open Source Software Through Speculative Design
-
Evidence rules stop research agents at the right time
Don\'t Stop Early: Scalable Enterprise Deep Research with Controlled Information Flow and Evidence-Aware Termination
-
LLMs biased to Python limit multilingual code tasks
Large Language Models for Multilingual Code Intelligence: A Survey
-
LLM auditors find fatal errors in agent benchmarks
BenchGuard: Who Guards the Benchmarks? Automated Auditing of LLM Agent Benchmarks
-
Fine-tuning shifts AI safety scores in unpredictable ways
Safety Drift After Fine-Tuning: Evidence from High-Stakes Domains
-
The paper introduces FGDM, a four-agent framework that converts code into flow graphs and…
FGDM: Reasoning Aware Multi-Agentic Framework for Software Bug Detection using Chain of Thought and Tree of Thought Prompting
-
Under-specified prompts raise code correctness on rich tasks
When Prompt Under-Specification Improves Code Correctness: An Exploratory Study of Prompt Wording and Structure Effects on LLM-Based Code Generation
-
Small finetuned model detects bad LLM code prompts at F1 0.80
Defective Task Descriptions in LLM-Based Code Generation: Detection and Analysis
-
Fine-tuned LLMs hit 1.00 structural fidelity on multi-file DSL edits
Leveraging LLMs for Multi-File DSL Code Generation: An Industrial Case Study
-
SLMs on phones work only when given the smallest tasks
Less Is More: Engineering Challenges of On-Device Small Language Model Integration in a Mobile Application
-
Mobile AI works reliably only when models do the least
Less Is More: Engineering Challenges of On-Device Small Language Model Integration in a Mobile Application
-
LLM tools break standard evaluation rules in software engineering
Evaluation of LLM-Based Software Engineering Tools: Practices, Challenges, and Future Directions
-
Markov chains predict LLM agent success times from traces
Measuring the Unmeasurable: Markov Chain Reliability for LLM Agents
-
Pipeline migrates monoliths to serverless with 100% deployment success
Mono2Sls: Automated Monolith-to-Serverless Migration via Multi-Stage Pipeline with Static Analysis
-
Review of 80 studies charts transformer use for finding code vulnerabilities
A systematic literature Review for Transformer-based Software Vulnerability detection
-
Automated checks match developer labels only 44-62% for code review bots
Understanding the Limits of Automated Evaluation for Code Review Bots in Practice
-
Survey maps student AI use across capstone projects
How Do Software Engineering Students Use Generative AI in Real-World Capstone Projects? An Empirical Baseline Study
-
Structured knowledge turns LLM training into debuggable code
Programming with Data: Test-Driven Data Engineering for Self-Improving LLMs from Raw Corpora
-
Tool generates personas to boost OSS developer empathy
Putting a Face to the Issue: Fostering User Empathy of Open Source Software Developers With PersonaFlow
-
More reviewer bot comments slow agentic PR resolution
On the Footprints of Reviewer Bots Feedback on Agentic Pull Requests in OSS GitHub Repositories
-
Models reach only 74% on code questions linking definitions to calls
SWE-QA: A Dataset and Benchmark for Complex Code Understanding
-
Multi-agent SZZ raises F1 scores for vulnerability commit detection by up to 65%
MAS-SZZ: Multi-Agentic SZZ Algorithm for Vulnerability-Inducing Commit Identification
-
Humans drive creativity in design even when using LLMs
Exploring Creativity in Human-Human-LLM Collaborative Software Design
-
One plugin interface unifies controls across diffusion models
Diffusion Templates: A Unified Plugin Framework for Controllable Diffusion
-
Evolving memory boosts private library code generation by 16%
MEMCoder: Multi-dimensional Evolving Memory for Private-Library-Oriented Code Generation
-
Dynamic agents hit 95% success generating hardware reference models
RefEvo: Agentic Design with Co-Evolutionary Verification for Agile Reference Model Generation
-
Basic agent with ADI fixes 63.8% of SWE-bench tasks
Empowering Autonomous Debugging Agents with Efficient Dynamic Analysis
-
Software framework lets AI close the business experimentation loop
Closing the Loop: A Software Framework for AI to Support Business Decision Making
-
Go projects contain 7,473 crypto API misuses with uneven detector coverage
Evaluating Cryptographic API Misuse Detectors for Go
-
Developers link to full migration guides in 83% of pull requests
How Do Developers Use Migration Guides? A Case Study of Log4j
-
Developers link to full migration guides in 83 percent of pull requests
How Do Developers Use Migration Guides? A Case Study of Log4j
-
Benchmark plus sentiment predicts AI agent adoption
AgentPulse: A Continuous Multi-Signal Framework for Evaluating AI Agents in Deployment
-
Linking bug reports to fixes lifts vulnerability detection to 0.941 F1
Vulnerability Identification by Harnessing Inter-connected Multi-Source Information
-
Multi-agent constraints make decompiled binaries executable in 84-97% of cases
Constraint-Guided Multi-Agent Decompilation for Executable Binary Recovery
-
Optimas automates GPU code optimization with 100% correctness
Optimas: An Intelligent Analytics-Informed Generative AI Framework for Performance Optimization
-
LLM system automates 45% of support sessions from copilot corrections
Learning Selective LLM Autonomy from Copilot Feedback in Enterprise Customer Support Workflows
-
6-33% of code review comments in scientific software are not useful
Characterizing the Usefulness of Code Review Comments in Scientific Software for Software Quality and Scientific Rigor
-
Five-layer AI agent matches top coding tools on benchmarks
KISS Sorcar: A Stupidly-Simple General-Purpose and Software Engineering AI Assistant
-
Fine-tuned LLMs answer code queries with focused UML diagrams
Query2Diagram: Answering Developer Queries with UML Diagrams
-
Frontier agents succeed in only 20% of multi-day coworker tasks
ClawMark: A Living-World Benchmark for Multi-Turn, Multi-Day, Multimodal Coworker Agents
-
LLMs classify code review comments using comment and diff
Automated Classification of Human Code Review Comments with Large Language Models
-
DAG modeling doubles agent failure detection over end-to-end checks
AgentEval: DAG-Structured Step-Level Evaluation for Agentic Workflows with Error Propagation Tracking
-
Grammar loop aligns CPS safety rules with simulations
Grammar-Constrained Refinement of Safety Operational Rules Using Language in the Loop: What Could Go Wrong
-
Requirements guide tests to detect 22-25 more business logic bugs
Uncovering Business Logic Bugs via Semantics-Driven Unit Test Generation
-
LLM uncertainty propagates across workflows and people
Uncertainty Propagation in LLM-Based Systems
-
Agents link browser symptoms to backend causes at 19.7% accuracy
CUJBench: Benchmarking LLM-Agent on Cross-Modal Failure Diagnosis from Browser to Backend
-
Prompt chaining lifts LLM accuracy on scientific text classification
Automating Categorization of Scientific Texts with In-Context Learning and Prompt-Chaining in Large Language Models