archive
Every paper Pith has read. Search by title, abstract, or pith.
1797 papers in cs.SE · page 8
-
RAG with LLMs catches 91 percent of false kernel bug reports
Characterizing and Mitigating False-Positive Bug Reports in the Linux Kernel
-
Natural-language rewrite lifts code retrieval scores
Do not copy and paste! Rewriting strategies for code retrieval
-
Scenario models automate VR app tests and catch more failures
System Test Generation for Virtual Reality Applications using Scenario Models
-
Search finds small perturbations that break robot vision 3-7x better
Search-based Robustness Testing of Laptop Refurbishing Robotic Software
-
Iterative refinement boosts LLM quantum solver success
Can LLMs Solve Science or Just Write Code? Evaluating Quantum Solver Generation
-
Iterative checks boost LLM quantum solver success
Can LLMs Solve Science or Just Write Code? Evaluating Quantum Solver Generation
-
Prefill signals from small models locate multi-agent failures
MASPrism: Lightweight Failure Attribution for Multi-Agent Systems Using Prefill-Stage Signals
-
Prefill signals from small LLMs locate root failures in agent traces
MASPrism: Lightweight Failure Attribution for Multi-Agent Systems Using Prefill-Stage Signals
-
Multi-shot prompts boost agreement only for Claude Haiku
Prompt Engineering Strategies for LLM-based Qualitative Coding of Psychological Safety in Software Engineering Communities: A Controlled Empirical Study
-
Multi-stage training boosts Java-to-Cangjie code translation 6%
Boosting Automatic Java-to-Cangjie Translation with Multi-Stage LLM Training and Error Repair
-
Unclear roles top ML team challenges in semiconductors
Exploring CoCo Challenges in ML Engineering Teams: Insights From the Semiconductor Industry
-
Open-source low-code editor builds and deploys AI web apps
Low-code and no-code with BESSER to create and deploy smart web applications
-
Compile rate misleads on LLM game scene quality
Mage: Multi-Axis Evaluation of LLM-Generated Executable Game Scenes Beyond Compile-Pass Rate
-
Dual-space loop refines virtual cell models by routing failures to right level
CellScientist: Dual-Space Hierarchical Orchestration for Closed-Loop Refinement of Virtual Cell Models
-
AI backends gain one admission seam for governance across requests
Execution Envelopes: A Shared Admission Contract for Backend AI Execution Requests
-
LLM agents reach only 30-55% on full repo generation from scratch
RepoZero: Can LLMs Generate a Code Repository from Scratch?
-
Top LLM agents complete only 30-55% of code repositories from scratch
RepoZero: Can LLMs Generate a Code Repository from Scratch?
-
Framework ties agent architecture to lifecycle for reliable CUAs
Securing Computer-Use Agents: A Unified Architecture-Lifecycle Framework for Deployment-Grounded Reliability
-
Authority transfer, not task performance, defines agentic CI/CD
From Assistance to Agency: Rethinking Autonomy and Control in CI/CD Pipelines
-
Replay script matches frontier models on computer-use benchmarks
Computer Use at the Edge of the Statistical Precipice
-
LLM agents fix under half of architectural code smells
SmellBench: Evaluating LLM Agents on Architectural Code Smell Repair
-
LLM agents fix under half of architectural code smells
SmellBench: Evaluating LLM Agents on Architectural Code Smell Repair
-
Language descriptions become solvable constraints for AV tests
Traffic Scenario Orchestration from Language via Constraint Satisfaction
-
This paper reviews studies linking lack of belonging to higher burnout in software…
Guidelines for Cultivating a Sense of Belonging to Reduce Developer Burnout
-
MySQL and PostgreSQL top DBMS use in open-source Java history
Analyzing the Adoption of Database Management Systems Throughout the History of Open Source Projects
-
Best coding agents pass under 16 percent of Java framework migrations
ScarfBench: A Benchmark for Cross-Framework Application Migration in Enterprise Java
-
Agents pass only 15% of Java framework migration tests
ScarfBench: A Benchmark for Cross-Framework Application Migration in Enterprise Java
-
AI code needs fewer updates than human code
To What Extent Does Agent-generated Code Require Maintenance? An Empirical Study
-
AI code receives less maintenance than human code
To What Extent Does Agent-generated Code Require Maintenance? An Empirical Study
-
LLM agents drop 30 points on backend tasks with full constraints
Constraint Decay: The Fragility of LLM Agents in Backend Code Generation
-
DAG replay preserves AI work state exactly with zero churn
From Agent Loops to Deterministic Graphs: Execution Lineage for Reproducible AI-Native Work
-
LLMs pick vulnerable library versions in 37-56% of tasks
Correct Code, Vulnerable Dependencies: A Large Scale Measurement Study of LLM-Specified Library Versions
-
LLM-based method repairs sibling code bugs across locations
SiblingRepair: Sibling-Based Multi-Hunk Repair with Large Language Models
-
Self-healing framework raises LLM agent success rates
A Self-Healing Framework for Reliable LLM-Based Autonomous Agents
-
Symbolic traces train 8B model to beat 32B on code violation detection
Teaching LLMs Program Semantics via Symbolic Execution Traces
-
0.1% of PyPI packages carry 80% of maintenance impact
Modeling Dependency-Propagated Ecosystem Impact of Changes in Maintenance Activities: Evaluating Support Strategies in the PyPI Network
-
0.1% of PyPI packages carry 80% of ecosystem impact
Modeling Dependency-Propagated Ecosystem Impact of Changes in Maintenance Activities: Evaluating Support Strategies in the PyPI Network
-
LLM judges flip up to 9% of safety verdicts on equivalent policy rewordings
Beyond Accuracy: Policy Invariance as a Reliability Test for LLM Safety Judges
-
Protocol tests agent effort to recover design intent from code
BUILD-AND-FIND: An Effort-Aware Protocol for Evaluating Agent-Managed Codebases
-
Agents top out near 47% F1 on updating project tests after changes
Breaking, Stale, or Missing? Benchmarking Coding Agents on Project-Level Test Evolution
-
One model beats coding specialists by 9% with utility-driven RL
Schedule-and-Calibrate: Utility-Guided Multi-Task Reinforcement Learning for Code LLMs
-
AST patterns identify algorithms more accurately than LLMs or clone detectors
Exploring the Effectiveness of Abstract Syntax Tree Patterns for Algorithm Recognition
-
Tool detects how LLMs create risks in GitHub CI workflows
Heimdallr: Characterizing and Detecting LLM-Induced Security Risks in GitHub CI Workflows
-
Multi-agent workflow lifts AI coding success by 6.5 percent
MAS-Algorithm: A Workflow for Solving Algorithmic Programming Problems with a Multi-Agent System
-
Multi-agent workflow lifts algorithmic solving by 6.5 percent
MAS-Algorithm: A Workflow for Solving Algorithmic Programming Problems with a Multi-Agent System
-
Automatic metrics fail to judge non-English code comments
Evaluating Non-English Developer Support in Machine Learning for Software Engineering
-
AI code security fixes often create new weaknesses
On Fixing Insecure AI-Generated Code through Model Fine-Tuning and Prompting Strategies
-
Ontology guides agent for better requirements interviews
From Chat to Interview: Agentic Requirements Elicitation with an Experience Ontology
-
Real IDE traces expose overestimation in simulated coding assistant tests
An Empirical Study of Proactive Coding Assistants in Real-World Software Development
-
Coding agents need insight policy quality
Agentic Coding Needs Proactivity, Not Just Autonomy