archive
Every paper Pith has read. Search by title, abstract, or pith.
1797 papers in cs.SE · page 1
-
Claude agent verifies programs at 98 percent success rate
Agentic Proving for Program Verification
-
Agents fail quantitative goals without progress tracking
Push Your Agent: Measuring and Enforcing Quantitative Goal Persistence in Long-Horizon LLM Agents
-
JVM microbenchmarks yield misleading results from unrealistic profiles
Misleading Microbenchmarks on the Java Virtual Machines
-
SQL benchmarks turned into Java Stream tests expose best parallel patterns
JEDI: Java Evaluation of Declarative and Imperative Queries
-
Rust auto-enforces 48% of applicable MISRA C++ rules
MISRust: Mapping MISRA-C++ Coding Guidelines to the Rust Programming Language
-
Enterprise AI needs risk reduction testing
AI Assurance: A Comprehensive Testing Strategy for Enterprise AI Systems
-
Compiler framework cuts runtime 45% while holding energy fixed
MileStone: A Multi-Objective Compiler Phase Ordering Framework for Graph-based IR-Level Optimization
-
AI coding assistants cut coding time but double worsened experience reports
The Impact of AI Coding Assistants on Software Engineering: A Longitudinal Study
-
Philosophical dispositions produce 51% unique AI code review findings
Philosophical Dispositions as Behavioral Constraints for AI-Assisted Code Review: An Empirical Study
-
All seven LLMs generate vulnerable code in developer-like tests
Security of LLM-generated Code: A Comparative Analysis
-
Kubernetes agent framework shows retrieval yields only partial falsification
A measurement substrate for agentic Kubernetes operations: Methodology and a case study in retrieval-compounding falsification
-
Input-output time proxies best match expert code comprehension rankings
On the Reliability of Code Comprehension Proxies
-
Flipping optimization branches reveals 21 DBMS performance bugs
Finding Performance Issues in Database Systems by Exploiting Dormant Code Paths
-
LLM code smells found in 73.5% of analyzed systems
LLM Code Smells: A Taxonomy and Detection Approach
-
Toolkit automates annotation of child-caregiver eye-tracking videos
GazeBehavior Annotation Toolkit (GBAT): AI-powered toolkit for automatic annotation of egocentric eye-tracking and video data of child-caregiver interaction
-
FAME detects log anomalies per message with 76x less labeling
FAME: Failure-Aware Mixture-of-Experts for Message-Level Log Anomaly Detection
-
One handler generates both streaming API and MCP tool
HarnessAPI: A Skill-First Framework for Unified Streaming APIs and MCP Tools
-
Contractual skills turn agent instructions into inspectable task contracts
Contractual Skills: A GovernSpec Design Framework for Enterprise AI Agents
-
AI Framework Secures Cardless Banking Against Fraud
Innovations in Cardless Artificial Intelligence Banking: A Comprehensive Framework for Cyber Secure and Fraud Mitigation using Machine Learning Algorithms
-
Multiple metrics required to judge synthetic data for tool-calling agents
SynAE: A Framework for Measuring the Quality of Synthetic Data for Tool-Calling Agent Evaluations
-
Rejections overstate AI agent errors in open source PRs
Why Are Agentic Pull Requests Merged or Rejected? An Empirical Study
-
Refinement more than doubles compilability of agent patches
"Refactoring Runaway": Understanding and Mitigating Tangled Refactorings in Coding Agents for Issue Resolution
-
Explicit baseline fixes attribution errors in neural explanations
The Neglected Baseline in Model Interpretation
-
Adversarial scaling reveals LLM code weaknesses
VeriScale: Adversarial Test-Suite Scaling for Verifiable Code Generation
-
GenAI adds hidden costs to developer well-being
At What Cost? Software Developers' Well-Being in the Age of GenAI
-
Trial harnesses let AI agents turn outcomes into process updates
Sibyl-AutoResearch: Autonomous Research Needs Self-Evolving Trial-and-Error Harnesses, Not Paper Generators
-
Attacks lift autonomous agent risk rate from 28.3% to 52.6%
Benchmarking Autonomous Agents against Temporal, Spatial, and Semantic Evasions
-
The paper describes an architecture that combines DevOps practices with decentralized…
An Architecture for Decentralised Deployment and Operation of Blockchain Applications
-
LLMs verify only 10% of test suites on code mutations
SWE-Mutation: Can LLMs Generate Reliable Test Suites in Software Engineering?
-
Syntax-driven repair fixes 97.5% of network config errors
Astragalus: Automatic Configuration Repair for Production Networks
-
System repairs TEE partitioning errors at 87.6 percent success
Automated Repair of TEE Partitioning Issues via DSL-Guided and LLM-Assisted Patching
-
LLM mocks let symbolic execution find TEE input flaws
Finding Missing Input Validation in TEEs via LLM-Assisted Symbolic Execution
-
Patch-guided trajectories raise SWE agent fixes by 10.8 points at 15% lower cost
From Patches to Trajectories: Privileged Process Supervision for Software-Engineering Agents
-
LLM summaries add context
Deterministic vs. Probabilistic Summarisation: An Empirical Trade-off Study in Design Pattern Centric Java Code
-
PITMuS maps bytecode mutants to source edits for fresh bug datasets
PITMuS: A Tool for Automated Bug Dataset Generation via Source-Level Mutant Reconstruction
-
Four principles let LLM agent build correct fuzz harnesses
Quality-Assured Fuzz Harness Generation via the Four Principles Framework
-
Multi-agent LLM system finds 29 zero-day vulnerabilities
FuzzingBrain V2: A Multi-Agent LLM System for Automated Vulnerability Discovery and Reproduction
-
Agile workshop formulates four propositions to close research gaps
The 2nd Workshop on Agile Practice & Research: A Summary and Call For Research
-
ReproFlake supplies scripts to reproduce failures in 1115 flaky tests
A Dataset of Reproducible Flaky-Test Failures
-
Dataset unifies 73k binaries with build variations and CVE history
ASSEMBLAGE-DEEPHISTORY: A Cross-Build Binary Dataset with Temporal Coverage
-
Hybrid OOD monitors lift LLM failure recall from 39 to 45 percent
Benchmarking and Improving Monitors for Out-Of-Distribution Alignment Failure in LLMs
-
LLMs reach 100% consistency adapting grammars to metamodel changes
Leveraging LLMs for Grammar Adaptation: A Study on Metamodel-Grammar Co-Evolution
-
AI refactoring PRs improve quality in 22.5% of cases
Quality and Security Signals in AI-Generated Python Refactoring Pull Requests
-
Agents propose specs, solvers verify LLM-generated code
Agentic Model Checking
-
Stdlib reimplementations match third-party Python library speeds
Stdlib or Third-Party? Empirical Performance and Correctness of LLM-Assisted Zero-Dependency Python Libraries
-
Voxel reconstruction validates navmeshes with less exploration
Validating Navmesh using Geometry: Voxel-Based Analysis with Prioritized Exploration
-
Agents pass visible tests but fail held-out usage tests as tasks lengthen
SpecBench: Measuring Reward Hacking in Long-Horizon Coding Agents
-
SPLE review compares adoption models and AI challenges
Software Product Line Engineering: Adoption, Tooling and AI Era Challenges
-
Multi-agent system turns full LLM traces into evidence-backed insights
Insights Generator: Systematic Corpus-Level Trace Diagnostics for LLM Agents
-
Multi-agent reports raise LLM scaffold performance by 30 points
Insights Generator: Systematic Corpus-Level Trace Diagnostics for LLM Agents