archive

Every paper Pith has read. Search by title, abstract, or pith.

1797 papers in cs.SE · page 1

cs.AI 2026-05-22 reviewed

Claude agent verifies programs at 98 percent success rate
Agentic Proving for Program Verification

Alessandro Sosso +2
cs.LG 2026-05-22 reviewed

Agents fail quantitative goals without progress tracking
Push Your Agent: Measuring and Enforcing Quantitative Goal Persistence in Long-Horizon LLM Agents

Yuandao Cai +4
cs.PL 2026-05-22 reviewed

JVM microbenchmarks yield misleading results from unrealistic profiles
Misleading Microbenchmarks on the Java Virtual Machines

Filippo Schiavio +2
cs.PL 2026-05-22 reviewed

SQL benchmarks turned into Java Stream tests expose best parallel patterns
JEDI: Java Evaluation of Declarative and Imperative Queries

Filippo Schiavio +1
cs.SE 2026-05-22 reviewed

Rust auto-enforces 48% of applicable MISRA C++ rules
MISRust: Mapping MISRA-C++ Coding Guidelines to the Rust Programming Language

Marius Molz +4
cs.SE 2026-05-22 reviewed

Enterprise AI needs risk reduction testing
AI Assurance: A Comprehensive Testing Strategy for Enterprise AI Systems

Chitra Badagi +3
cs.PL 2026-05-22 reviewed

Compiler framework cuts runtime 45% while holding energy fixed
MileStone: A Multi-Objective Compiler Phase Ordering Framework for Graph-based IR-Level Optimization

Amirhosein Sadr +1
cs.SE 2026-05-22 reviewed

AI coding assistants cut coding time but double worsened experience reports
The Impact of AI Coding Assistants on Software Engineering: A Longitudinal Study

Annie Vella +1
cs.SE 2026-05-21 reviewed

Philosophical dispositions produce 51% unique AI code review findings
Philosophical Dispositions as Behavioral Constraints for AI-Assisted Code Review: An Empirical Study

Kaushal Bansal
cs.SE 2026-05-21 reviewed

All seven LLMs generate vulnerable code in developer-like tests
Security of LLM-generated Code: A Comparative Analysis

Srivathsan G Morkonda +2
cs.SE 2026-05-21 reviewed

Kubernetes agent framework shows retrieval yields only partial falsification
A measurement substrate for agentic Kubernetes operations: Methodology and a case study in retrieval-compounding falsification

Joshua Odmark +2
cs.SE 2026-05-21 reviewed

Input-output time proxies best match expert code comprehension rankings
On the Reliability of Code Comprehension Proxies

Erfan Arvan +3
cs.SE 2026-05-21 reviewed

Flipping optimization branches reveals 21 DBMS performance bugs
Finding Performance Issues in Database Systems by Exploiting Dormant Code Paths

Jinsheng Ba +1
cs.SE 2026-05-21 reviewed

LLM code smells found in 73.5% of analyzed systems
LLM Code Smells: A Taxonomy and Detection Approach

Zacharie Chenail-Larcher +4
cs.CV 2026-05-21 reviewed

Toolkit automates annotation of child-caregiver eye-tracking videos
GazeBehavior Annotation Toolkit (GBAT): AI-powered toolkit for automatic annotation of egocentric eye-tracking and video data of child-caregiver interaction

Iba Baig +7
cs.SE 2026-05-21 reviewed

FAME detects log anomalies per message with 76x less labeling
FAME: Failure-Aware Mixture-of-Experts for Message-Level Log Anomaly Detection

Huanchi Wang +5
cs.AI 2026-05-21 reviewed

One handler generates both streaming API and MCP tool
HarnessAPI: A Skill-First Framework for Unified Streaming APIs and MCP Tools

Edwin Jose
cs.SE 2026-05-21 reviewed

Contractual skills turn agent instructions into inspectable task contracts
Contractual Skills: A GovernSpec Design Framework for Enterprise AI Agents

Ting Liu
cs.CR 2026-05-21 reviewed

AI Framework Secures Cardless Banking Against Fraud
Innovations in Cardless Artificial Intelligence Banking: A Comprehensive Framework for Cyber Secure and Fraud Mitigation using Machine Learning Algorithms

Md Israfeel
cs.CL 2026-05-21 reviewed

Multiple metrics required to judge synthetic data for tool-calling agents
SynAE: A Framework for Measuring the Quality of Synthetic Data for Tool-Calling Agent Evaluations

Shuaiqi Wang +3
cs.SE 2026-05-21 reviewed

Rejections overstate AI agent errors in open source PRs
Why Are Agentic Pull Requests Merged or Rejected? An Empirical Study

Sien Reeve O. Peralta +10
cs.SE 2026-05-21 reviewed

Refinement more than doubles compilability of agent patches
"Refactoring Runaway": Understanding and Mitigating Tangled Refactorings in Coding Agents for Issue Resolution

Zhao Tian +6
cs.CV 2026-05-21 reviewed

Explicit baseline fixes attribution errors in neural explanations
The Neglected Baseline in Model Interpretation

Yongjin Cui +1
cs.LG 2026-05-21 reviewed

Adversarial scaling reveals LLM code weaknesses
VeriScale: Adversarial Test-Suite Scaling for Verifiable Code Generation

Yifan Bai +9
cs.SE 2026-05-21 reviewed

GenAI adds hidden costs to developer well-being
At What Cost? Software Developers' Well-Being in the Age of GenAI

Mariam Guizani +3
cs.MA 2026-05-21 reviewed

Trial harnesses let AI agents turn outcomes into process updates
Sibyl-AutoResearch: Autonomous Research Needs Self-Evolving Trial-and-Error Harnesses, Not Paper Generators

Chengcheng Wang +5
cs.CR 2026-05-21 reviewed

Attacks lift autonomous agent risk rate from 28.3% to 52.6%
Benchmarking Autonomous Agents against Temporal, Spatial, and Semantic Evasions

Jianan Ma +10
cs.SE 2026-05-21 reviewed

The paper describes an architecture that combines DevOps practices with decentralized…
An Architecture for Decentralised Deployment and Operation of Blockchain Applications

Fabian Stiehle +2
cs.SE 2026-05-21 reviewed

LLMs verify only 10% of test suites on code mutations
SWE-Mutation: Can LLMs Generate Reliable Test Suites in Software Engineering?

Yuxuan Sun +8
cs.NI 2026-05-21 reviewed

Syntax-driven repair fixes 97.5% of network config errors
Astragalus: Automatic Configuration Repair for Production Networks

Zhenrong Gu +3
cs.SE 2026-05-21 reviewed

System repairs TEE partitioning errors at 87.6 percent success
Automated Repair of TEE Partitioning Issues via DSL-Guided and LLM-Assisted Patching

Chengyan Ma +6
cs.SE 2026-05-21 reviewed

LLM mocks let symbolic execution find TEE input flaws
Finding Missing Input Validation in TEEs via LLM-Assisted Symbolic Execution

Chengyan Ma +5
cs.SE 2026-05-21 reviewed

Patch-guided trajectories raise SWE agent fixes by 10.8 points at 15% lower cost
From Patches to Trajectories: Privileged Process Supervision for Software-Engineering Agents

Murong Ma +9
cs.SE 2026-05-21 reviewed

LLM summaries add context
Deterministic vs. Probabilistic Summarisation: An Empirical Trade-off Study in Design Pattern Centric Java Code

Najam Nazar +1
cs.SE 2026-05-21 reviewed

PITMuS maps bytecode mutants to source edits for fresh bug datasets
PITMuS: A Tool for Automated Bug Dataset Generation via Source-Level Mutant Reconstruction

Tasfia Tasnim +1
cs.CR 2026-05-20 reviewed

Four principles let LLM agent build correct fuzz harnesses
Quality-Assured Fuzz Harness Generation via the Four Principles Framework

Ze Sheng +5
cs.CR 2026-05-20 reviewed

Multi-agent LLM system finds 29 zero-day vulnerabilities
FuzzingBrain V2: A Multi-Agent LLM System for Automated Vulnerability Discovery and Reproduction

Ze Sheng +4
cs.SE 2026-05-20 reviewed

Agile workshop formulates four propositions to close research gaps
The 2nd Workshop on Agile Practice & Research: A Summary and Call For Research

Karen Eilers +5
cs.SE 2026-05-20 reviewed

ReproFlake supplies scripts to reproduce failures in 1115 flaky tests
A Dataset of Reproducible Flaky-Test Failures

Suzzana Rafi +5
cs.CR 2026-05-20 reviewed

Dataset unifies 73k binaries with build variations and CVE history
ASSEMBLAGE-DEEPHISTORY: A Cross-Build Binary Dataset with Temporal Coverage

Chang Liu +5
cs.AI 2026-05-20 reviewed

Hybrid OOD monitors lift LLM failure recall from 39 to 45 percent
Benchmarking and Improving Monitors for Out-Of-Distribution Alignment Failure in LLMs

Dylan Feng +3
cs.CL 2026-05-20 reviewed

LLMs reach 100% consistency adapting grammars to metamodel changes
Leveraging LLMs for Grammar Adaptation: A Study on Metamodel-Grammar Co-Evolution

Weixing Zhang +4
cs.SE 2026-05-20 reviewed

AI refactoring PRs improve quality in 22.5% of cases
Quality and Security Signals in AI-Generated Python Refactoring Pull Requests

Mohamed Almukhtar +2
cs.SE 2026-05-20 reviewed

Agents propose specs, solvers verify LLM-generated code
Agentic Model Checking

Youcheng Sun +3
cs.SE 2026-05-20 reviewed

Stdlib reimplementations match third-party Python library speeds
Stdlib or Third-Party? Empirical Performance and Correctness of LLM-Assisted Zero-Dependency Python Libraries

Peng Ding +1
cs.SE 2026-05-20 reviewed

Voxel reconstruction validates navmeshes with less exploration
Validating Navmesh using Geometry: Voxel-Based Analysis with Prioritized Exploration

Ramesh Raghavan +5
cs.SE 2026-05-20 reviewed

Agents pass visible tests but fail held-out usage tests as tasks lengthen
SpecBench: Measuring Reward Hacking in Long-Horizon Coding Agents

Bingchen Zhao +3
cs.SE 2026-05-20 reviewed

SPLE review compares adoption models and AI challenges
Software Product Line Engineering: Adoption, Tooling and AI Era Challenges

Najam Nazar
cs.AI 2026-05-20 reviewed

Multi-agent system turns full LLM traces into evidence-backed insights
Insights Generator: Systematic Corpus-Level Trace Diagnostics for LLM Agents

Akshay Manglik +8
cs.AI 2026-05-20 reviewed

Multi-agent reports raise LLM scaffold performance by 30 points
Insights Generator: Systematic Corpus-Level Trace Diagnostics for LLM Agents

Akshay Manglik +8