archive
Every paper Pith has read. Search by title, abstract, or pith.
1797 papers in cs.SE · page 4
-
Non-self-fixed ATD lingers longer with many developers' changes
The Dangers of Non-Self-Fixed Architecture Technical Debt and Its Impact on Time-to-Fix
-
Concept alignment lifts code search accuracy 15x on new data
XSearch: Explainable Code Search via Concept-to-Code Alignment
-
Small open LLMs match large ones at grammar-based DSL generation
From Text to DSL: Evaluating Grammar-Based Model Generation Using Open LLMs
-
AI agents solve at most 39% of real version upgrade tasks
RoadmapBench: Evaluating Long-Horizon Agentic Software Development Across Version Upgrades
1 Piths -
BootstrapAgent distills repo setup into reusable contracts
BootstrapAgent: Distilling Repository Setup into Reusable Agent Knowledge
-
Early QA in annotation pipelines cuts costs more than late checks
Position: Early-Stage Quality Assurance in Annotation Pipelines Is More Cost-Effective Than Late-Stage Validation
-
Intra-thread duplication catches 39% more defective servers
ITHICA: Intra-Thread Instruction Checking Approach for Defect-Induced Silent Data Corruptions
-
Bayesian sequential tests cut quantum verification costs
Bayesian Sequential Verification for Budget-Aware Quantum Program Testing
-
Chained mutators mostly interfere but some synergize in LLM jailbreaks
Compositional Jailbreaking: An Empirical Analysis of Mutator Chain Interactions in Aligned LLMs
-
LLM agent finds 24 zero-day privilege escalations in microservices
Detecting Privilege Escalation in Polyglot Microservices via Agentic Program Analysis
-
Runtime structure cuts retry costs in agentic coding by 51.7%
Runtime-Structured Task Decomposition for Agentic Coding Systems
-
Agent turns I/O examples into code via guided evolutionary search
From I/O to Code with Discovery Agent
-
Semantically grounded agents detect memory bugs in binaries
Veritas: A Semantically Grounded Agentic Framework for Memory Corruption Vulnerability Detection in Binaries
-
Viverra adds verified assertions to LLM-generated C code
Viverra: Text-to-Code with Guarantees
-
Test generation uncovers 2.56x more privacy leaks in code LLMs
Probing Privacy Leaks in LLM-based Code Generation via Test Generation
-
Agentic AI matures fastest where outputs can be tested automatically
Assistance to Autonomy: A Systematic Literature Review of Agentic AI across the Software Development Life Cycle
-
Architecture docs let agents migrate eight C repos to Rust
Documentation-Guided Agentic Codebase Migration from C to Rust
-
Documentation blueprint enables full C-to-Rust repo migration
Documentation-Guided Agentic Codebase Migration from C to Rust
-
ML classifier beats rules at spotting BDD refactoring chances
Mining Subscenario Refactoring Opportunities in Behaviour-Driven Software Test Suites: ML Classifiers and LLM-Judge Baselines
-
Memory agent keeps repo documentation consistent
Remember Your Trace: Memory-Guided Long-Horizon Agentic Framework for Consistent and Hierarchical Repository-Level Code Documentation
-
Retriever beats generator in RAG for code tasks
Not All RAGs Are Created Equal: A Component-Wise Empirical Study for Software Engineering Tasks
-
Stale code snippets make models output outdated helpers
When Retrieval Hurts Code Completion: A Diagnostic Study of Stale Repository Context
-
Disguised compliance rules let attackers hijack LLM agents
Exploiting LLM Agent Supply Chains via Payload-less Skills
-
Multi-agent system automates full library fuzzing lifecycle
FuzzAgent: Multi-Agent System for Evolutionary Library Fuzzing
-
Agents resolve 45 percent of chained package upgrades
SWE-Chain: Benchmarking Coding Agents on Chained Release-Level Package Upgrades
-
Size filter trims 80 percent of tokens from LLM repo inputs
Correctness-Aware Repository Filtering Under Maximum Effective Context Window Constraints
-
Valid microservice APIs often fail for AI agents
Making OpenAPI Documentation Agent-Ready: Detecting Documentation and REST Smells with a Multi-Agent LLM System
-
Hydra cuts LLM code gen latency up to 71% with rollback repairs
Hydra: Efficient, Correct Code Generation via Checkpoint-and-Rollback Support
-
Web agents should plan before seeing page content
Web Agents Should Adopt the Plan-Then-Execute Paradigm
-
Failure-guided fuzzing beats random testing for HQC programs
Failure-Guided Fuzzing for Hybrid Quantum-Classical Programs
-
Prompt strategy explains more variation in test diversity than model size when using LLMs…
LLM-Based Robustness Testing of Microservice Applications: An Empirical Study
-
Constrained edits merge checkpoints to lift code agent scores
CRANE: Constrained Reasoning Injection for Code Agents via Nullspace Editing
-
AI agents speed creation of digital music instruments
Case Studies and Reflections on Agentic Software Engineering for Rapid Development of Digital Music Instruments
-
Method-level change-proneness beats class-level for test minimization
Method-level Change-proneness: A Better Metric for Black-box Test Suite Minimization
-
Benchmark shows AI agents recall 42-83 percent of property-based testing bugs
PBT-Bench: Benchmarking AI Agents on Property-Based Testing
-
LLMs detect 42-83% of semantic bugs with property-test prompts
PBT-Bench: Benchmarking AI Agents on Property-Based Testing
-
LLM with SMT solver audits natural-language requirements
Neurosymbolic Auditing of Natural-Language Software Requirements
-
LLMs reach only 52% accuracy on HMSC semantic tasks
(How) Do Large Language Models Understand High-Level Message Sequence Charts?
-
LLMs reach only 52% accuracy on HMSC formal semantics
(How) Do Large Language Models Understand High-Level Message Sequence Charts?
-
CARS attributes AV collisions to driver faults
Learning Responsibility-Attributed Adversarial Scenarios for Testing Autonomous Vehicles
-
SkillOps is a plug-in framework that maintains LLM agent skill libraries by representing…
SkillOps: Managing LLM Agent Skill Libraries as Self-Maintaining Software Ecosystems
-
Quantifier rewrites and non-alias specs speed GPU verification ninefold
Scalable Deductive Verification of Data-Level Parallel Programs
-
AI agents drop 37-58% on hardware vs software tasks
Is Agentic AI Ready for Real-World Hardware Engineering? A Deep Dive with Phoenix-bench
-
Open standards let one agent model run consistently in three simulators
Integration of an Agent Model into an Open Simulation Architecture for Scenario-Based Testing of Automated Vehicles
-
Runtime pruning cuts tokens 49% for local LLM fault localization
SieveFL: Hierarchical Runtime-Aware Pruning for Scalable LLM-Based Fault Localization
-
Call stack data improves RL game testing agents
CA2: Code-Aware Agent for Automated Game Testing
-
Runtime harness mediates AI agent actions on code projects
AI Harness Engineering: A Runtime Substrate for Foundation-Model Software Agents
-
This paper finds that code generated by large language models has overall readability…
The Readability Spectrum: Patterns, Issues, and Prompt Effects in LLM-Generated Code
-
Noise reshapes mutant detection in quantum programs
Robust Mutation Analysis of Quantum Programs Under Noise
-
Readiness metrics show near-zero link to research software execution success
ReproScore: Separating Readiness from Outcome in Research Software Reproducibility Assessment