archive
Every paper Pith has read. Search by title, abstract, or pith.
1797 papers in cs.SE · page 2
-
Fortran scientific codes harbor many undefined-behavior-like defects
RSE of a Quantum Transport Code and its Effects
-
LLMs turn technical privacy details into clear reports for workers
Transforming Privacy Artifacts into Accessible Reports for Non-Technical Stakeholders
-
27% of Dockerfile SATD admissions couple with other files
Beyond the Tip of the Iceberg: Understanding SATD in Dockerfiles through the Lens of Co-evolution
-
RL fine-tuning lifts code generation pass@1 by 19% on MBPP
Domain-Adaptable Reinforcement Learning for Code Generation with Dense Rewards
-
Spectral distances flag Trojaned DNN updates after one step
Detecting Trojaned DNNs via Spectral Regression Analysis
-
Small classifier beats LLMs at pulling exact text from papers
ACL-Verbatim: hallucination-free question answering for research
-
Refusal rate misranks LLMs on bio safety
RefusalBench: Why Refusal Rate Misranks Frontier LLMs on Biological Research Prompts
4 Piths -
Five checkpoints enforce policy in generalist agents
Governance by Construction for Generalist Agents
-
Bioinformatics bug detection rises 30-38% with new full-context dataset
BioDefect: The First Dataset for Defect Detection in Bioinformatics Software
-
LLMs endorse 32% of their own behavior-changing code rewrites
Articulate but Wrong: Self-Review Failures in LLM-Based Code Modernization
-
Contextual data makes code smell detection more actionable
An Event-Driven Tool for Context-Aware Code Smell Detection Using SmellDSL
-
State management beats workspace isolation in multi-agent tasks
Multi-agent Collaboration with State Management
-
LLM agent accuracy drops to 0.54-0.62 without labels
AgentAtlas: Beyond Outcome Leaderboards for LLM Agents
-
Privacy views raise coaching adherence from 0.48 to 0.74
Privacy-by-Design Adaptive Group Assignment for Digital Lifestyle Coaching at Scale
-
Frama-C plugin checks non-functional rules for automotive C
Contract Based Verification of Non-functional Requirements for Embedded Automotive C Code
-
LLM tests catch all 16 anomalies where manual checks find only 7
A Multi-Layer Testing Framework for Automated Data Quality Assurance in Cloud-Native ELT Pipelines
-
Code gen picks winner by clustering behaviors on auto-generated inputs
Code Generation by Differential Test Time Scaling
-
Agentic AI coding improves with structured verification loops
Agentic Agile-V: From Vibe Coding to Verified Engineering in Software and Hardware Development
-
Methodology turns Bodies of Knowledge into assessable competencies
A Semantic-Web Oriented Competency Model for Engineering Programs
-
Four-part SDB contract organizes LLM agent runtimes
A Methodology for Selecting and Composing Runtime Architecture Patterns for Production LLM Agents
-
Taxonomy organizes 248 studies on combined program analyses
Combined Program Analysis Techniques: A Systematic Mapping Study
-
Staged analysis improves LLM recovery of ROS 2 architectures
Towards LLM-Assisted Architecture Recovery for Real-World ROS~2 Systems: An Agent-Based Multi-Level Approach to Hierarchical Structural Architecture Reconstruction
-
Cleaner code reduces agent token use by 7-8% with no change in success
Does Code Cleanliness Affect Coding Agents? A Controlled Minimal-Pair Study
-
Agent skills from expert methods beat docs for PostgreSQL tuning
A Case for Agentic Tuning: From Documentation to Action in PostgreSQL
-
Health data lakehouse shown usable for mixed-skill teams
OpenHealth Lake: Designing and testing a data lakehouse platform for health applications
-
LLMs Simplify OOD but Omit Key Abstractions
Can LLMs Produce Better Object-Oriented Designs than Human-Involved Development?
-
LLMs optimize code via priors
Prior Knowledge or Search? A Study of LLM Agents in Hardware-Aware Code Optimization
-
Hard-coded verifiers beat LLM judges at matching human evaluations
OpenComputer: Verifiable Software Worlds for Computer-Use Agents
-
Quantum tests can live inside .qasm circuit files
QUTest: A Native Testing Framework for Quantum Programs
-
Agent fixes 89% of flaws in source-free industrial software
SCARA: A Semantics-Constrained Autonomous Remediation Agent for Opaque Industrial Software Vulnerabilities
-
Criterion-level pairwise judgments lift code judge accuracy to 66.3%
CriterAlign: Criterion-Centric Rationale Alignment for Code Preference Judging
-
Study catalogs 301 real tile-program bugs from GitHub
Characterizing Real-World Bugs in Tile Programs for Automated Bug Detection
-
Single-file AI tools push accessibility boundaries outward
The Accessibility Capability Boundary: Operational Limits and Expansion Potential of AI-Generated Browser-Native Accessibility Systems
-
One LLM system optimizes text to beat specialists on six tasks
optimize_anything: A Universal API for Optimizing any Text Parameter
-
Governance recipe lifts LLM skill-library performance from 0.26 to 0.58
Library Drift: Diagnosing and Fixing a Silent Failure Mode in Self-Evolving LLM Skill Libraries
-
MILP solves fairness repair for neural networks with formal guarantees
Provable Fairness Repair for Deep Neural Networks
-
Dependency repair shrinks programs 52 percent more than syntax-only reducers
DRReduce: Enhancing Syntax-Guided Program Reduction with Dependency Reconstruction
-
Code models now decide when to answer and when to defer
When to Answer and When to Defer: A Decision Framework for Reliable Code Predictions
-
Input adaptation cuts code model mispredictions without retraining
On-the-Fly Input Adaptation for Reliable Code Intelligence
-
MOCHA improves agent skill correctness on every task
MOCHA: Multi-Objective Chebyshev Annealing for Agent Skill Optimization
-
Multi-agent system hardens test updates with mutations
MuMuTestUp: Mutation-based Multi-Agent Test Case Update
-
Self-healing web apps detect faults at 90.7% and recover 56% faster
When Web Apps Heal Themselves: A MAPE-K Based Approach to Fault Tolerance and Adaptive Recovery
-
LLM agents turn switch manuals into graphs at 97-99% accuracy
Supporting System Testing with a Multi-Agent LLM-based Framework for Knowledge Graph Extraction: A Case Study with Ethernet Switch Systems
-
AI restructures open source docs to cut cognitive overload
Restructure This: Using AI to Restructure Onboarding Documents to Reduce Cognitive Overload
-
RL agent refines prompts to boost LLM code pass rates
Prompt Optimization for LLM Code Generation via Reinforcement Learning
-
Multi-agent pipeline extracts traceable specs from legacy code
Reversa: A Reverse Documentation Engineering Framework for Converting Legacy Software into Operational Specifications for AI Agents
-
Stripping consent declarations raises overeager rate in coding agents
Overeager Coding Agents: Measuring Out-of-Scope Actions on Benign Tasks
-
q-log odds lift BM25 NDCG@10 by 89% on code search
Improving BM25 Code Retrieval Under Fixed Generic Tokenization: Adaptive q-Log Odds as a Drop-In BM25 Fix
-
One Engineer With AI Agents Finishes Four-Person Job In Half The Time
One Developer Is All You Need: A Case Study of an AI-Augmented One-Person Squad in a Brownfield Enterprise
-
One engineer with AI agents finishes four-person project in half the time
One Developer Is All You Need: A Case Study of an AI-Augmented One-Person Squad in a Brownfield Enterprise