archive
Every paper Pith has read. Search by title, abstract, or pith.
1797 papers in cs.SE · page 18
-
Two-agent system repairs LLM agent bugs more effectively
SelfHeal: Empirical Fix Pattern Analysis and Bug Repair in LLM Agents
-
Three patterns mark how teams respond to GitHub Actions failures
Beyond the YAML File: Understanding Real-World GitHub Actions Workflow Adoption
-
Hugging Face data drives dynamic AI model card updates
Toward Reusability of AI Models Using Dynamic Updates of AI Documentation
-
AI code shows 1.8 times more quiet-failure risks than human code
AIRA: AI-Induced Risk Audit: A Structured Inspection Framework for AI-Generated Code
-
Logging tools need multilingual checks to be reliable
Single-Language Evidence Is Insufficient for Automated Logging: A Multilingual Benchmark and Empirical Study with LLMs
-
QRisk cuts quantum noise 45% by avoiding recurring error patterns
Isolating Recurring Execution-Dependent Abnormal Patterns on NISQ Quantum Devices
-
Analysis extracts unit tests from integration tests
Augmenting unit test suites from integration tests
-
Technology research software forms its own overlooked category
Technology Research Software: An Often Overlooked Category of Research Software
-
Reverse-engineered specs yield 94% APR success on Defects4J
Project Prometheus: Bridging the Intent Gap in Agentic Program Repair via Reverse-Engineered Executable Specifications
-
Adaptive AI personas teach coding tool use
Agentic Education: Using Claude Code to Teach Claude Code
-
Modeling projects as networks provides more consistent estimates of resilience to key…
Project resilience as network robustness
-
ML automation targets RISC-V certification costs for cars
RISC-V Functional Safety for Autonomous Automotive Systems: An Analytical Framework and Research Roadmap for ML-Assisted Certification
-
Models pass tests by regenerating code
Precise Debugging Benchmark: Is Your Model Debugging or Regenerating?
-
LLMs pass 76% of tests but edit with under 45% precision
Precise Debugging Benchmark: Is Your Model Debugging or Regenerating?
-
LLMs detect design patterns with promising accuracy
A Pilot Study on Detecting Software Design Patterns with Large Language Models: An Empirical Evaluation
-
KnowPilot improves domain text generation by merging priors
KnowPilot: Your Knowledge-Driven Copilot for Domain Tasks
-
T2MRec matches tasks to MCP servers via semantic and structural cues
From Language to Action: Enhancing LLM Task Efficiency with Task-Aware MCP Server Recommendation
-
Kimi-K2.5 at 3 bits tops models on React Native app task
React-ing to Grace Hopper 200: Five Open-Weights Coding Models, One React Native App, One GH200, One Weekend
-
Personas in requirements engineering align clinical AI trainers with real practice
Persona-Based Requirements Engineering for Explainable Multi-Agent Educational Systems: A Scenario Simulator for Clinical Reasoning Training
-
Adaptive router lifts LLM code repair accuracy by 32 percent
SynthFix: Adaptive Neuro-Symbolic Code Vulnerability Repair
-
MoE routing overlaps 11x random even for different code tokens
Layer-wise MoE Routing Locality under Shared-Prefix Code Generation: Token-Identity Decomposition and Compile-Equivalent Fork Redundancy
-
Agentic AI governance misses links from rules to provable actions
Beyond Task Success: An Evidence-Synthesis Framework for Evaluating, Governing, and Orchestrating Agentic AI
-
Real token tracking matches AI dev costs within 2%
AI Observability for Developer Productivity Tools: Bridging Cost Awareness and Code Quality
-
Local command center unifies dev tools and raises AI readiness
Workstream: A Local-First Developer Command Center for the AI-Augmented Engineering Workflow
-
Transfer from C++ improves Ruby and Rust repair Pass@1 by 17 points
HELO-APR: Enhancing Low-Resource Program Repair through Cross-Lingual Knowledge Transfer
-
Memory cascade resolves 86% of Python dependency issues
MEMRES: A Memory-Augmented Resolver with Confidence Cascade for Agentic Python Dependency Resolution
-
Co-versioning run-time behavior with code reveals hidden changes
Treating Run-time Execution History as a First-Class Citizen: Co-Versioning Run-time Behavior alongside Code
-
Gleaner sampler raises RCA accuracy above full dataset at 1 percent rate
Gleaner: A Semantically-Rich and Efficient Online Sampler for Microservice Diagnostics
-
Prompt tweaks flip LLM judge verdicts on identical code
Bias in the Loop: Auditing LLM-as-a-Judge for Software Engineering
-
App reviews flag persistent ethical barriers in mobile apps
Exploring Ethical Concerns of Mobile Applications from App Reviews: A Literature Survey
-
Prompt method halves AI bias sensitivity in software tasks
Mitigating Prompt-Induced Cognitive Biases in General-Purpose AI for Software Engineering
-
AI slop creates a tragedy of the commons in software
AI Slop and the Software Commons
-
This paper empirically tests 22 agentic AI frameworks on three reasoning benchmarks and…
Agentic Frameworks for Reasoning Tasks: An Empirical Study
-
Conversational agents help high school students with CSP
Investigating Conversational Agents to Support Secondary School Students Learning CSP
-
Survey of 280 researchers diagnoses barriers to cumulative knowledge in software
From Papers to Progress: Rethinking Knowledge Accumulation in Software Engineering
-
Fixing requirement mismatches raises LLM code success
Bridging the Gap between User Intent and LLM: A Requirement Alignment Approach for Code Generation
-
Multi-modal verifier raises certified synthesis success rate
Certified Program Synthesis with a Multi-Modal Verifier
-
Contrastive training lifts LLM code detection accuracy to 78 percent
LLMSniffer: Detecting LLM-Generated Code via GraphCodeBERT and Supervised Contrastive Learning
-
The paper identifies a 'Keyword Shortcut' bias in existing code localization benchmarks…
Neurosymbolic Repo-level Code Localization
-
MLIR unifies equivalence checking from algorithms to netlists
EquivFusion: Unifying Hardware Equivalence Checking from Algorithms to Netlists via MLIR
-
The paper introduces flowR, a VS Code and Positron extension that builds dataflow graphs…
Supporting the Comprehension of Data Analysis Scripts
-
Small programs can have up to 76 configuration options
Small Yet Configurable: Unveiling Null Variability in Software
-
Removals lag additions so toggle counts keep rising in large systems
Feature Toggle Dynamics in Large-Scale Systems: Prevalence, Growth, Lifespan, and Benchmarking
-
QMutBench gives 700k quantum mutants to benchmark tests
QMutBench: A Dataset of Quantum Circuit Mutants
-
Tool pairs LLMs with symbolic checks to create Python contracts
SpecPylot: Python Specification Generation using Large Language Models
-
LLM evolves coding skill by generating its own failure tests
ACE: Self-Evolving LLM Coding Framework via Adversarial Unit Test Generation and Preference Optimization
-
One LLM improves code by making its own adversarial tests
ACE: Self-Evolving LLM Coding Framework via Adversarial Unit Test Generation and Preference Optimization
-
Model unites text, code and images in one retrieval system
CodeMMR: Bridging Natural Language, Code, and Image for Unified Retrieval
-
The paper models quantum error budget allocation as a potential game among logical…
A Game Theoretic Approach for Optimizing Quantum Error Budget Distribution
-
Symbolic guardrails enforce 74% of agent safety policies
Symbolic Guardrails for Domain-Specific Agents: Stronger Safety and Security Guarantees Without Sacrificing Utility