archive
Every paper Pith has read. Search by title, abstract, or pith.
1797 papers in cs.SE · page 3
-
Framework choice reverses meaning of agent behavior signals
Same Signal, Different Semantics: A Cross-Framework Behavioral Analysis of Software Engineering Agents
-
CommitDistill hits 0.75 retrieval rate from git history at 256-char budget
CommitDistill: A Lightweight Knowledge-Centric Memory Layer for Software Repositories
-
Debating LLMs catch more code vulnerabilities
Three Heads Are Better Than One: A Multi-perspective Reasoning Framework for Enhanced Vulnerability Detection
-
Multi-model feedback doubles AI solves on contest problems
A-ProS: Towards Reliable Autonomous Programming Through Multi-Model Feedback
-
ProcBench detects process defects in LLM coding agents missed by outcome scores
ProcCtrlBench: Evaluating Process-Level Defects and Control Preservation in LLM Coding Agents
-
Process benchmark catches mid-task defects in LLM coding agents
ProcCtrlBench: Evaluating Process-Level Defects and Control Preservation in LLM Coding Agents
-
Tool localizes node errors in multi-agent LLM workflows
PROTEA: Offline Evaluation and Iterative Refinement for Multi-Agent LLM Workflows
-
Two-level router cuts log QA latency 55%
LogRouter: Adaptive Two-Level LLM Routing for Log Question Answering in Big Data Systems
-
Verify gate turns agent completion into inspectable admission control
Verify-Gated Completion as Admission Control in a Governed Multi-Agent Runtime: A Bounded Architecture Case Study
-
Verify gate renders multi-agent completions inspectable and fail-closed
Verify-Gated Completion as Admission Control in a Governed Multi-Agent Runtime: A Bounded Architecture Case Study
-
Agentic RAG reaches 78% top-1 file bug localization
BLAgent: Agentic RAG for File-Level Bug Localization
-
Call-site context lifts code model pass rates
Contextualized Code Pretraining for Code Generation
-
Two-stage LLM workflow verifies code against natural language rules
LLM-Based Static Verification of Code Against Natural-Language Requirements: An Industrial Experience Report
-
Retrieval system compresses Lean proofs over 70 percent
Lean Refactor: Multi-Objective Controllable Proof Optimization via Agentic Strategy Search
-
AI feedback helps Scrum Masters spot their own negative emotions live
EGI: A Multimodal Emotional AI Framework for Enhancing Scrum Master Real-time Self-Awareness
-
Framework keeps AI-assisted scientific code traceable under NQA-1
Bridging the Gap on AI-Assisted Scientific Software Development Through Transparency and Traceability
-
Guided checks at code boundaries boost translation pass rates
Verifier-Guided Code Translation via Meta-Step Decoding
-
CFS and GA tuning lift fault prediction accuracy to 88.4%
A Feature-Driven Framework for Software Fault Prediction
-
LLMs subclassify invalid bug root causes and generate fixes
Automated Root-Cause Subclassification and No-Code Fix Generation for Invalid Bug Reports
-
Inverted API exploration yields verified tool-call data
Firefly: Illuminating Large-Scale Verified Tool-Call Data Generation from Real APIs
-
Five-stage AI workflow could ease the code review bottleneck
Rethinking Code Review in the Age of AI: A Vision for Agentic Code Review
-
Multi-agent setup with graphs keeps business rules in legacy modernization
AgentModernize: Preserving Business Logic in Legacy Modernization with Multi-Agent LLMs and Behavioral Specification Graphs
-
Agents fail 95% of SaaS tasks before business logic
SaaSBench: Exploring the Boundaries of Coding Agents in Long-Horizon Enterprise SaaS Engineering
-
LLM Agent Builds Formal Models by Repairing Verification Errors
Event-B Agent: Towards LLM Agent for Formal Model Synthesis and Repair
-
ContraFix fixes 84% of C/C++ vulnerabilities at low cost
ContraFix: Agentic Vulnerability Repair via Differential Runtime Evidence and Skill Reuse
-
Memory layers raise repo vulnerability repair to 58%
MemRepair: Hierarchical Memory for Agentic Repository-Level Vulnerability Repair
-
Diagnostic probes recover 45-62% of mislabeled GUI failures
DiagEval: Trajectory-Conditioned Diagnosis for Reliable Software Evaluation with GUI Agents
-
DiagEval recovers 45-62% of misattributed GUI failures
DiagEval: Trajectory-Conditioned Diagnosis for Reliable Software Evaluation with GUI Agents
-
Models hit only 6 Mythos bug targets out of 54 attempts with files supplied
Benchmarking Mythos-Linked Bug Rediscovery
-
PLC-BinX predicts PLC binary toolchains with 100 percent accuracy
One Step Further: Understanding PLC Binaries Through Cross-Platform Reverse Engineering and Function-Level Semantic Analysis
-
PLC-BinX predicts toolchain from binaries with 100% accuracy
One Step Further: Understanding PLC Binaries Through Cross-Platform Reverse Engineering and Function-Level Semantic Analysis
-
Ontology organizes foundations of software languages
Towards an Ontology for the Foundations of Software Languages
-
Block-level slicing triples LLM bug finds in 19K-line processor
Debug Like a Human: Scaling LLM-based Fault Localization to Processor Design via Block-Level Instruction-Oriented Slicing
-
No LLM clears 80 percent on observation contract compliance
ContractBench: Can LLM Agents Preserve Observation Contracts?
-
Context graphs guide LLMs to resolve code merge conflicts better
Rover: Context-aware Conflict Resolution with LLM
-
Automated TDD lifts AI web app success by 34-48 points
From Runnable to Shippable: Multi-Agent Test-Driven Development for Generating Full-Stack Web Applications from Requirements
-
Static checks boost diffusion code RL performance
Beyond Execution: Static-Analysis Rewards and Hint-Conditioned Diffusion RL for Code Generation
-
Region allocators keep locality edge on modern hardware
Reconsidering "Reconsidering Custom Memory Allocation"
-
LLM package hallucinations shrink to 4.6-6.1% but 127 names stay common
The Range Shrinks, the Threat Remains: Re-evaluating LLM Package Hallucinations on the 2026 Frontier-Model Cohort
-
LLMs skip hallucination-prone code tasks via execution checks
Task Abstention for Large Language Models in Code Generation
-
Low-code DevOps speeds tasks but adds security and governance risks
Low-Code Paradox in DevOps: Security and Governance Insights from Practitioners
-
FIDO Times Firmware Inputs at Availability Checks to Lift Coverage
Stop Starving or Stuffing Me: Boosting Firmware Fuzzing Efficiency with On-demand Input Delivery
-
78% of open source AI policies allow GenAI contributions
AI Policy, Disclosure, and Human in the Loop: How Are Contribution Guidelines Adapting to GenAI?
-
GitHub projects standardize on README
What's Inside a GitHub Repository? An Empirical Study on the Contents of 10K Projects
-
Core compiler reuse via LSP powers fast IDE for Move
Optimizing an IDE for an Evolving Language Ecosystem
-
LLM and search methods trade off strengths in fixing merge conflicts
LLM-based vs. Search-based Merge Conflict Resolution: An Empirical Study of Competing Paradigms
-
AR test framework tracks stable areas in videos for 55.8% coverage
TARIPlay: A Test Framework for AR Applications based on Interactive Area Tracking in Playback Videos
-
Gemini on trillion internal tokens cuts developer iterations 23%
Customizing an LLM for Enterprise Software Engineering
-
Adapted LLM cuts developer iterations by 23 percent
Customizing an LLM for Enterprise Software Engineering
-
Manufacturing ransomware recovery goes beyond backups
From Backup Restoration to Minimum Viable Factory Recovery: A Systematization of Ransomware Recovery in Manufacturing Systems