archive
Every paper Pith has read. Search by title, abstract, or pith.
1797 papers in cs.SE · page 11
-
LLM repair models drop over 50% on minor code tweaks
HEJ-Robust: A Robustness Benchmark for LLM-Based Automated Program Repair
-
Evaluation issues cause many false failures in LLM code translation
Beyond Translation Accuracy: Addressing False Failures in LLM-Based Code Translation
-
Evaluation errors inflate LLM code translation failure rates
Beyond Translation Accuracy: Addressing False Failures in LLM-Based Code Translation
-
Agentic critic loop keeps code docs synced to changes
DocSync: Agentic Documentation Maintenance via Critic-Guided Reflexion
-
Binary patching works via decompile-repair-recompile
SCRIBE: Practical Static Binary Patching via Binary-Aware Recompilation of Decompiled Code
-
Datalog DSL in Lean translates queries to provable theorems
A Shallow Embedding of Datalog in Lean
-
Foundation models detect Java refactoring bugs at 93.8% accuracy
Foundation Models as Oracles for Refactoring Correctness Detection
-
GitHub Actions audit finds 28% compliance with LLM hybrid checks
How Compliant Are GitHub Actions Workflows? A Checklist-Based Study with LLM-Assisted Auditing
-
This paper evaluates training-free classification of conventional commit messages using…
Conventional Commit Classification using Large Language Models and Prompt Engineering
-
ACDL standardizes precise descriptions of LLM agent contexts
A Language for Describing Agentic LLM Contexts
-
LLM agents cut false positives in security scans by 88 percent
QASecClaw: A Multi-Agent LLM Approach for False Positive Reduction in Static Application Security Testing
-
Declarative framework cuts RAG tuning code changes by 95%
AutoRAGTuner: A Declarative Framework for Automatic Optimization of RAG Pipelines
-
QSAF turns 34 circuit primitives into reusable hybrid-system components
Quantum Software Architecture Framework (QSAF): A Component-Based Framework for Designing Hybrid Quantum-Classical Systems
-
Expert patterns boost LLM vulnerability repair accuracy
VulKey: Automated Vulnerability Repair Guided by Domain-Specific Repair Patterns
-
Sprint simulation teaches empirical control in Scrum projects
A Lightweight Scrum Sprint Simulation to Help Learners Traverse the Empirical Process Control Threshold Concept
-
Safety-gated memory for RL coding agents hits 80% accuracy
Feedback-Normalized Developer Memory for Reinforcement-Learning Coding Agents: A Safety-Gated MCP Architecture
-
Neuro-symbolic agents block invalid requirements by design
Neuro-Symbolic Agents for Hallucination-Free Requirements Reuse
-
Genetic programming evolves scaling policies that cut microservice resource use
Genetic Programming for Self-Adaptive Auto-Scaling of Microservices
-
Unrestricted autonomy breaks LLM test repair in enterprise UIs
Practical Limits of Autonomous Test Repair: A Multi-Agent Case Study with LLM-Driven Discovery and Self-Correction
-
LLM spec accuracy drops 20 percent after removing deceptive outputs
LiveFMBench: Unveiling the Power and Limits of Agentic Workflows in Specification Generation
-
ChatGPT supports nine categories of software design tasks
Using LLMs in Software Design: An Empirical Study of GitHub and A Practitioner Survey
-
Turing machine extension defines context-awareness
On defining and modeling context-awareness
-
LLM feedback agents improve test coverage on C and Python code
FeedbackLLM: Metadata driven Multi-Agentic Language Agnostic Test Case Generator with Evolving prompt and Coverage Feedback
-
Interactive agents clarify vague specs before STL generation
ClarifySTL: An Interactive LLM Agent Framework for STL Transformation through Requirements Clarification
-
AI code output rises but reliability lags without strong specs
The Productivity-Reliability Paradox: Specification-Driven Governance for AI-Augmented Software Development
-
DDD simulator runs same microservice code under multiple consistency models
A Domain-Driven Design Simulator for Business Logic-Rich Microservice Systems
-
Platform links every AI prompt to its code edits for replay
RECAP: An End-to-End Platform for Capturing, Replaying, and Analyzing AI-Assisted Programming Interactions
-
ProMoTA links high-level models to code with full traceability
ProMoTA: a model-driven framework for end-to-end traceability analysis
-
Shor ECDLP oracle in Qrisp breaks control semantics
Semantics-Based Verification of an Implemented Shor Oracle for ECDLP in Qrisp
-
LLM agents reproduce materials findings at 54 percent
Can Coding Agents Reproduce Findings in Computational Materials Science?
-
GeoContra lifts LLM GIS correctness by 26 percent via contracts
GeoContra: From Fluent GIS Code to Verifiable Spatial Analysis with Geography-Grounded Repair
-
350k code preference pairs train multi-criteria reward models
Themis: Training Robust Multilingual Code Reward Models for Flexible Multi-Criteria Scoring
-
350k code preferences train flexible multilingual reward models
Themis: Training Robust Multilingual Code Reward Models for Flexible Multi-Criteria Scoring
-
Pass-rate rewards fail to beat binary rewards in code RL
Exploring Pass-Rate Reward in Reinforcement Learning for Code Generation
-
Practitioners identify gaps in end-to-end autonomous driving tests
From Research to Practice: An Interactive Rapid Review of Autonomous Driving System Testing in Industry
-
ML predicts energy of code blocks from static features
EnCoDe: Energy Estimation of Source Code At Design-Time
-
Dataset shows API recommenders weaken on deep calls
Q-ARE: An Evaluation Dataset for Query Based API Recommendation
-
Dense retrieval beats sparse for issue-commit links
Think Harder and Don't Overlook Your Options: Revisiting Issue-Commit Linking with LLM-Assisted Retrieval
-
PPO agent picks prompts for higher test coverage
PPO guided Agentic Pipeline for Adaptive Prompt Selection and Test Case Generation
-
Curriculum training lifts LLM code generation accuracy
Improving LLM Code Generation via Requirement-Aware Curriculum Reinforcement Learning
-
Agent skills remain untrusted until verified by runtime
Skills as Verifiable Artifacts: A Trust Schema and a Biconditional Correctness Criterion for Human-in-the-Loop Agent Runtimes
-
Agent skills stay untrusted until they pass verification tests
Skills as Verifiable Artifacts: A Trust Schema and a Biconditional Correctness Criterion for Human-in-the-Loop Agent Runtimes
-
LLMs infill masked bug reports to uncover 27 Rust compiler bugs
ClozeMaster: Fuzzing Rust Compiler by Harnessing LLMs for Infilling Masked Real Programs
-
Fairness monitor agent cuts bias in LLM code by 65 percent
Social Bias in LLM-Generated Code: Benchmark and Mitigation
-
Agile team embeds log-based fraud alerts via weekly iterations
Integrating Log-Based Security Analytics in Agile Workflows: A Real-World Experience Report
-
Code model released openly after risk checks find no new threats
Code World Model Preparedness Report
-
Code model cleared for open release after risk checks
Code World Model Preparedness Report
-
Encrypted string operations enable private conformance checking
A Privacy-Preserving Approach to Conformance Checking
-
Software leadership is managerial and interpersonal
What Characterizes a Software Leader? Identifying Leadership Practices from Practitioners Social Media
-
Deptex finds true vulnerability reach by combining graphs and language models
DEPTEX: Organization-First, Open Source Dependency Risk Monitoring