archive
Every paper Pith has read. Search by title, abstract, or pith.
1797 papers in cs.SE · page 7
-
Regional zoom beats global Pareto in 84-89% of SE tasks
Zoom, Don't Wander: Why Regional Search Outperforms Pareto Reasoning and Global Optimization in Budget-Constrained SBSE
-
LLM smart contracts score 8.29 points above human versions
SmartEval: A Benchmark for Evaluating LLM-Generated Smart Contracts from Natural Language Specifications
-
ConCovUp lifts concurrency test coverage from 37% to 68%
ConCovUp: Effective Agent-Based Test Driver Generation for Concurrency Testing
-
Belief-revision agents verify code authorship without training
MACAA: Belief-Revision Multi-Agent Reasoning for Code Authorship Verification
-
Multi-agent belief revision verifies code authors without training
MACAA: Belief-Revision Multi-Agent Reasoning for Code Authorship Verification
-
Multi-agent system verifies code authorship without training
MACAA: Belief-Revision Multi-Agent Reasoning for Code Authorship Verification
-
Ethical safeguards prioritized in cost model for LLM education use
Prediction Model of Motivators and Demotivators of Integrating Large Language Models in Software Engineering Education: An Empirical Study
-
Model optimizes cost-efficient LLM integration in software engineering classes
Prediction Model of Motivators and Demotivators of Integrating Large Language Models in Software Engineering Education: An Empirical Study
-
Execution traces create first noise-free test for LLM code understanding
An Execution-Verified Multi-Language Benchmark for Code Semantic Reasoning
-
LLM sim code runs but solves wrong physics
Your Simulation Runs but Solves the Wrong Physics: PDE-Grounded Intent Verification for LLM-Generated Multiphysics Simulation Code
-
Merlin turns natural language into CodeQL queries that raise accuracy 3.8x
Generating Complex Code Analyzers from Natural Language Questions
-
Memoized heuristics scale ion-trap qubit mapping
Scaling Qubit Mapping and Routing With Position Graph Abstraction and Memoization
-
Krone decomposes logs into entity-action-status units for modular anomaly detection
Detect, Localize, and Explain: Interactive Hierarchical Log Anomaly Analytics with LLM Augmentation
-
Line-level rewards raise program repair success to 40.7% on SWE-bench
BoostAPR: Boosting Automated Program Repair via Execution-Grounded Reinforcement Learning with Dual Reward Models
-
Line-level credit in RL lifts program repair to 40.7% on SWE-bench
BoostAPR: Boosting Automated Program Repair via Execution-Grounded Reinforcement Learning with Dual Reward Models
-
Dual rewards boost code repair to 40.7% on SWE-bench
BoostAPR: Boosting Automated Program Repair via Execution-Grounded Reinforcement Learning with Dual Reward Models
-
Developer reviews expose LLM code flaws missed by benchmarks
Evaluating LLM-Generated Code: A Benchmark and Developer Study
-
Fuzzer finds 64 inconsistencies in Solidity compilers
ParityFuzz: Finding Inconsistencies across Solidity Compilers via Fine-Grained Mutation and Differential Analysis
-
AI safety guarantees proven in the framework
Containment Verification: AI Safety Guarantees Independent of Alignment
-
Semantic distance beats disagreement counts for LLM code uncertainty
Using Semantic Distance to Estimate Uncertainty in LLM-Based Code Generation
-
Skill drift is contract violation in LLM agent libraries
Skill Drift Is Contract Violation: Proactive Maintenance for LLM Agent Skill Libraries
-
Three-layer gate turns agent failures into bounded fixes
Debugging the Debuggers: Failure-Anchored Structured Recovery for Software Engineering Agents
-
LLMs mine tactics that let CoqHammer prove 24% more theorems
A Learning Method for Symbolic Systems Using Large Language Models
-
Execution fingerprints beat text voting for LLM code
Semantic Voting: Execution-Grounded Consensus for LLM Code Generation
-
Sketching strategies outperforms flat sampling for code at fixed budget
Sketch-and-Verify: Structured Inference-Time Scaling via Program Sketching
-
EvidenT repairs 54% of RISC-V package build failures
EvidenT: An Evidence-Preserving Framework for Iterative System-Level Package Repair
-
Models reach 92 percent on code but only 5 percent on provable code
VeriContest: A Competitive-Programming Benchmark for Verifiable Code Generation
-
Benchmark reveals CUDA LLM fixers often degenerate code for tests
CUDABeaver: Benchmarking LLM-Based Automated CUDA Debugging
-
Dataset collects 15k configs for AI coding tools
A Dataset of Agentic AI Coding Tool Configurations
-
AI agents omit runtime details in their own technical talks
What Software Engineering Looks Like to AI Agents? -- An Empirical Study of AI-Only Technical Discourse on MoltBook
-
AI Agents Talk Security and Trust More Than Specific Code Issues
What Software Engineering Looks Like to AI Agents? -- An Empirical Study of AI-Only Technical Discourse on MoltBook
-
Benchmark scores coding agents on engineering quality beyond bug fixes
SWE Atlas: Benchmarking Coding Agents Beyond Issue Resolution
-
Hardware attestation signs build provenance without trusting operators
Kettle: Attested builds for verifiable software provenance
-
Cyclic tuning raises RAG quality by up to 54 percent
CDS4RAG: Cyclic Dual-Sequential Hyperparameter Optimization for RAG
-
AI agents start most PRs but humans keep merge authority
Collaborator or Assistant? How AI Coding Agents Partition Work Across Pull Request Lifecycles
-
Collaborator AIs open most PRs while humans keep merge control
Collaborator or Assistant? How AI Coding Agents Partition Work Across Pull Request Lifecycles
-
Adding one vector switches which tool a language model calls
Tool Calling is Linearly Readable and Steerable in Language Models
-
Similar past faults annotated to guide LLMs in test code
Similar Pattern Annotation via Retrieval Knowledge for LLM-Based Test Code Fault Localization
-
Trace comparison creates a score for design conformance
Evaluating Design Conformance Through Trace Comparison
-
One rules engine powers play
Mazocarta: A Seeded Procedural Deckbuilder for Instrumented Game Development
-
Bidirectional analysis finds 118 unsafe flows in 87 MCP servers
Unsafe by Flow: Uncovering Bidirectional Data-Flow Risks in MCP Ecosystem
-
Security designs link to code checks in only a few ways
Can I Check What I Designed? Mapping Security Design DSLs to Code Analyzers
-
Unified AST labels and graph matching link equivalent code across languages
Bridging the Programming Language Gap: Constructing a Multilingual Shared Semantic Space through AST Unification and Graph Matching
-
Agents patch code on 35-65% of already-fixed bugs
Coding Agents Don't Know When to Act
-
Neuro-symbolic method detects threats in stripped industrial binaries
Securing the Dark Matter: A Semantic-Enhanced Neuro-Symbolic Framework for Supply Chain Analysis of Opaque Industrial Software
-
SARC enforces agent constraints at runtime for zero hard violations
SARC: A Governance-by-Architecture Framework for Agentic AI Systems
-
Manifesto recasts scaled agile around AI as first-class participant
The AI-Native Large-Scale Agile Software Development Manifesto
-
Manifesto puts AI at core of large-scale agile development
The AI-Native Large-Scale Agile Software Development Manifesto
-
Search tunes LLMs to cut harmful responses
SafeTune: Search-based Harmfulness Minimisation for Large Language Models
-
First benchmark supplies real data for LLM hyperparameter tuning
LLMSYS-HPOBench: Hyperparameter Optimization Benchmark Suite for Real-World LLM Systems