archive
Every paper Pith has read. Search by title, abstract, or pith.
1797 papers in cs.SE · page 5
-
Tool finds 545 reference counting bugs in Linux kernel drivers
Automatic Detection of Reference Counting Bugs in Linux Kernel Drivers
-
DrvHorn uncovers 545 reference counting bugs in Linux v6.6 drivers
Automatic Detection of Reference Counting Bugs in Linux Kernel Drivers
-
Contrastive semantic model improves code translation
Improving Code Translation with Syntax-Guided and Semantic-aware Preference Optimization
-
LLMs lag experts on system-level performance code
PerfCodeBench: Benchmarking LLMs for System-Level High-Performance Code Optimization
-
Toolkit standardizes benchmarks for screenshot-to-code models
UIBenchKit: A unified toolkit for design-to-code model evaluation
-
Code agents solve far fewer issues in full cycles than isolated tasks
SWE-Cycle: Benchmarking Code Agents across the Complete Issue Resolution Cycle
-
Code models miss over 93% of fixes from changes alone
Code-Centric Detection of Vulnerability-Fixing Commits: A Unified Benchmark and Empirical Study
-
Bonuses for security scans cut issue density in team code
Security Incentivization: An Empirical Study of how Micropayments Impact Code Security
-
LLM JSON stays valid inside tight token budgets
TruncProof: A Guardrail for LLM-based JSON Generation under Token-Length Constraints
-
Deeper thought per algorithm beats more candidates under fixed tokens
Effective Harness Engineering for Algorithm Discovery with Coding Agents
-
Protocols govern generated code via invariants and evidence chains
Protocol-Driven Development: Governing Generated Software Through Invariants and Continuous Evidence
-
Protocols admit generated code only via signed compliance evidence
Protocol-Driven Development: Governing Generated Software Through Invariants and Continuous Evidence
-
Protocols, not code, decide if generated software is admissible
Protocol-Driven Development: Governing Generated Software Through Invariants and Continuous Evidence
-
10.7% of SWE-agent passes are lucky trial-and-error
AgentLens: Revealing The Lucky Pass Problem in SWE-Agent Evaluation
-
Metadata layer turns legacy SAS reports into AI-ready data
A Non-Destructive Methodological Framework for Modernizing Legacy Clinical Reporting Systems for AI-Driven Pharmacoinformatics: A SAS Case Study
-
Open-source projects follow product life cycles
Project Life Cycles in Open-Source Software
-
cozy is a comparative binary analysis tool that uses symbolic execution to find…
Finding a Crab in the C: Assured Translation via Comparative Symbolic Execution
-
Natural language runs grid analyses in under two minutes
Grid-Orch: An LLM-Powered Orchestrator for Distribution Grid Simulation and Analytics
-
Lattice structures LLM judgments for reliable program analysis
Agentic Interpretation: Lattice-Structured Evidence for LLM-Based Program Analysis
-
LLMs match human accuracy in spotting usability requirements in reviews
User Reviews as a Source for Usability Requirements: A Precursor Study on Using Large Language Models
-
Fine-tuned open LLM matches ChatGPT on code feedback quality
Fine-Tuning Models for Automated Code Review Feedback
-
Docker container makes Basilisk GN&C simulations reproducible
Basilisk and Docker for Reproducible GN&C Simulation: A Workflow Reference
-
Nine LLM audits on prompts found 51 defects and converged to zero
Iterative Audit Convergence in LLM-Managed Multi-Agent Systems: A Case Study in Prompt Engineering Quality Assurance
-
MinTEJ terminal editor for Julia uses less memory than VS Code
Minimalistic Terminal Editor for Julia Programming -- MinTEJ: A Friendly Approach for a Scientific Programmer
-
LLMs fail most at strategy in GitHub issue fixes
Characterizing the Failure Modes of LLMs in Resolving Real-World GitHub Issues
-
Partial programs control risk in LLM code generation
Uncertainty Quantification for LLM-based Code Generation
-
Dataset delivers 449 reproducible locator breaks in web GUI tests
ReproBreak: A Dataset of Reproducible Web Locator Breaks
-
Dataset supplies 2440 proprietary industrial repositories
CIDR: A Large-Scale Industrial Source Code Dataset for Software Engineering Research
-
Harness design stabilizes small language models at 95 percent success
It's Not the Size: Harness Design Determines Operational Stability in Small Language Models
-
Metamorphic testing and LLMs strengthen each other for AI quality checks
Bidirectional Empowerment of Metamorphic Testing and Large Language Models: A Systematic Survey
-
Framework embeds values in CPS human monitoring rules
HM-Req: A Framework for Embedding Values within CPS Human Monitoring Requirements
-
Diversified replicas detect correlated faults by ignoring addresses
Divergent Multi-Version Execution (DME): Canonical Instruction-Trace Fault Detection via Structural Address-Space Decorrelation
-
Microservices process thousands of documents per hour with OCR and LLMs
Operationalizing Document AI: A Microservice Architecture for OCR and LLM Pipelines in Production
-
Agent decision traces vary up to 43 points in completeness across SDKs
Property-Level Reconstructability of Agent Decisions: An Anchor-Level Pilot Across Vendor SDK Adapter Regimes
-
Guided LLMs translate APL legacy code to working C#
Neural Code Translation of Legacy Code: APL to C#
-
Print statements teach code models to reason step by step
StepCodeReasoner: Aligning Code Reasoning with Stepwise Execution Traces via Reinforcement Learning
-
Value and popularity drive OSS survival
The Death Spiral of Open Source Projects: A Post-Mortem Analysis of Pull Request Workflow Dynamics
-
Compiled interfaces cut agent token use by 57%
SkillSmith: Compiling Agent Skills into Boundary-Guided Runtime Interfaces
-
Bug localization replication fails after fixing data leak
An Extensive Replication Study of the ABLoTS Approach for Bug Localization
-
SMT-LLM resolves Python deps at 83.6 percent
Breaking the Dependency Chaos: A Constraint-Driven Python Dependency Resolution Strategy with Selective LLM Imputation
-
Seminar sets six research priorities for agents and software engineering
A Research Agenda on Agents and Software Engineering: Outcomes from the Rio A2SE Seminar
-
597-line harness supports fair comparisons of LLM pen-testing agents
Cochise: A Reference Harness for Autonomous Penetration Testing
-
Compiler feedback lifts neural decompilation success to 83.9 percent
Decaf: Improving Neural Decompilation with Automatic Feedback and Search
-
Mined tokens lift LLM flaky test F1-score to 69.34%
NeuroFlake: A Neuro-Symbolic LLM Framework for Flaky Test Classification
-
Risk lattice turns consent clicks into reusable options
Options, Not Clicks: Lattice Refinement for Consent-Driven MCP Authorization
-
LLMs generate natural language specs to verify code compositionally
Natural Language based Specification and Verification
-
Ranking own code attempts boosts single-sample accuracy
Primal Generation, Dual Judgment: Self-Training from Test-Time Scaling
-
Ranking own code attempts boosts single-rollout accuracy to match Best-of-4
Primal Generation, Dual Judgment: Self-Training from Test-Time Scaling
-
SysML model drives hardware verification directly via server link
SHIA: A Direct SysML-Hardware Interface Architecture for Model-Centric Verification
-
4714 GitHub workflows hijackable via crafted comments
Comment and Control: Hijacking Agentic Workflows via Context-Grounded Evolution