archive
Every paper Pith has read. Search by title, abstract, or pith.
1797 papers in cs.SE · page 17
-
Fuzzing finds bugs in deductive verifiers
Crash-free Deductive Verifiers
-
DynaHug catches malicious ML models by watching runtime behavior
Malicious ML Model Detection by Learning Dynamic Behaviors
-
Tool flags code-doc mismatches only when tests prove the mismatch
CASCADE: Detecting Inconsistencies between Code and Documentation with Automatic Test Generation
-
Framework maps stakeholder views to formal SysML v2 architecture
Towards Formalising Stakeholder Context using SysML v2
-
EnergyTrackr flags energy spikes in Java commits
Systematic Detection of Energy Regression and Corresponding Code Patterns in Java Projects
-
LLM agents reach only 35 percent CTF checkpoint completion
Do Agents Dream of Root Shells? Partial-Credit Evaluation of LLM Agents in Capture the Flag Challenges
-
Mocking info from tests guides LLMs to better unit tests
Improving LLM-Driven Test Generation by Learning from Mocking Information
-
Simulated debugging boosts LLM bug fixes by 26% on Defects4J
DebugRepair: Enhancing LLM-Based Automated Program Repair via Self-Directed Debugging
-
Four-layer workspace structures human-AI co-development of VA tools
BONSAI: A Mixed-Initiative Workspace for Human-AI Co-Development of Visual Analytics Applications
-
Iterative retriever lifts bug test generation rates by 20-32 percent
iCoRe: An Iterative Correlation-Aware Retriever for Bug Reproduction Test Generation
-
Large models sketch edits, small models apply them
Cascaded Code Editing: Large-Small Model Collaboration for Effective and Efficient Code Editing
-
Empathic IDE matches standard tools on learning but helps more with errors
Towards More Empathic Programming Environments: An Experimental Empathic AI-Enhanced IDE
-
Mutations expose inconsistencies in 15% of Code LLM responses
MUCOCO: Automated Consistency Testing of Code LLMs
-
Multimodal AI spots GUI defects in multi-window mobile apps
Proactive Detection of GUI Defects in Multi-Window Scenarios via Multimodal Reasoning
-
Adversarial agents eliminate 79% of LLM defect candidates
Refute-or-Promote: An Adversarial Stage-Gated Multi-Agent Review Methodology for High-Precision LLM-Assisted Defect Discovery
-
Security is relative to project contracts
Security Is Relative: Training-Free Vulnerability Detection via Multi-Agent Behavioral Contract Synthesis
-
Framework turns aerospace requirements into LTL at 85% precision
Automated LTL Specification Generation from Industrial Aerospace Requirements
-
SVGD seeds raise ADS safety violation rates
From Particles to Perils: SVGD-Based Hazardous Scenario Generation for Autonomous Driving Systems Testing
-
Graph interface turns AI coding into branching explorations
Choose Your Own Adventure: Non-Linear AI-Assisted Programming with EvoGraph
-
AI-human loop cuts bug report labeling effort by 196%
Human-Machine Co-Boosted Bug Report Identification with Mutualistic Neural Active Learning
-
Structural checks raise EDA code success without runtime debugging
Structural Verification for Reliable EDA Code Generation without Tool-in-the-Loop Debugging
-
Cutoff theorem bounds verification search for DSLTrans properties
Tractable Verification of Model Transformations: A Cutoff-Theorem Approach for DSLTrans
-
AI transformation methods lack systematic guidance on ML task derivation
From Business Problems to AI Solutions: Where Does Transformation Support Fail
-
Only 0.4% of Android apps match privacy policies to their logs
Do Privacy Policies Match with the Logs? An Empirical Study of Privacy Disclosure in Android Application Logs
-
Sentence transformers filter SCA alerts to 89% F1
Towards Better Static Code Analysis Reports: Sentence Transformer-based Filtering of Non-Actionable Alerts
-
Direct TypeScript compiler parser speeds up large-repo indexing for AI agents
TypeScript Repository Indexing for Code Agent Retrieval
-
Agent builds playable web games from prompts where LLMs fail
OpenGame: Open Agentic Coding for Games
-
AI software ecosystems show emergent failures from agent interactions
More Is Different: Toward a Theory of Emergence in AI-Native Software Ecosystems
-
Co-locating tests yields near-perfect AI code preservation
Co-Located Tests, Better AI Code: How Test Syntax Structure Affects Foundation Model Code Generation
-
AI bot PR frequency tied to lower CI/CD success rates
Reliability of AI Bots Footprints in GitHub Actions CI/CD Workflows
-
Context composition causally shapes LLM failure explanation quality
From Program Slices to Causal Clarity: Evaluating Faithful, Actionable LLM-Generated Failure Explanations via Context Partitioning and LLM-as-a-Judge
-
Context composition causally shapes LLM bug explanation quality
From Program Slices to Causal Clarity: Evaluating Faithful, Actionable LLM-Generated Failure Explanations via Context Partitioning and LLM-as-a-Judge
-
Modular adapters beat fine-tuning on hard SQL queries
LeGo-Code: Can Modular Curriculum Learning Advance Complex Code Generation? Insights from Text-to-SQL
-
LLM pipeline formalizes specs into properties at 77.8% accuracy
Towards an Agentic LLM-based Approach to Requirement Formalization from Unstructured Specifications
-
WebCompass benchmark evaluates full web coding workflows for AI models
WebCompass: Towards Multimodal Web Coding Evaluation for Code Language Models
-
Real execution replaces mental simulation in LLM coding
SolidCoder: Bridging the Mental-Reality Gap in LLM Code Generation through Concrete Execution
-
SystemC prototypes remove false positives from embedded fuzzing
Stateful Embedded Fuzzing with Peripheral-Accurate SystemC Virtual Prototypes
-
Processes and pipes made lightweight for far memory accelerators
Proxics: an efficient programming model for far memory accelerators
-
Fairness-first design thinking embeds equity into software architecture
Fairness-First Design Thinking for Software Architecture
-
7B model beats larger LLMs at code translation without parallel examples
CodePivot: Bootstrapping Multilingual Transpilation in LLMs via Reinforcement Learning without Parallel Corpora
-
AI systems should treat choices as governed tuned variables
Statistical Software Engineering with Tuned Variables
-
API sequence mining boosts library fuzz coverage by 8.54%
MASFuzzer: Fuzz Driver Generation and Adaptive Scheduling via Multidimensional API Sequences
-
PTMs added late in projects and accumulate rather than replaced
When AI Models Become Dependencies: Studying the Evolution of Pre-Trained Model Reuse in Downstream Software Systems
-
Framework detects every GitHub abuse type above 89% accuracy
Weaponizing the Commons: A Taxonomy and Detection Framework of Abuse on GitHub
-
Ten cache smells affect 89% of GitLab CI/CD projects
Cache-Related Smells in GitLab CI/CD: Comprehensive Catalog, Automated Detection, and Empirical Evidence
-
Graph consensus layer must replace code as AI coding artifact
Scaling Human-AI Coding Collaboration Requires a Governable Consensus Layer
-
Joint prompt and tool optimization raises agent success 5-20%
JTPRO: A Joint Tool-Prompt Reflective Optimization Framework for Language Agents
-
Video analysis lets AI grade diverse Scratch programs accurately
Raven: Rethinking Automated Assessment for Scratch Programs via Video-Grounded Evaluation
-
Ground truth tests show debloaters remove needed code or keep extras
Revisiting Code Debloating with Ground Truth-based Evaluation
-
GLMTest raises branch accuracy to 50% by conditioning on code graphs
Program Structure-aware Language Models: Targeted Software Testing beyond Textual Semantics