archive
Every paper Pith has read. Search by title, abstract, or pith.
1797 papers in cs.SE · page 13
-
Quantum circuits cover conditions well but paths poorly
Probabilistic Condition, Decision and Path Coverage of Circuit-based Quantum Programs
-
MoE models match human graders on math rubrics where 70B model fails
Human-in-the-Loop Benchmarking of Heterogeneous LLMs for Automated Competency Assessment in Secondary Level Mathematics
-
Seven recommendations guide LLM adoption in software teams
Recommendations for Efficient and Responsible LLM Adoption within Industrial Software Development
-
Pipeline builds consistent graphs from C
Graph Construction and Matching for Imperative Programs using Neural and Structural Methods
-
Pipeline builds consistent graphs from C
Graph Construction and Matching for Imperative Programs using Neural and Structural Methods
-
Natural language scenarios generate higher-coverage tests than BDD
PICKLES: a Natural Language Framework for Requirement Specification and Model-Based Testing
-
Solidity semantic clones detected with 97% recall via code and comments
Identifying and Characterizing Semantic Clones of Solidity Functions
-
Knowledge graph drives 3x faster documentation with 85% fewer tokens
RepoDoc: A Knowledge Graph-Based Framework to Automatic Documentation Generation and Incremental Updates
-
Speculative decoding speeds up SE tasks more for small models
An Empirical Study of Speculative Decoding on Software Engineering Tasks
-
LLMs vary widely in screening papers for software SLRs
Beyond Accuracy: LLM Variability in Evidence Screening for Software Engineering SLRs
-
Swarm optimizer cuts vehicle offload response times
Towards Intelligent Computation Offloading in Dynamic Vehicular Networks: A Scalable Multilayer Pipeline
-
Asset shells keep OCL constraints inside MBSE models
Asset Administration Shell-Based OCL Validation Framework for Model-Based System Engineering
-
Software engineering shifts from code generation to AI delegation
Agentic AI in the Software Development Lifecycle: Architecture, Empirical Evidence, and the Reshaping of Software Engineering
-
Only 23% of LLM-generated Rust crypto code compiles
An Empirical Security Evaluation of LLM-Generated Cryptographic Rust Code
-
Survey finds disconnect between program structure and adaptive security tests
Adaptive and AI-Augmented Security Testing: A Systematic Survey of Program Analysis, Feedback-Driven Testing, and Hybrid Learning-Based Approaches
-
Review shows LLMs automate data tasks in software engineering studies
LLM-Assisted Empirical Software Engineering: Systematic Literature Review and Research Agenda
-
LLM observability layers mature but integration lags
AI Observability for Large Language Model Systems: A Multi-Layer Analysis of Monitoring Approaches from Confidence Calibration to Infrastructure Tracing
-
LLM pipeline lifts bug report completeness from 8% to 96%
ImproBR: Bug Report Improver Using LLMs
-
Multi-view training detects AI code on unseen languages at 0.845 F1
UCSC-NLP at SemEval-2026 Task 13: Multi-View Generalization and Diagnostic Analysis of Machine-Generated Code Detection
-
LLM turns uncovered code into valid bug reports at 85 percent rate
LLM-Guided Issue Generation from Uncovered Code Segments
-
LLM tool turns uncovered code into prioritized bug reports
LLM-Guided Issue Generation from Uncovered Code Segments
-
Splitting code viewing from editing raises agent success 2.1% at 17.9% lower cost
SWE-Edit: Rethinking Code Editing for Efficient SWE-Agent
-
GenDetect turns a single observed DeFi attack into reusable detection rules by…
GenDetect: Generalizing Reactive Detection for Resilience Against Imitative DeFi Attack Cascade
-
Carbon-tax ordering cuts LLM memory by up to 49x
Carbon-Taxed Transformers: A Green Compression Pipeline for Overgrown Language Models
-
Multi-LLM pipeline extracts 734 trajectories from GitHub issues
From Threads to Trajectories: A Multi-LLM Pipeline for Community Knowledge Extraction from GitHub Issue Discussions
-
LLM REST tests lose effectiveness on faulty code and vague specs
RESTestBench: A Benchmark for Evaluating the Effectiveness of LLM-Generated REST API Test Cases from NL Requirements
-
Evolved harnesses raise coding-agent pass@1 from 69.7% to 77%
Agentic Harness Engineering: Observability-Driven Automatic Evolution of Coding-Agent Harnesses
-
Ten AHE iterations lift coding-agent pass@1 to 77%
Agentic Harness Engineering: Observability-Driven Automatic Evolution of Coding-Agent Harnesses
-
RSEs form a collective identity that shapes their wellbeing
Does social identity matter in software engineering? Assessing the case of research software engineers
-
Developer roles drive microservices coupling more than architecture
Key Developer Roles and Organizational Coupling in Microservices: A Longitudinal Analysis
-
Code metrics match plagiarism tools in ranking performance
Can Code Evaluation Metrics Detect Code Plagiarism?
-
Scenarios compose into online tests for robot systems
Scenario-based System Testing for Distributed Robotics Applications
-
Multi-agent editing lifts code success to 68.6 percent
SAFEdit: Does Multi-Agent Decomposition Resolve the Reliability Challenges of Instructed Code Editing?
-
Code-comment alignment lifts F1 scores by up to 27% in vulnerability detection
Learning Generalizable Multimodal Representations for Software Vulnerability Detection
-
Classical ML beats transformers for bug report fault localization
Bug-Report-Driven Fault Localization: Industrial Benchmarking and Lesson Learned at ABB Robotics
-
Bug report text trains models to find faults in robotics code
Bug-Report-Driven Fault Localization: Industrial Benchmarking and Lesson Learned at ABB Robotics
-
GPT tools draft spreadsheet models but fail to reproduce them consistently
Spreadsheet Modeling Experiments Using GPTs on Small Problem Statements and the Wall Task
-
LLMs generate Given-When-Then tests for FMU simulations
Using Large Language Models for Black-Box Testing of FMU-Based Simulations
-
PLM choice outweighs GNN backbone in code hybrid models
PLMGH: What Matters in PLM-GNN Hybrids for Code Classification and Vulnerability Detection
-
12,000 tests quantify energy costs of mobile settings
An Empirical Analysis of Mobile Energy Consumption Across User Configurations
-
MBSE models must be co-designed as AI-queryable knowledge bases
AI as Consumer and Participant: A Co-Design Agenda for MBSE Substrates and Methodology
-
MLLMs suggest ranked usability fixes from videos
Recommending Usability Improvements with Multimodal Large Language Models
-
LLMs inconsistent on equivalent code versions
CoRE: A Fine-Grained Code Reasoning Benchmark Beyond Output Prediction
-
Commit structure lifts test prioritization in CI
Commit-Aware Learning-Based Test Case Prioritization for Continuous Integration
-
R³-SQL reaches 75.03 accuracy on BIRD-dev for Text-to-SQL
R$^3$-SQL: Ranking Reward and Resampling for Text-to-SQL
-
VisualNeo connects visual queries to Neo4j for graph searches
VisualNeo: Bridging the Gap between Visual Query Interfaces and Graph Query Engines
-
MARD is a multi-agent system that uses large language models to detect Android malware by…
MARD: A Multi-Agent Framework for Robust Android Malware Detection
-
DiRe preserves 3-4 times more topology than UMAP at equal speed
DiRe-RAPIDS: Topology-faithful dimensionality reduction at scale
-
Conformance checking runs on homomorphically encrypted logs
Secure Conformance Checking using Token-based Replay and Homomorphic Encryption
-
Four agents turn incomplete Rust CVEs into analyzable tests
Symbolic Execution Meets Multi-LLM Orchestration: Detecting Memory Vulnerabilities in Incomplete Rust CVE Snippets