archive
Every paper Pith has read. Search by title, abstract, or pith.
1797 papers in cs.SE · page 16
-
Models generate correct code without public tests
You Don't Need Public Tests to Generate Correct Code
-
Bug variants reveal memorization in LLM repair models
A Metamorphic Testing Approach to Diagnosing Memorization in LLM-Based Program Repair
-
Decomposition plus mutation refines LLM specs to verify real programs
SpecSyn: LLM-based Synthesis and Refinement of Formal Specifications for Real-world Program Verification
-
Bounds on neural-net safety probability under random inputs
Probabilistic Verification of Neural Networks via Efficient Probabilistic Hull Generation
-
GPT dominates generative AI use in IT project management
A systematic review of generative AI usage for IT project management
-
Ambiguous requirements cut LLM code accuracy
Assessing the Impact of Requirement Ambiguity on LLM-based Function-Level Code Generation
-
IRAP turns vague specs into math functions with 40x gains
Conjecture and Inquiry: Quantifying Software Performance Requirements via Interactive Retrieval-Augmented Preference Elicitation
-
Modular checks push GUI agents past human performance on OSWorld
VLAA-GUI: Knowing When to Stop, Recover, and Search, A Modular Framework for GUI Automation
-
The authors adapted their prior mdok method for machine-generated text detection to…
mcdok at SemEval-2026 Task 13: Finetuning LLMs for Detection of Machine-Generated Code
-
Three LLM experts detect code vulnerabilities at 77% F1 for two cents
Strategic Heterogeneous Multi-Agent Architecture for Cost-Effective Code Vulnerability Detection
-
SBOM mismatches produce inconsistent vulnerability reports
Hidden Dependencies and Component Variants in SBOM-Based Software Composition Analysis
-
Meta-predicates flag unsuitable evidence in clinical AI rules upfront
Trustworthy Clinical Decision Support Using Meta-Predicates and Domain-Specific Languages
-
Execution feedback beats pipeline complexity for 1-3B code models
Feedback Over Form: Why Execution Feedback Matters More Than Pipeline Topology in 1-3B Code Generation
-
Ground-truth dataset exposes differences in vulnerability detectors
A Ground-Truth-Based Evaluation of Vulnerability Detection Across Multiple Ecosystems
-
GPU runs 20,000 GWAS phenotypes in 20 minutes
TorchGWAS : GPU-accelerated GWAS for thousands of quantitative phenotypes
-
POMDP models hidden user states to auto-refine LLM prompts
Mind the Prompt: Self-adaptive Generation of Task Plan Explanations via LLMs
-
37% of AI governance prompts miss key structure
Structural Quality Gaps in Practitioner AI Governance Prompts: An Empirical Study Using a Five-Principle Evaluation Framework
-
LLM gateways often swap models and misbill users
Behavioral Consistency and Transparency Analysis on Large Language Model API Gateways
-
The paper examines residual security risks in patched code by measuring semantic and…
Residual Risk Analysis in Benign Code: How Far Are We? A Multi-Model Semantic and Structural Similarity Approach
-
Value conflict tests show alignment faking in 7B models
Value-Conflict Diagnostics Reveal Widespread Alignment Faking in Language Models
-
LLM tool provides 24/7 feedback for software engineering students
Autonomous LLM-generated Feedback for Student Exercises in Introductory Software Engineering Courses
-
Coding agents retain just 44% of code in real commits
SWE-chat: Coding Agent Interactions From Real Users in the Wild
-
Serverless toolkit builds urban VA prototypes in hours
Autark: A Serverless Toolkit for Prototyping Urban Visual Analytics Systems
-
High AUC Does Not Ensure Defect Models Beat Random at All Thresholds
Evaluating Software Defect Prediction Models via the Area Under the ROC Curve Can Be Misleading
-
QuanForge distinguishes QNN test suites and finds weak circuit regions
QuanForge: A Mutation Testing Framework for Quantum Neural Networks
-
GNNs spot LLM-written safety cases at F1 0.94
Evaluating Assurance Cases as Text-Attributed Graphs for Structure and Provenance Analysis
-
LLM regex masks raise log parsing accuracy to 97.6%
DeepParse: Hybrid Log Parsing with LLM-Synthesized Regex Masks
-
LLMs reach 88-89% accuracy on product line blueprint analysis
Early-Stage Product Line Validation Using LLMs: A Study on Semi-Formal Blueprint Analysis
-
Hybrid detector finds 893k eliminable duplicate BDD steps
Reducing Maintenance Burden in Behaviour-Driven Development: A Paraphrase-Robust Duplicate-Step Detector with a 1.1M-Step Open Benchmark
-
Security commit messages remain largely uninformative
On the Informativeness of Security Commit Messages: A Large-scale Replication Study
-
Guardrails from requirements and models stabilize AI agents
Shift-Up: A Framework for Software Engineering Guardrails in AI-native Software Development -- Initial Findings
-
RL trains 7B model to build websites rivaling 671B LLMs
WebGen-R1: Incentivizing Large Language Models to Generate Functional and Aesthetic Websites with Reinforcement Learning
-
Reasoning models forecast parallel code races without tool calls
Learning Reasoning World Models for Parallel Code
-
LLMs detect logging security issues at 13-52 percent accuracy
Towards Secure Logging: Characterizing and Benchmarking Logging Code Security Issues with LLMs
-
Static tool catches invented symbols in LLM API migrations
Hallucination Inspector: A Fact-Checking Judge for API Migration
-
LLM agents confirm 84% of Node.js taint vulnerabilities
Taint-Style Vulnerability Detection and Confirmation for Node.js Packages Using LLM Agent Reasoning
-
Dual tasks test if LLMs grasp code execution flow
The Path Not Taken: Duality in Reasoning about Program Execution
-
LLM absorbs long contexts into fixed parameters with causal sync
Absorber LLM: Harnessing Causal Synchronization for Test-Time Training
-
Joint optimizations cut multi-agent edge latency by 62 percent at 200 agents
A Delta-Aware Orchestration Framework for Scalable Multi-Agent Edge Computing
-
Nine quantum-HPC stacks share design patterns for unifying layers
Quantum-HPC Software Stacks and the openQSE Reference Architecture: A Survey
-
Code snippets prove 20 percent more library calls executable
FIKA: Expanding Dependency Reachability with Executability Guarantees
-
Review charts automation routes for quantum software and AI
Automated Quantum Software and AI Engineering
-
Platform uses containers and supervised AI chat for reproducible biomedical workflows
Biomedical systems biology workflow orchestration and execution with PoSyMed
-
AI Security PRs Introduce Recurring Flaws but Often Merge
Insights into Security-Related AI-Generated Pull Requests
-
Vision models turn GUI bug videos into replays 72% of the time
ViBR: Automated Bug Replay from Video-based Reports using Vision-Language Models
-
LLM GUI code compiles but rarely plays without errors
PlayCoder: Making LLM-Generated GUI Code Playable
-
One open codebase trains vision-language-action models end-to-end
VLA Foundry: A Unified Framework for Training Vision-Language-Action Models
-
Predictive autoscaler holds Node.js latency at 26 ms in ramps
Predictive Autoscaling for Node.js on Kubernetes: Lower Latency, Right-Sized Capacity
-
Reflection and planning lift theorem proving 22% with fixed LLM calls
On Reasoning-Centric LLM-based Automated Theorem Proving
-
Fine-tuned LLMs raise XSS obfuscation match rate to 0.22
Evaluating LLM-Generated Obfuscated XSS Payloads for Machine Learning-Based Detection