archive
Every paper Pith has read. Search by title, abstract, or pith.
1797 papers in cs.SE · page 6
-
Fuzzer finds 15 vulnerabilities in LLM serving engines
Continuous Discovery of Vulnerabilities in LLM Serving Systems with Fuzzing
-
Symbolic analysis quantifies fraction of inputs changed by patches
Quantitative Symbolic Patch Impact Analysis
-
DMI-Lib cuts LLM internal observability overhead to 0.4-6.8 percent
Enabling Performant and Flexible Model-Internal Observability for LLM Inference
-
Code editor plugin logs student sessions for education datasets
Using Logs to support Programming Education
-
Git-like trace lets meta-agents fork past states 5x faster than Docker
Shepherd: A Runtime Substrate Empowering Meta-Agents with a Formalized Execution Trace
-
Pipeline builds dataset of 347 real C++ performance patches
CppPerf: An Automated Pipeline and Dataset for Performance-Improving C++ Commits
-
Benchmark shows CAD models miss fine details and complex operations
BenchCAD: A Comprehensive, Industry-Standard Benchmark for Programmatic CAD
-
BenchCAD benchmark shows AI simplifies complex CAD designs
BenchCAD: A Comprehensive, Industry-Standard Benchmark for Programmatic CAD
-
StartFlow helps non-experts build clearer startup prototypes
StartFlow: From Method Conception to Multi-Perspective Evaluation in UX Prototyping for Software Startups
-
LLM agents top out below 60% success in complex tool sandboxes
ComplexMCP: Evaluation of LLM Agents in Dynamic, Interdependent, and Large-Scale Tool Sandbox
-
LLM agents top out below 60% on complex tool tasks
ComplexMCP: Evaluation of LLM Agents in Dynamic, Interdependent, and Large-Scale Tool Sandbox
-
Unitaria composes quantum block encodings like NumPy arrays
Unitaria: Quantum Linear Algebra via Block Encodings
-
AutoSOUP generates unit proofs for component memory safety via LLM hybrid
AutoSOUP: Safety-Oriented Unit Proof Generation for Component-level Memory-Safety Verification
-
AI leaves all problem-solving behaviors intact in code extension tasks
ChatGPT: Friend or Foe When Comprehending and Changing Unfamiliar Code
-
Masking bad steps inside failed runs lifts agent resolution 3.7 percent
Step Rejection Fine-Tuning: A Practical Distillation Recipe
-
Autoencoder context compression fails on multi-step coding agents
On Problems of Implicit Context Compression for Software Engineering Agents
-
New benchmark tests agents on cracking binaries from executables
CrackMeBench: Binary Reverse Engineering for Agents
-
LLARS unifies LLM prompt engineering
LLARS: Enabling Domain Expert & Developer Collaboration for LLM Prompting, Generation and Evaluation
-
Logic prover turns G-code collisions into LLM correction signals
Correct-by-Construction G-Code Generation: A Neuro-Symbolic Approach via Separation Logic
-
Neuro-symbolic loop fixes G-code via spatial proof failures
Correct-by-Construction G-Code Generation: A Neuro-Symbolic Approach via Separation Logic
-
Neuro-symbolic system turns G-code collisions into bounding-box fixes
Correct-by-Construction G-Code Generation: A Neuro-Symbolic Approach via Separation Logic
-
Separation logic catches CNC collisions as spatial data races
Separation Logic for Verifying Physical Collisions of CNC Programs
-
Separation logic verifies CNC collisions as spatial data races
Separation Logic for Verifying Physical Collisions of CNC Programs
-
VLMs automate robot task oracles from video
VISOR: A Vision-Language Model-based Test Oracle for Testing Robots
-
VISOR automates robot test oracles using vision-language models
VISOR: A Vision-Language Model-based Test Oracle for Testing Robots
-
DREAMS tool cuts time for DRM model creation and revision
DREAMS: Modelling Support for Research into Engineering and Artistic Design
-
Vision loop polishes LaTeX documents to publication standards
PaperFit: Vision-in-the-Loop Typesetting Optimization for Scientific Documents
-
ReXCL tool automates requirements extraction and classification
Read, Extract, Classify: A Tool for Smarter Requirements Engineering
-
Margin-aware geometry reduces distortions in imbalanced vulnerability detection
MARGIN: Margin-Aware Regularized Geometry for Imbalanced Vulnerability Detection
-
Tiered AI agent framework adapts review to risk and separates duties
Beyond Autonomy: A Dynamic Tiered AgentRunner Framework for Governable and Resilient Enterprise AI Execution
-
Usability demands trick LLMs into insecure code
Usability as a Weapon: Attacking the Safety of LLM-Based Code Generation via Usability Requirements
-
LLM agents discover 40 bugs in V8 JavaScript engine
Agentic Fuzzing: Opportunities and Challenges
-
Simulator models edge computing on optical networks
GenioSim: A Novel Simulation Platform for Edge Computing over Optical Networks
-
Config file structure has no effect on coding agent adherence
Instruction Adherence in Coding Agent Configuration Files: A Factorial Study of Four File-Structure Variables
-
Move prover checks first-class functions with state changes
Formal Verification of Imperative First-Class Functions in Move
-
Move Prover verifies first-class imperative functions
Formal Verification of Imperative First-Class Functions in Move
-
Hybrid analysis and AI infers Move specifications
Combining Mechanical and Agentic Specification Inference for Move
-
Tool pairs weakest-precondition analysis with AI to infer Move specs
Combining Mechanical and Agentic Specification Inference for Move
-
Iterative prompting beats single-pass for complex graph tasks
GraphInstruct: A Progressive Benchmark for Diagnosing Capability Gaps in LLM Graph Generation
-
Benchmark finds LLM graph failures peak at multi-constraint tasks
GraphInstruct: A Progressive Benchmark for Diagnosing Capability Gaps in LLM Graph Generation
-
LLMs recover from flawed partial reasoning only 29% of the time
TeleResilienceBench: Quantifying Resilience for LLM Reasoning in Telecommunications
-
Deterministic orchestration matches LLM accuracy with 3.5x lower costs
Deterministic vs. LLM-Controlled Orchestration for COBOL-to-Python Modernization
-
Cloning duplicates many agent tools in public marketplaces
Evaluating Tool Cloning in Agentic-AI Ecosystems
-
Cloning duplicates 60-85% of high-similarity tool pairs in agent ecosystems
Evaluating Tool Cloning in Agentic-AI Ecosystems
-
Shared contract makes agent benchmark gate change controller choice
An Executable Benchmarking Suite for Tool-Using Agents
-
GenAI turns software engineering from code writing to intent oversight
From Code-Centric to Intent-Centric Software Engineering: A Reflexive Thematic Analysis of Generative AI, Agentic Systems, and Engineering Accountability
-
Trajectory context lifts tool accuracy from 39% to 57%
Trajectory Supervision for Continual Tool-Use Learning in LLMs
-
Pre-execution rubrics lift tool agents to 0.86 accuracy
RubricRefine: Improving Tool-Use Agent Reliability with Training-Free Pre-Execution Refinement
-
Pre-execution rubrics lift tool-use success to 0.86 average
RubricRefine: Improving Tool-Use Agent Reliability with Training-Free Pre-Execution Refinement
-
Pre-execution rubrics lift tool agent reliability to 0.86
RubricRefine: Improving Tool-Use Agent Reliability with Training-Free Pre-Execution Refinement