FLARE extracts specifications from multi-agent LLM code and applies coverage-guided fuzzing to achieve 96.9% inter-agent and 91.1% intra-agent coverage while uncovering 56 new failures across 16 applications.
Title resolution pending
6 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
years
2026 6roles
method 2polarities
use method 2representative citing papers
False-positive bug reports in the Linux kernel consume effort comparable to real bugs and can be filtered by LLMs using retrieval-augmented generation at 88% F1.
ComPASS creates tool-augmented LLM agents for substantive social support, releases the first personalized benchmark ComPASS-Bench, and fine-tunes ComPASS-Qwen to outperform its base model while matching larger LLMs.
PrivacyAkinator uses LLM-generated questions grounded in data-flow representations and a news-mined design space to help developers surface privacy decisions, yielding 47% more decisions identified in 73% less time than PRAM in a 24-person study.
An empirical study of real-world issues yields a taxonomy of 34 fault types, symptoms, and root causes in agentic AI systems, validated by 145 practitioners.
LLMs reach moderate macro-F1 scores of 0.36-0.37 when classifying code review comments into six smells and three useful intents, with one-shot examples helping some models on intent labels.
citing papers explorer
-
FLARE: Agentic Coverage-Guided Fuzzing for LLM-Based Multi-Agent Systems
FLARE extracts specifications from multi-agent LLM code and applies coverage-guided fuzzing to achieve 96.9% inter-agent and 91.1% intra-agent coverage while uncovering 56 new failures across 16 applications.
-
Characterizing and Mitigating False-Positive Bug Reports in the Linux Kernel
False-positive bug reports in the Linux kernel consume effort comparable to real bugs and can be filtered by LLMs using retrieval-augmented generation at 88% F1.
-
ComPASS: Towards Personalized Agentic Social Support via Tool-Augmented Companionship
ComPASS creates tool-augmented LLM agents for substantive social support, releases the first personalized benchmark ComPASS-Bench, and fine-tunes ComPASS-Qwen to outperform its base model while matching larger LLMs.
-
PrivacyAkinator: Articulating Key Privacy Design Decisions by Answering LLM-Generated Multiple-choice Questions
PrivacyAkinator uses LLM-generated questions grounded in data-flow representations and a news-mined design space to help developers surface privacy decisions, yielding 47% more decisions identified in 73% less time than PRAM in a 24-person study.
-
Characterizing Faults in Agentic AI: A Taxonomy of Types, Symptoms, and Root Causes
An empirical study of real-world issues yields a taxonomy of 34 fault types, symptoms, and root causes in agentic AI systems, validated by 145 practitioners.
-
Automated Classification of Human Code Review Comments with Large Language Models
LLMs reach moderate macro-F1 scores of 0.36-0.37 when classifying code review comments into six smells and three useful intents, with one-shot examples helping some models on intent labels.