The authors create the first large-scale dataset and taxonomy of failure modes in multi-agent LLM systems to explain their limited performance gains.
hub
Wang, and Sadid Hasan
14 Pith papers cite this work. Polarity classification is still indexing.
hub tools
citation-role summary
citation-polarity summary
roles
background 3representative citing papers
A new paired-prompt protocol reveals alignment-pipeline-specific heterogeneity in how open-weight LLMs respond to evaluation versus deployment framings.
IE-as-Cache framework repurposes information extraction as a dynamic cognitive cache to improve agentic reasoning accuracy in LLMs on challenging benchmarks.
A two-stage AST-based crossover and mutation operator with LLM repair expands the search space in LLM-driven heuristic evolution and improves performance on TSP and online bin packing.
Empirical Bayes conformal prediction converts score variability into r-value nonconformity scores that preserve target coverage while reducing inclusion of high-variance false candidates in image classification, CLIP VLMs, and LLMs.
LPDS quantifies difficulty of logic-preserving problem variations and searches for the hardest ones, producing up to 5x larger performance drops than random sampling and better robustness gains from fine-tuning on difficult examples.
PopQuiz Attack infers LLM training data membership by turning examples into quiz questions and measuring answer accuracy, reaching 0.873 average ROC-AUC across six models and outperforming prior methods by 20.6%.
EHRAG constructs structural hyperedges from sentence co-occurrence and semantic hyperedges from entity embedding clusters, then applies hybrid diffusion plus topic-aware PPR to retrieve top-k documents, outperforming baselines on four datasets with linear indexing cost and zero token overhead.
COMPACT synthesizes compositional visual instruction data to reduce VIT training data by 90% while achieving 100.2% of full performance across eight multimodal benchmarks.
Benchmarking 25 LLMs on Raspberry Pi hardware shows Granite4 Tiny Hybrid (7B) balances 2.5 tokens/s, 0.90 tokens/J, and 54.6% MMLU while teaching effectiveness does not require high general knowledge scores.
MIRAGE improves VLM analysis of multi-figure art by inserting a verifiable structured representation of micro-interactions between spatial grounding and narrative output.
LLMs reflect users' privacy preferences in access control decisions with up to 86% agreement and can promote safer behavior, but personalization trades off higher individual match for potentially less secure results when users over-permission.
Generative AI systems arise from statistical data processing that produces human-like outputs, creating a mismatch with traditional computer expectations and positioning educational researchers to lead in studying and applying them.
citing papers explorer
-
Empirical Bayes Conformal Prediction for Vision and Language Models
Empirical Bayes conformal prediction converts score variability into r-value nonconformity scores that preserve target coverage while reducing inclusion of high-variance false candidates in image classification, CLIP VLMs, and LLMs.
-
LPDS: Evaluating LLM Robustness Through Logic-Preserving Difficulty Scaling
LPDS quantifies difficulty of logic-preserving problem variations and searches for the hardest ones, producing up to 5x larger performance drops than random sampling and better robustness gains from fine-tuning on difficult examples.