{"total":14,"items":[{"citing_arxiv_id":"2605.22608","ref_index":23,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Agentic CLEAR: Automating Multi-Level Evaluation of LLM Agents","primary_cat":"cs.CL","submitted_at":"2026-05-21T15:26:02+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Agentic CLEAR automates multi-level evaluation of LLM agents, generating textual insights at system, trace, and node granularity that align with human annotations and predict task success.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.21347","ref_index":7,"ref_count":2,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Insights Generator: Systematic Corpus-Level Trace Diagnostics for LLM Agents","primary_cat":"cs.AI","submitted_at":"2026-05-20T16:13:53+00:00","verdict":null,"verdict_confidence":null,"novelty_score":null,"formal_verification":null,"one_line_summary":null,"context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.14892","ref_index":28,"ref_count":4,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Beyond Individual Intelligence: Surveying Collaboration, Failure Attribution, and Self-Evolution in LLM-based Multi-Agent Systems","primary_cat":"cs.AI","submitted_at":"2026-05-14T14:36:13+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"A survey that unifies prior work on multi-agent LLM systems via the LIFE framework, mapping dependencies across collaboration, failure attribution, and autonomous self-evolution while identifying cross-stage challenges.","context_count":2,"top_context_role":"background","top_context_polarity":"background","context_text":"message, or an incorrect tool invocation can propagate through successive interaction rounds, triggering cascading failures that obscure the original root cause [20,27]. As execution trajectories grow longer and inter-agent dependencies deepen, the causal chain linking a failure to its downstream consequences becomes increasingly opaque, rendering manual identification of the responsible agent and the decisive step both inefficient and unscalable [28,29,30]. Meanwhile, even when root causes are successfully identified, existing multi-agent systems largely lack the ability to translate diagnostic insights into structural adaptation, whether by reorganizing coordination topologies, revising role assignments, or refining collaboration policies in light of observed failure patterns [31,32]. These two deficits in"},{"citing_arxiv_id":"2605.14865","ref_index":7,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Holistic Evaluation and Failure Diagnosis of AI Agents","primary_cat":"cs.AI","submitted_at":"2026-05-14T14:12:39+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"A span-decomposed evaluation framework for AI agents achieves state-of-the-art results on GAIA and SWE-Bench with up to 3.5x gains in localization accuracy by breaking traces into independent per-span judgments.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.12925","ref_index":9,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"AgentLens: Revealing The Lucky Pass Problem in SWE-Agent Evaluation","primary_cat":"cs.SE","submitted_at":"2026-05-13T03:00:57+00:00","verdict":null,"verdict_confidence":null,"novelty_score":null,"formal_verification":null,"one_line_summary":null,"context_count":1,"top_context_role":"baseline","top_context_polarity":"baseline","context_text":"Quality tier labels - - - -✓ Curation support Outcome Outcome - -✓ No LLM calls needed - - -✓ ✓ Table 5: Framework comparison. AGENTLENSis the only system that combines SWE-specific trajectory analysis, multi- dimensional scoring, validated phase labels, and structured inefficiency attribution. Domain Granularity Multi-dim. Inefficiency Labelκ AgentBoard Multi (9) Sub-goal✓- - Graphectory SWE Trace✓- - MAST Multi-agent Trace✓- 0.88 TRAIL Multi (3) Step✓Partial - Web-Shepherd Web Step scalar - - SWE-RM SWE Trajectory scalar - - ABC SWE Task checklist - - AGENTLENSSWE Trajectory✓ ✓0.933 states from different agents are recognized as covering the same ground-truth action during PTA merging. The intent-stage labeling flow (B."},{"citing_arxiv_id":"2605.11225","ref_index":4,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"PIVOT: Bridging Planning and Execution in LLM Agents via Trajectory Refinement","primary_cat":"cs.AI","submitted_at":"2026-05-11T20:43:20+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"PIVOT refines LLM agent trajectories through plan-inspect-evolve-verify stages using environment feedback, yielding up to 94% relative gains in constraint satisfaction and 3-5x token efficiency over prior refinement methods.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"Pan, S. Yang, L. A. Agrawal, B. Chopra, R. Tiwari, K. Keutzer, A. Parameswaran, D. Klein, K. Ramchandran, et al. Why do multi-agent llm systems fail?arXiv preprint arXiv:2503.13657, 2025. [3] J. Chen, H. Li, J. Yang, Y . Liu, and Q. Ai. Enhancing llm-based agents via global planning and hierarchical execution.arXiv preprint arXiv:2504.16563, 2025. [4] D. Deshpande, V . Gangal, H. Mehta, J. Krishnan, A. Kannappan, and R. Qian. Trail: Trace reasoning and agentic issue localization.arXiv preprint arXiv:2505.08638, 2025. [5] Z. Gou, Z. Shao, Y . Gong, Y . Shen, Y . Yang, N. Duan, and C. W. Critic: Large language models can self-correct with tool-interactive critiquing. InInternational Conference on Learning Representations"},{"citing_arxiv_id":"2604.23455","ref_index":38,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"CUJBench: Benchmarking LLM-Agent on Cross-Modal Failure Diagnosis from Browser to Backend","primary_cat":"cs.SE","submitted_at":"2026-04-25T22:10:53+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":8.0,"formal_verification":"none","one_line_summary":"CUJBench is the first benchmark for cross-modal LLM-agent failure diagnosis, reporting 19.7% accuracy and identifying evidence attribution as the core bottleneck across six models.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.18240","ref_index":1,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"AJ-Bench: Benchmarking Agent-as-a-Judge for Environment-Aware Evaluation","primary_cat":"cs.AI","submitted_at":"2026-04-20T13:23:38+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"AJ-Bench provides 155 tasks in three domains to evaluate environment-interacting agent judges, showing performance gains over LLM-as-a-Judge but exposing remaining verification challenges.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.17699","ref_index":23,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"SelfHeal: Empirical Fix Pattern Analysis and Bug Repair in LLM Agents","primary_cat":"cs.SE","submitted_at":"2026-04-20T01:28:15+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"SelfHeal uses two ReAct agents and empirical fix patterns to repair bugs in LLM agents, outperforming baselines on a new 37-instance benchmark.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"since fewer developers are involved in the field. Prior research has established various taxonomies for software bugs, ranging from general software systems [19] to specific deep learning artifacts and attention-based mechanisms [37, 43]. While recent studies have begun to evaluate the performance of agents on coding benchmarks and trace their autonomous workflows [23, 76], these evaluations primarily focus on agent effectiveness rather than agent correctness. To date, no empirical study has systematically investigated fix patterns that can be used to fix bugs within agen- tic systems. Furthermore, although several frameworks have been proposed that employ agent-based architectures for automated bug fixing in conventional software systems [ 76] [81] [71], these ap-"},{"citing_arxiv_id":"2604.17658","ref_index":49,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Towards Self-Improving Error Diagnosis in Multi-Agent Systems","primary_cat":"cs.MA","submitted_at":"2026-04-19T23:13:05+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"ErrorProbe introduces a self-improving pipeline for attributing semantic failures in LLM multi-agent systems to specific agents and steps via anomaly detection, backward tracing, and tool-grounded validation with verified episodic memory.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.22819","ref_index":6,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"A pragmatic approach to regulating AI agents","primary_cat":"cs.CY","submitted_at":"2026-04-16T13:04:52+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"AI agents require distinct regulation as AI systems under the EU AI Act with orchestration-layer oversight and a risk-based traffic light authorization system in contract law to preserve human accountability.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2601.15232","ref_index":13,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"When Agents Fail: A Comprehensive Study of Bugs in LLM Agents with Automated Labeling","primary_cat":"cs.SE","submitted_at":"2026-01-21T18:13:10+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"A large-scale empirical study categorizes bugs in LLM agents and demonstrates that a specialized LLM agent can annotate them accurately at very low cost.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2512.02393","ref_index":15,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Process-Centric Analysis of Agentic Software Systems","primary_cat":"cs.SE","submitted_at":"2025-12-02T04:12:29+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Graphectory turns stochastic agent trajectories into analyzable graphs, showing that stronger models and successful fixes follow coherent localization-validation steps while failures are chaotic, and online detection plus rollback improves resolution rates by 6.9-23.5%.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2507.13334","ref_index":220,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"A Survey of Context Engineering for Large Language Models","primary_cat":"cs.CL","submitted_at":"2025-07-17T17:50:36+00:00","verdict":"ACCEPT","verdict_confidence":"MODERATE","novelty_score":4.0,"formal_verification":"none","one_line_summary":"The survey organizes Context Engineering into retrieval, processing, management, and integrated systems like RAG and multi-agent setups while identifying an asymmetry where LLMs handle complex inputs well but struggle with equally sophisticated long outputs.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"Components (§4) Context Generation Retrieval & (§4.1) e.g.,Chain-of-Thought [1147], Zero-shot CoT [559], ToT [1255], GoT [69], Self-consistency [1123], ReAct [1254], Auto-CoT [1107], Automatic Prompt [311] , CLEAR Framework [708], RAG [597], Cognitive Prompting [564], KAPING [48], Dynamic Assembly [311],etc. Context Processing (§4.2) e.g.,Mamba [1267], LongNet [220], FlashAttention [200], Ring Attention [682], YaRN [839], Infini-attention [798], StreamingLLM [1185], InfLLM [1184], Self-Refine [741], Reflexion [964], StructGPT [495], GraphFormers [1230], KG Integration [1330], Long CoT [148], MLLMs [49],etc. Context Management (§4.3) e.g.,Context Compression [321], StreamingLLM [1185], KV Cache Management [1399],"}],"limit":50,"offset":0}