{"total":15,"items":[{"citing_arxiv_id":"2606.11686","ref_index":4,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Layer-Isolated Evaluation: Gating the Deterministic Scaffold of a Production LLM Agent with a No-LLM, Regression-Locked Test Harness","primary_cat":"cs.CL","submitted_at":"2026-06-10T05:55:13+00:00","verdict":"ACCEPT","verdict_confidence":"MODERATE","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Layer-isolated evaluation decomposes LLM agents into per-layer deterministic no-LLM test slices whose locked baselines localize regressions that aggregate pass rates mask.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.10315","ref_index":10,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Catching One in Five: LLM-as-Judge Blind Spots in Production Multi-Turn Transaction Agents","primary_cat":"cs.CL","submitted_at":"2026-06-09T02:11:01+00:00","verdict":"ACCEPT","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Empirical study of a production multi-turn ordering agent finds LLM-as-judge recall below 25% for human-confirmed defects, missing cross-turn state issues due to limited rubric and routing.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.05670","ref_index":62,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Do More Agents Help? Controlled and Protocol-Aligned Evaluation of LLM Agent Workflows","primary_cat":"cs.AI","submitted_at":"2026-06-04T03:50:47+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Under controlled identical protocols, only one of six multi-agent LLM systems marginally exceeds a single-agent baseline on benchmark-balanced accuracy while the rest trail and cost more; a runtime workflow reaches 66.72% on GAIA.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.01961","ref_index":49,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"AutoMedBench: Towards Medical AutoResearch with Agentic AI Models","primary_cat":"cs.AI","submitted_at":"2026-06-01T09:22:55+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"AutoMedBench evaluates AI agents on long-horizon medical workflows across five stages and finds validation and submission as dominant failure points based on thousands of runs.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.20530","ref_index":8,"ref_count":2,"confidence":0.9,"is_internal_anchor":false,"paper_title":"AgentAtlas: Beyond Outcome Leaderboards for LLM Agents","primary_cat":"cs.AI","submitted_at":"2026-05-19T22:05:12+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"AgentAtlas introduces a diagnostic taxonomy and audit protocol to evaluate LLM agent control decisions and trajectories beyond final outcome success.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.19743","ref_index":36,"ref_count":2,"confidence":0.9,"is_internal_anchor":false,"paper_title":"EngiAI: A Multi-Agent Framework and Benchmark Suite for LLM-Driven Engineering Design","primary_cat":"cs.AI","submitted_at":"2026-05-19T12:12:09+00:00","verdict":"CONDITIONAL","verdict_confidence":"MODERATE","novelty_score":7.0,"formal_verification":"none","one_line_summary":"EngiAI introduces a LangGraph-based multi-agent framework and a three-part benchmark suite for LLM-driven engineering design, reporting high task completion rates for proprietary models on Beams2D and Photonics2D problems.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.14865","ref_index":18,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Holistic Evaluation and Failure Diagnosis of AI Agents","primary_cat":"cs.AI","submitted_at":"2026-05-14T14:12:39+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"A span-decomposed evaluation framework for AI agents achieves state-of-the-art results on GAIA and SWE-Bench with up to 3.5x gains in localization accuracy by breaking traces into independent per-span judgments.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.07054","ref_index":26,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Sell More, Play Less: Benchmarking LLM Realistic Selling Skill","primary_cat":"cs.CL","submitted_at":"2026-04-08T13:06:37+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"SalesLLM provides an automatic evaluation framework for LLM sales dialogues that correlates 0.98 with human experts and shows top models approaching human performance while weaker ones lag.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2602.09514","ref_index":23,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"EcoGym: Evaluating LLMs for Long-Horizon Plan-and-Execute in Interactive Economies","primary_cat":"cs.CL","submitted_at":"2026-02-10T08:12:23+00:00","verdict":"CONDITIONAL","verdict_confidence":"MODERATE","novelty_score":6.0,"formal_verification":"none","one_line_summary":"EcoGym is a new open benchmark with three economic environments that reveals no leading LLM dominates at sustained plan-and-execute decision making across scenarios.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2512.13564","ref_index":278,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Memory in the Age of AI Agents","primary_cat":"cs.CL","submitted_at":"2025-12-15T17:22:34+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"The paper maps agent memory research via three forms (token-level, parametric, latent), three functions (factual, experiential, working), and dynamics of formation/evolution/retrieval, plus benchmarks and future directions.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2511.20857","ref_index":154,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Evo-Memory: Benchmarking LLM Agent Test-time Learning with Self-Evolving Memory","primary_cat":"cs.CL","submitted_at":"2025-11-25T21:08:07+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Evo-Memory is a new streaming benchmark and evaluation framework for self-evolving memory in LLM agents, unifying over ten memory modules and introducing the ReMem pipeline for continual improvement on multi-turn and reasoning datasets.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2506.04565","ref_index":112,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"From Standalone LLMs to Integrated Intelligence: A Survey of Compound Al Systems","primary_cat":"cs.MA","submitted_at":"2025-06-05T02:34:43+00:00","verdict":"ACCEPT","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"A survey that defines Compound AI Systems, proposes a multi-dimensional taxonomy based on component roles and orchestration strategies, reviews four foundational paradigms, and identifies key challenges for future research.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":", knowledge graphs) to locate relevant information by traversing nodes and edges [6, 38, 46]. Hybrid Retriever combines both sparse and dense retrieval approaches to leverage the strengths of explicit term matching and semantic similarity [6, 50, 105]. LLM as Retriever involves the use of LLMs to directly retrieve relevant knowledge based on input queries [112]. 3.3 Generator The Generator in RAG systems is essentially an LLM. It can be an original pre-trained language model, such as T5 [136], FLAN [185] and LLaMA [174], or a black-box pre-trained language model, such as GPT-3 [14], GPT-4 [2], Gemini [169], Claude [24]. Alternatively, the generator can also be a fine-tuned language model specifically tailored for a particular"},{"citing_arxiv_id":"2501.09686","ref_index":88,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Towards Large Reasoning Models: A Survey of Reinforced Reasoning with Large Language Models","primary_cat":"cs.AI","submitted_at":"2025-01-16T17:37:58+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":3.0,"formal_verification":"none","one_line_summary":"The paper surveys reinforced reasoning techniques for LLMs, covering automated data construction, learning-to-reason methods, and test-time scaling as steps toward Large Reasoning Models.","context_count":1,"top_context_role":"method","top_context_polarity":"background","context_text":"Figure 4: Typical training-free test-time enhancing methods: verbal reinforcement search, memory- based reinforcement, and agentic system search. Table 3: A list of representative works of training-free test-time reinforcing. Method Category Representative literature Verbal Reinforcement Search Individual Agent Romera et al.[115], Shojaee et al.[130], Mysocki et al.[162],Ma et al.[88] Multi-Agent Chen et al.[20],Zhou et al.[199], Le et al.[69] ,Yu et al.[176] Embodied Agent Boiko et al.[13] Memory-based Reinforcement Experiential Learning Zhang et al.[184],Gao et al. [39] Qian et al.[108] Reflective Learning Shinn et al.[129],Sun et al.[138], Sun et al.[190] Concept Learning Zhang et al.[188],Gao et al.[40] , Guan et al.[44] Agentic System Search"},{"citing_arxiv_id":"2410.23218","ref_index":48,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"OS-ATLAS: A Foundation Action Model for Generalist GUI Agents","primary_cat":"cs.CL","submitted_at":"2024-10-30T17:10:19+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"OS-Atlas, trained on the largest open-source cross-platform GUI grounding corpus of 13 million elements, outperforms prior open-source models on six benchmarks across mobile, desktop, and web platforms.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2404.07972","ref_index":34,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environments","primary_cat":"cs.AI","submitted_at":"2024-04-11T17:56:05+00:00","verdict":"ACCEPT","verdict_confidence":"MODERATE","novelty_score":8.0,"formal_verification":"none","one_line_summary":"OSWorld provides the first unified real-computer benchmark for open-ended multimodal agent tasks, exposing large performance gaps between humans and state-of-the-art LLM/VLM agents.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"[32] Xiao Liu, Hao Yu, Hanchen Zhang, Yifan Xu, Xuanyu Lei, Hanyu Lai, Yu Gu, Hangliang Ding, Kaiwen Men, Kejuan Yang, et al. Agentbench: Evaluating llms as agents. arXiv preprint arXiv:2308.03688, 2023. [33] Xing Han Lù, Zdenˇek Kasner, and Siva Reddy. Weblinx: Real-world website navigation with multi-turn dialogue. arXiv preprint arXiv:2402.05930, 2024. [34] Chang Ma, Junlei Zhang, Zhihao Zhu, Cheng Yang, Yujiu Yang, Yaohui Jin, Zhenzhong Lan, Lingpeng Kong, and Junxian He. Agentboard: An analytical evaluation board of multi-turn llm agents. arXiv preprint arXiv:2401.13178, 2024. [35] Meta AI. Introducing meta Llama 3: The most capable openly available LLM to date, April 2024. URL https://ai.meta.com/blog/meta-llama-3/."}],"limit":50,"offset":0}