{"total":13,"items":[{"citing_arxiv_id":"2606.31976","ref_index":3,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"TreeAgent: A Generalizable Multi-Agent Framework for Automated Bias Labeling in Forestry via Compiled Expert Rules and Vision-Language Models","primary_cat":"cs.AI","submitted_at":"2026-06-30T17:16:59+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"TreeAgent uses a Decoupled Declarative Decision (D3) Framework to orchestrate expert rules and VLMs for tree bias classification, outperforming supervised ML baselines with reduced expert labeling effort.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.29654","ref_index":4,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Budgeted Act-or-Defer Multi-Agent LLM Deliberation with Local Reliability Bounds","primary_cat":"cs.AI","submitted_at":"2026-06-28T23:46:56+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"A kNN lower-confidence-bound approach for act-or-defer decisions in multi-agent LLM debates respects user-declared wrong-action budgets while achieving high automation rates on benchmarks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.29425","ref_index":6,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Mixture of Debaters: Learn to Debate at Architectural Level in Multi-Agent Reasoning","primary_cat":"cs.AI","submitted_at":"2026-06-28T14:40:01+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Mixture of Debaters uses MoE to enable dynamic self-debate inside one model, claiming better accuracy than multi-agent systems at 3.7x lower latency and 87% fewer tokens on multimodal benchmarks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.27409","ref_index":13,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Delayed Verification Destabilizes Multi-Agent LLM Belief: Instability Thresholds and Optimal Corrector Placement","primary_cat":"cs.MA","submitted_at":"2026-06-25T10:52:54+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Models delayed verification in multi-agent LLMs as graph consensus, derives stability thresholds (inverse golden ratio for delay two) via grounded Laplacian, and gives a supermodular greedy rule for corrector placement; experiments on five models confirm dose-delay oscillations.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.01667","ref_index":9,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"ATLAS: Agentic Test-time Learning-to-Allocate Scaling","primary_cat":"cs.LG","submitted_at":"2026-06-01T04:19:50+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"ATLAS introduces an LLM-orchestrated agentic framework for dynamic test-time scaling via extensible 'explore' actions, achieving higher accuracy with fewer API calls than fixed-workflow baselines on four benchmarks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.24755","ref_index":1,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Automated Detection and Classification of Delusion-related Content in Naturalistic Audio Diaries Using Multi-Agent Language Models","primary_cat":"cs.AI","submitted_at":"2026-05-23T22:18:40+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"A multi-agent LLM system with majority voting achieves reported Micro F1 of 0.872 for delusion detection and 0.779 for classification on naturalistic speech transcripts.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.08478","ref_index":2,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"When Independent Sampling Outperforms Agentic Reasoning","primary_cat":"cs.LG","submitted_at":"2026-05-08T20:53:51+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"On Codeforces problems, independent k-shot sampling achieves better accuracy-cost and accuracy-query tradeoffs than agentic reasoning, even with prompt caching.","context_count":1,"top_context_role":"background","top_context_polarity":"unclear","context_text":", 2024] and identify recurring failure modes, including unproductive iterative loops, ineffective debugging strategies, and inability to find the key algorithm. Analogous inefficiencies have been observed in prior studies of multi-agent debate and collaboration, where much of the observed performance gain can be attributed to aggregation of independent trials rather than multi-agent interaction [Choi et al., 2025, Li et al., 2024, Liang 2https://codeforces.com/ 2 et al., 2023]. Indeed, independentk-shot attempts naturally emphasize exploration, allowing rare but correct solution paths to be discovered early at relatively low cost. To place these observations in a broader context, we provide a theoretical discussion of how to allocate resources to success-or-fail solvers under a fixed budget. We model this problem as an integer program, solve"},{"citing_arxiv_id":"2605.01347","ref_index":9,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"MAD-OPD: Breaking the Ceiling in On-Policy Distillation via Multi-Agent Debate","primary_cat":"cs.CL","submitted_at":"2026-05-02T09:41:37+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"MAD-OPD recasts on-policy distillation teachers as a debating collective to supply better supervision, lifting agentic and code performance over single-teacher OPD across multiple model sizes.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"[7] Qianglong Chen, Feng Ji, Feng-Lin Li, Guohai Xu, Ming Yan, Ji Zhang, and Yin Zhang. AMTSS: An adaptive multi-teacher single-student knowledge distillation framework for multi- lingual language inference.arXiv preprint arXiv:2305.07928, 2023. [8] Yiqun Chen et al. Improving retrieval-augmented generation through multi-agent reinforcement learning. InAdvances in Neural Information Processing Systems, 2025. [9] Hyeong Kyu Choi et al. Debate or vote: Which yields better decisions in multi-agent large language models?arXiv preprint arXiv:2508.17536, 2025. [10] DeepSeek-AI. DeepSeek-V4 technical report. Technical report, DeepSeek, 2026. https: //huggingface.co/deepseek-ai/DeepSeek-V4-Pro/blob/main/DeepSeek_V4.pdf. [11] Yilun Du et al. Improving factuality and reasoning in language models through multiagent"},{"citing_arxiv_id":"2604.15972","ref_index":20,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Weak-Link Optimization for Multi-Agent Reasoning and Collaboration","primary_cat":"cs.AI","submitted_at":"2026-04-17T11:36:20+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"WORC improves multi-agent LLM reasoning to 82.2% average accuracy by predicting and compensating for the weakest agent via targeted extra sampling rather than uniform reinforcement.","context_count":1,"top_context_role":"background","top_context_polarity":"support","context_text":"weak agents, compromise the overall reliability of the system by inducing inaccurate reasoning, unreliable decisions, and error-prone outputs. Conventional design paradigms, which emphasize stronger reasoning agents or incorporate simple consensus mechanisms such as voting [18] and debate [19], remain susceptible to instability and exhibit high performance variability despite their effectiveness [20]. This fragility man- ifests specifically as: 1)Error accumulation across reasoning stages: In task arXiv:2604.15972v1 [cs.AI] 17 Apr 2026 decomposition, outputs of preceding agents serve as inputs for subsequent ones. Low-accuracy or miscal- ibrated outputs from any agent may propagate errors downstream, amplifying their impact. 2)Consensus degradation under heterogeneous agent"},{"citing_arxiv_id":"2604.07667","ref_index":1,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"From Debate to Decision: Conformal Social Choice for Safe Multi-Agent Deliberation","primary_cat":"cs.AI","submitted_at":"2026-04-09T00:15:20+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Conformal Social Choice aggregates verbalized probabilities from LLM debates via linear opinion pooling and uses split conformal prediction to generate prediction sets that guarantee inclusion of the correct answer with probability at least 1-alpha, enabling adjustable safe act-or-escalate decisions","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.07007","ref_index":12,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"AgentCity: Constitutional Governance for Autonomous Agent Economies via Separation of Power","primary_cat":"cs.MA","submitted_at":"2026-04-08T12:28:20+00:00","verdict":"UNVERDICTED","verdict_confidence":"UNKNOWN","novelty_score":6.0,"formal_verification":"none","one_line_summary":"AgentCity introduces a Separation of Power constitutional architecture on blockchain for governing autonomous agent economies through agent legislation, automated execution, and human accountability.","context_count":1,"top_context_role":"background","top_context_polarity":"unclear","context_text":"Management Science, 65(11), 5171-5187. [10] Chen, X. et al. (2026). Towards Transparent and Incentive-Compatible Collaboration in Decen- tralized LLM Multi-Agent Systems: A Blockchain-Driven Approach.IEEE Transactions on Network Science and Engineering. arXiv:2509.16736. [11] Chitra, T., & Kulkarni, K. (2022). Improving Proof of Stake Economic Security via MEV Redistribution.arXiv. [12] Choi, H. K., Zhu, X., & Li, S. (2025). Debate or V ote: Which Yields Better Decisions in Multi-Agent Large Language Models?NeurIPS 2025 Spotlight. arXiv:2508.17536. [13] Christoffersen, P. J. K., Haupt, A., & Hadfield-Menell, D. (2023). Get It in Writing: Formal Contracts Mitigate Social Dilemmas in Multi-Agent RL.Proc. AAMAS. [14] CMAG Authors. (2025)."},{"citing_arxiv_id":"2604.02863","ref_index":23,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"EMS: Multi-Agent Voting via Efficient Majority-then-Stopping","primary_cat":"cs.AI","submitted_at":"2026-04-03T08:29:50+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"EMS reduces the average number of agents invoked for majority voting by 32% via reliability-aware prioritization and early stopping on six benchmarks.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"Recent work demonstrates that majority voting accounts for the vast majority of performance gains typically attributed to complex multi-agent debate, suggesting that expensive inter-agent communication rounds are often unnecessary [15, 22]. Furthermore, standard voting protocols are particularly optimal for reasoning-based tasks, 2 Running Title for Header significantly outperforming other decision-making structures [23]. Despite these advances, most existing approaches follow areasoning-first-aggregation-laterparadigm, which introduces significant computational waste as the final decision is often reachable before all agents complete their reasoning. Our work addresses this gap by formulating majority voting as a reliability-aware agent scheduling problem. 3 Methodology"},{"citing_arxiv_id":"2601.22297","ref_index":3,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Learning from Self-Debate: Preparing Reasoning Models for Multi-Agent Debate","primary_cat":"cs.CL","submitted_at":"2026-01-29T20:21:44+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"SDRL trains LLMs via self-generated multi-path debates and joint optimization of standalone plus debate-conditioned responses to boost both single-model reasoning and multi-agent debate performance.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null}],"limit":50,"offset":0}