{"total":120,"items":[{"citing_arxiv_id":"2606.10106","ref_index":17,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"What makes a harness a harness: necessary and sufficient conditions for an agent harness","primary_cat":"cs.SE","submitted_at":"2026-06-08T19:35:37+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"Proposes and tests a constitutive definition of 'agent harness' via conceptual analysis of literature and six real systems.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.04602","ref_index":18,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Parthenon Law: A Self-Evolving Legal-Agent Framework","primary_cat":"cs.AI","submitted_at":"2026-06-03T08:39:42+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Parthenon is a self-evolving legal-agent framework that factors components for traceability and uses an anti-leakage learning loop to improve from scored failures on legal matters.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.31097","ref_index":15,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"SpecDB: LLM-Generated Customized Databases via Feature-Oriented Decomposition","primary_cat":"cs.DB","submitted_at":"2026-05-29T10:07:43+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"SpecDB generates a 23,779-line Rust database via LLM subagents that matches PostgreSQL and MySQL tpmC on TPC-C while using roughly 3% of their code size.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.23108","ref_index":7,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Philosophical Dispositions as Behavioral Constraints for AI-Assisted Code Review: An Empirical Study","primary_cat":"cs.SE","submitted_at":"2026-05-21T23:57:25+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"An empirical evaluation of philosophical dispositions constraining AI code review on 50 PRs shows 46% human convergence, 75% unique findings, zero author-judged false positives, and 51% findings absent from generic prompting.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.22643","ref_index":41,"ref_count":2,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Boiling the Frog: A Multi-Turn Benchmark for Agentic Safety","primary_cat":"cs.CL","submitted_at":"2026-05-21T15:50:18+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Boiling the Frog is a new stateful multi-turn benchmark that finds an aggregate 44.4% strict attack success rate for incremental safety violations across nine AI models, with rates ranging from 20.5% to 92.9%.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.22343","ref_index":14,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Sibyl-AutoResearch: Autonomous Research Needs Self-Evolving Trial-and-Error Harnesses, Not Paper Generators","primary_cat":"cs.MA","submitted_at":"2026-05-21T11:29:08+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Sibyl-AutoResearch introduces self-evolving trial-and-error harnesses with auditable conversion units that link trial signals to updated research behaviors and harness repairs in autonomous systems.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.20425","ref_index":8,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"AgentCo-op: Retrieval-Based Synthesis of Interoperable Multi-Agent Workflows","primary_cat":"cs.AI","submitted_at":"2026-05-19T19:22:21+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"AgentCo-op retrieves and assembles existing agents and tools into interoperable workflows for open-world scientific tasks, showing effectiveness in genomics case studies and competitive benchmark results with lower costs.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.20173","ref_index":15,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"A Methodology for Selecting and Composing Runtime Architecture Patterns for Production LLM Agents","primary_cat":"cs.AI","submitted_at":"2026-05-19T17:54:21+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Introduces the stochastic-deterministic boundary (SDB) as a load-bearing primitive for LLM agent runtimes and provides a five-step methodology plus catalog of six patterns adapted from distributed systems.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.19140","ref_index":19,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Learning to Hand Off: Provably Convergent Workflow Learning under Interface Constraints","primary_cat":"cs.AI","submitted_at":"2026-05-18T21:48:02+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":8.0,"formal_verification":"none","one_line_summary":"Formalizes interface-constrained semi-Markov decision processes and proves a finite-sample bound for neural IC-Q that decomposes into neural approximation error, interface gap, and mixing-time residual, with experiments showing parity to centralized oracles.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.19099","ref_index":11,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"DecisionBench: A Benchmark for Emergent Delegation in Long-Horizon Agentic Workflows","primary_cat":"cs.AI","submitted_at":"2026-05-18T20:37:14+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"DecisionBench supplies a fixed task suite, model pool, delegation interface, and multi-axis metrics to evaluate emergent delegation, showing similar quality across awareness conditions but 15-31 point headroom under perfect delegation.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.17292","ref_index":4,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"MetaCogAgent: A Metacognitive Multi-Agent LLM Framework with Self-Aware Task Delegation","primary_cat":"cs.AI","submitted_at":"2026-05-17T07:12:04+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"MetaCogAgent equips multi-agent LLMs with metacognitive self-assessment, adaptive delegation, and capability learning to reach 82.4% accuracy on a 700-task benchmark while using fewer API calls than baselines.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.17159","ref_index":15,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"MADP: A Multi-Agent Pipeline for Sustainable Document Processing with Human-in-the-Loop","primary_cat":"cs.AI","submitted_at":"2026-05-16T21:18:39+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"MADP multi-agent pipeline with human-in-the-loop achieves 97% full automation on 955 real documents, 98.5% accuracy on ablation set, and 69-70% reductions in FTE, energy, and emissions versus manual processing.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.16508","ref_index":29,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"The Scaling Laws of Skills in LLM Agent Systems","primary_cat":"cs.CL","submitted_at":"2026-05-15T18:05:21+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Empirical analysis across 15 LLMs and 1,141 skills identifies a logarithmic routing decay law and a multiplicative execution law coupled by a single fitted slope parameter b that enables targeted library optimizations improving routing accuracy and downstream task pass rates.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.15556","ref_index":16,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"TopoClaw: A Human-Centric and Topology-Aware Agent Operating System","primary_cat":"cs.HC","submitted_at":"2026-05-15T02:49:36+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"TopoClaw is a human-centric Agent OS that uses physical and social topology modeling to enable cross-boundary execution with identity attribution and context-aware governance.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.12376","ref_index":8,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"ProfiliTable: Profiling-Driven Tabular Data Processing via Agentic Workflows","primary_cat":"cs.AI","submitted_at":"2026-05-12T16:42:38+00:00","verdict":null,"verdict_confidence":null,"novelty_score":null,"formal_verification":null,"one_line_summary":null,"context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"Second, broader frameworks often prioritize other data science tasks-such as automated machine learning (AutoML) [36, 45] or table-based question answering [33, 39]-over the full spectrum of table processing. Third, even those systems explicitly targeting table processing either rely on generic multi-agent scaffolds, which lack table-specific data understanding [8, 27], or employ hand-crafted pipelines that do not emphasize adaptive, feedback-driven profiling [17, 26]. This limitation is evident in current tools. For example, CleanAgent performs column-type annotation but does not actively sample or inspect the actual cell values during code generation [26]. As a result, when given an ambiguous instruction such as \"standardize the currency column, \" it has no knowledge of the concrete values present in the column"},{"citing_arxiv_id":"2605.12280","ref_index":19,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Iterative Audit Convergence in LLM-Managed Multi-Agent Systems: A Case Study in Prompt-Engineering Quality Assurance","primary_cat":"cs.SE","submitted_at":"2026-05-12T15:39:04+00:00","verdict":null,"verdict_confidence":null,"novelty_score":null,"formal_verification":null,"one_line_summary":null,"context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"rather than benchmark performance. MetaGPT [17] operationalizes Standard Operating Procedures for multi-agent collaboration; AEGIS's seven-lane structure is a domain-specialized instance of that pattern. Naqvi, Baqar, and Mohammad [18] contrast iterative closed-loop multi-agent testing against static single-shot test generation. 1.4.5. Companion Work The companion preprint [19] addresses the runtime behavior of the same AEGIS pipeline, includ- ing STRIDE-based adversarial testing and FMEA-based safety analysis. The 51 specification defects reported here are distinct from the 51 STRIDE-categorized adversarial code findings reported in the companion: those are P0-P3 priority code vulnerabilities (lock handling, race conditions, subprocess"},{"citing_arxiv_id":"2605.12087","ref_index":5,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Intermediate Artifacts as First-Class Citizens: A Data Model for Durable Intermediate Artifacts in Agentic Systems","primary_cat":"cs.AI","submitted_at":"2026-05-12T13:09:07+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"A systems-level data model for preserving typed, addressable, versioned, and dependency-aware intermediate artifacts in agentic AI systems to improve long-term inspectability and maintainability.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"serves typed intermediate artifacts, allows explicit supersession, and keeps downstream dependencies addressable. 3 Background and Related Work 3.1 Agent Loops, Tool Use, and Intermediate Reasoning Agentic LLM systems commonly interleave reasoning, tool use, and revision inside iterative loops. ReAct [1] remains the canonical example of this paradigm, while Toolformer [2], AutoGen [6], CAMEL [3], V oyager [4], and MetaGPT [5] explore richer tool ecosystems, role structures, and multi-agent coordination. These systems demonstrate that substantial capability gains can be achieved by placing models inside structured harnesses rather than treating them as isolated text generators. Work on chain-of-thought, scratchpads, self-refinement, and search further shows that intermediate reasoning can"},{"citing_arxiv_id":"2605.11136","ref_index":11,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"EVOCHAMBER: Test-Time Co-evolution of Multi-Agent System at Individual, Team, and Population Scales","primary_cat":"cs.AI","submitted_at":"2026-05-11T18:42:33+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"EVOCHAMBER enables test-time co-evolution of multi-agent systems across three scales, producing emergent niche specialists and performance gains of up to 32% relative on math tasks with Qwen3-8B.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"four to five stable niche specialists spontaneously emerge, a structural signature of multi-agent evolution that no single-agent learner can express. See our code at: https://github.com/Mercury7353/EvoChamber 1 Introduction Large Language Models (LLMs) [21] excel at reasoning [35], coding, and recall. Multi-agent systems (MAS) built on LLMs assign roles and communication patterns across multiple LLM instances [11, 25, 15, 19, 36]. Deployed over continual task streams, such systems should improve with experience: breakthroughs should inform later tasks, and recurring task types should be routed to the best-suited agents. However, evolving a multi-agent system is fundamentally different from evolving a single agentN times in parallel. A single-agent learner, such as Reflexion [ 28] or ExpeL [43], evolves only one"},{"citing_arxiv_id":"2605.10913","ref_index":10,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Shepherd: Enabling Programmable Meta-Agents via Reversible Agentic Execution Traces","primary_cat":"cs.AI","submitted_at":"2026-05-11T17:50:51+00:00","verdict":null,"verdict_confidence":null,"novelty_score":null,"formal_verification":null,"one_line_summary":null,"context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"As LLM-based agentic systems mature, we see an increasing prevalence of agents that act on other agents at runtime, across several recent lines of work. Asawa et al. [3], Lin et al. [23] develop advisor agents that learn intervention policies from execution traces; pipeline optimizers such as GEPA and MetaHarness edit agent workflows [2, 19]; and Hou et al. [10], Ji et al. [12] build tree-search RL that branches rollouts to extract per-step credit. We call these systemsmeta-agents: higher-order agents that operate over other agents and their execution traces. Meta-agents are becoming increasingly central to extracting capability from agents [45]. Table 1: Substrate capabilities. / / = supported / un- supported / partial as a single in-process operation."},{"citing_arxiv_id":"2605.10528","ref_index":2,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Collective Alignment in LLM Multi-Agent Systems: Disentangling Bias from Cooperation via Statistical Physics","primary_cat":"cond-mat.stat-mech","submitted_at":"2026-05-11T13:13:44+00:00","verdict":"UNVERDICTED","verdict_confidence":"UNKNOWN","novelty_score":7.0,"formal_verification":"none","one_line_summary":"LLM multi-agent systems on lattices show bias-driven order-disorder crossovers instead of true phase transitions, with extracted effective couplings and fields serving as model-specific fingerprints.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"ditional exploratory runs, the actual total is somewhat higher, probably closer to 3×10 9. This computational cost is the primary reason for restricting finite-size scal- ing to roughly half a decade inL. [1] T. Guo, X. Chen, Y. Wang, R. Chang, S. Pei, N. V. Chawla, O. Wiest, and X. Zhang, Large language model based multi-agents: A survey of progress and challenges (2024), arXiv:2402.01680 [cs.CL]. [2] S. Hong, M. Zhuge, J. Chen, X. Zheng, Y. Cheng, C. Zhang, J. Wang, Z. Wang, S. K. S. Yau, Z. Lin, L. Zhou, C. Ran, L. Xiao, C. Wu, and J. Schmidhuber, MetaGPT: Meta programming for a multi-agent collab- orative framework (2024), arXiv:2308.00352 [cs.AI]. [3] Y. Du, S. Li, A. Torralba, J. B. Tenenbaum, and I. Mordatch, Improving factuality and reasoning in"},{"citing_arxiv_id":"2605.10516","ref_index":8,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Consistency as a Testable Property: Statistical Methods to Evaluate AI Agent Reliability","primary_cat":"cs.AI","submitted_at":"2026-05-11T13:06:24+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"A framework with U-statistics and kernel-based metrics quantifies AI agent consistency and robustness, showing trajectory metrics outperform pass@1 rates in diagnosing failures.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"specific benchmarks across software engineering [9], planning [30, 31], function calling [16], and system administration [14]. These benchmarks primarily measure task completion rates and ac- curacy, with some incorporating trajectory comparison metrics [13] or pass@k requirements [34]. Among tools, Harbor [6] provides infrastructure for containerized agent execution with verifiable task environments, and MetaGPT [8] and ChatDev [17] enable capability evaluation in multi-agent collaboration architectures. Recent work in measurement science has proposed sequential hypothe- sis testing for trajectory quality [24] but has not addressed consistency. Research on consistency and robustness in LLMs has focused primarily on single-turn output, with multiple open challenges including the coverage of agentic scenarios [15]."},{"citing_arxiv_id":"2605.10500","ref_index":2,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"SkillEvolver: Skill Learning as a Meta-Skill","primary_cat":"cs.AI","submitted_at":"2026-05-11T12:58:25+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"A meta-skill authors and refines prose-and-code skills for agents by learning from post-deployment failures with an overfit audit, achieving 56.8% accuracy on SkillsBench tasks versus 43.6% for human-curated skills.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.10440","ref_index":32,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"TourMart: A Parametric Audit Instrument for Commission Steering in LLM Travel Agents","primary_cat":"cs.CY","submitted_at":"2026-05-11T12:11:30+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"TourMart quantifies commission steering in LLM travel agents via paired counterfactual prompts, reporting 3.5-7.7 percentage point increases in steered recommendations for tested models.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"planner task success against ground-truth itineraries or reward models. They do not model a commission channel, do not define a welfare rule that gover- nance parameters can act on, and do not pair a commission message against a counterfactual. Multi-agent simulations..Generative Agents [10] initiated a line of multi- agentsocialsimulationextendedbyAutoGen[30], CAMEL[31], MetaGPT[32], AgentVerse [33], SOTOPIA [34], GovSim [35], and Stakeholders [36]; eco- nomically grounded variants include EconAgent [37], CompeteAI [38], RecA- gent [39], Turing Experiments [40], and Homo Silicus [11]. Task-competence benchmarksAgentBench[41]andAgentBoard[42]measurewithin-taskprogress for a single agent. None of these parametrizes a welfare rule and sweeps gov-"},{"citing_arxiv_id":"2605.10286","ref_index":94,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"AgentRx: A Benchmark Study of LLM Agents for Multimodal Clinical Prediction Tasks","primary_cat":"cs.AI","submitted_at":"2026-05-11T09:46:41+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Single-agent LLM frameworks outperform naive multi-agent systems in multimodal clinical risk prediction tasks and are better calibrated.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.10223","ref_index":4,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Beyond Autonomy: A Dynamic Tiered AgentRunner Framework for Governable and Resilient Enterprise AI Execution","primary_cat":"cs.AI","submitted_at":"2026-05-11T09:03:48+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"The Dynamic Tiered AgentRunner framework uses risk-adaptive tiering, separation of powers across agents, and verifier-recovery loops to enable governable and resilient enterprise AI execution.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"and recover from agent misbehavior independent of the under- lying LLM's alignment. 2.1 Soft Constraints Are Not Governance AutoGen [3] provides flexible multi-agent conversation patterns with customizable termination conditions. CrewAI [5] for- malizes role-based task decomposition with configurable del- egation. LangGraph [6] enables graph-based agent orchestra- tion with conditional routing. MetaGPT [4] and ChatDev [7] demonstrate impressive role-playing for software engineering. These are excellentorchestrationtools. But they operate un- der a critical assumption:agents are well-behaved by con- struction. Their \"constraints\" are prompt-level instructions (\"You are a careful reviewer...\") or conversation-level patterns (\"Agent B reviews Agent A's output\")."},{"citing_arxiv_id":"2605.09703","ref_index":25,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"MOTOR-Bench: A Real-world Dataset and Multi-agent Framework for Zero-shot Human Mental State Understanding","primary_cat":"cs.CV","submitted_at":"2026-05-10T18:51:34+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"MOTOR-Bench supplies a real-world video dataset for structured mental state understanding in learning settings, while MOTOR-MAS improves zero-shot prediction of behavior, cognition, and emotion labels over single models and other multi-agent systems.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"Multi-Agent Frameworks: Structure or Scale In multi-agent architectures, role specialization is more ef- fective than scaling. Nevertheless, how to classify the roles of the agents and how the agents interact still require further exploration. CAMEL [19] shows that assigning different dialogue roles can achieve collaborative problem-solving beyond single-agent benchmarks, while MetaGPT [25] fur- ther structures agent communication through standardized operational procedures. These frameworks demonstrate that collaboration is helpful, but their structure stems from heuristics of task decomposition rather than domain theory. Increasing the number of agents does not solve this limita- tion. Qian et al. discovered that performance follows a rea-"},{"citing_arxiv_id":"2605.08904","ref_index":61,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"OPT-BENCH: Evaluating the Iterative Self-Optimization of LLM Agents in Large-Scale Search Spaces","primary_cat":"cs.AI","submitted_at":"2026-05-09T11:51:34+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"OPT-BENCH and OPT-Agent evaluate LLM self-optimization in large search spaces, showing stronger models improve via feedback but stay constrained by base capacity and below human performance.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.08831","ref_index":2,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"AssemPlanner: A Multi-Agent Based Task Planning Framework for Flexible Assembly System","primary_cat":"cs.RO","submitted_at":"2026-05-09T09:36:21+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"AssemPlanner is a ReAct-based multi-agent system that autonomously generates production plans from natural language inputs by integrating scheduling, knowledge, line balancing, and scene graph feedback.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"often necessitates repeated task decomposition, process design, and resource configuration conducted by multiple domain experts. This process is inherently time-consuming and labor-intensive, thereby limiting the responsiveness to multi-variety and small-batch manufacturing demands. Large Language Model (LLM) based agent systems in diverse cognitive tasks [2, 3], we investigate a question whether a similar multi-agent paradigm can be applied to task planning in flexible assembly. Specifically, we address the following question: Can a multi-agent based task planning framework enable autonomous understanding, reasoning, and coordination in assembly processes? The realization of such a system could significantly mitigate human intervention, expedite production line"},{"citing_arxiv_id":"2605.07358","ref_index":6,"ref_count":4,"confidence":0.98,"is_internal_anchor":true,"paper_title":"A Comprehensive Survey on Agent Skills: Taxonomy, Techniques, and Applications","primary_cat":"cs.IR","submitted_at":"2026-05-08T07:10:26+00:00","verdict":null,"verdict_confidence":null,"novelty_score":null,"formal_verification":null,"one_line_summary":null,"context_count":2,"top_context_role":"background","top_context_polarity":"background","context_text":"Large language model (LLM)-based agents are emerging as a powerful paradigm for automating complex tasks. Funda- mentally, an LLM-based agent is an autonomous system that leverages an LLM as its cognitive engine to perceive its envi- ronment, interpret task context, reason over abstract goals, and execute actions through planning, tool use, memory retrieval, and structured interaction [1]-[6]. Recent pioneering systems, such as OpenClaw [7] Manus [8], and Claude Code [9], vividly exemplify this paradigm, marking a broader transition in intelligent systems from passive response generation to proactive, action-oriented task execution. As LLM-based agents are deployed in a growing range of scenarios and entrusted with increasingly complex tasks, tool"},{"citing_arxiv_id":"2605.06365","ref_index":13,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"From Agent Loops to Deterministic Graphs: Execution Lineage for Reproducible AI-Native Work","primary_cat":"cs.AI","submitted_at":"2026-05-07T14:39:37+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Execution lineage models AI-native work as a DAG of computations with explicit dependencies, achieving perfect state preservation in controlled update tasks where loop-based agents introduce churn and contamination.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"From Agent Loops to Deterministic Graphs: Execution Lineage for Reproducible AI-Native Work 2.2 Agentic LLM Systems ReAct [9] introduced interleaved reasoning and action. Toolformer [10] trains models to use tools via self-supervision. A parallel line of work studies agent frameworks and multi-agent organizations, including CAMEL [11], AutoGen [14], MetaGPT [13], and V oyager [12]. These systems show that substantial capability gains can be achieved by placing the model inside richer interaction loops, role structures, and tool environments. These approaches share a control-loop architecture: st+1 =f(s t,LLM,tools)(1) where state is implicit. The strength of this family of systems is adaptability: loops can choose tools, revise prompts, and redirect work at"},{"citing_arxiv_id":"2605.03195","ref_index":12,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Terminus-4B: Can a Smaller Model Replace Frontier LLMs at Agentic Execution Tasks?","primary_cat":"cs.AI","submitted_at":"2026-05-04T22:24:24+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"A fine-tuned 4B model matches or exceeds frontier LLMs in terminal execution subagent tasks for coding agents, reducing main agent token usage by 30% with no performance loss.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"tures, small language models for agentic workloads, RL for multi-turn LLM-based agents and LLM-judge / rubric-based reward design. We discuss each below. 2.1 Multi-Agent and Subagent Architectures Decomposing complex tasks across multiple agents has been studied extensively. AutoGen [11] provides a flexible frame- work for agent to agent conversation. Works like MetaGPT [12], ChatDev [13], explore role-based collaboration between mul- tiple agents. He et al. [14] systematically reviewed the land- scape of LLM-based multi-agent systems for software engi- neering highlighting the current capabilities and limitations of these approaches. Anthropic's multi-agent research [15] system adopts the orchestrator-worker pattern, where a lead"},{"citing_arxiv_id":"2605.00382","ref_index":93,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Social Bias in LLM-Generated Code: Benchmark and Mitigation","primary_cat":"cs.SE","submitted_at":"2026-05-01T04:06:02+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"LLMs show up to 60.58% social bias in generated code; a new Fairness Monitor Agent cuts bias by 65.1% and raises functional correctness from 75.80% to 83.97%.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.27647","ref_index":11,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Tail-aware N-version Machine Learning Models for Reliable API Recommendation","primary_cat":"cs.SE","submitted_at":"2026-04-30T09:42:48+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"NvRec profiles multiple API recommendation models on tail-API performance and applies majority voting with reliability filters to raise true accept rates while controlling rejection of uncertain outputs.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.27209","ref_index":16,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Theory Under Construction: Orchestrating Language Models for Research Software Where the Specification Evolves","primary_cat":"cs.SE","submitted_at":"2026-04-29T21:28:17+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Comet-H orchestrates LLMs via deficit-scoring prompt selection and half-life task tracking to co-evolve research software components, demonstrated by a static analysis tool reaching F1=0.768 versus a 0.364 baseline.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.26590","ref_index":17,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Recommendations for Efficient and Responsible LLM Adoption within Industrial Software Development","primary_cat":"cs.SE","submitted_at":"2026-04-29T12:15:31+00:00","verdict":"CONDITIONAL","verdict_confidence":"MODERATE","novelty_score":4.0,"formal_verification":"none","one_line_summary":"A multi-case study plus survey produces seven actionable recommendations for efficient and responsible LLM use in industrial software engineering.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.26523","ref_index":15,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"RepoDoc: A Knowledge Graph-Based Framework to Automatic Documentation Generation and Incremental Updates","primary_cat":"cs.SE","submitted_at":"2026-04-29T10:43:02+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"RepoDoc uses a repository knowledge graph with module clustering and semantic impact propagation to generate more complete documentation 3x faster with 85% fewer tokens and handle incremental updates 73% faster than prior LLM-based tools.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"work applies knowledge graphs to the full documentation lifecycle, particularly for both generation and maintenance. RepoDoc fills this gap by leveraging RepoKG as the semantic backbone to precisely identify, organize, and update affected documentation components throughout the lifecycle. Benchmarks for Code Documentation Generation. Existing benchmarks like CodeSearchNet [15] and CodeXGLUE [19] focus on function-level metrics. Recent work like CodeWikiBench evalu- ates repository-level documentation but lacks assessment of incre- mental updates. Other recent evaluations, like Long Code Arena [4], address large-scale repositories but focus on input length and scala- bility challenges rather than documentation lifecycle management."},{"citing_arxiv_id":"2605.00034","ref_index":13,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Symbolic Execution Meets Multi-LLM Orchestration: Detecting Memory Vulnerabilities in Incomplete Rust CVE Snippets","primary_cat":"cs.CR","submitted_at":"2026-04-28T01:27:09+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"A 4-agent LLM orchestration with KLEE symbolic execution generates harnesses for incomplete Rust CVE snippets, achieving 90.3% compilation success and detecting 1206 errors across 26 of 31 files versus far lower rates from Clippy and Miri.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.24110","ref_index":15,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Latency and Cost of Multi-Agent Intelligent Tutoring at Scale","primary_cat":"cs.CY","submitted_at":"2026-04-27T07:07:41+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":3.0,"formal_verification":"none","one_line_summary":"Priority PayGo keeps multi-agent tutoring responses under 4 seconds even at 50 concurrent users, while costs stay below textbook prices per student.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.23940","ref_index":14,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Constraint-Guided Multi-Agent Decompilation for Executable Binary Recovery","primary_cat":"cs.SE","submitted_at":"2026-04-27T01:28:11+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"A constraint-guided multi-agent system turns raw decompiler output into re-executable code at 84-97% success rates, outperforming prior LLM decompilation methods on real binaries.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.23897","ref_index":2,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"MarketBench: Evaluating AI Agents as Market Participants","primary_cat":"cs.AI","submitted_at":"2026-04-26T21:48:01+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"LLMs show poor calibration in predicting task success and token use on software engineering benchmarks, causing market auctions to underperform compared to perfect information scenarios, with limited improvement from added context.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.23853","ref_index":24,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"ClawTrace: Cost-Aware Tracing for LLM Agent Skill Distillation","primary_cat":"cs.AI","submitted_at":"2026-04-26T19:44:10+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"ClawTrace enables cost-aware LLM agent skill distillation by tracing per-step costs and generating preserve, prune, and repair patches, with ablations showing reduced regressions and prune rules transferring to cut costs by 32%.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.23781","ref_index":19,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"ClawMark: A Living-World Benchmark for Multi-Turn, Multi-Day, Multimodal Coworker Agents","primary_cat":"cs.CV","submitted_at":"2026-04-26T16:05:02+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"ClawMark is a new benchmark for multi-turn multi-day multimodal coworker agents in stateful evolving services, with deterministic Python checkers showing frontier models achieve only 20% strict task success.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.23579","ref_index":19,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"CineAGI: Character-Consistent Movie Creation through LLM-Orchestrated Multi-Modal Generation and Cross-Scene Integration","primary_cat":"cs.MM","submitted_at":"2026-04-26T07:34:41+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"CineAGI is a multi-agent LLM framework that generates multi-scene movies with improved character consistency, narrative coherence, and audio-visual alignment.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.23338","ref_index":107,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"A Systematic Survey of Security Threats and Defenses in LLM-Based AI Agents: A Layered Attack Surface Framework","primary_cat":"cs.CR","submitted_at":"2026-04-25T14:57:15+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"A new 7x4 taxonomy organizes agentic AI security threats by architectural layer and persistence timescale, revealing under-explored upper layers and missing defenses after surveying 116 papers.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"agent, and this section argues that defense at L5 cannot be reduced to per-agent hardening. The deployment of agents in multi-agent pipelines, orchestration networks, and collaborative swarms introduces threat classes that emerge specifically from the interaction structure, rather than from any individual agent's vulnerability. Production multi-agent frameworks include Auto- Gen [106], MetaGPT [107], ChatDev [108], CAMEL [109], and TaskWeaver [110], each organizing inter-agent communication differently and thereby presenting distinct attack surfaces at the coordination layer. Generative agent simulations [111] show that emergent social behaviors arise when many agents interact over long horizons, and adversarial influence can propagate through such networks without generating explicit"},{"citing_arxiv_id":"2604.23088","ref_index":6,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Code Broker: A Multi-Agent System for Automated Code Quality Assessment","primary_cat":"cs.SE","submitted_at":"2026-04-25T00:53:59+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":3.0,"formal_verification":"none","one_line_summary":"Code Broker deploys a five-agent hierarchy that combines LLM semantic analysis with static linting to generate actionable Python code quality reports.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.21282","ref_index":8,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Strategic Heterogeneous Multi-Agent Architecture for Cost-Effective Code Vulnerability Detection","primary_cat":"cs.CR","submitted_at":"2026-04-23T04:58:18+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"A game-theoretic heterogeneous multi-agent architecture with three cloud LLMs and a local verifier achieves 77.2% F1, 100% recall, and 3x speedup for code vulnerability detection at $0.002 per sample on the NIST Juliet suite.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.20801","ref_index":20,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Synthesizing Multi-Agent Harnesses for Vulnerability Discovery","primary_cat":"cs.CR","submitted_at":"2026-04-22T17:27:40+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"AgentFlow uses a typed graph DSL covering roles, prompts, tools, topology and protocol plus a runtime-signal feedback loop to optimize multi-agent harnesses, reaching 84.3% on TerminalBench-2 and discovering ten new zero-days in Chrome including two critical sandbox escapes.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"context, the model exhibitslost-in-the-middleeffects [ 27]: it drops earlier analysis, repeats work it already did, and abandons long- horizon strategies. Third, a single trace cannot explore multiple hypotheses in parallel; it must commit to one strategy at a time, and a dead-end wastes the entire budget. These limitations are well documented in the agent-systems literature [20, 21, 39] and are the reason that current state-of-the-art systems do not rely on a single agent. Production systems instead split the work across specialized agents, each with its own prompt, LLM model (the underlying language model assigned to that role), and tools, much like a small security team: ananalystextracts the preconditions a valid input"},{"citing_arxiv_id":"2604.20398","ref_index":13,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"WebGen-R1: Incentivizing Large Language Models to Generate Functional and Aesthetic Websites with Reinforcement Learning","primary_cat":"cs.CL","submitted_at":"2026-04-22T10:04:46+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"WebGen-R1 uses end-to-end RL with scaffold-driven generation and cascaded rewards for structure, function, and aesthetics to transform a 7B model into a generator of deployable multi-page websites that rivals much larger models.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"sites [27], abstracting away the intricacies of modern web applications, such as dynamic routing, state management, authentication flows, and cross-page navigation in modern web applications. Con- versely, multi-agent orchestration frameworks attempt to decompose the task by assigning different specialized sub-agents to implement discrete subtasks, such as UI layout, backend logic, and testing, and then integrating their outputs [ 13, 11, 12, 28]. However, such modularity introduces brittle inter-agent dependency chains, where small inconsistencies in contracts, file names, or interface definitions can cascade into non-functional builds. Although this issue can be mitigated through multi-turn execution with refined feedback, doing so results in substantial token costs and high"},{"citing_arxiv_id":"2604.20273","ref_index":13,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"ActuBench: A Multi-Agent LLM Pipeline for Generation and Evaluation of Actuarial Reasoning Tasks","primary_cat":"cs.AI","submitted_at":"2026-04-22T07:20:03+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"ActuBench is a multi-agent LLM pipeline for generating and evaluating actuarial reasoning tasks, with evaluations of 50 models showing effective verification, competitive local open-weights models, and differing rankings between MCQ and LLM-judge scoring.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"the items current models most frequently miss [16]; our empirically-hardest-100 construction (section 4.3) adopts this idea. 2.3 LLM-as-Judge Evaluation The LLM-as-judge paradigm was popularized by Zheng et al. [25] (MT-Bench and Chatbot Arena), who demonstrated that capable LLMs, particularly GPT-4, can approximate human preference scoring on open-ended tasks at above 80% agreement. Kim et al. [13] trained Prometheus, an open-source 13-billion-parameter fine-tuned evaluator LLM, to produce fine- grained rubric-grounded verdicts with reference answers. Critical follow-up work documented systematic biases: Wang et al. [21] show that LLM judges exhibit pronounced positional bias in the sense that simply reordering candidate responses in the prompt can flip the quality ranking."},{"citing_arxiv_id":"2604.20261","ref_index":64,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Memory-Augmented LLM-based Multi-Agent System for Automated Feature Generation on Tabular Data","primary_cat":"cs.AI","submitted_at":"2026-04-22T07:09:30+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"MALMAS is a memory-augmented multi-agent LLM system that generates diverse, high-quality features for tabular data via agent decomposition, routing, and iterative memory-guided refinement.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null}],"limit":50,"offset":0}