{"total":28,"items":[{"citing_arxiv_id":"2606.30246","ref_index":23,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Clarus: Coordinating Autonomous Research Agents toward Web-Scale Scientific Collaboration","primary_cat":"cs.AI","submitted_at":"2026-06-29T12:56:08+00:00","verdict":"UNVERDICTED","verdict_confidence":"UNKNOWN","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Clarus is a four-layer collaboration infrastructure with a project-agent-resource model that reformulates research as an open, traceable, multi-participant process.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.30092","ref_index":54,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Hierarchical Reinforcement Learning in StarCraft Micromanagement with Influence Maps and Cluster-based Scripts","primary_cat":"cs.AI","submitted_at":"2026-06-29T10:29:41+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":3.0,"formal_verification":"none","one_line_summary":"HRL-IM/CBS encodes battlefield states via influence map hashing and uses cluster-based scripts in a multi-Q-table hierarchy for StarCraft micromanagement, claiming competitive results with improved sample efficiency and interpretability over deep RL baselines.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.25584","ref_index":27,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Acting on the Unseen: Communication-Free Collaborative Filtering for Decentralized Multi-Robot Task Allocation","primary_cat":"cs.RO","submitted_at":"2026-05-25T08:33:40+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"SwarmCF enables robots to achieve low error on unseen task pairs with per-robot sample complexity linear in rank d rather than task count n by running decentralized low-rank matrix completion on masked broadcast data in the zero-knowledge MRTA regime.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.24516","ref_index":4,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Adaptive Punishment for Cooperation in Mixed-Motive Games","primary_cat":"cs.MA","submitted_at":"2026-05-23T11:01:49+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"APC adapts punishment via dynamic probability and a reward-guided defection awareness module to foster cooperation in iterated public goods games and sequential social dilemmas, outperforming baselines.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.18024","ref_index":13,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Interaction-Breaking Adversarial Learning Framework for Robust Multi-Agent Reinforcement Learning","primary_cat":"cs.LG","submitted_at":"2026-05-18T08:14:38+00:00","verdict":null,"verdict_confidence":null,"novelty_score":null,"formal_verification":null,"one_line_summary":null,"context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.14892","ref_index":205,"ref_count":4,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Beyond Individual Intelligence: Surveying Collaboration, Failure Attribution, and Self-Evolution in LLM-based Multi-Agent Systems","primary_cat":"cs.AI","submitted_at":"2026-05-14T14:36:13+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"A survey that unifies prior work on multi-agent LLM systems via the LIFE framework, mapping dependencies across collaboration, failure attribution, and autonomous self-evolution while identifying cross-stage challenges.","context_count":2,"top_context_role":"background","top_context_polarity":"background","context_text":"mainly arising from prompt instructions rather than model heterogeneity. Similarly, Generative Agents [110] simulate human-like behaviours through multiple agent instances derived from the same base model, enabling complex social interactions and emergent narratives in a controlled, homogeneous environment. Frameworks like SelfCorrect-Agent [207] and Chateval [206] further explore homogeneous multi-agent setups for robust reasoning and evaluation, leveraging symmetric LLM-based agents to iteratively correct or debate responses. VillagerAgent [205] demonstrates that even in graph-structured task coordination, MAS can be instantiated from identical models, emphasizing interaction over agent heterogeneity. Building on these observations, homogeneous roles offer several practical advantages."},{"citing_arxiv_id":"2605.14235","ref_index":5,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Quantum Advantage in Multi Agent Reinforcement Learning","primary_cat":"cs.LG","submitted_at":"2026-05-14T01:03:41+00:00","verdict":"CONDITIONAL","verdict_confidence":"MODERATE","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Entangled QMARL agents approach the Tsirelson bound of 0.854 in CHSH while unentangled versions match classical baselines, and hybrid quantum-classical setups outperform both in CoopNav.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.13554","ref_index":23,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Self-Supervised On-Policy Reinforcement Learning via Contrastive Proximal Policy Optimisation","primary_cat":"cs.LG","submitted_at":"2026-05-13T13:58:49+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"CPPO is an on-policy contrastive RL method that derives advantages from contrastive Q-values for PPO optimization, outperforming prior CRL baselines in 14/18 tasks and matching or exceeding reward-based PPO in 12/18 tasks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.11880","ref_index":5,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Adaptive TD-Lambda for Cooperative Multi-agent Reinforcement Learning","primary_cat":"cs.LG","submitted_at":"2026-05-12T09:56:24+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"ATD(λ) adapts TD(λ) in MARL via a density ratio estimator on past/current replay buffers to assign λ per state-action pair, yielding competitive or better results than fixed-λ QMIX and MAPPO on SMAC and Gfootball.","context_count":1,"top_context_role":"method","top_context_polarity":"use_method","context_text":"In this paper, we adopt a variational representation off-divergences between a set of older trajectories and a set of more recently generated trajec- tories to estimate the density ratios. Theorem 1.[Nguyenet al., 2010] Assume thatfhas first order derivativesf ′ at [0,+∞).∀P, Q∈ P(X)such that P≪Qandω:X →R +, Df(P∥Q)≥E P [f ′(ω(x))]−E Q[f ∗(f ′(ω(x)))](5) wheref ∗ denotes the convex conjugate and the equality is achieved whenω= dP dQ. According to Theorem 1, the density ratioω(s, a) := dπ(s, a)/dD(s, a)can be estimated by the samples from two sets of trajectories. One of the two sets of trajectoriesd D can be sampled from the regular large replay buffer (off-policy bufferD of f) from original value-based MARL algorithms"},{"citing_arxiv_id":"2605.18809","ref_index":7,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Metric-Gradient Projection for Stable Multi-Agent Policy Learning","primary_cat":"cs.LG","submitted_at":"2026-05-12T01:02:01+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"HPML projects multi-agent update fields onto the closest metric-gradient potential flow via Hodge decomposition, yielding Lyapunov potentials and equilibrium-gap bounds.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.08391","ref_index":4,"ref_count":2,"confidence":0.98,"is_internal_anchor":true,"paper_title":"SACHI: Structured Agent Coordination via Holistic Information Integration in Multi-Agent Reinforcement Learning","primary_cat":"cs.LG","submitted_at":"2026-05-08T19:00:34+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"SACHI enriches agent representations via graph transformer convolutions over inter-agent graphs to enable holistic information integration, outperforming baselines across five cooperative tasks with statistical significance.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"agent reinforcement learning (MARL) inherits this challenge in full generality: a team of agents must learn a joint policy under which each agent's action is individually executable yet collectively coherent, often from partial and local observations of a shared environment [2, 3]. The past decade has seen a remarkable influx of cooperative MARL algorithms, spanning value decomposition [4, 5], communication learning, and policy-gradient methods [6, 7], motivated by applications in autonomous driving, warehouse logistics, robotic manipula- tion, and network routing [8, 9, 10]. Yet across this landscape, a persistent difficulty remains: how should a joint policy be structured so that each agent's locally-computed action is compatible with the actions of teammates it cannot observe?"},{"citing_arxiv_id":"2605.06825","ref_index":92,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Randomness is sometimes necessary for coordination","primary_cat":"cs.AI","submitted_at":"2026-05-07T18:27:15+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Structured per-agent randomness via ranked masking in attention allows symmetric agents to break ties and coordinate, achieving perfect success on symmetric tasks where deterministic policies fail and enabling zero-shot transfer across team sizes.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.06557","ref_index":31,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Coordination Matters: Evaluation of Cooperative Multi-Agent Reinforcement Learning","primary_cat":"cs.MA","submitted_at":"2026-05-07T16:50:53+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"A new controlled testbed and coordination diagnostics show that multi-agent RL methods achieving similar returns can differ substantially in redundant assignments, diversity, and efficiency.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.05727","ref_index":22,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"LLM-Enhanced Deep Reinforcement Learning for Task Offloading in Collaborative Edge Computing","primary_cat":"cs.DC","submitted_at":"2026-05-07T06:19:07+00:00","verdict":null,"verdict_confidence":null,"novelty_score":null,"formal_verification":null,"one_line_summary":null,"context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.00751","ref_index":33,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"NonZero: Interaction-Guided Exploration for Multi-Agent Monte Carlo Tree Search","primary_cat":"cs.LG","submitted_at":"2026-05-01T16:02:58+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"NonZero introduces an interaction score and bandit-formalized proposal rule for local agent deviations in multi-agent MCTS, delivering a sublinear local-regret guarantee and improved sample efficiency on game benchmarks without full joint-action enumeration.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.22452","ref_index":45,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Superminds Test: Actively Evaluating Collective Intelligence of Agent Society via Probing Agents","primary_cat":"cs.AI","submitted_at":"2026-04-24T11:11:36+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Large-scale experiments on two million agents reveal that collective intelligence does not emerge from scale alone due to sparse and shallow interactions.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.17191","ref_index":17,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Do LLM-derived graph priors improve multi-agent coordination?","primary_cat":"cs.LG","submitted_at":"2026-04-19T01:40:39+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"LLM-generated coordination graph priors improve multi-agent reinforcement learning performance on MPE benchmarks, with models as small as 1.5B parameters proving effective.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.03189","ref_index":10,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Reflective Context Learning: Studying the Optimization Primitives of Context Space","primary_cat":"cs.LG","submitted_at":"2026-04-03T17:05:45+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Reflective Context Learning unifies context optimization for agents by recasting prior methods as instances of a shared learning problem and extending them with classical primitives such as batching, failure replay, and grouped rollouts, yielding improvements on AppWorld, BrowseComp+, and RewardBene","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2602.20078","ref_index":23,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Descent-Guided Policy Gradient for Scalable Cooperative Multi-Agent Learning","primary_cat":"cs.MA","submitted_at":"2026-02-23T17:45:08+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"DG-PG augments policy gradients with descent signals from analytical models to reduce estimator variance from O(N) to O(1), preserve game equilibria, and achieve agent-independent sample complexity while converging on 1500-agent tasks where baselines fail.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2601.21972","ref_index":29,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Learning Decentralized LLM Collaboration with Multi-Agent Actor Critic","primary_cat":"cs.AI","submitted_at":"2026-01-29T16:50:30+00:00","verdict":null,"verdict_confidence":null,"novelty_score":null,"formal_verification":null,"one_line_summary":null,"context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2601.07122","ref_index":57,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Enhancing Cloud Network Resilience via a Robust LLM-Empowered Multi-Agent Reinforcement Learning Framework","primary_cat":"cs.CR","submitted_at":"2026-01-12T01:25:41+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"CyberOps-Bots is a hierarchical LLM-empowered multi-agent RL framework that reports 68.5% higher network availability and 34.7% better jumpstart performance in new scenarios without retraining on real cloud datasets.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2509.17676","ref_index":46,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"GLo-MAPPO: Multi-Agent Deep Reinforcement Learning for Energy-Efficient UAV-Assisted LoRa Networks","primary_cat":"cs.NI","submitted_at":"2025-09-22T12:19:46+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"GLo-MAPPO applies centralized-training decentralized-execution MAPPO with a gain-based association scheme to jointly optimize LoRa parameters and UAV paths, yielding higher weighted energy efficiency than prior MARL baselines in simulations.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2509.15519","ref_index":23,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Fully Decentralized Cooperative Multi-Agent Reinforcement Learning is A Context Modeling Problem","primary_cat":"cs.LG","submitted_at":"2025-09-19T01:52:44+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"DAC models fully decentralized cooperative MARL as a context modeling problem, using latent variables for joint policies to fix non-stationarity in value updates and relative overgeneralization in value estimation.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2508.01049","ref_index":24,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Centralized Adaptive Sampling for Reliable Co-Training of Independent Multi-Agent Policies","primary_cat":"cs.LG","submitted_at":"2025-08-01T20:07:25+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"CoSER adaptively samples joint actions in CTDE MARL to reduce sampling error relative to the joint on-policy distribution, empirically improving reliability of independent policy gradient convergence.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2502.13388","ref_index":6,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Reflection of Episodes: Learning to Play Game from Expert and Self Experiences","primary_cat":"cs.AI","submitted_at":"2025-02-19T02:53:43+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"ROE framework lets LLM defeat Very Hard bot in TextStarCraft II via keyframe selection, expert/self-experience decisions, and post-game reflection for new self-experience.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2502.03506","ref_index":24,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Optimistic {\\epsilon}-Greedy Exploration for Cooperative Multi-Agent Reinforcement Learning","primary_cat":"cs.MA","submitted_at":"2025-02-05T12:06:54+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Optimistic ε-Greedy Exploration adds decoupled optimistic networks that converge in probability to maximum returns and samples from them with probability ε to increase optimal joint-action frequency in CTDE MARL.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2502.02844","ref_index":15,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Wolfpack Adversarial Attack for Robust Multi-Agent Reinforcement Learning","primary_cat":"cs.LG","submitted_at":"2025-02-05T02:59:23+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Wolfpack attack framework disrupts MARL cooperation by targeting initial and assisting agents; WALL trains robust policies against it with reported experimental gains.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"1906.12266","ref_index":11,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Growing Action Spaces","primary_cat":"cs.LG","submitted_at":"2019-06-28T15:35:11+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"A curriculum of growing action spaces combined with simultaneous off-policy value estimation accelerates learning in large multi-agent action spaces.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null}],"limit":50,"offset":0}