{"paper":{"title":"The Horizon Threshold in Cooperative Multi-Agent Reward-Free Exploration","license":"http://creativecommons.org/licenses/by/4.0/","headline":"Setting learning phases equal to the horizon H allows polynomial agents to approximate MDP dynamics in multi-agent reward-free exploration.","cross_cats":[],"primary_cat":"cs.LG","authors_text":"Idan Barnea, Orin Levy, Yishay Mansour","submitted_at":"2026-02-01T21:44:11Z","abstract_excerpt":"We study cooperative multi-agent reinforcement learning in the setting of reward-free exploration, where multiple agents jointly explore an unknown MDP in order to learn its dynamics (without observing rewards). We focus on a tabular finite-horizon MDP and adopt a phased learning framework. In each learning phase, multiple agents independently interact with the environment. More specifically, in each learning phase, each agent is assigned a policy, executes it, and observes the resulting trajectory. Our primary goal is to characterize the tradeoff between the number of learning phases and the "},"claims":{"count":4,"items":[{"kind":"strongest_claim","text":"When the number of learning phases equals H, we present a computationally efficient algorithm that uses only Õ(S^6 H^6 A / ε²) agents to obtain an ε approximation of the dynamics (i.e., yields an ε-optimal policy for any reward function). We complement our algorithm with a lower bound showing that any algorithm restricted to ρ < H phases requires at least A^{H/ρ} agents to achieve constant accuracy.","source":"verdict.strongest_claim","status":"machine_extracted","claim_id":"C1","attestation":"unclaimed"},{"kind":"weakest_assumption","text":"The MDP is tabular and finite-horizon, and the learning proceeds in independent phases where each agent is assigned a policy and executes it without intra-phase communication.","source":"verdict.weakest_assumption","status":"machine_extracted","claim_id":"C2","attestation":"unclaimed"},{"kind":"one_line_summary","text":"Θ(H) learning phases are necessary and sufficient for polynomial-agent ε-accurate dynamics estimation in multi-agent reward-free exploration of finite-horizon tabular MDPs.","source":"verdict.one_line_summary","status":"machine_extracted","claim_id":"C3","attestation":"unclaimed"},{"kind":"headline","text":"Setting learning phases equal to the horizon H allows polynomial agents to approximate MDP dynamics in multi-agent reward-free exploration.","source":"verdict.pith_extraction.headline","status":"machine_extracted","claim_id":"C4","attestation":"unclaimed"}],"snapshot_sha256":"c61aa029dd055e9c47650591034855f998106fa4bcd6ee21871edefc8d840819"},"source":{"id":"2602.01453","kind":"arxiv","version":3},"verdict":{"id":"b648f2b2-d099-41cd-92be-ef7266ab6a15","model_set":{"reader":"grok-4.3"},"created_at":"2026-05-16T08:22:30.324972Z","strongest_claim":"When the number of learning phases equals H, we present a computationally efficient algorithm that uses only Õ(S^6 H^6 A / ε²) agents to obtain an ε approximation of the dynamics (i.e., yields an ε-optimal policy for any reward function). We complement our algorithm with a lower bound showing that any algorithm restricted to ρ < H phases requires at least A^{H/ρ} agents to achieve constant accuracy.","one_line_summary":"Θ(H) learning phases are necessary and sufficient for polynomial-agent ε-accurate dynamics estimation in multi-agent reward-free exploration of finite-horizon tabular MDPs.","pipeline_version":"pith-pipeline@v0.9.0","weakest_assumption":"The MDP is tabular and finite-horizon, and the learning proceeds in independent phases where each agent is assigned a policy and executes it without intra-phase communication.","pith_extraction_headline":"Setting learning phases equal to the horizon H allows polynomial agents to approximate MDP dynamics in multi-agent reward-free exploration."},"references":{"count":0,"sample":[],"resolved_work":0,"snapshot_sha256":"258153158e38e3291e3d48162225fcdb2d5a3ed65a07baac614ab91432fd4f57","internal_anchors":0},"formal_canon":{"evidence_count":0,"snapshot_sha256":"258153158e38e3291e3d48162225fcdb2d5a3ed65a07baac614ab91432fd4f57"},"author_claims":{"count":0,"strong_count":0,"snapshot_sha256":"258153158e38e3291e3d48162225fcdb2d5a3ed65a07baac614ab91432fd4f57"},"builder_version":"pith-number-builder-2026-05-17-v1"}