{"record_type":"pith_number_record","schema_url":"https://pith.science/schemas/pith-number/v1.json","pith_number":"pith:2024:FECGQRX6RDAS532D4LSBA6OMVJ","short_pith_number":"pith:FECGQRX6","schema_version":"1.0","canonical_sha256":"29046846fe88c12eef43e2e41079ccaa53c0884ccba9b9fb88fb5c6d1e796d15","source":{"kind":"arxiv","id":"2406.04692","version":1},"attestation_state":"computed","paper":{"title":"Mixture-of-Agents Enhances Large Language Model Capabilities","license":"http://creativecommons.org/licenses/by/4.0/","headline":"A layered mixture of multiple LLM agents outperforms GPT-4 Omni on AlpacaEval 2.0, MT-Bench, and FLASK by using prior-layer outputs as auxiliary context.","cross_cats":[],"primary_cat":"cs.CL","authors_text":"Ben Athiwaratkun, Ce Zhang, James Zou, Jue Wang, Junlin Wang","submitted_at":"2024-06-07T07:04:10Z","abstract_excerpt":"Recent advances in large language models (LLMs) demonstrate substantial capabilities in natural language understanding and generation tasks. With the growing number of LLMs, how to harness the collective expertise of multiple LLMs is an exciting open direction. Toward this goal, we propose a new approach that leverages the collective strengths of multiple LLMs through a Mixture-of-Agents (MoA) methodology. In our approach, we construct a layered MoA architecture wherein each layer comprises multiple LLM agents. Each agent takes all the outputs from agents in the previous layer as auxiliary inf"},"verification_status":{"content_addressed":true,"pith_receipt":true,"author_attested":false,"weak_author_claims":0,"strong_author_claims":0,"externally_anchored":false,"storage_verified":false,"citation_signatures":0,"replication_records":0,"graph_snapshot":true,"references_resolved":true,"formal_links_present":true},"canonical_record":{"source":{"id":"2406.04692","kind":"arxiv","version":1},"metadata":{"license":"http://creativecommons.org/licenses/by/4.0/","primary_cat":"cs.CL","submitted_at":"2024-06-07T07:04:10Z","cross_cats_sorted":[],"title_canon_sha256":"3eca579e8e4dad3dfa8d40b5946b086c28c9157866b150f939aa4112d48a229b","abstract_canon_sha256":"6100c257d76c8a53e9e86512a1e9e0ee9b8ac073424fc1f3ba63b79b9d39ce1e"},"schema_version":"1.0"},"receipt":{"kind":"pith_receipt","key_id":"pith-v1-2026-05","algorithm":"ed25519","signed_at":"2026-05-17T23:38:46.842415Z","signature_b64":"0GWbRdqCF5AmpnLfulJjOMh1MG7D0jUZvF4RZ+nkQRdsCxRhF4pwQfHfkVxOBcsosZF1NLGsM5yCaqhbOUMxCw==","signed_message":"canonical_sha256_bytes","builder_version":"pith-number-builder-2026-05-17-v1","receipt_version":"0.3","canonical_sha256":"29046846fe88c12eef43e2e41079ccaa53c0884ccba9b9fb88fb5c6d1e796d15","last_reissued_at":"2026-05-17T23:38:46.841917Z","signature_status":"signed_v1","first_computed_at":"2026-05-17T23:38:46.841917Z","public_key_fingerprint":"8d4b5ee74e4693bcd1df2446408b0d54"},"graph_snapshot":{"paper":{"title":"Mixture-of-Agents Enhances Large Language Model Capabilities","license":"http://creativecommons.org/licenses/by/4.0/","headline":"A layered mixture of multiple LLM agents outperforms GPT-4 Omni on AlpacaEval 2.0, MT-Bench, and FLASK by using prior-layer outputs as auxiliary context.","cross_cats":[],"primary_cat":"cs.CL","authors_text":"Ben Athiwaratkun, Ce Zhang, James Zou, Jue Wang, Junlin Wang","submitted_at":"2024-06-07T07:04:10Z","abstract_excerpt":"Recent advances in large language models (LLMs) demonstrate substantial capabilities in natural language understanding and generation tasks. With the growing number of LLMs, how to harness the collective expertise of multiple LLMs is an exciting open direction. Toward this goal, we propose a new approach that leverages the collective strengths of multiple LLMs through a Mixture-of-Agents (MoA) methodology. In our approach, we construct a layered MoA architecture wherein each layer comprises multiple LLM agents. Each agent takes all the outputs from agents in the previous layer as auxiliary inf"},"claims":{"count":4,"items":[{"kind":"strongest_claim","text":"MoA models achieves state-of-art performance on AlpacaEval 2.0, MT-Bench and FLASK, surpassing GPT-4 Omni. For example, our MoA using only open-source LLMs is the leader of AlpacaEval 2.0 by a substantial gap, achieving a score of 65.1% compared to 57.5% by GPT-4 Omni.","source":"verdict.strongest_claim","status":"machine_extracted","claim_id":"C1","attestation":"unclaimed"},{"kind":"weakest_assumption","text":"That feeding previous-layer outputs as auxiliary information to each agent will reliably improve response quality without introducing noise or compounding errors from weaker models in the ensemble.","source":"verdict.weakest_assumption","status":"machine_extracted","claim_id":"C2","attestation":"unclaimed"},{"kind":"one_line_summary","text":"A layered Mixture-of-Agents system combining multiple LLMs achieves state-of-the-art results on AlpacaEval 2.0 (65.1%), MT-Bench, and FLASK, outperforming GPT-4 Omni.","source":"verdict.one_line_summary","status":"machine_extracted","claim_id":"C3","attestation":"unclaimed"},{"kind":"headline","text":"A layered mixture of multiple LLM agents outperforms GPT-4 Omni on AlpacaEval 2.0, MT-Bench, and FLASK by using prior-layer outputs as auxiliary context.","source":"verdict.pith_extraction.headline","status":"machine_extracted","claim_id":"C4","attestation":"unclaimed"}],"snapshot_sha256":"7831bfcb06858447f1bc80802afddaad7ce774cf4314d5abf63ac9743f9d596c"},"source":{"id":"2406.04692","kind":"arxiv","version":1},"verdict":{"id":"60d7e71d-fafb-4831-a8ac-572accea2ab0","model_set":{"reader":"grok-4.3"},"created_at":"2026-05-16T19:25:42.086172Z","strongest_claim":"MoA models achieves state-of-art performance on AlpacaEval 2.0, MT-Bench and FLASK, surpassing GPT-4 Omni. For example, our MoA using only open-source LLMs is the leader of AlpacaEval 2.0 by a substantial gap, achieving a score of 65.1% compared to 57.5% by GPT-4 Omni.","one_line_summary":"A layered Mixture-of-Agents system combining multiple LLMs achieves state-of-the-art results on AlpacaEval 2.0 (65.1%), MT-Bench, and FLASK, outperforming GPT-4 Omni.","pipeline_version":"pith-pipeline@v0.9.0","weakest_assumption":"That feeding previous-layer outputs as auxiliary information to each agent will reliably improve response quality without introducing noise or compounding errors from weaker models in the ensemble.","pith_extraction_headline":"A layered mixture of multiple LLM agents outperforms GPT-4 Omni on AlpacaEval 2.0, MT-Bench, and FLASK by using prior-layer outputs as auxiliary context."},"references":{"count":27,"sample":[{"doi":"","year":null,"title":"Qwen Technical Report","work_id":"bb1fd52f-6b2f-437c-9516-37bdf6eb9be8","ref_index":1,"cited_arxiv_id":"2309.16609","is_internal_anchor":true},{"doi":"","year":1901,"title":"D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al","work_id":"de1141da-47d9-4d67-9302-a1f3da653478","ref_index":2,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":null,"title":"ChatEval: Towards Better LLM-based Evaluators through Multi-Agent Debate","work_id":"eac74d79-d8d1-49dd-8565-53d713a84fff","ref_index":3,"cited_arxiv_id":"2308.07201","is_internal_anchor":true},{"doi":"","year":null,"title":"Reconcile: Round-table conference improves reasoning via consensus among diverse llms","work_id":"b594889e-fc31-4736-8dad-d37cc1f62a8d","ref_index":4,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":null,"title":"Active prompting with chain-of-thought for large language models","work_id":"693ffda3-001d-4079-bb6e-21d98981a8fd","ref_index":5,"cited_arxiv_id":"","is_internal_anchor":false}],"resolved_work":27,"snapshot_sha256":"c0a3c0a111b378b2b3f025aa55bc6501843cfacaecfabf8fd69a7f7e707eed18","internal_anchors":15},"formal_canon":{"evidence_count":3,"snapshot_sha256":"f07e69986a458bd00d0a70f1be95cf87bc1dc57719b07774d040c7e421e09d0f"},"author_claims":{"count":0,"strong_count":0,"snapshot_sha256":"258153158e38e3291e3d48162225fcdb2d5a3ed65a07baac614ab91432fd4f57"},"builder_version":"pith-number-builder-2026-05-17-v1"},"aliases":[{"alias_kind":"arxiv","alias_value":"2406.04692","created_at":"2026-05-17T23:38:46.841999+00:00"},{"alias_kind":"arxiv_version","alias_value":"2406.04692v1","created_at":"2026-05-17T23:38:46.841999+00:00"},{"alias_kind":"doi","alias_value":"10.48550/arxiv.2406.04692","created_at":"2026-05-17T23:38:46.841999+00:00"},{"alias_kind":"pith_short_12","alias_value":"FECGQRX6RDAS","created_at":"2026-05-18T12:33:37.589309+00:00"},{"alias_kind":"pith_short_16","alias_value":"FECGQRX6RDAS532D","created_at":"2026-05-18T12:33:37.589309+00:00"},{"alias_kind":"pith_short_8","alias_value":"FECGQRX6","created_at":"2026-05-18T12:33:37.589309+00:00"}],"events":[],"event_summary":{},"paper_claims":[],"inbound_citations":{"count":19,"internal_anchor_count":19,"sample":[{"citing_arxiv_id":"2512.04695","citing_title":"TRINITY: An Evolved LLM Coordinator","ref_index":18,"is_internal_anchor":true},{"citing_arxiv_id":"2512.22579","citing_title":"SANet: A Semantic-aware Agentic AI Networking Framework for Cross-layer Optimization in 6G","ref_index":34,"is_internal_anchor":true},{"citing_arxiv_id":"2602.19509","citing_title":"Pyramid MoA: A Probabilistic Framework for Cost-Optimized Anytime Inference","ref_index":4,"is_internal_anchor":true},{"citing_arxiv_id":"2605.01750","citing_title":"Talk is Cheap, Communication is Hard: Dynamic Grounding Failures and Repair in Multi-Agent Negotiation","ref_index":24,"is_internal_anchor":true},{"citing_arxiv_id":"2604.03809","citing_title":"Representational Collapse in Multi-Agent LLM Committees: Measurement and Diversity-Aware Consensus","ref_index":14,"is_internal_anchor":true},{"citing_arxiv_id":"2604.27586","citing_title":"Trace-Level Analysis of Information Contamination in Multi-Agent Systems","ref_index":33,"is_internal_anchor":true},{"citing_arxiv_id":"2407.21787","citing_title":"Large Language Monkeys: Scaling Inference Compute with Repeated Sampling","ref_index":61,"is_internal_anchor":true},{"citing_arxiv_id":"2604.22906","citing_title":"Network Edge Inference for Large Language Models: Principles, Techniques, and Opportunities","ref_index":157,"is_internal_anchor":true},{"citing_arxiv_id":"2605.01566","citing_title":"Multi-Agent Reasoning Improves Compute Efficiency: Pareto-Optimal Test-Time Scaling","ref_index":12,"is_internal_anchor":true},{"citing_arxiv_id":"2605.02290","citing_title":"Distilling Long-CoT Reasoning through Collaborative Step-wise Multi-Teacher Decoding","ref_index":36,"is_internal_anchor":true},{"citing_arxiv_id":"2605.04097","citing_title":"CTM-AI: A Blueprint for General AI Inspired by a Model of Consciousness","ref_index":21,"is_internal_anchor":true},{"citing_arxiv_id":"2604.18133","citing_title":"Multi-Agent Systems: From Classical Paradigms to Large Foundation Model-Enabled Futures","ref_index":128,"is_internal_anchor":true},{"citing_arxiv_id":"2605.07244","citing_title":"Experience Sharing in Mutual Reinforcement Learning for Heterogeneous Language Models","ref_index":91,"is_internal_anchor":true},{"citing_arxiv_id":"2604.14585","citing_title":"Prompt Optimization Is a Coin Flip: Diagnosing When It Helps in Compound AI Systems","ref_index":10,"is_internal_anchor":true},{"citing_arxiv_id":"2605.05216","citing_title":"SAT: Sequential Agent Tuning for Coordinator Free Plug and Play Multi-LLM Training with Monotonic Improvement Guarantees","ref_index":26,"is_internal_anchor":true},{"citing_arxiv_id":"2604.16205","citing_title":"ChemGraph-XANES: An Agentic Framework for XANES Simulation and Analysis","ref_index":11,"is_internal_anchor":true},{"citing_arxiv_id":"2604.17950","citing_title":"CADMAS-CTX: Contextual Capability Calibration for Multi-Agent Delegation","ref_index":26,"is_internal_anchor":true},{"citing_arxiv_id":"2604.19049","citing_title":"Refute-or-Promote: An Adversarial Stage-Gated Multi-Agent Review Methodology for High-Precision LLM-Assisted Defect Discovery","ref_index":36,"is_internal_anchor":true},{"citing_arxiv_id":"2604.21190","citing_title":"SpatiO: Adaptive Test-Time Orchestration of Vision-Language Agents for Spatial Reasoning","ref_index":34,"is_internal_anchor":true}]},"formal_canon":{"evidence_count":3,"sample":[],"anchors":[]},"links":{"html":"https://pith.science/pith/FECGQRX6RDAS532D4LSBA6OMVJ","json":"https://pith.science/pith/FECGQRX6RDAS532D4LSBA6OMVJ.json","graph_json":"https://pith.science/api/pith-number/FECGQRX6RDAS532D4LSBA6OMVJ/graph.json","events_json":"https://pith.science/api/pith-number/FECGQRX6RDAS532D4LSBA6OMVJ/events.json","paper":"https://pith.science/paper/FECGQRX6"},"agent_actions":{"view_html":"https://pith.science/pith/FECGQRX6RDAS532D4LSBA6OMVJ","download_json":"https://pith.science/pith/FECGQRX6RDAS532D4LSBA6OMVJ.json","view_paper":"https://pith.science/paper/FECGQRX6","resolve_alias":"https://pith.science/api/pith-number/resolve?arxiv=2406.04692&json=true","fetch_graph":"https://pith.science/api/pith-number/FECGQRX6RDAS532D4LSBA6OMVJ/graph.json","fetch_events":"https://pith.science/api/pith-number/FECGQRX6RDAS532D4LSBA6OMVJ/events.json","actions":{"anchor_timestamp":"https://pith.science/pith/FECGQRX6RDAS532D4LSBA6OMVJ/action/timestamp_anchor","attest_storage":"https://pith.science/pith/FECGQRX6RDAS532D4LSBA6OMVJ/action/storage_attestation","attest_author":"https://pith.science/pith/FECGQRX6RDAS532D4LSBA6OMVJ/action/author_attestation","sign_citation":"https://pith.science/pith/FECGQRX6RDAS532D4LSBA6OMVJ/action/citation_signature","submit_replication":"https://pith.science/pith/FECGQRX6RDAS532D4LSBA6OMVJ/action/replication_record"}},"created_at":"2026-05-17T23:38:46.841999+00:00","updated_at":"2026-05-17T23:38:46.841999+00:00"}