{"record_type":"pith_number_record","schema_url":"https://pith.science/schemas/pith-number/v1.json","pith_number":"pith:2024:4F4CSDS3JUYT4CXTN6WB3MYQAQ","short_pith_number":"pith:4F4CSDS3","schema_version":"1.0","canonical_sha256":"e178290e5b4d313e0af36fac1db310041b0ab0879963349fe3d04da5142c5cfd","source":{"kind":"arxiv","id":"2409.02060","version":2},"attestation_state":"computed","paper":{"title":"OLMoE: Open Mixture-of-Experts Language Models","license":"http://arxiv.org/licenses/nonexclusive-distrib/1.0/","headline":"OLMoE shows a 7B-parameter sparse MoE model with 1B active parameters per token can outperform denser models like Llama2-13B.","cross_cats":["cs.AI","cs.LG"],"primary_cat":"cs.CL","authors_text":"Akshita Bhagia, Alexander Wettig, Ali Farhadi, Amanpreet Singh, Binyuan Hui, David Wadden, Dirk Groeneveld, Douwe Kiela, Dustin Schwenk, Hannaneh Hajishirzi, Jacob Morrison, Kyle Lo, Luca Soldaini, Nathan Lambert, Niklas Muennighoff, Noah A. Smith, Oyvind Tafjord, Pang Wei Koh, Pete Walsh, Sewon Min, Shane Arora, Tim Dettmers, Weijia Shi, Yuling Gu","submitted_at":"2024-09-03T17:08:20Z","abstract_excerpt":"We introduce OLMoE, a fully open, state-of-the-art language model leveraging sparse Mixture-of-Experts (MoE). OLMoE-1B-7B has 7 billion (B) parameters but uses only 1B per input token. We pretrain it on 5 trillion tokens and further adapt it to create OLMoE-1B-7B-Instruct. Our models outperform all available models with similar active parameters, even surpassing larger ones like Llama2-13B-Chat and DeepSeekMoE-16B. We present various experiments on MoE training, analyze routing in our model showing high specialization, and open-source all aspects of our work: model weights, training data, code"},"verification_status":{"content_addressed":true,"pith_receipt":true,"author_attested":false,"weak_author_claims":0,"strong_author_claims":0,"externally_anchored":false,"storage_verified":false,"citation_signatures":0,"replication_records":0,"graph_snapshot":true,"references_resolved":true,"formal_links_present":true},"canonical_record":{"source":{"id":"2409.02060","kind":"arxiv","version":2},"metadata":{"license":"http://arxiv.org/licenses/nonexclusive-distrib/1.0/","primary_cat":"cs.CL","submitted_at":"2024-09-03T17:08:20Z","cross_cats_sorted":["cs.AI","cs.LG"],"title_canon_sha256":"ea4f149eedbdfb33fbd75aabf8e4f4c5741ff88e5d58565436ddf2c697541b16","abstract_canon_sha256":"9c16d9c7d6967a5232dece92619a6ab9873c90da68f13440851db7ecfbd86e8e"},"schema_version":"1.0"},"receipt":{"kind":"pith_receipt","key_id":"pith-v1-2026-05","algorithm":"ed25519","signed_at":"2026-05-17T23:38:47.557263Z","signature_b64":"O79EFSvIKyoQLQDmeoUWcLN0A8tGRlRJPMcqKCfEj7K68D3+e5WASQhzdgijydCthaAplS+4+7//1irAyjIwBQ==","signed_message":"canonical_sha256_bytes","builder_version":"pith-number-builder-2026-05-17-v1","receipt_version":"0.3","canonical_sha256":"e178290e5b4d313e0af36fac1db310041b0ab0879963349fe3d04da5142c5cfd","last_reissued_at":"2026-05-17T23:38:47.556662Z","signature_status":"signed_v1","first_computed_at":"2026-05-17T23:38:47.556662Z","public_key_fingerprint":"8d4b5ee74e4693bcd1df2446408b0d54"},"graph_snapshot":{"paper":{"title":"OLMoE: Open Mixture-of-Experts Language Models","license":"http://arxiv.org/licenses/nonexclusive-distrib/1.0/","headline":"OLMoE shows a 7B-parameter sparse MoE model with 1B active parameters per token can outperform denser models like Llama2-13B.","cross_cats":["cs.AI","cs.LG"],"primary_cat":"cs.CL","authors_text":"Akshita Bhagia, Alexander Wettig, Ali Farhadi, Amanpreet Singh, Binyuan Hui, David Wadden, Dirk Groeneveld, Douwe Kiela, Dustin Schwenk, Hannaneh Hajishirzi, Jacob Morrison, Kyle Lo, Luca Soldaini, Nathan Lambert, Niklas Muennighoff, Noah A. Smith, Oyvind Tafjord, Pang Wei Koh, Pete Walsh, Sewon Min, Shane Arora, Tim Dettmers, Weijia Shi, Yuling Gu","submitted_at":"2024-09-03T17:08:20Z","abstract_excerpt":"We introduce OLMoE, a fully open, state-of-the-art language model leveraging sparse Mixture-of-Experts (MoE). OLMoE-1B-7B has 7 billion (B) parameters but uses only 1B per input token. We pretrain it on 5 trillion tokens and further adapt it to create OLMoE-1B-7B-Instruct. Our models outperform all available models with similar active parameters, even surpassing larger ones like Llama2-13B-Chat and DeepSeekMoE-16B. We present various experiments on MoE training, analyze routing in our model showing high specialization, and open-source all aspects of our work: model weights, training data, code"},"claims":{"count":4,"items":[{"kind":"strongest_claim","text":"Our models outperform all available models with similar active parameters, even surpassing larger ones like Llama2-13B-Chat and DeepSeekMoE-16B.","source":"verdict.strongest_claim","status":"machine_extracted","claim_id":"C1","attestation":"unclaimed"},{"kind":"weakest_assumption","text":"That benchmark comparisons are fair across models trained under different data regimes, token counts, and optimization details, with no post-hoc selection affecting the reported gains.","source":"verdict.weakest_assumption","status":"machine_extracted","claim_id":"C2","attestation":"unclaimed"},{"kind":"one_line_summary","text":"OLMoE-1B-7B is an open MoE language model activating 1B parameters per token that outperforms models with similar active parameters after pretraining on 5T tokens.","source":"verdict.one_line_summary","status":"machine_extracted","claim_id":"C3","attestation":"unclaimed"},{"kind":"headline","text":"OLMoE shows a 7B-parameter sparse MoE model with 1B active parameters per token can outperform denser models like Llama2-13B.","source":"verdict.pith_extraction.headline","status":"machine_extracted","claim_id":"C4","attestation":"unclaimed"}],"snapshot_sha256":"dff8ac6123674789f53169a0b33dac4e5f958e9fbb8f8dfb3dfc0f9d15421dcf"},"source":{"id":"2409.02060","kind":"arxiv","version":2},"verdict":{"id":"0731d0fc-a34d-44d9-b11d-5db04e88e6f8","model_set":{"reader":"grok-4.3"},"created_at":"2026-05-16T14:39:15.667936Z","strongest_claim":"Our models outperform all available models with similar active parameters, even surpassing larger ones like Llama2-13B-Chat and DeepSeekMoE-16B.","one_line_summary":"OLMoE-1B-7B is an open MoE language model activating 1B parameters per token that outperforms models with similar active parameters after pretraining on 5T tokens.","pipeline_version":"pith-pipeline@v0.9.0","weakest_assumption":"That benchmark comparisons are fair across models trained under different data regimes, token counts, and optimization details, with no post-hoc selection affecting the reported gains.","pith_extraction_headline":"OLMoE shows a 7B-parameter sparse MoE model with 1B active parameters per token can outperform denser models like Llama2-13B."},"references":{"count":236,"sample":[{"doi":"","year":2024,"title":"Hewett, Jamie Huynh, Mojan Javaheripi, Xin Jin, Piero Kauffmann, Nikos Karampatziakis, Dongwoo Kim, Mahoud Khademi, Lev Kurilenko, James R","work_id":"001a7ac5-5354-4c88-b32a-2c271d2e7626","ref_index":1,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":2024,"title":"01. AI, :, Alex Young, Bei Chen, Chao Li, Chengen Huang, Ge Zhang, Guanwei Zhang, Heng Li, Jiangcheng Zhu, Jianqun Chen, Jing Chang, Kaidong Yu, Peng Liu, Qiang Liu, Shawn Yue, Senbin Yang, Shiming Ya","work_id":"d091a2ff-37d3-474b-b7ca-74f41db22bd0","ref_index":2,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":2023,"title":"Joshua Ainslie, James Lee-Thorp, Michiel de Jong, Yury Zemlyanskiy, Federico Lebr´on, and Sumit Sanghai. 2023. GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints","work_id":"2081683d-c693-4aa8-b386-b91bcb43db8a","ref_index":3,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":2024,"title":"Alon Albalak, Yanai Elazar, Sang Michael Xie, Shayne Longpre, Nathan Lambert, Xinyi Wang, Niklas Muennighoff, Bairu Hou, Liangming Pan, Haewon Jeong, Colin Raffel, Shiyu Chang, Tatsunori Hashimoto, an","work_id":"1ee59b5b-7853-4c22-9bb7-4edad6f50081","ref_index":4,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":2023,"title":"Loubna Ben Allal, Raymond Li, Denis Kocetkov, Chenghao Mou, Christopher Akiki, Car- los Munoz Ferrandis, Niklas Muennighoff, Mayank Mishra, Alex Gu, Manan Dey, et al. 2023. SantaCoder: don’t reach for","work_id":"3c52f2b9-4ba4-4dce-a830-db00b6a20e73","ref_index":5,"cited_arxiv_id":"","is_internal_anchor":false}],"resolved_work":236,"snapshot_sha256":"73b5838e0fa3c65b7167c72298e5b57b6b38e20f854eb393c48d3645541ed650","internal_anchors":0},"formal_canon":{"evidence_count":3,"snapshot_sha256":"f5f3f00d934c8eeefa61beaa35fa6c3b762e549175b0300f795e7d957e8db3e1"},"author_claims":{"count":0,"strong_count":0,"snapshot_sha256":"258153158e38e3291e3d48162225fcdb2d5a3ed65a07baac614ab91432fd4f57"},"builder_version":"pith-number-builder-2026-05-17-v1"},"aliases":[{"alias_kind":"arxiv","alias_value":"2409.02060","created_at":"2026-05-17T23:38:47.556765+00:00"},{"alias_kind":"arxiv_version","alias_value":"2409.02060v2","created_at":"2026-05-17T23:38:47.556765+00:00"},{"alias_kind":"doi","alias_value":"10.48550/arxiv.2409.02060","created_at":"2026-05-17T23:38:47.556765+00:00"},{"alias_kind":"pith_short_12","alias_value":"4F4CSDS3JUYT","created_at":"2026-05-18T12:33:37.589309+00:00"},{"alias_kind":"pith_short_16","alias_value":"4F4CSDS3JUYT4CXT","created_at":"2026-05-18T12:33:37.589309+00:00"},{"alias_kind":"pith_short_8","alias_value":"4F4CSDS3","created_at":"2026-05-18T12:33:37.589309+00:00"}],"events":[],"event_summary":{},"paper_claims":[],"inbound_citations":{"count":23,"internal_anchor_count":23,"sample":[{"citing_arxiv_id":"2509.21892","citing_title":"Elastic MoE: Unlocking the Inference-Time Scalability of Mixture-of-Experts","ref_index":24,"is_internal_anchor":true},{"citing_arxiv_id":"2509.25041","citing_title":"GRACE-MoE: Grouping and Replication with Locality-Aware Routing for Efficient Distributed MoE Inference","ref_index":10,"is_internal_anchor":true},{"citing_arxiv_id":"2510.05497","citing_title":"Patterns behind Chaos: Forecasting Data Movement for Efficient Large-Scale MoE LLM Inference","ref_index":42,"is_internal_anchor":true},{"citing_arxiv_id":"2411.04996","citing_title":"Mixture-of-Transformers: A Sparse and Scalable Architecture for Multi-Modal Foundation Models","ref_index":26,"is_internal_anchor":true},{"citing_arxiv_id":"2509.19349","citing_title":"ShinkaEvolve: Towards Open-Ended And Sample-Efficient Program Evolution","ref_index":242,"is_internal_anchor":true},{"citing_arxiv_id":"2601.14053","citing_title":"LLMOrbit: A Circular Taxonomy of Large Language Models -From Scaling Walls to Agentic AI Systems","ref_index":100,"is_internal_anchor":true},{"citing_arxiv_id":"2603.00883","citing_title":"Knowledge without Wisdom: Measuring Misalignment between LLMs and Intended Impact","ref_index":10,"is_internal_anchor":true},{"citing_arxiv_id":"2603.06003","citing_title":"EvoESAP: Non-Uniform Expert Pruning for Sparse MoE","ref_index":45,"is_internal_anchor":true},{"citing_arxiv_id":"2605.13997","citing_title":"HodgeCover: Higher-Order Topological Coverage Drives Compression of Sparse Mixture-of-Experts","ref_index":48,"is_internal_anchor":true},{"citing_arxiv_id":"2605.14200","citing_title":"How to Scale Mixture-of-Experts: From muP to the Maximally Scale-Stable Parameterization","ref_index":12,"is_internal_anchor":true},{"citing_arxiv_id":"2409.17146","citing_title":"Molmo and PixMo: Open Weights and Open Data for State-of-the-Art Vision-Language Models","ref_index":87,"is_internal_anchor":true},{"citing_arxiv_id":"2605.12476","citing_title":"Routers Learn the Geometry of Their Experts: Geometric Coupling in Sparse Mixture-of-Experts","ref_index":5,"is_internal_anchor":true},{"citing_arxiv_id":"2605.03598","citing_title":"Unifying Dynamical Systems and Graph Theory to Mechanistically Understand Computation in Neural Networks","ref_index":32,"is_internal_anchor":true},{"citing_arxiv_id":"2605.08292","citing_title":"Hierarchical Mixture-of-Experts with Two-Stage Optimization","ref_index":29,"is_internal_anchor":true},{"citing_arxiv_id":"2605.10655","citing_title":"BCJR-QAT: A Differentiable Relaxation of Trellis-Coded Weight Quantization","ref_index":21,"is_internal_anchor":true},{"citing_arxiv_id":"2605.06665","citing_title":"UniPool: A Globally Shared Expert Pool for Mixture-of-Experts","ref_index":32,"is_internal_anchor":true},{"citing_arxiv_id":"2604.23036","citing_title":"Preserving Long-Tailed Expert Information in Mixture-of-Experts Tuning","ref_index":20,"is_internal_anchor":true},{"citing_arxiv_id":"2605.06206","citing_title":"Federation of Experts: Communication Efficient Distributed Inference for Large Language Models","ref_index":19,"is_internal_anchor":true},{"citing_arxiv_id":"2605.05365","citing_title":"ZAYA1-8B Technical Report","ref_index":196,"is_internal_anchor":true},{"citing_arxiv_id":"2604.08133","citing_title":"Alloc-MoE: Budget-Aware Expert Activation Allocation for Efficient Mixture-of-Experts Inference","ref_index":4,"is_internal_anchor":true},{"citing_arxiv_id":"2605.07260","citing_title":"When Are Experts Misrouted? Counterfactual Routing Analysis in Mixture-of-Experts Language Models","ref_index":18,"is_internal_anchor":true},{"citing_arxiv_id":"2605.03598","citing_title":"Unifying Dynamical Systems and Graph Theory to Mechanistically Understand Computation in Neural Networks","ref_index":32,"is_internal_anchor":true},{"citing_arxiv_id":"2604.26039","citing_title":"RaMP: Runtime-Aware Megakernel Polymorphism for Mixture-of-Experts","ref_index":2,"is_internal_anchor":true}]},"formal_canon":{"evidence_count":3,"sample":[],"anchors":[]},"links":{"html":"https://pith.science/pith/4F4CSDS3JUYT4CXTN6WB3MYQAQ","json":"https://pith.science/pith/4F4CSDS3JUYT4CXTN6WB3MYQAQ.json","graph_json":"https://pith.science/api/pith-number/4F4CSDS3JUYT4CXTN6WB3MYQAQ/graph.json","events_json":"https://pith.science/api/pith-number/4F4CSDS3JUYT4CXTN6WB3MYQAQ/events.json","paper":"https://pith.science/paper/4F4CSDS3"},"agent_actions":{"view_html":"https://pith.science/pith/4F4CSDS3JUYT4CXTN6WB3MYQAQ","download_json":"https://pith.science/pith/4F4CSDS3JUYT4CXTN6WB3MYQAQ.json","view_paper":"https://pith.science/paper/4F4CSDS3","resolve_alias":"https://pith.science/api/pith-number/resolve?arxiv=2409.02060&json=true","fetch_graph":"https://pith.science/api/pith-number/4F4CSDS3JUYT4CXTN6WB3MYQAQ/graph.json","fetch_events":"https://pith.science/api/pith-number/4F4CSDS3JUYT4CXTN6WB3MYQAQ/events.json","actions":{"anchor_timestamp":"https://pith.science/pith/4F4CSDS3JUYT4CXTN6WB3MYQAQ/action/timestamp_anchor","attest_storage":"https://pith.science/pith/4F4CSDS3JUYT4CXTN6WB3MYQAQ/action/storage_attestation","attest_author":"https://pith.science/pith/4F4CSDS3JUYT4CXTN6WB3MYQAQ/action/author_attestation","sign_citation":"https://pith.science/pith/4F4CSDS3JUYT4CXTN6WB3MYQAQ/action/citation_signature","submit_replication":"https://pith.science/pith/4F4CSDS3JUYT4CXTN6WB3MYQAQ/action/replication_record"}},"created_at":"2026-05-17T23:38:47.556765+00:00","updated_at":"2026-05-17T23:38:47.556765+00:00"}