pith. machine review for the scientific record. sign in

arxiv: 2409.02060 · v2 · pith:4F4CSDS3new · submitted 2024-09-03 · 💻 cs.CL · cs.AI· cs.LG

OLMoE: Open Mixture-of-Experts Language Models

Pith reviewed 2026-05-16 14:39 UTC · model grok-4.3

classification 💻 cs.CL cs.AIcs.LG
keywords mixture of expertssparse activationlanguage modelsopen sourcepretraininginstruction tuningmodel efficiencybenchmark evaluation
0
0 comments X

The pith

OLMoE shows a 7B-parameter sparse MoE model with 1B active parameters per token can outperform denser models like Llama2-13B.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces OLMoE-1B-7B, a language model with 7 billion total parameters that activates only 1 billion for each input token through sparse expert routing. It pretrains this model on 5 trillion tokens and releases an instruction-tuned version, claiming better results than other models with matching active parameters and even some larger dense models. The authors also release the full training data, code, weights, and logs to make the work reproducible. This setup matters to readers because it points to a way to get strong performance at lower inference cost while keeping the entire system open. The central object carrying the argument is the MoE routing that keeps most parameters inactive per token.

Core claim

OLMoE-1B-7B uses a sparse Mixture-of-Experts architecture with 7 billion total parameters but only 1 billion active per token; after pretraining on 5 trillion tokens it surpasses available models with similar active parameter counts and exceeds some larger dense models such as Llama2-13B-Chat and DeepSeekMoE-16B, with all components including data and code released openly.

What carries the argument

Sparse Mixture-of-Experts routing that activates only a fixed subset of experts for each input token while keeping total parameter count higher.

If this is right

  • Inference cost scales with active parameters only, allowing larger total models without proportional runtime increase.
  • Routing specialization observed in the model separates knowledge across experts, supporting efficient scaling.
  • Full release of data and logs enables direct replication and targeted improvements to MoE training recipes.
  • Instruction tuning on the base model produces a chat version that retains the efficiency gains on downstream tasks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The approach may encourage future open models to favor sparse designs over uniform dense scaling for cost-sensitive applications.
  • Routing analysis could be extended to test whether expert specialization improves with larger numbers of experts or different capacity factors.
  • Combining the open weights with quantization or distillation might further lower deployment barriers for edge use cases.
  • The released training logs allow external checks on whether routing stability holds across different hardware setups.

Load-bearing premise

Performance differences on benchmarks arise mainly from the MoE design rather than from variations in training data, token volume, or optimization choices across compared models.

What would settle it

Training a dense 7B-parameter model on the exact same 5-trillion-token dataset and data mix, then showing it matches or exceeds OLMoE scores on the same benchmarks, would falsify the claimed advantage of sparse activation.

read the original abstract

We introduce OLMoE, a fully open, state-of-the-art language model leveraging sparse Mixture-of-Experts (MoE). OLMoE-1B-7B has 7 billion (B) parameters but uses only 1B per input token. We pretrain it on 5 trillion tokens and further adapt it to create OLMoE-1B-7B-Instruct. Our models outperform all available models with similar active parameters, even surpassing larger ones like Llama2-13B-Chat and DeepSeekMoE-16B. We present various experiments on MoE training, analyze routing in our model showing high specialization, and open-source all aspects of our work: model weights, training data, code, and logs.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces OLMoE, a fully open sparse Mixture-of-Experts language model with 7B total parameters but only 1B active parameters per token. Pretrained on 5 trillion tokens, OLMoE-1B-7B and its instruct variant are reported to outperform all available models with comparable active parameters and to surpass larger dense models such as Llama2-13B-Chat and DeepSeekMoE-16B on standard benchmarks. The manuscript includes experiments on MoE training dynamics, routing analysis showing high expert specialization, and full open-sourcing of weights, data, code, and logs.

Significance. If the performance claims hold after controlling for training data scale and optimization differences, the work would be significant for demonstrating efficient inference via MoE with strong benchmark results in a fully open setting. The complete release of training artifacts strengthens reproducibility and enables community follow-up, which is a clear positive contribution to open LLM research.

major comments (2)
  1. [§4, Table 1] §4 (Experiments) and Table 1: The central claim that OLMoE-1B-7B outperforms Llama2-13B-Chat and DeepSeekMoE-16B rests on direct benchmark numbers, yet the text provides no matched controls for pretraining token count (5T for OLMoE vs. ~2T for Llama-2) or data mixture; without retraining baselines under identical conditions or an ablation isolating the MoE routing contribution, the attribution of gains to the 1B-active design versus data scale remains unverified.
  2. [§4.1, Table 2] §4.1 and Table 2: No error bars, standard deviations, or multiple-run statistics are reported for any benchmark scores, and baseline evaluation details (exact prompt templates, data exclusion criteria, and post-training setups) are insufficient; this undermines confidence in the statistical reliability of the reported superiority over models with similar active parameters.
minor comments (2)
  1. [Figure 4] Figure 4 (routing visualization): The expert activation heatmaps lack quantitative specialization metrics (e.g., entropy or load-balance statistics per layer) that would strengthen the qualitative claim of high specialization.
  2. [Related Work] Related Work section: Several recent MoE papers on routing stability and expert capacity are cited but not compared quantitatively to the OLMoE training recipe.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We appreciate the emphasis on rigorous controls and statistical reporting, and we address each major comment below with specific plans for revision where appropriate.

read point-by-point responses
  1. Referee: [§4, Table 1] §4 (Experiments) and Table 1: The central claim that OLMoE-1B-7B outperforms Llama2-13B-Chat and DeepSeekMoE-16B rests on direct benchmark numbers, yet the text provides no matched controls for pretraining token count (5T for OLMoE vs. ~2T for Llama-2) or data mixture; without retraining baselines under identical conditions or an ablation isolating the MoE routing contribution, the attribution of gains to the 1B-active design versus data scale remains unverified.

    Authors: We agree that matched controls would provide stronger causal attribution. Retraining Llama-2 or DeepSeekMoE under identical 5T-token conditions is not feasible given the scale of compute involved. Our contribution instead demonstrates that an open 1B-active MoE model trained on 5T tokens achieves competitive or superior results to available dense and MoE baselines. We will revise Section 4 and the Table 1 caption to explicitly discuss the data-scale differences and note that performance is reported relative to publicly available models. We also expand the routing analysis in §4.3 to further illustrate how expert specialization contributes to efficiency, providing indirect support for the MoE design. revision: partial

  2. Referee: [§4.1, Table 2] §4.1 and Table 2: No error bars, standard deviations, or multiple-run statistics are reported for any benchmark scores, and baseline evaluation details (exact prompt templates, data exclusion criteria, and post-training setups) are insufficient; this undermines confidence in the statistical reliability of the reported superiority over models with similar active parameters.

    Authors: We acknowledge that error bars would increase confidence. However, the prohibitive cost of multiple independent pretraining runs on 5T tokens makes this impractical. We will add a limitations paragraph in §4.1 explaining this constraint and expand the evaluation appendix with exact prompt templates, data filtering criteria, and post-training details to improve reproducibility and allow readers to assess baseline setups more precisely. revision: partial

Circularity Check

0 steps flagged

No circularity: empirical model release with external benchmark comparisons

full rationale

The paper introduces OLMoE as an empirical artifact: a 7B-parameter MoE model with 1B active parameters pretrained on 5T tokens, followed by instruction tuning and direct reporting of benchmark scores against external baselines. No equations, derivations, or first-principles predictions appear in the provided text; performance claims rest on measured perplexity and downstream task numbers rather than any fitted parameter being relabeled as a prediction. No self-citation load-bearing steps, uniqueness theorems, or ansatz smuggling are present. The central claim therefore remains self-contained against external benchmarks and does not reduce to its own inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Abstract-only view yields limited visibility into exact hyperparameters; model relies on standard transformer and MoE routing assumptions without new invented entities.

axioms (1)
  • standard math Standard transformer layer assumptions and MoE top-k routing mechanics hold without modification
    Invoked implicitly for all MoE language model training

pith-pipeline@v0.9.0 · 5531 in / 1132 out tokens · 72504 ms · 2026-05-16T14:39:15.667936+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 22 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. HodgeCover: Higher-Order Topological Coverage Drives Compression of Sparse Mixture-of-Experts

    cs.LG 2026-05 unverdicted novelty 8.0

    HodgeCover isolates the harmonic kernel of a simplicial Laplacian on an expert 2-complex to identify irreducible merge cycles and selects experts for aggressive compression, matching or exceeding baselines on open-wei...

  2. Molmo and PixMo: Open Weights and Open Data for State-of-the-Art Vision-Language Models

    cs.CV 2024-09 accept novelty 8.0

    Molmo VLMs trained on newly collected PixMo open datasets achieve state-of-the-art performance among open-weight models and surpass multiple proprietary VLMs including Claude 3.5 Sonnet and Gemini 1.5 Pro.

  3. How to Scale Mixture-of-Experts: From muP to the Maximally Scale-Stable Parameterization

    cs.LG 2026-05 unverdicted novelty 7.0

    The authors derive a Maximally Scale-Stable Parameterization (MSSP) for MoE models that achieves robust learning-rate transfer and monotonic performance gains with scale across co-scaling regimes of width, experts, an...

  4. Routers Learn the Geometry of Their Experts: Geometric Coupling in Sparse Mixture-of-Experts

    cs.LG 2026-05 unverdicted novelty 7.0

    Routers in SMoE models form geometric alignments with their experts through shared gradient directions, enabling effective specialization that auxiliary load-balancing losses tend to disrupt.

  5. BCJR-QAT: A Differentiable Relaxation of Trellis-Coded Weight Quantization

    cs.LG 2026-05 unverdicted novelty 7.0

    BCJR-QAT makes trellis quantization differentiable via BCJR soft decoding at finite temperature, allowing QAT to improve 2-bit LLM perplexity over PTQ with a fused GPU kernel and a drift-budget escape condition.

  6. When Are Experts Misrouted? Counterfactual Routing Analysis in Mixture-of-Experts Language Models

    cs.LG 2026-05 unverdicted novelty 7.0

    Standard top-k routers in MoE language models often select suboptimal routes for difficult tokens, and updating only the final router layer raises pass@K on AIME and HMMT benchmarks across multiple models.

  7. Unifying Dynamical Systems and Graph Theory to Mechanistically Understand Computation in Neural Networks

    cs.NE 2026-05 unverdicted novelty 7.0

    Multi-hop graph analysis of RNNs reveals temporal information routing and motivates resolvent regularization that outperforms L1 by enforcing pathway-level sparsity aligned with task structure.

  8. Unifying Dynamical Systems and Graph Theory to Mechanistically Understand Computation in Neural Networks

    cs.NE 2026-05 unverdicted novelty 7.0

    RNN computation is recovered from multi-hop graph pathways, and constraining these pathways via resolvent regularization yields improved temporal sparsity and task performance over standard L1.

  9. Preserving Long-Tailed Expert Information in Mixture-of-Experts Tuning

    cs.LG 2026-04 unverdicted novelty 7.0

    A new SFT framework for MoE models combines bias-driven sparsification with gated condenser experts to retain long-tailed expert information, outperforming DenseMixer and ESFT by over 2.5% on math reasoning and common...

  10. EvoESAP: Non-Uniform Expert Pruning for Sparse MoE

    cs.LG 2026-03 conditional novelty 7.0

    EvoESAP uses evolutionary search guided by a speculative-decoding-inspired ESAP metric to discover non-uniform layer-wise sparsity allocations for MoE expert pruning, improving generation accuracy up to 19.6% at 50% sparsity.

  11. Hierarchical Mixture-of-Experts with Two-Stage Optimization

    cs.LG 2026-05 unverdicted novelty 6.0

    Hi-MoE uses two-level hierarchical routing objectives to enforce group-level balance while promoting within-group specialization, yielding better perplexity and expert utilization than prior MoE baselines in NLP and v...

  12. UniPool: A Globally Shared Expert Pool for Mixture-of-Experts

    cs.LG 2026-05 unverdicted novelty 6.0

    A shared global expert pool in MoE improves validation loss over per-layer experts and allows sublinear expert-parameter growth with depth.

  13. Federation of Experts: Communication Efficient Distributed Inference for Large Language Models

    cs.LG 2026-05 unverdicted novelty 6.0

    FoE restructures MoE blocks into per-KV-head clusters with sum-based synchronization, removing all-to-all communication in single-node settings and limiting it to intra-node in multi-node settings for up to 5.2x faste...

  14. ZAYA1-8B Technical Report

    cs.AI 2026-05 unverdicted novelty 6.0

    ZAYA1-8B is a reasoning MoE model with 700M active parameters that matches larger models on math and coding benchmarks and reaches 91.9% on AIME'25 via Markovian RSA test-time compute.

  15. RaMP: Runtime-Aware Megakernel Polymorphism for Mixture-of-Experts

    cs.LG 2026-04 unverdicted novelty 6.0

    RaMP uses a hardware-derived performance region analysis and a four-parameter wave cost model to select optimal polymorphic kernel configurations for MoE inference from runtime expert histograms, delivering 1.22x kern...

  16. Alloc-MoE: Budget-Aware Expert Activation Allocation for Efficient Mixture-of-Experts Inference

    cs.LG 2026-04 unverdicted novelty 6.0

    Alloc-MoE allocates a fixed expert activation budget using layer-level dynamic programming based on sensitivity and token-level score-based redistribution, delivering 1.15x prefill and 1.34x decode speedups on DeepSee...

  17. Knowledge without Wisdom: Measuring Misalignment between LLMs and Intended Impact

    cs.LG 2026-03 unverdicted novelty 6.0

    Shared biases across LLMs from common pretraining misalign with teaching quality and negatively correlate with intended student learning outcomes, with model ensembles amplifying the misalignment.

  18. Patterns behind Chaos: Forecasting Data Movement for Efficient Large-Scale MoE LLM Inference

    cs.DC 2025-10 conditional novelty 6.0

    Comprehensive profiling of expert selection in frontier MoE models reveals temporal and spatial patterns that enable 6.6x speedup on wafer-scale GPUs and 1.25x on existing systems via targeted optimizations.

  19. GRACE-MoE: Grouping and Replication with Locality-Aware Routing for Efficient Distributed MoE Inference

    cs.DC 2025-09 unverdicted novelty 6.0

    GRACE-MoE integrates expert grouping, dynamic replication, and locality-aware routing with hierarchical sparse communication to reduce end-to-end latency in distributed SMoE inference.

  20. ShinkaEvolve: Towards Open-Ended And Sample-Efficient Program Evolution

    cs.CL 2025-09 unverdicted novelty 6.0

    ShinkaEvolve improves sample efficiency in LLM-driven program evolution via parent sampling, code novelty rejection-sampling, and bandit LLM ensemble selection, achieving new SOTA circle packing with 150 samples and g...

  21. Mixture-of-Transformers: A Sparse and Scalable Architecture for Multi-Modal Foundation Models

    cs.CL 2024-11 conditional novelty 6.0

    MoT decouples non-embedding parameters by modality in transformers to match dense multi-modal performance with roughly one-third to one-half the FLOPs.

  22. LLMOrbit: A Circular Taxonomy of Large Language Models -From Scaling Walls to Agentic AI Systems

    cs.LG 2026-01 unverdicted novelty 3.0

    A survey taxonomy of LLMs identifies three scaling crises and six efficiency paradigms while tracing the shift from generation to tool-using agents.

Reference graph

Works this paper leans on

236 extracted references · 236 canonical work pages · cited by 21 Pith papers

  1. [1]

    Hewett, Jamie Huynh, Mojan Javaheripi, Xin Jin, Piero Kauffmann, Nikos Karampatziakis, Dongwoo Kim, Mahoud Khademi, Lev Kurilenko, James R

    Marah Abdin, Sam Ade Jacobs, Ammar Ahmad Awan, Jyoti Aneja, Ahmed Awadallah, Hany Awadalla, Nguyen Bach, Amit Bahree, Arash Bakhtiari, Jianmin Bao, Harkirat Behl, Alon Benhaim, Misha Bilenko, Johan Bjorck, S ´ebastien Bubeck, Qin Cai, Martin Cai, Caio C ´esar Teodoro Mendes, Weizhu Chen, Vishrav Chaudhary, Dong Chen, Dongdong Chen, Yen-Chun Chen, Yi-Ling ...

  2. [2]

    01. AI, :, Alex Young, Bei Chen, Chao Li, Chengen Huang, Ge Zhang, Guanwei Zhang, Heng Li, Jiangcheng Zhu, Jianqun Chen, Jing Chang, Kaidong Yu, Peng Liu, Qiang Liu, Shawn Yue, Senbin Yang, Shiming Yang, Tao Yu, Wen Xie, Wenhao Huang, Xiaohui Hu, Xiaoyi Ren, Xinyao Niu, Pengcheng Nie, Yuchi Xu, Yudong Liu, Yue Wang, Yuxuan Cai, Zhenyu Gu, Zhiyuan Liu, and...

  3. [3]

    Joshua Ainslie, James Lee-Thorp, Michiel de Jong, Yury Zemlyanskiy, Federico Lebr´on, and Sumit Sanghai. 2023. GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints

  4. [4]

    Alon Albalak, Yanai Elazar, Sang Michael Xie, Shayne Longpre, Nathan Lambert, Xinyi Wang, Niklas Muennighoff, Bairu Hou, Liangming Pan, Haewon Jeong, Colin Raffel, Shiyu Chang, Tatsunori Hashimoto, and William Yang Wang. 2024. A Survey on Data Selection for Language Models

  5. [5]

    Loubna Ben Allal, Raymond Li, Denis Kocetkov, Chenghao Mou, Christopher Akiki, Car- los Munoz Ferrandis, Niklas Muennighoff, Mayank Mishra, Alex Gu, Manan Dey, et al. 2023. SantaCoder: don’t reach for the stars!

  6. [6]

    Loubna Ben Allal, Anton Lozhkov, Elie Bakouch, Leandro von Werra, and Thomas Wolf

  7. [7]

    SmolLM - blazingly fast and remarkably powerful

  8. [8]

    Zeyuan Allen-Zhu and Yuanzhi Li. 2024. Physics of Language Models: Part 3.3, Knowledge Capacity Scaling Laws

  9. [9]

    Ebtesam Almazrouei, Hamza Alobeidli, Abdulaziz Alshamsi, Alessandro Cappelli, Ruxan- dra Cojocaru, M´erouane Debbah, ´Etienne Goffinet, Daniel Hesslow, Julien Launay, Quentin Malartic, Daniele Mazzotta, Badreddine Noune, Baptiste Pannier, and Guilherme Penedo

  10. [10]

    The Falcon Series of Open Language Models

  11. [11]

    Dai, Orhan Firat, Melvin Johnson, Dmitry Lepikhin, Alexandre Pas- sos, Siamak Shakeri, Emanuel Taropa, Paige Bailey, Zhifeng Chen, Eric Chu, Jonathan H

    Rohan Anil, Andrew M. Dai, Orhan Firat, Melvin Johnson, Dmitry Lepikhin, Alexandre Pas- sos, Siamak Shakeri, Emanuel Taropa, Paige Bailey, Zhifeng Chen, Eric Chu, Jonathan H. Clark, Laurent El Shafey, Yanping Huang, Kathy Meier-Hellstern, Gaurav Mishra, Erica Moreira, Mark Omernick, Kevin Robinson, Sebastian Ruder, Yi Tay, Kefan Xiao, Yuanzhong Xu, Yujing...

  12. [12]

    Mikel Artetxe, Shruti Bhosale, Naman Goyal, Todor Mihaylov, Myle Ott, Sam Shleifer, Xi Victoria Lin, Jingfei Du, Srinivasan Iyer, Ramakanth Pasunuru, Giri Anantharaman, Xian Li, Shuohui Chen, Halil Akin, Mandeep Baines, Louis Martin, Xing Zhou, Punit Singh Koura, Brian O’Horo, Jeff Wang, Luke Zettlemoyer, Mona Diab, Zornitsa Kozareva, and Ves Stoyanov. 20...

  13. [13]

    Jiang, Jia Deng, Stella Biderman, and Sean Welleck

    Zhangir Azerbayev, Hailey Schoelkopf, Keiran Paster, Marco Dos Santos, Stephen McAleer, Albert Q. Jiang, Jia Deng, Stella Biderman, and Sean Welleck. 2023. Llemma: An Open Language Model For Mathematics

  14. [14]

    Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E. Hinton. 2016. Layer Normalization

  15. [15]

    Jinze Bai, Shuai Bai, Yunfei Chu, Zeyu Cui, Kai Dang, Xiaodong Deng, Yang Fan, Wenbin Ge, Yu Han, Fei Huang, Binyuan Hui, Luo Ji, Mei Li, Junyang Lin, Runji Lin, Dayiheng Liu, Gao Liu, Chengqiang Lu, Keming Lu, Jianxin Ma, Rui Men, Xingzhang Ren, Xuancheng Ren, Chuanqi Tan, Sinan Tan, Jianhong Tu, Peng Wang, Shijie Wang, Wei Wang, Sheng- guang Wu, Benfeng...

  16. [16]

    Jinze Bai, Shuai Bai, Shusheng Yang, Shijie Wang, Sinan Tan, Peng Wang, Junyang Lin, Chang Zhou, and Jingren Zhou. 2023. Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond

  17. [17]

    Bowman, Zac Hatfield-Dodds, Ben Mann, Dario Amodei, Nicholas Joseph, Sam McCandlish, Tom Brown, and Jared Kaplan

    Yuntao Bai, Saurav Kadavath, Sandipan Kundu, Amanda Askell, Jackson Kernion, Andy Jones, Anna Chen, Anna Goldie, Azalia Mirhoseini, Cameron McKinnon, Carol Chen, Catherine Olsson, Christopher Olah, Danny Hernandez, Dawn Drain, Deep Ganguli, Dustin Li, Eli Tran-Johnson, Ethan Perez, Jamie Kerr, Jared Mueller, Jeffrey Ladish, Joshua Landau, Kamal Ndousse, K...

  18. [18]

    Marco Bellagente, Jonathan Tow, Dakota Mahan, Duy Phung, Maksym Zhuravinskyi, Reshinth Adithyan, James Baicoianu, Ben Brooks, Nathan Cooper, Ashish Datta, Meng Lee, Emad Mostaque, Michael Pieler, Nikhil Pinnaparju, Paulo Rocha, Harry Saini, Han- nah Teufel, Niccolo Zanichelli, and Carlos Riquelme. 2024. Stable LM 2 1.6B Technical Report

  19. [19]

    Emmanuel Bengio, Pierre-Luc Bacon, Joelle Pineau, and Doina Precup. 2016. Conditional Computation in Neural Networks for faster models

  20. [20]

    Stella Biderman, Hailey Schoelkopf, Quentin Anthony, Herbie Bradley, Kyle O’Brien, Eric Hallahan, Mohammad Aflah Khan, Shivanshu Purohit, USVSN Sai Prashanth, Edward Raff, Aviya Skowron, Lintang Sutawika, and Oskar van der Wal. 2023. Pythia: A Suite for Ana- lyzing Large Language Models Across Training and Scaling. 27

  21. [21]

    Lee, Haonan Li, Charles Lovering, Niklas Muennighoff, Ellie Pavlick, Jason Phang, Aviya Skowron, Samson Tan, Xiangru Tang, Kevin A

    Stella Biderman, Hailey Schoelkopf, Lintang Sutawika, Leo Gao, Jonathan Tow, Baber Ab- basi, Alham Fikri Aji, Pawan Sasanka Ammanamanchi, Sidney Black, Jordan Clive, An- thony DiPofi, Julen Etxaniz, Benjamin Fattori, Jessica Zosa Forde, Charles Foster, Jeffrey Hsu, Mimansa Jaiswal, Wilson Y . Lee, Haonan Li, Charles Lovering, Niklas Muennighoff, Ellie Pav...

  22. [22]

    Yonatan Bisk, Rowan Zellers, Ronan Le Bras, Jianfeng Gao, and Yejin Choi. 2019. PIQA: Reasoning about Physical Commonsense in Natural Language

  23. [23]

    Sid Black, Stella Biderman, Eric Hallahan, Quentin Anthony, Leo Gao, Laurence Gold- ing, Horace He, Connor Leahy, Kyle McDonell, Jason Phang, Michael Pieler, USVSN Sai Prashanth, Shivanshu Purohit, Laria Reynolds, Jonathan Tow, Ben Wang, and Samuel Wein- bach. 2022. GPT-NeoX-20B: An Open-Source Autoregressive Language Model

  24. [24]

    Sid Black, Leo Gao, Phil Wang, Connor Leahy, and Stella Biderman. 2021. GPT-Neo: Large Scale Autoregressive Language Modeling with Mesh-Tensorflow

  25. [25]

    Rishi Bommasani, Kevin Klyman, Shayne Longpre, Sayash Kapoor, Nestor Maslej, Betty Xiong, Daniel Zhang, and Percy Liang. 2023. The Foundation Model Transparency Index

  26. [26]

    Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhari- wal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al

    Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhari- wal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. 2020. Lan- guage Models are Few-Shot Learners

  27. [27]

    Tianle Cai. 2023. Mixtral from Mistral

  28. [28]

    Zheng Cai, Maosong Cao, Haojiong Chen, Kai Chen, Keyu Chen, Xin Chen, Xun Chen, Zehui Chen, Zhi Chen, Pei Chu, Xiaoyi Dong, Haodong Duan, Qi Fan, Zhaoye Fei, Yang Gao, Jiaye Ge, Chenya Gu, Yuzhe Gu, Tao Gui, Aijia Guo, Qipeng Guo, Conghui He, Yingfan Hu, Ting Huang, Tao Jiang, Penglong Jiao, Zhenjiang Jin, Zhikai Lei, Jiaxing Li, Jingwen Li, Linyang Li, S...

  29. [29]

    Mark Chen, Alec Radford, Rewon Child, Jeffrey Wu, Heewoo Jun, David Luan, and Ilya Sutskever. 2020. Generative pretraining from pixels

  30. [30]

    Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde de Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, Alex Ray, Raul Puri, Gretchen Krueger, Michael Petrov, Heidy Khlaaf, Girish Sastry, Pamela Mishkin, Brooke Chan, Scott Gray, Nick Ryder, Mikhail Pavlov, Alethea Power, Lukasz Kaiser, Mo- hammad Bavari...

  31. [31]

    Soumith Chintala. 2024. GPT-4 MoE. 28

  32. [32]

    Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, Maarten Bosma, Gaurav Mishra, Adam Roberts, Paul Barham, Hyung Won Chung, Charles Sutton, Sebastian Gehrmann, et al

  33. [33]

    PaLM: Scaling Language Modeling with Pathways

  34. [34]

    Brown, Miljan Martic, Shane Legg, and Dario Amodei

    Paul Christiano, Jan Leike, Tom B. Brown, Miljan Martic, Shane Legg, and Dario Amodei

  35. [35]

    Deep reinforcement learning from human preferences

  36. [36]

    Aidan Clark, Diego de las Casas, Aurelia Guy, Arthur Mensch, Michela Paganini, Jor- dan Hoffmann, Bogdan Damoc, Blake Hechtman, Trevor Cai, Sebastian Borgeaud, George van den Driessche, Eliza Rutherford, Tom Hennigan, Matthew Johnson, Katie Millican, Al- bin Cassirer, Chris Jones, Elena Buchatskaya, David Budden, Laurent Sifre, Simon Osindero, Oriol Vinya...

  37. [37]

    Christopher Clark, Kenton Lee, Ming-Wei Chang, Tom Kwiatkowski, Michael Collins, and Kristina Toutanova. 2019. BoolQ: Exploring the Surprising Difficulty of Natural Yes/No Questions

  38. [38]

    Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot, Ashish Sabharwal, Carissa Schoenick, and Oyvind Tafjord. 2018. Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge

  39. [39]

    Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, Christopher Hesse, and John Schulman. 2021. Training Verifiers to Solve Math Word Problems

  40. [40]

    Together Computer. 2023. RedPajama: An Open Source Recipe to Reproduce LLaMA train- ing dataset

  41. [41]

    R ´obert Csord ´as, Kazuki Irie, J ¨urgen Schmidhuber, Christopher Potts, and Christopher D. Manning. 2024. MoEUT: Mixture-of-Experts Universal Transformers

  42. [42]

    Ganqu Cui, Lifan Yuan, Ning Ding, Guanming Yao, Wei Zhu, Yuan Ni, Guotong Xie, Zhiyuan Liu, and Maosong Sun. 2023. UltraFeedback: Boosting Language Models with High-quality Feedback

  43. [43]

    Damai Dai, Chengqi Deng, Chenggang Zhao, R. X. Xu, Huazuo Gao, Deli Chen, Jiashi Li, Wangding Zeng, Xingkai Yu, Y . Wu, Zhenda Xie, Y . K. Li, Panpan Huang, Fuli Luo, Chong Ruan, Zhifang Sui, and Wenfeng Liang. 2024. DeepSeekMoE: Towards Ultimate Expert Specialization in Mixture-of-Experts Language Models

  44. [44]

    Databricks. 2024. DBRX

  45. [45]

    Dauphin, Angela Fan, Michael Auli, and David Grangier

    Yann N. Dauphin, Angela Fan, Michael Auli, and David Grangier. 2017. Language Modeling with Gated Convolutional Networks

  46. [46]

    DeepSeek-AI, :, Xiao Bi, Deli Chen, Guanting Chen, Shanhuang Chen, Damai Dai, Chengqi Deng, Honghui Ding, Kai Dong, Qiushi Du, Zhe Fu, Huazuo Gao, Kaige Gao, Wenjun Gao, Ruiqi Ge, Kang Guan, Daya Guo, Jianzhong Guo, Guangbo Hao, Zhewen Hao, Ying He, Wenjie Hu, Panpan Huang, Erhang Li, Guowei Li, Jiashi Li, Yao Li, Y . K. Li, Wenfeng Liang, Fangyun Lin, A....

  47. [47]

    Zhang, Hanwei Xu, Hao Yang, Haowei Zhang, Honghui Ding, Huajian Xin, Huazuo Gao, Hui Li, Hui Qu, J

    DeepSeek-AI, Aixin Liu, Bei Feng, Bin Wang, Bingxuan Wang, Bo Liu, Chenggang Zhao, Chengqi Dengr, Chong Ruan, Damai Dai, Daya Guo, Dejian Yang, Deli Chen, Dongjie Ji, Erhang Li, Fangyun Lin, Fuli Luo, Guangbo Hao, Guanting Chen, Guowei Li, H. Zhang, Hanwei Xu, Hao Yang, Haowei Zhang, Honghui Ding, Huajian Xin, Huazuo Gao, Hui Li, Hui Qu, J. L. Cai, Jian L...

  48. [48]

    Mostafa Dehghani, Josip Djolonga, Basil Mustafa, Piotr Padlewski, Jonathan Heek, Justin Gilmer, Andreas Steiner, Mathilde Caron, Robert Geirhos, Ibrahim Alabdulmohsin, Rodolphe Jenatton, Lucas Beyer, Michael Tschannen, Anurag Arnab, Xiao Wang, Car- los Riquelme, Matthias Minderer, Joan Puigcerver, Utku Evci, Manoj Kumar, Sjoerd van Steenkiste, Gamaleldin ...

  49. [49]

    Mostafa Dehghani, Stephan Gouws, Oriol Vinyals, Jakob Uszkoreit, and Łukasz Kaiser

  50. [50]

    Universal Transformers

  51. [51]

    Nolan Dey, Gurpreet Gosal, Zhiming, Chen, Hemant Khachane, William Marshall, Ribhu Pathria, Marvin Tom, and Joel Hestness. 2023. Cerebras-GPT: Open Compute-Optimal Lan- guage Models Trained on the Cerebras Wafer-Scale Cluster

  52. [52]

    Danny Driess, Fei Xia, Mehdi S. M. Sajjadi, Corey Lynch, Aakanksha Chowdhery, Brian Ichter, Ayzaan Wahid, Jonathan Tompson, Quan Vuong, Tianhe Yu, Wenlong Huang, Yevgen Chebotar, Pierre Sermanet, Daniel Duckworth, Sergey Levine, Vincent Vanhoucke, Karol Hausman, Marc Toussaint, Klaus Greff, Andy Zeng, Igor Mordatch, and Pete Florence. 2023. PaLM-E: An Emb...

  53. [53]

    Nan Du, Yanping Huang, Andrew M. Dai, Simon Tong, Dmitry Lepikhin, Yuanzhong Xu, Maxim Krikun, Yanqi Zhou, Adams Wei Yu, Orhan Firat, Barret Zoph, Liam Fedus, Maarten Bosma, Zongwei Zhou, Tao Wang, Yu Emma Wang, Kellie Webster, Marie Pellat, Kevin Robinson, Kathleen Meier-Hellstern, Toju Duke, Lucas Dixon, Kun Zhang, Quoc V Le, Yonghui Wu, Zhifeng Chen, a...

  54. [54]

    Dheeru Dua, Shruti Bhosale, Vedanuj Goswami, James Cross, Mike Lewis, and Angela Fan

  55. [55]

    Tricks for Training Sparse Translation Models

  56. [56]

    Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Amy Yang, Angela Fan, Anirudh Goyal, An- thony Hartshorn, Aobo Yang, Archi Mitra, Archie Sravankumar, Artem Korenev, Arthur 30 Hinsvark, Arun Rao, Aston Zhang, Aurelien Rodriguez, Austen Gregerson, et al. 2024. The Llama 3 Herd of Models

  57. [57]

    Hashimoto

    Yann Dubois, Bal ´azs Galambosi, Percy Liang, and Tatsunori B. Hashimoto. 2024. Length- Controlled AlpacaEval: A Simple Way to Debias Automatic Evaluators

  58. [58]

    David Eigen, Marc’Aurelio Ranzato, and Ilya Sutskever. 2014. Learning Factored Represen- tations in a Deep Mixture of Experts

  59. [59]

    Kenneth Enevoldsen, M ´arton Kardos, Niklas Muennighoff, and Kristoffer Laigaard Nielbo

  60. [60]

    The Scandinavian Embedding Benchmarks: Comprehensive Assessment of Multilin- gual and Monolingual Text Embedding

  61. [61]

    Kawin Ethayarajh, Winnie Xu, Niklas Muennighoff, Dan Jurafsky, and Douwe Kiela. 2024. KTO: Model Alignment as Prospect Theoretic Optimization

  62. [62]

    Guerreiro, Ant ´onio Loison, Duarte M

    Manuel Faysse, Patrick Fernandes, Nuno M. Guerreiro, Ant ´onio Loison, Duarte M. Alves, Caio Corro, Nicolas Boizard, Jo ˜ao Alves, Ricardo Rei, Pedro H. Martins, Antoni Bigata Casademunt, Franc ¸ois Yvon, Andr´e F. T. Martins, Gautier Viaud, C´eline Hudelot, and Pierre Colombo. 2024. CroissantLLM: A Truly Bilingual French-English Language Model

  63. [63]

    William Fedus, Barret Zoph, and Noam Shazeer. 2022. Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity

  64. [64]

    Dimakis, Gabriel Il- harco, Pang Wei Koh, Shuran Song, Thomas Kollar, Yair Carmon, Achal Dave, Reinhard Heckel, Niklas Muennighoff, and Ludwig Schmidt

    Samir Yitzhak Gadre, Georgios Smyrnis, Vaishaal Shankar, Suchin Gururangan, Mitchell Wortsman, Rulin Shao, Jean Mercat, Alex Fang, Jeffrey Li, Sedrick Keh, Rui Xin, Marianna Nezhurina, Igor Vasiljevic, Jenia Jitsev, Luca Soldaini, Alexandros G. Dimakis, Gabriel Il- harco, Pang Wei Koh, Shuran Song, Thomas Kollar, Yair Carmon, Achal Dave, Reinhard Heckel, ...

  65. [65]

    Trevor Gale, Deepak Narayanan, Cliff Young, and Matei Zaharia. 2022. MegaBlocks: Effi- cient Sparse Training with Mixture-of-Experts

  66. [66]

    Leo Gao, Stella Biderman, Sid Black, Laurence Golding, Travis Hoppe, Charles Foster, Jason Phang, Horace He, Anish Thite, Noa Nabeshima, Shawn Presser, and Connor Leahy. 2020. The Pile: An 800GB Dataset of Diverse Text for Language Modeling

  67. [67]

    Leo Gao, Jonathan Tow, Stella Biderman, Sid Black, Anthony DiPofi, Charles Foster, Laurence Golding, Jeffrey Hsu, Kyle McDonell, Niklas Muennighoff, Jason Phang, Laria Reynolds, Eric Tang, Anish Thite, Ben Wang, Kevin Wang, and Andy Zou. 2021. A frame- work for few-shot language model evaluation

  68. [68]

    Team GLM, Aohan Zeng, Bin Xu, Bowen Wang, Chenhui Zhang, Da Yin, Diego Rojas, Guanyu Feng, Hanlin Zhao, Hanyu Lai, Hao Yu, Hongning Wang, Jiadai Sun, Jiajie Zhang, Jiale Cheng, Jiayi Gui, Jie Tang, Jing Zhang, Juanzi Li, Lei Zhao, Lindong Wu, Lucen Zhong, Mingdao Liu, Minlie Huang, Peng Zhang, Qinkai Zheng, Rui Lu, Shuaiqi Duan, Shudan Zhang, Shulin Cao, ...

  69. [69]

    Paolo Glorioso, Quentin Anthony, Yury Tokpanov, James Whittington, Jonathan Pilault, Adam Ibrahim, and Beren Millidge. 2024. Zamba: A Compact 7B SSM Hybrid Model

  70. [70]

    Andrew Gordon, Zornitsa Kozareva, and Melissa Roemmele. 2012. SemEval-2012 Task 7: Choice of Plausible Alternatives: An Evaluation of Commonsense Causal Reasoning

  71. [71]

    Dirk Groeneveld, Anas Awadalla, Iz Beltagy, Akshita Bhagia, Ian Magnusson, Hao Peng, Oyvind Tafjord, Pete Walsh, Kyle Richardson, and Jesse Dodge. 2023. Catwalk: A Unified Language Model Evaluation Framework for Many Datasets. 31

  72. [72]

    Dirk Groeneveld, Iz Beltagy, Pete Walsh, Akshita Bhagia, Rodney Kinney, Oyvind Tafjord, Ananya Harsh Jha, Hamish Ivison, Ian Magnusson, Yizhong Wang, Shane Arora, David Atkinson, Russell Authur, Khyathi Raghavi Chandu, Arman Cohan, Jennifer Dumas, Yanai Elazar, Yuling Gu, Jack Hessel, Tushar Khot, William Merrill, Jacob Morrison, Niklas Muen- nighoff, Aak...

  73. [73]

    Sam Gross, Marc’Aurelio Ranzato, and Arthur Szlam. 2017. Hard Mixtures of Experts for Large Scale Weakly Supervised Vision

  74. [74]

    Yuling Gu, Oyvind Tafjord, Bailey Kuehl, Dany Haddad, Jesse Dodge, and Hannaneh Ha- jishirzi. 2024. OLMES: A Standard for Language Model Evaluations

  75. [75]

    Suriya Gunasekar, Yi Zhang, Jyoti Aneja, Caio C ´esar Teodoro Mendes, Allie Del Giorno, Sivakanth Gopi, Mojan Javaheripi, Piero Kauffmann, Gustavo de Rosa, Olli Saarikivi, Adil Salim, Shital Shah, Harkirat Singh Behl, Xin Wang, S ´ebastien Bubeck, Ronen Eldan, Adam Tauman Kalai, Yin Tat Lee, and Yuanzhi Li. 2023. Textbooks Are All You Need

  76. [76]

    Xu Owen He. 2024. Mixture of A Million Experts

  77. [77]

    Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. 2021. Measuring Massive Multitask Language Understanding

  78. [78]

    Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt. 2021. Measuring Mathematical Problem Solving With the MATH Dataset

  79. [79]

    Rae, Oriol Vinyals, and Laurent Sifre

    Jordan Hoffmann, Sebastian Borgeaud, Arthur Mensch, Elena Buchatskaya, Trevor Cai, Eliza Rutherford, Diego de Las Casas, Lisa Anne Hendricks, Johannes Welbl, Aidan Clark, Tom Hennigan, Eric Noland, Katie Millican, George van den Driessche, Bogdan Damoc, Aurelia Guy, Simon Osindero, Karen Simonyan, Erich Elsen, Jack W. Rae, Oriol Vinyals, and Laurent Sifre...

  80. [80]

    Jiwoo Hong, Noah Lee, and James Thorne. 2024. ORPO: Monolithic Preference Optimiza- tion without Reference Model

Showing first 80 references.