Recognition: 3 theorem links
· Lean TheoremDeepSeekMoE: Towards Ultimate Expert Specialization in Mixture-of-Experts Language Models
Pith reviewed 2026-05-11 22:45 UTC · model grok-4.3
The pith
DeepSeekMoE achieves comparable performance to denser models by segmenting experts finely and isolating shared ones for better specialization.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
DeepSeekMoE activates mK experts chosen from a pool of mN finely segmented experts while also keeping Ks experts permanently active as shared experts. The fine segmentation allows more precise routing combinations, and the shared experts absorb common patterns so that the routed experts can specialize without redundancy. Experiments from 2B to 145B parameters show that these two design choices produce performance on par with larger or denser baselines at substantially lower compute.
What carries the argument
The DeepSeekMoE architecture, which finely segments the expert pool into mN units (activating mK) and isolates Ks shared experts to capture common knowledge.
If this is right
- MoE models can be made to approach the performance of dense models of equal total parameter count.
- Expert specialization improves when routing choices become finer-grained without raising the number of active parameters.
- Shared experts reduce redundancy among routed experts, freeing capacity for specialized knowledge.
- The same design pattern scales from 2B to at least 145B parameters while retaining its efficiency edge over conventional MoE.
Where Pith is reading between the lines
- Designers of future MoE systems may adopt similar segmentation and shared-expert patterns to control compute while growing total capacity.
- Specialization metrics such as pairwise expert overlap or knowledge-diversity scores could become standard diagnostics for new MoE variants.
- The approach may combine naturally with other efficiency methods like quantization or pruning to push the efficiency frontier further.
Load-bearing premise
The measured gains in performance and efficiency arise from the fine segmentation and shared-expert isolation rather than from unreported differences in training data, optimizer, or total compute.
What would settle it
A side-by-side training run that uses identical data, optimizer, and total compute for both a standard top-K GShard model and a DeepSeekMoE model, then compares their final validation loss and downstream task scores.
read the original abstract
In the era of large language models, Mixture-of-Experts (MoE) is a promising architecture for managing computational costs when scaling up model parameters. However, conventional MoE architectures like GShard, which activate the top-$K$ out of $N$ experts, face challenges in ensuring expert specialization, i.e. each expert acquires non-overlapping and focused knowledge. In response, we propose the DeepSeekMoE architecture towards ultimate expert specialization. It involves two principal strategies: (1) finely segmenting the experts into $mN$ ones and activating $mK$ from them, allowing for a more flexible combination of activated experts; (2) isolating $K_s$ experts as shared ones, aiming at capturing common knowledge and mitigating redundancy in routed experts. Starting from a modest scale with 2B parameters, we demonstrate that DeepSeekMoE 2B achieves comparable performance with GShard 2.9B, which has 1.5 times the expert parameters and computation. In addition, DeepSeekMoE 2B nearly approaches the performance of its dense counterpart with the same number of total parameters, which set the upper bound of MoE models. Subsequently, we scale up DeepSeekMoE to 16B parameters and show that it achieves comparable performance with LLaMA2 7B, with only about 40% of computations. Further, our preliminary efforts to scale up DeepSeekMoE to 145B parameters consistently validate its substantial advantages over the GShard architecture, and show its performance comparable with DeepSeek 67B, using only 28.5% (maybe even 18.2%) of computations.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes the DeepSeekMoE architecture for Mixture-of-Experts language models, introducing two strategies: (1) fine segmentation of experts into mN total experts while activating mK of them for more flexible combinations, and (2) isolating Ks shared experts to capture common knowledge and reduce redundancy among routed experts. Starting from a 2B-parameter model, it reports that DeepSeekMoE 2B achieves performance comparable to GShard 2.9B (with 1.5x expert parameters and computation) and nearly matches its dense counterpart; scaling to 16B parameters yields performance comparable to LLaMA2 7B at ~40% of the computation; preliminary scaling to 145B parameters shows advantages over GShard and comparability to DeepSeek 67B at 28.5% (or possibly 18.2%) of the computation.
Significance. If the reported gains can be attributed to the proposed strategies rather than uncontrolled variables, the work would advance MoE designs by demonstrating a path to better expert specialization and substantially lower activated compute at scale. The progression from 2B to 145B provides a useful empirical scaling study, and the near-parity with dense models at the small scale is a notable data point for the efficiency upper bound of MoE.
major comments (3)
- [Abstract and §4] Abstract and §4 (Experiments): the central performance claims (DeepSeekMoE 2B vs. GShard 2.9B, DeepSeekMoE 16B vs. LLaMA2 7B, and 145B scaling) are presented without error bars, without stating the number of independent runs, and without indicating whether the 145B result is a single training run. This undermines the ability to judge whether the observed differences are statistically reliable or reproducible.
- [§3 and §4] §3 (Architecture) and §4 (Experiments): no ablation studies are reported that isolate the contribution of the two proposed strategies (fine segmentation into mN experts and isolation of Ks shared experts) from differences in training data distribution, total tokens seen, optimizer hyperparameters, or wall-clock compute. The comparisons therefore cannot support the claim that the gains arise from improved expert specialization.
- [§4] §4 (Scaling experiments): the statement that DeepSeekMoE 16B uses “only about 40% of computations” relative to LLaMA2 7B and that the 145B model uses 28.5% (or 18.2%) relative to DeepSeek 67B requires explicit confirmation that training tokens, data mixture, and total FLOPs were matched or controlled; absent such controls, the attribution to the MoE architecture is not load-bearing.
minor comments (2)
- [Abstract] Abstract: the parenthetical “maybe even 18.2%” for the 145B computation ratio is presented without a corresponding configuration, table entry, or footnote explaining the alternative setting.
- [§3] Notation: the relationship between the conventional top-K out of N and the new mK out of mN is introduced in the abstract but would benefit from an explicit equation or diagram in §3 showing how the routing and activation change.
Simulated Author's Rebuttal
We thank the referee for their insightful comments, which have helped us improve the clarity and rigor of our paper. Below, we provide detailed responses to each major comment and indicate the revisions we plan to make.
read point-by-point responses
-
Referee: [Abstract and §4] Abstract and §4 (Experiments): the central performance claims (DeepSeekMoE 2B vs. GShard 2.9B, DeepSeekMoE 16B vs. LLaMA2 7B, and 145B scaling) are presented without error bars, without stating the number of independent runs, and without indicating whether the 145B result is a single training run. This undermines the ability to judge whether the observed differences are statistically reliable or reproducible.
Authors: We appreciate this point and acknowledge that the manuscript does not include error bars or specify the number of runs. Training large MoE models, especially at 145B parameters, is extremely resource-intensive, making multiple independent runs impractical. All results reported are from single training runs. We will revise the abstract and §4 to explicitly state that all experiments are single-run results due to computational constraints, and clarify that the 145B scaling is preliminary. For the 2B and 16B models, we have internal evidence of stability but will add this note to the paper. revision: partial
-
Referee: [§3 and §4] §3 (Architecture) and §4 (Experiments): no ablation studies are reported that isolate the contribution of the two proposed strategies (fine segmentation into mN experts and isolation of Ks shared experts) from differences in training data distribution, total tokens seen, optimizer hyperparameters, or wall-clock compute. The comparisons therefore cannot support the claim that the gains arise from improved expert specialization.
Authors: We agree that dedicated ablation studies would better isolate the effects of fine-grained expert segmentation and shared experts. In our experiments, we aimed to keep other factors constant across comparisons, but we did not perform exhaustive ablations controlling for every variable. We will add a new subsection in the revised §4 discussing the contributions, including additional experiments at the 2B scale comparing models with and without the shared experts and with different segmentation levels, while noting the limitations in fully controlling all hyperparameters at larger scales. revision: partial
-
Referee: [§4] §4 (Scaling experiments): the statement that DeepSeekMoE 16B uses “only about 40% of computations” relative to LLaMA2 7B and that the 145B model uses 28.5% (or 18.2%) relative to DeepSeek 67B requires explicit confirmation that training tokens, data mixture, and total FLOPs were matched or controlled; absent such controls, the attribution to the MoE architecture is not load-bearing.
Authors: We confirm that the compute savings are calculated based on activated parameters per token, with training data and token counts matched as closely as possible to the baselines. For DeepSeekMoE 16B and LLaMA2 7B, both were trained on approximately the same number of tokens from similar data distributions. The 40% figure refers to the reduced FLOPs due to sparse activation. Similarly for the 145B comparison. We will revise §4 to provide a detailed breakdown of the training setup, including total tokens processed, data mixture, and the exact method for computing the relative compute (activated FLOPs), ensuring the claims are well-supported. revision: yes
Circularity Check
No circularity: empirical architecture proposal with measured benchmarks
full rationale
The paper proposes DeepSeekMoE, an MoE variant using fine expert segmentation (mN experts, mK activated) and Ks shared experts, then reports empirical results: DeepSeekMoE 2B matches GShard 2.9B performance, approaches dense 2B upper bound, and 16B matches LLaMA2 7B at ~40% compute. These are trained-model evaluations on held-out benchmarks, not predictions derived from equations. No self-definitional steps, fitted parameters renamed as predictions, or load-bearing self-citations of uniqueness theorems appear. Architecture choices are design decisions justified by specialization goals; performance attribution is experimental, not a closed mathematical reduction. The derivation chain is self-contained as standard empirical ML work.
Axiom & Free-Parameter Ledger
free parameters (2)
- m (expert segmentation factor)
- Ks (number of shared experts)
Forward citations
Cited by 47 Pith papers
-
Sieve: Dynamic Expert-Aware PIM Acceleration for Evolving Mixture-of-Experts Models
Sieve dynamically schedules MoE experts across GPU and PIM hardware to handle bimodal token distributions, achieving 1.3x to 1.6x gains in throughput and interactivity over static prior PIM systems on three large models.
-
How to Scale Mixture-of-Experts: From muP to the Maximally Scale-Stable Parameterization
The authors derive a Maximally Scale-Stable Parameterization (MSSP) for MoE models that achieves robust learning-rate transfer and monotonic performance gains with scale across co-scaling regimes of width, experts, an...
-
Routers Learn the Geometry of Their Experts: Geometric Coupling in Sparse Mixture-of-Experts
Routers in SMoE models form geometric alignments with their experts through shared gradient directions, enabling effective specialization that auxiliary load-balancing losses tend to disrupt.
-
Surviving Partial Rank Failures in Wide Expert-Parallel MoE Inference
EEP makes wide expert-parallel MoE serving survive single-rank failures with an 11s recovery pause, 8s reintegration pause, and throughput restored to 95% of pre-fault level within 52s while staying within 4.4% of a f...
-
Mixture of Layers with Hybrid Attention
Mixture of Layers replaces monolithic transformer blocks with routed thin parallel blocks using hybrid attention that combines a shared softmax block for global context with Gated DeltaNet linear attention in the rout...
-
SDG-MoE: Signed Debate Graph Mixture-of-Experts
SDG-MoE introduces learned signed interaction graphs and disagreement-gated deliberation among experts in MoE architectures, yielding 19.8% better validation perplexity than the strongest baseline.
-
SDG-MoE: Signed Debate Graph Mixture-of-Experts
SDG-MoE adds learned support and critique graphs plus disagreement-gated message passing to MoE models, yielding 19.8% better validation perplexity than the strongest baseline in three-seed pretraining.
-
MISA: Mixture of Indexer Sparse Attention for Long-Context LLM Inference
MISA routes to a small subset of indexer heads via block statistics, matching full DSA performance on LongBench with 4-8x fewer heads and 3.82x speedup while recovering over 92% of selected tokens.
-
VLA-GSE: Boosting Parameter-Efficient Fine-Tuning in VLA with Generalized and Specialized Experts
VLA-GSE improves VLA adaptation by initializing generalized shared experts and specialized routed experts via spectral decomposition of the backbone, outperforming full fine-tuning and other PEFT methods on robotic be...
-
Unified Multimodal Visual Tracking with Dual Mixture-of-Experts
OneTrackerV2 unifies multimodal tracking via Meta Merger and Dual Mixture-of-Experts to reach state-of-the-art results on five tasks and 12 benchmarks with efficiency and robustness when modalities are missing.
-
RouteHijack: Routing-Aware Attack on Mixture-of-Experts LLMs
RouteHijack is a routing-aware jailbreak that identifies safety-critical experts via activation contrast and optimizes suffixes to suppress them, reaching 69.3% average attack success rate on seven MoE LLMs with stron...
-
MASCing: Configurable Mixture-of-Experts Behavior via Activation Steering Masks
MASCing uses an LSTM surrogate and optimized steering masks to enable flexible, inference-time control over MoE expert routing for safety objectives, improving jailbreak defense and content generation success rates su...
-
Path-Constrained Mixture-of-Experts
PathMoE constrains expert paths in MoE models by sharing router parameters across layer blocks, yielding more concentrated paths, better performance on perplexity and tasks, and no need for auxiliary losses.
-
DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model
DeepSeek-V2 delivers top-tier open-source LLM performance using only 21B active parameters by compressing the KV cache 93.3% and cutting training costs 42.5% via MLA and DeepSeekMoE.
-
Combining pre-trained models via localized model averaging
Localized model averaging with covariate-dependent weights achieves asymptotic optimality and weight consistency for combining pre-trained models under a general loss framework.
-
MLCommons Chakra: Advancing Performance Benchmarking and Co-design using Standardized Execution Traces
Chakra introduces a portable, interoperable graph-based execution trace format for distributed ML workloads along with supporting tools to standardize performance benchmarking and software-hardware co-design.
-
DisagMoE: Computation-Communication overlapped MoE Training via Disaggregated AF-Pipe Parallelism
DisagMoE achieves up to 1.8x faster MoE training by disaggregating attention and FFN layers into disjoint GPU groups with a multi-stage uni-directional pipeline and roofline-based bandwidth balancing.
-
Hierarchical Mixture-of-Experts with Two-Stage Optimization
Hi-MoE uses two-level hierarchical routing objectives to enforce group-level balance while promoting within-group specialization, yielding better perplexity and expert utilization than prior MoE baselines in NLP and v...
-
TAS-LoRA: Transformer Architecture Search with Mixture-of-LoRA Experts
TAS-LoRA attaches a mixture of LoRA experts to a supernet and uses a dynamic router plus group-wise initialization to let different architecture subnets learn distinct features, yielding higher accuracy than prior TAS...
-
UniPool: A Globally Shared Expert Pool for Mixture-of-Experts
A shared global expert pool in MoE improves validation loss over per-layer experts and allows sublinear expert-parameter growth with depth.
-
MoE-Hub: Taming Software Complexity for Seamless MoE Overlap with Hardware-Accelerated Communication on Multi-GPU Systems
MoE-Hub enables seamless MoE communication overlap via hardware-accelerated destination-agnostic data transmission, delivering 1.40x-3.08x per-layer and 1.21x-1.98x end-to-end speedups over prior systems.
-
Piper: Efficient Large-Scale MoE Training via Resource Modeling and Pipelined Hybrid Parallelism
Piper introduces resource modeling and pipelined hybrid parallelism for MoE training, delivering 2-3.5X higher MFU than prior frameworks and 1.2-9X better all-to-all bandwidth.
-
Rethinking LLM Ensembling from the Perspective of Mixture Models
ME reinterprets LLM ensembling as a mixture model by sampling a single model stochastically at each token step, matching the ensemble distribution while invoking only one model per step for substantial speed gains.
-
A Limit Theory of Foundation Models: A Mathematical Approach to Understanding Emergent Intelligence and Scaling Laws
Emergent intelligence is recast as the existence of the limit of performance E(N,P,K) as N,P,K to infinity, with necessary and sufficient conditions derived via nonlinear Lipschitz operator theory and scaling laws obt...
-
SMoES: Soft Modality-Guided Expert Specialization in MoE-VLMs
SMoES improves MoE-VLM performance and efficiency via soft modality-guided expert routing and inter-bin mutual information regularization, yielding 0.9-4.2% task gains and 56% communication reduction.
-
Beyond N-gram: Data-Aware X-GRAM Extraction for Efficient Embedding Parameter Scaling
X-GRAM applies data-aware dynamic token injection with hybrid hashing and local feature extraction to achieve up to 4.4 accuracy point gains over vanilla backbones and 3.2 over retrieval baselines at 0.73B-1.15B scale...
-
Train Separately, Merge Together: Modular Post-Training with Mixture-of-Experts
BAR trains independent domain experts via separate mid-training, SFT, and RL pipelines then composes them with a MoE router to match monolithic retraining performance at lower cost and without catastrophic forgetting.
-
Geometric Routing Enables Causal Expert Control in Mixture of Experts
Cosine-similarity routing in low-dimensional space makes MoE experts monosemantic by construction and enables direct causal control via centroid interventions.
-
The Myth of Expert Specialization in MoEs: Why Routing Reflects Geometry, Not Necessarily Domain Expertise
Expert specialization in MoEs is an emergent effect of hidden state geometry due to linear routers, not domain expertise, as confirmed empirically across models and explained by a proof on load-balancing effects.
-
MP-ISMoE: Mixed-Precision Interactive Side Mixture-of-Experts for Efficient Transfer Learning
MP-ISMoE uses Gaussian noise perturbed iterative quantization and interactive side mixture-of-experts to deliver higher accuracy than prior memory-efficient transfer learning methods while keeping similar parameter an...
-
Adaptive Semantic Communication for Wireless Image Transmission Leveraging Mixture-of-Experts Mechanism
A novel adaptive MoE-based semantic communication system jointly routes experts using real-time CSI and semantic image content for improved MIMO wireless image transmission.
-
MaskTab: Scalable Masked Tabular Pretraining with Scaling Laws and Distillation for Industrial Classification
MaskTab is a masked pretraining method for industrial tabular data that delivers measurable gains in classification AUC and KS metrics while enabling effective distillation to smaller models.
-
E = T*H/(O+B): A Dimensionless Control Parameter for Mixture-of-Experts Ecology
A dimensionless parameter E = T*H/(O+B) >= 0.5 is claimed to guarantee zero dead experts in Mixture-of-Experts models, eliminating the need for auxiliary load-balancing losses.
-
VLA-GSE: Boosting Parameter-Efficient Fine-Tuning in VLA with Generalized and Specialized Experts
VLA-GSE uses spectral decomposition of the VLA backbone to create generalized and specialized experts, enabling effective robot task adaptation while updating only 2.51% of parameters and achieving 81.2% zero-shot suc...
-
Irminsul: MLA-Native Position-Independent Caching for Agentic LLM Serving
Irminsul recovers up to 83% of prompt tokens above exact-prefix matching and delivers 63% prefill energy savings per cache hit on MLA-MoE models by content-hashing CDC chunks and applying closed-form kr correction.
-
ARGUS: Policy-Adaptive Ad Governance via Evolving Reinforcement with Adversarial Umpiring
ARGUS uses a Prosecutor-Defender-Umpire multi-agent setup plus RAG and chain-of-thought rewards to adapt ad policy enforcement to new regulations using minimal fresh labels.
-
FaaSMoE: A Serverless Framework for Multi-Tenant Mixture-of-Experts Serving
FaaSMoE treats MoE experts as on-demand FaaS functions with configurable granularity, using under one-third the resources of a full-model baseline under multi-tenant workloads.
-
A Limit Theory of Foundation Models: A Mathematical Approach to Understanding Emergent Intelligence and Scaling Laws
Emergent intelligence corresponds to the limit of a performance function E(N,P,K) as N, P, K go to infinity, originating from a parameter-limit architecture whose existence is governed by Lipschitz conditions, with sc...
-
SemLT3D: Semantic-Guided Expert Distillation for Camera-only Long-Tailed 3D Object Detection
SemLT3D introduces semantic-guided expert distillation with a language MoE module and CLIP projection to enrich features for long-tailed classes in camera-only 3D detection.
-
Domain-Specialized Object Detection via Model-Level Mixtures of Experts
Model-level MoE of domain-specialized YOLO detectors with gating network outperforms standard ensembles on BDD100K while revealing expert specialization.
-
HQF-Net: A Hybrid Quantum-Classical Multi-Scale Fusion Network for Remote Sensing Image Segmentation
HQF-Net reports mIoU gains on three remote-sensing benchmarks by adding quantum circuits to skip connections and a mixture-of-experts bottleneck inside a classical U-Net fused with a DINOv3 backbone.
-
Does a Global Perspective Help Prune Sparse MoEs Elegantly?
GRAPE is a global redundancy-aware pruning strategy for sparse MoEs that dynamically allocates pruning budgets across layers and improves average accuracy by 1.40% over the best local baseline across tested models and...
-
Qwen3 Technical Report
Pith review generated a malformed one-line summary.
-
OneRec: Unifying Retrieve and Rank with Generative Recommender and Iterative Preference Alignment
OneRec unifies retrieval and ranking in a generative recommender using session-wise decoding and iterative DPO-based preference alignment, achieving real-world gains on Kuaishou.
-
DeepSeek-VL2: Mixture-of-Experts Vision-Language Models for Advanced Multimodal Understanding
DeepSeek-VL2 is a series of MoE vision-language models using dynamic tiling and latent attention that reach competitive or state-of-the-art results on VQA, OCR, document understanding and grounding with 1.0B to 4.5B a...
-
Mamoda2.5: Enhancing Unified Multimodal Model with DiT-MoE
Mamoda2.5 is a 25B-parameter DiT-MoE unified AR-Diffusion model that reaches top video generation and editing benchmarks with 4-step inference up to 95.9x faster than baselines.
-
CoRE: Concept-Reasoning Expansion for Continual Brain Lesion Segmentation
CoRE aligns image tokens to a hierarchical concept library to simulate clinical reasoning for expert routing and demand-based growth in continual brain lesion segmentation, achieving SOTA on 12 tasks.
Reference graph
Works this paper leans on
-
[1]
E. Almazrouei, H. Alobeidli, A. Alshamsi, A. Cappelli, R. Cojocaru, M. Debbah, E. Goffinet, D. Heslow, J. Launay, Q. Malartic, B. Noune, B. Pannier, and G. Penedo. Falcon-40B : an open large language model with state-of-the-art performance, 2023
work page 2023
-
[2]
M. Artetxe, S. Bhosale, N. Goyal, T. Mihaylov, M. Ott, S. Shleifer, X. V. Lin, J. Du, S. Iyer, R. Pasunuru, G. Anantharaman, X. Li, S. Chen, H. Akin, M. Baines, L. Martin, X. Zhou, P. S. Koura, B. O'Horo, J. Wang, L. Zettlemoyer, M. T. Diab, Z. Kozareva, and V. Stoyanov. Efficient large scale language modeling with mixtures of experts. In Y. Goldberg, Z. ...
-
[4]
S. Biderman, H. Schoelkopf, Q. G. Anthony, H. Bradley, K. O'Brien, E. Hallahan, M. A. Khan, S. Purohit, U. S. Prashanth, E. Raff, A. Skowron, L. Sutawika, and O. van der Wal. Pythia: A suite for analyzing large language models across training and scaling. In A. Krause, E. Brunskill, K. Cho, B. Engelhardt, S. Sabato, and J. Scarlett, editors, International...
work page 2023
-
[5]
Y. Bisk, R. Zellers, R. L. Bras, J. Gao, and Y. Choi. PIQA: reasoning about physical commonsense in natural language. In The Thirty-Fourth AAAI Conference on Artificial Intelligence, AAAI 2020, The Thirty-Second Innovative Applications of Artificial Intelligence Conference, IAAI 2020, The Tenth AAAI Symposium on Educational Advances in Artificial Intellig...
-
[7]
T. B. Brown, B. Mann, N. Ryder, M. Subbiah, J. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell, S. Agarwal, A. Herbert - Voss, G. Krueger, T. Henighan, R. Child, A. Ramesh, D. M. Ziegler, J. Wu, C. Winter, C. Hesse, M. Chen, E. Sigler, M. Litwin, S. Gray, B. Chess, J. Clark, C. Berner, S. McCandlish, A. Radford, I. Sutskever, and D. Am...
work page 2020
-
[8]
M. Chen, J. Tworek, H. Jun, Q. Yuan, H. P. de Oliveira Pinto, J. Kaplan, H. Edwards, Y. Burda, N. Joseph, G. Brockman, A. Ray, R. Puri, G. Krueger, M. Petrov, H. Khlaaf, G. Sastry, P. Mishkin, B. Chan, S. Gray, N. Ryder, M. Pavlov, A. Power, L. Kaiser, M. Bavarian, C. Winter, P. Tillet, F. P. Such, D. Cummings, M. Plappert, F. Chantzis, E. Barnes, A. Herb...
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[9]
Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge
P. Clark, I. Cowhey, O. Etzioni, T. Khot, A. Sabharwal, C. Schoenick, and O. Tafjord. Think you have solved question answering? try arc, the AI2 reasoning challenge. CoRR, abs/1803.05457, 2018. URL http://arxiv.org/abs/1803.05457
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[12]
D. Dai, L. Dong, S. Ma, B. Zheng, Z. Sui, B. Chang, and F. Wei. Stablemoe: Stable routing strategy for mixture of experts. In S. Muresan, P. Nakov, and A. Villavicencio, editors, Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), ACL 2022, Dublin, Ireland, May 22-27, 2022 , pages 7085--7095. As...
-
[14]
N. Du, Y. Huang, A. M. Dai, S. Tong, D. Lepikhin, Y. Xu, M. Krikun, Y. Zhou, A. W. Yu, O. Firat, B. Zoph, L. Fedus, M. P. Bosma, Z. Zhou, T. Wang, Y. E. Wang, K. Webster, M. Pellat, K. Robinson, K. S. Meier - Hellstern, T. Duke, L. Dixon, K. Zhang, Q. V. Le, Y. Wu, Z. Chen, and C. Cui. Glam: Efficient scaling of language models with mixture-of-experts. In...
work page 2022
-
[15]
D. Dua, Y. Wang, P. Dasigi, G. Stanovsky, S. Singh, and M. Gardner. DROP: A reading comprehension benchmark requiring discrete reasoning over paragraphs. In J. Burstein, C. Doran, and T. Solorio, editors, Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HL...
-
[16]
Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity
W. Fedus, B. Zoph, and N. Shazeer. Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity. CoRR, abs/2101.03961, 2021. URL https://arxiv.org/abs/2101.03961
work page internal anchor Pith review arXiv 2021
-
[17]
L. Gao, S. Biderman, S. Black, L. Golding, T. Hoppe, C. Foster, J. Phang, H. He, A. Thite, N. Nabeshima, et al. The Pile : An 800GB dataset of diverse text for language modeling. arXiv preprint arXiv:2101.00027, 2020
work page internal anchor Pith review Pith/arXiv arXiv 2020
-
[18]
X. Geng and H. Liu. Openllama: An open reproduction of llama, May 2023. URL https://github.com/openlm-research/open_llama
work page 2023
-
[19]
PipeDream: Fast and Efficient Pipeline Parallel DNN Training
A. Harlap, D. Narayanan, A. Phanishayee, V. Seshadri, N. R. Devanur, G. R. Ganger, and P. B. Gibbons. Pipedream: Fast and efficient pipeline parallel DNN training. CoRR, abs/1806.03377, 2018. URL http://arxiv.org/abs/1806.03377
work page Pith review arXiv 2018
-
[21]
D. Hendrycks, C. Burns, S. Kadavath, A. Arora, S. Basart, E. Tang, D. Song, and J. Steinhardt. Measuring mathematical problem solving with the math dataset, 2021
work page 2021
-
[22]
Hai-llm: An efficient and lightweight tool for training large models, 2023
High-Flyer. Hai-llm: An efficient and lightweight tool for training large models, 2023. URL https://www.high-flyer.cn/en/blog/hai-llm
work page 2023
-
[23]
Neural Computation 9(8), 1735–1780 (1997)
S. Hochreiter and J. Schmidhuber. Long short-term memory. Neural Computing, 9 0 (8): 0 1735--1780, 1997. URL https://doi.org/10.1162/neco.1997.9.8.1735
-
[24]
Training Compute-Optimal Large Language Models
J. Hoffmann, S. Borgeaud, A. Mensch, E. Buchatskaya, T. Cai, E. Rutherford, D. de Las Casas, L. A. Hendricks, J. Welbl, A. Clark, T. Hennigan, E. Noland, K. Millican, G. van den Driessche, B. Damoc, A. Guy, S. Osindero, K. Simonyan, E. Elsen, J. W. Rae, O. Vinyals, and L. Sifre. Training compute-optimal large language models. CoRR, abs/2203.15556, 2022. d...
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2203.15556 2022
-
[25]
C-eval: A multi-level multi-discipline chinese evaluation suite for foundation models
Y. Huang, Y. Bai, Z. Zhu, J. Zhang, J. Zhang, T. Su, J. Liu, C. Lv, Y. Zhang, J. Lei, et al. C-Eval : A multi-level multi-discipline chinese evaluation suite for foundation models. arXiv preprint arXiv:2305.08322, 2023
-
[26]
R. A. Jacobs, M. I. Jordan, S. J. Nowlan, and G. E. Hinton. Adaptive mixtures of local experts. Neural Computing, 3 0 (1): 0 79--87, 1991. URL https://doi.org/10.1162/neco.1991.3.1.79
-
[27]
M. I. Jordan and R. A. Jacobs. Hierarchical mixtures of experts and the EM algorithm. Neural Computing, 6 0 (2): 0 181--214, 1994. URL https://doi.org/10.1162/neco.1994.6.2.181
-
[28]
TriviaQA: A Large Scale Distantly Supervised Challenge Dataset for Reading Comprehension
M. Joshi , E. Choi , D. Weld , and L. Zettlemoyer . triviaqa: A Large Scale Distantly Supervised Challenge Dataset for Reading Comprehension . arXiv e-prints, art. arXiv:1705.03551, 2017
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[29]
V. A. Korthikanti, J. Casper, S. Lym, L. McAfee, M. Andersch, M. Shoeybi, and B. Catanzaro. Reducing activation recomputation in large transformer models. Proceedings of Machine Learning and Systems, 5, 2023
work page 2023
-
[30]
T. Kwiatkowski, J. Palomaki, O. Redfield, M. Collins, A. Parikh, C. Alberti, D. Epstein, I. Polosukhin, M. Kelcey, J. Devlin, K. Lee, K. N. Toutanova, L. Jones, M.-W. Chang, A. Dai, J. Uszkoreit, Q. Le, and S. Petrov. Natural questions: a benchmark for question answering research. Transactions of the Association of Computational Linguistics, 2019
work page 2019
-
[31]
G. Lai, Q. Xie, H. Liu, Y. Yang, and E. H. Hovy. RACE: large-scale reading comprehension dataset from examinations. In M. Palmer, R. Hwa, and S. Riedel, editors, Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, EMNLP 2017, Copenhagen, Denmark, September 9-11, 2017 , pages 785--794. Association for Computational Lingu...
-
[32]
D. Lepikhin, H. Lee, Y. Xu, D. Chen, O. Firat, Y. Huang, M. Krikun, N. Shazeer, and Z. Chen. Gshard: Scaling giant models with conditional computation and automatic sharding. In 9th International Conference on Learning Representations, ICLR 2021 . OpenReview.net, 2021. URL https://openreview.net/forum?id=qrwe7XHTmYb
work page 2021
- [33]
-
[34]
J. Lin, R. Men, A. Yang, C. Zhou, M. Ding, Y. Zhang, P. Wang, A. Wang, L. Jiang, X. Jia, J. Zhang, J. Zhang, X. Zou, Z. Li, X. Deng, J. Liu, J. Xue, H. Zhou, J. Ma, J. Yu, Y. Li, W. Lin, J. Zhou, J. Tang, and H. Yang. M6: A chinese multimodal pretrainer. CoRR, abs/2103.00823, 2021. URL https://arxiv.org/abs/2103.00823
-
[35]
S. Lin, J. Hilton, and O. Evans. Truthfulqa: Measuring how models mimic human falsehoods. In S. Muresan, P. Nakov, and A. Villavicencio, editors, Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), ACL 2022, Dublin, Ireland, May 22-27, 2022 , pages 3214--3252. Association for Computational Lingu...
-
[36]
I. Loshchilov and F. Hutter. Decoupled weight decay regularization. In 7th International Conference on Learning Representations, ICLR 2019, New Orleans, LA, USA, May 6-9, 2019 . OpenReview.net, 2019. URL https://openreview.net/forum?id=Bkg6RiCqY7
work page 2019
-
[37]
D. Narayanan, M. Shoeybi, J. Casper, P. LeGresley, M. Patwary, V. Korthikanti, D. Vainbrand, P. Kashinkunti, J. Bernauer, B. Catanzaro, et al. Efficient large-scale language model training on gpu clusters using megatron-lm. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, pages 1--15, 2021
work page 2021
-
[38]
OpenAI. GPT-4 technical report. CoRR, abs/2303.08774, 2023. doi:10.48550/arXiv.2303.08774. URL https://doi.org/10.48550/arXiv.2303.08774
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2303.08774 2023
-
[39]
Generalized Slow Roll for Tensors
S. Rajbhandari, J. Rasley, O. Ruwase, and Y. He. Zero: memory optimizations toward training trillion parameter models. In C. Cuicchi, I. Qualters, and W. T. Kramer, editors, Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, SC 2020, Virtual Event / Atlanta, Georgia, USA, November 9-19, 2020 , pag...
work page internal anchor Pith review Pith/arXiv arXiv doi:10.1109/sc41405.2020.00024 2020
-
[40]
S. Rajbhandari, C. Li, Z. Yao, M. Zhang, R. Y. Aminabadi, A. A. Awan, J. Rasley, and Y. He. Deepspeed-moe: Advancing mixture-of-experts inference and training to power next-generation AI scale. In K. Chaudhuri, S. Jegelka, L. Song, C. Szepesv \' a ri, G. Niu, and S. Sabato, editors, International Conference on Machine Learning, ICML 2022, 17-23 July 2022,...
work page 2022
-
[41]
X. Ren, P. Zhou, X. Meng, X. Huang, Y. Wang, W. Wang, P. Li, X. Zhang, A. Podolskiy, G. Arshinov, A. Bout, I. Piontkovskaya, J. Wei, X. Jiang, T. Su, Q. Liu, and J. Yao. Pangu- \( \) : Towards trillion parameter language model with sparse heterogeneous computing. CoRR, abs/2303.10845, 2023. URL https://doi.org/10.48550/arXiv.2303.10845
-
[42]
Hash layers for large sparse models
S. Roller, S. Sukhbaatar, A. Szlam, and J. Weston. Hash layers for large sparse models. CoRR, abs/2106.04426, 2021. URL https://arxiv.org/abs/2106.04426
-
[43]
K. Sakaguchi, R. L. Bras, C. Bhagavatula, and Y. Choi. Winogrande: An adversarial winograd schema challenge at scale, 2019
work page 2019
-
[44]
T. L. Scao, A. Fan, C. Akiki, E. Pavlick, S. Ilic, D. Hesslow, R. Castagn \' e , A. S. Luccioni, F. Yvon, M. Gall \' e , J. Tow, A. M. Rush, S. Biderman, A. Webson, P. S. Ammanamanchi, T. Wang, B. Sagot, N. Muennighoff, A. V. del Moral, O. Ruwase, R. Bawden, S. Bekman, A. McMillan - Major, I. Beltagy, H. Nguyen, L. Saulnier, S. Tan, P. O. Suarez, V. Sanh,...
work page internal anchor Pith review doi:10.48550/arxiv.2211.05100 2022
-
[45]
Neural Machine Translation of Rare Words with Subword Units
R. Sennrich, B. Haddow, and A. Birch. Neural machine translation of rare words with subword units. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics, ACL 2016, August 7-12, 2016, Berlin, Germany, Volume 1: Long Papers . The Association for Computer Linguistics, 2016. doi:10.18653/V1/P16-1162. URL https://doi.org/10...
-
[46]
N. Shazeer. Fast transformer decoding: One write-head is all you need. CoRR, abs/1911.02150, 2019. URL http://arxiv.org/abs/1911.02150
work page internal anchor Pith review Pith/arXiv arXiv 1911
-
[47]
N. Shazeer, A. Mirhoseini, K. Maziarz, A. Davis, Q. V. Le, G. E. Hinton, and J. Dean. Outrageously large neural networks: The sparsely-gated mixture-of-experts layer. In 5th International Conference on Learning Representations, ICLR 2017 . OpenReview.net, 2017. URL https://openreview.net/forum?id=B1ckMDqlg
work page 2017
-
[48]
S. Shen, L. Hou, Y. Zhou, N. Du, S. Longpre, J. Wei, H. W. Chung, B. Zoph, W. Fedus, X. Chen, T. Vu, Y. Wu, W. Chen, A. Webson, Y. Li, V. Zhao, H. Yu, K. Keutzer, T. Darrell, and D. Zhou. Flan-moe: Scaling instruction-finetuned language models with sparse mixture of experts. CoRR, abs/2305.14705, 2023. doi:10.48550/ARXIV.2305.14705. URL https://doi.org/10...
-
[52]
Redpajama-data: An open source recipe to reproduce llama training dataset, April 2023
Together-AI. Redpajama-data: An open source recipe to reproduce llama training dataset, April 2023. URL https://github.com/togethercomputer/RedPajama-Data
work page 2023
-
[53]
LLaMA: Open and Efficient Foundation Language Models
H. Touvron, T. Lavril, G. Izacard, X. Martinet, M. Lachaux, T. Lacroix, B. Rozi \` e re, N. Goyal, E. Hambro, F. Azhar, A. Rodriguez, A. Joulin, E. Grave, and G. Lample. Llama: Open and efficient foundation language models. CoRR, abs/2302.13971, 2023 a . doi:10.48550/arXiv.2302.13971. URL https://doi.org/10.48550/arXiv.2302.13971
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2302.13971 2023
-
[55]
A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin. Attention is all you need. In Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, pages 5998--6008, 2017. URL https://proceedings.neurips.cc/paper/2017/hash/3f5ee243547dee91fbd053c1c4a845a...
work page 2017
-
[56]
B. Wang and A. Komatsuzaki. GPT-J-6B: A 6 Billion Parameter Autoregressive Language Model . https://github.com/kingoflolz/mesh-transformer-jax, May 2021
work page 2021
-
[57]
L. Xu, H. Hu, X. Zhang, L. Li, C. Cao, Y. Li, Y. Xu, K. Sun, D. Yu, C. Yu, Y. Tian, Q. Dong, W. Liu, B. Shi, Y. Cui, J. Li, J. Zeng, R. Wang, W. Xie, Y. Li, Y. Patterson, Z. Tian, Y. Zhang, H. Zhou, S. Liu, Z. Zhao, Q. Zhao, C. Yue, X. Zhang, Z. Yang, K. Richardson, and Z. Lan. CLUE: A chinese language understanding evaluation benchmark. In D. Scott, N. B...
-
[58]
F. Xue, Z. Zheng, Y. Fu, J. Ni, Z. Zheng, W. Zhou, and Y. You. Openmoe: Open mixture-of-experts language models. https://github.com/XueFuzhao/OpenMoE, 2023
work page 2023
-
[59]
H ella S wag: Can a Machine Really Finish Your Sentence?
R. Zellers, A. Holtzman, Y. Bisk, A. Farhadi, and Y. Choi. HellaSwag : Can a machine really finish your sentence? In A. Korhonen, D. R. Traum, and L. M \` a rquez, editors, Proceedings of the 57th Conference of the Association for Computational Linguistics, ACL 2019, Florence, Italy, July 28- August 2, 2019, Volume 1: Long Papers , pages 4791--4800. Assoc...
- [60]
-
[61]
C. Zheng, M. Huang, and A. Sun. Chid: A large-scale chinese idiom dataset for cloze test. In A. Korhonen, D. R. Traum, and L. M \` a rquez, editors, Proceedings of the 57th Conference of the Association for Computational Linguistics, ACL 2019, Florence, Italy, July 28- August 2, 2019, Volume 1: Long Papers , pages 778--787. Association for Computational L...
-
[62]
Y. Zhou, T. Lei, H. Liu, N. Du, Y. Huang, V. Zhao, A. M. Dai, Z. Chen, Q. V. Le, and J. Laudon. Mixture-of-experts with expert choice routing. In NeurIPS, 2022. URL http://papers.nips.cc/paper\_files/paper/2022/hash/2f00ecd787b432c1d36f3de9800728eb-Abstract-Conference.html
work page 2022
-
[63]
B. Zoph. Designing effective sparse expert models. In IEEE International Parallel and Distributed Processing Symposium, IPDPS Workshops 2022, Lyon, France, May 30 - June 3, 2022 , page 1044. IEEE , 2022. URL https://doi.org/10.1109/IPDPSW55747.2022.00171
-
[64]
Jacob Devlin and Ming. Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies,. 2019 , url =
work page 2019
-
[65]
Yinhan Liu and Myle Ott and Naman Goyal and Jingfei Du and Mandar Joshi and Danqi Chen and Omer Levy and Mike Lewis and Luke Zettlemoyer and Veselin Stoyanov , title =. CoRR , volume =. 2019 , url =
work page 2019
-
[66]
Alec Radford and Karthik Narasimhan and Tim Salimans and Ilya Sutskever , title =
-
[67]
Language models are unsupervised multitask learners , author=. OpenAI blog , year=
-
[68]
Tom B. Brown and Benjamin Mann and Nick Ryder and Melanie Subbiah and Jared Kaplan and Prafulla Dhariwal and Arvind Neelakantan and Pranav Shyam and Girish Sastry and Amanda Askell and Sandhini Agarwal and Ariel Herbert. Language Models are Few-Shot Learners , booktitle =. 2020 , url =
work page 2020
-
[69]
Colin Raffel and Noam Shazeer and Adam Roberts and Katherine Lee and Sharan Narang and Michael Matena and Yanqi Zhou and Wei Li and Peter J. Liu , title =. Journal of Machine Learning Research , year =
-
[70]
Gomez and Lukasz Kaiser and Illia Polosukhin , title =
Ashish Vaswani and Noam Shazeer and Niki Parmar and Jakob Uszkoreit and Llion Jones and Aidan N. Gomez and Lukasz Kaiser and Illia Polosukhin , title =. Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017 , pages =. 2017 , url =
work page 2017
-
[71]
8th International Conference on Learning Representations,
Kevin Clark and Minh. 8th International Conference on Learning Representations,. 2020 , url =
work page 2020
-
[72]
UniLMv2: Pseudo-Masked Language Models for Unified Language Model Pre-Training , booktitle =
Hangbo Bao and Li Dong and Furu Wei and Wenhui Wang and Nan Yang and Xiaodong Liu and Yu Wang and Jianfeng Gao and Songhao Piao and Ming Zhou and Hsiao. UniLMv2: Pseudo-Masked Language Models for Unified Language Model Pre-Training , booktitle =. 2020 , url =
work page 2020
-
[73]
Unified Language Model Pre-training for Natural Language Understanding and Generation , booktitle =
Li Dong and Nan Yang and Wenhui Wang and Furu Wei and Xiaodong Liu and Yu Wang and Jianfeng Gao and Ming Zhou and Hsiao. Unified Language Model Pre-training for Natural Language Understanding and Generation , booktitle =. 2019 , url =
work page 2019
-
[74]
Noam Shazeer and Azalia Mirhoseini and Krzysztof Maziarz and Andy Davis and Quoc V. Le and Geoffrey E. Hinton and Jeff Dean , title =. 5th International Conference on Learning Representations,. 2017 , url =
work page 2017
-
[75]
William Fedus and Barret Zoph and Noam Shazeer , title =. CoRR , volume =. 2021 , url =
work page 2021
-
[76]
Proceedings of the 38th International Conference on Machine Learning,
Mike Lewis and Shruti Bhosale and Tim Dettmers and Naman Goyal and Luke Zettlemoyer , title =. Proceedings of the 38th International Conference on Machine Learning,. 2021 , url =
work page 2021
-
[77]
Stephen Roller and Sainbayar Sukhbaatar and Arthur Szlam and Jason Weston , title =. CoRR , volume =. 2021 , url =
work page 2021
-
[78]
9th International Conference on Learning Representations,
Dmitry Lepikhin and HyoukJoong Lee and Yuanzhong Xu and Dehao Chen and Orhan Firat and Yanping Huang and Maxim Krikun and Noam Shazeer and Zhifeng Chen , title =. 9th International Conference on Learning Representations,. 2021 , url =
work page 2021
-
[79]
Dimitri P. Bertsekas , title =. Computational Optimization and Applications , volume =. 1992 , url =
work page 1992
-
[80]
Unsupervised Cross-lingual Representation Learning at Scale , booktitle =
Alexis Conneau and Kartikay Khandelwal and Naman Goyal and Vishrav Chaudhary and Guillaume Wenzek and Francisco Guzm. Unsupervised Cross-lingual Representation Learning at Scale , booktitle =. 2020 , url =
work page 2020
-
[81]
Diederik P. Kingma and Jimmy Ba , title =. 3rd International Conference on Learning Representations,. 2015 , url =
work page 2015
-
[82]
Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing,
Yiren Wang and ChengXiang Zhai and Hany Hassan , title =. Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing,. 2020 , url =
work page 2020
-
[83]
Shuming Ma and Jian Yang and Haoyang Huang and Zewen Chi and Li Dong and Dongdong Zhang and Hany Hassan Awadalla and Alexandre Muzio and Akiko Eriguchi and Saksham Singhal and Xia Song and Arul Menezes and Furu Wei , title =. CoRR , volume =. 2020 , url =
work page 2020
-
[84]
Robert A. Jacobs and Michael I. Jordan and Steven J. Nowlan and Geoffrey E. Hinton , title =. Neural Computing , volume =. 1991 , url =
work page 1991
-
[85]
Michael I. Jordan and Robert A. Jacobs , title =. Neural Computing , volume =. 1994 , url =
work page 1994
-
[86]
Long Short-Term Memory , journal =
Sepp Hochreiter and J. Long Short-Term Memory , journal =. 1997 , url =
work page 1997
-
[87]
Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing,
Taku Kudo and John Richardson , title =. Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing,. 2018 , url =
work page 2018
-
[88]
LLaMA: Open and Efficient Foundation Language Models , journal =
Hugo Touvron and Thibaut Lavril and Gautier Izacard and Xavier Martinet and Marie. LLaMA: Open and Efficient Foundation Language Models , journal =. 2023 , url =
work page 2023
-
[89]
Sparks of Artificial General Intelligence: Early experiments with
S. Sparks of Artificial General Intelligence: Early experiments with. CoRR , volume =. 2023 , url =
work page 2023
- [90]
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.