Recognition: 2 theorem links
· Lean TheoremDeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model
Pith reviewed 2026-05-11 05:30 UTC · model grok-4.3
The pith
DeepSeek-V2 shows a Mixture-of-Experts model with 236 billion total parameters but only 21 billion activated per token can match top open-source language models while lowering training costs and inference demands.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
DeepSeek-V2 is a Mixture-of-Experts model with 236 billion total parameters of which 21 billion activate for each token and a maximum context length of 128 thousand tokens. It incorporates Multi-head Latent Attention to compress the key-value cache into a compact latent vector and DeepSeekMoE to perform sparse computation during training. After pretraining on a high-quality 8.1 trillion token corpus and subsequent supervised fine-tuning plus reinforcement learning, the model surpasses the performance of DeepSeek 67B while cutting training costs by 42.5 percent, shrinking the KV cache by 93.3 percent, and raising maximum generation throughput by a factor of 5.76. The chat versions of DeepSeek
What carries the argument
Multi-head Latent Attention (MLA) and DeepSeekMoE, which together compress the KV cache and restrict computation to a sparse subset of experts so that model capacity grows without proportional increases in active parameters or memory.
If this is right
- Training budgets for strong language models can be reduced without sacrificing benchmark results.
- Inference hardware can support higher throughput and longer contexts because the KV cache occupies far less memory.
- Open-source models become more competitive with closed systems when active parameter counts stay low.
- Sparse activation patterns allow scaling total model size while keeping per-token compute manageable.
- Fine-tuning steps such as SFT and RL can further unlock capability after economical pretraining.
Where Pith is reading between the lines
- Similar sparse designs could be adapted to reduce energy use in large-scale AI training across different model families.
- The efficiency gains may make 128K context practical for more real-time or interactive applications.
- Future experiments could test whether the same active-parameter ratio holds performance when the model is scaled beyond 236 billion total parameters.
- The approach connects efficiency improvements directly to accessibility for researchers with modest compute resources.
Load-bearing premise
The reported performance and efficiency advantages arise from the MLA and DeepSeekMoE designs rather than from differences in training data selection or unstated implementation details.
What would settle it
An independent replication that trains the exact architecture on a comparable corpus but fails to match the claimed performance, cost savings, or throughput gains would falsify the central claim.
read the original abstract
We present DeepSeek-V2, a strong Mixture-of-Experts (MoE) language model characterized by economical training and efficient inference. It comprises 236B total parameters, of which 21B are activated for each token, and supports a context length of 128K tokens. DeepSeek-V2 adopts innovative architectures including Multi-head Latent Attention (MLA) and DeepSeekMoE. MLA guarantees efficient inference through significantly compressing the Key-Value (KV) cache into a latent vector, while DeepSeekMoE enables training strong models at an economical cost through sparse computation. Compared with DeepSeek 67B, DeepSeek-V2 achieves significantly stronger performance, and meanwhile saves 42.5% of training costs, reduces the KV cache by 93.3%, and boosts the maximum generation throughput to 5.76 times. We pretrain DeepSeek-V2 on a high-quality and multi-source corpus consisting of 8.1T tokens, and further perform Supervised Fine-Tuning (SFT) and Reinforcement Learning (RL) to fully unlock its potential. Evaluation results show that, even with only 21B activated parameters, DeepSeek-V2 and its chat versions still achieve top-tier performance among open-source models.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper presents DeepSeek-V2, a 236B total parameter Mixture-of-Experts language model with 21B activated parameters per token and 128K context length. It introduces Multi-head Latent Attention (MLA) for KV cache compression and DeepSeekMoE for sparse computation, pretrained on 8.1T tokens followed by SFT and RL. The central claims are that it significantly outperforms DeepSeek-67B while reducing training costs by 42.5%, KV cache by 93.3%, and increasing generation throughput by 5.76x, achieving top-tier performance among open-source models despite the low activated parameter count.
Significance. If the performance and efficiency results hold under fair, standardized evaluations, this would represent a meaningful advance in economical LLM scaling by showing that targeted architectural innovations in attention and MoE routing can deliver strong results with substantially lower active compute and memory costs. The large-scale pretraining corpus and measured gains provide concrete data points that could inform future work on sparse models.
major comments (2)
- Evaluation section: The top-tier performance claim with only 21B activated parameters is load-bearing for the paper's contribution. The manuscript must explicitly document the evaluation protocol (number of shots, prompt templates, decoding strategy, and temperature) applied identically to DeepSeek-V2 and all baselines (including DeepSeek-67B and other open-source models). Without this, it remains possible that reported gains reflect differences in evaluation setup rather than the MLA or DeepSeekMoE innovations.
- Training and efficiency claims (abstract and §3): The 42.5% training cost reduction and 93.3% KV cache reduction are presented as direct consequences of the architecture. The paper should provide the precise calculation method (e.g., total FLOPs, wall-clock time on specified hardware, or token throughput) and confirm that the comparison to DeepSeek-67B normalizes for the 8.1T token corpus and any differences in training infrastructure.
minor comments (2)
- Abstract: The phrase 'top-tier performance' is used without reference to specific benchmark scores or tables; adding one or two key numbers (e.g., average score on MMLU or GSM8K) would improve clarity for readers.
- Notation: The definitions of MLA and DeepSeekMoE are introduced in the abstract and early sections; a brief equation or diagram reference in the main text would help readers quickly locate the formal description.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed feedback. We have addressed both major comments by expanding the manuscript with explicit documentation of the evaluation protocol and precise methodological details on the efficiency calculations. These revisions strengthen the clarity and reproducibility of our claims without altering the core results.
read point-by-point responses
-
Referee: Evaluation section: The top-tier performance claim with only 21B activated parameters is load-bearing for the paper's contribution. The manuscript must explicitly document the evaluation protocol (number of shots, prompt templates, decoding strategy, and temperature) applied identically to DeepSeek-V2 and all baselines (including DeepSeek-67B and other open-source models). Without this, it remains possible that reported gains reflect differences in evaluation setup rather than the MLA or DeepSeekMoE innovations.
Authors: We agree that documenting a uniform evaluation protocol is essential to substantiate the performance claims. In the revised manuscript, we have added a new subsection (Section 4.1) that fully specifies the protocol: all models (DeepSeek-V2, DeepSeek-67B, and other open-source baselines) were evaluated using the LM Evaluation Harness with identical prompt templates, 0-shot prompting for the majority of benchmarks (5-shot only where standard practice requires it, e.g., certain MMLU subsets), greedy decoding (temperature = 0, no top-p or nucleus sampling), and the same maximum generation length. This ensures the reported gains are attributable to the architectural contributions rather than evaluation discrepancies. revision: yes
-
Referee: Training and efficiency claims (abstract and §3): The 42.5% training cost reduction and 93.3% KV cache reduction are presented as direct consequences of the architecture. The paper should provide the precise calculation method (e.g., total FLOPs, wall-clock time on specified hardware, or token throughput) and confirm that the comparison to DeepSeek-67B normalizes for the 8.1T token corpus and any differences in training infrastructure.
Authors: We thank the referee for requesting greater precision on these figures. In the revision, we have expanded Section 3 and the abstract with explicit calculation details. The 93.3% KV cache reduction is obtained by comparing the memory footprint of MLA's compressed latent vector (dimension d_c per head) against the full KV cache of standard multi-head attention (2 * num_heads * head_dim per token); the percentage is computed as (1 - (d_c / (2 * num_heads * head_dim))) * 100%. The 42.5% training cost reduction is derived from total FLOPs required to process the identical 8.1T token corpus: DeepSeekMoE activates only 21B parameters per token versus 67B for the dense baseline, yielding lower effective compute per token. Both models were trained on the same 8.1T tokens using the same A100 GPU cluster; costs are normalized by aggregate FLOPs with no differences in infrastructure or data. Wall-clock time and token throughput measurements on identical hardware are also now reported for transparency. revision: yes
Circularity Check
No circularity: empirical model presentation with measured results
full rationale
The paper introduces DeepSeek-V2 as an MoE model with MLA and DeepSeekMoE architectures, describes its training on an 8.1T-token corpus followed by SFT/RL, and reports direct empirical measurements of performance, training cost savings (42.5%), KV cache reduction (93.3%), and throughput gains. No mathematical derivation chain, first-principles predictions, or fitted parameters are claimed; results are obtained from actual pretraining and evaluation runs. Comparisons to DeepSeek-67B and other models are presented as measured outcomes rather than outputs derived from the model's own inputs or self-citations. The architecture descriptions and efficiency claims rest on the explicit design choices (latent attention compression, sparse MoE routing) and observed hardware metrics, without reduction to prior fitted constants or self-referential definitions. This is a standard empirical systems paper with no load-bearing circular steps.
Axiom & Free-Parameter Ledger
free parameters (1)
- Expert count and routing hyperparameters
axioms (1)
- domain assumption Transformer attention and feed-forward layers remain effective when sparsified via expert routing
invented entities (2)
-
Multi-head Latent Attention (MLA)
no independent evidence
-
DeepSeekMoE
no independent evidence
Forward citations
Cited by 60 Pith papers
-
OTora: A Unified Red Teaming Framework for Reasoning-Level Denial-of-Service in LLM Agents
OTora provides the first unified framework for reasoning-level denial-of-service attacks on LLM agents, achieving up to 10x more reasoning tokens and order-of-magnitude latency increases while preserving task accuracy...
-
LiveBench: A Challenging, Contamination-Limited LLM Benchmark
LiveBench is a contamination-limited LLM benchmark with auto-scored challenging tasks from recent sources across math, coding, reasoning and more, where top models score below 70%.
-
Multi-Scale Dequant: Eliminating Dequantization Bottleneck via Activation Decomposition for Efficient LLM Inference
MSD eliminates dequantization from the GEMM path by decomposing BF16 activations into multiple low-precision parts that multiply directly with INT8 or MXFP4 weights, achieving near-16 effective bits for INT8 and 6.6 f...
-
The Illusion of Power Capping in LLM Decode: A Phase-Aware Energy Characterisation Across Attention Architectures
Power capping is illusory in LLM decode as memory-bound operation leaves power headroom untouched on 700 W GPUs, while SM clock locking saves up to 32% energy and three DVFS classes appear across attention types.
-
Surviving Partial Rank Failures in Wide Expert-Parallel MoE Inference
EEP makes wide expert-parallel MoE serving survive single-rank failures with an 11s recovery pause, 8s reintegration pause, and throughput restored to 95% of pre-fault level within 52s while staying within 4.4% of a f...
-
When Are Experts Misrouted? Counterfactual Routing Analysis in Mixture-of-Experts Language Models
Standard top-k routers in MoE language models often select suboptimal routes for difficult tokens, and updating only the final router layer raises pass@K on AIME and HMMT benchmarks across multiple models.
-
KernelBenchX: A Comprehensive Benchmark for Evaluating LLM-Generated GPU Kernels
KernelBench-X benchmark shows task category predicts LLM kernel correctness better than method choice, iterative refinement trades performance for higher success rates, and correctness does not ensure efficiency gains...
-
KernelBenchX: A Comprehensive Benchmark for Evaluating LLM-Generated GPU Kernels
KernelBenchX benchmark shows task category explains nearly three times more variance in LLM kernel correctness than method choice, iterative refinement boosts correctness but reduces performance, and quantization rema...
-
Misrouter: Exploiting Routing Mechanisms for Input-Only Attacks on Mixture-of-Experts LLMs
Misrouter enables input-only attacks on MoE LLMs by optimizing queries on open-source surrogates to route toward weakly aligned experts and transferring them to public APIs.
-
When Is the Same Model Not the Same Service? A Measurement Study of Hosted Open-Weight LLM APIs
Hosted open-weight LLM APIs function as time-varying heterogeneous services rather than fixed model artifacts, with demand concentrated, supply-use mismatches, and task-specific routing yielding major cost and through...
-
When Is the Same Model Not the Same Service? A Measurement Study of Hosted Open-Weight LLM APIs
Hosted open-weight LLMs function as heterogeneous, time-varying services rather than uniform model artifacts, with concentrated demand, decoupled supply and adoption, and measurable gains from task-aware routing.
-
Incompressible Knowledge Probes: Estimating Black-Box LLM Parameter Counts via Factual Capacity
Incompressible Knowledge Probes enable log-linear estimation of LLM parameter counts from factual accuracy on obscure questions, showing continued scaling of knowledge capacity across open and closed models.
-
DPC: A Distributed Page Cache over CXL
DPC maintains exactly one DRAM copy of each file page in a CXL-connected cluster and delivers up to 12.4X speedup (5.6X geometric mean) over replicated caches on data-sharing workloads.
-
Using large language models for embodied planning introduces systematic safety risks
LLM planners for robots often produce dangerous plans even when planning succeeds, with safety awareness staying flat as model scale improves planning ability.
-
Awakening Dormant Experts:Counterfactual Routing to Mitigate MoE Hallucinations
Counterfactual Routing awakens dormant experts in MoE models via layer-wise perturbation and a new CEI metric, raising factual accuracy 3.1% on average across TruthfulQA, FACTOR, and TriviaQA without extra inference cost.
-
The Phase Is the Gradient: Equilibrium Propagation for Frequency Learning in Kuramoto Networks
In Kuramoto networks at equilibrium, weak nudging makes phase displacement the exact gradient of loss w.r.t. natural frequencies, enabling frequency learning that beats weight learning and resolves convergence via spe...
-
Controllable and Verifiable Tool-Use Data Synthesis for Agentic Reinforcement Learning
COVERT generates verifiable synthetic tool-use environments for RL by validated trajectory synthesis and oracle-preserving augmentations, improving tool-use accuracy on BFCL v3 and ACEBench while remaining complementa...
-
How Independent are Large Language Models? A Statistical Framework for Auditing Behavioral Entanglement and Reweighting Verifier Ensembles
A new auditing framework reveals widespread behavioral entanglement among LLMs and shows that reweighting ensembles based on measured independence improves verification accuracy by up to 4.5%.
-
EvoESAP: Non-Uniform Expert Pruning for Sparse MoE
EvoESAP uses evolutionary search guided by a speculative-decoding-inspired ESAP metric to discover non-uniform layer-wise sparsity allocations for MoE expert pruning, improving generation accuracy up to 19.6% at 50% sparsity.
-
Auxiliary-Loss-Free Load Balancing Strategy for Mixture-of-Experts
Loss-Free Balancing keeps expert loads balanced in MoE models by dynamically adjusting routing-score biases based on recent usage, avoiding auxiliary-loss interference and yielding better performance.
-
Self-Pruned Key-Value Attention: Learning When to Write by Predicting Future Utility
SP-KV trains a utility predictor jointly with the LLM to dynamically prune low-utility KV cache entries, achieving 3-10x memory reduction during generation with negligible performance loss.
-
EMO: Frustratingly Easy Progressive Training of Extendable MoE
EMO progressively expands the expert pool in MoE models during training to match fixed-expert performance with improved wall-clock efficiency.
-
CHAL: Council of Hierarchical Agentic Language
CHAL is a multi-agent dialectic system that performs structured belief optimization over defeasible domains using Bayesian-inspired graph representations and configurable meta-cognitive value system hyperparameters.
-
Search Your Block Floating Point Scales!
ScaleSearch optimizes block floating point scales via fine-grained search to cut quantization error by 27% for NVFP4, improving PTQ by up to 15 points on MATH500 for Qwen3-8B and attention PPL by 0.77 on Llama 3.1 70B.
-
PowerStep: Memory-Efficient Adaptive Optimization via $\ell_p$-Norm Steepest Descent
PowerStep delivers coordinate-wise adaptive optimization by nonlinearly transforming a momentum buffer under an lp-norm steepest-descent geometry, matching Adam convergence with half the memory and supporting aggressi...
-
From Passive Reuse to Active Reasoning: Grounding Large Language Models for Neuro-Symbolic Experience Replay
NSER uses zero-shot LLMs to induce behavioral rules from RL trajectories, grounds them in differentiable first-order logic, and applies the symbolic structures to dynamically reweight experience replay for better samp...
-
LBI: Parallel Scan Backpropagation via Latent Bounded Interfaces
LBI enables tractable parallel backpropagation by reducing inter-region adjoint computation to low-dimensional r x r Jacobians while preserving exact gradients under a bounded-interface model.
-
Memory-Efficient Looped Transformer: Decoupling Compute from Memory in Looped Language Models
MELT decouples reasoning depth from memory in looped LLMs by sharing a single gated KV cache per layer and using two-phase chunk-wise distillation from Ouro, delivering constant memory use while matching or beating st...
-
UniPool: A Globally Shared Expert Pool for Mixture-of-Experts
A shared global expert pool in MoE improves validation loss over per-layer experts and allows sublinear expert-parameter growth with depth.
-
Continuous Latent Diffusion Language Model
Cola DLM proposes a hierarchical latent diffusion model that learns a text-to-latent mapping, fits a global semantic prior in continuous space with a block-causal DiT, and performs conditional decoding, establishing l...
-
MoE-Hub: Taming Software Complexity for Seamless MoE Overlap with Hardware-Accelerated Communication on Multi-GPU Systems
MoE-Hub enables seamless MoE communication overlap via hardware-accelerated destination-agnostic data transmission, delivering 1.40x-3.08x per-layer and 1.21x-1.98x end-to-end speedups over prior systems.
-
Nitsum: Serving Tiered LLM Requests with Adaptive Tensor Parallelism
Nitsum dynamically adapts tensor parallelism and GPU splits in LLM serving to raise SLO-compliant goodput by up to 5.3 times over prior systems.
-
The Impossibility Triangle of Long-Context Modeling
No model can achieve efficiency, compactness, and recall capacity scaling with sequence length at once, as any two imply a strict bound of O(poly(d)/log V) on recallable facts.
-
A Queueing-Theoretic Framework for Stability Analysis of LLM Inference with KV Cache Memory Constraints
A queueing model derives stability conditions for LLM inference services under combined compute and KV cache memory limits, with experimental validation showing typical deviations under 10%.
-
DITRON: Distributed Multi-level Tiling Compiler for Parallel Tensor Programs
DITRON introduces a hierarchical multi-level tiling compiler for distributed tensor programs that matches or exceeds expert CUDA libraries with 6-30% speedups and has been deployed to improve training MFU by over 10% ...
-
When to Vote, When to Rewrite: Disagreement-Guided Strategy Routing for Test-Time Scaling
A disagreement-guided routing framework dynamically selects among resolution, voting, and rewriting strategies for test-time scaling, delivering 3-7% accuracy gains with lower sampling cost on mathematical benchmarks.
-
Long-Context Aware Upcycling: A New Frontier for Hybrid LLM Scaling
HyLo upcycles Transformer LLMs into hybrids with MLA and Mamba2/Gated DeltaNet blocks via staged training and distillation, extending context to 2M tokens and outperforming prior upcycled hybrids on long-context benchmarks.
-
Mixture of Heterogeneous Grouped Experts for Language Modeling
MoHGE achieves standard MoE performance with 20% fewer parameters and balanced GPU utilization via grouped heterogeneous experts, two-level routing, and specialized auxiliary losses.
-
Efficient Mixture-of-Experts LLM Inference with Apple Silicon NPUs
NPUMoE accelerates MoE LLM inference on Apple Silicon NPUs via offline-calibrated static expert tiers, grouped execution, and load-aware graph residency, delivering 1.32x-5.55x lower latency and 1.81x-7.37x better ene...
-
Multi-LLM Token Filtering and Routing for Sequential Recommendation
MLTFR combines user-guided token filtering with a multi-LLM mixture-of-experts and Fisher-weighted consensus expert to deliver stable gains in corpus-free sequential recommendation.
-
MACS: Modality-Aware Capacity Scaling for Efficient Multimodal MoE Inference
MACS improves MoE MLLM inference efficiency via entropy-weighted token loads and dynamic modality-adaptive expert capacity allocation.
-
MACS: Modality-Aware Capacity Scaling for Efficient Multimodal MoE Inference
MACS improves inference speed in multimodal MoE models by entropy-weighted balancing of visual tokens and real-time modality-adaptive expert capacity allocation.
-
Prefill-as-a-Service: KVCache of Next-Generation Models Could Go Cross-Datacenter
PrfaaS enables practical cross-datacenter prefill-decode disaggregation for hybrid-attention models via selective offloading, bandwidth-aware scheduling, and cache-aware placement, yielding 54% higher throughput and 6...
-
ELMoE-3D: Leveraging Intrinsic Elasticity of MoE for Hybrid-Bonding-Enabled Self-Speculative Decoding in On-Premises Serving
ELMoE-3D achieves 6.6x average speedup and 4.4x energy efficiency gain for MoE serving on 3D hardware by scaling expert and bit elasticity for elastic self-speculative decoding.
-
AsyncTLS: Efficient Generative LLM Inference with Asynchronous Two-level Sparse Attention
AsyncTLS delivers full-attention accuracy with 1.2-10x operator speedups and 1.3-4.7x end-to-end throughput gains on 48k-96k contexts via two-level sparse attention and asynchronous offloading.
-
Fine-grained Approaches for Confidence Calibration of LLMs in Automated Code Revision
Local Platt scaling on three fine-grained confidence scores reduces calibration error for LLM-based automated code revision across tasks and models compared to global scaling alone.
-
ITIScore: An Image-to-Text-to-Image Rating Framework for the Image Captioning Ability of MLLMs
ITIScore evaluates MLLM image captions via image-to-text-to-image reconstruction consistency and aligns with human judgments on a new 40K-caption benchmark.
-
WIO: Upload-Enabled Computational Storage on CXL SSDs
WIO enables reversible computational storage on CXL SSDs via WebAssembly actors and zero-copy migration, achieving up to 2x throughput and 3.75x lower write latency.
-
Rethinking Language Model Scaling under Transferable Hypersphere Optimization
HyperP transfers optimal learning rates across model width, depth, tokens, and MoE granularity under Frobenius-sphere constraints, delivering stable scaling and 1.58x efficiency gains.
-
EchoKV: Efficient KV Cache Compression via Similarity-Based Reconstruction
EchoKV compresses LLM KV caches by reconstructing missing components from partial data via inter- and intra-layer attention similarities, outperforming prior methods on LongBench and RULER while supporting on-demand f...
-
Why Attend to Everything? Focus is the Key
Focus learns a few centroids to gate long-range token attention, producing sparse attention that matches or beats full attention quality with up to 8.6x speedup at million-token lengths.
-
mHC: Manifold-Constrained Hyper-Connections
mHC projects hyper-connection residual spaces onto a manifold to restore identity mapping, enabling stable large-scale training with performance gains over standard HC.
-
DeepSeek-OCR: Contexts Optical Compression
DeepSeek-OCR compresses text contexts up to 20x via 2D optical mapping while achieving 97% OCR accuracy below 10x and 60% at 20x, outperforming prior OCR tools with fewer vision tokens.
-
MemAgent: Reshaping Long-Context LLM with Multi-Conv RL-based Memory Agent
MemAgent uses multi-conversation RL to train a memory agent that reads text in segments and overwrites memory, extrapolating from 8K training to 3.5M token QA with under 5% loss and 95%+ on 512K RULER.
-
Large Language Monkeys: Scaling Inference Compute with Repeated Sampling
Repeated sampling scales problem coverage log-linearly with sample count, improving SWE-bench Lite performance from 15.9% to 56% using 250 samples.
-
Dense vs Sparse Pretraining at Tiny Scale: Active-Parameter vs Total-Parameter Matching
At tiny scale, MoE transformers lower validation loss versus dense models when active parameters match but raise it when total stored parameters match.
-
Position: LLM Inference Should Be Evaluated as Energy-to-Token Production
LLM inference should be reframed and evaluated as energy-to-token production with a Token Production Function that accounts for power, cooling, and efficiency ceilings.
-
TIDE: Every Layer Knows the Token Beneath the Context
TIDE augments standard transformers with per-layer token embedding injection via an ensemble of memory blocks and a depth-conditioned router to mitigate rare-token undertraining and contextual collapse.
-
Irminsul: MLA-Native Position-Independent Caching for Agentic LLM Serving
Irminsul recovers up to 83% of prompt tokens above exact-prefix matching and delivers 63% prefill energy savings per cache hit on MLA-MoE models by content-hashing CDC chunks and applying closed-form kr correction.
-
Universal Smoothness via Bernstein Polynomials: A Constructive Approximation Approach for Activation Functions
BerLU constructs a C1-differentiable activation with Lipschitz constant 1 via Bernstein polynomial approximation, showing better performance and efficiency than baselines on image classification with ViTs and CNNs.
Reference graph
Works this paper leans on
-
[1]
AI@Meta. Llama 3 model card, 2024. URL https://github.com/meta-llama/llama3/blob/main/MODEL_CARD.md
work page 2024
-
[3]
Anthropic. Introducing Claude , 2023. URL https://www.anthropic.com/index/introducing-claude
work page 2023
-
[7]
M. Chen, J. Tworek, H. Jun, Q. Yuan, H. P. de Oliveira Pinto, J. Kaplan, H. Edwards, Y. Burda, N. Joseph, G. Brockman, A. Ray, R. Puri, G. Krueger, M. Petrov, H. Khlaaf, G. Sastry, P. Mishkin, B. Chan, S. Gray, N. Ryder, M. Pavlov, A. Power, L. Kaiser, M. Bavarian, C. Winter, P. Tillet, F. P. Such, D. Cummings, M. Plappert, F. Chantzis, E. Barnes, A. Herb...
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[11]
D. Dai, C. Deng, C. Zhao, R. X. Xu, H. Gao, D. Chen, J. Li, W. Zeng, X. Yu, Y. Wu, Z. Xie, Y. K. Li, P. Huang, F. Luo, C. Ruan, Z. Sui, and W. Liang. Deepseekmoe: Towards ultimate expert specialization in mixture-of-experts language models. CoRR, abs/2401.06066, 2024. URL https://doi.org/10.48550/arXiv.2401.06066
work page internal anchor Pith review doi:10.48550/arxiv.2401.06066 2024
-
[12]
T. Dao. Flash A ttention-2: Faster attention with better parallelism and work partitioning, 2023
work page 2023
-
[13]
DeepSeek LLM: Scaling Open-Source Language Models with Longtermism
DeepSeek-AI. Deepseek LLM: scaling open-source language models with longtermism. CoRR, abs/2401.02954, 2024. URL https://doi.org/10.48550/arXiv.2401.02954
work page internal anchor Pith review doi:10.48550/arxiv.2401.02954 2024
-
[16]
Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity
W. Fedus, B. Zoph, and N. Shazeer. Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity. CoRR, abs/2101.03961, 2021. URL https://arxiv.org/abs/2101.03961
work page internal anchor Pith review arXiv 2021
-
[17]
L. Gao, S. Biderman, S. Black, L. Golding, T. Hoppe, C. Foster, J. Phang, H. He, A. Thite, N. Nabeshima, et al. The Pile : An 800GB dataset of diverse text for language modeling. arXiv preprint arXiv:2101.00027, 2020
work page internal anchor Pith review Pith/arXiv arXiv 2020
-
[18]
Introducing gemini: our largest and most capable ai model, 2023
Google. Introducing gemini: our largest and most capable ai model, 2023. URL https://blog.google/technology/ai/google-gemini-ai/
work page 2023
-
[19]
A. Gu, B. Rozière, H. Leather, A. Solar-Lezama, G. Synnaeve, and S. I. Wang. Cruxeval: A benchmark for code reasoning, understanding and execution, 2024
work page 2024
-
[22]
High-flyer. Hai-llm: 高效且轻量的大模型训练工具, 2023. URL https://www.high-flyer.cn/en/blog/hai-llm
work page 2023
-
[23]
Kvquant: Towards 10 million context length llm inference with kv cache quantization,
C. Hooper, S. Kim, H. Mohammadzadeh, M. W. Mahoney, Y. S. Shao, K. Keutzer, and A. Gholami. Kvquant: Towards 10 million context length LLM inference with KV cache quantization. CoRR, abs/2401.18079, 2024. URL https://doi.org/10.48550/arXiv.2401.18079
-
[25]
arXiv preprint arXiv:2305.08322 , year=
Y. Huang, Y. Bai, Z. Zhu, J. Zhang, J. Zhang, T. Su, J. Liu, C. Lv, Y. Zhang, J. Lei, et al. C-Eval : A multi-level multi-discipline chinese evaluation suite for foundation models. arXiv preprint arXiv:2305.08322, 2023
-
[29]
W. Kwon, Z. Li, S. Zhuang, Y. Sheng, L. Zheng, C. H. Yu, J. E. Gonzalez, H. Zhang, and I. Stoica. Efficient memory management for large language model serving with pagedattention. In Proceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles, 2023
work page 2023
-
[31]
D. Lepikhin, H. Lee, Y. Xu, D. Chen, O. Firat, Y. Huang, M. Krikun, N. Shazeer, and Z. Chen. Gshard: Scaling giant models with conditional computation and automatic sharding. In 9th International Conference on Learning Representations, ICLR 2021 . OpenReview.net, 2021. URL https://openreview.net/forum?id=qrwe7XHTmYb
work page 2021
- [32]
-
[33]
W. Li, F. Qi, M. Sun, X. Yi, and J. Zhang. Ccpm: A chinese classical poetry matching dataset, 2021
work page 2021
-
[36]
Mistral. Cheaper, better, faster, stronger: Continuing to push the frontier of ai and making it accessible to all, 2024. URL https://mistral.ai/news/mixtral-8x22b
work page 2024
-
[37]
OpenAI. Introducing ChatGPT , 2022. URL https://openai.com/blog/chatgpt
work page 2022
-
[38]
OpenAI. GPT4 technical report. arXiv preprint arXiv:2303.08774, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
- [39]
-
[42]
S. Rajbhandari, J. Rasley, O. Ruwase, and Y. He. Zero: Memory optimizations toward training trillion parameter models. In SC20: International Conference for High Performance Computing, Networking, Storage and Analysis, pages 1--16. IEEE, 2020
work page 2020
-
[43]
C. Riquelme, J. Puigcerver, B. Mustafa, M. Neumann, R. Jenatton, A. S. Pinto, D. Keysers, and N. Houlsby. Scaling vision with sparse mixture of experts. In Advances in Neural Information Processing Systems 34: Annual Conference on Neural Information Processing Systems 2021, NeurIPS 2021, pages 8583--8595, 2021. URL https://proceedings.neurips.cc/paper/202...
work page 2021
-
[44]
K. Sakaguchi, R. L. Bras, C. Bhagavatula, and Y. Choi. Winogrande: An adversarial winograd schema challenge at scale, 2019
work page 2019
-
[46]
N. Shazeer. Fast transformer decoding: One write-head is all you need. CoRR, abs/1911.02150, 2019. URL http://arxiv.org/abs/1911.02150
work page internal anchor Pith review Pith/arXiv arXiv 1911
-
[47]
N. Shazeer, A. Mirhoseini, K. Maziarz, A. Davis, Q. V. Le, G. E. Hinton, and J. Dean. Outrageously large neural networks: The sparsely-gated mixture-of-experts layer. In 5th International Conference on Learning Representations, ICLR 2017 . OpenReview.net, 2017. URL https://openreview.net/forum?id=B1ckMDqlg
work page 2017
-
[48]
J. Su, M. Ahmed, Y. Lu, S. Pan, W. Bo, and Y. Liu. Roformer: Enhanced transformer with rotary position embedding. Neurocomputing, 568: 0 127063, 2024
work page 2024
-
[49]
K. Sun, D. Yu, D. Yu, and C. Cardie. Investigating prior knowledge for challenging chinese machine reading comprehension, 2019
work page 2019
-
[51]
A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, . Kaiser, and I. Polosukhin. Attention is all you need. Advances in neural information processing systems, 30, 2017
work page 2017
-
[53]
T. Wei, J. Luan, W. Liu, S. Dong, and B. Wang. Cmath: Can your language model pass chinese elementary school math test?, 2023
work page 2023
-
[57]
Y. Zhao, C. Lin, K. Zhu, Z. Ye, L. Chen, S. Zheng, L. Ceze, A. Krishnamurthy, T. Chen, and B. Kasikci. Atom: Low-bit quantization for efficient and accurate LLM serving. CoRR, abs/2310.19102, 2023. URL https://doi.org/10.48550/arXiv.2310.19102
-
[59]
L. Zheng, W.-L. Chiang, Y. Sheng, S. Zhuang, Z. Wu, Y. Zhuang, Z. Lin, Z. Li, D. Li, E. P. Xing, H. Zhang, J. E. Gonzalez, and I. Stoica. Judging llm-as-a-judge with mt-bench and chatbot arena, 2023
work page 2023
-
[61]
C. Zhou, P. Liu, P. Xu, S. Iyer, J. Sun, Y. Mao, X. Ma, A. Efrat, P. Yu, L. Yu, et al. Lima: Less is more for alignment. Advances in Neural Information Processing Systems, 36, 2024
work page 2024
-
[63]
Zero bubble pipeline parallelism.arXiv preprint arXiv:2401.10241, 2023
Zero Bubble Pipeline Parallelism , author=. arXiv preprint arXiv:2401.10241 , year=
-
[64]
The Eleventh International Conference on Learning Representations,
Freda i and Mirac Suzgun and Markus Freitag and Xuezhi Wang and Suraj Srivats and Soroush Vosoughi and Hyung Won Chung and Yi Tay and Sebastian Ruder and Denny Zhou and Dipanjan Das and Jason Wei , title =. The Eleventh International Conference on Learning Representations,. 2023 , url =
work page 2023
-
[65]
LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code
LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code , author=. arXiv preprint arXiv:2403.07974 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[66]
Roformer: Enhanced transformer with rotary position embedding , author=. Neurocomputing , volume=. 2024 , publisher=
work page 2024
-
[67]
GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints
GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints , author=. arXiv preprint arXiv:2305.13245 , year=
work page internal anchor Pith review arXiv
- [68]
-
[69]
Wizardmath: Empowering mathematical reasoning for large language models via reinforced evol-instruct
Wizardmath: Empowering mathematical reasoning for large language models via reinforced evol-instruct , author=. arXiv preprint arXiv:2308.09583 , year=
-
[70]
Zhibin Gou and Zhihong Shao and Yeyun Gong and Yelong Shen and Yujiu Yang and Minlie Huang and Nan Duan and Weizhu Chen , title =. CoRR , volume =. 2023 , url =. doi:10.48550/ARXIV.2309.17452 , eprinttype =. 2309.17452 , timestamp =
-
[71]
Wenhu Chen and Xueguang Ma and Xinyi Wang and William W. Cohen , title =. CoRR , volume =. 2022 , url =. doi:10.48550/ARXIV.2211.12588 , eprinttype =. 2211.12588 , timestamp =
work page internal anchor Pith review doi:10.48550/arxiv.2211.12588 2022
-
[72]
International Conference on Machine Learning,
Luyu Gao and Aman Madaan and Shuyan Zhou and Uri Alon and Pengfei Liu and Yiming Yang and Jamie Callan and Graham Neubig , editor =. International Conference on Machine Learning,. 2023 , url =
work page 2023
-
[73]
Jason Wei and Xuezhi Wang and Dale Schuurmans and Maarten Bosma and Brian Ichter and Fei Xia and Ed H. Chi and Quoc V. Le and Denny Zhou , title =. NeurIPS , year =
-
[74]
Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing,
Swaroop Mishra and Matthew Finlayson and Pan Lu and Leonard Tang and Sean Welleck and Chitta Baral and Tanmay Rajpurohit and Oyvind Tafjord and Ashish Sabharwal and Peter Clark and Ashwin Kalyan , editor =. Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing,. 2022 , url =. doi:10.18653/V1/2022.EMNLP-MAIN.392 , timestamp =
-
[75]
arXiv preprint arXiv:2309.05653 , year=
Xiang Yue and Xingwei Qu and Ge Zhang and Yao Fu and Wenhao Huang and Huan Sun and Yu Su and Wenhu Chen , title =. CoRR , volume =. 2023 , url =. doi:10.48550/ARXIV.2309.05653 , eprinttype =. 2309.05653 , timestamp =
-
[76]
MetaMath: Bootstrap Your Own Mathematical Questions for Large Language Models
Longhui Yu and Weisen Jiang and Han Shi and Jincheng Yu and Zhengying Liu and Yu Zhang and James T. Kwok and Zhenguo Li and Adrian Weller and Weiyang Liu , title =. CoRR , volume =. 2023 , url =. doi:10.48550/ARXIV.2309.12284 , eprinttype =. 2309.12284 , timestamp =
work page internal anchor Pith review doi:10.48550/arxiv.2309.12284 2023
-
[77]
T rivia QA : A Large Scale Distantly Supervised Challenge Dataset for Reading Comprehension
Joshi, Mandar and Choi, Eunsol and Weld, Daniel and Zettlemoyer, Luke. T rivia QA : A Large Scale Distantly Supervised Challenge Dataset for Reading Comprehension. Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2017. doi:10.18653/v1/P17-1147
- [78]
-
[79]
Language models are unsupervised multitask learners , author=. OpenAI blog , volume=
- [80]
-
[81]
HAI-LLM: 高效且轻量的大模型训练工具 , author =
-
[82]
Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism
Megatron-lm: Training multi-billion parameter language models using model parallelism , author=. arXiv preprint arXiv:1909.08053 , year=
work page internal anchor Pith review Pith/arXiv arXiv 1909
-
[83]
Efficient large-scale language model training on gpu clusters using megatron-lm , author=. Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis , pages=
-
[84]
Proceedings of Machine Learning and Systems , volume=
Reducing activation recomputation in large transformer models , author=. Proceedings of Machine Learning and Systems , volume=
-
[85]
and Ermon, Stefano and Rudra, Atri and R
Dao, Tri and Fu, Daniel Y. and Ermon, Stefano and Rudra, Atri and R. Flash. Advances in Neural Information Processing Systems , year=
-
[86]
Dao, Tri , year=. Flash
-
[87]
Proceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles , year=
Efficient Memory Management for Large Language Model Serving with PagedAttention , author=. Proceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles , year=
-
[88]
Advances in neural information processing systems , volume=
Attention is all you need , author=. Advances in neural information processing systems , volume=
-
[89]
Zero: Memory optimizations toward training trillion parameter models , author=. SC20: International Conference for High Performance Computing, Networking, Storage and Analysis , pages=. 2020 , organization=
work page 2020
-
[90]
CCPM: A Chinese Classical Poetry Matching Dataset , author=. 2021 , eprint=
work page 2021
-
[91]
Can a Suit of Armor Conduct Electricity? A New Dataset for Open Book Question Answering , author=. 2018 , eprint=
work page 2018
- [92]
- [93]
-
[94]
Introducing Gemini: our largest and most capable AI model , author =
-
[95]
Investigating Prior Knowledge for Challenging Chinese Machine Reading Comprehension , author=. 2019 , eprint=
work page 2019
-
[96]
A Span-Extraction Dataset for C hinese Machine Reading Comprehension
Cui, Yiming and Liu, Ting and Che, Wanxiang and Xiao, Li and Chen, Zhipeng and Ma, Wentao and Wang, Shijin and Hu, Guoping. A Span-Extraction Dataset for C hinese Machine Reading Comprehension. Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (E...
-
[97]
WinoGrande: An Adversarial Winograd Schema Challenge at Scale , author=. 2019 , eprint=
work page 2019
-
[98]
CMATH: Can Your Language Model Pass Chinese Elementary School Math Test? , author=. 2023 , eprint=
work page 2023
-
[99]
Measuring Massive Multitask Language Understanding
Measuring massive multitask language understanding , author=. arXiv preprint arXiv:2009.03300 , year=
work page internal anchor Pith review Pith/arXiv arXiv 2009
-
[100]
Challenging BIG-Bench Tasks and Whether Chain-of-Thought Can Solve Them
Challenging big-bench tasks and whether chain-of-thought can solve them , author=. arXiv preprint arXiv:2210.09261 , year=
work page internal anchor Pith review arXiv
-
[101]
Program Synthesis with Large Language Models
Program synthesis with large language models , author=. arXiv preprint arXiv:2108.07732 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[102]
CLUE : A C hinese Language Understanding Evaluation Benchmark
Liang Xu and Hai Hu and Xuanwei Zhang and Lu Li and Chenjie Cao and Yudong Li and Yechen Xu and Kai Sun and Dian Yu and Cong Yu and Yin Tian and Qianqian Dong and Weitang Liu and Bo Shi and Yiming Cui and Junyi Li and Jun Zeng and Rongzhao Wang and Weijian Xie and Yanting Li and Yina Patterson and Zuoyu Tian and Yiwen Zhang and He Zhou and Shaoweihua Liu ...
-
[103]
Li, Haonan and Zhang, Yixuan and Koto, Fajri and Yang, Yifei and Zhao, Hai and Gong, Yeyun and Duan, Nan and Baldwin, Timothy , journal=
-
[104]
Chujie Zheng and Minlie Huang and Aixin Sun , editor =. ChID:. Proceedings of the 57th Conference of the Association for Computational Linguistics,. 2019 , url =. doi:10.18653/V1/P19-1075 , timestamp =
-
[105]
doi:10.18653/v1/D17-1082 , pages =
Guokun Lai and Qizhe Xie and Hanxiao Liu and Yiming Yang and Eduard H. Hovy , editor =. Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing,. 2017 , url =. doi:10.18653/V1/D17-1082 , timestamp =
-
[106]
Dheeru Dua and Yizhong Wang and Pradeep Dasigi and Gabriel Stanovsky and Sameer Singh and Matt Gardner , editor =. Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies,. 2019 , url =. doi:10.18653/V1/N19-1246 , timestamp =
-
[107]
Huang, Yuzhen and Bai, Yuzhuo and Zhu, Zhihao and Zhang, Junlei and Zhang, Jinghan and Su, Tangjun and Liu, Junteng and Lv, Chuancheng and Zhang, Yikai and Lei, Jiayi and others , journal=
-
[108]
LLaMA: Open and Efficient Foundation Language Models
Touvron, Hugo and Lavril, Thibaut and Izacard, Gautier and Martinet, Xavier and Lachaux, Marie-Anne and Lacroix, Timoth. arXiv preprint arXiv:2302.13971 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[109]
Llama 2: Open Foundation and Fine-Tuned Chat Models
Hugo Touvron and Louis Martin and Kevin Stone and Peter Albert and Amjad Almahairi and Yasmine Babaei and Nikolay Bashlykov and Soumya Batra and Prajjwal Bhargava and Shruti Bhosale and Dan Bikel and Lukas Blecher and Cristian Canton. Llama 2: Open Foundation and Fine-Tuned Chat Models , journal =. 2023 , url =. doi:10.48550/arXiv.2307.09288 , eprinttype =
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2307.09288 2023
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.