arxiv: 2405.04434 · v5 · submitted 2024-05-07 · 💻 cs.CL · cs.AI

Recognition: 2 theorem links

· Lean Theorem

DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model

DeepSeek-AI , Aixin Liu , Bei Feng , Bin Wang , Bingxuan Wang , Bo Liu , Chenggang Zhao , Chengqi Dengr

show 149 more authors

Chong Ruan Damai Dai Daya Guo Dejian Yang Deli Chen Dongjie Ji Erhang Li Fangyun Lin Fuli Luo Guangbo Hao Guanting Chen Guowei Li H. Zhang Hanwei Xu Hao Yang Haowei Zhang Honghui Ding Huajian Xin Huazuo Gao Hui Li Hui Qu J.L. Cai Jian Liang Jianzhong Guo Jiaqi Ni Jiashi Li Jin Chen Jingyang Yuan Junjie Qiu Junxiao Song Kai Dong Kaige Gao Kang Guan Lean Wang Lecong Zhang Lei Xu Leyi Xia Liang Zhao Liyue Zhang Meng Li Miaojun Wang Mingchuan Zhang Minghua Zhang Minghui Tang Mingming Li Ning Tian Panpan Huang Peiyi Wang Peng Zhang Qihao Zhu Qinyu Chen Qiushi Du R.J. Chen R.L. Jin Ruiqi Ge Ruizhe Pan Runxin Xu Ruyi Chen S.S. Li Shanghao Lu Shangyan Zhou Shanhuang Chen Shaoqing Wu Shengfeng Ye Shirong Ma Shiyu Wang Shuang Zhou Shuiping Yu Shunfeng Zhou Size Zheng T. Wang Tian Pei Tian Yuan Tianyu Sun W.L. Xiao Wangding Zeng Wei An Wen Liu Wenfeng Liang Wenjun Gao Wentao Zhang X.Q. Li Xiangyue Jin Xianzu Wang Xiao Bi XiaoDong Liu Xiaohan Wang Xiaojin Shen Xiaokang Chen Xiaosha Chen Xiaotao Nie Xiaowen Sun Xiaoxiang Wang Xin Liu Xin Xie Xingkai Yu Xinnan Song Xinyi Zhou Xinyu Yang Xuan Lu Xuecheng Su Y. Wu Y.K. Li Y.X. Wei Y.X. Zhu Yanhong Xu Yanping Huang Yao Li Yao Zhao Yaofeng Sun Yaohui Li Yaohui Wang Yi Zheng Yichao Zhang Yiliang Xiong Yilong Zhao Ying He Ying Tang Yishi Piao Yixin Dong Yixuan Tan Yiyuan Liu Yongji Wang Yongqiang Guo Yuchen Zhu Yuduan Wang Yuheng Zou Yukun Zha Yunxian Ma Yuting Yan Yuxiang You Yuxuan Liu Z.Z. Ren Zehui Ren Zhangli Sha Zhe Fu Zhen Huang Zhen Zhang Zhenda Xie Zhewen Hao Zhihong Shao Zhiniu Wen Zhipeng Xu Zhongyu Zhang Zhuoshu Li Zihan Wang Zihui Gu Zilin Li Ziwei Xie

Authors on Pith no claims yet

Pith reviewed 2026-05-11 05:30 UTC · model grok-4.3

classification 💻 cs.CL cs.AI

keywords Mixture-of-ExpertsLanguage ModelEfficient InferenceKV Cache CompressionSparse ComputationLarge Language ModelsParameter EfficiencyMulti-head Latent Attention

0 comments

The pith

DeepSeek-V2 shows a Mixture-of-Experts model with 236 billion total parameters but only 21 billion activated per token can match top open-source language models while lowering training costs and inference demands.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces DeepSeek-V2 as a large language model built with sparse activation to make both training and running the system more practical. It combines a new attention method that shrinks memory use during generation with an expert routing design that limits computation to a small subset of parameters for each token. The model is pretrained on 8.1 trillion tokens and then refined with supervised fine-tuning and reinforcement learning to reach its reported results. A sympathetic reader would care because the approach suggests high-performing language models need not require the full compute budget of dense alternatives, which could widen access to capable systems. The reported outcomes include stronger benchmark scores than the prior 67 billion parameter DeepSeek model along with clear savings in cost, memory, and speed.

Core claim

DeepSeek-V2 is a Mixture-of-Experts model with 236 billion total parameters of which 21 billion activate for each token and a maximum context length of 128 thousand tokens. It incorporates Multi-head Latent Attention to compress the key-value cache into a compact latent vector and DeepSeekMoE to perform sparse computation during training. After pretraining on a high-quality 8.1 trillion token corpus and subsequent supervised fine-tuning plus reinforcement learning, the model surpasses the performance of DeepSeek 67B while cutting training costs by 42.5 percent, shrinking the KV cache by 93.3 percent, and raising maximum generation throughput by a factor of 5.76. The chat versions of DeepSeek

What carries the argument

Multi-head Latent Attention (MLA) and DeepSeekMoE, which together compress the KV cache and restrict computation to a sparse subset of experts so that model capacity grows without proportional increases in active parameters or memory.

If this is right

Training budgets for strong language models can be reduced without sacrificing benchmark results.
Inference hardware can support higher throughput and longer contexts because the KV cache occupies far less memory.
Open-source models become more competitive with closed systems when active parameter counts stay low.
Sparse activation patterns allow scaling total model size while keeping per-token compute manageable.
Fine-tuning steps such as SFT and RL can further unlock capability after economical pretraining.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Similar sparse designs could be adapted to reduce energy use in large-scale AI training across different model families.
The efficiency gains may make 128K context practical for more real-time or interactive applications.
Future experiments could test whether the same active-parameter ratio holds performance when the model is scaled beyond 236 billion total parameters.
The approach connects efficiency improvements directly to accessibility for researchers with modest compute resources.

Load-bearing premise

The reported performance and efficiency advantages arise from the MLA and DeepSeekMoE designs rather than from differences in training data selection or unstated implementation details.

What would settle it

An independent replication that trains the exact architecture on a comparable corpus but fails to match the claimed performance, cost savings, or throughput gains would falsify the central claim.

read the original abstract

We present DeepSeek-V2, a strong Mixture-of-Experts (MoE) language model characterized by economical training and efficient inference. It comprises 236B total parameters, of which 21B are activated for each token, and supports a context length of 128K tokens. DeepSeek-V2 adopts innovative architectures including Multi-head Latent Attention (MLA) and DeepSeekMoE. MLA guarantees efficient inference through significantly compressing the Key-Value (KV) cache into a latent vector, while DeepSeekMoE enables training strong models at an economical cost through sparse computation. Compared with DeepSeek 67B, DeepSeek-V2 achieves significantly stronger performance, and meanwhile saves 42.5% of training costs, reduces the KV cache by 93.3%, and boosts the maximum generation throughput to 5.76 times. We pretrain DeepSeek-V2 on a high-quality and multi-source corpus consisting of 8.1T tokens, and further perform Supervised Fine-Tuning (SFT) and Reinforcement Learning (RL) to fully unlock its potential. Evaluation results show that, even with only 21B activated parameters, DeepSeek-V2 and its chat versions still achieve top-tier performance among open-source models.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper presents DeepSeek-V2, a 236B total parameter Mixture-of-Experts language model with 21B activated parameters per token and 128K context length. It introduces Multi-head Latent Attention (MLA) for KV cache compression and DeepSeekMoE for sparse computation, pretrained on 8.1T tokens followed by SFT and RL. The central claims are that it significantly outperforms DeepSeek-67B while reducing training costs by 42.5%, KV cache by 93.3%, and increasing generation throughput by 5.76x, achieving top-tier performance among open-source models despite the low activated parameter count.

Significance. If the performance and efficiency results hold under fair, standardized evaluations, this would represent a meaningful advance in economical LLM scaling by showing that targeted architectural innovations in attention and MoE routing can deliver strong results with substantially lower active compute and memory costs. The large-scale pretraining corpus and measured gains provide concrete data points that could inform future work on sparse models.

major comments (2)

Evaluation section: The top-tier performance claim with only 21B activated parameters is load-bearing for the paper's contribution. The manuscript must explicitly document the evaluation protocol (number of shots, prompt templates, decoding strategy, and temperature) applied identically to DeepSeek-V2 and all baselines (including DeepSeek-67B and other open-source models). Without this, it remains possible that reported gains reflect differences in evaluation setup rather than the MLA or DeepSeekMoE innovations.
Training and efficiency claims (abstract and §3): The 42.5% training cost reduction and 93.3% KV cache reduction are presented as direct consequences of the architecture. The paper should provide the precise calculation method (e.g., total FLOPs, wall-clock time on specified hardware, or token throughput) and confirm that the comparison to DeepSeek-67B normalizes for the 8.1T token corpus and any differences in training infrastructure.

minor comments (2)

Abstract: The phrase 'top-tier performance' is used without reference to specific benchmark scores or tables; adding one or two key numbers (e.g., average score on MMLU or GSM8K) would improve clarity for readers.
Notation: The definitions of MLA and DeepSeekMoE are introduced in the abstract and early sections; a brief equation or diagram reference in the main text would help readers quickly locate the formal description.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We have addressed both major comments by expanding the manuscript with explicit documentation of the evaluation protocol and precise methodological details on the efficiency calculations. These revisions strengthen the clarity and reproducibility of our claims without altering the core results.

read point-by-point responses

Referee: Evaluation section: The top-tier performance claim with only 21B activated parameters is load-bearing for the paper's contribution. The manuscript must explicitly document the evaluation protocol (number of shots, prompt templates, decoding strategy, and temperature) applied identically to DeepSeek-V2 and all baselines (including DeepSeek-67B and other open-source models). Without this, it remains possible that reported gains reflect differences in evaluation setup rather than the MLA or DeepSeekMoE innovations.

Authors: We agree that documenting a uniform evaluation protocol is essential to substantiate the performance claims. In the revised manuscript, we have added a new subsection (Section 4.1) that fully specifies the protocol: all models (DeepSeek-V2, DeepSeek-67B, and other open-source baselines) were evaluated using the LM Evaluation Harness with identical prompt templates, 0-shot prompting for the majority of benchmarks (5-shot only where standard practice requires it, e.g., certain MMLU subsets), greedy decoding (temperature = 0, no top-p or nucleus sampling), and the same maximum generation length. This ensures the reported gains are attributable to the architectural contributions rather than evaluation discrepancies. revision: yes
Referee: Training and efficiency claims (abstract and §3): The 42.5% training cost reduction and 93.3% KV cache reduction are presented as direct consequences of the architecture. The paper should provide the precise calculation method (e.g., total FLOPs, wall-clock time on specified hardware, or token throughput) and confirm that the comparison to DeepSeek-67B normalizes for the 8.1T token corpus and any differences in training infrastructure.

Authors: We thank the referee for requesting greater precision on these figures. In the revision, we have expanded Section 3 and the abstract with explicit calculation details. The 93.3% KV cache reduction is obtained by comparing the memory footprint of MLA's compressed latent vector (dimension d_c per head) against the full KV cache of standard multi-head attention (2 * num_heads * head_dim per token); the percentage is computed as (1 - (d_c / (2 * num_heads * head_dim))) * 100%. The 42.5% training cost reduction is derived from total FLOPs required to process the identical 8.1T token corpus: DeepSeekMoE activates only 21B parameters per token versus 67B for the dense baseline, yielding lower effective compute per token. Both models were trained on the same 8.1T tokens using the same A100 GPU cluster; costs are normalized by aggregate FLOPs with no differences in infrastructure or data. Wall-clock time and token throughput measurements on identical hardware are also now reported for transparency. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical model presentation with measured results

full rationale

The paper introduces DeepSeek-V2 as an MoE model with MLA and DeepSeekMoE architectures, describes its training on an 8.1T-token corpus followed by SFT/RL, and reports direct empirical measurements of performance, training cost savings (42.5%), KV cache reduction (93.3%), and throughput gains. No mathematical derivation chain, first-principles predictions, or fitted parameters are claimed; results are obtained from actual pretraining and evaluation runs. Comparisons to DeepSeek-67B and other models are presented as measured outcomes rather than outputs derived from the model's own inputs or self-citations. The architecture descriptions and efficiency claims rest on the explicit design choices (latent attention compression, sparse MoE routing) and observed hardware metrics, without reduction to prior fitted constants or self-referential definitions. This is a standard empirical systems paper with no load-bearing circular steps.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 2 invented entities

The central claim rests on the effectiveness of two newly invented architectural components whose value is demonstrated only through the reported experiments. No independent prior validation or formal proof is supplied.

free parameters (1)

Expert count and routing hyperparameters
Standard MoE design choices that must be tuned to achieve the claimed performance-efficiency trade-off.

axioms (1)

domain assumption Transformer attention and feed-forward layers remain effective when sparsified via expert routing
The model extends the standard transformer and MoE paradigm without re-deriving its foundations.

invented entities (2)

Multi-head Latent Attention (MLA) no independent evidence
purpose: Compress KV cache into a latent vector for efficient inference
Newly proposed mechanism with no prior independent evidence outside this work.
DeepSeekMoE no independent evidence
purpose: Enable economical training through sparse expert activation
New MoE variant introduced in this paper.

pith-pipeline@v0.9.0 · 6157 in / 1500 out tokens · 37578 ms · 2026-05-11T05:30:50.710829+00:00 · methodology

discussion (0)

Forward citations

Cited by 60 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

OTora: A Unified Red Teaming Framework for Reasoning-Level Denial-of-Service in LLM Agents
cs.LG 2026-05 unverdicted novelty 8.0

OTora provides the first unified framework for reasoning-level denial-of-service attacks on LLM agents, achieving up to 10x more reasoning tokens and order-of-magnitude latency increases while preserving task accuracy...
LiveBench: A Challenging, Contamination-Limited LLM Benchmark
cs.CL 2024-06 unverdicted novelty 8.0

LiveBench is a contamination-limited LLM benchmark with auto-scored challenging tasks from recent sources across math, coding, reasoning and more, where top models score below 70%.
Multi-Scale Dequant: Eliminating Dequantization Bottleneck via Activation Decomposition for Efficient LLM Inference
stat.ML 2026-05 unverdicted novelty 7.0

MSD eliminates dequantization from the GEMM path by decomposing BF16 activations into multiple low-precision parts that multiply directly with INT8 or MXFP4 weights, achieving near-16 effective bits for INT8 and 6.6 f...
The Illusion of Power Capping in LLM Decode: A Phase-Aware Energy Characterisation Across Attention Architectures
cs.DC 2026-05 unverdicted novelty 7.0

Power capping is illusory in LLM decode as memory-bound operation leaves power headroom untouched on 700 W GPUs, while SM clock locking saves up to 32% energy and three DVFS classes appear across attention types.
Surviving Partial Rank Failures in Wide Expert-Parallel MoE Inference
cs.DC 2026-05 unverdicted novelty 7.0

EEP makes wide expert-parallel MoE serving survive single-rank failures with an 11s recovery pause, 8s reintegration pause, and throughput restored to 95% of pre-fault level within 52s while staying within 4.4% of a f...
When Are Experts Misrouted? Counterfactual Routing Analysis in Mixture-of-Experts Language Models
cs.LG 2026-05 unverdicted novelty 7.0

Standard top-k routers in MoE language models often select suboptimal routes for difficult tokens, and updating only the final router layer raises pass@K on AIME and HMMT benchmarks across multiple models.
KernelBenchX: A Comprehensive Benchmark for Evaluating LLM-Generated GPU Kernels
cs.LG 2026-05 unverdicted novelty 7.0

KernelBench-X benchmark shows task category predicts LLM kernel correctness better than method choice, iterative refinement trades performance for higher success rates, and correctness does not ensure efficiency gains...
KernelBenchX: A Comprehensive Benchmark for Evaluating LLM-Generated GPU Kernels
cs.LG 2026-05 conditional novelty 7.0

KernelBenchX benchmark shows task category explains nearly three times more variance in LLM kernel correctness than method choice, iterative refinement boosts correctness but reduces performance, and quantization rema...
Misrouter: Exploiting Routing Mechanisms for Input-Only Attacks on Mixture-of-Experts LLMs
cs.CR 2026-05 unverdicted novelty 7.0

Misrouter enables input-only attacks on MoE LLMs by optimizing queries on open-source surrogates to route toward weakly aligned experts and transferring them to public APIs.
When Is the Same Model Not the Same Service? A Measurement Study of Hosted Open-Weight LLM APIs
cs.PF 2026-05 conditional novelty 7.0

Hosted open-weight LLMs function as heterogeneous, time-varying services rather than uniform model artifacts, with concentrated demand, decoupled supply and adoption, and measurable gains from task-aware routing.
When Is the Same Model Not the Same Service? A Measurement Study of Hosted Open-Weight LLM APIs
cs.PF 2026-05 unverdicted novelty 7.0

Hosted open-weight LLM APIs function as time-varying heterogeneous services rather than fixed model artifacts, with demand concentrated, supply-use mismatches, and task-specific routing yielding major cost and through...
Incompressible Knowledge Probes: Estimating Black-Box LLM Parameter Counts via Factual Capacity
cs.LG 2026-04 unverdicted novelty 7.0

Incompressible Knowledge Probes enable log-linear estimation of LLM parameter counts from factual accuracy on obscure questions, showing continued scaling of knowledge capacity across open and closed models.
DPC: A Distributed Page Cache over CXL
cs.DC 2026-04 conditional novelty 7.0

DPC maintains exactly one DRAM copy of each file page in a CXL-connected cluster and delivers up to 12.4X speedup (5.6X geometric mean) over replicated caches on data-sharing workloads.
Using large language models for embodied planning introduces systematic safety risks
cs.AI 2026-04 unverdicted novelty 7.0

LLM planners for robots often produce dangerous plans even when planning succeeds, with safety awareness staying flat as model scale improves planning ability.
Awakening Dormant Experts:Counterfactual Routing to Mitigate MoE Hallucinations
cs.LG 2026-04 unverdicted novelty 7.0

Counterfactual Routing awakens dormant experts in MoE models via layer-wise perturbation and a new CEI metric, raising factual accuracy 3.1% on average across TruthfulQA, FACTOR, and TriviaQA without extra inference cost.
The Phase Is the Gradient: Equilibrium Propagation for Frequency Learning in Kuramoto Networks
cs.LG 2026-04 unverdicted novelty 7.0

In Kuramoto networks at equilibrium, weak nudging makes phase displacement the exact gradient of loss w.r.t. natural frequencies, enabling frequency learning that beats weight learning and resolves convergence via spe...
Controllable and Verifiable Tool-Use Data Synthesis for Agentic Reinforcement Learning
cs.AI 2026-04 unverdicted novelty 7.0

COVERT generates verifiable synthetic tool-use environments for RL by validated trajectory synthesis and oracle-preserving augmentations, improving tool-use accuracy on BFCL v3 and ACEBench while remaining complementa...
How Independent are Large Language Models? A Statistical Framework for Auditing Behavioral Entanglement and Reweighting Verifier Ensembles
cs.AI 2026-04 unverdicted novelty 7.0

A new auditing framework reveals widespread behavioral entanglement among LLMs and shows that reweighting ensembles based on measured independence improves verification accuracy by up to 4.5%.
Self-Pruned Key-Value Attention: Learning When to Write by Predicting Future Utility
cs.LG 2026-05 unverdicted novelty 6.0

SP-KV trains a utility predictor jointly with the LLM to dynamically prune low-utility KV cache entries, achieving 3-10x memory reduction during generation with negligible performance loss.
EMO: Frustratingly Easy Progressive Training of Extendable MoE
cs.LG 2026-05 unverdicted novelty 6.0

EMO progressively expands the expert pool in MoE models during training to match fixed-expert performance with improved wall-clock efficiency.
CHAL: Council of Hierarchical Agentic Language
cs.AI 2026-05 unverdicted novelty 6.0

CHAL is a multi-agent dialectic system that performs structured belief optimization over defeasible domains using Bayesian-inspired graph representations and configurable meta-cognitive value system hyperparameters.
Search Your Block Floating Point Scales!
cs.LG 2026-05 unverdicted novelty 6.0

ScaleSearch optimizes block floating point scales via fine-grained search to cut quantization error by 27% for NVFP4, improving PTQ by up to 15 points on MATH500 for Qwen3-8B and attention PPL by 0.77 on Llama 3.1 70B.
PowerStep: Memory-Efficient Adaptive Optimization via $\ell_p$-Norm Steepest Descent
cs.LG 2026-05 unverdicted novelty 6.0

PowerStep delivers coordinate-wise adaptive optimization by nonlinearly transforming a momentum buffer under an lp-norm steepest-descent geometry, matching Adam convergence with half the memory and supporting aggressi...
From Passive Reuse to Active Reasoning: Grounding Large Language Models for Neuro-Symbolic Experience Replay
cs.AI 2026-05 unverdicted novelty 6.0

NSER uses zero-shot LLMs to induce behavioral rules from RL trajectories, grounds them in differentiable first-order logic, and applies the symbolic structures to dynamically reweight experience replay for better samp...
LBI: Parallel Scan Backpropagation via Latent Bounded Interfaces
cs.LG 2026-05 unverdicted novelty 6.0

LBI enables tractable parallel backpropagation by reducing inter-region adjoint computation to low-dimensional r x r Jacobians while preserving exact gradients under a bounded-interface model.
Memory-Efficient Looped Transformer: Decoupling Compute from Memory in Looped Language Models
cs.CL 2026-05 unverdicted novelty 6.0

MELT decouples reasoning depth from memory in looped LLMs by sharing a single gated KV cache per layer and using two-phase chunk-wise distillation from Ouro, delivering constant memory use while matching or beating st...
UniPool: A Globally Shared Expert Pool for Mixture-of-Experts
cs.LG 2026-05 unverdicted novelty 6.0

A shared global expert pool in MoE improves validation loss over per-layer experts and allows sublinear expert-parameter growth with depth.
Continuous Latent Diffusion Language Model
cs.CL 2026-05 unverdicted novelty 6.0

Cola DLM proposes a hierarchical latent diffusion model that learns a text-to-latent mapping, fits a global semantic prior in continuous space with a block-causal DiT, and performs conditional decoding, establishing l...
MoE-Hub: Taming Software Complexity for Seamless MoE Overlap with Hardware-Accelerated Communication on Multi-GPU Systems
cs.AR 2026-05 unverdicted novelty 6.0

MoE-Hub enables seamless MoE communication overlap via hardware-accelerated destination-agnostic data transmission, delivering 1.40x-3.08x per-layer and 1.21x-1.98x end-to-end speedups over prior systems.
Nitsum: Serving Tiered LLM Requests with Adaptive Tensor Parallelism
cs.DC 2026-05 unverdicted novelty 6.0

Nitsum dynamically adapts tensor parallelism and GPU splits in LLM serving to raise SLO-compliant goodput by up to 5.3 times over prior systems.
The Impossibility Triangle of Long-Context Modeling
cs.CL 2026-05 unverdicted novelty 6.0

No model can achieve efficiency, compactness, and recall capacity scaling with sequence length at once, as any two imply a strict bound of O(poly(d)/log V) on recallable facts.
A Queueing-Theoretic Framework for Stability Analysis of LLM Inference with KV Cache Memory Constraints
cs.LG 2026-05 unverdicted novelty 6.0

A queueing model derives stability conditions for LLM inference services under combined compute and KV cache memory limits, with experimental validation showing typical deviations under 10%.
DITRON: Distributed Multi-level Tiling Compiler for Parallel Tensor Programs
cs.PL 2026-05 unverdicted novelty 6.0

DITRON introduces a hierarchical multi-level tiling compiler for distributed tensor programs that matches or exceeds expert CUDA libraries with 6-30% speedups and has been deployed to improve training MFU by over 10% ...
When to Vote, When to Rewrite: Disagreement-Guided Strategy Routing for Test-Time Scaling
cs.AI 2026-04 unverdicted novelty 6.0

A disagreement-guided routing framework dynamically selects among resolution, voting, and rewriting strategies for test-time scaling, delivering 3-7% accuracy gains with lower sampling cost on mathematical benchmarks.
Long-Context Aware Upcycling: A New Frontier for Hybrid LLM Scaling
cs.CL 2026-04 unverdicted novelty 6.0

HyLo upcycles Transformer LLMs into hybrids with MLA and Mamba2/Gated DeltaNet blocks via staged training and distillation, extending context to 2M tokens and outperforming prior upcycled hybrids on long-context benchmarks.
Mixture of Heterogeneous Grouped Experts for Language Modeling
cs.CL 2026-04 unverdicted novelty 6.0

MoHGE achieves standard MoE performance with 20% fewer parameters and balanced GPU utilization via grouped heterogeneous experts, two-level routing, and specialized auxiliary losses.
Efficient Mixture-of-Experts LLM Inference with Apple Silicon NPUs
cs.LG 2026-04 unverdicted novelty 6.0

NPUMoE accelerates MoE LLM inference on Apple Silicon NPUs via offline-calibrated static expert tiers, grouped execution, and load-aware graph residency, delivering 1.32x-5.55x lower latency and 1.81x-7.37x better ene...
Multi-LLM Token Filtering and Routing for Sequential Recommendation
cs.IR 2026-04 unverdicted novelty 6.0

MLTFR combines user-guided token filtering with a multi-LLM mixture-of-experts and Fisher-weighted consensus expert to deliver stable gains in corpus-free sequential recommendation.
MACS: Modality-Aware Capacity Scaling for Efficient Multimodal MoE Inference
cs.LG 2026-04 unverdicted novelty 6.0

MACS improves MoE MLLM inference efficiency via entropy-weighted token loads and dynamic modality-adaptive expert capacity allocation.
MACS: Modality-Aware Capacity Scaling for Efficient Multimodal MoE Inference
cs.LG 2026-04 unverdicted novelty 6.0

MACS improves inference speed in multimodal MoE models by entropy-weighted balancing of visual tokens and real-time modality-adaptive expert capacity allocation.
Prefill-as-a-Service: KVCache of Next-Generation Models Could Go Cross-Datacenter
cs.DC 2026-04 unverdicted novelty 6.0

PrfaaS enables practical cross-datacenter prefill-decode disaggregation for hybrid-attention models via selective offloading, bandwidth-aware scheduling, and cache-aware placement, yielding 54% higher throughput and 6...
ELMoE-3D: Leveraging Intrinsic Elasticity of MoE for Hybrid-Bonding-Enabled Self-Speculative Decoding in On-Premises Serving
cs.LG 2026-04 unverdicted novelty 6.0

ELMoE-3D achieves 6.6x average speedup and 4.4x energy efficiency gain for MoE serving on 3D hardware by scaling expert and bit elasticity for elastic self-speculative decoding.
AsyncTLS: Efficient Generative LLM Inference with Asynchronous Two-level Sparse Attention
cs.CL 2026-04 unverdicted novelty 6.0

AsyncTLS delivers full-attention accuracy with 1.2-10x operator speedups and 1.3-4.7x end-to-end throughput gains on 48k-96k contexts via two-level sparse attention and asynchronous offloading.
Fine-grained Approaches for Confidence Calibration of LLMs in Automated Code Revision
cs.SE 2026-04 unverdicted novelty 6.0

Local Platt scaling on three fine-grained confidence scores reduces calibration error for LLM-based automated code revision across tasks and models compared to global scaling alone.
ITIScore: An Image-to-Text-to-Image Rating Framework for the Image Captioning Ability of MLLMs
cs.CV 2026-04 unverdicted novelty 6.0

ITIScore evaluates MLLM image captions via image-to-text-to-image reconstruction consistency and aligns with human judgments on a new 40K-caption benchmark.
WIO: Upload-Enabled Computational Storage on CXL SSDs
cs.OS 2026-04 unverdicted novelty 6.0

WIO enables reversible computational storage on CXL SSDs via WebAssembly actors and zero-copy migration, achieving up to 2x throughput and 3.75x lower write latency.
Rethinking Language Model Scaling under Transferable Hypersphere Optimization
cs.LG 2026-03 conditional novelty 6.0

HyperP transfers optimal learning rates across model width, depth, tokens, and MoE granularity under Frobenius-sphere constraints, delivering stable scaling and 1.58x efficiency gains.
EchoKV: Efficient KV Cache Compression via Similarity-Based Reconstruction
cs.CL 2026-03 unverdicted novelty 6.0

EchoKV compresses LLM KV caches by reconstructing missing components from partial data via inter- and intra-layer attention similarities, outperforming prior methods on LongBench and RULER while supporting on-demand f...
DeepSeek-OCR: Contexts Optical Compression
cs.CV 2025-10 unverdicted novelty 6.0

DeepSeek-OCR compresses text contexts up to 20x via 2D optical mapping while achieving 97% OCR accuracy below 10x and 60% at 20x, outperforming prior OCR tools with fewer vision tokens.
Large Language Monkeys: Scaling Inference Compute with Repeated Sampling
cs.LG 2024-07 unverdicted novelty 6.0

Repeated sampling scales problem coverage log-linearly with sample count, improving SWE-bench Lite performance from 15.9% to 56% using 250 samples.
Dense vs Sparse Pretraining at Tiny Scale: Active-Parameter vs Total-Parameter Matching
cs.CL 2026-05 accept novelty 5.0

At tiny scale, MoE transformers lower validation loss versus dense models when active parameters match but raise it when total stored parameters match.
Position: LLM Inference Should Be Evaluated as Energy-to-Token Production
cs.CE 2026-05 unverdicted novelty 5.0

LLM inference should be reframed and evaluated as energy-to-token production with a Token Production Function that accounts for power, cooling, and efficiency ceilings.
TIDE: Every Layer Knows the Token Beneath the Context
cs.CL 2026-05 unverdicted novelty 5.0

TIDE augments standard transformers with per-layer token embedding injection via an ensemble of memory blocks and a depth-conditioned router to mitigate rare-token undertraining and contextual collapse.
Irminsul: MLA-Native Position-Independent Caching for Agentic LLM Serving
cs.DC 2026-05 unverdicted novelty 5.0

Irminsul recovers up to 83% of prompt tokens above exact-prefix matching and delivers 63% prefill energy savings per cache hit on MLA-MoE models by content-hashing CDC chunks and applying closed-form kr correction.
Universal Smoothness via Bernstein Polynomials: A Constructive Approximation Approach for Activation Functions
cs.AI 2026-05 unverdicted novelty 5.0

BerLU constructs a C1-differentiable activation with Lipschitz constant 1 via Bernstein polynomial approximation, showing better performance and efficiency than baselines on image classification with ViTs and CNNs.
StreamIndex: Memory-Bounded Compressed Sparse Attention via Streaming Top-k
cs.LG 2026-05 accept novelty 5.0

Chunked streaming top-k enables CSA indexer execution at 1M sequence length with 6.21 GB peak memory and >=0.998 recall on synthetic V4-shaped inputs.
Mesh Based Simulations with Spatial and Temporal awareness
cs.LG 2026-05 unverdicted novelty 5.0

A unified training framework for mesh-based ML surrogates in CFD improves accuracy and long-horizon stability by enforcing spatial derivative consistency via multi-node prediction, using temporal cross-attention corre...
FedSLoP: Memory-Efficient Federated Learning with Low-Rank Gradient Projection
cs.LG 2026-04 unverdicted novelty 5.0

FedSLoP reduces communication and memory costs in federated learning through stochastic low-rank gradient projections, with a nonconvex convergence rate of O(1/sqrt(NT)) and competitive accuracy on heterogeneous MNIST data.
UniEP: Unified Expert-Parallel MoE MegaKernel for LLM Training
cs.DC 2026-04 unverdicted novelty 5.0

UniEP fuses MoE communication and computation into unified MegaKernels with deterministic token ordering, delivering 1.03x-1.38x speedups over prior work while preserving training accuracy.
FG$^2$-GDN: Enhancing Long-Context Gated Delta Networks with Doubly Fine-Grained Control
cs.LG 2026-04 unverdicted novelty 5.0

FG²-GDN replaces the scalar beta in the delta update with a channel-wise vector and decouples key/value scaling to improve recall over prior GDN and KDA models.

Reference graph

Works this paper leans on

150 extracted references · 150 canonical work pages · cited by 68 Pith papers · 42 internal anchors

[1]

Llama 3 model card, 2024

AI@Meta. Llama 3 model card, 2024. URL https://github.com/meta-llama/llama3/blob/main/MODEL_CARD.md

work page 2024
[3]

Introducing Claude , 2023

Anthropic. Introducing Claude , 2023. URL https://www.anthropic.com/index/introducing-claude

work page 2023
[7]

M. Chen, J. Tworek, H. Jun, Q. Yuan, H. P. de Oliveira Pinto, J. Kaplan, H. Edwards, Y. Burda, N. Joseph, G. Brockman, A. Ray, R. Puri, G. Krueger, M. Petrov, H. Khlaaf, G. Sastry, P. Mishkin, B. Chan, S. Gray, N. Ryder, M. Pavlov, A. Power, L. Kaiser, M. Bavarian, C. Winter, P. Tillet, F. P. Such, D. Cummings, M. Plappert, F. Chantzis, E. Barnes, A. Herb...

work page internal anchor Pith review Pith/arXiv arXiv 2021
[11]

D. Dai, C. Deng, C. Zhao, R. X. Xu, H. Gao, D. Chen, J. Li, W. Zeng, X. Yu, Y. Wu, Z. Xie, Y. K. Li, P. Huang, F. Luo, C. Ruan, Z. Sui, and W. Liang. Deepseekmoe: Towards ultimate expert specialization in mixture-of-experts language models. CoRR, abs/2401.06066, 2024. URL https://doi.org/10.48550/arXiv.2401.06066

work page internal anchor Pith review doi:10.48550/arxiv.2401.06066 2024
[12]

T. Dao. Flash A ttention-2: Faster attention with better parallelism and work partitioning, 2023

work page 2023
[13]

DeepSeek LLM: Scaling Open-Source Language Models with Longtermism

DeepSeek-AI. Deepseek LLM: scaling open-source language models with longtermism. CoRR, abs/2401.02954, 2024. URL https://doi.org/10.48550/arXiv.2401.02954

work page internal anchor Pith review doi:10.48550/arxiv.2401.02954 2024
[16]

Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity

W. Fedus, B. Zoph, and N. Shazeer. Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity. CoRR, abs/2101.03961, 2021. URL https://arxiv.org/abs/2101.03961

work page internal anchor Pith review arXiv 2021
[17]

L. Gao, S. Biderman, S. Black, L. Golding, T. Hoppe, C. Foster, J. Phang, H. He, A. Thite, N. Nabeshima, et al. The Pile : An 800GB dataset of diverse text for language modeling. arXiv preprint arXiv:2101.00027, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2020
[18]

Introducing gemini: our largest and most capable ai model, 2023

Google. Introducing gemini: our largest and most capable ai model, 2023. URL https://blog.google/technology/ai/google-gemini-ai/

work page 2023
[19]

A. Gu, B. Rozière, H. Leather, A. Solar-Lezama, G. Synnaeve, and S. I. Wang. Cruxeval: A benchmark for code reasoning, understanding and execution, 2024

work page 2024
[22]

Hai-llm: 高效且轻量的大模型训练工具, 2023

High-flyer. Hai-llm: 高效且轻量的大模型训练工具, 2023. URL https://www.high-flyer.cn/en/blog/hai-llm

work page 2023
[23]

Kvquant: Towards 10 million context length llm inference with kv cache quantization,

C. Hooper, S. Kim, H. Mohammadzadeh, M. W. Mahoney, Y. S. Shao, K. Keutzer, and A. Gholami. Kvquant: Towards 10 million context length LLM inference with KV cache quantization. CoRR, abs/2401.18079, 2024. URL https://doi.org/10.48550/arXiv.2401.18079

work page doi:10.48550/arxiv.2401.18079 2024
[25]

C-eval: A multi-level multi-discipline chinese evaluation suite for foundation models

Y. Huang, Y. Bai, Z. Zhu, J. Zhang, J. Zhang, T. Su, J. Liu, C. Lv, Y. Zhang, J. Lei, et al. C-Eval : A multi-level multi-discipline chinese evaluation suite for foundation models. arXiv preprint arXiv:2305.08322, 2023

work page arXiv 2023
[29]

W. Kwon, Z. Li, S. Zhuang, Y. Sheng, L. Zheng, C. H. Yu, J. E. Gonzalez, H. Zhang, and I. Stoica. Efficient memory management for large language model serving with pagedattention. In Proceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles, 2023

work page 2023
[31]

Lepikhin, H

D. Lepikhin, H. Lee, Y. Xu, D. Chen, O. Firat, Y. Huang, M. Krikun, N. Shazeer, and Z. Chen. Gshard: Scaling giant models with conditional computation and automatic sharding. In 9th International Conference on Learning Representations, ICLR 2021 . OpenReview.net, 2021. URL https://openreview.net/forum?id=qrwe7XHTmYb

work page 2021
[32]

H. Li, Y. Zhang, F. Koto, Y. Yang, H. Zhao, Y. Gong, N. Duan, and T. Baldwin. CMMLU : Measuring massive multitask language understanding in Chinese . arXiv preprint arXiv:2306.09212, 2023

work page arXiv 2023
[33]

W. Li, F. Qi, M. Sun, X. Yi, and J. Zhang. Ccpm: A chinese classical poetry matching dataset, 2021

work page 2021
[36]

Cheaper, better, faster, stronger: Continuing to push the frontier of ai and making it accessible to all, 2024

Mistral. Cheaper, better, faster, stronger: Continuing to push the frontier of ai and making it accessible to all, 2024. URL https://mistral.ai/news/mixtral-8x22b

work page 2024
[37]

Introducing ChatGPT , 2022

OpenAI. Introducing ChatGPT , 2022. URL https://openai.com/blog/chatgpt

work page 2022
[38]

GPT-4 Technical Report

OpenAI. GPT4 technical report. arXiv preprint arXiv:2303.08774, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[39]

Ouyang, J

L. Ouyang, J. Wu, X. Jiang, D. Almeida, C. Wainwright, P. Mishkin, C. Zhang, S. Agarwal, K. Slama, A. Ray, et al. Training language models to follow instructions with human feedback. Advances in neural information processing systems, 35: 0 27730--27744, 2022

work page 2022
[42]

Rajbhandari, J

S. Rajbhandari, J. Rasley, O. Ruwase, and Y. He. Zero: Memory optimizations toward training trillion parameter models. In SC20: International Conference for High Performance Computing, Networking, Storage and Analysis, pages 1--16. IEEE, 2020

work page 2020
[43]

Riquelme, J

C. Riquelme, J. Puigcerver, B. Mustafa, M. Neumann, R. Jenatton, A. S. Pinto, D. Keysers, and N. Houlsby. Scaling vision with sparse mixture of experts. In Advances in Neural Information Processing Systems 34: Annual Conference on Neural Information Processing Systems 2021, NeurIPS 2021, pages 8583--8595, 2021. URL https://proceedings.neurips.cc/paper/202...

work page 2021
[44]

Sakaguchi, R

K. Sakaguchi, R. L. Bras, C. Bhagavatula, and Y. Choi. Winogrande: An adversarial winograd schema challenge at scale, 2019

work page 2019
[46]

N. Shazeer. Fast transformer decoding: One write-head is all you need. CoRR, abs/1911.02150, 2019. URL http://arxiv.org/abs/1911.02150

work page internal anchor Pith review Pith/arXiv arXiv 1911
[47]

Shazeer, A

N. Shazeer, A. Mirhoseini, K. Maziarz, A. Davis, Q. V. Le, G. E. Hinton, and J. Dean. Outrageously large neural networks: The sparsely-gated mixture-of-experts layer. In 5th International Conference on Learning Representations, ICLR 2017 . OpenReview.net, 2017. URL https://openreview.net/forum?id=B1ckMDqlg

work page 2017
[48]

J. Su, M. Ahmed, Y. Lu, S. Pan, W. Bo, and Y. Liu. Roformer: Enhanced transformer with rotary position embedding. Neurocomputing, 568: 0 127063, 2024

work page 2024
[49]

K. Sun, D. Yu, D. Yu, and C. Cardie. Investigating prior knowledge for challenging chinese machine reading comprehension, 2019

work page 2019
[51]

Vaswani, N

A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, . Kaiser, and I. Polosukhin. Attention is all you need. Advances in neural information processing systems, 30, 2017

work page 2017
[53]

T. Wei, J. Luan, W. Liu, S. Dong, and B. Wang. Cmath: Can your language model pass chinese elementary school math test?, 2023

work page 2023
[57]

Y. Zhao, C. Lin, K. Zhu, Z. Ye, L. Chen, S. Zheng, L. Ceze, A. Krishnamurthy, T. Chen, and B. Kasikci. Atom: Low-bit quantization for efficient and accurate LLM serving. CoRR, abs/2310.19102, 2023. URL https://doi.org/10.48550/arXiv.2310.19102

work page doi:10.48550/arxiv.2310.19102 2023
[59]

Zheng, W.-L

L. Zheng, W.-L. Chiang, Y. Sheng, S. Zhuang, Z. Wu, Y. Zhuang, Z. Lin, Z. Li, D. Li, E. P. Xing, H. Zhang, J. E. Gonzalez, and I. Stoica. Judging llm-as-a-judge with mt-bench and chatbot arena, 2023

work page 2023
[61]

C. Zhou, P. Liu, P. Xu, S. Iyer, J. Sun, Y. Mao, X. Ma, A. Efrat, P. Yu, L. Yu, et al. Lima: Less is more for alignment. Advances in Neural Information Processing Systems, 36, 2024

work page 2024
[63]

Zero bubble pipeline parallelism.arXiv preprint arXiv:2401.10241, 2023

Zero Bubble Pipeline Parallelism , author=. arXiv preprint arXiv:2401.10241 , year=

work page arXiv
[64]

The Eleventh International Conference on Learning Representations,

Freda i and Mirac Suzgun and Markus Freitag and Xuezhi Wang and Suraj Srivats and Soroush Vosoughi and Hyung Won Chung and Yi Tay and Sebastian Ruder and Denny Zhou and Dipanjan Das and Jason Wei , title =. The Eleventh International Conference on Learning Representations,. 2023 , url =

work page 2023
[65]

LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code

LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code , author=. arXiv preprint arXiv:2403.07974 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[66]

Neurocomputing , volume=

Roformer: Enhanced transformer with rotary position embedding , author=. Neurocomputing , volume=. 2024 , publisher=

work page 2024
[67]

GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints

GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints , author=. arXiv preprint arXiv:2305.13245 , year=

work page internal anchor Pith review arXiv
[68]

CoRR , volume =

Noam Shazeer , title =. CoRR , volume =. 2019 , url =

work page 2019
[69]

Wizardmath: Empowering mathematical reasoning for large language models via reinforced evol-instruct

Wizardmath: Empowering mathematical reasoning for large language models via reinforced evol-instruct , author=. arXiv preprint arXiv:2308.09583 , year=

work page arXiv
[70]

Tora: A tool-integrated reasoning agent for mathematical problem solving.arXiv preprint arXiv:2309.17452,

Zhibin Gou and Zhihong Shao and Yeyun Gong and Yelong Shen and Yujiu Yang and Minlie Huang and Nan Duan and Weizhu Chen , title =. CoRR , volume =. 2023 , url =. doi:10.48550/ARXIV.2309.17452 , eprinttype =. 2309.17452 , timestamp =

work page doi:10.48550/arxiv.2309.17452 2023
[71]

Program of Thoughts Prompting: Disentangling Computation from Reasoning for Numerical Reasoning Tasks

Wenhu Chen and Xueguang Ma and Xinyi Wang and William W. Cohen , title =. CoRR , volume =. 2022 , url =. doi:10.48550/ARXIV.2211.12588 , eprinttype =. 2211.12588 , timestamp =

work page internal anchor Pith review doi:10.48550/arxiv.2211.12588 2022
[72]

International Conference on Machine Learning,

Luyu Gao and Aman Madaan and Shuyan Zhou and Uri Alon and Pengfei Liu and Yiming Yang and Jamie Callan and Graham Neubig , editor =. International Conference on Machine Learning,. 2023 , url =

work page 2023
[73]

Chi and Quoc V

Jason Wei and Xuezhi Wang and Dale Schuurmans and Maarten Bosma and Brian Ichter and Fei Xia and Ed H. Chi and Quoc V. Le and Denny Zhou , title =. NeurIPS , year =

work page
[74]

Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing,

Swaroop Mishra and Matthew Finlayson and Pan Lu and Leonard Tang and Sean Welleck and Chitta Baral and Tanmay Rajpurohit and Oyvind Tafjord and Ashish Sabharwal and Peter Clark and Ashwin Kalyan , editor =. Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing,. 2022 , url =. doi:10.18653/V1/2022.EMNLP-MAIN.392 , timestamp =

work page doi:10.18653/v1/2022.emnlp-main.392 2022
[75]

arXiv preprint arXiv:2309.05653 , year=

Xiang Yue and Xingwei Qu and Ge Zhang and Yao Fu and Wenhao Huang and Huan Sun and Yu Su and Wenhu Chen , title =. CoRR , volume =. 2023 , url =. doi:10.48550/ARXIV.2309.05653 , eprinttype =. 2309.05653 , timestamp =

work page doi:10.48550/arxiv.2309.05653 2023
[76]

MetaMath: Bootstrap Your Own Mathematical Questions for Large Language Models

Longhui Yu and Weisen Jiang and Han Shi and Jincheng Yu and Zhengying Liu and Yu Zhang and James T. Kwok and Zhenguo Li and Adrian Weller and Weiyang Liu , title =. CoRR , volume =. 2023 , url =. doi:10.48550/ARXIV.2309.12284 , eprinttype =. 2309.12284 , timestamp =

work page internal anchor Pith review doi:10.48550/arxiv.2309.12284 2023
[77]

T rivia QA : A Large Scale Distantly Supervised Challenge Dataset for Reading Comprehension

Joshi, Mandar and Choi, Eunsol and Weld, Daniel and Zettlemoyer, Luke. T rivia QA : A Large Scale Distantly Supervised Challenge Dataset for Reading Comprehension. Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2017. doi:10.18653/v1/P17-1147

work page doi:10.18653/v1/p17-1147 2017
[78]

2020 , eprint=

Language Models are Few-Shot Learners , author=. 2020 , eprint=

work page 2020
[79]

OpenAI blog , volume=

Language models are unsupervised multitask learners , author=. OpenAI blog , volume=

work page
[80]

Introducing

OpenAI , url =. Introducing

work page
[81]

HAI-LLM: 高效且轻量的大模型训练工具 , author =

work page
[82]

Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism

Megatron-lm: Training multi-billion parameter language models using model parallelism , author=. arXiv preprint arXiv:1909.08053 , year=

work page internal anchor Pith review Pith/arXiv arXiv 1909
[83]

Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis , pages=

Efficient large-scale language model training on gpu clusters using megatron-lm , author=. Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis , pages=

work page
[84]

Proceedings of Machine Learning and Systems , volume=

Reducing activation recomputation in large transformer models , author=. Proceedings of Machine Learning and Systems , volume=

work page
[85]

and Ermon, Stefano and Rudra, Atri and R

Dao, Tri and Fu, Daniel Y. and Ermon, Stefano and Rudra, Atri and R. Flash. Advances in Neural Information Processing Systems , year=

work page
[86]

Dao, Tri , year=. Flash

work page
[87]

Proceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles , year=

Efficient Memory Management for Large Language Model Serving with PagedAttention , author=. Proceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles , year=

work page
[88]

Advances in neural information processing systems , volume=

Attention is all you need , author=. Advances in neural information processing systems , volume=

work page
[89]

SC20: International Conference for High Performance Computing, Networking, Storage and Analysis , pages=

Zero: Memory optimizations toward training trillion parameter models , author=. SC20: International Conference for High Performance Computing, Networking, Storage and Analysis , pages=. 2020 , organization=

work page 2020
[90]

2021 , eprint=

CCPM: A Chinese Classical Poetry Matching Dataset , author=. 2021 , eprint=

work page 2021
[91]

2018 , eprint=

Can a Suit of Armor Conduct Electricity? A New Dataset for Open Book Question Answering , author=. 2018 , eprint=

work page 2018
[92]

Introducing

Anthropic , institution =. Introducing

work page
[93]

An important next step on our

Google , url =. An important next step on our

work page
[94]

Introducing Gemini: our largest and most capable AI model , author =

work page
[95]

2019 , eprint=

Investigating Prior Knowledge for Challenging Chinese Machine Reading Comprehension , author=. 2019 , eprint=

work page 2019
[96]

A Span-Extraction Dataset for C hinese Machine Reading Comprehension

Cui, Yiming and Liu, Ting and Che, Wanxiang and Xiao, Li and Chen, Zhipeng and Ma, Wentao and Wang, Shijin and Hu, Guoping. A Span-Extraction Dataset for C hinese Machine Reading Comprehension. Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (E...

work page doi:10.18653/v1/d19-1600 2019
[97]

2019 , eprint=

WinoGrande: An Adversarial Winograd Schema Challenge at Scale , author=. 2019 , eprint=

work page 2019
[98]

2023 , eprint=

CMATH: Can Your Language Model Pass Chinese Elementary School Math Test? , author=. 2023 , eprint=

work page 2023
[99]

Measuring Massive Multitask Language Understanding

Measuring massive multitask language understanding , author=. arXiv preprint arXiv:2009.03300 , year=

work page internal anchor Pith review Pith/arXiv arXiv 2009
[100]

Challenging BIG-Bench Tasks and Whether Chain-of-Thought Can Solve Them

Challenging big-bench tasks and whether chain-of-thought can solve them , author=. arXiv preprint arXiv:2210.09261 , year=

work page internal anchor Pith review arXiv
[101]

Program Synthesis with Large Language Models

Program synthesis with large language models , author=. arXiv preprint arXiv:2108.07732 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[102]

Proceedings of the 28th International Conference on Computational Linguistics,

Liang Xu and Hai Hu and Xuanwei Zhang and Lu Li and Chenjie Cao and Yudong Li and Yechen Xu and Kai Sun and Dian Yu and Cong Yu and Yin Tian and Qianqian Dong and Weitang Liu and Bo Shi and Yiming Cui and Junyi Li and Jun Zeng and Rongzhao Wang and Weijian Xie and Yanting Li and Yina Patterson and Zuoyu Tian and Yiwen Zhang and He Zhou and Shaoweihua Liu ...

work page doi:10.18653/v1/2020.coling-main.419 2020
[103]

Li, Haonan and Zhang, Yixuan and Koto, Fajri and Yang, Yifei and Zhao, Hai and Gong, Yeyun and Duan, Nan and Baldwin, Timothy , journal=

work page
[104]

Chujie Zheng and Minlie Huang and Aixin Sun , editor =. ChID:. Proceedings of the 57th Conference of the Association for Computational Linguistics,. 2019 , url =. doi:10.18653/V1/P19-1075 , timestamp =

work page doi:10.18653/v1/p19-1075 2019
[105]

RACE : Large-scale R e A ding comprehension dataset from examinations

Guokun Lai and Qizhe Xie and Hanxiao Liu and Yiming Yang and Eduard H. Hovy , editor =. Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing,. 2017 , url =. doi:10.18653/V1/D17-1082 , timestamp =

work page doi:10.18653/v1/d17-1082 2017
[106]

DROP : A reading comprehension benchmark requiring discrete reasoning over paragraphs

Dheeru Dua and Yizhong Wang and Pradeep Dasigi and Gabriel Stanovsky and Sameer Singh and Matt Gardner , editor =. Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies,. 2019 , url =. doi:10.18653/V1/N19-1246 , timestamp =

work page doi:10.18653/v1/n19-1246 2019
[107]

Huang, Yuzhen and Bai, Yuzhuo and Zhu, Zhihao and Zhang, Junlei and Zhang, Jinghan and Su, Tangjun and Liu, Junteng and Lv, Chuancheng and Zhang, Yikai and Lei, Jiayi and others , journal=

work page
[108]

LLaMA: Open and Efficient Foundation Language Models

Touvron, Hugo and Lavril, Thibaut and Izacard, Gautier and Martinet, Xavier and Lachaux, Marie-Anne and Lacroix, Timoth. arXiv preprint arXiv:2302.13971 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[109]

Llama 2: Open Foundation and Fine-Tuned Chat Models

Hugo Touvron and Louis Martin and Kevin Stone and Peter Albert and Amjad Almahairi and Yasmine Babaei and Nikolay Bashlykov and Soumya Batra and Prajjwal Bhargava and Shruti Bhosale and Dan Bikel and Lukas Blecher and Cristian Canton. Llama 2: Open Foundation and Fine-Tuned Chat Models , journal =. 2023 , url =. doi:10.48550/arXiv.2307.09288 , eprinttype =

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2307.09288 2023

Showing first 80 references.