arxiv: 2404.06395 · v3 · submitted 2024-04-09 · 💻 cs.CL · cs.LG

Recognition: 2 theorem links

MiniCPM: Unveiling the Potential of Small Language Models with Scalable Training Strategies

Shengding Hu , Yuge Tu , Xu Han , Chaoqun He , Ganqu Cui , Xiang Long , Zhi Zheng , Yewei Fang

show 17 more authors

Yuxiang Huang Weilin Zhao Xinrong Zhang Zheng Leng Thai Kaihuo Zhang Chongyi Wang Yuan Yao Chenyang Zhao Jie Zhou Jie Cai Zhongwu Zhai Ning Ding Chao Jia Guoyang Zeng Dahai Li Zhiyuan Liu Maosong Sun

Authors on Pith no claims yet

Pith reviewed 2026-05-13 17:52 UTC · model grok-4.3

classification 💻 cs.CL cs.LG

keywords small language modelsMiniCPMlearning rate schedulerscaling lawsmodel efficiencydata scalinglanguage model training

0 comments

The pith

MiniCPM shows that 1.2B and 2.4B parameter models can match the performance of 7B to 13B large language models.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper presents MiniCPM, a family of small language models with 1.2 billion and 2.4 billion non-embedding parameters. The authors demonstrate that these models perform on par with much larger 7B to 13B models across various benchmarks. They achieve this through careful model scaling experiments and a new Warmup-Stable-Decay learning rate scheduler that supports efficient data scaling. The work also derives a compute-optimal data-to-model ratio higher than the previously accepted Chinchilla optimum. By releasing variants optimized for different tasks, the approach highlights practical ways to build capable yet resource-efficient language models.

Core claim

The central discovery is that the MiniCPM 1.2B and 2.4B models not only lead in their size categories but also achieve capabilities comparable to 7B-13B LLMs. This is enabled by extensive model wind tunnel experiments for stable scaling and the introduction of the Warmup-Stable-Decay learning rate scheduler, which facilitates continuous training and reveals dynamics allowing a higher optimal data-model ratio than Chinchilla. The paper further shows scalability in both model and data dimensions and introduces extended family members like MiniCPM-DPO, MiniCPM-MoE, and MiniCPM-128K that maintain strong performance.

What carries the argument

The Warmup-Stable-Decay (WSD) learning rate scheduler, which divides training into warmup, stable, and decay phases to enable continuous training, domain adaptation, and efficient exploration of data-model scaling laws without full retraining.

If this is right

Models of this size can be deployed more widely due to lower computational requirements while matching larger model performance.
Training strategies allow for higher data-to-compute ratios than previously thought optimal.
The approach scales to larger models and different architectures like mixture-of-experts.
Specific variants handle tasks such as preference optimization and long-context processing effectively.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

This suggests that future research could prioritize optimizing small models over scaling to trillions of parameters for many applications.
On-device or edge AI deployments become more feasible with these efficient models.
Independent verification on held-out data would strengthen claims of parity.
The WSD scheduler might apply to training other types of neural networks beyond language models.

Load-bearing premise

The benchmarks used provide an unbiased measure of model capabilities without advantages from overlapping training data or post-training adjustments.

What would settle it

Evaluating the models on a new set of tasks designed to have no overlap with any training data and observing whether their performance remains comparable to 7B-13B models.

read the original abstract

The burgeoning interest in developing Large Language Models (LLMs) with up to trillion parameters has been met with concerns regarding resource efficiency and practical expense, particularly given the immense cost of experimentation. This scenario underscores the importance of exploring the potential of Small Language Models (SLMs) as a resource-efficient alternative. In this context, we introduce MiniCPM, specifically the 1.2B and 2.4B non-embedding parameter variants, not only excel in their respective categories but also demonstrate capabilities on par with 7B-13B LLMs. While focusing on SLMs, our approach exhibits scalability in both model and data dimensions for future LLM research. Regarding model scaling, we employ extensive model wind tunnel experiments for stable and optimal scaling. For data scaling, we introduce a Warmup-Stable-Decay (WSD) learning rate scheduler (LRS), conducive to continuous training and domain adaptation. We present an in-depth analysis of the intriguing training dynamics that occurred in the WSD LRS. With WSD LRS, we are now able to efficiently study data-model scaling law without extensive retraining experiments on both axes of model and data, from which we derive the much higher compute optimal data-model ratio than Chinchilla Optimal. Additionally, we introduce MiniCPM family, including MiniCPM-DPO, MiniCPM-MoE and MiniCPM-128K, whose excellent performance further cementing MiniCPM's foundation in diverse SLM applications. MiniCPM models are available publicly at https://github.com/OpenBMB/MiniCPM .

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces MiniCPM, focusing on 1.2B and 2.4B non-embedding parameter small language models that are claimed to excel in their size categories and achieve performance on par with 7B-13B LLMs. It details scalable training via extensive model wind tunnel experiments for stable model scaling and a Warmup-Stable-Decay (WSD) learning rate scheduler for data scaling and continuous training, including an analysis of WSD training dynamics. From these, the authors derive a compute-optimal data-model ratio substantially higher than the Chinchilla optimum. The work also presents MiniCPM variants (DPO, MoE, 128K) and releases the models publicly.

Significance. If the performance parity and scaling claims hold under rigorous controls, this would be a notable contribution to efficient LLM research by demonstrating that carefully trained small models can match much larger ones, lowering computational barriers and providing practical insights into data-model scaling. The public model release and the WSD scheduler's support for continuous adaptation add reproducibility value and could influence future training protocols. The higher data-model ratio challenges existing scaling laws and merits follow-up, though its impact hinges on clearer validation.

major comments (2)

[Abstract and Results] Abstract and main results: The central claim that the 1.2B and 2.4B variants 'demonstrate capabilities on par with 7B-13B LLMs' is load-bearing for the paper's contribution but is presented without error bars on benchmark scores, exact numerical baseline comparisons to specific 7B-13B models, or details on training data exclusion to rule out overlap. This omission makes independent verification of parity difficult and weakens the claim given potential selection effects.
[Scaling experiments] Scaling experiments section: The derivation of the compute-optimal data-model ratio (higher than Chinchilla) is obtained directly from the authors' WSD-based runs; while not circular by construction, the section lacks explicit controls such as sensitivity analysis to run hyperparameters or comparison against independent scaling datasets, which is needed to establish the ratio as a general finding rather than run-specific.

minor comments (2)

[Method] The description of the WSD scheduler would be clearer with an explicit figure showing the learning rate curve over training steps and how the stable and decay phases enable continuous training.
[Results] Table or figure captions for benchmark results should include the exact model sizes and training tokens used for all compared baselines to facilitate direct comparison.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address each major comment point by point below, providing clarifications and committing to revisions where they strengthen the work without misrepresenting our results.

read point-by-point responses

Referee: [Abstract and Results] Abstract and main results: The central claim that the 1.2B and 2.4B variants 'demonstrate capabilities on par with 7B-13B LLMs' is load-bearing for the paper's contribution but is presented without error bars on benchmark scores, exact numerical baseline comparisons to specific 7B-13B models, or details on training data exclusion to rule out overlap. This omission makes independent verification of parity difficult and weakens the claim given potential selection effects.

Authors: We agree that greater precision in the presentation of results would improve verifiability. In the revised manuscript we will add a dedicated comparison table listing exact scores for MiniCPM-1.2B and MiniCPM-2.4B against named 7B–13B baselines (LLaMA-2-7B, LLaMA-2-13B, Mistral-7B, Qwen-7B, etc.) on MMLU, BBH, HumanEval and GSM8K. We will also expand the data section to describe the deduplication pipeline and contamination checks performed against the evaluation benchmarks. Because our primary training runs used a single seed for computational efficiency, we cannot retroactively supply full error bars; we will instead report variance observed in auxiliary short runs and explicitly note this limitation. revision: partial
Referee: [Scaling experiments] Scaling experiments section: The derivation of the compute-optimal data-model ratio (higher than Chinchilla) is obtained directly from the authors' WSD-based runs; while not circular by construction, the section lacks explicit controls such as sensitivity analysis to run hyperparameters or comparison against independent scaling datasets, which is needed to establish the ratio as a general finding rather than run-specific.

Authors: We accept that additional robustness checks are warranted. We will insert a sensitivity-analysis subsection that varies batch size, peak learning rate and decay length within the WSD schedule and shows the optimal data-to-model ratio remains stable. A full comparison against independent scaling corpora is not feasible given the compute budget required for retraining on entirely new datasets; we will therefore qualify the claim as holding under our experimental regime while discussing the conditions under which the higher ratio may generalize. revision: partial

Circularity Check

0 steps flagged

No significant circularity; claims rest on independent empirical scaling runs

full rationale

The paper derives its data-model scaling ratio and performance claims directly from new wind-tunnel experiments and WSD-scheduler runs on the MiniCPM models themselves. These are presented as fresh empirical observations rather than reductions of prior fitted constants or self-cited theorems. No equations equate a prediction to its own input by construction, and no load-bearing uniqueness result is imported from the authors' earlier work. The derivation chain therefore remains self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The work rests on standard LLM scaling assumptions and the validity of the chosen benchmarks; the main empirical additions are the scheduler and the observed ratio.

free parameters (1)

compute-optimal data-model ratio
Fitted from the authors' WSD-enabled scaling runs and reported as higher than the Chinchilla value.

axioms (1)

domain assumption Standard power-law scaling relationships between model size, data, and compute continue to hold for the 1-2B regime.
Invoked when using wind-tunnel experiments to extrapolate optimal ratios.

pith-pipeline@v0.9.0 · 5679 in / 1230 out tokens · 120097 ms · 2026-05-13T17:52:51.839196+00:00 · methodology

discussion (0)

Forward citations

Cited by 27 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Do Audio-Visual Large Language Models Really See and Hear?
cs.AI 2026-04 unverdicted novelty 8.0

AVLLMs encode audio semantics in middle layers but suppress them in final text outputs when audio conflicts with vision, due to training that largely inherits from vision-language base models.
Large Language Diffusion Models
cs.CL 2025-02 unverdicted novelty 8.0

LLaDA is a scalable diffusion-based language model that matches autoregressive LLMs like LLaMA3 8B on tasks and surpasses GPT-4o on reversal poem completion.
A Causal Language Modeling Detour Improves Encoder Continued Pretraining
cs.CL 2026-05 conditional novelty 7.0

A temporary CLM phase followed by MLM decay during encoder continued pretraining outperforms standard MLM on biomedical tasks by 0.3-2.8pp across languages and model sizes.
EmbodiedMidtrain: Bridging the Gap between Vision-Language Models and Vision-Language-Action Models via Mid-training
cs.CV 2026-04 unverdicted novelty 7.0

EmbodiedMidtrain mid-trains VLMs on curated VLA-aligned data subsets to improve downstream performance on robot manipulation benchmarks.
Rethinking Scale: Deployment Trade-offs of Small Language Models under Agent Paradigms
cs.CL 2026-04 unverdicted novelty 7.0

Single-agent systems with tools provide the optimal performance-efficiency trade-off for small language models, outperforming base models and multi-agent setups.
Copy First, Translate Later: Interpreting Translation Dynamics in Multilingual Pretraining
cs.CL 2026-04 unverdicted novelty 7.0

Multilingual pretraining develops translation in two phases: early copying driven by surface similarities, followed by generalizing mechanisms while copying is refined.
Internalized Reasoning for Long-Context Visual Document Understanding
cs.CV 2026-03 unverdicted novelty 7.0

A synthetic pipeline creates and internalizes reasoning traces in VLMs for long-context visual document understanding, with a 32B model surpassing a 235B model on MMLongBenchDoc and showing 12.4x fewer output tokens.
DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model
cs.CL 2024-05 unverdicted novelty 7.0

DeepSeek-V2 delivers top-tier open-source LLM performance using only 21B active parameters by compressing the KV cache 93.3% and cutting training costs 42.5% via MLA and DeepSeekMoE.
Sparse Layers are Critical to Scaling Looped Language Models
cs.LG 2026-05 unverdicted novelty 6.0

Looped MoE models scale better than standard transformers because different experts activate on each loop pass, recovering expressivity without extra parameters, and support superior early exits.
Response-G1: Explicit Scene Graph Modeling for Proactive Streaming Video Understanding
cs.CV 2026-05 unverdicted novelty 6.0

Response-G1 uses query-guided scene graphs, memory retrieval, and augmented prompting to improve when Video-LLMs decide to respond during streaming videos.
Efficient Pre-Training with Token Superposition
cs.CL 2026-05 unverdicted novelty 6.0

Token superposition in an initial training phase followed by recovery allows large language models to reach target loss with substantially less total compute.
ZAYA1-8B Technical Report
cs.AI 2026-05 unverdicted novelty 6.0

ZAYA1-8B is a reasoning MoE model with 700M active parameters that matches larger models on math and coding benchmarks and reaches 91.9% on AIME'25 via Markovian RSA test-time compute.
Exploring Audio Hallucination in Egocentric Video Understanding
cs.CV 2026-04 unverdicted novelty 6.0

AV-LLMs hallucinate audio from visuals in egocentric videos, scoring only 27.3% accuracy on foreground sounds and 39.5% on background sounds in a 1000-question evaluation.
OmniMouse: Scaling properties of multi-modal, multi-task Brain Models on 150B Neural Tokens
q-bio.NC 2026-04 unverdicted novelty 6.0

OmniMouse demonstrates data-driven scaling in multi-task brain models on a 150B-token neural dataset, achieving SOTA across prediction, decoding, and forecasting while model size gains saturate.
BiasIG: Benchmarking Multi-dimensional Social Biases in Text-to-Image Models
cs.CY 2026-04 conditional novelty 6.0

BiasIG is a multi-dimensional benchmark for social biases in T2I models that shows debiasing interventions frequently cause confounding discrimination effects.
SinkTrack: Attention Sink based Context Anchoring for Large Language Models
cs.CV 2026-04 unverdicted novelty 6.0

SinkTrack uses attention sink at the BOS token to anchor LLMs to initial context, reducing hallucination and forgetting with reported gains on benchmarks like SQuAD2.0 and M3CoT.
Nexus: Same Pretraining Loss, Better Downstream Generalization via Common Minima
cs.LG 2026-04 unverdicted novelty 6.0

Nexus optimizer improves LLM downstream performance by converging to common minima across data sources despite identical pretraining loss.
DriveVLM: The Convergence of Autonomous Driving and Large Vision-Language Models
cs.CV 2024-02 unverdicted novelty 6.0

DriveVLM adds vision-language models with scene description, analysis, and hierarchical planning modules to autonomous driving, paired with a hybrid DriveVLM-Dual system tested on nuScenes and SUP-AD datasets and depl...
Response-G1: Explicit Scene Graph Modeling for Proactive Streaming Video Understanding
cs.CV 2026-05 unverdicted novelty 5.0

Response-G1 uses query-guided scene graph generation, memory retrieval, and retrieval-augmented prompting to improve proactive response timing in streaming video understanding.
VLA Foundry: A Unified Framework for Training Vision-Language-Action Models
cs.RO 2026-04 unverdicted novelty 5.0

VLA Foundry provides a single training stack for VLA models and releases open models that match prior closed-source performance or outperform baselines on multi-task manipulation in simulation.
JoyAI-LLM Flash: Advancing Mid-Scale LLMs with Token Efficiency
cs.CL 2026-04 unverdicted novelty 5.0

JoyAI-LLM Flash delivers a 48B MoE LLM with 2.7B active parameters per token via FiberPO RL and dense multi-token prediction, released with checkpoints on Hugging Face.
Kimi K2: Open Agentic Intelligence
cs.LG 2025-07 unverdicted novelty 5.0

Kimi K2 is a 1-trillion-parameter MoE model that leads open-source non-thinking models on agentic benchmarks including 65.8 on SWE-Bench Verified and 66.1 on Tau2-Bench.
SmolLM2: When Smol Goes Big -- Data-Centric Training of a Small Language Model
cs.CL 2025-02 unverdicted novelty 5.0

SmolLM2 is a 1.7B-parameter language model that outperforms Qwen2.5-1.5B and Llama3.2-1B after overtraining on 11 trillion tokens using custom FineMath, Stack-Edu, and SmolTalk datasets in a multi-stage pipeline.
MiniCPM-V: A GPT-4V Level MLLM on Your Phone
cs.CV 2024-08 conditional novelty 5.0

MiniCPM-Llama3-V 2.5 delivers GPT-4V-level multimodal performance on phones through architecture, pretraining, and alignment optimizations.
Fine-Tuning Small Language Models for Solution-Oriented Windows Event Log Analysis
cs.CR 2026-05 unverdicted novelty 4.0

Fine-tuned small language models trained on a synthetic Windows event log dataset with remediation steps outperform larger models in issue detection and solution generation with lower computational cost.
An Empirical Study on Influence-Based Pretraining Data Selection for Code Large Language Models
cs.SE 2026-04 unverdicted novelty 4.0

Data-influence-score filtering using validation-set loss on downstream coding tasks improves Code-LLM performance, with the most beneficial training data varying significantly across different programming tasks.
TinyLlama: An Open-Source Small Language Model
cs.CL 2024-01 accept novelty 4.0

TinyLlama is a 1.1B-parameter open-source language model pretrained on 1 trillion tokens that outperforms other open-source models of similar size on downstream tasks.

Reference graph

Works this paper leans on

56 extracted references · 56 canonical work pages · cited by 26 Pith papers · 28 internal anchors

[1]

GPT-4 Technical Report

Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report. arXiv preprint arXiv:2303.08774,

work page internal anchor Pith review Pith/arXiv arXiv
[2]

GQA: Training generalized multi-query transformer models from multi-head checkpoints

Joshua Ainslie, James Lee-Thorp, Michiel de Jong, Yury Zemlyanskiy, Federico Lebron, and Sumit Sanghai. GQA: Training generalized multi-query transformer models from multi-head checkpoints. In Houda Bouamor, Juan Pino, and Kalika Bali (eds.), Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing , pp. 4895–4901, Singapore,...

work page 2023
[3]

InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing

Association for Computational Linguistics. doi: 10.18653/v1/ 2023.emnlp-main.298. URL https://aclanthology.org/2023.emnlp-main.298. Ebtesam Almazrouei, Hamza Alobeidli, Abdulaziz Alshamsi, Alessandro Cappelli, Ruxan- dra Cojocaru, M ´erouane Debbah, ´Etienne Goffinet, Daniel Hesslow, Julien Launay, Quentin Malartic, et al. The falcon series of open langua...

work page doi:10.18653/v1/ 2023
[4]

A General Language Assistant as a Laboratory for Alignment

Amanda Askell, Yuntao Bai, Anna Chen, Dawn Drain, Deep Ganguli, Tom Henighan, Andy Jones, Nicholas Joseph, Ben Mann, Nova DasSarma, et al. A general language assistant as a laboratory for alignment. arXiv preprint arXiv:2112.00861,

work page internal anchor Pith review Pith/arXiv arXiv
[5]

Qwen Technical Report

Jinze Bai, Shuai Bai, Yunfei Chu, Zeyu Cui, Kai Dang, Xiaodong Deng, Yang Fan, Wenbin Ge, Yu Han, Fei Huang, et al. Qwen technical report. arXiv preprint arXiv:2309.16609 ,

work page internal anchor Pith review Pith/arXiv arXiv
[6]

DeepSeek LLM: Scaling Open-Source Language Models with Longtermism

Accessed: date-of-access. Xiao Bi, Deli Chen, Guanting Chen, Shanhuang Chen, Damai Dai, Chengqi Deng, Honghui Ding, Kai Dong, Qiushi Du, Zhe Fu, et al. Deepseek llm: Scaling open-source language models with longtermism. arXiv preprint arXiv:2401.02954,

work page internal anchor Pith review Pith/arXiv arXiv
[7]

Datasheet for the pile

Stella Biderman, Kieran Bicheno, and Leo Gao. Datasheet for the pile. arXiv preprint arXiv:2201.07311,

work page arXiv
[8]

Andrei Z Broder

Accessed: [Insert Date of Access]. Andrei Z Broder. On the resemblance and containment of documents. In Proceedings. Compression and Complexity of SEQUENCES 1997 (Cat. No. 97TB100171) , pp. 21–29. IEEE,

work page 1997
[9]

Evaluating Large Language Models Trained on Code

Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde de Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, et al. Evalu- ating large language models trained on code. arXiv preprint arXiv:2107.03374,

work page internal anchor Pith review Pith/arXiv arXiv
[10]

Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge

19 Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot, Ashish Sabharwal, Carissa Schoenick, and Oyvind Tafjord. Think you have solved question answering? try arc, the ai2 reasoning challenge. arXiv preprint arXiv:1803.05457,

work page internal anchor Pith review Pith/arXiv arXiv
[11]

Training Verifiers to Solve Math Word Problems

Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, et al. Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168,

work page internal anchor Pith review Pith/arXiv arXiv
[12]

Cerebras-gpt: Open compute- optimal language models trained on the cerebras wafer- scale cluster

Nolan Dey, Gurpreet Gosal, Hemant Khachane, William Marshall, Ribhu Pathria, Marvin Tom, Joel Hestness, et al. Cerebras-gpt: Open compute-optimal language models trained on the cerebras wafer-scale cluster. arXiv preprint arXiv:2304.03208,

work page arXiv
[13]

Glm: General language model pretraining with autoregressive blank infilling

Zhengxiao Du, Yujie Qian, Xiao Liu, Ming Ding, Jiezhong Qiu, Zhilin Yang, and Jie Tang. Glm: General language model pretraining with autoregressive blank infilling. arXiv preprint arXiv:2103.10360,

work page arXiv
[14]

Understanding emergent abilities of language models from the loss perspective

Zhengxiao Du, Aohan Zeng, Yuxiao Dong, and Jie Tang. Understanding emergent abilities of language models from the loss perspective. arXiv preprint arXiv:2403.15796,

work page arXiv
[15]

GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers

Elias Frantar, Saleh Ashkboos, Torsten Hoefler, and Dan Alistarh. GPTQ: accurate post- training quantization for generative pre-trained transformers.CoRR, abs/2210.17323,

work page internal anchor Pith review Pith/arXiv arXiv
[16]

GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers

doi: 10.48550/ARXIV .2210.17323. URLhttps://doi.org/10.48550/arXiv.2210.17323. Leo Gao, Stella Biderman, Sid Black, Laurence Golding, Travis Hoppe, Charles Foster, Jason Phang, Horace He, Anish Thite, Noa Nabeshima, et al. The Pile: An 800GB dataset of diverse text for language modeling. arXiv preprint arXiv:2101.00027,

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv
[17]

Gemini: A Family of Highly Capable Multimodal Models

Team Gemini, Rohan Anil, Sebastian Borgeaud, Yonghui Wu, Jean-Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalkwyk, Andrew M Dai, Anja Hauth, et al. Gemini: a family of highly capable multimodal models. arXiv preprint arXiv:2312.11805,

work page internal anchor Pith review Pith/arXiv arXiv
[18]

Textbooks Are All You Need

Suriya Gunasekar, Yi Zhang, Jyoti Aneja, Caio C ´esar Teodoro Mendes, Allie Del Giorno, Sivakanth Gopi, Mojan Javaheripi, Piero Kauffmann, Gustavo de Rosa, Olli Saarikivi, et al. Textbooks are all you need. arXiv preprint arXiv:2306.11644,

work page internal anchor Pith review Pith/arXiv arXiv
[19]

Measuring Massive Multitask Language Understanding

Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. Measuring massive multitask language understanding. arXiv preprint arXiv:2009.03300,

work page internal anchor Pith review Pith/arXiv arXiv 2009
[20]

Measuring Mathematical Problem Solving With the MATH Dataset

Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt. Measuring mathematical problem solving with the math dataset. arXiv preprint arXiv:2103.03874,

work page internal anchor Pith review Pith/arXiv arXiv
[21]

Query- key normalization for transformers

Alex Henry, Prudhvi Raj Dachapally, Shubham Shantaram Pawar, and Yuxuan Chen. Query- key normalization for transformers. In Trevor Cohn, Yulan He, and Yang Liu (eds.), Findings of the Association for Computational Linguistics: EMNLP 2020, pp. 4246–4253, Online, November

work page 2020
[22]

Training Compute-Optimal Large Language Models

Association for Computational Linguistics. doi: 10.18653/v1/2020. findings-emnlp.379. URL https://aclanthology.org/2020.findings-emnlp.379. 20 Jordan Hoffmann, Sebastian Borgeaud, Arthur Mensch, Elena Buchatskaya, Trevor Cai, Eliza Rutherford, Diego de Las Casas, Lisa Anne Hendricks, Johannes Welbl, Aidan Clark, et al. Training compute-optimal large langu...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.18653/v1/2020 2020
[23]

Universal Language Model Fine-tuning for Text Classification

Jeremy Howard and Sebastian Ruder. Universal language model fine-tuning for text classification. arXiv preprint arXiv:1801.06146,

work page Pith review arXiv
[24]

Unlock predictable scaling from emergent abilities

Shengding Hu, Xin Liu, Xu Han, Xinrong Zhang, Chaoqun He, Weilin Zhao, Yankai Lin, Ning Ding, Zebin Ou, Guoyang Zeng, et al. Unlock predictable scaling from emergent abilities. arXiv preprint arXiv:2310.03262,

work page arXiv
[25]

sharpdarts: Faster and more accurate differentiable architecture search

Andrew Hundt, Varun Jain, and Gregory D Hager. sharpdarts: Faster and more accurate differentiable architecture search. arXiv preprint arXiv:1903.09900,

work page arXiv 1903
[26]

Accessed: date-of- access. Albert Q Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Florian Bressand, Gianna Lengyel, Guillaume Lample, Lucile Saulnier, et al. Mistral 7b. arXiv preprint arXiv:2310.06825,

work page internal anchor Pith review Pith/arXiv arXiv
[27]

Scaling Laws for Neural Language Models

Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. Scaling laws for neural language models. arXiv preprint arXiv:2001.08361,

work page internal anchor Pith review Pith/arXiv arXiv 2001
[28]

Adam: A Method for Stochastic Optimization

Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980,

work page internal anchor Pith review Pith/arXiv arXiv
[29]

Sparse upcycling: Training mixture-of-experts from dense checkpoints.arXiv preprint arXiv:2212.05055, 2022

Aran Komatsuzaki, Joan Puigcerver, James Lee-Thorp, Carlos Riquelme Ruiz, Basil Mustafa, Joshua Ainslie, Yi Tay, Mostafa Dehghani, and Neil Houlsby. Sparse upcycling: Training mixture-of-experts from dense checkpoints. arXiv preprint arXiv:2212.05055,

work page arXiv
[30]

StarCoder: may the source be with you!

Raymond Li, Loubna Ben Allal, Yangtian Zi, Niklas Muennighoff, Denis Kocetkov, Cheng- hao Mou, Marc Marone, Christopher Akiki, Jia Li, Jenny Chim, et al. Starcoder: may the source be with you! arXiv preprint arXiv:2305.06161, 2023a. Yuanzhi Li, S´ebastien Bubeck, Ronen Eldan, Allie Del Giorno, Suriya Gunasekar, and Yin Tat Lee. Textbooks are all you need ...

work page internal anchor Pith review Pith/arXiv arXiv
[31]

SGDR: Stochastic Gradient Descent with Warm Restarts

Ilya Loshchilov and Frank Hutter. Sgdr: Stochastic gradient descent with warm restarts. arXiv preprint arXiv:1608.03983,

work page internal anchor Pith review Pith/arXiv arXiv
[32]

Decoupled Weight Decay Regularization

Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101,

work page internal anchor Pith review Pith/arXiv arXiv
[33]

YaRN: Efficient Context Window Extension of Large Language Models

Bowen Peng, Jeffrey Quesnelle, Honglu Fan, and Enrico Shippole. Yarn: Efficient context window extension of large language models. arXiv preprint arXiv:2309.00071,

work page internal anchor Pith review Pith/arXiv arXiv
[34]

Scaling Language Models: Methods, Analysis & Insights from Training Gopher

Jack W Rae, Sebastian Borgeaud, Trevor Cai, Katie Millican, Jordan Hoffmann, Francis Song, John Aslanides, Sarah Henderson, Roman Ring, Susannah Young, et al. Scaling language models: Methods, analysis & insights from training gopher. arXiv preprint arXiv:2112.11446,

work page internal anchor Pith review Pith/arXiv arXiv
[35]

Beyond chinchilla- optimal: Accounting for inference in language model scaling laws,

Nikhil Sardana and Jonathan Frankle. Beyond chinchilla-optimal: Accounting for inference in language model scaling laws. arXiv preprint arXiv:2401.00448,

work page arXiv
[36]

Neural Machine Translation of Rare Words with Subword Units

Association for Computational Linguistics. doi: 10.18653/v1/P16-1162. URL https://aclanthology.org/P16-1162. Weijia Shi, Sewon Min, Maria Lomeli, Chunting Zhou, Margaret Li, Victoria Lin, Noah A Smith, Luke Zettlemoyer, Scott Yih, and Mike Lewis. In-context pretraining: Language modeling beyond document boundaries. arXiv preprint arXiv:2310.10638,

work page doi:10.18653/v1/p16-1162
[37]

Don’t decay the learning rate, increase the batch size

Samuel L Smith, Pieter-Jan Kindermans, Chris Ying, and Quoc V Le. Don’t decay the learning rate, increase the batch size. arXiv preprint arXiv:1711.00489,

work page arXiv
[38]

22 Kaiqiang Song, Xiaoyang Wang, Sangwoo Cho, Xiaoman Pan, and Dong Yu

URL https://arxiv.org/abs/2402.00159. 22 Kaiqiang Song, Xiaoyang Wang, Sangwoo Cho, Xiaoman Pan, and Dong Yu. Zebra: Ex- tending context window with layerwise grouped local-global attention. arXiv preprint arXiv:2312.08618,

work page arXiv
[39]

LLaMA: Open and Efficient Foundation Language Models

LLMFarm team. LLMFarm, 2023a. URL https://github.com/guinmoon/LLMFarm. MLC team. MLC-LLM, 2023b. URL https://github.com/mlc-ai/mlc-llm. Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timoth´ee Lacroix, Baptiste Rozi `ere, Naman Goyal, Eric Hambro, Faisal Azhar, et al. Llama: Open and efficient foundation language model...

work page internal anchor Pith review Pith/arXiv arXiv
[40]

Finetuned Language Models Are Zero-Shot Learners

Jason Wei, Maarten Bosma, Vincent Y Zhao, Kelvin Guu, Adams Wei Yu, Brian Lester, Nan Du, Andrew M Dai, and Quoc V Le. Finetuned language models are zero-shot learners. arXiv preprint arXiv:2109.01652,

work page internal anchor Pith review Pith/arXiv arXiv
[41]

Magicoder: Empow- ering code generation with oss-instruct.arXiv preprint arXiv:2312.02120, 2023

Yuxiang Wei, Zhe Wang, Jiawei Liu, Yifeng Ding, and Lingming Zhang. Magicoder: Source code is all you need. arXiv preprint arXiv:2312.02120,

work page arXiv
[42]

Small-scale proxies for large-scale transformer training instabilities

Mitchell Wortsman, Peter J Liu, Lechao Xiao, Katie Everett, Alex Alemi, Ben Adlam, John D Co-Reyes, Izzeddin Gur, Abhishek Kumar, Roman Novak, et al. Small-scale proxies for large-scale transformer training instabilities. arXiv preprint arXiv:2309.14322,

work page arXiv
[43]

arXiv preprint arXiv:2310.06694 , year=

Mengzhou Xia, Tianyu Gao, Zhiyuan Zeng, and Danqi Chen. Sheared llama: Accelerating language model pre-training via structured pruning. arXiv preprint arXiv:2310.06694,

work page arXiv
[44]

Effective long-context scaling of foundation models

Wenhan Xiong, Jingyu Liu, Igor Molybog, Hejia Zhang, Prajjwal Bhargava, Rui Hou, Louis Martin, Rashi Rungta, Karthik Abinav Sankararaman, Barlas Oguz, et al. Effective long-context scaling of foundation models. arXiv preprint arXiv:2309.16039,

work page arXiv
[45]

WizardLM: Empowering large pre-trained language models to follow complex instructions

Can Xu, Qingfeng Sun, Kai Zheng, Xiubo Geng, Pu Zhao, Jiazhan Feng, Chongyang Tao, and Daxin Jiang. Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244,

work page internal anchor Pith review Pith/arXiv arXiv
[46]

arXiv:2203.03466 , year=

Greg Yang, Edward J Hu, Igor Babuschkin, Szymon Sidor, Xiaodong Liu, David Farhi, Nick Ryder, Jakub Pachocki, Weizhu Chen, and Jianfeng Gao. Tensor programs v: Tuning large neural networks via zero-shot hyperparameter transfer. arXiv preprint arXiv:2203.03466,

work page arXiv
[47]

arXiv:2310.02244 , year=

Greg Yang, Dingli Yu, Chen Zhu, and Soufiane Hayou. Tensor programs vi: Feature learning in infinite-depth neural networks. arXiv preprint arXiv:2310.02244,

work page arXiv
[48]

arXiv preprint arXiv:2403.16952 , year=

Jiasheng Ye, Peiju Liu, Tianxiang Sun, Yunhua Zhou, Jun Zhan, and Xipeng Qiu. Data mixing laws: Optimizing data mixtures by predicting language modeling performance. arXiv preprint arXiv:2403.16952,

work page arXiv
[49]

Yi: Open Foundation Models by 01.AI

ai. arXiv preprint arXiv:2403.04652,

work page internal anchor Pith review Pith/arXiv arXiv
[50]

HellaSwag: Can a Machine Really Finish Your Sentence?

23 Rowan Zellers, Ari Holtzman, Yonatan Bisk, Ali Farhadi, and Yejin Choi. Hellaswag: Can a machine really finish your sentence? arXiv preprint arXiv:1905.07830,

work page internal anchor Pith review Pith/arXiv arXiv 1905
[51]

TinyLlama: An Open-Source Small Language Model

Peiyuan Zhang, Guangtao Zeng, Tianduo Wang, and Wei Lu. Tinyllama: An open-source small language model. arXiv preprint arXiv:2401.02385, 2024a. Shengyu Zhang, Linfeng Dong, Xiaoya Li, Sen Zhang, Xiaofei Sun, Shuhe Wang, Jiwei Li, Runyi Hu, Tianwei Zhang, Fei Wu, et al. Instruction tuning for large language models: A survey. arXiv preprint arXiv:2308.10792,

work page internal anchor Pith review arXiv
[52]

∞ bench: Extending long context evaluation beyond 100k tokens

Xinrong Zhang, Yingfa Chen, Shengding Hu, Zihang Xu, Junhao Chen, Moo Khai Hao, Xu Han, Zhen Leng Thai, Shuo Wang, Zhiyuan Liu, et al.∞bench: Extending long context evaluation beyond 100k tokens. arXiv preprint arXiv:2402.13718, 2024b. Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Er...

work page arXiv
[53]

not consuming too many steps

After applying the QK-norm, we observe a significant decrease in the learning rate sensitivity similar to Wortsman et al. (2023). However, as the MiniCPM project itself is an SLM, we do not require low learning rate sensitivity as long as we find the best learning rate with TensorProgram (Yang et al., 2022; 2023). Therefore, we do not introduce QK-norm an...

work page 2023
[54]

The “shape” of the model, i.e., model width compared to model depth is kept as similar as possible to avoid any potential performance variation. Name N (B) dm d f f dh nh L 9M 0.009 320 800 64 5 8 30M 0.036 512 1280 64 8 12 70M 0.066 640 1600 64 10 14 0.1B 0.109 768 1920 64 12 16 0.17B 0.166 896 2240 64 14 18 0.2B 0.241 1024 2560 64 16 20 0.5B 0.499 1344 ...

work page 1920
[55]

The WSD LRS with different end steps share the same stable training stage

26 Cosine (T; s) =    s W η, s < W 0.9ηcos(π s T ) + 0.1η, W < s < T 0.1η, s > T CosineLoop (T; s) = s W η, s < W 0.9ηcos(π s T ) + 0.1η, W < s 0 2000 4000 6000 8000 10000 Iteration 0.0000 0.0025 0.0050 0.0075 0.0100 0.0125 0.0150 0.0175 0.0200 Learning Rate Cosine(40N) WSD(40N,4N) WSD(80N,8N) Figure 15: Illustrative comparison be- tween Cosine LRS and...

work page 2000
[56]

C MiniCPM’s Vocabulary Despite being small in parameter size, MiniCPM targets modeling diverse data distribution, excelling in English and Chinese

We convert the x-axis to computation Flops to compare the computed optimal regime on the right part of the Figure. C MiniCPM’s Vocabulary Despite being small in parameter size, MiniCPM targets modeling diverse data distribution, excelling in English and Chinese. Therefore, our vocabulary is relatively large. For the 2.4B model, we use a tokenizer consisti...

work page 2016