pith. machine review for the scientific record. sign in

arxiv: 2404.06395 · v3 · submitted 2024-04-09 · 💻 cs.CL · cs.LG

Recognition: 2 theorem links

MiniCPM: Unveiling the Potential of Small Language Models with Scalable Training Strategies

Authors on Pith no claims yet

Pith reviewed 2026-05-13 17:52 UTC · model grok-4.3

classification 💻 cs.CL cs.LG
keywords small language modelsMiniCPMlearning rate schedulerscaling lawsmodel efficiencydata scalinglanguage model training
0
0 comments X

The pith

MiniCPM shows that 1.2B and 2.4B parameter models can match the performance of 7B to 13B large language models.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper presents MiniCPM, a family of small language models with 1.2 billion and 2.4 billion non-embedding parameters. The authors demonstrate that these models perform on par with much larger 7B to 13B models across various benchmarks. They achieve this through careful model scaling experiments and a new Warmup-Stable-Decay learning rate scheduler that supports efficient data scaling. The work also derives a compute-optimal data-to-model ratio higher than the previously accepted Chinchilla optimum. By releasing variants optimized for different tasks, the approach highlights practical ways to build capable yet resource-efficient language models.

Core claim

The central discovery is that the MiniCPM 1.2B and 2.4B models not only lead in their size categories but also achieve capabilities comparable to 7B-13B LLMs. This is enabled by extensive model wind tunnel experiments for stable scaling and the introduction of the Warmup-Stable-Decay learning rate scheduler, which facilitates continuous training and reveals dynamics allowing a higher optimal data-model ratio than Chinchilla. The paper further shows scalability in both model and data dimensions and introduces extended family members like MiniCPM-DPO, MiniCPM-MoE, and MiniCPM-128K that maintain strong performance.

What carries the argument

The Warmup-Stable-Decay (WSD) learning rate scheduler, which divides training into warmup, stable, and decay phases to enable continuous training, domain adaptation, and efficient exploration of data-model scaling laws without full retraining.

If this is right

  • Models of this size can be deployed more widely due to lower computational requirements while matching larger model performance.
  • Training strategies allow for higher data-to-compute ratios than previously thought optimal.
  • The approach scales to larger models and different architectures like mixture-of-experts.
  • Specific variants handle tasks such as preference optimization and long-context processing effectively.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • This suggests that future research could prioritize optimizing small models over scaling to trillions of parameters for many applications.
  • On-device or edge AI deployments become more feasible with these efficient models.
  • Independent verification on held-out data would strengthen claims of parity.
  • The WSD scheduler might apply to training other types of neural networks beyond language models.

Load-bearing premise

The benchmarks used provide an unbiased measure of model capabilities without advantages from overlapping training data or post-training adjustments.

What would settle it

Evaluating the models on a new set of tasks designed to have no overlap with any training data and observing whether their performance remains comparable to 7B-13B models.

read the original abstract

The burgeoning interest in developing Large Language Models (LLMs) with up to trillion parameters has been met with concerns regarding resource efficiency and practical expense, particularly given the immense cost of experimentation. This scenario underscores the importance of exploring the potential of Small Language Models (SLMs) as a resource-efficient alternative. In this context, we introduce MiniCPM, specifically the 1.2B and 2.4B non-embedding parameter variants, not only excel in their respective categories but also demonstrate capabilities on par with 7B-13B LLMs. While focusing on SLMs, our approach exhibits scalability in both model and data dimensions for future LLM research. Regarding model scaling, we employ extensive model wind tunnel experiments for stable and optimal scaling. For data scaling, we introduce a Warmup-Stable-Decay (WSD) learning rate scheduler (LRS), conducive to continuous training and domain adaptation. We present an in-depth analysis of the intriguing training dynamics that occurred in the WSD LRS. With WSD LRS, we are now able to efficiently study data-model scaling law without extensive retraining experiments on both axes of model and data, from which we derive the much higher compute optimal data-model ratio than Chinchilla Optimal. Additionally, we introduce MiniCPM family, including MiniCPM-DPO, MiniCPM-MoE and MiniCPM-128K, whose excellent performance further cementing MiniCPM's foundation in diverse SLM applications. MiniCPM models are available publicly at https://github.com/OpenBMB/MiniCPM .

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces MiniCPM, focusing on 1.2B and 2.4B non-embedding parameter small language models that are claimed to excel in their size categories and achieve performance on par with 7B-13B LLMs. It details scalable training via extensive model wind tunnel experiments for stable model scaling and a Warmup-Stable-Decay (WSD) learning rate scheduler for data scaling and continuous training, including an analysis of WSD training dynamics. From these, the authors derive a compute-optimal data-model ratio substantially higher than the Chinchilla optimum. The work also presents MiniCPM variants (DPO, MoE, 128K) and releases the models publicly.

Significance. If the performance parity and scaling claims hold under rigorous controls, this would be a notable contribution to efficient LLM research by demonstrating that carefully trained small models can match much larger ones, lowering computational barriers and providing practical insights into data-model scaling. The public model release and the WSD scheduler's support for continuous adaptation add reproducibility value and could influence future training protocols. The higher data-model ratio challenges existing scaling laws and merits follow-up, though its impact hinges on clearer validation.

major comments (2)
  1. [Abstract and Results] Abstract and main results: The central claim that the 1.2B and 2.4B variants 'demonstrate capabilities on par with 7B-13B LLMs' is load-bearing for the paper's contribution but is presented without error bars on benchmark scores, exact numerical baseline comparisons to specific 7B-13B models, or details on training data exclusion to rule out overlap. This omission makes independent verification of parity difficult and weakens the claim given potential selection effects.
  2. [Scaling experiments] Scaling experiments section: The derivation of the compute-optimal data-model ratio (higher than Chinchilla) is obtained directly from the authors' WSD-based runs; while not circular by construction, the section lacks explicit controls such as sensitivity analysis to run hyperparameters or comparison against independent scaling datasets, which is needed to establish the ratio as a general finding rather than run-specific.
minor comments (2)
  1. [Method] The description of the WSD scheduler would be clearer with an explicit figure showing the learning rate curve over training steps and how the stable and decay phases enable continuous training.
  2. [Results] Table or figure captions for benchmark results should include the exact model sizes and training tokens used for all compared baselines to facilitate direct comparison.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address each major comment point by point below, providing clarifications and committing to revisions where they strengthen the work without misrepresenting our results.

read point-by-point responses
  1. Referee: [Abstract and Results] Abstract and main results: The central claim that the 1.2B and 2.4B variants 'demonstrate capabilities on par with 7B-13B LLMs' is load-bearing for the paper's contribution but is presented without error bars on benchmark scores, exact numerical baseline comparisons to specific 7B-13B models, or details on training data exclusion to rule out overlap. This omission makes independent verification of parity difficult and weakens the claim given potential selection effects.

    Authors: We agree that greater precision in the presentation of results would improve verifiability. In the revised manuscript we will add a dedicated comparison table listing exact scores for MiniCPM-1.2B and MiniCPM-2.4B against named 7B–13B baselines (LLaMA-2-7B, LLaMA-2-13B, Mistral-7B, Qwen-7B, etc.) on MMLU, BBH, HumanEval and GSM8K. We will also expand the data section to describe the deduplication pipeline and contamination checks performed against the evaluation benchmarks. Because our primary training runs used a single seed for computational efficiency, we cannot retroactively supply full error bars; we will instead report variance observed in auxiliary short runs and explicitly note this limitation. revision: partial

  2. Referee: [Scaling experiments] Scaling experiments section: The derivation of the compute-optimal data-model ratio (higher than Chinchilla) is obtained directly from the authors' WSD-based runs; while not circular by construction, the section lacks explicit controls such as sensitivity analysis to run hyperparameters or comparison against independent scaling datasets, which is needed to establish the ratio as a general finding rather than run-specific.

    Authors: We accept that additional robustness checks are warranted. We will insert a sensitivity-analysis subsection that varies batch size, peak learning rate and decay length within the WSD schedule and shows the optimal data-to-model ratio remains stable. A full comparison against independent scaling corpora is not feasible given the compute budget required for retraining on entirely new datasets; we will therefore qualify the claim as holding under our experimental regime while discussing the conditions under which the higher ratio may generalize. revision: partial

Circularity Check

0 steps flagged

No significant circularity; claims rest on independent empirical scaling runs

full rationale

The paper derives its data-model scaling ratio and performance claims directly from new wind-tunnel experiments and WSD-scheduler runs on the MiniCPM models themselves. These are presented as fresh empirical observations rather than reductions of prior fitted constants or self-cited theorems. No equations equate a prediction to its own input by construction, and no load-bearing uniqueness result is imported from the authors' earlier work. The derivation chain therefore remains self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The work rests on standard LLM scaling assumptions and the validity of the chosen benchmarks; the main empirical additions are the scheduler and the observed ratio.

free parameters (1)
  • compute-optimal data-model ratio
    Fitted from the authors' WSD-enabled scaling runs and reported as higher than the Chinchilla value.
axioms (1)
  • domain assumption Standard power-law scaling relationships between model size, data, and compute continue to hold for the 1-2B regime.
    Invoked when using wind-tunnel experiments to extrapolate optimal ratios.

pith-pipeline@v0.9.0 · 5679 in / 1230 out tokens · 120097 ms · 2026-05-13T17:52:51.839196+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 27 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Do Audio-Visual Large Language Models Really See and Hear?

    cs.AI 2026-04 unverdicted novelty 8.0

    AVLLMs encode audio semantics in middle layers but suppress them in final text outputs when audio conflicts with vision, due to training that largely inherits from vision-language base models.

  2. Large Language Diffusion Models

    cs.CL 2025-02 unverdicted novelty 8.0

    LLaDA is a scalable diffusion-based language model that matches autoregressive LLMs like LLaMA3 8B on tasks and surpasses GPT-4o on reversal poem completion.

  3. A Causal Language Modeling Detour Improves Encoder Continued Pretraining

    cs.CL 2026-05 conditional novelty 7.0

    A temporary CLM phase followed by MLM decay during encoder continued pretraining outperforms standard MLM on biomedical tasks by 0.3-2.8pp across languages and model sizes.

  4. EmbodiedMidtrain: Bridging the Gap between Vision-Language Models and Vision-Language-Action Models via Mid-training

    cs.CV 2026-04 unverdicted novelty 7.0

    EmbodiedMidtrain mid-trains VLMs on curated VLA-aligned data subsets to improve downstream performance on robot manipulation benchmarks.

  5. Rethinking Scale: Deployment Trade-offs of Small Language Models under Agent Paradigms

    cs.CL 2026-04 unverdicted novelty 7.0

    Single-agent systems with tools provide the optimal performance-efficiency trade-off for small language models, outperforming base models and multi-agent setups.

  6. Copy First, Translate Later: Interpreting Translation Dynamics in Multilingual Pretraining

    cs.CL 2026-04 unverdicted novelty 7.0

    Multilingual pretraining develops translation in two phases: early copying driven by surface similarities, followed by generalizing mechanisms while copying is refined.

  7. Internalized Reasoning for Long-Context Visual Document Understanding

    cs.CV 2026-03 unverdicted novelty 7.0

    A synthetic pipeline creates and internalizes reasoning traces in VLMs for long-context visual document understanding, with a 32B model surpassing a 235B model on MMLongBenchDoc and showing 12.4x fewer output tokens.

  8. DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model

    cs.CL 2024-05 unverdicted novelty 7.0

    DeepSeek-V2 delivers top-tier open-source LLM performance using only 21B active parameters by compressing the KV cache 93.3% and cutting training costs 42.5% via MLA and DeepSeekMoE.

  9. Sparse Layers are Critical to Scaling Looped Language Models

    cs.LG 2026-05 unverdicted novelty 6.0

    Looped MoE models scale better than standard transformers because different experts activate on each loop pass, recovering expressivity without extra parameters, and support superior early exits.

  10. Response-G1: Explicit Scene Graph Modeling for Proactive Streaming Video Understanding

    cs.CV 2026-05 unverdicted novelty 6.0

    Response-G1 uses query-guided scene graphs, memory retrieval, and augmented prompting to improve when Video-LLMs decide to respond during streaming videos.

  11. Efficient Pre-Training with Token Superposition

    cs.CL 2026-05 unverdicted novelty 6.0

    Token superposition in an initial training phase followed by recovery allows large language models to reach target loss with substantially less total compute.

  12. ZAYA1-8B Technical Report

    cs.AI 2026-05 unverdicted novelty 6.0

    ZAYA1-8B is a reasoning MoE model with 700M active parameters that matches larger models on math and coding benchmarks and reaches 91.9% on AIME'25 via Markovian RSA test-time compute.

  13. Exploring Audio Hallucination in Egocentric Video Understanding

    cs.CV 2026-04 unverdicted novelty 6.0

    AV-LLMs hallucinate audio from visuals in egocentric videos, scoring only 27.3% accuracy on foreground sounds and 39.5% on background sounds in a 1000-question evaluation.

  14. OmniMouse: Scaling properties of multi-modal, multi-task Brain Models on 150B Neural Tokens

    q-bio.NC 2026-04 unverdicted novelty 6.0

    OmniMouse demonstrates data-driven scaling in multi-task brain models on a 150B-token neural dataset, achieving SOTA across prediction, decoding, and forecasting while model size gains saturate.

  15. BiasIG: Benchmarking Multi-dimensional Social Biases in Text-to-Image Models

    cs.CY 2026-04 conditional novelty 6.0

    BiasIG is a multi-dimensional benchmark for social biases in T2I models that shows debiasing interventions frequently cause confounding discrimination effects.

  16. SinkTrack: Attention Sink based Context Anchoring for Large Language Models

    cs.CV 2026-04 unverdicted novelty 6.0

    SinkTrack uses attention sink at the BOS token to anchor LLMs to initial context, reducing hallucination and forgetting with reported gains on benchmarks like SQuAD2.0 and M3CoT.

  17. Nexus: Same Pretraining Loss, Better Downstream Generalization via Common Minima

    cs.LG 2026-04 unverdicted novelty 6.0

    Nexus optimizer improves LLM downstream performance by converging to common minima across data sources despite identical pretraining loss.

  18. DriveVLM: The Convergence of Autonomous Driving and Large Vision-Language Models

    cs.CV 2024-02 unverdicted novelty 6.0

    DriveVLM adds vision-language models with scene description, analysis, and hierarchical planning modules to autonomous driving, paired with a hybrid DriveVLM-Dual system tested on nuScenes and SUP-AD datasets and depl...

  19. Response-G1: Explicit Scene Graph Modeling for Proactive Streaming Video Understanding

    cs.CV 2026-05 unverdicted novelty 5.0

    Response-G1 uses query-guided scene graph generation, memory retrieval, and retrieval-augmented prompting to improve proactive response timing in streaming video understanding.

  20. VLA Foundry: A Unified Framework for Training Vision-Language-Action Models

    cs.RO 2026-04 unverdicted novelty 5.0

    VLA Foundry provides a single training stack for VLA models and releases open models that match prior closed-source performance or outperform baselines on multi-task manipulation in simulation.

  21. JoyAI-LLM Flash: Advancing Mid-Scale LLMs with Token Efficiency

    cs.CL 2026-04 unverdicted novelty 5.0

    JoyAI-LLM Flash delivers a 48B MoE LLM with 2.7B active parameters per token via FiberPO RL and dense multi-token prediction, released with checkpoints on Hugging Face.

  22. Kimi K2: Open Agentic Intelligence

    cs.LG 2025-07 unverdicted novelty 5.0

    Kimi K2 is a 1-trillion-parameter MoE model that leads open-source non-thinking models on agentic benchmarks including 65.8 on SWE-Bench Verified and 66.1 on Tau2-Bench.

  23. SmolLM2: When Smol Goes Big -- Data-Centric Training of a Small Language Model

    cs.CL 2025-02 unverdicted novelty 5.0

    SmolLM2 is a 1.7B-parameter language model that outperforms Qwen2.5-1.5B and Llama3.2-1B after overtraining on 11 trillion tokens using custom FineMath, Stack-Edu, and SmolTalk datasets in a multi-stage pipeline.

  24. MiniCPM-V: A GPT-4V Level MLLM on Your Phone

    cs.CV 2024-08 conditional novelty 5.0

    MiniCPM-Llama3-V 2.5 delivers GPT-4V-level multimodal performance on phones through architecture, pretraining, and alignment optimizations.

  25. Fine-Tuning Small Language Models for Solution-Oriented Windows Event Log Analysis

    cs.CR 2026-05 unverdicted novelty 4.0

    Fine-tuned small language models trained on a synthetic Windows event log dataset with remediation steps outperform larger models in issue detection and solution generation with lower computational cost.

  26. An Empirical Study on Influence-Based Pretraining Data Selection for Code Large Language Models

    cs.SE 2026-04 unverdicted novelty 4.0

    Data-influence-score filtering using validation-set loss on downstream coding tasks improves Code-LLM performance, with the most beneficial training data varying significantly across different programming tasks.

  27. TinyLlama: An Open-Source Small Language Model

    cs.CL 2024-01 accept novelty 4.0

    TinyLlama is a 1.1B-parameter open-source language model pretrained on 1 trillion tokens that outperforms other open-source models of similar size on downstream tasks.

Reference graph

Works this paper leans on

56 extracted references · 56 canonical work pages · cited by 26 Pith papers · 28 internal anchors

  1. [1]

    GPT-4 Technical Report

    Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report. arXiv preprint arXiv:2303.08774,

  2. [2]

    GQA: Training generalized multi-query transformer models from multi-head checkpoints

    Joshua Ainslie, James Lee-Thorp, Michiel de Jong, Yury Zemlyanskiy, Federico Lebron, and Sumit Sanghai. GQA: Training generalized multi-query transformer models from multi-head checkpoints. In Houda Bouamor, Juan Pino, and Kalika Bali (eds.), Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing , pp. 4895–4901, Singapore,...

  3. [3]

    InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing

    Association for Computational Linguistics. doi: 10.18653/v1/ 2023.emnlp-main.298. URL https://aclanthology.org/2023.emnlp-main.298. Ebtesam Almazrouei, Hamza Alobeidli, Abdulaziz Alshamsi, Alessandro Cappelli, Ruxan- dra Cojocaru, M ´erouane Debbah, ´Etienne Goffinet, Daniel Hesslow, Julien Launay, Quentin Malartic, et al. The falcon series of open langua...

  4. [4]

    A General Language Assistant as a Laboratory for Alignment

    Amanda Askell, Yuntao Bai, Anna Chen, Dawn Drain, Deep Ganguli, Tom Henighan, Andy Jones, Nicholas Joseph, Ben Mann, Nova DasSarma, et al. A general language assistant as a laboratory for alignment. arXiv preprint arXiv:2112.00861,

  5. [5]

    Qwen Technical Report

    Jinze Bai, Shuai Bai, Yunfei Chu, Zeyu Cui, Kai Dang, Xiaodong Deng, Yang Fan, Wenbin Ge, Yu Han, Fei Huang, et al. Qwen technical report. arXiv preprint arXiv:2309.16609 ,

  6. [6]

    DeepSeek LLM: Scaling Open-Source Language Models with Longtermism

    Accessed: date-of-access. Xiao Bi, Deli Chen, Guanting Chen, Shanhuang Chen, Damai Dai, Chengqi Deng, Honghui Ding, Kai Dong, Qiushi Du, Zhe Fu, et al. Deepseek llm: Scaling open-source language models with longtermism. arXiv preprint arXiv:2401.02954,

  7. [7]

    Datasheet for the pile

    Stella Biderman, Kieran Bicheno, and Leo Gao. Datasheet for the pile. arXiv preprint arXiv:2201.07311,

  8. [8]

    Andrei Z Broder

    Accessed: [Insert Date of Access]. Andrei Z Broder. On the resemblance and containment of documents. In Proceedings. Compression and Complexity of SEQUENCES 1997 (Cat. No. 97TB100171) , pp. 21–29. IEEE,

  9. [9]

    Evaluating Large Language Models Trained on Code

    Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde de Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, et al. Evalu- ating large language models trained on code. arXiv preprint arXiv:2107.03374,

  10. [10]

    Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge

    19 Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot, Ashish Sabharwal, Carissa Schoenick, and Oyvind Tafjord. Think you have solved question answering? try arc, the ai2 reasoning challenge. arXiv preprint arXiv:1803.05457,

  11. [11]

    Training Verifiers to Solve Math Word Problems

    Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, et al. Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168,

  12. [12]

    Cerebras-gpt: Open compute- optimal language models trained on the cerebras wafer- scale cluster

    Nolan Dey, Gurpreet Gosal, Hemant Khachane, William Marshall, Ribhu Pathria, Marvin Tom, Joel Hestness, et al. Cerebras-gpt: Open compute-optimal language models trained on the cerebras wafer-scale cluster. arXiv preprint arXiv:2304.03208,

  13. [13]

    Glm: General language model pretraining with autoregressive blank infilling

    Zhengxiao Du, Yujie Qian, Xiao Liu, Ming Ding, Jiezhong Qiu, Zhilin Yang, and Jie Tang. Glm: General language model pretraining with autoregressive blank infilling. arXiv preprint arXiv:2103.10360,

  14. [14]

    Understanding emergent abilities of language models from the loss perspective

    Zhengxiao Du, Aohan Zeng, Yuxiao Dong, and Jie Tang. Understanding emergent abilities of language models from the loss perspective. arXiv preprint arXiv:2403.15796,

  15. [15]

    GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers

    Elias Frantar, Saleh Ashkboos, Torsten Hoefler, and Dan Alistarh. GPTQ: accurate post- training quantization for generative pre-trained transformers.CoRR, abs/2210.17323,

  16. [16]

    GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers

    doi: 10.48550/ARXIV .2210.17323. URLhttps://doi.org/10.48550/arXiv.2210.17323. Leo Gao, Stella Biderman, Sid Black, Laurence Golding, Travis Hoppe, Charles Foster, Jason Phang, Horace He, Anish Thite, Noa Nabeshima, et al. The Pile: An 800GB dataset of diverse text for language modeling. arXiv preprint arXiv:2101.00027,

  17. [17]

    Gemini: A Family of Highly Capable Multimodal Models

    Team Gemini, Rohan Anil, Sebastian Borgeaud, Yonghui Wu, Jean-Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalkwyk, Andrew M Dai, Anja Hauth, et al. Gemini: a family of highly capable multimodal models. arXiv preprint arXiv:2312.11805,

  18. [18]

    Textbooks Are All You Need

    Suriya Gunasekar, Yi Zhang, Jyoti Aneja, Caio C ´esar Teodoro Mendes, Allie Del Giorno, Sivakanth Gopi, Mojan Javaheripi, Piero Kauffmann, Gustavo de Rosa, Olli Saarikivi, et al. Textbooks are all you need. arXiv preprint arXiv:2306.11644,

  19. [19]

    Measuring Massive Multitask Language Understanding

    Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. Measuring massive multitask language understanding. arXiv preprint arXiv:2009.03300,

  20. [20]

    Measuring Mathematical Problem Solving With the MATH Dataset

    Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt. Measuring mathematical problem solving with the math dataset. arXiv preprint arXiv:2103.03874,

  21. [21]

    Query- key normalization for transformers

    Alex Henry, Prudhvi Raj Dachapally, Shubham Shantaram Pawar, and Yuxuan Chen. Query- key normalization for transformers. In Trevor Cohn, Yulan He, and Yang Liu (eds.), Findings of the Association for Computational Linguistics: EMNLP 2020, pp. 4246–4253, Online, November

  22. [22]

    Training Compute-Optimal Large Language Models

    Association for Computational Linguistics. doi: 10.18653/v1/2020. findings-emnlp.379. URL https://aclanthology.org/2020.findings-emnlp.379. 20 Jordan Hoffmann, Sebastian Borgeaud, Arthur Mensch, Elena Buchatskaya, Trevor Cai, Eliza Rutherford, Diego de Las Casas, Lisa Anne Hendricks, Johannes Welbl, Aidan Clark, et al. Training compute-optimal large langu...

  23. [23]

    Universal Language Model Fine-tuning for Text Classification

    Jeremy Howard and Sebastian Ruder. Universal language model fine-tuning for text classification. arXiv preprint arXiv:1801.06146,

  24. [24]

    Unlock predictable scaling from emergent abilities

    Shengding Hu, Xin Liu, Xu Han, Xinrong Zhang, Chaoqun He, Weilin Zhao, Yankai Lin, Ning Ding, Zebin Ou, Guoyang Zeng, et al. Unlock predictable scaling from emergent abilities. arXiv preprint arXiv:2310.03262,

  25. [25]

    sharpdarts: Faster and more accurate differentiable architecture search

    Andrew Hundt, Varun Jain, and Gregory D Hager. sharpdarts: Faster and more accurate differentiable architecture search. arXiv preprint arXiv:1903.09900,

  26. [26]

    Accessed: date-of- access. Albert Q Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Florian Bressand, Gianna Lengyel, Guillaume Lample, Lucile Saulnier, et al. Mistral 7b. arXiv preprint arXiv:2310.06825,

  27. [27]

    Scaling Laws for Neural Language Models

    Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. Scaling laws for neural language models. arXiv preprint arXiv:2001.08361,

  28. [28]

    Adam: A Method for Stochastic Optimization

    Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980,

  29. [29]

    Sparse upcycling: Training mixture-of-experts from dense checkpoints.arXiv preprint arXiv:2212.05055, 2022

    Aran Komatsuzaki, Joan Puigcerver, James Lee-Thorp, Carlos Riquelme Ruiz, Basil Mustafa, Joshua Ainslie, Yi Tay, Mostafa Dehghani, and Neil Houlsby. Sparse upcycling: Training mixture-of-experts from dense checkpoints. arXiv preprint arXiv:2212.05055,

  30. [30]

    StarCoder: may the source be with you!

    Raymond Li, Loubna Ben Allal, Yangtian Zi, Niklas Muennighoff, Denis Kocetkov, Cheng- hao Mou, Marc Marone, Christopher Akiki, Jia Li, Jenny Chim, et al. Starcoder: may the source be with you! arXiv preprint arXiv:2305.06161, 2023a. Yuanzhi Li, S´ebastien Bubeck, Ronen Eldan, Allie Del Giorno, Suriya Gunasekar, and Yin Tat Lee. Textbooks are all you need ...

  31. [31]

    SGDR: Stochastic Gradient Descent with Warm Restarts

    Ilya Loshchilov and Frank Hutter. Sgdr: Stochastic gradient descent with warm restarts. arXiv preprint arXiv:1608.03983,

  32. [32]

    Decoupled Weight Decay Regularization

    Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101,

  33. [33]

    YaRN: Efficient Context Window Extension of Large Language Models

    Bowen Peng, Jeffrey Quesnelle, Honglu Fan, and Enrico Shippole. Yarn: Efficient context window extension of large language models. arXiv preprint arXiv:2309.00071,

  34. [34]

    Scaling Language Models: Methods, Analysis & Insights from Training Gopher

    Jack W Rae, Sebastian Borgeaud, Trevor Cai, Katie Millican, Jordan Hoffmann, Francis Song, John Aslanides, Sarah Henderson, Roman Ring, Susannah Young, et al. Scaling language models: Methods, analysis & insights from training gopher. arXiv preprint arXiv:2112.11446,

  35. [35]

    Beyond chinchilla- optimal: Accounting for inference in language model scaling laws,

    Nikhil Sardana and Jonathan Frankle. Beyond chinchilla-optimal: Accounting for inference in language model scaling laws. arXiv preprint arXiv:2401.00448,

  36. [36]

    Neural Machine Translation of Rare Words with Subword Units

    Association for Computational Linguistics. doi: 10.18653/v1/P16-1162. URL https://aclanthology.org/P16-1162. Weijia Shi, Sewon Min, Maria Lomeli, Chunting Zhou, Margaret Li, Victoria Lin, Noah A Smith, Luke Zettlemoyer, Scott Yih, and Mike Lewis. In-context pretraining: Language modeling beyond document boundaries. arXiv preprint arXiv:2310.10638,

  37. [37]

    Don’t decay the learning rate, increase the batch size

    Samuel L Smith, Pieter-Jan Kindermans, Chris Ying, and Quoc V Le. Don’t decay the learning rate, increase the batch size. arXiv preprint arXiv:1711.00489,

  38. [38]

    22 Kaiqiang Song, Xiaoyang Wang, Sangwoo Cho, Xiaoman Pan, and Dong Yu

    URL https://arxiv.org/abs/2402.00159. 22 Kaiqiang Song, Xiaoyang Wang, Sangwoo Cho, Xiaoman Pan, and Dong Yu. Zebra: Ex- tending context window with layerwise grouped local-global attention. arXiv preprint arXiv:2312.08618,

  39. [39]

    LLaMA: Open and Efficient Foundation Language Models

    LLMFarm team. LLMFarm, 2023a. URL https://github.com/guinmoon/LLMFarm. MLC team. MLC-LLM, 2023b. URL https://github.com/mlc-ai/mlc-llm. Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timoth´ee Lacroix, Baptiste Rozi `ere, Naman Goyal, Eric Hambro, Faisal Azhar, et al. Llama: Open and efficient foundation language model...

  40. [40]

    Finetuned Language Models Are Zero-Shot Learners

    Jason Wei, Maarten Bosma, Vincent Y Zhao, Kelvin Guu, Adams Wei Yu, Brian Lester, Nan Du, Andrew M Dai, and Quoc V Le. Finetuned language models are zero-shot learners. arXiv preprint arXiv:2109.01652,

  41. [41]

    Magicoder: Empow- ering code generation with oss-instruct.arXiv preprint arXiv:2312.02120, 2023

    Yuxiang Wei, Zhe Wang, Jiawei Liu, Yifeng Ding, and Lingming Zhang. Magicoder: Source code is all you need. arXiv preprint arXiv:2312.02120,

  42. [42]

    Small-scale proxies for large-scale transformer training instabilities

    Mitchell Wortsman, Peter J Liu, Lechao Xiao, Katie Everett, Alex Alemi, Ben Adlam, John D Co-Reyes, Izzeddin Gur, Abhishek Kumar, Roman Novak, et al. Small-scale proxies for large-scale transformer training instabilities. arXiv preprint arXiv:2309.14322,

  43. [43]

    arXiv preprint arXiv:2310.06694 , year=

    Mengzhou Xia, Tianyu Gao, Zhiyuan Zeng, and Danqi Chen. Sheared llama: Accelerating language model pre-training via structured pruning. arXiv preprint arXiv:2310.06694,

  44. [44]

    Effective long-context scaling of foundation models

    Wenhan Xiong, Jingyu Liu, Igor Molybog, Hejia Zhang, Prajjwal Bhargava, Rui Hou, Louis Martin, Rashi Rungta, Karthik Abinav Sankararaman, Barlas Oguz, et al. Effective long-context scaling of foundation models. arXiv preprint arXiv:2309.16039,

  45. [45]

    WizardLM: Empowering large pre-trained language models to follow complex instructions

    Can Xu, Qingfeng Sun, Kai Zheng, Xiubo Geng, Pu Zhao, Jiazhan Feng, Chongyang Tao, and Daxin Jiang. Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244,

  46. [46]

    arXiv:2203.03466 , year=

    Greg Yang, Edward J Hu, Igor Babuschkin, Szymon Sidor, Xiaodong Liu, David Farhi, Nick Ryder, Jakub Pachocki, Weizhu Chen, and Jianfeng Gao. Tensor programs v: Tuning large neural networks via zero-shot hyperparameter transfer. arXiv preprint arXiv:2203.03466,

  47. [47]

    arXiv:2310.02244 , year=

    Greg Yang, Dingli Yu, Chen Zhu, and Soufiane Hayou. Tensor programs vi: Feature learning in infinite-depth neural networks. arXiv preprint arXiv:2310.02244,

  48. [48]

    arXiv preprint arXiv:2403.16952 , year=

    Jiasheng Ye, Peiju Liu, Tianxiang Sun, Yunhua Zhou, Jun Zhan, and Xipeng Qiu. Data mixing laws: Optimizing data mixtures by predicting language modeling performance. arXiv preprint arXiv:2403.16952,

  49. [49]

    Yi: Open Foundation Models by 01.AI

    ai. arXiv preprint arXiv:2403.04652,

  50. [50]

    HellaSwag: Can a Machine Really Finish Your Sentence?

    23 Rowan Zellers, Ari Holtzman, Yonatan Bisk, Ali Farhadi, and Yejin Choi. Hellaswag: Can a machine really finish your sentence? arXiv preprint arXiv:1905.07830,

  51. [51]

    TinyLlama: An Open-Source Small Language Model

    Peiyuan Zhang, Guangtao Zeng, Tianduo Wang, and Wei Lu. Tinyllama: An open-source small language model. arXiv preprint arXiv:2401.02385, 2024a. Shengyu Zhang, Linfeng Dong, Xiaoya Li, Sen Zhang, Xiaofei Sun, Shuhe Wang, Jiwei Li, Runyi Hu, Tianwei Zhang, Fei Wu, et al. Instruction tuning for large language models: A survey. arXiv preprint arXiv:2308.10792,

  52. [52]

    ∞ bench: Extending long context evaluation beyond 100k tokens

    Xinrong Zhang, Yingfa Chen, Shengding Hu, Zihang Xu, Junhao Chen, Moo Khai Hao, Xu Han, Zhen Leng Thai, Shuo Wang, Zhiyuan Liu, et al.∞bench: Extending long context evaluation beyond 100k tokens. arXiv preprint arXiv:2402.13718, 2024b. Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Er...

  53. [53]

    not consuming too many steps

    After applying the QK-norm, we observe a significant decrease in the learning rate sensitivity similar to Wortsman et al. (2023). However, as the MiniCPM project itself is an SLM, we do not require low learning rate sensitivity as long as we find the best learning rate with TensorProgram (Yang et al., 2022; 2023). Therefore, we do not introduce QK-norm an...

  54. [54]

    The “shape” of the model, i.e., model width compared to model depth is kept as similar as possible to avoid any potential performance variation. Name N (B) dm d f f dh nh L 9M 0.009 320 800 64 5 8 30M 0.036 512 1280 64 8 12 70M 0.066 640 1600 64 10 14 0.1B 0.109 768 1920 64 12 16 0.17B 0.166 896 2240 64 14 18 0.2B 0.241 1024 2560 64 16 20 0.5B 0.499 1344 ...

  55. [55]

    The WSD LRS with different end steps share the same stable training stage

    26 Cosine (T; s) =    s W η, s < W 0.9ηcos(π s T ) + 0.1η, W < s < T 0.1η, s > T CosineLoop (T; s) = s W η, s < W 0.9ηcos(π s T ) + 0.1η, W < s 0 2000 4000 6000 8000 10000 Iteration 0.0000 0.0025 0.0050 0.0075 0.0100 0.0125 0.0150 0.0175 0.0200 Learning Rate Cosine(40N) WSD(40N,4N) WSD(80N,8N) Figure 15: Illustrative comparison be- tween Cosine LRS and...

  56. [56]

    C MiniCPM’s Vocabulary Despite being small in parameter size, MiniCPM targets modeling diverse data distribution, excelling in English and Chinese

    We convert the x-axis to computation Flops to compare the computed optimal regime on the right part of the Figure. C MiniCPM’s Vocabulary Despite being small in parameter size, MiniCPM targets modeling diverse data distribution, excelling in English and Chinese. Therefore, our vocabulary is relatively large. For the 2.4B model, we use a tokenizer consisti...