pith. sign in

arxiv: 2409.06624 · v4 · submitted 2024-09-10 · 💻 cs.CL · cs.AI· cs.LG

A Practice of Post-Training on Llama-3 70B with Optimal Selection of Additional Language Mixture Ratio

Pith reviewed 2026-05-23 20:40 UTC · model grok-4.3

classification 💻 cs.CL cs.AIcs.LG
keywords continual pre-trainingLlama-3 70BAdditional Language Mixture Ratiohyperparameter tuningChinese languagemodel adaptationfine-tuning
0
0 comments X

The pith

Optimal correlation of language mixture ratio and learning rate on 8B transfers to improve Llama-3 70B Chinese skills

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper investigates the selection of hyper-parameters for continual pre-training of large language models to acquire new language capabilities. The authors identify an optimal relationship between the Additional Language Mixture Ratio and the learning rate using the Llama-3 8B model, which then informs the training configuration for the 70B version. Applying this setup leads to enhanced performance on Chinese benchmarks and additional gains in math, coding, and emotional intelligence. The work demonstrates a practical way to manage the high costs of full-scale training by leveraging smaller models for hyper-parameter tuning before deployment on larger ones. This matters for making language adaptation of LLMs more accessible and efficient in real-world applications.

Core claim

Studying the optimal correlation between the Additional Language Mixture Ratio (ALMR) and the Learning Rate (LR) on the 8B size directly indicates the optimal experimental setup for the 70B model. Through careful hyper-parameter choice and subsequent fine-tuning, the model capability improves on Chinese-related benchmarks as well as in math, coding, and emotional intelligence, with the final 70B version deployed successfully in a chat system.

What carries the argument

The correlation between Additional Language Mixture Ratio (ALMR) and Learning Rate (LR) determined on the smaller 8B model to set up training for the 70B model.

If this is right

  • Enhanced Chinese language capabilities in the adapted 70B model.
  • Unexpected improvements in math, coding, and emotional intelligence domains.
  • Practical deployment of the improved model in a real chat system.
  • Lower training costs by avoiding full hyper-parameter searches at the largest scale.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Smaller models can act as efficient testbeds for hyper-parameter decisions in continual pre-training of larger models.
  • Language adaptation through CPT may produce broad benefits across seemingly unrelated skills.
  • Similar ALMR-LR tuning could be tested for other target languages or specialized domains.

Load-bearing premise

The optimal ALMR and LR correlation identified on the 8B model can be directly applied to the 70B model to achieve similar improvements without further adjustment.

What would settle it

If training the 70B model with the ALMR-LR values from the 8B experiments results in no gains or losses on Chinese benchmarks relative to standard training, this would show the correlation does not transfer.

Figures

Figures reproduced from arXiv: 2409.06624 by Kun Fan, Luo Ji, Ningyuan Xi, Qingqing Gu, Teng Chen, Yetao Wu.

Figure 1
Figure 1. Figure 1: CPT Performance contours for different combinations of ALMR (in percent￾age) and LR on Llama-3 8B. The contour values correspond to validation loss (left) and averaged metrics (right). The cross points are experimental data points and the contours are extrapolated. The blue dash lines indicate the efficient frontiers between ALMR and LR found from the contours. To determine the final choice of our ALMR and… view at source ↗
Figure 2
Figure 2. Figure 2: Typical metric plots of CPT experiment on Llama-3 70B. Metrics include C￾Eval, LCSTS, GSM8K and HumanEval. decreased from Base. Furthermore, our 70B CPT model outperforms Llama-3 in almost all the benchmarks, with the only exception of BBH [PITH_FULL_IMAGE:figures/full_fig_p008_2.png] view at source ↗
read the original abstract

Large Language Models (LLM) often need to be Continual Pre-Trained (CPT) to obtain unfamiliar language skills or adapt to new domains. The huge training cost of CPT often asks for cautious choice of key hyper-parameters such as the mixture ratio of extra language or domain corpus. However, there is no systematic study that bridges the gap between the optimal mixture ratio and the actual model performance, and the gap between experimental scaling law and the actual deployment in the full model size. In this paper, we perform CPT on Llama-3 8B and 70B to enhance its Chinese ability. We study the optimal correlation between the Additional Language Mixture Ratio (ALMR) and the Learning Rate (LR) on the 8B size which directly indicates the optimal experimental setup. By thorough choice of hyper-parameter, and subsequent fine-tuning, the model capability is improved not only on the Chinese-related benchmark but also in some specific domains including math, coding, and emotional intelligence. We deploy the final 70B version of LLM on a real-life chat system which obtains satisfying performance.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper reports continual pre-training (CPT) of Llama-3 8B and 70B to improve Chinese capabilities. It studies the correlation between Additional Language Mixture Ratio (ALMR) and Learning Rate (LR) on the 8B model, asserts that this correlation directly determines the optimal setup for the 70B run, and claims that the resulting 70B model shows gains on Chinese benchmarks plus math, coding, and emotional-intelligence tasks, followed by deployment in a chat system.

Significance. If the 8B-derived ALMR-LR pairing transfers reliably to 70B, the approach would offer a lower-cost route to hyperparameter selection for large-scale CPT. The multi-domain gains and real-world deployment would strengthen its practical value, but the absence of scale-specific confirmation at 70B limits the strength of that conclusion.

major comments (2)
  1. [Abstract] Abstract: the claim that the ALMR-LR correlation identified on 8B 'directly indicates the optimal experimental setup' for the 70B CPT run is load-bearing yet unsupported; no 70B ablation, grid search, or sensitivity study is described that would confirm the same pairing remains optimal at the larger scale.
  2. [Abstract (and experimental results sections)] The manuscript provides no quantitative benchmark tables, scores, or figures for the claimed improvements on Chinese, math, coding, or emotional-intelligence tasks, so the magnitude and statistical significance of the reported gains cannot be assessed.
minor comments (1)
  1. [Abstract] The abstract would be strengthened by inclusion of at least one key quantitative result (e.g., benchmark delta) to ground the performance claims.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment below, indicating planned revisions to improve clarity and completeness.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the claim that the ALMR-LR correlation identified on 8B 'directly indicates the optimal experimental setup' for the 70B CPT run is load-bearing yet unsupported; no 70B ablation, grid search, or sensitivity study is described that would confirm the same pairing remains optimal at the larger scale.

    Authors: We agree the phrasing 'directly indicates' overstates the evidence, as the correlation was established via 8B experiments and applied to 70B without dedicated ablations at that scale due to computational expense. We will revise the abstract and methods to describe the 8B results as informing the 70B configuration under an assumption of transferability, while explicitly noting the lack of 70B-specific validation. revision: yes

  2. Referee: [Abstract (and experimental results sections)] The manuscript provides no quantitative benchmark tables, scores, or figures for the claimed improvements on Chinese, math, coding, or emotional-intelligence tasks, so the magnitude and statistical significance of the reported gains cannot be assessed.

    Authors: We will add explicit quantitative tables and figures in the experimental results section showing benchmark scores, deltas, and any available significance measures for the Chinese, math, coding, and emotional-intelligence tasks to enable direct evaluation of the gains. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical hyperparameter transfer from 8B to 70B is an unverified assumption, not a self-referential derivation

full rationale

The paper reports an empirical CPT study: ALMR-LR correlation is measured on Llama-3 8B and the resulting choice is applied to the 70B run, followed by benchmark evaluation. No equations, fitted parameters renamed as predictions, self-citations, or ansatzes are present in the provided text. The transfer step is an explicit modeling assumption rather than a claim that the 70B outcome is forced by construction from the 8B data. The work is therefore self-contained against external benchmarks and receives the default non-circularity finding.

Axiom & Free-Parameter Ledger

2 free parameters · 0 axioms · 0 invented entities

The central claim rests on the empirical determination of optimal ALMR and LR correlation, which involves tuning free parameters through experimentation on the 8B model. No explicit axioms or invented entities are described.

free parameters (2)
  • Additional Language Mixture Ratio (ALMR)
    The ratio is a hyperparameter optimized in the study on 8B model to find correlation with LR.
  • Learning Rate (LR)
    The learning rate is tuned in correlation with ALMR on the 8B model.

pith-pipeline@v0.9.0 · 5743 in / 1189 out tokens · 47662 ms · 2026-05-23T20:40:09.860137+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

33 extracted references · 33 canonical work pages · 9 internal anchors

  1. [1]

    In: Findings of ACL 2023 (2023)

    Cai, H., Shen, X., Xu, Q., Shen, W., Wang, X., Ge, W., Zheng, X., Xue, X.: Improving empathetic dialogue generation by dynamically infusing commonsense knowledge. In: Findings of ACL 2023 (2023)

  2. [2]

    Chen, J., et al.: Towards effective and efficient continual pre-training of large lan- guage models (2024)

  3. [3]

    Chih Chieh Shao, T.L., Lai, Y., Tseng, Y., Tsai, S.: Drcd: a chinese machine reading comprehension dataset (2019), https://arxiv.org/abs/1806.00920

  4. [4]

    Cobbe, K., Kosaraju, V., Bavarian, M., Chen, M., Jun, H., Kaiser, L., Plappert, M., Tworek, J., Hilton, J., Nakano, R., Hesse, C., Schulman, J.: Training verifiers to solve math word problems (2021), https://arxiv.org/abs/2110.14168

  5. [5]

    In: EMNLP-IJCNLP (2019)

    Cui, Y., et al.: A span-extraction dataset for chinese machine reading comprehen- sion. In: EMNLP-IJCNLP (2019)

  6. [6]

    DanHendrycks, CollinBurns,S.B.A.Z.M.M.D.S.J.S.: Measuringmassive multitask language understanding (2021), https://arxiv.org/abs/2009.03300

  7. [7]

    Duan, H., Wei, J., Wang, C., Liu, H., Fang, Y., Zhang, S., Lin, D., Chen, K.: Botchat: Evaluating llms’ capabilities of having multi-turn dialogues (2023), https://arxiv.org/abs/2310.13650

  8. [8]

    Grattafiori, A., et al.: The llama 3 herd of models (2024)

  9. [9]

    Gu, J., Yang, Z., Ding, C., Zhao, R., Tan, F.: Cmr scaling law: Predicting critical mixture ratios for continual pre-training of language models (2024), https://arxiv.org/abs/2407.17467

  10. [10]

    Gururangan, S., Marasović, A., Swayamdipta, S., Lo, K., Beltagy, I., Downey, D., Smith, N.A.: Don’t stop pretraining: Adapt language models to domains and tasks (2020), https://arxiv.org/abs/2004.10964

  11. [11]

    Hoffmann, J., Borgeaud, S., Mensch, A., Buchatskaya, E., Cai, T., Rutherford, E., de Las Casas, D., Hendricks, L.A., Welbl, J., Clark, A., Hennigan, T., Noland, E., Millican, K., van den Driessche, G., Damoc, B., Guy, A., Osindero, S., Simonyan, K., Elsen, E., Rae, J.W., Vinyals, O., Sifre, L.: Training compute-optimal large language models (2022), https:...

  12. [12]

    In: Màrquez, L., Callison-Burch, C., Su, J

    Hu, B., Chen, Q., Zhu, F.: LCSTS: A large scale Chinese short text summarization dataset. In: Màrquez, L., Callison-Burch, C., Su, J. (eds.) Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing (Sep 2015)

  13. [13]

    Scaling Laws for Neural Language Models

    Jared Kaplan, Sam McCandlish, T.H.T.B.B.B.C.R.C.S.G.A.R.J.W.D.A.: Scaling laws for neural language models (2020), https://arxiv.org/abs/2001.08361 12 N. Xi et al

  14. [14]

    Liang, X., Hu, X., Zuo, S., Gong, Y., Lou, Q., Liu, Y., Huang, S.L., Jiao, J.: Task oriented in-domain data augmentation (2024), https://arxiv.org/abs/2406.16694

  15. [15]

    Liang Xu, Xiaojing Lu, C.Y., Zhang, X., Xu, H., Yuan, H., Wei, G., Pan, X., Tian, X., Qin, L., Hai, H.: Fewclue: A chinese few-shot learning evaluation benchmark (2021), https://arxiv.org/abs/2107.07498

  16. [16]

    Ma, S., Huang, S., Huang, S., Wang, X., Li, Y., Zheng, H.T., Xie, P., Huang, F., Jiang, Y.: Ecomgpt-ct: Continual pre-training of e-commerce large language models with semi-structured data (2023), https://arxiv.org/abs/2312.15696

  17. [17]

    Mandar Joshi, Eunsol Choi, D.S.W.L.Z.: Triviaqa: A large scale dis- tantly supervised challenge dataset for reading comprehension (2017), https://arxiv.org/abs/1705.03551

  18. [18]

    Mark Chen, Jerry Tworek, o.: Evaluating large language models trained on code (2021), https://arxiv.org/abs/2107.03374

  19. [19]

    Ouyang, L., et al.: Training language models to follow instructions with human feedback (2022), https://arxiv.org/abs/2203.02155

  20. [20]

    Que, H., Liu, J., Zhang, G., Zhang, C., Qu, X., Ma, Y., Duan, F., Bai, Z., Wang, J., Zhang, Y., Tan, X., Fu, J., Su, W., Wang, J., Qu, L., Zheng, B.: D-cpt law: Domain-specific continual pre-training scaling law for large language models (2024), https://arxiv.org/abs/2406.01375

  21. [21]

    In: Oh, A., Naumann, T., Globerson, A., Saenko, K., Hardt, M., Levine, S

    Rafailov, R., Sharma, A., Mitchell, E., Manning, C.D., Ermon, S., Finn, C.: Di- rect preference optimization: Your language model is secretly a reward model. In: Oh, A., Naumann, T., Globerson, A., Saenko, K., Hardt, M., Levine, S. (eds.) Ad- vances in Neural Information Processing Systems. vol. 36, pp. 53728–53741. Curran Associates, Inc. (2023)

  22. [22]

    Srivastava, A., Rastogi, A., et al.: Beyond the imitation game: Quantifying and extrapolating the capabilities of language models (2023)

  23. [23]

    Journal of Pacific Rim Psychology (2023)

    Wang, X., Li, X., Yin, Z., Wu, Y., Liu, J.: Emotional intelligence of large language models. Journal of Pacific Rim Psychology (2023)

  24. [24]

    Wanjun Zhong, Ruixiang Cui, Y.G.Y.L.S.L.Y.W.A.S.W.C.N.D.: Agieval: A human-centric benchmark for evaluating foundation models (2023), https://arxiv.org/abs/2304.06364

  25. [25]

    Wu, C., Gan, Y., Ge, Y., Lu, Z., Wang, J., Feng, Y., Shan, Y., Luo, P.: Llama pro: Progressive llama with block expansion (2024), https://arxiv.org/abs/2401.02415

  26. [26]

    Xu,L.,etal.:Clue:Achineselanguageunderstandingevaluationbenchmark(2020)

  27. [27]

    Technical report, Alibaba Group (2024)

    Yang, A., et al.: QWEN2 TECHNICAL REPORT. Technical report, Alibaba Group (2024)

  28. [28]

    Yiming Cui, Ziqing Yang, X.Y.: Efficient and effective text encoding for chinese llama and alpaca (2023)

  29. [29]

    Yunjie Ji, Yong Deng, Y.G.Y.P.Q.N.L.Z.B.M.X.L.: Exploring the impact of in- struction data scaling on large language models: An empirical study on real-world use cases (2023)

  30. [30]

    Yuzhen Huang, Yuzhuo Bai, Z.Z.J.Z.J.Z.T.S.J.L.C.L.Y.Z.J.L.Y.F.M.S.J.H.: C- eval: A multi-level multi-discipline chinese evaluation suite for foundation models (2023), https://arxiv.org/abs/2305.08322

  31. [31]

    Zhang, T., Wang, S., Yan, S., Li, J., Liu, Q.: Generative table pre-training empow- ers models for tabular prediction (2023), https://arxiv.org/abs/2305.09696

  32. [32]

    In: ACL (2024)

    Zheng, Y., et al.: Llamafactory: Unified efficient fine-tuning of 100+ language mod- els. In: ACL (2024)

  33. [33]

    Zixuan Ke, B.L.: Continual learning of natural language processing tasks: A survey (2023), https://arxiv.org/abs/2211.12701