A Practice of Post-Training on Llama-3 70B with Optimal Selection of Additional Language Mixture Ratio

Kun Fan; Luo Ji; Ningyuan Xi; Qingqing Gu; Teng Chen; Yetao Wu

arxiv: 2409.06624 · v4 · submitted 2024-09-10 · 💻 cs.CL · cs.AI· cs.LG

A Practice of Post-Training on Llama-3 70B with Optimal Selection of Additional Language Mixture Ratio

Ningyuan Xi , Yetao Wu , Kun Fan , Teng Chen , Qingqing Gu , Luo Ji This is my paper

Pith reviewed 2026-05-23 20:40 UTC · model grok-4.3

classification 💻 cs.CL cs.AIcs.LG

keywords continual pre-trainingLlama-3 70BAdditional Language Mixture Ratiohyperparameter tuningChinese languagemodel adaptationfine-tuning

0 comments

The pith

Optimal correlation of language mixture ratio and learning rate on 8B transfers to improve Llama-3 70B Chinese skills

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper investigates the selection of hyper-parameters for continual pre-training of large language models to acquire new language capabilities. The authors identify an optimal relationship between the Additional Language Mixture Ratio and the learning rate using the Llama-3 8B model, which then informs the training configuration for the 70B version. Applying this setup leads to enhanced performance on Chinese benchmarks and additional gains in math, coding, and emotional intelligence. The work demonstrates a practical way to manage the high costs of full-scale training by leveraging smaller models for hyper-parameter tuning before deployment on larger ones. This matters for making language adaptation of LLMs more accessible and efficient in real-world applications.

Core claim

Studying the optimal correlation between the Additional Language Mixture Ratio (ALMR) and the Learning Rate (LR) on the 8B size directly indicates the optimal experimental setup for the 70B model. Through careful hyper-parameter choice and subsequent fine-tuning, the model capability improves on Chinese-related benchmarks as well as in math, coding, and emotional intelligence, with the final 70B version deployed successfully in a chat system.

What carries the argument

The correlation between Additional Language Mixture Ratio (ALMR) and Learning Rate (LR) determined on the smaller 8B model to set up training for the 70B model.

If this is right

Enhanced Chinese language capabilities in the adapted 70B model.
Unexpected improvements in math, coding, and emotional intelligence domains.
Practical deployment of the improved model in a real chat system.
Lower training costs by avoiding full hyper-parameter searches at the largest scale.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Smaller models can act as efficient testbeds for hyper-parameter decisions in continual pre-training of larger models.
Language adaptation through CPT may produce broad benefits across seemingly unrelated skills.
Similar ALMR-LR tuning could be tested for other target languages or specialized domains.

Load-bearing premise

The optimal ALMR and LR correlation identified on the 8B model can be directly applied to the 70B model to achieve similar improvements without further adjustment.

What would settle it

If training the 70B model with the ALMR-LR values from the 8B experiments results in no gains or losses on Chinese benchmarks relative to standard training, this would show the correlation does not transfer.

Figures

Figures reproduced from arXiv: 2409.06624 by Kun Fan, Luo Ji, Ningyuan Xi, Qingqing Gu, Teng Chen, Yetao Wu.

**Figure 1.** Figure 1: CPT Performance contours for different combinations of ALMR (in percentage) and LR on Llama-3 8B. The contour values correspond to validation loss (left) and averaged metrics (right). The cross points are experimental data points and the contours are extrapolated. The blue dash lines indicate the efficient frontiers between ALMR and LR found from the contours. To determine the final choice of our ALMR and… view at source ↗

**Figure 2.** Figure 2: Typical metric plots of CPT experiment on Llama-3 70B. Metrics include CEval, LCSTS, GSM8K and HumanEval. decreased from Base. Furthermore, our 70B CPT model outperforms Llama-3 in almost all the benchmarks, with the only exception of BBH [PITH_FULL_IMAGE:figures/full_fig_p008_2.png] view at source ↗

read the original abstract

Large Language Models (LLM) often need to be Continual Pre-Trained (CPT) to obtain unfamiliar language skills or adapt to new domains. The huge training cost of CPT often asks for cautious choice of key hyper-parameters such as the mixture ratio of extra language or domain corpus. However, there is no systematic study that bridges the gap between the optimal mixture ratio and the actual model performance, and the gap between experimental scaling law and the actual deployment in the full model size. In this paper, we perform CPT on Llama-3 8B and 70B to enhance its Chinese ability. We study the optimal correlation between the Additional Language Mixture Ratio (ALMR) and the Learning Rate (LR) on the 8B size which directly indicates the optimal experimental setup. By thorough choice of hyper-parameter, and subsequent fine-tuning, the model capability is improved not only on the Chinese-related benchmark but also in some specific domains including math, coding, and emotional intelligence. We deploy the final 70B version of LLM on a real-life chat system which obtains satisfying performance.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The 8B hyperparameter choice is applied to 70B without scale validation.

read the letter

The paper reports running a hyperparameter study on Llama-3 8B to find the best Additional Language Mixture Ratio paired with learning rate for Chinese continual pre-training. They then apply that combination to the 70B model and report gains on Chinese benchmarks plus math, coding, and emotional intelligence after fine-tuning, along with a deployed chat system. This is mostly a practical report on tuning for a specific model rather than a new technique. The focus on the mixture ratio and learning rate correlation as a lever is useful for people doing similar adaptations. The main weakness is the direct jump from 8B results to 70B without any reported confirmation at the larger scale. The paper states that the 8B study directly indicates the optimal setup for 70B, but optimal mixture ratios and learning rates often change with model size because of differences in how the model absorbs new data. No ablations or additional searches at 70B are described, so the claimed improvements rest on an untested transfer. The paper is aimed at practitioners who need to adapt open models like Llama to new languages. A reader interested in general principles or scaling laws will not find much new here. I would not bring this to the next reading group. I would not cite it in my own work. It does not deserve a serious referee because the central link between the 8B experiments and the 70B results lacks supporting evidence.

Referee Report

2 major / 1 minor

Summary. The paper reports continual pre-training (CPT) of Llama-3 8B and 70B to improve Chinese capabilities. It studies the correlation between Additional Language Mixture Ratio (ALMR) and Learning Rate (LR) on the 8B model, asserts that this correlation directly determines the optimal setup for the 70B run, and claims that the resulting 70B model shows gains on Chinese benchmarks plus math, coding, and emotional-intelligence tasks, followed by deployment in a chat system.

Significance. If the 8B-derived ALMR-LR pairing transfers reliably to 70B, the approach would offer a lower-cost route to hyperparameter selection for large-scale CPT. The multi-domain gains and real-world deployment would strengthen its practical value, but the absence of scale-specific confirmation at 70B limits the strength of that conclusion.

major comments (2)

[Abstract] Abstract: the claim that the ALMR-LR correlation identified on 8B 'directly indicates the optimal experimental setup' for the 70B CPT run is load-bearing yet unsupported; no 70B ablation, grid search, or sensitivity study is described that would confirm the same pairing remains optimal at the larger scale.
[Abstract (and experimental results sections)] The manuscript provides no quantitative benchmark tables, scores, or figures for the claimed improvements on Chinese, math, coding, or emotional-intelligence tasks, so the magnitude and statistical significance of the reported gains cannot be assessed.

minor comments (1)

[Abstract] The abstract would be strengthened by inclusion of at least one key quantitative result (e.g., benchmark delta) to ground the performance claims.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment below, indicating planned revisions to improve clarity and completeness.

read point-by-point responses

Referee: [Abstract] Abstract: the claim that the ALMR-LR correlation identified on 8B 'directly indicates the optimal experimental setup' for the 70B CPT run is load-bearing yet unsupported; no 70B ablation, grid search, or sensitivity study is described that would confirm the same pairing remains optimal at the larger scale.

Authors: We agree the phrasing 'directly indicates' overstates the evidence, as the correlation was established via 8B experiments and applied to 70B without dedicated ablations at that scale due to computational expense. We will revise the abstract and methods to describe the 8B results as informing the 70B configuration under an assumption of transferability, while explicitly noting the lack of 70B-specific validation. revision: yes
Referee: [Abstract (and experimental results sections)] The manuscript provides no quantitative benchmark tables, scores, or figures for the claimed improvements on Chinese, math, coding, or emotional-intelligence tasks, so the magnitude and statistical significance of the reported gains cannot be assessed.

Authors: We will add explicit quantitative tables and figures in the experimental results section showing benchmark scores, deltas, and any available significance measures for the Chinese, math, coding, and emotional-intelligence tasks to enable direct evaluation of the gains. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical hyperparameter transfer from 8B to 70B is an unverified assumption, not a self-referential derivation

full rationale

The paper reports an empirical CPT study: ALMR-LR correlation is measured on Llama-3 8B and the resulting choice is applied to the 70B run, followed by benchmark evaluation. No equations, fitted parameters renamed as predictions, self-citations, or ansatzes are present in the provided text. The transfer step is an explicit modeling assumption rather than a claim that the 70B outcome is forced by construction from the 8B data. The work is therefore self-contained against external benchmarks and receives the default non-circularity finding.

Axiom & Free-Parameter Ledger

2 free parameters · 0 axioms · 0 invented entities

The central claim rests on the empirical determination of optimal ALMR and LR correlation, which involves tuning free parameters through experimentation on the 8B model. No explicit axioms or invented entities are described.

free parameters (2)

Additional Language Mixture Ratio (ALMR)
The ratio is a hyperparameter optimized in the study on 8B model to find correlation with LR.
Learning Rate (LR)
The learning rate is tuned in correlation with ALMR on the 8B model.

pith-pipeline@v0.9.0 · 5743 in / 1189 out tokens · 47662 ms · 2026-05-23T20:40:09.860137+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

33 extracted references · 33 canonical work pages · 9 internal anchors

[1]

In: Findings of ACL 2023 (2023)

Cai, H., Shen, X., Xu, Q., Shen, W., Wang, X., Ge, W., Zheng, X., Xue, X.: Improving empathetic dialogue generation by dynamically infusing commonsense knowledge. In: Findings of ACL 2023 (2023)

work page 2023
[2]

Chen, J., et al.: Towards effective and efficient continual pre-training of large lan- guage models (2024)

work page 2024
[3]

Chih Chieh Shao, T.L., Lai, Y., Tseng, Y., Tsai, S.: Drcd: a chinese machine reading comprehension dataset (2019), https://arxiv.org/abs/1806.00920

work page internal anchor Pith review Pith/arXiv arXiv 2019
[4]

Cobbe, K., Kosaraju, V., Bavarian, M., Chen, M., Jun, H., Kaiser, L., Plappert, M., Tworek, J., Hilton, J., Nakano, R., Hesse, C., Schulman, J.: Training verifiers to solve math word problems (2021), https://arxiv.org/abs/2110.14168

work page internal anchor Pith review Pith/arXiv arXiv 2021
[5]

In: EMNLP-IJCNLP (2019)

Cui, Y., et al.: A span-extraction dataset for chinese machine reading comprehen- sion. In: EMNLP-IJCNLP (2019)

work page 2019
[6]

DanHendrycks, CollinBurns,S.B.A.Z.M.M.D.S.J.S.: Measuringmassive multitask language understanding (2021), https://arxiv.org/abs/2009.03300

work page internal anchor Pith review Pith/arXiv arXiv 2021
[7]

Duan, H., Wei, J., Wang, C., Liu, H., Fang, Y., Zhang, S., Lin, D., Chen, K.: Botchat: Evaluating llms’ capabilities of having multi-turn dialogues (2023), https://arxiv.org/abs/2310.13650

work page arXiv 2023
[8]

Grattafiori, A., et al.: The llama 3 herd of models (2024)

work page 2024
[9]

Gu, J., Yang, Z., Ding, C., Zhao, R., Tan, F.: Cmr scaling law: Predicting critical mixture ratios for continual pre-training of language models (2024), https://arxiv.org/abs/2407.17467

work page arXiv 2024
[10]

Gururangan, S., Marasović, A., Swayamdipta, S., Lo, K., Beltagy, I., Downey, D., Smith, N.A.: Don’t stop pretraining: Adapt language models to domains and tasks (2020), https://arxiv.org/abs/2004.10964

work page arXiv 2020
[11]

Hoffmann, J., Borgeaud, S., Mensch, A., Buchatskaya, E., Cai, T., Rutherford, E., de Las Casas, D., Hendricks, L.A., Welbl, J., Clark, A., Hennigan, T., Noland, E., Millican, K., van den Driessche, G., Damoc, B., Guy, A., Osindero, S., Simonyan, K., Elsen, E., Rae, J.W., Vinyals, O., Sifre, L.: Training compute-optimal large language models (2022), https:...

work page internal anchor Pith review Pith/arXiv arXiv 2022
[12]

In: Màrquez, L., Callison-Burch, C., Su, J

Hu, B., Chen, Q., Zhu, F.: LCSTS: A large scale Chinese short text summarization dataset. In: Màrquez, L., Callison-Burch, C., Su, J. (eds.) Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing (Sep 2015)

work page 2015
[13]

Scaling Laws for Neural Language Models

Jared Kaplan, Sam McCandlish, T.H.T.B.B.B.C.R.C.S.G.A.R.J.W.D.A.: Scaling laws for neural language models (2020), https://arxiv.org/abs/2001.08361 12 N. Xi et al

work page internal anchor Pith review Pith/arXiv arXiv 2020
[14]

Liang, X., Hu, X., Zuo, S., Gong, Y., Lou, Q., Liu, Y., Huang, S.L., Jiao, J.: Task oriented in-domain data augmentation (2024), https://arxiv.org/abs/2406.16694

work page arXiv 2024
[15]

Liang Xu, Xiaojing Lu, C.Y., Zhang, X., Xu, H., Yuan, H., Wei, G., Pan, X., Tian, X., Qin, L., Hai, H.: Fewclue: A chinese few-shot learning evaluation benchmark (2021), https://arxiv.org/abs/2107.07498

work page arXiv 2021
[16]

Ma, S., Huang, S., Huang, S., Wang, X., Li, Y., Zheng, H.T., Xie, P., Huang, F., Jiang, Y.: Ecomgpt-ct: Continual pre-training of e-commerce large language models with semi-structured data (2023), https://arxiv.org/abs/2312.15696

work page arXiv 2023
[17]

Mandar Joshi, Eunsol Choi, D.S.W.L.Z.: Triviaqa: A large scale dis- tantly supervised challenge dataset for reading comprehension (2017), https://arxiv.org/abs/1705.03551

work page internal anchor Pith review Pith/arXiv arXiv 2017
[18]

Mark Chen, Jerry Tworek, o.: Evaluating large language models trained on code (2021), https://arxiv.org/abs/2107.03374

work page internal anchor Pith review Pith/arXiv arXiv 2021
[19]

Ouyang, L., et al.: Training language models to follow instructions with human feedback (2022), https://arxiv.org/abs/2203.02155

work page internal anchor Pith review Pith/arXiv arXiv 2022
[20]

Que, H., Liu, J., Zhang, G., Zhang, C., Qu, X., Ma, Y., Duan, F., Bai, Z., Wang, J., Zhang, Y., Tan, X., Fu, J., Su, W., Wang, J., Qu, L., Zheng, B.: D-cpt law: Domain-specific continual pre-training scaling law for large language models (2024), https://arxiv.org/abs/2406.01375

work page arXiv 2024
[21]

In: Oh, A., Naumann, T., Globerson, A., Saenko, K., Hardt, M., Levine, S

Rafailov, R., Sharma, A., Mitchell, E., Manning, C.D., Ermon, S., Finn, C.: Di- rect preference optimization: Your language model is secretly a reward model. In: Oh, A., Naumann, T., Globerson, A., Saenko, K., Hardt, M., Levine, S. (eds.) Ad- vances in Neural Information Processing Systems. vol. 36, pp. 53728–53741. Curran Associates, Inc. (2023)

work page 2023
[22]

Srivastava, A., Rastogi, A., et al.: Beyond the imitation game: Quantifying and extrapolating the capabilities of language models (2023)

work page 2023
[23]

Journal of Pacific Rim Psychology (2023)

Wang, X., Li, X., Yin, Z., Wu, Y., Liu, J.: Emotional intelligence of large language models. Journal of Pacific Rim Psychology (2023)

work page 2023
[24]

Wanjun Zhong, Ruixiang Cui, Y.G.Y.L.S.L.Y.W.A.S.W.C.N.D.: Agieval: A human-centric benchmark for evaluating foundation models (2023), https://arxiv.org/abs/2304.06364

work page internal anchor Pith review Pith/arXiv arXiv 2023
[25]

Wu, C., Gan, Y., Ge, Y., Lu, Z., Wang, J., Feng, Y., Shan, Y., Luo, P.: Llama pro: Progressive llama with block expansion (2024), https://arxiv.org/abs/2401.02415

work page arXiv 2024
[26]

Xu,L.,etal.:Clue:Achineselanguageunderstandingevaluationbenchmark(2020)

work page 2020
[27]

Technical report, Alibaba Group (2024)

Yang, A., et al.: QWEN2 TECHNICAL REPORT. Technical report, Alibaba Group (2024)

work page 2024
[28]

Yiming Cui, Ziqing Yang, X.Y.: Efficient and effective text encoding for chinese llama and alpaca (2023)

work page 2023
[29]

Yunjie Ji, Yong Deng, Y.G.Y.P.Q.N.L.Z.B.M.X.L.: Exploring the impact of in- struction data scaling on large language models: An empirical study on real-world use cases (2023)

work page 2023
[30]

Yuzhen Huang, Yuzhuo Bai, Z.Z.J.Z.J.Z.T.S.J.L.C.L.Y.Z.J.L.Y.F.M.S.J.H.: C- eval: A multi-level multi-discipline chinese evaluation suite for foundation models (2023), https://arxiv.org/abs/2305.08322

work page arXiv 2023
[31]

Zhang, T., Wang, S., Yan, S., Li, J., Liu, Q.: Generative table pre-training empow- ers models for tabular prediction (2023), https://arxiv.org/abs/2305.09696

work page arXiv 2023
[32]

In: ACL (2024)

Zheng, Y., et al.: Llamafactory: Unified efficient fine-tuning of 100+ language mod- els. In: ACL (2024)

work page 2024
[33]

Zixuan Ke, B.L.: Continual learning of natural language processing tasks: A survey (2023), https://arxiv.org/abs/2211.12701

work page arXiv 2023

[1] [1]

In: Findings of ACL 2023 (2023)

Cai, H., Shen, X., Xu, Q., Shen, W., Wang, X., Ge, W., Zheng, X., Xue, X.: Improving empathetic dialogue generation by dynamically infusing commonsense knowledge. In: Findings of ACL 2023 (2023)

work page 2023

[2] [2]

Chen, J., et al.: Towards effective and efficient continual pre-training of large lan- guage models (2024)

work page 2024

[3] [3]

Chih Chieh Shao, T.L., Lai, Y., Tseng, Y., Tsai, S.: Drcd: a chinese machine reading comprehension dataset (2019), https://arxiv.org/abs/1806.00920

work page internal anchor Pith review Pith/arXiv arXiv 2019

[4] [4]

Cobbe, K., Kosaraju, V., Bavarian, M., Chen, M., Jun, H., Kaiser, L., Plappert, M., Tworek, J., Hilton, J., Nakano, R., Hesse, C., Schulman, J.: Training verifiers to solve math word problems (2021), https://arxiv.org/abs/2110.14168

work page internal anchor Pith review Pith/arXiv arXiv 2021

[5] [5]

In: EMNLP-IJCNLP (2019)

Cui, Y., et al.: A span-extraction dataset for chinese machine reading comprehen- sion. In: EMNLP-IJCNLP (2019)

work page 2019

[6] [6]

DanHendrycks, CollinBurns,S.B.A.Z.M.M.D.S.J.S.: Measuringmassive multitask language understanding (2021), https://arxiv.org/abs/2009.03300

work page internal anchor Pith review Pith/arXiv arXiv 2021

[7] [7]

Duan, H., Wei, J., Wang, C., Liu, H., Fang, Y., Zhang, S., Lin, D., Chen, K.: Botchat: Evaluating llms’ capabilities of having multi-turn dialogues (2023), https://arxiv.org/abs/2310.13650

work page arXiv 2023

[8] [8]

Grattafiori, A., et al.: The llama 3 herd of models (2024)

work page 2024

[9] [9]

Gu, J., Yang, Z., Ding, C., Zhao, R., Tan, F.: Cmr scaling law: Predicting critical mixture ratios for continual pre-training of language models (2024), https://arxiv.org/abs/2407.17467

work page arXiv 2024

[10] [10]

Gururangan, S., Marasović, A., Swayamdipta, S., Lo, K., Beltagy, I., Downey, D., Smith, N.A.: Don’t stop pretraining: Adapt language models to domains and tasks (2020), https://arxiv.org/abs/2004.10964

work page arXiv 2020

[11] [11]

Hoffmann, J., Borgeaud, S., Mensch, A., Buchatskaya, E., Cai, T., Rutherford, E., de Las Casas, D., Hendricks, L.A., Welbl, J., Clark, A., Hennigan, T., Noland, E., Millican, K., van den Driessche, G., Damoc, B., Guy, A., Osindero, S., Simonyan, K., Elsen, E., Rae, J.W., Vinyals, O., Sifre, L.: Training compute-optimal large language models (2022), https:...

work page internal anchor Pith review Pith/arXiv arXiv 2022

[12] [12]

In: Màrquez, L., Callison-Burch, C., Su, J

Hu, B., Chen, Q., Zhu, F.: LCSTS: A large scale Chinese short text summarization dataset. In: Màrquez, L., Callison-Burch, C., Su, J. (eds.) Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing (Sep 2015)

work page 2015

[13] [13]

Scaling Laws for Neural Language Models

Jared Kaplan, Sam McCandlish, T.H.T.B.B.B.C.R.C.S.G.A.R.J.W.D.A.: Scaling laws for neural language models (2020), https://arxiv.org/abs/2001.08361 12 N. Xi et al

work page internal anchor Pith review Pith/arXiv arXiv 2020

[14] [14]

Liang, X., Hu, X., Zuo, S., Gong, Y., Lou, Q., Liu, Y., Huang, S.L., Jiao, J.: Task oriented in-domain data augmentation (2024), https://arxiv.org/abs/2406.16694

work page arXiv 2024

[15] [15]

Liang Xu, Xiaojing Lu, C.Y., Zhang, X., Xu, H., Yuan, H., Wei, G., Pan, X., Tian, X., Qin, L., Hai, H.: Fewclue: A chinese few-shot learning evaluation benchmark (2021), https://arxiv.org/abs/2107.07498

work page arXiv 2021

[16] [16]

Ma, S., Huang, S., Huang, S., Wang, X., Li, Y., Zheng, H.T., Xie, P., Huang, F., Jiang, Y.: Ecomgpt-ct: Continual pre-training of e-commerce large language models with semi-structured data (2023), https://arxiv.org/abs/2312.15696

work page arXiv 2023

[17] [17]

Mandar Joshi, Eunsol Choi, D.S.W.L.Z.: Triviaqa: A large scale dis- tantly supervised challenge dataset for reading comprehension (2017), https://arxiv.org/abs/1705.03551

work page internal anchor Pith review Pith/arXiv arXiv 2017

[18] [18]

Mark Chen, Jerry Tworek, o.: Evaluating large language models trained on code (2021), https://arxiv.org/abs/2107.03374

work page internal anchor Pith review Pith/arXiv arXiv 2021

[19] [19]

Ouyang, L., et al.: Training language models to follow instructions with human feedback (2022), https://arxiv.org/abs/2203.02155

work page internal anchor Pith review Pith/arXiv arXiv 2022

[20] [20]

Que, H., Liu, J., Zhang, G., Zhang, C., Qu, X., Ma, Y., Duan, F., Bai, Z., Wang, J., Zhang, Y., Tan, X., Fu, J., Su, W., Wang, J., Qu, L., Zheng, B.: D-cpt law: Domain-specific continual pre-training scaling law for large language models (2024), https://arxiv.org/abs/2406.01375

work page arXiv 2024

[21] [21]

In: Oh, A., Naumann, T., Globerson, A., Saenko, K., Hardt, M., Levine, S

Rafailov, R., Sharma, A., Mitchell, E., Manning, C.D., Ermon, S., Finn, C.: Di- rect preference optimization: Your language model is secretly a reward model. In: Oh, A., Naumann, T., Globerson, A., Saenko, K., Hardt, M., Levine, S. (eds.) Ad- vances in Neural Information Processing Systems. vol. 36, pp. 53728–53741. Curran Associates, Inc. (2023)

work page 2023

[22] [22]

Srivastava, A., Rastogi, A., et al.: Beyond the imitation game: Quantifying and extrapolating the capabilities of language models (2023)

work page 2023

[23] [23]

Journal of Pacific Rim Psychology (2023)

Wang, X., Li, X., Yin, Z., Wu, Y., Liu, J.: Emotional intelligence of large language models. Journal of Pacific Rim Psychology (2023)

work page 2023

[24] [24]

Wanjun Zhong, Ruixiang Cui, Y.G.Y.L.S.L.Y.W.A.S.W.C.N.D.: Agieval: A human-centric benchmark for evaluating foundation models (2023), https://arxiv.org/abs/2304.06364

work page internal anchor Pith review Pith/arXiv arXiv 2023

[25] [25]

Wu, C., Gan, Y., Ge, Y., Lu, Z., Wang, J., Feng, Y., Shan, Y., Luo, P.: Llama pro: Progressive llama with block expansion (2024), https://arxiv.org/abs/2401.02415

work page arXiv 2024

[26] [26]

Xu,L.,etal.:Clue:Achineselanguageunderstandingevaluationbenchmark(2020)

work page 2020

[27] [27]

Technical report, Alibaba Group (2024)

Yang, A., et al.: QWEN2 TECHNICAL REPORT. Technical report, Alibaba Group (2024)

work page 2024

[28] [28]

Yiming Cui, Ziqing Yang, X.Y.: Efficient and effective text encoding for chinese llama and alpaca (2023)

work page 2023

[29] [29]

Yunjie Ji, Yong Deng, Y.G.Y.P.Q.N.L.Z.B.M.X.L.: Exploring the impact of in- struction data scaling on large language models: An empirical study on real-world use cases (2023)

work page 2023

[30] [30]

Yuzhen Huang, Yuzhuo Bai, Z.Z.J.Z.J.Z.T.S.J.L.C.L.Y.Z.J.L.Y.F.M.S.J.H.: C- eval: A multi-level multi-discipline chinese evaluation suite for foundation models (2023), https://arxiv.org/abs/2305.08322

work page arXiv 2023

[31] [31]

Zhang, T., Wang, S., Yan, S., Li, J., Liu, Q.: Generative table pre-training empow- ers models for tabular prediction (2023), https://arxiv.org/abs/2305.09696

work page arXiv 2023

[32] [32]

In: ACL (2024)

Zheng, Y., et al.: Llamafactory: Unified efficient fine-tuning of 100+ language mod- els. In: ACL (2024)

work page 2024

[33] [33]

Zixuan Ke, B.L.: Continual learning of natural language processing tasks: A survey (2023), https://arxiv.org/abs/2211.12701

work page arXiv 2023