Apertus LLM Family Expansion via Distillation and Quantization

Andrei Panferov; Dan Alistarh; Davit Melikidze; Martin Jaggi

arxiv: 2605.29128 · v2 · pith:M2EXHVG5new · submitted 2026-05-27 · 💻 cs.LG

Apertus LLM Family Expansion via Distillation and Quantization

Andrei Panferov , Davit Melikidze , Martin Jaggi , Dan Alistarh This is my paper

Pith reviewed 2026-06-29 13:25 UTC · model grok-4.3

classification 💻 cs.LG

keywords distillationquantizationLLM family expansionmodel compressionparameter reductionhardware constraintscost efficiency

0 comments

The pith

Distillation and quantization expand an 8B LLM to a family of smaller models up to 4B parameters with competitive accuracy.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that distillation combined with quantization offers a cost-effective method to create multiple sizes of LLMs from a single base model. Starting from an 8B parameter model trained on permissive data, the approach generates a family of models with fewer parameters suitable for diverse hardware. A sympathetic reader would care because it reduces the need for training each size from scratch while covering a wide range of deployment constraints. This validates a practical way to broaden LLM accessibility across budgets and systems.

Core claim

Distillation and quantization applied to the base 8B model produce a distilled family of models with up to 4B parameters trained on 1.7T permissive tokens, demonstrating cost-efficiency and strong accuracy performance across hardware requirements.

What carries the argument

The distillation process from the base model combined with quantization to produce varied sizes and formats.

If this is right

Multiple model sizes can be generated without separate full trainings from scratch.
The resulting models satisfy diverse hardware and budget constraints at lower overall cost.
Accuracy performance stays competitive while expanding coverage of system requirements.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

This method could apply to other base LLMs for efficient family expansion without retraining everything.
It suggests potential for testing even smaller model sizes or varied quantization bit-widths on the same base.
The approach connects to broader questions of how data volume interacts with compression techniques in deployment.

Load-bearing premise

The base 8B model and the large permissive dataset are of sufficient quality that distillation plus quantization will reliably produce smaller models whose accuracy remains competitive without needing extensive new hyperparameter search or additional data curation.

What would settle it

A direct comparison showing that the distilled models underperform significantly on standard benchmarks compared to similarly sized models trained from scratch would falsify the cost-effectiveness claim.

Figures

Figures reproduced from arXiv: 2605.29128 by Andrei Panferov, Dan Alistarh, Davit Melikidze, Martin Jaggi.

**Figure 1.** Figure 1: Training loss curves of Apertus-v1.1 models. Dashed line shows the loss of the teacher model (Apertus-8B-2509). els. For the subsequent alignment stage, we utilized a simplified DPO (Rafailov et al., 2024) setup. Evaluations. Following the Apertus evaluation setup, we report multilingual benchmarks average during training in [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗

**Figure 2.** Figure 2: Multilingual performance macro average during pretraining of Apertus-v1.1 models and for a number of similar-sized models. Distillation allows Apertus-v1.1 models to achieve competitive performance while training on up to an order of magnitude less compute [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 3.** Figure 3: Visualization of the cost-accuracy trade-off for Apertus and Apertus-v1.1 models. Base models (left) are compared based on validation loss while instruction-tuned models (right) are compared based on downstream performance. Quantized models both optimize the trade-off and add intermediate points to the Pareto fronts. NVFP4 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 GPTQ QAD QAD Norm Fusion Apertus-v1.1 0.6B Val Loss … view at source ↗

**Figure 4.** Figure 4: Apertus-v1.1 quantization recipe ablation. 3.2. Pareto Optimality As mentioned in the beginning of this section, we analyze base model quantization in the context of high-throughput applications and instruction-tuned model quantization in the context of memory-constrained deployment. Naturally, the corresponding cost can be measured for every model we trained (quantized or otherwise), along with a represen… view at source ↗

**Figure 5.** Figure 5: The effect of weight averaging (WA) over the last few base model checkpoints on post-training quantization for various data-types and algorithms. Checkpoints were taken every 1000 iterations [PITH_FULL_IMAGE:figures/full_fig_p006_5.png] view at source ↗

read the original abstract

The wide adoption of LLMs has led to their use in great variety of applications and scenarios, such as chatbot assistants and data annotation, creating the need for the models to satisfy certain budget and hardware constraints. This has led to the trend of LLMs being released in batches consisting of similar models of various sizes for the family of models to adhere to as wide of a range of constraints as possible. In this paper, we validate distillation and quantization as a cost-effective way to expand model families to new sizes and hardware formats. Based on the open-recipe Apertus 8B LLM, we produce Apertus-v1.1 - a distilled family of models with up to 4B parameters trained on 1.7T permissive license tokens. We demonstrate cost-efficiency and strong accuracy performance of our approach for covering large ranges of hardware and systems requirements.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This is a routine engineering application of distillation and quantization to one open 8B model to produce smaller variants, with no numbers shown to support the performance claims.

read the letter

The paper takes the open Apertus 8B checkpoint and applies distillation followed by quantization to create the Apertus-v1.1 family of models up to 4B parameters, trained on 1.7T permissive-license tokens. The goal is to cover a wider range of hardware constraints at lower cost than training from scratch.

What is actually new is the specific family and the exact combination of base model, token count, and size targets. The underlying techniques are already standard practice across labs.

The work does a clear job laying out a practical pipeline that emphasizes permissive data and open recipes, which helps with reproducibility for deployment-focused groups.

The main soft spot is the complete absence of numbers. The abstract asserts strong accuracy and cost-efficiency but gives no benchmarks, baselines, ablations, or error bars. Without those, the central claim cannot be checked. The stress-test point lands: if the 8B base underperforms peer 7-8B models on the same tasks, that weakness will carry into the distilled versions, and the paper offers no direct comparison to rule this out. The assumption that the 1.7T mix needs no extra curation also stays untested in the provided text.

This is for practitioners who need a concrete recipe for model-family expansion on open checkpoints. A reader already running similar pipelines might skim the details for implementation notes, but the lack of evidence limits broader interest.

I would not send it to peer review until the full manuscript supplies the missing tables and the numbers are shown to hold up against reasonable baselines.

Referee Report

2 major / 1 minor

Summary. The paper claims that distillation combined with quantization provides a cost-effective method to expand the open-recipe Apertus 8B LLM into the Apertus-v1.1 family of models (up to 4B parameters) trained on 1.7T permissive-license tokens, thereby covering a wide range of hardware and system constraints while maintaining strong accuracy without extensive additional hyperparameter search or data curation.

Significance. If the empirical results hold with proper controls, the work supplies a practical, reproducible recipe for model-family expansion from a single base checkpoint using only permissive data; this could reduce the compute barrier for creating size- and format-diverse LLM families. The emphasis on permissive tokens is a clear strength for open research.

major comments (2)

[Abstract] Abstract: the central claim that distillation plus quantization yields 'strong accuracy performance' and 'competitive' smaller models without 'extensive new hyperparameter search' is unsupported by any quantitative metrics, baselines, or ablation tables in the abstract. This absence directly undermines verification of the result.
[Introduction] Introduction / base-model description: no direct head-to-head evaluation of the Apertus 8B checkpoint against contemporaneous 7-8B models (e.g., Llama-3-8B, Mistral-7B) is reported on the same downstream benchmarks later used for the distilled variants. This comparison is load-bearing for the weakest assumption that the 8B base already encodes sufficiently general capabilities.

minor comments (1)

[Abstract] The abstract refers to 'Apertus-v1.1' but does not define the exact parameter counts or quantization formats of the released family members.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. The comments highlight opportunities to improve verifiability of claims and grounding of the base model. We address each point below and will revise the manuscript to incorporate the suggested changes.

read point-by-point responses

Referee: [Abstract] Abstract: the central claim that distillation plus quantization yields 'strong accuracy performance' and 'competitive' smaller models without 'extensive new hyperparameter search' is unsupported by any quantitative metrics, baselines, or ablation tables in the abstract. This absence directly undermines verification of the result.

Authors: We agree that the abstract would benefit from explicit quantitative support. In the revision we will expand the abstract to include key performance numbers (e.g., average benchmark scores for the 4B and 2B distilled models relative to the 8B teacher and to published baselines of similar size) while remaining within length limits. These numbers are already present in the experimental tables and can be summarized without new experiments. revision: yes
Referee: [Introduction] Introduction / base-model description: no direct head-to-head evaluation of the Apertus 8B checkpoint against contemporaneous 7-8B models (e.g., Llama-3-8B, Mistral-7B) is reported on the same downstream benchmarks later used for the distilled variants. This comparison is load-bearing for the weakest assumption that the 8B base already encodes sufficiently general capabilities.

Authors: The manuscript centers on the distillation-plus-quantization expansion procedure rather than a re-evaluation of the teacher. Nevertheless, the referee is correct that a direct comparison on the same suite would strengthen the narrative. We will add a compact table (or reference to existing public evaluations of Apertus 8B) in the introduction or experimental setup section that reports Apertus 8B alongside Llama-3-8B and Mistral-7B on the identical downstream tasks used for the distilled variants. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical validation of distillation/quantization pipeline

full rationale

The paper reports an empirical procedure: starting from an existing Apertus 8B checkpoint, apply distillation on 1.7T tokens to obtain smaller models, then quantize. No equations, fitted parameters, or uniqueness theorems are invoked. The central claim (cost-effective family expansion with competitive accuracy) is presented as the outcome of running the pipeline and measuring benchmarks, not as a quantity derived from itself by definition or by a self-citation chain. The base-model quality and token-corpus sufficiency are treated as external prerequisites rather than results proven inside the paper; their status does not create a self-referential loop. Consequently the derivation chain contains no load-bearing step that reduces to its own inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review; no explicit free parameters, axioms, or invented entities are stated. The central claim implicitly assumes the base model and data are adequate inputs.

pith-pipeline@v0.9.1-grok · 5680 in / 1030 out tokens · 23261 ms · 2026-06-29T13:25:09.418252+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

22 extracted references · 18 canonical work pages · 8 internal anchors

[1]

Apertus, P., Hern´andez-Cano, A., H¨agele, A., Huang, A

URL https:// arxiv.org/abs/2502.06761. Apertus, P., Hern´andez-Cano, A., H¨agele, A., Huang, A. H., Romanou, A., Solergibert, A.-J., Pasztor, B., Messmer, B., Garbaya, D., ˇDurech, E. F., Hakimi, I., Giraldo, J. G., Ismayilzada, M., Foroutan, N., Moalla, S., Chen, T., Sabolˇcec, V ., Xu, Y ., Aerni, M., AlKhamissi, B., Mari˜nas, I. A., Amani, M. H., Ansar...

work page arXiv
[2]

Trevor J

URLhttps://arxiv.org/abs/2509.14233. Bengio, Y ., L´eonard, N., and Courville, A. Estimating or propagating gradients through stochastic neurons for con- 6 Apertus LLM Family Expansion via Distillation and Quantization ditional computation,

work page arXiv
[3]

Estimating or Propagating Gradients Through Stochastic Neurons for Conditional Computation

URL https://arxiv. org/abs/1308.3432. Clark, P., Cowhey, I., Etzioni, O., Khot, T., Sabharwal, A., Schoenick, C., and Tafjord, O. Think you have solved question answering? try arc, the ai2 reasoning challenge,

work page internal anchor Pith review Pith/arXiv arXiv
[4]

Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge

URL https://arxiv.org/abs/ 1803.05457. Conneau, A., Rinott, R., Lample, G., Williams, A., Bowman, S., Schwenk, H., and Stoyanov, V . Xnli: Evaluating cross-lingual sentence representations. InProceedings of the 2018 conference on empirical methods in natural language processing, pp. 2475–2485,

work page internal anchor Pith review Pith/arXiv arXiv 2018
[5]

GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers

URL https://arxiv. org/abs/2210.17323. Hendrycks, D., Burns, C., Basart, S., Zou, A., Mazeika, M., Song, D., and Steinhardt, J. Measuring massive multitask language understanding,

work page internal anchor Pith review Pith/arXiv arXiv
[6]

Measuring Massive Multitask Language Understanding

URL https: //arxiv.org/abs/2009.03300. Huang, A. H. and Schlag, I. Deriving activation functions using integration,

work page internal anchor Pith review Pith/arXiv arXiv 2009
[7]

URL https://arxiv.org/ abs/2411.13010. Lee, J. H., Shin, S., Kim, V ., You, J., and Chen, A. Unifying block-wise ptq and distillation-based qat for progressive quantization toward 2-bit instruction-tuned llms,

work page arXiv
[8]

Liu, Z., Zhao, C., Iandola, F., Lai, C., Tian, Y ., Fedorov, I., Xiong, Y ., Chang, E., Shi, Y ., Krishnamoorthi, R., Lai, L., and Chandra, V

URLhttps://arxiv.org/abs/2506.09104. Liu, Z., Zhao, C., Iandola, F., Lai, C., Tian, Y ., Fedorov, I., Xiong, Y ., Chang, E., Shi, Y ., Krishnamoorthi, R., Lai, L., and Chandra, V . Mobilellm: Optimizing sub-billion parameter language models for on-device use cases,

work page arXiv
[9]

Ma, Shuming, Hongyu Wang, Lingxiao Ma, et al

URLhttps://arxiv.org/abs/2402.14905. Loshchilov, I. and Hutter, F. Decoupled weight decay regu- larization,

work page arXiv
[10]

Decoupled Weight Decay Regularization

URL https://arxiv.org/abs/ 1711.05101. Muennighoff, N., Wang, T., Sutawika, L., Roberts, A., Bi- derman, S., Scao, T. L., Bari, M. S., Shen, S., Yong, Z.-X., Schoelkopf, H., Tang, X., Radev, D., Aji, A. F., Al- mubarak, K., Albanie, S., Alyafeai, Z., Webson, A., Raff, E., and Raffel, C. Crosslingual generalization through multitask finetuning,

work page internal anchor Pith review Pith/arXiv arXiv
[11]

The ademamix optimizer: Better, faster, older

Pagliardini, M., Ablin, P., and Grangier, D. The ademamix optimizer: Better, faster, older. In Yue, Y ., Garg, A., Peng, N., Sha, F., and Yu, R. (eds.),International Con- ference on Learning Representations, volume 2025, pp. 64715–64757,

2025
[12]

iclr.cc/paper_files/paper/2025/file/ a2cf225ba392627529efef14dc857e22-Paper-Conference

URL https://proceedings. iclr.cc/paper_files/paper/2025/file/ a2cf225ba392627529efef14dc857e22-Paper-Conference. pdf. Peng, H., Lv, X., Bai, Y ., Yao, Z., Zhang, J., Hou, L., and Li, J. Pre-training distillation for large language models: A design space exploration,

2025
[13]

Ponti, E

URL https: //arxiv.org/abs/2410.16215. Ponti, E. M., Glavaˇs, G., Majewska, O., Liu, Q., Vuli´c, I., and Korhonen, A. Xcopa: A multilingual dataset for causal commonsense reasoning,

work page arXiv
[14]

Xcopa: A multilingual dataset for causal commonsense reasoning

URL https: //arxiv.org/abs/2005.00333. Rafailov, R., Sharma, A., Mitchell, E., Ermon, S., Manning, C. D., and Finn, C. Direct preference optimization: Your language model is secretly a reward model,

work page arXiv 2005
[15]

Direct Preference Optimization: Your Language Model is Secretly a Reward Model

URL https://arxiv.org/abs/2305.18290. Romanou, A., Foroutan, N., Sotnikova, A., Nelaturu, S. H., Singh, S., Maheshwary, R., Altomare, M., Chen, Z., Hag- gag, M., Amayuelas, A., et al. Include: Evaluating multi- lingual language understanding with regional knowledge. InInternational Conference on Learning Representations, volume 2025, pp. 83291–83322,

work page internal anchor Pith review Pith/arXiv arXiv 2025
[16]

WinoGrande: An Adversarial Winograd Schema Challenge at Scale

URL https://arxiv.org/abs/ 1907.10641. Singh, S., Romanou, A., Fourrier, C., Adelani, D. I., Ngui, J. G., Vila-Suero, D., Limkonchotiwat, P., Marchisio, K., Leong, W. Q., Susanto, Y ., Ng, R., Longpre, S., Ko, W.-Y ., Ruder, S., Smith, M., Bosselut, A., Oh, A., Martins, A. F. T., Choshen, L., Ippolito, D., Ferrante, E., Fadaee, M., Ermis, B., and Hooker, ...

work page internal anchor Pith review Pith/arXiv arXiv 1907
[17]

In Findings of the Associa- tion for Computational Linguistics: ACL 2024 , pages 14182–14214, Bangkok, Thailand

URL https://arxiv.org/abs/2412.03304. Xin, M., Priyadarshi, S., Xin, J., Kartal, B., Vavre, A., Thekkumpate, A. K., Chen, Z., Mahabaleshwarkar, A. S., Shahaf, I., Bercovich, A., Patel, K., Velury, S. V ., Luo, C., Cheng, Z., Chen, J., Yu, C.-H., Ping, W., Rybakov, 7 Apertus LLM Family Expansion via Distillation and Quantization O., Tajbakhsh, N., Olabiyi,...

work page arXiv
[18]

Quantization-aware distillation for NVFP4 inference accuracy recovery

URLhttps://arxiv.org/abs/2601.20088. Yang, Y ., Zhang, Y ., Tar, C., and Baldridge, J. PAWS-X: A cross-lingual adversarial dataset for paraphrase iden- tification. In Inui, K., Jiang, J., Ng, V ., and Wan, X. (eds.),Proceedings of the 2019 Conference on Empir- ical Methods in Natural Language Processing and the 9th International Joint Conference on Natura...

work page arXiv 2019
[19]

doi: 10.18653/v1/D19-1382

Association for Compu- tational Linguistics. doi: 10.18653/v1/D19-1382. URL https://aclanthology.org/D19-1382/. Zellers, R., Holtzman, A., Bisk, Y ., Farhadi, A., and Choi, Y . Hellaswag: Can a machine really finish your sentence?,

work page doi:10.18653/v1/d19-1382
[20]

HellaSwag: Can a Machine Really Finish Your Sentence?

URL https://arxiv.org/abs/ 1905.07830. 8 Apertus LLM Family Expansion via Distillation and Quantization Table 6.Additional hyper-parameters. Model LR GBS Total Iterations Apertus-v1.1-0.5B 6e-4 512 800000 Apertus-v1.1-1.5B 3e-4 512 800000 Apertus-v1.1-4B 2e-4 1024 400000 A. Codebases The full codebases for the pre-training distillation, post-training, eva...

work page internal anchor Pith review Pith/arXiv arXiv 1905
[21]

and Multilingual HellaSwag (Dac Lai et al., 2023). C. Additional Hyper-Parameters C.1. Pre-Training Details Additional per-model pre-training hyper-parameters are shown in Table

2023
[22]

For base models, we use the same sequence length and batch size as in pre-training

with cosine LR schedule. For base models, we use the same sequence length and batch size as in pre-training. For instruction-tuned models, we use slightly larger batch size of 512-2048 to compensate for smaller length of some post-training sequences. Similar to pre-training distillation, we pre-compute and store the sparse logits from the teacher model (A...

2048

[1] [1]

Apertus, P., Hern´andez-Cano, A., H¨agele, A., Huang, A

URL https:// arxiv.org/abs/2502.06761. Apertus, P., Hern´andez-Cano, A., H¨agele, A., Huang, A. H., Romanou, A., Solergibert, A.-J., Pasztor, B., Messmer, B., Garbaya, D., ˇDurech, E. F., Hakimi, I., Giraldo, J. G., Ismayilzada, M., Foroutan, N., Moalla, S., Chen, T., Sabolˇcec, V ., Xu, Y ., Aerni, M., AlKhamissi, B., Mari˜nas, I. A., Amani, M. H., Ansar...

work page arXiv

[2] [2]

Trevor J

URLhttps://arxiv.org/abs/2509.14233. Bengio, Y ., L´eonard, N., and Courville, A. Estimating or propagating gradients through stochastic neurons for con- 6 Apertus LLM Family Expansion via Distillation and Quantization ditional computation,

work page arXiv

[3] [3]

Estimating or Propagating Gradients Through Stochastic Neurons for Conditional Computation

URL https://arxiv. org/abs/1308.3432. Clark, P., Cowhey, I., Etzioni, O., Khot, T., Sabharwal, A., Schoenick, C., and Tafjord, O. Think you have solved question answering? try arc, the ai2 reasoning challenge,

work page internal anchor Pith review Pith/arXiv arXiv

[4] [4]

Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge

URL https://arxiv.org/abs/ 1803.05457. Conneau, A., Rinott, R., Lample, G., Williams, A., Bowman, S., Schwenk, H., and Stoyanov, V . Xnli: Evaluating cross-lingual sentence representations. InProceedings of the 2018 conference on empirical methods in natural language processing, pp. 2475–2485,

work page internal anchor Pith review Pith/arXiv arXiv 2018

[5] [5]

GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers

URL https://arxiv. org/abs/2210.17323. Hendrycks, D., Burns, C., Basart, S., Zou, A., Mazeika, M., Song, D., and Steinhardt, J. Measuring massive multitask language understanding,

work page internal anchor Pith review Pith/arXiv arXiv

[6] [6]

Measuring Massive Multitask Language Understanding

URL https: //arxiv.org/abs/2009.03300. Huang, A. H. and Schlag, I. Deriving activation functions using integration,

work page internal anchor Pith review Pith/arXiv arXiv 2009

[7] [7]

URL https://arxiv.org/ abs/2411.13010. Lee, J. H., Shin, S., Kim, V ., You, J., and Chen, A. Unifying block-wise ptq and distillation-based qat for progressive quantization toward 2-bit instruction-tuned llms,

work page arXiv

[8] [8]

Liu, Z., Zhao, C., Iandola, F., Lai, C., Tian, Y ., Fedorov, I., Xiong, Y ., Chang, E., Shi, Y ., Krishnamoorthi, R., Lai, L., and Chandra, V

URLhttps://arxiv.org/abs/2506.09104. Liu, Z., Zhao, C., Iandola, F., Lai, C., Tian, Y ., Fedorov, I., Xiong, Y ., Chang, E., Shi, Y ., Krishnamoorthi, R., Lai, L., and Chandra, V . Mobilellm: Optimizing sub-billion parameter language models for on-device use cases,

work page arXiv

[9] [9]

Ma, Shuming, Hongyu Wang, Lingxiao Ma, et al

URLhttps://arxiv.org/abs/2402.14905. Loshchilov, I. and Hutter, F. Decoupled weight decay regu- larization,

work page arXiv

[10] [10]

Decoupled Weight Decay Regularization

URL https://arxiv.org/abs/ 1711.05101. Muennighoff, N., Wang, T., Sutawika, L., Roberts, A., Bi- derman, S., Scao, T. L., Bari, M. S., Shen, S., Yong, Z.-X., Schoelkopf, H., Tang, X., Radev, D., Aji, A. F., Al- mubarak, K., Albanie, S., Alyafeai, Z., Webson, A., Raff, E., and Raffel, C. Crosslingual generalization through multitask finetuning,

work page internal anchor Pith review Pith/arXiv arXiv

[11] [11]

The ademamix optimizer: Better, faster, older

Pagliardini, M., Ablin, P., and Grangier, D. The ademamix optimizer: Better, faster, older. In Yue, Y ., Garg, A., Peng, N., Sha, F., and Yu, R. (eds.),International Con- ference on Learning Representations, volume 2025, pp. 64715–64757,

2025

[12] [12]

iclr.cc/paper_files/paper/2025/file/ a2cf225ba392627529efef14dc857e22-Paper-Conference

URL https://proceedings. iclr.cc/paper_files/paper/2025/file/ a2cf225ba392627529efef14dc857e22-Paper-Conference. pdf. Peng, H., Lv, X., Bai, Y ., Yao, Z., Zhang, J., Hou, L., and Li, J. Pre-training distillation for large language models: A design space exploration,

2025

[13] [13]

Ponti, E

URL https: //arxiv.org/abs/2410.16215. Ponti, E. M., Glavaˇs, G., Majewska, O., Liu, Q., Vuli´c, I., and Korhonen, A. Xcopa: A multilingual dataset for causal commonsense reasoning,

work page arXiv

[14] [14]

Xcopa: A multilingual dataset for causal commonsense reasoning

URL https: //arxiv.org/abs/2005.00333. Rafailov, R., Sharma, A., Mitchell, E., Ermon, S., Manning, C. D., and Finn, C. Direct preference optimization: Your language model is secretly a reward model,

work page arXiv 2005

[15] [15]

Direct Preference Optimization: Your Language Model is Secretly a Reward Model

URL https://arxiv.org/abs/2305.18290. Romanou, A., Foroutan, N., Sotnikova, A., Nelaturu, S. H., Singh, S., Maheshwary, R., Altomare, M., Chen, Z., Hag- gag, M., Amayuelas, A., et al. Include: Evaluating multi- lingual language understanding with regional knowledge. InInternational Conference on Learning Representations, volume 2025, pp. 83291–83322,

work page internal anchor Pith review Pith/arXiv arXiv 2025

[16] [16]

WinoGrande: An Adversarial Winograd Schema Challenge at Scale

URL https://arxiv.org/abs/ 1907.10641. Singh, S., Romanou, A., Fourrier, C., Adelani, D. I., Ngui, J. G., Vila-Suero, D., Limkonchotiwat, P., Marchisio, K., Leong, W. Q., Susanto, Y ., Ng, R., Longpre, S., Ko, W.-Y ., Ruder, S., Smith, M., Bosselut, A., Oh, A., Martins, A. F. T., Choshen, L., Ippolito, D., Ferrante, E., Fadaee, M., Ermis, B., and Hooker, ...

work page internal anchor Pith review Pith/arXiv arXiv 1907

[17] [17]

In Findings of the Associa- tion for Computational Linguistics: ACL 2024 , pages 14182–14214, Bangkok, Thailand

URL https://arxiv.org/abs/2412.03304. Xin, M., Priyadarshi, S., Xin, J., Kartal, B., Vavre, A., Thekkumpate, A. K., Chen, Z., Mahabaleshwarkar, A. S., Shahaf, I., Bercovich, A., Patel, K., Velury, S. V ., Luo, C., Cheng, Z., Chen, J., Yu, C.-H., Ping, W., Rybakov, 7 Apertus LLM Family Expansion via Distillation and Quantization O., Tajbakhsh, N., Olabiyi,...

work page arXiv

[18] [18]

Quantization-aware distillation for NVFP4 inference accuracy recovery

URLhttps://arxiv.org/abs/2601.20088. Yang, Y ., Zhang, Y ., Tar, C., and Baldridge, J. PAWS-X: A cross-lingual adversarial dataset for paraphrase iden- tification. In Inui, K., Jiang, J., Ng, V ., and Wan, X. (eds.),Proceedings of the 2019 Conference on Empir- ical Methods in Natural Language Processing and the 9th International Joint Conference on Natura...

work page arXiv 2019

[19] [19]

doi: 10.18653/v1/D19-1382

Association for Compu- tational Linguistics. doi: 10.18653/v1/D19-1382. URL https://aclanthology.org/D19-1382/. Zellers, R., Holtzman, A., Bisk, Y ., Farhadi, A., and Choi, Y . Hellaswag: Can a machine really finish your sentence?,

work page doi:10.18653/v1/d19-1382

[20] [20]

HellaSwag: Can a Machine Really Finish Your Sentence?

URL https://arxiv.org/abs/ 1905.07830. 8 Apertus LLM Family Expansion via Distillation and Quantization Table 6.Additional hyper-parameters. Model LR GBS Total Iterations Apertus-v1.1-0.5B 6e-4 512 800000 Apertus-v1.1-1.5B 3e-4 512 800000 Apertus-v1.1-4B 2e-4 1024 400000 A. Codebases The full codebases for the pre-training distillation, post-training, eva...

work page internal anchor Pith review Pith/arXiv arXiv 1905

[21] [21]

and Multilingual HellaSwag (Dac Lai et al., 2023). C. Additional Hyper-Parameters C.1. Pre-Training Details Additional per-model pre-training hyper-parameters are shown in Table

2023

[22] [22]

For base models, we use the same sequence length and batch size as in pre-training

with cosine LR schedule. For base models, we use the same sequence length and batch size as in pre-training. For instruction-tuned models, we use slightly larger batch size of 512-2048 to compensate for smaller length of some post-training sequences. Similar to pre-training distillation, we pre-compute and store the sparse logits from the teacher model (A...

2048