Apertus LLM Family Expansion via Distillation and Quantization
Pith reviewed 2026-06-29 13:25 UTC · model grok-4.3
The pith
Distillation and quantization expand an 8B LLM to a family of smaller models up to 4B parameters with competitive accuracy.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Distillation and quantization applied to the base 8B model produce a distilled family of models with up to 4B parameters trained on 1.7T permissive tokens, demonstrating cost-efficiency and strong accuracy performance across hardware requirements.
What carries the argument
The distillation process from the base model combined with quantization to produce varied sizes and formats.
If this is right
- Multiple model sizes can be generated without separate full trainings from scratch.
- The resulting models satisfy diverse hardware and budget constraints at lower overall cost.
- Accuracy performance stays competitive while expanding coverage of system requirements.
Where Pith is reading between the lines
- This method could apply to other base LLMs for efficient family expansion without retraining everything.
- It suggests potential for testing even smaller model sizes or varied quantization bit-widths on the same base.
- The approach connects to broader questions of how data volume interacts with compression techniques in deployment.
Load-bearing premise
The base 8B model and the large permissive dataset are of sufficient quality that distillation plus quantization will reliably produce smaller models whose accuracy remains competitive without needing extensive new hyperparameter search or additional data curation.
What would settle it
A direct comparison showing that the distilled models underperform significantly on standard benchmarks compared to similarly sized models trained from scratch would falsify the cost-effectiveness claim.
Figures
read the original abstract
The wide adoption of LLMs has led to their use in great variety of applications and scenarios, such as chatbot assistants and data annotation, creating the need for the models to satisfy certain budget and hardware constraints. This has led to the trend of LLMs being released in batches consisting of similar models of various sizes for the family of models to adhere to as wide of a range of constraints as possible. In this paper, we validate distillation and quantization as a cost-effective way to expand model families to new sizes and hardware formats. Based on the open-recipe Apertus 8B LLM, we produce Apertus-v1.1 - a distilled family of models with up to 4B parameters trained on 1.7T permissive license tokens. We demonstrate cost-efficiency and strong accuracy performance of our approach for covering large ranges of hardware and systems requirements.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims that distillation combined with quantization provides a cost-effective method to expand the open-recipe Apertus 8B LLM into the Apertus-v1.1 family of models (up to 4B parameters) trained on 1.7T permissive-license tokens, thereby covering a wide range of hardware and system constraints while maintaining strong accuracy without extensive additional hyperparameter search or data curation.
Significance. If the empirical results hold with proper controls, the work supplies a practical, reproducible recipe for model-family expansion from a single base checkpoint using only permissive data; this could reduce the compute barrier for creating size- and format-diverse LLM families. The emphasis on permissive tokens is a clear strength for open research.
major comments (2)
- [Abstract] Abstract: the central claim that distillation plus quantization yields 'strong accuracy performance' and 'competitive' smaller models without 'extensive new hyperparameter search' is unsupported by any quantitative metrics, baselines, or ablation tables in the abstract. This absence directly undermines verification of the result.
- [Introduction] Introduction / base-model description: no direct head-to-head evaluation of the Apertus 8B checkpoint against contemporaneous 7-8B models (e.g., Llama-3-8B, Mistral-7B) is reported on the same downstream benchmarks later used for the distilled variants. This comparison is load-bearing for the weakest assumption that the 8B base already encodes sufficiently general capabilities.
minor comments (1)
- [Abstract] The abstract refers to 'Apertus-v1.1' but does not define the exact parameter counts or quantization formats of the released family members.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback. The comments highlight opportunities to improve verifiability of claims and grounding of the base model. We address each point below and will revise the manuscript to incorporate the suggested changes.
read point-by-point responses
-
Referee: [Abstract] Abstract: the central claim that distillation plus quantization yields 'strong accuracy performance' and 'competitive' smaller models without 'extensive new hyperparameter search' is unsupported by any quantitative metrics, baselines, or ablation tables in the abstract. This absence directly undermines verification of the result.
Authors: We agree that the abstract would benefit from explicit quantitative support. In the revision we will expand the abstract to include key performance numbers (e.g., average benchmark scores for the 4B and 2B distilled models relative to the 8B teacher and to published baselines of similar size) while remaining within length limits. These numbers are already present in the experimental tables and can be summarized without new experiments. revision: yes
-
Referee: [Introduction] Introduction / base-model description: no direct head-to-head evaluation of the Apertus 8B checkpoint against contemporaneous 7-8B models (e.g., Llama-3-8B, Mistral-7B) is reported on the same downstream benchmarks later used for the distilled variants. This comparison is load-bearing for the weakest assumption that the 8B base already encodes sufficiently general capabilities.
Authors: The manuscript centers on the distillation-plus-quantization expansion procedure rather than a re-evaluation of the teacher. Nevertheless, the referee is correct that a direct comparison on the same suite would strengthen the narrative. We will add a compact table (or reference to existing public evaluations of Apertus 8B) in the introduction or experimental setup section that reports Apertus 8B alongside Llama-3-8B and Mistral-7B on the identical downstream tasks used for the distilled variants. revision: yes
Circularity Check
No circularity: empirical validation of distillation/quantization pipeline
full rationale
The paper reports an empirical procedure: starting from an existing Apertus 8B checkpoint, apply distillation on 1.7T tokens to obtain smaller models, then quantize. No equations, fitted parameters, or uniqueness theorems are invoked. The central claim (cost-effective family expansion with competitive accuracy) is presented as the outcome of running the pipeline and measuring benchmarks, not as a quantity derived from itself by definition or by a self-citation chain. The base-model quality and token-corpus sufficiency are treated as external prerequisites rather than results proven inside the paper; their status does not create a self-referential loop. Consequently the derivation chain contains no load-bearing step that reduces to its own inputs.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Apertus, P., Hern´andez-Cano, A., H¨agele, A., Huang, A
URL https:// arxiv.org/abs/2502.06761. Apertus, P., Hern´andez-Cano, A., H¨agele, A., Huang, A. H., Romanou, A., Solergibert, A.-J., Pasztor, B., Messmer, B., Garbaya, D., ˇDurech, E. F., Hakimi, I., Giraldo, J. G., Ismayilzada, M., Foroutan, N., Moalla, S., Chen, T., Sabolˇcec, V ., Xu, Y ., Aerni, M., AlKhamissi, B., Mari˜nas, I. A., Amani, M. H., Ansar...
- [2]
-
[3]
Estimating or Propagating Gradients Through Stochastic Neurons for Conditional Computation
URL https://arxiv. org/abs/1308.3432. Clark, P., Cowhey, I., Etzioni, O., Khot, T., Sabharwal, A., Schoenick, C., and Tafjord, O. Think you have solved question answering? try arc, the ai2 reasoning challenge,
work page internal anchor Pith review Pith/arXiv arXiv
-
[4]
Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge
URL https://arxiv.org/abs/ 1803.05457. Conneau, A., Rinott, R., Lample, G., Williams, A., Bowman, S., Schwenk, H., and Stoyanov, V . Xnli: Evaluating cross-lingual sentence representations. InProceedings of the 2018 conference on empirical methods in natural language processing, pp. 2475–2485,
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[5]
GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers
URL https://arxiv. org/abs/2210.17323. Hendrycks, D., Burns, C., Basart, S., Zou, A., Mazeika, M., Song, D., and Steinhardt, J. Measuring massive multitask language understanding,
work page internal anchor Pith review Pith/arXiv arXiv
-
[6]
Measuring Massive Multitask Language Understanding
URL https: //arxiv.org/abs/2009.03300. Huang, A. H. and Schlag, I. Deriving activation functions using integration,
work page internal anchor Pith review Pith/arXiv arXiv 2009
- [7]
-
[8]
URLhttps://arxiv.org/abs/2506.09104. Liu, Z., Zhao, C., Iandola, F., Lai, C., Tian, Y ., Fedorov, I., Xiong, Y ., Chang, E., Shi, Y ., Krishnamoorthi, R., Lai, L., and Chandra, V . Mobilellm: Optimizing sub-billion parameter language models for on-device use cases,
-
[9]
Ma, Shuming, Hongyu Wang, Lingxiao Ma, et al
URLhttps://arxiv.org/abs/2402.14905. Loshchilov, I. and Hutter, F. Decoupled weight decay regu- larization,
-
[10]
Decoupled Weight Decay Regularization
URL https://arxiv.org/abs/ 1711.05101. Muennighoff, N., Wang, T., Sutawika, L., Roberts, A., Bi- derman, S., Scao, T. L., Bari, M. S., Shen, S., Yong, Z.-X., Schoelkopf, H., Tang, X., Radev, D., Aji, A. F., Al- mubarak, K., Albanie, S., Alyafeai, Z., Webson, A., Raff, E., and Raffel, C. Crosslingual generalization through multitask finetuning,
work page internal anchor Pith review Pith/arXiv arXiv
-
[11]
The ademamix optimizer: Better, faster, older
Pagliardini, M., Ablin, P., and Grangier, D. The ademamix optimizer: Better, faster, older. In Yue, Y ., Garg, A., Peng, N., Sha, F., and Yu, R. (eds.),International Con- ference on Learning Representations, volume 2025, pp. 64715–64757,
2025
-
[12]
iclr.cc/paper_files/paper/2025/file/ a2cf225ba392627529efef14dc857e22-Paper-Conference
URL https://proceedings. iclr.cc/paper_files/paper/2025/file/ a2cf225ba392627529efef14dc857e22-Paper-Conference. pdf. Peng, H., Lv, X., Bai, Y ., Yao, Z., Zhang, J., Hou, L., and Li, J. Pre-training distillation for large language models: A design space exploration,
2025
- [13]
-
[14]
Xcopa: A multilingual dataset for causal commonsense reasoning
URL https: //arxiv.org/abs/2005.00333. Rafailov, R., Sharma, A., Mitchell, E., Ermon, S., Manning, C. D., and Finn, C. Direct preference optimization: Your language model is secretly a reward model,
-
[15]
Direct Preference Optimization: Your Language Model is Secretly a Reward Model
URL https://arxiv.org/abs/2305.18290. Romanou, A., Foroutan, N., Sotnikova, A., Nelaturu, S. H., Singh, S., Maheshwary, R., Altomare, M., Chen, Z., Hag- gag, M., Amayuelas, A., et al. Include: Evaluating multi- lingual language understanding with regional knowledge. InInternational Conference on Learning Representations, volume 2025, pp. 83291–83322,
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[16]
WinoGrande: An Adversarial Winograd Schema Challenge at Scale
URL https://arxiv.org/abs/ 1907.10641. Singh, S., Romanou, A., Fourrier, C., Adelani, D. I., Ngui, J. G., Vila-Suero, D., Limkonchotiwat, P., Marchisio, K., Leong, W. Q., Susanto, Y ., Ng, R., Longpre, S., Ko, W.-Y ., Ruder, S., Smith, M., Bosselut, A., Oh, A., Martins, A. F. T., Choshen, L., Ippolito, D., Ferrante, E., Fadaee, M., Ermis, B., and Hooker, ...
work page internal anchor Pith review Pith/arXiv arXiv 1907
-
[17]
URL https://arxiv.org/abs/2412.03304. Xin, M., Priyadarshi, S., Xin, J., Kartal, B., Vavre, A., Thekkumpate, A. K., Chen, Z., Mahabaleshwarkar, A. S., Shahaf, I., Bercovich, A., Patel, K., Velury, S. V ., Luo, C., Cheng, Z., Chen, J., Yu, C.-H., Ping, W., Rybakov, 7 Apertus LLM Family Expansion via Distillation and Quantization O., Tajbakhsh, N., Olabiyi,...
-
[18]
Quantization-aware distillation for NVFP4 inference accuracy recovery
URLhttps://arxiv.org/abs/2601.20088. Yang, Y ., Zhang, Y ., Tar, C., and Baldridge, J. PAWS-X: A cross-lingual adversarial dataset for paraphrase iden- tification. In Inui, K., Jiang, J., Ng, V ., and Wan, X. (eds.),Proceedings of the 2019 Conference on Empir- ical Methods in Natural Language Processing and the 9th International Joint Conference on Natura...
-
[19]
Association for Compu- tational Linguistics. doi: 10.18653/v1/D19-1382. URL https://aclanthology.org/D19-1382/. Zellers, R., Holtzman, A., Bisk, Y ., Farhadi, A., and Choi, Y . Hellaswag: Can a machine really finish your sentence?,
-
[20]
HellaSwag: Can a Machine Really Finish Your Sentence?
URL https://arxiv.org/abs/ 1905.07830. 8 Apertus LLM Family Expansion via Distillation and Quantization Table 6.Additional hyper-parameters. Model LR GBS Total Iterations Apertus-v1.1-0.5B 6e-4 512 800000 Apertus-v1.1-1.5B 3e-4 512 800000 Apertus-v1.1-4B 2e-4 1024 400000 A. Codebases The full codebases for the pre-training distillation, post-training, eva...
work page internal anchor Pith review Pith/arXiv arXiv 1905
-
[21]
and Multilingual HellaSwag (Dac Lai et al., 2023). C. Additional Hyper-Parameters C.1. Pre-Training Details Additional per-model pre-training hyper-parameters are shown in Table
2023
-
[22]
For base models, we use the same sequence length and batch size as in pre-training
with cosine LR schedule. For base models, we use the same sequence length and batch size as in pre-training. For instruction-tuned models, we use slightly larger batch size of 512-2048 to compensate for smaller length of some post-training sequences. Similar to pre-training distillation, we pre-compute and store the sparse logits from the teacher model (A...
2048
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.