Recognition: no theorem link
Capacity-Aware Mixture Law Enables Efficient LLM Data Optimization
Pith reviewed 2026-05-15 14:19 UTC · model grok-4.3
The pith
A capacity-aware mixture law fitted on small models predicts optimal data mixes for large language models.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
CAMEL expresses validation loss through the nonlinear interaction of model capacity and data-source mixture weights, allowing the optimal mixture for a large target model to be identified from fits performed only on smaller models up to 7B parameters.
What carries the argument
The CAMEL capacity-aware mixture law that captures nonlinear dependence of validation loss on model size and mixture proportions.
If this is right
- Mixture optimization can be completed at roughly half the usual compute cost.
- Validation loss serves as a reliable proxy for final benchmark performance via the added prediction law.
- Compute can be allocated across model scales to reduce prediction error for the target size.
- The extrapolated mixture improves downstream results when used on the 55B model.
Where Pith is reading between the lines
- The same fitting strategy could be tested on other training variables such as sequence length or learning-rate schedules.
- If the law remains accurate at still larger scales, repeated small-model experiments could replace most large-scale mixture trials.
- Analogous capacity-aware laws might be derived for vision-language or multimodal data mixtures.
Load-bearing premise
The functional form and parameters fitted on models up to 7B parameters continue to hold without large error when applied to a 55B target model.
What would settle it
Train the 55B target model once with the CAMEL-predicted mixture and once with a mixture obtained by direct search or prior methods, then compare the resulting validation loss and benchmark accuracies.
Figures
read the original abstract
A data mixture refers to how different data sources are combined to train large language models, and selecting an effective mixture is crucial for optimal downstream performance. Existing methods either conduct costly searches directly on the target model or rely on mixture scaling laws that fail to extrapolate well to large model sizes. We address these limitations by introducing a compute-efficient pipeline for data mixture scaling. First, we propose CAMEL, a capacity-aware mixture law that models validation loss with the nonlinear interplay between model size and mixture. We also introduce a loss-to-benchmark prediction law that estimates benchmark accuracy from validation loss, enabling end-to-end performance prediction for the target model. Next, we study how to allocate a fixed compute budget across model scales to fit the law and reduce prediction error. Finally, we apply our method to Mixture-of-Experts models with up to 7B-A150M parameters to fit the law, and verify the optimal mixture derived from the law by extrapolating to a 55B-A1.2B target model. Compared to prior methods, we reduce mixture optimization costs by 50\% and improves downstream benchmark performance by up to 3\%.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces CAMEL, a capacity-aware mixture law modeling validation loss via nonlinear interplay between model size and data mixture proportions, plus a loss-to-benchmark prediction law. These are fitted on small MoE models (up to 7B-A150M) to optimize mixtures and extrapolate to a 55B-A1.2B target, claiming 50% lower optimization costs and up to 3% benchmark gains versus prior methods.
Significance. If the extrapolation and functional forms prove robust, the pipeline could meaningfully cut the high cost of data-mixture search for large LLMs. The capacity-aware nonlinear terms address a known weakness of existing scaling laws. However, the absence of explicit equations, fitted parameters, error bars, and direct large-scale validation makes it difficult to judge whether the reported gains are reliable or artifacts of the small-scale regime.
major comments (2)
- [Method / CAMEL definition] The explicit functional form of the CAMEL law (including how model size and mixture interact nonlinearly) is not stated, nor are the fitted parameters, fitting procedure, or error metrics from the small-model runs (up to 7B-A150M). This is load-bearing for the extrapolation claim to 55B-A1.2B.
- [Experiments / Results] No details are given on the actual training runs or metrics at the 55B-A1.2B target scale that 'verify' the extrapolated mixture; the 50% cost reduction and 3% benchmark lift lack baselines, statistical significance, or ablation of the loss-to-benchmark predictor.
minor comments (1)
- [Abstract] The abstract refers to 'nonlinear interplay' without an equation or pointer to the defining section; adding a brief reference would aid readability.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback, which identifies key areas where additional clarity will strengthen the paper. We address each major comment below and will incorporate the requested details into the revised manuscript.
read point-by-point responses
-
Referee: [Method / CAMEL definition] The explicit functional form of the CAMEL law (including how model size and mixture interact nonlinearly) is not stated, nor are the fitted parameters, fitting procedure, or error metrics from the small-model runs (up to 7B-A150M). This is load-bearing for the extrapolation claim to 55B-A1.2B.
Authors: We agree that the explicit functional form, parameters, fitting procedure, and error metrics are essential for assessing the extrapolation. In the revision we will add the full mathematical definition of CAMEL (capturing the nonlinear capacity-aware interaction between model size and mixture proportions), the specific fitted parameter values obtained from the small-model experiments, the exact fitting procedure (nonlinear regression on validation losses), and quantitative error metrics such as RMSE on held-out data. These additions will directly support the reliability of the extrapolation to the 55B-A1.2B target. revision: yes
-
Referee: [Experiments / Results] No details are given on the actual training runs or metrics at the 55B-A1.2B target scale that 'verify' the extrapolated mixture; the 50% cost reduction and 3% benchmark lift lack baselines, statistical significance, or ablation of the loss-to-benchmark predictor.
Authors: We will expand the experimental section to include full details on the 55B-A1.2B training runs (compute budget, training steps, hardware), the achieved validation loss and downstream benchmark scores, explicit baseline comparisons (uniform mixture and prior scaling-law methods), statistical significance where multiple runs are available, and a dedicated ablation isolating the contribution of the loss-to-benchmark predictor. These additions will substantiate the reported 50% cost reduction and up to 3% benchmark gains. revision: yes
Circularity Check
No circularity: CAMEL law is a fitted empirical model with explicit extrapolation and verification
full rationale
The derivation fits a capacity-aware functional form to validation loss on small-scale runs (≤7B-A150M), then extrapolates the resulting optimal mixture to a 55B-A1.2B target and verifies performance. This is a standard scaling-law pipeline rather than any self-definitional, fitted-input-renamed-as-prediction, or self-citation reduction. No equations collapse the large-model output to the small-model inputs by construction, and the paper reports an independent verification step on the target model.
Axiom & Free-Parameter Ledger
free parameters (1)
- CAMEL law parameters
axioms (1)
- domain assumption Validation loss can be expressed as a deterministic nonlinear function of model size and mixture proportions
Reference graph
Works this paper leans on
-
[1]
Program Synthesis with Large Language Models
Austin, J., Odena, A., Nye, M., Bosma, M., Michalewski, H., Dohan, D., Jiang, E., Cai, C., Terry, M., Le, Q., et al. Program synthesis with large language models.arXiv preprint arXiv:2108.07732,
work page internal anchor Pith review Pith/arXiv arXiv
-
[2]
Bhagia, A., Liu, J., Wettig, A., Heineman, D., Tafjord, O., Jha, A. H., Soldaini, L., Smith, N. A., Groeneveld, D., Koh, P. W., et al. Establishing task scaling laws via compute-efficient model ladders.arXiv preprint arXiv:2412.04403,
-
[3]
W., Owen, S., and Fran- kle, J
Blakeney, C., Paul, M., Larsen, B. W., Owen, S., and Fran- kle, J. Does your data spark joy? performance gains from domain upsampling at the end of training.arXiv preprint arXiv:2406.03476,
-
[4]
Chen, M., Tworek, J., Jun, H., Yuan, Q., Pinto, H. P. D. O., Kaplan, J., Edwards, H., Burda, Y ., Joseph, N., Brockman, G., et al. Evaluating large language models trained on code.arXiv preprint arXiv:2107.03374,
work page internal anchor Pith review Pith/arXiv arXiv
-
[5]
F., Murray, T., Heineman, D., Jordan, M., Ha- jishirzi, H., R ´e, C., Soldaini, L., and Lo, K
Chen, M. F., Murray, T., Heineman, D., Jordan, M., Ha- jishirzi, H., R ´e, C., Soldaini, L., and Lo, K. Olmix: A framework for data mixing throughout lm development. arXiv preprint arXiv:2602.12237,
-
[6]
Chen, Z., Cano, A. H., Romanou, A., Bonnet, A., Ma- toba, K., Salvi, F., Pagliardini, M., Fan, S., K ¨opf, A., Mohtashami, A., et al. Meditron-70b: Scaling medical pretraining for large language models.arXiv preprint arXiv:2311.16079,
-
[7]
Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge
Clark, P., Cowhey, I., Etzioni, O., Khot, T., Sabharwal, A., Schoenick, C., and Tafjord, O. Think you have solved question answering? try arc, the ai2 reasoning challenge. arXiv preprint arXiv:1803.05457,
work page internal anchor Pith review Pith/arXiv arXiv
-
[8]
Training Verifiers to Solve Math Word Problems
Cobbe, K., Kosaraju, V ., Bavarian, M., Chen, M., Jun, H., Kaiser, L., Plappert, M., Tworek, J., Hilton, J., Nakano, R., et al. Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168,
work page internal anchor Pith review Pith/arXiv arXiv
-
[9]
P., Boudiaf, M., Culver, D., Melo, R., Corro, C., Martins, A
Colombo, P., Pires, T. P., Boudiaf, M., Culver, D., Melo, R., Corro, C., Martins, A. F., Esposito, F., Raposo, V . L., 9 Capacity-Aware Mixture Law Enables Efficient LLM Data Optimization Morgado, S., et al. Saullm-7b: A pioneering large lan- guage model for law.arXiv preprint arXiv:2403.03883,
-
[10]
Diao, S., Yang, Y ., Fu, Y ., Dong, X., Su, D., Kliegl, M., Chen, Z., Belcak, P., Suhara, Y ., Yin, H., et al. Climb: Clustering-based iterative data mixture bootstrap- ping for language model pre-training.arXiv preprint arXiv:2504.13161,
-
[11]
Gadre, S. Y ., Smyrnis, G., Shankar, V ., Gururangan, S., Wortsman, M., Shao, R., Mercat, J., Fang, A., Li, J., Keh, S., et al. Language models scale reliably with over-training and on downstream tasks.arXiv preprint arXiv:2403.08540,
-
[12]
Bimix: A bivariate data mixing law for language model pretraining
Ge, C., Ma, Z., Chen, D., Li, Y ., and Ding, B. Bimix: A bivariate data mixing law for language model pretraining. arXiv preprint arXiv:2405.14908,
-
[13]
Data Mixing Can Induce Phase Transitions in Knowledge Acquisition
Gu, X., Lyu, K., Li, J., and Zhang, J. Data mixing can induce phase transitions in knowledge acquisition.arXiv preprint arXiv:2505.18091,
work page internal anchor Pith review Pith/arXiv arXiv
-
[14]
Measuring Massive Multitask Language Understanding
Hendrycks, D., Burns, C., Basart, S., Zou, A., Mazeika, M., Song, D., and Steinhardt, J. Measuring mas- sive multitask language understanding.arXiv preprint arXiv:2009.03300,
work page internal anchor Pith review Pith/arXiv arXiv 2009
-
[15]
Measuring Mathematical Problem Solving With the MATH Dataset
Hendrycks, D., Burns, C., Kadavath, S., Arora, A., Basart, S., Tang, E., Song, D., and Steinhardt, J. Measuring math- ematical problem solving with the math dataset.arXiv preprint arXiv:2103.03874,
work page internal anchor Pith review Pith/arXiv arXiv
-
[16]
Scaling Laws for Autoregressive Generative Modeling
Henighan, T., Kaplan, J., Katz, M., Chen, M., Hesse, C., Jackson, J., Jun, H., Brown, T. B., Dhariwal, P., Gray, S., et al. Scaling laws for autoregressive generative modeling. arXiv preprint arXiv:2010.14701,
work page internal anchor Pith review Pith/arXiv arXiv 2010
-
[17]
MiniCPM: Unveiling the Potential of Small Language Models with Scalable Training Strategies
Hu, S., Tu, Y ., Han, X., He, C., Cui, G., Long, X., Zheng, Z., Fang, Y ., Huang, Y ., Zhao, W., et al. Minicpm: Unveiling the potential of small language models with scalable train- ing strategies.arXiv preprint arXiv:2404.06395,
work page internal anchor Pith review Pith/arXiv arXiv
-
[18]
TriviaQA: A Large Scale Distantly Supervised Challenge Dataset for Reading Comprehension
Joshi, M., Choi, E., Weld, D. S., and Zettlemoyer, L. Triviaqa: A large scale distantly supervised challenge dataset for reading comprehension.arXiv preprint arXiv:1705.03551,
work page internal anchor Pith review Pith/arXiv arXiv
-
[19]
Autoscale: Scale-aware data mixing for pre-training llms.arXiv preprint arXiv:2407.20177,
Kang, F., Sun, Y ., Wen, B., Chen, S., Song, D., Mahmood, R., and Jia, R. Autoscale: Scale-aware data mixing for pre-training llms.arXiv preprint arXiv:2407.20177,
-
[20]
Scaling Laws for Neural Language Models
Kaplan, J., McCandlish, S., Henighan, T., Brown, T. B., Chess, B., Child, R., Gray, S., Radford, A., Wu, J., and Amodei, D. Scaling laws for neural language models. arXiv preprint arXiv:2001.08361,
work page internal anchor Pith review Pith/arXiv arXiv 2001
-
[21]
Cmmlu: Measuring massive multitask language understanding in chinese
Li, H., Zhang, Y ., Koto, F., Yang, Y ., Zhao, H., Gong, Y ., Duan, N., and Baldwin, T. Cmmlu: Measuring massive multitask language understanding in chinese. InFindings of the Association for Computational Linguistics: ACL 2024, pp. 11260–11285,
work page 2024
-
[22]
Liu, A., Feng, B., Xue, B., Wang, B., Wu, B., Lu, C., Zhao, C., Deng, C., Zhang, C., Ruan, C., et al. Deepseek- v3 technical report.arXiv preprint arXiv:2412.19437, 2024a. Liu, J., Su, J., Yao, X., Jiang, Z., Lai, G., Du, Y ., Qin, Y ., Xu, W., Lu, E., Yan, J., et al. Muon is scalable for llm training.arXiv preprint arXiv:2502.16982,
work page internal anchor Pith review Pith/arXiv arXiv
-
[23]
Regmix: Data mixture as regression for language model pre-training
Liu, Q., Zheng, X., Muennighoff, N., Zeng, G., Dou, L., Pang, T., Jiang, J., and Lin, M. Regmix: Data mixture as regression for language model pre-training. InThe Thirteenth International Conference on Learning Repre- sentations, 2024b. 10 Capacity-Aware Mixture Law Enables Efficient LLM Data Optimization Lourie, N., Hu, M. Y ., and Cho, K. Scaling laws a...
- [24]
-
[25]
Scaling laws for optimal data mixtures.arXiv preprint arXiv:2507.09404,
Shukor, M., Bethune, L., Busbridge, D., Grangier, D., Fini, E., El-Nouby, A., and Ablin, P. Scaling laws for optimal data mixtures.arXiv preprint arXiv:2507.09404,
-
[26]
W., Chowdhery, A., Le, Q., Chi, E., Zhou, D., et al
Suzgun, M., Scales, N., Sch¨arli, N., Gehrmann, S., Tay, Y ., Chung, H. W., Chowdhery, A., Le, Q., Chi, E., Zhou, D., et al. Challenging big-bench tasks and whether chain-of- thought can solve them. InFindings of the Association for Computational Linguistics: ACL 2023, pp. 13003–13051,
work page 2023
-
[27]
Galactica: A Large Language Model for Science
Taylor, R., Kardas, M., Cucurull, G., Scialom, T., Hartshorn, A., Saravia, E., Poulton, A., Kerkez, V ., and Stojnic, R. Galactica: A large language model for science.arXiv preprint arXiv:2211.09085,
work page internal anchor Pith review Pith/arXiv arXiv
-
[28]
Gemini: A Family of Highly Capable Multimodal Models
Team, G., Anil, R., Borgeaud, S., Alayrac, J.-B., Yu, J., Sori- cut, R., Schalkwyk, J., Dai, A. M., Hauth, A., Millican, K., et al. Gemini: a family of highly capable multimodal models.arXiv preprint arXiv:2312.11805,
work page internal anchor Pith review Pith/arXiv arXiv
-
[29]
Wei, T., Luan, J., Liu, W., Dong, S., and Wang, B. Cmath: Can your language model pass chinese elementary school math test?arXiv preprint arXiv:2306.16636,
-
[30]
K., Sharma, N., Bethge, M., and Ermis, B
Yıldız,C ¸., Ravichandran, N. K., Sharma, N., Bethge, M., and Ermis, B. Investigating continual pretraining in large language models: Insights and implications.arXiv preprint arXiv:2402.17400,
-
[31]
11 Capacity-Aware Mixture Law Enables Efficient LLM Data Optimization A. General Implementation Details In this section, we describe the general details of our setup, such as the model architecture, dataset and training. A.1. Details of Model Architectures We construct a family of smaller models similar to Deepseek V3 architecture (Liu et al., 2024a). For...
work page 2025
-
[32]
We further evaluate the formulation on a broad set of additional benchmarks covering language understanding, reasoning, mathematics, and code generation. Specifically, we consider MMLU (Hendrycks et al., 2020), ARC-C and ARC-E (Clark et al., 2018), BBH (Suzgun et al., 2023), GSM8K (Cobbe et al., 2021), MATH (Hendrycks et al., 2021), HumanEval (Chen et al....
work page 2020
-
[33]
C.1. Details of Training Sets We consider a set of models trained at different scales M, each with a fixed set of mixture ratios. At each training scale, we evaluate multiple mixture ratios r, yielding a collection of (r, M) pairs used to fit the law. In particular, the scaling law is fitted using models with the same setup in Appendix B.1. After fitting ...
work page 2020
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.