pith. machine review for the scientific record. sign in

arxiv: 2603.08022 · v2 · submitted 2026-03-09 · 💻 cs.LG

Recognition: no theorem link

Capacity-Aware Mixture Law Enables Efficient LLM Data Optimization

Authors on Pith no claims yet

Pith reviewed 2026-05-15 14:19 UTC · model grok-4.3

classification 💻 cs.LG
keywords data mixture optimizationscaling lawslarge language modelsvalidation lossmixture of expertsextrapolation
0
0 comments X

The pith

A capacity-aware mixture law fitted on small models predicts optimal data mixes for large language models.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper introduces CAMEL to model how validation loss depends on both model size and the proportions of different training data sources. By fitting the law using runs on smaller models, the approach avoids the expense of searching mixtures directly on the target large model. A second law converts predicted validation loss into expected accuracy on downstream benchmarks. The authors determine how to spread a fixed compute budget across model sizes to keep extrapolation error low. When used to set the mixture for a 55B target, the method cuts optimization cost in half and raises benchmark scores by up to 3 percent.

Core claim

CAMEL expresses validation loss through the nonlinear interaction of model capacity and data-source mixture weights, allowing the optimal mixture for a large target model to be identified from fits performed only on smaller models up to 7B parameters.

What carries the argument

The CAMEL capacity-aware mixture law that captures nonlinear dependence of validation loss on model size and mixture proportions.

If this is right

  • Mixture optimization can be completed at roughly half the usual compute cost.
  • Validation loss serves as a reliable proxy for final benchmark performance via the added prediction law.
  • Compute can be allocated across model scales to reduce prediction error for the target size.
  • The extrapolated mixture improves downstream results when used on the 55B model.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same fitting strategy could be tested on other training variables such as sequence length or learning-rate schedules.
  • If the law remains accurate at still larger scales, repeated small-model experiments could replace most large-scale mixture trials.
  • Analogous capacity-aware laws might be derived for vision-language or multimodal data mixtures.

Load-bearing premise

The functional form and parameters fitted on models up to 7B parameters continue to hold without large error when applied to a 55B target model.

What would settle it

Train the 55B target model once with the CAMEL-predicted mixture and once with a mixture obtained by direct search or prior methods, then compare the resulting validation loss and benchmark accuracies.

Figures

Figures reproduced from arXiv: 2603.08022 by Jingwei Li, Jingzhao Zhang, Xinran Gu.

Figure 1
Figure 1. Figure 1: Mixture optimization on the target model under different compute budgets. We evaluate different mixture extrap￾olation methods by applying them to a larger target model with varying training FLOPS as mixture optimization costs. CAMEL, our proposed method, identifies high-quality data mixtures with even less than the cost of one full training pass on the target model. As the optimization budget increases, C… view at source ↗
Figure 2
Figure 2. Figure 2: End-to-end framework for data mixture extrapolation under model scaling. We first fit a loss-to-benchmark mapping to relate validation loss to downstream benchmark accuracy (Section 2.2). We then model validation loss as a function of model size and data mixtures using sampled (M, r) pairs from smaller models (Section 2.1). These components together enable extrapolation to large models and direct optimizat… view at source ↗
Figure 3
Figure 3. Figure 3: Training loss observations for each domain across model sizes. We train on a mixed dataset of math and knowledge and log the training loss for each domain. While larger models reduce loss in both areas, the rates of reduction differ significantly. This non-uniform scaling implies that the effective parameters allocated to each domain are redistributed dynamically rather than proportionally as the model sca… view at source ↗
Figure 4
Figure 4. Figure 4: Error of loss-to-benchmark prediction. We model each downstream benchmark accuracy as a function of multiple validation losses. The scatter plots show predicted versus ground-truth scores on training and validation splits. The low prediction error demonstrates that validation losses can reliably predict downstream benchmark accuracy. See Appendix B.3 for details and results on other benchmarks [PITH_FULL_… view at source ↗
Figure 5
Figure 5. Figure 5: Comparison between CAMEL and baseline scaling laws. We compare the fitting error of our proposed Capacity-Aware Mixture Law (CAMEL) with two baseline methods, DML (Ye et al., 2025) and SODM (Shukor et al., 2025). CAMEL achieves con￾sistently lower fitting error and exhibits more stable extrapolation behavior across model scales. exponents vanish faster and can be disregarded. The second assumption follows … view at source ↗
Figure 6
Figure 6. Figure 6: Sampling strategies for fitting the scaling law. We illustrate several strategies for selecting training configurations to fit the scaling law. Each subfigure corresponds to one sampling strategy, showing how combinations of model size and data mixture are chosen. Each circle represents one sampling point, and the circle area is proportional to the model size. We use Lb(r, M), rather than individual losses… view at source ↗
Figure 7
Figure 7. Figure 7: Comparison of sampling strategies under fixed com￾pute. For each expression, five strategies are compared under varying mixture optimization cost. The Hourglass strategy consis￾tently achieves the lowest prediction error. same number of mixtures from each scale. However, under a fixed computing budget, this strategy may not be optimal. For example, allocating fewer points to larger models can free enough c… view at source ↗
Figure 8
Figure 8. Figure 8: Performance on held-out benchmarks. Our method achieves the highest average accuracy on benchmarks not used during optimization, indicating strong generalization beyond the proxy objectives. This observation is consistent with prior work (Mizrahi et al., 2025) that mixtures optimized on diverse benchmarks generalize to unseen tasks. Method Specialized Target Math Code Knowledge Model-size agnostic (Xie et … view at source ↗
Figure 9
Figure 9. Figure 9: shows CAMEL-derived mixtures for a balanced target at different model sizes. As model size increases, the optimal weight on Knowledge increases, while those on Math and Code decrease. This suggests that larger models absorb general knowledge more efficiently, so knowledge data should be given more weight at larger scales. This find￾ing provides valuable guidance for determining the optimal data mixture for… view at source ↗
Figure 10
Figure 10. Figure 10: Ablation study of intrinsic domains. We vary the number of intrinsic domains k and evaluate the validation-loss prediction error. When k is small, the model has limited capacity, resulting in higher prediction error. As k increases, the error decreases and reaches its minimum around k = 5. Further increasing k leads to higher error, indicating overfitting and reduced robustness. As shown in [PITH_FULL_IM… view at source ↗
Figure 11
Figure 11. Figure 11: Predictions of all benchmarks. This figure shows predicted versus observed accuracies on all benchmarks. The close alignment indicates strong generalization of the proposed law. 14 [PITH_FULL_IMAGE:figures/full_fig_p014_11.png] view at source ↗
read the original abstract

A data mixture refers to how different data sources are combined to train large language models, and selecting an effective mixture is crucial for optimal downstream performance. Existing methods either conduct costly searches directly on the target model or rely on mixture scaling laws that fail to extrapolate well to large model sizes. We address these limitations by introducing a compute-efficient pipeline for data mixture scaling. First, we propose CAMEL, a capacity-aware mixture law that models validation loss with the nonlinear interplay between model size and mixture. We also introduce a loss-to-benchmark prediction law that estimates benchmark accuracy from validation loss, enabling end-to-end performance prediction for the target model. Next, we study how to allocate a fixed compute budget across model scales to fit the law and reduce prediction error. Finally, we apply our method to Mixture-of-Experts models with up to 7B-A150M parameters to fit the law, and verify the optimal mixture derived from the law by extrapolating to a 55B-A1.2B target model. Compared to prior methods, we reduce mixture optimization costs by 50\% and improves downstream benchmark performance by up to 3\%.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper introduces CAMEL, a capacity-aware mixture law modeling validation loss via nonlinear interplay between model size and data mixture proportions, plus a loss-to-benchmark prediction law. These are fitted on small MoE models (up to 7B-A150M) to optimize mixtures and extrapolate to a 55B-A1.2B target, claiming 50% lower optimization costs and up to 3% benchmark gains versus prior methods.

Significance. If the extrapolation and functional forms prove robust, the pipeline could meaningfully cut the high cost of data-mixture search for large LLMs. The capacity-aware nonlinear terms address a known weakness of existing scaling laws. However, the absence of explicit equations, fitted parameters, error bars, and direct large-scale validation makes it difficult to judge whether the reported gains are reliable or artifacts of the small-scale regime.

major comments (2)
  1. [Method / CAMEL definition] The explicit functional form of the CAMEL law (including how model size and mixture interact nonlinearly) is not stated, nor are the fitted parameters, fitting procedure, or error metrics from the small-model runs (up to 7B-A150M). This is load-bearing for the extrapolation claim to 55B-A1.2B.
  2. [Experiments / Results] No details are given on the actual training runs or metrics at the 55B-A1.2B target scale that 'verify' the extrapolated mixture; the 50% cost reduction and 3% benchmark lift lack baselines, statistical significance, or ablation of the loss-to-benchmark predictor.
minor comments (1)
  1. [Abstract] The abstract refers to 'nonlinear interplay' without an equation or pointer to the defining section; adding a brief reference would aid readability.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback, which identifies key areas where additional clarity will strengthen the paper. We address each major comment below and will incorporate the requested details into the revised manuscript.

read point-by-point responses
  1. Referee: [Method / CAMEL definition] The explicit functional form of the CAMEL law (including how model size and mixture interact nonlinearly) is not stated, nor are the fitted parameters, fitting procedure, or error metrics from the small-model runs (up to 7B-A150M). This is load-bearing for the extrapolation claim to 55B-A1.2B.

    Authors: We agree that the explicit functional form, parameters, fitting procedure, and error metrics are essential for assessing the extrapolation. In the revision we will add the full mathematical definition of CAMEL (capturing the nonlinear capacity-aware interaction between model size and mixture proportions), the specific fitted parameter values obtained from the small-model experiments, the exact fitting procedure (nonlinear regression on validation losses), and quantitative error metrics such as RMSE on held-out data. These additions will directly support the reliability of the extrapolation to the 55B-A1.2B target. revision: yes

  2. Referee: [Experiments / Results] No details are given on the actual training runs or metrics at the 55B-A1.2B target scale that 'verify' the extrapolated mixture; the 50% cost reduction and 3% benchmark lift lack baselines, statistical significance, or ablation of the loss-to-benchmark predictor.

    Authors: We will expand the experimental section to include full details on the 55B-A1.2B training runs (compute budget, training steps, hardware), the achieved validation loss and downstream benchmark scores, explicit baseline comparisons (uniform mixture and prior scaling-law methods), statistical significance where multiple runs are available, and a dedicated ablation isolating the contribution of the loss-to-benchmark predictor. These additions will substantiate the reported 50% cost reduction and up to 3% benchmark gains. revision: yes

Circularity Check

0 steps flagged

No circularity: CAMEL law is a fitted empirical model with explicit extrapolation and verification

full rationale

The derivation fits a capacity-aware functional form to validation loss on small-scale runs (≤7B-A150M), then extrapolates the resulting optimal mixture to a 55B-A1.2B target and verifies performance. This is a standard scaling-law pipeline rather than any self-definitional, fitted-input-renamed-as-prediction, or self-citation reduction. No equations collapse the large-model output to the small-model inputs by construction, and the paper reports an independent verification step on the target model.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

Abstract provides no explicit functional form or parameter list; the law necessarily introduces fitted parameters whose number and identifiability cannot be audited from the given text.

free parameters (1)
  • CAMEL law parameters
    Parameters of the capacity-aware mixture law are fitted to validation loss from small-model runs.
axioms (1)
  • domain assumption Validation loss can be expressed as a deterministic nonlinear function of model size and mixture proportions
    Core modeling assumption stated in the proposal of CAMEL.

pith-pipeline@v0.9.0 · 5498 in / 1246 out tokens · 43400 ms · 2026-05-15T14:19:29.927885+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

33 extracted references · 33 canonical work pages · 14 internal anchors

  1. [1]

    Program Synthesis with Large Language Models

    Austin, J., Odena, A., Nye, M., Bosma, M., Michalewski, H., Dohan, D., Jiang, E., Cai, C., Terry, M., Le, Q., et al. Program synthesis with large language models.arXiv preprint arXiv:2108.07732,

  2. [2]

    H., Soldaini, L., Smith, N

    Bhagia, A., Liu, J., Wettig, A., Heineman, D., Tafjord, O., Jha, A. H., Soldaini, L., Smith, N. A., Groeneveld, D., Koh, P. W., et al. Establishing task scaling laws via compute-efficient model ladders.arXiv preprint arXiv:2412.04403,

  3. [3]

    W., Owen, S., and Fran- kle, J

    Blakeney, C., Paul, M., Larsen, B. W., Owen, S., and Fran- kle, J. Does your data spark joy? performance gains from domain upsampling at the end of training.arXiv preprint arXiv:2406.03476,

  4. [4]

    Chen, M., Tworek, J., Jun, H., Yuan, Q., Pinto, H. P. D. O., Kaplan, J., Edwards, H., Burda, Y ., Joseph, N., Brockman, G., et al. Evaluating large language models trained on code.arXiv preprint arXiv:2107.03374,

  5. [5]

    F., Murray, T., Heineman, D., Jordan, M., Ha- jishirzi, H., R ´e, C., Soldaini, L., and Lo, K

    Chen, M. F., Murray, T., Heineman, D., Jordan, M., Ha- jishirzi, H., R ´e, C., Soldaini, L., and Lo, K. Olmix: A framework for data mixing throughout lm development. arXiv preprint arXiv:2602.12237,

  6. [6]

    H., Romanou, A., Bonnet, A., Ma- toba, K., Salvi, F., Pagliardini, M., Fan, S., K ¨opf, A., Mohtashami, A., et al

    Chen, Z., Cano, A. H., Romanou, A., Bonnet, A., Ma- toba, K., Salvi, F., Pagliardini, M., Fan, S., K ¨opf, A., Mohtashami, A., et al. Meditron-70b: Scaling medical pretraining for large language models.arXiv preprint arXiv:2311.16079,

  7. [7]

    Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge

    Clark, P., Cowhey, I., Etzioni, O., Khot, T., Sabharwal, A., Schoenick, C., and Tafjord, O. Think you have solved question answering? try arc, the ai2 reasoning challenge. arXiv preprint arXiv:1803.05457,

  8. [8]

    Training Verifiers to Solve Math Word Problems

    Cobbe, K., Kosaraju, V ., Bavarian, M., Chen, M., Jun, H., Kaiser, L., Plappert, M., Tworek, J., Hilton, J., Nakano, R., et al. Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168,

  9. [9]

    P., Boudiaf, M., Culver, D., Melo, R., Corro, C., Martins, A

    Colombo, P., Pires, T. P., Boudiaf, M., Culver, D., Melo, R., Corro, C., Martins, A. F., Esposito, F., Raposo, V . L., 9 Capacity-Aware Mixture Law Enables Efficient LLM Data Optimization Morgado, S., et al. Saullm-7b: A pioneering large lan- guage model for law.arXiv preprint arXiv:2403.03883,

  10. [10]

    Climb: Clustering-based iterative data mixture bootstrap- ping for language model pre-training.arXiv preprint arXiv:2504.13161,

    Diao, S., Yang, Y ., Fu, Y ., Dong, X., Su, D., Kliegl, M., Chen, Z., Belcak, P., Suhara, Y ., Yin, H., et al. Climb: Clustering-based iterative data mixture bootstrap- ping for language model pre-training.arXiv preprint arXiv:2504.13161,

  11. [11]

    Y ., Smyrnis, G., Shankar, V ., Gururangan, S., Wortsman, M., Shao, R., Mercat, J., Fang, A., Li, J., Keh, S., et al

    Gadre, S. Y ., Smyrnis, G., Shankar, V ., Gururangan, S., Wortsman, M., Shao, R., Mercat, J., Fang, A., Li, J., Keh, S., et al. Language models scale reliably with over-training and on downstream tasks.arXiv preprint arXiv:2403.08540,

  12. [12]

    Bimix: A bivariate data mixing law for language model pretraining

    Ge, C., Ma, Z., Chen, D., Li, Y ., and Ding, B. Bimix: A bivariate data mixing law for language model pretraining. arXiv preprint arXiv:2405.14908,

  13. [13]

    Data Mixing Can Induce Phase Transitions in Knowledge Acquisition

    Gu, X., Lyu, K., Li, J., and Zhang, J. Data mixing can induce phase transitions in knowledge acquisition.arXiv preprint arXiv:2505.18091,

  14. [14]

    Measuring Massive Multitask Language Understanding

    Hendrycks, D., Burns, C., Basart, S., Zou, A., Mazeika, M., Song, D., and Steinhardt, J. Measuring mas- sive multitask language understanding.arXiv preprint arXiv:2009.03300,

  15. [15]

    Measuring Mathematical Problem Solving With the MATH Dataset

    Hendrycks, D., Burns, C., Kadavath, S., Arora, A., Basart, S., Tang, E., Song, D., and Steinhardt, J. Measuring math- ematical problem solving with the math dataset.arXiv preprint arXiv:2103.03874,

  16. [16]

    Scaling Laws for Autoregressive Generative Modeling

    Henighan, T., Kaplan, J., Katz, M., Chen, M., Hesse, C., Jackson, J., Jun, H., Brown, T. B., Dhariwal, P., Gray, S., et al. Scaling laws for autoregressive generative modeling. arXiv preprint arXiv:2010.14701,

  17. [17]

    MiniCPM: Unveiling the Potential of Small Language Models with Scalable Training Strategies

    Hu, S., Tu, Y ., Han, X., He, C., Cui, G., Long, X., Zheng, Z., Fang, Y ., Huang, Y ., Zhao, W., et al. Minicpm: Unveiling the potential of small language models with scalable train- ing strategies.arXiv preprint arXiv:2404.06395,

  18. [18]

    TriviaQA: A Large Scale Distantly Supervised Challenge Dataset for Reading Comprehension

    Joshi, M., Choi, E., Weld, D. S., and Zettlemoyer, L. Triviaqa: A large scale distantly supervised challenge dataset for reading comprehension.arXiv preprint arXiv:1705.03551,

  19. [19]

    Autoscale: Scale-aware data mixing for pre-training llms.arXiv preprint arXiv:2407.20177,

    Kang, F., Sun, Y ., Wen, B., Chen, S., Song, D., Mahmood, R., and Jia, R. Autoscale: Scale-aware data mixing for pre-training llms.arXiv preprint arXiv:2407.20177,

  20. [20]

    Scaling Laws for Neural Language Models

    Kaplan, J., McCandlish, S., Henighan, T., Brown, T. B., Chess, B., Child, R., Gray, S., Radford, A., Wu, J., and Amodei, D. Scaling laws for neural language models. arXiv preprint arXiv:2001.08361,

  21. [21]

    Cmmlu: Measuring massive multitask language understanding in chinese

    Li, H., Zhang, Y ., Koto, F., Yang, Y ., Zhao, H., Gong, Y ., Duan, N., and Baldwin, T. Cmmlu: Measuring massive multitask language understanding in chinese. InFindings of the Association for Computational Linguistics: ACL 2024, pp. 11260–11285,

  22. [22]

    DeepSeek-V3 Technical Report

    Liu, A., Feng, B., Xue, B., Wang, B., Wu, B., Lu, C., Zhao, C., Deng, C., Zhang, C., Ruan, C., et al. Deepseek- v3 technical report.arXiv preprint arXiv:2412.19437, 2024a. Liu, J., Su, J., Yao, X., Jiang, Z., Lai, G., Du, Y ., Qin, Y ., Xu, W., Lu, E., Yan, J., et al. Muon is scalable for llm training.arXiv preprint arXiv:2502.16982,

  23. [23]

    Regmix: Data mixture as regression for language model pre-training

    Liu, Q., Zheng, X., Muennighoff, N., Zeng, G., Dou, L., Pang, T., Jiang, J., and Lin, M. Regmix: Data mixture as regression for language model pre-training. InThe Thirteenth International Conference on Learning Repre- sentations, 2024b. 10 Capacity-Aware Mixture Law Enables Efficient LLM Data Optimization Lourie, N., Hu, M. Y ., and Cho, K. Scaling laws a...

  24. [24]

    Mizrahi, D., Larsen, A. B. L., Allardice, J., Petryk, S., Gorokhov, Y ., Li, J., Fang, A., Gardner, J., Gunter, T., and Dehghan, A. Language models improve when pretraining data matches target tasks.arXiv preprint arXiv:2507.12466,

  25. [25]

    Scaling laws for optimal data mixtures.arXiv preprint arXiv:2507.09404,

    Shukor, M., Bethune, L., Busbridge, D., Grangier, D., Fini, E., El-Nouby, A., and Ablin, P. Scaling laws for optimal data mixtures.arXiv preprint arXiv:2507.09404,

  26. [26]

    W., Chowdhery, A., Le, Q., Chi, E., Zhou, D., et al

    Suzgun, M., Scales, N., Sch¨arli, N., Gehrmann, S., Tay, Y ., Chung, H. W., Chowdhery, A., Le, Q., Chi, E., Zhou, D., et al. Challenging big-bench tasks and whether chain-of- thought can solve them. InFindings of the Association for Computational Linguistics: ACL 2023, pp. 13003–13051,

  27. [27]

    Galactica: A Large Language Model for Science

    Taylor, R., Kardas, M., Cucurull, G., Scialom, T., Hartshorn, A., Saravia, E., Poulton, A., Kerkez, V ., and Stojnic, R. Galactica: A large language model for science.arXiv preprint arXiv:2211.09085,

  28. [28]

    Gemini: A Family of Highly Capable Multimodal Models

    Team, G., Anil, R., Borgeaud, S., Alayrac, J.-B., Yu, J., Sori- cut, R., Schalkwyk, J., Dai, A. M., Hauth, A., Millican, K., et al. Gemini: a family of highly capable multimodal models.arXiv preprint arXiv:2312.11805,

  29. [29]

    Cmath: Can your language model pass chinese elementary school math test?arXiv preprint arXiv:2306.16636,

    Wei, T., Luan, J., Liu, W., Dong, S., and Wang, B. Cmath: Can your language model pass chinese elementary school math test?arXiv preprint arXiv:2306.16636,

  30. [30]

    K., Sharma, N., Bethge, M., and Ermis, B

    Yıldız,C ¸., Ravichandran, N. K., Sharma, N., Bethge, M., and Ermis, B. Investigating continual pretraining in large language models: Insights and implications.arXiv preprint arXiv:2402.17400,

  31. [31]

    General Implementation Details In this section, we describe the general details of our setup, such as the model architecture, dataset and training

    11 Capacity-Aware Mixture Law Enables Efficient LLM Data Optimization A. General Implementation Details In this section, we describe the general details of our setup, such as the model architecture, dataset and training. A.1. Details of Model Architectures We construct a family of smaller models similar to Deepseek V3 architecture (Liu et al., 2024a). For...

  32. [32]

    We further evaluate the formulation on a broad set of additional benchmarks covering language understanding, reasoning, mathematics, and code generation. Specifically, we consider MMLU (Hendrycks et al., 2020), ARC-C and ARC-E (Clark et al., 2018), BBH (Suzgun et al., 2023), GSM8K (Cobbe et al., 2021), MATH (Hendrycks et al., 2021), HumanEval (Chen et al....

  33. [33]

    Details of Training Sets We consider a set of models trained at different scales M, each with a fixed set of mixture ratios

    C.1. Details of Training Sets We consider a set of models trained at different scales M, each with a fixed set of mixture ratios. At each training scale, we evaluate multiple mixture ratios r, yielding a collection of (r, M) pairs used to fit the law. In particular, the scaling law is fitted using models with the same setup in Appendix B.1. After fitting ...