pith. sign in

arxiv: 2606.02008 · v1 · pith:4HLCE4A5new · submitted 2026-06-01 · 📊 stat.ML · cs.LG

Provable Data Scaling Law for Meta Learning via Complexity Minimization

Pith reviewed 2026-06-28 12:57 UTC · model grok-4.3

classification 📊 stat.ML cs.LG
keywords meta-learningcomplexity minimizationscaling lawfew-shot adaptationpre-trainingrepresentation learningsample complexity
0
0 comments X

The pith

A complexity minimization framework for meta-representation learning provably shows that few-shot adaptation error rates improve with more meta-training data.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces complexity minimization, a meta-representation learning method that evaluates the downstream model complexity best suited to each source domain and then minimizes the worst-case value of that complexity across domains. An end-to-end theoretical analysis from pre-training through downstream regression demonstrates that this procedure captures the observed scaling behavior, with the error rate of few-shot adaptation decreasing as the volume of meta-training data grows. The analysis holds without further restrictions on function classes or data distributions beyond the per-domain complexity evaluation. Empirically, adding complexity regularization to existing meta-learning algorithms improves downstream sample efficiency.

Core claim

By learning representations through evaluation of per-domain downstream model complexity followed by minimization of the worst-case such complexity across source domains, the framework ensures that the error rate of few-shot adaptation improves as the amount of meta-training data grows, as established by the full theoretical chain from pre-training to downstream regression.

What carries the argument

Complexity minimization, the mechanism that evaluates the downstream model complexity suited to each domain and minimizes its worst-case value across source domains to produce scalable representations.

If this is right

  • The error rate of few-shot adaptation decreases as meta-training data volume increases.
  • Representations produced by the framework exhibit scaling behavior in downstream regression tasks.
  • Adding complexity regularization to existing meta-learning methods improves downstream sample efficiency.
  • The scaling relation holds from the end-to-end analysis spanning pre-training to adaptation without extra restrictions on function classes.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same worst-case complexity objective might be adapted to explain scaling in non-meta pre-training settings such as language model pre-training.
  • If the per-domain complexity measure can be approximated efficiently, the framework could guide data collection priorities in large-scale meta-learning pipelines.
  • Testing the predicted rate of error improvement against observed curves on standard few-shot benchmarks would provide a direct check on the derived scaling.

Load-bearing premise

Downstream model complexity can be meaningfully evaluated per domain such that minimizing its worst-case value across source domains makes the scaling of representation quality follow directly from meta-training data volume.

What would settle it

A concrete counterexample or experiment in which increasing the amount of meta-training data fails to reduce the few-shot adaptation error rate when representations are learned under complexity minimization.

Figures

Figures reproduced from arXiv: 2606.02008 by Kazuto Fukuchi, Kota Matsui, Ryuichiro Hataya.

Figure 1
Figure 1. Figure 1: Conceptual diagram of complexity minimization. Each base-learner (bottom) assesses the best model complexity J †(d) n (ϕ) for its domain d and reports it to the meta-learner. The meta-learner (top) selects ϕ to minimize the worst-case best model complexity over the observed domains, thereby reducing the downstream sample complexity across all domains. Downstream regression problem In the downstream task, t… view at source ↗
Figure 2
Figure 2. Figure 2: Regularizing model complexity (parameter spectral norms) improves down￾stream sample efficiency. Test error rate on CIFAR-10 (log scale) vs. fine-tuning size (n, log scale) with different pre-training size m for four meta-learning algorithms trained on Mini-ImageNet in the 5-way 1-shot setting with and without model complexity regularization. algorithms are adopted, which are discussed in Section 6: first-… view at source ↗
Figure 3
Figure 3. Figure 3: Downstream test error rate on SVHN, CIFAR-10, CIFAR-100 (log scale) vs. fine-tuning [PITH_FULL_IMAGE:figures/full_fig_p036_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Downstream test error rate on SVHN, CIFAR-10, CIFAR-100 (log scale) vs. fine-tuning [PITH_FULL_IMAGE:figures/full_fig_p037_4.png] view at source ↗
read the original abstract

Pre-training has become a fundamental paradigm in modern machine learning, with one of its key empirical benefits being reduced downstream sample complexity as the scale of pre-training data increases. However, existing theoretical frameworks for pre-training do not fully explain this phenomenon. In this paper, we introduce complexity minimization, a novel meta-representation learning framework designed to enable theoretical analysis of this scaling behavior, which learns representations by evaluating the downstream model complexity best suited to each domain and minimizing the worst-case such complexity across source domains. Our end-to-end theoretical analysis, spanning pre-training through downstream regression, shows that this framework provably captures this scaling behavior; in particular, we show that the error rate of few-shot adaptation improves as the amount of meta-training data grows. Empirically, we demonstrate that incorporating complexity regularization into existing meta-learning methods consistently improves downstream sample efficiency.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

0 major / 2 minor

Summary. The manuscript introduces a complexity minimization framework for meta-representation learning that learns representations by evaluating the downstream model complexity best suited to each domain and minimizing the worst-case such complexity across source domains. It claims an end-to-end theoretical analysis spanning pre-training through downstream regression that proves the error rate of few-shot adaptation improves as the amount of meta-training data grows. Empirically, incorporating complexity regularization into existing meta-learning methods is shown to improve downstream sample efficiency.

Significance. If the central theoretical claim holds under standard assumptions on function classes and data distributions, the work would be significant for supplying a provable account of data scaling in pre-training for meta-learning, a phenomenon not fully explained by prior frameworks. The combination of the claimed end-to-end guarantee with empirical gains on sample efficiency constitutes a substantive contribution.

minor comments (2)
  1. [Abstract] Abstract: the description of how complexity is evaluated per domain and how the worst-case minimization is formalized would benefit from one additional sentence to make the framework's objective fully explicit without requiring the reader to infer it from the scaling claim.
  2. The empirical section would be strengthened by reporting the precise meta-learning baselines used and the magnitude of the sample-efficiency gains (e.g., reduction in shots needed to reach a target error).

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for the positive review, accurate summary of our contributions, and recommendation of minor revision. No specific major comments were raised in the report.

Circularity Check

0 steps flagged

No significant circularity identified

full rationale

The paper defines a new complexity-minimization framework for meta-representation learning and derives an end-to-end theoretical guarantee that few-shot regression error improves with meta-training data volume under that framework. No quoted equation or step reduces the scaling result to a tautology of the framework's own definitions, a fitted parameter renamed as prediction, or a self-citation chain. The analysis is presented as independent content within the stated assumptions on per-domain complexity evaluation and worst-case minimization; the derivation chain therefore remains self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim rests on the newly introduced complexity-minimization objective and the assumption that worst-case downstream complexity is a suitable proxy for scaling behavior; no explicit free parameters or external benchmarks are mentioned in the abstract.

axioms (1)
  • domain assumption Downstream model complexity can be evaluated per domain and minimized in the worst case across source domains
    This is the core definition of the proposed framework as stated in the abstract.
invented entities (1)
  • complexity minimization framework no independent evidence
    purpose: To enable theoretical analysis of pre-training scaling behavior
    Newly defined construct introduced in the paper; no independent evidence outside the framework itself is provided in the abstract.

pith-pipeline@v0.9.1-grok · 5675 in / 1187 out tokens · 33376 ms · 2026-06-28T12:57:33.387734+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

102 extracted references · 11 canonical work pages · 2 internal anchors

  1. [1]

    and Cho, Kyunghyun

    Lourie, Nicholas and Hu, Michael Y. and Cho, Kyunghyun. Scaling Laws Are Unreliable for Downstream Tasks: A Reality Check. Findings of the Association for Computational Linguistics: EMNLP 2025. 2025. doi:10.18653/v1/2025.findings-emnlp.877

  2. [2]

    arXiv:2102.01293 , year=

    Scaling laws for transfer , author=. arXiv:2102.01293 , year=

  3. [3]

    Journal of Statistical Mechanics: Theory and Experiment , volume=

    How feature learning can improve neural scaling laws , author=. Journal of Statistical Mechanics: Theory and Experiment , volume=. 2025 , publisher=

  4. [4]

    Proceedings of the National Academy of Sciences , volume=

    Explaining neural scaling laws , author=. Proceedings of the National Academy of Sciences , volume=. 2024 , publisher=

  5. [5]

    Advances in Neural Information Processing Systems , year=

    Training Compute-Optimal Large Language Models , author=. Advances in Neural Information Processing Systems , year=

  6. [6]

    arXiv:2001.08361 , year=

    Scaling laws for neural language models , author=. arXiv:2001.08361 , year=

  7. [7]

    arXiv:1712.00409 , year=

    Deep learning scaling is predictable, empirically , author=. arXiv:1712.00409 , year=

  8. [8]

    Advances in neural information processing systems , volume=

    Provable guarantees for self-supervised deep learning with spectral contrastive loss , author=. Advances in neural information processing systems , volume=

  9. [9]

    International conference on machine learning , pages=

    A theoretical analysis of contrastive unsupervised representation learning , author=. International conference on machine learning , pages=. 2019 , organization=

  10. [10]

    Forty-second International Conference on Machine Learning , year=

    Nonlinear transformers can perform inference-time feature learning , author=. Forty-second International Conference on Machine Learning , year=

  11. [11]

    Advances in Neural Information Processing Systems , volume=

    Feature learning via mean-field langevin dynamics: classifying sparse parities and beyond , author=. Advances in Neural Information Processing Systems , volume=

  12. [12]

    Advances in Neural Information Processing Systems , volume=

    High-dimensional asymptotics of feature learning: How one gradient step improves the representation , author=. Advances in Neural Information Processing Systems , volume=

  13. [13]

    Conference on Learning Theory , pages=

    Kernel and rich regimes in overparametrized models , author=. Conference on Learning Theory , pages=. 2020 , organization=

  14. [14]

    SIAM Journal on Applied Mathematics , volume=

    Mean field analysis of neural networks: A law of large numbers , author=. SIAM Journal on Applied Mathematics , volume=. 2020 , publisher=

  15. [15]

    Proceedings of the National Academy of Sciences , volume=

    A mean field view of the landscape of two-layer neural networks , author=. Proceedings of the National Academy of Sciences , volume=. 2018 , publisher=

  16. [16]

    Advances in neural information processing systems , volume=

    On lazy training in differentiable programming , author=. Advances in neural information processing systems , volume=

  17. [17]

    International conference on machine learning , pages=

    A convergence theory for deep learning via over-parameterization , author=. International conference on machine learning , pages=. 2019 , organization=

  18. [18]

    International Conference on Learning Representations , year=

    Gradient descent provably optimizes over-parameterized neural networks , author=. International Conference on Learning Representations , year=

  19. [19]

    Advances in neural information processing systems , volume=

    Neural tangent kernel: Convergence and generalization in neural networks , author=. Advances in neural information processing systems , volume=

  20. [20]

    Journal of Statistical Mechanics: Theory and Experiment , volume=

    Deep double descent: Where bigger models and more data hurt , author=. Journal of Statistical Mechanics: Theory and Experiment , volume=. 2021 , publisher=

  21. [21]

    Proceedings of the National Academy of Sciences , volume=

    Reconciling modern machine-learning practice and the classical bias--variance trade-off , author=. Proceedings of the National Academy of Sciences , volume=. 2019 , publisher=

  22. [22]

    International Conference on Learning Representations , year=

    Understanding deep learning requires rethinking generalization , author=. International Conference on Learning Representations , year=

  23. [23]

    International Conference on Learning Representations , year=

    A PAC-Bayesian Approach to Spectrally-Normalized Margin Bounds for Neural Networks , author=. International Conference on Learning Representations , year=

  24. [24]

    Advances in neural information processing systems , volume=

    Spectrally-normalized margin bounds for neural networks , author=. Advances in neural information processing systems , volume=

  25. [25]

    International Conference on Machine Learning , pages=

    Diffusion models are minimax optimal distribution estimators , author=. International Conference on Machine Learning , pages=. 2023 , organization=

  26. [26]

    International Conference on Machine Learning , pages=

    Approximation and estimation ability of transformers for sequence-to-sequence functions with infinite dimensional input , author=. International Conference on Machine Learning , pages=. 2023 , organization=

  27. [27]

    International conference on machine learning , pages=

    Approximation and non-parametric estimation of ResNet-type convolutional neural networks , author=. International conference on machine learning , pages=. 2019 , organization=

  28. [28]

    2019 , booktitle=

    Adaptivity of deep ReLU network for learning in Besov and mixed smooth Besov spaces: optimal rate and curse of dimensionality , author=. 2019 , booktitle=

  29. [29]

    Neural networks , volume=

    Error bounds for approximations with deep ReLU networks , author=. Neural networks , volume=. 2017 , publisher=

  30. [30]

    arXiv preprint arXiv:2212.06817 , year=

    Rt-1: Robotics transformer for real-world control at scale , author=. arXiv preprint arXiv:2212.06817 , year=

  31. [31]

    Nature , volume=

    Large language models encode clinical knowledge , author=. Nature , volume=. 2023 , publisher=

  32. [32]

    nature , volume=

    Highly accurate protein structure prediction with AlphaFold , author=. nature , volume=. 2021 , publisher=

  33. [33]

    Proceedings of the IEEE/CVF international conference on computer vision , pages=

    Segment anything , author=. Proceedings of the IEEE/CVF international conference on computer vision , pages=

  34. [34]

    International conference on machine learning , pages=

    Learning transferable visual models from natural language supervision , author=. International conference on machine learning , pages=. 2021 , organization=

  35. [35]

    Advances in neural information processing systems , volume=

    Language models are few-shot learners , author=. Advances in neural information processing systems , volume=

  36. [36]

    Bert: Pre-training of deep bidirectional transformers for language understanding , author=. Proceedings of the 2019 conference of the North American chapter of the association for computational linguistics: human language technologies, volume 1 (long and short papers) , pages=

  37. [37]

    International Conference on Learning Representations , year=

    Meta-learning with differentiable closed-form solvers , author=. International Conference on Learning Representations , year=

  38. [38]

    Advances in neural information processing systems , volume=

    Uniform convergence may be unable to explain generalization in deep learning , author=. Advances in neural information processing systems , volume=

  39. [39]

    Advances in Neural Information Processing Systems , volume=

    Generalization bounds for meta-learning: An information-theoretic analysis , author=. Advances in Neural Information Processing Systems , volume=

  40. [40]

    International Conference on Machine Learning , pages=

    Meta-learning by adjusting priors based on extended PAC-Bayes theory , author=. International Conference on Machine Learning , pages=. 2018 , organization=

  41. [41]

    International Conference on Machine Learning , pages=

    A PAC-Bayesian bound for lifelong learning , author=. International Conference on Machine Learning , pages=. 2014 , organization=

  42. [42]

    Journal of machine learning research , volume=

    Stability and generalization , author=. Journal of machine learning research , volume=

  43. [43]

    Advances in neural information processing systems , volume=

    Understanding benign overfitting in gradient-based meta learning , author=. Advances in neural information processing systems , volume=

  44. [44]

    Advances in Neural Information Processing Systems , volume=

    Provable generalization of overparameterized meta-learning trained with sgd , author=. Advances in Neural Information Processing Systems , volume=

  45. [45]

    Advances in Neural Information Processing Systems , volume=

    On the stability and generalization of meta-learning , author=. Advances in Neural Information Processing Systems , volume=

  46. [46]

    International Conference on Machine Learning , pages=

    A unified view on pac-bayes bounds for meta-learning , author=. International Conference on Machine Learning , pages=. 2022 , organization=

  47. [47]

    Advances in Neural Information Processing Systems , volume=

    Towards sample-efficient overparameterized meta-learning , author=. Advances in Neural Information Processing Systems , volume=

  48. [48]

    International conference on machine learning , pages=

    Provable meta-learning of linear representations , author=. International conference on machine learning , pages=. 2021 , organization=

  49. [49]

    International conference on machine learning , pages=

    Sparse coding for multitask and transfer learning , author=. International conference on machine learning , pages=. 2013 , organization=

  50. [50]

    Journal of Machine Learning Research , volume=

    The benefit of multitask representation learning , author=. Journal of Machine Learning Research , volume=

  51. [51]

    Journal of artificial intelligence research , volume=

    A model of inductive bias learning , author=. Journal of artificial intelligence research , volume=

  52. [52]

    IEEE transactions on pattern analysis and machine intelligence , volume=

    Representation learning: A review and new perspectives , author=. IEEE transactions on pattern analysis and machine intelligence , volume=. 2013 , publisher=

  53. [53]

    IEEE Transactions on knowledge and data engineering , volume=

    A survey on transfer learning , author=. IEEE Transactions on knowledge and data engineering , volume=. 2009 , publisher=

  54. [54]

    arXiv:1707.03141 , year=

    A simple neural attentive meta-learner , author=. arXiv:1707.03141 , year=

  55. [55]

    Advances in neural information processing systems , volume=

    Learning to learn by gradient descent by gradient descent , author=. Advances in neural information processing systems , volume=

  56. [56]

    International conference on artificial neural networks , pages=

    Learning to learn using gradient descent , author=. International conference on artificial neural networks , pages=. 2001 , organization=

  57. [57]

    International conference on learning representations , year=

    Optimization as a model for few-shot learning , author=. International conference on learning representations , year=

  58. [58]

    International conference on machine learning , pages=

    Meta-learning with memory-augmented neural networks , author=. International conference on machine learning , pages=. 2016 , organization=

  59. [59]

    Advances in neural information processing systems , volume=

    Meta-learning with implicit gradients , author=. Advances in neural information processing systems , volume=

  60. [60]

    arXiv:1707.09835 , year=

    Meta-sgd: Learning to learn quickly for few-shot learning , author=. arXiv:1707.09835 , year=

  61. [61]

    arXiv:1803.02999 , year=

    On first-order meta-learning algorithms , author=. arXiv:1803.02999 , year=

  62. [62]

    Proceedings of the IEEE conference on computer vision and pattern recognition , pages=

    Learning to compare: Relation network for few-shot learning , author=. Proceedings of the IEEE conference on computer vision and pattern recognition , pages=

  63. [63]

    Advances in neural information processing systems , volume=

    Prototypical networks for few-shot learning , author=. Advances in neural information processing systems , volume=

  64. [64]

    Advances in neural information processing systems , volume=

    Matching networks for one shot learning , author=. Advances in neural information processing systems , volume=

  65. [65]

    ACM computing surveys (csur) , volume=

    Generalizing from a few examples: A survey on few-shot learning , author=. ACM computing surveys (csur) , volume=. 2020 , publisher=

  66. [66]

    IEEE transactions on pattern analysis and machine intelligence , volume=

    Meta-learning in neural networks: A survey , author=. IEEE transactions on pattern analysis and machine intelligence , volume=. 2021 , publisher=

  67. [67]

    Lepskii, O. V. , date =. On a. doi:10.1137/1135065 , url =

  68. [68]

    Lepski, O. V. and Mammen, E. and Spokoiny, V. G. , date =. Optimal Spatial Adaptation to Inhomogeneous Smoothness: An Approach Based on Kernel Estimates with Variable Bandwidth Selectors , shorttitle =. doi:10.1214/aos/1069362731 , url =

  69. [69]

    and Lee, Jason D

    Du, Simon Shaolei and Hu, Wei and Kakade, Sham M. and Lee, Jason D. and Lei, Qi , date =. Few-. International

  70. [70]

    Transformers as

    Bai, Yu and Chen, Fan and Wang, Huan and Xiong, Caiming and Mei, Song , date =. Transformers as. Advances in

  71. [71]

    Transformers Are

    Kim, Juno and Nakamaki, Tai and Suzuki, Taiji , date =. Transformers Are. Advances in

  72. [72]

    Transformers as

    Li, Yingcong and Ildiz, Muhammed Emrullah and Papailiopoulos, Dimitris and Oymak, Samet , date =. Transformers as. Proceedings of the 40th

  73. [73]

    Scaling Laws for Neural Language Models

    Kaplan, Jared and McCandlish, Sam and Henighan, Tom and Brown, Tom B. and Chess, Benjamin and Child, Rewon and Gray, Scott and Radford, Alec and Wu, Jeffrey and Amodei, Dario , date =. Scaling. doi:10.48550/arXiv.2001.08361 , url =. 2001.08361 , eprinttype =

  74. [74]

    Scaling Laws for Autoregressive Generative Modeling

    Henighan, Tom and Kaplan, Jared and Katz, Mor and Chen, Mark and Hesse, Christopher and Jackson, Jacob and Jun, Heewoo and Brown, Tom B. and Dhariwal, Prafulla and Gray, Scott and Hallacy, Chris and Mann, Benjamin and Radford, Alec and Ramesh, Aditya and Ryder, Nick and Ziegler, Daniel M. and Schulman, John and Amodei, Dario and McCandlish, Sam , date =. ...

  75. [75]

    Joint European Conference on Machine Learning and Knowledge Discovery in Databases , pages=

    A Scaling Law for Syn2real Transfer: How Much Is Your Pre-training Effective? , author=. Joint European Conference on Machine Learning and Knowledge Discovery in Databases , pages=. 2022 , organization=

  76. [76]

    Finn, Chelsea and Abbeel, Pieter and Levine, Sergey , date =. Model-. Proceedings of the 34th

  77. [77]

    Hospedales, Timothy and Antoniou, Antreas and Micaelli, Paul and Storkey, Amos , date =. Meta-. doi:10.1109/TPAMI.2021.3079209 , url =

  78. [78]

    Proceedings of the 39th

    Collins, Liam and Mokhtari, Aryan and Oh, Sewoong and Shakkottai, Sanjay , date =. Proceedings of the 39th

  79. [79]

    The Thirty Seventh Annual Conference on Learning Theory , pages=

    Metalearning with very few samples per task , author=. The Thirty Seventh Annual Conference on Learning Theory , pages=. 2024 , organization=

  80. [80]

    Provable

    Huang, Yu and Liang, Yingbin and Huang, Longbo , date =. Provable

Showing first 80 references.