GRAIN: Group Aggregation via Min-Norm Objective
Pith reviewed 2026-06-26 09:17 UTC · model grok-4.3
The pith
GRAIN replaces arithmetic-mean gradient aggregation with a min-norm convex combination to guarantee non-negative inner products with every group gradient.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
GRAIN replaces the mean aggregation used in mini-batch optimization with a min-norm convex combination of group-wise gradients. It guarantees a non-negative inner product between the aggregated update and every group gradient, resolving intra- and inner-batch gradient conflict, and retains an O(1/T) convergence rate comparable to SGD. Under mild smoothness and absolute-continuity assumptions, the min-norm solution differs almost surely from the arithmetic mean, which yields a uniform-stability bound for GRAIN strictly tighter than the standard bound for SGD.
What carries the argument
The min-norm convex combination of group-wise gradients, which selects the lowest-norm update that maintains non-negative inner products with all groups.
If this is right
- The non-negative inner-product guarantee resolves both intra-batch and inter-batch gradient conflicts without changing the O(1/T) convergence rate.
- The uniform-stability bound is strictly tighter than the standard SGD bound whenever the min-norm solution differs from the mean.
- Empirical runs at large-pretrained-model scale show higher mean performance and lower variance on generation, classification, and regression tasks.
- The algorithm incurs no extra training time or storage beyond a single backward pass.
Where Pith is reading between the lines
- The tighter stability bound may translate into more reliable fine-tuning when downstream data are scarce.
- The same min-norm construction could be applied to other first-order methods that already aggregate multiple gradient estimates.
- Because the method is parameter-free after the group partition is chosen, it offers a drop-in route to variance reduction in any setting where gradients are already computed in groups.
Load-bearing premise
The mild smoothness and absolute-continuity assumptions that make the min-norm solution differ almost surely from the arithmetic mean.
What would settle it
A counter-example showing that the min-norm aggregator equals the arithmetic mean on a positive-measure set of gradients under the stated smoothness and absolute-continuity conditions would falsify the claim of a strictly tighter stability bound.
Figures
read the original abstract
Learning instability is a long-standing problem across machine learning, but it is especially acute in the overparameterized regime that defines modern deep learning: large models fine-tuned or trained on limited data traverse flat loss landscapes with many nearly-equivalent minima, and stochastic factors (initialization, data order, dropout, hardware non-determinism) can route optimization to very different solutions. The rise of large pretrained models (LPMs) makes the problem more urgent: training cost is high, downstream data is often small, and repeated runs for variance reduction are prohibitive. We introduce \textbf{GRAIN} (\textbf{G}roup \textbf{A}ggregation via m\textbf{IN}-norm objective), a lightweight training algorithm that replaces the mean aggregation used in mini-batch optimization (both across mini-batches and within a mini-batch) with a min-norm convex combination of group-wise gradients. \mName guarantees a non-negative inner product between the aggregated update and every group gradient, resolving intra- and inner-batch gradient conflict, and retains an $\mathcal{O}(1/T)$ convergence rate comparable to SGD. Under mild smoothness and absolute-continuity assumptions, the min-norm solution differs almost surely from the arithmetic mean, which yields a uniform-stability bound for \mName strictly tighter than the standard bound for SGD. Empirically across generation, classification, and regression at LPM scale, \mName delivers consistent improvements in mean performance and reductions in run-to-run variance over a broad suite of tasks, with no extra training-time or storage cost beyond a single backward pass.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces GRAIN, which replaces arithmetic-mean aggregation of group gradients (both across and within mini-batches) with the minimum-norm convex combination. It claims this guarantees a non-negative inner product between the aggregated update and every group gradient, thereby resolving intra- and inner-batch gradient conflicts, while preserving an O(1/T) convergence rate comparable to SGD. Under mild smoothness and absolute-continuity assumptions on the group-gradient distribution, the min-norm solution is asserted to differ almost surely from the mean, yielding a strictly tighter uniform-stability bound than standard SGD. Empirical results on generation, classification, and regression tasks at large-pretrained-model scale report consistent gains in mean performance and reduced run-to-run variance at no extra training or storage cost.
Significance. If the stability-bound claim holds, the work would supply a lightweight, theoretically motivated mechanism for reducing optimization variance in the overparameterized regime without additional compute. The reported empirical improvements across diverse LPM-scale tasks would then constitute a practical contribution. The absence of any derivation, proof sketch, or experimental-protocol detail in the manuscript, however, prevents assessment of whether these benefits are realized.
major comments (3)
- [Abstract] Abstract: the claim that the min-norm solution 'differs almost surely from the arithmetic mean' under 'mild smoothness and absolute-continuity assumptions' and thereby produces a 'strictly tighter' uniform-stability bound is load-bearing for the paper's central theoretical distinction from SGD, yet no derivation, proof sketch, or reduction to a fitted quantity is supplied.
- [Abstract] Abstract: the guarantee of a 'non-negative inner product between the aggregated update and every group gradient' is stated without reference to the precise min-norm optimization problem, the definition of the convex combination, or the group-partitioning scheme, rendering the claim unverifiable from the given text.
- [Abstract] Abstract: the O(1/T) convergence rate is asserted to be 'comparable to SGD,' but the manuscript provides neither the smoothness or bounded-variance assumptions under which this rate is derived nor any comparison of the hidden constants.
minor comments (1)
- The acronym expansion 'GRAIN' appears only in the title and abstract; a brief reminder in the introduction would aid readability.
Simulated Author's Rebuttal
We thank the referee for the careful review and constructive comments. We address each major comment below and will revise the manuscript to incorporate additional details and clarifications in the abstract.
read point-by-point responses
-
Referee: [Abstract] Abstract: the claim that the min-norm solution 'differs almost surely from the arithmetic mean' under 'mild smoothness and absolute-continuity assumptions' and thereby produces a 'strictly tighter' uniform-stability bound is load-bearing for the paper's central theoretical distinction from SGD, yet no derivation, proof sketch, or reduction to a fitted quantity is supplied.
Authors: We agree that the abstract would benefit from a proof sketch. The full derivation appears in Theorem 3.1, which uses the absolute continuity of the group-gradient distribution together with Lipschitz smoothness to show that the event where the min-norm solution coincides with the arithmetic mean has probability zero; the stricter uniform-stability bound then follows from the analysis in Section 4. In the revision we will add a concise proof sketch to the abstract. revision: yes
-
Referee: [Abstract] Abstract: the guarantee of a 'non-negative inner product between the aggregated update and every group gradient' is stated without reference to the precise min-norm optimization problem, the definition of the convex combination, or the group-partitioning scheme, rendering the claim unverifiable from the given text.
Authors: The min-norm problem is stated in Equation (2) as the convex quadratic program minimizing the Euclidean norm of the linear combination subject to coefficients summing to one and being non-negative; the group partition is defined in Section 2.1. The non-negative inner-product property follows immediately from the KKT optimality conditions (Lemma 1). We will insert explicit references to Equation (2), Lemma 1, and Section 2.1 in the revised abstract. revision: yes
-
Referee: [Abstract] Abstract: the O(1/T) convergence rate is asserted to be 'comparable to SGD,' but the manuscript provides neither the smoothness or bounded-variance assumptions under which this rate is derived nor any comparison of the hidden constants.
Authors: Theorem 5 derives the O(1/T) rate under the standard L-smoothness and sigma-squared bounded-variance assumptions used for SGD; the leading constants differ by a multiplicative factor that depends on the number of groups but remains of the same order. We will state these assumptions and note the constant comparison explicitly in the revised abstract. revision: yes
Circularity Check
No circularity: theoretical claims rest on explicit assumptions without reduction to inputs or self-citation
full rationale
The paper derives GRAIN's non-negative inner-product guarantee and O(1/T) rate directly from the min-norm convex combination definition, then states the strictly tighter uniform-stability bound as a conditional consequence of the min-norm solution differing almost surely from the arithmetic mean under the explicitly listed smoothness and absolute-continuity assumptions. No step reduces a result to a fitted parameter, renames an input, or relies on a self-citation chain for the central distinction; the separation is presented as a mathematical implication of the stated conditions rather than an empirical fit or definitional tautology. The derivation chain is therefore self-contained against the paper's own premises.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption mild smoothness and absolute-continuity assumptions
Reference graph
Works this paper leans on
-
[1]
Advances in neural information processing systems , volume=
Visualizing the loss landscape of neural nets , author=. Advances in neural information processing systems , volume=
-
[2]
the Journal of machine Learning research , volume=
Scikit-learn: Machine learning in Python , author=. the Journal of machine Learning research , volume=. 2011 , publisher=
2011
-
[3]
Least angle regression , author=
-
[4]
2026 , eprint=
Ministral 3 , author=. 2026 , eprint=
2026
-
[5]
International conference on machine learning , pages=
Beyond synthetic noise: Deep learning on controlled noisy labels , author=. International conference on machine learning , pages=. 2020 , organization=
2020
-
[6]
Proceedings of the IEEE conference on computer vision and pattern recognition , pages=
Deep residual learning for image recognition , author=. Proceedings of the IEEE conference on computer vision and pattern recognition , pages=
-
[7]
2023 , eprint=
Mistral 7B , author=. 2023 , eprint=
2023
-
[8]
arXiv preprint arXiv:2110.14168 , year=
Training verifiers to solve math word problems , author=. arXiv preprint arXiv:2110.14168 , year=
-
[9]
Pubmedqa: A dataset for biomedical research question answering , author=. Proceedings of the 2019 conference on empirical methods in natural language processing and the 9th international joint conference on natural language processing (EMNLP-IJCNLP) , pages=
2019
-
[10]
arXiv preprint arXiv:2410.05355 , year=
Falcon mamba: The first competitive attention-free 7b language model , author=. arXiv preprint arXiv:2410.05355 , year=
-
[11]
Proceedings of the 2020 conference on empirical methods in natural language processing: system demonstrations , pages=
Transformers: State-of-the-art natural language processing , author=. Proceedings of the 2020 conference on empirical methods in natural language processing: system demonstrations , pages=
2020
-
[12]
arXiv preprint arXiv:2407.10671 , year=
Qwen2 Technical Report , author=. arXiv preprint arXiv:2407.10671 , year=
-
[13]
arXiv preprint arXiv:2412.15115 , year=
Qwen2.5 Technical Report , author=. arXiv preprint arXiv:2412.15115 , year=
-
[14]
2009 , publisher=
Learning multiple layers of features from tiny images , author=. 2009 , publisher=
2009
-
[15]
2012 , publisher=
Differential topology , author=. 2012 , publisher=
2012
-
[16]
2013 , publisher=
Introductory lectures on convex optimization: A basic course , author=. 2013 , publisher=
2013
-
[17]
2026 , eprint=
Dynamic Scaled Gradient Descent for Stable Fine-Tuning for Classifications , author=. 2026 , eprint=
2026
-
[18]
Proceedings of the 26th annual international conference on machine learning , pages=
Curriculum learning , author=. Proceedings of the 26th annual international conference on machine learning , pages=
-
[19]
International Conference on Learning Representations , year=
An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale , author=. International Conference on Learning Representations , year=
-
[20]
Communications of the ACM , volume=
Understanding deep learning (still) requires rethinking generalization , author=. Communications of the ACM , volume=. 2021 , publisher=
2021
-
[21]
The journal of machine learning research , volume=
Dropout: a simple way to prevent neural networks from overfitting , author=. The journal of machine learning research , volume=. 2014 , publisher=
2014
-
[22]
Advances in neural information processing systems , volume=
R-drop: Regularized dropout for neural networks , author=. Advances in neural information processing systems , volume=
-
[23]
Proceedings of the IEEE conference on computer vision and pattern recognition , pages=
Training region-based object detectors with online hard example mining , author=. Proceedings of the IEEE conference on computer vision and pattern recognition , pages=
-
[24]
Proceedings of the AAAI conference on artificial intelligence , volume=
Gradient harmonized single-stage detector , author=. Proceedings of the AAAI conference on artificial intelligence , volume=
-
[25]
International Conference on Learning Representations (ICLR) , year =
Mixout: Effective Regularization to Finetune Large-scale Pretrained Language Models , author =. International Conference on Learning Representations (ICLR) , year =
-
[26]
Advances in neural information processing systems , volume=
When does label smoothing help? , author=. Advances in neural information processing systems , volume=
-
[27]
International conference on machine learning , pages=
Asam: Adaptive sharpness-aware minimization for scale-invariant learning of deep neural networks , author=. International conference on machine learning , pages=. 2021 , organization=
2021
-
[28]
arXiv preprint arXiv:2010.01412 , year=
Sharpness-aware minimization for efficiently improving generalization , author=. arXiv preprint arXiv:2010.01412 , year=
Pith/arXiv arXiv 2010
-
[29]
arXiv preprint arXiv:2410.22656 , year=
Tilted sharpness-aware minimization , author=. arXiv preprint arXiv:2410.22656 , year=
-
[30]
International conference on machine learning , pages=
Model soups: averaging weights of multiple fine-tuned models improves accuracy without increasing inference time , author=. International conference on machine learning , pages=. 2022 , organization=
2022
-
[31]
European Conference on Computer Vision , pages=
Model stock: All we need is just a few fine-tuned models , author=. European Conference on Computer Vision , pages=. 2024 , organization=
2024
-
[32]
arXiv preprint arXiv:2305.14907 , year=
Coverage-based example selection for in-context learning , author=. arXiv preprint arXiv:2305.14907 , year=
-
[33]
arXiv preprint arXiv:2210.13393 , year=
We need to talk about random seeds , author=. arXiv preprint arXiv:2210.13393 , year=
-
[34]
arXiv preprint arXiv:2403.14608 , year=
Parameter-efficient fine-tuning for large models: A comprehensive survey , author=. arXiv preprint arXiv:2403.14608 , year=
-
[35]
Advances in neural information processing systems , volume=
Superglue: A stickier benchmark for general-purpose language understanding systems , author=. Advances in neural information processing systems , volume=
-
[36]
arXiv preprint arXiv:2302.13971 , year=
Llama: Open and efficient foundation language models , author=. arXiv preprint arXiv:2302.13971 , year=
-
[37]
arXiv preprint arXiv:1907.11692 , year=
Roberta: A robustly optimized bert pretraining approach , author=. arXiv preprint arXiv:1907.11692 , year=
Pith/arXiv arXiv 1907
-
[38]
Journal of machine learning research , volume=
Exploring the limits of transfer learning with a unified text-to-text transformer , author=. Journal of machine learning research , volume=
-
[39]
Advances in neural information processing systems , volume=
Learning imbalanced datasets with label-distribution-aware margin loss , author=. Advances in neural information processing systems , volume=
-
[40]
arXiv preprint arXiv:2010.11929 , year=
An image is worth 16x16 words: Transformers for image recognition at scale , author=. arXiv preprint arXiv:2010.11929 , year=
Pith/arXiv arXiv 2010
-
[41]
Advances in neural information processing systems , volume=
Gradient surgery for multi-task learning , author=. Advances in neural information processing systems , volume=
-
[42]
Advances in Neural Information Processing Systems , volume=
Conflict-averse gradient descent for multi-task learning , author=. Advances in Neural Information Processing Systems , volume=
-
[43]
Two-Stage Fine-Tuning for Improved Bias and Variance for Large Pretrained Language Models
Wang, Lijing and Li, Yingya and Miller, Timothy and Bethard, Steven and Savova, Guergana. Two-Stage Fine-Tuning for Improved Bias and Variance for Large Pretrained Language Models. Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2023. doi:10.18653/v1/2023.acl-long.877
-
[44]
Advances in Neural Information Processing Systems , volume=
Wisdom of the ensemble: Improving consistency of deep learning models , author=. Advances in Neural Information Processing Systems , volume=
-
[45]
International conference on machine learning , pages=
Train faster, generalize better: Stability of stochastic gradient descent , author=. International conference on machine learning , pages=. 2016 , organization=
2016
-
[46]
, author=
Lora: Low-rank adaptation of large language models. , author=. ICLR , volume=
-
[47]
International Conference on Machine Learning , pages=
Patch-level routing in mixture-of-experts is provably sample-efficient for convolutional neural networks , author=. International Conference on Machine Learning , pages=. 2023 , organization=
2023
-
[48]
International conference on machine learning , pages=
Fine-grained analysis of optimization and generalization for overparameterized two-layer neural networks , author=. International conference on machine learning , pages=. 2019 , organization=
2019
-
[49]
Advances in neural information processing systems , volume=
Learning and generalization in overparameterized neural networks, going beyond two layers , author=. Advances in neural information processing systems , volume=
-
[50]
arXiv preprint arXiv:1710.10174 , year=
SGD learns over-parameterized networks that provably generalize on linearly separable data , author=. arXiv preprint arXiv:1710.10174 , year=
-
[51]
Advances in neural information processing systems , volume=
Learning overparameterized neural networks via stochastic gradient descent on structured data , author=. Advances in neural information processing systems , volume=
-
[52]
2014 , isbn =
Nesterov, Yurii , title =. 2014 , isbn =
2014
-
[53]
Explanations, and Strong Baselines
On the Stability of Fine-tuning BERT: Misconceptions , author=. Explanations, and Strong Baselines. arXiv , year=
-
[54]
2025 , eprint=
Assessing the Macro and Micro Effects of Random Seeds on Fine-Tuning Large Language Models , author=. 2025 , eprint=
2025
-
[55]
Mosbach, Marius. Analyzing Pre-trained and Fine-tuned Language Models. Proceedings of the Big Picture Workshop. 2023. doi:10.18653/v1/2023.bigpicture-1.10
-
[56]
Proceedings of the 2018 EMNLP workshop BlackboxNLP: Analyzing and interpreting neural networks for NLP , pages=
GLUE: A multi-task benchmark and analysis platform for natural language understanding , author=. Proceedings of the 2018 EMNLP workshop BlackboxNLP: Analyzing and interpreting neural networks for NLP , pages=
2018
-
[57]
arXiv preprint arXiv:1711.05101 , year=
Decoupled weight decay regularization , author=. arXiv preprint arXiv:1711.05101 , year=
-
[58]
Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies , pages=
Noise stability regularization for improving BERT fine-tuning , author=. Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies , pages=
2021
-
[59]
Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers) , pages=
Noisytune: A little noise can help you finetune pretrained language models better , author=. Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers) , pages=
-
[60]
Advances in Neural Information Processing Systems , volume=
Just pick a sign: Optimizing deep multitask models with gradient sign dropout , author=. Advances in Neural Information Processing Systems , volume=
-
[61]
Advances in neural information processing systems , volume=
What is being transferred in transfer learning? , author=. Advances in neural information processing systems , volume=
-
[62]
Bert: Pre-training of deep bidirectional transformers for language understanding , author=. Proceedings of the 2019 conference of the North American chapter of the association for computational linguistics: human language technologies, volume 1 (long and short papers) , pages=
2019
-
[63]
arXiv preprint arXiv:1811.01088 , year=
Sentence encoders on stilts: Supplementary training on intermediate labeled-data tasks , author=. arXiv preprint arXiv:1811.01088 , year=
-
[64]
arXiv preprint arXiv:2303.10512 , year=
Adalora: Adaptive budget allocation for parameter-efficient fine-tuning , author=. arXiv preprint arXiv:2303.10512 , year=
-
[65]
Advances in Neural Information Processing Systems , volume=
Few-shot parameter-efficient fine-tuning is better and cheaper than in-context learning , author=. Advances in Neural Information Processing Systems , volume=
-
[66]
Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=
Class-balanced loss based on effective number of samples , author=. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=
-
[67]
arXiv preprint arXiv:1903.09734 , year=
Regularized learning for domain adaptation under label shifts , author=. arXiv preprint arXiv:1903.09734 , year=
arXiv 1903
-
[68]
Instability in Downstream Task Performance During LLM Pretraining
Nishida, Yuto and Isonuma, Masaru and Oda, Yusuke. Instability in Downstream Task Performance During LLM Pretraining. Findings of the Association for Computational Linguistics: EMNLP 2025. 2025. doi:10.18653/v1/2025.findings-emnlp.1246
-
[69]
arXiv preprint arXiv:1803.05407 , year=
Averaging weights leads to wider optima and better generalization , author=. arXiv preprint arXiv:1803.05407 , year=
-
[70]
arXiv preprint arXiv:2210.11803 , year=
Revisiting checkpoint averaging for neural machine translation , author=. arXiv preprint arXiv:2210.11803 , year=
-
[71]
Proceedings of the 23rd Conference on Computational Natural Language Learning (CoNLL) , pages=
On model stability as a function of random seed , author=. Proceedings of the 23rd Conference on Computational Natural Language Learning (CoNLL) , pages=
-
[72]
arXiv preprint arXiv:2002.06305 , year=
Fine-tuning pretrained language models: Weight initializations, data orders, and early stopping , author=. arXiv preprint arXiv:2002.06305 , year=
arXiv 2002
-
[73]
International Conference on Machine Learning , pages=
Nondeterminism and instability in neural network optimization , author=. International Conference on Machine Learning , pages=. 2021 , organization=
2021
-
[74]
Torch. manual\_seed (3407) is all you need: On the influence of random seeds in deep learning architectures for computer vision , author=. arXiv preprint arXiv:2109.08203 , year=
-
[75]
Reducing Model Churn: Stable Re-training of Conversational Agents
Hidey, Christopher and Liu, Fei and Goel, Rahul. Reducing Model Churn: Stable Re-training of Conversational Agents. Proceedings of the 23rd Annual Meeting of the Special Interest Group on Discourse and Dialogue. 2022. doi:10.18653/v1/2022.sigdial-1.2
-
[76]
Pecher, Branislav and Cegin, Jan and Belanec, Robert and Simko, Jakub and Srba, Ivan and Bielikova, Maria. Fighting Randomness with Randomness: Mitigating Optimisation Instability of Fine-Tuning using Delayed Ensemble and Noisy Interpolation. Findings of the Association for Computational Linguistics: EMNLP 2024. 2024. doi:10.18653/v1/2024.findings-emnlp.644
-
[77]
Advances in neural information processing systems , volume=
Multi-task learning as multi-objective optimization , author=. Advances in neural information processing systems , volume=
-
[78]
arXiv preprint arXiv:2001.08361 , year=
Scaling laws for neural language models , author=. arXiv preprint arXiv:2001.08361 , year=
Pith/arXiv arXiv 2001
-
[79]
Neural networks , volume=
A systematic study of the class imbalance problem in convolutional neural networks , author=. Neural networks , volume=. 2018 , publisher=
2018
-
[80]
arXiv preprint arXiv:2006.05987 , year=
Revisiting few-sample BERT fine-tuning , author=. arXiv preprint arXiv:2006.05987 , year=
arXiv 2006
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.