Meta Additive Model: Interpretable Sparse Learning With Auto Weighting

Hong Chen; Lingjuan Wu; Xinyue Liu; Xuelin Zhang

arxiv: 2604.20111 · v1 · submitted 2026-04-22 · 💻 cs.LG · cs.AI· stat.ML

Meta Additive Model: Interpretable Sparse Learning With Auto Weighting

Xuelin Zhang , Xinyue Liu , Lingjuan Wu , Hong Chen This is my paper

Pith reviewed 2026-05-10 01:24 UTC · model grok-4.3

classification 💻 cs.LG cs.AIstat.ML

keywords sparse additive modelsbilevel optimizationmeta learningloss reweightingvariable selectionrobust regressioninterpretable models

0 comments

The pith

A bilevel optimization framework lets sparse additive models learn their own per-sample loss weights from meta data.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces the meta additive model to fix the sensitivity of standard sparse additive models to complex noise such as outliers, non-Gaussian errors, noisy labels, and class imbalance. Instead of relying on fixed mean-squared error loss and hand-chosen weights, it trains an MLP weighting function on meta data inside a bilevel loop that alternately optimizes the additive model and the weights. This keeps the interpretability and sparsity of additive models while adding automatic robustness. A reader would care because the approach removes manual hyperparameter choices for weighting and supplies theoretical guarantees on convergence and variable selection consistency.

Core claim

MAM uses bilevel optimization in which an inner loop fits the sparse additive model for given weights while an outer loop trains an MLP on meta data to produce those weights, enabling adaptive reweighting for variable selection, robust regression, and imbalanced classification, with proved convergence, algorithmic generalization, and variable selection consistency under mild conditions.

What carries the argument

Bilevel optimization loop where the outer level trains an MLP to output per-sample loss weights from meta data and the inner level optimizes the sparse additive model parameters under those weights.

Load-bearing premise

The meta data must be representative of the noise and corruption patterns in the main training set so that the learned weights actually improve rather than degrade the primary model.

What would settle it

If, on a dataset with known outliers or label noise, the variables selected by MAM show lower recovery accuracy or the predictions show higher error than a standard sparse additive model with fixed uniform weights, the claimed robustness benefit would be refuted.

Figures

Figures reproduced from arXiv: 2604.20111 by Hong Chen, Lingjuan Wu, Xinyue Liu, Xuelin Zhang.

**Figure 2.** Figure 2: Visualization of the bilevel optimization process [PITH_FULL_IMAGE:figures/full_fig_p009_2.png] view at source ↗

**Figure 3.** Figure 3: Convergence curves of prediction accuracy versus training epoch for meta-based [PITH_FULL_IMAGE:figures/full_fig_p021_3.png] view at source ↗

**Figure 4.** Figure 4: The weighting curves of several loss functions and MAM. (a) The properties and [PITH_FULL_IMAGE:figures/full_fig_p023_4.png] view at source ↗

**Figure 5.** Figure 5: The variable weights learned by several methods for the CME dataset. [PITH_FULL_IMAGE:figures/full_fig_p024_5.png] view at source ↗

**Figure 6.** Figure 6: Visualization of the estimated component functions by MAM. Three types of [PITH_FULL_IMAGE:figures/full_fig_p026_6.png] view at source ↗

read the original abstract

Sparse additive models have attracted much attention in high-dimensional data analysis due to their flexible representation and strong interpretability. However, most existing models are limited to single-level learning under the mean-squared error criterion, whose empirical performance can degrade significantly in the presence of complex noise, such as non-Gaussian perturbations, outliers, noisy labels, and imbalanced categories. The sample reweighting strategy is widely used to reduce the model's sensitivity to atypical data; however, it typically requires prespecifying the weighting functions and manually selecting additional hyperparameters. To address this issue, we propose a new meta additive model (MAM) based on the bilevel optimization framework, which learns data-driven weighting of individual losses by parameterizing the weighting function via an MLP trained on meta data. MAM is capable of a variety of learning tasks, including variable selection, robust regression estimation, and imbalanced classification. Theoretically, MAM provides guarantees on convergence in computation, algorithmic generalization, and variable selection consistency under mild conditions. Empirically, MAM outperforms several state-of-the-art additive models on both synthetic and real-world data under various data corruptions.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

MAM learns MLP-based loss weights for sparse additive models via bilevel optimization on meta data, which is a clean extension but only works if that meta data matches the noise in the target set.

read the letter

The main thing to know is that the authors replace manual reweighting functions in sparse additive models with an MLP whose parameters are learned through bilevel optimization on a meta dataset. This lets the model adapt weights to outliers, label noise, or imbalance without extra hyperparameters. That specific combination inside the additive-model setting is the new piece relative to earlier single-level reweighting work. The framework is set up to handle variable selection, robust regression, and imbalanced classification in one go, and the abstract states that convergence, generalization, and selection consistency hold under mild conditions. If the proofs are complete and the experiments are clean, this is a practical step for people who need interpretable models on messy high-dimensional data. The soft spot is exactly the one the stress-test flags: the guarantees treat the learned weights as fixed once meta-training finishes, so they only apply if the meta data distribution actually reflects the noise structure in the main training set. If the meta set is unrepresentative, the inner optimization can produce weights that amplify bad samples instead of down-weighting them, and the outer consistency claims would not hold. The paper would need to show explicit checks on meta-data construction and sensitivity to that choice. Empirically they report gains on synthetic and real corrupted data, but without the full protocols it is hard to judge whether the baselines were fairly tuned or whether the bilevel solver is stable. This paper is aimed at statisticians and ML researchers working on robust interpretable models for high-dimensional problems. A reader who already knows additive models and bilevel optimization would get the most out of it. It is coherent enough on its own terms to deserve a serious referee rather than a desk reject, even though the meta-data assumption will need close examination in review.

Referee Report

2 major / 1 minor

Summary. The paper proposes the Meta Additive Model (MAM), which extends sparse additive models via a bilevel optimization framework. An MLP is trained on meta data to parameterize a data-driven loss weighting function, enabling automatic handling of complex noise (outliers, label noise, imbalance) for tasks including variable selection, robust regression, and imbalanced classification. The authors claim theoretical guarantees on computational convergence, algorithmic generalization, and variable selection consistency under mild conditions, together with empirical outperformance versus state-of-the-art additive models on synthetic and real-world data under various corruptions.

Significance. If the bilevel meta-training reliably produces weights that satisfy the stated mild conditions and the consistency proofs hold, MAM would offer a principled way to automate reweighting in interpretable sparse models, reducing manual hyperparameter choices while preserving variable-selection interpretability. The combination of theoretical analysis with experiments on corrupted data is a strength; however, the framework's practical value depends on the meta-data distribution matching the target noise structure, which is not independently validated.

major comments (2)

[Abstract] Abstract: The variable selection consistency guarantee under mild conditions is stated to hold after meta-training the MLP weighting function, yet the derivation (presumably in §3) treats the learned weights as fixed inputs to the inner sparse additive optimization. No analysis is supplied showing that the bilevel solution preserves these mild conditions when the meta-data distribution differs from the primary data's noise structure (e.g., outlier rate or label noise level).
[Theoretical Analysis] Bilevel optimization setup (likely §3 and §4): The central claim that MAM improves robustness and generalization rests on the assumption that the meta-trained MLP yields weights that mitigate rather than amplify atypical samples. The manuscript provides no sensitivity analysis, bounds, or ablation on meta-data construction and MLP architecture; if this assumption fails, both the convergence and consistency results are voided.

minor comments (1)

[Abstract] The abstract and introduction would benefit from a concise statement of the precise bilevel objective and the form of the MLP weighting function to allow readers to assess the added complexity immediately.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments, which help clarify the scope of our theoretical guarantees and the practical assumptions in the bilevel framework. We address each major comment below and indicate planned revisions.

read point-by-point responses

Referee: [Abstract] Abstract: The variable selection consistency guarantee under mild conditions is stated to hold after meta-training the MLP weighting function, yet the derivation (presumably in §3) treats the learned weights as fixed inputs to the inner sparse additive optimization. No analysis is supplied showing that the bilevel solution preserves these mild conditions when the meta-data distribution differs from the primary data's noise structure (e.g., outlier rate or label noise level).

Authors: We thank the referee for this observation. The variable selection consistency result (Theorem 3) is derived for the inner sparse additive optimization with weights treated as fixed inputs that satisfy the stated mild conditions (boundedness, positivity, and sufficient decay on atypical samples). The meta-training step is intended to produce weights meeting these conditions in a data-driven manner. We agree that the manuscript does not supply a formal analysis proving that the bilevel solution automatically preserves the mild conditions under arbitrary distribution shifts between meta-data and primary data. In the revised version we will (i) revise the abstract to explicitly state that the consistency holds conditional on the meta-learned weights satisfying the conditions, and (ii) add a short discussion paragraph in Section 3 clarifying the role of meta-data representativeness as a practical modeling assumption rather than a proven invariance. revision: partial
Referee: [Theoretical Analysis] Bilevel optimization setup (likely §3 and §4): The central claim that MAM improves robustness and generalization rests on the assumption that the meta-trained MLP yields weights that mitigate rather than amplify atypical samples. The manuscript provides no sensitivity analysis, bounds, or ablation on meta-data construction and MLP architecture; if this assumption fails, both the convergence and consistency results are voided.

Authors: The referee correctly notes that the robustness and generalization claims rely on the meta-trained weights down-weighting atypical samples. The convergence (Theorem 2) and consistency results are conditional on this property holding. While the paper supplies empirical support through experiments on synthetic and real-world data with controlled corruptions, it indeed contains no theoretical sensitivity bounds or analysis of how meta-data construction and MLP architecture influence the quality of the learned weights. We will address this by adding (i) an expanded ablation study in the experiments section examining alternative meta-data sampling strategies and MLP depths/widths, and (ii) a dedicated paragraph in the theoretical analysis section discussing the assumption and its practical implications. These additions will better delineate when the guarantees are expected to hold. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected in the derivation chain

full rationale

The paper introduces MAM via bilevel optimization where an MLP weighting function is trained on separate meta data to reweight losses for the sparse additive model. Theoretical guarantees on convergence, generalization, and variable selection consistency are stated to hold under mild conditions that treat the learned weights as given after meta-training. This structure does not reduce any claimed prediction or consistency result to a self-definition, a fitted input renamed as prediction, or a load-bearing self-citation chain; the meta-training step is an explicit algorithmic component whose validity rests on external assumptions about meta-data representativeness rather than on the target theorems themselves. Empirical outperformance is presented separately and does not enter the derivation.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim rests on the bilevel optimization framework and the assumption that meta data can train an effective weighting MLP; the model introduces learned parameters in the MLP but no new physical entities.

free parameters (1)

MLP weights for loss weighting
Parameters of the MLP that generates per-sample weights are fitted during the outer-level optimization.

axioms (1)

domain assumption Mild conditions suffice for convergence, algorithmic generalization, and variable selection consistency
Invoked to support the theoretical guarantees stated in the abstract.

pith-pipeline@v0.9.0 · 5503 in / 1470 out tokens · 46835 ms · 2026-05-10T01:24:58.461779+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

48 extracted references · 48 canonical work pages

[1]

C. J. Stone, Additive regression and other nonparametric models, The annals of Statistics (1985) 689–705

work page 1985
[2]

Y. Wang, H. Chen, W. Liu, F. He, T. Gong, Y. Fu, D. Tao, Tilted sparse additive models, in: International conference on machine learn- ing, PMLR, 2023, pp. 35579–35604

work page 2023
[3]

Duong, Q

V. Duong, Q. Wu, Z. Zhou, H. Zhao, C. Luo, E. Zavesky, H. Yao, H. Shao, Cat: Interpretable concept-based taylor additive models, in: Proceedings of the 30th ACM SIGKDD Conference on Knowledge Dis- covery and Data Mining, 2024, pp. 723–734

work page 2024
[4]

Kandasamy, Y

K. Kandasamy, Y. Yu, Additive approximations in high dimensional nonparametric regression via the salsa, in: International conference on machine learning, PMLR, 2016, pp. 69–78

work page 2016
[5]

Agarwal, L

R. Agarwal, L. Melnick, N. Frosst, X. Zhang, B. Lengerich, R. Caruana, G. E. Hinton, Neural additive models: Interpretable machine learning with neural nets, Advances in neural information processing systems 34 (2021) 4699–4711

work page 2021
[6]

H. Chen, Y. Wang, F. Zheng, C. Deng, H. Huang, Sparse modal additive model, IEEE Transactions on Neural Networks and Learning Systems 32 (6) (2020) 2373–2387

work page 2020
[7]

Y. Wang, X. Zhong, F. He, H. Chen, D. Tao, Huber additive models for non-stationary time series analysis, in: International conference on learning representations, 2021

work page 2021
[8]

Christmann, D.-X

A. Christmann, D.-X. Zhou, Learning rates for the risk of kernel-based quantile regression estimators in additive models, Analysis and Appli- cations 14 (03) (2016) 449–477

work page 2016
[9]

X. Lu, A. Boukouvalas, J. Hensman, Additive gaussian processes re- visited, in: International Conference on Machine Learning, 2022, pp. 14358–14383

work page 2022
[10]

Raskutti, M

G. Raskutti, M. J. Wainwright, B. Yu, Minimax-optimal rates for sparse additive models over kernel classes via convex programming, The Jour- nal of Machine Learning Research 13 (1) (2012) 389–427. 28

work page 2012
[11]

Tan, C.-H

Z. Tan, C.-H. Zhang, Doubly penalized estimation in additive regression with high-dimensional data, Ann. Stat. 47 (5) (2019) 2567 – 2600

work page 2019
[12]

H. Liu, L. Wasserman, J. Lafferty, P. Ravikumar, Spam: Sparse additive models, Advances in Neural Information Processing Systems 20 (2007)

work page 2007
[13]

S. Lv, H. Lin, H. Lian, J. Huang, Oracle inequalities for sparse additive quantile regression in reproducing kernel hilbert space, The Annals of Statistics 46 (2) (2018) 781–813

work page 2018
[14]

H. Chen, C. Guo, H. Xiong, Y. Wang, Sparse additive machine with ramp loss, Analysis and Applications 19 (03) (2021) 509–528

work page 2021
[15]

P. Yuan, X. You, H. Chen, Y. Wang, Q. Peng, B. Zou, Sparse addi- tive machine with the correntropy-induced loss, IEEE Transactions on Neural Networks and Learning Systems 36 (2) (2023) 1989–2003

work page 2023
[16]

J. Shu, Q. Xie, L. Yi, Q. Zhao, S. Zhou, Z. Xu, D. Meng, Meta-weight- net: Learning an explicit mapping for sample weighting, Advances in Neural Information Processing Systems (NeurIPS) 32 (2019)

work page 2019
[17]

J. Shu, D. Meng, Z. Xu, Learning an explicit hyper-parameter predic- tion function conditioned on tasks, Journal of machine learning research 24 (186) (2023) 1–74

work page 2023
[18]

T. Li, A. Beirami, M. Sanjabi, V. Smith, Tilted empirical risk minimiza- tion, in: International Conference on Learning Representations, 2021

work page 2021
[19]

J. Shu, X. Yuan, D. Meng, Z. Xu, Cmw-net: Learning a class-aware sample weighting mapping for robust deep learning, IEEE Transactions on Pattern Analysis and Machine Intelligence 45 (10) (2023) 11521– 11539

work page 2023
[20]

T. Fang, N. Lu, G. Niu, M. Sugiyama, Generalizing importance weight- ing to a universal solver for distribution shift problems, Advances in Neural Information Processing Systems 36 (2024)

work page 2024
[21]

Z. Liu, Y. Wang, S. Vaidya, F. Ruehle, J. Halverson, M. Soljačić, T. Y. Hou, M. Tegmark, Kan: Kolmogorov–arnold networks (2022). 29

work page 2022
[22]

X. Zhou, R. Pi, W. Zhang, Y. Lin, Z. Chen, T. Zhang, Probabilis- tic bilevel coreset selection, in: International Conference on Machine Learning, 2022, pp. 27287–27302

work page 2022
[23]

Vilalta, Y

R. Vilalta, Y. Drissi, A perspective view and survey of meta-learning, Artif. Intell. Rev. 18 (2002) 77–95

work page 2002
[24]

Q. Zhao, J. Shu, X. Yuan, Z. Liu, D. Meng, A probabilistic formula- tion for meta-weight-net, IEEE Transactions on Neural Networks and Learning Systems 34 (3) (2021) 1194–1208

work page 2021
[25]

Franceschi, M

L. Franceschi, M. Donini, P. Frasconi, M. Pontil, Forward and reverse gradient-based hyperparameter optimization, in: International Confer- ence on Machine Learning, 2017, pp. 1165–1173

work page 2017
[26]

F. Bao, G. Wu, C. Li, J. Zhu, B. Zhang, Stability and generalization of bilevel programming in hyperparameter optimization, in: Advances in Neural Information Processing Systems, 2021, pp. 4529–4541

work page 2021
[27]

X. Wang, H. Chen, W. Cai, D. Shen, H. Huang, Regularized modal regression with applications in cognitive impairment prediction, in: Ad- vances in Neural Information Processing Systems, 2017, pp. 1448–1458

work page 2017
[28]

Shi, Learning theory estimates for coefficient-based regularized re- gression, Appl

L. Shi, Learning theory estimates for coefficient-based regularized re- gression, Appl. Comput. Harmon. Anal. 34 (2) (2013) 252–265

work page 2013
[29]

Y. Lei, T. Yang, Y. Ying, D.-X. Zhou, Generalization analysis for con- trastive representation learning, in: International Conference on Ma- chine Learning, 2023, pp. 19200–19227

work page 2023
[30]

Thrun, Lifelong learning algorithms, in: Learning to learn, Springer, 1998, pp

S. Thrun, Lifelong learning algorithms, in: Learning to learn, Springer, 1998, pp. 181–209

work page 1998
[31]

Hoffer, I

E. Hoffer, I. Hubara, D. Soudry, Train longer, generalize better: clos- ing the generalization gap in large batch training of neural networks, Advances in Neural Information Processing Systems 30 (2017)

work page 2017
[32]

K. Ji, J. Yang, Y. Liang, Bilevel optimization: Convergence analysis and enhanced design, in: International Conference on Machine Learning, 2021, pp. 4882–4892. 30

work page 2021
[33]

Liang, et al., Lower bounds and accelerated algorithms for bilevel optimization, J

Y. Liang, et al., Lower bounds and accelerated algorithms for bilevel optimization, J. Mach. Learn. Res. 24 (22) (2023) 1–56

work page 2023
[34]

Zhang, H

X. Zhang, H. Chen, B. Gu, T. Gong, F. Zheng, Fine-grained analysis of stability and generalization for stochastic bilevel optimization, in: International Joint Conference on Artificial Intelligence, 2024, pp. 5508– 5516

work page 2024
[35]

Tibshirani, Regression shrinkage and selection via the lasso, Journal of the Royal Statistical Society, Series B 73 (3) (1994) 267–288

R. Tibshirani, Regression shrinkage and selection via the lasso, Journal of the Royal Statistical Society, Series B 73 (3) (1994) 267–288

work page 1994
[36]

Li, J.-F

Z. Li, J.-F. Ton, D. Oglic, D. Sejdinovic, Towards a unified analysis of random fourier features, J. Mach. Learn. Res. 22 (108) (2021) 1–51

work page 2021
[37]

Radenovic, A

F. Radenovic, A. Dubey, D. Mahajan, Neural basis models for in- terpretability, Advances in Neural Information Processing Systems 35 (2022) 8414–8426

work page 2022
[38]

T. Zhao, H. Liu, Sparse additive machine, in: Artificial Intelligence and Statistics, PMLR, 2012, pp. 1435–1443

work page 2012
[39]

J. Zhu, S. Rosset, R. Tibshirani, T. Hastie, 1-norm support vector ma- chines, Advances in Neural Information Processing Systems (NIPS) 16 (2003)

work page 2003
[40]

Lahiri, B

A. Lahiri, B. Paria, P. K. Biswas, Forward stagewise additive model for collaborative multiview boosting, IEEE Transactions on Neural Net- works and Learning Systems 29 (2) (2016) 470–485

work page 2016
[41]

J. Liu, Y. Ye, C. Shen, Y. Wang, R. Erdélyi, A new tool for cme ar- rival time prediction using machine learning algorithms: Cat-puma, The Astrophysical Journal 855 (2) (2018) 109

work page 2018
[42]

J. Bao, M. Kudo, K. Kimura, L. Sun, Robust embedding regression for semi-supervised learning, Pattern Recognit. 145 (2024) 109894

work page 2024
[43]

LeCun, The mnist database of handwritten digits, http://yann

Y. LeCun, The mnist database of handwritten digits, http://yann. le- cun. com/exdb/mnist/ (1998)

work page 1998
[44]

C.-H. Lee, Z. Liu, L. Wu, P. Luo, Maskgan: Towards diverse and in- teractive facial image manipulation, in: Proceedings of the IEEE/CVF 31 conference on computer vision and pattern recognition, 2020, pp. 5549– 5558

work page 2020
[45]

T. Hu, J. Fan, Q. Wu, D.-X. Zhou, Learning theory approach to mini- mum error entropy criterion, J. Mach. Learn. Res. 14 (2013) 377–397

work page 2013
[46]

J. Fan, T. Hu, Q. Wu, D.-X. Zhou, Consistency analysis of an empirical minimum error entropy algorithm, Appl. Comput. Harmon. Anal. 41 (1) (2016) 164–189

work page 2016
[47]

Chen, Coronal mass ejections: models and their observational basis, Living Reviews in Solar Physics 8 (1) (2011) 1–92

P. Chen, Coronal mass ejections: models and their observational basis, Living Reviews in Solar Physics 8 (1) (2011) 1–92

work page 2011
[48]

A. M. Veronig, P. Odert, M. Leitzinger, K. Dissauer, N. C. Fleck, H. S. Hudson, Indications of stellar coronal mass ejections through coronal dimmings, Nature Astronomy 5 (7) (2021) 697–706. 32

work page 2021

[1] [1]

C. J. Stone, Additive regression and other nonparametric models, The annals of Statistics (1985) 689–705

work page 1985

[2] [2]

Y. Wang, H. Chen, W. Liu, F. He, T. Gong, Y. Fu, D. Tao, Tilted sparse additive models, in: International conference on machine learn- ing, PMLR, 2023, pp. 35579–35604

work page 2023

[3] [3]

Duong, Q

V. Duong, Q. Wu, Z. Zhou, H. Zhao, C. Luo, E. Zavesky, H. Yao, H. Shao, Cat: Interpretable concept-based taylor additive models, in: Proceedings of the 30th ACM SIGKDD Conference on Knowledge Dis- covery and Data Mining, 2024, pp. 723–734

work page 2024

[4] [4]

Kandasamy, Y

K. Kandasamy, Y. Yu, Additive approximations in high dimensional nonparametric regression via the salsa, in: International conference on machine learning, PMLR, 2016, pp. 69–78

work page 2016

[5] [5]

Agarwal, L

R. Agarwal, L. Melnick, N. Frosst, X. Zhang, B. Lengerich, R. Caruana, G. E. Hinton, Neural additive models: Interpretable machine learning with neural nets, Advances in neural information processing systems 34 (2021) 4699–4711

work page 2021

[6] [6]

H. Chen, Y. Wang, F. Zheng, C. Deng, H. Huang, Sparse modal additive model, IEEE Transactions on Neural Networks and Learning Systems 32 (6) (2020) 2373–2387

work page 2020

[7] [7]

Y. Wang, X. Zhong, F. He, H. Chen, D. Tao, Huber additive models for non-stationary time series analysis, in: International conference on learning representations, 2021

work page 2021

[8] [8]

Christmann, D.-X

A. Christmann, D.-X. Zhou, Learning rates for the risk of kernel-based quantile regression estimators in additive models, Analysis and Appli- cations 14 (03) (2016) 449–477

work page 2016

[9] [9]

X. Lu, A. Boukouvalas, J. Hensman, Additive gaussian processes re- visited, in: International Conference on Machine Learning, 2022, pp. 14358–14383

work page 2022

[10] [10]

Raskutti, M

G. Raskutti, M. J. Wainwright, B. Yu, Minimax-optimal rates for sparse additive models over kernel classes via convex programming, The Jour- nal of Machine Learning Research 13 (1) (2012) 389–427. 28

work page 2012

[11] [11]

Tan, C.-H

Z. Tan, C.-H. Zhang, Doubly penalized estimation in additive regression with high-dimensional data, Ann. Stat. 47 (5) (2019) 2567 – 2600

work page 2019

[12] [12]

H. Liu, L. Wasserman, J. Lafferty, P. Ravikumar, Spam: Sparse additive models, Advances in Neural Information Processing Systems 20 (2007)

work page 2007

[13] [13]

S. Lv, H. Lin, H. Lian, J. Huang, Oracle inequalities for sparse additive quantile regression in reproducing kernel hilbert space, The Annals of Statistics 46 (2) (2018) 781–813

work page 2018

[14] [14]

H. Chen, C. Guo, H. Xiong, Y. Wang, Sparse additive machine with ramp loss, Analysis and Applications 19 (03) (2021) 509–528

work page 2021

[15] [15]

P. Yuan, X. You, H. Chen, Y. Wang, Q. Peng, B. Zou, Sparse addi- tive machine with the correntropy-induced loss, IEEE Transactions on Neural Networks and Learning Systems 36 (2) (2023) 1989–2003

work page 2023

[16] [16]

J. Shu, Q. Xie, L. Yi, Q. Zhao, S. Zhou, Z. Xu, D. Meng, Meta-weight- net: Learning an explicit mapping for sample weighting, Advances in Neural Information Processing Systems (NeurIPS) 32 (2019)

work page 2019

[17] [17]

J. Shu, D. Meng, Z. Xu, Learning an explicit hyper-parameter predic- tion function conditioned on tasks, Journal of machine learning research 24 (186) (2023) 1–74

work page 2023

[18] [18]

T. Li, A. Beirami, M. Sanjabi, V. Smith, Tilted empirical risk minimiza- tion, in: International Conference on Learning Representations, 2021

work page 2021

[19] [19]

J. Shu, X. Yuan, D. Meng, Z. Xu, Cmw-net: Learning a class-aware sample weighting mapping for robust deep learning, IEEE Transactions on Pattern Analysis and Machine Intelligence 45 (10) (2023) 11521– 11539

work page 2023

[20] [20]

T. Fang, N. Lu, G. Niu, M. Sugiyama, Generalizing importance weight- ing to a universal solver for distribution shift problems, Advances in Neural Information Processing Systems 36 (2024)

work page 2024

[21] [21]

Z. Liu, Y. Wang, S. Vaidya, F. Ruehle, J. Halverson, M. Soljačić, T. Y. Hou, M. Tegmark, Kan: Kolmogorov–arnold networks (2022). 29

work page 2022

[22] [22]

X. Zhou, R. Pi, W. Zhang, Y. Lin, Z. Chen, T. Zhang, Probabilis- tic bilevel coreset selection, in: International Conference on Machine Learning, 2022, pp. 27287–27302

work page 2022

[23] [23]

Vilalta, Y

R. Vilalta, Y. Drissi, A perspective view and survey of meta-learning, Artif. Intell. Rev. 18 (2002) 77–95

work page 2002

[24] [24]

Q. Zhao, J. Shu, X. Yuan, Z. Liu, D. Meng, A probabilistic formula- tion for meta-weight-net, IEEE Transactions on Neural Networks and Learning Systems 34 (3) (2021) 1194–1208

work page 2021

[25] [25]

Franceschi, M

L. Franceschi, M. Donini, P. Frasconi, M. Pontil, Forward and reverse gradient-based hyperparameter optimization, in: International Confer- ence on Machine Learning, 2017, pp. 1165–1173

work page 2017

[26] [26]

F. Bao, G. Wu, C. Li, J. Zhu, B. Zhang, Stability and generalization of bilevel programming in hyperparameter optimization, in: Advances in Neural Information Processing Systems, 2021, pp. 4529–4541

work page 2021

[27] [27]

X. Wang, H. Chen, W. Cai, D. Shen, H. Huang, Regularized modal regression with applications in cognitive impairment prediction, in: Ad- vances in Neural Information Processing Systems, 2017, pp. 1448–1458

work page 2017

[28] [28]

Shi, Learning theory estimates for coefficient-based regularized re- gression, Appl

L. Shi, Learning theory estimates for coefficient-based regularized re- gression, Appl. Comput. Harmon. Anal. 34 (2) (2013) 252–265

work page 2013

[29] [29]

Y. Lei, T. Yang, Y. Ying, D.-X. Zhou, Generalization analysis for con- trastive representation learning, in: International Conference on Ma- chine Learning, 2023, pp. 19200–19227

work page 2023

[30] [30]

Thrun, Lifelong learning algorithms, in: Learning to learn, Springer, 1998, pp

S. Thrun, Lifelong learning algorithms, in: Learning to learn, Springer, 1998, pp. 181–209

work page 1998

[31] [31]

Hoffer, I

E. Hoffer, I. Hubara, D. Soudry, Train longer, generalize better: clos- ing the generalization gap in large batch training of neural networks, Advances in Neural Information Processing Systems 30 (2017)

work page 2017

[32] [32]

K. Ji, J. Yang, Y. Liang, Bilevel optimization: Convergence analysis and enhanced design, in: International Conference on Machine Learning, 2021, pp. 4882–4892. 30

work page 2021

[33] [33]

Liang, et al., Lower bounds and accelerated algorithms for bilevel optimization, J

Y. Liang, et al., Lower bounds and accelerated algorithms for bilevel optimization, J. Mach. Learn. Res. 24 (22) (2023) 1–56

work page 2023

[34] [34]

Zhang, H

X. Zhang, H. Chen, B. Gu, T. Gong, F. Zheng, Fine-grained analysis of stability and generalization for stochastic bilevel optimization, in: International Joint Conference on Artificial Intelligence, 2024, pp. 5508– 5516

work page 2024

[35] [35]

Tibshirani, Regression shrinkage and selection via the lasso, Journal of the Royal Statistical Society, Series B 73 (3) (1994) 267–288

R. Tibshirani, Regression shrinkage and selection via the lasso, Journal of the Royal Statistical Society, Series B 73 (3) (1994) 267–288

work page 1994

[36] [36]

Li, J.-F

Z. Li, J.-F. Ton, D. Oglic, D. Sejdinovic, Towards a unified analysis of random fourier features, J. Mach. Learn. Res. 22 (108) (2021) 1–51

work page 2021

[37] [37]

Radenovic, A

F. Radenovic, A. Dubey, D. Mahajan, Neural basis models for in- terpretability, Advances in Neural Information Processing Systems 35 (2022) 8414–8426

work page 2022

[38] [38]

T. Zhao, H. Liu, Sparse additive machine, in: Artificial Intelligence and Statistics, PMLR, 2012, pp. 1435–1443

work page 2012

[39] [39]

J. Zhu, S. Rosset, R. Tibshirani, T. Hastie, 1-norm support vector ma- chines, Advances in Neural Information Processing Systems (NIPS) 16 (2003)

work page 2003

[40] [40]

Lahiri, B

A. Lahiri, B. Paria, P. K. Biswas, Forward stagewise additive model for collaborative multiview boosting, IEEE Transactions on Neural Net- works and Learning Systems 29 (2) (2016) 470–485

work page 2016

[41] [41]

J. Liu, Y. Ye, C. Shen, Y. Wang, R. Erdélyi, A new tool for cme ar- rival time prediction using machine learning algorithms: Cat-puma, The Astrophysical Journal 855 (2) (2018) 109

work page 2018

[42] [42]

J. Bao, M. Kudo, K. Kimura, L. Sun, Robust embedding regression for semi-supervised learning, Pattern Recognit. 145 (2024) 109894

work page 2024

[43] [43]

LeCun, The mnist database of handwritten digits, http://yann

Y. LeCun, The mnist database of handwritten digits, http://yann. le- cun. com/exdb/mnist/ (1998)

work page 1998

[44] [44]

C.-H. Lee, Z. Liu, L. Wu, P. Luo, Maskgan: Towards diverse and in- teractive facial image manipulation, in: Proceedings of the IEEE/CVF 31 conference on computer vision and pattern recognition, 2020, pp. 5549– 5558

work page 2020

[45] [45]

T. Hu, J. Fan, Q. Wu, D.-X. Zhou, Learning theory approach to mini- mum error entropy criterion, J. Mach. Learn. Res. 14 (2013) 377–397

work page 2013

[46] [46]

J. Fan, T. Hu, Q. Wu, D.-X. Zhou, Consistency analysis of an empirical minimum error entropy algorithm, Appl. Comput. Harmon. Anal. 41 (1) (2016) 164–189

work page 2016

[47] [47]

Chen, Coronal mass ejections: models and their observational basis, Living Reviews in Solar Physics 8 (1) (2011) 1–92

P. Chen, Coronal mass ejections: models and their observational basis, Living Reviews in Solar Physics 8 (1) (2011) 1–92

work page 2011

[48] [48]

A. M. Veronig, P. Odert, M. Leitzinger, K. Dissauer, N. C. Fleck, H. S. Hudson, Indications of stellar coronal mass ejections through coronal dimmings, Nature Astronomy 5 (7) (2021) 697–706. 32

work page 2021