Soft Learning

Ali Aledhari; Fatimah Aledhari; Mohamed Rahouti; Mohammed Aledhari

arxiv: 2605.18889 · v1 · pith:7QMEGCIKnew · submitted 2026-05-16 · 💻 cs.LG · cs.AI

Soft Learning

Mohammed Aledhari , Ali Aledhari , Fatimah Aledhari , Mohamed Rahouti This is my paper

Pith reviewed 2026-05-20 14:14 UTC · model grok-4.3

classification 💻 cs.LG cs.AI

keywords soft learningmodel combinationensemble methodsnon-negative least squaresmachine learningclassificationregression

0 comments

The pith

Soft Learning learns optimal non-negative weights to combine diverse specialists, guaranteeing performance that matches or exceeds the best weighted mix while training far faster than deep networks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Soft Learning keeps a library of different machine learning specialists such as linear models, tree ensembles, kernel machines, and neural networks. It uses cross-validated non-negative least squares to find combination weights that are mathematically guaranteed to perform at least as well as the best possible weighted mix of those specialists. This removes the need to choose one algorithm in advance or to run extensive hyperparameter searches on GPUs. The approach also supplies interpretability because the learned weights directly show which specialist type contributed most to the solution. Results across dozens of classification and regression tasks indicate that the method often ranks highest while running on ordinary CPUs.

Core claim

Soft Learning maintains a library of heterogeneous specialists and discovers provably optimal combination weights through cross-validated non-negative least squares. This construction guarantees that the resulting model will match or exceed the best weighted combination of its specialists. The method trains 72-435 times faster than deep networks on CPU hardware alone, requires no hyperparameter tuning, and supplies inherent interpretability via the learned weights that indicate which algorithmic family fits the data.

What carries the argument

Cross-validated non-negative least squares, which solves for non-negative weights that minimize validation error when combining the prediction outputs of the specialist models.

If this is right

Performance is guaranteed to remain the same or improve when any new specialist is added to the library.
The learned weights reveal which modeling paradigm best matches a given dataset without extra analysis.
No GPU hardware or hyperparameter tuning is required to reach competitive or superior results on both classification and regression tasks.
The same framework applies uniformly to the 25 classification and 12 regression datasets tested.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Practitioners could stop asking which single algorithm is best and instead ask what combination of available specialists is optimal for the data at hand.
Resource-limited settings might adopt this style of combination to reach high performance without specialized hardware.
The guarantee structure could be tested on streaming or continually arriving data to see whether the weights remain stable over time.

Load-bearing premise

Weights found by non-negative least squares on cross-validation folds will continue to produce good combinations on completely new test data.

What would settle it

A held-out test set on which the Soft Learning output performs materially worse than its single best specialist despite the non-negative least-squares combination being applied.

Figures

Figures reproduced from arXiv: 2605.18889 by Ali Aledhari, Fatimah Aledhari, Mohamed Rahouti, Mohammed Aledhari.

**Figure 2.** Figure 2: Head-to-head comparison across 37 datasets. a, [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗

**Figure 3.** Figure 3: Performance consistency across task types and dataset scales. a, [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗

read the original abstract

Modern machine learning forces practitioners to choose between powerful but expensive deep networks and fast but limited classical algorithms. Here we introduce Soft Learning, a framework that maintains a library of heterogeneous specialists -- spanning linear models, tree ensembles, kernel machines, and neural networks -- and discovers provably optimal combination weights through cross-validated non-negative least squares. Soft Learning is guaranteed to match or exceed the best weighted combination of its specialists, trains over two orders of magnitude faster than deep networks on CPU alone (72-435x faster across tested configurations), provides inherent interpretability through learned weights that reveal which algorithmic paradigm best fits the data, and is future-proof: adding specialists is mathematically guaranteed to maintain or improve performance. Across 37 datasets (25 classification, 12 regression) against nine methods including CatBoost and tuned deep networks, Soft Learning ranks first on 70% of tasks, achieves the best mean rank (Friedman test, p = 1.12 x 10^-12), and is the only method to simultaneously excel at both classification and regression -- all without GPU hardware or hyperparameter tuning. These results suggest a paradigm shift from "which algorithm is best?" to "what is the provably optimal combination?" -- a question Soft Learning answers with formal guarantees for any data modality.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Soft Learning combines heterogeneous models via cross-validated NNLS with a future-proof guarantee, but the reported wins likely overstate generalization because the weights are fit on the same folds used for ranking.

read the letter

Soft Learning keeps a library of different specialists and learns non-negative weights for them through cross-validated least squares. The central claim is that this procedure is guaranteed to match or beat the best possible combination of those specialists, while training far faster than deep networks on CPU and needing no hyperparameter search. It also claims to be future-proof in the sense that adding more specialists cannot hurt the guarantee. Across 37 datasets it reports top ranks on 70 percent of tasks and the best average rank overall. That is the practical pitch: a CPU-friendly ensemble that mixes linear models, trees, kernels, and networks without manual tuning. The interpretability angle, where the learned weights show which paradigm fits the data, is a reasonable byproduct. The future-proof property is a clean mathematical feature if the re-optimization step is done correctly. The soft spot is the circularity in the evaluation. The weights are chosen on cross-validation folds and then the combined predictor is ranked on held-out data, but the abstract gives no sign of a nested outer loop that would measure how much the meta-combination itself overfits the particular validation splits. With correlated specialists and more than a handful of them, the non-negative least squares solution can latch onto noise in those folds. The Friedman test p-value is reported, yet without details on whether the ranks reflect truly independent test performance the headline numbers are hard to trust at face value. This work is aimed at practitioners who want reliable results on tabular data without GPUs or extensive tuning. Readers who already use ensembles or care about deployment constraints could extract value from the combination procedure. It deserves a serious referee because the method is simple to re-implement and the claims are concrete enough to check. I would send it for peer review to clarify the statistical controls and the exact generalization behavior of the weight-learning step.

Referee Report

2 major / 2 minor

Summary. The manuscript proposes Soft Learning as a method to combine predictions from a diverse set of specialist models (linear, tree-based, kernel, and neural) by solving for non-negative weights using cross-validated non-negative least squares. It asserts formal optimality guarantees, significant speed advantages over deep networks, and superior empirical performance across 37 datasets in both classification and regression tasks, all without hyperparameter tuning or specialized hardware.

Significance. Should the central claims regarding out-of-sample optimality and generalization of the learned weights be substantiated, this approach could meaningfully advance ensemble methods by offering a principled, efficient, and interpretable alternative to both classical algorithms and deep learning. The ability to add specialists while maintaining guarantees and the lack of need for GPU resources are strong practical advantages. The work also provides a clear path toward understanding which paradigms suit particular data.

major comments (2)

[Abstract] The claim that Soft Learning is 'guaranteed to match or exceed the best weighted combination of its specialists' is based on the cross-validated non-negative least squares solution. However, since this solution is obtained from the same cross-validation folds used in evaluation, the optimality may not extend to unseen test data without additional safeguards against overfitting in the weight estimation step.
[Empirical evaluation] The reported best mean rank and first-place ranking on 70% of tasks rely on the learned weights generalizing from CV to test. Given that specialist predictions are often correlated and the number of specialists is not specified as small, a nested cross-validation loop isolating the weight-learning generalization error would be necessary to support these claims robustly.

minor comments (2)

[Abstract] Ensure that the number of specialists and their types are clearly stated in the main text for reproducibility.
[Introduction] The transition from 'which algorithm is best?' to 'what is the provably optimal combination?' is compelling but would benefit from a brief discussion of related work on meta-learning and stacking.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their detailed review and constructive comments on our work on Soft Learning. We provide point-by-point responses to the major comments below.

read point-by-point responses

Referee: [Abstract] The claim that Soft Learning is 'guaranteed to match or exceed the best weighted combination of its specialists' is based on the cross-validated non-negative least squares solution. However, since this solution is obtained from the same cross-validation folds used in evaluation, the optimality may not extend to unseen test data without additional safeguards against overfitting in the weight estimation step.

Authors: The optimality guarantee applies specifically to the cross-validation data used for weight estimation. The non-negative least squares solver finds the weights that minimize the squared error on the out-of-fold specialist predictions, which are generated without using the target instances in training the specialists. This ensures the combination is optimal for those CV predictions. For the test data, we apply the learned weights and evaluate empirically, without claiming a formal optimality guarantee on the test distribution. We agree that this distinction should be clarified to avoid misinterpretation. In the revised manuscript, we will update the abstract and add a section explaining the scope of the guarantees. revision: partial
Referee: [Empirical evaluation] The reported best mean rank and first-place ranking on 70% of tasks rely on the learned weights generalizing from CV to test. Given that specialist predictions are often correlated and the number of specialists is not specified as small, a nested cross-validation loop isolating the weight-learning generalization error would be necessary to support these claims robustly.

Authors: We recognize the value of nested cross-validation for isolating the generalization performance of the weight estimation step, particularly given potential correlations among specialist predictions. In our current implementation, we employ a single cross-validation procedure to balance computational efficiency with the scale of our experiments across 37 datasets. The number of specialists is 9 in the reported experiments, which is modest. While a full nested CV would strengthen the claims, the observed performance advantages and the statistical significance (Friedman test p-value) provide supporting evidence that the weights generalize effectively. We will revise the manuscript to specify the number of specialists, discuss this limitation, and include a nested CV analysis on a representative subset of datasets. revision: partial

Circularity Check

1 steps flagged

Optimality guarantee reduces to NNLS fit on CV folds by construction

specific steps

fitted input called prediction [Abstract]
"Soft Learning is guaranteed to match or exceed the best weighted combination of its specialists, ... discovers provably optimal combination weights through cross-validated non-negative least squares."

The guarantee is obtained by fitting NNLS weights on the cross-validation folds; the reported superiority on the 37 datasets is therefore the in-sample fit on those folds, not a prediction that must generalize beyond the data used to compute the weights.

full rationale

The paper's central guarantee that Soft Learning 'is guaranteed to match or exceed the best weighted combination' is achieved by solving non-negative least squares on the same cross-validation folds later used to report performance. This makes the headline claims (best mean rank, first on 70% of tasks, matches/exceeds best specialist) a direct consequence of the fitted weights rather than an independent prediction on held-out test data. No nested outer loop isolates the generalization of the weight-learning step itself. The derivation chain therefore collapses the 'provable optimality' claim into the fitting procedure on the evaluation data.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The framework rests on the assumption that a fixed library of standard specialists is diverse enough for the convex combination to be useful and that cross-validation provides an unbiased estimate of combination quality.

free parameters (1)

combination weights
Learned via non-negative least squares on cross-validation folds; these are the central fitted quantities.

axioms (1)

domain assumption The library of specialists contains sufficiently complementary models so that a convex combination improves over the best single model.
Invoked when claiming future-proof improvement upon adding specialists.

pith-pipeline@v0.9.0 · 5754 in / 1404 out tokens · 41909 ms · 2026-05-20T14:14:40.401184+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Soft Learning ... discovers provably optimal combination weights through cross-validated non-negative least squares ... oracle inequality ... Krogh-Vedelsby decomposition

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

60 extracted references · 60 canonical work pages · 1 internal anchor

[1]

Chen and C

Chen, T. & Guestrin, C. Xgboost: A scalable tree boosting system. InProceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’16, 785–794, DOI: 10.1145/2939672.2939785 (Association for Computing Machinery, New York, NY , USA, 2016). 3.Friedman, J. H. Greedy function approximation: A gradient boosting machine....

work page doi:10.1145/2939672.2939785 2016
[2]

Energy and Policy Considerations for Deep Learning in NLP

Strubell, E., Ganesh, A. & McCallum, A. Energy and policy considerations for deep learning in NLP. In Korhonen, A., Traum, D. & Màrquez, L. (eds.)Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, 3645–3650, DOI: 10.18653/v1/P19-1355 (Association for Computational Linguistics, Florence, Italy, 2019)

work page internal anchor Pith review doi:10.18653/v1/p19-1355 2019
[3]

Schwartz, R., Dodge, J., Smith, N. A. & Etzioni, O. Green ai.Commun. ACM63, 54–63, DOI: 10.1145/3381831 (2020)

work page doi:10.1145/3381831 2020
[4]

doi: 10.1145/3446776

Zhang, C., Bengio, S., Hardt, M., Recht, B. & Vinyals, O. Understanding deep learning (still) requires rethinking generalization.Commun. ACM64, 107–115, DOI: 10.1145/3446776 (2021)

work page doi:10.1145/3446776 2021
[5]

& Dietterich, T

Hendrycks, D. & Dietterich, T. Benchmarking neural network robustness to common corruptions and perturbations. In International Conference on Learning Representations(2019)

work page 2019
[6]

& Weinberger, K

Guo, C., Pleiss, G., Sun, Y . & Weinberger, K. Q. On calibration of modern neural networks. InProceedings of the 34th International Conference on Machine Learning - Volume 70, ICML’17, 1321–1330 (JMLR.org, 2017)

work page 2017
[7]

Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead

Rudin, C. Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead.Nat. Mach. Intell.1, 206–215, DOI: 10.1038/s42256-019-0048-x (2019)

work page doi:10.1038/s42256-019-0048-x 2019
[8]

In Cortes, C., Lawrence, N., Lee, D., Sugiyama, M

Sculley, D.et al.Hidden technical debt in machine learning systems. In Cortes, C., Lawrence, N., Lee, D., Sugiyama, M. & Garnett, R. (eds.)Advances in Neural Information Processing Systems, vol. 28 (Curran Associates, Inc., 2015)

work page 2015
[9]

& Hinton, G

Krizhevsky, A., Sutskever, I. & Hinton, G. E. ImageNet classification with deep convolutional neural networks. In Pereira, F., Burges, C. J. C., Bottou, L. & Weinberger, K. Q. (eds.)Advances in Neural Information Processing Systems, vol. 25, 1097–1105 (Curran Associates, Inc., 2012)

work page 2012
[10]

Deep residual learning for image recognition,

He, K., Zhang, X., Ren, S. & Sun, J. Deep residual learning for image recognition. In2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 770–778, DOI: 10.1109/CVPR.2016.90 (2016)

work page doi:10.1109/cvpr.2016.90 2016
[11]

Dahl, Abdel-rahman Mohamed, Navdeep Jaitly, Andrew Senior, Vincent Vanhoucke, Patrick Nguyen, Tara N

Hinton, G.et al.Deep neural networks for acoustic modeling in speech recognition: the shared views of four research groups.IEEE Signal Process. Mag.29, 82–97, DOI: 10.1109/MSP.2012.2205597 (2012)

work page doi:10.1109/msp.2012.2205597 2012
[12]

Sutskever, I., Vinyals, O. & Le, Q. V . Sequence to sequence learning with neural networks. InProceedings of the 28th International Conference on Neural Information Processing Systems - Volume 2, NIPS’14, 3104–3112 (MIT Press, Cambridge, MA, USA, 2014)

work page 2014
[13]

InProceedings of the 31st International Conference on Neural Information Processing Systems, NIPS’17, 6000–6010 (Curran Associates Inc., Red Hook, NY , USA, 2017)

Vaswani, A.et al.Attention is all you need. InProceedings of the 31st International Conference on Neural Information Processing Systems, NIPS’17, 6000–6010 (Curran Associates Inc., Red Hook, NY , USA, 2017)

work page 2017
[14]

doi: 10.1109/TPAMI.2013.50

Bengio, Y ., Courville, A. & Vincent, P. Representation learning: A review and new perspectives.IEEE Transactions on Pattern Analysis Mach. Intell.35, 1798–1828, DOI: 10.1109/TPAMI.2013.50 (2013)

work page doi:10.1109/tpami.2013.50 2013
[15]

Learning Representations by Back- Propagating Errors

Rumelhart, D. E., Hinton, G. E. & Williams, R. J. Learning representations by back-propagating errors.Nature323, 533–536, DOI: 10.1038/323533a0 (1986)

work page doi:10.1038/323533a0 1986
[16]

& Varoquaux, G

Grinsztajn, L., Oyallon, E. & Varoquaux, G. Why do tree-based models still outperform deep learning on typical tabular data? InProceedings of the 36th International Conference on Neural Information Processing Systems, NIPS ’22 (Curran Associates Inc., Red Hook, NY , USA, 2022). 15/33

work page 2022
[17]

Masset, R

Shwartz-Ziv, R. & Armon, A. Tabular data: Deep learning is not all you need.Inf. Fusion81, 84–90, DOI: 10.1016/j. inffus.2021.11.011 (2022)

work page doi:10.1016/j 2021
[18]

Borisov, T

Borisov, V .et al.Deep neural networks and tabular data: A survey.IEEE Transactions on Neural Networks Learn. Syst.35, 7499–7519, DOI: 10.1109/TNNLS.2022.3229161 (2024)

work page doi:10.1109/tnnls.2022.3229161 2022
[19]

& Amorim, D

Fernández-Delgado, M., Cernadas, E., Barro, S. & Amorim, D. Do we need hundreds of classifiers to solve real world classification problems?J. Mach. Learn. Res.15, 3133–3181 (2014)

work page 2014
[20]

& Macready, W

Wolpert, D. & Macready, W. No free lunch theorems for optimization.IEEE Transactions on Evol. Comput.1, 67–82, DOI: 10.1109/4235.585893 (1997)

work page doi:10.1109/4235.585893 1997
[21]

Dietterich, T. G. Ensemble methods in machine learning. InMultiple Classifier Systems, 1–15 (Springer Berlin Heidelberg, Berlin, Heidelberg, 2000)

work page 2000
[22]

Wolpert, D. H. Stacked generalization.Neural Networks5, 241–259, DOI: https://doi.org/10.1016/S0893-6080(05)80023-1 (1992). 27.Breiman, L. Stacked regressions.Mach. Learn.24, 49–64, DOI: 10.1007/BF00117832 (1996)

work page doi:10.1016/s0893-6080(05)80023-1 1992
[23]

Le Goallec, A., Diai, S., Collin, S., Prost, J.-B., Vincent, T., and Patel, C

van der Laan, M. J., Polley, E. C. & Hubbard, A. E. Super learner.Stat. Appl. Genet. Mol. Biol.6, 1–23, DOI: 10.2202/1544-6115.1309 (2007)

work page doi:10.2202/1544-6115.1309 2007
[24]

Polley, E. C. & van der Laan, M. J. Super learner in prediction. Working Paper 266, U.C. Berkeley Division of Biostatistics Working Paper Series (2010)

work page 2010
[25]

van der Laan, M. J. & Dudoit, S. Unified cross-validation methodology for selection among estimators and a general cross-validated adaptive epsilon-net estimator: finite sample oracle inequalities and examples. Working Paper 130, U.C. Berkeley Division of Biostatistics Working Paper Series (2003)

work page 2003
[26]

W., Dudoit, S

van der Vaart, A. W., Dudoit, S. & van der Laan, M. J. Oracle inequalities for multi-fold cross validation.Stat. & Decis. 24, 351–371, DOI: 10.1524/stnd.2006.24.3.351 (2006)

work page doi:10.1524/stnd.2006.24.3.351 2006
[27]

Naimi, A. I. & Balzer, L. B. Stacked generalization: an introduction to super learning.Eur. J. Epidemiol.33, 459–464, DOI: 10.1007/s10654-018-0390-z (2018)

work page doi:10.1007/s10654-018-0390-z 2018
[28]

ISBN 1581138385.DOI: 10.1145/1015330.1015430

Caruana, R., Niculescu-Mizil, A., Crew, G. & Ksikes, A. Ensemble selection from libraries of models. InProceedings of the Twenty-First International Conference on Machine Learning, ICML ’04, 18, DOI: 10.1145/1015330.1015432 (Association for Computing Machinery, New York, NY , USA, 2004)

work page doi:10.1145/1015330.1015432 2004
[29]

Adaptive Mixtures of Local Experts

Jacobs, R. A., Jordan, M. I., Nowlan, S. J. & Hinton, G. E. Adaptive mixtures of local experts.Neural Comput.3, 79–87, DOI: 10.1162/neco.1991.3.1.79 (1991). https://direct.mit.edu/neco/article-pdf/3/1/79/812104/neco.1991.3.1.79.pdf

work page doi:10.1162/neco.1991.3.1.79 1991
[30]

InInternational Conference on Learning Representations(2017)

Shazeer, N.et al.Outrageously large neural networks: The sparsely-gated mixture-of-experts layer. InInternational Conference on Learning Representations(2017)

work page 2017
[31]

& Ebrahimpour, R

Masoudnia, S. & Ebrahimpour, R. Mixture of experts: a literature survey.Artif. Intell. Rev.42, 275–293, DOI: 10.1007/s10462-012-9338-y (2014)

work page doi:10.1007/s10462-012-9338-y 2014
[32]

& Blundell, C

Lakshminarayanan, B., Pritzel, A. & Blundell, C. Simple and scalable predictive uncertainty estimation using deep ensembles. In Guyon, I.et al.(eds.)Advances in Neural Information Processing Systems, vol. 30 (Curran Associates, Inc., 2017)

work page 2017
[33]

InProceedings of the 29th International Conference on Neural Information Processing Systems - Volume 2, NIPS’15, 2755–2763 (MIT Press, Cambridge, MA, USA, 2015)

Feurer, M.et al.Efficient and robust automated machine learning. InProceedings of the 29th International Conference on Neural Information Processing Systems - Volume 2, NIPS’15, 2755–2763 (MIT Press, Cambridge, MA, USA, 2015)

work page 2015
[34]

& Vanschoren, J

Hutter, F., Kotthoff, L. & Vanschoren, J. (eds.)Automated Machine Learning: Methods, Systems, Challenges. The Springer Series on Challenges in Machine Learning (Springer, Cham, 2019)

work page 2019
[35]

Thornton, C., Hutter, F., Hoos, H. H. & Leyton-Brown, K. Auto-weka: combined selection and hyperparameter optimization of classification algorithms. InProceedings of the 19th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’13, 847–855, DOI: 10.1145/2487575.2487629 (Association for Computing Machinery, New York, NY , USA, 2013)

work page doi:10.1145/2487575.2487629 2013
[36]

In7th ICML Workshop on Automated Machine Learning (AutoML 2020)(2020)

Erickson, N.et al.AutoGluon-Tabular: Robust and accurate AutoML for structured data. In7th ICML Workshop on Automated Machine Learning (AutoML 2020)(2020)

work page 2020
[37]

InInternational Conference on Learning Representations (ICLR) (2025)

Liu, Z.et al.KAN: Kolmogorov–Arnold networks. InInternational Conference on Learning Representations (ICLR) (2025). 16/33

work page 2025
[38]

Garcez, A. d. & Lamb, L. C. Neurosymbolic ai: the 3rd wave.Artif. Intell. Rev.56, 12387–12406, DOI: 10.1007/ s10462-023-10448-w (2023)

work page 2023
[39]

Kautz, H. A. The third ai summer: Aaai robert s. engelmore memorial lecture.AI Mag.43, 105–125, DOI: https: //doi.org/10.1002/aaai.12036 (2022). https://onlinelibrary.wiley.com/doi/pdf/10.1002/aaai.12036

work page doi:10.1002/aaai.12036 2022
[40]

& Hanson, R.Solving Least Squares Problems

Lawson, C. & Hanson, R.Solving Least Squares Problems. Classics in Applied Mathematics (Society for Industrial and Applied Mathematics, 1995)

work page 1995
[41]

& LeCun, Y

Choromanska, A., Henaff, M., Mathieu, M., Ben Arous, G. & LeCun, Y . The Loss Surfaces of Multilayer Networks. In Lebanon, G. & Vishwanathan, S. V . N. (eds.)Proceedings of the Eighteenth International Conference on Artificial Intelligence and Statistics, vol. 38 ofProceedings of Machine Learning Research, 192–204 (PMLR, San Diego, California, USA, 2015)

work page 2015
[42]

& Vedelsby, J

Krogh, A. & Vedelsby, J. Neural network ensembles, cross validation and active learning. InProceedings of the 8th International Conference on Neural Information Processing Systems, NIPS’94, 231–238 (MIT Press, Cambridge, MA, USA, 1994). 48.Friedman, J. H. Multivariate adaptive regression splines.The Annals Stat.19, 1–67 (1991)

work page 1994
[43]

van Rijn, Bernd Bischl, and Luis Torgo

Vanschoren, J., van Rijn, J. N., Bischl, B. & Torgo, L. Openml: networked science in machine learning.SIGKDD Explor. Newsl.15, 49–60, DOI: 10.1145/2641190.2641198 (2014). 50.Kelly, M., Longjohn, R. & Nottingham, K. The UCI machine learning repository. https://archive.ics.uci.edu

work page doi:10.1145/2641190.2641198 2014
[44]

& Szegedy, C

Ioffe, S. & Szegedy, C. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In Bach, F. & Blei, D. (eds.)Proceedings of the 32nd International Conference on Machine Learning, vol. 37 ofProceedings of Machine Learning Research, 448–456 (PMLR, Lille, France, 2015)

work page 2015
[45]

& Friedman, J.The Elements of Statistical Learning: Data Mining, Inference, and Prediction

Hastie, T., Tibshirani, R. & Friedman, J.The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Springer Series in Statistics (Springer, New York, NY , 2009), 2 edn

work page 2009
[46]

Prokhorenkova, L., Gusev, G., V orobev, A., Dorogush, A. V . & Gulin, A. Catboost: unbiased boosting with categorical features. InProceedings of the 32nd International Conference on Neural Information Processing Systems, NIPS’18, 6639–6649 (Curran Associates Inc., Red Hook, NY , USA, 2018)

work page 2018
[47]

& Smola, A

Schölkopf, B. & Smola, A. J.Learning with Kernels: Support Vector Machines, Regularization, Optimization, and Beyond (The MIT Press, 2001)

work page 2001
[48]

N.The Nature of Statistical Learning Theory

Vapnik, V . N.The Nature of Statistical Learning Theory. Information Science and Statistics (Springer, New York, NY , 2000), 2 edn. 56.Pedregosa, F.et al.Scikit-learn: Machine learning in python.J. Mach. Learn. Res.12, 2825–2830 (2011). 57.Nocedal, J. Updating quasi-newton matrices with limited storage.Math. Comput.35, 773–782 (1980)

work page 2000
[49]

Breiman, L., Friedman, J., Olshen, R. A. & Stone, C. J.Classification and Regression Trees(Chapman and Hall/CRC, 1984), 1 edn

work page 1984
[50]

InProceedings of the 31st International Conference on Neural Information Processing Systems, NIPS’17, 3149–3157 (Curran Associates Inc., Red Hook, NY , USA, 2017)

Ke, G.et al.Lightgbm: a highly efficient gradient boosting decision tree. InProceedings of the 31st International Conference on Neural Information Processing Systems, NIPS’17, 3149–3157 (Curran Associates Inc., Red Hook, NY , USA, 2017)

work page 2017
[51]

Platt, J. C. Probabilistic outputs for support vector machines and comparisons to regularized likelihood methods. In Smola, A. J., Bartlett, P., Schölkopf, B. & Schuurmans, D. (eds.)Advances in Large Margin Classifiers, 61–74 (MIT Press, 1999)

work page 1999
[52]

Kingma, D. P. & Ba, J. Adam: A method for stochastic optimization. InInternational Conference on Learning Representations (ICLR)(2015)

work page 2015
[53]

A software package for sequential quadratic programming

Kraft, D. A software package for sequential quadratic programming. Tech. Rep., Deutsche Forschungs- und Versuchsanstalt für Luft- und Raumfahrt (DFVLR) (1988)

work page 1988
[54]

& Salakhutdinov, R

Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I. & Salakhutdinov, R. Dropout: A simple way to prevent neural networks from overfitting.J. Mach. Learn. Res.15, 1929–1958 (2014). 64.Demšar, J. Statistical comparisons of classifiers over multiple data sets.The J. Mach. Learn. Res.7, 1–30 (2006)

work page 1929
[55]

Strictly

Gneiting, T. & Raftery, A. E. Strictly proper scoring rules, prediction, and estimation.J. Am. Stat. Assoc.102, 359–378, DOI: 10.1198/016214506000001437 (2007). https://doi.org/10.1198/016214506000001437. 17/33

work page doi:10.1198/016214506000001437 2007
[56]

& Bartlett, P

Anthony, M. & Bartlett, P. L.Neural Network Learning: Theoretical Foundations(Cambridge University Press, Cambridge, 1999)

work page 1999
[57]

Probability inequalities for sums of bounded random variables

Hoeffding, W. Probability inequalities for sums of bounded random variables. In Fisher, N. I. & Sen, P. K. (eds.)The Collected Works of Wassily Hoeffding, 409–426, DOI: 10.1007/978-1-4612-0865-5_26 (Springer New York, New York, NY , 1994)

work page doi:10.1007/978-1-4612-0865-5_26 1994
[58]

& Ben-David, S.Understanding Machine Learning: From Theory to Algorithms(Cambridge University Press, Cambridge, 2014)

Shalev-Shwartz, S. & Ben-David, S.Understanding Machine Learning: From Theory to Algorithms(Cambridge University Press, Cambridge, 2014)

work page 2014
[59]

J., Shlens, J

Goodfellow, I. J., Shlens, J. & Szegedy, C. Explaining and harnessing adversarial examples. InInternational Conference on Learning Representations (ICLR)(2015)

work page 2015
[60]

Task” column indicates classification (C) or regression (Reg). “Bal

Madry, A., Makelov, A., Schmidt, L., Tsipras, D. & Vladu, A. Towards deep learning models resistant to adversarial attacks. InInternational Conference on Learning Representations(2018). 18/33 Supplementary Information Soft Learning Mohammed Aledhari, Ali Aledhari, Fatimah Aledhari, Mohamed Rahouti S1. Formal Framework and Definitions S1.1 Problem Setting ...

work page 2018

[1] [1]

Chen and C

Chen, T. & Guestrin, C. Xgboost: A scalable tree boosting system. InProceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’16, 785–794, DOI: 10.1145/2939672.2939785 (Association for Computing Machinery, New York, NY , USA, 2016). 3.Friedman, J. H. Greedy function approximation: A gradient boosting machine....

work page doi:10.1145/2939672.2939785 2016

[2] [2]

Energy and Policy Considerations for Deep Learning in NLP

Strubell, E., Ganesh, A. & McCallum, A. Energy and policy considerations for deep learning in NLP. In Korhonen, A., Traum, D. & Màrquez, L. (eds.)Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, 3645–3650, DOI: 10.18653/v1/P19-1355 (Association for Computational Linguistics, Florence, Italy, 2019)

work page internal anchor Pith review doi:10.18653/v1/p19-1355 2019

[3] [3]

Schwartz, R., Dodge, J., Smith, N. A. & Etzioni, O. Green ai.Commun. ACM63, 54–63, DOI: 10.1145/3381831 (2020)

work page doi:10.1145/3381831 2020

[4] [4]

doi: 10.1145/3446776

Zhang, C., Bengio, S., Hardt, M., Recht, B. & Vinyals, O. Understanding deep learning (still) requires rethinking generalization.Commun. ACM64, 107–115, DOI: 10.1145/3446776 (2021)

work page doi:10.1145/3446776 2021

[5] [5]

& Dietterich, T

Hendrycks, D. & Dietterich, T. Benchmarking neural network robustness to common corruptions and perturbations. In International Conference on Learning Representations(2019)

work page 2019

[6] [6]

& Weinberger, K

Guo, C., Pleiss, G., Sun, Y . & Weinberger, K. Q. On calibration of modern neural networks. InProceedings of the 34th International Conference on Machine Learning - Volume 70, ICML’17, 1321–1330 (JMLR.org, 2017)

work page 2017

[7] [7]

Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead

Rudin, C. Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead.Nat. Mach. Intell.1, 206–215, DOI: 10.1038/s42256-019-0048-x (2019)

work page doi:10.1038/s42256-019-0048-x 2019

[8] [8]

In Cortes, C., Lawrence, N., Lee, D., Sugiyama, M

Sculley, D.et al.Hidden technical debt in machine learning systems. In Cortes, C., Lawrence, N., Lee, D., Sugiyama, M. & Garnett, R. (eds.)Advances in Neural Information Processing Systems, vol. 28 (Curran Associates, Inc., 2015)

work page 2015

[9] [9]

& Hinton, G

Krizhevsky, A., Sutskever, I. & Hinton, G. E. ImageNet classification with deep convolutional neural networks. In Pereira, F., Burges, C. J. C., Bottou, L. & Weinberger, K. Q. (eds.)Advances in Neural Information Processing Systems, vol. 25, 1097–1105 (Curran Associates, Inc., 2012)

work page 2012

[10] [10]

Deep residual learning for image recognition,

He, K., Zhang, X., Ren, S. & Sun, J. Deep residual learning for image recognition. In2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 770–778, DOI: 10.1109/CVPR.2016.90 (2016)

work page doi:10.1109/cvpr.2016.90 2016

[11] [11]

Dahl, Abdel-rahman Mohamed, Navdeep Jaitly, Andrew Senior, Vincent Vanhoucke, Patrick Nguyen, Tara N

Hinton, G.et al.Deep neural networks for acoustic modeling in speech recognition: the shared views of four research groups.IEEE Signal Process. Mag.29, 82–97, DOI: 10.1109/MSP.2012.2205597 (2012)

work page doi:10.1109/msp.2012.2205597 2012

[12] [12]

Sutskever, I., Vinyals, O. & Le, Q. V . Sequence to sequence learning with neural networks. InProceedings of the 28th International Conference on Neural Information Processing Systems - Volume 2, NIPS’14, 3104–3112 (MIT Press, Cambridge, MA, USA, 2014)

work page 2014

[13] [13]

InProceedings of the 31st International Conference on Neural Information Processing Systems, NIPS’17, 6000–6010 (Curran Associates Inc., Red Hook, NY , USA, 2017)

Vaswani, A.et al.Attention is all you need. InProceedings of the 31st International Conference on Neural Information Processing Systems, NIPS’17, 6000–6010 (Curran Associates Inc., Red Hook, NY , USA, 2017)

work page 2017

[14] [14]

doi: 10.1109/TPAMI.2013.50

Bengio, Y ., Courville, A. & Vincent, P. Representation learning: A review and new perspectives.IEEE Transactions on Pattern Analysis Mach. Intell.35, 1798–1828, DOI: 10.1109/TPAMI.2013.50 (2013)

work page doi:10.1109/tpami.2013.50 2013

[15] [15]

Learning Representations by Back- Propagating Errors

Rumelhart, D. E., Hinton, G. E. & Williams, R. J. Learning representations by back-propagating errors.Nature323, 533–536, DOI: 10.1038/323533a0 (1986)

work page doi:10.1038/323533a0 1986

[16] [16]

& Varoquaux, G

Grinsztajn, L., Oyallon, E. & Varoquaux, G. Why do tree-based models still outperform deep learning on typical tabular data? InProceedings of the 36th International Conference on Neural Information Processing Systems, NIPS ’22 (Curran Associates Inc., Red Hook, NY , USA, 2022). 15/33

work page 2022

[17] [17]

Masset, R

Shwartz-Ziv, R. & Armon, A. Tabular data: Deep learning is not all you need.Inf. Fusion81, 84–90, DOI: 10.1016/j. inffus.2021.11.011 (2022)

work page doi:10.1016/j 2021

[18] [18]

Borisov, T

Borisov, V .et al.Deep neural networks and tabular data: A survey.IEEE Transactions on Neural Networks Learn. Syst.35, 7499–7519, DOI: 10.1109/TNNLS.2022.3229161 (2024)

work page doi:10.1109/tnnls.2022.3229161 2022

[19] [19]

& Amorim, D

Fernández-Delgado, M., Cernadas, E., Barro, S. & Amorim, D. Do we need hundreds of classifiers to solve real world classification problems?J. Mach. Learn. Res.15, 3133–3181 (2014)

work page 2014

[20] [20]

& Macready, W

Wolpert, D. & Macready, W. No free lunch theorems for optimization.IEEE Transactions on Evol. Comput.1, 67–82, DOI: 10.1109/4235.585893 (1997)

work page doi:10.1109/4235.585893 1997

[21] [21]

Dietterich, T. G. Ensemble methods in machine learning. InMultiple Classifier Systems, 1–15 (Springer Berlin Heidelberg, Berlin, Heidelberg, 2000)

work page 2000

[22] [22]

Wolpert, D. H. Stacked generalization.Neural Networks5, 241–259, DOI: https://doi.org/10.1016/S0893-6080(05)80023-1 (1992). 27.Breiman, L. Stacked regressions.Mach. Learn.24, 49–64, DOI: 10.1007/BF00117832 (1996)

work page doi:10.1016/s0893-6080(05)80023-1 1992

[23] [23]

Le Goallec, A., Diai, S., Collin, S., Prost, J.-B., Vincent, T., and Patel, C

van der Laan, M. J., Polley, E. C. & Hubbard, A. E. Super learner.Stat. Appl. Genet. Mol. Biol.6, 1–23, DOI: 10.2202/1544-6115.1309 (2007)

work page doi:10.2202/1544-6115.1309 2007

[24] [24]

Polley, E. C. & van der Laan, M. J. Super learner in prediction. Working Paper 266, U.C. Berkeley Division of Biostatistics Working Paper Series (2010)

work page 2010

[25] [25]

van der Laan, M. J. & Dudoit, S. Unified cross-validation methodology for selection among estimators and a general cross-validated adaptive epsilon-net estimator: finite sample oracle inequalities and examples. Working Paper 130, U.C. Berkeley Division of Biostatistics Working Paper Series (2003)

work page 2003

[26] [26]

W., Dudoit, S

van der Vaart, A. W., Dudoit, S. & van der Laan, M. J. Oracle inequalities for multi-fold cross validation.Stat. & Decis. 24, 351–371, DOI: 10.1524/stnd.2006.24.3.351 (2006)

work page doi:10.1524/stnd.2006.24.3.351 2006

[27] [27]

Naimi, A. I. & Balzer, L. B. Stacked generalization: an introduction to super learning.Eur. J. Epidemiol.33, 459–464, DOI: 10.1007/s10654-018-0390-z (2018)

work page doi:10.1007/s10654-018-0390-z 2018

[28] [28]

ISBN 1581138385.DOI: 10.1145/1015330.1015430

Caruana, R., Niculescu-Mizil, A., Crew, G. & Ksikes, A. Ensemble selection from libraries of models. InProceedings of the Twenty-First International Conference on Machine Learning, ICML ’04, 18, DOI: 10.1145/1015330.1015432 (Association for Computing Machinery, New York, NY , USA, 2004)

work page doi:10.1145/1015330.1015432 2004

[29] [29]

Adaptive Mixtures of Local Experts

Jacobs, R. A., Jordan, M. I., Nowlan, S. J. & Hinton, G. E. Adaptive mixtures of local experts.Neural Comput.3, 79–87, DOI: 10.1162/neco.1991.3.1.79 (1991). https://direct.mit.edu/neco/article-pdf/3/1/79/812104/neco.1991.3.1.79.pdf

work page doi:10.1162/neco.1991.3.1.79 1991

[30] [30]

InInternational Conference on Learning Representations(2017)

Shazeer, N.et al.Outrageously large neural networks: The sparsely-gated mixture-of-experts layer. InInternational Conference on Learning Representations(2017)

work page 2017

[31] [31]

& Ebrahimpour, R

Masoudnia, S. & Ebrahimpour, R. Mixture of experts: a literature survey.Artif. Intell. Rev.42, 275–293, DOI: 10.1007/s10462-012-9338-y (2014)

work page doi:10.1007/s10462-012-9338-y 2014

[32] [32]

& Blundell, C

Lakshminarayanan, B., Pritzel, A. & Blundell, C. Simple and scalable predictive uncertainty estimation using deep ensembles. In Guyon, I.et al.(eds.)Advances in Neural Information Processing Systems, vol. 30 (Curran Associates, Inc., 2017)

work page 2017

[33] [33]

InProceedings of the 29th International Conference on Neural Information Processing Systems - Volume 2, NIPS’15, 2755–2763 (MIT Press, Cambridge, MA, USA, 2015)

Feurer, M.et al.Efficient and robust automated machine learning. InProceedings of the 29th International Conference on Neural Information Processing Systems - Volume 2, NIPS’15, 2755–2763 (MIT Press, Cambridge, MA, USA, 2015)

work page 2015

[34] [34]

& Vanschoren, J

Hutter, F., Kotthoff, L. & Vanschoren, J. (eds.)Automated Machine Learning: Methods, Systems, Challenges. The Springer Series on Challenges in Machine Learning (Springer, Cham, 2019)

work page 2019

[35] [35]

Thornton, C., Hutter, F., Hoos, H. H. & Leyton-Brown, K. Auto-weka: combined selection and hyperparameter optimization of classification algorithms. InProceedings of the 19th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’13, 847–855, DOI: 10.1145/2487575.2487629 (Association for Computing Machinery, New York, NY , USA, 2013)

work page doi:10.1145/2487575.2487629 2013

[36] [36]

In7th ICML Workshop on Automated Machine Learning (AutoML 2020)(2020)

Erickson, N.et al.AutoGluon-Tabular: Robust and accurate AutoML for structured data. In7th ICML Workshop on Automated Machine Learning (AutoML 2020)(2020)

work page 2020

[37] [37]

InInternational Conference on Learning Representations (ICLR) (2025)

Liu, Z.et al.KAN: Kolmogorov–Arnold networks. InInternational Conference on Learning Representations (ICLR) (2025). 16/33

work page 2025

[38] [38]

Garcez, A. d. & Lamb, L. C. Neurosymbolic ai: the 3rd wave.Artif. Intell. Rev.56, 12387–12406, DOI: 10.1007/ s10462-023-10448-w (2023)

work page 2023

[39] [39]

Kautz, H. A. The third ai summer: Aaai robert s. engelmore memorial lecture.AI Mag.43, 105–125, DOI: https: //doi.org/10.1002/aaai.12036 (2022). https://onlinelibrary.wiley.com/doi/pdf/10.1002/aaai.12036

work page doi:10.1002/aaai.12036 2022

[40] [40]

& Hanson, R.Solving Least Squares Problems

Lawson, C. & Hanson, R.Solving Least Squares Problems. Classics in Applied Mathematics (Society for Industrial and Applied Mathematics, 1995)

work page 1995

[41] [41]

& LeCun, Y

Choromanska, A., Henaff, M., Mathieu, M., Ben Arous, G. & LeCun, Y . The Loss Surfaces of Multilayer Networks. In Lebanon, G. & Vishwanathan, S. V . N. (eds.)Proceedings of the Eighteenth International Conference on Artificial Intelligence and Statistics, vol. 38 ofProceedings of Machine Learning Research, 192–204 (PMLR, San Diego, California, USA, 2015)

work page 2015

[42] [42]

& Vedelsby, J

Krogh, A. & Vedelsby, J. Neural network ensembles, cross validation and active learning. InProceedings of the 8th International Conference on Neural Information Processing Systems, NIPS’94, 231–238 (MIT Press, Cambridge, MA, USA, 1994). 48.Friedman, J. H. Multivariate adaptive regression splines.The Annals Stat.19, 1–67 (1991)

work page 1994

[43] [43]

van Rijn, Bernd Bischl, and Luis Torgo

Vanschoren, J., van Rijn, J. N., Bischl, B. & Torgo, L. Openml: networked science in machine learning.SIGKDD Explor. Newsl.15, 49–60, DOI: 10.1145/2641190.2641198 (2014). 50.Kelly, M., Longjohn, R. & Nottingham, K. The UCI machine learning repository. https://archive.ics.uci.edu

work page doi:10.1145/2641190.2641198 2014

[44] [44]

& Szegedy, C

Ioffe, S. & Szegedy, C. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In Bach, F. & Blei, D. (eds.)Proceedings of the 32nd International Conference on Machine Learning, vol. 37 ofProceedings of Machine Learning Research, 448–456 (PMLR, Lille, France, 2015)

work page 2015

[45] [45]

& Friedman, J.The Elements of Statistical Learning: Data Mining, Inference, and Prediction

Hastie, T., Tibshirani, R. & Friedman, J.The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Springer Series in Statistics (Springer, New York, NY , 2009), 2 edn

work page 2009

[46] [46]

Prokhorenkova, L., Gusev, G., V orobev, A., Dorogush, A. V . & Gulin, A. Catboost: unbiased boosting with categorical features. InProceedings of the 32nd International Conference on Neural Information Processing Systems, NIPS’18, 6639–6649 (Curran Associates Inc., Red Hook, NY , USA, 2018)

work page 2018

[47] [47]

& Smola, A

Schölkopf, B. & Smola, A. J.Learning with Kernels: Support Vector Machines, Regularization, Optimization, and Beyond (The MIT Press, 2001)

work page 2001

[48] [48]

N.The Nature of Statistical Learning Theory

Vapnik, V . N.The Nature of Statistical Learning Theory. Information Science and Statistics (Springer, New York, NY , 2000), 2 edn. 56.Pedregosa, F.et al.Scikit-learn: Machine learning in python.J. Mach. Learn. Res.12, 2825–2830 (2011). 57.Nocedal, J. Updating quasi-newton matrices with limited storage.Math. Comput.35, 773–782 (1980)

work page 2000

[49] [49]

Breiman, L., Friedman, J., Olshen, R. A. & Stone, C. J.Classification and Regression Trees(Chapman and Hall/CRC, 1984), 1 edn

work page 1984

[50] [50]

InProceedings of the 31st International Conference on Neural Information Processing Systems, NIPS’17, 3149–3157 (Curran Associates Inc., Red Hook, NY , USA, 2017)

Ke, G.et al.Lightgbm: a highly efficient gradient boosting decision tree. InProceedings of the 31st International Conference on Neural Information Processing Systems, NIPS’17, 3149–3157 (Curran Associates Inc., Red Hook, NY , USA, 2017)

work page 2017

[51] [51]

Platt, J. C. Probabilistic outputs for support vector machines and comparisons to regularized likelihood methods. In Smola, A. J., Bartlett, P., Schölkopf, B. & Schuurmans, D. (eds.)Advances in Large Margin Classifiers, 61–74 (MIT Press, 1999)

work page 1999

[52] [52]

Kingma, D. P. & Ba, J. Adam: A method for stochastic optimization. InInternational Conference on Learning Representations (ICLR)(2015)

work page 2015

[53] [53]

A software package for sequential quadratic programming

Kraft, D. A software package for sequential quadratic programming. Tech. Rep., Deutsche Forschungs- und Versuchsanstalt für Luft- und Raumfahrt (DFVLR) (1988)

work page 1988

[54] [54]

& Salakhutdinov, R

Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I. & Salakhutdinov, R. Dropout: A simple way to prevent neural networks from overfitting.J. Mach. Learn. Res.15, 1929–1958 (2014). 64.Demšar, J. Statistical comparisons of classifiers over multiple data sets.The J. Mach. Learn. Res.7, 1–30 (2006)

work page 1929

[55] [55]

Strictly

Gneiting, T. & Raftery, A. E. Strictly proper scoring rules, prediction, and estimation.J. Am. Stat. Assoc.102, 359–378, DOI: 10.1198/016214506000001437 (2007). https://doi.org/10.1198/016214506000001437. 17/33

work page doi:10.1198/016214506000001437 2007

[56] [56]

& Bartlett, P

Anthony, M. & Bartlett, P. L.Neural Network Learning: Theoretical Foundations(Cambridge University Press, Cambridge, 1999)

work page 1999

[57] [57]

Probability inequalities for sums of bounded random variables

Hoeffding, W. Probability inequalities for sums of bounded random variables. In Fisher, N. I. & Sen, P. K. (eds.)The Collected Works of Wassily Hoeffding, 409–426, DOI: 10.1007/978-1-4612-0865-5_26 (Springer New York, New York, NY , 1994)

work page doi:10.1007/978-1-4612-0865-5_26 1994

[58] [58]

& Ben-David, S.Understanding Machine Learning: From Theory to Algorithms(Cambridge University Press, Cambridge, 2014)

Shalev-Shwartz, S. & Ben-David, S.Understanding Machine Learning: From Theory to Algorithms(Cambridge University Press, Cambridge, 2014)

work page 2014

[59] [59]

J., Shlens, J

Goodfellow, I. J., Shlens, J. & Szegedy, C. Explaining and harnessing adversarial examples. InInternational Conference on Learning Representations (ICLR)(2015)

work page 2015

[60] [60]

Task” column indicates classification (C) or regression (Reg). “Bal

Madry, A., Makelov, A., Schmidt, L., Tsipras, D. & Vladu, A. Towards deep learning models resistant to adversarial attacks. InInternational Conference on Learning Representations(2018). 18/33 Supplementary Information Soft Learning Mohammed Aledhari, Ali Aledhari, Fatimah Aledhari, Mohamed Rahouti S1. Formal Framework and Definitions S1.1 Problem Setting ...

work page 2018