pith. sign in

arxiv: 2605.18889 · v1 · pith:7QMEGCIKnew · submitted 2026-05-16 · 💻 cs.LG · cs.AI

Soft Learning

Pith reviewed 2026-05-20 14:14 UTC · model grok-4.3

classification 💻 cs.LG cs.AI
keywords soft learningmodel combinationensemble methodsnon-negative least squaresmachine learningclassificationregression
0
0 comments X

The pith

Soft Learning learns optimal non-negative weights to combine diverse specialists, guaranteeing performance that matches or exceeds the best weighted mix while training far faster than deep networks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Soft Learning keeps a library of different machine learning specialists such as linear models, tree ensembles, kernel machines, and neural networks. It uses cross-validated non-negative least squares to find combination weights that are mathematically guaranteed to perform at least as well as the best possible weighted mix of those specialists. This removes the need to choose one algorithm in advance or to run extensive hyperparameter searches on GPUs. The approach also supplies interpretability because the learned weights directly show which specialist type contributed most to the solution. Results across dozens of classification and regression tasks indicate that the method often ranks highest while running on ordinary CPUs.

Core claim

Soft Learning maintains a library of heterogeneous specialists and discovers provably optimal combination weights through cross-validated non-negative least squares. This construction guarantees that the resulting model will match or exceed the best weighted combination of its specialists. The method trains 72-435 times faster than deep networks on CPU hardware alone, requires no hyperparameter tuning, and supplies inherent interpretability via the learned weights that indicate which algorithmic family fits the data.

What carries the argument

Cross-validated non-negative least squares, which solves for non-negative weights that minimize validation error when combining the prediction outputs of the specialist models.

If this is right

  • Performance is guaranteed to remain the same or improve when any new specialist is added to the library.
  • The learned weights reveal which modeling paradigm best matches a given dataset without extra analysis.
  • No GPU hardware or hyperparameter tuning is required to reach competitive or superior results on both classification and regression tasks.
  • The same framework applies uniformly to the 25 classification and 12 regression datasets tested.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Practitioners could stop asking which single algorithm is best and instead ask what combination of available specialists is optimal for the data at hand.
  • Resource-limited settings might adopt this style of combination to reach high performance without specialized hardware.
  • The guarantee structure could be tested on streaming or continually arriving data to see whether the weights remain stable over time.

Load-bearing premise

Weights found by non-negative least squares on cross-validation folds will continue to produce good combinations on completely new test data.

What would settle it

A held-out test set on which the Soft Learning output performs materially worse than its single best specialist despite the non-negative least-squares combination being applied.

Figures

Figures reproduced from arXiv: 2605.18889 by Ali Aledhari, Fatimah Aledhari, Mohamed Rahouti, Mohammed Aledhari.

Figure 1
Figure 1. Figure 1: The Soft Learning framework: architecture, training pipeline, and specialist diversity. a, [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Head-to-head comparison across 37 datasets. a, [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Performance consistency across task types and dataset scales. a, [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗
read the original abstract

Modern machine learning forces practitioners to choose between powerful but expensive deep networks and fast but limited classical algorithms. Here we introduce Soft Learning, a framework that maintains a library of heterogeneous specialists -- spanning linear models, tree ensembles, kernel machines, and neural networks -- and discovers provably optimal combination weights through cross-validated non-negative least squares. Soft Learning is guaranteed to match or exceed the best weighted combination of its specialists, trains over two orders of magnitude faster than deep networks on CPU alone (72-435x faster across tested configurations), provides inherent interpretability through learned weights that reveal which algorithmic paradigm best fits the data, and is future-proof: adding specialists is mathematically guaranteed to maintain or improve performance. Across 37 datasets (25 classification, 12 regression) against nine methods including CatBoost and tuned deep networks, Soft Learning ranks first on 70% of tasks, achieves the best mean rank (Friedman test, p = 1.12 x 10^-12), and is the only method to simultaneously excel at both classification and regression -- all without GPU hardware or hyperparameter tuning. These results suggest a paradigm shift from "which algorithm is best?" to "what is the provably optimal combination?" -- a question Soft Learning answers with formal guarantees for any data modality.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript proposes Soft Learning as a method to combine predictions from a diverse set of specialist models (linear, tree-based, kernel, and neural) by solving for non-negative weights using cross-validated non-negative least squares. It asserts formal optimality guarantees, significant speed advantages over deep networks, and superior empirical performance across 37 datasets in both classification and regression tasks, all without hyperparameter tuning or specialized hardware.

Significance. Should the central claims regarding out-of-sample optimality and generalization of the learned weights be substantiated, this approach could meaningfully advance ensemble methods by offering a principled, efficient, and interpretable alternative to both classical algorithms and deep learning. The ability to add specialists while maintaining guarantees and the lack of need for GPU resources are strong practical advantages. The work also provides a clear path toward understanding which paradigms suit particular data.

major comments (2)
  1. [Abstract] The claim that Soft Learning is 'guaranteed to match or exceed the best weighted combination of its specialists' is based on the cross-validated non-negative least squares solution. However, since this solution is obtained from the same cross-validation folds used in evaluation, the optimality may not extend to unseen test data without additional safeguards against overfitting in the weight estimation step.
  2. [Empirical evaluation] The reported best mean rank and first-place ranking on 70% of tasks rely on the learned weights generalizing from CV to test. Given that specialist predictions are often correlated and the number of specialists is not specified as small, a nested cross-validation loop isolating the weight-learning generalization error would be necessary to support these claims robustly.
minor comments (2)
  1. [Abstract] Ensure that the number of specialists and their types are clearly stated in the main text for reproducibility.
  2. [Introduction] The transition from 'which algorithm is best?' to 'what is the provably optimal combination?' is compelling but would benefit from a brief discussion of related work on meta-learning and stacking.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their detailed review and constructive comments on our work on Soft Learning. We provide point-by-point responses to the major comments below.

read point-by-point responses
  1. Referee: [Abstract] The claim that Soft Learning is 'guaranteed to match or exceed the best weighted combination of its specialists' is based on the cross-validated non-negative least squares solution. However, since this solution is obtained from the same cross-validation folds used in evaluation, the optimality may not extend to unseen test data without additional safeguards against overfitting in the weight estimation step.

    Authors: The optimality guarantee applies specifically to the cross-validation data used for weight estimation. The non-negative least squares solver finds the weights that minimize the squared error on the out-of-fold specialist predictions, which are generated without using the target instances in training the specialists. This ensures the combination is optimal for those CV predictions. For the test data, we apply the learned weights and evaluate empirically, without claiming a formal optimality guarantee on the test distribution. We agree that this distinction should be clarified to avoid misinterpretation. In the revised manuscript, we will update the abstract and add a section explaining the scope of the guarantees. revision: partial

  2. Referee: [Empirical evaluation] The reported best mean rank and first-place ranking on 70% of tasks rely on the learned weights generalizing from CV to test. Given that specialist predictions are often correlated and the number of specialists is not specified as small, a nested cross-validation loop isolating the weight-learning generalization error would be necessary to support these claims robustly.

    Authors: We recognize the value of nested cross-validation for isolating the generalization performance of the weight estimation step, particularly given potential correlations among specialist predictions. In our current implementation, we employ a single cross-validation procedure to balance computational efficiency with the scale of our experiments across 37 datasets. The number of specialists is 9 in the reported experiments, which is modest. While a full nested CV would strengthen the claims, the observed performance advantages and the statistical significance (Friedman test p-value) provide supporting evidence that the weights generalize effectively. We will revise the manuscript to specify the number of specialists, discuss this limitation, and include a nested CV analysis on a representative subset of datasets. revision: partial

Circularity Check

1 steps flagged

Optimality guarantee reduces to NNLS fit on CV folds by construction

specific steps
  1. fitted input called prediction [Abstract]
    "Soft Learning is guaranteed to match or exceed the best weighted combination of its specialists, ... discovers provably optimal combination weights through cross-validated non-negative least squares."

    The guarantee is obtained by fitting NNLS weights on the cross-validation folds; the reported superiority on the 37 datasets is therefore the in-sample fit on those folds, not a prediction that must generalize beyond the data used to compute the weights.

full rationale

The paper's central guarantee that Soft Learning 'is guaranteed to match or exceed the best weighted combination' is achieved by solving non-negative least squares on the same cross-validation folds later used to report performance. This makes the headline claims (best mean rank, first on 70% of tasks, matches/exceeds best specialist) a direct consequence of the fitted weights rather than an independent prediction on held-out test data. No nested outer loop isolates the generalization of the weight-learning step itself. The derivation chain therefore collapses the 'provable optimality' claim into the fitting procedure on the evaluation data.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The framework rests on the assumption that a fixed library of standard specialists is diverse enough for the convex combination to be useful and that cross-validation provides an unbiased estimate of combination quality.

free parameters (1)
  • combination weights
    Learned via non-negative least squares on cross-validation folds; these are the central fitted quantities.
axioms (1)
  • domain assumption The library of specialists contains sufficiently complementary models so that a convex combination improves over the best single model.
    Invoked when claiming future-proof improvement upon adding specialists.

pith-pipeline@v0.9.0 · 5754 in / 1404 out tokens · 41909 ms · 2026-05-20T14:14:40.401184+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

60 extracted references · 60 canonical work pages · 1 internal anchor

  1. [1]

    Chen and C

    Chen, T. & Guestrin, C. Xgboost: A scalable tree boosting system. InProceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’16, 785–794, DOI: 10.1145/2939672.2939785 (Association for Computing Machinery, New York, NY , USA, 2016). 3.Friedman, J. H. Greedy function approximation: A gradient boosting machine....

  2. [2]

    Energy and Policy Considerations for Deep Learning in NLP

    Strubell, E., Ganesh, A. & McCallum, A. Energy and policy considerations for deep learning in NLP. In Korhonen, A., Traum, D. & Màrquez, L. (eds.)Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, 3645–3650, DOI: 10.18653/v1/P19-1355 (Association for Computational Linguistics, Florence, Italy, 2019)

  3. [3]

    Schwartz, R., Dodge, J., Smith, N. A. & Etzioni, O. Green ai.Commun. ACM63, 54–63, DOI: 10.1145/3381831 (2020)

  4. [4]

    doi: 10.1145/3446776

    Zhang, C., Bengio, S., Hardt, M., Recht, B. & Vinyals, O. Understanding deep learning (still) requires rethinking generalization.Commun. ACM64, 107–115, DOI: 10.1145/3446776 (2021)

  5. [5]

    & Dietterich, T

    Hendrycks, D. & Dietterich, T. Benchmarking neural network robustness to common corruptions and perturbations. In International Conference on Learning Representations(2019)

  6. [6]

    & Weinberger, K

    Guo, C., Pleiss, G., Sun, Y . & Weinberger, K. Q. On calibration of modern neural networks. InProceedings of the 34th International Conference on Machine Learning - Volume 70, ICML’17, 1321–1330 (JMLR.org, 2017)

  7. [7]

    Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead

    Rudin, C. Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead.Nat. Mach. Intell.1, 206–215, DOI: 10.1038/s42256-019-0048-x (2019)

  8. [8]

    In Cortes, C., Lawrence, N., Lee, D., Sugiyama, M

    Sculley, D.et al.Hidden technical debt in machine learning systems. In Cortes, C., Lawrence, N., Lee, D., Sugiyama, M. & Garnett, R. (eds.)Advances in Neural Information Processing Systems, vol. 28 (Curran Associates, Inc., 2015)

  9. [9]

    & Hinton, G

    Krizhevsky, A., Sutskever, I. & Hinton, G. E. ImageNet classification with deep convolutional neural networks. In Pereira, F., Burges, C. J. C., Bottou, L. & Weinberger, K. Q. (eds.)Advances in Neural Information Processing Systems, vol. 25, 1097–1105 (Curran Associates, Inc., 2012)

  10. [10]

    Deep residual learning for image recognition,

    He, K., Zhang, X., Ren, S. & Sun, J. Deep residual learning for image recognition. In2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 770–778, DOI: 10.1109/CVPR.2016.90 (2016)

  11. [11]

    Dahl, Abdel-rahman Mohamed, Navdeep Jaitly, Andrew Senior, Vincent Vanhoucke, Patrick Nguyen, Tara N

    Hinton, G.et al.Deep neural networks for acoustic modeling in speech recognition: the shared views of four research groups.IEEE Signal Process. Mag.29, 82–97, DOI: 10.1109/MSP.2012.2205597 (2012)

  12. [12]

    Sutskever, I., Vinyals, O. & Le, Q. V . Sequence to sequence learning with neural networks. InProceedings of the 28th International Conference on Neural Information Processing Systems - Volume 2, NIPS’14, 3104–3112 (MIT Press, Cambridge, MA, USA, 2014)

  13. [13]

    InProceedings of the 31st International Conference on Neural Information Processing Systems, NIPS’17, 6000–6010 (Curran Associates Inc., Red Hook, NY , USA, 2017)

    Vaswani, A.et al.Attention is all you need. InProceedings of the 31st International Conference on Neural Information Processing Systems, NIPS’17, 6000–6010 (Curran Associates Inc., Red Hook, NY , USA, 2017)

  14. [14]

    doi: 10.1109/TPAMI.2013.50

    Bengio, Y ., Courville, A. & Vincent, P. Representation learning: A review and new perspectives.IEEE Transactions on Pattern Analysis Mach. Intell.35, 1798–1828, DOI: 10.1109/TPAMI.2013.50 (2013)

  15. [15]

    Learning Representations by Back- Propagating Errors

    Rumelhart, D. E., Hinton, G. E. & Williams, R. J. Learning representations by back-propagating errors.Nature323, 533–536, DOI: 10.1038/323533a0 (1986)

  16. [16]

    & Varoquaux, G

    Grinsztajn, L., Oyallon, E. & Varoquaux, G. Why do tree-based models still outperform deep learning on typical tabular data? InProceedings of the 36th International Conference on Neural Information Processing Systems, NIPS ’22 (Curran Associates Inc., Red Hook, NY , USA, 2022). 15/33

  17. [17]

    Masset, R

    Shwartz-Ziv, R. & Armon, A. Tabular data: Deep learning is not all you need.Inf. Fusion81, 84–90, DOI: 10.1016/j. inffus.2021.11.011 (2022)

  18. [18]

    Borisov, T

    Borisov, V .et al.Deep neural networks and tabular data: A survey.IEEE Transactions on Neural Networks Learn. Syst.35, 7499–7519, DOI: 10.1109/TNNLS.2022.3229161 (2024)

  19. [19]

    & Amorim, D

    Fernández-Delgado, M., Cernadas, E., Barro, S. & Amorim, D. Do we need hundreds of classifiers to solve real world classification problems?J. Mach. Learn. Res.15, 3133–3181 (2014)

  20. [20]

    & Macready, W

    Wolpert, D. & Macready, W. No free lunch theorems for optimization.IEEE Transactions on Evol. Comput.1, 67–82, DOI: 10.1109/4235.585893 (1997)

  21. [21]

    Dietterich, T. G. Ensemble methods in machine learning. InMultiple Classifier Systems, 1–15 (Springer Berlin Heidelberg, Berlin, Heidelberg, 2000)

  22. [22]

    Wolpert, D. H. Stacked generalization.Neural Networks5, 241–259, DOI: https://doi.org/10.1016/S0893-6080(05)80023-1 (1992). 27.Breiman, L. Stacked regressions.Mach. Learn.24, 49–64, DOI: 10.1007/BF00117832 (1996)

  23. [23]

    Le Goallec, A., Diai, S., Collin, S., Prost, J.-B., Vincent, T., and Patel, C

    van der Laan, M. J., Polley, E. C. & Hubbard, A. E. Super learner.Stat. Appl. Genet. Mol. Biol.6, 1–23, DOI: 10.2202/1544-6115.1309 (2007)

  24. [24]

    Polley, E. C. & van der Laan, M. J. Super learner in prediction. Working Paper 266, U.C. Berkeley Division of Biostatistics Working Paper Series (2010)

  25. [25]

    van der Laan, M. J. & Dudoit, S. Unified cross-validation methodology for selection among estimators and a general cross-validated adaptive epsilon-net estimator: finite sample oracle inequalities and examples. Working Paper 130, U.C. Berkeley Division of Biostatistics Working Paper Series (2003)

  26. [26]

    W., Dudoit, S

    van der Vaart, A. W., Dudoit, S. & van der Laan, M. J. Oracle inequalities for multi-fold cross validation.Stat. & Decis. 24, 351–371, DOI: 10.1524/stnd.2006.24.3.351 (2006)

  27. [27]

    Naimi, A. I. & Balzer, L. B. Stacked generalization: an introduction to super learning.Eur. J. Epidemiol.33, 459–464, DOI: 10.1007/s10654-018-0390-z (2018)

  28. [28]

    ISBN 1581138385.DOI: 10.1145/1015330.1015430

    Caruana, R., Niculescu-Mizil, A., Crew, G. & Ksikes, A. Ensemble selection from libraries of models. InProceedings of the Twenty-First International Conference on Machine Learning, ICML ’04, 18, DOI: 10.1145/1015330.1015432 (Association for Computing Machinery, New York, NY , USA, 2004)

  29. [29]

    Adaptive Mixtures of Local Experts

    Jacobs, R. A., Jordan, M. I., Nowlan, S. J. & Hinton, G. E. Adaptive mixtures of local experts.Neural Comput.3, 79–87, DOI: 10.1162/neco.1991.3.1.79 (1991). https://direct.mit.edu/neco/article-pdf/3/1/79/812104/neco.1991.3.1.79.pdf

  30. [30]

    InInternational Conference on Learning Representations(2017)

    Shazeer, N.et al.Outrageously large neural networks: The sparsely-gated mixture-of-experts layer. InInternational Conference on Learning Representations(2017)

  31. [31]

    & Ebrahimpour, R

    Masoudnia, S. & Ebrahimpour, R. Mixture of experts: a literature survey.Artif. Intell. Rev.42, 275–293, DOI: 10.1007/s10462-012-9338-y (2014)

  32. [32]

    & Blundell, C

    Lakshminarayanan, B., Pritzel, A. & Blundell, C. Simple and scalable predictive uncertainty estimation using deep ensembles. In Guyon, I.et al.(eds.)Advances in Neural Information Processing Systems, vol. 30 (Curran Associates, Inc., 2017)

  33. [33]

    InProceedings of the 29th International Conference on Neural Information Processing Systems - Volume 2, NIPS’15, 2755–2763 (MIT Press, Cambridge, MA, USA, 2015)

    Feurer, M.et al.Efficient and robust automated machine learning. InProceedings of the 29th International Conference on Neural Information Processing Systems - Volume 2, NIPS’15, 2755–2763 (MIT Press, Cambridge, MA, USA, 2015)

  34. [34]

    & Vanschoren, J

    Hutter, F., Kotthoff, L. & Vanschoren, J. (eds.)Automated Machine Learning: Methods, Systems, Challenges. The Springer Series on Challenges in Machine Learning (Springer, Cham, 2019)

  35. [35]

    Thornton, C., Hutter, F., Hoos, H. H. & Leyton-Brown, K. Auto-weka: combined selection and hyperparameter optimization of classification algorithms. InProceedings of the 19th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’13, 847–855, DOI: 10.1145/2487575.2487629 (Association for Computing Machinery, New York, NY , USA, 2013)

  36. [36]

    In7th ICML Workshop on Automated Machine Learning (AutoML 2020)(2020)

    Erickson, N.et al.AutoGluon-Tabular: Robust and accurate AutoML for structured data. In7th ICML Workshop on Automated Machine Learning (AutoML 2020)(2020)

  37. [37]

    InInternational Conference on Learning Representations (ICLR) (2025)

    Liu, Z.et al.KAN: Kolmogorov–Arnold networks. InInternational Conference on Learning Representations (ICLR) (2025). 16/33

  38. [38]

    Garcez, A. d. & Lamb, L. C. Neurosymbolic ai: the 3rd wave.Artif. Intell. Rev.56, 12387–12406, DOI: 10.1007/ s10462-023-10448-w (2023)

  39. [39]

    Kautz, H. A. The third ai summer: Aaai robert s. engelmore memorial lecture.AI Mag.43, 105–125, DOI: https: //doi.org/10.1002/aaai.12036 (2022). https://onlinelibrary.wiley.com/doi/pdf/10.1002/aaai.12036

  40. [40]

    & Hanson, R.Solving Least Squares Problems

    Lawson, C. & Hanson, R.Solving Least Squares Problems. Classics in Applied Mathematics (Society for Industrial and Applied Mathematics, 1995)

  41. [41]

    & LeCun, Y

    Choromanska, A., Henaff, M., Mathieu, M., Ben Arous, G. & LeCun, Y . The Loss Surfaces of Multilayer Networks. In Lebanon, G. & Vishwanathan, S. V . N. (eds.)Proceedings of the Eighteenth International Conference on Artificial Intelligence and Statistics, vol. 38 ofProceedings of Machine Learning Research, 192–204 (PMLR, San Diego, California, USA, 2015)

  42. [42]

    & Vedelsby, J

    Krogh, A. & Vedelsby, J. Neural network ensembles, cross validation and active learning. InProceedings of the 8th International Conference on Neural Information Processing Systems, NIPS’94, 231–238 (MIT Press, Cambridge, MA, USA, 1994). 48.Friedman, J. H. Multivariate adaptive regression splines.The Annals Stat.19, 1–67 (1991)

  43. [43]

    van Rijn, Bernd Bischl, and Luis Torgo

    Vanschoren, J., van Rijn, J. N., Bischl, B. & Torgo, L. Openml: networked science in machine learning.SIGKDD Explor. Newsl.15, 49–60, DOI: 10.1145/2641190.2641198 (2014). 50.Kelly, M., Longjohn, R. & Nottingham, K. The UCI machine learning repository. https://archive.ics.uci.edu

  44. [44]

    & Szegedy, C

    Ioffe, S. & Szegedy, C. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In Bach, F. & Blei, D. (eds.)Proceedings of the 32nd International Conference on Machine Learning, vol. 37 ofProceedings of Machine Learning Research, 448–456 (PMLR, Lille, France, 2015)

  45. [45]

    & Friedman, J.The Elements of Statistical Learning: Data Mining, Inference, and Prediction

    Hastie, T., Tibshirani, R. & Friedman, J.The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Springer Series in Statistics (Springer, New York, NY , 2009), 2 edn

  46. [46]

    Prokhorenkova, L., Gusev, G., V orobev, A., Dorogush, A. V . & Gulin, A. Catboost: unbiased boosting with categorical features. InProceedings of the 32nd International Conference on Neural Information Processing Systems, NIPS’18, 6639–6649 (Curran Associates Inc., Red Hook, NY , USA, 2018)

  47. [47]

    & Smola, A

    Schölkopf, B. & Smola, A. J.Learning with Kernels: Support Vector Machines, Regularization, Optimization, and Beyond (The MIT Press, 2001)

  48. [48]

    N.The Nature of Statistical Learning Theory

    Vapnik, V . N.The Nature of Statistical Learning Theory. Information Science and Statistics (Springer, New York, NY , 2000), 2 edn. 56.Pedregosa, F.et al.Scikit-learn: Machine learning in python.J. Mach. Learn. Res.12, 2825–2830 (2011). 57.Nocedal, J. Updating quasi-newton matrices with limited storage.Math. Comput.35, 773–782 (1980)

  49. [49]

    Breiman, L., Friedman, J., Olshen, R. A. & Stone, C. J.Classification and Regression Trees(Chapman and Hall/CRC, 1984), 1 edn

  50. [50]

    InProceedings of the 31st International Conference on Neural Information Processing Systems, NIPS’17, 3149–3157 (Curran Associates Inc., Red Hook, NY , USA, 2017)

    Ke, G.et al.Lightgbm: a highly efficient gradient boosting decision tree. InProceedings of the 31st International Conference on Neural Information Processing Systems, NIPS’17, 3149–3157 (Curran Associates Inc., Red Hook, NY , USA, 2017)

  51. [51]

    Platt, J. C. Probabilistic outputs for support vector machines and comparisons to regularized likelihood methods. In Smola, A. J., Bartlett, P., Schölkopf, B. & Schuurmans, D. (eds.)Advances in Large Margin Classifiers, 61–74 (MIT Press, 1999)

  52. [52]

    Kingma, D. P. & Ba, J. Adam: A method for stochastic optimization. InInternational Conference on Learning Representations (ICLR)(2015)

  53. [53]

    A software package for sequential quadratic programming

    Kraft, D. A software package for sequential quadratic programming. Tech. Rep., Deutsche Forschungs- und Versuchsanstalt für Luft- und Raumfahrt (DFVLR) (1988)

  54. [54]

    & Salakhutdinov, R

    Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I. & Salakhutdinov, R. Dropout: A simple way to prevent neural networks from overfitting.J. Mach. Learn. Res.15, 1929–1958 (2014). 64.Demšar, J. Statistical comparisons of classifiers over multiple data sets.The J. Mach. Learn. Res.7, 1–30 (2006)

  55. [55]

    Strictly

    Gneiting, T. & Raftery, A. E. Strictly proper scoring rules, prediction, and estimation.J. Am. Stat. Assoc.102, 359–378, DOI: 10.1198/016214506000001437 (2007). https://doi.org/10.1198/016214506000001437. 17/33

  56. [56]

    & Bartlett, P

    Anthony, M. & Bartlett, P. L.Neural Network Learning: Theoretical Foundations(Cambridge University Press, Cambridge, 1999)

  57. [57]

    Probability inequalities for sums of bounded random variables

    Hoeffding, W. Probability inequalities for sums of bounded random variables. In Fisher, N. I. & Sen, P. K. (eds.)The Collected Works of Wassily Hoeffding, 409–426, DOI: 10.1007/978-1-4612-0865-5_26 (Springer New York, New York, NY , 1994)

  58. [58]

    & Ben-David, S.Understanding Machine Learning: From Theory to Algorithms(Cambridge University Press, Cambridge, 2014)

    Shalev-Shwartz, S. & Ben-David, S.Understanding Machine Learning: From Theory to Algorithms(Cambridge University Press, Cambridge, 2014)

  59. [59]

    J., Shlens, J

    Goodfellow, I. J., Shlens, J. & Szegedy, C. Explaining and harnessing adversarial examples. InInternational Conference on Learning Representations (ICLR)(2015)

  60. [60]

    Task” column indicates classification (C) or regression (Reg). “Bal

    Madry, A., Makelov, A., Schmidt, L., Tsipras, D. & Vladu, A. Towards deep learning models resistant to adversarial attacks. InInternational Conference on Learning Representations(2018). 18/33 Supplementary Information Soft Learning Mohammed Aledhari, Ali Aledhari, Fatimah Aledhari, Mohamed Rahouti S1. Formal Framework and Definitions S1.1 Problem Setting ...