Inference with non-differentiable surrogate loss in a general high-dimensional classification framework

Maureen A Smith; Muxuan Liang; Yang Ning; Ying-Qi Zhao

arxiv: 2405.11723 · v2 · submitted 2024-05-20 · 📊 stat.ME · stat.ML

Inference with non-differentiable surrogate loss in a general high-dimensional classification framework

Muxuan Liang , Yang Ning , Maureen A Smith , Ying-Qi Zhao This is my paper

Pith reviewed 2026-05-24 01:39 UTC · model grok-4.3

classification 📊 stat.ME stat.ML

keywords high-dimensional inferencesurrogate lossclassificationkernel smoothingdecorrelated scorepiecewise linear losscross-fittingconfidence intervals

0 comments

The pith

A kernel-smoothed decorrelated score enables hypothesis tests and confidence intervals for high-dimensional linear classifiers trained with non-differentiable surrogate losses.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper addresses the lack of inference tools for identifying important features in high-dimensional linear decision rules when training relies on piecewise linear surrogate losses such as the hinge loss. These losses have discontinuous gradients and non-regular Hessians that break standard asymptotic arguments. The authors introduce kernel smoothing to approximate the gradient and Hessian, paired with a decorrelated score that removes the influence of high-dimensional nuisance parameters. A cross-fitted variant accommodates flexible nuisance estimates. They derive the limiting distribution of the resulting score, which directly yields valid p-values and intervals. This matters because it shifts focus from pure prediction accuracy to statistically supported statements about which variables drive the classifier.

Core claim

We propose a kernel-smoothed decorrelated score to construct hypothesis tests and interval estimators for a linear decision rule estimated using a piece-wise linear surrogate loss, which has a discontinuous gradient and non-regular Hessian. Specifically, we adopt kernel approximations to smooth the discontinuous gradient near discontinuity points and approximate the non-regular Hessian of the surrogate loss. In applications where additional nuisance parameters are involved, we propose a novel cross-fitted version to accommodate flexible nuisance estimates and kernel approximations. We establish the limiting distribution of the kernel-smoothed decorrelated score and its cross-fitted version.

What carries the argument

The kernel-smoothed decorrelated score, which applies kernel approximations to handle the discontinuous gradient and non-regular Hessian of the piecewise linear surrogate loss while decorrelating to control high-dimensional effects.

If this is right

Valid p-values and confidence intervals become available for individual coefficients in the estimated linear decision rule.
The cross-fitted version permits inference even when nuisance parameters are estimated flexibly and at high dimension.
The method applies directly to any piecewise linear surrogate loss used in penalized empirical risk minimization for classification.
Simulation and real-data results show the procedure achieves nominal coverage and power where prior approaches do not.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same kernel-smoothing device could be applied to inference problems involving other non-differentiable losses in regression or ranking tasks.
In medical or genomics applications the resulting intervals would allow formal statements about which biomarkers drive a diagnostic rule rather than only reporting predictive performance.
Choice of kernel bandwidth and order might be tuned by monitoring the finite-sample coverage of the intervals on held-out data with known ground-truth signals.

Load-bearing premise

Kernel approximations must accurately smooth the discontinuous gradient near discontinuity points and approximate the non-regular Hessian of the piecewise linear surrogate loss.

What would settle it

Empirical coverage of the resulting confidence intervals falling well below the nominal level in repeated high-dimensional simulations that use a piecewise linear surrogate loss and realistic discontinuity patterns would falsify the claimed limiting distribution.

Figures

Figures reproduced from arXiv: 2405.11723 by Maureen A Smith, Muxuan Liang, Yang Ning, Ying-Qi Zhao.

**Figure 2.** Figure 2: Coverage results for Scenario I with the change of sample size when [PITH_FULL_IMAGE:figures/full_fig_p020_2.png] view at source ↗

**Figure 3.** Figure 3: Classification accuracy and estimation error for Scenario I with the change of sample size when [PITH_FULL_IMAGE:figures/full_fig_p021_3.png] view at source ↗

**Figure 4.** Figure 4: Testing results for Scenario II with the change of sample size when [PITH_FULL_IMAGE:figures/full_fig_p022_4.png] view at source ↗

**Figure 5.** Figure 5: Coverage results for Scenario II with the change of sample size when [PITH_FULL_IMAGE:figures/full_fig_p023_5.png] view at source ↗

**Figure 6.** Figure 6: Value function and estimation error for Scenario II with the change of sample size when [PITH_FULL_IMAGE:figures/full_fig_p024_6.png] view at source ↗

read the original abstract

Penalized empirical risk minimization with a surrogate loss function is often used to learn a high-dimensional linear decision rule in classification problems. Although much of the literature focus on the generalization error, there is a lack of inference procedures for identifying the driving factors of the estimated decision rule, especially when the surrogate loss is non-differentiable. We propose a kernel-smoothed decorrelated score to construct hypothesis tests and interval estimators for a linear decision rule estimated using a piece-wise linear surrogate loss, which has a discontinuous gradient and non-regular Hessian. Specifically, we adopt kernel approximations to smooth the discontinuous gradient near discontinuity points and approximate the non-regular Hessian of the surrogate loss. In applications where additional nuisance parameters are involved, we propose a novel cross-fitted version to accommodate flexible nuisance estimates and kernel approximations. We establish the limiting distribution of the kernel-smoothed decorrelated score and its cross-fitted version in a high-dimensional setup. Simulation and real data analysis are conducted to demonstrate the validity and the superiority of the proposed method.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Kernel smoothing extends decorrelated scores to piecewise-linear surrogate losses in high dimensions, but the bandwidth and Hessian approximation rates look like the part that needs the most checking.

read the letter

The paper's main contribution is a kernel-smoothed decorrelated score (and its cross-fitted version) that lets you do inference on the linear coefficients after fitting a high-dimensional classifier with a non-differentiable piecewise-linear loss such as hinge loss. Most existing decorrelated-score work assumes differentiable losses, so this is a direct extension to the non-smooth case that people actually use in classification. They smooth the jump in the subgradient and approximate the non-regular Hessian, then derive a limiting normal distribution under high-dimensional asymptotics. The cross-fitting step is a sensible addition when nuisance parameters are present. That fills a real gap for post-estimation inference on SVM-type rules. The simulations and real-data example are mentioned to back it up, though the abstract gives no numbers on coverage or power. The soft spot is the one the stress test flags: whether the kernel bandwidth can be tuned so that the smoothing bias and the Hessian approximation error stay small enough relative to 1/sqrt(n) after penalization, cross-fitting, and high dimension. If the proofs only assume generic rates without explicit conditions on how bandwidth scales with p, s, or the kink locations, the claimed limiting distribution could fail in practice. The abstract states they establish the distribution, but without seeing the exact assumptions that is the load-bearing claim. This is for readers already working on high-dimensional inference for classification or on decorrelated scores. A statistician who needs valid intervals after fitting hinge-loss rules would get something concrete to try or extend. It is worth sending to a serious referee so the rate conditions and the simulation design can be checked in detail.

Referee Report

1 major / 1 minor

Summary. The manuscript proposes a kernel-smoothed decorrelated score (and its cross-fitted variant) for inference on the coefficients of a high-dimensional linear decision rule learned by penalized empirical risk minimization with a piecewise-linear surrogate loss. The method uses kernel approximations both to smooth the discontinuous subgradient near kink points and to handle the non-regular Hessian; the central theoretical claim is that this score admits a limiting normal distribution in a high-dimensional regime, enabling hypothesis tests and interval estimators. Simulations and a real-data example are presented to support the approach.

Significance. If the limiting-distribution result holds under verifiable conditions, the work would supply a practical route to post-estimation inference for non-differentiable surrogate losses in high-dimensional classification, an area where most existing literature addresses only generalization error. The explicit handling of the non-regular Hessian via kernel smoothing and the cross-fitting device for nuisance parameters are technically distinctive contributions.

major comments (1)

[Abstract / theoretical analysis] Abstract and the statement of the main theoretical result: the claim that the kernel-smoothed decorrelated score possesses a limiting distribution after high-dimensional penalization and cross-fitting rests on the kernel bandwidth and approximation parameters being chosen so that bias vanishes at the 1/sqrt(n) rate. No explicit scaling conditions relating bandwidth to dimension p, sparsity s, or the number of cross-fitting folds are stated in the abstract or the high-level description of the theorems; without these rate requirements the bias-variance balance that justifies the limiting distribution cannot be verified.

minor comments (1)

The abstract refers to 'simulation and real data analysis' but does not indicate the performance metrics (e.g., coverage, type-I error, or comparison baselines) used to claim superiority.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the detailed and constructive review. The single major comment identifies a presentational gap in the abstract and high-level theorem statements; we address it directly below.

read point-by-point responses

Referee: [Abstract / theoretical analysis] Abstract and the statement of the main theoretical result: the claim that the kernel-smoothed decorrelated score possesses a limiting distribution after high-dimensional penalization and cross-fitting rests on the kernel bandwidth and approximation parameters being chosen so that bias vanishes at the 1/sqrt(n) rate. No explicit scaling conditions relating bandwidth to dimension p, sparsity s, or the number of cross-fitting folds are stated in the abstract or the high-level description of the theorems; without these rate requirements the bias-variance balance that justifies the limiting distribution cannot be verified.

Authors: We agree that the abstract and the high-level description of the main theorems would be clearer if they explicitly recorded the scaling restrictions on the kernel bandwidth h_n and approximation parameters that are already required for the bias term to be o_p(n^{-1/2}). The detailed theorems in Sections 3 and 4 contain these conditions (bandwidth satisfying n h_n^2 → ∞ together with h_n = o(n^{-1/2}) and relations involving the sparsity s and the number of cross-fitting folds K), but the abstract and the introductory theorem statements do not restate them. We will revise both the abstract and the high-level theorem summaries to include the necessary rate requirements relating h_n to n, p, s, and K. This change makes the bias-variance balance verifiable directly from the high-level statements without altering any proofs or results. revision: yes

Circularity Check

0 steps flagged

No circularity; derivation is a standard asymptotic analysis of a proposed kernel-smoothed estimator

full rationale

The paper proposes a kernel-smoothed decorrelated score (and cross-fitted variant) for inference under piecewise-linear surrogate losses, adopts kernel approximations as a methodological device to handle the discontinuous gradient and non-regular Hessian, and then establishes the limiting distribution in high dimensions. No equation or claim reduces by construction to a fitted parameter renamed as a prediction, nor does any load-bearing step rely on a self-citation chain whose content is itself unverified within the paper. The central result is an independent theoretical statement about the asymptotic behavior of the constructed score under the stated approximations and high-dimensional assumptions; it does not collapse to the inputs by the paper's own definitions or equations. This matches the default expectation of a self-contained derivation.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

Based on abstract only; the central claim rests on standard high-dimensional regularity conditions and the validity of kernel smoothing for the specific loss class.

axioms (2)

domain assumption High-dimensional regime with appropriate sparsity or regularity conditions on the decision rule coefficients
Invoked to obtain the limiting distribution of the score statistic.
ad hoc to paper Kernel bandwidth and approximation parameters chosen such that bias vanishes at the required rate
Necessary for the smoothed gradient and Hessian to yield the claimed asymptotic normality.

pith-pipeline@v0.9.0 · 5713 in / 1233 out tokens · 29969 ms · 2026-05-24T01:39:42.767372+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

38 extracted references · 38 canonical work pages

[1]

" " arXiv preprint arXiv:

ENTRY address author booktitle chapter edition editor howpublished institution journal key month note number organization pages publisher school series title type volume year arxivId label extra.label sort.label short.list INTEGERS output.state before.all mid.sentence after.sentence after.block FUNCTION format.arxivId arxivId empty "" " arXiv preprint arX...

work page
[2]

write newline

" write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION word.in bbl.in " " * FUNCTION format....

work page
[3]

L., Jordan, M

Bartlett, P. L., Jordan, M. I., and McAuliffe, J. D. (2006). Convexity, classification, and risk bounds. Journal of the American Statistical Association , 101(473):138--156

work page 2006
[4]

Bartlett, P. L. and Wegkamp, M. H. (2008). Classification with a reject option using a hinge loss. Journal of Machine Learning Research , 9(8):1823--1840

work page 2008
[5]

and Hochberg, Y

Benjamini, Y. and Hochberg, Y. (1995). Controlling the false discovery rate: a practical and powerful approach to multiple testing. Journal of the Royal statistical society: series B (Methodological) , 57(1):289--300

work page 1995
[6]

Blanchard, G., Bousquet, O., Massart, P., et al. (2008). Statistical performance of support vector machines. Annals of Statistics , 36(2):489--531

work page 2008
[7]

Chen, G., Zeng, D., and Kosorok, M. R. (2016). Personalized dose finding using outcome weighted learning. Journal of the American Statistical Association , 111(516):1509--1521

work page 2016
[8]

Chernozhukov, V., Chetverikov, D., Demirer, M., Duflo, E., Hansen, C., and Newey, W. (2017). Double/debiased/neyman machine learning of treatment effects. American Economic Review , 107(5):261--65

work page 2017
[9]

Chernozhukov, V., Chetverikov, D., Demirer, M., Duflo, E., Hansen, C., Newey, W., and Robins, J. (2018). Double/debiased machine learning for treatment and structural parameters. The Econometrics Journal , 21(1):C1--C68

work page 2018
[10]

and Vapnik, V

Cortes, C. and Vapnik, V. (1995). Support-vector networks. Machine learning , 20(3):273--297

work page 1995
[11]

Dezeure, R., B \"u hlmann, P., and Zhang, C.-H. (2017). High-dimensional simultaneous inference with the bootstrap. TEST , 26(4):685--719

work page 2017
[12]

Imbens, G. W. and Rubin, D. B. (2015). Causal Inference for Statistics, Social, and Biomedical Sciences: An Introduction . Cambridge University Press

work page 2015
[13]

Koo, J.-Y., Lee, Y., Kim, Y., and Park, C. (2008). A bahadur representation of the linear support vector machine. Journal of Machine Learning Research , 9:1343--1368

work page 2008
[14]

A., and Zhao, Y.-Q

Liang, M., Choi, Y.-G., Ning, Y., Smith, M. A., and Zhao, Y.-Q. (2022). Estimation and inference on high-dimensional individualized treatment rule in observational data using split-and-pooled de-correlated score. Journal of Machine Learning Research (In print)

work page 2022
[15]

Lin, Y. (2000). Some asymptotic properties of the support vector machine. University of Wisconsin, Madison

work page 2000
[16]

Lin, Y. (2004). A note on margin-based loss functions in classification. Statistics & probability letters , 68(1):73--82

work page 2004
[17]

Ma, R., Tony Cai, T., and Li, H. (2021). Global and simultaneous hypothesis testing for high-dimensional logistic regression models. Journal of the American Statistical Association , 116(534):984--998

work page 2021
[18]

Newey, W. K. (1997). Convergence rates and asymptotic normality for series estimators. Journal of Econometrics , 79(1):147--168

work page 1997
[19]

and Liu, H

Ning, Y. and Liu, H. (2017). A general theory of hypothesis tests and confidence regions for sparse high dimensional models. Annals of Statistics , 45(1):158--195

work page 2017
[20]

and Zhao, Y.-Q

Pan, Y. and Zhao, Y.-Q. (2021). Improved doubly robust estimation in learning optimal individualized treatment rules. Journal of the American Statistical Association , 116(533):283--294

work page 2021
[21]

Peng, B., Wang, L., and Wu, Y. (2016). An error bound for l1-norm support vector machine coefficients in ultra-high dimension. Journal of Machine Learning Research , 17(1):8279--8304

work page 2016
[22]

Rubin, D. B. (1974). Estimating causal effects of treatments in randomized and nonrandomized studies. Journal of educational Psychology , 66(5):688

work page 1974
[23]

Rubin, D. B. (2005). Causal inference using potential outcomes. Journal of the American Statistical Association , 100(469):322--331

work page 2005
[24]

Shi, C., Song, R., Chen, Z., Li, R., et al. (2019). Linear hypothesis testing for high dimensional generalized linear models. Annals of statistics , 47(5):2671--2703

work page 2019
[25]

Steinwart, I. (2005). Consistency of support vector machines and other regularized kernel classifiers. IEEE transactions on information theory , 51(1):128--142

work page 2005
[26]

Steinwart, I., Scovel, C., et al. (2007). Fast rates for support vector machines using gaussian kernels. Annals of Statistics , 35(2):575--607

work page 2007
[27]

Van de Geer, S., B \"u hlmann, P., Ritov, Y., and Dezeure, R. (2014). On asymptotically optimal confidence regions and tests for high-dimensional models. Annals of Statistics , 42(3):1166--1202

work page 2014
[28]

Vert, R., Vert, J.-P., and Sch \"o lkopf, B. (2006). Consistency and convergence rates of one-class svms and related algorithms. Journal of Machine Learning Research , 7(5)

work page 2006
[29]

Wang, X., Yang, Z., Chen, X., and Liu, W. (2019). Distributed inference for linear support vector machine. Journal of machine learning research , 20(113):1--41

work page 2019
[30]

Wu, Y., Wang, L., and Fu, H. (2021). Model-assisted uniformly honest inference for optimal treatment regimes in high dimension. Journal of the American Statistical Association (In print)

work page 2021
[31]

Xue, F., Zhang, Y., Zhou, W., Fu, H., and Qu, A. (2020). Multicategory angle-based learning for estimating optimal dynamic treatment regimes with censored data. Journal of the American Statistical Association (In print)

work page 2020
[32]

Zhang, T. et al. (2004). Statistical behavior and consistency of classification methods based on convex risk minimization. Annals of Statistics , 32(1):56--85

work page 2004
[33]

Zhang, X., Wu, Y., Wang, L., and Li, R. (2016a). A consistent information criterion for support vector machines in diverging model spaces. Journal of Machine Learning Research , 17(1):466--491

work page
[34]

Zhang, X., Wu, Y., Wang, L., and Li, R. (2016b). Variable selection for support vector machines in moderately high dimensions. Journal of the Royal Statistical Society: Series B (Statistical Methodology) , 78(1):53--76

work page
[35]

J., and Kosorok, M

Zhao, Y., Zeng, D., Rush, A. J., and Kosorok, M. R. (2012). Estimating individualized treatment rules using outcome weighted learning . Journal of the American Statistical Association , 107(499):1106--1118

work page 2012
[36]

B., Ning, Y., Saha, S., and Sands, B

Zhao, Y.-Q., Laber, E. B., Ning, Y., Saha, S., and Sands, B. E. (2019). Efficient augmentation and relaxation learning for individualized treatment rules using observational data. Journal of Machine Learning Research , 20(1):1821--1843

work page 2019
[37]

Q., Zeng, D., Laber, E

Zhao, Y. Q., Zeng, D., Laber, E. B., Song, R., Yuan, M., and Kosorok, M. R. (2014). Doubly robust learning for estimating individualized treatment with censored data . Biometrika , 102(1):151--168

work page 2014
[38]

Zhou, X., Mayer-Hamblett, N., Khan, U., and Kosorok, M. R. (2017). Residual weighted learning for estimating individualized treatment rules. Journal of the American Statistical Association , 112(517):169--187

work page 2017

[1] [1]

" " arXiv preprint arXiv:

ENTRY address author booktitle chapter edition editor howpublished institution journal key month note number organization pages publisher school series title type volume year arxivId label extra.label sort.label short.list INTEGERS output.state before.all mid.sentence after.sentence after.block FUNCTION format.arxivId arxivId empty "" " arXiv preprint arX...

work page

[2] [2]

write newline

" write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION word.in bbl.in " " * FUNCTION format....

work page

[3] [3]

L., Jordan, M

Bartlett, P. L., Jordan, M. I., and McAuliffe, J. D. (2006). Convexity, classification, and risk bounds. Journal of the American Statistical Association , 101(473):138--156

work page 2006

[4] [4]

Bartlett, P. L. and Wegkamp, M. H. (2008). Classification with a reject option using a hinge loss. Journal of Machine Learning Research , 9(8):1823--1840

work page 2008

[5] [5]

and Hochberg, Y

Benjamini, Y. and Hochberg, Y. (1995). Controlling the false discovery rate: a practical and powerful approach to multiple testing. Journal of the Royal statistical society: series B (Methodological) , 57(1):289--300

work page 1995

[6] [6]

Blanchard, G., Bousquet, O., Massart, P., et al. (2008). Statistical performance of support vector machines. Annals of Statistics , 36(2):489--531

work page 2008

[7] [7]

Chen, G., Zeng, D., and Kosorok, M. R. (2016). Personalized dose finding using outcome weighted learning. Journal of the American Statistical Association , 111(516):1509--1521

work page 2016

[8] [8]

Chernozhukov, V., Chetverikov, D., Demirer, M., Duflo, E., Hansen, C., and Newey, W. (2017). Double/debiased/neyman machine learning of treatment effects. American Economic Review , 107(5):261--65

work page 2017

[9] [9]

Chernozhukov, V., Chetverikov, D., Demirer, M., Duflo, E., Hansen, C., Newey, W., and Robins, J. (2018). Double/debiased machine learning for treatment and structural parameters. The Econometrics Journal , 21(1):C1--C68

work page 2018

[10] [10]

and Vapnik, V

Cortes, C. and Vapnik, V. (1995). Support-vector networks. Machine learning , 20(3):273--297

work page 1995

[11] [11]

Dezeure, R., B \"u hlmann, P., and Zhang, C.-H. (2017). High-dimensional simultaneous inference with the bootstrap. TEST , 26(4):685--719

work page 2017

[12] [12]

Imbens, G. W. and Rubin, D. B. (2015). Causal Inference for Statistics, Social, and Biomedical Sciences: An Introduction . Cambridge University Press

work page 2015

[13] [13]

Koo, J.-Y., Lee, Y., Kim, Y., and Park, C. (2008). A bahadur representation of the linear support vector machine. Journal of Machine Learning Research , 9:1343--1368

work page 2008

[14] [14]

A., and Zhao, Y.-Q

Liang, M., Choi, Y.-G., Ning, Y., Smith, M. A., and Zhao, Y.-Q. (2022). Estimation and inference on high-dimensional individualized treatment rule in observational data using split-and-pooled de-correlated score. Journal of Machine Learning Research (In print)

work page 2022

[15] [15]

Lin, Y. (2000). Some asymptotic properties of the support vector machine. University of Wisconsin, Madison

work page 2000

[16] [16]

Lin, Y. (2004). A note on margin-based loss functions in classification. Statistics & probability letters , 68(1):73--82

work page 2004

[17] [17]

Ma, R., Tony Cai, T., and Li, H. (2021). Global and simultaneous hypothesis testing for high-dimensional logistic regression models. Journal of the American Statistical Association , 116(534):984--998

work page 2021

[18] [18]

Newey, W. K. (1997). Convergence rates and asymptotic normality for series estimators. Journal of Econometrics , 79(1):147--168

work page 1997

[19] [19]

and Liu, H

Ning, Y. and Liu, H. (2017). A general theory of hypothesis tests and confidence regions for sparse high dimensional models. Annals of Statistics , 45(1):158--195

work page 2017

[20] [20]

and Zhao, Y.-Q

Pan, Y. and Zhao, Y.-Q. (2021). Improved doubly robust estimation in learning optimal individualized treatment rules. Journal of the American Statistical Association , 116(533):283--294

work page 2021

[21] [21]

Peng, B., Wang, L., and Wu, Y. (2016). An error bound for l1-norm support vector machine coefficients in ultra-high dimension. Journal of Machine Learning Research , 17(1):8279--8304

work page 2016

[22] [22]

Rubin, D. B. (1974). Estimating causal effects of treatments in randomized and nonrandomized studies. Journal of educational Psychology , 66(5):688

work page 1974

[23] [23]

Rubin, D. B. (2005). Causal inference using potential outcomes. Journal of the American Statistical Association , 100(469):322--331

work page 2005

[24] [24]

Shi, C., Song, R., Chen, Z., Li, R., et al. (2019). Linear hypothesis testing for high dimensional generalized linear models. Annals of statistics , 47(5):2671--2703

work page 2019

[25] [25]

Steinwart, I. (2005). Consistency of support vector machines and other regularized kernel classifiers. IEEE transactions on information theory , 51(1):128--142

work page 2005

[26] [26]

Steinwart, I., Scovel, C., et al. (2007). Fast rates for support vector machines using gaussian kernels. Annals of Statistics , 35(2):575--607

work page 2007

[27] [27]

Van de Geer, S., B \"u hlmann, P., Ritov, Y., and Dezeure, R. (2014). On asymptotically optimal confidence regions and tests for high-dimensional models. Annals of Statistics , 42(3):1166--1202

work page 2014

[28] [28]

Vert, R., Vert, J.-P., and Sch \"o lkopf, B. (2006). Consistency and convergence rates of one-class svms and related algorithms. Journal of Machine Learning Research , 7(5)

work page 2006

[29] [29]

Wang, X., Yang, Z., Chen, X., and Liu, W. (2019). Distributed inference for linear support vector machine. Journal of machine learning research , 20(113):1--41

work page 2019

[30] [30]

Wu, Y., Wang, L., and Fu, H. (2021). Model-assisted uniformly honest inference for optimal treatment regimes in high dimension. Journal of the American Statistical Association (In print)

work page 2021

[31] [31]

Xue, F., Zhang, Y., Zhou, W., Fu, H., and Qu, A. (2020). Multicategory angle-based learning for estimating optimal dynamic treatment regimes with censored data. Journal of the American Statistical Association (In print)

work page 2020

[32] [32]

Zhang, T. et al. (2004). Statistical behavior and consistency of classification methods based on convex risk minimization. Annals of Statistics , 32(1):56--85

work page 2004

[33] [33]

Zhang, X., Wu, Y., Wang, L., and Li, R. (2016a). A consistent information criterion for support vector machines in diverging model spaces. Journal of Machine Learning Research , 17(1):466--491

work page

[34] [34]

Zhang, X., Wu, Y., Wang, L., and Li, R. (2016b). Variable selection for support vector machines in moderately high dimensions. Journal of the Royal Statistical Society: Series B (Statistical Methodology) , 78(1):53--76

work page

[35] [35]

J., and Kosorok, M

Zhao, Y., Zeng, D., Rush, A. J., and Kosorok, M. R. (2012). Estimating individualized treatment rules using outcome weighted learning . Journal of the American Statistical Association , 107(499):1106--1118

work page 2012

[36] [36]

B., Ning, Y., Saha, S., and Sands, B

Zhao, Y.-Q., Laber, E. B., Ning, Y., Saha, S., and Sands, B. E. (2019). Efficient augmentation and relaxation learning for individualized treatment rules using observational data. Journal of Machine Learning Research , 20(1):1821--1843

work page 2019

[37] [37]

Q., Zeng, D., Laber, E

Zhao, Y. Q., Zeng, D., Laber, E. B., Song, R., Yuan, M., and Kosorok, M. R. (2014). Doubly robust learning for estimating individualized treatment with censored data . Biometrika , 102(1):151--168

work page 2014

[38] [38]

Zhou, X., Mayer-Hamblett, N., Khan, U., and Kosorok, M. R. (2017). Residual weighted learning for estimating individualized treatment rules. Journal of the American Statistical Association , 112(517):169--187

work page 2017