From Isotonic to Lipschitz Regression: A New Interpolative Perspective on Shape-restricted Estimation

Arun Kumar Kuchibhotla; Kenta Takatsu; Tianyu Zhang

arxiv: 2307.05732 · v5 · pith:MWPEX63Rnew · submitted 2023-07-11 · 📊 stat.ME · math.ST· stat.TH

From Isotonic to Lipschitz Regression: A New Interpolative Perspective on Shape-restricted Estimation

Kenta Takatsu , Tianyu Zhang , Arun Kumar Kuchibhotla This is my paper

Pith reviewed 2026-05-24 07:14 UTC · model grok-4.3

classification 📊 stat.ME math.STstat.TH

keywords Lipschitz regressionisotonic estimationshape-restricted estimationsample splittingnonparametric regressionmonotonic functionsadaptivityheavy-tailed errors

0 comments

The pith

Every Lipschitz function equals a monotonic function plus a linear function.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that any function satisfying a Lipschitz condition can be rewritten exactly as the sum of a monotonic function and a linear function. This algebraic identity lets the authors construct regression estimators for the Lipschitz class by splitting the data once, applying an isotonic estimator to one subsample, and fitting the linear term to the other. The resulting procedure carries over the computational speed, theoretical rates, and robustness properties already known for shape-restricted estimators. The same decomposition extends to higher-order monotonicity constraints and to functions of several variables, delivering convergence under heteroscedastic and heavy-tailed noise while adapting automatically to the unknown smoothness level of the target function.

Core claim

The central claim is that the Lipschitz regression problem reduces to an isotonic problem plus a linear fit through the identity that every Lipschitz function f satisfies f(x) = g(x) + c·x for some monotonic g and constant c. A sample-splitting estimator is built by applying an existing shape-restricted estimator to one half of the data and ordinary least squares to the linear coefficient on the remaining half; the procedure inherits the methodological, theoretical, and computational features of isotonic regression while providing explicit convergence guarantees under heteroscedastic and heavy-tailed errors and automatic adaptation to the complexity of the unknown regression function. The 1D

What carries the argument

The monotonic-plus-linear decomposition of a Lipschitz function, which reduces estimation to a shape-restricted subproblem that existing isotonic algorithms can solve after one sample split.

If this is right

The estimators attain explicit convergence rates under heteroscedastic and heavy-tailed errors.
Rates adapt automatically to the unknown complexity of the regression function without tuning.
The same decomposition yields estimators for higher-order monotonicity and multivariate Lipschitz classes.
New approximation results follow directly from the decomposition for the Lipschitz and related classes.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Similar algebraic decompositions might reduce other smoothness classes to shape-restricted ones, allowing reuse of isotonic code for additional problems.
The approach could be tested for robustness when the linear term is estimated under dependence or missing observations.
Empirical comparisons on data sets known to have heavy tails would check whether the inherited robustness appears in practice.

Load-bearing premise

The sample-splitting step based on the decomposition inherits all statistical and computational advantages of the base shape-restricted estimators without introducing extra bias or new limitations under the stated error models.

What would settle it

A simulation or theoretical calculation in which the sample-split estimator exhibits strictly worse rate or larger finite-sample error than a direct Lipschitz-constrained optimizer, under the same heteroscedastic heavy-tailed design, would show the inheritance claim fails.

Figures

Figures reproduced from arXiv: 2307.05732 by Arun Kumar Kuchibhotla, Kenta Takatsu, Tianyu Zhang.

**Figure 6.1.** Figure 6.1: Estimated regression functions based on 500 observat [PITH_FULL_IMAGE:figures/full_fig_p021_6_1.png] view at source ↗

**Figure 6.2.** Figure 6.2: Average MSEs across 300 replications for different samp [PITH_FULL_IMAGE:figures/full_fig_p022_6_2.png] view at source ↗

**Figure 6.3.** Figure 6.3: Boxplot of MSEs across 300 replications for sample sizes [PITH_FULL_IMAGE:figures/full_fig_p022_6_3.png] view at source ↗

**Figure 6.4.** Figure 6.4: Average MSEs across 300 replications. The solid black line r [PITH_FULL_IMAGE:figures/full_fig_p023_6_4.png] view at source ↗

**Figure 6.5.** Figure 6.5: Boxplot of MSEs across 300 replications for sample sizes [PITH_FULL_IMAGE:figures/full_fig_p023_6_5.png] view at source ↗

**Figure 7.1.** Figure 7.1: Estimated density functions based on 500 observations [PITH_FULL_IMAGE:figures/full_fig_p025_7_1.png] view at source ↗

read the original abstract

This manuscript bridges nonparametric smoothness-based and shape-restricted estimation, which may appear as two disjoint paradigms in the field. The proposed approach is motivated by a conceptually simple observation: every Lipschitz function is a sum of a monotonic and a linear function. This principle is further generalized to the higher-order monotonicity and multivariate settings. A family of estimators is proposed based on a sample-splitting procedure, inheriting desirable methodological, theoretical, and computational properties of shape-restricted estimators. The theoretical analysis provides convergence guarantees of the estimator under heteroscedastic and heavy-tailed errors, as well as adaptivity to the unknown ``complexity" of the true regression function. The generality of the proposed decomposition framework is demonstrated through new approximation results and numerical studies.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper's core move is a clean decomposition turning Lipschitz regression into isotonic plus linear via sample splitting, with rates under weak noise.

read the letter

The useful thing here is the observation that any Lipschitz function decomposes as a monotone piece plus a linear term, which they turn into a sample-split estimator that borrows isotonic machinery. This extends to higher-order monotonicity and multivariate cases, and they supply approximation results plus numerics to show the idea works in practice. The theoretical part gives convergence under heteroscedastic and heavy-tailed errors plus adaptivity to the function's complexity, which is a concrete plus over purely smoothness-based approaches that often need stronger assumptions. The construction is straightforward and avoids circularity since the split separates the monotone estimation from the slope. What stands out is how little extra machinery is needed once the decomposition is in hand. Sample splitting is the obvious soft spot: it halves the data for each piece, so efficiency takes a hit relative to a joint estimator, though the paper appears to accept this trade-off for the inheritance of shape-restricted properties. The multivariate extension is claimed but likely requires more care on the ordering and the linear term. Overall the argument holds together without load-bearing gaps visible from the claims and stress-test. This is for people already working on shape-restricted or isotonic methods who want a way to handle smoothness constraints without starting from scratch. A reader who cares about practical nonparametric tools with weak error conditions will get something usable. It deserves a serious referee because the decomposition is elementary yet opens a usable bridge, and the rates are stated under realistic conditions.

Referee Report

2 major / 2 minor

Summary. The manuscript claims that every Lipschitz function equals a monotonic function plus a linear term (with generalizations to higher-order monotonicity and the multivariate case), and uses this decomposition to construct sample-splitting estimators for Lipschitz regression. These estimators are asserted to inherit the methodological, theoretical, and computational advantages of existing shape-restricted estimators while delivering convergence rates under heteroscedastic and heavy-tailed errors, adaptivity to unknown complexity, and new approximation results, all supported by theory and numerical studies.

Significance. If the decomposition and the inheritance properties of the sample-splitting estimators hold with the stated rates, the work supplies a practical bridge between shape-restricted and smoothness-based nonparametric regression. The explicit construction f(x) = [f(x) − Lx] + Lx (and its higher-order analogues) together with the provision of reproducible numerical studies and robustness to non-Gaussian errors constitute clear strengths.

major comments (2)

[§3] §3 (estimator construction): the sample-splitting procedure is presented as inheriting the full suite of computational and theoretical advantages of isotonic estimators without new limitations, yet the analysis does not quantify the effect of the split ratio on the final rate under heavy-tailed errors; this is load-bearing for the robustness claim.
[Theorem 4.1] Theorem 4.1 (convergence under heteroscedasticity): the stated rate appears to treat the linear coefficient as known after the first split, but the paper does not supply an explicit bound on the additional error introduced when the coefficient is estimated from the second subsample; without this the adaptivity statement is incomplete.

minor comments (2)

[Abstract] The abstract asserts 'new approximation results' without indicating their form or where they appear; a one-sentence pointer would improve readability.
[§2] Notation for the Lipschitz constant L is introduced without an explicit statement of whether it is assumed known or estimated; a short clarification in §2 would remove ambiguity.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their thoughtful review and recommendation of minor revision. The comments highlight two points where the current analysis could be strengthened for greater clarity on the robustness and adaptivity claims. We address each below and indicate the revisions we will make.

read point-by-point responses

Referee: [§3] §3 (estimator construction): the sample-splitting procedure is presented as inheriting the full suite of computational and theoretical advantages of isotonic estimators without new limitations, yet the analysis does not quantify the effect of the split ratio on the final rate under heavy-tailed errors; this is load-bearing for the robustness claim.

Authors: We agree that an explicit quantification of the split-ratio effect on the final rate (particularly the dependence on the heavy-tail parameter) would strengthen the robustness claim. The current rates are derived for a fixed split proportion (e.g., n/2), and the constants absorb the resulting factors; however, we did not spell out the explicit dependence on the split fraction. In the revision we will add a short remark (and, if space permits, a corollary) showing how the leading constant scales with the split ratio under the stated moment conditions. This is a partial revision because the main convergence statements remain unchanged. revision: partial
Referee: [Theorem 4.1] Theorem 4.1 (convergence under heteroscedasticity): the stated rate appears to treat the linear coefficient as known after the first split, but the paper does not supply an explicit bound on the additional error introduced when the coefficient is estimated from the second subsample; without this the adaptivity statement is incomplete.

Authors: The linear coefficient is estimated on the first subsample and then subtracted before applying the isotonic estimator on the second subsample. Under the heavy-tailed moment assumptions used in the paper, the estimation error of the coefficient is of order n^{-1/2} (up to log factors), which is absorbed into the leading term of the isotonic rate. Nevertheless, we concede that an explicit decomposition isolating this term would make the adaptivity argument fully rigorous. We will insert a short lemma bounding the additional error and update the proof of Theorem 4.1 accordingly; the stated rate itself is unaffected. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper's core claim is the explicit decomposition that every Lipschitz function equals a monotonic function plus a linear term (generalized to higher-order and multivariate cases), which holds by direct construction f(x) = [f(x) − Lx] + Lx with the bracketed term provably non-increasing under the Lipschitz condition. This is an independent mathematical fact, not a self-definition or fitted input. The sample-splitting estimators are constructed to inherit properties of shape-restricted estimators, with convergence results under stated error models; no step reduces a prediction to a fitted parameter by construction, no load-bearing self-citation chain is invoked for uniqueness, and no ansatz is smuggled via prior work. The derivation chain is self-contained and externally verifiable via standard analysis.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the validity of the monotonic-plus-linear decomposition for Lipschitz functions and on the claim that sample splitting transfers the advantages of shape-restricted estimators; both are presented as observations rather than derived results.

axioms (1)

domain assumption Every Lipschitz function is a sum of a monotonic function and a linear function (generalized to higher-order monotonicity and multivariate cases)
This is the motivating observation stated in the abstract on which the entire estimator family is built.

pith-pipeline@v0.9.0 · 5658 in / 1333 out tokens · 31168 ms · 2026-05-24T07:14:04.325394+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

20 extracted references · 20 canonical work pages

[1]

37 In step (I) we used the following bound on E[Γn]: E[Γn] = E   ( N − 1 N∑ i=1 ξ2 i )1/ 2 + ∥f0∥∞ +F   ≤ σ + ∥f0∥∞ +F since E[ξ2 i |Xi]<σ 2

holds for any f ⋄ ∈ F ε, it can be stated for any f ∈ F as follows: E[ℓ0((X,Y ); ˆf ⋄)] ≤ (1 + 2δ) E[ℓ0((X,Y );f ⋄)] +C log(N (ε, F,L 2(Pn))) ( n− 1+1/q F +n− 1F 2/δ ) + 4ε E[Γn] =⇒ E[ℓ0((X,Y ); ˆf )] − 4ε(F + ∥f0∥∞ ) ≤ (1 + 2δ) {E[ℓ0((X,Y );f )] + 4ε(∥f0∥∞ +F )} +C log(N (ε, F,L 2(Pn))) ( n− 1+1/q F +n− 1F 2/δ ) + 4ε E[Γn] =⇒ E[ℓ0((X,Y ); ˆf )] ≤ (1 + 2δ...

work page 2006
[2]

and we deﬁne a ﬁxed- L oracle function as follows: f ∗ L(x) := g∗ L(x) − Lx where g∗ L := arg min g∈C E(Y +LX − g(X))2. (41) We also recall that for each m ∈ N, we deﬁne Fm(1,L ) a set of any m-piecewise function, taking the following form: fm,L (x) := m∑ j=1 aj1(x ∈ Ij) − Lx (42) where ai < aj for i < jand {Ij}m j=1 is any non-overlapping partition of [0...

work page
[3]

Consider the regression model (1) and assume E[ξ2|Xi] ≤ σ 2 for all 1 ≤ i ≤ n

and f ∗ L be the ﬁxed-L oracle function given by (41). Consider the regression model (1) and assume E[ξ2|Xi] ≤ σ 2 for all 1 ≤ i ≤ n. Deﬁne B(L) := ∥f0∥∞ +σ 2 + 2|L|. Then, there exists constants Cε and Nε only depending on ε ∈ (0, 1) such that for any n ≥ Nε, the following holds with probability greater than 1 − ε: ∥ ˆfL − f ∗ L∥2 L2(P ) ≤ CεB(L)2 inf m∈...

work page
[4]

Since the above argument holds for arbitrary m and fm,L , we conclude the claim by taking the inﬁmum over Fm,L and m ∈ N on the right-hand side

and obtain that with probability greater than 1 − ε, ∥ ˆfL − f ∗ L∥2 L2(P ) ≤ CεB(L)2 (mσ 2 log2 {3nB(L)} n + ∥fm,L − f ∗ L∥2 L2(P ) ) for constant Cε only depending on ε. Since the above argument holds for arbitrary m and fm,L , we conclude the claim by taking the inﬁmum over Fm,L and m ∈ N on the right-hand side. C.2 Proofs of Lemma C.2 and Lemma C.3 Tw...

work page 2018
[5]

using this new measure Q(·) as follows: E [ sup f ∈F |Wf | ⏐ ⏐ ⏐ ⏐ ⏐ {(ξi,X i)}n i=1 ] ≲ ∫ ∥ξ∥∥F ∥L2(Q) 0 √ log N (∥ξ∥− 1δ, F, ∥ · ∥L2(Q))dδ = ∥ξ∥∥F ∥L2(Q) ∫ 1 0 √ log N (∥F ∥L2(Q)τ, F, ∥ · ∥L2(Q))dτ. 54 We take expectation with respect to the joint distribution {(ξi,X i)}n i=1 on both sides and obtain E [ sup f ∈F |Wf | ] ≲J(1, F) E[∥ξ∥∥F ∥L2(Q)] =J(1, F...

work page 1996
[6]

For any q ∈ F , its restriction on Ij ,q|Ij :Ij → R, is an isotonic function. 56

work page
[7]

Here, the probability measure µ is deﬁned over [0, 1] and µ ([a,b ]) is = ∫b aµ (x) for 0 ≤ a<b ≤

The function space H := {h is non-decreasing over [0 , 1] | ∥h∥L2 (µ ) ≤ δ, ∥h∥∞ ≤ B} has its corresponding envelope function H(x) := min { B,δ max {µ ([0,x ]),µ ([x, 1])}− 1/ 2 } . Here, the probability measure µ is deﬁned over [0, 1] and µ ([a,b ]) is = ∫b aµ (x) for 0 ≤ a<b ≤

work page
[8]

Furthermore, we have ∥H∥L2(µ ) ≤ Cδ √ log(B/δ ) (See Section 5.1 of Kuchibhotla and Patra (2022)). The restriction of F on each interval Ij, denoted by FIj , is a subset of the following function space: FIj ⊂ G j := { g : [0, 1] ↦→ R ⏐ ⏐ ⏐ ⏐g is non-decreasing over Ij, ∥g∥L2(Pj ) ≤ δ/P 1/ 2 X (Ij), ∥g∥∞ ≤ B } . Combining the above two facts, we know the e...

work page 2022
[9]

In this case, we perform a coordinate-descent procedure to upd ate ˆgj,L for each j = 1,...d component until the empirical risk converges

While the underlying concept behind the procedure remains identical to Algorithm 1, there are notable diﬀerences in steps 3 and 4. In this case, we perform a coordinate-descent procedure to upd ate ˆgj,L for each j = 1,...d component until the empirical risk converges. Similar to Algorithm 1, step 5 of Algorithm 2 does not require exhaustive evaluation of...

work page
[10]

Initialize estimators ˆgi(x) for i = 1 , 2, . . . , d . These can, for instance, be constant functions

work page
[11]

, n } into two disjoint subsets: I1 and I2

Randomly split {1, 2, . . . , n } into two disjoint subsets: I1 and I2

work page
[12]

, d } and L ∈ L , compute an isotonic regression estimator ˆgℓ based on ( Xi, [ℓ], Z i,L − ∑ j̸=ℓ gj(Xi, [j]))n i=1 where Zi,L := Yi + L⊤ Xi

For each ℓ ∈ { 1, 2, . . . , d } and L ∈ L , compute an isotonic regression estimator ˆgℓ based on ( Xi, [ℓ], Z i,L − ∑ j̸=ℓ gj(Xi, [j]))n i=1 where Zi,L := Yi + L⊤ Xi. For instance, one may deﬁne the LSE for each coordinate as ˆgℓ,L := arg min g∈C (1) ∑ i∈I 1 {( Zi,L − ∑ j̸=ℓ ˆgj(Xi, [j]) ) − g(Xi, [ℓ]) } 2 , iteratively obtaining an isotonic regression ...

work page
[13]

Deﬁne the resulting estimator for each L as ˆfL(x) = ∑d j=1 ˆgj,L (x) − L⊤ x

Repeat step 3 until the empirical risk associated with I1 given by ∑ i∈I 1 ( Zi,L − d∑ j=1 ˆgj,L (Xi, [j]) )2 converges. Deﬁne the resulting estimator for each L as ˆfL(x) = ∑d j=1 ˆgj,L (x) − L⊤ x

work page
[14]

Find the “optimal” vector ˆL based on {(Xi, Y i) : i ∈ I 2} by: ˆL := arg min L∈L ∑ i∈I 2 (Yi − ˆfL(Xi))2

work page
[15]

Return the ﬁnal estimator ˆfˆL(x) := ∑d j=1 ˆgj, ˆL(x) − ˆL⊤ x. 68 F.2 Oracle property of the additive estimator Below, we claim that, under the additive model, the converge nce rate of each additive component informs the convergence rate of overall procedure. This fol lows from the so-called oracle property of additive estimator, which reduces its analys...

work page 2007
[16]

We select the penalty parameter of ridge regression amon g 10 candidates

Kernel ridge regression ( KRR): The kernel function is given by K(x,z ) := 1 + min(x,z ), which corresponds to the ﬁrst-order Sobolev space (see, for insta nce, Example 12.16 of Wainwright (2019)). We select the penalty parameter of ridge regression amon g 10 candidates. Due to the computationally intensive nature of KRR, we do not explo re ﬁner choices o...

work page 2019
[17]

01, and we choose the maximum depth of each tree from the set {2, 5}

Gradient boosting machines ( GBM): The shrinkage parameter is set to 0 . 01, and we choose the maximum depth of each tree from the set {2, 5}. The total number of trees is selected from {100, 1000, 2000, 4000, 8000}. The remaining parameters are set to their default values according to the GBM library in R

work page 2000
[18]

We use the estimator implemented by R package randomForest (Liaw and Wiener , 2002)

Random forest regression ( RF): The number of trees is selected from {50, 100, 500, 1000, 5000}. We use the estimator implemented by R package randomForest (Liaw and Wiener , 2002)

work page 2002
[19]

worst case

Penalized sieve estimator with cosine basis ( Sieve): We employ 50 basis functions and select the penalization tuning parameter from (approximately) 10 0 default grids. The estimators are realized using R package Sieve (Zhang and Simon , 2022). G.2 Details on regression functions used during the numeric al studies G.2.1 Univariate cases We provide the det...

work page 2022
[20]

Similar to the univariate cases, we anticipate that the proposed method will converge essent ially at a rate of n− 2/ 3 for Scenario 1 and n− 1 for Scenario 2

and ( 54). Similar to the univariate cases, we anticipate that the proposed method will converge essent ially at a rate of n− 2/ 3 for Scenario 1 and n− 1 for Scenario 2. We discuss the justiﬁcation behind this argu ment in Section F.2. We also consider two 5-dimensional examples: • Scenario 3 (5d): f (x) := f1(x[1]) − f1(x[2]) +x3 − x4 + 1, • Scenario 4 ...

work page 2015

[1] [1]

37 In step (I) we used the following bound on E[Γn]: E[Γn] = E   ( N − 1 N∑ i=1 ξ2 i )1/ 2 + ∥f0∥∞ +F   ≤ σ + ∥f0∥∞ +F since E[ξ2 i |Xi]<σ 2

holds for any f ⋄ ∈ F ε, it can be stated for any f ∈ F as follows: E[ℓ0((X,Y ); ˆf ⋄)] ≤ (1 + 2δ) E[ℓ0((X,Y );f ⋄)] +C log(N (ε, F,L 2(Pn))) ( n− 1+1/q F +n− 1F 2/δ ) + 4ε E[Γn] =⇒ E[ℓ0((X,Y ); ˆf )] − 4ε(F + ∥f0∥∞ ) ≤ (1 + 2δ) {E[ℓ0((X,Y );f )] + 4ε(∥f0∥∞ +F )} +C log(N (ε, F,L 2(Pn))) ( n− 1+1/q F +n− 1F 2/δ ) + 4ε E[Γn] =⇒ E[ℓ0((X,Y ); ˆf )] ≤ (1 + 2δ...

work page 2006

[2] [2]

and we deﬁne a ﬁxed- L oracle function as follows: f ∗ L(x) := g∗ L(x) − Lx where g∗ L := arg min g∈C E(Y +LX − g(X))2. (41) We also recall that for each m ∈ N, we deﬁne Fm(1,L ) a set of any m-piecewise function, taking the following form: fm,L (x) := m∑ j=1 aj1(x ∈ Ij) − Lx (42) where ai < aj for i < jand {Ij}m j=1 is any non-overlapping partition of [0...

work page

[3] [3]

Consider the regression model (1) and assume E[ξ2|Xi] ≤ σ 2 for all 1 ≤ i ≤ n

and f ∗ L be the ﬁxed-L oracle function given by (41). Consider the regression model (1) and assume E[ξ2|Xi] ≤ σ 2 for all 1 ≤ i ≤ n. Deﬁne B(L) := ∥f0∥∞ +σ 2 + 2|L|. Then, there exists constants Cε and Nε only depending on ε ∈ (0, 1) such that for any n ≥ Nε, the following holds with probability greater than 1 − ε: ∥ ˆfL − f ∗ L∥2 L2(P ) ≤ CεB(L)2 inf m∈...

work page

[4] [4]

Since the above argument holds for arbitrary m and fm,L , we conclude the claim by taking the inﬁmum over Fm,L and m ∈ N on the right-hand side

and obtain that with probability greater than 1 − ε, ∥ ˆfL − f ∗ L∥2 L2(P ) ≤ CεB(L)2 (mσ 2 log2 {3nB(L)} n + ∥fm,L − f ∗ L∥2 L2(P ) ) for constant Cε only depending on ε. Since the above argument holds for arbitrary m and fm,L , we conclude the claim by taking the inﬁmum over Fm,L and m ∈ N on the right-hand side. C.2 Proofs of Lemma C.2 and Lemma C.3 Tw...

work page 2018

[5] [5]

using this new measure Q(·) as follows: E [ sup f ∈F |Wf | ⏐ ⏐ ⏐ ⏐ ⏐ {(ξi,X i)}n i=1 ] ≲ ∫ ∥ξ∥∥F ∥L2(Q) 0 √ log N (∥ξ∥− 1δ, F, ∥ · ∥L2(Q))dδ = ∥ξ∥∥F ∥L2(Q) ∫ 1 0 √ log N (∥F ∥L2(Q)τ, F, ∥ · ∥L2(Q))dτ. 54 We take expectation with respect to the joint distribution {(ξi,X i)}n i=1 on both sides and obtain E [ sup f ∈F |Wf | ] ≲J(1, F) E[∥ξ∥∥F ∥L2(Q)] =J(1, F...

work page 1996

[6] [6]

For any q ∈ F , its restriction on Ij ,q|Ij :Ij → R, is an isotonic function. 56

work page

[7] [7]

Here, the probability measure µ is deﬁned over [0, 1] and µ ([a,b ]) is = ∫b aµ (x) for 0 ≤ a<b ≤

The function space H := {h is non-decreasing over [0 , 1] | ∥h∥L2 (µ ) ≤ δ, ∥h∥∞ ≤ B} has its corresponding envelope function H(x) := min { B,δ max {µ ([0,x ]),µ ([x, 1])}− 1/ 2 } . Here, the probability measure µ is deﬁned over [0, 1] and µ ([a,b ]) is = ∫b aµ (x) for 0 ≤ a<b ≤

work page

[8] [8]

Furthermore, we have ∥H∥L2(µ ) ≤ Cδ √ log(B/δ ) (See Section 5.1 of Kuchibhotla and Patra (2022)). The restriction of F on each interval Ij, denoted by FIj , is a subset of the following function space: FIj ⊂ G j := { g : [0, 1] ↦→ R ⏐ ⏐ ⏐ ⏐g is non-decreasing over Ij, ∥g∥L2(Pj ) ≤ δ/P 1/ 2 X (Ij), ∥g∥∞ ≤ B } . Combining the above two facts, we know the e...

work page 2022

[9] [9]

In this case, we perform a coordinate-descent procedure to upd ate ˆgj,L for each j = 1,...d component until the empirical risk converges

While the underlying concept behind the procedure remains identical to Algorithm 1, there are notable diﬀerences in steps 3 and 4. In this case, we perform a coordinate-descent procedure to upd ate ˆgj,L for each j = 1,...d component until the empirical risk converges. Similar to Algorithm 1, step 5 of Algorithm 2 does not require exhaustive evaluation of...

work page

[10] [10]

Initialize estimators ˆgi(x) for i = 1 , 2, . . . , d . These can, for instance, be constant functions

work page

[11] [11]

, n } into two disjoint subsets: I1 and I2

Randomly split {1, 2, . . . , n } into two disjoint subsets: I1 and I2

work page

[12] [12]

, d } and L ∈ L , compute an isotonic regression estimator ˆgℓ based on ( Xi, [ℓ], Z i,L − ∑ j̸=ℓ gj(Xi, [j]))n i=1 where Zi,L := Yi + L⊤ Xi

For each ℓ ∈ { 1, 2, . . . , d } and L ∈ L , compute an isotonic regression estimator ˆgℓ based on ( Xi, [ℓ], Z i,L − ∑ j̸=ℓ gj(Xi, [j]))n i=1 where Zi,L := Yi + L⊤ Xi. For instance, one may deﬁne the LSE for each coordinate as ˆgℓ,L := arg min g∈C (1) ∑ i∈I 1 {( Zi,L − ∑ j̸=ℓ ˆgj(Xi, [j]) ) − g(Xi, [ℓ]) } 2 , iteratively obtaining an isotonic regression ...

work page

[13] [13]

Deﬁne the resulting estimator for each L as ˆfL(x) = ∑d j=1 ˆgj,L (x) − L⊤ x

Repeat step 3 until the empirical risk associated with I1 given by ∑ i∈I 1 ( Zi,L − d∑ j=1 ˆgj,L (Xi, [j]) )2 converges. Deﬁne the resulting estimator for each L as ˆfL(x) = ∑d j=1 ˆgj,L (x) − L⊤ x

work page

[14] [14]

Find the “optimal” vector ˆL based on {(Xi, Y i) : i ∈ I 2} by: ˆL := arg min L∈L ∑ i∈I 2 (Yi − ˆfL(Xi))2

work page

[15] [15]

Return the ﬁnal estimator ˆfˆL(x) := ∑d j=1 ˆgj, ˆL(x) − ˆL⊤ x. 68 F.2 Oracle property of the additive estimator Below, we claim that, under the additive model, the converge nce rate of each additive component informs the convergence rate of overall procedure. This fol lows from the so-called oracle property of additive estimator, which reduces its analys...

work page 2007

[16] [16]

We select the penalty parameter of ridge regression amon g 10 candidates

Kernel ridge regression ( KRR): The kernel function is given by K(x,z ) := 1 + min(x,z ), which corresponds to the ﬁrst-order Sobolev space (see, for insta nce, Example 12.16 of Wainwright (2019)). We select the penalty parameter of ridge regression amon g 10 candidates. Due to the computationally intensive nature of KRR, we do not explo re ﬁner choices o...

work page 2019

[17] [17]

01, and we choose the maximum depth of each tree from the set {2, 5}

Gradient boosting machines ( GBM): The shrinkage parameter is set to 0 . 01, and we choose the maximum depth of each tree from the set {2, 5}. The total number of trees is selected from {100, 1000, 2000, 4000, 8000}. The remaining parameters are set to their default values according to the GBM library in R

work page 2000

[18] [18]

We use the estimator implemented by R package randomForest (Liaw and Wiener , 2002)

Random forest regression ( RF): The number of trees is selected from {50, 100, 500, 1000, 5000}. We use the estimator implemented by R package randomForest (Liaw and Wiener , 2002)

work page 2002

[19] [19]

worst case

Penalized sieve estimator with cosine basis ( Sieve): We employ 50 basis functions and select the penalization tuning parameter from (approximately) 10 0 default grids. The estimators are realized using R package Sieve (Zhang and Simon , 2022). G.2 Details on regression functions used during the numeric al studies G.2.1 Univariate cases We provide the det...

work page 2022

[20] [20]

Similar to the univariate cases, we anticipate that the proposed method will converge essent ially at a rate of n− 2/ 3 for Scenario 1 and n− 1 for Scenario 2

and ( 54). Similar to the univariate cases, we anticipate that the proposed method will converge essent ially at a rate of n− 2/ 3 for Scenario 1 and n− 1 for Scenario 2. We discuss the justiﬁcation behind this argu ment in Section F.2. We also consider two 5-dimensional examples: • Scenario 3 (5d): f (x) := f1(x[1]) − f1(x[2]) +x3 − x4 + 1, • Scenario 4 ...

work page 2015