pith. sign in

arxiv: 2606.15602 · v2 · pith:ESRK5YDMnew · submitted 2026-06-14 · 📊 stat.ME

Bias-Aware External-Model-Assisted Inference in High-Dimensional Regression

Pith reviewed 2026-06-27 04:27 UTC · model grok-4.3

classification 📊 stat.ME
keywords high-dimensional regressionsemi-supervised inferencedebiased lassoexternal modelsasymptotic normalityconfidence intervalsprediction-powered inferencecovariate shift
0
0 comments X

The pith

A bias-aware shrinkage step routes external predictors into the variance of debiased high-dimensional estimators, producing shorter valid intervals than PPI or debiased Lasso.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that in high-dimensional semi-supervised linear regression, an external predictor and many unlabeled samples can be used to tighten confidence intervals without sacrificing validity. It achieves this by sending the external information only into the variance term of a debiased estimator through a cross-fitted shrinkage step that automatically adapts when the external model is helpful, neutral, or harmful. The resulting procedure maintains coordinate-wise asymptotic normality and remains valid for the projection parameter even when the external model is misspecified or the outcome model is nonlinear. Simulations and six real-data examples confirm that the intervals are materially shorter than those from debiased Lasso, PPI, and PPI++ at the same unlabeled budget, and a shift-aware variant handles covariate shift.

Core claim

The Debiased External-model-Assisted Lasso (DEAL) routes the external estimator and the unlabeled covariates into the variance of a debiased estimator, with a bias-aware, cross-fitted shrinkage step that adapts across target-only, near-oracle, and biased-but-informative regimes. It proves coordinate-wise asymptotic normality with an adaptive variance, extends validity to the projection parameter under misspecification and nonlinear labelers, and shows that, at a common unlabeled budget, DEAL intervals are shorter than those of debiased Lasso, PPI, and PPI++; a shift-aware variant preserves coverage under covariate shift.

What carries the argument

The bias-aware, cross-fitted shrinkage step that decides how much external-model information to fold into the variance estimator based on estimated bias.

If this is right

  • At fixed unlabeled budget, DEAL intervals are shorter than those from debiased Lasso, PPI, and PPI++.
  • The procedure remains valid for the projection parameter when the external model is misspecified or the labeler is nonlinear.
  • A shift-aware variant maintains coverage when the distribution of unlabeled covariates differs from the labeled data.
  • In simulations the interval lengths are between 0.49 and 0.87 times the debiased-Lasso length; in real data the median ratios range from 0.23 to 0.53.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same adaptive routing of external information into variance could be tried with other high-dimensional estimators beyond the Lasso.
  • When the external model is a large language model, DEAL offers a way to obtain shorter scientific intervals without retraining the model on the target labels.
  • The length gains may be largest when unlabeled data are cheap relative to labeled data, suggesting a practical allocation rule for future studies.

Load-bearing premise

The cross-fitted shrinkage step adapts correctly to different external-model qualities without introducing bias that the asymptotic normality argument does not capture.

What would settle it

A high-dimensional simulation in which the external predictor carries substantial bias yet the shrinkage step fails to downweight it enough, producing intervals whose empirical coverage falls below the nominal level.

Figures

Figures reproduced from arXiv: 2606.15602 by Hanxuan Ye, Hongzhe Li, Hongzhe Zhang.

Figure 1
Figure 1. Figure 1: Power adaptivity of DEAL across the twelve external-estimator qualities. Left: the variance￾balance choice Nˆ ∗ versus the external sample size nA. Centre: the median CI-length ratio of DEAL to target-only debiased Lasso (DL) on signal coordinates; horizontal reference lines mark the DL benchmark and the PPI++ benchmark at the highest external-estimator quality (nA = 3200). Right: empirical signal coverage… view at source ↗
Figure 2
Figure 2. Figure 2: Robustness of DEAL-shift-aware under covariate shift in the unlabeled covariate distribution. Left: the variance-balance choice Nˆ ∗ versus the unlabeled-design AR(1) parameter ρu. Centre: CI-length ratio against DL. Right: empirical signal coverage. Dashed lines indicate the no-shift reference for each external-estimator quality from Section 7.2. 7.4 Robustness to covariate shift in the unlabeled covariat… view at source ↗
Figure 3
Figure 3. Figure 3: DEAL inference under the oracle linear-coefficient labeler under non-linear (Hermite) truth. Left: empirical signal coverage versus the misspecification strength α. Right: CI-length ratio of each estimator to DL on Jsignal. External-estimator coefficient βˆ ext = β ⋆ exactly, nA = 3200, n0 = 400, AR(1) target with ρ0 = 0.4, p = 120, s = 6, R = 20 replications. N is chosen per replication by the variance-ba… view at source ↗
Figure 4
Figure 4. Figure 4: DEAL inference under the linearised oracle labeler across three forms of η. Top row: empirical signal coverage versus the misspecification strength α. Bottom row: CI-length ratio of each estimator to DL on Jsignal. Columns correspond to the three η specifications: Hermite, GB-shaped, and MLP-shaped. Reference lines mark the nominal coverage 0.95 (top) and the DL-parity ratio 1 (bottom). External-estimator … view at source ↗
Figure 5
Figure 5. Figure 5: DEAL inference under joint covariate shift and model misspecification. Top row: empirical signal coverage versus the unlabeled-design AR(1) parameter ρu. Bottom row: CI-length ratio of DEAL to DL on Jsignal. Columns correspond to the three η specifications: Hermite (left), GB-shaped (centre), and MLP-shaped (right). Vertical dotted line marks the no-shift cell ρu = ρ0 = 0.4. The dashed purple lines mark th… view at source ↗
read the original abstract

In high-dimensional semi-supervised linear regression, prediction-powered inference (PPI) corrects an external predictor with a rectifier estimated from the labeled data. In a linear model, however, this rectifier cancels the predictor: PPI and PPI++ reduce to ordinary least squares and can inflate variance when the predictor is close to the oracle. We propose the Debiased External-model-Assisted Lasso (DEAL), which routes the external estimator and the unlabeled covariates into the variance of a debiased estimator, with a bias-aware, cross-fitted shrinkage step that adapts across target-only, near-oracle, and biased-but-informative regimes. We prove coordinate-wise asymptotic normality with an adaptive variance, extend validity to the projection parameter under misspecification and nonlinear labelers, and show that, at a common unlabeled budget, DEAL intervals are shorter than those of debiased Lasso, PPI, and PPI++; a shift-aware variant preserves coverage under covariate shift. In simulations, DEAL intervals are 0.49-0.87 of the debiased-Lasso length, and across six real-data applications spanning astronomy, chemistry, proteomics, and oncology, the last using a large-language-model oracle, they tighten in every case, with median length ratios of 0.23-0.53.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper proposes the Debiased External-model-Assisted Lasso (DEAL) for high-dimensional semi-supervised linear regression. It routes an external predictor through a bias-aware cross-fitted shrinkage step into the variance of a debiased estimator, proving coordinate-wise asymptotic normality with an adaptive variance. The method extends validity to the projection parameter under misspecification and nonlinear labelers, claims shorter intervals than debiased Lasso, PPI, and PPI++ at fixed unlabeled budget, and includes a shift-aware variant; simulations and six real-data examples (astronomy to oncology with LLM oracle) report length ratios of 0.23-0.87.

Significance. If the asymptotic results hold, DEAL offers a principled way to adaptively incorporate external models in high-dimensional inference without the variance inflation seen in PPI when predictors are near-oracle, while preserving coverage under misspecification. Strengths include the coordinate-wise normality proof, extension to projection parameters, and consistent empirical tightening across regimes and datasets.

major comments (2)
  1. [§3.2, Theorem 1] §3.2 (bias-aware cross-fitted shrinkage): the proof of coordinate-wise asymptotic normality (Theorem 1) does not supply an explicit bound on the remainder term arising from the data-dependent shrinkage parameter. Without showing that this term is asymptotically negligible uniformly across target-only, near-oracle, and biased-but-informative regimes, residual dependence between cross-fit folds and the primary estimating equation may alter both centering and the claimed adaptive variance formula.
  2. [§4] §4 (extension to projection parameter under misspecification): the validity claim for nonlinear labelers relies on the same cross-fitted shrinkage step, yet the expansion does not address how the shrinkage adaptation interacts with the misspecification bias term; this is load-bearing for the statement that DEAL remains valid when the external model is biased but informative.
minor comments (2)
  1. [§3.1] Notation for the shrinkage parameter λ and its cross-fit estimator should be introduced with an explicit equation before its use in the variance formula.
  2. [Table 1] Table 1 (simulation length ratios): report the number of Monte Carlo replications and whether the reported intervals are averaged over coordinates or selected coordinates.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the careful and constructive review. We address each major comment below and will revise the manuscript accordingly.

read point-by-point responses
  1. Referee: [§3.2, Theorem 1] §3.2 (bias-aware cross-fitted shrinkage): the proof of coordinate-wise asymptotic normality (Theorem 1) does not supply an explicit bound on the remainder term arising from the data-dependent shrinkage parameter. Without showing that this term is asymptotically negligible uniformly across target-only, near-oracle, and biased-but-informative regimes, residual dependence between cross-fit folds and the primary estimating equation may alter both centering and the claimed adaptive variance formula.

    Authors: We agree that an explicit bound on the remainder term would strengthen the proof. In the revision we will add a supplementary lemma deriving such a bound and verifying its asymptotic negligibility uniformly across the three regimes. This will confirm that cross-fitting removes any residual dependence that could affect centering or the adaptive variance formula. revision: yes

  2. Referee: [§4] §4 (extension to projection parameter under misspecification): the validity claim for nonlinear labelers relies on the same cross-fitted shrinkage step, yet the expansion does not address how the shrinkage adaptation interacts with the misspecification bias term; this is load-bearing for the statement that DEAL remains valid when the external model is biased but informative.

    Authors: We concur that the interaction between shrinkage adaptation and the misspecification bias term should be made explicit. The revised §4 will augment the expansion to detail this interaction, thereby confirming validity for the projection parameter under nonlinear labelers and biased-but-informative external models. revision: yes

Circularity Check

0 steps flagged

No significant circularity in derivation chain

full rationale

The paper extends existing debiased Lasso and PPI methods with a new bias-aware cross-fitted shrinkage step, then claims to prove coordinate-wise asymptotic normality with an adaptive variance that holds under misspecification. No equations or steps in the abstract reduce the adaptive variance, interval lengths, or normality result to a fitted quantity defined by the same data by construction. The derivation builds on prior external methods without load-bearing self-citations or uniqueness theorems imported from the authors' own prior work; the central claims rest on stated assumptions and new proofs rather than self-referential definitions or renaming of known results.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claim rests on standard high-dimensional linear model assumptions plus the novel adaptive shrinkage construction; no free parameters or invented entities are named in the abstract.

axioms (2)
  • domain assumption The data follow a linear model in which the rectifier of PPI cancels the external predictor.
    Abstract states that in a linear model PPI reduces to OLS.
  • domain assumption Cross-fitting produces an adaptive shrinkage factor that remains valid across predictor-quality regimes.
    The bias-aware shrinkage step is presented as the mechanism that adapts across regimes.

pith-pipeline@v0.9.1-grok · 5759 in / 1411 out tokens · 46240 ms · 2026-06-27T04:27:39.065290+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

49 extracted references

  1. [1]

    Science , volume=

    Prediction-powered inference , author=. Science , volume=. 2023 , publisher=

  2. [2]

    The Journal of Machine Learning Research , volume=

    Confidence intervals and hypothesis testing for high-dimensional regression , author=. The Journal of Machine Learning Research , volume=. 2014 , publisher=

  3. [3]

    Journal of the Royal Statistical Society Series B: Statistical Methodology , volume=

    Transfer learning for high-dimensional linear regression: Prediction, estimation and minimax optimality , author=. Journal of the Royal Statistical Society Series B: Statistical Methodology , volume=. 2022 , publisher=

  4. [4]

    Angelopoulos, Anastasios N and Duchi, John C and Zrnic, Tijana , journal=

  5. [5]

    The Annals of Statistics , volume=

    On asymptotically optimal confidence regions and tests for high-dimensional models , author=. The Annals of Statistics , volume=. 2014 , publisher=

  6. [6]

    Journal of the Royal Statistical Society Series B: Statistical Methodology , volume=

    Confidence intervals for low dimensional parameters in high dimensional linear models , author=. Journal of the Royal Statistical Society Series B: Statistical Methodology , volume=. 2014 , publisher=

  7. [7]

    The Annals of Statistics , volume=

    High-dimensional graphs and variable selection with the lasso , author=. The Annals of Statistics , volume=. 2006 , publisher=

  8. [8]

    Journal of the Royal Statistical Society Series B: Statistical Methodology , volume=

    Stability selection , author=. Journal of the Royal Statistical Society Series B: Statistical Methodology , volume=

  9. [9]

    Normal approximations with

    Nourdin, Ivan and Peccati, Giovanni , series=. Normal approximations with. 2012 , publisher=

  10. [10]

    Walmsley, Mike and others , journal=. Galaxy

  11. [11]

    Bamford, Steven P and others , journal=

  12. [12]

    Physical content of the exact

    Perdew, John P and Levy, Mel , journal=. Physical content of the exact

  13. [13]

    Physical Chemistry Chemical Physics , volume=

    Screened hybrid density functionals for solid-state chemistry and physics , author=. Physical Chemistry Chemical Physics , volume=

  14. [14]

    Benchmarking materials property prediction methods: the

    Dunn, Alexander and Wang, Qi and Ganose, Alex and Dopp, Daniel and Jain, Anubhav , journal=. Benchmarking materials property prediction methods: the

  15. [15]

    Compensatory water effects link yearly global land

    Jung, Martin and others , journal=. Compensatory water effects link yearly global land

  16. [16]

    Nature , volume=

    Large influence of soil moisture on long-term terrestrial carbon uptake , author=. Nature , volume=

  17. [17]

    Recent pause in the growth rate of atmospheric

    Keenan, Trevor F and Prentice, I Colin and Canadell, Josep G and others , journal=. Recent pause in the growth rate of atmospheric

  18. [18]

    Integrating the evidence for a terrestrial carbon sink caused by increasing atmospheric

    Walker, Anthony P and others , journal=. Integrating the evidence for a terrestrial carbon sink caused by increasing atmospheric

  19. [19]

    Nature Climate Change , volume=

    The increasing importance of atmospheric demand for ecosystem water and carbon fluxes , author=. Nature Climate Change , volume=

  20. [20]

    Sensitivity of atmospheric

    Humphrey, Vincent and Zscheischler, Jakob and Ciais, Philippe and others , journal=. Sensitivity of atmospheric

  21. [21]

    Terrestrial carbon balance in a drier world: the effects of water availability in southwestern

    Biederman, Joel A and others , journal=. Terrestrial carbon balance in a drier world: the effects of water availability in southwestern

  22. [22]

    Proceedings of the National Academy of Sciences , volume=

    Land--atmosphere feedbacks exacerbate concurrent soil drought and atmospheric aridity , author=. Proceedings of the National Academy of Sciences , volume=

  23. [23]

    Science Advances , volume=

    Dependence of drivers affects risks associated with compound events , author=. Science Advances , volume=

  24. [24]

    Predicting carbon dioxide and energy fluxes across global

    Tramontana, Gianluca and Jung, Martin and Schwalm, Christopher R and others , journal=. Predicting carbon dioxide and energy fluxes across global

  25. [25]

    Earth System Science Data , volume=

    Upscaled diurnal cycles of land--atmosphere fluxes: a new global half-hourly data product , author=. Earth System Science Data , volume=

  26. [26]

    New perspective on spring vegetation phenology and global climate change based on

    Yang, Bao and others , journal=. New perspective on spring vegetation phenology and global climate change based on

  27. [27]

    Nature Medicine , volume=

    Plasma protein patterns as comprehensive indicators of health , author=. Nature Medicine , volume=

  28. [28]

    Nature Medicine , volume=

    Proteomic signatures improve risk prediction for common and rare diseases , author=. Nature Medicine , volume=

  29. [29]

    European Heart Journal , volume=

    Proteomic cardiovascular risk assessment in chronic kidney disease , author=. European Heart Journal , volume=

  30. [30]

    Bild, Diane E and Bluemke, David A and Burke, Gregory L and others , journal=. Multi-

  31. [31]

    Feldman, Harold I and Appel, Lawrence J and Chertow, Glenn M and others , journal=. The

  32. [32]

    Journal of the American Medical Association , volume=

    A genomic predictor of response and survival following taxane--anthracycline chemotherapy for invasive breast cancer , author=. Journal of the American Medical Association , volume=

  33. [33]

    Journal of Clinical Oncology , volume=

    Long-term prognostic risk after neoadjuvant chemotherapy associated with residual cancer burden and breast cancer subtype , author=. Journal of Clinical Oncology , volume=

  34. [34]

    Nature Medicine , volume=

    High-throughput screening using patient-derived tumor xenografts to predict clinical trial drug response , author=. Nature Medicine , volume=

  35. [35]

    Alpelisib for

    Andr. Alpelisib for. New England Journal of Medicine , volume=

  36. [36]

    Biological characterization of

    Yeh, Tony C and Marsh, Vivien and Bernat, Beth A and Ballard, Joshua and Colwell, Hillary and Evans, Rebecca J and Parry, Janet and Smith, Darnell and Brandhuber, Barbara J and Gross, Susan and others , journal=. Biological characterization of. 2007 , publisher=

  37. [37]

    Genomics of Drug Sensitivity in Cancer (

    Yang, Wanjuan and Soares, Jorge and Greninger, Patricia and Edelman, Elena J and Lightfoot, Howard and Forbes, Simon and Bindal, Nidhi and Beare, Dave and Smith, James A and Thompson, I Richard and others , journal=. Genomics of Drug Sensitivity in Cancer (. 2013 , publisher=

  38. [38]

    Proceedings of the National Academy of Sciences , volume=

    Cross-prediction-powered inference , author=. Proceedings of the National Academy of Sciences , volume=. 2024 , publisher=

  39. [39]

    Proceedings of the 41st International Conference on Machine Learning , year=

    Active statistical inference , author=. Proceedings of the 41st International Conference on Machine Learning , year=

  40. [40]

    The Annals of Statistics , volume=

    Efficient and adaptive linear regression in semi-supervised settings , author=. The Annals of Statistics , volume=. 2018 , publisher=

  41. [41]

    The Annals of Statistics , volume=

    Semi-supervised inference: General theory and estimation of means , author=. The Annals of Statistics , volume=. 2019 , publisher=

  42. [42]

    Journal of the American Statistical Association , volume=

    Transfer learning under high-dimensional generalized linear models , author=. Journal of the American Statistical Association , volume=. 2023 , publisher=

  43. [43]

    Keret, Nir and Shojaie, Ali , journal=

  44. [44]

    arXiv preprint arXiv:2510.08123 , year=

    High-dimensional analysis of synthetic data selection , author=. arXiv preprint arXiv:2510.08123 , year=

  45. [45]

    Bernoulli , volume=

    Concentration inequalities and moment bounds for sample covariance operators , author=. Bernoulli , volume=. 2017 , publisher=

  46. [46]

    Biometrika , volume=

    Scaled sparse linear regression , author=. Biometrika , volume=. 2012 , publisher=

  47. [47]

    Proceedings of the National Academy of Sciences , volume=

    Methods for correcting inference based on outcomes predicted by machine learning , author=. Proceedings of the National Academy of Sciences , volume=. 2020 , publisher=

  48. [48]

    Journal of Machine Learning Research , volume=

    Revisiting inference after prediction , author=. Journal of Machine Learning Research , volume=

  49. [49]

    Physical Review Letters , volume=

    Generalized gradient approximation made simple , author=. Physical Review Letters , volume=. 1996 , publisher=