Horseshoe Predictive Inference

Percy S. Zhai; Veronika Ro\v{c}kov\'a

arxiv: 2604.16661 · v1 · submitted 2026-04-17 · 🧮 math.ST · stat.ME· stat.TH

Horseshoe Predictive Inference

Percy S. Zhai , Veronika Ro\v{c}kov\'a This is my paper

Pith reviewed 2026-05-10 06:40 UTC · model grok-4.3

classification 🧮 math.ST stat.MEstat.TH

keywords horseshoe priorpredictive inferencesparse Gaussian sequenceminimax optimalityhierarchical priorshrinkage estimationphase transitionbayesian prediction

0 comments

The pith

The Horseshoe prior delivers asymptotically minimax optimal predictive Bayes estimators in sparse Gaussian sequence models.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper examines predictive inference using the Horseshoe prior in sparse Gaussian sequence models, which have received less attention than non-sparse cases. It establishes that the predictive Bayes estimator achieves exact asymptotic minimax optimality when the sparsity level is known. A Gaussian-mixture representation of the posterior predictive density, called Horseshoe spectroscopy, shows that the prior inherits phase transitions in local shrinkage, leading to behavior akin to thresholding estimators. For unknown sparsity, a hierarchical Horseshoe prior enables adaptive switching, and under a theta-min condition, it achieves sharper predictive risk bounds over restricted classes.

Core claim

The predictive Bayes estimator under the Horseshoe prior is exactly asymptotically minimax optimal when sparsity is known. Through Horseshoe spectroscopy, the phase-transition in the local shrinkage scale is passed to the predictive mechanism. When sparsity is unknown, the hierarchical Horseshoe performs adaptive switching and attains an upper bound on predictive risk over a restricted parameter class that improves on the minimax rate for the full class, provided a theta-min condition holds.

What carries the argument

Horseshoe spectroscopy, a Gaussian-mixture representation of the posterior predictive density that transfers the phase transition from the shrinkage scale to the predictive inference step.

If this is right

The predictive Horseshoe estimator matches the minimax rate for known sparsity levels in sparse Gaussian sequences.
Hierarchical Horseshoe priors allow automatic adaptation to unknown sparsity without manual tuning.
Predictive risk bounds improve under theta-min conditions for signals in restricted classes.
The approach applies directly to modeling images and time series as sparse Gaussian sequences for tasks like facial recognition.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If the spectroscopy mechanism generalizes, other continuous shrinkage priors might exhibit similar phase-transition inheritance in prediction.
This could lead to better uncertainty quantification in sparse predictive settings compared to discrete mixture priors.
Testing on additional domains like genomics or finance could validate the practical gains from adaptive switching.

Load-bearing premise

The theta-min condition on the signals is required to obtain the sharper upper bound on predictive risk over the restricted parameter class when sparsity is unknown.

What would settle it

If the predictive risk under the hierarchical Horseshoe exceeds the claimed upper bound in simulations satisfying the theta-min condition, or if the posterior predictive density fails to exhibit the phase transition in local shrinkage for known sparsity, the central claims would not hold.

Figures

Figures reproduced from arXiv: 2604.16661 by Percy S. Zhai, Veronika Ro\v{c}kov\'a.

**Figure 1.** Figure 1: An example of g(λ) under different y values. The dashed red line depicts λ ∗ 2 , the theoretical local maximum of g(λ). Note that g(λ) is normalized posterior density π(λ | y), such that g(0) = 1. the scale of observation, |y|. However, to answer the question when this phase transition occurs, we shall take a closer look at the posterior of λ. The posterior density of λ has the following closed form, π(λ |… view at source ↗

**Figure 2.** Figure 2: The posterior of τ , π(τ | y), under the hierarchical Horseshoe model. Here, y is randomly generated from a fixed θ under two different settings. On the first row, θ follows Setup 1. On the second row, θ follows Setup 2. The hyperprior of τ was selected as an exponential distribution with rate n, and we set sn = n/10 all the time. The blue dashed vertical line represents the minimum oracle calibration τn,0… view at source ↗

**Figure 3.** Figure 3: Samples from the JAFFE dataset. world applications. The most immediate is forecasting, where the goal is to quantify the uncertainty of a future observation. We would like to emphasize, however, a powerful yet usually overlooked application of anomaly detection. Based on the observation Y , we may construct a predictive set that can be interpreted as the likely range for the next observation to occur, give… view at source ↗

**Figure 4.** Figure 4: Results of the Horseshoe predictive inference method on the JAFFE dataset, [PITH_FULL_IMAGE:figures/full_fig_p023_4.png] view at source ↗

**Figure 5.** Figure 5: Accuracy of JAFFE facial recognition under different choices of cutoff [PITH_FULL_IMAGE:figures/full_fig_p024_5.png] view at source ↗

**Figure 6.** Figure 6: Results of the Gaussian predictive inference method on the JAFFE dataset, using [PITH_FULL_IMAGE:figures/full_fig_p025_6.png] view at source ↗

**Figure 7.** Figure 7: Accuracy of JAFFE facial recognition under different choices of cutoff [PITH_FULL_IMAGE:figures/full_fig_p025_7.png] view at source ↗

**Figure 8.** Figure 8: Distribution of energy scores across 54 brain regions. Boxplots display the [PITH_FULL_IMAGE:figures/full_fig_p029_8.png] view at source ↗

**Figure 9.** Figure 9: Anatomical topography of brain regions with significant discrepancy between ASD [PITH_FULL_IMAGE:figures/full_fig_p029_9.png] view at source ↗

**Figure 10.** Figure 10: The univariate Horseshoe predictive Kullback-Leibler risk [PITH_FULL_IMAGE:figures/full_fig_p039_10.png] view at source ↗

**Figure 11.** Figure 11: A comparison of maximum risk of various methods to the minimax KL risk [PITH_FULL_IMAGE:figures/full_fig_p087_11.png] view at source ↗

**Figure 12.** Figure 12: Univariate risk plots for the Bi-Grid, the Dirac spike-and-slab, and the Horseshoe [PITH_FULL_IMAGE:figures/full_fig_p088_12.png] view at source ↗

**Figure 13.** Figure 13: Results of the Horseshoe predictive inference method on the JAFFE dataset, [PITH_FULL_IMAGE:figures/full_fig_p094_13.png] view at source ↗

**Figure 14.** Figure 14: Accuracy of JAFFE facial recognition under different choices of cutoff [PITH_FULL_IMAGE:figures/full_fig_p094_14.png] view at source ↗

**Figure 15.** Figure 15: Results of the Horseshoe predictive inference method on the JAFFE dataset, [PITH_FULL_IMAGE:figures/full_fig_p096_15.png] view at source ↗

**Figure 16.** Figure 16: Accuracy of JAFFE facial recognition under different choices of cutoff [PITH_FULL_IMAGE:figures/full_fig_p096_16.png] view at source ↗

**Figure 17.** Figure 17: Distribution of rank-based predictive scores across 54 brain regions. Boxplots [PITH_FULL_IMAGE:figures/full_fig_p101_17.png] view at source ↗

**Figure 18.** Figure 18: Anatomical topography of brain regions with significant discrepancy between [PITH_FULL_IMAGE:figures/full_fig_p101_18.png] view at source ↗

**Figure 19.** Figure 19: Distribution of coverage rates across 54 brain regions. Boxplots display the [PITH_FULL_IMAGE:figures/full_fig_p103_19.png] view at source ↗

**Figure 20.** Figure 20: Anatomical topography of brain regions with significant discrepancy between [PITH_FULL_IMAGE:figures/full_fig_p103_20.png] view at source ↗

read the original abstract

Predictive inference in the sparse Gaussian sequence model has received considerably less attention than its non-sparse, finite-sample counterpart. Existing work has largely been confined to discrete mixture priors. In this paper, we study predictive inference under a widely used continuous mixture prior, the Horseshoe. We provide new theoretical results establishing exact asymptotic minimax optimality of the predictive Bayes estimator when the sparsity level is known. Furthermore, through a Gaussian-mixture representation of the posterior predictive density (which we term Horseshoe spectroscopy), the phase-transition in the local shrinkage scale is inherited by the predictive mechanism, producing behavior similar to that of previous thresholding/switching estimators. When sparsity is unknown, we adopt a fully Bayesian approach using a hierarchical Horseshoe prior and show that it performs adaptive, as opposed to manual, switching. Under a theta-min condition, the resulting predictive risk admits an upper bound over a restricted parameter class that is sharper than the minimax rate over the full class. We demonstrate the practical value of predictive Horseshoe shrinkage on data such as images and time series that can be naturally modeled as sparse Gaussian sequences. We illustrate this approach on facial recognition across varying facial expressions and study region-wise atypical brain lateralization in autism spectrum disorder.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper shows exact asymptotic minimax optimality for Horseshoe predictive Bayes when sparsity is known, plus adaptive hierarchical behavior under a theta-min condition when sparsity is unknown.

read the letter

The main takeaway is that the Horseshoe prior achieves exact asymptotic minimax optimality for predictive Bayes estimation in the sparse Gaussian sequence model when the sparsity level is known. The authors also introduce a Gaussian-mixture representation of the posterior predictive they call Horseshoe spectroscopy, which carries the local shrinkage phase transition over to prediction and produces thresholding-like behavior. When sparsity is unknown they switch to a hierarchical Horseshoe and obtain adaptive switching, with a sharper risk upper bound over the restricted class that satisfies the theta-min condition on signals.

Referee Report

1 major / 3 minor

Summary. The manuscript claims to establish exact asymptotic minimax optimality of the predictive Bayes estimator under the Horseshoe prior in the sparse Gaussian sequence model when sparsity is known. It introduces 'Horseshoe spectroscopy' as a Gaussian-mixture representation of the posterior predictive density to demonstrate inheritance of the local shrinkage phase-transition to prediction. For unknown sparsity, the hierarchical Horseshoe prior is shown to perform adaptive switching, and under a theta-min condition, the predictive risk admits a sharper upper bound over a restricted parameter class than the minimax rate over the full class. Practical value is demonstrated on image and time series data for applications like facial recognition and brain lateralization analysis in autism spectrum disorder.

Significance. Should the theoretical results be verified, this paper contributes meaningfully to the literature on Bayesian predictive inference in sparse high-dimensional settings by providing rigorous optimality guarantees for a popular continuous shrinkage prior. The Horseshoe spectroscopy offers a fresh perspective on how shrinkage mechanisms translate to predictive distributions, potentially generalizable to other priors. The adaptive results under the hierarchical prior, albeit restricted, advance understanding of fully Bayesian approaches to unknown sparsity. The real-data examples underscore the method's relevance to statistical applications in imaging and neuroscience.

major comments (1)

[Abstract and unknown sparsity section] Abstract and section on unknown sparsity: The upper bound on predictive risk that is sharper than the minimax rate over the full class is derived under the theta-min condition on the signals. This condition is load-bearing for the adaptivity claim as it excludes arbitrarily small signals; without it, the risk may not improve. The manuscript should explicitly state whether this restriction is necessary or if extensions to the full class are possible.

minor comments (3)

[Introduction] The term 'Horseshoe spectroscopy' is coined for the Gaussian-mixture representation of the posterior predictive density; a formal definition and motivation should be provided at the first mention to aid reader comprehension.
[Applications] The modeling of images and time series as sparse Gaussian sequences in the applications could benefit from more explicit description of the data transformation steps and how the sequence model is fitted.
[Introduction] Additional references to prior work on predictive inference using discrete mixture priors in sparse models would better position the continuous Horseshoe approach.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their thoughtful review and positive assessment of the manuscript. We address the single major comment below and will revise the paper accordingly to improve clarity.

read point-by-point responses

Referee: [Abstract and unknown sparsity section] Abstract and section on unknown sparsity: The upper bound on predictive risk that is sharper than the minimax rate over the full class is derived under the theta-min condition on the signals. This condition is load-bearing for the adaptivity claim as it excludes arbitrarily small signals; without it, the risk may not improve. The manuscript should explicitly state whether this restriction is necessary or if extensions to the full class are possible.

Authors: We agree that the theta-min condition plays a central role in deriving the sharper upper bound, as it ensures signals are bounded away from zero and thereby permits the hierarchical Horseshoe to achieve adaptive switching without interference from arbitrarily small nonzero components. Without this separation, the predictive risk may revert to the minimax rate over the full class. In the revised manuscript we will explicitly state that the restriction is necessary for the improved bound and that extending the result to the unrestricted class (without a theta-min condition) is an open question left for future work. This clarification will be added to both the abstract and the unknown-sparsity section. revision: yes

Circularity Check

0 steps flagged

No circularity: standard minimax and Bayesian derivations on Horseshoe prior

full rationale

The paper derives exact asymptotic minimax optimality for the predictive Bayes estimator under known sparsity using standard theoretical techniques for continuous mixture priors in the Gaussian sequence model. The Horseshoe spectroscopy representation is introduced as a Gaussian-mixture form of the posterior predictive density to transfer local shrinkage phase transitions, but this is an analytical tool rather than a self-definitional reduction or fitted input renamed as prediction. The adaptive result for unknown sparsity employs a hierarchical Horseshoe prior with an explicit theta-min condition on signals to obtain a sharper upper bound on a restricted class; this is a stated assumption, not a circular equivalence or self-citation load-bearing step. No equations or claims reduce by construction to the paper's own inputs, prior self-citations, or renamings of known empirical patterns. The derivation chain remains self-contained against external minimax benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 1 invented entities

The central claims rest on the standard sparse Gaussian sequence model and properties of the Horseshoe prior drawn from prior literature, plus the theta-min condition for the sharper bound.

axioms (2)

domain assumption Observations follow the sparse Gaussian sequence model y_i = theta_i + epsilon_i with epsilon_i ~ N(0,1)
This is the core setup stated in the abstract for all results.
domain assumption The Horseshoe prior induces the desired shrinkage and phase-transition behavior
Invoked throughout the theoretical development and spectroscopy representation.

invented entities (1)

Horseshoe spectroscopy no independent evidence
purpose: Gaussian-mixture representation of the posterior predictive density to study inheritance of phase transitions
New term and representation introduced in the paper to analyze predictive behavior.

pith-pipeline@v0.9.0 · 5513 in / 1495 out tokens · 45908 ms · 2026-05-10T06:40:24.928045+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

5 extracted references · 5 canonical work pages

[1]

Assume that the prior ofθis a scale-mixture of Gaussians,θ|λ∼N(0,λ2), where the prior ofλisν(λ), then ρ(θ,ˆp) =θ2 2r−EθlogN GM θ,v(Z) +E θlogN GM θ,1(Z),(A.1) where NGM θ,v(Z) = ∫ ∞ 0 √ v λ2 +v exp [ λ2 λ2 +v (√vZ+θ)2 2v ] ν(λ) dλ

work page
[2]

37 Proof.By Lemma 2.1 of Ročková (2023), For any priorπ(θ), ρ(θ,ˆp) =θ2 2r−EθlogNθ,v(Z) +E θlogNθ,1(Z),(A.6) where Nθ,v(Z) = ∫ R exp { µZ√v + µθ v −µ2 2v } π(µ) dµ, andv−1= 1 +r−1

Specifically, ifθfollows Horseshoe prior with a fixedτ >0, then ρ(θ,ˆp) =θ2 2r−EθlogN HS θ,v(Z) +E θlogN HS θ,1(Z),(A.2) whereN HS θ,v(Z)takes any of the following equivalent forms: NHS θ,v(Z) = τ π√v ∫ 1 0 u−1/2 1 τ2/v+ (1−τ2/v)u exp [ (√vZ+θ)2 2v u ] du(A.3) = τ π√ve (√vZ+θ)2 2v ∫ 1 0 (1−u)−1/2 1 1−(1−τ2/v)u exp [ −(√vZ+θ)2 2v u ] du (A.4) = 2τ π√ve (√v...

work page 2014
[3]

Forτ∈(0, 1), we may boundD(Yi)from above and Nv(Yi, ˜Yi)from below

Recall Lemma C.1 that ˜g(Yi, ˜Yi,θi,v) = ˜Yiθi r −θ2 i 2r−logNv(Yi, ˜Yi) + logD(Yi). Forτ∈(0, 1), we may boundD(Yi)from above and Nv(Yi, ˜Yi)from below. Using the representation (C.1), we haveτ2 + (1−τ2)u≥τ2 forτ∈(0,1), and hence logD(Yi)≤log ( τ π·eY 2 i /2 τ2 ∫ 1 0 u−1/2du ) = Y 2 i 2 + log 1 τ+ log 2 π. Meanwhile, note thatτ2/v+ (1−τ2/v)u≤1/vforτ∈(0,1)...

work page 2023
[4]

Boxplots display the distribution of these scores for the ASD group (red) and control group (blue)

Cerebellum_10 0.0 0.5 1.0 0.0 0.5 1.0 0.0 0.5 1.0 Rank−based Predictive Score Group ASD Control Figure 17: Distribution of rank-based predictive scores across 54 brain regions. Boxplots display the distribution of these scores for the ASD group (red) and control group (blue). Region labels are color-coded based on significant group differences in Wilcoxon...

work page
[5]

Boxplots display the distribution of these scores for the ASD group (red) and control group (blue)

Cerebellum_10 0.0 0.5 1.0 0.0 0.5 1.0 0.0 0.5 1.0 Entrywise Coverage Rate (C) Group ASD Control Figure 19: Distribution of coverage rates across 54 brain regions. Boxplots display the distribution of these scores for the ASD group (red) and control group (blue). Region labels are color-coded based on significant group differences in Wilcoxon rank-sum test...

work page

[1] [1]

Assume that the prior ofθis a scale-mixture of Gaussians,θ|λ∼N(0,λ2), where the prior ofλisν(λ), then ρ(θ,ˆp) =θ2 2r−EθlogN GM θ,v(Z) +E θlogN GM θ,1(Z),(A.1) where NGM θ,v(Z) = ∫ ∞ 0 √ v λ2 +v exp [ λ2 λ2 +v (√vZ+θ)2 2v ] ν(λ) dλ

work page

[2] [2]

37 Proof.By Lemma 2.1 of Ročková (2023), For any priorπ(θ), ρ(θ,ˆp) =θ2 2r−EθlogNθ,v(Z) +E θlogNθ,1(Z),(A.6) where Nθ,v(Z) = ∫ R exp { µZ√v + µθ v −µ2 2v } π(µ) dµ, andv−1= 1 +r−1

Specifically, ifθfollows Horseshoe prior with a fixedτ >0, then ρ(θ,ˆp) =θ2 2r−EθlogN HS θ,v(Z) +E θlogN HS θ,1(Z),(A.2) whereN HS θ,v(Z)takes any of the following equivalent forms: NHS θ,v(Z) = τ π√v ∫ 1 0 u−1/2 1 τ2/v+ (1−τ2/v)u exp [ (√vZ+θ)2 2v u ] du(A.3) = τ π√ve (√vZ+θ)2 2v ∫ 1 0 (1−u)−1/2 1 1−(1−τ2/v)u exp [ −(√vZ+θ)2 2v u ] du (A.4) = 2τ π√ve (√v...

work page 2014

[3] [3]

Forτ∈(0, 1), we may boundD(Yi)from above and Nv(Yi, ˜Yi)from below

Recall Lemma C.1 that ˜g(Yi, ˜Yi,θi,v) = ˜Yiθi r −θ2 i 2r−logNv(Yi, ˜Yi) + logD(Yi). Forτ∈(0, 1), we may boundD(Yi)from above and Nv(Yi, ˜Yi)from below. Using the representation (C.1), we haveτ2 + (1−τ2)u≥τ2 forτ∈(0,1), and hence logD(Yi)≤log ( τ π·eY 2 i /2 τ2 ∫ 1 0 u−1/2du ) = Y 2 i 2 + log 1 τ+ log 2 π. Meanwhile, note thatτ2/v+ (1−τ2/v)u≤1/vforτ∈(0,1)...

work page 2023

[4] [4]

Boxplots display the distribution of these scores for the ASD group (red) and control group (blue)

Cerebellum_10 0.0 0.5 1.0 0.0 0.5 1.0 0.0 0.5 1.0 Rank−based Predictive Score Group ASD Control Figure 17: Distribution of rank-based predictive scores across 54 brain regions. Boxplots display the distribution of these scores for the ASD group (red) and control group (blue). Region labels are color-coded based on significant group differences in Wilcoxon...

work page

[5] [5]

Boxplots display the distribution of these scores for the ASD group (red) and control group (blue)

Cerebellum_10 0.0 0.5 1.0 0.0 0.5 1.0 0.0 0.5 1.0 Entrywise Coverage Rate (C) Group ASD Control Figure 19: Distribution of coverage rates across 54 brain regions. Boxplots display the distribution of these scores for the ASD group (red) and control group (blue). Region labels are color-coded based on significant group differences in Wilcoxon rank-sum test...

work page