pith. sign in

arxiv: 2606.27688 · v1 · pith:U73TWRGXnew · submitted 2026-06-26 · 💻 cs.LG · cs.AI

Deployment-Side Adaptiveness in Multi-Horizon Volatility Forecasting

Pith reviewed 2026-06-29 04:40 UTC · model grok-4.3

classification 💻 cs.LG cs.AI
keywords multi-horizon forecastingvolatility predictionMIMO forecastersinference-time rolloutdeployment policiesstock volatility seriesMSE and QLIKEfinancial time series
0
0 comments X

The pith

Varying the inference-time rollout rule for a trained MIMO volatility forecaster often improves accuracy over standard deployment, and validation can select low-cost policies that outperform the default.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that a trained multi-output forecaster for multi-horizon volatility does not fix a single predictor: different rollout rules at inference time create a family of forecasts with distinct accuracy and cost profiles. Across 20 stock-volatility series, three horizons, and models from linear to PatchTST, non-default rules frequently beat standard MIMO deployment, though the best fixed rule shifts with architecture and horizon. Validation-based selection of single rules or small subsets yields low-cost gains over the default, and recovers much of the benefit of larger ensembles, but rankings shift when switching from MSE to QLIKE. A reader would care because this decouples training from deployment, turning inference choices into a source of adaptiveness without retraining.

Core claim

By changing the inference-time rollout rule, the same trained MIMO forecaster induces a family of forecasts; validation-selected singletons improve over default MIMO at low cost, while small rule subsets recover much of the benefit of larger ensembles at substantially lower inference cost. Non-default rollout rules often improve over standard MIMO across the series, yet policy rankings are metric-sensitive and do not transfer uniformly from MSE to QLIKE.

What carries the argument

The family of forecasts induced by different inference-time rollout rules on a trained multi-output (MIMO) model; it turns one trained network into multiple deployable predictors that can be chosen by validation.

If this is right

  • Non-default rollout rules improve performance over standard MIMO deployment on the 20 volatility series.
  • Validation-selected singletons deliver low-cost accuracy gains over the default.
  • Small rule subsets recover most ensemble benefits while cutting inference cost.
  • Optimal policies change when the loss switches from MSE to QLIKE.
  • Volatility forecasters need evaluation on both architecture and deployment policy.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same rollout-family idea could be tested on multi-horizon tasks outside finance, such as energy demand or traffic.
  • Dynamic selection of rules based on recent market regime might further reduce the gap to full ensembles.
  • The approach raises the question of whether deployment adaptiveness appears in other multi-output time-series models beyond volatility.
  • Live deployment on streaming market data would test whether the validation gains survive distribution shifts.

Load-bearing premise

The validation set used to select the rollout policy is representative of future unseen data, and observed performance differences arise from the rules themselves rather than from overfitting to the validation period.

What would settle it

Check whether a rollout policy chosen on the validation set still outperforms the default MIMO rule on a later test window that was never seen during policy selection.

Figures

Figures reproduced from arXiv: 2606.27688 by Riku Green, Telmo M Silva Filho, Zahraa S. Abdallah.

Figure 1
Figure 1. Figure 1: A trained 𝐻-output MIMO forecaster can be redeployed with smaller block size 𝑠 ≤ 𝐻. Top: default MIMO deployment uses all 𝐻 = 5 outputs in one shot. Middle: block-recursive deployment with 𝑠 = 2 commits the first two outputs from each call; these committed predictions are fused with the current state to form the next rolled state, which is then reused by the same forecasting function 𝑓𝜃 . Bottom: fully rec… view at source ↗
Figure 2
Figure 2. Figure 2: Rule-level win rates against default MIMO across horizons. For each horizon, cells show the fraction of tasks on which [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Singleton deployment rules expose a heterogeneous accuracy–cost landscape. Each point is a fixed non-MIMO singleton [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Operational deployment policies improve the MSE– [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Transfer of MSE-selected deployment policies to [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Transfer of MSE wins to QLIKE by deployment [PITH_FULL_IMAGE:figures/full_fig_p007_6.png] view at source ↗
read the original abstract

In financial forecasting, predictive performance depends not only on which model is trained, but also on how the trained model is deployed. We study this issue in multi-horizon volatility forecasting. Our starting point is that a trained multi-output (MIMO) forecaster does not define a single deployable predictor: by changing the inference-time rollout rule, the same trained model induces a family of forecasts with different accuracy and cost profiles. Across 20 stock-volatility series, three forecast horizons, and architectures ranging from linear models to PatchTST, we find that non-default rollout rules often improve over standard MIMO deployment. However, the best fixed rule varies substantially across architectures and horizons, making any single static replacement unreliable. We therefore evaluate validation-based deployment policies over the induced rule family. Under the primary MSE objective, validation-selected singletons provide a low-cost improvement over default MIMO, while small rule subsets recover much of the benefit of larger ensembles at substantially lower inference cost. We also find that policy rankings are metric-sensitive: MSE-selected policies do not transfer uniformly to QLIKE, a finance-standard volatility loss. These results show that inference-time deployment is a meaningful source of adaptiveness in financial forecasting, and that trained volatility forecasters should be evaluated not only by their architecture, but also by their deployment policy.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper claims that a trained MIMO forecaster in multi-horizon volatility prediction induces a family of deployable predictors via different inference-time rollout rules; across 20 stock-volatility series, three horizons, and architectures from linear models to PatchTST, non-default rules often outperform standard MIMO, validation-selected singletons yield low-cost gains over default MIMO, and small rule subsets recover much of the benefit of larger ensembles at lower inference cost, though policy rankings are sensitive to the loss (MSE vs. QLIKE).

Significance. If the central empirical patterns hold after addressing validation representativeness, the work establishes deployment policy as a distinct, low-cost source of adaptiveness in financial forecasting that complements model architecture choices. The breadth of the evaluation (20 series, multiple horizons and architectures, held-out test comparisons) and the efficiency finding on small rule subsets are strengths that would make the result practically relevant for volatility model deployment.

major comments (2)
  1. [Abstract and validation-based deployment policies section] The headline result on validation-selected singletons and rule subsets improving over default MIMO (abstract and results) rests on the assumption that performance differences are driven by the rollout rules and that the validation window is representative of future data. In non-stationary volatility series, a single fixed validation period risks selecting rules that exploit transient regime characteristics; the manuscript does not report multiple rolling validation windows, statistical tests for distribution shift between validation and test sets, or ablation on stability of rule rankings across windows.
  2. [Empirical evaluation sections] The reported consistent empirical patterns across series, horizons, and architectures lack accompanying error bars, statistical significance tests, or exact details on data splits and train/validation/test partitioning (abstract and empirical results). This leaves moderate support for the claim that observed gains arise from the rollout rules rather than sampling variability or unaccounted data characteristics.
minor comments (2)
  1. Notation for the family of rollout rules and the induced predictors could be introduced more explicitly with a small table or diagram to aid readability.
  2. The manuscript would benefit from a brief discussion of how the 20 series were selected and any preprocessing steps for the volatility targets.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments, which highlight important considerations for robustness in non-stationary financial time series. We address each major comment below and indicate planned revisions.

read point-by-point responses
  1. Referee: [Abstract and validation-based deployment policies section] The headline result on validation-selected singletons and rule subsets improving over default MIMO rests on the assumption that performance differences are driven by the rollout rules and that the validation window is representative of future data. In non-stationary volatility series, a single fixed validation period risks selecting rules that exploit transient regime characteristics; the manuscript does not report multiple rolling validation windows, statistical tests for distribution shift between validation and test sets, or ablation on stability of rule rankings across windows.

    Authors: We agree that non-stationarity poses a risk for validation-based selection and that additional robustness checks would strengthen the claims. Our current setup uses a single validation window immediately preceding the test period, following standard practice in financial forecasting to reflect recent regimes. However, we will add an ablation using multiple rolling validation windows, report stability of rule rankings across them, and include Kolmogorov-Smirnov tests for distribution shift between validation and test sets. These will be presented in a new subsection on validation sensitivity. revision: yes

  2. Referee: [Empirical evaluation sections] The reported consistent empirical patterns across series, horizons, and architectures lack accompanying error bars, statistical significance tests, or exact details on data splits and train/validation/test partitioning. This leaves moderate support for the claim that observed gains arise from the rollout rules rather than sampling variability or unaccounted data characteristics.

    Authors: We acknowledge that the current presentation would benefit from greater statistical rigor. The manuscript already specifies the 20 series, horizons, and partitioning (70/15/15 train/validation/test split with chronological ordering), but we will expand this description with exact dates and add error bars (via bootstrap resampling over series) plus paired t-tests or Wilcoxon tests for significance of gains over default MIMO. These additions will appear in the empirical evaluation and appendix. revision: yes

Circularity Check

0 steps flagged

No significant circularity in empirical evaluation chain

full rationale

The paper's claims rest entirely on direct empirical comparisons of rollout rules (selected via validation) against default MIMO deployment, evaluated on held-out test data across 20 volatility series and multiple architectures. No mathematical derivation, first-principles result, or fitted-parameter renaming is presented that reduces to its own inputs by construction. No self-citations are invoked as load-bearing uniqueness theorems or ansatzes. The evaluation follows standard train-val-test splits and is externally falsifiable via replication on the same data splits.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Empirical study with no explicit free parameters, axioms, or invented entities stated in the abstract; the central claim rests on standard assumptions about train-validation-test splits and model training procedures common to machine learning.

pith-pipeline@v0.9.1-grok · 5766 in / 1069 out tokens · 70562 ms · 2026-06-29T04:40:37.339089+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

27 extracted references · 7 canonical work pages · 3 internal anchors

  1. [1]

    Torben G Andersen, Tim Bollerslev, Francis X Diebold, and Paul Labys. 2003. Modeling and forecasting realized volatility.Econometrica71, 2 (2003), 579–625

  2. [2]

    John M Bates and Clive WJ Granger. 1969. The combination of forecasts.Journal of the operational research society20, 4 (1969), 451–468

  3. [3]

    Itishree Behera, Pragyan Nanda, Soma Mitra, and Swapna Kumari. 2024. Machine learning approaches for forecasting financial market volatility.Machine learning approaches in financial analytics(2024), 431–451

  4. [4]

    Leo Breiman. 2001. Random forests.Machine learning45, 1 (2001), 5–32

  5. [5]

    Andrea Bucci. 2017. Forecasting realized volatility: a review.Journal of Advanced Studies in Finance (JASF)8, 16 (2017), 94–138

  6. [6]

    Kim Christensen, Mathias Siggaard, and Bezirgen Veliyev. 2023. A machine learning approach to volatility forecasting.Journal of Financial Econometrics21, 5 (2023), 1680–1727

  7. [7]

    Peter F Christoffersen and Francis X Diebold. 2000. How relevant is volatility forecasting for financial risk management?Review of Economics and Statistics82, 1 (2000), 12–22

  8. [8]

    Fabrizio Cipollini, Giulia Cruciani, Giampiero M Gallo, Alessandra Insana, Edoardo Otranto, and Fabio Spagnolo. 2026. VOLatility Archive for Realized Estimates (VOLARE).arXiv preprint arXiv:2602.19732(2026). Deployment-Side Adaptiveness in Multi-Horizon Volatility Forecasting

  9. [9]

    Thomas G Dietterich. 2000. Ensemble methods in machine learning. InInterna- tional workshop on multiple classifier systems. Springer, 1–15

  10. [10]

    Jeff Fleming, Chris Kirby, and Barbara Ostdiek. 2001. The economic value of volatility timing.The Journal of Finance56, 1 (2001), 329–352

  11. [11]

    Yarin Gal and Zoubin Ghahramani. 2016. Dropout as a bayesian approximation: Representing model uncertainty in deep learning. Ininternational conference on machine learning. PMLR, 1050–1059

  12. [12]

    Sofia Giantsidi and Tarantola Claudia. 2025. Deep learning for financial forecast- ing: A review of recent advancements.A vailable at SSRN 5263710(2025)

  13. [13]

    Riku Green, Zahraa S Abdallah, et al. 2026. Expectations vs. Realities: The Cost of MSE-Optimal Forecasting Under Conditional Uncertainty.arXiv preprint arXiv:2606.04342(2026)

  14. [14]

    Riku Green, Zahraa S Abdallah, et al. 2026. Exposure Bias as Epistemic Underi- dentification in Recursive Forecasting.arXiv preprint arXiv:2606.12990(2026)

  15. [15]

    Riku Green, Huw Day, Zahraa S Abdallah, et al. 2025. Epistemic Error Decom- position for Multi-step Time Series Forecasting: Rethinking Bias-Variance in Recursive and Direct Strategies.arXiv preprint arXiv:2511.11461(2025)

  16. [16]

    Riku Green, Grant Stevens, Zahraa Abdallah, et al. 2024. Time-series classification for dynamic strategies in multi-step forecasting.arXiv preprint arXiv:2402.08373 (2024)

  17. [17]

    Riku Green, Grant Stevens, Zahraa S Abdallah, and Telmo M Silva Filho. 2025. Stratify: unifying multi-step forecasting strategies: R. Green et al.Data Mining and Knowledge Discovery39, 5 (2025), 64

  18. [18]

    Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long short-term memory.Neural computation9, 8 (1997), 1735–1780

  19. [19]

    Spyros Makridakis, Evangelos Spiliotis, and Vassilios Assimakopoulos. 2018. The M4 Competition: Results, findings, conclusion and way forward.International Journal of forecasting34, 4 (2018), 802–808

  20. [20]

    Ricardo P Masini, Marcelo C Medeiros, and Eduardo F Mendes. 2023. Machine learning advances for time series forecasting.Journal of economic surveys37, 1 (2023), 76–111

  21. [21]

    Yuqi Nie, Nam H Nguyen, Phanwadee Sinthong, and Jayant Kalagnanam. 2022. A time series is worth 64 words: Long-term forecasting with transformers.arXiv preprint arXiv:2211.14730(2022)

  22. [22]

    Boris N Oreshkin, Dmitri Carpov, Nicolas Chapados, and Yoshua Bengio. 2019. N- BEATS: Neural basis expansion analysis for interpretable time series forecasting. arXiv preprint arXiv:1905.10437(2019)

  23. [23]

    Mary E Thomson, Andrew C Pollock, Dilek Önkal, and M Sinan Gönül. 2019. Combining forecasts: Performance and coherence.International Journal of Fore- casting35, 2 (2019), 474–484

  24. [24]

    Xiaoqian Wang, Rob J Hyndman, Feng Li, and Yanfei Kang. 2023. Forecast combinations: An over 50-year review.International Journal of Forecasting39, 4 (2023), 1518–1547

  25. [25]

    Helmut Wasserbacher and Martin Spindler. 2022. Machine learning for financial forecasting, planning and analysis: recent developments and pitfalls.Digital Finance4, 1 (2022), 63–88

  26. [26]

    Danny Wood, Tingting Mu, Andrew M Webb, Henry WJ Reeve, Mikel Lujan, and Gavin Brown. 2023. A unified theory of diversity in ensemble learning.Journal of machine learning research24, 359 (2023), 1–49

  27. [27]

    Hao Wu and David Levinson. 2021. The ensemble approach to forecasting: A review and synthesis.Transportation Research Part C: Emerging Technologies132 (2021), 103357. A Additional Methodological Context This appendix adds a few extra notes on the forecasting target, the evaluation losses, the baseline comparison, and the rollout construction. These detail...