Deployment-Side Adaptiveness in Multi-Horizon Volatility Forecasting

Riku Green; Telmo M Silva Filho; Zahraa S. Abdallah

arxiv: 2606.27688 · v1 · pith:U73TWRGXnew · submitted 2026-06-26 · 💻 cs.LG · cs.AI

Deployment-Side Adaptiveness in Multi-Horizon Volatility Forecasting

Riku Green , Zahraa S. Abdallah , Telmo M Silva Filho This is my paper

Pith reviewed 2026-06-29 04:40 UTC · model grok-4.3

classification 💻 cs.LG cs.AI

keywords multi-horizon forecastingvolatility predictionMIMO forecastersinference-time rolloutdeployment policiesstock volatility seriesMSE and QLIKEfinancial time series

0 comments

The pith

Varying the inference-time rollout rule for a trained MIMO volatility forecaster often improves accuracy over standard deployment, and validation can select low-cost policies that outperform the default.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that a trained multi-output forecaster for multi-horizon volatility does not fix a single predictor: different rollout rules at inference time create a family of forecasts with distinct accuracy and cost profiles. Across 20 stock-volatility series, three horizons, and models from linear to PatchTST, non-default rules frequently beat standard MIMO deployment, though the best fixed rule shifts with architecture and horizon. Validation-based selection of single rules or small subsets yields low-cost gains over the default, and recovers much of the benefit of larger ensembles, but rankings shift when switching from MSE to QLIKE. A reader would care because this decouples training from deployment, turning inference choices into a source of adaptiveness without retraining.

Core claim

By changing the inference-time rollout rule, the same trained MIMO forecaster induces a family of forecasts; validation-selected singletons improve over default MIMO at low cost, while small rule subsets recover much of the benefit of larger ensembles at substantially lower inference cost. Non-default rollout rules often improve over standard MIMO across the series, yet policy rankings are metric-sensitive and do not transfer uniformly from MSE to QLIKE.

What carries the argument

The family of forecasts induced by different inference-time rollout rules on a trained multi-output (MIMO) model; it turns one trained network into multiple deployable predictors that can be chosen by validation.

If this is right

Non-default rollout rules improve performance over standard MIMO deployment on the 20 volatility series.
Validation-selected singletons deliver low-cost accuracy gains over the default.
Small rule subsets recover most ensemble benefits while cutting inference cost.
Optimal policies change when the loss switches from MSE to QLIKE.
Volatility forecasters need evaluation on both architecture and deployment policy.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same rollout-family idea could be tested on multi-horizon tasks outside finance, such as energy demand or traffic.
Dynamic selection of rules based on recent market regime might further reduce the gap to full ensembles.
The approach raises the question of whether deployment adaptiveness appears in other multi-output time-series models beyond volatility.
Live deployment on streaming market data would test whether the validation gains survive distribution shifts.

Load-bearing premise

The validation set used to select the rollout policy is representative of future unseen data, and observed performance differences arise from the rules themselves rather than from overfitting to the validation period.

What would settle it

Check whether a rollout policy chosen on the validation set still outperforms the default MIMO rule on a later test window that was never seen during policy selection.

Figures

Figures reproduced from arXiv: 2606.27688 by Riku Green, Telmo M Silva Filho, Zahraa S. Abdallah.

**Figure 1.** Figure 1: A trained 𝐻-output MIMO forecaster can be redeployed with smaller block size 𝑠 ≤ 𝐻. Top: default MIMO deployment uses all 𝐻 = 5 outputs in one shot. Middle: block-recursive deployment with 𝑠 = 2 commits the first two outputs from each call; these committed predictions are fused with the current state to form the next rolled state, which is then reused by the same forecasting function 𝑓𝜃 . Bottom: fully rec… view at source ↗

**Figure 2.** Figure 2: Rule-level win rates against default MIMO across horizons. For each horizon, cells show the fraction of tasks on which [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗

**Figure 3.** Figure 3: Singleton deployment rules expose a heterogeneous accuracy–cost landscape. Each point is a fixed non-MIMO singleton [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗

**Figure 4.** Figure 4: Operational deployment policies improve the MSE– [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗

**Figure 5.** Figure 5: Transfer of MSE-selected deployment policies to [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗

**Figure 6.** Figure 6: Transfer of MSE wins to QLIKE by deployment [PITH_FULL_IMAGE:figures/full_fig_p007_6.png] view at source ↗

read the original abstract

In financial forecasting, predictive performance depends not only on which model is trained, but also on how the trained model is deployed. We study this issue in multi-horizon volatility forecasting. Our starting point is that a trained multi-output (MIMO) forecaster does not define a single deployable predictor: by changing the inference-time rollout rule, the same trained model induces a family of forecasts with different accuracy and cost profiles. Across 20 stock-volatility series, three forecast horizons, and architectures ranging from linear models to PatchTST, we find that non-default rollout rules often improve over standard MIMO deployment. However, the best fixed rule varies substantially across architectures and horizons, making any single static replacement unreliable. We therefore evaluate validation-based deployment policies over the induced rule family. Under the primary MSE objective, validation-selected singletons provide a low-cost improvement over default MIMO, while small rule subsets recover much of the benefit of larger ensembles at substantially lower inference cost. We also find that policy rankings are metric-sensitive: MSE-selected policies do not transfer uniformly to QLIKE, a finance-standard volatility loss. These results show that inference-time deployment is a meaningful source of adaptiveness in financial forecasting, and that trained volatility forecasters should be evaluated not only by their architecture, but also by their deployment policy.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper shows rollout rules are a real lever for MIMO volatility models and validation selection beats default at low cost, but non-stationarity risks make the gains look split-dependent.

read the letter

The main takeaway is that for trained multi-output volatility forecasters, changing the inference rollout rule creates a family of predictors with different accuracy-cost profiles, and validation can pick useful ones from that family. Across 20 series, three horizons, and models from linear to PatchTST, non-default rules often beat standard MIMO, though the best rule shifts with architecture and horizon. Validation singletons give a cheap win, small subsets capture most ensemble benefit, and MSE policies do not transfer cleanly to QLIKE.

The work is direct: it runs held-out comparisons, keeps the circularity burden low, and surfaces the metric sensitivity clearly. That part is useful for anyone running these models in practice.

The soft spot is the validation window. Volatility series shift regimes often, so a single fixed validation period can select rules that exploit transient features rather than stable rollout advantages. The abstract gives no rolling-window checks, distribution-shift tests, or error bars, which leaves the reported improvements open to the split-specific explanation. Without those, the central claim rests on moderate rather than strong evidence.

This is for practitioners who already train MIMO volatility models and want to squeeze more out of deployment without retraining. A reader focused on finance forecasting will find the cost-accuracy tradeoffs and metric mismatch worth seeing.

It deserves peer review. The experiments are concrete and the deployment angle is under-discussed, even if the robustness checks need tightening.

Referee Report

2 major / 2 minor

Summary. The paper claims that a trained MIMO forecaster in multi-horizon volatility prediction induces a family of deployable predictors via different inference-time rollout rules; across 20 stock-volatility series, three horizons, and architectures from linear models to PatchTST, non-default rules often outperform standard MIMO, validation-selected singletons yield low-cost gains over default MIMO, and small rule subsets recover much of the benefit of larger ensembles at lower inference cost, though policy rankings are sensitive to the loss (MSE vs. QLIKE).

Significance. If the central empirical patterns hold after addressing validation representativeness, the work establishes deployment policy as a distinct, low-cost source of adaptiveness in financial forecasting that complements model architecture choices. The breadth of the evaluation (20 series, multiple horizons and architectures, held-out test comparisons) and the efficiency finding on small rule subsets are strengths that would make the result practically relevant for volatility model deployment.

major comments (2)

[Abstract and validation-based deployment policies section] The headline result on validation-selected singletons and rule subsets improving over default MIMO (abstract and results) rests on the assumption that performance differences are driven by the rollout rules and that the validation window is representative of future data. In non-stationary volatility series, a single fixed validation period risks selecting rules that exploit transient regime characteristics; the manuscript does not report multiple rolling validation windows, statistical tests for distribution shift between validation and test sets, or ablation on stability of rule rankings across windows.
[Empirical evaluation sections] The reported consistent empirical patterns across series, horizons, and architectures lack accompanying error bars, statistical significance tests, or exact details on data splits and train/validation/test partitioning (abstract and empirical results). This leaves moderate support for the claim that observed gains arise from the rollout rules rather than sampling variability or unaccounted data characteristics.

minor comments (2)

Notation for the family of rollout rules and the induced predictors could be introduced more explicitly with a small table or diagram to aid readability.
The manuscript would benefit from a brief discussion of how the 20 series were selected and any preprocessing steps for the volatility targets.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments, which highlight important considerations for robustness in non-stationary financial time series. We address each major comment below and indicate planned revisions.

read point-by-point responses

Referee: [Abstract and validation-based deployment policies section] The headline result on validation-selected singletons and rule subsets improving over default MIMO rests on the assumption that performance differences are driven by the rollout rules and that the validation window is representative of future data. In non-stationary volatility series, a single fixed validation period risks selecting rules that exploit transient regime characteristics; the manuscript does not report multiple rolling validation windows, statistical tests for distribution shift between validation and test sets, or ablation on stability of rule rankings across windows.

Authors: We agree that non-stationarity poses a risk for validation-based selection and that additional robustness checks would strengthen the claims. Our current setup uses a single validation window immediately preceding the test period, following standard practice in financial forecasting to reflect recent regimes. However, we will add an ablation using multiple rolling validation windows, report stability of rule rankings across them, and include Kolmogorov-Smirnov tests for distribution shift between validation and test sets. These will be presented in a new subsection on validation sensitivity. revision: yes
Referee: [Empirical evaluation sections] The reported consistent empirical patterns across series, horizons, and architectures lack accompanying error bars, statistical significance tests, or exact details on data splits and train/validation/test partitioning. This leaves moderate support for the claim that observed gains arise from the rollout rules rather than sampling variability or unaccounted data characteristics.

Authors: We acknowledge that the current presentation would benefit from greater statistical rigor. The manuscript already specifies the 20 series, horizons, and partitioning (70/15/15 train/validation/test split with chronological ordering), but we will expand this description with exact dates and add error bars (via bootstrap resampling over series) plus paired t-tests or Wilcoxon tests for significance of gains over default MIMO. These additions will appear in the empirical evaluation and appendix. revision: yes

Circularity Check

0 steps flagged

No significant circularity in empirical evaluation chain

full rationale

The paper's claims rest entirely on direct empirical comparisons of rollout rules (selected via validation) against default MIMO deployment, evaluated on held-out test data across 20 volatility series and multiple architectures. No mathematical derivation, first-principles result, or fitted-parameter renaming is presented that reduces to its own inputs by construction. No self-citations are invoked as load-bearing uniqueness theorems or ansatzes. The evaluation follows standard train-val-test splits and is externally falsifiable via replication on the same data splits.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Empirical study with no explicit free parameters, axioms, or invented entities stated in the abstract; the central claim rests on standard assumptions about train-validation-test splits and model training procedures common to machine learning.

pith-pipeline@v0.9.1-grok · 5766 in / 1069 out tokens · 70562 ms · 2026-06-29T04:40:37.339089+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

27 extracted references · 7 canonical work pages · 3 internal anchors

[1]

Torben G Andersen, Tim Bollerslev, Francis X Diebold, and Paul Labys. 2003. Modeling and forecasting realized volatility.Econometrica71, 2 (2003), 579–625

2003
[2]

John M Bates and Clive WJ Granger. 1969. The combination of forecasts.Journal of the operational research society20, 4 (1969), 451–468

1969
[3]

Itishree Behera, Pragyan Nanda, Soma Mitra, and Swapna Kumari. 2024. Machine learning approaches for forecasting financial market volatility.Machine learning approaches in financial analytics(2024), 431–451

2024
[4]

Leo Breiman. 2001. Random forests.Machine learning45, 1 (2001), 5–32

2001
[5]

Andrea Bucci. 2017. Forecasting realized volatility: a review.Journal of Advanced Studies in Finance (JASF)8, 16 (2017), 94–138

2017
[6]

Kim Christensen, Mathias Siggaard, and Bezirgen Veliyev. 2023. A machine learning approach to volatility forecasting.Journal of Financial Econometrics21, 5 (2023), 1680–1727

2023
[7]

Peter F Christoffersen and Francis X Diebold. 2000. How relevant is volatility forecasting for financial risk management?Review of Economics and Statistics82, 1 (2000), 12–22

2000
[8]

Fabrizio Cipollini, Giulia Cruciani, Giampiero M Gallo, Alessandra Insana, Edoardo Otranto, and Fabio Spagnolo. 2026. VOLatility Archive for Realized Estimates (VOLARE).arXiv preprint arXiv:2602.19732(2026). Deployment-Side Adaptiveness in Multi-Horizon Volatility Forecasting

work page arXiv 2026
[9]

Thomas G Dietterich. 2000. Ensemble methods in machine learning. InInterna- tional workshop on multiple classifier systems. Springer, 1–15

2000
[10]

Jeff Fleming, Chris Kirby, and Barbara Ostdiek. 2001. The economic value of volatility timing.The Journal of Finance56, 1 (2001), 329–352

2001
[11]

Yarin Gal and Zoubin Ghahramani. 2016. Dropout as a bayesian approximation: Representing model uncertainty in deep learning. Ininternational conference on machine learning. PMLR, 1050–1059

2016
[12]

Sofia Giantsidi and Tarantola Claudia. 2025. Deep learning for financial forecast- ing: A review of recent advancements.A vailable at SSRN 5263710(2025)

2025
[13]

Riku Green, Zahraa S Abdallah, et al. 2026. Expectations vs. Realities: The Cost of MSE-Optimal Forecasting Under Conditional Uncertainty.arXiv preprint arXiv:2606.04342(2026)

work page internal anchor Pith review Pith/arXiv arXiv 2026
[14]

Riku Green, Zahraa S Abdallah, et al. 2026. Exposure Bias as Epistemic Underi- dentification in Recursive Forecasting.arXiv preprint arXiv:2606.12990(2026)

work page internal anchor Pith review Pith/arXiv arXiv 2026
[15]

Riku Green, Huw Day, Zahraa S Abdallah, et al. 2025. Epistemic Error Decom- position for Multi-step Time Series Forecasting: Rethinking Bias-Variance in Recursive and Direct Strategies.arXiv preprint arXiv:2511.11461(2025)

work page arXiv 2025
[16]

Riku Green, Grant Stevens, Zahraa Abdallah, et al. 2024. Time-series classification for dynamic strategies in multi-step forecasting.arXiv preprint arXiv:2402.08373 (2024)

work page arXiv 2024
[17]

Riku Green, Grant Stevens, Zahraa S Abdallah, and Telmo M Silva Filho. 2025. Stratify: unifying multi-step forecasting strategies: R. Green et al.Data Mining and Knowledge Discovery39, 5 (2025), 64

2025
[18]

Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long short-term memory.Neural computation9, 8 (1997), 1735–1780

1997
[19]

Spyros Makridakis, Evangelos Spiliotis, and Vassilios Assimakopoulos. 2018. The M4 Competition: Results, findings, conclusion and way forward.International Journal of forecasting34, 4 (2018), 802–808

2018
[20]

Ricardo P Masini, Marcelo C Medeiros, and Eduardo F Mendes. 2023. Machine learning advances for time series forecasting.Journal of economic surveys37, 1 (2023), 76–111

2023
[21]

Yuqi Nie, Nam H Nguyen, Phanwadee Sinthong, and Jayant Kalagnanam. 2022. A time series is worth 64 words: Long-term forecasting with transformers.arXiv preprint arXiv:2211.14730(2022)

work page internal anchor Pith review Pith/arXiv arXiv 2022
[22]

Boris N Oreshkin, Dmitri Carpov, Nicolas Chapados, and Yoshua Bengio. 2019. N- BEATS: Neural basis expansion analysis for interpretable time series forecasting. arXiv preprint arXiv:1905.10437(2019)

work page arXiv 2019
[23]

Mary E Thomson, Andrew C Pollock, Dilek Önkal, and M Sinan Gönül. 2019. Combining forecasts: Performance and coherence.International Journal of Fore- casting35, 2 (2019), 474–484

2019
[24]

Xiaoqian Wang, Rob J Hyndman, Feng Li, and Yanfei Kang. 2023. Forecast combinations: An over 50-year review.International Journal of Forecasting39, 4 (2023), 1518–1547

2023
[25]

Helmut Wasserbacher and Martin Spindler. 2022. Machine learning for financial forecasting, planning and analysis: recent developments and pitfalls.Digital Finance4, 1 (2022), 63–88

2022
[26]

Danny Wood, Tingting Mu, Andrew M Webb, Henry WJ Reeve, Mikel Lujan, and Gavin Brown. 2023. A unified theory of diversity in ensemble learning.Journal of machine learning research24, 359 (2023), 1–49

2023
[27]

Hao Wu and David Levinson. 2021. The ensemble approach to forecasting: A review and synthesis.Transportation Research Part C: Emerging Technologies132 (2021), 103357. A Additional Methodological Context This appendix adds a few extra notes on the forecasting target, the evaluation losses, the baseline comparison, and the rollout construction. These detail...

2021

[1] [1]

Torben G Andersen, Tim Bollerslev, Francis X Diebold, and Paul Labys. 2003. Modeling and forecasting realized volatility.Econometrica71, 2 (2003), 579–625

2003

[2] [2]

John M Bates and Clive WJ Granger. 1969. The combination of forecasts.Journal of the operational research society20, 4 (1969), 451–468

1969

[3] [3]

Itishree Behera, Pragyan Nanda, Soma Mitra, and Swapna Kumari. 2024. Machine learning approaches for forecasting financial market volatility.Machine learning approaches in financial analytics(2024), 431–451

2024

[4] [4]

Leo Breiman. 2001. Random forests.Machine learning45, 1 (2001), 5–32

2001

[5] [5]

Andrea Bucci. 2017. Forecasting realized volatility: a review.Journal of Advanced Studies in Finance (JASF)8, 16 (2017), 94–138

2017

[6] [6]

Kim Christensen, Mathias Siggaard, and Bezirgen Veliyev. 2023. A machine learning approach to volatility forecasting.Journal of Financial Econometrics21, 5 (2023), 1680–1727

2023

[7] [7]

Peter F Christoffersen and Francis X Diebold. 2000. How relevant is volatility forecasting for financial risk management?Review of Economics and Statistics82, 1 (2000), 12–22

2000

[8] [8]

Fabrizio Cipollini, Giulia Cruciani, Giampiero M Gallo, Alessandra Insana, Edoardo Otranto, and Fabio Spagnolo. 2026. VOLatility Archive for Realized Estimates (VOLARE).arXiv preprint arXiv:2602.19732(2026). Deployment-Side Adaptiveness in Multi-Horizon Volatility Forecasting

work page arXiv 2026

[9] [9]

Thomas G Dietterich. 2000. Ensemble methods in machine learning. InInterna- tional workshop on multiple classifier systems. Springer, 1–15

2000

[10] [10]

Jeff Fleming, Chris Kirby, and Barbara Ostdiek. 2001. The economic value of volatility timing.The Journal of Finance56, 1 (2001), 329–352

2001

[11] [11]

Yarin Gal and Zoubin Ghahramani. 2016. Dropout as a bayesian approximation: Representing model uncertainty in deep learning. Ininternational conference on machine learning. PMLR, 1050–1059

2016

[12] [12]

Sofia Giantsidi and Tarantola Claudia. 2025. Deep learning for financial forecast- ing: A review of recent advancements.A vailable at SSRN 5263710(2025)

2025

[13] [13]

Riku Green, Zahraa S Abdallah, et al. 2026. Expectations vs. Realities: The Cost of MSE-Optimal Forecasting Under Conditional Uncertainty.arXiv preprint arXiv:2606.04342(2026)

work page internal anchor Pith review Pith/arXiv arXiv 2026

[14] [14]

Riku Green, Zahraa S Abdallah, et al. 2026. Exposure Bias as Epistemic Underi- dentification in Recursive Forecasting.arXiv preprint arXiv:2606.12990(2026)

work page internal anchor Pith review Pith/arXiv arXiv 2026

[15] [15]

Riku Green, Huw Day, Zahraa S Abdallah, et al. 2025. Epistemic Error Decom- position for Multi-step Time Series Forecasting: Rethinking Bias-Variance in Recursive and Direct Strategies.arXiv preprint arXiv:2511.11461(2025)

work page arXiv 2025

[16] [16]

Riku Green, Grant Stevens, Zahraa Abdallah, et al. 2024. Time-series classification for dynamic strategies in multi-step forecasting.arXiv preprint arXiv:2402.08373 (2024)

work page arXiv 2024

[17] [17]

Riku Green, Grant Stevens, Zahraa S Abdallah, and Telmo M Silva Filho. 2025. Stratify: unifying multi-step forecasting strategies: R. Green et al.Data Mining and Knowledge Discovery39, 5 (2025), 64

2025

[18] [18]

Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long short-term memory.Neural computation9, 8 (1997), 1735–1780

1997

[19] [19]

Spyros Makridakis, Evangelos Spiliotis, and Vassilios Assimakopoulos. 2018. The M4 Competition: Results, findings, conclusion and way forward.International Journal of forecasting34, 4 (2018), 802–808

2018

[20] [20]

Ricardo P Masini, Marcelo C Medeiros, and Eduardo F Mendes. 2023. Machine learning advances for time series forecasting.Journal of economic surveys37, 1 (2023), 76–111

2023

[21] [21]

Yuqi Nie, Nam H Nguyen, Phanwadee Sinthong, and Jayant Kalagnanam. 2022. A time series is worth 64 words: Long-term forecasting with transformers.arXiv preprint arXiv:2211.14730(2022)

work page internal anchor Pith review Pith/arXiv arXiv 2022

[22] [22]

Boris N Oreshkin, Dmitri Carpov, Nicolas Chapados, and Yoshua Bengio. 2019. N- BEATS: Neural basis expansion analysis for interpretable time series forecasting. arXiv preprint arXiv:1905.10437(2019)

work page arXiv 2019

[23] [23]

Mary E Thomson, Andrew C Pollock, Dilek Önkal, and M Sinan Gönül. 2019. Combining forecasts: Performance and coherence.International Journal of Fore- casting35, 2 (2019), 474–484

2019

[24] [24]

Xiaoqian Wang, Rob J Hyndman, Feng Li, and Yanfei Kang. 2023. Forecast combinations: An over 50-year review.International Journal of Forecasting39, 4 (2023), 1518–1547

2023

[25] [25]

Helmut Wasserbacher and Martin Spindler. 2022. Machine learning for financial forecasting, planning and analysis: recent developments and pitfalls.Digital Finance4, 1 (2022), 63–88

2022

[26] [26]

Danny Wood, Tingting Mu, Andrew M Webb, Henry WJ Reeve, Mikel Lujan, and Gavin Brown. 2023. A unified theory of diversity in ensemble learning.Journal of machine learning research24, 359 (2023), 1–49

2023

[27] [27]

Hao Wu and David Levinson. 2021. The ensemble approach to forecasting: A review and synthesis.Transportation Research Part C: Emerging Technologies132 (2021), 103357. A Additional Methodological Context This appendix adds a few extra notes on the forecasting target, the evaluation losses, the baseline comparison, and the rollout construction. These detail...

2021