Benchmarking Deep Time Series Models for Equity Portfolios

Aoxin Zhang; Kwanting Leung; Yuhan Cheng

arxiv: 2606.09420 · v1 · pith:MAXC2WF3new · submitted 2026-06-08 · 🧮 math.OC · q-fin.PM

Benchmarking Deep Time Series Models for Equity Portfolios

Aoxin Zhang , Yuhan Cheng , Kwanting Leung This is my paper

Pith reviewed 2026-06-27 15:37 UTC · model grok-4.3

classification 🧮 math.OC q-fin.PM

keywords time series forecastingequity portfoliosbenchmarkingdeep learningstochastic multi-criteria analysisportfolio optimizationtransaction costsforecasting architectures

0 comments

The pith

No single time-series architecture dominates daily equity portfolio benchmarks after costs and constraints apply.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper constructs a CRSP daily-stock benchmark covering 15 deep and statistical time-series models from 2018 to 2024. It evaluates models through common-window decile portfolios followed by stochastic multi-criteria acceptability analysis and a constrained quadratic portfolio layer that enforces capacity, beta, industry, risk, leverage, and turnover limits. An entropic acceptability index derived from the SMAA prior downweights models that produce high portfolio regret. Results show no model exceeds a 0.36 rank-1 acceptability score, with TransEnc-8 highest at 0.352, while rankings shift across preferences, market states, features, and transaction costs. Constrained portfolios produce negative net Sharpe ratios at 20 basis points for every promoted model.

Core claim

What carries the argument

The deployment-adjusted acceptability index as an entropic update from the SMAA prior, applied to decile portfolios before constrained quadratic optimization with capacity, beta, industry, risk, leverage, and turnover controls.

If this is right

Model rankings change with investor preferences, market state, feature universe, and transaction costs.
TransEnc-8 is selected in the five-model constrained-portfolio comparison under the full protocol.
Raw return-oriented rankings can instead favor TS-RIDGE.
Broad-universe decile signals survive costs in some configurations.
Net Sharpe ratios after 20 bps costs are negative for all models in the baseline constrained QP.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Model selection for portfolios should incorporate full construction pipelines rather than isolated accuracy metrics.
The entropic index could be tested on other multi-criteria problems such as credit or macro forecasting.
Extending the protocol to intraday data or international equities would reveal whether the no-dominance result holds outside daily U.S. stocks.

Load-bearing premise

The protocol of common-window decile portfolios, SMAA, entropic acceptability index, and constrained QP with the listed controls is sufficient to determine which models remain usable after preferences, costs, and constraints are imposed.

What would settle it

Finding one architecture that produces positive net Sharpe ratios in the constrained QP across multiple preference weightings and cost levels would show dominance where the paper reports none.

Figures

Figures reproduced from arXiv: 2606.09420 by Aoxin Zhang, Kwanting Leung, Yuhan Cheng.

**Figure 2.** Figure 2: SMAA rank acceptability over 15 models and 15 possible ranks. The full rank distribution in [PITH_FULL_IMAGE:figures/full_fig_p013_2.png] view at source ↗

**Figure 3.** Figure 3: SMAA top-3 acceptability probabilities. Top-3 acceptability concentrates around the lower-turnover transformer-encoder and recurrent configurations ( [PITH_FULL_IMAGE:figures/full_fig_p013_3.png] view at source ↗

**Figure 4.** Figure 4: Rank-1 SMAA acceptability bands for the leading models. gross sharpe net20 sharpe vw sharpe abs ff5 alpha t minus turnover minus max drawdown bootstrap significance TS-RIDGE SMAA Central Weight Vectors 0.0 0.2 0.4 0.6 0.8 1.0 Weight LSTM TransEnc-8 [PITH_FULL_IMAGE:figures/full_fig_p014_4.png] view at source ↗

**Figure 5.** Figure 5: Central SMAA weight vectors for selected leading models [PITH_FULL_IMAGE:figures/full_fig_p014_5.png] view at source ↗

**Figure 6.** Figure 6: SMAA preference-space geometry and turnover-weight acceptability frontier [PITH_FULL_IMAGE:figures/full_fig_p015_6.png] view at source ↗

**Figure 7.** Figure 7: Ranking sensitivity across transaction cost levels. bps; as the cost schedule tightens, cost-sensitive preferences move acceptability toward architectures whose rankings rotate more slowly [PITH_FULL_IMAGE:figures/full_fig_p016_7.png] view at source ↗

**Figure 8.** Figure 8: SMAA rank acceptability by market-volatility state [PITH_FULL_IMAGE:figures/full_fig_p016_8.png] view at source ↗

**Figure 9.** Figure 9: Robust net Sharpe under ellipsoidal score ambiguity [PITH_FULL_IMAGE:figures/full_fig_p018_9.png] view at source ↗

**Figure 10.** Figure 10: Net Sharpe ratios under wider constrained-portfolio designs [PITH_FULL_IMAGE:figures/full_fig_p019_10.png] view at source ↗

**Figure 11.** Figure 11: Predict-then-optimize rank association between raw SMAA expected ranks and constrained-QP ranks across transaction cost levels [PITH_FULL_IMAGE:figures/full_fig_p020_11.png] view at source ↗

**Figure 12.** Figure 12: Deployment regret in net Sharpe units under constrained-portfolio designs. 20 [PITH_FULL_IMAGE:figures/full_fig_p020_12.png] view at source ↗

**Figure 13.** Figure 13: Raw 15-model decile SMAA, restricted to promoted models, and optimized five-model portfolio SMAA [PITH_FULL_IMAGE:figures/full_fig_p021_13.png] view at source ↗

**Figure 14.** Figure 14: Raw and deployment-adjusted rank-1 acceptability for promoted models. rows use the same full-universe forecast files as the confirmatory headline benchmark, so the shared F3 cells match [PITH_FULL_IMAGE:figures/full_fig_p022_14.png] view at source ↗

**Figure 15.** Figure 15: TS-RIDGE regularization path across nested feature universes. that information becomes valuable only when coefficient discipline prevents overreaction to unstable high-dimensional variation. 0.000 0.001 0.002 0.003 0.004 0.005 0.006 0.007 Average absolute coefficient mktcap_log rev_5 mom_5 ret_l1 mom_21 rsi_14 ret_l2 gap_open log_dollar_vol spread_hilo range_1 ma_21_ratio ma_5_ratio excess_sp_l1 excess_ew… view at source ↗

**Figure 16.** Figure 16: Average absolute TS-RIDGE coefficient by feature [PITH_FULL_IMAGE:figures/full_fig_p023_16.png] view at source ↗

**Figure 17.** Figure 17: Feature-block ablation heatmap for the five promoted models under the common forecast protocol. The feature-block ablation reinforces the same mechanism. Because F2 is F3 without the size, beta, and market-relative block, removing the price block weakens the nonlinear price-path channel and removing activity variables tests whether trading intensity carries the ranking. Restoring the full structured stack… view at source ↗

**Figure 18.** Figure 18: Average daily cross-sectional Spearman rank correlations among retained model prediction signals. RIDGE long–short. TS-RIDGE supplies the stable full-signal linear anchor, and LSTM supplies the price-sensitive nonlinear contrast [PITH_FULL_IMAGE:figures/full_fig_p025_18.png] view at source ↗

**Figure 19.** Figure 19: Promoted-set pairwise rolling-combination net Sharpe matrix. 9 Deployment Boundaries Daily stock sorting draws strength from universe breadth, so restricting capacity changes the ranking [PITH_FULL_IMAGE:figures/full_fig_p026_19.png] view at source ↗

**Figure 20.** Figure 20: Capacity frontier for promoted models [PITH_FULL_IMAGE:figures/full_fig_p027_20.png] view at source ↗

**Figure 21.** Figure 21: Breakeven transaction-cost diagnostics for promoted models. Curves report net Sharpe under one-way cost schedules for the broad decile layer and the baseline constrained-QP layer; vertical reference lines mark 0.8, 20, and 50 bps [PITH_FULL_IMAGE:figures/full_fig_p027_21.png] view at source ↗

**Figure 22.** Figure 22: Design sensitivity and aggregate prediction boundary. The aggregate boundary test is sharper than the state comparison. None of the market-plussignal specifications produces positive aggregate out-of-sample R2 , so strong stock-level sorting does not automatically become market-timing ability. Adjacent forecasting targets impose different constraints once the target, aggregation level, and portfolio rul… view at source ↗

**Figure 23.** Figure 23: F3 neural-model decile-return profiles used to audit the negative RankIC and positive long–short Sharpe cases. 2021-01 2021-07 2022-01 2022-07 2023-01 2023-07 2024-01 2024-07 2025-01 Date 0.2 0.0 0.2 0.4 0.6 0.8 1.0 Cumulative D10-D1 F3 LSTM long-short cumulative (a) LSTM 2021-01 2021-07 2022-01 2022-07 2023-01 2023-07 2024-01 2024-07 2025-01 Date 0.4 0.2 0.0 0.2 0.4 Cumulative D10-D1 F3 TransEnc-8 long-s… view at source ↗

**Figure 24.** Figure 24: F3 cumulative long–short returns under the same score direction as the RankIC calculation. non-monotonicity effect rather than a long–short direction or label convention error. 36 [PITH_FULL_IMAGE:figures/full_fig_p036_24.png] view at source ↗

**Figure 25.** Figure 25: SMAA rank-1 Monte Carlo convergence for leading models [PITH_FULL_IMAGE:figures/full_fig_p038_25.png] view at source ↗

**Figure 26.** Figure 26: Deployment-adjusted rank-1 sensitivity across dimensionless regret-discount multipliers c. −2.0 −1.5 −1.0 −0.5 0.0 0.5 Net Sharpe TS-OLS TS-RIDGE Linear 20 bps LSTM Calibrated sqrt TransEnc-8 TransEnc-10 [PITH_FULL_IMAGE:figures/full_fig_p039_26.png] view at source ↗

**Figure 27.** Figure 27: Linear and square-root transaction-cost robustness for promoted common-window portfolios [PITH_FULL_IMAGE:figures/full_fig_p039_27.png] view at source ↗

**Figure 28.** Figure 28: Net Sharpe ratios under daily and weekly rebalancing. nonlinear models have lower turnover under both rules, so the weekly change is smaller for them. The ranking under weekly rebalancing is led by TS-RIDGE, TS-OLS, and TransEnc-8. 42 [PITH_FULL_IMAGE:figures/full_fig_p042_28.png] view at source ↗

read the original abstract

Benchmarking forecasting architectures for daily equity portfolios is not just a prediction exercise. It also asks which model remains usable after preferences, costs, and portfolio constraints are imposed. We build a CRSP daily-stock benchmark for 15 deep and statistical time-series architectures over 2018--2024. The protocol combines common-window decile portfolios, stochastic multi-criteria acceptability analysis, a deployment-adjusted acceptability index, and a constrained quadratic portfolio layer with capacity, beta, industry, risk, leverage, and turnover controls. The index starts from the SMAA rank-acceptability distribution and downweights models whose criteria-level wins produce high portfolio regret; its Gibbs form is characterized as an entropic update from the SMAA prior. Empirically, no architecture dominates the raw benchmark: TransEnc-8 has the largest rank-1 acceptability, 0.352, and no model exceeds about 0.36. Rankings vary with preferences, market state, feature universe, and transaction costs. In the promoted five-model constrained-portfolio comparison, TransEnc-8 is selected throughout, while return-oriented raw rankings can favor TS-RIDGE. Broad-universe decile signals can survive costs, but the baseline constrained-QP net Sharpe at 20 bps is negative for every promoted model. The benchmark supports model selection and diagnosis rather than a standalone trading-strategy claim.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This paper gives a new acceptability index for time-series models in constrained equity portfolios and shows no architecture dominates on CRSP data, with negative net Sharpes after costs.

read the letter

The punchline is that the work supplies a multi-criteria benchmark protocol for 15 architectures on 2018-2024 CRSP daily data and introduces a deployment-adjusted index that starts from SMAA rank-acceptability and applies a Gibbs-style entropic downweight for portfolio regret. No model exceeds roughly 0.36 rank-1 acceptability, TransEnc-8 leads at 0.352, and the constrained QP layer with beta, industry, leverage, and turnover limits selects it consistently while raw return rankings sometimes favor TS-RIDGE.

What stands out is the explicit combination of common-window deciles, stochastic acceptability analysis, the entropic index, and the full set of capacity and risk controls. The finding that rankings move with preferences, market state, feature set, and transaction costs is straightforward and useful. The negative net Sharpe result at 20 bps for every promoted model is presented plainly as an outcome rather than hidden.

The main soft spot is that the abstract and stress-test leave the exact QP solver details, data-split handling, and index sensitivity checks opaque, so it is still possible the reported numbers shift under small changes in the constraint set or cost assumption. The protocol is elaborate; whether the entropic update adds enough beyond standard SMAA for the extra machinery is not fully stress-tested in the visible material. No internal contradiction appears in the claims.

This is aimed at quant researchers and portfolio teams who need a structured way to compare forecasting models once real-world constraints and costs enter the picture. A reader who wants to see how raw forecast rankings translate (or fail to translate) into usable portfolios will find concrete numbers and a reusable protocol.

It is worth sending to peer review so the implementation details and robustness checks can be examined.

Referee Report

0 major / 3 minor

Summary. The paper benchmarks 15 deep and statistical time-series architectures for daily equity portfolio construction on CRSP data (2018–2024). It deploys a multi-stage protocol that constructs common-window decile portfolios, applies stochastic multi-criteria acceptability analysis (SMAA), defines a deployment-adjusted acceptability index via an entropic (Gibbs) update from the SMAA prior, and solves a constrained quadratic program incorporating capacity, beta, industry, risk, leverage, and turnover limits. The central empirical claims are that no architecture dominates (TransEnc-8 attains the highest rank-1 acceptability of 0.352; no model exceeds ~0.36), that rankings are sensitive to preferences, market state, feature universe, and transaction costs, that TransEnc-8 is selected in the five-model constrained comparison while return-oriented rankings can favor TS-RIDGE, and that broad-universe decile signals can survive costs yet the baseline constrained-QP net Sharpe at 20 bps remains negative for every promoted model.

Significance. If the protocol and implementation details hold, the work supplies a reproducible, preference-aware evaluation framework that moves beyond raw forecast accuracy to usability under realistic portfolio constraints. It credits the use of standard CRSP data, explicit multi-criteria SMAA, the explicit entropic characterization of the acceptability index, and the full set of QP controls. The negative net-Sharpe finding and the observation that no model dominates are falsifiable, policy-relevant results that caution against over-reliance on any single architecture.

minor comments (3)

Abstract and §4: the exact functional form of the entropic update (Gibbs distribution) from the SMAA rank-acceptability vector to the deployment-adjusted index should be written explicitly, including the temperature parameter and any normalization, so that the index can be reproduced from the reported acceptability numbers alone.
§3.3 and Table 2: the precise definition of the five-model constrained-QP comparison (which models are included, how the capacity and turnover limits are set, and whether the same random seeds are used across architectures) needs a dedicated paragraph or pseudocode block to eliminate ambiguity in the reported selection of TransEnc-8.
Figure 3 and §5.2: the market-state and transaction-cost sensitivity plots would benefit from error bars or bootstrap intervals on the acceptability values so that the claim “rankings vary” can be assessed for statistical significance rather than visual inspection alone.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for the careful summary of the manuscript, the positive evaluation of its significance, and the recommendation of minor revision. No major comments appear in the report, so we provide no point-by-point rebuttals below. Any minor suggestions will be incorporated in the revised version.

Circularity Check

0 steps flagged

No significant circularity; empirical claims rest on external data and stated definitions

full rationale

The paper is an empirical benchmarking exercise on CRSP data using common-window decile portfolios, SMAA rank-acceptability, a defined entropic deployment-adjusted index (explicitly characterized as an update from the SMAA prior), and constrained QP optimization with explicit controls. No derivation reduces by construction to fitted parameters or self-citations; the index form is stated rather than derived from the target results, and all performance numbers (e.g., rank-1 acceptability 0.352) are computed from external market data and standard portfolio layers. The protocol is presented as a composite method whose outputs are falsifiable against the data rather than tautological.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Only the abstract was available for review; no explicit free parameters, axioms, or invented entities are identifiable beyond standard domain assumptions of portfolio theory. The 20 bps transaction cost level and any internal weights inside the entropic index are not detailed.

pith-pipeline@v0.9.1-grok · 5771 in / 1447 out tokens · 26610 ms · 2026-06-27T15:37:17.927679+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

51 extracted references · 43 canonical work pages

[1]

SMAA-2: Stochastic multicriteria acceptability analysis for group decision making.Operations Research, 49(3):444–454, 2001

Risto Lahdelma and Pekka Salminen. SMAA-2: Stochastic multicriteria acceptability analysis for group decision making.Operations Research, 49(3):444–454, 2001. doi: 10.1287/opre.49.3. 444.11220. URL https://doi.org/10.1287/opre.49.3.444.11220

work page doi:10.1287/opre.49.3 2001
[2]

A survey on stochastic multicriteria acceptability analysis methods.Journal of Multi-Criteria Decision Analysis, 15(1–2):1–14, 2008

Tommi Tervonen and Jos´ e Rui Figueira. A survey on stochastic multicriteria acceptability analysis methods.Journal of Multi-Criteria Decision Analysis, 15(1–2):1–14, 2008. doi: 10.1 002/mcda.407. URL https://doi.org/10.1002/mcda.407

work page doi:10.1002/mcda.407 2008
[3]

Springer, 2 edition, 2016

Salvatore Greco, Matthias Ehrgott, and Jos´ e Rui Figueira, editors.Multiple Criteria Decision Analysis: State of the Art Surveys, volume 233 ofInternational Series in Operations Research & Management Science. Springer, 2 edition, 2016. doi: 10.1007/978-1-4939-3094-4. URL https://doi.org/10.1007/978-1-4939-3094-4

work page doi:10.1007/978-1-4939-3094-4 2016
[4]

IPSSIS: An inte- grated multicriteria decision support system for equity portfolio construction and selection.Eu- ropean Journal of Operational Research, 210(2):398–409, 2011

Panos Xidonas, George Mavrotas, Constantin Zopounidis, and John Psarras. IPSSIS: An inte- grated multicriteria decision support system for equity portfolio construction and selection.Eu- ropean Journal of Operational Research, 210(2):398–409, 2011. doi: 10.1016/j.ejor.2010.08.028. URL https://doi.org/10.1016/j.ejor.2010.08.028

work page doi:10.1016/j.ejor.2010.08.028 2011
[5]

predict, then optimize

Adam N. Elmachtoub and Paul Grigas. Smart “predict, then optimize”.Management Science, 68(1):9–26, 2022. doi: 10.1287/mnsc.2020.3922. URL https://doi.org/10.1287/mnsc.2020.3922

work page doi:10.1287/mnsc.2020.3922 2022
[6]

Cambridge Univer- sity Press, Cambridge, 2006

Nicolo Cesa-Bianchi and Gabor Lugosi.Prediction, Learning, and Games. Cambridge Univer- sity Press, Cambridge, 2006

2006
[7]

Robust portfolio selection problems.Mathematics of Operations Research, 28(1):1–38, 2003

Donald Goldfarb and Garud Iyengar. Robust portfolio selection problems.Mathematics of Operations Research, 28(1):1–38, 2003. doi: 10.1287/moor.28.1.1.14260. URL https://doi.org/ 10.1287/moor.28.1.1.14260

work page doi:10.1287/moor.28.1.1.14260 2003
[8]

Demirel, I., Celik, A

Erick Delage and Yinyu Ye. Distributionally robust optimization under moment uncertainty with application to data-driven problems.Operations Research, 58(3):595–612, 2010. doi: 10.1287/opre.1090.0741. URL https://doi.org/10.1287/opre.1090.0741

work page doi:10.1287/opre.1090.0741 2010
[9]

European Journal of Operational Research , author =

Dimitris Bertsimas and Martin S. Copenhaver. Characterization of the equivalence of ro- bustification and regularization in linear and matrix regression.European Journal of Op- erational Research, 270(3):931–942, 2018. doi: 10.1016/j.ejor.2017.03.051. URL https: //doi.org/10.1016/j.ejor.2017.03.051

work page doi:10.1016/j.ejor.2017.03.051 2018
[10]

Robust wasserstein profile inference and applications to machine learning.Journal of Applied Probability, 56(3):830–857, 2019

Jose Blanchet, Yang Kang, and Karthyek Murthy. Robust wasserstein profile inference and applications to machine learning.Journal of Applied Probability, 56(3):830–857, 2019. doi: 10.1017/jpr.2019.49. URL https://doi.org/10.1017/jpr.2019.49. 47

work page doi:10.1017/jpr.2019.49 2019
[11]

Regularization via mass transportation.Journal of Machine Learning Research, 20(103):1–68, 2019

Soroosh Shafieezadeh-Abadeh, Daniel Kuhn, and Peyman Mohajerin Esfahani. Regularization via mass transportation.Journal of Machine Learning Research, 20(103):1–68, 2019. URL https://www.jmlr.org/papers/v20/17-633.html

2019
[12]

Giorgio Costa and Garud N. Iyengar. Distributionally robust end-to-end portfolio construction. Quantitative Finance, 23(10):1465–1482, 2023. doi: 10.1080/14697688.2023.2236148. URL https://doi.org/10.1080/14697688.2023.2236148

work page doi:10.1080/14697688.2023.2236148 2023
[13]

Fr´ ed´ eric Bonnans and Alexander Shapiro.Perturbation Analysis of Optimization Problems

J. Fr´ ed´ eric Bonnans and Alexander Shapiro.Perturbation Analysis of Optimization Problems. Springer, 2000. doi: 10.1007/978-1-4612-1394-9. URL https://doi.org/10.1007/978-1-4612-1 394-9

work page doi:10.1007/978-1-4612-1394-9 2000
[14]

Elmachtoub, Paul Grigas, and Ambuj Tewari

Othman El Balghiti, Adam N. Elmachtoub, Paul Grigas, and Ambuj Tewari. Generalization bounds in the predict-then-optimize framework. InAdvances in Neural Information Processing Systems, volume 32, 2019. URL https://proceedings.neurips.cc/paper/2019/hash/a70145bf8b1 73e4496b554ce57969e24-Abstract.html

2019
[15]

Topkis.Supermodularity and Complementarity

Donald M. Topkis.Supermodularity and Complementarity. Princeton University Press, 1998. URL https://press.princeton.edu/books/paperback/9780691032443/supermodularity-and-com plementarity

arXiv 1998
[16]

, Kelly , Bryan B

Shihao Gu, Bryan Kelly, and Dacheng Xiu. Empirical asset pricing via machine learning. The Review of Financial Studies, 33(5):2223–2273, 2020. doi: 10.1093/rfs/hhaa009. URL https://doi.org/10.1093/rfs/hhaa009

work page doi:10.1093/rfs/hhaa009 2020
[17]

Deep learning in asset pricing.Management Science, 70(2):714–750, 2024

Luyang Chen, Markus Pelger, and Jason Zhu. Deep learning in asset pricing.Management Science, 70(2):714–750, 2024. doi: 10.1287/mnsc.2023.4695. URL https://doi.org/10.1287/mn sc.2023.4695

work page doi:10.1287/mnsc.2023.4695 2024
[18]

Machine learning versus economic restrictions: Evidence from stock return predictability.Management Science, 69(5):2587–2619, 2023

Doron Avramov, Si Cheng, and Lior Metzker. Machine learning versus economic restrictions: Evidence from stock return predictability.Management Science, 69(5):2587–2619, 2023. doi: 10.1287/mnsc.2022.4449. URL https://doi.org/10.1287/mnsc.2022.4449

work page doi:10.1287/mnsc.2022.4449 2023
[19]

Nogales, and Raman Uppal

Victor DeMiguel, Alberto Mart´ ın-Utrera, Francisco J. Nogales, and Raman Uppal. A transaction-cost perspective on the multitude of firm characteristics.The Review of Finan- cial Studies, 33(5):2180–2222, 2020. doi: 10.1093/rfs/hhz085. URL https://doi.org/10.1093/rf s/hhz085

work page doi:10.1093/rfs/hhz085 2020
[20]

Model comparison with transaction costs.The Journal of Finance, 78(3):1743–1775, 2023

Andrew Detzel, Robert Novy-Marx, and Mihail Velikov. Model comparison with transaction costs.The Journal of Finance, 78(3):1743–1775, 2023. doi: 10.1111/jofi.13225. URL https: //doi.org/10.1111/jofi.13225

work page doi:10.1111/jofi.13225 2023
[21]

Chen and Mihail Velikov

Andrew Y. Chen and Mihail Velikov. Zeroing in on the expected returns of anomalies.Journal of Financial and Quantitative Analysis, 58(3):968–1004, 2023. doi: 10.1017/S0022109022000874. URL https://doi.org/10.1017/S0022109022000874

work page doi:10.1017/s0022109022000874 2023
[22]

Moskowitz

Andrea Frazzini, Ronen Israel, and Tobias J. Moskowitz. Trading costs.SSRN Electronic Journal, 2018. doi: 10.2139/ssrn.3229719. URL https://doi.org/10.2139/ssrn.3229719. 48

work page doi:10.2139/ssrn.3229719 2018
[23]

Asness, Andrea Frazzini, Ronen Israel, Tobias J

Clifford S. Asness, Andrea Frazzini, Ronen Israel, Tobias J. Moskowitz, and Lasse H. Pedersen. Size matters, if you control your junk.Journal of Financial Economics, 129(3):479–509, 2018. doi: 10.1016/j.jfineco.2018.05.006. URL https://doi.org/10.1016/j.jfineco.2018.05.006

work page doi:10.1016/j.jfineco.2018.05.006 2018
[24]

Fama and Kenneth R

Eugene F. Fama and Kenneth R. French. Common risk factors in the returns on stocks and bonds.Journal of Financial Economics, 33(1):3–56, 1993. doi: 10.1016/0304-405X(93)90023-5. URL https://doi.org/10.1016/0304-405X(93)90023-5

work page doi:10.1016/0304-405x(93)90023-5 1993
[25]

Fama and Kenneth R

Eugene F. Fama and Kenneth R. French. A five-factor asset pricing model.Journal of Financial Economics, 116(1):1–22, 2015. doi: 10.1016/j.jfineco.2014.10.010. URL https://doi.org/10.101 6/j.jfineco.2014.10.010

work page doi:10.1016/j.jfineco.2014.10.010 2015
[26]

Digesting anomalies: An investment approach.The Review of Financial Studies, 28(3):650–705, 2015

Kewei Hou, Chen Xue, and Lu Zhang. Digesting anomalies: An investment approach.The Review of Financial Studies, 28(3):650–705, 2015. doi: 10.1093/rfs/hhu068. URL https: //doi.org/10.1093/rfs/hhu068

work page doi:10.1093/rfs/hhu068 2015
[27]

An augmented q-factor model with expected growth.Review of Finance, 25(1):1–41, 2021

Kewei Hou, Haitao Mo, Chen Xue, and Lu Zhang. An augmented q-factor model with expected growth.Review of Finance, 25(1):1–41, 2021. doi: 10.1093/rof/rfaa004. URL https://doi.org/ 10.1093/rof/rfaa004

work page doi:10.1093/rof/rfaa004 2021
[28]

The other side of value: The gross profitability premium.Journal of Financial Economics, 108(1):1–28, 2013

Robert Novy-Marx. The other side of value: The gross profitability premium.Journal of Financial Economics, 108(1):1–28, 2013. doi: 10.1016/j.jfineco.2013.01.003. URL https: //doi.org/10.1016/j.jfineco.2013.01.003

work page doi:10.1016/j.jfineco.2013.01.003 2013
[29]

David McLean and Jeffrey Pontiff

R. David McLean and Jeffrey Pontiff. Does academic research destroy stock return pre- dictability?The Journal of Finance, 71(1):5–32, 2016. doi: 10.1111/jofi.12365. URL https://doi.org/10.1111/jofi.12365

work page doi:10.1111/jofi.12365 2016
[30]

Replicating anomalies.The Review of Financial Studies, 33(5):2019–2133, 2020

Kewei Hou, Chen Xue, and Lu Zhang. Replicating anomalies.The Review of Financial Studies, 33(5):2019–2133, 2020. doi: 10.1093/rfs/hhy131. URL https://doi.org/10.1093/rfs/hhy131

work page doi:10.1093/rfs/hhy131 2019
[31]

Chen and Tom Zimmermann

Andrew Y. Chen and Tom Zimmermann. Open source cross-sectional asset pricing.Critical Finance Review, 11(2):207–264, 2022. doi: 10.1561/104.00000112. URL https://doi.org/10.156 1/104.00000112

work page doi:10.1561/104.00000112 2022
[32]

Webb, Rob J

Rakshitha Godahewa, Christoph Bergmeir, Geoffrey I. Webb, Rob J. Hyndman, and Pablo Montero-Manso. Monash time series forecasting archive. InProceedings of the Neural Informa- tion Processing Systems Track on Datasets and Benchmarks, 2021. URL https://openreview.n et/forum?id=I01l7rc0jcb

2021
[33]

M5 accuracy competi- tion: Results, findings, and conclusions.International Journal of Forecasting, 38(4):1346–1364,

Spyros Makridakis, Evangelos Spiliotis, and Vassilios Assimakopoulos. M5 accuracy competi- tion: Results, findings, and conclusions.International Journal of Forecasting, 38(4):1346–1364,
[34]

, Spiliotis, E

doi: 10.1016/j.ijforecast.2021.11.013. URL https://doi.org/10.1016/j.ijforecast.2021.11.0 13

work page doi:10.1016/j.ijforecast.2021.11.013 2021
[35]

Arik, Nicolas Loeff, and Tomas Pfister

Bryan Lim, Sercan ¨O. Arik, Nicolas Loeff, and Tomas Pfister. Temporal fusion transformers for interpretable multi-horizon time series forecasting.International Journal of Forecasting, 37 (4):1748–1764, 2021. doi: 10.1016/j.ijforecast.2021.03.012. URL https://doi.org/10.1016/j.ijfo recast.2021.03.012. 49

work page doi:10.1016/j.ijforecast.2021.03.012 2021
[36]

Oreshkin, Dmitri Carpov, Nicolas Chapados, and Yoshua Bengio

Boris N. Oreshkin, Dmitri Carpov, Nicolas Chapados, and Yoshua Bengio. N-BEATS: Neural basis expansion analysis for interpretable time series forecasting. InInternational Conference on Learning Representations, 2020. URL https://openreview.net/forum?id=r1ecqn4YwB. Published as a conference paper at ICLR 2020; arXiv:1905.10437

arXiv 2020
[37]

11121–11128

Ailing Zeng, Muxi Chen, Lei Zhang, and Qiang Xu. Are transformers effective for time series forecasting? InProceedings of the AAAI Conference on Artificial Intelligence, volume 37, pages 11121–11128, 2023. doi: 10.1609/aaai.v37i9.26317. URL https://doi.org/10.1609/aaai.v37i9.2 6317

work page doi:10.1609/aaai.v37i9.26317 2023
[38]

Lag-Llama: Towards foundation models for probabilistic time series forecasting.arXiv preprint arXiv:2310.08278, 2023

Kashif Rasul, Arjun Ashok, Andrew Robert Williams, Hena Ghonia, Rishika Bhagwatkar, Arian Khorasani, Mohammad Javad Darvishi Bayazi, George Adamopoulos, Roland Riachi, Nadhir Hassen, Marin Biloˇ s, Sahil Garg, Anderson Schneider, Nicolas Chapados, Alexandre Drouin, Valentina Zantedeschi, Yuriy Nevmyvaka, and Irina Rish. Lag-Llama: Towards foundation model...

work page doi:10.48550/arxiv.2310.08278 2023
[39]

Gift-eval: A benchmark for general time series forecasting model evaluation

Taha Aksu, Gerald Woo, Juncheng Liu, Xu Liu, Chenghao Liu, Silvio Savarese, Caiming Xiong, and Doyen Sahoo. Gift-eval: A benchmark for general time series forecasting model evaluation. arXiv preprint arXiv:2410.10393, 2024. doi: 10.48550/arXiv.2410.10393. URL https://arxiv. org/abs/2410.10393

work page doi:10.48550/arxiv.2410.10393 2024
[40]

Andrew W. Lo. The statistics of sharpe ratios.Financial Analysts Journal, 58(4):36–52, 2002. doi: 10.2469/faj.v58.n4.2453. URL https://doi.org/10.2469/faj.v58.n4.2453

work page doi:10.2469/faj.v58.n4.2453 2002
[41]

Robust performance hypothesis testing with the sharpe ratio

Olivier Ledoit and Michael Wolf. Robust performance hypothesis testing with the sharpe ratio. Journal of Empirical Finance, 15(5):850–859, 2008. doi: 10.1016/j.jempfin.2008.03.002. URL https://doi.org/10.1016/j.jempfin.2008.03.002

work page doi:10.1016/j.jempfin.2008.03.002 2008
[42]

Newey and Kenneth D

Whitney K. Newey and Kenneth D. West. A simple, positive semi-definite, heteroskedasticity and autocorrelation consistent covariance matrix.Econometrica, 55(3):703–708, 1987. URL https://www.jstor.org/stable/1913610

arXiv 1987
[43]

of” in the title, which we felt was better than the original, “on

Halbert White. A reality check for data snooping.Econometrica, 68(5):1097–1126, 2000. doi: 10.1111/1468-0262.00152. URL https://doi.org/10.1111/1468-0262.00152

work page doi:10.1111/1468-0262.00152 2000
[44]

Peter R. Hansen. A test for superior predictive ability.Journal of Business & Economic Statistics, 23(4):365–380, 2005. doi: 10.1198/073500105000000063. URL https://doi.org/10.1 198/073500105000000063

work page doi:10.1198/073500105000000063 2005
[45]

Romano and Michael Wolf

Joseph P. Romano and Michael Wolf. Stepwise multiple testing as formalized data snooping. Econometrica, 73(4):1237–1282, 2005. doi: 10.1111/j.1468-0262.2005.00615.x. URL https: //doi.org/10.1111/j.1468-0262.2005.00615.x

work page doi:10.1111/j.1468-0262.2005.00615.x 2005
[46]

and Lunde, Asger and Nason, James M

Peter R. Hansen, Asger Lunde, and James M. Nason. The model confidence set.Econometrica, 79(2):453–497, 2011. doi: 10.3982/ECTA5771. URL https://doi.org/10.3982/ECTA5771

work page doi:10.3982/ecta5771 2011
[47]

and Mariano, Roberto S

Francis X. Diebold and Roberto S. Mariano. Comparing predictive accuracy.Journal of Busi- ness & Economic Statistics, 13(3):253–263, 1995. doi: 10.1080/07350015.1995.10524599. URL https://doi.org/10.1080/07350015.1995.10524599. 50

work page doi:10.1080/07350015.1995.10524599 1995
[48]

Harvey, Yan Liu, and Heqing Zhu

Campbell R. Harvey, Yan Liu, and Heqing Zhu. ... and the cross-section of expected returns. The Review of Financial Studies, 29(1):5–68, 2016. doi: 10.1093/rfs/hhv059. URL https: //doi.org/10.1093/rfs/hhv059

work page doi:10.1093/rfs/hhv059 2016
[49]

Peyman Mohajerin Esfahani and Daniel Kuhn. Data-driven distributionally robust optimization using the wasserstein metric: Performance guarantees and tractable reformulations.Math- ematical Programming, 171(1–2):115–166, 2018. doi: 10.1007/s10107-017-1172-1. URL https://doi.org/10.1007/s10107-017-1172-1

work page doi:10.1007/s10107-017-1172-1 2018
[50]

and White, Halbert , TITLE =

Dimitris N. Politis and Halbert White. Automatic block-length selection for the dependent bootstrap.Econometric Reviews, 23(1):53–70, 2004. doi: 10.1081/ETC-120028836. URL https://doi.org/10.1081/ETC-120028836

work page doi:10.1081/etc-120028836 2004
[51]

Shrinking the cross-section.Journal of Financial Economics, 135(2):271–292, 2020

Serhiy Kozak, Stefan Nagel, and Shrihari Santosh. Shrinking the cross-section.Journal of Financial Economics, 135(2):271–292, 2020. doi: 10.1016/j.jfineco.2019.06.008. URL https: //doi.org/10.1016/j.jfineco.2019.06.008. 51

work page doi:10.1016/j.jfineco.2019.06.008 2020

[1] [1]

SMAA-2: Stochastic multicriteria acceptability analysis for group decision making.Operations Research, 49(3):444–454, 2001

Risto Lahdelma and Pekka Salminen. SMAA-2: Stochastic multicriteria acceptability analysis for group decision making.Operations Research, 49(3):444–454, 2001. doi: 10.1287/opre.49.3. 444.11220. URL https://doi.org/10.1287/opre.49.3.444.11220

work page doi:10.1287/opre.49.3 2001

[2] [2]

A survey on stochastic multicriteria acceptability analysis methods.Journal of Multi-Criteria Decision Analysis, 15(1–2):1–14, 2008

Tommi Tervonen and Jos´ e Rui Figueira. A survey on stochastic multicriteria acceptability analysis methods.Journal of Multi-Criteria Decision Analysis, 15(1–2):1–14, 2008. doi: 10.1 002/mcda.407. URL https://doi.org/10.1002/mcda.407

work page doi:10.1002/mcda.407 2008

[3] [3]

Springer, 2 edition, 2016

Salvatore Greco, Matthias Ehrgott, and Jos´ e Rui Figueira, editors.Multiple Criteria Decision Analysis: State of the Art Surveys, volume 233 ofInternational Series in Operations Research & Management Science. Springer, 2 edition, 2016. doi: 10.1007/978-1-4939-3094-4. URL https://doi.org/10.1007/978-1-4939-3094-4

work page doi:10.1007/978-1-4939-3094-4 2016

[4] [4]

IPSSIS: An inte- grated multicriteria decision support system for equity portfolio construction and selection.Eu- ropean Journal of Operational Research, 210(2):398–409, 2011

Panos Xidonas, George Mavrotas, Constantin Zopounidis, and John Psarras. IPSSIS: An inte- grated multicriteria decision support system for equity portfolio construction and selection.Eu- ropean Journal of Operational Research, 210(2):398–409, 2011. doi: 10.1016/j.ejor.2010.08.028. URL https://doi.org/10.1016/j.ejor.2010.08.028

work page doi:10.1016/j.ejor.2010.08.028 2011

[5] [5]

predict, then optimize

Adam N. Elmachtoub and Paul Grigas. Smart “predict, then optimize”.Management Science, 68(1):9–26, 2022. doi: 10.1287/mnsc.2020.3922. URL https://doi.org/10.1287/mnsc.2020.3922

work page doi:10.1287/mnsc.2020.3922 2022

[6] [6]

Cambridge Univer- sity Press, Cambridge, 2006

Nicolo Cesa-Bianchi and Gabor Lugosi.Prediction, Learning, and Games. Cambridge Univer- sity Press, Cambridge, 2006

2006

[7] [7]

Robust portfolio selection problems.Mathematics of Operations Research, 28(1):1–38, 2003

Donald Goldfarb and Garud Iyengar. Robust portfolio selection problems.Mathematics of Operations Research, 28(1):1–38, 2003. doi: 10.1287/moor.28.1.1.14260. URL https://doi.org/ 10.1287/moor.28.1.1.14260

work page doi:10.1287/moor.28.1.1.14260 2003

[8] [8]

Demirel, I., Celik, A

Erick Delage and Yinyu Ye. Distributionally robust optimization under moment uncertainty with application to data-driven problems.Operations Research, 58(3):595–612, 2010. doi: 10.1287/opre.1090.0741. URL https://doi.org/10.1287/opre.1090.0741

work page doi:10.1287/opre.1090.0741 2010

[9] [9]

European Journal of Operational Research , author =

Dimitris Bertsimas and Martin S. Copenhaver. Characterization of the equivalence of ro- bustification and regularization in linear and matrix regression.European Journal of Op- erational Research, 270(3):931–942, 2018. doi: 10.1016/j.ejor.2017.03.051. URL https: //doi.org/10.1016/j.ejor.2017.03.051

work page doi:10.1016/j.ejor.2017.03.051 2018

[10] [10]

Robust wasserstein profile inference and applications to machine learning.Journal of Applied Probability, 56(3):830–857, 2019

Jose Blanchet, Yang Kang, and Karthyek Murthy. Robust wasserstein profile inference and applications to machine learning.Journal of Applied Probability, 56(3):830–857, 2019. doi: 10.1017/jpr.2019.49. URL https://doi.org/10.1017/jpr.2019.49. 47

work page doi:10.1017/jpr.2019.49 2019

[11] [11]

Regularization via mass transportation.Journal of Machine Learning Research, 20(103):1–68, 2019

Soroosh Shafieezadeh-Abadeh, Daniel Kuhn, and Peyman Mohajerin Esfahani. Regularization via mass transportation.Journal of Machine Learning Research, 20(103):1–68, 2019. URL https://www.jmlr.org/papers/v20/17-633.html

2019

[12] [12]

Giorgio Costa and Garud N. Iyengar. Distributionally robust end-to-end portfolio construction. Quantitative Finance, 23(10):1465–1482, 2023. doi: 10.1080/14697688.2023.2236148. URL https://doi.org/10.1080/14697688.2023.2236148

work page doi:10.1080/14697688.2023.2236148 2023

[13] [13]

Fr´ ed´ eric Bonnans and Alexander Shapiro.Perturbation Analysis of Optimization Problems

J. Fr´ ed´ eric Bonnans and Alexander Shapiro.Perturbation Analysis of Optimization Problems. Springer, 2000. doi: 10.1007/978-1-4612-1394-9. URL https://doi.org/10.1007/978-1-4612-1 394-9

work page doi:10.1007/978-1-4612-1394-9 2000

[14] [14]

Elmachtoub, Paul Grigas, and Ambuj Tewari

Othman El Balghiti, Adam N. Elmachtoub, Paul Grigas, and Ambuj Tewari. Generalization bounds in the predict-then-optimize framework. InAdvances in Neural Information Processing Systems, volume 32, 2019. URL https://proceedings.neurips.cc/paper/2019/hash/a70145bf8b1 73e4496b554ce57969e24-Abstract.html

2019

[15] [15]

Topkis.Supermodularity and Complementarity

Donald M. Topkis.Supermodularity and Complementarity. Princeton University Press, 1998. URL https://press.princeton.edu/books/paperback/9780691032443/supermodularity-and-com plementarity

arXiv 1998

[16] [16]

, Kelly , Bryan B

Shihao Gu, Bryan Kelly, and Dacheng Xiu. Empirical asset pricing via machine learning. The Review of Financial Studies, 33(5):2223–2273, 2020. doi: 10.1093/rfs/hhaa009. URL https://doi.org/10.1093/rfs/hhaa009

work page doi:10.1093/rfs/hhaa009 2020

[17] [17]

Deep learning in asset pricing.Management Science, 70(2):714–750, 2024

Luyang Chen, Markus Pelger, and Jason Zhu. Deep learning in asset pricing.Management Science, 70(2):714–750, 2024. doi: 10.1287/mnsc.2023.4695. URL https://doi.org/10.1287/mn sc.2023.4695

work page doi:10.1287/mnsc.2023.4695 2024

[18] [18]

Machine learning versus economic restrictions: Evidence from stock return predictability.Management Science, 69(5):2587–2619, 2023

Doron Avramov, Si Cheng, and Lior Metzker. Machine learning versus economic restrictions: Evidence from stock return predictability.Management Science, 69(5):2587–2619, 2023. doi: 10.1287/mnsc.2022.4449. URL https://doi.org/10.1287/mnsc.2022.4449

work page doi:10.1287/mnsc.2022.4449 2023

[19] [19]

Nogales, and Raman Uppal

Victor DeMiguel, Alberto Mart´ ın-Utrera, Francisco J. Nogales, and Raman Uppal. A transaction-cost perspective on the multitude of firm characteristics.The Review of Finan- cial Studies, 33(5):2180–2222, 2020. doi: 10.1093/rfs/hhz085. URL https://doi.org/10.1093/rf s/hhz085

work page doi:10.1093/rfs/hhz085 2020

[20] [20]

Model comparison with transaction costs.The Journal of Finance, 78(3):1743–1775, 2023

Andrew Detzel, Robert Novy-Marx, and Mihail Velikov. Model comparison with transaction costs.The Journal of Finance, 78(3):1743–1775, 2023. doi: 10.1111/jofi.13225. URL https: //doi.org/10.1111/jofi.13225

work page doi:10.1111/jofi.13225 2023

[21] [21]

Chen and Mihail Velikov

Andrew Y. Chen and Mihail Velikov. Zeroing in on the expected returns of anomalies.Journal of Financial and Quantitative Analysis, 58(3):968–1004, 2023. doi: 10.1017/S0022109022000874. URL https://doi.org/10.1017/S0022109022000874

work page doi:10.1017/s0022109022000874 2023

[22] [22]

Moskowitz

Andrea Frazzini, Ronen Israel, and Tobias J. Moskowitz. Trading costs.SSRN Electronic Journal, 2018. doi: 10.2139/ssrn.3229719. URL https://doi.org/10.2139/ssrn.3229719. 48

work page doi:10.2139/ssrn.3229719 2018

[23] [23]

Asness, Andrea Frazzini, Ronen Israel, Tobias J

Clifford S. Asness, Andrea Frazzini, Ronen Israel, Tobias J. Moskowitz, and Lasse H. Pedersen. Size matters, if you control your junk.Journal of Financial Economics, 129(3):479–509, 2018. doi: 10.1016/j.jfineco.2018.05.006. URL https://doi.org/10.1016/j.jfineco.2018.05.006

work page doi:10.1016/j.jfineco.2018.05.006 2018

[24] [24]

Fama and Kenneth R

Eugene F. Fama and Kenneth R. French. Common risk factors in the returns on stocks and bonds.Journal of Financial Economics, 33(1):3–56, 1993. doi: 10.1016/0304-405X(93)90023-5. URL https://doi.org/10.1016/0304-405X(93)90023-5

work page doi:10.1016/0304-405x(93)90023-5 1993

[25] [25]

Fama and Kenneth R

Eugene F. Fama and Kenneth R. French. A five-factor asset pricing model.Journal of Financial Economics, 116(1):1–22, 2015. doi: 10.1016/j.jfineco.2014.10.010. URL https://doi.org/10.101 6/j.jfineco.2014.10.010

work page doi:10.1016/j.jfineco.2014.10.010 2015

[26] [26]

Digesting anomalies: An investment approach.The Review of Financial Studies, 28(3):650–705, 2015

Kewei Hou, Chen Xue, and Lu Zhang. Digesting anomalies: An investment approach.The Review of Financial Studies, 28(3):650–705, 2015. doi: 10.1093/rfs/hhu068. URL https: //doi.org/10.1093/rfs/hhu068

work page doi:10.1093/rfs/hhu068 2015

[27] [27]

An augmented q-factor model with expected growth.Review of Finance, 25(1):1–41, 2021

Kewei Hou, Haitao Mo, Chen Xue, and Lu Zhang. An augmented q-factor model with expected growth.Review of Finance, 25(1):1–41, 2021. doi: 10.1093/rof/rfaa004. URL https://doi.org/ 10.1093/rof/rfaa004

work page doi:10.1093/rof/rfaa004 2021

[28] [28]

The other side of value: The gross profitability premium.Journal of Financial Economics, 108(1):1–28, 2013

Robert Novy-Marx. The other side of value: The gross profitability premium.Journal of Financial Economics, 108(1):1–28, 2013. doi: 10.1016/j.jfineco.2013.01.003. URL https: //doi.org/10.1016/j.jfineco.2013.01.003

work page doi:10.1016/j.jfineco.2013.01.003 2013

[29] [29]

David McLean and Jeffrey Pontiff

R. David McLean and Jeffrey Pontiff. Does academic research destroy stock return pre- dictability?The Journal of Finance, 71(1):5–32, 2016. doi: 10.1111/jofi.12365. URL https://doi.org/10.1111/jofi.12365

work page doi:10.1111/jofi.12365 2016

[30] [30]

Replicating anomalies.The Review of Financial Studies, 33(5):2019–2133, 2020

Kewei Hou, Chen Xue, and Lu Zhang. Replicating anomalies.The Review of Financial Studies, 33(5):2019–2133, 2020. doi: 10.1093/rfs/hhy131. URL https://doi.org/10.1093/rfs/hhy131

work page doi:10.1093/rfs/hhy131 2019

[31] [31]

Chen and Tom Zimmermann

Andrew Y. Chen and Tom Zimmermann. Open source cross-sectional asset pricing.Critical Finance Review, 11(2):207–264, 2022. doi: 10.1561/104.00000112. URL https://doi.org/10.156 1/104.00000112

work page doi:10.1561/104.00000112 2022

[32] [32]

Webb, Rob J

Rakshitha Godahewa, Christoph Bergmeir, Geoffrey I. Webb, Rob J. Hyndman, and Pablo Montero-Manso. Monash time series forecasting archive. InProceedings of the Neural Informa- tion Processing Systems Track on Datasets and Benchmarks, 2021. URL https://openreview.n et/forum?id=I01l7rc0jcb

2021

[33] [33]

M5 accuracy competi- tion: Results, findings, and conclusions.International Journal of Forecasting, 38(4):1346–1364,

Spyros Makridakis, Evangelos Spiliotis, and Vassilios Assimakopoulos. M5 accuracy competi- tion: Results, findings, and conclusions.International Journal of Forecasting, 38(4):1346–1364,

[34] [34]

, Spiliotis, E

doi: 10.1016/j.ijforecast.2021.11.013. URL https://doi.org/10.1016/j.ijforecast.2021.11.0 13

work page doi:10.1016/j.ijforecast.2021.11.013 2021

[35] [35]

Arik, Nicolas Loeff, and Tomas Pfister

Bryan Lim, Sercan ¨O. Arik, Nicolas Loeff, and Tomas Pfister. Temporal fusion transformers for interpretable multi-horizon time series forecasting.International Journal of Forecasting, 37 (4):1748–1764, 2021. doi: 10.1016/j.ijforecast.2021.03.012. URL https://doi.org/10.1016/j.ijfo recast.2021.03.012. 49

work page doi:10.1016/j.ijforecast.2021.03.012 2021

[36] [36]

Oreshkin, Dmitri Carpov, Nicolas Chapados, and Yoshua Bengio

Boris N. Oreshkin, Dmitri Carpov, Nicolas Chapados, and Yoshua Bengio. N-BEATS: Neural basis expansion analysis for interpretable time series forecasting. InInternational Conference on Learning Representations, 2020. URL https://openreview.net/forum?id=r1ecqn4YwB. Published as a conference paper at ICLR 2020; arXiv:1905.10437

arXiv 2020

[37] [37]

11121–11128

Ailing Zeng, Muxi Chen, Lei Zhang, and Qiang Xu. Are transformers effective for time series forecasting? InProceedings of the AAAI Conference on Artificial Intelligence, volume 37, pages 11121–11128, 2023. doi: 10.1609/aaai.v37i9.26317. URL https://doi.org/10.1609/aaai.v37i9.2 6317

work page doi:10.1609/aaai.v37i9.26317 2023

[38] [38]

Lag-Llama: Towards foundation models for probabilistic time series forecasting.arXiv preprint arXiv:2310.08278, 2023

Kashif Rasul, Arjun Ashok, Andrew Robert Williams, Hena Ghonia, Rishika Bhagwatkar, Arian Khorasani, Mohammad Javad Darvishi Bayazi, George Adamopoulos, Roland Riachi, Nadhir Hassen, Marin Biloˇ s, Sahil Garg, Anderson Schneider, Nicolas Chapados, Alexandre Drouin, Valentina Zantedeschi, Yuriy Nevmyvaka, and Irina Rish. Lag-Llama: Towards foundation model...

work page doi:10.48550/arxiv.2310.08278 2023

[39] [39]

Gift-eval: A benchmark for general time series forecasting model evaluation

Taha Aksu, Gerald Woo, Juncheng Liu, Xu Liu, Chenghao Liu, Silvio Savarese, Caiming Xiong, and Doyen Sahoo. Gift-eval: A benchmark for general time series forecasting model evaluation. arXiv preprint arXiv:2410.10393, 2024. doi: 10.48550/arXiv.2410.10393. URL https://arxiv. org/abs/2410.10393

work page doi:10.48550/arxiv.2410.10393 2024

[40] [40]

Andrew W. Lo. The statistics of sharpe ratios.Financial Analysts Journal, 58(4):36–52, 2002. doi: 10.2469/faj.v58.n4.2453. URL https://doi.org/10.2469/faj.v58.n4.2453

work page doi:10.2469/faj.v58.n4.2453 2002

[41] [41]

Robust performance hypothesis testing with the sharpe ratio

Olivier Ledoit and Michael Wolf. Robust performance hypothesis testing with the sharpe ratio. Journal of Empirical Finance, 15(5):850–859, 2008. doi: 10.1016/j.jempfin.2008.03.002. URL https://doi.org/10.1016/j.jempfin.2008.03.002

work page doi:10.1016/j.jempfin.2008.03.002 2008

[42] [42]

Newey and Kenneth D

Whitney K. Newey and Kenneth D. West. A simple, positive semi-definite, heteroskedasticity and autocorrelation consistent covariance matrix.Econometrica, 55(3):703–708, 1987. URL https://www.jstor.org/stable/1913610

arXiv 1987

[43] [43]

of” in the title, which we felt was better than the original, “on

Halbert White. A reality check for data snooping.Econometrica, 68(5):1097–1126, 2000. doi: 10.1111/1468-0262.00152. URL https://doi.org/10.1111/1468-0262.00152

work page doi:10.1111/1468-0262.00152 2000

[44] [44]

Peter R. Hansen. A test for superior predictive ability.Journal of Business & Economic Statistics, 23(4):365–380, 2005. doi: 10.1198/073500105000000063. URL https://doi.org/10.1 198/073500105000000063

work page doi:10.1198/073500105000000063 2005

[45] [45]

Romano and Michael Wolf

Joseph P. Romano and Michael Wolf. Stepwise multiple testing as formalized data snooping. Econometrica, 73(4):1237–1282, 2005. doi: 10.1111/j.1468-0262.2005.00615.x. URL https: //doi.org/10.1111/j.1468-0262.2005.00615.x

work page doi:10.1111/j.1468-0262.2005.00615.x 2005

[46] [46]

and Lunde, Asger and Nason, James M

Peter R. Hansen, Asger Lunde, and James M. Nason. The model confidence set.Econometrica, 79(2):453–497, 2011. doi: 10.3982/ECTA5771. URL https://doi.org/10.3982/ECTA5771

work page doi:10.3982/ecta5771 2011

[47] [47]

and Mariano, Roberto S

Francis X. Diebold and Roberto S. Mariano. Comparing predictive accuracy.Journal of Busi- ness & Economic Statistics, 13(3):253–263, 1995. doi: 10.1080/07350015.1995.10524599. URL https://doi.org/10.1080/07350015.1995.10524599. 50

work page doi:10.1080/07350015.1995.10524599 1995

[48] [48]

Harvey, Yan Liu, and Heqing Zhu

Campbell R. Harvey, Yan Liu, and Heqing Zhu. ... and the cross-section of expected returns. The Review of Financial Studies, 29(1):5–68, 2016. doi: 10.1093/rfs/hhv059. URL https: //doi.org/10.1093/rfs/hhv059

work page doi:10.1093/rfs/hhv059 2016

[49] [49]

Peyman Mohajerin Esfahani and Daniel Kuhn. Data-driven distributionally robust optimization using the wasserstein metric: Performance guarantees and tractable reformulations.Math- ematical Programming, 171(1–2):115–166, 2018. doi: 10.1007/s10107-017-1172-1. URL https://doi.org/10.1007/s10107-017-1172-1

work page doi:10.1007/s10107-017-1172-1 2018

[50] [50]

and White, Halbert , TITLE =

Dimitris N. Politis and Halbert White. Automatic block-length selection for the dependent bootstrap.Econometric Reviews, 23(1):53–70, 2004. doi: 10.1081/ETC-120028836. URL https://doi.org/10.1081/ETC-120028836

work page doi:10.1081/etc-120028836 2004

[51] [51]

Shrinking the cross-section.Journal of Financial Economics, 135(2):271–292, 2020

Serhiy Kozak, Stefan Nagel, and Shrihari Santosh. Shrinking the cross-section.Journal of Financial Economics, 135(2):271–292, 2020. doi: 10.1016/j.jfineco.2019.06.008. URL https: //doi.org/10.1016/j.jfineco.2019.06.008. 51

work page doi:10.1016/j.jfineco.2019.06.008 2020