pith. sign in

arxiv: 2606.09420 · v1 · pith:MAXC2WF3new · submitted 2026-06-08 · 🧮 math.OC · q-fin.PM

Benchmarking Deep Time Series Models for Equity Portfolios

Pith reviewed 2026-06-27 15:37 UTC · model grok-4.3

classification 🧮 math.OC q-fin.PM
keywords time series forecastingequity portfoliosbenchmarkingdeep learningstochastic multi-criteria analysisportfolio optimizationtransaction costsforecasting architectures
0
0 comments X

The pith

No single time-series architecture dominates daily equity portfolio benchmarks after costs and constraints apply.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper constructs a CRSP daily-stock benchmark covering 15 deep and statistical time-series models from 2018 to 2024. It evaluates models through common-window decile portfolios followed by stochastic multi-criteria acceptability analysis and a constrained quadratic portfolio layer that enforces capacity, beta, industry, risk, leverage, and turnover limits. An entropic acceptability index derived from the SMAA prior downweights models that produce high portfolio regret. Results show no model exceeds a 0.36 rank-1 acceptability score, with TransEnc-8 highest at 0.352, while rankings shift across preferences, market states, features, and transaction costs. Constrained portfolios produce negative net Sharpe ratios at 20 basis points for every promoted model.

Core claim

Benchmarking forecasting architectures for daily equity portfolios is not just a prediction exercise. It also asks which model remains usable after preferences, costs, and portfolio constraints are imposed. We build a CRSP daily-stock benchmark for 15 deep and statistical time-series architectures over 2018--2024. The protocol combines common-window decile portfolios, stochastic multi-criteria acceptability analysis, a deployment-adjusted acceptability index defined as an entropic update from the SMAA prior, and a constrained quadratic portfolio layer with capacity, beta, industry, risk, leverage, and turnover controls. Empirically, no architecture dominates the raw benchmark: TransEnc-8 has

What carries the argument

The deployment-adjusted acceptability index as an entropic update from the SMAA prior, applied to decile portfolios before constrained quadratic optimization with capacity, beta, industry, risk, leverage, and turnover controls.

If this is right

  • Model rankings change with investor preferences, market state, feature universe, and transaction costs.
  • TransEnc-8 is selected in the five-model constrained-portfolio comparison under the full protocol.
  • Raw return-oriented rankings can instead favor TS-RIDGE.
  • Broad-universe decile signals survive costs in some configurations.
  • Net Sharpe ratios after 20 bps costs are negative for all models in the baseline constrained QP.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Model selection for portfolios should incorporate full construction pipelines rather than isolated accuracy metrics.
  • The entropic index could be tested on other multi-criteria problems such as credit or macro forecasting.
  • Extending the protocol to intraday data or international equities would reveal whether the no-dominance result holds outside daily U.S. stocks.

Load-bearing premise

The protocol of common-window decile portfolios, SMAA, entropic acceptability index, and constrained QP with the listed controls is sufficient to determine which models remain usable after preferences, costs, and constraints are imposed.

What would settle it

Finding one architecture that produces positive net Sharpe ratios in the constrained QP across multiple preference weightings and cost levels would show dominance where the paper reports none.

Figures

Figures reproduced from arXiv: 2606.09420 by Aoxin Zhang, Kwanting Leung, Yuhan Cheng.

Figure 1
Figure 1. Figure 1: Equal-weighted and value-weighted Sharpe ratios for the fifteen retained models [PITH_FULL_IMAGE:figures/full_fig_p011_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: SMAA rank acceptability over 15 models and 15 possible ranks. The full rank distribution in [PITH_FULL_IMAGE:figures/full_fig_p013_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: SMAA top-3 acceptability probabilities. Top-3 acceptability concentrates around the lower-turnover transformer-encoder and recurrent configurations ( [PITH_FULL_IMAGE:figures/full_fig_p013_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Rank-1 SMAA acceptability bands for the leading models. gross sharpe net20 sharpe vw sharpe abs ff5 alpha t minus turnover minus max drawdown bootstrap significance TS-RIDGE SMAA Central Weight Vectors 0.0 0.2 0.4 0.6 0.8 1.0 Weight LSTM TransEnc-8 [PITH_FULL_IMAGE:figures/full_fig_p014_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Central SMAA weight vectors for selected leading models [PITH_FULL_IMAGE:figures/full_fig_p014_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: SMAA preference-space geometry and turnover-weight acceptability frontier [PITH_FULL_IMAGE:figures/full_fig_p015_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Ranking sensitivity across transaction cost levels. bps; as the cost schedule tightens, cost-sensitive preferences move acceptability toward architectures whose rankings rotate more slowly [PITH_FULL_IMAGE:figures/full_fig_p016_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: SMAA rank acceptability by market-volatility state [PITH_FULL_IMAGE:figures/full_fig_p016_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Robust net Sharpe under ellipsoidal score ambiguity [PITH_FULL_IMAGE:figures/full_fig_p018_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Net Sharpe ratios under wider constrained-portfolio designs [PITH_FULL_IMAGE:figures/full_fig_p019_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Predict-then-optimize rank association between raw SMAA expected ranks and constrained-QP ranks across transaction cost levels [PITH_FULL_IMAGE:figures/full_fig_p020_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Deployment regret in net Sharpe units under constrained-portfolio designs. 20 [PITH_FULL_IMAGE:figures/full_fig_p020_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: Raw 15-model decile SMAA, restricted to promoted models, and optimized five-model portfolio SMAA [PITH_FULL_IMAGE:figures/full_fig_p021_13.png] view at source ↗
Figure 14
Figure 14. Figure 14: Raw and deployment-adjusted rank-1 acceptability for promoted models. rows use the same full-universe forecast files as the confirmatory headline benchmark, so the shared F3 cells match [PITH_FULL_IMAGE:figures/full_fig_p022_14.png] view at source ↗
Figure 15
Figure 15. Figure 15: TS-RIDGE regularization path across nested feature universes. that information becomes valuable only when coefficient discipline prevents overreaction to unstable high-dimensional variation. 0.000 0.001 0.002 0.003 0.004 0.005 0.006 0.007 Average absolute coefficient mktcap_log rev_5 mom_5 ret_l1 mom_21 rsi_14 ret_l2 gap_open log_dollar_vol spread_hilo range_1 ma_21_ratio ma_5_ratio excess_sp_l1 excess_ew… view at source ↗
Figure 16
Figure 16. Figure 16: Average absolute TS-RIDGE coefficient by feature [PITH_FULL_IMAGE:figures/full_fig_p023_16.png] view at source ↗
Figure 17
Figure 17. Figure 17: Feature-block ablation heatmap for the five promoted models under the common forecast protocol. The feature-block ablation reinforces the same mechanism. Because F2 is F3 without the size, beta, and market-relative block, removing the price block weakens the nonlinear price-path channel and removing activity variables tests whether trading intensity carries the ranking. Restoring the full structured stack… view at source ↗
Figure 18
Figure 18. Figure 18: Average daily cross-sectional Spearman rank correlations among retained model prediction signals. RIDGE long–short. TS-RIDGE supplies the stable full-signal linear anchor, and LSTM supplies the price-sensitive nonlinear contrast [PITH_FULL_IMAGE:figures/full_fig_p025_18.png] view at source ↗
Figure 19
Figure 19. Figure 19: Promoted-set pairwise rolling-combination net Sharpe matrix. 9 Deployment Boundaries Daily stock sorting draws strength from universe breadth, so restricting capacity changes the ranking [PITH_FULL_IMAGE:figures/full_fig_p026_19.png] view at source ↗
Figure 20
Figure 20. Figure 20: Capacity frontier for promoted models [PITH_FULL_IMAGE:figures/full_fig_p027_20.png] view at source ↗
Figure 21
Figure 21. Figure 21: Breakeven transaction-cost diagnostics for promoted models. Curves report net Sharpe under one-way cost schedules for the broad decile layer and the baseline constrained-QP layer; vertical reference lines mark 0.8, 20, and 50 bps [PITH_FULL_IMAGE:figures/full_fig_p027_21.png] view at source ↗
Figure 22
Figure 22. Figure 22: Design sensitivity and aggregate prediction boundary. The aggregate boundary test is sharper than the state comparison. None of the market-plus￾signal specifications produces positive aggregate out-of-sample R2 , so strong stock-level sorting does not automatically become market-timing ability. Adjacent forecasting targets impose different con￾straints once the target, aggregation level, and portfolio rul… view at source ↗
Figure 23
Figure 23. Figure 23: F3 neural-model decile-return profiles used to audit the negative RankIC and positive long–short Sharpe cases. 2021-01 2021-07 2022-01 2022-07 2023-01 2023-07 2024-01 2024-07 2025-01 Date 0.2 0.0 0.2 0.4 0.6 0.8 1.0 Cumulative D10-D1 F3 LSTM long-short cumulative (a) LSTM 2021-01 2021-07 2022-01 2022-07 2023-01 2023-07 2024-01 2024-07 2025-01 Date 0.4 0.2 0.0 0.2 0.4 Cumulative D10-D1 F3 TransEnc-8 long-s… view at source ↗
Figure 24
Figure 24. Figure 24: F3 cumulative long–short returns under the same score direction as the RankIC calculation. non-monotonicity effect rather than a long–short direction or label convention error. 36 [PITH_FULL_IMAGE:figures/full_fig_p036_24.png] view at source ↗
Figure 25
Figure 25. Figure 25: SMAA rank-1 Monte Carlo convergence for leading models [PITH_FULL_IMAGE:figures/full_fig_p038_25.png] view at source ↗
Figure 26
Figure 26. Figure 26: Deployment-adjusted rank-1 sensitivity across dimensionless regret-discount multipliers c. −2.0 −1.5 −1.0 −0.5 0.0 0.5 Net Sharpe TS-OLS TS-RIDGE Linear 20 bps LSTM Calibrated sqrt TransEnc-8 TransEnc-10 [PITH_FULL_IMAGE:figures/full_fig_p039_26.png] view at source ↗
Figure 27
Figure 27. Figure 27: Linear and square-root transaction-cost robustness for promoted common-window portfolios [PITH_FULL_IMAGE:figures/full_fig_p039_27.png] view at source ↗
Figure 28
Figure 28. Figure 28: Net Sharpe ratios under daily and weekly rebalancing. nonlinear models have lower turnover under both rules, so the weekly change is smaller for them. The ranking under weekly rebalancing is led by TS-RIDGE, TS-OLS, and TransEnc-8. 42 [PITH_FULL_IMAGE:figures/full_fig_p042_28.png] view at source ↗
read the original abstract

Benchmarking forecasting architectures for daily equity portfolios is not just a prediction exercise. It also asks which model remains usable after preferences, costs, and portfolio constraints are imposed. We build a CRSP daily-stock benchmark for 15 deep and statistical time-series architectures over 2018--2024. The protocol combines common-window decile portfolios, stochastic multi-criteria acceptability analysis, a deployment-adjusted acceptability index, and a constrained quadratic portfolio layer with capacity, beta, industry, risk, leverage, and turnover controls. The index starts from the SMAA rank-acceptability distribution and downweights models whose criteria-level wins produce high portfolio regret; its Gibbs form is characterized as an entropic update from the SMAA prior. Empirically, no architecture dominates the raw benchmark: TransEnc-8 has the largest rank-1 acceptability, 0.352, and no model exceeds about 0.36. Rankings vary with preferences, market state, feature universe, and transaction costs. In the promoted five-model constrained-portfolio comparison, TransEnc-8 is selected throughout, while return-oriented raw rankings can favor TS-RIDGE. Broad-universe decile signals can survive costs, but the baseline constrained-QP net Sharpe at 20 bps is negative for every promoted model. The benchmark supports model selection and diagnosis rather than a standalone trading-strategy claim.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

0 major / 3 minor

Summary. The paper benchmarks 15 deep and statistical time-series architectures for daily equity portfolio construction on CRSP data (2018–2024). It deploys a multi-stage protocol that constructs common-window decile portfolios, applies stochastic multi-criteria acceptability analysis (SMAA), defines a deployment-adjusted acceptability index via an entropic (Gibbs) update from the SMAA prior, and solves a constrained quadratic program incorporating capacity, beta, industry, risk, leverage, and turnover limits. The central empirical claims are that no architecture dominates (TransEnc-8 attains the highest rank-1 acceptability of 0.352; no model exceeds ~0.36), that rankings are sensitive to preferences, market state, feature universe, and transaction costs, that TransEnc-8 is selected in the five-model constrained comparison while return-oriented rankings can favor TS-RIDGE, and that broad-universe decile signals can survive costs yet the baseline constrained-QP net Sharpe at 20 bps remains negative for every promoted model.

Significance. If the protocol and implementation details hold, the work supplies a reproducible, preference-aware evaluation framework that moves beyond raw forecast accuracy to usability under realistic portfolio constraints. It credits the use of standard CRSP data, explicit multi-criteria SMAA, the explicit entropic characterization of the acceptability index, and the full set of QP controls. The negative net-Sharpe finding and the observation that no model dominates are falsifiable, policy-relevant results that caution against over-reliance on any single architecture.

minor comments (3)
  1. Abstract and §4: the exact functional form of the entropic update (Gibbs distribution) from the SMAA rank-acceptability vector to the deployment-adjusted index should be written explicitly, including the temperature parameter and any normalization, so that the index can be reproduced from the reported acceptability numbers alone.
  2. §3.3 and Table 2: the precise definition of the five-model constrained-QP comparison (which models are included, how the capacity and turnover limits are set, and whether the same random seeds are used across architectures) needs a dedicated paragraph or pseudocode block to eliminate ambiguity in the reported selection of TransEnc-8.
  3. Figure 3 and §5.2: the market-state and transaction-cost sensitivity plots would benefit from error bars or bootstrap intervals on the acceptability values so that the claim “rankings vary” can be assessed for statistical significance rather than visual inspection alone.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for the careful summary of the manuscript, the positive evaluation of its significance, and the recommendation of minor revision. No major comments appear in the report, so we provide no point-by-point rebuttals below. Any minor suggestions will be incorporated in the revised version.

Circularity Check

0 steps flagged

No significant circularity; empirical claims rest on external data and stated definitions

full rationale

The paper is an empirical benchmarking exercise on CRSP data using common-window decile portfolios, SMAA rank-acceptability, a defined entropic deployment-adjusted index (explicitly characterized as an update from the SMAA prior), and constrained QP optimization with explicit controls. No derivation reduces by construction to fitted parameters or self-citations; the index form is stated rather than derived from the target results, and all performance numbers (e.g., rank-1 acceptability 0.352) are computed from external market data and standard portfolio layers. The protocol is presented as a composite method whose outputs are falsifiable against the data rather than tautological.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Only the abstract was available for review; no explicit free parameters, axioms, or invented entities are identifiable beyond standard domain assumptions of portfolio theory. The 20 bps transaction cost level and any internal weights inside the entropic index are not detailed.

pith-pipeline@v0.9.1-grok · 5771 in / 1447 out tokens · 26610 ms · 2026-06-27T15:37:17.927679+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

51 extracted references · 43 canonical work pages

  1. [1]

    SMAA-2: Stochastic multicriteria acceptability analysis for group decision making.Operations Research, 49(3):444–454, 2001

    Risto Lahdelma and Pekka Salminen. SMAA-2: Stochastic multicriteria acceptability analysis for group decision making.Operations Research, 49(3):444–454, 2001. doi: 10.1287/opre.49.3. 444.11220. URL https://doi.org/10.1287/opre.49.3.444.11220

  2. [2]

    A survey on stochastic multicriteria acceptability analysis methods.Journal of Multi-Criteria Decision Analysis, 15(1–2):1–14, 2008

    Tommi Tervonen and Jos´ e Rui Figueira. A survey on stochastic multicriteria acceptability analysis methods.Journal of Multi-Criteria Decision Analysis, 15(1–2):1–14, 2008. doi: 10.1 002/mcda.407. URL https://doi.org/10.1002/mcda.407

  3. [3]

    Springer, 2 edition, 2016

    Salvatore Greco, Matthias Ehrgott, and Jos´ e Rui Figueira, editors.Multiple Criteria Decision Analysis: State of the Art Surveys, volume 233 ofInternational Series in Operations Research & Management Science. Springer, 2 edition, 2016. doi: 10.1007/978-1-4939-3094-4. URL https://doi.org/10.1007/978-1-4939-3094-4

  4. [4]

    IPSSIS: An inte- grated multicriteria decision support system for equity portfolio construction and selection.Eu- ropean Journal of Operational Research, 210(2):398–409, 2011

    Panos Xidonas, George Mavrotas, Constantin Zopounidis, and John Psarras. IPSSIS: An inte- grated multicriteria decision support system for equity portfolio construction and selection.Eu- ropean Journal of Operational Research, 210(2):398–409, 2011. doi: 10.1016/j.ejor.2010.08.028. URL https://doi.org/10.1016/j.ejor.2010.08.028

  5. [5]

    predict, then optimize

    Adam N. Elmachtoub and Paul Grigas. Smart “predict, then optimize”.Management Science, 68(1):9–26, 2022. doi: 10.1287/mnsc.2020.3922. URL https://doi.org/10.1287/mnsc.2020.3922

  6. [6]

    Cambridge Univer- sity Press, Cambridge, 2006

    Nicolo Cesa-Bianchi and Gabor Lugosi.Prediction, Learning, and Games. Cambridge Univer- sity Press, Cambridge, 2006

  7. [7]

    Robust portfolio selection problems.Mathematics of Operations Research, 28(1):1–38, 2003

    Donald Goldfarb and Garud Iyengar. Robust portfolio selection problems.Mathematics of Operations Research, 28(1):1–38, 2003. doi: 10.1287/moor.28.1.1.14260. URL https://doi.org/ 10.1287/moor.28.1.1.14260

  8. [8]

    Demirel, I., Celik, A

    Erick Delage and Yinyu Ye. Distributionally robust optimization under moment uncertainty with application to data-driven problems.Operations Research, 58(3):595–612, 2010. doi: 10.1287/opre.1090.0741. URL https://doi.org/10.1287/opre.1090.0741

  9. [9]

    European Journal of Operational Research , author =

    Dimitris Bertsimas and Martin S. Copenhaver. Characterization of the equivalence of ro- bustification and regularization in linear and matrix regression.European Journal of Op- erational Research, 270(3):931–942, 2018. doi: 10.1016/j.ejor.2017.03.051. URL https: //doi.org/10.1016/j.ejor.2017.03.051

  10. [10]

    Robust wasserstein profile inference and applications to machine learning.Journal of Applied Probability, 56(3):830–857, 2019

    Jose Blanchet, Yang Kang, and Karthyek Murthy. Robust wasserstein profile inference and applications to machine learning.Journal of Applied Probability, 56(3):830–857, 2019. doi: 10.1017/jpr.2019.49. URL https://doi.org/10.1017/jpr.2019.49. 47

  11. [11]

    Regularization via mass transportation.Journal of Machine Learning Research, 20(103):1–68, 2019

    Soroosh Shafieezadeh-Abadeh, Daniel Kuhn, and Peyman Mohajerin Esfahani. Regularization via mass transportation.Journal of Machine Learning Research, 20(103):1–68, 2019. URL https://www.jmlr.org/papers/v20/17-633.html

  12. [12]

    Giorgio Costa and Garud N. Iyengar. Distributionally robust end-to-end portfolio construction. Quantitative Finance, 23(10):1465–1482, 2023. doi: 10.1080/14697688.2023.2236148. URL https://doi.org/10.1080/14697688.2023.2236148

  13. [13]

    Fr´ ed´ eric Bonnans and Alexander Shapiro.Perturbation Analysis of Optimization Problems

    J. Fr´ ed´ eric Bonnans and Alexander Shapiro.Perturbation Analysis of Optimization Problems. Springer, 2000. doi: 10.1007/978-1-4612-1394-9. URL https://doi.org/10.1007/978-1-4612-1 394-9

  14. [14]

    Elmachtoub, Paul Grigas, and Ambuj Tewari

    Othman El Balghiti, Adam N. Elmachtoub, Paul Grigas, and Ambuj Tewari. Generalization bounds in the predict-then-optimize framework. InAdvances in Neural Information Processing Systems, volume 32, 2019. URL https://proceedings.neurips.cc/paper/2019/hash/a70145bf8b1 73e4496b554ce57969e24-Abstract.html

  15. [15]

    Topkis.Supermodularity and Complementarity

    Donald M. Topkis.Supermodularity and Complementarity. Princeton University Press, 1998. URL https://press.princeton.edu/books/paperback/9780691032443/supermodularity-and-com plementarity

  16. [16]

    , Kelly , Bryan B

    Shihao Gu, Bryan Kelly, and Dacheng Xiu. Empirical asset pricing via machine learning. The Review of Financial Studies, 33(5):2223–2273, 2020. doi: 10.1093/rfs/hhaa009. URL https://doi.org/10.1093/rfs/hhaa009

  17. [17]

    Deep learning in asset pricing.Management Science, 70(2):714–750, 2024

    Luyang Chen, Markus Pelger, and Jason Zhu. Deep learning in asset pricing.Management Science, 70(2):714–750, 2024. doi: 10.1287/mnsc.2023.4695. URL https://doi.org/10.1287/mn sc.2023.4695

  18. [18]

    Machine learning versus economic restrictions: Evidence from stock return predictability.Management Science, 69(5):2587–2619, 2023

    Doron Avramov, Si Cheng, and Lior Metzker. Machine learning versus economic restrictions: Evidence from stock return predictability.Management Science, 69(5):2587–2619, 2023. doi: 10.1287/mnsc.2022.4449. URL https://doi.org/10.1287/mnsc.2022.4449

  19. [19]

    Nogales, and Raman Uppal

    Victor DeMiguel, Alberto Mart´ ın-Utrera, Francisco J. Nogales, and Raman Uppal. A transaction-cost perspective on the multitude of firm characteristics.The Review of Finan- cial Studies, 33(5):2180–2222, 2020. doi: 10.1093/rfs/hhz085. URL https://doi.org/10.1093/rf s/hhz085

  20. [20]

    Model comparison with transaction costs.The Journal of Finance, 78(3):1743–1775, 2023

    Andrew Detzel, Robert Novy-Marx, and Mihail Velikov. Model comparison with transaction costs.The Journal of Finance, 78(3):1743–1775, 2023. doi: 10.1111/jofi.13225. URL https: //doi.org/10.1111/jofi.13225

  21. [21]

    Chen and Mihail Velikov

    Andrew Y. Chen and Mihail Velikov. Zeroing in on the expected returns of anomalies.Journal of Financial and Quantitative Analysis, 58(3):968–1004, 2023. doi: 10.1017/S0022109022000874. URL https://doi.org/10.1017/S0022109022000874

  22. [22]

    Moskowitz

    Andrea Frazzini, Ronen Israel, and Tobias J. Moskowitz. Trading costs.SSRN Electronic Journal, 2018. doi: 10.2139/ssrn.3229719. URL https://doi.org/10.2139/ssrn.3229719. 48

  23. [23]

    Asness, Andrea Frazzini, Ronen Israel, Tobias J

    Clifford S. Asness, Andrea Frazzini, Ronen Israel, Tobias J. Moskowitz, and Lasse H. Pedersen. Size matters, if you control your junk.Journal of Financial Economics, 129(3):479–509, 2018. doi: 10.1016/j.jfineco.2018.05.006. URL https://doi.org/10.1016/j.jfineco.2018.05.006

  24. [24]

    Fama and Kenneth R

    Eugene F. Fama and Kenneth R. French. Common risk factors in the returns on stocks and bonds.Journal of Financial Economics, 33(1):3–56, 1993. doi: 10.1016/0304-405X(93)90023-5. URL https://doi.org/10.1016/0304-405X(93)90023-5

  25. [25]

    Fama and Kenneth R

    Eugene F. Fama and Kenneth R. French. A five-factor asset pricing model.Journal of Financial Economics, 116(1):1–22, 2015. doi: 10.1016/j.jfineco.2014.10.010. URL https://doi.org/10.101 6/j.jfineco.2014.10.010

  26. [26]

    Digesting anomalies: An investment approach.The Review of Financial Studies, 28(3):650–705, 2015

    Kewei Hou, Chen Xue, and Lu Zhang. Digesting anomalies: An investment approach.The Review of Financial Studies, 28(3):650–705, 2015. doi: 10.1093/rfs/hhu068. URL https: //doi.org/10.1093/rfs/hhu068

  27. [27]

    An augmented q-factor model with expected growth.Review of Finance, 25(1):1–41, 2021

    Kewei Hou, Haitao Mo, Chen Xue, and Lu Zhang. An augmented q-factor model with expected growth.Review of Finance, 25(1):1–41, 2021. doi: 10.1093/rof/rfaa004. URL https://doi.org/ 10.1093/rof/rfaa004

  28. [28]

    The other side of value: The gross profitability premium.Journal of Financial Economics, 108(1):1–28, 2013

    Robert Novy-Marx. The other side of value: The gross profitability premium.Journal of Financial Economics, 108(1):1–28, 2013. doi: 10.1016/j.jfineco.2013.01.003. URL https: //doi.org/10.1016/j.jfineco.2013.01.003

  29. [29]

    David McLean and Jeffrey Pontiff

    R. David McLean and Jeffrey Pontiff. Does academic research destroy stock return pre- dictability?The Journal of Finance, 71(1):5–32, 2016. doi: 10.1111/jofi.12365. URL https://doi.org/10.1111/jofi.12365

  30. [30]

    Replicating anomalies.The Review of Financial Studies, 33(5):2019–2133, 2020

    Kewei Hou, Chen Xue, and Lu Zhang. Replicating anomalies.The Review of Financial Studies, 33(5):2019–2133, 2020. doi: 10.1093/rfs/hhy131. URL https://doi.org/10.1093/rfs/hhy131

  31. [31]

    Chen and Tom Zimmermann

    Andrew Y. Chen and Tom Zimmermann. Open source cross-sectional asset pricing.Critical Finance Review, 11(2):207–264, 2022. doi: 10.1561/104.00000112. URL https://doi.org/10.156 1/104.00000112

  32. [32]

    Webb, Rob J

    Rakshitha Godahewa, Christoph Bergmeir, Geoffrey I. Webb, Rob J. Hyndman, and Pablo Montero-Manso. Monash time series forecasting archive. InProceedings of the Neural Informa- tion Processing Systems Track on Datasets and Benchmarks, 2021. URL https://openreview.n et/forum?id=I01l7rc0jcb

  33. [33]

    M5 accuracy competi- tion: Results, findings, and conclusions.International Journal of Forecasting, 38(4):1346–1364,

    Spyros Makridakis, Evangelos Spiliotis, and Vassilios Assimakopoulos. M5 accuracy competi- tion: Results, findings, and conclusions.International Journal of Forecasting, 38(4):1346–1364,

  34. [34]

    , Spiliotis, E

    doi: 10.1016/j.ijforecast.2021.11.013. URL https://doi.org/10.1016/j.ijforecast.2021.11.0 13

  35. [35]

    Arik, Nicolas Loeff, and Tomas Pfister

    Bryan Lim, Sercan ¨O. Arik, Nicolas Loeff, and Tomas Pfister. Temporal fusion transformers for interpretable multi-horizon time series forecasting.International Journal of Forecasting, 37 (4):1748–1764, 2021. doi: 10.1016/j.ijforecast.2021.03.012. URL https://doi.org/10.1016/j.ijfo recast.2021.03.012. 49

  36. [36]

    Oreshkin, Dmitri Carpov, Nicolas Chapados, and Yoshua Bengio

    Boris N. Oreshkin, Dmitri Carpov, Nicolas Chapados, and Yoshua Bengio. N-BEATS: Neural basis expansion analysis for interpretable time series forecasting. InInternational Conference on Learning Representations, 2020. URL https://openreview.net/forum?id=r1ecqn4YwB. Published as a conference paper at ICLR 2020; arXiv:1905.10437

  37. [37]

    11121–11128

    Ailing Zeng, Muxi Chen, Lei Zhang, and Qiang Xu. Are transformers effective for time series forecasting? InProceedings of the AAAI Conference on Artificial Intelligence, volume 37, pages 11121–11128, 2023. doi: 10.1609/aaai.v37i9.26317. URL https://doi.org/10.1609/aaai.v37i9.2 6317

  38. [38]

    Lag-Llama: Towards foundation models for probabilistic time series forecasting.arXiv preprint arXiv:2310.08278, 2023

    Kashif Rasul, Arjun Ashok, Andrew Robert Williams, Hena Ghonia, Rishika Bhagwatkar, Arian Khorasani, Mohammad Javad Darvishi Bayazi, George Adamopoulos, Roland Riachi, Nadhir Hassen, Marin Biloˇ s, Sahil Garg, Anderson Schneider, Nicolas Chapados, Alexandre Drouin, Valentina Zantedeschi, Yuriy Nevmyvaka, and Irina Rish. Lag-Llama: Towards foundation model...

  39. [39]

    Gift-eval: A benchmark for general time series forecasting model evaluation

    Taha Aksu, Gerald Woo, Juncheng Liu, Xu Liu, Chenghao Liu, Silvio Savarese, Caiming Xiong, and Doyen Sahoo. Gift-eval: A benchmark for general time series forecasting model evaluation. arXiv preprint arXiv:2410.10393, 2024. doi: 10.48550/arXiv.2410.10393. URL https://arxiv. org/abs/2410.10393

  40. [40]

    Andrew W. Lo. The statistics of sharpe ratios.Financial Analysts Journal, 58(4):36–52, 2002. doi: 10.2469/faj.v58.n4.2453. URL https://doi.org/10.2469/faj.v58.n4.2453

  41. [41]

    Robust performance hypothesis testing with the sharpe ratio

    Olivier Ledoit and Michael Wolf. Robust performance hypothesis testing with the sharpe ratio. Journal of Empirical Finance, 15(5):850–859, 2008. doi: 10.1016/j.jempfin.2008.03.002. URL https://doi.org/10.1016/j.jempfin.2008.03.002

  42. [42]

    Newey and Kenneth D

    Whitney K. Newey and Kenneth D. West. A simple, positive semi-definite, heteroskedasticity and autocorrelation consistent covariance matrix.Econometrica, 55(3):703–708, 1987. URL https://www.jstor.org/stable/1913610

  43. [43]

    of” in the title, which we felt was better than the original, “on

    Halbert White. A reality check for data snooping.Econometrica, 68(5):1097–1126, 2000. doi: 10.1111/1468-0262.00152. URL https://doi.org/10.1111/1468-0262.00152

  44. [44]

    Peter R. Hansen. A test for superior predictive ability.Journal of Business & Economic Statistics, 23(4):365–380, 2005. doi: 10.1198/073500105000000063. URL https://doi.org/10.1 198/073500105000000063

  45. [45]

    Romano and Michael Wolf

    Joseph P. Romano and Michael Wolf. Stepwise multiple testing as formalized data snooping. Econometrica, 73(4):1237–1282, 2005. doi: 10.1111/j.1468-0262.2005.00615.x. URL https: //doi.org/10.1111/j.1468-0262.2005.00615.x

  46. [46]

    and Lunde, Asger and Nason, James M

    Peter R. Hansen, Asger Lunde, and James M. Nason. The model confidence set.Econometrica, 79(2):453–497, 2011. doi: 10.3982/ECTA5771. URL https://doi.org/10.3982/ECTA5771

  47. [47]

    and Mariano, Roberto S

    Francis X. Diebold and Roberto S. Mariano. Comparing predictive accuracy.Journal of Busi- ness & Economic Statistics, 13(3):253–263, 1995. doi: 10.1080/07350015.1995.10524599. URL https://doi.org/10.1080/07350015.1995.10524599. 50

  48. [48]

    Harvey, Yan Liu, and Heqing Zhu

    Campbell R. Harvey, Yan Liu, and Heqing Zhu. ... and the cross-section of expected returns. The Review of Financial Studies, 29(1):5–68, 2016. doi: 10.1093/rfs/hhv059. URL https: //doi.org/10.1093/rfs/hhv059

  49. [49]

    Peyman Mohajerin Esfahani and Daniel Kuhn. Data-driven distributionally robust optimization using the wasserstein metric: Performance guarantees and tractable reformulations.Math- ematical Programming, 171(1–2):115–166, 2018. doi: 10.1007/s10107-017-1172-1. URL https://doi.org/10.1007/s10107-017-1172-1

  50. [50]

    and White, Halbert , TITLE =

    Dimitris N. Politis and Halbert White. Automatic block-length selection for the dependent bootstrap.Econometric Reviews, 23(1):53–70, 2004. doi: 10.1081/ETC-120028836. URL https://doi.org/10.1081/ETC-120028836

  51. [51]

    Shrinking the cross-section.Journal of Financial Economics, 135(2):271–292, 2020

    Serhiy Kozak, Stefan Nagel, and Shrihari Santosh. Shrinking the cross-section.Journal of Financial Economics, 135(2):271–292, 2020. doi: 10.1016/j.jfineco.2019.06.008. URL https: //doi.org/10.1016/j.jfineco.2019.06.008. 51