pith. sign in

arxiv: 2606.27282 · v2 · pith:3NG4KWKVnew · submitted 2026-06-25 · 💻 cs.LG

How Good Can Linear Models Be for Time-Series Forecasting?

Pith reviewed 2026-06-30 09:30 UTC · model grok-4.3

classification 💻 cs.LG
keywords time-series forecastingridge regressionlinear modelspreprocessinghyperparameter searchcontext lengthnormalizationbenchmarks
0
0 comments X

The pith

Tuned Ridge regression with optimized preprocessing beats prior linear models and exceeds Transformers on six of eight time-series benchmarks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper sets out to show that most forecasting gains come from careful preprocessing rather than model scale. It uses Ridge regression, which admits a closed-form solution, as a transparent testbed and searches over context length, local normalization, regularization, and augmentation across eight standard benchmarks. A sympathetic reader would care because the results indicate that simple linear models, once their inputs are properly prepared, can close or reverse the gap with far larger architectures without any increase in capacity. The work also treats the discovered hyperparameters as readable diagnostics that expose dataset-specific structure. If correct, the claim reframes the usual scaling narrative around time-series forecasting.

Core claim

Ridge regression models achieve superior accuracy once context length is chosen in a series-specific and often non-monotonic way, normalization is performed over a learned trailing fraction of the lookback window rather than the whole window, regularization strength and augmentation are tuned, and the degree of cross-series parameter sharing is allowed to vary from fully shared to fully per-series. These choices let the linear models surpass earlier linear forecasters on most dataset-horizon pairs and exceed Transformer, MLP, and CNN baselines on six of the eight benchmarks while remaining fully interpretable.

What carries the argument

Ridge regression as a closed-form, weight-interpretable testbed that directly reveals the effect of each preprocessing hyperparameter.

If this is right

  • Optimal lookback length follows dataset-specific power laws that can be positive or negative and often shrinks rather than grows with forecast horizon.
  • Normalizing over a learned trailing fraction of the context window is almost always better than normalizing over the entire window.
  • Series inside the same dataset frequently disagree on the best hyperparameters, so the optimal amount of cross-series sharing ranges from none to full.
  • The fitted hyperparameters themselves act as diagnostics that surface structures larger models would otherwise bury inside their parameters.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same preprocessing search could be applied to other linear or near-linear forecasters to test whether the gains generalize beyond Ridge.
  • If preprocessing dominates, then foundation-model training budgets might be reallocated toward data curation and input normalization rather than parameter count.
  • The observed non-monotonic and series-specific lookback lengths suggest that fixed-context designs common in large models may be systematically suboptimal.

Load-bearing premise

The hyperparameter search over context length, normalization, regularization, and augmentation never saw the test data, and the eight benchmarks are representative of settings where larger models could still win.

What would settle it

A single new dataset or benchmark where, after identical hyperparameter search, any Transformer, MLP, or CNN still outperforms the tuned Ridge model by a clear margin.

Figures

Figures reproduced from arXiv: 2606.27282 by Jinglue Xu, Lang Huang, Luke Darlow.

Figure 1
Figure 1. Figure 1: Per-horizon optimal context-horizon relationships for four time series. The context lengths [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Left: Median optimal lookback per (series, horizon) varies. Right: Adapting context per horizon yields up to +16% MSE improvement over a global baseline across these four datasets. 2 [PITH_FULL_IMAGE:figures/full_fig_p002_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Optimal lookback L ∗ vs. forecast horizon H across 8 datasets. (a) Median lookback (log￾log) with IQR bands and power-law fits L ∗ ∝ Hb . (b) Fitted exponent b per dataset. (c) Per-series exponent distribution within each dataset. 4.2 Main Results [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Effect of series (Top) and horizon (Bottom) grouping on forecasting accuracy across 4 datasets. Each panel shows MSE degradation (%) relative to the best group size (marked with ⋆) as a function of group size, from per-series/horizon to fully shared. directions at once: it underserves Weather and Electricity by roughly an order of magnitude, and overserves Exchange and Traffic at long horizons, where the o… view at source ↗
Figure 5
Figure 5. Figure 5: Forecast comparison. Our method (blue) closely tracks the ground truth (black), while the [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Weight magnitude |w| over lag and forecast horizon. Lighter shades indicate larger magnitudes. Lag is measured from the most recent input (bottom row); white regions lie beyond each model’s chosen lookback. observations, not to the whole history. The jumps at gh=48 bin boundaries show the tradeoff: horizon grouping barely changes MSE (≤ 0.4%; §5.2), but it can change the weights abruptly. 5.4 Global v.s. L… view at source ↗
Figure 7
Figure 7. Figure 7: Per-series hyperparameter (local ratio r and regularization α) heatmaps, with gh = 48. ETTh1 ETTh2 ETTm1 ETTm2 Weather Exchange Electricity Traffic 0 20 40 60 80 100 Selection (%) 34 28 36 39 34 32 28 29 33 33 27 30 33 38 32 33 32 39 37 31 33 31 40 39 Freq Time None ETTh1 ETTh2 ETTm1 ETTm2 Weather Exchange Electricity Traffic 10 3 10 2 10 1 Optimal [PITH_FULL_IMAGE:figures/full_fig_p009_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Augmentation analysis. Left: Proportion of trials selecting frequency-domain noise, time￾domain noise, or no augmentation. Right: Distribution of optimal σ conditional on augmentation being selected, broken down by domain. and Exchange sit between these extremes; grouping by measurement type or a few clusters would likely capture most of the heterogeneity at a fraction of full per-series cost. 5.6 Augmenta… view at source ↗
Figure 9
Figure 9. Figure 9: Channel-averaged Pearson autocorrelation [PITH_FULL_IMAGE:figures/full_fig_p013_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Median per-channel test MSE on Ridge with universal-default preprocessing at [PITH_FULL_IMAGE:figures/full_fig_p014_10.png] view at source ↗
read the original abstract

Time-series forecasting research has been moving steadily toward larger architectures, from specialized transformers to general-purpose foundation models, on the assumption that capacity is what unlocks accuracy. We take the opposite position: most of the gap can be closed at far lower cost by tuning preprocessing rather than scaling models. We use Ridge regression as the testbed, since it has a closed-form solution and interpretable weights, which let the optimal hyperparameters be read off the search directly. We search over context length, local normalization, regularization, and augmentation on eight standard benchmarks and find three patterns. (1) Optimal lookback is strongly series-specific and often non-monotonic in forecast horizon, with fitted power-law exponents ranging from $+0.46$ on ETTm2 to $-0.19$ on Exchange and Traffic, challenging the convention that longer horizons need longer history. (2) Normalizing over a learned trailing fraction of the context, rather than its entirety, is almost universally preferred. (3) Series within the same dataset often disagree on hyperparameters; the optimal degree of cross-series sharing varies from fully shared to fully per-series. The resulting models beat prior linear forecasters on most dataset-horizon entries and exceed Transformer, MLP, and CNN baselines on six of eight benchmarks. The optimized hyperparameters also serve as a diagnostic on the data itself, revealing structures that larger models absorb silently into their learned parameters. We provide an accompanying interactive online demonstration and the code at https://sakanaai.github.io/SearchCast/.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper claims that Ridge regression with tuned preprocessing (context length, trailing normalization fraction, regularization, augmentation) can outperform prior linear forecasters on most dataset-horizon pairs and exceed Transformer/MLP/CNN baselines on six of eight standard benchmarks. It reports series-specific optimal lookbacks (with power-law exponents from +0.46 to -0.19), preference for partial-context normalization, and varying degrees of cross-series hyperparameter sharing, positioning these choices as diagnostics on the data and a lower-cost alternative to scaling model capacity. Code and an interactive demo are released.

Significance. If the results are obtained without test leakage, the work shows that substantial forecasting gains are achievable via preprocessing on a closed-form linear model, challenging the emphasis on larger architectures. The explicit release of code and the demo constitute a clear reproducibility strength that allows direct inspection of the search procedure and fitted weights.

major comments (2)
  1. [Abstract; §4 (Experiments)] The abstract and experimental sections provide no description of the validation protocol (e.g., rolling-origin or strictly held-out splits) used when searching over context length, normalization window fraction, regularization strength, and augmentation parameters. Because the headline result (outperformance on 6/8 benchmarks) rests on these per-series or per-dataset choices, absence of explicit leakage safeguards is load-bearing for the central claim.
  2. [§4.3 (Results); associated tables] Table entries comparing optimized Ridge models to baselines report point estimates only; no standard errors, multiple random seeds, or Diebold-Mariano tests are mentioned. This weakens the assertion that the linear models “exceed” complex baselines, especially when hyperparameters are themselves selected on validation data.
minor comments (2)
  1. [§3 (Method)] Notation for the trailing normalization fraction is introduced without an explicit equation; a short definition (e.g., Eq. (X)) would improve clarity when the fraction is later reported per dataset.
  2. [§4.2 (Hyperparameter Analysis)] The power-law fits for optimal lookback versus horizon are presented without the underlying scatter or R² values; adding these would let readers assess how strongly the data support the reported exponents.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive report. The two major comments identify important gaps in the experimental description and statistical reporting. We address each below and will revise the manuscript accordingly.

read point-by-point responses
  1. Referee: [Abstract; §4 (Experiments)] The abstract and experimental sections provide no description of the validation protocol (e.g., rolling-origin or strictly held-out splits) used when searching over context length, normalization window fraction, regularization strength, and augmentation parameters. Because the headline result (outperformance on 6/8 benchmarks) rests on these per-series or per-dataset choices, absence of explicit leakage safeguards is load-bearing for the central claim.

    Authors: We agree that an explicit description of the validation protocol is necessary. The released code implements a rolling-origin validation scheme performed exclusively on the training portion of each series, with the final test window held out and never used for hyperparameter selection or normalization statistics. No test data influences the search. We will add a dedicated subsection in §4 describing this protocol, including the exact train/validation/test split ratios and the fact that all preprocessing statistics are computed only on the training window. revision: yes

  2. Referee: [§4.3 (Results); associated tables] Table entries comparing optimized Ridge models to baselines report point estimates only; no standard errors, multiple random seeds, or Diebold-Mariano tests are mentioned. This weakens the assertion that the linear models “exceed” complex baselines, especially when hyperparameters are themselves selected on validation data.

    Authors: The current tables report single-run point estimates because Ridge regression with fixed hyperparameters is deterministic. However, we acknowledge that variability arising from hyperparameter search and any stochastic augmentation should be quantified. In the revision we will rerun the full pipeline with multiple random seeds for the augmentation component, report standard deviations across seeds, and add Diebold-Mariano tests against the strongest baseline for each dataset-horizon pair. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical hyperparameter search on held-out benchmarks with no definitional or self-citation reduction

full rationale

The paper reports results from a direct grid search over context length, normalization fraction, regularization, and augmentation for Ridge regression, then evaluates the tuned models on eight standard time-series benchmarks. No equations, first-principles derivations, or predictions are presented that reduce to fitted parameters or self-citations by construction. The three observed patterns (series-specific lookback, trailing normalization preference, variable cross-series sharing) are post-hoc summaries of the search outcomes rather than inputs redefined as outputs. Self-citation load-bearing, ansatz smuggling, and uniqueness theorems are absent from the provided text. The performance claims rest on standard empirical evaluation rather than tautological re-labeling of inputs.

Axiom & Free-Parameter Ledger

4 free parameters · 2 axioms · 0 invented entities

The work rests on empirical hyperparameter optimization rather than new theoretical entities or derivations; the main inputs are standard benchmarks and the closed-form property of Ridge regression.

free parameters (4)
  • context length
    Optimized per series and forecast horizon; fitted values range from power-law exponents +0.46 to -0.19
  • normalization window fraction
    Learned trailing fraction of context rather than full window
  • regularization strength
    Tuned jointly with other factors
  • augmentation parameters
    Included in the search space
axioms (2)
  • standard math Ridge regression admits a closed-form solution that enables direct reading of optimal hyperparameters
    Used to justify the choice of testbed
  • domain assumption The eight standard benchmarks are sufficient to evaluate general forecasting performance
    Central to the claim of beating baselines on most entries

pith-pipeline@v0.9.1-grok · 5796 in / 1412 out tokens · 35879 ms · 2026-06-30T09:30:25.417950+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

27 extracted references · 9 canonical work pages · 4 internal anchors

  1. [1]

    Abhimanyu

    D. Abhimanyu. A decoder-only foundation model for time-series forecasting. InInternational Conference on Machine Learning, 2024

  2. [2]

    Akiba, S

    T. Akiba, S. Sano, T. Yanase, T. Ohta, and M. Koyama. Optuna: A next-generation hyperparameter optimization framework. InProceedings of the 25th ACM SIGKDD international conference on knowledge discovery & data mining, pages 2623–2631, 2019

  3. [3]

    A. F. Ansari, L. Stella, A. C. Turkmen, X. Zhang, P. Mercado, H. Shen, O. Shchur, S. S. Rangapuram, S. P. Arango, S. Kapoor, et al. Chronos: Learning the language of time series.Transactions on Machine Learning Research, 2024

  4. [4]

    W. A. Brock, W. D. Dechert, J. A. Scheinkman, and B. LeBaron. A test for independence based on the correlation dimension.Econometric Reviews, 15(3):197–235, 1996

  5. [5]

    Challu, K

    C. Challu, K. G. Olivares, B. N. Oreshkin, F. G. Ramirez, M. M. Canseco, and A. Dubrawski. Nhits: Neural hierarchical interpolation for time series forecasting. InProceedings of the AAAI conference on artificial intelligence, volume 37, pages 6989–6997, 2023

  6. [6]

    Chen, C.-L

    S.-A. Chen, C.-L. Li, N. Yoder, S. O. Arik, and T. Pfister. Tsmixer: An all-mlp architecture for time series forecasting.arXiv preprint arXiv:2303.06053, 2023

  7. [7]

    R. B. Cleveland, W. S. Cleveland, J. E. McRae, and I. Terpenning. STL: A seasonal-trend decomposition procedure based on loess.Journal of Official Statistics, 6(1):3–73, 1990

  8. [8]

    Darlow, Q

    L. Darlow, Q. Deng, A. Hassan, M. Asenov, R. Singh, A. Joosen, A. Barker, and A. Storkey. Dam: Towards a foundation model for forecasting. InInternational Conference on Learning Representations, 2025

  9. [9]

    A. Das, W. Kong, A. Leach, S. Mathur, R. Sen, and R. Yu. Long-term forecasting with tide: Time-series dense encoder.arXiv preprint arXiv:2304.08424, 2023

  10. [10]

    A. E. Hoerl and R. W. Kennard. Ridge regression: Biased estimation for nonorthogonal problems. Technometrics, 12:55–67, 1970

  11. [11]

    R. J. Hyndman and G. Athanasopoulos.Forecasting: Principles and Practice. OTexts, Melbourne, Australia, 3rd edition, 2021

  12. [12]

    T. Kim, J. Kim, Y . Tae, C. Park, J.-H. Choi, and J. Choo. Reversible instance normalization for accurate time-series forecasting against distribution shift. InInternational Conference on Learning Representations, 2022

  13. [13]

    Z. Li, S. Qi, Y . Li, and Z. Xu. Revisiting long-term time series forecasting: An investigation on linear mapping.arXiv preprint arXiv:2305.10721, 2023

  14. [14]

    S. Lin, W. Lin, W. Wu, H. Chen, and J. Yang. Sparsetsf: Modeling long-term time series forecasting with 1k parameters.arXiv preprint arXiv:2405.00946, 2024

  15. [15]

    Y . Liu, T. Hu, H. Zhang, H. Wu, S. Wang, L. Ma, and M. Long. itransformer: Inverted transformers are effective for time series forecasting.arXiv preprint arXiv:2310.06625, 2023

  16. [16]

    Y . Nie, N. H. Nguyen, P. Sinthong, and J. Kalagnanam. A time series is worth 64 words: Long-term forecasting with transformers.arXiv preprint arXiv:2211.14730, 2022

  17. [17]

    B. N. Oreshkin, D. Carpov, N. Chapados, and Y . Bengio. N-beats: Neural basis expansion analysis for interpretable time series forecasting. InInternational Conference on Learning Representations, 2020

  18. [18]

    X. Qiu, J. Hu, L. Zhou, X. Wu, J. Du, B. Zhang, C. Guo, A. Zhou, C. S. Jensen, Z. Sheng, et al. Tfb: Towards comprehensive and fair benchmarking of time series forecasting methods.arXiv preprint arXiv:2403.20150, 2024

  19. [19]

    Seabold and J

    S. Seabold and J. Perktold. statsmodels: Econometric and statistical modeling with python. InProceedings of the 9th Python in Science Conference (SciPy), pages 92–96, 2010

  20. [20]

    Toner and L

    W. Toner and L. Darlow. An analysis of linear time series forecasting models. InProceedings of the 41st International Conference on Machine Learning, pages 48404–48427, 2024

  21. [21]

    Vaswani, N

    A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin. Attention is all you need.Advances in neural information processing systems, 30, 2017. 10

  22. [22]

    S. Wang, H. Wu, X. Shi, T. Hu, H. Luo, L. Ma, J. Y . Zhang, and J. Zhou. Timemixer: Decomposable multiscale mixing for time series forecasting.arXiv preprint arXiv:2405.14616, 2024

  23. [23]

    G. Woo, C. Liu, A. Kumar, C. Xiong, S. Savarese, and D. Sahoo. Unified training of universal time series forecasting transformers. InInternational Conference on Machine Learning, 2024

  24. [24]

    H. Wu, T. Hu, Y . Liu, H. Zhou, J. Wang, and M. Long. Timesnet: Temporal 2d-variation modeling for general time series analysis.arXiv preprint arXiv:2210.02186, 2022

  25. [25]

    H. Wu, J. Xu, J. Wang, and M. Long. Autoformer: Decomposition transformers with auto-correlation for long-term series forecasting.Advances in neural information processing systems, 34:22419–22430, 2021

  26. [26]

    A. Zeng, M. Chen, L. Zhang, and Q. Xu. Are transformers effective for time series forecasting? In Proceedings of the AAAI conference on artificial intelligence, volume 37, pages 11121–11128, 2023

  27. [27]

    more capacity captures more structure

    T. Zhou, Z. Ma, Q. Wen, X. Wang, L. Sun, and R. Jin. Fedformer: Frequency enhanced decomposed transformer for long-term series forecasting. InInternational conference on machine learning, pages 27268–27286. PMLR, 2022. 11 A Long-range linear autocorrelation in the benchmark series The per-dataset search of Section 5.1 selects context lengths that differ b...