pith. machine review for the scientific record. sign in

arxiv: 2211.14730 · v2 · submitted 2022-11-27 · 💻 cs.LG · cs.AI

Recognition: 2 theorem links

· Lean Theorem

A Time Series is Worth 64 Words: Long-term Forecasting with Transformers

Authors on Pith no claims yet

Pith reviewed 2026-05-13 19:12 UTC · model grok-4.3

classification 💻 cs.LG cs.AI
keywords time series forecastingtransformerpatchingchannel independencelong-term forecastingself-supervised pretrainingmultivariate series
0
0 comments X

The pith

Splitting time series into subseries patches and processing channels independently lets a Transformer forecast long horizons more accurately than prior models.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that dividing each time series into short consecutive patches turns those patches into the tokens a Transformer receives. It further treats every variable as its own independent univariate series while sharing the same embedding layer and Transformer weights across all of them. This keeps local patterns inside each patch, cuts the quadratic cost of attention maps, and lets the model look back farther in history without extra memory. The resulting PatchTST model therefore produces markedly lower error on long-term forecasting benchmarks than earlier Transformer designs. The same architecture also supports masked pre-training that transfers well across datasets and often beats training from scratch on large collections.

Core claim

By segmenting multivariate time series into subseries patches that serve as input tokens and enforcing channel independence with shared embeddings and weights, the model retains local semantic information, reduces attention computation quadratically for a given look-back window, and attends to longer history, delivering significantly higher long-term forecasting accuracy than state-of-the-art Transformer baselines.

What carries the argument

Subseries patches used as Transformer input tokens together with channel-independent processing that shares embeddings and weights across all univariate series.

Load-bearing premise

Dividing a time series into fixed-length patches keeps enough local information and does not break the cross-patch temporal dependencies needed for accurate long-range forecasts.

What would settle it

A controlled test on a dataset whose key predictive patterns cross the chosen patch boundaries, showing that the patched model loses accuracy relative to an otherwise identical non-patched Transformer given the same look-back window.

read the original abstract

We propose an efficient design of Transformer-based models for multivariate time series forecasting and self-supervised representation learning. It is based on two key components: (i) segmentation of time series into subseries-level patches which are served as input tokens to Transformer; (ii) channel-independence where each channel contains a single univariate time series that shares the same embedding and Transformer weights across all the series. Patching design naturally has three-fold benefit: local semantic information is retained in the embedding; computation and memory usage of the attention maps are quadratically reduced given the same look-back window; and the model can attend longer history. Our channel-independent patch time series Transformer (PatchTST) can improve the long-term forecasting accuracy significantly when compared with that of SOTA Transformer-based models. We also apply our model to self-supervised pre-training tasks and attain excellent fine-tuning performance, which outperforms supervised training on large datasets. Transferring of masked pre-trained representation on one dataset to others also produces SOTA forecasting accuracy. Code is available at: https://github.com/yuqinie98/PatchTST.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces PatchTST, a Transformer architecture for multivariate long-term time series forecasting and self-supervised representation learning. It segments each univariate series into subseries patches that serve as input tokens and adopts channel-independence so that all channels share the same embedding and Transformer weights. The design is claimed to retain local semantics within patches, quadratically reduce attention cost for a given look-back window, and enable longer history, yielding significant accuracy gains over prior SOTA Transformer baselines. The paper also reports strong fine-tuning results after masked pre-training and successful transfer across datasets.

Significance. If the empirical gains are robust, the work would be significant for time-series modeling by demonstrating that a simple patch-based tokenization (analogous to ViT) combined with channel independence can make Transformers practical for long-horizon forecasting while lowering memory and compute. The public code release and the self-supervised pre-training results further increase its utility. The central empirical claim rests on direct comparisons rather than any parameter-free derivation.

major comments (2)
  1. [§3] §3 (Patching and embedding): the linear projection that maps each patch to a token embedding is presented as sufficient to retain local semantic information, yet the subsequent self-attention operates only across patches. No analysis is given of information loss inside a patch (e.g., high-frequency or non-stationary components). An ablation replacing the linear layer with a small MLP or varying patch length while holding look-back fixed would test whether this step is load-bearing for the reported gains.
  2. [§4] §4 (Experiments): the headline claim of “significant” improvement over SOTA Transformer models requires that baselines use identical look-back windows and that the reported advantage is not driven solely by the ability to use longer history. The manuscript should state the exact look-back lengths used for each baseline and include an ablation that isolates patching from the channel-independent design.
minor comments (2)
  1. [Title / §3.1] The title refers to “64 Words” but the manuscript does not explicitly justify the default patch length of 16 (or 64) in the main text; a short paragraph in §3.1 would clarify the hyper-parameter choice.
  2. [Figures in §4] Figure captions and legends in the experimental section would benefit from explicit mention of the number of runs and whether shaded regions represent standard deviation.

Simulated Author's Rebuttal

2 responses · 0 unresolved

Thank you for the detailed review and the recommendation for minor revision. We have carefully considered the comments and provide point-by-point responses below. We will update the manuscript accordingly.

read point-by-point responses
  1. Referee: [§3] §3 (Patching and embedding): the linear projection that maps each patch to a token embedding is presented as sufficient to retain local semantic information, yet the subsequent self-attention operates only across patches. No analysis is given of information loss inside a patch (e.g., high-frequency or non-stationary components). An ablation replacing the linear layer with a small MLP or varying patch length while holding look-back fixed would test whether this step is load-bearing for the reported gains.

    Authors: We thank the referee for pointing this out. The choice of a linear projection for patch embedding is motivated by its efficiency and its ability to preserve local semantics, as validated by the overall performance gains. To investigate potential information loss, we will add an ablation study in the revision that replaces the linear layer with a small MLP. We will also include results varying the patch length while keeping the look-back window fixed to confirm the robustness of our design. revision: yes

  2. Referee: [§4] §4 (Experiments): the headline claim of “significant” improvement over SOTA Transformer models requires that baselines use identical look-back windows and that the reported advantage is not driven solely by the ability to use longer history. The manuscript should state the exact look-back lengths used for each baseline and include an ablation that isolates patching from the channel-independent design.

    Authors: We agree that it is crucial to use identical look-back windows for fair comparison. In our experiments, the look-back window was set to the same value for PatchTST and all baselines (details are provided in the appendix, and we will move this to the main text). The ability to use longer history is an additional benefit enabled by the reduced complexity of patching, but our primary results use matched settings. We will also add an ablation that isolates the effect of patching from the channel-independent design by comparing against a channel-dependent variant. revision: yes

Circularity Check

0 steps flagged

No significant circularity: empirical architecture evaluated against external baselines

full rationale

The paper introduces patching of time series into subseries tokens and channel-independent processing as explicit architectural choices, then reports forecasting accuracy via direct head-to-head experiments on standard public datasets against prior published SOTA Transformer models. No equation or claim reduces a fitted internal parameter to a 'prediction' of itself, no load-bearing premise rests on a self-citation chain, and the design motivations (local semantics, quadratic attention reduction, longer history) are stated independently of the numerical results. The derivation chain is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The design rests on the domain assumption that local patch semantics are sufficient for long-range forecasting and on standard Transformer inductive biases; no free parameters or invented entities are introduced beyond ordinary hyperparameters.

axioms (2)
  • domain assumption Segmenting a univariate time series into fixed-length patches preserves sufficient local semantic information for downstream forecasting.
    Invoked to justify the patching design in the abstract.
  • domain assumption Channel independence with shared weights is a valid modeling choice for multivariate series.
    Stated as a core component without further justification in the abstract.

pith-pipeline@v0.9.0 · 5501 in / 1226 out tokens · 40011 ms · 2026-05-13T19:12:20.078362+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

  • Foundation.SimplicialLedger, Foundation.EightTick simplicial_loop_tick_lower_bound, eight_tick_forces_D3 echoes
    ?
    echoes

    ECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.

    segmentation of time series into subseries-level patches which are served as input tokens to Transformer; patching design naturally has three-fold benefit: local semantic information is retained in the embedding; computation and memory usage of the attention maps are quadratically reduced given the same look-back window; and the model can attend longer history

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 29 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. SeesawNet: Towards Non-stationary Time Series Forecasting with Balanced Modeling of Common and Specific Dependencies

    cs.LG 2026-05 unverdicted novelty 7.0

    SeesawNet dynamically balances common and instance-specific dependencies via ASNA in temporal and channel dimensions, outperforming prior methods on non-stationary forecasting benchmarks.

  2. What if Tomorrow is the World Cup Final? Counterfactual Time Series Forecasting with Textual Conditions

    cs.LG 2026-05 unverdicted novelty 7.0

    Introduces the task of counterfactual time series forecasting with textual conditions plus a text-attribution mechanism that improves accuracy by distinguishing mutable from immutable factors.

  3. Empowering VLMs for Few-Shot Multimodal Time Series Classification via Tailored Agentic Reasoning

    cs.AI 2026-05 unverdicted novelty 7.0

    MarsTSC is a VLM-based agentic reasoning framework with a self-evolving knowledge bank and Generator-Reflector-Modifier roles that achieves better few-shot multimodal time series classification than baselines on 12 be...

  4. FactoryBench: Evaluating Industrial Machine Understanding

    cs.AI 2026-05 unverdicted novelty 7.0

    FactoryBench reveals that frontier LLMs achieve under 50% on structured causal questions and under 18% on decision-making in industrial robotic telemetry.

  5. Hedging Memory Horizons for Non-Stationary Prediction via Online Aggregation

    cs.LG 2026-05 unverdicted novelty 7.0

    MELO aggregates base predictors and their multi-scale EWLS adaptations using MLpol to achieve oracle inequalities against best fixed and time-varying predictors in non-stationary settings.

  6. Does Synthetic Data Help? Empirical Evidence from Deep Learning Time Series Forecasters

    cs.LG 2026-05 accept novelty 7.0

    Synthetic data augmentation helps channel-mixing time series models but degrades channel-independent ones, with reliable gains only from seasonal-trend generators and gradual schedules in low-resource settings.

  7. Discrete Prototypical Memories for Federated Time Series Foundation Models

    cs.LG 2026-04 unverdicted novelty 7.0

    FeDPM learns and aligns local discrete prototypical memories across domains to create a unified discrete latent space for LLM-based time series foundation models in a federated setting.

  8. Self-Supervised Foundation Model for Calcium-imaging Population Dynamics

    q-bio.QM 2026-04 unverdicted novelty 7.0

    CalM uses a discrete tokenizer and dual-axis autoregressive transformer pretrained self-supervised on calcium traces to outperform specialized baselines on population dynamics forecasting and adapt to superior behavio...

  9. What If We Let Forecasting Forget? A Sparse Bottleneck for Cross-Variable Dependencies

    cs.LG 2026-05 unverdicted novelty 6.0

    MS-FLOW uses a capacity-limited sparse routing mechanism to model only critical inter-variable dependencies in time series data, achieving state-of-the-art accuracy on 12 benchmarks with fewer but more reliable connections.

  10. Exploring the Potential of Probabilistic Transformer for Time Series Modeling: A Report on the ST-PT Framework

    cs.LG 2026-04 unverdicted novelty 6.0

    ST-PT turns transformers into explicit factor graphs for time series, enabling structural injection of symbolic priors, per-sample conditional generation, and principled latent autoregressive forecasting via MFVI iterations.

  11. CAARL: In-Context Learning for Interpretable Co-Evolving Time Series Forecasting

    cs.LG 2026-04 unverdicted novelty 6.0

    CAARL decomposes co-evolving time series into autoregressive segments, builds a temporal dependency graph, serializes it into a narrative, and uses LLMs for interpretable forecasting via chain-of-thought reasoning.

  12. M3R: Localized Rainfall Nowcasting with Meteorology-Informed MultiModal Attention

    cs.LG 2026-04 unverdicted novelty 6.0

    M3R improves localized rainfall nowcasting by using weather station time series as queries in multimodal attention to selectively extract precipitation patterns from radar imagery.

  13. A General Framework for Generative Self-supervised Learning in Non-invasive Estimation of Physiological Parameters Using Photoplethysmography

    eess.SP 2026-04 unverdicted novelty 6.0

    TS2TC combines cross-temporal fusion generative anchor pretraining with dual-process transfer to achieve 2.49% lower RMSE than prior methods on PPG parameter estimation using only 10% labeled data.

  14. A Benchmark of Classical and Deep Learning Models for Agricultural Commodity Price Forecasting on A Novel Bangladeshi Market Price Dataset

    cs.LG 2026-03 accept novelty 6.0

    AgriPriceBD dataset of 1779 daily prices released; naive persistence outperforms deep models like Informer and Time2Vec-Transformer on heterogeneous Bangladeshi commodity series with statistical validation.

  15. A Foundation Model for Instruction-Conditioned In-Context Time Series Tasks

    cs.LG 2026-03 unverdicted novelty 6.0

    iAmTime is a time-series foundation model that uses instruction-conditioned in-context learning from demonstrations to perform zero-shot adaptation on forecasting, imputation, classification, and related tasks.

  16. Timer-S1: A Billion-Scale Time Series Foundation Model with Serial Scaling

    cs.AI 2026-03 unverdicted novelty 6.0

    Timer-S1 is a released 8.3B-parameter MoE time series model that achieves state-of-the-art MASE and CRPS scores on GIFT-Eval using serial scaling and Serial-Token Prediction.

  17. Titans: Learning to Memorize at Test Time

    cs.LG 2024-12 unverdicted novelty 6.0

    Titans combine attention for current context with a learnable neural memory for long-term history, achieving better performance and scaling to over 2M-token contexts on language, reasoning, genomics, and time-series tasks.

  18. Beyond Similarity: Temporal Operator Attention for Time Series Analysis

    cs.LG 2026-05 unverdicted novelty 5.0

    Temporal Operator Attention augments softmax attention with learnable sequence-space operators for signed temporal mixing and uses stochastic regularization to enable practical training, yielding consistent gains on t...

  19. Mela: Test-Time Memory Consolidation based on Transformation Hypothesis

    cs.CL 2026-05 unverdicted novelty 5.0

    Mela is a Transformer variant with a dual-frequency Hierarchical Memory Module and MemStack that performs test-time memory consolidation, outperforming baselines on long contexts.

  20. Risk-Aware Safe Throughput Forecasting for Starlink Networks

    eess.SY 2026-05 unverdicted novelty 5.0

    BG-CFQS provides risk-aware quantile-based forecasting for Starlink throughput that meets overestimation budgets and reduces positive errors compared to other feasible methods.

  21. TSNN: A Non-parametric and Interpretable Framework for Traffic Time Series Forecasting

    cs.LG 2026-05 unverdicted novelty 5.0

    TSNN matches time series entries to a training-derived memory bank to forecast traffic without any trainable parameters and achieves competitive accuracy on four real-world datasets.

  22. Learning Fingerprints for Medical Time Series with Redundancy-Constrained Information Maximization

    cs.LG 2026-04 unverdicted novelty 5.0

    A self-supervised method learns a fixed set of disentangled fingerprint tokens from medical time series by combining reconstruction loss with a total coding rate diversity penalty, framed as a disentangled rate-distor...

  23. MedMamba: Recasting Mamba for Medical Time Series Classification

    eess.SP 2026-04 unverdicted novelty 5.0

    MedMamba introduces a principle-guided bidirectional multi-scale Mamba model that outperforms prior methods on EEG, ECG, and activity classification benchmarks while delivering 4.6x inference speedup.

  24. Foundation Models Defining A New Era In Sensor-based Human Activity Recognition: A Survey And Outlook

    eess.SP 2026-04 accept novelty 5.0

    The survey organizes foundation models for sensor-based HAR into a lifecycle taxonomy and identifies three trajectories: HAR-specific models from scratch, adaptation of general time-series models, and integration with...

  25. A Foundation Model for Instruction-Conditioned In-Context Time Series Tasks

    cs.LG 2026-03 unverdicted novelty 5.0

    iAmTime is a hierarchical transformer-based time series foundation model that uses semantic tokens and instruction-conditioned prompts to infer tasks from demonstrations, achieving improved zero-shot performance on fo...

  26. Forecasting Green Skill Demand in the Automotive Industry: Evidence from Online Job Postings

    cs.LG 2026-05 unverdicted novelty 4.0

    A dataset of 204k skill mentions from Mexican auto job postings yields 274 green skills whose demand is best forecasted by transformer models like FEDformer, with current demand focused on operational sustainability a...

  27. Interpretable Physics-Informed Load Forecasting for U.S. Grid Resilience: SHAP-Guided Ensemble Validation in Hybrid Deep Learning Under Extreme Weather

    cs.LG 2026-04 unverdicted novelty 4.0

    A hybrid deep learning model with physics regularization and SHAP analysis achieves 1.18% MAPE on ERCOT load data and up to 40.5% better performance on extreme events than its individual branches.

  28. Empirical Assessment of Time-Series Foundation Models For Power System Forecasting Applications

    eess.SY 2026-04 unverdicted novelty 4.0

    The paper benchmarks foundation models like TimesFM and Chronos against baselines on eight forecasting capabilities for power system time series.

  29. The CTLNet for Shanghai Composite Index Prediction

    q-fin.ST 2026-04 reject novelty 3.0

    CTLNet hybrid model outperforms listed baselines on Shanghai Composite Index prediction task.

Reference graph

Works this paper leans on

30 extracted references · 30 canonical work pages · cited by 28 Pith papers · 4 internal anchors

  1. [1]

    An Empirical Evaluation of Generic Convolutional and Recurrent Networks for Sequence Modeling

    Shaojie Bai, J Zico Kolter, and Vladlen Koltun. An empirical evaluation of generic convolutional and recurrent networks for sequence modeling. arXiv preprint arXiv:1803.01271,

  2. [2]

    On the Opportunities and Risks of Foundation Models

    URL https: //openreview.net/forum?id=p-BhZSz59o4. Rishi Bommasani, Drew A Hudson, Ehsan Adeli, Russ Altman, Simran Arora, Sydney von Arx, Michael S Bernstein, Jeannette Bohg, Antoine Bosselut, Emma Brunskill, et al. On the opportu- nities and risks of foundation models. arXiv preprint arXiv:2108.07258,

  3. [3]

    Defu Cao, Yujing Wang, Juanyong Duan, Ce Zhang, Xia Zhu, Congrui Huang, Yunhai Tong, Bix- iong Xu, Jing Bai, Jie Tong, et al

    doi: 10.1098/rsta.2020.0209. Defu Cao, Yujing Wang, Juanyong Duan, Ce Zhang, Xia Zhu, Congrui Huang, Yunhai Tong, Bix- iong Xu, Jing Bai, Jie Tong, et al. Spectral temporal graph neural network for multivariate time- series forecasting. Advances in neural information processing systems, 33:17766–17778,

  4. [4]

    Triformer: Triangular, variable-specific attentions for long sequence multivariate time series forecasting

    Razvan-Gabriel Cirstea, Chenjuan Guo, Bin Yang, Tung Kieu, Xuanyi Dong, and Shirui Pan. Triformer: Triangular, variable-specific attentions for long sequence multivariate time series forecasting. In Proceedings of the Thirty-First International Joint Conference on Artificial In- telligence, IJCAI-22 , pp. 1994–2001, 7

  5. [5]

    BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

    doi: 10.24963/ijcai.2022/277. URL https: //doi.org/10.24963/ijcai.2022/277. Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. BERT: pre-training of deep bidirectional transformers for language understanding. CoRR, abs/1810.04805,

  6. [6]

    URL http://arxiv.org/abs/1810.04805. Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszko- reit, and Neil Houlsby. An image is worth 16x16 words: Transformers for image recogni- tion at scale. In International Conference on ...

  7. [8]

    Masked autoencoders are scalable vision learners.arXiv:2111.06377, 2021

    URL https://arxiv. org/abs/2111.06377. Dan Hendrycks and Kevin Gimpel. Gaussian error linear units (gelus). arXiv preprint arXiv:1606.08415,

  8. [9]

    Ammus: A sur- vey of transformer-based pretrained models in natural language processing

    Katikapalli Subramanyam Kalyan, Ajit Rajasekharan, and Sivanesan Sangeetha. Ammus: A sur- vey of transformer-based pretrained models in natural language processing. arXiv preprint arXiv:2108.05542,

  9. [10]

    Shizhan Liu, Hang Yu, Cong Liao, Jianguo Li, Weiyao Lin, Alex X Liu, and Schahram Dustdar

    URL https://proceedings.neurips.cc/paper/2019/file/ 6775a0635c302542da2c32aa19d86be0-Paper.pdf. Shizhan Liu, Hang Yu, Cong Liao, Jianguo Li, Weiyao Lin, Alex X Liu, and Schahram Dustdar. Pyraformer: Low-complexity pyramidal attention for long-range time series modeling and fore- casting. In International Conference on Learning Representations,

  10. [11]

    URL https://www.aeaweb.org/articles? id=10.1257/jel.51.4.1063

    doi: 10.1257/jel.51.4.1063. URL https://www.aeaweb.org/articles? id=10.1257/jel.51.4.1063. David Salinas, Valentin Flunkert, Jan Gasthaus, and Tim Januschowski. Deepar: Probabilistic fore- casting with autoregressive recurrent networks.International Journal of Forecasting, 36(3):1181– 1191,

  11. [12]

    11 Published as a conference paper at ICLR 2023 Sana Tonekaboni, Danny Eytan, and Anna Goldenberg

    doi: 10.1109/ICASSP.2012.6289079. 11 Published as a conference paper at ICLR 2023 Sana Tonekaboni, Danny Eytan, and Anna Goldenberg. Unsupervised representation learning for time series with temporal neighborhood coding. In International Conference on Learning Repre- sentations,

  12. [13]

    Instance Normalization: The Missing Ingredient for Fast Stylization

    Dmitry Ulyanov, Andrea Vedaldi, and Victor Lempitsky. Instance normalization: The missing in- gredient for fast stylization. arXiv preprint arXiv:1607.08022,

  13. [14]

    Transformers in time series: A survey.arXiv preprint arXiv:2202.07125, 2022

    URL https://proceedings.neurips. cc/paper/2017/file/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf. Qingsong Wen, Tian Zhou, Chaoli Zhang, Weiqi Chen, Ziqing Ma, Junchi Yan, and Liang Sun. Transformers in time series: A survey. arXiv preprint arXiv:2202.07125,

  14. [15]

    Are transformers effective for time series forecasting? arXiv preprint arXiv:2205.13504,

    Ailing Zeng, Muxi Chen, Lei Zhang, and Qiang Xu. Are transformers effective for time series forecasting? arXiv preprint arXiv:2205.13504,

  15. [16]

    12 Published as a conference paper at ICLR 2023 A A PPENDIX A.1 E XPERIMENTAL DETAILS A.1.1 D ATASETS We use 8 popular multivariate datasets provided in (Wu et al.,

  16. [17]

    Weather3 dataset collects 21 meteorological indicators in Germany, such as humidity and air temperature

    for forecasting and representa- tion learning. Weather3 dataset collects 21 meteorological indicators in Germany, such as humidity and air temperature. Traffic4 dataset records the road occupancy rates from different sensors on San Francisco freeways. Electricity5 is a dataset that describes 321 customers’ hourly electricity con- sumption. ILI6 dataset col...

  17. [18]

    However, this can possibly lead to an under-estimation of the baselines

    The reason of this difference is that Transformer-based baselines are easy to overfit when look-back window is long while DLinear tend to underfit. However, this can possibly lead to an under-estimation of the baselines. To address this issue, we re-run FEDformer, Autoformer and Informer by ourselves for six different look-back window L∈{ 24, 48, 96, 192, 3...

  18. [19]

    and DeepAR (Salinas et al., 2020). However, they are demonstrated to be not as effective as Transformer- based models in long-term forecasting tasks (Zhou et al., 2021; Wu et al., 2021), thus we don’t include them in our baselines. 3https://www.bgc-jena.mpg.de/wetter/ 4https://pems.dot.ca.gov/ 5https://archive.ics.uci.edu/ml/datasets/ElectricityLoadDiagra...

  19. [20]

    PatchTST provides the best forecasting both in terms of scale and bias

    Here, we predict 192 steps ahead on Weather and Eletricity datasets and60 steps ahead on ILI dataset. PatchTST provides the best forecasting both in terms of scale and bias. Figure 3: Visualization of 192-step forecasting on{Weather, Traffic} datasets with the look-back window L = 336 and 60-step forecasting on ILI dataset with L = 104 . PatchTST (in red) ...

  20. [21]

    A.4.2 V ARYING LOOK -BACK WINDOW Here we provide a full benchmark of quantitative results in Table 9 for varying look-back window in supervised PatchTST/42 regarding Figure 2 in the main text. Generally speaking, our model gains performance improvement with increasing look-back window, which show the effectiveness of our model in learning information from...

  21. [22]

    A full benchmark regarding Table

    16 Published as a conference paper at ICLR 2023 Models PatchTST FEDformerP+CI CI P Original Metric MSE MAE MSE MAE MSE MAE MSE MAE MSE MAE Weather 96 0.152 0.199 0.164 0.2130.168 0.2230.177 0.2360.238 0.314 192 0.197 0.243 0.205 0.2500.213 0.2620.221 0.2700.275 0.329 336 0.249 0.283 0.255 0.2890.266 0.3000.271 0.3060.339 0.377 720 0.320 0.335 0.327 0.3430...

  22. [23]

    The best results are inbold and the second best are underlined

    17 Published as a conference paper at ICLR 2023 ModelsPatchTST/64 (+in)PatchTST/64 (-in)PatchTST/42 (+in)PatchTST/42 (-in)FEDformerAutoformerInformer Metric MSE MAE MSE MAE MSE MAE MSE MAE MSE MAEMSE MAEMSE MAE Weather 96 0.149 0.198 0.161 0.219 0.152 0.199 0.156 0.210 0.238 0.3140.249 0.3290.354 0.405 192 0.194 0.241 0.201 0.254 0.197 0.243 0.199 0.250 0...

  23. [24]

    The mean and standard derivation of the results are reported in Table

    To examine the robustness of our results, we train the supervised PatchTST model with 5 different random seeds: 2019, 2020, 2021, 2022, 2023 and calculate the MSE and MAE scores with each selected seed. The mean and standard derivation of the results are reported in Table

  24. [25]

    18 Published as a conference paper at ICLR 2023 Models PatchTST DLinear FEDformer Autoformer InformerFine-tuning Lin

    We also observe that the variance is insignificant especially on large datasets while higher variance can be seen on smaller datasets. 18 Published as a conference paper at ICLR 2023 Models PatchTST DLinear FEDformer Autoformer InformerFine-tuning Lin. Prob. Sup.Metric MSE MAE MSE MAE MSE MAE MSE MAE MSE MAE MSE MAE MSE MAE Weather 96 0.144 0.193 0.158 0.2...

  25. [26]

    Models PatchTST DLinear FEDformer Autoformer InformerFine-tuning Lin

    The best results are in bold. Models PatchTST DLinear FEDformer Autoformer InformerFine-tuning Lin. Prob. Sup.Metric MSE MAE MSE MAE MSE MAE MSE MAE MSE MAE MSE MAE MSE MAE Weather 96 0.145 0.195 0.163 0.216 0.152 0.1990.176 0.237 0.238 0.314 0.249 0.329 0.354 0.405192 0.193 0.243 0.205 0.252 0.1970.243 0.220 0.282 0.275 0.329 0.325 0.370 0.419 0.434336 0...

  26. [27]

    The best results are in bold. A.6.2 R ESULTS WITH DIFFERENT MODEL PARAMETERS To see whether PatchTST is sensitive to the choice of Transformer settings, we perform another ex- periments with varying model parameters. We vary the number of Transformer layersL ={3, 4, 5} and select the model dimension D ={128, 256} while the inner-layer of the feed forward ...

  27. [28]

    The model is run with supervised PatchTST/42 to forecast 96 steps

    are orderly labeled from 1 to 6 in the figure. The model is run with supervised PatchTST/42 to forecast 96 steps. For Traffic and Electricity datasets, we reduce the maximum number of epochs to 50 to save computational time. A.7 C HANNEL -INDEPENDENCE ANALYSIS Intuitively, channel-mixing models should outperform the channel-independent ones since they have ...

  28. [29]

    The best trained models are determined by validation data, which are approximately the lowest points in the test loss curves

    Channel-mixing models show over- fitting after a few initial epochs, while channel-independent models continue optimizing the loss with more training epochs. The best trained models are determined by validation data, which are approximately the lowest points in the test loss curves. It is clear that the forecasting performance of channel-independent models...

  29. [30]

    The channel-independent technique can improve the forecasting performance of those models generally. Although they are still not able to outperform PatchTST which is based on vanilla attention mechanism, we believe that more performance boost and computational reduction can be obtained with more advanced attention designs incorporating the channel-indepen...

  30. [31]

    Left Panel: Test loss vs train size

    We plot the mean values and error bars with 5 differ- ent random seeds:{2019, 2020, 2021, 2022, 2023}. Left Panel: Test loss vs train size. Here, train size denotes the fraction of the training data that is used to learn the model from scratch. Channel- independence contributes to a quicker convergence as more training data is available. Right Panel: Test...