arxiv: 2211.14730 · v2 · submitted 2022-11-27 · 💻 cs.LG · cs.AI

Recognition: 2 theorem links

· Lean Theorem

A Time Series is Worth 64 Words: Long-term Forecasting with Transformers

Yuqi Nie , Nam H. Nguyen , Phanwadee Sinthong , Jayant Kalagnanam

Authors on Pith no claims yet

Pith reviewed 2026-05-13 19:12 UTC · model grok-4.3

classification 💻 cs.LG cs.AI

keywords time series forecastingtransformerpatchingchannel independencelong-term forecastingself-supervised pretrainingmultivariate series

0 comments

The pith

Splitting time series into subseries patches and processing channels independently lets a Transformer forecast long horizons more accurately than prior models.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that dividing each time series into short consecutive patches turns those patches into the tokens a Transformer receives. It further treats every variable as its own independent univariate series while sharing the same embedding layer and Transformer weights across all of them. This keeps local patterns inside each patch, cuts the quadratic cost of attention maps, and lets the model look back farther in history without extra memory. The resulting PatchTST model therefore produces markedly lower error on long-term forecasting benchmarks than earlier Transformer designs. The same architecture also supports masked pre-training that transfers well across datasets and often beats training from scratch on large collections.

Core claim

By segmenting multivariate time series into subseries patches that serve as input tokens and enforcing channel independence with shared embeddings and weights, the model retains local semantic information, reduces attention computation quadratically for a given look-back window, and attends to longer history, delivering significantly higher long-term forecasting accuracy than state-of-the-art Transformer baselines.

What carries the argument

Subseries patches used as Transformer input tokens together with channel-independent processing that shares embeddings and weights across all univariate series.

Load-bearing premise

Dividing a time series into fixed-length patches keeps enough local information and does not break the cross-patch temporal dependencies needed for accurate long-range forecasts.

What would settle it

A controlled test on a dataset whose key predictive patterns cross the chosen patch boundaries, showing that the patched model loses accuracy relative to an otherwise identical non-patched Transformer given the same look-back window.

read the original abstract

We propose an efficient design of Transformer-based models for multivariate time series forecasting and self-supervised representation learning. It is based on two key components: (i) segmentation of time series into subseries-level patches which are served as input tokens to Transformer; (ii) channel-independence where each channel contains a single univariate time series that shares the same embedding and Transformer weights across all the series. Patching design naturally has three-fold benefit: local semantic information is retained in the embedding; computation and memory usage of the attention maps are quadratically reduced given the same look-back window; and the model can attend longer history. Our channel-independent patch time series Transformer (PatchTST) can improve the long-term forecasting accuracy significantly when compared with that of SOTA Transformer-based models. We also apply our model to self-supervised pre-training tasks and attain excellent fine-tuning performance, which outperforms supervised training on large datasets. Transferring of masked pre-trained representation on one dataset to others also produces SOTA forecasting accuracy. Code is available at: https://github.com/yuqinie98/PatchTST.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

PatchTST uses patching and channel independence to improve Transformer performance on long-term time series forecasting, though the exact role of patching needs closer look.

read the letter

The key takeaway is that this paper presents PatchTST, which uses subseries patches as tokens for a Transformer and processes each channel independently. This setup improves long-term forecasting accuracy on the tested datasets compared to other Transformer-based approaches, while also supporting self-supervised pre-training that transfers across datasets. What is new here is the application of patching to time series Transformers in this specific way, combined with channel independence. It addresses three issues at once: retaining local information, reducing compute for attention, and allowing longer input sequences. The self-supervised part is a nice addition, showing that pre-training on one dataset can help forecasting on others. The public code link is a real asset for the community. The paper does well in providing a clean architecture that seems practical. The claims are grounded in comparisons to SOTA models, and the design choices are explained clearly in the abstract. On the soft spots, the main one is whether the patching truly preserves the necessary information. A linear embedding of each patch collapses the internal time steps, and the Transformer then attends only at the patch level. This could mean that any non-linear or high-frequency patterns within a patch are not available for the model to use in making long-range predictions. The gains might actually come from the channel-independent design or the extended history length rather than patching alone. Since this is based on the abstract, the full paper's experiments would need to address this with proper controls. This paper is for researchers and practitioners in machine learning for time series, particularly those interested in Transformer architectures for forecasting. It offers value to anyone needing efficient long-horizon predictions or exploring self-supervised methods in this domain. It deserves a serious referee because it introduces a usable idea with code, even if some aspects of the mechanism require further validation.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces PatchTST, a Transformer architecture for multivariate long-term time series forecasting and self-supervised representation learning. It segments each univariate series into subseries patches that serve as input tokens and adopts channel-independence so that all channels share the same embedding and Transformer weights. The design is claimed to retain local semantics within patches, quadratically reduce attention cost for a given look-back window, and enable longer history, yielding significant accuracy gains over prior SOTA Transformer baselines. The paper also reports strong fine-tuning results after masked pre-training and successful transfer across datasets.

Significance. If the empirical gains are robust, the work would be significant for time-series modeling by demonstrating that a simple patch-based tokenization (analogous to ViT) combined with channel independence can make Transformers practical for long-horizon forecasting while lowering memory and compute. The public code release and the self-supervised pre-training results further increase its utility. The central empirical claim rests on direct comparisons rather than any parameter-free derivation.

major comments (2)

[§3] §3 (Patching and embedding): the linear projection that maps each patch to a token embedding is presented as sufficient to retain local semantic information, yet the subsequent self-attention operates only across patches. No analysis is given of information loss inside a patch (e.g., high-frequency or non-stationary components). An ablation replacing the linear layer with a small MLP or varying patch length while holding look-back fixed would test whether this step is load-bearing for the reported gains.
[§4] §4 (Experiments): the headline claim of “significant” improvement over SOTA Transformer models requires that baselines use identical look-back windows and that the reported advantage is not driven solely by the ability to use longer history. The manuscript should state the exact look-back lengths used for each baseline and include an ablation that isolates patching from the channel-independent design.

minor comments (2)

[Title / §3.1] The title refers to “64 Words” but the manuscript does not explicitly justify the default patch length of 16 (or 64) in the main text; a short paragraph in §3.1 would clarify the hyper-parameter choice.
[Figures in §4] Figure captions and legends in the experimental section would benefit from explicit mention of the number of runs and whether shaded regions represent standard deviation.

Simulated Author's Rebuttal

2 responses · 0 unresolved

Thank you for the detailed review and the recommendation for minor revision. We have carefully considered the comments and provide point-by-point responses below. We will update the manuscript accordingly.

read point-by-point responses

Referee: [§3] §3 (Patching and embedding): the linear projection that maps each patch to a token embedding is presented as sufficient to retain local semantic information, yet the subsequent self-attention operates only across patches. No analysis is given of information loss inside a patch (e.g., high-frequency or non-stationary components). An ablation replacing the linear layer with a small MLP or varying patch length while holding look-back fixed would test whether this step is load-bearing for the reported gains.

Authors: We thank the referee for pointing this out. The choice of a linear projection for patch embedding is motivated by its efficiency and its ability to preserve local semantics, as validated by the overall performance gains. To investigate potential information loss, we will add an ablation study in the revision that replaces the linear layer with a small MLP. We will also include results varying the patch length while keeping the look-back window fixed to confirm the robustness of our design. revision: yes
Referee: [§4] §4 (Experiments): the headline claim of “significant” improvement over SOTA Transformer models requires that baselines use identical look-back windows and that the reported advantage is not driven solely by the ability to use longer history. The manuscript should state the exact look-back lengths used for each baseline and include an ablation that isolates patching from the channel-independent design.

Authors: We agree that it is crucial to use identical look-back windows for fair comparison. In our experiments, the look-back window was set to the same value for PatchTST and all baselines (details are provided in the appendix, and we will move this to the main text). The ability to use longer history is an additional benefit enabled by the reduced complexity of patching, but our primary results use matched settings. We will also add an ablation that isolates the effect of patching from the channel-independent design by comparing against a channel-dependent variant. revision: yes

Circularity Check

0 steps flagged

No significant circularity: empirical architecture evaluated against external baselines

full rationale

The paper introduces patching of time series into subseries tokens and channel-independent processing as explicit architectural choices, then reports forecasting accuracy via direct head-to-head experiments on standard public datasets against prior published SOTA Transformer models. No equation or claim reduces a fitted internal parameter to a 'prediction' of itself, no load-bearing premise rests on a self-citation chain, and the design motivations (local semantics, quadratic attention reduction, longer history) are stated independently of the numerical results. The derivation chain is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The design rests on the domain assumption that local patch semantics are sufficient for long-range forecasting and on standard Transformer inductive biases; no free parameters or invented entities are introduced beyond ordinary hyperparameters.

axioms (2)

domain assumption Segmenting a univariate time series into fixed-length patches preserves sufficient local semantic information for downstream forecasting.
Invoked to justify the patching design in the abstract.
domain assumption Channel independence with shared weights is a valid modeling choice for multivariate series.
Stated as a core component without further justification in the abstract.

pith-pipeline@v0.9.0 · 5501 in / 1226 out tokens · 40011 ms · 2026-05-13T19:12:20.078362+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

Foundation.SimplicialLedger, Foundation.EightTick simplicial_loop_tick_lower_bound, eight_tick_forces_D3 echoes

?

echoes
ECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.

segmentation of time series into subseries-level patches which are served as input tokens to Transformer; patching design naturally has three-fold benefit: local semantic information is retained in the embedding; computation and memory usage of the attention maps are quadratically reduced given the same look-back window; and the model can attend longer history

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 29 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

SeesawNet: Towards Non-stationary Time Series Forecasting with Balanced Modeling of Common and Specific Dependencies
cs.LG 2026-05 unverdicted novelty 7.0

SeesawNet dynamically balances common and instance-specific dependencies via ASNA in temporal and channel dimensions, outperforming prior methods on non-stationary forecasting benchmarks.
What if Tomorrow is the World Cup Final? Counterfactual Time Series Forecasting with Textual Conditions
cs.LG 2026-05 unverdicted novelty 7.0

Introduces the task of counterfactual time series forecasting with textual conditions plus a text-attribution mechanism that improves accuracy by distinguishing mutable from immutable factors.
Empowering VLMs for Few-Shot Multimodal Time Series Classification via Tailored Agentic Reasoning
cs.AI 2026-05 unverdicted novelty 7.0

MarsTSC is a VLM-based agentic reasoning framework with a self-evolving knowledge bank and Generator-Reflector-Modifier roles that achieves better few-shot multimodal time series classification than baselines on 12 be...
FactoryBench: Evaluating Industrial Machine Understanding
cs.AI 2026-05 unverdicted novelty 7.0

FactoryBench reveals that frontier LLMs achieve under 50% on structured causal questions and under 18% on decision-making in industrial robotic telemetry.
Hedging Memory Horizons for Non-Stationary Prediction via Online Aggregation
cs.LG 2026-05 unverdicted novelty 7.0

MELO aggregates base predictors and their multi-scale EWLS adaptations using MLpol to achieve oracle inequalities against best fixed and time-varying predictors in non-stationary settings.
Does Synthetic Data Help? Empirical Evidence from Deep Learning Time Series Forecasters
cs.LG 2026-05 accept novelty 7.0

Synthetic data augmentation helps channel-mixing time series models but degrades channel-independent ones, with reliable gains only from seasonal-trend generators and gradual schedules in low-resource settings.
Discrete Prototypical Memories for Federated Time Series Foundation Models
cs.LG 2026-04 unverdicted novelty 7.0

FeDPM learns and aligns local discrete prototypical memories across domains to create a unified discrete latent space for LLM-based time series foundation models in a federated setting.
Self-Supervised Foundation Model for Calcium-imaging Population Dynamics
q-bio.QM 2026-04 unverdicted novelty 7.0

CalM uses a discrete tokenizer and dual-axis autoregressive transformer pretrained self-supervised on calcium traces to outperform specialized baselines on population dynamics forecasting and adapt to superior behavio...
What If We Let Forecasting Forget? A Sparse Bottleneck for Cross-Variable Dependencies
cs.LG 2026-05 unverdicted novelty 6.0

MS-FLOW uses a capacity-limited sparse routing mechanism to model only critical inter-variable dependencies in time series data, achieving state-of-the-art accuracy on 12 benchmarks with fewer but more reliable connections.
Exploring the Potential of Probabilistic Transformer for Time Series Modeling: A Report on the ST-PT Framework
cs.LG 2026-04 unverdicted novelty 6.0

ST-PT turns transformers into explicit factor graphs for time series, enabling structural injection of symbolic priors, per-sample conditional generation, and principled latent autoregressive forecasting via MFVI iterations.
CAARL: In-Context Learning for Interpretable Co-Evolving Time Series Forecasting
cs.LG 2026-04 unverdicted novelty 6.0

CAARL decomposes co-evolving time series into autoregressive segments, builds a temporal dependency graph, serializes it into a narrative, and uses LLMs for interpretable forecasting via chain-of-thought reasoning.
M3R: Localized Rainfall Nowcasting with Meteorology-Informed MultiModal Attention
cs.LG 2026-04 unverdicted novelty 6.0

M3R improves localized rainfall nowcasting by using weather station time series as queries in multimodal attention to selectively extract precipitation patterns from radar imagery.
A General Framework for Generative Self-supervised Learning in Non-invasive Estimation of Physiological Parameters Using Photoplethysmography
eess.SP 2026-04 unverdicted novelty 6.0

TS2TC combines cross-temporal fusion generative anchor pretraining with dual-process transfer to achieve 2.49% lower RMSE than prior methods on PPG parameter estimation using only 10% labeled data.
A Benchmark of Classical and Deep Learning Models for Agricultural Commodity Price Forecasting on A Novel Bangladeshi Market Price Dataset
cs.LG 2026-03 accept novelty 6.0

AgriPriceBD dataset of 1779 daily prices released; naive persistence outperforms deep models like Informer and Time2Vec-Transformer on heterogeneous Bangladeshi commodity series with statistical validation.
A Foundation Model for Instruction-Conditioned In-Context Time Series Tasks
cs.LG 2026-03 unverdicted novelty 6.0

iAmTime is a time-series foundation model that uses instruction-conditioned in-context learning from demonstrations to perform zero-shot adaptation on forecasting, imputation, classification, and related tasks.
Timer-S1: A Billion-Scale Time Series Foundation Model with Serial Scaling
cs.AI 2026-03 unverdicted novelty 6.0

Timer-S1 is a released 8.3B-parameter MoE time series model that achieves state-of-the-art MASE and CRPS scores on GIFT-Eval using serial scaling and Serial-Token Prediction.
Titans: Learning to Memorize at Test Time
cs.LG 2024-12 unverdicted novelty 6.0

Titans combine attention for current context with a learnable neural memory for long-term history, achieving better performance and scaling to over 2M-token contexts on language, reasoning, genomics, and time-series tasks.
Beyond Similarity: Temporal Operator Attention for Time Series Analysis
cs.LG 2026-05 unverdicted novelty 5.0

Temporal Operator Attention augments softmax attention with learnable sequence-space operators for signed temporal mixing and uses stochastic regularization to enable practical training, yielding consistent gains on t...
Mela: Test-Time Memory Consolidation based on Transformation Hypothesis
cs.CL 2026-05 unverdicted novelty 5.0

Mela is a Transformer variant with a dual-frequency Hierarchical Memory Module and MemStack that performs test-time memory consolidation, outperforming baselines on long contexts.
Risk-Aware Safe Throughput Forecasting for Starlink Networks
eess.SY 2026-05 unverdicted novelty 5.0

BG-CFQS provides risk-aware quantile-based forecasting for Starlink throughput that meets overestimation budgets and reduces positive errors compared to other feasible methods.
TSNN: A Non-parametric and Interpretable Framework for Traffic Time Series Forecasting
cs.LG 2026-05 unverdicted novelty 5.0

TSNN matches time series entries to a training-derived memory bank to forecast traffic without any trainable parameters and achieves competitive accuracy on four real-world datasets.
Learning Fingerprints for Medical Time Series with Redundancy-Constrained Information Maximization
cs.LG 2026-04 unverdicted novelty 5.0

A self-supervised method learns a fixed set of disentangled fingerprint tokens from medical time series by combining reconstruction loss with a total coding rate diversity penalty, framed as a disentangled rate-distor...
MedMamba: Recasting Mamba for Medical Time Series Classification
eess.SP 2026-04 unverdicted novelty 5.0

MedMamba introduces a principle-guided bidirectional multi-scale Mamba model that outperforms prior methods on EEG, ECG, and activity classification benchmarks while delivering 4.6x inference speedup.
Foundation Models Defining A New Era In Sensor-based Human Activity Recognition: A Survey And Outlook
eess.SP 2026-04 accept novelty 5.0

The survey organizes foundation models for sensor-based HAR into a lifecycle taxonomy and identifies three trajectories: HAR-specific models from scratch, adaptation of general time-series models, and integration with...
A Foundation Model for Instruction-Conditioned In-Context Time Series Tasks
cs.LG 2026-03 unverdicted novelty 5.0

iAmTime is a hierarchical transformer-based time series foundation model that uses semantic tokens and instruction-conditioned prompts to infer tasks from demonstrations, achieving improved zero-shot performance on fo...
Forecasting Green Skill Demand in the Automotive Industry: Evidence from Online Job Postings
cs.LG 2026-05 unverdicted novelty 4.0

A dataset of 204k skill mentions from Mexican auto job postings yields 274 green skills whose demand is best forecasted by transformer models like FEDformer, with current demand focused on operational sustainability a...
Interpretable Physics-Informed Load Forecasting for U.S. Grid Resilience: SHAP-Guided Ensemble Validation in Hybrid Deep Learning Under Extreme Weather
cs.LG 2026-04 unverdicted novelty 4.0

A hybrid deep learning model with physics regularization and SHAP analysis achieves 1.18% MAPE on ERCOT load data and up to 40.5% better performance on extreme events than its individual branches.
Empirical Assessment of Time-Series Foundation Models For Power System Forecasting Applications
eess.SY 2026-04 unverdicted novelty 4.0

The paper benchmarks foundation models like TimesFM and Chronos against baselines on eight forecasting capabilities for power system time series.
The CTLNet for Shanghai Composite Index Prediction
q-fin.ST 2026-04 reject novelty 3.0

CTLNet hybrid model outperforms listed baselines on Shanghai Composite Index prediction task.

Reference graph

Works this paper leans on

30 extracted references · 30 canonical work pages · cited by 28 Pith papers · 4 internal anchors

[1]

An Empirical Evaluation of Generic Convolutional and Recurrent Networks for Sequence Modeling

Shaojie Bai, J Zico Kolter, and Vladlen Koltun. An empirical evaluation of generic convolutional and recurrent networks for sequence modeling. arXiv preprint arXiv:1803.01271,

work page internal anchor Pith review Pith/arXiv arXiv
[2]

On the Opportunities and Risks of Foundation Models

URL https: //openreview.net/forum?id=p-BhZSz59o4. Rishi Bommasani, Drew A Hudson, Ehsan Adeli, Russ Altman, Simran Arora, Sydney von Arx, Michael S Bernstein, Jeannette Bohg, Antoine Bosselut, Emma Brunskill, et al. On the opportu- nities and risks of foundation models. arXiv preprint arXiv:2108.07258,

work page internal anchor Pith review Pith/arXiv arXiv
[3]

Defu Cao, Yujing Wang, Juanyong Duan, Ce Zhang, Xia Zhu, Congrui Huang, Yunhai Tong, Bix- iong Xu, Jing Bai, Jie Tong, et al

doi: 10.1098/rsta.2020.0209. Defu Cao, Yujing Wang, Juanyong Duan, Ce Zhang, Xia Zhu, Congrui Huang, Yunhai Tong, Bix- iong Xu, Jing Bai, Jie Tong, et al. Spectral temporal graph neural network for multivariate time- series forecasting. Advances in neural information processing systems, 33:17766–17778,

work page doi:10.1098/rsta.2020.0209 2020
[4]

Triformer: Triangular, variable-speciﬁc attentions for long sequence multivariate time series forecasting

Razvan-Gabriel Cirstea, Chenjuan Guo, Bin Yang, Tung Kieu, Xuanyi Dong, and Shirui Pan. Triformer: Triangular, variable-speciﬁc attentions for long sequence multivariate time series forecasting. In Proceedings of the Thirty-First International Joint Conference on Artiﬁcial In- telligence, IJCAI-22 , pp. 1994–2001, 7

work page 1994
[5]

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

doi: 10.24963/ijcai.2022/277. URL https: //doi.org/10.24963/ijcai.2022/277. Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. BERT: pre-training of deep bidirectional transformers for language understanding. CoRR, abs/1810.04805,

work page internal anchor Pith review Pith/arXiv arXiv doi:10.24963/ijcai.2022/277 2022
[6]

URL http://arxiv.org/abs/1810.04805. Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszko- reit, and Neil Houlsby. An image is worth 16x16 words: Transformers for image recogni- tion at scale. In International Conference on ...

work page internal anchor Pith review Pith/arXiv arXiv
[8]

Masked autoencoders are scalable vision learners.arXiv:2111.06377, 2021

URL https://arxiv. org/abs/2111.06377. Dan Hendrycks and Kevin Gimpel. Gaussian error linear units (gelus). arXiv preprint arXiv:1606.08415,

work page arXiv
[9]

Ammus: A sur- vey of transformer-based pretrained models in natural language processing

Katikapalli Subramanyam Kalyan, Ajit Rajasekharan, and Sivanesan Sangeetha. Ammus: A sur- vey of transformer-based pretrained models in natural language processing. arXiv preprint arXiv:2108.05542,

work page arXiv
[10]

Shizhan Liu, Hang Yu, Cong Liao, Jianguo Li, Weiyao Lin, Alex X Liu, and Schahram Dustdar

URL https://proceedings.neurips.cc/paper/2019/file/ 6775a0635c302542da2c32aa19d86be0-Paper.pdf. Shizhan Liu, Hang Yu, Cong Liao, Jianguo Li, Weiyao Lin, Alex X Liu, and Schahram Dustdar. Pyraformer: Low-complexity pyramidal attention for long-range time series modeling and fore- casting. In International Conference on Learning Representations,

work page 2019
[11]

URL https://www.aeaweb.org/articles? id=10.1257/jel.51.4.1063

doi: 10.1257/jel.51.4.1063. URL https://www.aeaweb.org/articles? id=10.1257/jel.51.4.1063. David Salinas, Valentin Flunkert, Jan Gasthaus, and Tim Januschowski. Deepar: Probabilistic fore- casting with autoregressive recurrent networks.International Journal of Forecasting, 36(3):1181– 1191,

work page doi:10.1257/jel.51.4.1063
[12]

11 Published as a conference paper at ICLR 2023 Sana Tonekaboni, Danny Eytan, and Anna Goldenberg

doi: 10.1109/ICASSP.2012.6289079. 11 Published as a conference paper at ICLR 2023 Sana Tonekaboni, Danny Eytan, and Anna Goldenberg. Unsupervised representation learning for time series with temporal neighborhood coding. In International Conference on Learning Repre- sentations,

work page doi:10.1109/icassp.2012.6289079 2012
[13]

Instance Normalization: The Missing Ingredient for Fast Stylization

Dmitry Ulyanov, Andrea Vedaldi, and Victor Lempitsky. Instance normalization: The missing in- gredient for fast stylization. arXiv preprint arXiv:1607.08022,

work page Pith review arXiv
[14]

Transformers in time series: A survey.arXiv preprint arXiv:2202.07125, 2022

URL https://proceedings.neurips. cc/paper/2017/file/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf. Qingsong Wen, Tian Zhou, Chaoli Zhang, Weiqi Chen, Ziqing Ma, Junchi Yan, and Liang Sun. Transformers in time series: A survey. arXiv preprint arXiv:2202.07125,

work page arXiv 2017
[15]

Are transformers effective for time series forecasting? arXiv preprint arXiv:2205.13504,

Ailing Zeng, Muxi Chen, Lei Zhang, and Qiang Xu. Are transformers effective for time series forecasting? arXiv preprint arXiv:2205.13504,

work page arXiv
[16]

12 Published as a conference paper at ICLR 2023 A A PPENDIX A.1 E XPERIMENTAL DETAILS A.1.1 D ATASETS We use 8 popular multivariate datasets provided in (Wu et al.,

work page 2023
[17]

Weather3 dataset collects 21 meteorological indicators in Germany, such as humidity and air temperature

for forecasting and representa- tion learning. Weather3 dataset collects 21 meteorological indicators in Germany, such as humidity and air temperature. Trafﬁc4 dataset records the road occupancy rates from different sensors on San Francisco freeways. Electricity5 is a dataset that describes 321 customers’ hourly electricity con- sumption. ILI6 dataset col...

work page 1970
[18]

However, this can possibly lead to an under-estimation of the baselines

The reason of this difference is that Transformer-based baselines are easy to overﬁt when look-back window is long while DLinear tend to underﬁt. However, this can possibly lead to an under-estimation of the baselines. To address this issue, we re-run FEDformer, Autoformer and Informer by ourselves for six different look-back window L∈{ 24, 48, 96, 192, 3...

work page 1970
[19]

and DeepAR (Salinas et al., 2020). However, they are demonstrated to be not as effective as Transformer- based models in long-term forecasting tasks (Zhou et al., 2021; Wu et al., 2021), thus we don’t include them in our baselines. 3https://www.bgc-jena.mpg.de/wetter/ 4https://pems.dot.ca.gov/ 5https://archive.ics.uci.edu/ml/datasets/ElectricityLoadDiagra...

work page 2020
[20]

PatchTST provides the best forecasting both in terms of scale and bias

Here, we predict 192 steps ahead on Weather and Eletricity datasets and60 steps ahead on ILI dataset. PatchTST provides the best forecasting both in terms of scale and bias. Figure 3: Visualization of 192-step forecasting on{Weather, Trafﬁc} datasets with the look-back window L = 336 and 60-step forecasting on ILI dataset with L = 104 . PatchTST (in red) ...

work page 2022
[21]

A.4.2 V ARYING LOOK -BACK WINDOW Here we provide a full benchmark of quantitative results in Table 9 for varying look-back window in supervised PatchTST/42 regarding Figure 2 in the main text. Generally speaking, our model gains performance improvement with increasing look-back window, which show the effectiveness of our model in learning information from...

work page 2023
[22]

A full benchmark regarding Table

16 Published as a conference paper at ICLR 2023 Models PatchTST FEDformerP+CI CI P Original Metric MSE MAE MSE MAE MSE MAE MSE MAE MSE MAE Weather 96 0.152 0.199 0.164 0.2130.168 0.2230.177 0.2360.238 0.314 192 0.197 0.243 0.205 0.2500.213 0.2620.221 0.2700.275 0.329 336 0.249 0.283 0.255 0.2890.266 0.3000.271 0.3060.339 0.377 720 0.320 0.335 0.327 0.3430...

work page 2023
[23]

The best results are inbold and the second best are underlined

17 Published as a conference paper at ICLR 2023 ModelsPatchTST/64 (+in)PatchTST/64 (-in)PatchTST/42 (+in)PatchTST/42 (-in)FEDformerAutoformerInformer Metric MSE MAE MSE MAE MSE MAE MSE MAE MSE MAEMSE MAEMSE MAE Weather 96 0.149 0.198 0.161 0.219 0.152 0.199 0.156 0.210 0.238 0.3140.249 0.3290.354 0.405 192 0.194 0.241 0.201 0.254 0.197 0.243 0.199 0.250 0...

work page 2023
[24]

The mean and standard derivation of the results are reported in Table

To examine the robustness of our results, we train the supervised PatchTST model with 5 different random seeds: 2019, 2020, 2021, 2022, 2023 and calculate the MSE and MAE scores with each selected seed. The mean and standard derivation of the results are reported in Table

work page 2019
[25]

18 Published as a conference paper at ICLR 2023 Models PatchTST DLinear FEDformer Autoformer InformerFine-tuning Lin

We also observe that the variance is insigniﬁcant especially on large datasets while higher variance can be seen on smaller datasets. 18 Published as a conference paper at ICLR 2023 Models PatchTST DLinear FEDformer Autoformer InformerFine-tuning Lin. Prob. Sup.Metric MSE MAE MSE MAE MSE MAE MSE MAE MSE MAE MSE MAE MSE MAE Weather 96 0.144 0.193 0.158 0.2...

work page 2023
[26]

Models PatchTST DLinear FEDformer Autoformer InformerFine-tuning Lin

The best results are in bold. Models PatchTST DLinear FEDformer Autoformer InformerFine-tuning Lin. Prob. Sup.Metric MSE MAE MSE MAE MSE MAE MSE MAE MSE MAE MSE MAE MSE MAE Weather 96 0.145 0.195 0.163 0.216 0.152 0.1990.176 0.237 0.238 0.314 0.249 0.329 0.354 0.405192 0.193 0.243 0.205 0.252 0.1970.243 0.220 0.282 0.275 0.329 0.325 0.370 0.419 0.434336 0...

work page 1990
[27]

The best results are in bold. A.6.2 R ESULTS WITH DIFFERENT MODEL PARAMETERS To see whether PatchTST is sensitive to the choice of Transformer settings, we perform another ex- periments with varying model parameters. We vary the number of Transformer layersL ={3, 4, 5} and select the model dimension D ={128, 256} while the inner-layer of the feed forward ...

work page arXiv 2023
[28]

The model is run with supervised PatchTST/42 to forecast 96 steps

are orderly labeled from 1 to 6 in the ﬁgure. The model is run with supervised PatchTST/42 to forecast 96 steps. For Trafﬁc and Electricity datasets, we reduce the maximum number of epochs to 50 to save computational time. A.7 C HANNEL -INDEPENDENCE ANALYSIS Intuitively, channel-mixing models should outperform the channel-independent ones since they have ...

work page 2023
[29]

The best trained models are determined by validation data, which are approximately the lowest points in the test loss curves

Channel-mixing models show over- ﬁtting after a few initial epochs, while channel-independent models continue optimizing the loss with more training epochs. The best trained models are determined by validation data, which are approximately the lowest points in the test loss curves. It is clear that the forecasting performance of channel-independent models...

work page 2020
[30]

The channel-independent technique can improve the forecasting performance of those models generally. Although they are still not able to outperform PatchTST which is based on vanilla attention mechanism, we believe that more performance boost and computational reduction can be obtained with more advanced attention designs incorporating the channel-indepen...

work page 2023
[31]

Left Panel: Test loss vs train size

We plot the mean values and error bars with 5 differ- ent random seeds:{2019, 2020, 2021, 2022, 2023}. Left Panel: Test loss vs train size. Here, train size denotes the fraction of the training data that is used to learn the model from scratch. Channel- independence contributes to a quicker convergence as more training data is available. Right Panel: Test...

work page 2019