Recognition: 2 theorem links
· Lean TheoremA Time Series is Worth 64 Words: Long-term Forecasting with Transformers
Pith reviewed 2026-05-13 19:12 UTC · model grok-4.3
The pith
Splitting time series into subseries patches and processing channels independently lets a Transformer forecast long horizons more accurately than prior models.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
By segmenting multivariate time series into subseries patches that serve as input tokens and enforcing channel independence with shared embeddings and weights, the model retains local semantic information, reduces attention computation quadratically for a given look-back window, and attends to longer history, delivering significantly higher long-term forecasting accuracy than state-of-the-art Transformer baselines.
What carries the argument
Subseries patches used as Transformer input tokens together with channel-independent processing that shares embeddings and weights across all univariate series.
Load-bearing premise
Dividing a time series into fixed-length patches keeps enough local information and does not break the cross-patch temporal dependencies needed for accurate long-range forecasts.
What would settle it
A controlled test on a dataset whose key predictive patterns cross the chosen patch boundaries, showing that the patched model loses accuracy relative to an otherwise identical non-patched Transformer given the same look-back window.
read the original abstract
We propose an efficient design of Transformer-based models for multivariate time series forecasting and self-supervised representation learning. It is based on two key components: (i) segmentation of time series into subseries-level patches which are served as input tokens to Transformer; (ii) channel-independence where each channel contains a single univariate time series that shares the same embedding and Transformer weights across all the series. Patching design naturally has three-fold benefit: local semantic information is retained in the embedding; computation and memory usage of the attention maps are quadratically reduced given the same look-back window; and the model can attend longer history. Our channel-independent patch time series Transformer (PatchTST) can improve the long-term forecasting accuracy significantly when compared with that of SOTA Transformer-based models. We also apply our model to self-supervised pre-training tasks and attain excellent fine-tuning performance, which outperforms supervised training on large datasets. Transferring of masked pre-trained representation on one dataset to others also produces SOTA forecasting accuracy. Code is available at: https://github.com/yuqinie98/PatchTST.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces PatchTST, a Transformer architecture for multivariate long-term time series forecasting and self-supervised representation learning. It segments each univariate series into subseries patches that serve as input tokens and adopts channel-independence so that all channels share the same embedding and Transformer weights. The design is claimed to retain local semantics within patches, quadratically reduce attention cost for a given look-back window, and enable longer history, yielding significant accuracy gains over prior SOTA Transformer baselines. The paper also reports strong fine-tuning results after masked pre-training and successful transfer across datasets.
Significance. If the empirical gains are robust, the work would be significant for time-series modeling by demonstrating that a simple patch-based tokenization (analogous to ViT) combined with channel independence can make Transformers practical for long-horizon forecasting while lowering memory and compute. The public code release and the self-supervised pre-training results further increase its utility. The central empirical claim rests on direct comparisons rather than any parameter-free derivation.
major comments (2)
- [§3] §3 (Patching and embedding): the linear projection that maps each patch to a token embedding is presented as sufficient to retain local semantic information, yet the subsequent self-attention operates only across patches. No analysis is given of information loss inside a patch (e.g., high-frequency or non-stationary components). An ablation replacing the linear layer with a small MLP or varying patch length while holding look-back fixed would test whether this step is load-bearing for the reported gains.
- [§4] §4 (Experiments): the headline claim of “significant” improvement over SOTA Transformer models requires that baselines use identical look-back windows and that the reported advantage is not driven solely by the ability to use longer history. The manuscript should state the exact look-back lengths used for each baseline and include an ablation that isolates patching from the channel-independent design.
minor comments (2)
- [Title / §3.1] The title refers to “64 Words” but the manuscript does not explicitly justify the default patch length of 16 (or 64) in the main text; a short paragraph in §3.1 would clarify the hyper-parameter choice.
- [Figures in §4] Figure captions and legends in the experimental section would benefit from explicit mention of the number of runs and whether shaded regions represent standard deviation.
Simulated Author's Rebuttal
Thank you for the detailed review and the recommendation for minor revision. We have carefully considered the comments and provide point-by-point responses below. We will update the manuscript accordingly.
read point-by-point responses
-
Referee: [§3] §3 (Patching and embedding): the linear projection that maps each patch to a token embedding is presented as sufficient to retain local semantic information, yet the subsequent self-attention operates only across patches. No analysis is given of information loss inside a patch (e.g., high-frequency or non-stationary components). An ablation replacing the linear layer with a small MLP or varying patch length while holding look-back fixed would test whether this step is load-bearing for the reported gains.
Authors: We thank the referee for pointing this out. The choice of a linear projection for patch embedding is motivated by its efficiency and its ability to preserve local semantics, as validated by the overall performance gains. To investigate potential information loss, we will add an ablation study in the revision that replaces the linear layer with a small MLP. We will also include results varying the patch length while keeping the look-back window fixed to confirm the robustness of our design. revision: yes
-
Referee: [§4] §4 (Experiments): the headline claim of “significant” improvement over SOTA Transformer models requires that baselines use identical look-back windows and that the reported advantage is not driven solely by the ability to use longer history. The manuscript should state the exact look-back lengths used for each baseline and include an ablation that isolates patching from the channel-independent design.
Authors: We agree that it is crucial to use identical look-back windows for fair comparison. In our experiments, the look-back window was set to the same value for PatchTST and all baselines (details are provided in the appendix, and we will move this to the main text). The ability to use longer history is an additional benefit enabled by the reduced complexity of patching, but our primary results use matched settings. We will also add an ablation that isolates the effect of patching from the channel-independent design by comparing against a channel-dependent variant. revision: yes
Circularity Check
No significant circularity: empirical architecture evaluated against external baselines
full rationale
The paper introduces patching of time series into subseries tokens and channel-independent processing as explicit architectural choices, then reports forecasting accuracy via direct head-to-head experiments on standard public datasets against prior published SOTA Transformer models. No equation or claim reduces a fitted internal parameter to a 'prediction' of itself, no load-bearing premise rests on a self-citation chain, and the design motivations (local semantics, quadratic attention reduction, longer history) are stated independently of the numerical results. The derivation chain is therefore self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption Segmenting a univariate time series into fixed-length patches preserves sufficient local semantic information for downstream forecasting.
- domain assumption Channel independence with shared weights is a valid modeling choice for multivariate series.
Lean theorems connected to this paper
-
Foundation.SimplicialLedger, Foundation.EightTicksimplicial_loop_tick_lower_bound, eight_tick_forces_D3 echoes?
echoesECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.
segmentation of time series into subseries-level patches which are served as input tokens to Transformer; patching design naturally has three-fold benefit: local semantic information is retained in the embedding; computation and memory usage of the attention maps are quadratically reduced given the same look-back window; and the model can attend longer history
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Forward citations
Cited by 28 Pith papers
-
SeesawNet: Towards Non-stationary Time Series Forecasting with Balanced Modeling of Common and Specific Dependencies
SeesawNet dynamically balances common and instance-specific dependencies via ASNA in temporal and channel dimensions, outperforming prior methods on non-stationary forecasting benchmarks.
-
What if Tomorrow is the World Cup Final? Counterfactual Time Series Forecasting with Textual Conditions
Introduces the task of counterfactual time series forecasting with textual conditions plus a text-attribution mechanism that improves accuracy by distinguishing mutable from immutable factors.
-
Empowering VLMs for Few-Shot Multimodal Time Series Classification via Tailored Agentic Reasoning
MarsTSC is a VLM-based agentic reasoning framework with a self-evolving knowledge bank and Generator-Reflector-Modifier roles that achieves better few-shot multimodal time series classification than baselines on 12 be...
-
FactoryBench: Evaluating Industrial Machine Understanding
FactoryBench reveals that frontier LLMs achieve under 50% on structured causal questions and under 18% on decision-making in industrial robotic telemetry.
-
Hedging Memory Horizons for Non-Stationary Prediction via Online Aggregation
MELO aggregates base predictors and their multi-scale EWLS adaptations using MLpol to achieve oracle inequalities against best fixed and time-varying predictors in non-stationary settings.
-
Does Synthetic Data Help? Empirical Evidence from Deep Learning Time Series Forecasters
Synthetic data augmentation helps channel-mixing time series models but degrades channel-independent ones, with reliable gains only from seasonal-trend generators and gradual schedules in low-resource settings.
-
Discrete Prototypical Memories for Federated Time Series Foundation Models
FeDPM learns and aligns local discrete prototypical memories across domains to create a unified discrete latent space for LLM-based time series foundation models in a federated setting.
-
Self-Supervised Foundation Model for Calcium-imaging Population Dynamics
CalM uses a discrete tokenizer and dual-axis autoregressive transformer pretrained self-supervised on calcium traces to outperform specialized baselines on population dynamics forecasting and adapt to superior behavio...
-
What If We Let Forecasting Forget? A Sparse Bottleneck for Cross-Variable Dependencies
MS-FLOW uses a capacity-limited sparse routing mechanism to model only critical inter-variable dependencies in time series data, achieving state-of-the-art accuracy on 12 benchmarks with fewer but more reliable connections.
-
Exploring the Potential of Probabilistic Transformer for Time Series Modeling: A Report on the ST-PT Framework
ST-PT turns transformers into explicit factor graphs for time series, enabling structural injection of symbolic priors, per-sample conditional generation, and principled latent autoregressive forecasting via MFVI iterations.
-
CAARL: In-Context Learning for Interpretable Co-Evolving Time Series Forecasting
CAARL decomposes co-evolving time series into autoregressive segments, builds a temporal dependency graph, serializes it into a narrative, and uses LLMs for interpretable forecasting via chain-of-thought reasoning.
-
M3R: Localized Rainfall Nowcasting with Meteorology-Informed MultiModal Attention
M3R improves localized rainfall nowcasting by using weather station time series as queries in multimodal attention to selectively extract precipitation patterns from radar imagery.
-
A General Framework for Generative Self-supervised Learning in Non-invasive Estimation of Physiological Parameters Using Photoplethysmography
TS2TC combines cross-temporal fusion generative anchor pretraining with dual-process transfer to achieve 2.49% lower RMSE than prior methods on PPG parameter estimation using only 10% labeled data.
-
A Benchmark of Classical and Deep Learning Models for Agricultural Commodity Price Forecasting on A Novel Bangladeshi Market Price Dataset
AgriPriceBD dataset of 1779 daily prices released; naive persistence outperforms deep models like Informer and Time2Vec-Transformer on heterogeneous Bangladeshi commodity series with statistical validation.
-
A Foundation Model for Instruction-Conditioned In-Context Time Series Tasks
iAmTime is a time-series foundation model that uses instruction-conditioned in-context learning from demonstrations to perform zero-shot adaptation on forecasting, imputation, classification, and related tasks.
-
Titans: Learning to Memorize at Test Time
Titans combine attention for current context with a learnable neural memory for long-term history, achieving better performance and scaling to over 2M-token contexts on language, reasoning, genomics, and time-series tasks.
-
Beyond Similarity: Temporal Operator Attention for Time Series Analysis
Temporal Operator Attention augments softmax attention with learnable sequence-space operators for signed temporal mixing and uses stochastic regularization to enable practical training, yielding consistent gains on t...
-
Mela: Test-Time Memory Consolidation based on Transformation Hypothesis
Mela is a Transformer variant with a dual-frequency Hierarchical Memory Module and MemStack that performs test-time memory consolidation, outperforming baselines on long contexts.
-
Risk-Aware Safe Throughput Forecasting for Starlink Networks
BG-CFQS provides risk-aware quantile-based forecasting for Starlink throughput that meets overestimation budgets and reduces positive errors compared to other feasible methods.
-
TSNN: A Non-parametric and Interpretable Framework for Traffic Time Series Forecasting
TSNN matches time series entries to a training-derived memory bank to forecast traffic without any trainable parameters and achieves competitive accuracy on four real-world datasets.
-
Learning Fingerprints for Medical Time Series with Redundancy-Constrained Information Maximization
A self-supervised method learns a fixed set of disentangled fingerprint tokens from medical time series by combining reconstruction loss with a total coding rate diversity penalty, framed as a disentangled rate-distor...
-
MedMamba: Recasting Mamba for Medical Time Series Classification
MedMamba introduces a principle-guided bidirectional multi-scale Mamba model that outperforms prior methods on EEG, ECG, and activity classification benchmarks while delivering 4.6x inference speedup.
-
Foundation Models Defining A New Era In Sensor-based Human Activity Recognition: A Survey And Outlook
The survey organizes foundation models for sensor-based HAR into a lifecycle taxonomy and identifies three trajectories: HAR-specific models from scratch, adaptation of general time-series models, and integration with...
-
A Foundation Model for Instruction-Conditioned In-Context Time Series Tasks
iAmTime is a hierarchical transformer-based time series foundation model that uses semantic tokens and instruction-conditioned prompts to infer tasks from demonstrations, achieving improved zero-shot performance on fo...
-
Forecasting Green Skill Demand in the Automotive Industry: Evidence from Online Job Postings
A dataset of 204k skill mentions from Mexican auto job postings yields 274 green skills whose demand is best forecasted by transformer models like FEDformer, with current demand focused on operational sustainability a...
-
Interpretable Physics-Informed Load Forecasting for U.S. Grid Resilience: SHAP-Guided Ensemble Validation in Hybrid Deep Learning Under Extreme Weather
A hybrid deep learning model with physics regularization and SHAP analysis achieves 1.18% MAPE on ERCOT load data and up to 40.5% better performance on extreme events than its individual branches.
-
Empirical Assessment of Time-Series Foundation Models For Power System Forecasting Applications
The paper benchmarks foundation models like TimesFM and Chronos against baselines on eight forecasting capabilities for power system time series.
-
The CTLNet for Shanghai Composite Index Prediction
CTLNet hybrid model outperforms listed baselines on Shanghai Composite Index prediction task.
Reference graph
Works this paper leans on
-
[1]
An Empirical Evaluation of Generic Convolutional and Recurrent Networks for Sequence Modeling
Shaojie Bai, J Zico Kolter, and Vladlen Koltun. An empirical evaluation of generic convolutional and recurrent networks for sequence modeling. arXiv preprint arXiv:1803.01271,
work page internal anchor Pith review Pith/arXiv arXiv
-
[2]
On the Opportunities and Risks of Foundation Models
URL https: //openreview.net/forum?id=p-BhZSz59o4. Rishi Bommasani, Drew A Hudson, Ehsan Adeli, Russ Altman, Simran Arora, Sydney von Arx, Michael S Bernstein, Jeannette Bohg, Antoine Bosselut, Emma Brunskill, et al. On the opportu- nities and risks of foundation models. arXiv preprint arXiv:2108.07258,
work page internal anchor Pith review Pith/arXiv arXiv
-
[3]
doi: 10.1098/rsta.2020.0209. Defu Cao, Yujing Wang, Juanyong Duan, Ce Zhang, Xia Zhu, Congrui Huang, Yunhai Tong, Bix- iong Xu, Jing Bai, Jie Tong, et al. Spectral temporal graph neural network for multivariate time- series forecasting. Advances in neural information processing systems, 33:17766–17778,
-
[4]
Razvan-Gabriel Cirstea, Chenjuan Guo, Bin Yang, Tung Kieu, Xuanyi Dong, and Shirui Pan. Triformer: Triangular, variable-specific attentions for long sequence multivariate time series forecasting. In Proceedings of the Thirty-First International Joint Conference on Artificial In- telligence, IJCAI-22 , pp. 1994–2001, 7
work page 1994
-
[5]
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
doi: 10.24963/ijcai.2022/277. URL https: //doi.org/10.24963/ijcai.2022/277. Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. BERT: pre-training of deep bidirectional transformers for language understanding. CoRR, abs/1810.04805,
work page internal anchor Pith review Pith/arXiv arXiv doi:10.24963/ijcai.2022/277 2022
-
[6]
URL http://arxiv.org/abs/1810.04805. Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszko- reit, and Neil Houlsby. An image is worth 16x16 words: Transformers for image recogni- tion at scale. In International Conference on ...
work page internal anchor Pith review Pith/arXiv arXiv
-
[8]
URL https://arxiv. org/abs/2111.06377. Dan Hendrycks and Kevin Gimpel. Gaussian error linear units (gelus). arXiv preprint arXiv:1606.08415,
-
[9]
Ammus: A sur- vey of transformer-based pretrained models in natural language processing
Katikapalli Subramanyam Kalyan, Ajit Rajasekharan, and Sivanesan Sangeetha. Ammus: A sur- vey of transformer-based pretrained models in natural language processing. arXiv preprint arXiv:2108.05542,
-
[10]
Shizhan Liu, Hang Yu, Cong Liao, Jianguo Li, Weiyao Lin, Alex X Liu, and Schahram Dustdar
URL https://proceedings.neurips.cc/paper/2019/file/ 6775a0635c302542da2c32aa19d86be0-Paper.pdf. Shizhan Liu, Hang Yu, Cong Liao, Jianguo Li, Weiyao Lin, Alex X Liu, and Schahram Dustdar. Pyraformer: Low-complexity pyramidal attention for long-range time series modeling and fore- casting. In International Conference on Learning Representations,
work page 2019
-
[11]
URL https://www.aeaweb.org/articles? id=10.1257/jel.51.4.1063
doi: 10.1257/jel.51.4.1063. URL https://www.aeaweb.org/articles? id=10.1257/jel.51.4.1063. David Salinas, Valentin Flunkert, Jan Gasthaus, and Tim Januschowski. Deepar: Probabilistic fore- casting with autoregressive recurrent networks.International Journal of Forecasting, 36(3):1181– 1191,
-
[12]
11 Published as a conference paper at ICLR 2023 Sana Tonekaboni, Danny Eytan, and Anna Goldenberg
doi: 10.1109/ICASSP.2012.6289079. 11 Published as a conference paper at ICLR 2023 Sana Tonekaboni, Danny Eytan, and Anna Goldenberg. Unsupervised representation learning for time series with temporal neighborhood coding. In International Conference on Learning Repre- sentations,
-
[13]
Instance Normalization: The Missing Ingredient for Fast Stylization
Dmitry Ulyanov, Andrea Vedaldi, and Victor Lempitsky. Instance normalization: The missing in- gredient for fast stylization. arXiv preprint arXiv:1607.08022,
-
[14]
cc/paper/2017/file/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf
URL https://proceedings.neurips. cc/paper/2017/file/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf. Qingsong Wen, Tian Zhou, Chaoli Zhang, Weiqi Chen, Ziqing Ma, Junchi Yan, and Liang Sun. Transformers in time series: A survey. arXiv preprint arXiv:2202.07125,
-
[15]
Are transformers effective for time series forecasting? arXiv preprint arXiv:2205.13504,
Ailing Zeng, Muxi Chen, Lei Zhang, and Qiang Xu. Are transformers effective for time series forecasting? arXiv preprint arXiv:2205.13504,
-
[16]
12 Published as a conference paper at ICLR 2023 A A PPENDIX A.1 E XPERIMENTAL DETAILS A.1.1 D ATASETS We use 8 popular multivariate datasets provided in (Wu et al.,
work page 2023
-
[17]
for forecasting and representa- tion learning. Weather3 dataset collects 21 meteorological indicators in Germany, such as humidity and air temperature. Traffic4 dataset records the road occupancy rates from different sensors on San Francisco freeways. Electricity5 is a dataset that describes 321 customers’ hourly electricity con- sumption. ILI6 dataset col...
work page 1970
-
[18]
However, this can possibly lead to an under-estimation of the baselines
The reason of this difference is that Transformer-based baselines are easy to overfit when look-back window is long while DLinear tend to underfit. However, this can possibly lead to an under-estimation of the baselines. To address this issue, we re-run FEDformer, Autoformer and Informer by ourselves for six different look-back window L∈{ 24, 48, 96, 192, 3...
work page 1970
-
[19]
and DeepAR (Salinas et al., 2020). However, they are demonstrated to be not as effective as Transformer- based models in long-term forecasting tasks (Zhou et al., 2021; Wu et al., 2021), thus we don’t include them in our baselines. 3https://www.bgc-jena.mpg.de/wetter/ 4https://pems.dot.ca.gov/ 5https://archive.ics.uci.edu/ml/datasets/ElectricityLoadDiagra...
work page 2020
-
[20]
PatchTST provides the best forecasting both in terms of scale and bias
Here, we predict 192 steps ahead on Weather and Eletricity datasets and60 steps ahead on ILI dataset. PatchTST provides the best forecasting both in terms of scale and bias. Figure 3: Visualization of 192-step forecasting on{Weather, Traffic} datasets with the look-back window L = 336 and 60-step forecasting on ILI dataset with L = 104 . PatchTST (in red) ...
work page 2022
-
[21]
A.4.2 V ARYING LOOK -BACK WINDOW Here we provide a full benchmark of quantitative results in Table 9 for varying look-back window in supervised PatchTST/42 regarding Figure 2 in the main text. Generally speaking, our model gains performance improvement with increasing look-back window, which show the effectiveness of our model in learning information from...
work page 2023
-
[22]
A full benchmark regarding Table
16 Published as a conference paper at ICLR 2023 Models PatchTST FEDformerP+CI CI P Original Metric MSE MAE MSE MAE MSE MAE MSE MAE MSE MAE Weather 96 0.152 0.199 0.164 0.2130.168 0.2230.177 0.2360.238 0.314 192 0.197 0.243 0.205 0.2500.213 0.2620.221 0.2700.275 0.329 336 0.249 0.283 0.255 0.2890.266 0.3000.271 0.3060.339 0.377 720 0.320 0.335 0.327 0.3430...
work page 2023
-
[23]
The best results are inbold and the second best are underlined
17 Published as a conference paper at ICLR 2023 ModelsPatchTST/64 (+in)PatchTST/64 (-in)PatchTST/42 (+in)PatchTST/42 (-in)FEDformerAutoformerInformer Metric MSE MAE MSE MAE MSE MAE MSE MAE MSE MAEMSE MAEMSE MAE Weather 96 0.149 0.198 0.161 0.219 0.152 0.199 0.156 0.210 0.238 0.3140.249 0.3290.354 0.405 192 0.194 0.241 0.201 0.254 0.197 0.243 0.199 0.250 0...
work page 2023
-
[24]
The mean and standard derivation of the results are reported in Table
To examine the robustness of our results, we train the supervised PatchTST model with 5 different random seeds: 2019, 2020, 2021, 2022, 2023 and calculate the MSE and MAE scores with each selected seed. The mean and standard derivation of the results are reported in Table
work page 2019
-
[25]
We also observe that the variance is insignificant especially on large datasets while higher variance can be seen on smaller datasets. 18 Published as a conference paper at ICLR 2023 Models PatchTST DLinear FEDformer Autoformer InformerFine-tuning Lin. Prob. Sup.Metric MSE MAE MSE MAE MSE MAE MSE MAE MSE MAE MSE MAE MSE MAE Weather 96 0.144 0.193 0.158 0.2...
work page 2023
-
[26]
Models PatchTST DLinear FEDformer Autoformer InformerFine-tuning Lin
The best results are in bold. Models PatchTST DLinear FEDformer Autoformer InformerFine-tuning Lin. Prob. Sup.Metric MSE MAE MSE MAE MSE MAE MSE MAE MSE MAE MSE MAE MSE MAE Weather 96 0.145 0.195 0.163 0.216 0.152 0.1990.176 0.237 0.238 0.314 0.249 0.329 0.354 0.405192 0.193 0.243 0.205 0.252 0.1970.243 0.220 0.282 0.275 0.329 0.325 0.370 0.419 0.434336 0...
work page 1990
-
[27]
The best results are in bold. A.6.2 R ESULTS WITH DIFFERENT MODEL PARAMETERS To see whether PatchTST is sensitive to the choice of Transformer settings, we perform another ex- periments with varying model parameters. We vary the number of Transformer layersL ={3, 4, 5} and select the model dimension D ={128, 256} while the inner-layer of the feed forward ...
-
[28]
The model is run with supervised PatchTST/42 to forecast 96 steps
are orderly labeled from 1 to 6 in the figure. The model is run with supervised PatchTST/42 to forecast 96 steps. For Traffic and Electricity datasets, we reduce the maximum number of epochs to 50 to save computational time. A.7 C HANNEL -INDEPENDENCE ANALYSIS Intuitively, channel-mixing models should outperform the channel-independent ones since they have ...
work page 2023
-
[29]
Channel-mixing models show over- fitting after a few initial epochs, while channel-independent models continue optimizing the loss with more training epochs. The best trained models are determined by validation data, which are approximately the lowest points in the test loss curves. It is clear that the forecasting performance of channel-independent models...
work page 2020
-
[30]
The channel-independent technique can improve the forecasting performance of those models generally. Although they are still not able to outperform PatchTST which is based on vanilla attention mechanism, we believe that more performance boost and computational reduction can be obtained with more advanced attention designs incorporating the channel-indepen...
work page 2023
-
[31]
Left Panel: Test loss vs train size
We plot the mean values and error bars with 5 differ- ent random seeds:{2019, 2020, 2021, 2022, 2023}. Left Panel: Test loss vs train size. Here, train size denotes the fraction of the training data that is used to learn the model from scratch. Channel- independence contributes to a quicker convergence as more training data is available. Right Panel: Test...
work page 2019
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.