arxiv: 2310.06625 · v4 · submitted 2023-10-10 · 💻 cs.LG

Recognition: 2 theorem links

· Lean Theorem

iTransformer: Inverted Transformers Are Effective for Time Series Forecasting

Yong Liu , Tengge Hu , Haoran Zhang , Haixu Wu , Shiyu Wang , Lintao Ma , Mingsheng Long

Authors on Pith no claims yet

Pith reviewed 2026-05-13 18:49 UTC · model grok-4.3

classification 💻 cs.LG

keywords time series forecastingtransformermultivariate time seriesattention mechanisminverted transformerlookback windowgeneralizationforecasting models

0 comments

The pith

Inverting the Transformer dimensions lets attention model correlations between variables directly and reaches state-of-the-art on real time series forecasting tasks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Standard Transformer forecasters form tokens by embedding all variates at the same timestamp together, which mixes signals from distinct physical measurements and delayed events and yields uninformative attention maps. The iTransformer instead embeds the full history of each individual series into a variate token, applies self-attention across these tokens to capture multivariate dependencies, and runs the feed-forward network separately on each token to learn nonlinear representations. This simple dimension swap improves accuracy on challenging real-world datasets, strengthens generalization when the number of variables changes, and removes the performance drop that occurs with longer lookback windows. A reader would care because the change revives the Transformer family as a competitive backbone without introducing new components or modifications to the core layers.

Core claim

By inverting the input so that attention operates over variate tokens (each representing the full time series of one variable) while the feed-forward network processes each variate token independently, the architecture learns variate-centric representations and multivariate correlations, achieving state-of-the-art results on real-world forecasting benchmarks while improving generalization across different variates and enabling effective use of arbitrary lookback windows.

What carries the argument

Inverted attention applied to variate tokens formed from individual series histories, paired with per-variate feed-forward networks.

If this is right

Attention maps become more interpretable because they directly reflect relationships among variables.
Longer lookback windows can be used without quadratic attention dilution or accuracy loss.
The model generalizes better when the number of input variables differs across training and test sets.
Transformers regain competitiveness with recent linear forecasters on multivariate real-world data.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The inversion suggests that temporal modeling can be offloaded to simpler mechanisms while attention focuses on cross-variable links.
This dimension swap may apply to other multivariate sequential tasks where feature correlations dominate over long-range temporal ones.
Controlled tests on synthetic data with tunable cross-variable correlation strength would show exactly when the inversion provides the largest benefit.

Load-bearing premise

That mixing multiple variates into each temporal token inherently prevents learning distinct variate representations, while simply swapping the dimensions captures the necessary correlations without losing critical temporal structure.

What would settle it

A head-to-head comparison on a dataset with strong temporal patterns but weak cross-variable correlations where iTransformer shows no accuracy gain over a standard Transformer or linear baseline.

read the original abstract

The recent boom of linear forecasting models questions the ongoing passion for architectural modifications of Transformer-based forecasters. These forecasters leverage Transformers to model the global dependencies over temporal tokens of time series, with each token formed by multiple variates of the same timestamp. However, Transformers are challenged in forecasting series with larger lookback windows due to performance degradation and computation explosion. Besides, the embedding for each temporal token fuses multiple variates that represent potential delayed events and distinct physical measurements, which may fail in learning variate-centric representations and result in meaningless attention maps. In this work, we reflect on the competent duties of Transformer components and repurpose the Transformer architecture without any modification to the basic components. We propose iTransformer that simply applies the attention and feed-forward network on the inverted dimensions. Specifically, the time points of individual series are embedded into variate tokens which are utilized by the attention mechanism to capture multivariate correlations; meanwhile, the feed-forward network is applied for each variate token to learn nonlinear representations. The iTransformer model achieves state-of-the-art on challenging real-world datasets, which further empowers the Transformer family with promoted performance, generalization ability across different variates, and better utilization of arbitrary lookback windows, making it a nice alternative as the fundamental backbone of time series forecasting. Code is available at this repository: https://github.com/thuml/iTransformer.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

iTransformer inverts the usual time-series Transformer so attention runs across variates and FFNs handle each series' temporal patterns, with reported gains on real data that look worth checking.

read the letter

The main point is that this paper flips the standard setup: instead of tokens that bundle all variates at one time step, it embeds each variate's full lookback window as a token. Attention then mixes information across variates, and a feed-forward network processes the temporal side for each variate on its own. This keeps the core blocks the same while changing only the dimension order, which cuts the quadratic cost that usually hits long lookback windows and lets the model focus directly on cross-variate correlations.

Referee Report

2 major / 2 minor

Summary. The manuscript proposes iTransformer, which inverts the standard Transformer for time series forecasting: time points of individual series are embedded as variate tokens, self-attention is applied across variates to capture multivariate correlations, and feed-forward networks are applied per variate token to learn nonlinear temporal representations. The authors claim this yields state-of-the-art results on challenging real-world datasets while improving generalization across variates and enabling better use of arbitrary lookback windows, positioning it as a competitive backbone for forecasting.

Significance. If the empirical claims hold under rigorous verification, the work is significant because it demonstrates that a minimal, component-preserving inversion of the Transformer can mitigate key limitations (performance drop and quadratic cost with long lookback windows, fused variate embeddings) without introducing new modules. This strengthens the case for Transformer-based forecasters against linear alternatives and supplies a reproducible starting point via the linked code repository.

major comments (2)

[Abstract] Abstract: the central claim that iTransformer 'achieves state-of-the-art on challenging real-world datasets' is load-bearing yet unsupported by any description of baselines, metrics, data splits, or hyperparameter controls, preventing verification of the reported gains.
[Proposed Method] Proposed architecture (inversion description): relegating all temporal modeling to a position-wise FFN applied to each variate's embedded lookback vector assumes this fixed-capacity network can encode arbitrary long-range intra-series dependencies without temporal attention or recurrence; this assumption directly supports the 'arbitrary lookback windows' claim but lacks ablations or analysis showing when it holds versus when temporal attention would be superior.

minor comments (2)

[Abstract] Abstract: the term 'arbitrary lookback windows' should be qualified with respect to memory and compute scaling to avoid overstatement.
Notation: ensure consistent use of 'variate token' versus 'temporal token' across all sections and figures.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address the major comments point by point below, agreeing where revisions are warranted and providing clarifications supported by the manuscript's experimental sections.

read point-by-point responses

Referee: [Abstract] Abstract: the central claim that iTransformer 'achieves state-of-the-art on challenging real-world datasets' is load-bearing yet unsupported by any description of baselines, metrics, data splits, or hyperparameter controls, preventing verification of the reported gains.

Authors: We agree that the abstract would benefit from additional context to make the SOTA claim more verifiable at a glance. In the revised version, we will expand the abstract to briefly note the use of standard real-world multivariate forecasting benchmarks (e.g., ETT, Electricity, Traffic), evaluation via MSE and MAE, and comparisons against recent Transformer variants and linear models under consistent protocols. Full details on data splits, lookback/prediction lengths, and hyperparameter settings remain in Section 4 and the appendix; the abstract revision will improve accessibility without altering the paper's technical content. revision: yes
Referee: [Proposed Method] Proposed architecture (inversion description): relegating all temporal modeling to a position-wise FFN applied to each variate's embedded lookback vector assumes this fixed-capacity network can encode arbitrary long-range intra-series dependencies without temporal attention or recurrence; this assumption directly supports the 'arbitrary lookback windows' claim but lacks ablations or analysis showing when it holds versus when temporal attention would be superior.

Authors: The inversion design intentionally assigns multivariate correlation modeling to attention while delegating per-variate temporal representation learning to the FFN, which operates on the full lookback embedding and benefits from the increased capacity per token. This choice is empirically validated by superior performance on long lookback windows in our experiments (Section 4.3), where iTransformer maintains accuracy as sequence length grows without the quadratic cost of temporal attention. We acknowledge that dedicated ablations contrasting FFN versus temporal attention across varying lookback lengths would strengthen the analysis. We will add such experiments in the revision, including cases where temporal attention might retain an edge, to better delineate the conditions favoring the inverted architecture. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical architectural proposal stands on its own

full rationale

The paper presents iTransformer as a direct repurposing of standard Transformer components (attention on inverted variate tokens, FFN per variate for temporal modeling) without any derivation chain, equations, or fitted parameters that reduce to inputs by construction. Claims of SOTA performance and better lookback utilization rest on experimental results across real-world datasets rather than self-referential definitions or self-citation load-bearing premises. No uniqueness theorems, ansatzes, or renamings of known results are invoked in a way that collapses the central argument. The proposal is self-contained as an empirical alternative to temporal-token Transformers.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that variate-centric tokens better capture correlations than temporal tokens; no new free parameters or invented entities are introduced beyond standard Transformer hyperparameters.

axioms (1)

domain assumption Embedding individual series time points into variate tokens enables the attention mechanism to capture multivariate correlations effectively.
This premise underpins the decision to invert dimensions rather than modify components.

pith-pipeline@v0.9.0 · 5558 in / 1180 out tokens · 75663 ms · 2026-05-13T18:49:21.286632+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith.Foundation.HierarchyEmergence hierarchy_emergence_forces_phi unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

The iTransformer model achieves state-of-the-art on challenging real-world datasets, which further empowers the Transformer family with promoted performance, generalization ability across different variates, and better utilization of arbitrary lookback windows

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 25 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

SeesawNet: Towards Non-stationary Time Series Forecasting with Balanced Modeling of Common and Specific Dependencies
cs.LG 2026-05 unverdicted novelty 7.0

SeesawNet dynamically balances common and instance-specific dependencies via ASNA in temporal and channel dimensions, outperforming prior methods on non-stationary forecasting benchmarks.
What if Tomorrow is the World Cup Final? Counterfactual Time Series Forecasting with Textual Conditions
cs.LG 2026-05 unverdicted novelty 7.0

Introduces the task of counterfactual time series forecasting with textual conditions plus a text-attribution mechanism that improves accuracy by distinguishing mutable from immutable factors.
Does Synthetic Data Help? Empirical Evidence from Deep Learning Time Series Forecasters
cs.LG 2026-05 accept novelty 7.0

Synthetic data augmentation helps channel-mixing time series models but degrades channel-independent ones, with reliable gains only from seasonal-trend generators and gradual schedules in low-resource settings.
From Prediction to Practice: A Task-Aware Evaluation Framework for Blood Glucose Forecasting
cs.LG 2026-05 unverdicted novelty 7.0

A new evaluation framework shows that blood glucose forecasting models with high overall accuracy often fail at timely hypoglycemia detection in high-risk periods and at predicting effects of changed insulin doses.
Beyond Static Forecasting: Unleashing the Power of World Models for Mobile Traffic Extrapolation
cs.NI 2026-04 unverdicted novelty 7.0

MobiWM is a multimodal world model for mobile networks that learns state-action dynamics to enable unlimited-horizon counterfactual traffic simulations and optimization.
Self-Supervised Foundation Model for Calcium-imaging Population Dynamics
q-bio.QM 2026-04 unverdicted novelty 7.0

CalM uses a discrete tokenizer and dual-axis autoregressive transformer pretrained self-supervised on calcium traces to outperform specialized baselines on population dynamics forecasting and adapt to superior behavio...
XDecomposer: Learning Prior-Free Set Decomposition for Multiphase X-ray Diffraction
cs.AI 2026-05 unverdicted novelty 6.0

XDecomposer uses set prediction and phase-query decomposition to jointly identify phases and reconstruct multiphase PXRD patterns without priors.
Exploring the Potential of Probabilistic Transformer for Time Series Modeling: A Report on the ST-PT Framework
cs.LG 2026-04 unverdicted novelty 6.0

ST-PT turns transformers into explicit factor graphs for time series, enabling structural injection of symbolic priors, per-sample conditional generation, and principled latent autoregressive forecasting via MFVI iterations.
GeoCert: Certified Geometric AI for Reliable Forecasting
cs.LG 2026-04 unverdicted novelty 6.0

GeoCert uses hyperbolic geometry to unify forecasting with physical reasoning and built-in formal certification, claiming major gains in accuracy and efficiency.
Earth System Foundation Model (ESFM): A unified framework for heterogeneous data integration and forecasting
physics.ao-ph 2026-04 unverdicted novelty 6.0

ESFM is a single open foundation model that unifies heterogeneous Earth data sources and forecasts missing regions while preserving inter-variable physical relationships.
CSRA: Controlled Spectral Residual Augmentation for Robust Sepsis Prediction
cs.LG 2026-04 unverdicted novelty 6.0

CSRA applies input-adaptive spectral residual perturbations to multi-system ICU time series, trained end-to-end with the predictor and consistency losses, yielding 10.2% MSE and 3.7% MAE reductions on MIMIC-IV sepsis ...
M3R: Localized Rainfall Nowcasting with Meteorology-Informed MultiModal Attention
cs.LG 2026-04 unverdicted novelty 6.0

M3R improves localized rainfall nowcasting by using weather station time series as queries in multimodal attention to selectively extract precipitation patterns from radar imagery.
Frequency-aware Decomposition Learning for Sensorless Wrench Forecasting on a Vibration-rich Hydraulic Manipulator
cs.RO 2026-04 unverdicted novelty 6.0

FDN uses spectral decomposition, asymmetric heads for deterministic and probabilistic wrench components, and frequency-aware filtering to forecast high-frequency wrench from proprioception, outperforming baselines on ...
DSPR: Dual-Stream Physics-Residual Networks for Trustworthy Industrial Time Series Forecasting
cs.LG 2026-04 unverdicted novelty 6.0

DSPR decouples statistical temporal evolution from physics-informed residual dynamics via an adaptive window for transport delays and a physics-guided dynamic graph to achieve accurate, physically plausible forecasts ...
UniMamba: A Unified Spatial-Temporal Modeling Framework with State-Space and Attention Integration
cs.LG 2026-03 unverdicted novelty 6.0

UniMamba integrates Mamba state-space dynamics with attention layers and transforms like FFT-Laplace to outperform prior models on multivariate time series forecasting benchmarks.
Titans: Learning to Memorize at Test Time
cs.LG 2024-12 unverdicted novelty 6.0

Titans combine attention for current context with a learnable neural memory for long-term history, achieving better performance and scaling to over 2M-token contexts on language, reasoning, genomics, and time-series tasks.
Risk-Aware Safe Throughput Forecasting for Starlink Networks
eess.SY 2026-05 unverdicted novelty 5.0

BG-CFQS provides risk-aware quantile-based forecasting for Starlink throughput that meets overestimation budgets and reduces positive errors compared to other feasible methods.
A Market-Rule-Informed Neural Network for Efficient Imbalance Electricity Price Forecasting
q-fin.CP 2026-05 unverdicted novelty 5.0

A market-rule-informed neural network for imbalance electricity price forecasting matches generic deep learning accuracy while using substantially fewer parameters and less training time.
CombinationTS: A Modular Framework for Understanding Time-Series Forecasting Models
cs.LG 2026-05 unverdicted novelty 5.0

CombinationTS decomposes time-series models into modules and finds that good embeddings let simple identity encoders match complex ones, while input structural priors give better performance-stability trade-offs than ...
Learning Fingerprints for Medical Time Series with Redundancy-Constrained Information Maximization
cs.LG 2026-04 unverdicted novelty 5.0

A self-supervised method learns a fixed set of disentangled fingerprint tokens from medical time series by combining reconstruction loss with a total coding rate diversity penalty, framed as a disentangled rate-distor...
MedMamba: Recasting Mamba for Medical Time Series Classification
eess.SP 2026-04 unverdicted novelty 5.0

MedMamba introduces a principle-guided bidirectional multi-scale Mamba model that outperforms prior methods on EEG, ECG, and activity classification benchmarks while delivering 4.6x inference speedup.
Modality-Aware Zero-Shot Pruning and Sparse Attention for Efficient Multimodal Edge Inference
cs.LG 2026-04 unverdicted novelty 5.0

SentryFuse delivers modality-aware zero-shot pruning and sparse attention that improves accuracy by 12.7% on average and up to 18% under sensor dropout while cutting memory 28.2% and latency up to 1.63x across multimo...
Learning Reactive Human Motion Generation from Paired Interaction Data Using Transformer-Based Models
cs.CV 2026-04 unverdicted novelty 4.0

Transformer models with person ID embeddings generate plausible reactive motions from paired boxing interaction data, with the simple Transformer outperforming iTransformer and Crossformer in stability and avoiding po...
Yield Curve Forecasting using Machine Learning and Econometrics: A Comparative Analysis
cs.AI 2026-05 unverdicted novelty 3.0

ARIMA and naive econometric models outperform machine learning and deep learning methods for US Treasury yield curve forecasting over 47 years, except in one time block, while TimeGPT, LGBM, and RNNs lead among ML approaches.
The CTLNet for Shanghai Composite Index Prediction
q-fin.ST 2026-04 reject novelty 3.0

CTLNet hybrid model outperforms listed baselines on Shanghai Composite Index prediction task.

Reference graph

Works this paper leans on

28 extracted references · 28 canonical work pages · cited by 25 Pith papers · 4 internal anchors

[1]

Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E. Hinton. Layer normalization. https://arxiv.org/pdf/1607.06450.pdf,

work page internal anchor Pith review Pith/arXiv arXiv
[2]

An Empirical Evaluation of Generic Convolutional and Recurrent Networks for Sequence Modeling

Shaojie Bai, J Zico Kolter, and Vladlen Koltun. An empirical evaluation of generic convolutional and recurrent networks for sequence modeling. arXiv preprint arXiv:1803.01271, 2,

work page internal anchor Pith review Pith/arXiv arXiv
[3]

Long- term forecasting with tide: Time-series dense encoder,

Abhimanyu Das, Weihao Kong, Andrew Leach, Rajat Sen, and Rose Yu. Long-term forecasting with tide: Time-series dense encoder. arXiv preprint arXiv:2304.08424,

work page arXiv
[4]

Simmtm: A simple pre-training framework for masked time-series modeling

Jiaxiang Dong, Haixu Wu, Haoran Zhang, Li Zhang, Jianmin Wang, and Mingsheng Long. Simmtm: A simple pre-training framework for masked time-series modeling. arXiv preprint arXiv:2302.00861,

work page arXiv
[5]

The capacity and robustness trade-off: Revisiting the chan- nel independent strategy for multivariate time series forecasting

Lu Han, Han-Jia Ye, and De-Chuan Zhan. The capacity and robustness trade-off: Revisiting the chan- nel independent strategy for multivariate time series forecasting. arXiv preprint arXiv:2304.05206,

work page arXiv
[6]

Scaling Laws for Neural Language Models

Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. Scaling laws for neural language models. arXiv preprint arXiv:2001.08361,

work page internal anchor Pith review Pith/arXiv arXiv 2001
[7]

Reversible instance normalization for accurate time-series forecasting against distribution shift

10 Published as a conference paper at ICLR 2024 Taesung Kim, Jinhee Kim, Yunwon Tae, Cheonbok Park, Jang-Ho Choi, and Jaegul Choo. Reversible instance normalization for accurate time-series forecasting against distribution shift. ICLR,

work page 2024
[8]

CoRR abs/2012.07436 (2020)

Jianxin Li, Xiong Hui, and Wancai Zhang. Informer: Beyond efficient transformer for long sequence time-series forecasting. arXiv: 2012.07436,

work page arXiv 2012
[9]

Revisiting long-term time se- ries forecasting: An investigation on linear mapping,

Zhe Li, Shiyi Qi, Yiduo Li, and Zenglin Xu. Revisiting long-term time series forecasting: An investigation on linear mapping. arXiv preprint arXiv:2305.10721,

work page arXiv
[10]

Scinet: time series modeling and forecasting with sample convolution and interaction

Minhao Liu, Ailing Zeng, Muxi Chen, Zhijian Xu, Qiuxia Lai, Lingna Ma, and Qiang Xu. Scinet: time series modeling and forecasting with sample convolution and interaction. NeurIPS, 2022a. Yong Liu, Haixu Wu, Jianmin Wang, and Mingsheng Long. Non-stationary transformers: Rethinking the stationarity in time series forecasting. NeurIPS, 2022b. Yong Liu, Cheny...

work page arXiv
[11]

Are transformers effective for time series forecasting? AAAI,

11 Published as a conference paper at ICLR 2024 Ailing Zeng, Muxi Chen, Lei Zhang, and Qiang Xu. Are transformers effective for time series forecasting? AAAI,

work page 2024
[12]

12 Published as a conference paper at ICLR 2024 A I MPLEMENTATION DETAILS A.1 D ATASET DESCRIPTIONS We conduct experiments on 7 real-world datasets to evaluate the performance of the proposed iTransformer including (1) ETT (Li et al.,

work page 2024
[13]

contains 7 factors of electricity transformer from July 2016 to July

work page 2016
[14]

collects the panel data of daily exchange rates from 8 countries from 1990 to

work page 1990
[15]

collects hourly road occupancy rates measured by 862 sensors of San Francisco Bay area freeways from January 2015 to December

work page 2015
[16]

(7) PEMS contains the public traffic network data in California collected by 5-minute windows

records the solar power production of 137 PV plants in 2006, which are sampled every 10 minutes. (7) PEMS contains the public traffic network data in California collected by 5-minute windows. We use the same four public subsets (PEMS03, PEMS04, PEMS07, PEMS08) adopted in SCINet (Liu et al., 2022a). Apart from the public datasets widely used as forecasting...

work page 2006
[17]

It includes 6 sub-datasets, which are divided according to diverse transaction domains. We follow the same data processing and train-validation-test set split protocol used in TimesNet (Wu et al., 2023), where the train, validation, and test datasets are strictly divided according to chronologi- cal order to make sure there are no data leakage issues. As ...

work page 2023
[18]

Require: Input lookback time series X ∈ RT ×N ; input Length T ; predicted length S; variates number N; token dimension D; iTransformer block number L

A.2 I MPLEMENTATION DETAILS Algorithm 1 iTransformer - Overall Architecture. Require: Input lookback time series X ∈ RT ×N ; input Length T ; predicted length S; variates number N; token dimension D; iTransformer block number L. 1: X = X.transpose ▷ X ∈ RN ×T 2: ▷ Multi-layer Perceptron works on the last dimension to embed series into variate tokens. 3: H...

work page 2024
[19]

Table 5: Robustness of iTransformer performance

We also report the standard deviation of iTransformer performance under five runs with different random seeds in Table 5, which exhibits that the performance of iTransformer is stable. Table 5: Robustness of iTransformer performance. The results are obtained from five random seeds. Dataset ECL ETTh2 Exchange Horizon MSE MAE MSE MAE MSE MAE 96 0.148±0.000 ...

work page 2024
[20]

D M ODEL EFFICIENCY We comprehensively compare the forecasting performance, training speed, and memory footprint of the following models: iTransformer, iTransformer with our efficient training strategy and iTransformer 15 Published as a conference paper at ICLR 2024 Table 6: Full results of the ablation on iTransformer. We apply different components on th...

work page 2024
[21]

The results are recorded with the official model configuration and the same batch size

and TiDE (Das et al., 2023); Transformers: Transformer (Vaswani et al., 2017), PatchTST (Nie et al., 2023), and Crossformer (Zhang & Yan, 2023). The results are recorded with the official model configuration and the same batch size. In Figure 10, we compare the efficiency under two representative datasets (21 variates in Weather and 862 in Traffic) with96...

work page 2023
[22]

All the cases have distinct multivariate correlations in the lookback and forecast window because the dataset exhibits obvious seasonal changes in the daytime and night

We provide the Pearson Correlation coefficients of each variate of the raw series by the following equation: ρxy = P i(xi − ¯x)(yi − ¯y)pP i(xi − ¯x)2pP i(yi − ¯y)2 , where xi, yi ∈ R run through all time points of the paired variates to be correlated. All the cases have distinct multivariate correlations in the lookback and forecast window because the da...

work page 2024
[23]

Among the various models, iTransformer predicts the most precise future series variations and exhibits superior performance

E.2 V ISUALIZATION OF PREDICTION RESULTS To provide a clear comparison among different models, we list supplementary prediction showcases of four representative datasets in Figures 13- 16, which are given by the following models: iTransfomrer, PatchTST (Nie et al., 2023), DLinear (Zeng et al., 2023), Crossformer (Zhang & Yan, 2023), Autoformer (Wu et al.,...

work page 2023
[24]

Table 7: Full performance comparison between the vanilla Transformer and the proposed iTransformer

Consistent and great promotions can be achieved, indicating that the attention and feed-forward network on the inverted dimensions greatly empower Transformers in multivariate time series forecasting, leaving an instructive direction to build up the foundation model of extensive time series data. Table 7: Full performance comparison between the vanilla Tr...

work page 2017
[25]

A regularity and compactness theory for immersed stable minimal hypersurfaces of multiplicity at most 2

The results demonstrate that our iTransformers framework can consistently promote these Transformer variants, and take advantage of the booming efficient attention mechanisms. Table 8: Full results of Transformers with our inverted framework. Flashformer means Transformer equipped with the hardware-accelerated FlashAttention (Dao et al., 2022). Models Tra...

work page internal anchor Pith review Pith/arXiv arXiv 2022
[26]

These works strive to reveal the temporal dependency better

and Stationarization (Liu et al., 2022b) have been widely applied for the distribution shift (non-stationarity) as architecture-free techniques. These works strive to reveal the temporal dependency better. This is accomplished by layer normalization in iTransformer and still leaves further improvement for us to tackle the distribution shift. G.2 D ISCUSSI...

work page 2023
[27]

More advanced linear forecasters focus on structural point-wise model- ing (Oreshkin et al., 2019; Liu et al., 2022a; 2023)

can reveal measurement-free relationships among the time points of the same variate. More advanced linear forecasters focus on structural point-wise model- ing (Oreshkin et al., 2019; Liu et al., 2022a; 2023). By contrast, iTransformer is particularly good at forecasting high-dimensional time series (numerous variates with complicated correlations, which ...

work page 2019
[28]

The problem can be alleviated by expanding the receptive field

Transformer treats time series as the natural language but the time- aligned embedding may bring about risks in multi-dimensional series. The problem can be alleviated by expanding the receptive field. Although it is believed that Patching (Zhang & Yan, 2023; Nie et al.,

work page 2023