Self-Adaptive Scale Handling for Forecasting Time Series with Scale Heterogeneity

Erpeng Qi; Peng Wang; Wei Wang; Xun Lu; Xu Zhang; Yunkai Chen; Yunzhi Wu; Zhengang Huang; Zhongya Xue

arxiv: 2606.20010 · v1 · pith:4HSLBTZNnew · submitted 2026-06-18 · 💻 cs.LG

Self-Adaptive Scale Handling for Forecasting Time Series with Scale Heterogeneity

Xu Zhang , Zhengang Huang , Yunzhi Wu , Xun Lu , Erpeng Qi , Yunkai Chen , Zhongya Xue , Peng Wang

show 1 more author

Wei Wang

This is my paper

Pith reviewed 2026-06-26 17:55 UTC · model grok-4.3

classification 💻 cs.LG

keywords time series forecastingscale heterogeneityadaptive scalingnormalizationdeep learningfinancial time seriesforecasting models

0 comments

The pith

A self-adaptive scale-handling module enables joint forecasting of time series that differ by orders of magnitude while keeping semantic meaning intact.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Time series forecasting research usually assumes data where all series have similar value ranges. In practice many series share temporal patterns but vary by orders of magnitude, so joint modeling would use data more efficiently. Existing normalization either compresses small signals or creates large reversal errors. The paper introduces an AS module that learns an input-specific scale factor through a neural network and then decides whether to use the learned factor or keep the original one. When added to standard forecasting models the module raises accuracy on real fund-sales data that exhibit this scale variation.

Core claim

The self-Adaptive Scale-handling (AS) module learns adaptive scale factors tailored to each input, preserving semantic discriminability while reducing inverse-scaling errors. AS consists of Scale Calibrating (SC), which calibrates prior mean scaling factors through neural networks, and Scaling Selection (SS), which decides whether to apply calibration or retain the original factor, avoiding over-calibration.

What carries the argument

The self-Adaptive Scale-handling (AS) module, built from Scale Calibrating (SC) and Scaling Selection (SS) components that produce and gate per-input scale factors.

If this is right

Existing time series forecasting models gain measurable performance when the AS module is inserted without architectural redesign.
The module reduces the inverse-scaling errors that arise from window-based scaling methods.
Semantic discriminability between series is retained better than under global normalization.
Joint training becomes practical for collections of series that share patterns but span wide magnitude ranges.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same per-input calibration logic could be tested on other tasks that require handling inputs of widely varying magnitudes.
Replacing fixed preprocessing steps with learned selection might simplify pipelines that currently tune normalization separately per dataset.
Controlled synthetic experiments that vary only the scale spread while holding patterns fixed could isolate how much the module contributes.

Load-bearing premise

Different time series share similar temporal patterns even when their numerical values differ by orders of magnitude.

What would settle it

Running the AS module inside several base forecasting models on scale-heterogeneous datasets and observing no consistent accuracy gain would falsify the central claim.

read the original abstract

Current time series forecasting (TSF) research predominantly focuses on scale-homogeneous data, where different time series share similar numerical magnitude ranges. However, in real-world industrial scenarios such as financial product sales, different time series often differ by orders of magnitude (scale heterogeneity). Since these series share similar temporal patterns, joint modeling is desirable for better data utilization, yet existing scaling methods either compress low-scale signals (global normalization) or destroy semantic discriminability and amplify inverse-scaling errors (window-based scaling). This paper proposes a self-Adaptive Scale-handling (AS) module that learns adaptive scale factors tailored to each input, preserving semantic discriminability while reducing inverse-scaling errors. AS consists of Scale Calibrating (SC), which calibrates prior mean scaling factors through neural networks, and Scaling Selection (SS), which decides whether to apply calibration or retain the original factor, avoiding over-calibration. Experiments on real-world fund sales datasets from Ant Fortune and Alipay show that AS seamlessly integrates into popular TSF models and consistently improves their performance. The code and dataset are available at the link https://github.com/Meteor-Stars/ASTSF.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The AS module gives a practical way to adapt scales for joint forecasting on magnitude-heterogeneous series, but the shared-patterns premise lacks any supporting measurement.

read the letter

The main thing here is that the paper adds a self-adaptive scale module with neural calibration of mean factors plus a selection step that decides whether to apply the adjustment. The goal is to keep joint modeling viable when series differ by orders of magnitude while avoiding the usual problems of global or window scaling.

What the work does well is release both code and the actual Ant Fortune and Alipay fund sales datasets. That lets others test the integration into existing forecasters and see whether the reported gains hold. The two-part design (calibration then selection) is a straightforward engineering response to over-calibration risk.

The soft spot is the core premise. The abstract claims the series share similar temporal patterns once scale is removed, which is why joint modeling makes sense, yet no similarity numbers, DTW distances, or shape clustering results are shown. Without that check, it is possible the gains come mainly from per-input normalization rather than any cross-series pattern sharing. The abstract also gives no ablation numbers or run-to-run variance, so the value of the selection step over calibration alone stays unclear.

This paper is aimed at practitioners who already train on mixed-magnitude industrial series and want a drop-in fix. A reader focused on deployment rather than new theory would get the most out of it.

I would send it to peer review. The problem is common enough that even an incremental module with public artifacts merits a full look at the experiments.

Referee Report

2 major / 1 minor

Summary. The paper proposes a self-Adaptive Scale-handling (AS) module for time series forecasting under scale heterogeneity. It consists of Scale Calibrating (SC) via neural networks and Scaling Selection (SS) to decide on calibration, allowing joint modeling of series that differ by orders of magnitude while preserving semantic discriminability. Experiments on Ant Fortune and Alipay fund sales datasets show consistent improvements when integrated into existing TSF models; code and data are released.

Significance. If the central claims hold, the work addresses a practical gap in industrial TSF where global or window-based scaling fails on heterogeneous scales. Open-sourcing the code and dataset strengthens reproducibility. The approach could enable better data utilization in joint modeling scenarios, but its impact depends on whether the adaptive factors demonstrably exploit shared patterns rather than acting as per-series normalizers.

major comments (2)

[Abstract] Abstract: The motivation for joint modeling rests on the unverified premise that scale-heterogeneous series 'share similar temporal patterns,' yet no quantitative support (e.g., cross-series DTW distances, normalized autocorrelation similarity, or shape-feature clustering after scale removal) is provided. If this premise does not hold, reported gains could be explained by per-series scaling alone.
[Abstract] Abstract: The SC and SS components are described only at high level ('learns adaptive scale factors,' 'calibrates prior mean scaling factors through neural networks,' 'decides whether to apply calibration'). Without equations, architecture diagrams, or ablation isolating their contribution to inverse-scaling error reduction, it is impossible to assess whether they preserve semantic discriminability beyond standard normalization.

minor comments (1)

[Abstract] The abstract states that AS 'seamlessly integrates' into popular TSF models, but provides no details on integration points or compatibility constraints.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our work. We address the major comments point by point below.

read point-by-point responses

Referee: [Abstract] Abstract: The motivation for joint modeling rests on the unverified premise that scale-heterogeneous series 'share similar temporal patterns,' yet no quantitative support (e.g., cross-series DTW distances, normalized autocorrelation similarity, or shape-feature clustering after scale removal) is provided. If this premise does not hold, reported gains could be explained by per-series scaling alone.

Authors: We agree that providing quantitative support would strengthen the motivation section. The claim is based on domain knowledge of the datasets, but to address this, we will add quantitative analyses (e.g., DTW on normalized series and feature clustering) in a new subsection of the revised manuscript to verify the shared patterns and demonstrate that the gains are not solely from per-series scaling. revision: yes
Referee: [Abstract] Abstract: The SC and SS components are described only at high level ('learns adaptive scale factors,' 'calibrates prior mean scaling factors through neural networks,' 'decides whether to apply calibration'). Without equations, architecture diagrams, or ablation isolating their contribution to inverse-scaling error reduction, it is impossible to assess whether they preserve semantic discriminability beyond standard normalization.

Authors: The abstract provides a high-level summary, as is standard. Detailed equations for the SC and SS modules, the neural network architectures, architecture diagrams, and ablation studies isolating their effects on inverse-scaling errors are provided in Sections 3 and 4 of the full manuscript. These demonstrate how semantic discriminability is preserved. We can add a brief reference to these in the abstract during revision if needed. revision: partial

Circularity Check

0 steps flagged

No circularity: additive module with external validation

full rationale

The paper presents the AS module (SC + SS) as an empirical architectural addition to existing TSF models. No derivation chain, equations, or self-citations are shown that reduce the claimed performance gains or 'preserving semantic discriminability' to quantities defined by the method itself. The motivation premise (shared temporal patterns across scale-heterogeneous series) is stated as an empirical observation rather than derived from the module. Experiments on external Ant Fortune/Alipay datasets provide independent falsifiability. This is a standard non-circular empirical contribution.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The approach rests on the domain assumption that series share temporal patterns across scales and on standard neural network training assumptions; no free parameters or invented entities are explicitly introduced in the abstract.

axioms (1)

domain assumption Different time series share similar temporal patterns despite scale differences.
Explicitly stated in the abstract as the justification for desiring joint modeling.

pith-pipeline@v0.9.1-grok · 5752 in / 1149 out tokens · 24360 ms · 2026-06-26T17:55:10.873458+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

29 extracted references · 2 linked inside Pith

[1]

Based on thescale(i.e.,numerical magnitude level) distribu- tion properties of the data, TSF tasks can be categorized intoscale-homogeneousandscale-heterogeneoussettings

Introduction Time series forecasting (TSF) is essential in many real-world applications, including weather prediction [1], traffic flow estimation [2] and financial inventory planning [3]. Based on thescale(i.e.,numerical magnitude level) distribu- tion properties of the data, TSF tasks can be categorized intoscale-homogeneousandscale-heterogeneoussetting...

Pith/arXiv arXiv 2026
[2]

For scale-heterogeneous data, window- based scaling [16] offers a practical alternative by dividing each window by its local mean

Related Work Time series normalization and scaling.Most TSF meth- ods employ global standardization or normalization as pre- processing [6, 15], assuming data follows a roughly homo- geneous distribution. For scale-heterogeneous data, window- based scaling [16] offers a practical alternative by dividing each window by its local mean. While this partially ...
[3]

Method 3.1. Problem Formulation Given a historical input window (multivariate time series in- stance)X h = [x 1, x2, ..., xn]∈R n×c with the length of n, time series forecasting (TSF) tasks aim to forecast the fu- turemstepsX f = [x n+1, xn+2, ..., xn+m]∈R m×c for all cvariables. In scale-heterogeneous scenarios, multiple vari- ables exhibiting similar te...
[4]

AS”), vanilla scaling (“VS

EXPERIMENTS 4.1. Experimental Settings Dataset.We collect fund sales datasets from Ant Fortune, which is an online wealth management platform on the Ali- pay APP. They are divided into two groups based on the hold- ing period for comprehensive experiment evaluation, called Fund1 (66 fund sales datasets) and Fund 2 (106 fund sales datasets). The sales of d...

2048
[5]

Experiments on real-world fund sales datasets validate that: Table 2

Conclusion This paper proposes the self-Adaptive Scale-handling (AS) module for time series forecasting under scale heterogeneity. Experiments on real-world fund sales datasets validate that: Table 2. Ablation study.‡denotes the full AS module (SC + SS sub-modules, usingˆvi);†denotes using only the SC sub- module (using vi). The last two rows compare WMAP...
[6]

Limitations Our current experiments involve two variables per product (purchase and redemption volumes). When extending to more variables with greater scale diversity within a single product, the learned calibration factors may become less stable or ef- fective, requiring further exploration of cross-variable scale coordination. Additionally, the per-samp...
[7]

Mul- tivariate time series dataset for space weather data ana- lytics,

Rafal A Angryk, Petrus C Martens, Berkay Aydin, Dustin Kempton, Sushant S Mahajan, Sunitha Ba- sodi, Azim Ahmadzadeh, Xumin Cai, Soukaina Fi- lali Boubrahimi, Shah Muhammad Hamdi, et al., “Mul- tivariate time series dataset for space weather data ana- lytics,”Scientific data, vol. 7, no. 1, pp. 227, 2020

2020
[8]

Freeway performance mea- surement system: mining loop detector data,

Chao Chen, Karl Petty, Alexander Skabardonis, Pravin Varaiya, and Zhanfeng Jia, “Freeway performance mea- surement system: mining loop detector data,”Trans- portation Research Record, vol. 1748, no. 1, pp. 96– 102, 2001

2001
[9]

Multi-period learning for financial time series forecasting,

Xu Zhang, Zhengang Huang, Yunzhi Wu, Xun Lu, Er- peng Qi, Yunkai Chen, Zhongya Xue, Qitong Wang, Peng Wang, and Wei Wang, “Multi-period learning for financial time series forecasting,” inProceedings of the 31st ACM SIGKDD Conference on Knowledge Discov- ery and Data Mining V . 1, 2025, pp. 2848–2859

2025
[10]

En- hancing the locality and breaking the memory bottle- neck of transformer on time series forecasting,

Shiyang Li, Xiaoyong Jin, Yao Xuan, Xiyou Zhou, Wenhu Chen, Yu-Xiang Wang, and Xifeng Yan, “En- hancing the locality and breaking the memory bottle- neck of transformer on time series forecasting,”Ad- vances in neural information processing systems, vol. 32, 2019

2019
[11]

Lost in the non-convex loss landscape: How to fine-tune the large time series model?,

Xu Zhang, Peang Wang, and Wei Wang, “Lost in the non-convex loss landscape: How to fine-tune the large time series model?,”arXiv preprint arXiv:2606.08578, 2026

Pith/arXiv arXiv 2026
[12]

In- former: Beyond efficient transformer for long sequence time-series forecasting,

Haoyi Zhou, Shanghang Zhang, Jieqi Peng, Shuai Zhang, Jianxin Li, Hui Xiong, and Wancai Zhang, “In- former: Beyond efficient transformer for long sequence time-series forecasting,” inProceedings of the AAAI conference on artificial intelligence, 2021, vol. 35, pp. 11106–11115

2021
[13]

Fedformer: Frequency en- hanced decomposed transformer for long-term series forecasting,

Tian Zhou, Ziqing Ma, Qingsong Wen, Xue Wang, Liang Sun, and Rong Jin, “Fedformer: Frequency en- hanced decomposed transformer for long-term series forecasting,” inInternational Conference on Machine Learning. PMLR, 2022, pp. 27268–27286

2022
[14]

Amortized predictability-aware training framework for time series forecasting and classification,

Xu Zhang, Peng Wang, Yichen Li, and Wei Wang, “Amortized predictability-aware training framework for time series forecasting and classification,” inProceed- ings of the ACM Web Conference 2026, 2026, pp. 5624– 5635

2026
[15]

Are transformers effective for time series forecasting?,

Ailing Zeng, Muxi Chen, Lei Zhang, and Qiang Xu, “Are transformers effective for time series forecasting?,” inProceedings of the AAAI conference on artificial in- telligence, 2023, vol. 37, pp. 11121–11128

2023
[16]

Global feature enhancing and fu- sion framework for strain gauge status recognition,

Xu Zhang, Peng Wang, Chen Wang, Zhe Xu, Xiaohua Nie, and Wei Wang, “Global feature enhancing and fu- sion framework for strain gauge status recognition,” in Companion Proceedings of the ACM on Web Conference
[17]

611–620, ACM

May 2025, WWW ’25, p. 611–620, ACM

2025
[18]

Film: Frequency im- proved legendre memory model for long-term time se- ries forecasting,

Tian Zhou, Ziqing Ma, Qingsong Wen, Liang Sun, Tao Yao, Wotao Yin, Rong Jin, et al., “Film: Frequency im- proved legendre memory model for long-term time se- ries forecasting,”Advances in Neural Information Pro- cessing Systems, vol. 35, pp. 12677–12690, 2022

2022
[19]

Tsmixer: An all-mlp ar- chitecture for time series forecasting,

Si-An Chen, Chun-Liang Li, Nate Yoder, Sercan ¨O. Arik, and Tomas Pfister, “Tsmixer: An all-mlp ar- chitecture for time series forecasting,”CoRR, vol. abs/2303.06053, 2023

arXiv 2023
[20]

Tsmixer: Lightweight mlp-mixer model for multivariate time se- ries forecasting,

Vijay Ekambaram, Arindam Jati, Nam Nguyen, Phan- wadee Sinthong, and Jayant Kalagnanam, “Tsmixer: Lightweight mlp-mixer model for multivariate time se- ries forecasting,” inProceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, KDD 2023, Long Beach, CA, USA, August 6-10, 2023, Ambuj Singh, Yizhou Sun, Leman Akoglu, Dimitrios...

2023
[21]

Diff-mn: Diffusion parameterized moe-ncde for continuous time series generation with irregular obser- vations,

Xu Zhang, Junwei Deng, Chang Xu, Hao Li, and Jiang Bian, “Diff-mn: Diffusion parameterized moe-ncde for continuous time series generation with irregular obser- vations,” 2026

2026
[22]

Autoformer: Decomposition transformers with auto-correlation for long-term series forecasting,

Haixu Wu, Jiehui Xu, Jianmin Wang, and Mingsheng Long, “Autoformer: Decomposition transformers with auto-correlation for long-term series forecasting,”Ad- vances in Neural Information Processing Systems, vol. 34, pp. 22419–22430, 2021

2021
[23]

Deepar: Probabilistic forecasting with autoregressive recurrent networks,

David Salinas, Valentin Flunkert, Jan Gasthaus, and Tim Januschowski, “Deepar: Probabilistic forecasting with autoregressive recurrent networks,”International Jour- nal of Forecasting, vol. 36, no. 3, pp. 1181–1191, 2020

2020
[24]

Rethinking atten- tion with performers,

Krzysztof Marcin Choromanski, Valerii Likhosherstov, David Dohan, Xingyou Song, Andreea Gane, Tam ´as Sarl´os, Peter Hawkins, Jared Quincy Davis, Afroz Mohiuddin, Lukasz Kaiser, David Benjamin Belanger, Lucy J. Colwell, and Adrian Weller, “Rethinking atten- tion with performers,” in9th International Conference on Learning Representations, ICLR 2021, Virtu...

2021
[25]

A lightweight sparse interaction network for time series forecasting,

Xu Zhang, Qitong Wang, Peng Wang, and Wei Wang, “A lightweight sparse interaction network for time series forecasting,” inProceedings of the AAAI Conference on Artificial Intelligence, 2025, vol. 39, pp. 13304–13312

2025
[26]

Semixer: Semantics enhanced mlp-mixer for multi- scale mixing and long-term time series forecasting,

Xu Zhang, Qitong Wang, Peng Wang, and Wei Wang, “Semixer: Semantics enhanced mlp-mixer for multi- scale mixing and long-term time series forecasting,” in Proceedings of the ACM Web Conference 2026, 2026, pp. 5636–5647

2026
[27]

Categori- cal reparameterization with gumbel-softmax,

Eric Jang, Shixiang Gu, and Ben Poole, “Categori- cal reparameterization with gumbel-softmax,” in5th International Conference on Learning Representations, ICLR 2017, Toulon, France, April 24-26, 2017, Confer- ence Track Proceedings. 2017, OpenReview.net

2017
[28]

The concrete distribution: A continuous relaxation of discrete random variables,

Chris J. Maddison, Andriy Mnih, and Yee Whye Teh, “The concrete distribution: A continuous relaxation of discrete random variables,” in5th International Confer- ence on Learning Representations, ICLR 2017, Toulon, France, April 24-26, 2017, Conference Track Proceed- ings. 2017, OpenReview.net

2017
[29]

Attention is all you need,

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin, “Attention is all you need,” inAdvances in Neural Information Processing Systems 30: Annual Conference on Neural Information Process- ing Systems 2017, December 4-9, 2017, Long Beach, CA, USA, Isabelle Guyon, Ulrike von Luxburg, Sa...

2017

[1] [1]

Based on thescale(i.e.,numerical magnitude level) distribu- tion properties of the data, TSF tasks can be categorized intoscale-homogeneousandscale-heterogeneoussettings

Introduction Time series forecasting (TSF) is essential in many real-world applications, including weather prediction [1], traffic flow estimation [2] and financial inventory planning [3]. Based on thescale(i.e.,numerical magnitude level) distribu- tion properties of the data, TSF tasks can be categorized intoscale-homogeneousandscale-heterogeneoussetting...

Pith/arXiv arXiv 2026

[2] [2]

For scale-heterogeneous data, window- based scaling [16] offers a practical alternative by dividing each window by its local mean

Related Work Time series normalization and scaling.Most TSF meth- ods employ global standardization or normalization as pre- processing [6, 15], assuming data follows a roughly homo- geneous distribution. For scale-heterogeneous data, window- based scaling [16] offers a practical alternative by dividing each window by its local mean. While this partially ...

[3] [3]

Method 3.1. Problem Formulation Given a historical input window (multivariate time series in- stance)X h = [x 1, x2, ..., xn]∈R n×c with the length of n, time series forecasting (TSF) tasks aim to forecast the fu- turemstepsX f = [x n+1, xn+2, ..., xn+m]∈R m×c for all cvariables. In scale-heterogeneous scenarios, multiple vari- ables exhibiting similar te...

[4] [4]

AS”), vanilla scaling (“VS

EXPERIMENTS 4.1. Experimental Settings Dataset.We collect fund sales datasets from Ant Fortune, which is an online wealth management platform on the Ali- pay APP. They are divided into two groups based on the hold- ing period for comprehensive experiment evaluation, called Fund1 (66 fund sales datasets) and Fund 2 (106 fund sales datasets). The sales of d...

2048

[5] [5]

Experiments on real-world fund sales datasets validate that: Table 2

Conclusion This paper proposes the self-Adaptive Scale-handling (AS) module for time series forecasting under scale heterogeneity. Experiments on real-world fund sales datasets validate that: Table 2. Ablation study.‡denotes the full AS module (SC + SS sub-modules, usingˆvi);†denotes using only the SC sub- module (using vi). The last two rows compare WMAP...

[6] [6]

Limitations Our current experiments involve two variables per product (purchase and redemption volumes). When extending to more variables with greater scale diversity within a single product, the learned calibration factors may become less stable or ef- fective, requiring further exploration of cross-variable scale coordination. Additionally, the per-samp...

[7] [7]

Mul- tivariate time series dataset for space weather data ana- lytics,

Rafal A Angryk, Petrus C Martens, Berkay Aydin, Dustin Kempton, Sushant S Mahajan, Sunitha Ba- sodi, Azim Ahmadzadeh, Xumin Cai, Soukaina Fi- lali Boubrahimi, Shah Muhammad Hamdi, et al., “Mul- tivariate time series dataset for space weather data ana- lytics,”Scientific data, vol. 7, no. 1, pp. 227, 2020

2020

[8] [8]

Freeway performance mea- surement system: mining loop detector data,

Chao Chen, Karl Petty, Alexander Skabardonis, Pravin Varaiya, and Zhanfeng Jia, “Freeway performance mea- surement system: mining loop detector data,”Trans- portation Research Record, vol. 1748, no. 1, pp. 96– 102, 2001

2001

[9] [9]

Multi-period learning for financial time series forecasting,

Xu Zhang, Zhengang Huang, Yunzhi Wu, Xun Lu, Er- peng Qi, Yunkai Chen, Zhongya Xue, Qitong Wang, Peng Wang, and Wei Wang, “Multi-period learning for financial time series forecasting,” inProceedings of the 31st ACM SIGKDD Conference on Knowledge Discov- ery and Data Mining V . 1, 2025, pp. 2848–2859

2025

[10] [10]

En- hancing the locality and breaking the memory bottle- neck of transformer on time series forecasting,

Shiyang Li, Xiaoyong Jin, Yao Xuan, Xiyou Zhou, Wenhu Chen, Yu-Xiang Wang, and Xifeng Yan, “En- hancing the locality and breaking the memory bottle- neck of transformer on time series forecasting,”Ad- vances in neural information processing systems, vol. 32, 2019

2019

[11] [11]

Lost in the non-convex loss landscape: How to fine-tune the large time series model?,

Xu Zhang, Peang Wang, and Wei Wang, “Lost in the non-convex loss landscape: How to fine-tune the large time series model?,”arXiv preprint arXiv:2606.08578, 2026

Pith/arXiv arXiv 2026

[12] [12]

In- former: Beyond efficient transformer for long sequence time-series forecasting,

Haoyi Zhou, Shanghang Zhang, Jieqi Peng, Shuai Zhang, Jianxin Li, Hui Xiong, and Wancai Zhang, “In- former: Beyond efficient transformer for long sequence time-series forecasting,” inProceedings of the AAAI conference on artificial intelligence, 2021, vol. 35, pp. 11106–11115

2021

[13] [13]

Fedformer: Frequency en- hanced decomposed transformer for long-term series forecasting,

Tian Zhou, Ziqing Ma, Qingsong Wen, Xue Wang, Liang Sun, and Rong Jin, “Fedformer: Frequency en- hanced decomposed transformer for long-term series forecasting,” inInternational Conference on Machine Learning. PMLR, 2022, pp. 27268–27286

2022

[14] [14]

Amortized predictability-aware training framework for time series forecasting and classification,

Xu Zhang, Peng Wang, Yichen Li, and Wei Wang, “Amortized predictability-aware training framework for time series forecasting and classification,” inProceed- ings of the ACM Web Conference 2026, 2026, pp. 5624– 5635

2026

[15] [15]

Are transformers effective for time series forecasting?,

Ailing Zeng, Muxi Chen, Lei Zhang, and Qiang Xu, “Are transformers effective for time series forecasting?,” inProceedings of the AAAI conference on artificial in- telligence, 2023, vol. 37, pp. 11121–11128

2023

[16] [16]

Global feature enhancing and fu- sion framework for strain gauge status recognition,

Xu Zhang, Peng Wang, Chen Wang, Zhe Xu, Xiaohua Nie, and Wei Wang, “Global feature enhancing and fu- sion framework for strain gauge status recognition,” in Companion Proceedings of the ACM on Web Conference

[17] [17]

611–620, ACM

May 2025, WWW ’25, p. 611–620, ACM

2025

[18] [18]

Film: Frequency im- proved legendre memory model for long-term time se- ries forecasting,

Tian Zhou, Ziqing Ma, Qingsong Wen, Liang Sun, Tao Yao, Wotao Yin, Rong Jin, et al., “Film: Frequency im- proved legendre memory model for long-term time se- ries forecasting,”Advances in Neural Information Pro- cessing Systems, vol. 35, pp. 12677–12690, 2022

2022

[19] [19]

Tsmixer: An all-mlp ar- chitecture for time series forecasting,

Si-An Chen, Chun-Liang Li, Nate Yoder, Sercan ¨O. Arik, and Tomas Pfister, “Tsmixer: An all-mlp ar- chitecture for time series forecasting,”CoRR, vol. abs/2303.06053, 2023

arXiv 2023

[20] [20]

Tsmixer: Lightweight mlp-mixer model for multivariate time se- ries forecasting,

Vijay Ekambaram, Arindam Jati, Nam Nguyen, Phan- wadee Sinthong, and Jayant Kalagnanam, “Tsmixer: Lightweight mlp-mixer model for multivariate time se- ries forecasting,” inProceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, KDD 2023, Long Beach, CA, USA, August 6-10, 2023, Ambuj Singh, Yizhou Sun, Leman Akoglu, Dimitrios...

2023

[21] [21]

Diff-mn: Diffusion parameterized moe-ncde for continuous time series generation with irregular obser- vations,

Xu Zhang, Junwei Deng, Chang Xu, Hao Li, and Jiang Bian, “Diff-mn: Diffusion parameterized moe-ncde for continuous time series generation with irregular obser- vations,” 2026

2026

[22] [22]

Autoformer: Decomposition transformers with auto-correlation for long-term series forecasting,

Haixu Wu, Jiehui Xu, Jianmin Wang, and Mingsheng Long, “Autoformer: Decomposition transformers with auto-correlation for long-term series forecasting,”Ad- vances in Neural Information Processing Systems, vol. 34, pp. 22419–22430, 2021

2021

[23] [23]

Deepar: Probabilistic forecasting with autoregressive recurrent networks,

David Salinas, Valentin Flunkert, Jan Gasthaus, and Tim Januschowski, “Deepar: Probabilistic forecasting with autoregressive recurrent networks,”International Jour- nal of Forecasting, vol. 36, no. 3, pp. 1181–1191, 2020

2020

[24] [24]

Rethinking atten- tion with performers,

Krzysztof Marcin Choromanski, Valerii Likhosherstov, David Dohan, Xingyou Song, Andreea Gane, Tam ´as Sarl´os, Peter Hawkins, Jared Quincy Davis, Afroz Mohiuddin, Lukasz Kaiser, David Benjamin Belanger, Lucy J. Colwell, and Adrian Weller, “Rethinking atten- tion with performers,” in9th International Conference on Learning Representations, ICLR 2021, Virtu...

2021

[25] [25]

A lightweight sparse interaction network for time series forecasting,

Xu Zhang, Qitong Wang, Peng Wang, and Wei Wang, “A lightweight sparse interaction network for time series forecasting,” inProceedings of the AAAI Conference on Artificial Intelligence, 2025, vol. 39, pp. 13304–13312

2025

[26] [26]

Semixer: Semantics enhanced mlp-mixer for multi- scale mixing and long-term time series forecasting,

Xu Zhang, Qitong Wang, Peng Wang, and Wei Wang, “Semixer: Semantics enhanced mlp-mixer for multi- scale mixing and long-term time series forecasting,” in Proceedings of the ACM Web Conference 2026, 2026, pp. 5636–5647

2026

[27] [27]

Categori- cal reparameterization with gumbel-softmax,

Eric Jang, Shixiang Gu, and Ben Poole, “Categori- cal reparameterization with gumbel-softmax,” in5th International Conference on Learning Representations, ICLR 2017, Toulon, France, April 24-26, 2017, Confer- ence Track Proceedings. 2017, OpenReview.net

2017

[28] [28]

The concrete distribution: A continuous relaxation of discrete random variables,

Chris J. Maddison, Andriy Mnih, and Yee Whye Teh, “The concrete distribution: A continuous relaxation of discrete random variables,” in5th International Confer- ence on Learning Representations, ICLR 2017, Toulon, France, April 24-26, 2017, Conference Track Proceed- ings. 2017, OpenReview.net

2017

[29] [29]

Attention is all you need,

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin, “Attention is all you need,” inAdvances in Neural Information Processing Systems 30: Annual Conference on Neural Information Process- ing Systems 2017, December 4-9, 2017, Long Beach, CA, USA, Isabelle Guyon, Ulrike von Luxburg, Sa...

2017