arxiv: 2605.14551 · v1 · submitted 2026-05-14 · 💻 cs.LG

Recognition: no theorem link

SeesawNet: Towards Non-stationary Time Series Forecasting with Balanced Modeling of Common and Specific Dependencies

Hao Li , Lu Zhang , Liu Chong , Yankai Chen , Pengyang Wang , Yingjie Zhou

Authors on Pith no claims yet

Pith reviewed 2026-05-15 01:51 UTC · model grok-4.3

classification 💻 cs.LG

keywords non-stationary time seriesmultivariate forecastinginstance normalizationattention mechanismdependency modelingadaptive fusion

0 comments

The pith

SeesawNet adaptively balances common and instance-specific dependencies in non-stationary time series forecasting.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that instance normalization reduces distribution shifts across samples but erases instance-specific structural details needed for accurate temporal and cross-channel modeling. SeesawNet counters this with Adaptive Stationary-Nonstationary Attention that pulls shared patterns from normalized sequences and unique patterns from raw sequences, then fuses the two based on each instance's measured non-stationarity. The architecture alternates dedicated temporal and channel blocks to capture both long-range and cross-variable relations. Experiments on standard benchmarks indicate this balanced approach yields higher forecast accuracy than prior methods that either suppress or partially recover specific dependencies.

Core claim

SeesawNet introduces Adaptive Stationary-Nonstationary Attention (ASNA) that captures common dependencies from normalized sequences and specific dependencies from raw sequences, adaptively fusing them according to instance-level non-stationarity. Built upon ASNA, the model alternates dedicated temporal and channel relationship modeling to jointly capture long-range and cross-variable dependencies, consistently outperforming state-of-the-art methods on multiple real-world benchmarks.

What carries the argument

Adaptive Stationary-Nonstationary Attention (ASNA), which extracts common dependencies from instance-normalized sequences and specific dependencies from raw sequences before adaptively fusing them guided by per-instance non-stationarity.

If this is right

Improved accuracy on real-world multivariate forecasting tasks that exhibit varying degrees of non-stationarity.
Joint modeling of temporal and channel dependencies without over-smoothing instance-specific structures.
Dynamic per-instance adaptation that avoids one-size-fits-all normalization or de-normalization.
Consistent gains over methods that either fully suppress or only partially recover specific dependencies.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same fusion principle could be tested on non-stationary sequences in domains such as financial returns or physiological signals where both population trends and individual deviations matter.
Replacing fixed instance normalization with ASNA-style adaptive fusion might reduce error accumulation in long-horizon recursive forecasting.
Extending the temporal-channel alternation to include explicit cross-instance attention could further stabilize performance when distribution shifts occur between training and test periods.

Load-bearing premise

Adaptive fusion of common dependencies from normalized sequences and specific dependencies from raw sequences, guided by instance-level non-stationarity, will reliably capture heterogeneity without introducing new smoothing artifacts or overfitting to training distribution shifts.

What would settle it

A controlled experiment on a benchmark dataset where forcing the fusion weights in ASNA to be uniform or fixed produces accuracy no better than using only normalized inputs or only raw inputs across multiple forecast horizons.

Figures

Figures reproduced from arXiv: 2605.14551 by Hao Li, Liu Chong, Lu Zhang, Pengyang Wang, Yankai Chen, Yingjie Zhou.

**Figure 2.** Figure 2: The observation of blurred regime-dependent cross [PITH_FULL_IMAGE:figures/full_fig_p002_2.png] view at source ↗

**Figure 3.** Figure 3: Overall architecture of SeesawNet with its core component Adaptive Stationary-non-stationary Attention (ASNA). [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗

**Figure 4.** Figure 4: Visualization of SeesawNet’s attention behavior on two samples from ETTm1. Each row presents one sample. (a) shows the [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗

read the original abstract

Instance normalization (IN) is widely used in non-stationary multivariate time series forecasting to reduce distribution shifts and highlight common patterns across samples. However, IN can over-smooth instance-specific structural information that is essential for modeling temporal and cross-channel heterogeneity. While prior methods further suppress distribution discrepancies or attempt to recover temporal specific dependencies, they often ignore a central tension: how to adaptively model common and instance-specific dependency based on each instance's non-stationary structures. To address this dilemma, we propose SeesawNet, a unified architecture that dynamically balances common and instance-specific dependency modeling in both temporal and channel dimensions. At its core is Adaptive Stationary-Nonstationary Attention (ASNA), which captures common dependencies from normalized sequences and specific dependencies from raw sequences, and adaptively fuses them according to instance-level non-stationarity. Built upon ASNA, SeesawNet alternates dedicated temporal and channel relationship modeling to jointly capture long-range and cross-variable dependencies. Extensive experiments on multiple real-world benchmarks demonstrate that SeesawNet consistently outperforms state-of-the-art methods.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

SeesawNet adds an adaptive fusion step inside attention to keep both common patterns from normalized sequences and instance-specific details from raw ones, but the abstract gives no numbers or controls to show it works.

read the letter

The paper's main contribution is the ASNA module inside SeesawNet. It pulls common dependencies from instance-normalized sequences and specific dependencies from the raw sequences, then fuses them with weights that depend on how non-stationary each instance is. This directly targets the smoothing problem that comes with plain instance normalization in multivariate forecasting. The rest of the model alternates temporal and channel modeling on top of that fusion. The design is a clean engineering response to a tension that shows up often in practice: you want stability across samples without losing what makes each one different. That part is new enough compared with the normalization-plus-attention baselines cited in the abstract. The mechanism itself looks internally consistent and avoids obvious contradictions with the non-stationarity goal. The soft spot is the evidence. The abstract states consistent outperformance on real-world benchmarks but supplies no error bars, ablation results, statistical tests, or details on how the fusion weights are learned or regularized. Without those, it is difficult to judge whether the adaptive part drives the gains or whether the model simply benefits from extra capacity. The full paper may contain the controls, but they are not visible here. This work is aimed at researchers who already use instance normalization on non-stationary multivariate series and have seen the over-smoothing tradeoff. A reader who has tried recovery tricks or extra normalization layers would see the practical point quickly. I would send it to peer review because the problem is well-posed and the proposed fix is concrete enough for referees to evaluate against the experiments.

Referee Report

2 major / 2 minor

Summary. The paper proposes SeesawNet, a unified neural architecture for non-stationary multivariate time series forecasting. Its core component is Adaptive Stationary-Nonstationary Attention (ASNA), which extracts common dependencies from instance-normalized sequences and instance-specific dependencies from raw sequences, then adaptively fuses them according to per-instance non-stationarity measures. SeesawNet alternates dedicated temporal and channel modeling blocks built on ASNA to jointly capture long-range temporal and cross-variable dependencies, claiming consistent outperformance over state-of-the-art methods on multiple real-world benchmarks.

Significance. If the empirical results hold under rigorous controls, the work directly tackles a recognized tension in instance normalization for time series: the risk of over-smoothing instance-specific structure while reducing distribution shift. The ASNA fusion mechanism provides a concrete, instance-adaptive solution that could improve robustness on heterogeneous data. The alternating temporal-channel design is a standard but well-motivated choice that aligns with the dual dependency goal. The paper's framing as an architectural contribution rather than a re-derivation of prior results is internally consistent.

major comments (2)

[§4 Experiments] §4 (Experiments): The central claim of 'consistent outperformance' is presented without visible error bars, ablation controls, statistical significance tests, or explicit data-exclusion rules in the abstract and summary description. These elements are load-bearing for establishing that the ASNA fusion reliably improves over baselines rather than reflecting variance or selective reporting.
[§3.2 ASNA] §3.2 (ASNA): The adaptive fusion weights are listed among free parameters; the manuscript must specify their initialization, any regularization, and how the instance-level non-stationarity scalar is computed from the input to ensure the mechanism does not introduce new hyperparameters that could be tuned to the evaluation sets.

minor comments (2)

[Abstract] Abstract: The phrase 'alternates dedicated temporal and channel relationship modeling' is high-level; a one-sentence clarification of the alternation order and residual connections would improve readability without altering the technical content.
[§3 Method] Notation: Ensure consistent use of symbols for normalized vs. raw sequences across equations and figures; minor inconsistencies in subscripting can obscure the common/specific distinction.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment below and will revise the manuscript accordingly to improve clarity and rigor.

read point-by-point responses

Referee: [§4 Experiments] §4 (Experiments): The central claim of 'consistent outperformance' is presented without visible error bars, ablation controls, statistical significance tests, or explicit data-exclusion rules in the abstract and summary description. These elements are load-bearing for establishing that the ASNA fusion reliably improves over baselines rather than reflecting variance or selective reporting.

Authors: We agree that explicit statistical controls strengthen the claims. The full manuscript already reports mean results with standard deviations over 5 random seeds in Tables 1–4 and includes ablation studies in §4.3. To address the concern, we will add paired t-test p-values comparing SeesawNet against each baseline in the revised §4, explicitly state the data preprocessing and exclusion criteria (no test-set leakage, standard train/val/test splits) in the main text, and ensure all figures display error bars. These additions will be included in the revision. revision: yes
Referee: [§3.2 ASNA] §3.2 (ASNA): The adaptive fusion weights are listed among free parameters; the manuscript must specify their initialization, any regularization, and how the instance-level non-stationarity scalar is computed from the input to ensure the mechanism does not introduce new hyperparameters that could be tuned to the evaluation sets.

Authors: We appreciate the request for precision. The non-stationarity scalar is computed as the normalized variance of first-order differences on the raw input sequence (explicit formula: σ = std((x_{t+1} - x_t)) / (max(x) - min(x) + ε)). Fusion weights are initialized uniformly at 0.5 and optimized end-to-end with no extra regularization beyond the forecasting loss. We will insert a dedicated paragraph in §3.2 with these exact definitions and confirm that no additional hyperparameters are introduced or tuned on test data. This clarification will appear in the revised manuscript. revision: yes

Circularity Check

0 steps flagged

No significant circularity in architectural proposal

full rationale

The paper introduces SeesawNet as a new neural architecture featuring Adaptive Stationary-Nonstationary Attention (ASNA) to dynamically balance common dependencies from normalized sequences and instance-specific dependencies from raw sequences. No equations, derivations, or predictions are shown that reduce claimed performance or balancing mechanism to quantities fitted from evaluation data or prior self-citations. The contribution is framed as an empirical architectural design with benchmark validation, remaining self-contained without load-bearing reductions to inputs by construction.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 1 invented entities

The central claim rests on the domain assumption that instance normalization primarily highlights common patterns while raw sequences retain useful specific structure, plus the design choice that instance-level non-stationarity can be used as a reliable signal for adaptive fusion. No explicit free parameters beyond standard neural network weights are named in the abstract.

free parameters (1)

adaptive fusion weights in ASNA
Learned parameters that decide the balance between common and specific attention outputs for each instance.

axioms (1)

domain assumption Instance normalization reduces distribution shifts and highlights common patterns across samples
Stated as widely used and effective in the opening sentence of the abstract.

invented entities (1)

Adaptive Stationary-Nonstationary Attention (ASNA) no independent evidence
purpose: Captures common dependencies from normalized sequences and specific dependencies from raw sequences then fuses them adaptively
New module introduced as the core of SeesawNet; no independent evidence outside the paper is provided.

pith-pipeline@v0.9.0 · 5501 in / 1356 out tokens · 48100 ms · 2026-05-15T01:51:26.012761+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

21 extracted references · 21 canonical work pages · 2 internal anchors

[1]

From dense to sparse: Event response for en- hanced residential load forecasting.IEEE Transactions on Instrumentation and Measurement,

[Caoet al., 2025 ] Xin Cao, Qinghua Tao, Yingjie Zhou, Lu Zhang, Le Zhang, Dongjin Song, Dapeng Oliver Wu, and Ce Zhu. From dense to sparse: Event response for en- hanced residential load forecasting.IEEE Transactions on Instrumentation and Measurement,

work page 2025
[2]

Pathformer: Multi-scale trans- formers with adaptive pathways for time series forecast- ing.arXiv preprint arXiv:2402.05956,

[Chenet al., 2024 ] Peng Chen, Yingying Zhang, Yunyao Cheng, Yang Shu, Yihang Wang, Qingsong Wen, Bin Yang, and Chenjuan Guo. Pathformer: Multi-scale trans- formers with adaptive pathways for time series forecast- ing.arXiv preprint arXiv:2402.05956,

work page arXiv 2024
[3]

Dish-ts: a general paradigm for alleviating distribution shift in time series forecasting

[Fanet al., 2023 ] Wei Fan, Pengyang Wang, Dongkun Wang, Dongjie Wang, Yuanchun Zhou, and Yanjie Fu. Dish-ts: a general paradigm for alleviating distribution shift in time series forecasting. InProceedings of the AAAI conference on artificial intelligence, volume 37, pages 7522–7529,

work page 2023
[4]

Deep frequency derivative learning for non-stationary time series forecasting.arXiv preprint arXiv:2407.00502,

[Fanet al., 2024 ] Wei Fan, Kun Yi, Hangting Ye, Zhiyuan Ning, Qi Zhang, and Ning An. Deep frequency derivative learning for non-stationary time series forecasting.arXiv preprint arXiv:2407.00502,

work page arXiv 2024
[5]

Sin: Selective and interpretable normalization for long- term time series forecasting

[Hanet al., 2024 ] Lu Han, Han-Jia Ye, and De-Chuan Zhan. Sin: Selective and interpretable normalization for long- term time series forecasting. InForty-first International Conference on Machine Learning,

work page 2024
[6]

Re- versible instance normalization for accurate time-series forecasting against distribution shift

[Kimet al., 2021 ] Taesung Kim, Jinhee Kim, Yunwon Tae, Cheonbok Park, Jang-Ho Choi, and Jaegul Choo. Re- versible instance normalization for accurate time-series forecasting against distribution shift. InInternational con- ference on learning representations,

work page 2021
[7]

Battling the non-stationarity in time se- ries forecasting via test-time adaptation

[Kimet al., 2025 ] HyunGi Kim, Siwon Kim, Jisoo Mok, and Sungroh Yoon. Battling the non-stationarity in time se- ries forecasting via test-time adaptation. InProceedings of the AAAI Conference on Artificial Intelligence, volume 39, pages 17868–17876,

work page 2025
[8]

Cspo: Cross-market synergistic stock price move- ment forecasting with pseudo-volatility optimization

[Linet al., 2025 ] Sida Lin, Yankai Chen, Yiyan Qi, Chen- hao Ma, Bokai Cao, Yifei Zhang, Xue Liu, and Jian Guo. Cspo: Cross-market synergistic stock price move- ment forecasting with pseudo-volatility optimization. In Companion Proceedings of the ACM on Web Conference 2025, pages 354–363,

work page 2025
[9]

Non-stationary transformers: Explor- ing the stationarity in time series forecasting.Advances in neural information processing systems, 35:9881–9893,

[Liuet al., 2022 ] Yong Liu, Haixu Wu, Jianmin Wang, and Mingsheng Long. Non-stationary transformers: Explor- ing the stationarity in time series forecasting.Advances in neural information processing systems, 35:9881–9893,

work page 2022
[10]

iTransformer: Inverted Transformers Are Effective for Time Series Forecasting

[Liuet al., 2023a ] Yong Liu, Tengge Hu, Haoran Zhang, Haixu Wu, Shiyu Wang, Lintao Ma, and Mingsheng Long. itransformer: Inverted transformers are effective for time series forecasting.arXiv preprint arXiv:2310.06625,

work page internal anchor Pith review Pith/arXiv arXiv
[11]

Timebridge: Non-stationarity matters for long-term time series fore- casting.arXiv preprint arXiv:2410.04442,

[Liuet al., 2024 ] Peiyuan Liu, Beiliang Wu, Yifan Hu, Naiqi Li, Tao Dai, Jigang Bao, and Shu-tao Xia. Timebridge: Non-stationarity matters for long-term time series fore- casting.arXiv preprint arXiv:2410.04442,

work page arXiv 2024
[12]

U-mixer: An unet-mixer architecture with stationarity correction for time series forecasting

[Maet al., 2024 ] Xiang Ma, Xuemei Li, Lexin Fang, Tian- long Zhao, and Caiming Zhang. U-mixer: An unet-mixer architecture with stationarity correction for time series forecasting. InProceedings of the AAAI conference on ar- tificial intelligence, volume 38, pages 14255–14262,

work page 2024
[13]

A Time Series is Worth 64 Words: Long-term Forecasting with Transformers

[Nieet al., 2022 ] Yuqi Nie, Nam H Nguyen, Phanwadee Sinthong, and Jayant Kalagnanam. A time series is worth 64 words: Long-term forecasting with transformers.arXiv preprint arXiv:2211.14730,

work page internal anchor Pith review Pith/arXiv arXiv 2022
[14]

Duet: Dual cluster- ing enhanced multivariate time series forecasting

[Qiuet al., 2025 ] Xiangfei Qiu, Xingjian Wu, Yan Lin, Chenjuan Guo, Jilin Hu, and Bin Yang. Duet: Dual cluster- ing enhanced multivariate time series forecasting. InPro- ceedings of the 31st ACM SIGKDD Conference on Knowl- edge Discovery and Data Mining V . 1, pages 1185–1196,

work page 2025
[15]

Timemixer++: A general time series pattern machine for universal predictive analysis

[Wanget al., 2024 ] Shiyu Wang, Jiawei Li, Xiaoming Shi, Zhou Ye, Baichuan Mo, Wenze Lin, Shengtong Ju, Zhix- uan Chu, and Ming Jin. Timemixer++: A general time series pattern machine for universal predictive analysis. arXiv preprint arXiv:2410.16032,

work page arXiv 2024
[16]

Fredf: Learning to forecast in the frequency domain

[Wanget al., 2025 ] Hao Wang, Licheng Pan, Zhichao Chen, Degui Yang, Sen Zhang, Yifei Yang, Xinggao Liu, Haox- uan Li, and Dacheng Tao. Fredf: Learning to forecast in the frequency domain. InICLR,

work page 2025
[17]

Autoformer: Decomposition transform- ers with auto-correlation for long-term series forecast- ing.Advances in neural information processing systems, 34:22419–22430,

[Wuet al., 2021 ] Haixu Wu, Jiehui Xu, Jianmin Wang, and Mingsheng Long. Autoformer: Decomposition transform- ers with auto-correlation for long-term series forecast- ing.Advances in neural information processing systems, 34:22419–22430,

work page 2021
[18]

Fits: Modeling time series with10kparameters.arXiv preprint arXiv:2307.03756,

[Xuet al., 2023 ] Zhijian Xu, Ailing Zeng, and Qiang Xu. Fits: Modeling time series with10kparameters.arXiv preprint arXiv:2307.03756,

work page arXiv 2023
[19]

Frequency adaptive normalization for non-stationary time series forecasting.Advances in Neural Information Processing Systems, 37:31350–31379,

[Yeet al., 2024 ] Weiwei Ye, Songgaojun Deng, Qiaosha Zou, and Ning Gui. Frequency adaptive normalization for non-stationary time series forecasting.Advances in Neural Information Processing Systems, 37:31350–31379,

work page 2024
[20]

Are transformers effective for time series fore- casting? InProceedings of the AAAI conference on artifi- cial intelligence, volume 37, pages 11121–11128,

[Zenget al., 2023 ] Ailing Zeng, Muxi Chen, Lei Zhang, and Qiang Xu. Are transformers effective for time series fore- casting? InProceedings of the AAAI conference on artifi- cial intelligence, volume 37, pages 11121–11128,

work page 2023
[21]

Crossformer: Transformer utilizing cross-dimension de- pendency for multivariate time series forecasting

[Zhang and Yan, 2023] Yunhao Zhang and Junchi Yan. Crossformer: Transformer utilizing cross-dimension de- pendency for multivariate time series forecasting. InInter- national Conference on Learning Representations, 2023

work page 2023