arxiv: 2604.16428 · v1 · submitted 2026-04-06 · 💻 cs.LG · cs.AI· stat.ML

Recognition: no theorem link

Non-Stationarity in the Embedding Space of Time Series Foundation Models

Jinmyeong Choi , Brad Shook , Artur Dubrawski

Authors on Pith no claims yet

Pith reviewed 2026-05-10 19:28 UTC · model grok-4.3

classification 💻 cs.LG cs.AIstat.ML

keywords time series foundation modelsnon-stationarityembedding spacesdistribution shiftlinear probesstatistical process controlpersistence

0 comments

The pith

Time series foundation models show smooth degradation in detecting non-stationarity within their embedding spaces as shift strength increases.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper examines how non-stationarity appears in the embedding spaces of time series foundation models, distinguishing it from general distribution shift by focusing on classical forms such as mean shifts, variance changes, linear trends, and persistence violations. Motivated by statistical process control traditions where detecting departures from stable regimes matters for monitoring, the work uses synthetic data to test linear accessibility of these changes. Experiments across multiple models reveal that detectability declines gradually with stronger shifts rather than vanishing suddenly, and that each model has its own characteristic failure patterns.

Core claim

Different forms of distributional non-stationarity including mean shifts, variance changes, and linear trends, plus temporal non-stationarity from persistence, become linearly accessible in TSFM embedding spaces under controlled conditions, yet this detectability degrades smoothly as shift strength grows and different models display distinct failure modes.

What carries the argument

Linear probes applied to embeddings from synthetic time series containing controlled mean shifts, variance changes, linear trends, and persistence violations.

If this is right

Detectability of non-stationarity in embeddings is gradual and scales with shift magnitude instead of being binary.
Each TSFM exhibits model-specific sensitivities and blind spots for particular non-stationarity types.
Classical SPC-style diagnostics for mean, variance, and trend changes can be partially recovered from embedding spaces.
Model choice for monitoring applications depends on the expected non-stationarity forms in the target data.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Applications in anomaly detection may benefit from testing multiple TSFMs to cover different non-stationarity types.
Embeddings could be post-processed with explicit stationarity detectors to compensate for model-specific gaps.
Training objectives for future TSFMs might include explicit preservation of non-stationarity signals.

Load-bearing premise

The controlled synthetic non-stationarities sufficiently represent the forms of non-stationarity encountered in real-world time series data and linear probes are the appropriate measure of accessibility.

What would settle it

An observation of abrupt rather than smooth drops in detectability, or uniform failure modes across models, when applying the same linear probes to real time series with independently verified non-stationarities would challenge the findings.

Figures

Figures reproduced from arXiv: 2604.16428 by Artur Dubrawski, Brad Shook, Jinmyeong Choi.

**Figure 2.** Figure 2: Macro-F1 as a function of shift strength. Strong shifts are easily detected by both TSFM em [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 3.** Figure 3: Temporal non-stationarity in TSFM embeddings under an AR(1) persistence sweep. (a) Chronos2 [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗

**Figure 4.** Figure 4: Representative AR(1) windows with increasing persistence. As [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗

**Figure 5.** Figure 5: Representative AR(1) windows used for distributional non-stationarity experiments ( [PITH_FULL_IMAGE:figures/full_fig_p010_5.png] view at source ↗

**Figure 6.** Figure 6: Confusion matrices for shift-type probing at sequence length [PITH_FULL_IMAGE:figures/full_fig_p016_6.png] view at source ↗

**Figure 7.** Figure 7: Confusion matrices under random persistence ( [PITH_FULL_IMAGE:figures/full_fig_p017_7.png] view at source ↗

read the original abstract

Time series foundation models (TSFMs) are widely used as generic feature extractors, yet the notion of non-stationarity in their embedding spaces remains poorly understood. Recent work often conflates non-stationarity with distribution shift, blurring distinctions fundamental to classical time-series analysis and long-standing methodologies such as statistical process control (SPC). In SPC, non-stationarity signals a process leaving a stable regime - via shifts in mean, variance, or emerging trends - and detecting such departures is central to quality monitoring and change-point analysis. Motivated by this diagnostic tradition, we study how different forms of distributional non-stationarity - mean shifts, variance changes, and linear trends - become linearly accessible in TSFM embedding spaces under controlled conditions. We further examine temporal non-stationarity arising from persistence, which reflects violations of weak stationarity due to long-memory or near-unit-root behavior rather than explicit distributional shifts. By sweeping shift strength and probing multiple TSFMs, we find that embedding-space detectability of non-stationarity degrades smoothly and that different models exhibit distinct, model-specific failure modes.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper shows TSFM embeddings allow linear detection of simple non-stationarities but lose it smoothly with stronger shifts and in model-specific ways.

read the letter

The main takeaway is that non-stationarity shows up in TSFM embeddings in a graded way rather than all-or-nothing, and different models handle different types of shifts unevenly. The authors inject controlled mean shifts, variance changes, linear trends, and persistence into synthetic series, then probe the embeddings with linear classifiers to track how detectable these departures remain as the strength increases. They report smooth degradation across models and distinct failure patterns per architecture. That empirical pattern is the concrete new observation here. It connects classical SPC ideas to foundation-model embeddings without claiming the embeddings solve monitoring tasks outright. The controlled sweep and multi-model comparison give the results some structure that prior work on TSFM embeddings has not emphasized. The setup is reproducible in principle and stays within its stated scope of linear accessibility under synthetic conditions. The soft spots are straightforward. Everything rests on four families of synthetic injections; real series often mix multiplicative effects, regime switches, or non-additive trend-noise interactions that these generators do not cover. Linear probes also miss any information that would require non-linear readout. The abstract gives no indication that the authors tested on actual deployed datasets or compared against non-linear probes, so the claim about embedding-space detectability stays narrow. Readers working on change-point detection or SPC-style monitoring with TSFMs will find the degradation curves and model differences useful as a starting map. The work is clear enough on its own terms to merit referee time, though any review will likely press on external validity. I would send it for review rather than desk-reject.

Referee Report

2 major / 2 minor

Summary. The paper examines how different forms of non-stationarity (mean shifts, variance changes, linear trends, and persistence) manifest as linearly accessible features in the embedding spaces of time series foundation models (TSFMs). Using controlled synthetic injections and linear probes across multiple models, it reports that detectability degrades smoothly with increasing shift strength and that models display distinct, model-specific failure modes. The work distinguishes non-stationarity from general distribution shift and draws motivation from statistical process control traditions.

Significance. If the central empirical findings hold under more general conditions, the results would clarify the diagnostic capabilities of TSFM embeddings for change-point and regime-shift detection tasks, with potential value for applications in quality monitoring and time-series analysis. The controlled sweep over shift strength and comparison across models is a positive design element that allows quantitative characterization of degradation behavior.

major comments (2)

[Experimental setup and probing methodology (as outlined in abstract)] The central claim that embedding-space detectability of non-stationarity 'degrades smoothly' and exhibits 'model-specific failure modes' rests exclusively on linear probes applied to four families of synthetic injections (mean shifts, variance changes, linear trends, persistence). This setup does not demonstrate that linear separability is an adequate proxy for accessibility, as any non-linear encoding of non-stationarity in the embeddings would remain invisible to the reported probes.
[Data generation and non-stationarity definitions] The chosen synthetic generators (additive mean/variance shifts, linear trends, persistence) do not span common real-world non-stationarities such as regime-switching, multiplicative seasonality, or non-additive trend-noise interactions. Without evidence that these synthetics are representative of the data regimes where TSFMs are deployed, the reported degradation behavior cannot be generalized to 'embedding-space detectability' in general.

minor comments (2)

Clarify the exact definition of 'linear accessibility' and the training procedure for the probes (e.g., whether probes are trained on held-out data or use the full embedding set).
Provide quantitative details on the number of models tested, the range of shift strengths, and the statistical tests used to establish 'smooth degradation' and 'distinct failure modes'.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the thoughtful and detailed comments. We address each major point below, clarifying the intentional scope of our controlled study while agreeing to strengthen the manuscript with additional discussion of limitations and generalizability.

read point-by-point responses

Referee: [Experimental setup and probing methodology (as outlined in abstract)] The central claim that embedding-space detectability of non-stationarity 'degrades smoothly' and exhibits 'model-specific failure modes' rests exclusively on linear probes applied to four families of synthetic injections (mean shifts, variance changes, linear trends, persistence). This setup does not demonstrate that linear separability is an adequate proxy for accessibility, as any non-linear encoding of non-stationarity in the embeddings would remain invisible to the reported probes.

Authors: We thank the referee for highlighting this distinction. Our claims are explicitly limited to linear accessibility and detectability, as stated throughout the abstract, introduction, and title. Linear probes were chosen deliberately because they provide a direct, quantifiable measure of whether non-stationarity signals are present in a linearly decodable form, aligning with the linear statistical tools central to statistical process control. We agree that this does not rule out non-linear encodings and will revise the manuscript to (i) restate the linear focus more prominently in the abstract and methods, and (ii) add a dedicated limitations paragraph acknowledging that non-linear probes could reveal additional structure and suggesting this as future work. revision: partial
Referee: [Data generation and non-stationarity definitions] The chosen synthetic generators (additive mean/variance shifts, linear trends, persistence) do not span common real-world non-stationarities such as regime-switching, multiplicative seasonality, or non-additive trend-noise interactions. Without evidence that these synthetics are representative of the data regimes where TSFMs are deployed, the reported degradation behavior cannot be generalized to 'embedding-space detectability' in general.

Authors: The synthetic generators were selected to isolate specific, well-defined forms of non-stationarity (mean shifts, variance changes, linear trends, and persistence) so that degradation with shift strength could be measured cleanly and model-specific failure modes identified. This controlled design follows directly from the SPC motivation in the introduction. We do not claim these cover all real-world non-stationarities, and we will revise the manuscript to (i) add an explicit scope statement in the introduction and conclusion, (ii) discuss how the chosen forms relate to (but do not exhaust) phenomena such as regime-switching, and (iii) note that broader generalization would require additional experiments on real-world datasets with mixed non-stationarities. revision: partial

Circularity Check

0 steps flagged

No circularity: purely empirical experimental reporting

full rationale

The paper conducts controlled synthetic injections of non-stationarity (mean shifts, variance changes, linear trends, persistence) into time series, extracts embeddings from several TSFMs, and measures linear probe performance as shift strength varies. All reported findings are direct experimental outcomes from these sweeps; no equations, fitted parameters, or predictions are presented that reduce by construction to the input generators or probe definitions. No self-citations are invoked as load-bearing uniqueness theorems or ansatzes. The derivation chain is therefore self-contained observational reporting rather than any of the enumerated circular patterns.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review; no explicit free parameters, axioms, or invented entities are stated. The work implicitly relies on standard assumptions from time-series analysis (weak stationarity definitions) and machine learning (linear probe sufficiency) but introduces none that are novel or fitted within the paper.

pith-pipeline@v0.9.0 · 5495 in / 1091 out tokens · 85820 ms · 2026-05-10T19:28:32.488311+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

16 extracted references · 7 canonical work pages · 1 internal anchor

[2]

Chronos-2: From Univariate to Universal Forecasting

URL https://arxiv.org/abs/2510.15821. Andreas Auer, Patrick Podest, Daniel Klotz, Sebastian B ¨ock, G ¨unter Klambauer, and Sepp Hochre- iter. Tirex: Zero-shot forecasting across long and short horizons with enhanced in-context learning. (arXiv:2505.23719), May

work page internal anchor Pith review arXiv
[3]

Tirex: Zero-shot forecasting across long and short horizons with enhanced in-context learning

doi: 10.48550/arXiv.2505.23719. URLhttp://arxiv.org/abs/ 2505.23719. arXiv:2505.23719 [cs]. J.D. Cryer and K.S. Chan.Time Series Analysis: With Applications in R. Springer Texts in Statistics. Springer New York,

work page doi:10.48550/arxiv.2505.23719
[4]

URLhttp://www.jstor.org/stable/2286348

ISSN 01621459, 1537274X. URLhttp://www.jstor.org/stable/2286348. Wei Fan, Pengyang Wang, Dongkun Wang, Dongjie Wang, Yuanchun Zhou, and Yanjie Fu. Dish-ts: A general paradigm for alleviating distribution shift in time series forecasting. InProceedings of the AAAI Conference on Artificial Intelligence, volume 37, pp. 7522–7529,

work page arXiv
[5]

5 ICLR 2026 Workshop on Time Series in the Age of Large Models (TSALM) Denis Kwiatkowski, Peter CB Phillips, Peter Schmidt, and Yongcheol Shin

URLhttps://openreview.net/forum?id= cGDAkQo1C0p. 5 ICLR 2026 Workshop on Time Series in the Age of Large Models (TSALM) Denis Kwiatkowski, Peter CB Phillips, Peter Schmidt, and Yongcheol Shin. Testing the null hypothesis of stationarity against the alternative of a unit root. how sure are we that economic time series have a unit root?Journal of Econometri...

2026
[6]

doi: 10.1016/0304-4076(92) 90104-Y

ISSN 0304-4076. doi: 10.1016/0304-4076(92) 90104-Y. Yuxuan Liang, Haomin Wen, Yuqi Nie, Yushan Jiang, Ming Jin, Dongjin Song, Shirui Pan, and Qingsong Wen. Foundation models for time series analysis: A tutorial and survey. InProceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, KDD ’24, pp. 6555–6565, New York, NY , USA,

work page doi:10.1016/0304-4076(92
[7]

A Survey on RAG Meeting LLMs: Towards Retrieval-Augmented Large Language Models

Association for Computing Machinery. ISBN 9798400704901. doi: 10.1145/3637528. 3671451. URLhttps://doi.org/10.1145/3637528.3671451. Peiyuan Liu, Beiliang Wu, Yifan Hu, Naiqi Li, Tao Dai, Jigang Bao, and Shu-Tao Xia. Timebridge: Non- stationarity matters for long-term time series forecasting.International Conference on Machine Learning,

work page doi:10.1145/3637528
[8]

URLhttp://www.jstor.org/stable/2336182

ISSN 00063444. URLhttp://www.jstor.org/stable/2336182. Lina Sj¨osten. A comparative study of the kpss and adf tests in terms of size and power,

work page arXiv
[9]

Timemixer++: A general time series pattern machine for universal predictive analysis

ISSN 2835-8856. URL https://openreview.net/forum?id=QlTLkH6xRC. Shiyu Wang, Jiawei Li, Xiaoming Shi, Zhou Ye, Baichuan Mo, Wenze Lin, Shengtong Ju, Zhixuan Chu, and Ming Jin. Timemixer++: A general time series pattern machine for universal predictive analysis.arXiv preprint arXiv:2410.16032,

work page arXiv
[10]

All experiments are conducted at the window level with sequence lengthL=

6 ICLR 2026 Workshop on Time Series in the Age of Large Models (TSALM) A DATAGENERATINGPROCESS We generate synthetic time series using a controlled AR(1) process to study how distributional and temporal non-stationarity manifest in embedding space. All experiments are conducted at the window level with sequence lengthL=

2026
[11]

Unless otherwise specified, we use ϕ= 0.6

A.1 BASELINEAR(1) PROCESS The stationary baseline is defined as an AR(1) process xt =µ+ϕ(x t−1 −µ) +ε t, ε t ∼ N(0, σ 2), whereµ= 0.5,σ= 0.06, and|ϕ|<1ensures weak stationarity. Unless otherwise specified, we use ϕ= 0.6. This baseline defines thestationaryclass. A.2 DISTRIBUTIONALSHIFTTYPES We consider three forms of distributional non-stationarity applie...

2026
[12]

These models were selected to span diverse architectural paradigms and training objectives while enabling consistent extraction of window-level embeddings

Shift structure Half-window change (continuous) Trend SlopeαUniform(0.3s,0.6s)·sign Trend form Additive linear ramp Shift Strength Strength levels{1.0,0.7,0.5,0.35,0.25,0.18,0.12,0.08} Interpretations= 1: strongest shift,s→0: indistinguishable B MODELS We evaluate three representative time series foundation models (TSFMs): Chronos2 (Ansari et al., 2025), ...

2025
[13]

is a pretrained time series foundation model designed for universal forecasting across univariate and multivariate settings.1 It extends the Chronos family with group-attention mechanisms that enable cross-series information sharing and in-context learning. The model processes normalized input 1https://github.com/amazon-science/chronos-forecasting 9 ICLR ...

2026
[14]

is a family of open-source foundation models for general-purpose time series analysis, trained via large-scale multi-dataset pretraining. 2 It learns representations through self- supervised objectives such as masked reconstruction, enabling a single model to support forecasting, classi- fication, anomaly detection, and imputation tasks. By reconstructing...

2026
[15]

B.5 WHYTHESETSFMS? We selected these models for three primary reasons

These features train a logistic regression model to perform shift type classification. B.5 WHYTHESETSFMS? We selected these models for three primary reasons. (1) Comparable embedding extraction.All three models provide encoder outputs that can be converted into fixed-length window embeddings without task-specific fine-tuning, enabling consistent represent...

2026
[16]

Longer win- dows consistently improve separability for all methods, reflecting the benefit of additional temporal context

C.2.1 FIXEDPERSISTENCE(ϕ= 0.6) Table 3 summarizes Macro-F1 across sequence lengths under fixed persistence (ϕ= 0.6). Longer win- dows consistently improve separability for all methods, reflecting the benefit of additional temporal context. Importantly, the qualitative model ranking is unchanged acrossL, and the weak-shift regime (s= 0.12) continues to rev...

2026
[17]

These formulations further reinforce the view of non-stationarity as a form of distribution shift that disrupts stable representation learning

distinguishes betweenintra-space shift, referring to temporal changes within a single representation space, andinter-space shift, describing misalignment across representations learned under different temporal regimes. These formulations further reinforce the view of non-stationarity as a form of distribution shift that disrupts stable representation lear...

2022