arxiv: 2604.26762 · v1 · submitted 2026-04-29 · 💻 cs.LG · cs.AI

Recognition: unknown

Exploring the Potential of Probabilistic Transformer for Time Series Modeling: A Report on the ST-PT Framework

Zhangzhi Xiong , Haoyi Wu , You Wu , Shuqi Gu , Kan Ren , Kewei Tu

Authors on Pith no claims yet

Pith reviewed 2026-05-07 10:52 UTC · model grok-4.3

classification 💻 cs.LG cs.AI

keywords Probabilistic Transformertime series modelingconditional random fieldmean-field variational inferencefactor graphspatial-temporalautoregressive forecastingST-PT

0 comments

The pith

The Probabilistic Transformer becomes a programmable factor graph for time series by equating self-attention to mean-field variational inference on a conditional random field.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper shows that the Probabilistic Transformer equates self-attention and feed-forward blocks to mean-field variational inference on a conditional random field, making the architecture an explicit and editable factor graph. The authors adapt it to spatial-temporal data as ST-PT by adding support for channel dimensions and per-step processing. They then run three studies to test using the graph structure to add time-series priors in low-data settings, conditioning generation by reprogramming factors per sample, and turning autoregressive forecasting into successive Bayesian updates with distillation from a teacher model. Readers should care if this lets models incorporate domain knowledge structurally instead of only through data. The work positions ST-PT as a backbone that blends neural flexibility with graphical model transparency for forecasting tasks.

Core claim

ST-PT serves as a shared cornerstone by lifting the PT equivalence to handle time series, allowing the three properties of programmable topology for prior injection, factor matrix programming for conditional generation, and MFVI iterations for principled latent AR transitions with CRF distillation to counter error buildup.

What carries the argument

The equivalence between the Transformer's self-attention plus feed-forward block and mean-field variational inference on a conditional random field, extended to ST-PT for time series with added channel and temporal semantics.

If this is right

Injecting symbolic time-series priors via direct graph modifications improves modeling when data is scarce or noisy.
Conditioning the CRF factor matrices externally on a per-sample basis enables structural rather than feature-based conditional generation.
Converting latent-space autoregressive transitions to Bayesian posterior updates and distilling from a CRF teacher reduces cumulative errors in multi-step forecasting.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Similar adaptations could apply the programmable factor graph approach to other domains like video prediction or natural language generation.
This framework might support more interpretable models by allowing inspection and editing of the explicit potentials and topology.
Joint optimization of graph structure alongside the potentials could emerge as a new direction for architecture search in sequence models.

Load-bearing premise

The mathematical equivalence between the modified ST-PT attention blocks and mean-field variational inference on the CRF continues to hold, and the performance improvements arise from exploiting the programmable properties.

What would settle it

Demonstrating that the self-attention computations in ST-PT deviate from the corresponding mean-field updates, or finding that structural graph changes do not yield gains over baseline Transformers in data-scarce time series scenarios.

Figures

Figures reproduced from arXiv: 2604.26762 by Haoyi Wu, Kan Ren, Kewei Tu, Shuqi Gu, You Wu, Zhangzhi Xiong.

**Figure 1.** Figure 1: The model consists of the following components and designs: view at source ↗

**Figure 1.** Figure 1: Structure of ST-PT: a 2D factor graph on (channel view at source ↗

**Figure 2.** Figure 2: (a) Three graph-level mechanisms for injecting symbolic priors into ST-PT’s factor graph (RQ1, Section 4.1): view at source ↗

read the original abstract

The Probabilistic Transformer (PT) establishes that the Transformer's self-attention plus its feed-forward block is mathematically equivalent to Mean-Field Variational Inference (MFVI) on a Conditional Random Field (CRF). Under this equivalence the Transformer ceases to be a black-box neural network and becomes a programmable factor graph: graph topology, factor potentials, and the message-passing schedule are all explicit and inspectable primitives that can be engineered. PT was originally developed for natural language and in this report we investigate its potential for time series. We first lift PT into the Spatial-Temporal Probabilistic Transformer (ST-PT) to repair PT's missing channel axis and weak per-step semantics, and adopt ST-PT as a shared cornerstone backbone. We then identify three distinct properties that PT/ST-PT offers as a factor-graph model and derive three Research Questions, one per property, that probe how each property can be exploited in time series: RQ1. The graph topology and potentials are direct programmable primitives. Can this be used to inject symbolic time-series priors into ST-PT through structural graph modifications, especially under data scarcity and noise? RQ2. The CRF's factor matrices are the operator's potentials. Can an external condition program these factor matrices on a per-sample basis, so that conditional generation becomes structural rather than feature-level modulation of a fixed one? RQ3. Each MFVI iteration is a Bayesian posterior update on the factor graph. Can this turn the latent transition of latent-space AutoRegressive (AR) forecasting from an opaque MLP into a principled posterior update, and can a CRF teacher distill its latents into the AR student to counter cumulative error? We give one empirical study per question. Together, these three studies position ST-PT as a programmable framework for time-series modeling.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

ST-PT extends the Probabilistic Transformer to time series with three targeted studies, but the core equivalence needs re-derivation after the spatial-temporal changes or the programmable-factor claims lose their grounding.

read the letter

The paper's main move is to take the original PT equivalence (self-attention plus feed-forward as MFVI on a CRF) and lift it into ST-PT for time series by adding a channel axis and tightening per-step semantics. It then poses three concrete research questions: using the explicit graph topology to inject symbolic priors under data scarcity, programming the factor matrices on a per-sample basis for conditional generation, and treating MFVI iterations as posterior updates to distill a CRF teacher into an autoregressive student that reduces cumulative forecast error. Each question gets its own empirical study, which is a clean way to test whether the factor-graph view actually buys something practical in time series.

Referee Report

2 major / 1 minor

Summary. The paper claims that the Probabilistic Transformer (PT) is mathematically equivalent to mean-field variational inference on a conditional random field, turning the architecture into an explicit programmable factor graph. It lifts PT to the Spatial-Temporal Probabilistic Transformer (ST-PT) by adding a channel axis and repairing per-step semantics, then uses this backbone to pose and empirically investigate three research questions: (RQ1) injecting symbolic time-series priors via structural graph modifications under data scarcity; (RQ2) per-sample conditional programming of factor matrices for structural rather than feature-level conditioning; and (RQ3) interpreting latent autoregressive transitions as principled Bayesian posterior updates with CRF-teacher distillation to mitigate cumulative error.

Significance. If the PT-to-MFVI-CRF equivalence is shown to survive the ST-PT modifications and the three studies isolate gains attributable to explicit graph topology, programmable potentials, and posterior-update semantics, the work would supply a concrete bridge between transformer architectures and probabilistic graphical models for time series, enabling more interpretable incorporation of domain priors and conditional generation.

major comments (2)

[§3] §3: The lifting of PT to ST-PT is described as adding a channel axis and repairing per-step semantics, yet the manuscript supplies no re-derivation demonstrating that the modified self-attention and feed-forward blocks continue to implement the identical mean-field variational updates on the underlying CRF factor graph. Because the three RQs and the interpretation of all empirical results rest on inheriting the programmable-factor-graph properties, this omission is load-bearing.
[Empirical studies] Empirical studies (one per RQ): the manuscript presents the studies as demonstrating exploitation of the factor-graph properties, but provides insufficient detail on baselines, metrics, ablation controls, and quantitative effect sizes that would isolate the contribution of explicit graph topology or posterior-update semantics from ordinary modeling improvements. Without such isolation the studies cannot be read as evidence for the claimed advantages of the equivalence.

minor comments (1)

[Abstract] The abstract would be strengthened by a single sentence summarizing the key quantitative outcomes of the three studies (e.g., relative error reductions or statistical significance) so that readers can immediately gauge the practical magnitude of the reported gains.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback on our manuscript. The comments highlight important areas for strengthening the technical foundations and empirical evidence. We address each major comment below and will revise the manuscript accordingly.

read point-by-point responses

Referee: [§3] §3: The lifting of PT to ST-PT is described as adding a channel axis and repairing per-step semantics, yet the manuscript supplies no re-derivation demonstrating that the modified self-attention and feed-forward blocks continue to implement the identical mean-field variational updates on the underlying CRF factor graph. Because the three RQs and the interpretation of all empirical results rest on inheriting the programmable-factor-graph properties, this omission is load-bearing.

Authors: We agree that the manuscript would be strengthened by an explicit re-derivation confirming that the ST-PT modifications preserve the MFVI-CRF equivalence. The channel-axis addition and per-step semantic repairs are structural extensions that maintain the same variational update rules in self-attention and feed-forward blocks. In the revised version, we will add a dedicated subsection to §3 with the full re-derivation, explicitly showing that the modified blocks implement identical mean-field updates on the extended CRF factor graph. This will make the inheritance of programmable factor-graph properties transparent and directly support the three RQs. revision: yes
Referee: [Empirical studies] Empirical studies (one per RQ): the manuscript presents the studies as demonstrating exploitation of the factor-graph properties, but provides insufficient detail on baselines, metrics, ablation controls, and quantitative effect sizes that would isolate the contribution of explicit graph topology or posterior-update semantics from ordinary modeling improvements. Without such isolation the studies cannot be read as evidence for the claimed advantages of the equivalence.

Authors: We concur that the empirical sections require expanded controls to isolate the contributions of the factor-graph properties. In the revision, we will augment each RQ study with: (i) additional baselines including vanilla Transformers, standard probabilistic time-series models, and ablated variants without explicit graph structure; (ii) precise metric definitions and quantitative effect sizes; (iii) targeted ablations disabling graph topology modifications, per-sample factor programming, or CRF-teacher distillation; and (iv) statistical significance testing. These enhancements will more clearly attribute observed gains to the explicit topology, programmable potentials, and posterior-update semantics. revision: yes

Circularity Check

0 steps flagged

No significant circularity in derivation chain

full rationale

The manuscript cites the PT-to-MFVI-CRF equivalence as an established result from prior work and describes lifting PT to ST-PT via explicit modifications (channel axis and per-step semantics repairs) before deriving three RQs from the assumed factor-graph properties. No equation or claim in the abstract or described structure reduces a first-principles result or prediction to its own inputs by construction, nor does any load-bearing step rely on a self-citation chain whose authors overlap with the present paper. The empirical studies are presented as probes of the inherited properties rather than re-derivations or fits that rename inputs as outputs. The chain therefore remains self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claims rest on the PT-MFVI-CRF equivalence (treated as given) and the introduction of ST-PT to handle time-series structure. No explicit free parameters or additional invented entities beyond the framework itself are named in the abstract.

axioms (1)

domain assumption Transformer's self-attention plus feed-forward block equals MFVI on a CRF
Stated as the foundational equivalence that makes the Transformer a programmable factor graph.

invented entities (1)

ST-PT (Spatial-Temporal Probabilistic Transformer) no independent evidence
purpose: Repair PT's missing channel axis and weak per-step semantics for time-series data
Introduced in this work as the shared backbone for the three studies.

pith-pipeline@v0.9.0 · 5643 in / 1459 out tokens · 78233 ms · 2026-05-07T10:52:14.569516+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

51 extracted references · 24 canonical work pages · 5 internal anchors

[1]

Time-series forecasting with deep learning: a survey.Philosophical transactions of the royal society a: mathematical, physical and engineering sciences, 379(2194), 2021

Bryan Lim and Stefan Zohren. Time-series forecasting with deep learning: a survey.Philosophical transactions of the royal society a: mathematical, physical and engineering sciences, 379(2194), 2021

2021
[2]

Stl: A seasonal-trend decomposi- tion.J

Robert B Cleveland, William S Cleveland, Jean E McRae, and Irma Terpenning. Stl: A seasonal-trend decomposi- tion.J. off. Stat, 6(1):3–73, 1990

1990
[3]

Arima models

Robert H Shumway and David S Stoffer. Arima models. InTime series analysis and its applications: with R examples, pages 75–163. Springer, 2017

2017
[4]

Autoformer: Decomposition transformers with auto- correlation for long-term series forecasting.Advances in neural information processing systems, 34:22419–22430, 2021

Haixu Wu, Jiehui Xu, Jianmin Wang, and Mingsheng Long. Autoformer: Decomposition transformers with auto- correlation for long-term series forecasting.Advances in neural information processing systems, 34:22419–22430, 2021

2021
[5]

Fedformer: Frequency enhanced decomposed transformer for long-term series forecasting

Tian Zhou, Ziqing Ma, Qingsong Wen, Xue Wang, Liang Sun, and Rong Jin. Fedformer: Frequency enhanced decomposed transformer for long-term series forecasting. InInternational conference on machine learning, pages 27268–27286. PMLR, 2022

2022
[6]

Informer: Beyond efficient transformer for long sequence time-series forecasting

Haoyi Zhou, Shanghang Zhang, Jieqi Peng, Shuai Zhang, Jianxin Li, Hui Xiong, and Wancai Zhang. Informer: Beyond efficient transformer for long sequence time-series forecasting. InProceedings of the AAAI conference on artificial intelligence, volume 35, pages 11106–11115, 2021

2021
[7]

A Time Series is Worth 64 Words: Long-term Forecasting with Transformers

Yuqi Nie, Nam H Nguyen, Phanwadee Sinthong, and Jayant Kalagnanam. A time series is worth 64 words: Long-term forecasting with transformers.arXiv preprint arXiv:2211.14730, 2022

work page internal anchor Pith review arXiv 2022
[8]

iTransformer: Inverted Transformers Are Effective for Time Series Forecasting

Yong Liu, Tengge Hu, Haoran Zhang, Haixu Wu, Shiyu Wang, Lintao Ma, and Mingsheng Long. itransformer: Inverted transformers are effective for time series forecasting.arXiv preprint arXiv:2310.06625, 2023

work page internal anchor Pith review arXiv 2023
[9]

Are transformers effective for time series forecasting? In Proceedings of the AAAI conference on artificial intelligence, volume 37, pages 11121–11128, 2023

Ailing Zeng, Muxi Chen, Lei Zhang, and Qiang Xu. Are transformers effective for time series forecasting? In Proceedings of the AAAI conference on artificial intelligence, volume 37, pages 11121–11128, 2023

2023
[10]

Why do transformers fail to forecast time series in-context?arXiv preprint arXiv:2510.09776, 2025

Yufa Zhou, Yixiao Wang, Surbhi Goel, and Anru R Zhang. Why do transformers fail to forecast time series in-context?arXiv preprint arXiv:2510.09776, 2025

work page arXiv 2025
[11]

arXiv preprint arXiv:2403.02682 , year=

Sai Shankar Narasimhan, Shubhankar Agarwal, Oguzhan Akcin, Sujay Sanghavi, and Sandeep Chinchali. Time weaver: A conditional time series generation model.arXiv preprint arXiv:2403.02682, 2024

work page arXiv 2024
[12]

Verbalts: Generating time series from texts

Shuqi Gu, Chuyue Li, Baoyu Jing, and Kan Ren. Verbalts: Generating time series from texts. InForty-second International Conference on Machine Learning, 2025

2025
[13]

T2s: High-resolution time series generation with text-to-series diffusion models.arXiv preprint arXiv:2505.02417, 2025

Yunfeng Ge, Jiawei Li, Yiji Zhao, Haomin Wen, Zhao Li, Meikang Qiu, Hongyan Li, Ming Jin, and Shirui Pan. T2s: High-resolution time series generation with text-to-series diffusion models.arXiv preprint arXiv:2505.02417, 2025

work page arXiv 2025
[14]

Bridge: Bootstrapping text to control time-series generation via multi-agent iterative optimization and diffusion modeling.arXiv preprint arXiv:2503.02445, 2025

Hao Li, Yu-Hao Huang, Chang Xu, Viktor Schlegel, Renhe Jiang, Riza Batista-Navarro, Goran Nenadic, and Jiang Bian. Bridge: Bootstrapping text to control time-series generation via multi-agent iterative optimization and diffusion modeling.arXiv preprint arXiv:2503.02445, 2025

work page arXiv 2025
[15]

DeepAR: Probabilistic Forecasting with Autoregressive Recurrent Networks

Valentin Flunkert, David Salinas, and Jan Gasthaus. Deepar: Probabilistic forecasting with autoregressive recurrent networks.CoRR, abs/1704.04110, 2017

work page Pith review arXiv 2017
[16]

Long short-term memory.Neural Computation, 9(8):1735–1780, 1997

Sepp Hochreiter and Jürgen Schmidhuber. Long short-term memory.Neural Computation, 9(8):1735–1780, 1997

1997
[17]

Modeling long-and short-term temporal patterns with deep neural networks

Guokun Lai, Wei-Cheng Chang, Yiming Yang, and Hanxiao Liu. Modeling long-and short-term temporal patterns with deep neural networks. InThe 41st international ACM SIGIR conference on research & development in information retrieval, pages 95–104, 2018

2018
[18]

Next-latent prediction transformers learn compact world models.arXiv preprint arXiv:2511.05963, 2025

Jayden Teoh, Manan Tomar, Kwangjun Ahn, Edward S Hu, Pratyusha Sharma, Riashat Islam, Alex Lamb, and John Langford. Next-latent prediction transformers learn compact world models.arXiv preprint arXiv:2511.05963, 2025

work page arXiv 2025
[19]

Probabilistic transformer: A probabilistic dependency model for contextual word representation

Haoyi Wu and Kewei Tu. Probabilistic transformer: A probabilistic dependency model for contextual word representation. InFindings of the Association for Computational Linguistics: ACL 2023, pages 7613–7636, 2023

2023
[20]

Diffusion convolutional recurrent neural network: Data-driven traffic forecasting.arXiv preprint arXiv:1707.01926, 2017

Yaguang Li, Rose Yu, Cyrus Shahabi, and Yan Liu. Diffusion convolutional recurrent neural network: Data-driven traffic forecasting.arXiv preprint arXiv:1707.01926, 2017

work page arXiv 2017
[21]

Relational inductive biases, deep learning, and graph networks

Peter W Battaglia, Jessica B Hamrick, Victor Bapst, Alvaro Sanchez-Gonzalez, Vinicius Zambaldi, Mateusz Malinowski, Andrea Tacchetti, David Raposo, Adam Santoro, Ryan Faulkner, et al. Relational inductive biases, deep learning, and graph networks.arXiv preprint arXiv:1806.01261, 2018. 16 Preprint

work page internal anchor Pith review arXiv 2018
[22]

Comparing prior and learned time representations in transformer models of timeseries, 2024

Natalia Koliou, Tatiana Boura, Stasinos Konstantopoulos, George Meramveliotakis, and George Kosmadakis. Comparing prior and learned time representations in transformer models of timeseries, 2024

2024
[23]

arXiv preprint arXiv:2603.04767 , year=

Shaocheng Lan, Shuqi Gu, Zhangzhi Xiong, and Kan Ren. Contsg-bench: A unified benchmark for conditional time series generation.arXiv preprint arXiv:2603.04767, 2026

work page arXiv 2026
[24]

Probabilistic transformer for time series analysis.Advances in neural information processing systems, 34:23592–23608, 2021

Binh Tang and David S Matteson. Probabilistic transformer for time series analysis.Advances in neural information processing systems, 34:23592–23608, 2021

2021
[25]

Enhancing the locality and breaking the memory bottleneck of transformer on time series forecasting.Advances in neural information processing systems, 32, 2019

Shiyang Li, Xiaoyong Jin, Yao Xuan, Xiyou Zhou, Wenhu Chen, Yu-Xiang Wang, and Xifeng Yan. Enhancing the locality and breaking the memory bottleneck of transformer on time series forecasting.Advances in neural information processing systems, 32, 2019

2019
[26]

Pyraformer: Low-complexity pyramidal attention for long-range time series modeling and forecasting

Shizhan Liu, Hang Yu, Cong Liao, Jianguo Li, Weiyao Lin, Alex X Liu, and Schahram Dustdar. Pyraformer: Low-complexity pyramidal attention for long-range time series modeling and forecasting. InInternational conference on learning representations, 2021

2021
[27]

Etsformer: Exponential smoothing transformers for time-series forecasting.arXiv preprint arXiv:2202.01381, 2022

Gerald Woo, Chenghao Liu, Doyen Sahoo, Akshat Kumar, and Steven Hoi. Etsformer: Exponential smoothing transformers for time-series forecasting.arXiv preprint arXiv:2202.01381, 2022

work page arXiv 2022
[28]

Film: Frequency improved legendre memory model for long-term time series forecasting.Advances in neural information processing systems, 35:12677–12690, 2022

Tian Zhou, Ziqing Ma, Qingsong Wen, Liang Sun, Tao Yao, Wotao Yin, Rong Jin, et al. Film: Frequency improved legendre memory model for long-term time series forecasting.Advances in neural information processing systems, 35:12677–12690, 2022

2022
[29]

arXiv preprint arXiv:2303.06053 , year=

Si-An Chen, Chun-Liang Li, Nate Yoder, Sercan O Arik, and Tomas Pfister. Tsmixer: An all-mlp architecture for time series forecasting.arXiv preprint arXiv:2303.06053, 2023

work page arXiv 2023
[30]

Timemixer++: A general time series pattern machine for universal predictive analysis

Shiyu Wang, Jiawei Li, Xiaoming Shi, Zhou Ye, Baichuan Mo, Wenze Lin, Shengtong Ju, Zhixuan Chu, and Ming Jin. Timemixer++: A general time series pattern machine for universal predictive analysis.arXiv preprint arXiv:2410.16032, 2024

work page arXiv 2024
[31]

Model-agnostic meta-learning for fast adaptation of deep networks

Chelsea Finn, Pieter Abbeel, and Sergey Levine. Model-agnostic meta-learning for fast adaptation of deep networks. InInternational conference on machine learning, pages 1126–1135. PMLR, 2017

2017
[32]

Few-shot learning for time-series forecasting.arXiv preprint arXiv:2009.14379, 2020

Tomoharu Iwata and Atsutoshi Kumagai. Few-shot learning for time-series forecasting.arXiv preprint arXiv:2009.14379, 2020

work page arXiv 2009
[33]

Timegpt-1.arXiv preprint arXiv:2310.03589, 2023

Azul Garza, Cristian Challu, and Max Mergenthaler-Canseco. Timegpt-1.arXiv preprint arXiv:2310.03589, 2023

work page arXiv 2023
[34]

arXiv preprint arXiv:2310.01728 , year=

Ming Jin, Shiyu Wang, Lintao Ma, Zhixuan Chu, James Y Zhang, Xiaoming Shi, Pin-Yu Chen, Yuxuan Liang, Yuan-Fang Li, Shirui Pan, et al. Time-llm: Time series forecasting by reprogramming large language models. arXiv preprint arXiv:2310.01728, 2023

work page arXiv 2023
[35]

One fits all: Power general time series analysis by pretrained lm.Advances in neural information processing systems, 36:43322–43355, 2023

Tian Zhou, Peisong Niu, Liang Sun, Rong Jin, et al. One fits all: Power general time series analysis by pretrained lm.Advances in neural information processing systems, 36:43322–43355, 2023

2023
[36]

In-context fine- tuning for time-series foundation models.arXiv preprint arXiv:2410.24087, 2024a

Abhimanyu Das, Matthew Faw, Rajat Sen, and Yichen Zhou. In-context fine-tuning for time-series foundation models.arXiv preprint arXiv:2410.24087, 2024

work page arXiv 2024
[37]

A survey of few-shot learning for biomedical time series.IEEE Reviews in Biomedical Engineering, 18:192–210, 2024

Chenqi Li, Timothy Denison, and Tingting Zhu. A survey of few-shot learning for biomedical time series.IEEE Reviews in Biomedical Engineering, 18:192–210, 2024

2024
[38]

Physics-informed machine learning: A survey on problems, methods and applications.arXiv preprint arXiv:2211.08064, 2022

Zhongkai Hao, Songming Liu, Yichi Zhang, Chengyang Ying, Yao Feng, Hang Su, and Jun Zhu. Physics-informed machine learning: A survey on problems, methods and applications.arXiv preprint arXiv:2211.08064, 2022

work page arXiv 2022
[39]

A prior-knowledge- based time series model for heat demand prediction of district heating systems.Applied Thermal Engineering, 252:123696, 2024

Yiwen Zhang, Xiangning Tian, Yazhou Zhao, Chaobo Zhang, Yang Zhao, and Jie Lu. A prior-knowledge- based time series model for heat demand prediction of district heating systems.Applied Thermal Engineering, 252:123696, 2024

2024
[40]

Tts-cgan: A transformer time-series conditional gan for biosignal data augmentation.arXiv preprint arXiv:2206.13676, 2022

Xiaomin Li, Anne Hee Hiong Ngu, and Vangelis Metsis. Tts-cgan: A transformer time-series conditional gan for biosignal data augmentation.arXiv preprint arXiv:2206.13676, 2022

work page arXiv 2022
[41]

Vector quantized time series generation with a bidirectional prior model.arXiv preprint arXiv:2303.04743, 2023

Daesoo Lee, Sara Malacarne, and Erlend Aune. Vector quantized time series generation with a bidirectional prior model.arXiv preprint arXiv:2303.04743, 2023

work page arXiv 2023
[42]

Wavestitch: Flexible and fast conditional time series generation with diffusion models.Proceedings of the ACM on Management of Data, 3(6):1–25, 2025

Aditya Shankar, Lydia Chen, Arie van Deursen, and Rihan Hai. Wavestitch: Flexible and fast conditional time series generation with diffusion models.Proceedings of the ACM on Management of Data, 3(6):1–25, 2025

2025
[43]

Towards editing time series.Advances in Neural Information Processing Systems, 37:37561–37593, 2024

Baoyu Jing, Shuqi Gu, Tianyu Chen, Zhiyu Yang, Dongsheng Li, Jingrui He, and Kan Ren. Towards editing time series.Advances in Neural Information Processing Systems, 37:37561–37593, 2024

2024
[44]

Diffusets: 12-lead ecg generation conditioned on clinical text reports and patient-specific information.Patterns, 6(10):101291, October 2025

Yongfan Lai, Jiabo Chen, Qinghao Zhao, Deyun Zhang, Yue Wang, Shijia Geng, Hongyan Li, and Shenda Hong. Diffusets: 12-lead ecg generation conditioned on clinical text reports and patient-specific information.Patterns, 6(10):101291, October 2025. 17 Preprint

2025
[45]

Mlp-mixer: An all-mlp architecture for vision

Ilya O Tolstikhin, Neil Houlsby, Alexander Kolesnikov, Lucas Beyer, Xiaohua Zhai, Thomas Unterthiner, Jessica Yung, Andreas Steiner, Daniel Keysers, Jakob Uszkoreit, et al. Mlp-mixer: An all-mlp architecture for vision. Advances in neural information processing systems, 34:24261–24272, 2021

2021
[46]

HyperNetworks

David Ha, Andrew Dai, and Quoc V Le. Hypernetworks.arXiv preprint arXiv:1609.09106, 2016

work page internal anchor Pith review arXiv 2016
[47]

arXiv preprint arXiv:2210.02186 , year=

Haixu Wu, Tengge Hu, Yong Liu, Hang Zhou, Jianmin Wang, and Mingsheng Long. Timesnet: Temporal 2d-variation modeling for general time series analysis.arXiv preprint arXiv:2210.02186, 2022

work page arXiv 2022
[48]

Crossformer: Transformer utilizing cross-dimension dependency for multivariate time series forecasting

Yunhao Zhang and Junchi Yan. Crossformer: Transformer utilizing cross-dimension dependency for multivariate time series forecasting. InThe eleventh international conference on learning representations, 2023

2023
[49]

Human Motion Diffusion Model

Guy Tevet, Sigal Raab, Brian Gordon, Yonatan Shafir, Daniel Cohen-Or, and Amit H Bermano. Human motion diffusion model.arXiv preprint arXiv:2209.14916, 2022. 18 Preprint A Message Passing in ST-PT: Full Derivation This appendix expands the cornerstone message-passing of Section 3 with a full MFVI derivation, starting from the CRF potentials and ending wit...

work page internal anchor Pith review arXiv 2022
[50]

causal(G) or channel-independence (B.4)) enter here

Joint softmax.Concatenate the two vectors and take a single softmax over the combined (P−1)+(N−1) candidates (15), yielding q(H (c) i,t ); optional additive−∞ masks (e.g. causal(G) or channel-independence (B.4)) enter here
[51]

architectural trick

Backward messages.Form ˜mtime i,t and ˜mchan i,t by (18)–(19) (both are attention-weighted sums of keys from the admissible parents), then apply the per-head output projection implicit inU (·,c)⊤. 5.Topic message.Apply the binary-factor FFN to Z i,t. 6.DampedZupdate.Combine unary, both backward messages, and topic message by (20). Under the low-rank decom...

2048