pith. machine review for the scientific record. sign in

arxiv: 2604.26762 · v1 · submitted 2026-04-29 · 💻 cs.LG · cs.AI

Recognition: unknown

Exploring the Potential of Probabilistic Transformer for Time Series Modeling: A Report on the ST-PT Framework

Authors on Pith no claims yet

Pith reviewed 2026-05-07 10:52 UTC · model grok-4.3

classification 💻 cs.LG cs.AI
keywords Probabilistic Transformertime series modelingconditional random fieldmean-field variational inferencefactor graphspatial-temporalautoregressive forecastingST-PT
0
0 comments X

The pith

The Probabilistic Transformer becomes a programmable factor graph for time series by equating self-attention to mean-field variational inference on a conditional random field.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper shows that the Probabilistic Transformer equates self-attention and feed-forward blocks to mean-field variational inference on a conditional random field, making the architecture an explicit and editable factor graph. The authors adapt it to spatial-temporal data as ST-PT by adding support for channel dimensions and per-step processing. They then run three studies to test using the graph structure to add time-series priors in low-data settings, conditioning generation by reprogramming factors per sample, and turning autoregressive forecasting into successive Bayesian updates with distillation from a teacher model. Readers should care if this lets models incorporate domain knowledge structurally instead of only through data. The work positions ST-PT as a backbone that blends neural flexibility with graphical model transparency for forecasting tasks.

Core claim

ST-PT serves as a shared cornerstone by lifting the PT equivalence to handle time series, allowing the three properties of programmable topology for prior injection, factor matrix programming for conditional generation, and MFVI iterations for principled latent AR transitions with CRF distillation to counter error buildup.

What carries the argument

The equivalence between the Transformer's self-attention plus feed-forward block and mean-field variational inference on a conditional random field, extended to ST-PT for time series with added channel and temporal semantics.

If this is right

  • Injecting symbolic time-series priors via direct graph modifications improves modeling when data is scarce or noisy.
  • Conditioning the CRF factor matrices externally on a per-sample basis enables structural rather than feature-based conditional generation.
  • Converting latent-space autoregressive transitions to Bayesian posterior updates and distilling from a CRF teacher reduces cumulative errors in multi-step forecasting.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Similar adaptations could apply the programmable factor graph approach to other domains like video prediction or natural language generation.
  • This framework might support more interpretable models by allowing inspection and editing of the explicit potentials and topology.
  • Joint optimization of graph structure alongside the potentials could emerge as a new direction for architecture search in sequence models.

Load-bearing premise

The mathematical equivalence between the modified ST-PT attention blocks and mean-field variational inference on the CRF continues to hold, and the performance improvements arise from exploiting the programmable properties.

What would settle it

Demonstrating that the self-attention computations in ST-PT deviate from the corresponding mean-field updates, or finding that structural graph changes do not yield gains over baseline Transformers in data-scarce time series scenarios.

Figures

Figures reproduced from arXiv: 2604.26762 by Haoyi Wu, Kan Ren, Kewei Tu, Shuqi Gu, You Wu, Zhangzhi Xiong.

Figure 1
Figure 1. Figure 1: The model consists of the following components and designs: view at source ↗
Figure 1
Figure 1. Figure 1: Structure of ST-PT: a 2D factor graph on (channel view at source ↗
Figure 2
Figure 2. Figure 2: (a) Three graph-level mechanisms for injecting symbolic priors into ST-PT’s factor graph (RQ1, Section 4.1): view at source ↗
read the original abstract

The Probabilistic Transformer (PT) establishes that the Transformer's self-attention plus its feed-forward block is mathematically equivalent to Mean-Field Variational Inference (MFVI) on a Conditional Random Field (CRF). Under this equivalence the Transformer ceases to be a black-box neural network and becomes a programmable factor graph: graph topology, factor potentials, and the message-passing schedule are all explicit and inspectable primitives that can be engineered. PT was originally developed for natural language and in this report we investigate its potential for time series. We first lift PT into the Spatial-Temporal Probabilistic Transformer (ST-PT) to repair PT's missing channel axis and weak per-step semantics, and adopt ST-PT as a shared cornerstone backbone. We then identify three distinct properties that PT/ST-PT offers as a factor-graph model and derive three Research Questions, one per property, that probe how each property can be exploited in time series: RQ1. The graph topology and potentials are direct programmable primitives. Can this be used to inject symbolic time-series priors into ST-PT through structural graph modifications, especially under data scarcity and noise? RQ2. The CRF's factor matrices are the operator's potentials. Can an external condition program these factor matrices on a per-sample basis, so that conditional generation becomes structural rather than feature-level modulation of a fixed one? RQ3. Each MFVI iteration is a Bayesian posterior update on the factor graph. Can this turn the latent transition of latent-space AutoRegressive (AR) forecasting from an opaque MLP into a principled posterior update, and can a CRF teacher distill its latents into the AR student to counter cumulative error? We give one empirical study per question. Together, these three studies position ST-PT as a programmable framework for time-series modeling.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper claims that the Probabilistic Transformer (PT) is mathematically equivalent to mean-field variational inference on a conditional random field, turning the architecture into an explicit programmable factor graph. It lifts PT to the Spatial-Temporal Probabilistic Transformer (ST-PT) by adding a channel axis and repairing per-step semantics, then uses this backbone to pose and empirically investigate three research questions: (RQ1) injecting symbolic time-series priors via structural graph modifications under data scarcity; (RQ2) per-sample conditional programming of factor matrices for structural rather than feature-level conditioning; and (RQ3) interpreting latent autoregressive transitions as principled Bayesian posterior updates with CRF-teacher distillation to mitigate cumulative error.

Significance. If the PT-to-MFVI-CRF equivalence is shown to survive the ST-PT modifications and the three studies isolate gains attributable to explicit graph topology, programmable potentials, and posterior-update semantics, the work would supply a concrete bridge between transformer architectures and probabilistic graphical models for time series, enabling more interpretable incorporation of domain priors and conditional generation.

major comments (2)
  1. [§3] §3: The lifting of PT to ST-PT is described as adding a channel axis and repairing per-step semantics, yet the manuscript supplies no re-derivation demonstrating that the modified self-attention and feed-forward blocks continue to implement the identical mean-field variational updates on the underlying CRF factor graph. Because the three RQs and the interpretation of all empirical results rest on inheriting the programmable-factor-graph properties, this omission is load-bearing.
  2. [Empirical studies] Empirical studies (one per RQ): the manuscript presents the studies as demonstrating exploitation of the factor-graph properties, but provides insufficient detail on baselines, metrics, ablation controls, and quantitative effect sizes that would isolate the contribution of explicit graph topology or posterior-update semantics from ordinary modeling improvements. Without such isolation the studies cannot be read as evidence for the claimed advantages of the equivalence.
minor comments (1)
  1. [Abstract] The abstract would be strengthened by a single sentence summarizing the key quantitative outcomes of the three studies (e.g., relative error reductions or statistical significance) so that readers can immediately gauge the practical magnitude of the reported gains.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback on our manuscript. The comments highlight important areas for strengthening the technical foundations and empirical evidence. We address each major comment below and will revise the manuscript accordingly.

read point-by-point responses
  1. Referee: [§3] §3: The lifting of PT to ST-PT is described as adding a channel axis and repairing per-step semantics, yet the manuscript supplies no re-derivation demonstrating that the modified self-attention and feed-forward blocks continue to implement the identical mean-field variational updates on the underlying CRF factor graph. Because the three RQs and the interpretation of all empirical results rest on inheriting the programmable-factor-graph properties, this omission is load-bearing.

    Authors: We agree that the manuscript would be strengthened by an explicit re-derivation confirming that the ST-PT modifications preserve the MFVI-CRF equivalence. The channel-axis addition and per-step semantic repairs are structural extensions that maintain the same variational update rules in self-attention and feed-forward blocks. In the revised version, we will add a dedicated subsection to §3 with the full re-derivation, explicitly showing that the modified blocks implement identical mean-field updates on the extended CRF factor graph. This will make the inheritance of programmable factor-graph properties transparent and directly support the three RQs. revision: yes

  2. Referee: [Empirical studies] Empirical studies (one per RQ): the manuscript presents the studies as demonstrating exploitation of the factor-graph properties, but provides insufficient detail on baselines, metrics, ablation controls, and quantitative effect sizes that would isolate the contribution of explicit graph topology or posterior-update semantics from ordinary modeling improvements. Without such isolation the studies cannot be read as evidence for the claimed advantages of the equivalence.

    Authors: We concur that the empirical sections require expanded controls to isolate the contributions of the factor-graph properties. In the revision, we will augment each RQ study with: (i) additional baselines including vanilla Transformers, standard probabilistic time-series models, and ablated variants without explicit graph structure; (ii) precise metric definitions and quantitative effect sizes; (iii) targeted ablations disabling graph topology modifications, per-sample factor programming, or CRF-teacher distillation; and (iv) statistical significance testing. These enhancements will more clearly attribute observed gains to the explicit topology, programmable potentials, and posterior-update semantics. revision: yes

Circularity Check

0 steps flagged

No significant circularity in derivation chain

full rationale

The manuscript cites the PT-to-MFVI-CRF equivalence as an established result from prior work and describes lifting PT to ST-PT via explicit modifications (channel axis and per-step semantics repairs) before deriving three RQs from the assumed factor-graph properties. No equation or claim in the abstract or described structure reduces a first-principles result or prediction to its own inputs by construction, nor does any load-bearing step rely on a self-citation chain whose authors overlap with the present paper. The empirical studies are presented as probes of the inherited properties rather than re-derivations or fits that rename inputs as outputs. The chain therefore remains self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claims rest on the PT-MFVI-CRF equivalence (treated as given) and the introduction of ST-PT to handle time-series structure. No explicit free parameters or additional invented entities beyond the framework itself are named in the abstract.

axioms (1)
  • domain assumption Transformer's self-attention plus feed-forward block equals MFVI on a CRF
    Stated as the foundational equivalence that makes the Transformer a programmable factor graph.
invented entities (1)
  • ST-PT (Spatial-Temporal Probabilistic Transformer) no independent evidence
    purpose: Repair PT's missing channel axis and weak per-step semantics for time-series data
    Introduced in this work as the shared backbone for the three studies.

pith-pipeline@v0.9.0 · 5643 in / 1459 out tokens · 78233 ms · 2026-05-07T10:52:14.569516+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

51 extracted references · 24 canonical work pages · 5 internal anchors

  1. [1]

    Time-series forecasting with deep learning: a survey.Philosophical transactions of the royal society a: mathematical, physical and engineering sciences, 379(2194), 2021

    Bryan Lim and Stefan Zohren. Time-series forecasting with deep learning: a survey.Philosophical transactions of the royal society a: mathematical, physical and engineering sciences, 379(2194), 2021

  2. [2]

    Stl: A seasonal-trend decomposi- tion.J

    Robert B Cleveland, William S Cleveland, Jean E McRae, and Irma Terpenning. Stl: A seasonal-trend decomposi- tion.J. off. Stat, 6(1):3–73, 1990

  3. [3]

    Arima models

    Robert H Shumway and David S Stoffer. Arima models. InTime series analysis and its applications: with R examples, pages 75–163. Springer, 2017

  4. [4]

    Autoformer: Decomposition transformers with auto- correlation for long-term series forecasting.Advances in neural information processing systems, 34:22419–22430, 2021

    Haixu Wu, Jiehui Xu, Jianmin Wang, and Mingsheng Long. Autoformer: Decomposition transformers with auto- correlation for long-term series forecasting.Advances in neural information processing systems, 34:22419–22430, 2021

  5. [5]

    Fedformer: Frequency enhanced decomposed transformer for long-term series forecasting

    Tian Zhou, Ziqing Ma, Qingsong Wen, Xue Wang, Liang Sun, and Rong Jin. Fedformer: Frequency enhanced decomposed transformer for long-term series forecasting. InInternational conference on machine learning, pages 27268–27286. PMLR, 2022

  6. [6]

    Informer: Beyond efficient transformer for long sequence time-series forecasting

    Haoyi Zhou, Shanghang Zhang, Jieqi Peng, Shuai Zhang, Jianxin Li, Hui Xiong, and Wancai Zhang. Informer: Beyond efficient transformer for long sequence time-series forecasting. InProceedings of the AAAI conference on artificial intelligence, volume 35, pages 11106–11115, 2021

  7. [7]

    A Time Series is Worth 64 Words: Long-term Forecasting with Transformers

    Yuqi Nie, Nam H Nguyen, Phanwadee Sinthong, and Jayant Kalagnanam. A time series is worth 64 words: Long-term forecasting with transformers.arXiv preprint arXiv:2211.14730, 2022

  8. [8]

    iTransformer: Inverted Transformers Are Effective for Time Series Forecasting

    Yong Liu, Tengge Hu, Haoran Zhang, Haixu Wu, Shiyu Wang, Lintao Ma, and Mingsheng Long. itransformer: Inverted transformers are effective for time series forecasting.arXiv preprint arXiv:2310.06625, 2023

  9. [9]

    Are transformers effective for time series forecasting? In Proceedings of the AAAI conference on artificial intelligence, volume 37, pages 11121–11128, 2023

    Ailing Zeng, Muxi Chen, Lei Zhang, and Qiang Xu. Are transformers effective for time series forecasting? In Proceedings of the AAAI conference on artificial intelligence, volume 37, pages 11121–11128, 2023

  10. [10]

    Why do transformers fail to forecast time series in-context?arXiv preprint arXiv:2510.09776, 2025

    Yufa Zhou, Yixiao Wang, Surbhi Goel, and Anru R Zhang. Why do transformers fail to forecast time series in-context?arXiv preprint arXiv:2510.09776, 2025

  11. [11]

    arXiv preprint arXiv:2403.02682 , year=

    Sai Shankar Narasimhan, Shubhankar Agarwal, Oguzhan Akcin, Sujay Sanghavi, and Sandeep Chinchali. Time weaver: A conditional time series generation model.arXiv preprint arXiv:2403.02682, 2024

  12. [12]

    Verbalts: Generating time series from texts

    Shuqi Gu, Chuyue Li, Baoyu Jing, and Kan Ren. Verbalts: Generating time series from texts. InForty-second International Conference on Machine Learning, 2025

  13. [13]

    T2s: High-resolution time series generation with text-to-series diffusion models.arXiv preprint arXiv:2505.02417, 2025

    Yunfeng Ge, Jiawei Li, Yiji Zhao, Haomin Wen, Zhao Li, Meikang Qiu, Hongyan Li, Ming Jin, and Shirui Pan. T2s: High-resolution time series generation with text-to-series diffusion models.arXiv preprint arXiv:2505.02417, 2025

  14. [14]

    Bridge: Bootstrapping text to control time-series generation via multi-agent iterative optimization and diffusion modeling.arXiv preprint arXiv:2503.02445, 2025

    Hao Li, Yu-Hao Huang, Chang Xu, Viktor Schlegel, Renhe Jiang, Riza Batista-Navarro, Goran Nenadic, and Jiang Bian. Bridge: Bootstrapping text to control time-series generation via multi-agent iterative optimization and diffusion modeling.arXiv preprint arXiv:2503.02445, 2025

  15. [15]

    DeepAR: Probabilistic Forecasting with Autoregressive Recurrent Networks

    Valentin Flunkert, David Salinas, and Jan Gasthaus. Deepar: Probabilistic forecasting with autoregressive recurrent networks.CoRR, abs/1704.04110, 2017

  16. [16]

    Long short-term memory.Neural Computation, 9(8):1735–1780, 1997

    Sepp Hochreiter and Jürgen Schmidhuber. Long short-term memory.Neural Computation, 9(8):1735–1780, 1997

  17. [17]

    Modeling long-and short-term temporal patterns with deep neural networks

    Guokun Lai, Wei-Cheng Chang, Yiming Yang, and Hanxiao Liu. Modeling long-and short-term temporal patterns with deep neural networks. InThe 41st international ACM SIGIR conference on research & development in information retrieval, pages 95–104, 2018

  18. [18]

    Next-latent prediction transformers learn compact world models.arXiv preprint arXiv:2511.05963, 2025

    Jayden Teoh, Manan Tomar, Kwangjun Ahn, Edward S Hu, Pratyusha Sharma, Riashat Islam, Alex Lamb, and John Langford. Next-latent prediction transformers learn compact world models.arXiv preprint arXiv:2511.05963, 2025

  19. [19]

    Probabilistic transformer: A probabilistic dependency model for contextual word representation

    Haoyi Wu and Kewei Tu. Probabilistic transformer: A probabilistic dependency model for contextual word representation. InFindings of the Association for Computational Linguistics: ACL 2023, pages 7613–7636, 2023

  20. [20]

    Diffusion convolutional recurrent neural network: Data-driven traffic forecasting.arXiv preprint arXiv:1707.01926, 2017

    Yaguang Li, Rose Yu, Cyrus Shahabi, and Yan Liu. Diffusion convolutional recurrent neural network: Data-driven traffic forecasting.arXiv preprint arXiv:1707.01926, 2017

  21. [21]

    Relational inductive biases, deep learning, and graph networks

    Peter W Battaglia, Jessica B Hamrick, Victor Bapst, Alvaro Sanchez-Gonzalez, Vinicius Zambaldi, Mateusz Malinowski, Andrea Tacchetti, David Raposo, Adam Santoro, Ryan Faulkner, et al. Relational inductive biases, deep learning, and graph networks.arXiv preprint arXiv:1806.01261, 2018. 16 Preprint

  22. [22]

    Comparing prior and learned time representations in transformer models of timeseries, 2024

    Natalia Koliou, Tatiana Boura, Stasinos Konstantopoulos, George Meramveliotakis, and George Kosmadakis. Comparing prior and learned time representations in transformer models of timeseries, 2024

  23. [23]

    arXiv preprint arXiv:2603.04767 , year=

    Shaocheng Lan, Shuqi Gu, Zhangzhi Xiong, and Kan Ren. Contsg-bench: A unified benchmark for conditional time series generation.arXiv preprint arXiv:2603.04767, 2026

  24. [24]

    Probabilistic transformer for time series analysis.Advances in neural information processing systems, 34:23592–23608, 2021

    Binh Tang and David S Matteson. Probabilistic transformer for time series analysis.Advances in neural information processing systems, 34:23592–23608, 2021

  25. [25]

    Enhancing the locality and breaking the memory bottleneck of transformer on time series forecasting.Advances in neural information processing systems, 32, 2019

    Shiyang Li, Xiaoyong Jin, Yao Xuan, Xiyou Zhou, Wenhu Chen, Yu-Xiang Wang, and Xifeng Yan. Enhancing the locality and breaking the memory bottleneck of transformer on time series forecasting.Advances in neural information processing systems, 32, 2019

  26. [26]

    Pyraformer: Low-complexity pyramidal attention for long-range time series modeling and forecasting

    Shizhan Liu, Hang Yu, Cong Liao, Jianguo Li, Weiyao Lin, Alex X Liu, and Schahram Dustdar. Pyraformer: Low-complexity pyramidal attention for long-range time series modeling and forecasting. InInternational conference on learning representations, 2021

  27. [27]

    Etsformer: Exponential smoothing transformers for time-series forecasting.arXiv preprint arXiv:2202.01381, 2022

    Gerald Woo, Chenghao Liu, Doyen Sahoo, Akshat Kumar, and Steven Hoi. Etsformer: Exponential smoothing transformers for time-series forecasting.arXiv preprint arXiv:2202.01381, 2022

  28. [28]

    Film: Frequency improved legendre memory model for long-term time series forecasting.Advances in neural information processing systems, 35:12677–12690, 2022

    Tian Zhou, Ziqing Ma, Qingsong Wen, Liang Sun, Tao Yao, Wotao Yin, Rong Jin, et al. Film: Frequency improved legendre memory model for long-term time series forecasting.Advances in neural information processing systems, 35:12677–12690, 2022

  29. [29]

    arXiv preprint arXiv:2303.06053 , year=

    Si-An Chen, Chun-Liang Li, Nate Yoder, Sercan O Arik, and Tomas Pfister. Tsmixer: An all-mlp architecture for time series forecasting.arXiv preprint arXiv:2303.06053, 2023

  30. [30]

    Timemixer++: A general time series pattern machine for universal predictive analysis

    Shiyu Wang, Jiawei Li, Xiaoming Shi, Zhou Ye, Baichuan Mo, Wenze Lin, Shengtong Ju, Zhixuan Chu, and Ming Jin. Timemixer++: A general time series pattern machine for universal predictive analysis.arXiv preprint arXiv:2410.16032, 2024

  31. [31]

    Model-agnostic meta-learning for fast adaptation of deep networks

    Chelsea Finn, Pieter Abbeel, and Sergey Levine. Model-agnostic meta-learning for fast adaptation of deep networks. InInternational conference on machine learning, pages 1126–1135. PMLR, 2017

  32. [32]

    Few-shot learning for time-series forecasting.arXiv preprint arXiv:2009.14379, 2020

    Tomoharu Iwata and Atsutoshi Kumagai. Few-shot learning for time-series forecasting.arXiv preprint arXiv:2009.14379, 2020

  33. [33]

    Timegpt-1.arXiv preprint arXiv:2310.03589, 2023

    Azul Garza, Cristian Challu, and Max Mergenthaler-Canseco. Timegpt-1.arXiv preprint arXiv:2310.03589, 2023

  34. [34]

    arXiv preprint arXiv:2310.01728 , year=

    Ming Jin, Shiyu Wang, Lintao Ma, Zhixuan Chu, James Y Zhang, Xiaoming Shi, Pin-Yu Chen, Yuxuan Liang, Yuan-Fang Li, Shirui Pan, et al. Time-llm: Time series forecasting by reprogramming large language models. arXiv preprint arXiv:2310.01728, 2023

  35. [35]

    One fits all: Power general time series analysis by pretrained lm.Advances in neural information processing systems, 36:43322–43355, 2023

    Tian Zhou, Peisong Niu, Liang Sun, Rong Jin, et al. One fits all: Power general time series analysis by pretrained lm.Advances in neural information processing systems, 36:43322–43355, 2023

  36. [36]

    In-context fine- tuning for time-series foundation models.arXiv preprint arXiv:2410.24087, 2024a

    Abhimanyu Das, Matthew Faw, Rajat Sen, and Yichen Zhou. In-context fine-tuning for time-series foundation models.arXiv preprint arXiv:2410.24087, 2024

  37. [37]

    A survey of few-shot learning for biomedical time series.IEEE Reviews in Biomedical Engineering, 18:192–210, 2024

    Chenqi Li, Timothy Denison, and Tingting Zhu. A survey of few-shot learning for biomedical time series.IEEE Reviews in Biomedical Engineering, 18:192–210, 2024

  38. [38]

    Physics-informed machine learning: A survey on problems, methods and applications.arXiv preprint arXiv:2211.08064, 2022

    Zhongkai Hao, Songming Liu, Yichi Zhang, Chengyang Ying, Yao Feng, Hang Su, and Jun Zhu. Physics-informed machine learning: A survey on problems, methods and applications.arXiv preprint arXiv:2211.08064, 2022

  39. [39]

    A prior-knowledge- based time series model for heat demand prediction of district heating systems.Applied Thermal Engineering, 252:123696, 2024

    Yiwen Zhang, Xiangning Tian, Yazhou Zhao, Chaobo Zhang, Yang Zhao, and Jie Lu. A prior-knowledge- based time series model for heat demand prediction of district heating systems.Applied Thermal Engineering, 252:123696, 2024

  40. [40]

    Tts-cgan: A transformer time-series conditional gan for biosignal data augmentation.arXiv preprint arXiv:2206.13676, 2022

    Xiaomin Li, Anne Hee Hiong Ngu, and Vangelis Metsis. Tts-cgan: A transformer time-series conditional gan for biosignal data augmentation.arXiv preprint arXiv:2206.13676, 2022

  41. [41]

    Vector quantized time series generation with a bidirectional prior model.arXiv preprint arXiv:2303.04743, 2023

    Daesoo Lee, Sara Malacarne, and Erlend Aune. Vector quantized time series generation with a bidirectional prior model.arXiv preprint arXiv:2303.04743, 2023

  42. [42]

    Wavestitch: Flexible and fast conditional time series generation with diffusion models.Proceedings of the ACM on Management of Data, 3(6):1–25, 2025

    Aditya Shankar, Lydia Chen, Arie van Deursen, and Rihan Hai. Wavestitch: Flexible and fast conditional time series generation with diffusion models.Proceedings of the ACM on Management of Data, 3(6):1–25, 2025

  43. [43]

    Towards editing time series.Advances in Neural Information Processing Systems, 37:37561–37593, 2024

    Baoyu Jing, Shuqi Gu, Tianyu Chen, Zhiyu Yang, Dongsheng Li, Jingrui He, and Kan Ren. Towards editing time series.Advances in Neural Information Processing Systems, 37:37561–37593, 2024

  44. [44]

    Diffusets: 12-lead ecg generation conditioned on clinical text reports and patient-specific information.Patterns, 6(10):101291, October 2025

    Yongfan Lai, Jiabo Chen, Qinghao Zhao, Deyun Zhang, Yue Wang, Shijia Geng, Hongyan Li, and Shenda Hong. Diffusets: 12-lead ecg generation conditioned on clinical text reports and patient-specific information.Patterns, 6(10):101291, October 2025. 17 Preprint

  45. [45]

    Mlp-mixer: An all-mlp architecture for vision

    Ilya O Tolstikhin, Neil Houlsby, Alexander Kolesnikov, Lucas Beyer, Xiaohua Zhai, Thomas Unterthiner, Jessica Yung, Andreas Steiner, Daniel Keysers, Jakob Uszkoreit, et al. Mlp-mixer: An all-mlp architecture for vision. Advances in neural information processing systems, 34:24261–24272, 2021

  46. [46]

    HyperNetworks

    David Ha, Andrew Dai, and Quoc V Le. Hypernetworks.arXiv preprint arXiv:1609.09106, 2016

  47. [47]

    arXiv preprint arXiv:2210.02186 , year=

    Haixu Wu, Tengge Hu, Yong Liu, Hang Zhou, Jianmin Wang, and Mingsheng Long. Timesnet: Temporal 2d-variation modeling for general time series analysis.arXiv preprint arXiv:2210.02186, 2022

  48. [48]

    Crossformer: Transformer utilizing cross-dimension dependency for multivariate time series forecasting

    Yunhao Zhang and Junchi Yan. Crossformer: Transformer utilizing cross-dimension dependency for multivariate time series forecasting. InThe eleventh international conference on learning representations, 2023

  49. [49]

    Human Motion Diffusion Model

    Guy Tevet, Sigal Raab, Brian Gordon, Yonatan Shafir, Daniel Cohen-Or, and Amit H Bermano. Human motion diffusion model.arXiv preprint arXiv:2209.14916, 2022. 18 Preprint A Message Passing in ST-PT: Full Derivation This appendix expands the cornerstone message-passing of Section 3 with a full MFVI derivation, starting from the CRF potentials and ending wit...

  50. [50]

    causal(G) or channel-independence (B.4)) enter here

    Joint softmax.Concatenate the two vectors and take a single softmax over the combined (P−1)+(N−1) candidates (15), yielding q(H (c) i,t ); optional additive−∞ masks (e.g. causal(G) or channel-independence (B.4)) enter here

  51. [51]

    architectural trick

    Backward messages.Form ˜mtime i,t and ˜mchan i,t by (18)–(19) (both are attention-weighted sums of keys from the admissible parents), then apply the per-head output projection implicit inU (·,c)⊤. 5.Topic message.Apply the binary-factor FFN to Z i,t. 6.DampedZupdate.Combine unary, both backward messages, and topic message by (20). Under the low-rank decom...