VFEM: Visual Feature Empowered Multivariate Time Series Forecasting with Cross-Modal Fusion

Danny Dongning Sun; Fei Ma; Hang Yu; Hongkang Zhang; Jian Xu; Shao-Lun Huang; Tongtong Feng; Xiao-Ping Zhang; Yanlong Wang; Zijian Zhang

arxiv: 2510.03244 · v2 · pith:2WHHFKQOnew · submitted 2025-09-25 · 💻 cs.LG · cs.AI· cs.CV

VFEM: Visual Feature Empowered Multivariate Time Series Forecasting with Cross-Modal Fusion

Yanlong Wang , Hang Yu , Jian Xu , Fei Ma , Hongkang Zhang , Tongtong Feng , Zijian Zhang , Shao-Lun Huang

show 2 more authors

Danny Dongning Sun Xiao-Ping Zhang

This is my paper

classification 💻 cs.LG cs.AIcs.CV

keywords seriestimecross-modalforecastingmodelsvfemmultivariatevisual

0 comments

read the original abstract

Large time series foundation models often adopt channel-independent architectures to handle varying data dimensions, but this design ignores crucial cross-channel dependencies. Meanwhile, existing cross-modal methods predominantly rely on textual modalities, leaving the spatial pattern recognition capabilities of vision models underexplored for time series analysis. To address these limitations, we propose VFEM, a cross-modal forecasting model that leverages pre-trained large vision models (LVMs) to capture complex cross-variable patterns. VFEM transforms multivariate time series into visual representations, enabling LVMs to perceive spatial relationships that are not explicitly modeled by channel-independent models. Through a dual-branch architecture, visual and temporal features are independently extracted and then fused via cross-modal attention, allowing complementary information from both modalities to enhance forecasting. By freezing the LVM and training only 7.45% of the total parameters, VFEM achieves competitive performance on multiple benchmarks, offering a new perspective on multivariate time series forecasting.

This paper has not been read by Pith yet.

VFEM: Visual Feature Empowered Multivariate Time Series Forecasting with Cross-Modal Fusion

discussion (0)