pith. sign in

arxiv: 2510.03244 · v2 · pith:2WHHFKQOnew · submitted 2025-09-25 · 💻 cs.LG · cs.AI· cs.CV

VFEM: Visual Feature Empowered Multivariate Time Series Forecasting with Cross-Modal Fusion

classification 💻 cs.LG cs.AIcs.CV
keywords seriestimecross-modalforecastingmodelsvfemmultivariatevisual
0
0 comments X
read the original abstract

Large time series foundation models often adopt channel-independent architectures to handle varying data dimensions, but this design ignores crucial cross-channel dependencies. Meanwhile, existing cross-modal methods predominantly rely on textual modalities, leaving the spatial pattern recognition capabilities of vision models underexplored for time series analysis. To address these limitations, we propose VFEM, a cross-modal forecasting model that leverages pre-trained large vision models (LVMs) to capture complex cross-variable patterns. VFEM transforms multivariate time series into visual representations, enabling LVMs to perceive spatial relationships that are not explicitly modeled by channel-independent models. Through a dual-branch architecture, visual and temporal features are independently extracted and then fused via cross-modal attention, allowing complementary information from both modalities to enhance forecasting. By freezing the LVM and training only 7.45% of the total parameters, VFEM achieves competitive performance on multiple benchmarks, offering a new perspective on multivariate time series forecasting.

This paper has not been read by Pith yet.

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.