PyViT-FUSE: A Foundation Model for Multi-Sensor Earth Observation Data

Carly Beneke; Manuel Weber

arxiv: 2504.18770 · v1 · pith:5JIVL5ZZnew · submitted 2025-04-26 · 💻 cs.CV · cs.AI· cs.LG

PyViT-FUSE: A Foundation Model for Multi-Sensor Earth Observation Data

Manuel Weber , Carly Beneke This is my paper

classification 💻 cs.CV cs.AIcs.LG

keywords modelattentiondataearthfoundationmechanismobservationpyvit-fuse

0 comments

read the original abstract

We propose PyViT-FUSE, a foundation model for earth observation data explicitly designed to handle multi-modal imagery by learning to fuse an arbitrary number of mixed-resolution input bands into a single representation through an attention mechanism. The learned patch tokens are further processed by a stack of vision transformers with a novel pyramidal structure. We train the model on a globally sampled dataset in a self-supervised manner, leveraging core concepts of the SwAV algorithm. We show the interpretability of the fusion mechanism by visualization of the attention scores and the models applicability to downstream tasks.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

UniverSat: Resolution- and Modality-Agnostic Transformers for Earth Observation
cs.CV 2026-06 unverdicted novelty 7.0

UniverSat is a ViT-style model with a universal patch encoder enabling self-supervised training on heterogeneous multimodal Earth observation data from varying resolutions and sensors.