OceanMAE: A Foundation Model for Ocean Remote Sensing

Beg\"um Demir; Behnood Rasti; Panagiotis Agrafiotis; Viola-Joanna Stamer

arxiv: 2604.08171 · v1 · submitted 2026-04-09 · 💻 cs.CV · cs.AI

OceanMAE: A Foundation Model for Ocean Remote Sensing

Viola-Joanna Stamer , Panagiotis Agrafiotis , Behnood Rasti , Beg\"um Demir This is my paper

Pith reviewed 2026-05-10 17:55 UTC · model grok-4.3

classification 💻 cs.CV cs.AI

keywords ocean remote sensingmasked autoencoderself-supervised learningmarine segmentationbathymetry estimationfoundation modelSentinel-2domain adaptation

0 comments

The pith

Integrating physically meaningful ocean descriptors into masked autoencoder pre-training improves downstream marine segmentation quality.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that an ocean-specific masked autoencoder called OceanMAE can learn more useful representations by adding auxiliary ocean descriptors to standard MAE pre-training on large unlabeled Sentinel-2 data. A sympathetic reader would care because ocean remote sensing suffers from scarce labels and models pre-trained mostly on land imagery, limiting accuracy on tasks such as debris detection and bathymetry. The work shows this domain-aligned approach yields its clearest gains on segmentation benchmarks while remaining competitive on regression. It further demonstrates through direct comparison that the added descriptors contribute to better transfer performance rather than generic self-supervision alone.

Core claim

OceanMAE extends standard MAE pre-training by jointly encoding multispectral Sentinel-2 observations and physically meaningful ocean descriptors on the Hydro dataset, producing latent representations that transfer to a modified UNet framework and deliver stronger marine pollutant and debris segmentation on MADOS and MARIDA together with competitive bathymetry results on MagicBathyNet.

What carries the argument

The auxiliary ocean descriptors added to the masked autoencoder pre-training objective, which guide the model toward ocean-aware latent representations from unlabeled multispectral imagery.

If this is right

OceanMAE produces its largest accuracy gains on marine debris and pollutant segmentation tasks.
Bathymetry estimation benefits remain competitive and vary with the specific regression setup.
A controlled ablation confirms that the ocean descriptors themselves drive measurable downstream improvement over a plain MAE baseline.
The resulting representations support transfer to both segmentation and regression heads via a shared UNet-style decoder.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same descriptor-injection pattern could be tested on other remote-sensing domains that possess domain-specific physical variables.
If the descriptors remain effective across different sensor resolutions, the method offers a route to build more general ocean foundation models without task-specific labels.
Public release of code and weights allows direct replication and extension on additional ocean datasets.

Load-bearing premise

The selected ocean descriptors are physically meaningful and sufficiently independent of the downstream task labels that their use in pre-training genuinely aids generalization rather than introducing dataset-specific leakage.

What would settle it

Retraining the same architecture on the Hydro dataset without the auxiliary ocean descriptors and observing no improvement or a drop in segmentation metrics on the MARIDA test set relative to the full OceanMAE model.

Figures

Figures reproduced from arXiv: 2604.08171 by Beg\"um Demir, Behnood Rasti, Panagiotis Agrafiotis, Viola-Joanna Stamer.

**Figure 2.** Figure 2: Architecture of the modified UNet for downstream ocean tasks. The [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 3.** Figure 3: Qualitative comparison of pollutants and sea-surface segmentation [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗

read the original abstract

Accurate ocean mapping is essential for applications such as bathymetry estimation, seabed characterization, marine litter detection, and ecosystem monitoring. However, ocean remote sensing (RS) remains constrained by limited labeled data and by the reduced transferability of models pre-trained mainly on land-dominated Earth observation imagery. In this paper, we propose OceanMAE, an ocean-specific masked autoencoder that extends standard MAE pre-training by integrating multispectral Sentinel-2 observations with physically meaningful ocean descriptors during self-supervised learning. By incorporating these auxiliary ocean features, OceanMAE is designed to learn more informative and ocean-aware latent representations from large- scale unlabeled data. To transfer these representations to downstream applications, we further employ a modified UNet-based framework for marine segmentation and bathymetry estimation. Pre-trained on the Hydro dataset, OceanMAE is evaluated on MADOS and MARIDA for marine pollutant and debris segmentation, and on MagicBathyNet for bathymetry regression. The experiments show that OceanMAE yields the strongest gains on marine segmentation, while bathymetry benefits are competitive and task-dependent. In addition, an ablation against a standard MAE on MARIDA indicates that incorporating auxiliary ocean descriptors during pre-training improves downstream segmentation quality. These findings highlight the value of physically informed and domain-aligned self-supervised pre- training for ocean RS. Code and weights are publicly available at https://git.tu-berlin.de/joanna.stamer/SSLORS2.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

OceanMAE is a standard MAE with auxiliary ocean descriptors added during pre-training, delivering an ablation gain on MARIDA segmentation but without checks that rule out indirect label leakage from those descriptors.

read the letter

The main thing to know is that this paper takes the masked autoencoder recipe and adds physically motivated ocean descriptors to the pre-training stage on Sentinel-2 data from the Hydro set. It then fine-tunes a UNet-style head for marine debris and pollutant segmentation on MADOS and MARIDA plus bathymetry regression on MagicBathyNet. The clearest result is the ablation showing that the extra descriptors improve segmentation over plain MAE on MARIDA, and the authors release code and weights at a public repo.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces OceanMAE, a masked autoencoder pre-trained on the Hydro dataset that augments standard MAE with auxiliary ocean descriptors (e.g., chlorophyll concentration, sea-surface temperature) derived from Sentinel-2 multispectral observations. The model is transferred via a modified UNet to downstream tasks: marine pollutant/debris segmentation on MADOS and MARIDA, and bathymetry regression on MagicBathyNet. The central empirical claim is that the auxiliary descriptors yield stronger gains on segmentation than a standard MAE baseline, as shown by an ablation on MARIDA.

Significance. If the performance gains are shown to arise from genuinely ocean-aware representations rather than leakage, the work would provide a useful domain-adapted foundation model for ocean remote sensing, where labeled data are scarce. Public release of code and weights supports reproducibility and is a clear strength.

major comments (2)

[Ablation study] Ablation study (abstract and experiments section): The reported improvement of OceanMAE over standard MAE on MARIDA does not include any quantitative check (correlation coefficients, mutual information, or per-descriptor ablation) that the auxiliary descriptors are statistically independent of the marine debris/pollutant segmentation labels. Without this, the performance gap could be explained by implicit weak supervision during pre-training rather than improved generalization.
[Methods] Methods section: The description of how auxiliary ocean descriptors are encoded, normalized, and fused into the MAE encoder/decoder (including changes to input dimensionality, positional embeddings, or the reconstruction loss) is insufficiently detailed to allow replication or to assess whether the integration is parameter-free or introduces new hyperparameters that could affect the claimed gains.

minor comments (2)

[Abstract] Abstract: No quantitative metrics, dataset sizes, or error bars are provided despite the claim of 'strongest gains'; adding these would strengthen the summary.
[Experiments] Evaluation protocol: Clarify the exact fine-tuning procedure, number of epochs, learning-rate schedule, and whether the same data augmentations are used for both OceanMAE and the standard MAE baseline.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive and insightful comments. We address each major point below and will revise the manuscript accordingly to improve clarity and strengthen the empirical claims.

read point-by-point responses

Referee: [Ablation study] Ablation study (abstract and experiments section): The reported improvement of OceanMAE over standard MAE on MARIDA does not include any quantitative check (correlation coefficients, mutual information, or per-descriptor ablation) that the auxiliary descriptors are statistically independent of the marine debris/pollutant segmentation labels. Without this, the performance gap could be explained by implicit weak supervision during pre-training rather than improved generalization.

Authors: We agree this is a valid concern: without explicit independence checks, the observed gains could partly reflect correlations between the auxiliary descriptors and the downstream labels rather than purely improved generalization. Although pre-training remains fully self-supervised (no segmentation labels are used), the descriptors are derived from the same Sentinel-2 observations and could carry implicit information. In the revised manuscript we will add a dedicated analysis subsection that reports Pearson correlations and mutual information between each auxiliary descriptor and the MARIDA labels, together with per-descriptor ablation results. These additions will allow readers to assess the degree of any leakage and better attribute the performance improvements. revision: yes
Referee: [Methods] Methods section: The description of how auxiliary ocean descriptors are encoded, normalized, and fused into the MAE encoder/decoder (including changes to input dimensionality, positional embeddings, or the reconstruction loss) is insufficiently detailed to allow replication or to assess whether the integration is parameter-free or introduces new hyperparameters that could affect the claimed gains.

Authors: We acknowledge that the current methods description is too high-level for full reproducibility. In the revised version we will expand the OceanMAE architecture subsection to specify: (i) the exact normalization applied to each descriptor (z-score using Hydro dataset statistics), (ii) the encoding mechanism (concatenation as additional input channels with an adjusted linear patch embedding layer), (iii) any consequent changes to positional embeddings, and (iv) confirmation that the reconstruction loss remains the standard masked MSE with no auxiliary terms. We will also state explicitly that the only new design choice is the selection of the four descriptors; no additional hyperparameters are introduced beyond the original MAE configuration. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical ablation with no derivation chain

full rationale

The paper presents no mathematical derivation, first-principles result, or predictive claim that reduces to its own inputs by construction. All load-bearing evidence consists of empirical ablations (OceanMAE vs. standard MAE on MARIDA) and downstream evaluations on public datasets (MADOS, MARIDA, MagicBathyNet). Pre-training incorporates auxiliary ocean descriptors by design, but the performance gap is measured externally rather than being tautological. No self-citation load-bearing steps, uniqueness theorems, or fitted parameters renamed as predictions appear. The work is self-contained as an empirical study with public code.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on standard assumptions of masked autoencoder pre-training (reconstruction of masked patches yields useful representations) and transfer learning (representations learned on unlabeled data transfer to labeled downstream tasks). No free parameters, axioms, or invented entities are explicitly introduced in the abstract beyond the model itself.

axioms (1)

domain assumption Masked reconstruction on multispectral imagery plus auxiliary descriptors produces ocean-aware latent features that transfer to segmentation and regression.
Invoked implicitly when claiming that the pre-trained representations improve downstream performance.

pith-pipeline@v0.9.0 · 5566 in / 1430 out tokens · 21713 ms · 2026-05-10T17:55:55.872750+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

OceanMAE adapts the MAE framework to ocean imagery by augmenting representation learning with external oceanic variables... bathymetry, chlorophyll level, and Secchi depth... linearly projected... concatenated with the E_CLS token
IndisputableMonolith/Foundation/BranchSelection.lean branch_selection unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

an ablation against a standard MAE on MARIDA indicates that incorporating auxiliary ocean descriptors during pre-training improves downstream segmentation quality

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

18 extracted references · 18 canonical work pages · 1 internal anchor

[1]

Hydro foundation model,

I.Corley and C.Robinson, “Hydro foundation model,” 2024. [Online]. Available: https://github.com/isaaccorley/hydro-foundation-model

work page 2024
[2]

Detecting marine pollutants and sea surface features with deep learning in sentinel-2 imagery,

K.Kikaki, I.Kakogeorgiou, I.Hoteit, and K.Karantzalos, “Detecting marine pollutants and sea surface features with deep learning in sentinel-2 imagery,” vol. 210, pp. 39–54. [Online]. Available: https://www.sciencedirect.com/science/article/pii/S0924271624000625

work page
[3]

MARIDA: A benchmark for marine debris detection from sentinel-2 remote sensing data,

K.Kikaki, I.Kakogeorgiou, P.Mikeli, D.E.Raitsos, and K.Karantzalos, “MARIDA: A benchmark for marine debris detection from sentinel-2 remote sensing data,” vol. 17, no. 1, p. e0262247. [Online]. Available: https://dx.plos.org/10.1371/journal.pone.0262247

work page doi:10.1371/journal.pone.0262247
[4]

MAGIC- BATHYNET: A multimodal remote sensing dataset for bathymetry prediction and pixel-based classification in shallow waters,

P.Agrafiotis, Ł.Janowski, D.Skarlatos, and B.Demir, “MAGIC- BATHYNET: A multimodal remote sensing dataset for bathymetry prediction and pixel-based classification in shallow waters,” in IGARSS 2024 - 2024 IEEE International Geoscience and Remote Sensing Symposium. IEEE, pp. 249–253. [Online]. Available: https://ieeexplore.ieee.org/document/10641355/

work page arXiv 2024
[5]

A review of active and passive optical methods in hydrography,

G.Mandlburger, “A review of active and passive optical methods in hydrography,”The International Hydrographic Review, vol. 28, pp. 8– 52, 11 2022

work page 2022
[6]

Deepblue: Advanced convolutional neural network applications for ocean remote sensing,

H. Wang and X. Li, “Deepblue: Advanced convolutional neural network applications for ocean remote sensing,”IEEE geoscience and remote sensing magazine, vol. 12, no. 1, pp. 138–161, 2023

work page 2023
[7]

Satellite remote sensing and bathymetry co-driven deep neu- ral network for coral reef shallow water benthic habitat classification,

H. Chen, J. Cheng, X. Ruan, J. Li, L. Ye, S. Chu, L. Cheng, and K. Zhang, “Satellite remote sensing and bathymetry co-driven deep neu- ral network for coral reef shallow water benthic habitat classification,” International Journal of Applied Earth Observation and Geoinforma- tion, vol. 132, p. 104054, 2024

work page 2024
[8]

Developments in deep learning algorithms for coastline extraction from remote sensing imagery: a systematic review,

S. Khurram, A. B. Pour, M. Bagheri, E. H. Ariffin, M. F. Akhir, and S. B. Hamzah, “Developments in deep learning algorithms for coastline extraction from remote sensing imagery: a systematic review,”Earth Science Informatics, vol. 18, no. 3, p. 292, 2025

work page 2025
[9]

Seabed-net: A multi-task network for joint bathymetry estimation and seabed classification from remote sensing imagery in shallow waters,

P. Agrafiotis and B. Demir, “Seabed-net: A multi-task network for joint bathymetry estimation and seabed classification from remote sensing imagery in shallow waters,”ISPRS Journal of Photogrammetry and Remote Sensing, vol. 231, pp. 414–430, 2026

work page 2026
[10]

Deep learning for ocean forecasting: A comprehensive review of methods, applications, and datasets,

R. Hao, Y . Zhao, S. Zhang, and X. Deng, “Deep learning for ocean forecasting: A comprehensive review of methods, applications, and datasets,”IEEE Transactions on Cybernetics, 2025

work page 2025
[11]

Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data,

O.Manas, A.Lacoste, X. i Nieto, D.Vazquez, and P.Rodriguez, “Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data,” in2021 IEEE/CVF International Conference on Computer Vision (ICCV). IEEE, pp. 9394–9403. [Online]. Available: https://ieeexplore.ieee.org/document/9710545/

work page arXiv
[12]

Spectralgpt: Spectral remote sensing foun- dation model,

D. Hong, B. Zhang, X. Li, Y . Li, C. Li, J. Yao, N. Yokoya, H. Li, P. Ghamisi, X. Jiaet al., “Spectralgpt: Spectral remote sensing foun- dation model,”IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 46, no. 8, pp. 5227–5244, 2024

work page 2024
[13]

U-net: Convolutional networks for biomedical image segmentation,

O. Ronneberger, P. Fischer, and T. Brox, “U-net: Convolutional networks for biomedical image segmentation,” inInternational Conference on Medical image computing and computer-assisted intervention. Springer, 2015, pp. 234–241

work page 2015
[14]

Masked autoencoders are scalable vision learners

K.He, X.Chen, S.Xie, Y .Li, P.Dollar, and R.Girshick, “Masked autoencoders are scalable vision learners,” in2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, pp. 15 979–15 988. [Online]. Available: https://ieeexplore.ieee. org/document/9879206/

work page arXiv
[15]

An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly, J. Uszkoreit, and N. Houlsby, “An image is worth 16x16 words: Transformers for image recognition at scale,” 2021. [Online]. Available: https://arxiv.org/abs/2010.11929

work page internal anchor Pith review Pith/arXiv arXiv 2021
[16]

Feature guided masked autoencoder for self-supervised learning in remote sensing,

Y .Wang, H.H.Hern´andez, C.M.Albrecht, and X.X.Zhu, “Feature guided masked autoencoder for self-supervised learning in remote sensing,” vol. 18, pp. 321–336. [Online]. Available: https://ieeexplore.ieee.org/ document/10766851/

work page arXiv
[17]

11 Published at The 2nd Workshop on Foundation Models for Science at ICLR 2026 Remi Denton and Vighnesh Birodkar

Y .Cong, S.Khanna, C.Meng, P.Liu, E.Rozi, Y .He, M.Burke, D.B.Lobell, and S.Ermon, “SatMAE: Pre-training transformers for temporal and multi-spectral satellite imagery,” version Number: 3. [Online]. Available: https://arxiv.org/abs/2207.08051

work page arXiv
[18]

SSL4eo-s12: A large-scale multimodal, multitemporal dataset for self-supervised learning in earth observation [software and data sets],

Y .Wang, N.A.A.Braham, Z.Xiong, C.Liu, C.M.Albrecht, and X.X.Zhu, “SSL4eo-s12: A large-scale multimodal, multitemporal dataset for self-supervised learning in earth observation [software and data sets],” pp. 98–106. [Online]. Available: https://ieeexplore.ieee.org/document/ 10261879/

work page

[1] [1]

Hydro foundation model,

I.Corley and C.Robinson, “Hydro foundation model,” 2024. [Online]. Available: https://github.com/isaaccorley/hydro-foundation-model

work page 2024

[2] [2]

Detecting marine pollutants and sea surface features with deep learning in sentinel-2 imagery,

K.Kikaki, I.Kakogeorgiou, I.Hoteit, and K.Karantzalos, “Detecting marine pollutants and sea surface features with deep learning in sentinel-2 imagery,” vol. 210, pp. 39–54. [Online]. Available: https://www.sciencedirect.com/science/article/pii/S0924271624000625

work page

[3] [3]

MARIDA: A benchmark for marine debris detection from sentinel-2 remote sensing data,

K.Kikaki, I.Kakogeorgiou, P.Mikeli, D.E.Raitsos, and K.Karantzalos, “MARIDA: A benchmark for marine debris detection from sentinel-2 remote sensing data,” vol. 17, no. 1, p. e0262247. [Online]. Available: https://dx.plos.org/10.1371/journal.pone.0262247

work page doi:10.1371/journal.pone.0262247

[4] [4]

MAGIC- BATHYNET: A multimodal remote sensing dataset for bathymetry prediction and pixel-based classification in shallow waters,

P.Agrafiotis, Ł.Janowski, D.Skarlatos, and B.Demir, “MAGIC- BATHYNET: A multimodal remote sensing dataset for bathymetry prediction and pixel-based classification in shallow waters,” in IGARSS 2024 - 2024 IEEE International Geoscience and Remote Sensing Symposium. IEEE, pp. 249–253. [Online]. Available: https://ieeexplore.ieee.org/document/10641355/

work page arXiv 2024

[5] [5]

A review of active and passive optical methods in hydrography,

G.Mandlburger, “A review of active and passive optical methods in hydrography,”The International Hydrographic Review, vol. 28, pp. 8– 52, 11 2022

work page 2022

[6] [6]

Deepblue: Advanced convolutional neural network applications for ocean remote sensing,

H. Wang and X. Li, “Deepblue: Advanced convolutional neural network applications for ocean remote sensing,”IEEE geoscience and remote sensing magazine, vol. 12, no. 1, pp. 138–161, 2023

work page 2023

[7] [7]

Satellite remote sensing and bathymetry co-driven deep neu- ral network for coral reef shallow water benthic habitat classification,

H. Chen, J. Cheng, X. Ruan, J. Li, L. Ye, S. Chu, L. Cheng, and K. Zhang, “Satellite remote sensing and bathymetry co-driven deep neu- ral network for coral reef shallow water benthic habitat classification,” International Journal of Applied Earth Observation and Geoinforma- tion, vol. 132, p. 104054, 2024

work page 2024

[8] [8]

Developments in deep learning algorithms for coastline extraction from remote sensing imagery: a systematic review,

S. Khurram, A. B. Pour, M. Bagheri, E. H. Ariffin, M. F. Akhir, and S. B. Hamzah, “Developments in deep learning algorithms for coastline extraction from remote sensing imagery: a systematic review,”Earth Science Informatics, vol. 18, no. 3, p. 292, 2025

work page 2025

[9] [9]

Seabed-net: A multi-task network for joint bathymetry estimation and seabed classification from remote sensing imagery in shallow waters,

P. Agrafiotis and B. Demir, “Seabed-net: A multi-task network for joint bathymetry estimation and seabed classification from remote sensing imagery in shallow waters,”ISPRS Journal of Photogrammetry and Remote Sensing, vol. 231, pp. 414–430, 2026

work page 2026

[10] [10]

Deep learning for ocean forecasting: A comprehensive review of methods, applications, and datasets,

R. Hao, Y . Zhao, S. Zhang, and X. Deng, “Deep learning for ocean forecasting: A comprehensive review of methods, applications, and datasets,”IEEE Transactions on Cybernetics, 2025

work page 2025

[11] [11]

Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data,

O.Manas, A.Lacoste, X. i Nieto, D.Vazquez, and P.Rodriguez, “Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data,” in2021 IEEE/CVF International Conference on Computer Vision (ICCV). IEEE, pp. 9394–9403. [Online]. Available: https://ieeexplore.ieee.org/document/9710545/

work page arXiv

[12] [12]

Spectralgpt: Spectral remote sensing foun- dation model,

D. Hong, B. Zhang, X. Li, Y . Li, C. Li, J. Yao, N. Yokoya, H. Li, P. Ghamisi, X. Jiaet al., “Spectralgpt: Spectral remote sensing foun- dation model,”IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 46, no. 8, pp. 5227–5244, 2024

work page 2024

[13] [13]

U-net: Convolutional networks for biomedical image segmentation,

O. Ronneberger, P. Fischer, and T. Brox, “U-net: Convolutional networks for biomedical image segmentation,” inInternational Conference on Medical image computing and computer-assisted intervention. Springer, 2015, pp. 234–241

work page 2015

[14] [14]

Masked autoencoders are scalable vision learners

K.He, X.Chen, S.Xie, Y .Li, P.Dollar, and R.Girshick, “Masked autoencoders are scalable vision learners,” in2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, pp. 15 979–15 988. [Online]. Available: https://ieeexplore.ieee. org/document/9879206/

work page arXiv

[15] [15]

An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly, J. Uszkoreit, and N. Houlsby, “An image is worth 16x16 words: Transformers for image recognition at scale,” 2021. [Online]. Available: https://arxiv.org/abs/2010.11929

work page internal anchor Pith review Pith/arXiv arXiv 2021

[16] [16]

Feature guided masked autoencoder for self-supervised learning in remote sensing,

Y .Wang, H.H.Hern´andez, C.M.Albrecht, and X.X.Zhu, “Feature guided masked autoencoder for self-supervised learning in remote sensing,” vol. 18, pp. 321–336. [Online]. Available: https://ieeexplore.ieee.org/ document/10766851/

work page arXiv

[17] [17]

11 Published at The 2nd Workshop on Foundation Models for Science at ICLR 2026 Remi Denton and Vighnesh Birodkar

Y .Cong, S.Khanna, C.Meng, P.Liu, E.Rozi, Y .He, M.Burke, D.B.Lobell, and S.Ermon, “SatMAE: Pre-training transformers for temporal and multi-spectral satellite imagery,” version Number: 3. [Online]. Available: https://arxiv.org/abs/2207.08051

work page arXiv

[18] [18]

SSL4eo-s12: A large-scale multimodal, multitemporal dataset for self-supervised learning in earth observation [software and data sets],

Y .Wang, N.A.A.Braham, Z.Xiong, C.Liu, C.M.Albrecht, and X.X.Zhu, “SSL4eo-s12: A large-scale multimodal, multitemporal dataset for self-supervised learning in earth observation [software and data sets],” pp. 98–106. [Online]. Available: https://ieeexplore.ieee.org/document/ 10261879/

work page