pith. machine review for the scientific record. sign in

arxiv: 2605.14406 · v1 · submitted 2026-05-14 · 💻 cs.LG · cs.CV

Recognition: 2 theorem links

· Lean Theorem

GeoViSTA: Geospatial Vision-Tabular Transformer for Multimodal Environment Representation

Authors on Pith no claims yet

Pith reviewed 2026-05-15 02:19 UTC · model grok-4.3

classification 💻 cs.LG cs.CV
keywords GeoViSTAvision-tabular transformergeospatial embeddingsmultimodal learningcross-attentionmasked autoencodingearth observationsocioeconomic data
0
0 comments X

The pith

GeoViSTA creates transferable geospatial embeddings by jointly modeling imagery and tabular socioeconomic data with cross-attention.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents GeoViSTA, a vision-tabular transformer that learns unified representations from co-registered Earth observation imagery and tabular socioeconomic data. It uses bilateral cross-attention and geography-aware attention to align image patches with irregular census tract tokens. A self-supervised joint masked autoencoding objective trains the model to reconstruct missing patches and rows using context from both modalities. The resulting embeddings lead to better performance than baselines when predicting disease mortality and fire hazards in held-out regions. This demonstrates the benefit of combining physical environment data with socioeconomic context for comprehensive geospatial analysis.

Core claim

GeoViSTA is a vision-tabular architecture that learns unified geospatial embeddings from co-registered gridded imagery and tabular data. It utilizes bilateral cross-attention to exchange spatial and semantic information across modalities, guided by a geography-aware attention mechanism that aligns continuous image patches with irregular census-tract tokens. The model is trained with a self-supervised joint masked-autoencoding objective, forcing it to recover missing image patches and tabular rows using local spatial context and cross-modal cues.

What carries the argument

Bilateral cross-attention guided by geography-aware attention to align image patches with tabular tokens in a joint masked autoencoding setup.

If this is right

  • Outperforms baselines in linear probing for disease-specific mortality prediction across regions.
  • Improves prediction of fire hazard frequency in held-out areas.
  • Produces highly transferable representations for holistic geospatial inference.
  • Captures the complete total environment by including both natural and built features with socioeconomic covariates.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Applying similar fusion techniques could enhance models in related fields like climate impact assessment.
  • Testing on additional downstream tasks such as population density estimation would further validate the approach.
  • The method suggests that modality alignment via geography-aware attention is key for multimodal geospatial data.

Load-bearing premise

That the bilateral cross-attention and geography-aware attention effectively align the modalities so the self-supervised training yields embeddings that generalize without loss of information.

What would settle it

If the joint model does not show improved performance over single-modality baselines on the held-out mortality and fire hazard prediction tasks, the benefit of the unified embeddings would be called into question.

Figures

Figures reproduced from arXiv: 2605.14406 by Ashok Veeraraghavan, Guha Balakrishnan, Sadeer Al-Kindi, Yuhao Liu.

Figure 1
Figure 1. Figure 1: Vision–tabular geospatial data and reasoning. (a-b) Vision and tabular datasets have fundamentally different structures: vision data are continuous, gridded feature images, while tabular data represent features over irregular geographical regions. (c-d) Examples of geospatial vision and tabular signals. Vision signals describe apparent facets of the natural and built environment, whereas tabular datasets p… view at source ↗
Figure 2
Figure 2. Figure 2: Paired data and high-level GeoViSTA design. Left: We sample co-registered vision and tabular data. Each tabular row corresponds to a census tract defined by a polygon. Right: We propose GeoViSTA, a masked autoencoder framework that jointly trains over paired vision and tabular data. Bilateral cross-attention exchanges spatial and semantic information across modalities. We provide a detailed architectural d… view at source ↗
Figure 3
Figure 3. Figure 3: GeoViSTA architecture. We feed tabular inputs to the row encoder blocks (with column self-attention), followed by row attention blocks (with row self-attention). Vision data goes through standard ViT encoder blocks with positional embeddings. Both vision and tabular tokens receive geospatial positional encodings. This is followed by bilateral cross-attention blocks, which mix vision and tabular tokens. Aft… view at source ↗
Figure 4
Figure 4. Figure 4: Location and geometry encodings. (a) Both vision and tabular tokens are encoded with longitude and latitude offsets from a reference point. (b) Tabular tokens receive additional en￾coded geometry summaries. We devise geospatial positional encodings that capture the spatial proximity of tokens and the approximate polygon geometry of the underly￾ing census tracts. For any coordinate p = (λ, ϕ) within a regio… view at source ↗
Figure 5
Figure 5. Figure 5: Spatially localized cross-attention. For four example tabular tokens, we visualize their cross-attention weights over vision tokens. The learned attention concentrates on nearby visual patches, indicating that cross-modal interactions follow local spatial structure. With encoded ViT tokens Zvis ∈ R Nvis×Dv and TabT tokens Ztab ∈ R Ntab×Dt , we proceed with bidirectional ViT–TabT cross-attention blocks. Eac… view at source ↗
Figure 6
Figure 6. Figure 6: Masked auto-reconstruction validation results. Under a mask ratio of 0.5, we jointly reconstruct missing vision patches and tabular census tracts by using self- and cross-attention to exchange spatial and semantic information across the two modalities. reasoning objectives. We selected 139 environment-related indicators for training and excluded 44 health-related indicators from the input. We excluded Alas… view at source ↗
Figure 7
Figure 7. Figure 7: Principal components of GeoViSTA embeddings. We show the first eight components of vision-enriched tabular embeddings from GeoViSTA; percentages indicate explained variance. PC1 shows urban-rural transition, and subsequent PCs show coherent regional clustering. We extracted the vision-enriched tabular embeddings Z ′ tab for all census tracts across the CONUS and performed PCA, shown in [PITH_FULL_IMAGE:fi… view at source ↗
Figure 8
Figure 8. Figure 8: Training data splits. 16 [PITH_FULL_IMAGE:figures/full_fig_p016_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Detailed GeoViSTA architecture. C Evaluation Additional details on PCA. Following [PITH_FULL_IMAGE:figures/full_fig_p018_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Principal components of CVI tabular inputs. 18 [PITH_FULL_IMAGE:figures/full_fig_p018_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Principal components of AlphaEarth embeddings. 19 [PITH_FULL_IMAGE:figures/full_fig_p019_11.png] view at source ↗
read the original abstract

Large-scale pretraining on Earth observation imagery has yielded powerful representations of the natural and built environment. However, most existing geospatial foundation models do not directly model the structured socioeconomic covariates typically stored in tabular form. This modality gap limits their ability to capture the complete total environment, which is critical for reasoning about complex environmental, social, and health-related outcomes. In this work, we propose GeoViSTA (Geospatial Vision-Tabular Transformer), a vision-tabular architecture that learns unified geospatial embeddings from co-registered gridded imagery and tabular data. GeoViSTA utilizes bilateral cross-attention to exchange spatial and semantic information across modalities, guided by a geography-aware attention mechanism that aligns continuous image patches with irregular census-tract tokens. We train GeoViSTA with a self-supervised joint masked-autoencoding objective, forcing it to recover missing image patches and tabular rows using local spatial context and cross-modal cues. Empirically, GeoViSTA's unified embeddings improve linear probing performance on high-impact downstream tasks, outperforming baselines in predicting disease-specific mortality and fire hazard frequency across held-out regions. These results demonstrate that jointly modeling the physical environment alongside structured socioeconomic context yields highly transferable representations for holistic geospatial inference.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript proposes GeoViSTA, a vision-tabular transformer architecture that learns unified geospatial embeddings from co-registered Earth observation imagery and tabular socioeconomic data. It introduces bilateral cross-attention combined with a geography-aware attention mechanism to align image patches with irregular census-tract tokens, trained via a self-supervised joint masked-autoencoding objective that reconstructs missing patches and rows using cross-modal cues. The central claim is that these embeddings yield improved linear-probing performance on downstream tasks such as disease-specific mortality prediction and fire-hazard frequency forecasting across held-out regions, demonstrating the value of jointly modeling physical and structured socioeconomic context.

Significance. If the empirical results are substantiated with rigorous baselines, ablations, and statistical controls, the work would advance multimodal geospatial foundation models by closing the modality gap between visual and tabular data, enabling more holistic representations for environmental, health, and social inference tasks.

major comments (2)
  1. [Abstract] Abstract: the claim of outperforming baselines on downstream tasks is unsupported by any quantitative metrics, error bars, or baseline descriptions, which is load-bearing for the central empirical claim of improved transferable representations.
  2. [§4 and §3.2] §4 (Experiments) and §3.2 (Architecture): without reported ablation results on the bilateral cross-attention or geography-aware alignment, it remains unclear whether the self-supervised objective actually mitigates modality misalignment or information loss between irregular tabular tokens and image patches.
minor comments (2)
  1. [Abstract] Abstract: consider including at least one concrete performance delta or task-specific metric to make the empirical contribution immediately verifiable.
  2. [§3.2] Notation: define the geography-aware attention mask explicitly (e.g., via an equation) to clarify how continuous spatial coordinates are mapped to irregular tabular token positions.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We agree that the abstract requires quantitative support and that additional ablations would strengthen the claims regarding the architecture components. We will revise the manuscript accordingly.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the claim of outperforming baselines on downstream tasks is unsupported by any quantitative metrics, error bars, or baseline descriptions, which is load-bearing for the central empirical claim of improved transferable representations.

    Authors: We agree that the abstract should include specific quantitative metrics to support the claims. The full manuscript reports these results in Section 4, including comparisons against unimodal vision and tabular baselines with error bars from multiple runs. In the revision, we will update the abstract to include key performance figures (e.g., relative improvements on mortality and fire-hazard tasks) and brief baseline descriptions. revision: yes

  2. Referee: [§4 and §3.2] §4 (Experiments) and §3.2 (Architecture): without reported ablation results on the bilateral cross-attention or geography-aware alignment, it remains unclear whether the self-supervised objective actually mitigates modality misalignment or information loss between irregular tabular tokens and image patches.

    Authors: We acknowledge that explicit ablations isolating bilateral cross-attention and geography-aware alignment are not currently reported. While the manuscript compares against unimodal baselines and demonstrates the benefit of the joint objective, dedicated ablations would better isolate these components. We will add these ablation studies to Section 4 in the revision, reporting performance drops when each mechanism is removed. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical architecture with self-supervised training

full rationale

The paper introduces GeoViSTA as a vision-tabular transformer trained end-to-end with a joint masked-autoencoding objective on co-registered imagery and tabular socioeconomic data. All central claims (unified embeddings, improved linear probing on mortality and fire hazard tasks) rest on empirical evaluation across held-out regions rather than any closed-form derivation, parameter fitting renamed as prediction, or load-bearing self-citation. No equations reduce the output to the input by construction, and the architecture choices (bilateral cross-attention, geography-aware alignment) are presented as design decisions validated by downstream performance, not as theorems derived from prior author work.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The approach rests on standard deep learning assumptions about transformer attention and the availability of aligned multimodal geospatial data; no new entities are postulated and no free parameters are explicitly fitted in the abstract description.

axioms (2)
  • domain assumption Co-registered gridded imagery and tabular census-tract data can be aligned at the patch and token level.
    Required for the bilateral cross-attention and geography-aware mechanism to function as described.
  • domain assumption A joint masked-autoencoding objective on both modalities will produce transferable embeddings for downstream geospatial tasks.
    Core premise of the self-supervised training strategy.

pith-pipeline@v0.9.0 · 5522 in / 1467 out tokens · 100090 ms · 2026-05-15T02:19:22.595715+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

33 extracted references · 33 canonical work pages · 1 internal anchor

  1. [1]

    Skilful precipitation nowcasting using deep generative models of radar,

    S. Ravuri, K. Lenc, M. Willson, D. Kangin, R. Lam, P. Mirowski, M. Fitzsimons, M. Athanassiadou, S. Kashem, S. Madge, R. Prudden, A. Mandhane, A. Clark, A. Brock, K. Simonyan, R. Hadsell, N. Robin- son, E. Clancy, A. Arribas, and S. Mohamed, “Skilful precipitation nowcasting using deep generative models of radar,”Nature, vol. 597, no. 7878, pp. 672–677, Sep. 2021

  2. [2]

    Downscaling Extreme Precipitation With Wasserstein Regularized Diffusion,

    Y . Liu, J. Doss-Gollin, Q. Dai, A. Veeraraghavan, and G. Balakrishnan, “Downscaling Extreme Precipitation With Wasserstein Regularized Diffusion,”IEEE Transactions on Geoscience and Remote Sensing, vol. 63, pp. 1–11, 2025

  3. [3]

    Designsafe: New cyberinfrastructure for natural hazards engineering,

    E. M. Rathje, C. Dawson, J. E. Padgett, J.-P. Pinelli, D. Stanzione, A. Adair, P. Arduino, S. J. Brandenberg, T. Cockerill, C. Deyet al., “Designsafe: New cyberinfrastructure for natural hazards engineering,”Natural hazards review, vol. 18, no. 3, p. 06017001, 2017

  4. [4]

    Debris segmentation using post-hurricane aerial imagery,

    K. Amini, Y . Liu, J. E. Padgett, G. Balakrishnan, and A. Veeraraghavan, “Debris segmentation using post-hurricane aerial imagery,”Computer-Aided Civil and Infrastructure Engineering, vol. 40, no. 25, pp. 4116–4131

  5. [5]

    Air pollution and cardiovascular disease: Jacc state-of- the-art review,

    S. Rajagopalan, S. G. Al-Kindi, and R. D. Brook, “Air pollution and cardiovascular disease: Jacc state-of- the-art review,”Journal of the American College of Cardiology, vol. 72, no. 17, pp. 2054–2070, 2018

  6. [6]

    Environmental determinants of cardiovascular disease: lessons learned from air pollution,

    S. G. Al-Kindi, R. D. Brook, S. Biswal, and S. Rajagopalan, “Environmental determinants of cardiovascular disease: lessons learned from air pollution,”Nature Reviews Cardiology, vol. 17, no. 10, pp. 656–672, 2020

  7. [7]

    General Geospatial Inference with a Population Dynamics Foundation Model,

    M. Agarwal, M. Sun, C. Kamath, A. Muslim, P. Sarker, J. Paul, H. Yee, M. Sieniek, K. Jablonski, Y . Mayer, D. Fork, S. de Guia, J. McPike, A. Boulanger, T. Shekel, D. Schottlander, Y . Xiao, M. C. Manukonda, Y . Liu, N. Bulut, S. Abu-el-haija, B. Perozzi, M. Bharel, V . Nguyen, L. Barrington, N. Efron, Y . Matias, G. Corrado, K. Eswaran, S. Prabhakara, S....

  8. [8]

    Google Earth Engine: Planetary-scale geospatial analysis for everyone,

    N. Gorelick, M. Hancher, M. Dixon, S. Ilyushchenko, D. Thau, and R. Moore, “Google Earth Engine: Planetary-scale geospatial analysis for everyone,”Remote Sensing of Environment, vol. 202, pp. 18–27, Dec. 2017

  9. [9]

    RainNet: A Large-Scale Imagery Dataset and Benchmark for Spatial Precipitation Downscaling,

    X. Chen, K. Feng, N. Liu, B. Ni, Y . Lu, Z. Tong, and Z. Liu, “RainNet: A Large-Scale Imagery Dataset and Benchmark for Spatial Precipitation Downscaling,”Advances in Neural Information Processing Systems, vol. 35, pp. 9797–9812, Dec. 2022

  10. [10]

    SEVIR : A Storm Event Imagery Dataset for Deep Learning Applications in Radar and Satellite Meteorology,

    M. Veillette, S. Samsi, and C. Mattioli, “SEVIR : A Storm Event Imagery Dataset for Deep Learning Applications in Radar and Satellite Meteorology,” inAdvances in Neural Information Processing Systems, vol. 33. Curran Associates, Inc., 2020, pp. 22 009–22 019

  11. [11]

    Computer vision uncovers predictors of physical urban change,

    N. Naik, S. D. Kominers, R. Raskar, E. L. Glaeser, and C. A. Hidalgo, “Computer vision uncovers predictors of physical urban change,”Proceedings of the National Academy of Sciences, vol. 114, no. 29, pp. 7571–7576, Jul. 2017

  12. [12]

    A land use and land cover classification system for use with remote sensor data,

    J. R. Anderson, E. E. Hardy, J. T. Roach, and R. E. Witmer, “A land use and land cover classification system for use with remote sensor data,” USGS Numbered Series 964, 1976

  13. [13]

    AlphaEarth Foundations: An embedding field model for accurate and efficient global mapping from sparse label data,

    C. F. Brown, M. R. Kazmierski, V . J. Pasquarella, W. J. Rucklidge, M. Samsikova, C. Zhang, E. Shelhamer, E. Lahera, O. Wiles, S. Ilyushchenko, N. Gorelick, L. L. Zhang, S. Alj, E. Schechter, S. Askay, O. Guinan, R. Moore, A. Boukouvalas, and P. Kohli, “AlphaEarth Foundations: An embedding field model for accurate and efficient global mapping from sparse ...

  14. [14]

    SatCLIP: Global, General-Purpose Location Embeddings with Satellite Imagery,

    K. Klemmer, E. Rolf, C. Robinson, L. Mackey, and M. Rußwurm, “SatCLIP: Global, General-Purpose Location Embeddings with Satellite Imagery,” Apr. 2024. 10

  15. [15]

    Prithvi-EO-2.0: A Versatile Multi-Temporal Foundation Model for Earth Observation Applications,

    D. Szwarcman, S. Roy, P. Fraccaro, Þ. E. Gíslason, B. Blumenstiel, R. Ghosal, P. H. de Oliveira, J. L. d. S. Almeida, R. Sedona, Y . Kang, S. Chakraborty, S. Wang, C. Gomes, A. Kumar, M. Truong, D. Godwin, H. Lee, C.-Y . Hsu, A. A. Asanjan, B. Mujeci, D. Shidham, T. Keenan, P. Arevalo, W. Li, H. Alemohammad, P. Olofsson, C. Hain, R. Kennedy, B. Zadrozny, ...

  16. [16]

    OlmoEarth: Stable Latent Image Modeling for Multimodal Earth Observation,

    H. Herzog, F. Bastani, Y . Zhang, G. Tseng, J. Redmon, H. Sablon, R. Park, J. Morrison, A. Buraczynski, K. Farley, J. Hansen, A. Howe, P. A. Johnson, M. Otterlee, T. Schmitt, H. Pitelka, S. Daspit, R. Ratner, C. Wilhelm, S. Wood, M. Jacobi, H. Kerner, E. Shelhamer, A. Farhadi, R. Krishna, and P. Beukema, “OlmoEarth: Stable Latent Image Modeling for Multim...

  17. [17]

    TerraMind: Large-Scale Generative Multimodality for Earth Observation,

    J. Jakubik, F. Yang, B. Blumenstiel, E. Scheurer, R. Sedona, S. Maurogiovanni, J. Bosmans, N. Dionelis, V . Marsocci, N. Kopp, R. Ramachandran, P. Fraccaro, T. Brunschwiler, G. Cavallaro, J. Bernabe-Moreno, and N. Longépé, “TerraMind: Large-Scale Generative Multimodality for Earth Observation,” Apr. 2025

  18. [18]

    SkySense: A Multi-Modal Remote Sensing Foundation Model Towards Universal Interpretation for Earth Observation Imagery,

    X. Guo, J. Lao, B. Dang, Y . Zhang, L. Yu, L. Ru, L. Zhong, Z. Huang, K. Wu, D. Hu, H. He, J. Wang, J. Chen, M. Yang, Y . Zhang, and Y . Li, “SkySense: A Multi-Modal Remote Sensing Foundation Model Towards Universal Interpretation for Earth Observation Imagery,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp....

  19. [19]

    SatMAE: Pre-training Transformers for Temporal and Multi-Spectral Satellite Imagery,

    Y . Cong, S. Khanna, C. Meng, P. Liu, E. Rozi, Y . He, M. Burke, D. B. Lobell, and S. Ermon, “SatMAE: Pre-training Transformers for Temporal and Multi-Spectral Satellite Imagery,” inAdvances in Neural Information Processing Systems, Oct. 2022

  20. [20]

    Masked Autoencoders Are Scalable Vision Learners,

    K. He, X. Chen, S. Xie, Y . Li, P. Dollár, and R. Girshick, “Masked Autoencoders Are Scalable Vision Learners,” Dec. 2021

  21. [21]

    About the u.s. climate vulnerability index,

    Environmental Defense Fund and Texas A&M University, “About the u.s. climate vulnerability index,” https://climatevulnerabilityindex.org/about/, 2023, accessed: 2026-04-20

  22. [22]

    Lanistr: Multimodal learning from structured and un- structured data.arXiv preprint arXiv:2305.16556, 2023

    S. Ebrahimi, S. O. Arik, Y . Dong, and T. Pfister, “LANISTR: Multimodal learning from structured and unstructured data,”arXiv:2305.16556, 2023

  23. [23]

    TIP: Tabular-image pre-training for multimodal classification with incomplete data,

    S. Du, S. Zheng, Y . Wang, W. Bai, D. P. O’Regan, and C. Qin, “TIP: Tabular-image pre-training for multimodal classification with incomplete data,” inComputer Vision – ECCV 2024, 2024, pp. 478–496

  24. [24]

    TabTransformer: Tabular Data Modeling Using Contextual Embeddings

    X. Huang, A. Khetan, M. Cvitkovic, and Z. Karnin, “TabTransformer: Tabular data modeling using contextual embeddings,”arXiv:2012.06678, 2020

  25. [25]

    Revisiting deep learning models for tabular data,

    Y . Gorishniy, I. Rubachev, V . Khrulkov, and A. Babenko, “Revisiting deep learning models for tabular data,” inAdvances in Neural Information Processing Systems, vol. 34, 2021, pp. 18 932–18 943. [Online]. Available: https://proceedings.neurips.cc/paper/2021/hash/9d86d83f925f2149e9edb0ac3b49229c-Abstract. html

  26. [26]

    Best of both worlds: Multimodal contrastive learning with tabular and imaging data,

    P. Hager, M. J. Menten, and D. Rueckert, “Best of both worlds: Multimodal contrastive learning with tabular and imaging data,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 23 924–23 935

  27. [27]

    An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale,

    A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Min- derer, G. Heigold, S. Gelly, J. Uszkoreit, and N. Houlsby, “An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale,” Jun. 2021

  28. [28]

    Train short, test long: Attention with linear biases enables input length extrapolation,

    O. Press, N. A. Smith, and M. Lewis, “Train short, test long: Attention with linear biases enables input length extrapolation,” inInternational Conference on Learning Representations, 2022. [Online]. Available: https://openreview.net/forum?id=R8sQPpGCv0

  29. [29]

    A Simple Framework for Contrastive Learning of Visual Representations,

    T. Chen, S. Kornblith, M. Norouzi, and G. Hinton, “A Simple Framework for Contrastive Learning of Visual Representations,” Jul. 2020

  30. [30]

    Emerging Properties in Self-Supervised Vision Transformers,

    M. Caron, H. Touvron, I. Misra, H. Jégou, J. Mairal, P. Bojanowski, and A. Joulin, “Emerging Properties in Self-Supervised Vision Transformers,” May 2021

  31. [31]

    Scikit-learn: Machine learning in python,

    F. Pedregosa, G. Varoquaux, A. Gramfort, V . Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V . Dubourget al., “Scikit-learn: Machine learning in python,”Journal of Machine Learning Research, vol. 12, pp. 2825–2830, 2011. 11

  32. [32]

    National Vital Statistics System, Mortality 1999–2020 on CDC WONDER Online Database,

    Centers for Disease Control and Prevention, National Center for Health Statistics, “National Vital Statistics System, Mortality 1999–2020 on CDC WONDER Online Database,” http://wonder.cdc.gov/mcd-icd10. html, 2021, data are from the Multiple Cause of Death Files, 1999–2020, as compiled from data provided by the 57 vital statistics jurisdictions through th...

  33. [33]

    ESA Fire Climate Change Initiative (Fire_cci): MODIS Fire_cci Burned Area Pixel product, version 5.1,

    E. Chuvieco, M. L. Pettinari, J. Lizundia-Loiola, T. Storm, and M. Padilla Parellada, “ESA Fire Climate Change Initiative (Fire_cci): MODIS Fire_cci Burned Area Pixel product, version 5.1,” https://catalogue. ceda.ac.uk/uuid/58f00d8814064b79a0c49662ad3af537/, 2018, published: 2018-11-01. 12 Supplementary Material GeoViSTA: Geospatial Vision-Tabular Transf...