arxiv: 2605.14406 · v1 · submitted 2026-05-14 · 💻 cs.LG · cs.CV

Recognition: 2 theorem links

· Lean Theorem

GeoViSTA: Geospatial Vision-Tabular Transformer for Multimodal Environment Representation

Yuhao Liu , Sadeer Al-Kindi , Ashok Veeraraghavan , Guha Balakrishnan

Authors on Pith no claims yet

Pith reviewed 2026-05-15 02:19 UTC · model grok-4.3

classification 💻 cs.LG cs.CV

keywords GeoViSTAvision-tabular transformergeospatial embeddingsmultimodal learningcross-attentionmasked autoencodingearth observationsocioeconomic data

0 comments

The pith

GeoViSTA creates transferable geospatial embeddings by jointly modeling imagery and tabular socioeconomic data with cross-attention.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents GeoViSTA, a vision-tabular transformer that learns unified representations from co-registered Earth observation imagery and tabular socioeconomic data. It uses bilateral cross-attention and geography-aware attention to align image patches with irregular census tract tokens. A self-supervised joint masked autoencoding objective trains the model to reconstruct missing patches and rows using context from both modalities. The resulting embeddings lead to better performance than baselines when predicting disease mortality and fire hazards in held-out regions. This demonstrates the benefit of combining physical environment data with socioeconomic context for comprehensive geospatial analysis.

Core claim

GeoViSTA is a vision-tabular architecture that learns unified geospatial embeddings from co-registered gridded imagery and tabular data. It utilizes bilateral cross-attention to exchange spatial and semantic information across modalities, guided by a geography-aware attention mechanism that aligns continuous image patches with irregular census-tract tokens. The model is trained with a self-supervised joint masked-autoencoding objective, forcing it to recover missing image patches and tabular rows using local spatial context and cross-modal cues.

What carries the argument

Bilateral cross-attention guided by geography-aware attention to align image patches with tabular tokens in a joint masked autoencoding setup.

If this is right

Outperforms baselines in linear probing for disease-specific mortality prediction across regions.
Improves prediction of fire hazard frequency in held-out areas.
Produces highly transferable representations for holistic geospatial inference.
Captures the complete total environment by including both natural and built features with socioeconomic covariates.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Applying similar fusion techniques could enhance models in related fields like climate impact assessment.
Testing on additional downstream tasks such as population density estimation would further validate the approach.
The method suggests that modality alignment via geography-aware attention is key for multimodal geospatial data.

Load-bearing premise

That the bilateral cross-attention and geography-aware attention effectively align the modalities so the self-supervised training yields embeddings that generalize without loss of information.

What would settle it

If the joint model does not show improved performance over single-modality baselines on the held-out mortality and fire hazard prediction tasks, the benefit of the unified embeddings would be called into question.

Figures

Figures reproduced from arXiv: 2605.14406 by Ashok Veeraraghavan, Guha Balakrishnan, Sadeer Al-Kindi, Yuhao Liu.

**Figure 1.** Figure 1: Vision–tabular geospatial data and reasoning. (a-b) Vision and tabular datasets have fundamentally different structures: vision data are continuous, gridded feature images, while tabular data represent features over irregular geographical regions. (c-d) Examples of geospatial vision and tabular signals. Vision signals describe apparent facets of the natural and built environment, whereas tabular datasets p… view at source ↗

**Figure 2.** Figure 2: Paired data and high-level GeoViSTA design. Left: We sample co-registered vision and tabular data. Each tabular row corresponds to a census tract defined by a polygon. Right: We propose GeoViSTA, a masked autoencoder framework that jointly trains over paired vision and tabular data. Bilateral cross-attention exchanges spatial and semantic information across modalities. We provide a detailed architectural d… view at source ↗

**Figure 3.** Figure 3: GeoViSTA architecture. We feed tabular inputs to the row encoder blocks (with column self-attention), followed by row attention blocks (with row self-attention). Vision data goes through standard ViT encoder blocks with positional embeddings. Both vision and tabular tokens receive geospatial positional encodings. This is followed by bilateral cross-attention blocks, which mix vision and tabular tokens. Aft… view at source ↗

**Figure 4.** Figure 4: Location and geometry encodings. (a) Both vision and tabular tokens are encoded with longitude and latitude offsets from a reference point. (b) Tabular tokens receive additional encoded geometry summaries. We devise geospatial positional encodings that capture the spatial proximity of tokens and the approximate polygon geometry of the underlying census tracts. For any coordinate p = (λ, ϕ) within a regio… view at source ↗

**Figure 5.** Figure 5: Spatially localized cross-attention. For four example tabular tokens, we visualize their cross-attention weights over vision tokens. The learned attention concentrates on nearby visual patches, indicating that cross-modal interactions follow local spatial structure. With encoded ViT tokens Zvis ∈ R Nvis×Dv and TabT tokens Ztab ∈ R Ntab×Dt , we proceed with bidirectional ViT–TabT cross-attention blocks. Eac… view at source ↗

**Figure 6.** Figure 6: Masked auto-reconstruction validation results. Under a mask ratio of 0.5, we jointly reconstruct missing vision patches and tabular census tracts by using self- and cross-attention to exchange spatial and semantic information across the two modalities. reasoning objectives. We selected 139 environment-related indicators for training and excluded 44 health-related indicators from the input. We excluded Alas… view at source ↗

**Figure 7.** Figure 7: Principal components of GeoViSTA embeddings. We show the first eight components of vision-enriched tabular embeddings from GeoViSTA; percentages indicate explained variance. PC1 shows urban-rural transition, and subsequent PCs show coherent regional clustering. We extracted the vision-enriched tabular embeddings Z ′ tab for all census tracts across the CONUS and performed PCA, shown in [PITH_FULL_IMAGE:fi… view at source ↗

**Figure 8.** Figure 8: Training data splits. 16 [PITH_FULL_IMAGE:figures/full_fig_p016_8.png] view at source ↗

**Figure 9.** Figure 9: Detailed GeoViSTA architecture. C Evaluation Additional details on PCA. Following [PITH_FULL_IMAGE:figures/full_fig_p018_9.png] view at source ↗

**Figure 10.** Figure 10: Principal components of CVI tabular inputs. 18 [PITH_FULL_IMAGE:figures/full_fig_p018_10.png] view at source ↗

**Figure 11.** Figure 11: Principal components of AlphaEarth embeddings. 19 [PITH_FULL_IMAGE:figures/full_fig_p019_11.png] view at source ↗

read the original abstract

Large-scale pretraining on Earth observation imagery has yielded powerful representations of the natural and built environment. However, most existing geospatial foundation models do not directly model the structured socioeconomic covariates typically stored in tabular form. This modality gap limits their ability to capture the complete total environment, which is critical for reasoning about complex environmental, social, and health-related outcomes. In this work, we propose GeoViSTA (Geospatial Vision-Tabular Transformer), a vision-tabular architecture that learns unified geospatial embeddings from co-registered gridded imagery and tabular data. GeoViSTA utilizes bilateral cross-attention to exchange spatial and semantic information across modalities, guided by a geography-aware attention mechanism that aligns continuous image patches with irregular census-tract tokens. We train GeoViSTA with a self-supervised joint masked-autoencoding objective, forcing it to recover missing image patches and tabular rows using local spatial context and cross-modal cues. Empirically, GeoViSTA's unified embeddings improve linear probing performance on high-impact downstream tasks, outperforming baselines in predicting disease-specific mortality and fire hazard frequency across held-out regions. These results demonstrate that jointly modeling the physical environment alongside structured socioeconomic context yields highly transferable representations for holistic geospatial inference.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

GeoViSTA adds bilateral cross-attention and geography-aware alignment to fuse imagery with tabular census data in a joint masked autoencoding setup, but the abstract gives no numbers so the gains are hard to judge.

read the letter

The main new piece is the specific architecture: bilateral cross-attention that lets image patches and irregular census-tract tokens exchange information, plus a geography-aware attention layer to keep the alignment spatially sensible. They train it end-to-end with a joint masked autoencoding loss that reconstructs both missing patches and missing tabular rows from local context and cross-modal signals. That setup directly targets the gap where most Earth-observation models ignore structured socioeconomic covariates, and the downstream claim is that the resulting embeddings help on linear probing for disease mortality and fire hazard frequency in held-out regions. The design choice to force cross-modal reconstruction is reasonable and extends standard MAE ideas without obvious circularity. If the alignment holds, it could produce more complete representations for public-health and environmental tasks. The abstract states that the unified embeddings outperform baselines, but it supplies no metrics, no error bars, no list of exact baselines, and no ablation on the cross-attention or geography-aware components. Without those details it is difficult to tell whether the reported improvements are large enough to matter or whether they come mainly from the new fusion rather than from data scale or training tricks. The alignment step between continuous patches and discrete tabular tokens also looks like the place where misalignment or information loss could quietly occur, and the abstract does not address how well that step generalizes across different regions or data densities. This work is aimed at people already building or fine-tuning geospatial foundation models who need to bring in census-style covariates. A reader who wants a concrete recipe for multimodal fusion in remote-sensing transformers will find the architecture description useful even if the numbers are still missing. I would send it to peer review because the problem is real and the proposed mechanism is straightforward; the experiments will need to be expanded and quantified before the claims can be taken as settled.

Referee Report

2 major / 2 minor

Summary. The manuscript proposes GeoViSTA, a vision-tabular transformer architecture that learns unified geospatial embeddings from co-registered Earth observation imagery and tabular socioeconomic data. It introduces bilateral cross-attention combined with a geography-aware attention mechanism to align image patches with irregular census-tract tokens, trained via a self-supervised joint masked-autoencoding objective that reconstructs missing patches and rows using cross-modal cues. The central claim is that these embeddings yield improved linear-probing performance on downstream tasks such as disease-specific mortality prediction and fire-hazard frequency forecasting across held-out regions, demonstrating the value of jointly modeling physical and structured socioeconomic context.

Significance. If the empirical results are substantiated with rigorous baselines, ablations, and statistical controls, the work would advance multimodal geospatial foundation models by closing the modality gap between visual and tabular data, enabling more holistic representations for environmental, health, and social inference tasks.

major comments (2)

[Abstract] Abstract: the claim of outperforming baselines on downstream tasks is unsupported by any quantitative metrics, error bars, or baseline descriptions, which is load-bearing for the central empirical claim of improved transferable representations.
[§4 and §3.2] §4 (Experiments) and §3.2 (Architecture): without reported ablation results on the bilateral cross-attention or geography-aware alignment, it remains unclear whether the self-supervised objective actually mitigates modality misalignment or information loss between irregular tabular tokens and image patches.

minor comments (2)

[Abstract] Abstract: consider including at least one concrete performance delta or task-specific metric to make the empirical contribution immediately verifiable.
[§3.2] Notation: define the geography-aware attention mask explicitly (e.g., via an equation) to clarify how continuous spatial coordinates are mapped to irregular tabular token positions.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We agree that the abstract requires quantitative support and that additional ablations would strengthen the claims regarding the architecture components. We will revise the manuscript accordingly.

read point-by-point responses

Referee: [Abstract] Abstract: the claim of outperforming baselines on downstream tasks is unsupported by any quantitative metrics, error bars, or baseline descriptions, which is load-bearing for the central empirical claim of improved transferable representations.

Authors: We agree that the abstract should include specific quantitative metrics to support the claims. The full manuscript reports these results in Section 4, including comparisons against unimodal vision and tabular baselines with error bars from multiple runs. In the revision, we will update the abstract to include key performance figures (e.g., relative improvements on mortality and fire-hazard tasks) and brief baseline descriptions. revision: yes
Referee: [§4 and §3.2] §4 (Experiments) and §3.2 (Architecture): without reported ablation results on the bilateral cross-attention or geography-aware alignment, it remains unclear whether the self-supervised objective actually mitigates modality misalignment or information loss between irregular tabular tokens and image patches.

Authors: We acknowledge that explicit ablations isolating bilateral cross-attention and geography-aware alignment are not currently reported. While the manuscript compares against unimodal baselines and demonstrates the benefit of the joint objective, dedicated ablations would better isolate these components. We will add these ablation studies to Section 4 in the revision, reporting performance drops when each mechanism is removed. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical architecture with self-supervised training

full rationale

The paper introduces GeoViSTA as a vision-tabular transformer trained end-to-end with a joint masked-autoencoding objective on co-registered imagery and tabular socioeconomic data. All central claims (unified embeddings, improved linear probing on mortality and fire hazard tasks) rest on empirical evaluation across held-out regions rather than any closed-form derivation, parameter fitting renamed as prediction, or load-bearing self-citation. No equations reduce the output to the input by construction, and the architecture choices (bilateral cross-attention, geography-aware alignment) are presented as design decisions validated by downstream performance, not as theorems derived from prior author work.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The approach rests on standard deep learning assumptions about transformer attention and the availability of aligned multimodal geospatial data; no new entities are postulated and no free parameters are explicitly fitted in the abstract description.

axioms (2)

domain assumption Co-registered gridded imagery and tabular census-tract data can be aligned at the patch and token level.
Required for the bilateral cross-attention and geography-aware mechanism to function as described.
domain assumption A joint masked-autoencoding objective on both modalities will produce transferable embeddings for downstream geospatial tasks.
Core premise of the self-supervised training strategy.

pith-pipeline@v0.9.0 · 5522 in / 1467 out tokens · 100090 ms · 2026-05-15T02:19:22.595715+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

GeoViSTA utilizes bilateral cross-attention to exchange spatial and semantic information across modalities, guided by a geography-aware attention mechanism that aligns continuous image patches with irregular census-tract tokens... self-supervised joint masked-autoencoding objective
IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We bias attention by physical spatial distance in kilometers... ϕ(d) = tanh((d0 − d)/τ)

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

33 extracted references · 33 canonical work pages · 1 internal anchor

[1]

Skilful precipitation nowcasting using deep generative models of radar,

S. Ravuri, K. Lenc, M. Willson, D. Kangin, R. Lam, P. Mirowski, M. Fitzsimons, M. Athanassiadou, S. Kashem, S. Madge, R. Prudden, A. Mandhane, A. Clark, A. Brock, K. Simonyan, R. Hadsell, N. Robin- son, E. Clancy, A. Arribas, and S. Mohamed, “Skilful precipitation nowcasting using deep generative models of radar,”Nature, vol. 597, no. 7878, pp. 672–677, Sep. 2021

work page 2021
[2]

Downscaling Extreme Precipitation With Wasserstein Regularized Diffusion,

Y . Liu, J. Doss-Gollin, Q. Dai, A. Veeraraghavan, and G. Balakrishnan, “Downscaling Extreme Precipitation With Wasserstein Regularized Diffusion,”IEEE Transactions on Geoscience and Remote Sensing, vol. 63, pp. 1–11, 2025

work page 2025
[3]

Designsafe: New cyberinfrastructure for natural hazards engineering,

E. M. Rathje, C. Dawson, J. E. Padgett, J.-P. Pinelli, D. Stanzione, A. Adair, P. Arduino, S. J. Brandenberg, T. Cockerill, C. Deyet al., “Designsafe: New cyberinfrastructure for natural hazards engineering,”Natural hazards review, vol. 18, no. 3, p. 06017001, 2017

work page 2017
[4]

Debris segmentation using post-hurricane aerial imagery,

K. Amini, Y . Liu, J. E. Padgett, G. Balakrishnan, and A. Veeraraghavan, “Debris segmentation using post-hurricane aerial imagery,”Computer-Aided Civil and Infrastructure Engineering, vol. 40, no. 25, pp. 4116–4131

work page
[5]

Air pollution and cardiovascular disease: Jacc state-of- the-art review,

S. Rajagopalan, S. G. Al-Kindi, and R. D. Brook, “Air pollution and cardiovascular disease: Jacc state-of- the-art review,”Journal of the American College of Cardiology, vol. 72, no. 17, pp. 2054–2070, 2018

work page 2054
[6]

Environmental determinants of cardiovascular disease: lessons learned from air pollution,

S. G. Al-Kindi, R. D. Brook, S. Biswal, and S. Rajagopalan, “Environmental determinants of cardiovascular disease: lessons learned from air pollution,”Nature Reviews Cardiology, vol. 17, no. 10, pp. 656–672, 2020

work page 2020
[7]

General Geospatial Inference with a Population Dynamics Foundation Model,

M. Agarwal, M. Sun, C. Kamath, A. Muslim, P. Sarker, J. Paul, H. Yee, M. Sieniek, K. Jablonski, Y . Mayer, D. Fork, S. de Guia, J. McPike, A. Boulanger, T. Shekel, D. Schottlander, Y . Xiao, M. C. Manukonda, Y . Liu, N. Bulut, S. Abu-el-haija, B. Perozzi, M. Bharel, V . Nguyen, L. Barrington, N. Efron, Y . Matias, G. Corrado, K. Eswaran, S. Prabhakara, S....

work page 2025
[8]

Google Earth Engine: Planetary-scale geospatial analysis for everyone,

N. Gorelick, M. Hancher, M. Dixon, S. Ilyushchenko, D. Thau, and R. Moore, “Google Earth Engine: Planetary-scale geospatial analysis for everyone,”Remote Sensing of Environment, vol. 202, pp. 18–27, Dec. 2017

work page 2017
[9]

RainNet: A Large-Scale Imagery Dataset and Benchmark for Spatial Precipitation Downscaling,

X. Chen, K. Feng, N. Liu, B. Ni, Y . Lu, Z. Tong, and Z. Liu, “RainNet: A Large-Scale Imagery Dataset and Benchmark for Spatial Precipitation Downscaling,”Advances in Neural Information Processing Systems, vol. 35, pp. 9797–9812, Dec. 2022

work page 2022
[10]

SEVIR : A Storm Event Imagery Dataset for Deep Learning Applications in Radar and Satellite Meteorology,

M. Veillette, S. Samsi, and C. Mattioli, “SEVIR : A Storm Event Imagery Dataset for Deep Learning Applications in Radar and Satellite Meteorology,” inAdvances in Neural Information Processing Systems, vol. 33. Curran Associates, Inc., 2020, pp. 22 009–22 019

work page 2020
[11]

Computer vision uncovers predictors of physical urban change,

N. Naik, S. D. Kominers, R. Raskar, E. L. Glaeser, and C. A. Hidalgo, “Computer vision uncovers predictors of physical urban change,”Proceedings of the National Academy of Sciences, vol. 114, no. 29, pp. 7571–7576, Jul. 2017

work page 2017
[12]

A land use and land cover classification system for use with remote sensor data,

J. R. Anderson, E. E. Hardy, J. T. Roach, and R. E. Witmer, “A land use and land cover classification system for use with remote sensor data,” USGS Numbered Series 964, 1976

work page 1976
[13]

AlphaEarth Foundations: An embedding field model for accurate and efficient global mapping from sparse label data,

C. F. Brown, M. R. Kazmierski, V . J. Pasquarella, W. J. Rucklidge, M. Samsikova, C. Zhang, E. Shelhamer, E. Lahera, O. Wiles, S. Ilyushchenko, N. Gorelick, L. L. Zhang, S. Alj, E. Schechter, S. Askay, O. Guinan, R. Moore, A. Boukouvalas, and P. Kohli, “AlphaEarth Foundations: An embedding field model for accurate and efficient global mapping from sparse ...

work page 2025
[14]

SatCLIP: Global, General-Purpose Location Embeddings with Satellite Imagery,

K. Klemmer, E. Rolf, C. Robinson, L. Mackey, and M. Rußwurm, “SatCLIP: Global, General-Purpose Location Embeddings with Satellite Imagery,” Apr. 2024. 10

work page 2024
[15]

Prithvi-EO-2.0: A Versatile Multi-Temporal Foundation Model for Earth Observation Applications,

D. Szwarcman, S. Roy, P. Fraccaro, Þ. E. Gíslason, B. Blumenstiel, R. Ghosal, P. H. de Oliveira, J. L. d. S. Almeida, R. Sedona, Y . Kang, S. Chakraborty, S. Wang, C. Gomes, A. Kumar, M. Truong, D. Godwin, H. Lee, C.-Y . Hsu, A. A. Asanjan, B. Mujeci, D. Shidham, T. Keenan, P. Arevalo, W. Li, H. Alemohammad, P. Olofsson, C. Hain, R. Kennedy, B. Zadrozny, ...

work page 2025
[16]

OlmoEarth: Stable Latent Image Modeling for Multimodal Earth Observation,

H. Herzog, F. Bastani, Y . Zhang, G. Tseng, J. Redmon, H. Sablon, R. Park, J. Morrison, A. Buraczynski, K. Farley, J. Hansen, A. Howe, P. A. Johnson, M. Otterlee, T. Schmitt, H. Pitelka, S. Daspit, R. Ratner, C. Wilhelm, S. Wood, M. Jacobi, H. Kerner, E. Shelhamer, A. Farhadi, R. Krishna, and P. Beukema, “OlmoEarth: Stable Latent Image Modeling for Multim...

work page 2025
[17]

TerraMind: Large-Scale Generative Multimodality for Earth Observation,

J. Jakubik, F. Yang, B. Blumenstiel, E. Scheurer, R. Sedona, S. Maurogiovanni, J. Bosmans, N. Dionelis, V . Marsocci, N. Kopp, R. Ramachandran, P. Fraccaro, T. Brunschwiler, G. Cavallaro, J. Bernabe-Moreno, and N. Longépé, “TerraMind: Large-Scale Generative Multimodality for Earth Observation,” Apr. 2025

work page 2025
[18]

SkySense: A Multi-Modal Remote Sensing Foundation Model Towards Universal Interpretation for Earth Observation Imagery,

X. Guo, J. Lao, B. Dang, Y . Zhang, L. Yu, L. Ru, L. Zhong, Z. Huang, K. Wu, D. Hu, H. He, J. Wang, J. Chen, M. Yang, Y . Zhang, and Y . Li, “SkySense: A Multi-Modal Remote Sensing Foundation Model Towards Universal Interpretation for Earth Observation Imagery,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp....

work page 2024
[19]

SatMAE: Pre-training Transformers for Temporal and Multi-Spectral Satellite Imagery,

Y . Cong, S. Khanna, C. Meng, P. Liu, E. Rozi, Y . He, M. Burke, D. B. Lobell, and S. Ermon, “SatMAE: Pre-training Transformers for Temporal and Multi-Spectral Satellite Imagery,” inAdvances in Neural Information Processing Systems, Oct. 2022

work page 2022
[20]

Masked Autoencoders Are Scalable Vision Learners,

K. He, X. Chen, S. Xie, Y . Li, P. Dollár, and R. Girshick, “Masked Autoencoders Are Scalable Vision Learners,” Dec. 2021

work page 2021
[21]

About the u.s. climate vulnerability index,

Environmental Defense Fund and Texas A&M University, “About the u.s. climate vulnerability index,” https://climatevulnerabilityindex.org/about/, 2023, accessed: 2026-04-20

work page 2023
[22]

Lanistr: Multimodal learning from structured and un- structured data.arXiv preprint arXiv:2305.16556, 2023

S. Ebrahimi, S. O. Arik, Y . Dong, and T. Pfister, “LANISTR: Multimodal learning from structured and unstructured data,”arXiv:2305.16556, 2023

work page arXiv 2023
[23]

TIP: Tabular-image pre-training for multimodal classification with incomplete data,

S. Du, S. Zheng, Y . Wang, W. Bai, D. P. O’Regan, and C. Qin, “TIP: Tabular-image pre-training for multimodal classification with incomplete data,” inComputer Vision – ECCV 2024, 2024, pp. 478–496

work page 2024
[24]

TabTransformer: Tabular Data Modeling Using Contextual Embeddings

X. Huang, A. Khetan, M. Cvitkovic, and Z. Karnin, “TabTransformer: Tabular data modeling using contextual embeddings,”arXiv:2012.06678, 2020

work page internal anchor Pith review arXiv 2012
[25]

Revisiting deep learning models for tabular data,

Y . Gorishniy, I. Rubachev, V . Khrulkov, and A. Babenko, “Revisiting deep learning models for tabular data,” inAdvances in Neural Information Processing Systems, vol. 34, 2021, pp. 18 932–18 943. [Online]. Available: https://proceedings.neurips.cc/paper/2021/hash/9d86d83f925f2149e9edb0ac3b49229c-Abstract. html

work page 2021
[26]

Best of both worlds: Multimodal contrastive learning with tabular and imaging data,

P. Hager, M. J. Menten, and D. Rueckert, “Best of both worlds: Multimodal contrastive learning with tabular and imaging data,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 23 924–23 935

work page 2023
[27]

An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale,

A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Min- derer, G. Heigold, S. Gelly, J. Uszkoreit, and N. Houlsby, “An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale,” Jun. 2021

work page 2021
[28]

Train short, test long: Attention with linear biases enables input length extrapolation,

O. Press, N. A. Smith, and M. Lewis, “Train short, test long: Attention with linear biases enables input length extrapolation,” inInternational Conference on Learning Representations, 2022. [Online]. Available: https://openreview.net/forum?id=R8sQPpGCv0

work page 2022
[29]

A Simple Framework for Contrastive Learning of Visual Representations,

T. Chen, S. Kornblith, M. Norouzi, and G. Hinton, “A Simple Framework for Contrastive Learning of Visual Representations,” Jul. 2020

work page 2020
[30]

Emerging Properties in Self-Supervised Vision Transformers,

M. Caron, H. Touvron, I. Misra, H. Jégou, J. Mairal, P. Bojanowski, and A. Joulin, “Emerging Properties in Self-Supervised Vision Transformers,” May 2021

work page 2021
[31]

Scikit-learn: Machine learning in python,

F. Pedregosa, G. Varoquaux, A. Gramfort, V . Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V . Dubourget al., “Scikit-learn: Machine learning in python,”Journal of Machine Learning Research, vol. 12, pp. 2825–2830, 2011. 11

work page 2011
[32]

National Vital Statistics System, Mortality 1999–2020 on CDC WONDER Online Database,

Centers for Disease Control and Prevention, National Center for Health Statistics, “National Vital Statistics System, Mortality 1999–2020 on CDC WONDER Online Database,” http://wonder.cdc.gov/mcd-icd10. html, 2021, data are from the Multiple Cause of Death Files, 1999–2020, as compiled from data provided by the 57 vital statistics jurisdictions through th...

work page 1999
[33]

ESA Fire Climate Change Initiative (Fire_cci): MODIS Fire_cci Burned Area Pixel product, version 5.1,

E. Chuvieco, M. L. Pettinari, J. Lizundia-Loiola, T. Storm, and M. Padilla Parellada, “ESA Fire Climate Change Initiative (Fire_cci): MODIS Fire_cci Burned Area Pixel product, version 5.1,” https://catalogue. ceda.ac.uk/uuid/58f00d8814064b79a0c49662ad3af537/, 2018, published: 2018-11-01. 12 Supplementary Material GeoViSTA: Geospatial Vision-Tabular Transf...

work page 2018