GeoFormer: A Lightweight Swin Transformer for Joint Building Height and Footprint Estimation from Sentinel Imagery
Pith reviewed 2026-05-16 02:45 UTC · model grok-4.3
The pith
GeoFormer uses a lightweight Swin Transformer to jointly estimate building height and footprint from Sentinel data with fewer parameters and higher accuracy than CNN baselines.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
GeoFormer achieves a building height RMSE of 3.19 m and competitive footprint accuracy with only 0.32 M parameters by replacing convolutional layers with windowed local attention in a multi-task framework; this outperforms the best CNN baseline (UNet) by 7.5 percent and maintains sub-3.5 m RMSE in cross-continent transfer tests without region-specific fine-tuning.
What carries the argument
A lightweight Swin Transformer backbone with windowed self-attention operating in a multi-task regression head that jointly outputs building height and footprint on a 100 m grid from fused Sentinel and DEM inputs.
If this is right
- A 5 by 5 (500 m) receptive field proves optimal for scene-level building parameter retrieval.
- DEM data is indispensable for height accuracy while multispectral reflectance supplies the dominant signal for footprint prediction.
- The model’s low parameter count allows deployment on modest hardware for repeated global mapping updates.
- Cross-continent transfer without fine-tuning supports production of consistent worldwide urban morphology layers.
- Ablation results indicate that further gains are unlikely from simply enlarging the context window or model capacity.
Where Pith is reading between the lines
- The same architecture could be adapted to estimate additional urban parameters such as building volume or material type with minimal extra cost.
- Public release of the global product enables immediate integration into existing climate and disaster models that currently lack fine-scale building data.
- If Sentinel data streams continue, periodic re-runs of the model could track urban expansion and height changes over time at low computational expense.
- The efficiency advantage may extend to other remote-sensing regression tasks where labeled data are sparse but multi-modal satellite inputs are abundant.
Load-bearing premise
The geo-blocked split across 54 cities is assumed to deliver strict spatial independence plus enough morphological variety for the model to generalize globally without any further training.
What would settle it
Repeating the evaluation on a fresh collection of cities outside the original 54 and finding that GeoFormer’s height RMSE exceeds the retrained UNet baseline by more than 0.2 m.
Figures
read the original abstract
Building height (BH) and footprint (BF) are fundamental urban morphological parameters required by climate modelling, disaster-risk assessment, and population mapping, yet globally consistent data remain scarce. In this work, we develop GeoFormer, a lightweight Swin Transformer-based multi-task learning framework that jointly estimates BH and BF on a 100 m grid using only open-access Sentinel-1 SAR, Sentinel-2 multispectral, and DEM data. A geo-blocked data-splitting strategy enforces strict spatial independence between training and evaluation regions across 54 morphologically diverse cities. We set representative CNN baselines (ResNet, UNet, SENet) as benchmarks and thoroughly evaluate GeoFormer's prediction accuracy, computational efficiency, and spatial transferability. Results show that GeoFormer achieves a BH RMSE of 3.19 m with only 0.32 M parameters -- outperforming the best CNN baseline (UNet) by 7.5% -- indicating that windowed local attention is more effective than convolution for scene-level building-parameter retrieval. Systematic ablation on context window size, model capacity, and input modality further reveals that a 5x5 (500 m) receptive field is optimal, DEM is indispensable for height estimation, and multispectral reflectance carries the dominant predictive signal. Cross-continent transfer tests confirm BH RMSE below 3.5 m without region-specific fine-tuning. All code, model weights, and the resulting global product are publicly released.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces GeoFormer, a lightweight Swin Transformer-based multi-task framework for joint building height (BH) and footprint (BF) estimation on a 100 m grid from Sentinel-1 SAR, Sentinel-2 multispectral, and DEM inputs. It employs a geo-blocked split across 54 cities to enforce spatial independence, reports a BH RMSE of 3.19 m with 0.32 M parameters (7.5 % better than UNet), provides ablations on context window size, capacity, and modalities, and shows cross-continent transfer with RMSE below 3.5 m, while releasing all code, weights, and the global product.
Significance. If the performance and generalization claims hold, the work would be significant for delivering an efficient, publicly available model that improves upon CNN baselines for global-scale urban morphology retrieval using only open satellite data, with direct utility for climate modeling, disaster risk, and population mapping; the ablation results on receptive field and input modalities also provide useful insight into attention mechanisms for remote-sensing regression tasks.
major comments (2)
- [Data-splitting section] Data-splitting section: the claim that the geo-blocked strategy across 54 cities 'enforces strict spatial independence' is load-bearing for the cross-continent transfer results (RMSE < 3.5 m) and the interpretation that windowed attention enables global generalization; however, no quantitative validation (e.g., Earth-mover distance or nearest-neighbor similarity on morphological histograms of building density/height) is supplied to confirm absence of leakage.
- [Results section] Results section (performance table): the reported 7.5 % improvement over UNet and the headline BH RMSE of 3.19 m are presented without error bars, confidence intervals, or statistical significance tests, which is required to substantiate that the gain is robust rather than attributable to run-to-run variance.
minor comments (1)
- [Abstract] Abstract: the joint multi-task architecture (shared backbone vs. separate heads) and the precise definition of the 100 m output grid are not stated explicitly, which would aid reproducibility.
Simulated Author's Rebuttal
We thank the referee for the constructive comments, which help strengthen the manuscript. We address each major point below and will revise the paper accordingly where appropriate.
read point-by-point responses
-
Referee: [Data-splitting section] Data-splitting section: the claim that the geo-blocked strategy across 54 cities 'enforces strict spatial independence' is load-bearing for the cross-continent transfer results (RMSE < 3.5 m) and the interpretation that windowed attention enables global generalization; however, no quantitative validation (e.g., Earth-mover distance or nearest-neighbor similarity on morphological histograms of building density/height) is supplied to confirm absence of leakage.
Authors: We agree that quantitative validation would further substantiate the spatial independence claim. In the revised manuscript we will add an analysis of morphological feature distributions (building density and height histograms) between the training and test partitions, reporting Earth Mover's Distance and nearest-neighbor similarity scores. The geo-blocked split across 54 cities was constructed to eliminate any spatial overlap, but the additional metrics will provide empirical confirmation of minimal leakage. revision: yes
-
Referee: [Results section] Results section (performance table): the reported 7.5 % improvement over UNet and the headline BH RMSE of 3.19 m are presented without error bars, confidence intervals, or statistical significance tests, which is required to substantiate that the gain is robust rather than attributable to run-to-run variance.
Authors: We concur that error bars and statistical tests are necessary to demonstrate robustness. In the revised version we will report standard deviations computed over five independent training runs with different random seeds, add 95% confidence intervals to the performance table, and include paired t-test p-values comparing GeoFormer against the UNet baseline to establish that the 7.5% improvement is statistically significant. revision: yes
Circularity Check
No circularity: empirical training and held-out geographic evaluation are self-contained
full rationale
The manuscript presents GeoFormer as an empirical multi-task model (Swin-Transformer backbone with standard training on Sentinel-1/2 + DEM inputs). All headline numbers (BH RMSE 3.19 m, 7.5 % gain over UNet, cross-continent transfer < 3.5 m) are obtained by fitting on geo-blocked training folds and measuring on held-out city blocks. No derivation, uniqueness theorem, or ansatz is invoked that reduces the reported performance to fitted parameters by construction. Any citations to the original Swin Transformer paper are to an independent, externally published architecture and do not bear the load of the accuracy claims. The evaluation therefore remains falsifiable against external benchmarks and receives the default non-circularity finding.
Axiom & Free-Parameter Ledger
free parameters (2)
- context_window_size
- model_capacity
axioms (2)
- domain assumption Sentinel-1 SAR, Sentinel-2 multispectral, and DEM inputs contain sufficient signal for building height and footprint at 100 m resolution
- domain assumption Geo-blocked splitting across 54 cities produces training and test sets that are spatially independent and morphologically representative
Reference graph
Works this paper leans on
-
[1]
Changing and differentiated urban land- scape in China: Spatiotemporal patterns and driving forces,
C. Fang, G. Li, and S. Wang, “Changing and differentiated urban land- scape in China: Spatiotemporal patterns and driving forces,”Environ. Sci. Technol., vol. 50, no. 5, pp. 2217–2227, 2016
work page 2016
-
[2]
A global fingerprint of macro-scale changes in urban structure from 1999 to 2009,
S. Frolking, T. Milliman, K. C. Seto, and M. A. Friedl, “A global fingerprint of macro-scale changes in urban structure from 1999 to 2009,”Environ. Res. Lett., vol. 8, no. 2, p. 024004, 2013
work page 1999
-
[3]
Global urban structural growth shows a profound shift from spreading out to building up,
S. Frolking, R. Mahtta, T. Milliman, T. Esch, and K. C. Seto, “Global urban structural growth shows a profound shift from spreading out to building up,”Nat. Cities, vol. 1, no. 9, pp. 555–566, 2024
work page 2024
-
[4]
Impacts of urban-scale building height diversity on urban climates: A case study of Nanjing, China,
C. Xi, C. Ren, J. Wang, Z. Feng, and S.-J. Cao, “Impacts of urban-scale building height diversity on urban climates: A case study of Nanjing, China,”Energy Build., vol. 251, p. 111350, 2021
work page 2021
-
[5]
K. Perini and A. Magliocco, “Effects of vegetation, urban density, building height, and atmospheric conditions on local temperatures and thermal comfort,”Urban For. Urban Green., vol. 13, no. 3, pp. 495–506, 2014
work page 2014
-
[6]
X. Huang and C. Wang, “Estimates of exposure to the 100-year floods in the conterminous United States using national building footprints,” Int. J. Disaster Risk Reduct., vol. 50, p. 101731, 2020
work page 2020
-
[7]
Y . Tian, M. Lu, Z. Xu, and J. Ren, “A fire following earthquake spread model considering building height and its application to real-world events,”Int. J. Disaster Risk Reduct., p. 105261, 2025
work page 2025
-
[8]
OpenStreetMap download statistics,
Geofabrik, “OpenStreetMap download statistics,” 2018
work page 2018
-
[9]
Developing a method to estimate building height from Sentinel-1 data,
X. Li, Y . Zhou, P. Gong, K. C. Seto, and N. Clinton, “Developing a method to estimate building height from Sentinel-1 data,”Remote Sens. Environ., vol. 240, p. 111705, 2020
work page 2020
-
[10]
Deep learning-based building height mapping using Sentinel-1 and Sentinel-2 data,
B. Cai, Z. Shao, X. Huang, X. Zhou, and S. Fang, “Deep learning-based building height mapping using Sentinel-1 and Sentinel-2 data,”Int. J. Appl. Earth Obs. Geoinformation, vol. 122, p. 103399, 2023
work page 2023
-
[11]
National-scale mapping of building height using Sentinel-1 and Sentinel-2 time series,
D. Frantz, F. Schug, A. Okujeni, C. Navacchi, W. Wagner, S. van der Linden, and P. Hostert, “National-scale mapping of building height using Sentinel-1 and Sentinel-2 time series,”Remote Sens. Environ., vol. 252, p. 112128, 2021
work page 2021
-
[12]
Leveraging machine learning to generate a unified and complete building height dataset for Germany,
K. Dabrock, N. Pflugradt, J. M. Weinand, and D. Stolten, “Leveraging machine learning to generate a unified and complete building height dataset for Germany,”Energy AI, vol. 17, p. 100408, 2024
work page 2024
-
[13]
M. Buyukdemircioglu, R. Can, S. Kocaman, and M. Kada, “Deep learning based building footprint extraction from very high resolution true orthophotos and nDSM,”ISPRS Ann. Photogramm. Remote Sens. Spat. Inf. Sci., vol. 2, pp. 211–218, 2022
work page 2022
-
[14]
Z. Li, Q. Xin, Y . Sun, and M. Cao, “A deep learning-based framework for automated extraction of building footprint polygons from very high- resolution aerial imagery,”Remote Sens., vol. 13, no. 18, p. 3630, 2021
work page 2021
-
[15]
R. Li, T. Sun, F. Tian, and G.-H. Ni, “SHAFTS (v2022.3): A deep- learning-based Python package for simultaneous extraction of building height and footprint from Sentinel imagery,”Geosci. Model Dev., vol. 16, no. 2, pp. 751–778, 2023
work page 2023
-
[16]
K. Rastogi, P. Bodani, and S. A. Sharma, “Automatic building foot- print extraction from very high-resolution imagery using deep learning techniques,”Geocarto Int., vol. 37, no. 5, pp. 1501–1513, 2022
work page 2022
-
[17]
Y . Park and J.-M. Guldmann, “Creating 3D city models with building footprints and LIDAR point cloud classification: A machine learning approach,”Comput. Environ. Urban Syst., vol. 75, pp. 76–89, 2019. IEEE JOURNAL OF SELECTED TOPICS IN APPLIED EARTH OBSERV ATIONS AND REMOTE SENSING 14
work page 2019
-
[18]
Y . Sun, Y . Hua, L. Mou, and XX. Zhu, “Large-scale building height estimation from single VHR SAR image using fully convolutional network and GIS building footprints. 2019 Joint Urban Remote Sensing Event, JURSE 2019,” 2019
work page 2019
-
[19]
P. Cai, J. Guo, R. Li, Z. Xiao, H. Fu, T. Guo, X. Zhang, Y . Li, and X. Song, “Automated building height estimation using ice, cloud, and land elevation satellite 2 light detection and ranging data and building footprints,”Remote Sens., vol. 16, no. 2, p. 263, 2024
work page 2024
-
[20]
W.-B. Wu, J. Ma, E. Banzhaf, M. E. Meadows, Z.-W. Yu, F.-X. Guo, D. Sengupta, X.-X. Cai, and B. Zhao, “A first Chinese building height estimate at 10 m resolution (CNBH-10 m) using multi-source earth observations and machine learning,”Remote Sens. Environ., vol. 291, p. 113578, 2023
work page 2023
-
[21]
Y . Chen, W. Sun, L. Yang, X. Yang, X. Zhou, X. Li, S. Li, and G. Tang, “Refining urban morphology: An explainable machine learning method for estimating footprint-level building height,”Sustain. Cities Soc., vol. 112, p. 105635, 2024
work page 2024
-
[22]
Structure-aware deep learning network for building height estimation,
Y . Chen, J. Zhou, C. Xu, Q. Ma, X. Zhang, Y . Zhou, and Y . Ge, “Structure-aware deep learning network for building height estimation,” Int. J. Appl. Earth Obs. Geoinformation, p. 104443, 2025
work page 2025
-
[23]
3D-GloBFP: The first global three-dimensional building footprint dataset,
Y . Che, X. Li, X. Liu, Y . Wang, W. Liao, X. Zheng, X. Zhang, X. Xu, Q. Shi, J. Zhuet al., “3D-GloBFP: The first global three-dimensional building footprint dataset,”Earth Syst. Sci. Data Discuss., vol. 2024, pp. 1–28, 2024
work page 2024
-
[24]
S. Wang, B. Cai, D. Hou, Q. Ding, J. Wang, and Z. Shao, “Mf-bhnet: A hybrid multimodal fusion network for building height estimation using sentinel-1 and sentinel-2 imagery,”IEEE Transactions on Geoscience and Remote Sensing, vol. 62, pp. 1–19, 2024
work page 2024
-
[25]
Y . Zhenget al., “Estimating individual building heights by integrating spaceborne LiDAR and multisource remote sensing data: A CNN– transformer model and a semi-supervised sample augmentation ap- proach,”IEEE Transactions on Geoscience and Remote Sensing, vol. 63, 2025
work page 2025
-
[26]
H. G. Kamath, M. Singh, N. Malviya, A. Martilli, L. He, D. Aliaga, C. He, F. Chen, L. A. Magruder, Z.-L. Yanget al., “Global building heights for urban studies (ut-globus) for city-and street-scale urban simulations: Development and first applications,”Scientific Data, vol. 11, no. 1, p. 886, 2024
work page 2024
-
[27]
X. Zhu, S. Chen, F. Zhang, Y . Shi, and Y . Wang, “GlobalBuildingAtlas: An open global and complete dataset of building polygons, heights and LoD1 3D models,”Earth System Science Data, vol. 17, pp. 6647–6670, 2025
work page 2025
-
[28]
The pixel: A snare and a delusion,
P. Fisher, “The pixel: A snare and a delusion,”Int. J. Remote Sens., vol. 18, no. 3, pp. 679–685, 1997
work page 1997
-
[29]
Remote sensing of impervious surfaces in the urban areas: Requirements, methods, and trends,
Q. Weng, “Remote sensing of impervious surfaces in the urban areas: Requirements, methods, and trends,”Remote Sens. Environ., vol. 117, pp. 34–49, 2012
work page 2012
-
[30]
Earthquake damage assess- ment of buildings using VHR optical and SAR imagery,
D. Brunner, G. Lemoine, and L. Bruzzone, “Earthquake damage assess- ment of buildings using VHR optical and SAR imagery,”IEEE Trans. Geosci. Remote Sens., vol. 48, no. 5, pp. 2403–2420, 2010
work page 2010
-
[31]
F. Schug, D. Frantz, A. Okujeni, and P. Hostert, “Sub-pixel building area mapping based on synthetic training data and regression-based unmixing using Sentinel-1 and-2 data,”Remote Sens. Lett., vol. 13, no. 8, pp. 822– 832, 2022
work page 2022
-
[32]
Sentinel- 2’s potential for sub-pixel landscape feature detection,
J. Radoux, G. Chom ´e, D. C. Jacques, F. Waldner, N. Bellemans, N. Matton, C. Lamarche, R. d’Andrimont, and P. Defourny, “Sentinel- 2’s potential for sub-pixel landscape feature detection,”Remote Sens., vol. 8, no. 6, p. 488, 2016
work page 2016
-
[33]
Local climate zones for urban temperature studies,
I. D. Stewart and T. R. Oke, “Local climate zones for urban temperature studies,”Bull. Am. Meteorol. Soc., vol. 93, no. 12, pp. 1879–1900, 2012
work page 1900
-
[34]
M. Demuzere, J. Kittner, A. Martilli, G. Mills, C. Moede, I. D. Stewart, J. Van Vliet, and B. Bechtel, “A global map of local climate zones to support earth system modelling and urban scale environmental science,” Earth System Science Data Discussions, vol. 2022, pp. 1–57, 2022
work page 2022
-
[35]
WUDAPT: An urban weather, climate, and environmental modeling infrastructure for the anthropocene,
J. Ching, G. Mills, B. Bechtel, L. See, J. Feddema, X. Wang, C. Ren, O. Brousse, A. Martilli, M. Neophytouet al., “WUDAPT: An urban weather, climate, and environmental modeling infrastructure for the anthropocene,”Bull. Am. Meteorol. Soc., vol. 99, no. 9, pp. 1907–1924, 2018
work page 1907
-
[36]
Mapping local climate zones for a worldwide database of the form and function of cities,
B. Bechtel, P. J. Alexander, J. B ¨ohner, J. Ching, O. Conrad, J. Feddema, G. Mills, L. See, and I. Stewart, “Mapping local climate zones for a worldwide database of the form and function of cities,”ISPRS Int. J. Geo-Inf., vol. 4, no. 1, pp. 199–219, 2015
work page 2015
-
[37]
An urban surface exchange parameterisation for mesoscale models,
A. Martilli, A. Clappier, and M. W. Rotach, “An urban surface exchange parameterisation for mesoscale models,”Bound.-Layer Meteorol., vol. 104, pp. 261–304, 2002
work page 2002
-
[38]
WorldPop, open data for spatial demography,
A. J. Tatem, “WorldPop, open data for spatial demography,”Sci. Data, vol. 4, p. 170004, 2017
work page 2017
-
[39]
GHS-POP R2023A – GHS population grid multitemporal (1975–2030),
M. Schiavina, S. Freire, A. Carioli, and K. MacManus, “GHS-POP R2023A – GHS population grid multitemporal (1975–2030),” European Commission, Joint Research Centre (JRC), 2023, available at 100 m resolution
work page 1975
-
[40]
Urban building energy modeling – a review of a nascent field,
C. F. Reinhart and C. Cerezo Davila, “Urban building energy modeling – a review of a nascent field,”Build. Environ., vol. 97, pp. 196–202, 2016
work page 2016
-
[41]
Ł. Musiaka and M. Nalej, “Application of GIS tools in the measurement analysis of urban spatial layouts using the square grid method,”ISPRS Int. J. Geo-Inf., vol. 10, no. 8, p. 558, 2021
work page 2021
-
[42]
Swin transformer: Hierarchical vision transformer using shifted windows,
Z. Liu, Y . Lin, Y . Cao, H. Hu, Y . Wei, Z. Zhang, S. Lin, and B. Guo, “Swin transformer: Hierarchical vision transformer using shifted windows,” inProc. IEEECVF Int. Conf. Comput. Vis., 2021, pp. 10 012– 10 022
work page 2021
-
[43]
Multi-task learning using uncer- tainty to weigh losses for scene geometry and semantics,
A. Kendall, Y . Gal, and R. Cipolla, “Multi-task learning using uncer- tainty to weigh losses for scene geometry and semantics,” inProc. IEEE Conf. Comput. Vis. Pattern Recognit., 2018, pp. 7482–7491
work page 2018
-
[44]
Robust estimation of a location parameter,
P. J. Huber, “Robust estimation of a location parameter,” inBreak- throughs in Statistics: Methodology and Distribution. Springer, 1992, pp. 492–518
work page 1992
-
[45]
Decoupled weight decay regularization,
I. Loshchilov and F. Hutter, “Decoupled weight decay regularization,” ArXiv Prepr. ArXiv171105101, 2017
work page 2017
-
[46]
Sgdr: Stochastic gradient descent with warm restarts,
——, “Sgdr: Stochastic gradient descent with warm restarts,”ArXiv Prepr. ArXiv160803983, 2016
work page 2016
-
[47]
Deep residual learning for image recognition,
K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” inProc. IEEE Conf. Comput. Vis. Pattern Recognit., 2016, pp. 770–778
work page 2016
-
[48]
U-Net: Convolutional net- works for biomedical image segmentation,
O. Ronneberger, P. Fischer, and T. Brox, “U-Net: Convolutional net- works for biomedical image segmentation,” inMedical Image Comput- ing and Computer-Assisted Intervention (MICCAI). Springer, 2015, pp. 234–241
work page 2015
-
[49]
Squeeze-and-excitation networks,
J. Hu, L. Shen, and G. Sun, “Squeeze-and-excitation networks,” inProc. IEEE/CVF Conf. Computer Vision and Pattern Recognition (CVPR), 2018, pp. 7132–7141
work page 2018
-
[50]
M.-H. Lee, W.-S. Seo, C.-Y . Park, and C.-H. Choi, “Improvement of surface roughness classification criteria reflecting the height and density of building by region,”J. Korean Inst. Archit. Sustain. Environ. Build. Syst. KIAEBS, vol. 15, no. 5, pp. 513–524, 2021
work page 2021
-
[51]
X. Wang, G. Feng, L. He, Q. An, Z. Xiong, H. Lu, W. Wang, N. Li, Y . Zhao, Y . Wang, and Y . Wang, “Evaluating urban building damage of 2023 Kahramanmaras, Turkey earthquake sequence using SAR change detection,”Sensors, vol. 23, no. 14, p. 6342, 2023
work page 2023
-
[52]
X. Yu, X. Hu, Y . Songet al., “Intelligent assessment of building damage of 2023 Turkey–Syria earthquake by multiple remote sensing approaches,”npj Nat. Hazards, vol. 1, p. 3, 2024
work page 2023
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.