Morphology-Guided Cross-Task Coupling for Joint Building Height and Footprint Estimation
Pith reviewed 2026-05-08 17:17 UTC · model grok-4.3
The pith
Explicitly coupling building height and footprint estimation via morphology guidance improves height accuracy over independent or standard multi-task approaches.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
MorphoFormer encodes cross-task coupling between building height and footprint using a BF-Guided Task Decoder that applies cross-attention from footprint morphology to gate the height branch, plus a Morphology Consistency Loss that trains a height surrogate from footprint features against ground-truth heights. On a 54-city dataset this lowers building height RMSE from 3.39 m to 3.15 m and raises R-squared from 0.62 to 0.67, while footprint R-squared stays at 0.80. Ablations at matched capacity attribute the 0.24 m improvement mainly to the two mechanisms.
What carries the argument
MorphoFormer framework with BF-Guided Task Decoder (cross-attention gating of height branch by footprint morphology context) and Morphology Consistency Loss (supervising height-from-footprint surrogate).
Load-bearing premise
That the proposed consistency loss and guided decoder compel footprint features to capture genuine height-related morphology instead of training-set-specific correlations.
What would settle it
Observing that the height accuracy improvement vanishes when the model is tested on cities whose building morphologies differ substantially from those in the training data.
Figures
read the original abstract
Building height (BH) and building footprint (BF) jointly describe the vertical and horizontal extent of the built environment and are required inputs for urban climate, disaster-risk, and population-mapping models. The two parameters are coupled through floor-area-ratio (FAR) constraints, yet remote-sensing approaches typically treat them as independent regression targets. We argue that explicitly encoding this cross-task coupling is more impactful than further refining individual encoders, and propose MorphoFormer, a joint BH/BF estimation framework built around two complementary mechanisms: (i) a BF-Guided Task Decoder (BGTD) that gates the height branch via cross-attention on a footprint-derived morphology context, and (ii) a Morphology Consistency Loss (MCL) that supervises a height-from-footprint surrogate against the ground-truth BH, indirectly forcing the BF feature to encode height-correlated structure. The encoder is a single-stage Swin backbone fed by Sentinel-1 SAR, Sentinel-2 multispectral, and DEM inputs, trained and evaluated on a geo-blocked split of 54 cities. Against a Swin-MTL baseline at identical receptive field, MorphoFormer reduces BH test RMSE from 3.39 to 3.15 m (R^2 improves 0.62 -> 0.67) with BF R^2 stable at 0.80. Controlled ablations at identical capacity attribute most of this 0.24 m improvement to the two proposed mechanisms: removing BGTD raises BH RMSE by 0.11 m and removing MCL raises it by 0.11 m, with the residual approximately 0.02 m falling within the noise floor of encoder-side variations. Because both mechanisms act on cross-task representations rather than pixels, the design carries no intrinsic dependence on input resolution.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript proposes MorphoFormer, a joint framework for building height (BH) and building footprint (BF) estimation from Sentinel-1 SAR, Sentinel-2 multispectral, and DEM inputs using a single-stage Swin encoder. It introduces two mechanisms to encode floor-area-ratio coupling: a BF-Guided Task Decoder (BGTD) that applies cross-attention gating from footprint-derived morphology to the height branch, and a Morphology Consistency Loss (MCL) that supervises a height surrogate derived from footprint features against ground-truth BH. On a geo-blocked split of 54 cities, MorphoFormer reduces BH test RMSE from 3.39 m to 3.15 m (R² from 0.62 to 0.67) while holding BF R² at 0.80; controlled ablations at fixed capacity attribute most of the 0.24 m gain to BGTD and MCL.
Significance. If the results hold, the work demonstrates that explicitly modeling cross-task physical coupling can improve remote-sensing regression accuracy without increasing encoder capacity or input resolution. The geo-blocked evaluation across disjoint cities and the fixed-capacity ablations provide direct evidence that gains arise from the morphology-guided mechanisms rather than data leakage or parameter count, strengthening the case for incorporating domain constraints like FAR into multi-task architectures for urban morphology estimation.
minor comments (3)
- The experimental section should report standard deviations or results across multiple random seeds for the RMSE and R² values, as the 0.24 m improvement and the 0.11 m ablation deltas are currently presented as point estimates.
- The exact mathematical definition of the Morphology Consistency Loss (including how the height surrogate is computed from BF features) would benefit from an explicit equation in the methods section to facilitate reproduction.
- Table or text describing the data split should state the precise number of cities (or samples) allocated to training, validation, and test sets under the geo-blocked protocol.
Simulated Author's Rebuttal
We thank the referee for the positive summary and significance assessment of MorphoFormer, which correctly highlights the 0.24 m BH RMSE reduction, stable BF performance, geo-blocked 54-city evaluation, and fixed-capacity ablations. The recommendation for minor revision is noted. No specific major comments were provided in the report.
Circularity Check
No significant circularity
full rationale
The paper's derivation consists of proposing two explicit mechanisms (BGTD cross-attention gating and MCL surrogate supervision) whose impact is measured via controlled ablations on a geo-blocked held-out test set of 54 cities. The 0.24 m BH RMSE reduction is reported as an empirical outcome on spatially disjoint test data, not as a quantity obtained by fitting a parameter and then re-predicting a related statistic from the same fit. No equations are presented that equate a claimed prediction to its own input by construction, and no load-bearing premise relies on a self-citation chain or imported uniqueness theorem. The design is therefore self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption Cross-attention on footprint-derived morphology can usefully gate height features without introducing harmful bias
- domain assumption A height-from-footprint surrogate loss will force footprint features to encode height-correlated structure
Reference graph
Works this paper leans on
-
[1]
C. Fang, G. Li, S. Wang, Changing and differentiated urban landscape in China: Spatiotemporal patterns and driving forces, Environ. Sci. Tech- nol. 50 (5) (2016) 2217–2227
2016
-
[2]
Frolking, T
S. Frolking, T. Milliman, K. C. Seto, M. A. Friedl, A global fingerprint of macro-scale changes in urban structure from 1999 to 2009, Environ. Res. Lett. 8 (2) (2013) 024004
1999
-
[3]
Frolking, R
S. Frolking, R. Mahtta, T. Milliman, T. Esch, K. C. Seto, Global urban structural growth shows a profound shift from spreading out to building up, Nat. Cities 1 (9) (2024) 555–566
2024
-
[4]
C. Xi, C. Ren, J. Wang, Z. Feng, S.-J. Cao, Impacts of urban-scale building height diversity on urban climates: A case study of Nanjing, China, Energy Build. 251 (2021) 111350
2021
-
[5]
Perini, A
K. Perini, A. Magliocco, Effects of vegetation, urban density, building height, and atmospheric conditions on local temperatures and thermal comfort, Urban For. Urban Green. 13 (3) (2014) 495–506
2014
-
[6]
Huang, C
X. Huang, C. Wang, Estimates of exposure to the 100-year floods in the conterminous United States using national building footprints, Int. J. Disaster Risk Reduct. 50 (2020) 101731. 24
2020
-
[7]
Y. Tian, M. Lu, Z. Xu, J. Ren, A fire following earthquake spread model considering building height and its application to real-world events, Int. J. Disaster Risk Reduct. (2025) 105261
2025
-
[8]
C. F. Reinhart, C. Cerezo Davila, Urban building energy modeling – a review of a nascent field, Build. Environ. 97 (2016) 196–202. doi:10.1016/j.buildenv.2015.12.001
-
[9]
A. J. Tatem, WorldPop, open data for spatial demography, Sci. Data 4 (2017) 170004. doi:10.1038/sdata.2017.4
-
[10]
M. Schiavina, S. Freire, A. Carioli, K. MacManus, GHS-POP R2023A – GHS population grid multitemporal (1975–2030), European Commis- sion, Joint Research Centre (JRC), available at 100m resolution (2023). doi:10.2905/2FF68A52-5B5B-4532-88CB-C5A729C3F5D0
work page doi:10.2905/2ff68a52-5b5b-4532-88cb-c5a729c3f5d0 1975
-
[11]
Geofabrik, OpenStreetMap download statistics (2018)
2018
-
[12]
Buyukdemircioglu, R
M. Buyukdemircioglu, R. Can, S. Kocaman, M. Kada, Deep learning based building footprint extraction from very high resolution true or- thophotos and nDSM, ISPRS Ann. Photogramm. Remote Sens. Spat. Inf. Sci. 2 (2022) 211–218
2022
-
[13]
Z. Li, Q. Xin, Y. Sun, M. Cao, A deep learning-based framework for automated extraction of building footprint polygons from very high- resolution aerial imagery, Remote Sens. 13 (18) (2021) 3630
2021
-
[14]
Rastogi, P
K. Rastogi, P. Bodani, S. A. Sharma, Automatic building footprint ex- traction from very high-resolution imagery using deep learning tech- niques, Geocarto Int. 37 (5) (2022) 1501–1513
2022
-
[15]
Park, J.-M
Y. Park, J.-M. Guldmann, Creating 3D city models with building foot- prints and LIDAR point cloud classification: A machine learning ap- proach, Comput. Environ. Urban Syst. 75 (2019) 76–89
2019
-
[16]
X. Li, Y. Zhou, P. Gong, K. C. Seto, N. Clinton, Developing a method to estimate building height from Sentinel-1 data, Remote Sens. Environ. 240 (2020) 111705
2020
-
[17]
Y. Sun, Y. Hua, L. Mou, XX. Zhu, Large-scale building height esti- mation from single VHR SAR image using fully convolutional network 25 and GIS building footprints. 2019 Joint Urban Remote Sensing Event, JURSE 2019 (2019)
2019
-
[18]
P. Cai, J. Guo, R. Li, Z. Xiao, H. Fu, T. Guo, X. Zhang, Y. Li, X. Song, Automated building height estimation using ice, cloud, and land eleva- tion satellite 2 light detection and ranging data and building footprints, Remote Sens. 16 (2) (2024) 263
2024
-
[19]
Frantz, F
D. Frantz, F. Schug, A. Okujeni, C. Navacchi, W. Wagner, S. van der Linden, P. Hostert, National-scale mapping of building height using Sentinel-1 and Sentinel-2 time series, Remote Sens. Environ. 252 (2021) 112128
2021
-
[20]
W.-B. Wu, J. Ma, E. Banzhaf, M. E. Meadows, Z.-W. Yu, F.-X. Guo, D. Sengupta, X.-X. Cai, B. Zhao, A first Chinese building height esti- mate at 10 m resolution (CNBH-10 m) using multi-source earth observa- tions and machine learning, Remote Sens. Environ. 291 (2023) 113578
2023
-
[21]
B. Cai, Z. Shao, X. Huang, X. Zhou, S. Fang, Deep learning-based building height mapping using Sentinel-1 and Sentinel-2 data, Int. J. Appl. Earth Obs. Geoinformation 122 (2023) 103399
2023
-
[22]
Y. Chen, W. Sun, L. Yang, X. Yang, X. Zhou, X. Li, S. Li, G. Tang, Refining urban morphology: An explainable machine learning method for estimating footprint-level building height, Sustain. Cities Soc. 112 (2024) 105635
2024
-
[23]
S. Wang, B. Cai, D. Hou, Q. Ding, J. Wang, Z. Shao, Mf-bhnet: A hybrid multimodal fusion network for building height estimation using sentinel-1 and sentinel-2 imagery, IEEE Transactions on Geoscience and Remote Sensing 62 (2024) 1–19
2024
-
[24]
Y. Zheng, et al., Estimating individual building heights by integrat- ing spaceborne LiDAR and multisource remote sensing data: A CNN– transformer model and a semi-supervised sample augmentation ap- proach, IEEE Transactions on Geoscience and Remote Sensing 63 (2025). doi:10.1109/TGRS.2025.3601205
-
[25]
H. G. Kamath, M. Singh, N. Malviya, A. Martilli, L. He, D. Aliaga, C. He, F. Chen, L. A. Magruder, Z.-L. Yang, et al., Global build- ing heights for urban studies (ut-globus) for city-and street-scale urban 26 simulations: Development and first applications, Scientific Data 11 (1) (2024) 886
2024
-
[26]
X. Zhu, S. Chen, F. Zhang, Y. Shi, Y. Wang, GlobalBuildingAtlas: An open global and complete dataset of building polygons, heights and LoD1 3D models, Earth System Science Data 17 (2025) 6647–6670. doi:10.5194/essd-17-6647-2025
-
[27]
I. D. Stewart, T. R. Oke, Local climate zones for urban temper- ature studies, Bull. Am. Meteorol. Soc. 93 (12) (2012) 1879–1900. doi:10.1175/BAMS-D-11-00019.1
-
[28]
J. Ching, G. Mills, B. Bechtel, L. See, J. Feddema, X. Wang, C. Ren, O. Brousse, A. Martilli, M. Neophytou, et al., WUDAPT: An ur- ban weather, climate, and environmental modeling infrastructure for the anthropocene, Bull. Am. Meteorol. Soc. 99 (9) (2018) 1907–1924. doi:10.1175/BAMS-D-16-0236.1
-
[29]
R. Li, T. Sun, F. Tian, G.-H. Ni, SHAFTS (v2022.3): A deep-learning- based Python package for simultaneous extraction of building height and footprint from Sentinel imagery, Geosci. Model Dev. 16 (2) (2023) 751–778. doi:10.5194/gmd-16-751-2023
-
[30]
Ł. Musiaka, M. Nalej, Application of GIS tools in the measurement analysis of urban spatial layouts using the square grid method, ISPRS Int. J. Geo-Inf. 10 (8) (2021) 558. doi:10.3390/ijgi10080558
-
[31]
Z. Liu, Y. Lin, Y. Cao, H. Hu, Y. Wei, Z. Zhang, S. Lin, B. Guo, Swin transformer: Hierarchical vision transformer using shifted windows, in: Proc. IEEECVF Int. Conf. Comput. Vis., 2021, pp. 10012–10022. 27
2021
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.