pith. sign in

arxiv: 2605.04731 · v1 · submitted 2026-05-06 · 💻 cs.CV

Morphology-Guided Cross-Task Coupling for Joint Building Height and Footprint Estimation

Pith reviewed 2026-05-08 17:17 UTC · model grok-4.3

classification 💻 cs.CV
keywords building heightbuilding footprintremote sensingmulti-task learningmorphology consistencycross-attentionjoint estimationurban morphology
0
0 comments X

The pith

Explicitly coupling building height and footprint estimation via morphology guidance improves height accuracy over independent or standard multi-task approaches.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Building height and footprint describe the vertical and horizontal built environment and are linked through floor-area-ratio constraints, yet remote sensing usually estimates them separately. The paper demonstrates that two mechanisms enforcing their coupling—a decoder that guides height prediction using footprint morphology context and a loss that ensures height consistency from the footprint—deliver better height estimates while preserving footprint quality. This matters for applications like urban climate modeling, disaster risk assessment, and population mapping that rely on accurate 3D building data from satellite imagery. Ablations show the coupling accounts for most gains compared to a baseline with the same encoder. The design uses a shared Swin backbone on Sentinel and DEM inputs across 54 cities with a geo-blocked split.

Core claim

MorphoFormer encodes cross-task coupling between building height and footprint using a BF-Guided Task Decoder that applies cross-attention from footprint morphology to gate the height branch, plus a Morphology Consistency Loss that trains a height surrogate from footprint features against ground-truth heights. On a 54-city dataset this lowers building height RMSE from 3.39 m to 3.15 m and raises R-squared from 0.62 to 0.67, while footprint R-squared stays at 0.80. Ablations at matched capacity attribute the 0.24 m improvement mainly to the two mechanisms.

What carries the argument

MorphoFormer framework with BF-Guided Task Decoder (cross-attention gating of height branch by footprint morphology context) and Morphology Consistency Loss (supervising height-from-footprint surrogate).

Load-bearing premise

That the proposed consistency loss and guided decoder compel footprint features to capture genuine height-related morphology instead of training-set-specific correlations.

What would settle it

Observing that the height accuracy improvement vanishes when the model is tested on cities whose building morphologies differ substantially from those in the training data.

Figures

Figures reproduced from arXiv: 2605.04731 by HongSik Yun, JinByeong Lee, Jinzhen Han, JiSung Kim.

Figure 1
Figure 1. Figure 1: Overview of the proposed MorphoFormer framework. The BGTD module (high￾lighted) and the Morphology Consistency Loss (LMCL) jointly operationalize the cross-task coupling described in Section 4. 3. Against a Swin-MTL baseline at identical receptive field, Morpho￾Former reduces BH RMSE from 3.39 to 3.15 m on a 54-city geo-blocked test split. 4. Controlled ablations at identical capacity attribute most of the… view at source ↗
Figure 2
Figure 2. Figure 2: A 90 × 90 input scene from the test split (San Francisco, coastline). Panels left to right: Sentinel-1 VV/VH average, Sentinel-2 RGB, Sentinel-2 NIR, SRTM DEM, and the validity mask. Dashed red lines mark cell boundaries; the regression-target centre cell is outlined in solid red. seen during training (as centres of their own training samples), so the same raw band values appear in train and test inputs an… view at source ↗
Figure 3
Figure 3. Figure 3: GeoSplit assignment for three cities of contrasting morphology and extent. Each view at source ↗
Figure 4
Figure 4. Figure 4: The (BH, BF) coupling on the training split. (a) Joint hexbin of view at source ↗
Figure 5
Figure 5. Figure 5: Architecture of MorphoFormer. Encoder pipeline (Section 4.1) on the left; view at source ↗
Figure 6
Figure 6. Figure 6: Predicted-vs-ground-truth hexbin densities on the test split for MorphoFormer view at source ↗
Figure 7
Figure 7. Figure 7: Stratification of test-set BH RMSE by λp bin. (a) BH RMSE per bin for MorphoFormer (full) and the BGTD/MCL ablations; the dense-urban tail (λp > 0.55) is dominated by a small population of high-rise cells with large absolute residuals. (b) Per-bin BH-RMSE increase upon removing each mechanism view at source ↗
Figure 8
Figure 8. Figure 8: BGTD cross-gate activation on the test split. (a) Per-sample mean gate value view at source ↗
Figure 9
Figure 9. Figure 9: Behaviour of the height-from-footprint surrogate on the test split. (a) Hexbin view at source ↗
Figure 10
Figure 10. Figure 10: Error origins for MorphoFormer on the test split, binned over view at source ↗
read the original abstract

Building height (BH) and building footprint (BF) jointly describe the vertical and horizontal extent of the built environment and are required inputs for urban climate, disaster-risk, and population-mapping models. The two parameters are coupled through floor-area-ratio (FAR) constraints, yet remote-sensing approaches typically treat them as independent regression targets. We argue that explicitly encoding this cross-task coupling is more impactful than further refining individual encoders, and propose MorphoFormer, a joint BH/BF estimation framework built around two complementary mechanisms: (i) a BF-Guided Task Decoder (BGTD) that gates the height branch via cross-attention on a footprint-derived morphology context, and (ii) a Morphology Consistency Loss (MCL) that supervises a height-from-footprint surrogate against the ground-truth BH, indirectly forcing the BF feature to encode height-correlated structure. The encoder is a single-stage Swin backbone fed by Sentinel-1 SAR, Sentinel-2 multispectral, and DEM inputs, trained and evaluated on a geo-blocked split of 54 cities. Against a Swin-MTL baseline at identical receptive field, MorphoFormer reduces BH test RMSE from 3.39 to 3.15 m (R^2 improves 0.62 -> 0.67) with BF R^2 stable at 0.80. Controlled ablations at identical capacity attribute most of this 0.24 m improvement to the two proposed mechanisms: removing BGTD raises BH RMSE by 0.11 m and removing MCL raises it by 0.11 m, with the residual approximately 0.02 m falling within the noise floor of encoder-side variations. Because both mechanisms act on cross-task representations rather than pixels, the design carries no intrinsic dependence on input resolution.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

0 major / 3 minor

Summary. The manuscript proposes MorphoFormer, a joint framework for building height (BH) and building footprint (BF) estimation from Sentinel-1 SAR, Sentinel-2 multispectral, and DEM inputs using a single-stage Swin encoder. It introduces two mechanisms to encode floor-area-ratio coupling: a BF-Guided Task Decoder (BGTD) that applies cross-attention gating from footprint-derived morphology to the height branch, and a Morphology Consistency Loss (MCL) that supervises a height surrogate derived from footprint features against ground-truth BH. On a geo-blocked split of 54 cities, MorphoFormer reduces BH test RMSE from 3.39 m to 3.15 m (R² from 0.62 to 0.67) while holding BF R² at 0.80; controlled ablations at fixed capacity attribute most of the 0.24 m gain to BGTD and MCL.

Significance. If the results hold, the work demonstrates that explicitly modeling cross-task physical coupling can improve remote-sensing regression accuracy without increasing encoder capacity or input resolution. The geo-blocked evaluation across disjoint cities and the fixed-capacity ablations provide direct evidence that gains arise from the morphology-guided mechanisms rather than data leakage or parameter count, strengthening the case for incorporating domain constraints like FAR into multi-task architectures for urban morphology estimation.

minor comments (3)
  1. The experimental section should report standard deviations or results across multiple random seeds for the RMSE and R² values, as the 0.24 m improvement and the 0.11 m ablation deltas are currently presented as point estimates.
  2. The exact mathematical definition of the Morphology Consistency Loss (including how the height surrogate is computed from BF features) would benefit from an explicit equation in the methods section to facilitate reproduction.
  3. Table or text describing the data split should state the precise number of cities (or samples) allocated to training, validation, and test sets under the geo-blocked protocol.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for the positive summary and significance assessment of MorphoFormer, which correctly highlights the 0.24 m BH RMSE reduction, stable BF performance, geo-blocked 54-city evaluation, and fixed-capacity ablations. The recommendation for minor revision is noted. No specific major comments were provided in the report.

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper's derivation consists of proposing two explicit mechanisms (BGTD cross-attention gating and MCL surrogate supervision) whose impact is measured via controlled ablations on a geo-blocked held-out test set of 54 cities. The 0.24 m BH RMSE reduction is reported as an empirical outcome on spatially disjoint test data, not as a quantity obtained by fitting a parameter and then re-predicting a related statistic from the same fit. No equations are presented that equate a claimed prediction to its own input by construction, and no load-bearing premise relies on a self-citation chain or imported uniqueness theorem. The design is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claim rests on standard assumptions that cross-attention can transfer morphology context and that a surrogate height-from-footprint task will regularize features toward real height correlations; no new physical entities or ad-hoc constants are introduced beyond ordinary neural-network training.

axioms (2)
  • domain assumption Cross-attention on footprint-derived morphology can usefully gate height features without introducing harmful bias
    Invoked in the design of the BF-Guided Task Decoder
  • domain assumption A height-from-footprint surrogate loss will force footprint features to encode height-correlated structure
    Core justification for the Morphology Consistency Loss

pith-pipeline@v0.9.0 · 5637 in / 1579 out tokens · 82516 ms · 2026-05-08T17:17:07.625595+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

31 extracted references · 9 canonical work pages

  1. [1]

    C. Fang, G. Li, S. Wang, Changing and differentiated urban landscape in China: Spatiotemporal patterns and driving forces, Environ. Sci. Tech- nol. 50 (5) (2016) 2217–2227

  2. [2]

    Frolking, T

    S. Frolking, T. Milliman, K. C. Seto, M. A. Friedl, A global fingerprint of macro-scale changes in urban structure from 1999 to 2009, Environ. Res. Lett. 8 (2) (2013) 024004

  3. [3]

    Frolking, R

    S. Frolking, R. Mahtta, T. Milliman, T. Esch, K. C. Seto, Global urban structural growth shows a profound shift from spreading out to building up, Nat. Cities 1 (9) (2024) 555–566

  4. [4]

    C. Xi, C. Ren, J. Wang, Z. Feng, S.-J. Cao, Impacts of urban-scale building height diversity on urban climates: A case study of Nanjing, China, Energy Build. 251 (2021) 111350

  5. [5]

    Perini, A

    K. Perini, A. Magliocco, Effects of vegetation, urban density, building height, and atmospheric conditions on local temperatures and thermal comfort, Urban For. Urban Green. 13 (3) (2014) 495–506

  6. [6]

    Huang, C

    X. Huang, C. Wang, Estimates of exposure to the 100-year floods in the conterminous United States using national building footprints, Int. J. Disaster Risk Reduct. 50 (2020) 101731. 24

  7. [7]

    Y. Tian, M. Lu, Z. Xu, J. Ren, A fire following earthquake spread model considering building height and its application to real-world events, Int. J. Disaster Risk Reduct. (2025) 105261

  8. [8]

    C. F. Reinhart, C. Cerezo Davila, Urban building energy modeling – a review of a nascent field, Build. Environ. 97 (2016) 196–202. doi:10.1016/j.buildenv.2015.12.001

  9. [9]

    A. J. Tatem, WorldPop, open data for spatial demography, Sci. Data 4 (2017) 170004. doi:10.1038/sdata.2017.4

  10. [10]

    Schiavina, S

    M. Schiavina, S. Freire, A. Carioli, K. MacManus, GHS-POP R2023A – GHS population grid multitemporal (1975–2030), European Commis- sion, Joint Research Centre (JRC), available at 100m resolution (2023). doi:10.2905/2FF68A52-5B5B-4532-88CB-C5A729C3F5D0

  11. [11]

    Geofabrik, OpenStreetMap download statistics (2018)

  12. [12]

    Buyukdemircioglu, R

    M. Buyukdemircioglu, R. Can, S. Kocaman, M. Kada, Deep learning based building footprint extraction from very high resolution true or- thophotos and nDSM, ISPRS Ann. Photogramm. Remote Sens. Spat. Inf. Sci. 2 (2022) 211–218

  13. [13]

    Z. Li, Q. Xin, Y. Sun, M. Cao, A deep learning-based framework for automated extraction of building footprint polygons from very high- resolution aerial imagery, Remote Sens. 13 (18) (2021) 3630

  14. [14]

    Rastogi, P

    K. Rastogi, P. Bodani, S. A. Sharma, Automatic building footprint ex- traction from very high-resolution imagery using deep learning tech- niques, Geocarto Int. 37 (5) (2022) 1501–1513

  15. [15]

    Park, J.-M

    Y. Park, J.-M. Guldmann, Creating 3D city models with building foot- prints and LIDAR point cloud classification: A machine learning ap- proach, Comput. Environ. Urban Syst. 75 (2019) 76–89

  16. [16]

    X. Li, Y. Zhou, P. Gong, K. C. Seto, N. Clinton, Developing a method to estimate building height from Sentinel-1 data, Remote Sens. Environ. 240 (2020) 111705

  17. [17]

    Y. Sun, Y. Hua, L. Mou, XX. Zhu, Large-scale building height esti- mation from single VHR SAR image using fully convolutional network 25 and GIS building footprints. 2019 Joint Urban Remote Sensing Event, JURSE 2019 (2019)

  18. [18]

    P. Cai, J. Guo, R. Li, Z. Xiao, H. Fu, T. Guo, X. Zhang, Y. Li, X. Song, Automated building height estimation using ice, cloud, and land eleva- tion satellite 2 light detection and ranging data and building footprints, Remote Sens. 16 (2) (2024) 263

  19. [19]

    Frantz, F

    D. Frantz, F. Schug, A. Okujeni, C. Navacchi, W. Wagner, S. van der Linden, P. Hostert, National-scale mapping of building height using Sentinel-1 and Sentinel-2 time series, Remote Sens. Environ. 252 (2021) 112128

  20. [20]

    W.-B. Wu, J. Ma, E. Banzhaf, M. E. Meadows, Z.-W. Yu, F.-X. Guo, D. Sengupta, X.-X. Cai, B. Zhao, A first Chinese building height esti- mate at 10 m resolution (CNBH-10 m) using multi-source earth observa- tions and machine learning, Remote Sens. Environ. 291 (2023) 113578

  21. [21]

    B. Cai, Z. Shao, X. Huang, X. Zhou, S. Fang, Deep learning-based building height mapping using Sentinel-1 and Sentinel-2 data, Int. J. Appl. Earth Obs. Geoinformation 122 (2023) 103399

  22. [22]

    Y. Chen, W. Sun, L. Yang, X. Yang, X. Zhou, X. Li, S. Li, G. Tang, Refining urban morphology: An explainable machine learning method for estimating footprint-level building height, Sustain. Cities Soc. 112 (2024) 105635

  23. [23]

    S. Wang, B. Cai, D. Hou, Q. Ding, J. Wang, Z. Shao, Mf-bhnet: A hybrid multimodal fusion network for building height estimation using sentinel-1 and sentinel-2 imagery, IEEE Transactions on Geoscience and Remote Sensing 62 (2024) 1–19

  24. [24]

    Y. Zheng, et al., Estimating individual building heights by integrat- ing spaceborne LiDAR and multisource remote sensing data: A CNN– transformer model and a semi-supervised sample augmentation ap- proach, IEEE Transactions on Geoscience and Remote Sensing 63 (2025). doi:10.1109/TGRS.2025.3601205

  25. [25]

    H. G. Kamath, M. Singh, N. Malviya, A. Martilli, L. He, D. Aliaga, C. He, F. Chen, L. A. Magruder, Z.-L. Yang, et al., Global build- ing heights for urban studies (ut-globus) for city-and street-scale urban 26 simulations: Development and first applications, Scientific Data 11 (1) (2024) 886

  26. [26]

    X. Zhu, S. Chen, F. Zhang, Y. Shi, Y. Wang, GlobalBuildingAtlas: An open global and complete dataset of building polygons, heights and LoD1 3D models, Earth System Science Data 17 (2025) 6647–6670. doi:10.5194/essd-17-6647-2025

  27. [27]

    I. D. Stewart, T. R. Oke, Local climate zones for urban temper- ature studies, Bull. Am. Meteorol. Soc. 93 (12) (2012) 1879–1900. doi:10.1175/BAMS-D-11-00019.1

  28. [28]

    Ching, G

    J. Ching, G. Mills, B. Bechtel, L. See, J. Feddema, X. Wang, C. Ren, O. Brousse, A. Martilli, M. Neophytou, et al., WUDAPT: An ur- ban weather, climate, and environmental modeling infrastructure for the anthropocene, Bull. Am. Meteorol. Soc. 99 (9) (2018) 1907–1924. doi:10.1175/BAMS-D-16-0236.1

  29. [29]

    R. Li, T. Sun, F. Tian, G.-H. Ni, SHAFTS (v2022.3): A deep-learning- based Python package for simultaneous extraction of building height and footprint from Sentinel imagery, Geosci. Model Dev. 16 (2) (2023) 751–778. doi:10.5194/gmd-16-751-2023

  30. [30]

    Musiaka, M

    Ł. Musiaka, M. Nalej, Application of GIS tools in the measurement analysis of urban spatial layouts using the square grid method, ISPRS Int. J. Geo-Inf. 10 (8) (2021) 558. doi:10.3390/ijgi10080558

  31. [31]

    Z. Liu, Y. Lin, Y. Cao, H. Hu, Y. Wei, Z. Zhang, S. Lin, B. Guo, Swin transformer: Hierarchical vision transformer using shifted windows, in: Proc. IEEECVF Int. Conf. Comput. Vis., 2021, pp. 10012–10022. 27