arxiv: 2605.08521 · v1 · submitted 2026-05-08 · 💻 cs.CV · cs.LG

Recognition: no theorem link

Geometric Flood Depth Estimation: Fusing Transformer-Based Segmentation with Digital Elevation Models

Nhut Le , Ehsan Karimi , Maryam Rahnemoonfar

Authors on Pith no claims yet

Pith reviewed 2026-05-12 01:32 UTC · model grok-4.3

classification 💻 cs.CV cs.LG

keywords flood depth estimationtransformer segmentationdigital elevation modelswater surface elevationhydrostatic equilibriumaerial imagerydisaster responsegeometric estimation

0 comments

The pith

Flood depth is estimated geometrically from aerial images by fusing transformer segmentation masks with elevation models to determine a single water surface level.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper aims to show that flood depth can be calculated from ordinary aerial photographs without running full water-flow simulations. It applies a transformer model to create precise 2D flood masks, then matches those masks to digital elevation models to locate the water-land boundary. From that boundary a single global water level is set, and depths at each point are found by subtracting the local ground height. Readers would care because the method supplies the vertical dimension needed to judge navigability and structural risk after disasters, turning flat extent maps into usable volume estimates quickly.

Core claim

The authors present a geometric Water Surface Elevation approach in which Mask2Former segmentation masks are fused with Digital Elevation Models to identify the water-land boundary, calculate a global water surface elevation Z_water, and compute per-pixel flood depths on the principle of local hydrostatic equilibrium.

What carries the argument

The Water Surface Elevation workflow that fuses transformer-based flood masks with DEMs to locate the boundary, set a global Z_water, and derive per-pixel depths from elevation differences.

If this is right

High-performance 2D segmentation directly yields 3D volumetric flood information from monocular imagery.
The pipeline avoids the computational delay of full hydrodynamic simulations for post-disaster use.
The method is demonstrated on the FloodNet and CRASAR-U-DROIDS datasets for practical validation.
Per-pixel depths become available once the water-land boundary is identified from the fused data.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

High-resolution DEMs would be essential for accurate depths in areas with steep terrain.
The approach could be tested on time-series imagery to track how depths evolve during a flood event.
Where flow is present the single-level assumption may need adjustment, suggesting possible hybrid use with simpler flow models.

Load-bearing premise

A single global water surface elevation can represent the entire water body while local hydrostatic equilibrium holds without major flow or wind effects.

What would settle it

Direct measurements of differing water surface heights at separate boundary points or evidence of strong currents that violate hydrostatic equilibrium would disprove the central geometric calculation.

Figures

Figures reproduced from arXiv: 2605.08521 by Ehsan Karimi, Maryam Rahnemoonfar, Nhut Le.

**Figure 1.** Figure 1: Data alignment. Left: Satellite imagery of Austin, Texas. Right: Corresponding DEM aligned to the imagery for terrain height extraction. Abstract—Post-disaster situational awareness relies heavily on understanding both the extent and the volume of floodwaters. While 2D semantic segmentation provides accurate flood masking, it lacks the vertical dimension required to assess navigability and structural risk… view at source ↗

**Figure 2.** Figure 2: Schematic of the Geometric Depth Estimation ap [PITH_FULL_IMAGE:figures/full_fig_p002_2.png] view at source ↗

**Figure 3.** Figure 3: Qualitative Results of Geometric Depth Estimation. The figure illustrates our step-by-step pipeline for recovering flood depth: (1) Input RGB: Post-disaster aerial imagery; (2) Segmentation Mask: Semantic predictions distinguishing Flood (Yellow), Buildings (Red), and Background (Green); (3) Water Boundary: The extracted internal boundary (Orange) used to calculate water levels, demonstrating our “Strict D… view at source ↗

read the original abstract

Post-disaster situational awareness relies heavily on understanding both the extent and the volume of floodwaters. While 2D semantic segmentation provides accurate flood masking, it lacks the vertical dimension required to assess navigability and structural risk. This paper presents a geometric "Water Surface Elevation" approach for estimating flood depth from monocular aerial imagery. Our pipeline utilizes Mask2Former, a state-of-the-art transformer-based segmentation model, to generate precise 2D flood masks. These masks are fused with Digital Elevation Models (DEMs) to identify the water-land boundary, calculate a global water surface elevation ($Z_{water}$), and compute per-pixel depth based on the principle of local hydrostatic equilibrium. We evaluate this workflow using the FloodNet and CRASAR-U-DROIDS datasets, demonstrating how high-performance segmentation can be leveraged to extract 3D volumetric data from 2D imagery without the latency of hydrodynamic simulations.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper fuses Mask2Former flood masks with DEMs to compute depths from a single global water elevation under hydrostatic equilibrium, which is a clean application but rests on assumptions that need real validation.

read the letter

The core idea is simple: run a strong transformer segmenter on aerial flood photos, pull the water-land boundary from the DEM, set one Z_water value from those boundary elevations, and subtract local ground height for per-pixel depth. This skips full hydrodynamic runs and gives volume estimates from ordinary 2D imagery plus elevation data. On the FloodNet and CRASAR-U-DROIDS sets it looks workable for quick situational awareness after disasters. The segmentation step benefits from Mask2Former being current SOTA, and the geometric step follows directly once you accept the flat-surface premise. That part is new enough as an end-to-end workflow even if the pieces are known separately. The method is practical and avoids extra sensors or heavy simulation latency, which matters for emergency use. The main weakness is the global Z_water choice and the local hydrostatic assumption. Real water surfaces tilt from flow, wind, or channel effects, so depths could be systematically off in parts of the scene. The abstract claims evaluation on the two datasets, yet no numbers appear on depth error against ground truth or reference models, leaving the practical accuracy unknown. Without those checks the error magnitude stays unclear. This is aimed at remote-sensing and disaster-response groups who already have DEMs and want fast 3D flood info on top of masks. Computer-vision readers will see mostly routine fusion, but practitioners could use it if the numbers hold up. It is solid enough on its own terms to go to referees so they can examine the implementation details and any quantitative depth results that are in the full text.

Referee Report

3 major / 3 minor

Summary. The paper proposes a geometric method for estimating flood depths from monocular aerial imagery. It employs Mask2Former for 2D flood segmentation, fuses the resulting masks with DEMs to extract the water-land boundary, computes a single global water surface elevation Z_water from boundary pixel elevations, and derives per-pixel depths as Z_water minus local DEM values under the assumption of local hydrostatic equilibrium. The workflow is evaluated on the FloodNet and CRASAR-U-DROIDS datasets to show extraction of 3D volumetric information without running hydrodynamic simulations.

Significance. If validated, the approach offers a fast, parameter-free alternative to simulation-based methods for obtaining volumetric flood data from standard 2D imagery and DEMs, which could be valuable for rapid post-disaster situational awareness. The use of a state-of-the-art transformer segmentation model and the direct geometric derivation (no fitted parameters) are strengths that align with needs in computer vision for disaster applications. However, the significance is tempered by the untested flat-surface assumption and lack of depth-specific quantitative validation.

major comments (3)

[Method] Method section: The exact procedure for computing the global Z_water from boundary pixels (e.g., mean, median, maximum, or other statistic of DEM elevations at the water-land interface) is not specified. This choice is load-bearing for all subsequent per-pixel depth values and must be stated explicitly, ideally with a formula.
[Evaluation] Evaluation section: The reported experiments focus on segmentation performance but provide no quantitative metrics for depth estimation accuracy (e.g., MAE, RMSE against ground-truth bathymetry or hydrodynamic reference solutions) on FloodNet or CRASAR-U-DROIDS. Without such validation, the central claim of reliable 3D volumetric extraction cannot be assessed.
[Introduction/Method] Introduction/Method: The assumption of a single global Z_water (i.e., perfectly level water surface under local hydrostatic equilibrium) is stated but not stress-tested. No analysis or examples address potential violations from flow-induced slopes, wind setup, or non-hydrostatic effects, which would directly invalidate the per-pixel depth formula Z_water - DEM.

minor comments (3)

[Abstract] Abstract: The phrase 'precise 2D flood masks' should be qualified with the specific segmentation metrics (e.g., mIoU) achieved on the evaluation datasets.
[Related Work] Related work discussion appears limited; add references to prior geometric or DEM-fusion approaches for flood depth estimation to better contextualize the contribution.
[Figures] Ensure all figures showing depth maps include color bars, scale bars, and quantitative error visualizations if available.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address each major comment point by point below, indicating where revisions have been made to improve clarity and address limitations.

read point-by-point responses

Referee: [Method] Method section: The exact procedure for computing the global Z_water from boundary pixels (e.g., mean, median, maximum, or other statistic of DEM elevations at the water-land interface) is not specified. This choice is load-bearing for all subsequent per-pixel depth values and must be stated explicitly, ideally with a formula.

Authors: We agree that the aggregation method for Z_water was not explicitly detailed. In the revised manuscript, the Method section now specifies that Z_water is computed as the median of DEM elevations at water-land boundary pixels (chosen for robustness to DEM noise and boundary misclassifications). The formula added is Z_water = median({DEM(p) | p in boundary pixels}), along with pseudocode for the full pipeline from mask to depths. revision: yes
Referee: [Evaluation] Evaluation section: The reported experiments focus on segmentation performance but provide no quantitative metrics for depth estimation accuracy (e.g., MAE, RMSE against ground-truth bathymetry or hydrodynamic reference solutions) on FloodNet or CRASAR-U-DROIDS. Without such validation, the central claim of reliable 3D volumetric extraction cannot be assessed.

Authors: We acknowledge the absence of quantitative depth metrics. The FloodNet and CRASAR-U-DROIDS datasets provide only 2D segmentation ground truth and lack bathymetry or depth references, precluding direct MAE/RMSE computation against hydrodynamic solutions. We have added a dedicated limitations subsection noting this constraint and have included qualitative depth map visualizations. Future extensions will target datasets with depth annotations. revision: partial
Referee: [Introduction/Method] Introduction/Method: The assumption of a single global Z_water (i.e., perfectly level water surface under local hydrostatic equilibrium) is stated but not stress-tested. No analysis or examples address potential violations from flow-induced slopes, wind setup, or non-hydrostatic effects, which would directly invalidate the per-pixel depth formula Z_water - DEM.

Authors: The flat-surface assumption is foundational, and we have expanded both the Introduction and Method to discuss its applicability. For the post-disaster scenes in our datasets, water bodies are largely quiescent with limited flow over the imaged scales, supporting the local hydrostatic approximation. We added analysis of error sources (e.g., wind setup inducing <5 cm slopes over 100 m) and note that deviations would be detectable as inconsistencies at boundaries. This provides a fast baseline while acknowledging cases where full hydrodynamics would be needed. revision: yes

Circularity Check

0 steps flagged

No circularity: depth computation applies stated geometric principle directly from boundary data

full rationale

The paper's core workflow extracts a water-land boundary from the Mask2Former mask fused with DEM elevations, sets a single global Z_water from those boundary values, and subtracts local DEM heights to obtain per-pixel depths under the local hydrostatic equilibrium assumption. This is a direct geometric calculation with no parameter fitting, no self-referential equations, and no load-bearing self-citations or imported ansatzes described in the abstract or method outline. The result is not equivalent to its inputs by construction; it encodes an explicit physical modeling choice whose validity can be checked against external bathymetry or hydrodynamic references.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The method rests on the assumption of hydrostatic equilibrium and a flat water surface; no free parameters are fitted, no new entities are postulated, and the only background axiom is the standard principle of local hydrostatic equilibrium.

axioms (1)

domain assumption Local hydrostatic equilibrium holds so that water surface elevation is constant across the flooded region.
Invoked to justify computing a single global Z_water from boundary pixels and subtracting ground elevation for depth.

pith-pipeline@v0.9.0 · 5460 in / 1152 out tokens · 22783 ms · 2026-05-12T01:32:58.748704+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

21 extracted references · 21 canonical work pages · 1 internal anchor

[1]

Validation of anuga hydraulic model using exact solutions to shallow water wave problems,

S. Mungkasi and S. G. Roberts, “Validation of anuga hydraulic model using exact solutions to shallow water wave problems,”Journal of Physics: Conference Series, vol. 423, no. 1, p. 012029, apr 2013. [Online]. Available: https://doi.org/10.1088/1742-6596/423/1/012029

work page doi:10.1088/1742-6596/423/1/012029 2013
[2]

Hec-ras river analysis system. hydraulic refer- ence manual. version 1.0

G. W. Brunner, “Hec-ras river analysis system. hydraulic refer- ence manual. version 1.0.” 1995

work page 1995
[3]

Hervouet,Hydrodynamics of free surface flows: modelling with the finite element method

J.-M. Hervouet,Hydrodynamics of free surface flows: modelling with the finite element method. John Wiley & Sons, 2007

work page 2007
[4]

Hurricane Harvey Flood Depth Rasters,

Federal Emergency Management Agency, “Hurricane Harvey Flood Depth Rasters,” https://www.fema.gov/flood-maps, 2017, flood depth grids derived from post-event high-water mark surveys and hydraulic modeling, Harvey 2017

work page 2017
[5]

Extent and depth of flooding using sar sentinel- 1 and machine learning algorithms,

J. Soria-Ruiz, Y . M. Fernandez-Ordonez, and J. P. Ambrosio- Ambrosio, “Extent and depth of flooding using sar sentinel- 1 and machine learning algorithms,” inIGARSS 2023-2023 IEEE International Geoscience and Remote Sensing Sympo- sium. IEEE, 2023, pp. 2246–2249

work page 2023
[6]

Sentinel-1 sar and lidar to detect extent and depth flood using random forests machine learning,

J. Soria-Ruiz, Y . M. Fernandez-Ordo ˜nez, J. P. Ambrosio- Ambrosio, and M. A. Escalona-Maurice, “Sentinel-1 sar and lidar to detect extent and depth flood using random forests machine learning,” inIGARSS 2022-2022 IEEE International Geoscience and Remote Sensing Symposium. IEEE, 2022, pp. 5113–5116

work page 2022
[7]

Floodnet: A high resolution aerial imagery dataset for post flood scene understanding,

M. Rahnemoonfar, T. Chowdhury, A. Sarkar, D. Varshney, M. Yari, and R. R. Murphy, “Floodnet: A high resolution aerial imagery dataset for post flood scene understanding,”IEEE Access, vol. 9, pp. 89 644–89 654, 2021

work page 2021
[8]

Masked-attention mask transformer for universal image segmentation,

B. Cheng, I. Misra, A. G. Schwing, A. Kirillov, and R. Gird- har, “Masked-attention mask transformer for universal image segmentation,” 2022

work page 2022
[9]

Crasar-u-droids: A large scale benchmark dataset for building alignment and damage assessment in georectified suas imagery,

T. Manzini, P. Perali, R. Karnik, and R. Murphy, “Crasar-u- droids: A large scale benchmark dataset for building alignment and damage assessment in georectified suas imagery,”arXiv preprint arXiv:2407.17673, 2024

work page arXiv 2024
[10]

Rescuenet: a high resolution uav semantic segmentation dataset for natural disaster damage assessment,

M. Rahnemoonfar, T. Chowdhury, and R. Murphy, “Rescuenet: a high resolution uav semantic segmentation dataset for natural disaster damage assessment,”Scientific data, vol. 10, no. 1, p. 913, 2023

work page 2023
[11]

ENet: A Deep Neural Network Architecture for Real-Time Semantic Segmentation

A. Paszke, A. Chaurasia, S. Kim, and E. Culurciello, “Enet: A deep neural network architecture for real-time semantic segmentation,”arXiv preprint arXiv:1606.02147, 2016

work page Pith review arXiv 2016
[12]

Encoder-decoder with atrous separable convolution for se- mantic image segmentation,

L.-C. Chen, Y . Zhu, G. Papandreou, F. Schroff, and H. Adam, “Encoder-decoder with atrous separable convolution for se- mantic image segmentation,” inProceedings of the European conference on computer vision (ECCV), 2018, pp. 801–818

work page 2018
[13]

Segformer: Simple and efficient design for semantic segmentation with transformers,

E. Xie, W. Wang, Z. Yu, A. Anandkumar, J. M. Alvarez, and P. Luo, “Segformer: Simple and efficient design for semantic segmentation with transformers,”Advances in neural informa- tion processing systems, vol. 34, pp. 12 077–12 090, 2021

work page 2021
[14]

Attention U-Net: Learning Where to Look for the Pancreas

O. Oktay, J. Schlemper, L. L. Folgoc, M. Lee, M. Heinrich, K. Misawa, K. Mori, S. McDonagh, N. Y . Hammerla, B. Kainz et al., “Attention u-net: Learning where to look for the pan- creas,”arXiv preprint arXiv:1804.03999, 2018

work page internal anchor Pith review arXiv 2018
[15]

Automated flood depth estimates from online traffic sign images: Explorations of a convolutional neural network-based method,

Z. Song and Y . Tuo, “Automated flood depth estimates from online traffic sign images: Explorations of a convolutional neural network-based method,”Sensors, vol. 21, no. 16, p. 5614, 2021

work page 2021
[16]

A novel depth measurement method for urban flooding based on surveillance video images and a floating ruler,

S. Liu, W. Zheng, X. Wang, H. Xiong, J. Cheng, C. Yong, W. Zhang, and X. Zou, “A novel depth measurement method for urban flooding based on surveillance video images and a floating ruler,”Natural Hazards, vol. 119, no. 3, pp. 1967–1989, 2023

work page 1967
[17]

Sar-based flood mapping, where we are and future challenges,

M. Chini, R. Pelich, Y . Li, R. Hostache, J. Zhao, C. Di Mauro, and P. Matgen, “Sar-based flood mapping, where we are and future challenges,” in2021 IEEE International Geoscience and Remote Sensing Symposium IGARSS, 2021, pp. 884–886

work page 2021
[18]

Opentopography: a services ori- ented architecture for community access to lidar topography,

S. Krishnan, C. Crosby, V . Nandigam, M. Phan, C. Cowart, C. Baru, and R. Arrowsmith, “Opentopography: a services ori- ented architecture for community access to lidar topography,” in Proceedings of the 2nd international conference on computing for Geospatial Research & Applications, 2011, pp. 1–8

work page 2011
[19]

Flood depth estimation by means of high-resolution sar images and lidar data,

F. Cian, M. Marconcini, P. Ceccato, and C. Giupponi, “Flood depth estimation by means of high-resolution sar images and lidar data,”Natural Hazards and Earth System Sciences, vol. 18, no. 11, pp. 3063–3084, 2018

work page 2018
[20]

and Fischer, Jeremy and Lowe, John Michael and Snapp-Childs, Winona and Pierce, Marlon and Marru, Suresh and Coulter, J

D. Y . Hancock, J. Fischer, J. M. Lowe, W. Snapp-Childs, M. Pierce, S. Marru, J. E. Coulter, M. Vaughn, B. Beck, N. Merchant, E. Skidmore, and G. Jacobs, “Jetstream2: Accelerating cloud computing via jetstream,” inPractice and Experience in Advanced Research Computing 2021: Evolution Across All Dimensions, ser. PEARC ’21. New York, NY , USA: Association f...

work page doi:10.1145/3437359.3465565 2021
[21]

Access: Advancing innovation: Nsf’s advanced cyberinfrastructure coordination ecosystem: Services & support,

T. J. Boerner, S. Deems, T. R. Furlani, S. L. Knuth, and J. Towns, “Access: Advancing innovation: Nsf’s advanced cyberinfrastructure coordination ecosystem: Services & support,” inPractice and Experience in Advanced Research Computing 2023: Computing for the Common Good, ser. PEARC ’23. New York, NY , USA: Association for Computing Machinery, 2023, p. 173...

work page doi:10.1145/3569951.3597559 2023