Urban Flood Observations: A hand-labeled training and validation dataset of post-flood inundation

Ariful Islam; Beth Tellman; Hannah K. Friedrich; Jonathan Giezendanner; Rohit Mukherjee; Upmanu Lall; Venkataraman Lakshmi; Zhijie Zhang

arxiv: 2604.23066 · v2 · pith:XPRHBVKTnew · submitted 2026-04-24 · 💻 cs.CV

Urban Flood Observations: A hand-labeled training and validation dataset of post-flood inundation

Rohit Mukherjee , Hannah K. Friedrich , Beth Tellman , Ariful Islam , Zhijie Zhang , Jonathan Giezendanner , Upmanu Lall , Venkataraman Lakshmi This is my paper

Pith reviewed 2026-05-08 12:11 UTC · model grok-4.3

classification 💻 cs.CV

keywords urban floodinginundation mappingsatellite imageryhand-labeled datasetPlanetScopesemantic segmentationflood observationremote sensing

0 comments

The pith

A hand-labeled dataset of 215 satellite image chips enables machine learning models to map urban flood inundation at 77.3 mean IoU.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper introduces the Urban Flood Observations (UFO) dataset, consisting of 215 hand-annotated 1024 by 1024 pixel chips from 3-meter resolution PlanetScope imagery across 14 global flood events between 2017 and 2021. The labels distinguish inundated areas, including floodwater and permanent water bodies, from non-inundated land in complex urban environments. A segmentation model trained using leave-one-event-out cross-validation on this data reaches a mean Intersection over Union score of 77.3, while two standard surface water products score only 44.1 and 48.1 on the same data. The dataset is released publicly to advance the creation and testing of improved methods for urban flood mapping from space, where high resolution and frequent coverage are needed but clouds and urban features complicate analysis.

Core claim

The central claim is that the UFO dataset of 215 globally distributed, expert-labeled image chips from 14 flood events, annotated for visible surface water in two classes, supports training of a segmentation model that achieves 77.3 mean IoU via leave-one-event-out cross-validation and shows that two widely used surface water products achieve only 44.1 and 48.1 IoU on the same chips.

What carries the argument

The UFO hand-labeled dataset of 1024x1024 PlanetScope image chips with binary 'inundated' and 'non-inundated' annotations, used to train and validate segmentation models through leave-one-event-out cross-validation.

Load-bearing premise

Expert hand-labeling accurately identifies all visible surface water without significant errors from shadows, vegetation, or urban structures, and the 14 selected events with their chips capture sufficient diversity to support generalizable models.

What would settle it

Testing a model trained on UFO against an independent collection of PlanetScope urban flood images labeled by a separate group of experts yields a mean IoU below 60.

Figures

Figures reproduced from arXiv: 2604.23066 by Ariful Islam, Beth Tellman, Hannah K. Friedrich, Jonathan Giezendanner, Rohit Mukherjee, Upmanu Lall, Venkataraman Lakshmi, Zhijie Zhang.

**Figure 1.** Figure 1: Locations of the 14 urban flood events in the UFO dataset. PlanetScope Image Processing For each retained event, we downloaded PlanetScope 4-band surface reflectance imagery (blue, green, red, near-infrared). Each scene was subdivided into 1024×1024-pixel chips (3.072×3.072 km at 3 m resolution). To aid labelers in identifying surface water, we generated three composites per chip: (1) true-color (red, gree… view at source ↗

**Figure 2.** Figure 2: shows the distribution of inundated versus permanent water pixels per event, expressed as percentages of total chip area. Permanent water was estimated using the ESA WorldCover 2020 water class41. The Craig (Missouri, USA) event had the highest inundation fraction per label, while Bad Neuenahr-Ahrweiler (Germany) and Can Tho (Vietnam) showed relatively lower fractions view at source ↗

**Figure 3.** Figure 3: Examples of PlanetScope true-color chips (left) and corresponding post-flood inundation labels (right). Top: Can Tho, Vietnam (2019); middle: Houston, Texas (Hurricane Harvey, 2017); bottom: Grafton, Illinois (2019). 5/15 view at source ↗

**Figure 4.** Figure 4: Predictions from the UFO-trained SegFormer-B2 model. Columns: PlanetScope true-color image (left), UFO hand label (center), model prediction (right). Top to bottom: Dhaka, Bangladesh (2020); Khartoum, Sudan (2020); Kempsey, NSW, Australia (2021). Performance varied across events: IoU exceeded 86% for Khartoum (KTM; 91.6%), Grafton (GIL; 87.9%), and Can Tho (CTO; 86.8%), while it was lower for Beira (BEI; 4… view at source ↗

**Figure 5.** Figure 5: Per-event mean Intersection over Union (IoU) and Recall (Sensitivity) for the UFO-trained SegFormer-B2 under leave-one-event-out cross-validation. Error bars indicate one standard deviation across chips within each event. 8/15 view at source ↗

**Figure 6.** Figure 6: PlanetScope true-color imagery (left), UFO hand labels (center), and S1-IMPACT inundation predictions (right) for two same-day acquisitions. Top: Beira, Mozambique (2019); bottom: San Pedro Sula, Honduras (2020) view at source ↗

**Figure 7.** Figure 7: PlanetScope true-color imagery (left), UFO hand labels (center), and Dynamic World predictions (right; water and flooded-vegetation classes combined) for two same-day acquisitions. Top: Craig, Missouri (2019); bottom: Santa Lucía, Uruguay (2019). Precision Recall Specificity F1 IoU Accuracy Surface water models mean std mean std mean std mean std mean std mean std S1-IMPACT 86.2 23.4 46.2 31.8 95.8 11.4 54… view at source ↗

**Figure 8.** Figure 8: Chip-level Intersection over Union (IoU) and Recall (Sensitivity) against UFO labels. Grey boxplots show the UFO-trained SegFormer-B2; blue boxplots show benchmark products. Left: Dynamic World vs. UFO model (n = 31). Right: S1-IMPACT vs. UFO model (n = 42). Boxes show median and interquartile range; points denote individual chips. All comparisons use same-day acquisitions. 11/15 view at source ↗

read the original abstract

Urban flooding affects lives and infrastructure worldwide. Mapping inundation in complex urban environments from satellite imagery remains challenging due to limited spatial resolution, infrequent acquisitions, and cloud cover. We present Urban Flood Observations (UFO), a global, hand-labeled dataset of post-flood inundation in diverse urban settings. UFO comprises 215 image chips (1024 by 1024 pixels) from 14 flood events between 2017 and 2021, derived from 3 m PlanetScope imagery. Each chip is annotated with two classes: 'inundated' (all visible surface water, including floodwater and pre-existing water bodies (permanent or seasonal)) and 'non-inundated'. To demonstrate the dataset's utility, we trained a segmentation model using leave-one-event-out cross-validation, achieving a mean Intersection over Union (IoU) of 77.3. We also used UFO to evaluate two widely used surface water products, the Sentinel-1-based NASA IMPACT model and Google's 10 m Dynamic World water class, which yielded IoUs of 44.1 and 48.1, respectively. UFO is publicly available to support the development and validation of urban inundation mapping methods.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript presents the Urban Flood Observations (UFO) dataset: 215 hand-labeled 1024×1024 pixel chips from 3 m PlanetScope imagery across 14 flood events (2017–2021). Chips are annotated into two classes—inundated (all visible surface water, including permanent bodies) and non-inundated. Utility is shown via leave-one-event-out cross-validation of a segmentation model (mean IoU 77.3) and by benchmarking two existing products (NASA IMPACT IoU 44.1; Dynamic World IoU 48.1). The dataset is released publicly.

Significance. If the hand labels prove reliable, UFO would fill a documented gap in high-resolution urban inundation training data and enable more rigorous benchmarking than current coarse products allow. The leave-one-event-out protocol and public release are concrete strengths that support reproducibility and community use.

major comments (2)

[Methods/§3] Dataset construction (Methods/§3): No labeling protocol, number of annotators, inter-annotator agreement statistics, or independent validation (higher-resolution optical/SAR or field data) is reported. Because the central claim—that UFO supplies reliable ground truth for training and benchmarking—rests on label accuracy, the absence of these details leaves open the possibility that reported IoUs partly reflect label consistency rather than true inundation detection, especially given known urban confounds (shadows, dark roofs, wet pavement).
[Results/§4] Results (§4): The leave-one-event-out mean IoU of 77.3 is presented without per-event breakdowns, confusion matrices, or error analysis stratified by urban density or event type. This makes it impossible to verify whether performance generalizes across the claimed diversity of 14 events or is driven by a subset of easier scenes.

minor comments (2)

[Abstract] Abstract: The sentence describing the two benchmark products should explicitly name the products and their resolutions for immediate clarity.
[Dataset description] The manuscript would benefit from a table summarizing event dates, locations, and number of chips per event to allow readers to assess geographic and temporal coverage.

Simulated Author's Rebuttal

2 responses · 2 unresolved

We thank the referee for their constructive comments, which identify important areas for improving the transparency and rigor of the UFO dataset description. We respond to each major comment below, indicating planned revisions where feasible.

read point-by-point responses

Referee: [Methods/§3] Dataset construction (Methods/§3): No labeling protocol, number of annotators, inter-annotator agreement statistics, or independent validation (higher-resolution optical/SAR or field data) is reported. Because the central claim—that UFO supplies reliable ground truth for training and benchmarking—rests on label accuracy, the absence of these details leaves open the possibility that reported IoUs partly reflect label consistency rather than true inundation detection, especially given known urban confounds (shadows, dark roofs, wet pavement).

Authors: We agree that the original manuscript omitted key details on annotation. The revised version will expand §3 with a full description of the labeling protocol and the number of annotators. However, inter-annotator agreement statistics were not computed during the original process, and independent validation against higher-resolution optical/SAR or field data was not performed for these retrospective events. We will explicitly note these as limitations and discuss potential impacts from urban confounds such as shadows and wet pavement. revision: partial
Referee: [Results/§4] Results (§4): The leave-one-event-out mean IoU of 77.3 is presented without per-event breakdowns, confusion matrices, or error analysis stratified by urban density or event type. This makes it impossible to verify whether performance generalizes across the claimed diversity of 14 events or is driven by a subset of easier scenes.

Authors: We agree that additional granularity is warranted. In the revision we will add a table of per-event IoU scores from the leave-one-event-out validation, an overall confusion matrix, and a qualitative error analysis of common failure modes. Although quantitative urban-density stratification is not available, we will group events by qualitative characteristics (e.g., flood type and setting) and report any observed performance differences to help readers assess generalization across the 14 events. revision: yes

standing simulated objections not resolved

Inter-annotator agreement statistics, as they were not calculated during dataset creation.
Independent validation with higher-resolution optical/SAR or field data, which was not available for the selected events.

Circularity Check

0 steps flagged

No circularity: empirical dataset release with standard ML validation

full rationale

The paper describes creation of a hand-labeled dataset from PlanetScope imagery and its use to train a segmentation model under leave-one-event-out cross-validation, reporting mean IoU of 77.3 against the same labels on held-out events, plus direct evaluation of two external surface-water products on the same labels. These steps are self-contained empirical procedures: the labels function as the defined ground truth for the reported metrics, with no equations, fitted parameters renamed as predictions, self-citation load-bearing arguments, or uniqueness claims that reduce the central results to their own inputs by construction. The work contains no derivation chain that collapses into tautology.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the assumption that visual interpretation of 3m PlanetScope imagery can reliably distinguish inundated areas in urban settings and that the chosen events are representative.

axioms (1)

domain assumption Hand-labeling by experts provides accurate ground truth for visible surface water in satellite imagery of urban floods.
This underpins the creation of the 'inundated' and 'non-inundated' labels without reported validation metrics such as inter-annotator agreement.

pith-pipeline@v0.9.0 · 5545 in / 1396 out tokens · 43158 ms · 2026-05-08T12:11:17.900269+00:00 · methodology

Urban Flood Observations: A hand-labeled training and validation dataset of post-flood inundation

Core claim

What carries the argument

Load-bearing premise

What would settle it

discussion (0)