Urban Flood Observations: A hand-labeled training and validation dataset of post-flood inundation
Pith reviewed 2026-05-08 12:11 UTC · model grok-4.3
The pith
A hand-labeled dataset of 215 satellite image chips enables machine learning models to map urban flood inundation at 77.3 mean IoU.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The central claim is that the UFO dataset of 215 globally distributed, expert-labeled image chips from 14 flood events, annotated for visible surface water in two classes, supports training of a segmentation model that achieves 77.3 mean IoU via leave-one-event-out cross-validation and shows that two widely used surface water products achieve only 44.1 and 48.1 IoU on the same chips.
What carries the argument
The UFO hand-labeled dataset of 1024x1024 PlanetScope image chips with binary 'inundated' and 'non-inundated' annotations, used to train and validate segmentation models through leave-one-event-out cross-validation.
Load-bearing premise
Expert hand-labeling accurately identifies all visible surface water without significant errors from shadows, vegetation, or urban structures, and the 14 selected events with their chips capture sufficient diversity to support generalizable models.
What would settle it
Testing a model trained on UFO against an independent collection of PlanetScope urban flood images labeled by a separate group of experts yields a mean IoU below 60.
Figures
read the original abstract
Urban flooding affects lives and infrastructure worldwide. Mapping inundation in complex urban environments from satellite imagery remains challenging due to limited spatial resolution, infrequent acquisitions, and cloud cover. We present Urban Flood Observations (UFO), a global, hand-labeled dataset of post-flood inundation in diverse urban settings. UFO comprises 215 image chips (1024 by 1024 pixels) from 14 flood events between 2017 and 2021, derived from 3 m PlanetScope imagery. Each chip is annotated with two classes: 'inundated' (all visible surface water, including floodwater and pre-existing water bodies (permanent or seasonal)) and 'non-inundated'. To demonstrate the dataset's utility, we trained a segmentation model using leave-one-event-out cross-validation, achieving a mean Intersection over Union (IoU) of 77.3. We also used UFO to evaluate two widely used surface water products, the Sentinel-1-based NASA IMPACT model and Google's 10 m Dynamic World water class, which yielded IoUs of 44.1 and 48.1, respectively. UFO is publicly available to support the development and validation of urban inundation mapping methods.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript presents the Urban Flood Observations (UFO) dataset: 215 hand-labeled 1024×1024 pixel chips from 3 m PlanetScope imagery across 14 flood events (2017–2021). Chips are annotated into two classes—inundated (all visible surface water, including permanent bodies) and non-inundated. Utility is shown via leave-one-event-out cross-validation of a segmentation model (mean IoU 77.3) and by benchmarking two existing products (NASA IMPACT IoU 44.1; Dynamic World IoU 48.1). The dataset is released publicly.
Significance. If the hand labels prove reliable, UFO would fill a documented gap in high-resolution urban inundation training data and enable more rigorous benchmarking than current coarse products allow. The leave-one-event-out protocol and public release are concrete strengths that support reproducibility and community use.
major comments (2)
- [Methods/§3] Dataset construction (Methods/§3): No labeling protocol, number of annotators, inter-annotator agreement statistics, or independent validation (higher-resolution optical/SAR or field data) is reported. Because the central claim—that UFO supplies reliable ground truth for training and benchmarking—rests on label accuracy, the absence of these details leaves open the possibility that reported IoUs partly reflect label consistency rather than true inundation detection, especially given known urban confounds (shadows, dark roofs, wet pavement).
- [Results/§4] Results (§4): The leave-one-event-out mean IoU of 77.3 is presented without per-event breakdowns, confusion matrices, or error analysis stratified by urban density or event type. This makes it impossible to verify whether performance generalizes across the claimed diversity of 14 events or is driven by a subset of easier scenes.
minor comments (2)
- [Abstract] Abstract: The sentence describing the two benchmark products should explicitly name the products and their resolutions for immediate clarity.
- [Dataset description] The manuscript would benefit from a table summarizing event dates, locations, and number of chips per event to allow readers to assess geographic and temporal coverage.
Simulated Author's Rebuttal
We thank the referee for their constructive comments, which identify important areas for improving the transparency and rigor of the UFO dataset description. We respond to each major comment below, indicating planned revisions where feasible.
read point-by-point responses
-
Referee: [Methods/§3] Dataset construction (Methods/§3): No labeling protocol, number of annotators, inter-annotator agreement statistics, or independent validation (higher-resolution optical/SAR or field data) is reported. Because the central claim—that UFO supplies reliable ground truth for training and benchmarking—rests on label accuracy, the absence of these details leaves open the possibility that reported IoUs partly reflect label consistency rather than true inundation detection, especially given known urban confounds (shadows, dark roofs, wet pavement).
Authors: We agree that the original manuscript omitted key details on annotation. The revised version will expand §3 with a full description of the labeling protocol and the number of annotators. However, inter-annotator agreement statistics were not computed during the original process, and independent validation against higher-resolution optical/SAR or field data was not performed for these retrospective events. We will explicitly note these as limitations and discuss potential impacts from urban confounds such as shadows and wet pavement. revision: partial
-
Referee: [Results/§4] Results (§4): The leave-one-event-out mean IoU of 77.3 is presented without per-event breakdowns, confusion matrices, or error analysis stratified by urban density or event type. This makes it impossible to verify whether performance generalizes across the claimed diversity of 14 events or is driven by a subset of easier scenes.
Authors: We agree that additional granularity is warranted. In the revision we will add a table of per-event IoU scores from the leave-one-event-out validation, an overall confusion matrix, and a qualitative error analysis of common failure modes. Although quantitative urban-density stratification is not available, we will group events by qualitative characteristics (e.g., flood type and setting) and report any observed performance differences to help readers assess generalization across the 14 events. revision: yes
- Inter-annotator agreement statistics, as they were not calculated during dataset creation.
- Independent validation with higher-resolution optical/SAR or field data, which was not available for the selected events.
Circularity Check
No circularity: empirical dataset release with standard ML validation
full rationale
The paper describes creation of a hand-labeled dataset from PlanetScope imagery and its use to train a segmentation model under leave-one-event-out cross-validation, reporting mean IoU of 77.3 against the same labels on held-out events, plus direct evaluation of two external surface-water products on the same labels. These steps are self-contained empirical procedures: the labels function as the defined ground truth for the reported metrics, with no equations, fitted parameters renamed as predictions, self-citation load-bearing arguments, or uniqueness claims that reduce the central results to their own inputs by construction. The work contains no derivation chain that collapses into tautology.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Hand-labeling by experts provides accurate ground truth for visible surface water in satellite imagery of urban floods.
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.