pith. sign in

arxiv: 2603.22768 · v1 · submitted 2026-03-24 · 💻 cs.CV

From Pixels to Semantics: A Multi-Stage AI Framework for Structural Damage Detection in Satellite Imagery

Pith reviewed 2026-05-15 01:14 UTC · model grok-4.3

classification 💻 cs.CV
keywords structural damage detectionsatellite imagerysuper-resolutionvision-language modelsdisaster assessmentbuilding localizationreference-free evaluationxBD dataset
0
0 comments X

The pith

A hybrid AI pipeline that upsamples satellite images, detects buildings, and uses vision-language models to classify structural damage levels.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper sets out to show that low-resolution satellite imagery can be turned into usable semantic damage assessments by chaining super-resolution enhancement, object detection, and vision-language model reasoning. A sympathetic reader would care because faster, more interpretable damage maps after disasters could guide emergency crews toward the most affected structures without waiting for high-resolution ground surveys. The work demonstrates the approach on real post-disaster events from the xBD dataset and introduces reference-free scoring methods to evaluate the outputs when no caption ground truth exists.

Core claim

The central claim is that first applying a Video Restoration Transformer to increase image resolution from 1024x1024 to 4096x4096, then using a YOLOv11 detector to locate buildings in pre-disaster frames, and finally feeding the cropped regions to vision-language models for four-level semantic damage classification produces more interpretable results than prior pipelines; the authors support this by showing improved semantic alignment via CLIPScore and reduced bias through a multi-model jury voting procedure on Moore Tornado and Hurricane Matthew subsets of the xBD dataset.

What carries the argument

The three-stage pipeline that performs super-resolution upscaling, YOLOv11-based building localization, and vision-language model semantic scoring with CLIPScore reference-free evaluation plus jury voting.

If this is right

  • Damage assessments become available from standard-resolution satellite passes without requiring new high-resolution captures.
  • First responders receive explicit severity rankings and recovery recommendations derived from the semantic analysis.
  • The jury-voting step reduces the impact of any single vision-language model's biases in safety-critical outputs.
  • Reference-free metrics such as CLIPScore allow evaluation on new disaster events where labeled captions do not exist.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same staged approach could be tested on additional disaster types such as earthquakes or floods to check whether the resolution boost and semantic step generalize.
  • Connecting the pipeline to streaming satellite feeds might allow near-real-time damage mapping rather than post-event batch processing.
  • The jury mechanism could be extended to include human-in-the-loop overrides for the highest-severity cases.

Load-bearing premise

That vision-language model outputs scored by CLIPScore and jury voting reliably reflect actual structural damage severity when no ground-truth damage labels are available for validation.

What would settle it

A side-by-side comparison of the framework's four-level damage predictions against independent expert visual annotations or on-site inspection records for the same buildings in the Moore Tornado or Hurricane Matthew image sets.

Figures

Figures reproduced from arXiv: 2603.22768 by Bijay Shakya, Catherine Hoier, Khandaker Mamun Ahmed.

Figure 1
Figure 1. Figure 1: xBD dataset samples: pre-disaster images (top) and post [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Workflow of the proposed multi-stage structural dam [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Overview of the Multi-VLM framework for disas [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Working mechanism of the VLM-As-A-Jury metric. [PITH_FULL_IMAGE:figures/full_fig_p005_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Aggregated word clouds of VLM-generated damage de [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗
read the original abstract

Rapid and accurate structural damage assessment following natural disasters is critical for effective emergency response and recovery. However, remote sensing imagery often suffers from low spatial resolution, contextual ambiguity, and limited semantic interpretability, reducing the reliability of traditional detection pipelines. In this work, we propose a novel hybrid framework that integrates AI-based super-resolution, deep learning object detection, and Vision-Language Models (VLMs) for comprehensive post-disaster building damage assessment. First, we enhance pre- and post-disaster satellite imagery using a Video Restoration Transformer (VRT) to upscale images from 1024x1024 to 4096x4096 resolution, improving structural detail visibility. Next, a YOLOv11-based detector localizes buildings in pre-disaster imagery, and cropped building regions are analyzed using VLMs to semantically assess structural damage across four severity levels. To ensure robust evaluation in the absence of ground-truth captions, we employ CLIPScore for reference-free semantic alignment and introduce a multi-model VLM-as-a-Jury strategy to reduce individual model bias in safety-critical decision making. Experiments on subsets of the xBD dataset, including the Moore Tornado and Hurricane Matthew events, demonstrate that the proposed framework enhances the semantic interpretation of damaged buildings. In addition, our framework provides helpful recommendations to first responders for recovery based on damage analysis.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 1 minor

Summary. The manuscript proposes a multi-stage framework for post-disaster structural damage assessment from satellite imagery. It combines VRT-based super-resolution (1024x1024 to 4096x4096), YOLOv11 object detection on pre-disaster images to localize buildings, and VLM-based semantic analysis of cropped regions to classify damage into four severity levels. In the absence of ground-truth captions, evaluation relies on reference-free CLIPScore for semantic alignment and a multi-VLM 'jury' voting strategy to mitigate bias. Experiments on xBD subsets (Moore Tornado, Hurricane Matthew) claim that the pipeline enhances semantic interpretation of damaged buildings and can provide recommendations to first responders.

Significance. If the reference-free metrics were shown to align with actual damage severity, the work would offer a practical advance in automated disaster response by improving interpretability of low-resolution imagery through combined super-resolution, detection, and language-based reasoning. The jury strategy is a reasonable approach for safety-critical applications. However, the current lack of validation against available ground-truth labels substantially reduces the demonstrated significance.

major comments (1)
  1. [Experiments] Experiments section: xBD provides explicit per-building ground-truth damage labels (no/minor/major/destroyed), yet the evaluation reports no quantitative alignment (accuracy, Cohen's kappa, or confusion matrix) between VLM outputs / jury votes and these labels. The claim that the framework 'enhances semantic interpretation' therefore rests on the untested assumption that CLIPScore and VLM consensus track factual damage severity; this is load-bearing for the central contribution.
minor comments (1)
  1. [Abstract] Abstract and §4: The statement that the framework 'provides helpful recommendations to first responders' is asserted without any description of how recommendations are generated from the damage scores or any example outputs.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address the major comment point-by-point below and will incorporate revisions to strengthen the validation.

read point-by-point responses
  1. Referee: [Experiments] Experiments section: xBD provides explicit per-building ground-truth damage labels (no/minor/major/destroyed), yet the evaluation reports no quantitative alignment (accuracy, Cohen's kappa, or confusion matrix) between VLM outputs / jury votes and these labels. The claim that the framework 'enhances semantic interpretation' therefore rests on the untested assumption that CLIPScore and VLM consensus track factual damage severity; this is load-bearing for the central contribution.

    Authors: We agree that direct quantitative comparison to the xBD ground-truth damage labels would provide stronger validation of the framework's semantic outputs. Our original evaluation emphasized reference-free metrics (CLIPScore and jury consensus) due to the absence of ground-truth captions for the VLM-generated descriptions, but we acknowledge that alignment with the available per-building labels (no/minor/major/destroyed) is feasible and important. In the revised manuscript, we will add accuracy, Cohen's kappa, and confusion matrix results comparing the multi-VLM jury classifications to the xBD ground-truth labels on the Moore Tornado and Hurricane Matthew subsets. This will empirically test whether the semantic interpretations track factual damage severity. revision: yes

Circularity Check

0 steps flagged

No circularity: pipeline assembles independent components and reference-free metrics without self-referential reduction

full rationale

The paper describes a sequential pipeline (VRT super-resolution on 1024x1024 imagery to 4096x4096, YOLOv11 detection on pre-disaster images, VLM semantic labeling into four damage levels) evaluated via CLIPScore and multi-VLM jury voting. No equations, fitted parameters, or derivations are presented that reduce a claimed output to the input by construction. The justification for reference-free metrics is the explicit absence of ground-truth captions, not a loop back to the framework's own outputs. No self-citations appear in the provided text as load-bearing premises. The central claim of enhanced semantic interpretation therefore rests on external model capabilities and consensus scoring rather than any enumerated circular pattern.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The approach depends on standard assumptions in AI about model generalization to damage assessment tasks and the utility of reference-free metrics.

axioms (2)
  • domain assumption Vision-Language Models can reliably assess structural damage severity from image crops
    Invoked in the semantic assessment stage without ground truth.
  • domain assumption CLIPScore provides a valid proxy for semantic alignment in damage classification
    Used for reference-free evaluation.

pith-pipeline@v0.9.0 · 5544 in / 1313 out tokens · 63273 ms · 2026-05-15T01:14:32.971086+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

37 extracted references · 37 canonical work pages · 2 internal anchors

  1. [1]

    Computer vision framework for crack detec- tion of civil infrastructure—a review.Engineering Applica- tions of Artificial Intelligence, 117:105478, 2023

    Dihao Ai, Guiyuan Jiang, Siew-Kei Lam, Peilan He, and Chengwu Li. Computer vision framework for crack detec- tion of civil infrastructure—a review.Engineering Applica- tions of Artificial Intelligence, 117:105478, 2023. 2

  2. [2]

    Integrating machine learn- ing and remote sensing in disaster management: A decadal review of post-disaster building damage assessment.Build- ings, 14(8):2344, 2024

    Sultan Al Shafian and Da Hu. Integrating machine learn- ing and remote sensing in disaster management: A decadal review of post-disaster building damage assessment.Build- ings, 14(8):2344, 2024. 2

  3. [3]

    Shuai Bai, Yuxuan Cai, and Keming et. al. Zhu. Qwen3-vl technical report.arXiv preprint arXiv:2511.21631, 2025. 5

  4. [4]

    Mask-to-height: A yolov11-based archi- tecture for joint building instance segmentation and height classification from satellite imagery

    Mahmoud El Hussieni, Bahadır K G ¨unt¨urk, Hasan F Ates ¸, and O˘guz Hano˘glu. Mask-to-height: A yolov11-based archi- tecture for joint building instance segmentation and height classification from satellite imagery. In2025 Innovations in Intelligent Systems and Applications Conference (ASYU), pages 1–6. IEEE, 2025. 4

  5. [5]

    Toward faster and accu- rate post-disaster damage assessment: Development of end- to-end building damage detection framework with super- resolution architecture

    Xuanchao Fu, Toru Kouyama, Hang Yang, Ryosuke Naka- mura, and Ichiro Yoshikawa. Toward faster and accu- rate post-disaster damage assessment: Development of end- to-end building damage detection framework with super- resolution architecture. InIGARSS 2022-2022 IEEE Inter- national Geoscience and Remote Sensing Symposium, pages 1588–1591. IEEE, 2022. 2

  6. [6]

    xBD: A Dataset for Assessing Building Damage from Satellite Imagery, November 2019

    Ritwik Gupta, Richard Hosfelt, Sandra Sajeev, Nirav Patel, Bryce Goodman, Jigar Doshi, Eric Heim, Howie Choset, and Matthew Gaston. xbd: A dataset for assessing building dam- age from satellite imagery.arXiv preprint arXiv:1911.09296,

  7. [7]

    xbd: A dataset for assessing building dam- age from satellite imagery, 2019

    Ritwik Gupta, Richard Hosfelt, Sandra Sajeev, Nirav Patel, Bryce Goodman, Jigar Doshi, Eric Heim, Howie Choset, and Matthew Gaston. xbd: A dataset for assessing building dam- age from satellite imagery, 2019. 3

  8. [8]

    Clipscore: A reference-free evaluation met- ric for image captioning, 2022

    Jack Hessel, Ari Holtzman, Maxwell Forbes, Ronan Le Bras, and Yejin Choi. Clipscore: A reference-free evaluation met- ric for image captioning, 2022. 5

  9. [9]

    Super-resolution images methodology applied to uav datasets to road pavement mon- itoring.Drones, 6(7):171, 2022

    Laura Inzerillo, Francesco Acuto, Gaetano Di Mino, and Mohammed Zeeshan Uddin. Super-resolution images methodology applied to uav datasets to road pavement mon- itoring.Drones, 6(7):171, 2022. 2

  10. [10]

    Building damage detection via superpixel-based belief fu- sion of space-borne sar and optical images.IEEE Sensors Journal, 20(4):2008–2022, 2019

    Xiao Jiang, You He, Gang Li, Yu Liu, and Xiao-Ping Zhang. Building damage detection via superpixel-based belief fu- sion of space-borne sar and optical images.IEEE Sensors Journal, 20(4):2008–2022, 2019. 2

  11. [11]

    Zeshot-vqa: Zero-shot visual question answering framework with an- swer mapping for natural disaster damage assessment.arXiv preprint arXiv:2506.00238, 2025

    Ehsan Karimi and Maryam Rahnemoonfar. Zeshot-vqa: Zero-shot visual question answering framework with an- swer mapping for natural disaster damage assessment.arXiv preprint arXiv:2506.00238, 2025. 2, 3

  12. [12]

    Jin Kim, Seungbo Shim, Seok-Jun Kang, and Gye-Chun Cho. Learning structure for concrete crack detection us- ing robust super-resolution with generative adversarial net- work.Structural Control and Health Monitoring, 2023(1): 8850290, 2023. 2

  13. [13]

    Umut Lagap and Saman Ghaffarian. Enhancing post-disaster damage detection and recovery monitoring by addressing class imbalance in satellite imagery using enhanced super- resolution gans (esrgan).The International Archives of the Photogrammetry, Remote Sensing and Spatial Information Sciences, 48:853–860, 2025. 2

  14. [14]

    Build- ing damage detection from post-event aerial imagery using single shot multibox detector.Applied Sciences, 9(6):1128,

    Yundong Li, Wei Hu, Han Dong, and Xueyan Zhang. Build- ing damage detection from post-event aerial imagery using single shot multibox detector.Applied Sciences, 9(6):1128,

  15. [15]

    Vrt: A video restoration transformer.arXiv preprint arXiv:2201.12288, 2022

    Jingyun Liang, Jiezhang Cao, Yuchen Fan, Kai Zhang, Rakesh Ranjan, Yawei Li, Radu Timofte, and Luc Van Gool. Vrt: A video restoration transformer.arXiv preprint arXiv:2201.12288, 2022. 4

  16. [16]

    Sentinel-1 change detection analysis for cyclone damage assessment in urban environments.Remote Sensing, 12(15):2409, 2020

    David Malmgren-Hansen, Thomas Sohnesen, Peter Fisker, and Javier Baez. Sentinel-1 change detection analysis for cyclone damage assessment in urban environments.Remote Sensing, 12(15):2409, 2020. 2

  17. [17]

    Damage detection and localisation using uav/drone with object detection.Procedia Computer Science, 225:118– 127, 2023

    Fabio Martinelli, Francesco Mercaldo, and Antonella San- tone. Damage detection and localisation using uav/drone with object detection.Procedia Computer Science, 225:118– 127, 2023. 2, 3

  18. [18]

    2024 tornado activity reached near-historic levels across the U.S.https://www.weather.gov/news/250703_ tornado_activity, 2024

    National Oceanic and Atmospheric Administration. 2024 tornado activity reached near-historic levels across the U.S.https://www.weather.gov/news/250703_ tornado_activity, 2024. Accessed: Feb. 4, 2026. 1

  19. [19]

    National Oceanic and Atmospheric Administration and US National Weather Service. Number of lives lost due to tornadoes in the united states from 1995 to 2023.https://www.statista.com/statistics/ 203694 / number - of - fatalities - caused - by - tornadoes-in-the-us/, 2024. Statista (release date: May 2024). Accessed: 2026-02-28. 2

  20. [20]

    February 19 tor- nadoes and severe storms.https://www.weather

    National Weather Service, Indianapolis, IN. February 19 tor- nadoes and severe storms.https://www.weather. gov/ind/feb192026severe, 2026. Accessed: 2026- 02-28. 1

  21. [21]

    Deep learning framework for infrastructure maintenance: Crack detection and high- resolution imaging of infrastructure surfaces.arXiv preprint arXiv:2505.03974, 2025

    Nikhil M Pawar, Jorge A Prozzi, Feng Hong, and Surya Sarat Chandra Congress. Deep learning framework for infrastructure maintenance: Crack detection and high- resolution imaging of infrastructure surfaces.arXiv preprint arXiv:2505.03974, 2025. 2

  22. [22]

    Disaster recovery lessons learned from an occupational health and human resources perspective

    Karen H Perce. Disaster recovery lessons learned from an occupational health and human resources perspective. AAOHN journal, 55(6):235–240, 2007. 2

  23. [23]

    Improving road damage detection accuracy using deep learning image enhancement models

    Van Vung Pham. Improving road damage detection accuracy using deep learning image enhancement models. Technical report, Institute for Homeland Security, 2024. 2

  24. [24]

    Deep learning-based yolo network model for de- tecting surface cracks during structural health monitoring

    Kumari Pratibha, Mayank Mishra, GV Ramana, and Paulo B Lourenc ¸o. Deep learning-based yolo network model for de- tecting surface cracks during structural health monitoring. In International Conference on Structural Analysis of Histori- cal Constructions, pages 179–187. Springer, 2023. 3

  25. [25]

    55 km nnw of kota belud, malaysia (event id: us6000sasz).https : / / www

    QuakePulse. 55 km nnw of kota belud, malaysia (event id: us6000sasz).https : / / www . quakepulse . com/earthquake/us6000sasz/55-km-nnw-of- kota-belud-malaysia, 2026. Accessed: 2026-02-28. 1

  26. [26]

    Vlce: A knowledge- 9 enhanced framework for image description in disaster assess- ment.arXiv preprint arXiv:2509.21609, 2025

    Md Mahfuzur Rahman, Kishor Datta Gupta, Marufa Ka- mal, Fahad Rahman, Sunzida Siddique, Ahmed Rafi Hasan, Mohd Ariful Haque, and Roy George. Vlce: A knowledge- 9 enhanced framework for image description in disaster assess- ment.arXiv preprint arXiv:2509.21609, 2025. 7

  27. [27]

    Damage detection in concrete structures with multi-feature backgrounds using the yolo network family.Automation in Construction, 170:105887, 2025

    Rakesh Raushan, Vaibhav Singhal, and Rajib Kumar Jha. Damage detection in concrete structures with multi-feature backgrounds using the yolo network family.Automation in Construction, 170:105887, 2025. 2

  28. [28]

    Vqa-aid: Visual question answering for post-disaster damage assessment and analysis

    Argho Sarkar and Maryam Rahnemoonfar. Vqa-aid: Visual question answering for post-disaster damage assessment and analysis. In2021 IEEE International Geoscience and Re- mote Sensing Symposium IGARSS, pages 8660–8663. IEEE,

  29. [29]

    Gemma 3 Technical Report

    Gemma Team and Google DeepMind. Gemma 3 technical report.arXiv preprint arXiv:2503.19786, 2025. 5

  30. [30]

    Ac- celerating post-tornado disaster assessment using advanced deep learning models

    Robinson Umeike, Thang Dao, and Shane Crawford. Ac- celerating post-tornado disaster assessment using advanced deep learning models. In2024 IEEE MetroCon, pages 1–3. IEEE, 2024. 3

  31. [31]

    Myanmar earthquake: One-month impact report (march–april 2025)

    UNHCR. Myanmar earthquake: One-month impact report (march–april 2025). Impact report, United Nations High Commissioner for Refugees (UNHCR), United Nations in Myanmar, 2025. Accessed: 2026-02-28. 1

  32. [32]

    GAR 2025 hazard explorations: Earthquakes.https : / / www

    United Nations Office for Disaster Risk Reduction. GAR 2025 hazard explorations: Earthquakes.https : / / www . undrr . org / gar / gar2025 / hazard - exploration/earthquakes, 2025. Accessed: Feb. 4,

  33. [33]

    Disasterm3: A remote sensing vision-language dataset for disaster damage assessment and response.arXiv preprint arXiv:2505.21089, 2025

    Junjue Wang, Weihao Xuan, Heli Qi, Zhihao Liu, Kunyi Liu, Yuhan Wu, Hongruixuan Chen, Jian Song, Junshi Xia, Zhuo Zheng, et al. Disasterm3: A remote sensing vision-language dataset for disaster damage assessment and response.arXiv preprint arXiv:2505.21089, 2025. 3

  34. [34]

    Tropical cyclone gezani hits madagascar and threatens mozambique

    World Meteorological Organization (WMO). Tropical cyclone gezani hits madagascar and threatens mozambique. https : / / wmo . int / media / news / tropical - cyclone - gezani - hits - madagascar - and - threatens- mozambique, 2026. Accessed: 2026-02-

  35. [35]

    Super-resolution reconstruction method of pavement crack images based on an improved generative ad- versarial network.Sensors, 22(23):9092, 2022

    Bo Yuan, Zhaoyun Sun, Lili Pei, Wei Li, Minghang Ding, and Xueli Hao. Super-resolution reconstruction method of pavement crack images based on an improved generative ad- versarial network.Sensors, 22(23):9092, 2022. 2

  36. [36]

    Mar-yolo: multi-scale fea- ture adaptive selection and asymptotic pyramid for oriented building detection in remote sensing images.Scientific Re- ports, 2025

    Yuzhe Zhao and Haizhong Qian. Mar-yolo: multi-scale fea- ture adaptive selection and asymptotic pyramid for oriented building detection in remote sensing images.Scientific Re- ports, 2025. 4

  37. [37]

    Sr- gan based super-resolution reconstruction of power inspec- tion images.Discover Applied Sciences, 6(12):639, 2024

    Jianjun Zhou, Jianbo Zhang, Jiangang Jia, and Jie Liu. Sr- gan based super-resolution reconstruction of power inspec- tion images.Discover Applied Sciences, 6(12):639, 2024. 2 10