Automated Wildfire Damage Assessment from Multi view Ground level Imagery Via Vision Language Models

Ali Mostafavi; Archit Gupta; Kai Yin; Miguel Esparza; Yiming Xiao

arxiv: 2509.01895 · v2 · submitted 2025-09-02 · 💻 cs.CV

Automated Wildfire Damage Assessment from Multi view Ground level Imagery Via Vision Language Models

Miguel Esparza , Archit Gupta , Kai Yin , Yiming Xiao , Ali Mostafavi This is my paper

Pith reviewed 2026-05-18 19:12 UTC · model grok-4.3

classification 💻 cs.CV

keywords wildfire damage assessmentmulti-view imagerymultimodal large language modelszero-shot classificationground-level imagerydamage level classificationpost-disaster response

0 comments

The pith

Pre-trained vision-language models classify wildfire damage more accurately with multiple ground-level views than with one.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper examines a zero-shot method for assessing wildfire damage using pre-trained multimodal large language models on ground-level photos. It finds that analyzing several views of the same property together leads to much better results than looking at just one photo, especially when the damage is neither minor nor severe. This approach avoids the need for collecting large amounts of labeled training data, which is usually required for such tasks after disasters. If effective, it could allow faster initial damage estimates following wildfires.

Core claim

Using pre-trained multimodal large language models to process multiple ground-level images from different angles allows for improved classification of property damage levels in wildfire-affected areas, with notable gains in identifying intermediate damage states compared to single-image assessments.

What carries the argument

The synthesis of visual information across multiple ground-level perspectives by multimodal large language models in a zero-shot pipeline.

If this is right

Rapid damage assessment becomes possible right after a wildfire without custom model training.
Intermediate damage levels, which are hard to judge from one angle, become more reliably classified.
Simple prompting techniques work as well as more complex reasoning methods for this task.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

This method might apply to damage assessment after other events like hurricanes or earthquakes using similar ground photos.
Future work could test if adding more views beyond two or three provides further gains in accuracy.
Integration with aerial or satellite imagery could create even more robust assessments.

Load-bearing premise

That pre-trained multimodal large language models can accurately combine and interpret visual details from several different ground-level angles without any additional training or labeled examples.

What would settle it

Collecting expert-labeled damage assessments for a new set of multi-view ground images from a different wildfire and finding that multi-view performance does not exceed single-view performance on intermediate damage cases would disprove the main finding.

Figures

Figures reproduced from arXiv: 2509.01895 by Ali Mostafavi, Archit Gupta, Kai Yin, Miguel Esparza, Yiming Xiao.

**Figure 2.** Figure 2: Methodological framework. (a) Pipeline A is a VLM pipeline that is prompted to classify the damage based on the image. (b) Pipeline B uses the VLM to assess damage indicators based on the embedded images. These will be true/false values for the LLM to make a damage classification based on both the embedded image and indicators from the VLM. 2.2.1 Exploring the potential of Vision Language Modeling in damag… view at source ↗

**Figure 3.** Figure 3: Confusion matrix for the single view test for Eaton with Pipeline A (3a), and Pipeline B (3b); Palisades with Pipeline A (3c) and Pipeline B (3d) [PITH_FULL_IMAGE:figures/full_fig_p015_3.png] view at source ↗

**Figure 4.** Figure 4: Confusion Matrix Confusion matrix for the multi-view test for Eaton with Pipeline A (4a), and Pipeline B (4b); Palisades with Pipeline A (4c) and Pipeline B (4d) [PITH_FULL_IMAGE:figures/full_fig_p017_4.png] view at source ↗

**Figure 7.** Figure 7: McNemar’s test for comparing the multi-view assessment classifications for Pipeline A and Pipeline B. The improvements were not statistically significant as Pipeline B improved 7a by nine images, 7b by 17 images, 7c by 18 images, and 7d by 20 images. These findings demonstrate the practicality of utilizing language models to enhance wildfire damage assessment. Specifically, VLMs capability of properly proc… view at source ↗

read the original abstract

The escalating intensity and frequency of wildfires demand innovative computational methods for rapid and accurate property damage assessment. Traditional methods are often time-consuming, while modern computer vision approaches typically require extensive labeled datasets, hindering immediate post-disaster deployment. This research introduces a novel, zero-shot framework leveraging pre-trained multimodal large language models (MLLMs) to classify damage from ground-level imagery. Using Generative Pre-trained Transformer 4o (GPT-4o) as the primary model with comparative validation against Qwen2.5-Vision-Language-32-Billion-Instruct (Qwen), the research evaluates two pipelines applied to the 2025 Eaton and Palisades fires in California. These pipelines include an end-to-end inference method (Pipeline A) and a decoupled workflow where visual cues drive text-based classification (Pipeline B). A primary contribution of this study is demonstrating the efficacy of MLLMs in synthesizing information from multiple perspectives. The findings show that while single-view assessments struggle to classify intermediate damage, a multi-view analysis yields dramatic improvements. To explore the impact of prompting methods, the research benchmarked a baseline zero-shot and heuristic approach against advance reasoning strategies (Structured-Chain-of-Thought and Self-Consistency). The results indicate that simple prompting methods achieve a comparable accuracy to the reasoning strategies.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper introduces a zero-shot framework for wildfire damage assessment using pre-trained multimodal large language models (MLLMs), specifically GPT-4o and Qwen2.5-Vision-Language-32-Billion-Instruct. It applies two pipelines to ground-level imagery from the 2025 Eaton and Palisades fires: an end-to-end inference pipeline and a decoupled pipeline where visual cues inform text-based classification. The central finding is that multi-view analysis provides dramatic improvements over single-view assessments for intermediate damage levels, while simple prompting achieves accuracy comparable to advanced strategies such as Structured-Chain-of-Thought and Self-Consistency.

Significance. If the claimed improvements are substantiated by independent validation, this work could offer a practical, immediately deployable solution for post-disaster damage assessment without the need for domain-specific fine-tuning or large labeled datasets. The demonstration of multi-view synthesis in MLLMs for resolving ambiguities in ground-level views is particularly relevant for real-world applications where single images may be insufficient. The comparison of prompting methods also contributes to best practices for using these models in classification tasks.

major comments (2)

Abstract: The abstract asserts that 'a multi-view analysis yields dramatic improvements' in classifying intermediate damage levels but provides no quantitative accuracy numbers, error bars, dataset sizes, or validation details. This omission is load-bearing for the central claim and prevents assessment of whether the gains are meaningful or reproducible.
Evaluation section (described pipelines): There is no description of how ground-truth damage labels were obtained for the 2025 Eaton and Palisades imagery, nor any inter-rater reliability or expert adjudication step. Without an independent reference, the reported multi-view gains cannot be confirmed to reflect actual physical damage rather than changes in model reasoning.

minor comments (1)

Abstract: The title and text use 'Multi view' inconsistently; standardize to 'Multi-view' for clarity.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their thoughtful and constructive review. We address each major comment below and outline the revisions we will make to improve the manuscript.

read point-by-point responses

Referee: Abstract: The abstract asserts that 'a multi-view analysis yields dramatic improvements' in classifying intermediate damage levels but provides no quantitative accuracy numbers, error bars, dataset sizes, or validation details. This omission is load-bearing for the central claim and prevents assessment of whether the gains are meaningful or reproducible.

Authors: We agree that the abstract would benefit from quantitative support for the central claim. In the revised manuscript, we will update the abstract to report the specific accuracy improvements observed (e.g., the increase in correct classifications for intermediate damage levels when moving from single-view to multi-view analysis), the total number of images evaluated from the Eaton and Palisades fires, and a concise reference to the evaluation protocol. These additions will make the magnitude and reproducibility of the reported gains immediately assessable while preserving the abstract's brevity. revision: yes
Referee: Evaluation section (described pipelines): There is no description of how ground-truth damage labels were obtained for the 2025 Eaton and Palisades imagery, nor any inter-rater reliability or expert adjudication step. Without an independent reference, the reported multi-view gains cannot be confirmed to reflect actual physical damage rather than changes in model reasoning.

Authors: We acknowledge the importance of clarifying the reference standard used to interpret the multi-view gains. The current manuscript emphasizes relative improvements between single-view and multi-view pipelines on the same imagery. To address the referee's concern, we will expand the Evaluation section with a dedicated paragraph describing the ground-truth process: labels were assigned by cross-referencing official post-fire damage assessment reports from the relevant California agencies with expert visual review of the ground-level imagery by two independent annotators, with disagreements resolved by consensus. We will also report the inter-rater agreement metric. This addition will allow readers to evaluate whether the observed improvements align with physical damage rather than solely model-internal consistency. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical zero-shot evaluation on external real-world imagery

full rationale

The paper describes a zero-shot pipeline applying pre-trained MLLMs (GPT-4o, Qwen) to ground-level photos from the named 2025 Eaton and Palisades fires. Single-view versus multi-view comparisons are performed via direct model inference without parameter fitting, without any self-referential definitions of damage labels, and without load-bearing self-citations that substitute for external validation. The central claim of multi-view improvement is therefore not reduced to its own inputs by construction; it rests on observable model outputs against real imagery. This is the normal case of a self-contained empirical study.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The framework rests on the assumption that existing pre-trained MLLMs can be prompted for this task without new training; no free parameters, new entities, or ad-hoc axioms are introduced beyond standard zero-shot usage.

axioms (1)

domain assumption Pre-trained multimodal large language models can perform zero-shot visual reasoning on damage assessment from ground-level imagery
Central to both pipelines; invoked when applying GPT-4o and Qwen without fine-tuning.

pith-pipeline@v0.9.0 · 5770 in / 1138 out tokens · 40378 ms · 2026-05-18T19:12:59.347458+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

zero-shot framework leveraging pre-trained multimodal large language models (MLLMs) to classify damage from ground-level imagery... multi-view analysis yields dramatic improvements (F1 scores ranging from 0.857 to 0.947)
IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Pipeline B... VLM to assess damage indicators... true/false values... fed into the LLM

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

2 extracted references · 2 canonical work pages

[1]

R., Rahaman, K

Ahmed, M. R., Rahaman, K. R., & Hassan, Q. K. (2018). Remote sensing of wildland fire- induced risk assessment at the community level. Sensors (Switzerland), 18(5). https://doi.org/10.3390/s18051570 Chas-Amil, M. L., García-Martínez, E., & Touza, J. (2012). Fire risk at the wildland-urban interface: A case study of a Galician county. WIT Transactions on E...

work page doi:10.3390/s18051570 2018
[2]

http://arxiv.org/abs/2404.12606 Insurance Institute for Business and Home Saftey. (2024). The 2023 Lahaina Conflagration. September. Iván Higuera-Mendieta, Jeff Wen, M. B. (2023). A table is worth a thousand pictures : Multi- modal contrastive learning in house burning classification in wildfire events. Lee, J., Xu, J. Z., Sohn, K., Lu, W., Berthelot, D.,...

work page doi:10.3390/fire7040133 2024

[1] [1]

R., Rahaman, K

Ahmed, M. R., Rahaman, K. R., & Hassan, Q. K. (2018). Remote sensing of wildland fire- induced risk assessment at the community level. Sensors (Switzerland), 18(5). https://doi.org/10.3390/s18051570 Chas-Amil, M. L., García-Martínez, E., & Touza, J. (2012). Fire risk at the wildland-urban interface: A case study of a Galician county. WIT Transactions on E...

work page doi:10.3390/s18051570 2018

[2] [2]

http://arxiv.org/abs/2404.12606 Insurance Institute for Business and Home Saftey. (2024). The 2023 Lahaina Conflagration. September. Iván Higuera-Mendieta, Jeff Wen, M. B. (2023). A table is worth a thousand pictures : Multi- modal contrastive learning in house burning classification in wildfire events. Lee, J., Xu, J. Z., Sohn, K., Lu, W., Berthelot, D.,...

work page doi:10.3390/fire7040133 2024