arxiv: 2602.13305 · v2 · submitted 2026-02-09 · 💻 cs.CV · cs.AI

Recognition: no theorem link

WildfireVLM: AI-powered Analysis for Early Wildfire Detection and Risk Assessment Using Satellite Imagery

Aydin Ayanzadeh , Prakhar Dixit , Sadia Kamal , Milton Halem

Authors on Pith no claims yet

Pith reviewed 2026-05-16 05:11 UTC · model grok-4.3

classification 💻 cs.CV cs.AI

keywords wildfire detectionsatellite imageryYOLOv12multimodal large language modelsrisk assessmentAI frameworkreal-time monitoring

0 comments

The pith

WildfireVLM pairs YOLOv12 detection on satellite images with multimodal LLMs to produce contextual wildfire risk assessments.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents WildfireVLM as a framework that detects wildfires and smoke plumes in satellite imagery and then uses language models to generate risk assessments and response recommendations. It builds a labeled dataset from Landsat-8/9 and GOES-16 sources with aligned spectral bands, applies YOLOv12 to handle small and complex patterns, and feeds the outputs to multimodal LLMs for contextualized reasoning. The authors validate the quality of those assessments through an LLM-as-judge process that applies a shared rubric and deploy the full pipeline in a service-oriented architecture for real-time dashboards and tracking. A sympathetic reader would care because the approach aims to move beyond raw detection toward actionable information that could support faster decisions in wildfire management.

Core claim

WildfireVLM combines YOLOv12 detection on harmonized Landsat and GOES imagery with multimodal LLMs that convert detection outputs into contextualized risk assessments and prioritized response recommendations, with the quality of the reasoning validated by an LLM-as-judge evaluation that uses a shared rubric, and with the system deployed in a service-oriented architecture that supports real-time processing, visual dashboards, and long-term tracking.

What carries the argument

YOLOv12 model for detecting fire zones and smoke plumes in satellite imagery, integrated with multimodal LLMs that translate those detections into language-based risk reasoning and recommendations.

If this is right

Enables real-time analysis of large satellite scenes for early alerts on faint smoke signals.
Produces prioritized response recommendations that disaster managers can use directly.
Supports visual risk dashboards and long-term tracking of wildfire events.
Demonstrates that combining computer vision outputs with language reasoning can scale monitoring across dynamic conditions.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same detection-plus-reasoning pattern could be tested on other satellite-derived hazards such as flood mapping using the same public imagery sources.
Public release of the code and dataset opens the possibility for independent accuracy checks on imagery from different regions or sensors.
If integrated with existing emergency-alert systems, the framework might shorten the interval between satellite observation and on-ground response.

Load-bearing premise

The assumption that an LLM-as-judge evaluation with a shared rubric provides a reliable and unbiased validation of the risk reasoning quality produced by the multimodal models.

What would settle it

A side-by-side comparison in which human disaster-management experts rate the same set of risk assessments and produce scores that differ substantially from those assigned by the LLM judge on the identical rubric.

Figures

Figures reproduced from arXiv: 2602.13305 by Aydin Ayanzadeh, Milton Halem, Prakhar Dixit, Sadia Kamal.

**Figure 2.** Figure 2: System architecture of WildfireVLM, consisting of four integrated modules: (1) Input module processes Landsat-8/9 satellite imagery via preprocessing; [PITH_FULL_IMAGE:figures/full_fig_p002_2.png] view at source ↗

**Figure 3.** Figure 3: Wildfire risk assessment comparison between GPT-4o and Claude 3.5 Sonnet on the proposed dataset. Left: Demonstration of wildfire detection. [PITH_FULL_IMAGE:figures/full_fig_p003_3.png] view at source ↗

read the original abstract

Wildfires are a growing threat to ecosystems, human lives, and infrastructure, with their frequency and intensity rising due to climate change and human activities. Early detection is critical, yet satellite-based monitoring remains challenging due to faint smoke signals, dynamic weather conditions, and the need for real-time analysis over large areas. We introduce WildfireVLM, an AI framework that combines satellite imagery wildfire detection with language-driven risk assessment. We construct a labeled wildfire and smoke dataset using imagery from Landsat-8/9, GOES-16, and other publicly available Earth observation sources, including harmonized products with aligned spectral bands. WildfireVLM employs YOLOv12 to detect fire zones and smoke plumes, leveraging its ability to detect small, complex patterns in satellite imagery. We integrate Multimodal Large Language Models (MLLMs) that convert detection outputs into contextualized risk assessments and prioritized response recommendations for disaster management. We validate the quality of risk reasoning using an LLM-as-judge evaluation with a shared rubric. The system is deployed using a service-oriented architecture that supports real-time processing, visual risk dashboards, and long-term wildfire tracking, demonstrating the value of combining computer vision with language-based reasoning for scalable wildfire monitoring. The code and dataset are publicly available on GitHub at https://github.com/Ayanzadeh93/_WildfireVLM_.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

WildfireVLM wires YOLOv12 detection to MLLM risk text on satellite data with public code and dataset, but the claims rest on unanchored LLM-as-judge validation without metrics or human checks.

read the letter

The main thing to know is that this paper describes an applied pipeline: YOLOv12 spots fire and smoke in Landsat and GOES imagery, multimodal LLMs turn the detections into risk assessments and recommendations, and an LLM-as-judge scores the text quality using a shared rubric. They also built a service-oriented deployment for real-time dashboards and tracking, and released the code plus a labeled dataset on GitHub. That release is the clearest practical contribution here. The domain focus on faint smoke signals and harmonized bands from public Earth observation sources is reasonable for the problem. The architecture description is straightforward enough that someone could replicate the flow for operational monitoring. What is actually new is the specific end-to-end framing for wildfire risk text rather than a novel detector or model. The soft spots are in the evidence. No detection metrics appear, no baseline comparisons, and no precision-recall numbers on the dataset. The LLM-as-judge step for validating risk reasoning lacks any reported correlation with human experts or checks for position bias and self-preference. This leaves the central claim about producing trustworthy, actionable outputs without independent grounding. The circularity concern is real on the terms presented. This is for applied researchers in remote sensing and emergency management who need a starting template with open resources. A reader who wants code to adapt would get immediate value from the GitHub link, even while adding their own tests. I would bring it to a reading group to discuss how to strengthen the validation side. It deserves peer review because the problem matters, the artifacts are public, and referees could push for the missing quantitative anchors and human agreement studies.

Referee Report

2 major / 2 minor

Summary. The paper introduces WildfireVLM, a framework that applies YOLOv12 to detect fire zones and smoke plumes in harmonized Landsat-8/9 and GOES-16 satellite imagery, feeds the detections into MLLMs to produce contextualized risk assessments and response recommendations, validates the quality of those assessments via an LLM-as-judge procedure that uses a shared rubric, and deploys the pipeline in a service-oriented architecture supporting real-time processing, visual dashboards, and long-term tracking. The code and dataset are released publicly.

Significance. If the detection performance and risk-reasoning quality can be independently verified, the work would demonstrate a practical integration of object detection with multimodal language reasoning for scalable wildfire monitoring, with the public code release aiding reproducibility and extension.

major comments (2)

[Abstract and §4] Abstract and §4 (Results): no quantitative detection metrics (mAP, precision-recall curves, or confusion matrices) are reported for YOLOv12 on the Landsat/GOES dataset, leaving the central claim of effective early detection unsupported by evidence.
[§5] §5 (Validation): the LLM-as-judge evaluation with a shared rubric is presented as the primary validation of MLLM risk reasoning, yet no inter-rater agreement with human experts, correlation analysis, or comparison against a non-LLM baseline is supplied; because the judge belongs to the same model class, the procedure risks circularity and known biases (position, verbosity, self-preference) that are not quantified or mitigated.

minor comments (2)

[§3] The description of spectral-band harmonization and labeling protocol for the constructed dataset would benefit from additional detail on inter-annotator agreement and quality-control steps.
[Figures] Figure captions for the risk-dashboard examples should explicitly state the input imagery source, detection thresholds, and MLLM prompt template used.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We agree that stronger quantitative evidence is needed for the detection component and that the LLM-as-judge validation requires additional safeguards against circularity. We address each major comment below and will incorporate the suggested improvements in the revised manuscript.

read point-by-point responses

Referee: [Abstract and §4] Abstract and §4 (Results): no quantitative detection metrics (mAP, precision-recall curves, or confusion matrices) are reported for YOLOv12 on the Landsat/GOES dataset, leaving the central claim of effective early detection unsupported by evidence.

Authors: We agree that the absence of standard detection metrics weakens the central claim. In the revised manuscript we will add to §4 a full quantitative evaluation of YOLOv12 on the harmonized Landsat-8/9 and GOES-16 test set, including mAP@0.5, mAP@0.5:0.95, per-class precision/recall/F1, precision-recall curves, and confusion matrices. We will also report results against two baselines (YOLOv8 and a fine-tuned Faster R-CNN) using the same train/test split. These additions will be summarized in the abstract as well. revision: yes
Referee: [§5] §5 (Validation): the LLM-as-judge evaluation with a shared rubric is presented as the primary validation of MLLM risk reasoning, yet no inter-rater agreement with human experts, correlation analysis, or comparison against a non-LLM baseline is supplied; because the judge belongs to the same model class, the procedure risks circularity and known biases (position, verbosity, self-preference) that are not quantified or mitigated.

Authors: We acknowledge the risk of circularity and unquantified biases. In the revision we will expand §5 with: (i) a non-LLM baseline (rule-based risk scoring using detection counts and metadata) whose outputs are compared to the MLLM via the same rubric; (ii) human-expert ratings on a random subset of 150 cases, with reported Pearson correlation and Cohen’s kappa between the LLM judge and the two human raters; (iii) an ablation that varies prompt order and model temperature to quantify position and verbosity bias. We will also discuss these limitations explicitly. A full-scale human study on the entire corpus remains resource-constrained, but the proposed additions provide a concrete mitigation. revision: partial

Circularity Check

1 steps flagged

LLM-as-judge validation of MLLM risk reasoning lacks grounding and is the weakest link in the central claim

specific steps

other [Abstract]
"We validate the quality of risk reasoning using an LLM-as-judge evaluation with a shared rubric."

The sentence presents LLM-as-judge evaluation as the validation of MLLM risk reasoning quality. Because the judge belongs to the same broad class of large language models as the MLLMs whose outputs it evaluates, the procedure is not independent of the system being assessed; the claimed quality is therefore measured by a closely related component rather than by external benchmarks or human experts.

full rationale

The paper's central claim is that WildfireVLM produces high-quality risk assessments by combining YOLOv12 detection with MLLM reasoning. The only validation step offered is an LLM-as-judge procedure with a shared rubric. This step is load-bearing for the claim of quality and actionability, yet it relies on a model class closely related to the MLLMs whose outputs are being judged. No independent detection metrics (mAP, precision-recall), no human inter-rater agreement, and no non-LLM baseline are reported in the provided text, so the validation does not supply external grounding. The derivation therefore reduces the asserted quality to an internal, same-family assessment rather than an independently verified result.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The framework rests on standard pre-trained model capabilities and public satellite data sources without introducing new mathematical parameters or entities.

axioms (2)

domain assumption YOLOv12 is capable of detecting small and complex smoke and fire patterns in satellite imagery
Invoked when stating that YOLOv12 is leveraged for detection without providing domain-specific benchmarks in the abstract.
ad hoc to paper LLM-as-judge evaluation with a shared rubric accurately measures the quality of risk reasoning
Used as the validation method for the language component.

pith-pipeline@v0.9.0 · 5555 in / 1447 out tokens · 45885 ms · 2026-05-16T05:11:13.893795+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

34 extracted references · 34 canonical work pages · 2 internal anchors

[1]

A fire danger index for the early detection of areas vulnerable to wildfires in the eastern Mediterranean region,

P. Xofis, G. Tsiourlis, and P. Konstantinidis, “A fire danger index for the early detection of areas vulnerable to wildfires in the eastern Mediterranean region,”Euro- Mediterranean Journal for Environmental Integration, vol. 5, no. 2, p. 32, 2020

work page 2020
[2]

Economic footprint of California wildfires in 2018,

D. Wang, D. Guan, S. Zhu, M. M. Kinnon, G. Geng, Q. Zhang, H. Zheng, T. Lei, S. Shao, P. Gonget al., “Economic footprint of California wildfires in 2018,” Nature Sustainability, vol. 4, no. 3, pp. 252–260, 2021

work page 2018
[3]

A forest fire detection system based on ensemble learning,

R. Xu, H. Lin, K. Lu, L. Cao, and Y . Liu, “A forest fire detection system based on ensemble learning,”Forests, vol. 12, no. 2, p. 217, 2021

work page 2021
[4]

Chang- ing wildfire, changing forests: the effects of climate change on fire regimes and vegetation in the Pacific Northwest, USA,

J. E. Halofsky, D. L. Peterson, and B. J. Harvey, “Chang- ing wildfire, changing forests: the effects of climate change on fire regimes and vegetation in the Pacific Northwest, USA,”Fire Ecology, vol. 16, no. 1, pp. 1– 26, 2020

work page 2020
[5]

Climate and wildfire adaptation of inland northwest US forests,

P. F. Hessburg, S. Charnley, A. N. Gray, T. A. Spies, D. W. Peterson, R. L. Flitcroft, K. L. Wendel, J. E. Halof- sky, E. M. White, and J. Marshall, “Climate and wildfire adaptation of inland northwest US forests,”Frontiers in Ecology and the Environment, vol. 20, no. 1, pp. 40–48, 2022

work page 2022
[6]

Climate change and wildfire in California,

A. L. Westerling and B. P. Bryant, “Climate change and wildfire in California,”Climatic Change, vol. 87, pp. 231–249, 2008

work page 2008
[7]

Climate change is increasing the likelihood of extreme autumn wildfire conditions across California,

M. Goss, D. L. Swain, J. T. Abatzoglou, A. Sarhadi, C. A. Kolden, A. P. Williams, and N. S. Diffenbaugh, “Climate change is increasing the likelihood of extreme autumn wildfire conditions across California,”Environ- mental Research Letters, vol. 15, no. 9, p. 094016, 2020

work page 2020
[8]

Me- teorological conditions and wildfire-related houseloss in Australia,

R. Blanchi, C. Lucas, J. Leonard, and K. Finkele, “Me- teorological conditions and wildfire-related houseloss in Australia,”International Journal of Wildland Fire, vol. 19, no. 7, pp. 914–926, 2010

work page 2010
[9]

Advancements in artificial intelligence appli- cations for forest fire prediction,

H. Liu, L. Shu, X. Liu, P. Cheng, M. Wang, and Y . Huang, “Advancements in artificial intelligence appli- cations for forest fire prediction,”Forests, vol. 16, no. 4, p. 704, 2025

work page 2025
[10]

You only look once: Unified, real-time object detection,

J. Redmon, S. Divvala, R. Girshick, and A. Farhadi, “You only look once: Unified, real-time object detection,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 779–788

work page 2016
[11]

Review of wildfire detection, fighting, and technologies: Future prospects and insights,

N. M. Negash, L. Sun, C. Fan, D. Shi, and F. Wang, “Review of wildfire detection, fighting, and technologies: Future prospects and insights,” inAIAA AVIATION FO- RUM AND ASCEND 2025, 2025, p. 3469

work page 2025
[12]

From smoke to fire: A forest fire early warning and risk assessment model fusing multimodal data,

P. Jin, P. Cheng, X. Liu, and Y . Huang, “From smoke to fire: A forest fire early warning and risk assessment model fusing multimodal data,”Engineering Applica- tions of Artificial Intelligence, vol. 152, p. 110848, 2025

work page 2025
[13]

Automatic early detection of wildfire smoke with visible light cam- eras using deep learning and visual explanation,

A. M. Fernandes, A. B. Utkin, and P. Chaves, “Automatic early detection of wildfire smoke with visible light cam- eras using deep learning and visual explanation,”IEEE Access, vol. 10, pp. 12 814–12 828, 2022

work page 2022
[14]

WildfireGPT: Tailored large language model for wildfire analysis,

Y . Xie, B. Jiang, T. Mallick, J. D. Bergerson, J. K. Hutchison, D. R. Verner, J. Branham, M. R. Alexander, R. B. Ross, Y . Feng, L.-A. Levy, W. Su, and C. J. Taylor, “WildfireGPT: Tailored large language model for wildfire analysis,”arXiv preprint arXiv:2402.07877, 2024

work page arXiv 2024
[15]

YOLO-based models for smoke and wildfire detection in ground and aerial images,

L. A. O. Gonc ¸alves, R. Ghali, and M. A. Akhloufi, “YOLO-based models for smoke and wildfire detection in ground and aerial images,”Fire, vol. 7, no. 4, p. 140, 2024. [Online]. Available: https://www.mdpi.com/2 571-6255/7/4/140

work page 2024
[16]

Smoke detection in UA V images using YOLOv7,

B. Kim and N. Muminov, “Smoke detection in UA V images using YOLOv7,”Sensors, vol. 23, no. 15, p. 6701, 2023

work page 2023
[17]

Toward real-world imple- mentation of deep reinforcement learning for vision- based autonomous drone navigation with mission,

M. Navardi, P. Dixit, T. Manjunath, N. R. Waytowich, T. Mohsenin, and T. Oates, “Toward real-world imple- mentation of deep reinforcement learning for vision- based autonomous drone navigation with mission,”arXiv preprint arXiv:2208.06456, 2022

work page arXiv 2022
[18]

Wildfire and smoke detection using YOLO-NAS,

A. Maillardet al., “Wildfire and smoke detection using YOLO-NAS,” inIEEE Conference Proceedings, 2024. [Online]. Available: https://ieeexplore.ieee.org/document /10585773

work page 2024
[19]

Improved YOLOv5 for aerial smoke detec- tion,

Yanget al., “Improved YOLOv5 for aerial smoke detec- tion,”Fire Technology, vol. 59, pp. 1–20, 2023

work page 2023
[20]

Improved lightweight yolov11 algorithm for real-time forest fire detection,

Y . Tao, B. Li, P. Li, J. Qian, and L. Qi, “Improved lightweight yolov11 algorithm for real-time forest fire detection,”Electronics, vol. 14, no. 8, p. 1508, 2025

work page 2025
[21]

Multi-classification using yolov11 and hybrid yolo11n-mobilenet models: A fire classes case study,

E. H. Alkhammash, “Multi-classification using yolov11 and hybrid yolo11n-mobilenet models: A fire classes case study,”Fire, vol. 8, no. 1, p. 17, 2025

work page 2025
[22]

Early fire and smoke detection using deep learning: A comprehensive review of models, datasets, and chal- lenges,

A. Elhanashi, S. Essahraui, P. Dini, and S. Saponara, “Early fire and smoke detection using deep learning: A comprehensive review of models, datasets, and chal- lenges,”Applied Sciences, vol. 15, no. 18, p. 10255, 2025

work page 2025
[23]

Bgc-litenet: Beidou grid code embedded lightweight neural architecture for real-time uav fire detection and localization,

H. Yin, Y . Yu, A. Hong, M. Hu, S. Wang, and Z. Zhang, “Bgc-litenet: Beidou grid code embedded lightweight neural architecture for real-time uav fire detection and localization,”Scientific Reports, 2026

work page 2026
[24]

Geochat: Grounded large vision- language model for remote sensing,

K. Kuckreja, M. S. Danish, M. Naseer, A. Das, S. Khan, and F. S. Khan, “Geochat: Grounded large vision- language model for remote sensing,” inProc. IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024, pp. 27 831–27 840

work page 2024
[25]

RSGPT: A remote sensing vision language model and benchmark,

Y . Hu, J. Yuan, C. Wen, X. Lu, and Y . Xian, “RSGPT: A remote sensing vision language model and benchmark,” arXiv preprint arXiv:2307.15266, 2023

work page arXiv 2023
[26]

EarthGPT: A universal multi-modal large language model for multi- sensor image comprehension in remote sensing domain,

W. Zhang, M. Cai, T. Zhanget al., “EarthGPT: A universal multi-modal large language model for multi- sensor image comprehension in remote sensing domain,” IEEE Transactions on Geoscience and Remote Sensing, 2024

work page 2024
[27]

Floorplan2Guide: LLM- guided floorplan parsing for BLV indoor navigation,

A. Ayanzadeh and T. Oates, “Floorplan2Guide: LLM- guided floorplan parsing for BLV indoor navigation,” arXiv preprint arXiv:2412.18120, 2024

work page arXiv 2024
[28]

Judging LLM-as-a-judge with MT-Bench and chatbot arena,

L. Zheng, W.-L. Chiang, Y . Sheng, S. Zhuang, Z. Wu, Y . Zhuang, Z. Lin, Z. Li, D. Li, E. P. Xing, H. Zhang, J. E. Gonzalez, and I. Stoica, “Judging LLM-as-a-judge with MT-Bench and chatbot arena,” inAdvances in Neu- ral Information Processing Systems (NeurIPS), vol. 36, 2023

work page 2023
[29]

LLMs-as-Judges: A Comprehensive Survey on LLM-based Evaluation Methods

S. Li, J. Yeet al., “LLMs-as-judges: A comprehen- sive survey on LLM-based evaluation methods,”arXiv preprint arXiv:2412.05579, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[30]

YOLOv12: Attention-Centric Real-Time Object Detectors

Y . Tian, Q. Ye, and D. Doermann, “Yolov12: Attention- centric real-time object detectors,”arXiv preprint arXiv:2502.12524, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[31]

YOLOv11: State-of-the-art object detection,

G. Jocheret al., “YOLOv11: State-of-the-art object detection,” Ultralytics, 2024. [Online]. Available: https: //github.com/ultralytics/ultralytics

work page 2024
[32]

Real-time flying object detection with YOLOv8,

G. Jocher, A. Chaurasia, and J. Qiu, “YOLOv8: A real-time object detection system,” arXiv preprint arXiv:2305.09972, 2023

work page arXiv 2023
[33]

YOLO-NAS: Neural architecture search for object detection,

D. AI, “YOLO-NAS: Neural architecture search for object detection,” Technical Report, 2023. [Online]. Available: https://deci.ai/blog/yolo-nas-object-detection -foundation-model/

work page 2023
[34]

Gpt-4 technical report,

OpenAI, “Gpt-4 technical report,” OpenAI Research, Mar. 2023, published 2023. [Online]. Available: https: //openai.com/index/gpt-4-research/

work page 2023