FireScope: Wildfire Risk Raster Prediction with a Chain-of-Thought Oracle

(2) ETH Zurich); Danda Pani Paudel (1) ((1) INSAIT; Konrad Schindler (2); Luc Van Gool (1); Mario Markov (1); Sofia University "St. Kliment Ohridski"; Stefan Maria Ailuro (1)

arxiv: 2511.17171 · v6 · pith:S3B3RQO5new · submitted 2025-11-21 · 💻 cs.CV · cs.LG

FireScope: Wildfire Risk Raster Prediction with a Chain-of-Thought Oracle

Mario Markov (1) , Stefan Maria Ailuro (1) , Luc Van Gool (1) , Konrad Schindler (2) , Danda Pani Paudel (1) ((1) INSAIT , Sofia University "St. Kliment Ohridski" , (2) ETH Zurich) This is my paper

Pith reviewed 2026-05-25 07:13 UTC · model grok-4.3

classification 💻 cs.CV cs.LG

keywords wildfire risk predictionchain-of-thought reasoningvision-language modelsraster generationcross-continental generalizationspatial reasoningmultimodal predictioninterpretability

0 comments

The pith

FireScope shows that chain-of-thought reasoning in a vision-language model improves generalization when predicting wildfire risk rasters from US training data to European events.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces a dataset that pairs satellite imagery and climate data with expert risk maps in the US and real fire events in Europe. It then presents a model that generates risk rasters together with explicit language reasoning steps. The central claim is that these reasoning traces, learned through reinforcement and visual supervision, produce both higher accuracy on the held-out continent and more interpretable outputs than standard raster predictors. If correct, the work establishes that language-based reasoning can serve as a grounding mechanism for spatial generation tasks that must transfer across regions with different vegetation, climate, and land-use patterns.

Core claim

A VLM-based reasoning-to-generation framework trained on US expert risk rasters produces higher-fidelity risk maps on European wildfire events than prior methods, while its generated reasoning traces remain faithful to the visual and climatic inputs according to expert review and automated checks.

What carries the argument

The chain-of-thought oracle that produces intermediate language reasoning traces which are then used to condition the generation of continuous risk rasters.

If this is right

Risk maps generated with explicit reasoning steps become directly inspectable by domain experts for missing causal factors.
The same training recipe can be applied to other spatial prediction tasks that require cross-region transfer, such as flood or drought mapping.
Models can be updated incrementally by adding new expert feedback on reasoning traces without retraining the entire raster generator.
Systematic studies of generalization become possible because the benchmark separates training geography from evaluation geography.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If the reasoning traces prove reliable, they could be used to query the model about hypothetical climate scenarios, such as increased drought, without new labeled data.
The approach opens a route to hybrid systems where human experts edit the language reasoning rather than the pixel-level raster, potentially lowering the cost of model maintenance.
Success on this task suggests that similar reasoning-augmented generators could be tested on other raster outputs like land-cover classification where causal factors are also multimodal.

Load-bearing premise

The expert-defined risk rasters used for US training correctly identify the causal factors that drive wildfire risk and do not contain labeling patterns that fail to apply in Europe.

What would settle it

An ablation that removes the reasoning traces while keeping the same visual and climatic inputs; if performance on the European test set drops to the level of non-reasoning baselines, the claim that reasoning drives the generalization gain would be falsified.

Figures

Figures reproduced from arXiv: 2511.17171 by (2) ETH Zurich), Danda Pani Paudel (1) ((1) INSAIT, Konrad Schindler (2), Luc Van Gool (1), Mario Markov (1), Sofia University "St. Kliment Ohridski", Stefan Maria Ailuro (1).

**Figure 2.** Figure 2: FireScope-Bench overview. A large-scale multimodal benchmark combining satellite imagery, climate data, and expert-defined risk maps over the U.S. and Europe. It enables training on USA data and testing across Europe on real wildfire events to evaluate model generalization and reasoning in wildfire risk prediction. The benchmark includes metrics for accuracy, calibration, and interpretability. vision-langu… view at source ↗

**Figure 3.** Figure 3: FireScope overview. A VLM fine-tuned with GRPO learns CoT reasoning over climate and imagery to predict scalar risk (“Oracle”), which subsequently conditions Encoder–Decoder through a FiLM mechanism to generate fine-grained risk rasters, linking reasoning with spatial prediction. collection [15] in 10m resolution, constituting images of 1024 × 1024 pixels. Regions occluded by clouds are excluded, and each… view at source ↗

**Figure 4.** Figure 4: Ablation study. We assess the effects of more training [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗

**Figure 5.** Figure 5: Examples of failure cases when conditioning AlphaEarth on Oracle, fixed with the addition of CoT. Enabling iterative reasoning [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗

**Figure 6.** Figure 6: Error distribution of FireScope in Europe across latitudes [PITH_FULL_IMAGE:figures/full_fig_p008_6.png] view at source ↗

**Figure 7.** Figure 7: tile-wise Brier Score 19 [PITH_FULL_IMAGE:figures/full_fig_p019_7.png] view at source ↗

**Figure 8.** Figure 8: pixel-wise ROC AUC 20 [PITH_FULL_IMAGE:figures/full_fig_p020_8.png] view at source ↗

**Figure 9.** Figure 9: tile-wise ROC curves, pixel-wise ROC curves, tile-wise callibration curves [PITH_FULL_IMAGE:figures/full_fig_p021_9.png] view at source ↗

**Figure 10.** Figure 10: 35.3996◦N, −98.2942◦W (Oklahoma). 22 [PITH_FULL_IMAGE:figures/full_fig_p022_10.png] view at source ↗

**Figure 11.** Figure 11: 45.6889◦N, −118.4442◦W (Oregon). 23 [PITH_FULL_IMAGE:figures/full_fig_p023_11.png] view at source ↗

**Figure 12.** Figure 12: 42.1761◦N, 26.161◦W (Bulgaria). Fire event in 2020, pre-fire image from 2019. 24 [PITH_FULL_IMAGE:figures/full_fig_p024_12.png] view at source ↗

**Figure 13.** Figure 13: 51.3168◦N, 30.1658◦W (Ukraine). Fire event in 2020, pre-fire image from 2019. 25 [PITH_FULL_IMAGE:figures/full_fig_p025_13.png] view at source ↗

**Figure 14.** Figure 14: Visualization of U-Net FireScope’s adherence to its CoT and resulting high fidelity. After the CoT is artificially perturbed, the [PITH_FULL_IMAGE:figures/full_fig_p026_14.png] view at source ↗

read the original abstract

Predicting wildfire risk is a reasoning-intensive spatial problem that requires the integration of visual, climatic, and geographic factors to infer continuous risk maps. Existing methods lack the causal reasoning and multimodal understanding required for reliable generalization. We introduce FireScope-Bench, a large-scale dataset and benchmark that couples Sentinel-2 imagery and climate data with expert-defined risk rasters across the USA, and real wildfire events in Europe for cross-continental evaluation. Building on this dataset, we propose FireScope, a VLM-based reasoning-to-generation framework that learns from both reinforcement learning and visual supervision to predict risk rasters with complementary reasoning traces. When trained in the USA and tested in Europe, FireScope achieves substantial performance gains, while expert feedback and automated analysis confirm that its reasoning traces are faithful and semantically meaningful. Our findings demonstrate that reasoning can ground raster prediction models, improving both generalization and interpretability. To our knowledge, this is the first framework to (1) demonstrate that language-based reasoning can improve generalization in visual generation, (2) propose a high-resolution wildfire risk model that can be applied across continents, and (3) enable systematic studies of robust cross-continental generalization for multimodal fire risk models. We believe that FireScope-Bench has the potential to serve as a foundation for advancing reasoning-driven, interpretable and generalizable spatial modeling. Data and source code will be made publicly available.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 1 minor

Summary. The paper introduces FireScope-Bench, pairing Sentinel-2 imagery and climate data with expert-defined US wildfire risk rasters plus European wildfire events for cross-continental testing. It proposes FireScope, a VLM framework combining chain-of-thought reasoning, reinforcement learning, and visual supervision to output risk rasters together with reasoning traces. The central empirical claim is that US-trained models achieve substantial gains on European held-out events, with expert and automated validation confirming faithful, semantically meaningful reasoning traces. The work positions itself as the first to show language reasoning improving visual generation generalization, to offer a high-resolution cross-continent wildfire model, and to enable systematic cross-continental studies.

Significance. If the quantitative claims, baselines, and transferability validations hold after proper reporting, the result would be significant for multimodal spatial reasoning: it would provide concrete evidence that CoT-style language supervision can improve both accuracy and interpretability in raster prediction tasks that generalize across continents and data regimes.

major comments (3)

[Abstract] Abstract: the assertion of 'substantial performance gains' when trained in the USA and tested in Europe is presented without any numeric metrics, baseline comparisons, ablation results, or statistical tests, rendering the central empirical claim impossible to evaluate.
[Abstract] Abstract / §3 (dataset and evaluation): the transferability assumption that US expert-defined risk rasters encode causal, region-agnostic drivers is load-bearing for the cross-continental claim yet unsupported; no inter-rater reliability statistics, validation against held-out fire occurrences, or ablation removing raster supervision is described.
[Abstract] Abstract / Methods: the training protocol, model architecture, loss formulation, reinforcement-learning objective, and exact evaluation protocol on European events are absent, so the reported gains cannot be reproduced or stress-tested against the labeling-bias concern.

minor comments (1)

[Abstract] Abstract: the three 'first' claims require a dedicated related-work section with explicit comparisons rather than an assertion.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the detailed and constructive feedback. We agree that the abstract requires strengthening with quantitative results and that methodological transparency is essential for the cross-continental claims. We will revise the manuscript accordingly.

read point-by-point responses

Referee: [Abstract] Abstract: the assertion of 'substantial performance gains' when trained in the USA and tested in Europe is presented without any numeric metrics, baseline comparisons, ablation results, or statistical tests, rendering the central empirical claim impossible to evaluate.

Authors: We agree the abstract should be self-contained. The experiments section reports specific metrics on European events (including comparisons to prior approaches), ablations on the reasoning components, and statistical tests. In revision we will insert the key numeric results, baseline deltas, and significance statements directly into the abstract. revision: yes
Referee: [Abstract] Abstract / §3 (dataset and evaluation): the transferability assumption that US expert-defined risk rasters encode causal, region-agnostic drivers is load-bearing for the cross-continental claim yet unsupported; no inter-rater reliability statistics, validation against held-out fire occurrences, or ablation removing raster supervision is described.

Authors: The assumption is indeed central. The current manuscript describes expert raster construction and European event-based evaluation but does not report inter-rater statistics or the requested ablation. We will add these analyses (or explicit discussion of their absence) in §3 and the experiments section of the revision. revision: yes
Referee: [Abstract] Abstract / Methods: the training protocol, model architecture, loss formulation, reinforcement-learning objective, and exact evaluation protocol on European events are absent, so the reported gains cannot be reproduced or stress-tested against the labeling-bias concern.

Authors: The full methods section details the VLM architecture, CoT reasoning, RL objective, visual supervision losses, and European evaluation protocol. To improve accessibility we will add a concise methods summary to the abstract and ensure the European protocol is stated explicitly enough for reproduction and bias checks. revision: yes

Circularity Check

0 steps flagged

No circularity; central claims are empirical results on held-out cross-continental data

full rationale

The supplied abstract and description contain no equations, parameter-fitting steps, or derivation chains. The headline result is framed as measured performance gains on European wildfire events after US training, which is an external empirical test rather than a quantity computed from the training statistics themselves. No self-citations, ansatzes, or uniqueness theorems are invoked in a load-bearing way within the provided text. The expert-defined rasters are treated as input supervision, not as quantities derived inside the model. This is the normal case of a self-contained empirical paper; no reduction of any claimed prediction to its own inputs occurs.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review supplies no equations, hyperparameters, or modeling assumptions; ledger entries cannot be populated beyond the generic requirement that any VLM training rests on standard optimization assumptions.

pith-pipeline@v0.9.0 · 5827 in / 1182 out tokens · 28452 ms · 2026-05-25T07:13:37.115303+00:00 · methodology

Review history (2 revisions) →

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

B-GRTO: Bootstrapped Group Relative Tool Optimization for Referring Segmentation
cs.CV 2026-05 unverdicted novelty 6.0

B-GRTO extends GRPO by reusing rollouts to optimize auxiliary segmentation decoder objectives, yielding substantial gains over plain GRPO on referring segmentation tasks.

Reference graph

Works this paper leans on

88 extracted references · 88 canonical work pages · cited by 1 Pith paper · 15 internal anchors

[1]

Evaluation of synthetic data impact on fire segmentation models performance.Scientific Reports, 15(1): 16759, 2025

Matej Arlovic, Franko Hrzic, Mitesh Patel, Tomasz Bednarz, and Josip Balen. Evaluation of synthetic data impact on fire segmentation models performance.Scientific Reports, 15(1): 16759, 2025. 1

work page 2025
[2]

SegNet: A deep convolutional encoder-decoder architecture for image segmentation.IEEE Transactions on Pattern Anal- ysis and Machine Intelligence (TPAMI), 39(12):2481–2495,

Vijay Badrinarayanan, Alex Kendall, and Roberto Cipolla. SegNet: A deep convolutional encoder-decoder architecture for image segmentation.IEEE Transactions on Pattern Anal- ysis and Machine Intelligence (TPAMI), 39(12):2481–2495,

work page
[3]

Qwen2.5-VL Technical Report

Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, Humen Zhong, Yuanzhi Zhu, Mingkun Yang, Zhao- hai Li, Jianqiang Wan, Pengfei Wang, Wei Ding, Zheren Fu, Yiheng Xu, Jiabo Ye, Xi Zhang, Tianbao Xie, Zesen Cheng, Hang Zhang, Zhibo Yang, Haiyang Xu, and Junyang Lin. Qwen2.5-VL technical report....

work page internal anchor Pith review Pith/arXiv arXiv
[4]

eDiff-I: Text-to-Image Diffusion Models with an Ensemble of Expert Denoisers

Yogesh Balaji and et al. eDiff-I: Text-to-image diffusion models with an ensemble of expert denoisers. Inpreprint arXiv:2211.01324, 2022. 3

work page internal anchor Pith review Pith/arXiv arXiv 2022
[5]

SatlasPretrain: A large-scale dataset for remote sensing image understanding

Favyen Bastani, Piper Wolters, Ritwik Gupta, Joe Fer- dinando, and Aniruddha Kembhavi. SatlasPretrain: A large-scale dataset for remote sensing image understanding. preprint arXiv:2211.15660, 2023. 3

work page arXiv 2023
[6]

Recognition in terra incognita

Sara Beery, Grant Van Horn, and Pietro Perona. Recognition in terra incognita. InEuropean Conference on Computer Vision (ECCV), 2018. 1, 2, 3

work page 2018
[7]

Statistical calibra- tion of probabilistic medium-range fire weather index fore- casts in europe.Natural Hazards and Earth System Sciences, 24:4225–4235, 2024

Stephanie Bohlmann and Marko Laine. Statistical calibra- tion of probabilistic medium-range fire weather index fore- casts in europe.Natural Hazards and Earth System Sciences, 24:4225–4235, 2024. 1, 3

work page 2024
[8]

G. W. Brier. Verification of forecasts expressed in terms of probability.Monthly Weather Review, 78(1):1–3, 1950. 4, 12

work page 1950
[9]

AlphaEarth Foundations: An embedding field model for accurate and efficient global mapping from sparse label data

Christopher F. Brown, Michal R. Kazmierski, Valerie J. Pasquarella, William J. Rucklidge, Masha Samsikova, Chen- hui Zhang, Evan Shelhamer, Estefania Lahera, Olivia Wiles, Simon Ilyushchenko, Noel Gorelick, Lihui Lydia Zhang, Sophia Alj, Emily Schechter, Sean Askay, Oliver Guinan, Rebecca Moore, Alexis Boukouvalas, and Pushmeet Kohli. AlphaEarth foundatio...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[10]

SMLFire1.0: a stochastic machine learning model for fire frequency and size distributions across the western united states.Geoscientific Model Development, 16:3407–3432, 2023

Jeremy Buch, Erich Fischer, Jorge Pe˜na, et al. SMLFire1.0: a stochastic machine learning model for fire frequency and size distributions across the western united states.Geoscientific Model Development, 16:3407–3432, 2023. 1, 3

work page 2023
[11]

R2I- Bench: Benchmarking reasoning-driven text-to-image gen- eration.preprint arXiv:2505.23493, 2025

Kaijie Chen, Zihao Lin, Zhiyang Xu, Ying Shen, Yuguang Yao, Joy Rimchala, Jiaxin Zhang, and Lifu Huang. R2I- Bench: Benchmarking reasoning-driven text-to-image gen- eration.preprint arXiv:2505.23493, 2025. 3

work page arXiv 2025
[12]

Encoder-decoder with atrous separable convolution for semantic image segmentation

Liang-Chieh Chen, Yukun Zhu, George Papandreou, Florian Schroff, and Hartwig Adam. Encoder-decoder with atrous separable convolution for semantic image segmentation. In European Conference on Computer Vision (ECCV), 2018. 3

work page 2018
[13]

Weighted kappa: Nominal scale agreement with provision for scaled disagreement or partial credit.Psy- chological Bulletin, 70(4):213–220, 1968

Jacob Cohen. Weighted kappa: Nominal scale agreement with provision for scaled disagreement or partial credit.Psy- chological Bulletin, 70(4):213–220, 1968. 4, 6, 12

work page 1968
[14]

EFFIS burnt areas (by MODIS) was accessed on 24.10.2025 from https://forest-fire.emergency.copernicus.eu,

Copernicus. EFFIS burnt areas (by MODIS) was accessed on 24.10.2025 from https://forest-fire.emergency.copernicus.eu, . Accessed 24.10.2025. 3

work page 2025
[15]

Sentinel-2 was accessed on 24.10.2025 from https://registry.opendata.aws/sentinel-2,

Copernicus. Sentinel-2 was accessed on 24.10.2025 from https://registry.opendata.aws/sentinel-2, . Accessed 24.10.2025. 4

work page 2025
[16]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

DeepSeek-AI. DeepSeek-R1: Incentivizing reasoning ca- pability in LLMs via reinforcement learning.preprint arXiv:2501.12948, 2025. 5

work page internal anchor Pith review Pith/arXiv arXiv 2025
[17]

Global data-driven prediction of fire ac- tivity.Nature Communications, 16(1):58097, 2025

Francesca Di Giuseppe, Joe McNorton, Anna Lombardi, and Fredrik Wetterhall. Global data-driven prediction of fire ac- tivity.Nature Communications, 16(1):58097, 2025. 1, 3

work page 2025
[18]

Tam- ing transformers for high-resolution image synthesis

Patrick Esser, Robin Rombach, and Bj ¨orn Ommer. Tam- ing transformers for high-resolution image synthesis. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2021. 3

work page 2021
[19]

T. Fawcett. An introduction to ROC analysis.Pattern Recog- nition Letters, 27(8):861–874, 2006. 4, 12

work page 2006
[20]

Soft Actor-Critic Algorithms and Applications

Tuomas Haarnoja, Aurick Zhou, Kristian Hartikainen, George Tucker, Sehoon Ha, Jie Tan, Vikash Kumar, Henry Zhu, Abhishek Gupta, Pieter Abbeel, and Sergey Levine. Soft actor-critic algorithms and applications.preprint arXiv:1812.05905, 2019. 5, 13

work page internal anchor Pith review Pith/arXiv arXiv 2019
[21]

On misconceptions about the brier score in binary prediction models.preprint arXiv:2504.04906v4,

Linard Hoessly. On misconceptions about the brier score in binary prediction models.preprint arXiv:2504.04906v4,

work page arXiv
[22]

P. Jaccard. The distribution of the flora in the alpine zone. New Phytologist, 11(2):37–50, 1912. 4, 12

work page 1912
[23]

Perceiver IO: A General Architecture for Structured Inputs & Outputs

Andrew Jaegle, Sebastian Borgeaud, Jean-Baptiste Alayrac, Carl Doersch, Catalin Ionescu, David Ding, Skanda Kop- pula, Daniel Zoran, Andrew Brock, Evan Shelhamer, Olivier H´enaff, Matthew M. Botvinick, Andrew Zisserman, Oriol Vinyals, and Jo ¯ao Carreira. Perceiver IO: A general architecture for structured inputs & outputs.preprint arXiv:2107.14795, 2022. 6

work page internal anchor Pith review Pith/arXiv arXiv 2022
[24]

Instruction reasoning dataset for ad- vanced image editing.preprint arXiv:2405.11190, 2024

Ying Jin, Pengyang Ling, Xiaoyi Dong, Pan Zhang, Jiaqi Wang, and Dahua Lin. Instruction reasoning dataset for ad- vanced image editing.preprint arXiv:2405.11190, 2024. 3

work page arXiv 2024
[25]

Evaluating numerical reasoning in text-to- image models.preprint arXiv:2406.14774, 2024

Ivan Kaji ´c et al. Evaluating numerical reasoning in text-to- image models.preprint arXiv:2406.14774, 2024. 3

work page arXiv 2024
[26]

Wilds: A benchmark of in-the-wild distribution shifts

Pang Wei Koh, Shiori Sagawa, Henrik Marklund, Sang Michael Xie, Marvin Zhang, Akhil Balsubramani, 9 Weihua Hu, Michihiro Yasunaga, Percy Liang, Yair Carmon, et al. Wilds: A benchmark of in-the-wild distribution shifts. InInternational Conference on Machine Learning (ICML),

work page
[27]

Large Language Models are Zero-Shot Reasoners

Takeshi Kojima, Shixiang Shane Gu, Machel Reid, Yutaka Matsuo, and Yusuke Iwasawa. Large language models are zero-shot reasoners.preprint arXiv:2205.11916, 2022. 3

work page internal anchor Pith review Pith/arXiv arXiv 2022
[28]

Wildfire danger prediction and understanding with deep learning.Geophysi- cal Research Letters, 49(17):e2022GL099368, 2022

Spyros Kondylatos, Ioannis Prapas, Michele Ronco, Ioannis Papoutsis, Gustau Camps-Valls, Mar´ıa Piles, Miguel-´Angel Fern´andez-Torres, and Nuno Carvalhais. Wildfire danger prediction and understanding with deep learning.Geophysi- cal Research Letters, 49(17):e2022GL099368, 2022. 3

work page 2022
[29]

Uncertainty-aware deep learning for wildfire danger forecasting.preprint arXiv:2509.25017, 2025

Spyros Kondylatos, Gustau Camps-Valls, and Ioannis Pa- poutsis. Uncertainty-aware deep learning for wildfire danger forecasting.preprint arXiv:2509.25017, 2025. 1

work page arXiv 2025
[30]

Measuring Faithfulness in Chain-of-Thought Reasoning

Tamera Lanham, Anna Chen, Ansh Radhakrishnan, Benoit Steiner, Carson Denison, Danny Hernandez, Dustin Li, Esin Durmus, Evan Hubinger, Jackson Kernion, Kamil ˙e Lukoˇsi¯ut˙e, Karina Nguyen, Newton Cheng, Nicholas Joseph, Nicholas Schiefer, Oliver Rausch, Robin Larson, Sam McCandlish, Sandipan Kundu, Saurav Kadavath, Shan- non Yang, Thomas Henighan, Timothy...

work page internal anchor Pith review Pith/arXiv arXiv 2023
[31]

Evaluating text-to-visual generation with image-to-text models.preprint arXiv:2404.01291, 2024

Ziqiu Lin et al. Evaluating text-to-visual generation with image-to-text models.preprint arXiv:2404.01291, 2024. 3

work page arXiv 2024
[32]

MM-REACT: Prompting ChatGPT for Multimodal Reasoning and Action

Haotian Liu, Chunyuan Li, Pengchuan Zhang, and Yong Jae Lee. MM-ReAct: Prompting ChatGPT for multimodal rea- soning and action.preprint arXiv:2303.11381, 2023. 3

work page internal anchor Pith review Pith/arXiv arXiv 2023
[33]

Application of remote sensing and explainable artificial intelligence for wildfire risk zon- ing in the mountainous region of Southwest China.Remote Sensing, 16(19):3602, 2024

Jia Liu, Yukuan Wang, Yafeng Lu, Pengguo Zhao, Shunjiu Wang, Yu Sun, and Yu Luo. Application of remote sensing and explainable artificial intelligence for wildfire risk zon- ing in the mountainous region of Southwest China.Remote Sensing, 16(19):3602, 2024. 3

work page 2024
[34]

Fully convolutional networks for semantic segmentation

Jonathan Long, Evan Shelhamer, and Trevor Darrell. Fully convolutional networks for semantic segmentation. InIEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 3431–3440, 2015. 3

work page 2015
[35]

Decoupled Weight Decay Regularization

Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization.preprint arXiv:1711.05101, 2017. 14

work page internal anchor Pith review Pith/arXiv arXiv 2017
[36]

SGDR: Stochastic gradi- ent descent with warm restarts

Ilya Loshchilov and Frank Hutter. SGDR: Stochastic gradi- ent descent with warm restarts. InInternational Conference on Learning Representations (ICLR), 2017. 14

work page 2017
[37]

A global probability-of-fire (PoF) forecast.Geophysical Research Letters, 51:e2023GL107929, 2024

Joe Ramu McNorton, Francesca Di Giuseppe, Ewan Mark Pinnington, Matthew Chantry, and Chris Barnard. A global probability-of-fire (PoF) forecast.Geophysical Research Letters, 51:e2023GL107929, 2024. 1, 3

work page 2024
[38]

PhyBench: A physical com- monsense benchmark for evaluating text-to-image models

Fanqing Meng, Wenqi Shao, Lixin Luo, Yahong Wang, Yi- ran Chen, Quanfeng Lu, Yue Yang, Tianshuo Yang, Kaipeng Zhang, Yu Qiao, and Ping Luo. PhyBench: A physical com- monsense benchmark for evaluating text-to-image models. preprint arXiv:2406.11802, 2024. 3

work page arXiv 2024
[39]

M. P. Naeini, G. F. Cooper, and M. Hauskrecht. Obtaining well calibrated probabilities using bayesian binning. InAAAI Conference on Artificial Intelligence, 2015. 4, 12

work page 2015
[40]

Data obtained from national aeronautics and space administration (NASA) Langley Research Center’s predic- tion of worldwide energy resources (POWER), NASA Earth Science Division,

NASA. Data obtained from national aeronautics and space administration (NASA) Langley Research Center’s predic- tion of worldwide energy resources (POWER), NASA Earth Science Division, . Accessed 24.10.2025. 4

work page 2025
[41]

Data obtained from the POWER project’s climatol- ogy,

NASA. Data obtained from the POWER project’s climatol- ogy, . Accessed 24.10.2025. 4

work page 2025
[42]

Introducing GPT-5, 2025

OpenAI. Introducing GPT-5, 2025. Accessed: Nov. 12,

work page 2025
[43]

Marc-Andr ´e Parisien and Max A. Moritz. Environmental controls on the distribution of wildfire at multiple spatial scales.Ecological Monographs, 79(1):127–154, 2009. 4, 12

work page 2009
[44]

FiLM: Visual reasoning with a general conditioning layer

Ethan Perez, Florian Strub, Harm de Vries, Vincent Du- moulin, and Aaron Courville. FiLM: Visual reasoning with a general conditioning layer. InAAAI Conference on Artificial Intelligence, 2018. 2, 5

work page 2018
[45]

High-resolution image syn- thesis with latent diffusion models

Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Bj¨orn Ommer. High-resolution image syn- thesis with latent diffusion models. InIEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2022. 3

work page 2022
[46]

U- Net: Convolutional networks for biomedical image segmen- tation

Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U- Net: Convolutional networks for biomedical image segmen- tation. InMedical Image Computing and Computer-Assisted Intervention (MICCAI), 2015. 3, 5, 13

work page 2015
[47]

J. San-Miguel-Ayanz, Ernst Schulte, Guido Schmuck, An- drea Camia, Peter Strobl, Giorgio Libert `a, Cristiano Gio- vando, Roberto Boca, Fernando Sedano, Pieter Kempeneers, Daniel McInerney, Ceri Withmore, Sandra Oliveira, Mar- cos Rodrigues, Tracy Durrant, Paolo Corti, Friderike Oehler, Lara Vilar, and Giuseppe Amatulli. Comprehensive monitor- ing of wild...

work page 2012
[48]

Proximal Policy Optimization Algorithms

John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Rad- ford, and Oleg Klimov. Proximal policy optimization algo- rithms.preprint arXiv:1707.06347, 2017. 5, 13

work page internal anchor Pith review Pith/arXiv arXiv 2017
[49]

Sengupta et al

A. Sengupta et al. Recent advances in explainable machine learning models for wildfires: From forecasting to burned area estimation.Environmental Data Science, 2025. In press. 3

work page 2025
[50]

Wildfire risk to communities

USDA Forest Service. Wildfire risk to communities. https://wildfirerisk.org. Accessed 24.10.2025. 3, 7

work page 2025
[51]

Wildfire spreading pre- diction using multimodal data and deep neural network ap- proach.Scientific Reports, 14:2606, 2024

Dmitrii Shadrin, Svetlana Illarionova, Fedor Gubanov, Kse- nia Evteeva, Maksim Mironenko, Ivan Levchunets, Roman Belousov, and Evgeny Burnaev. Wildfire spreading pre- diction using multimodal data and deep neural network ap- proach.Scientific Reports, 14:2606, 2024. 1, 3

work page 2024
[52]

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

Zhihang Shao, Ziyu Wang, Yuxin Zhang, Zihan Zheng, Yao Liu, Zihan Liu, Yibo Shang, Linyang Xu, Tianyang Zhang, Lingpeng Chen, et al. DeepSeekMath: Pushing the limits of mathematical reasoning in open language models.preprint arXiv:2402.03300, 2024. 5, 13

work page internal anchor Pith review Pith/arXiv arXiv 2024
[53]

ViperGPT: Visual inference via python execution for reasoning

D ´avid Sur´ıs, Sachit Menon, and Carl V ondrick. ViperGPT: Visual inference via python execution for reasoning. In IEEE/CVF International Conference on Computer Vision (ICCV), 2023. 3

work page 2023
[54]

C. E. Van Wagner. Development and structure of the cana- dian forest fire weather index system. Technical Report 10 Forestry Technical Report 35, Canadian Forestry Service, Petawawa National Forestry Institute, Chalk River, Ontario,

work page
[55]

Self-Consistency Improves Chain of Thought Reasoning in Language Models

Xuezhi Wang, Jason Wei, Dale Schuurmans, Maarten Bosma, Ed Chi, Quoc Le, and Denny Zhou. Self-consistency improves chain-of-thought reasoning in language models. preprint arXiv:2203.11171, 2022. 3

work page internal anchor Pith review Pith/arXiv arXiv 2022
[56]

Z. Wang, A. C. Bovik, H. R. Sheikh, and E. P. Simoncelli. Image quality assessment: From error visibility to structural similarity.IEEE Transactions on Image Processing, 13(4): 600–612, 2004. 4, 5, 12

work page 2004
[57]

Chain-of-Thought Prompting Elicits Reasoning in Large Language Models

Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Brian Ichter, Fei Xia, Ed Chi, Quoc Le, and Denny Zhou. Chain-of-thought prompting elicits reasoning in large language models.preprint arXiv:2201.11903, 2022. 3

work page internal anchor Pith review Pith/arXiv arXiv 2022
[58]

Alvarez, and Ping Luo

Enze Xie, Wenhai Wang, Zhiding Yu, Anima Anandkumar, Jose M. Alvarez, and Ping Luo. SegFormer: Simple and efficient design for semantic segmentation with transform- ers. InAdvances in Neural Information Processing Systems (NeurIPS), 2021. 5, 13

work page 2021
[59]

Deep learning for wildfire risk prediction: Integrating remote sensing and en- vironmental data.ISPRS Journal of Photogrammetry and Remote Sensing, 2025

Zhengsen Xu, Jonathan Li, Sibo Cheng, Xue Rui, Yu Zhao, Hongjie Heand Haiyan Guan, Aryan Sharma, Matthew Erxleben, Ryan Chang, and Linlin Xu. Deep learning for wildfire risk prediction: Integrating remote sensing and en- vironmental data.ISPRS Journal of Photogrammetry and Remote Sensing, 2025. Early access. 3

work page 2025
[60]

Mmmu: A massive multi-discipline multimodal understand- ing and reasoning benchmark for expert agi

Xiang Yue, Yuansheng Ni, Tianyu Zheng, Kai Zhang, Ruoqi Liu, Ge Zhang, Samuel Stevens, Dongfu Jiang, Weiming Ren, Yuxuan Sun, Cong Wei, Botao Yu, Ruibin Yuan, Ren- liang Sun, Ming Yin, Boyuan Zheng, Zhenzhu Yang, Yibo Liu, Wenhao Huang, Huan Sun, Yu Su, and Wenhu Chen. Mmmu: A massive multi-discipline multimodal understand- ing and reasoning benchmark for...

work page 2024
[61]

The Dawn of LMMs: Preliminary Explorations with GPT-4V(ision)

Renrui Zhang, Zheng Li, Hongyang Li, Yu Qiao, and Peng Gao. Visual chain-of-thought reasoning for multimodal large language models.preprint arXiv:2309.17421, 2023. 3

work page internal anchor Pith review Pith/arXiv arXiv 2023
[62]

Towards omnidi- rectional reasoning with 360-r1: A dataset, benchmark, and GRPO-based method.preprint arXiv:2505.14197, 2025

Xinshen Zhang, Zhen Ye, and Xu Zheng. Towards omnidi- rectional reasoning with 360-r1: A dataset, benchmark, and GRPO-based method.preprint arXiv:2505.14197, 2025. 5

work page arXiv 2025
[63]

Trade-offs in large reasoning models: An empirical analysis of deliberative and adaptive reasoning over foundational capabilities.preprint arXiv:2503.17979,

Weixiang Zhao, Xingyu Sui, Jiahe Guo, Yulin Hu, Yang Deng, Yanyan Zhao, Bing Qin, Wanxiang Che, Tat-Seng Chua, and Ting Liu. Trade-offs in large reasoning models: An empirical analysis of deliberative and adaptive reasoning over foundational capabilities.preprint arXiv:2503.17979,

work page arXiv
[64]

Enhancing seasonal fire predictions with hybrid dy- namical and random forest models.Natural Hazards, 2,

Miguel ´Angel Torres-V ´azquez, Sixto Herrera, And- rina Gincheva, Amar Halifa-Mar ´ın, Leone Cavicchia, Francesca Di Giuseppe, Juan Pedro Mont ´avez, and Marco Turco. Enhancing seasonal fire predictions with hybrid dy- namical and random forest models.Natural Hazards, 2,

work page
[65]

1, 3 11 FireScope: Wildfire Risk Prediction with a Chain-of-Thought Oracle Supplementary Material

work page
[66]

Detailed Metrics In-distribution (ID).As we have ground truth continuous risk rasters in the US, we use three metrics for evaluation: Mean Squared Error (MSE)to quantify per-pixel predic- tion error: MSE= 1 N NX i (xi −y i)2 (5) . Mean Absolute Error (MAE)to quantify per-pixel pre- diction error: MAE= 1 N NX i |xi −y i|(6) Structural Similarity Index (SSI...

work page
[67]

Models Oracles

Experiments Configurations 10.1. Models Oracles. We select Qwen2.5-VL-7B-Instruct [3] as our Or- acle for its strong open-source performance across multi- modal reasoning benchmarks and efficiency to train and de- ploy at only 7B parameters. We train two versions of it: one with CoT reasoning as outlined in Section 10.2, and one trained with supervised fi...

work page 2016
[68]

Metrics of ablation methods

Ablation results Metrics of ablation methods are reported in Table 5 Table 5. Metrics of ablation methods. Method OOD wildfire eventsOOD wildfire pixelsID wildfire risk rastersConditioning EncoderBrier↓ROC AUC↑ECE↓ROC AUC↑IoU@0.5↑MSE↓SSIM↑MAE↓n/a VLM-enc 0.200 0.738 0.0690.647 0.176 0.050 0.468 0.182Image only Unet*0.208 0.699 0.0580.619 0.173 0.019 0.620...

work page
[69]

For this reason, we perform simple quantitative and qualitative analyses of CoT Oracle’s general purpose capabilities

General-Purpose Capabilities of CoT Ora- cle (FireScope) After prompting CoT Oracle to provide its reasoning and overall risk score, it may be of interest to users to acquire answers to follow-up questions or continue conversing. For this reason, we perform simple quantitative and qualitative analyses of CoT Oracle’s general purpose capabilities. In this ...

work page
[70]

Dense vegetation can increase fire risk due to more fuel available for fires to spread

**Vegetation Density**: The satellite image shows areas with 14 varying shades of green, indicating different levels of vegetation density. Dense vegetation can increase fire risk due to more fuel available for fires to spread

work page
[71]

However, the average temperatures in the image region are relatively low, suggesting that fire risk might be moderate but not extremely high

**Climate Data**: - **Average Temperature**: Lower temperatures generally reduce fire risk because fires require higher temperatures to ignite and sustain. However, the average temperatures in the image region are relatively low, suggesting that fire risk might be moderate but not extremely high. - **Average Humidity**: Higher humidity reduces fire risk b...

work page
[72]

Slopes can increase fire risk as they allow fires to spread more easily uphill

**Slope**: The satellite image shows areas with varying eleva- tions, which could indicate slopes. Slopes can increase fire risk as they allow fires to spread more easily uphill. Analysis: - **Vegetation Density**: The image shows a mix of dense and less dense vegetation. Dense vegetation increases fire risk. - **Climate Data**: - **Temperature**: Low tem...

work page
[73]

**Avoid Areas Near Vegetation**: Stay away from areas with dense vegetation, as these are more likely to catch fire and spread quickly

work page
[74]

**Consider Elevation**: Higher elevation areas may have better air circulation, which can help dissipate heat and reduce the risk of fire

work page
[75]

**Avoid Low-Lying Areas**: Low-lying areas can accumulate heat and moisture, making them more susceptible to fire. Recommended Area: Based on the satellite image, a good recommendation would be to build the house in the upper part of the image, specifically in the lighter green areas that appear to be less dense with vegetation. These areas are likely to ...

work page
[76]

This can make the area more comfortable year-round

**Better Air Circulation**: Higher elevation areas typically ex- perience better air circulation, which can help cool the surrounding environment and reduce the risk of heat accumulation. This can make the area more comfortable year-round

work page
[77]

Additionally, water can be used for firefighting purposes if needed

**Proximity to Water**: The presence of a body of water nearby can provide a natural buffer zone, reducing the risk of direct fire threats. Additionally, water can be used for firefighting purposes if needed

work page
[78]

**Scenic Views**: Higher elevation areas often offer panoramic views of the landscape, which can be visually appealing and provide a sense of tranquility

work page
[79]

**Reduced Noise Pollution**: Higher elevation areas are usu- 15 ally quieter than lower-lying areas, as they are farther from urban centers and industrial noise sources

work page
[80]

Disadvantages:

**Potential for Wildlife**: Depending on the local ecosystem, higher elevation areas might attract more wildlife, offering opportunities for nature observation and enjoyment. Disadvantages:

work page

Showing first 80 references.

[1] [1]

Evaluation of synthetic data impact on fire segmentation models performance.Scientific Reports, 15(1): 16759, 2025

Matej Arlovic, Franko Hrzic, Mitesh Patel, Tomasz Bednarz, and Josip Balen. Evaluation of synthetic data impact on fire segmentation models performance.Scientific Reports, 15(1): 16759, 2025. 1

work page 2025

[2] [2]

SegNet: A deep convolutional encoder-decoder architecture for image segmentation.IEEE Transactions on Pattern Anal- ysis and Machine Intelligence (TPAMI), 39(12):2481–2495,

Vijay Badrinarayanan, Alex Kendall, and Roberto Cipolla. SegNet: A deep convolutional encoder-decoder architecture for image segmentation.IEEE Transactions on Pattern Anal- ysis and Machine Intelligence (TPAMI), 39(12):2481–2495,

work page

[3] [3]

Qwen2.5-VL Technical Report

Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, Humen Zhong, Yuanzhi Zhu, Mingkun Yang, Zhao- hai Li, Jianqiang Wan, Pengfei Wang, Wei Ding, Zheren Fu, Yiheng Xu, Jiabo Ye, Xi Zhang, Tianbao Xie, Zesen Cheng, Hang Zhang, Zhibo Yang, Haiyang Xu, and Junyang Lin. Qwen2.5-VL technical report....

work page internal anchor Pith review Pith/arXiv arXiv

[4] [4]

eDiff-I: Text-to-Image Diffusion Models with an Ensemble of Expert Denoisers

Yogesh Balaji and et al. eDiff-I: Text-to-image diffusion models with an ensemble of expert denoisers. Inpreprint arXiv:2211.01324, 2022. 3

work page internal anchor Pith review Pith/arXiv arXiv 2022

[5] [5]

SatlasPretrain: A large-scale dataset for remote sensing image understanding

Favyen Bastani, Piper Wolters, Ritwik Gupta, Joe Fer- dinando, and Aniruddha Kembhavi. SatlasPretrain: A large-scale dataset for remote sensing image understanding. preprint arXiv:2211.15660, 2023. 3

work page arXiv 2023

[6] [6]

Recognition in terra incognita

Sara Beery, Grant Van Horn, and Pietro Perona. Recognition in terra incognita. InEuropean Conference on Computer Vision (ECCV), 2018. 1, 2, 3

work page 2018

[7] [7]

Statistical calibra- tion of probabilistic medium-range fire weather index fore- casts in europe.Natural Hazards and Earth System Sciences, 24:4225–4235, 2024

Stephanie Bohlmann and Marko Laine. Statistical calibra- tion of probabilistic medium-range fire weather index fore- casts in europe.Natural Hazards and Earth System Sciences, 24:4225–4235, 2024. 1, 3

work page 2024

[8] [8]

G. W. Brier. Verification of forecasts expressed in terms of probability.Monthly Weather Review, 78(1):1–3, 1950. 4, 12

work page 1950

[9] [9]

AlphaEarth Foundations: An embedding field model for accurate and efficient global mapping from sparse label data

Christopher F. Brown, Michal R. Kazmierski, Valerie J. Pasquarella, William J. Rucklidge, Masha Samsikova, Chen- hui Zhang, Evan Shelhamer, Estefania Lahera, Olivia Wiles, Simon Ilyushchenko, Noel Gorelick, Lihui Lydia Zhang, Sophia Alj, Emily Schechter, Sean Askay, Oliver Guinan, Rebecca Moore, Alexis Boukouvalas, and Pushmeet Kohli. AlphaEarth foundatio...

work page internal anchor Pith review Pith/arXiv arXiv 2025

[10] [10]

SMLFire1.0: a stochastic machine learning model for fire frequency and size distributions across the western united states.Geoscientific Model Development, 16:3407–3432, 2023

Jeremy Buch, Erich Fischer, Jorge Pe˜na, et al. SMLFire1.0: a stochastic machine learning model for fire frequency and size distributions across the western united states.Geoscientific Model Development, 16:3407–3432, 2023. 1, 3

work page 2023

[11] [11]

R2I- Bench: Benchmarking reasoning-driven text-to-image gen- eration.preprint arXiv:2505.23493, 2025

Kaijie Chen, Zihao Lin, Zhiyang Xu, Ying Shen, Yuguang Yao, Joy Rimchala, Jiaxin Zhang, and Lifu Huang. R2I- Bench: Benchmarking reasoning-driven text-to-image gen- eration.preprint arXiv:2505.23493, 2025. 3

work page arXiv 2025

[12] [12]

Encoder-decoder with atrous separable convolution for semantic image segmentation

Liang-Chieh Chen, Yukun Zhu, George Papandreou, Florian Schroff, and Hartwig Adam. Encoder-decoder with atrous separable convolution for semantic image segmentation. In European Conference on Computer Vision (ECCV), 2018. 3

work page 2018

[13] [13]

Weighted kappa: Nominal scale agreement with provision for scaled disagreement or partial credit.Psy- chological Bulletin, 70(4):213–220, 1968

Jacob Cohen. Weighted kappa: Nominal scale agreement with provision for scaled disagreement or partial credit.Psy- chological Bulletin, 70(4):213–220, 1968. 4, 6, 12

work page 1968

[14] [14]

EFFIS burnt areas (by MODIS) was accessed on 24.10.2025 from https://forest-fire.emergency.copernicus.eu,

Copernicus. EFFIS burnt areas (by MODIS) was accessed on 24.10.2025 from https://forest-fire.emergency.copernicus.eu, . Accessed 24.10.2025. 3

work page 2025

[15] [15]

Sentinel-2 was accessed on 24.10.2025 from https://registry.opendata.aws/sentinel-2,

Copernicus. Sentinel-2 was accessed on 24.10.2025 from https://registry.opendata.aws/sentinel-2, . Accessed 24.10.2025. 4

work page 2025

[16] [16]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

DeepSeek-AI. DeepSeek-R1: Incentivizing reasoning ca- pability in LLMs via reinforcement learning.preprint arXiv:2501.12948, 2025. 5

work page internal anchor Pith review Pith/arXiv arXiv 2025

[17] [17]

Global data-driven prediction of fire ac- tivity.Nature Communications, 16(1):58097, 2025

Francesca Di Giuseppe, Joe McNorton, Anna Lombardi, and Fredrik Wetterhall. Global data-driven prediction of fire ac- tivity.Nature Communications, 16(1):58097, 2025. 1, 3

work page 2025

[18] [18]

Tam- ing transformers for high-resolution image synthesis

Patrick Esser, Robin Rombach, and Bj ¨orn Ommer. Tam- ing transformers for high-resolution image synthesis. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2021. 3

work page 2021

[19] [19]

T. Fawcett. An introduction to ROC analysis.Pattern Recog- nition Letters, 27(8):861–874, 2006. 4, 12

work page 2006

[20] [20]

Soft Actor-Critic Algorithms and Applications

Tuomas Haarnoja, Aurick Zhou, Kristian Hartikainen, George Tucker, Sehoon Ha, Jie Tan, Vikash Kumar, Henry Zhu, Abhishek Gupta, Pieter Abbeel, and Sergey Levine. Soft actor-critic algorithms and applications.preprint arXiv:1812.05905, 2019. 5, 13

work page internal anchor Pith review Pith/arXiv arXiv 2019

[21] [21]

On misconceptions about the brier score in binary prediction models.preprint arXiv:2504.04906v4,

Linard Hoessly. On misconceptions about the brier score in binary prediction models.preprint arXiv:2504.04906v4,

work page arXiv

[22] [22]

P. Jaccard. The distribution of the flora in the alpine zone. New Phytologist, 11(2):37–50, 1912. 4, 12

work page 1912

[23] [23]

Perceiver IO: A General Architecture for Structured Inputs & Outputs

Andrew Jaegle, Sebastian Borgeaud, Jean-Baptiste Alayrac, Carl Doersch, Catalin Ionescu, David Ding, Skanda Kop- pula, Daniel Zoran, Andrew Brock, Evan Shelhamer, Olivier H´enaff, Matthew M. Botvinick, Andrew Zisserman, Oriol Vinyals, and Jo ¯ao Carreira. Perceiver IO: A general architecture for structured inputs & outputs.preprint arXiv:2107.14795, 2022. 6

work page internal anchor Pith review Pith/arXiv arXiv 2022

[24] [24]

Instruction reasoning dataset for ad- vanced image editing.preprint arXiv:2405.11190, 2024

Ying Jin, Pengyang Ling, Xiaoyi Dong, Pan Zhang, Jiaqi Wang, and Dahua Lin. Instruction reasoning dataset for ad- vanced image editing.preprint arXiv:2405.11190, 2024. 3

work page arXiv 2024

[25] [25]

Evaluating numerical reasoning in text-to- image models.preprint arXiv:2406.14774, 2024

Ivan Kaji ´c et al. Evaluating numerical reasoning in text-to- image models.preprint arXiv:2406.14774, 2024. 3

work page arXiv 2024

[26] [26]

Wilds: A benchmark of in-the-wild distribution shifts

Pang Wei Koh, Shiori Sagawa, Henrik Marklund, Sang Michael Xie, Marvin Zhang, Akhil Balsubramani, 9 Weihua Hu, Michihiro Yasunaga, Percy Liang, Yair Carmon, et al. Wilds: A benchmark of in-the-wild distribution shifts. InInternational Conference on Machine Learning (ICML),

work page

[27] [27]

Large Language Models are Zero-Shot Reasoners

Takeshi Kojima, Shixiang Shane Gu, Machel Reid, Yutaka Matsuo, and Yusuke Iwasawa. Large language models are zero-shot reasoners.preprint arXiv:2205.11916, 2022. 3

work page internal anchor Pith review Pith/arXiv arXiv 2022

[28] [28]

Wildfire danger prediction and understanding with deep learning.Geophysi- cal Research Letters, 49(17):e2022GL099368, 2022

Spyros Kondylatos, Ioannis Prapas, Michele Ronco, Ioannis Papoutsis, Gustau Camps-Valls, Mar´ıa Piles, Miguel-´Angel Fern´andez-Torres, and Nuno Carvalhais. Wildfire danger prediction and understanding with deep learning.Geophysi- cal Research Letters, 49(17):e2022GL099368, 2022. 3

work page 2022

[29] [29]

Uncertainty-aware deep learning for wildfire danger forecasting.preprint arXiv:2509.25017, 2025

Spyros Kondylatos, Gustau Camps-Valls, and Ioannis Pa- poutsis. Uncertainty-aware deep learning for wildfire danger forecasting.preprint arXiv:2509.25017, 2025. 1

work page arXiv 2025

[30] [30]

Measuring Faithfulness in Chain-of-Thought Reasoning

Tamera Lanham, Anna Chen, Ansh Radhakrishnan, Benoit Steiner, Carson Denison, Danny Hernandez, Dustin Li, Esin Durmus, Evan Hubinger, Jackson Kernion, Kamil ˙e Lukoˇsi¯ut˙e, Karina Nguyen, Newton Cheng, Nicholas Joseph, Nicholas Schiefer, Oliver Rausch, Robin Larson, Sam McCandlish, Sandipan Kundu, Saurav Kadavath, Shan- non Yang, Thomas Henighan, Timothy...

work page internal anchor Pith review Pith/arXiv arXiv 2023

[31] [31]

Evaluating text-to-visual generation with image-to-text models.preprint arXiv:2404.01291, 2024

Ziqiu Lin et al. Evaluating text-to-visual generation with image-to-text models.preprint arXiv:2404.01291, 2024. 3

work page arXiv 2024

[32] [32]

MM-REACT: Prompting ChatGPT for Multimodal Reasoning and Action

Haotian Liu, Chunyuan Li, Pengchuan Zhang, and Yong Jae Lee. MM-ReAct: Prompting ChatGPT for multimodal rea- soning and action.preprint arXiv:2303.11381, 2023. 3

work page internal anchor Pith review Pith/arXiv arXiv 2023

[33] [33]

Application of remote sensing and explainable artificial intelligence for wildfire risk zon- ing in the mountainous region of Southwest China.Remote Sensing, 16(19):3602, 2024

Jia Liu, Yukuan Wang, Yafeng Lu, Pengguo Zhao, Shunjiu Wang, Yu Sun, and Yu Luo. Application of remote sensing and explainable artificial intelligence for wildfire risk zon- ing in the mountainous region of Southwest China.Remote Sensing, 16(19):3602, 2024. 3

work page 2024

[34] [34]

Fully convolutional networks for semantic segmentation

Jonathan Long, Evan Shelhamer, and Trevor Darrell. Fully convolutional networks for semantic segmentation. InIEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 3431–3440, 2015. 3

work page 2015

[35] [35]

Decoupled Weight Decay Regularization

Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization.preprint arXiv:1711.05101, 2017. 14

work page internal anchor Pith review Pith/arXiv arXiv 2017

[36] [36]

SGDR: Stochastic gradi- ent descent with warm restarts

Ilya Loshchilov and Frank Hutter. SGDR: Stochastic gradi- ent descent with warm restarts. InInternational Conference on Learning Representations (ICLR), 2017. 14

work page 2017

[37] [37]

A global probability-of-fire (PoF) forecast.Geophysical Research Letters, 51:e2023GL107929, 2024

Joe Ramu McNorton, Francesca Di Giuseppe, Ewan Mark Pinnington, Matthew Chantry, and Chris Barnard. A global probability-of-fire (PoF) forecast.Geophysical Research Letters, 51:e2023GL107929, 2024. 1, 3

work page 2024

[38] [38]

PhyBench: A physical com- monsense benchmark for evaluating text-to-image models

Fanqing Meng, Wenqi Shao, Lixin Luo, Yahong Wang, Yi- ran Chen, Quanfeng Lu, Yue Yang, Tianshuo Yang, Kaipeng Zhang, Yu Qiao, and Ping Luo. PhyBench: A physical com- monsense benchmark for evaluating text-to-image models. preprint arXiv:2406.11802, 2024. 3

work page arXiv 2024

[39] [39]

M. P. Naeini, G. F. Cooper, and M. Hauskrecht. Obtaining well calibrated probabilities using bayesian binning. InAAAI Conference on Artificial Intelligence, 2015. 4, 12

work page 2015

[40] [40]

Data obtained from national aeronautics and space administration (NASA) Langley Research Center’s predic- tion of worldwide energy resources (POWER), NASA Earth Science Division,

NASA. Data obtained from national aeronautics and space administration (NASA) Langley Research Center’s predic- tion of worldwide energy resources (POWER), NASA Earth Science Division, . Accessed 24.10.2025. 4

work page 2025

[41] [41]

Data obtained from the POWER project’s climatol- ogy,

NASA. Data obtained from the POWER project’s climatol- ogy, . Accessed 24.10.2025. 4

work page 2025

[42] [42]

Introducing GPT-5, 2025

OpenAI. Introducing GPT-5, 2025. Accessed: Nov. 12,

work page 2025

[43] [43]

Marc-Andr ´e Parisien and Max A. Moritz. Environmental controls on the distribution of wildfire at multiple spatial scales.Ecological Monographs, 79(1):127–154, 2009. 4, 12

work page 2009

[44] [44]

FiLM: Visual reasoning with a general conditioning layer

Ethan Perez, Florian Strub, Harm de Vries, Vincent Du- moulin, and Aaron Courville. FiLM: Visual reasoning with a general conditioning layer. InAAAI Conference on Artificial Intelligence, 2018. 2, 5

work page 2018

[45] [45]

High-resolution image syn- thesis with latent diffusion models

Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Bj¨orn Ommer. High-resolution image syn- thesis with latent diffusion models. InIEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2022. 3

work page 2022

[46] [46]

U- Net: Convolutional networks for biomedical image segmen- tation

Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U- Net: Convolutional networks for biomedical image segmen- tation. InMedical Image Computing and Computer-Assisted Intervention (MICCAI), 2015. 3, 5, 13

work page 2015

[47] [47]

J. San-Miguel-Ayanz, Ernst Schulte, Guido Schmuck, An- drea Camia, Peter Strobl, Giorgio Libert `a, Cristiano Gio- vando, Roberto Boca, Fernando Sedano, Pieter Kempeneers, Daniel McInerney, Ceri Withmore, Sandra Oliveira, Mar- cos Rodrigues, Tracy Durrant, Paolo Corti, Friderike Oehler, Lara Vilar, and Giuseppe Amatulli. Comprehensive monitor- ing of wild...

work page 2012

[48] [48]

Proximal Policy Optimization Algorithms

John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Rad- ford, and Oleg Klimov. Proximal policy optimization algo- rithms.preprint arXiv:1707.06347, 2017. 5, 13

work page internal anchor Pith review Pith/arXiv arXiv 2017

[49] [49]

Sengupta et al

A. Sengupta et al. Recent advances in explainable machine learning models for wildfires: From forecasting to burned area estimation.Environmental Data Science, 2025. In press. 3

work page 2025

[50] [50]

Wildfire risk to communities

USDA Forest Service. Wildfire risk to communities. https://wildfirerisk.org. Accessed 24.10.2025. 3, 7

work page 2025

[51] [51]

Wildfire spreading pre- diction using multimodal data and deep neural network ap- proach.Scientific Reports, 14:2606, 2024

Dmitrii Shadrin, Svetlana Illarionova, Fedor Gubanov, Kse- nia Evteeva, Maksim Mironenko, Ivan Levchunets, Roman Belousov, and Evgeny Burnaev. Wildfire spreading pre- diction using multimodal data and deep neural network ap- proach.Scientific Reports, 14:2606, 2024. 1, 3

work page 2024

[52] [52]

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

Zhihang Shao, Ziyu Wang, Yuxin Zhang, Zihan Zheng, Yao Liu, Zihan Liu, Yibo Shang, Linyang Xu, Tianyang Zhang, Lingpeng Chen, et al. DeepSeekMath: Pushing the limits of mathematical reasoning in open language models.preprint arXiv:2402.03300, 2024. 5, 13

work page internal anchor Pith review Pith/arXiv arXiv 2024

[53] [53]

ViperGPT: Visual inference via python execution for reasoning

D ´avid Sur´ıs, Sachit Menon, and Carl V ondrick. ViperGPT: Visual inference via python execution for reasoning. In IEEE/CVF International Conference on Computer Vision (ICCV), 2023. 3

work page 2023

[54] [54]

C. E. Van Wagner. Development and structure of the cana- dian forest fire weather index system. Technical Report 10 Forestry Technical Report 35, Canadian Forestry Service, Petawawa National Forestry Institute, Chalk River, Ontario,

work page

[55] [55]

Self-Consistency Improves Chain of Thought Reasoning in Language Models

Xuezhi Wang, Jason Wei, Dale Schuurmans, Maarten Bosma, Ed Chi, Quoc Le, and Denny Zhou. Self-consistency improves chain-of-thought reasoning in language models. preprint arXiv:2203.11171, 2022. 3

work page internal anchor Pith review Pith/arXiv arXiv 2022

[56] [56]

Z. Wang, A. C. Bovik, H. R. Sheikh, and E. P. Simoncelli. Image quality assessment: From error visibility to structural similarity.IEEE Transactions on Image Processing, 13(4): 600–612, 2004. 4, 5, 12

work page 2004

[57] [57]

Chain-of-Thought Prompting Elicits Reasoning in Large Language Models

Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Brian Ichter, Fei Xia, Ed Chi, Quoc Le, and Denny Zhou. Chain-of-thought prompting elicits reasoning in large language models.preprint arXiv:2201.11903, 2022. 3

work page internal anchor Pith review Pith/arXiv arXiv 2022

[58] [58]

Alvarez, and Ping Luo

Enze Xie, Wenhai Wang, Zhiding Yu, Anima Anandkumar, Jose M. Alvarez, and Ping Luo. SegFormer: Simple and efficient design for semantic segmentation with transform- ers. InAdvances in Neural Information Processing Systems (NeurIPS), 2021. 5, 13

work page 2021

[59] [59]

Deep learning for wildfire risk prediction: Integrating remote sensing and en- vironmental data.ISPRS Journal of Photogrammetry and Remote Sensing, 2025

Zhengsen Xu, Jonathan Li, Sibo Cheng, Xue Rui, Yu Zhao, Hongjie Heand Haiyan Guan, Aryan Sharma, Matthew Erxleben, Ryan Chang, and Linlin Xu. Deep learning for wildfire risk prediction: Integrating remote sensing and en- vironmental data.ISPRS Journal of Photogrammetry and Remote Sensing, 2025. Early access. 3

work page 2025

[60] [60]

Mmmu: A massive multi-discipline multimodal understand- ing and reasoning benchmark for expert agi

Xiang Yue, Yuansheng Ni, Tianyu Zheng, Kai Zhang, Ruoqi Liu, Ge Zhang, Samuel Stevens, Dongfu Jiang, Weiming Ren, Yuxuan Sun, Cong Wei, Botao Yu, Ruibin Yuan, Ren- liang Sun, Ming Yin, Boyuan Zheng, Zhenzhu Yang, Yibo Liu, Wenhao Huang, Huan Sun, Yu Su, and Wenhu Chen. Mmmu: A massive multi-discipline multimodal understand- ing and reasoning benchmark for...

work page 2024

[61] [61]

The Dawn of LMMs: Preliminary Explorations with GPT-4V(ision)

Renrui Zhang, Zheng Li, Hongyang Li, Yu Qiao, and Peng Gao. Visual chain-of-thought reasoning for multimodal large language models.preprint arXiv:2309.17421, 2023. 3

work page internal anchor Pith review Pith/arXiv arXiv 2023

[62] [62]

Towards omnidi- rectional reasoning with 360-r1: A dataset, benchmark, and GRPO-based method.preprint arXiv:2505.14197, 2025

Xinshen Zhang, Zhen Ye, and Xu Zheng. Towards omnidi- rectional reasoning with 360-r1: A dataset, benchmark, and GRPO-based method.preprint arXiv:2505.14197, 2025. 5

work page arXiv 2025

[63] [63]

Trade-offs in large reasoning models: An empirical analysis of deliberative and adaptive reasoning over foundational capabilities.preprint arXiv:2503.17979,

Weixiang Zhao, Xingyu Sui, Jiahe Guo, Yulin Hu, Yang Deng, Yanyan Zhao, Bing Qin, Wanxiang Che, Tat-Seng Chua, and Ting Liu. Trade-offs in large reasoning models: An empirical analysis of deliberative and adaptive reasoning over foundational capabilities.preprint arXiv:2503.17979,

work page arXiv

[64] [64]

Enhancing seasonal fire predictions with hybrid dy- namical and random forest models.Natural Hazards, 2,

Miguel ´Angel Torres-V ´azquez, Sixto Herrera, And- rina Gincheva, Amar Halifa-Mar ´ın, Leone Cavicchia, Francesca Di Giuseppe, Juan Pedro Mont ´avez, and Marco Turco. Enhancing seasonal fire predictions with hybrid dy- namical and random forest models.Natural Hazards, 2,

work page

[65] [65]

1, 3 11 FireScope: Wildfire Risk Prediction with a Chain-of-Thought Oracle Supplementary Material

work page

[66] [66]

Detailed Metrics In-distribution (ID).As we have ground truth continuous risk rasters in the US, we use three metrics for evaluation: Mean Squared Error (MSE)to quantify per-pixel predic- tion error: MSE= 1 N NX i (xi −y i)2 (5) . Mean Absolute Error (MAE)to quantify per-pixel pre- diction error: MAE= 1 N NX i |xi −y i|(6) Structural Similarity Index (SSI...

work page

[67] [67]

Models Oracles

Experiments Configurations 10.1. Models Oracles. We select Qwen2.5-VL-7B-Instruct [3] as our Or- acle for its strong open-source performance across multi- modal reasoning benchmarks and efficiency to train and de- ploy at only 7B parameters. We train two versions of it: one with CoT reasoning as outlined in Section 10.2, and one trained with supervised fi...

work page 2016

[68] [68]

Metrics of ablation methods

Ablation results Metrics of ablation methods are reported in Table 5 Table 5. Metrics of ablation methods. Method OOD wildfire eventsOOD wildfire pixelsID wildfire risk rastersConditioning EncoderBrier↓ROC AUC↑ECE↓ROC AUC↑IoU@0.5↑MSE↓SSIM↑MAE↓n/a VLM-enc 0.200 0.738 0.0690.647 0.176 0.050 0.468 0.182Image only Unet*0.208 0.699 0.0580.619 0.173 0.019 0.620...

work page

[69] [69]

For this reason, we perform simple quantitative and qualitative analyses of CoT Oracle’s general purpose capabilities

General-Purpose Capabilities of CoT Ora- cle (FireScope) After prompting CoT Oracle to provide its reasoning and overall risk score, it may be of interest to users to acquire answers to follow-up questions or continue conversing. For this reason, we perform simple quantitative and qualitative analyses of CoT Oracle’s general purpose capabilities. In this ...

work page

[70] [70]

Dense vegetation can increase fire risk due to more fuel available for fires to spread

**Vegetation Density**: The satellite image shows areas with 14 varying shades of green, indicating different levels of vegetation density. Dense vegetation can increase fire risk due to more fuel available for fires to spread

work page

[71] [71]

However, the average temperatures in the image region are relatively low, suggesting that fire risk might be moderate but not extremely high

**Climate Data**: - **Average Temperature**: Lower temperatures generally reduce fire risk because fires require higher temperatures to ignite and sustain. However, the average temperatures in the image region are relatively low, suggesting that fire risk might be moderate but not extremely high. - **Average Humidity**: Higher humidity reduces fire risk b...

work page

[72] [72]

Slopes can increase fire risk as they allow fires to spread more easily uphill

**Slope**: The satellite image shows areas with varying eleva- tions, which could indicate slopes. Slopes can increase fire risk as they allow fires to spread more easily uphill. Analysis: - **Vegetation Density**: The image shows a mix of dense and less dense vegetation. Dense vegetation increases fire risk. - **Climate Data**: - **Temperature**: Low tem...

work page

[73] [73]

**Avoid Areas Near Vegetation**: Stay away from areas with dense vegetation, as these are more likely to catch fire and spread quickly

work page

[74] [74]

**Consider Elevation**: Higher elevation areas may have better air circulation, which can help dissipate heat and reduce the risk of fire

work page

[75] [75]

**Avoid Low-Lying Areas**: Low-lying areas can accumulate heat and moisture, making them more susceptible to fire. Recommended Area: Based on the satellite image, a good recommendation would be to build the house in the upper part of the image, specifically in the lighter green areas that appear to be less dense with vegetation. These areas are likely to ...

work page

[76] [76]

This can make the area more comfortable year-round

**Better Air Circulation**: Higher elevation areas typically ex- perience better air circulation, which can help cool the surrounding environment and reduce the risk of heat accumulation. This can make the area more comfortable year-round

work page

[77] [77]

Additionally, water can be used for firefighting purposes if needed

**Proximity to Water**: The presence of a body of water nearby can provide a natural buffer zone, reducing the risk of direct fire threats. Additionally, water can be used for firefighting purposes if needed

work page

[78] [78]

**Scenic Views**: Higher elevation areas often offer panoramic views of the landscape, which can be visually appealing and provide a sense of tranquility

work page

[79] [79]

**Reduced Noise Pollution**: Higher elevation areas are usu- 15 ally quieter than lower-lying areas, as they are farther from urban centers and industrial noise sources

work page

[80] [80]

Disadvantages:

**Potential for Wildlife**: Depending on the local ecosystem, higher elevation areas might attract more wildlife, offering opportunities for nature observation and enjoyment. Disadvantages:

work page