arxiv: 2604.07675 · v1 · submitted 2026-04-09 · 💻 cs.CV

Recognition: unknown

FireSenseNet: A Dual-Branch CNN with Cross-Attentive Feature Interaction for Next-Day Wildfire Spread Prediction

Jinzhen Han , JinByeong Lee , Hak Han , YeonJu Na , Jae-Joon Lee

Authors on Pith no claims yet

Pith reviewed 2026-05-10 17:07 UTC · model grok-4.3

classification 💻 cs.CV

keywords wildfire spread predictiondual-branch CNNcross-attentiongeospatial forecastingdisaster responseremote sensinguncertainty quantificationevaluation bias

0 comments

The pith

FireSenseNet separates static terrain from dynamic weather in dual CNN branches linked by attention to forecast next-day wildfire spread.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that treating fuel and terrain as one modality and weather as another, then linking them through learnable attention gates at multiple scales, produces more accurate next-day spread maps than simply stacking all inputs together. This matters because wildfire forecasts guide evacuation orders and firefighting resource placement, where even modest gains in precision can reduce loss of life and property. Systematic tests against seven other architectures on a public benchmark show the proposed network reaching higher F1 and AUC-PR scores than a much larger transformer model. Ablation experiments isolate the contribution of the attention module, while analysis of feature importance highlights that yesterday's fire perimeter drives most predictions and that wind data adds noise at the available time resolution. The work also demonstrates that standard evaluation practices can overstate accuracy by more than forty percent.

Core claim

FireSenseNet is a dual-branch convolutional network in which one branch processes static fuel and terrain maps while the other processes time-varying meteorological fields; a Cross-Attentive Feature Interaction Module then uses learnable gates to exchange information between the branches at several encoder resolutions. On the Google Next-Day Wildfire Spread benchmark this architecture records an F1 score of 0.4176 and AUC-PR of 0.3435, exceeding the scores of all compared models including a SegFormer variant that contains 3.8 times more parameters. Channel-wise importance analysis shows the previous-day fire mask as the dominant input while wind speed contributes little at the dataset's one-

What carries the argument

Cross-Attentive Feature Interaction Module (CAFIM), which applies learnable attention gates to fuse static fuel/terrain features with dynamic meteorological features at multiple encoder scales.

If this is right

The dual-branch design with explicit cross-attention yields a 7.1 percent relative F1 improvement over simple concatenation of inputs.
Previous-day fire perimeter supplies the majority of predictive signal while wind speed acts as noise under coarse temporal sampling.
Monte Carlo Dropout produces per-pixel uncertainty maps that can accompany the spread forecast.
Common evaluation practices that ignore spatial autocorrelation inflate F1 scores by more than 44 percent on this task.
The architecture uses fewer parameters than the strongest competing transformer while still achieving higher accuracy.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same separation of static and dynamic inputs could improve forecasts for other geospatial phenomena such as flood extent or crop yield where terrain and weather interact.
Uncertainty maps from Monte Carlo Dropout could be used to prioritize ground-truth collection in high-uncertainty regions for active learning loops.
At higher temporal resolution the wind channel might become informative, suggesting the current noise finding is resolution-dependent rather than fundamental.
Retraining on regional subsets could reveal whether the dominance of the fire mask persists outside the training geography.

Load-bearing premise

The Google Next-Day Wildfire Spread benchmark and its evaluation protocol reflect real-world next-day prediction difficulty without the shortcuts that artificially boost reported scores.

What would settle it

Retraining and testing the same architecture on a wildfire dataset that supplies hourly meteorological observations and finer spatial grids, then measuring whether wind speed regains predictive value and whether the reported F1 inflation disappears.

Figures

Figures reproduced from arXiv: 2604.07675 by Hak Han, Jae-Joon Lee, JinByeong Lee, Jinzhen Han, YeonJu Na.

**Figure 1.** Figure 1: FireSenseNet pipeline overview. Given 12-channel geospatial inputs from Day t, dual-branch encoders with modality-specific kernel sizes process fuel/terrain (4 ch, 3×3) and meteorological (8 ch, 5×5) features independently. Cross-Attentive Feature Interaction Modules (CAFIM) fuse the two branches via learned spatial attention at three encoder scales. A U-Net decoder with multi-scale skip connections produc… view at source ↗

**Figure 2.** Figure 2: Visualization of a representative test sample showing all 12 input channels and the ground truth fire mask. Static fuel/terrain features (top-left four panels) exhibit fine-grained spatial structure, while dynamic meteorological variables (remaining panels) vary smoothly across the patch. The ground truth (bottom-right) shows fire pixels (red), non-fire pixels (green), and unknown pixels (gray, labeled −1)… view at source ↗

**Figure 3.** Figure 3: FireSenseNet architecture. Fuel/terrain and meteorological inputs are processed by independent encoder branches with modality-appropriate kernel sizes. At each spatial scale, a Cross-Attentive Feature Interaction Module (CAFIM) generates a spatial attention map that gates the contribution of each modality before routing to the U-Net decoder. MC Dropout before the prediction head enables uncertainty quantif… view at source ↗

**Figure 4.** Figure 4: Architecture spectrum: F1 score vs. model complexity. FireSenseNet achieves the highest F1 (0.4176) with only 3.0M parameters, while Transformer-based architectures cluster at the bottom despite higher parameter counts. The dual y-axis reveals an inverse relationship between F1 and reliance on Transformer components. Finding 2: Transformer components consistently degrade performance. A striking monotonic t… view at source ↗

**Figure 5.** Figure 5: Qualitative prediction comparison on four test samples. FireSenseNet (CAFIM) produces spatially precise predictions matching ground truth geometries, while the SegFormer generates diffuse, over-spread probability maps. The Baseline CNN captures fire locations but with less boundary precision. Ablation and diagnostic analyses We investigate why FireSenseNet outperforms alternatives through two complementary… view at source ↗

**Figure 6.** Figure 6: (a) CAFIM ablation: removing CAFIM and replacing it with simple concatenation reduces F1 by 7.1%. (b) Channel-wise feature importance via systematic ablation. PrevFireMask dominates (∆F1 = −0.21); wind speed is the only channel whose removal improves performance (green bar), indicating it acts as noise at this temporal resolution. Sample 199 PrevFireMask Ground Truth CAFIM Scale 1 (64×64) CAFIM Scale 2 (32… view at source ↗

**Figure 7.** Figure 7: CAFIM attention maps at three encoder scales for representative test samples. The learned spatial attention α determines the relative contribution of fuel features (α → 1, yellow) vs. weather features (α → 0, dark). At fine scales, CAFIM focuses sharply on fire perimeters; at coarse scales, it captures broader susceptibility patterns. 9/13 [PITH_FULL_IMAGE:figures/full_fig_p009_7.png] view at source ↗

**Figure 8.** Figure 8: Clean vs. inflated evaluation protocols. Including previous-day fire in the target and relabeling unknown pixels as non-fire inflates reported F1 by 44–50%, disproportionately benefiting weaker architectures and masking genuine performance differences. Sample 62 PrevFireMask Ground Truth Mean Prediction (N=20) Uncertainty ( ) Sample 21 Sample 36 0.0 0.2 0.4 0.6 0.8 1.0 0.00 0.01 0.02 0.03 0.04 0.05 0.06 0.… view at source ↗

**Figure 9.** Figure 9: Monte Carlo Dropout uncertainty estimation for three test samples. From left to right: previous-day fire mask, ground truth, mean prediction (20 stochastic passes), and pixel-level uncertainty (σ). Uncertainty concentrates at fire perimeter boundaries, providing actionable information for prioritizing ground-truth verification during emergency response. 11/13 [PITH_FULL_IMAGE:figures/full_fig_p011_9.png] view at source ↗

read the original abstract

Accurate prediction of next-day wildfire spread is critical for disaster response and resource allocation. Existing deep learning approaches typically concatenate heterogeneous geospatial inputs into a single tensor, ignoring the fundamental physical distinction between static fuel/terrain properties and dynamic meteorological conditions. We propose FireSenseNet, a dual-branch convolutional neural network equipped with a novel Cross-Attentive Feature Interaction Module (CAFIM) that explicitly models the spatially varying interaction between fuel and weather modalities through learnable attention gates at multiple encoder scales. Through a systematic comparison of seven architectures -- spanning pure CNNs, Vision Transformers, and hybrid designs -- on the Google Next-Day Wildfire Spread benchmark, we demonstrate that FireSenseNet achieves an F1 of 0.4176 and AUC-PR of 0.3435, outperforming all alternatives including a SegFormer with 3.8* more parameters (F1 = 0.3502). Ablation studies confirm that CAFIM provides a 7.1% relative F1 gain over naive concatenation, and channel-wise feature importance analysis reveals that the previous-day fire mask dominates prediction while wind speed acts as noise at the dataset's coarse temporal resolution. We further incorporate Monte Carlo Dropout for pixel-level uncertainty quantification and present a critical analysis showing that common evaluation shortcuts inflate reported F1 scores by over 44%.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

FireSenseNet adds a dual-branch design with cross-attention for wildfire spread, and the numbers beat bigger models on the Google benchmark, but the gains rest on whether the evaluation actually sidesteps the shortcuts the paper flags.

read the letter

The main thing to know is that this architecture separates static fuel/terrain from dynamic weather inputs and uses a new CAFIM module to let them interact through learnable attention at multiple scales. On the Google Next-Day Wildfire Spread benchmark it reaches F1 0.4176 and AUC-PR 0.3435, beating a SegFormer that has almost four times the parameters. The ablation credits the module with a 7.1% relative F1 lift, and they add Monte Carlo dropout for pixel uncertainty plus a feature-importance check showing the prior-day fire mask drives most predictions while wind speed is mostly noise at coarse resolution.

Referee Report

2 major / 1 minor

Summary. The paper proposes FireSenseNet, a dual-branch CNN equipped with a novel Cross-Attentive Feature Interaction Module (CAFIM) that models spatially varying interactions between static fuel/terrain and dynamic meteorological inputs for next-day wildfire spread prediction. On the Google Next-Day Wildfire Spread benchmark, it reports an F1 of 0.4176 and AUC-PR of 0.3435, outperforming seven other architectures including a 3.8× larger SegFormer (F1=0.3502). Ablation studies show a 7.1% relative F1 gain from CAFIM, channel-wise importance analysis finds the prior-day fire mask dominant and wind speed noisy, Monte Carlo Dropout is used for uncertainty, and a critical analysis claims common evaluation shortcuts inflate F1 by over 44%.

Significance. If the reported metrics and comparisons were computed under the non-shortcut protocol the authors themselves identify, the work would be significant for geospatial deep learning in disaster modeling: it provides a concrete demonstration that explicit cross-modal attention improves over naive concatenation, offers a systematic head-to-head comparison across CNNs, ViTs and hybrids, and supplies actionable feature-importance insights at the dataset's coarse temporal resolution. The addition of pixel-level uncertainty quantification is also a constructive contribution.

major comments (2)

[Abstract and Experimental Evaluation] Abstract and Experimental Evaluation: The manuscript states clear performance numbers (F1=0.4176, AUC-PR=0.3435) and an ablation gain of 7.1% from CAFIM, yet provides no details on data splits, training protocol, positive/negative sampling, or the exact calculation behind the 44% F1 inflation claim. This is load-bearing for the central outperformance claim, because the paper itself flags that biased sampling, failure to mask non-burnable areas, or persistence-only labeling can inflate F1 by >44%; without explicit confirmation that the published numbers avoid these shortcuts, the attribution of gains to the dual-branch CAFIM design cannot be verified.
[Results section (comparison table)] Results section (comparison table): The head-to-head claim that FireSenseNet outperforms a SegFormer with 3.8× more parameters (F1 0.4176 vs 0.3502) and six other models rests on the benchmark evaluation protocol. If the protocol used is the shortcut version the authors criticize, the relative ranking and the conclusion that CAFIM provides a meaningful modeling advance are not supported by the evidence presented.

minor comments (1)

[Abstract] Abstract: The notation '3.8*' for the parameter ratio should be written as '3.8×' for clarity.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments emphasizing the need for experimental transparency. We have revised the manuscript to address both major points by expanding the relevant sections with the requested details and explicit protocol confirmations.

read point-by-point responses

Referee: [Abstract and Experimental Evaluation] The manuscript states clear performance numbers (F1=0.4176, AUC-PR=0.3435) and an ablation gain of 7.1% from CAFIM, yet provides no details on data splits, training protocol, positive/negative sampling, or the exact calculation behind the 44% F1 inflation claim. This is load-bearing for the central outperformance claim, because the paper itself flags that biased sampling, failure to mask non-burnable areas, or persistence-only labeling can inflate F1 by >44%; without explicit confirmation that the published numbers avoid these shortcuts, the attribution of gains to the dual-branch CAFIM design cannot be verified.

Authors: We agree that these details are necessary to substantiate our claims. In the revised manuscript, we have added a new subsection under Experimental Setup that specifies: the temporal train/validation/test splits on the Google benchmark (with no future leakage), the full training protocol (optimizer, learning rate, epochs, batch size, and weighted loss for imbalance), the positive/negative sampling approach, and the exact procedure for the 44% inflation analysis (re-running baselines under biased sampling, unmasked non-burnable pixels, and persistence labeling). We explicitly confirm that all reported metrics, including the F1 of 0.4176, AUC-PR, and the 7.1% CAFIM ablation gain, were obtained under the non-shortcut protocol with non-burnable areas masked. revision: yes
Referee: [Results section (comparison table)] The head-to-head claim that FireSenseNet outperforms a SegFormer with 3.8× more parameters (F1 0.4176 vs 0.3502) and six other models rests on the benchmark evaluation protocol. If the protocol used is the shortcut version the authors criticize, the relative ranking and the conclusion that CAFIM provides a meaningful modeling advance are not supported by the evidence presented.

Authors: This is a fair point. The revised Results section and comparison table now include an explicit statement (with a footnote) that every model—including the 3.8× larger SegFormer and the other six architectures—was evaluated under the identical non-shortcut protocol we advocate: temporal splits without leakage, masking of non-burnable areas, and no biased or persistence-only sampling. We have also added implementation details for the SegFormer baseline to ensure fairness. Under this protocol the reported outperformance and the CAFIM ablation gain remain valid. revision: yes

Circularity Check

0 steps flagged

No circularity in empirical performance claims or model design

full rationale

The paper's central claims rest on direct empirical comparisons of FireSenseNet against seven other architectures on the fixed Google Next-Day Wildfire Spread benchmark, plus ablation studies isolating the CAFIM module's contribution. These results are obtained by training and evaluating models on public data splits; they do not reduce by the paper's own equations or definitions to quantities that are fitted or defined only in terms of the target metrics. The critical analysis of evaluation shortcuts is presented as an independent contribution that distinguishes shortcut-inflated scores from the protocol used for the reported numbers, with no self-citation load-bearing the architecture or results. No self-definitional loops, fitted-input predictions, or ansatz smuggling appear in the provided text.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 1 invented entities

The central claim rests on the assumption that separating static and dynamic inputs plus learned cross-attention yields measurable gains on this task; the model itself contains the usual neural-network free parameters learned from data.

free parameters (1)

learnable attention gates in CAFIM
Parameters inside the cross-attentive module are fitted during training on the wildfire dataset.

axioms (1)

domain assumption Static fuel/terrain properties and dynamic meteorological conditions are fundamentally distinct and benefit from separate processing branches
Explicitly stated in the abstract as the motivation for the dual-branch design.

invented entities (1)

Cross-Attentive Feature Interaction Module (CAFIM) no independent evidence
purpose: To explicitly model spatially varying interactions between fuel and weather modalities at multiple encoder scales
New module introduced by the authors; no independent evidence outside the paper is provided.

pith-pipeline@v0.9.0 · 5566 in / 1535 out tokens · 64813 ms · 2026-05-10T17:07:41.108239+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

17 extracted references · 4 canonical work pages · 3 internal anchors

[1]

Improved Regularization of Convolutional Neural Networks with Cutout

DeVries, T. and Taylor, G. W. (2017). Improved regularization of convolutional neural networks with cutout. arXiv preprint arXiv:1708.04552

work page internal anchor Pith review arXiv 2017
[2]

Di Giuseppe, F., McNorton, J., Lombardi, A., and Wetterhall, F. (2025). Global data-driven prediction of fire activity. Nature Communications , 16(1):2918

2025
[3]

Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al. (2020). An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929

work page internal anchor Pith review Pith/arXiv arXiv 2020
[4]

Duff, T. J. and Penman, T. D. (2021). Determining the likelihood of asset destruction during wildfires: modelling house destruction with fire simulator outputs and local-scale landscape properties. Safety science , 139:105196

2021
[5]

Finney, M. A. (1998). FARSITE: Fire Area Simulator-model development and evaluation . U.S. Department of Agriculture, Forest Service, Rocky Mountain Research Station

1998
[6]

Gerard, S., Zhao, Y., and Sullivan, J. (2023). Wildfirespreadts: A dataset of multi-modal time series for wildfire spread prediction. Advances in Neural Information Processing Systems , 36:74515--74529

2023
[7]

Hodges, J. L. and Lattimer, B. Y. (2019). Wildland fire spread modeling using convolutional neural networks. Fire technology , 55(6):2115--2142

2019
[8]

L., Goyal, N., Sankar, T., Ihme, M., and Chen, Y.-F

Huot, F., Hu, R. L., Goyal, N., Sankar, T., Ihme, M., and Chen, Y.-F. (2022). Next day wildfire spread: A machine learning dataset to predict wildfire spreading from remote-sensing data. IEEE Transactions on Geoscience and Remote Sensing , 60:1--13

2022
[9]

K., and Travis, W

Iglesias, V., Balch, J. K., and Travis, W. R. (2022). Us fires became larger, more frequent, and more widespread in the 2000s. Science advances , 8(11):eabc0020

2022
[10]

Kantarcioglu, O., Kocaman, S., and Schindler, K. (2023). Artificial neural networks for assessing forest fire susceptibility in t \"u rkiye. Ecological Informatics , 75:102034

2023
[11]

Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., and Guo, B. (2021). Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF international conference on computer vision , pages 10012--10022

2021
[12]

Decoupled Weight Decay Regularization

Loshchilov, I. and Hutter, F. (2017). Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101

work page internal anchor Pith review Pith/arXiv arXiv 2017
[13]

Luo, Y., Rong, S., Watts, A., and Cetin, A. E. (2026). U-net with hadamard transform and dct latent spaces for next-day wildfire spread prediction. arXiv preprint arXiv:2602.11672

work page arXiv 2026
[14]

Wildland fire summary and statistics annual report 2024

National Interagency Coordination Center (2024). Wildland fire summary and statistics annual report 2024. Technical report, National Interagency Coordination Center

2024
[15]

Shadrin, D., Illarionova, S., Gubanov, F., Evteeva, K., Mironenko, M., Levchunets, I., Belousov, R., and Burnaev, E. (2024). Wildfire spreading prediction using multimodal data and deep neural network approach. Scientific reports , 14(1):2606

2024
[16]

K., and Srivastava, S

Singh, H., Ang, L.-M., Paudyal, D., Acuna, M., Srivastava, P. K., and Srivastava, S. K. (2025). A comprehensive review of empirical and dynamic wildfire simulators and machine learning techniques used for the prediction of wildfire in australia. Technology, Knowledge and Learning , 30(2):935--968

2025
[17]

M., and Luo, P

Xie, E., Wang, W., Yu, Z., Anandkumar, A., Alvarez, J. M., and Luo, P. (2021). Segformer: Simple and efficient design for semantic segmentation with transformers. Advances in neural information processing systems , 34:12077--12090

2021