Two-Stage Framework for Efficient UAV-Based Wildfire Video Analysis with Adaptive Compression and Fire Source Detection

Erick Mas; Jianchao Bi; Junjie Hu; Lemeng Zhao; Rui-Yang Ju; Shunichi Koshimura; Yanbing Bai

arxiv: 2508.16739 · v2 · submitted 2025-08-22 · 💻 cs.CV

Two-Stage Framework for Efficient UAV-Based Wildfire Video Analysis with Adaptive Compression and Fire Source Detection

Yanbing Bai , Rui-Yang Ju , Lemeng Zhao , Junjie Hu , Jianchao Bi , Erick Mas , Shunichi Koshimura This is my paper

Pith reviewed 2026-05-18 20:50 UTC · model grok-4.3

classification 💻 cs.CV

keywords UAV wildfire monitoringtwo-stage frameworkpolicy networkadaptive compressionYOLOv8 fire detectionreal-time video analysisdisaster responsecomputational efficiency

0 comments

The pith

A two-stage UAV framework reduces computational costs for wildfire video analysis while preserving accuracy and enabling real-time fire detection.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The authors aim to solve the problem of running heavy video analysis models on UAVs that have limited onboard computing power. They introduce a first stage that uses a policy network to decide which video clips are worth processing, incorporating a station point mechanism to look ahead at future frames for better decisions. This reduces the amount of data sent to the second stage, where an improved YOLOv8 model finds and locates fire sources in the selected frames. The result is lower overall computation while keeping the ability to spot fires accurately and in real time. Experiments on standard datasets confirm that costs drop without hurting performance in either stage.

Core claim

The paper establishes a lightweight two-stage framework for UAV wildfire video analysis. Stage 1 uses a policy network with a station point mechanism to identify and discard redundant clips, thereby lowering computational costs while operating near real time by incorporating future frame information. Stage 2 applies an improved YOLOv8 model to localize fire sources accurately and in real time only on the retained frames. Evaluations on the FLAME, HMDB51, and Fire & Smoke Detection datasets show significant cost reductions in Stage 1 with maintained classification accuracy and high detection accuracy with real-time inference in Stage 2.

What carries the argument

The station point mechanism within the sequential policy network, which incorporates future frame information to improve the accuracy of decisions on which video clips to discard before passing them to the fire detector.

If this is right

Computational costs are significantly reduced in Stage 1 while classification accuracy is maintained on the FLAME and HMDB51 datasets.
Stage 2 achieves high fire source detection accuracy with real-time inference on the Fire & Smoke Detection Dataset.
The framework supports near-real-time operation suitable for onboard UAV disaster response applications.
Large models can run efficiently on UAVs with limited resources through selective processing of only relevant frames.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The selective clip processing strategy could extend to other long-duration UAV video tasks such as flood monitoring or search-and-rescue operations.
Joint training of the policy network and detector might further improve the balance between cost savings and detection reliability.
Real-world UAV flight tests in actual wildfire conditions would be required to validate performance beyond the laboratory datasets used.

Load-bearing premise

The policy network with the station point mechanism accurately discards redundant clips without missing frames that contain emerging or small fire sources.

What would settle it

A test video sequence in which a small or emerging fire source appears in a clip that the policy network discards as redundant, resulting in the fire going undetected by the second stage.

Figures

Figures reproduced from arXiv: 2508.16739 by Erick Mas, Jianchao Bi, Junjie Hu, Lemeng Zhao, Rui-Yang Ju, Shunichi Koshimura, Yanbing Bai.

**Figure 1.** Figure 1: Pipeline of the proposed two-stage framework. In Stage 1, frame selection is performed based on a static distribution guided by [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗

**Figure 2.** Figure 2: Comparison between the traditional method and our method for video [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 3.** Figure 3: A representative example illustrating the two types of labels used in [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗

**Figure 4.** Figure 4: The pipeline illustrating the construction process of FLAME wildfire [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗

**Figure 6.** Figure 6: Performance comparison of frame selection scoring methods S1, S2, [PITH_FULL_IMAGE:figures/full_fig_p008_6.png] view at source ↗

**Figure 7.** Figure 7: Detailed Precision–Recall curves for four different models across each [PITH_FULL_IMAGE:figures/full_fig_p010_7.png] view at source ↗

read the original abstract

Unmanned Aerial Vehicles (UAVs) have become increasingly important in disaster emergency response by facilitating aerial video analysis. Due to the limited computational resources available on UAVs, large models cannot be run efficiently for on-board analysis. To overcome this challenge, we propose a lightweight and efficient two-stage framework for wildfire monitoring and fire source detection on UAV platforms. Specifically, in Stage 1, we utilize a policy network to identify and discard redundant video clips, thereby reducing computational costs. We also introduce a station point mechanism that incorporates future frame information within the sequential policy network to improve prediction accuracy. This mechanism allows Stage 1 to operate in a near-real-time manner. In Stage 2, for frames classified as containing fire, we apply an improved YOLOv8 model to accurately localize the fire source in real-time on selected frames. We evaluate Stage 1 using the FLAME and HMDB51 datasets, and Stage 2 using the Fire & Smoke Detection Dataset. Experimental results show that our method significantly reduces computational costs while maintaining classification accuracy in Stage 1, and achieves high detection accuracy with real-time inference in Stage 2.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This paper offers a practical two-stage UAV pipeline that prunes clips via policy network with station point before running improved YOLOv8 detection, but the pruning safety for small or emerging fires rests on thin evidence.

read the letter

The main thing here is a two-stage setup for onboard wildfire video work on UAVs: a policy network drops redundant clips using a station point that peeks at future frames, then an improved YOLOv8 localizes fire sources on the kept frames. It targets the real limit of compute on drones and aims for near-real-time operation without heavy models running constantly. The station point is a straightforward engineering addition to sequential decisions, and pairing it with YOLO tweaks for the detection stage gives a concrete architecture for the wildfire use case. They pick FLAME and HMDB51 for the pruning stage and a fire-smoke dataset for detection, which lines up with the application. That framing around actual hardware constraints and the split into pruning then localization is the part that feels useful. The soft spots sit in the evaluation. The abstract claims cost reduction while holding accuracy and real-time high-accuracy detection, yet it gives no baselines, no ablation on the station point, no per-class false-negative rates, and no targeted checks on gradual ignition or small distant sources. HMDB51 action clips do not closely match the visual subtlety of smoke-obscured or early fire starts, so the accuracy numbers do not directly confirm that pruning preserves every critical sequence. If those details are missing from the full paper too, the central tradeoff stays hard to trust. This is the kind of applied work that could interest engineers building drone systems for disaster response or remote sensing groups looking for efficient video handling. Readers who need ideas for cutting compute on limited platforms would find the framework worth reading, even if they plan to add their own tests. It deserves a serious referee because the problem is grounded and the method is a reasonable extension of existing tools, though the experiments need more precision on the pruning reliability to hold up.

Referee Report

2 major / 2 minor

Summary. The manuscript proposes a two-stage framework for efficient UAV-based wildfire video analysis. Stage 1 uses a policy network with a station point mechanism to identify and discard redundant video clips, reducing computational costs while maintaining classification accuracy, evaluated on the FLAME and HMDB51 datasets. Stage 2 applies an improved YOLOv8 model for real-time fire source localization on frames classified as containing fire, evaluated on the Fire & Smoke Detection Dataset. The abstract reports positive results on cost reduction and detection accuracy with real-time inference.

Significance. If the performance claims hold under rigorous testing, particularly the safe discarding of clips without missing emerging or small fire sources, the framework could provide a practical advance for on-board UAV wildfire monitoring by enabling efficient analysis on resource-constrained platforms while preserving detection utility.

major comments (2)

[Abstract and Experimental Results] Abstract and Experimental Results section: The headline claim of significantly reducing computational costs while maintaining classification accuracy in Stage 1 depends on the policy network (with station point mechanism) having a low false-negative rate on clips containing small or emerging fire sources. However, the evaluation uses HMDB51, a generic action recognition dataset whose negative examples do not simulate subtle distant or smoke-obscured ignitions, and no per-class false-negative rates, ablation isolating the station-point contribution on onset frames, or test sets with gradual fire ignition sequences are reported.
[Stage 1 Method and Evaluation] Stage 1 Method and Evaluation: The manuscript provides no details on baselines, error bars, exact metrics (e.g., precision/recall for fire vs. non-fire clips), or ablation studies, which prevents full assessment of whether the reported accuracy is competitive or if the cost savings preserve overall system utility for the target wildfire use case.

minor comments (2)

[Abstract] The abstract refers to an 'improved YOLOv8' without specifying the modifications (e.g., architectural changes, loss functions, or training data augmentations).
[Method Description] Notation and implementation details for the station point mechanism and policy network training hyperparameters are not fully elaborated, which could hinder reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our two-stage UAV wildfire analysis framework. We address the major comments below and have revised the manuscript to improve the evaluation and clarity of Stage 1 results.

read point-by-point responses

Referee: [Abstract and Experimental Results] Abstract and Experimental Results section: The headline claim of significantly reducing computational costs while maintaining classification accuracy in Stage 1 depends on the policy network (with station point mechanism) having a low false-negative rate on clips containing small or emerging fire sources. However, the evaluation uses HMDB51, a generic action recognition dataset whose negative examples do not simulate subtle distant or smoke-obscured ignitions, and no per-class false-negative rates, ablation isolating the station-point contribution on onset frames, or test sets with gradual fire ignition sequences are reported.

Authors: We thank the referee for this important observation. FLAME provides wildfire-specific clips while HMDB51 is included to demonstrate generalization of the policy network beyond fire data. We acknowledge that HMDB51 negatives do not explicitly model subtle or smoke-obscured ignitions. The station point mechanism incorporates future-frame context precisely to improve detection of emerging events in sequential clips. In the revised manuscript we will add per-class false-negative rates, an ablation isolating the station-point contribution on onset frames, and a discussion of limitations regarding gradual ignition sequences, along with suggestions for future specialized test sets. revision: partial
Referee: [Stage 1 Method and Evaluation] Stage 1 Method and Evaluation: The manuscript provides no details on baselines, error bars, exact metrics (e.g., precision/recall for fire vs. non-fire clips), or ablation studies, which prevents full assessment of whether the reported accuracy is competitive or if the cost savings preserve overall system utility for the target wildfire use case.

Authors: We agree that these details are necessary for rigorous assessment. The revised manuscript will include comparisons against relevant baselines for the policy network, error bars computed over multiple runs, exact precision and recall for fire versus non-fire clip classification, and ablation studies on the station point mechanism and its contribution to end-to-end system utility for resource-constrained UAV wildfire monitoring. revision: yes

Circularity Check

0 steps flagged

Derivation is self-contained with independent dataset evaluations

full rationale

The paper's two-stage framework (policy network with station-point mechanism in Stage 1 for discarding redundant clips, followed by improved YOLOv8 in Stage 2) is evaluated on independent public datasets: FLAME and HMDB51 for Stage 1 classification accuracy, and Fire & Smoke Detection Dataset for Stage 2 detection. No equations or central claims reduce by construction to fitted parameters presented as predictions, self-definitional loops, or load-bearing self-citations. Efficiency and accuracy results are reported as empirical outcomes against external benchmarks rather than internal redefinitions, making the derivation self-contained.

Axiom & Free-Parameter Ledger

2 free parameters · 1 axioms · 0 invented entities

The framework depends on standard supervised training assumptions for the policy network and YOLOv8, plus the unverified effectiveness of the station point mechanism; several training hyperparameters are free parameters not detailed in the abstract.

free parameters (2)

Policy network training hyperparameters
Parameters controlling the policy network for clip selection are fitted during training but not specified.
YOLOv8 improvement parameters
Modifications and training settings for the improved YOLOv8 model are fitted to the fire dataset.

axioms (1)

domain assumption Station point mechanism incorporates future frame information to improve sequential policy prediction accuracy.
Invoked to justify near-real-time operation and accuracy gains in Stage 1.

pith-pipeline@v0.9.0 · 5757 in / 1269 out tokens · 42671 ms · 2026-05-18T20:50:27.173497+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

65 extracted references · 65 canonical work pages · 5 internal anchors

[1]

A review on early wildfire detection from unmanned aerial vehicles using deep learning-based computer vision algorithms,

A. Bouguettaya, H. Zarzour, A. M. Taberkit, and A. Kechida, “A review on early wildfire detection from unmanned aerial vehicles using deep learning-based computer vision algorithms,”Signal Processing, vol. 190, p. 108309, 2022

work page 2022
[2]

Multi-uav path planning methodology for postdisaster building damage surveying,

R. Nagasawa, E. Mas, L. Moya, and S. Koshimura, “Multi-uav path planning methodology for postdisaster building damage surveying,” 2020. IEEE JOURNAL OF SELECTED TOPICS IN APPLIED EARTH OBSERV ATIONS AND REMOTE SENSING 11

work page 2020
[3]

O. Ozkan, “Optimization of the distance-constrained multi-based multi- uav routing problem with simulated annealing and local search-based matheuristic to detect forest fires: The case of turkey,” Applied Soft Computing, vol. 113, p. 108015, 2021

work page 2021
[4]

Model-based analysis of multi-uav path planning for surveying postdisaster building damage,

R. Nagasawa, E. Mas, L. Moya, and S. Koshimura, “Model-based analysis of multi-uav path planning for surveying postdisaster building damage,” Scientific reports, vol. 11, no. 1, pp. 1–14, 2021

work page 2021
[5]

Wild- fire detection from multisensor satellite imagery using deep semantic segmentation,

D. Rashkovetsky, F. Mauracher, M. Langer, and M. Schmitt, “Wild- fire detection from multisensor satellite imagery using deep semantic segmentation,” IEEE Journal of Selected Topics in Applied Earth Ob- servations and Remote Sensing , vol. 14, pp. 7001–7016, 2021

work page 2021
[6]

A yolo based technique for early forest fire detection,

S. Goyal, M. Shagill, A. Kaur, H. V ohra, and A. Singh, “A yolo based technique for early forest fire detection,” Int. J. Innov. Technol. Explor. Eng.(IJITEE) Vol, vol. 9, pp. 1357–1362, 2020

work page 2020
[7]

Analysis of machine learning methods for wildfire security monitoring with an unmanned aerial vehicles,

D. Alexandrov, E. Pertseva, I. Berman, I. Pantiukhin, and A. Kapitonov, “Analysis of machine learning methods for wildfire security monitoring with an unmanned aerial vehicles,” in 2019 24th conference of open innovations association (FRUCT) , pp. 3–9, IEEE, 2019

work page 2019
[8]

Forest fire flame and smoke detection from uav-captured images using fire-specific color fea- tures and multi-color space local binary pattern,

F. A. Hossain, Y . M. Zhang, and M. A. Tonima, “Forest fire flame and smoke detection from uav-captured images using fire-specific color fea- tures and multi-color space local binary pattern,” Journal of Unmanned Vehicle Systems, vol. 8, no. 4, pp. 285–309, 2020

work page 2020
[9]

Pdam–stpnnet: a small target detection approach for wildland fire smoke through remote sensing images,

J. Zhan, Y . Hu, W. Cai, G. Zhou, and L. Li, “Pdam–stpnnet: a small target detection approach for wildland fire smoke through remote sensing images,” Symmetry, vol. 13, no. 12, p. 2260, 2021

work page 2021
[10]

Streamlin- ing forest wildfire surveillance: Ai-enhanced uavs utilizing the flame aerial video dataset for lightweight and efficient monitoring,

L. Zhao, J. Hu, J. Bi, Y . Bai, E. Mas, and S. Koshimura, “Streamlin- ing forest wildfire surveillance: Ai-enhanced uavs utilizing the flame aerial video dataset for lightweight and efficient monitoring,” in 2024 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 8063–8068, IEEE, 2024

work page 2024
[11]

Digital twin computing for enhancing resilience of disaster response system,

S. Koshimura and E. Mas, “Digital twin computing for enhancing resilience of disaster response system,” in EGU General Assembly Conference Abstracts, pp. EGU–11756, 2023

work page 2023
[12]

Tiny video networks,

A. Piergiovanni, A. Angelova, and M. S. Ryoo, “Tiny video networks,” Applied AI Letters , vol. 3, no. 1, p. e38, 2022

work page 2022
[13]

Video classification with channel-separated convolutional networks,

D. Tran, H. Wang, L. Torresani, and M. Feiszli, “Video classification with channel-separated convolutional networks,” in Proceedings of the IEEE/CVF International Conference on Computer Vision , pp. 5552– 5561, 2019

work page 2019
[14]

Light-weight semantic segmentation network for uav remote sensing images,

S. Liu, J. Cheng, L. Liang, H. Bai, and W. Dang, “Light-weight semantic segmentation network for uav remote sensing images,” IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing , vol. 14, pp. 8287–8296, 2021

work page 2021
[15]

Deepcore: A comprehensive library for coreset selection in deep learning,

C. Guo, B. Zhao, and Y . Bai, “Deepcore: A comprehensive library for coreset selection in deep learning,” DEXA, 2022

work page 2022
[16]

Ar-net: Adaptive frame resolution for effi- cient action recognition,

Y . Meng, C.-C. Lin, R. Panda, P. Sattigeri, L. Karlinsky, A. Oliva, K. Saenko, and R. Feris, “Ar-net: Adaptive frame resolution for effi- cient action recognition,” in European Conference on Computer Vision , pp. 86–104, Springer, 2020

work page 2020
[17]

Towards effi- cient disaster response via cost-effective unbiased class rate estimation through neyman allocation stratified sampling active learning,

Y . Bai, X. Wu, L. Xu, J. Pei, E. Mas, and S. Koshimura, “Towards effi- cient disaster response via cost-effective unbiased class rate estimation through neyman allocation stratified sampling active learning,” arXiv preprint arXiv:2405.17734, 2024

work page arXiv 2024
[18]

Smoke detection on video sequences using 3d convolutional neural networks,

G. Lin, Y . Zhang, G. Xu, and Q. Zhang, “Smoke detection on video sequences using 3d convolutional neural networks,” Fire Technology, vol. 55, pp. 1827–1847, 2019

work page 2019
[19]

Tsunami flow measurement using the video recorded during the 2011 tohoku tsunami attack,

S. Koshimura and S. Hayashi, “Tsunami flow measurement using the video recorded during the 2011 tohoku tsunami attack,” in 2012 IEEE International Geoscience and Remote Sensing Symposium , pp. 6693– 6696, IEEE, 2012

work page 2011
[20]

Remote sensing approach for mapping and monitoring tsunami debris,

S. Koshimura and T. Fukuoka, “Remote sensing approach for mapping and monitoring tsunami debris,” in IGARSS 2019-2019 IEEE Inter- national Geoscience and Remote Sensing Symposium , pp. 4829–4832, IEEE, 2019

work page 2019
[21]

Yolo by ultralytics,

G. Jocher, A. Chaurasia, and J. Qiu, “Yolo by ultralytics,” Code repository, 2023

work page 2023
[22]

Ocsampler: Compress- ing videos to one clip with single-step sampling,

J. Lin, H. Duan, K. Chen, D. Lin, and L. Wang, “Ocsampler: Compress- ing videos to one clip with single-step sampling,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pp. 13894–13903, 2022

work page 2022
[23]

Adaframe: Adaptive frame selection for fast video recognition,

Z. Wu, C. Xiong, C.-Y . Ma, R. Socher, and L. S. Davis, “Adaframe: Adaptive frame selection for fast video recognition,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1278–1287, 2019

work page 2019
[24]

Dynamic network quantization for efficient video inference,

X. Sun, R. Panda, C.-F. R. Chen, A. Oliva, R. Feris, and K. Saenko, “Dynamic network quantization for efficient video inference,” in Pro- ceedings of the IEEE/CVF International Conference on Computer Vi- sion, pp. 7375–7385, 2021

work page 2021
[25]

Scsampler: Sampling salient clips from video for efficient action recognition,

B. Korbar, D. Tran, and L. Torresani, “Scsampler: Sampling salient clips from video for efficient action recognition,” in Proceedings of the IEEE/CVF International Conference on Computer Vision , pp. 6232– 6242, 2019

work page 2019
[26]

Adafuse: Adaptive temporal fusion network for efficient action recognition,

Y . Meng, R. Panda, C.-C. Lin, P. Sattigeri, L. Karlinsky, K. Saenko, A. Oliva, and R. Feris, “Adafuse: Adaptive temporal fusion network for efficient action recognition,” arXiv preprint arXiv:2102.05775 , 2021

work page arXiv 2021
[27]

Activitynet: A large-scale video benchmark for human activity under- standing,

F. Caba Heilbron, V . Escorcia, B. Ghanem, and J. Carlos Niebles, “Activitynet: A large-scale video benchmark for human activity under- standing,” in Proceedings of the ieee conference on computer vision and pattern recognition, pp. 961–970, 2015

work page 2015
[28]

Exploiting feature and class relationships in video categorization with regularized deep neural networks,

Y .-G. Jiang, Z. Wu, J. Wang, X. Xue, and S.-F. Chang, “Exploiting feature and class relationships in video categorization with regularized deep neural networks,” IEEE transactions on pattern analysis and machine intelligence, vol. 40, no. 2, pp. 352–364, 2017

work page 2017
[29]

The Kinetics Human Action Video Dataset

W. Kay, J. Carreira, K. Simonyan, B. Zhang, C. Hillier, S. Vijaya- narasimhan, F. Viola, T. Green, T. Back, P. Natsev, et al., “The kinetics human action video dataset,” arXiv preprint arXiv:1705.06950 , 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[30]

End-to-end learning of action detection from frame glimpses in videos,

S. Yeung, O. Russakovsky, G. Mori, and L. Fei-Fei, “End-to-end learning of action detection from frame glimpses in videos,” inProceedings of the IEEE conference on computer vision and pattern recognition , pp. 2678– 2687, 2016

work page 2016
[31]

Smart frame selection for action recognition,

S. N. Gowda, M. Rohrbach, and L. Sevilla-Lara, “Smart frame selection for action recognition,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 1451–1459, 2021

work page 2021
[32]

Mgsampler: An explainable sampling strategy for video action recognition,

Y . Zhi, Z. Tong, L. Wang, and G. Wu, “Mgsampler: An explainable sampling strategy for video action recognition,” in Proceedings of the IEEE/CVF International Conference on Computer Vision , pp. 1513– 1522, 2021

work page 2021
[33]

Assessing the effectiveness of yolo architectures for smoke and wildfire detection,

E. Casas, L. Ramos, E. Bendek, and F. Rivas-Echeverr ´ıa, “Assessing the effectiveness of yolo architectures for smoke and wildfire detection,” IEEE Access, vol. 11, pp. 96554–96583, 2023

work page 2023
[34]

A study of yolo architectures for wildfire and smoke detection in ground and aerial imagery,

L. T. Ramos, E. Casas, C. Romero, F. Rivas-Echeverr ´ıa, and E. Bendek, “A study of yolo architectures for wildfire and smoke detection in ground and aerial imagery,” Results in Engineering , vol. 26, p. 104869, 2025

work page 2025
[35]

Squeeze-and-excitation networks,

J. Hu, L. Shen, and G. Sun, “Squeeze-and-excitation networks,” in Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 7132–7141, 2018

work page 2018
[36]

Cbam: Convolutional block attention module,

S. Woo, J. Park, J.-Y . Lee, and I. S. Kweon, “Cbam: Convolutional block attention module,” in Proceedings of the European conference on computer vision (ECCV) , pp. 3–19, 2018

work page 2018
[37]

Eca-net: Efficient channel attention for deep convolutional neural networks,

Q. Wang, B. Wu, P. Zhu, P. Li, W. Zuo, and Q. Hu, “Eca-net: Efficient channel attention for deep convolutional neural networks,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 11534–11542, 2020

work page 2020
[38]

Sa-net: Shuffle attention for deep con- volutional neural networks,

Q.-L. Zhang and Y .-B. Yang, “Sa-net: Shuffle attention for deep con- volutional neural networks,” in ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) , pp. 2235–2239, IEEE, 2021

work page 2021
[39]

mixup: Beyond Empirical Risk Minimization

H. Zhang, M. Cisse, Y . N. Dauphin, and D. Lopez-Paz, “mixup: Beyond empirical risk minimization,” arXiv preprint arXiv:1710.09412 , 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[40]

Group normalization,

Y . Wu and K. He, “Group normalization,” in Proceedings of the European conference on computer vision (ECCV) , pp. 3–19, 2018

work page 2018
[41]

Liteeval: A coarse-to- fine framework for resource efficient video recognition,

Z. Wu, C. Xiong, Y .-G. Jiang, and L. S. Davis, “Liteeval: A coarse-to- fine framework for resource efficient video recognition,” Advances in Neural Information Processing Systems , vol. 32, 2019

work page 2019
[42]

Categorical Reparameterization with Gumbel-Softmax

E. Jang, S. Gu, and B. Poole, “Categorical reparameterization with gumbel-softmax,” arXiv preprint arXiv:1611.01144 , 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016
[43]

Cspnet: A new backbone that can enhance learning capability of cnn,

C.-Y . Wang, H.-Y . M. Liao, Y .-H. Wu, P.-Y . Chen, J.-W. Hsieh, and I.-H. Yeh, “Cspnet: A new backbone that can enhance learning capability of cnn,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition workshops , pp. 390–391, 2020

work page 2020
[44]

Jocher, K

G. Jocher, K. Nishimura, T. Mineeva, and R. Vilari ˜no, “yolov5,” Code repository, p. 9, 2020

work page 2020
[45]

Designing network design strategies through gradient path analysis,

C.-Y . Wang, H.-Y . M. Liao, and I.-H. Yeh, “Designing network design strategies through gradient path analysis,” arXiv preprint arXiv:2211.04800, 2022

work page arXiv 2022
[46]

Yolov7: Trainable bag-of-freebies sets new state-of-the-art for real-time object detectors,

C.-Y . Wang, A. Bochkovskiy, and H.-Y . M. Liao, “Yolov7: Trainable bag-of-freebies sets new state-of-the-art for real-time object detectors,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 7464–7475, 2023

work page 2023
[47]

Feature pyramid networks for object detection,

T.-Y . Lin, P. Doll´ar, R. Girshick, K. He, B. Hariharan, and S. Belongie, “Feature pyramid networks for object detection,” in Proceedings of the IEEE conference on computer vision and pattern recognition , pp. 2117– 2125, 2017. IEEE JOURNAL OF SELECTED TOPICS IN APPLIED EARTH OBSERV ATIONS AND REMOTE SENSING 12

work page 2017
[48]

Path aggregation network for instance segmentation,

S. Liu, L. Qi, H. Qin, J. Shi, and J. Jia, “Path aggregation network for instance segmentation,” in Proceedings of the IEEE conference on computer vision and pattern recognition , pp. 8759–8768, 2018

work page 2018
[49]

Faster r-cnn: Towards real-time object detection with region proposal networks,

S. Ren, K. He, R. Girshick, and J. Sun, “Faster r-cnn: Towards real-time object detection with region proposal networks,” Advances in neural information processing systems , vol. 28, 2015

work page 2015
[50]

Centernet: Keypoint triplets for object detection,

K. Duan, S. Bai, L. Xie, H. Qi, Q. Huang, and Q. Tian, “Centernet: Keypoint triplets for object detection,” in Proceedings of the IEEE/CVF international conference on computer vision , pp. 6569–6578, 2019

work page 2019
[51]

Yolov8-rescbam: Yolov8 based on an effective attention module for pediatric wrist fracture detection,

R.-Y . Ju, C.-T. Chien, and J.-S. Chiang, “Yolov8-rescbam: Yolov8 based on an effective attention module for pediatric wrist fracture detection,” arXiv preprint arXiv:2409.18826 , 2024

work page arXiv 2024
[52]

Yolov8-am: Yolov8 based on effective attention mechanisms for pedi- atric wrist fracture detection,

C.-T. Chien, R.-Y . Ju, K.-Y . Chou, E. Xieerke, and J.-S. Chiang, “Yolov8-am: Yolov8 based on effective attention mechanisms for pedi- atric wrist fracture detection,” IEEE Access, vol. 13, pp. 52461–52477, 2025

work page 2025
[53]

Generalized focal loss: Learning qualified and distributed bounding boxes for dense object detection,

X. Li, W. Wang, L. Wu, S. Chen, X. Hu, J. Li, J. Tang, and J. Yang, “Generalized focal loss: Learning qualified and distributed bounding boxes for dense object detection,” Advances in Neural Information Processing Systems, vol. 33, pp. 21002–21012, 2020

work page 2020
[54]

Enhancing geometric factors in model learning and inference for object detection and instance segmentation,

Z. Zheng, P. Wang, D. Ren, W. Liu, R. Ye, Q. Hu, and W. Zuo, “Enhancing geometric factors in model learning and inference for object detection and instance segmentation,” IEEE transactions on cybernetics, vol. 52, no. 8, pp. 8574–8586, 2021

work page 2021
[55]

Distance-iou loss: Faster and better learning for bounding box regression,

Z. Zheng, P. Wang, W. Liu, J. Li, R. Ye, and D. Ren, “Distance-iou loss: Faster and better learning for bounding box regression,” in Proceedings of the AAAI conference on artificial intelligence , vol. 34, pp. 12993– 13000, 2020

work page 2020
[56]

Aerial imagery pile burn detection using deep learning: The flame dataset,

A. Shamsoshoara, F. Afghah, A. Razi, L. Zheng, P. Z. Ful ´e, and E. Blasch, “Aerial imagery pile burn detection using deep learning: The flame dataset,” Computer Networks, vol. 193, p. 108001, 2021

work page 2021
[57]

Hmdb: a large video database for human motion recognition,

H. Kuehne, H. Jhuang, E. Garrote, T. Poggio, and T. Serre, “Hmdb: a large video database for human motion recognition,” in 2011 Interna- tional conference on computer vision , pp. 2556–2563, IEEE, 2011

work page 2011
[58]

Fire & smoke dataset

A. Akhtamov, “Fire & smoke dataset.” https://www.kaggle.com/datasets/ azimjaan21/fire-and-smoke-dataset-object-detection-yolo, 2023

work page 2023
[59]

Mobilenetv2: Inverted residuals and linear bottlenecks,

M. Sandler, A. Howard, M. Zhu, A. Zhmoginov, and L.-C. Chen, “Mobilenetv2: Inverted residuals and linear bottlenecks,” in Proceedings of the IEEE conference on computer vision and pattern recognition , pp. 4510–4520, 2018

work page 2018
[60]

Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation

K. Cho, B. Van Merri ¨enboer, C. Gulcehre, D. Bahdanau, F. Bougares, H. Schwenk, and Y . Bengio, “Learning phrase representations using rnn encoder-decoder for statistical machine translation,” arXiv preprint arXiv:1406.1078, 2014

work page internal anchor Pith review Pith/arXiv arXiv 2014
[61]

Imagenet: A large-scale hierarchical image database,

J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, “Imagenet: A large-scale hierarchical image database,” in 2009 IEEE conference on computer vision and pattern recognition , pp. 248–255, Ieee, 2009

work page 2009
[62]

Deep residual learning for image recognition,

K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proceedings of the IEEE conference on computer vision and pattern recognition , pp. 770–778, 2016

work page 2016
[63]

Tsm: Temporal shift module for efficient video understanding,

J. Lin, C. Gan, and S. Han, “Tsm: Temporal shift module for efficient video understanding,” in Proceedings of the IEEE/CVF International Conference on Computer Vision , pp. 7083–7093, 2019

work page 2019
[64]

An overview of gradient descent optimization algorithms

S. Ruder, “An overview of gradient descent optimization algorithms,” arXiv preprint arXiv:1609.04747 , 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016
[65]

Crossformer: Transformer utilizing cross- dimension dependency for multivariate time series forecasting,

Y . Zhang and J. Yan, “Crossformer: Transformer utilizing cross- dimension dependency for multivariate time series forecasting,” in The eleventh international conference on learning representations , 2023

work page 2023

[1] [1]

A review on early wildfire detection from unmanned aerial vehicles using deep learning-based computer vision algorithms,

A. Bouguettaya, H. Zarzour, A. M. Taberkit, and A. Kechida, “A review on early wildfire detection from unmanned aerial vehicles using deep learning-based computer vision algorithms,”Signal Processing, vol. 190, p. 108309, 2022

work page 2022

[2] [2]

Multi-uav path planning methodology for postdisaster building damage surveying,

R. Nagasawa, E. Mas, L. Moya, and S. Koshimura, “Multi-uav path planning methodology for postdisaster building damage surveying,” 2020. IEEE JOURNAL OF SELECTED TOPICS IN APPLIED EARTH OBSERV ATIONS AND REMOTE SENSING 11

work page 2020

[3] [3]

O. Ozkan, “Optimization of the distance-constrained multi-based multi- uav routing problem with simulated annealing and local search-based matheuristic to detect forest fires: The case of turkey,” Applied Soft Computing, vol. 113, p. 108015, 2021

work page 2021

[4] [4]

Model-based analysis of multi-uav path planning for surveying postdisaster building damage,

R. Nagasawa, E. Mas, L. Moya, and S. Koshimura, “Model-based analysis of multi-uav path planning for surveying postdisaster building damage,” Scientific reports, vol. 11, no. 1, pp. 1–14, 2021

work page 2021

[5] [5]

Wild- fire detection from multisensor satellite imagery using deep semantic segmentation,

D. Rashkovetsky, F. Mauracher, M. Langer, and M. Schmitt, “Wild- fire detection from multisensor satellite imagery using deep semantic segmentation,” IEEE Journal of Selected Topics in Applied Earth Ob- servations and Remote Sensing , vol. 14, pp. 7001–7016, 2021

work page 2021

[6] [6]

A yolo based technique for early forest fire detection,

S. Goyal, M. Shagill, A. Kaur, H. V ohra, and A. Singh, “A yolo based technique for early forest fire detection,” Int. J. Innov. Technol. Explor. Eng.(IJITEE) Vol, vol. 9, pp. 1357–1362, 2020

work page 2020

[7] [7]

Analysis of machine learning methods for wildfire security monitoring with an unmanned aerial vehicles,

D. Alexandrov, E. Pertseva, I. Berman, I. Pantiukhin, and A. Kapitonov, “Analysis of machine learning methods for wildfire security monitoring with an unmanned aerial vehicles,” in 2019 24th conference of open innovations association (FRUCT) , pp. 3–9, IEEE, 2019

work page 2019

[8] [8]

Forest fire flame and smoke detection from uav-captured images using fire-specific color fea- tures and multi-color space local binary pattern,

F. A. Hossain, Y . M. Zhang, and M. A. Tonima, “Forest fire flame and smoke detection from uav-captured images using fire-specific color fea- tures and multi-color space local binary pattern,” Journal of Unmanned Vehicle Systems, vol. 8, no. 4, pp. 285–309, 2020

work page 2020

[9] [9]

Pdam–stpnnet: a small target detection approach for wildland fire smoke through remote sensing images,

J. Zhan, Y . Hu, W. Cai, G. Zhou, and L. Li, “Pdam–stpnnet: a small target detection approach for wildland fire smoke through remote sensing images,” Symmetry, vol. 13, no. 12, p. 2260, 2021

work page 2021

[10] [10]

Streamlin- ing forest wildfire surveillance: Ai-enhanced uavs utilizing the flame aerial video dataset for lightweight and efficient monitoring,

L. Zhao, J. Hu, J. Bi, Y . Bai, E. Mas, and S. Koshimura, “Streamlin- ing forest wildfire surveillance: Ai-enhanced uavs utilizing the flame aerial video dataset for lightweight and efficient monitoring,” in 2024 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 8063–8068, IEEE, 2024

work page 2024

[11] [11]

Digital twin computing for enhancing resilience of disaster response system,

S. Koshimura and E. Mas, “Digital twin computing for enhancing resilience of disaster response system,” in EGU General Assembly Conference Abstracts, pp. EGU–11756, 2023

work page 2023

[12] [12]

Tiny video networks,

A. Piergiovanni, A. Angelova, and M. S. Ryoo, “Tiny video networks,” Applied AI Letters , vol. 3, no. 1, p. e38, 2022

work page 2022

[13] [13]

Video classification with channel-separated convolutional networks,

D. Tran, H. Wang, L. Torresani, and M. Feiszli, “Video classification with channel-separated convolutional networks,” in Proceedings of the IEEE/CVF International Conference on Computer Vision , pp. 5552– 5561, 2019

work page 2019

[14] [14]

Light-weight semantic segmentation network for uav remote sensing images,

S. Liu, J. Cheng, L. Liang, H. Bai, and W. Dang, “Light-weight semantic segmentation network for uav remote sensing images,” IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing , vol. 14, pp. 8287–8296, 2021

work page 2021

[15] [15]

Deepcore: A comprehensive library for coreset selection in deep learning,

C. Guo, B. Zhao, and Y . Bai, “Deepcore: A comprehensive library for coreset selection in deep learning,” DEXA, 2022

work page 2022

[16] [16]

Ar-net: Adaptive frame resolution for effi- cient action recognition,

Y . Meng, C.-C. Lin, R. Panda, P. Sattigeri, L. Karlinsky, A. Oliva, K. Saenko, and R. Feris, “Ar-net: Adaptive frame resolution for effi- cient action recognition,” in European Conference on Computer Vision , pp. 86–104, Springer, 2020

work page 2020

[17] [17]

Towards effi- cient disaster response via cost-effective unbiased class rate estimation through neyman allocation stratified sampling active learning,

Y . Bai, X. Wu, L. Xu, J. Pei, E. Mas, and S. Koshimura, “Towards effi- cient disaster response via cost-effective unbiased class rate estimation through neyman allocation stratified sampling active learning,” arXiv preprint arXiv:2405.17734, 2024

work page arXiv 2024

[18] [18]

Smoke detection on video sequences using 3d convolutional neural networks,

G. Lin, Y . Zhang, G. Xu, and Q. Zhang, “Smoke detection on video sequences using 3d convolutional neural networks,” Fire Technology, vol. 55, pp. 1827–1847, 2019

work page 2019

[19] [19]

Tsunami flow measurement using the video recorded during the 2011 tohoku tsunami attack,

S. Koshimura and S. Hayashi, “Tsunami flow measurement using the video recorded during the 2011 tohoku tsunami attack,” in 2012 IEEE International Geoscience and Remote Sensing Symposium , pp. 6693– 6696, IEEE, 2012

work page 2011

[20] [20]

Remote sensing approach for mapping and monitoring tsunami debris,

S. Koshimura and T. Fukuoka, “Remote sensing approach for mapping and monitoring tsunami debris,” in IGARSS 2019-2019 IEEE Inter- national Geoscience and Remote Sensing Symposium , pp. 4829–4832, IEEE, 2019

work page 2019

[21] [21]

Yolo by ultralytics,

G. Jocher, A. Chaurasia, and J. Qiu, “Yolo by ultralytics,” Code repository, 2023

work page 2023

[22] [22]

Ocsampler: Compress- ing videos to one clip with single-step sampling,

J. Lin, H. Duan, K. Chen, D. Lin, and L. Wang, “Ocsampler: Compress- ing videos to one clip with single-step sampling,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pp. 13894–13903, 2022

work page 2022

[23] [23]

Adaframe: Adaptive frame selection for fast video recognition,

Z. Wu, C. Xiong, C.-Y . Ma, R. Socher, and L. S. Davis, “Adaframe: Adaptive frame selection for fast video recognition,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1278–1287, 2019

work page 2019

[24] [24]

Dynamic network quantization for efficient video inference,

X. Sun, R. Panda, C.-F. R. Chen, A. Oliva, R. Feris, and K. Saenko, “Dynamic network quantization for efficient video inference,” in Pro- ceedings of the IEEE/CVF International Conference on Computer Vi- sion, pp. 7375–7385, 2021

work page 2021

[25] [25]

Scsampler: Sampling salient clips from video for efficient action recognition,

B. Korbar, D. Tran, and L. Torresani, “Scsampler: Sampling salient clips from video for efficient action recognition,” in Proceedings of the IEEE/CVF International Conference on Computer Vision , pp. 6232– 6242, 2019

work page 2019

[26] [26]

Adafuse: Adaptive temporal fusion network for efficient action recognition,

Y . Meng, R. Panda, C.-C. Lin, P. Sattigeri, L. Karlinsky, K. Saenko, A. Oliva, and R. Feris, “Adafuse: Adaptive temporal fusion network for efficient action recognition,” arXiv preprint arXiv:2102.05775 , 2021

work page arXiv 2021

[27] [27]

Activitynet: A large-scale video benchmark for human activity under- standing,

F. Caba Heilbron, V . Escorcia, B. Ghanem, and J. Carlos Niebles, “Activitynet: A large-scale video benchmark for human activity under- standing,” in Proceedings of the ieee conference on computer vision and pattern recognition, pp. 961–970, 2015

work page 2015

[28] [28]

Exploiting feature and class relationships in video categorization with regularized deep neural networks,

Y .-G. Jiang, Z. Wu, J. Wang, X. Xue, and S.-F. Chang, “Exploiting feature and class relationships in video categorization with regularized deep neural networks,” IEEE transactions on pattern analysis and machine intelligence, vol. 40, no. 2, pp. 352–364, 2017

work page 2017

[29] [29]

The Kinetics Human Action Video Dataset

W. Kay, J. Carreira, K. Simonyan, B. Zhang, C. Hillier, S. Vijaya- narasimhan, F. Viola, T. Green, T. Back, P. Natsev, et al., “The kinetics human action video dataset,” arXiv preprint arXiv:1705.06950 , 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017

[30] [30]

End-to-end learning of action detection from frame glimpses in videos,

S. Yeung, O. Russakovsky, G. Mori, and L. Fei-Fei, “End-to-end learning of action detection from frame glimpses in videos,” inProceedings of the IEEE conference on computer vision and pattern recognition , pp. 2678– 2687, 2016

work page 2016

[31] [31]

Smart frame selection for action recognition,

S. N. Gowda, M. Rohrbach, and L. Sevilla-Lara, “Smart frame selection for action recognition,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 1451–1459, 2021

work page 2021

[32] [32]

Mgsampler: An explainable sampling strategy for video action recognition,

Y . Zhi, Z. Tong, L. Wang, and G. Wu, “Mgsampler: An explainable sampling strategy for video action recognition,” in Proceedings of the IEEE/CVF International Conference on Computer Vision , pp. 1513– 1522, 2021

work page 2021

[33] [33]

Assessing the effectiveness of yolo architectures for smoke and wildfire detection,

E. Casas, L. Ramos, E. Bendek, and F. Rivas-Echeverr ´ıa, “Assessing the effectiveness of yolo architectures for smoke and wildfire detection,” IEEE Access, vol. 11, pp. 96554–96583, 2023

work page 2023

[34] [34]

A study of yolo architectures for wildfire and smoke detection in ground and aerial imagery,

L. T. Ramos, E. Casas, C. Romero, F. Rivas-Echeverr ´ıa, and E. Bendek, “A study of yolo architectures for wildfire and smoke detection in ground and aerial imagery,” Results in Engineering , vol. 26, p. 104869, 2025

work page 2025

[35] [35]

Squeeze-and-excitation networks,

J. Hu, L. Shen, and G. Sun, “Squeeze-and-excitation networks,” in Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 7132–7141, 2018

work page 2018

[36] [36]

Cbam: Convolutional block attention module,

S. Woo, J. Park, J.-Y . Lee, and I. S. Kweon, “Cbam: Convolutional block attention module,” in Proceedings of the European conference on computer vision (ECCV) , pp. 3–19, 2018

work page 2018

[37] [37]

Eca-net: Efficient channel attention for deep convolutional neural networks,

Q. Wang, B. Wu, P. Zhu, P. Li, W. Zuo, and Q. Hu, “Eca-net: Efficient channel attention for deep convolutional neural networks,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 11534–11542, 2020

work page 2020

[38] [38]

Sa-net: Shuffle attention for deep con- volutional neural networks,

Q.-L. Zhang and Y .-B. Yang, “Sa-net: Shuffle attention for deep con- volutional neural networks,” in ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) , pp. 2235–2239, IEEE, 2021

work page 2021

[39] [39]

mixup: Beyond Empirical Risk Minimization

H. Zhang, M. Cisse, Y . N. Dauphin, and D. Lopez-Paz, “mixup: Beyond empirical risk minimization,” arXiv preprint arXiv:1710.09412 , 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017

[40] [40]

Group normalization,

Y . Wu and K. He, “Group normalization,” in Proceedings of the European conference on computer vision (ECCV) , pp. 3–19, 2018

work page 2018

[41] [41]

Liteeval: A coarse-to- fine framework for resource efficient video recognition,

Z. Wu, C. Xiong, Y .-G. Jiang, and L. S. Davis, “Liteeval: A coarse-to- fine framework for resource efficient video recognition,” Advances in Neural Information Processing Systems , vol. 32, 2019

work page 2019

[42] [42]

Categorical Reparameterization with Gumbel-Softmax

E. Jang, S. Gu, and B. Poole, “Categorical reparameterization with gumbel-softmax,” arXiv preprint arXiv:1611.01144 , 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016

[43] [43]

Cspnet: A new backbone that can enhance learning capability of cnn,

C.-Y . Wang, H.-Y . M. Liao, Y .-H. Wu, P.-Y . Chen, J.-W. Hsieh, and I.-H. Yeh, “Cspnet: A new backbone that can enhance learning capability of cnn,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition workshops , pp. 390–391, 2020

work page 2020

[44] [44]

Jocher, K

G. Jocher, K. Nishimura, T. Mineeva, and R. Vilari ˜no, “yolov5,” Code repository, p. 9, 2020

work page 2020

[45] [45]

Designing network design strategies through gradient path analysis,

C.-Y . Wang, H.-Y . M. Liao, and I.-H. Yeh, “Designing network design strategies through gradient path analysis,” arXiv preprint arXiv:2211.04800, 2022

work page arXiv 2022

[46] [46]

Yolov7: Trainable bag-of-freebies sets new state-of-the-art for real-time object detectors,

C.-Y . Wang, A. Bochkovskiy, and H.-Y . M. Liao, “Yolov7: Trainable bag-of-freebies sets new state-of-the-art for real-time object detectors,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 7464–7475, 2023

work page 2023

[47] [47]

Feature pyramid networks for object detection,

T.-Y . Lin, P. Doll´ar, R. Girshick, K. He, B. Hariharan, and S. Belongie, “Feature pyramid networks for object detection,” in Proceedings of the IEEE conference on computer vision and pattern recognition , pp. 2117– 2125, 2017. IEEE JOURNAL OF SELECTED TOPICS IN APPLIED EARTH OBSERV ATIONS AND REMOTE SENSING 12

work page 2017

[48] [48]

Path aggregation network for instance segmentation,

S. Liu, L. Qi, H. Qin, J. Shi, and J. Jia, “Path aggregation network for instance segmentation,” in Proceedings of the IEEE conference on computer vision and pattern recognition , pp. 8759–8768, 2018

work page 2018

[49] [49]

Faster r-cnn: Towards real-time object detection with region proposal networks,

S. Ren, K. He, R. Girshick, and J. Sun, “Faster r-cnn: Towards real-time object detection with region proposal networks,” Advances in neural information processing systems , vol. 28, 2015

work page 2015

[50] [50]

Centernet: Keypoint triplets for object detection,

K. Duan, S. Bai, L. Xie, H. Qi, Q. Huang, and Q. Tian, “Centernet: Keypoint triplets for object detection,” in Proceedings of the IEEE/CVF international conference on computer vision , pp. 6569–6578, 2019

work page 2019

[51] [51]

Yolov8-rescbam: Yolov8 based on an effective attention module for pediatric wrist fracture detection,

R.-Y . Ju, C.-T. Chien, and J.-S. Chiang, “Yolov8-rescbam: Yolov8 based on an effective attention module for pediatric wrist fracture detection,” arXiv preprint arXiv:2409.18826 , 2024

work page arXiv 2024

[52] [52]

Yolov8-am: Yolov8 based on effective attention mechanisms for pedi- atric wrist fracture detection,

C.-T. Chien, R.-Y . Ju, K.-Y . Chou, E. Xieerke, and J.-S. Chiang, “Yolov8-am: Yolov8 based on effective attention mechanisms for pedi- atric wrist fracture detection,” IEEE Access, vol. 13, pp. 52461–52477, 2025

work page 2025

[53] [53]

Generalized focal loss: Learning qualified and distributed bounding boxes for dense object detection,

X. Li, W. Wang, L. Wu, S. Chen, X. Hu, J. Li, J. Tang, and J. Yang, “Generalized focal loss: Learning qualified and distributed bounding boxes for dense object detection,” Advances in Neural Information Processing Systems, vol. 33, pp. 21002–21012, 2020

work page 2020

[54] [54]

Enhancing geometric factors in model learning and inference for object detection and instance segmentation,

Z. Zheng, P. Wang, D. Ren, W. Liu, R. Ye, Q. Hu, and W. Zuo, “Enhancing geometric factors in model learning and inference for object detection and instance segmentation,” IEEE transactions on cybernetics, vol. 52, no. 8, pp. 8574–8586, 2021

work page 2021

[55] [55]

Distance-iou loss: Faster and better learning for bounding box regression,

Z. Zheng, P. Wang, W. Liu, J. Li, R. Ye, and D. Ren, “Distance-iou loss: Faster and better learning for bounding box regression,” in Proceedings of the AAAI conference on artificial intelligence , vol. 34, pp. 12993– 13000, 2020

work page 2020

[56] [56]

Aerial imagery pile burn detection using deep learning: The flame dataset,

A. Shamsoshoara, F. Afghah, A. Razi, L. Zheng, P. Z. Ful ´e, and E. Blasch, “Aerial imagery pile burn detection using deep learning: The flame dataset,” Computer Networks, vol. 193, p. 108001, 2021

work page 2021

[57] [57]

Hmdb: a large video database for human motion recognition,

H. Kuehne, H. Jhuang, E. Garrote, T. Poggio, and T. Serre, “Hmdb: a large video database for human motion recognition,” in 2011 Interna- tional conference on computer vision , pp. 2556–2563, IEEE, 2011

work page 2011

[58] [58]

Fire & smoke dataset

A. Akhtamov, “Fire & smoke dataset.” https://www.kaggle.com/datasets/ azimjaan21/fire-and-smoke-dataset-object-detection-yolo, 2023

work page 2023

[59] [59]

Mobilenetv2: Inverted residuals and linear bottlenecks,

M. Sandler, A. Howard, M. Zhu, A. Zhmoginov, and L.-C. Chen, “Mobilenetv2: Inverted residuals and linear bottlenecks,” in Proceedings of the IEEE conference on computer vision and pattern recognition , pp. 4510–4520, 2018

work page 2018

[60] [60]

Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation

K. Cho, B. Van Merri ¨enboer, C. Gulcehre, D. Bahdanau, F. Bougares, H. Schwenk, and Y . Bengio, “Learning phrase representations using rnn encoder-decoder for statistical machine translation,” arXiv preprint arXiv:1406.1078, 2014

work page internal anchor Pith review Pith/arXiv arXiv 2014

[61] [61]

Imagenet: A large-scale hierarchical image database,

J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, “Imagenet: A large-scale hierarchical image database,” in 2009 IEEE conference on computer vision and pattern recognition , pp. 248–255, Ieee, 2009

work page 2009

[62] [62]

Deep residual learning for image recognition,

K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proceedings of the IEEE conference on computer vision and pattern recognition , pp. 770–778, 2016

work page 2016

[63] [63]

Tsm: Temporal shift module for efficient video understanding,

J. Lin, C. Gan, and S. Han, “Tsm: Temporal shift module for efficient video understanding,” in Proceedings of the IEEE/CVF International Conference on Computer Vision , pp. 7083–7093, 2019

work page 2019

[64] [64]

An overview of gradient descent optimization algorithms

S. Ruder, “An overview of gradient descent optimization algorithms,” arXiv preprint arXiv:1609.04747 , 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016

[65] [65]

Crossformer: Transformer utilizing cross- dimension dependency for multivariate time series forecasting,

Y . Zhang and J. Yan, “Crossformer: Transformer utilizing cross- dimension dependency for multivariate time series forecasting,” in The eleventh international conference on learning representations , 2023

work page 2023