pith. sign in

arxiv: 2508.16739 · v2 · submitted 2025-08-22 · 💻 cs.CV

Two-Stage Framework for Efficient UAV-Based Wildfire Video Analysis with Adaptive Compression and Fire Source Detection

Pith reviewed 2026-05-18 20:50 UTC · model grok-4.3

classification 💻 cs.CV
keywords UAV wildfire monitoringtwo-stage frameworkpolicy networkadaptive compressionYOLOv8 fire detectionreal-time video analysisdisaster responsecomputational efficiency
0
0 comments X

The pith

A two-stage UAV framework reduces computational costs for wildfire video analysis while preserving accuracy and enabling real-time fire detection.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The authors aim to solve the problem of running heavy video analysis models on UAVs that have limited onboard computing power. They introduce a first stage that uses a policy network to decide which video clips are worth processing, incorporating a station point mechanism to look ahead at future frames for better decisions. This reduces the amount of data sent to the second stage, where an improved YOLOv8 model finds and locates fire sources in the selected frames. The result is lower overall computation while keeping the ability to spot fires accurately and in real time. Experiments on standard datasets confirm that costs drop without hurting performance in either stage.

Core claim

The paper establishes a lightweight two-stage framework for UAV wildfire video analysis. Stage 1 uses a policy network with a station point mechanism to identify and discard redundant clips, thereby lowering computational costs while operating near real time by incorporating future frame information. Stage 2 applies an improved YOLOv8 model to localize fire sources accurately and in real time only on the retained frames. Evaluations on the FLAME, HMDB51, and Fire & Smoke Detection datasets show significant cost reductions in Stage 1 with maintained classification accuracy and high detection accuracy with real-time inference in Stage 2.

What carries the argument

The station point mechanism within the sequential policy network, which incorporates future frame information to improve the accuracy of decisions on which video clips to discard before passing them to the fire detector.

If this is right

  • Computational costs are significantly reduced in Stage 1 while classification accuracy is maintained on the FLAME and HMDB51 datasets.
  • Stage 2 achieves high fire source detection accuracy with real-time inference on the Fire & Smoke Detection Dataset.
  • The framework supports near-real-time operation suitable for onboard UAV disaster response applications.
  • Large models can run efficiently on UAVs with limited resources through selective processing of only relevant frames.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The selective clip processing strategy could extend to other long-duration UAV video tasks such as flood monitoring or search-and-rescue operations.
  • Joint training of the policy network and detector might further improve the balance between cost savings and detection reliability.
  • Real-world UAV flight tests in actual wildfire conditions would be required to validate performance beyond the laboratory datasets used.

Load-bearing premise

The policy network with the station point mechanism accurately discards redundant clips without missing frames that contain emerging or small fire sources.

What would settle it

A test video sequence in which a small or emerging fire source appears in a clip that the policy network discards as redundant, resulting in the fire going undetected by the second stage.

Figures

Figures reproduced from arXiv: 2508.16739 by Erick Mas, Jianchao Bi, Junjie Hu, Lemeng Zhao, Rui-Yang Ju, Shunichi Koshimura, Yanbing Bai.

Figure 1
Figure 1. Figure 1: Pipeline of the proposed two-stage framework. In Stage 1, frame selection is performed based on a static distribution guided by [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Comparison between the traditional method and our method for video [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: A representative example illustrating the two types of labels used in [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: The pipeline illustrating the construction process of FLAME wildfire [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗
Figure 6
Figure 6. Figure 6: Performance comparison of frame selection scoring methods S1, S2, [PITH_FULL_IMAGE:figures/full_fig_p008_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Detailed Precision–Recall curves for four different models across each [PITH_FULL_IMAGE:figures/full_fig_p010_7.png] view at source ↗
read the original abstract

Unmanned Aerial Vehicles (UAVs) have become increasingly important in disaster emergency response by facilitating aerial video analysis. Due to the limited computational resources available on UAVs, large models cannot be run efficiently for on-board analysis. To overcome this challenge, we propose a lightweight and efficient two-stage framework for wildfire monitoring and fire source detection on UAV platforms. Specifically, in Stage 1, we utilize a policy network to identify and discard redundant video clips, thereby reducing computational costs. We also introduce a station point mechanism that incorporates future frame information within the sequential policy network to improve prediction accuracy. This mechanism allows Stage 1 to operate in a near-real-time manner. In Stage 2, for frames classified as containing fire, we apply an improved YOLOv8 model to accurately localize the fire source in real-time on selected frames. We evaluate Stage 1 using the FLAME and HMDB51 datasets, and Stage 2 using the Fire & Smoke Detection Dataset. Experimental results show that our method significantly reduces computational costs while maintaining classification accuracy in Stage 1, and achieves high detection accuracy with real-time inference in Stage 2.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript proposes a two-stage framework for efficient UAV-based wildfire video analysis. Stage 1 uses a policy network with a station point mechanism to identify and discard redundant video clips, reducing computational costs while maintaining classification accuracy, evaluated on the FLAME and HMDB51 datasets. Stage 2 applies an improved YOLOv8 model for real-time fire source localization on frames classified as containing fire, evaluated on the Fire & Smoke Detection Dataset. The abstract reports positive results on cost reduction and detection accuracy with real-time inference.

Significance. If the performance claims hold under rigorous testing, particularly the safe discarding of clips without missing emerging or small fire sources, the framework could provide a practical advance for on-board UAV wildfire monitoring by enabling efficient analysis on resource-constrained platforms while preserving detection utility.

major comments (2)
  1. [Abstract and Experimental Results] Abstract and Experimental Results section: The headline claim of significantly reducing computational costs while maintaining classification accuracy in Stage 1 depends on the policy network (with station point mechanism) having a low false-negative rate on clips containing small or emerging fire sources. However, the evaluation uses HMDB51, a generic action recognition dataset whose negative examples do not simulate subtle distant or smoke-obscured ignitions, and no per-class false-negative rates, ablation isolating the station-point contribution on onset frames, or test sets with gradual fire ignition sequences are reported.
  2. [Stage 1 Method and Evaluation] Stage 1 Method and Evaluation: The manuscript provides no details on baselines, error bars, exact metrics (e.g., precision/recall for fire vs. non-fire clips), or ablation studies, which prevents full assessment of whether the reported accuracy is competitive or if the cost savings preserve overall system utility for the target wildfire use case.
minor comments (2)
  1. [Abstract] The abstract refers to an 'improved YOLOv8' without specifying the modifications (e.g., architectural changes, loss functions, or training data augmentations).
  2. [Method Description] Notation and implementation details for the station point mechanism and policy network training hyperparameters are not fully elaborated, which could hinder reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our two-stage UAV wildfire analysis framework. We address the major comments below and have revised the manuscript to improve the evaluation and clarity of Stage 1 results.

read point-by-point responses
  1. Referee: [Abstract and Experimental Results] Abstract and Experimental Results section: The headline claim of significantly reducing computational costs while maintaining classification accuracy in Stage 1 depends on the policy network (with station point mechanism) having a low false-negative rate on clips containing small or emerging fire sources. However, the evaluation uses HMDB51, a generic action recognition dataset whose negative examples do not simulate subtle distant or smoke-obscured ignitions, and no per-class false-negative rates, ablation isolating the station-point contribution on onset frames, or test sets with gradual fire ignition sequences are reported.

    Authors: We thank the referee for this important observation. FLAME provides wildfire-specific clips while HMDB51 is included to demonstrate generalization of the policy network beyond fire data. We acknowledge that HMDB51 negatives do not explicitly model subtle or smoke-obscured ignitions. The station point mechanism incorporates future-frame context precisely to improve detection of emerging events in sequential clips. In the revised manuscript we will add per-class false-negative rates, an ablation isolating the station-point contribution on onset frames, and a discussion of limitations regarding gradual ignition sequences, along with suggestions for future specialized test sets. revision: partial

  2. Referee: [Stage 1 Method and Evaluation] Stage 1 Method and Evaluation: The manuscript provides no details on baselines, error bars, exact metrics (e.g., precision/recall for fire vs. non-fire clips), or ablation studies, which prevents full assessment of whether the reported accuracy is competitive or if the cost savings preserve overall system utility for the target wildfire use case.

    Authors: We agree that these details are necessary for rigorous assessment. The revised manuscript will include comparisons against relevant baselines for the policy network, error bars computed over multiple runs, exact precision and recall for fire versus non-fire clip classification, and ablation studies on the station point mechanism and its contribution to end-to-end system utility for resource-constrained UAV wildfire monitoring. revision: yes

Circularity Check

0 steps flagged

Derivation is self-contained with independent dataset evaluations

full rationale

The paper's two-stage framework (policy network with station-point mechanism in Stage 1 for discarding redundant clips, followed by improved YOLOv8 in Stage 2) is evaluated on independent public datasets: FLAME and HMDB51 for Stage 1 classification accuracy, and Fire & Smoke Detection Dataset for Stage 2 detection. No equations or central claims reduce by construction to fitted parameters presented as predictions, self-definitional loops, or load-bearing self-citations. Efficiency and accuracy results are reported as empirical outcomes against external benchmarks rather than internal redefinitions, making the derivation self-contained.

Axiom & Free-Parameter Ledger

2 free parameters · 1 axioms · 0 invented entities

The framework depends on standard supervised training assumptions for the policy network and YOLOv8, plus the unverified effectiveness of the station point mechanism; several training hyperparameters are free parameters not detailed in the abstract.

free parameters (2)
  • Policy network training hyperparameters
    Parameters controlling the policy network for clip selection are fitted during training but not specified.
  • YOLOv8 improvement parameters
    Modifications and training settings for the improved YOLOv8 model are fitted to the fire dataset.
axioms (1)
  • domain assumption Station point mechanism incorporates future frame information to improve sequential policy prediction accuracy.
    Invoked to justify near-real-time operation and accuracy gains in Stage 1.

pith-pipeline@v0.9.0 · 5757 in / 1269 out tokens · 42671 ms · 2026-05-18T20:50:27.173497+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

65 extracted references · 65 canonical work pages · 5 internal anchors

  1. [1]

    A review on early wildfire detection from unmanned aerial vehicles using deep learning-based computer vision algorithms,

    A. Bouguettaya, H. Zarzour, A. M. Taberkit, and A. Kechida, “A review on early wildfire detection from unmanned aerial vehicles using deep learning-based computer vision algorithms,”Signal Processing, vol. 190, p. 108309, 2022

  2. [2]

    Multi-uav path planning methodology for postdisaster building damage surveying,

    R. Nagasawa, E. Mas, L. Moya, and S. Koshimura, “Multi-uav path planning methodology for postdisaster building damage surveying,” 2020. IEEE JOURNAL OF SELECTED TOPICS IN APPLIED EARTH OBSERV ATIONS AND REMOTE SENSING 11

  3. [3]

    O. Ozkan, “Optimization of the distance-constrained multi-based multi- uav routing problem with simulated annealing and local search-based matheuristic to detect forest fires: The case of turkey,” Applied Soft Computing, vol. 113, p. 108015, 2021

  4. [4]

    Model-based analysis of multi-uav path planning for surveying postdisaster building damage,

    R. Nagasawa, E. Mas, L. Moya, and S. Koshimura, “Model-based analysis of multi-uav path planning for surveying postdisaster building damage,” Scientific reports, vol. 11, no. 1, pp. 1–14, 2021

  5. [5]

    Wild- fire detection from multisensor satellite imagery using deep semantic segmentation,

    D. Rashkovetsky, F. Mauracher, M. Langer, and M. Schmitt, “Wild- fire detection from multisensor satellite imagery using deep semantic segmentation,” IEEE Journal of Selected Topics in Applied Earth Ob- servations and Remote Sensing , vol. 14, pp. 7001–7016, 2021

  6. [6]

    A yolo based technique for early forest fire detection,

    S. Goyal, M. Shagill, A. Kaur, H. V ohra, and A. Singh, “A yolo based technique for early forest fire detection,” Int. J. Innov. Technol. Explor. Eng.(IJITEE) Vol, vol. 9, pp. 1357–1362, 2020

  7. [7]

    Analysis of machine learning methods for wildfire security monitoring with an unmanned aerial vehicles,

    D. Alexandrov, E. Pertseva, I. Berman, I. Pantiukhin, and A. Kapitonov, “Analysis of machine learning methods for wildfire security monitoring with an unmanned aerial vehicles,” in 2019 24th conference of open innovations association (FRUCT) , pp. 3–9, IEEE, 2019

  8. [8]

    Forest fire flame and smoke detection from uav-captured images using fire-specific color fea- tures and multi-color space local binary pattern,

    F. A. Hossain, Y . M. Zhang, and M. A. Tonima, “Forest fire flame and smoke detection from uav-captured images using fire-specific color fea- tures and multi-color space local binary pattern,” Journal of Unmanned Vehicle Systems, vol. 8, no. 4, pp. 285–309, 2020

  9. [9]

    Pdam–stpnnet: a small target detection approach for wildland fire smoke through remote sensing images,

    J. Zhan, Y . Hu, W. Cai, G. Zhou, and L. Li, “Pdam–stpnnet: a small target detection approach for wildland fire smoke through remote sensing images,” Symmetry, vol. 13, no. 12, p. 2260, 2021

  10. [10]

    Streamlin- ing forest wildfire surveillance: Ai-enhanced uavs utilizing the flame aerial video dataset for lightweight and efficient monitoring,

    L. Zhao, J. Hu, J. Bi, Y . Bai, E. Mas, and S. Koshimura, “Streamlin- ing forest wildfire surveillance: Ai-enhanced uavs utilizing the flame aerial video dataset for lightweight and efficient monitoring,” in 2024 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 8063–8068, IEEE, 2024

  11. [11]

    Digital twin computing for enhancing resilience of disaster response system,

    S. Koshimura and E. Mas, “Digital twin computing for enhancing resilience of disaster response system,” in EGU General Assembly Conference Abstracts, pp. EGU–11756, 2023

  12. [12]

    Tiny video networks,

    A. Piergiovanni, A. Angelova, and M. S. Ryoo, “Tiny video networks,” Applied AI Letters , vol. 3, no. 1, p. e38, 2022

  13. [13]

    Video classification with channel-separated convolutional networks,

    D. Tran, H. Wang, L. Torresani, and M. Feiszli, “Video classification with channel-separated convolutional networks,” in Proceedings of the IEEE/CVF International Conference on Computer Vision , pp. 5552– 5561, 2019

  14. [14]

    Light-weight semantic segmentation network for uav remote sensing images,

    S. Liu, J. Cheng, L. Liang, H. Bai, and W. Dang, “Light-weight semantic segmentation network for uav remote sensing images,” IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing , vol. 14, pp. 8287–8296, 2021

  15. [15]

    Deepcore: A comprehensive library for coreset selection in deep learning,

    C. Guo, B. Zhao, and Y . Bai, “Deepcore: A comprehensive library for coreset selection in deep learning,” DEXA, 2022

  16. [16]

    Ar-net: Adaptive frame resolution for effi- cient action recognition,

    Y . Meng, C.-C. Lin, R. Panda, P. Sattigeri, L. Karlinsky, A. Oliva, K. Saenko, and R. Feris, “Ar-net: Adaptive frame resolution for effi- cient action recognition,” in European Conference on Computer Vision , pp. 86–104, Springer, 2020

  17. [17]

    Towards effi- cient disaster response via cost-effective unbiased class rate estimation through neyman allocation stratified sampling active learning,

    Y . Bai, X. Wu, L. Xu, J. Pei, E. Mas, and S. Koshimura, “Towards effi- cient disaster response via cost-effective unbiased class rate estimation through neyman allocation stratified sampling active learning,” arXiv preprint arXiv:2405.17734, 2024

  18. [18]

    Smoke detection on video sequences using 3d convolutional neural networks,

    G. Lin, Y . Zhang, G. Xu, and Q. Zhang, “Smoke detection on video sequences using 3d convolutional neural networks,” Fire Technology, vol. 55, pp. 1827–1847, 2019

  19. [19]

    Tsunami flow measurement using the video recorded during the 2011 tohoku tsunami attack,

    S. Koshimura and S. Hayashi, “Tsunami flow measurement using the video recorded during the 2011 tohoku tsunami attack,” in 2012 IEEE International Geoscience and Remote Sensing Symposium , pp. 6693– 6696, IEEE, 2012

  20. [20]

    Remote sensing approach for mapping and monitoring tsunami debris,

    S. Koshimura and T. Fukuoka, “Remote sensing approach for mapping and monitoring tsunami debris,” in IGARSS 2019-2019 IEEE Inter- national Geoscience and Remote Sensing Symposium , pp. 4829–4832, IEEE, 2019

  21. [21]

    Yolo by ultralytics,

    G. Jocher, A. Chaurasia, and J. Qiu, “Yolo by ultralytics,” Code repository, 2023

  22. [22]

    Ocsampler: Compress- ing videos to one clip with single-step sampling,

    J. Lin, H. Duan, K. Chen, D. Lin, and L. Wang, “Ocsampler: Compress- ing videos to one clip with single-step sampling,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pp. 13894–13903, 2022

  23. [23]

    Adaframe: Adaptive frame selection for fast video recognition,

    Z. Wu, C. Xiong, C.-Y . Ma, R. Socher, and L. S. Davis, “Adaframe: Adaptive frame selection for fast video recognition,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1278–1287, 2019

  24. [24]

    Dynamic network quantization for efficient video inference,

    X. Sun, R. Panda, C.-F. R. Chen, A. Oliva, R. Feris, and K. Saenko, “Dynamic network quantization for efficient video inference,” in Pro- ceedings of the IEEE/CVF International Conference on Computer Vi- sion, pp. 7375–7385, 2021

  25. [25]

    Scsampler: Sampling salient clips from video for efficient action recognition,

    B. Korbar, D. Tran, and L. Torresani, “Scsampler: Sampling salient clips from video for efficient action recognition,” in Proceedings of the IEEE/CVF International Conference on Computer Vision , pp. 6232– 6242, 2019

  26. [26]

    Adafuse: Adaptive temporal fusion network for efficient action recognition,

    Y . Meng, R. Panda, C.-C. Lin, P. Sattigeri, L. Karlinsky, K. Saenko, A. Oliva, and R. Feris, “Adafuse: Adaptive temporal fusion network for efficient action recognition,” arXiv preprint arXiv:2102.05775 , 2021

  27. [27]

    Activitynet: A large-scale video benchmark for human activity under- standing,

    F. Caba Heilbron, V . Escorcia, B. Ghanem, and J. Carlos Niebles, “Activitynet: A large-scale video benchmark for human activity under- standing,” in Proceedings of the ieee conference on computer vision and pattern recognition, pp. 961–970, 2015

  28. [28]

    Exploiting feature and class relationships in video categorization with regularized deep neural networks,

    Y .-G. Jiang, Z. Wu, J. Wang, X. Xue, and S.-F. Chang, “Exploiting feature and class relationships in video categorization with regularized deep neural networks,” IEEE transactions on pattern analysis and machine intelligence, vol. 40, no. 2, pp. 352–364, 2017

  29. [29]

    The Kinetics Human Action Video Dataset

    W. Kay, J. Carreira, K. Simonyan, B. Zhang, C. Hillier, S. Vijaya- narasimhan, F. Viola, T. Green, T. Back, P. Natsev, et al., “The kinetics human action video dataset,” arXiv preprint arXiv:1705.06950 , 2017

  30. [30]

    End-to-end learning of action detection from frame glimpses in videos,

    S. Yeung, O. Russakovsky, G. Mori, and L. Fei-Fei, “End-to-end learning of action detection from frame glimpses in videos,” inProceedings of the IEEE conference on computer vision and pattern recognition , pp. 2678– 2687, 2016

  31. [31]

    Smart frame selection for action recognition,

    S. N. Gowda, M. Rohrbach, and L. Sevilla-Lara, “Smart frame selection for action recognition,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 1451–1459, 2021

  32. [32]

    Mgsampler: An explainable sampling strategy for video action recognition,

    Y . Zhi, Z. Tong, L. Wang, and G. Wu, “Mgsampler: An explainable sampling strategy for video action recognition,” in Proceedings of the IEEE/CVF International Conference on Computer Vision , pp. 1513– 1522, 2021

  33. [33]

    Assessing the effectiveness of yolo architectures for smoke and wildfire detection,

    E. Casas, L. Ramos, E. Bendek, and F. Rivas-Echeverr ´ıa, “Assessing the effectiveness of yolo architectures for smoke and wildfire detection,” IEEE Access, vol. 11, pp. 96554–96583, 2023

  34. [34]

    A study of yolo architectures for wildfire and smoke detection in ground and aerial imagery,

    L. T. Ramos, E. Casas, C. Romero, F. Rivas-Echeverr ´ıa, and E. Bendek, “A study of yolo architectures for wildfire and smoke detection in ground and aerial imagery,” Results in Engineering , vol. 26, p. 104869, 2025

  35. [35]

    Squeeze-and-excitation networks,

    J. Hu, L. Shen, and G. Sun, “Squeeze-and-excitation networks,” in Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 7132–7141, 2018

  36. [36]

    Cbam: Convolutional block attention module,

    S. Woo, J. Park, J.-Y . Lee, and I. S. Kweon, “Cbam: Convolutional block attention module,” in Proceedings of the European conference on computer vision (ECCV) , pp. 3–19, 2018

  37. [37]

    Eca-net: Efficient channel attention for deep convolutional neural networks,

    Q. Wang, B. Wu, P. Zhu, P. Li, W. Zuo, and Q. Hu, “Eca-net: Efficient channel attention for deep convolutional neural networks,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 11534–11542, 2020

  38. [38]

    Sa-net: Shuffle attention for deep con- volutional neural networks,

    Q.-L. Zhang and Y .-B. Yang, “Sa-net: Shuffle attention for deep con- volutional neural networks,” in ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) , pp. 2235–2239, IEEE, 2021

  39. [39]

    mixup: Beyond Empirical Risk Minimization

    H. Zhang, M. Cisse, Y . N. Dauphin, and D. Lopez-Paz, “mixup: Beyond empirical risk minimization,” arXiv preprint arXiv:1710.09412 , 2017

  40. [40]

    Group normalization,

    Y . Wu and K. He, “Group normalization,” in Proceedings of the European conference on computer vision (ECCV) , pp. 3–19, 2018

  41. [41]

    Liteeval: A coarse-to- fine framework for resource efficient video recognition,

    Z. Wu, C. Xiong, Y .-G. Jiang, and L. S. Davis, “Liteeval: A coarse-to- fine framework for resource efficient video recognition,” Advances in Neural Information Processing Systems , vol. 32, 2019

  42. [42]

    Categorical Reparameterization with Gumbel-Softmax

    E. Jang, S. Gu, and B. Poole, “Categorical reparameterization with gumbel-softmax,” arXiv preprint arXiv:1611.01144 , 2016

  43. [43]

    Cspnet: A new backbone that can enhance learning capability of cnn,

    C.-Y . Wang, H.-Y . M. Liao, Y .-H. Wu, P.-Y . Chen, J.-W. Hsieh, and I.-H. Yeh, “Cspnet: A new backbone that can enhance learning capability of cnn,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition workshops , pp. 390–391, 2020

  44. [44]

    Jocher, K

    G. Jocher, K. Nishimura, T. Mineeva, and R. Vilari ˜no, “yolov5,” Code repository, p. 9, 2020

  45. [45]

    Designing network design strategies through gradient path analysis,

    C.-Y . Wang, H.-Y . M. Liao, and I.-H. Yeh, “Designing network design strategies through gradient path analysis,” arXiv preprint arXiv:2211.04800, 2022

  46. [46]

    Yolov7: Trainable bag-of-freebies sets new state-of-the-art for real-time object detectors,

    C.-Y . Wang, A. Bochkovskiy, and H.-Y . M. Liao, “Yolov7: Trainable bag-of-freebies sets new state-of-the-art for real-time object detectors,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 7464–7475, 2023

  47. [47]

    Feature pyramid networks for object detection,

    T.-Y . Lin, P. Doll´ar, R. Girshick, K. He, B. Hariharan, and S. Belongie, “Feature pyramid networks for object detection,” in Proceedings of the IEEE conference on computer vision and pattern recognition , pp. 2117– 2125, 2017. IEEE JOURNAL OF SELECTED TOPICS IN APPLIED EARTH OBSERV ATIONS AND REMOTE SENSING 12

  48. [48]

    Path aggregation network for instance segmentation,

    S. Liu, L. Qi, H. Qin, J. Shi, and J. Jia, “Path aggregation network for instance segmentation,” in Proceedings of the IEEE conference on computer vision and pattern recognition , pp. 8759–8768, 2018

  49. [49]

    Faster r-cnn: Towards real-time object detection with region proposal networks,

    S. Ren, K. He, R. Girshick, and J. Sun, “Faster r-cnn: Towards real-time object detection with region proposal networks,” Advances in neural information processing systems , vol. 28, 2015

  50. [50]

    Centernet: Keypoint triplets for object detection,

    K. Duan, S. Bai, L. Xie, H. Qi, Q. Huang, and Q. Tian, “Centernet: Keypoint triplets for object detection,” in Proceedings of the IEEE/CVF international conference on computer vision , pp. 6569–6578, 2019

  51. [51]

    Yolov8-rescbam: Yolov8 based on an effective attention module for pediatric wrist fracture detection,

    R.-Y . Ju, C.-T. Chien, and J.-S. Chiang, “Yolov8-rescbam: Yolov8 based on an effective attention module for pediatric wrist fracture detection,” arXiv preprint arXiv:2409.18826 , 2024

  52. [52]

    Yolov8-am: Yolov8 based on effective attention mechanisms for pedi- atric wrist fracture detection,

    C.-T. Chien, R.-Y . Ju, K.-Y . Chou, E. Xieerke, and J.-S. Chiang, “Yolov8-am: Yolov8 based on effective attention mechanisms for pedi- atric wrist fracture detection,” IEEE Access, vol. 13, pp. 52461–52477, 2025

  53. [53]

    Generalized focal loss: Learning qualified and distributed bounding boxes for dense object detection,

    X. Li, W. Wang, L. Wu, S. Chen, X. Hu, J. Li, J. Tang, and J. Yang, “Generalized focal loss: Learning qualified and distributed bounding boxes for dense object detection,” Advances in Neural Information Processing Systems, vol. 33, pp. 21002–21012, 2020

  54. [54]

    Enhancing geometric factors in model learning and inference for object detection and instance segmentation,

    Z. Zheng, P. Wang, D. Ren, W. Liu, R. Ye, Q. Hu, and W. Zuo, “Enhancing geometric factors in model learning and inference for object detection and instance segmentation,” IEEE transactions on cybernetics, vol. 52, no. 8, pp. 8574–8586, 2021

  55. [55]

    Distance-iou loss: Faster and better learning for bounding box regression,

    Z. Zheng, P. Wang, W. Liu, J. Li, R. Ye, and D. Ren, “Distance-iou loss: Faster and better learning for bounding box regression,” in Proceedings of the AAAI conference on artificial intelligence , vol. 34, pp. 12993– 13000, 2020

  56. [56]

    Aerial imagery pile burn detection using deep learning: The flame dataset,

    A. Shamsoshoara, F. Afghah, A. Razi, L. Zheng, P. Z. Ful ´e, and E. Blasch, “Aerial imagery pile burn detection using deep learning: The flame dataset,” Computer Networks, vol. 193, p. 108001, 2021

  57. [57]

    Hmdb: a large video database for human motion recognition,

    H. Kuehne, H. Jhuang, E. Garrote, T. Poggio, and T. Serre, “Hmdb: a large video database for human motion recognition,” in 2011 Interna- tional conference on computer vision , pp. 2556–2563, IEEE, 2011

  58. [58]

    Fire & smoke dataset

    A. Akhtamov, “Fire & smoke dataset.” https://www.kaggle.com/datasets/ azimjaan21/fire-and-smoke-dataset-object-detection-yolo, 2023

  59. [59]

    Mobilenetv2: Inverted residuals and linear bottlenecks,

    M. Sandler, A. Howard, M. Zhu, A. Zhmoginov, and L.-C. Chen, “Mobilenetv2: Inverted residuals and linear bottlenecks,” in Proceedings of the IEEE conference on computer vision and pattern recognition , pp. 4510–4520, 2018

  60. [60]

    Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation

    K. Cho, B. Van Merri ¨enboer, C. Gulcehre, D. Bahdanau, F. Bougares, H. Schwenk, and Y . Bengio, “Learning phrase representations using rnn encoder-decoder for statistical machine translation,” arXiv preprint arXiv:1406.1078, 2014

  61. [61]

    Imagenet: A large-scale hierarchical image database,

    J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, “Imagenet: A large-scale hierarchical image database,” in 2009 IEEE conference on computer vision and pattern recognition , pp. 248–255, Ieee, 2009

  62. [62]

    Deep residual learning for image recognition,

    K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proceedings of the IEEE conference on computer vision and pattern recognition , pp. 770–778, 2016

  63. [63]

    Tsm: Temporal shift module for efficient video understanding,

    J. Lin, C. Gan, and S. Han, “Tsm: Temporal shift module for efficient video understanding,” in Proceedings of the IEEE/CVF International Conference on Computer Vision , pp. 7083–7093, 2019

  64. [64]

    An overview of gradient descent optimization algorithms

    S. Ruder, “An overview of gradient descent optimization algorithms,” arXiv preprint arXiv:1609.04747 , 2016

  65. [65]

    Crossformer: Transformer utilizing cross- dimension dependency for multivariate time series forecasting,

    Y . Zhang and J. Yan, “Crossformer: Transformer utilizing cross- dimension dependency for multivariate time series forecasting,” in The eleventh international conference on learning representations , 2023