arxiv: 2604.27617 · v1 · submitted 2026-04-30 · 💻 cs.CV · cs.AI

Robust Lightweight Crack Classification for Real-Time UAV Bridge Inspection

Wei Li , Haisheng Li , Weijie Li , Jiandong Wang , Kaichen Ma , Luming Yang This is my paper

Pith reviewed 2026-05-07 05:25 UTC · model grok-4.3

classification 💻 cs.CV cs.AI

keywords crack classificationUAV bridge inspectionlightweight CNNattention modulefocal lossdata augmentationstructural health monitoringreal-time detection

0 comments

The pith

Lightweight CNN with attention and focal loss detects bridge cracks at 825 FPS for UAV inspections.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper develops a lightweight convolutional neural network for automatic crack detection in UAV images of bridges. The framework addresses weak crack features, degraded conditions, class imbalance, and limited compute by using a compact backbone, an attention module to highlight relevant features, augmentation strategies drawn from inspection scenes, and focal loss to prioritize difficult examples. On the SDNET2018 dataset, it runs at 825 frames per second with modest model size and improves key accuracy metrics over the baseline model. A sympathetic reader would care because UAV inspections need fast onboard or near-real-time processing to be practical, and better crack detection can lead to earlier maintenance of critical infrastructure. The visualizations suggest the model learns to track cracks more precisely.

Core claim

The paper claims that combining a lightweight backbone, Convolutional Block Attention Module (CBAM), directed robust augmentation based on inspection-scene priors, and Focal Loss creates a unified framework that delivers high-speed, accurate crack classification suitable for real-time UAV bridge inspection. Tested on the SDNET2018 bridge deck dataset, the method attains an inference speed of 825 FPS using only 11.21 million parameters and 1.82 gigaflops. It raises the F1-score by 2.51 percent and recall by 3.95 percent relative to the baseline, while Grad-CAM heatmaps show the attention component directs attention along actual crack paths rather than scattered regions.

What carries the argument

The central mechanism is the AttXNet unified lightweight framework that integrates four components—a compact convolutional backbone for efficiency, CBAM to enhance channel and spatial features of weak cracks, a directed augmentation pipeline informed by UAV inspection priors to boost robustness, and focal loss to mitigate class imbalance by focusing training on hard samples.

If this is right

Supports real-time UAV inspections by processing images at 825 FPS, enabling ground-station assisted workflows without heavy hardware.
Low parameter count of 11.21M and 1.82G FLOPs allows deployment on resource-constrained UAV systems.
3.95% higher recall means more cracks are detected, reducing the risk of overlooking structural issues.
CBAM integration improves model focus on crack trajectories as confirmed by visualization techniques.
Provides a practical, balanced solution for accuracy, speed, and robustness in structural health monitoring.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If the scene-prior augmentation generalizes, similar techniques could improve defect detection in other UAV applications like power line or pipeline inspection.
Further tests on datasets with greater variation in lighting and bridge designs would test the claimed robustness.
The lightweight design opens possibilities for on-drone processing rather than relying solely on ground stations.
Combining this with multi-view or video analysis from UAV flights could enhance crack tracking over time.

Load-bearing premise

Performance gains from the directed augmentation, CBAM, and focal loss observed on the SDNET2018 dataset will transfer to real-world UAV operations with different lighting, weather, flight conditions, and bridge structures.

What would settle it

Collecting a new test set of UAV bridge images from unseen locations and conditions and measuring whether the reported improvements in F1-score and recall are maintained or significantly reduced.

Figures

Figures reproduced from arXiv: 2604.27617 by Haisheng Li, Jiandong Wang, Kaichen Ma, Luming Yang, Weijie Li, Wei Li.

**Figure 1.** Figure 1: FIGURE 1: Three-dimensional comparison of FPS, test F1 view at source ↗

**Figure 2.** Figure 2: FIGURE 2: Overview of the proposed UAV-based bridge crack detection framework. Raw bridge surface images are first view at source ↗

**Figure 3.** Figure 3: FIGURE 3: Practical ground-station-assisted UAV bridge inspection workflow. The UAV continuously acquires high-resolution view at source ↗

**Figure 4.** Figure 4: FIGURE 4: Comparison of resource consumption, inference efficiency, and classification performance of three backbone view at source ↗

**Figure 7.** Figure 7: FIGURE 7: Comparison of F1-score and precision between view at source ↗

**Figure 5.** Figure 5: FIGURE 5: Comprehensive radar chart evaluation of three view at source ↗

**Figure 8.** Figure 8: FIGURE 8: Parameter comparison on Focal loss view at source ↗

**Figure 10.** Figure 10: FIGURE 10: Training and validation loss/accuracy curves of view at source ↗

**Figure 9.** Figure 9: FIGURE 9: Comparison of attention overhead across differ view at source ↗

**Figure 11.** Figure 11: FIGURE 11: Grad-CAM visualizations comparing the baseline model and the proposed AttX-Net. The attention maps view at source ↗

**Figure 12.** Figure 12: FIGURE 12: Validation F1-score and precision across epochs view at source ↗

read the original abstract

With the widespread application of Unmanned Aerial Vehicles (UAVs) in bridge structural health monitoring, deep learning-based automatic crack detection has become a major research focus. However, practical UAV inspections still face four key challenges: weak crack features, degraded imaging conditions, severe class imbalance, and limited computational resources for practical UAV inspection workflows. To address these issues, this paper proposes a unified lightweight convolutional neural network framework composed of four synergistic components: a lightweight backbone network, a Convolutional Block Attention Module (CBAM) for channel and spatial enhancement, a directed robust augmentation strategy based on inspection-scene priors, and Focal Loss for hard-sample learning under class imbalance. Experiments on the SDNET2018 bridge deck dataset show that the proposed method achieves an inference speed of 825 FPS with only 11.21M parameters and 1.82G FLOPs. Compared with the baseline model, the complete framework improves the F1-score by 2.51% and recall by 3.95%. In addition, Grad-CAM visualizations indicate that the introduced attention module shifts the model's focus from scattered regions to precise tracking along crack trajectories. Overall, this study achieves a strong balance among accuracy, speed, and robustness, providing a practical solution for ground-station assisted real-time deployment in UAV bridge inspections. The source code is available at: https://github.com/skylynf/AttXNet .

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The manuscript proposes a lightweight CNN framework for crack classification in UAV bridge inspections to address weak features, degraded conditions, class imbalance, and limited compute. The framework integrates a lightweight backbone, CBAM attention module, a directed robust augmentation strategy derived from inspection-scene priors, and Focal Loss. On the SDNET2018 bridge deck dataset, it reports 825 FPS inference, 11.21M parameters, 1.82G FLOPs, and gains of +2.51% F1-score and +3.95% recall over a baseline, with Grad-CAM visualizations indicating improved focus on crack trajectories. Source code is provided.

Significance. If the reported metrics on SDNET2018 are reproducible and the components prove effective, the work offers a practical, efficient solution for real-time UAV-assisted bridge inspection, balancing accuracy, speed, and handling of imbalance. The open-source code and concrete FPS/parameter/FLOP numbers strengthen reproducibility and applicability claims. However, the significance for 'robust' real-world deployment is limited by the absence of evidence beyond a single public dataset.

major comments (3)

[§4 (Experiments)] §4 (Experiments): All quantitative results (825 FPS, 11.21M params, 1.82G FLOPs, +2.51% F1, +3.95% recall) and Grad-CAM visualizations are confined to SDNET2018 splits. No cross-dataset evaluation, no held-out UAV sequences with novel lighting/angles/bridge types, and no test of whether the inspection-scene priors transfer are reported, so the abstract's claim of robustness to 'varied real-world UAV flight conditions' is an unsupported extrapolation.
[§3.3 (Directed Robust Augmentation)] §3.3 (Directed Robust Augmentation): The augmentation strategy is explicitly 'based on inspection-scene priors' derived from the target dataset. Without an ablation that isolates its contribution from dataset-specific tuning or a transfer experiment on a different bridge dataset, it is impossible to determine whether the reported F1/recall gains are load-bearing or would generalize.
[§4.1 (Implementation Details)] §4.1 (Implementation Details) and §4.2 (Ablation Studies): The baseline model architecture, training hyperparameters, and exact comparison protocol are not specified with sufficient precision to rule out post-hoc tuning. The incremental benefit of CBAM + augmentation + Focal Loss versus the backbone alone is not broken down with statistical tests or multiple runs, weakening the claim that the complete framework is responsible for the gains.

minor comments (2)

[§3.1] The backbone network is referred to as 'lightweight' but its exact topology (e.g., number of layers, specific MobileNet/EfficientNet variant) should be stated explicitly in §3.1 for reproducibility.
[Figure 5] Figure 5 (Grad-CAM) would benefit from side-by-side quantitative metrics (e.g., localization error) in addition to qualitative examples to strengthen the attention-module claim.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive comments, which highlight key areas for strengthening the evidence of generalization and experimental rigor. We agree that additional validation beyond SDNET2018 would better support the robustness claims and will incorporate the suggested revisions, including new experiments and clarifications, in the revised manuscript.

read point-by-point responses

Referee: All quantitative results (825 FPS, 11.21M params, 1.82G FLOPs, +2.51% F1, +3.95% recall) and Grad-CAM visualizations are confined to SDNET2018 splits. No cross-dataset evaluation, no held-out UAV sequences with novel lighting/angles/bridge types, and no test of whether the inspection-scene priors transfer are reported, so the abstract's claim of robustness to 'varied real-world UAV flight conditions' is an unsupported extrapolation.

Authors: We acknowledge that the current quantitative results and visualizations are limited to SDNET2018. Although this dataset includes substantial variation in lighting, angles, surface conditions, and crack appearances representative of UAV bridge inspections, we agree that cross-dataset evaluation would provide stronger support for claims of robustness to varied real-world conditions. In the revised manuscript, we will add evaluation on at least one additional public crack detection dataset to demonstrate transferability of the framework and the inspection-scene priors. revision: yes
Referee: The augmentation strategy is explicitly 'based on inspection-scene priors' derived from the target dataset. Without an ablation that isolates its contribution from dataset-specific tuning or a transfer experiment on a different bridge dataset, it is impossible to determine whether the reported F1/recall gains are load-bearing or would generalize.

Authors: The directed robust augmentation incorporates priors drawn from typical UAV bridge inspection characteristics (e.g., crack orientations under perspective distortion and common degradation patterns). To address the concern, we will expand the ablation studies to isolate the augmentation's contribution through controlled variants (with and without the directed strategy) and will use the planned cross-dataset experiments to test whether the observed gains transfer beyond SDNET2018-specific tuning. revision: yes
Referee: The baseline model architecture, training hyperparameters, and exact comparison protocol are not specified with sufficient precision to rule out post-hoc tuning. The incremental benefit of CBAM + augmentation + Focal Loss versus the backbone alone is not broken down with statistical tests or multiple runs, weakening the claim that the complete framework is responsible for the gains.

Authors: We will revise §4.1 to include complete specifications of the baseline architecture (layer dimensions and connections), all training hyperparameters (optimizer, learning rate schedule, batch size, epochs, and data splits), and the precise comparison protocol. In §4.2, we will report ablation results as means and standard deviations over multiple independent runs (minimum five seeds) and include statistical significance tests (e.g., paired t-tests) to rigorously quantify the incremental contributions of CBAM, the augmentation strategy, and Focal Loss. revision: yes

Circularity Check

0 steps flagged

No circularity in derivation chain

full rationale

The paper is an empirical CV contribution proposing a lightweight CNN with CBAM attention, a directed augmentation strategy derived from inspection-scene priors, and Focal Loss. All quantitative claims (825 FPS, 11.21M params, +2.51% F1, +3.95% recall) are measured on the external public SDNET2018 benchmark against an explicitly stated baseline. No equations, first-principles derivations, or self-referential normalizations appear in the provided text. The augmentation is a fixed preprocessing step based on domain priors rather than a fitted parameter whose output is later renamed as a prediction. No self-citation chains or uniqueness theorems are invoked to justify core components. The derivation chain is therefore self-contained against external data and does not reduce any result to its own inputs by construction.

Axiom & Free-Parameter Ledger

1 free parameters · 2 axioms · 0 invented entities

This is an applied empirical machine-learning paper whose claims rest on experimental outcomes rather than derivations. The augmentation strategy depends on domain knowledge of UAV inspection imagery, and standard CNN assumptions are used without new postulates.

free parameters (1)

training hyperparameters (learning rate, batch size, augmentation strengths)
Chosen during model development to optimize validation performance on SDNET2018; typical for CNN training and not derived from first principles.

axioms (2)

domain assumption Convolutional neural networks can learn discriminative features from labeled images for binary crack classification.
Standard premise underlying all deep-learning image classification work cited in the abstract.
domain assumption The SDNET2018 dataset distribution is sufficiently representative of real UAV bridge inspection conditions for generalization claims.
Invoked when the authors extrapolate from benchmark results to practical UAV deployment.

pith-pipeline@v0.9.0 · 5561 in / 1699 out tokens · 45617 ms · 2026-05-07T05:25:43.786227+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

37 extracted references · 2 canonical work pages · 2 internal anchors

[1]

Dong and F

C.-Z. Dong and F. N. Catbas, ‘‘A review of computer vision-based struc- tural health monitoring at local and global levels,’’Structural Health Monitoring, vol. 20, no. 2, pp. 692–743, 2021

2021
[2]

K. Luo, X. Kong, J. Zhang, J. Hu, J. Li, and H. Tang, ‘‘Computer vision- based bridge inspection and monitoring: A review,’’Sensors, vol. 23, no. 18, p. 7863, 2023

2023
[3]

Metni and T

N. Metni and T. Hamel, ‘‘A UA V for bridge inspection: Visual servoing control law with orientation limits,’’Automation in Construction, vol. 17, no. 1, pp. 3–10, 2007

2007
[4]

Ellenberg, L

A. Ellenberg, L. Branco, A. Krick, I. Bartoli, and A. Kontsos, ‘‘Use of un- manned aerial vehicle for quantitative infrastructure evaluation,’’Journal of Infrastructure Systems, vol. 21, no. 3, p. 04014054, 2015

2015
[5]

Y . Ham, K. K. Han, J. J. Lin, and M. Golparvar-Fard, ‘‘Visual monitoring of civil infrastructure systems via camera-equipped unmanned aerial vehicles (UA Vs): a review of related works,’’Visualization in Engineering, vol. 4, p. 1, 2016

2016
[6]

Feroz and S

S. Feroz and S. Abu Dabous, ‘‘UA V-based remote sensing applications for bridge condition assessment,’’Remote Sensing, vol. 13, no. 9, p. 1809, 2021

2021
[7]

Zhang, Y

C. Zhang, Y . Zou, F. Wang, E. del Rey Castillo, J. Dimyadi, and L. Chen, ‘‘Towards fully automated unmanned aerial vehicle-enabled bridge inspec- tion: Where are we at?’’Construction and Building Materials, vol. 347, p. 128543, 2022

2022
[8]

C. V . Dung and L. D. Anh, ‘‘Autonomous concrete crack detection using deep fully convolutional neural network,’’Automation in Construction, vol. 99, pp. 52–58, 2019

2019
[9]

M. M. M. Islam and J.-M. Kim, ‘‘Vision-based autonomous crack detec- tion of concrete structures using a fully convolutional encoder–decoder network,’’Sensors, vol. 19, no. 19, p. 4251, 2019

2019
[10]

Dorafshan, R

S. Dorafshan, R. J. Thomas, and M. Maguire, ‘‘SDNET2018: An annotated image dataset for non-contact concrete crack detection using deep convo- lutional neural networks,’’Data in Brief, vol. 21, pp. 1664–1668, 2018

2018
[11]

R. Li, J. Y u, F. Li, R. Y ang, Y . Wang, and Z. Peng, ‘‘Automatic bridge crack detection using unmanned aerial vehicle and faster r-cnn,’’Construction and Building Materials, vol. 362, p. 129659, 2023

2023
[12]

L. Zhou, Y . Jiang, H. Jia, L. Zhang, F. Xu, Y . Tian, Z. Ma, X. Liu, S. Guo, Y . Wuet al., ‘‘Uav vision-based crack quantification and visualization of bridges: system design and engineering application,’’Structural Health Monitoring, vol. 24, no. 2, pp. 1083–1100, 2025

2025
[13]

Jiang, L

T. Jiang, L. Liu, C. Hu, L. Li, and J. Zheng, ‘‘An advanced method for surface damage detection of concrete structures in low-light environments based on image enhancement and object detection networks,’’Advances in Bridge Engineering, vol. 5, no. 1, p. 33, 2024

2024
[14]

Z. Y ao, J. Xu, S. Hou, and M. C. Chuah, ‘‘Cracknex: a few-shot low-light crack segmentation model based on retinex theory for uav inspections,’’ in 2024 IEEE International Conference on Robotics and Automation (ICRA). IEEE, 2024, pp. 11 155–11 162

2024
[15]

C. Lee, D. Kim, and D. Kim, ‘‘Optimizing deep learning-based crack detection using no-reference image quality assessment in a mobile tunnel scanning system,’’Sensors, vol. 25, no. 17, p. 5437, 2025

2025
[16]

Y . Liu, J. K. Y eoh, and D. K. Chua, ‘‘Deep learning–based enhancement of motion blurred uav concrete crack images,’’Journal of computing in civil engineering, vol. 34, no. 5, p. 04020028, 2020

2020
[17]

Hsieh, K.-Y

H.-Y . Hsieh, K.-Y . Liu, and S. Kang, ‘‘Development of an automated sur- face crack detection and bim-integrated management system for concrete bridges,’’Journal of Civil Engineering and Management, vol. 31, no. 7, pp. 710–728, 2025

2025
[18]

K. He, X. Zhang, S. Ren, and J. Sun, ‘‘Deep residual learning for image recognition,’’ inProceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 770–778

2016
[19]

R. Yin, J. Peng, Y . Cai, C. Wu, B. Champagne, and N. Al-Dhahir, ‘‘Radar- assisted predictive beamforming for uav-aided networks: a deep-learning solution,’’IEEE Transactions on V ehicular Technology, 2025

2025
[20]

——, ‘‘Intelligent 3d trajectory and resource control for multi-uav 6g networks via gnn and deep unfolding,’’IEEE Transactions on Communi- cations, 2026

2026
[21]

G. Liu, J. Liu, H. Fan, S. He, W. Bo, C. Y ang, and J. Miao, ‘‘Insights into evolution of rockfalls on a high-steep slope using uav photogrammetry and cone complementary-based 3d-dda,’’Canadian Geotechnical Journal, no. ja, 2026

2026
[22]

Pan and Y

R. Pan and Y . Zhang, ‘‘Cracklite-net: A sustainable transportation-oriented real-time lightweight network for adaptive road crack detection,’’Sustain- ability, vol. 17, no. 24, p. 10973, 2025

2025
[23]

R. Wang, R. Chen, H. Y an, and X. Guo, ‘‘Lightweight concrete crack recognition model based on improved mobilenetv3,’’Scientific Reports, vol. 15, no. 1, p. 15704, 2025

2025
[24]

C. Zhou, M. Dai, F. Wang, Y . Dong, X. Chen, and C. He, ‘‘An innovative uav and deep learning-based framework for automatic bridge crack detec- tion and measurement,’’The Journal of Supercomputing, vol. 81, no. 15, p. 1410, 2025

2025
[25]

Xiang, H

X. Xiang, H. Hu, Y . Ding, Y . Zheng, and S. Wu, ‘‘Gc-yolov5s: a lightweight detector for uav road crack detection,’’Applied Sciences, vol. 13, no. 19, p. 11030, 2023

2023
[26]

Wahid, H

A. Wahid, H. U. Khan, A. Naz, and F. K. Alarfaj, ‘‘Hybrid lightweight vision transformers with attention mechanism for feature extraction and classification of product designs,’’Plos one, vol. 21, no. 3, p. e0343510, 2026

2026
[27]

S. Woo, J. Park, J.-Y . Lee, and I. S. Kweon, ‘‘Cbam: Convolutional block attention module,’’ inProceedings of the European conference on computer vision (ECCV), 2018, pp. 3–19

2018
[28]

H. Ji, Z. Zeng, and X. Dong, ‘‘Lightweight concrete crack detection for urban intelligent management and maintenance,’’ inProceedings of the Institution of Civil Engineers-Transport. Emerald Publishing Limited, 2025

2025
[29]

Maguire, S

M. Maguire, S. Dorafshan, and R. J. Thomas, ‘‘Sdnet2018: A concrete crack image dataset for machine learning applications,’’ 2018

2018
[30]

T.-Y . Lin, P . Goyal, R. Girshick, K. He, and P . Dollár, ‘‘Focal loss for dense object detection,’’ inProceedings of the IEEE international conference on computer vision, 2017, pp. 2980–2988

2017
[31]

G. Liu, J. Kang, T. Y e, S. Wang, W. Bo, D. Duoji, and Y . Tian, ‘‘Three- dimensional (3d) laser scanning–based identification of rock mass dis- continuities for rockfall modeling using 3d discontinuous deformation analysis,’’International Journal of Rock Mechanics and Mining Sciences, vol. 202, p. 106484, 2026

2026
[32]

Y . Ma, Q. Li, L. Chu, Y . Zhou, and C. Xu, ‘‘Real-time detection and spatial localization of insulators for uav inspection based on binocular stereo vision,’’Remote Sensing, vol. 13, no. 2, p. 230, 2021

2021
[33]

Seibold, A

C. Seibold, A. Hilsmann, and P . Eisert, ‘‘Model-based motion blur estima- tion for the improvement of motion tracking,’’Computer Vision and Image Understanding, vol. 160, pp. 45–56, 2017

2017
[34]

Howard, M

A. Howard, M. Sandler, G. Chu, L.-C. Chen, B. Chen, M. Tan, W. Wang, Y . Zhu, R. Pang, V . V asudevanet al., ‘‘Searching for mobilenetv3,’’ inPro- ceedings of the IEEE/CVF international conference on computer vision, 2019, pp. 1314–1324

2019
[35]

Tan and Q

M. Tan and Q. Le, ‘‘Efficientnet: Rethinking model scaling for convolu- tional neural networks,’’ inInternational conference on machine learning. PMLR, 2019, pp. 6105–6114

2019
[36]

Decoupled Weight Decay Regularization

I. Loshchilov and F. Hutter, ‘‘Decoupled weight decay regularization,’’ arXiv preprint arXiv:1711.05101, 2017

work page internal anchor Pith review arXiv 2017
[37]

——, ‘‘Sgdr: Stochastic gradient descent with warm restarts,’’arXiv preprint arXiv:1608.03983, 2016. 12

work page internal anchor Pith review arXiv 2016