pith. sign in

arxiv: 2605.26095 · v1 · pith:QUO3ME4Cnew · submitted 2026-05-25 · 💻 cs.CV

Pixel-Level Pavement Distress Assessment Using Instance Segmentation

Pith reviewed 2026-06-29 22:34 UTC · model grok-4.3

classification 💻 cs.CV
keywords pavement distressinstance segmentationMask R-CNNcrack detectionpothole detectionroad maintenancecomputer visionUWGB-StreetCrack
0
0 comments X

The pith

Mask R-CNN instance segmentation produces pixel-level masks that let aggregate crack area on pavement images match manual ground truth to within 0.006 percentage points.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests whether instance segmentation can move pavement assessment from image-level labels or bounding boxes to exact per-pixel localization of thin, branching cracks and potholes. It trains five Mask R-CNN variants on a custom set of smartphone images collected from roadways and manually outlined with polygons for four distress classes. The strongest model reaches 84.23 percent precision and 90.04 percent recall under a bounding-box matching rule, then delivers a total predicted crack-area fraction of 2.164 percent that sits 0.006 points from the labeled fraction. This level of geometric agreement matters because maintenance crews need quantified extent, not just presence, to schedule repairs. The same setup also shows clear gains over a retrained YOLO detector on the identical data.

Core claim

A Mask R-CNN model with ResNet-101 FPN backbone, fine-tuned on the UWGB-StreetCrack dataset of polygon-annotated roadway images, localizes longitudinal, transverse, alligator cracks and potholes at the pixel level and yields an aggregate crack-area fraction of 2.164 percent that closely matches the 2.170 percent ground-truth fraction, while attaining 84.23 percent precision, 90.04 percent recall and 87.04 percent F1 under the project's bounding-box protocol.

What carries the argument

Mask R-CNN instance segmentation that outputs per-object binary masks rather than bounding boxes, applied to a vehicle-mounted smartphone image dataset with manual polygon labels for four distress types.

If this is right

  • Pixel-level masks enable direct computation of distress area instead of coarse bounding-box approximations.
  • Instance segmentation supplies a workable route to aggregate quantification on ordinary field photographs.
  • The same model family outperforms a CSPDarknet53 YOLO detector by a wide margin on the identical validation protocol.
  • Close numerical agreement on total crack fraction demonstrates that segmentation outputs can support maintenance-relevant measurements.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The resulting area totals could be fed directly into pavement-management databases to rank road segments by total distress load.
  • Processing sequential frames from a moving vehicle would allow the same masks to track how individual cracks grow between inspections.
  • Reducing reliance on single-annotator polygons would require multi-expert consensus labels or synthetic data augmentation to improve robustness.

Load-bearing premise

The human-drawn polygon annotations on the images accurately trace the true boundaries of each distress without systematic bias or inconsistency that would affect both training and the area-fraction comparison.

What would settle it

Acquire a fresh set of pavement images, obtain independent polygon annotations from multiple experts, and test whether the model's predicted aggregate crack-area fraction still lies within 0.01 percentage points of the averaged ground-truth fraction.

Figures

Figures reproduced from arXiv: 2605.26095 by Bibesh Pyakurel (University of Wisconsin - Green Bay), Kong Pheng Yang (University of Wisconsin - Green Bay), Logan Dewick (University of Wisconsin - Green Bay), M. G. Sarwar Murshed (University of Wisconsin - Green Bay), Nazim Choudhury (University of Wisconsin - Green Bay).

Figure 1
Figure 1. Figure 1: Vehicle-mounted smartphone used to collect field pavement imagery [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Polygon-based pavement distress annotation in Label Studio. The [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Representative failure cases used for qualitative analysis. Red boxes [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗
read the original abstract

Automated pavement distress assessment requires more than image-level classification or coarse bounding box detection, demanding precise localization of thin, branching, and irregular cracks to achieve the geometric precision necessary for maintenance-relevant quantification. This paper presents a vision-based pavement distress analysis system based on Mask R-CNN instance segmentation and evaluates it on UWGB-StreetCrack, a custom field-collected roadway image dataset acquired with a vehicle-mounted smartphone and manually annotated with polygon labels for longitudinal cracks, transverse cracks, alligator cracks, and potholes. Five Detectron2-based Mask R-CNN backbone variants were considered under a consistent fine-tuning protocol. The best-performing model, Mask R-CNN with a ResNet-101 FPN backbone, achieved 84.23% precision, 90.04% recall, and an F1 score of 87.04% under the project-specific bounding-box matching protocol. The same model produced an aggregate predicted crack-area fraction of 2.164%, closely matching the 2.170% ground-truth crack-area fraction. To contextualize the segmentation system against a detector-oriented alternative, a CSPDarknet53-based YOLO detector was also adapted and retrained on the dataset, reaching 27.5% precision and 20.7% recall on the validation protocol. The results show that instance segmentation is a practical direction for field pavement imagery and aggregate crack-area estimation, while also exposing open challenges in annotation consistency, class imbalance, confounder rejection, and mask-level benchmarking.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 1 minor

Summary. The paper presents a Mask R-CNN instance segmentation system for pixel-level pavement distress assessment (longitudinal, transverse, alligator cracks, and potholes) on the custom UWGB-StreetCrack smartphone-collected dataset with manual polygon annotations. Five Detectron2 backbones are fine-tuned; the ResNet-101 FPN variant reports 84.23% precision, 90.04% recall, and 87.04% F1 under a project-specific bounding-box protocol, plus an aggregate predicted crack-area fraction of 2.164% that closely matches the 2.170% ground-truth fraction. A CSPDarknet53 YOLO detector is retrained for comparison, reaching 27.5% precision and 20.7% recall.

Significance. If the quantitative claims hold under proper validation, the work demonstrates that instance segmentation can deliver maintenance-relevant geometric metrics (pixel-area fractions) on field imagery where bounding-box detectors fall short, addressing a practical gap in automated pavement assessment.

major comments (3)
  1. [Abstract] Abstract: the central claim that the 2.164% vs. 2.170% crack-area fraction match demonstrates 'geometric precision necessary for maintenance-relevant quantification' is undermined because the same manual polygon annotations supply both the training targets and the reference area fraction; any consistent annotator bias in crack width, branch inclusion, or boundary placement is learned and reproduced, making the 0.006% difference uninformative without inter-annotator agreement, repeated labeling, or external reference (e.g., laser scan).
  2. [Abstract] Abstract: no validation-split details, error bars, post-hoc exclusions, or annotation inter-rater reliability are reported, so the precision/recall/F1 numbers for the ResNet-101 model cannot be assessed for robustness or generalizability.
  3. [Abstract] Abstract: the YOLO comparison (27.5% precision, 20.7% recall) is presented without confirming that the bounding-box matching protocol, image splits, and evaluation conditions were identical to those used for Mask R-CNN, preventing a controlled assessment of segmentation versus detection.
minor comments (1)
  1. The abstract refers to a 'project-specific bounding-box matching protocol' without definition or reference to a methods section; explicit description is needed for reproducibility.

Simulated Author's Rebuttal

3 responses · 1 unresolved

We thank the referee for the constructive comments. We address each major comment point by point below and indicate the revisions that will be incorporated.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the central claim that the 2.164% vs. 2.170% crack-area fraction match demonstrates 'geometric precision necessary for maintenance-relevant quantification' is undermined because the same manual polygon annotations supply both the training targets and the reference area fraction; any consistent annotator bias in crack width, branch inclusion, or boundary placement is learned and reproduced, making the 0.006% difference uninformative without inter-annotator agreement, repeated labeling, or external reference (e.g., laser scan).

    Authors: We agree that the reported crack-area fraction comparison is derived from the same annotations used in training and therefore cannot independently demonstrate geometric precision free of annotator bias. We will revise the abstract to remove or substantially qualify the claim regarding 'geometric precision necessary for maintenance-relevant quantification' and add a limitations paragraph discussing reliance on single-annotator polygons. revision: yes

  2. Referee: [Abstract] Abstract: no validation-split details, error bars, post-hoc exclusions, or annotation inter-rater reliability are reported, so the precision/recall/F1 numbers for the ResNet-101 model cannot be assessed for robustness or generalizability.

    Authors: We will add explicit details on the validation split (image counts and class distribution) and state that the metrics come from a single fixed split without cross-validation or error bars. Inter-rater reliability was not computed. These clarifications will be inserted in the methods and results sections. revision: yes

  3. Referee: [Abstract] Abstract: the YOLO comparison (27.5% precision, 20.7% recall) is presented without confirming that the bounding-box matching protocol, image splits, and evaluation conditions were identical to those used for Mask R-CNN, preventing a controlled assessment of segmentation versus detection.

    Authors: The same image splits were used for both models. YOLO evaluation follows a bounding-box IoU protocol while Mask R-CNN metrics are obtained by converting predicted masks to boxes; we will explicitly document these protocol differences and confirm the shared splits in the revised methods and results sections. revision: partial

standing simulated objections not resolved
  • We cannot supply inter-annotator agreement statistics, repeated labeling, or external ground truth (e.g., laser scans) because the dataset was annotated by a single individual without such measures.

Circularity Check

0 steps flagged

No significant circularity; standard supervised evaluation on held-out data.

full rationale

The paper reports empirical performance metrics (precision, recall, F1) and an aggregate crack-area fraction comparison from Mask R-CNN trained on manually annotated UWGB-StreetCrack images and evaluated on held-out validation images. No mathematical derivation, functional form, or parameter fitting is present that reduces to its own inputs by construction. The area-fraction match is a direct comparison of model output to ground-truth labels on the test set, which is the standard non-circular outcome of supervised learning. No self-citations, uniqueness theorems, or ansatzes are invoked as load-bearing steps for the central claims. The work is self-contained against external benchmarks (held-out image evaluation) and receives the default non-circularity finding.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

This is an empirical computer-vision application paper. No mathematical derivation, free parameters in an analytic sense, axioms, or invented physical entities are present.

pith-pipeline@v0.9.1-grok · 5849 in / 1101 out tokens · 43641 ms · 2026-06-29T22:34:48.455365+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

23 extracted references · 2 canonical work pages · 2 internal anchors

  1. [1]

    Mask r-cnn

    Kaiming He, Georgia Gkioxari, Piotr Doll ´ar, and Ross Girshick. Mask r-cnn. In2017 IEEE International Conference on Computer Vision (ICCV), pages 2980–2988, 2017

  2. [2]

    Cracktree: Automatic crack detection from pavement images.Pattern Recognition Letters, 33(3):227–238, 2012

    Qin Zou, Yu Cao, Qingquan Li, Qingzhou Mao, and Song Wang. Cracktree: Automatic crack detection from pavement images.Pattern Recognition Letters, 33(3):227–238, 2012

  3. [3]

    Automatic road crack detection using random structured forests.IEEE Transactions on Intelligent Transportation Systems, 17(12):3434–3445, 2016

    Yong Shi, Limeng Cui, Zhiquan Qi, Fan Meng, and Zhensong Chen. Automatic road crack detection using random structured forests.IEEE Transactions on Intelligent Transportation Systems, 17(12):3434–3445, 2016

  4. [4]

    Road crack detection using deep convolutional neural network

    Lei Zhang, Fan Yang, Yimin Daniel Zhang, and Ying Julie Zhu. Road crack detection using deep convolutional neural network. In2016 IEEE international conference on image processing (ICIP), pages 3708–3712. IEEE, 2016

  5. [5]

    Automatic Pavement Crack Detection Based on Structured Prediction with the Convolutional Neural Network

    Zhun Fan, Yuming Wu, Jiewei Lu, and Wenji Li. Automatic pavement crack detection based on structured prediction with the convolutional neural network.arXiv preprint arXiv:1802.02208, 2018

  6. [6]

    U-net: Con- volutional networks for biomedical image segmentation

    Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U-net: Con- volutional networks for biomedical image segmentation. In Nassir Navab, Joachim Hornegger, William M. Wells, and Alejandro F. Frangi, editors,Medical Image Computing and Computer-Assisted Intervention – MICCAI 2015, pages 234–241, Cham, 2015. Springer International Publishing

  7. [7]

    Deepcrack: Learning hierarchical convolutional features for crack detection.IEEE transactions on image processing, 28(3):1498– 1512, 2018

    Qin Zou, Zheng Zhang, Qingquan Li, Xianbiao Qi, Qian Wang, and Song Wang. Deepcrack: Learning hierarchical convolutional features for crack detection.IEEE transactions on image processing, 28(3):1498– 1512, 2018

  8. [8]

    FPCNet: Fast Pavement Crack Detection Network Based on Encoder-Decoder Architecture

    Wenjun Liu, Yuchun Huang, Ying Li, and Qi Chen. Fpcnet: Fast pave- ment crack detection network based on encoder-decoder architecture. arXiv preprint arXiv:1907.02248, 2019

  9. [9]

    Encoder–decoder network for pixel-level road crack detection in black- box images.Computer-Aided Civil and Infrastructure Engineering, 34(8):713–727, 2019

    Seongdeok Bang, Somin Park, Hongjo Kim, and Hyoungkwan Kim. Encoder–decoder network for pixel-level road crack detection in black- box images.Computer-Aided Civil and Infrastructure Engineering, 34(8):713–727, 2019

  10. [10]

    Faster r-cnn: Towards real-time object detection with region proposal networks

    Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. Faster r-cnn: Towards real-time object detection with region proposal networks. In C. Cortes, N. Lawrence, D. Lee, M. Sugiyama, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 28. Curran Associates, Inc., 2015

  11. [11]

    M. G. Sarwar Murshed, Keivan Bahmani, Stephanie Schuckers, and Faraz Hussain. Deep age-invariant fingerprint segmentation system. IEEE Transactions on Biometrics, Behavior, and Identity Science, 7(3):313–330, 2025

  12. [12]

    Deep learning frameworks for pavement distress classification: A comparative analysis

    Vishal Mandal, Abdul Rashid Mussah, and Yaw Adu-Gyamfi. Deep learning frameworks for pavement distress classification: A comparative analysis. pages 5577–5583, 2020

  13. [13]

    Hu, Bao L

    Guo X. Hu, Bao L. Hu, Zhong Yang, Li Huang, and Ping Li. Pavement crack detection method based on deep learning models.Wireless Communications and Mobile Computing, 2021(1):5573590, 2021

  14. [14]

    The road crack detection algorithm improved based on yolov7

    Xin Liang and Maoting Gao. The road crack detection algorithm improved based on yolov7. InProceedings of the 2024 International Conference on Generative Artificial Intelligence and Information Secu- rity, pages 335–339, 2024

  15. [15]

    Gsbyolo: A lightweight multi-scale fusion network for road crack detection in complex environments.Scientific Reports, 15(1):26615, 2025

    Yuhao Wang, Heran Zhu, Yirong Wang, Jianping Liu, Jun Xie, Bi Zhao, and Siyue Zhao. Gsbyolo: A lightweight multi-scale fusion network for road crack detection in complex environments.Scientific Reports, 15(1):26615, 2025

  16. [16]

    Pavement crack detection and segmentation method based on improved deep learning fusion model.Mathematical Problems in Engineering, 2020(1):8515213, 2020

    Xiaoran Feng, Liyang Xiao, Wei Li, Lili Pei, Zhaoyun Sun, Zhidan Ma, Hao Shen, and Huyan Ju. Pavement crack detection and segmentation method based on improved deep learning fusion model.Mathematical Problems in Engineering, 2020(1):8515213, 2020

  17. [17]

    Automated pavement crack detection and segmentation based on two-step convolutional neural network

    Jingwei Liu, Xu Yang, Stephen Lau, Xin Wang, Sang Luo, Vincent Cheng-Siong Lee, and Ling Ding. Automated pavement crack detection and segmentation based on two-step convolutional neural network. Computer-Aided Civil and Infrastructure Engineering, 35(11):1291– 1305, 2020

  18. [18]

    Feature pyramid networks for object detection

    Tsung-Yi Lin, Piotr Doll ´ar, Ross Girshick, Kaiming He, Bharath Hariha- ran, and Serge Belongie. Feature pyramid networks for object detection. In2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 936–944, 2017

  19. [19]

    Pavement crack instance segmentation using yolov7-wmf with connected feature fusion.Automation in Construction, 160:105331, 2024

    Guanting Ye, Sai Li, Manxu Zhou, Yifei Mao, Jinsheng Qu, Tieyu Shi, and Qiang Jin. Pavement crack instance segmentation using yolov7-wmf with connected feature fusion.Automation in Construction, 160:105331, 2024

  20. [20]

    Research on real- time detection algorithm for pavement cracks based on sparseinst-cdsm

    Shao-Jie Wang, Ji-Kai Zhang, and Xiao-Qi Lu. Research on real- time detection algorithm for pavement cracks based on sparseinst-cdsm. Mathematics, 11(15):3277, 2023

  21. [21]

    Distress identification manual for the long-term pavement performance program

    John S Miller, William Y Bellinger, et al. Distress identification manual for the long-term pavement performance program. Technical report, United States. Federal Highway Administration. Office of Infrastruc- ture . . . , 2003

  22. [22]

    Detectron2

    Yuxin Wu, Alexander Kirillov, Francisco Massa, Wan-Yen Lo, and Ross Girshick. Detectron2. https://github.com/facebookresearch/detectron2, 2019

  23. [23]

    Automatic pavement crack detection based on structured prediction with the convolutional neural network, 2018

    Zhun Fan, Yuming Wu, Jiewei Lu, and Wenji Li. Automatic pavement crack detection based on structured prediction with the convolutional neural network, 2018