arxiv: 2604.08212 · v1 · submitted 2026-04-09 · 💻 cs.CV

Recognition: unknown

Vision-Language Foundation Models for Comprehensive Automated Pavement Condition Assessment

Blessing Agyei Kyem , Joshua Kofi Asamoah , Anthony Dontoh , Armstrong Aboah

Authors on Pith no claims yet

Pith reviewed 2026-05-10 17:25 UTC · model grok-4.3

classification 💻 cs.CV

keywords vision-language modelspavement condition assessmentinstruction tuningautomated infrastructure inspectionASTM D6433 compliancemultitask pavement analysisdomain-specific foundation models

0 comments

The pith

Instruction tuning on a unified pavement dataset enables vision-language models to perform comprehensive assessments with over 20 percent gains and full standard compliance.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests whether specialized instruction tuning can adapt general vision-language models to the precise demands of pavement evaluation. It builds a large dataset by merging annotations from nine existing pavement sources into 278,889 image-instruction pairs covering 32 task types. A model trained on this data outperforms prior systems on perception, spatial reasoning, and output generation while meeting ASTM D6433 requirements. If the approach holds, agencies could replace several narrow tools with one conversational interface that needs less expert oversight.

Core claim

PaveGPT, the model trained on the PaveInstruct dataset, delivers improvements exceeding 20 percent in spatial grounding, reasoning, and generation tasks relative to state-of-the-art vision-language models and consistently produces outputs that satisfy ASTM D6433 pavement distress protocols.

What carries the argument

The PaveInstruct dataset, formed by unifying annotations across nine heterogeneous pavement collections into 278,889 image-instruction-response pairs spanning 32 task types, which supplies the targeted training signal for instruction tuning of vision-language models.

If this is right

Transportation agencies could operate a single conversational system for all routine pavement inspections instead of maintaining separate specialized tools.
Technical staff would require less domain expertise to obtain standardized condition ratings from image inputs.
The same instruction-tuning recipe could be repeated for other infrastructure domains such as bridges or railways to create comparable unified assessment interfaces.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If the unification step succeeds, similar merged datasets could accelerate adaptation of vision-language models to other regulated inspection fields that currently lack large labeled corpora.
The 32-task structure suggests the model may generalize to hybrid queries that combine perception with forward-looking maintenance recommendations, though this remains untested in the work.
Deployment would still require ongoing validation against new distress types or climate-specific degradation patterns not captured in the original nine sources.

Load-bearing premise

Merging annotations from nine different pavement datasets yields a single training collection that is consistent, representative, and unbiased enough to support reliable performance on every real-world condition and all 32 task types.

What would settle it

A controlled test on a fresh, geographically distinct pavement dataset where the tuned model shows less than 10 percent improvement over untuned baselines or produces outputs that violate ASTM D6433 rating rules.

Figures

Figures reproduced from arXiv: 2604.08212 by Anthony Dontoh, Armstrong Aboah, Blessing Agyei Kyem, Joshua Kofi Asamoah.

**Figure 2.** Figure 2: Overview of the PaveInstruct task taxonomy [PITH_FULL_IMAGE:figures/full_fig_p017_2.png] view at source ↗

**Figure 3.** Figure 3: PaveInstruct Dataset Creation Pipeline To ensure the accuracy and engineering relevance of the generated instructionresponse pairs, we implement a multi-stage quality assurance process that combines automated validation with expert human review. Each sample is first checked for structural integrity, including the presence of required fields such as bounding boxes, severity labels, or PCI values, depending… view at source ↗

**Figure 4.** Figure 4: Instruction count per source dataset. Across datasets, the number of annotated distress classes varies, as shown in [PITH_FULL_IMAGE:figures/full_fig_p024_4.png] view at source ↗

**Figure 5.** Figure 5: Number of distress classes per dataset [PITH_FULL_IMAGE:figures/full_fig_p025_5.png] view at source ↗

**Figure 6.** Figure 6: (Left) Global distress type frequencies. (Right) Percentage distribution of dis [PITH_FULL_IMAGE:figures/full_fig_p026_6.png] view at source ↗

**Figure 7.** Figure 7: Distribution of single-turn vs. multi-turn conversations. [PITH_FULL_IMAGE:figures/full_fig_p026_7.png] view at source ↗

**Figure 8.** Figure 8: Distribution of conversation turns per dialogue. [PITH_FULL_IMAGE:figures/full_fig_p027_8.png] view at source ↗

**Figure 9.** Figure 9: Distribution of answer word counts. To support diverse training objectives, PaveInstruct instructions are grounded in a range of answer formats [PITH_FULL_IMAGE:figures/full_fig_p028_9.png] view at source ↗

**Figure 10.** Figure 10: Distribution of answer formats in PaveInstruct. [PITH_FULL_IMAGE:figures/full_fig_p029_10.png] view at source ↗

**Figure 11.** Figure 11: Representative instruction-response pairs from PaveInstruct across multiple [PITH_FULL_IMAGE:figures/full_fig_p030_11.png] view at source ↗

**Figure 12.** Figure 12: PaveGPT architecture 3.3.5. Training and Optimization PaveGPT is fine-tuned on the PaveInstruct training split using the autoregressive objective defined in Equations 18 and 19. During supervised fine-tuning, the vision tower remains frozen, while the multimodal projection MLP and the language backbone are updated. This setup keeps the 34 [PITH_FULL_IMAGE:figures/full_fig_p034_12.png] view at source ↗

**Figure 13.** Figure 13: PCI prediction Mean Absolute Error (MAE) comparison. Lower values indicate [PITH_FULL_IMAGE:figures/full_fig_p046_13.png] view at source ↗

**Figure 14.** Figure 14: Zero-shot vs. fine-tuned performance on spatial grounding (mIoU). Fine-tuning [PITH_FULL_IMAGE:figures/full_fig_p047_14.png] view at source ↗

**Figure 15.** Figure 15: Performance improvement (∆%) from fine-tuning on PaveInstruct across spatial grounding, reasoning, and captioning tasks compared to zero-shot settings. In explanatory tasks, which include pavement captioning, models trained on PaveInstruct are able to generate more fluent, relevant, and contextsensitive outputs. MiniCPM achieves the highest scores across all generation metrics, including BLEU-4, ROUGE-L,… view at source ↗

**Figure 16.** Figure 16: Heatmap of fine-tuned model performance across evaluation metrics. Darker [PITH_FULL_IMAGE:figures/full_fig_p049_16.png] view at source ↗

**Figure 17.** Figure 17: Radar chart comparing fine-tuned VLMs across six key evaluation metrics on [PITH_FULL_IMAGE:figures/full_fig_p050_17.png] view at source ↗

**Figure 18.** Figure 18: Comparison of InternVL-3.5-8B performance before and after fine-tuning on [PITH_FULL_IMAGE:figures/full_fig_p051_18.png] view at source ↗

**Figure 19.** Figure 19: Qualitative comparison of VLM responses for distress listing task. Red high [PITH_FULL_IMAGE:figures/full_fig_p053_19.png] view at source ↗

**Figure 20.** Figure 20: Results on multiple-choice VQA for maintenance recommendation. Red high [PITH_FULL_IMAGE:figures/full_fig_p054_20.png] view at source ↗

**Figure 21.** Figure 21: Qualitative comparison on PCI estimation with ASTM D6433-compliant rea [PITH_FULL_IMAGE:figures/full_fig_p056_21.png] view at source ↗

**Figure 22.** Figure 22: Model predictions for spatial grounding task. Red boxes indicate localized [PITH_FULL_IMAGE:figures/full_fig_p057_22.png] view at source ↗

read the original abstract

General-purpose vision-language models demonstrate strong performance in everyday domains but struggle with specialized technical fields requiring precise terminology, structured reasoning, and adherence to engineering standards. This work addresses whether domain-specific instruction tuning can enable comprehensive pavement condition assessment through vision-language models. PaveInstruct, a dataset containing 278,889 image-instruction-response pairs spanning 32 task types, was created by unifying annotations from nine heterogeneous pavement datasets. PaveGPT, a pavement foundation model trained on this dataset, was evaluated against state-of-the-art vision-language models across perception, understanding, and reasoning tasks. Instruction tuning transformed model capabilities, achieving improvements exceeding 20% in spatial grounding, reasoning, and generation tasks while producing ASTM D6433-compliant outputs. These results enable transportation agencies to deploy unified conversational assessment tools that replace multiple specialized systems, simplifying workflows and reducing technical expertise requirements. The approach establishes a pathway for developing instruction-driven AI systems across infrastructure domains including bridge inspection, railway maintenance, and building condition assessment.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

PaveInstruct and PaveGPT apply instruction tuning to pavement assessment with a practical angle, but the performance claims rest on thin evidence.

read the letter

The main takeaway is that this paper merges nine existing pavement datasets into a 278k-pair instruction set called PaveInstruct, then fine-tunes a vision-language model into PaveGPT and reports gains above 20% on perception, reasoning, and generation tasks plus ASTM D6433 compliance. That dataset construction and the domain-specific tuning step are the actual new pieces here. General VLMs often struggle with engineering standards and precise terminology, so targeting pavement condition assessment with conversational outputs is a reasonable applied direction that could simplify workflows for agencies. The idea of replacing multiple specialized tools with one model is straightforward and worth exploring. The soft spots sit in the evaluation. The abstract states the improvements and compliance without listing concrete metrics, baselines, statistical tests, or error breakdowns, which leaves the central claims hard to assess. The stress-test point about heterogeneous sources is fair: differing camera setups, distress labels, and protocols across the nine datasets can create noise or coverage gaps, and nothing in the text shows explicit harmonization or diversity checks. If those steps exist in the full paper they need to be front and center; otherwise the gains could be overstated. This is for readers working on multimodal models for infrastructure or civil engineering applications. Someone already building domain-adapted VLMs might pick up the dataset unification tactic, but only if the results hold up under scrutiny. It deserves peer review because the application is fresh and the dataset could be reusable, though the experimental section will need substantial strengthening on metrics and validation before it lands.

Referee Report

2 major / 1 minor

Summary. The manuscript introduces PaveInstruct, a dataset of 278,889 image-instruction-response pairs spanning 32 task types created by unifying annotations from nine heterogeneous pavement datasets, and PaveGPT, a vision-language foundation model trained via instruction tuning on this data. It claims that this approach transforms general VLMs into capable pavement assessment systems, yielding improvements exceeding 20% in spatial grounding, reasoning, and generation tasks relative to state-of-the-art models while producing outputs compliant with the ASTM D6433 standard, thereby enabling unified conversational tools for transportation agencies.

Significance. If the reported performance gains and standard compliance prove robust, the work would be significant for extending vision-language models to specialized engineering domains. The scale of the unified dataset and the multi-task coverage (32 types) represent a concrete step toward practical deployment of conversational AI for infrastructure inspection, potentially simplifying workflows by replacing multiple specialized systems. The suggested generalization to bridge, railway, and building assessment further broadens its potential impact.

major comments (2)

[Abstract] Abstract: The central claim that instruction tuning produces improvements exceeding 20% in spatial grounding, reasoning, and generation tasks is presented without any quantitative metrics, baseline model names, statistical significance tests, or error analysis. This absence leaves the performance transformation unsubstantiated and directly undermines verification of the primary result.
[Dataset construction (PaveInstruct)] Dataset construction (PaveInstruct): Merging annotations from nine sources that differ in camera geometry, lighting conditions, distress taxonomies, and labeling protocols creates unaddressed risks of label noise, domain shift, and coverage gaps across the 278,889 pairs and 32 tasks. No harmonization protocol, cross-dataset validation splits, or diversity statistics are described, which threatens the reliability of both the claimed quantitative gains and the ASTM D6433 compliance.

minor comments (1)

[Abstract] Abstract: The phrase 'state-of-the-art vision-language models' is used for comparison but no specific models or references are named, reducing clarity on the evaluation setup.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment point by point below, indicating where revisions to the manuscript will be made.

read point-by-point responses

Referee: [Abstract] Abstract: The central claim that instruction tuning produces improvements exceeding 20% in spatial grounding, reasoning, and generation tasks is presented without any quantitative metrics, baseline model names, statistical significance tests, or error analysis. This absence leaves the performance transformation unsubstantiated and directly undermines verification of the primary result.

Authors: We agree that the abstract would be strengthened by including more specific quantitative information to support the central claim. The full manuscript provides detailed comparisons against state-of-the-art vision-language models in the experimental section, along with task-specific metrics and evaluation protocols. To directly address this point, we will revise the abstract to reference the primary baselines, include representative quantitative improvements, and note the evaluation methodology. revision: yes
Referee: [Dataset construction (PaveInstruct)] Dataset construction (PaveInstruct): Merging annotations from nine sources that differ in camera geometry, lighting conditions, distress taxonomies, and labeling protocols creates unaddressed risks of label noise, domain shift, and coverage gaps across the 278,889 pairs and 32 tasks. No harmonization protocol, cross-dataset validation splits, or diversity statistics are described, which threatens the reliability of both the claimed quantitative gains and the ASTM D6433 compliance.

Authors: The referee raises a valid concern about the challenges of unifying heterogeneous data sources. The manuscript describes the unification of annotations from nine pavement datasets into PaveInstruct and alignment with the ASTM D6433 standard, but we acknowledge that additional explicit details on the harmonization process would improve clarity and address potential risks of label noise or domain shift. We will expand the dataset construction section to include a description of the harmonization protocol, diversity statistics across the source datasets, and the approach to cross-dataset validation splits. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical training and benchmarking on unified dataset

full rationale

The paper creates PaveInstruct by merging annotations from nine prior datasets into 278,889 pairs across 32 tasks, trains PaveGPT via instruction tuning, and reports >20% gains plus ASTM D6433 compliance by direct comparison to SOTA vision-language models. No equations, fitted parameters, or derivations are presented that reduce outputs to inputs by construction. Claims rest on standard empirical evaluation rather than self-definition, self-citation load-bearing uniqueness theorems, or renamed known results. The central results are falsifiable against external benchmarks and do not collapse into tautology.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

No free parameters, axioms, or invented entities are specified in the abstract; the work relies on standard practices of dataset unification and instruction tuning without additional postulated mechanisms.

pith-pipeline@v0.9.0 · 5478 in / 1111 out tokens · 59444 ms · 2026-05-10T17:25:24.052694+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

61 extracted references · 40 canonical work pages · 12 internal anchors

[1]

J. Bai, S. Bai, S. Yang, S. Wang, S. Tan, P. Wang, J. Lin, C. Zhou, J. Zhou, Qwen-vl: A versatile vision-language model for understanding, localization, text reading, and beyond, arXiv preprint arXiv:2308.12966 (2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023
[2]

P. Wang, S. Bai, S. Tan, S. Wang, Z. Fan, J. Bai, K. Chen, X. Liu, J. Wang, W. Ge, Y. Fan, K. Dang, M. Du, X. Ren, R. Men, D. Liu, C. Zhou, J. Zhou, J. Lin, Qwen2-vl: Enhancing vision-language model’s perception of the world at any resolution, arXiv preprint arXiv:2409.12191 (2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024
[3]

S. Bai, K. Chen, X. Liu, J. Wang, W. Ge, S. Song, K. Dang, P. Wang, S. Wang, J. Tang, H. Zhong, Y. Zhu, M. Yang, Z. Li, J. Wan, P. Wang, W. Ding, Z. Fu, Y. Xu, J. Ye, X. Zhang, T. Xie, Z. Cheng, H. Zhang, Z. Yang, H. Xu, J. Lin, Qwen2.5-vl technical report, arXiv preprint arXiv:2502.13923 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[4]

H. Liu, C. Li, Q. Wu, Y. J. Lee, Visual instruction tuning (2023)

2023
[5]

H. Liu, C. Li, Y. Li, Y. J. Lee, Improved baselines with visual instruction tuning (2023)

2023
[6]

X. He, Y. Zhang, L. Mou, E. Xing, P. Xie, Pathvqa: 30 000+ questions for medical visual question answering, arXiv preprint arXiv:2003.10286 (2020). URLhttps://arxiv.org/abs/2003.10286

work page internal anchor Pith review arXiv 2003
[7]

J.Lau, Jason S.Gayen, A.BenAbacha, D.Demner-Fushman, Adataset of clinically generated visual questions and answers about radiology im- ages, Scientific Data 5 (1) (2018) 180251.doi:10.1038/sdata.2018. 251. URLhttps://www.nature.com/articles/sdata2018251 62

work page doi:10.1038/sdata.2018 2018
[8]

T. Qian, J. Chen, L. Zhuo, Y. Jiao, Y.-G. Jiang, Nuscenes-qa: A multi- modal visual question answering benchmark for autonomous driving sce- nario, arXiv preprint arXiv:2305.14836 (2023). URLhttps://arxiv.org/abs/2305.14836

work page arXiv 2023
[9]

C. Sima, K. Renz, K. Chitta, L. Chen, H. Zhang, C. Xie, J. Beißwenger, P. Luo, A. Geiger, H. Li, Drivelm: Driving with graph visual question answering, in: European Conference on Computer Vision (ECCV) 2024, 2024. URLhttps://www.ecva.net/papers/eccv_2024/papers_ECCV/ papers/06870.pdf

2024
[10]

J. Kim, A. Rohrbach, T. Darrell, J. Canny, Z. Akata, Textual explana- tions for self-driving vehicles, Proceedings of the European Conference on Computer Vision (ECCV) (2018)

2018
[11]

C. Li, C. Wong, S. Zhang, N. Usuyama, H. Liu, J. Yang, T. Naumann, H. Poon, J. Gao, Llava-med: Training a large language-and-vision as- sistant for biomedicine in one day (2023).arXiv:2306.00890. URLhttps://arxiv.org/abs/2306.00890

work page arXiv 2023
[12]

arXiv preprint arXiv:2212.13138 , year=

K. Singhal, Y. Lu, C. Kahn, ..., Large language models encode clinical knowledge, arXiv preprint arXiv:2212.13138 (2022). URLhttps://arxiv.org/abs/2212.13138

work page arXiv 2022
[13]

LAION-5B: An open large-scale dataset for training next generation image-text models

C. Schuhmann, R. Beaumont, R. Vencu, C. Gordon, R. Wight- man, M. Cherti, T. Coombes, A. Katta, C. Mullis, M. Wortsman, P. Schramowski, S. Kundurthy, K. Crowson, L. Schmidt, R. Kaczmar- czyk, J. Jitsev, Laion-5b: An open large-scale dataset for training next generation image-text models, arXiv preprint arXiv:2210.08402 (2022). URLhttps://arxiv.org/abs/2210.08402

work page internal anchor Pith review arXiv 2022
[14]

Sharma, N

P. Sharma, N. Ding, S. Goodman, R. Soricut, Conceptual captions: A cleaned, hypernymed, image alt-text dataset for automatic image cap- tioning, in: Proc. 56th Annual Meeting of the Association for Compu- tational Linguistics (ACL 2018), 2018. URLhttps://aclanthology.org/P18-1238/

2018
[15]

X. l. Chen, H. Fang, T.-Y. Lin, R. Vedantam, S. Gupta, P. Dollar, C. L. Zitnick, Microsoft coco captions: Data collection and evaluation server, 63 arXiv preprint arXiv:1504.00325 (2015). URLhttps://arxiv.org/abs/1504.00325

work page internal anchor Pith review arXiv 2015
[16]

Krishna, Y

R. Krishna, Y. Zhu, O. Groth, J. Johnson, K. Hata, J. Kravitz, S. Chen, Y. Kalantidis, J.-L. Li, D. A. Shamma, M. Bernstein, L. Fei-Fei, Visual genome: Connecting language and vision using crowdsourced dense image annotations, International Journal of Computer Vision 123 (1) (2017) 32–73. URLhttps://link.springer.com/article/10.1007/ s11263-016-0981-7

2017
[17]

H. Liu, C. Li, Y. J. Lee, Llava-instruct 150 k: Visual instruction tuning for large-scale multimodal models, in: arXiv preprint arXiv:2310.03744, 2023. URLhttps://arxiv.org/abs/2310.03744

work page internal anchor Pith review arXiv 2023
[18]

Kulkarni, S

S. Kulkarni, S. Singh, D. Balakrishnan, S. Sharma, S. Devunuri, S. C. R. Korlapati, Crackseg9k: A collection and benchmark for crack segmen- tation datasets and frameworks, in: ECCV 2022 Workshops, Springer, 2022, pp. 179–195.doi:10.1007/978-3-031-25082-8_12. URLhttps://arxiv.org/abs/2208.13054

work page doi:10.1007/978-3-031-25082-8_12 2022
[19]

Y. Liu, J. Yao, X. Lu, R. Xie, L. Li, Deepcrack: A deep hierarchical feature learning architecture for crack segmentation, Neurocomputing 338 (2019) 139–153.doi:10.1016/j.neucom.2019.01.036

work page doi:10.1016/j.neucom.2019.01.036 2019
[20]

Zhang, F

L. Zhang, F. Yang, Y. D. Zhang, Y. J. Zhu, Road crack detection using deep convolutional neural network, in: Image Processing (ICIP), 2016 IEEE International Conference on, IEEE, 2016, pp. 3708–3712

2016
[21]

Z. Tong, T. Ma, J. Huyan, W. Zhang, Pavementscapes: A large-scale hierarchical image dataset for asphalt pavement damage segmentation, arXiv preprint arXiv:2208.00775 (2022). URLhttps://arxiv.org/abs/2208.00775

work page arXiv 2022
[22]

M. Ren, X. Zhang, X. Zhi, Y. Wei, Z. Feng, An annotated street view image dataset for automated road damage detection, Scientific Data 11 (1) (Apr. 2024).doi:10.1038/s41597-024-03263-7. URLhttp://dx.doi.org/10.1038/s41597-024-03263-7 64

work page doi:10.1038/s41597-024-03263-7 2024
[23]

Z. Liu, W. Wu, X. Gu, B. Cui, Pavedistress: A comprehensive dataset of pavement distresses detection, Data in Brief 57 (2024) 111111. doi:https://doi.org/10.1016/j.dib.2024.111111. URLhttps://www.sciencedirect.com/science/article/pii/ S2352340924010734

work page doi:10.1016/j.dib.2024.111111 2024
[25]

B. A. Kyem, E. Denteh, J. K. Asamoah, A. Aboah, Pavecap: The first multimodal framework for comprehensive pavement condition assess- ment with dense captioning and pci estimation, ArXiv abs/2408.04110 (2024). URLhttps://api.semanticscholar.org/CorpusID:271768959

work page arXiv 2024
[26]

Zhang, C

Y. Zhang, C. Liu, Vision-enhanced multi-modal learning frame- work for non-destructive pavement damage detection, Au- tomation in Construction 177 (2025) 106389.doi:https: //doi.org/10.1016/j.autcon.2025.106389. URLhttps://www.sciencedirect.com/science/article/pii/ S0926580525004297

work page doi:10.1016/j.autcon.2025.106389 2025
[27]

X. Xiao, Y. Zhang, J. Wang, L. Zhao, Y. Wei, H. Li, Y. Li, X. Wang, S. K. Roy, H. Xu, T. Wang, Roadbench: A vision-language foundation model and benchmark for road damage understanding (2025).arXiv: 2507.17353. URLhttps://arxiv.org/abs/2507.17353

work page arXiv 2025
[28]

u. F. Özgenel, Concrete crack images for classification (2019).doi: 10.17632/5Y9WDSG2ZT.2. URLhttps://data.mendeley.com/datasets/5y9wdsg2zt/2

work page doi:10.17632/5y9wdsg2zt.2 2019
[29]

Nicolás, The bar derived category of a curved dg algebra, Journal of Pure and Applied Algebra 212 (2008) 2633–2659

S. Dorafshan, R. Thomas, M. Maguire, Sdnet2018: An annotated image dataset for non-contact concrete crack detection using deep convolu- tional neural networks, Data in Brief 21 (11 2018).doi:10.1016/j. dib.2018.11.015. 65

work page doi:10.1016/j 2018
[30]

Eisenbach, R

M. Eisenbach, R. Stricker, D. Seichter, K. Amende, K. Debes, M. Sessel- mann, D. Ebersbach, U. Stoeckert, H.-M. Gross, How to get pavement distress detection ready for deep learning? a systematic approach, in: 2017 International Joint Conference on Neural Networks (IJCNN), 2017, pp. 2039–2047.doi:10.1109/IJCNN.2017.7966101

work page doi:10.1109/ijcnn.2017.7966101 2017
[31]

Stricker, M

R. Stricker, M. Eisenbach, M. Sesselmann, K. Debes, H.-M. Gross, Im- proving visual road condition assessment by extensive experiments on the extended gaps dataset, in: 2019 International Joint Conference on Neural Networks (IJCNN), 2019, pp. 1–8.doi:10.1109/IJCNN.2019. 8852257

work page doi:10.1109/ijcnn.2019 2019
[32]

Maeda, Y

H. Maeda, Y. Sekimoto, T. Seto, T. Kashiyama, H. Omata, Road dam- age detection and classification using deep neural networks with smart- phone images: Road damage detection and classification, Computer- Aided Civil and Infrastructure Engineering 33 (06 2018).doi:10.1111/ mice.12387

2018
[33]

D. Arya, H. Maeda, S. K. Ghosh, D. Toshniwal, Y. Sekimoto, Rdd2022: A multi-national image dataset for automatic road damage detection (2022).arXiv:2209.08538. URLhttps://arxiv.org/abs/2209.08538

work page arXiv 2022
[34]

H. Yang, J. Cao, J. Wan, Q. Gao, C. Liu, M. Fischer, Y. Du, Y. Li, P. Jain, D. Wu, A large-scale image repository for automated pavement distress analysis and degradation trend prediction, Scientific Data 12 (1) (Aug. 2025).doi:10.1038/s41597-025-05748-5. URLhttp://dx.doi.org/10.1038/s41597-025-05748-5

work page doi:10.1038/s41597-025-05748-5 2025
[35]

Q. Zou, Y. Cao, Q. Li, Q. Mao, S. Wang, Cracktree: Automatic crack detection from pavement images, Pattern Recognition Letters 33 (3) (2012) 227–238

2012
[36]

Y. Shi, L. Cui, Z. Qi, F. Meng, Z. Chen, Automatic road crack detec- tion using random structured forests, IEEE Transactions on Intelligent Transportation Systems 17 (12) (2016) 3434–3445

2016
[37]

Q. Zou, Z. Zhang, Q. Li, X. Qi, Q. Wang, S. Wang, Deepcrack: Learning hierarchical convolutional features for crack detection, IEEE Transac- tions on Image Processing 28 (3) (2019) 1498–1512. 66

2019
[38]

Kulkarni, S

S. Kulkarni, S. Singh, D. Balakrishnan, S. Sharma, S. Devunuri, S. C. R. Korlapati, Crackseg9k: a collection and benchmark for crack segmenta- tion datasets and frameworks, in: European Conference on Computer Vision, Springer, 2022, pp. 179–195

2022
[39]

Han, I.-H

S. Han, I.-H. Chung, Y. Jiang, B. Uwakweh, Pcier: Pavement condition evaluation using aerial imagery and deep learning, Geographies 3 (1) (2023) 132–142.doi:10.3390/geographies3010008. URLhttps://www.mdpi.com/2673-7086/3/1/8

work page doi:10.3390/geographies3010008 2023
[40]

Adu-Gyamfi, B

Y. Adu-Gyamfi, B. Buttlar, E. Dave, D. Mensching, H. Majidifard, DSPS — dsps-1e998.web.app,https://dsps-1e998.web.app/data, [Accessed 09-02-2025]

2025
[41]

H. Liu, C. Li, Q. Wu, Y. J. Lee, Visual instruction tuning (2023).arXiv: 2304.08485. URLhttps://arxiv.org/abs/2304.08485

work page internal anchor Pith review arXiv 2023
[42]

L. Chen, J. Li, X. Dong, P. Zhang, C. He, J. Wang, F. Zhao, D. Lin, Sharegpt4v: Improving large multi-modal models with better captions (2023).arXiv:2311.12793. URLhttps://arxiv.org/abs/2311.12793

work page internal anchor Pith review arXiv 2023
[43]

L. Li, Y. Yin, S. Li, L. Chen, P. Wang, S. Ren, M. Li, Y. Yang, J. Xu, X. Sun, L. Kong, Q. Liu, M3it: A large-scale dataset towards multi- modal multilingual instruction tuning, arXiv preprint arXiv:2306.04387 (2023)

work page arXiv 2023
[44]

M. Moor, Q. Huang, S. Wu, M. Yasunaga, Y. Dalmia, J. Leskovec, C. Zakka, E. P. Reis, P. Rajpurkar, Med-flamingo: a multimodal med- ical few-shot learner, in: S. Hegselmann, A. Parziale, D. Shanmugam, S. Tang, M. N. Asiedu, S. Chang, T. Hartvigsen, H. Singh (Eds.), Pro- ceedings of the 3rd Machine Learning for Health Symposium, Vol. 225 of Proceedings of M...

2023
[45]

Deruyttere, S

T. Deruyttere, S. Vandenhende, D. Grujicic, L. Van Gool, M. F. Moens, Talk2car: Taking control of your self-driving car, in: Proceedings of the 2019 Conference on Empirical Methods in Natural Language Process- ing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), 2019, pp. 2088–2098. 67

2019
[46]

X. Tian, J. Gu, B. Li, Y. Liu, Y. Wang, Z. Zhao, K. Zhan, P. Jia, X. Lang, H. Zhao, Drivevlm: The convergence of autonomous driving and large vision-language models (2024).arXiv:2402.12289. URLhttps://arxiv.org/abs/2402.12289

work page internal anchor Pith review arXiv 2024
[47]

Lobry, D

S. Lobry, D. Marcos, J. Murray, D. Tuia, Rsvqa: Visual question an- swering for remote sensing data, IEEE Transactions on Geoscience and Remote Sensing 58 (12) (2020) 8555–8566.doi:10.1109/TGRS.2020. 2988782

work page doi:10.1109/tgrs.2020 2020
[48]

P. Lu, S. Mishra, T. Xia, L. Qiu, K.-W. Chang, S.-C. Zhu, O. Tafjord, P. Clark, A. Kalyan, Learn to explain: Multimodal reasoning via thought chains for science question answering, in: The 36th Conference on Neural Information Processing Systems (NeurIPS), 2022

2022
[49]

A Diagram Is Worth A Dozen Images

A. Kembhavi, M. Salvato, E. Kolve, M. Seo, H. Hajishirzi, A. Farhadi, A diagram is worth a dozen images (2016).arXiv:1603.07396

work page Pith review arXiv 2016
[50]

Docvqa: A dataset for vqa on document images

M. Mathew, D. Karatzas, C. V. Jawahar, Docvqa: A dataset for vqa on document images (2021).arXiv:2007.00398. URLhttps://arxiv.org/abs/2007.00398

work page arXiv 2021
[51]

Mathew, V

M. Mathew, V. Bagal, R. P. Tito, D. Karatzas, E. Valveny, C. V. Jawa- har, Infographicvqa (2021).arXiv:2104.12756. URLhttps://arxiv.org/abs/2104.12756

work page arXiv 2021
[52]

Majidifard, P

H. Majidifard, P. Jin, Y. Adu-Gyamfi, W. G. Buttlar, Pavement image datasets: A new benchmark dataset to classify and densify pavement distresses, Transportation Research Record 2674 (2) (2020) 328–339. arXiv:https://doi.org/10.1177/0361198120907283,doi:10.1177/ 0361198120907283. URLhttps://doi.org/10.1177/0361198120907283

work page doi:10.1177/0361198120907283 2020
[53]

J. Zhu, J. Zhong, T. Ma, X. Huang, W. Zhang, Y. Zhou, Pavement distress detection using convolutional neural networks with images captured via uav, Automation in Construction 133 (2022) 103991. doi:https://doi.org/10.1016/j.autcon.2021.103991. URLhttps://www.sciencedirect.com/science/article/pii/ S0926580521004428 68

work page doi:10.1016/j.autcon.2021.103991 2022
[54]

H. Yan, J. Zhang, Uav-pdd2023: A benchmark dataset for pavement distress detection based on uav images, Data in Brief 51 (2023) 109692. doi:https://doi.org/10.1016/j.dib.2023.109692. URLhttps://www.sciencedirect.com/science/article/pii/ S2352340923007710

work page doi:10.1016/j.dib.2023.109692 2023
[55]

Cultivating Multidisciplinary AI Workforce Development on iTiger GPU Cluster: Practices and Challenges

M. Sharif, G. Han, W. Liu, X. Huang, Cultivating multidisciplinary research and education on gpu infrastructure for mid-south institutions at the university of memphis: Practice and challenge (2025).arXiv: 2504.14786. URLhttps://arxiv.org/abs/2504.14786

work page internal anchor Pith review Pith/arXiv arXiv 2025
[56]

Everingham, L

M. Everingham, L. Van Gool, C. K. I. Williams, J. Winn, A. Zisser- man, The PASCAL visual object classes (VOC) challenge, International Journal of Computer Vision 88 (2) (2010) 303–338

2010
[57]

D. M. W. Powers, Evaluation: From precision, recall and F-measure to ROC, informedness, markedness & correlation, Journal of Machine Learning Technologies 2 (1) (2011) 37–63

2011
[58]

V. I. Levenshtein, Binary codes capable of correcting deletions, inser- tions and reversals, Soviet Physics Doklady 10 (8) (1966) 707–710

1966
[59]

Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena

L. Zheng, W.-L. Chiang, Y. Sheng, S. Zhuang, Z. Wu, Y. Zhuang, Z. Lin, Z. Li, D. Li, E. P. Xing, H. Zhang, J. E. Gonzalez, I. Sto- ica, Judging LLM-as-a-judge with MT-bench and chatbot arenaNeurIPS 2023 Datasets and Benchmarks Track (2023).arXiv:2306.05685, doi:10.48550/arXiv.2306.05685. URLhttps://arxiv.org/abs/2306.05685

work page internal anchor Pith review doi:10.48550/arxiv.2306.05685 2023
[60]

B leu: a method for automatic evaluation of machine translation

K. Papineni, S. Roukos, T. Ward, W.-J. Zhu, BLEU: a method for au- tomatic evaluation of machine translation, in: Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, Asso- ciationforComputationalLinguistics, Philadelphia, Pennsylvania, USA, 2002, pp. 311–318.doi:10.3115/1073083.1073135. URLhttps://aclanthology.org/P02-1040/

work page doi:10.3115/1073083.1073135 2002
[61]

Lin, ROUGE: A package for automatic evaluation of summaries, in: Text Summarization Branches Out, Association for Computational 69 Linguistics, Barcelona, Spain, 2004, pp

C.-Y. Lin, ROUGE: A package for automatic evaluation of summaries, in: Text Summarization Branches Out, Association for Computational 69 Linguistics, Barcelona, Spain, 2004, pp. 74–81. URLhttps://aclanthology.org/W04-1013/

2004
[62]

Vedantam, C

R. Vedantam, C. L. Zitnick, D. Parikh, CIDEr: Consensus-based image description evaluation, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2015, pp. 4566–4575. URLhttps://openaccess.thecvf.com/content_cvpr_2015/html/ Vedantam_CIDEr_Consensus-Based_Image_2015_CVPR_paper.html 70

2015