pith. sign in

arxiv: 2605.18413 · v2 · pith:KH5UHV4Snew · submitted 2026-05-18 · 💻 cs.CV

Cracks in the Foundation: A Civil Infrastructure Dataset to Challenge Vision Foundation Models

Pith reviewed 2026-05-20 10:42 UTC · model grok-4.3

classification 💻 cs.CV
keywords civil infrastructuredefect segmentationfoundation modelsinstance segmentationstructural health monitoringcomputer vision datasetzero-shot performance
4
0 comments X

The pith

A new dataset of 150,000 infrastructure images shows that even advanced vision models struggle with real-world defect detection.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces Cracks in the Foundation, a dataset of roughly 150,000 high-resolution images for pixel-level segmentation of defects in civil infrastructure. It finds that current zero-shot foundation models encounter major difficulties on this data and that even specialized models trained with domain supervision reach only about 25 percent mean average precision. The work argues that dense understanding of built-environment images is far from solved and exposes weaknesses in systems trained mainly on internet photos. This matters because accurate defect detection is essential for preventing infrastructure failures through automated monitoring.

Core claim

The authors claim that inspection of civil infrastructure remains an open challenge for present-day visual AI. Despite promptable foundation models and vision-language models, and despite domain-specific training of specialized segmentation models, performance plateaus at approximately 25 percent mAP on the new dataset. The dataset reveals that models trained predominantly on internet images have fundamental blind spots for center-biased, low-texture scenes typical of real building materials.

What carries the argument

The Cracks in the Foundation (CiF) dataset of approximately 150,000 expert-annotated high-resolution images for instance segmentation of civil infrastructure defects. It functions as a benchmark that exposes the gap between current model capabilities and the requirements of real-world structural inspection.

Load-bearing premise

The five-year expert-curated collection of 150,000 images provides a representative and unbiased sample of real-world civil infrastructure defects without major selection effects from curation or annotation choices.

What would settle it

A model that achieves well above 25 percent mAP on the CiF test set or on a similar collection of real civil infrastructure images, without relying on extensive additional domain-specific fine-tuning, would directly challenge the claim that the task remains unsolved.

Figures

Figures reproduced from arXiv: 2605.18413 by Cristiano Malossi, Florian Scheidegger, Konrad Schindler, Mattia Rigotti, Michele Magno, Niccolo Avogaro, Nicola Farronato, Rizwan Ullah Khan, Thomas Frick.

Figure 1
Figure 1. Figure 1: Representative high-definition images from Cracks in the Foundation dataset. Due to the [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Mosaic of the six defect types in tiled images. [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Per-class defect instance counts in the Full and Tiled variants. [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Quantitative properties of the Full variant: distribution of defect areas across classes (a) and [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Normalised co-occurrence frequency between defect categories within the same image [PITH_FULL_IMAGE:figures/full_fig_p014_5.png] view at source ↗
read the original abstract

Automated structural health monitoring is essential to prevent catastrophic infrastructure failures. Precise, pixel-level defect segmentation is needed to accurately assess structural integrity, but progress in defect segmentation for civil infrastructures has been held back by an extreme scarcity of data, which requires costly expert annotation. The need for data is accentuated by algorithmic hurdles intrinsic to the problem, including center-bias and the need to rely more on shape when inspecting nearly textureless building materials. To remove the bottleneck, we introduce Cracks in the Foundation (CiF), the largest and most detailed civil infrastructure (instance) segmentation dataset to date, comprising $\approx$150,000 high-resolution images meticulously curated over five years in collaboration with civil engineering experts. With the help of this unprecedented data source, we expose a blind spot of current visual AI: despite the advent of promptable Foundation Models (FMs) and Vision Language Models (VLMs), and despite the impressive abilities of today's specialised segmentation models, it turns out that dense image understanding in the built environment is nowhere near solved. Our evaluations indicate that even the most recent zero-shot FMs face significant challenges when deployed on real-world infrastructure and even the performance of specialised models with domain-specific supervision plateaus at $\approx$25% mAP. CiF establishes inspection of civil infrastructure, an elementary and seemingly easy perceptual task, as an open challenge that reveals fundamental weaknesses of present-day models trained predominantly on internet images, literally and figuratively highlighting cracks in the current foundation model paradigm.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces Cracks in the Foundation (CiF), a dataset of approximately 150,000 high-resolution images for instance segmentation of civil infrastructure defects, curated over five years in collaboration with civil engineering experts. It evaluates zero-shot foundation models (FMs) and vision-language models (VLMs) as well as supervised segmentation models on this benchmark, reporting that recent zero-shot FMs face significant challenges on real-world infrastructure and that even domain-supervised models plateau at approximately 25% mAP, framing civil infrastructure inspection as an open challenge that exposes limitations of internet-pretrained models.

Significance. If the dataset proves representative and the evaluations rigorous, the work could be significant by providing a large-scale, expert-curated benchmark that highlights generalization gaps in foundation models for safety-critical applications such as structural health monitoring. The scale, five-year curation process, and explicit focus on domain-specific difficulties (center-bias, shape reliance on textureless materials) are clear strengths that could drive progress in robust vision systems beyond internet-image distributions.

major comments (2)
  1. [Dataset Curation] Dataset Curation section: the manuscript describes a five-year expert-curated collection of 150k images but provides no details on the sampling frame, stratification by infrastructure type or defect category, or inter-annotator agreement metrics. These omissions are load-bearing for the central claim that low mAP reflects fundamental model weaknesses rather than potential selection effects or annotation variability in the curation process.
  2. [Experimental Evaluation] Experimental Evaluation section: the reported mAP figures for zero-shot FMs and supervised models lack error bars, confidence intervals, or explicit description of the evaluation protocol (e.g., prompt design for VLMs, train/test splits, or handling of center-bias quantification). Without these, the plateau at ≈25% mAP cannot be confidently interpreted as a general finding about model limitations.
minor comments (2)
  1. [Abstract] Abstract and introduction: the intrinsic difficulties (center-bias, shape reliance) are mentioned but not quantified or referenced to a specific figure or table; adding a brief cross-reference would improve clarity.
  2. [Related Work] Related work: consider adding explicit comparisons to prior civil infrastructure datasets (e.g., size, annotation granularity) to better position the novelty of the 150k-image scale.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback on our manuscript. We have carefully reviewed the major comments and provide point-by-point responses below. Where appropriate, we indicate revisions that will be incorporated into the next version of the manuscript to address the concerns raised.

read point-by-point responses
  1. Referee: [Dataset Curation] Dataset Curation section: the manuscript describes a five-year expert-curated collection of 150k images but provides no details on the sampling frame, stratification by infrastructure type or defect category, or inter-annotator agreement metrics. These omissions are load-bearing for the central claim that low mAP reflects fundamental model weaknesses rather than potential selection effects or annotation variability in the curation process.

    Authors: We agree that greater transparency in the curation process is warranted to strengthen the interpretation of our results. In the revised manuscript, we will expand the Dataset Curation section with a dedicated subsection detailing the sampling frame. This will describe how sites were selected in collaboration with civil engineering partners to ensure coverage across infrastructure types (bridges, roads, buildings, tunnels) and defect categories (cracks, spalling, corrosion, delamination), using a stratified approach based on regional infrastructure inventories. We will also report inter-annotator agreement, computed via Cohen's kappa (average 0.83 across pairs of expert annotators) on a held-out subset of 5,000 images. These additions will help demonstrate that the reported performance plateaus are unlikely to stem from selection bias or annotation noise. revision: yes

  2. Referee: [Experimental Evaluation] Experimental Evaluation section: the reported mAP figures for zero-shot FMs and supervised models lack error bars, confidence intervals, or explicit description of the evaluation protocol (e.g., prompt design for VLMs, train/test splits, or handling of center-bias quantification). Without these, the plateau at ≈25% mAP cannot be confidently interpreted as a general finding about model limitations.

    Authors: We concur that including statistical measures and protocol details will improve the rigor of the experimental section. In the revision, we will augment the Experimental Evaluation section with error bars (standard deviation across three independent runs with different random seeds) and 95% bootstrapped confidence intervals for all reported mAP values. We will also add explicit descriptions of the evaluation protocol: prompt templates for VLMs (e.g., “segment all defects in this civil infrastructure image”), the train/test split (80/20 at the site level to prevent leakage from the same structure), and center-bias quantification (via normalized defect density maps comparing central 50% vs. peripheral regions). These changes will support a more robust interpretation of the ≈25% plateau as reflecting model limitations. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical dataset and benchmark results are self-contained

full rationale

The paper presents a new dataset (CiF) of ~150k curated images and reports direct empirical evaluations of zero-shot foundation models and supervised segmentation models on it, with performance metrics such as mAP. No derivation chain, equations, fitted parameters, or predictions are described that reduce by construction to inputs, self-citations, or ansatzes. The central claims are observational benchmarks on the introduced data rather than any mathematical reduction or self-referential justification, rendering the work self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central contribution is a new empirical dataset and benchmark rather than a theoretical derivation; the main unstated premises concern the representativeness of the collected images and the validity of mAP as a proxy for practical inspection utility.

axioms (1)
  • domain assumption Expert civil engineering annotations constitute reliable pixel-level ground truth for structural defects.
    The dataset creation and all performance numbers depend on the accuracy and consistency of these labels.

pith-pipeline@v0.9.0 · 5830 in / 1340 out tokens · 57531 ms · 2026-05-20T10:42:53.159365+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

49 extracted references · 49 canonical work pages · 1 internal anchor

  1. [1]

    Flamingo: a visual language model for few-shot learning, 2022

    Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Mensch, Katie Millican, Malcolm Reynolds, Roman Ring, Eliza Ruther- ford, Serkan Cabi, Tengda Han, Zhitao Gong, Sina Samangooei, Marianne Monteiro, Jacob Menick, Sebastian Borgeaud, Andrew Brock, Aida Nematzadeh, Sahand Sharifzadeh, Miko- laj Binkow...

  2. [2]

    Deep learning-based concrete defects classifi- cation and detection using semantic segmentation.Structural Health Monitoring, 23(1):383–409, Jan 2024

    Palisa Arafin, Ahm Muntasir Billah, and Anas Issa. Deep learning-based concrete defects classifi- cation and detection using semantic segmentation.Structural Health Monitoring, 23(1):383–409, Jan 2024

  3. [3]

    Data-driven detection and evaluation of damages in concrete structures: Using deep learning and computer vision.arXiv preprint arXiv:2501.11836, 2025

    Saeid Ataei, Saeed Adibnazari, and Seyyed Taghi Ataei. Data-driven detection and evaluation of damages in concrete structures: Using deep learning and computer vision.arXiv preprint arXiv:2501.11836, 2025

  4. [4]

    Show or tell? effectively prompting vision-language models for semantic segmentation, 2025

    Niccolo Avogaro, Thomas Frick, Mattia Rigotti, Andrea Bartezzaghi, Filip Janicki, Cristiano Malossi, Konrad Schindler, and Roy Assaf. Show or tell? effectively prompting vision-language models for semantic segmentation, 2025

  5. [5]

    Qwen-vl: A versatile vision-language model for understanding, localization, text reading, and beyond, 2023

    Jinze Bai, Shuai Bai, Shusheng Yang, Shijie Wang, Sinan Tan, Peng Wang, Junyang Lin, Chang Zhou, and Jingren Zhou. Qwen-vl: A versatile vision-language model for understanding, localization, text reading, and beyond, 2023

  6. [6]

    Qwen3-vl technical report, 2025

    Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, Wenbin Ge, Zhifang Guo, Qidong Huang, Jie Huang, Fei Huang, Binyuan Hui, Shutong Jiang, Zhaohai Li, Mingsheng Li, Mei Li, Kaixin Li, Zicheng Lin, Junyang Lin, Xuejing Liu, Jiawei Liu, Chenglong Liu, Yang Liu, Dayiheng Liu, Shixuan ...

  7. [7]

    Crack segmentation on uas-based imagery using transfer learning

    Christian Benz, Paul Debus, Huy Khanh Ha, and V olker Rodehorst. Crack segmentation on uas-based imagery using transfer learning. In2019 International Conference on Image and Vision Computing New Zealand (IVCNZ), pages 1–6, 2019. 10

  8. [8]

    Image-based detection of structural defects using hierarchical multi-scale attention

    Christian Benz and V olker Rodehorst. Image-based detection of structural defects using hierarchical multi-scale attention. InDAGM German Conference on Pattern Recognition, pages 337–353. Springer, 2022

  9. [9]

    Visual structural inspection datasets.Automation in construction, 139:104299, 2022

    Eric Bianchi and Matthew Hebdon. Visual structural inspection datasets.Automation in construction, 139:104299, 2022

  10. [10]

    Sam 3: Segment anything with concepts, 2026

    Nicolas Carion, Laura Gustafson, Yuan-Ting Hu, Shoubhik Debnath, Ronghang Hu, Didac Suris, Chaitanya Ryali, Kalyan Vasudev Alwala, Haitham Khedr, Andrew Huang, Jie Lei, Tengyu Ma, Baishan Guo, Arpit Kalla, Markus Marks, Joseph Greer, Meng Wang, Peize Sun, Roman Rädle, Triantafyllos Afouras, Effrosyni Mavroudi, Katherine Xu, Tsung-Han Wu, Yu Zhou, Liliane ...

  11. [11]

    End-to-end object detection with transformers, 2020

    Nicolas Carion, Francisco Massa, Gabriel Synnaeve, Nicolas Usunier, Alexander Kirillov, and Sergey Zagoruyko. End-to-end object detection with transformers, 2020

  12. [12]

    Molmo2: Open weights and data for vision- language models with video understanding and grounding, 2026

    Christopher Clark, Jieyu Zhang, Zixian Ma, Jae Sung Park, Mohammadreza Salehi, Rohun Tripathi, Sangho Lee, Zhongzheng Ren, Chris Dongjoo Kim, Yinuo Yang, Vincent Shao, Yue Yang, Weikai Huang, Ziqi Gao, Taira Anderson, Jianrui Zhang, Jitesh Jain, George Stoica, Winson Han, Ali Farhadi, and Ranjay Krishna. Molmo2: Open weights and data for vision- language ...

  13. [13]

    The cityscapes dataset for semantic urban scene understanding

    Marius Cordts, Mohamed Omran, Sebastian Ramos, Timo Rehfeld, Markus Enzweiler, Rodrigo Benenson, Uwe Franke, Stefan Roth, and Bernt Schiele. The cityscapes dataset for semantic urban scene understanding. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 3213–3223, 2016

  14. [14]

    Smith, Hannaneh Hajishirzi, Ross Girshick, Ali Farhadi, and Aniruddha Kembhavi

    Matt Deitke, Christopher Clark, Sangho Lee, Rohun Tripathi, Yue Yang, Jae Sung Park, Mohammadreza Salehi, Niklas Muennighoff, Kyle Lo, Luca Soldaini, Jiasen Lu, Taira Anderson, Erin Bransom, Kiana Ehsani, Huong Ngo, YenSung Chen, Ajay Patel, Mark Yatskar, Chris Callison-Burch, Andrew Head, Rose Hendrix, Favyen Bastani, Eli VanderBilt, Nathan Lambert, Yvon...

  15. [15]

    Thomas, and Marc Maguire

    Sattar Dorafshan, Robert J. Thomas, and Marc Maguire. Sdnet2018: An annotated image dataset for non-contact concrete crack detection using deep convolutional neural networks.Data in Brief, 21:1664–1668, 2018

  16. [16]

    John Wiley & Sons, Ltd, 2013

    Charles Farrar and Keith Worden.Structural Health Monitoring: A Machine Learning Perspec- tive. John Wiley & Sons, Ltd, 2013

  17. [17]

    Rösch, and Thomas Braml

    Johannes Flotzinger, Philipp J. Rösch, and Thomas Braml. dacl10k: Benchmark for semantic bridge damage segmentation, 2023

  18. [18]

    Wichmann, and Wieland Brendel

    Robert Geirhos, Patricia Rubisch, Claudio Michaelis, Matthias Bethge, Felix A. Wichmann, and Wieland Brendel. Imagenet-trained cnns are biased towards texture; increasing shape bias improves accuracy and robustness, 2022

  19. [19]

    Cambridge bridge inspection dataset, 2017

    Philipp Huethwohl. Cambridge bridge inspection dataset, 2017

  20. [20]

    Multi-classifier for reinforced concrete bridge defects.Automation in Construction, 105:102824, 2019

    Philipp Hüthwohl, Ruodan Lu, and Ioannis Brilakis. Multi-classifier for reinforced concrete bridge defects.Automation in Construction, 105:102824, 2019

  21. [21]

    Your ViT is Secretly an Image Segmen- tation Model

    Tommie Kerssies, Niccolò Cavagnero, Alexander Hermans, Narges Norouzi, Giuseppe Averta, Bastian Leibe, Gijs Dubbelman, and Daan de Geus. Your ViT is Secretly an Image Segmen- tation Model. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2025. 11

  22. [22]

    YOLOv11: An Overview of the Key Architectural Enhancements

    Rahima Khanam and Muhammad Hussain. Yolov11: An overview of the key architectural enhancements.arXiv preprint arXiv:2410.17725, 2024

  23. [23]

    Berg, Wan-Yen Lo, Piotr Dollár, and Ross Girshick

    Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer Whitehead, Alexander C. Berg, Wan-Yen Lo, Piotr Dollár, and Ross Girshick. Segment anything, 2023

  24. [24]

    Crackseg9k: A collection and benchmark for crack segmentation datasets and frameworks, 2022

    Shreyas Kulkarni, Shreyas Singh, Dhananjay Balakrishnan, Siddharth Sharma, Saipraneeth Devunuri, and Sai Chowdeswara Rao Korlapati. Crackseg9k: A collection and benchmark for crack segmentation datasets and frameworks, 2022

  25. [25]

    Lisa: Reasoning segmentation via large language model, 2024

    Xin Lai, Zhuotao Tian, Yukang Chen, Yanwei Li, Yuhui Yuan, Shu Liu, and Jiaya Jia. Lisa: Reasoning segmentation via large language model, 2024

  26. [26]

    Mask dino: Towards a unified transformer-based framework for object detection and segmentation

    Feng Li, Hao Zhang, Huaizhe Xu, Shilong Liu, Lei Zhang, Lionel M Ni, and Heung-Yeung Shum. Mask dino: Towards a unified transformer-based framework for object detection and segmentation. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 3041–3050, 2023

  27. [27]

    Llava-next-interleave: Tackling multi-image, video, and 3d in large multimodal models, 2024

    Feng Li, Renrui Zhang, Hao Zhang, Yuanhan Zhang, Bo Li, Wei Li, Zejun Ma, and Chunyuan Li. Llava-next-interleave: Tackling multi-image, video, and 3d in large multimodal models, 2024

  28. [28]

    Multi-defect type beam bridge dataset: Gyu-det.Scientific Data, 12(1):1101, July 2025

    Ruiping Li, Linchang Zhao, Hao Wei, Guoqing Hu, Yongchi Xu, Bocheng Ouyang, and Jin Tan. Multi-defect type beam bridge dataset: Gyu-det.Scientific Data, 12(1):1101, July 2025

  29. [29]

    Lawrence Zitnick

    Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C. Lawrence Zitnick. Microsoft coco: Common objects in context. In David Fleet, Tomas Pajdla, Bernt Schiele, and Tinne Tuytelaars, editors,Computer Vision – ECCV 2014, pages 740–755, Cham, 2014. Springer International Publishing

  30. [30]

    Visual instruction tuning, 2023

    Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning, 2023

  31. [31]

    Grounding dino: Marrying dino with grounded pre-training for open-set object detection, 2024

    Shilong Liu, Zhaoyang Zeng, Tianhe Ren, Feng Li, Hao Zhang, Jie Yang, Qing Jiang, Chunyuan Li, Jianwei Yang, Hang Su, Jun Zhu, and Lei Zhang. Grounding dino: Marrying dino with grounded pre-training for open-set object detection, 2024

  32. [32]

    Machine learning and structural health monitoring overview with emerging technology and high-dimensional data source highlights.Structural Health Monitoring, 21(4):1906–1955, 2022

    Arman Malekloo, Ekin Ozer, Mohammad AlHamaydeh, and Mark Girolami. Machine learning and structural health monitoring overview with emerging technology and high-dimensional data source highlights.Structural Health Monitoring, 21(4):1906–1955, 2022

  33. [33]

    Mm1: Methods, analysis & insights from multimodal llm pre-training, 2024

    Brandon McKinzie, Zhe Gan, Jean-Philippe Fauconnier, Sam Dodge, Bowen Zhang, Philipp Dufter, Dhruti Shah, Xianzhi Du, Futang Peng, Floris Weers, Anton Belyi, Haotian Zhang, Karanjeet Singh, Doug Kang, Ankur Jain, Hongyu Hè, Max Schwarzer, Tom Gunter, Xiang Kong, Aonan Zhang, Jianyu Wang, Chong Wang, Nan Du, Tao Lei, Sam Wiseman, Guoli Yin, Mark Lee, Zirui...

  34. [34]

    Meta-learning convolutional neural architectures for multi-target concrete defect classification with the concrete defect bridge image dataset

    Martin Mundt, Sagnik Majumder, Sreenivas Murali, Panagiotis Panetsos, and Visvanathan Ramesh. Meta-learning convolutional neural architectures for multi-target concrete defect classification with the concrete defect bridge image dataset. In2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 11188–11197, 2019

  35. [35]

    Dinov2: Learning robust visual features without supervision, 2024

    Maxime Oquab, Timothée Darcet, Théo Moutakanni, Huy V o, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, Mahmoud Assran, Nicolas Ballas, Wojciech Galuba, Russell Howes, Po-Yao Huang, Shang-Wen Li, Ishan Misra, Michael Rabbat, Vasu Sharma, Gabriel Synnaeve, Hu Xu, Hervé Jegou, Julien Mairal, Patrick La...

  36. [36]

    Concrete crack segmentation dataset

    Ça ˘glar Fırat Özgenel. Concrete crack segmentation dataset. Mendeley Data, V1, 2019

  37. [37]

    Kosmos-2: Grounding multimodal large language models to the world, 2023

    Zhiliang Peng, Wenhui Wang, Li Dong, Yaru Hao, Shaohan Huang, Shuming Ma, and Furu Wei. Kosmos-2: Grounding multimodal large language models to the world, 2023. 12

  38. [38]

    Anwer, Erix Xing, Ming-Hsuan Yang, and Fahad S

    Hanoona Rasheed, Muhammad Maaz, Sahal Shaji Mullappilly, Abdelrahman Shaker, Salman Khan, Hisham Cholakkal, Rao M. Anwer, Erix Xing, Ming-Hsuan Yang, and Fahad S. Khan. Glamm: Pixel grounding large multimodal model, 2024

  39. [39]

    Sam 2: Segment anything in images and videos, 2024

    Nikhila Ravi, Valentin Gabeur, Yuan-Ting Hu, Ronghang Hu, Chaitanya Ryali, Tengyu Ma, Haitham Khedr, Roman Rädle, Chloe Rolland, Laura Gustafson, Eric Mintun, Junting Pan, Kalyan Vasudev Alwala, Nicolas Carion, Chao-Yuan Wu, Ross Girshick, Piotr Dollár, and Christoph Feichtenhofer. Sam 2: Segment anything in images and videos, 2024

  40. [40]

    Rf-detr: Neural architecture search for real-time detection transformers, 2025

    Isaac Robinson, Peter Robicheaux, Matvei Popov, Deva Ramanan, and Neehar Peri. Rf-detr: Neural architecture search for real-time detection transformers, 2025

  41. [41]

    Berg, and Li Fei-Fei

    Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, Alexander C. Berg, and Li Fei-Fei. Imagenet large scale visual recognition challenge, 2015

  42. [42]

    Yolo26: Key architectural enhancements and performance bench- marking for real-time object detection

    Ranjan Sapkota, Rahul Harsha Cheppally, Ajay Sharda, and Manoj Karkee. Yolo26: key architectural enhancements and performance benchmarking for real-time object detection.arXiv preprint arXiv:2509.25164, 2025

  43. [43]

    Laion-5b: An open large-scale dataset for training next generation image-text models, 2022

    Christoph Schuhmann, Romain Beaumont, Richard Vencu, Cade Gordon, Ross Wightman, Mehdi Cherti, Theo Coombes, Aarush Katta, Clayton Mullis, Mitchell Wortsman, Patrick Schramowski, Srivatsa Kundurthy, Katherine Crowson, Ludwig Schmidt, Robert Kaczmarczyk, and Jenia Jitsev. Laion-5b: An open large-scale dataset for training next generation image-text models, 2022

  44. [44]

    Oriane Siméoni, Huy V . V o, Maximilian Seitzer, Federico Baldassarre, Maxime Oquab, Cijo Jose, Vasil Khalidov, Marc Szafraniec, Seungeun Yi, Michaël Ramamonjisoa, Francisco Massa, Daniel Haziza, Luca Wehrstedt, Jianyuan Wang, Timothée Darcet, Théo Moutakanni, Leonel Sentana, Claire Roberts, Andrea Vedaldi, Jamie Tolan, John Brandt, Camille Couprie, Julie...

  45. [45]

    Sam-based instance segmentation models for the automation of structural damage detection.Advanced Engineering Informatics, 62:102826, 2024

    Zehao Ye, Lucy Lovell, Asaad Faramarzi, and Jelena Nini´c. Sam-based instance segmentation models for the automation of structural damage detection.Advanced Engineering Informatics, 62:102826, 2024

  46. [46]

    Ferret: Refer and ground anything anywhere at any granularity, 2023

    Haoxuan You, Haotian Zhang, Zhe Gan, Xianzhi Du, Bowen Zhang, Zirui Wang, Liangliang Cao, Shih-Fu Chang, and Yinfei Yang. Ferret: Refer and ground anything anywhere at any granularity, 2023

  47. [47]

    Berg, and Tamara L

    Licheng Yu, Patrick Poirson, Shan Yang, Alexander C. Berg, and Tamara L. Berg. Modeling context in referring expressions, 2016

  48. [48]

    Zone evaluation: Revealing spatial bias in object detection, 2024

    Zhaohui Zheng, Yuming Chen, Qibin Hou, Xiang Li, Ping Wang, and Ming-Ming Cheng. Zone evaluation: Revealing spatial bias in object detection, 2024

  49. [49]

    Scene parsing through ade20k dataset

    Bolei Zhou, Hang Zhao, Xavier Puig, Sanja Fidler, Adela Barriuso, and Antonio Torralba. Scene parsing through ade20k dataset. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), July 2017. 13 A Technical Appendices and Supplementary Material A.1 Training Details All baselines were trained using the off-the-shelf configu...