Are Large Pre-trained Vision Language Models Effective Construction Safety Inspectors?

Xuezheng Chen; Zhengbo Zou

arxiv: 2508.11011 · v2 · pith:JPFAHTI7new · submitted 2025-08-14 · 💻 cs.CV

Are Large Pre-trained Vision Language Models Effective Construction Safety Inspectors?

Xuezheng Chen , Zhengbo Zou This is my paper

Pith reviewed 2026-05-18 22:27 UTC · model grok-4.3

classification 💻 cs.CV

keywords construction safety inspectionvision language modelsConstructionSite 10kzero-shot generalizationvisual question answeringimage captioningvisual groundingbenchmark dataset

0 comments

The pith

Pre-trained vision-language models generalize to construction safety tasks in zero- and few-shot settings but still require additional training for real sites.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces the ConstructionSite 10k dataset containing 10,000 annotated construction site images to support three linked tasks: image captioning, safety-rule violation visual question answering, and construction element visual grounding. Evaluation of current large pre-trained VLMs on this benchmark demonstrates that they already handle the tasks with notable success when given no or very few examples, indicating they can detect safety issues from images without task-specific pre-training. A reader would care because construction safety inspections remain manual, time-consuming, and inconsistent; automated support could reduce accidents if models prove reliable enough for deployment.

Core claim

The authors establish that current state-of-the-art large pre-trained VLMs exhibit notable generalization abilities when applied to construction safety inspection in zero-shot and few-shot settings, although additional training is still needed to make them applicable to actual construction sites. This conclusion rests on the new ConstructionSite 10k dataset that supplies 10,000 images with annotations for image captioning, safety rule violation VQA, and construction element visual grounding, allowing systematic testing of models on realistic site imagery.

What carries the argument

The ConstructionSite 10k dataset, a collection of 10,000 construction site images annotated for the three interconnected tasks of image captioning, safety-rule violation visual question answering, and construction element visual grounding.

If this is right

The dataset can serve as a public benchmark for training and testing new VLM architectures on construction safety tasks.
VLMs can already identify safety rule violations from site images without being trained directly on construction data.
Few-shot fine-tuning on the dataset offers a practical path to improve model accuracy for real-world use.
Automated image-based inspection could reduce the frequency of full manual walkthroughs while still requiring human oversight for final decisions.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If models trained on this dataset transfer well, similar benchmarks could be built for safety inspection in manufacturing plants or transportation infrastructure.
Drone or fixed-camera systems could feed live images into these VLMs to flag violations in near real time, though integration with site-specific rules would still be required.
The gap between zero-shot performance and practical readiness suggests that hybrid systems combining pre-trained VLMs with lightweight site-specific adaptation layers may be the fastest route to usable tools.

Load-bearing premise

Performance on captioning, safety-rule VQA, and element grounding is enough to decide whether a VLM can function as an effective construction safety inspector.

What would settle it

Deploying the evaluated VLMs on an active construction site and measuring whether they miss safety violations that on-site human inspectors consistently identify.

Figures

Figures reproduced from arXiv: 2508.11011 by Xuezheng Chen, Zhengbo Zou.

**Figure 1.** Figure 1: The high-level schematic diagram shows the three tasks employed in this paper. The VLM will receive the construction site image and prompts for three tasks: image captioning, safety rule violation VQA, and construction element visual grounding. INTRODUCTION Hazardous working conditions and risky behaviors at construction sites accounted for a concerning 21% of all workplace fatalities in the U.S. in 2022 … view at source ↗

**Figure 2.** Figure 2: The figure presents construction site images from the dataset, each annotated with three labels arranged in a grid structure. For example, the bottom-left image is labeled as “sparse info”, “night”, and “short distance”. The images are uniformly scaled for demonstration purposes, resulting in slightly altered appearances. These images and labels aim to showcase representative samples and do not depict the … view at source ↗

**Figure 3.** Figure 3: Distribution of images in the dataset across four features. These statistics unveil the diversity of construction site imagery. annotations. Unlike traditional image captioning datasets, our annotations include not only the most visible objects in the foreground, but also construction elements scattered throughout the image, including those in the corners. Background details are also included as an attempt… view at source ↗

**Figure 4.** Figure 4: The time consumption breakdown to prepare the test set of the dataset. The time shown in the figure approximates and includes the time to create annotation software [PITH_FULL_IMAGE:figures/full_fig_p011_4.png] view at source ↗

**Figure 5.** Figure 5: The overall workflow of GPT-assisted image caption annotation. Visual grounding. Given the dominant text-based annotations, we consider VLMs to be only employed at construction sites if they are competent enough to ground the required object precisely. Thus, we include the construction element visual grounding section in this dataset. The visual grounding annotations include the bounding boxes of three ty… view at source ↗

**Figure 6.** Figure 6: Word count of prevalent terms across various topics used in the reference image captions. Synonyms and plural forms of nouns, such as “concrete truck and “concrete mixers”, are consolidated. For verbs, the lemma is used to represent the lexeme. Safety violation VQA Test set Training set PPE 323 677 Safety harness 25 59 Edge protection 63 109 Blind spot 24 46 Object detection Test set Training set Excavato… view at source ↗

**Figure 7.** Figure 7: Workflow of the image captioning task. The image, system and user prompts are given to the VLMs as inputs, the models generate an image caption as an output. The example prompts will be used to give examples to the VLMs in few-shot settings. The generated candidate caption, human-labeled reference caption or the image are then evaluated with automatic metrics. Frequency Inverse Document Frequency (TF-IDF) … view at source ↗

**Figure 8.** Figure 8: A three-stage query and evaluation workflow for the safety rule VQA test. The images and prompts are given to VLM in Stage 1. The VLMs generate choices, reasonings, and bounding boxes as outputs. If the VLM select the correct violations in Stage 2, the reasonings and bounding boxes will be evaluated in Stage 3. The safety rules depicted in the figure are simplified for clarity, with soild font indicating r… view at source ↗

**Figure 9.** Figure 9: The figure displays examples of candidate captions generated by the GPT-4 model, alongside reference captions and their evaluations (all scores are in %) for the image captioning task. For all evaluation metrics except the one assessing the reference caption, higher scores indicate a “better" candidate caption. While the five types of image captioning metrics assign different scores to the same caption, th… view at source ↗

**Figure 10.** Figure 10: Examples of safety rule violation reasoning. The figure includes the candidate, reference reasoning, and the evaluations. GPT-4 Vision LLaVA-13B [PITH_FULL_IMAGE:figures/full_fig_p025_10.png] view at source ↗

**Figure 11.** Figure 11: The image displays visual grounding examples from GPT model. Each row corresponds to a different object category: the first row shows excavators, the second row shows rebar detection, and the third row shows workers with white hard hats. The model excels in detecting larger objects but struggles with irregular shapes, such as rebar piles, and specific constraints, such as workers with white hard hats. 25 … view at source ↗

read the original abstract

Construction safety inspections typically involve a human inspector identifying safety concerns on-site. With the rise of powerful Vision Language Models (VLMs), researchers are exploring their use for tasks such as detecting safety rule violations from on-site images. However, there is a lack of open datasets to comprehensively evaluate and further fine-tune VLMs in construction safety inspection. Current applications of VLMs use small, supervised datasets, limiting their applicability in tasks they are not directly trained for. In this paper, we propose the ConstructionSite 10k, featuring 10,000 construction site images with annotations for three inter-connected tasks, including image captioning, safety rule violation visual question answering (VQA), and construction element visual grounding. Our subsequent evaluation of current state-of-the-art large pre-trained VLMs shows notable generalization abilities in zero-shot and few-shot settings, while additional training is needed to make them applicable to actual construction sites. This dataset allows researchers to train and evaluate their own VLMs with new architectures and techniques, providing a valuable benchmark for construction safety inspection.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper's main contribution is a new 10k-image dataset for VLM tasks in construction safety, though the three annotation types may not fully stand in for real inspector work.

read the letter

The paper introduces the ConstructionSite 10k dataset with 10,000 construction site images and annotations for three linked tasks: image captioning, safety rule violation VQA, and construction element visual grounding. That dataset is the clear new piece, since earlier work relied on much smaller supervised collections and left little room for zero-shot or few-shot testing in this domain. The authors run current VLMs on it in zero-shot and few-shot settings and note some generalization while saying more training is needed for actual sites. This gives researchers a concrete starting point they did not have before. The release of a sizable, safety-focused benchmark with interconnected annotations is the part that holds up. It lets others train or evaluate models without building everything from scratch, and the choice of tasks at least tries to cover description, question answering, and localization in one package. The softer spot is how well these tasks proxy actual construction safety inspection. Real work on sites usually involves sequences of observations, cross-referencing with plans or regulations, ranking risks, and coping with occlusion or change. The benchmark stays with single images and fixed question types, so results on it do not automatically translate to the claimed generalization or to the need for further training in deployment. The abstract's statement on notable generalization would carry more weight with explicit metrics, inter-annotator checks, or error breakdowns rather than a qualitative summary. This is useful for applied computer vision groups or civil engineering researchers who need domain-specific VLM benchmarks. A reader looking for ready data to test or fine-tune models in safety contexts would get direct value from it. The dataset contribution is substantial enough to merit peer review, even if the evaluation section needs tightening on how the tasks map to field use. I would send it to referees and ask them to check the quantitative results and the task-to-application link.

Referee Report

2 major / 1 minor

Summary. The manuscript introduces the ConstructionSite 10k dataset of 10,000 construction-site images annotated for three interconnected tasks: image captioning, safety-rule violation visual question answering (VQA), and construction-element visual grounding. It evaluates several state-of-the-art large pre-trained vision-language models in zero-shot and few-shot regimes, asserts that these models exhibit notable generalization on the proposed tasks, and concludes that further training is required before the models can be deployed as practical construction safety inspectors. The dataset is presented as an open benchmark to support training and evaluation of new VLM architectures for this domain.

Significance. If the quantitative evaluation is strengthened and the task proxies are shown to be adequate, the work would be a useful contribution by releasing the first sizable open dataset for VLM-based construction safety inspection. The empirical demonstration of zero- and few-shot behavior on captioning, VQA, and grounding tasks provides a concrete starting point for the community; credit is given for the public release of the annotated corpus, which directly addresses the data-scarcity problem noted in the abstract.

major comments (2)

[Abstract] Abstract: the claim of 'notable generalization abilities in zero-shot and few-shot settings' is presented without any quantitative metrics, error bars, or statistical tests, leaving the central evaluation claim unsupported by numbers.
[Abstract, paragraph 3] Abstract, paragraph 3: the three tasks (captioning, safety-rule VQA, element grounding) are treated as sufficient to assess whether a VLM can function as an effective construction safety inspector, yet the paper does not demonstrate how these single-image tasks capture sequential observation, integration with site plans or regulations, risk prioritization, or handling of dynamic/occluded scenes that characterize real inspections.

minor comments (1)

[Section 3] Section 3: the annotation protocol would benefit from explicit reporting of inter-annotator agreement scores or quality-control procedures for the VQA and grounding labels.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive comments. We address each major comment point by point below, clarifying the scope of our claims and indicating revisions where they strengthen the manuscript without altering its core contributions.

read point-by-point responses

Referee: [Abstract] Abstract: the claim of 'notable generalization abilities in zero-shot and few-shot settings' is presented without any quantitative metrics, error bars, or statistical tests, leaving the central evaluation claim unsupported by numbers.

Authors: We agree that the abstract's phrasing would be strengthened by direct reference to quantitative results. The full paper reports concrete metrics in Section 4 (e.g., BLEU-4 and CIDEr for captioning, accuracy for VQA, and IoU for grounding across zero- and few-shot regimes), including comparisons against random and supervised baselines. In the revised manuscript we have added a concise sentence to the abstract summarizing the key observed improvements (e.g., X-point gains in few-shot VQA accuracy) while preserving brevity; error bars and statistical details remain in the experimental section as is conventional for abstracts. revision: yes
Referee: [Abstract, paragraph 3] Abstract, paragraph 3: the three tasks (captioning, safety-rule VQA, element grounding) are treated as sufficient to assess whether a VLM can function as an effective construction safety inspector, yet the paper does not demonstrate how these single-image tasks capture sequential observation, integration with site plans or regulations, risk prioritization, or handling of dynamic/occluded scenes that characterize real inspections.

Authors: We do not claim that the three single-image tasks fully replicate operational inspections. The manuscript explicitly frames them as interconnected benchmark tasks that probe core VLM capabilities (description, violation detection, localization) and states in both the introduction and conclusion that real deployment would require further training, multi-view reasoning, and integration with site documentation. To address the referee's concern we have added a new paragraph in the revised discussion section that maps each task to specific inspection elements while openly listing the missing aspects (sequential observation, dynamic scenes, risk prioritization) as directions for future work. revision: partial

Circularity Check

0 steps flagged

No circularity: dataset creation plus evaluation of external pre-trained models

full rationale

The paper introduces the ConstructionSite 10k dataset with annotations for captioning, safety-rule VQA, and element grounding, then reports zero-shot and few-shot results of existing large VLMs on those tasks. No equations, fitted parameters, or internal predictions appear; the central claims rest on empirical performance numbers obtained from models whose weights and training data are external to the paper. The three tasks are presented as a benchmark rather than derived from any self-referential construction, and no self-citation chain is invoked to justify uniqueness or force a result. The work is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim depends on the creation of a new annotated dataset and the domain assumption that pre-trained VLMs possess transferable visual-linguistic knowledge applicable to construction safety scenes.

axioms (1)

domain assumption Pre-trained VLMs exhibit meaningful zero-shot and few-shot generalization on previously unseen domain-specific visual question answering and grounding tasks.
Invoked when the abstract interprets model performance on the new dataset as evidence of generalization abilities.

pith-pipeline@v0.9.0 · 5706 in / 1277 out tokens · 55765 ms · 2026-05-18T22:27:29.248449+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We propose the ConstructionSite 10k, featuring 10,000 construction site images with annotations for three inter-connected tasks, including image captioning, safety rule violation visual question answering (VQA), and construction element visual grounding.
IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Our subsequent evaluation of current state-of-the-art large pre-trained VLMs shows notable generalization abilities in zero-shot and few-shot settings

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

General Hazard Detection
cs.CV 2026-05 unverdicted novelty 5.0

Introduces CompliVision dataset and active learning framework for rule-based hazard compliance assessment using vision-language models grounded in safety standards.
Integration of Object Detection and Small VLMs for Construction Safety Hazard Identification
cs.CV 2026-04 unverdicted novelty 4.0

Detection-guided prompting raises small VLM hazard F1 from 34.5% to 50.6% and BERTScore from 0.61 to 0.82 on construction images with only 2.5 ms added latency.