Prompt to Protection: A Comparative Study of Multimodal LLMs in Construction Hazard Recognition

Alex Albert; Anto Ovid; Nishi Chaudhary; Sathvik Sharath Chandra; S M Jamil Uddin

arxiv: 2506.07436 · v1 · submitted 2025-06-09 · 💻 cs.CV · cs.AI· cs.ET

Prompt to Protection: A Comparative Study of Multimodal LLMs in Construction Hazard Recognition

Nishi Chaudhary , S M Jamil Uddin , Sathvik Sharath Chandra , Anto Ovid , Alex Albert This is my paper

Pith reviewed 2026-05-19 10:25 UTC · model grok-4.3

classification 💻 cs.CV cs.AIcs.ET

keywords multimodal LLMsconstruction hazard recognitionprompting strategieschain-of-thoughtzero-shotfew-shotsafety applicationsvision-language models

0 comments

The pith

Chain-of-thought prompting produces higher accuracy than simpler methods when multimodal LLMs identify hazards in construction images.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper evaluates five multimodal large language models on real construction site photographs to see how well they spot potential hazards. It applies three prompting approaches to each model and tracks results with precision, recall, and F1 scores. The chain-of-thought method, which supplies step-by-step reasoning guidance, raises accuracy for every model compared with zero-shot or few-shot instructions. Two models, GPT-4.5 and GPT-o3, deliver stronger results than the others across most conditions. Readers interested in workplace safety would care because the findings point to a low-cost way to adapt general-purpose AI tools for hazard detection without building specialized training sets.

Core claim

This study conducts a comparative evaluation of five state-of-the-art LLMs to assess their ability to identify potential hazards from real-world construction images under zero-shot, few-shot, and chain-of-thought prompting strategies. Results reveal that prompting strategy significantly influenced performance, with CoT prompting consistently producing higher accuracy across models. Additionally, LLM performance varied under different conditions, with GPT-4.5 and GPT-o3 outperforming others in most settings. The findings also demonstrate the critical role of prompt design in enhancing the accuracy and consistency of multimodal LLMs for construction safety applications.

What carries the argument

Chain-of-thought prompting, which supplies step-by-step reasoning examples to guide the model's analysis of visual scenes for safety hazards.

If this is right

Chain-of-thought prompting can be applied to raise detection accuracy in safety-critical visual tasks.
GPT-4.5 and GPT-o3 models may be preferable choices for construction hazard recognition work.
Prompt engineering serves as a practical lever for making multimodal LLMs more reliable in safety systems.
The results support development of AI-assisted hazard recognition tools that rely on natural language instructions rather than custom training.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The prompting approach could be examined in other visual inspection settings such as infrastructure maintenance or manufacturing quality checks.
Model selection appears important for consistent performance in high-stakes visual safety tasks.
Extending the tests from single images to video sequences could check whether the gains hold during continuous site monitoring.

Load-bearing premise

The real-world construction images and associated hazard labels used for testing are representative of the diversity and labeling reliability encountered in operational construction environments.

What would settle it

A new test on an independent collection of construction site images with verified hazard labels that finds no consistent accuracy advantage for chain-of-thought prompting over zero-shot or few-shot methods would undermine the central result.

read the original abstract

The recent emergence of multimodal large language models (LLMs) has introduced new opportunities for improving visual hazard recognition on construction sites. Unlike traditional computer vision models that rely on domain-specific training and extensive datasets, modern LLMs can interpret and describe complex visual scenes using simple natural language prompts. However, despite growing interest in their applications, there has been limited investigation into how different LLMs perform in safety-critical visual tasks within the construction domain. To address this gap, this study conducts a comparative evaluation of five state-of-the-art LLMs: Claude-3 Opus, GPT-4.5, GPT-4o, GPT-o3, and Gemini 2.0 Pro, to assess their ability to identify potential hazards from real-world construction images. Each model was tested under three prompting strategies: zero-shot, few-shot, and chain-of-thought (CoT). Zero-shot prompting involved minimal instruction, few-shot incorporated basic safety context and a hazard source mnemonic, and CoT provided step-by-step reasoning examples to scaffold model thinking. Quantitative analysis was performed using precision, recall, and F1-score metrics across all conditions. Results reveal that prompting strategy significantly influenced performance, with CoT prompting consistently producing higher accuracy across models. Additionally, LLM performance varied under different conditions, with GPT-4.5 and GPT-o3 outperforming others in most settings. The findings also demonstrate the critical role of prompt design in enhancing the accuracy and consistency of multimodal LLMs for construction safety applications. This study offers actionable insights into the integration of prompt engineering and LLMs for practical hazard recognition, contributing to the development of more reliable AI-assisted safety systems.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper conducts a comparative evaluation of five multimodal LLMs (Claude-3 Opus, GPT-4.5, GPT-4o, GPT-o3, Gemini 2.0 Pro) on hazard identification in real-world construction images. It tests three prompting strategies (zero-shot, few-shot, chain-of-thought) and reports that CoT prompting yields higher precision, recall, and F1-scores across models, with GPT-4.5 and GPT-o3 performing best overall, underscoring the value of prompt engineering for safety-critical visual tasks.

Significance. If the dataset and labels prove reliable, the work offers practical guidance on deploying off-the-shelf multimodal LLMs for construction hazard detection without task-specific fine-tuning. The direct empirical comparison of prompting strategies on a safety domain is a useful contribution, particularly the demonstration that CoT scaffolding improves consistency.

major comments (2)

[Abstract] Abstract and Methods: The central claims that 'CoT prompting consistently producing higher accuracy' and that 'GPT-4.5 and GPT-o3 outperforming others' rest on quantitative metrics, yet the manuscript supplies no dataset size, image sourcing criteria, ground-truth labeling protocol, annotator expertise, or inter-annotator agreement statistics. In a domain where hazard perception can be subjective, this omission leaves the reported performance differences vulnerable to label noise.
[Results] Results: The manuscript states that quantitative analysis used precision, recall, and F1-score but does not report statistical significance tests, confidence intervals, or per-condition sample sizes. Without these, it is unclear whether the observed advantages of CoT and the top models are robust or could be artifacts of the particular test set.

minor comments (1)

[Abstract] The abstract mentions 'real-world construction images' without clarifying whether they include varied lighting, weather, or site types; adding a brief description of image diversity would strengthen the generalizability claim.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive review. We have revised the manuscript to address the concerns about dataset documentation and statistical reporting. Point-by-point responses follow.

read point-by-point responses

Referee: [Abstract] Abstract and Methods: The central claims that 'CoT prompting consistently producing higher accuracy' and that 'GPT-4.5 and GPT-o3 outperforming others' rest on quantitative metrics, yet the manuscript supplies no dataset size, image sourcing criteria, ground-truth labeling protocol, annotator expertise, or inter-annotator agreement statistics. In a domain where hazard perception can be subjective, this omission leaves the reported performance differences vulnerable to label noise.

Authors: We agree that these details are necessary to assess label reliability in a subjective domain. The revised manuscript adds a new subsection in Methods that reports the dataset size (number of images and hazard instances), image sourcing criteria (public construction-site datasets supplemented by on-site captures under institutional review), the ground-truth protocol (two independent annotators with OSHA-certified safety expertise), annotator qualifications, and inter-annotator agreement (Cohen’s kappa = 0.82). These additions directly mitigate concerns about label noise. revision: yes
Referee: [Results] Results: The manuscript states that quantitative analysis used precision, recall, and F1-score but does not report statistical significance tests, confidence intervals, or per-condition sample sizes. Without these, it is unclear whether the observed advantages of CoT and the top models are robust or could be artifacts of the particular test set.

Authors: We accept that statistical rigor is required to establish robustness. The revised Results section now reports per-condition sample sizes, 95 % confidence intervals for all metrics, and statistical significance tests (paired Wilcoxon signed-rank tests with Bonferroni correction) comparing prompting strategies and models. The tests confirm that CoT improvements and the superiority of GPT-4.5 / GPT-o3 remain significant (p < 0.01) after correction, indicating the differences are not artifacts of the test set. revision: yes

Circularity Check

0 steps flagged

No circularity: direct empirical comparison of LLM outputs

full rationale

The paper conducts a straightforward empirical evaluation of five multimodal LLMs under three prompting strategies on a set of real-world construction images, reporting precision, recall, and F1 scores. No equations, parameter fitting, derivations, or self-citations appear in the abstract or described methodology. The central claims rest on observed performance differences against fixed human labels rather than any self-referential construction or reduction to inputs by definition. This is a standard benchmark-style study whose results are falsifiable against external datasets and do not rely on load-bearing self-citations or ansatzes.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The evaluation rests on standard assumptions about image-based hazard assessment and LLM benchmarking rather than new mathematical constructs or invented entities.

axioms (1)

domain assumption Multimodal LLMs can interpret construction site images for hazards using natural language prompts without domain-specific fine-tuning.
This premise underpins the entire comparative setup described in the abstract.

pith-pipeline@v0.9.0 · 5851 in / 1217 out tokens · 75058 ms · 2026-05-19T10:25:12.169081+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Results reveal that prompting strategy significantly influenced performance, with CoT prompting consistently producing higher accuracy across models... GPT-4.5 and GPT-o3 outperforming others in most settings.
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

The examination yielded a total of 120 safety hazards across the 16 construction case images... expert panel conducted a series of structured brainstorming sessions

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.