Passive Construction Site Safety Monitoring via Persona-Scaffolded Adversarial Chain-of-Thought VLM Verification

arxiv: 2605.19869 · v1 · pith:D7I2N6EGnew · submitted 2026-05-19 · 💻 cs.CV · cs.AI

Passive Construction Site Safety Monitoring via Persona-Scaffolded Adversarial Chain-of-Thought VLM Verification

Ananth Sriram , Neel Mokaria , Rajveer Singh This is my paper

Pith reviewed 2026-05-20 05:32 UTC · model grok-4.3

classification 💻 cs.CV cs.AI

keywords construction safety monitoringvision-language modelsprompt engineeringadversarial chain-of-thoughtPPE detectionOSHA standardshallucination mitigationvideo analysis

0 comments p. Extension

pith:D7I2N6EG Add to your LaTeX paper

What is a Pith Number?

\usepackage{pith}
\pithnumber{D7I2N6EG}

Prints a linked pith:D7I2N6EG badge after your title and writes the identifier into PDF metadata. Compiles on arXiv with no extra files. Learn more

The pith

A persona-scaffolded adversarial prompting method boosts precision in construction safety violation detection by 12%.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper introduces a passive monitoring system for construction site safety that analyzes footage from body-worn and fixed cameras at the end of a shift. The system first detects personal protective equipment and hazards, refines the detections with segmentation, and then uses a vision-language model to verify compliance through a three-pass adversarial process. The core innovation lies in the prompt design that assigns professional personas to the model and structures the reasoning into generator, discriminator, and reconciliation steps with different rules. In an informal review on twelve development videos, this method achieved a twelve percent higher precision than basic prompting, particularly for categories where models often invent nonexistent violations. Such automation could help address the high rate of preventable worker injuries in construction by providing detailed reports without requiring live human operators.

Core claim

The principal contribution is the Stage 3 prompt design: professional persona backstories following the method-actor framing drive an observed 12% precision improvement over single-pass prompting in an informal three-author review of the 12-video Ironsite development corpus, with the largest gains on hallucination-prone violation categories.

What carries the argument

The persona-scaffolded adversarial chain-of-thought protocol that uses method-actor framing with professional backstories to enforce observational independence between generator, discriminator, and reconciliation passes via structural message isolation and asymmetric rules.

If this is right

The system maps violations to specific OSHA standards.
It performs REBA-inspired ergonomic risk scoring from pose keypoints.
Per-worker safety reports are produced with timestamped evidence.
The pipeline enables passive end-of-shift monitoring from POV and fixed cameras without real-time operators.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If the prompting technique generalizes, it could improve hallucination control in other VLM-based scene analysis tasks such as traffic or inventory monitoring.
Evaluating the method on larger datasets from multiple construction sites would test whether the precision gains persist beyond the initial development corpus.
The adversarial structure might serve as a general template for enhancing reliability in automated compliance checking systems.

Load-bearing premise

The 12% precision improvement from the persona-scaffolded adversarial protocol on the 12-video development corpus is due to the prompting method rather than video selection or reviewer expectations and will generalize to new sites and larger datasets.

What would settle it

Measuring the precision of the persona-based three-pass protocol versus single-pass prompting on a new, independent set of construction site videos from a different location to see if the 12% improvement holds.

Figures

Figures reproduced from arXiv: 2605.19869 by Ananth Sriram, Neel Mokaria, Rajveer Singh.

**Figure 2.** Figure 2: PPE detection results. (a) Worker build [PITH_FULL_IMAGE:figures/full_fig_p008_2.png] view at source ↗

**Figure 3.** Figure 3: Posture analysis results. (a) Worker seg [PITH_FULL_IMAGE:figures/full_fig_p009_3.png] view at source ↗

**Figure 4.** Figure 4: Pass 1 system prompt. The generator receives raw video only; no machine detection data or [PITH_FULL_IMAGE:figures/full_fig_p012_4.png] view at source ↗

**Figure 5.** Figure 5: Pass 2 system prompt. The discriminator receives raw video, annotated frames, and YOLO [PITH_FULL_IMAGE:figures/full_fig_p012_5.png] view at source ↗

**Figure 6.** Figure 6: Pass 3 reconciliation rules (user template, verbatim). Pass 3 omits the raw video and operates [PITH_FULL_IMAGE:figures/full_fig_p013_6.png] view at source ↗

read the original abstract

Construction remains the deadliest industry sector in the United States, with 1,055 fatal worker injuries recorded in 2023, and the majority preventable. Existing monitoring approaches are expensive, require real-time human operators, or address only a narrow subset of violations. This paper presents a passive, end-of-shift construction safety monitoring pipeline processing video from POV body-worn and fixed wall-mounted cameras through a three-stage architecture: (1) fine-tuned YOLO11 for primary PPE and hazard detection, (2) SAM 3 for segmentation refinement and worker deduplication, and (3) Qwen3-VL-8B-Instruct with a method-prompted, persona-scaffolded three-pass adversarial chain-of-thought protocol for compliance verification and hallucination control. The principal contribution is the Stage 3 prompt design: professional persona backstories following the method-actor framing drive an observed 12% precision improvement over single-pass prompting in an informal three-author review of the 12-video Ironsite development corpus, with the largest gains on hallucination-prone violation categories. Structural message isolation enforces observational independence between a generator, discriminator, and reconciliation pass governed by asymmetric rules encoding priors about human observation versus automated detection reliability. The system maps violations to OSHA standards, performs REBA-inspired ergonomic risk scoring from pose keypoints, and produces per-worker safety reports with timestamped evidence. An evaluation harness is released for future reproduction.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The 12% precision lift from persona-scaffolded adversarial CoT is reported from an informal three-author review of 12 videos with no blinding or held-out testing, so the central claim needs more evidence before it can be taken as settled.

read the letter

The main point is that the authors built an end-to-end pipeline for passive construction safety checks that runs YOLO11 for initial detection, SAM for segmentation and deduplication, and Qwen3-VL with a three-pass adversarial prompting scheme that assigns professional personas to generator, discriminator, and reconciliation roles. They map outputs to OSHA rules and add REBA-style ergonomic scoring from pose data, then release an evaluation harness. That combination is new enough in the safety-monitoring literature to be worth noting, and the structural isolation between passes is a practical way to limit information leakage in the VLM stage. The domain focus on a high-fatality industry is also straightforward and useful. The evaluation is the clear weak point. The reported 12% precision gain over single-pass prompting, especially on hallucination-prone categories, comes from an informal three-author review of the 12-video development corpus. No blinding, no inter-rater reliability numbers, no statistical tests, and no separate test set are described, so it is hard to rule out reviewer expectations or corpus-specific effects. The paper does not claim formal statistical validation, which keeps the claim modest but also limits how far the result can be trusted yet. Readers working on applied vision systems for industrial compliance or on prompt-engineering techniques for VLMs in verification tasks would get the most out of it. The architecture is concrete enough that someone with similar camera setups could try the prompting pattern and see whether it helps in their own setting. The work deserves a serious referee because the problem is real and the system description is detailed enough to reproduce and extend, even if the current results section will need expansion. I would send it for review with the main request being a stronger evaluation on larger, held-out data with proper controls.

Referee Report

1 major / 1 minor

Summary. The manuscript presents a passive, end-of-shift construction safety monitoring pipeline that ingests video from POV body-worn and fixed cameras through a three-stage architecture: (1) fine-tuned YOLO11 for primary PPE and hazard detection, (2) SAM 3 for segmentation refinement and worker deduplication, and (3) Qwen3-VL-8B-Instruct employing a method-prompted, persona-scaffolded three-pass adversarial chain-of-thought protocol for compliance verification. The principal contribution is the Stage 3 prompt design, which is reported to produce a 12% precision improvement over single-pass prompting (largest on hallucination-prone violation categories) based on an informal three-author review of the 12-video Ironsite development corpus. The system maps violations to OSHA standards, computes REBA-inspired ergonomic risk scores from pose keypoints, generates per-worker reports with timestamped evidence, and releases an evaluation harness.

Significance. If the reported precision gains are substantiated by rigorous, blinded evaluation on held-out data, the pipeline could provide a scalable, low-cost complement to existing construction safety monitoring, addressing the high rate of preventable fatalities in the sector. The release of the evaluation harness is a concrete strength that supports reproducibility and community follow-up.

major comments (1)

[Abstract] Abstract (principal contribution paragraph): The central empirical claim of a 12% precision improvement from the persona-scaffolded adversarial CoT protocol is based solely on an informal three-author review of 12 videos in the development corpus. No statistical tests, inter-rater reliability, blinding, formal baselines, ablation isolating the persona component from the three-pass structure, or evaluation on an independent test set are described. This makes it impossible to attribute the observed difference to the method rather than reviewer expectations or corpus-specific effects, directly undermining the principal contribution.

minor comments (1)

[General] The manuscript would benefit from explicit section headings and numbered subsections for the three pipeline stages to improve navigation and reproducibility.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their thorough review and valuable feedback on our manuscript. We agree that the evaluation of the persona-scaffolded adversarial chain-of-thought protocol requires more rigorous validation to strengthen the principal contribution. We will revise the paper accordingly.

read point-by-point responses

Referee: [Abstract] Abstract (principal contribution paragraph): The central empirical claim of a 12% precision improvement from the persona-scaffolded adversarial CoT protocol is based solely on an informal three-author review of 12 videos in the development corpus. No statistical tests, inter-rater reliability, blinding, formal baselines, ablation isolating the persona component from the three-pass structure, or evaluation on an independent test set are described. This makes it impossible to attribute the observed difference to the method rather than reviewer expectations or corpus-specific effects, directly undermining the principal contribution.

Authors: We fully acknowledge the validity of this critique. The current evaluation is indeed informal and lacks the rigorous elements mentioned, which limits the generalizability of the 12% precision gain claim. In the revised manuscript, we will perform a formal evaluation on a held-out test set consisting of additional videos not used in development. This will include blinded review by independent raters, computation of inter-rater reliability (e.g., Cohen's kappa), statistical significance testing (e.g., paired t-test or McNemar's test for precision differences), and ablations to isolate the contribution of the persona scaffolding versus the three-pass adversarial structure. We will update the abstract to reflect these changes and add a new subsection detailing the evaluation protocol, results, and limitations. The evaluation harness will be extended to support this blinded setup. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical observation from informal review stands independent of inputs

full rationale

The paper presents a three-stage engineering pipeline whose principal claim is an observed 12% precision lift from persona-scaffolded adversarial CoT prompting, measured via informal three-author review on the 12-video development corpus. No equations, first-principles derivations, fitted parameters renamed as predictions, or uniqueness theorems appear. The reported gain is framed as an empirical result rather than a quantity forced by construction from the prompt design itself. No self-citations load-bear the central claim, and the system description remains self-contained without reducing any output to its own inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The work relies on standard assumptions about off-the-shelf vision models and introduces no new physical entities or mathematical axioms; the prompting protocol is the main added component.

axioms (2)

domain assumption Fine-tuned YOLO11 and SAM 3 produce sufficiently accurate initial detections and segmentations for downstream VLM verification.
Invoked in the description of stages 1 and 2 without reported error rates on the target domain.
ad hoc to paper The informal three-author review provides a reliable signal of real-world precision improvement.
Central to the 12% claim but not justified with formal evaluation protocols.

pith-pipeline@v0.9.0 · 5795 in / 1545 out tokens · 52521 ms · 2026-05-20T05:32:18.753188+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

The principal methodological contribution is the Stage 3 prompt design: professional persona backstories following the method-actor framing drive an observed 12% precision improvement...
IndisputableMonolith/Foundation/AbsoluteFloorClosure.lean absolute_floor_iff_bare_distinguishability unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

three-pass adversarial chain-of-thought protocol... structural message isolation enforces observational independence

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

20 extracted references · 20 canonical work pages · 2 internal anchors

[1]

Bureau of Labor Statistics.Census of Fatal Oc- cupational Injuries Summary, 2023. U.S. Depart- ment of Labor, 2024.https://www.bls.gov/news. release/archives/cfoi_12192024.htm

work page 2023
[2]

Top 10 Most Frequently Cited Standards for Fis- cal Year 2024

Occupational Safety and Health Administration. Top 10 Most Frequently Cited Standards for Fis- cal Year 2024. U.S. Department of Labor, 2024. https://www.osha.gov/top10citedstandards/

work page 2024
[3]

Q. Fang, H. Li, X. Luo, L. Ding, H. Luo, T. Rose, and W. An. Detecting non-hardhat-use by a deep learning method from far-field surveillance videos. Automation in Construction, 85:1–9, 2018

work page 2018
[4]

N. D. Nath, A. H. Behzadan, and S. G. Paal. Deep learning for site safety: Real-time detection of per- sonal protective equipment.Automation in Con- struction, 112:103085, 2020

work page 2020
[5]

Hignett and L

S. Hignett and L. McAtamney. Rapid Entire Body Assessment (REBA).Applied Ergonomics, 31(2):201–205, 2000

work page 2000
[6]

X. Yan, H. Li, A. R. Li, and H. Zhang. Wear- able IMU-based real-time motion warning system for construction workers’ musculoskeletal disorders pre- vention.Automation in Construction, 74:2–11, 2017

work page 2017
[7]

X. Wang, J. Wei, D. Schuurmans, Q. Le, E. Chi, S. Narang, A. Chowdhery, and D. Zhou. Self- consistency improves chain of thought reasoning in language models. InProc. ICLR, 2023

work page 2023
[8]

Kirillov, E

A. Kirillov, E. Mintun, N. Ravi, H. Mao, C. Rol- land, L. Gustafson, T. Xiao, S. Whitehead, A. Berg, W.-Y. Lo, P. Doll´ ar, and R. Girshick. Segment Any- thing. InProc. ICCV, 2023

work page 2023
[9]

¨Onal and E

O. ¨Onal and E. Dandıl. A video dataset for safe/unsafe worker behaviour classification in pro- duction environments.BMC Research Notes, 17:234, 2024

work page 2024
[10]

CWPV: A Working Postures of the Construc- tion Working Postures Videos dataset.Figshare, 2024.https://figshare.com/articles/dataset/ 27907818

work page 2024
[11]

C. Doyle. LLMs as Method Actors: A Model for Prompt Engineering and Architecture.arXiv preprint arXiv:2411.05778, 2024.https://doi. org/10.48550/arXiv.2411.05778

work page doi:10.48550/arxiv.2411.05778 2024
[12]

Sanjeewani, G

P. Sanjeewani, G. Neuber, J. Fitzgerald, N. Chan- drasena, S. Potums, A. Alavi, and C. Lane. Real- time personal protective equipment non-compliance recognition on AI edge cameras.Electronics, 13(15):2990, 2024

work page 2024
[13]

W. Zhao, L. Wang, Y. Li, X. Liu, Y. Zhang, B. Yan, and H. Li. A multi-scale and multi-stage human pose recognition method based on convolutional neural networks for non-wearable ergonomic evaluation. Processes, 12(11):2419, 2024

work page 2024
[14]

J. Geng, J. Zhao, and S. Li. Optimizing helmet use detection in construction sites via fuzzy logic-based state tracking.Sensors, 25(2):456, 2025

work page 2025
[15]

Aharon, R

N. Aharon, R. Orfaig, and B. Z. Bobrovsky. BoT- SORT: Robust associations multi-pedestrian track- ing.arXiv preprint arXiv:2206.14651, 2022

work page arXiv 2022
[16]

Z. Bai, P. Chen, C. Fu, S. Yin, and E. Chen. Hal- lucination of multimodal large language models: A survey.arXiv preprint arXiv:2404.18930, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[17]

Welch, E

R. Welch, E. Konuk, and K. Smith. The cost of reasoning: Chain-of-thought induces overcon- fidence in vision-language models.arXiv preprint arXiv:2603.16728, 2025

work page arXiv 2025
[18]

K. Kim, Y. Cho, and S. Zhang. Integrated safety monitoring system using RFID and computer vision for construction sites.Automation in Construction, 66:11–21, 2016

work page 2016
[19]

Awolusi, E

I. Awolusi, E. Marks, and M. Solis. Wearable tech- nology for personalized construction safety monitor- ing and health assessment.Automation in Construc- tion, 91:235–250, 2018

work page 2018
[20]

SAM 2: Segment Anything in Images and Videos

N. Ravi, V. Gabeur, Y.-T. Hu, R. Hu, C. Ryali, T. Ma, H. Khabsa, R. R¨ ohrbach, A. El-Nouby, A. Shiekh, W. Cheng, M. Saraf, A. Morcos, M. Paluri, and C. Feichtenhofer. SAM 2: Seg- ment Anything in Images and Videos.arXiv preprint arXiv:2408.00714, 2024. 11 A Supplementary Material: VLM System Prompts The full system prompts used in the three-pass adversar...

work page internal anchor Pith review Pith/arXiv arXiv 2024

[1] [1]

Bureau of Labor Statistics.Census of Fatal Oc- cupational Injuries Summary, 2023. U.S. Depart- ment of Labor, 2024.https://www.bls.gov/news. release/archives/cfoi_12192024.htm

work page 2023

[2] [2]

Top 10 Most Frequently Cited Standards for Fis- cal Year 2024

Occupational Safety and Health Administration. Top 10 Most Frequently Cited Standards for Fis- cal Year 2024. U.S. Department of Labor, 2024. https://www.osha.gov/top10citedstandards/

work page 2024

[3] [3]

Q. Fang, H. Li, X. Luo, L. Ding, H. Luo, T. Rose, and W. An. Detecting non-hardhat-use by a deep learning method from far-field surveillance videos. Automation in Construction, 85:1–9, 2018

work page 2018

[4] [4]

N. D. Nath, A. H. Behzadan, and S. G. Paal. Deep learning for site safety: Real-time detection of per- sonal protective equipment.Automation in Con- struction, 112:103085, 2020

work page 2020

[5] [5]

Hignett and L

S. Hignett and L. McAtamney. Rapid Entire Body Assessment (REBA).Applied Ergonomics, 31(2):201–205, 2000

work page 2000

[6] [6]

X. Yan, H. Li, A. R. Li, and H. Zhang. Wear- able IMU-based real-time motion warning system for construction workers’ musculoskeletal disorders pre- vention.Automation in Construction, 74:2–11, 2017

work page 2017

[7] [7]

X. Wang, J. Wei, D. Schuurmans, Q. Le, E. Chi, S. Narang, A. Chowdhery, and D. Zhou. Self- consistency improves chain of thought reasoning in language models. InProc. ICLR, 2023

work page 2023

[8] [8]

Kirillov, E

A. Kirillov, E. Mintun, N. Ravi, H. Mao, C. Rol- land, L. Gustafson, T. Xiao, S. Whitehead, A. Berg, W.-Y. Lo, P. Doll´ ar, and R. Girshick. Segment Any- thing. InProc. ICCV, 2023

work page 2023

[9] [9]

¨Onal and E

O. ¨Onal and E. Dandıl. A video dataset for safe/unsafe worker behaviour classification in pro- duction environments.BMC Research Notes, 17:234, 2024

work page 2024

[10] [10]

CWPV: A Working Postures of the Construc- tion Working Postures Videos dataset.Figshare, 2024.https://figshare.com/articles/dataset/ 27907818

work page 2024

[11] [11]

C. Doyle. LLMs as Method Actors: A Model for Prompt Engineering and Architecture.arXiv preprint arXiv:2411.05778, 2024.https://doi. org/10.48550/arXiv.2411.05778

work page doi:10.48550/arxiv.2411.05778 2024

[12] [12]

Sanjeewani, G

P. Sanjeewani, G. Neuber, J. Fitzgerald, N. Chan- drasena, S. Potums, A. Alavi, and C. Lane. Real- time personal protective equipment non-compliance recognition on AI edge cameras.Electronics, 13(15):2990, 2024

work page 2024

[13] [13]

W. Zhao, L. Wang, Y. Li, X. Liu, Y. Zhang, B. Yan, and H. Li. A multi-scale and multi-stage human pose recognition method based on convolutional neural networks for non-wearable ergonomic evaluation. Processes, 12(11):2419, 2024

work page 2024

[14] [14]

J. Geng, J. Zhao, and S. Li. Optimizing helmet use detection in construction sites via fuzzy logic-based state tracking.Sensors, 25(2):456, 2025

work page 2025

[15] [15]

Aharon, R

N. Aharon, R. Orfaig, and B. Z. Bobrovsky. BoT- SORT: Robust associations multi-pedestrian track- ing.arXiv preprint arXiv:2206.14651, 2022

work page arXiv 2022

[16] [16]

Z. Bai, P. Chen, C. Fu, S. Yin, and E. Chen. Hal- lucination of multimodal large language models: A survey.arXiv preprint arXiv:2404.18930, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[17] [17]

Welch, E

R. Welch, E. Konuk, and K. Smith. The cost of reasoning: Chain-of-thought induces overcon- fidence in vision-language models.arXiv preprint arXiv:2603.16728, 2025

work page arXiv 2025

[18] [18]

K. Kim, Y. Cho, and S. Zhang. Integrated safety monitoring system using RFID and computer vision for construction sites.Automation in Construction, 66:11–21, 2016

work page 2016

[19] [19]

Awolusi, E

I. Awolusi, E. Marks, and M. Solis. Wearable tech- nology for personalized construction safety monitor- ing and health assessment.Automation in Construc- tion, 91:235–250, 2018

work page 2018

[20] [20]

SAM 2: Segment Anything in Images and Videos

N. Ravi, V. Gabeur, Y.-T. Hu, R. Hu, C. Ryali, T. Ma, H. Khabsa, R. R¨ ohrbach, A. El-Nouby, A. Shiekh, W. Cheng, M. Saraf, A. Morcos, M. Paluri, and C. Feichtenhofer. SAM 2: Seg- ment Anything in Images and Videos.arXiv preprint arXiv:2408.00714, 2024. 11 A Supplementary Material: VLM System Prompts The full system prompts used in the three-pass adversar...

work page internal anchor Pith review Pith/arXiv arXiv 2024