pith. sign in

arxiv: 2605.19869 · v1 · pith:D7I2N6EGnew · submitted 2026-05-19 · 💻 cs.CV · cs.AI

Passive Construction Site Safety Monitoring via Persona-Scaffolded Adversarial Chain-of-Thought VLM Verification

Pith reviewed 2026-05-20 05:32 UTC · model grok-4.3

classification 💻 cs.CV cs.AI
keywords construction safety monitoringvision-language modelsprompt engineeringadversarial chain-of-thoughtPPE detectionOSHA standardshallucination mitigationvideo analysis
0
0 comments X p. Extension
pith:D7I2N6EG Add to your LaTeX paper What is a Pith Number?
\usepackage{pith}
\pithnumber{D7I2N6EG}

Prints a linked pith:D7I2N6EG badge after your title and writes the identifier into PDF metadata. Compiles on arXiv with no extra files. Learn more

The pith

A persona-scaffolded adversarial prompting method boosts precision in construction safety violation detection by 12%.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper introduces a passive monitoring system for construction site safety that analyzes footage from body-worn and fixed cameras at the end of a shift. The system first detects personal protective equipment and hazards, refines the detections with segmentation, and then uses a vision-language model to verify compliance through a three-pass adversarial process. The core innovation lies in the prompt design that assigns professional personas to the model and structures the reasoning into generator, discriminator, and reconciliation steps with different rules. In an informal review on twelve development videos, this method achieved a twelve percent higher precision than basic prompting, particularly for categories where models often invent nonexistent violations. Such automation could help address the high rate of preventable worker injuries in construction by providing detailed reports without requiring live human operators.

Core claim

The principal contribution is the Stage 3 prompt design: professional persona backstories following the method-actor framing drive an observed 12% precision improvement over single-pass prompting in an informal three-author review of the 12-video Ironsite development corpus, with the largest gains on hallucination-prone violation categories.

What carries the argument

The persona-scaffolded adversarial chain-of-thought protocol that uses method-actor framing with professional backstories to enforce observational independence between generator, discriminator, and reconciliation passes via structural message isolation and asymmetric rules.

If this is right

  • The system maps violations to specific OSHA standards.
  • It performs REBA-inspired ergonomic risk scoring from pose keypoints.
  • Per-worker safety reports are produced with timestamped evidence.
  • The pipeline enables passive end-of-shift monitoring from POV and fixed cameras without real-time operators.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • If the prompting technique generalizes, it could improve hallucination control in other VLM-based scene analysis tasks such as traffic or inventory monitoring.
  • Evaluating the method on larger datasets from multiple construction sites would test whether the precision gains persist beyond the initial development corpus.
  • The adversarial structure might serve as a general template for enhancing reliability in automated compliance checking systems.

Load-bearing premise

The 12% precision improvement from the persona-scaffolded adversarial protocol on the 12-video development corpus is due to the prompting method rather than video selection or reviewer expectations and will generalize to new sites and larger datasets.

What would settle it

Measuring the precision of the persona-based three-pass protocol versus single-pass prompting on a new, independent set of construction site videos from a different location to see if the 12% improvement holds.

Figures

Figures reproduced from arXiv: 2605.19869 by Ananth Sriram, Neel Mokaria, Rajveer Singh.

Figure 1
Figure 1. Figure 1: Three-stage pipeline architecture shared by both posture/ergonomics and PPE violation detec [PITH_FULL_IMAGE:figures/full_fig_p004_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: PPE detection results. (a) Worker build [PITH_FULL_IMAGE:figures/full_fig_p008_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Posture analysis results. (a) Worker seg [PITH_FULL_IMAGE:figures/full_fig_p009_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Pass 1 system prompt. The generator receives raw video only; no machine detection data or [PITH_FULL_IMAGE:figures/full_fig_p012_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Pass 2 system prompt. The discriminator receives raw video, annotated frames, and YOLO [PITH_FULL_IMAGE:figures/full_fig_p012_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Pass 3 reconciliation rules (user template, verbatim). Pass 3 omits the raw video and operates [PITH_FULL_IMAGE:figures/full_fig_p013_6.png] view at source ↗
read the original abstract

Construction remains the deadliest industry sector in the United States, with 1,055 fatal worker injuries recorded in 2023, and the majority preventable. Existing monitoring approaches are expensive, require real-time human operators, or address only a narrow subset of violations. This paper presents a passive, end-of-shift construction safety monitoring pipeline processing video from POV body-worn and fixed wall-mounted cameras through a three-stage architecture: (1) fine-tuned YOLO11 for primary PPE and hazard detection, (2) SAM 3 for segmentation refinement and worker deduplication, and (3) Qwen3-VL-8B-Instruct with a method-prompted, persona-scaffolded three-pass adversarial chain-of-thought protocol for compliance verification and hallucination control. The principal contribution is the Stage 3 prompt design: professional persona backstories following the method-actor framing drive an observed 12% precision improvement over single-pass prompting in an informal three-author review of the 12-video Ironsite development corpus, with the largest gains on hallucination-prone violation categories. Structural message isolation enforces observational independence between a generator, discriminator, and reconciliation pass governed by asymmetric rules encoding priors about human observation versus automated detection reliability. The system maps violations to OSHA standards, performs REBA-inspired ergonomic risk scoring from pose keypoints, and produces per-worker safety reports with timestamped evidence. An evaluation harness is released for future reproduction.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 1 minor

Summary. The manuscript presents a passive, end-of-shift construction safety monitoring pipeline that ingests video from POV body-worn and fixed cameras through a three-stage architecture: (1) fine-tuned YOLO11 for primary PPE and hazard detection, (2) SAM 3 for segmentation refinement and worker deduplication, and (3) Qwen3-VL-8B-Instruct employing a method-prompted, persona-scaffolded three-pass adversarial chain-of-thought protocol for compliance verification. The principal contribution is the Stage 3 prompt design, which is reported to produce a 12% precision improvement over single-pass prompting (largest on hallucination-prone violation categories) based on an informal three-author review of the 12-video Ironsite development corpus. The system maps violations to OSHA standards, computes REBA-inspired ergonomic risk scores from pose keypoints, generates per-worker reports with timestamped evidence, and releases an evaluation harness.

Significance. If the reported precision gains are substantiated by rigorous, blinded evaluation on held-out data, the pipeline could provide a scalable, low-cost complement to existing construction safety monitoring, addressing the high rate of preventable fatalities in the sector. The release of the evaluation harness is a concrete strength that supports reproducibility and community follow-up.

major comments (1)
  1. [Abstract] Abstract (principal contribution paragraph): The central empirical claim of a 12% precision improvement from the persona-scaffolded adversarial CoT protocol is based solely on an informal three-author review of 12 videos in the development corpus. No statistical tests, inter-rater reliability, blinding, formal baselines, ablation isolating the persona component from the three-pass structure, or evaluation on an independent test set are described. This makes it impossible to attribute the observed difference to the method rather than reviewer expectations or corpus-specific effects, directly undermining the principal contribution.
minor comments (1)
  1. [General] The manuscript would benefit from explicit section headings and numbered subsections for the three pipeline stages to improve navigation and reproducibility.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their thorough review and valuable feedback on our manuscript. We agree that the evaluation of the persona-scaffolded adversarial chain-of-thought protocol requires more rigorous validation to strengthen the principal contribution. We will revise the paper accordingly.

read point-by-point responses
  1. Referee: [Abstract] Abstract (principal contribution paragraph): The central empirical claim of a 12% precision improvement from the persona-scaffolded adversarial CoT protocol is based solely on an informal three-author review of 12 videos in the development corpus. No statistical tests, inter-rater reliability, blinding, formal baselines, ablation isolating the persona component from the three-pass structure, or evaluation on an independent test set are described. This makes it impossible to attribute the observed difference to the method rather than reviewer expectations or corpus-specific effects, directly undermining the principal contribution.

    Authors: We fully acknowledge the validity of this critique. The current evaluation is indeed informal and lacks the rigorous elements mentioned, which limits the generalizability of the 12% precision gain claim. In the revised manuscript, we will perform a formal evaluation on a held-out test set consisting of additional videos not used in development. This will include blinded review by independent raters, computation of inter-rater reliability (e.g., Cohen's kappa), statistical significance testing (e.g., paired t-test or McNemar's test for precision differences), and ablations to isolate the contribution of the persona scaffolding versus the three-pass adversarial structure. We will update the abstract to reflect these changes and add a new subsection detailing the evaluation protocol, results, and limitations. The evaluation harness will be extended to support this blinded setup. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical observation from informal review stands independent of inputs

full rationale

The paper presents a three-stage engineering pipeline whose principal claim is an observed 12% precision lift from persona-scaffolded adversarial CoT prompting, measured via informal three-author review on the 12-video development corpus. No equations, first-principles derivations, fitted parameters renamed as predictions, or uniqueness theorems appear. The reported gain is framed as an empirical result rather than a quantity forced by construction from the prompt design itself. No self-citations load-bear the central claim, and the system description remains self-contained without reducing any output to its own inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The work relies on standard assumptions about off-the-shelf vision models and introduces no new physical entities or mathematical axioms; the prompting protocol is the main added component.

axioms (2)
  • domain assumption Fine-tuned YOLO11 and SAM 3 produce sufficiently accurate initial detections and segmentations for downstream VLM verification.
    Invoked in the description of stages 1 and 2 without reported error rates on the target domain.
  • ad hoc to paper The informal three-author review provides a reliable signal of real-world precision improvement.
    Central to the 12% claim but not justified with formal evaluation protocols.

pith-pipeline@v0.9.0 · 5795 in / 1545 out tokens · 52521 ms · 2026-05-20T05:32:18.753188+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

20 extracted references · 20 canonical work pages · 2 internal anchors

  1. [1]

    Bureau of Labor Statistics.Census of Fatal Oc- cupational Injuries Summary, 2023. U.S. Depart- ment of Labor, 2024.https://www.bls.gov/news. release/archives/cfoi_12192024.htm

  2. [2]

    Top 10 Most Frequently Cited Standards for Fis- cal Year 2024

    Occupational Safety and Health Administration. Top 10 Most Frequently Cited Standards for Fis- cal Year 2024. U.S. Department of Labor, 2024. https://www.osha.gov/top10citedstandards/

  3. [3]

    Q. Fang, H. Li, X. Luo, L. Ding, H. Luo, T. Rose, and W. An. Detecting non-hardhat-use by a deep learning method from far-field surveillance videos. Automation in Construction, 85:1–9, 2018

  4. [4]

    N. D. Nath, A. H. Behzadan, and S. G. Paal. Deep learning for site safety: Real-time detection of per- sonal protective equipment.Automation in Con- struction, 112:103085, 2020

  5. [5]

    Hignett and L

    S. Hignett and L. McAtamney. Rapid Entire Body Assessment (REBA).Applied Ergonomics, 31(2):201–205, 2000

  6. [6]

    X. Yan, H. Li, A. R. Li, and H. Zhang. Wear- able IMU-based real-time motion warning system for construction workers’ musculoskeletal disorders pre- vention.Automation in Construction, 74:2–11, 2017

  7. [7]

    X. Wang, J. Wei, D. Schuurmans, Q. Le, E. Chi, S. Narang, A. Chowdhery, and D. Zhou. Self- consistency improves chain of thought reasoning in language models. InProc. ICLR, 2023

  8. [8]

    Kirillov, E

    A. Kirillov, E. Mintun, N. Ravi, H. Mao, C. Rol- land, L. Gustafson, T. Xiao, S. Whitehead, A. Berg, W.-Y. Lo, P. Doll´ ar, and R. Girshick. Segment Any- thing. InProc. ICCV, 2023

  9. [9]

    ¨Onal and E

    O. ¨Onal and E. Dandıl. A video dataset for safe/unsafe worker behaviour classification in pro- duction environments.BMC Research Notes, 17:234, 2024

  10. [10]

    CWPV: A Working Postures of the Construc- tion Working Postures Videos dataset.Figshare, 2024.https://figshare.com/articles/dataset/ 27907818

  11. [11]

    C. Doyle. LLMs as Method Actors: A Model for Prompt Engineering and Architecture.arXiv preprint arXiv:2411.05778, 2024.https://doi. org/10.48550/arXiv.2411.05778

  12. [12]

    Sanjeewani, G

    P. Sanjeewani, G. Neuber, J. Fitzgerald, N. Chan- drasena, S. Potums, A. Alavi, and C. Lane. Real- time personal protective equipment non-compliance recognition on AI edge cameras.Electronics, 13(15):2990, 2024

  13. [13]

    W. Zhao, L. Wang, Y. Li, X. Liu, Y. Zhang, B. Yan, and H. Li. A multi-scale and multi-stage human pose recognition method based on convolutional neural networks for non-wearable ergonomic evaluation. Processes, 12(11):2419, 2024

  14. [14]

    J. Geng, J. Zhao, and S. Li. Optimizing helmet use detection in construction sites via fuzzy logic-based state tracking.Sensors, 25(2):456, 2025

  15. [15]

    Aharon, R

    N. Aharon, R. Orfaig, and B. Z. Bobrovsky. BoT- SORT: Robust associations multi-pedestrian track- ing.arXiv preprint arXiv:2206.14651, 2022

  16. [16]

    Z. Bai, P. Chen, C. Fu, S. Yin, and E. Chen. Hal- lucination of multimodal large language models: A survey.arXiv preprint arXiv:2404.18930, 2024

  17. [17]

    Welch, E

    R. Welch, E. Konuk, and K. Smith. The cost of reasoning: Chain-of-thought induces overcon- fidence in vision-language models.arXiv preprint arXiv:2603.16728, 2025

  18. [18]

    K. Kim, Y. Cho, and S. Zhang. Integrated safety monitoring system using RFID and computer vision for construction sites.Automation in Construction, 66:11–21, 2016

  19. [19]

    Awolusi, E

    I. Awolusi, E. Marks, and M. Solis. Wearable tech- nology for personalized construction safety monitor- ing and health assessment.Automation in Construc- tion, 91:235–250, 2018

  20. [20]

    SAM 2: Segment Anything in Images and Videos

    N. Ravi, V. Gabeur, Y.-T. Hu, R. Hu, C. Ryali, T. Ma, H. Khabsa, R. R¨ ohrbach, A. El-Nouby, A. Shiekh, W. Cheng, M. Saraf, A. Morcos, M. Paluri, and C. Feichtenhofer. SAM 2: Seg- ment Anything in Images and Videos.arXiv preprint arXiv:2408.00714, 2024. 11 A Supplementary Material: VLM System Prompts The full system prompts used in the three-pass adversar...