A Deep Learning-Based CCTV System for Automatic Smoking Detection in Fire Exit Zones

Junaid Ahmed Sifat; Md. Khalilur Rhaman; Md. Waseq Alauddin Alvi; Mohammad Irtiza Hossain; Sami Sadat; Suhail Haque Rafi

arxiv: 2508.11696 · v3 · submitted 2025-08-12 · 💻 cs.CV · cs.LG

A Deep Learning-Based CCTV System for Automatic Smoking Detection in Fire Exit Zones

Sami Sadat , Mohammad Irtiza Hossain , Junaid Ahmed Sifat , Suhail Haque Rafi , Md. Waseq Alauddin Alvi , Md. Khalilur Rhaman This is my paper

Pith reviewed 2026-05-18 22:58 UTC · model grok-4.3

classification 💻 cs.CV cs.LG

keywords smoking detectionYOLO object detectionCCTV surveillancefire safetydeep learningedge computingreal-time monitoring

0 comments

The pith

A custom YOLO-based model detects smoking in fire exit zones with 78.9 percent recall and 83.7 percent mAP at 50.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper develops a real-time deep learning system to spot smoking in CCTV footage of fire exit zones, where it creates fire hazards. The authors test three recent YOLO models and then build a custom version from YOLOv8 with extra structures to handle low light and other surveillance difficulties. On a dataset of 8,124 images from 20 scenarios plus low-light samples, the custom model reaches 78.90 percent recall and 83.70 percent mAP at 50 while running at 52 to 97 milliseconds per inference on edge hardware. A reader would care because automatic detection could reduce human monitoring needs and help enforce safety rules in places where smoking is banned.

Core claim

The paper claims that a custom object detection model derived from YOLOv8 with added structures for challenging surveillance contexts outperforms YOLOv8, YOLOv11, and YOLOv12. Evaluated on 8,124 images from 20 different scenarios and 2,708 low-light raw samples, the model achieves a recall of 78.90 percent and mAP at 50 of 83.70 percent. It further shows real-time suitability with inference times of 52 to 97 milliseconds on the Jetson Xavier NX, supporting deployment for public safety monitoring and regulatory compliance.

What carries the argument

Custom YOLOv8-derived object detection model with added structures for challenging surveillance contexts

If this is right

The system supports real-time automatic monitoring of smoking violations in fire exit areas.
Edge device performance enables deployment without constant cloud connectivity.
Automatic detection aids regulatory compliance by logging potential violations.
The approach provides a base for adapting similar detection to other safety rules in public spaces.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The model could connect to alarm systems that notify staff or authorities the moment smoking is detected.
Similar customizations might apply to spotting other restricted actions such as open flames or blocked exits in the same zones.

Load-bearing premise

The dataset of images from 20 scenarios and low-light areas represents the full range of real-world fire exit zone CCTV conditions and lets the model generalize without overfitting.

What would settle it

Running the custom model on a new collection of live CCTV videos from actual fire exit zones never seen during training and finding that recall drops well below 78.9 percent or mAP at 50 falls below 83.7 percent.

read the original abstract

A deep learning real-time smoking detection system for CCTV surveillance of fire exit areas is proposed due to critical safety requirements. The dataset contains 8,124 images from 20 different scenarios along with 2,708 raw samples demonstrating low-light areas. We evaluated three advanced object detection models: YOLOv8, YOLOv11, and YOLOv12, followed by development of a custom model derived from YOLOv8 with added structures for challenging surveillance contexts. The proposed model outperformed the others, achieving a recall of 78.90 percent and mAP at 50 of 83.70 percent, delivering optimal object detection across varied environments. Performance evaluation on multiple edge devices using multithreaded operations showed the Jetson Xavier NX processed data at 52 to 97 milliseconds per inference, establishing its suitability for time-sensitive operations. This system offers a robust and adaptable platform for monitoring public safety and enabling automatic regulatory compliance.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This is a straightforward applied tweak of YOLOv8 for smoking detection in fire-exit CCTV, with usable edge-device numbers but thin coverage in the 20-scenario dataset.

read the letter

The main takeaway is a practical engineering project that adapts YOLO models to spot smoking in fire exit zones on CCTV. The authors gathered 8124 images across 20 scenarios plus 2708 low-light samples, ran comparisons against YOLOv8, v11, and v12, then added structures to a YOLOv8 base. Their custom version reaches 78.9 percent recall and 83.7 percent mAP at 50, and they timed it on hardware like the Jetson Xavier NX at 52 to 97 milliseconds per inference with multithreading. That device testing is the part that stands out as useful for anyone who might actually install something like this. It moves beyond pure accuracy tables and shows the system could keep up in a real surveillance setup. The comparisons are presented plainly and the numbers are easy to read. The dataset size is reasonable for a targeted task, and the focus on a safety-critical location gives the work a clear purpose. The soft spots sit mostly with generalization. Twenty scenarios is a limited sample for claiming performance across varied environments, and without details on scenario selection, camera angles, crowd densities, or how the train and test splits were made, it is hard to rule out that the model is picking up collection-specific patterns rather than robust smoking cues. The recall under 80 percent also means some misses are likely, which matters in a setting where the goal is automatic compliance. No error analysis or failure-case breakdown is mentioned, so readers cannot easily see where the model struggles. This kind of paper is aimed at applied computer vision groups or safety-system developers who need working examples for restricted-area monitoring. Someone building similar CCTV analytics could borrow the dataset construction or the edge benchmarks. It deserves a serious referee because the experiments are concrete, the deployment angle is real, and the claims can be checked against the reported metrics even if revisions would likely push for more data diversity evidence.

Referee Report

2 major / 1 minor

Summary. The manuscript proposes a real-time deep learning system for detecting smoking in fire exit CCTV zones. It describes a dataset of 8,124 images collected from 20 scenarios plus 2,708 low-light samples, evaluates YOLOv8/v11/v12, introduces a custom YOLOv8-derived model with added structures, reports that the custom model achieves 78.90% recall and 83.70% mAP@50, and demonstrates inference on edge devices (Jetson Xavier NX at 52–97 ms per frame).

Significance. If the performance claims hold under proper validation, the work could offer a deployable tool for automated safety monitoring and regulatory compliance in restricted public spaces, with practical emphasis on low-light conditions and edge-device efficiency. The empirical focus on multiple YOLO variants and hardware testing provides a concrete baseline for surveillance applications.

major comments (2)

[Dataset and Experiments] Dataset description: the claim that the custom model delivers 'optimal object detection across varied environments' rests on 8,124 images from only 20 scenarios plus 2,708 low-light samples, yet no quantitative metrics of scenario diversity (camera angles, densities, partial occlusions, heights) or selection criteria are supplied. This directly undermines the generalization asserted in the abstract and results.
[Results and Evaluation] Evaluation protocol: no details are given on train/validation/test splits, cross-validation, or leakage prevention. Without these, the headline metrics (recall 78.90%, mAP@50 83.70%) cannot be interpreted as evidence of robustness rather than possible overfitting to the collected data, which is load-bearing for the central performance claim.

minor comments (1)

[Abstract] Abstract: 'mAP at 50' should be written consistently as mAP@50 or mAP50 to match standard object-detection notation.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address each major comment below and indicate the revisions planned for the next version.

read point-by-point responses

Referee: [Dataset and Experiments] Dataset description: the claim that the custom model delivers 'optimal object detection across varied environments' rests on 8,124 images from only 20 scenarios plus 2,708 low-light samples, yet no quantitative metrics of scenario diversity (camera angles, densities, partial occlusions, heights) or selection criteria are supplied. This directly undermines the generalization asserted in the abstract and results.

Authors: We agree that additional quantitative details on scenario diversity would strengthen the generalization claims. The current manuscript provides only the high-level count of 20 scenarios and low-light samples without metrics on camera angles, densities, occlusions, or heights, nor explicit selection criteria. In the revised manuscript we will add a table and accompanying text quantifying these attributes across the scenarios and describing the collection protocol and selection criteria. revision: yes
Referee: [Results and Evaluation] Evaluation protocol: no details are given on train/validation/test splits, cross-validation, or leakage prevention. Without these, the headline metrics (recall 78.90%, mAP@50 83.70%) cannot be interpreted as evidence of robustness rather than possible overfitting to the collected data, which is load-bearing for the central performance claim.

Authors: We acknowledge that the absence of evaluation protocol details is a limitation. The manuscript does not currently describe the train/validation/test splits, any cross-validation procedure, or leakage-prevention steps. We will revise the results section to include these specifics, stating the split ratios, confirming scenario-level separation to avoid leakage, and noting whether cross-validation was performed along with its rationale. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical evaluation on collected dataset with standard metrics

full rationale

The paper describes dataset collection (8124 images from 20 scenarios plus low-light samples), training of YOLOv8/11/12 and a custom variant, and direct reporting of recall/mAP on test splits. No equations, derivations, or first-principles results exist that could reduce to self-definition or fitted inputs by construction. Central performance claims rest on empirical measurement against held-out data rather than any self-citation chain or ansatz smuggled from prior work. The representativeness assumption affects generalization risk but does not create circularity in the reported results, which remain falsifiable on the described test set.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

Based solely on the abstract, the central claim rests on standard deep-learning assumptions that the collected images capture sufficient environmental variability for generalization and that edge-device timing results transfer to operational CCTV deployments.

free parameters (1)

Added structures in custom YOLOv8 model
Additional components introduced for challenging surveillance contexts; their exact form and any associated hyperparameters are not specified but were presumably tuned to achieve the reported metrics.

axioms (1)

domain assumption YOLO family models are appropriate base architectures for real-time smoking detection in varied lighting and angles.
Invoked by selecting and modifying YOLOv8, v11, and v12 for the surveillance task.

pith-pipeline@v0.9.0 · 5725 in / 1486 out tokens · 49691 ms · 2026-05-18T22:58:58.047897+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/RealityFromDistinction reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

custom model that retains core features of YOLOv8 while introducing enhancements for low-light conditions and varied camera angles... achieved the highest recall (78.90%)... mAP@50 score (83.70%)
IndisputableMonolith/Cost/FunctionalEquation washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

dataset of 8,124 images from 20 different scenarios... augmented by rotating them, adjusting exposure, and injecting a small amount (0.1%) of noise

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.