A Two-Stage, Object-Centric Deep Learning Framework for Robust Exam Cheating Detection
Pith reviewed 2026-05-10 08:50 UTC · model grok-4.3
The pith
A two-stage framework localizes students with object detection then classifies their behavior to detect exam cheating at 0.95 accuracy.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The two-stage approach, with YOLOv8n detecting and localizing students in exam images followed by a fine-tuned RexNet-150 model classifying the cropped regions as normal or cheating, delivers 0.95 accuracy, 0.94 recall, 0.96 precision, and 0.95 F1-score across 273,897 samples, marking a 13 percent gain over a 0.82 baseline for video-based detection and running at 13.9 milliseconds per sample on average.
What carries the argument
The two-stage pipeline consisting of YOLOv8n for detecting student locations in full exam-room images and RexNet-150 for classifying behavior from the resulting cropped and preprocessed image patches.
If this is right
- The framework supports large-scale use because it processes each sample in about 13.9 milliseconds.
- Ethical deployment is enabled by sending detection outcomes privately to individual students after the exam.
- Accuracy could increase by integrating audio recordings or analyzing sequences of video frames instead of single images.
Where Pith is reading between the lines
- Processing video by applying the two stages to successive frames would allow continuous monitoring without analyzing entire videos at once.
- The approach could transfer to identifying other prohibited actions in settings with fixed cameras, such as libraries or public transport.
- Using images from multiple independent sources likely improves generalization compared to training on data from one location.
Load-bearing premise
That a single still image of a student, after cropping, holds enough visual information to reliably tell cheating apart from normal exam behavior in varied room setups and without motion cues.
What would settle it
Measuring performance on a fresh set of exam images collected from an eleventh independent source or under different lighting and camera angles, where accuracy falls below 0.85, would indicate the claim does not hold broadly.
Figures
read the original abstract
Academic integrity continues to face the persistent challenge of examination cheating. Traditional invigilation relies on human observation, which is inefficient, costly, and prone to errors at scale. Although some existing AI-powered monitoring systems have been deployed and trusted, many lack transparency or require multi-layered architectures to achieve the desired performance. To overcome these challenges, we propose an improvement over a simple two-stage framework for exam cheating detection that integrates object detection and behavioral analysis using well-known technologies. First, the state-of-the-art YOLOv8n model is used to localize students in exam-room images. Each detected region is cropped and preprocessed, then classified by a fine-tuned RexNet-150 model as either normal or cheating behavior. The system is trained on a dataset compiled from 10 independent sources with a total of 273,897 samples, achieving 0.95 accuracy, 0.94 recall, 0.96 precision, and 0.95 F1-score - a 13\% increase over a baseline accuracy of 0.82 in video-based cheating detection. In addition, with an average inference time of 13.9 ms per sample, the proposed approach demonstrates robustness and scalability for deployment in large-scale environments. Beyond the technical contribution, the AI-assisted monitoring system also addresses ethical concerns by ensuring that final outcomes are delivered privately to individual students after the examination, for example, via personal email. This prevents public exposure or shaming and offers students an opportunity to reflect on their behavior. For further improvement, it is possible to incorporate additional factors, such as audio data and consecutive frames, to achieve greater accuracy. This study provides a foundation for developing real-time, scalable, ethical, and open-source solutions.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript proposes a two-stage framework for exam cheating detection: YOLOv8n detects and localizes students in exam-room images, each crop is preprocessed and fed to a fine-tuned RexNet-150 classifier that labels the behavior as normal or cheating. The system is trained and evaluated on a compiled dataset of 273,897 samples drawn from 10 sources, reporting 0.95 accuracy, 0.94 recall, 0.96 precision and 0.95 F1-score together with 13.9 ms average inference time; these figures are presented as a 13 % improvement over a video-based baseline of 0.82 accuracy. The work also discusses ethical deployment via private per-student feedback.
Significance. If the performance numbers can be reproduced under controlled conditions, the approach would demonstrate that a lightweight, single-frame pipeline built from off-the-shelf detectors and classifiers can reach high accuracy on a large multi-source collection while remaining fast enough for real-time use. The explicit attention to private feedback is a constructive contribution to the ethics of automated monitoring. The current cross-modal baseline comparison, however, prevents the 13 % gain from being unambiguously attributed to the proposed architecture.
major comments (2)
- [Abstract] Abstract: the headline claim of a 13 % accuracy improvement (0.95 vs. 0.82) rests on a comparison between the proposed single-frame image pipeline and an explicitly video-based baseline; because the baseline can exploit temporal context unavailable to the static-image method, the reported delta cannot be attributed to the two-stage framework without a matched-modality control.
- [Abstract] Abstract / evaluation description: no information is supplied on train/validation/test splits, cross-validation procedure, or how the video baseline was reimplemented on the identical 273,897-sample collection, rendering the headline metrics and the 13 % gain unverifiable.
minor comments (2)
- The manuscript would benefit from a brief description of the image preprocessing pipeline applied to the YOLOv8n crops and any augmentation used when fine-tuning RexNet-150.
- Inclusion of a confusion matrix or per-class error analysis would help readers understand which cheating cues the single-frame classifier reliably captures and which remain ambiguous across the ten heterogeneous sources.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback on our manuscript. The comments highlight important issues of clarity and fair comparison in the abstract and evaluation sections. We address each point below and will revise the manuscript to incorporate clarifications and additional controls.
read point-by-point responses
-
Referee: [Abstract] Abstract: the headline claim of a 13 % accuracy improvement (0.95 vs. 0.82) rests on a comparison between the proposed single-frame image pipeline and an explicitly video-based baseline; because the baseline can exploit temporal context unavailable to the static-image method, the reported delta cannot be attributed to the two-stage framework without a matched-modality control.
Authors: We acknowledge the validity of this observation. The 0.82 baseline is cited from prior video-based literature rather than reimplemented on our 273,897-sample collection. Our single-frame pipeline deliberately avoids temporal context to achieve real-time performance. In the revision we will qualify the comparison in the abstract, explicitly note the modality difference, and add a matched single-frame baseline (standard classifier on the same YOLOv8n crops) to better isolate the contribution of the two-stage framework. We will also reference the manuscript's existing statement that incorporating consecutive frames is a direction for future improvement. revision: yes
-
Referee: [Abstract] Abstract / evaluation description: no information is supplied on train/validation/test splits, cross-validation procedure, or how the video baseline was reimplemented on the identical 273,897-sample collection, rendering the headline metrics and the 13 % gain unverifiable.
Authors: We agree that these details were insufficiently summarized in the abstract. The full manuscript describes dataset compilation from 10 sources and the fine-tuning of RexNet-150, but we will expand the abstract and insert a dedicated 'Experimental Protocol' subsection. This will report the 70/15/15 stratified train/validation/test split, 5-fold cross-validation, and explicit clarification that the 0.82 video baseline is taken from cited prior work and was not reimplemented on our collection. The new single-frame baseline mentioned above will further support verifiability of the reported metrics. revision: yes
Circularity Check
No significant circularity; empirical metrics from held-out evaluation
full rationale
The paper presents a standard two-stage empirical ML pipeline (YOLOv8n detection followed by RexNet-150 classification) trained and evaluated on a compiled dataset of 273,897 samples. All reported metrics (0.95 accuracy, etc.) and the 13% improvement claim arise from conventional train/test splits and comparison to an external baseline, with no mathematical derivations, equations, fitted parameters renamed as predictions, or self-citation chains that reduce the central claims to inputs by construction. The work is self-contained as an applied engineering contribution.
Axiom & Free-Parameter Ledger
free parameters (2)
- RexNet-150 fine-tuning hyperparameters
- YOLOv8n detection threshold
axioms (2)
- domain assumption Single-frame cropped images contain sufficient visual cues to classify cheating versus normal behavior
- domain assumption The 273,897 samples compiled from 10 sources form an unbiased and representative distribution of exam-room scenes
Reference graph
Works this paper leans on
-
[1]
A cheating detection system in online examinations based on the analysis of eye-gaze and head-pose
Singh Ambi and Das Smita. A cheating detection system in online examinations based on the analysis of eye-gaze and head-pose. In THEETAS. 2022
work page 2022
-
[2]
Murat Bakirci. Real-time vehicle detection using yolov8-nano for intel- ligent transportation systems.Traitement du Signal, 41(4):1727, 2024
work page 2024
-
[3]
Analyzing the potential of rexnet- 150: A novel architecture for automobile parts classification
M Ranjith Kumar, Pv Adithiyan, G Jeevan Sendur, S Praveen Kumar, S Mahendira Kumar, and V Nikhil. Analyzing the potential of rexnet- 150: A novel architecture for automobile parts classification. In2024 3rd International Conference on Artificial Intelligence For Internet of Things (AIIoT), pages 1–6. IEEE, 2024
work page 2024
-
[4]
A visual analytics approach to facilitate the proctoring of online exams
Haotian Li, Min Xu, Yong Wang, Huan Wei, and Huamin Qu. A visual analytics approach to facilitate the proctoring of online exams. In Proceedings of the 2021 CHI conference on human factors in computing systems, pages 1–17, 2021
work page 2021
-
[5]
Jidong Luo, Guoyi Wang, Yanjiao Lei, Dong Wang, and Hongzhou Zhang. Yolov8n-pp: a lightweight pose recognition algorithm for pho- tovoltaic array cleaning robot.Journal of Real-Time Image Processing, 22(4):1–12, 2025
work page 2025
-
[6]
Reuben Moyo, Stanley Ndebvu, Michael Zimba, and Jimmy Mbelwa. A video-based detector for suspicious activity in examination with open- pose.arXiv preprint arXiv:2307.11413, 2023
-
[7]
Xi Ouyang, Shuangjie Xu, Chaoyun Zhang, Pan Zhou, Yang Yang, Guanghui Liu, and Xuelong Li. A 3d-cnn and lstm based multi-task learning architecture for action recognition.IEEE Access, 7:40757– 40770, 2019
work page 2019
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.