A Two-Stage, Object-Centric Deep Learning Framework for Robust Exam Cheating Detection

Le-Khanh Nguyen; Trong-Doanh Nguyen; Van-Truong Le

arxiv: 2604.16234 · v1 · submitted 2026-04-17 · 💻 cs.CV · cs.AI

A Two-Stage, Object-Centric Deep Learning Framework for Robust Exam Cheating Detection

Van-Truong Le , Le-Khanh Nguyen , Trong-Doanh Nguyen This is my paper

Pith reviewed 2026-05-10 08:50 UTC · model grok-4.3

classification 💻 cs.CV cs.AI

keywords exam cheating detectionobject detectionimage classificationdeep learningtwo-stage frameworkacademic integrityYOLOneural network classifier

0 comments

The pith

A two-stage framework localizes students with object detection then classifies their behavior to detect exam cheating at 0.95 accuracy.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper seeks to establish that a straightforward two-stage system can effectively identify cheating during exams. It first uses an object detection model to find individual students in room photographs, crops those areas, and then applies a classification model to determine if the behavior is normal or indicative of cheating. This matters because traditional methods are inefficient and some AI alternatives are overly complex or opaque, while this approach runs quickly and keeps results private to avoid public shaming. The system was trained and tested on a large collection of images gathered from ten different sources.

Core claim

The two-stage approach, with YOLOv8n detecting and localizing students in exam images followed by a fine-tuned RexNet-150 model classifying the cropped regions as normal or cheating, delivers 0.95 accuracy, 0.94 recall, 0.96 precision, and 0.95 F1-score across 273,897 samples, marking a 13 percent gain over a 0.82 baseline for video-based detection and running at 13.9 milliseconds per sample on average.

What carries the argument

The two-stage pipeline consisting of YOLOv8n for detecting student locations in full exam-room images and RexNet-150 for classifying behavior from the resulting cropped and preprocessed image patches.

If this is right

The framework supports large-scale use because it processes each sample in about 13.9 milliseconds.
Ethical deployment is enabled by sending detection outcomes privately to individual students after the exam.
Accuracy could increase by integrating audio recordings or analyzing sequences of video frames instead of single images.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Processing video by applying the two stages to successive frames would allow continuous monitoring without analyzing entire videos at once.
The approach could transfer to identifying other prohibited actions in settings with fixed cameras, such as libraries or public transport.
Using images from multiple independent sources likely improves generalization compared to training on data from one location.

Load-bearing premise

That a single still image of a student, after cropping, holds enough visual information to reliably tell cheating apart from normal exam behavior in varied room setups and without motion cues.

What would settle it

Measuring performance on a fresh set of exam images collected from an eleventh independent source or under different lighting and camera angles, where accuracy falls below 0.85, would indicate the claim does not hold broadly.

Figures

Figures reproduced from arXiv: 2604.16234 by Le-Khanh Nguyen, Trong-Doanh Nguyen, Van-Truong Le.

**Figure 1.** Figure 1: Visualization of the proposed two-stage inference pipeline. (a) The original input frame captures the exam environment. (b) YOLOv8n localizes the examinee (yellow bounding box). (c) The region of interest (ROI) is cropped and normalized. (d) RexNet-150 classifies the behavior, correctly identifying ’Cheating’ with high confidence. Note: Names are extracted from the dataset URLs on Roboflow. TABLE II: Globa… view at source ↗

**Figure 2.** Figure 2: Overview of the proposed Exam Cheating Detection (ECD) framework. The system consists of two stages: (1) Object localization using YOLOv8 to extract students, and (2) Behavior classification using RexNet-150 to detect cheating. E. Evaluation Metrics To provide a comprehensive and robust assessment of our framework’s performance, we employ four standard evaluation metrics derived from the confusion matrix: … view at source ↗

**Figure 4.** Figure 4: Training and validation F1-Score curves for the RexNet-150 model over 10 epochs. The validation F1-Score peaks around Epoch 8. balanced performance with a Precision of 0.91, a Recall of 0.91, and an F1-score of 0.91. This indicates that the system is not only accurate overall but is also highly effective at its primary task of identifying cheating behaviors, even when they are subtle and infrequent in real… view at source ↗

**Figure 3.** Figure 3: Training and validation loss curves for the RexNet-150 model over 10 epochs. The divergence after Epoch 3 indicates the onset of overfitting. C. Main Quantitative Results The selected model from Epoch 8 was evaluated on the unseen test set, which comprises 6,895 samples (5,057 not cheating and 1,838 cheating). The overall system achieved an impressive accuracy of 95.16%, further highlighting the strong gen… view at source ↗

**Figure 5.** Figure 5: Confusion matrix of the final RexNet-150 model on the test set [PITH_FULL_IMAGE:figures/full_fig_p005_5.png] view at source ↗

read the original abstract

Academic integrity continues to face the persistent challenge of examination cheating. Traditional invigilation relies on human observation, which is inefficient, costly, and prone to errors at scale. Although some existing AI-powered monitoring systems have been deployed and trusted, many lack transparency or require multi-layered architectures to achieve the desired performance. To overcome these challenges, we propose an improvement over a simple two-stage framework for exam cheating detection that integrates object detection and behavioral analysis using well-known technologies. First, the state-of-the-art YOLOv8n model is used to localize students in exam-room images. Each detected region is cropped and preprocessed, then classified by a fine-tuned RexNet-150 model as either normal or cheating behavior. The system is trained on a dataset compiled from 10 independent sources with a total of 273,897 samples, achieving 0.95 accuracy, 0.94 recall, 0.96 precision, and 0.95 F1-score - a 13\% increase over a baseline accuracy of 0.82 in video-based cheating detection. In addition, with an average inference time of 13.9 ms per sample, the proposed approach demonstrates robustness and scalability for deployment in large-scale environments. Beyond the technical contribution, the AI-assisted monitoring system also addresses ethical concerns by ensuring that final outcomes are delivered privately to individual students after the examination, for example, via personal email. This prevents public exposure or shaming and offers students an opportunity to reflect on their behavior. For further improvement, it is possible to incorporate additional factors, such as audio data and consecutive frames, to achieve greater accuracy. This study provides a foundation for developing real-time, scalable, ethical, and open-source solutions.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript proposes a two-stage framework for exam cheating detection: YOLOv8n detects and localizes students in exam-room images, each crop is preprocessed and fed to a fine-tuned RexNet-150 classifier that labels the behavior as normal or cheating. The system is trained and evaluated on a compiled dataset of 273,897 samples drawn from 10 sources, reporting 0.95 accuracy, 0.94 recall, 0.96 precision and 0.95 F1-score together with 13.9 ms average inference time; these figures are presented as a 13 % improvement over a video-based baseline of 0.82 accuracy. The work also discusses ethical deployment via private per-student feedback.

Significance. If the performance numbers can be reproduced under controlled conditions, the approach would demonstrate that a lightweight, single-frame pipeline built from off-the-shelf detectors and classifiers can reach high accuracy on a large multi-source collection while remaining fast enough for real-time use. The explicit attention to private feedback is a constructive contribution to the ethics of automated monitoring. The current cross-modal baseline comparison, however, prevents the 13 % gain from being unambiguously attributed to the proposed architecture.

major comments (2)

[Abstract] Abstract: the headline claim of a 13 % accuracy improvement (0.95 vs. 0.82) rests on a comparison between the proposed single-frame image pipeline and an explicitly video-based baseline; because the baseline can exploit temporal context unavailable to the static-image method, the reported delta cannot be attributed to the two-stage framework without a matched-modality control.
[Abstract] Abstract / evaluation description: no information is supplied on train/validation/test splits, cross-validation procedure, or how the video baseline was reimplemented on the identical 273,897-sample collection, rendering the headline metrics and the 13 % gain unverifiable.

minor comments (2)

The manuscript would benefit from a brief description of the image preprocessing pipeline applied to the YOLOv8n crops and any augmentation used when fine-tuning RexNet-150.
Inclusion of a confusion matrix or per-class error analysis would help readers understand which cheating cues the single-frame classifier reliably captures and which remain ambiguous across the ten heterogeneous sources.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. The comments highlight important issues of clarity and fair comparison in the abstract and evaluation sections. We address each point below and will revise the manuscript to incorporate clarifications and additional controls.

read point-by-point responses

Referee: [Abstract] Abstract: the headline claim of a 13 % accuracy improvement (0.95 vs. 0.82) rests on a comparison between the proposed single-frame image pipeline and an explicitly video-based baseline; because the baseline can exploit temporal context unavailable to the static-image method, the reported delta cannot be attributed to the two-stage framework without a matched-modality control.

Authors: We acknowledge the validity of this observation. The 0.82 baseline is cited from prior video-based literature rather than reimplemented on our 273,897-sample collection. Our single-frame pipeline deliberately avoids temporal context to achieve real-time performance. In the revision we will qualify the comparison in the abstract, explicitly note the modality difference, and add a matched single-frame baseline (standard classifier on the same YOLOv8n crops) to better isolate the contribution of the two-stage framework. We will also reference the manuscript's existing statement that incorporating consecutive frames is a direction for future improvement. revision: yes
Referee: [Abstract] Abstract / evaluation description: no information is supplied on train/validation/test splits, cross-validation procedure, or how the video baseline was reimplemented on the identical 273,897-sample collection, rendering the headline metrics and the 13 % gain unverifiable.

Authors: We agree that these details were insufficiently summarized in the abstract. The full manuscript describes dataset compilation from 10 sources and the fine-tuning of RexNet-150, but we will expand the abstract and insert a dedicated 'Experimental Protocol' subsection. This will report the 70/15/15 stratified train/validation/test split, 5-fold cross-validation, and explicit clarification that the 0.82 video baseline is taken from cited prior work and was not reimplemented on our collection. The new single-frame baseline mentioned above will further support verifiability of the reported metrics. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical metrics from held-out evaluation

full rationale

The paper presents a standard two-stage empirical ML pipeline (YOLOv8n detection followed by RexNet-150 classification) trained and evaluated on a compiled dataset of 273,897 samples. All reported metrics (0.95 accuracy, etc.) and the 13% improvement claim arise from conventional train/test splits and comparison to an external baseline, with no mathematical derivations, equations, fitted parameters renamed as predictions, or self-citation chains that reduce the central claims to inputs by construction. The work is self-contained as an applied engineering contribution.

Axiom & Free-Parameter Ledger

2 free parameters · 2 axioms · 0 invented entities

The central claim rests on standard pre-trained vision models fine-tuned on a custom dataset plus several domain assumptions about image sufficiency and data representativeness; no new physical entities or mathematical axioms are introduced.

free parameters (2)

RexNet-150 fine-tuning hyperparameters
Learning rate, batch size, epochs, and augmentation choices are selected to maximize performance on the training portion of the 273k samples.
YOLOv8n detection threshold
Confidence threshold for student localization is chosen on validation data and affects downstream classification.

axioms (2)

domain assumption Single-frame cropped images contain sufficient visual cues to classify cheating versus normal behavior
Invoked by the two-stage design that discards temporal context from video.
domain assumption The 273,897 samples compiled from 10 sources form an unbiased and representative distribution of exam-room scenes
Required for the reported accuracy to generalize beyond the training distribution.

pith-pipeline@v0.9.0 · 5623 in / 1559 out tokens · 82524 ms · 2026-05-10T08:50:49.024134+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

7 extracted references · 7 canonical work pages

[1]

A cheating detection system in online examinations based on the analysis of eye-gaze and head-pose

Singh Ambi and Das Smita. A cheating detection system in online examinations based on the analysis of eye-gaze and head-pose. In THEETAS. 2022

work page 2022
[2]

Real-time vehicle detection using yolov8-nano for intel- ligent transportation systems.Traitement du Signal, 41(4):1727, 2024

Murat Bakirci. Real-time vehicle detection using yolov8-nano for intel- ligent transportation systems.Traitement du Signal, 41(4):1727, 2024

work page 2024
[3]

Analyzing the potential of rexnet- 150: A novel architecture for automobile parts classification

M Ranjith Kumar, Pv Adithiyan, G Jeevan Sendur, S Praveen Kumar, S Mahendira Kumar, and V Nikhil. Analyzing the potential of rexnet- 150: A novel architecture for automobile parts classification. In2024 3rd International Conference on Artificial Intelligence For Internet of Things (AIIoT), pages 1–6. IEEE, 2024

work page 2024
[4]

A visual analytics approach to facilitate the proctoring of online exams

Haotian Li, Min Xu, Yong Wang, Huan Wei, and Huamin Qu. A visual analytics approach to facilitate the proctoring of online exams. In Proceedings of the 2021 CHI conference on human factors in computing systems, pages 1–17, 2021

work page 2021
[5]

Yolov8n-pp: a lightweight pose recognition algorithm for pho- tovoltaic array cleaning robot.Journal of Real-Time Image Processing, 22(4):1–12, 2025

Jidong Luo, Guoyi Wang, Yanjiao Lei, Dong Wang, and Hongzhou Zhang. Yolov8n-pp: a lightweight pose recognition algorithm for pho- tovoltaic array cleaning robot.Journal of Real-Time Image Processing, 22(4):1–12, 2025

work page 2025
[6]

A video-based detector for suspicious activity in examination with open- pose.arXiv preprint arXiv:2307.11413, 2023

Reuben Moyo, Stanley Ndebvu, Michael Zimba, and Jimmy Mbelwa. A video-based detector for suspicious activity in examination with open- pose.arXiv preprint arXiv:2307.11413, 2023

work page arXiv 2023
[7]

A 3d-cnn and lstm based multi-task learning architecture for action recognition.IEEE Access, 7:40757– 40770, 2019

Xi Ouyang, Shuangjie Xu, Chaoyun Zhang, Pan Zhou, Yang Yang, Guanghui Liu, and Xuelong Li. A 3d-cnn and lstm based multi-task learning architecture for action recognition.IEEE Access, 7:40757– 40770, 2019

work page 2019

[1] [1]

A cheating detection system in online examinations based on the analysis of eye-gaze and head-pose

Singh Ambi and Das Smita. A cheating detection system in online examinations based on the analysis of eye-gaze and head-pose. In THEETAS. 2022

work page 2022

[2] [2]

Real-time vehicle detection using yolov8-nano for intel- ligent transportation systems.Traitement du Signal, 41(4):1727, 2024

Murat Bakirci. Real-time vehicle detection using yolov8-nano for intel- ligent transportation systems.Traitement du Signal, 41(4):1727, 2024

work page 2024

[3] [3]

Analyzing the potential of rexnet- 150: A novel architecture for automobile parts classification

M Ranjith Kumar, Pv Adithiyan, G Jeevan Sendur, S Praveen Kumar, S Mahendira Kumar, and V Nikhil. Analyzing the potential of rexnet- 150: A novel architecture for automobile parts classification. In2024 3rd International Conference on Artificial Intelligence For Internet of Things (AIIoT), pages 1–6. IEEE, 2024

work page 2024

[4] [4]

A visual analytics approach to facilitate the proctoring of online exams

Haotian Li, Min Xu, Yong Wang, Huan Wei, and Huamin Qu. A visual analytics approach to facilitate the proctoring of online exams. In Proceedings of the 2021 CHI conference on human factors in computing systems, pages 1–17, 2021

work page 2021

[5] [5]

Yolov8n-pp: a lightweight pose recognition algorithm for pho- tovoltaic array cleaning robot.Journal of Real-Time Image Processing, 22(4):1–12, 2025

Jidong Luo, Guoyi Wang, Yanjiao Lei, Dong Wang, and Hongzhou Zhang. Yolov8n-pp: a lightweight pose recognition algorithm for pho- tovoltaic array cleaning robot.Journal of Real-Time Image Processing, 22(4):1–12, 2025

work page 2025

[6] [6]

A video-based detector for suspicious activity in examination with open- pose.arXiv preprint arXiv:2307.11413, 2023

Reuben Moyo, Stanley Ndebvu, Michael Zimba, and Jimmy Mbelwa. A video-based detector for suspicious activity in examination with open- pose.arXiv preprint arXiv:2307.11413, 2023

work page arXiv 2023

[7] [7]

A 3d-cnn and lstm based multi-task learning architecture for action recognition.IEEE Access, 7:40757– 40770, 2019

Xi Ouyang, Shuangjie Xu, Chaoyun Zhang, Pan Zhou, Yang Yang, Guanghui Liu, and Xuelong Li. A 3d-cnn and lstm based multi-task learning architecture for action recognition.IEEE Access, 7:40757– 40770, 2019

work page 2019