pith. machine review for the scientific record. sign in

arxiv: 2605.03615 · v1 · submitted 2026-05-05 · 💻 cs.CV

Recognition: unknown

PriorNet: Prior-Guided Engagement Estimation from Face Video

Authors on Pith no claims yet

Pith reviewed 2026-05-07 17:47 UTC · model grok-4.3

classification 💻 cs.CV
keywords engagement estimationface video analysisprior knowledge injectionlow-rank adaptationevidential learningaffective computingcomputer vision
0
0 comments X

The pith

Injecting priors at preprocessing, adaptation, and objective stages improves engagement estimation from face videos.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents PriorNet to tackle the difficulties of estimating engagement levels from face videos, where faces are frequently missing from frames and annotations are subjective. It does this by embedding prior knowledge at three points in the processing pipeline. First, missing face detections are turned into special zero frames so the sequence stays complete. Second, a pre-trained video model is adapted efficiently without changing its base weights. Third, the training loss accounts for uncertainty in the labels. This leads to higher performance than earlier approaches on four different benchmark datasets when using each dataset's standard testing method. Readers would care because it offers a concrete way to make video-based affect analysis more reliable when perfect data is unavailable.

Core claim

PriorNet converts face detection failures into zero-frame placeholders to keep missing-face events in the input sequence, adapts a frozen Self-supervised Video Facial Affect Perceiver backbone with a Prior-guided Low-Rank Adaptation module, and trains using a Dirichlet-evidential uncertainty-weighted objective. On the EngageNet, DAiSEE, DREAMS, and PAFE datasets using native evaluation protocols, PriorNet outperforms the strongest previous methods listed for each, and ablations show the three prior types contribute complementarily to the gains.

What carries the argument

The three-stage prior injection in PriorNet: zero-frame placeholders for missing faces in preprocessing, Prior-LoRA for model adaptation, and Dirichlet-evidential objective for training.

If this is right

  • Missing face events are explicitly represented rather than dropped, preserving sequence structure.
  • Parameter-efficient adaptation of large pre-trained models becomes feasible for this task.
  • Uncertainty in subjective engagement labels is modeled to guide training.
  • Complementary use of priors at multiple pipeline stages yields additive benefits.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Similar prior injection could help other video understanding tasks facing incomplete observations.
  • Testing the framework on datasets with different demographics might reveal generalization limits.
  • Integrating these priors into end-to-end real-time systems could be a next step.

Load-bearing premise

The performance improvements result from the specific prior injections rather than from other details in the implementation or from dataset-specific effects.

What would settle it

An experiment that applies PriorNet to a new face-video engagement dataset under its native protocol and finds no improvement over the strongest baseline would challenge the central claim.

read the original abstract

Engagement estimation from face video remains challenging because facial evidence is often incomplete, labeled data are limited, and engagement annotations are subjective. We present PriorNet, a prior-guided framework that injects task-relevant priors at three stages of the pipeline: preprocessing, model adaptation, and objective design. PriorNet converts face-detection failures into explicit zero-frame placeholders so that missing-face events remain represented in the input sequence, adapts a frozen Self-supervised Video Facial Affect Perceiver (SVFAP) backbone through a Prior-guided Low-Rank Adaptation module (Prior-LoRA) for parameter-efficient specialization, and trains with a Dirichlet-evidential, uncertainty-weighted objective under hard-label supervision. We evaluate PriorNet on EngageNet, DAiSEE, DREAMS, and PAFE using each dataset's native evaluation protocol. Across these benchmarks, PriorNet improves over the strongest listed prior reference within each dataset's evaluation framing, while component ablations on EngageNet and DAiSEE indicate that the gains arise from complementary contributions of preprocessing, adaptation, and objective-level priors. These results support explicit prior injection as a useful design principle for face-video engagement estimation under the benchmark conditions studied in this work.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces PriorNet, a prior-guided framework for engagement estimation from face videos. It injects priors at three stages: converting face-detection failures to zero-frame placeholders during preprocessing, using Prior-LoRA to adapt a frozen SVFAP backbone, and employing a Dirichlet-evidential uncertainty-weighted objective under hard-label supervision. Evaluations on EngageNet, DAiSEE, DREAMS, and PAFE using native protocols report improvements over the strongest listed prior references, with component ablations on EngageNet and DAiSEE indicating complementary contributions from the three prior elements.

Significance. If the reported gains can be attributed specifically to the explicit prior injections rather than backbone choice or implementation details, the work provides a practical design principle for handling incomplete facial evidence and limited subjective labels in video-based engagement estimation. The parameter-efficient adaptation of a self-supervised backbone and evidential loss formulation could extend to other affect-analysis tasks with sparse annotations, supporting more robust real-world video processing pipelines.

major comments (2)
  1. [Ablation studies (Section 5)] Ablation studies on EngageNet and DAiSEE: the component ablations claim complementary gains from preprocessing, Prior-LoRA adaptation, and objective-level priors, but do not explicitly confirm that non-prior factors (SVFAP backbone, optimizer, sequence length, augmentations) are held fixed while toggling only the three prior components. This control is load-bearing for the central attribution claim.
  2. [Experiments (Section 4)] Experimental comparisons: while native evaluation protocols are followed, the manuscript does not state whether the strongest prior references were re-trained under identical regimes (same frozen SVFAP backbone, hyperparameters, and data processing) or evaluated from their originally reported numbers. Without this, deltas cannot be isolated to prior injection versus backbone or tuning differences.
minor comments (2)
  1. [Abstract] The abstract states training 'under hard-label supervision' yet employs a Dirichlet-evidential loss; a one-sentence clarification on label handling in the objective would reduce potential reader confusion.
  2. [Method] Notation for the Prior-LoRA module and zero-frame placeholder construction could be formalized with a short equation or pseudocode in the method section for reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on the clarity of our ablation controls and baseline comparisons. We address each major comment below with clarifications and revisions to improve transparency.

read point-by-point responses
  1. Referee: [Ablation studies (Section 5)] Ablation studies on EngageNet and DAiSEE: the component ablations claim complementary gains from preprocessing, Prior-LoRA adaptation, and objective-level priors, but do not explicitly confirm that non-prior factors (SVFAP backbone, optimizer, sequence length, augmentations) are held fixed while toggling only the three prior components. This control is load-bearing for the central attribution claim.

    Authors: We confirm that all non-prior factors—including the frozen SVFAP backbone, optimizer, sequence length, and augmentations—were held fixed across ablation variants, with only the three prior-injection elements toggled. We will revise Section 5 to explicitly document this control for unambiguous attribution. revision: yes

  2. Referee: [Experiments (Section 4)] Experimental comparisons: while native evaluation protocols are followed, the manuscript does not state whether the strongest prior references were re-trained under identical regimes (same frozen SVFAP backbone, hyperparameters, and data processing) or evaluated from their originally reported numbers. Without this, deltas cannot be isolated to prior injection versus backbone or tuning differences.

    Authors: Baseline numbers are taken from the originally reported results in each reference paper, following standard practice for multi-dataset comparisons where exact re-implementation under our SVFAP regime is not always feasible. We will add an explicit statement in Section 4 clarifying this. We acknowledge that this prevents fully isolating gains to prior injection alone without new controlled re-evaluations. revision: partial

Circularity Check

0 steps flagged

No circularity: empirical claims rest on benchmark comparisons and ablations without reducing to self-referential fits or definitions

full rationale

The paper presents an empirical ML framework (PriorNet) that injects priors at preprocessing, adaptation, and objective stages, then reports performance gains on four datasets under native protocols plus component ablations. No mathematical derivation, first-principles prediction, or parameter-fitting step is claimed that could reduce to its own inputs by construction. The central attribution of gains to the three prior components is supported by ablations rather than by any self-definitional equation, fitted-input-as-prediction, or load-bearing self-citation chain. Self-citations, if present in the full text, are not invoked to justify uniqueness theorems or ansatzes that would force the result. The work is therefore self-contained against external benchmarks and does not exhibit any of the enumerated circularity patterns.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Only abstract available, so no explicit free parameters, axioms, or invented entities extracted. The approach relies on standard components from prior literature such as LoRA and Dirichlet distributions without introducing new postulates.

pith-pipeline@v0.9.0 · 5497 in / 1112 out tokens · 48928 ms · 2026-05-07T17:47:30.106167+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Not Every Subject Should Stay: Machine Unlearning for Noisy Engagement Recognition

    cs.CV 2026-05 unverdicted novelty 5.0

    Approximate subject-level unlearning recovers 89.3% and 92.5% of oracle performance gains on EngageNet and DAiSEE at roughly one-quarter the retraining cost in K=3 forget-set regimes.

Reference graph

Works this paper leans on

53 extracted references · 32 canonical work pages · cited by 1 Pith paper · 1 internal anchor

  1. [1]

    Smart Learning Environments9(1), 31 (2022)

    Karimah, S.N., Hasegawa, S.: Automatic engagement estimation in smart edu- cation/learning settings: a systematic review of engagement definitions, datasets, and methods. Smart Learning Environments9(1), 31 (2022)

  2. [2]

    In: ICMI, pp

    Singh, M.,et al.: Do I Have Your Attention: A Large Scale Engagement Prediction Dataset and Baselines. In: ICMI, pp. 174–182 (2023). https://doi.org/10.1145/ 3577190.3614164

  3. [3]

    In: ICPR, pp

    Singh, M., Sharma, G.,et al.: DREAMS: Diverse Reactions of Engagement and Attention Mind States Dataset. In: ICPR, pp. 163–179 (2024)

  4. [4]

    https: //arxiv.org/abs/2403.08824

    Kumar, P., Vedernikov, A., Chen, Y., Zheng, W., Li, X.: Computational Analysis of Stress, Depression and Engagement in Mental Health: A Survey (2025). https: //arxiv.org/abs/2403.08824

  5. [5]

    In: Conference on Robots and Vision, pp

    Abedi, A., Khan, S.S.: Improving State-of-the-art in Detecting Student Engage- ment with Resnet and TCN Hybrid Network. In: Conference on Robots and Vision, pp. 151–157 (2021)

  6. [6]

    https://arxiv.org/ abs/2403.17175

    Abedi, A., Khan, S.S.: Engagement Measurement Based on Facial Landmarks and Spatial-Temporal Graph Convolutional Networks (2024). https://arxiv.org/ abs/2403.17175

  7. [7]

    Multimedia Tools and Applications84, 29371–29390 (2025) https://doi.org/10.1007/s11042-024-20350-4

    Ai, X., Sheng, V.S., Li, C., Yang, H., Cui, Z.: Class-attention video transformer for engagement prediction. Multimedia Tools and Applications84, 29371–29390 (2025) https://doi.org/10.1007/s11042-024-20350-4

  8. [8]

    LoRA: Low-Rank Adaptation of Large Language Models

    Hu, E.J., Shen, Y., Wallis, P., Allen-Zhu, Z., Li, Y., Wang, S., Wang, L., Chen, W.: LoRA: Low-Rank Adaptation of Large Language Models (2021). https:// arxiv.org/abs/2106.09685

  9. [9]

    IEEE Transactions on Affective Computing5(1), 86–98 (2014) https: //doi.org/10.1109/TAFFC.2014.2316163

    Whitehill, J., Serpell, Z., Lin, Y.-C., Foster, A., Movellan, J.R.: The faces of engagement: Automatic recognition of student engagement from facial expres- sions. IEEE Transactions on Affective Computing5(1), 86–98 (2014) https: //doi.org/10.1109/TAFFC.2014.2316163 . Cited by: 379

  10. [10]

    https://arxiv.org/abs/1808.02324

    Nezami, O.M., Dras, M., Hamey, L., Richards, D., Wan, S., Paris, C.: Automatic Recognition of Student Engagement using Deep Learning and Facial Expression (2019). https://arxiv.org/abs/1808.02324

  11. [11]

    IEEE Trans

    Monkaresi, H., Bosch, N., Calvo, R.A., D’Mello, S.K.: Automated detection of engagement using video-based estimation of facial expressions and heart rate. IEEE Trans. Affect. Comput.8(1), 15–28 (2017) https://doi.org/10.1109/ TAFFC.2016.2515084 21

  12. [12]

    Recovering Architectural Design Decisions,

    Baltrusaitis, T., Zadeh, A., Lim, Y.C., Morency, L.-P.: Openface 2.0: Facial behav- ior analysis toolkit. In: 2018 13th IEEE International Conference on Automatic Face & Gesture Recognition (FG 2018), pp. 59–66 (2018). https://doi.org/10. 1109/FG.2018.00019

  13. [13]

    In: Proceedings of the 2019 Chi Conference on Human Factors in Computing Systems, pp

    Aslan, S., Alyuz, N., Tanriover, C., Mete, S.E., Okur, E., D’Mello, S.K., Arslan Esme, A.: Investigating the impact of a real-time, multimodal student engagement analytics technology in authentic classrooms. In: Proceedings of the 2019 Chi Conference on Human Factors in Computing Systems, pp. 1–12 (2019)

  14. [14]

    IEEE Transactions on Games10(3), 292–303 (2017)

    Psaltis, A., Apostolakis, K.C., Dimitropoulos, K., Daras, P.: Multimodal student engagement recognition in prosocial games. IEEE Transactions on Games10(3), 292–303 (2017)

  15. [15]

    Computers, Materials & Continua70(2) (2022)

    Khenkar, S., Jarraya, S.K.: Engagement detection based on analyzing micro body gestures using 3d cnn. Computers, Materials & Continua70(2) (2022)

  16. [16]

    Multimedia Tools and Applications (2023) https://doi.org/10.1007/ s11042-023-16345-2

    Abedi, A., Khan, S.S.: Affect-driven ordinal engagement measurement from video. Multimedia Tools and Applications (2023) https://doi.org/10.1007/ s11042-023-16345-2 . Cited by: 2; All Open Access, Green Open Access

  17. [17]

    In: CVPR Workshops, pp

    Vedernikov, A.,et al.: TCCT-Net: Two-Stream Network Architecture for Fast and Efficient Engagement Estimation via Behavioral Feature Signals. In: CVPR Workshops, pp. 4723–4732 (2024)

  18. [18]

    In: Advances in Neural Information Processing Systems, vol

    Chen, S., Ge, C., Tong, Z., Wang, J., Song, Y., Wang, J., Luo, P.: Adaptformer: Adapting vision transformers for scalable visual recognition. In: Advances in Neural Information Processing Systems, vol. 35 (2022)

  19. [19]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp

    Dong, W., Zhang, X., Chen, B., Yan, D., Lin, Z., Yan, Q., Wang, P., Yang, Y.: Low-rank rescaled vision transformer fine-tuning: A residual design approach. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 16101–16110 (2024)

  20. [20]

    https://arxiv.org/abs/1804.00858

    Kaur, A., Mustafa, A., Mehta, L., Dhall, A.: Prediction and Localization of Student Engagement in the Wild (2018). https://arxiv.org/abs/1804.00858

  21. [21]

    https://arxiv.org/abs/1806.01768

    Sensoy, M., Kaplan, L., Kandemir, M.: Evidential Deep Learning to Quantify Classification Uncertainty (2018). https://arxiv.org/abs/1806.01768

  22. [22]

    https://arxiv.org/abs/2404.10980

    Li, C., Li, K., Ou, Y., Kaplan, L.M., Jøsang, A., Cho, J.-H., Jeong, D.H., Chen, F.: Hyper Evidential Deep Learning to Quantify Composite Classification Uncertainty (2024). https://arxiv.org/abs/2404.10980

  23. [23]

    In: Proceedings of the Computer Vision and Pattern Recognition Conference, pp

    Lin, J., Tao, L., Dong, M., Xu, C.: Uncertainty weighted gradients for model calibration. In: Proceedings of the Computer Vision and Pattern Recognition Conference, pp. 15497–15507 (2025) 22

  24. [24]

    IEEE Transactions on Affective Computing (2024) https://doi.org/10.1109/TAFFC.2024.3436913

    Sun, L., Lian, Z., Wang, K., He, Y., Xu, M., Sun, H., Liu, B., Tao, J.: Svfap: Self-supervised video facial affect perceiver. IEEE Transactions on Affective Computing (2024) https://doi.org/10.1109/TAFFC.2024.3436913

  25. [25]

    In: European Conference on Computer Vision, pp

    Wang, L., Xiong, Y., Wang, Z., Qiao, Y., Lin, D., Tang, X., Van Gool, L.: Tem- poral segment networks: Towards good practices for deep action recognition. In: European Conference on Computer Vision, pp. 20–36 (2016). Springer

  26. [26]

    In: Computer Vision – ECCV 2016, pp

    Liu, W., Anguelov, D., Erhan, D., Szegedy, C., Reed, S., Fu, C.-Y., Berg, A.C.: Ssd: Single shot multibox detector. In: Computer Vision – ECCV 2016, pp. 21–37 (2016). https://doi.org/10.1007/978-3-319-46448-0 2

  27. [27]

    In: Proceedings of the IEEE International Conference on Computer Vision, pp

    Lin, T.-Y., Goyal, P., Girshick, R., He, K., Doll´ ar, P.: Focal loss for dense object detection. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 2980–2988 (2017)

  28. [28]

    746–749 (2023)

    Dhall, A., Singh, M., Goecke, R., Gedeon, T., Zeng, D.,et al.: EmotiW 2023: Emotion Recognition in the Wild Challenge, pp. 746–749 (2023)

  29. [29]

    arXiv preprint arXiv:1609.01885 (2016)

    Gupta, A., D’Cunha, A., et al.: Daisee: towards user engagement recognition in the wild. arXiv preprint arXiv:1609.01885 (2016)

  30. [30]

    In: CVPR Workshops, pp

    Lee, T., Kim, D., Park, S.,et al.: Predicting Mind-Wandering with Facial Videos in Online Lectures. In: CVPR Workshops, pp. 2103–2112 (2022)

  31. [31]

    Jiang, F., Huang, C., Huang, Q., Huang, X.: Prompting Multimodal Vision- Language Models for Automated Student Engagement Prediction. SSRN. SSRN working paper. Posted: 10 Dec 2025 (2025). https://doi.org/10.2139/ssrn.5900588 . https://ssrn.com/abstract=5900588

  32. [32]

    Computers14(3) (2025) https://doi

    Alarefah, W., Jarraya, S.K., Abuzinadah, N.: Transformer-based student engage- ment recognition using few-shot learning. Computers14(3) (2025) https://doi. org/10.3390/computers14030109

  33. [33]

    IEEE Access 10, 99573–99583 (2022)

    Selim, T.,et al.: Students Engagement Level Detection in Online e-Learning Using Hybrid EfficientNetB7 Together With TCN, LSTM, and Bi-LSTM. IEEE Access 10, 99573–99583 (2022)

  34. [34]

    https://arxiv

    Vedernikov, A., Kumar, P., Chen, H., Sepp¨ anen, T., Li, X.: Vision Large Language Models Are Good Noise Handlers in Engagement Analysis (2025). https://arxiv. org/abs/2511.14749

  35. [35]

    arXiv preprint arXiv:2405.04251 (2024) arXiv:2405.04251 [cs.CV] 23

    Malekshahi, S., Kheyridoost, J.M., Fatemi, O.: A general model for detect- ing learner engagement: Implementation and evaluation. arXiv preprint arXiv:2405.04251 (2024) arXiv:2405.04251 [cs.CV] 23

  36. [36]

    Journal on Artificial Intelligence6(1) (2024) https://doi.org/10.32604/ jai.2024.048911

    Shiri, F.M., Ahmadi, E., Rezaee, M., Perumal, T.: Detection of student engage- ment in e-learning environments using efficientnetv2-l together with rnn-based models. Journal on Artificial Intelligence6(1) (2024) https://doi.org/10.32604/ jai.2024.048911

  37. [37]

    Alexandria Engineering Journal 107, 198–204 (2024) https://doi.org/10.1016/j.aej.2024.06.074

    Su, R., He, L., Luo, M.: Leveraging part-and-sensitive attention network and transformer for learner engagement detection. Alexandria Engineering Journal 107, 198–204 (2024) https://doi.org/10.1016/j.aej.2024.06.074

  38. [38]

    arXiv preprint arXiv:2409.16126 (2024) arXiv:2409.16126 [cs.CV]

    Singh, A., Verma, N., Goyal, K., Singh, A., Kumar, P., Li, X.: Visiophysioenet: Multimodal engagement detection using visual and physiological signals. arXiv preprint arXiv:2409.16126 (2024) arXiv:2409.16126 [cs.CV]

  39. [39]

    arXiv preprint arXiv:2510.18016 (2025) arXiv:2510.18016 [cs.CV]

    Gothwal, P., Banerjee, D., Biswas, A.K.: Vibed-net: Video based engagement detection network using face-aware and scene-aware spatiotemporal cues. arXiv preprint arXiv:2510.18016 (2025) arXiv:2510.18016 [cs.CV]

  40. [40]

    Alexandria Engineering Journal141, 14–23 (2026) https://doi.org/10.1016/j.aej.2026.02.015

    Zhu, X., Huang, L.: Data augmented hybrid gcn transformer for student engage- ment recognition in e-learning. Alexandria Engineering Journal141, 14–23 (2026) https://doi.org/10.1016/j.aej.2026.02.015

  41. [41]

    In: 2019 IEEE 9th Inter- national Conference on Electronics Information and Emergency Communication (ICEIEC), pp

    Zhang, H., Xiao, X., Huang, T., Liu, S., Xia, Y., Li, J.: An novel end-to-end network for automatic student engagement recognition. In: 2019 IEEE 9th Inter- national Conference on Electronics Information and Emergency Communication (ICEIEC), pp. 342–345 (2019). https://doi.org/10.1109/ICEIEC.2019.8784507

  42. [42]

    In: 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), pp

    Vedernikov, A., Sun, Z., Kykyri, V.-L., Pohjola, M., Nokia, M., Li, X.: Ana- lyzing Participants’ Engagement during Online Meetings Using Unsupervised Remote Photoplethysmography with Behavioral Features . In: 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), pp. 389–399. IEEE Computer Society, Los Alamitos, CA, USA (2...

  43. [43]

    In: 2024 3rd International Conference for Innovation in Technology (INOCON) (2024)

    Sathisha, G., Subbaraya, C.K., Ravikumar, G.K.: Facial expression recognition in video using 3d-cnn deep features discrimination. In: 2024 3rd International Conference for Innovation in Technology (INOCON) (2024)

  44. [44]

    Applied Intelligence51(10) (2021) https://doi.org/10.1007/ s10489-020-02139-8

    Liao, J., et al.: Deep Facial Spatiotemporal Net For Engagement Prediction in Online Learning. Applied Intelligence51(10) (2021) https://doi.org/10.1007/ s10489-020-02139-8

  45. [45]

    In: 2019 IEEE 9th International Con- ference on Electronics Information and Emergency Communication (ICEIEC), pp

    Huang, T., Mei, Y., Zhang, H., Liu, S., Yang, H.: Fine-grained engagement recognition in online learning environment. In: 2019 IEEE 9th International Con- ference on Electronics Information and Emergency Communication (ICEIEC), pp. 338–341 (2019). https://doi.org/10.1109/ICEIEC.2019.8784559

  46. [46]

    Computers, Materials and Continua 81(1), 369–391 (2024) https://doi.org/10.32604/cmc.2024.054982

    Li, X., Zhu, W., Li, Q., Hou, C., Zhang, Y.: Re-distributing facial features 24 for engagement prediction with moderntcn. Computers, Materials and Continua 81(1), 369–391 (2024) https://doi.org/10.32604/cmc.2024.054982

  47. [47]

    International Journal of Information and Education Technology11(3), 107–111 (2021) https://doi.org/ 10.18178/ijiet.2021.11.3.1497

    Ma, X., Xu, M., Dong, Y., Sun, Z.: Automatic student engagement in online learning environment based on neural turing machine. International Journal of Information and Education Technology11(3), 107–111 (2021) https://doi.org/ 10.18178/ijiet.2021.11.3.1497

  48. [48]

    Applied Intelligence52, 13803–13823 (2022) https://doi.org/10

    Mehta, N.K., Prasad, S.S., Saurav, S., Saini, R., Singh, S.: Three-dimensional densenet self-attention neural network for automatic detection of student’s engagement. Applied Intelligence52, 13803–13823 (2022) https://doi.org/10. 1007/s10489-022-03200-4

  49. [49]

    Applied Sciences12(16), 8007 (2022) https://doi

    Hu, Y., Jiang, Z., Zhu, K.: An optimized cnn model for engagement recognition in an e-learning environment. Applied Sciences12(16), 8007 (2022) https://doi. org/10.3390/app12168007

  50. [50]

    IEEE Transactions on Circuits and Systems for Video Technology (2024)

    Zhang, W.L., Jia, R.S., Wang, H., Che, C.Y., Sun, H.M.: A self-supervised learn- ing network for student engagement recognition from facial expressions. IEEE Transactions on Circuits and Systems for Video Technology (2024)

  51. [51]

    https://arxiv.org/abs/2505.20676

    Safa, S., Abedi, A., Khan, S.S.: Supervised Contrastive Learning for Ordinal Engagement Measurement (2025). https://arxiv.org/abs/2505.20676

  52. [52]

    Journal of Imaging 10(5) (2024) https://doi.org/10.3390/jimaging10050097

    Anh, N.T.L., Bach, N.G., Tu, N.T.T., Kamioka, E., Tan, P.X.: Svd-based mind- wandering prediction from facial videos in online learning. Journal of Imaging 10(5) (2024) https://doi.org/10.3390/jimaging10050097

  53. [53]

    International Journal of Artificial Intelligence in Education35, 823–857 (2025) https://doi.org/10.1007/ s40593-024-00412-2 25

    B¨ uhler, B., Bozkir, E., Goldberg, P., S¨ umer, ¨O., D’Mello, S., Gerjets, P., Trautwein, U., Kasneci, E.: From the lab to the wild: Examining general- izability of video-based mind wandering detection. International Journal of Artificial Intelligence in Education35, 823–857 (2025) https://doi.org/10.1007/ s40593-024-00412-2 25