pith. sign in

arxiv: 1907.03422 · v1 · pith:72QVMCVCnew · submitted 2019-07-08 · 💻 cs.CV

Bootstrap Model Ensemble and Rank Loss for Engagement Intensity Regression

Pith reviewed 2026-05-25 01:23 UTC · model grok-4.3

classification 💻 cs.CV
keywords engagement intensity regressionrank lossbootstrap aggregationLSTMmulti-instance learningfacial landmarksEmotiW 2019MOOC videos
0
0 comments X

The pith

Rank loss and bootstrap aggregation improve an LSTM multi-instance model for engagement intensity regression to third place.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that extending a prior winning multi-instance LSTM framework with facial landmark features, a rank loss that enforces distance margins between engagement category pairs, and bootstrap model ensemble produces better regression of student engagement from MOOC videos. A sympathetic reader would care because more accurate automatic engagement measurement could support better design of online education content. The full method reaches an MSE of 0.0626 on the test set and third place in the EmotiW 2019 challenge. Validation experiments discuss the contribution of each added component.

Core claim

The central claim is that facial landmark features, a rank loss regularization enforcing margins between distant and adjacent engagement categories, and bootstrap aggregation by repeated random sampling and prediction averaging together yield improved performance over the baseline multi-instance LSTM framework on the engagement intensity regression task.

What carries the argument

The rank loss that enforces a distance margin between features of distant category pairs and adjacent category pairs, together with bootstrap aggregation that randomly samples training data several times and averages the resulting model predictions.

If this is right

  • Facial landmark features supply information beyond gaze and head pose for the LSTM.
  • The rank loss improves separation of engagement levels in the learned feature space.
  • Bootstrap aggregation lowers prediction variance through repeated sampling and averaging.
  • The combined modifications extend the previous solution while keeping the multi-instance LSTM structure.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The rank loss may transfer to other ordinal regression problems where category distances matter.
  • Without reported ablations it is unclear whether both the rank loss and bootstrap are required or if one accounts for most of the gain.
  • The same ensemble and loss additions could be tested on different base networks or non-video affective datasets.

Load-bearing premise

The added rank loss and bootstrap aggregation produce genuine generalization gains beyond the authors' prior winning solution.

What would settle it

An ablation experiment on the same data and framework in which removing the rank loss or the bootstrap aggregation produces equal or lower MSE on the test set would falsify the claim that these additions drive the reported improvement.

Figures

Figures reproduced from arXiv: 1907.03422 by Da Guo, Jianfei Yang, Kaipeng Zhang, Kai Wang, Xiaojiang Peng, Yu Qiao.

Figure 1
Figure 1. Figure 1: The system pipeline of our approach. information of body and face cannot be synchronized. We regard the body feature as an independent modality. Although the OpenFace and OpenPose features include face, head and body, it is rather limited. These features can only represent the degree of movement of different component, but the concrete actions and gaze changing patterns are neglected. More severely, high-l… view at source ↗
Figure 2
Figure 2. Figure 2: C_i is the feature center of ith engagement intensity level. δ is the margin of different engagement levels. 3.5 Bootstrap and Model Ensemble The bootstrap method [6] is a statistical technique to estimate the distribution about dataset by averaging estimates from multiple small data samples. This approach is called sampling with replacement. The key idea of the method is estimating the true distribution w… view at source ↗
read the original abstract

This paper presents our approach for the engagement intensity regression task of EmotiW 2019. The task is to predict the engagement intensity value of a student when he or she is watching an online MOOCs video in various conditions. Based on our winner solution last year, we mainly explore head features and body features with a bootstrap strategy and two novel loss functions in this paper. We maintain the framework of multi-instance learning with long short-term memory (LSTM) network, and make three contributions. First, besides of the gaze and head pose features, we explore facial landmark features in our framework. Second, inspired by the fact that engagement intensity can be ranked in values, we design a rank loss as a regularization which enforces a distance margin between the features of distant category pairs and adjacent category pairs. Third, we use the classical bootstrap aggregation method to perform model ensemble which randomly samples a certain training data by several times and then averages the model predictions. We evaluate the performance of our method and discuss the influence of each part on the validation dataset. Our methods finally win 3rd place with MSE of 0.0626 on the testing set.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The manuscript presents an approach for the EmotiW 2019 engagement intensity regression task. Building on the authors' prior winning solution, it incorporates facial landmark features alongside gaze and head pose, introduces a rank loss regularization that enforces distance margins between features of distant versus adjacent engagement categories, and applies bootstrap aggregation for ensembling within a multi-instance LSTM framework. The method is reported to achieve 3rd place with a test-set MSE of 0.0626, with component influences discussed on the validation set.

Significance. If supported by quantitative evidence, the result would demonstrate incremental, practical gains in multimodal video analysis for predicting student engagement in MOOC settings. The competition ranking provides a concrete, falsifiable benchmark, and the rank-loss idea offers a potentially generalizable regularization strategy for ordinal regression problems in affective computing.

major comments (2)
  1. [Abstract] Abstract: The central claim that the facial-landmark features, rank loss, and bootstrap ensemble produce the reported test MSE of 0.0626 is not accompanied by any numerical ablation results, MSE deltas, or statistical comparisons against the authors' own prior winning entry (or against ablated versions) on either the validation or test sets, despite the abstract stating that 'the influence of each part' is discussed.
  2. [Results / Experiments section] Results / Experiments section: No validation curves, error bars, dataset statistics, or cross-validation details are supplied to allow verification of the performance numbers or to assess whether the bootstrap ensemble and rank loss yield genuine generalization improvements rather than comparable scores obtained with additional machinery.
minor comments (1)
  1. [Abstract] Abstract: 'besides of the gaze' is grammatically incorrect and should read 'besides the gaze'.

Simulated Author's Rebuttal

2 responses · 1 unresolved

We thank the referee for the constructive feedback highlighting the need for stronger quantitative support. We agree that additional numerical details would strengthen the manuscript and will incorporate them in the revision.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The central claim that the facial-landmark features, rank loss, and bootstrap ensemble produce the reported test MSE of 0.0626 is not accompanied by any numerical ablation results, MSE deltas, or statistical comparisons against the authors' own prior winning entry (or against ablated versions) on either the validation or test sets, despite the abstract stating that 'the influence of each part' is discussed.

    Authors: The abstract states that influences are discussed on the validation set, and the full paper provides qualitative discussion of each component's contribution. However, we acknowledge the absence of explicit numerical ablation tables, deltas, or direct comparisons to the prior winning entry. We will add a dedicated ablation table with MSE values on the validation set in the revised manuscript. Direct test-set ablations against prior or ablated models are not feasible post-competition without additional submissions. revision: yes

  2. Referee: [Results / Experiments section] Results / Experiments section: No validation curves, error bars, dataset statistics, or cross-validation details are supplied to allow verification of the performance numbers or to assess whether the bootstrap ensemble and rank loss yield genuine generalization improvements rather than comparable scores obtained with additional machinery.

    Authors: We agree that the current manuscript lacks validation curves, error bars, dataset statistics, and cross-validation details. These omissions limit independent verification of the reported gains. We will add dataset statistics, error bars on validation results, and a description of the validation procedure in the revised experiments section. revision: yes

standing simulated objections not resolved
  • Direct numerical comparisons or ablations on the hidden test set, as test labels are unavailable outside the original competition submissions.

Circularity Check

0 steps flagged

Empirical competition entry with no self-referential derivation or fitted prediction

full rationale

The paper reports a test-set MSE from an EmotiW 2019 submission that extends the authors' prior winning entry via added features, rank loss, and bootstrap ensemble. No equations, derivations, or parameter-fitting steps are described that reduce the reported MSE to a quantity defined in terms of itself. The result is measured on held-out test data and does not invoke uniqueness theorems, ansatzes smuggled via self-citation, or renaming of known results as new predictions. Self-citation of the prior solution is present but not load-bearing for any mathematical claim.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Only the abstract is available; no explicit free parameters, axioms, or invented entities are stated beyond standard supervised learning assumptions.

axioms (1)
  • domain assumption Standard assumptions of multi-instance learning and LSTM sequence modeling hold for the video features.
    The framework is presented without explicit justification of these background modeling choices.

pith-pipeline@v0.9.0 · 5739 in / 1154 out tokens · 25440 ms · 2026-05-25T01:23:18.661150+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

27 extracted references · 27 canonical work pages · 1 internal anchor

  1. [1]

    Brandon Amos, Bartosz Ludwiczuk, Mahadev Satyanarayanan, et al. 2016. Open- face: A general-purpose face recognition library with mobile applications. CMU School of Computer Science (2016)

  2. [2]

    Nigel Bosch, Sidney K D’Mello, Ryan S Baker, Jaclyn Ocumpaugh, Valerie Shute, Matthew Ventura, Lubin Wang, and Weinan Zhao. 2016. Detecting Student Emotions in Computer-Enabled Classrooms.. In IJCAI. 4125–4129

  3. [3]

    Abhinav Dhall, Amanjot Kaur, Roland Goecke, and Tom Gedeon. 2018. EmotiW 2018: Audio-Video, Student Engagement and Group-Level Affect Prediction. In Proceedings of the 19th ACM International Conference on Multimodal Interaction (in press). ACM

  4. [4]

    Sidney K D’Mello, Scotty D Craig, and Art C Graesser. 2009. Multimethod assess- ment of affective experience and expression during deep learning. International Journal of Learning Technology 4, 3-4 (2009), 165–187

  5. [5]

    Sidney K DâĂŹMello and Arthur Graesser. 2010. Multimodal semi-automated affect detection from conversational cues, gross body language, and facial features. User Modeling and User-Adapted Interaction 20, 2 (2010), 147–187

  6. [6]

    B. Efron. 1979. Bootstrap Methods: Another Look at the Jackknife. Annals of Statistics 7, 1 (1979), 1–26

  7. [7]

    Jennifer A Fredricks, Phyllis C Blumenfeld, and Alison H Paris. 2004. School engagement: Potential of the concept, state of the evidence.Review of educational research 74, 1 (2004), 59–109

  8. [8]

    Benjamin S Goldberg, Robert A Sottilare, Keith W Brawner, and Heather K Holden. 2011. Predicting learner engagement during well-defined and ill-defined computer-based intercultural interactions. InInternational Conference on Affective Computing and Intelligent Interaction . Springer, 538–547

  9. [9]

    Julie A Gray and Melanie DiLoreto. 2016. The effects of student engagement, student satisfaction, and perceived learning in online learning environments. International Journal of Educational Leadership Preparation 11, 1 (2016), n1

  10. [10]

    E Joseph. 2005. Engagement tracing: using response times to model student disengagement. Artificial intelligence in education: Supporting learning through intelligent and socially informed technology 125 (2005), 88

  11. [11]

    Kenneth R Koedinger, John R Anderson, William H Hadley, and Mary A Mark

  12. [12]

    International Journal of Artificial Intelligence in Education (IJAIED) 8 (1997), 30–43

    Intelligent tutoring goes to school in the big city. International Journal of Artificial Intelligence in Education (IJAIED) 8 (1997), 30–43

  13. [13]

    Zheng Li, Jianfei Yang, Juan Zha, Chang-Dong Wang, and Weishi Zheng. 2016. Online visual tracking via correlation filter with convolutional networks. In Visual Communications and Image Processing (VCIP), 2016 . IEEE, 1–4

  14. [14]

    Debin Meng, Xiaojiang Peng, Kai Wang, and Yu Qiao. 2019. frame attention networks for facial expression recognition in videos. arXiv:cs.CV/1907.00193

  15. [15]

    Aamir Mustafa, Amanjot Kaur, Love Mehta, and Abhinav Dhall. 2018. Pre- diction and Localization of Student Engagement in the Wild. arXiv preprint arXiv:1804.00858 (2018)

  16. [16]

    Xuesong Niu, Hu Han, Jiabei Zeng, Xuran Sun, Shiguang Shan, Yan Huang, Songfan Yang, and Xilin Chen. 2018. Automatic engagement prediction with GAP feature. In Proceedings of the 2018 on International Conference on Multimodal Interaction. ACM, 599–603

  17. [17]

    Tomas Simon, Hanbyul Joo, Iain Matthews, and Yaser Sheikh. 2017. Hand key- point detection in single images using multiview bootstrapping. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR) , Vol. 2

  18. [18]

    Lianzhi Tan, Kaipeng Zhang, Kai Wang, Xiaoxing Zeng, Xiaojiang Peng, and Yu Qiao. 2017. Group emotion recognition with individual facial emotion CNNs and global image based CNNs. InProceedings of the 19th ACM International Conference on Multimodal Interaction. ACM, 549–552

  19. [19]

    Du Tran, Lubomir Bourdev, Rob Fergus, Lorenzo Torresani, and Manohar Paluri

  20. [20]

    In Proceedings of the IEEE international conference on computer vision

    Learning spatiotemporal features with 3d convolutional networks. In Proceedings of the IEEE international conference on computer vision . 4489–4497

  21. [21]

    Kai Wang, , Xiaoxing Zeng, Jianfei Yang, Debin Meng, Kaipeng Zhang, Xiaojiang Peng, and Yu Qiao. 2018. Cascade Attention Networks For Group Emotion Recognition with Face, Body and Image Cues. In Proceedings of the 19th ACM International Conference on Multimodal Interaction (in press) . ACM

  22. [22]

    Kai Wang, Xiaojiang Peng, Jianfei Yang, Debin Meng, and Yu Qiao. 2019. Region Attention Networks for Pose and Occlusion Robust Facial Expression Recognition. arXiv preprint arXiv:1905.04075 (2019)

  23. [23]

    Yandong Wen, Kaipeng Zhang, Zhifeng Li, and Qiao Yu. 2016. A Discriminative Feature Learning Approach for Deep Face Recognition

  24. [24]

    Jacob Whitehill, Zewelanji Serpell, Yi-Ching Lin, Aysha Foster, and Javier R Movellan. 2014. The faces of engagement: Automatic recognition of student engagement from facial expressions. IEEE Transactions on Affective Computing 5, 1 (2014), 86–98

  25. [25]

    Xiang Xiao, Phuong Pham, and Jingtao Wang. 2017. Dynamics of affective states during mooc learning. In International Conference on Artificial Intelligence in Education. Springer, 586–589

  26. [26]

    Jianfei Yang, Kai Wang, Xiaojiang Peng, and Yu Qiao. 2018. Deep Recurrent Multi-instance Learning with Spatio-temporal Features for Engagement Intensity Prediction. In Proceedings of the 2018 on International Conference on Multimodal Interaction. ACM, 594–598

  27. [27]

    W. Yun, D. Lee, C. Park, J. Kim, and J. Kim. 2018. Automatic Recognition of Children Engagement from Facial Video using Convolutional Neural Networks. IEEE Transactions on Affective Computing (2018), 1–1. https://doi.org/10.1109/ TAFFC.2018.2834350