Bootstrap Model Ensemble and Rank Loss for Engagement Intensity Regression
Pith reviewed 2026-05-25 01:23 UTC · model grok-4.3
The pith
Rank loss and bootstrap aggregation improve an LSTM multi-instance model for engagement intensity regression to third place.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The central claim is that facial landmark features, a rank loss regularization enforcing margins between distant and adjacent engagement categories, and bootstrap aggregation by repeated random sampling and prediction averaging together yield improved performance over the baseline multi-instance LSTM framework on the engagement intensity regression task.
What carries the argument
The rank loss that enforces a distance margin between features of distant category pairs and adjacent category pairs, together with bootstrap aggregation that randomly samples training data several times and averages the resulting model predictions.
If this is right
- Facial landmark features supply information beyond gaze and head pose for the LSTM.
- The rank loss improves separation of engagement levels in the learned feature space.
- Bootstrap aggregation lowers prediction variance through repeated sampling and averaging.
- The combined modifications extend the previous solution while keeping the multi-instance LSTM structure.
Where Pith is reading between the lines
- The rank loss may transfer to other ordinal regression problems where category distances matter.
- Without reported ablations it is unclear whether both the rank loss and bootstrap are required or if one accounts for most of the gain.
- The same ensemble and loss additions could be tested on different base networks or non-video affective datasets.
Load-bearing premise
The added rank loss and bootstrap aggregation produce genuine generalization gains beyond the authors' prior winning solution.
What would settle it
An ablation experiment on the same data and framework in which removing the rank loss or the bootstrap aggregation produces equal or lower MSE on the test set would falsify the claim that these additions drive the reported improvement.
Figures
read the original abstract
This paper presents our approach for the engagement intensity regression task of EmotiW 2019. The task is to predict the engagement intensity value of a student when he or she is watching an online MOOCs video in various conditions. Based on our winner solution last year, we mainly explore head features and body features with a bootstrap strategy and two novel loss functions in this paper. We maintain the framework of multi-instance learning with long short-term memory (LSTM) network, and make three contributions. First, besides of the gaze and head pose features, we explore facial landmark features in our framework. Second, inspired by the fact that engagement intensity can be ranked in values, we design a rank loss as a regularization which enforces a distance margin between the features of distant category pairs and adjacent category pairs. Third, we use the classical bootstrap aggregation method to perform model ensemble which randomly samples a certain training data by several times and then averages the model predictions. We evaluate the performance of our method and discuss the influence of each part on the validation dataset. Our methods finally win 3rd place with MSE of 0.0626 on the testing set.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript presents an approach for the EmotiW 2019 engagement intensity regression task. Building on the authors' prior winning solution, it incorporates facial landmark features alongside gaze and head pose, introduces a rank loss regularization that enforces distance margins between features of distant versus adjacent engagement categories, and applies bootstrap aggregation for ensembling within a multi-instance LSTM framework. The method is reported to achieve 3rd place with a test-set MSE of 0.0626, with component influences discussed on the validation set.
Significance. If supported by quantitative evidence, the result would demonstrate incremental, practical gains in multimodal video analysis for predicting student engagement in MOOC settings. The competition ranking provides a concrete, falsifiable benchmark, and the rank-loss idea offers a potentially generalizable regularization strategy for ordinal regression problems in affective computing.
major comments (2)
- [Abstract] Abstract: The central claim that the facial-landmark features, rank loss, and bootstrap ensemble produce the reported test MSE of 0.0626 is not accompanied by any numerical ablation results, MSE deltas, or statistical comparisons against the authors' own prior winning entry (or against ablated versions) on either the validation or test sets, despite the abstract stating that 'the influence of each part' is discussed.
- [Results / Experiments section] Results / Experiments section: No validation curves, error bars, dataset statistics, or cross-validation details are supplied to allow verification of the performance numbers or to assess whether the bootstrap ensemble and rank loss yield genuine generalization improvements rather than comparable scores obtained with additional machinery.
minor comments (1)
- [Abstract] Abstract: 'besides of the gaze' is grammatically incorrect and should read 'besides the gaze'.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback highlighting the need for stronger quantitative support. We agree that additional numerical details would strengthen the manuscript and will incorporate them in the revision.
read point-by-point responses
-
Referee: [Abstract] Abstract: The central claim that the facial-landmark features, rank loss, and bootstrap ensemble produce the reported test MSE of 0.0626 is not accompanied by any numerical ablation results, MSE deltas, or statistical comparisons against the authors' own prior winning entry (or against ablated versions) on either the validation or test sets, despite the abstract stating that 'the influence of each part' is discussed.
Authors: The abstract states that influences are discussed on the validation set, and the full paper provides qualitative discussion of each component's contribution. However, we acknowledge the absence of explicit numerical ablation tables, deltas, or direct comparisons to the prior winning entry. We will add a dedicated ablation table with MSE values on the validation set in the revised manuscript. Direct test-set ablations against prior or ablated models are not feasible post-competition without additional submissions. revision: yes
-
Referee: [Results / Experiments section] Results / Experiments section: No validation curves, error bars, dataset statistics, or cross-validation details are supplied to allow verification of the performance numbers or to assess whether the bootstrap ensemble and rank loss yield genuine generalization improvements rather than comparable scores obtained with additional machinery.
Authors: We agree that the current manuscript lacks validation curves, error bars, dataset statistics, and cross-validation details. These omissions limit independent verification of the reported gains. We will add dataset statistics, error bars on validation results, and a description of the validation procedure in the revised experiments section. revision: yes
- Direct numerical comparisons or ablations on the hidden test set, as test labels are unavailable outside the original competition submissions.
Circularity Check
Empirical competition entry with no self-referential derivation or fitted prediction
full rationale
The paper reports a test-set MSE from an EmotiW 2019 submission that extends the authors' prior winning entry via added features, rank loss, and bootstrap ensemble. No equations, derivations, or parameter-fitting steps are described that reduce the reported MSE to a quantity defined in terms of itself. The result is measured on held-out test data and does not invoke uniqueness theorems, ansatzes smuggled via self-citation, or renaming of known results as new predictions. Self-citation of the prior solution is present but not load-bearing for any mathematical claim.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Standard assumptions of multi-instance learning and LSTM sequence modeling hold for the video features.
Reference graph
Works this paper leans on
-
[1]
Brandon Amos, Bartosz Ludwiczuk, Mahadev Satyanarayanan, et al. 2016. Open- face: A general-purpose face recognition library with mobile applications. CMU School of Computer Science (2016)
work page 2016
-
[2]
Nigel Bosch, Sidney K D’Mello, Ryan S Baker, Jaclyn Ocumpaugh, Valerie Shute, Matthew Ventura, Lubin Wang, and Weinan Zhao. 2016. Detecting Student Emotions in Computer-Enabled Classrooms.. In IJCAI. 4125–4129
work page 2016
-
[3]
Abhinav Dhall, Amanjot Kaur, Roland Goecke, and Tom Gedeon. 2018. EmotiW 2018: Audio-Video, Student Engagement and Group-Level Affect Prediction. In Proceedings of the 19th ACM International Conference on Multimodal Interaction (in press). ACM
work page 2018
-
[4]
Sidney K D’Mello, Scotty D Craig, and Art C Graesser. 2009. Multimethod assess- ment of affective experience and expression during deep learning. International Journal of Learning Technology 4, 3-4 (2009), 165–187
work page 2009
-
[5]
Sidney K DâĂŹMello and Arthur Graesser. 2010. Multimodal semi-automated affect detection from conversational cues, gross body language, and facial features. User Modeling and User-Adapted Interaction 20, 2 (2010), 147–187
work page 2010
-
[6]
B. Efron. 1979. Bootstrap Methods: Another Look at the Jackknife. Annals of Statistics 7, 1 (1979), 1–26
work page 1979
-
[7]
Jennifer A Fredricks, Phyllis C Blumenfeld, and Alison H Paris. 2004. School engagement: Potential of the concept, state of the evidence.Review of educational research 74, 1 (2004), 59–109
work page 2004
-
[8]
Benjamin S Goldberg, Robert A Sottilare, Keith W Brawner, and Heather K Holden. 2011. Predicting learner engagement during well-defined and ill-defined computer-based intercultural interactions. InInternational Conference on Affective Computing and Intelligent Interaction . Springer, 538–547
work page 2011
-
[9]
Julie A Gray and Melanie DiLoreto. 2016. The effects of student engagement, student satisfaction, and perceived learning in online learning environments. International Journal of Educational Leadership Preparation 11, 1 (2016), n1
work page 2016
-
[10]
E Joseph. 2005. Engagement tracing: using response times to model student disengagement. Artificial intelligence in education: Supporting learning through intelligent and socially informed technology 125 (2005), 88
work page 2005
-
[11]
Kenneth R Koedinger, John R Anderson, William H Hadley, and Mary A Mark
-
[12]
International Journal of Artificial Intelligence in Education (IJAIED) 8 (1997), 30–43
Intelligent tutoring goes to school in the big city. International Journal of Artificial Intelligence in Education (IJAIED) 8 (1997), 30–43
work page 1997
-
[13]
Zheng Li, Jianfei Yang, Juan Zha, Chang-Dong Wang, and Weishi Zheng. 2016. Online visual tracking via correlation filter with convolutional networks. In Visual Communications and Image Processing (VCIP), 2016 . IEEE, 1–4
work page 2016
- [14]
-
[15]
Aamir Mustafa, Amanjot Kaur, Love Mehta, and Abhinav Dhall. 2018. Pre- diction and Localization of Student Engagement in the Wild. arXiv preprint arXiv:1804.00858 (2018)
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[16]
Xuesong Niu, Hu Han, Jiabei Zeng, Xuran Sun, Shiguang Shan, Yan Huang, Songfan Yang, and Xilin Chen. 2018. Automatic engagement prediction with GAP feature. In Proceedings of the 2018 on International Conference on Multimodal Interaction. ACM, 599–603
work page 2018
-
[17]
Tomas Simon, Hanbyul Joo, Iain Matthews, and Yaser Sheikh. 2017. Hand key- point detection in single images using multiview bootstrapping. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR) , Vol. 2
work page 2017
-
[18]
Lianzhi Tan, Kaipeng Zhang, Kai Wang, Xiaoxing Zeng, Xiaojiang Peng, and Yu Qiao. 2017. Group emotion recognition with individual facial emotion CNNs and global image based CNNs. InProceedings of the 19th ACM International Conference on Multimodal Interaction. ACM, 549–552
work page 2017
-
[19]
Du Tran, Lubomir Bourdev, Rob Fergus, Lorenzo Torresani, and Manohar Paluri
-
[20]
In Proceedings of the IEEE international conference on computer vision
Learning spatiotemporal features with 3d convolutional networks. In Proceedings of the IEEE international conference on computer vision . 4489–4497
-
[21]
Kai Wang, , Xiaoxing Zeng, Jianfei Yang, Debin Meng, Kaipeng Zhang, Xiaojiang Peng, and Yu Qiao. 2018. Cascade Attention Networks For Group Emotion Recognition with Face, Body and Image Cues. In Proceedings of the 19th ACM International Conference on Multimodal Interaction (in press) . ACM
work page 2018
- [22]
-
[23]
Yandong Wen, Kaipeng Zhang, Zhifeng Li, and Qiao Yu. 2016. A Discriminative Feature Learning Approach for Deep Face Recognition
work page 2016
-
[24]
Jacob Whitehill, Zewelanji Serpell, Yi-Ching Lin, Aysha Foster, and Javier R Movellan. 2014. The faces of engagement: Automatic recognition of student engagement from facial expressions. IEEE Transactions on Affective Computing 5, 1 (2014), 86–98
work page 2014
-
[25]
Xiang Xiao, Phuong Pham, and Jingtao Wang. 2017. Dynamics of affective states during mooc learning. In International Conference on Artificial Intelligence in Education. Springer, 586–589
work page 2017
-
[26]
Jianfei Yang, Kai Wang, Xiaojiang Peng, and Yu Qiao. 2018. Deep Recurrent Multi-instance Learning with Spatio-temporal Features for Engagement Intensity Prediction. In Proceedings of the 2018 on International Conference on Multimodal Interaction. ACM, 594–598
work page 2018
- [27]
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.