Bootstrap Model Ensemble and Rank Loss for Engagement Intensity Regression

Da Guo; Jianfei Yang; Kaipeng Zhang; Kai Wang; Xiaojiang Peng; Yu Qiao

arxiv: 1907.03422 · v1 · pith:72QVMCVCnew · submitted 2019-07-08 · 💻 cs.CV

Bootstrap Model Ensemble and Rank Loss for Engagement Intensity Regression

Kai Wang , Jianfei Yang , Da Guo , Kaipeng Zhang , Xiaojiang Peng , Yu Qiao This is my paper

Pith reviewed 2026-05-25 01:23 UTC · model grok-4.3

classification 💻 cs.CV

keywords engagement intensity regressionrank lossbootstrap aggregationLSTMmulti-instance learningfacial landmarksEmotiW 2019MOOC videos

0 comments

The pith

Rank loss and bootstrap aggregation improve an LSTM multi-instance model for engagement intensity regression to third place.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that extending a prior winning multi-instance LSTM framework with facial landmark features, a rank loss that enforces distance margins between engagement category pairs, and bootstrap model ensemble produces better regression of student engagement from MOOC videos. A sympathetic reader would care because more accurate automatic engagement measurement could support better design of online education content. The full method reaches an MSE of 0.0626 on the test set and third place in the EmotiW 2019 challenge. Validation experiments discuss the contribution of each added component.

Core claim

The central claim is that facial landmark features, a rank loss regularization enforcing margins between distant and adjacent engagement categories, and bootstrap aggregation by repeated random sampling and prediction averaging together yield improved performance over the baseline multi-instance LSTM framework on the engagement intensity regression task.

What carries the argument

The rank loss that enforces a distance margin between features of distant category pairs and adjacent category pairs, together with bootstrap aggregation that randomly samples training data several times and averages the resulting model predictions.

If this is right

Facial landmark features supply information beyond gaze and head pose for the LSTM.
The rank loss improves separation of engagement levels in the learned feature space.
Bootstrap aggregation lowers prediction variance through repeated sampling and averaging.
The combined modifications extend the previous solution while keeping the multi-instance LSTM structure.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The rank loss may transfer to other ordinal regression problems where category distances matter.
Without reported ablations it is unclear whether both the rank loss and bootstrap are required or if one accounts for most of the gain.
The same ensemble and loss additions could be tested on different base networks or non-video affective datasets.

Load-bearing premise

The added rank loss and bootstrap aggregation produce genuine generalization gains beyond the authors' prior winning solution.

What would settle it

An ablation experiment on the same data and framework in which removing the rank loss or the bootstrap aggregation produces equal or lower MSE on the test set would falsify the claim that these additions drive the reported improvement.

Figures

Figures reproduced from arXiv: 1907.03422 by Da Guo, Jianfei Yang, Kaipeng Zhang, Kai Wang, Xiaojiang Peng, Yu Qiao.

**Figure 1.** Figure 1: The system pipeline of our approach. information of body and face cannot be synchronized. We regard the body feature as an independent modality. Although the OpenFace and OpenPose features include face, head and body, it is rather limited. These features can only represent the degree of movement of different component, but the concrete actions and gaze changing patterns are neglected. More severely, high-l… view at source ↗

**Figure 2.** Figure 2: C_i is the feature center of ith engagement intensity level. δ is the margin of different engagement levels. 3.5 Bootstrap and Model Ensemble The bootstrap method [6] is a statistical technique to estimate the distribution about dataset by averaging estimates from multiple small data samples. This approach is called sampling with replacement. The key idea of the method is estimating the true distribution w… view at source ↗

read the original abstract

This paper presents our approach for the engagement intensity regression task of EmotiW 2019. The task is to predict the engagement intensity value of a student when he or she is watching an online MOOCs video in various conditions. Based on our winner solution last year, we mainly explore head features and body features with a bootstrap strategy and two novel loss functions in this paper. We maintain the framework of multi-instance learning with long short-term memory (LSTM) network, and make three contributions. First, besides of the gaze and head pose features, we explore facial landmark features in our framework. Second, inspired by the fact that engagement intensity can be ranked in values, we design a rank loss as a regularization which enforces a distance margin between the features of distant category pairs and adjacent category pairs. Third, we use the classical bootstrap aggregation method to perform model ensemble which randomly samples a certain training data by several times and then averages the model predictions. We evaluate the performance of our method and discuss the influence of each part on the validation dataset. Our methods finally win 3rd place with MSE of 0.0626 on the testing set.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This is an incremental extension of the authors' own prior EmotiW winner that adds facial landmarks, a rank loss, and bootstrap but supplies no numbers showing those changes improve the score.

read the letter

The paper describes the authors' 3rd-place solution for the EmotiW 2019 engagement intensity regression task on MOOC videos. They keep their earlier multi-instance LSTM setup and add three pieces: facial landmark features on top of gaze and head pose, a rank loss that enforces larger margins between distant engagement categories than adjacent ones, and bootstrap aggregation that trains several models on resampled subsets and averages the outputs. They report a test MSE of 0.0626. The rank loss is a reasonable match for the ordinal labels, and bootstrap is a standard ensemble trick that can reduce variance in small-data settings like this one. The pipeline is practical for anyone already working on video-based engagement prediction. The central weakness is the missing evidence. The abstract states that the influence of each component is discussed on the validation set, yet the text gives no MSE deltas, no ablation tables, and no direct comparison against the authors' own previous winning entry. Without those numbers it is impossible to tell whether the new loss or the ensemble actually moved the result or whether the score is essentially the same as before with extra machinery. This paper is mainly useful to teams entering the same or similar affective computing competitions. A reader looking for general methods or first-principles advances will find little. I would bring it to a reading group only if the group tracks competition entries in video affect. I would not cite it. It deserves peer review because it is a clear, reproducible competition report with a defensible loss function, even if the gains from the extensions remain unquantified.

Referee Report

2 major / 1 minor

Summary. The manuscript presents an approach for the EmotiW 2019 engagement intensity regression task. Building on the authors' prior winning solution, it incorporates facial landmark features alongside gaze and head pose, introduces a rank loss regularization that enforces distance margins between features of distant versus adjacent engagement categories, and applies bootstrap aggregation for ensembling within a multi-instance LSTM framework. The method is reported to achieve 3rd place with a test-set MSE of 0.0626, with component influences discussed on the validation set.

Significance. If supported by quantitative evidence, the result would demonstrate incremental, practical gains in multimodal video analysis for predicting student engagement in MOOC settings. The competition ranking provides a concrete, falsifiable benchmark, and the rank-loss idea offers a potentially generalizable regularization strategy for ordinal regression problems in affective computing.

major comments (2)

[Abstract] Abstract: The central claim that the facial-landmark features, rank loss, and bootstrap ensemble produce the reported test MSE of 0.0626 is not accompanied by any numerical ablation results, MSE deltas, or statistical comparisons against the authors' own prior winning entry (or against ablated versions) on either the validation or test sets, despite the abstract stating that 'the influence of each part' is discussed.
[Results / Experiments section] Results / Experiments section: No validation curves, error bars, dataset statistics, or cross-validation details are supplied to allow verification of the performance numbers or to assess whether the bootstrap ensemble and rank loss yield genuine generalization improvements rather than comparable scores obtained with additional machinery.

minor comments (1)

[Abstract] Abstract: 'besides of the gaze' is grammatically incorrect and should read 'besides the gaze'.

Simulated Author's Rebuttal

2 responses · 1 unresolved

We thank the referee for the constructive feedback highlighting the need for stronger quantitative support. We agree that additional numerical details would strengthen the manuscript and will incorporate them in the revision.

read point-by-point responses

Referee: [Abstract] Abstract: The central claim that the facial-landmark features, rank loss, and bootstrap ensemble produce the reported test MSE of 0.0626 is not accompanied by any numerical ablation results, MSE deltas, or statistical comparisons against the authors' own prior winning entry (or against ablated versions) on either the validation or test sets, despite the abstract stating that 'the influence of each part' is discussed.

Authors: The abstract states that influences are discussed on the validation set, and the full paper provides qualitative discussion of each component's contribution. However, we acknowledge the absence of explicit numerical ablation tables, deltas, or direct comparisons to the prior winning entry. We will add a dedicated ablation table with MSE values on the validation set in the revised manuscript. Direct test-set ablations against prior or ablated models are not feasible post-competition without additional submissions. revision: yes
Referee: [Results / Experiments section] Results / Experiments section: No validation curves, error bars, dataset statistics, or cross-validation details are supplied to allow verification of the performance numbers or to assess whether the bootstrap ensemble and rank loss yield genuine generalization improvements rather than comparable scores obtained with additional machinery.

Authors: We agree that the current manuscript lacks validation curves, error bars, dataset statistics, and cross-validation details. These omissions limit independent verification of the reported gains. We will add dataset statistics, error bars on validation results, and a description of the validation procedure in the revised experiments section. revision: yes

standing simulated objections not resolved

Direct numerical comparisons or ablations on the hidden test set, as test labels are unavailable outside the original competition submissions.

Circularity Check

0 steps flagged

Empirical competition entry with no self-referential derivation or fitted prediction

full rationale

The paper reports a test-set MSE from an EmotiW 2019 submission that extends the authors' prior winning entry via added features, rank loss, and bootstrap ensemble. No equations, derivations, or parameter-fitting steps are described that reduce the reported MSE to a quantity defined in terms of itself. The result is measured on held-out test data and does not invoke uniqueness theorems, ansatzes smuggled via self-citation, or renaming of known results as new predictions. Self-citation of the prior solution is present but not load-bearing for any mathematical claim.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Only the abstract is available; no explicit free parameters, axioms, or invented entities are stated beyond standard supervised learning assumptions.

axioms (1)

domain assumption Standard assumptions of multi-instance learning and LSTM sequence modeling hold for the video features.
The framework is presented without explicit justification of these background modeling choices.

pith-pipeline@v0.9.0 · 5739 in / 1154 out tokens · 25440 ms · 2026-05-25T01:23:18.661150+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

27 extracted references · 27 canonical work pages · 1 internal anchor

[1]

Brandon Amos, Bartosz Ludwiczuk, Mahadev Satyanarayanan, et al. 2016. Open- face: A general-purpose face recognition library with mobile applications. CMU School of Computer Science (2016)

work page 2016
[2]

Nigel Bosch, Sidney K D’Mello, Ryan S Baker, Jaclyn Ocumpaugh, Valerie Shute, Matthew Ventura, Lubin Wang, and Weinan Zhao. 2016. Detecting Student Emotions in Computer-Enabled Classrooms.. In IJCAI. 4125–4129

work page 2016
[3]

Abhinav Dhall, Amanjot Kaur, Roland Goecke, and Tom Gedeon. 2018. EmotiW 2018: Audio-Video, Student Engagement and Group-Level Affect Prediction. In Proceedings of the 19th ACM International Conference on Multimodal Interaction (in press). ACM

work page 2018
[4]

Sidney K D’Mello, Scotty D Craig, and Art C Graesser. 2009. Multimethod assess- ment of affective experience and expression during deep learning. International Journal of Learning Technology 4, 3-4 (2009), 165–187

work page 2009
[5]

Sidney K DâĂŹMello and Arthur Graesser. 2010. Multimodal semi-automated affect detection from conversational cues, gross body language, and facial features. User Modeling and User-Adapted Interaction 20, 2 (2010), 147–187

work page 2010
[6]

B. Efron. 1979. Bootstrap Methods: Another Look at the Jackknife. Annals of Statistics 7, 1 (1979), 1–26

work page 1979
[7]

Jennifer A Fredricks, Phyllis C Blumenfeld, and Alison H Paris. 2004. School engagement: Potential of the concept, state of the evidence.Review of educational research 74, 1 (2004), 59–109

work page 2004
[8]

Benjamin S Goldberg, Robert A Sottilare, Keith W Brawner, and Heather K Holden. 2011. Predicting learner engagement during well-defined and ill-defined computer-based intercultural interactions. InInternational Conference on Affective Computing and Intelligent Interaction . Springer, 538–547

work page 2011
[9]

Julie A Gray and Melanie DiLoreto. 2016. The effects of student engagement, student satisfaction, and perceived learning in online learning environments. International Journal of Educational Leadership Preparation 11, 1 (2016), n1

work page 2016
[10]

E Joseph. 2005. Engagement tracing: using response times to model student disengagement. Artificial intelligence in education: Supporting learning through intelligent and socially informed technology 125 (2005), 88

work page 2005
[11]

Kenneth R Koedinger, John R Anderson, William H Hadley, and Mary A Mark

work page
[12]

International Journal of Artificial Intelligence in Education (IJAIED) 8 (1997), 30–43

Intelligent tutoring goes to school in the big city. International Journal of Artificial Intelligence in Education (IJAIED) 8 (1997), 30–43

work page 1997
[13]

Zheng Li, Jianfei Yang, Juan Zha, Chang-Dong Wang, and Weishi Zheng. 2016. Online visual tracking via correlation filter with convolutional networks. In Visual Communications and Image Processing (VCIP), 2016 . IEEE, 1–4

work page 2016
[14]

Debin Meng, Xiaojiang Peng, Kai Wang, and Yu Qiao. 2019. frame attention networks for facial expression recognition in videos. arXiv:cs.CV/1907.00193

work page arXiv 2019
[15]

Aamir Mustafa, Amanjot Kaur, Love Mehta, and Abhinav Dhall. 2018. Pre- diction and Localization of Student Engagement in the Wild. arXiv preprint arXiv:1804.00858 (2018)

work page internal anchor Pith review Pith/arXiv arXiv 2018
[16]

Xuesong Niu, Hu Han, Jiabei Zeng, Xuran Sun, Shiguang Shan, Yan Huang, Songfan Yang, and Xilin Chen. 2018. Automatic engagement prediction with GAP feature. In Proceedings of the 2018 on International Conference on Multimodal Interaction. ACM, 599–603

work page 2018
[17]

Tomas Simon, Hanbyul Joo, Iain Matthews, and Yaser Sheikh. 2017. Hand key- point detection in single images using multiview bootstrapping. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR) , Vol. 2

work page 2017
[18]

Lianzhi Tan, Kaipeng Zhang, Kai Wang, Xiaoxing Zeng, Xiaojiang Peng, and Yu Qiao. 2017. Group emotion recognition with individual facial emotion CNNs and global image based CNNs. InProceedings of the 19th ACM International Conference on Multimodal Interaction. ACM, 549–552

work page 2017
[19]

Du Tran, Lubomir Bourdev, Rob Fergus, Lorenzo Torresani, and Manohar Paluri

work page
[20]

In Proceedings of the IEEE international conference on computer vision

Learning spatiotemporal features with 3d convolutional networks. In Proceedings of the IEEE international conference on computer vision . 4489–4497

work page
[21]

Kai Wang, , Xiaoxing Zeng, Jianfei Yang, Debin Meng, Kaipeng Zhang, Xiaojiang Peng, and Yu Qiao. 2018. Cascade Attention Networks For Group Emotion Recognition with Face, Body and Image Cues. In Proceedings of the 19th ACM International Conference on Multimodal Interaction (in press) . ACM

work page 2018
[22]

Kai Wang, Xiaojiang Peng, Jianfei Yang, Debin Meng, and Yu Qiao. 2019. Region Attention Networks for Pose and Occlusion Robust Facial Expression Recognition. arXiv preprint arXiv:1905.04075 (2019)

work page arXiv 2019
[23]

Yandong Wen, Kaipeng Zhang, Zhifeng Li, and Qiao Yu. 2016. A Discriminative Feature Learning Approach for Deep Face Recognition

work page 2016
[24]

Jacob Whitehill, Zewelanji Serpell, Yi-Ching Lin, Aysha Foster, and Javier R Movellan. 2014. The faces of engagement: Automatic recognition of student engagement from facial expressions. IEEE Transactions on Affective Computing 5, 1 (2014), 86–98

work page 2014
[25]

Xiang Xiao, Phuong Pham, and Jingtao Wang. 2017. Dynamics of affective states during mooc learning. In International Conference on Artificial Intelligence in Education. Springer, 586–589

work page 2017
[26]

Jianfei Yang, Kai Wang, Xiaojiang Peng, and Yu Qiao. 2018. Deep Recurrent Multi-instance Learning with Spatio-temporal Features for Engagement Intensity Prediction. In Proceedings of the 2018 on International Conference on Multimodal Interaction. ACM, 594–598

work page 2018
[27]

W. Yun, D. Lee, C. Park, J. Kim, and J. Kim. 2018. Automatic Recognition of Children Engagement from Facial Video using Convolutional Neural Networks. IEEE Transactions on Affective Computing (2018), 1–1. https://doi.org/10.1109/ TAFFC.2018.2834350

work page arXiv 2018

[1] [1]

Brandon Amos, Bartosz Ludwiczuk, Mahadev Satyanarayanan, et al. 2016. Open- face: A general-purpose face recognition library with mobile applications. CMU School of Computer Science (2016)

work page 2016

[2] [2]

Nigel Bosch, Sidney K D’Mello, Ryan S Baker, Jaclyn Ocumpaugh, Valerie Shute, Matthew Ventura, Lubin Wang, and Weinan Zhao. 2016. Detecting Student Emotions in Computer-Enabled Classrooms.. In IJCAI. 4125–4129

work page 2016

[3] [3]

Abhinav Dhall, Amanjot Kaur, Roland Goecke, and Tom Gedeon. 2018. EmotiW 2018: Audio-Video, Student Engagement and Group-Level Affect Prediction. In Proceedings of the 19th ACM International Conference on Multimodal Interaction (in press). ACM

work page 2018

[4] [4]

Sidney K D’Mello, Scotty D Craig, and Art C Graesser. 2009. Multimethod assess- ment of affective experience and expression during deep learning. International Journal of Learning Technology 4, 3-4 (2009), 165–187

work page 2009

[5] [5]

Sidney K DâĂŹMello and Arthur Graesser. 2010. Multimodal semi-automated affect detection from conversational cues, gross body language, and facial features. User Modeling and User-Adapted Interaction 20, 2 (2010), 147–187

work page 2010

[6] [6]

B. Efron. 1979. Bootstrap Methods: Another Look at the Jackknife. Annals of Statistics 7, 1 (1979), 1–26

work page 1979

[7] [7]

Jennifer A Fredricks, Phyllis C Blumenfeld, and Alison H Paris. 2004. School engagement: Potential of the concept, state of the evidence.Review of educational research 74, 1 (2004), 59–109

work page 2004

[8] [8]

Benjamin S Goldberg, Robert A Sottilare, Keith W Brawner, and Heather K Holden. 2011. Predicting learner engagement during well-defined and ill-defined computer-based intercultural interactions. InInternational Conference on Affective Computing and Intelligent Interaction . Springer, 538–547

work page 2011

[9] [9]

Julie A Gray and Melanie DiLoreto. 2016. The effects of student engagement, student satisfaction, and perceived learning in online learning environments. International Journal of Educational Leadership Preparation 11, 1 (2016), n1

work page 2016

[10] [10]

E Joseph. 2005. Engagement tracing: using response times to model student disengagement. Artificial intelligence in education: Supporting learning through intelligent and socially informed technology 125 (2005), 88

work page 2005

[11] [11]

Kenneth R Koedinger, John R Anderson, William H Hadley, and Mary A Mark

work page

[12] [12]

International Journal of Artificial Intelligence in Education (IJAIED) 8 (1997), 30–43

Intelligent tutoring goes to school in the big city. International Journal of Artificial Intelligence in Education (IJAIED) 8 (1997), 30–43

work page 1997

[13] [13]

Zheng Li, Jianfei Yang, Juan Zha, Chang-Dong Wang, and Weishi Zheng. 2016. Online visual tracking via correlation filter with convolutional networks. In Visual Communications and Image Processing (VCIP), 2016 . IEEE, 1–4

work page 2016

[14] [14]

Debin Meng, Xiaojiang Peng, Kai Wang, and Yu Qiao. 2019. frame attention networks for facial expression recognition in videos. arXiv:cs.CV/1907.00193

work page arXiv 2019

[15] [15]

Aamir Mustafa, Amanjot Kaur, Love Mehta, and Abhinav Dhall. 2018. Pre- diction and Localization of Student Engagement in the Wild. arXiv preprint arXiv:1804.00858 (2018)

work page internal anchor Pith review Pith/arXiv arXiv 2018

[16] [16]

Xuesong Niu, Hu Han, Jiabei Zeng, Xuran Sun, Shiguang Shan, Yan Huang, Songfan Yang, and Xilin Chen. 2018. Automatic engagement prediction with GAP feature. In Proceedings of the 2018 on International Conference on Multimodal Interaction. ACM, 599–603

work page 2018

[17] [17]

Tomas Simon, Hanbyul Joo, Iain Matthews, and Yaser Sheikh. 2017. Hand key- point detection in single images using multiview bootstrapping. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR) , Vol. 2

work page 2017

[18] [18]

Lianzhi Tan, Kaipeng Zhang, Kai Wang, Xiaoxing Zeng, Xiaojiang Peng, and Yu Qiao. 2017. Group emotion recognition with individual facial emotion CNNs and global image based CNNs. InProceedings of the 19th ACM International Conference on Multimodal Interaction. ACM, 549–552

work page 2017

[19] [19]

Du Tran, Lubomir Bourdev, Rob Fergus, Lorenzo Torresani, and Manohar Paluri

work page

[20] [20]

In Proceedings of the IEEE international conference on computer vision

Learning spatiotemporal features with 3d convolutional networks. In Proceedings of the IEEE international conference on computer vision . 4489–4497

work page

[21] [21]

Kai Wang, , Xiaoxing Zeng, Jianfei Yang, Debin Meng, Kaipeng Zhang, Xiaojiang Peng, and Yu Qiao. 2018. Cascade Attention Networks For Group Emotion Recognition with Face, Body and Image Cues. In Proceedings of the 19th ACM International Conference on Multimodal Interaction (in press) . ACM

work page 2018

[22] [22]

Kai Wang, Xiaojiang Peng, Jianfei Yang, Debin Meng, and Yu Qiao. 2019. Region Attention Networks for Pose and Occlusion Robust Facial Expression Recognition. arXiv preprint arXiv:1905.04075 (2019)

work page arXiv 2019

[23] [23]

Yandong Wen, Kaipeng Zhang, Zhifeng Li, and Qiao Yu. 2016. A Discriminative Feature Learning Approach for Deep Face Recognition

work page 2016

[24] [24]

Jacob Whitehill, Zewelanji Serpell, Yi-Ching Lin, Aysha Foster, and Javier R Movellan. 2014. The faces of engagement: Automatic recognition of student engagement from facial expressions. IEEE Transactions on Affective Computing 5, 1 (2014), 86–98

work page 2014

[25] [25]

Xiang Xiao, Phuong Pham, and Jingtao Wang. 2017. Dynamics of affective states during mooc learning. In International Conference on Artificial Intelligence in Education. Springer, 586–589

work page 2017

[26] [26]

Jianfei Yang, Kai Wang, Xiaojiang Peng, and Yu Qiao. 2018. Deep Recurrent Multi-instance Learning with Spatio-temporal Features for Engagement Intensity Prediction. In Proceedings of the 2018 on International Conference on Multimodal Interaction. ACM, 594–598

work page 2018

[27] [27]

W. Yun, D. Lee, C. Park, J. Kim, and J. Kim. 2018. Automatic Recognition of Children Engagement from Facial Video using Convolutional Neural Networks. IEEE Transactions on Affective Computing (2018), 1–1. https://doi.org/10.1109/ TAFFC.2018.2834350

work page arXiv 2018