An Interpretable Closed-Loop Intelligent Tutoring System for Multimodal Affective Feedback in Asynchronous Presentation Training
Pith reviewed 2026-05-25 05:47 UTC · model grok-4.3
The pith
A closed-loop intelligent tutoring system using multimodal analysis and three-layer feedback produces significant gains on seven presentation skill dimensions.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The system operationalizes a seven-dimensional BARS and implements a three-layer interpretable feedback architecture that connects rubric-aligned multimodal scoring, audience-perceived expressive diagnostics, and retrieval-augmented conversational coaching. Trained on 10,360 MOOC video segments, the XGBoost backbone achieves rubric-aligned scoring with R2 of 0.48-0.61, Spearman's rho of 0.69-0.78, and MAE of 0.43-0.57. In the pre-post validation study with 204 adult learners over a 30-day window, participants showed significant improvements across all seven BARS dimensions (Cohen's d = 0.39-0.90), and practice frequency remained strongly associated with posttest performance after controlling
What carries the argument
The three-layer interpretable feedback architecture that maps rubric-aligned multimodal scores to audience-perceived expressive diagnostics and retrieval-augmented conversational coaching.
If this is right
- Participants demonstrated significant improvements across all seven BARS dimensions with Cohen's d ranging from 0.39 to 0.90.
- Practice frequency showed a strong positive association with posttest performance after controlling for baseline scores and demographics.
- The XGBoost model reached rubric-aligned scoring performance levels comparable to expert ratings on the held-out MOOC segments.
- The integrated feedback architecture supports deliberate practice for performance-based competencies at scale.
Where Pith is reading between the lines
- The traceable, rubric-linked feedback may increase learner acceptance compared with opaque scoring systems.
- The same architecture could be tested on related performance skills such as job-interview responses or sales pitches.
- Follow-up measurement after the 30-day window would show whether the gains transfer to live, unscripted presentations.
Load-bearing premise
The three-layer feedback architecture successfully converts rubric-aligned multimodal scores into audience-perceived expressive diagnostics that produce the observed behavioral improvements.
What would settle it
A randomized trial in which one group uses the full three-layer ITS while a matched group receives only recording practice without the diagnostic or coaching layers, then comparing pre-post changes on the seven BARS dimensions.
read the original abstract
This paper presents an interpretable closed-loop Intelligent Tutoring System (ITS) that supports feedback-guided practice for developing on-camera oral presentation skills at scale. The system operationalizes a seven-dimensional Behaviorally Anchored Rating Scale (BARS) and implements a three-layer interpretable feedback architecture that connects rubric-aligned multimodal scoring, audience-perceived expressive diagnostics, and retrieval-augmented conversational coaching to support deliberate practice. Built on an XGBoost backbone, the ITS maps multimodal inputs (facial, vocal, textual, and oculomotor features) into evidence-based feedback that can be traced back to observable performance cues. Trained on 10,360 Massive Open Online Course (MOOC) video segments, the system achieved rubric-aligned scoring with performance levels comparable to expert ratings (R2 = 0.48-0.61, Spearman's rho = 0.69-0.78, MAE = 0.43-0.57). In a pre-post validation study with 204 adult learners over a 30-day practice window, participants demonstrated significant improvements across all seven BARS dimensions (Cohen's d = 0.39-0.90), with practice frequency showing a strong positive association with posttest performance after controlling for baseline scores and demographics. The results demonstrate how multimodal analytic outputs can be systematically transformed into observable behavioral change through an integrated feedback architecture, advancing explainable and pedagogically grounded ITS design for performance-based competencies.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript presents an interpretable closed-loop Intelligent Tutoring System (ITS) for asynchronous on-camera presentation training. It operationalizes a seven-dimensional Behaviorally Anchored Rating Scale (BARS) via a three-layer feedback architecture (rubric-aligned multimodal scoring with XGBoost, audience-perceived expressive diagnostics, and retrieval-augmented conversational coaching). The scoring model, trained on 10,360 MOOC video segments, maps facial, vocal, textual, and oculomotor features to BARS ratings with R² = 0.48-0.61, Spearman's rho = 0.69-0.78, and MAE = 0.43-0.57. A pre-post validation study with 204 adult learners over 30 days reports significant gains across all BARS dimensions (Cohen's d = 0.39-0.90) and a positive association between practice frequency and posttest performance after controlling for baselines and demographics.
Significance. If the causal attribution holds, the work supplies a concrete, traceable pipeline from multimodal analytics to behavioral change in a scalable ITS for performance skills. Credit is due for the held-out MOOC training set, the independent pre-post validation with reported effect sizes, and the explicit controls for baseline scores and demographics in the practice-frequency analysis.
major comments (1)
- [Validation study] Validation study (pre-post design with 204 learners): the reported BARS gains (d = 0.39-0.90) and the practice-frequency association are attributed to the three-layer feedback architecture, yet every participant receives the full closed-loop ITS. Without a control arm, the design cannot separate architecture-driven change from repeated testing, task familiarity, or self-selection into higher practice frequency, leaving the central claim that the architecture converts rubric-aligned scores into diagnostics that drive observable improvements untested.
Simulated Author's Rebuttal
We thank the referee for their constructive feedback on the validation study design. We address the major comment below and will revise the manuscript accordingly.
read point-by-point responses
-
Referee: [Validation study] Validation study (pre-post design with 204 learners): the reported BARS gains (d = 0.39-0.90) and the practice-frequency association are attributed to the three-layer feedback architecture, yet every participant receives the full closed-loop ITS. Without a control arm, the design cannot separate architecture-driven change from repeated testing, task familiarity, or self-selection into higher practice frequency, leaving the central claim that the architecture converts rubric-aligned scores into diagnostics that drive observable improvements untested.
Authors: We agree that the single-arm pre-post design limits causal attribution to the three-layer feedback architecture specifically. The reported gains and dose-response association (after controlling for baselines and demographics) demonstrate observable change in a real-world deployment, but cannot isolate effects from repeated testing, task familiarity, or self-selection. We will revise the manuscript to: (1) explicitly state this limitation in the Discussion, (2) temper language around causal claims to emphasize associations and improvements rather than architecture-driven causation, and (3) outline future randomized controlled trials as necessary next steps. This preserves the contribution of the held-out training data, effect sizes, and traceable pipeline while addressing the design gap. revision: yes
Circularity Check
No circularity; scoring model and pre-post gains remain independent
full rationale
The paper trains an XGBoost model on 10,360 MOOC segments and reports standard held-out metrics (R2 = 0.48-0.61, rho = 0.69-0.78). It then presents a separate pre-post study (n=204) measuring direct behavioral change via Cohen's d and regression on practice frequency. No equations, self-citations, or definitions reduce the reported skill gains to the fitted scoring parameters by construction. The three-layer feedback architecture is described but does not invoke load-bearing self-citations or rename known results as novel derivations. The central claims are therefore self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
free parameters (1)
- XGBoost model parameters
axioms (1)
- domain assumption The seven-dimensional BARS accurately measures audience-perceived presentation quality
Reference graph
Works this paper leans on
-
[1]
X. Ochoa and F. Domínguez, “Controlled evaluation of a multimodal system to improve oral presentation skills in a real learning setting,” Br. J. Educ. Technol., vol. 51, no. 5, pp. 1615 –1630, 2020, doi: 10.1111/bjet.12987
-
[2]
OpenOPAF: An open source multimodal system for automated feedback for oral presentations,
X. Ochoa and H. Zhao, “OpenOPAF: An open source multimodal system for automated feedback for oral presentations,” J. Learn. Anal., vol. 11, no. 3, pp. 224–248, 2024, doi: 10.18608/jla.2024.8411
-
[3]
Evaluation of presentation skills in the context of online learning: A literature review,
S. Suroto, E. Y. Haenilah, H. Hariri, Pargito, and N. Trenggono, “Evaluation of presentation skills in the context of online learning: A literature review,” Int. J. Inf. Educ. Technol., vol. 13, no. 5, pp. 855 –860, 2023, doi: 10.18178/ijiet.2023.13.5.1879
-
[4]
Y. Lee, “Developing a computer -based tutor utilizing generative artificial intelligence (GAI) and retrieval-augmented generation (RAG),” Educ. Inf. Technol., vol. 30, pp. 7841–7862, 2025, doi: 10.1007/s10639-024-13129- 5
-
[5]
Predicting presentation skill of a speaker using automatic speaker and audience measurement,
C. Thomas and D. Jayagopi, “Predicting presentation skill of a speaker using automatic speaker and audience measurement,” IEEE Trans. Learn. Technol., vol. 15, pp. 350–363, 2022, doi: 10.1109/TLT.2022.3171601
-
[6]
Multimodal transfer learning for oral presentation assessment,
S. S. Y. Tun, S. Okada, H. -H. Huang, and C. W. Leong, “Multimodal transfer learning for oral presentation assessment,” IEEE Access, vol. 11, pp. 84013–84026, 2023, doi: 10.1109/ACCESS.2023.3295832
-
[7]
J. Herrero, F. Gomez -Donoso, and R. Roig -Vila, “The first steps for adapting an artificial intelligence emotion expression recognition software for emotional management in the educational context,” Br. J. Educ. Technol., vol. 54, pp. 1939–1963, 2023, doi: 10.1111/bjet.13326
-
[8]
Learning through AI -clones: Enhancing self-perception and presentation performance,
Q. Zheng, Z. Chen, and Y. Huang, “Learning through AI -clones: Enhancing self-perception and presentation performance,” Comput. Hum. Behav.: Artif. Humans, vol. 3, p. 100117, 2025, doi: 10.1016/j.chbah.2025.100117
-
[9]
Effect of video styles on learner engagement in MOOCs,
R. Deng, “Effect of video styles on learner engagement in MOOCs,” Technol. Pedagog. Educ., vol. 33, no. 1, pp. 1 –21, 2023, doi: 10.1080/1475939X.2023.2246981
-
[10]
H. van Ginkel, J. Gulikers, H. Biemans, and M. Mulder, “Towards a set of design principles for developing oral presentation competence: A synthesis of research in higher education,” Educ. Res. Rev., vol. 14, pp. 62–80, 2015, doi: 10.1016/j.edurev.2015.02.002
-
[11]
Teachers’ vocal expressions and student engagement in asynchronous video learning,
D. J. Neufeld, M. M. Roghanizad, and R. E. White, "The impact of video- mediated communication on social predictions and theory of mind activation," Int. J. Hum. –Comput. Interact. , pp. 1 –14, 2025, doi: 10.1080/10447318.2025.2493374
-
[12]
J. Ferná ndez-Herrero, “Evaluating recent advances in affective intelligent tutoring systems: A scoping review of educational impacts and future prospects,” Educ. Sci., vol. 14, no. 8, p. 839, 2024, doi: 10.3390/educsci14080839
-
[13]
XGBoost to enhance learner performance prediction,
S. Hakkal and A. A. Lahcen, "XGBoost to enhance learner performance prediction," Comput. Educ.: Artif . Intell., vol. 7, p. 100254, 2024. doi: 10.1016/j.caeai.2024.100254
-
[14]
G. Deeva, D. Bogdanova, E. Serral, M. Snoeck, and J. De Weerdt, “A review of automated feedback systems for learners: Classification framework, challenges and opportunities,” Comput. Educ., vol. 162, p. 104094, 2021, doi: 10.1016/j.compedu.2020.104094
-
[15]
Intelligent tutoring systems and learning outcomes: A meta-analysis,
W. Ma, O. Adesope, J. Nesbit, and Q. Liu, “Intelligent tutoring systems and learning outcomes: A meta-analysis,” J. Educ. Psychol., vol. 106, no. 4, pp. 901–918, 2014, doi: 10.1037/a0037123
-
[16]
Effectiveness of intelligent tutoring systems,
J. A. Kulik and J. D. Fletcher, “Effectiveness of intelligent tutoring systems,” Rev. Educ. Res., vol. 86, no. 1, pp. 42 –78, 2016, doi: 10.3102/0034654315581420
-
[17]
Data - driven artificial intelligence in education: A comprehensive review,
K. Ahmad, H. Ullah, A. Al -Barakati, M. Al -Shehri, and F. Alam, “Data - driven artificial intelligence in education: A comprehensive review,” IEEE Trans. Learn. Technol., vol. 17, no. 1, pp. 12 –31, 2024, doi: 10.1109/TLT.2023.3323123
-
[18]
R. Nadolski, H. Hummel, E. Rusman, and K. Ackermans, “Rubric formats for the formative assessment of oral presentation skills acquisition in secondary education,” Educ. Technol. Res. Dev., vol. 69, pp. 2663 –2682, 2021, doi: 10.1007/s11423-021-10030-7. Accepted manuscript. Published version available in IEEE Transactions on Learning Technologies. DOI: 10....
-
[19]
Attaining self -regulation: A social cognitive perspective,
B. J. Zimmerman, “Attaining self -regulation: A social cognitive perspective,” in Handb. Self-Regulation, M. Boekaerts, P. R. Pintrich, and M. Zeidner, Eds. Academic Press, 2000, pp. 13 –39, doi: 10.1016/B978 - 012109890-2/50031-7
-
[20]
J. Boetje and S. Ginkel, “The added benefit of an extra practice session in virtual reality on the development of presentation skills: A randomized control trial,” J. Comput. Assist. Learn., vol. 37, pp. 253 –264, 2020, doi: 10.1111/jcal.12484
-
[21]
T. Sen, G. Naven, L. Gerstner, D. Bagley, R. Baten, W. Rahman, M. Hasan, K. Haut, A. Mamun, S. Samrose, A. Solbu, R. Barnes, G. Mark, F. Metze, and E. Hoque, “DBATES: Dataset for discerning benefits of audio, textual, and facial expression features in comp etitive debate speeches,” IEEE Trans. Affect. Comput., vol. 14, pp. 1028 –1043, 2023, doi: 10.1109/T...
-
[22]
Automatic gaze analysis: A survey of deep learning based approaches,
S. Ghosh, A. Dhall, M. Hayat, J. Knibbe, and Q. Ji, “Automatic gaze analysis: A survey of deep learning based approaches,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 46, no. 1, pp. 61 –84, Jan. 2024, doi: 10.1109/TPAMI.2023.3321337
-
[23]
Face direction estimation based on MediaPipe landmarks,
A. Al -Nuimi and G. Mohammed, “Face direction estimation based on MediaPipe landmarks,” in Proc. 7th Int. Conf. Contemp. Inf. Technol. Math. (ICCITM), 2021, pp. 185 –190, doi: 10.1109/ICCITM53167.2021.9677878
-
[24]
Online presentations for instruction: An overview,
F. Ruth, C. Lipphardt, M. Schickel, E. Ruth -Herbein, and T. Ringeisen, “Online presentations for instruction: An overview,” Front. Educ., vol. 10, p. 1450222, 2025, doi: 10.3389/feduc.2025.145022
-
[25]
T. Chen and C. Guestrin, “XGBoost: A scalable tree boosting system,” Proc. 22nd ACM SIGKDD Int. Conf. Knowledge Discovery and Data Mining, 2016, pp. 785–794, doi: 10.1145/2939672.2939785
-
[26]
Adapting and evaluating influence-estimation methods for gradient -boosted decision trees,
J. Brophy, Z. Hammoudeh, and D. Lowd, “Adapting and evaluating influence-estimation methods for gradient -boosted decision trees,” J. Mach. Learn. Res., vol. 24, Art. no. 154, pp. 1 –48, 2023. [Online]. Available: https://jmlr.org/papers/v24/22-0449.html
work page 2023
-
[27]
R. Di Palma, S. Beausaert, D. Mahr, J. Heller, and T. Hilken, “Does using virtual reality to enhance students' presentation skills work? The role of feedback and presence,” J. Comput. Assist. Learn., vol. 41, no. 5, p. e70097, 2025, doi: 10.1111/jcal.70097
-
[28]
Heuristics for supporting cooperative dashboard design,
V. Setlur, M. Correll, A. Satyanarayan, and M. Tory, “Heuristics for supporting cooperative dashboard design,” IEEE Trans. Vis. Comput. Graph., vol. 30, pp. 370–380, 2023, doi: 10.1109/TVCG.2023.3327158
-
[29]
H. Wang, A. Tlili, R. Huang, et al., “Examining the applications of intelligent tutoring systems in real educational contexts: A systematic literature review from the social experiment perspective,” Educ. Inf. Technol., vol. 28, pp. 9113–9148, 2023, doi: 10.1007/s10639-022-11555- x
-
[30]
N. Ma, Y. L. Zhang, C. P. Liu, and L. Du, “The comparison of two automated feedback approaches based on automated analysis of the online asynchronous interaction: a case of massive online teacher training,” Interact. Learn. Environ., vol. 32, no. 7, pp. 38 18–3839, 2023, doi: 10.1080/10494820.2023.2191252
-
[31]
Designing an automated assessment of public speaking skills using multimodal cues,
L. Chen, G. Feng, C. W. Leong, J. Joe, C. Kitchen, and C. M. Lee, “Designing an automated assessment of public speaking skills using multimodal cues,” J. Learn. Anal., vol. 3, no. 2, pp. 261 –281, 2016, doi: 10.18608/jla.2016.32.13
-
[32]
AutoTutor: A tutor with dialogue in natural language,
A. C. Graesser, S. Lu, G. T. Jackson, H. H. Mitchell, and A. Olney, “AutoTutor: A tutor with dialogue in natural language,” Behav. Res. Methods Instrum. Comput., vol. 36, no. 2, pp. 180 –192, 2004, doi: 10.3758/BF03195563
-
[33]
M. Pourmirzaei, G. A. Montazer, and E. Mousavi, “ATTENDEE: an affective tutoring system based on facial emotion recognition and head pose estimation to personalize e-learning environment,” J. Comput. Educ., vol. 12, pp. 65–92, 2025, doi: 10.1007/s40692-023-00303-w
-
[34]
Thin slices of expressive behavior as predictors of interpersonal consequences: A meta-analysis,
N. Ambady and R. Rosenthal, “Thin slices of expressive behavior as predictors of interpersonal consequences: A meta-analysis,” Psychol. Bull., vol. 111, pp. 256–274, 1992, doi: 10.1037/0033-2909.111.2.256
-
[35]
The expressive balance effect: Perception and physiological responses of prosody and gestures,
E. Rodero, O. Larrea, I. Rodrí guez-De-Dios, and I. Lucas, “The expressive balance effect: Perception and physiological responses of prosody and gestures,” J. Lang. Soc. Psychol., vol. 41, pp. 659 –684, 2022, doi: 10.1177/0261927X221078317
-
[36]
H. Zhao, S. He, C. Du, L. Liu, and L. Yu, “KHFA: Knowledge -Driven Hierarchical Feature Alignment Framework for Subject -Invariant Facial Action Unit Detection,” IEEE Trans. Instrum. Meas., vol. 73, pp. 1 –14,
- [37]
-
[38]
K. Krasich, K. O’Neill, S. Murray, J. R. Brockmole, F. De Brigard, and A. Nuthmann, “A computational modeling approach to investigating mind wandering-related adjustments to gaze behavior during scene viewing,” Cognition, vol. 242, p. 105624, 2023
work page 2023
-
[39]
Recent developments in openSMILE, the Munich open -source multimedia feature extraction toolkit,
F. Eyben, F. Weninger, F. Gross, and B. Schuller, "Recent developments in openSMILE, the Munich open -source multimedia feature extraction toolkit," in Proc. 21st ACM Int. Conf. Multimedia, Oct. 2013, pp. 835–838. doi: 10.1145/2502081.2502224
-
[40]
Sentence -BERT: Sentence Embeddings using Siamese BERT-Networks,
N. Reimers and I. Gurevych, "Sentence -BERT: Sentence Embeddings using Siamese BERT-Networks," in Proc. 2019 Conf. Empirical Methods Natural Language Process. 9th Int. Joint Conf. Natural Language Process. (EMNLP-IJCNLP), Nov. 2019, pp. 3982 –3992. doi: 10.1 8653/v1/D19- 1410
work page 2019
-
[41]
Taking the human out of the loop: A review of Bayesian optimization,
B. Shahriari, K. Swersky, Z. Wang, R. P. Adams, and N. de Freitas, “Taking the human out of the loop: A review of Bayesian optimization,” Proc. IEEE, vol. 104, no. 1, pp. 148 –175, Jan. 2016, doi: 10.1109/JPROC.2015.2494218
-
[42]
The Impact of Speaker -Independent Experiments on the Validity of Speech-Based Affective Computing,
G. Mezgec and S. Seljak, "The Impact of Speaker -Independent Experiments on the Validity of Speech-Based Affective Computing," IEEE Access, vol. 12, pp. 15432 -15450, 2024. doi: 10.1109/ACCESS.2024.335678
-
[43]
J. C. F. de Winter, S. D. Gosling, and J. Potter, “Comparing the Pearson and Spearman correlation coefficients across distributions and sample sizes: A tutorial using simulations and empirical data,” Psychol. Methods, vol. 21, no. 3, pp. 273–290, 2016, doi: 10.1037/met0000079
-
[44]
H.-Y. Suen and K.-E. Hung, “Enhancing learner affective engagement: The impact of instructor emotional expressions and vocal charisma in asynchronous video-based online learning,” Educ. Inf. Technol., vol. 30, pp. 4033–4060, 2025, doi: 10.1007/s10639-024-12956-w
-
[45]
Teachers’ vocal expressions and student engagement in asynchronous video learning,
H.-Y. Suen and Y. -S. Su, “Teachers’ vocal expressions and student engagement in asynchronous video learning,” Int. J. Hum. -Comput. Interact., pp. 1–12, 2025, doi: 10.1080/10447318.2025.2474469
-
[46]
S. Whitmore and T. Gaskell, Coaching for performance: The principles and practice of coaching and leadership, 6th ed. London, UK: John Murray Business, 2024
work page 2024
-
[47]
Intraclass correlations: Uses in assessing rater reliability,
P. E. Shrout and J. L. Fleiss, "Intraclass correlations: Uses in assessing rater reliability," Psychol. Bull., vol. 86, no. 2, pp. 420 –428, 1979. doi : 10.1037/0033-2909.86.2.420
-
[48]
P. Lai, C. Chan, J. Chen, and C. Chan, “Enhancing English oral presentation skills through a rubric -based hybrid AI –peer feedback platform,” in Proc. IEEE Int. Conf. Teaching, Assessment, and Learning for Engineering (TALE), Macao, China, 2025, pp. 1 –8, do i: 10.1109/TALE66047.2025.11346692
-
[49]
Y. Guo, H. L. Li, and H. Y. J. Lai, “WIP: Adaptive presentation training powered by AI,” in Proc. IEEE Int. Conf. Teaching, Assessment, and Learning for Engineering (TALE), Macao, China, 2025, pp. 1 –3, doi: 10.1109/TALE66047.2025.11346676
-
[50]
N. N. A. Abdel Fatah, A. S. Mohamed Bakr, H. A. M. Shaaban, S. K. Ashry, and M. A. A. Abdel -Hamid Elzahry, “Evaluation of artificial intelligence as a tool for assessing presentation skills among first -year medical students at Ain Shams University,” QJM: An International Journal of Medicine, vol. 118, suppl. 1, p. hcaf224-138, 2025. Hung-Yue Suen receiv...
work page 2025
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.