pith. sign in

arxiv: 2604.04482 · v1 · submitted 2026-04-06 · 💻 cs.AI

Scalable and Explainable Learner-Video Interaction Prediction using Multimodal Large Language Models

Pith reviewed 2026-05-10 19:55 UTC · model grok-4.3

classification 💻 cs.AI
keywords multimodal large language modelseducational video analysislearner interaction predictionmultimedia learning theoryconcept activation vectorscognitive loadonline coursesexplainable prediction
0
0 comments X

The pith

Multimodal large language model embeddings of short video segments can predict population-level learner interactions such as pausing and skipping, while mapping to interpretable concepts from multimedia learning theory.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper develops a pipeline that turns educational video content into predictions of how learners will use controls like play, pause, skip, and rewind, using only the video itself rather than waiting for real usage data. Short segments are embedded by multimodal LLMs, a neural classifier is trained on 77 million interaction events from 66 courses to spot peaks, and GPT-5 labels the segments with features drawn from multimedia learning theory so that concept activation vectors can explain which instructional elements drive the predictions. A sympathetic reader would care because this offers a low-cost way for instructors to test and refine video design before release, potentially improving cognitive load management at the scale of entire online programs. The approach also claims to generalize across academic fields not seen during training.

Core claim

Classifiers built on MLLM embeddings of short video segments reliably identify temporally fine-grained peaks in learner interactions across 77 million control events from 66 online courses. These classifiers generalize to unseen academic fields and their decisions can be interpreted by projecting GPT-5-coded video features, drawn from multimedia learning theory on instructional design for optimal cognitive load, onto concept activation vectors that recover theory-relevant instructional concepts.

What carries the argument

MLLM embeddings of short video segments passed to a neural classifier, interpreted through concept activation vectors on GPT-5-coded features aligned with multimedia learning theory.

If this is right

  • Instructors gain a pre-deployment check that flags video segments likely to trigger high cognitive load before students see them.
  • The same pipeline can be applied to new courses in different subjects without retraining on their specific interaction data.
  • Model explanations surface concrete instructional concepts (such as segment length or example density) that align with established theory, enabling empirical tests of that theory at population scale.
  • Prediction works from content alone, removing the need to collect and store large volumes of learner interaction logs for every new video.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Automated suggestions for editing problematic segments could be generated by identifying which GPT-coded features most strongly activate the interaction-peak class.
  • The method might extend to non-video educational materials if their content can be segmented and embedded similarly, though this remains untested here.
  • Longitudinal studies could check whether videos pre-screened with this pipeline actually produce lower dropout or better learning outcomes in real deployments.
  • If the embeddings encode stable cognitive signals, the same features might predict interactions in live classroom recordings or other media formats.

Load-bearing premise

MLLM embeddings of short video segments preserve usable signals of cognitive processing and instructional design quality, and GPT-5 feature coding provides an unbiased bridge to multimedia learning theory.

What would settle it

Train the same classifier on MLLM embeddings from one set of courses and test on a completely new academic field; if interaction-peak prediction accuracy drops to chance level, or if the top-activated concepts from the activation vectors show no correspondence with expert ratings of the same segments, the central claim fails.

Figures

Figures reproduced from arXiv: 2604.04482 by Dominik Glandorf, Fares Fawzi, Tanja K\"aser.

Figure 1
Figure 1. Figure 1: Pipeline for predicting learners’ interactions with online learning videos and explaining predictions from multimedia learning theory. explained using AI-coded CTML features; and (RQ3) how sensitive embedding￾based predictions are to such features. We evaluate our approach on a multi-year dataset of over 77 million video in￾teractions across 1,641 videos from 66 courses. Our results show that embedding￾bas… view at source ↗
Figure 2
Figure 2. Figure 2: Three modalities of video segments around t are encoded by pre-trained trans￾formers. A neural classifier predicts if Signalv (t) is among the top K% at timepoint t. linearly detrended the signal to remove global temporal trends (e.g., position￾in-video effects) that are unrelated to video content. Finally, 5) we defined the signals in terms of percentile ranks: Signalv (t) = rankt∈{30,...,Dv−30}(Eventsv(t… view at source ↗
Figure 3
Figure 3. Figure 3: AUC (± std, 5 seeds) for prediction of top 5% learner-video interaction moments on fields unseen during training as a measure of generalization of our classifier [PITH_FULL_IMAGE:figures/full_fig_p010_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Learner-video interactions at 6,000 video moments (50% from top 5% ranks) by CTML features coded by GPT-5 (inter-rater agreement in the right panel). For example, moments without a formula have an average PausedAtv(t) rank of 67% within their video, whereas moments with a formula have a higher average rank of 71%. 3.2 RQ2: Predictiveness of AI-coded CTML Features We analyzed the informativeness of CTML fea… view at source ↗
Figure 5
Figure 5. Figure 5: TCAV values of CTML concepts and activations in our classifier H. Significant (*) values above 0.5 mean that the classifier was positively sensitive to the concept present in the activation space. embeddings clearly outperformed CTML features at larger sample sizes; CTML feature performance plateaued at medium sample sizes. 3.3 RQ3: Sensitivity of Model Predictions to CTML features Given the association be… view at source ↗
read the original abstract

Learners' use of video controls in educational videos provides implicit signals of cognitive processing and instructional design quality, yet the lack of scalable and explainable predictive models limits instructors' ability to anticipate such behavior before deployment. We propose a scalable, interpretable pipeline for predicting population-level watching, pausing, skipping, and rewinding behavior as proxies for cognitive load from video content alone. Our approach leverages multimodal large language models (MLLMs) to compute embeddings of short video segments and trains a neural classifier to identify temporally fine-grained interaction peaks. Drawing from multimedia learning theory on instructional design for optimal cognitive load, we code features of the video segments using GPT-5 and employ them as a basis for interpreting model predictions via concept activation vectors. We evaluate our pipeline on 77 million video control events from 66 online courses. Our findings demonstrate that classifiers based on MLLM embeddings reliably predict interaction peaks, generalize to unseen academic fields, and encode interpretable, theory-relevant instructional concepts. Overall, our results show the feasibility of cost-efficient, interpretable pre-screening of educational video design and open new opportunities to empirically examine multimedia learning theory at scale.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 1 minor

Summary. The paper proposes a scalable pipeline that extracts MLLM embeddings from short educational video segments, trains a neural classifier to predict population-level learner interaction peaks (pausing, skipping, rewinding) as proxies for cognitive load, and interprets the model via GPT-5-coded instructional features and concept activation vectors grounded in multimedia learning theory. It reports evaluation on 77 million video-control events across 66 courses and claims reliable prediction, cross-field generalization, and encoding of theory-relevant concepts.

Significance. If the central claims hold with proper validation, the work could enable cost-efficient pre-screening of video designs and large-scale empirical tests of multimedia learning theory. The combination of embedding-based prediction with post-hoc CAV interpretation is a reasonable direction for explainable educational analytics, but the current manuscript provides no quantitative evidence to establish whether the approach actually outperforms simpler baselines or captures the intended cognitive signals.

major comments (3)
  1. [Abstract] Abstract: the claim that 'classifiers based on MLLM embeddings reliably predict interaction peaks' and 'generalize to unseen academic fields' is unsupported because the abstract (and the provided evaluation description) reports no performance metrics, no baseline comparisons (e.g., transcript TF-IDF, visual features, or duration-only models), no error analysis, and no details on how interaction peaks were labeled or how post-hoc exclusions were avoided in the 77M-event dataset.
  2. [Interpretation / CAV analysis] The interpretation pipeline (GPT-5 feature coding + CAVs) shares the same broad model family as the MLLM embeddings used for prediction; without reported ablation against non-MLLM baselines or correlation checks between GPT-5 codes and human-coded cognitive-load ratings on the same segments, the claim that the embeddings 'encode interpretable, theory-relevant instructional concepts' risks circularity and model-specific artifacts.
  3. [Method / Evaluation] The central assumption that MLLM embeddings of short segments preserve signals of cognitive processing and instructional design quality is load-bearing for both the prediction and generalization claims, yet the manuscript provides no direct validation against established proxies (human ratings, eye-tracking, or established cognitive-load instruments) on the same video segments.
minor comments (1)
  1. [Abstract] The abstract states evaluation on 66 courses but does not specify how many courses were held out for the cross-field generalization test or whether the held-out fields were truly disjoint in topic and terminology.

Simulated Author's Rebuttal

3 responses · 1 unresolved

We thank the referee for the detailed and constructive feedback. We have addressed each major comment point by point below, making revisions to the manuscript where the concerns are valid and providing clarifications or additional analyses where appropriate. Our responses focus on strengthening the presentation of results and limitations without altering the core claims.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the claim that 'classifiers based on MLLM embeddings reliably predict interaction peaks' and 'generalize to unseen academic fields' is unsupported because the abstract (and the provided evaluation description) reports no performance metrics, no baseline comparisons (e.g., transcript TF-IDF, visual features, or duration-only models), no error analysis, and no details on how interaction peaks were labeled or how post-hoc exclusions were avoided in the 77M-event dataset.

    Authors: We agree that the abstract would be strengthened by including key quantitative results and methodological details to support the claims. The evaluation section of the manuscript reports performance metrics (including AUC, precision, recall, and F1 scores) along with cross-field generalization results, but these were not summarized in the abstract. We have revised the abstract to include specific metrics and a brief statement on baseline performance. We have also expanded the methods section with details on peak labeling from the 77M events, preprocessing steps, and any exclusions applied. An error analysis subsection has been added to the results. revision: yes

  2. Referee: [Interpretation / CAV analysis] The interpretation pipeline (GPT-5 feature coding + CAVs) shares the same broad model family as the MLLM embeddings used for prediction; without reported ablation against non-MLLM baselines or correlation checks between GPT-5 codes and human-coded cognitive-load ratings on the same segments, the claim that the embeddings 'encode interpretable, theory-relevant instructional concepts' risks circularity and model-specific artifacts.

    Authors: This is a fair concern regarding potential circularity. The MLLM embeddings and GPT-5 coding are related but not identical; however, to strengthen the interpretation, we have added an ablation comparing CAVs from MLLM embeddings against those derived from non-MLLM features (transcript TF-IDF and basic visual descriptors). We have also included a small human validation study on a subset of segments, reporting correlations between GPT-5-coded features and independent human ratings of cognitive load. These additions are now in the revised interpretation section and reduce the risk of model-specific artifacts. revision: partial

  3. Referee: [Method / Evaluation] The central assumption that MLLM embeddings of short segments preserve signals of cognitive processing and instructional design quality is load-bearing for both the prediction and generalization claims, yet the manuscript provides no direct validation against established proxies (human ratings, eye-tracking, or established cognitive-load instruments) on the same video segments.

    Authors: We acknowledge that direct validation against human ratings, eye-tracking, or standardized cognitive-load instruments on the identical segments would provide stronger evidence. Such data was not collected in this large-scale study of 66 courses. Our current validation relies on predictive performance on held-out courses, cross-field generalization, and alignment with multimedia learning theory via the coded features. We have expanded the limitations and discussion sections to explicitly address this gap, including the rationale for using population-level interaction peaks as scalable proxies and suggestions for future targeted validations. revision: partial

standing simulated objections not resolved
  • Direct validation of the MLLM embeddings against eye-tracking data or established cognitive-load instruments on the same video segments, as this would require new data collection beyond the scope of the current 77M-event dataset.

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The described pipeline trains a neural classifier on MLLM embeddings of video segments to predict real population-level interaction peaks drawn from 77 million external events across 66 courses. Evaluation includes generalization to unseen academic fields via held-out testing. GPT-5 feature coding is applied only for post-hoc CAV-based interpretation of the trained model and is not part of the predictive training loop or input definition. No equations, self-citations, or renamings reduce any claimed result to its own inputs by construction; the derivation chain remains empirically grounded in independent interaction data rather than tautological or self-referential steps.

Axiom & Free-Parameter Ledger

1 free parameters · 2 axioms · 0 invented entities

Approach rests on standard assumptions about embedding quality and theory fidelity; no explicit free parameters or invented entities named beyond typical neural network weights.

free parameters (1)
  • neural classifier parameters
    Weights and hyperparameters of the classifier trained on MLLM embeddings; details not specified in abstract.
axioms (2)
  • domain assumption MLLM embeddings capture semantic and visual cues relevant to cognitive load and instructional design
    Invoked when using embeddings directly for interaction prediction.
  • domain assumption GPT-5 feature coding faithfully represents multimedia learning theory concepts without introducing LLM-specific biases
    Used as basis for CAV interpretation of model predictions.

pith-pipeline@v0.9.0 · 5506 in / 1430 out tokens · 77137 ms · 2026-05-10T19:55:22.504965+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

31 extracted references · 31 canonical work pages

  1. [1]

    In: Proc

    Akpinar, N., Ramdas, A., Acar, U.: Analyzing Student Strategies In Blended Courses Using Clickstream Data. In: Proc. EDM (2020)

  2. [2]

    In: Proc

    Asadi, M., Swamy, V., Frej, J., Vignoud, J., Marras, M., Käser, T.: Ripple: Concept-Based Interpretation for Raw Time Series Models in Education. In: Proc. AAAI (2023)

  3. [3]

    Atapattu, T., Falkner, K.: Impact of Lecturer’s Discourse for Students’ Video En- gagement: Video Learning Analytics Case Study of MOOCs. J. Learn. Ana. (2018)

  4. [4]

    Bai, S., Chen, K., Liu, X., Wang, J.: Qwen2.5-VL Technical Report (2025)

  5. [5]

    In-Video Quiz Performance

    Brinton, C.G., Buccapatnam, S., Chiang, M., Poor, H.V.: Mining MOOC Click- streams: Video-Watching Behavior vs. In-Video Quiz Performance. IEEE Trans. Signal Process. (2016)

  6. [6]

    Chavan, P., Mitra, R.: Tcherly: A Teacher-facing Dashboard for Online Video Lec- tures. J. Learn. Ana. (2022)

  7. [7]

    IR- RODL (2018)

    Chorianopoulos, K.: A Taxonomy of Asynchronous Instructional Video Styles. IR- RODL (2018)

  8. [8]

    Cohen, J.: A Coefficient of Agreement for Nominal Scales. Educ. Psychol. Meas. (1960)

  9. [9]

    Fiorella, L., Stull, A.T., Kuhlmann, S., Mayer, R.E.: Instructor presence in video lectures: The role of dynamic drawings, eye contact, and instructor visibility. J. Educ. Psychol. (2019)

  10. [10]

    PNAS (2014)

    Freeman, S., Eddy, S.L., McDonough, M., Smith, M.K., Okoroafor, N., Jordt, H., Wenderoth, M.P.: Active learning increases student performance in science, engi- neering, and mathematics. PNAS (2014)

  11. [11]

    In: Proc

    Gritz, W., Salih, H., Hoppe, A., Ewerth, R.: From Formulas to Figures: How Visual Elements Impact User Interactions in Educational Videos. In: Proc. AIED (2025)

  12. [12]

    In: Proc

    Kim, B., Wattenberg, M., Gilmer, J., Cai, C., Wexler, J., Viegas, F., et al.: Inter- pretabilitybeyondfeatureattribution:Quantitativetestingwithconceptactivation vectors (TCAV). In: Proc. ICML (2018) Video Interaction Prediction with Multimodal Large Language Models 15

  13. [13]

    In: Proc

    Kim, J., Guo, P.J., Seaton, D.T., Mitros, P., Gajos, K.Z., Miller, R.C.: Under- standing in-video dropouts and interaction peaks inonline lecture videos. In: Proc. L@S (2014)

  14. [14]

    Kuhlmann, S.L., Plumley, R., Evans, Z., Bernacki, M.L., Greene, J.A., Hogan, K.A., Berro, M., Gates, K., Panter, A.: Students’ active cognitive engagement with instructional videos predicts STEM learning. Comput. Educ. (2024)

  15. [15]

    In: Proc

    Lallé, S., Conati, C.: A data-driven student model to provide adaptive support during video watching across moocs. In: Proc. AIED (2020)

  16. [16]

    Lee, H., Liu, M., Scriney, M., Smeaton, A.F.: Playback-centric visualizations of video usage using weighted interactions to guide where to watch in an educational context. Front. Educ. (2022)

  17. [17]

    EC-TEL (2015)

    Li, N., Kidziński, L., Jermann, P., Dillenbourg, P.: MOOC Video Interaction Pat- terns: What Do They Tell Us? In: Proc. EC-TEL (2015)

  18. [18]

    Mayer, R.E., Fiorella, L.: Principles for Reducing Extraneous Processing in Multi- media Learning: Coherence, Signaling, Redundancy, Spatial Contiguity, and Tem- poralContiguityPrinciples.In:TheCambridgeHandbookofMultimediaLearning, pp. 279–315. Cambridge University Press, 2 edn. (2014)

  19. [19]

    In: Proc

    Mbouzao, B., Desmarais, M.C., Shrier, I.: Early Prediction of Success in MOOC from Video Interaction Features. In: Proc. AIED (2020)

  20. [20]

    Merkt, M., Hoppe, A., Bruns, G., Ewerth, R., Huff, M.: Pushing the button: Why do learners pause online videos? Comput. Educ. (2022)

  21. [21]

    IJAIED (2025)

    Navarrete, E., Nehring, A., Schanze, S., Ewerth, R., Hoppe, A.: A Comprehensive Review of Video Characteristics, Tools, Technologies, and Learning Effectiveness. IJAIED (2025)

  22. [22]

    AMLER (2023)

    Papadakis, S.: MOOCs 2012-2022: An overview. AMLER (2023)

  23. [23]

    Sablić, M., Mirosavljević, A., Škugor, A.: Video-Based Learning (VBL)—Past, Present and Future: an Overview of the Research Published from 2008 to 2019. Technol. Knowl. Learn. (2021)

  24. [24]

    Shoufan, A.: Estimating the cognitive value of YouTube’s educational videos: A learning analytics approach. Comput. Hum. Behav. (2019)

  25. [25]

    In: EMNLP Workshop on Analysis of Large Scale Social In- teraction in MOOCs (2014)

    Sinha, T., Jermann, P., Li, N., Dillenbourg, P.: Your click decides your fate: In- ferring Information Processing and Attrition Behavior from MOOC Video Click- stream Interactions. In: EMNLP Workshop on Analysis of Large Scale Social In- teraction in MOOCs (2014)

  26. [26]

    BJET (2019)

    Stöhr, C., Stathakarou, N., Mueller, F., Nifakos, S., McGrath, C.: Videos as learn- ing objects in MOOCs: A study of specialist and non-specialist participants’ video activity in MOOCs. BJET (2019)

  27. [27]

    In: Proc

    Swamy, V., Marras, M., Käser, T.: Meta Transfer Learning for Early Success Pre- diction in MOOCs. In: Proc. L@S (2022)

  28. [28]

    In: Proc

    Thornton,S.,Riley,C.,Wiltrout,M.E.:CriteriaforVideoEngagementinaBiology MOOC. In: Proc. L@S (2017)

  29. [29]

    Yang, A., Li, A., Yang, B., Zhang, B., Hui, B., Zheng, B.: Qwen3 (2025)

  30. [30]

    Zhai, X., Mustafa, B., Kolesnikov, A., Beyer, L.: Sigmoid Loss for Language Image Pre-Training (2023)

  31. [31]

    Zhang, J., Huang, Y., Gao, M.: Video Features, Engagement, and Patterns of Collective Attention Allocation: An Open Flow Network Perspective. J. Lear. Ana. (2022)