Predicting States of Understanding in Explanatory Interactions Using Cognitive Load-Related Linguistic Cues
Pith reviewed 2026-05-15 08:23 UTC · model grok-4.3
The pith
Cognitive load cues from speaker utterances and listener gaze can classify a listener's moment-by-moment understanding states during explanations.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
In explanatory interactions, the information value and syntactic complexity of speaker utterances together with variation in listener gaze behavior covary with the listener's self-reported states of understanding, and these cues improve automatic prediction of the four states over text-only baselines.
What carries the argument
Three cognitive load-related linguistic cues: speaker utterance surprisal (information value), speaker utterance syntactic complexity, and variation in listener interactive gaze behavior.
If this is right
- Real-time monitoring of understanding becomes feasible in spoken explanatory systems without requiring explicit listener feedback.
- Dialogue agents could adapt their explanations on the fly when the cues indicate partial understanding or misunderstanding.
- The same cues may generalize to other task-oriented dialogues beyond board-game explanations.
Where Pith is reading between the lines
- If gaze variation proves robust across languages and settings, it could serve as a low-cost sensor for understanding in video-based tutoring platforms.
- The approach opens the possibility of training models that detect when an explanation has failed before the listener verbally signals confusion.
- Combining these cues with existing multimodal models might reduce the amount of labeled dialogue data needed for understanding prediction.
Load-bearing premise
Retrospective video-recall self-annotations by listeners accurately capture their actual states of understanding at each moment in the original live interaction.
What would settle it
A side-by-side comparison in which listeners provide real-time button presses or physiological markers during a new set of explanations and the model predictions are checked against those immediate measures rather than post-hoc recall.
Figures
read the original abstract
We investigate how verbal and nonverbal linguistic features, exhibited by speakers and listeners in dialogue, can contribute to predicting the listener's state of understanding in explanatory interactions on a moment-by-moment basis. Specifically, we examine three linguistic cues related to cognitive load and hypothesised to correlate with listener understanding: the information value (operationalised with surprisal) and syntactic complexity of the speaker's utterances, and the variation in the listener's interactive gaze behaviour. Based on statistical analyses of the MUNDEX corpus of face-to-face dialogic board game explanations, we find that individual cues vary with the listener's level of understanding. Listener states ('Understanding', 'Partial Understanding', 'Non-Understanding' and 'Misunderstanding') were self-annotated by the listeners using a retrospective video-recall method. The results of a subsequent classification experiment, involving two off-the-shelf classifiers and a fine-tuned German BERT-based multimodal classifier, demonstrate that prediction of these four states of understanding is generally possible and improves when the three linguistic cues are considered alongside textual features.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims that three cognitive-load-related linguistic cues—speaker utterance surprisal, syntactic complexity, and listener gaze variation—correlate with listeners' self-annotated states of understanding (Understanding, Partial Understanding, Non-Understanding, Misunderstanding) in face-to-face explanatory dialogues from the MUNDEX corpus. Statistical analyses show individual cues vary with these labels, and classification experiments using off-the-shelf classifiers plus a fine-tuned German BERT multimodal model demonstrate that adding the cues to textual features improves prediction of the four states.
Significance. If the retrospective labels accurately reflect moment-by-moment understanding, the work offers a multimodal approach to real-time comprehension monitoring with potential value for dialogue systems and educational applications. The use of naturalistic data and integration of verbal/nonverbal cues is a constructive contribution; however, the absence of label validation limits the strength of the empirical claims and their generalizability.
major comments (2)
- [Abstract] Abstract: the claim that 'prediction ... improves when the three linguistic cues are considered alongside textual features' is presented without effect sizes, baseline accuracies, cross-validation details, or error analysis, making it impossible to assess whether the reported gains are practically meaningful or statistically robust.
- [Methods / Abstract] Labeling procedure (described in Abstract and Methods): all four understanding states are derived exclusively from listeners' retrospective video-recall annotations; no inter-rater reliability, concurrent annotation protocol, physiological validation, or quantification of recall bias is reported. This label noise directly affects both the statistical cue analyses and the classification results, so the central claim that the cues predict 'states of understanding' rests on an unverified assumption.
minor comments (2)
- [Abstract] The abstract should report the number of dialogues, participants, and total annotated utterances in the MUNDEX corpus to allow readers to gauge the scale of the study.
- [Classification experiment] Clarify whether the off-the-shelf classifiers and the BERT model use the same train/test splits and feature sets when comparing 'textual features alone' versus 'textual + cues'.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback. We address each major comment below and indicate where revisions will be made to improve clarity and transparency.
read point-by-point responses
-
Referee: [Abstract] Abstract: the claim that 'prediction ... improves when the three linguistic cues are considered alongside textual features' is presented without effect sizes, baseline accuracies, cross-validation details, or error analysis, making it impossible to assess whether the reported gains are practically meaningful or statistically robust.
Authors: We agree that the abstract would benefit from these details to allow proper evaluation of the improvements. In the revised version, we will expand the abstract to report baseline accuracies (text-only models), the magnitude of gains when adding the three cues (e.g., F1 deltas), the cross-validation procedure used, and a brief summary of error patterns. This will make the practical and statistical significance of the results more transparent without exceeding abstract length constraints. revision: yes
-
Referee: [Methods / Abstract] Labeling procedure (described in Abstract and Methods): all four understanding states are derived exclusively from listeners' retrospective video-recall annotations; no inter-rater reliability, concurrent annotation protocol, physiological validation, or quantification of recall bias is reported. This label noise directly affects both the statistical cue analyses and the classification results, so the central claim that the cues predict 'states of understanding' rests on an unverified assumption.
Authors: We partially agree and will strengthen the presentation of this limitation. The labels are self-reported via retrospective video-recall, a method chosen to capture moment-by-moment states without interrupting the natural dialogue; inter-rater reliability does not apply to self-annotations. The MUNDEX corpus contains no physiological signals, precluding validation of that type. In revision we will (i) detail the video-recall protocol more explicitly, (ii) discuss and where possible quantify recall bias, (iii) cite relevant literature on the validity of video-recall for understanding states, and (iv) add an explicit limitations paragraph. We maintain that the observed statistical correlations and classification gains still provide evidence that the cues are predictive of the self-reported states, while acknowledging the assumption inherent in the labeling. revision: partial
- Absence of physiological validation or concurrent annotation data, which cannot be supplied without new data collection outside the existing MUNDEX corpus.
Circularity Check
No circularity: empirical ML classification on independent annotations
full rationale
The paper is a data-driven empirical study. It extracts linguistic features (surprisal, syntactic complexity, gaze variation) from the MUNDEX corpus, obtains listener state labels via retrospective video-recall annotation, performs statistical correlation tests, and trains off-the-shelf and BERT-based classifiers to predict the four states. No equations, parameter fitting, or derivations are present that would make any reported prediction equivalent to its inputs by construction. The MUNDEX corpus and annotations are treated as external observed data; no self-citation chain or uniqueness theorem is invoked to justify the core results. The classification accuracies are therefore independent of the paper's own definitions and can be falsified against the provided labels.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Retrospective video-recall self-annotations accurately reflect listeners' moment-by-moment understanding states
Reference graph
Works this paper leans on
-
[1]
Introduction Explanatory interactions are a type of everyday communicative activity in which an ‘explainer’ tries to explain something to an ‘explainee’. The ex- plainee seeks to understand the explanation and common ground is continually built during the inter- action (Clark and Brennan, 1991). In the ground- ing process, explainees frequently provide fe...
work page internal anchor Pith review Pith/arXiv arXiv 1991
-
[2]
Background 2.1. Linguistic Cues for Cognitive Load Cognitiveloadisconsideredastheamountofwork- ing memory dedicated to problem solving (Sweller, 1988; Paas et al., 2003). Language comprehen- sion, like many other daily tasks, constantly re- quires working memory capacity (Just and Car- penter, 1992). From a psycholinguistic perspective, cognitive load, so...
work page 1988
-
[3]
Dataset: MUNDEX Corpus For our research, we used a the MUNDEX corpus (Türk et al., 2023) involving dyadic explanations of howtoplayaboardgame(Figure1).MUNDEXcon- tainsmanualandautomaticmultimodalannotations (Buschmeier et al., 2025) of the acoustic signal (e.g., voice quality), textual descriptors (e.g., dis- course functions), and nonverbal behaviour (e....
work page 2023
-
[4]
Methods In Section 2.1, we discussed potential linguistic cues, indicating cognitive load during language comprehension, to analyse. Based on the survey, we choose the following three linguistic aspects in order to see how effective they are for predicting differentunderstandingstates:(i)semanticinforma- tion conveyed in the utterances (measured using inf...
work page 2018
-
[5]
Results and Discussion In our survey of previous studies, we discussed linguistic cues which may be related to the develop- ment of understanding during interaction. Here, we examineselectedlinguisticcuesfromtheMUNDEX corpusandquantifythelinguisticinformationbased on the approach proposed in Section 4. We first present a statistical analysis of the indivi...
work page 2011
-
[6]
with the utterance data and then fuse the three selected linguistic cues into the model as a third classifier (see Figure 4). Based on earlier em- piricalfindingsthatBERTtendstoencodesemantic and co-reference information (which is potentially related to understanding in our study) in the higher layers (see Tenney et al., 2019), we only used the last four ...
work page 2019
-
[7]
Conclusions and Future Work We aimed at investigating the predictability of differ- ent understanding states of explainees in explana- toryinteractionusingdifferentcognitiveloadrelated verbal and non-verbal linguistic cues. This builds onpreviousworkofTürketal.(2024)thatanalysed the predictability of different understanding states based on two classes: Un...
work page 2024
-
[8]
Ethical Considerations and Limitations The publicly available MUNDEX corpus does not containpersonaldataofthestudyparticipants.The corpuswascollectedwiththeprotectionofpersonal data in mind, and was approved by our institutional review board. Within the scope of this study, we consider the following three limitations which we would like to leave for futur...
work page 2018
-
[9]
Supplementary Material Code and data are available on Zenodo: https://doi.org/10.5281/zenodo.19003190
-
[10]
Acknowledgements Funded by the Deutsche Forschungsgemeinschaft (DFG, German Research Foundation): TRR 318/3 2026 – 438445824, project A02. We thank the par- ticipantsandourcolleaguesandstudentassistants for their support in data collection, transcription, and annotation. We would also like to thank the anonymous LREC 2026 reviewers who provided constructi...
work page 2026
-
[11]
Bibliographical References Jens Allwood, Stefan Kopp, Karl Grammer, Elisa- bethAhlsén,ElisabethOberzaucher,andMarkus Koppensteiner. 2007. The analysis of embodied communicative feedback in multimodal corpora: A prerequisite for behaviour simulation.Lan- guage Resources and Evaluation, 41:255–272. Jens Allwood, Joakim Nivre, and Elisabeth Ahlsén
work page 2007
-
[12]
Agnes Axelsson and Gabriel Skantze
On the semantics and pragmatics of lin- guistic feedback.Journal of Semantics, 9:1–26. Agnes Axelsson and Gabriel Skantze. 2022. Mul- timodal user feedback during adaptive robot- human presentations.Frontiers in Computer Sci- ence, 3:741148:22. Tadas Baltrusaitis, Amir Zadeh, Yao Chong Lim, and Louis-Philippe Morency. 2018. OpenFace 2.0: Facial behavior a...
work page 2022
-
[13]
Shared knowledge in natural conversa- tions: Can entropy metrics shed light on infor- mation transfers? InProceedings of the 26th Conference on Computational Natural Language Learning (CoNLL), pages 213–227, Abu Dhabi, UAE. ACL. Dimitry. Mindlin, Amelie S. Robrecht, Michael Morasch, and Philipp Cimiano. 2024. Measur- ing user understanding in dialogue-bas...
work page 2024
-
[14]
GermanBERTmodels(bert-base-german- dbmdz-uncased). Hugging Face. Jeff Mitchell, Mirella Lapata, Vera Demberg, and Frank Keller. 2010. Syntactic and semantic fac- tors in processing difficulty: An integrated mea- sure. InProceedings of the 48th Annual Meeting of the Association for Computational Linguistics, pages 196–206, Uppsala, Sweden. ACL. Louis-Phili...
work page 2010
-
[15]
Cognitive load theory and instructional design: Recent developments.Educational Psy- chologist, 38(1):1–4. Fabian Pedregosa, Gaël Varoquaux, Alexandre Gramfort, Vincent Michel, Bertrand Thirion, Olivier Grisel, Mathieu Blondel, Peter Pretten- hofer, Ron Weiss, Vincent Dubourg, Jake Van- derplas, Alexandre Passos, David Cournapeau, Matthieu Brucher, Matthi...
work page 2011
-
[16]
Language Resource References Hendrik Buschmeier, Angela Grimminger, Pe- tra Wagner, Stefan Lazarov, Olcay Türk, and Yu Wang. 2025. MUNDEX annotations (version 0.7). Zenodo. OlcayTürk,PetraWagner,HendrikBuschmeier,An- gela Grimminger, Yu Wang, and Stefan Lazarov
work page 2025
-
[17]
MUNDEX: A multimodal corpus for the study of the understanding of explanations. InProceedings of the 1st International Multi- modal Communication Symposium, pages 63– 64, Barcelona, Spain
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.