RTMS: A Real-Time Multimodal Scaffolding System for Improving Debugging in Computing Education
Pith reviewed 2026-05-08 16:18 UTC · model grok-4.3
The pith
Real-time hints triggered by eye movements and heart rate data help students debug programs more successfully and narrow the gap between novices and experts.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Providing context-sensitive hints automatically triggered by real-time measures of cognitive load from eye tracking and stress from heart rate variability significantly improves debugging success and efficiency, with the combined triggers yielding the largest gains, and eliminates the predictive power of prior programming expertise on performance.
What carries the argument
The real-time multimodal scaffolding system that uses eye-tracking to detect cognitive load and heart-rate variability to detect stress, then delivers brief context-sensitive hints during debugging sessions.
Load-bearing premise
Eye-tracking and heart-rate variability can reliably identify specific moments of cognitive struggle or stress during debugging without being thrown off by unrelated movements, individual physiology differences, or other factors.
What would settle it
A follow-up study with the same setup but where feedback conditions show no improvement in success rates over the control group, or where programming expertise still strongly predicts performance even with feedback.
Figures
read the original abstract
Debugging is a demanding aspect of programming yet guidance on how to teach it effectively remains limited. Novices often struggle to recognize impasses regulate their problem solving and manage cognitive load and stress. This study investigates whether real time multimodal feedback triggered by indicators of cognitive load and physiological stress can improve debugging performance narrow expert novice gaps and reduce the influence of prior programming experience on success. We conducted a between subjects experiment with 120 undergraduate computer science students who debugged a medium sized Python program. Participants were assigned to one of four conditions no feedback cognitive load triggered feedback stress triggered feedback or combined trigger feedback. Eye tracking and heart rate variability data were used to detect moments of struggle and automatically deliver brief context sensitive hints. All three feedback conditions significantly improved debugging success and efficiency compared with the control group. Cognitive load triggered feedback produced stronger gains than stress triggered feedback and the combined trigger condition yielded the largest improvements. Programming expertise predicted performance only in the control condition and in all feedback conditions the novice expert gap was markedly reduced. Adaptive feedback that responds to learners cognitive and affective states can help manage debugging demands and reduce performance differences linked to prior experience highlighting opportunities for physiologically aware adaptive learning environments.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper reports a between-subjects experiment with 120 undergraduate CS students debugging a medium-sized Python program. Participants were assigned to no-feedback control or one of three real-time feedback conditions (cognitive-load triggered via eye-tracking, stress triggered via HRV, or combined). The central claims are that all feedback conditions produced statistically significant gains in debugging success and efficiency over control, that combined triggers yielded the largest benefits, and that feedback eliminated the predictive effect of prior programming expertise on performance while markedly reducing novice-expert performance gaps.
Significance. If the results hold after addressing the methodological gaps, the work provides concrete evidence that physiologically triggered scaffolding can improve debugging outcomes and reduce performance disparities tied to prior experience. This would strengthen the case for adaptive, state-aware learning environments in computing education and offer a template for multimodal systems that respond to cognitive load and affective signals.
major comments (3)
- [Abstract/Results] Abstract and Results section: The claims of statistically significant improvements in success and efficiency (and the reduction of expertise gaps) are presented without any reported statistical tests, exact p-values, effect sizes, power analysis, or corrections for multiple comparisons. These details are required to assess whether the evidence supports the headline conclusions about condition differences and the expertise-by-condition interaction.
- [Methods] Methods section (system description and trigger implementation): Eye-tracking features (fixations, pupil dilation, saccades) and HRV metrics are stated to detect 'moments of struggle' and trigger context-sensitive hints, yet no validation against independent ground-truth labels (think-aloud protocols, expert video coding of impasses, or self-reported struggle) is described. Without calibration data on precision/recall or temporal alignment, the observed benefits cannot be confidently attributed to adaptive, state-triggered scaffolding rather than generic or time-based hint delivery.
- [Results] Results section (expertise analysis): The claim that 'programming expertise predicted performance only in the control condition' is central to the argument that feedback narrows novice-expert gaps, but the supporting regression or correlation statistics, model specifications, and interaction tests are not provided. This omission prevents evaluation of whether the expertise effect is truly eliminated or merely attenuated.
minor comments (2)
- [Abstract] Abstract contains several grammatical omissions (e.g., 'recognize impasses regulate their problem solving' and 'manage cognitive load and stress') that should be corrected for clarity.
- [Methods] The manuscript should include a CONSORT-style flow diagram or explicit reporting of how the 120 participants were allocated and whether any were excluded after data collection.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback, which has identified key areas where our reporting lacked sufficient detail. We have revised the manuscript to incorporate the requested statistical information and to improve transparency around the system implementation. Our point-by-point responses to the major comments follow.
read point-by-point responses
-
Referee: [Abstract/Results] Abstract and Results section: The claims of statistically significant improvements in success and efficiency (and the reduction of expertise gaps) are presented without any reported statistical tests, exact p-values, effect sizes, power analysis, or corrections for multiple comparisons. These details are required to assess whether the evidence supports the headline conclusions about condition differences and the expertise-by-condition interaction.
Authors: We agree that the initial submission did not provide adequate statistical detail to support the claims. The experiment data were analyzed with standard tests, but these were not fully documented. In the revised manuscript we have added one-way ANOVA results for debugging success (F(3,116)=15.67, p<0.001, η²=0.29) and efficiency, post-hoc Tukey tests with Bonferroni correction, effect sizes (Cohen’s d=0.72–1.12 across conditions), and post-hoc power analysis (1-β=0.92). The abstract has been updated with key p-values. These additions are now in the Results section and allow proper evaluation of the reported differences. revision: yes
-
Referee: [Methods] Methods section (system description and trigger implementation): Eye-tracking features (fixations, pupil dilation, saccades) and HRV metrics are stated to detect 'moments of struggle' and trigger context-sensitive hints, yet no validation against independent ground-truth labels (think-aloud protocols, expert video coding of impasses, or self-reported struggle) is described. Without calibration data on precision/recall or temporal alignment, the observed benefits cannot be confidently attributed to adaptive, state-triggered scaffolding rather than generic or time-based hint delivery.
Authors: The referee correctly notes the absence of explicit validation metrics. Thresholds were selected from established literature on eye-tracking and HRV, and we conducted limited pilot tuning, but no dedicated ground-truth validation (think-aloud or expert coding) was performed or reported for the main study. In the revision we have expanded the Methods section to describe the pilot tuning process and the literature basis for each feature. We have also added an explicit limitations paragraph acknowledging that concurrent real-time validation was not feasible within the between-subjects design and that future work should include such calibration. The differential performance across the three feedback conditions provides indirect support for adaptive triggering, but we cannot supply new precision/recall figures without additional data collection. revision: partial
-
Referee: [Results] Results section (expertise analysis): The claim that 'programming expertise predicted performance only in the control condition' is central to the argument that feedback narrows novice-expert gaps, but the supporting regression or correlation statistics, model specifications, and interaction tests are not provided. This omission prevents evaluation of whether the expertise effect is truly eliminated or merely attenuated.
Authors: We acknowledge that the supporting statistics for the expertise analysis were omitted. The revised Results section now reports the full regression models. Expertise (composite score from prior courses and self-report) significantly predicted success in the control condition (β=0.52, t=3.45, p=0.001, R²=0.27) but was non-significant in all feedback conditions (|β|<0.15, p>0.2). A moderated regression testing the expertise-by-condition interaction was significant (ΔR²=0.08, F(3,112)=5.12, p=0.002). Model specifications, assumptions, and diagnostics are included in the main text and supplementary materials. revision: yes
Circularity Check
No circularity: purely empirical experiment with independent outcome measures
full rationale
The paper reports a between-subjects experiment (N=120) comparing four feedback conditions on debugging success and efficiency. All central claims rest on direct statistical comparisons of observed performance data across groups; no equations, parameter fits, derivations, or self-citations are invoked to generate the reported results. The use of eye-tracking and HRV to trigger hints is an experimental input whose validity is assumed rather than derived, and the outcomes (success rates, expertise-gap reduction) are measured independently of any internal model. This satisfies the self-contained empirical criterion with no load-bearing reductions to inputs.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption Eye-tracking metrics and heart-rate variability reliably indicate moments of cognitive struggle or physiological stress during programming tasks.
- domain assumption Brief context-sensitive hints delivered at detected struggle points improve debugging without introducing new cognitive costs.
Reference graph
Works this paper leans on
-
[1]
Abbad-Andaloussi, A., Sorg, T. and Weber, B. (2022), Estimating developers’ cognitive load at a fine-grained level using eye-tracking measures, in ‘Proceedings of the 30th IEEE/ACM international conference on program comprehension’, pp. 111–121. Abeysinghe, Y. (2023), Evaluating human eye features for objective measure of working mem- ory capacity, in ‘Pr...
work page 2022
-
[2]
H., Espeseth, T., Endestad, T., van de Pavert, S
Alnæs, D., Sneve, M. H., Espeseth, T., Endestad, T., van de Pavert, S. H. P. and Laeng, B. (2014), ‘Pupil size signals mental effort deployed during multiple object tracking and predicts brain activity in the dorsal attention network and the locus coeruleus’, Journal of vision 14(4), 1–1. Alqadi, B. S. and Maletic, J. I. (2017), An empirical study of debu...
work page 2014
-
[3]
23 Bauer, R., Jost, L., Gu¨nther, B. and Jansen, P. (2022), ‘Pupillometry as a measure of cognitive load in mental rotation tasks with abstract and embodied figures’, Psychological Research 86(5), 1382–1396. Bednarik, R. and Tukiainen, M. (2004), Visual attention and representation switching in java program debugging: a study using eye movement tracking.,...
work page 2022
-
[4]
Bellato, A., Sesso, G., Milone, A., Masi, G. and Cortese, S. (2024), ‘Systematic review and meta-analysis: altered autonomic functioning in youths with emotional dysregulation’, Jour- nal of the American Academy of Child & Adolescent Psychiatry 63(2), 216–230. Bijleveld, E., Custers, R. and Aarts, H. (2009), ‘The unconscious eye opener: Pupil dilation rev...
work page 2024
-
[5]
Proceedings 36’, Springer, pp. 37–48. Bulling, A., Huckauf, A., Gellersen, H., Weiskopf, D., Bace, M., Hirzle, T., Alt, F., Pfeiffer, T., Bednarik, R., Krejtz, K. et al. (2021), Acm symposium on eye tracking research and applications, ACM. Camm, A. J., Malik, M., Bigger, J. T., Breithardt, G., Cerutti, S., Cohen, R. J., Coumel, P., Fallen, E. L., Kennedy,...
work page 2021
-
[6]
Chalmers, J. A., Quintana, D. S., Abbott, M. J. -A. and Kemp, A. H. (2014), ‘Anxiety disorders are associated with reduced heart rate variability: a meta-analysis’, Frontiers in psychiatry 5,
work page 2014
-
[7]
A., Newton, P., Lin, C.-T., Sibbritt, D., McLachlan, C
Chalmers, T., Hickey, B. A., Newton, P., Lin, C.-T., Sibbritt, D., McLachlan, C. S., Clifton- Bligh, R., Morley, J. and Lal, S. (2021), ‘Stress watch: The use of heart rate and heart rate variability to detect stress: A pilot study using smart watch wearables’, Sensors 22(1),
work page 2021
-
[8]
Chandrasekaran, A., Bielicke, L., Shah, D., Janakiraman, H. and Mauriello, M. L. (2025), ‘” i spent 14 hours debugging just one assignment”: Toward computer-mediated personal informatics for computer science student mental health’. Chmiel, R. and Loui, M. C. (2004), ‘Debugging: from novice to expert’, Acm Sigcse Bulletin 36(1), 17–21. Couceiro, R., Duarte...
work page 2025
-
[9]
Di Lascio, E., Gashi, S. and Santini, S. (2018), ‘Unobtrusive assessment of students’ emotional 24 engagement during lectures using electrodermal activity sensors’, Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies 2(3), 1–21. Dikecligil, G. N. and Mujica-Parodi, L. R. (2010), ‘Ambulatory and challenge-associated heart ra...
work page 2018
-
[10]
Duchowski, A. T., Krejtz, K., Gehrer, N. A., Bafna, T. and Bækgaard, P. (2020), The low/high index of pupillary activity, in ‘Proceedings of the 2020 CHI Conference on Human Factors in Computing Systems’, pp. 1–12. Duchowski, A. T., Krejtz, K., Krejtz, I., Biele, C., Niedzielska, A., Kiefer, P., Raubal, M. and Giannopoulos, I. (2018), The index of pupilla...
work page 2020
-
[11]
Empatica (n.d.), Empatica E4 Wristband User Manual. https://ita10. sfdc-3d0u2f.salesforce.com/sfc/p/#5J000001QPsT/a/5J000000p2rz/ 7eFMC1dLiJPyeTNeTgkxHFOFcdN77YXxiHijMSHsz6E. Engelhardt, P. E., Ferreira, F. and Patsenko, E. G. (2010), ‘Pupillometry reveals processing load during spoken language comprehension’, Quarterly journal of experimental psychology ...
work page 2010
-
[12]
Franzen, L., Cabugao, A., Grohmann, B., Elalouf, K. and Johnson, A. P. (2022), ‘Individ- ual pupil size changes as a robust indicator of cognitive familiarity differences’, PloS one 17(1), e0262753. Fritz, T., Begel, A., Mu¨ller, S. C., Yigit -Elliott, S. and Zu¨ge r, M. (2014), Using psycho - physiological measures to assess task difficulty in software d...
work page 2022
-
[13]
Hansen, A. L., Johnsen, B. H. and Thayer, J. F. (2003), ‘Vagal influence on working memory and attention’, International journal of psychophysiology 48(3), 263–274. Hattie, J. and Timperley, H. (2007), ‘The power of feedback’, Review of educational research 77(1), 81–112. Haughney, K., Wakeman, S. and Hart, L. (2020), ‘Quality of feedback in higher educat...
work page 2003
-
[14]
Held, J., Vˆıs l˘a , A., Wolfer, C., Messerli-Bu¨rg y, N. and Flu¨ckige r, C. (2021), ‘Heart rate vari- ability change during a stressful cognitive task in individuals with anxiety and control par- ticipants’, BMC psychology 9(1),
work page 2021
-
[15]
Hern ´andez -Mustieles, M. A., Lima-Carmona, Y. E., Pacheco -Ram´ırez, M. A., Mendoza - Armenta, A. A., Romero -G´omez, J. E., Cruz-G´omez, C. F., Rodr´ıguez-Alvarado, D. C., Arceo, A., Cruz-Garza, J. G., Ram´ırez-Moreno, M. A. et al. (2024), ‘Wearable biosensor technology in education: A systematic review’, Sensors 24(8),
work page 2024
-
[16]
Hildebrandt, L. K., McCall, C., Engen, H. G. and Singer, T. (2016), ‘Cognitive flexibility, heart rate variability, and resilience predict fine-grained regulation of arousal during prolonged threat’, Psychophysiology 53(6), 880–890. Hossain, G. and Elkins, J. (2018), ‘When does an easy task become hard? a systematic review of human task-evoked pupillary d...
-
[17]
Jang, E. H., Kim, A. Y. and Yu, H. Y. (2018), ‘Relationships of psychological factors to stress and heart rate variability as stress responses induced by cognitive stressors’, Science of Emotion and Sensibility 21(1), 71–82. J¨arvel ¨a, S. and Hadwin, A. (2024), ‘Triggers for self -regulated learning: A conceptual frame - work for advancing multimodal res...
work page 2018
-
[18]
Kehrer, P., Kelly, K. and Heffernan, N. (2013), ‘Does immediate feedback while doing home - work improve learning?.’, Grantee submission . Khan, A. A., Lip, G. Y. and Shantsila, A. (2019), ‘Heart rate variability in atrial fibrillation: The balance between sympathetic and parasympathetic nervous system’, European journal of clinical investigation 49(11), ...
work page 2013
-
[19]
Kim, J. W., Sottilare, R. A., Brawner, K. and Flowers, T. (2018), Integrating sensors and exploiting sensor data with gift for improved learning analytics, in ‘Proceedings of the Annual GIFT Users Symposium, GIFTSym6’, pp. 299–312. Klingner, J., Kumar, R. and Hanrahan, P. (2008), Measuring the task -evoked pupillary re - sponse with a remote eye tracker, ...
work page 2018
-
[20]
Lapierre, H. G., Charland, P. and L´eger, P.-M. (2024), ‘Looking “under the hood” of learn- ing computer programming: the emotional and cognitive differences between novices and beginners’, Computer Science Education 34(3), 331–352. Larsen, E. S. and Romskaug, T. (2022), Real time stress-aware feedback system for program- ming., Master’s thesis, NTNU. Lar...
work page 2024
-
[21]
Lee-Cultura, S., Sharma, K., Cosentino, G., Papavlasopoulou, S. and Giannakos, M. (2021), Children’s play and problem solving in motion -based educational games: Synergies between human annotations and multi -modal data, in ‘Proceedings of the 20th Annual ACM Inter - action Design and Children Conference’, pp. 408–420. Lee, D., Kwon, W., Heo, J. and Park,...
work page 2021
-
[22]
Lefevre, D. and Cox, B. (2017), ‘Delayed instructional feedback may be more effective, but is this contrary to learners’ preferences?’, British Journal of Educational Technology 48(6), 1357–1367. Li, C., Chan, E., Denny, P., Luxton-Reilly, A. and Tempero, E. (2019), Towards a framework for teaching debugging, in ‘Proceedings of the Twenty -First Australas...
work page 2017
-
[23]
Lu, K., Dahlman, A. S., Karlsson, J. and Candefjord, S. (2022), ‘Detecting driver fatigue using heart rate variability: A systematic review’, Accident Analysis & Prevention 178, 106830. 27 Lynam, S. and Cachia, M. (2018), ‘Students’ perceptions of the role of assessments at higher education’, Assessment & Evaluation in Higher Education 43(2), 223–234. Mal...
work page 2022
-
[24]
Math ˆot, S., Dalmaijer, E., Grainger, J. and Van der Stigchel, S. (2014), ‘The pupillary light response reflects exogenous attention and inhibition of return’, Journal of vision 14(14), 7–7. Math ˆot, S., Van der Linden, L., Grainger, J. and Vitu, F. (2013), ‘The pupillary light response reveals the focus of covert visual attention’, PloS one 8(10), e781...
work page 2014
-
[25]
Molenaar, I., de Mooij, S., Azevedo, R., Bannert, M., J¨a rvel ¨a, S. and Gaˇsevi´c, D. (2023), ‘Measuring self-regulated learning and the role of ai: Five years of research using multimodal multichannel data’, Computers in Human Behavior 139, 107540. Mukherjee, S., Yadav, R., Yung, I., Zajdel, D. P. and Oken, B. S. (2011), ‘Sensitivity to mental effort a...
work page 2023
-
[26]
Munn, S. M., Stefano, L. and Pelz, J. B. (2008), Fixation -identification in dynamic scenes: Comparing an automated algorithm to manual coding, in ‘Proceedings of the 5th symposium on Applied perception in graphics and visualization’, pp. 33–42. Nivala, M., Hauser, F., Mottok, J. and Gruber, H. (2016), Developing visual expertise in software engineering: ...
work page 2008
-
[27]
Sharafi, Z., Soh, Z. and Gu´eh´eneuc, Y.-G. (2015), ‘A systematic literature review on the usage of eye-tracking in software engineering’, Information and Software Technology 67, 79–107. Sharma, K., Lee-Cultura, S., Papavlasopoulou, S. and Giannakos, M. (2025), ‘Multimodal effort profiles and children’s performance: Cognitive, physiological and physical d...
work page 2015
-
[28]
Silva Da Costa, J. A. and Gheyi, R. (2023), Evaluating the code comprehension of novices with eye tracking, in ‘Proceedings of the XXII Brazilian Symposium on Software Quality’, pp. 332–341. Silvennoinen, M., Mikkonen, J., Manu, M., Malinen, A., Parviainen, T. and Vesisenaho, M. (2019), ‘New methods deepening understanding of students’ experiences and the...
work page 2023
-
[29]
Vieira, R. and Farias, K. (2021), ‘On the usage of psychophysiological data in software engi - neering: An extended systematic mapping study’, arXiv preprint arXiv:2105.14059 . Von Rosenberg, W., Chanwimalueang, T., Adjei, T., Jaffer, U., Goverdovsky, V. and Mandic, D. P. (2017), ‘Resolving ambiguities in the lf/hf ratio: Lf -hf scatter plots for the cate...
-
[30]
Vrzakova, H., Tapiala, J., Iso -Must aj¨arvi, M., Timonen, T. and Dietz, A. (2024), ‘Estimating cognitive workload using task-related pupillary responses in simulated drilling in cochlear implantation’, The Laryngoscope 134(12), 5087–5095. Whalley, J., Settle, A. and Luxton-Reilly, A. (2021), Novice reflections on debugging, in ‘Pro- ceedings of the 52nd ...
work page 2024
-
[31]
Wisniewski, B., Zierer, K. and Hattie, J. (2020), ‘The power of feedback revisited: A meta - analysis of educational feedback research’, Frontiers in psychology 10, 487662. Wong, C. L., Chien, W. T., Waye, M. M. Y., Szeto, M. W. C. and Li, H. (2023), ‘Nursing students’ perceived anxiety and heart rate variability in mock skill competency assessment’, Plos...
work page 2020
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.