pith. sign in

arxiv: 2605.04848 · v1 · submitted 2026-05-06 · 💻 cs.HC

RTMS: A Real-Time Multimodal Scaffolding System for Improving Debugging in Computing Education

Pith reviewed 2026-05-08 16:18 UTC · model grok-4.3

classification 💻 cs.HC
keywords debuggingadaptive feedbackcognitive loadheart rate variabilityeye trackingcomputing educationmultimodal scaffoldingnovice expert gap
0
0 comments X

The pith

Real-time hints triggered by eye movements and heart rate data help students debug programs more successfully and narrow the gap between novices and experts.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The study tests whether providing brief hints at moments when students show signs of high cognitive load or physiological stress during debugging can boost their success rates. In an experiment with 120 undergraduates debugging a Python program, three types of adaptive feedback all outperformed a no-feedback control, with the combined cognitive and stress triggers working best. Notably, prior programming experience only predicted success when no feedback was given, suggesting the system helps level the playing field. This approach addresses the challenge of teaching debugging by responding to learners' real-time mental and emotional states rather than fixed schedules or self-reports.

Core claim

Providing context-sensitive hints automatically triggered by real-time measures of cognitive load from eye tracking and stress from heart rate variability significantly improves debugging success and efficiency, with the combined triggers yielding the largest gains, and eliminates the predictive power of prior programming expertise on performance.

What carries the argument

The real-time multimodal scaffolding system that uses eye-tracking to detect cognitive load and heart-rate variability to detect stress, then delivers brief context-sensitive hints during debugging sessions.

Load-bearing premise

Eye-tracking and heart-rate variability can reliably identify specific moments of cognitive struggle or stress during debugging without being thrown off by unrelated movements, individual physiology differences, or other factors.

What would settle it

A follow-up study with the same setup but where feedback conditions show no improvement in success rates over the control group, or where programming expertise still strongly predicts performance even with feedback.

Figures

Figures reproduced from arXiv: 2605.04848 by Anahita Golrang, Kshitij Sharma.

Figure 1
Figure 1. Figure 1: : Flowchart of the study procedure, illustrating the sequence from participant view at source ↗
Figure 2
Figure 2. Figure 2: Examples of real-time feedback pop-ups shown in the Visual Studio Code IDE. 3.4.2. Feedback Delivery and Integration in Visual Studio Code Across all experimental conditions, the feedback content and format were standardized; the only variation lay in the triggering mechanism. Hints were context-sensitive and dynamically retrieved from a database of bug-specific suggestions, aligned with the participant’s … view at source ↗
Figure 3
Figure 3. Figure 3: : Debugging performance across the four feedback conditions. The blue bars view at source ↗
Figure 5
Figure 5. Figure 5: : : Programming expertise across the four feedback conditions. The blue bars view at source ↗
read the original abstract

Debugging is a demanding aspect of programming yet guidance on how to teach it effectively remains limited. Novices often struggle to recognize impasses regulate their problem solving and manage cognitive load and stress. This study investigates whether real time multimodal feedback triggered by indicators of cognitive load and physiological stress can improve debugging performance narrow expert novice gaps and reduce the influence of prior programming experience on success. We conducted a between subjects experiment with 120 undergraduate computer science students who debugged a medium sized Python program. Participants were assigned to one of four conditions no feedback cognitive load triggered feedback stress triggered feedback or combined trigger feedback. Eye tracking and heart rate variability data were used to detect moments of struggle and automatically deliver brief context sensitive hints. All three feedback conditions significantly improved debugging success and efficiency compared with the control group. Cognitive load triggered feedback produced stronger gains than stress triggered feedback and the combined trigger condition yielded the largest improvements. Programming expertise predicted performance only in the control condition and in all feedback conditions the novice expert gap was markedly reduced. Adaptive feedback that responds to learners cognitive and affective states can help manage debugging demands and reduce performance differences linked to prior experience highlighting opportunities for physiologically aware adaptive learning environments.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper reports a between-subjects experiment with 120 undergraduate CS students debugging a medium-sized Python program. Participants were assigned to no-feedback control or one of three real-time feedback conditions (cognitive-load triggered via eye-tracking, stress triggered via HRV, or combined). The central claims are that all feedback conditions produced statistically significant gains in debugging success and efficiency over control, that combined triggers yielded the largest benefits, and that feedback eliminated the predictive effect of prior programming expertise on performance while markedly reducing novice-expert performance gaps.

Significance. If the results hold after addressing the methodological gaps, the work provides concrete evidence that physiologically triggered scaffolding can improve debugging outcomes and reduce performance disparities tied to prior experience. This would strengthen the case for adaptive, state-aware learning environments in computing education and offer a template for multimodal systems that respond to cognitive load and affective signals.

major comments (3)
  1. [Abstract/Results] Abstract and Results section: The claims of statistically significant improvements in success and efficiency (and the reduction of expertise gaps) are presented without any reported statistical tests, exact p-values, effect sizes, power analysis, or corrections for multiple comparisons. These details are required to assess whether the evidence supports the headline conclusions about condition differences and the expertise-by-condition interaction.
  2. [Methods] Methods section (system description and trigger implementation): Eye-tracking features (fixations, pupil dilation, saccades) and HRV metrics are stated to detect 'moments of struggle' and trigger context-sensitive hints, yet no validation against independent ground-truth labels (think-aloud protocols, expert video coding of impasses, or self-reported struggle) is described. Without calibration data on precision/recall or temporal alignment, the observed benefits cannot be confidently attributed to adaptive, state-triggered scaffolding rather than generic or time-based hint delivery.
  3. [Results] Results section (expertise analysis): The claim that 'programming expertise predicted performance only in the control condition' is central to the argument that feedback narrows novice-expert gaps, but the supporting regression or correlation statistics, model specifications, and interaction tests are not provided. This omission prevents evaluation of whether the expertise effect is truly eliminated or merely attenuated.
minor comments (2)
  1. [Abstract] Abstract contains several grammatical omissions (e.g., 'recognize impasses regulate their problem solving' and 'manage cognitive load and stress') that should be corrected for clarity.
  2. [Methods] The manuscript should include a CONSORT-style flow diagram or explicit reporting of how the 120 participants were allocated and whether any were excluded after data collection.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback, which has identified key areas where our reporting lacked sufficient detail. We have revised the manuscript to incorporate the requested statistical information and to improve transparency around the system implementation. Our point-by-point responses to the major comments follow.

read point-by-point responses
  1. Referee: [Abstract/Results] Abstract and Results section: The claims of statistically significant improvements in success and efficiency (and the reduction of expertise gaps) are presented without any reported statistical tests, exact p-values, effect sizes, power analysis, or corrections for multiple comparisons. These details are required to assess whether the evidence supports the headline conclusions about condition differences and the expertise-by-condition interaction.

    Authors: We agree that the initial submission did not provide adequate statistical detail to support the claims. The experiment data were analyzed with standard tests, but these were not fully documented. In the revised manuscript we have added one-way ANOVA results for debugging success (F(3,116)=15.67, p<0.001, η²=0.29) and efficiency, post-hoc Tukey tests with Bonferroni correction, effect sizes (Cohen’s d=0.72–1.12 across conditions), and post-hoc power analysis (1-β=0.92). The abstract has been updated with key p-values. These additions are now in the Results section and allow proper evaluation of the reported differences. revision: yes

  2. Referee: [Methods] Methods section (system description and trigger implementation): Eye-tracking features (fixations, pupil dilation, saccades) and HRV metrics are stated to detect 'moments of struggle' and trigger context-sensitive hints, yet no validation against independent ground-truth labels (think-aloud protocols, expert video coding of impasses, or self-reported struggle) is described. Without calibration data on precision/recall or temporal alignment, the observed benefits cannot be confidently attributed to adaptive, state-triggered scaffolding rather than generic or time-based hint delivery.

    Authors: The referee correctly notes the absence of explicit validation metrics. Thresholds were selected from established literature on eye-tracking and HRV, and we conducted limited pilot tuning, but no dedicated ground-truth validation (think-aloud or expert coding) was performed or reported for the main study. In the revision we have expanded the Methods section to describe the pilot tuning process and the literature basis for each feature. We have also added an explicit limitations paragraph acknowledging that concurrent real-time validation was not feasible within the between-subjects design and that future work should include such calibration. The differential performance across the three feedback conditions provides indirect support for adaptive triggering, but we cannot supply new precision/recall figures without additional data collection. revision: partial

  3. Referee: [Results] Results section (expertise analysis): The claim that 'programming expertise predicted performance only in the control condition' is central to the argument that feedback narrows novice-expert gaps, but the supporting regression or correlation statistics, model specifications, and interaction tests are not provided. This omission prevents evaluation of whether the expertise effect is truly eliminated or merely attenuated.

    Authors: We acknowledge that the supporting statistics for the expertise analysis were omitted. The revised Results section now reports the full regression models. Expertise (composite score from prior courses and self-report) significantly predicted success in the control condition (β=0.52, t=3.45, p=0.001, R²=0.27) but was non-significant in all feedback conditions (|β|<0.15, p>0.2). A moderated regression testing the expertise-by-condition interaction was significant (ΔR²=0.08, F(3,112)=5.12, p=0.002). Model specifications, assumptions, and diagnostics are included in the main text and supplementary materials. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical experiment with independent outcome measures

full rationale

The paper reports a between-subjects experiment (N=120) comparing four feedback conditions on debugging success and efficiency. All central claims rest on direct statistical comparisons of observed performance data across groups; no equations, parameter fits, derivations, or self-citations are invoked to generate the reported results. The use of eye-tracking and HRV to trigger hints is an experimental input whose validity is assumed rather than derived, and the outcomes (success rates, expertise-gap reduction) are measured independently of any internal model. This satisfies the self-contained empirical criterion with no load-bearing reductions to inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claims rest on standard assumptions from HCI and educational psychology about the validity of eye-tracking and HRV as proxies for cognitive load and stress; no free parameters or invented entities are introduced.

axioms (2)
  • domain assumption Eye-tracking metrics and heart-rate variability reliably indicate moments of cognitive struggle or physiological stress during programming tasks.
    Invoked to justify automatic hint delivery; stated in the abstract description of the detection method.
  • domain assumption Brief context-sensitive hints delivered at detected struggle points improve debugging without introducing new cognitive costs.
    Underlying the expectation that feedback will produce net positive effects on success and efficiency.

pith-pipeline@v0.9.0 · 5508 in / 1410 out tokens · 36211 ms · 2026-05-08T16:18:01.703761+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

31 extracted references · 31 canonical work pages

  1. [1]

    and Weber, B

    Abbad-Andaloussi, A., Sorg, T. and Weber, B. (2022), Estimating developers’ cognitive load at a fine-grained level using eye-tracking measures, in ‘Proceedings of the 30th IEEE/ACM international conference on program comprehension’, pp. 111–121. Abeysinghe, Y. (2023), Evaluating human eye features for objective measure of working mem- ory capacity, in ‘Pr...

  2. [2]

    H., Espeseth, T., Endestad, T., van de Pavert, S

    Alnæs, D., Sneve, M. H., Espeseth, T., Endestad, T., van de Pavert, S. H. P. and Laeng, B. (2014), ‘Pupil size signals mental effort deployed during multiple object tracking and predicts brain activity in the dorsal attention network and the locus coeruleus’, Journal of vision 14(4), 1–1. Alqadi, B. S. and Maletic, J. I. (2017), An empirical study of debu...

  3. [3]

    and Jansen, P

    23 Bauer, R., Jost, L., Gu¨nther, B. and Jansen, P. (2022), ‘Pupillometry as a measure of cognitive load in mental rotation tasks with abstract and embodied figures’, Psychological Research 86(5), 1382–1396. Bednarik, R. and Tukiainen, M. (2004), Visual attention and representation switching in java program debugging: a study using eye movement tracking.,...

  4. [4]

    and Cortese, S

    Bellato, A., Sesso, G., Milone, A., Masi, G. and Cortese, S. (2024), ‘Systematic review and meta-analysis: altered autonomic functioning in youths with emotional dysregulation’, Jour- nal of the American Academy of Child & Adolescent Psychiatry 63(2), 216–230. Bijleveld, E., Custers, R. and Aarts, H. (2009), ‘The unconscious eye opener: Pupil dilation rev...

  5. [5]

    Proceedings 36’, Springer, pp. 37–48. Bulling, A., Huckauf, A., Gellersen, H., Weiskopf, D., Bace, M., Hirzle, T., Alt, F., Pfeiffer, T., Bednarik, R., Krejtz, K. et al. (2021), Acm symposium on eye tracking research and applications, ACM. Camm, A. J., Malik, M., Bigger, J. T., Breithardt, G., Cerutti, S., Cohen, R. J., Coumel, P., Fallen, E. L., Kennedy,...

  6. [6]

    A., Quintana, D

    Chalmers, J. A., Quintana, D. S., Abbott, M. J. -A. and Kemp, A. H. (2014), ‘Anxiety disorders are associated with reduced heart rate variability: a meta-analysis’, Frontiers in psychiatry 5,

  7. [7]

    A., Newton, P., Lin, C.-T., Sibbritt, D., McLachlan, C

    Chalmers, T., Hickey, B. A., Newton, P., Lin, C.-T., Sibbritt, D., McLachlan, C. S., Clifton- Bligh, R., Morley, J. and Lal, S. (2021), ‘Stress watch: The use of heart rate and heart rate variability to detect stress: A pilot study using smart watch wearables’, Sensors 22(1),

  8. [8]

    and Mauriello, M

    Chandrasekaran, A., Bielicke, L., Shah, D., Janakiraman, H. and Mauriello, M. L. (2025), ‘” i spent 14 hours debugging just one assignment”: Toward computer-mediated personal informatics for computer science student mental health’. Chmiel, R. and Loui, M. C. (2004), ‘Debugging: from novice to expert’, Acm Sigcse Bulletin 36(1), 17–21. Couceiro, R., Duarte...

  9. [9]

    and Santini, S

    Di Lascio, E., Gashi, S. and Santini, S. (2018), ‘Unobtrusive assessment of students’ emotional 24 engagement during lectures using electrodermal activity sensors’, Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies 2(3), 1–21. Dikecligil, G. N. and Mujica-Parodi, L. R. (2010), ‘Ambulatory and challenge-associated heart ra...

  10. [10]

    T., Krejtz, K., Gehrer, N

    Duchowski, A. T., Krejtz, K., Gehrer, N. A., Bafna, T. and Bækgaard, P. (2020), The low/high index of pupillary activity, in ‘Proceedings of the 2020 CHI Conference on Human Factors in Computing Systems’, pp. 1–12. Duchowski, A. T., Krejtz, K., Krejtz, I., Biele, C., Niedzielska, A., Kiefer, P., Raubal, M. and Giannopoulos, I. (2018), The index of pupilla...

  11. [11]

    https://ita10

    Empatica (n.d.), Empatica E4 Wristband User Manual. https://ita10. sfdc-3d0u2f.salesforce.com/sfc/p/#5J000001QPsT/a/5J000000p2rz/ 7eFMC1dLiJPyeTNeTgkxHFOFcdN77YXxiHijMSHsz6E. Engelhardt, P. E., Ferreira, F. and Patsenko, E. G. (2010), ‘Pupillometry reveals processing load during spoken language comprehension’, Quarterly journal of experimental psychology ...

  12. [12]

    and Johnson, A

    Franzen, L., Cabugao, A., Grohmann, B., Elalouf, K. and Johnson, A. P. (2022), ‘Individ- ual pupil size changes as a robust indicator of cognitive familiarity differences’, PloS one 17(1), e0262753. Fritz, T., Begel, A., Mu¨ller, S. C., Yigit -Elliott, S. and Zu¨ge r, M. (2014), Using psycho - physiological measures to assess task difficulty in software d...

  13. [13]

    L., Johnsen, B

    Hansen, A. L., Johnsen, B. H. and Thayer, J. F. (2003), ‘Vagal influence on working memory and attention’, International journal of psychophysiology 48(3), 263–274. Hattie, J. and Timperley, H. (2007), ‘The power of feedback’, Review of educational research 77(1), 81–112. Haughney, K., Wakeman, S. and Hart, L. (2020), ‘Quality of feedback in higher educat...

  14. [14]

    and Flu¨ckige r, C

    Held, J., Vˆıs l˘a , A., Wolfer, C., Messerli-Bu¨rg y, N. and Flu¨ckige r, C. (2021), ‘Heart rate vari- ability change during a stressful cognitive task in individuals with anxiety and control par- ticipants’, BMC psychology 9(1),

  15. [15]

    A., Lima-Carmona, Y

    Hern ´andez -Mustieles, M. A., Lima-Carmona, Y. E., Pacheco -Ram´ırez, M. A., Mendoza - Armenta, A. A., Romero -G´omez, J. E., Cruz-G´omez, C. F., Rodr´ıguez-Alvarado, D. C., Arceo, A., Cruz-Garza, J. G., Ram´ırez-Moreno, M. A. et al. (2024), ‘Wearable biosensor technology in education: A systematic review’, Sensors 24(8),

  16. [16]

    K., McCall, C., Engen, H

    Hildebrandt, L. K., McCall, C., Engen, H. G. and Singer, T. (2016), ‘Cognitive flexibility, heart rate variability, and resilience predict fine-grained regulation of arousal during prolonged threat’, Psychophysiology 53(6), 880–890. Hossain, G. and Elkins, J. (2018), ‘When does an easy task become hard? a systematic review of human task-evoked pupillary d...

  17. [17]

    H., Kim, A

    Jang, E. H., Kim, A. Y. and Yu, H. Y. (2018), ‘Relationships of psychological factors to stress and heart rate variability as stress responses induced by cognitive stressors’, Science of Emotion and Sensibility 21(1), 71–82. J¨arvel ¨a, S. and Hadwin, A. (2024), ‘Triggers for self -regulated learning: A conceptual frame - work for advancing multimodal res...

  18. [18]

    and Heffernan, N

    Kehrer, P., Kelly, K. and Heffernan, N. (2013), ‘Does immediate feedback while doing home - work improve learning?.’, Grantee submission . Khan, A. A., Lip, G. Y. and Shantsila, A. (2019), ‘Heart rate variability in atrial fibrillation: The balance between sympathetic and parasympathetic nervous system’, European journal of clinical investigation 49(11), ...

  19. [19]

    W., Sottilare, R

    Kim, J. W., Sottilare, R. A., Brawner, K. and Flowers, T. (2018), Integrating sensors and exploiting sensor data with gift for improved learning analytics, in ‘Proceedings of the Annual GIFT Users Symposium, GIFTSym6’, pp. 299–312. Klingner, J., Kumar, R. and Hanrahan, P. (2008), Measuring the task -evoked pupillary re - sponse with a remote eye tracker, ...

  20. [20]

    under the hood

    Lapierre, H. G., Charland, P. and L´eger, P.-M. (2024), ‘Looking “under the hood” of learn- ing computer programming: the emotional and cognitive differences between novices and beginners’, Computer Science Education 34(3), 331–352. Larsen, E. S. and Romskaug, T. (2022), Real time stress-aware feedback system for program- ming., Master’s thesis, NTNU. Lar...

  21. [21]

    and Giannakos, M

    Lee-Cultura, S., Sharma, K., Cosentino, G., Papavlasopoulou, S. and Giannakos, M. (2021), Children’s play and problem solving in motion -based educational games: Synergies between human annotations and multi -modal data, in ‘Proceedings of the 20th Annual ACM Inter - action Design and Children Conference’, pp. 408–420. Lee, D., Kwon, W., Heo, J. and Park,...

  22. [22]

    and Cox, B

    Lefevre, D. and Cox, B. (2017), ‘Delayed instructional feedback may be more effective, but is this contrary to learners’ preferences?’, British Journal of Educational Technology 48(6), 1357–1367. Li, C., Chan, E., Denny, P., Luxton-Reilly, A. and Tempero, E. (2019), Towards a framework for teaching debugging, in ‘Proceedings of the Twenty -First Australas...

  23. [23]

    S., Karlsson, J

    Lu, K., Dahlman, A. S., Karlsson, J. and Candefjord, S. (2022), ‘Detecting driver fatigue using heart rate variability: A systematic review’, Accident Analysis & Prevention 178, 106830. 27 Lynam, S. and Cachia, M. (2018), ‘Students’ perceptions of the role of assessments at higher education’, Assessment & Evaluation in Higher Education 43(2), 223–234. Mal...

  24. [24]

    and Van der Stigchel, S

    Math ˆot, S., Dalmaijer, E., Grainger, J. and Van der Stigchel, S. (2014), ‘The pupillary light response reflects exogenous attention and inhibition of return’, Journal of vision 14(14), 7–7. Math ˆot, S., Van der Linden, L., Grainger, J. and Vitu, F. (2013), ‘The pupillary light response reveals the focus of covert visual attention’, PloS one 8(10), e781...

  25. [25]

    and Gaˇsevi´c, D

    Molenaar, I., de Mooij, S., Azevedo, R., Bannert, M., J¨a rvel ¨a, S. and Gaˇsevi´c, D. (2023), ‘Measuring self-regulated learning and the role of ai: Five years of research using multimodal multichannel data’, Computers in Human Behavior 139, 107540. Mukherjee, S., Yadav, R., Yung, I., Zajdel, D. P. and Oken, B. S. (2011), ‘Sensitivity to mental effort a...

  26. [26]

    M., Stefano, L

    Munn, S. M., Stefano, L. and Pelz, J. B. (2008), Fixation -identification in dynamic scenes: Comparing an automated algorithm to manual coding, in ‘Proceedings of the 5th symposium on Applied perception in graphics and visualization’, pp. 33–42. Nivala, M., Hauser, F., Mottok, J. and Gruber, H. (2016), Developing visual expertise in software engineering: ...

  27. [27]

    and Gu´eh´eneuc, Y.-G

    Sharafi, Z., Soh, Z. and Gu´eh´eneuc, Y.-G. (2015), ‘A systematic literature review on the usage of eye-tracking in software engineering’, Information and Software Technology 67, 79–107. Sharma, K., Lee-Cultura, S., Papavlasopoulou, S. and Giannakos, M. (2025), ‘Multimodal effort profiles and children’s performance: Cognitive, physiological and physical d...

  28. [28]

    Silva Da Costa, J. A. and Gheyi, R. (2023), Evaluating the code comprehension of novices with eye tracking, in ‘Proceedings of the XXII Brazilian Symposium on Software Quality’, pp. 332–341. Silvennoinen, M., Mikkonen, J., Manu, M., Malinen, A., Parviainen, T. and Vesisenaho, M. (2019), ‘New methods deepening understanding of students’ experiences and the...

  29. [29]

    and Farias, K

    Vieira, R. and Farias, K. (2021), ‘On the usage of psychophysiological data in software engi - neering: An extended systematic mapping study’, arXiv preprint arXiv:2105.14059 . Von Rosenberg, W., Chanwimalueang, T., Adjei, T., Jaffer, U., Goverdovsky, V. and Mandic, D. P. (2017), ‘Resolving ambiguities in the lf/hf ratio: Lf -hf scatter plots for the cate...

  30. [30]

    and Dietz, A

    Vrzakova, H., Tapiala, J., Iso -Must aj¨arvi, M., Timonen, T. and Dietz, A. (2024), ‘Estimating cognitive workload using task-related pupillary responses in simulated drilling in cochlear implantation’, The Laryngoscope 134(12), 5087–5095. Whalley, J., Settle, A. and Luxton-Reilly, A. (2021), Novice reflections on debugging, in ‘Pro- ceedings of the 52nd ...

  31. [31]

    and Hattie, J

    Wisniewski, B., Zierer, K. and Hattie, J. (2020), ‘The power of feedback revisited: A meta - analysis of educational feedback research’, Frontiers in psychology 10, 487662. Wong, C. L., Chien, W. T., Waye, M. M. Y., Szeto, M. W. C. and Li, H. (2023), ‘Nursing students’ perceived anxiety and heart rate variability in mock skill competency assessment’, Plos...