AI-Assisted Competency Assessment from Egocentric Video in Simulation-Based Nursing Education
Pith reviewed 2026-05-21 08:06 UTC · model grok-4.3
The pith
Recognition accuracy of student actions drops as nursing competency rises in simulation videos.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
A frozen DINOv2 backbone with HMM Viterbi decoding reaches 57.4 percent mean overlap F1 on leave-one-out one-shot recognition of 493 actions in 3.8 hours of video. Recognition accuracy correlates negatively with instructor-rated competency (rho = -0.524, p = 0.012 for mIoU), and this link survives six confound controls. More competent students generate diverse workflows that the model finds harder to classify, while simple sequence statistics show no comparable relationship. Patient safety protocols and team communication stand out in the per-item breakdown, and higher-competency sessions display more protocol-consistent action transitions.
What carries the argument
Recognition accuracy (mIoU and MOF) from a frozen visual encoder plus HMM decoder, treated as a proxy for workflow diversity in student action sequences.
If this is right
- Recognition accuracy can serve as a complement to predicted action timelines for automated competency assessment.
- Higher competency links to greater workflow diversity that automatic classifiers find harder to label.
- Patient safety and team communication behaviors drive much of the recognition-competency relationship.
- Higher-competency students follow more protocol-consistent sequences of actions.
Where Pith is reading between the lines
- The same recognition-difficulty signal could be tested in other hands-on simulation settings such as surgical or emergency training.
- Larger multi-site datasets would clarify how stable the correlation remains across institutions and scenario variations.
- Feeding the accuracy signal back to trainees in real time might help them notice when their own workflows become more variable.
Load-bearing premise
The observed negative correlation mainly reflects real differences in action diversity caused by competency rather than hidden differences in scenarios, annotations, or the particular group of students.
What would settle it
Run the identical pipeline on a fresh collection of simulation sessions that use different scenarios and a separate student group; disappearance or reversal of the negative correlation after the same six controls would falsify the claim.
Figures
read the original abstract
Assessing learner competency in clinical simulation requires expert observation that is time-intensive, difficult to scale, and subject to inter-rater variability. Vision-language models have emerged as a promising tool for understanding complex visual behavior. In this work, we investigate whether visual observations can provide educationally meaningful signals for competency assessment through a three-stage framework that (1) extracts action timelines from egocentric nursing simulation video using frozen visual encoders and few-shot learning, (2) derives sequence-level features and per-session recognition metrics, and (3) relates these to instructor-rated competency. Across 22 densely annotated sessions (3.8 hours, 493 actions), a frozen DINOv2 backbone with HMM Viterbi decoding achieves 57.4% MOF in leave-one-out 1-shot recognition. Surprisingly, we observe a negative trend between recognition accuracy and competency (rho = -0.524, p = 0.012 for mIoU), robust to six confound controls: more competent students produce diverse, harder-to-classify workflows, while simple sequence features show no such relationship. Per-item analysis identifies patient safety protocols and team communication as the expected behaviors most reflected in this pattern, and process model comparisons reveal that higher-competency students exhibit more protocol-consistent action transitions. These findings suggest that recognition accuracy may complement predicted action timelines as a pedagogically informative signal in automated competency assessment.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript proposes a three-stage framework for AI-assisted competency assessment in nursing simulation education from egocentric video. Stage 1 extracts action timelines using a frozen DINOv2 visual encoder with few-shot learning and HMM Viterbi decoding; stage 2 derives sequence-level features and per-session recognition metrics; stage 3 correlates these with instructor-rated competency. On 22 densely annotated sessions (3.8 hours, 493 actions), the method achieves 57.4% MOF in leave-one-out 1-shot recognition. The central empirical result is a negative correlation between recognition accuracy and competency (rho = -0.524, p = 0.012 for mIoU), which the authors interpret as evidence that higher-competency students produce more diverse, harder-to-classify workflows. They report the trend is robust to six confound controls, that simple sequence features show no such relationship, and that per-item and process-model analyses highlight patient safety and team communication behaviors.
Significance. If the reported negative correlation genuinely reflects competency-driven differences in action diversity rather than residual biases, the work identifies a novel, scalable signal (recognition difficulty) that could complement direct action timeline prediction for automated competency assessment. The practical choice of frozen encoders and 1-shot HMM decoding is well-suited to small educational datasets. The finding also aligns with process-model comparisons showing more protocol-consistent transitions among higher-competency students. However, the small sample (n=22) and reliance on a single correlation without extensive stability diagnostics limit the immediate strength of the claim for educational deployment.
major comments (2)
- [Results] Results (correlation analysis): The central claim rests on rho = -0.524 (p = 0.012) for mIoU versus competency across 22 sessions. No influence diagnostics, bootstrap intervals, or leave-one-session-out stability for the correlation coefficient itself are reported, so it is unclear whether the p-value survives modest changes in sample composition or the precise definition of the six confound controls.
- [Methods] Methods (confound controls): The manuscript states the negative trend is 'robust to six confound controls,' yet provides no explicit list or implementation details for these controls (e.g., how scenario difficulty, annotation granularity, or student cohort effects were quantified and partialled out of the correlation). This detail is load-bearing for interpreting whether the result primarily reflects action diversity.
minor comments (2)
- [Abstract] Abstract and results: The exact definitions of MOF and mIoU, the precise leave-one-out 1-shot protocol, and the full list of the six confound controls should be stated explicitly to support reproducibility.
- [Results] Consider reporting per-session recognition metrics in a table alongside competency scores and basic descriptive statistics (mean, range) to allow readers to assess outlier influence directly.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback on our manuscript. We address each major comment below and have revised the manuscript to incorporate additional diagnostics and methodological details as requested.
read point-by-point responses
-
Referee: [Results] Results (correlation analysis): The central claim rests on rho = -0.524 (p = 0.012) for mIoU versus competency across 22 sessions. No influence diagnostics, bootstrap intervals, or leave-one-session-out stability for the correlation coefficient itself are reported, so it is unclear whether the p-value survives modest changes in sample composition or the precise definition of the six confound controls.
Authors: We agree that stability diagnostics strengthen the interpretation. In the revised manuscript we now report bootstrap 95% confidence intervals for the Spearman correlation (1000 resamples) and a leave-one-session-out analysis. The negative association remains significant in 18 of 22 iterations, with no single session driving the result. Cook's distance values are all below 0.5, indicating no influential outliers. These additions are presented in a new supplementary table. revision: yes
-
Referee: [Methods] Methods (confound controls): The manuscript states the negative trend is 'robust to six confound controls,' yet provides no explicit list or implementation details for these controls (e.g., how scenario difficulty, annotation granularity, or student cohort effects were quantified and partialled out of the correlation). This detail is load-bearing for interpreting whether the result primarily reflects action diversity.
Authors: We regret the lack of explicit detail. The six controls are: (1) instructor-rated scenario difficulty, (2) total video duration, (3) number of distinct actions, (4) student year of study, (5) annotation label granularity (unique action count), and (6) average frame quality score. We computed partial Spearman correlations after regressing out each control individually and jointly; the negative rho remained between -0.47 and -0.53 (all p < 0.05). A new Methods subsection now lists these variables, describes their measurement, and reports the partial-correlation results. revision: yes
Circularity Check
No circularity: empirical correlation from held-out metrics and external ratings
full rationale
The paper reports an empirical negative correlation (rho = -0.524) between leave-one-out 1-shot recognition accuracy (mIoU from frozen DINOv2 + HMM) and instructor-rated competency on 22 sessions. This is computed directly from independent held-out model outputs and external human ratings, with six confound controls applied post hoc. No equations, fitted parameters, or self-citations reduce the reported trend or any sequence feature to the target competency variable by construction; the derivation chain remains self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
free parameters (1)
- Choice of DINOv2 backbone and HMM Viterbi decoding parameters
axioms (1)
- domain assumption Frozen visual encoders extract educationally meaningful action timelines from egocentric nursing simulation video.
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
a frozen DINOv2 backbone with HMM Viterbi decoding achieves 57.4% MOF in leave-one-out 1-shot recognition... negative trend between recognition accuracy and competency (rho = -0.524, p = 0.012 for mIoU)
-
IndisputableMonolith/Foundation/AlexanderDuality.leanalexander_duality_circle_linking unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
process model comparisons reveal that higher-competency students exhibit more protocol-consistent action transitions
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Brandon M Booth, Nigel Bosch, and Sidney K D’Mello. En- gagement detection and its applications in learning: a tuto- rial and selective review.Proceedings of the IEEE, 111(10): 1398–1422, 2023. 1
work page 2023
-
[2]
Automatic detection of learning-centered affective states in the wild
Nigel Bosch, Sidney D’Mello, Ryan Baker, Jaclyn Ocumpaugh, Valerie Shute, Matthew Ventura, Lubin Wang, and Weinan Zhao. Automatic detection of learning-centered affective states in the wild. InProceedings of the 20th in- ternational conference on intelligent user interfaces, pages 379–388, 2015. 2
work page 2015
-
[3]
Clayton Cohn, Eduardo Davalos, Caleb Vatral, Joyce Horn Fonteles, Hanchen David Wang, Austin Coursey, Surya Ray- ala, Meiyi Ma, Gautam Biswas, et al. Multimodal meth- ods for analyzing learning and training environments: A sys- tematic literature review.arXiv preprint arXiv:2408.14491,
-
[4]
Tempo- ral action segmentation: An analysis of modern techniques
Guodong Ding, Fadime Sener, and Angela Yao. Tempo- ral action segmentation: An analysis of modern techniques. IEEE Transactions on Pattern Analysis and Machine Intelli- gence, 46(2):1011–1030, 2024. 3
work page 2024
-
[5]
K Anders Ericsson, Ralf Th Krampe, and Clemens Tesch- R¨omer. The role of deliberate practice in the acquisition of expert performance.Psychological Review, 100(3):363–406,
-
[6]
Joyce Horn Fonteles, Clayton Cohn, Efrat Ayalon, Mengxi Zhou, Ashwin TS, Eduardo Davalos, Zhijian Li, Surya Ray- ala, Divya Mereddy, Austin Coursey, et al. Analyzing em- bodied learning in classroom settings: A human-in-the-loop ai approach for multimodal learning analytics.Learning and Instruction, 103:102274, 2026. 2
work page 2026
-
[7]
Isabel Funke, Sjoerd T Mees, J ¨urgen Weitz, and Stefanie Speidel. Video-based surgical skill assessment using 3D con- volutional neural networks.International Journal of Com- puter Assisted Radiology and Surgery, 14(7):1217–1225,
-
[8]
Hashimoto, Guy Rosman, Daniela Rus, and Ozanan R
Daniel A. Hashimoto, Guy Rosman, Daniela Rus, and Ozanan R. Meireles. Artificial intelligence in surgery: Promises and perils.Annals of Surgery, 268(1), 2018. 2
work page 2018
-
[9]
Jennifer K Hayden, Richard A Smiley, Maryann Alexander, Suzan Kardong-Edgren, and Pamela R Jeffries. The NCSBN national simulation study: A longitudinal, randomized, con- trolled study replacing clinical hours with simulation in pre- licensure nursing education.Journal of Nursing Regulation, 5(2):C1–S64, 2014. 1
work page 2014
-
[10]
Deep residual learning for image recognition
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. InProceed- ings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 770–778, 2016. 4
work page 2016
-
[11]
Sissel Eikeland Husebø, Febe Friberg, Eldar Søreide, and Hans Rystedt. Instructional problems in briefings: How to prepare nursing students for simulation-based cardiopul- monary resuscitation training.Clinical Simulation in Nurs- ing, 9(8):e307–e318, 2013. 1
work page 2013
-
[12]
Automated gaze-based mind wandering detection during computerized learning in classrooms: S
Stephen Hutt, Kristina Krasich, Caitlin Mills, Nigel Bosch, Shelby White, James R Brockmole, and Sidney K D’Mello. Automated gaze-based mind wandering detection during computerized learning in classrooms: S. hutt et al.User Modeling and User-Adapted Interaction, 29(4):821–867,
-
[13]
Pamela R Jeffries. A framework for designing, implement- ing, and evaluating: Simulations used as teaching strategies in nursing.Nursing education perspectives, 26(2):96–103,
-
[14]
Ahmad Khalifa, Owais Tahhan, Mohammed Albazooni, Mo- hammed Saeed, Ruha Hamdi, Megan Stanners, Amman Ma- lik, and Adnan Malik. Automated and artificial intelligence (ai)-derived performance assessment in surgical simulation: A systematic review.Cureus, 17(12), 2025. 2
work page 2025
-
[15]
J. Richard Landis and Gary G. Koch. The measurement of observer agreement for categorical data.Biometrics, 33(1): 159–174, 1977. 3
work page 1977
-
[16]
Kathie Lasater. Clinical judgment development: Using sim- ulation to create an assessment rubric.Journal of Nursing Education, 46(11):496–503, 2007. 1
work page 2007
-
[17]
Yilin Liu, Hanchen David Wang, Haowei Fu, Madison Lee Mason, Fanjie Li, Gautam Biswas, Daniel Levin, Alyssa Wise, and Meiyi Ma. Smartseg: A non-parametric approach for wearable camera video segmentation.Pervasive and Mo- bile Computing, 2025. 2
work page 2025
-
[18]
DINOv2: Learning robust visual features without supervi- sion.TMLR, 2024
Maxime Oquab, Timoth ´ee Darcet, Th ´eo Moutakanni, Huy V o, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, et al. DINOv2: Learning robust visual features without supervi- sion.TMLR, 2024. 3, 4
work page 2024
-
[19]
Pellegrino, Naomi Chudowsky, and Robert Glaser
James W. Pellegrino, Naomi Chudowsky, and Robert Glaser. Knowing What Students Know: The Science and Design of Educational Assessment. National Academies Press, 2001. 1
work page 2001
-
[20]
Lawrence R Rabiner. A tutorial on hidden Markov models and selected applications in speech recognition.Proceedings of the IEEE, 77(2):257–286, 1989. 3, 4
work page 1989
-
[21]
Learning transferable visual models from natural language supervision
Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. Learning transferable visual models from natural language supervision. InProceedings of the 38th International Conference on Machine Learning (ICML), pages 8748–8763, 2021. 4
work page 2021
-
[22]
Rajiv Ranganathan, Mei-Hua Lee, and Karl M Newell. Rep- etition without repetition: Challenges in understanding be- havioral flexibility in motor skill.Frontiers in Psychology, 11:2018, 2020. 6
work page 2018
-
[23]
Ginger Schroers and Jill Pfieffer. Tool development and test- ing: An objective measurement of medication administration competency.Nursing Education Perspectives, 46(5):E37– E39, 2025. 2, 1
work page 2025
-
[24]
Prototypi- cal networks for few-shot learning
Jake Snell, Kevin Swersky, and Richard Zemel. Prototypi- cal networks for few-shot learning. InAdvances in Neural Information Processing Systems (NeurIPS), 2017. 3, 4
work page 2017
-
[25]
Surgical skill evaluation from robot-assisted surgery recordings
Abed Soleymani, Ali Akbar Sadat Asl, Mojtaba Yegane- jou, Scott Dick, Mahdi Tavakoli, and Xingyu Li. Surgical skill evaluation from robot-assisted surgery recordings. In 2021 International Symposium on Medical Robotics (ISMR), pages 1–6. IEEE, 2021. 2
work page 2021
-
[26]
Multimodal inte- gration of human-like attention in visual question answering
Ekta Sood, Fabian K ¨ogel, Philipp M ¨uller, Dominike Thomas, Mihai Bˆace, and Andreas Bulling. Multimodal inte- gration of human-like attention in visual question answering. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2648–2658, 2023. 2
work page 2023
-
[27]
Michael Todd, Julie A. Manz, Kathleen S. Hawkins, Michele E. Parsons, and Mary Hercinger. The develop- ment of a quantitative evaluation tool for simulations in nurs- ing education.International Journal of Nursing Education Scholarship, 5(1), 2008. 1
work page 2008
-
[28]
Hanchen David Wang and Meiyi Ma. Physiq: Off-site qual- ity assessment of exercise in physical therapy.Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies, 6(4):1–25, 2023. 1
work page 2023
-
[29]
Communicates effectively with team,
Hung-Hsuan Yen, Ming-Chih Ho, Yi-Hsiang Hsiao, and Chun-Chieh Huang. Surgical video-based temporal action analysis algorithm and competency assessment in laparo- scopic cholecystectomy: development and exploratory eval- uation.Surgical Endoscopy, 2025. 2 AI-Assisted Competency Assessment from Egocentric Video in Simulation-Based Nursing Education Suppleme...
work page 2025
-
[30]
Code only what is directly observable; do not infer in- tent
-
[31]
When in doubt, leave the segment unlabeled
-
[32]
Annotations must not overlap within the Action layer
-
[33]
Start when the action begins (first observable move- ment); end when it concludes (hands leave the object, body repositions away)
-
[34]
Brief interruptions (<2 s): code as one continuous seg- ment. Action definitions.Tab. 6 lists theK=16fine-grained clin- ical action classes. Frames that do not correspond to any of these classes (e.g., walking, adjusting equipment, idle periods between clinical actions) are left unannotated and treated as the background classa ∅, yieldingK+1=17la- bels in...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.