pith. sign in

arxiv: 2605.20233 · v1 · pith:AKBCHFGUnew · submitted 2026-05-16 · 💻 cs.CV · cs.AI

AI-Assisted Competency Assessment from Egocentric Video in Simulation-Based Nursing Education

Pith reviewed 2026-05-21 08:06 UTC · model grok-4.3

classification 💻 cs.CV cs.AI
keywords egocentric videonursing educationcompetency assessmentaction recognitionsimulation-based trainingworkflow diversityvision models
0
0 comments X

The pith

Recognition accuracy of student actions drops as nursing competency rises in simulation videos.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper builds a three-stage system that pulls action timelines from egocentric nursing simulation videos with frozen visual encoders and few-shot learning, then links sequence features and recognition metrics to instructor competency ratings. Across 22 sessions it finds a clear negative correlation between how accurately the model recognizes the actions and how competent the students are rated. This pattern holds after controlling for several possible confounds and points to competent students using more varied, harder-to-predict workflows. Because expert observation is costly and variable, the work explores whether automatic recognition difficulty itself can become a useful signal for assessing clinical readiness.

Core claim

A frozen DINOv2 backbone with HMM Viterbi decoding reaches 57.4 percent mean overlap F1 on leave-one-out one-shot recognition of 493 actions in 3.8 hours of video. Recognition accuracy correlates negatively with instructor-rated competency (rho = -0.524, p = 0.012 for mIoU), and this link survives six confound controls. More competent students generate diverse workflows that the model finds harder to classify, while simple sequence statistics show no comparable relationship. Patient safety protocols and team communication stand out in the per-item breakdown, and higher-competency sessions display more protocol-consistent action transitions.

What carries the argument

Recognition accuracy (mIoU and MOF) from a frozen visual encoder plus HMM decoder, treated as a proxy for workflow diversity in student action sequences.

If this is right

  • Recognition accuracy can serve as a complement to predicted action timelines for automated competency assessment.
  • Higher competency links to greater workflow diversity that automatic classifiers find harder to label.
  • Patient safety and team communication behaviors drive much of the recognition-competency relationship.
  • Higher-competency students follow more protocol-consistent sequences of actions.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same recognition-difficulty signal could be tested in other hands-on simulation settings such as surgical or emergency training.
  • Larger multi-site datasets would clarify how stable the correlation remains across institutions and scenario variations.
  • Feeding the accuracy signal back to trainees in real time might help them notice when their own workflows become more variable.

Load-bearing premise

The observed negative correlation mainly reflects real differences in action diversity caused by competency rather than hidden differences in scenarios, annotations, or the particular group of students.

What would settle it

Run the identical pipeline on a fresh collection of simulation sessions that use different scenarios and a separate student group; disappearance or reversal of the negative correlation after the same six controls would falsify the claim.

Figures

Figures reproduced from arXiv: 2605.20233 by Daniel T. Levin, Gautam Biswas, Hanchen David Wang, Madison J. Lee, Meiyi Ma, Surya Chand Rayala, Yilin Liu.

Figure 1
Figure 1. Figure 1: Example images of checking the patient screen, calculating dosage, and preparing medication from simulation videos of five nursing students. Students A, B, and E use phones for dosage calculation, whereas students C and D use handheld calculators. During medication preparation, students C and E use a dark brown medicine bottle, while students A, B, and D each use a different bottle and hold it differently.… view at source ↗
Figure 2
Figure 2. Figure 2: Overview of the proposed three-stage framework. Gray boxes denote inputs, orange boxes denote processing modules, green [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Group-level comparison: sessions split by median video [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Process models comparing the higher-performing group (video-observable competency score [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗
read the original abstract

Assessing learner competency in clinical simulation requires expert observation that is time-intensive, difficult to scale, and subject to inter-rater variability. Vision-language models have emerged as a promising tool for understanding complex visual behavior. In this work, we investigate whether visual observations can provide educationally meaningful signals for competency assessment through a three-stage framework that (1) extracts action timelines from egocentric nursing simulation video using frozen visual encoders and few-shot learning, (2) derives sequence-level features and per-session recognition metrics, and (3) relates these to instructor-rated competency. Across 22 densely annotated sessions (3.8 hours, 493 actions), a frozen DINOv2 backbone with HMM Viterbi decoding achieves 57.4% MOF in leave-one-out 1-shot recognition. Surprisingly, we observe a negative trend between recognition accuracy and competency (rho = -0.524, p = 0.012 for mIoU), robust to six confound controls: more competent students produce diverse, harder-to-classify workflows, while simple sequence features show no such relationship. Per-item analysis identifies patient safety protocols and team communication as the expected behaviors most reflected in this pattern, and process model comparisons reveal that higher-competency students exhibit more protocol-consistent action transitions. These findings suggest that recognition accuracy may complement predicted action timelines as a pedagogically informative signal in automated competency assessment.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript proposes a three-stage framework for AI-assisted competency assessment in nursing simulation education from egocentric video. Stage 1 extracts action timelines using a frozen DINOv2 visual encoder with few-shot learning and HMM Viterbi decoding; stage 2 derives sequence-level features and per-session recognition metrics; stage 3 correlates these with instructor-rated competency. On 22 densely annotated sessions (3.8 hours, 493 actions), the method achieves 57.4% MOF in leave-one-out 1-shot recognition. The central empirical result is a negative correlation between recognition accuracy and competency (rho = -0.524, p = 0.012 for mIoU), which the authors interpret as evidence that higher-competency students produce more diverse, harder-to-classify workflows. They report the trend is robust to six confound controls, that simple sequence features show no such relationship, and that per-item and process-model analyses highlight patient safety and team communication behaviors.

Significance. If the reported negative correlation genuinely reflects competency-driven differences in action diversity rather than residual biases, the work identifies a novel, scalable signal (recognition difficulty) that could complement direct action timeline prediction for automated competency assessment. The practical choice of frozen encoders and 1-shot HMM decoding is well-suited to small educational datasets. The finding also aligns with process-model comparisons showing more protocol-consistent transitions among higher-competency students. However, the small sample (n=22) and reliance on a single correlation without extensive stability diagnostics limit the immediate strength of the claim for educational deployment.

major comments (2)
  1. [Results] Results (correlation analysis): The central claim rests on rho = -0.524 (p = 0.012) for mIoU versus competency across 22 sessions. No influence diagnostics, bootstrap intervals, or leave-one-session-out stability for the correlation coefficient itself are reported, so it is unclear whether the p-value survives modest changes in sample composition or the precise definition of the six confound controls.
  2. [Methods] Methods (confound controls): The manuscript states the negative trend is 'robust to six confound controls,' yet provides no explicit list or implementation details for these controls (e.g., how scenario difficulty, annotation granularity, or student cohort effects were quantified and partialled out of the correlation). This detail is load-bearing for interpreting whether the result primarily reflects action diversity.
minor comments (2)
  1. [Abstract] Abstract and results: The exact definitions of MOF and mIoU, the precise leave-one-out 1-shot protocol, and the full list of the six confound controls should be stated explicitly to support reproducibility.
  2. [Results] Consider reporting per-session recognition metrics in a table alongside competency scores and basic descriptive statistics (mean, range) to allow readers to assess outlier influence directly.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address each major comment below and have revised the manuscript to incorporate additional diagnostics and methodological details as requested.

read point-by-point responses
  1. Referee: [Results] Results (correlation analysis): The central claim rests on rho = -0.524 (p = 0.012) for mIoU versus competency across 22 sessions. No influence diagnostics, bootstrap intervals, or leave-one-session-out stability for the correlation coefficient itself are reported, so it is unclear whether the p-value survives modest changes in sample composition or the precise definition of the six confound controls.

    Authors: We agree that stability diagnostics strengthen the interpretation. In the revised manuscript we now report bootstrap 95% confidence intervals for the Spearman correlation (1000 resamples) and a leave-one-session-out analysis. The negative association remains significant in 18 of 22 iterations, with no single session driving the result. Cook's distance values are all below 0.5, indicating no influential outliers. These additions are presented in a new supplementary table. revision: yes

  2. Referee: [Methods] Methods (confound controls): The manuscript states the negative trend is 'robust to six confound controls,' yet provides no explicit list or implementation details for these controls (e.g., how scenario difficulty, annotation granularity, or student cohort effects were quantified and partialled out of the correlation). This detail is load-bearing for interpreting whether the result primarily reflects action diversity.

    Authors: We regret the lack of explicit detail. The six controls are: (1) instructor-rated scenario difficulty, (2) total video duration, (3) number of distinct actions, (4) student year of study, (5) annotation label granularity (unique action count), and (6) average frame quality score. We computed partial Spearman correlations after regressing out each control individually and jointly; the negative rho remained between -0.47 and -0.53 (all p < 0.05). A new Methods subsection now lists these variables, describes their measurement, and reports the partial-correlation results. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical correlation from held-out metrics and external ratings

full rationale

The paper reports an empirical negative correlation (rho = -0.524) between leave-one-out 1-shot recognition accuracy (mIoU from frozen DINOv2 + HMM) and instructor-rated competency on 22 sessions. This is computed directly from independent held-out model outputs and external human ratings, with six confound controls applied post hoc. No equations, fitted parameters, or self-citations reduce the reported trend or any sequence feature to the target competency variable by construction; the derivation chain remains self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim rests on empirical observation from a modest set of annotated sessions rather than new theoretical constructs or invented entities.

free parameters (1)
  • Choice of DINOv2 backbone and HMM Viterbi decoding parameters
    Pre-trained model and decoding method selected for the recognition stage; not fitted to the competency correlation.
axioms (1)
  • domain assumption Frozen visual encoders extract educationally meaningful action timelines from egocentric nursing simulation video.
    Core premise of stage 1; relies on transfer from general pre-training to the clinical simulation domain.

pith-pipeline@v0.9.0 · 5801 in / 1243 out tokens · 42253 ms · 2026-05-21T08:06:57.412034+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

34 extracted references · 34 canonical work pages

  1. [1]

    En- gagement detection and its applications in learning: a tuto- rial and selective review.Proceedings of the IEEE, 111(10): 1398–1422, 2023

    Brandon M Booth, Nigel Bosch, and Sidney K D’Mello. En- gagement detection and its applications in learning: a tuto- rial and selective review.Proceedings of the IEEE, 111(10): 1398–1422, 2023. 1

  2. [2]

    Automatic detection of learning-centered affective states in the wild

    Nigel Bosch, Sidney D’Mello, Ryan Baker, Jaclyn Ocumpaugh, Valerie Shute, Matthew Ventura, Lubin Wang, and Weinan Zhao. Automatic detection of learning-centered affective states in the wild. InProceedings of the 20th in- ternational conference on intelligent user interfaces, pages 379–388, 2015. 2

  3. [3]

    Multimodal meth- ods for analyzing learning and training environments: A sys- tematic literature review.arXiv preprint arXiv:2408.14491,

    Clayton Cohn, Eduardo Davalos, Caleb Vatral, Joyce Horn Fonteles, Hanchen David Wang, Austin Coursey, Surya Ray- ala, Meiyi Ma, Gautam Biswas, et al. Multimodal meth- ods for analyzing learning and training environments: A sys- tematic literature review.arXiv preprint arXiv:2408.14491,

  4. [4]

    Tempo- ral action segmentation: An analysis of modern techniques

    Guodong Ding, Fadime Sener, and Angela Yao. Tempo- ral action segmentation: An analysis of modern techniques. IEEE Transactions on Pattern Analysis and Machine Intelli- gence, 46(2):1011–1030, 2024. 3

  5. [5]

    The role of deliberate practice in the acquisition of expert performance.Psychological Review, 100(3):363–406,

    K Anders Ericsson, Ralf Th Krampe, and Clemens Tesch- R¨omer. The role of deliberate practice in the acquisition of expert performance.Psychological Review, 100(3):363–406,

  6. [6]

    Analyzing em- bodied learning in classroom settings: A human-in-the-loop ai approach for multimodal learning analytics.Learning and Instruction, 103:102274, 2026

    Joyce Horn Fonteles, Clayton Cohn, Efrat Ayalon, Mengxi Zhou, Ashwin TS, Eduardo Davalos, Zhijian Li, Surya Ray- ala, Divya Mereddy, Austin Coursey, et al. Analyzing em- bodied learning in classroom settings: A human-in-the-loop ai approach for multimodal learning analytics.Learning and Instruction, 103:102274, 2026. 2

  7. [7]

    Video-based surgical skill assessment using 3D con- volutional neural networks.International Journal of Com- puter Assisted Radiology and Surgery, 14(7):1217–1225,

    Isabel Funke, Sjoerd T Mees, J ¨urgen Weitz, and Stefanie Speidel. Video-based surgical skill assessment using 3D con- volutional neural networks.International Journal of Com- puter Assisted Radiology and Surgery, 14(7):1217–1225,

  8. [8]

    Hashimoto, Guy Rosman, Daniela Rus, and Ozanan R

    Daniel A. Hashimoto, Guy Rosman, Daniela Rus, and Ozanan R. Meireles. Artificial intelligence in surgery: Promises and perils.Annals of Surgery, 268(1), 2018. 2

  9. [9]

    Jennifer K Hayden, Richard A Smiley, Maryann Alexander, Suzan Kardong-Edgren, and Pamela R Jeffries. The NCSBN national simulation study: A longitudinal, randomized, con- trolled study replacing clinical hours with simulation in pre- licensure nursing education.Journal of Nursing Regulation, 5(2):C1–S64, 2014. 1

  10. [10]

    Deep residual learning for image recognition

    Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. InProceed- ings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 770–778, 2016. 4

  11. [11]

    Sissel Eikeland Husebø, Febe Friberg, Eldar Søreide, and Hans Rystedt. Instructional problems in briefings: How to prepare nursing students for simulation-based cardiopul- monary resuscitation training.Clinical Simulation in Nurs- ing, 9(8):e307–e318, 2013. 1

  12. [12]

    Automated gaze-based mind wandering detection during computerized learning in classrooms: S

    Stephen Hutt, Kristina Krasich, Caitlin Mills, Nigel Bosch, Shelby White, James R Brockmole, and Sidney K D’Mello. Automated gaze-based mind wandering detection during computerized learning in classrooms: S. hutt et al.User Modeling and User-Adapted Interaction, 29(4):821–867,

  13. [13]

    A framework for designing, implement- ing, and evaluating: Simulations used as teaching strategies in nursing.Nursing education perspectives, 26(2):96–103,

    Pamela R Jeffries. A framework for designing, implement- ing, and evaluating: Simulations used as teaching strategies in nursing.Nursing education perspectives, 26(2):96–103,

  14. [14]

    Automated and artificial intelligence (ai)-derived performance assessment in surgical simulation: A systematic review.Cureus, 17(12), 2025

    Ahmad Khalifa, Owais Tahhan, Mohammed Albazooni, Mo- hammed Saeed, Ruha Hamdi, Megan Stanners, Amman Ma- lik, and Adnan Malik. Automated and artificial intelligence (ai)-derived performance assessment in surgical simulation: A systematic review.Cureus, 17(12), 2025. 2

  15. [15]

    Richard Landis and Gary G

    J. Richard Landis and Gary G. Koch. The measurement of observer agreement for categorical data.Biometrics, 33(1): 159–174, 1977. 3

  16. [16]

    Clinical judgment development: Using sim- ulation to create an assessment rubric.Journal of Nursing Education, 46(11):496–503, 2007

    Kathie Lasater. Clinical judgment development: Using sim- ulation to create an assessment rubric.Journal of Nursing Education, 46(11):496–503, 2007. 1

  17. [17]

    Smartseg: A non-parametric approach for wearable camera video segmentation.Pervasive and Mo- bile Computing, 2025

    Yilin Liu, Hanchen David Wang, Haowei Fu, Madison Lee Mason, Fanjie Li, Gautam Biswas, Daniel Levin, Alyssa Wise, and Meiyi Ma. Smartseg: A non-parametric approach for wearable camera video segmentation.Pervasive and Mo- bile Computing, 2025. 2

  18. [18]

    DINOv2: Learning robust visual features without supervi- sion.TMLR, 2024

    Maxime Oquab, Timoth ´ee Darcet, Th ´eo Moutakanni, Huy V o, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, et al. DINOv2: Learning robust visual features without supervi- sion.TMLR, 2024. 3, 4

  19. [19]

    Pellegrino, Naomi Chudowsky, and Robert Glaser

    James W. Pellegrino, Naomi Chudowsky, and Robert Glaser. Knowing What Students Know: The Science and Design of Educational Assessment. National Academies Press, 2001. 1

  20. [20]

    A tutorial on hidden Markov models and selected applications in speech recognition.Proceedings of the IEEE, 77(2):257–286, 1989

    Lawrence R Rabiner. A tutorial on hidden Markov models and selected applications in speech recognition.Proceedings of the IEEE, 77(2):257–286, 1989. 3, 4

  21. [21]

    Learning transferable visual models from natural language supervision

    Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. Learning transferable visual models from natural language supervision. InProceedings of the 38th International Conference on Machine Learning (ICML), pages 8748–8763, 2021. 4

  22. [22]

    Rep- etition without repetition: Challenges in understanding be- havioral flexibility in motor skill.Frontiers in Psychology, 11:2018, 2020

    Rajiv Ranganathan, Mei-Hua Lee, and Karl M Newell. Rep- etition without repetition: Challenges in understanding be- havioral flexibility in motor skill.Frontiers in Psychology, 11:2018, 2020. 6

  23. [23]

    Tool development and test- ing: An objective measurement of medication administration competency.Nursing Education Perspectives, 46(5):E37– E39, 2025

    Ginger Schroers and Jill Pfieffer. Tool development and test- ing: An objective measurement of medication administration competency.Nursing Education Perspectives, 46(5):E37– E39, 2025. 2, 1

  24. [24]

    Prototypi- cal networks for few-shot learning

    Jake Snell, Kevin Swersky, and Richard Zemel. Prototypi- cal networks for few-shot learning. InAdvances in Neural Information Processing Systems (NeurIPS), 2017. 3, 4

  25. [25]

    Surgical skill evaluation from robot-assisted surgery recordings

    Abed Soleymani, Ali Akbar Sadat Asl, Mojtaba Yegane- jou, Scott Dick, Mahdi Tavakoli, and Xingyu Li. Surgical skill evaluation from robot-assisted surgery recordings. In 2021 International Symposium on Medical Robotics (ISMR), pages 1–6. IEEE, 2021. 2

  26. [26]

    Multimodal inte- gration of human-like attention in visual question answering

    Ekta Sood, Fabian K ¨ogel, Philipp M ¨uller, Dominike Thomas, Mihai Bˆace, and Andreas Bulling. Multimodal inte- gration of human-like attention in visual question answering. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2648–2658, 2023. 2

  27. [27]

    Manz, Kathleen S

    Michael Todd, Julie A. Manz, Kathleen S. Hawkins, Michele E. Parsons, and Mary Hercinger. The develop- ment of a quantitative evaluation tool for simulations in nurs- ing education.International Journal of Nursing Education Scholarship, 5(1), 2008. 1

  28. [28]

    Physiq: Off-site qual- ity assessment of exercise in physical therapy.Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies, 6(4):1–25, 2023

    Hanchen David Wang and Meiyi Ma. Physiq: Off-site qual- ity assessment of exercise in physical therapy.Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies, 6(4):1–25, 2023. 1

  29. [29]

    Communicates effectively with team,

    Hung-Hsuan Yen, Ming-Chih Ho, Yi-Hsiang Hsiao, and Chun-Chieh Huang. Surgical video-based temporal action analysis algorithm and competency assessment in laparo- scopic cholecystectomy: development and exploratory eval- uation.Surgical Endoscopy, 2025. 2 AI-Assisted Competency Assessment from Egocentric Video in Simulation-Based Nursing Education Suppleme...

  30. [30]

    Code only what is directly observable; do not infer in- tent

  31. [31]

    When in doubt, leave the segment unlabeled

  32. [32]

    Annotations must not overlap within the Action layer

  33. [33]

    Start when the action begins (first observable move- ment); end when it concludes (hands leave the object, body repositions away)

  34. [34]

    Action definitions.Tab

    Brief interruptions (<2 s): code as one continuous seg- ment. Action definitions.Tab. 6 lists theK=16fine-grained clin- ical action classes. Frames that do not correspond to any of these classes (e.g., walking, adjusting equipment, idle periods between clinical actions) are left unannotated and treated as the background classa ∅, yieldingK+1=17la- bels in...