pith. sign in

arxiv: 2604.23255 · v1 · submitted 2026-04-25 · 💻 cs.HC · cs.AI· cs.CY

Scalable LLM-based Coding of Dialogue in Healthcare Simulation: Balancing Coding Performance, Processing Time, and Environmental Impact

Pith reviewed 2026-05-08 07:40 UTC · model grok-4.3

classification 💻 cs.HC cs.AIcs.CY
keywords LLM codingdialogue analysishealthcare simulationbatch sizeenergy consumptionqualitative codingteam learning
0
0 comments X

The pith

Increasing batch sizes in LLM prompting for healthcare dialogue coding improves speed and reduces energy consumption but decreases accuracy.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper examines ways to make LLM-based coding of team dialogue practical for healthcare simulation training. It tests different prompt designs and batch sizes on over 11,000 utterances to see how they affect accuracy, speed, and environmental cost. A sympathetic reader would care because manual qualitative coding is slow and expensive, while automated systems need to deliver results quickly enough for debriefing sessions without excessive computing resources. The study finds that larger batches trade some coding quality for faster and greener processing. This points toward design choices that could make such tools viable in real educational environments where time and sustainability matter.

Core claim

Testing four prompt designs across varying batch sizes on a dataset of 11,647 utterances coded for six dialogue constructs reveals that larger batch sizes improve processing speed and lower energy use while reducing coding performance compared to smaller batches.

What carries the argument

Batch size as the variable controlling the number of utterances processed together in each LLM call, which affects the balance between accuracy, latency, and power draw.

If this is right

  • Larger batch sizes enable faster processing suitable for real-time debriefing in simulations.
  • Reduced energy use supports sustainable deployment of LLM tools in educational settings.
  • Smaller batches can be selected when maximum coding fidelity is required for research.
  • Practical guidance emerges for scaling dialogue analytics where timeliness and sustainability matter.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • If accuracy losses stay within limits acceptable for feedback, efficient batching could support real-time educational systems without high compute demands.
  • The batching approach might transfer to dialogue analysis in other collaborative learning contexts such as classrooms if similar patterns hold.
  • Different LLMs or refined prompts could be tested to reduce the accuracy penalty while keeping the speed gains.

Load-bearing premise

The accuracy-speed-energy trade-offs seen with this dataset of 11,647 utterances, specific prompts, and chosen LLM will hold in other settings and that the resulting accuracy levels remain useful for providing educational feedback.

What would settle it

Running the same experiments on a new dataset from a different healthcare simulation or with another LLM model and finding that accuracy does not decline with increased batch size would challenge the central trade-off claim.

Figures

Figures reproduced from arXiv: 2604.23255 by Dragan Gasevic, Gloria Milena Fernandez-Nieto, Kiyoshige Garces, Linxuan Zhao, Roberto Martinez-Maldonado, Sachini Samaraweera, Vanessa Echeverria.

Figure 1
Figure 1. Figure 1: Processing Time in Seconds across prompts designs and batch sizes. view at source ↗
Figure 2
Figure 2. Figure 2: Macro-averaged F1 scores and batch sizes illustrat view at source ↗
Figure 3
Figure 3. Figure 3: Energy Consumption in Joules across prompts designs and batch sizes. view at source ↗
Figure 4
Figure 4. Figure 4: Pareto front. The highlighted points depict the Pareto front for the two dimensions to optimise: Processing Time and view at source ↗
read the original abstract

Research shows that dialogue, the interactive process through which participants articulate their thinking, plays a central role in constructing shared understanding, coordinating action, and shaping learning outcomes in teams. Analysing dialogue content has been central to advancing team learning theory and informing the design of computer-supported collaborative learning environments, yet this progress has depended on labour-intensive qualitative coding. LLMs offer new possibilities for automating and enhancing the dialogue layer within emerging multimodal learning analytics approaches, with recent studies showing that they can approximate human coding through few-shot prompting. However, prior work has focused on replicating human coding accuracy for research purposes, rather than addressing a more educationally consequential question: how can we design prompts that allow an LLM to label team dialogue accurately and fast enough to be useful in real settings, such as in-person healthcare simulations, where results must be returned quickly and computational cost and sustainability also matter? This paper investigates how prompt design and batching strategies can be optimised to balance coding accuracy, processing time, and environmental impact in team-based healthcare simulation debriefing. Using a dataset of 11,647 utterances coded across 6 dialogue constructs, we compared 4 prompt designs across varying batch sizes, evaluating coding performance, processing time, and energy consumption, as well as the trade-offs between these metrics. Results indicate that increasing batch size improves speed and reduces energy use, but negatively impacts coding performance. Beyond demonstrating the feasibility of LLM-based qualitative analysis, this study offers practical guidance for scaling dialogue analytics in contexts where timeliness, privacy, and sustainability are critical.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper reports an empirical study comparing four prompt designs and varying batch sizes for LLM-based coding of team dialogue in healthcare simulation debriefings. On a fixed dataset of 11,647 utterances labeled across six dialogue constructs, it measures coding performance, processing time, and energy consumption, finding that larger batch sizes improve speed and reduce energy use while degrading coding accuracy. The work positions this as practical guidance for scalable, timely, private, and sustainable dialogue analytics in educational settings.

Significance. If the reported accuracy-speed-energy trade-offs are robust, the study usefully shifts LLM dialogue-coding research from pure accuracy replication toward deployment-relevant constraints in real-time healthcare training. It supplies concrete empirical observations on batching effects for one dataset and model family, which could inform prompt and infrastructure choices where latency and carbon cost matter.

major comments (3)
  1. [Abstract / Results] Abstract and Results: directional claims that increasing batch size 'negatively impacts coding performance' are presented without statistical tests, confidence intervals, error bars, or exact numeric deltas (e.g., F1 or accuracy drops per batch size). This absence makes the central trade-off observation unverifiable from the reported text.
  2. [Methods / Evaluation] Methods / Evaluation: no human inter-rater reliability baselines (e.g., Cohen's kappa or percent agreement) are supplied for the six dialogue constructs, so the absolute and relative quality of the LLM outputs cannot be contextualized against the human coding standard the paper seeks to approximate.
  3. [Results] Results: the paper does not specify the exact performance metric(s) used (accuracy, macro-F1, exact match, etc.) or how ground-truth labels were obtained and validated, which is load-bearing for interpreting the reported performance degradation.
minor comments (2)
  1. [Abstract] The abstract states '4 prompt designs' and '6 dialogue constructs' but does not name them; adding explicit labels or a small table would improve readability.
  2. [Methods] Energy-consumption measurement protocol (hardware, carbon-intensity assumptions, tool used) should be stated more explicitly even if relegated to an appendix.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their constructive feedback, which highlights opportunities to strengthen the clarity and verifiability of our empirical results on LLM batching trade-offs. We address each major comment below and will revise the manuscript accordingly.

read point-by-point responses
  1. Referee: [Abstract / Results] Abstract and Results: directional claims that increasing batch size 'negatively impacts coding performance' are presented without statistical tests, confidence intervals, error bars, or exact numeric deltas (e.g., F1 or accuracy drops per batch size). This absence makes the central trade-off observation unverifiable from the reported text.

    Authors: We agree that the directional claims require stronger statistical support for verifiability. The full results section contains per-batch performance values, but we did not include formal tests, CIs, or deltas in the abstract or summary text. In revision we will add exact F1/accuracy deltas between batch sizes, error bars on figures, and appropriate statistical tests (e.g., repeated-measures ANOVA or paired comparisons with correction) to substantiate the observed degradation. revision: yes

  2. Referee: [Methods / Evaluation] Methods / Evaluation: no human inter-rater reliability baselines (e.g., Cohen's kappa or percent agreement) are supplied for the six dialogue constructs, so the absolute and relative quality of the LLM outputs cannot be contextualized against the human coding standard the paper seeks to approximate.

    Authors: The ground-truth labels were produced by a single primary expert coder with secondary validation by the research team rather than independent parallel coding, which is why IRR statistics were not computed or reported. We will expand the Methods section with a full description of the labeling protocol, any available validation agreement figures, and an explicit discussion of how this affects interpretation of LLM performance relative to human standards. revision: partial

  3. Referee: [Results] Results: the paper does not specify the exact performance metric(s) used (accuracy, macro-F1, exact match, etc.) or how ground-truth labels were obtained and validated, which is load-bearing for interpreting the reported performance degradation.

    Authors: Macro-F1 was the primary metric (chosen for its suitability to the multi-construct coding task); ground-truth labels were obtained via expert manual coding of the 11,647 utterances by healthcare simulation researchers, with a validation subset reviewed for consistency. We will state the metric explicitly in Methods and Results, add details on label acquisition and validation procedures, and clarify how performance degradation was calculated. revision: yes

Circularity Check

0 steps flagged

No significant circularity; purely empirical comparison

full rationale

The paper reports an empirical study that compares four prompt designs across varying batch sizes on a fixed dataset of 11,647 utterances, measuring coding performance, processing time, and energy consumption directly from the experiments. No mathematical derivations, equations, fitted parameters presented as predictions, or self-citations that reduce the central claims to prior results by construction are present. The headline findings on accuracy-speed-energy trade-offs are stated as observed outcomes for this specific setup rather than as a universal law derived from inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the empirical observation that batch size modulates the three metrics; no free parameters are fitted to produce the result, but the work assumes LLM few-shot prompting can be tuned for practical use.

axioms (1)
  • domain assumption LLMs can approximate human qualitative coding of dialogue through few-shot prompting
    Invoked in abstract as established by recent studies, forming the basis for testing prompt and batch optimizations.

pith-pipeline@v0.9.0 · 5618 in / 1179 out tokens · 39217 ms · 2026-05-08T07:40:33.770469+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

72 extracted references · 72 canonical work pages

  1. [1]

    In: StatPearls

    Abulebda, K., Auerbach, M., Limaiem, F.: Debriefing techniques utilized in medical simulation. In: StatPearls. StatPearls Publishing, Treasure Island, FL (2025)

  2. [2]

    IEEE Access13, 5858–5870 (2025)

    Algarni, A.M., Thayananthan, V.: Digital health: The cybersecurity for ai-based healthcare communication. IEEE Access13, 5858–5870 (2025). https://doi.org/10.1109/ACCESS.2025.3526666

  3. [3]

    McLaren, B.: Lever- aging intelligent tutoring systems to enhance project-based learning in work- force training at community colleges

    An, M., Teffera, L., Mehrvarz, M., Li, B., Bogart, C., Sakr, M., M. McLaren, B.: Lever- aging intelligent tutoring systems to enhance project-based learning in work- force training at community colleges. In: Ferreira Mello, R., Rummel, N., Jivet, I., Pishtari, G., Ruipérez Valiente, J.A. (eds.) Technology Enhanced Learning for In- clusive and Equitable Qu...

  4. [4]

    In: Proceedings of the Eleventh ACM Conference on Learning @ Scale

    Barno, E., Albaladejo-González, M., Reich, J.: Scaling generated feedback for novice teachers by sustaining teacher educators’ expertise: A design to train llms with teacher educator endorsement of generated feedback. In: Proceedings of the Eleventh ACM Conference on Learning @ Scale. p. 412–416. L@S ’24, ACM, New York, NY, USA (2024). https://doi.org/10....

  5. [5]

    Berthelot, A., Caron, E., Jay, M., Lefèvre, L.: Understanding the environmen- tal impact of generative ai services. Commun. ACM68(7), 46–53 (Jun 2025). https://doi.org/10.1145/3725984

  6. [6]

    In: Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing: Industry Track

    Cheng, Z., Kasai, J., Yu, T.: Batch Prompting: Efficient Inference with Large Language Model APIs. In: Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing: Industry Track. pp. 792–810. Association for Computational Linguistics, Singapore (2023). https://doi.org/10.18653/v1/2023.emnlp-industry.74

  7. [7]

    In: Proceedings of the 2nd Workshop on Sustainable Computer Systems

    Chien, A.A., Lin, L., Nguyen, H., Rao, V., Sharma, T., Wijayawardana, R.: Reducing the carbon impact of generative ai inference (today and in 2035). In: Proceedings of the 2nd Workshop on Sustainable Computer Systems. HotCarbon ’23, ACM, New York, NY, USA (2023). https://doi.org/10.1145/3604930.3605705

  8. [8]

    Educational Technology & Society21(2), 273–290 (2018)

    Choi, S.P.M., Lam, S.S., Li, K.C., Wong, B.T.M.: Learning analytics at low cost: At- risk student prediction with clicker data and systematic proactive interventions. Educational Technology & Society21(2), 273–290 (2018)

  9. [9]

    International Journal of Social Research Methodology15(6), 523–543 (2012)

    Crowston, K., Allen, E.E., Heckman, R.: Using natural language processing tech- nology for qualitative data analysis. International Journal of Social Research Methodology15(6), 523–543 (2012)

  10. [10]

    Devlin, M.-W

    Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: BERT: Pre-training of deep bidirec- tional transformers for language understanding. In: Burstein, J., Doran, C., Solorio, T. (eds.) Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers)...

  11. [11]

    Medical Teacher31(7), e287–e294 (Jan 2009)

    Dieckmann, P., Molin Friis, S., Lippert, A., Østergaard, D.: The art and science of debriefing in simulation: Ideal and practice. Medical Teacher31(7), e287–e294 (Jan 2009). https://doi.org/10.1080/01421590902866218

  12. [12]

    In: 2024 IEEE 15th International Green and Sustainable Computing Conference (IGSC)

    Ding, Y., Shi, T.: Sustainable llm serving: Environmental implications, chal- lenges, and opportunities : Invited paper. In: 2024 IEEE 15th International Green and Sustainable Computing Conference (IGSC). pp. 37–38 (2024). https://doi.org/10.1109/IGSC64514.2024.00016

  13. [13]

    In: Proceedings of the 14th Learning Analytics and Knowledge Conference

    Echeverria, V., Yan, L., Zhao, L., Abel, S., Alfredo, R., Dix, S., Jaggard, H., Wother- spoon, R., Osborne, A., Buckingham Shum, S., Gasevic, D., Martinez-Maldonado, R.: TeamSlides: A Multimodal Teamwork Analytics Dashboard for Teacher-guided Reflection in a Physical Learning Space. In: Proceedings of the 14th Learning Analytics and Knowledge Conference. ...

  14. [14]

    In: Proceedings of the 2025 CHI Conference on Human Factors in Computing Systems

    Echeverria, V., Zhao, L., Alfredo, R., Milesi, M.E., Jin, Y., Abel, S., Fan, J.X., Yan, L., Dix, S., Wotherspoon, R., Li, X., Jaggard, H.A., Osborne, A., Buckingham Shum, S., Gasevic, D., Martinez-Maldonado, R.: TeamVision: An AI-powered Learning Analytics System for Supporting Reflection in Team-based Healthcare Simulation. In: Proceedings of the 2025 CH...

  15. [15]

    Measuring the environmental impact of delivering AI at Google Scale.arXiv preprint arXiv:2508.15734, 2025

    Elsworth, C., Huang, K., Patterson, D., Schneider, I., Sedivy, R., Goodman, S., Townsend, B., Ranganathan, P., Dean, J., Vahdat, A., Gomes, B., Manyika, J.: Measuring the environmental impact of delivering AI at Google Scale (Aug 2025). https://doi.org/10.48550/arXiv.2508.15734

  16. [16]

    Educational Technology & Society28(4), 166–182 (October 2025)

    Erdoğdu, F., Kara, M., Gökoğlu, S., Telci, S.: Trends and insights in cscl research from the emergence to the present: A review through bibliometric and latent dirichlet allocation analyses. Educational Technology & Society28(4), 166–182 (October 2025)

  17. [17]

    Simulation in Healthcare: The Journal of the Society for Simulation in Healthcare 2(2), 115–125 (2007)

    Fanning, R.M., Gaba, D.M.: The role of debriefing in simulation-based learning. Simulation in Healthcare: The Journal of the Society for Simulation in Healthcare 2(2), 115–125 (2007). https://doi.org/10.1097/SIH.0b013e3180315539

  18. [18]

    Educational Psychologist48(1), 9–24 (2013)

    Fransen, J., Weinberger, A., Kirschner, P.A.: Team effectiveness and team development in cscl. Educational Psychologist48(1), 9–24 (2013). https://doi.org/10.1080/00461520.2012.747947

  19. [19]

    In: Proceedings of the 14th Learning Analytics and Knowledge Conference

    Garg, R., Han, J., Cheng, Y., Fang, Z., Swiecki, Z.: Automated discourse analysis via generative artificial intelligence. In: Proceedings of the 14th Learning Analytics and Knowledge Conference. p. 814–820. LAK ’24, ACM, New York, NY, USA (2024). https://doi.org/10.1145/3636555.3636879

  20. [20]

    TechTrends59(1), 64–71 (2015)

    Gašević, D., Dawson, S., Siemens, G.: Let’s not forget: Learning analytics are about learning. TechTrends59(1), 64–71 (2015). https://doi.org/10.1007/s11528- 014-0822-x

  21. [21]

    Transactions of the Association for Computational Linguistics11, 351–366 (Apr 2023)

    Gekhman, Z., Oved, N., Keller, O., Szpektor, I., Reichart, R.: On the robust- ness of dialogue history representation in conversational question answer- ing: A comprehensive study and a new prompt-based method. Transactions of the Association for Computational Linguistics11, 351–366 (Apr 2023). https://doi.org/10.1162/tacl_a_00549

  22. [22]

    doi: 10.1038/s41586-025-09422-z

    Guo, D., Yang, D., Zhang, H., Song, J., et al.: DeepSeek-R1 incentivizes reasoning in LLMs through reinforcement learning. Nature645(8081), 633–638 (Sep 2025). https://doi.org/10.1038/s41586-025-09422-z

  23. [23]

    In: Cress, U., Rosé, C., Wise, A.F., Oshima, J

    Hmelo-Silver, C.E., Jeong, H.: An overview of cscl methods. In: Cress, U., Rosé, C., Wise, A.F., Oshima, J. (eds.) International Handbook of Computer-Supported Col- laborative Learning, Computer-Supported Collaborative Learning Series, vol. 19. Springer, Cham (2021). https://doi.org/10.1007/978-3-030-65291-3_4

  24. [24]

    In: Proceedings of the Eleventh ACM Conference on Learning @ Scale

    Hutt, S., Hieb, G.: Scaling up mastery learning with generative ai: Explor- ing how generative ai can assist in the generation and evaluation of mas- tery quiz questions. In: Proceedings of the Eleventh ACM Conference on Learning @ Scale. p. 310–314. L@S ’24, ACM, New York, NY, USA (2024). https://doi.org/10.1145/3657604.3664699

  25. [25]

    In: Proc

    Inie, N., Falk, J., Selvan, R.: How co2stly is chi? the carbon footprint of generative ai in hci research and what we should do about it. In: Proc. of the CHI Conference on Human Factors in Computing Systems (CHI ’25). pp. 1–29. ACM, New York, NY, USA (2025)

  26. [26]

    International Journal of Computer-Supported Collaborative Learning9(3), 305–334 (2014)

    Jeong, H., Hmelo-Silver, C.E., Yu, Y.: An examination of cscl methodological practices and the influence of theoretical frameworks 2005–2009. International Journal of Computer-Supported Collaborative Learning9(3), 305–334 (2014). https://doi.org/10.1007/s11412-014-9198-3

  27. [27]

    Proceedings of the VLDB Endowment18(7), 2172–2184 (Mar 2025)

    Ji, Z., Wang, X., Luo, Z., Xie, Z., Zhang, M.: Optimized Batch Prompting for Cost-Effective LLMs. Proceedings of the VLDB Endowment18(7), 2172–2184 (Mar 2025). https://doi.org/10.14778/3734839.3734853

  28. [28]

    In: Proceedings of the Eleventh ACM Conference on Learning @ Scale

    Jin, Y., Yu, J.: Optimizing mentor-student communication using llm-based auto- mated labeling information states. In: Proceedings of the Eleventh ACM Confer- ence on Learning @ Scale. p. 284–288. L@S ’24, ACM, New York, NY, USA (2024). https://doi.org/10.1145/3657604.3664691

  29. [29]

    Richard Landis and Gary G

    Landis, J.R., Koch, G.G.: The Measurement of Observer Agreement for Categorical Data. Biometrics33(1), 159 (Mar 1977). https://doi.org/10.2307/2529310

  30. [30]

    In: 2025 7th International Conference on Computer Science and Technologies in Education (CSTE)

    Li, M., Qin, W., Tang, Z., Fang, X., He, T., Cao, X.: Automating ssrl detec- tion in asynchronous ocl via llms. In: 2025 7th International Conference on Computer Science and Technologies in Education (CSTE). pp. 548–551 (2025). https://doi.org/10.1109/CSTE64638.2025.11092245

  31. [31]

    Future in Educational Researchn/a(n/a) (2025)

    Liao, J., Sun, F., Liu, Y., Hu, Y.: Deepseek in education: Exploring the transfor- mative potential of ai-driven educational intelligence. Future in Educational Researchn/a(n/a) (2025). https://doi.org/10.1002/fer3.70022

  32. [32]

    Liu, Kevin Lin, John Hewitt, Ashwin Paranjape, Michele Bevilacqua, Fabio Petroni, and Percy Liang

    Liu, N.F., Lin, K., Hewitt, J., Paranjape, A., Bevilacqua, M., Petroni, F., Liang, P.: Lost in the middle: How language models use long contexts. Transac- tions of the Association for Computational Linguistics12, 157–173 (02 2024). https://doi.org/10.1162/tacl_a_00638

  33. [33]

    Journal of Learning Analytics12(1), 169–185 (2025)

    Liu, X., Zambrano, A.F., Baker, R.S., Barany, A., Ocumpaugh, J., Zhang, J., Pankiewicz, M., Nasiar, N., Wei, Z.: Qualitative coding with gpt-4: Where it works better. Journal of Learning Analytics12(1), 169–185 (2025). https://doi.org/10.18608/jla.2025.8575

  34. [34]

    ACM Trans

    Martinez-Maldonado, R., Echeverria, V., Fernandez-Nieto, G., Yan, L., Zhao, L., Alfredo, R., Li, X., Dix, S., Jaggard, H., Wotherspoon, R., Osborne, A., Shum, S.B., Gašević, D.: Lessons learnt from a multimodal learning analytics de- ployment in-the-wild. ACM Trans. Comput.-Hum. Interact.31(1) (Nov 2023). https://doi.org/10.1145/3622784

  35. [35]

    Computers in Human Behavior 71, 327–342 (2017)

    Martinez-Maldonado, R., Goodyear, P., Carvalho, L., Thompson, K., Hernandez- Leo, D., Dimitriadis, Y., Prieto, L.P., Wardak, D.: Supporting collaborative design activity in a multi-user digital design ecology. Computers in Human Behavior 71, 327–342 (2017). https://doi.org/10.1016/j.chb.2017.01.055

  36. [36]

    In: Proceedings of the Seventh International Learning Analytics & Knowledge Conference

    Martinez-Maldonado, R., Power, T., Hayes, C., Abdiprano, A., Vo, T., Axisa, C., Buckingham Shum, S.: Analytics meet patient manikins: challenges in an authen- tic small-group healthcare simulation classroom. In: Proceedings of the Seventh International Learning Analytics & Knowledge Conference. p. 90–94. LAK ’17, ACM, New York, NY, USA (2017). https://doi...

  37. [37]

    Journal of Computer Assisted Learning36(5), 741–762 (2020)

    Martinez-Maldonado, R., Schulte, J., Echeverria, V., Gopalan, Y., Shum, S.B.: Where is the teacher? digital analytics for classroom proxemics. Journal of Computer Assisted Learning36(5), 741–762 (2020). https://doi.org/10.1111/jcal.12444

  38. [38]

    Biochemia medica22(3), 276–282 (2012).https://doi.org/10.11613/BM.2012.031

    McHugh, M.L.: Interrater reliability: The kappa statistic. Biochemia Medica22(3), 276–282 (2012). https://doi.org/10.11613/BM.2012.031

  39. [39]

    In: Proceedings of the Twelfth ACM Confer- ence on Learning @ Scale

    Mehta, S., Srivastava, N., Liu, X., Vanacore, K., Baker, R.S.: Do mooc conver- sations matter? investigating the role of social presence and course-relevant discussion in career advancement. In: Proceedings of the Twelfth ACM Confer- ence on Learning @ Scale. p. 236–240. L@S ’25, ACM, New York, NY, USA (2025). https://doi.org/10.1145/3698205.3733930 L@S ’...

  40. [40]

    Journal of Nursing Management17(2), 247–255 (Mar 2009)

    Miller, K., Riley, W., Davis, S.: Identifying key nursing and team behaviours to achieve high reliability. Journal of Nursing Management17(2), 247–255 (Mar 2009). https://doi.org/10.1111/j.1365-2834.2009.00978.x

  41. [41]

    In: Proceed- ings of the Eleventh ACM Conference on Learning @ Scale

    Moore, S., Schmucker, R., Mitchell, T., Stamper, J.: Automated generation and tagging of knowledge components from multiple-choice questions. In: Proceed- ings of the Eleventh ACM Conference on Learning @ Scale. p. 122–133. L@S ’24, ACM, New York, NY, USA (2024). https://doi.org/10.1145/3657604.3662030

  42. [42]

    In: Proceedings of the 13th International Conference on, Intelligent Systems Application to Power Systems

    Ngatchou, P., Zarei, A., El-Sharkawi, A.: Pareto multi objective op- timization. In: Proceedings of the 13th International Conference on, Intelligent Systems Application to Power Systems. pp. 84–91 (2005). https://doi.org/10.1109/ISAP.2005.1599245

  43. [43]

    In: Proceedings of the Eleventh ACM Conference on Learning @ Scale

    Nguyen, H., Stott, N., Allan, V.: Comparing feedback from large language mod- els and instructors: Teaching computer science at scale. In: Proceedings of the Eleventh ACM Conference on Learning @ Scale. p. 335–339. L@S ’24, ACM, New York, NY, USA (2024). https://doi.org/10.1145/3657604.3664660

  44. [44]

    In: Proceedings of the Twelfth ACM Conference on Learning @ Scale

    Nie, A., Chandak, Y., Suzara, M., Malik, A., Woodrow, J., Peng, M., Sahami, M., Brunskill, E., Piech, C.: The gpt surprise: Offering large language model chat in a massive coding class reduced engagement but may increase adopters’ exam performances. In: Proceedings of the Twelfth ACM Conference on Learning @ Scale. p. 376–380. L@S ’25, ACM, New York, NY, ...

  45. [45]

    In: Proceedings of the Tenth ACM Conference on Learning @ Scale

    Ouhaichi, H., Spikol, D., Vogel, B.: Rethinking mmla: Design considerations for multimodal learning analytics systems. In: Proceedings of the Tenth ACM Conference on Learning @ Scale. p. 354–359. L@S ’23, ACM, New York, NY, USA (2023). https://doi.org/10.1145/3573051.3596186

  46. [46]

    In: Proceedings of the Third (2016) ACM Conference on Learning @ Scale

    Papathoma, T., Ferguson, R., Littlejohn, A., Coe, A.: Making the production of learning at scale more open and flexible. In: Proceedings of the Third (2016) ACM Conference on Learning @ Scale. p. 273–276. L@S ’16, ACM, New York, NY, USA (2016). https://doi.org/10.1145/2876034.2893432

  47. [47]

    In: Proceedings of the 40th International Conference on Machine Learning

    Radford, A., Kim, J.W., Xu, T., Brockman, G., McLeavey, C., Sutskever, I.: Robust speech recognition via large-scale weak supervision. In: Proceedings of the 40th International Conference on Machine Learning. ICML’23 (2023)

  48. [48]

    In: Henriksen, K., Battles, J.B., Keyes, M.A., Grady, M.L

    Riley, W., Hansen, H., Gürses, A.P., Davis, S., Miller, K., Priester, R.: The nature, characteristics and patterns of perinatal critical events teams. In: Henriksen, K., Battles, J.B., Keyes, M.A., Grady, M.L. (eds.) Advances in Patient Safety: New Directions and Alternative Approaches (Vol. 3: Performance and Tools). Agency for Healthcare Research and Qu...

  49. [49]

    Journal of Continuing Education in the Health Professions32(4), 243–254 (2012)

    Rosen, M.A., Hunt, E.A., Pronovost, P.J., Federowicz, M.A., Weaver, S.J.: In Situ Simulation in Continuing Education for the Health Care Professions: A Systematic Review. Journal of Continuing Education in the Health Professions32(4), 243–254 (2012). https://doi.org/10.1002/chp.21152

  50. [50]

    American Psychologist73(4), 593–600 (2018)

    Salas, E., Reyes, D.L., McDaniel, S.H.: The science of teamwork: Progress, re- flections, and the road ahead. American Psychologist73(4), 593–600 (2018). https://doi.org/10.1037/amp0000334

  51. [51]

    In: Pro- ceedings of the 16th International Learning Analytics & Knowledge Conference (LAK ’26)

    Samaraweera, S., Zhao, L., Echeverria, V., Alfredo, R., Chen, G., Davis, J., Leonny, S., Sevenhuysen, S., Connell, C., Gasevic, D., Martinez-Maldonado, R., Dhar- maratne, A.: From formal learning to professional practice: Automated llm-based coding and visualisation of team dialogue in in-situ healthcare simulation. In: Pro- ceedings of the 16th Internati...

  52. [52]

    Communication & Medicine13(1), 1–7 (2017)

    Sarangi, S.: Editorial: Team work and team talk as distributed and coordinated action in healthcare delivery. Communication & Medicine13(1), 1–7 (2017). https://doi.org/10.1558/cam.32569

  53. [53]

    JMIR Medical Informatics12, e55318 (Apr 2024)

    Sivarajkumar, S., Kelley, M., Samolyk-Mazzanti, A., Visweswaran, S., Wang, Y.: An Empirical Evaluation of Prompting Strategies for Large Language Mod- els in Zero-Shot Clinical Natural Language Processing: Algorithm Develop- ment and Validation Study. JMIR Medical Informatics12, e55318 (Apr 2024). https://doi.org/10.2196/55318

  54. [54]

    Southwell, R., Pugh, S., E. Margaret Perkoff, Clevenger, C., Bush, J., Lieber, R., Ward, W., Foltz, P., D’Mello, S.: Challenges and Feasibility of Auto- matic Speech Recognition for Modeling Student Collaborative Discourse in Classrooms. In: Mitrovic, A., Bosch, N. (eds.) Proceedings of the 15th International Conference on Educational Data Mining. Zenodo ...

  55. [55]

    Stadler, W.: Multicriteria Optimization in Engineering and in the Sciences, vol. 37. Springer Science & Business Media (1988)

  56. [56]

    MIT Press, Cambridge, MA (2006)

    Stahl, G.: Group Cognition: Computer Support for Building Collaborative Knowledge. MIT Press, Cambridge, MA (2006). https://doi.org/10.7551/mitpress/3372.001.0001

  57. [57]

    Journal of Medical Internet Research27, e58744 (Feb 2025)

    Stenseth, H.V., Steindal, S.A., Solberg, M.T., Ølnes, M.A., Sørensen, A.L., Strandell- Laine, C., Olaussen, C., Farsjø Aure, C., Pedersen, I., Zlamal, J., Gue Mar- tini, J., Bresolin, P., Linnerud, S.C.W., Nes, A.A.G.: Simulation-Based Learning Supported by Technology to Enhance Critical Thinking in Nursing Students: Scoping Review. Journal of Medical Int...

  58. [58]

    International Journal of Surgery53, 171–177 (2018)

    Sun, R., Marshall, D.C., Sykes, M.C., Maruthappu, M., Shalhoub, J.: The impact of improving teamwork on patient outcomes in surgery: A sys- tematic review. International Journal of Surgery53, 171–177 (2018). https://doi.org/10.1016/j.ijsu.2018.03.044

  59. [59]

    Advances in Simulation (Jan 2026)

    Tscholl, D.W., Ebensperger, M., RahrischRahrisch, A., Wang, H., Heckel, H., Thomasius, M., Kaserer, A., Grande, B., Seelandt, J.C., Kolbe, M.: Generative AI in simulation debriefings: An exploratory study using the Team-FIRST frame- work and qualitative feedback from simulation experts and learners. Advances in Simulation (Jan 2026). https://doi.org/10.11...

  60. [60]

    British Journal of Educational Technology56(6), 2671–2704 (2025)

    Wang, D., Chen, G.: Evaluating the use of bert and llama to anal- yse classroom dialogue for teachers’ learning of dialogic pedagogy. British Journal of Educational Technology56(6), 2671–2704 (2025). https://doi.org/https://doi.org/10.1111/bjet.13604

  61. [61]

    International Journal of Educational Research 123, 102275 (2024)

    Wang, D., Tao, Y., Chen, G.: Artificial intelligence in classroom discourse: A sys- tematic review of the past decade. International Journal of Educational Research 123, 102275 (2024)

  62. [62]

    In: Proceedings of the Twelfth ACM Conference on Learning @ Scale

    Wang, D., Yang, C., Chen, G.: Using lora to fine-tune large language models for analyzing collaborative argumentation in classrooms. In: Proceedings of the Twelfth ACM Conference on Learning @ Scale. p. 207–211. L@S ’25, ACM, New York, NY, USA (2025). https://doi.org/10.1145/3698205.3733924

  63. [63]

    In: Proceedings of the 30th Conference on Pattern Languages of Programs

    White, J., Fu, Q., Hays, S., Sandborn, M., Olea, C., Gilbert, H., Elnashar, A., Spencer- Smith, J., Schmidt, D.C.: A prompt pattern catalog to enhance prompt engineering with chatgpt. In: Proceedings of the 30th Conference on Pattern Languages of Programs. PLoP ’23, The Hillside Group, USA (2023)

  64. [64]

    International Journal of Artificial Intelligence in Education35(4), 2421– 2452 (Dec 2025)

    Yan, L., Gašević, D., Echeverria, V., Zhao, L., Jin, Y., Li, X., Martinez-Maldonado, R.: In Sync or Out of Sync? Understanding Stress and Learning Performance in Collaborative Healthcare Simulations through Physiological Synchrony and Arousal. International Journal of Artificial Intelligence in Education35(4), 2421– 2452 (Dec 2025). https://doi.org/10.100...

  65. [65]

    In: Pro- ceedings of the Ninth ACM Conference on Learning @ Scale

    Yang, H., Alozie, N., Rachmatullah, A.: Collaboration at scale: Exploring member role changing patterns in collaborative science problem-solving tasks. In: Pro- ceedings of the Ninth ACM Conference on Learning @ Scale. p. 309–312. L@S ’22, ACM, New York, NY, USA (2022). https://doi.org/10.1145/3491140.3528319

  66. [66]

    In: Usenix Nsdi (2023)

    You, J., Chung, J.W., Chowdhury, M.: Zeus: Understanding and optimizing GPU energy consumption of DNN training. In: Usenix Nsdi (2023)

  67. [67]

    Baker, Juhan Kim, and Nidhi Nasiar

    Zambrano, A.F., Liu, X., Barany, A., Baker, R.S., Kim, J., Nasiar, N.: From ncoder to chatgpt: From automated coding to refining human coding. In: Arastoopour Ir- gens, G., Knight, S. (eds.) Advances in Quantitative Ethnography, Communica- tions in Computer and Information Science, vol. 1895. Springer, Cham (2023). https://doi.org/10.1007/978-3-031-47014-1_32

  68. [68]

    In: Proceedings of the Eleventh ACM Conference on Learning @ Scale

    Zhang, A.G., Tang, X., Oney, S., Chen, Y.: Cflow: Supporting semantic flow analy- sis of students’ code in programming problems at scale. In: Proceedings of the Eleventh ACM Conference on Learning @ Scale. p. 188–199. L@S ’24, ACM, New York, NY, USA (2024). https://doi.org/10.1145/3657604.3662025

  69. [69]

    In: Proceedings of the 14th Learning Analytics and Knowledge Conference

    Zhao, L., Echeverria, V., Swiecki, Z., Yan, L., Alfredo, R., Li, X., Gase- vic, D., Martinez-Maldonado, R.: Epistemic network analysis for end-users: Closing the loop in the context of multimodal analytics for collaborative team learning. In: Proceedings of the 14th Learning Analytics and Knowl- edge Conference. p. 90–100. LAK ’24, ACM, New York, NY, USA ...

  70. [70]

    British Journal of Educational Technology55(4), 1673–1702 (Jul 2024)

    Zhao, L., Gašević, D., Swiecki, Z., Li, Y., Lin, J., Sha, L., Yan, L., Alfredo, R., Li, X., Martinez-Maldonado, R.: Towards automated transcribing and coding of embodied teamwork communication through multimodal learning analyt- ics. British Journal of Educational Technology55(4), 1673–1702 (Jul 2024). https://doi.org/10.1111/bjet.13476

  71. [71]

    In: LAK23: 13th International Learning Analytics and Knowledge Conference

    Zhao, L., Swiecki, Z., Gasevic, D., Yan, L., Dix, S., Jaggard, H., Wotherspoon, R., Osborne, A., Li, X., Alfredo, R., Martinez-Maldonado, R.: Mets: Multimodal learning analytics of embodied teamwork learning. In: LAK23: 13th International Learning Analytics and Knowledge Conference. p. 186–196. LAK2023, ACM, New York, NY, USA (2023). https://doi.org/10.11...

  72. [72]

    https://doi.org/10.48550/ARXIV.2405.16178 Received 16 February 2026

    Zhu, Y., Gu, J.C., Sikora, C., Ko, H., Liu, Y., Lin, C.C., Shu, L., Luo, L., Meng, L., Liu, B., Chen, J.: Accelerating Inference of Retrieval-Augmented Generation via Sparse Context Selection (2024). https://doi.org/10.48550/ARXIV.2405.16178 Received 16 February 2026