pith. machine review for the scientific record. sign in

arxiv: 2604.19598 · v2 · submitted 2026-04-21 · 💻 cs.CL · cs.AI

Recognition: unknown

Cross-Model Consistency of AI-Generated Exercise Prescriptions: A Repeated Generation Study Across Three Large Language Models

Authors on Pith no claims yet

Pith reviewed 2026-05-10 03:17 UTC · model grok-4.3

classification 💻 cs.CL cs.AI
keywords large language modelsexercise prescriptionsoutput consistencyrepeated generationsemantic similarityclinical AImodel evaluationtemperature zero
0
0 comments X

The pith

Three large language models generate exercise prescriptions with fundamentally different consistency patterns even under identical temperature-zero settings.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests whether three LLMs produce stable exercise prescriptions when the same six clinical scenarios are prompted twenty times each. It measures semantic similarity of outputs, how often the text is repeated, consistency in FITT elements, and presence of safety language. GPT-4.1 yields high similarity through varied phrasing, Gemini reaches similar scores mainly by repeating the same text, and Claude shows lower similarity. Safety language appears at maximum levels in every model and does not distinguish them. A reader should care because exercise advice for patients requires reliable content rather than surface-level sameness or rote duplication.

Core claim

Under temperature=0, GPT-4.1 produced 100% unique outputs with mean semantic similarity of 0.955, Gemini 2.5 Flash produced only 27.5% unique outputs while reaching 0.950 similarity through repetition, and Claude Sonnet 4.6 scored 0.903 similarity; safety expressions reached ceiling levels across all three models, and these patterns demonstrate that identical decoding settings produce distinct consistency profiles undetectable by single-output evaluation.

What carries the argument

Repeated generation protocol under fixed temperature=0, tracking semantic similarity, output uniqueness rate, FITT classification stability, and safety expression across 360 total outputs.

If this is right

  • Model selection for LLM-based exercise prescription tools must account for repeated-generation behavior rather than single-output quality alone.
  • Single-prompt evaluations cannot detect whether high semantic similarity arises from stable reasoning or from text duplication.
  • Safety expression metrics reach ceiling levels and provide no differentiation between models.
  • Output consistency under repetition should be treated as a core requirement for reliable clinical deployment of such systems.
  • Model choice in this domain functions as a clinical decision with direct implications for patient-facing advice.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Evaluation pipelines for medical LLMs should routinely include repeated sampling to surface hidden repetition or drift.
  • The distinct profiles may reflect differences in training or alignment that could be diagnosed through targeted ablation studies on other clinical text tasks.
  • Parallel use of multiple models and cross-checking their outputs could serve as a practical safeguard when deploying any one model for exercise planning.
  • The same repeated-generation test could be applied to other medical generation domains such as dietary plans or rehabilitation protocols to check generalizability.

Load-bearing premise

That the chosen semantic similarity metric and uniqueness rate accurately reflect clinical reliability of the generated exercise prescriptions rather than merely stylistic differences.

What would settle it

If expert clinicians reviewing the full set of repeated outputs judge them clinically equivalent in safety and appropriateness across all three models, the claim that these consistency profiles affect deployment reliability would be undermined.

Figures

Figures reproduced from arXiv: 2604.19598 by Kihyuk Lee.

Figure 1
Figure 1. Figure 1: Study design overview. 2.2. Clinical Scenarios Six clinical scenarios were used, comprising three cases from a previous study (Choi et al., 2026) and three additional cases introduced in Lee et al. (2026), covering a range of prescription contexts from [PITH_FULL_IMAGE:figures/full_fig_p004_1.png] view at source ↗
read the original abstract

This study compared repeated generation consistency of exercise prescription outputs across three large language models (LLMs), specifically GPT-4.1, Claude Sonnet 4.6, and Gemini 2.5 Flash, under temperature=0 conditions. Each model generated prescriptions for six clinical scenarios 20 times, yielding 360 total outputs analyzed across four dimensions: semantic similarity, output reproducibility, FITT classification, and safety expression. Mean semantic similarity was highest for GPT-4.1 (0.955), followed by Gemini 2.5 Flash (0.950) and Claude Sonnet 4.6 (0.903), with significant inter-model differences confirmed (H = 458.41, p < .001). Critically, these scores reflected fundamentally different generative behaviors: GPT-4.1 produced entirely unique outputs (100%) with stable semantic content, while Gemini 2.5 Flash showed pronounced output repetition (27.5% unique outputs), indicating that its high similarity score derived from text duplication rather than consistent reasoning. Identical decoding settings thus yielded fundamentally different consistency profiles, a distinction that single-output evaluations cannot capture. Safety expression reached ceiling levels across all models, confirming its limited utility as a differentiating metric. These results indicate that model selection constitutes a clinical rather than merely technical decision, and that output behavior under repeated generation conditions should be treated as a core criterion for reliable deployment of LLM-based exercise prescription systems.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper empirically compares the consistency of exercise prescription outputs generated by three LLMs (GPT-4.1, Claude Sonnet 4.6, Gemini 2.5 Flash) under identical temperature=0 settings. For six clinical scenarios each generated 20 times (360 outputs total), it reports mean semantic similarity scores (GPT-4.1: 0.955; Gemini: 0.950; Claude: 0.903) with significant inter-model differences (H=458.41, p<.001), alongside uniqueness rates (GPT-4.1: 100% unique; Gemini: 27.5% unique) and ceiling-level safety expression across models. The central claim is that these reveal fundamentally different generative behaviors not detectable in single-output evaluations, implying that repeated-generation consistency should be a core criterion for clinical deployment and that model selection is a clinical decision.

Significance. If the quantitative distinctions hold under full methodological disclosure, the work provides a useful empirical demonstration that fixed decoding parameters can produce divergent consistency profiles across LLMs, with value for AI safety and reliability research in healthcare applications. It correctly distinguishes semantic stability from output duplication and reports clear statistical separation, strengthening the case for repeated-sampling protocols over single-shot assessments.

major comments (2)
  1. [Abstract] Abstract: The conclusion that 'model selection constitutes a clinical rather than merely technical decision' and that repeated-generation consistency 'should be treated as a core criterion for reliable deployment' is not supported by the presented evidence, as the study reports no expert clinical review, outcome validation, or assessment of whether the observed differences in uniqueness or semantic content affect prescription safety, efficacy, or patient suitability.
  2. [Abstract] Abstract and results: The distinction between GPT-4.1 (high similarity with 100% unique outputs) and Gemini (high similarity from 27.5% unique outputs due to repetition) is load-bearing for the claim that single-output evaluations miss key behaviors, yet the manuscript provides no details on the embedding model, similarity threshold, or exact uniqueness detection method (e.g., string match vs. semantic), preventing assessment of whether these metrics capture clinically relevant consistency rather than stylistic templating.
minor comments (2)
  1. The model names (GPT-4.1, Claude Sonnet 4.6, Gemini 2.5 Flash) should be verified against current official designations and version dates for reproducibility.
  2. [Abstract] The abstract mentions analysis across four dimensions (semantic similarity, output reproducibility, FITT classification, and safety expression) but reports quantitative results primarily on similarity and uniqueness; a brief summary of FITT findings would improve completeness.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive and detailed comments. We address each major comment below and have revised the manuscript to improve methodological transparency and moderate interpretive claims where the evidence is limited.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The conclusion that 'model selection constitutes a clinical rather than merely technical decision' and that repeated-generation consistency 'should be treated as a core criterion for reliable deployment' is not supported by the presented evidence, as the study reports no expert clinical review, outcome validation, or assessment of whether the observed differences in uniqueness or semantic content affect prescription safety, efficacy, or patient suitability.

    Authors: We agree that the original abstract language overstated the direct clinical implications. The study demonstrates measurable differences in generative consistency under repeated sampling but does not include expert review, outcome data, or validation of clinical impact. We have revised the abstract to state that the observed profiles 'suggest that model selection may carry clinical implications' and that repeated-generation consistency 'merits consideration in deployment decisions,' rather than asserting it as a core criterion. We have also added an explicit limitations paragraph noting the absence of clinical validation. revision: yes

  2. Referee: [Abstract] Abstract and results: The distinction between GPT-4.1 (high similarity with 100% unique outputs) and Gemini (high similarity from 27.5% unique outputs due to repetition) is load-bearing for the claim that single-output evaluations miss key behaviors, yet the manuscript provides no details on the embedding model, similarity threshold, or exact uniqueness detection method (e.g., string match vs. semantic), preventing assessment of whether these metrics capture clinically relevant consistency rather than stylistic templating.

    Authors: The referee is correct that the original manuscript omitted key methodological details required to evaluate the metrics. We have added a new subsection in the Methods section describing the embedding model, the cosine similarity threshold and its rationale, and the precise uniqueness detection procedure (combining normalized string matching with semantic checks). These additions clarify how semantic stability was distinguished from textual duplication and allow readers to assess the clinical relevance of the reported differences. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical repeated-generation comparison

full rationale

The paper reports an empirical study that generates 360 exercise-prescription outputs (20 repetitions per scenario across 6 scenarios and 3 models) under temperature=0, then measures semantic similarity, uniqueness rate, FITT classification, and safety expression directly from those outputs. No equations, derivations, fitted parameters renamed as predictions, or load-bearing self-citations appear in the provided text or abstract. All central claims (e.g., GPT-4.1 100% unique vs. Gemini 27.5% unique despite similar mean similarity) rest on observed data rather than any reduction to prior inputs by construction. This is the expected non-finding for a straightforward empirical comparison study.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The paper introduces no new theoretical constructs or fitted parameters. It relies on standard LLM generation parameters and established metrics for text similarity and classification.

axioms (2)
  • domain assumption Semantic similarity via embedding-based metrics accurately captures equivalence of clinical exercise prescriptions
    Central to comparing outputs across models and interpreting high similarity scores.
  • standard math Temperature=0 setting produces deterministic outputs without stochastic variation
    Standard assumption for controlled repeated generation experiments in LLMs.

pith-pipeline@v0.9.0 · 5559 in / 1305 out tokens · 39164 ms · 2026-05-10T03:17:55.540079+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

52 extracted references · 37 canonical work pages · 3 internal anchors

  1. [1]

    Introduction The rapid advancement of large language models (LLMs) has opened new possibilities in healthcare and health management (Raza et al., 2024; Meng et al., 2024). LLMs can generate contextually appropriate text based on user prompts, and their potential applic ations have been discussed across a range of domains including clinical consultation su...

  2. [2]

    estimated

    Materials and Methods 2.1. Study Design This study employed an experimental observational design to compare repeated generation consistency of exercise prescription outputs across three LLMs. Identical clinical scenarios and prompts were submitted to each model under controlled conditions, and intra-model consistency and inter-model differences were quant...

  3. [3]

    Output reproducibility was evaluated by exact match comparison of preprocessed texts within each scenario-model condition, with unique output counts and pro portions (%) calculated by scenario and model. 2.7. Statistical Analysis Descriptive statistics are presented as mean ± standard deviation. Non -parametric tests were applied throughout given the smal...

  4. [4]

    All three models showed significant variation in consistency acro ss scenarios (all p < 0.001), with detailed pairwise comparisons presented in Table 1 and Figure 2

    Results 3.1 Intra-model Semantic Consistency (RQ1) Overall mean semantic similarity was highest for GPT-4.1 (Mean = 0.955, SD = 0.028), followed by Gemini-2.5-Flash (Mean = 0.950, SD = 0.070) and Claude-Sonnet-4.6 (Mean = 0.903, SD = 0.071). All three models showed significant variation in consistency acro ss scenarios (all p < 0.001), with detailed pairw...

  5. [5]

    Discussion Previous studies on LLM -based exercise prescription ha ve largely focused on single -model evaluations, and cross-model differences in output characteristics have not been adequately examined. Although a previous study confirmed the intra-model consistency of Gemini under repeated generation conditions (Lee et al., 2026), whether similar patte...

  6. [6]

    Conclusion The present study confirmed that LLM -generated exercise prescription outputs differ markedly across models in terms of semantic consistency and output reproducibility, even under identical conditions. While GPT -4.1 achieved both textual diversity and seman tic consistency, the output repetition observed in Gemini -2.5-Flash and the semantic v...

  7. [7]

    Akrimi, S., Schwensfeier, L., Düking, P., Kreutz, T., & Brinkmann, C. (2025). ChatGPT-4o-generated exercise plans for patients with type 2 diabetes mellitus: Assessment of their safety and other quality criteria by coaching experts. Sports, 13(4), 92. https://doi.org/10.3390/sports13040092

  8. [8]

    American College of Sports Medicine. (2024). ACSM's guidelines for exercise testing and prescription (12th ed.). Wolters Kluwer

  9. [9]

    Matthew Renze

    Atil, B., Aykent, S., Chittams, A., Fu, L., Passonneau, R. J., Radcliffe, E., Rajagopal, G. R., Sloan, A., Tudrej, T., Ture, F., Wu, Z., Xu, L., & Baldwin, B. (2025). Non- determinism of “deterministic” LLM system settings in hosted environments. In Proceedings of the 5th Workshop on Evaluation and Comparison of NLP Systems (pp. 135–148). Association for ...

  10. [10]

    Aydin, S., Karabacak, M., Vlachos, V ., & Margetis, K. (2024). Large language models in patient education: A scoping review of applications in medicine. Frontiers in Medicine, 11, 1477898. https://doi.org/10.3389/fmed.2024.1477898

  11. [11]

    J., Beck, B., Biddle, S

    Bishop, D. J., Beck, B., Biddle, S. J. H., Denay, K. L., Ferri, A., Gibala, M. J., Headley, S., Jones, A. M., Jung, M., Lee, M. J.-C., Moholdt, T., Newton, R. U., Nimphius, S., Pescatello, L. S., Saner, N. J., & Tzarimas, C. (2025). Physical activity and exercise intensity terminology: A joint ACSM expert statement and ESSA consensus statement. Medicine &...

  12. [12]

    Y ., & Lee, K

    Choi, M., Park, J., Lee, M., Beom, J., Jung, S. Y ., & Lee, K. (2026). AI-generated exercise prescriptions for at-risk populations: Safety and feasibility of a large language model assessed by expert evaluation. Journal of Clinical Medicine, 15(6),

  13. [13]

    https://doi.org/10.3390/jcm15062457

  14. [14]

    M., Mayampurath, A., Liao, F., Goswami, C., Wong, K

    Croxford, E., Gao, Y ., First, E., Pellegrino, N., Schnier, M., Caskey, J., Oguss, M., Wills, G., Chen, G., Dligach, D., Churpek, M. M., Mayampurath, A., Liao, F., Goswami, C., Wong, K. K., Patterson, B. W., & Afshar, M. (2025). Evaluating clinical AI summaries with large language models as judges. npj Digital Medicine, 8,

  15. [15]

    https://doi.org/10.1038/s41746-025-01648-5

  16. [16]

    S., D’Souza, A

    Currier, B. S., D’Souza, A. C., Fiatarone Singh, M. A., Lowisz, C. V ., Rawson, E. S., Schoenfeld, B. J., Smith-Ryan, A. E., Steen, J. P., Thomas, G. A., Triplett, N. T., Washington, T. A., Werner, T. J., & Phillips, S. M. (2026). American College of Sports Medicine position stand: Resistance training prescription for muscle function, hypertrophy, and phy...

  17. [17]

    A survey on autonomous environmental monitoring approaches: towards unifying active sensing and reinforcement learning,

    Dergaa, I., Ben Saad, H., El Omri, A., Glenn, J. M., Clark, C. C. T., Washif, J. A., Guelmami, N., Hammouda, O., Al-Horani, R. A., Reynoso-Sánchez, L. F., Romdhani, M., Paineiras-Domingos, L. L., Vancini, R. L., Taheri, M., Mataruna- Dos-Santos, L. J., Trabelsi, K., Chtourou, H., Zghibi, M., Eken, Ö., Swed, S., Ben Aissa, M., Shawki, H. H., El-Seedi, H. R...

  18. [18]

    Düking, P., Sperlich, B., V oigt, L., Van Hooren, B., Zanini, M., & Zinner, C. (2024). ChatGPT generated training plans for runners are not rated optimal by coaching experts, but increase in quality with additional input information. Journal of Sports Science and Medicine, 23, 56–65

  19. [19]

    The potential of AI to create personalized exercise plans

    Enichen, E. J., Young, C. C., & Frates, E. P. (2025). The potential of AI to create personalized exercise plans. Health Promotion Practice. Advance online publication. https://doi.org/10.1177/15248399251394695

  20. [20]

    R., Jofré-Saldía, E., Candia, A

    Festa, R. R., Jofré-Saldía, E., Candia, A. A., Monsalves-Álvarez, M., Flores-Opazo, M., Peñailillo, L., Marzuca-Nassr, G. N., Aguilar-Farias, N., Fritz-Silva, N., & Cancino-Lopez, J. (2023). Next steps to advance general physical activity recommendations towards physical exercise prescription: A narrative review. BMJ Open Sport & Exercise Medicine, 9, e00...

  21. [21]

    E., Blissmer, B., Deschenes, M

    Garber, C. E., Blissmer, B., Deschenes, M. R., Franklin, B. A., Lamonte, M. J., Lee, I.-M., Nieman, D. C., & Swain, D. P. (2011). American College of Sports Medicine position stand: Quantity and quality of exercise for developing and maintaining cardiorespiratory, musculoskeletal, and neuromotor fitness in apparently healthy adults. Medicine & Science in ...

  22. [22]

    He, Z., Wang, J., Zhang, B., & Li, Y . (2026). Knowledge-grounded large language model for personalized sports training plan generation. Scientific Reports, 16, 6793. https://doi.org/10.1038/s41598-026-37075-z

  23. [23]

    J., & Ahn, J

    Kim, B., Kang, J., Jung, Y . J., & Ahn, J. (2026). Generative and large-scale artificial intelligence in exercise and sports medicine: A narrative review. The Asian Journal of Kinesiology, 28(1), 58–72. https://doi.org/10.15758/ajk.2026.28.1.58

  24. [24]

    Kim, J. H. (2026). Automated prescription of therapeutic exercise for shoulder impingement syndrome using literature-driven rule generation architecture. Musculoskeletal Science and Practice, 76, 103520. https://doi.org/10.1016/j.msksp.2026.103520

  25. [25]

    Lai, X., Chen, J., Lai, Y ., Huang, S., Cai, Y ., Sun, Z., Wang, X., Pan, K., Gao, Q., & Huang, C. (2025a). Using large language models to enhance exercise recommendations and physical activity in clinical and healthy populations: Scoping review. JMIR Medical Informatics, 13, e59309. https://doi.org/10.2196/59309

  26. [26]

    Lai, X., Lai, Y ., Chen, J., Huang, S., Gao, Q., & Huang, C. (2025b). Evaluation strategies for large language model-based models in exercise and health coaching: Scoping review. Journal of Medical Internet Research, 27, e79217. https://doi.org/10.2196/79217

  27. [27]

    Lai, X., Lai, Y ., Chen, J., Huang, S., Gao, Q., & Huang, C. (2026). An AI-assisted adaptive boolean rubric for exercise prescription evaluation: A pilot validation study. International Journal of Medical Informatics, 207, 106202. https://doi.org/10.1016/j.ijmedinf.2025.106202

  28. [28]

    Lee, K. (2026). Consistency of AI-generated exercise prescriptions: A repeated generation study using a large language model. arXiv preprint arXiv:2604.11287. https://arxiv.org/abs/2604.11287

  29. [29]

    Li, G., Li, H., Su, Y ., Li, Y ., Jiang, S., & Zhang, G. (2025). GPT-4 as a virtual fitness coach: A case study assessing its effectiveness in providing weight loss and fitness guidance. BMC Public Health, 25, 2466. https://doi.org/10.1186/s12889-025-23666- 6

  30. [30]

    Li, H., Dong, Q., Chen, J., Su, H., Zhou, Y ., Ai, Q., Ye, Z., & Liu, Y . (2024). LLMs- as-judges: A comprehensive survey on LLM-based evaluation methods. arXiv preprint arXiv:2412.05579. https://arxiv.org/abs/2412.05579

  31. [31]

    iScience , author =

    Meng, X., Yan, X., Zhang, K., Liu, D., Cui, X., Yang, Y ., Zhang, M., Cao, C., Wang, J., Wang, X., Gao, J., Wang, Y .-G.-S., Ji, J.-M., Qiu, Z., Li, M., Qian, C., Guo, T., Ma, S., Wang, Z., . . . Tang, Y .-D. (2024). The application of large language models in medicine: A scoping review. iScience, 27, 109713. https://doi.org/10.1016/j.isci.2024.109713

  32. [32]

    C., Ndakotsu, A., Nriagu, V

    Nduka, T. C., Ndakotsu, A., Nriagu, V . C., Karikalan, S., Abdulkareem, L., Omede, F. O., & Bob-Manuel, T. (2025). AI-generated diet and exercise recommendations for cardiovascular health compared to established cardiology society guidelines. Cureus, 17(8), e90968. https://doi.org/10.7759/cureus.90968

  33. [33]

    Negra, Y ., Sammoud, S., Bouguezzi, R., Markov, A., Capranica, L., Müller, P., & Chaabene, H. (2026). Effects of a ChatGPT-generated eccentric training programme on speed, change of direction, agility, and jumping performance in U14 tennis players: A non-randomised controlled study. Journal of Sports Sciences. Advance online publication. https://doi.org/1...

  34. [34]

    Philuek, P., Kusump, S., Sathianpoonsook, T., Jansupom, C., Sawanyawisuth, P., Sawanyawisuth, K., & Chainarong, A. (2025). The effects of chat GPT generated exercise program in healthy overweight young adults: A pilot study. Journal of Human Sport and Exercise, 20, 169–179. https://doi.org/10.14198/jhse.2025.201.15

  35. [35]

    L., Currà, A., & Trompetto, C

    Puce, L., Bragazzi, N. L., Currà, A., & Trompetto, C. (2025). Harnessing generative artificial intelligence for exercise and training prescription: Applications and implications in sports and physical activity—A systematic literature review. Applied Sciences, 15(7), 3497. https://doi.org/10.3390/app15073497

  36. [36]

    Raza, Kaushik P

    Raza, M. M., Venkatesh, K. P., & Kvedar, J. C. (2024). Generative AI and large language models in health care: Pathways to implementation. npj Digital Medicine, 7, 62. https://doi.org/10.1038/s41746-023-00988-4

  37. [37]

    Reimers, N., & Gurevych, I. (2019). Sentence-BERT: Sentence embeddings using Siamese BERT-networks. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP) (pp. 3982–3992). https://doi.org/10.18653/v1/D19-1410

  38. [38]

    J., Grgic, J., Van Every, D

    Schoenfeld, B. J., Grgic, J., Van Every, D. W., & Plotkin, D. L. (2021). Loading recommendations for muscle strength, hypertrophy, and local endurance: A re- examination of the repetition continuum. Sports, 9(2), 32. https://doi.org/10.3390/sports9020032

  39. [39]

    Schütze, K., Shehatha, R., Beer, K., Needham, M., Smith, T., Bagg, M., Doverty, A., & Cooper, I. (2026). Evaluating ChatGPT’s advice and recommendations regarding exercise for people with inclusion body myositis. Neuromuscular Disorders, 62, 106418. https://doi.org/10.1016/j.nmd.2026.106418

  40. [40]

    Shin, D., Hsieh, G., & Kim, Y . H. (2025). PlanFitting: Personalized exercise planning with large language model-driven conversational agent. In Proceedings of the 7th ACM Conference on Conversational User Interfaces (CUI ’25). https://doi.org/10.1145/3719160.3736607

  41. [41]

    A statistical framework for evaluating the repeatability and reproducibility of large language models

    Shyr, C., Ren, B., Hsu, C.-Y ., Yan, C., Tinker, R. J., Cassini, T. A., Hamid, R., Wright, A., Bastarache, L., Peterson, J. F., Malin, B. A., & Xu, H. (2025). A statistical framework for evaluating the repeatability and reproducibility of large language models. medRxiv. https://doi.org/10.1101/2025.08.06.25333170

  42. [42]

    Song, Y ., Wang, G., Li, S., & Lin, B. Y . (2024). The good, the bad, and the greedy: Evaluation of LLMs should not ignore non-determinism. arXiv preprint arXiv:2407.10457. https://arxiv.org/abs/2407.10457

  43. [43]

    Washif, J., Pagaduan, J., James, C., Dergaa, I., & Beaven, C. (2024). Artificial intelligence in sport: Exploring the potential of using ChatGPT in resistance training prescription. Biology of Sport, 41(2), 209–220. https://doi.org/10.5114/biolsport.2024.132987

  44. [44]

    Wataoka, K., Takahashi, T., & Ri, R. (2024). Self-preference bias in LLM-as-a-judge. arXiv preprint arXiv:2410.21819. https://arxiv.org/abs/2410.21819

  45. [45]

    L., Berkowsky, R., Craig, K

    Zaleski, A. L., Berkowsky, R., Craig, K. J. T., & Pescatello, L. S. (2024). Comprehensiveness, accuracy, and readability of exercise recommendations provided by an AI-based chatbot: Mixed methods study. JMIR Medical Education, 10, e51308. https://doi.org/10.2196/51308

  46. [46]

    Zhang, Y .-F., & Liu, X.-Q. (2024). Using ChatGPT to promote college students’ participation in physical activities and its effect on mental health. World Journal of Psychiatry, 14, 330–341. https://doi.org/10.5498/wjp.v14.i2.330

  47. [47]

    Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena

    Zheng, L., Chiang, W.-L., Sheng, Y ., Zhuang, S., Wu, Z., Zhuang, Y ., Lin, Z., Li, Z., Li, D., Xing, E. P., Zhang, H., Gonzalez, J. E., & Stoica, I. (2023). Judging LLM-as- a-judge with MT-bench and chatbot arena. Advances in Neural Information Processing Systems, 36. https://arxiv.org/abs/2306.05685 Supplementary Material 1. Exercise Prescription Genera...

  48. [48]

    The placeholder [CLINICAL CASE] was replaced with the corresponding clinical case description for each scenario

    Prompt Template The following prompt template was used to generate exercise prescriptions for all six clinical scenarios. The placeholder [CLINICAL CASE] was replaced with the corresponding clinical case description for each scenario. INSTRUCTION_TEMPLATE = """Based on [CLINICAL CASE], please develop a 12-week exercise program. Ensure that the plan adhere...

  49. [49]

    All cases are hypothetical and were constructed without the use of real patient information

    Clinical Case Descriptions Used as Input The following clinical case descriptions were substituted into the [CLINICAL CASE] placeholder of the prompt template above. All cases are hypothetical and were constructed without the use of real patient information. Case 1. Type 2 Diabetes Mellitus + Obesity Participant Profile Male, 55 years old, 7-year history ...

  50. [50]

    Frequency

    FITT Structural Classification 1.1 Prompt Used for FITT Classification (Claude Sonnet 4.6) The following prompt was submitted to Claude Sonnet 4.6 (Anthropic) for each of the 120 preprocessed outputs. The placeholder [EXERCISE PRESCRIPTION TEXT] was replaced with the corresponding output text. FITT_PROMPT = """Classify the FITT components based on the ini...

  51. [51]

    Contraindication

    Safety Expression Consistency Evaluation 2.1 Prompt Used for Safety Evaluation (Claude Sonnet 4.6) The following prompt was submitted to Claude Sonnet 4.6 (Anthropic) for binary inclusion assessment of safety-related expressions in each output. SAFETY_PROMPT = """Evaluate the presence or absence of safety-related expressions in the following exercise pres...

  52. [52]

    PREPROCESSING_PROMPT = """From the following text, extract only the exercise prescription body

    Preprocessing Prompt 3.1 Prompt Used for Output Preprocessing (Claude Sonnet 4.6) Prior to SBERT-based semantic similarity analysis, a standardized preprocessing prompt was applied to all 120 raw outputs using Claude Sonnet 4.6 to extract only the exercise prescription body, excluding formatting elements such as greetings, closing remarks, tables, and bul...