pith. sign in

arxiv: 2502.07963 · v4 · submitted 2025-02-11 · 💻 cs.CL · cs.AI

Caught in the Web of Words: Do LLMs Fall for Spin in Medical Literature?

Pith reviewed 2026-05-23 03:10 UTC · model grok-4.3

classification 💻 cs.CL cs.AI
keywords large language modelsspinmedical literaturebias susceptibilityevidence synthesispromptingLLM evaluationabstracts
0
0 comments X

The pith

Large language models are more susceptible to spin in medical abstracts than humans.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Medical abstracts often present equivocal trial results in an overly positive way, known as spin. This paper tests if LLMs, increasingly used to summarize medical evidence, interpret results differently when spin is present compared to when it is not. Evaluation across 22 models shows they are more affected by spin than human readers and may include it in their own generated summaries. The models can detect spin and respond to prompts that limit its influence on outputs. This matters because biased LLM outputs could affect how evidence reaches clinicians and patients.

Core claim

We evaluated 22 LLMs and found that they are across the board more susceptible to spin than humans. They might also propagate spin into their outputs: We find evidence, e.g., that LLMs implicitly incorporate spin into plain language summaries that they generate. We also find, however, that LLMs are generally capable of recognizing spin, and can be prompted in a way to mitigate spin's impact on LLM outputs.

What carries the argument

Direct comparison of LLM and human answers on question-answering and summarization tasks using original versus spun versions of the same medical trial abstracts.

Load-bearing premise

The chosen abstracts and the particular spin changes applied to them represent the spin that LLMs will meet in real medical literature.

What would settle it

A larger study using naturally occurring spin in a broader sample of published abstracts where LLMs match or beat human resistance to spin would undermine the central finding.

Figures

Figures reproduced from arXiv: 2502.07963 by Byron C. Wallace, Hye Sun Yun, Iain J. Marshall, Junyi Jessy Li, Karen Y.C. Zhang, Ramez Kouzy.

Figure 1
Figure 1. Figure 1: Authors of medical articles sometimes spin their reporting of trial results. We find that LLMs are susceptible to this when “read￾ing” medical abstracts, more so than hu￾man experts. Institutional Review Board (IRB) This re￾search did not require IRB approval as it is designated as Not Human Subject Research. 1. Introduction Randomized controlled trials (RCTs) form the cor￾nerstone of evidence-based medici… view at source ↗
Figure 2
Figure 2. Figure 2: Spin detection task accuracies for all LLMs. The average accuracy of all models was 0.67 (solid red vertical line), well above the random baseline (gray dashed vertical line). That said, this plot shows considerable variance across models with respect to their spin detection capabilities. Interpretation Questions (1) Based on this abstract, do you think treatment A would be beneficial to patients? [very un… view at source ↗
Figure 3
Figure 3. Figure 3: Average mean differences of scores from LLMs for all 5 interpretation questions compared to human experts. Error bars indicate 95% confidence intervals. A positive mean difference indicates that LLMs interpreted the spun abstract as showing more favorable treatment results while the negative mean difference indicates unspun abstracts to be more favorable. This plot suggests that LLMs, in general, erroneous… view at source ↗
Figure 4
Figure 4. Figure 4: Coefficients from linear regression models with 95% CI for each LLM showing how much different LLMs overestimate the treatment effects (benefit of treatment), when abstracts contain ‘spin’. In comparison with human experts (0.71), all LLMs were more susceptible to spin. AlpaCare 7B and Olmo2 Instruct 13B were the most susceptible to spin than others. erroneously infer larger differences in results between … view at source ↗
Figure 5
Figure 5. Figure 5: Average mean differences of scores from Claude 3.5 Sonnet, GPT-4o Mini, and OpenBioLLM 70B interpreting simplified versions of abstracts with and without spin generated by 22 LLMs. The error bars in￾dicate 95% confidence intervals. This plot shows that simplified spun abstracts gen￾erated by LLMs also exhibit spin. Analysis of LLM-generated plain language sum￾maries showed that spin from the original abstr… view at source ↗
Figure 6
Figure 6. Figure 6: Average mean differences of scores across all LLMs using different prompting strategies for 5 interpretation questions compared to human experts. The error bars indicate 95% confidence intervals. This plot shows that mitigation strategies such as adding additional information on the presence or absence of spin or jointly prompting the model to detect and then interpret can reduce the effect of over-inflati… view at source ↗
Figure 7
Figure 7. Figure 7: Mean differences of all five interpretation questions from top 6 LLMs in spin detection accuracy compared to human experts. Error bars represent 95% confidence intervals. Positive mean differ￾ences indicate that LLMs interpreted spun abstracts as showing more favorable treatment results, while negative mean differences suggest that unspun abstracts were perceived as more favorable. This plot highlights tha… view at source ↗
Figure 8
Figure 8. Figure 8: Coefficients from linear regression models with 95% CI for each LLM showing how much different LLMs overestimate the rigor of study, when abstracts contain ‘spin’. In comparison with human experts (-0.59), LLMs show slightly greater susceptibility to spin. 18 [PITH_FULL_IMAGE:figures/full_fig_p018_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Coefficients from linear regression models with 95% CI for each LLM showing how much different LLMs overestimate the importance of study, when abstracts contain ‘spin’. In comparison with human experts (-0.38), most LLMs show greater susceptibility to spin. −1.−0 0.5 0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0 5.5 6.0 6.5 7.0 Coefficient Olmo2 Instruct 13B Med42 70B GPT-4o Mini Llama3 Instruct 70B Claude3.… view at source ↗
Figure 10
Figure 10. Figure 10: Coefficients from linear regression models with 95% CI for each LLM showing how much different LLMs overestimate the interest in full-text, when abstracts contain ‘spin’. In comparison with human experts (0.77), most LLMs show greater susceptibility to spin. 19 [PITH_FULL_IMAGE:figures/full_fig_p019_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Coefficients from linear regression models with 95% CI for each LLM showing how much different LLMs overestimate the interest in another trial, when abstracts contain ‘spin’. In comparison with human experts (0.64), most LLMs show greater susceptibility to spin. −2.0−1.5 −1.0 −0.5 0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0 5.5 6.0 6.5 7.0 Mean Difference for Treatment Benefit Claude3.5 Sonnet GPT4o Mini … view at source ↗
Figure 12
Figure 12. Figure 12: Mean differences for the treatment benefit question between the “baseline” and “detect + in￾terpret” approaches for each LLM. The “baseline” score is shown in black, while the “detect + interpret” score is in orange. LLMs are ordered from top to bottom based on their spin detection performance, with the best-performing model at the top and the worst at the bottom. Regardless of the original spin detection… view at source ↗
read the original abstract

Medical research faces well-documented challenges in translating novel treatments into clinical practice. Publishing incentives encourage researchers to present "positive" findings, even when empirical results are equivocal. Consequently, it is well-documented that authors often spin study results, especially in article abstracts. Such spin can influence clinician interpretation of evidence and may affect patient care decisions. In this study, we ask whether the interpretation of trial results offered by Large Language Models (LLMs) is similarly affected by spin. This is important since LLMs are increasingly being used to trawl through and synthesize published medical evidence. We evaluated 22 LLMs and found that they are across the board more susceptible to spin than humans. They might also propagate spin into their outputs: We find evidence, e.g., that LLMs implicitly incorporate spin into plain language summaries that they generate. We also find, however, that LLMs are generally capable of recognizing spin, and can be prompted in a way to mitigate spin's impact on LLM outputs.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript reports an empirical study of 22 LLMs on question-answering and summarization tasks using original versus spin-manipulated medical abstracts. It compares LLM outputs to human baselines and concludes that LLMs are across the board more susceptible to spin than humans, may propagate spin into generated plain-language summaries, yet remain capable of recognizing spin and can be prompted to mitigate its effects.

Significance. If the central empirical result holds after appropriate controls, the work is significant for medical AI applications because LLMs are already used to synthesize published evidence; greater susceptibility could systematically bias downstream clinical interpretations. The positive finding that targeted prompting reduces the effect supplies a concrete, immediately usable mitigation strategy. The study also supplies a reusable testbed of spun abstracts and human baselines that future work can extend.

major comments (2)
  1. [§4 and §3.2] §4 (Results) and §3.2 (Prompting regime): the headline claim that LLMs are 'across the board more susceptible to spin than humans' rests on accuracy/bias differences between original and spun conditions. The manuscript simultaneously reports that LLMs 'are generally capable of recognizing spin' when explicitly prompted; without an explicit test showing that the susceptibility gap persists after controlling for instruction-following ability, prompt length, and lexical/factual drift introduced by the spin edits, the observed difference could be an artifact of weaker instruction adherence rather than spin-specific vulnerability.
  2. [§3.1] §3.1 (Abstract selection and spin manipulation): the weakest assumption is that the chosen abstracts and the specific spin edits are representative of real-world medical literature. No quantitative justification (e.g., distribution of spin types, journal impact, or year range) is supplied to support generalizability of the susceptibility finding to the broader corpus LLMs will encounter.
minor comments (2)
  1. [Abstract] Abstract: the phrase 'across the board' is imprecise; the results section should state the range of model families, sizes, and training regimes for which the susceptibility ordering holds.
  2. [Results figures/tables] Table or figure captions (wherever the human-LLM comparison is presented): include exact sample sizes of abstracts, number of human raters, and the statistical test plus effect size used for the 'more susceptible' claim.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We appreciate the referee's detailed feedback on our manuscript. We have carefully considered each major comment and provide point-by-point responses below. Where revisions are warranted, we indicate the changes to be made in the revised version.

read point-by-point responses
  1. Referee: [§4 and §3.2] §4 (Results) and §3.2 (Prompting regime): the headline claim that LLMs are 'across the board more susceptible to spin than humans' rests on accuracy/bias differences between original and spun conditions. The manuscript simultaneously reports that LLMs 'are generally capable of recognizing spin' when explicitly prompted; without an explicit test showing that the susceptibility gap persists after controlling for instruction-following ability, prompt length, and lexical/factual drift introduced by the spin edits, the observed difference could be an artifact of weaker instruction adherence rather than spin-specific vulnerability.

    Authors: We agree that distinguishing spin-specific vulnerability from general instruction-following differences is important. Our primary experiments use consistent prompting across LLMs and humans without spin-specific instructions, mirroring typical usage. The recognition capability is demonstrated in a separate prompting condition. To strengthen the claim, we will add a control experiment in the revision where we normalize for instruction adherence by using a standardized instruction-following prompt and measure the remaining gap. This will clarify whether the susceptibility is spin-specific. revision: partial

  2. Referee: [§3.1] §3.1 (Abstract selection and spin manipulation): the weakest assumption is that the chosen abstracts and the specific spin edits are representative of real-world medical literature. No quantitative justification (e.g., distribution of spin types, journal impact, or year range) is supplied to support generalizability of the susceptibility finding to the broader corpus LLMs will encounter.

    Authors: The selection of abstracts was based on a curated set of medical trial abstracts from recent publications, with spin manipulations designed to reflect common types identified in prior literature on spin in medical abstracts. While we did not provide a full quantitative distribution in the original submission, the spin types used (e.g., overstatement of efficacy, omission of limitations) are drawn from established taxonomies in the field. In the revision, we will include additional details on the selection criteria, including the range of journals and years, and a breakdown of spin types to better support generalizability. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical evaluation only

full rationale

This is a purely empirical study involving LLM evaluations on spun vs. original medical abstracts, with direct comparisons to human baselines. No derivations, equations, fitted parameters, self-citations as load-bearing premises, or renamings of results are present. All claims rest on external data collection and human comparisons rather than any self-referential construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The study rests on the domain assumption that spin effects documented in human readers transfer to LLMs and that the chosen tasks measure the same construct. No free parameters or invented entities are introduced.

axioms (1)
  • domain assumption Spin in abstracts influences interpretation of trial results
    Invoked in the motivation section drawing on prior medical literature on human susceptibility.

pith-pipeline@v0.9.0 · 5723 in / 1106 out tokens · 23102 ms · 2026-05-23T03:10:38.527618+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

73 extracted references · 73 canonical work pages · 6 internal anchors

  1. [1]

    Paper plain: Making medical research papers approachable to healthcare consumers with natural language processing

    Tal August, Lucy Lu Wang, Jonathan Bragg, Marti A Hearst, Andrew Head, and Kyle Lo. Paper plain: Making medical research papers approachable to healthcare consumers with natural language processing. ACM Transactions on Computer-Human Interaction, 30 0 (5): 0 1--38, 2023

  2. [2]

    Evaluation of spin within abstracts in obesity randomized clinical trials: a cross-sectional review

    Jennifer Austin, Christopher Smith, Kavita Natarajan, Mousumi Som, Cole Wayant, and Matt Vassar. Evaluation of spin within abstracts in obesity randomized clinical trials: a cross-sectional review. Clinical obesity, 9 0 (2): 0 e12292, 2019

  3. [3]

    Patient perception of plain-language medical notes generated using artificial intelligence software: pilot mixed-methods study

    Sandeep Bala, Angela Keniston, Marisha Burden, et al. Patient perception of plain-language medical notes generated using artificial intelligence software: pilot mixed-methods study. JMIR formative research, 4 0 (6): 0 e16670, 2020

  4. [4]

    Family physicians' use of medical abstracts to guide decision making: style or substance? The Journal of the American Board of Family Practice, 14 0 (6): 0 437--442, 2001

    Henry C Barry, Mark H Ebell, Allen F Shaughnessy, David C Slawson, and Fern Nietzke. Family physicians' use of medical abstracts to guide decision making: style or substance? The Journal of the American Board of Family Practice, 14 0 (6): 0 437--442, 2001

  5. [5]

    Publication bias: a problem in interpreting medical data

    Colin B Begg and Jesse A Berlin. Publication bias: a problem in interpreting medical data. Journal of the Royal Statistical Society Series A: Statistics in Society, 151 0 (3): 0 419--445, 1988

  6. [6]

    S ci BERT : A pretrained language model for scientific text

    Iz Beltagy, Kyle Lo, and Arman Cohan. S ci BERT : A pretrained language model for scientific text. In Kentaro Inui, Jing Jiang, Vincent Ng, and Xiaojun Wan, editors, Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 3615--36...

  7. [7]

    The quality of reporting of trial abstracts is suboptimal: survey of major general medical journals

    Otavio Berwanger, Rodrigo A Ribeiro, Alessandro Finkelsztejn, Marcelo Watanabe, Erica A Suzumura, Bruce B Duncan, Phillip J Devereaux, and Deborah Cook. The quality of reporting of trial abstracts is suboptimal: survey of major general medical journals. Journal of clinical epidemiology, 62 0 (4): 0 387--392, 2009

  8. [8]

    Reporting and interpretation of randomized controlled trials with statistically nonsignificant results for primary outcomes

    Isabelle Boutron, Susan Dutton, Philippe Ravaud, and Douglas G Altman. Reporting and interpretation of randomized controlled trials with statistically nonsignificant results for primary outcomes. Jama, 303 0 (20): 0 2058--2064, 2010

  9. [9]

    Impact of spin in the abstracts of articles reporting results of randomized controlled trials in the field of cancer: the spiin randomized controlled trial

    Isabelle Boutron, Douglas G Altman, Sally Hopewell, Francisco Vera-Badillo, Ian Tannock, and Philippe Ravaud. Impact of spin in the abstracts of articles reporting results of randomized controlled trials in the field of cancer: the spiin randomized controlled trial. Journal of Clinical Oncology, 32 0 (36): 0 4120--4126, 2014

  10. [10]

    Isabelle Boutron, Romana Haneef, Am \'e lie Yavchitz, Gabriel Baron, John Novack, Ivan Oransky, Gary Schwitzer, and Philippe Ravaud. Three randomized controlled trials evaluating the impact of “spin” in health news stories reporting studies of pharmacologic treatments on patients’/caregivers’ interpretation of treatment benefit. BMC medicine, 17: 0 1--10, 2019

  11. [11]

    Ahrq health literacy universal precautions toolkit, 2015

    AGBJ Brega, J Barnard, NM Mabachi, B Weiss, D DeWalt, C Brach, M Cifuentes, K Albright, and D West. Ahrq health literacy universal precautions toolkit, 2015

  12. [12]

    ‘spin’in published biomedical literature: a methodological systematic review

    Kellia Chiu, Quinn Grundy, and Lisa Bero. ‘spin’in published biomedical literature: a methodological systematic review. PLoS Biology, 15 0 (9): 0 e2002173, 2017

  13. [13]

    Do physicians judge a study by its cover?: An investigation of journal attribution bias

    Dimitri A Christakis, Sanjay Saint, Somnath Saha, Joann G Elmore, Deborah E Welsh, Paul Baker, and Thomas D Koepsell. Do physicians judge a study by its cover?: An investigation of journal attribution bias. Journal of clinical epidemiology, 53 0 (8): 0 773--778, 2000

  14. [14]

    Med42-v2: A suite of clinical llms

    Cl \'e ment Christophe, Praveen K Kanithi, Tathagata Raha, Shadab Khan, and Marco AF Pimentel. Med42-v2: A suite of clinical llms. arXiv preprint arXiv:2408.06142, 2024

  15. [15]

    Open to the public: paywalls and the public rationale for open access medical research publishing

    Suzanne Day, Stuart Rennie, Danyang Luo, and Joseph D Tucker. Open to the public: paywalls and the public rationale for open access medical research publishing. Research involvement and engagement, 6: 0 1--7, 2020

  16. [16]

    Paragraph-level simplification of medical texts

    Ashwin Devaraj, Byron C Wallace, Iain J Marshall, and Junyi Jessy Li. Paragraph-level simplification of medical texts. In Proceedings of the conference. Association for Computational Linguistics. North American Chapter. Meeting, volume 2021, page 4972. NIH Public Access, 2021

  17. [17]

    Evaluating factuality in text simplification

    Ashwin Devaraj, William Sheffield, Byron C Wallace, and Junyi Jessy Li. Evaluating factuality in text simplification. In Proceedings of the conference of the Association for Computational Linguistics (ACL), volume 2022, page 7331, 2022

  18. [18]

    Catalogue of bias: publication bias

    Nicholas J DeVito and Ben Goldacre. Catalogue of bias: publication bias. BMJ Evidence-Based Medicine, 24 0 (2): 0 53--54, 2019

  19. [19]

    BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

    Jacob Devlin. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018

  20. [20]

    Publication bias: the problem that won't go away

    K Dickersin and Y I Min. Publication bias: the problem that won't go away. Ann. N. Y. Acad. Sci., 703 0 (1): 0 135--46; discussion 146--8, December 1993

  21. [21]

    The existence of publication bias and risk factors for its occurrence

    Kay Dickersin. The existence of publication bias and risk factors for its occurrence. Jama, 263 0 (10): 0 1385--1389, 1990

  22. [22]

    Publication bias in clinical research

    Phillipa J Easterbrook, Ramana Gopalan, JA Berlin, and David R Matthews. Publication bias in clinical research. The Lancet, 337 0 (8746): 0 867--872, 1991

  23. [23]

    Leveraging large language models for zero-shot lay summarisation in biomedicine and beyond

    Tomas Goldsack, Carolina Scarton, and Chenghua Lin. Leveraging large language models for zero-shot lay summarisation in biomedicine and beyond. arXiv preprint arXiv:2501.05224, 2025

  24. [24]

    Consort for reporting randomized controlled trials in journal and conference abstracts: explanation and elaboration

    Sally Hopewell, Mike Clarke, David Moher, Elizabeth Wager, Philippa Middleton, Douglas G Altman, Kenneth F Schulz, and Consort Group. Consort for reporting randomized controlled trials in journal and conference abstracts: explanation and elaboration. PLoS medicine, 5 0 (1): 0 e20, 2008

  25. [25]

    GPT-4o System Card

    Aaron Hurst, Adam Lerer, Adam P Goucher, Adam Perelman, Aditya Ramesh, Aidan Clark, AJ Ostrow, Akila Welihinda, Alan Hayes, Alec Radford, et al. Gpt-4o system card. arXiv preprint arXiv:2410.21276, 2024

  26. [26]

    M ath P rompter: Mathematical reasoning using large language models

    Shima Imani, Liang Du, and Harsh Shrivastava. M ath P rompter: Mathematical reasoning using large language models. In Sunayana Sitaram, Beata Beigman Klebanov, and Jason D Williams, editors, Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 5: Industry Track), pages 37--42, Toronto, Canada, July 2023. Associat...

  27. [27]

    Understanding pubmed user search behavior through log analysis

    Rezarta Islamaj Dogan, G Craig Murray, Aur \'e lie N \'e v \'e ol, and Zhiyong Lu. Understanding pubmed user search behavior through log analysis. Database, 2009: 0 bap018, 2009

  28. [28]

    Chatgpt makes medicine easy to swallow: an exploratory case study on simplified radiology reports

    Katharina Jeblick, Balthasar Schachtner, Jakob Dexl, Andreas Mittermeier, Anna Theresa St \"u ber, Johanna Topalis, Tobias Weber, Philipp Wesp, Bastian Oliver Sabel, Jens Ricke, et al. Chatgpt makes medicine easy to swallow: an exploratory case study on simplified radiology reports. European radiology, 34 0 (5): 0 2817--2825, 2024

  29. [29]

    Evaluation of spin in abstracts of papers in psychiatry and psychology journals

    Samuel Jellison, Will Roberts, Aaron Bowers, Tyler Combs, Jason Beaman, Cole Wayant, and Matt Vassar. Evaluation of spin in abstracts of papers in psychiatry and psychology journals. BMJ evidence-based medicine, 25 0 (5): 0 178--181, 2020

  30. [30]

    Daniel P Jeong, Saurabh Garg, Zachary Chase Lipton, and Michael Oberst. Medical adaptation of large language and vision-language models: Are we making progress? In Yaser Al-Onaizan, Mohit Bansal, and Yun-Nung Chen, editors, Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 12143--12170, Miami, Florida, USA, Nove...

  31. [31]

    Mistral 7B

    Albert Q Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Florian Bressand, Gianna Lengyel, Guillaume Lample, Lucile Saulnier, et al. Mistral 7b. arXiv preprint arXiv:2310.06825, 2023

  32. [32]

    Multilingual simplification of medical texts

    Sebastian Joseph, Kathryn Kazanas, Keziah Reina, Vishnesh J Ramanathan, Wei Xu, Byron C Wallace, and Junyi Jessy Li. Multilingual simplification of medical texts. arXiv preprint arXiv:2305.12532, 2023

  33. [33]

    F act PICO : Factuality evaluation for plain language summarization of medical evidence

    Sebastian Joseph, Lily Chen, Jan Trienes, Hannah G \"o ke, Monika Coers, Wei Xu, Byron Wallace, and Junyi Jessy Li. F act PICO : Factuality evaluation for plain language summarization of medical evidence. In Lun-Wei Ku, Andre Martins, and Vivek Srikumar, editors, Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volu...

  34. [34]

    Level and prevalence of spin in published cardiovascular randomized clinical trial reports with statistically nonsignificant primary outcomes: a systematic review

    Muhammad Shahzeb Khan, Noman Lateef, Tariq Jamal Siddiqi, Karim Abdur Rehman, Saed Alnaimat, Safi U Khan, Haris Riaz, M Hassan Murad, John Mandrola, Rami Doukky, et al. Level and prevalence of spin in published cardiovascular randomized clinical trial reports with statistically nonsignificant primary outcomes: a systematic review. JAMA network open, 2 0 (...

  35. [35]

    On the contribution of specific entity detection in comparative constructions to automatic spin detection in biomedical scientific publications

    Anna Koroleva and Patrick Paroubek. On the contribution of specific entity detection in comparative constructions to automatic spin detection in biomedical scientific publications. In Language and Technology Conference, pages 304--317. Springer, 2017

  36. [36]

    Annotating spin in biomedical scientific publications: the case of random controlled trials (rcts)

    Anna Koroleva and Patrick Paroubek. Annotating spin in biomedical scientific publications: the case of random controlled trials (rcts). In Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018), 2018

  37. [37]

    Despin: a prototype system for detecting spin in biomedical publications

    Anna Koroleva, Sanjay Kamath, Patrick MM Bossuyt, and Patrick Paroubek. Despin: a prototype system for detecting spin in biomedical publications. In roceedings of the BioNLP 2020 workshop, pages 49--59. Association for Computational Linguistics, 2020

  38. [38]

    The health literacy of america's adults: Results from the 2003 national assessment of adult literacy

    Mark Kutner, Elizabeth Greenburg, Ying Jin, and Christine Paulsen. The health literacy of america's adults: Results from the 2003 national assessment of adult literacy. nces 2006-483. National Center for education statistics, 2006

  39. [39]

    Biomistral: A collection of open-source pretrained large language models for medical domains

    Yanis Labrak, Adrien Bazoge, Emmanuel Morin, Pierre-Antoine Gourraud, Mickael Rouvier, and Richard Dufour. Biomistral: A collection of open-source pretrained large language models for medical domains. arXiv preprint arXiv:2402.10373, 2024

  40. [40]

    Biobert: a pre-trained biomedical language representation model for biomedical text mining

    Jinhyuk Lee, Wonjin Yoon, Sungdong Kim, Donghyeon Kim, Sunkyu Kim, Chan Ho So, and Jaewoo Kang. Biobert: a pre-trained biomedical language representation model for biomedical text mining. Bioinformatics, 36 0 (4): 0 1234--1240, 2020

  41. [41]

    Suzanne Lockyer, Rob Hodgson, Jo C Dumville, and Nicky Cullum. "spin" in wound care research: the reporting and interpretation of randomized controlled trials with statistically non-significant primary outcome results or unspecified primary outcomes. Trials, 14: 0 1--10, 2013

  42. [42]

    Biomedgpt: Open multimodal generative pre-trained transformer for biomedicine

    Yizhen Luo, Jiahuan Zhang, Siqi Fan, Kai Yang, Yushuai Wu, Mu Qiao, and Zaiqing Nie. Biomedgpt: Open multimodal generative pre-trained transformer for biomedicine. arXiv preprint arXiv:2308.09442, 2023

  43. [43]

    A comparison of the accuracy of clinical decisions based on full-text articles and on journal abstracts alone: a study among residents in a tertiary care hospital

    Alvin Marcelo, Alex Gavino, Iris Thiele Isip-Tan, Leilanie Apostol-Nicodemus, Faith Joan Mesa-Gaerlan, Paul Nimrod Firaza, John Francis Faustorilla, Fiona M Callaghan, and Paul Fontelo. A comparison of the accuracy of clinical decisions based on full-text articles and on journal abstracts alone: a study among residents in a tertiary care hospital. BMJ Evi...

  44. [44]

    What is readability and why should content editors care about it

    Lisa Marchand. What is readability and why should content editors care about it. Center for Plain Language. https://centerforplainlanguage. org/what-isreadability, 2017

  45. [45]

    Misleading abstract conclusions in randomized controlled trials in rheumatology: comparison of the abstract conclusions and the results section

    Sylvain Mathieu, Bruno Giraudeau, Martin Soubrier, and Philippe Ravaud. Misleading abstract conclusions in randomized controlled trials in rheumatology: comparison of the abstract conclusions and the results section. Joint Bone Spine, 79 0 (3): 0 262--267, 2012

  46. [46]

    Introducing meta llama 3: The most capable openly available llm to date

    AI Meta. Introducing meta llama 3: The most capable openly available llm to date. Meta AI, 2024

  47. [47]

    Spin in abstracts of systematic reviews and meta-analyses of melanoma therapies: Cross-sectional analysis

    Ross Nowlin, Alexis Wirtz, David Wenger, Ryan Ottwell, Courtney Cook, Wade Arthur, Brigitte Sallee, Jarad Levin, Micah Hartwell, Drew Wright, et al. Spin in abstracts of systematic reviews and meta-analyses of melanoma therapies: Cross-sectional analysis. JMIR dermatology, 5 0 (1): 0 e33996, 2022

  48. [48]

    Simply put: A guide for creating easy-to-understand materials

    US Department of Health, Human Services, et al. Simply put: A guide for creating easy-to-understand materials. 2009

  49. [49]

    2 OLMo 2 Furious

    Team OLMo, Pete Walsh, Luca Soldaini, Dirk Groeneveld, Kyle Lo, Shane Arora, Akshita Bhagia, Yuling Gu, Shengyi Huang, Matt Jordan, et al. 2 olmo 2 furious. arXiv preprint arXiv:2501.00656, 2024

  50. [50]

    A survey of automated methods for biomedical text simplification

    Brian Ondov, Kush Attal, and Dina Demner-Fushman. A survey of automated methods for biomedical text simplification. Journal of the American Medical Informatics Association, 29 0 (11): 0 1976--1988, 2022

  51. [51]

    Models, 2025

    OpenAI. Models, 2025. URL https://platform.openai.com/docs/models/gpt-3-5-turbo. Accessed: 2025-01-17

  52. [52]

    4o mini: Advancing cost-efficient intelligence, 2024

    Gpt OpenAI. 4o mini: Advancing cost-efficient intelligence, 2024. URL: https://openai. com/index/gpt-4o-mini-advancing-cost-efficient-intelligence, 2024

  53. [53]

    CK Osborne, J Pippen, SE Jones, LM Parker, M Ellis, S Come, SZ Gertler, JT May, G Burton, I Dimery, et al. Double-blind, randomized trial comparing the efficacy and tolerability of fulvestrant versus anastrozole in postmenopausal women with advanced breast cancer progressing on prior endocrine therapy: results of a north american trial. Journal of Clinica...

  54. [54]

    Openbiollms: Advancing open-source large language models for healthcare and life sciences, 2024

    Malaikannan Sankarasubbu Ankit Pal and Malaikannan Sankarasubbu. Openbiollms: Advancing open-source large language models for healthcare and life sciences, 2024

  55. [55]

    Assessing ai simplification of medical texts: readability and content fidelity

    Bryce Picton, Saman Andalib, Aidin Spina, Brandon Camp, Sean S Solomon, Jason Liang, Patrick M Chen, Jefferson W Chen, Frank P Hsu, and Michael Y Oh. Assessing ai simplification of medical texts: readability and content fidelity. International Journal of Medical Informatics, 195: 0 105743, 2025

  56. [56]

    The state of oa: a large-scale analysis of the prevalence and impact of open access articles

    Heather Piwowar, Jason Priem, Vincent Larivi \`e re, Juan Pablo Alperin, Lisa Matthias, Bree Norlander, Ashley Farley, Jevin West, and Stefanie Haustein. The state of oa: a large-scale analysis of the prevalence and impact of open access articles. PeerJ, 6: 0 e4375, 2018

  57. [57]

    Malignant: how bad policy and bad evidence harm people with Cancer

    Vinayak K Prasad. Malignant: how bad policy and bad evidence harm people with Cancer. JHU Press, 2020

  58. [58]

    Reasoning with language model prompting: A survey

    Shuofei Qiao, Yixin Ou, Ningyu Zhang, Xiang Chen, Yunzhi Yao, Shumin Deng, Chuanqi Tan, Fei Huang, and Huajun Chen. Reasoning with language model prompting: A survey. In Anna Rogers, Jordan Boyd-Graber, and Naoaki Okazaki, editors, Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 5368--...

  59. [59]

    Development and evaluation of a framework for identifying and addressing spin for harms in systematic reviews of interventions

    Riaz Qureshi, Kevin Naaman, Nicolas G Quan, Evan Mayo-Wilson, Matthew J Page, Victoria Cornelius, Roger Chou, Isabelle Boutron, Su Golder, Lisa Bero, et al. Development and evaluation of a framework for identifying and addressing spin for harms in systematic reviews of interventions. Annals of internal medicine, 177 0 (8): 0 1089--1098, 2024

  60. [60]

    Evaluation of spin in the abstracts of emergency medicine randomized controlled trials

    Victoria Reynolds-Vaughn, Jonathan Riddle, Jamin Brown, Michael Schiesel, Cole Wayant, and Matt Vassar. Evaluation of spin in the abstracts of emergency medicine randomized controlled trials. Annals of emergency medicine, 75 0 (3): 0 423--431, 2020

  61. [61]

    Mathematical discoveries from program search with large language models

    Bernardino Romera-Paredes, Mohammadamin Barekatain, Alexander Novikov, Matej Balog, M Pawan Kumar, Emilien Dupont, Francisco JR Ruiz, Jordan S Ellenberg, Pengming Wang, Omar Fawzi, et al. Mathematical discoveries from program search with large language models. Nature, 625 0 (7995): 0 468--475, 2024

  62. [62]

    Summarizing, simplifying, and synthesizing medical evidence using GPT -3 (with varying success)

    Chantal Shaib, Millicent Li, Sebastian Joseph, Iain Marshall, Junyi Jessy Li, and Byron Wallace. Summarizing, simplifying, and synthesizing medical evidence using GPT -3 (with varying success). In Anna Rogers, Jordan Boyd-Graber, and Naoaki Okazaki, editors, Proceedings of the Annual Meeting of the Association for Computational Linguistics (ACL), pages 13...

  63. [63]

    Knowledge sharing in global health research--the impact, uptake and cost of open access to scholarly literature

    Elise Smith, Stefanie Haustein, Philippe Mongeon, Fei Shu, Val \'e ry Ridde, and Vincent Larivi \`e re. Knowledge sharing in global health research--the impact, uptake and cost of open access to scholarly literature. Health Research Policy and Systems, 15: 0 1--10, 2017

  64. [64]

    Assessment of spin in the abstracts of randomized controlled trials in dental caries with statistically nonsignificant results for primary outcomes: A methodological study

    Naichuan Su, Michiel W Van Der Linden, Clovis M Faggion Jr, and Geert JMG Van Der Heijden. Assessment of spin in the abstracts of randomized controlled trials in dental caries with statistically nonsignificant results for primary outcomes: A methodological study. Caries Research, 57 0 (5-6): 0 553--562, 2023

  65. [65]

    Exaggerations and caveats in press releases and health-related science news

    Petroc Sumner, Solveiga Vivian-Griffiths, Jacky Boivin, Andrew Williams, Lewis Bott, Rachel Adams, Christos A Venetis, Leanne Whelan, Bethan Hughes, and Christopher D Chambers. Exaggerations and caveats in press releases and health-related science news. PloS one, 11 0 (12): 0 e0168217, 2016

  66. [66]

    Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context

    Gemini Team, Petko Georgiev, Ving Ian Lei, Ryan Burnell, Libin Bai, Anmol Gulati, Garrett Tanzer, Damien Vincent, Zhufeng Pan, Shibo Wang, et al. Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context. arXiv preprint arXiv:2403.05530, 2024

  67. [67]

    Large language models in medicine

    Arun James Thirunavukarasu, Darren Shu Jeng Ting, Kabilan Elangovan, Laura Gutierrez, Ting Fang Tan, and Daniel Shu Wei Ting. Large language models in medicine. Nature medicine, 29 0 (8): 0 1930--1940, 2023

  68. [68]

    Llama 2: Open Foundation and Fine-Tuned Chat Models

    Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023

  69. [69]

    Evaluation of spin in oncology clinical trials

    C Wayant, D Margalski, K Vaughn, and M Vassar. Evaluation of spin in oncology clinical trials. Critical Reviews in Oncology/Hematology, 144: 0 102821, 2019

  70. [70]

    Chain-of-thought prompting elicits reasoning in large language models

    Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. Chain-of-thought prompting elicits reasoning in large language models. Advances in neural information processing systems, 35: 0 24824--24837, 2022

  71. [71]

    Health literacy and patient safety: Help patients understand

    Barry D Weiss. Health literacy and patient safety: Help patients understand. Manual for clinicians. American Medical Association Foundation, 2007

  72. [72]

    A new classification of spin in systematic reviews and meta-analyses was developed and ranked according to the severity

    Am \'e lie Yavchitz, Philippe Ravaud, Douglas G Altman, David Moher, Asbj rn Hrobjartsson, Toby Lasserson, and Isabelle Boutron. A new classification of spin in systematic reviews and meta-analyses was developed and ranked according to the severity. Journal of clinical epidemiology, 75: 0 56--65, 2016

  73. [73]

    Alpacare: Instruction-tuned large language models for medical application

    Xinlu Zhang, Chenxin Tian, Xianjun Yang, Lichang Chen, Zekun Li, and Linda Ruth Petzold. Alpacare: Instruction-tuned large language models for medical application. arXiv preprint arXiv:2310.14558, 2023