Caught in the Web of Words: Do LLMs Fall for Spin in Medical Literature?

Byron C. Wallace; Hye Sun Yun; Iain J. Marshall; Junyi Jessy Li; Karen Y.C. Zhang; Ramez Kouzy

arxiv: 2502.07963 · v4 · submitted 2025-02-11 · 💻 cs.CL · cs.AI

Caught in the Web of Words: Do LLMs Fall for Spin in Medical Literature?

Hye Sun Yun , Karen Y.C. Zhang , Ramez Kouzy , Iain J. Marshall , Junyi Jessy Li , Byron C. Wallace This is my paper

Pith reviewed 2026-05-23 03:10 UTC · model grok-4.3

classification 💻 cs.CL cs.AI

keywords large language modelsspinmedical literaturebias susceptibilityevidence synthesispromptingLLM evaluationabstracts

0 comments

The pith

Large language models are more susceptible to spin in medical abstracts than humans.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Medical abstracts often present equivocal trial results in an overly positive way, known as spin. This paper tests if LLMs, increasingly used to summarize medical evidence, interpret results differently when spin is present compared to when it is not. Evaluation across 22 models shows they are more affected by spin than human readers and may include it in their own generated summaries. The models can detect spin and respond to prompts that limit its influence on outputs. This matters because biased LLM outputs could affect how evidence reaches clinicians and patients.

Core claim

We evaluated 22 LLMs and found that they are across the board more susceptible to spin than humans. They might also propagate spin into their outputs: We find evidence, e.g., that LLMs implicitly incorporate spin into plain language summaries that they generate. We also find, however, that LLMs are generally capable of recognizing spin, and can be prompted in a way to mitigate spin's impact on LLM outputs.

What carries the argument

Direct comparison of LLM and human answers on question-answering and summarization tasks using original versus spun versions of the same medical trial abstracts.

Load-bearing premise

The chosen abstracts and the particular spin changes applied to them represent the spin that LLMs will meet in real medical literature.

What would settle it

A larger study using naturally occurring spin in a broader sample of published abstracts where LLMs match or beat human resistance to spin would undermine the central finding.

Figures

Figures reproduced from arXiv: 2502.07963 by Byron C. Wallace, Hye Sun Yun, Iain J. Marshall, Junyi Jessy Li, Karen Y.C. Zhang, Ramez Kouzy.

**Figure 1.** Figure 1: Authors of medical articles sometimes spin their reporting of trial results. We find that LLMs are susceptible to this when “reading” medical abstracts, more so than human experts. Institutional Review Board (IRB) This research did not require IRB approval as it is designated as Not Human Subject Research. 1. Introduction Randomized controlled trials (RCTs) form the cornerstone of evidence-based medici… view at source ↗

**Figure 2.** Figure 2: Spin detection task accuracies for all LLMs. The average accuracy of all models was 0.67 (solid red vertical line), well above the random baseline (gray dashed vertical line). That said, this plot shows considerable variance across models with respect to their spin detection capabilities. Interpretation Questions (1) Based on this abstract, do you think treatment A would be beneficial to patients? [very un… view at source ↗

**Figure 3.** Figure 3: Average mean differences of scores from LLMs for all 5 interpretation questions compared to human experts. Error bars indicate 95% confidence intervals. A positive mean difference indicates that LLMs interpreted the spun abstract as showing more favorable treatment results while the negative mean difference indicates unspun abstracts to be more favorable. This plot suggests that LLMs, in general, erroneous… view at source ↗

**Figure 4.** Figure 4: Coefficients from linear regression models with 95% CI for each LLM showing how much different LLMs overestimate the treatment effects (benefit of treatment), when abstracts contain ‘spin’. In comparison with human experts (0.71), all LLMs were more susceptible to spin. AlpaCare 7B and Olmo2 Instruct 13B were the most susceptible to spin than others. erroneously infer larger differences in results between … view at source ↗

**Figure 5.** Figure 5: Average mean differences of scores from Claude 3.5 Sonnet, GPT-4o Mini, and OpenBioLLM 70B interpreting simplified versions of abstracts with and without spin generated by 22 LLMs. The error bars indicate 95% confidence intervals. This plot shows that simplified spun abstracts generated by LLMs also exhibit spin. Analysis of LLM-generated plain language summaries showed that spin from the original abstr… view at source ↗

**Figure 6.** Figure 6: Average mean differences of scores across all LLMs using different prompting strategies for 5 interpretation questions compared to human experts. The error bars indicate 95% confidence intervals. This plot shows that mitigation strategies such as adding additional information on the presence or absence of spin or jointly prompting the model to detect and then interpret can reduce the effect of over-inflati… view at source ↗

**Figure 7.** Figure 7: Mean differences of all five interpretation questions from top 6 LLMs in spin detection accuracy compared to human experts. Error bars represent 95% confidence intervals. Positive mean differences indicate that LLMs interpreted spun abstracts as showing more favorable treatment results, while negative mean differences suggest that unspun abstracts were perceived as more favorable. This plot highlights tha… view at source ↗

**Figure 8.** Figure 8: Coefficients from linear regression models with 95% CI for each LLM showing how much different LLMs overestimate the rigor of study, when abstracts contain ‘spin’. In comparison with human experts (-0.59), LLMs show slightly greater susceptibility to spin. 18 [PITH_FULL_IMAGE:figures/full_fig_p018_8.png] view at source ↗

**Figure 9.** Figure 9: Coefficients from linear regression models with 95% CI for each LLM showing how much different LLMs overestimate the importance of study, when abstracts contain ‘spin’. In comparison with human experts (-0.38), most LLMs show greater susceptibility to spin. −1.−0 0.5 0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0 5.5 6.0 6.5 7.0 Coefficient Olmo2 Instruct 13B Med42 70B GPT-4o Mini Llama3 Instruct 70B Claude3.… view at source ↗

**Figure 10.** Figure 10: Coefficients from linear regression models with 95% CI for each LLM showing how much different LLMs overestimate the interest in full-text, when abstracts contain ‘spin’. In comparison with human experts (0.77), most LLMs show greater susceptibility to spin. 19 [PITH_FULL_IMAGE:figures/full_fig_p019_10.png] view at source ↗

**Figure 11.** Figure 11: Coefficients from linear regression models with 95% CI for each LLM showing how much different LLMs overestimate the interest in another trial, when abstracts contain ‘spin’. In comparison with human experts (0.64), most LLMs show greater susceptibility to spin. −2.0−1.5 −1.0 −0.5 0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0 5.5 6.0 6.5 7.0 Mean Difference for Treatment Benefit Claude3.5 Sonnet GPT4o Mini … view at source ↗

**Figure 12.** Figure 12: Mean differences for the treatment benefit question between the “baseline” and “detect + interpret” approaches for each LLM. The “baseline” score is shown in black, while the “detect + interpret” score is in orange. LLMs are ordered from top to bottom based on their spin detection performance, with the best-performing model at the top and the worst at the bottom. Regardless of the original spin detection… view at source ↗

read the original abstract

Medical research faces well-documented challenges in translating novel treatments into clinical practice. Publishing incentives encourage researchers to present "positive" findings, even when empirical results are equivocal. Consequently, it is well-documented that authors often spin study results, especially in article abstracts. Such spin can influence clinician interpretation of evidence and may affect patient care decisions. In this study, we ask whether the interpretation of trial results offered by Large Language Models (LLMs) is similarly affected by spin. This is important since LLMs are increasingly being used to trawl through and synthesize published medical evidence. We evaluated 22 LLMs and found that they are across the board more susceptible to spin than humans. They might also propagate spin into their outputs: We find evidence, e.g., that LLMs implicitly incorporate spin into plain language summaries that they generate. We also find, however, that LLMs are generally capable of recognizing spin, and can be prompted in a way to mitigate spin's impact on LLM outputs.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

LLMs show higher susceptibility to spin in medical abstracts than humans and can propagate it in summaries, but the gap may partly reflect weaker instruction following rather than spin-specific issues.

read the letter

The core finding is that 22 LLMs were more affected by spin in medical trial abstracts than human readers, both in direct interpretation and when generating plain-language summaries. Prompting the models to watch for spin reduced the effect. This is the first direct test of the issue on current LLMs, which matters because these models are already being used to scan and summarize medical evidence. The authors build on existing human spin studies and add checks for both propagation and mitigation, which is a reasonable extension. They also note that the models can recognize spin when explicitly asked, which separates the susceptibility result from a blanket claim that the models cannot handle the concept at all. That distinction is useful. The main soft spot is the lack of visible controls for whether the performance drop is truly about spin or just about the spun versions being harder to parse in general. The stress-test concern lands: if prompt length, lexical changes, or factual drift from the edits were not matched or measured separately, the human-LLM gap could be an artifact of instruction adherence rather than a spin-specific vulnerability. Sample sizes, exact statistical tests, and how the spin manipulations were constructed are not in the abstract, so the size and robustness of the effect are still unclear. The abstracts chosen and the tasks used also need to be shown to match real deployment conditions. This paper is aimed at people working on LLM tools for evidence synthesis or clinical decision support. It raises a practical risk worth checking. The question is timely and the basic setup is sound enough that it should go to peer review rather than desk rejection, though referees will need to press on the methods and controls.

Referee Report

2 major / 2 minor

Summary. The manuscript reports an empirical study of 22 LLMs on question-answering and summarization tasks using original versus spin-manipulated medical abstracts. It compares LLM outputs to human baselines and concludes that LLMs are across the board more susceptible to spin than humans, may propagate spin into generated plain-language summaries, yet remain capable of recognizing spin and can be prompted to mitigate its effects.

Significance. If the central empirical result holds after appropriate controls, the work is significant for medical AI applications because LLMs are already used to synthesize published evidence; greater susceptibility could systematically bias downstream clinical interpretations. The positive finding that targeted prompting reduces the effect supplies a concrete, immediately usable mitigation strategy. The study also supplies a reusable testbed of spun abstracts and human baselines that future work can extend.

major comments (2)

[§4 and §3.2] §4 (Results) and §3.2 (Prompting regime): the headline claim that LLMs are 'across the board more susceptible to spin than humans' rests on accuracy/bias differences between original and spun conditions. The manuscript simultaneously reports that LLMs 'are generally capable of recognizing spin' when explicitly prompted; without an explicit test showing that the susceptibility gap persists after controlling for instruction-following ability, prompt length, and lexical/factual drift introduced by the spin edits, the observed difference could be an artifact of weaker instruction adherence rather than spin-specific vulnerability.
[§3.1] §3.1 (Abstract selection and spin manipulation): the weakest assumption is that the chosen abstracts and the specific spin edits are representative of real-world medical literature. No quantitative justification (e.g., distribution of spin types, journal impact, or year range) is supplied to support generalizability of the susceptibility finding to the broader corpus LLMs will encounter.

minor comments (2)

[Abstract] Abstract: the phrase 'across the board' is imprecise; the results section should state the range of model families, sizes, and training regimes for which the susceptibility ordering holds.
[Results figures/tables] Table or figure captions (wherever the human-LLM comparison is presented): include exact sample sizes of abstracts, number of human raters, and the statistical test plus effect size used for the 'more susceptible' claim.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We appreciate the referee's detailed feedback on our manuscript. We have carefully considered each major comment and provide point-by-point responses below. Where revisions are warranted, we indicate the changes to be made in the revised version.

read point-by-point responses

Referee: [§4 and §3.2] §4 (Results) and §3.2 (Prompting regime): the headline claim that LLMs are 'across the board more susceptible to spin than humans' rests on accuracy/bias differences between original and spun conditions. The manuscript simultaneously reports that LLMs 'are generally capable of recognizing spin' when explicitly prompted; without an explicit test showing that the susceptibility gap persists after controlling for instruction-following ability, prompt length, and lexical/factual drift introduced by the spin edits, the observed difference could be an artifact of weaker instruction adherence rather than spin-specific vulnerability.

Authors: We agree that distinguishing spin-specific vulnerability from general instruction-following differences is important. Our primary experiments use consistent prompting across LLMs and humans without spin-specific instructions, mirroring typical usage. The recognition capability is demonstrated in a separate prompting condition. To strengthen the claim, we will add a control experiment in the revision where we normalize for instruction adherence by using a standardized instruction-following prompt and measure the remaining gap. This will clarify whether the susceptibility is spin-specific. revision: partial
Referee: [§3.1] §3.1 (Abstract selection and spin manipulation): the weakest assumption is that the chosen abstracts and the specific spin edits are representative of real-world medical literature. No quantitative justification (e.g., distribution of spin types, journal impact, or year range) is supplied to support generalizability of the susceptibility finding to the broader corpus LLMs will encounter.

Authors: The selection of abstracts was based on a curated set of medical trial abstracts from recent publications, with spin manipulations designed to reflect common types identified in prior literature on spin in medical abstracts. While we did not provide a full quantitative distribution in the original submission, the spin types used (e.g., overstatement of efficacy, omission of limitations) are drawn from established taxonomies in the field. In the revision, we will include additional details on the selection criteria, including the range of journals and years, and a breakdown of spin types to better support generalizability. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical evaluation only

full rationale

This is a purely empirical study involving LLM evaluations on spun vs. original medical abstracts, with direct comparisons to human baselines. No derivations, equations, fitted parameters, self-citations as load-bearing premises, or renamings of results are present. All claims rest on external data collection and human comparisons rather than any self-referential construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The study rests on the domain assumption that spin effects documented in human readers transfer to LLMs and that the chosen tasks measure the same construct. No free parameters or invented entities are introduced.

axioms (1)

domain assumption Spin in abstracts influences interpretation of trial results
Invoked in the motivation section drawing on prior medical literature on human susceptibility.

pith-pipeline@v0.9.0 · 5723 in / 1106 out tokens · 23102 ms · 2026-05-23T03:10:38.527618+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/AbsoluteFloorClosure.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We evaluated 22 LLMs … more susceptible to spin than humans … can be prompted … to mitigate spin’s impact
IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

linear regression … β1k · (presence or absence of spin)

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

73 extracted references · 73 canonical work pages · 6 internal anchors

[1]

Paper plain: Making medical research papers approachable to healthcare consumers with natural language processing

Tal August, Lucy Lu Wang, Jonathan Bragg, Marti A Hearst, Andrew Head, and Kyle Lo. Paper plain: Making medical research papers approachable to healthcare consumers with natural language processing. ACM Transactions on Computer-Human Interaction, 30 0 (5): 0 1--38, 2023

work page 2023
[2]

Evaluation of spin within abstracts in obesity randomized clinical trials: a cross-sectional review

Jennifer Austin, Christopher Smith, Kavita Natarajan, Mousumi Som, Cole Wayant, and Matt Vassar. Evaluation of spin within abstracts in obesity randomized clinical trials: a cross-sectional review. Clinical obesity, 9 0 (2): 0 e12292, 2019

work page 2019
[3]

Patient perception of plain-language medical notes generated using artificial intelligence software: pilot mixed-methods study

Sandeep Bala, Angela Keniston, Marisha Burden, et al. Patient perception of plain-language medical notes generated using artificial intelligence software: pilot mixed-methods study. JMIR formative research, 4 0 (6): 0 e16670, 2020

work page 2020
[4]

Family physicians' use of medical abstracts to guide decision making: style or substance? The Journal of the American Board of Family Practice, 14 0 (6): 0 437--442, 2001

Henry C Barry, Mark H Ebell, Allen F Shaughnessy, David C Slawson, and Fern Nietzke. Family physicians' use of medical abstracts to guide decision making: style or substance? The Journal of the American Board of Family Practice, 14 0 (6): 0 437--442, 2001

work page 2001
[5]

Publication bias: a problem in interpreting medical data

Colin B Begg and Jesse A Berlin. Publication bias: a problem in interpreting medical data. Journal of the Royal Statistical Society Series A: Statistics in Society, 151 0 (3): 0 419--445, 1988

work page 1988
[6]

S ci BERT : A pretrained language model for scientific text

Iz Beltagy, Kyle Lo, and Arman Cohan. S ci BERT : A pretrained language model for scientific text. In Kentaro Inui, Jing Jiang, Vincent Ng, and Xiaojun Wan, editors, Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 3615--36...

work page doi:10.18653/v1/d19-1371 2019
[7]

The quality of reporting of trial abstracts is suboptimal: survey of major general medical journals

Otavio Berwanger, Rodrigo A Ribeiro, Alessandro Finkelsztejn, Marcelo Watanabe, Erica A Suzumura, Bruce B Duncan, Phillip J Devereaux, and Deborah Cook. The quality of reporting of trial abstracts is suboptimal: survey of major general medical journals. Journal of clinical epidemiology, 62 0 (4): 0 387--392, 2009

work page 2009
[8]

Reporting and interpretation of randomized controlled trials with statistically nonsignificant results for primary outcomes

Isabelle Boutron, Susan Dutton, Philippe Ravaud, and Douglas G Altman. Reporting and interpretation of randomized controlled trials with statistically nonsignificant results for primary outcomes. Jama, 303 0 (20): 0 2058--2064, 2010

work page 2058
[9]

Impact of spin in the abstracts of articles reporting results of randomized controlled trials in the field of cancer: the spiin randomized controlled trial

Isabelle Boutron, Douglas G Altman, Sally Hopewell, Francisco Vera-Badillo, Ian Tannock, and Philippe Ravaud. Impact of spin in the abstracts of articles reporting results of randomized controlled trials in the field of cancer: the spiin randomized controlled trial. Journal of Clinical Oncology, 32 0 (36): 0 4120--4126, 2014

work page 2014
[10]

Isabelle Boutron, Romana Haneef, Am \'e lie Yavchitz, Gabriel Baron, John Novack, Ivan Oransky, Gary Schwitzer, and Philippe Ravaud. Three randomized controlled trials evaluating the impact of “spin” in health news stories reporting studies of pharmacologic treatments on patients’/caregivers’ interpretation of treatment benefit. BMC medicine, 17: 0 1--10, 2019

work page 2019
[11]

Ahrq health literacy universal precautions toolkit, 2015

AGBJ Brega, J Barnard, NM Mabachi, B Weiss, D DeWalt, C Brach, M Cifuentes, K Albright, and D West. Ahrq health literacy universal precautions toolkit, 2015

work page 2015
[12]

‘spin’in published biomedical literature: a methodological systematic review

Kellia Chiu, Quinn Grundy, and Lisa Bero. ‘spin’in published biomedical literature: a methodological systematic review. PLoS Biology, 15 0 (9): 0 e2002173, 2017

work page 2017
[13]

Do physicians judge a study by its cover?: An investigation of journal attribution bias

Dimitri A Christakis, Sanjay Saint, Somnath Saha, Joann G Elmore, Deborah E Welsh, Paul Baker, and Thomas D Koepsell. Do physicians judge a study by its cover?: An investigation of journal attribution bias. Journal of clinical epidemiology, 53 0 (8): 0 773--778, 2000

work page 2000
[14]

Med42-v2: A suite of clinical llms

Cl \'e ment Christophe, Praveen K Kanithi, Tathagata Raha, Shadab Khan, and Marco AF Pimentel. Med42-v2: A suite of clinical llms. arXiv preprint arXiv:2408.06142, 2024

work page arXiv 2024
[15]

Open to the public: paywalls and the public rationale for open access medical research publishing

Suzanne Day, Stuart Rennie, Danyang Luo, and Joseph D Tucker. Open to the public: paywalls and the public rationale for open access medical research publishing. Research involvement and engagement, 6: 0 1--7, 2020

work page 2020
[16]

Paragraph-level simplification of medical texts

Ashwin Devaraj, Byron C Wallace, Iain J Marshall, and Junyi Jessy Li. Paragraph-level simplification of medical texts. In Proceedings of the conference. Association for Computational Linguistics. North American Chapter. Meeting, volume 2021, page 4972. NIH Public Access, 2021

work page 2021
[17]

Evaluating factuality in text simplification

Ashwin Devaraj, William Sheffield, Byron C Wallace, and Junyi Jessy Li. Evaluating factuality in text simplification. In Proceedings of the conference of the Association for Computational Linguistics (ACL), volume 2022, page 7331, 2022

work page 2022
[18]

Catalogue of bias: publication bias

Nicholas J DeVito and Ben Goldacre. Catalogue of bias: publication bias. BMJ Evidence-Based Medicine, 24 0 (2): 0 53--54, 2019

work page 2019
[19]

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

Jacob Devlin. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[20]

Publication bias: the problem that won't go away

K Dickersin and Y I Min. Publication bias: the problem that won't go away. Ann. N. Y. Acad. Sci., 703 0 (1): 0 135--46; discussion 146--8, December 1993

work page 1993
[21]

The existence of publication bias and risk factors for its occurrence

Kay Dickersin. The existence of publication bias and risk factors for its occurrence. Jama, 263 0 (10): 0 1385--1389, 1990

work page 1990
[22]

Publication bias in clinical research

Phillipa J Easterbrook, Ramana Gopalan, JA Berlin, and David R Matthews. Publication bias in clinical research. The Lancet, 337 0 (8746): 0 867--872, 1991

work page 1991
[23]

Leveraging large language models for zero-shot lay summarisation in biomedicine and beyond

Tomas Goldsack, Carolina Scarton, and Chenghua Lin. Leveraging large language models for zero-shot lay summarisation in biomedicine and beyond. arXiv preprint arXiv:2501.05224, 2025

work page arXiv 2025
[24]

Consort for reporting randomized controlled trials in journal and conference abstracts: explanation and elaboration

Sally Hopewell, Mike Clarke, David Moher, Elizabeth Wager, Philippa Middleton, Douglas G Altman, Kenneth F Schulz, and Consort Group. Consort for reporting randomized controlled trials in journal and conference abstracts: explanation and elaboration. PLoS medicine, 5 0 (1): 0 e20, 2008

work page 2008
[25]

GPT-4o System Card

Aaron Hurst, Adam Lerer, Adam P Goucher, Adam Perelman, Aditya Ramesh, Aidan Clark, AJ Ostrow, Akila Welihinda, Alan Hayes, Alec Radford, et al. Gpt-4o system card. arXiv preprint arXiv:2410.21276, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[26]

M ath P rompter: Mathematical reasoning using large language models

Shima Imani, Liang Du, and Harsh Shrivastava. M ath P rompter: Mathematical reasoning using large language models. In Sunayana Sitaram, Beata Beigman Klebanov, and Jason D Williams, editors, Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 5: Industry Track), pages 37--42, Toronto, Canada, July 2023. Associat...

work page doi:10.18653/v1/2023.acl-industry.4 2023
[27]

Understanding pubmed user search behavior through log analysis

Rezarta Islamaj Dogan, G Craig Murray, Aur \'e lie N \'e v \'e ol, and Zhiyong Lu. Understanding pubmed user search behavior through log analysis. Database, 2009: 0 bap018, 2009

work page 2009
[28]

Chatgpt makes medicine easy to swallow: an exploratory case study on simplified radiology reports

Katharina Jeblick, Balthasar Schachtner, Jakob Dexl, Andreas Mittermeier, Anna Theresa St \"u ber, Johanna Topalis, Tobias Weber, Philipp Wesp, Bastian Oliver Sabel, Jens Ricke, et al. Chatgpt makes medicine easy to swallow: an exploratory case study on simplified radiology reports. European radiology, 34 0 (5): 0 2817--2825, 2024

work page 2024
[29]

Evaluation of spin in abstracts of papers in psychiatry and psychology journals

Samuel Jellison, Will Roberts, Aaron Bowers, Tyler Combs, Jason Beaman, Cole Wayant, and Matt Vassar. Evaluation of spin in abstracts of papers in psychiatry and psychology journals. BMJ evidence-based medicine, 25 0 (5): 0 178--181, 2020

work page 2020
[30]

Daniel P Jeong, Saurabh Garg, Zachary Chase Lipton, and Michael Oberst. Medical adaptation of large language and vision-language models: Are we making progress? In Yaser Al-Onaizan, Mohit Bansal, and Yun-Nung Chen, editors, Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 12143--12170, Miami, Florida, USA, Nove...

work page doi:10.18653/v1/2024.emnlp-main.677 2024
[31]

Mistral 7B

Albert Q Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Florian Bressand, Gianna Lengyel, Guillaume Lample, Lucile Saulnier, et al. Mistral 7b. arXiv preprint arXiv:2310.06825, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[32]

Multilingual simplification of medical texts

Sebastian Joseph, Kathryn Kazanas, Keziah Reina, Vishnesh J Ramanathan, Wei Xu, Byron C Wallace, and Junyi Jessy Li. Multilingual simplification of medical texts. arXiv preprint arXiv:2305.12532, 2023

work page arXiv 2023
[33]

F act PICO : Factuality evaluation for plain language summarization of medical evidence

Sebastian Joseph, Lily Chen, Jan Trienes, Hannah G \"o ke, Monika Coers, Wei Xu, Byron Wallace, and Junyi Jessy Li. F act PICO : Factuality evaluation for plain language summarization of medical evidence. In Lun-Wei Ku, Andre Martins, and Vivek Srikumar, editors, Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volu...

work page doi:10.18653/v1/2024.acl-long.459 2024
[34]

Level and prevalence of spin in published cardiovascular randomized clinical trial reports with statistically nonsignificant primary outcomes: a systematic review

Muhammad Shahzeb Khan, Noman Lateef, Tariq Jamal Siddiqi, Karim Abdur Rehman, Saed Alnaimat, Safi U Khan, Haris Riaz, M Hassan Murad, John Mandrola, Rami Doukky, et al. Level and prevalence of spin in published cardiovascular randomized clinical trial reports with statistically nonsignificant primary outcomes: a systematic review. JAMA network open, 2 0 (...

work page 2019
[35]

On the contribution of specific entity detection in comparative constructions to automatic spin detection in biomedical scientific publications

Anna Koroleva and Patrick Paroubek. On the contribution of specific entity detection in comparative constructions to automatic spin detection in biomedical scientific publications. In Language and Technology Conference, pages 304--317. Springer, 2017

work page 2017
[36]

Annotating spin in biomedical scientific publications: the case of random controlled trials (rcts)

Anna Koroleva and Patrick Paroubek. Annotating spin in biomedical scientific publications: the case of random controlled trials (rcts). In Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018), 2018

work page 2018
[37]

Despin: a prototype system for detecting spin in biomedical publications

Anna Koroleva, Sanjay Kamath, Patrick MM Bossuyt, and Patrick Paroubek. Despin: a prototype system for detecting spin in biomedical publications. In roceedings of the BioNLP 2020 workshop, pages 49--59. Association for Computational Linguistics, 2020

work page 2020
[38]

The health literacy of america's adults: Results from the 2003 national assessment of adult literacy

Mark Kutner, Elizabeth Greenburg, Ying Jin, and Christine Paulsen. The health literacy of america's adults: Results from the 2003 national assessment of adult literacy. nces 2006-483. National Center for education statistics, 2006

work page 2003
[39]

Biomistral: A collection of open-source pretrained large language models for medical domains

Yanis Labrak, Adrien Bazoge, Emmanuel Morin, Pierre-Antoine Gourraud, Mickael Rouvier, and Richard Dufour. Biomistral: A collection of open-source pretrained large language models for medical domains. arXiv preprint arXiv:2402.10373, 2024

work page arXiv 2024
[40]

Biobert: a pre-trained biomedical language representation model for biomedical text mining

Jinhyuk Lee, Wonjin Yoon, Sungdong Kim, Donghyeon Kim, Sunkyu Kim, Chan Ho So, and Jaewoo Kang. Biobert: a pre-trained biomedical language representation model for biomedical text mining. Bioinformatics, 36 0 (4): 0 1234--1240, 2020

work page 2020
[41]

Suzanne Lockyer, Rob Hodgson, Jo C Dumville, and Nicky Cullum. "spin" in wound care research: the reporting and interpretation of randomized controlled trials with statistically non-significant primary outcome results or unspecified primary outcomes. Trials, 14: 0 1--10, 2013

work page 2013
[42]

Biomedgpt: Open multimodal generative pre-trained transformer for biomedicine

Yizhen Luo, Jiahuan Zhang, Siqi Fan, Kai Yang, Yushuai Wu, Mu Qiao, and Zaiqing Nie. Biomedgpt: Open multimodal generative pre-trained transformer for biomedicine. arXiv preprint arXiv:2308.09442, 2023

work page arXiv 2023
[43]

A comparison of the accuracy of clinical decisions based on full-text articles and on journal abstracts alone: a study among residents in a tertiary care hospital

Alvin Marcelo, Alex Gavino, Iris Thiele Isip-Tan, Leilanie Apostol-Nicodemus, Faith Joan Mesa-Gaerlan, Paul Nimrod Firaza, John Francis Faustorilla, Fiona M Callaghan, and Paul Fontelo. A comparison of the accuracy of clinical decisions based on full-text articles and on journal abstracts alone: a study among residents in a tertiary care hospital. BMJ Evi...

work page 2013
[44]

What is readability and why should content editors care about it

Lisa Marchand. What is readability and why should content editors care about it. Center for Plain Language. https://centerforplainlanguage. org/what-isreadability, 2017

work page 2017
[45]

Misleading abstract conclusions in randomized controlled trials in rheumatology: comparison of the abstract conclusions and the results section

Sylvain Mathieu, Bruno Giraudeau, Martin Soubrier, and Philippe Ravaud. Misleading abstract conclusions in randomized controlled trials in rheumatology: comparison of the abstract conclusions and the results section. Joint Bone Spine, 79 0 (3): 0 262--267, 2012

work page 2012
[46]

Introducing meta llama 3: The most capable openly available llm to date

AI Meta. Introducing meta llama 3: The most capable openly available llm to date. Meta AI, 2024

work page 2024
[47]

Spin in abstracts of systematic reviews and meta-analyses of melanoma therapies: Cross-sectional analysis

Ross Nowlin, Alexis Wirtz, David Wenger, Ryan Ottwell, Courtney Cook, Wade Arthur, Brigitte Sallee, Jarad Levin, Micah Hartwell, Drew Wright, et al. Spin in abstracts of systematic reviews and meta-analyses of melanoma therapies: Cross-sectional analysis. JMIR dermatology, 5 0 (1): 0 e33996, 2022

work page 2022
[48]

Simply put: A guide for creating easy-to-understand materials

US Department of Health, Human Services, et al. Simply put: A guide for creating easy-to-understand materials. 2009

work page 2009
[49]

2 OLMo 2 Furious

Team OLMo, Pete Walsh, Luca Soldaini, Dirk Groeneveld, Kyle Lo, Shane Arora, Akshita Bhagia, Yuling Gu, Shengyi Huang, Matt Jordan, et al. 2 olmo 2 furious. arXiv preprint arXiv:2501.00656, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[50]

A survey of automated methods for biomedical text simplification

Brian Ondov, Kush Attal, and Dina Demner-Fushman. A survey of automated methods for biomedical text simplification. Journal of the American Medical Informatics Association, 29 0 (11): 0 1976--1988, 2022

work page 1976
[51]

Models, 2025

OpenAI. Models, 2025. URL https://platform.openai.com/docs/models/gpt-3-5-turbo. Accessed: 2025-01-17

work page 2025
[52]

4o mini: Advancing cost-efficient intelligence, 2024

Gpt OpenAI. 4o mini: Advancing cost-efficient intelligence, 2024. URL: https://openai. com/index/gpt-4o-mini-advancing-cost-efficient-intelligence, 2024

work page 2024
[53]

CK Osborne, J Pippen, SE Jones, LM Parker, M Ellis, S Come, SZ Gertler, JT May, G Burton, I Dimery, et al. Double-blind, randomized trial comparing the efficacy and tolerability of fulvestrant versus anastrozole in postmenopausal women with advanced breast cancer progressing on prior endocrine therapy: results of a north american trial. Journal of Clinica...

work page 2002
[54]

Openbiollms: Advancing open-source large language models for healthcare and life sciences, 2024

Malaikannan Sankarasubbu Ankit Pal and Malaikannan Sankarasubbu. Openbiollms: Advancing open-source large language models for healthcare and life sciences, 2024

work page 2024
[55]

Assessing ai simplification of medical texts: readability and content fidelity

Bryce Picton, Saman Andalib, Aidin Spina, Brandon Camp, Sean S Solomon, Jason Liang, Patrick M Chen, Jefferson W Chen, Frank P Hsu, and Michael Y Oh. Assessing ai simplification of medical texts: readability and content fidelity. International Journal of Medical Informatics, 195: 0 105743, 2025

work page 2025
[56]

The state of oa: a large-scale analysis of the prevalence and impact of open access articles

Heather Piwowar, Jason Priem, Vincent Larivi \`e re, Juan Pablo Alperin, Lisa Matthias, Bree Norlander, Ashley Farley, Jevin West, and Stefanie Haustein. The state of oa: a large-scale analysis of the prevalence and impact of open access articles. PeerJ, 6: 0 e4375, 2018

work page 2018
[57]

Malignant: how bad policy and bad evidence harm people with Cancer

Vinayak K Prasad. Malignant: how bad policy and bad evidence harm people with Cancer. JHU Press, 2020

work page 2020
[58]

Reasoning with language model prompting: A survey

Shuofei Qiao, Yixin Ou, Ningyu Zhang, Xiang Chen, Yunzhi Yao, Shumin Deng, Chuanqi Tan, Fei Huang, and Huajun Chen. Reasoning with language model prompting: A survey. In Anna Rogers, Jordan Boyd-Graber, and Naoaki Okazaki, editors, Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 5368--...

work page doi:10.18653/v1/2023.acl-long.294 2023
[59]

Development and evaluation of a framework for identifying and addressing spin for harms in systematic reviews of interventions

Riaz Qureshi, Kevin Naaman, Nicolas G Quan, Evan Mayo-Wilson, Matthew J Page, Victoria Cornelius, Roger Chou, Isabelle Boutron, Su Golder, Lisa Bero, et al. Development and evaluation of a framework for identifying and addressing spin for harms in systematic reviews of interventions. Annals of internal medicine, 177 0 (8): 0 1089--1098, 2024

work page 2024
[60]

Evaluation of spin in the abstracts of emergency medicine randomized controlled trials

Victoria Reynolds-Vaughn, Jonathan Riddle, Jamin Brown, Michael Schiesel, Cole Wayant, and Matt Vassar. Evaluation of spin in the abstracts of emergency medicine randomized controlled trials. Annals of emergency medicine, 75 0 (3): 0 423--431, 2020

work page 2020
[61]

Mathematical discoveries from program search with large language models

Bernardino Romera-Paredes, Mohammadamin Barekatain, Alexander Novikov, Matej Balog, M Pawan Kumar, Emilien Dupont, Francisco JR Ruiz, Jordan S Ellenberg, Pengming Wang, Omar Fawzi, et al. Mathematical discoveries from program search with large language models. Nature, 625 0 (7995): 0 468--475, 2024

work page 2024
[62]

Summarizing, simplifying, and synthesizing medical evidence using GPT -3 (with varying success)

Chantal Shaib, Millicent Li, Sebastian Joseph, Iain Marshall, Junyi Jessy Li, and Byron Wallace. Summarizing, simplifying, and synthesizing medical evidence using GPT -3 (with varying success). In Anna Rogers, Jordan Boyd-Graber, and Naoaki Okazaki, editors, Proceedings of the Annual Meeting of the Association for Computational Linguistics (ACL), pages 13...

work page doi:10.18653/v1/2023.acl-short.119 2023
[63]

Knowledge sharing in global health research--the impact, uptake and cost of open access to scholarly literature

Elise Smith, Stefanie Haustein, Philippe Mongeon, Fei Shu, Val \'e ry Ridde, and Vincent Larivi \`e re. Knowledge sharing in global health research--the impact, uptake and cost of open access to scholarly literature. Health Research Policy and Systems, 15: 0 1--10, 2017

work page 2017
[64]

Assessment of spin in the abstracts of randomized controlled trials in dental caries with statistically nonsignificant results for primary outcomes: A methodological study

Naichuan Su, Michiel W Van Der Linden, Clovis M Faggion Jr, and Geert JMG Van Der Heijden. Assessment of spin in the abstracts of randomized controlled trials in dental caries with statistically nonsignificant results for primary outcomes: A methodological study. Caries Research, 57 0 (5-6): 0 553--562, 2023

work page 2023
[65]

Exaggerations and caveats in press releases and health-related science news

Petroc Sumner, Solveiga Vivian-Griffiths, Jacky Boivin, Andrew Williams, Lewis Bott, Rachel Adams, Christos A Venetis, Leanne Whelan, Bethan Hughes, and Christopher D Chambers. Exaggerations and caveats in press releases and health-related science news. PloS one, 11 0 (12): 0 e0168217, 2016

work page 2016
[66]

Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context

Gemini Team, Petko Georgiev, Ving Ian Lei, Ryan Burnell, Libin Bai, Anmol Gulati, Garrett Tanzer, Damien Vincent, Zhufeng Pan, Shibo Wang, et al. Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context. arXiv preprint arXiv:2403.05530, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[67]

Large language models in medicine

Arun James Thirunavukarasu, Darren Shu Jeng Ting, Kabilan Elangovan, Laura Gutierrez, Ting Fang Tan, and Daniel Shu Wei Ting. Large language models in medicine. Nature medicine, 29 0 (8): 0 1930--1940, 2023

work page 1930
[68]

Llama 2: Open Foundation and Fine-Tuned Chat Models

Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[69]

Evaluation of spin in oncology clinical trials

C Wayant, D Margalski, K Vaughn, and M Vassar. Evaluation of spin in oncology clinical trials. Critical Reviews in Oncology/Hematology, 144: 0 102821, 2019

work page 2019
[70]

Chain-of-thought prompting elicits reasoning in large language models

Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. Chain-of-thought prompting elicits reasoning in large language models. Advances in neural information processing systems, 35: 0 24824--24837, 2022

work page 2022
[71]

Health literacy and patient safety: Help patients understand

Barry D Weiss. Health literacy and patient safety: Help patients understand. Manual for clinicians. American Medical Association Foundation, 2007

work page 2007
[72]

A new classification of spin in systematic reviews and meta-analyses was developed and ranked according to the severity

Am \'e lie Yavchitz, Philippe Ravaud, Douglas G Altman, David Moher, Asbj rn Hrobjartsson, Toby Lasserson, and Isabelle Boutron. A new classification of spin in systematic reviews and meta-analyses was developed and ranked according to the severity. Journal of clinical epidemiology, 75: 0 56--65, 2016

work page 2016
[73]

Alpacare: Instruction-tuned large language models for medical application

Xinlu Zhang, Chenxin Tian, Xianjun Yang, Lichang Chen, Zekun Li, and Linda Ruth Petzold. Alpacare: Instruction-tuned large language models for medical application. arXiv preprint arXiv:2310.14558, 2023

work page arXiv 2023

[1] [1]

Paper plain: Making medical research papers approachable to healthcare consumers with natural language processing

Tal August, Lucy Lu Wang, Jonathan Bragg, Marti A Hearst, Andrew Head, and Kyle Lo. Paper plain: Making medical research papers approachable to healthcare consumers with natural language processing. ACM Transactions on Computer-Human Interaction, 30 0 (5): 0 1--38, 2023

work page 2023

[2] [2]

Evaluation of spin within abstracts in obesity randomized clinical trials: a cross-sectional review

Jennifer Austin, Christopher Smith, Kavita Natarajan, Mousumi Som, Cole Wayant, and Matt Vassar. Evaluation of spin within abstracts in obesity randomized clinical trials: a cross-sectional review. Clinical obesity, 9 0 (2): 0 e12292, 2019

work page 2019

[3] [3]

Patient perception of plain-language medical notes generated using artificial intelligence software: pilot mixed-methods study

Sandeep Bala, Angela Keniston, Marisha Burden, et al. Patient perception of plain-language medical notes generated using artificial intelligence software: pilot mixed-methods study. JMIR formative research, 4 0 (6): 0 e16670, 2020

work page 2020

[4] [4]

Family physicians' use of medical abstracts to guide decision making: style or substance? The Journal of the American Board of Family Practice, 14 0 (6): 0 437--442, 2001

Henry C Barry, Mark H Ebell, Allen F Shaughnessy, David C Slawson, and Fern Nietzke. Family physicians' use of medical abstracts to guide decision making: style or substance? The Journal of the American Board of Family Practice, 14 0 (6): 0 437--442, 2001

work page 2001

[5] [5]

Publication bias: a problem in interpreting medical data

Colin B Begg and Jesse A Berlin. Publication bias: a problem in interpreting medical data. Journal of the Royal Statistical Society Series A: Statistics in Society, 151 0 (3): 0 419--445, 1988

work page 1988

[6] [6]

S ci BERT : A pretrained language model for scientific text

Iz Beltagy, Kyle Lo, and Arman Cohan. S ci BERT : A pretrained language model for scientific text. In Kentaro Inui, Jing Jiang, Vincent Ng, and Xiaojun Wan, editors, Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 3615--36...

work page doi:10.18653/v1/d19-1371 2019

[7] [7]

The quality of reporting of trial abstracts is suboptimal: survey of major general medical journals

Otavio Berwanger, Rodrigo A Ribeiro, Alessandro Finkelsztejn, Marcelo Watanabe, Erica A Suzumura, Bruce B Duncan, Phillip J Devereaux, and Deborah Cook. The quality of reporting of trial abstracts is suboptimal: survey of major general medical journals. Journal of clinical epidemiology, 62 0 (4): 0 387--392, 2009

work page 2009

[8] [8]

Reporting and interpretation of randomized controlled trials with statistically nonsignificant results for primary outcomes

Isabelle Boutron, Susan Dutton, Philippe Ravaud, and Douglas G Altman. Reporting and interpretation of randomized controlled trials with statistically nonsignificant results for primary outcomes. Jama, 303 0 (20): 0 2058--2064, 2010

work page 2058

[9] [9]

Impact of spin in the abstracts of articles reporting results of randomized controlled trials in the field of cancer: the spiin randomized controlled trial

Isabelle Boutron, Douglas G Altman, Sally Hopewell, Francisco Vera-Badillo, Ian Tannock, and Philippe Ravaud. Impact of spin in the abstracts of articles reporting results of randomized controlled trials in the field of cancer: the spiin randomized controlled trial. Journal of Clinical Oncology, 32 0 (36): 0 4120--4126, 2014

work page 2014

[10] [10]

Isabelle Boutron, Romana Haneef, Am \'e lie Yavchitz, Gabriel Baron, John Novack, Ivan Oransky, Gary Schwitzer, and Philippe Ravaud. Three randomized controlled trials evaluating the impact of “spin” in health news stories reporting studies of pharmacologic treatments on patients’/caregivers’ interpretation of treatment benefit. BMC medicine, 17: 0 1--10, 2019

work page 2019

[11] [11]

Ahrq health literacy universal precautions toolkit, 2015

AGBJ Brega, J Barnard, NM Mabachi, B Weiss, D DeWalt, C Brach, M Cifuentes, K Albright, and D West. Ahrq health literacy universal precautions toolkit, 2015

work page 2015

[12] [12]

‘spin’in published biomedical literature: a methodological systematic review

Kellia Chiu, Quinn Grundy, and Lisa Bero. ‘spin’in published biomedical literature: a methodological systematic review. PLoS Biology, 15 0 (9): 0 e2002173, 2017

work page 2017

[13] [13]

Do physicians judge a study by its cover?: An investigation of journal attribution bias

Dimitri A Christakis, Sanjay Saint, Somnath Saha, Joann G Elmore, Deborah E Welsh, Paul Baker, and Thomas D Koepsell. Do physicians judge a study by its cover?: An investigation of journal attribution bias. Journal of clinical epidemiology, 53 0 (8): 0 773--778, 2000

work page 2000

[14] [14]

Med42-v2: A suite of clinical llms

Cl \'e ment Christophe, Praveen K Kanithi, Tathagata Raha, Shadab Khan, and Marco AF Pimentel. Med42-v2: A suite of clinical llms. arXiv preprint arXiv:2408.06142, 2024

work page arXiv 2024

[15] [15]

Open to the public: paywalls and the public rationale for open access medical research publishing

Suzanne Day, Stuart Rennie, Danyang Luo, and Joseph D Tucker. Open to the public: paywalls and the public rationale for open access medical research publishing. Research involvement and engagement, 6: 0 1--7, 2020

work page 2020

[16] [16]

Paragraph-level simplification of medical texts

Ashwin Devaraj, Byron C Wallace, Iain J Marshall, and Junyi Jessy Li. Paragraph-level simplification of medical texts. In Proceedings of the conference. Association for Computational Linguistics. North American Chapter. Meeting, volume 2021, page 4972. NIH Public Access, 2021

work page 2021

[17] [17]

Evaluating factuality in text simplification

Ashwin Devaraj, William Sheffield, Byron C Wallace, and Junyi Jessy Li. Evaluating factuality in text simplification. In Proceedings of the conference of the Association for Computational Linguistics (ACL), volume 2022, page 7331, 2022

work page 2022

[18] [18]

Catalogue of bias: publication bias

Nicholas J DeVito and Ben Goldacre. Catalogue of bias: publication bias. BMJ Evidence-Based Medicine, 24 0 (2): 0 53--54, 2019

work page 2019

[19] [19]

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

Jacob Devlin. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018

[20] [20]

Publication bias: the problem that won't go away

K Dickersin and Y I Min. Publication bias: the problem that won't go away. Ann. N. Y. Acad. Sci., 703 0 (1): 0 135--46; discussion 146--8, December 1993

work page 1993

[21] [21]

The existence of publication bias and risk factors for its occurrence

Kay Dickersin. The existence of publication bias and risk factors for its occurrence. Jama, 263 0 (10): 0 1385--1389, 1990

work page 1990

[22] [22]

Publication bias in clinical research

Phillipa J Easterbrook, Ramana Gopalan, JA Berlin, and David R Matthews. Publication bias in clinical research. The Lancet, 337 0 (8746): 0 867--872, 1991

work page 1991

[23] [23]

Leveraging large language models for zero-shot lay summarisation in biomedicine and beyond

Tomas Goldsack, Carolina Scarton, and Chenghua Lin. Leveraging large language models for zero-shot lay summarisation in biomedicine and beyond. arXiv preprint arXiv:2501.05224, 2025

work page arXiv 2025

[24] [24]

Consort for reporting randomized controlled trials in journal and conference abstracts: explanation and elaboration

Sally Hopewell, Mike Clarke, David Moher, Elizabeth Wager, Philippa Middleton, Douglas G Altman, Kenneth F Schulz, and Consort Group. Consort for reporting randomized controlled trials in journal and conference abstracts: explanation and elaboration. PLoS medicine, 5 0 (1): 0 e20, 2008

work page 2008

[25] [25]

GPT-4o System Card

Aaron Hurst, Adam Lerer, Adam P Goucher, Adam Perelman, Aditya Ramesh, Aidan Clark, AJ Ostrow, Akila Welihinda, Alan Hayes, Alec Radford, et al. Gpt-4o system card. arXiv preprint arXiv:2410.21276, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[26] [26]

M ath P rompter: Mathematical reasoning using large language models

Shima Imani, Liang Du, and Harsh Shrivastava. M ath P rompter: Mathematical reasoning using large language models. In Sunayana Sitaram, Beata Beigman Klebanov, and Jason D Williams, editors, Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 5: Industry Track), pages 37--42, Toronto, Canada, July 2023. Associat...

work page doi:10.18653/v1/2023.acl-industry.4 2023

[27] [27]

Understanding pubmed user search behavior through log analysis

Rezarta Islamaj Dogan, G Craig Murray, Aur \'e lie N \'e v \'e ol, and Zhiyong Lu. Understanding pubmed user search behavior through log analysis. Database, 2009: 0 bap018, 2009

work page 2009

[28] [28]

Chatgpt makes medicine easy to swallow: an exploratory case study on simplified radiology reports

Katharina Jeblick, Balthasar Schachtner, Jakob Dexl, Andreas Mittermeier, Anna Theresa St \"u ber, Johanna Topalis, Tobias Weber, Philipp Wesp, Bastian Oliver Sabel, Jens Ricke, et al. Chatgpt makes medicine easy to swallow: an exploratory case study on simplified radiology reports. European radiology, 34 0 (5): 0 2817--2825, 2024

work page 2024

[29] [29]

Evaluation of spin in abstracts of papers in psychiatry and psychology journals

Samuel Jellison, Will Roberts, Aaron Bowers, Tyler Combs, Jason Beaman, Cole Wayant, and Matt Vassar. Evaluation of spin in abstracts of papers in psychiatry and psychology journals. BMJ evidence-based medicine, 25 0 (5): 0 178--181, 2020

work page 2020

[30] [30]

Daniel P Jeong, Saurabh Garg, Zachary Chase Lipton, and Michael Oberst. Medical adaptation of large language and vision-language models: Are we making progress? In Yaser Al-Onaizan, Mohit Bansal, and Yun-Nung Chen, editors, Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 12143--12170, Miami, Florida, USA, Nove...

work page doi:10.18653/v1/2024.emnlp-main.677 2024

[31] [31]

Mistral 7B

Albert Q Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Florian Bressand, Gianna Lengyel, Guillaume Lample, Lucile Saulnier, et al. Mistral 7b. arXiv preprint arXiv:2310.06825, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[32] [32]

Multilingual simplification of medical texts

Sebastian Joseph, Kathryn Kazanas, Keziah Reina, Vishnesh J Ramanathan, Wei Xu, Byron C Wallace, and Junyi Jessy Li. Multilingual simplification of medical texts. arXiv preprint arXiv:2305.12532, 2023

work page arXiv 2023

[33] [33]

F act PICO : Factuality evaluation for plain language summarization of medical evidence

Sebastian Joseph, Lily Chen, Jan Trienes, Hannah G \"o ke, Monika Coers, Wei Xu, Byron Wallace, and Junyi Jessy Li. F act PICO : Factuality evaluation for plain language summarization of medical evidence. In Lun-Wei Ku, Andre Martins, and Vivek Srikumar, editors, Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volu...

work page doi:10.18653/v1/2024.acl-long.459 2024

[34] [34]

Level and prevalence of spin in published cardiovascular randomized clinical trial reports with statistically nonsignificant primary outcomes: a systematic review

Muhammad Shahzeb Khan, Noman Lateef, Tariq Jamal Siddiqi, Karim Abdur Rehman, Saed Alnaimat, Safi U Khan, Haris Riaz, M Hassan Murad, John Mandrola, Rami Doukky, et al. Level and prevalence of spin in published cardiovascular randomized clinical trial reports with statistically nonsignificant primary outcomes: a systematic review. JAMA network open, 2 0 (...

work page 2019

[35] [35]

On the contribution of specific entity detection in comparative constructions to automatic spin detection in biomedical scientific publications

Anna Koroleva and Patrick Paroubek. On the contribution of specific entity detection in comparative constructions to automatic spin detection in biomedical scientific publications. In Language and Technology Conference, pages 304--317. Springer, 2017

work page 2017

[36] [36]

Annotating spin in biomedical scientific publications: the case of random controlled trials (rcts)

Anna Koroleva and Patrick Paroubek. Annotating spin in biomedical scientific publications: the case of random controlled trials (rcts). In Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018), 2018

work page 2018

[37] [37]

Despin: a prototype system for detecting spin in biomedical publications

Anna Koroleva, Sanjay Kamath, Patrick MM Bossuyt, and Patrick Paroubek. Despin: a prototype system for detecting spin in biomedical publications. In roceedings of the BioNLP 2020 workshop, pages 49--59. Association for Computational Linguistics, 2020

work page 2020

[38] [38]

The health literacy of america's adults: Results from the 2003 national assessment of adult literacy

Mark Kutner, Elizabeth Greenburg, Ying Jin, and Christine Paulsen. The health literacy of america's adults: Results from the 2003 national assessment of adult literacy. nces 2006-483. National Center for education statistics, 2006

work page 2003

[39] [39]

Biomistral: A collection of open-source pretrained large language models for medical domains

Yanis Labrak, Adrien Bazoge, Emmanuel Morin, Pierre-Antoine Gourraud, Mickael Rouvier, and Richard Dufour. Biomistral: A collection of open-source pretrained large language models for medical domains. arXiv preprint arXiv:2402.10373, 2024

work page arXiv 2024

[40] [40]

Biobert: a pre-trained biomedical language representation model for biomedical text mining

Jinhyuk Lee, Wonjin Yoon, Sungdong Kim, Donghyeon Kim, Sunkyu Kim, Chan Ho So, and Jaewoo Kang. Biobert: a pre-trained biomedical language representation model for biomedical text mining. Bioinformatics, 36 0 (4): 0 1234--1240, 2020

work page 2020

[41] [41]

Suzanne Lockyer, Rob Hodgson, Jo C Dumville, and Nicky Cullum. "spin" in wound care research: the reporting and interpretation of randomized controlled trials with statistically non-significant primary outcome results or unspecified primary outcomes. Trials, 14: 0 1--10, 2013

work page 2013

[42] [42]

Biomedgpt: Open multimodal generative pre-trained transformer for biomedicine

Yizhen Luo, Jiahuan Zhang, Siqi Fan, Kai Yang, Yushuai Wu, Mu Qiao, and Zaiqing Nie. Biomedgpt: Open multimodal generative pre-trained transformer for biomedicine. arXiv preprint arXiv:2308.09442, 2023

work page arXiv 2023

[43] [43]

A comparison of the accuracy of clinical decisions based on full-text articles and on journal abstracts alone: a study among residents in a tertiary care hospital

Alvin Marcelo, Alex Gavino, Iris Thiele Isip-Tan, Leilanie Apostol-Nicodemus, Faith Joan Mesa-Gaerlan, Paul Nimrod Firaza, John Francis Faustorilla, Fiona M Callaghan, and Paul Fontelo. A comparison of the accuracy of clinical decisions based on full-text articles and on journal abstracts alone: a study among residents in a tertiary care hospital. BMJ Evi...

work page 2013

[44] [44]

What is readability and why should content editors care about it

Lisa Marchand. What is readability and why should content editors care about it. Center for Plain Language. https://centerforplainlanguage. org/what-isreadability, 2017

work page 2017

[45] [45]

Misleading abstract conclusions in randomized controlled trials in rheumatology: comparison of the abstract conclusions and the results section

Sylvain Mathieu, Bruno Giraudeau, Martin Soubrier, and Philippe Ravaud. Misleading abstract conclusions in randomized controlled trials in rheumatology: comparison of the abstract conclusions and the results section. Joint Bone Spine, 79 0 (3): 0 262--267, 2012

work page 2012

[46] [46]

Introducing meta llama 3: The most capable openly available llm to date

AI Meta. Introducing meta llama 3: The most capable openly available llm to date. Meta AI, 2024

work page 2024

[47] [47]

Spin in abstracts of systematic reviews and meta-analyses of melanoma therapies: Cross-sectional analysis

Ross Nowlin, Alexis Wirtz, David Wenger, Ryan Ottwell, Courtney Cook, Wade Arthur, Brigitte Sallee, Jarad Levin, Micah Hartwell, Drew Wright, et al. Spin in abstracts of systematic reviews and meta-analyses of melanoma therapies: Cross-sectional analysis. JMIR dermatology, 5 0 (1): 0 e33996, 2022

work page 2022

[48] [48]

Simply put: A guide for creating easy-to-understand materials

US Department of Health, Human Services, et al. Simply put: A guide for creating easy-to-understand materials. 2009

work page 2009

[49] [49]

2 OLMo 2 Furious

Team OLMo, Pete Walsh, Luca Soldaini, Dirk Groeneveld, Kyle Lo, Shane Arora, Akshita Bhagia, Yuling Gu, Shengyi Huang, Matt Jordan, et al. 2 olmo 2 furious. arXiv preprint arXiv:2501.00656, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[50] [50]

A survey of automated methods for biomedical text simplification

Brian Ondov, Kush Attal, and Dina Demner-Fushman. A survey of automated methods for biomedical text simplification. Journal of the American Medical Informatics Association, 29 0 (11): 0 1976--1988, 2022

work page 1976

[51] [51]

Models, 2025

OpenAI. Models, 2025. URL https://platform.openai.com/docs/models/gpt-3-5-turbo. Accessed: 2025-01-17

work page 2025

[52] [52]

4o mini: Advancing cost-efficient intelligence, 2024

Gpt OpenAI. 4o mini: Advancing cost-efficient intelligence, 2024. URL: https://openai. com/index/gpt-4o-mini-advancing-cost-efficient-intelligence, 2024

work page 2024

[53] [53]

CK Osborne, J Pippen, SE Jones, LM Parker, M Ellis, S Come, SZ Gertler, JT May, G Burton, I Dimery, et al. Double-blind, randomized trial comparing the efficacy and tolerability of fulvestrant versus anastrozole in postmenopausal women with advanced breast cancer progressing on prior endocrine therapy: results of a north american trial. Journal of Clinica...

work page 2002

[54] [54]

Openbiollms: Advancing open-source large language models for healthcare and life sciences, 2024

Malaikannan Sankarasubbu Ankit Pal and Malaikannan Sankarasubbu. Openbiollms: Advancing open-source large language models for healthcare and life sciences, 2024

work page 2024

[55] [55]

Assessing ai simplification of medical texts: readability and content fidelity

Bryce Picton, Saman Andalib, Aidin Spina, Brandon Camp, Sean S Solomon, Jason Liang, Patrick M Chen, Jefferson W Chen, Frank P Hsu, and Michael Y Oh. Assessing ai simplification of medical texts: readability and content fidelity. International Journal of Medical Informatics, 195: 0 105743, 2025

work page 2025

[56] [56]

The state of oa: a large-scale analysis of the prevalence and impact of open access articles

Heather Piwowar, Jason Priem, Vincent Larivi \`e re, Juan Pablo Alperin, Lisa Matthias, Bree Norlander, Ashley Farley, Jevin West, and Stefanie Haustein. The state of oa: a large-scale analysis of the prevalence and impact of open access articles. PeerJ, 6: 0 e4375, 2018

work page 2018

[57] [57]

Malignant: how bad policy and bad evidence harm people with Cancer

Vinayak K Prasad. Malignant: how bad policy and bad evidence harm people with Cancer. JHU Press, 2020

work page 2020

[58] [58]

Reasoning with language model prompting: A survey

Shuofei Qiao, Yixin Ou, Ningyu Zhang, Xiang Chen, Yunzhi Yao, Shumin Deng, Chuanqi Tan, Fei Huang, and Huajun Chen. Reasoning with language model prompting: A survey. In Anna Rogers, Jordan Boyd-Graber, and Naoaki Okazaki, editors, Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 5368--...

work page doi:10.18653/v1/2023.acl-long.294 2023

[59] [59]

Development and evaluation of a framework for identifying and addressing spin for harms in systematic reviews of interventions

Riaz Qureshi, Kevin Naaman, Nicolas G Quan, Evan Mayo-Wilson, Matthew J Page, Victoria Cornelius, Roger Chou, Isabelle Boutron, Su Golder, Lisa Bero, et al. Development and evaluation of a framework for identifying and addressing spin for harms in systematic reviews of interventions. Annals of internal medicine, 177 0 (8): 0 1089--1098, 2024

work page 2024

[60] [60]

Evaluation of spin in the abstracts of emergency medicine randomized controlled trials

Victoria Reynolds-Vaughn, Jonathan Riddle, Jamin Brown, Michael Schiesel, Cole Wayant, and Matt Vassar. Evaluation of spin in the abstracts of emergency medicine randomized controlled trials. Annals of emergency medicine, 75 0 (3): 0 423--431, 2020

work page 2020

[61] [61]

Mathematical discoveries from program search with large language models

Bernardino Romera-Paredes, Mohammadamin Barekatain, Alexander Novikov, Matej Balog, M Pawan Kumar, Emilien Dupont, Francisco JR Ruiz, Jordan S Ellenberg, Pengming Wang, Omar Fawzi, et al. Mathematical discoveries from program search with large language models. Nature, 625 0 (7995): 0 468--475, 2024

work page 2024

[62] [62]

Summarizing, simplifying, and synthesizing medical evidence using GPT -3 (with varying success)

Chantal Shaib, Millicent Li, Sebastian Joseph, Iain Marshall, Junyi Jessy Li, and Byron Wallace. Summarizing, simplifying, and synthesizing medical evidence using GPT -3 (with varying success). In Anna Rogers, Jordan Boyd-Graber, and Naoaki Okazaki, editors, Proceedings of the Annual Meeting of the Association for Computational Linguistics (ACL), pages 13...

work page doi:10.18653/v1/2023.acl-short.119 2023

[63] [63]

Knowledge sharing in global health research--the impact, uptake and cost of open access to scholarly literature

Elise Smith, Stefanie Haustein, Philippe Mongeon, Fei Shu, Val \'e ry Ridde, and Vincent Larivi \`e re. Knowledge sharing in global health research--the impact, uptake and cost of open access to scholarly literature. Health Research Policy and Systems, 15: 0 1--10, 2017

work page 2017

[64] [64]

Assessment of spin in the abstracts of randomized controlled trials in dental caries with statistically nonsignificant results for primary outcomes: A methodological study

Naichuan Su, Michiel W Van Der Linden, Clovis M Faggion Jr, and Geert JMG Van Der Heijden. Assessment of spin in the abstracts of randomized controlled trials in dental caries with statistically nonsignificant results for primary outcomes: A methodological study. Caries Research, 57 0 (5-6): 0 553--562, 2023

work page 2023

[65] [65]

Exaggerations and caveats in press releases and health-related science news

Petroc Sumner, Solveiga Vivian-Griffiths, Jacky Boivin, Andrew Williams, Lewis Bott, Rachel Adams, Christos A Venetis, Leanne Whelan, Bethan Hughes, and Christopher D Chambers. Exaggerations and caveats in press releases and health-related science news. PloS one, 11 0 (12): 0 e0168217, 2016

work page 2016

[66] [66]

Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context

Gemini Team, Petko Georgiev, Ving Ian Lei, Ryan Burnell, Libin Bai, Anmol Gulati, Garrett Tanzer, Damien Vincent, Zhufeng Pan, Shibo Wang, et al. Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context. arXiv preprint arXiv:2403.05530, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[67] [67]

Large language models in medicine

Arun James Thirunavukarasu, Darren Shu Jeng Ting, Kabilan Elangovan, Laura Gutierrez, Ting Fang Tan, and Daniel Shu Wei Ting. Large language models in medicine. Nature medicine, 29 0 (8): 0 1930--1940, 2023

work page 1930

[68] [68]

Llama 2: Open Foundation and Fine-Tuned Chat Models

Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[69] [69]

Evaluation of spin in oncology clinical trials

C Wayant, D Margalski, K Vaughn, and M Vassar. Evaluation of spin in oncology clinical trials. Critical Reviews in Oncology/Hematology, 144: 0 102821, 2019

work page 2019

[70] [70]

Chain-of-thought prompting elicits reasoning in large language models

Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. Chain-of-thought prompting elicits reasoning in large language models. Advances in neural information processing systems, 35: 0 24824--24837, 2022

work page 2022

[71] [71]

Health literacy and patient safety: Help patients understand

Barry D Weiss. Health literacy and patient safety: Help patients understand. Manual for clinicians. American Medical Association Foundation, 2007

work page 2007

[72] [72]

A new classification of spin in systematic reviews and meta-analyses was developed and ranked according to the severity

Am \'e lie Yavchitz, Philippe Ravaud, Douglas G Altman, David Moher, Asbj rn Hrobjartsson, Toby Lasserson, and Isabelle Boutron. A new classification of spin in systematic reviews and meta-analyses was developed and ranked according to the severity. Journal of clinical epidemiology, 75: 0 56--65, 2016

work page 2016

[73] [73]

Alpacare: Instruction-tuned large language models for medical application

Xinlu Zhang, Chenxin Tian, Xianjun Yang, Lichang Chen, Zekun Li, and Linda Ruth Petzold. Alpacare: Instruction-tuned large language models for medical application. arXiv preprint arXiv:2310.14558, 2023

work page arXiv 2023