Making Knowledge Accessible: Divergent Readability-Accuracy Strategies of Mistral and QWen in Biomedical Text Simplification

Aikaterini Melliou; Lian Zhang; P. Bilha Githinji; Peiwu Qin; Zeming Liang

arxiv: 2511.05080 · v4 · submitted 2025-11-07 · 💻 cs.CL

Making Knowledge Accessible: Divergent Readability-Accuracy Strategies of Mistral and QWen in Biomedical Text Simplification

P. Bilha Githinji , Aikaterini Melliou , Zeming Liang , Lian Zhang , Peiwu Qin This is my paper

Pith reviewed 2026-05-18 00:24 UTC · model grok-4.3

classification 💻 cs.CL

keywords biomedical text simplificationlarge language modelsreadability metricsBERTScorediscourse fidelityMistralQWen

0 comments

The pith

Mistral improves readability in biomedical texts while preserving discourse fidelity at levels statistically comparable to humans, unlike QWen which shows a disconnect in balancing the two.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper compares how two large language models simplify biomedical text to meet public demand for accessible information. It finds that Mistral applies a careful, tempered approach to changing words that boosts multiple readability scores yet keeps overall meaning close to the original. QWen also raises readability but does not align those gains as tightly with preserved accuracy. The authors further show that many of the 21 metrics they tracked overlap strongly, pointing to simpler ways to judge future simplification work. A reader would care because reliable, meaning-preserving simplification could let non-experts understand medical content without distortion.

Core claim

Mistral exhibits a tempered lexical simplification approach that consistently enhances readability across multiple metrics while preserving discourse fidelity (BERTScore: 0.91, statistically comparable to that of humans). In comparison, QWen also attains enhanced readability performance and a reasonable BERTScore of 0.89, but presents a disconnect in balancing between readability and accuracy. Additionally, a comprehensive correlation analysis of a suite of 21 metrics confirms strong functional redundancies in metrics and informs adaptation requirements.

What carries the argument

The distinct operational strategies each model uses to trade off lexical simplification against discourse preservation, tracked through readability metrics and BERTScore against human baselines.

If this is right

Models like Mistral can be selected for public-facing biomedical applications where both access and accuracy matter.
Fewer than 21 metrics may suffice for judging simplification quality because many are functionally redundant.
Instruction-tuned models may favor fidelity-preserving strategies while reasoning-augmented ones prioritize other gains.
Adaptation of these models for biomedical use should target the specific readability-accuracy balance observed.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The strategy difference may trace to Mistral being instruction-tuned versus QWen being reasoning-augmented, suggesting tuning type shapes simplification behavior.
Evaluating the same models on authentic patient education materials could show whether the reported trade-offs hold outside the test set.
The metric redundancies open a path to build lighter, more targeted evaluation suites for future text-simplification studies.

Load-bearing premise

That the chosen readability metrics and BERTScore together fully capture the intended trade-off without missing important aspects of biomedical accuracy or that the test texts represent typical real-world biomedical content.

What would settle it

A set of expert human ratings on critical medical facts retained or lost in the simplified outputs, or performance measured on a new collection of real patient-facing biomedical queries.

Figures

Figures reproduced from arXiv: 2511.05080 by Aikaterini Melliou, Lian Zhang, P. Bilha Githinji, Peiwu Qin, Zeming Liang.

**Figure 2.** Figure 2: Hypotheses test results. For µLLM Vs µhuman rows, mean values are presented and then shaded with a darker hue where pvalue > 0.05. For µ1 - µ2 rows, the difference between means is presented and a darker shading highlights results with pvalue > 0.05. 4 [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: Correlations between metrics tive of the temperature configuration, exhibits human-level discourse preservation, a quality not demonstrated by the QWen model. Additionally, a strategic difference between the architectures is illuminated by the vocabulary matching and the difficult words scores, offering insights into the treatment of relevant but complex terms. Both LLMs reduce the proportion of difficu… view at source ↗

**Figure 4.** Figure 4: LLMs’ self-reported rationale for changes made. [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗

read the original abstract

The growing public demand for accessible biomedical information calls for scalable text simplification. While large language models (LLMs) offer solutions, they too struggle with balancing improved readability against preservation of meaning. This report empirically compares how two LLMs - instruction-tuned Mistral-Small 3 24B and the reasoning-augmented QWen2.5 32B- navigate this trade-off in biomedical text simplification, benchmarked against human performance. Our analysis highlights how each model applies distinct operational strategies when simplifying biomedical text. Mistral exhibits a tempered lexical simplification approach that consistently enhances readability across multiple metrics while preserving discourse fidelity (BERTScore: 0.91, statistically comparable to that of humans). In comparison, QWen also attains enhanced readability performance and a reasonable BERTScore of 0.89, but presents a disconnect in balancing between readability and accuracy. Additionally, a comprehensive correlation analysis of a suite of 21 metrics confirms strong functional redundancies in metrics and informs adaptation requirements.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Mistral keeps better semantic overlap than QWen on biomedical simplification but the evaluation does not directly check factual medical accuracy.

read the letter

The paper's core observation is that Mistral-Small 3 produces simplifications with higher BERTScore (0.91) and steadier readability gains than QWen2.5 (0.89), while both beat or match human baselines on several surface metrics. The authors also map correlations across 21 metrics and note redundancies that could guide future work. That comparison of two recent models on this specific domain is the main new piece of evidence they add to existing LLM evaluation studies.

Referee Report

3 major / 2 minor

Summary. The manuscript empirically compares instruction-tuned Mistral-Small 3 24B and reasoning-augmented QWen2.5 32B on biomedical text simplification, benchmarked against human performance. It claims Mistral applies a tempered lexical simplification strategy that improves readability across multiple metrics while preserving discourse fidelity (BERTScore 0.91, statistically comparable to humans), whereas QWen attains readability gains but shows a disconnect in balancing readability and accuracy (BERTScore 0.89). A correlation analysis across 21 metrics is used to identify functional redundancies.

Significance. If the empirical distinctions hold under more rigorous validation, the work is significant for documenting divergent LLM strategies in a high-stakes domain, with potential to inform model selection or prompting for accessible biomedical communication. The 21-metric correlation analysis is a clear strength, as it directly addresses metric redundancy and could support more efficient evaluation protocols in future simplification research.

major comments (3)

Abstract and Results: The central claim that Mistral preserves discourse fidelity (BERTScore 0.91, statistically comparable to humans) while QWen exhibits a readability-accuracy disconnect (BERTScore 0.89) rests on BERTScore as a proxy, yet the manuscript provides no domain-specific factuality validation, expert error annotation for omitted facts or altered causal relations, or comparison against biomedical reference standards; this is load-bearing because BERTScore captures contextual embedding overlap rather than factual accuracy in technical text.
Methods/Experimental Setup: Dataset details (source texts, size, selection criteria for biomedical content), exact prompting templates, and the statistical tests or error bars supporting the 'statistically comparable' claim are absent; these omissions prevent assessment of whether the reported metric differences reflect genuine strategy distinctions or surface-level lexical changes.
Results (correlation analysis): While the 21-metric analysis is a positive contribution, the manuscript does not report how the subset of metrics was chosen post-hoc or whether the observed redundancies affect the interpretation of the readability-accuracy trade-off for each model.

minor comments (2)

Abstract: The phrase 'a suite of 21 metrics' should be accompanied by at least a high-level categorization (e.g., lexical, syntactic, semantic) to orient readers before the full correlation results.
Throughout: Define all abbreviations (e.g., BERTScore) on first use and ensure figure captions explicitly state what each panel compares (model vs. human vs. original).

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback on our manuscript. We address each major comment below and indicate the revisions we will make to strengthen the work.

read point-by-point responses

Referee: Abstract and Results: The central claim that Mistral preserves discourse fidelity (BERTScore 0.91, statistically comparable to humans) while QWen exhibits a readability-accuracy disconnect (BERTScore 0.89) rests on BERTScore as a proxy, yet the manuscript provides no domain-specific factuality validation, expert error annotation for omitted facts or altered causal relations, or comparison against biomedical reference standards; this is load-bearing because BERTScore captures contextual embedding overlap rather than factual accuracy in technical text.

Authors: We agree that BERTScore functions as a semantic similarity proxy rather than a direct factuality measure and does not detect omitted facts or altered causal relations. Our central contribution is the documentation of divergent operational strategies through a multi-metric profile, where BERTScore is used alongside readability metrics to benchmark against human performance. In the revision we will add an explicit limitations subsection acknowledging this proxy limitation, include qualitative examples illustrating content preservation differences, and note that full expert factuality annotation lies outside the current scope. We maintain that the observed metric distinctions still offer useful guidance for model selection in biomedical simplification. revision: partial
Referee: Methods/Experimental Setup: Dataset details (source texts, size, selection criteria for biomedical content), exact prompting templates, and the statistical tests or error bars supporting the 'statistically comparable' claim are absent; these omissions prevent assessment of whether the reported metric differences reflect genuine strategy distinctions or surface-level lexical changes.

Authors: We will revise the Methods section to explicitly state the source corpus (PubMed abstracts), the exact number of texts, and the selection criteria for biomedical relevance. Exact prompting templates will be moved from supplementary material into the main text. The statistical comparability claim is based on a two-sample t-test; we will report the precise test, p-value, and add error bars to all relevant figures and tables. revision: yes
Referee: Results (correlation analysis): While the 21-metric analysis is a positive contribution, the manuscript does not report how the subset of metrics was chosen post-hoc or whether the observed redundancies affect the interpretation of the readability-accuracy trade-off for each model.

Authors: The 21 metrics were pre-selected from the standard set used in prior simplification literature to span readability, semantic fidelity, and lexical dimensions. We will add a dedicated paragraph in the Results section describing this selection rationale and explicitly discuss how the observed redundancies (e.g., between multiple readability formulas) should qualify interpretation of the readability-accuracy trade-off, ensuring readers do not over-weight correlated measures. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical metrics and human baselines are independent

full rationale

The paper reports direct empirical measurements of LLM-generated simplifications against human references using off-the-shelf metrics (BERTScore, readability scores) and a correlation analysis of 21 metrics. No equations, fitted parameters, or self-citations are used to derive the central claims; the reported differences (e.g., Mistral BERTScore 0.91 vs. QWen 0.89) are computed from model outputs on the test set and compared to external human baselines. The analysis is therefore self-contained against standard benchmarks and does not reduce to any input by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

This is a purely empirical comparison study that relies on off-the-shelf LLMs and standard NLP metrics; no new mathematical axioms, free parameters fitted to the target result, or invented entities are introduced.

pith-pipeline@v0.9.0 · 5490 in / 1209 out tokens · 48349 ms · 2026-05-18T00:24:29.769957+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

53 extracted references · 53 canonical work pages · 1 internal anchor

[1]

Shared Decision Making Interventions: Theoretical and Empirical Evidence with Im- plications for Health Literacy

Stacey Dawn, Hill Sophie, McCaffery Kirsten, Boland Laura, Lewis Krystina B., and Horvat Lidia. Shared Decision Making Interventions: Theoretical and Empirical Evidence with Im- plications for Health Literacy. InStudies in Health Technology and Informatics. IOS Press, 9 An Architectural Advantage of The Instruction-Tuned LLM in Containing The Readability-...

work page
[2]

URL https://www.medra.org/servlet/aliasResolver?alias= iospressISBN&isbn=978-1-61499-789-4&spage= 263&doi=10.3233/978-1-61499-790-0-263

doi:10.3233/978-1-61499-790-0-263. URL https://www.medra.org/servlet/aliasResolver?alias= iospressISBN&isbn=978-1-61499-789-4&spage= 263&doi=10.3233/978-1-61499-790-0-263

work page doi:10.3233/978-1-61499-790-0-263
[3]

A guide for policy and deci- sion makers on health literacy policies.Euro- pean Journal of Public Health, 34(Supplement_3): ckae144.787, November 2024

A Schlacher. A guide for policy and deci- sion makers on health literacy policies.Euro- pean Journal of Public Health, 34(Supplement_3): ckae144.787, November 2024. ISSN 1101- 1262, 1464-360X. doi:10.1093/eurpub/ckae144.787. URL https://academic.oup.com/eurpub/article/doi/10. 1093/eurpub/ckae144.787/7844567

work page doi:10.1093/eurpub/ckae144.787 2024
[4]

Vishala Mishra and Joseph P. Dexter. Compar- ison of Readability of Official Public Health Information About COVID-19 on Websites of International Agencies and the Governments of 15 Countries.JAMA Network Open, 3(8): e2018033, August 2020. ISSN 2574-3805. doi:10.1001/jamanetworkopen.2020.18033. URL https://jamanetwork.com/journals/ jamanetworkopen/fullart...

work page doi:10.1001/jamanetworkopen.2020.18033 2020
[5]

Prevalence of Health Misinformation on Social Me- dia: Systematic Review.Journal of Medical Internet Research, 23(1):e17187, January 2021

Victor Suarez-Lledo and Javier Alvarez-Galvez. Prevalence of Health Misinformation on Social Me- dia: Systematic Review.Journal of Medical Internet Research, 23(1):e17187, January 2021. ISSN 1438-

work page 2021
[6]

URL http://www.jmir.org/ 2021/1/e17187/

doi:10.2196/17187. URL http://www.jmir.org/ 2021/1/e17187/

work page doi:10.2196/17187 2021
[8]

Overview of the BioLay- Summ 2024 Shared Task on the Lay Summarization of Biomedical Research Articles

Tomas Goldsack, Carolina Scarton, Matthew Shard- low, and Chenghua Lin. Overview of the BioLay- Summ 2024 Shared Task on the Lay Summarization of Biomedical Research Articles. In Dina Demner- Fushman, Sophia Ananiadou, Makoto Miwa, Kirk Roberts, and Junichi Tsujii, editors,Proceedings of the 23rd Workshop on Biomedical Natural Lan- guage Processing, pages...

work page doi:10.18653/v1/2024.bionlp-1.10 2024
[9]

Chen, Freya Gulamali, and Shalmali Joshi

Monica Agrawal, Irene Y . Chen, Freya Gulamali, and Shalmali Joshi. The evaluation illusion of large language models in medicine.npj Digital Medicine, 8(1):1–4, October 2025. ISSN 2398-

work page 2025
[10]

Chen, Freya Gula- mali, and Shalmali Joshi

doi:10.1038/s41746-025-01963-x. URL https: //www.nature.com/articles/s41746-025-01963-x

work page doi:10.1038/s41746-025-01963-x
[11]

Lessons from the TREC Plain Language Adaptation of Biomed- ical Abstracts (PLABA) track, 2025

Brian Ondov, William Xia, Kush Attal, Ishita Unde, Jerry He, and Dina Demner-Fushman. Lessons from the TREC Plain Language Adaptation of Biomed- ical Abstracts (PLABA) track, 2025. URL https: //arxiv.org/abs/2507.14096

work page arXiv 2025
[12]

Plain Language Adaptations of Biomedical Text Us- ing LLMs: Comparision of Evaluation Metrics

Primoz Kocbek, Leon Kopitar, and Gregor Stiglic. Plain Language Adaptations of Biomedical Text Us- ing LLMs: Comparision of Evaluation Metrics. In Mowafa S. Househ, Zain Ul Abideen Tariq, Mah- mood Al-Zubaidi, Uzair Shah, and Elaine Huesing, editors,Studies in Health Technology and Informat- ics. IOS Press, August 2025. ISBN 9781643686080. doi:10.3233/SHT...

work page doi:10.3233/shti250946 2025
[14]

Mistral Small 3 | Mistral AI

Mistral AI Team. Mistral Small 3 | Mistral AI. URL https://mistral.ai/news/mistral-small-3

work page
[15]

Qwen2.5 Technical Report

An Yang, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chengyuan Li, Dayi- heng Liu, Fei Huang, Haoran Wei, Huan Lin, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Yang, Jiaxi Yang, Jingren Zhou, Junyang Lin, Kai Dang, Keming Lu, Keqin Bao, Kexin Yang, Le Yu, Mei Li, Mingfeng Xue, Pei Zhang, Qin Zhu, Rui Men, Runji Lin, Tianhao Li, Tingyu...

work page internal anchor Pith review Pith/arXiv arXiv 2024
[16]

Explor- ing the Landscape of Automatic Text Summa- rization: A Comprehensive Survey.IEEE Ac- cess, 11:109819–109840, 2023

Bilal Khan, Zohaib Ali Shah, Muhammad Us- man, Inayat Khan, and Badam Niazi. Explor- ing the Landscape of Automatic Text Summa- rization: A Comprehensive Survey.IEEE Ac- cess, 11:109819–109840, 2023. ISSN 2169-3536. doi:10.1109/ACCESS.2023.3322188. URL https: //ieeexplore.ieee.org/document/10272614/

work page doi:10.1109/access.2023.3322188 2023
[17]

On Faithfulness and Fac- tuality in Abstractive Summarization

Joshua Maynez, Shashi Narayan, Bernd Bohnet, and Ryan McDonald. On Faithfulness and Fac- tuality in Abstractive Summarization. In Dan Jurafsky, Joyce Chai, Natalie Schluter, and Joel Tetreault, editors,Proceedings of the 58th An- nual Meeting of the Association for Computa- tional Linguistics, pages 1906–1919, Online, July

work page 1906
[18]

On Faithfulness and Factuality in Abstractive Summarization

Association for Computational Linguistics. doi:10.18653/v1/2020.acl-main.173. URL https:// aclanthology.org/2020.acl-main.173/

work page doi:10.18653/v1/2020.acl-main.173 2020
[19]

Xuanxin Wu and Yuki Arase. An In-depth Evaluation of Large Language Models in Sentence Simplifica- tion with Error-based Human Assessment.ACM Transactions on Intelligent Systems and Technology, page 3744744, June 2025. ISSN 2157-6904, 2157-

work page 2025
[20]

URL https://dl.acm.org/ doi/10.1145/3744744

doi:10.1145/3744744. URL https://dl.acm.org/ doi/10.1145/3744744

work page doi:10.1145/3744744
[21]

Jiageng Wu, Xian Wu, Zhaopeng Qiu, Minghui Li, Shixu Lin, Yingying Zhang, Yefeng Zheng, Changzheng Yuan, and Jie Yang. Large language 10 An Architectural Advantage of The Instruction-Tuned LLM in Containing The Readability-Accuracy Tension models leverage external knowledge to extend clin- ical insight beyond language boundaries.Jour- nal of the American ...

work page doi:10.1093/jamia/ocae079 2054
[22]

In: Proc

Huu Tan Mai, Cuong Xuan Chu, and Heiko Paulheim. Do LLMs Really Adapt to Domains? An Ontology Learning Perspective. In Gianluca Demartini, Katja Hose, Maribel Acosta, Matteo Palmonari, Gong Cheng, Hala Skaf-Molli, Nicolas Ferranti, Daniel Hernández, and Aidan Hogan, editors,The Semantic Web – ISWC 2024, volume 15231, pages 126–143. Springer Nature Switzer...

work page doi:10.1007/978- 2024
[23]

Evaluation of Large Language Model Performance on the Biomedical Language Understanding and Reasoning Benchmark: Comparative Study, May 2024

Hui Feng, Francesco Ronzano, Jude LaFleur, Matthew Garber, Rodrigo De Oliveira, Kathryn Rough, Katharine Roth, Jay Nanavati, Khaldoun Zine El Abidine, and Christina Mack. Evaluation of Large Language Model Performance on the Biomedical Language Understanding and Reasoning Benchmark: Comparative Study, May 2024. URL http://medrxiv. org/lookup/doi/10.1101/2...

work page doi:10.1101/2024.05.17.24307411 2024
[24]

In- Context Meta LoRA Generation

Yihua Shao, Minxi Yan, Yang Liu, Siyu Chen, Wenjie Chen, Xinwei Long, Ziyang Yan, Lei Li, Chenyu Zhang, Nicu Sebe, Hao Tang, Yan Wang, Hao Zhao, Mengzhu Wang, and Jingcai Guo. In- Context Meta LoRA Generation. InProceedings of the Thirty-ThirdInternational Joint Conference on Artificial Intelligence, pages 6138–6146, Jeju, South Korea, August 2024. Intern...

work page doi:10.24963/ijcai.2025/683 2024
[25]

MEDVOC: V ocabulary Adapta- tion for Fine-tuning Pre-trained Language Models on Medical Text Summarization

Gunjan Balde, Soumyadeep Roy, Mainack Mondal, and Niloy Ganguly. MEDVOC: V ocabulary Adapta- tion for Fine-tuning Pre-trained Language Models on Medical Text Summarization. volume 7, pages 6180– 6188, August 2024. doi:10.24963/ijcai.2024/683. URL https://www.ijcai.org/proceedings/2024/683

work page doi:10.24963/ijcai.2024/683 2024
[26]

Unveil- ing the Generalization Power of Fine-Tuned Large Language Models

Haoran Yang, Yumeng Zhang, Jiaqi Xu, Hongyuan Lu, Pheng-Ann Heng, and Wai Lam. Unveil- ing the Generalization Power of Fine-Tuned Large Language Models. InProceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Hu- man Language Technologies (Volume 1: Long Papers), pages 884–899, Mexico City, Mexico,

work page 2024
[27]

doi:10.18653/v1/2024.naacl-long.51

Association for Computational Linguistics. doi:10.18653/v1/2024.naacl-long.51. URL https: //aclanthology.org/2024.naacl-long.51

work page doi:10.18653/v1/2024.naacl-long.51 2024
[28]

Biomedi- cal text readability after hypernym substitution with fine-tuned large language models.PLOS Digital Health, 3(4):e0000489, April 2024

Karl Swanson, Shuhan He, Josh Calvano, David Chen, Talar Telvizian, Lawrence Jiang, Paul Chong, Jacob Schwell, Gin Mak, and Jarone Lee. Biomedi- cal text readability after hypernym substitution with fine-tuned large language models.PLOS Digital Health, 3(4):e0000489, April 2024. ISSN 2767-

work page 2024
[29]

URL https://dx.plos.org/10.1371/journal.pdig.0000489

doi:10.1371/journal.pdig.0000489. URL https://dx.plos.org/10.1371/journal.pdig.0000489

work page doi:10.1371/journal.pdig.0000489
[30]

Salahaldin Alamleh, Dorsa Mavedatnia, Gizelle Fran- cis, Trung Le, Joel Davies, Vincent Lin, and John J.W. Lee. Readability, Reliability, and Quality Analysis of Internet-Based Patient Education Materials and Large Language Models on Meniere’s Disease. Journal of Otolaryngology - Head & Neck Surgery, 54:19160216251360651, July 2025. ISSN 1916- 0216, 1916-...

work page doi:10.1177/19160216251360651 2025
[31]

Hanauer, Kai Zheng, and Danny T.Y

Tzu-Chun Wu, Hanniel Shih, Anunita Nattam, Himaja Chintalapalli, David A. Hanauer, Kai Zheng, and Danny T.Y . Wu. Readability As- sessment and Comparison of Large Language Model-Generated Summaries of Trial Descriptions on ClinicalTrials.gov. In Mowafa S. Househ, Zain Ul Abideen Tariq, Mahmood Al-Zubaidi, Uzair Shah, and Elaine Huesing, editors,Stud- ies ...

work page doi:10.3233/shti250982 2025
[32]

Dorfner, Amin Dada, Felix Busch, Mar- cus R

Felix J. Dorfner, Amin Dada, Felix Busch, Mar- cus R. Makowski, Tianyu Han, Daniel Truhn, Jens Kleesiek, Madhumita Sushil, Jacqueline Lammert, Lisa C. Adams, and Keno K. Bressem. Biomedi- cal Large Languages Models Seem not to be Supe- rior to Generalist Models on Unseen Medical Data, August 2024. URL http://arxiv.org/abs/2408.13833. arXiv:2408.13833

work page arXiv 2024
[33]

Life and death of colloidal bonds control the rate-dependent rheology of gels

Qingyu Chen, Yan Hu, Xueqing Peng, Qianqian Xie, Qiao Jin, Aidan Gilson, Maxwell B. Singer, Xuguang Ai, Po-Ting Lai, Zhizheng Wang, Vip- ina K. Keloth, Kalpana Raja, Jimin Huang, Huan He, Fongci Lin, Jingcheng Du, Rui Zhang, W. Jim Zheng, Ron A. Adelman, Zhiyong Lu, and Hua Xu. Benchmarking large language models for biomedical natural language processing ...

work page doi:10.1038/s41467- 2025
[34]

Jung, P.R

Kush Attal, Brian Ondov, and Dina Demner- Fushman. A dataset for plain language adaptation of biomedical abstracts.Scientific Data, 10(1):8, Jan- uary 2023. ISSN 2052-4463. doi:10.1038/s41597- 022-01920-3. URL https://www.nature.com/articles/ s41597-022-01920-3

work page doi:10.1038/s41597- 2023
[35]

Heidi Cramm, Janet Breimer, Lydia Lee, Julie Burch, Valerie Ashford, and Mike Schaub. Best practices for writing effective lay summaries.Journal of Mil- itary, Veteran and Family Health, 3(1):7–20, April 11 An Architectural Advantage of The Instruction-Tuned LLM in Containing The Readability-Accuracy Tension

work page
[36]

doi:10.3138/jmvfh.3.1.004

ISSN 2368-7924. doi:10.3138/jmvfh.3.1.004. URL https://utppublishing.com/doi/10.3138/jmvfh.3. 1.004

work page doi:10.3138/jmvfh.3.1.004
[37]

A Critical Look at Meta-evaluating Summarisa- tion Evaluation Metrics

Xiang Dai, Sarvnaz Karimi, and Biaoyan Fang. A Critical Look at Meta-evaluating Summarisa- tion Evaluation Metrics. InFindings of the As- sociation for Computational Linguistics: EMNLP 2024, pages 14795–14808, Miami, Florida, USA,

work page 2024
[38]

doi:10.18653/v1/2024.findings-emnlp.869

Association for Computational Linguistics. doi:10.18653/v1/2024.findings-emnlp.869. URL https://aclanthology.org/2024.findings-emnlp.869

work page doi:10.18653/v1/2024.findings-emnlp.869 2024
[39]

Evaluating the Demand for Integrative Medicine Practices in Breast and Gy- necological Cancer Patients.Breast Care, 14 (1):35–40, 2019

Nikolas Schuerger, Evelyn Klein, Alexander Hapfelmeier, Marion Kiechle, Christine Brambs, and Daniela Paepke. Evaluating the Demand for Integrative Medicine Practices in Breast and Gy- necological Cancer Patients.Breast Care, 14 (1):35–40, 2019. ISSN 1661-3791, 1661-3805. doi:10.1159/000492235. URL https://karger.com/ article/doi/10.1159/000492235

work page doi:10.1159/000492235 2019
[40]

Kessler, and Ras- mus Hoffmann

Miriam Trübner, Alexander Patzina, Judith Lehmann, Benno Brinkhaus, Christian S. Kessler, and Ras- mus Hoffmann. Health information-seeking behavior among users of traditional, complementary and in- tegrative medicine (TCIM).BMC Complementary Medicine and Therapies, 25(1):111, March 2025. ISSN 2662-7671. doi:10.1186/s12906-025-04843-9. URL https://doi.org...

work page doi:10.1186/s12906-025-04843-9 2025
[41]

Weinberger, and Yoav Artzi

Tianyi Zhang, Varsha Kishore, Felix Wu, Kilian Q. Weinberger, and Yoav Artzi. BERTScore: Evaluating Text Generation with BERT. In8th International Conference on Learning Representations, ICLR 2020, Addis Ababa, Ethiopia, April 26-30, 2020. OpenRe- view.net, 2020. URL https://openreview.net/forum? id=SkeHuCVFDr

work page 2020
[42]

ROUGE: A Package for Automatic Evaluation of Summaries

Chin-Yew Lin. ROUGE: A Package for Automatic Evaluation of Summaries. InText Summarization Branches Out, pages 74–81, Barcelona, Spain, July

work page
[43]

URL https://aclanthology.org/W04-1013/

Association for Computational Linguistics. URL https://aclanthology.org/W04-1013/

work page
[44]

A Call for Clarity in Reporting BLEU Scores

Matt Post. A Call for Clarity in Reporting BLEU Scores. In Ond ˇrej Bojar, Rajen Chatterjee, Chris- tian Federmann, Mark Fishel, Yvette Graham, Barry Haddow, Matthias Huck, Antonio Jimeno Yepes, Philipp Koehn, Christof Monz, Matteo Negri, Au- rélie Névéol, Mariana Neves, Matt Post, Lucia Spe- cia, Marco Turchi, and Karin Verspoor, editors,Pro- ceedings of...

work page doi:10.18653/v1/w18-6319 2018
[45]

Optimizing Statistical Machine Translation for Text Sim- plification.Transactions of the Association for Computational Linguistics, 4:401–415, 2016

Wei Xu, Courtney Napoles, Ellie Pavlick, Quanze Chen, and Chris Callison-Burch. Optimizing Statistical Machine Translation for Text Sim- plification.Transactions of the Association for Computational Linguistics, 4:401–415, 2016. doi:10.1162/tacl_a_00107. URL https://aclanthology. org/Q16-1029/

work page doi:10.1162/tacl_a_00107 2016
[46]

Harry Mc Laughlin

G. Harry Mc Laughlin. Smog grading-a new read- ability formula.Journal of Reading, 12(8):639–646,

work page
[47]

URL http://www.jstor.org/ stable/40011226

ISSN 00224103. URL http://www.jstor.org/ stable/40011226

work page arXiv
[48]

N.Y ., rev ed

Robert Gunning.The technique of clear writing. N.Y ., rev ed. edition. ISBN 9787000014190. OCLC: 1260373335

work page
[49]

E. A. Smith and R. J. Senter. Automated readabil- ity index.AMRL-TR. Aerospace Medical Research Laboratories (U.S.), pages 1–14, May 1967

work page 1967
[50]

Peter Kincaid, Richard Braby, and John E

J. Peter Kincaid, Richard Braby, and John E. Mears. Electronic authoring and delivery of tech- nical information.Journal of Instructional Devel- opment, 11(2):8–13, June 1988. ISSN 0162-2641. doi:10.1007/BF02904998. URL http://link.springer. com/10.1007/BF02904998

work page doi:10.1007/bf02904998 1988
[51]

Klare, Paul P

George R. Klare, Paul P. Rowe, M. Gregory St. John, and Lawrence M. Stolurow. Automation of the Flesch Reading Ease Readability Formula, with Various Op- tions.Reading Research Quarterly, 4(4):550, 1969. ISSN 00340553. doi:10.2307/747070. URL https: //www.jstor.org/stable/747070?origin=crossref

work page doi:10.2307/747070 1969
[52]

URL https: //github.com/huggingface/evaluate

huggingface/evaluate, November 2025. URL https: //github.com/huggingface/evaluate. original-date: 2022-03-30T15:08:26Z

work page 2025
[53]

URL https:// github.com/textstat/textstat

textstat/textstat, November 2025. URL https:// github.com/textstat/textstat. original-date: 2014-06- 18T10:54:08Z. 12 An Architectural Advantage of The Instruction-Tuned LLM in Containing The Readability-Accuracy Tension A Methodological details A.1 Metric Properties Table A1: The suite of metrics in the evaluation. A. Foundational/supplementary metricsCo...

work page 2025
[54]

8- You must operate at a sentence level

** Your t r a n s f o r m a t i o n o p e r a t i o n s work at a sentence level **. 8- You must operate at a sentence level . 9- For instance , a title text is already a sentence , while an abstract or a p ara gr ap h of text is not . A p ara gr ap h of text has multiple sentences , so you ** MUST split p a r a g r a p h s into a list of se nt en ce s fi...

work page
[55]

Aim :" can be added to an ob je ct iv e sentence , while

** For each sentence , consider the f ol lo wi ng possible t r a n s f o r m a t i o n s ** that might realise simpler sentences , that are easy to read and u n d e r s t a n d for a layman . 15- You may split a sentence into 2 or more se nt enc es as part of the s i m p l i f i c a t i o n tr ans fo rm . For instance , in the case of long complex s en te...

work page

[1] [1]

Shared Decision Making Interventions: Theoretical and Empirical Evidence with Im- plications for Health Literacy

Stacey Dawn, Hill Sophie, McCaffery Kirsten, Boland Laura, Lewis Krystina B., and Horvat Lidia. Shared Decision Making Interventions: Theoretical and Empirical Evidence with Im- plications for Health Literacy. InStudies in Health Technology and Informatics. IOS Press, 9 An Architectural Advantage of The Instruction-Tuned LLM in Containing The Readability-...

work page

[2] [2]

URL https://www.medra.org/servlet/aliasResolver?alias= iospressISBN&isbn=978-1-61499-789-4&spage= 263&doi=10.3233/978-1-61499-790-0-263

doi:10.3233/978-1-61499-790-0-263. URL https://www.medra.org/servlet/aliasResolver?alias= iospressISBN&isbn=978-1-61499-789-4&spage= 263&doi=10.3233/978-1-61499-790-0-263

work page doi:10.3233/978-1-61499-790-0-263

[3] [3]

A guide for policy and deci- sion makers on health literacy policies.Euro- pean Journal of Public Health, 34(Supplement_3): ckae144.787, November 2024

A Schlacher. A guide for policy and deci- sion makers on health literacy policies.Euro- pean Journal of Public Health, 34(Supplement_3): ckae144.787, November 2024. ISSN 1101- 1262, 1464-360X. doi:10.1093/eurpub/ckae144.787. URL https://academic.oup.com/eurpub/article/doi/10. 1093/eurpub/ckae144.787/7844567

work page doi:10.1093/eurpub/ckae144.787 2024

[4] [4]

Vishala Mishra and Joseph P. Dexter. Compar- ison of Readability of Official Public Health Information About COVID-19 on Websites of International Agencies and the Governments of 15 Countries.JAMA Network Open, 3(8): e2018033, August 2020. ISSN 2574-3805. doi:10.1001/jamanetworkopen.2020.18033. URL https://jamanetwork.com/journals/ jamanetworkopen/fullart...

work page doi:10.1001/jamanetworkopen.2020.18033 2020

[5] [5]

Prevalence of Health Misinformation on Social Me- dia: Systematic Review.Journal of Medical Internet Research, 23(1):e17187, January 2021

Victor Suarez-Lledo and Javier Alvarez-Galvez. Prevalence of Health Misinformation on Social Me- dia: Systematic Review.Journal of Medical Internet Research, 23(1):e17187, January 2021. ISSN 1438-

work page 2021

[6] [6]

URL http://www.jmir.org/ 2021/1/e17187/

doi:10.2196/17187. URL http://www.jmir.org/ 2021/1/e17187/

work page doi:10.2196/17187 2021

[7] [8]

Overview of the BioLay- Summ 2024 Shared Task on the Lay Summarization of Biomedical Research Articles

Tomas Goldsack, Carolina Scarton, Matthew Shard- low, and Chenghua Lin. Overview of the BioLay- Summ 2024 Shared Task on the Lay Summarization of Biomedical Research Articles. In Dina Demner- Fushman, Sophia Ananiadou, Makoto Miwa, Kirk Roberts, and Junichi Tsujii, editors,Proceedings of the 23rd Workshop on Biomedical Natural Lan- guage Processing, pages...

work page doi:10.18653/v1/2024.bionlp-1.10 2024

[8] [9]

Chen, Freya Gulamali, and Shalmali Joshi

Monica Agrawal, Irene Y . Chen, Freya Gulamali, and Shalmali Joshi. The evaluation illusion of large language models in medicine.npj Digital Medicine, 8(1):1–4, October 2025. ISSN 2398-

work page 2025

[9] [10]

Chen, Freya Gula- mali, and Shalmali Joshi

doi:10.1038/s41746-025-01963-x. URL https: //www.nature.com/articles/s41746-025-01963-x

work page doi:10.1038/s41746-025-01963-x

[10] [11]

Lessons from the TREC Plain Language Adaptation of Biomed- ical Abstracts (PLABA) track, 2025

Brian Ondov, William Xia, Kush Attal, Ishita Unde, Jerry He, and Dina Demner-Fushman. Lessons from the TREC Plain Language Adaptation of Biomed- ical Abstracts (PLABA) track, 2025. URL https: //arxiv.org/abs/2507.14096

work page arXiv 2025

[11] [12]

Plain Language Adaptations of Biomedical Text Us- ing LLMs: Comparision of Evaluation Metrics

Primoz Kocbek, Leon Kopitar, and Gregor Stiglic. Plain Language Adaptations of Biomedical Text Us- ing LLMs: Comparision of Evaluation Metrics. In Mowafa S. Househ, Zain Ul Abideen Tariq, Mah- mood Al-Zubaidi, Uzair Shah, and Elaine Huesing, editors,Studies in Health Technology and Informat- ics. IOS Press, August 2025. ISBN 9781643686080. doi:10.3233/SHT...

work page doi:10.3233/shti250946 2025

[12] [14]

Mistral Small 3 | Mistral AI

Mistral AI Team. Mistral Small 3 | Mistral AI. URL https://mistral.ai/news/mistral-small-3

work page

[13] [15]

Qwen2.5 Technical Report

An Yang, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chengyuan Li, Dayi- heng Liu, Fei Huang, Haoran Wei, Huan Lin, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Yang, Jiaxi Yang, Jingren Zhou, Junyang Lin, Kai Dang, Keming Lu, Keqin Bao, Kexin Yang, Le Yu, Mei Li, Mingfeng Xue, Pei Zhang, Qin Zhu, Rui Men, Runji Lin, Tianhao Li, Tingyu...

work page internal anchor Pith review Pith/arXiv arXiv 2024

[14] [16]

Explor- ing the Landscape of Automatic Text Summa- rization: A Comprehensive Survey.IEEE Ac- cess, 11:109819–109840, 2023

Bilal Khan, Zohaib Ali Shah, Muhammad Us- man, Inayat Khan, and Badam Niazi. Explor- ing the Landscape of Automatic Text Summa- rization: A Comprehensive Survey.IEEE Ac- cess, 11:109819–109840, 2023. ISSN 2169-3536. doi:10.1109/ACCESS.2023.3322188. URL https: //ieeexplore.ieee.org/document/10272614/

work page doi:10.1109/access.2023.3322188 2023

[15] [17]

On Faithfulness and Fac- tuality in Abstractive Summarization

Joshua Maynez, Shashi Narayan, Bernd Bohnet, and Ryan McDonald. On Faithfulness and Fac- tuality in Abstractive Summarization. In Dan Jurafsky, Joyce Chai, Natalie Schluter, and Joel Tetreault, editors,Proceedings of the 58th An- nual Meeting of the Association for Computa- tional Linguistics, pages 1906–1919, Online, July

work page 1906

[16] [18]

On Faithfulness and Factuality in Abstractive Summarization

Association for Computational Linguistics. doi:10.18653/v1/2020.acl-main.173. URL https:// aclanthology.org/2020.acl-main.173/

work page doi:10.18653/v1/2020.acl-main.173 2020

[17] [19]

Xuanxin Wu and Yuki Arase. An In-depth Evaluation of Large Language Models in Sentence Simplifica- tion with Error-based Human Assessment.ACM Transactions on Intelligent Systems and Technology, page 3744744, June 2025. ISSN 2157-6904, 2157-

work page 2025

[18] [20]

URL https://dl.acm.org/ doi/10.1145/3744744

doi:10.1145/3744744. URL https://dl.acm.org/ doi/10.1145/3744744

work page doi:10.1145/3744744

[19] [21]

Jiageng Wu, Xian Wu, Zhaopeng Qiu, Minghui Li, Shixu Lin, Yingying Zhang, Yefeng Zheng, Changzheng Yuan, and Jie Yang. Large language 10 An Architectural Advantage of The Instruction-Tuned LLM in Containing The Readability-Accuracy Tension models leverage external knowledge to extend clin- ical insight beyond language boundaries.Jour- nal of the American ...

work page doi:10.1093/jamia/ocae079 2054

[20] [22]

In: Proc

Huu Tan Mai, Cuong Xuan Chu, and Heiko Paulheim. Do LLMs Really Adapt to Domains? An Ontology Learning Perspective. In Gianluca Demartini, Katja Hose, Maribel Acosta, Matteo Palmonari, Gong Cheng, Hala Skaf-Molli, Nicolas Ferranti, Daniel Hernández, and Aidan Hogan, editors,The Semantic Web – ISWC 2024, volume 15231, pages 126–143. Springer Nature Switzer...

work page doi:10.1007/978- 2024

[21] [23]

Evaluation of Large Language Model Performance on the Biomedical Language Understanding and Reasoning Benchmark: Comparative Study, May 2024

Hui Feng, Francesco Ronzano, Jude LaFleur, Matthew Garber, Rodrigo De Oliveira, Kathryn Rough, Katharine Roth, Jay Nanavati, Khaldoun Zine El Abidine, and Christina Mack. Evaluation of Large Language Model Performance on the Biomedical Language Understanding and Reasoning Benchmark: Comparative Study, May 2024. URL http://medrxiv. org/lookup/doi/10.1101/2...

work page doi:10.1101/2024.05.17.24307411 2024

[22] [24]

In- Context Meta LoRA Generation

Yihua Shao, Minxi Yan, Yang Liu, Siyu Chen, Wenjie Chen, Xinwei Long, Ziyang Yan, Lei Li, Chenyu Zhang, Nicu Sebe, Hao Tang, Yan Wang, Hao Zhao, Mengzhu Wang, and Jingcai Guo. In- Context Meta LoRA Generation. InProceedings of the Thirty-ThirdInternational Joint Conference on Artificial Intelligence, pages 6138–6146, Jeju, South Korea, August 2024. Intern...

work page doi:10.24963/ijcai.2025/683 2024

[23] [25]

MEDVOC: V ocabulary Adapta- tion for Fine-tuning Pre-trained Language Models on Medical Text Summarization

Gunjan Balde, Soumyadeep Roy, Mainack Mondal, and Niloy Ganguly. MEDVOC: V ocabulary Adapta- tion for Fine-tuning Pre-trained Language Models on Medical Text Summarization. volume 7, pages 6180– 6188, August 2024. doi:10.24963/ijcai.2024/683. URL https://www.ijcai.org/proceedings/2024/683

work page doi:10.24963/ijcai.2024/683 2024

[24] [26]

Unveil- ing the Generalization Power of Fine-Tuned Large Language Models

Haoran Yang, Yumeng Zhang, Jiaqi Xu, Hongyuan Lu, Pheng-Ann Heng, and Wai Lam. Unveil- ing the Generalization Power of Fine-Tuned Large Language Models. InProceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Hu- man Language Technologies (Volume 1: Long Papers), pages 884–899, Mexico City, Mexico,

work page 2024

[25] [27]

doi:10.18653/v1/2024.naacl-long.51

Association for Computational Linguistics. doi:10.18653/v1/2024.naacl-long.51. URL https: //aclanthology.org/2024.naacl-long.51

work page doi:10.18653/v1/2024.naacl-long.51 2024

[26] [28]

Biomedi- cal text readability after hypernym substitution with fine-tuned large language models.PLOS Digital Health, 3(4):e0000489, April 2024

Karl Swanson, Shuhan He, Josh Calvano, David Chen, Talar Telvizian, Lawrence Jiang, Paul Chong, Jacob Schwell, Gin Mak, and Jarone Lee. Biomedi- cal text readability after hypernym substitution with fine-tuned large language models.PLOS Digital Health, 3(4):e0000489, April 2024. ISSN 2767-

work page 2024

[27] [29]

URL https://dx.plos.org/10.1371/journal.pdig.0000489

doi:10.1371/journal.pdig.0000489. URL https://dx.plos.org/10.1371/journal.pdig.0000489

work page doi:10.1371/journal.pdig.0000489

[28] [30]

Salahaldin Alamleh, Dorsa Mavedatnia, Gizelle Fran- cis, Trung Le, Joel Davies, Vincent Lin, and John J.W. Lee. Readability, Reliability, and Quality Analysis of Internet-Based Patient Education Materials and Large Language Models on Meniere’s Disease. Journal of Otolaryngology - Head & Neck Surgery, 54:19160216251360651, July 2025. ISSN 1916- 0216, 1916-...

work page doi:10.1177/19160216251360651 2025

[29] [31]

Hanauer, Kai Zheng, and Danny T.Y

Tzu-Chun Wu, Hanniel Shih, Anunita Nattam, Himaja Chintalapalli, David A. Hanauer, Kai Zheng, and Danny T.Y . Wu. Readability As- sessment and Comparison of Large Language Model-Generated Summaries of Trial Descriptions on ClinicalTrials.gov. In Mowafa S. Househ, Zain Ul Abideen Tariq, Mahmood Al-Zubaidi, Uzair Shah, and Elaine Huesing, editors,Stud- ies ...

work page doi:10.3233/shti250982 2025

[30] [32]

Dorfner, Amin Dada, Felix Busch, Mar- cus R

Felix J. Dorfner, Amin Dada, Felix Busch, Mar- cus R. Makowski, Tianyu Han, Daniel Truhn, Jens Kleesiek, Madhumita Sushil, Jacqueline Lammert, Lisa C. Adams, and Keno K. Bressem. Biomedi- cal Large Languages Models Seem not to be Supe- rior to Generalist Models on Unseen Medical Data, August 2024. URL http://arxiv.org/abs/2408.13833. arXiv:2408.13833

work page arXiv 2024

[31] [33]

Life and death of colloidal bonds control the rate-dependent rheology of gels

Qingyu Chen, Yan Hu, Xueqing Peng, Qianqian Xie, Qiao Jin, Aidan Gilson, Maxwell B. Singer, Xuguang Ai, Po-Ting Lai, Zhizheng Wang, Vip- ina K. Keloth, Kalpana Raja, Jimin Huang, Huan He, Fongci Lin, Jingcheng Du, Rui Zhang, W. Jim Zheng, Ron A. Adelman, Zhiyong Lu, and Hua Xu. Benchmarking large language models for biomedical natural language processing ...

work page doi:10.1038/s41467- 2025

[32] [34]

Jung, P.R

Kush Attal, Brian Ondov, and Dina Demner- Fushman. A dataset for plain language adaptation of biomedical abstracts.Scientific Data, 10(1):8, Jan- uary 2023. ISSN 2052-4463. doi:10.1038/s41597- 022-01920-3. URL https://www.nature.com/articles/ s41597-022-01920-3

work page doi:10.1038/s41597- 2023

[33] [35]

Heidi Cramm, Janet Breimer, Lydia Lee, Julie Burch, Valerie Ashford, and Mike Schaub. Best practices for writing effective lay summaries.Journal of Mil- itary, Veteran and Family Health, 3(1):7–20, April 11 An Architectural Advantage of The Instruction-Tuned LLM in Containing The Readability-Accuracy Tension

work page

[34] [36]

doi:10.3138/jmvfh.3.1.004

ISSN 2368-7924. doi:10.3138/jmvfh.3.1.004. URL https://utppublishing.com/doi/10.3138/jmvfh.3. 1.004

work page doi:10.3138/jmvfh.3.1.004

[35] [37]

A Critical Look at Meta-evaluating Summarisa- tion Evaluation Metrics

Xiang Dai, Sarvnaz Karimi, and Biaoyan Fang. A Critical Look at Meta-evaluating Summarisa- tion Evaluation Metrics. InFindings of the As- sociation for Computational Linguistics: EMNLP 2024, pages 14795–14808, Miami, Florida, USA,

work page 2024

[36] [38]

doi:10.18653/v1/2024.findings-emnlp.869

Association for Computational Linguistics. doi:10.18653/v1/2024.findings-emnlp.869. URL https://aclanthology.org/2024.findings-emnlp.869

work page doi:10.18653/v1/2024.findings-emnlp.869 2024

[37] [39]

Evaluating the Demand for Integrative Medicine Practices in Breast and Gy- necological Cancer Patients.Breast Care, 14 (1):35–40, 2019

Nikolas Schuerger, Evelyn Klein, Alexander Hapfelmeier, Marion Kiechle, Christine Brambs, and Daniela Paepke. Evaluating the Demand for Integrative Medicine Practices in Breast and Gy- necological Cancer Patients.Breast Care, 14 (1):35–40, 2019. ISSN 1661-3791, 1661-3805. doi:10.1159/000492235. URL https://karger.com/ article/doi/10.1159/000492235

work page doi:10.1159/000492235 2019

[38] [40]

Kessler, and Ras- mus Hoffmann

Miriam Trübner, Alexander Patzina, Judith Lehmann, Benno Brinkhaus, Christian S. Kessler, and Ras- mus Hoffmann. Health information-seeking behavior among users of traditional, complementary and in- tegrative medicine (TCIM).BMC Complementary Medicine and Therapies, 25(1):111, March 2025. ISSN 2662-7671. doi:10.1186/s12906-025-04843-9. URL https://doi.org...

work page doi:10.1186/s12906-025-04843-9 2025

[39] [41]

Weinberger, and Yoav Artzi

Tianyi Zhang, Varsha Kishore, Felix Wu, Kilian Q. Weinberger, and Yoav Artzi. BERTScore: Evaluating Text Generation with BERT. In8th International Conference on Learning Representations, ICLR 2020, Addis Ababa, Ethiopia, April 26-30, 2020. OpenRe- view.net, 2020. URL https://openreview.net/forum? id=SkeHuCVFDr

work page 2020

[40] [42]

ROUGE: A Package for Automatic Evaluation of Summaries

Chin-Yew Lin. ROUGE: A Package for Automatic Evaluation of Summaries. InText Summarization Branches Out, pages 74–81, Barcelona, Spain, July

work page

[41] [43]

URL https://aclanthology.org/W04-1013/

Association for Computational Linguistics. URL https://aclanthology.org/W04-1013/

work page

[42] [44]

A Call for Clarity in Reporting BLEU Scores

Matt Post. A Call for Clarity in Reporting BLEU Scores. In Ond ˇrej Bojar, Rajen Chatterjee, Chris- tian Federmann, Mark Fishel, Yvette Graham, Barry Haddow, Matthias Huck, Antonio Jimeno Yepes, Philipp Koehn, Christof Monz, Matteo Negri, Au- rélie Névéol, Mariana Neves, Matt Post, Lucia Spe- cia, Marco Turchi, and Karin Verspoor, editors,Pro- ceedings of...

work page doi:10.18653/v1/w18-6319 2018

[43] [45]

Optimizing Statistical Machine Translation for Text Sim- plification.Transactions of the Association for Computational Linguistics, 4:401–415, 2016

Wei Xu, Courtney Napoles, Ellie Pavlick, Quanze Chen, and Chris Callison-Burch. Optimizing Statistical Machine Translation for Text Sim- plification.Transactions of the Association for Computational Linguistics, 4:401–415, 2016. doi:10.1162/tacl_a_00107. URL https://aclanthology. org/Q16-1029/

work page doi:10.1162/tacl_a_00107 2016

[44] [46]

Harry Mc Laughlin

G. Harry Mc Laughlin. Smog grading-a new read- ability formula.Journal of Reading, 12(8):639–646,

work page

[45] [47]

URL http://www.jstor.org/ stable/40011226

ISSN 00224103. URL http://www.jstor.org/ stable/40011226

work page arXiv

[46] [48]

N.Y ., rev ed

Robert Gunning.The technique of clear writing. N.Y ., rev ed. edition. ISBN 9787000014190. OCLC: 1260373335

work page

[47] [49]

E. A. Smith and R. J. Senter. Automated readabil- ity index.AMRL-TR. Aerospace Medical Research Laboratories (U.S.), pages 1–14, May 1967

work page 1967

[48] [50]

Peter Kincaid, Richard Braby, and John E

J. Peter Kincaid, Richard Braby, and John E. Mears. Electronic authoring and delivery of tech- nical information.Journal of Instructional Devel- opment, 11(2):8–13, June 1988. ISSN 0162-2641. doi:10.1007/BF02904998. URL http://link.springer. com/10.1007/BF02904998

work page doi:10.1007/bf02904998 1988

[49] [51]

Klare, Paul P

George R. Klare, Paul P. Rowe, M. Gregory St. John, and Lawrence M. Stolurow. Automation of the Flesch Reading Ease Readability Formula, with Various Op- tions.Reading Research Quarterly, 4(4):550, 1969. ISSN 00340553. doi:10.2307/747070. URL https: //www.jstor.org/stable/747070?origin=crossref

work page doi:10.2307/747070 1969

[50] [52]

URL https: //github.com/huggingface/evaluate

huggingface/evaluate, November 2025. URL https: //github.com/huggingface/evaluate. original-date: 2022-03-30T15:08:26Z

work page 2025

[51] [53]

URL https:// github.com/textstat/textstat

textstat/textstat, November 2025. URL https:// github.com/textstat/textstat. original-date: 2014-06- 18T10:54:08Z. 12 An Architectural Advantage of The Instruction-Tuned LLM in Containing The Readability-Accuracy Tension A Methodological details A.1 Metric Properties Table A1: The suite of metrics in the evaluation. A. Foundational/supplementary metricsCo...

work page 2025

[52] [54]

8- You must operate at a sentence level

** Your t r a n s f o r m a t i o n o p e r a t i o n s work at a sentence level **. 8- You must operate at a sentence level . 9- For instance , a title text is already a sentence , while an abstract or a p ara gr ap h of text is not . A p ara gr ap h of text has multiple sentences , so you ** MUST split p a r a g r a p h s into a list of se nt en ce s fi...

work page

[53] [55]

Aim :" can be added to an ob je ct iv e sentence , while

** For each sentence , consider the f ol lo wi ng possible t r a n s f o r m a t i o n s ** that might realise simpler sentences , that are easy to read and u n d e r s t a n d for a layman . 15- You may split a sentence into 2 or more se nt enc es as part of the s i m p l i f i c a t i o n tr ans fo rm . For instance , in the case of long complex s en te...

work page