pith. sign in

arxiv: 2511.05080 · v4 · submitted 2025-11-07 · 💻 cs.CL

Making Knowledge Accessible: Divergent Readability-Accuracy Strategies of Mistral and QWen in Biomedical Text Simplification

Pith reviewed 2026-05-18 00:24 UTC · model grok-4.3

classification 💻 cs.CL
keywords biomedical text simplificationlarge language modelsreadability metricsBERTScorediscourse fidelityMistralQWen
0
0 comments X

The pith

Mistral improves readability in biomedical texts while preserving discourse fidelity at levels statistically comparable to humans, unlike QWen which shows a disconnect in balancing the two.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper compares how two large language models simplify biomedical text to meet public demand for accessible information. It finds that Mistral applies a careful, tempered approach to changing words that boosts multiple readability scores yet keeps overall meaning close to the original. QWen also raises readability but does not align those gains as tightly with preserved accuracy. The authors further show that many of the 21 metrics they tracked overlap strongly, pointing to simpler ways to judge future simplification work. A reader would care because reliable, meaning-preserving simplification could let non-experts understand medical content without distortion.

Core claim

Mistral exhibits a tempered lexical simplification approach that consistently enhances readability across multiple metrics while preserving discourse fidelity (BERTScore: 0.91, statistically comparable to that of humans). In comparison, QWen also attains enhanced readability performance and a reasonable BERTScore of 0.89, but presents a disconnect in balancing between readability and accuracy. Additionally, a comprehensive correlation analysis of a suite of 21 metrics confirms strong functional redundancies in metrics and informs adaptation requirements.

What carries the argument

The distinct operational strategies each model uses to trade off lexical simplification against discourse preservation, tracked through readability metrics and BERTScore against human baselines.

If this is right

  • Models like Mistral can be selected for public-facing biomedical applications where both access and accuracy matter.
  • Fewer than 21 metrics may suffice for judging simplification quality because many are functionally redundant.
  • Instruction-tuned models may favor fidelity-preserving strategies while reasoning-augmented ones prioritize other gains.
  • Adaptation of these models for biomedical use should target the specific readability-accuracy balance observed.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The strategy difference may trace to Mistral being instruction-tuned versus QWen being reasoning-augmented, suggesting tuning type shapes simplification behavior.
  • Evaluating the same models on authentic patient education materials could show whether the reported trade-offs hold outside the test set.
  • The metric redundancies open a path to build lighter, more targeted evaluation suites for future text-simplification studies.

Load-bearing premise

That the chosen readability metrics and BERTScore together fully capture the intended trade-off without missing important aspects of biomedical accuracy or that the test texts represent typical real-world biomedical content.

What would settle it

A set of expert human ratings on critical medical facts retained or lost in the simplified outputs, or performance measured on a new collection of real patient-facing biomedical queries.

Figures

Figures reproduced from arXiv: 2511.05080 by Aikaterini Melliou, Lian Zhang, P. Bilha Githinji, Peiwu Qin, Zeming Liang.

Figure 1
Figure 1. Figure 1: Average performance and underlying data distributions [PITH_FULL_IMAGE:figures/full_fig_p004_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Hypotheses test results. For µLLM Vs µhuman rows, mean values are presented and then shaded with a darker hue where pvalue > 0.05. For µ1 - µ2 rows, the difference between means is presented and a darker shading highlights results with pvalue > 0.05. 4 [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Correlations between metrics tive of the temperature configuration, exhibits human-level discourse preservation, a quality not demonstrated by the QWen model. Additionally, a strategic difference between the architec￾tures is illuminated by the vocabulary matching and the dif￾ficult words scores, offering insights into the treatment of relevant but complex terms. Both LLMs reduce the propor￾tion of difficu… view at source ↗
Figure 4
Figure 4. Figure 4: LLMs’ self-reported rationale for changes made. [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗
read the original abstract

The growing public demand for accessible biomedical information calls for scalable text simplification. While large language models (LLMs) offer solutions, they too struggle with balancing improved readability against preservation of meaning. This report empirically compares how two LLMs - instruction-tuned Mistral-Small 3 24B and the reasoning-augmented QWen2.5 32B- navigate this trade-off in biomedical text simplification, benchmarked against human performance. Our analysis highlights how each model applies distinct operational strategies when simplifying biomedical text. Mistral exhibits a tempered lexical simplification approach that consistently enhances readability across multiple metrics while preserving discourse fidelity (BERTScore: 0.91, statistically comparable to that of humans). In comparison, QWen also attains enhanced readability performance and a reasonable BERTScore of 0.89, but presents a disconnect in balancing between readability and accuracy. Additionally, a comprehensive correlation analysis of a suite of 21 metrics confirms strong functional redundancies in metrics and informs adaptation requirements.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The manuscript empirically compares instruction-tuned Mistral-Small 3 24B and reasoning-augmented QWen2.5 32B on biomedical text simplification, benchmarked against human performance. It claims Mistral applies a tempered lexical simplification strategy that improves readability across multiple metrics while preserving discourse fidelity (BERTScore 0.91, statistically comparable to humans), whereas QWen attains readability gains but shows a disconnect in balancing readability and accuracy (BERTScore 0.89). A correlation analysis across 21 metrics is used to identify functional redundancies.

Significance. If the empirical distinctions hold under more rigorous validation, the work is significant for documenting divergent LLM strategies in a high-stakes domain, with potential to inform model selection or prompting for accessible biomedical communication. The 21-metric correlation analysis is a clear strength, as it directly addresses metric redundancy and could support more efficient evaluation protocols in future simplification research.

major comments (3)
  1. Abstract and Results: The central claim that Mistral preserves discourse fidelity (BERTScore 0.91, statistically comparable to humans) while QWen exhibits a readability-accuracy disconnect (BERTScore 0.89) rests on BERTScore as a proxy, yet the manuscript provides no domain-specific factuality validation, expert error annotation for omitted facts or altered causal relations, or comparison against biomedical reference standards; this is load-bearing because BERTScore captures contextual embedding overlap rather than factual accuracy in technical text.
  2. Methods/Experimental Setup: Dataset details (source texts, size, selection criteria for biomedical content), exact prompting templates, and the statistical tests or error bars supporting the 'statistically comparable' claim are absent; these omissions prevent assessment of whether the reported metric differences reflect genuine strategy distinctions or surface-level lexical changes.
  3. Results (correlation analysis): While the 21-metric analysis is a positive contribution, the manuscript does not report how the subset of metrics was chosen post-hoc or whether the observed redundancies affect the interpretation of the readability-accuracy trade-off for each model.
minor comments (2)
  1. Abstract: The phrase 'a suite of 21 metrics' should be accompanied by at least a high-level categorization (e.g., lexical, syntactic, semantic) to orient readers before the full correlation results.
  2. Throughout: Define all abbreviations (e.g., BERTScore) on first use and ensure figure captions explicitly state what each panel compares (model vs. human vs. original).

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback on our manuscript. We address each major comment below and indicate the revisions we will make to strengthen the work.

read point-by-point responses
  1. Referee: Abstract and Results: The central claim that Mistral preserves discourse fidelity (BERTScore 0.91, statistically comparable to humans) while QWen exhibits a readability-accuracy disconnect (BERTScore 0.89) rests on BERTScore as a proxy, yet the manuscript provides no domain-specific factuality validation, expert error annotation for omitted facts or altered causal relations, or comparison against biomedical reference standards; this is load-bearing because BERTScore captures contextual embedding overlap rather than factual accuracy in technical text.

    Authors: We agree that BERTScore functions as a semantic similarity proxy rather than a direct factuality measure and does not detect omitted facts or altered causal relations. Our central contribution is the documentation of divergent operational strategies through a multi-metric profile, where BERTScore is used alongside readability metrics to benchmark against human performance. In the revision we will add an explicit limitations subsection acknowledging this proxy limitation, include qualitative examples illustrating content preservation differences, and note that full expert factuality annotation lies outside the current scope. We maintain that the observed metric distinctions still offer useful guidance for model selection in biomedical simplification. revision: partial

  2. Referee: Methods/Experimental Setup: Dataset details (source texts, size, selection criteria for biomedical content), exact prompting templates, and the statistical tests or error bars supporting the 'statistically comparable' claim are absent; these omissions prevent assessment of whether the reported metric differences reflect genuine strategy distinctions or surface-level lexical changes.

    Authors: We will revise the Methods section to explicitly state the source corpus (PubMed abstracts), the exact number of texts, and the selection criteria for biomedical relevance. Exact prompting templates will be moved from supplementary material into the main text. The statistical comparability claim is based on a two-sample t-test; we will report the precise test, p-value, and add error bars to all relevant figures and tables. revision: yes

  3. Referee: Results (correlation analysis): While the 21-metric analysis is a positive contribution, the manuscript does not report how the subset of metrics was chosen post-hoc or whether the observed redundancies affect the interpretation of the readability-accuracy trade-off for each model.

    Authors: The 21 metrics were pre-selected from the standard set used in prior simplification literature to span readability, semantic fidelity, and lexical dimensions. We will add a dedicated paragraph in the Results section describing this selection rationale and explicitly discuss how the observed redundancies (e.g., between multiple readability formulas) should qualify interpretation of the readability-accuracy trade-off, ensuring readers do not over-weight correlated measures. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical metrics and human baselines are independent

full rationale

The paper reports direct empirical measurements of LLM-generated simplifications against human references using off-the-shelf metrics (BERTScore, readability scores) and a correlation analysis of 21 metrics. No equations, fitted parameters, or self-citations are used to derive the central claims; the reported differences (e.g., Mistral BERTScore 0.91 vs. QWen 0.89) are computed from model outputs on the test set and compared to external human baselines. The analysis is therefore self-contained against standard benchmarks and does not reduce to any input by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

This is a purely empirical comparison study that relies on off-the-shelf LLMs and standard NLP metrics; no new mathematical axioms, free parameters fitted to the target result, or invented entities are introduced.

pith-pipeline@v0.9.0 · 5490 in / 1209 out tokens · 48349 ms · 2026-05-18T00:24:29.769957+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

53 extracted references · 53 canonical work pages · 1 internal anchor

  1. [1]

    Shared Decision Making Interventions: Theoretical and Empirical Evidence with Im- plications for Health Literacy

    Stacey Dawn, Hill Sophie, McCaffery Kirsten, Boland Laura, Lewis Krystina B., and Horvat Lidia. Shared Decision Making Interventions: Theoretical and Empirical Evidence with Im- plications for Health Literacy. InStudies in Health Technology and Informatics. IOS Press, 9 An Architectural Advantage of The Instruction-Tuned LLM in Containing The Readability-...

  2. [2]

    URL https://www.medra.org/servlet/aliasResolver?alias= iospressISBN&isbn=978-1-61499-789-4&spage= 263&doi=10.3233/978-1-61499-790-0-263

    doi:10.3233/978-1-61499-790-0-263. URL https://www.medra.org/servlet/aliasResolver?alias= iospressISBN&isbn=978-1-61499-789-4&spage= 263&doi=10.3233/978-1-61499-790-0-263

  3. [3]

    A guide for policy and deci- sion makers on health literacy policies.Euro- pean Journal of Public Health, 34(Supplement_3): ckae144.787, November 2024

    A Schlacher. A guide for policy and deci- sion makers on health literacy policies.Euro- pean Journal of Public Health, 34(Supplement_3): ckae144.787, November 2024. ISSN 1101- 1262, 1464-360X. doi:10.1093/eurpub/ckae144.787. URL https://academic.oup.com/eurpub/article/doi/10. 1093/eurpub/ckae144.787/7844567

  4. [4]

    Vishala Mishra and Joseph P. Dexter. Compar- ison of Readability of Official Public Health Information About COVID-19 on Websites of International Agencies and the Governments of 15 Countries.JAMA Network Open, 3(8): e2018033, August 2020. ISSN 2574-3805. doi:10.1001/jamanetworkopen.2020.18033. URL https://jamanetwork.com/journals/ jamanetworkopen/fullart...

  5. [5]

    Prevalence of Health Misinformation on Social Me- dia: Systematic Review.Journal of Medical Internet Research, 23(1):e17187, January 2021

    Victor Suarez-Lledo and Javier Alvarez-Galvez. Prevalence of Health Misinformation on Social Me- dia: Systematic Review.Journal of Medical Internet Research, 23(1):e17187, January 2021. ISSN 1438-

  6. [6]

    URL http://www.jmir.org/ 2021/1/e17187/

    doi:10.2196/17187. URL http://www.jmir.org/ 2021/1/e17187/

  7. [8]

    Overview of the BioLay- Summ 2024 Shared Task on the Lay Summarization of Biomedical Research Articles

    Tomas Goldsack, Carolina Scarton, Matthew Shard- low, and Chenghua Lin. Overview of the BioLay- Summ 2024 Shared Task on the Lay Summarization of Biomedical Research Articles. In Dina Demner- Fushman, Sophia Ananiadou, Makoto Miwa, Kirk Roberts, and Junichi Tsujii, editors,Proceedings of the 23rd Workshop on Biomedical Natural Lan- guage Processing, pages...

  8. [9]

    Chen, Freya Gulamali, and Shalmali Joshi

    Monica Agrawal, Irene Y . Chen, Freya Gulamali, and Shalmali Joshi. The evaluation illusion of large language models in medicine.npj Digital Medicine, 8(1):1–4, October 2025. ISSN 2398-

  9. [10]

    Chen, Freya Gula- mali, and Shalmali Joshi

    doi:10.1038/s41746-025-01963-x. URL https: //www.nature.com/articles/s41746-025-01963-x

  10. [11]

    Lessons from the TREC Plain Language Adaptation of Biomed- ical Abstracts (PLABA) track, 2025

    Brian Ondov, William Xia, Kush Attal, Ishita Unde, Jerry He, and Dina Demner-Fushman. Lessons from the TREC Plain Language Adaptation of Biomed- ical Abstracts (PLABA) track, 2025. URL https: //arxiv.org/abs/2507.14096

  11. [12]

    Plain Language Adaptations of Biomedical Text Us- ing LLMs: Comparision of Evaluation Metrics

    Primoz Kocbek, Leon Kopitar, and Gregor Stiglic. Plain Language Adaptations of Biomedical Text Us- ing LLMs: Comparision of Evaluation Metrics. In Mowafa S. Househ, Zain Ul Abideen Tariq, Mah- mood Al-Zubaidi, Uzair Shah, and Elaine Huesing, editors,Studies in Health Technology and Informat- ics. IOS Press, August 2025. ISBN 9781643686080. doi:10.3233/SHT...

  12. [14]

    Mistral Small 3 | Mistral AI

    Mistral AI Team. Mistral Small 3 | Mistral AI. URL https://mistral.ai/news/mistral-small-3

  13. [15]

    Qwen2.5 Technical Report

    An Yang, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chengyuan Li, Dayi- heng Liu, Fei Huang, Haoran Wei, Huan Lin, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Yang, Jiaxi Yang, Jingren Zhou, Junyang Lin, Kai Dang, Keming Lu, Keqin Bao, Kexin Yang, Le Yu, Mei Li, Mingfeng Xue, Pei Zhang, Qin Zhu, Rui Men, Runji Lin, Tianhao Li, Tingyu...

  14. [16]

    Explor- ing the Landscape of Automatic Text Summa- rization: A Comprehensive Survey.IEEE Ac- cess, 11:109819–109840, 2023

    Bilal Khan, Zohaib Ali Shah, Muhammad Us- man, Inayat Khan, and Badam Niazi. Explor- ing the Landscape of Automatic Text Summa- rization: A Comprehensive Survey.IEEE Ac- cess, 11:109819–109840, 2023. ISSN 2169-3536. doi:10.1109/ACCESS.2023.3322188. URL https: //ieeexplore.ieee.org/document/10272614/

  15. [17]

    On Faithfulness and Fac- tuality in Abstractive Summarization

    Joshua Maynez, Shashi Narayan, Bernd Bohnet, and Ryan McDonald. On Faithfulness and Fac- tuality in Abstractive Summarization. In Dan Jurafsky, Joyce Chai, Natalie Schluter, and Joel Tetreault, editors,Proceedings of the 58th An- nual Meeting of the Association for Computa- tional Linguistics, pages 1906–1919, Online, July

  16. [18]

    On Faithfulness and Factuality in Abstractive Summarization

    Association for Computational Linguistics. doi:10.18653/v1/2020.acl-main.173. URL https:// aclanthology.org/2020.acl-main.173/

  17. [19]

    Xuanxin Wu and Yuki Arase. An In-depth Evaluation of Large Language Models in Sentence Simplifica- tion with Error-based Human Assessment.ACM Transactions on Intelligent Systems and Technology, page 3744744, June 2025. ISSN 2157-6904, 2157-

  18. [20]

    URL https://dl.acm.org/ doi/10.1145/3744744

    doi:10.1145/3744744. URL https://dl.acm.org/ doi/10.1145/3744744

  19. [21]

    Jiageng Wu, Xian Wu, Zhaopeng Qiu, Minghui Li, Shixu Lin, Yingying Zhang, Yefeng Zheng, Changzheng Yuan, and Jie Yang. Large language 10 An Architectural Advantage of The Instruction-Tuned LLM in Containing The Readability-Accuracy Tension models leverage external knowledge to extend clin- ical insight beyond language boundaries.Jour- nal of the American ...

  20. [22]

    In: Proc

    Huu Tan Mai, Cuong Xuan Chu, and Heiko Paulheim. Do LLMs Really Adapt to Domains? An Ontology Learning Perspective. In Gianluca Demartini, Katja Hose, Maribel Acosta, Matteo Palmonari, Gong Cheng, Hala Skaf-Molli, Nicolas Ferranti, Daniel Hernández, and Aidan Hogan, editors,The Semantic Web – ISWC 2024, volume 15231, pages 126–143. Springer Nature Switzer...

  21. [23]

    Evaluation of Large Language Model Performance on the Biomedical Language Understanding and Reasoning Benchmark: Comparative Study, May 2024

    Hui Feng, Francesco Ronzano, Jude LaFleur, Matthew Garber, Rodrigo De Oliveira, Kathryn Rough, Katharine Roth, Jay Nanavati, Khaldoun Zine El Abidine, and Christina Mack. Evaluation of Large Language Model Performance on the Biomedical Language Understanding and Reasoning Benchmark: Comparative Study, May 2024. URL http://medrxiv. org/lookup/doi/10.1101/2...

  22. [24]

    In- Context Meta LoRA Generation

    Yihua Shao, Minxi Yan, Yang Liu, Siyu Chen, Wenjie Chen, Xinwei Long, Ziyang Yan, Lei Li, Chenyu Zhang, Nicu Sebe, Hao Tang, Yan Wang, Hao Zhao, Mengzhu Wang, and Jingcai Guo. In- Context Meta LoRA Generation. InProceedings of the Thirty-ThirdInternational Joint Conference on Artificial Intelligence, pages 6138–6146, Jeju, South Korea, August 2024. Intern...

  23. [25]

    MEDVOC: V ocabulary Adapta- tion for Fine-tuning Pre-trained Language Models on Medical Text Summarization

    Gunjan Balde, Soumyadeep Roy, Mainack Mondal, and Niloy Ganguly. MEDVOC: V ocabulary Adapta- tion for Fine-tuning Pre-trained Language Models on Medical Text Summarization. volume 7, pages 6180– 6188, August 2024. doi:10.24963/ijcai.2024/683. URL https://www.ijcai.org/proceedings/2024/683

  24. [26]

    Unveil- ing the Generalization Power of Fine-Tuned Large Language Models

    Haoran Yang, Yumeng Zhang, Jiaqi Xu, Hongyuan Lu, Pheng-Ann Heng, and Wai Lam. Unveil- ing the Generalization Power of Fine-Tuned Large Language Models. InProceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Hu- man Language Technologies (Volume 1: Long Papers), pages 884–899, Mexico City, Mexico,

  25. [27]

    doi:10.18653/v1/2024.naacl-long.51

    Association for Computational Linguistics. doi:10.18653/v1/2024.naacl-long.51. URL https: //aclanthology.org/2024.naacl-long.51

  26. [28]

    Biomedi- cal text readability after hypernym substitution with fine-tuned large language models.PLOS Digital Health, 3(4):e0000489, April 2024

    Karl Swanson, Shuhan He, Josh Calvano, David Chen, Talar Telvizian, Lawrence Jiang, Paul Chong, Jacob Schwell, Gin Mak, and Jarone Lee. Biomedi- cal text readability after hypernym substitution with fine-tuned large language models.PLOS Digital Health, 3(4):e0000489, April 2024. ISSN 2767-

  27. [29]

    URL https://dx.plos.org/10.1371/journal.pdig.0000489

    doi:10.1371/journal.pdig.0000489. URL https://dx.plos.org/10.1371/journal.pdig.0000489

  28. [30]

    Salahaldin Alamleh, Dorsa Mavedatnia, Gizelle Fran- cis, Trung Le, Joel Davies, Vincent Lin, and John J.W. Lee. Readability, Reliability, and Quality Analysis of Internet-Based Patient Education Materials and Large Language Models on Meniere’s Disease. Journal of Otolaryngology - Head & Neck Surgery, 54:19160216251360651, July 2025. ISSN 1916- 0216, 1916-...

  29. [31]

    Hanauer, Kai Zheng, and Danny T.Y

    Tzu-Chun Wu, Hanniel Shih, Anunita Nattam, Himaja Chintalapalli, David A. Hanauer, Kai Zheng, and Danny T.Y . Wu. Readability As- sessment and Comparison of Large Language Model-Generated Summaries of Trial Descriptions on ClinicalTrials.gov. In Mowafa S. Househ, Zain Ul Abideen Tariq, Mahmood Al-Zubaidi, Uzair Shah, and Elaine Huesing, editors,Stud- ies ...

  30. [32]

    Dorfner, Amin Dada, Felix Busch, Mar- cus R

    Felix J. Dorfner, Amin Dada, Felix Busch, Mar- cus R. Makowski, Tianyu Han, Daniel Truhn, Jens Kleesiek, Madhumita Sushil, Jacqueline Lammert, Lisa C. Adams, and Keno K. Bressem. Biomedi- cal Large Languages Models Seem not to be Supe- rior to Generalist Models on Unseen Medical Data, August 2024. URL http://arxiv.org/abs/2408.13833. arXiv:2408.13833

  31. [33]

    Life and death of colloidal bonds control the rate-dependent rheology of gels

    Qingyu Chen, Yan Hu, Xueqing Peng, Qianqian Xie, Qiao Jin, Aidan Gilson, Maxwell B. Singer, Xuguang Ai, Po-Ting Lai, Zhizheng Wang, Vip- ina K. Keloth, Kalpana Raja, Jimin Huang, Huan He, Fongci Lin, Jingcheng Du, Rui Zhang, W. Jim Zheng, Ron A. Adelman, Zhiyong Lu, and Hua Xu. Benchmarking large language models for biomedical natural language processing ...

  32. [34]

    Jung, P.R

    Kush Attal, Brian Ondov, and Dina Demner- Fushman. A dataset for plain language adaptation of biomedical abstracts.Scientific Data, 10(1):8, Jan- uary 2023. ISSN 2052-4463. doi:10.1038/s41597- 022-01920-3. URL https://www.nature.com/articles/ s41597-022-01920-3

  33. [35]

    Heidi Cramm, Janet Breimer, Lydia Lee, Julie Burch, Valerie Ashford, and Mike Schaub. Best practices for writing effective lay summaries.Journal of Mil- itary, Veteran and Family Health, 3(1):7–20, April 11 An Architectural Advantage of The Instruction-Tuned LLM in Containing The Readability-Accuracy Tension

  34. [36]

    doi:10.3138/jmvfh.3.1.004

    ISSN 2368-7924. doi:10.3138/jmvfh.3.1.004. URL https://utppublishing.com/doi/10.3138/jmvfh.3. 1.004

  35. [37]

    A Critical Look at Meta-evaluating Summarisa- tion Evaluation Metrics

    Xiang Dai, Sarvnaz Karimi, and Biaoyan Fang. A Critical Look at Meta-evaluating Summarisa- tion Evaluation Metrics. InFindings of the As- sociation for Computational Linguistics: EMNLP 2024, pages 14795–14808, Miami, Florida, USA,

  36. [38]

    doi:10.18653/v1/2024.findings-emnlp.869

    Association for Computational Linguistics. doi:10.18653/v1/2024.findings-emnlp.869. URL https://aclanthology.org/2024.findings-emnlp.869

  37. [39]

    Evaluating the Demand for Integrative Medicine Practices in Breast and Gy- necological Cancer Patients.Breast Care, 14 (1):35–40, 2019

    Nikolas Schuerger, Evelyn Klein, Alexander Hapfelmeier, Marion Kiechle, Christine Brambs, and Daniela Paepke. Evaluating the Demand for Integrative Medicine Practices in Breast and Gy- necological Cancer Patients.Breast Care, 14 (1):35–40, 2019. ISSN 1661-3791, 1661-3805. doi:10.1159/000492235. URL https://karger.com/ article/doi/10.1159/000492235

  38. [40]

    Kessler, and Ras- mus Hoffmann

    Miriam Trübner, Alexander Patzina, Judith Lehmann, Benno Brinkhaus, Christian S. Kessler, and Ras- mus Hoffmann. Health information-seeking behavior among users of traditional, complementary and in- tegrative medicine (TCIM).BMC Complementary Medicine and Therapies, 25(1):111, March 2025. ISSN 2662-7671. doi:10.1186/s12906-025-04843-9. URL https://doi.org...

  39. [41]

    Weinberger, and Yoav Artzi

    Tianyi Zhang, Varsha Kishore, Felix Wu, Kilian Q. Weinberger, and Yoav Artzi. BERTScore: Evaluating Text Generation with BERT. In8th International Conference on Learning Representations, ICLR 2020, Addis Ababa, Ethiopia, April 26-30, 2020. OpenRe- view.net, 2020. URL https://openreview.net/forum? id=SkeHuCVFDr

  40. [42]

    ROUGE: A Package for Automatic Evaluation of Summaries

    Chin-Yew Lin. ROUGE: A Package for Automatic Evaluation of Summaries. InText Summarization Branches Out, pages 74–81, Barcelona, Spain, July

  41. [43]

    URL https://aclanthology.org/W04-1013/

    Association for Computational Linguistics. URL https://aclanthology.org/W04-1013/

  42. [44]

    A Call for Clarity in Reporting BLEU Scores

    Matt Post. A Call for Clarity in Reporting BLEU Scores. In Ond ˇrej Bojar, Rajen Chatterjee, Chris- tian Federmann, Mark Fishel, Yvette Graham, Barry Haddow, Matthias Huck, Antonio Jimeno Yepes, Philipp Koehn, Christof Monz, Matteo Negri, Au- rélie Névéol, Mariana Neves, Matt Post, Lucia Spe- cia, Marco Turchi, and Karin Verspoor, editors,Pro- ceedings of...

  43. [45]

    Optimizing Statistical Machine Translation for Text Sim- plification.Transactions of the Association for Computational Linguistics, 4:401–415, 2016

    Wei Xu, Courtney Napoles, Ellie Pavlick, Quanze Chen, and Chris Callison-Burch. Optimizing Statistical Machine Translation for Text Sim- plification.Transactions of the Association for Computational Linguistics, 4:401–415, 2016. doi:10.1162/tacl_a_00107. URL https://aclanthology. org/Q16-1029/

  44. [46]

    Harry Mc Laughlin

    G. Harry Mc Laughlin. Smog grading-a new read- ability formula.Journal of Reading, 12(8):639–646,

  45. [47]

    URL http://www.jstor.org/ stable/40011226

    ISSN 00224103. URL http://www.jstor.org/ stable/40011226

  46. [48]

    N.Y ., rev ed

    Robert Gunning.The technique of clear writing. N.Y ., rev ed. edition. ISBN 9787000014190. OCLC: 1260373335

  47. [49]

    E. A. Smith and R. J. Senter. Automated readabil- ity index.AMRL-TR. Aerospace Medical Research Laboratories (U.S.), pages 1–14, May 1967

  48. [50]

    Peter Kincaid, Richard Braby, and John E

    J. Peter Kincaid, Richard Braby, and John E. Mears. Electronic authoring and delivery of tech- nical information.Journal of Instructional Devel- opment, 11(2):8–13, June 1988. ISSN 0162-2641. doi:10.1007/BF02904998. URL http://link.springer. com/10.1007/BF02904998

  49. [51]

    Klare, Paul P

    George R. Klare, Paul P. Rowe, M. Gregory St. John, and Lawrence M. Stolurow. Automation of the Flesch Reading Ease Readability Formula, with Various Op- tions.Reading Research Quarterly, 4(4):550, 1969. ISSN 00340553. doi:10.2307/747070. URL https: //www.jstor.org/stable/747070?origin=crossref

  50. [52]

    URL https: //github.com/huggingface/evaluate

    huggingface/evaluate, November 2025. URL https: //github.com/huggingface/evaluate. original-date: 2022-03-30T15:08:26Z

  51. [53]

    URL https:// github.com/textstat/textstat

    textstat/textstat, November 2025. URL https:// github.com/textstat/textstat. original-date: 2014-06- 18T10:54:08Z. 12 An Architectural Advantage of The Instruction-Tuned LLM in Containing The Readability-Accuracy Tension A Methodological details A.1 Metric Properties Table A1: The suite of metrics in the evaluation. A. Foundational/supplementary metricsCo...

  52. [54]

    8- You must operate at a sentence level

    ** Your t r a n s f o r m a t i o n o p e r a t i o n s work at a sentence level **. 8- You must operate at a sentence level . 9- For instance , a title text is already a sentence , while an abstract or a p ara gr ap h of text is not . A p ara gr ap h of text has multiple sentences , so you ** MUST split p a r a g r a p h s into a list of se nt en ce s fi...

  53. [55]

    Aim :" can be added to an ob je ct iv e sentence , while

    ** For each sentence , consider the f ol lo wi ng possible t r a n s f o r m a t i o n s ** that might realise simpler sentences , that are easy to read and u n d e r s t a n d for a layman . 15- You may split a sentence into 2 or more se nt enc es as part of the s i m p l i f i c a t i o n tr ans fo rm . For instance , in the case of long complex s en te...