pith. machine review for the scientific record. sign in

arxiv: 2604.06505 · v1 · submitted 2026-04-07 · 💻 cs.CL · cs.AI

Recognition: 2 theorem links

· Lean Theorem

MedConclusion: A Benchmark for Biomedical Conclusion Generation from Structured Abstracts

Authors on Pith no claims yet

Pith reviewed 2026-05-10 18:33 UTC · model grok-4.3

classification 💻 cs.CL cs.AI
keywords biomedical conclusion generationstructured abstractsPubMed datasetLLM benchmarkingevidence-to-conclusion reasoningscientific reasoningNLP benchmarkbiomedical AI
0
0 comments X

The pith

MedConclusion offers a 5.7 million PubMed dataset for benchmarking how LLMs generate biomedical conclusions from structured evidence.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces MedConclusion, a dataset of 5.7M structured abstracts from PubMed that pairs the non-conclusion sections with the original author conclusions. This creates a large resource for training and testing models on inferring scientific conclusions from biomedical evidence. Metadata on journal categories and SJR scores allows for analysis in specific domains. Evaluations of various LLMs indicate that conclusion generation is distinct from summary writing, that top models perform similarly under standard metrics, and that LLM judges can produce varying scores based on their identity.

Core claim

MedConclusion is a large-scale dataset of 5.7M PubMed structured abstracts in which each instance consists of the non-conclusion sections paired with the author-written conclusion, supplemented with journal-level metadata such as biomedical category and SJR, providing naturally occurring supervision for evidence-to-conclusion reasoning.

What carries the argument

The MedConclusion dataset, which pairs non-conclusion abstract sections with author conclusions from PubMed and includes metadata for domain subgroup analysis.

If this is right

  • Generating conclusions is behaviorally distinct from generating summaries for LLMs.
  • Strong LLMs remain closely clustered when scored with current automatic metrics.
  • LLM-as-a-judge evaluations can substantially shift absolute scores depending on the judge model.
  • Subgroup analysis across biomedical domains and journal impact is enabled by the included metadata.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Models could be fine-tuned on this data to improve their ability to synthesize biomedical evidence into conclusions.
  • The clustering of strong models suggests that new, more discriminative metrics may be required for this task.
  • Variability in judge scores points to the need for standardized evaluation protocols in scientific reasoning benchmarks.
  • Such datasets might eventually help in automating parts of scientific literature synthesis.

Load-bearing premise

Author-written conclusions in PubMed abstracts are appropriate and unbiased targets for model training and evaluation.

What would settle it

A study that finds systematic logical gaps, unsupported claims, or external information in a significant portion of the author conclusions relative to their evidence sections would undermine the benchmark.

Figures

Figures reproduced from arXiv: 2604.06505 by Jiahui Cai, Mengyu Wang, Ruizhi Qian, Weiyue Li, Yan Luo, Yi Li, Yongce Li, Yunfan Long.

Figure 1
Figure 1. Figure 1: Overview of MedConclusion and the evaluation pipeline. Left: an example [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Scatter plots of journal-level SJR scores versus evaluation metrics for [PITH_FULL_IMAGE:figures/full_fig_p008_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Radar chart comparison of the top 5 and bottom 5 biomedical categories ranked [PITH_FULL_IMAGE:figures/full_fig_p009_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Publication-year density of abstracts in MedConclusion from 2000–2025. The [PITH_FULL_IMAGE:figures/full_fig_p034_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Distribution statistics of MedConclusion. (a) Top 10 subject categories by propor [PITH_FULL_IMAGE:figures/full_fig_p034_5.png] view at source ↗
read the original abstract

Large language models (LLMs) are widely explored for reasoning-intensive research tasks, yet resources for testing whether they can infer scientific conclusions from structured biomedical evidence remain limited. We introduce $\textbf{MedConclusion}$, a large-scale dataset of $\textbf{5.7M}$ PubMed structured abstracts for biomedical conclusion generation. Each instance pairs the non-conclusion sections of an abstract with the original author-written conclusion, providing naturally occurring supervision for evidence-to-conclusion reasoning. MedConclusion also includes journal-level metadata such as biomedical category and SJR, enabling subgroup analysis across biomedical domains. As an initial study, we evaluate diverse LLMs under conclusion and summary prompting settings and score outputs with both reference-based metrics and LLM-as-a-judge. We find that conclusion writing is behaviorally distinct from summary writing, strong models remain closely clustered under current automatic metrics, and judge identity can substantially shift absolute scores. MedConclusion provides a reusable data resource for studying scientific evidence-to-conclusion reasoning. Our code and data are available at: https://github.com/Harvard-AI-and-Robotics-Lab/MedConclusion.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces MedConclusion, a dataset of 5.7M PubMed structured abstracts that pairs the non-conclusion sections of each abstract with the original author-written conclusion as naturally occurring supervision. It includes journal-level metadata (biomedical category, SJR) for subgroup analysis. As an initial study, the authors benchmark diverse LLMs under conclusion and summary prompting, scoring outputs with reference-based metrics and LLM-as-a-judge, and report that conclusion generation is behaviorally distinct from summarization, that strong models cluster under automatic metrics, and that judge identity affects absolute scores. The dataset, code, and data are released publicly.

Significance. If the author-written conclusions prove to be faithful, evidence-derived targets, MedConclusion would constitute a large-scale, reusable resource for studying evidence-to-conclusion reasoning in biomedicine, supporting both training and evaluation of LLMs for scientific inference tasks. The open release of the full dataset and code, together with metadata enabling domain-specific analyses, is a clear strength that facilitates follow-on work. The reported empirical findings on prompting differences and judge sensitivity are useful observations, though their interpretive weight depends on the soundness of the underlying targets.

major comments (2)
  1. [Dataset construction] Dataset construction section: The manuscript provides insufficient detail on filtering criteria applied to the 5.7M PubMed abstracts, train/validation/test splits, deduplication, or any bias-handling procedures. These omissions directly affect the interpretability of the LLM benchmarking results and the claim that the resource isolates evidence-to-conclusion reasoning.
  2. [Introduction] Introduction and abstract: The central claim that MedConclusion supports 'scientific evidence-to-conclusion reasoning' rests on the assumption that author-written conclusions are appropriate, unbiased targets derivable from the supplied non-conclusion sections. Biomedical abstracts frequently contain extrapolations, clinical implications, or publication-driven claims not strictly supported by the results; without a quantitative audit (e.g., percentage of conclusions containing unsupported statements) or fidelity analysis, this assumption remains unverified and load-bearing for the reusable-resource contribution.
minor comments (2)
  1. [Experiments] The distinction between 'conclusion prompting' and 'summary prompting' settings is mentioned but not illustrated with example prompts or templates; providing these would improve replicability of the reported behavioral differences.
  2. [Results] The abstract states that 'strong models remain closely clustered under current automatic metrics'; a brief discussion of why this clustering occurs (e.g., metric saturation) would help readers interpret the practical significance of the finding.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback, which helps clarify key aspects of our work. We address each major comment below, indicating where revisions will be made to the manuscript.

read point-by-point responses
  1. Referee: [Dataset construction] Dataset construction section: The manuscript provides insufficient detail on filtering criteria applied to the 5.7M PubMed abstracts, train/validation/test splits, deduplication, or any bias-handling procedures. These omissions directly affect the interpretability of the LLM benchmarking results and the claim that the resource isolates evidence-to-conclusion reasoning.

    Authors: We agree that expanded details on dataset construction will improve interpretability. In the revised manuscript, we will augment the Dataset Construction section with explicit descriptions of the filtering criteria applied to select the 5.7M structured PubMed abstracts, the rationale and ratios for the train/validation/test splits, deduplication methods employed, and any bias-mitigation steps such as ensuring representation across biomedical categories and journal impact levels. These additions will directly support the benchmarking claims. revision: yes

  2. Referee: [Introduction] Introduction and abstract: The central claim that MedConclusion supports 'scientific evidence-to-conclusion reasoning' rests on the assumption that author-written conclusions are appropriate, unbiased targets derivable from the supplied non-conclusion sections. Biomedical abstracts frequently contain extrapolations, clinical implications, or publication-driven claims not strictly supported by the results; without a quantitative audit (e.g., percentage of conclusions containing unsupported statements) or fidelity analysis, this assumption remains unverified and load-bearing for the reusable-resource contribution.

    Authors: We acknowledge that author-written conclusions may include extrapolations or implications not strictly entailed by the results sections, as is common in biomedical literature. The dataset is designed around these naturally occurring author targets to study conclusion generation in practice. While the original manuscript did not include a quantitative fidelity audit, we will revise the Introduction to qualify the central claim and add a dedicated Limitations section that discusses this assumption, provides illustrative examples, and notes how the paired data can enable future fidelity analyses. This preserves the resource's utility for evidence-to-conclusion tasks while addressing the concern. revision: partial

Circularity Check

0 steps flagged

No circularity: dataset curation and empirical benchmarking are self-contained.

full rationale

The paper constructs MedConclusion by extracting non-conclusion sections from PubMed structured abstracts and pairing them with the original author-written conclusions as supervision targets. It then reports standard reference-based metrics and LLM-as-a-judge scores on diverse models under two prompting regimes. No equations, fitted parameters, predictions, uniqueness theorems, or ansatzes appear; the central claim that the resource enables study of evidence-to-conclusion reasoning rests directly on the curation process and external evaluation protocols rather than any reduction to its own inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that PubMed structured abstracts provide sufficient evidence and that author conclusions are valid supervision targets; no free parameters or invented entities are introduced.

axioms (1)
  • domain assumption PubMed structured abstracts contain sufficient evidence in non-conclusion sections to support conclusion generation
    The task definition and dataset construction rely on this premise for all instances.

pith-pipeline@v0.9.0 · 5511 in / 1241 out tokens · 89674 ms · 2026-05-10T18:33:06.241009+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

44 extracted references · 18 canonical work pages · 7 internal anchors

  1. [1]

    URLhttps://aclanthology.org/2022.lrec-1.748/

    European Language Resources Association. URLhttps://aclanthology.org/2022.lrec-1.748/. Isabel Cachola, Kyle Lo, Arman Cohan, and Daniel S Weld. Tldr: Extreme summarization of scientific documents. InFindings of the Association for Computational Linguistics: EMNLP 2020, pp. 4766–4777,

  2. [2]

    A discourse-aware attention model for abstractive summa- rization of long documents

    Arman Cohan, Franck Dernoncourt, Doo Soon Kim, Trung Bui, Seokhwan Kim, Walter Chang, and Nazli Goharian. A discourse-aware attention model for abstractive summa- rization of long documents. InProceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Pape...

  3. [3]

    DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

    DeepSeek-AI. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning, 2025a. URLhttps://arxiv.org/abs/2501.12948. DeepSeek-AI. Deepseek-v3.2: Pushing the frontier of open large language models, 2025b. Franck Dernoncourt and Ji-Young Lee. Pubmed 200k rct: a dataset for sequential sentence classification in medical abstracts. InPr...

  4. [4]

    Length-Controlled AlpacaEval: A Simple Way to Debias Automatic Evaluators

    Yann Dubois, Bal´azs Galambosi, Percy Liang, and Tatsunori B Hashimoto. Length-controlled alpacaeval: A simple way to debias automatic evaluators.arXiv preprint arXiv:2404.04475,

  5. [5]

    thought” of LLM by finding the “circuit

    doi: 10.34740/KAGGLE/M/3301. URL https://www.kaggle. com/m/3301. Gemma Team. Gemma

  6. [6]

    googleapis.com/deepmind-media/Model-Cards/Gemini-3-Flash-Model-Card.pdf

    URL https://storage. googleapis.com/deepmind-media/Model-Cards/Gemini-3-Flash-Model-Card.pdf . Pub- lished December 2025; accessed 2026-03-31. 10 Preprint. Under review. Google DeepMind. Gemini 3.1 pro: Model card, February

  7. [7]

    googleapis.com/deepmind-media/Model-Cards/Gemini-3-1-Pro-Model-Card.pdf

    URL https://storage. googleapis.com/deepmind-media/Model-Cards/Gemini-3-1-Pro-Model-Card.pdf . Pub- lished February 2026; accessed 2026-03-31. Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, Amy Yang, Angela Fan, Anirudh Goyal, Anthony Hartshorn,...

  8. [8]

    The Llama 3 Herd of Models

    URLhttps://arxiv.org/abs/2407.21783. Jiawei Gu, Xuhui Jiang, Zhichao Shi, Hexiang Tan, Xuehao Zhai, Chengjin Xu, Wei Li, Yinghan Shen, Shengjie Ma, Honghao Liu, et al. A survey on llm-as-a-judge.The Innovation,

  9. [9]

    An empirical study of llm-as-a-judge for llm evaluation: Fine-tuned judge model is not a general substitute for gpt-4

    Hui Huang, Xingyuan Bu, Hongli Zhou, Yingqi Qu, Jing Liu, Muyun Yang, Bing Xu, and Tiejun Zhao. An empirical study of llm-as-a-judge for llm evaluation: Fine-tuned judge model is not a general substitute for gpt-4. InFindings of the Association for Computational Linguistics: ACL 2025, pp. 5880–5895,

  10. [10]

    Pubmedqa: A dataset for biomedical research question answering

    Qiao Jin, Bhuwan Dhingra, Zhengping Liu, William Cohen, and Xinghua Lu. Pubmedqa: A dataset for biomedical research question answering. InProceedings of the 2019 conference 12 Preprint. Under review. on empirical methods in natural language processing and the 9th international joint conference on natural language processing (EMNLP-IJCNLP), pp. 2567–2577,

  11. [11]

    Kimi K2: Open Agentic Intelligence

    URLhttps://arxiv.org/abs/2507.20534. Wojciech Kry´sci´nski, Bryan McCann, Caiming Xiong, and Richard Socher. Evaluating the factual consistency of abstractive text summarization. InProceedings of the 2020 conference on empirical methods in natural language processing (EMNLP), pp. 9332–9346,

  12. [12]

    Inferring which medical treatments work from reports of clinical trials

    Eric Lehman, Jay DeYoung, Regina Barzilay, and Byron C Wallace. Inferring which medical treatments work from reports of clinical trials. InProceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pp. 3705–3717,

  13. [13]

    Llm review: Enhancing creative writing via blind peer review feedback.arXiv preprint arXiv:2601.08003, 2026a

    Weiyue Li, Mingxiao Song, Zhenda Shen, Dachuan Zhao, Yunfan Long, Yi Li, Yongce Li, Ruyi Yang, and Mengyu Wang. Llm review: Enhancing creative writing via blind peer review feedback.arXiv preprint arXiv:2601.08003, 2026a. Weiyue Li, Minda Zhao, Weixuan Dong, Jiahui Cai, Yuze Wei, Michael Pocress, Yi Li, Wanyan Yuan, Xiaoyue Wang, Ruoyu Hou, et al. Grading...

  14. [14]

    Under review

    13 Preprint. Under review. Yang Liu, Dan Iter, Yichong Xu, Shuohang Wang, Ruochen Xu, and Chenguang Zhu. G- eval: Nlg evaluation using gpt-4 with better human alignment. InProceedings of the 2023 conference on empirical methods in natural language processing, pp. 2511–2522,

  15. [15]

    Llm4sr: A survey on large language models for scientific research, 2025

    Ziming Luo, Zonglin Yang, Zexin Xu, Wei Yang, and Xinya Du. Llm4sr: A survey on large language models for scientific research.arXiv preprint arXiv:2501.04306,

  16. [16]

    On faithfulness and factuality in abstractive summarization

    Joshua Maynez, Shashi Narayan, Bernd Bohnet, and Ryan McDonald. On faithfulness and factuality in abstractive summarization. InProceedings of the 58th annual meeting of the association for computational linguistics, pp. 1906–1919,

  17. [17]

    Meta AI blog post, pub- lished 2024-09-25; accessed 2026-03-31

    URL https://ai.meta.com/blog/ llama-3-2-connect-2024-vision-edge-mobile-devices/ . Meta AI blog post, pub- lished 2024-09-25; accessed 2026-03-31. MiniMax. Minimax m2.1: Significantly enhanced multi-language programming, built for real-world complex tasks, December

  18. [18]

    MiniMax news post, published 2025-12-23; accessed 2026-03-31

    URL https://www.minimax.io/news/ minimax-m21. MiniMax news post, published 2025-12-23; accessed 2026-03-31. Benjamin Nye, Junyi Jessy Li, Roma Patel, Yinfei Yang, Iain Marshall, Ani Nenkova, and Byron C Wallace. A corpus with multi-level annotations of patients, interventions and outcomes to support language processing for medical literature. InProceeding...

  19. [19]

    OpenAI product announcement, accessed 2026-03-31

    URL https://openai.com/index/ introducing-gpt-5-4/. OpenAI product announcement, accessed 2026-03-31. Ankit Pal, Logesh Kumar Umapathi, and Malaikannan Sankarasubbu. Medmcqa: A large-scale multi-subject multi-choice dataset for medical domain question answering. In Gerardo Flores, George H Chen, Tom Pollard, Joyce C Ho, and Tristan Naumann (eds.),Proceedi...

  20. [20]

    Qwen3 Technical Report

    URL https: //qwenlm.github.io/blog/qwen2.5/. Qwen Team. Qwen2.5-vl, January 2025a. URLhttps://qwen.ai/blog?id=qwen2.5-vl. Qwen Team. Qwen3 technical report, 2025b. URLhttps://arxiv.org/abs/2505.09388. Nils Reimers and Iryna Gurevych. Sentence-bert: Sentence embeddings using siamese bert-networks. InProceedings of the 2019 conference on empirical methods i...

  21. [21]

    Questeval: Summarization asks for fact-based evaluation

    Thomas Scialom, Paul-Alexis Dray, Sylvain Lamprier, Benjamin Piwowarski, Jacopo Sta- iano, Alex Wang, and Patrick Gallinari. Questeval: Summarization asks for fact-based evaluation. InProceedings of the 2021 conference on empirical methods in natural language processing, pp. 6594–6604,

  22. [22]

    Under review

    14 Preprint. Under review. Alexander Te-Wei Shieh, Yung-Sung Chuang, Shang-Yu Su, and Yun-Nung Chen. Towards understanding of medical randomized controlled trials by conclusion generation. In Proceedings of the Tenth International Workshop on Health Text Mining and Information Analysis (LOUHI 2019), pp. 108–117,

  23. [23]

    doi: 10.18653/v1/2023.clinicalnlp-1.7

    Association for Computational Linguistics. doi: 10.18653/v1/2023.clinicalnlp-1.7. URL https://aclanthology.org/2023.clinicalnlp-1.7/. V Team, Wenyi Hong, Wenmeng Yu, Xiaotao Gu, Guo Wang, Guobing Gan, Haomiao Tang, Jiale Cheng, Ji Qi, Junhui Ji, Lihang Pan, Shuaiqi Duan, Weihan Wang, Yan Wang, Yean Cheng, Zehai He, Zhe Su, Zhen Yang, Ziyang Pan, Aohan Zen...

  24. [24]

    GLM-4.5V and GLM-4.1V-Thinking: Towards Versatile Multimodal Reasoning with Scalable Reinforcement Learning

    URL https://arxiv.org/abs/2507.01006. Simone Teufel and Marc Moens. Summarizing scientific articles: experiments with relevance and rhetorical status.Computational linguistics, 28(4):409–445,

  25. [25]

    Fact or fiction: Verifying scientific claims

    David Wadden, Shanchuan Lin, Kyle Lo, Lucy Lu Wang, Madeleine van Zuylen, Arman Cohan, and Hannaneh Hajishirzi. Fact or fiction: Verifying scientific claims. InProceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 7534–7550,

  26. [26]

    Evidencebench: A benchmark for extracting evidence from biomedical papers.arXiv preprint arXiv:2504.18736,

    Jianyou Wang, Weili Cao, Kaicheng Wang, Xiaoyue Wang, Ashish Dalvi, Gino Prasad, Qishan Liang, Hsuan-lin Her, Ming Wang, Qin Yang, et al. Evidencebench: A benchmark for extracting evidence from biomedical papers.arXiv preprint arXiv:2504.18736,

  27. [27]

    doi: 10.1609/aaai.v33i01.33017386

    ISBN 978-1-57735- 809-1. doi: 10.1609/aaai.v33i01.33017386. URL https://doi.org/10.1609/aaai.v33i01. 33017386. Jiayi Ye, Yanbo Wang, Yue Huang, Dongping Chen, Qihui Zhang, Nuno Moniz, Tian Gao, Werner Geyer, Chao Huang, Pin-Yu Chen, et al. Justice or prejudice? quantifying biases in llm-as-a-judge.arXiv preprint arXiv:2410.02736,

  28. [28]

    FormalMATH: Benchmarking Formal Mathematical Reasoning of Large Language Models, 2025

    15 Preprint. Under review. Zhouliang Yu, Ruotian Peng, Keyi Ding, Yizhe Li, Zhongyuan Peng, Minghao Liu, Yifan Zhang, Zheng Yuan, Huajian Xin, Wenhao Huang, et al. Formalmath: Benchmarking formal mathematical reasoning of large language models.arXiv preprint arXiv:2505.02735,

  29. [29]

    Sovereign AI Foundation Model

    Jie Zhang, Cezara Petrui, Kristina Nikoli ´c, and Florian Tram`er. Realmath: A continuous benchmark for evaluating language models on research-level mathematics.arXiv preprint arXiv:2505.12575,

  30. [30]

    BERTScore: Evaluating Text Generation with BERT

    Tianyi Zhang, Varsha Kishore, Felix Wu, Kilian Q Weinberger, and Yoav Artzi. Bertscore: Evaluating text generation with bert.arXiv preprint arXiv:1904.09675,

  31. [31]

    From automation to autonomy: A survey on large language models in scientific discovery

    Tianshi Zheng, Zheye Deng, Hong Ting Tsang, Weiqi Wang, Jiaxin Bai, Zihao Wang, and Yangqiu Song. From automation to autonomy: A survey on large language models in scientific discovery. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pp. 17744–17761,

  32. [32]

    Judgelm: Fine-tuned large language models are scalable judges,

    Lianghui Zhu, Xinggang Wang, and Xinlong Wang. Judgelm: Fine-tuned large language models are scalable judges.arXiv preprint arXiv:2310.17631,

  33. [33]

    semantic similarity

    16 Preprint. Under review. A Example data Example Datapoint from MedConclusion Metadata: PMID:21401313 Title:Inositol phosphoglycan P-type in infants of preeclamptic mothers. Journal Title:The journal of maternal-fetal & neonatal medicine : the official journal of the European Association of Perinatal Medicine, the Federation of Asia and Oceania Perinatal...

  34. [34]

    Under review

    23 Preprint. Under review. E.2 Example 2 Example: Endocrine and Autonomic Systems (HIGH) Metadata: PMID:37580720 Journal Title:Thyroid research Publication Year:2023 SJR 2023:0.492 Subject Area and Category: - Endocrine and Autonomic Systems Structured Abstract: PURPOSE:Lacrimal gland enlargement can be a feature of thyroid eye disease (TED). Unilateral o...

  35. [35]

    Under review

    24 Preprint. Under review. E.3 Example 3 Example: Advanced and Specialized Nursing (HIGH) Metadata: PMID:32951753 Journal Title:Complementary therapies in medicine Publication Year:2020 SJR 2020:0.58 Subject Area and Category: - Advanced and Specialized Nursing Structured Abstract: BACKGROUND AND OBJECTIVE:Walnut intake is considered a healthy dietary app...

  36. [36]

    Under review

    25 Preprint. Under review. E.4 Example 4 Example: Environmental Science (HIGH) Metadata: PMID:31054526 Journal Title:Environmental research Publication Year:2019 SJR 2019:1.52 Subject Area and Category: Environmental Science Structured Abstract: BACKGROUND:Hypertension and air pollution are two important risk factors for cardiovascular morbidity and morta...

  37. [37]

    Under review

    26 Preprint. Under review. E.5 Example 5 Example: Emergency Nursing (HIGH) Metadata: PMID:21458134 Journal Title:Resuscitation Publication Year:2011 SJR 2011:1.736 Subject Area and Category: - Emergency Nursing Structured Abstract: AIM:Body mass index (BMI) may influence the quality of cardiopulmonary re- suscitation and may influence prognosis after card...

  38. [38]

    Under review

    27 Preprint. Under review. E.6 Example 6 Example: Pollution (LOW) Metadata: PMID:36497716 Journal Title:International journal of environmental research and public health Publication Year:2022 SJR 2022:0.828 Subject Area and Category: - Pollution Structured Abstract: BACKGROUND:Urinary incontinence (UI) and poor sleep negatively affect health- related qual...

  39. [39]

    Under review

    28 Preprint. Under review. E.7 Example 7 Example: Health, Toxicology and Mutagenesis (LOW) Metadata: PMID:28934092 Journal Title:Environmental health perspectives Publication Year:2017 SJR 2017:3.41 Subject Area and Category: - Health, Toxicology and Mutagenesis Structured Abstract: BACKGROUND:Some epidemiologic and laboratory studies suggest that insec- ...

  40. [40]

    Under review

    29 Preprint. Under review. E.8 Example 8 Example: Computer Science Applications (LOW) Metadata: PMID:39014177 Journal Title:International journal of computer assisted radiology and surgery Publication Year:2025 SJR 2025:0.658 Subject Area and Category: - Computer Science Applications Structured Abstract: PURPOSE:Augmented reality guidance in laparoscopic ...

  41. [41]

    Under review

    30 Preprint. Under review. E.9 Example 9 Example: Applied Microbiology and Biotechnology (LOW) Metadata: PMID:26497155 Journal Title:Journal of applied microbiology Publication Year:2016 SJR 2016:0.84 Subject Area and Category: Biochemistry, Genetics and Molecular Biology Immunology and Microbiology Medicine - Applied Microbiology and Biotechnology - Biot...

  42. [42]

    Under review

    31 Preprint. Under review. E.10 Example 10 Example: Software (LOW) Metadata: PMID:29157445 Journal Title:Computer methods and programs in biomedicine Publication Year:2018 SJR 2018:0.753 Subject Area and Category: - Software Structured Abstract: BACKGROUND AND OBJECTIVES:Diabetic retinopathy (DR) is one of the leading causes of preventable blindness in th...

  43. [43]

    Under review

    32 Preprint. Under review. E.11 The conclusion–summary distinction holds across categories Section 4.2 shows that summary-mode outputs recover most of the semantic similarity of conclusion-mode outputs while diverging sharply in writing style and numeric consistency. We now test whether this pattern is universal or driven by a subset of categories, by com...

  44. [44]

    Figure 4: Publication-year density of abstracts in MedConclusion from 2000–2025

    F.1 Abstracts’ year distribution 2000 2005 2010 2015 2020 2025 Publication year 0.0% 1.0% 2.0% 3.0% 4.0% 5.0% 6.0% 7.0%Share of abstracts Peak year: 2024 425,802 abstracts Publication-Year Density of Abstracts Bars show the annual share; the curve is a Gaussian-smoothed density over publication years. Figure 4: Publication-year density of abstracts in Med...