arxiv: 2604.06505 · v1 · submitted 2026-04-07 · 💻 cs.CL · cs.AI

Recognition: 2 theorem links

· Lean Theorem

MedConclusion: A Benchmark for Biomedical Conclusion Generation from Structured Abstracts

Weiyue Li , Ruizhi Qian , Yi Li , Yongce Li , Yunfan Long , Jiahui Cai , Yan Luo , Mengyu Wang

Authors on Pith no claims yet

Pith reviewed 2026-05-10 18:33 UTC · model grok-4.3

classification 💻 cs.CL cs.AI

keywords biomedical conclusion generationstructured abstractsPubMed datasetLLM benchmarkingevidence-to-conclusion reasoningscientific reasoningNLP benchmarkbiomedical AI

0 comments

The pith

MedConclusion offers a 5.7 million PubMed dataset for benchmarking how LLMs generate biomedical conclusions from structured evidence.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces MedConclusion, a dataset of 5.7M structured abstracts from PubMed that pairs the non-conclusion sections with the original author conclusions. This creates a large resource for training and testing models on inferring scientific conclusions from biomedical evidence. Metadata on journal categories and SJR scores allows for analysis in specific domains. Evaluations of various LLMs indicate that conclusion generation is distinct from summary writing, that top models perform similarly under standard metrics, and that LLM judges can produce varying scores based on their identity.

Core claim

MedConclusion is a large-scale dataset of 5.7M PubMed structured abstracts in which each instance consists of the non-conclusion sections paired with the author-written conclusion, supplemented with journal-level metadata such as biomedical category and SJR, providing naturally occurring supervision for evidence-to-conclusion reasoning.

What carries the argument

The MedConclusion dataset, which pairs non-conclusion abstract sections with author conclusions from PubMed and includes metadata for domain subgroup analysis.

If this is right

Generating conclusions is behaviorally distinct from generating summaries for LLMs.
Strong LLMs remain closely clustered when scored with current automatic metrics.
LLM-as-a-judge evaluations can substantially shift absolute scores depending on the judge model.
Subgroup analysis across biomedical domains and journal impact is enabled by the included metadata.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Models could be fine-tuned on this data to improve their ability to synthesize biomedical evidence into conclusions.
The clustering of strong models suggests that new, more discriminative metrics may be required for this task.
Variability in judge scores points to the need for standardized evaluation protocols in scientific reasoning benchmarks.
Such datasets might eventually help in automating parts of scientific literature synthesis.

Load-bearing premise

Author-written conclusions in PubMed abstracts are appropriate and unbiased targets for model training and evaluation.

What would settle it

A study that finds systematic logical gaps, unsupported claims, or external information in a significant portion of the author conclusions relative to their evidence sections would undermine the benchmark.

Figures

Figures reproduced from arXiv: 2604.06505 by Jiahui Cai, Mengyu Wang, Ruizhi Qian, Weiyue Li, Yan Luo, Yi Li, Yongce Li, Yunfan Long.

**Figure 2.** Figure 2: Scatter plots of journal-level SJR scores versus evaluation metrics for [PITH_FULL_IMAGE:figures/full_fig_p008_2.png] view at source ↗

**Figure 3.** Figure 3: Radar chart comparison of the top 5 and bottom 5 biomedical categories ranked [PITH_FULL_IMAGE:figures/full_fig_p009_3.png] view at source ↗

**Figure 4.** Figure 4: Publication-year density of abstracts in MedConclusion from 2000–2025. The [PITH_FULL_IMAGE:figures/full_fig_p034_4.png] view at source ↗

**Figure 5.** Figure 5: Distribution statistics of MedConclusion. (a) Top 10 subject categories by propor [PITH_FULL_IMAGE:figures/full_fig_p034_5.png] view at source ↗

read the original abstract

Large language models (LLMs) are widely explored for reasoning-intensive research tasks, yet resources for testing whether they can infer scientific conclusions from structured biomedical evidence remain limited. We introduce $\textbf{MedConclusion}$, a large-scale dataset of $\textbf{5.7M}$ PubMed structured abstracts for biomedical conclusion generation. Each instance pairs the non-conclusion sections of an abstract with the original author-written conclusion, providing naturally occurring supervision for evidence-to-conclusion reasoning. MedConclusion also includes journal-level metadata such as biomedical category and SJR, enabling subgroup analysis across biomedical domains. As an initial study, we evaluate diverse LLMs under conclusion and summary prompting settings and score outputs with both reference-based metrics and LLM-as-a-judge. We find that conclusion writing is behaviorally distinct from summary writing, strong models remain closely clustered under current automatic metrics, and judge identity can substantially shift absolute scores. MedConclusion provides a reusable data resource for studying scientific evidence-to-conclusion reasoning. Our code and data are available at: https://github.com/Harvard-AI-and-Robotics-Lab/MedConclusion.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

MedConclusion is a large new PubMed dataset for conclusion generation that is worth releasing, but its targets carry the usual risks of author spin and extrapolation.

read the letter

The paper's main contribution is a 5.7 million instance dataset that pairs non-conclusion sections of structured PubMed abstracts with the original author-written conclusions, plus journal metadata for subgroup splits. They run some initial LLM evaluations under different prompts and report that conclusion generation behaves differently from summarization, that automatic metrics cluster strong models together, and that LLM judges produce quite different absolute scores depending on which model is used as judge. The data and code are released, which is the part that actually matters for the field.

Referee Report

2 major / 2 minor

Summary. The paper introduces MedConclusion, a dataset of 5.7M PubMed structured abstracts that pairs the non-conclusion sections of each abstract with the original author-written conclusion as naturally occurring supervision. It includes journal-level metadata (biomedical category, SJR) for subgroup analysis. As an initial study, the authors benchmark diverse LLMs under conclusion and summary prompting, scoring outputs with reference-based metrics and LLM-as-a-judge, and report that conclusion generation is behaviorally distinct from summarization, that strong models cluster under automatic metrics, and that judge identity affects absolute scores. The dataset, code, and data are released publicly.

Significance. If the author-written conclusions prove to be faithful, evidence-derived targets, MedConclusion would constitute a large-scale, reusable resource for studying evidence-to-conclusion reasoning in biomedicine, supporting both training and evaluation of LLMs for scientific inference tasks. The open release of the full dataset and code, together with metadata enabling domain-specific analyses, is a clear strength that facilitates follow-on work. The reported empirical findings on prompting differences and judge sensitivity are useful observations, though their interpretive weight depends on the soundness of the underlying targets.

major comments (2)

[Dataset construction] Dataset construction section: The manuscript provides insufficient detail on filtering criteria applied to the 5.7M PubMed abstracts, train/validation/test splits, deduplication, or any bias-handling procedures. These omissions directly affect the interpretability of the LLM benchmarking results and the claim that the resource isolates evidence-to-conclusion reasoning.
[Introduction] Introduction and abstract: The central claim that MedConclusion supports 'scientific evidence-to-conclusion reasoning' rests on the assumption that author-written conclusions are appropriate, unbiased targets derivable from the supplied non-conclusion sections. Biomedical abstracts frequently contain extrapolations, clinical implications, or publication-driven claims not strictly supported by the results; without a quantitative audit (e.g., percentage of conclusions containing unsupported statements) or fidelity analysis, this assumption remains unverified and load-bearing for the reusable-resource contribution.

minor comments (2)

[Experiments] The distinction between 'conclusion prompting' and 'summary prompting' settings is mentioned but not illustrated with example prompts or templates; providing these would improve replicability of the reported behavioral differences.
[Results] The abstract states that 'strong models remain closely clustered under current automatic metrics'; a brief discussion of why this clustering occurs (e.g., metric saturation) would help readers interpret the practical significance of the finding.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback, which helps clarify key aspects of our work. We address each major comment below, indicating where revisions will be made to the manuscript.

read point-by-point responses

Referee: [Dataset construction] Dataset construction section: The manuscript provides insufficient detail on filtering criteria applied to the 5.7M PubMed abstracts, train/validation/test splits, deduplication, or any bias-handling procedures. These omissions directly affect the interpretability of the LLM benchmarking results and the claim that the resource isolates evidence-to-conclusion reasoning.

Authors: We agree that expanded details on dataset construction will improve interpretability. In the revised manuscript, we will augment the Dataset Construction section with explicit descriptions of the filtering criteria applied to select the 5.7M structured PubMed abstracts, the rationale and ratios for the train/validation/test splits, deduplication methods employed, and any bias-mitigation steps such as ensuring representation across biomedical categories and journal impact levels. These additions will directly support the benchmarking claims. revision: yes
Referee: [Introduction] Introduction and abstract: The central claim that MedConclusion supports 'scientific evidence-to-conclusion reasoning' rests on the assumption that author-written conclusions are appropriate, unbiased targets derivable from the supplied non-conclusion sections. Biomedical abstracts frequently contain extrapolations, clinical implications, or publication-driven claims not strictly supported by the results; without a quantitative audit (e.g., percentage of conclusions containing unsupported statements) or fidelity analysis, this assumption remains unverified and load-bearing for the reusable-resource contribution.

Authors: We acknowledge that author-written conclusions may include extrapolations or implications not strictly entailed by the results sections, as is common in biomedical literature. The dataset is designed around these naturally occurring author targets to study conclusion generation in practice. While the original manuscript did not include a quantitative fidelity audit, we will revise the Introduction to qualify the central claim and add a dedicated Limitations section that discusses this assumption, provides illustrative examples, and notes how the paired data can enable future fidelity analyses. This preserves the resource's utility for evidence-to-conclusion tasks while addressing the concern. revision: partial

Circularity Check

0 steps flagged

No circularity: dataset curation and empirical benchmarking are self-contained.

full rationale

The paper constructs MedConclusion by extracting non-conclusion sections from PubMed structured abstracts and pairing them with the original author-written conclusions as supervision targets. It then reports standard reference-based metrics and LLM-as-a-judge scores on diverse models under two prompting regimes. No equations, fitted parameters, predictions, uniqueness theorems, or ansatzes appear; the central claim that the resource enables study of evidence-to-conclusion reasoning rests directly on the curation process and external evaluation protocols rather than any reduction to its own inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that PubMed structured abstracts provide sufficient evidence and that author conclusions are valid supervision targets; no free parameters or invented entities are introduced.

axioms (1)

domain assumption PubMed structured abstracts contain sufficient evidence in non-conclusion sections to support conclusion generation
The task definition and dataset construction rely on this premise for all instances.

pith-pipeline@v0.9.0 · 5511 in / 1241 out tokens · 89674 ms · 2026-05-10T18:33:06.241009+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Each instance pairs the non-conclusion sections of an abstract with the original author-written conclusion, providing naturally occurring supervision for evidence-to-conclusion reasoning.
IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We score outputs with two classes of automatic metrics... ROUGE-1/2/L, BLEU... LLM-as-a-judge scoring.

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

44 extracted references · 18 canonical work pages · 7 internal anchors

[1]

URLhttps://aclanthology.org/2022.lrec-1.748/

European Language Resources Association. URLhttps://aclanthology.org/2022.lrec-1.748/. Isabel Cachola, Kyle Lo, Arman Cohan, and Daniel S Weld. Tldr: Extreme summarization of scientific documents. InFindings of the Association for Computational Linguistics: EMNLP 2020, pp. 4766–4777,

2022
[2]

A discourse-aware attention model for abstractive summa- rization of long documents

Arman Cohan, Franck Dernoncourt, Doo Soon Kim, Trung Bui, Seokhwan Kim, Walter Chang, and Nazli Goharian. A discourse-aware attention model for abstractive summa- rization of long documents. InProceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Pape...

2018
[3]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

DeepSeek-AI. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning, 2025a. URLhttps://arxiv.org/abs/2501.12948. DeepSeek-AI. Deepseek-v3.2: Pushing the frontier of open large language models, 2025b. Franck Dernoncourt and Ji-Young Lee. Pubmed 200k rct: a dataset for sequential sentence classification in medical abstracts. InPr...

work page internal anchor Pith review Pith/arXiv arXiv
[4]

Length-Controlled AlpacaEval: A Simple Way to Debias Automatic Evaluators

Yann Dubois, Bal´azs Galambosi, Percy Liang, and Tatsunori B Hashimoto. Length-controlled alpacaeval: A simple way to debias automatic evaluators.arXiv preprint arXiv:2404.04475,

work page internal anchor Pith review arXiv
[5]

thought” of LLM by finding the “circuit

doi: 10.34740/KAGGLE/M/3301. URL https://www.kaggle. com/m/3301. Gemma Team. Gemma

work page doi:10.34740/kaggle/m/3301
[6]

googleapis.com/deepmind-media/Model-Cards/Gemini-3-Flash-Model-Card.pdf

URL https://storage. googleapis.com/deepmind-media/Model-Cards/Gemini-3-Flash-Model-Card.pdf . Pub- lished December 2025; accessed 2026-03-31. 10 Preprint. Under review. Google DeepMind. Gemini 3.1 pro: Model card, February

2025
[7]

googleapis.com/deepmind-media/Model-Cards/Gemini-3-1-Pro-Model-Card.pdf

URL https://storage. googleapis.com/deepmind-media/Model-Cards/Gemini-3-1-Pro-Model-Card.pdf . Pub- lished February 2026; accessed 2026-03-31. Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, Amy Yang, Angela Fan, Anirudh Goyal, Anthony Hartshorn,...

2026
[8]

The Llama 3 Herd of Models

URLhttps://arxiv.org/abs/2407.21783. Jiawei Gu, Xuhui Jiang, Zhichao Shi, Hexiang Tan, Xuehao Zhai, Chengjin Xu, Wei Li, Yinghan Shen, Shengjie Ma, Honghao Liu, et al. A survey on llm-as-a-judge.The Innovation,

work page internal anchor Pith review Pith/arXiv arXiv
[9]

An empirical study of llm-as-a-judge for llm evaluation: Fine-tuned judge model is not a general substitute for gpt-4

Hui Huang, Xingyuan Bu, Hongli Zhou, Yingqi Qu, Jing Liu, Muyun Yang, Bing Xu, and Tiejun Zhao. An empirical study of llm-as-a-judge for llm evaluation: Fine-tuned judge model is not a general substitute for gpt-4. InFindings of the Association for Computational Linguistics: ACL 2025, pp. 5880–5895,

2025
[10]

Pubmedqa: A dataset for biomedical research question answering

Qiao Jin, Bhuwan Dhingra, Zhengping Liu, William Cohen, and Xinghua Lu. Pubmedqa: A dataset for biomedical research question answering. InProceedings of the 2019 conference 12 Preprint. Under review. on empirical methods in natural language processing and the 9th international joint conference on natural language processing (EMNLP-IJCNLP), pp. 2567–2577,

2019
[11]

Kimi K2: Open Agentic Intelligence

URLhttps://arxiv.org/abs/2507.20534. Wojciech Kry´sci´nski, Bryan McCann, Caiming Xiong, and Richard Socher. Evaluating the factual consistency of abstractive text summarization. InProceedings of the 2020 conference on empirical methods in natural language processing (EMNLP), pp. 9332–9346,

work page internal anchor Pith review Pith/arXiv arXiv 2020
[12]

Inferring which medical treatments work from reports of clinical trials

Eric Lehman, Jay DeYoung, Regina Barzilay, and Byron C Wallace. Inferring which medical treatments work from reports of clinical trials. InProceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pp. 3705–3717,

2019
[13]

Llm review: Enhancing creative writing via blind peer review feedback.arXiv preprint arXiv:2601.08003, 2026a

Weiyue Li, Mingxiao Song, Zhenda Shen, Dachuan Zhao, Yunfan Long, Yi Li, Yongce Li, Ruyi Yang, and Mengyu Wang. Llm review: Enhancing creative writing via blind peer review feedback.arXiv preprint arXiv:2601.08003, 2026a. Weiyue Li, Minda Zhao, Weixuan Dong, Jiahui Cai, Yuze Wei, Michael Pocress, Yi Li, Wanyan Yuan, Xiaoyue Wang, Ruoyu Hou, et al. Grading...

work page arXiv
[14]

Under review

13 Preprint. Under review. Yang Liu, Dan Iter, Yichong Xu, Shuohang Wang, Ruochen Xu, and Chenguang Zhu. G- eval: Nlg evaluation using gpt-4 with better human alignment. InProceedings of the 2023 conference on empirical methods in natural language processing, pp. 2511–2522,

2023
[15]

Llm4sr: A survey on large language models for scientific research, 2025

Ziming Luo, Zonglin Yang, Zexin Xu, Wei Yang, and Xinya Du. Llm4sr: A survey on large language models for scientific research.arXiv preprint arXiv:2501.04306,

work page arXiv
[16]

On faithfulness and factuality in abstractive summarization

Joshua Maynez, Shashi Narayan, Bernd Bohnet, and Ryan McDonald. On faithfulness and factuality in abstractive summarization. InProceedings of the 58th annual meeting of the association for computational linguistics, pp. 1906–1919,

1906
[17]

Meta AI blog post, pub- lished 2024-09-25; accessed 2026-03-31

URL https://ai.meta.com/blog/ llama-3-2-connect-2024-vision-edge-mobile-devices/ . Meta AI blog post, pub- lished 2024-09-25; accessed 2026-03-31. MiniMax. Minimax m2.1: Significantly enhanced multi-language programming, built for real-world complex tasks, December

2024
[18]

MiniMax news post, published 2025-12-23; accessed 2026-03-31

URL https://www.minimax.io/news/ minimax-m21. MiniMax news post, published 2025-12-23; accessed 2026-03-31. Benjamin Nye, Junyi Jessy Li, Roma Patel, Yinfei Yang, Iain Marshall, Ani Nenkova, and Byron C Wallace. A corpus with multi-level annotations of patients, interventions and outcomes to support language processing for medical literature. InProceeding...

2025
[19]

OpenAI product announcement, accessed 2026-03-31

URL https://openai.com/index/ introducing-gpt-5-4/. OpenAI product announcement, accessed 2026-03-31. Ankit Pal, Logesh Kumar Umapathi, and Malaikannan Sankarasubbu. Medmcqa: A large-scale multi-subject multi-choice dataset for medical domain question answering. In Gerardo Flores, George H Chen, Tom Pollard, Joyce C Ho, and Tristan Naumann (eds.),Proceedi...

2026
[20]

Qwen3 Technical Report

URL https: //qwenlm.github.io/blog/qwen2.5/. Qwen Team. Qwen2.5-vl, January 2025a. URLhttps://qwen.ai/blog?id=qwen2.5-vl. Qwen Team. Qwen3 technical report, 2025b. URLhttps://arxiv.org/abs/2505.09388. Nils Reimers and Iryna Gurevych. Sentence-bert: Sentence embeddings using siamese bert-networks. InProceedings of the 2019 conference on empirical methods i...

work page internal anchor Pith review Pith/arXiv arXiv 2019
[21]

Questeval: Summarization asks for fact-based evaluation

Thomas Scialom, Paul-Alexis Dray, Sylvain Lamprier, Benjamin Piwowarski, Jacopo Sta- iano, Alex Wang, and Patrick Gallinari. Questeval: Summarization asks for fact-based evaluation. InProceedings of the 2021 conference on empirical methods in natural language processing, pp. 6594–6604,

2021
[22]

Under review

14 Preprint. Under review. Alexander Te-Wei Shieh, Yung-Sung Chuang, Shang-Yu Su, and Yun-Nung Chen. Towards understanding of medical randomized controlled trials by conclusion generation. In Proceedings of the Tenth International Workshop on Health Text Mining and Information Analysis (LOUHI 2019), pp. 108–117,

2019
[23]

doi: 10.18653/v1/2023.clinicalnlp-1.7

Association for Computational Linguistics. doi: 10.18653/v1/2023.clinicalnlp-1.7. URL https://aclanthology.org/2023.clinicalnlp-1.7/. V Team, Wenyi Hong, Wenmeng Yu, Xiaotao Gu, Guo Wang, Guobing Gan, Haomiao Tang, Jiale Cheng, Ji Qi, Junhui Ji, Lihang Pan, Shuaiqi Duan, Weihan Wang, Yan Wang, Yean Cheng, Zehai He, Zhe Su, Zhen Yang, Ziyang Pan, Aohan Zen...

work page doi:10.18653/v1/2023.clinicalnlp-1.7 2023
[24]

GLM-4.5V and GLM-4.1V-Thinking: Towards Versatile Multimodal Reasoning with Scalable Reinforcement Learning

URL https://arxiv.org/abs/2507.01006. Simone Teufel and Marc Moens. Summarizing scientific articles: experiments with relevance and rhetorical status.Computational linguistics, 28(4):409–445,

work page internal anchor Pith review arXiv
[25]

Fact or fiction: Verifying scientific claims

David Wadden, Shanchuan Lin, Kyle Lo, Lucy Lu Wang, Madeleine van Zuylen, Arman Cohan, and Hannaneh Hajishirzi. Fact or fiction: Verifying scientific claims. InProceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 7534–7550,

2020
[26]

Evidencebench: A benchmark for extracting evidence from biomedical papers.arXiv preprint arXiv:2504.18736,

Jianyou Wang, Weili Cao, Kaicheng Wang, Xiaoyue Wang, Ashish Dalvi, Gino Prasad, Qishan Liang, Hsuan-lin Her, Ming Wang, Qin Yang, et al. Evidencebench: A benchmark for extracting evidence from biomedical papers.arXiv preprint arXiv:2504.18736,

work page arXiv
[27]

doi: 10.1609/aaai.v33i01.33017386

ISBN 978-1-57735- 809-1. doi: 10.1609/aaai.v33i01.33017386. URL https://doi.org/10.1609/aaai.v33i01. 33017386. Jiayi Ye, Yanbo Wang, Yue Huang, Dongping Chen, Qihui Zhang, Nuno Moniz, Tian Gao, Werner Geyer, Chao Huang, Pin-Yu Chen, et al. Justice or prejudice? quantifying biases in llm-as-a-judge.arXiv preprint arXiv:2410.02736,

work page doi:10.1609/aaai.v33i01.33017386
[28]

FormalMATH: Benchmarking Formal Mathematical Reasoning of Large Language Models, 2025

15 Preprint. Under review. Zhouliang Yu, Ruotian Peng, Keyi Ding, Yizhe Li, Zhongyuan Peng, Minghao Liu, Yifan Zhang, Zheng Yuan, Huajian Xin, Wenhao Huang, et al. Formalmath: Benchmarking formal mathematical reasoning of large language models.arXiv preprint arXiv:2505.02735,

work page arXiv
[29]

Sovereign AI Foundation Model

Jie Zhang, Cezara Petrui, Kristina Nikoli ´c, and Florian Tram`er. Realmath: A continuous benchmark for evaluating language models on research-level mathematics.arXiv preprint arXiv:2505.12575,

work page arXiv
[30]

BERTScore: Evaluating Text Generation with BERT

Tianyi Zhang, Varsha Kishore, Felix Wu, Kilian Q Weinberger, and Yoav Artzi. Bertscore: Evaluating text generation with bert.arXiv preprint arXiv:1904.09675,

work page internal anchor Pith review Pith/arXiv arXiv 1904
[31]

From automation to autonomy: A survey on large language models in scientific discovery

Tianshi Zheng, Zheye Deng, Hong Ting Tsang, Weiqi Wang, Jiaxin Bai, Zihao Wang, and Yangqiu Song. From automation to autonomy: A survey on large language models in scientific discovery. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pp. 17744–17761,

2025
[32]

Judgelm: Fine-tuned large language models are scalable judges,

Lianghui Zhu, Xinggang Wang, and Xinlong Wang. Judgelm: Fine-tuned large language models are scalable judges.arXiv preprint arXiv:2310.17631,

work page arXiv
[33]

semantic similarity

16 Preprint. Under review. A Example data Example Datapoint from MedConclusion Metadata: PMID:21401313 Title:Inositol phosphoglycan P-type in infants of preeclamptic mothers. Journal Title:The journal of maternal-fetal & neonatal medicine : the official journal of the European Association of Perinatal Medicine, the Federation of Asia and Oceania Perinatal...

work page doi:10.3109/14767058.2011.557789 2011
[34]

Under review

23 Preprint. Under review. E.2 Example 2 Example: Endocrine and Autonomic Systems (HIGH) Metadata: PMID:37580720 Journal Title:Thyroid research Publication Year:2023 SJR 2023:0.492 Subject Area and Category: - Endocrine and Autonomic Systems Structured Abstract: PURPOSE:Lacrimal gland enlargement can be a feature of thyroid eye disease (TED). Unilateral o...

2023
[35]

Under review

24 Preprint. Under review. E.3 Example 3 Example: Advanced and Specialized Nursing (HIGH) Metadata: PMID:32951753 Journal Title:Complementary therapies in medicine Publication Year:2020 SJR 2020:0.58 Subject Area and Category: - Advanced and Specialized Nursing Structured Abstract: BACKGROUND AND OBJECTIVE:Walnut intake is considered a healthy dietary app...

2020
[36]

Under review

25 Preprint. Under review. E.4 Example 4 Example: Environmental Science (HIGH) Metadata: PMID:31054526 Journal Title:Environmental research Publication Year:2019 SJR 2019:1.52 Subject Area and Category: Environmental Science Structured Abstract: BACKGROUND:Hypertension and air pollution are two important risk factors for cardiovascular morbidity and morta...

2019
[37]

Under review

26 Preprint. Under review. E.5 Example 5 Example: Emergency Nursing (HIGH) Metadata: PMID:21458134 Journal Title:Resuscitation Publication Year:2011 SJR 2011:1.736 Subject Area and Category: - Emergency Nursing Structured Abstract: AIM:Body mass index (BMI) may influence the quality of cardiopulmonary re- suscitation and may influence prognosis after card...

2011
[38]

Under review

27 Preprint. Under review. E.6 Example 6 Example: Pollution (LOW) Metadata: PMID:36497716 Journal Title:International journal of environmental research and public health Publication Year:2022 SJR 2022:0.828 Subject Area and Category: - Pollution Structured Abstract: BACKGROUND:Urinary incontinence (UI) and poor sleep negatively affect health- related qual...

2022
[39]

Under review

28 Preprint. Under review. E.7 Example 7 Example: Health, Toxicology and Mutagenesis (LOW) Metadata: PMID:28934092 Journal Title:Environmental health perspectives Publication Year:2017 SJR 2017:3.41 Subject Area and Category: - Health, Toxicology and Mutagenesis Structured Abstract: BACKGROUND:Some epidemiologic and laboratory studies suggest that insec- ...

work page doi:10.1289/ehp1295 2017
[40]

Under review

29 Preprint. Under review. E.8 Example 8 Example: Computer Science Applications (LOW) Metadata: PMID:39014177 Journal Title:International journal of computer assisted radiology and surgery Publication Year:2025 SJR 2025:0.658 Subject Area and Category: - Computer Science Applications Structured Abstract: PURPOSE:Augmented reality guidance in laparoscopic ...

2025
[41]

Under review

30 Preprint. Under review. E.9 Example 9 Example: Applied Microbiology and Biotechnology (LOW) Metadata: PMID:26497155 Journal Title:Journal of applied microbiology Publication Year:2016 SJR 2016:0.84 Subject Area and Category: Biochemistry, Genetics and Molecular Biology Immunology and Microbiology Medicine - Applied Microbiology and Biotechnology - Biot...

2016
[42]

Under review

31 Preprint. Under review. E.10 Example 10 Example: Software (LOW) Metadata: PMID:29157445 Journal Title:Computer methods and programs in biomedicine Publication Year:2018 SJR 2018:0.753 Subject Area and Category: - Software Structured Abstract: BACKGROUND AND OBJECTIVES:Diabetic retinopathy (DR) is one of the leading causes of preventable blindness in th...

2018
[43]

Under review

32 Preprint. Under review. E.11 The conclusion–summary distinction holds across categories Section 4.2 shows that summary-mode outputs recover most of the semantic similarity of conclusion-mode outputs while diverging sharply in writing style and numeric consistency. We now test whether this pattern is universal or driven by a subset of categories, by com...

2010
[44]

Figure 4: Publication-year density of abstracts in MedConclusion from 2000–2025

F.1 Abstracts’ year distribution 2000 2005 2010 2015 2020 2025 Publication year 0.0% 1.0% 2.0% 3.0% 4.0% 5.0% 6.0% 7.0%Share of abstracts Peak year: 2024 425,802 abstracts Publication-Year Density of Abstracts Bars show the annual share; the curve is a Gaussian-smoothed density over publication years. Figure 4: Publication-year density of abstracts in Med...

2000