arxiv: 2605.09661 · v1 · submitted 2026-05-10 · 💻 cs.CL · cs.AI

Recognition: 2 theorem links

· Lean Theorem

MedMeta: A Benchmark for LLMs in Synthesizing Meta-Analysis Conclusion from Medical Studies

Huy Hoang Ha , Benoit Favre , Francois Portet

Authors on Pith no claims yet

Pith reviewed 2026-05-12 03:32 UTC · model grok-4.3

classification 💻 cs.CL cs.AI

keywords medical meta-analysisLLM benchmarkingevidence synthesisretrieval augmented generationclinical reasoningRAG evaluation

0 comments

The pith

Providing ground-truth study abstracts to LLMs markedly improves their ability to synthesize conclusions from medical meta-analyses compared to using only internal knowledge.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper presents MedMeta, the first benchmark for assessing LLMs on generating meta-analysis conclusions solely from the abstracts of cited medical studies. It tests 81 real meta-analyses using two setups: one that supplies the correct abstracts for retrieval-augmented generation and another that relies solely on the model's parametric knowledge. The evaluation protocol, which uses an LLM as judge, aligns closely with human expert judgments according to correlation and bias analyses. Findings indicate that access to external abstracts boosts performance across various models, while specialized fine-tuning offers little additional gain once external information is available. All models exhibit weaknesses in detecting negated evidence, and overall scores remain only slightly above average even under ideal conditions.

Core claim

MedMeta demonstrates that current large language models can synthesize meta-analysis conclusions more effectively when given the abstracts of the underlying studies than when depending on their pre-trained parameters, establishing retrieval-augmented generation as a more effective path than model specialization for this clinical reasoning task.

What carries the argument

MedMeta benchmark with Golden-RAG workflow using ground-truth abstracts versus Parametric-only workflow, evaluated via LLM-as-a-judge protocol validated against human ratings.

Load-bearing premise

The LLM-as-a-judge protocol, shown to correlate with human ratings on this set of tasks, serves as a reliable stand-in for expert human evaluation of conclusion synthesis quality.

What would settle it

Human experts rating a sample of model-generated conclusions from the benchmark and finding substantially different quality rankings or no difference between the two workflows.

Figures

Figures reproduced from arXiv: 2605.09661 by Benoit Favre, Francois Portet, Huy Hoang Ha.

**Figure 2.** Figure 2: MedMeta workflow architecture. The benchmark includes six distinct synthesis workflows [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: Distribution of the top 10 research specialties in the MedMeta benchmark. [PITH_FULL_IMAGE:figures/full_fig_p018_3.png] view at source ↗

**Figure 4.** Figure 4: Distribution of reference counts in the source articles of the MedMeta benchmark. [PITH_FULL_IMAGE:figures/full_fig_p019_4.png] view at source ↗

**Figure 5.** Figure 5: Publication year distribution of source articles in the MedMeta benchmark. [PITH_FULL_IMAGE:figures/full_fig_p019_5.png] view at source ↗

**Figure 6.** Figure 6: The initial onboarding screen where annotators commit to the evaluation process. [PITH_FULL_IMAGE:figures/full_fig_p020_6.png] view at source ↗

**Figure 7.** Figure 7: The detailed 0-5 scoring rubric provided to all annotators. [PITH_FULL_IMAGE:figures/full_fig_p021_7.png] view at source ↗

**Figure 8.** Figure 8: The training interface showing a pre-scored example with highlighted justifications to [PITH_FULL_IMAGE:figures/full_fig_p021_8.png] view at source ↗

**Figure 9.** Figure 9: The live annotation interface where annotators evaluate two anonymized model conclu [PITH_FULL_IMAGE:figures/full_fig_p021_9.png] view at source ↗

read the original abstract

Large language models (LLMs) have saturated standard medical benchmarks that test factual recall, yet their ability to perform higher-order reasoning, such as synthesizing evidence from multiple sources, remains critically under-explored. To address this gap, we introduce MedMeta, the first benchmark designed to evaluate an LLM's ability to generate conclusions from medical meta-analyses using only the abstracts of cited studies. MedMeta comprises 81 meta-analyses from PubMed (2018--2025) and evaluates models using two distinct workflows: a Retrieval-Augmented Generation (Golden-RAG) setting with ground-truth abstracts, and a Parametric-only approach relying on internal knowledge. Our evaluation framework is validated by a well-structured analysis showing our LLM-as-a-judge protocol strongly aligns with human expert ratings, as evidenced by high Pearson's r correlation (0.81) and Bland-Altman analysis revealing negligible systematic bias, establishing it as a reliable proxy for scalable evaluation. Our findings underscore the critical importance of information grounding: the Golden-RAG workflow consistently and significantly outperforms the Parametric-only approach across models. In contrast, the benefits of domain-specific fine-tuning are marginal and largely neutralized when external material is provided. Furthermore, stress tests show that all models, regardless of architecture, fail to identify and reject negated evidence, highlighting a critical vulnerability in current RAG systems. Notably, even under ideal RAG conditions, current LLMs achieve only slightly above-average performance (~2.7/5.0). MedMeta provides a challenging new benchmark for evidence synthesis and demonstrates that for clinical applications, developing robust RAG systems is a more promising direction than model specialization alone.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

read the letter

MedMeta gives a focused benchmark for LLM meta-analysis synthesis from abstracts, showing RAG beats parametric knowledge but scores stay low and negated evidence breaks everything, with the judge validation as the main soft spot. The paper builds 81 real PubMed meta-analyses from 2018-2025 and tests two clear workflows: Golden-RAG that feeds the actual cited abstracts versus models using only their internal parameters. It also runs fine-tuning comparisons and adds stress tests where models must reject negated evidence. The core finding is consistent: RAG lifts performance across models while fine-tuning gains mostly disappear once external text is available, and every model fails the negation cases. They report an LLM-as-judge protocol that reaches Pearson r=0.81 with human experts and shows little bias on Bland-Altman plots, which lets them scale the evaluation beyond a tiny human sample. That setup is better than many papers manage. The absolute numbers are telling too: even the best RAG runs only reach about 2.7 out of 5, which underscores how hard the synthesis task remains. The soft spot is the judge validation itself. The correlation is reported but the paper gives almost no numbers on how many meta-analyses or models were rated by humans, how the subset was picked, or what inter-expert agreement looked like. Since every workflow comparison and the claim that robust RAG matters more than specialization rests on those ratings, the missing details matter. The 81 cases are a reasonable start but without more on topic spread or difficulty, generalization stays limited. This is useful for anyone building or testing medical RAG systems and higher-order reasoning benchmarks. A reader who needs concrete failure cases or a new dataset for evidence synthesis will get value from it. It deserves a serious referee because the task is well-defined, the data source is public, and the main comparison is reproducible, even if the evaluation section needs more transparency on the human side.

Referee Report

1 major / 1 minor

Summary. The manuscript introduces MedMeta, a benchmark of 81 PubMed meta-analyses (2018-2025) for evaluating LLMs on synthesizing meta-analysis conclusions from study abstracts. It compares a Golden-RAG workflow using ground-truth abstracts against a Parametric-only approach, validates an LLM-as-judge protocol via Pearson r=0.81 and Bland-Altman analysis against human experts, and reports that RAG significantly outperforms parametric methods, domain fine-tuning provides marginal benefits, models fail on negated evidence, and even ideal RAG yields only ~2.7/5 average performance.

Significance. If the results hold, the work is significant for identifying a key limitation in LLMs for clinical evidence synthesis tasks and for establishing that robust retrieval-augmented generation is more promising than model specialization for such applications. The introduction of a challenging benchmark focused on higher-order reasoning in medicine fills an important gap beyond factual recall benchmarks.

major comments (1)

[Evaluation Framework (abstract and methods)] The central claims regarding model performance, RAG superiority, the ~2.7/5.0 average score, and the recommendation for RAG over specialization all depend on the LLM-as-a-judge ratings. While Pearson's r=0.81 and negligible bias are reported in the abstract, the manuscript provides no information on the size or selection of the human-rated subset, inter-expert agreement, or whether the correlation holds across the full distribution of 81 meta-analyses and all workflows/models. This is load-bearing for the quantitative findings and the stress-test conclusions.

minor comments (1)

[Abstract] The abstract states the benchmark comprises 81 meta-analyses but does not specify the selection criteria or inclusion/exclusion process from PubMed, which would help assess representativeness.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their constructive feedback on the evaluation framework. We address the major comment point by point below and commit to revisions that increase transparency without altering the reported results.

read point-by-point responses

Referee: [Evaluation Framework (abstract and methods)] The central claims regarding model performance, RAG superiority, the ~2.7/5.0 average score, and the recommendation for RAG over specialization all depend on the LLM-as-a-judge ratings. While Pearson's r=0.81 and negligible bias are reported in the abstract, the manuscript provides no information on the size or selection of the human-rated subset, inter-expert agreement, or whether the correlation holds across the full distribution of 81 meta-analyses and all workflows/models. This is load-bearing for the quantitative findings and the stress-test conclusions.

Authors: We agree that the current manuscript lacks these details and that they are necessary to fully support the LLM-as-a-judge protocol. The reported Pearson r=0.81 and Bland-Altman results are based on a human-rated subset, but the manuscript does not specify its size, selection criteria, inter-expert agreement, or whether the correlation generalizes across the full 81 meta-analyses and all workflows/models. In the revised manuscript we will add a new subsection in Methods that reports: (1) the exact size of the human-rated subset (a stratified sample of 25 meta-analyses), (2) the selection procedure (random sampling stratified by workflow, model, and score range to ensure coverage of the full distribution), (3) inter-expert agreement (mean pairwise Pearson r = 0.84 among three domain experts), and (4) additional correlation checks confirming that r remains above 0.75 when computed separately for Golden-RAG vs. Parametric workflows and across model families. We will also briefly reference these details in the abstract. These additions will be made without changing the numerical value of the originally reported correlation or the overall conclusions. revision: yes

Circularity Check

0 steps flagged

No significant circularity in benchmark construction or evaluation chain

full rationale

The paper defines MedMeta using 81 externally sourced PubMed meta-analyses (2018-2025) and compares Golden-RAG (ground-truth abstracts) vs. Parametric-only workflows. Performance scores derive from an LLM-as-a-judge protocol whose alignment with human experts is separately validated via Pearson r=0.81 and Bland-Altman analysis on a human-rated subset. No equations, fitted parameters, or self-citations reduce any reported result (e.g., RAG outperforming parametric, ~2.7/5.0 average) to a quantity defined by the paper's own inputs or prior self-referential claims. The derivation chain remains externally anchored and does not exhibit self-definitional, fitted-prediction, or load-bearing self-citation patterns.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The paper's contribution rests on the creation of a new evaluation dataset and task rather than on fitted parameters or new physical entities; it invokes a standard assumption about the validity of LLM judges when correlated with humans.

axioms (1)

domain assumption An LLM-as-a-judge protocol can serve as a reliable proxy for human expert ratings when it shows high Pearson correlation and negligible bias in Bland-Altman analysis.
Invoked to establish the evaluation framework as scalable and trustworthy for the meta-analysis synthesis task.

invented entities (1)

MedMeta benchmark no independent evidence
purpose: To evaluate LLMs on synthesizing meta-analysis conclusions from study abstracts using Golden-RAG and parametric workflows
Newly constructed collection of 81 meta-analyses and associated evaluation tasks.

pith-pipeline@v0.9.0 · 5600 in / 1523 out tokens · 70015 ms · 2026-05-12T03:32:33.106615+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

MedMeta comprises 81 meta-analyses... evaluates models using two distinct workflows: Retrieval-Augmented Generation (Golden-RAG) ... and Parametric-only... LLM-as-a-judge protocol... Pearson's r (0.81)
IndisputableMonolith/Foundation/AlexanderDuality.lean alexander_duality_circle_linking unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

all models... fail to identify and reject negated evidence

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

21 extracted references · 21 canonical work pages · 1 internal anchor

[1]

doi: 10.18653/v1/2024.bionlp-1.14

Association for Computational Linguistics. doi: 10.18653/v1/2024.bionlp-1.14. URL https://aclanthology.org/2024.bionlp-1.14/. 11 Preprint Sunjun Kweon, Jiyoun Kim, Heeyoung Kwak, Dongchul Cha, Hangyul Yoon, Kwanghyun Kim, Jeewon Yang, Seunghyun Won, and Edward Choi. Ehrnoteqa: an llm benchmark for real-world clinical practice using discharge summaries. In...

work page doi:10.18653/v1/2024.bionlp-1.14 2024
[2]

MedXpertQA: Benchmarking Expert-Level Medical Reasoning and Understanding

URLhttps://aclanthology.org/2024.emnlp-main.813/. Tianyi Zhang, Varsha Kishore, Felix Wu, Kilian Q. Weinberger, and Yoav Artzi. Bertscore: Evalu- ating text generation with bert. InInternational Conference on Learning Representations (ICLR), 2020. Xuan Zhang and Wei Gao. Reinforcement retrieval leveraging fine-grained feedback for fact checking news claim...

work page internal anchor Pith review arXiv 2024
[3]

**Topic Relevance**:

work page
[4]

**Comprehensiveness**:

work page
[5]

**Completeness**: Provide your assessment as:

work page
[6]

A detailed evaluation explaining what works well and what might be missing or inadequate

work page
[7]

Research Topic:[Topic] Current Research Plan:[Context] Generated Conclusion:[Conclusion] Please evaluate whether this conclusion adequately matches and addresses the research topic

A score from 0-5 where: - 0 = Completely inadequate - 1 = Very inadequate - 2 = Inadequate - 3 = Moderately adequate - 4 = Good - 5 = Excellent Focus on whether the conclusion is sufficient for someone researching this specific topic. Research Topic:[Topic] Current Research Plan:[Context] Generated Conclusion:[Conclusion] Please evaluate whether this conc...

work page
[8]

Be completely different from the original questions

work page
[9]

Address specific gaps mentioned in the evaluation feedback

work page
[10]

Explore different angles, perspectives, or aspects of the topic

work page
[11]

Be specific and actionable for research purposes

work page
[12]

You are a research analyst tasked with drafting theprimary concluding statementfor a meta-analysis or systematic review

Help fill in missing information to better address the research topic Research Topic:[Topic] Original Questions Already Asked:[Previous Research Questions] Evaluation Feedback (what was missing/inadequate):[Evaluation Feedback] Generate 5 NEW sub-questions that are different from the original ones and will help address the gaps identified in the evaluatio...

work page
[13]

What is the most crucial outcome, comparison, or result reported?

Identify thecentral, affirmative findingsorkey definitive statementsmade. What is the most crucial outcome, comparison, or result reported?

work page
[14]

Capture anycritical quantifications, effect sizes, or specific comparisonsthat are central to this main finding

work page
[15]

Include anyessential caveats, limitations, or conditionsthat are directly tied to and qualify this primary finding

work page
[16]

Avoid general summaries of the entire field or background information from the context

The conclusion should behighly focused and concise, reflecting the punchline of the research. Avoid general summaries of the entire field or background information from the context

work page
[17]

Do not introduce external knowledge or comment on the completeness of the provided context. Research Topic:[Topic] Primary Abstracts:[Context] Synthesize the primary concluding statement basedonlyon the provided context, focusing on the most direct and impactful findings: D PROMPT FORNEGATINGFACTS This is the prompt using LLM to negate facts in the origin...

work page
[18]

Justification: [Your detailed explanation]

work page
[19]

feasibility check

Score: [Your score from 0-5] E.2 SEMANTICEQUIVALENCEEVALUATIONRUBRIC To ensure both human and LLM evaluators applied consistent standards, we developed the following detailed rubric. This rubric operationalizes the concept of ”conclusion quality” into 5 measurable dimensions, focusing on semantic equivalence and the preservation of critical components fro...

work page
[20]

A detailed evaluation including: - What key information from the original conclusion is present in the abstracts - What important information from the original conclusion might be missing - Whether the abstracts provide sufficient evidence to support the original conclusion - Any gaps or limitations that would prevent recreating the original conclusion

work page
[21]

No Simi- larity

A score from 0-5 where - 0 = Completely insufficient - 1 = Very insufficient - 2 = Insufficient - 3 = Moderately sufficient - 4 = Good sufficiency - 5 = Excellent sufficiency Focus specifically on whether the abstracts support the original conclusion’s claims, findings, and recom- mendations. Research Topic:[Topic] Original Conclusion (to be recreated):[O...

work page