Evaluating AI-based Scientific Knowledge Synthesis with Epidemiological Systematic Reviews

· 2026 · cs.IR · arXiv 2603.22327

1 Pith paper cite this work. Polarity classification is still indexing.

1 Pith paper citing it

open full Pith review browse 1 citing papers arXiv PDF

abstract

Systematic literature reviews (SLRs) are a demanding and high-stakes form of scientific knowledge synthesis that remains underspecified as an evaluation setting for large language models (LLMs). We introduce AgentSLR, a large-scale evaluation harness comprising an SLR automation workflow and an expert annotated dataset covering 16,248 articles, designed to test LLM capabilities across the stages of SLRs in epidemiology. Reference annotations were derived from peer-reviewed studies on WHO priority pathogens and produced by domain experts. The harness evaluates each review stage as a separate unit with dedicated metrics enabling targeted failure analysis. We evaluated five frontier reasoning models and found that no single model dominated across all tasks, showing sub-task specialisation often hidden by aggregate benchmarks. Structured data extraction is a major bottleneck, with no model exceeding an average field-level F1 of 0.67. Estimated costs vary substantially, by up to 96 times across evaluated models. Documented failure modes suggest that the evaluated models are not yet reliable enough for unsupervised deployment in epidemiology, where findings can inform public policy.

representative citing papers

AI Coding Agents Can Reproduce Social Science Findings

cs.CL · 2026-06-09 · conditional · novelty 7.0

A new benchmark shows AI coding agents reproduce many social science findings from provided materials, outperforming prior agent benchmarks, while highlighting prompt sensitivity.

citing papers explorer

Showing 1 of 1 citing paper.

AI Coding Agents Can Reproduce Social Science Findings cs.CL · 2026-06-09 · conditional · none · ref 7 · internal anchor
A new benchmark shows AI coding agents reproduce many social science findings from provided materials, outperforming prior agent benchmarks, while highlighting prompt sensitivity.

Evaluating AI-based Scientific Knowledge Synthesis with Epidemiological Systematic Reviews

fields

years

verdicts

representative citing papers

citing papers explorer