LLM4SCREENLIT: Recommendations on Assessing the Performance of Large Language Models for Screening Literature in Systematic Reviews
Pith reviewed 2026-05-17 22:07 UTC · model grok-4.3
The pith
Standard metrics for LLM literature screening in systematic reviews can hide major losses of relevant evidence.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The paper claims that SR-screening evaluations should prioritize Lost Evidence and use cost-sensitive WMCC alongside MCC for ranking. Reporting must include the full confusion matrix and treat unclassifiable outputs as positives requiring human review. Designs should be leakage-aware, with non-LLM baselines when the study aims to inform SR practice and labels are available.
What carries the argument
The Weighted Matthews Correlation Coefficient (WMCC), which extends standard MCC by adding a weight parameter to reflect higher costs for false negatives than false positives.
If this is right
- Evaluators must report Lost Evidence as a primary outcome alongside other metrics.
- Model rankings should combine WMCC and MCC rather than rely on either alone.
- Every published study must display the full confusion matrix.
- Unclassifiable LLM outputs must be counted as positives that require human review.
- Studies intended to guide actual SR practice require leakage-aware designs and non-LLM baselines.
Where Pith is reading between the lines
- The WMCC approach could be adapted to other high-stakes screening tasks where missing positives carries heavy downstream costs.
- Empirical studies could test whether domain-specific calibration of the weight improves performance over the conservative default.
- Prospective trials that apply these recommendations during live systematic reviews would provide stronger evidence than retrospective reanalyses.
Load-bearing premise
That a single default weight parameter can adequately represent asymmetric misclassification costs across different domains, study types, and review contexts without domain-specific calibration.
What would settle it
An independent evaluation on a new collection of systematic reviews in which the LLM ranked highest by WMCC loses substantially more relevant evidence than the one ranked highest by standard MCC under realistic screening costs.
Figures
read the original abstract
Context: Large language models (LLMs) are increasingly used to screen literature for systematic reviews (SRs), but the standard confusion-matrix metrics used to evaluate them can mislead under the imbalanced, cost-asymmetric conditions of screening. Objective: We develop and justify LLM4SCREENLIT-practical recommendations for researchers conducting LLM-screening evaluations and for editors and reviewers assessing such studies-differentiated by study type (retrospective benchmarking vs deployment for a specific SR). Method: Using Delgado-Chaves et al. (2025), an 18-LLM benchmark across three biomedical SRs, as a motivating example, we reviewed 28 additional papers and extracted their reported metrics. We propose a Weighted Matthews Correlation Coefficient (WMCC) that integrates MCC's chance-correction with asymmetric misclassification costs, and validated it on three software-engineering (SE) reanalyses, the largest covering 9 LLMs x 24 SE secondary studies (34,528 articles). Results: Across the 29 papers, only 10% reported MCC, only 24% reported full confusion matrices, and none of the five papers claiming workload savings priced false-negative cost. In the largest SE reanalysis, MCC and WMCC disagree on the best LLM in 55% of evaluable studies; in the most striking 9,695-article SE study, the Accuracy-best LLM loses 63.3% of relevant evidence (Lost Evidence), the MCC-best 43.9%, but the WMCC-best only 5.8%. Sensitivity analysis (median crossover at w~=2.7, all <7) supports w=10 as a conservative default. Conclusions: SR-screening evaluations should prioritize Lost Evidence and use cost-sensitive WMCC alongside MCC for ranking. Reporting must include the full confusion matrix and treat unclassifiable outputs as positives requiring human review. Designs should be leakage-aware, with non-LLM baselines when the study aims to inform SR practice and labels are available.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper reviews 29 studies on LLM-based literature screening for systematic reviews (SRs), identifies shortcomings in standard metrics (e.g., only 10% report MCC, none price false-negative costs), proposes a Weighted Matthews Correlation Coefficient (WMCC) that incorporates asymmetric misclassification costs via a weight w, and offers differentiated recommendations for retrospective benchmarking versus deployment studies. These are grounded in reanalyses of three SE corpora (largest: 9 LLMs × 24 studies, 34,528 articles) showing MCC/WMCC ranking disagreements in 55% of cases and substantial Lost Evidence differences (e.g., 63.3% vs 5.8% in one 9,695-article study), plus a sensitivity analysis supporting w=10 as a conservative default (median crossover ~2.7).
Significance. If the central recommendations hold, the work would meaningfully improve evaluation practices for LLM screening in SRs by reducing the risk of overlooking relevant evidence and encouraging fuller reporting (full confusion matrices, leakage-aware designs). The concrete reanalysis on a large SE corpus, explicit comparison of MCC vs WMCC outcomes, and sensitivity analysis for the weight parameter are notable strengths that provide empirical grounding beyond the motivating biomedical example.
major comments (2)
- [Results (sensitivity analysis)] Results, sensitivity analysis paragraph: the selection of default w=10 as conservative rests on a crossover analysis performed on the same SE reanalysis data (median ~2.7, all <7); this introduces modest circularity and leaves the central recommendation to use WMCC for ranking vulnerable when cost ratios differ across biomedical vs SE domains or study types, as no per-domain bounds or expert-elicited ratios are provided.
- [Method] Method and Abstract: limited detail is given on exact data exclusion rules, inclusion criteria for the 29 papers, and processing steps for the Delgado-Chaves et al. (2025) motivating dataset; this reduces the ability to assess the robustness of the reported metric-extraction statistics (e.g., 10% MCC reporting rate) that underpin the recommendations.
minor comments (2)
- [Method (WMCC proposal)] The WMCC formula should be presented as an explicit equation (with clear definition of the weight w and how it modifies the standard MCC) rather than described only in prose, to aid reproducibility.
- [Abstract] Abstract: 'w~=2.7' should be written as 'w ≈ 2.7' for standard mathematical notation.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed comments, which help clarify the scope and robustness of our recommendations. We address each major comment below, making revisions where they strengthen the manuscript without altering its core claims.
read point-by-point responses
-
Referee: Results (sensitivity analysis)] Results, sensitivity analysis paragraph: the selection of default w=10 as conservative rests on a crossover analysis performed on the same SE reanalysis data (median ~2.7, all <7); this introduces modest circularity and leaves the central recommendation to use WMCC for ranking vulnerable when cost ratios differ across biomedical vs SE domains or study types, as no per-domain bounds or expert-elicited ratios are provided.
Authors: We acknowledge the modest circularity in deriving the default w=10 from crossover points observed in the SE reanalysis corpus. However, the analysis shows consistently low thresholds (median ~2.7, maximum <7), which supports w=10 as a conservative choice that remains robust even under higher cost asymmetries within the tested data. We agree that cost ratios may vary across biomedical versus SE domains and will revise the discussion section to explicitly note this limitation, recommend that future studies perform domain-specific sensitivity analyses, and clarify that w=10 serves as a practical starting point rather than a universal optimum. No per-domain expert elicitation was conducted in this work, as our focus was on demonstrating the impact of WMCC via large-scale reanalysis. revision: partial
-
Referee: Method and Abstract: limited detail is given on exact data exclusion rules, inclusion criteria for the 29 papers, and processing steps for the Delgado-Chaves et al. (2025) motivating dataset; this reduces the ability to assess the robustness of the reported metric-extraction statistics (e.g., 10% MCC reporting rate) that underpin the recommendations.
Authors: We agree that additional methodological transparency is warranted. We will expand the Methods section (and update the Abstract if space permits) to specify the exact search strategy, inclusion and exclusion criteria applied to identify the 29 papers, the data extraction protocol for reported metrics, and the precise processing steps used for the Delgado-Chaves et al. (2025) dataset, including how unclassifiable outputs were handled. This revision will allow readers to better evaluate the reproducibility of the 10% MCC reporting statistic and related findings. revision: yes
Circularity Check
WMCC default w=10 chosen via sensitivity analysis on the same evaluation data
specific steps
-
fitted input called prediction
[Results (sensitivity analysis paragraph)]
"Sensitivity analysis (median crossover at w~=2.7, all <7) supports w=10 as a conservative default."
The scalar weight w that defines WMCC is set to the value 10 on the basis of a crossover analysis performed on the largest SE reanalysis (9 LLMs × 24 studies, 34,528 articles). That same corpus is then used to demonstrate that WMCC rankings differ from MCC rankings in 55% of studies and that the WMCC-best model loses only 5.8% evidence versus 43.9% for the MCC-best model. Consequently the recommended default parameter is statistically informed by the very data on which its advantage is claimed.
full rationale
The paper's central recommendation to use WMCC (with fixed w=10) for ranking LLMs rests on a sensitivity analysis that identifies median crossover at w≈2.7 using the identical 34,528-article SE reanalysis corpus where WMCC is shown to outperform MCC and reduce Lost Evidence. This creates modest circularity because the conservative default is calibrated directly on the data used to validate the metric's superiority, rather than on independent cost-elicitation or external benchmarks. The paper still supplies independent content via the full confusion-matrix reporting requirement, the 29-paper review, and explicit MCC comparisons, preventing a higher score. No self-citation chains or definitional loops were identified.
Axiom & Free-Parameter Ledger
free parameters (1)
- weight w =
10
axioms (1)
- domain assumption Misclassification costs in literature screening are asymmetric, with missing a relevant paper (false negative) substantially more costly than including an irrelevant one.
Reference graph
Works this paper leans on
-
[1]
A. Huotala, M. Kuutila, P. Ralph, M. Mäntylä, The Promise and Challenges of Using LLMs to Accelerate the Screening Process of Systematic Reviews, in: Proceedings of the 28th International Conference on Evaluation and Assessment in Software Engineering (EASE’24), ACM, New York, NY, USA, 2024, pp. 262–271
work page 2024
- [2]
-
[3]
K. R. Felizardo, M. S. Lima, et al., ChatGPT application in Systematic Literature Reviews in Software Engineering: an evaluation of its accuracy to support the selection activity, in: Proceedings of the 18th ACM/IEEE International Symposium on Empirical Software Engineering and Measurement (ESEM’24), ACM, New York, NY, USA, 2024, pp. 25–36
work page 2024
-
[4]
F. M. Delgado-Chaves, M. J. Jennings, et al., Transforming literature screening: The emerging role of large language models in systematic reviews, Proceedings of the National Academy of Sciences 122 (2025) e2411962122. 16
work page 2025
-
[5]
F. Dennstädt, J. Zink, P. M. Putora, J. Hastings, N. Cihoric, Title and abstract screening for literature reviews using large language models: an exploratory study in the biomedical domain., Syst Rev 13 (2024) 158
work page 2024
-
[6]
Q. Khraisha, S. Put, J. Kappenberg, A. Warraitch, K. Hadfield, Can large language models replace humans in systematic reviews? Evaluating GPT-4’s efficacy in screening and extracting data from peer-reviewed and grey literature in multiple languages., Res Synth Methods 15 (2024) 616–626
work page 2024
-
[7]
F. Trad, R. Yammine, J. Charafeddine, M. Chakhtoura, M. Rahme, G. El-Hajj Fuleihan, A. Chehab, Streamlining systematic reviews with large language models using prompt engineering and retrieval augmented generation, BMC Medical Research Methodology 25 (2025) 130
work page 2025
-
[8]
R. Sanghera, A. J. Thirunavukarasu, M. El Khoury, J. O’Logbon, Y. Chen, A. Watt, M. Mahmood, H. Butt, G. Nishimura, A. A. S. Soltan, High-performance automated abstract screening with large language model ensembles, Journal of the American Medical Informatics Association 32 (2025) 893–904
work page 2025
-
[9]
D. Scherbakov, N. Hubig, V. Jansari, A. Bakumenko, L. A. Lenert, The emergence of large language models as tools in literature reviews: a large language model-assisted systematic review, Journal of the American Medical Informatics Association 32 (2025) 1071–1086
work page 2025
-
[10]
S. Wang, H. Scells, S. Zhuang, M. Potthast, B. Koopman, G. Zuccon, Zero-shot generative large language models for systematic review screening automation, in: N. Goharian, N. Tonellotto, Y. He, A. Lipani, G. McDonald, C. Macdonald, I. Ounis (Eds.), Advances in Information Retrieval, Springer Nature Switzerland, Cham, 2024, pp. 403–420
work page 2024
-
[11]
T. Oami, Y. Okada, T.-a. Nakada, Performance of a Large Language Model in Screening Citations, JAMA Network Open 7 (2024) e2420496–e2420496
work page 2024
-
[12]
J. K. Kim, M. Rickard, P. Dangle, N. Batra, M. Chua, A. Khondker, K. Szymanski, R. Misseri, A. Lorenzo, Evaluating large language models for title/abstract screening: A systematic review and meta-analysis & development of new tool, Journal of Medical Artificial Intelligence (2025)
work page 2025
-
[13]
E. Sandner, L. Fontana, K. Kothari, A. Henriques, I. Jakovljevic, A. Simniceanu, A. Wagner, C. Gütl, Evaluating large language models for literature screening: A systematic review of sensitivity and workload reduction, in: Proceedings of the 14th International Conference on Data Science, Technology and Appli- cations - Volume 1: DATA, INSTICC, SciTePress,...
-
[14]
D. M. W. Powers, Evaluation: from precision, recall and F-measure to ROC, informedness, markedness and correlation, International Journal of Machine Learning Technology 2 (2011). URL:https://api. semanticscholar.org/CorpusID:3770261
work page 2011
-
[15]
T. Byrt, J. Bishop, J. B. Carlin, Bias, prevalence and kappa, Journal of Clinical Epidemiology 46 (1993) 423–429
work page 1993
-
[16]
G. Chen, P. Faris, B. Hemmelgarn, R. L. Walker, H. Quan, Measuring agreement of administrative data with chart data using prevalence unadjusted and adjusted kappa, BMC Medical Research Methodology 9 (2009) 1–8
work page 2009
-
[17]
B. Matthews, Comparison of the predicted and observed secondary structure of t4 phage lysozyme, Biochimica et Biophysica Acta (BBA)-Protein Structure 405 (1975) 442–451
work page 1975
- [18]
-
[19]
B. Kitchenham, L. Madeyski, D. Budgen, SEGRESS: Software Engineering Guidelines for REporting Secondary Studies, IEEE Transactions on Software Engineering 49 (2023) 1273–1298. doi:TSE.2022. 3174092. 17
work page 2023
-
[20]
B. Kitchenham, D. Budgen, P. Brereton, Evidence-Based Software Engineering and Systematic Reviews, CRC Press, 2016
work page 2016
-
[21]
T. Woelfle, J. Hirt, et al., Benchmarking human–AI collaboration for common evidence appraisal tools, Journal of Clinical Epidemiology 175 (2024) 111533
work page 2024
-
[22]
O. Akinseloyin, X. Jiang, V. Palade, A question-answering framework for automated abstract screening using large language models, Journal of the American Medical Informatics Association 31 (2024) 1939–1952
work page 2024
- [23]
-
[24]
X. Cai, Y. Geng, Y. Du, B. Westerman, D. Wang, C. Ma, J. J. G. Vallejo, Utilizing chatgpt to select literature for meta-analysis shows workload reduction while maintaining a similar recall level as manual curation, medRxiv (2023) 2023–09
work page 2023
-
[25]
C. Cao, J. Sang, R. Arora, R. Kloosterman, M. Cecere, J. Gorla, R. Saleh, D. Chen, I. Drennan, B. Teja, et al., Prompting is all you need: LLMs for systematic review screening, medRxiv (2024) 2024–06
work page 2024
-
[26]
P. Castillo-Segura, C. Alario-Hoyos, C. D. Kloos, C. F. Panadero, Leveraging the potential of generative ai to accelerate systematic literature reviews: an example in the area of educational technology, in: 2023 World Engineering Education Forum-Global Engineering Deans Council (WEEF-GEDC), IEEE, 2023, pp. 1–8
work page 2023
-
[27]
S. Datta, K. Lee, H. Paek, M. Mojarad, V. Prabhu, J. Zhang, E. Foley, J. Glasgow, C. Liston, Y. Zheng, et al., Msr103 optimizing systematic literature reviews in endometrial cancer: Leveraging ai for real-time article screening and data extraction in clinical trials, Value in Health 27 (2024) S279
work page 2024
-
[28]
J. Du, E. Soysal, D. Wang, L. He, B. Lin, J. Wang, F. J. Manion, Y. Li, E. Wu, L. Yao, Machine learning models for abstract screening task-a systematic literature review application for health economics and outcome research, BMC Medical Research Methodology 24 (2024) 108
work page 2024
-
[29]
O. K. Gargari, M. H. Mahmoudi, M. Hajisafarali, R. Samiee, Enhancing title and abstract screening for systematic reviews with gpt-3.5 turbo, BMJ Evidence-based Medicine 29 (2024) 69–70
work page 2024
-
[30]
E. Guo, M. Gupta, J. Deng, Y.-J. Park, M. Paget, C. Naugler, Automated paper screening for clinical reviews using large language models: data analysis study, Journal of Medical Internet Research 26 (2024) e48996
work page 2024
- [31]
-
[32]
R. Kaur, P. Rai, S. Attri, G. Kaur, B. Singh, Msr15 revolutionizing systematic literature reviews: harnessing the power of large language model (gpt-4) for enhanced research synthesis, Value in Health 27 (2024) S262
work page 2024
-
[33]
M. Li, J. Sun, X. Tan, Evaluating the effectiveness of large language models in abstract screening: a comparative analysis, Systematic Reviews 13 (2024) 219
work page 2024
-
[34]
Y. Lin, J. Li, H. Xiao, L. Zheng, Y. Xiao, H. Song, J. Fan, D. Xiao, D. Ai, T. Fu, et al., Automatic literature screening using the pajo deep-learning model for clinical practice guidelines, BMC Medical Informatics and Decision Making 23 (2023) 247. 18
work page 2023
-
[35]
P. Rai, R. Kaur, S. Pandey, S. Attri, G. Kaur, B. Singh, Msr59 advancing systematic literature reviews: The integration of ai-powered nlp models in data collection processes, Value in Health 27 (2024) S270
work page 2024
-
[36]
A. Robinson, W. Thorne, B. P. Wu, A. Pandor, M. Essat, M. Stevenson, X. Song, Bio-sieve: ex- ploring instruction tuning large language models for systematic review automation, arXiv preprint arXiv:2308.06610 (2023)
- [37]
-
[38]
S. Spillias, P. Tuohy, M. Andreotta, R. Annand-Jones, F. Boschetti, C. Cvitanovic, J. Duggan, E. A. Fulton, D. B. Karcher, C. Paris, et al., Human-ai collaboration to identify literature for evidence synthesis, Cell Reports Sustainability 1 (2024)
work page 2024
-
[39]
E. Syriani, I. David, G. Kumar, Assessing the ability of ChatGPT to screen articles for systematic reviews, arXiv preprint arXiv:2307.06464 (2023)
-
[40]
E. Syriani, I. David, G. Kumar, Screening articles for systematic reviews with ChatGPT, Journal of Computer Languages 80 (2024) 101287
work page 2024
-
[41]
V.-T. Tran, G. Gartlehner, S. Yaacoub, I. Boutron, L. Schwingshackl, J. Stadelmaier, I. Sommer, F. Aboulayeh, S. Afach, J. Meerpohl, et al., Sensitivity, specificity and avoidable workload of using a large language models for title and abstract screening in systematic reviews and meta-analyses, medRxiv (2023) 2023–12
work page 2023
-
[42]
D. Wilkins, Automated title and abstract screening for scoping reviews using the gpt-4 large language model, arXiv preprint arXiv:2311.07918 (2023). 19
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.