pith. sign in

arxiv: 2511.12635 · v2 · submitted 2025-11-16 · 💻 cs.SE · cs.AI· cs.LG

LLM4SCREENLIT: Recommendations on Assessing the Performance of Large Language Models for Screening Literature in Systematic Reviews

Pith reviewed 2026-05-17 22:07 UTC · model grok-4.3

classification 💻 cs.SE cs.AIcs.LG
keywords large language modelsliterature screeningsystematic reviewsperformance evaluationMatthews correlation coefficientcost-sensitive metricslost evidence
0
0 comments X

The pith

Standard metrics for LLM literature screening in systematic reviews can hide major losses of relevant evidence.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper argues that conventional performance measures like accuracy mislead when applied to LLM-based screening for systematic reviews because they ignore both the severe cost of missing relevant studies and the extreme class imbalance. It introduces a Weighted Matthews Correlation Coefficient that folds in those asymmetric costs and shows on large reanalyses that this measure selects models retaining far more relevant papers than accuracy or unweighted MCC. The authors further insist that every evaluation must publish the complete confusion matrix, treat undecided LLM outputs as positives needing human review, and remain leakage-aware with appropriate baselines. A reader cares because systematic reviews drive decisions across medicine, policy, and science, and flawed screening evaluations can quietly discard key evidence or inflate workloads.

Core claim

The paper claims that SR-screening evaluations should prioritize Lost Evidence and use cost-sensitive WMCC alongside MCC for ranking. Reporting must include the full confusion matrix and treat unclassifiable outputs as positives requiring human review. Designs should be leakage-aware, with non-LLM baselines when the study aims to inform SR practice and labels are available.

What carries the argument

The Weighted Matthews Correlation Coefficient (WMCC), which extends standard MCC by adding a weight parameter to reflect higher costs for false negatives than false positives.

If this is right

  • Evaluators must report Lost Evidence as a primary outcome alongside other metrics.
  • Model rankings should combine WMCC and MCC rather than rely on either alone.
  • Every published study must display the full confusion matrix.
  • Unclassifiable LLM outputs must be counted as positives that require human review.
  • Studies intended to guide actual SR practice require leakage-aware designs and non-LLM baselines.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The WMCC approach could be adapted to other high-stakes screening tasks where missing positives carries heavy downstream costs.
  • Empirical studies could test whether domain-specific calibration of the weight improves performance over the conservative default.
  • Prospective trials that apply these recommendations during live systematic reviews would provide stronger evidence than retrospective reanalyses.

Load-bearing premise

That a single default weight parameter can adequately represent asymmetric misclassification costs across different domains, study types, and review contexts without domain-specific calibration.

What would settle it

An independent evaluation on a new collection of systematic reviews in which the LLM ranked highest by WMCC loses substantially more relevant evidence than the one ranked highest by standard MCC under realistic screening costs.

Figures

Figures reproduced from arXiv: 2511.12635 by Barbara Kitchenham, Lech Madeyski, Martin Shepperd.

Figure 1
Figure 1. Figure 1: Lost Evidence per Model for three SRs (the median of Lost Evidence is presented as a point) and the min/max show [PITH_FULL_IMAGE:figures/full_fig_p006_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Distribution of evaluation metrics used across 27 papers analyzing Gen-AI tools for systematic review screening [PITH_FULL_IMAGE:figures/full_fig_p009_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Adoption of good practices classifications (e.g., classifications made by two LLMs, or a classification made by a single human researcher and an LLM). In addition, the PABAK variant does not seem to be a useful metric in any circumstances. 4. Failure to address costs as well as benefits: None of the five papers that explicitly claimed to measure workload savings, nor the Sandner SR [13] that explicitly inv… view at source ↗
Figure 4
Figure 4. Figure 4: Subsampling stability with 100 to 500 observations [PITH_FULL_IMAGE:figures/full_fig_p013_4.png] view at source ↗
read the original abstract

Context: Large language models (LLMs) are increasingly used to screen literature for systematic reviews (SRs), but the standard confusion-matrix metrics used to evaluate them can mislead under the imbalanced, cost-asymmetric conditions of screening. Objective: We develop and justify LLM4SCREENLIT-practical recommendations for researchers conducting LLM-screening evaluations and for editors and reviewers assessing such studies-differentiated by study type (retrospective benchmarking vs deployment for a specific SR). Method: Using Delgado-Chaves et al. (2025), an 18-LLM benchmark across three biomedical SRs, as a motivating example, we reviewed 28 additional papers and extracted their reported metrics. We propose a Weighted Matthews Correlation Coefficient (WMCC) that integrates MCC's chance-correction with asymmetric misclassification costs, and validated it on three software-engineering (SE) reanalyses, the largest covering 9 LLMs x 24 SE secondary studies (34,528 articles). Results: Across the 29 papers, only 10% reported MCC, only 24% reported full confusion matrices, and none of the five papers claiming workload savings priced false-negative cost. In the largest SE reanalysis, MCC and WMCC disagree on the best LLM in 55% of evaluable studies; in the most striking 9,695-article SE study, the Accuracy-best LLM loses 63.3% of relevant evidence (Lost Evidence), the MCC-best 43.9%, but the WMCC-best only 5.8%. Sensitivity analysis (median crossover at w~=2.7, all <7) supports w=10 as a conservative default. Conclusions: SR-screening evaluations should prioritize Lost Evidence and use cost-sensitive WMCC alongside MCC for ranking. Reporting must include the full confusion matrix and treat unclassifiable outputs as positives requiring human review. Designs should be leakage-aware, with non-LLM baselines when the study aims to inform SR practice and labels are available.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper reviews 29 studies on LLM-based literature screening for systematic reviews (SRs), identifies shortcomings in standard metrics (e.g., only 10% report MCC, none price false-negative costs), proposes a Weighted Matthews Correlation Coefficient (WMCC) that incorporates asymmetric misclassification costs via a weight w, and offers differentiated recommendations for retrospective benchmarking versus deployment studies. These are grounded in reanalyses of three SE corpora (largest: 9 LLMs × 24 studies, 34,528 articles) showing MCC/WMCC ranking disagreements in 55% of cases and substantial Lost Evidence differences (e.g., 63.3% vs 5.8% in one 9,695-article study), plus a sensitivity analysis supporting w=10 as a conservative default (median crossover ~2.7).

Significance. If the central recommendations hold, the work would meaningfully improve evaluation practices for LLM screening in SRs by reducing the risk of overlooking relevant evidence and encouraging fuller reporting (full confusion matrices, leakage-aware designs). The concrete reanalysis on a large SE corpus, explicit comparison of MCC vs WMCC outcomes, and sensitivity analysis for the weight parameter are notable strengths that provide empirical grounding beyond the motivating biomedical example.

major comments (2)
  1. [Results (sensitivity analysis)] Results, sensitivity analysis paragraph: the selection of default w=10 as conservative rests on a crossover analysis performed on the same SE reanalysis data (median ~2.7, all <7); this introduces modest circularity and leaves the central recommendation to use WMCC for ranking vulnerable when cost ratios differ across biomedical vs SE domains or study types, as no per-domain bounds or expert-elicited ratios are provided.
  2. [Method] Method and Abstract: limited detail is given on exact data exclusion rules, inclusion criteria for the 29 papers, and processing steps for the Delgado-Chaves et al. (2025) motivating dataset; this reduces the ability to assess the robustness of the reported metric-extraction statistics (e.g., 10% MCC reporting rate) that underpin the recommendations.
minor comments (2)
  1. [Method (WMCC proposal)] The WMCC formula should be presented as an explicit equation (with clear definition of the weight w and how it modifies the standard MCC) rather than described only in prose, to aid reproducibility.
  2. [Abstract] Abstract: 'w~=2.7' should be written as 'w ≈ 2.7' for standard mathematical notation.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed comments, which help clarify the scope and robustness of our recommendations. We address each major comment below, making revisions where they strengthen the manuscript without altering its core claims.

read point-by-point responses
  1. Referee: Results (sensitivity analysis)] Results, sensitivity analysis paragraph: the selection of default w=10 as conservative rests on a crossover analysis performed on the same SE reanalysis data (median ~2.7, all <7); this introduces modest circularity and leaves the central recommendation to use WMCC for ranking vulnerable when cost ratios differ across biomedical vs SE domains or study types, as no per-domain bounds or expert-elicited ratios are provided.

    Authors: We acknowledge the modest circularity in deriving the default w=10 from crossover points observed in the SE reanalysis corpus. However, the analysis shows consistently low thresholds (median ~2.7, maximum <7), which supports w=10 as a conservative choice that remains robust even under higher cost asymmetries within the tested data. We agree that cost ratios may vary across biomedical versus SE domains and will revise the discussion section to explicitly note this limitation, recommend that future studies perform domain-specific sensitivity analyses, and clarify that w=10 serves as a practical starting point rather than a universal optimum. No per-domain expert elicitation was conducted in this work, as our focus was on demonstrating the impact of WMCC via large-scale reanalysis. revision: partial

  2. Referee: Method and Abstract: limited detail is given on exact data exclusion rules, inclusion criteria for the 29 papers, and processing steps for the Delgado-Chaves et al. (2025) motivating dataset; this reduces the ability to assess the robustness of the reported metric-extraction statistics (e.g., 10% MCC reporting rate) that underpin the recommendations.

    Authors: We agree that additional methodological transparency is warranted. We will expand the Methods section (and update the Abstract if space permits) to specify the exact search strategy, inclusion and exclusion criteria applied to identify the 29 papers, the data extraction protocol for reported metrics, and the precise processing steps used for the Delgado-Chaves et al. (2025) dataset, including how unclassifiable outputs were handled. This revision will allow readers to better evaluate the reproducibility of the 10% MCC reporting statistic and related findings. revision: yes

Circularity Check

1 steps flagged

WMCC default w=10 chosen via sensitivity analysis on the same evaluation data

specific steps
  1. fitted input called prediction [Results (sensitivity analysis paragraph)]
    "Sensitivity analysis (median crossover at w~=2.7, all <7) supports w=10 as a conservative default."

    The scalar weight w that defines WMCC is set to the value 10 on the basis of a crossover analysis performed on the largest SE reanalysis (9 LLMs × 24 studies, 34,528 articles). That same corpus is then used to demonstrate that WMCC rankings differ from MCC rankings in 55% of studies and that the WMCC-best model loses only 5.8% evidence versus 43.9% for the MCC-best model. Consequently the recommended default parameter is statistically informed by the very data on which its advantage is claimed.

full rationale

The paper's central recommendation to use WMCC (with fixed w=10) for ranking LLMs rests on a sensitivity analysis that identifies median crossover at w≈2.7 using the identical 34,528-article SE reanalysis corpus where WMCC is shown to outperform MCC and reduce Lost Evidence. This creates modest circularity because the conservative default is calibrated directly on the data used to validate the metric's superiority, rather than on independent cost-elicitation or external benchmarks. The paper still supplies independent content via the full confusion-matrix reporting requirement, the 29-paper review, and explicit MCC comparisons, preventing a higher score. No self-citation chains or definitional loops were identified.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that false-negative costs dominate in screening and on the choice of a single weight w; no new entities are postulated and the reanalyses provide external grounding beyond the motivating example.

free parameters (1)
  • weight w = 10
    Single scalar that encodes the relative cost of false negatives versus false positives; default set to 10 after sensitivity analysis on reanalysis data.
axioms (1)
  • domain assumption Misclassification costs in literature screening are asymmetric, with missing a relevant paper (false negative) substantially more costly than including an irrelevant one.
    Invoked to justify WMCC and the priority on Lost Evidence; appears in the objective and results sections.

pith-pipeline@v0.9.0 · 5679 in / 1350 out tokens · 34062 ms · 2026-05-17T22:07:20.704384+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

42 extracted references · 42 canonical work pages

  1. [1]

    Huotala, M

    A. Huotala, M. Kuutila, P. Ralph, M. Mäntylä, The Promise and Challenges of Using LLMs to Accelerate the Screening Process of Systematic Reviews, in: Proceedings of the 28th International Conference on Evaluation and Assessment in Software Engineering (EASE’24), ACM, New York, NY, USA, 2024, pp. 262–271

  2. [2]

    Thode, U

    L. Thode, U. Iftikhar, D. Mendez, Exploring the use of LLMs for the selection phase in systematic literature studies, Information and Software Technology 184 (2025) 107757

  3. [3]

    K. R. Felizardo, M. S. Lima, et al., ChatGPT application in Systematic Literature Reviews in Software Engineering: an evaluation of its accuracy to support the selection activity, in: Proceedings of the 18th ACM/IEEE International Symposium on Empirical Software Engineering and Measurement (ESEM’24), ACM, New York, NY, USA, 2024, pp. 25–36

  4. [4]

    F. M. Delgado-Chaves, M. J. Jennings, et al., Transforming literature screening: The emerging role of large language models in systematic reviews, Proceedings of the National Academy of Sciences 122 (2025) e2411962122. 16

  5. [5]

    Dennstädt, J

    F. Dennstädt, J. Zink, P. M. Putora, J. Hastings, N. Cihoric, Title and abstract screening for literature reviews using large language models: an exploratory study in the biomedical domain., Syst Rev 13 (2024) 158

  6. [6]

    Khraisha, S

    Q. Khraisha, S. Put, J. Kappenberg, A. Warraitch, K. Hadfield, Can large language models replace humans in systematic reviews? Evaluating GPT-4’s efficacy in screening and extracting data from peer-reviewed and grey literature in multiple languages., Res Synth Methods 15 (2024) 616–626

  7. [7]

    F. Trad, R. Yammine, J. Charafeddine, M. Chakhtoura, M. Rahme, G. El-Hajj Fuleihan, A. Chehab, Streamlining systematic reviews with large language models using prompt engineering and retrieval augmented generation, BMC Medical Research Methodology 25 (2025) 130

  8. [8]

    Sanghera, A

    R. Sanghera, A. J. Thirunavukarasu, M. El Khoury, J. O’Logbon, Y. Chen, A. Watt, M. Mahmood, H. Butt, G. Nishimura, A. A. S. Soltan, High-performance automated abstract screening with large language model ensembles, Journal of the American Medical Informatics Association 32 (2025) 893–904

  9. [9]

    Scherbakov, N

    D. Scherbakov, N. Hubig, V. Jansari, A. Bakumenko, L. A. Lenert, The emergence of large language models as tools in literature reviews: a large language model-assisted systematic review, Journal of the American Medical Informatics Association 32 (2025) 1071–1086

  10. [10]

    S. Wang, H. Scells, S. Zhuang, M. Potthast, B. Koopman, G. Zuccon, Zero-shot generative large language models for systematic review screening automation, in: N. Goharian, N. Tonellotto, Y. He, A. Lipani, G. McDonald, C. Macdonald, I. Ounis (Eds.), Advances in Information Retrieval, Springer Nature Switzerland, Cham, 2024, pp. 403–420

  11. [11]

    T. Oami, Y. Okada, T.-a. Nakada, Performance of a Large Language Model in Screening Citations, JAMA Network Open 7 (2024) e2420496–e2420496

  12. [12]

    J. K. Kim, M. Rickard, P. Dangle, N. Batra, M. Chua, A. Khondker, K. Szymanski, R. Misseri, A. Lorenzo, Evaluating large language models for title/abstract screening: A systematic review and meta-analysis & development of new tool, Journal of Medical Artificial Intelligence (2025)

  13. [13]

    Sandner, L

    E. Sandner, L. Fontana, K. Kothari, A. Henriques, I. Jakovljevic, A. Simniceanu, A. Wagner, C. Gütl, Evaluating large language models for literature screening: A systematic review of sensitivity and workload reduction, in: Proceedings of the 14th International Conference on Data Science, Technology and Appli- cations - Volume 1: DATA, INSTICC, SciTePress,...

  14. [14]

    D. M. W. Powers, Evaluation: from precision, recall and F-measure to ROC, informedness, markedness and correlation, International Journal of Machine Learning Technology 2 (2011). URL:https://api. semanticscholar.org/CorpusID:3770261

  15. [15]

    T. Byrt, J. Bishop, J. B. Carlin, Bias, prevalence and kappa, Journal of Clinical Epidemiology 46 (1993) 423–429

  16. [16]

    G. Chen, P. Faris, B. Hemmelgarn, R. L. Walker, H. Quan, Measuring agreement of administrative data with chart data using prevalence unadjusted and adjusted kappa, BMC Medical Research Methodology 9 (2009) 1–8

  17. [17]

    Matthews, Comparison of the predicted and observed secondary structure of t4 phage lysozyme, Biochimica et Biophysica Acta (BBA)-Protein Structure 405 (1975) 442–451

    B. Matthews, Comparison of the predicted and observed secondary structure of t4 phage lysozyme, Biochimica et Biophysica Acta (BBA)-Protein Structure 405 (1975) 442–451

  18. [18]

    Luque, A

    A. Luque, A. Carrasco, A. Martín, A. de Las Heras, The impact of class imbalance in classification performance metrics based on the binary confusion matrix, Pattern Recognition 91 (2019) 216–231

  19. [19]

    Kitchenham, L

    B. Kitchenham, L. Madeyski, D. Budgen, SEGRESS: Software Engineering Guidelines for REporting Secondary Studies, IEEE Transactions on Software Engineering 49 (2023) 1273–1298. doi:TSE.2022. 3174092. 17

  20. [20]

    Kitchenham, D

    B. Kitchenham, D. Budgen, P. Brereton, Evidence-Based Software Engineering and Systematic Reviews, CRC Press, 2016

  21. [21]

    Woelfle, J

    T. Woelfle, J. Hirt, et al., Benchmarking human–AI collaboration for common evidence appraisal tools, Journal of Clinical Epidemiology 175 (2024) 111533

  22. [22]

    Akinseloyin, X

    O. Akinseloyin, X. Jiang, V. Palade, A question-answering framework for automated abstract screening using large language models, Journal of the American Medical Informatics Association 31 (2024) 1939–1952

  23. [23]

    Attri, R

    S. Attri, R. Kaur, B. Singh, P. Rai, Msr57 transforming systematic literature reviews: unleashing the potential of gpt-4: a cutting-edge large language model, to elevate research synthesis, Value in Health 27 (2024) S270

  24. [24]

    X. Cai, Y. Geng, Y. Du, B. Westerman, D. Wang, C. Ma, J. J. G. Vallejo, Utilizing chatgpt to select literature for meta-analysis shows workload reduction while maintaining a similar recall level as manual curation, medRxiv (2023) 2023–09

  25. [25]

    C. Cao, J. Sang, R. Arora, R. Kloosterman, M. Cecere, J. Gorla, R. Saleh, D. Chen, I. Drennan, B. Teja, et al., Prompting is all you need: LLMs for systematic review screening, medRxiv (2024) 2024–06

  26. [26]

    Castillo-Segura, C

    P. Castillo-Segura, C. Alario-Hoyos, C. D. Kloos, C. F. Panadero, Leveraging the potential of generative ai to accelerate systematic literature reviews: an example in the area of educational technology, in: 2023 World Engineering Education Forum-Global Engineering Deans Council (WEEF-GEDC), IEEE, 2023, pp. 1–8

  27. [27]

    Datta, K

    S. Datta, K. Lee, H. Paek, M. Mojarad, V. Prabhu, J. Zhang, E. Foley, J. Glasgow, C. Liston, Y. Zheng, et al., Msr103 optimizing systematic literature reviews in endometrial cancer: Leveraging ai for real-time article screening and data extraction in clinical trials, Value in Health 27 (2024) S279

  28. [28]

    J. Du, E. Soysal, D. Wang, L. He, B. Lin, J. Wang, F. J. Manion, Y. Li, E. Wu, L. Yao, Machine learning models for abstract screening task-a systematic literature review application for health economics and outcome research, BMC Medical Research Methodology 24 (2024) 108

  29. [29]

    O. K. Gargari, M. H. Mahmoudi, M. Hajisafarali, R. Samiee, Enhancing title and abstract screening for systematic reviews with gpt-3.5 turbo, BMJ Evidence-based Medicine 29 (2024) 69–70

  30. [30]

    E. Guo, M. Gupta, J. Deng, Y.-J. Park, M. Paget, C. Naugler, Automated paper screening for clinical reviews using large language models: data analysis study, Journal of Medical Internet Research 26 (2024) e48996

  31. [31]

    Issaiy, H

    M. Issaiy, H. Ghanaati, S. Kolahi, M. Shakiba, A. H. Jalali, D. Zarei, S. Kazemian, M. A. Avanaki, K. Firouznia, Methodological insights into chatgpt’s screening performance in systematic reviews, BMC medical research methodology 24 (2024) 78

  32. [32]

    R. Kaur, P. Rai, S. Attri, G. Kaur, B. Singh, Msr15 revolutionizing systematic literature reviews: harnessing the power of large language model (gpt-4) for enhanced research synthesis, Value in Health 27 (2024) S262

  33. [33]

    M. Li, J. Sun, X. Tan, Evaluating the effectiveness of large language models in abstract screening: a comparative analysis, Systematic Reviews 13 (2024) 219

  34. [34]

    Y. Lin, J. Li, H. Xiao, L. Zheng, Y. Xiao, H. Song, J. Fan, D. Xiao, D. Ai, T. Fu, et al., Automatic literature screening using the pajo deep-learning model for clinical practice guidelines, BMC Medical Informatics and Decision Making 23 (2023) 247. 18

  35. [35]

    P. Rai, R. Kaur, S. Pandey, S. Attri, G. Kaur, B. Singh, Msr59 advancing systematic literature reviews: The integration of ai-powered nlp models in data collection processes, Value in Health 27 (2024) S270

  36. [36]

    Robinson, W

    A. Robinson, W. Thorne, B. P. Wu, A. Pandor, M. Essat, M. Stevenson, X. Song, Bio-sieve: ex- ploring instruction tuning large language models for systematic review automation, arXiv preprint arXiv:2308.06610 (2023)

  37. [37]

    Royer, E

    J. Royer, E. Wu, R. Ayyagari, S. Parravano, U. Pathare, M. Kisielinska, Msr131 prospects for automation of systemic literature reviews (slrs) with artificial intelligence and natural language processing, Value in Health 26 (2023) S418

  38. [38]

    Spillias, P

    S. Spillias, P. Tuohy, M. Andreotta, R. Annand-Jones, F. Boschetti, C. Cvitanovic, J. Duggan, E. A. Fulton, D. B. Karcher, C. Paris, et al., Human-ai collaboration to identify literature for evidence synthesis, Cell Reports Sustainability 1 (2024)

  39. [39]

    Syriani, I

    E. Syriani, I. David, G. Kumar, Assessing the ability of ChatGPT to screen articles for systematic reviews, arXiv preprint arXiv:2307.06464 (2023)

  40. [40]

    Syriani, I

    E. Syriani, I. David, G. Kumar, Screening articles for systematic reviews with ChatGPT, Journal of Computer Languages 80 (2024) 101287

  41. [41]

    V.-T. Tran, G. Gartlehner, S. Yaacoub, I. Boutron, L. Schwingshackl, J. Stadelmaier, I. Sommer, F. Aboulayeh, S. Afach, J. Meerpohl, et al., Sensitivity, specificity and avoidable workload of using a large language models for title and abstract screening in systematic reviews and meta-analyses, medRxiv (2023) 2023–12

  42. [42]

    Wilkins, Automated title and abstract screening for scoping reviews using the gpt-4 large language model, arXiv preprint arXiv:2311.07918 (2023)

    D. Wilkins, Automated title and abstract screening for scoping reviews using the gpt-4 large language model, arXiv preprint arXiv:2311.07918 (2023). 19