pith. sign in

arxiv: 2510.06708 · v3 · submitted 2025-10-08 · 💻 cs.SE · cs.AI

AISysRev -- LLM-based Tool for Title-abstract Screening

Pith reviewed 2026-05-18 09:36 UTC · model grok-4.3

classification 💻 cs.SE cs.AI
keywords LLM screeningsystematic reviewstitle-abstract screeninghuman-AI collaborationboundary casespaper classificationAISysRev
0
0 comments X

The pith

An LLM screening tool classifies papers into easy and boundary cases to focus human effort on uncertain ones.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces AISysRev, a containerized web application that uses large language models to perform title-abstract screening for systematic reviews. Users upload a CSV of papers, set inclusion and exclusion criteria, and choose from remote or local models that run in parallel for fast throughput. A qualitative trial on 137 papers produced a four-way classification: easy includes, easy excludes, boundary includes, and boundary excludes. The boundary cases are where the models tend to err, so the tool is presented as a way to cut workload while still routing uncertain papers to human reviewers.

Core claim

AISysRev accepts CSV files of titles and abstracts, applies user-specified criteria through zero-shot or few-shot prompting with models such as Gemini, Claude, Mistral, ChatGPT, or local OpenAI-compatible models, and provides interfaces that display LLM outputs as guidance for manual review. In the 137-paper trial the outputs fall into four categories—Easy Includes, Easy Excludes, Boundary Includes, and Boundary Excludes—where boundary cases are prone to LLM errors, demonstrating that LLMs can reduce the volume of assessment but do not replace human judgment.

What carries the argument

The four-category classification (Easy Includes, Easy Excludes, Boundary Includes, Boundary Excludes) that separates papers the LLM handles reliably from those that require human intervention.

If this is right

  • Screening can proceed at 100 to 300 papers per minute depending on the chosen model.
  • Human reviewers can direct attention to the boundary cases flagged by the tool.
  • The same interface supports both fully automated runs and assisted manual review.
  • Any model compatible with the OpenAI SDK or hosted locally can be swapped in without changing the workflow.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The four-category framework could be used in other research fields to measure where LLMs are reliable for literature screening.
  • Tool guidance that highlights boundary cases might reduce reviewer fatigue on large projects.
  • Iterative use of human corrections on boundary cases could be fed back to refine prompts or fine-tune models.

Load-bearing premise

The pattern of easy and boundary cases seen in the 137-paper trial will appear and be recognizable in other sets of papers, criteria, and models.

What would settle it

A new screening run on a different collection of papers in which error rates are no higher for the boundary category than for the easy categories, or in which human reviewers cannot reliably identify the boundary cases from the tool output.

Figures

Figures reproduced from arXiv: 2510.06708 by Aleksi Huotala, Miikka Kuutila, Mika M\"antyl\"a, Olli-Pekka Turtio, Simo Sipil\"a.

Figure 1
Figure 1. Figure 1: The AiSysRev tool architecture. literature review (SLR) on time pressure in software engineering. The screening trials have been part of our effort to update our prior SLR [11]. Thus we have evaluated papers published after the SLR was completed. This has allowed us to evaluate the tool in an actual context. The trial users (the second and fourth author) are topic experts who participated in conducting the… view at source ↗
read the original abstract

Conducting systematic reviews is laborious. In the screening or study selection phase, the number of papers can be overwhelming. Recent research has demonstrated that large language models (LLMs) can perform title-abstract screening and support humans in the task. To this end, we developed AISysRev, an LLM-based screening tool implemented as a containerized web application. The tool accepts CSV files containing paper titles and abstracts. Users specify inclusion and exclusion criteria. Multiple different LLMs can be used, such as Gemini, Claude, Mistral or ChatGPT via OpenRouter. We also support locally hosted models and any model compatible with the OpenAI SDK. AISysRev implements both zero-shot and few-shot prompting, and also allows for manual screening through interfaces that display LLM results as guidance for human reviewers. LLM calls are parallelized, meaning screening speed is typically between 100 to 300 papers per minute, depending on the model and the host. To demonstrate the tool's use in practice, we conducted a qualitative trial study with 137 papers using the tool. Our findings indicate that papers can be classified into four categories: Easy Includes, Easy Excludes, Boundary Includes, and Boundary Excludes. The Boundary cases, where LLMs are prone to errors, highlight the need for human intervention. While LLMs do not replace human judgment in systematic reviews, they can reduce the burden of assessing large volumes of scientific literature. Video: https://www.youtube.com/watch?v=HeblemlgnAQ Tool: https://github.com/EvoTestOps/AISysRev

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The manuscript presents AISysRev, a containerized web application for LLM-assisted title-abstract screening in systematic reviews. The tool accepts CSV inputs of titles and abstracts, allows users to specify inclusion/exclusion criteria, supports multiple LLMs (Gemini, Claude, Mistral, ChatGPT via OpenRouter, local models, and OpenAI SDK-compatible ones), implements zero-shot and few-shot prompting, and provides manual screening interfaces that display LLM results as guidance. A qualitative trial on 137 papers is used to illustrate that papers fall into four categories (Easy Includes, Easy Excludes, Boundary Includes, Boundary Excludes), with the conclusion that LLMs can reduce screening burden while boundary cases require human intervention.

Significance. The concrete implementation details, support for diverse models and prompting strategies, and parallelized screening (100-300 papers per minute) represent practical contributions to evidence synthesis workflows. If the four-category classification and boundary detection prove reproducible across datasets and models, the work could support hybrid human-LLM screening protocols that direct effort to uncertain cases. However, the absence of quantitative metrics (accuracy, inter-rater agreement, error rates, or category counts) limits the strength of claims about burden reduction and the reliability of the proposed guidance framework.

major comments (1)
  1. [Qualitative trial description (and abstract)] Qualitative trial on 137 papers: The central claim that papers classify into Easy Includes/Excludes and Boundary Includes/Excludes, with boundary cases highlighting the need for human intervention, rests on a single qualitative observation. No counts per category are reported, no operational definition of 'boundary' (e.g., confidence threshold, prompt disagreement, or uncertainty signal) is provided, and no human-LLM agreement metrics or comparison to a gold-standard screening decision are given. This makes the classification appear post-hoc and weakens support for the practical recommendation that LLMs reduce burden while reliably flagging oversight needs.
minor comments (2)
  1. [Discussion or Conclusion] The manuscript would benefit from a dedicated limitations section that explicitly discusses generalizability beyond the tested models, criteria, and paper set.
  2. [Tool features and implementation] Consider clarifying in the tool description whether the few-shot examples are user-provided or system-suggested, and how they are stored or reused across sessions.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their constructive review of our manuscript on AISysRev. We address the major comment below, focusing on the qualitative trial and its presentation.

read point-by-point responses
  1. Referee: Qualitative trial on 137 papers: The central claim that papers classify into four categories: Easy Includes/Excludes and Boundary Includes/Excludes, with boundary cases highlighting the need for human intervention, rests on a single qualitative observation. No counts per category are reported, no operational definition of 'boundary' (e.g., confidence threshold, prompt disagreement, or uncertainty signal) is provided, and no human-LLM agreement metrics or comparison to a gold-standard screening decision are given. This makes the classification appear post-hoc and weakens support for the practical recommendation that LLMs reduce burden while reliably flagging oversight needs.

    Authors: We thank the referee for this observation. The four-category classification emerged directly from our use of the AISysRev interface while screening the 137 papers: Easy Includes and Easy Excludes were cases where the LLM output aligned clearly with the provided criteria, while Boundary Includes and Boundary Excludes were those requiring human review due to ambiguity in the LLM suggestion or partial match with criteria. This was not a post-hoc invention but an observed pattern during the trial. We agree that the manuscript would benefit from greater clarity on this point. We will revise the relevant section (and abstract) to provide an operational description of how boundary cases were identified in practice (based on LLM output characteristics and human overrides in the tool), to state explicitly that the trial is illustrative rather than a formal evaluation, and to moderate claims about burden reduction to reflect the qualitative nature of the demonstration. We did not collect gold-standard labels or compute agreement metrics because the study focus was tool functionality, not performance benchmarking. revision: partial

Circularity Check

0 steps flagged

No significant circularity; tool report and qualitative trial are self-contained

full rationale

The paper presents the AISysRev tool implementation and reports observations from one qualitative trial on 137 papers, classifying results into four categories as an empirical finding. No equations, derivations, fitted parameters, predictions, or self-citation chains exist that reduce any claim to its inputs by construction. The four-category observation is stated directly from the trial without mathematical reduction or load-bearing reliance on prior self-authored results. This is a standard tool-development report with independent content and no circular reasoning.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

No free parameters, mathematical axioms, or invented entities are used. The work rests on the domain assumption that LLMs can be prompted to perform binary classification of scientific abstracts and that human reviewers can usefully interpret the model's outputs on boundary cases.

axioms (1)
  • domain assumption LLMs prompted with inclusion/exclusion criteria can produce useful screening decisions for title-abstract pairs.
    Stated implicitly in the tool design and the qualitative trial description.

pith-pipeline@v0.9.0 · 5838 in / 1455 out tokens · 37670 ms · 2026-05-18T09:36:32.036956+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. TiAb Review Plugin: A Browser-Based Tool for AI-Assisted Title and Abstract Screening

    cs.DL 2026-04 accept novelty 7.0

    A Chrome extension provides no-code, serverless AI-assisted title and abstract screening for systematic reviews by integrating LLMs and ML active learning with Google Sheets.

Reference graph

Works this paper leans on

22 extracted references · 22 canonical work pages · cited by 1 Pith paper

  1. [1]

    D., Carbone, A., Slade, S., Baik, C., Hughes- Warrington, M., and Neumann, D

    Bearman, M., Smith, C. D., Carbone, A., Slade, S., Baik, C., Hughes- Warrington, M., and Neumann, D. L.Systematic review methodology in higher education.Higher Education Research & Development 31, 5 (2012), 625–640

  2. [2]

    M.A full systematic review was completed in 2 weeks using automation tools: a case study.Journal of clinical epidemiology 121(2020), 81–90

    Clark, J., Glasziou, P., Del Mar, C., Bannach-Brown, A., Stehlik, P., and Scott, A. M.A full systematic review was completed in 2 weeks using automation tools: a case study.Journal of clinical epidemiology 121(2020), 81–90

  3. [3]

    Felizardo, K. R., Deizepe, A., Coutinho, D., Gomes, G., Meireles, M., Gerosa, M., and Steinmacher, I.On the difficulties of conducting and replicating systematic literature reviews studies using llms in software engineering. In2025 IEEE/ACM International Workshop on Methodological Issues with Empirical Studies in Software Engineering (WSESE)(2025), IEEE, ...

  4. [4]

    R., Lima, M

    Felizardo, K. R., Lima, M. S., Deizepe, A., Conte, T. U., and Steinmacher, I. ChatGPT application in Systematic Literature Reviews in Software Engineering: An evaluation of its accuracy to support the selection activity. InProceedings of the 18th ACM/IEEE International Symposium on Empirical Software Engineering and Measurement(New York, NY, USA, 2024), E...

  5. [5]

    Garritty, C., Gartlehner, G., Nussbaumer-Streit, B., King, V. J., Hamel, C., Kamel, C., Affengruber, L., and Stevens, A.Cochrane rapid reviews methods group offers evidence-informed guidance to conduct rapid reviews.Journal of clinical epidemiology 130(2021), 13–22

  6. [6]

    InProceedings of the 18th ACM/IEEE International Symposium on Empirical Software Engineering and Measurement(Oct

    Huotala, A., Kuutila, M., and Mäntylä, M.SESR-Eval: Dataset for Evaluating LLMs in the Title-Abstract Screening of Systematic Reviews. InProceedings of the 18th ACM/IEEE International Symposium on Empirical Software Engineering and Measurement(Oct. 2025), ESEM ’25, IEEE, pp. 1–12

  7. [7]

    Huotala, A., Kuutila, M., Ralph, P., and Mäntylä, M.The Promise and Challenges of Using LLMs to Accelerate the Screening Process of Systematic Reviews. InProceedings of the 28th International Conference on Evaluation and Assessment in Software Engineering(New York, NY, USA, July 18, 2024), EASE ’24, Association for Computing Machinery, pp. 262–271

  8. [8]

    Huotala, A., Kuutila, M., Turtio, O.-P., and Mäntylä, M.Dataset for: AISysRev - LLM-based tool for Title-abstract Screening, Oct. 2025. Available at https://doi.org/10.5281/zenodo.17208539

  9. [9]

    Kersten, R., Harms, J., Liket, K., and Maas, K.Small firms, large impact? a systematic review of the sme finance literature.World development 97(2017), 330–348

  10. [10]

    A., Dyba, T., and Jorgensen, M.Evidence-based software engineering

    Kitchenham, B. A., Dyba, T., and Jorgensen, M.Evidence-based software engineering. InProceedings. 26th International Conference on Software Engineering (2004), IEEE, pp. 273–281

  11. [11]

    Kuutila, M., Mäntylä, M., Farooq, U., and Claes, M.Time pressure in software engineering: A systematic review.Information and Software Technology 121(2020), 106257

  12. [12]

    J., and Wallace, B

    Marshall, I. J., and Wallace, B. C.Toward systematic review automation: a practical guide to using machine learning tools in research synthesis.Systematic reviews 8, 1 (2019), 163

  13. [13]

    O., and Richardson, W

    Meade, M. O., and Richardson, W. S.Selecting and appraising studies for a systematic review.Annals of internal medicine 127, 7 (1997), 531–537

  14. [14]

    H.Ai-powered standardised patients: evaluat- ing chatgpt-4o’s impact on clinical case management in intern physicians.BMC Medical Education 25, 1 (2025), 278

    Öncü, S., Torun, F., and Ülkü, H. H.Ai-powered standardised patients: evaluat- ing chatgpt-4o’s impact on clinical case management in intern physicians.BMC Medical Education 25, 1 (2025), 278

  15. [15]

    M.On the road to interactive llm-based systematic mapping studies.Information and Software Technology 178(2025), 107611

    Petersen, K., and Gerken, J. M.On the road to interactive llm-based systematic mapping studies.Information and Software Technology 178(2025), 107611

  16. [16]

    Pizard, S., Lezama, J., García, R., Vallespir, D., and Kitchenham, B.Using rapid reviews to support software engineering practice: a systematic review and a replication study.Empirical Software Engineering 30, 1 (2025), 10

  17. [17]

    M., Moses, K

    Rafi, D. M., Moses, K. R. K., Petersen, K., and Mäntylä, M. V.Benefits and limitations of automated software testing: Systematic literature review and prac- titioner survey. In2012 7th international workshop on automation of software test (AST)(2012), IEEE, pp. 36–42

  18. [18]

    In International Conference on Product-Focused Software Process Improvement(2024), Springer, pp

    Romano, S., Conforti, A., Guidetti, G., Viotti, S., Converso, D., and Scan- niello, G.On job demands and resources in the italian software industry. In International Conference on Product-Focused Software Process Improvement(2024), Springer, pp. 172–188

  19. [19]

    Shahzeidi, M., Mollahoseini Ardakani, M., Javdani Gandomani, T., and Mirzaie, K.A hybrid model of long short-term memory neural networks and quantum behavior pso for detecting self-admitted technical debt.Cluster Com- puting 28, 3 (2025), 152

  20. [20]

    M., and Garzon Ruiz, J

    Sotaqirá-Gutiérrez, R., Beltran, L. M., and Garzon Ruiz, J. P.Hackathons as experiential learning platforms for engineering design skills.Cogent Education 12, 1 (2025), 2442187

  21. [21]

    Thode, L., Iftikhar, U., and Mendez, D.Exploring the use of llms for the selec- tion phase in systematic literature studies.Information and Software Technology (2025), 107757

  22. [22]

    Watt, A., Cameron, A., Sturm, L., Lathlean, T., Babidge, W., Blamey, S., Facey, K., Hailey, D., Norderhaug, I., and Maddern, G.Rapid reviews versus full systematic reviews: an inventory of current methods and practice in health technology assessment.International journal of technology assessment in health care 24, 2 (2008), 133–139