CheckSupport: A Local LLM-Powered Tool for Automated Manuscript Submission Checklist Selection and Completion

Don Enwerem; Jacinta Arnold; Kevin Song; Kristian Quevada; Satvik Tripathi; Tessa S. Cook

arxiv: 2605.16377 · v1 · pith:4M2UWHRLnew · submitted 2026-05-10 · 💻 cs.DL · cs.AI· cs.LG

CheckSupport: A Local LLM-Powered Tool for Automated Manuscript Submission Checklist Selection and Completion

Satvik Tripathi , Don Enwerem , Kevin Song , Kristian Quevada , Jacinta Arnold , Tessa S. Cook This is my paper

Pith reviewed 2026-05-20 22:31 UTC · model grok-4.3

classification 💻 cs.DL cs.AIcs.LG

keywords reporting checklistsmanuscript automationlocal LLMscientific reproducibilityevidence extractionCPU inferenceopen source tool

0 comments

The pith

A locally deployed LLM system recommends and completes scientific reporting checklists at 88 to 90 percent accuracy.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces CheckSupport as an open-source tool that uses instruction-tuned large language models running entirely on local hardware to first recommend the right reporting checklist for a manuscript and then fill in its items by pulling evidence directly from the text. It breaks the process into staged, constrained prompts that favor accurate extraction instead of free-form generation. A sympathetic reader would care because manual checklist work is a known barrier to consistent, reproducible reporting; automating it locally could lower that barrier without sending manuscripts to external servers. The reported results show 90 percent accuracy on checklist choice and 88 percent on individual item completion across peer-reviewed papers, with an average of 12.5 seconds of CPU time per manuscript.

Core claim

CheckSupport is an open-source, locally deployable system that uses large language models to automate the recommendation of reporting checklists and the evidence-grounded completion of checklists for scientific manuscripts. It employs a staged prompting strategy that decomposes reporting workflows into constrained inference tasks, prioritizing faithful extraction over generative text synthesis. All inference is performed locally using instruction-tuned models. Evaluated on a corpus of peer-reviewed manuscripts, CheckSupport achieved 90 percent overall accuracy for checklist recommendations and 88 percent overall accuracy for item-level completion while operating on CPU-only hardware, with an

What carries the argument

staged prompting strategy that decomposes reporting workflows into constrained inference tasks prioritizing faithful extraction over generative text synthesis

If this is right

Reduces the manual effort authors spend selecting and completing reporting checklists
Enables reproducible and auditable checklist workflows without sharing manuscript text externally
Supports more transparent scientific reporting across multiple disciplines
Runs on ordinary CPU hardware with an average of 12.5 seconds per manuscript

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Journal submission platforms could embed the tool to surface the correct checklist automatically at upload time.
The extraction approach might transfer to other document-heavy tasks such as grant compliance checks or regulatory filings.
Authors could run the system as a pre-submission self-audit to catch missing items before peer review begins.

Load-bearing premise

The staged prompting strategy produces faithful, evidence-grounded extractions from arbitrary manuscript text without systematic omissions or fabrications that would affect checklist accuracy.

What would settle it

Apply the same CheckSupport pipeline to a fresh set of manuscripts drawn from a discipline or journal set absent from the original corpus and measure whether checklist recommendation accuracy drops below 80 percent.

read the original abstract

Transparent and standardized reporting is essential for reproducible scientific research, yet adherence to reporting guidelines remains inconsistent because of the manual effort required to select and complete checklists. We present CheckSupport, an open-source, locally deployable system that uses large language models to automate the recommendation of reporting checklists and the evidence-grounded completion of checklists for scientific manuscripts. CheckSupport employs a staged prompting strategy that decomposes reporting workflows into constrained inference tasks, prioritizing faithful extraction over generative text synthesis. All inference is performed locally using instruction-tuned models, preserving data privacy and enabling reproducible, auditable workflows. Evaluated on a corpus of peer-reviewed manuscripts, CheckSupport achieved 90% overall accuracy for checklist recommendations and 88% overall accuracy for item-level completion while operating on CPU-only hardware. On average, the wall-clock time per manuscript was 12.5 seconds, including the checklist recommendation and full checklist completion. These results demonstrate that large language models, when applied as structured inference components, can reduce reporting burden and support more transparent and reproducible scientific reporting across disciplines.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

CheckSupport gives a working local LLM tool for checklist selection and completion via staged prompting, but the accuracy numbers lack enough evaluation detail to fully trust them yet.

read the letter

The main point is that this paper describes an open-source, CPU-only system called CheckSupport that uses local instruction-tuned models and a staged prompting workflow to recommend reporting checklists and then extract evidence to fill them out. It reports 90% accuracy on recommendations and 88% on item-level completion, with average runtimes around 12 seconds per manuscript on peer-reviewed papers. That combination of local deployment and constrained extraction steps is the concrete new piece here, aimed at cutting down manual work while keeping data private and workflows auditable. The paper does a reasonable job showing the architecture is feasible and that the approach runs without cloud services, which matters for labs that care about reproducibility and compliance. It also gives timing numbers that suggest it could be practical in real submissions. The soft spot is the evaluation. The headline accuracies come from a corpus of manuscripts, but the write-up does not spell out corpus size, selection criteria, or how accuracy was scored—whether exact spans, partial credit, or post-checks for omissions and fabrications. The staged prompting is meant to reduce hallucinations, yet without targeted verification against the source text, it is hard to know how often the system might skip key passages or add ungrounded support. That is a moderate rather than fatal gap for a tool paper, but it needs tightening. This work is for people building or using tools to improve reporting standards in publishing, or for groups that want a starting point for local LLM automation in this domain. A reader who needs a ready implementation to test or extend would get value from it. It deserves a serious referee because it ships a functional system with measurable results on a real problem, even if the validation section needs more rigor. I would recommend sending it for review and asking specifically for expanded details on the test corpus and error analysis.

Referee Report

2 major / 1 minor

Summary. The paper presents CheckSupport, an open-source, locally deployable system that uses instruction-tuned LLMs with a staged prompting strategy to automate reporting checklist recommendation and evidence-grounded item completion for scientific manuscripts. All inference runs on CPU-only hardware. On a corpus of peer-reviewed manuscripts the system reports 90% overall accuracy for checklist recommendations and 88% overall accuracy for item-level completion, with an average wall-clock time of 12.5 seconds per manuscript.

Significance. If the reported accuracies are robust, the work offers a practical, privacy-preserving tool that could reduce manual reporting burden and improve adherence to standardized guidelines across disciplines. The emphasis on constrained, staged inference rather than open-ended generation is a sound design choice for extraction tasks, and the open-source release plus CPU-only operation lowers barriers to adoption.

major comments (2)

[Abstract] Abstract: The headline performance claims (90% recommendation accuracy, 88% item-level completion accuracy) are presented without any information on corpus size, manuscript selection criteria, disciplinary coverage, ground-truth annotation protocol, or inter-annotator agreement. It is also unclear whether accuracy is measured by exact span match, allows partial credit, or incorporates post-hoc human verification of every LLM extraction for hallucination or omission. These details are load-bearing for interpreting the central empirical results.
[Methods] Staged prompting strategy (described in the methods): The claim that the decomposition into constrained tasks produces faithful, evidence-grounded extractions is not accompanied by targeted validation. No error analysis, hallucination audit, or comparison against human-extracted spans is reported to confirm that systematic omissions or fabrications do not inflate the headline accuracies, especially on complex or interdisciplinary papers.

minor comments (1)

The average wall-clock time of 12.5 seconds is useful, but reporting variance or a breakdown by manuscript length or checklist complexity would help readers assess practical deployment.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address each major comment below and indicate planned revisions to strengthen the presentation of our empirical results and validation of the approach.

read point-by-point responses

Referee: [Abstract] Abstract: The headline performance claims (90% recommendation accuracy, 88% item-level completion accuracy) are presented without any information on corpus size, manuscript selection criteria, disciplinary coverage, ground-truth annotation protocol, or inter-annotator agreement. It is also unclear whether accuracy is measured by exact span match, allows partial credit, or incorporates post-hoc human verification of every LLM extraction for hallucination or omission. These details are load-bearing for interpreting the central empirical results.

Authors: We agree that the abstract would benefit from additional context to support interpretation of the headline accuracies. In the revised manuscript we will expand the abstract to summarize the evaluation corpus size, manuscript selection criteria, disciplinary coverage, ground-truth annotation protocol, and inter-annotator agreement. We will also clarify that accuracy is assessed via exact span match against human annotations with post-hoc verification for hallucinations and omissions. These details are already present in the Methods and Results sections; the revision will bring a concise version into the abstract. revision: yes
Referee: [Methods] Staged prompting strategy (described in the methods): The claim that the decomposition into constrained tasks produces faithful, evidence-grounded extractions is not accompanied by targeted validation. No error analysis, hallucination audit, or comparison against human-extracted spans is reported to confirm that systematic omissions or fabrications do not inflate the headline accuracies, especially on complex or interdisciplinary papers.

Authors: We acknowledge that a dedicated error analysis focused on the staged prompting would provide additional reassurance. The reported accuracies are computed against human-annotated ground truth, which directly measures omissions and fabrications at the item level. Nevertheless, we will add a targeted error analysis subsection that categorizes failure modes, performs a hallucination audit on a sample of complex and interdisciplinary papers, and reports comparisons against human-extracted spans. This will be included in the revised manuscript. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical tool evaluation with external benchmarks

full rationale

The paper describes an applied system (CheckSupport) that uses staged prompting with local LLMs to recommend and complete reporting checklists, then reports empirical accuracies (90% overall for recommendations, 88% for item-level completion) measured on a held-out corpus of peer-reviewed manuscripts. No equations, derivations, or parameter-fitting steps are present that could reduce any claimed result to its own inputs by construction. The evaluation relies on external manuscript text as ground truth rather than self-referential definitions or self-citation chains, satisfying the criteria for a self-contained empirical assessment against independent benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central performance claims rest on the assumption that LLMs can perform reliable constrained extraction on scientific text and that the chosen evaluation corpus is representative of real manuscripts.

axioms (1)

domain assumption Large language models can be prompted to perform faithful evidence extraction from manuscript text without introducing unsupported content.
Invoked in the description of the staged prompting strategy and the accuracy claims.

pith-pipeline@v0.9.0 · 5734 in / 1071 out tokens · 79563 ms · 2026-05-20T22:31:41.460435+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

CheckSupport employs a staged prompting strategy that decomposes reporting workflows into constrained inference tasks, prioritizing faithful extraction over generative text synthesis.

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

13 extracted references · 13 canonical work pages

[1]

L., & Douglas, J

Tate, R. L., & Douglas, J. (2011). Use of reporting guidelines in scientific writing: PRISMA, CONSORT, STROBE, STARD and other resources. Brain Impairment, 12 (1), 1-21

work page 2011
[2]

A., & McInnes, M

Prager, R., Bowdridge, J., Kareemi, H., Wright, C., McGrath, T. A., & McInnes, M. D. (2020). Adherence to the standards for reporting of diagnostic accuracy (STARD) 2015 guidelines in acute point-of-care ultrasound research. JAMA network open, 3 (5), e203871-e203871

work page 2020
[3]

Tripathi, S., Alkhulaifat, D., Doo, F

& 13. Tripathi, S., Alkhulaifat, D., Doo, F. X., Rajpurkar, P., McBeth, R., Daye, D., & Cook, T. S. (2025). Development, Evaluation, and Assessment of Large Language Models (DEAL) Checklist: A Technical Report . NEJM AI, 2 (6), AIp2401106

work page 2025
[4]

D., Lim, C

McInnes, M. D., Lim, C. S., van der Pol, C. B., Salameh, J. P., McGrath, T. A., & Frank, R. A. (2019, March). Reporting guidelines for imaging research. In Seminars in Nuclear Medicine (Vol. 49, No. 2, pp. 121-135). WB Saunders

work page 2019
[5]

H., Houwert, R

Nawijn, F., Ham, W. H., Houwert, R. M., Groenwold, R. H., Hietbrink, F., & Smeeing, D. P. (2019). Quality of reporting of systematic reviews and meta-analyses in emergency medicine based on the PRISMA statement. BMC emergency medicine, 19 (1), 19

work page 2019
[6]

& Stroup, D

Begg, C., Cho, M., Eastwood, S., Horton, R., Moher, D., Olkin, I., ... & Stroup, D. F. (1996). Improving the quality of reporting of randomized controlled trials: the CONSORT statement. Jama, 276 (8), 637-639

work page 1996
[7]

Scherbakov, D., Hubig, N., Jansari, V., Bakumenko, A., & Lenert, L. A. (2025). The emergence of large language models as tools in literature reviews: a large language model-assisted systematic review. Journal of the American Medical Informatics Association, 32 (6), 1071-1086

work page 2025
[8]

J., & Wallace, B

Marshall, I. J., & Wallace, B. C. (2019). Toward systematic review automation: a practical guide to using machine learning tools in research synthesis. Systematic reviews, 8 (1), 163

work page 2019
[9]

Chang, Y., Lo, K., Goyal, T., & Iyyer, M. (2023). Booookscore: A systematic exploration of book-length summarization in the era of llms. arXiv preprint arXiv:2310.00785

work page arXiv 2023
[10]

A., Seethi, V

Azher, I. A., Seethi, V. D. R., Akella, A. P., & Alhoori, H. (2024, December). Limtopic: Llm-based topic modeling and text summarization for analyzing scientific articles limitations. In Proceedings of the 24th ACM/IEEE Joint Conference on Digital Libraries (pp. 1-12)

work page 2024
[11]

E., Ontiveros-Palacios, N., Griffiths-Jones, S., Petrov, A

Green, A., Ribas, C. E., Ontiveros-Palacios, N., Griffiths-Jones, S., Petrov, A. I., Bateman, A., & Sweeney, B. (2025). LitSumm: large language models for literature summarization of noncoding RNAs. Database, 2025, baaf006

work page 2025
[12]

E., & Zayed, T

Alshami, A., Elsayed, M., Ali, E., Eltoukhy, A. E., & Zayed, T. (2023). Harnessing the power of ChatGPT for automating systematic review process: methodology, case study, limitations, and future directions. Systems, 11 (7), 351

work page 2023
[13]

I., Elahi, A.,

Tripathi, S., Gabriel, K., Dheer, S., Parajuli, A., Augustin, A. I., Elahi, A., ... & Dako, F. (2023). Understanding biases and disparities in radiology AI datasets: a review. Journal of the American College of Radiology, 20 (9), 836-841

work page 2023

[1] [1]

L., & Douglas, J

Tate, R. L., & Douglas, J. (2011). Use of reporting guidelines in scientific writing: PRISMA, CONSORT, STROBE, STARD and other resources. Brain Impairment, 12 (1), 1-21

work page 2011

[2] [2]

A., & McInnes, M

Prager, R., Bowdridge, J., Kareemi, H., Wright, C., McGrath, T. A., & McInnes, M. D. (2020). Adherence to the standards for reporting of diagnostic accuracy (STARD) 2015 guidelines in acute point-of-care ultrasound research. JAMA network open, 3 (5), e203871-e203871

work page 2020

[3] [3]

Tripathi, S., Alkhulaifat, D., Doo, F

& 13. Tripathi, S., Alkhulaifat, D., Doo, F. X., Rajpurkar, P., McBeth, R., Daye, D., & Cook, T. S. (2025). Development, Evaluation, and Assessment of Large Language Models (DEAL) Checklist: A Technical Report . NEJM AI, 2 (6), AIp2401106

work page 2025

[4] [4]

D., Lim, C

McInnes, M. D., Lim, C. S., van der Pol, C. B., Salameh, J. P., McGrath, T. A., & Frank, R. A. (2019, March). Reporting guidelines for imaging research. In Seminars in Nuclear Medicine (Vol. 49, No. 2, pp. 121-135). WB Saunders

work page 2019

[5] [5]

H., Houwert, R

Nawijn, F., Ham, W. H., Houwert, R. M., Groenwold, R. H., Hietbrink, F., & Smeeing, D. P. (2019). Quality of reporting of systematic reviews and meta-analyses in emergency medicine based on the PRISMA statement. BMC emergency medicine, 19 (1), 19

work page 2019

[6] [6]

& Stroup, D

Begg, C., Cho, M., Eastwood, S., Horton, R., Moher, D., Olkin, I., ... & Stroup, D. F. (1996). Improving the quality of reporting of randomized controlled trials: the CONSORT statement. Jama, 276 (8), 637-639

work page 1996

[7] [7]

Scherbakov, D., Hubig, N., Jansari, V., Bakumenko, A., & Lenert, L. A. (2025). The emergence of large language models as tools in literature reviews: a large language model-assisted systematic review. Journal of the American Medical Informatics Association, 32 (6), 1071-1086

work page 2025

[8] [8]

J., & Wallace, B

Marshall, I. J., & Wallace, B. C. (2019). Toward systematic review automation: a practical guide to using machine learning tools in research synthesis. Systematic reviews, 8 (1), 163

work page 2019

[9] [9]

Chang, Y., Lo, K., Goyal, T., & Iyyer, M. (2023). Booookscore: A systematic exploration of book-length summarization in the era of llms. arXiv preprint arXiv:2310.00785

work page arXiv 2023

[10] [10]

A., Seethi, V

Azher, I. A., Seethi, V. D. R., Akella, A. P., & Alhoori, H. (2024, December). Limtopic: Llm-based topic modeling and text summarization for analyzing scientific articles limitations. In Proceedings of the 24th ACM/IEEE Joint Conference on Digital Libraries (pp. 1-12)

work page 2024

[11] [11]

E., Ontiveros-Palacios, N., Griffiths-Jones, S., Petrov, A

Green, A., Ribas, C. E., Ontiveros-Palacios, N., Griffiths-Jones, S., Petrov, A. I., Bateman, A., & Sweeney, B. (2025). LitSumm: large language models for literature summarization of noncoding RNAs. Database, 2025, baaf006

work page 2025

[12] [12]

E., & Zayed, T

Alshami, A., Elsayed, M., Ali, E., Eltoukhy, A. E., & Zayed, T. (2023). Harnessing the power of ChatGPT for automating systematic review process: methodology, case study, limitations, and future directions. Systems, 11 (7), 351

work page 2023

[13] [13]

I., Elahi, A.,

Tripathi, S., Gabriel, K., Dheer, S., Parajuli, A., Augustin, A. I., Elahi, A., ... & Dako, F. (2023). Understanding biases and disparities in radiology AI datasets: a review. Journal of the American College of Radiology, 20 (9), 836-841

work page 2023