Recognition: unknown
sciwrite-lint: Verification Infrastructure for the Age of Science Vibe-Writing
Pith reviewed 2026-05-10 17:14 UTC · model grok-4.3
The pith
A locally running open-source tool can verify that a scientific manuscript's citations exist, are unretracted, and support the claims made about them.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
sciwrite-lint provides a local verification pipeline that confirms references exist, checks retraction status, compares metadata against canonical records, downloads and parses cited papers to verify support for the manuscript's claims, extends the check one level deeper into the cited papers' bibliographies, and assigns each reference a per-reference reliability score that aggregates all signals.
What carries the argument
The sciwrite-lint verification pipeline, which uses open-weights models running locally to parse downloaded papers and test whether they support specific claims.
If this is right
- Authors can run automated integrity checks on drafts before submission without relying on external services or peer review.
- Fabricated citations that evade current gatekeeping can be detected at the author level through metadata, retraction, and claim-support checks.
- Each reference receives a composite reliability score that combines existence, retraction status, metadata match, and claim verification.
- Verification extends one bibliographic level deeper, revealing inconsistencies in the cited papers' own reference lists.
Where Pith is reading between the lines
- Embedding the pipeline into common writing environments could turn citation verification into a routine, real-time step during drafting.
- Increasing the recursion depth beyond one level might expose longer chains of weak or fabricated citations.
- The experimental contribution-scoring module, which draws on philosophy-of-science frameworks, could evolve into quantitative tests of argument structure if the integrity component proves reliable.
Load-bearing premise
Automated local parsing of cited papers using open-weights models can reliably determine whether those papers support the specific claims made in the manuscript under review.
What would settle it
A controlled test set of manuscripts containing fabricated or unsupported citations where the pipeline's claim-verification step fails to flag them at a high rate.
Figures
read the original abstract
Science currently offers two options for quality assurance, both inadequate. Journal gatekeeping claims to verify both integrity and contribution, but actually measures prestige: peer review is slow, biased, and misses fabricated citations even at top venues. Open science provides no quality assurance at all: the only filter between AI-generated text and the public record is the author's integrity. AI-assisted writing makes both worse by producing more papers faster than either system can absorb. We propose a third option: measure the paper itself. sciwrite-lint (pip install sciwrite-lint) is an open-source linter for scientific manuscripts that runs entirely on the researcher's machine (free public databases, a single consumer GPU, and open-weights models) with no manuscripts sent to external services. The pipeline verifies that references exist, checks retraction status, compares metadata against canonical records, downloads and parses cited papers, verifies that they support the claims made about them, and follows one level further to check cited papers' own bibliographies. Each reference receives a per-reference reliability score aggregating all verification signals. We evaluate the pipeline on 30 unseen papers from arXiv and bioRxiv with error injection and LLM-adjudicated false positive analysis. As an experimental extension, we propose SciLint Score, combining integrity verification with a contribution component that operationalizes five frameworks from philosophy of science (Popper, Lakatos, Kitcher, Laudan, Mayo) into computable structural properties of scientific arguments. The integrity component is the core of the tool and is evaluated in this paper; the contribution component is released as experimental code for community development.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces sciwrite-lint, an open-source local linter for scientific manuscripts that verifies reference existence, retraction status, metadata consistency against canonical records, downloads and parses cited papers to check claim support using open-weights models, and extends verification one level into the cited papers' bibliographies. Each reference receives an aggregated reliability score. An experimental SciLint Score is proposed that combines the integrity pipeline with a contribution metric derived from five philosophy-of-science frameworks (Popper, Lakatos, Kitcher, Laudan, Mayo). The evaluation consists of running the pipeline on 30 unseen arXiv and bioRxiv papers after error injection, followed by LLM-adjudicated false-positive analysis.
Significance. If the core verification steps can be shown to be reliable, the tool would provide a practical, privacy-preserving, and accessible method for authors to perform pre-submission integrity checks without transmitting manuscripts to external services. The fully local execution using free public databases, consumer hardware, and open-weights models is a clear strength that supports reproducibility and broad adoption. The open-source release and experimental contribution component also invite community extension. However, the absence of quantitative performance metrics and independent validation limits the immediate significance of the reported results.
major comments (2)
- [Evaluation (abstract and main text)] The evaluation on 30 papers (described in the abstract and evaluation section) reports no quantitative results, error rates, precision/recall figures, or detailed methodology for the error-injection protocol and LLM adjudication process. This is load-bearing for the central claim that the pipeline provides effective verification, as it leaves the actual performance of the claim-support step unquantified.
- [Pipeline description and evaluation] The claim-support verification component, which downloads cited papers and uses open-weights models to decide whether they support specific manuscript claims, relies on LLM adjudication for false-positive analysis without any human-labeled ground-truth subset. This setup can detect obvious mismatches but does not establish accuracy on nuanced cases of partial support, context omission, or interpretive disagreement, directly affecting the reliability of the per-reference scores.
minor comments (2)
- [Abstract] The abstract states that the contribution component is 'released as experimental code' but provides no link, repository details, or usage instructions in the provided text.
- [Pipeline description] The manuscript would benefit from a table summarizing the verification signals aggregated into the per-reference reliability score.
Simulated Author's Rebuttal
We thank the referee for their constructive feedback on the manuscript. We address each major comment below with clarifications on the evaluation design and planned revisions to improve transparency around methodology and limitations.
read point-by-point responses
-
Referee: [Evaluation (abstract and main text)] The evaluation on 30 papers (described in the abstract and evaluation section) reports no quantitative results, error rates, precision/recall figures, or detailed methodology for the error-injection protocol and LLM adjudication process. This is load-bearing for the central claim that the pipeline provides effective verification, as it leaves the actual performance of the claim-support step unquantified.
Authors: We agree that quantitative metrics such as precision, recall, or error rates would provide stronger evidence for the pipeline's effectiveness. The reported evaluation is a proof-of-concept demonstration using error injection on 30 unseen papers followed by LLM adjudication for false-positive analysis, rather than a full benchmark study. We will revise the evaluation section to include a more detailed description of the error-injection protocol, the specific types of errors introduced, the adjudication prompts used, and any observed patterns in detected issues. We will also add an explicit statement that this constitutes an initial functional validation rather than a comprehensive performance benchmark. revision: partial
-
Referee: [Pipeline description and evaluation] The claim-support verification component, which downloads cited papers and uses open-weights models to decide whether they support specific manuscript claims, relies on LLM adjudication for false-positive analysis without any human-labeled ground-truth subset. This setup can detect obvious mismatches but does not establish accuracy on nuanced cases of partial support, context omission, or interpretive disagreement, directly affecting the reliability of the per-reference scores.
Authors: We acknowledge this limitation in the current evaluation design. The LLM adjudication serves as a practical, scalable method to surface potential mismatches in the absence of existing human-annotated ground-truth datasets for scientific claim support. We recognize that this approach is better suited to identifying clear discrepancies than to resolving nuanced interpretive cases. In the revised manuscript, we will expand the limitations discussion to emphasize that the per-reference reliability scores function as aggregated heuristic signals to guide author attention, not as definitive accuracy measures. We will also outline directions for future work, including community-driven creation of human-labeled subsets for more rigorous validation. revision: partial
Circularity Check
No significant circularity in derivation chain
full rationale
The paper describes a software pipeline and its evaluation on 30 papers using error injection plus LLM adjudication, with no mathematical derivations, equations, fitted parameters, or predictions. The core integrity verification is presented as a direct implementation of public databases and open models; the SciLint Score contribution component is explicitly labeled experimental and not used to support integrity claims. No self-citations are load-bearing for any central result, and no step reduces by construction to its own inputs. This is a self-contained tool description rather than a derivation chain.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Public databases and open-weights models suffice to verify reference existence, retraction status, metadata, and claim support without external services or human intervention.
invented entities (1)
-
SciLint Score
no independent evidence
Reference graph
Works this paper leans on
-
[1]
San Francisco declaration on research assessment ( DORA )
American Society for Cell Biology . San Francisco declaration on research assessment ( DORA ). https://sfdora.org, 2012
2012
-
[2]
Jawad Ansari. Compound deception in elite peer review. arXiv preprint arXiv:2602.05930, 2026
-
[3]
Evaluating sakana's AI scientist: Bold claims, mixed results
Joeran Beel, Tobias Vente, and Christin Mahlich. Evaluating sakana's AI scientist: Bold claims, mixed results. arXiv preprint arXiv:2502.14297, 2025
-
[4]
Timothy Bienz, Arianna Pearson, and Sylvie Garcia de Gonzalo. The case of the mysterious citations. arXiv preprint arXiv:2602.05867, 2026
-
[5]
Samsung bans staff's AI use after spotting ChatGPT data leak, 2023
Bloomberg News . Samsung bans staff's AI use after spotting ChatGPT data leak, 2023. URL https://www.bloomberg.com/news/articles/2023-05-02/samsung-bans-chatgpt-and-other-generative-ai-use-by-staff-after-leak
2023
-
[6]
Deep impact: Unintended consequences of journal rank
Bj \"o rn Brembs, Katherine Button, and Marcus Munaf \`o . Deep impact: Unintended consequences of journal rank. Frontiers in Human Neuroscience, 7: 0 291, 2013. doi:10.3389/fnhum.2013.00291
-
[7]
Tortured phrases: A dubious writing technique emerging in science
Guillaume Cabanac, Cyril Labb \'e , and Alexander Magazinov. Tortured phrases: A dubious writing technique emerging in science. evidence of critical these in engineered papers. arXiv preprint arXiv:2107.06751, 2021
-
[8]
The increasing citation references: A study across disciplines and document types
Wanyou Dai, Sta s a Milojevi \'c , and Vincent Larivi \`e re. The increasing citation references: A study across disciplines and document types. PLOS ONE, 16 0 (4): 0 e0249878, 2021. doi:10.1371/journal.pone.0249878
-
[9]
PaperRank : A ranking model for scientific publications
Mingcui Du, Fengshan Bai, and Yushen Liu. PaperRank : A ranking model for scientific publications. In Proceedings of the 2009 International Conference on Semantics, Knowledge and Grid, 2009. doi:10.1109/csie.2009.479
-
[10]
Marc A. Edwards and Siddhartha Roy. Academic research in the 21st century: Maintaining scientific integrity in a climate of perverse incentives and hypercompetition. Environmental Engineering Science, 34 0 (1): 0 51--61, 2017. doi:10.1089/ees.2016.0223
-
[11]
G. Nigel Gilbert. Referencing as persuasion. Social Studies of Science, 7 0 (1): 0 113--122, 1977. doi:10.1177/030631277700700112
-
[12]
100 hallucinations in NeurIPS 2025
GPTZero . 100 hallucinations in NeurIPS 2025. https://gptzero.me/news/neurips/, 2026
2025
-
[13]
Sebastian Haan. SemanticCite : Citation verification with AI -powered full-text analysis and evidence-based reasoning. arXiv preprint arXiv:2511.16198, 2025
-
[14]
Bibliometrics: The Leiden manifesto for research metrics
Diana Hicks, Paul Wouters, Ludo Waltman, Sarah de Rijcke, and Ismael Rafols. Bibliometrics: The Leiden manifesto for research metrics. Nature, 520 0 (7548): 0 429--431, 2015. doi:10.1038/520429a
-
[15]
Scientific Reasoning: The Bayesian Approach
Colin Howson and Peter Urbach. Scientific Reasoning: The Bayesian Approach . Open Court, 2nd edition, 1993. ISBN 978-0-8126-9235-8
1993
-
[16]
Large language models cannot self-correct reasoning yet
Jie Huang et al. Large language models cannot self-correct reasoning yet. In Proceedings of the Twelfth International Conference on Learning Representations (ICLR), 2024
2024
-
[17]
POPPER: Agentic fal- sification of free-form hypotheses.arXiv preprint arXiv:2502.09858, 2025
Yuxuan Huang et al. POPPER : An agentic framework for automated hypothesis validation via Karl Popper 's falsification. arXiv preprint arXiv:2502.09858, 2025. ICML 2025
-
[18]
Stephen C. Johnson. Lint, a C program checker. Technical Report 65, Bell Laboratories, 1978. URL https://wolfram.schneider.org/bsd/7thEdManVol2/lint/lint.pdf
1978
-
[19]
When can LLMs actually correct their own mistakes? A survey of self-correction
Ryo Kamoi et al. When can LLMs actually correct their own mistakes? A critical survey of self-correction of LLMs . Transactions of the Association for Computational Linguistics, 12: 0 1417--1440, 2024. doi:10.1162/tacl_a_00713. arXiv:2406.01297
-
[20]
Automated measurement of research practices and reproducibility
David Kernohan, Matthew Hattle, et al. Automated measurement of research practices and reproducibility. Frontiers in Research Metrics and Analytics, 6: 0 751734, 2021. doi:10.3389/frma.2021.751734
-
[21]
Scientific claim verification with fine-tuned NLI models
Milo s Kosprdic et al. Scientific claim verification with fine-tuned NLI models. In Proceedings of the 16th International Conference on Knowledge Discovery and Information Retrieval (KDIR), pages 129--136, 2024. doi:10.5220/0012900000003838
-
[22]
More open abstracts? https://bmkramer.github.io/SesameOpenScience_site/thought/202411_open_abstracts/, 2024
Bianca Kramer. More open abstracts? https://bmkramer.github.io/SesameOpenScience_site/thought/202411_open_abstracts/, 2024. Sesame Open Science, November 24, 2024
2024
-
[23]
Science in Action: How to Follow Scientists and Engineers Through Society
Bruno Latour. Science in Action: How to Follow Scientists and Engineers Through Society. Harvard University Press, 1987. ISBN 978-0-674-79290-6
1987
-
[24]
McKenzie et al
Ian R. McKenzie et al. Inverse scaling: When bigger isn't better. Transactions on Machine Learning Research, 2023
2023
-
[25]
Robert K. Merton. The Matthew effect in science. Science, 159 0 (3810): 0 56--63, 1968. doi:10.1126/science.159.3810.56
-
[26]
Robert K. Merton. The normative structure of science. In The Sociology of Science, pages 267--278. University of Chicago Press, 1973. ISBN 978-0-226-52091-9. Originally published 1942
1973
-
[27]
Moravcsik and Poovanalingam Murugesan
Michael J. Moravcsik and Poovanalingam Murugesan. Some results on the function and quality of citations. Social Studies of Science, 5 0 (1): 0 86--92, 1975. doi:10.1177/030631277500500106
-
[28]
Feder Cooper, Daphne Ippolito, Christopher A
Milad Nasr, Nicholas Carlini, Jonathan Hayase, Matthew Jagielski, A. Feder Cooper, Daphne Ippolito, Christopher A. Choquette-Choo, Eric Wallace, Florian Tram\` e r, and Katherine Lee. Scalable extraction of training data from (production) language models. arXiv preprint arXiv:2311.17035, 2023
-
[29]
Joshua M. Nicholson et al. scite: A smart citation index. Quantitative Science Studies, 2 0 (3): 0 882--898, 2021. doi:10.1162/qss_a_00146
-
[30]
K. S. Novoselov, A. K. Geim, S. V. Morozov, D. Jiang, Y. Zhang, S. V. Dubonos, I. V. Grigorieva, and A. A. Firsov. Electric field effect in atomically thin carbon films. Science, 306 0 (5696): 0 666--669, 2004. doi:10.1126/science.1102896
work page internal anchor Pith review doi:10.1126/science.1102896 2004
-
[31]
Mich \`e le B. Nuijten et al. Statistical reporting errors in psychology. Behavior Research Methods, 48: 0 1205--1226, 2016. doi:10.3758/s13428-015-0664-2
-
[32]
The PageRank citation ranking: Bringing order to the web
Lawrence Page, Sergey Brin, Rajeev Motwani, and Terry Winograd. The PageRank citation ranking: Bringing order to the web. In Proceedings of the 7th International World Wide Web Conference, 1999. URL http://ilpubs.stanford.edu:8090/422/1/1999-66.pdf
1999
-
[33]
The Logic of Scientific Discovery
Karl Popper. The Logic of Scientific Discovery. Routledge, 1959. ISBN 978-0-415-27844-7. doi:10.4324/9780203994627
-
[34]
RefChecker
Mark Russinovich. RefChecker . https://github.com/markrussinovich/refchecker, 2025
2025
-
[35]
scicode-lint: A linter for machine learning research code
Sergey Samsonau. scicode-lint: A linter for machine learning research code. arXiv preprint arXiv:2603.17893, 2025
-
[36]
SciScore : Automated rigor and transparency scoring
SciScore . SciScore : Automated rigor and transparency scoring. https://sciscore.com, 2018. Deployed in Editorial Manager
2018
-
[37]
Is a qualitative metric of falsifiability possible? The F-index
Seeds of Science . Is a qualitative metric of falsifiability possible? The F-index . Seeds of Science, 2024. doi:10.53975/1y7h-g9wd
-
[38]
Towards understanding sycophancy in language models
Mrinank Sharma, Meg Tong, et al. Towards understanding sycophancy in language models. In Proceedings of the Twelfth International Conference on Learning Representations (ICLR), 2024
2024
-
[39]
Mikhail V. Simkin and Vwani P. Roychowdhury. Read before you cite! Complex Systems, 14: 0 269--274, 2003. doi:10.25088/complexsystems.14.3.269
-
[40]
Cited documents as concept symbols
Henry Small. Cited documents as concept symbols. Social Studies of Science, 8 0 (3): 0 327--340, 1978. doi:10.1177/030631277800800305
-
[41]
MiniCheck : Efficient fact-checking of LLMs
Liyan Tang et al. MiniCheck : Efficient fact-checking of LLMs . In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 8818--8847, 2024. doi:10.18653/v1/2024.emnlp-main.499
-
[42]
Paul Thagard. Explanatory coherence. Behavioral and Brain Sciences, 12 0 (3): 0 435--467, 1989. doi:10.1017/S0140525X00057046
-
[43]
Fact or fiction: Verifying scientific claims
David Wadden et al. Fact or fiction: Verifying scientific claims. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 7534--7550, 2020. doi:10.18653/v1/2020.emnlp-main.609
-
[44]
The inconsistency of the h-index
Ludo Waltman and Nees Jan van Eck. The inconsistency of the h-index. Journal of the American Society for Information Science and Technology, 63 0 (2): 0 406--415, 2012. doi:10.1002/asi.21678
-
[45]
A review on the novelty measurements of academic papers
Yutao Wang et al. A review on the novelty measurements of academic papers. arXiv preprint arXiv:2501.17456, 2025
-
[46]
Jason Wei, Najoung Kim, Yi Tay, and Quoc V. Le. Inverse scaling can become U -shaped. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 15580--15591, 2023. doi:10.18653/v1/2023.emnlp-main.963. arXiv:2211.02011
-
[47]
Zhe Xu et al. GhostCite : Citation validity in the age of LLMs . arXiv preprint arXiv:2602.06718, 2026
work page internal anchor Pith review arXiv 2026
-
[48]
Zhengqing Yuan, Kexin Shi, Zhaonan Zhang, Lichao Sun, Nitesh V. Chawla, and Yanfang Ye. CiteAudit : You cited it, but did you read it? A benchmark for verifying scientific references in the LLM era. arXiv preprint arXiv:2602.23452, 2026
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[49]
AlignScore : Evaluating factual consistency with a unified alignment function
Yuheng Zha et al. AlignScore : Evaluating factual consistency with a unified alignment function. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (ACL), pages 11328--11348, 2023. doi:10.18653/v1/2023.acl-long.634
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.