pith. machine review for the scientific record. sign in

arxiv: 2604.08501 · v1 · submitted 2026-04-09 · 💻 cs.DL · cs.CL· cs.SE

Recognition: unknown

sciwrite-lint: Verification Infrastructure for the Age of Science Vibe-Writing

Sergey V Samsonau

Authors on Pith no claims yet

Pith reviewed 2026-05-10 17:14 UTC · model grok-4.3

classification 💻 cs.DL cs.CLcs.SE
keywords scientific manuscript verificationcitation integrity checkinglocal AI linterreference support verificationretraction detectionopen-weights modelsAI-assisted scientific writingbibliographic analysis
0
0 comments X

The pith

A locally running open-source tool can verify that a scientific manuscript's citations exist, are unretracted, and support the claims made about them.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes sciwrite-lint as a third option for scientific quality assurance beyond journal gatekeeping or pure open science. It is an open-source pipeline that runs entirely on the researcher's machine using public databases, a consumer GPU, and open-weights models, without sending any manuscripts externally. The system checks reference existence and retraction status, aligns metadata with canonical records, downloads and parses cited papers to confirm they back the stated claims, and recurses one level into those papers' own bibliographies. Each reference receives an aggregated reliability score. This approach is evaluated on thirty unseen arXiv and bioRxiv papers that include injected errors, with false-positive analysis performed by an LLM adjudicator.

Core claim

sciwrite-lint provides a local verification pipeline that confirms references exist, checks retraction status, compares metadata against canonical records, downloads and parses cited papers to verify support for the manuscript's claims, extends the check one level deeper into the cited papers' bibliographies, and assigns each reference a per-reference reliability score that aggregates all signals.

What carries the argument

The sciwrite-lint verification pipeline, which uses open-weights models running locally to parse downloaded papers and test whether they support specific claims.

If this is right

  • Authors can run automated integrity checks on drafts before submission without relying on external services or peer review.
  • Fabricated citations that evade current gatekeeping can be detected at the author level through metadata, retraction, and claim-support checks.
  • Each reference receives a composite reliability score that combines existence, retraction status, metadata match, and claim verification.
  • Verification extends one bibliographic level deeper, revealing inconsistencies in the cited papers' own reference lists.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Embedding the pipeline into common writing environments could turn citation verification into a routine, real-time step during drafting.
  • Increasing the recursion depth beyond one level might expose longer chains of weak or fabricated citations.
  • The experimental contribution-scoring module, which draws on philosophy-of-science frameworks, could evolve into quantitative tests of argument structure if the integrity component proves reliable.

Load-bearing premise

Automated local parsing of cited papers using open-weights models can reliably determine whether those papers support the specific claims made in the manuscript under review.

What would settle it

A controlled test set of manuscripts containing fabricated or unsupported citations where the pipeline's claim-verification step fails to flag them at a high rate.

Figures

Figures reproduced from arXiv: 2604.08501 by Sergey V Samsonau.

Figure 1
Figure 1. Figure 1: Verification architecture. Top: per-paper operations (steps 1–6 above). Bottom: the pipeline fans out through the citation graph at three levels: the manuscript (all operations), cited papers (downloaded, parsed, consistency-checked), and their references (existence + metadata checked via API). Solid borders: full text available; dashed: API-verified only [PITH_FULL_IMAGE:figures/full_fig_p005_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Contribution profiles for 3 of the 20 calibration papers. Left: Nobel Prize discovery (balanced). [PITH_FULL_IMAGE:figures/full_fig_p012_2.png] view at source ↗
read the original abstract

Science currently offers two options for quality assurance, both inadequate. Journal gatekeeping claims to verify both integrity and contribution, but actually measures prestige: peer review is slow, biased, and misses fabricated citations even at top venues. Open science provides no quality assurance at all: the only filter between AI-generated text and the public record is the author's integrity. AI-assisted writing makes both worse by producing more papers faster than either system can absorb. We propose a third option: measure the paper itself. sciwrite-lint (pip install sciwrite-lint) is an open-source linter for scientific manuscripts that runs entirely on the researcher's machine (free public databases, a single consumer GPU, and open-weights models) with no manuscripts sent to external services. The pipeline verifies that references exist, checks retraction status, compares metadata against canonical records, downloads and parses cited papers, verifies that they support the claims made about them, and follows one level further to check cited papers' own bibliographies. Each reference receives a per-reference reliability score aggregating all verification signals. We evaluate the pipeline on 30 unseen papers from arXiv and bioRxiv with error injection and LLM-adjudicated false positive analysis. As an experimental extension, we propose SciLint Score, combining integrity verification with a contribution component that operationalizes five frameworks from philosophy of science (Popper, Lakatos, Kitcher, Laudan, Mayo) into computable structural properties of scientific arguments. The integrity component is the core of the tool and is evaluated in this paper; the contribution component is released as experimental code for community development.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces sciwrite-lint, an open-source local linter for scientific manuscripts that verifies reference existence, retraction status, metadata consistency against canonical records, downloads and parses cited papers to check claim support using open-weights models, and extends verification one level into the cited papers' bibliographies. Each reference receives an aggregated reliability score. An experimental SciLint Score is proposed that combines the integrity pipeline with a contribution metric derived from five philosophy-of-science frameworks (Popper, Lakatos, Kitcher, Laudan, Mayo). The evaluation consists of running the pipeline on 30 unseen arXiv and bioRxiv papers after error injection, followed by LLM-adjudicated false-positive analysis.

Significance. If the core verification steps can be shown to be reliable, the tool would provide a practical, privacy-preserving, and accessible method for authors to perform pre-submission integrity checks without transmitting manuscripts to external services. The fully local execution using free public databases, consumer hardware, and open-weights models is a clear strength that supports reproducibility and broad adoption. The open-source release and experimental contribution component also invite community extension. However, the absence of quantitative performance metrics and independent validation limits the immediate significance of the reported results.

major comments (2)
  1. [Evaluation (abstract and main text)] The evaluation on 30 papers (described in the abstract and evaluation section) reports no quantitative results, error rates, precision/recall figures, or detailed methodology for the error-injection protocol and LLM adjudication process. This is load-bearing for the central claim that the pipeline provides effective verification, as it leaves the actual performance of the claim-support step unquantified.
  2. [Pipeline description and evaluation] The claim-support verification component, which downloads cited papers and uses open-weights models to decide whether they support specific manuscript claims, relies on LLM adjudication for false-positive analysis without any human-labeled ground-truth subset. This setup can detect obvious mismatches but does not establish accuracy on nuanced cases of partial support, context omission, or interpretive disagreement, directly affecting the reliability of the per-reference scores.
minor comments (2)
  1. [Abstract] The abstract states that the contribution component is 'released as experimental code' but provides no link, repository details, or usage instructions in the provided text.
  2. [Pipeline description] The manuscript would benefit from a table summarizing the verification signals aggregated into the per-reference reliability score.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive feedback on the manuscript. We address each major comment below with clarifications on the evaluation design and planned revisions to improve transparency around methodology and limitations.

read point-by-point responses
  1. Referee: [Evaluation (abstract and main text)] The evaluation on 30 papers (described in the abstract and evaluation section) reports no quantitative results, error rates, precision/recall figures, or detailed methodology for the error-injection protocol and LLM adjudication process. This is load-bearing for the central claim that the pipeline provides effective verification, as it leaves the actual performance of the claim-support step unquantified.

    Authors: We agree that quantitative metrics such as precision, recall, or error rates would provide stronger evidence for the pipeline's effectiveness. The reported evaluation is a proof-of-concept demonstration using error injection on 30 unseen papers followed by LLM adjudication for false-positive analysis, rather than a full benchmark study. We will revise the evaluation section to include a more detailed description of the error-injection protocol, the specific types of errors introduced, the adjudication prompts used, and any observed patterns in detected issues. We will also add an explicit statement that this constitutes an initial functional validation rather than a comprehensive performance benchmark. revision: partial

  2. Referee: [Pipeline description and evaluation] The claim-support verification component, which downloads cited papers and uses open-weights models to decide whether they support specific manuscript claims, relies on LLM adjudication for false-positive analysis without any human-labeled ground-truth subset. This setup can detect obvious mismatches but does not establish accuracy on nuanced cases of partial support, context omission, or interpretive disagreement, directly affecting the reliability of the per-reference scores.

    Authors: We acknowledge this limitation in the current evaluation design. The LLM adjudication serves as a practical, scalable method to surface potential mismatches in the absence of existing human-annotated ground-truth datasets for scientific claim support. We recognize that this approach is better suited to identifying clear discrepancies than to resolving nuanced interpretive cases. In the revised manuscript, we will expand the limitations discussion to emphasize that the per-reference reliability scores function as aggregated heuristic signals to guide author attention, not as definitive accuracy measures. We will also outline directions for future work, including community-driven creation of human-labeled subsets for more rigorous validation. revision: partial

Circularity Check

0 steps flagged

No significant circularity in derivation chain

full rationale

The paper describes a software pipeline and its evaluation on 30 papers using error injection plus LLM adjudication, with no mathematical derivations, equations, fitted parameters, or predictions. The core integrity verification is presented as a direct implementation of public databases and open models; the SciLint Score contribution component is explicitly labeled experimental and not used to support integrity claims. No self-citations are load-bearing for any central result, and no step reduces by construction to its own inputs. This is a self-contained tool description rather than a derivation chain.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim depends on the reliability of external public databases for metadata and retractions plus the capability of open-weights models to parse and compare scientific claims accurately in a fully local setting; the experimental SciLint Score adds an unvalidated layer.

axioms (1)
  • domain assumption Public databases and open-weights models suffice to verify reference existence, retraction status, metadata, and claim support without external services or human intervention.
    Invoked throughout the pipeline description as the basis for local-only operation.
invented entities (1)
  • SciLint Score no independent evidence
    purpose: Combines integrity verification with a contribution component operationalizing five philosophy-of-science frameworks into computable structural properties.
    Introduced as an experimental extension released for community development, with no validation results provided.

pith-pipeline@v0.9.0 · 5591 in / 1534 out tokens · 67233 ms · 2026-05-10T17:14:24.319457+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

49 extracted references · 35 canonical work pages · 3 internal anchors

  1. [1]

    San Francisco declaration on research assessment ( DORA )

    American Society for Cell Biology . San Francisco declaration on research assessment ( DORA ). https://sfdora.org, 2012

  2. [2]

    arXiv:2602.05930

    Jawad Ansari. Compound deception in elite peer review. arXiv preprint arXiv:2602.05930, 2026

  3. [3]

    Evaluating sakana's AI scientist: Bold claims, mixed results

    Joeran Beel, Tobias Vente, and Christin Mahlich. Evaluating sakana's AI scientist: Bold claims, mixed results. arXiv preprint arXiv:2502.14297, 2025

  4. [4]

    Preprint, arXiv:2602.05867

    Timothy Bienz, Arianna Pearson, and Sylvie Garcia de Gonzalo. The case of the mysterious citations. arXiv preprint arXiv:2602.05867, 2026

  5. [5]

    Samsung bans staff's AI use after spotting ChatGPT data leak, 2023

    Bloomberg News . Samsung bans staff's AI use after spotting ChatGPT data leak, 2023. URL https://www.bloomberg.com/news/articles/2023-05-02/samsung-bans-chatgpt-and-other-generative-ai-use-by-staff-after-leak

  6. [6]

    Deep impact: Unintended consequences of journal rank

    Bj \"o rn Brembs, Katherine Button, and Marcus Munaf \`o . Deep impact: Unintended consequences of journal rank. Frontiers in Human Neuroscience, 7: 0 291, 2013. doi:10.3389/fnhum.2013.00291

  7. [7]

    Tortured phrases: A dubious writing technique emerging in science

    Guillaume Cabanac, Cyril Labb \'e , and Alexander Magazinov. Tortured phrases: A dubious writing technique emerging in science. evidence of critical these in engineered papers. arXiv preprint arXiv:2107.06751, 2021

  8. [8]

    The increasing citation references: A study across disciplines and document types

    Wanyou Dai, Sta s a Milojevi \'c , and Vincent Larivi \`e re. The increasing citation references: A study across disciplines and document types. PLOS ONE, 16 0 (4): 0 e0249878, 2021. doi:10.1371/journal.pone.0249878

  9. [9]

    PaperRank : A ranking model for scientific publications

    Mingcui Du, Fengshan Bai, and Yushen Liu. PaperRank : A ranking model for scientific publications. In Proceedings of the 2009 International Conference on Semantics, Knowledge and Grid, 2009. doi:10.1109/csie.2009.479

  10. [10]

    Edwards and Siddhartha Roy

    Marc A. Edwards and Siddhartha Roy. Academic research in the 21st century: Maintaining scientific integrity in a climate of perverse incentives and hypercompetition. Environmental Engineering Science, 34 0 (1): 0 51--61, 2017. doi:10.1089/ees.2016.0223

  11. [11]

    Nigel Gilbert

    G. Nigel Gilbert. Referencing as persuasion. Social Studies of Science, 7 0 (1): 0 113--122, 1977. doi:10.1177/030631277700700112

  12. [12]

    100 hallucinations in NeurIPS 2025

    GPTZero . 100 hallucinations in NeurIPS 2025. https://gptzero.me/news/neurips/, 2026

  13. [13]

    SemanticCite : Citation verification with AI -powered full-text analysis and evidence-based reasoning

    Sebastian Haan. SemanticCite : Citation verification with AI -powered full-text analysis and evidence-based reasoning. arXiv preprint arXiv:2511.16198, 2025

  14. [14]

    Bibliometrics: The Leiden manifesto for research metrics

    Diana Hicks, Paul Wouters, Ludo Waltman, Sarah de Rijcke, and Ismael Rafols. Bibliometrics: The Leiden manifesto for research metrics. Nature, 520 0 (7548): 0 429--431, 2015. doi:10.1038/520429a

  15. [15]

    Scientific Reasoning: The Bayesian Approach

    Colin Howson and Peter Urbach. Scientific Reasoning: The Bayesian Approach . Open Court, 2nd edition, 1993. ISBN 978-0-8126-9235-8

  16. [16]

    Large language models cannot self-correct reasoning yet

    Jie Huang et al. Large language models cannot self-correct reasoning yet. In Proceedings of the Twelfth International Conference on Learning Representations (ICLR), 2024

  17. [17]

    POPPER: Agentic fal- sification of free-form hypotheses.arXiv preprint arXiv:2502.09858, 2025

    Yuxuan Huang et al. POPPER : An agentic framework for automated hypothesis validation via Karl Popper 's falsification. arXiv preprint arXiv:2502.09858, 2025. ICML 2025

  18. [18]

    Stephen C. Johnson. Lint, a C program checker. Technical Report 65, Bell Laboratories, 1978. URL https://wolfram.schneider.org/bsd/7thEdManVol2/lint/lint.pdf

  19. [19]

    When can LLMs actually correct their own mistakes? A survey of self-correction

    Ryo Kamoi et al. When can LLMs actually correct their own mistakes? A critical survey of self-correction of LLMs . Transactions of the Association for Computational Linguistics, 12: 0 1417--1440, 2024. doi:10.1162/tacl_a_00713. arXiv:2406.01297

  20. [20]

    Automated measurement of research practices and reproducibility

    David Kernohan, Matthew Hattle, et al. Automated measurement of research practices and reproducibility. Frontiers in Research Metrics and Analytics, 6: 0 751734, 2021. doi:10.3389/frma.2021.751734

  21. [21]

    Scientific claim verification with fine-tuned NLI models

    Milo s Kosprdic et al. Scientific claim verification with fine-tuned NLI models. In Proceedings of the 16th International Conference on Knowledge Discovery and Information Retrieval (KDIR), pages 129--136, 2024. doi:10.5220/0012900000003838

  22. [22]

    More open abstracts? https://bmkramer.github.io/SesameOpenScience_site/thought/202411_open_abstracts/, 2024

    Bianca Kramer. More open abstracts? https://bmkramer.github.io/SesameOpenScience_site/thought/202411_open_abstracts/, 2024. Sesame Open Science, November 24, 2024

  23. [23]

    Science in Action: How to Follow Scientists and Engineers Through Society

    Bruno Latour. Science in Action: How to Follow Scientists and Engineers Through Society. Harvard University Press, 1987. ISBN 978-0-674-79290-6

  24. [24]

    McKenzie et al

    Ian R. McKenzie et al. Inverse scaling: When bigger isn't better. Transactions on Machine Learning Research, 2023

  25. [25]

    Robert K. Merton. The Matthew effect in science. Science, 159 0 (3810): 0 56--63, 1968. doi:10.1126/science.159.3810.56

  26. [26]

    Robert K. Merton. The normative structure of science. In The Sociology of Science, pages 267--278. University of Chicago Press, 1973. ISBN 978-0-226-52091-9. Originally published 1942

  27. [27]

    Moravcsik and Poovanalingam Murugesan

    Michael J. Moravcsik and Poovanalingam Murugesan. Some results on the function and quality of citations. Social Studies of Science, 5 0 (1): 0 86--92, 1975. doi:10.1177/030631277500500106

  28. [28]

    Feder Cooper, Daphne Ippolito, Christopher A

    Milad Nasr, Nicholas Carlini, Jonathan Hayase, Matthew Jagielski, A. Feder Cooper, Daphne Ippolito, Christopher A. Choquette-Choo, Eric Wallace, Florian Tram\` e r, and Katherine Lee. Scalable extraction of training data from (production) language models. arXiv preprint arXiv:2311.17035, 2023

  29. [29]

    Nicholson et al

    Joshua M. Nicholson et al. scite: A smart citation index. Quantitative Science Studies, 2 0 (3): 0 882--898, 2021. doi:10.1162/qss_a_00146

  30. [30]

    K. S. Novoselov, A. K. Geim, S. V. Morozov, D. Jiang, Y. Zhang, S. V. Dubonos, I. V. Grigorieva, and A. A. Firsov. Electric field effect in atomically thin carbon films. Science, 306 0 (5696): 0 666--669, 2004. doi:10.1126/science.1102896

  31. [31]

    Nuijten et al

    Mich \`e le B. Nuijten et al. Statistical reporting errors in psychology. Behavior Research Methods, 48: 0 1205--1226, 2016. doi:10.3758/s13428-015-0664-2

  32. [32]

    The PageRank citation ranking: Bringing order to the web

    Lawrence Page, Sergey Brin, Rajeev Motwani, and Terry Winograd. The PageRank citation ranking: Bringing order to the web. In Proceedings of the 7th International World Wide Web Conference, 1999. URL http://ilpubs.stanford.edu:8090/422/1/1999-66.pdf

  33. [33]

    The Logic of Scientific Discovery

    Karl Popper. The Logic of Scientific Discovery. Routledge, 1959. ISBN 978-0-415-27844-7. doi:10.4324/9780203994627

  34. [34]

    RefChecker

    Mark Russinovich. RefChecker . https://github.com/markrussinovich/refchecker, 2025

  35. [35]

    scicode-lint: A linter for machine learning research code

    Sergey Samsonau. scicode-lint: A linter for machine learning research code. arXiv preprint arXiv:2603.17893, 2025

  36. [36]

    SciScore : Automated rigor and transparency scoring

    SciScore . SciScore : Automated rigor and transparency scoring. https://sciscore.com, 2018. Deployed in Editorial Manager

  37. [37]

    Is a qualitative metric of falsifiability possible? The F-index

    Seeds of Science . Is a qualitative metric of falsifiability possible? The F-index . Seeds of Science, 2024. doi:10.53975/1y7h-g9wd

  38. [38]

    Towards understanding sycophancy in language models

    Mrinank Sharma, Meg Tong, et al. Towards understanding sycophancy in language models. In Proceedings of the Twelfth International Conference on Learning Representations (ICLR), 2024

  39. [39]

    Simkin and Vwani P

    Mikhail V. Simkin and Vwani P. Roychowdhury. Read before you cite! Complex Systems, 14: 0 269--274, 2003. doi:10.25088/complexsystems.14.3.269

  40. [40]

    Cited documents as concept symbols

    Henry Small. Cited documents as concept symbols. Social Studies of Science, 8 0 (3): 0 327--340, 1978. doi:10.1177/030631277800800305

  41. [41]

    MiniCheck : Efficient fact-checking of LLMs

    Liyan Tang et al. MiniCheck : Efficient fact-checking of LLMs . In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 8818--8847, 2024. doi:10.18653/v1/2024.emnlp-main.499

  42. [42]

    Explanatory coherence

    Paul Thagard. Explanatory coherence. Behavioral and Brain Sciences, 12 0 (3): 0 435--467, 1989. doi:10.1017/S0140525X00057046

  43. [43]

    Fact or fiction: Verifying scientific claims

    David Wadden et al. Fact or fiction: Verifying scientific claims. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 7534--7550, 2020. doi:10.18653/v1/2020.emnlp-main.609

  44. [44]

    The inconsistency of the h-index

    Ludo Waltman and Nees Jan van Eck. The inconsistency of the h-index. Journal of the American Society for Information Science and Technology, 63 0 (2): 0 406--415, 2012. doi:10.1002/asi.21678

  45. [45]

    A review on the novelty measurements of academic papers

    Yutao Wang et al. A review on the novelty measurements of academic papers. arXiv preprint arXiv:2501.17456, 2025

  46. [46]

    Jason Wei, Najoung Kim, Yi Tay, and Quoc V. Le. Inverse scaling can become U -shaped. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 15580--15591, 2023. doi:10.18653/v1/2023.emnlp-main.963. arXiv:2211.02011

  47. [47]

    arXiv:2602.06718

    Zhe Xu et al. GhostCite : Citation validity in the age of LLMs . arXiv preprint arXiv:2602.06718, 2026

  48. [48]

    CiteAudit: You Cited It, But Did You Read It? A Benchmark for Verifying Scientific References in the LLM Era

    Zhengqing Yuan, Kexin Shi, Zhaonan Zhang, Lichao Sun, Nitesh V. Chawla, and Yanfang Ye. CiteAudit : You cited it, but did you read it? A benchmark for verifying scientific references in the LLM era. arXiv preprint arXiv:2602.23452, 2026

  49. [49]

    AlignScore : Evaluating factual consistency with a unified alignment function

    Yuheng Zha et al. AlignScore : Evaluating factual consistency with a unified alignment function. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (ACL), pages 11328--11348, 2023. doi:10.18653/v1/2023.acl-long.634