arxiv: 2604.08501 · v1 · submitted 2026-04-09 · 💻 cs.DL · cs.CL· cs.SE

Recognition: unknown

sciwrite-lint: Verification Infrastructure for the Age of Science Vibe-Writing

Sergey V Samsonau

Authors on Pith no claims yet

Pith reviewed 2026-05-10 17:14 UTC · model grok-4.3

classification 💻 cs.DL cs.CLcs.SE

keywords scientific manuscript verificationcitation integrity checkinglocal AI linterreference support verificationretraction detectionopen-weights modelsAI-assisted scientific writingbibliographic analysis

0 comments

The pith

A locally running open-source tool can verify that a scientific manuscript's citations exist, are unretracted, and support the claims made about them.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes sciwrite-lint as a third option for scientific quality assurance beyond journal gatekeeping or pure open science. It is an open-source pipeline that runs entirely on the researcher's machine using public databases, a consumer GPU, and open-weights models, without sending any manuscripts externally. The system checks reference existence and retraction status, aligns metadata with canonical records, downloads and parses cited papers to confirm they back the stated claims, and recurses one level into those papers' own bibliographies. Each reference receives an aggregated reliability score. This approach is evaluated on thirty unseen arXiv and bioRxiv papers that include injected errors, with false-positive analysis performed by an LLM adjudicator.

Core claim

sciwrite-lint provides a local verification pipeline that confirms references exist, checks retraction status, compares metadata against canonical records, downloads and parses cited papers to verify support for the manuscript's claims, extends the check one level deeper into the cited papers' bibliographies, and assigns each reference a per-reference reliability score that aggregates all signals.

What carries the argument

The sciwrite-lint verification pipeline, which uses open-weights models running locally to parse downloaded papers and test whether they support specific claims.

If this is right

Authors can run automated integrity checks on drafts before submission without relying on external services or peer review.
Fabricated citations that evade current gatekeeping can be detected at the author level through metadata, retraction, and claim-support checks.
Each reference receives a composite reliability score that combines existence, retraction status, metadata match, and claim verification.
Verification extends one bibliographic level deeper, revealing inconsistencies in the cited papers' own reference lists.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Embedding the pipeline into common writing environments could turn citation verification into a routine, real-time step during drafting.
Increasing the recursion depth beyond one level might expose longer chains of weak or fabricated citations.
The experimental contribution-scoring module, which draws on philosophy-of-science frameworks, could evolve into quantitative tests of argument structure if the integrity component proves reliable.

Load-bearing premise

Automated local parsing of cited papers using open-weights models can reliably determine whether those papers support the specific claims made in the manuscript under review.

What would settle it

A controlled test set of manuscripts containing fabricated or unsupported citations where the pipeline's claim-verification step fails to flag them at a high rate.

Figures

Figures reproduced from arXiv: 2604.08501 by Sergey V Samsonau.

**Figure 1.** Figure 1: Verification architecture. Top: per-paper operations (steps 1–6 above). Bottom: the pipeline fans out through the citation graph at three levels: the manuscript (all operations), cited papers (downloaded, parsed, consistency-checked), and their references (existence + metadata checked via API). Solid borders: full text available; dashed: API-verified only [PITH_FULL_IMAGE:figures/full_fig_p005_1.png] view at source ↗

**Figure 2.** Figure 2: Contribution profiles for 3 of the 20 calibration papers. Left: Nobel Prize discovery (balanced). [PITH_FULL_IMAGE:figures/full_fig_p012_2.png] view at source ↗

read the original abstract

Science currently offers two options for quality assurance, both inadequate. Journal gatekeeping claims to verify both integrity and contribution, but actually measures prestige: peer review is slow, biased, and misses fabricated citations even at top venues. Open science provides no quality assurance at all: the only filter between AI-generated text and the public record is the author's integrity. AI-assisted writing makes both worse by producing more papers faster than either system can absorb. We propose a third option: measure the paper itself. sciwrite-lint (pip install sciwrite-lint) is an open-source linter for scientific manuscripts that runs entirely on the researcher's machine (free public databases, a single consumer GPU, and open-weights models) with no manuscripts sent to external services. The pipeline verifies that references exist, checks retraction status, compares metadata against canonical records, downloads and parses cited papers, verifies that they support the claims made about them, and follows one level further to check cited papers' own bibliographies. Each reference receives a per-reference reliability score aggregating all verification signals. We evaluate the pipeline on 30 unseen papers from arXiv and bioRxiv with error injection and LLM-adjudicated false positive analysis. As an experimental extension, we propose SciLint Score, combining integrity verification with a contribution component that operationalizes five frameworks from philosophy of science (Popper, Lakatos, Kitcher, Laudan, Mayo) into computable structural properties of scientific arguments. The integrity component is the core of the tool and is evaluated in this paper; the contribution component is released as experimental code for community development.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

sciwrite-lint gives a practical local pipeline for reference checks and claim support, but the evaluation on 30 papers with LLM adjudication does not yet show reliable performance on real cases.

read the letter

The main takeaway is a working local tool that runs on your machine, pulls public data, downloads cited papers, and scores each reference for basic integrity plus whether it actually supports the claims made about it. The one-level citation follow-up is a nice addition that most checkers skip. Keeping everything private with open models and no external calls is a clear practical win for authors who want to self-audit before submission. The experimental SciLint Score that pulls in philosophy-of-science ideas for contribution is labeled as such and not used to prop up the main claims, which keeps the paper honest. The code release and installable package lower the barrier for others to test or extend it. The evaluation setup tests error injection on 30 papers and uses LLM adjudication for false positives, which at least tries to measure something concrete rather than just describing the pipeline. That said, the absence of any reported numbers on precision, recall, or agreement rates makes it hard to judge how well the claim-support step actually works. The stress-test point lands: without a human-labeled subset for nuanced support cases, partial matches, or context-dependent claims, the core verification promise stays unproven. The paper is aimed at authors and tool developers who care about citation hygiene in an era of fast AI writing. Readers who build or use verification infrastructure will get concrete ideas from the pipeline design and the local execution model. It deserves a serious referee because the implementation is real and the problem it targets is genuine, even if the current evidence is preliminary. I would send it for review with the expectation that the authors expand the evaluation section with quantitative results and some human validation on the support decisions.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces sciwrite-lint, an open-source local linter for scientific manuscripts that verifies reference existence, retraction status, metadata consistency against canonical records, downloads and parses cited papers to check claim support using open-weights models, and extends verification one level into the cited papers' bibliographies. Each reference receives an aggregated reliability score. An experimental SciLint Score is proposed that combines the integrity pipeline with a contribution metric derived from five philosophy-of-science frameworks (Popper, Lakatos, Kitcher, Laudan, Mayo). The evaluation consists of running the pipeline on 30 unseen arXiv and bioRxiv papers after error injection, followed by LLM-adjudicated false-positive analysis.

Significance. If the core verification steps can be shown to be reliable, the tool would provide a practical, privacy-preserving, and accessible method for authors to perform pre-submission integrity checks without transmitting manuscripts to external services. The fully local execution using free public databases, consumer hardware, and open-weights models is a clear strength that supports reproducibility and broad adoption. The open-source release and experimental contribution component also invite community extension. However, the absence of quantitative performance metrics and independent validation limits the immediate significance of the reported results.

major comments (2)

[Evaluation (abstract and main text)] The evaluation on 30 papers (described in the abstract and evaluation section) reports no quantitative results, error rates, precision/recall figures, or detailed methodology for the error-injection protocol and LLM adjudication process. This is load-bearing for the central claim that the pipeline provides effective verification, as it leaves the actual performance of the claim-support step unquantified.
[Pipeline description and evaluation] The claim-support verification component, which downloads cited papers and uses open-weights models to decide whether they support specific manuscript claims, relies on LLM adjudication for false-positive analysis without any human-labeled ground-truth subset. This setup can detect obvious mismatches but does not establish accuracy on nuanced cases of partial support, context omission, or interpretive disagreement, directly affecting the reliability of the per-reference scores.

minor comments (2)

[Abstract] The abstract states that the contribution component is 'released as experimental code' but provides no link, repository details, or usage instructions in the provided text.
[Pipeline description] The manuscript would benefit from a table summarizing the verification signals aggregated into the per-reference reliability score.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive feedback on the manuscript. We address each major comment below with clarifications on the evaluation design and planned revisions to improve transparency around methodology and limitations.

read point-by-point responses

Referee: [Evaluation (abstract and main text)] The evaluation on 30 papers (described in the abstract and evaluation section) reports no quantitative results, error rates, precision/recall figures, or detailed methodology for the error-injection protocol and LLM adjudication process. This is load-bearing for the central claim that the pipeline provides effective verification, as it leaves the actual performance of the claim-support step unquantified.

Authors: We agree that quantitative metrics such as precision, recall, or error rates would provide stronger evidence for the pipeline's effectiveness. The reported evaluation is a proof-of-concept demonstration using error injection on 30 unseen papers followed by LLM adjudication for false-positive analysis, rather than a full benchmark study. We will revise the evaluation section to include a more detailed description of the error-injection protocol, the specific types of errors introduced, the adjudication prompts used, and any observed patterns in detected issues. We will also add an explicit statement that this constitutes an initial functional validation rather than a comprehensive performance benchmark. revision: partial
Referee: [Pipeline description and evaluation] The claim-support verification component, which downloads cited papers and uses open-weights models to decide whether they support specific manuscript claims, relies on LLM adjudication for false-positive analysis without any human-labeled ground-truth subset. This setup can detect obvious mismatches but does not establish accuracy on nuanced cases of partial support, context omission, or interpretive disagreement, directly affecting the reliability of the per-reference scores.

Authors: We acknowledge this limitation in the current evaluation design. The LLM adjudication serves as a practical, scalable method to surface potential mismatches in the absence of existing human-annotated ground-truth datasets for scientific claim support. We recognize that this approach is better suited to identifying clear discrepancies than to resolving nuanced interpretive cases. In the revised manuscript, we will expand the limitations discussion to emphasize that the per-reference reliability scores function as aggregated heuristic signals to guide author attention, not as definitive accuracy measures. We will also outline directions for future work, including community-driven creation of human-labeled subsets for more rigorous validation. revision: partial

Circularity Check

0 steps flagged

No significant circularity in derivation chain

full rationale

The paper describes a software pipeline and its evaluation on 30 papers using error injection plus LLM adjudication, with no mathematical derivations, equations, fitted parameters, or predictions. The core integrity verification is presented as a direct implementation of public databases and open models; the SciLint Score contribution component is explicitly labeled experimental and not used to support integrity claims. No self-citations are load-bearing for any central result, and no step reduces by construction to its own inputs. This is a self-contained tool description rather than a derivation chain.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim depends on the reliability of external public databases for metadata and retractions plus the capability of open-weights models to parse and compare scientific claims accurately in a fully local setting; the experimental SciLint Score adds an unvalidated layer.

axioms (1)

domain assumption Public databases and open-weights models suffice to verify reference existence, retraction status, metadata, and claim support without external services or human intervention.
Invoked throughout the pipeline description as the basis for local-only operation.

invented entities (1)

SciLint Score no independent evidence
purpose: Combines integrity verification with a contribution component operationalizing five philosophy-of-science frameworks into computable structural properties.
Introduced as an experimental extension released for community development, with no validation results provided.

pith-pipeline@v0.9.0 · 5591 in / 1534 out tokens · 67233 ms · 2026-05-10T17:14:24.319457+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

49 extracted references · 35 canonical work pages · 3 internal anchors

[1]

San Francisco declaration on research assessment ( DORA )

American Society for Cell Biology . San Francisco declaration on research assessment ( DORA ). https://sfdora.org, 2012

2012
[2]

arXiv:2602.05930

Jawad Ansari. Compound deception in elite peer review. arXiv preprint arXiv:2602.05930, 2026

work page arXiv 2026
[3]

Evaluating sakana's AI scientist: Bold claims, mixed results

Joeran Beel, Tobias Vente, and Christin Mahlich. Evaluating sakana's AI scientist: Bold claims, mixed results. arXiv preprint arXiv:2502.14297, 2025

work page arXiv 2025
[4]

Preprint, arXiv:2602.05867

Timothy Bienz, Arianna Pearson, and Sylvie Garcia de Gonzalo. The case of the mysterious citations. arXiv preprint arXiv:2602.05867, 2026

work page arXiv 2026
[5]

Samsung bans staff's AI use after spotting ChatGPT data leak, 2023

Bloomberg News . Samsung bans staff's AI use after spotting ChatGPT data leak, 2023. URL https://www.bloomberg.com/news/articles/2023-05-02/samsung-bans-chatgpt-and-other-generative-ai-use-by-staff-after-leak

2023
[6]

Deep impact: Unintended consequences of journal rank

Bj \"o rn Brembs, Katherine Button, and Marcus Munaf \`o . Deep impact: Unintended consequences of journal rank. Frontiers in Human Neuroscience, 7: 0 291, 2013. doi:10.3389/fnhum.2013.00291

work page doi:10.3389/fnhum.2013.00291 2013
[7]

Tortured phrases: A dubious writing technique emerging in science

Guillaume Cabanac, Cyril Labb \'e , and Alexander Magazinov. Tortured phrases: A dubious writing technique emerging in science. evidence of critical these in engineered papers. arXiv preprint arXiv:2107.06751, 2021

work page arXiv 2021
[8]

The increasing citation references: A study across disciplines and document types

Wanyou Dai, Sta s a Milojevi \'c , and Vincent Larivi \`e re. The increasing citation references: A study across disciplines and document types. PLOS ONE, 16 0 (4): 0 e0249878, 2021. doi:10.1371/journal.pone.0249878

work page doi:10.1371/journal.pone.0249878 2021
[9]

PaperRank : A ranking model for scientific publications

Mingcui Du, Fengshan Bai, and Yushen Liu. PaperRank : A ranking model for scientific publications. In Proceedings of the 2009 International Conference on Semantics, Knowledge and Grid, 2009. doi:10.1109/csie.2009.479

work page doi:10.1109/csie.2009.479 2009
[10]

Edwards and Siddhartha Roy

Marc A. Edwards and Siddhartha Roy. Academic research in the 21st century: Maintaining scientific integrity in a climate of perverse incentives and hypercompetition. Environmental Engineering Science, 34 0 (1): 0 51--61, 2017. doi:10.1089/ees.2016.0223

work page doi:10.1089/ees.2016.0223 2017
[11]

Nigel Gilbert

G. Nigel Gilbert. Referencing as persuasion. Social Studies of Science, 7 0 (1): 0 113--122, 1977. doi:10.1177/030631277700700112

work page doi:10.1177/030631277700700112 1977
[12]

100 hallucinations in NeurIPS 2025

GPTZero . 100 hallucinations in NeurIPS 2025. https://gptzero.me/news/neurips/, 2026

2025
[13]

SemanticCite : Citation verification with AI -powered full-text analysis and evidence-based reasoning

Sebastian Haan. SemanticCite : Citation verification with AI -powered full-text analysis and evidence-based reasoning. arXiv preprint arXiv:2511.16198, 2025

work page arXiv 2025
[14]

Bibliometrics: The Leiden manifesto for research metrics

Diana Hicks, Paul Wouters, Ludo Waltman, Sarah de Rijcke, and Ismael Rafols. Bibliometrics: The Leiden manifesto for research metrics. Nature, 520 0 (7548): 0 429--431, 2015. doi:10.1038/520429a

work page doi:10.1038/520429a 2015
[15]

Scientific Reasoning: The Bayesian Approach

Colin Howson and Peter Urbach. Scientific Reasoning: The Bayesian Approach . Open Court, 2nd edition, 1993. ISBN 978-0-8126-9235-8

1993
[16]

Large language models cannot self-correct reasoning yet

Jie Huang et al. Large language models cannot self-correct reasoning yet. In Proceedings of the Twelfth International Conference on Learning Representations (ICLR), 2024

2024
[17]

POPPER: Agentic fal- sification of free-form hypotheses.arXiv preprint arXiv:2502.09858, 2025

Yuxuan Huang et al. POPPER : An agentic framework for automated hypothesis validation via Karl Popper 's falsification. arXiv preprint arXiv:2502.09858, 2025. ICML 2025

work page arXiv 2025
[18]

Stephen C. Johnson. Lint, a C program checker. Technical Report 65, Bell Laboratories, 1978. URL https://wolfram.schneider.org/bsd/7thEdManVol2/lint/lint.pdf

1978
[19]

When can LLMs actually correct their own mistakes? A survey of self-correction

Ryo Kamoi et al. When can LLMs actually correct their own mistakes? A critical survey of self-correction of LLMs . Transactions of the Association for Computational Linguistics, 12: 0 1417--1440, 2024. doi:10.1162/tacl_a_00713. arXiv:2406.01297

work page doi:10.1162/tacl_a_00713 2024
[20]

Automated measurement of research practices and reproducibility

David Kernohan, Matthew Hattle, et al. Automated measurement of research practices and reproducibility. Frontiers in Research Metrics and Analytics, 6: 0 751734, 2021. doi:10.3389/frma.2021.751734

work page doi:10.3389/frma.2021.751734 2021
[21]

Scientific claim verification with fine-tuned NLI models

Milo s Kosprdic et al. Scientific claim verification with fine-tuned NLI models. In Proceedings of the 16th International Conference on Knowledge Discovery and Information Retrieval (KDIR), pages 129--136, 2024. doi:10.5220/0012900000003838

work page doi:10.5220/0012900000003838 2024
[22]

More open abstracts? https://bmkramer.github.io/SesameOpenScience_site/thought/202411_open_abstracts/, 2024

Bianca Kramer. More open abstracts? https://bmkramer.github.io/SesameOpenScience_site/thought/202411_open_abstracts/, 2024. Sesame Open Science, November 24, 2024

2024
[23]

Science in Action: How to Follow Scientists and Engineers Through Society

Bruno Latour. Science in Action: How to Follow Scientists and Engineers Through Society. Harvard University Press, 1987. ISBN 978-0-674-79290-6

1987
[24]

McKenzie et al

Ian R. McKenzie et al. Inverse scaling: When bigger isn't better. Transactions on Machine Learning Research, 2023

2023
[25]

Robert K. Merton. The Matthew effect in science. Science, 159 0 (3810): 0 56--63, 1968. doi:10.1126/science.159.3810.56

work page doi:10.1126/science.159.3810.56 1968
[26]

Robert K. Merton. The normative structure of science. In The Sociology of Science, pages 267--278. University of Chicago Press, 1973. ISBN 978-0-226-52091-9. Originally published 1942

1973
[27]

Moravcsik and Poovanalingam Murugesan

Michael J. Moravcsik and Poovanalingam Murugesan. Some results on the function and quality of citations. Social Studies of Science, 5 0 (1): 0 86--92, 1975. doi:10.1177/030631277500500106

work page doi:10.1177/030631277500500106 1975
[28]

Feder Cooper, Daphne Ippolito, Christopher A

Milad Nasr, Nicholas Carlini, Jonathan Hayase, Matthew Jagielski, A. Feder Cooper, Daphne Ippolito, Christopher A. Choquette-Choo, Eric Wallace, Florian Tram\` e r, and Katherine Lee. Scalable extraction of training data from (production) language models. arXiv preprint arXiv:2311.17035, 2023

work page arXiv 2023
[29]

Nicholson et al

Joshua M. Nicholson et al. scite: A smart citation index. Quantitative Science Studies, 2 0 (3): 0 882--898, 2021. doi:10.1162/qss_a_00146

work page doi:10.1162/qss_a_00146 2021
[30]

K. S. Novoselov, A. K. Geim, S. V. Morozov, D. Jiang, Y. Zhang, S. V. Dubonos, I. V. Grigorieva, and A. A. Firsov. Electric field effect in atomically thin carbon films. Science, 306 0 (5696): 0 666--669, 2004. doi:10.1126/science.1102896

work page internal anchor Pith review doi:10.1126/science.1102896 2004
[31]

Nuijten et al

Mich \`e le B. Nuijten et al. Statistical reporting errors in psychology. Behavior Research Methods, 48: 0 1205--1226, 2016. doi:10.3758/s13428-015-0664-2

work page doi:10.3758/s13428-015-0664-2 2016
[32]

The PageRank citation ranking: Bringing order to the web

Lawrence Page, Sergey Brin, Rajeev Motwani, and Terry Winograd. The PageRank citation ranking: Bringing order to the web. In Proceedings of the 7th International World Wide Web Conference, 1999. URL http://ilpubs.stanford.edu:8090/422/1/1999-66.pdf

1999
[33]

The Logic of Scientific Discovery

Karl Popper. The Logic of Scientific Discovery. Routledge, 1959. ISBN 978-0-415-27844-7. doi:10.4324/9780203994627

work page doi:10.4324/9780203994627 1959
[34]

RefChecker

Mark Russinovich. RefChecker . https://github.com/markrussinovich/refchecker, 2025

2025
[35]

scicode-lint: A linter for machine learning research code

Sergey Samsonau. scicode-lint: A linter for machine learning research code. arXiv preprint arXiv:2603.17893, 2025

work page arXiv 2025
[36]

SciScore : Automated rigor and transparency scoring

SciScore . SciScore : Automated rigor and transparency scoring. https://sciscore.com, 2018. Deployed in Editorial Manager

2018
[37]

Is a qualitative metric of falsifiability possible? The F-index

Seeds of Science . Is a qualitative metric of falsifiability possible? The F-index . Seeds of Science, 2024. doi:10.53975/1y7h-g9wd

work page doi:10.53975/1y7h-g9wd 2024
[38]

Towards understanding sycophancy in language models

Mrinank Sharma, Meg Tong, et al. Towards understanding sycophancy in language models. In Proceedings of the Twelfth International Conference on Learning Representations (ICLR), 2024

2024
[39]

Simkin and Vwani P

Mikhail V. Simkin and Vwani P. Roychowdhury. Read before you cite! Complex Systems, 14: 0 269--274, 2003. doi:10.25088/complexsystems.14.3.269

work page doi:10.25088/complexsystems.14.3.269 2003
[40]

Cited documents as concept symbols

Henry Small. Cited documents as concept symbols. Social Studies of Science, 8 0 (3): 0 327--340, 1978. doi:10.1177/030631277800800305

work page doi:10.1177/030631277800800305 1978
[41]

MiniCheck : Efficient fact-checking of LLMs

Liyan Tang et al. MiniCheck : Efficient fact-checking of LLMs . In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 8818--8847, 2024. doi:10.18653/v1/2024.emnlp-main.499

work page doi:10.18653/v1/2024.emnlp-main.499 2024
[42]

Explanatory coherence

Paul Thagard. Explanatory coherence. Behavioral and Brain Sciences, 12 0 (3): 0 435--467, 1989. doi:10.1017/S0140525X00057046

work page doi:10.1017/s0140525x00057046 1989
[43]

Fact or fiction: Verifying scientific claims

David Wadden et al. Fact or fiction: Verifying scientific claims. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 7534--7550, 2020. doi:10.18653/v1/2020.emnlp-main.609

work page doi:10.18653/v1/2020.emnlp-main.609 2020
[44]

The inconsistency of the h-index

Ludo Waltman and Nees Jan van Eck. The inconsistency of the h-index. Journal of the American Society for Information Science and Technology, 63 0 (2): 0 406--415, 2012. doi:10.1002/asi.21678

work page doi:10.1002/asi.21678 2012
[45]

A review on the novelty measurements of academic papers

Yutao Wang et al. A review on the novelty measurements of academic papers. arXiv preprint arXiv:2501.17456, 2025

work page arXiv 2025
[46]

Jason Wei, Najoung Kim, Yi Tay, and Quoc V. Le. Inverse scaling can become U -shaped. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 15580--15591, 2023. doi:10.18653/v1/2023.emnlp-main.963. arXiv:2211.02011

work page doi:10.18653/v1/2023.emnlp-main.963 2023
[47]

arXiv:2602.06718

Zhe Xu et al. GhostCite : Citation validity in the age of LLMs . arXiv preprint arXiv:2602.06718, 2026

work page internal anchor Pith review arXiv 2026
[48]

CiteAudit: You Cited It, But Did You Read It? A Benchmark for Verifying Scientific References in the LLM Era

Zhengqing Yuan, Kexin Shi, Zhaonan Zhang, Lichao Sun, Nitesh V. Chawla, and Yanfang Ye. CiteAudit : You cited it, but did you read it? A benchmark for verifying scientific references in the LLM era. arXiv preprint arXiv:2602.23452, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[49]

AlignScore : Evaluating factual consistency with a unified alignment function

Yuheng Zha et al. AlignScore : Evaluating factual consistency with a unified alignment function. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (ACL), pages 11328--11348, 2023. doi:10.18653/v1/2023.acl-long.634

work page doi:10.18653/v1/2023.acl-long.634 2023