CitePrism: Human-in-the-Loop AI for Citation Auditing and Editorial Integrity

Binh Vu; Budanur Madappa Darshan Gowda; Gowrika Mahesh; Kavana Gopladevarahalli Papegowda; Mehrdad Jalali; Prajwal Basavaraj; Swati Chandna

arxiv: 2605.16000 · v2 · pith:RGEEDHKDnew · submitted 2026-05-15 · 💻 cs.SI · cs.AI· cs.DL

CitePrism: Human-in-the-Loop AI for Citation Auditing and Editorial Integrity

Gowrika Mahesh , Budanur Madappa Darshan Gowda , Kavana Gopladevarahalli Papegowda , Prajwal Basavaraj , Binh Vu , Swati Chandna , Mehrdad Jalali This is my paper

Pith reviewed 2026-05-19 18:56 UTC · model grok-4.3

classification 💻 cs.SI cs.AIcs.DL

keywords citation auditinghuman-in-the-loopAI decision supporteditorial integritybibliographic analysisrelevance scoringmanuscript review

0 comments

The pith

CitePrism combines AI analysis with human review to help editors audit citation relevance and integrity.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents CitePrism as a hybrid framework designed to assist with the manual task of checking whether citations in a manuscript are relevant, accurate, and appropriate. It integrates large language model reasoning, semantic similarity measures, and metadata checks, then routes uncertain cases to a human analyst for final judgment. Tested on one engineering paper with over a hundred references, the system achieved moderate agreement with human labels and caught every irrelevant citation identified by people, though it also flagged some relevant ones. This positions CitePrism as a conservative screening aid rather than a replacement for editorial expertise.

Core claim

CitePrism extracts citation neighborhoods, enriches reference metadata, computes fused relevance scores using LLM contextual reasoning and embedding similarity, generates integrity flags for issues like self-citations, and applies configurable thresholds to triage citations for human review. In the preliminary validation on a pavement engineering manuscript, it reached a Cohen's kappa of 0.429 with human binary relevance labels and at threshold tau = 17 flagged all human-labeled irrelevant citations while producing false positives that require analyst attention.

What carries the argument

The hybrid decision-support framework that fuses LLM-assisted contextual reasoning with embedding-based semantic similarity, metadata verification, and integrity flags, all under human-in-the-loop analyst review.

If this is right

Manuscript editors could use the system to prioritize which citations need close inspection, reducing the overall manual workload.
Configurable thresholds allow tailoring the balance between catching problematic citations and minimizing unnecessary reviews.
The approach surfaces specific issues such as poor metadata or unusual self-citation patterns that might support claims of bibliographic integrity.
By keeping humans in the decision loop, the framework avoids fully automated judgments on citation quality.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Extending the system to multiple domains could require retraining or adjusting the LLM prompts and similarity measures to account for field-specific citation norms.
Over time, widespread use might change how authors prepare citations knowing that automated checks are part of the process.
Combining this with other integrity tools, such as those for checking data availability or conflict of interest statements, could create more comprehensive manuscript screening pipelines.
Further studies could test whether the moderate agreement level improves with better prompt engineering or additional training data from editorial decisions.

Load-bearing premise

The level of agreement and flagging performance seen in the single case-study manuscript will apply to other manuscripts, research domains, and different human annotators.

What would settle it

A study that applies CitePrism to several manuscripts across different scientific fields, collects independent relevance judgments from multiple editors or reviewers for each citation, and measures whether the system's recall of irrelevant citations and overall agreement remain consistent with the original results.

Figures

Figures reproduced from arXiv: 2605.16000 by Binh Vu, Budanur Madappa Darshan Gowda, Gowrika Mahesh, Kavana Gopladevarahalli Papegowda, Mehrdad Jalali, Prajwal Basavaraj, Swati Chandna.

**Figure 1.** Figure 1: Editorial problem context and positioning of CitePrism as optional, human-supervised citation screening before or alongside peer review. [PITH_FULL_IMAGE:figures/full_fig_p004_1.png] view at source ↗

**Figure 2.** Figure 2: CitePrism system architecture: five processing stages, SQLite persistence (documents, processing logs, API cache), and an editorial analyst interface. [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗

**Figure 3.** Figure 3: Hybrid relevance scoring and citation-risk detection workflow from manuscript ingestion through fused scoring, band assignment, integrity flags, and [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗

**Figure 4.** Figure 4: Pilot evaluation for Paper 1 at τ = 17: confusion matrix and summary metrics (κ = 0.429; n = 104). TABLE II CLASSIFICATION METRICS FOR PAPER 1 AT OPERATING THRESHOLD τ = 17 (n = 104 REFERENCES). Class Precision Recall F1 Support Flagged (0) 0.420 1.000 0.592 21 Clean (1) 1.000 0.651 0.788 83 Accuracy 0.721 Macro avg 0.710 0.825 0.690 104 Weighted avg 0.883 0.721 0.749 104 Cohen’s κ = 0.429 (moderate agreem… view at source ↗

**Figure 5.** Figure 5: Human-in-the-loop governance model for editorial deployment: automated signals inform analyst review; editors retain accountability and policy [PITH_FULL_IMAGE:figures/full_fig_p009_5.png] view at source ↗

read the original abstract

Editors and reviewers are expected to ensure that manuscripts cite relevant, accurate, current, and ethically appropriate literature, yet manuscript-level citation auditing remains largely manual, fragmented, and difficult to scale. Citation context, metadata quality, self-citation patterns, and bibliographic integrity all affect whether a reference appropriately supports a local claim. We present CitePrism, a transparent hybrid decision-support framework for editorial citation auditing that combines LLM-assisted contextual reasoning, embedding-based semantic similarity, metadata verification, integrity-oriented flags, and human-in-the-loop analyst review. CitePrism extracts citation neighborhoods, enriches reference metadata, computes fused relevance scores, surfaces metadata and self-citation review prompts, and supports configurable threshold-based triage. In a preliminary validation on a single case-study manuscript with 104 references from pavement engineering, agreement with human binary relevance labels reached Cohen's kappa = 0.429. At operating threshold tau = 17, CitePrism flagged all human-labeled irrelevant citations, while also producing false positives requiring analyst review. These results suggest that CitePrism may support conservative editorial screening and citation-quality triage, but they do not establish general editorial performance. CitePrism is intended as pilot-stage decision support, not as an autonomous misconduct detector or automated editorial decision system. Broader validation across manuscripts, domains, annotators, baselines, and deployment settings is required before operational use.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The manuscript introduces CitePrism, a transparent hybrid human-in-the-loop framework for citation auditing that fuses LLM contextual reasoning, embedding-based semantic similarity, metadata verification, self-citation flags, and configurable threshold-based triage. In a preliminary single-manuscript validation on a pavement-engineering paper containing 104 references, the system achieves Cohen's kappa = 0.429 against human binary relevance labels; at operating threshold tau = 17 it flags all human-labeled irrelevant citations while generating false positives that require analyst review. The authors explicitly limit their claim to suggesting possible support for conservative editorial screening and triage, stating that broader validation is required before operational use.

Significance. If the observed conservative flagging behavior and moderate agreement generalize under human oversight, CitePrism could offer a practical, auditable aid for scaling citation-integrity checks in editorial workflows. The explicit framing as pilot-stage decision support rather than an autonomous detector is a responsible design choice that aligns with current best practices for AI in scholarly publishing.

major comments (1)

Validation section: the reported Cohen's kappa = 0.429 is obtained from a single manuscript and a single (implicit) annotator set; without reported inter-annotator agreement, multiple domains, or any baseline comparator (e.g., metadata-only or random flagging), it is difficult to assess whether the observed flagging performance exceeds what simpler heuristics would achieve. This directly affects the strength of even the narrow claim that the system 'may support conservative editorial screening.'

minor comments (2)

Abstract and §4: the precise definition and computation of the 'fused relevance scores' and the operating threshold tau = 17 are referenced but not fully specified; adding a short algorithmic outline or pseudocode would improve reproducibility.
The manuscript would benefit from a brief discussion of how the system handles citation contexts that are ambiguous or require domain expertise beyond the LLM's training data.

Simulated Author's Rebuttal

1 responses · 1 unresolved

We thank the referee for the constructive feedback on the validation limitations. We agree that the pilot nature of the study restricts the strength of claims and have revised the manuscript to better contextualize these issues while preserving the narrow, conservative framing of the work.

read point-by-point responses

Referee: Validation section: the reported Cohen's kappa = 0.429 is obtained from a single manuscript and a single (implicit) annotator set; without reported inter-annotator agreement, multiple domains, or any baseline comparator (e.g., metadata-only or random flagging), it is difficult to assess whether the observed flagging performance exceeds what simpler heuristics would achieve. This directly affects the strength of even the narrow claim that the system 'may support conservative editorial screening.'

Authors: We acknowledge that the validation is limited to a single manuscript with relevance labels from one annotator (the corresponding author, acting as domain expert). We have revised the Validation section to explicitly state the lack of inter-annotator agreement and explain that a multi-annotator protocol was outside the scope of this initial pilot due to time and resource constraints. To address the baseline concern, we added a short comparison in the revised text showing that the hybrid system outperforms a metadata-only heuristic and a random baseline in recall for irrelevant citations (while noting higher false positives). We have further softened the language in the abstract and conclusion to emphasize that the results only 'suggest possible support' for triage and do not claim superiority over all heuristics. These changes directly respond to the referee's point without overstating the current evidence. revision: partial

standing simulated objections not resolved

A full multi-domain validation with multiple independent annotators and exhaustive baseline comparisons would require a new, larger empirical study beyond the scope of the current single-case pilot manuscript.

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper describes CitePrism as a pilot-stage hybrid human-in-the-loop framework for citation auditing that combines LLM reasoning, semantic similarity, metadata checks, and analyst review. No mathematical derivations, equations, fitted parameters, or predictions are present that reduce to inputs by construction. The single-manuscript validation (kappa=0.429 on 104 references) is reported with explicit limitations and does not claim cross-domain generality or autonomous performance. No self-citation load-bearing for theorems, uniqueness results, or ansatzes occurs; the central claims about potential support for conservative screening remain independent and self-contained against the stated narrow scope.

Axiom & Free-Parameter Ledger

1 free parameters · 2 axioms · 0 invented entities

The framework depends on assumptions about the effectiveness of LLMs and embeddings in citation contexts, plus a configurable threshold tuned for the pilot.

free parameters (1)

operating threshold tau = 17
Chosen threshold for flagging in the validation case study.

axioms (2)

domain assumption LLM-assisted contextual reasoning provides reliable assessment of citation relevance
Core to the system's ability to extract and evaluate citation neighborhoods.
domain assumption Embedding-based semantic similarity accurately reflects citation appropriateness
Used to compute fused relevance scores.

pith-pipeline@v0.9.0 · 5819 in / 1466 out tokens · 65024 ms · 2026-05-19T18:56:14.146681+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

CitePrism computes two complementary signals per reference: Embedding score (RS embed) and LLM score (RS llm); fused relevance score RS_final = 0.6×RS_llm + 0.4×RS_embed

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.