Pith Integrity Protocol v1
Public, machine-readable contract for the Pith Scientific Integrity Layer. Every finding emitted by Pith conforms to this protocol. Verifiers can reproduce any finding from public detector code and the cited paper.
About the integrity layer · Live findings feed · Event schema
1. What this layer is for
The Pith integrity layer is an automated detector pool that runs over every paper in the Pith corpus and emits machine-verified findings. Each finding is keyed by a stable evidence hash, signed with the Pith Ed25519 key, persisted in integrity_findings, and emitted as a pith.integrity.v1 event in the paper's Pith Number Open Graph Bundle.
The layer makes claims only about what mechanical resolution finds. It does not impute intent, attribute fabrications to authors, or rank papers. Phrasing is factual: identifiers either resolve or don't; URLs either return 200 or don't; regex patterns either match or don't.
2. Verdict classes
Every detector commits to one verdict class up front. A class describes the kind of evidence the detector is allowed to publish on. Findings that don't meet the class bar are dropped at the source.
| Class | What it means | Examples |
|---|---|---|
| incontrovertible | The evidence is the verbatim text or a transport-layer fact. False positives require physical impossibility (e.g. doi.org returning 404 for a working DOI). | Regex literal match in body text; HTTP 404 from doi.org; syntactic identifier malformation. |
| cross_source | Two independent authoritative resolvers agree on a finding before it is published. | Crossref-by-DOI and OpenAlex-by-DOI both return a paper whose title disagrees with the cited title. |
| threshold_with_margin | Probabilistic similarity above a calibrated threshold, where the calibration set produced zero false positives. Detector flags only above the calibrated margin. | Citation-to-quotation BM25 + cosine match below threshold (Phase 2). |
| rescinded | A previously published finding has been withdrawn. The original event remains in the bundle; a follow-up event with this class supersedes it. | Operator review identifies a parser artifact missed by detector reconstruction. |
3. Severity
Severity is independent of verdict class and reflects how disruptive the finding is to a reader who tries to follow the cited evidence.
| Severity | Meaning |
|---|---|
| critical | A reader cannot follow the citation or use the artifact as printed. Identifier returns 404 from doi.org and Crossref; URL returns 404; AI meta-comment present in body. |
| advisory | The artifact is fragmented or anomalous but a longer/correct form was visible in the surrounding text. Recoverable identifiers and dead URLs that are not code links fall here. |
| informational | Used by detectors that only surface aggregated context (rare). |
4. Detectors
doi_compliance
For every reference with a DOI or arXiv ID, the detector resolves the identifier through Crossref-by-DOI, OpenAlex-by-DOI, the internal arXiv corpus, and finally a doi.org HEAD request. Findings:
broken_identifier: DOI as printed is malformed and doi.org HEAD returns 404. Critical.recoverable_identifier: parser truncated or whitespace-fragmented DOI. A longer candidate was visible in surrounding text but could not be confirmed as printed. Advisory.unresolvable_identifier: identifier syntactically valid but doi.org returns 404 and Crossref/OpenAlex have no record. Critical, cross_source class.
The doi.org HEAD check is mandatory before flagging anything as unresolvable; this prevents false positives on Zenodo/Figshare/DataCite DOIs that aren't indexed by Crossref.
doi_title_agreement
For every reference where the parser extracted a title AND the reference resolved to a cited_works row through Crossref or OpenAlex, the detector compares the parsed bibliographic title against the resolved work title using normalized Jaccard plus token overlap. Below 0.20 similarity it emits identifier_title_mismatch (critical).
Required preconditions: parsed title at least 18 characters; resolver was Crossref or OpenAlex (not a fuzzy fallback); no flag if both Crossref and OpenAlex were not consulted.
ai_meta_artifact
Scans paper body text (bibliography stripped) for verbatim AI-assistant artifacts. Patterns include LLM disclaimers (as an AI language model, training-cutoff phrasing), refusal templates, summary offers, placeholder cite markers ([insert citation here], [TODO: cite]), illustrative-only table notes, and lorem ipsum.
Contextual filters before publishing: matches inside double-quoted spans or TeX verbatim/lstlisting blocks are dropped; papers whose title or abstract is about LLMs (topic-paper heuristic) drop a curated subset of disclaimers (as an AI language model, model self-references) to avoid false positives in survey papers.
The matched span itself is the evidence. Each finding records match, match_offset, regex_id, and 240 surrounding characters as context_excerpt.
external_links
Extracts every URL from paper body text. Stores them in paper_external_links with a per-host classification (github, gitlab, huggingface, zenodo, etc.). Periodically re-checks each URL via HTTP HEAD with a GET fallback. Findings:
dead_code_link: github/gitlab/bitbucket URL returns 404. Critical.dead_url: any other URL returns 404 or transport-layer failure. Advisory.
HTTP 401 and 403 are not flagged: those indicate permission gating, not absence. Skip-listed hosts (DOI registries, paywalled publishers, social media) are never flagged because false-negative gating is preferred to false-positive accusations.
5. Event payload
Every finding produces one signed event of type pith.integrity.v1. The full schema lives at /schemas/pith-integrity-event/v1.json.
{
"event_type": "pith.integrity.v1",
"detector": "doi_compliance",
"detector_version": "1.0.0",
"finding_type": "broken_identifier",
"verdict_class": "incontrovertible",
"severity": "critical",
"arxiv_id": "2605.12345",
"paper_version": 1,
"evidence_hash": "<sha256>",
"evidence": { "doi_as_printed": "10.1109/JPROC.", "raw_excerpt": "...", "verdict_class": "incontrovertible" },
"snippet": "...short surrounding context...",
"note": "DOI '10.1109/JPROC.' as printed in the bibliography is syntactically invalid and cannot resolve.",
"ref_index": 17,
"audited_at": "2026-05-19T05:30:00Z",
"detector_url": "https://pith.science/pith-integrity-protocol#doi_compliance"
}
The signature is over the canonical (sorted-key, comma+colon) JSON form of the payload, signed with the Pith Ed25519 key documented at /pith-signing-key.json.
6. Public endpoints
/findings: live human-readable feed, severity-banded and detector-filterable./pith/<arxiv_id>/integrity.json: per-paper integrity record, includes the most recent signed events for replay verification./pith/<arxiv_id>/bundle.json: full Open Graph Bundle. Thegraph_snapshot.integrityblock is updated on every build./schemas/pith-integrity-event/v1.json: JSON Schema for individual events./schemas/pith-open-graph-bundle/v1.json: JSON Schema for the bundle envelope.
7. Framing rules
Every public-facing string emitted by a detector follows three rules:
- State the fact, not the inference. "DOI returns 404 from doi.org at 2026-05-19T05:30:00Z" is allowed. "Author cited a fake paper" is not.
- Cite the resolver. Every claim about absence names the database that returned the absence. "Crossref returned no record" is allowed. "This paper does not exist" is not.
- Allow recovery. Every detector emits a follow-up event when the underlying fact changes. A revived URL gets a fresh event. A rescinded finding gets a
rescindedevent that supersedes the original.
8. Versioning and rescission
Every detector carries a detector_version. Bumping the version (semver) signals a behavior change. Findings emitted by older versions remain in the event log; consumers can filter by version when reproducing.
An operator can rescind a finding at POST /admin/findings/<id>/rescind with a reason. The original row stays in the database with status='rescinded'; the public feed hides it; a follow-up bundle event with verdict_class='rescinded' publishes the withdrawal.
9. Open lanes
Phase 2 adds two probabilistic lanes (citation-to-quotation validity and 40-token shingle plagiarism) under the threshold_with_margin and incontrovertible classes respectively. Phase 3 covers figure perceptual hashing, statistical anomaly tests, ORCID provenance, and proof-linkage extensions. Public detectors will continue to commit to a verdict class and ship under this protocol.