pith. sign in
Pith Number

pith:3ELOZYL4

pith:2023:3ELOZYL4ZIAQTPWHMW6B6PB5IN
not attested not anchored not stored refs resolved

SelfCheckGPT: Zero-Resource Black-Box Hallucination Detection for Generative Large Language Models

Adian Liusie, Mark J. F. Gales, Potsawee Manakul

Multiple stochastic samples from a black-box LLM reveal which generated facts are hallucinations by checking their consistency.

arxiv:2303.08896 v3 · 2023-03-15 · cs.CL

Add to your LaTeX paper
\usepackage{pith}
\pithnumber{3ELOZYL4ZIAQTPWHMW6B6PB5IN}

Prints a linked badge after your title and injects PDF metadata. Compiles on arXiv. Learn more · Embed verified badge

Record completeness

1 Bitcoin timestamp
2 Internet Archive
3 Author claim open · sign in to claim
4 Citations open
5 Replications open
Portable graph bundle live · download bundle · merged state
The bundle contains the canonical record plus signed events. A mirror can host it anywhere and recompute the same current state with the deterministic merge algorithm.

Claims

C1strongest claim

SelfCheckGPT can detect non-factual and factual sentences and rank passages in terms of factuality, achieving considerably higher AUC-PR scores in sentence-level hallucination detection and higher correlation scores in passage-level factuality assessment compared to grey-box methods.

C2weakest assumption

That divergence among stochastically sampled responses reliably signals hallucinated facts rather than other sources of output variation such as stylistic differences or partial knowledge.

C3one line summary

SelfCheckGPT detects hallucinations by checking consistency across multiple sampled responses from black-box LLMs on WikiBio biography generation tasks.

References

288 extracted · 288 resolved · 11 Pith anchors

[3] GPT - N eo X -20 B : An open-source autoregressive language model 2022 · doi:10.18653/v1/2022.bigscience-1.9
[4] Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. 2020. Language models are few-shot lear 2020
[6] Jacob Cohen. 1960. A coefficient of agreement for nominal scales. Educational and Psychological Measurement, 20:37 -- 46 1960
[8] A Survey on Automated Fact-Checking 2022 · doi:10.1162/tacl_a_00454
[9] Pengcheng He, Jianfeng Gao, and Weizhu Chen. 2023. https://openreview.net/forum?id=sE7-XhLxHA De BERT av3: Improving de BERT a using ELECTRA -style pre-training with gradient-disentangled embedding sh 2023

Formal links

1 machine-checked theorem link

Cited by

43 papers in Pith

Receipt and verification
First computed 2026-05-17T23:38:51.162387Z
Builder pith-number-builder-2026-05-17-v1
Signature Pith Ed25519 (pith-v1-2026-05) · public key
Schema pith-number/v1.0

Canonical hash

d916ece17cca0109bec765bc1f3c3d4355b4cf38d4c18e83a0f0bc24a7771a7d

Aliases

arxiv: 2303.08896 · arxiv_version: 2303.08896v3 · doi: 10.48550/arxiv.2303.08896 · pith_short_12: 3ELOZYL4ZIAQ · pith_short_16: 3ELOZYL4ZIAQTPWH · pith_short_8: 3ELOZYL4
Agent API
Verify this Pith Number yourself
curl -sH 'Accept: application/ld+json' https://pith.science/pith/3ELOZYL4ZIAQTPWHMW6B6PB5IN \
  | jq -c '.canonical_record' \
  | python3 -c "import sys,json,hashlib; b=json.dumps(json.loads(sys.stdin.read()), sort_keys=True, separators=(',',':'), ensure_ascii=False).encode(); print(hashlib.sha256(b).hexdigest())"
# expect: d916ece17cca0109bec765bc1f3c3d4355b4cf38d4c18e83a0f0bc24a7771a7d
Canonical record JSON
{
  "metadata": {
    "abstract_canon_sha256": "05943fbaa1b7804bec5ee7292f4abfdfd47bb7dc322d85807a25d98066a762a1",
    "cross_cats_sorted": [],
    "license": "http://creativecommons.org/licenses/by/4.0/",
    "primary_cat": "cs.CL",
    "submitted_at": "2023-03-15T19:31:21Z",
    "title_canon_sha256": "5ef7ef8161143d74354342d262c2ce6a2cdbd7bbeb3a33fd76d64210e7f55add"
  },
  "schema_version": "1.0",
  "source": {
    "id": "2303.08896",
    "kind": "arxiv",
    "version": 3
  }
}