{"record_type":"pith_number_record","schema_url":"https://pith.science/schemas/pith-number/v1.json","pith_number":"pith:2023:3ELOZYL4ZIAQTPWHMW6B6PB5IN","short_pith_number":"pith:3ELOZYL4","schema_version":"1.0","canonical_sha256":"d916ece17cca0109bec765bc1f3c3d4355b4cf38d4c18e83a0f0bc24a7771a7d","source":{"kind":"arxiv","id":"2303.08896","version":3},"attestation_state":"computed","paper":{"title":"SelfCheckGPT: Zero-Resource Black-Box Hallucination Detection for Generative Large Language Models","license":"http://creativecommons.org/licenses/by/4.0/","headline":"Multiple stochastic samples from a black-box LLM reveal which generated facts are hallucinations by checking their consistency.","cross_cats":[],"primary_cat":"cs.CL","authors_text":"Adian Liusie, Mark J. F. Gales, Potsawee Manakul","submitted_at":"2023-03-15T19:31:21Z","abstract_excerpt":"Generative Large Language Models (LLMs) such as GPT-3 are capable of generating highly fluent responses to a wide variety of user prompts. However, LLMs are known to hallucinate facts and make non-factual statements which can undermine trust in their output. Existing fact-checking approaches either require access to the output probability distribution (which may not be available for systems such as ChatGPT) or external databases that are interfaced via separate, often complex, modules. In this work, we propose \"SelfCheckGPT\", a simple sampling-based approach that can be used to fact-check the "},"verification_status":{"content_addressed":true,"pith_receipt":true,"author_attested":false,"weak_author_claims":0,"strong_author_claims":0,"externally_anchored":false,"storage_verified":false,"citation_signatures":0,"replication_records":0,"graph_snapshot":true,"references_resolved":true,"formal_links_present":true},"canonical_record":{"source":{"id":"2303.08896","kind":"arxiv","version":3},"metadata":{"license":"http://creativecommons.org/licenses/by/4.0/","primary_cat":"cs.CL","submitted_at":"2023-03-15T19:31:21Z","cross_cats_sorted":[],"title_canon_sha256":"5ef7ef8161143d74354342d262c2ce6a2cdbd7bbeb3a33fd76d64210e7f55add","abstract_canon_sha256":"05943fbaa1b7804bec5ee7292f4abfdfd47bb7dc322d85807a25d98066a762a1"},"schema_version":"1.0"},"receipt":{"kind":"pith_receipt","key_id":"pith-v1-2026-05","algorithm":"ed25519","signed_at":"2026-05-17T23:38:51.163011Z","signature_b64":"ycuRxxhUGLre2ceu6UmDXroNct55XFR3x+MGtVDZiLqkbznPsd/LIyqMIdud5Wjsy96msRZthMugeP2BJNCvCQ==","signed_message":"canonical_sha256_bytes","builder_version":"pith-number-builder-2026-05-17-v1","receipt_version":"0.3","canonical_sha256":"d916ece17cca0109bec765bc1f3c3d4355b4cf38d4c18e83a0f0bc24a7771a7d","last_reissued_at":"2026-05-17T23:38:51.162387Z","signature_status":"signed_v1","first_computed_at":"2026-05-17T23:38:51.162387Z","public_key_fingerprint":"8d4b5ee74e4693bcd1df2446408b0d54"},"graph_snapshot":{"paper":{"title":"SelfCheckGPT: Zero-Resource Black-Box Hallucination Detection for Generative Large Language Models","license":"http://creativecommons.org/licenses/by/4.0/","headline":"Multiple stochastic samples from a black-box LLM reveal which generated facts are hallucinations by checking their consistency.","cross_cats":[],"primary_cat":"cs.CL","authors_text":"Adian Liusie, Mark J. F. Gales, Potsawee Manakul","submitted_at":"2023-03-15T19:31:21Z","abstract_excerpt":"Generative Large Language Models (LLMs) such as GPT-3 are capable of generating highly fluent responses to a wide variety of user prompts. However, LLMs are known to hallucinate facts and make non-factual statements which can undermine trust in their output. Existing fact-checking approaches either require access to the output probability distribution (which may not be available for systems such as ChatGPT) or external databases that are interfaced via separate, often complex, modules. In this work, we propose \"SelfCheckGPT\", a simple sampling-based approach that can be used to fact-check the "},"claims":{"count":4,"items":[{"kind":"strongest_claim","text":"SelfCheckGPT can detect non-factual and factual sentences and rank passages in terms of factuality, achieving considerably higher AUC-PR scores in sentence-level hallucination detection and higher correlation scores in passage-level factuality assessment compared to grey-box methods.","source":"verdict.strongest_claim","status":"machine_extracted","claim_id":"C1","attestation":"unclaimed"},{"kind":"weakest_assumption","text":"That divergence among stochastically sampled responses reliably signals hallucinated facts rather than other sources of output variation such as stylistic differences or partial knowledge.","source":"verdict.weakest_assumption","status":"machine_extracted","claim_id":"C2","attestation":"unclaimed"},{"kind":"one_line_summary","text":"SelfCheckGPT detects hallucinations by checking consistency across multiple sampled responses from black-box LLMs on WikiBio biography generation tasks.","source":"verdict.one_line_summary","status":"machine_extracted","claim_id":"C3","attestation":"unclaimed"},{"kind":"headline","text":"Multiple stochastic samples from a black-box LLM reveal which generated facts are hallucinations by checking their consistency.","source":"verdict.pith_extraction.headline","status":"machine_extracted","claim_id":"C4","attestation":"unclaimed"}],"snapshot_sha256":"dbac617b47caa7da80dd48bc2147ca7d0ac5b7fd2e3de27486567bd451215b24"},"source":{"id":"2303.08896","kind":"arxiv","version":3},"verdict":{"id":"6ad901ca-6132-43f9-82c2-bb996c3136e5","model_set":{"reader":"grok-4.3"},"created_at":"2026-05-15T15:06:20.879049Z","strongest_claim":"SelfCheckGPT can detect non-factual and factual sentences and rank passages in terms of factuality, achieving considerably higher AUC-PR scores in sentence-level hallucination detection and higher correlation scores in passage-level factuality assessment compared to grey-box methods.","one_line_summary":"SelfCheckGPT detects hallucinations by checking consistency across multiple sampled responses from black-box LLMs on WikiBio biography generation tasks.","pipeline_version":"pith-pipeline@v0.9.0","weakest_assumption":"That divergence among stochastically sampled responses reliably signals hallucinated facts rather than other sources of output variation such as stylistic differences or partial knowledge.","pith_extraction_headline":"Multiple stochastic samples from a black-box LLM reveal which generated facts are hallucinations by checking their consistency."},"references":{"count":288,"sample":[{"doi":"10.18653/v1/2022.bigscience-1.9","year":2022,"title":"GPT - N eo X -20 B : An open-source autoregressive language model","work_id":"2af0caa7-05e2-4d08-961c-64691965a173","ref_index":3,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":2020,"title":"Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. 2020. Language models are few-shot lear","work_id":"50684699-ce18-4086-8bac-7cecd178fad0","ref_index":4,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":1960,"title":"Jacob Cohen. 1960. A coefficient of agreement for nominal scales. Educational and Psychological Measurement, 20:37 -- 46","work_id":"2d455e5d-b145-45fa-84d9-5e14d0a95593","ref_index":6,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"10.1162/tacl_a_00454","year":2022,"title":"A Survey on Automated Fact-Checking","work_id":"564d8ed0-5089-4f26-b690-83f1ed571e24","ref_index":8,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":2023,"title":"Pengcheng He, Jianfeng Gao, and Weizhu Chen. 2023. https://openreview.net/forum?id=sE7-XhLxHA De BERT av3: Improving de BERT a using ELECTRA -style pre-training with gradient-disentangled embedding sh","work_id":"fd00a3fd-c211-44e1-ae23-d5d502e59f7f","ref_index":9,"cited_arxiv_id":"","is_internal_anchor":false}],"resolved_work":288,"snapshot_sha256":"a2c34709888ffac31ce178ab980948608d9bc60f27ce6f96a343326c7846bb3c","internal_anchors":11},"formal_canon":{"evidence_count":1,"snapshot_sha256":"fc284037100753a67bc6f4587e953341ff9f61346fe1a495cf60fde44181e5ce"},"author_claims":{"count":0,"strong_count":0,"snapshot_sha256":"258153158e38e3291e3d48162225fcdb2d5a3ed65a07baac614ab91432fd4f57"},"builder_version":"pith-number-builder-2026-05-17-v1"},"aliases":[{"alias_kind":"arxiv","alias_value":"2303.08896","created_at":"2026-05-17T23:38:51.162485+00:00"},{"alias_kind":"arxiv_version","alias_value":"2303.08896v3","created_at":"2026-05-17T23:38:51.162485+00:00"},{"alias_kind":"doi","alias_value":"10.48550/arxiv.2303.08896","created_at":"2026-05-17T23:38:51.162485+00:00"},{"alias_kind":"pith_short_12","alias_value":"3ELOZYL4ZIAQ","created_at":"2026-05-18T12:33:33.725879+00:00"},{"alias_kind":"pith_short_16","alias_value":"3ELOZYL4ZIAQTPWH","created_at":"2026-05-18T12:33:33.725879+00:00"},{"alias_kind":"pith_short_8","alias_value":"3ELOZYL4","created_at":"2026-05-18T12:33:33.725879+00:00"}],"events":[],"event_summary":{},"paper_claims":[],"inbound_citations":{"count":43,"internal_anchor_count":43,"sample":[{"citing_arxiv_id":"2605.02443","citing_title":"HalluScan: A Systematic Benchmark for Detecting and Mitigating Hallucinations in Instruction-Following LLMs","ref_index":24,"is_internal_anchor":true},{"citing_arxiv_id":"2502.14427","citing_title":"Token-Level Density-Based Uncertainty Quantification Methods for Eliciting Truthfulness of Large Language Models","ref_index":37,"is_internal_anchor":true},{"citing_arxiv_id":"2503.18562","citing_title":"Self-Reported Confidence of Large Language Models in Gastroenterology: Analysis of Commercial, Open-Source, and Quantized Models","ref_index":10,"is_internal_anchor":true},{"citing_arxiv_id":"2505.11737","citing_title":"TokUR: Token-Level Uncertainty Estimation for Large Language Model Reasoning","ref_index":27,"is_internal_anchor":true},{"citing_arxiv_id":"2605.21801","citing_title":"Why Semantic Entropy Fails: Geometry-Aware and Calibrated Uncertainty for Policy Optimization","ref_index":48,"is_internal_anchor":true},{"citing_arxiv_id":"2511.10292","citing_title":"Adaptive Residual-Update Steering for Low-Overhead Hallucination Mitigation in Large Vision Language Models","ref_index":16,"is_internal_anchor":true},{"citing_arxiv_id":"2604.16359","citing_title":"LLM4Log: A Systematic Review of Large Language Model-based Log Analysis","ref_index":110,"is_internal_anchor":true},{"citing_arxiv_id":"2605.20312","citing_title":"Pramana: A Protocol-Layer Treatment of Claim Verification in Autonomous Agent Networks","ref_index":28,"is_internal_anchor":true},{"citing_arxiv_id":"2605.18776","citing_title":"Mask-to-Correct$^+$: Leveraging Retriever Diversity for Masking-guided Faithful Fact Correction","ref_index":4,"is_internal_anchor":true},{"citing_arxiv_id":"2604.17487","citing_title":"Answer Only as Precisely as Justified: Calibrated Claim-Level Specificity Control for Agentic Systems","ref_index":9,"is_internal_anchor":true},{"citing_arxiv_id":"2605.10930","citing_title":"Evaluating the False Trust Engendered by LLM Explanations","ref_index":28,"is_internal_anchor":true},{"citing_arxiv_id":"2605.16953","citing_title":"How do Humans Process AI-generated Hallucination Contents: a Neuroimaging Study","ref_index":5,"is_internal_anchor":true},{"citing_arxiv_id":"2506.19807","citing_title":"KnowRL: Exploring Knowledgeable Reinforcement Learning for Factuality","ref_index":5,"is_internal_anchor":true},{"citing_arxiv_id":"2508.18473","citing_title":"Principled Detection of Hallucinations in Large Language Models via Multiple Testing","ref_index":12,"is_internal_anchor":true},{"citing_arxiv_id":"2509.17314","citing_title":"Clotho: Measuring Task-Specific Pre-Generation Test Adequacy for LLM Inputs","ref_index":30,"is_internal_anchor":true},{"citing_arxiv_id":"2509.25868","citing_title":"ReFACT: A Benchmark for Scientific Confabulation Detection with Positional Error Annotations","ref_index":21,"is_internal_anchor":true},{"citing_arxiv_id":"2309.11495","citing_title":"Chain-of-Verification Reduces Hallucination in Large Language Models","ref_index":112,"is_internal_anchor":true},{"citing_arxiv_id":"2308.05374","citing_title":"Trustworthy LLMs: a Survey and Guideline for Evaluating Large Language Models' Alignment","ref_index":71,"is_internal_anchor":true},{"citing_arxiv_id":"2511.09803","citing_title":"Retrieval as a Decision: Training-Free Adaptive Gating for Efficient RAG","ref_index":11,"is_internal_anchor":true},{"citing_arxiv_id":"2309.15217","citing_title":"Ragas: Automated Evaluation of Retrieval Augmented Generation","ref_index":3,"is_internal_anchor":true},{"citing_arxiv_id":"2309.05922","citing_title":"A Survey of Hallucination in Large Foundation Models","ref_index":136,"is_internal_anchor":true},{"citing_arxiv_id":"2601.10398","citing_title":"LatentRefusal: Latent-Signal Refusal for Unanswerable Text-to-SQL Queries","ref_index":4,"is_internal_anchor":true},{"citing_arxiv_id":"2309.03883","citing_title":"DoLa: Decoding by Contrasting Layers Improves Factuality in Large Language Models","ref_index":66,"is_internal_anchor":true},{"citing_arxiv_id":"2602.15189","citing_title":"ScrapeGraphAI-100k: Dataset for Schema-Constrained LLM Generation","ref_index":14,"is_internal_anchor":true},{"citing_arxiv_id":"2604.16359","citing_title":"LLM4Log: A Systematic Review of Large Language Model-based Log Analysis","ref_index":110,"is_internal_anchor":true}]},"formal_canon":{"evidence_count":1,"sample":[],"anchors":[]},"links":{"html":"https://pith.science/pith/3ELOZYL4ZIAQTPWHMW6B6PB5IN","json":"https://pith.science/pith/3ELOZYL4ZIAQTPWHMW6B6PB5IN.json","graph_json":"https://pith.science/api/pith-number/3ELOZYL4ZIAQTPWHMW6B6PB5IN/graph.json","events_json":"https://pith.science/api/pith-number/3ELOZYL4ZIAQTPWHMW6B6PB5IN/events.json","paper":"https://pith.science/paper/3ELOZYL4"},"agent_actions":{"view_html":"https://pith.science/pith/3ELOZYL4ZIAQTPWHMW6B6PB5IN","download_json":"https://pith.science/pith/3ELOZYL4ZIAQTPWHMW6B6PB5IN.json","view_paper":"https://pith.science/paper/3ELOZYL4","resolve_alias":"https://pith.science/api/pith-number/resolve?arxiv=2303.08896&json=true","fetch_graph":"https://pith.science/api/pith-number/3ELOZYL4ZIAQTPWHMW6B6PB5IN/graph.json","fetch_events":"https://pith.science/api/pith-number/3ELOZYL4ZIAQTPWHMW6B6PB5IN/events.json","actions":{"anchor_timestamp":"https://pith.science/pith/3ELOZYL4ZIAQTPWHMW6B6PB5IN/action/timestamp_anchor","attest_storage":"https://pith.science/pith/3ELOZYL4ZIAQTPWHMW6B6PB5IN/action/storage_attestation","attest_author":"https://pith.science/pith/3ELOZYL4ZIAQTPWHMW6B6PB5IN/action/author_attestation","sign_citation":"https://pith.science/pith/3ELOZYL4ZIAQTPWHMW6B6PB5IN/action/citation_signature","submit_replication":"https://pith.science/pith/3ELOZYL4ZIAQTPWHMW6B6PB5IN/action/replication_record"}},"created_at":"2026-05-17T23:38:51.162485+00:00","updated_at":"2026-05-17T23:38:51.162485+00:00"}