{"record_type":"pith_number_record","schema_url":"https://pith.science/schemas/pith-number/v1.json","pith_number":"pith:2023:6UG7QNNZDYOU574GK653U7DRKD","short_pith_number":"pith:6UG7QNNZ","schema_version":"1.0","canonical_sha256":"f50df835b91e1d4eff8657bbba7c7150d4fba63d1d8e0aa443e3eb3899ff1c48","source":{"kind":"arxiv","id":"2302.04166","version":2},"attestation_state":"computed","paper":{"title":"GPTScore: Evaluate as You Desire","license":"http://arxiv.org/licenses/nonexclusive-distrib/1.0/","headline":"GPTScore uses zero-shot prompting of generative models ranging from 80M to 175B parameters to evaluate text according to arbitrary natural language criteria, tested on 4 tasks, 22 aspects, and 37 datasets.","cross_cats":[],"primary_cat":"cs.CL","authors_text":"Jinlan Fu, Pengfei Liu, See-kiong Ng, Zhengbao Jiang","submitted_at":"2023-02-08T16:17:29Z","abstract_excerpt":"Generative Artificial Intelligence (AI) has enabled the development of sophisticated models that are capable of producing high-caliber text, images, and other outputs through the utilization of large pre-trained models. Nevertheless, assessing the quality of the generation is an even more arduous task than the generation itself, and this issue has not been given adequate consideration recently. This paper proposes a novel evaluation framework, GPTScore, which utilizes the emergent abilities (e.g., zero-shot instruction) of generative pre-trained models to score generated texts. There are 19 pr"},"verification_status":{"content_addressed":true,"pith_receipt":true,"author_attested":false,"weak_author_claims":0,"strong_author_claims":0,"externally_anchored":false,"storage_verified":false,"citation_signatures":0,"replication_records":0,"graph_snapshot":true,"references_resolved":false,"formal_links_present":true},"canonical_record":{"source":{"id":"2302.04166","kind":"arxiv","version":2},"metadata":{"license":"http://arxiv.org/licenses/nonexclusive-distrib/1.0/","primary_cat":"cs.CL","submitted_at":"2023-02-08T16:17:29Z","cross_cats_sorted":[],"title_canon_sha256":"fbe6d4804eed2dc343a3d0d23df63ede9cf07092a9013bf8f85857fc3b06ba7f","abstract_canon_sha256":"5a087b62a0a246edf049a858a2bb51dfda02d2af9147f8f2f29c0ca602acb3ec"},"schema_version":"1.0"},"receipt":{"kind":"pith_receipt","key_id":"pith-v1-2026-05","algorithm":"ed25519","signed_at":"2026-05-17T23:38:13.552079Z","signature_b64":"9BgGoEtd6bD0zuhrG+Khx1u/vxlijEdwHbu5H+9fv4OJCBTvbzHa2BBcqBjp3YMjg0VrpMDPqMrYyA43nT2JDw==","signed_message":"canonical_sha256_bytes","builder_version":"pith-number-builder-2026-05-17-v1","receipt_version":"0.3","canonical_sha256":"f50df835b91e1d4eff8657bbba7c7150d4fba63d1d8e0aa443e3eb3899ff1c48","last_reissued_at":"2026-05-17T23:38:13.551525Z","signature_status":"signed_v1","first_computed_at":"2026-05-17T23:38:13.551525Z","public_key_fingerprint":"8d4b5ee74e4693bcd1df2446408b0d54"},"graph_snapshot":{"paper":{"title":"GPTScore: Evaluate as You Desire","license":"http://arxiv.org/licenses/nonexclusive-distrib/1.0/","headline":"GPTScore uses zero-shot prompting of generative models ranging from 80M to 175B parameters to evaluate text according to arbitrary natural language criteria, tested on 4 tasks, 22 aspects, and 37 datasets.","cross_cats":[],"primary_cat":"cs.CL","authors_text":"Jinlan Fu, Pengfei Liu, See-kiong Ng, Zhengbao Jiang","submitted_at":"2023-02-08T16:17:29Z","abstract_excerpt":"Generative Artificial Intelligence (AI) has enabled the development of sophisticated models that are capable of producing high-caliber text, images, and other outputs through the utilization of large pre-trained models. Nevertheless, assessing the quality of the generation is an even more arduous task than the generation itself, and this issue has not been given adequate consideration recently. This paper proposes a novel evaluation framework, GPTScore, which utilizes the emergent abilities (e.g., zero-shot instruction) of generative pre-trained models to score generated texts. There are 19 pr"},"claims":{"count":3,"items":[{"kind":"strongest_claim","text":"Experimental results on four text generation tasks, 22 evaluation aspects, and corresponding 37 datasets demonstrate that this approach can effectively allow us to achieve what one desires to evaluate for texts simply by natural language instructions.","source":"verdict.strongest_claim","status":"machine_extracted","claim_id":"C1","attestation":"unclaimed"},{"kind":"weakest_assumption","text":"That the emergent zero-shot instruction-following abilities of the tested pre-trained models can produce scores that meaningfully reflect the desired evaluation criteria without task-specific fine-tuning or annotated samples.","source":"verdict.weakest_assumption","status":"machine_extracted","claim_id":"C2","attestation":"unclaimed"},{"kind":"one_line_summary","text":"GPTScore uses zero-shot prompting of generative models ranging from 80M to 175B parameters to evaluate text according to arbitrary natural language criteria, tested on 4 tasks, 22 aspects, and 37 datasets.","source":"verdict.one_line_summary","status":"machine_extracted","claim_id":"C3","attestation":"unclaimed"}],"snapshot_sha256":"bde13bac3bc5ae4bf8564f15f92524aa60810022602e12205bc3b64ce58f54d8"},"source":{"id":"2302.04166","kind":"arxiv","version":2},"verdict":{"id":"a3a32912-9185-4ea9-a20b-b29fb853ecfc","model_set":{"reader":"grok-4.3"},"created_at":"2026-05-17T17:06:06.842651Z","strongest_claim":"Experimental results on four text generation tasks, 22 evaluation aspects, and corresponding 37 datasets demonstrate that this approach can effectively allow us to achieve what one desires to evaluate for texts simply by natural language instructions.","one_line_summary":"GPTScore uses zero-shot prompting of generative models ranging from 80M to 175B parameters to evaluate text according to arbitrary natural language criteria, tested on 4 tasks, 22 aspects, and 37 datasets.","pipeline_version":"pith-pipeline@v0.9.0","weakest_assumption":"That the emergent zero-shot instruction-following abilities of the tested pre-trained models can produce scores that meaningfully reflect the desired evaluation criteria without task-specific fine-tuning or annotated samples.","pith_extraction_headline":""},"references":{"count":0,"sample":[],"resolved_work":0,"snapshot_sha256":"258153158e38e3291e3d48162225fcdb2d5a3ed65a07baac614ab91432fd4f57","internal_anchors":0},"formal_canon":{"evidence_count":2,"snapshot_sha256":"651a82ecb0c5242a58809d718f6641ff889812b0957a4fb3adc1a49a31950172"},"author_claims":{"count":0,"strong_count":0,"snapshot_sha256":"258153158e38e3291e3d48162225fcdb2d5a3ed65a07baac614ab91432fd4f57"},"builder_version":"pith-number-builder-2026-05-17-v1"},"aliases":[{"alias_kind":"arxiv","alias_value":"2302.04166","created_at":"2026-05-17T23:38:13.551644+00:00"},{"alias_kind":"arxiv_version","alias_value":"2302.04166v2","created_at":"2026-05-17T23:38:13.551644+00:00"},{"alias_kind":"doi","alias_value":"10.48550/arxiv.2302.04166","created_at":"2026-05-17T23:38:13.551644+00:00"},{"alias_kind":"pith_short_12","alias_value":"6UG7QNNZDYOU","created_at":"2026-05-18T12:33:33.725879+00:00"},{"alias_kind":"pith_short_16","alias_value":"6UG7QNNZDYOU574G","created_at":"2026-05-18T12:33:33.725879+00:00"},{"alias_kind":"pith_short_8","alias_value":"6UG7QNNZ","created_at":"2026-05-18T12:33:33.725879+00:00"}],"events":[],"event_summary":{},"paper_claims":[],"inbound_citations":{"count":26,"internal_anchor_count":26,"sample":[{"citing_arxiv_id":"2311.07911","citing_title":"Instruction-Following Evaluation for Large Language Models","ref_index":10,"is_internal_anchor":true},{"citing_arxiv_id":"2503.12374","citing_title":"Beyond Final Code: A Process-Oriented Error Analysis of Software Development Agents in Real-World GitHub Scenarios","ref_index":17,"is_internal_anchor":true},{"citing_arxiv_id":"2504.20605","citing_title":"TF1-EN-3M: Three Million Synthetic Moral Fables for Training Small, Open Language Models","ref_index":8,"is_internal_anchor":true},{"citing_arxiv_id":"2404.13076","citing_title":"LLM Evaluators Recognize and Favor Their Own Generations","ref_index":6,"is_internal_anchor":true},{"citing_arxiv_id":"2605.22435","citing_title":"Assisted Counterspeech Writing at the Crossroads of Hate Speech and Misinformation","ref_index":147,"is_internal_anchor":true},{"citing_arxiv_id":"2512.24366","citing_title":"On the Factual Consistency of Text-based Explainable Recommendation Models","ref_index":5,"is_internal_anchor":true},{"citing_arxiv_id":"2502.10248","citing_title":"Step-Video-T2V Technical Report: The Practice, Challenges, and Future of Video Foundation Model","ref_index":187,"is_internal_anchor":true},{"citing_arxiv_id":"2509.00891","citing_title":"ChatCLIDS: Simulating Persuasive AI Dialogues to Promote Closed-Loop Insulin Adoption in Type 1 Diabetes Care","ref_index":2,"is_internal_anchor":true},{"citing_arxiv_id":"2509.11206","citing_title":"Evalet: Evaluating Large Language Models through Functional Fragmentation","ref_index":19,"is_internal_anchor":true},{"citing_arxiv_id":"2510.02837","citing_title":"Beyond the Final Answer: Evaluating the Reasoning Trajectories of Tool-Augmented Agents","ref_index":3,"is_internal_anchor":true},{"citing_arxiv_id":"2512.23213","citing_title":"Scoring, Reasoning, and Selecting the Best! Ensembling Large Language Models via a Peer-Review Process","ref_index":20,"is_internal_anchor":true},{"citing_arxiv_id":"2601.00514","citing_title":"The Illusion of Insight in Reasoning Models","ref_index":2,"is_internal_anchor":true},{"citing_arxiv_id":"2303.08896","citing_title":"SelfCheckGPT: Zero-Resource Black-Box Hallucination Detection for Generative Large Language Models","ref_index":53,"is_internal_anchor":true},{"citing_arxiv_id":"2603.10477","citing_title":"PEEM: Prompt Engineering Evaluation Metrics for Interpretable Joint Evaluation of Prompts and Responses","ref_index":9,"is_internal_anchor":true},{"citing_arxiv_id":"2406.06608","citing_title":"The Prompt Report: A Systematic Survey of Prompt Engineering Techniques","ref_index":7,"is_internal_anchor":true},{"citing_arxiv_id":"2604.03724","citing_title":"Rank, Don't Generate: Statement-level Ranking for Explainable Recommendation","ref_index":10,"is_internal_anchor":true},{"citing_arxiv_id":"2308.07201","citing_title":"ChatEval: Towards Better LLM-based Evaluators through Multi-Agent Debate","ref_index":9,"is_internal_anchor":true},{"citing_arxiv_id":"2311.05232","citing_title":"A Survey on Hallucination in Large Language Models: Principles, Taxonomy, Challenges, and Open Questions","ref_index":94,"is_internal_anchor":true},{"citing_arxiv_id":"2303.16634","citing_title":"G-Eval: NLG Evaluation using GPT-4 with Better Human Alignment","ref_index":1,"is_internal_anchor":true},{"citing_arxiv_id":"2604.27547","citing_title":"Diagnosing Capability Gaps in Fine-Tuning Data","ref_index":1,"is_internal_anchor":true},{"citing_arxiv_id":"2412.05579","citing_title":"LLMs-as-Judges: A Comprehensive Survey on LLM-based Evaluation Methods","ref_index":69,"is_internal_anchor":true},{"citing_arxiv_id":"2305.14314","citing_title":"QLoRA: Efficient Finetuning of Quantized LLMs","ref_index":19,"is_internal_anchor":true},{"citing_arxiv_id":"2604.11419","citing_title":"Beyond RAG for Cyber Threat Intelligence: A Systematic Evaluation of Graph-Based and Agentic Retrieval","ref_index":11,"is_internal_anchor":true},{"citing_arxiv_id":"2303.17651","citing_title":"Self-Refine: Iterative Refinement with Self-Feedback","ref_index":14,"is_internal_anchor":true},{"citing_arxiv_id":"2604.15302","citing_title":"Diagnosing LLM Judge Reliability: Conformal Prediction Sets and Transitivity Violations","ref_index":29,"is_internal_anchor":true}]},"formal_canon":{"evidence_count":2,"sample":[],"anchors":[]},"links":{"html":"https://pith.science/pith/6UG7QNNZDYOU574GK653U7DRKD","json":"https://pith.science/pith/6UG7QNNZDYOU574GK653U7DRKD.json","graph_json":"https://pith.science/api/pith-number/6UG7QNNZDYOU574GK653U7DRKD/graph.json","events_json":"https://pith.science/api/pith-number/6UG7QNNZDYOU574GK653U7DRKD/events.json","paper":"https://pith.science/paper/6UG7QNNZ"},"agent_actions":{"view_html":"https://pith.science/pith/6UG7QNNZDYOU574GK653U7DRKD","download_json":"https://pith.science/pith/6UG7QNNZDYOU574GK653U7DRKD.json","view_paper":"https://pith.science/paper/6UG7QNNZ","resolve_alias":"https://pith.science/api/pith-number/resolve?arxiv=2302.04166&json=true","fetch_graph":"https://pith.science/api/pith-number/6UG7QNNZDYOU574GK653U7DRKD/graph.json","fetch_events":"https://pith.science/api/pith-number/6UG7QNNZDYOU574GK653U7DRKD/events.json","actions":{"anchor_timestamp":"https://pith.science/pith/6UG7QNNZDYOU574GK653U7DRKD/action/timestamp_anchor","attest_storage":"https://pith.science/pith/6UG7QNNZDYOU574GK653U7DRKD/action/storage_attestation","attest_author":"https://pith.science/pith/6UG7QNNZDYOU574GK653U7DRKD/action/author_attestation","sign_citation":"https://pith.science/pith/6UG7QNNZDYOU574GK653U7DRKD/action/citation_signature","submit_replication":"https://pith.science/pith/6UG7QNNZDYOU574GK653U7DRKD/action/replication_record"}},"created_at":"2026-05-17T23:38:13.551644+00:00","updated_at":"2026-05-17T23:38:13.551644+00:00"}