{"record_type":"pith_number_record","schema_url":"https://pith.science/schemas/pith-number/v1.json","pith_number":"pith:2023:ZGTKTZFW6VAR2IZJADJARJK4TJ","short_pith_number":"pith:ZGTKTZFW","schema_version":"1.0","canonical_sha256":"c9a6a9e4b6f5411d232900d208a55c9a7de412fd7489d4c2e8ab15a9219e1409","source":{"kind":"arxiv","id":"2308.05374","version":2},"attestation_state":"computed","paper":{"title":"Trustworthy LLMs: a Survey and Guideline for Evaluating Large Language Models' Alignment","license":"http://creativecommons.org/licenses/by-nc-sa/4.0/","headline":"A survey finds that more aligned LLMs generally achieve higher trustworthiness, though the gains differ across categories.","cross_cats":["cs.LG"],"primary_cat":"cs.AI","authors_text":"Hang Li, Hao Cheng, Jean-Francois Ton, Muhammad Faaiz Taufiq, Ruocheng Guo, Xiaoying Zhang, Yang Liu, Yegor Klochkov, Yuanshun Yao","submitted_at":"2023-08-10T06:43:44Z","abstract_excerpt":"Ensuring alignment, which refers to making models behave in accordance with human intentions [1,2], has become a critical task before deploying large language models (LLMs) in real-world applications. For instance, OpenAI devoted six months to iteratively aligning GPT-4 before its release [3]. However, a major challenge faced by practitioners is the lack of clear guidance on evaluating whether LLM outputs align with social norms, values, and regulations. This obstacle hinders systematic iteration and deployment of LLMs. To address this issue, this paper presents a comprehensive survey of key d"},"verification_status":{"content_addressed":true,"pith_receipt":true,"author_attested":false,"weak_author_claims":0,"strong_author_claims":0,"externally_anchored":false,"storage_verified":false,"citation_signatures":0,"replication_records":0,"graph_snapshot":true,"references_resolved":true,"formal_links_present":true},"canonical_record":{"source":{"id":"2308.05374","kind":"arxiv","version":2},"metadata":{"license":"http://creativecommons.org/licenses/by-nc-sa/4.0/","primary_cat":"cs.AI","submitted_at":"2023-08-10T06:43:44Z","cross_cats_sorted":["cs.LG"],"title_canon_sha256":"e4f29685ef9212d331f35b161dfd4efe86e04c62c4d0faf6cdb9dac9031623f4","abstract_canon_sha256":"f486721c6f283b619343311b946661d598241a74b6d7b31ef1a7c3e8492341d3"},"schema_version":"1.0"},"receipt":{"kind":"pith_receipt","key_id":"pith-v1-2026-05","algorithm":"ed25519","signed_at":"2026-05-17T23:38:12.821267Z","signature_b64":"V0lqwxKY0ZcqinddUuWyfIupwATVdmtYNaClm0borAh+iIUcSN1QIL9ewWm7FKZ24qj12ueiYY9Qy09MidOBDw==","signed_message":"canonical_sha256_bytes","builder_version":"pith-number-builder-2026-05-17-v1","receipt_version":"0.3","canonical_sha256":"c9a6a9e4b6f5411d232900d208a55c9a7de412fd7489d4c2e8ab15a9219e1409","last_reissued_at":"2026-05-17T23:38:12.820356Z","signature_status":"signed_v1","first_computed_at":"2026-05-17T23:38:12.820356Z","public_key_fingerprint":"8d4b5ee74e4693bcd1df2446408b0d54"},"graph_snapshot":{"paper":{"title":"Trustworthy LLMs: a Survey and Guideline for Evaluating Large Language Models' Alignment","license":"http://creativecommons.org/licenses/by-nc-sa/4.0/","headline":"A survey finds that more aligned LLMs generally achieve higher trustworthiness, though the gains differ across categories.","cross_cats":["cs.LG"],"primary_cat":"cs.AI","authors_text":"Hang Li, Hao Cheng, Jean-Francois Ton, Muhammad Faaiz Taufiq, Ruocheng Guo, Xiaoying Zhang, Yang Liu, Yegor Klochkov, Yuanshun Yao","submitted_at":"2023-08-10T06:43:44Z","abstract_excerpt":"Ensuring alignment, which refers to making models behave in accordance with human intentions [1,2], has become a critical task before deploying large language models (LLMs) in real-world applications. For instance, OpenAI devoted six months to iteratively aligning GPT-4 before its release [3]. However, a major challenge faced by practitioners is the lack of clear guidance on evaluating whether LLM outputs align with social norms, values, and regulations. This obstacle hinders systematic iteration and deployment of LLMs. To address this issue, this paper presents a comprehensive survey of key d"},"claims":{"count":4,"items":[{"kind":"strongest_claim","text":"The measurement results indicate that, in general, more aligned models tend to perform better in terms of overall trustworthiness. However, the effectiveness of alignment varies across the different trustworthiness categories considered.","source":"verdict.strongest_claim","status":"machine_extracted","claim_id":"C1","attestation":"unclaimed"},{"kind":"weakest_assumption","text":"That the seven categories and 29 sub-categories comprehensively capture trustworthiness and that the selected eight sub-categories plus the chosen measurement methods accurately reflect real-world alignment.","source":"verdict.weakest_assumption","status":"machine_extracted","claim_id":"C2","attestation":"unclaimed"},{"kind":"one_line_summary","text":"Survey organizes LLM trustworthiness into seven categories and 29 sub-categories, measures eight sub-categories on popular models, and finds that more aligned models generally score higher but with varying effectiveness.","source":"verdict.one_line_summary","status":"machine_extracted","claim_id":"C3","attestation":"unclaimed"},{"kind":"headline","text":"A survey finds that more aligned LLMs generally achieve higher trustworthiness, though the gains differ across categories.","source":"verdict.pith_extraction.headline","status":"machine_extracted","claim_id":"C4","attestation":"unclaimed"}],"snapshot_sha256":"b3c0c23793df0f8c6173f470a100a0500014ec208030a0169927937bb1ec13a6"},"source":{"id":"2308.05374","kind":"arxiv","version":2},"verdict":{"id":"16967dac-bb6b-4eca-9c61-8bcb939a565a","model_set":{"reader":"grok-4.3"},"created_at":"2026-05-17T22:26:47.271196Z","strongest_claim":"The measurement results indicate that, in general, more aligned models tend to perform better in terms of overall trustworthiness. However, the effectiveness of alignment varies across the different trustworthiness categories considered.","one_line_summary":"Survey organizes LLM trustworthiness into seven categories and 29 sub-categories, measures eight sub-categories on popular models, and finds that more aligned models generally score higher but with varying effectiveness.","pipeline_version":"pith-pipeline@v0.9.0","weakest_assumption":"That the seven categories and 29 sub-categories comprehensively capture trustworthiness and that the selected eight sub-categories plus the chosen measurement methods accurately reflect real-world alignment.","pith_extraction_headline":"A survey finds that more aligned LLMs generally achieve higher trustworthiness, though the gains differ across categories."},"references":{"count":300,"sample":[{"doi":"","year":2022,"title":"Training language models to follow instructions with human feedback","work_id":"843d640e-e399-40c5-8e8f-789cae25da17","ref_index":1,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":2021,"title":"Alignment of language agents","work_id":"2dc6ed25-0b66-42f5-b67e-eb7e67977011","ref_index":2,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":2023,"title":"OpenAI. Gpt-4. https://openai.com/research/gpt-4, 2023","work_id":"d644ff37-47a2-4b15-a9ab-83415b8a3a60","ref_index":3,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":2021,"title":"On the dangers of stochastic parrots: Can language models be too big? In Proceedings of the 2021 ACM conference on fairness, accountability, and transparency, pages 610–623, 2021","work_id":"3ad5196c-6f5c-4854-b3dd-8d67a2979292","ref_index":4,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":2019,"title":"Language models are unsupervised multitask learners","work_id":"5fa609b3-1203-4f0e-a526-0110cb3f8046","ref_index":5,"cited_arxiv_id":"","is_internal_anchor":false}],"resolved_work":300,"snapshot_sha256":"e592caa9fdb93f6a85a7aec285226791d31ccf6db0821a84e9d67bab365dc226","internal_anchors":29},"formal_canon":{"evidence_count":2,"snapshot_sha256":"42472f81a16dfaaadc5fa68af5aae9630f67e9d4c39de6f58e42e446375e0c77"},"author_claims":{"count":0,"strong_count":0,"snapshot_sha256":"258153158e38e3291e3d48162225fcdb2d5a3ed65a07baac614ab91432fd4f57"},"builder_version":"pith-number-builder-2026-05-17-v1"},"aliases":[{"alias_kind":"arxiv","alias_value":"2308.05374","created_at":"2026-05-17T23:38:12.820492+00:00"},{"alias_kind":"arxiv_version","alias_value":"2308.05374v2","created_at":"2026-05-17T23:38:12.820492+00:00"},{"alias_kind":"doi","alias_value":"10.48550/arxiv.2308.05374","created_at":"2026-05-17T23:38:12.820492+00:00"},{"alias_kind":"pith_short_12","alias_value":"ZGTKTZFW6VAR","created_at":"2026-05-18T12:33:37.589309+00:00"},{"alias_kind":"pith_short_16","alias_value":"ZGTKTZFW6VAR2IZJ","created_at":"2026-05-18T12:33:37.589309+00:00"},{"alias_kind":"pith_short_8","alias_value":"ZGTKTZFW","created_at":"2026-05-18T12:33:37.589309+00:00"}],"events":[],"event_summary":{},"paper_claims":[],"inbound_citations":{"count":29,"internal_anchor_count":13,"sample":[{"citing_arxiv_id":"2401.02458","citing_title":"Data-Centric Foundation Models in Computational Healthcare: A Survey","ref_index":180,"is_internal_anchor":true},{"citing_arxiv_id":"2408.12622","citing_title":"The AI risk repository: A meta-review, database, and taxonomy of risks from artificial intelligence","ref_index":10,"is_internal_anchor":true},{"citing_arxiv_id":"2409.10102","citing_title":"Trustworthiness in Retrieval-Augmented Generation Systems: A Survey","ref_index":19,"is_internal_anchor":true},{"citing_arxiv_id":"2409.18169","citing_title":"Harmful Fine-tuning Attacks and Defenses for Large Language Models: A Survey","ref_index":102,"is_internal_anchor":true},{"citing_arxiv_id":"2503.19444","citing_title":"AI Failures in the Eyes of the Downstream Developer: A First Look at Concerns, Practices, and Challenges","ref_index":75,"is_internal_anchor":true},{"citing_arxiv_id":"2510.07239","citing_title":"Red-Bandit: Test-Time Adaptation for LLM Red-Teaming via Bandit-Guided LoRA Experts","ref_index":4,"is_internal_anchor":true},{"citing_arxiv_id":"2605.16776","citing_title":"Distinguishable Deletion: Unifying Knowledge Erasure and Refusal for Large Language Model Unlearning","ref_index":80,"is_internal_anchor":true},{"citing_arxiv_id":"2505.22073","citing_title":"A Closer Look at the Existing Risks of Generative AI: Mapping the Who, What, and How of Real-World Incidents","ref_index":43,"is_internal_anchor":true},{"citing_arxiv_id":"2509.06572","citing_title":"Parasites in the Toolchain: A Large-Scale Analysis of Attacks on the MCP Ecosystem","ref_index":38,"is_internal_anchor":true},{"citing_arxiv_id":"2401.05561","citing_title":"TrustLLM: Trustworthiness in Large Language Models","ref_index":72,"is_internal_anchor":true},{"citing_arxiv_id":"2510.21293","citing_title":"Understanding AI Trustworthiness: A Scoping Review of AIES & FAccT Articles","ref_index":44,"is_internal_anchor":true},{"citing_arxiv_id":"2402.13116","citing_title":"A Survey on Knowledge Distillation of Large Language Models","ref_index":124,"is_internal_anchor":true},{"citing_arxiv_id":"2511.10287","citing_title":"OutSafe-Bench: A Benchmark for Multimodal Offensive Content Detection in Large Language Models","ref_index":35,"is_internal_anchor":true},{"citing_arxiv_id":"2409.02977","citing_title":"Large Language Model-Based Agents for Software Engineering: A Survey","ref_index":285,"is_internal_anchor":false},{"citing_arxiv_id":"2410.17196","citing_title":"VoiceBench: Benchmarking LLM-Based Voice Assistants","ref_index":86,"is_internal_anchor":false},{"citing_arxiv_id":"2404.13501","citing_title":"A Survey on the Memory Mechanism of Large Language Model based Agents","ref_index":11,"is_internal_anchor":false},{"citing_arxiv_id":"2605.13875","citing_title":"Common-agency Games for Multi-Objective Test-Time Alignment","ref_index":15,"is_internal_anchor":false},{"citing_arxiv_id":"2605.11920","citing_title":"Domain Restriction via Multi SAE Layer Transitions","ref_index":19,"is_internal_anchor":false},{"citing_arxiv_id":"2311.05232","citing_title":"A Survey on Hallucination in Large Language Models: Principles, Taxonomy, Challenges, and Open Questions","ref_index":200,"is_internal_anchor":false},{"citing_arxiv_id":"2604.27618","citing_title":"Math Education Digital Shadows for facilitating learning with LLMs: Math performance, anxiety and confidence in simulated students and AIs","ref_index":44,"is_internal_anchor":false},{"citing_arxiv_id":"2604.27624","citing_title":"Mapping how LLMs debate societal issues when shadowing human personality traits, sociodemographics and social media behavior","ref_index":26,"is_internal_anchor":false},{"citing_arxiv_id":"2503.13657","citing_title":"Why Do Multi-Agent LLM Systems Fail?","ref_index":35,"is_internal_anchor":false},{"citing_arxiv_id":"2605.10639","citing_title":"Navigating the Sea of LLM Evaluation: Investigating Bias in Toxicity Benchmarks","ref_index":18,"is_internal_anchor":false},{"citing_arxiv_id":"2605.10622","citing_title":"Vocabulary Hijacking in LVLMs: Unveiling Critical Attention Heads by Excluding Inert Tokens to Mitigate Hallucination","ref_index":25,"is_internal_anchor":false},{"citing_arxiv_id":"2605.04243","citing_title":"Temporal Reasoning Is Not the Bottleneck: A Probabilistic Inconsistency Framework for Neuro-Symbolic QA","ref_index":48,"is_internal_anchor":false}]},"formal_canon":{"evidence_count":2,"sample":[],"anchors":[]},"links":{"html":"https://pith.science/pith/ZGTKTZFW6VAR2IZJADJARJK4TJ","json":"https://pith.science/pith/ZGTKTZFW6VAR2IZJADJARJK4TJ.json","graph_json":"https://pith.science/api/pith-number/ZGTKTZFW6VAR2IZJADJARJK4TJ/graph.json","events_json":"https://pith.science/api/pith-number/ZGTKTZFW6VAR2IZJADJARJK4TJ/events.json","paper":"https://pith.science/paper/ZGTKTZFW"},"agent_actions":{"view_html":"https://pith.science/pith/ZGTKTZFW6VAR2IZJADJARJK4TJ","download_json":"https://pith.science/pith/ZGTKTZFW6VAR2IZJADJARJK4TJ.json","view_paper":"https://pith.science/paper/ZGTKTZFW","resolve_alias":"https://pith.science/api/pith-number/resolve?arxiv=2308.05374&json=true","fetch_graph":"https://pith.science/api/pith-number/ZGTKTZFW6VAR2IZJADJARJK4TJ/graph.json","fetch_events":"https://pith.science/api/pith-number/ZGTKTZFW6VAR2IZJADJARJK4TJ/events.json","actions":{"anchor_timestamp":"https://pith.science/pith/ZGTKTZFW6VAR2IZJADJARJK4TJ/action/timestamp_anchor","attest_storage":"https://pith.science/pith/ZGTKTZFW6VAR2IZJADJARJK4TJ/action/storage_attestation","attest_author":"https://pith.science/pith/ZGTKTZFW6VAR2IZJADJARJK4TJ/action/author_attestation","sign_citation":"https://pith.science/pith/ZGTKTZFW6VAR2IZJADJARJK4TJ/action/citation_signature","submit_replication":"https://pith.science/pith/ZGTKTZFW6VAR2IZJADJARJK4TJ/action/replication_record"}},"created_at":"2026-05-17T23:38:12.820492+00:00","updated_at":"2026-05-17T23:38:12.820492+00:00"}