{"record_type":"pith_number_record","schema_url":"https://pith.science/schemas/pith-number/v1.json","pith_number":"pith:2023:RFZR3QBSBPJSZTMJWL5BRKQE7Z","short_pith_number":"pith:RFZR3QBS","schema_version":"1.0","canonical_sha256":"89731dc0320bd32ccd89b2fa18aa04fe7d12472eb6034ce52229d7b01d876ede","source":{"kind":"arxiv","id":"2310.19852","version":6},"attestation_state":"computed","paper":{"title":"AI Alignment: A Comprehensive Survey","license":"http://arxiv.org/licenses/nonexclusive-distrib/1.0/","headline":"AI alignment research can be structured around four principles and split into forward training versus backward assurance.","cross_cats":[],"primary_cat":"cs.AI","authors_text":"Aidan O'Gara, Borong Zhang, Boyuan Chen, Brian Tse, Donghai Hong, Fanzhi Zeng, Hantao Lou, Hua Xu, Jiaming Ji, Jiayi Zhou, Jie Fu, Juntao Dai, Kaile Wang, Kwan Yee Ng, Lukas Vierling, Song-Chun Zhu, Stephen McAleer, Tianyi Qiu, Wen Gao, Xuehai Pan, Yaodong Yang, Yawen Duan, Yike Guo, Yizhou Wang, Zhaowei Zhang, Zhonghao He","submitted_at":"2023-10-30T15:52:15Z","abstract_excerpt":"AI alignment aims to make AI systems behave in line with human intentions and values. As AI systems grow more capable, so do risks from misalignment. To provide a comprehensive and up-to-date overview of the alignment field, in this survey, we delve into the core concepts, methodology, and practice of alignment. First, we identify four principles as the key objectives of AI alignment: Robustness, Interpretability, Controllability, and Ethicality (RICE). Guided by these four principles, we outline the landscape of current alignment research and decompose them into two key components: forward al"},"verification_status":{"content_addressed":true,"pith_receipt":true,"author_attested":false,"weak_author_claims":0,"strong_author_claims":0,"externally_anchored":false,"storage_verified":false,"citation_signatures":0,"replication_records":0,"graph_snapshot":true,"references_resolved":true,"formal_links_present":true},"canonical_record":{"source":{"id":"2310.19852","kind":"arxiv","version":6},"metadata":{"license":"http://arxiv.org/licenses/nonexclusive-distrib/1.0/","primary_cat":"cs.AI","submitted_at":"2023-10-30T15:52:15Z","cross_cats_sorted":[],"title_canon_sha256":"cdaea1ef56d9f162173f93e128895eccd8b87d13027ab82760ec2de0f01c0492","abstract_canon_sha256":"71a187f0c4ce7bfd2c1937bf3c1f3edbe4d30a000688c961a7f23470d31deaa0"},"schema_version":"1.0"},"receipt":{"kind":"pith_receipt","key_id":"pith-v1-2026-05","algorithm":"ed25519","signed_at":"2026-05-17T23:38:13.827598Z","signature_b64":"OMYZUAXR9jJ7WDfoyKmTEXtE86xTP9rUfCX6EpYc8YkiQpki30QgQpvcJykK96uaVfrmNAvrEAWC7gahwe6UBg==","signed_message":"canonical_sha256_bytes","builder_version":"pith-number-builder-2026-05-17-v1","receipt_version":"0.3","canonical_sha256":"89731dc0320bd32ccd89b2fa18aa04fe7d12472eb6034ce52229d7b01d876ede","last_reissued_at":"2026-05-17T23:38:13.826924Z","signature_status":"signed_v1","first_computed_at":"2026-05-17T23:38:13.826924Z","public_key_fingerprint":"8d4b5ee74e4693bcd1df2446408b0d54"},"graph_snapshot":{"paper":{"title":"AI Alignment: A Comprehensive Survey","license":"http://arxiv.org/licenses/nonexclusive-distrib/1.0/","headline":"AI alignment research can be structured around four principles and split into forward training versus backward assurance.","cross_cats":[],"primary_cat":"cs.AI","authors_text":"Aidan O'Gara, Borong Zhang, Boyuan Chen, Brian Tse, Donghai Hong, Fanzhi Zeng, Hantao Lou, Hua Xu, Jiaming Ji, Jiayi Zhou, Jie Fu, Juntao Dai, Kaile Wang, Kwan Yee Ng, Lukas Vierling, Song-Chun Zhu, Stephen McAleer, Tianyi Qiu, Wen Gao, Xuehai Pan, Yaodong Yang, Yawen Duan, Yike Guo, Yizhou Wang, Zhaowei Zhang, Zhonghao He","submitted_at":"2023-10-30T15:52:15Z","abstract_excerpt":"AI alignment aims to make AI systems behave in line with human intentions and values. As AI systems grow more capable, so do risks from misalignment. To provide a comprehensive and up-to-date overview of the alignment field, in this survey, we delve into the core concepts, methodology, and practice of alignment. First, we identify four principles as the key objectives of AI alignment: Robustness, Interpretability, Controllability, and Ethicality (RICE). Guided by these four principles, we outline the landscape of current alignment research and decompose them into two key components: forward al"},"claims":{"count":4,"items":[{"kind":"strongest_claim","text":"We identify four principles as the key objectives of AI alignment: Robustness, Interpretability, Controllability, and Ethicality (RICE). Guided by these four principles, we outline the landscape of current alignment research and decompose them into two key components: forward alignment and backward alignment.","source":"verdict.strongest_claim","status":"machine_extracted","claim_id":"C1","attestation":"unclaimed"},{"kind":"weakest_assumption","text":"That the four RICE principles comprehensively capture the essential objectives of AI alignment and that the forward/backward decomposition provides a useful, largely non-overlapping categorization of the existing literature.","source":"verdict.weakest_assumption","status":"machine_extracted","claim_id":"C2","attestation":"unclaimed"},{"kind":"one_line_summary","text":"The paper surveys AI alignment by proposing the RICE principles and categorizing research into forward training-based alignment and backward assurance and governance approaches.","source":"verdict.one_line_summary","status":"machine_extracted","claim_id":"C3","attestation":"unclaimed"},{"kind":"headline","text":"AI alignment research can be structured around four principles and split into forward training versus backward assurance.","source":"verdict.pith_extraction.headline","status":"machine_extracted","claim_id":"C4","attestation":"unclaimed"}],"snapshot_sha256":"02de8cc83381a4b480240e1fd7b7f66862f21d1b17603f0bf46b0a92995be6e3"},"source":{"id":"2310.19852","kind":"arxiv","version":6},"verdict":{"id":"99c9dcc1-b5ae-4e55-b3e7-90159c358059","model_set":{"reader":"grok-4.3"},"created_at":"2026-05-17T14:23:02.346341Z","strongest_claim":"We identify four principles as the key objectives of AI alignment: Robustness, Interpretability, Controllability, and Ethicality (RICE). Guided by these four principles, we outline the landscape of current alignment research and decompose them into two key components: forward alignment and backward alignment.","one_line_summary":"The paper surveys AI alignment by proposing the RICE principles and categorizing research into forward training-based alignment and backward assurance and governance approaches.","pipeline_version":"pith-pipeline@v0.9.0","weakest_assumption":"That the four RICE principles comprehensively capture the essential objectives of AI alignment and that the forward/backward decomposition provides a useful, largely non-overlapping categorization of the existing literature.","pith_extraction_headline":"AI alignment research can be structured around four principles and split into forward training versus backward assurance."},"references":{"count":18,"sample":[{"doi":"","year":2024,"title":"Rui Zheng, Wei Shen, Yuan Hua, Wenbin Lai, Shihan Dou, Yuhao Zhou, Zhiheng Xi, Xiao Wang, Haoran Huang, Tao Gui, Qi Zhang, and Xuanjing Huang. 2024. Improving generalization of alignment with human pr","work_id":"1616c33c-f214-4842-8e80-d260d99dd514","ref_index":1,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":2016,"title":"Stephan Zheng, Yang Song, Thomas Leung, and Ian Goodfellow. 2016. Improving the robustness of deep neural networks via stability training. In Proceedings of the ieee conference on computer vision and ","work_id":"fad17738-1abc-45af-a2b5-a3d435135851","ref_index":2,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":2018,"title":"Revisiting the Importance of Individual Units in CNNs via Ablation","work_id":"535f8b61-fa75-402c-b5e6-6b2a743c732f","ref_index":3,"cited_arxiv_id":"1806.02891","is_internal_anchor":true},{"doi":"","year":2024,"title":"Chunting Zhou, Pengfei Liu, Puxin Xu, Srinivasan Iyer, Jiao Sun, Yuning Mao, Xuezhe Ma, Avia Efrat, Ping Yu, Lili Yu, et al. 2024. Lima: Less is more for alignment.Advances in Neural Information Proce","work_id":"42054b08-7fb3-4067-9c17-a1623b8fe55a","ref_index":4,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":2022,"title":"Kaiyang Zhou, Ziwei Liu, Yu Qiao, Tao Xiang, and Chen Change Loy. 2022. Domain generalization: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence","work_id":"c978e1cc-6a48-4426-a68c-6ddd7ded6dc0","ref_index":5,"cited_arxiv_id":"","is_internal_anchor":false}],"resolved_work":18,"snapshot_sha256":"31adfe32c9331b1800d1db3a04e9ccee5c405ad9b9c4e08c0d9dbf377004e023","internal_anchors":4},"formal_canon":{"evidence_count":2,"snapshot_sha256":"e522669f010c013a7832f8cb2f018d3cc38b3dea088dcbd535fc66893c82f3f4"},"author_claims":{"count":0,"strong_count":0,"snapshot_sha256":"258153158e38e3291e3d48162225fcdb2d5a3ed65a07baac614ab91432fd4f57"},"builder_version":"pith-number-builder-2026-05-17-v1"},"aliases":[{"alias_kind":"arxiv","alias_value":"2310.19852","created_at":"2026-05-17T23:38:13.827039+00:00"},{"alias_kind":"arxiv_version","alias_value":"2310.19852v6","created_at":"2026-05-17T23:38:13.827039+00:00"},{"alias_kind":"doi","alias_value":"10.48550/arxiv.2310.19852","created_at":"2026-05-17T23:38:13.827039+00:00"},{"alias_kind":"pith_short_12","alias_value":"RFZR3QBSBPJS","created_at":"2026-05-18T12:33:37.589309+00:00"},{"alias_kind":"pith_short_16","alias_value":"RFZR3QBSBPJSZTMJ","created_at":"2026-05-18T12:33:37.589309+00:00"},{"alias_kind":"pith_short_8","alias_value":"RFZR3QBS","created_at":"2026-05-18T12:33:37.589309+00:00"}],"events":[],"event_summary":{},"paper_claims":[],"inbound_citations":{"count":30,"internal_anchor_count":30,"sample":[{"citing_arxiv_id":"2605.23024","citing_title":"The Deterministic Horizon: Impossibility Results as Design Specifications for Trustworthy AI Systems","ref_index":108,"is_internal_anchor":true},{"citing_arxiv_id":"2404.09005","citing_title":"Proof-of-Learning with Incentive Security","ref_index":30,"is_internal_anchor":true},{"citing_arxiv_id":"2503.03480","citing_title":"SafeVLA: Towards Safety Alignment of Vision-Language-Action Model via Constrained Learning","ref_index":9,"is_internal_anchor":true},{"citing_arxiv_id":"2602.07340","citing_title":"Revisiting Robustness for LLM Safety Alignment via Selective Geometry Control","ref_index":12,"is_internal_anchor":true},{"citing_arxiv_id":"2602.13372","citing_title":"MoralityGym: A Benchmark for Evaluating Hierarchical Moral Alignment in Sequential Decision-Making Agents","ref_index":46,"is_internal_anchor":true},{"citing_arxiv_id":"2505.19241","citing_title":"ActiveDPO: Active Direct Preference Optimization for Sample-Efficient Alignment","ref_index":9,"is_internal_anchor":true},{"citing_arxiv_id":"2605.18759","citing_title":"Interoceptive Divergence in Aesthetic Evaluation and Implications for Human-AI Alignment","ref_index":20,"is_internal_anchor":true},{"citing_arxiv_id":"2605.21351","citing_title":"The Human-AI Delegation Dilemma: Individual Strategies, Collective Equilibria and Sociotechnical Lock-in","ref_index":36,"is_internal_anchor":true},{"citing_arxiv_id":"2605.17342","citing_title":"Transitivity Meets Cyclicity: Explicit Preference Decomposition for Dynamic Large Language Model Alignment","ref_index":80,"is_internal_anchor":true},{"citing_arxiv_id":"2605.18660","citing_title":"Evaluating Multi-turn Human-AI Interaction","ref_index":78,"is_internal_anchor":true},{"citing_arxiv_id":"2605.19940","citing_title":"Robotics-Inspired Guardrails for Foundation Models in Socially Sensitive Domains","ref_index":23,"is_internal_anchor":true},{"citing_arxiv_id":"2506.05171","citing_title":"Towards provable probabilistic safety for scalable embodied AI systems","ref_index":100,"is_internal_anchor":true},{"citing_arxiv_id":"2506.06816","citing_title":"How do datasets, developers, and models affect biases in a low-resourced language?: The Case of the Bengali Language","ref_index":78,"is_internal_anchor":true},{"citing_arxiv_id":"2507.01925","citing_title":"A Survey on Vision-Language-Action Models: An Action Tokenization Perspective","ref_index":53,"is_internal_anchor":true},{"citing_arxiv_id":"2402.05070","citing_title":"A Roadmap to Pluralistic Alignment","ref_index":272,"is_internal_anchor":true},{"citing_arxiv_id":"2602.21843","citing_title":"The economic alignment problem of artificial intelligence","ref_index":37,"is_internal_anchor":true},{"citing_arxiv_id":"2603.02259","citing_title":"The Alignment Flywheel: A Governance-Centric Hybrid MAS for Architecture-Agnostic Safety","ref_index":18,"is_internal_anchor":true},{"citing_arxiv_id":"2604.15340","citing_title":"Restoration, Exploration and Transformation: How Youth Engage Character.AI Chatbots for Feels, Fun and Finding themselves","ref_index":27,"is_internal_anchor":true},{"citing_arxiv_id":"2604.02574","citing_title":"Understanding the Effects of Safety Unalignment on Large Language Models","ref_index":19,"is_internal_anchor":true},{"citing_arxiv_id":"2406.00515","citing_title":"A Survey on Large Language Models for Code Generation","ref_index":120,"is_internal_anchor":true},{"citing_arxiv_id":"2410.23218","citing_title":"OS-ATLAS: A Foundation Action Model for Generalist GUI Agents","ref_index":93,"is_internal_anchor":true},{"citing_arxiv_id":"2604.27917","citing_title":"A Logic of Inability","ref_index":7,"is_internal_anchor":true},{"citing_arxiv_id":"2605.05710","citing_title":"On the Blessing of Pre-training in Weak-to-Strong Generalization","ref_index":92,"is_internal_anchor":true},{"citing_arxiv_id":"2604.07535","citing_title":"Trust the AI, Doubt Yourself: The Effect of Urgency on Self-Confidence in Human-AI Interaction","ref_index":24,"is_internal_anchor":true},{"citing_arxiv_id":"2605.07096","citing_title":"Query-efficient model evaluation using cached responses","ref_index":131,"is_internal_anchor":true}]},"formal_canon":{"evidence_count":2,"sample":[],"anchors":[]},"links":{"html":"https://pith.science/pith/RFZR3QBSBPJSZTMJWL5BRKQE7Z","json":"https://pith.science/pith/RFZR3QBSBPJSZTMJWL5BRKQE7Z.json","graph_json":"https://pith.science/api/pith-number/RFZR3QBSBPJSZTMJWL5BRKQE7Z/graph.json","events_json":"https://pith.science/api/pith-number/RFZR3QBSBPJSZTMJWL5BRKQE7Z/events.json","paper":"https://pith.science/paper/RFZR3QBS"},"agent_actions":{"view_html":"https://pith.science/pith/RFZR3QBSBPJSZTMJWL5BRKQE7Z","download_json":"https://pith.science/pith/RFZR3QBSBPJSZTMJWL5BRKQE7Z.json","view_paper":"https://pith.science/paper/RFZR3QBS","resolve_alias":"https://pith.science/api/pith-number/resolve?arxiv=2310.19852&json=true","fetch_graph":"https://pith.science/api/pith-number/RFZR3QBSBPJSZTMJWL5BRKQE7Z/graph.json","fetch_events":"https://pith.science/api/pith-number/RFZR3QBSBPJSZTMJWL5BRKQE7Z/events.json","actions":{"anchor_timestamp":"https://pith.science/pith/RFZR3QBSBPJSZTMJWL5BRKQE7Z/action/timestamp_anchor","attest_storage":"https://pith.science/pith/RFZR3QBSBPJSZTMJWL5BRKQE7Z/action/storage_attestation","attest_author":"https://pith.science/pith/RFZR3QBSBPJSZTMJWL5BRKQE7Z/action/author_attestation","sign_citation":"https://pith.science/pith/RFZR3QBSBPJSZTMJWL5BRKQE7Z/action/citation_signature","submit_replication":"https://pith.science/pith/RFZR3QBSBPJSZTMJWL5BRKQE7Z/action/replication_record"}},"created_at":"2026-05-17T23:38:13.827039+00:00","updated_at":"2026-05-17T23:38:13.827039+00:00"}