{"work":{"id":"faaaa4e0-2676-4fac-a0b4-99aef10d2095","openalex_id":null,"doi":null,"arxiv_id":"2212.08073","raw_key":null,"title":"Constitutional AI: Harmlessness from AI Feedback","authors":null,"authors_text":"Bai Y , Kadavath S, Kundu S, et al","year":2022,"venue":"cs.CL","abstract":"As AI systems become more capable, we would like to enlist their help to supervise other AIs. We experiment with methods for training a harmless AI assistant through self-improvement, without any human labels identifying harmful outputs. The only human oversight is provided through a list of rules or principles, and so we refer to the method as 'Constitutional AI'. The process involves both a supervised learning and a reinforcement learning phase. In the supervised phase we sample from an initial model, then generate self-critiques and revisions, and then finetune the original model on revised responses. In the RL phase, we sample from the finetuned model, use a model to evaluate which of the two samples is better, and then train a preference model from this dataset of AI preferences. We then train with RL using the preference model as the reward signal, i.e. we use 'RL from AI Feedback' (RLAIF). As a result we are able to train a harmless but non-evasive AI assistant that engages with harmful queries by explaining its objections to them. Both the SL and RL methods can leverage chain-of-thought style reasoning to improve the human-judged performance and transparency of AI decision making. These methods make it possible to control AI behavior more precisely and with far fewer human labels.","external_url":"https://arxiv.org/abs/2212.08073","cited_by_count":null,"metadata_source":"pith","metadata_fetched_at":"2026-05-25T06:15:23.805502+00:00","pith_arxiv_id":"2212.08073","created_at":"2026-05-08T18:44:01.574669+00:00","updated_at":"2026-05-25T06:15:23.805502+00:00","title_quality_ok":true,"display_title":"Constitutional AI: Harmlessness from AI Feedback","render_title":"Constitutional AI: Harmlessness from AI Feedback"},"hub":{"state":{"work_id":"faaaa4e0-2676-4fac-a0b4-99aef10d2095","tier":"super_hub","tier_reason":"100+ Pith inbound or 10,000+ external citations","pith_inbound_count":353,"external_cited_by_count":null,"distinct_field_count":24,"first_pith_cited_at":"2023-03-30T16:01:52+00:00","last_pith_cited_at":"2026-05-22T12:59:16+00:00","author_build_status":"needed","summary_status":"needed","contexts_status":"needed","graph_status":"needed","ask_index_status":"needed","reader_status":"not_needed","recognition_status":"not_needed","updated_at":"2026-06-02T13:54:44.145201+00:00","tier_text":"super_hub"},"tier":"super_hub","role_counts":[{"context_role":"background","n":86},{"context_role":"baseline","n":3},{"context_role":"method","n":3},{"context_role":"dataset","n":1},{"context_role":"other","n":1}],"polarity_counts":[{"context_polarity":"background","n":79},{"context_polarity":"unclear","n":5},{"context_polarity":"baseline","n":3},{"context_polarity":"support","n":3},{"context_polarity":"use_method","n":3},{"context_polarity":"use_dataset","n":1}],"runs":{"ask_index":{"job_type":"ask_index","status":"succeeded","result":{"title":"Constitutional AI: Harmlessness from AI Feedback","claims":[{"claim_text":"As AI systems become more capable, we would like to enlist their help to supervise other AIs. We experiment with methods for training a harmless AI assistant through self-improvement, without any human labels identifying harmful outputs. The only human oversight is provided through a list of rules or principles, and so we refer to the method as 'Constitutional AI'. The process involves both a supervised learning and a reinforcement learning phase. In the supervised phase we sample from an initial model, then generate self-critiques and revisions, and then finetune the original model on revised","claim_type":"abstract","evidence_strength":"source_metadata"}],"why_cited":"Pith tracks Constitutional AI: Harmlessness from AI Feedback because it crossed a citation-hub threshold.","role_counts":[]},"error":null,"updated_at":"2026-05-13T20:23:33.736546+00:00"},"author_expand":{"job_type":"author_expand","status":"succeeded","result":{"authors_linked":[{"id":"609b7325-e0af-4e6c-9f23-5200b6ef5d3b","orcid":null,"display_name":"Bai Y"},{"id":"1a45344d-526b-4780-a0c1-1cef0dfcfe4f","orcid":null,"display_name":"Kadavath S"},{"id":"c32a45c9-366e-4ffa-a04d-13a413f90c00","orcid":null,"display_name":"Kundu S"},{"id":"535def4b-a2c7-4c75-ad8b-db3e939f2bc9","orcid":null,"display_name":"et al"}]},"error":null,"updated_at":"2026-05-13T20:23:52.054300+00:00"},"context_extract":{"job_type":"context_extract","status":"succeeded","result":{"enqueued_papers":25},"error":null,"updated_at":"2026-05-13T20:23:44.490177+00:00"},"graph_features":{"job_type":"graph_features","status":"succeeded","result":{"co_cited":[{"title":"Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback","work_id":"a1f2574b-a899-4713-be60-c87ba332656c","shared_citers":38},{"title":"GPT-4 Technical Report","work_id":"b928e041-6991-4c08-8c81-0359e4097c7b","shared_citers":34},{"title":"DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models","work_id":"c5006563-f3ec-438a-9e35-b7b484f34828","shared_citers":30},{"title":"Qwen3 Technical Report","work_id":"25a4e30c-1232-48e7-9925-02fa12ba7c9e","shared_citers":28},{"title":"Training Verifiers to Solve Math Word Problems","work_id":"acab1aa8-b4d6-40e0-a3ee-25341701dca2","shared_citers":28},{"title":"The Llama 3 Herd of Models","work_id":"1549a635-88af-4ac1-acfe-51ae7bb53345","shared_citers":27},{"title":"DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning","work_id":"e6b75ad5-2877-4168-97c8-710407094d20","shared_citers":24},{"title":"Llama 2: Open Foundation and Fine-Tuned Chat Models","work_id":"68a5177f-d644-44c1-bd4f-4e5278c22f5d","shared_citers":24},{"title":"Universal and Transferable Adversarial Attacks on Aligned Language Models","work_id":"3322fa86-1768-4677-8425-dd326b45e078","shared_citers":23},{"title":"Evaluating Large Language Models Trained on Code","work_id":"042493e9-b26f-4b4e-bbde-382072ca9b08","shared_citers":22},{"title":"LLaMA: Open and Efficient Foundation Language Models","work_id":"c018fc23-6f3f-4035-9d02-28a2173b2b9d","shared_citers":21},{"title":"Proximal Policy Optimization Algorithms","work_id":"240c67fe-d14d-4520-91c1-38a4e272ca19","shared_citers":20},{"title":"Llama Guard: LLM-based Input-Output Safeguard for Human-AI Conversations","work_id":"93844332-869b-448c-a1be-35466150b1b2","shared_citers":19},{"title":"Training language models to follow instructions with human feedback","work_id":"52aff42f-4fa9-4fcf-bdb3-1459b9bebf65","shared_citers":19},{"title":"Fine-Tuning Language Models from Human Preferences","work_id":"4f54aad1-f3b6-404f-b9c7-e21ba0a33b99","shared_citers":18},{"title":"Concrete Problems in AI Safety","work_id":"c8d14fbe-6eab-464a-95b3-778aabd82fa3","shared_citers":16},{"title":"Red Teaming Language Models to Reduce Harms: Methods, Scaling Behaviors, and Lessons Learned","work_id":"1aabd84d-3779-4ba9-ba2f-15ce264a9b1e","shared_citers":15},{"title":"Scaling Laws for Neural Language Models","work_id":"b7dd8749-9c45-4977-ab9b-64478dce1ae8","shared_citers":15},{"title":"A General Language Assistant as a Laboratory for Alignment","work_id":"a43f9ea0-01be-47d5-b8ee-a1a9f73381c5","shared_citers":13},{"title":"HarmBench: A Standardized Evaluation Framework for Automated Red Teaming and Robust Refusal","work_id":"b0b0303f-2444-4789-a979-8153624312ff","shared_citers":12},{"title":"On the Opportunities and Risks of Foundation Models","work_id":"a18039e9-928d-47c9-a836-32656a71bf71","shared_citers":12},{"title":"Program Synthesis with Large Language Models","work_id":"fd241a05-03b9-4de2-9588-9d77ce176125","shared_citers":12},{"title":"Holistic Evaluation of Language Models","work_id":"cc02a01e-7218-47dc-8e66-3333e7e4adec","shared_citers":11},{"title":"Red Teaming Language Models with Language Models","work_id":"d1274c54-508f-42f9-aeb3-91db13f3a622","shared_citers":11}],"time_series":[{"n":5,"year":2023},{"n":5,"year":2024},{"n":4,"year":2025},{"n":172,"year":2026}]},"error":null,"updated_at":"2026-05-13T20:23:32.301275+00:00"},"identity_refresh":{"job_type":"identity_refresh","status":"succeeded","result":{"fixed":1,"items":[{"title":"Qwen3 Technical Report","work_id":"25a4e30c-1232-48e7-9925-02fa12ba7c9e","resolver":"local_arxiv","confidence":0.98,"old_work_id":"25a4e30c-1232-48e7-9925-02fa12ba7c9e"}],"errors":[],"attempted":1},"error":null,"updated_at":"2026-05-13T20:23:39.630486+00:00"},"role_polarity":{"job_type":"role_polarity","status":"succeeded","result":{"title":"Constitutional AI: Harmlessness from AI Feedback","claims":[{"claim_text":"As AI systems become more capable, we would like to enlist their help to supervise other AIs. We experiment with methods for training a harmless AI assistant through self-improvement, without any human labels identifying harmful outputs. The only human oversight is provided through a list of rules or principles, and so we refer to the method as 'Constitutional AI'. The process involves both a supervised learning and a reinforcement learning phase. In the supervised phase we sample from an initial model, then generate self-critiques and revisions, and then finetune the original model on revised","claim_type":"abstract","evidence_strength":"source_metadata"}],"why_cited":"Pith tracks Constitutional AI: Harmlessness from AI Feedback because it crossed a citation-hub threshold.","role_counts":[]},"error":null,"updated_at":"2026-05-13T20:23:49.488395+00:00"},"summary_claims":{"job_type":"summary_claims","status":"succeeded","result":{"title":"Constitutional AI: Harmlessness from AI Feedback","claims":[{"claim_text":"As AI systems become more capable, we would like to enlist their help to supervise other AIs. We experiment with methods for training a harmless AI assistant through self-improvement, without any human labels identifying harmful outputs. The only human oversight is provided through a list of rules or principles, and so we refer to the method as 'Constitutional AI'. The process involves both a supervised learning and a reinforcement learning phase. In the supervised phase we sample from an initial model, then generate self-critiques and revisions, and then finetune the original model on revised","claim_type":"abstract","evidence_strength":"source_metadata"}],"why_cited":"Pith tracks Constitutional AI: Harmlessness from AI Feedback because it crossed a citation-hub threshold.","role_counts":[]},"error":null,"updated_at":"2026-05-13T20:23:32.199047+00:00"}},"summary":{"title":"Constitutional AI: Harmlessness from AI Feedback","claims":[{"claim_text":"As AI systems become more capable, we would like to enlist their help to supervise other AIs. We experiment with methods for training a harmless AI assistant through self-improvement, without any human labels identifying harmful outputs. The only human oversight is provided through a list of rules or principles, and so we refer to the method as 'Constitutional AI'. The process involves both a supervised learning and a reinforcement learning phase. In the supervised phase we sample from an initial model, then generate self-critiques and revisions, and then finetune the original model on revised","claim_type":"abstract","evidence_strength":"source_metadata"}],"why_cited":"Pith tracks Constitutional AI: Harmlessness from AI Feedback because it crossed a citation-hub threshold.","role_counts":[]},"graph":{"co_cited":[{"title":"Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback","work_id":"a1f2574b-a899-4713-be60-c87ba332656c","shared_citers":38},{"title":"GPT-4 Technical Report","work_id":"b928e041-6991-4c08-8c81-0359e4097c7b","shared_citers":34},{"title":"DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models","work_id":"c5006563-f3ec-438a-9e35-b7b484f34828","shared_citers":30},{"title":"Qwen3 Technical Report","work_id":"25a4e30c-1232-48e7-9925-02fa12ba7c9e","shared_citers":28},{"title":"Training Verifiers to Solve Math Word Problems","work_id":"acab1aa8-b4d6-40e0-a3ee-25341701dca2","shared_citers":28},{"title":"The Llama 3 Herd of Models","work_id":"1549a635-88af-4ac1-acfe-51ae7bb53345","shared_citers":27},{"title":"DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning","work_id":"e6b75ad5-2877-4168-97c8-710407094d20","shared_citers":24},{"title":"Llama 2: Open Foundation and Fine-Tuned Chat Models","work_id":"68a5177f-d644-44c1-bd4f-4e5278c22f5d","shared_citers":24},{"title":"Universal and Transferable Adversarial Attacks on Aligned Language Models","work_id":"3322fa86-1768-4677-8425-dd326b45e078","shared_citers":23},{"title":"Evaluating Large Language Models Trained on Code","work_id":"042493e9-b26f-4b4e-bbde-382072ca9b08","shared_citers":22},{"title":"LLaMA: Open and Efficient Foundation Language Models","work_id":"c018fc23-6f3f-4035-9d02-28a2173b2b9d","shared_citers":21},{"title":"Proximal Policy Optimization Algorithms","work_id":"240c67fe-d14d-4520-91c1-38a4e272ca19","shared_citers":20},{"title":"Llama Guard: LLM-based Input-Output Safeguard for Human-AI Conversations","work_id":"93844332-869b-448c-a1be-35466150b1b2","shared_citers":19},{"title":"Training language models to follow instructions with human feedback","work_id":"52aff42f-4fa9-4fcf-bdb3-1459b9bebf65","shared_citers":19},{"title":"Fine-Tuning Language Models from Human Preferences","work_id":"4f54aad1-f3b6-404f-b9c7-e21ba0a33b99","shared_citers":18},{"title":"Concrete Problems in AI Safety","work_id":"c8d14fbe-6eab-464a-95b3-778aabd82fa3","shared_citers":16},{"title":"Red Teaming Language Models to Reduce Harms: Methods, Scaling Behaviors, and Lessons Learned","work_id":"1aabd84d-3779-4ba9-ba2f-15ce264a9b1e","shared_citers":15},{"title":"Scaling Laws for Neural Language Models","work_id":"b7dd8749-9c45-4977-ab9b-64478dce1ae8","shared_citers":15},{"title":"A General Language Assistant as a Laboratory for Alignment","work_id":"a43f9ea0-01be-47d5-b8ee-a1a9f73381c5","shared_citers":13},{"title":"HarmBench: A Standardized Evaluation Framework for Automated Red Teaming and Robust Refusal","work_id":"b0b0303f-2444-4789-a979-8153624312ff","shared_citers":12},{"title":"On the Opportunities and Risks of Foundation Models","work_id":"a18039e9-928d-47c9-a836-32656a71bf71","shared_citers":12},{"title":"Program Synthesis with Large Language Models","work_id":"fd241a05-03b9-4de2-9588-9d77ce176125","shared_citers":12},{"title":"Holistic Evaluation of Language Models","work_id":"cc02a01e-7218-47dc-8e66-3333e7e4adec","shared_citers":11},{"title":"Red Teaming Language Models with Language Models","work_id":"d1274c54-508f-42f9-aeb3-91db13f3a622","shared_citers":11}],"time_series":[{"n":5,"year":2023},{"n":5,"year":2024},{"n":4,"year":2025},{"n":172,"year":2026}]},"authors":[{"id":"609b7325-e0af-4e6c-9f23-5200b6ef5d3b","orcid":null,"display_name":"Bai Y","source":"manual","import_confidence":0.72},{"id":"535def4b-a2c7-4c75-ad8b-db3e939f2bc9","orcid":null,"display_name":"et al","source":"manual","import_confidence":0.72},{"id":"1a45344d-526b-4780-a0c1-1cef0dfcfe4f","orcid":null,"display_name":"Kadavath S","source":"manual","import_confidence":0.72},{"id":"c32a45c9-366e-4ffa-a04d-13a413f90c00","orcid":null,"display_name":"Kundu S","source":"manual","import_confidence":0.72}]}}