{"work":{"id":"c8d14fbe-6eab-464a-95b3-778aabd82fa3","openalex_id":null,"doi":null,"arxiv_id":"1606.06565","raw_key":null,"title":"Concrete Problems in AI Safety","authors":null,"authors_text":"Dario Amodei, Chris Olah, Jacob Steinhardt, Paul Christiano, John Schulman, Dan Man\\'e","year":2016,"venue":"cs.AI","abstract":"Rapid progress in machine learning and artificial intelligence (AI) has brought increasing attention to the potential impacts of AI technologies on society. In this paper we discuss one such potential impact: the problem of accidents in machine learning systems, defined as unintended and harmful behavior that may emerge from poor design of real-world AI systems. We present a list of five practical research problems related to accident risk, categorized according to whether the problem originates from having the wrong objective function (\"avoiding side effects\" and \"avoiding reward hacking\"), an objective function that is too expensive to evaluate frequently (\"scalable supervision\"), or undesirable behavior during the learning process (\"safe exploration\" and \"distributional shift\"). We review previous work in these areas as well as suggesting research directions with a focus on relevance to cutting-edge AI systems. Finally, we consider the high-level question of how to think most productively about the safety of forward-looking applications of AI.","external_url":"https://arxiv.org/abs/1606.06565","cited_by_count":null,"metadata_source":"pith","metadata_fetched_at":"2026-05-25T19:51:10.643747+00:00","pith_arxiv_id":"1606.06565","created_at":"2026-05-09T23:04:17.703789+00:00","updated_at":"2026-05-25T19:51:10.643747+00:00","title_quality_ok":false,"display_title":"Concrete Problems in AI Safety","render_title":"Concrete Problems in AI Safety"},"hub":{"state":{"work_id":"c8d14fbe-6eab-464a-95b3-778aabd82fa3","tier":"super_hub","tier_reason":"100+ Pith inbound or 10,000+ external citations","pith_inbound_count":147,"external_cited_by_count":null,"distinct_field_count":19,"first_pith_cited_at":"2017-02-28T02:19:20+00:00","last_pith_cited_at":"2026-05-22T13:21:05+00:00","author_build_status":"needed","summary_status":"needed","contexts_status":"needed","graph_status":"needed","ask_index_status":"needed","reader_status":"not_needed","recognition_status":"not_needed","updated_at":"2026-05-29T22:50:29.028373+00:00","tier_text":"super_hub"},"tier":"super_hub","role_counts":[{"context_role":"background","n":39},{"context_role":"method","n":1}],"polarity_counts":[{"context_polarity":"background","n":36},{"context_polarity":"support","n":2},{"context_polarity":"unclear","n":1},{"context_polarity":"use_method","n":1}],"runs":{"ask_index":{"job_type":"ask_index","status":"succeeded","result":{"title":"Concrete Problems in AI Safety","claims":[{"claim_text":"Rapid progress in machine learning and artificial intelligence (AI) has brought increasing attention to the potential impacts of AI technologies on society. In this paper we discuss one such potential impact: the problem of accidents in machine learning systems, defined as unintended and harmful behavior that may emerge from poor design of real-world AI systems. We present a list of five practical research problems related to accident risk, categorized according to whether the problem originates from having the wrong objective function (\"avoiding side effects\" and \"avoiding reward hacking\"), a","claim_type":"abstract","evidence_strength":"source_metadata"},{"claim_text":"Safe multi-agent behavior must be maintained, not merely asserted. References [1] Joshua Achiam, David Held, Aviv Tamar, and Pieter Abbeel. Constrained policy optimization, 2017. URLhttps://arxiv.org/abs/1705.10528. [2] Dario Amodei, Chris Olah, Jacob Steinhardt, Paul Christiano, John Schulman, and Dan Mané. Concrete problems in ai safety, 2016. URLhttps://arxiv.org/abs/1606.06565. [3] Anthropic. Introducing the model context protocol. https://www.anthropic.com/news/ model-context-protocol, Nove","claim_type":"background","confidence":0.95,"evidence_strength":"citation_context"},{"claim_text":"Recent direct distillation methods [128] leverage teacher-model outputs to reduce annotation costs. How- ever, SFT remains susceptible to distribution shift, where adversarial prompts outside the training distribution can bypass the learned safety guardrails [105, 133]. 2.3 Safety Alignment Safety alignment ensures that model behavior conforms to hu- man values and ethical constraints [3, 13]. Traditionally, Reinforce- ment Learning from Human Feedback (RLHF) [104] maximizes a reward signal 𝑟(𝑥,","claim_type":"background","confidence":0.9,"evidence_strength":"citation_context"},{"claim_text":"GP: Greedy Precision; GR: Greedy Recall; Cel: Celebrity; Land: Landmark; Mat: Material; Pos: Position; Rel: Relationship; Neg: Negation . cise localization abilities of dedicated OD models and the semantic understanding of VLMs. 5.3. Ablations & Extended experiments 5.3.1 Investigation about \"reward hacking\" What is reward hacking? Reward hacking [5] in rein- forcement learning refers to a phenomenon where an agent exploits loopholes in the reward function to achieve high reward without truly fu","claim_type":"background","confidence":0.9,"evidence_strength":"citation_context"},{"claim_text":"State of Exception. University of Chicago Press, Chicago, 2005. Translated by Kevin Attell. [3] Anthony Aguirre. Keep the future human: ASI without humanity's consent is theft. arXiv:2311.09452, 2025. [4] Sam Altman, Greg Brockman, and Ilya Sutskever. Governance of superintelligence. OpenAI Blog, 2023. URLhttps://openai.com/index/governance-of-superintelligence/. [5] Dario Amodei, Chris Olah, Jacob Steinhardt, Paul Christiano, John Schulman, and Dan Mané. Concrete problems in AI safety. arXiv:16","claim_type":"background","confidence":0.9,"evidence_strength":"citation_context"},{"claim_text":"literature, this systemic vulnerability is also discussed under various synonymous or closely related concepts, including reward gaming[ 9],reward overoptimization[ 12],specification gaming[ 13],goal misgeneralization[ 14], andreward tampering[ 15]. At its core, reward hacking occurs when a model produces behavioral trajectories that mathematically maximize the proxy reward while actively degrading or bypassing the intended objective [10, 16]. While reward hacking has long been recognized as a t","claim_type":"background","confidence":0.9,"evidence_strength":"citation_context"},{"claim_text":"such as privacy or robustness might be placed on information transmitted between local machines; and memory/storage constraints might be placed on local machines or a central server. Recent work:We first provide an overview of several theoretical questions studied in recent work, with an emphasis on high-dimensional settings; we refer the reader to excellent survey papers [GLW `22, CCX14, 17 <latexit sha1_base64=\"pdJLtZ7ichNC1J1ruhuzZGzmtE8=\">AAAB7HicbVBNS8NAFHypX7V+VT16WSyCp5KIVI9FLx4rmLbQhrLZb","claim_type":"background","confidence":0.85,"evidence_strength":"citation_context"}],"why_cited":"Pith tracks Concrete Problems in AI Safety because it crossed a citation-hub threshold. Current citing contexts most often use it as background evidence (20 contexts).","role_counts":[{"n":20,"context_role":"background"},{"n":1,"context_role":"method"}]},"error":null,"updated_at":"2026-05-17T15:09:55.417684+00:00"},"author_expand":{"job_type":"author_expand","status":"succeeded","result":{"authors_linked":[{"id":"77c3c2d7-9b00-4b88-b1b8-5cdd4b2ea747","orcid":null,"display_name":"Dario Amodei"},{"id":"c51bdf66-c375-4dec-86c0-647ca4f89b02","orcid":null,"display_name":"Chris Olah"},{"id":"9abdb3e2-caf2-47ea-9ab5-d10786d948e0","orcid":null,"display_name":"Jacob Steinhardt"},{"id":"b5229722-2551-40ed-8eca-d2ef7b4fd5df","orcid":null,"display_name":"Paul Christiano"},{"id":"298fbbc8-0497-4319-a5d8-d4bcef3f7f3d","orcid":null,"display_name":"John Schulman"},{"id":"2bf2bc1d-a1b5-492a-b4eb-d4858cf531c8","orcid":null,"display_name":"Dan Man\\'e"}]},"error":null,"updated_at":"2026-05-17T15:09:56.368330+00:00"},"context_extract":{"job_type":"context_extract","status":"succeeded","result":{"enqueued_papers":25},"error":null,"updated_at":"2026-05-14T08:07:48.063324+00:00"},"graph_features":{"job_type":"graph_features","status":"succeeded","result":{"co_cited":[{"title":"Constitutional AI: Harmlessness from AI Feedback","work_id":"faaaa4e0-2676-4fac-a0b4-99aef10d2095","shared_citers":16},{"title":"DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models","work_id":"c5006563-f3ec-438a-9e35-b7b484f34828","shared_citers":11},{"title":"DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning","work_id":"e6b75ad5-2877-4168-97c8-710407094d20","shared_citers":10},{"title":"Proximal Policy Optimization Algorithms","work_id":"240c67fe-d14d-4520-91c1-38a4e272ca19","shared_citers":10},{"title":"Scaling Laws for Neural Language Models","work_id":"b7dd8749-9c45-4977-ab9b-64478dce1ae8","shared_citers":9},{"title":"Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback","work_id":"a1f2574b-a899-4713-be60-c87ba332656c","shared_citers":8},{"title":"GPT-4 Technical Report","work_id":"b928e041-6991-4c08-8c81-0359e4097c7b","shared_citers":7},{"title":"The Llama 3 Herd of Models","work_id":"1549a635-88af-4ac1-acfe-51ae7bb53345","shared_citers":7},{"title":"CoRR , volume =","work_id":"871c0bb7-e08b-4d8b-be76-610707c748dd","shared_citers":6},{"title":"On the Opportunities and Risks of Foundation Models","work_id":"a18039e9-928d-47c9-a836-32656a71bf71","shared_citers":6},{"title":"AI safety via debate","work_id":"13c1ec37-af93-438a-bdf0-f2eafaee5635","shared_citers":5},{"title":"AutoGen: Enabling Next-Gen LLM Applications via Multi-Agent Conversation","work_id":"92b7eb9c-c3d8-4518-a376-06fa15dd895b","shared_citers":5},{"title":"LLaMA: Open and Efficient Foundation Language Models","work_id":"c018fc23-6f3f-4035-9d02-28a2173b2b9d","shared_citers":5},{"title":"Mistral 7B","work_id":"eb5e1305-ad11-4875-ad8d-ad8b8f697599","shared_citers":5},{"title":"Qwen3 Technical Report","work_id":"25a4e30c-1232-48e7-9925-02fa12ba7c9e","shared_citers":5},{"title":"Scalable agent alignment via reward modeling: a research direction","work_id":"1e9c4f6d-b369-4bd2-8e6e-1ec00318c924","shared_citers":5},{"title":"Towards Understanding Sycophancy in Language Models","work_id":"aeefec9a-6ad5-4743-92b9-de6983895e21","shared_citers":5},{"title":"Training Verifiers to Solve Math Word Problems","work_id":"acab1aa8-b4d6-40e0-a3ee-25341701dca2","shared_citers":5},{"title":"Universal and Transferable Adversarial Attacks on Aligned Language Models","work_id":"3322fa86-1768-4677-8425-dd326b45e078","shared_citers":5},{"title":"Adam: A Method for Stochastic Optimization","work_id":"1910796d-9b52-4683-bf5c-de9632c1028b","shared_citers":4},{"title":"Alignment faking in large language models","work_id":"cc253a89-cda1-4889-9631-bf3ce8147650","shared_citers":4},{"title":"arXiv preprint arXiv:2310.19852 , year=","work_id":"bc8d43a6-a842-4003-b1b6-c424256a8151","shared_citers":4},{"title":"BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding","work_id":"ed240a10-5b19-406c-baa5-30803f465785","shared_citers":4},{"title":"DAPO: An Open-Source LLM Reinforcement Learning System at Scale","work_id":"64019d00-0b11-4bbd-b173-b46c8fad0157","shared_citers":4}],"time_series":[{"n":1,"year":2017},{"n":1,"year":2018},{"n":1,"year":2020},{"n":1,"year":2021},{"n":1,"year":2022},{"n":2,"year":2025},{"n":67,"year":2026}],"dependency_candidates":[]},"error":null,"updated_at":"2026-05-14T08:18:09.128783+00:00"},"identity_refresh":{"job_type":"identity_refresh","status":"succeeded","result":{"items":[{"title":"Qwen3 Technical Report","outcome":"unchanged","work_id":"25a4e30c-1232-48e7-9925-02fa12ba7c9e","resolver":"local_arxiv","confidence":0.98,"old_work_id":"25a4e30c-1232-48e7-9925-02fa12ba7c9e"}],"counts":{"fixed":0,"merged":0,"unchanged":1,"quarantined":0,"needs_external_resolution":0},"errors":[],"attempted":1},"error":null,"updated_at":"2026-05-14T08:07:54.597072+00:00"},"role_polarity":{"job_type":"role_polarity","status":"succeeded","result":{"title":"Concrete Problems in AI Safety","claims":[{"claim_text":"Rapid progress in machine learning and artificial intelligence (AI) has brought increasing attention to the potential impacts of AI technologies on society. In this paper we discuss one such potential impact: the problem of accidents in machine learning systems, defined as unintended and harmful behavior that may emerge from poor design of real-world AI systems. We present a list of five practical research problems related to accident risk, categorized according to whether the problem originates from having the wrong objective function (\"avoiding side effects\" and \"avoiding reward hacking\"), a","claim_type":"abstract","evidence_strength":"source_metadata"},{"claim_text":"Safe multi-agent behavior must be maintained, not merely asserted. References [1] Joshua Achiam, David Held, Aviv Tamar, and Pieter Abbeel. Constrained policy optimization, 2017. URLhttps://arxiv.org/abs/1705.10528. [2] Dario Amodei, Chris Olah, Jacob Steinhardt, Paul Christiano, John Schulman, and Dan Mané. Concrete problems in ai safety, 2016. URLhttps://arxiv.org/abs/1606.06565. [3] Anthropic. Introducing the model context protocol. https://www.anthropic.com/news/ model-context-protocol, Nove","claim_type":"background","confidence":0.95,"evidence_strength":"citation_context"},{"claim_text":"Recent direct distillation methods [128] leverage teacher-model outputs to reduce annotation costs. How- ever, SFT remains susceptible to distribution shift, where adversarial prompts outside the training distribution can bypass the learned safety guardrails [105, 133]. 2.3 Safety Alignment Safety alignment ensures that model behavior conforms to hu- man values and ethical constraints [3, 13]. Traditionally, Reinforce- ment Learning from Human Feedback (RLHF) [104] maximizes a reward signal 𝑟(𝑥,","claim_type":"background","confidence":0.9,"evidence_strength":"citation_context"},{"claim_text":"GP: Greedy Precision; GR: Greedy Recall; Cel: Celebrity; Land: Landmark; Mat: Material; Pos: Position; Rel: Relationship; Neg: Negation . cise localization abilities of dedicated OD models and the semantic understanding of VLMs. 5.3. Ablations & Extended experiments 5.3.1 Investigation about \"reward hacking\" What is reward hacking? Reward hacking [5] in rein- forcement learning refers to a phenomenon where an agent exploits loopholes in the reward function to achieve high reward without truly fu","claim_type":"background","confidence":0.9,"evidence_strength":"citation_context"},{"claim_text":"State of Exception. University of Chicago Press, Chicago, 2005. Translated by Kevin Attell. [3] Anthony Aguirre. Keep the future human: ASI without humanity's consent is theft. arXiv:2311.09452, 2025. [4] Sam Altman, Greg Brockman, and Ilya Sutskever. Governance of superintelligence. OpenAI Blog, 2023. URLhttps://openai.com/index/governance-of-superintelligence/. [5] Dario Amodei, Chris Olah, Jacob Steinhardt, Paul Christiano, John Schulman, and Dan Mané. Concrete problems in AI safety. arXiv:16","claim_type":"background","confidence":0.9,"evidence_strength":"citation_context"},{"claim_text":"literature, this systemic vulnerability is also discussed under various synonymous or closely related concepts, including reward gaming[ 9],reward overoptimization[ 12],specification gaming[ 13],goal misgeneralization[ 14], andreward tampering[ 15]. At its core, reward hacking occurs when a model produces behavioral trajectories that mathematically maximize the proxy reward while actively degrading or bypassing the intended objective [10, 16]. While reward hacking has long been recognized as a t","claim_type":"background","confidence":0.9,"evidence_strength":"citation_context"},{"claim_text":"such as privacy or robustness might be placed on information transmitted between local machines; and memory/storage constraints might be placed on local machines or a central server. Recent work:We first provide an overview of several theoretical questions studied in recent work, with an emphasis on high-dimensional settings; we refer the reader to excellent survey papers [GLW `22, CCX14, 17 <latexit sha1_base64=\"pdJLtZ7ichNC1J1ruhuzZGzmtE8=\">AAAB7HicbVBNS8NAFHypX7V+VT16WSyCp5KIVI9FLx4rmLbQhrLZb","claim_type":"background","confidence":0.85,"evidence_strength":"citation_context"}],"why_cited":"Pith tracks Concrete Problems in AI Safety because it crossed a citation-hub threshold. Current citing contexts most often use it as background evidence (20 contexts).","role_counts":[{"n":20,"context_role":"background"},{"n":1,"context_role":"method"}]},"error":null,"updated_at":"2026-05-17T15:09:56.371950+00:00"},"summary_claims":{"job_type":"summary_claims","status":"succeeded","result":{"title":"Concrete Problems in AI Safety","claims":[{"claim_text":"Rapid progress in machine learning and artificial intelligence (AI) has brought increasing attention to the potential impacts of AI technologies on society. In this paper we discuss one such potential impact: the problem of accidents in machine learning systems, defined as unintended and harmful behavior that may emerge from poor design of real-world AI systems. We present a list of five practical research problems related to accident risk, categorized according to whether the problem originates from having the wrong objective function (\"avoiding side effects\" and \"avoiding reward hacking\"), a","claim_type":"abstract","evidence_strength":"source_metadata"}],"why_cited":"Pith tracks Concrete Problems in AI Safety because it crossed a citation-hub threshold.","role_counts":[]},"error":null,"updated_at":"2026-05-14T08:18:16.872700+00:00"}},"summary":{"title":"Concrete Problems in AI Safety","claims":[{"claim_text":"Rapid progress in machine learning and artificial intelligence (AI) has brought increasing attention to the potential impacts of AI technologies on society. In this paper we discuss one such potential impact: the problem of accidents in machine learning systems, defined as unintended and harmful behavior that may emerge from poor design of real-world AI systems. We present a list of five practical research problems related to accident risk, categorized according to whether the problem originates from having the wrong objective function (\"avoiding side effects\" and \"avoiding reward hacking\"), a","claim_type":"abstract","evidence_strength":"source_metadata"}],"why_cited":"Pith tracks Concrete Problems in AI Safety because it crossed a citation-hub threshold.","role_counts":[]},"graph":{"co_cited":[{"title":"Constitutional AI: Harmlessness from AI Feedback","work_id":"faaaa4e0-2676-4fac-a0b4-99aef10d2095","shared_citers":16},{"title":"DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models","work_id":"c5006563-f3ec-438a-9e35-b7b484f34828","shared_citers":11},{"title":"DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning","work_id":"e6b75ad5-2877-4168-97c8-710407094d20","shared_citers":10},{"title":"Proximal Policy Optimization Algorithms","work_id":"240c67fe-d14d-4520-91c1-38a4e272ca19","shared_citers":10},{"title":"Scaling Laws for Neural Language Models","work_id":"b7dd8749-9c45-4977-ab9b-64478dce1ae8","shared_citers":9},{"title":"Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback","work_id":"a1f2574b-a899-4713-be60-c87ba332656c","shared_citers":8},{"title":"GPT-4 Technical Report","work_id":"b928e041-6991-4c08-8c81-0359e4097c7b","shared_citers":7},{"title":"The Llama 3 Herd of Models","work_id":"1549a635-88af-4ac1-acfe-51ae7bb53345","shared_citers":7},{"title":"CoRR , volume =","work_id":"871c0bb7-e08b-4d8b-be76-610707c748dd","shared_citers":6},{"title":"On the Opportunities and Risks of Foundation Models","work_id":"a18039e9-928d-47c9-a836-32656a71bf71","shared_citers":6},{"title":"AI safety via debate","work_id":"13c1ec37-af93-438a-bdf0-f2eafaee5635","shared_citers":5},{"title":"AutoGen: Enabling Next-Gen LLM Applications via Multi-Agent Conversation","work_id":"92b7eb9c-c3d8-4518-a376-06fa15dd895b","shared_citers":5},{"title":"LLaMA: Open and Efficient Foundation Language Models","work_id":"c018fc23-6f3f-4035-9d02-28a2173b2b9d","shared_citers":5},{"title":"Mistral 7B","work_id":"eb5e1305-ad11-4875-ad8d-ad8b8f697599","shared_citers":5},{"title":"Qwen3 Technical Report","work_id":"25a4e30c-1232-48e7-9925-02fa12ba7c9e","shared_citers":5},{"title":"Scalable agent alignment via reward modeling: a research direction","work_id":"1e9c4f6d-b369-4bd2-8e6e-1ec00318c924","shared_citers":5},{"title":"Towards Understanding Sycophancy in Language Models","work_id":"aeefec9a-6ad5-4743-92b9-de6983895e21","shared_citers":5},{"title":"Training Verifiers to Solve Math Word Problems","work_id":"acab1aa8-b4d6-40e0-a3ee-25341701dca2","shared_citers":5},{"title":"Universal and Transferable Adversarial Attacks on Aligned Language Models","work_id":"3322fa86-1768-4677-8425-dd326b45e078","shared_citers":5},{"title":"Adam: A Method for Stochastic Optimization","work_id":"1910796d-9b52-4683-bf5c-de9632c1028b","shared_citers":4},{"title":"Alignment faking in large language models","work_id":"cc253a89-cda1-4889-9631-bf3ce8147650","shared_citers":4},{"title":"arXiv preprint arXiv:2310.19852 , year=","work_id":"bc8d43a6-a842-4003-b1b6-c424256a8151","shared_citers":4},{"title":"BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding","work_id":"ed240a10-5b19-406c-baa5-30803f465785","shared_citers":4},{"title":"DAPO: An Open-Source LLM Reinforcement Learning System at Scale","work_id":"64019d00-0b11-4bbd-b173-b46c8fad0157","shared_citers":4}],"time_series":[{"n":1,"year":2017},{"n":1,"year":2018},{"n":1,"year":2020},{"n":1,"year":2021},{"n":1,"year":2022},{"n":2,"year":2025},{"n":67,"year":2026}],"dependency_candidates":[]},"authors":[{"id":"c51bdf66-c375-4dec-86c0-647ca4f89b02","orcid":null,"display_name":"Chris Olah","source":"manual","import_confidence":0.72},{"id":"2bf2bc1d-a1b5-492a-b4eb-d4858cf531c8","orcid":null,"display_name":"Dan Man\\'e","source":"manual","import_confidence":0.72},{"id":"77c3c2d7-9b00-4b88-b1b8-5cdd4b2ea747","orcid":null,"display_name":"Dario Amodei","source":"manual","import_confidence":0.72},{"id":"9abdb3e2-caf2-47ea-9ab5-d10786d948e0","orcid":null,"display_name":"Jacob Steinhardt","source":"manual","import_confidence":0.72},{"id":"298fbbc8-0497-4319-a5d8-d4bcef3f7f3d","orcid":null,"display_name":"John Schulman","source":"manual","import_confidence":0.72},{"id":"b5229722-2551-40ed-8eca-d2ef7b4fd5df","orcid":null,"display_name":"Paul Christiano","source":"manual","import_confidence":0.72}]}}