{"paper":{"title":"Risks from Learned Optimization in Advanced Machine Learning Systems","license":"http://arxiv.org/licenses/nonexclusive-distrib/1.0/","headline":"Learned models in machine learning can themselves become optimizers whose objectives diverge from the training loss.","cross_cats":[],"primary_cat":"cs.AI","authors_text":"Chris van Merwijk, Evan Hubinger, Joar Skalse, Scott Garrabrant, Vladimir Mikulik","submitted_at":"2019-06-05T04:43:25Z","abstract_excerpt":"We analyze the type of learned optimization that occurs when a learned model (such as a neural network) is itself an optimizer - a situation we refer to as mesa-optimization, a neologism we introduce in this paper. We believe that the possibility of mesa-optimization raises two important questions for the safety and transparency of advanced machine learning systems. First, under what circumstances will learned models be optimizers, including when they should not be? Second, when a learned model is an optimizer, what will its objective be - how will it differ from the loss function it was train"},"claims":{"count":4,"items":[{"kind":"strongest_claim","text":"We believe that the possibility of mesa-optimization raises two important questions for the safety and transparency of advanced machine learning systems: under what circumstances will learned models be optimizers, and when a learned model is an optimizer, what will its objective be and how can it be aligned?","source":"verdict.strongest_claim","status":"machine_extracted","claim_id":"C1","attestation":"unclaimed"},{"kind":"weakest_assumption","text":"The analysis assumes that sufficiently capable learned models will contain internal optimization processes whose objectives can be analyzed separately from the outer training loss, without providing formal conditions or empirical thresholds for when this separation becomes load-bearing.","source":"verdict.weakest_assumption","status":"machine_extracted","claim_id":"C2","attestation":"unclaimed"},{"kind":"one_line_summary","text":"Mesa-optimization arises when learned models act as optimizers with objectives that can differ from their training loss, creating alignment risks in advanced machine learning.","source":"verdict.one_line_summary","status":"machine_extracted","claim_id":"C3","attestation":"unclaimed"},{"kind":"headline","text":"Learned models in machine learning can themselves become optimizers whose objectives diverge from the training loss.","source":"verdict.pith_extraction.headline","status":"machine_extracted","claim_id":"C4","attestation":"unclaimed"}],"snapshot_sha256":"2720bafcf2225627f4fe10aa4c8bc8687bc38d42ad942670459f997db0b67c42"},"source":{"id":"1906.01820","kind":"arxiv","version":3},"verdict":{"id":"ad4daa52-aae6-44a7-89dd-aabb3516faa4","model_set":{"reader":"grok-4.3"},"created_at":"2026-05-15T12:17:39.308599Z","strongest_claim":"We believe that the possibility of mesa-optimization raises two important questions for the safety and transparency of advanced machine learning systems: under what circumstances will learned models be optimizers, and when a learned model is an optimizer, what will its objective be and how can it be aligned?","one_line_summary":"Mesa-optimization arises when learned models act as optimizers with objectives that can differ from their training loss, creating alignment risks in advanced machine learning.","pipeline_version":"pith-pipeline@v0.9.0","weakest_assumption":"The analysis assumes that sufficiently capable learned models will contain internal optimization processes whose objectives can be analyzed separately from the outer training loss, without providing formal conditions or empirical thresholds for when this separation becomes load-bearing.","pith_extraction_headline":"Learned models in machine learning can themselves become optimizers whose objectives diverge from the training loss."},"references":{"count":40,"sample":[{"doi":"","year":2018,"title":"Bottle caps aren’t optimisers, 2018","work_id":"8ecf2977-b6f2-4d72-bc30-102fd9f0f239","ref_index":1,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":2018,"title":"TreeQN and ATreeC: Differentiable Tree-Structured Models for Deep Reinforcement Learning","work_id":"54bcc5b2-9cd5-454d-b73d-e4e24db90330","ref_index":2,"cited_arxiv_id":"1710.11417","is_internal_anchor":true},{"doi":"","year":2018,"title":"Universal Planning Networks","work_id":"12d1df6e-e02d-4bc2-9d7e-1adfc87fa16c","ref_index":3,"cited_arxiv_id":"1804.00645","is_internal_anchor":true},{"doi":"","year":2016,"title":"2016 , month = nov, journal =","work_id":"9919eb6a-fde4-4d39-8047-21197365166d","ref_index":4,"cited_arxiv_id":"1606.04474","is_internal_anchor":true},{"doi":"","year":2016,"title":"Bartlett, Ilya Sutskever, and Pieter Abbeel","work_id":"f9b75698-414f-456e-a2e6-dbe568b2693d","ref_index":5,"cited_arxiv_id":"1611.02779","is_internal_anchor":true}],"resolved_work":40,"snapshot_sha256":"b7f5e0419c68a6a7655e64aa0f7f2583f0a3fabadfcde1ed8dc17422c6e4d104","internal_anchors":18},"formal_canon":{"evidence_count":1,"snapshot_sha256":"3187f144d2a547554246567d06b0e45853d60b4f975f40671375d8b31182bf69"},"author_claims":{"count":0,"strong_count":0,"snapshot_sha256":"258153158e38e3291e3d48162225fcdb2d5a3ed65a07baac614ab91432fd4f57"},"builder_version":"pith-number-builder-2026-05-17-v1"}