{"work":{"id":"43875dbe-bc2d-4ab5-af63-744411533ff7","openalex_id":null,"doi":null,"arxiv_id":"2209.10652","raw_key":null,"title":"Toy Models of Superposition","authors":null,"authors_text":"Nelson Elhage, Tristan Hume, Catherine Olsson, Nicholas Schiefer, Tom Henighan, Shauna Kravec","year":2022,"venue":"cs.LG","abstract":"Neural networks often pack many unrelated concepts into a single neuron - a puzzling phenomenon known as 'polysemanticity' which makes interpretability much more challenging. This paper provides a toy model where polysemanticity can be fully understood, arising as a result of models storing additional sparse features in \"superposition.\" We demonstrate the existence of a phase change, a surprising connection to the geometry of uniform polytopes, and evidence of a link to adversarial examples. We also discuss potential implications for mechanistic interpretability.","external_url":"https://arxiv.org/abs/2209.10652","cited_by_count":null,"metadata_source":"pith","metadata_fetched_at":"2026-05-25T05:36:40.324812+00:00","pith_arxiv_id":"2209.10652","created_at":"2026-05-09T06:00:36.347281+00:00","updated_at":"2026-05-25T05:36:40.324812+00:00","title_quality_ok":false,"display_title":"Toy Models of Superposition","render_title":"Toy Models of Superposition"},"hub":{"state":{"work_id":"43875dbe-bc2d-4ab5-af63-744411533ff7","tier":"hub","tier_reason":"10+ Pith inbound or 1,000+ external citations","pith_inbound_count":76,"external_cited_by_count":null,"distinct_field_count":13,"first_pith_cited_at":"2023-03-14T17:47:09+00:00","last_pith_cited_at":"2026-05-22T13:59:13+00:00","author_build_status":"not_needed","summary_status":"needed","contexts_status":"needed","graph_status":"needed","ask_index_status":"not_needed","reader_status":"not_needed","recognition_status":"not_needed","updated_at":"2026-05-31T10:22:21.535169+00:00","tier_text":"hub"},"tier":"hub","role_counts":[{"context_role":"background","n":19},{"context_role":"method","n":1}],"polarity_counts":[{"context_polarity":"background","n":17},{"context_polarity":"support","n":1},{"context_polarity":"unclear","n":1},{"context_polarity":"use_method","n":1}],"runs":{"context_extract":{"job_type":"context_extract","status":"succeeded","result":{"enqueued_papers":25},"error":null,"updated_at":"2026-05-14T14:11:24.242753+00:00"},"graph_features":{"job_type":"graph_features","status":"succeeded","result":{"co_cited":[{"title":"Sparse Autoencoders Find Highly Interpretable Features in Language Models","work_id":"51960d72-c69f-4db8-8efd-e90e8b4d9524","shared_citers":16},{"title":"Qwen3 Technical Report","work_id":"25a4e30c-1232-48e7-9925-02fa12ba7c9e","shared_citers":12},{"title":"In-context Learning and Induction Heads","work_id":"db2b0911-2758-4a2a-99dc-15b14b91bd5e","shared_citers":11},{"title":"Representation Engineering: A Top-Down Approach to AI Transparency","work_id":"45b326e2-e962-41a5-a542-2559e103a19b","shared_citers":11},{"title":"The Llama 3 Herd of Models","work_id":"1549a635-88af-4ac1-acfe-51ae7bb53345","shared_citers":8},{"title":"Steering Language Models With Activation Engineering","work_id":"d525fe06-5560-4e97-86fc-7a0e551f5b17","shared_citers":7},{"title":"The Geometry of Truth: Emergent Linear Structure in Large Language Model Representations of True/False Datasets","work_id":"400e017f-8643-4166-b6da-a75d4446da80","shared_citers":7},{"title":"Interpretability in the Wild: a Circuit for Indirect Object Identification in GPT-2 small","work_id":"d1167c73-3f2a-472b-8bf5-0ec282d7988a","shared_citers":6},{"title":"Scaling and evaluating sparse autoencoders","work_id":"f3faddeb-36ed-42bc-a7c9-9e764dc9b368","shared_citers":6},{"title":"Scaling Laws for Neural Language Models","work_id":"b7dd8749-9c45-4977-ab9b-64478dce1ae8","shared_citers":6},{"title":"Sparse Feature Circuits: Discovering and Editing Interpretable Causal Graphs in Language Models","work_id":"fb24e7e7-f336-4706-bc2d-62d656b28d74","shared_citers":6},{"title":"Llama 2: Open Foundation and Fine-Tuned Chat Models","work_id":"68a5177f-d644-44c1-bd4f-4e5278c22f5d","shared_citers":5},{"title":"Mistral 7B","work_id":"eb5e1305-ad11-4875-ad8d-ad8b8f697599","shared_citers":5},{"title":"The Linear Representation Hypothesis and the Geometry of Large Language Models","work_id":"a7b44adc-f2c2-4420-a27d-8ade97dd3b75","shared_citers":5},{"title":"arXiv preprint arXiv:2308.09124 , year=","work_id":"25f5f724-b6d7-427f-a2f3-e8b72fd3b5e2","shared_citers":4},{"title":"arXiv preprint arXiv:2501.16496 , year=","work_id":"f55f2189-55b1-4a1c-acfb-a5fa7bfa9e86","shared_citers":4},{"title":"DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning","work_id":"e6b75ad5-2877-4168-97c8-710407094d20","shared_citers":4},{"title":"GPT-4 Technical Report","work_id":"b928e041-6991-4c08-8c81-0359e4097c7b","shared_citers":4},{"title":"Measuring Massive Multitask Language Understanding","work_id":"e87ec49a-544b-4ec8-8991-75298c64ff5e","shared_citers":4},{"title":"Sparks of Artificial General Intelligence: Early experiments with GPT-4","work_id":"a23cfe92-7f7c-424b-98d4-b386a83002fb","shared_citers":4},{"title":"Steering llama 2 via contrastive activation addition","work_id":"1c681d32-40e4-4e00-b5a1-ddec8430f6ca","shared_citers":4},{"title":"Transformer Feed-Forward Layers Are Key-Value Memories","work_id":"af95ed34-d603-4b8d-b5d5-582dd09e9938","shared_citers":4},{"title":"2023 , journal=","work_id":"b5b39b4d-dcec-4f85-b0b1-54f58650c82e","shared_citers":3},{"title":"2024 , eprint=","work_id":"e17bb16b-c75e-4341-b40a-d9e96783cd08","shared_citers":3}],"time_series":[{"n":2,"year":2023},{"n":2,"year":2024},{"n":42,"year":2026}],"dependency_candidates":[]},"error":null,"updated_at":"2026-05-14T14:11:15.768955+00:00"},"identity_refresh":{"job_type":"identity_refresh","status":"succeeded","result":{"items":[{"title":"Qwen3 Technical Report","outcome":"unchanged","work_id":"25a4e30c-1232-48e7-9925-02fa12ba7c9e","resolver":"local_arxiv","confidence":0.98,"old_work_id":"25a4e30c-1232-48e7-9925-02fa12ba7c9e"}],"counts":{"fixed":0,"merged":0,"unchanged":1,"quarantined":0,"needs_external_resolution":0},"errors":[],"attempted":1},"error":null,"updated_at":"2026-05-14T14:11:15.714071+00:00"},"summary_claims":{"job_type":"summary_claims","status":"succeeded","result":{"title":"Toy Models of Superposition","claims":[{"claim_text":"Neural networks often pack many unrelated concepts into a single neuron - a puzzling phenomenon known as 'polysemanticity' which makes interpretability much more challenging. This paper provides a toy model where polysemanticity can be fully understood, arising as a result of models storing additional sparse features in \"superposition.\" We demonstrate the existence of a phase change, a surprising connection to the geometry of uniform polytopes, and evidence of a link to adversarial examples. We also discuss potential implications for mechanistic interpretability.","claim_type":"abstract","evidence_strength":"source_metadata"}],"why_cited":"Pith tracks Toy Models of Superposition because it crossed a citation-hub threshold.","role_counts":[]},"error":null,"updated_at":"2026-05-14T14:11:13.407038+00:00"}},"summary":{"title":"Toy Models of Superposition","claims":[{"claim_text":"Neural networks often pack many unrelated concepts into a single neuron - a puzzling phenomenon known as 'polysemanticity' which makes interpretability much more challenging. This paper provides a toy model where polysemanticity can be fully understood, arising as a result of models storing additional sparse features in \"superposition.\" We demonstrate the existence of a phase change, a surprising connection to the geometry of uniform polytopes, and evidence of a link to adversarial examples. We also discuss potential implications for mechanistic interpretability.","claim_type":"abstract","evidence_strength":"source_metadata"}],"why_cited":"Pith tracks Toy Models of Superposition because it crossed a citation-hub threshold.","role_counts":[]},"graph":{"co_cited":[{"title":"Sparse Autoencoders Find Highly Interpretable Features in Language Models","work_id":"51960d72-c69f-4db8-8efd-e90e8b4d9524","shared_citers":16},{"title":"Qwen3 Technical Report","work_id":"25a4e30c-1232-48e7-9925-02fa12ba7c9e","shared_citers":12},{"title":"In-context Learning and Induction Heads","work_id":"db2b0911-2758-4a2a-99dc-15b14b91bd5e","shared_citers":11},{"title":"Representation Engineering: A Top-Down Approach to AI Transparency","work_id":"45b326e2-e962-41a5-a542-2559e103a19b","shared_citers":11},{"title":"The Llama 3 Herd of Models","work_id":"1549a635-88af-4ac1-acfe-51ae7bb53345","shared_citers":8},{"title":"Steering Language Models With Activation Engineering","work_id":"d525fe06-5560-4e97-86fc-7a0e551f5b17","shared_citers":7},{"title":"The Geometry of Truth: Emergent Linear Structure in Large Language Model Representations of True/False Datasets","work_id":"400e017f-8643-4166-b6da-a75d4446da80","shared_citers":7},{"title":"Interpretability in the Wild: a Circuit for Indirect Object Identification in GPT-2 small","work_id":"d1167c73-3f2a-472b-8bf5-0ec282d7988a","shared_citers":6},{"title":"Scaling and evaluating sparse autoencoders","work_id":"f3faddeb-36ed-42bc-a7c9-9e764dc9b368","shared_citers":6},{"title":"Scaling Laws for Neural Language Models","work_id":"b7dd8749-9c45-4977-ab9b-64478dce1ae8","shared_citers":6},{"title":"Sparse Feature Circuits: Discovering and Editing Interpretable Causal Graphs in Language Models","work_id":"fb24e7e7-f336-4706-bc2d-62d656b28d74","shared_citers":6},{"title":"Llama 2: Open Foundation and Fine-Tuned Chat Models","work_id":"68a5177f-d644-44c1-bd4f-4e5278c22f5d","shared_citers":5},{"title":"Mistral 7B","work_id":"eb5e1305-ad11-4875-ad8d-ad8b8f697599","shared_citers":5},{"title":"The Linear Representation Hypothesis and the Geometry of Large Language Models","work_id":"a7b44adc-f2c2-4420-a27d-8ade97dd3b75","shared_citers":5},{"title":"arXiv preprint arXiv:2308.09124 , year=","work_id":"25f5f724-b6d7-427f-a2f3-e8b72fd3b5e2","shared_citers":4},{"title":"arXiv preprint arXiv:2501.16496 , year=","work_id":"f55f2189-55b1-4a1c-acfb-a5fa7bfa9e86","shared_citers":4},{"title":"DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning","work_id":"e6b75ad5-2877-4168-97c8-710407094d20","shared_citers":4},{"title":"GPT-4 Technical Report","work_id":"b928e041-6991-4c08-8c81-0359e4097c7b","shared_citers":4},{"title":"Measuring Massive Multitask Language Understanding","work_id":"e87ec49a-544b-4ec8-8991-75298c64ff5e","shared_citers":4},{"title":"Sparks of Artificial General Intelligence: Early experiments with GPT-4","work_id":"a23cfe92-7f7c-424b-98d4-b386a83002fb","shared_citers":4},{"title":"Steering llama 2 via contrastive activation addition","work_id":"1c681d32-40e4-4e00-b5a1-ddec8430f6ca","shared_citers":4},{"title":"Transformer Feed-Forward Layers Are Key-Value Memories","work_id":"af95ed34-d603-4b8d-b5d5-582dd09e9938","shared_citers":4},{"title":"2023 , journal=","work_id":"b5b39b4d-dcec-4f85-b0b1-54f58650c82e","shared_citers":3},{"title":"2024 , eprint=","work_id":"e17bb16b-c75e-4341-b40a-d9e96783cd08","shared_citers":3}],"time_series":[{"n":2,"year":2023},{"n":2,"year":2024},{"n":42,"year":2026}],"dependency_candidates":[]},"authors":[]}}