arxiv: 2605.01420 · v1 · submitted 2026-05-02 · 💻 cs.AI

Recognition: unknown

Artificial Jagged Intelligence as Uneven Optimization Energy Allocation Capability Concentration, Redistribution, and Optimization Governance

Wesley Shu , Peng Wei

Authors on Pith no claims yet

Pith reviewed 2026-05-09 14:10 UTC · model grok-4.3

classification 💻 cs.AI

keywords artificial jagged intelligenceoptimization energy allocationfinite-budget tradeoffcapability dispersionuneven emergencetraining anisotropyredistribution mechanismsgradient update energy

0 comments

The pith

AI training distributes limited optimization energy unevenly, producing jagged capability profiles rather than uniform intelligence.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper models the training of large learning systems as a finite-budget process that allocates gradient-driven update energy across capability-relevant directions in parameter space. It establishes that persistent concentration of this energy creates lower bounds on dispersion in capability gains across domains. A tradeoff theorem shows that prioritizing one capability imposes opportunity costs on others unless positive coupling or shared structure offsets the cost. The analysis examines redistribution tools such as energy-variance regularization and auxiliary objectives that can reshape the optimization field. This reframes jagged performance as a direct consequence of resource allocation mechanics rather than an unexplained feature of intelligence.

Core claim

Artificial Jagged Intelligence denotes the pattern of strong local capabilities alongside brittleness elsewhere, arising because training distributes a finite budget of gradient-driven update energy across anisotropic directions in parameter space. Persistent concentration of cumulative update energy yields lower bounds on dispersion in capability gains. A finite-budget tradeoff theorem demonstrates that prioritizing one capability imposes opportunity costs on others unless positive coupling or shared structure offsets the cost. Redistribution mechanisms, including energy-variance regularization and auxiliary structural objectives, can reshape the optimization field. The framework predicts a

What carries the argument

The finite-budget tradeoff theorem that links concentrated update energy to measurable dispersion in capability gains.

If this is right

Early concentration of update energy forecasts later capability jaggedness.
Scaling under a narrow objective need not eliminate anisotropy in capability profiles.
Explicitly funded auxiliary objectives can revive neglected capabilities.
Energy-variance regularization and similar interventions reshape the optimization field to reduce dispersion.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Designers could monitor update energy shares during training to intervene before jaggedness becomes entrenched.
Architectures that increase representational coupling might lower the energy required for redistribution.
The same allocation logic could apply to other constrained optimization settings such as reinforcement learning with multiple reward channels.

Load-bearing premise

Training can be modeled as a finite-budget process that distributes gradient-driven update energy across distinct capability directions in parameter space.

What would settle it

Training runs in which update energy concentrates heavily on one capability yet gains remain uniform across unrelated domains.

read the original abstract

Artificial Jagged Intelligence (AJI) denotes a recurring pattern in which large learning systems exhibit strong local capabilities while remaining weak or brittle in other domains. This paper develops a formal theory of AJI as uneven allocation of optimization pressure. We model training as a finite-budget process that distributes gradient-driven update energy across capability-relevant directions in parameter space. In this model, jagged capability profiles arise from anisotropic objective structure, data geometry, and representational coupling rather than from a single scalar quantity called intelligence. The paper defines capability gain, optimization energy share, and jaggedness, then proves that persistent concentration of cumulative update energy yields lower bounds on dispersion in capability gains. A finite-budget tradeoff theorem shows why prioritizing one capability can impose opportunity costs on others unless positive coupling or shared structure offsets the cost. The analysis also studies redistribution mechanisms, including energy-variance regularization and auxiliary structural objectives, as interventions that reshape the optimization field. The resulting framework links uneven emergence, training architecture, and optimization governance. It predicts that early concentration of update energy should forecast later capability jaggedness; that scaling under a narrow objective need not eliminate anisotropy; and that explicitly funded auxiliary objectives can revive neglected capabilities. AJI is therefore not merely a descriptive label for uneven model behavior, but a testable theory of how finite optimization resources produce concentrated, delayed, and structurally uneven capability formation.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper frames jagged AI capabilities as uneven optimization energy allocation with a tradeoff theorem, but the core math likely rests on an unproven conservation assumption rather than gradient descent dynamics.

read the letter

This paper models uneven capabilities in large models as the result of how training distributes a limited pool of update energy across parameter directions. The main pitch is that persistent concentration in one area creates lower bounds on dispersion elsewhere, with a tradeoff theorem explaining opportunity costs unless coupling helps out. They also sketch redistribution fixes like energy-variance regularization and auxiliary objectives, plus predictions that early energy focus forecasts later jaggedness.

Referee Report

2 major / 2 minor

Summary. The paper introduces Artificial Jagged Intelligence (AJI) as a pattern of strong local capabilities alongside brittleness in other domains in large learning systems. It models training as a finite-budget process distributing gradient-driven update energy across capability-relevant directions in parameter space, defines capability gain, optimization energy share, and jaggedness, and claims to prove that persistent concentration of cumulative update energy yields lower bounds on dispersion in capability gains. A finite-budget tradeoff theorem is presented to explain opportunity costs when prioritizing one capability, along with analysis of redistribution mechanisms such as energy-variance regularization and auxiliary objectives, and predictions that early energy concentration forecasts later jaggedness while narrow objectives do not eliminate anisotropy.

Significance. If the modeling and theorems hold with proper derivation from optimization dynamics, the framework could provide a principled account of uneven capability emergence in scaled models, shifting emphasis from monolithic intelligence to anisotropic objective structure, data geometry, and representational coupling. It offers concrete predictions and intervention strategies (e.g., auxiliary objectives to revive neglected capabilities) that could inform training design and optimization governance, with potential for falsifiable tests on real systems.

major comments (2)

[Modeling of training as finite-budget process and statement of the tradeoff theorem] The finite-budget tradeoff theorem and lower bounds on dispersion rest on the modeling assumption that training distributes a conserved total optimization energy across directions (with concentration producing measurable opportunity costs unless offset by coupling). This conservation is not derived from the gradient descent update equations; standard optimizers (SGD, Adam) determine ||Δθ|| via learning-rate schedules and adaptive statistics without a fixed scalar budget, so the claimed lower bounds appear to follow from an imposed accounting identity rather than from the loss landscape or data geometry.
[Definitions of capability gain, optimization energy share, and jaggedness] The definitions of optimization energy share and jaggedness are introduced directly in terms of the same allocation process that the lower bounds and predictions are meant to explain, creating a risk of circularity; the manuscript must show how these quantities are independently measurable or falsifiable against external benchmarks (e.g., capability evaluations) rather than tautological with the modeling choices.

minor comments (2)

Notation for 'jaggedness' and 'energy share' should be formalized with explicit equations early in the paper to improve readability and allow direct comparison to standard optimization quantities such as gradient norms or Fisher information.
The manuscript would benefit from additional references to prior work on anisotropic loss landscapes, multi-task optimization, and emergent capabilities to situate the contribution and avoid reinventing related concepts.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive review and for recognizing the potential significance of the AJI framework in providing a principled account of anisotropic capability emergence. We address each major comment below with clarifications and planned revisions to strengthen the manuscript's rigor and empirical grounding.

read point-by-point responses

Referee: [Modeling of training as finite-budget process and statement of the tradeoff theorem] The finite-budget tradeoff theorem and lower bounds on dispersion rest on the modeling assumption that training distributes a conserved total optimization energy across directions (with concentration producing measurable opportunity costs unless offset by coupling). This conservation is not derived from the gradient descent update equations; standard optimizers (SGD, Adam) determine ||Δθ|| via learning-rate schedules and adaptive statistics without a fixed scalar budget, so the claimed lower bounds appear to follow from an imposed accounting identity rather than from the loss landscape or data geometry.

Authors: We acknowledge that the finite-budget formulation is introduced as a modeling abstraction rather than a direct mathematical consequence of the standard gradient-descent update rules. The framework posits an effective conservation of optimization resources to reflect the practical finiteness of training (fixed step count, compute budget, and data exposure), under which cumulative update energy is allocated across parameter directions. Within this model, the lower bounds on capability dispersion follow from the concentration of updates combined with anisotropic objectives and limited representational coupling. In the revised manuscript we will add an explicit subsection that (i) derives the effective budget from training constraints such as total gradient steps and adaptive step-size statistics, (ii) shows that the tradeoff theorem continues to hold under relaxed (non-strictly conserved) budgets when opportunity costs are measured via directional gradient contributions, and (iii) discusses the relationship to the loss landscape geometry. We maintain that the core opportunity-cost insight is driven by objective anisotropy and coupling rather than the accounting identity alone. revision: partial
Referee: [Definitions of capability gain, optimization energy share, and jaggedness] The definitions of optimization energy share and jaggedness are introduced directly in terms of the same allocation process that the lower bounds and predictions are meant to explain, creating a risk of circularity; the manuscript must show how these quantities are independently measurable or falsifiable against external benchmarks (e.g., capability evaluations) rather than tautological with the modeling choices.

Authors: We agree that independent measurability is essential to avoid circularity. In the revision we will augment the definitions section with operational mappings to observable quantities: capability gain will be tied to performance deltas on standardized external benchmarks or task suites; optimization energy share will be proxied by integrated gradient norms or parameter-update magnitudes projected onto capability-relevant subspaces (identified via linear probes or attribution methods); and jaggedness will be quantified as the statistical dispersion (e.g., variance or Gini coefficient) of normalized benchmark scores across domains. These mappings enable falsifiable predictions, such as correlating early-training energy concentration with later benchmark dispersion. A new subsection on empirical validation strategies will outline concrete experimental protocols for testing these relations on existing models. revision: yes

Circularity Check

1 steps flagged

Finite-budget tradeoff theorem is an accounting identity imposed by the modeling assumption of conserved total update energy

specific steps

self definitional [Abstract]
"We model training as a finite-budget process that distributes gradient-driven update energy across capability-relevant directions in parameter space. ... The paper defines capability gain, optimization energy share, and jaggedness, then proves that persistent concentration of cumulative update energy yields lower bounds on dispersion in capability gains. A finite-budget tradeoff theorem shows why prioritizing one capability can impose opportunity costs on others unless positive coupling or shared structure offsets the cost."

The lower bounds and tradeoff are claimed as derived results, but they reduce to the definitional accounting identity of a fixed total energy budget: any persistent concentration necessarily produces dispersion and opportunity costs by the finite-sum constraint. No step derives the conservation of total update energy from the gradient descent update rule itself; the theorem therefore holds by construction inside the model rather than from external dynamics.

full rationale

The paper's derivation chain starts by defining training as a finite-budget process that allocates a conserved total of gradient-driven update energy across capability directions. It then defines capability gain, energy share, and jaggedness within this framework and proves lower bounds on dispersion plus a tradeoff theorem. These results follow directly from the finite-sum constraint (concentration in one direction reduces others by definition) without deriving the conservation property from the actual update equations of SGD, Adam, or other optimizers. The claimed predictions and lower bounds are therefore tautological to the input model rather than independent consequences of the loss landscape or data geometry.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 2 invented entities

The framework rests on treating training as finite-budget anisotropic optimization without independent empirical grounding or external benchmarks shown in the abstract; new concepts such as optimization energy share are introduced to explain the target phenomenon.

axioms (1)

domain assumption Training is a finite-budget process that distributes gradient-driven update energy across capability-relevant directions in parameter space
Explicitly stated as the modeling premise in the abstract.

invented entities (2)

optimization energy share no independent evidence
purpose: Quantify allocation of update energy to different capabilities
New quantity defined to formalize the uneven allocation mechanism
jaggedness no independent evidence
purpose: Measure dispersion in capability gains
New derived metric tied directly to energy concentration

pith-pipeline@v0.9.0 · 5538 in / 1533 out tokens · 65338 ms · 2026-05-09T14:10:00.022826+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

40 extracted references · 21 canonical work pages · 11 internal anchors

[1]

Concrete Problems in AI Safety

Amodei, Dario and Olah, Chris and Steinhardt, Jacob and Christiano, Paul and Schulman, John and Mané, Dan. “Concrete Problems in AI Safety.”arXiv preprint arXiv:1606.06565, 2016

work page internal anchor Pith review arXiv 2016
[2]

Frontier AI regulation: Managing emerging risks to public safety.arXiv preprint arXiv:2307.03718,

Anderljung, Markus and Barnhart, Josh and Korinek, Anton and Leung, Jane and O’Keefe, Cullen and Whittlestone, Jess and Avin, Shahar and Brundage, Miles and Bullock, Joseph and Cass-Beggs, David and others. “Frontier AI Regulation: Managing Emerging Risks to Public Safety.”arXiv preprint arXiv:2307.03718, 2023

work page arXiv 2023
[3]

Explaining Neural Scaling Laws

Bahri, Yasaman and Dyer, Ethan and Kaplan, Jared and Lee, Jaehoon and Sharma, Utkarsh and Sohl-Dickstein, Jascha and Ganguli, Surya. “Explaining Neural Scaling Laws.”Proceedings of the National Academy of Sciences121(27), pp. e2311878121, 2024

2024
[4]

On the Opportunities and Risks of Foundation Models

Bommasani, Rishi and Hudson, Drew and Adeli, Ehsan and Altman, Russ and Arora, Simran and von Arx, Sydney and Bernstein, Michael and Bohg, Jeannette and Bosselut, Antoine and Brunskill, Emma and others. “On the Opportunities and Risks of Foundation Models.”arXiv preprint arXiv:2108.07258, 2021

work page internal anchor Pith review arXiv 2021
[5]

Towards Monosemanticity: Decom- posing Language Models With Dictionary Learning

Bricken, Trenton and Templeton, Adly and Jermyn, Adam and Batson, Joshua and Chen, Brian and Burke, Jacob and Carter, Shan and others. “Towards Monosemanticity: Decom- posing Language Models With Dictionary Learning.” Anthropic interpretability report, 2023. https://transformer-circuits.pub/2023/monosemantic-features/index.html. 12

2023
[6]

Sparks of Artificial General Intelligence: Early experiments with GPT-4

Bubeck, Sébastien and Chandrasekaran, Varun and Eldan, Ronen and Gehrke, Johannes and Horvitz, Eric and Kamar, Ece and Lee, Peter and Lee, Yin Tat and Li, Yuanzhi and Lundberg, Scott and others. “Sparks of Artificial General Intelligence: Early Experiments with GPT-4.” arXiv preprint arXiv:2303.12712, 2023

work page internal anchor Pith review arXiv 2023
[7]

Is power-seeking AI an existential risk? arXiv:2206.13353, 2022

Carlsmith, Joseph. “Is Power-Seeking AI an Existential Risk?.”arXiv preprint arXiv:2206.13353, 2022

work page arXiv 2022
[8]

Grad- Norm: Gradient Normalization for Adaptive Loss Balancing in Deep Multitask Networks

Chen, Zhao and Badrinarayanan, Vijay and Lee, Chen-Yu and Rabinovich, Andrew. “Grad- Norm: Gradient Normalization for Adaptive Loss Balancing in Deep Multitask Networks.” In Proceedings of the 35th International Conference on Machine Learning, 2018

2018
[9]

On Lazy Training in Differentiable Programming

Chizat, Lénaïc and Oyallon, Edouard and Bach, Francis. “On Lazy Training in Differentiable Programming.”Advances in Neural Information Processing Systems32, 2019

2019
[10]

Adapting Auxiliary Losses Using Gradient Similarity

Du, Yunshu and Czarnecki, Wojciech M. and Jayakumar, Siddhant M. and Pascanu, Razvan and Lakshminarayanan, Balaji. “Adapting Auxiliary Losses Using Gradient Similarity.” In International Conference on Learning Representations, 2019

2019
[11]

A Mathematical Framework for Transformer Circuits

Elhage, Nelson and Nanda, Neel and Olsson, Catherine and Henighan, Tom and Joseph, Nicholas and Mann, Ben and Askell, Amanda and Bai, Yuntao and Chen, Anna and Conerly, Tom and others. “A Mathematical Framework for Transformer Circuits.” Transformer Circuits Thread, 2021.https://transformer-circuits.pub/2021/framework/index.html

2021
[12]

Toy Models of Superposition

Elhage, Nelson and Hume, Tristan and Olsson, Catherine and Schiefer, Nicholas and Henighan, Tom and Kravec, Shauna and Hatfield-Dodds, Zac and Lasenby, Robert and Drain, Dawn and Chen, Carol and others. “Toy Models of Superposition.”arXiv preprint arXiv:2209.10652, 2022

work page internal anchor Pith review arXiv 2022
[13]

Privileged Bases in the Transformer Residual Stream

Elhage, Nelson and Jermyn, Adam and Nanda, Neel and Olsson, Catherine and Steiner, Benoit and Tillman, Holly and others. “Privileged Bases in the Transformer Residual Stream.” Transformer Circuits Thread, 2023.https://transformer-circuits.pub/2023/ privileged-basis/index.html

2023
[14]

SwitchTransformers: ScalingtoTrillion Parameter Models with Simple and Efficient Sparsity

Fedus, WilliamandZoph, BarretandShazeer, Noam.“SwitchTransformers: ScalingtoTrillion Parameter Models with Simple and Efficient Sparsity.”Journal of Machine Learning Research 23(120), pp. 1–39, 2022

2022
[15]

Deep Learning Versus Kernel Learning: An Empirical Study of Loss Landscape Geometry and the Time Evolution of the Neural Tangent Kernel

Fort, Stanislav and Hu, Huiyi and Lakshminarayanan, Balaji. “Deep Learning Versus Kernel Learning: An Empirical Study of Loss Landscape Geometry and the Time Evolution of the Neural Tangent Kernel.”Advances in Neural Information Processing Systems33, 2020

2020
[16]

arXiv preprint arXiv:2202.07785 , year=

Ganguli, Deep and Hernandez, Danny and Lovitt, Liane and Askell, Amanda and Kandpal, Nikhil and Brundage, Miles and Chen, Tom and Johnston, Scott and Kravec, Shauna and El Showk, Sheer and others. “Predictability and Surprise in Large Generative Models.”arXiv preprint arXiv:2202.07785, 2022

work page arXiv 2022
[17]

Towards Data Gover- nance of Frontier AI Models

Hausenloy, Milo and Seger, Elizabeth and Trask, Andrew and others. “Towards Data Gover- nance of Frontier AI Models.”arXiv preprint arXiv:2412.03824, 2024

work page arXiv 2024
[18]

Training Compute-Optimal Large Language Models

Hoffmann, Jordan and Borgeaud, Sebastian and Mensch, Arthur and Buchatskaya, Elena and Cai, Trevor and Rutherford, Eliza and de Las Casas, Diego and Hendricks, Lisa Anne and 13 Welbl, Johannes and Clark, Aidan and others. “Training Compute-Optimal Large Language Models.”arXiv preprint arXiv:2203.15556, 2022

work page internal anchor Pith review arXiv 2022
[19]

arXiv preprint arXiv:1906.01820 , year =

Hubinger, Evan and van Merwijk, Chris and Mikulik, Vladimir and Skalse, Joar and Garrabrant, Scott. “Risks from Learned Optimization in Advanced Machine Learning Sys- tems.”arXiv preprint arXiv:1906.01820, 2019

work page arXiv 1906
[20]

Neural Tangent Kernel: Conver- gence and Generalization in Neural Networks

Jacot, Arthur and Gabriel, Franck and Hongler, Clément. “Neural Tangent Kernel: Conver- gence and Generalization in Neural Networks.”Advances in Neural Information Processing Systems31, 2018

2018
[21]

RotoGrad: Gradient Homogenization in Multitask Learn- ing

Javaloy, Adrià and Valera, Isabel. “RotoGrad: Gradient Homogenization in Multitask Learn- ing.”International Conference on Learning Representations, 2022

2022
[22]

Scaling Laws for Neural Language Models

Kaplan, Jared and McCandlish, Sam and Henighan, Tom and Brown, Tom B. and Chess, Benjamin and Child, Rewon and Gray, Scott and Radford, Alec and Wu, Jeffrey and Amodei, Dario. “Scaling Laws for Neural Language Models.”arXiv preprint arXiv:2001.08361, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2001
[23]

GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding

Lepikhin, Dmitry and Lee, HyoukJoong and Xu, Yuanzhong and Chen, Dehao and Firat, Orhan and Huang, Yanping and Krikun, Maxim and others. “GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding.”arXiv preprint arXiv:2006.16668, 2020

work page internal anchor Pith review arXiv 2006
[24]

Conflict-Averse Gradient Descent for Multi-Task Learning

Liu, Bo and Feng, Xiaokang and Stone, Peter and Liu, Qiang. “Conflict-Averse Gradient Descent for Multi-Task Learning.”Advances in Neural Information Processing Systems34, 2021

2021
[25]

Progress measures for grokking via mechanistic interpretability

Nanda, Neel and Chan, Lawrence and Lieberum, Tom and Smith, Jess and Steinhardt, Jacob. “Progress Measures for Grokking via Mechanistic Interpretability.”arXiv preprint arXiv:2301.05217, 2023

work page internal anchor Pith review arXiv 2023
[26]

Multi-Task Learning as a Bargaining Game

Navon, Aviv and Shamsian, Aviv and Achituve, Idan and Fetaya, Ethan. “Multi-Task Learning as a Bargaining Game.”Proceedings of Machine Learning Research162, pp. 15828–15846, 2022

2022
[27]

Towards Understanding the Role of Over-Parametrization in Generalization of Neural Networks

Neyshabur, Behnam and Bhojanapalli, Srinadh and McAllester, David and Srebro, Nathan. “Towards Understanding the Role of Over-Parametrization in Generalization of Neural Net- works.”arXiv preprint arXiv:1805.12076, 2018

work page Pith review arXiv 2018
[28]

In-context Learning and Induction Heads

Olsson, Catherine and Elhage, Nelson and Nanda, Neel and Joseph, Nicholas and Das- Sarma, Nova and Henighan, Tom and Mann, Ben and Askell, Amanda and Bai, Yuntao and Chen, Anna and others. “In-context Learning and Induction Heads.”arXiv preprint arXiv:2209.11895, 2022

work page internal anchor Pith review arXiv 2022
[29]

Red Teaming Language Models with Language Models

Perez, Ethan and Ringer, Sam and Lukosiute, Kallone and Nguyen, Karina and Chen, Ed- win and Heiner, Scott and Ziegler, Daniel and others. “Red Teaming Language Models with Language Models.”arXiv preprint arXiv:2202.03286, 2022

work page Pith review arXiv 2022
[30]

Tikeng Notsawo, Hattie Zhou, Mohammad Pezeshki, Irina Rish, and Guillaume Dumas

Pezeshki, Mohammad and Mitra, Ayan and Makhzani, Alireza and Rabusseau, Guillaume and Bengio, Yoshua and Lajoie, Guillaume. “Predicting Grokking Long Before It Happens: A Look Into the Loss Landscape of Models That Grok.”arXiv preprint arXiv:2306.13253, 2023. 14

work page arXiv 2023
[31]

Grokking: Generalization Beyond Overfitting on Small Algorithmic Datasets

Power, Alethea and Burda, Yuri and Edwards, Harri and Babuschkin, Igor and Misra, Vedant. “Grokking: Generalization Beyond Overfitting on Small Algorithmic Datasets.”arXiv preprint arXiv:2201.02177, 2022

work page internal anchor Pith review arXiv 2022
[32]

Human Compatible: Artificial Intelligence and the Problem of Control

Russell, Stuart. “Human Compatible: Artificial Intelligence and the Problem of Control.” Viking, 2019

2019
[33]

2023 , month = may, journal =

Schaeffer, Rylan and Miranda, Brando and Koyejo, Sanmi. “Are Emergent Abilities of Large Language Models a Mirage?.”arXiv preprint arXiv:2304.15004, 2023

work page arXiv 2023
[34]

Multi-Task Learning as Multi-Objective Optimization

Sener, Ozan and Koltun, Vladlen. “Multi-Task Learning as Multi-Objective Optimization.” Advances in Neural Information Processing Systems31, 2018

2018
[35]

Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer

Shazeer, Noam and Mirhoseini, Azalia and Maziarz, Krzysztof and Davis, Andy and Le, Quoc and Hinton, Geoffrey and Dean, Jeff. “Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer.”International Conference on Learning Represen- tations, 2017

2017
[36]

arXiv preprint arXiv:2305.15324 , year=

Shevlane, Toby and Farquhar, Sebastian and Garfinkel, Ben and Phuong, Mary and Whit- tlestone, Jess and Leung, Jane and Kokotajlo, Daniel and Marchal, Noémi and Anderljung, Markus and Kolt, Nathan and others. “Model Evaluation for Extreme Risks.”arXiv preprint arXiv:2305.15324, 2023

work page arXiv 2023
[37]

The Bitter Lesson

Sutton, Richard. “The Bitter Lesson.” 2019.http://www.incompleteideas.net/IncIdeas/ BitterLesson.html

2019
[38]

Extracting Interpretable Features from Claude 3 Sonnet

Templeton, Adly and Bricken, Trenton and Jermyn, Adam and Carter, Shan and others. “Extracting Interpretable Features from Claude 3 Sonnet.” Anthropic interpretability report, 2024.https://transformer-circuits.pub/2024/scaling-monosemanticity/

2024
[39]

Emergent Abilities of Large Language Models

Wei, Jason and Tay, Yi and Bommasani, Rishi and Raffel, Colin and Zoph, Barret and Borgeaud, Sebastian and Yogatama, Dani and Bosma, Maarten and Zhou, Denny and Met- zler, Donald and others. “Emergent Abilities of Large Language Models.”arXiv preprint arXiv:2206.07682, 2022

work page internal anchor Pith review arXiv 2022
[40]

Gradient Surgery for Multi-Task Learning

Yu, Tianhe and Kumar, Saurabh and Gupta, Abhinav and Hausman, Karol and Levine, Sergey and Finn, Chelsea. “Gradient Surgery for Multi-Task Learning.”Advances in Neural Information Processing Systems33, 2020. 15

2020