The Effect of Training Task Diversity on In-Context Learning through the Lens of Low-Dimensional Subspaces

Alec S. Xu; Can Yaras; Dogyoon Song; Laura Balzano; Qing Qu; Soo Min Kwon

arxiv: 2606.06814 · v1 · pith:4VGIB2VInew · submitted 2026-06-05 · 📊 stat.ML · cs.LG· math.ST· stat.AP· stat.TH

The Effect of Training Task Diversity on In-Context Learning through the Lens of Low-Dimensional Subspaces

Soo Min Kwon , Alec S. Xu , Can Yaras , Dogyoon Song , Laura Balzano , Qing Qu This is my paper

Pith reviewed 2026-06-27 21:02 UTC · model grok-4.3

classification 📊 stat.ML cs.LGmath.STstat.APstat.TH

keywords in-context learningtask diversitylow-rank Gaussianslinear attentionsubspace overlapgeneralizationoptimization trajectory

0 comments

The pith

Modeling training task vectors as a mixture of low-rank Gaussians explains how subspace diversity shortens the ICL plateau and produces apparent out-of-distribution generalization.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper seeks to show that a minimal probabilistic model of task vectors accounts for observed effects of training diversity on in-context learning. Task vectors are drawn from a mixture of low-rank Gaussians whose covariance matrices are set by subspaces, and diversity is measured by how many columns of those subspaces do not overlap. If the model is correct, increasing this non-overlap both speeds the escape from the plateau phase of linear-attention training and makes the learned ICL behavior look as though it generalizes beyond the training distribution. The work then checks that the same pattern appears in nonlinear transformers and function classes.

Core claim

By representing training task vectors as samples from a mixture of low-rank Gaussians whose covariances are set by subspaces, the number of non-overlapping columns between those subspaces determines how task diversity affects the optimization and generalization of in-context learning using linear attention. This setup provably leads to faster escape from the ICL plateau when diversity increases and produces apparent out-of-distribution generalization.

What carries the argument

Mixture of low-rank Gaussians for task vectors, with diversity given by the count of non-overlapping columns between the subspaces that parameterize the covariance matrices.

If this is right

Higher task diversity, measured by subspace non-overlap, shortens the plateau phase in the optimization trajectory of linear attention for ICL.
The same diversity measure produces the appearance of out-of-distribution generalization even though the underlying mechanism stays inside the training distribution.
The low-rank Gaussian mixture model unifies phenomena previously studied under two different definitions of task diversity.
The same qualitative effects appear when the analysis is extended empirically to nonlinear transformers and nonlinear function classes.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The subspace-overlap metric could be used to design or filter pre-training corpora so that the number of non-overlapping directions is deliberately maximized.
If the model captures the essential geometry, then random sampling of tasks is likely suboptimal and structured sampling that controls subspace overlap could improve sample efficiency.
The same low-rank mixture lens might be applied to other emergent transformer behaviors whose training dynamics currently lack simple explanations.

Load-bearing premise

The distribution of training task vectors can be accurately captured by a mixture of low-rank Gaussians whose covariance matrices are set by subspaces.

What would settle it

Train a linear attention model on task vectors whose generating subspaces have more overlap and observe that the ICL plateau length does not increase or that out-of-distribution behavior disappears.

Figures

Figures reproduced from arXiv: 2606.06814 by Alec S. Xu, Can Yaras, Dogyoon Song, Laura Balzano, Qing Qu, Soo Min Kwon.

**Figure 1.** Figure 1: An illustrative overview. We introduce a subspace-based notion of task diversity and prove its benefits for transformer learning dynamics and generalization. Top: task diversity accelerates convergence in both linear attention and GPT-2. Bottom: at the global minimum, a transformer trained with diverse tasks can generalize to all subspaces within the span of training subspaces at principal angle 𝜃, even in… view at source ↗

**Figure 2.** Figure 2: Demonstrating the effects of training task diversity. We train a GPT-2 model for ICL, and show that task diversity shortens the ICL plateau. Left: Reproducing the observations of [Kim+25] with 𝑑 = 10; training on all tasks jointly (linear, sparse, Leaky ReLU) drops the training loss faster than that of training on each task individually. Right: Plot of the training loss using our definition of task diversi… view at source ↗

**Figure 3.** Figure 3: Plots for simulated GF dynamics. We choose with 𝑞 = 10, 𝑛ind = 4, 𝑛over = 6, 𝛼 = 0.005, 𝜂 = 0.001, and 𝛿 = 0.001. 𝑡mix, over denotes the predicted value of 𝑡over for the task-diverse case. Left: Plot of the learning trajectories of 𝑢ind(𝑡) = 𝑣(𝑡)𝜆 2 ind(𝑡) and 𝑢over(𝑡) = 𝑣(𝑡)𝜆 2 over(𝑡). 𝑢ind(𝑡) stays close to its initialization until 𝑢over(𝑡) reaches 1, at which point 𝑢ind(𝑡) then begins to learn. This de… view at source ↗

**Figure 4.** Figure 4: Plot of the test risk as a function of the prompt length when trained using diverse task vectors with test subspace drawn from Equation (24). Left: Plot of the risk for linear attention. Right: Plot of the risk for GPT-2. For both plots, when the prompt length at test time is large enough, the test risk goes nearly to zero for all 𝜃 ∈ [PITH_FULL_IMAGE:figures/full_fig_p016_4.png] view at source ↗

**Figure 5.** Figure 5: Plot of the test risk as a function of prompt length when trained using diverse task vectors, but with the test subspace drawn from Equation (29). Left: Plot of the risk for linear attention. Right: Plot of the risk for GPT-2. The test loss does not approach zero even with large prompt lengths, and instead converges to the (normalized) test risk in Theorem 3 as the test subspace shifts away from the traini… view at source ↗

**Figure 6.** Figure 6: Depiction of the two-stage learning phenomenon on a single-layer linear attention model. We verify the two stages by testing on two different test-task vectors: one drawn from the overlapping subspace, and another drawn from the independent components (i.e., the remaining components orthogonal to the overlapping subspace). The right figure shows a zoomed-in version of the test losses, showing that the test… view at source ↗

**Figure 7.** Figure 7: Demonstrating that task diversity shortens the ICL plateau and accelerates convergence. We consider the case in which 𝑛over = 4 and 𝑛ind = 8 (which sets 𝑛over = 12 without diversity). Left: Simulated GF dynamics demonstrating that 𝑡mix, low, defined as 𝑡mix evaluated at the lower bound of 𝑡ind, is strictly less than 𝑡single and serves as an accurate approximation of the convergence time. Right: Plot of the… view at source ↗

**Figure 8.** Figure 8: Depiction of the two-stage learning phenomenon on GPT-2, where we train the model using two function classes, linear regression and sparse regression, sampled with equal probability. For sparse regression, we mask out 5 components such that the unmasked components can be viewed as the overlapping components. The ICL plateau drops once the sparse regression components (i.e., the common components) are lea… view at source ↗

**Figure 9.** Figure 9: Left: Phase plot of the test risk as we vary the angle between Σ𝑠 and Σ𝑡 and the prompt length with 𝑚 = 𝑛 for a linear attention model trained with a mixture of Gaussians. The test risk is low across all angle shifts, and decreases further as the prompt length increases. Right: Plot of the test risk as a function of the prompt length for a case in which Σ𝑠 ≠ Σ𝑡 but with 𝜃 = 0, following the OOD example in … view at source ↗

**Figure 10.** Figure 10: Visualization of the generalization behavior of transformers for learning nonlinear function classes in-context. Each corner of a triangle represents a one-dimensional subspace spanned by 𝜓1 (bottom left), 𝜓2 (bottom right), or 𝜓3 (top), with all possible convex combinations given by the interior. In all cases, we show the risk when evaluated at different points in span({𝜓1, 𝜓2, 𝜓3}) for the appropriate … view at source ↗

read the original abstract

The transformer's emergent ability to perform in-context learning (ICL) has sparked a wide range of studies designed to understand its underlying mechanisms. Existing works often study how training task diversity, defined either as the number of ICL training task vectors or as the number of function classes from which the task vectors are drawn, shapes both the learning dynamics and generalization capabilities of ICL. While both definitions have uncovered many interesting phenomena, many observations under the latter definition remain theoretically unexplained. This paper presents a minimal analytical model under which these phenomena provably emerge from the properties of the training data. By modeling the training task vectors as a mixture of low-rank Gaussians, we show how training task diversity, defined by the number of non-overlapping columns between subspaces that parameterize the covariance matrices, improves both the generalization and optimization trajectory of ICL with linear attention. In particular, we show that our model can explain (i) why training with task diversity shortens the ICL plateau and (ii) why ICL appears to achieve out-of-distribution generalization. We conclude by empirically demonstrating how our results extend to nonlinear transformers and nonlinear function classes. Overall, our work presents a tractable framework to unify existing observations.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper builds a mixture-of-low-rank-Gaussians model that derives shorter ICL plateaus and apparent OOD generalization from reduced subspace overlap, but the fit to actual training distributions is untested.

read the letter

The main point is that they model training task vectors as a mixture of low-rank Gaussians and tie diversity to the count of non-overlapping columns between the subspaces that set the covariances. From this they derive that greater diversity shortens the ICL plateau under linear attention and produces what looks like out-of-distribution generalization.

The construction is new in how it makes subspace overlap the direct control knob for both the optimization trajectory and the generalization behavior. It gives a single minimal setup that accounts for two separate lines of prior observations, which is cleaner than treating them separately. The empirical checks on nonlinear transformers and function classes are a reasonable next step to show the ideas are not limited to the linear case.

The limitation is that all the analytical results rest on the tasks actually following this low-rank Gaussian mixture structure, with diversity captured exactly by column overlap. The paper does not show that standard ICL data sources match this distribution or that the chosen diversity measure is the operative one in practice. If real task covariances differ, the link between the model and observed transformer behavior becomes weaker. The derivations themselves appear to follow from the model assumptions without obvious circularity, but the strength of the explanation depends on how well the generative story matches reality.

This is for people working on mechanistic or theoretical accounts of in-context learning. A reader who wants analytical tools to connect data properties to training dynamics would get something concrete from it. It deserves a serious referee because the model is tractable and makes testable predictions, even if the connection to real-scale ICL needs more validation.

Referee Report

2 major / 2 minor

Summary. The paper claims that modeling ICL training task vectors as a mixture of low-rank Gaussians, with diversity quantified by the number of non-overlapping columns between the subspaces parameterizing their covariance matrices, yields a minimal analytical model from which two phenomena provably emerge for linear attention: (i) increased task diversity shortens the ICL plateau and (ii) ICL exhibits apparent out-of-distribution generalization. The model is presented as explaining these effects directly from training-data properties, with empirical extension to nonlinear transformers and function classes.

Significance. If the generative assumptions are representative, the work supplies a tractable, low-dimensional subspace framework that analytically unifies several previously observed but theoretically unexplained effects of task diversity on ICL optimization trajectories and generalization. The explicit derivation under a mixture-of-low-rank-Gaussians model and the empirical checks on nonlinear cases are strengths that could make the framework useful for further theoretical study of linear attention.

major comments (2)

[Abstract / modeling section] Abstract and modeling section: the central explanatory claim—that the shortened plateau and apparent OOD generalization 'provably emerge from the properties of the training data'—rests entirely on the unverified generative assumption that real ICL task vectors are well-approximated by a mixture of low-rank Gaussians whose diversity is captured by non-overlapping subspace columns. No section provides a direct empirical check (e.g., covariance estimation or subspace overlap statistics) on standard ICL training distributions to support that this structure is operative rather than an ad-hoc modeling choice.
[Analytical derivations] The analytical results on plateau length and OOD generalization are derived under the specific mixture-of-low-rank-Gaussians model for linear attention. If the true task covariances deviate from this structure (as the skeptic note suggests is possible), the link between the derived quantities and observed transformer behavior is weakened; the manuscript does not contain a sensitivity analysis or alternative generative models to test robustness of the two main phenomena.

minor comments (2)

[Modeling section] Notation for the subspace overlap count and the mixture weights should be introduced with a single, self-contained definition early in the modeling section to avoid later ambiguity.
[Empirical section] The empirical extension to nonlinear transformers would benefit from an explicit statement of the architecture, depth, and training hyperparameters used, so that the claimed qualitative agreement can be reproduced.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We appreciate the referee's detailed feedback on our manuscript. Below, we provide point-by-point responses to the major comments, clarifying the scope and intent of our theoretical model.

read point-by-point responses

Referee: [Abstract / modeling section] Abstract and modeling section: the central explanatory claim—that the shortened plateau and apparent OOD generalization 'provably emerge from the properties of the training data'—rests entirely on the unverified generative assumption that real ICL task vectors are well-approximated by a mixture of low-rank Gaussians whose diversity is captured by non-overlapping subspace columns. No section provides a direct empirical check (e.g., covariance estimation or subspace overlap statistics) on standard ICL training distributions to support that this structure is operative rather than an ad-hoc modeling choice.

Authors: We thank the referee for this observation. Our claim is specifically that the described phenomena provably emerge from the properties of the training data *under the proposed minimal analytical model*. The model is not presented as a verified description of real ICL task distributions but as a tractable framework that isolates the effect of subspace diversity on ICL dynamics for linear attention. The low-rank Gaussian mixture assumption is motivated by the prevalence of low-dimensional structures in ICL analyses in the literature. While we agree that direct empirical checks on standard datasets would provide additional support, such verification falls outside the theoretical focus of this work. Our empirical contributions instead demonstrate that the qualitative predictions hold when extending to nonlinear transformers and function classes. revision: no
Referee: [Analytical derivations] The analytical results on plateau length and OOD generalization are derived under the specific mixture-of-low-rank-Gaussians model for linear attention. If the true task covariances deviate from this structure (as the skeptic note suggests is possible), the link between the derived quantities and observed transformer behavior is weakened; the manuscript does not contain a sensitivity analysis or alternative generative models to test robustness of the two main phenomena.

Authors: The derivations are performed under this specific model to enable analytical tractability and explicit connections between subspace overlap and the ICL phenomena. We selected the mixture-of-low-rank-Gaussians structure precisely because it permits closed-form analysis of the optimization trajectory and generalization. Regarding robustness, the manuscript includes empirical validation on nonlinear attention mechanisms and nonlinear function classes, which serves as an initial check that the effects are not artifacts of the linear setting. A comprehensive sensitivity analysis across multiple alternative generative models would be a substantial undertaking and is not included; however, the minimal nature of the model is intended to highlight the core mechanism rather than to exhaustively model all possible task distributions. revision: no

Circularity Check

0 steps flagged

Analytical model derives ICL effects from posited generative assumptions without reduction to inputs by construction.

full rationale

The paper posits a mixture-of-low-rank-Gaussians model for task vectors and defines diversity via non-overlapping subspace columns, then mathematically derives shortened ICL plateau and apparent OOD generalization as consequences under linear attention. This is a standard forward derivation from stated assumptions rather than any self-definitional loop, fitted parameter renamed as prediction, or load-bearing self-citation. No equations or claims in the provided text reduce the output phenomena to the modeling choice by tautology; the results are conditional on the model. The assumption itself may be strong or unverified against real data, but that is a question of external validity, not circularity in the derivation chain.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Abstract-only review supplies insufficient detail to enumerate free parameters or invented entities; the modeling assumption itself is treated as a domain assumption below.

axioms (1)

domain assumption Training task vectors follow a mixture of low-rank Gaussians whose covariances are parameterized by subspaces
Explicitly stated as the modeling choice that makes the diversity effects provable.

pith-pipeline@v0.9.1-grok · 5773 in / 1243 out tokens · 15271 ms · 2026-06-27T21:02:20.902668+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

77 extracted references · 2 canonical work pages

[1]

Forty-first International Conference on Machine Learning,

Libin Zhu and Chaoyue Liu and Adityanarayanan Radhakrishnan and Mikhail Belkin , title =. Forty-first International Conference on Machine Learning,. 2024 , url =

2024
[2]

The Impact of Initialization on Lo

Soufiane Hayou and Nikhil Ghosh and Bin Yu , booktitle=. The Impact of Initialization on Lo. 2024 , url=

2024
[3]

Soufiane Hayou and Nikhil Ghosh and Bin Yu , booktitle=. Lo. 2024 , url=

2024
[4]

Submitted to The Thirteenth International Conference on Learning Representations , year=

Efficient Learning with Sine-Activated Low-Rank Matrices , author=. Submitted to The Thirteenth International Conference on Learning Representations , year=
[5]

The Twelfth International Conference on Learning Representations , year=

Benign Oscillation of Stochastic Gradient Descent with Large Learning Rate , author=. The Twelfth International Conference on Learning Representations , year=
[6]

arXiv preprint arXiv:2310.17513 , year=

The expressive power of low-rank adaptation , author=. arXiv preprint arXiv:2310.17513 , year=

arXiv
[7]

Transactions on Machine Learning Research , issn=

Task Diversity Shortens the In-Context Learning Plateau , author=. Transactions on Machine Learning Research , issn=. 2025 , url=

2025
[8]

The Thirty-eighth Annual Conference on Neural Information Processing Systems , year=

Fine-grained Analysis of In-context Linear Estimation: Data, Architecture, and Beyond , author=. The Thirty-eighth Annual Conference on Neural Information Processing Systems , year=
[9]

Statistics & Probability Letters , volume=

Multiplying a Gaussian matrix by a Gaussian vector , author=. Statistics & Probability Letters , volume=. 2017 , publisher=

2017
[10]

Technical University of Denmark , volume=

The matrix cookbook , author=. Technical University of Denmark , volume=
[11]

Journal of Machine Learning Research , volume=

Trained transformers learn linear models in-context , author=. Journal of Machine Learning Research , volume=
[12]

Advances in Neural Information Processing Systems , volume=

Pretrained transformer efficiently learns low-dimensional target functions in-context , author=. Advances in Neural Information Processing Systems , volume=
[13]

In-Context Learning with Representations: Contextual Generalization of Trained Transformers , url =

Yang, Tong and Huang, Yu and Liang, Yingbin and Chi, Yuejie , booktitle =. In-Context Learning with Representations: Contextual Generalization of Trained Transformers , url =
[14]

Forty-second International Conference on Machine Learning , year =

Training Dynamics of In-Context Learning in Linear Attention , author=. Forty-second International Conference on Machine Learning , year =
[15]

The Thirteenth International Conference on Learning Representations , year=

Can In-context Learning Really Generalize to Out-of-distribution Tasks? , author=. The Thirteenth International Conference on Learning Representations , year=
[16]

Sudden Drops in the Loss: Syntax Acquisition, Phase Transitions, and Simplicity Bias in

Angelica Chen and Ravid Shwartz-Ziv and Kyunghyun Cho and Matthew L Leavitt and Naomi Saphra , booktitle=. Sudden Drops in the Loss: Syntax Acquisition, Phase Transitions, and Simplicity Bias in. 2024 , url=

2024
[17]

Abrupt Learning in Transformers: A Case Study on Matrix Completion , volume =

Gopalani, Pulkit and Lubana, Ekdeep Singh and Hu, Wei , booktitle =. Abrupt Learning in Transformers: A Case Study on Matrix Completion , volume =
[18]

positive

Yue M. Lu and Mary Letey and Jacob A. Zavatone-Veth and Anindita Maiti and Cengiz Pehlevan , title =. Proceedings of the National Academy of Sciences , volume =. 2025 , doi =. https://www.pnas.org/doi/pdf/10.1073/pnas.2502599122 , abstract =

work page doi:10.1073/pnas.2502599122 2025
[19]

The Twelfth International Conference on Learning Representations , year=

Linear attention is (maybe) all you need (to understand Transformer optimization) , author=. The Twelfth International Conference on Learning Representations , year=
[20]

International Conference on Machine Learning , pages=

Compressible Dynamics in Deep Overparameterized Low-Rank Learning & Adaptation , author=. International Conference on Machine Learning , pages=. 2024 , organization=

2024
[21]

Language Models are Few-Shot Learners , url =

Brown, Tom and Mann, Benjamin and Ryder, Nick and Subbiah, Melanie and Kaplan, Jared D and Dhariwal, Prafulla and Neelakantan, Arvind and Shyam, Pranav and Sastry, Girish and Askell, Amanda and Agarwal, Sandhini and Herbert-Voss, Ariel and Krueger, Gretchen and Henighan, Tom and Child, Rewon and Ramesh, Aditya and Ziegler, Daniel and Wu, Jeffrey and Winte...
[22]

arXiv preprint arXiv:2107.03374 , year=

Evaluating large language models trained on code , author=. arXiv preprint arXiv:2107.03374 , year=

Pith/arXiv arXiv
[23]

Patti and Jayson Lynch and Avi Shporer and Nakul Verma and Eugene Wu and Gilbert Strang , title =

Iddo Drori and Sarah Zhang and Reece Shuttleworth and Leonard Tang and Albert Lu and Elizabeth Ke and Kevin Liu and Linda Chen and Sunny Tran and Newman Cheng and Roman Wang and Nikhil Singh and Taylor L. Patti and Jayson Lynch and Avi Shporer and Nakul Verma and Eugene Wu and Gilbert Strang , title =. Proceedings of the National Academy of Sciences , vol...

work page doi:10.1073/pnas.2123433119 2022
[24]

Proceedings of the 40th International Conference on Machine Learning , pages =

Transformers as Algorithms: Generalization and Stability in In-context Learning , author =. Proceedings of the 40th International Conference on Machine Learning , pages =. 2023 , editor =

2023
[25]

Transactions on Machine Learning Research , issn=

Emergent Abilities of Large Language Models , author=. Transactions on Machine Learning Research , issn=. 2022 , url=

2022
[26]

What Can Transformers Learn In-Context?

Shivam Garg and Dimitris Tsipras and Percy Liang and Gregory Valiant , booktitle=. What Can Transformers Learn In-Context?. 2022 , url=

2022
[27]

Advances in Neural Information Processing Systems , volume=

Transformers learn to implement preconditioned gradient descent for in-context learning , author=. Advances in Neural Information Processing Systems , volume=
[28]

The Twelfth International Conference on Learning Representations , year=

One Step of Gradient Descent is Provably the Optimal In-Context Learner with One Layer of Linear Self-Attention , author=. The Twelfth International Conference on Learning Representations , year=
[29]

International Conference on Machine Learning , pages=

Transformers learn in-context by gradient descent , author=. International Conference on Machine Learning , pages=. 2023 , organization=

2023
[30]

Advances in neural information processing systems , volume=

Pretraining task diversity and the emergence of non-bayesian in-context learning for regression , author=. Advances in neural information processing systems , volume=
[31]

The Eleventh International Conference on Learning Representations , year=

What learning algorithm is in-context learning? investigations with linear models , author=. The Eleventh International Conference on Learning Representations , year=
[32]

Advances in neural information processing systems , volume=

Language models are few-shot learners , author=. Advances in neural information processing systems , volume=
[33]

arXiv preprint arXiv:2311.00871 , year=

Pretraining data mixtures enable narrow model selection capabilities in transformer models , author=. arXiv preprint arXiv:2311.00871 , year=

arXiv
[34]

arXiv preprint arXiv:2305.16704 , year=

A closer look at in-context learning under distribution shifts , author=. arXiv preprint arXiv:2305.16704 , year=

arXiv
[35]

NeurIPS 2024 Workshop on Compositional Learning: Perspectives, Methods, and Paths Forward , year=

Transformers Can Learn Meta-skills for Task Generalization in In-Context Learning , author=. NeurIPS 2024 Workshop on Compositional Learning: Perspectives, Methods, and Paths Forward , year=

2024
[36]

International Conference on Machine Learning , pages=

How Do Nonlinear Transformers Learn and Generalize in In-Context Learning? , author=. International Conference on Machine Learning , pages=. 2024 , organization=

2024
[37]

International Conference on Machine Learning , pages=

In-context Convergence of Transformers , author=. International Conference on Machine Learning , pages=. 2024 , organization=

2024
[38]

ICASSP 2024-2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) , pages=

Revisiting the equivalence of in-context learning and gradient descent: The impact of data distribution , author=. ICASSP 2024-2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) , pages=. 2024 , organization=

2024
[39]

The 61st Annual Meeting Of The Association For Computational Linguistics , year=

What In-Context Learning "Learns" In-Context: Disentangling Task Recognition and Task Learning , author=. The 61st Annual Meeting Of The Association For Computational Linguistics , year=
[40]

The Twelfth International Conference on Learning Representations , year=

In-Context Learning Learns Label Relationships but Is Not Conventional Learning , author=. The Twelfth International Conference on Learning Representations , year=
[41]

Advances in Neural Information Processing Systems , volume=

In-context learning of a linear transformer block: benefits of the mlp component and one-step gd initialization , author=. Advances in Neural Information Processing Systems , volume=
[42]

Edward J Hu and yelong shen and Phillip Wallis and Zeyuan Allen-Zhu and Yuanzhi Li and Shean Wang and Lu Wang and Weizhu Chen , booktitle=. Lo. 2022 , url=

2022
[43]

Forty-first International Conference on Machine Learning , year=

Can Looped Transformers Learn to Implement Multi-step Gradient Descent for In-context Learning? , author=. Forty-first International Conference on Machine Learning , year=
[44]

OpenAI blog , volume=

Language models are unsupervised multitask learners , author=. OpenAI blog , volume=
[45]

Proceedings of The 27th International Conference on Artificial Intelligence and Statistics , pages =

Efficient Low-Dimensional Compression of Overparameterized Models , author =. Proceedings of The 27th International Conference on Artificial Intelligence and Statistics , pages =. 2024 , volume =

2024
[46]

IEEE Signal Processing Magazine , title =

Vidal, Ren. IEEE Signal Processing Magazine , title =. 2011 , number =

2011
[47]

Advances in neural information processing systems , volume=

Transformers as statisticians: Provable in-context learning with in-context algorithm selection , author=. Advances in neural information processing systems , volume=
[48]

Transformers Meet In-Context Learning: A Universal Approximation Theory , author=
[49]

arXiv preprint arXiv:2409.02426 , year=

Diffusion Models Learn Low-Dimensional Distributions via Subspace Clustering , author=. arXiv preprint arXiv:2409.02426 , year=

Pith/arXiv arXiv
[50]

arXiv preprint arXiv:2501.02364 , year=

Understanding How Nonlinear Layers Create Linearly Separable Features for Low-Dimensional Data , author=. arXiv preprint arXiv:2501.02364 , year=

arXiv
[51]

Forty-second International Conference on Machine Learning , year=

Test-Time Training Provably Improves Transformers as In-context Learners , author=. Forty-second International Conference on Machine Learning , year=
[52]

Proceedings of The 28th International Conference on Artificial Intelligence and Statistics , pages =

Provable Benefits of Task-Specific Prompts for In-context Learning , author =. Proceedings of The 28th International Conference on Artificial Intelligence and Statistics , pages =. 2025 , editor =

2025
[53]

The Twelfth International Conference on Learning Representations , year=

How Many Pretraining Tasks Are Needed for In-Context Learning of Linear Regression? , author=. The Twelfth International Conference on Learning Representations , year=
[54]

Forty-first International Conference on Machine Learning , year=

A Global Geometric Analysis of Maximal Coding Rate Reduction , author=. Forty-first International Conference on Machine Learning , year=
[55]

The Thirteenth International Conference on Learning Representations , year=

Learning Dynamics of Deep Matrix Factorization Beyond the Edge of Stability , author=. The Thirteenth International Conference on Learning Representations , year=
[56]

arXiv preprint arXiv:2503.19859 , year=

An Overview of Low-Rank Structures in the Training and Adaptation of Large Models , author=. arXiv preprint arXiv:2503.19859 , year=

arXiv
[57]

Wang, Zengzhi and Xie, Qiming and Feng, Yi and Ding, Zixiang and Yang, Zinong and Xia, Rui , booktitle=. Is
[58]

Findings of the Association for Computational Linguistics ACL 2024 , pages=

Improving In-Context Learning with Prediction Feedback for Sentiment Analysis , author=. Findings of the Association for Computational Linguistics ACL 2024 , pages=

2024
[59]

Transactions of the Association for Computational Linguistics , volume=

Retrieval-style In-context Learning for Few-shot Hierarchical Text Classification , author=. Transactions of the Association for Computational Linguistics , volume=
[60]

Findings of the Association for Computational Linguistics: ACL 2023 , pages=

In-context Examples Selection for Machine Translation , author=. Findings of the Association for Computational Linguistics: ACL 2023 , pages=

2023
[61]

arXiv preprint arXiv:2211.09102 , year=

Prompting palm for translation: Assessing strategies and performance , author=. arXiv preprint arXiv:2211.09102 , year=

arXiv
[62]

Forty-second International Conference on Machine Learning , year=

When can in-context learning generalize out of task distribution? , author=. Forty-second International Conference on Machine Learning , year=
[63]

Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers) , pages=

Evaluating In-Context Learning of Libraries for Code Generation , author=. Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers) , pages=

2024
[64]

ACM Transactions on Software Engineering and Methodology , year=

Large language model-aware in-context learning for code generation , author=. ACM Transactions on Software Engineering and Methodology , year=
[65]

Advances in neural information processing systems , volume=

Chain-of-thought prompting elicits reasoning in large language models , author=. Advances in neural information processing systems , volume=
[66]

2023 IEEE/ACM 45th International Conference on Software Engineering (ICSE) , pages=

Automated program repair in the era of large pre-trained language models , author=. 2023 IEEE/ACM 45th International Conference on Software Engineering (ICSE) , pages=. 2023 , organization=

2023
[67]

Acta Applicandae Mathematica , volume=

Riemannian geometry of Grassmann manifolds with a view on algorithmic computation , author=. Acta Applicandae Mathematica , volume=. 2004 , publisher=

2004
[68]

Advances in neural information processing systems , volume=

Attention is all you need , author=. Advances in neural information processing systems , volume=
[69]

2020 , journal=

Scaling Laws for Neural Language Models , author=. 2020 , journal=

2020
[70]

Thirty-seventh Conference on Neural Information Processing Systems , year=

On the spectral bias of two-layer linear networks , author=. Thirty-seventh Conference on Neural Information Processing Systems , year=
[71]

The 29th International Conference on Artificial Intelligence and Statistics , year=

Out-of-Distribution Generalization of In-Context Learning: A Low-Dimensional Subspace Perspective , author=. The 29th International Conference on Artificial Intelligence and Statistics , year=
[72]

2022 , journal=

Saddle-to-Saddle Dynamics in Deep Linear Networks: Small Initialization Training, Symmetry, and Sparsity , author=. 2022 , journal=

2022
[73]

The Thirteenth International Conference on Learning Representations , year=

From Lazy to Rich: Exact Learning Dynamics in Deep Linear Networks , author=. The Thirteenth International Conference on Learning Representations , year=
[74]

International Conference on Learning Representations , year=

The Implicit Bias of Depth: How Incremental Learning Drives Generalization , author=. International Conference on Learning Representations , year=
[75]

Small random initialization is akin to spectral learning: Optimization and generalization guarantees for overparameterized low-rank matrix reconstruction , url =

St\". Small random initialization is akin to spectral learning: Optimization and generalization guarantees for overparameterized low-rank matrix reconstruction , url =. Advances in Neural Information Processing Systems , editor =
[76]

Proceedings of The 27th International Conference on Artificial Intelligence and Statistics , pages =

Efficient Low-Dimensional Compression of Overparameterized Models , author =. Proceedings of The 27th International Conference on Artificial Intelligence and Statistics , pages =. 2024 , editor =

2024
[77]

International Conference on Learning Representations , year=

Towards Resolving the Implicit Bias of Gradient Descent for Matrix Factorization: Greedy Low-Rank Learning , author=. International Conference on Learning Representations , year=

[1] [1]

Forty-first International Conference on Machine Learning,

Libin Zhu and Chaoyue Liu and Adityanarayanan Radhakrishnan and Mikhail Belkin , title =. Forty-first International Conference on Machine Learning,. 2024 , url =

2024

[2] [2]

The Impact of Initialization on Lo

Soufiane Hayou and Nikhil Ghosh and Bin Yu , booktitle=. The Impact of Initialization on Lo. 2024 , url=

2024

[3] [3]

Soufiane Hayou and Nikhil Ghosh and Bin Yu , booktitle=. Lo. 2024 , url=

2024

[4] [4]

Submitted to The Thirteenth International Conference on Learning Representations , year=

Efficient Learning with Sine-Activated Low-Rank Matrices , author=. Submitted to The Thirteenth International Conference on Learning Representations , year=

[5] [5]

The Twelfth International Conference on Learning Representations , year=

Benign Oscillation of Stochastic Gradient Descent with Large Learning Rate , author=. The Twelfth International Conference on Learning Representations , year=

[6] [6]

arXiv preprint arXiv:2310.17513 , year=

The expressive power of low-rank adaptation , author=. arXiv preprint arXiv:2310.17513 , year=

arXiv

[7] [7]

Transactions on Machine Learning Research , issn=

Task Diversity Shortens the In-Context Learning Plateau , author=. Transactions on Machine Learning Research , issn=. 2025 , url=

2025

[8] [8]

The Thirty-eighth Annual Conference on Neural Information Processing Systems , year=

Fine-grained Analysis of In-context Linear Estimation: Data, Architecture, and Beyond , author=. The Thirty-eighth Annual Conference on Neural Information Processing Systems , year=

[9] [9]

Statistics & Probability Letters , volume=

Multiplying a Gaussian matrix by a Gaussian vector , author=. Statistics & Probability Letters , volume=. 2017 , publisher=

2017

[10] [10]

Technical University of Denmark , volume=

The matrix cookbook , author=. Technical University of Denmark , volume=

[11] [11]

Journal of Machine Learning Research , volume=

Trained transformers learn linear models in-context , author=. Journal of Machine Learning Research , volume=

[12] [12]

Advances in Neural Information Processing Systems , volume=

Pretrained transformer efficiently learns low-dimensional target functions in-context , author=. Advances in Neural Information Processing Systems , volume=

[13] [13]

In-Context Learning with Representations: Contextual Generalization of Trained Transformers , url =

Yang, Tong and Huang, Yu and Liang, Yingbin and Chi, Yuejie , booktitle =. In-Context Learning with Representations: Contextual Generalization of Trained Transformers , url =

[14] [14]

Forty-second International Conference on Machine Learning , year =

Training Dynamics of In-Context Learning in Linear Attention , author=. Forty-second International Conference on Machine Learning , year =

[15] [15]

The Thirteenth International Conference on Learning Representations , year=

Can In-context Learning Really Generalize to Out-of-distribution Tasks? , author=. The Thirteenth International Conference on Learning Representations , year=

[16] [16]

Sudden Drops in the Loss: Syntax Acquisition, Phase Transitions, and Simplicity Bias in

Angelica Chen and Ravid Shwartz-Ziv and Kyunghyun Cho and Matthew L Leavitt and Naomi Saphra , booktitle=. Sudden Drops in the Loss: Syntax Acquisition, Phase Transitions, and Simplicity Bias in. 2024 , url=

2024

[17] [17]

Abrupt Learning in Transformers: A Case Study on Matrix Completion , volume =

Gopalani, Pulkit and Lubana, Ekdeep Singh and Hu, Wei , booktitle =. Abrupt Learning in Transformers: A Case Study on Matrix Completion , volume =

[18] [18]

positive

Yue M. Lu and Mary Letey and Jacob A. Zavatone-Veth and Anindita Maiti and Cengiz Pehlevan , title =. Proceedings of the National Academy of Sciences , volume =. 2025 , doi =. https://www.pnas.org/doi/pdf/10.1073/pnas.2502599122 , abstract =

work page doi:10.1073/pnas.2502599122 2025

[19] [19]

The Twelfth International Conference on Learning Representations , year=

Linear attention is (maybe) all you need (to understand Transformer optimization) , author=. The Twelfth International Conference on Learning Representations , year=

[20] [20]

International Conference on Machine Learning , pages=

Compressible Dynamics in Deep Overparameterized Low-Rank Learning & Adaptation , author=. International Conference on Machine Learning , pages=. 2024 , organization=

2024

[21] [21]

Language Models are Few-Shot Learners , url =

Brown, Tom and Mann, Benjamin and Ryder, Nick and Subbiah, Melanie and Kaplan, Jared D and Dhariwal, Prafulla and Neelakantan, Arvind and Shyam, Pranav and Sastry, Girish and Askell, Amanda and Agarwal, Sandhini and Herbert-Voss, Ariel and Krueger, Gretchen and Henighan, Tom and Child, Rewon and Ramesh, Aditya and Ziegler, Daniel and Wu, Jeffrey and Winte...

[22] [22]

arXiv preprint arXiv:2107.03374 , year=

Evaluating large language models trained on code , author=. arXiv preprint arXiv:2107.03374 , year=

Pith/arXiv arXiv

[23] [23]

Patti and Jayson Lynch and Avi Shporer and Nakul Verma and Eugene Wu and Gilbert Strang , title =

Iddo Drori and Sarah Zhang and Reece Shuttleworth and Leonard Tang and Albert Lu and Elizabeth Ke and Kevin Liu and Linda Chen and Sunny Tran and Newman Cheng and Roman Wang and Nikhil Singh and Taylor L. Patti and Jayson Lynch and Avi Shporer and Nakul Verma and Eugene Wu and Gilbert Strang , title =. Proceedings of the National Academy of Sciences , vol...

work page doi:10.1073/pnas.2123433119 2022

[24] [24]

Proceedings of the 40th International Conference on Machine Learning , pages =

Transformers as Algorithms: Generalization and Stability in In-context Learning , author =. Proceedings of the 40th International Conference on Machine Learning , pages =. 2023 , editor =

2023

[25] [25]

Transactions on Machine Learning Research , issn=

Emergent Abilities of Large Language Models , author=. Transactions on Machine Learning Research , issn=. 2022 , url=

2022

[26] [26]

What Can Transformers Learn In-Context?

Shivam Garg and Dimitris Tsipras and Percy Liang and Gregory Valiant , booktitle=. What Can Transformers Learn In-Context?. 2022 , url=

2022

[27] [27]

Advances in Neural Information Processing Systems , volume=

Transformers learn to implement preconditioned gradient descent for in-context learning , author=. Advances in Neural Information Processing Systems , volume=

[28] [28]

The Twelfth International Conference on Learning Representations , year=

One Step of Gradient Descent is Provably the Optimal In-Context Learner with One Layer of Linear Self-Attention , author=. The Twelfth International Conference on Learning Representations , year=

[29] [29]

International Conference on Machine Learning , pages=

Transformers learn in-context by gradient descent , author=. International Conference on Machine Learning , pages=. 2023 , organization=

2023

[30] [30]

Advances in neural information processing systems , volume=

Pretraining task diversity and the emergence of non-bayesian in-context learning for regression , author=. Advances in neural information processing systems , volume=

[31] [31]

The Eleventh International Conference on Learning Representations , year=

What learning algorithm is in-context learning? investigations with linear models , author=. The Eleventh International Conference on Learning Representations , year=

[32] [32]

Advances in neural information processing systems , volume=

Language models are few-shot learners , author=. Advances in neural information processing systems , volume=

[33] [33]

arXiv preprint arXiv:2311.00871 , year=

Pretraining data mixtures enable narrow model selection capabilities in transformer models , author=. arXiv preprint arXiv:2311.00871 , year=

arXiv

[34] [34]

arXiv preprint arXiv:2305.16704 , year=

A closer look at in-context learning under distribution shifts , author=. arXiv preprint arXiv:2305.16704 , year=

arXiv

[35] [35]

NeurIPS 2024 Workshop on Compositional Learning: Perspectives, Methods, and Paths Forward , year=

Transformers Can Learn Meta-skills for Task Generalization in In-Context Learning , author=. NeurIPS 2024 Workshop on Compositional Learning: Perspectives, Methods, and Paths Forward , year=

2024

[36] [36]

International Conference on Machine Learning , pages=

How Do Nonlinear Transformers Learn and Generalize in In-Context Learning? , author=. International Conference on Machine Learning , pages=. 2024 , organization=

2024

[37] [37]

International Conference on Machine Learning , pages=

In-context Convergence of Transformers , author=. International Conference on Machine Learning , pages=. 2024 , organization=

2024

[38] [38]

ICASSP 2024-2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) , pages=

Revisiting the equivalence of in-context learning and gradient descent: The impact of data distribution , author=. ICASSP 2024-2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) , pages=. 2024 , organization=

2024

[39] [39]

The 61st Annual Meeting Of The Association For Computational Linguistics , year=

What In-Context Learning "Learns" In-Context: Disentangling Task Recognition and Task Learning , author=. The 61st Annual Meeting Of The Association For Computational Linguistics , year=

[40] [40]

The Twelfth International Conference on Learning Representations , year=

In-Context Learning Learns Label Relationships but Is Not Conventional Learning , author=. The Twelfth International Conference on Learning Representations , year=

[41] [41]

Advances in Neural Information Processing Systems , volume=

In-context learning of a linear transformer block: benefits of the mlp component and one-step gd initialization , author=. Advances in Neural Information Processing Systems , volume=

[42] [42]

Edward J Hu and yelong shen and Phillip Wallis and Zeyuan Allen-Zhu and Yuanzhi Li and Shean Wang and Lu Wang and Weizhu Chen , booktitle=. Lo. 2022 , url=

2022

[43] [43]

Forty-first International Conference on Machine Learning , year=

Can Looped Transformers Learn to Implement Multi-step Gradient Descent for In-context Learning? , author=. Forty-first International Conference on Machine Learning , year=

[44] [44]

OpenAI blog , volume=

Language models are unsupervised multitask learners , author=. OpenAI blog , volume=

[45] [45]

Proceedings of The 27th International Conference on Artificial Intelligence and Statistics , pages =

Efficient Low-Dimensional Compression of Overparameterized Models , author =. Proceedings of The 27th International Conference on Artificial Intelligence and Statistics , pages =. 2024 , volume =

2024

[46] [46]

IEEE Signal Processing Magazine , title =

Vidal, Ren. IEEE Signal Processing Magazine , title =. 2011 , number =

2011

[47] [47]

Advances in neural information processing systems , volume=

Transformers as statisticians: Provable in-context learning with in-context algorithm selection , author=. Advances in neural information processing systems , volume=

[48] [48]

Transformers Meet In-Context Learning: A Universal Approximation Theory , author=

[49] [49]

arXiv preprint arXiv:2409.02426 , year=

Diffusion Models Learn Low-Dimensional Distributions via Subspace Clustering , author=. arXiv preprint arXiv:2409.02426 , year=

Pith/arXiv arXiv

[50] [50]

arXiv preprint arXiv:2501.02364 , year=

Understanding How Nonlinear Layers Create Linearly Separable Features for Low-Dimensional Data , author=. arXiv preprint arXiv:2501.02364 , year=

arXiv

[51] [51]

Forty-second International Conference on Machine Learning , year=

Test-Time Training Provably Improves Transformers as In-context Learners , author=. Forty-second International Conference on Machine Learning , year=

[52] [52]

Proceedings of The 28th International Conference on Artificial Intelligence and Statistics , pages =

Provable Benefits of Task-Specific Prompts for In-context Learning , author =. Proceedings of The 28th International Conference on Artificial Intelligence and Statistics , pages =. 2025 , editor =

2025

[53] [53]

The Twelfth International Conference on Learning Representations , year=

How Many Pretraining Tasks Are Needed for In-Context Learning of Linear Regression? , author=. The Twelfth International Conference on Learning Representations , year=

[54] [54]

Forty-first International Conference on Machine Learning , year=

A Global Geometric Analysis of Maximal Coding Rate Reduction , author=. Forty-first International Conference on Machine Learning , year=

[55] [55]

The Thirteenth International Conference on Learning Representations , year=

Learning Dynamics of Deep Matrix Factorization Beyond the Edge of Stability , author=. The Thirteenth International Conference on Learning Representations , year=

[56] [56]

arXiv preprint arXiv:2503.19859 , year=

An Overview of Low-Rank Structures in the Training and Adaptation of Large Models , author=. arXiv preprint arXiv:2503.19859 , year=

arXiv

[57] [57]

Wang, Zengzhi and Xie, Qiming and Feng, Yi and Ding, Zixiang and Yang, Zinong and Xia, Rui , booktitle=. Is

[58] [58]

Findings of the Association for Computational Linguistics ACL 2024 , pages=

Improving In-Context Learning with Prediction Feedback for Sentiment Analysis , author=. Findings of the Association for Computational Linguistics ACL 2024 , pages=

2024

[59] [59]

Transactions of the Association for Computational Linguistics , volume=

Retrieval-style In-context Learning for Few-shot Hierarchical Text Classification , author=. Transactions of the Association for Computational Linguistics , volume=

[60] [60]

Findings of the Association for Computational Linguistics: ACL 2023 , pages=

In-context Examples Selection for Machine Translation , author=. Findings of the Association for Computational Linguistics: ACL 2023 , pages=

2023

[61] [61]

arXiv preprint arXiv:2211.09102 , year=

Prompting palm for translation: Assessing strategies and performance , author=. arXiv preprint arXiv:2211.09102 , year=

arXiv

[62] [62]

Forty-second International Conference on Machine Learning , year=

When can in-context learning generalize out of task distribution? , author=. Forty-second International Conference on Machine Learning , year=

[63] [63]

Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers) , pages=

Evaluating In-Context Learning of Libraries for Code Generation , author=. Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers) , pages=

2024

[64] [64]

ACM Transactions on Software Engineering and Methodology , year=

Large language model-aware in-context learning for code generation , author=. ACM Transactions on Software Engineering and Methodology , year=

[65] [65]

Advances in neural information processing systems , volume=

Chain-of-thought prompting elicits reasoning in large language models , author=. Advances in neural information processing systems , volume=

[66] [66]

2023 IEEE/ACM 45th International Conference on Software Engineering (ICSE) , pages=

Automated program repair in the era of large pre-trained language models , author=. 2023 IEEE/ACM 45th International Conference on Software Engineering (ICSE) , pages=. 2023 , organization=

2023

[67] [67]

Acta Applicandae Mathematica , volume=

Riemannian geometry of Grassmann manifolds with a view on algorithmic computation , author=. Acta Applicandae Mathematica , volume=. 2004 , publisher=

2004

[68] [68]

Advances in neural information processing systems , volume=

Attention is all you need , author=. Advances in neural information processing systems , volume=

[69] [69]

2020 , journal=

Scaling Laws for Neural Language Models , author=. 2020 , journal=

2020

[70] [70]

Thirty-seventh Conference on Neural Information Processing Systems , year=

On the spectral bias of two-layer linear networks , author=. Thirty-seventh Conference on Neural Information Processing Systems , year=

[71] [71]

The 29th International Conference on Artificial Intelligence and Statistics , year=

Out-of-Distribution Generalization of In-Context Learning: A Low-Dimensional Subspace Perspective , author=. The 29th International Conference on Artificial Intelligence and Statistics , year=

[72] [72]

2022 , journal=

Saddle-to-Saddle Dynamics in Deep Linear Networks: Small Initialization Training, Symmetry, and Sparsity , author=. 2022 , journal=

2022

[73] [73]

The Thirteenth International Conference on Learning Representations , year=

From Lazy to Rich: Exact Learning Dynamics in Deep Linear Networks , author=. The Thirteenth International Conference on Learning Representations , year=

[74] [74]

International Conference on Learning Representations , year=

The Implicit Bias of Depth: How Incremental Learning Drives Generalization , author=. International Conference on Learning Representations , year=

[75] [75]

Small random initialization is akin to spectral learning: Optimization and generalization guarantees for overparameterized low-rank matrix reconstruction , url =

St\". Small random initialization is akin to spectral learning: Optimization and generalization guarantees for overparameterized low-rank matrix reconstruction , url =. Advances in Neural Information Processing Systems , editor =

[76] [76]

Proceedings of The 27th International Conference on Artificial Intelligence and Statistics , pages =

Efficient Low-Dimensional Compression of Overparameterized Models , author =. Proceedings of The 27th International Conference on Artificial Intelligence and Statistics , pages =. 2024 , editor =

2024

[77] [77]

International Conference on Learning Representations , year=

Towards Resolving the Implicit Bias of Gradient Descent for Matrix Factorization: Greedy Low-Rank Learning , author=. International Conference on Learning Representations , year=