Learning Linear Regression with Low-Rank Tasks in-Context

Kaito Takanami; Takashi Takahashi; Yoshiyuki Kabashima

arxiv: 2510.04548 · v2 · submitted 2025-10-06 · ❄️ cond-mat.dis-nn · cs.LG· stat.ML

Learning Linear Regression with Low-Rank Tasks in-Context

Kaito Takanami , Takashi Takahashi , Yoshiyuki Kabashima This is my paper

Pith reviewed 2026-05-18 09:52 UTC · model grok-4.3

classification ❄️ cond-mat.dis-nn cs.LGstat.ML

keywords in-context learninglinear attentionlow-rank regressiongeneralization errorphase transitionhigh-dimensional limitimplicit regularization

0 comments

The pith

A linear attention model on low-rank regression tasks shows a sharp phase transition in generalization error set by task structure in the high-dimensional limit.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper studies in-context learning by training a linear attention model on regression tasks that share a low-rank structure. It derives exact expressions for the distribution of the model's predictions and for the generalization error when the input dimension grows large. Finite pre-training data introduces random fluctuations that act as an implicit regularizer, improving performance without explicit penalties. A sharp phase transition appears in the generalization error, and the location of this transition is controlled by the rank and structure of the tasks. The results supply a concrete mathematical picture of how such models extract and exploit shared task structure.

Core claim

For a linear attention model trained on low-rank regression tasks, the distribution of predictions and the generalization error admit exact characterizations in the high-dimensional limit. Statistical fluctuations present in any finite pre-training set induce an implicit regularization. The generalization error undergoes a sharp phase transition whose location is governed by the underlying task structure.

What carries the argument

Linear attention model trained on low-rank regression tasks, analyzed in the high-dimensional limit to obtain exact expressions for predictions and error.

Load-bearing premise

The high-dimensional limit together with the linear attention architecture on low-rank tasks is representative enough to produce precise characterizations and a sharp phase transition.

What would settle it

Numerical simulations of the same linear attention model on low-rank tasks in large but finite dimensions that fail to display the predicted phase transition in generalization error would refute the claim.

read the original abstract

In-context learning (ICL) is a key building block of modern large language models, yet its theoretical mechanisms remain poorly understood. It is particularly mysterious how ICL operates in real-world applications where tasks have a common structure. In this work, we address this problem by analyzing a linear attention model trained on low-rank regression tasks. Within this setting, we precisely characterize the distribution of predictions and the generalization error in the high-dimensional limit. Moreover, we find that statistical fluctuations in finite pre-training data induce an implicit regularization. Finally, we identify a sharp phase transition of the generalization error governed by task structure. These results provide a framework for understanding how transformers learn to learn the task structure.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This paper gives a clean high-dimensional asymptotic analysis of in-context learning for low-rank linear regression tasks in a linear attention model, with explicit characterizations of error and a phase transition.

read the letter

The main point is that this work derives the high-dimensional limits for prediction distributions and generalization error when training linear attention on low-rank regression tasks. It also shows that finite pre-training data fluctuations produce an implicit regularization effect and identifies a sharp phase transition in error that depends on task structure parameters like rank and scaling ratios. The derivations rely on standard tools such as random matrix theory or replica methods applied to the linear case, and the assumptions (Gaussian data, exact low-rank structure, linear attention) are stated upfront so the results stay within the model class. This is a solid piece of technical work that pins down how task structure influences ICL behavior in this setting. The implicit regularization angle is a useful addition because it ties finite-sample effects directly to the asymptotics without extra fitting. The analysis appears internally consistent with no load-bearing circularity or unfalsifiable steps. The main limitation is the restriction to linear attention and the specific high-dimensional scalings; these choices make the phase transition precise but limit direct carryover to nonlinear transformers or real data distributions. They do not overclaim transfer, which keeps the contribution proportionate. This is for readers already working in theoretical ICL or high-dimensional statistics who want exact characterizations rather than broad claims about full models. A serious referee should see it because the methods match the problem and the results are falsifiable within the stated framework. I would send it for review with the expectation that minor clarifications on the replica or RMT steps might come up.

Referee Report

0 major / 4 minor

Summary. The manuscript analyzes in-context learning in a linear attention model trained on low-rank linear regression tasks. In the high-dimensional limit with appropriate scalings for dimension d, context length, task rank, and number of pre-training samples, the authors derive precise characterizations of the distribution of predictions and the generalization error. They further show that statistical fluctuations arising from finite pre-training data induce an implicit regularization effect, and they identify a sharp phase transition in the generalization error that is governed by the underlying task structure. The analysis relies on standard tools such as random matrix theory or replica methods, with all modeling assumptions (linearity of attention, exact low-rank task structure, Gaussian data) declared explicitly and results stated to hold within this model class.

Significance. If the derivations hold, this provides a rigorous theoretical framework for understanding how transformers acquire task structure through in-context learning in structured regression settings. The high-dimensional asymptotic analysis yielding exact characterizations, the demonstration of implicit regularization from finite-sample fluctuations, and the identification of a sharp phase transition are notable strengths. Explicit declaration of assumptions and scope limitations to the linear attention model class strengthens the reliability of the claims.

minor comments (4)

[§2] §2 (Model definition): the precise form of the linear attention layer and the low-rank task generation process should be written with explicit matrix dimensions and scaling factors to avoid ambiguity when reproducing the high-dimensional limits.
[§4] §4 (Main results): while the phase transition is stated to be sharp in the d → ∞ limit, a brief remark on the rate of convergence or finite-d corrections would help readers assess practical relevance.
[Figures] Figure captions (throughout): ensure every parameter (e.g., task rank r, pre-training sample count N) appearing in plots is defined in the caption or immediately preceding text.
[Abstract, §1] Abstract and §1: the phrase 'precisely characterize' is appropriate for the asymptotic analysis but could be qualified once with a parenthetical reference to the specific scaling regime used.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for the positive assessment of our manuscript and the recommendation for minor revision. The referee's summary accurately captures our main results on the high-dimensional characterization of predictions and generalization error for linear attention models trained on low-rank regression tasks, including the implicit regularization induced by finite-sample fluctuations and the sharp phase transition governed by task structure. We appreciate the recognition of our explicit modeling assumptions and the use of random matrix theory and replica methods.

Circularity Check

0 steps flagged

No significant circularity in derivation chain

full rationale

The paper derives its characterizations of prediction distributions, generalization error, implicit regularization from finite-sample fluctuations, and the phase transition via high-dimensional asymptotic analysis of a linear attention model on low-rank tasks. It employs standard tools such as random matrix theory and replica methods under explicitly declared scalings and Gaussian assumptions. All load-bearing steps remain self-contained within the stated model class; no self-definitional reductions, fitted inputs renamed as predictions, or load-bearing self-citation chains appear in the derivation. The results are presented as holding within the analyzed setting rather than by construction from the inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review provides no information on free parameters, axioms, or invented entities; ledger left empty.

pith-pipeline@v0.9.0 · 5649 in / 1051 out tokens · 39703 ms · 2026-05-18T09:52:01.034554+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Result 3.1 (Decomposition of ICL prediction) … ŷ = ŷ_algo + ŷ_mem + ŷ_struct … replica method
IndisputableMonolith/Foundation/AlexanderDuality.lean alexander_duality_circle_linking unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Theorem 5.1 (Spectrum of the Task Matrix) … ν_S(s) = (1−min(ρ,κ))δ(s) + … Marchenko-Pastur
IndisputableMonolith/Foundation/AlphaCoordinateFixation.lean J_uniquely_calibrated_via_higher_derivative unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Result 4.2 … effective regularization coefficient ˆq̄ = O(1/α) > 0 only for finite pre-training data

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.