Investigation into In-Context Learning Capabilities of Transformers

Arya Mazumdar; Leo Bangayan; Rushil Chandrupatla; Sebastian Leng

arxiv: 2604.25858 · v2 · pith:PK5JOMRFnew · submitted 2026-04-28 · 💻 cs.LG · cs.AI

Investigation into In-Context Learning Capabilities of Transformers

Rushil Chandrupatla , Leo Bangayan , Sebastian Leng , Arya Mazumdar This is my paper

Pith reviewed 2026-05-20 23:41 UTC · model grok-4.3

classification 💻 cs.LG cs.AI

keywords in-context learningtransformersscaling behaviorGaussian mixture modelsbinary classificationbenign overfittingsynthetic dataempirical study

0 comments

The pith

Transformers succeed at in-context binary classification when input dimension, signal strength, and contextual information satisfy specific geometric thresholds.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper maps the scaling behavior of in-context learning accuracy for Gaussian-mixture binary classification tasks in transformers. It shows how performance depends on input dimension, the number of in-context examples, and the number of pre-training tasks, using a controlled synthetic setup to isolate when models infer task structure from context alone. The work also identifies regimes where benign overfitting occurs, letting models memorize noisy labels while generalizing well on clean test data. A sympathetic reader would care because these empirical patterns clarify when in-context learning works without any task-specific fine-tuning and when it fails.

Core claim

Building on the linear in-context classifier framework, the authors demonstrate through systematic sweeps that in-context test accuracy for Gaussian-mixture classification is governed by the interplay of dimensionality, signal-to-noise ratio, and the volume of contextual information, producing a detailed map of parameter regions where transformers successfully extract and apply task structure from examples at inference time.

What carries the argument

Linear in-context classifier formulation inside a controlled synthetic Gaussian-mixture data setup that isolates the geometric conditions for inferring task structure from context alone.

If this is right

In-context accuracy rises with more examples only up to a limit set by the input dimension relative to signal strength.
Benign overfitting appears reliably in intermediate signal-to-noise and dimensionality ranges, preserving generalization on clean data despite noisy in-context labels.
Greater diversity of pre-training tasks expands the region where context alone suffices to recover the underlying classification rule.
Failure occurs predictably when dimensionality grows faster than the available signal or contextual information.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If these scaling patterns transfer to natural data, practitioners could estimate required context length from simple dimension and signal measurements before deployment.
The same geometric view might explain why in-context learning sometimes emerges suddenly with scale in other sequence modeling settings.
Testing the map on nonlinear or multi-class variants of the mixture model would show whether the linear approximation is the main driver or merely a convenient proxy.

Load-bearing premise

A linear classifier applied to the synthetic Gaussian-mixture examples is sufficient to reveal the actual conditions under which transformers perform in-context learning.

What would settle it

Running the same Gaussian-mixture tasks on actual transformer models and finding that test accuracy scales differently from the reported map once dimensionality exceeds the number of in-context examples by a large margin.

Figures

Figures reproduced from arXiv: 2604.25858 by Arya Mazumdar, Leo Bangayan, Rushil Chandrupatla, Sebastian Leng.

**Figure 1.** Figure 1: Model Performance (d=50, N=20, B=1000, R=snr) 12 view at source ↗

**Figure 2.** Figure 2: Model Performance (d=50, N=20, B=1000, R=const) 13 view at source ↗

**Figure 3.** Figure 3: Model Performance (d=1000, N=20, B=1000, R=const) 14 view at source ↗

**Figure 4.** Figure 4: Model Performance (d=1000, N=20, B=1000, R=snr) 15 view at source ↗

**Figure 5.** Figure 5: Model Performance (d=500, N=5, B=1000, R=const) 16 view at source ↗

**Figure 6.** Figure 6: Model Performance (d=500, N=5, B=1000, R=snr) 17 view at source ↗

**Figure 7.** Figure 7: Model Performance (d=500, N=80, B=1000, R=const) 18 view at source ↗

**Figure 8.** Figure 8: Model Performance (d=500, N=80, B=1000, R=snr) 19 view at source ↗

**Figure 9.** Figure 9: Model Performance (d=500, N=20, B=50, R=const) 20 view at source ↗

**Figure 10.** Figure 10: Model Performance (d=500, N=20, B=50, R=snr) 21 view at source ↗

**Figure 11.** Figure 11: Model Performance (d=500, N=20, B=2000, R=const) 22 view at source ↗

**Figure 12.** Figure 12: Model Performance (d=500, N=20, B=1000, R=const) 23 view at source ↗

**Figure 13.** Figure 13: Model Performance (d=500, N=20, B=1000, R=snr, noise=0.20) 24 view at source ↗

**Figure 14.** Figure 14: Model Performance (d=100, N=20, B=1000, R=snr, noise=0.20) 25 view at source ↗

**Figure 15.** Figure 15: Model Performance (d=1000, N=20, B=1000, R=const, noise=0.20) 26 view at source ↗

**Figure 16.** Figure 16: Model Performance (d=500, N=20, B=1000, R=snr, noise=0.40) 27 view at source ↗

**Figure 17.** Figure 17: Model Performance (Benign Overfitting Case) view at source ↗

**Figure 18.** Figure 18: Model Performance (Non-Benign Overfitting Case) view at source ↗

**Figure 19.** Figure 19: Model Performance (d=1500, N=20, B=1500, R=1.35, noise=0.20) view at source ↗

**Figure 20.** Figure 20: Model Performance (d=1500, N=20, B=1500, R=8.97, noise=0.20) 29 view at source ↗

**Figure 21.** Figure 21: Model Performance (d=500, N=20, B=1000, R=6.45, noise=0.30) view at source ↗

**Figure 22.** Figure 22: Model Performance (d=1000, N=20, B=1000, R=7.94, noise=0.30) 30 view at source ↗

**Figure 23.** Figure 23: ChatGpt-4-mini results 31 view at source ↗

**Figure 24.** Figure 24: Gemini-2.0-mini results 32 view at source ↗

**Figure 25.** Figure 25: Gemini-2.0-mini results 33 view at source ↗

**Figure 26.** Figure 26: ChatGpt-4-mini results agregated as dimenstion increases view at source ↗

**Figure 27.** Figure 27: ChatGpt-4-mini validation vs ICL as context length increases view at source ↗

**Figure 28.** Figure 28: ChatGpt-4-mini vs Gemini 2.0 flash as context examples increase view at source ↗

read the original abstract

Transformers have demonstrated a strong ability for in-context learning (ICL), enabling models to solve previously unseen tasks using only example input output pairs provided at inference time. While prior theoretical work has established conditions under which transformers can perform linear classification in-context, the empirical scaling behavior governing when this mechanism succeeds remains insufficiently characterized. In this paper, we conduct a systematic empirical study of in-context learning for Gaussian-mixture binary classification tasks. Building on the theoretical framework of Frei and Vardi (2024), we analyze how in-context test accuracy depends on three fundamental factors: the input dimension, the number of in-context examples, and the number of pre-training tasks. Using a controlled synthetic setup and a linear in-context classifier formulation, we isolate the geometric conditions under which models successfully infer task structure from context alone. We additionally investigate the emergence of benign overfitting, where models memorize noisy in-context labels while still achieving strong generalization performance on clean test data. Through extensive sweeps across dimensionality, sequence length, task diversity, and signal-to-noise regimes, we identify the parameter regions in which this phenomenon arises and characterize how it depends on data geometry and training exposure. Our results provide a comprehensive empirical map of scaling behavior in in-context classification, highlighting the critical role of dimensionality, signal strength, and contextual information in determining when in-context learning succeeds and when it fails.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This paper maps scaling regimes for ICL success and benign overfitting in a synthetic Gaussian-mixture task but rests on a linear classifier proxy that may not match real transformer behavior.

read the letter

Colleague, the main thing to know is that this work delivers a systematic empirical map of when in-context learning succeeds on Gaussian-mixture binary classification, tracking accuracy across input dimension, context length, task count, and signal strength while also charting where benign overfitting emerges. It extends Frei and Vardi by filling in the scaling picture with controlled sweeps rather than deriving new theory from scratch. The controlled synthetic setup and parameter sweeps are the clear strength here. They isolate geometric factors cleanly and show concrete regions where models generalize despite noisy labels, which gives a practical handle on the conditions the earlier theory left open. That kind of data is useful for anyone trying to connect abstract guarantees to observable behavior. The soft spot is the linear in-context classifier they use to model inference from context. Transformers are nonlinear and can implement richer functions through attention, so the reported scaling boundaries and overfitting regimes might describe only the linear surrogate rather than the actual trained models, especially at high dimensions or low signal. The abstract ties the claims directly to this formulation, which makes the gap worth probing. This paper is for researchers already working on ICL mechanisms and scaling in transformers, not for a general audience. A reader focused on synthetic benchmarks or theory-empirical bridges would find the maps worth looking at. I would send it to peer review; the experiments are thoughtfully scoped and the extension is legitimate even if the linear assumption needs closer examination in revisions.

Referee Report

1 major / 2 minor

Summary. The paper conducts a systematic empirical study of in-context learning for Gaussian-mixture binary classification tasks in transformers. Building on prior theoretical work, it examines how in-context test accuracy scales with input dimension, number of in-context examples, and number of pre-training tasks. Using a controlled synthetic setup and a linear in-context classifier formulation, the authors isolate geometric conditions for successful task inference from context and characterize regimes of benign overfitting across sweeps of dimensionality, sequence length, task diversity, and signal-to-noise.

Significance. If the linear in-context classifier formulation accurately reflects transformer behavior, the work delivers a valuable empirical map of ICL scaling in classification, clarifying the roles of dimensionality, signal strength, and contextual information. The extensive parameter sweeps and focus on benign overfitting provide concrete guidance for when ICL succeeds or fails, which could inform both theory and model design.

major comments (1)

[Abstract] Abstract and linear in-context classifier formulation: the central claims about transformer ICL scaling and benign-overfitting regions rest on using a linear proxy to isolate geometric conditions. Transformers implement nonlinear functions via attention and feed-forward layers, so the proxy may diverge from actual decision boundaries in high-dimensional or low-signal regimes. Direct validation (e.g., comparison of linear vs. transformer outputs on held-out contexts) is needed to confirm the reported maps characterize the models studied rather than only the surrogate.

minor comments (2)

[Experimental Setup] The description of experimental details (data exclusion rules, exact training procedure, and number of runs) is insufficient to assess reproducibility; adding these would strengthen the manuscript.
[Figures] Figures reporting accuracy vs. dimension or sequence length should include error bars or shaded regions indicating variability across random seeds.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their thoughtful and constructive feedback. We address the major comment on the linear in-context classifier formulation below.

read point-by-point responses

Referee: [Abstract] Abstract and linear in-context classifier formulation: the central claims about transformer ICL scaling and benign-overfitting regions rest on using a linear proxy to isolate geometric conditions. Transformers implement nonlinear functions via attention and feed-forward layers, so the proxy may diverge from actual decision boundaries in high-dimensional or low-signal regimes. Direct validation (e.g., comparison of linear vs. transformer outputs on held-out contexts) is needed to confirm the reported maps characterize the models studied rather than only the surrogate.

Authors: We appreciate the referee's observation that transformers can implement nonlinear functions. The linear in-context classifier is not an arbitrary proxy but is directly motivated by the analysis in Frei and Vardi (2024), which establishes that, for the Gaussian-mixture tasks considered here, the attention mechanism in transformers realizes a linear classifier whose weights are inferred from the in-context examples. Our empirical study therefore maps scaling and benign-overfitting regimes inside the setting where this equivalence holds, allowing us to isolate geometric effects of dimension, context length, and task diversity. We agree that explicit validation would strengthen the link to full transformer behavior. In the revised manuscript we will add a dedicated subsection that compares the linear classifier outputs against the actual transformer predictions on held-out contexts, reporting agreement rates and selected decision-boundary visualizations across representative high-dimensional and low-signal regimes. revision: yes

Circularity Check

0 steps flagged

No significant circularity in this empirical study

full rationale

The paper is a systematic empirical investigation of transformer in-context learning on synthetic Gaussian-mixture tasks, relying on experimental sweeps over dimensionality, sequence length, task diversity, and signal-to-noise regimes rather than any mathematical derivation chain. It builds on the external theoretical framework of Frei and Vardi (2024) without self-citation load-bearing or imported uniqueness theorems from the present authors. The linear in-context classifier formulation is explicitly part of the controlled synthetic setup to isolate geometric conditions and does not reduce any reported prediction or scaling map to fitted inputs by construction; all central claims rest on observed empirical outcomes that remain falsifiable against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

No explicit free parameters, axioms, or invented entities are stated in the abstract; the work relies on standard machine-learning assumptions for synthetic data generation and linear classification.

pith-pipeline@v0.9.0 · 5778 in / 1141 out tokens · 31679 ms · 2026-05-20T23:41:48.490510+00:00 · methodology

Review history (2 revisions) →

discussion (0)

Reference graph

Works this paper leans on

2 extracted references · 2 canonical work pages

[1]

Trained Transformer Classifiers Generalize and Exhibit Benign Overfitting In-Context

Frei, Spencer, and Gal Vardi.2024. “Trained Transformer Classifiers Generalize and Exhibit Benign Overfitting In-Context.”[Link]

work page 2024
[2]

What Can Trans- formers Learn In-Context? A Case Study of Simple Function Classes

Garg, Shivam, Dimitris Tsipras, Percy Liang, and Gregory Valiant.2023. “What Can Trans- formers Learn In-Context? A Case Study of Simple Function Classes.”[Link] 40

work page 2023

[1] [1]

Trained Transformer Classifiers Generalize and Exhibit Benign Overfitting In-Context

Frei, Spencer, and Gal Vardi.2024. “Trained Transformer Classifiers Generalize and Exhibit Benign Overfitting In-Context.”[Link]

work page 2024

[2] [2]

What Can Trans- formers Learn In-Context? A Case Study of Simple Function Classes

Garg, Shivam, Dimitris Tsipras, Percy Liang, and Gregory Valiant.2023. “What Can Trans- formers Learn In-Context? A Case Study of Simple Function Classes.”[Link] 40

work page 2023