Understanding in-context learning on structured manifolds: Bridging attention to kernel methods

Zhaiming Shen, Alexander Hsu, Rongjie Lai, Wenjing Liao · 2025 · cs.LG · arXiv 2506.10959

3 Pith papers cite this work. Polarity classification is still indexing.

3 Pith papers citing it

open full Pith review browse 3 citing papers arXiv PDF

abstract

While in-context learning (ICL) has achieved remarkable success in natural language and vision domains, its theoretical understanding-particularly in the context of structured geometric data-remains unexplored. This paper initiates a theoretical study of ICL for regression of H\"older functions on manifolds. We establish a novel connection between the attention mechanism and classical kernel methods, demonstrating that transformers effectively perform kernel-based prediction at a new query through its interaction with the prompt. This connection is validated by numerical experiments, revealing that the learned query-prompt scores for H\"older functions are highly correlated with the Gaussian kernel. Building on this insight, we derive generalization error bounds in terms of the prompt length and the number of training tasks. When a sufficient number of training tasks are observed, transformers give rise to the minimax regression rate of H\"older functions on manifolds, which scales exponentially with respect to the prompt length with the exponent depending on the intrinsic dimension of the manifold, rather than the ambient space dimension. Our result also characterizes how the generalization error scales with the number of training tasks, shedding light on the complexity of transformers as in-context kernel algorithm learners. Our findings provide foundational insights into the role of geometry in ICL and novels tools to study ICL of nonlinear models.

citation-role summary

background 1

citation-polarity summary

unclear 1

representative citing papers

Transformers Can Implement Preconditioned Richardson Iteration for In-Context Gaussian Kernel Regression

cs.LG · 2026-05-08 · unverdicted · novelty 7.0 · 2 refs

A single-head softmax transformer with O(log(1/ε)) blocks and O(√(N/ε)) MLP width implements preconditioned Richardson iteration to achieve ε-accurate Gaussian KRR predictions on length-N prompts under bounded data.

Transformers for Learning on Noisy and Task-Level Manifolds: Approximation and Generalization Insights

cs.LG · 2025-05-06 · unverdicted · novelty 7.0

Transformers achieve approximation and generalization error bounds for noisy manifold regression that scale with the intrinsic dimension of the task-level manifold.

Understanding In-Context Learning for Nonlinear Regression with Transformers: Attention as Featurizer

cs.LG · 2026-05-06 · unverdicted · novelty 6.0

Transformers can be built to act as nonlinear featurizers via attention, supporting in-context regression with proven generalization bounds on synthetic tasks.

citing papers explorer

Showing 3 of 3 citing papers.

Transformers Can Implement Preconditioned Richardson Iteration for In-Context Gaussian Kernel Regression cs.LG · 2026-05-08 · unverdicted · none · ref 33 · 2 links · internal anchor
A single-head softmax transformer with O(log(1/ε)) blocks and O(√(N/ε)) MLP width implements preconditioned Richardson iteration to achieve ε-accurate Gaussian KRR predictions on length-N prompts under bounded data.
Transformers for Learning on Noisy and Task-Level Manifolds: Approximation and Generalization Insights cs.LG · 2025-05-06 · unverdicted · none · ref 12 · internal anchor
Transformers achieve approximation and generalization error bounds for noisy manifold regression that scale with the intrinsic dimension of the task-level manifold.
Understanding In-Context Learning for Nonlinear Regression with Transformers: Attention as Featurizer cs.LG · 2026-05-06 · unverdicted · none · ref 32 · internal anchor
Transformers can be built to act as nonlinear featurizers via attention, supporting in-context regression with proven generalization bounds on synthetic tasks.

Understanding in-context learning on structured manifolds: Bridging attention to kernel methods

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer