Transformers for Learning on Noisy and Task-Level Manifolds: Approximation and Generalization Insights

Alexander Cloninger; Alex Havrilla; Rongjie Lai; Wenjing Liao; Zhaiming Shen

arxiv: 2505.03205 · v3 · pith:UOV3EJSInew · submitted 2025-05-06 · 💻 cs.LG · cs.NA· math.NA· math.ST· stat.TH

Transformers for Learning on Noisy and Task-Level Manifolds: Approximation and Generalization Insights

Zhaiming Shen , Alex Havrilla , Rongjie Lai , Alexander Cloninger , Wenjing Liao This is my paper

classification 💻 cs.LG cs.NAmath.NAmath.STstat.TH

keywords transformersdatamanifoldinputlearningnoisytask-leveltasks

0 comments

read the original abstract

Transformers serve as the foundational architecture for large language and video generation models, such as GPT, BERT, SORA and their successors. Empirical studies have demonstrated that real-world data and learning tasks exhibit low-dimensional structures, along with some noise or measurement error. The performance of transformers tends to depend on the intrinsic dimension of the data/tasks, though theoretical understandings remain largely unexplored for transformers. This work establishes a theoretical foundation by analyzing the performance of transformers for regression tasks involving noisy input data near a manifold. Specifically, the input data are in a tubular neighborhood of a manifold, while the ground truth function depends on the projection of the noisy data onto this manifold, referred to as the task-level manifold. We prove approximation and generalization errors which crucially depend on the intrinsic dimension of the task-level manifold. Our results demonstrate that transformers can leverage low-complexity structures in learning task even when the input data are perturbed by high-dimensional noise. Our novel proof technique constructs representations of basic arithmetic operations by transformers, which may hold independent interest.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Understanding In-Context Learning for Nonlinear Regression with Transformers: Attention as Featurizer
cs.LG 2026-05 unverdicted novelty 6.0

Transformers can be built to act as nonlinear featurizers via attention, supporting in-context regression with proven generalization bounds on synthetic tasks.
A Mathematical Explanation of Transformers
cs.LG 2025-10 unverdicted novelty 5.0

The Transformer is interpreted as discretization of a structured integro-differential equation in continuous domains for tokens and features, unifying attention, feedforward, and normalization via operator and variati...