pith. sign in

arxiv: 2211.12316 · v2 · pith:BX6Z2CHBnew · submitted 2022-11-22 · 💻 cs.LG · cs.CL

Simplicity Bias in Transformers and their Ability to Learn Sparse Boolean Functions

classification 💻 cs.LG cs.CL
keywords transformersfunctionsbooleansensitivitymodelsrecurrentdespitegeneralization
0
0 comments X
read the original abstract

Despite the widespread success of Transformers on NLP tasks, recent works have found that they struggle to model several formal languages when compared to recurrent models. This raises the question of why Transformers perform well in practice and whether they have any properties that enable them to generalize better than recurrent models. In this work, we conduct an extensive empirical study on Boolean functions to demonstrate the following: (i) Random Transformers are relatively more biased towards functions of low sensitivity. (ii) When trained on Boolean functions, both Transformers and LSTMs prioritize learning functions of low sensitivity, with Transformers ultimately converging to functions of lower sensitivity. (iii) On sparse Boolean functions which have low sensitivity, we find that Transformers generalize near perfectly even in the presence of noisy labels whereas LSTMs overfit and achieve poor generalization accuracy. Overall, our results provide strong quantifiable evidence that suggests differences in the inductive biases of Transformers and recurrent models which may help explain Transformer's effective generalization performance despite relatively limited expressiveness.

This paper has not been read by Pith yet.

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 3 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Looped Transformers with Layer Normalization Provably Learn the Power Method

    cs.LG 2026-05 unverdicted novelty 8.0

    Looped linear transformers with LN provably converge via GD to implement the power method on principal component prediction.

  2. The two clocks and the innovation window: When and how generative models learn rules

    cs.LG 2026-05 unverdicted novelty 6.0

    Generative models learn rules before memorizing data, creating an innovation window whose width depends on dataset size and rule complexity, observed in both diffusion and autoregressive architectures.

  3. On the Convergence Behavior of Preconditioned Gradient Descent Toward the Rich Learning Regime

    cs.LG 2026-01 unverdicted novelty 5.0

    Preconditioned gradient descent mitigates spectral bias and reduces grokking delays by enabling uniform parameter space exploration in the NTK regime, confirming grokking as a transition to the rich regime.