Transformers Provably Learn Sparse XOR with Polylogarithmic Parameters

Debarghya Ghoshdastidar; Yaomengxi Han

arxiv: 2502.07553 · v2 · submitted 2025-02-11 · 💻 cs.LG

Transformers Provably Learn Sparse XOR with Polylogarithmic Parameters

Yaomengxi Han , Debarghya Ghoshdastidar This is my paper

Pith reviewed 2026-05-23 03:20 UTC · model grok-4.3

classification 💻 cs.LG

keywords transformerssparse parityfeature learningXOR functionpolylogarithmic parametersattention mechanismgradient descentsample complexity

0 comments

The pith

A single-layer two-head Transformer learns sparse XOR using O(polylog(d)) parameters and one gradient step.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that Transformers can identify the two unknown coordinates whose product determines the label and reduce loss to nearly zero on all inputs, using only polylogarithmic parameters. This capability stands in contrast to feedforward networks, which require linearly many parameters in the dimension d to achieve comparable feature selection on the same sparse parity task. The analysis focuses on a minimal architecture with exact softmax attention and proves that a single gradient update suffices for feature discovery under random initialization. The work also supplies a sample-complexity bound showing that finite data suffices for generalization. Empirical comparisons indicate that the exact softmax computation, rather than linear or elementwise substitutes, drives the rapid learning.

Core claim

We prove that a single-layer, two-head Transformer with O(polylog(d)) trainable parameters can discover the unknown indices i* and j* and drive the loss to nearly zero for every input with one gradient step, when labels are generated as y = -x_i* x_j*.

What carries the argument

Exact softmax attention in a two-head single-layer Transformer that computes pairwise interactions to isolate the relevant feature pair.

If this is right

Transformers overcome the Ω(d) parameter lower bound that holds for feedforward networks on this task.
Feature discovery and loss minimization occur after a single gradient step.
Exact softmax attention is necessary for the observed rapid learning, while linear and component-wise attention fail to match it.
A finite-sample generalization bound holds, establishing that the model succeeds from polynomially many examples.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same attention-driven mechanism may allow parameter-efficient learning of other sparse Boolean functions whose support size is constant.
Replacing softmax with cheaper approximations could eliminate the performance gap observed in the experiments.
Extending the construction to multi-layer Transformers might handle parity functions of higher but still constant degree without increasing the parameter count beyond polylog(d).

Load-bearing premise

The analysis assumes labels are generated exactly as the product of two fixed unknown coordinates and that the model uses exact softmax attention with a specific random initialization.

What would settle it

Running the same training procedure with linear attention instead of softmax and observing that loss does not reach near zero after one step on inputs where the two relevant coordinates are present.

Figures

Figures reproduced from arXiv: 2502.07553 by Debarghya Ghoshdastidar, Yaomengxi Han.

**Figure 1.** Figure 1: The architecture of the transformer and the example workflow to classify the parity of some given input. Given a binary string that consists of 7 tokens as input, the embedding layer (in green) will embed each token into a concatenation of a positional embedding and a token embedding wj = fpos(j) ◦ femb(xj ). An extra token embedding w0 will be prepended as the embedding of the CLS token. In the encoding … view at source ↗

**Figure 2.** Figure 2: Two heat maps of soft attention training. When there are no neighboring bits each head attend to a separate bit with a score very close to 1 (sub-figure on the left). If there exist neighboring bits (sub-figure on the right), a pair of attention heads could learn the same direction, which is in the middle of the positional embeddings of the neighboring bits. References Jimmy Ba, Murat A. Erdogdu, Taiji Suz… view at source ↗

read the original abstract

Learning sparse parity functions has become a theoretical testbed for studying feature learning in neural networks. However, existing analyses primarily focus on Feed-Forward Neural Networks (FFNNs). Meanwhile, theoretical understanding of Transformers in this setting remains limited, despite their empirical success and structural suitability for discovering sparse support over long sequences. To address this gap, we analyze how a single-layer, two-head Transformer learns the sparse XOR problem. Considering samples $(\mathbf{x}, y) \in \lbrace\pm 1\rbrace^d \times \lbrace\pm 1\rbrace$, where the label is defined by $y = -x_{i^*} x_{j^*}$ for some unknown $i^*, j^* \in [d]$, we prove that, with only $O(\mathrm{polylog}(d))$ trainable parameters, Transformers can successfully discover the relevant features and drive the loss for every input to nearly 0 with one gradient step. This result establishes that Transformers break the fundamental $\Omega(d)$ parameter bottleneck inherent to FFNNs for this problem. Furthermore, we empirically show that this rapid feature discovery is uniquely driven by the exact softmax attention, outperforming common substitutes such as linear or component-wise attention. Finally, we provide a theoretical sample complexity bound for learning from finite data, demonstrating the generalization ability of Transformers in this task.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Transformers learn sparse XOR with polylog parameters in one population gradient step, but the guarantee needs exact softmax and a tuned random initialization to align the gradient on the right pair.

read the letter

The main thing to know is that this paper proves a single-layer two-head Transformer with O(polylog(d)) parameters can identify the unknown indices i* and j* and drive population loss to near zero on every input in {±1}^d after one gradient step. That is the separation from the Ω(d) parameter cost shown for FFNNs on the same sparse parity task. They also run experiments showing that linear or componentwise attention versions do not achieve the same rapid feature discovery, which pins the result on the softmax normalization. The sample-complexity bound for finite data is included as well. These pieces are new relative to the FFNN-focused citations in the abstract. The proof strategy appears to work by showing that the population gradient on the shared Q/K projections, under the stated random initialization, points in a direction that boosts attention mass on the correct pair, after which the value head can recover the product. That part is cleanly scoped to the exact model and data distribution given. The soft spots are exactly where the stress-test note flags them. The one-step guarantee depends on the precise softmax to produce the required concentration and on the initialization variance plus positional embedding dimension so the gradient isolates the right pair out of d candidates. Small changes to either can send the update in a useless direction. The finite-sample result is stated but its dependence on d and failure probability is not visible in the abstract, so the practical reach is unclear. No load-bearing circularity or post-hoc fitting shows up. This is for readers working on architecture-specific feature learning bounds and attention mechanisms. It is narrow but addresses a documented gap with a concrete positive result, so it deserves a serious referee even if the assumptions stay stylized.

Referee Report

2 major / 2 minor

Summary. The paper claims that a single-layer two-head Transformer with O(polylog(d)) trainable parameters learns the sparse XOR problem (y = -x_{i*} x_{j*} for unknown i*, j* in [d]) by performing one gradient step on the population loss from a specific random initialization, driving the loss to nearly zero on all 2^d inputs and thereby breaking the Ω(d) parameter bottleneck of FFNNs. It further provides empirical evidence that exact softmax attention is essential for this rapid feature discovery (outperforming linear or componentwise substitutes) and derives a sample-complexity bound for finite-data generalization.

Significance. If the central derivation holds, the result supplies a concrete, architecture-specific explanation for why attention can achieve parameter-efficient feature learning on sparse parity tasks where FFNNs require Ω(d) parameters. The explicit sample-complexity bound and the controlled empirical isolation of exact softmax are additional strengths that make the contribution falsifiable and reproducible.

major comments (2)

[§3] §3 (main theorem and population-gradient calculation): the one-step guarantee that the updated attention concentrates exactly on the unknown pair (i*, j*) and yields loss ~0 for every x ∈ {±1}^d depends on the precise alignment of the population gradient with the fixed positional embeddings under a specific initialization variance; the manuscript should state the exact variance range and prove that the alignment is not an artifact of that choice.
[§3, §4] §3 and §4: the proof and the empirical claim both rely on exact softmax normalization to produce the required concentration; it is load-bearing to show whether the gradient direction remains isolating when softmax is replaced by any standard approximation (e.g., temperature-scaled or linear attention) or when numerical under/overflow occurs.

minor comments (2)

[Abstract, §2] Abstract and §2: the phrase “nearly 0” should be replaced by an explicit error bound (e.g., O(1/poly(d))) that is tracked through the proof.
[Notation] Notation section: clarify whether the positional embeddings are fixed random vectors or learned, and state their dimension relative to the model width.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the careful and constructive review. The two major comments highlight important aspects of the initialization and the role of exact softmax. We address each below and will incorporate revisions where appropriate to strengthen the manuscript.

read point-by-point responses

Referee: [§3] §3 (main theorem and population-gradient calculation): the one-step guarantee that the updated attention concentrates exactly on the unknown pair (i*, j*) and yields loss ~0 for every x ∈ {±1}^d depends on the precise alignment of the population gradient with the fixed positional embeddings under a specific initialization variance; the manuscript should state the exact variance range and prove that the alignment is not an artifact of that choice.

Authors: We agree that the alignment relies on the initialization variance and that this should be made fully explicit. In the revised manuscript we will add a precise statement of the allowable variance range (σ² ∈ [c₁/log d, c₂/log d] for explicit constants c₁, c₂ > 0) together with a short lemma showing that the population-gradient inner product with the positional embedding vectors remains strictly positive and bounded away from zero uniformly throughout this interval. The proof of the lemma follows from a direct second-moment calculation on the random features and does not depend on any further special choice of σ beyond the stated interval, thereby confirming the result is not an artifact of a single variance value. revision: yes
Referee: [§3, §4] §3 and §4: the proof and the empirical claim both rely on exact softmax normalization to produce the required concentration; it is load-bearing to show whether the gradient direction remains isolating when softmax is replaced by any standard approximation (e.g., temperature-scaled or linear attention) or when numerical under/overflow occurs.

Authors: The theoretical analysis in §3 is derived under exact softmax; the normalization step is essential to the concentration argument and we do not claim the same one-step guarantee for approximate attention mechanisms. Our existing experiments in §4 already compare exact softmax against linear and component-wise attention and show that only exact softmax isolates the relevant pair. For the revision we will add a short empirical subsection reporting results for temperature-scaled softmax across a range of temperatures (including values that induce under/overflow in float32) and confirm that the isolating property degrades gracefully but is lost outside a narrow temperature band. We will also note in the text that a theoretical extension to approximate softmax is left for future work, as the current proof technique does not directly extend. revision: partial

Circularity Check

0 steps flagged

No significant circularity; proof is self-contained from architecture and data model

full rationale

The paper presents a direct theoretical proof that a single-layer two-head Transformer with exact softmax attention, under specific random initialization, achieves near-zero loss on the sparse XOR task after one population gradient step using only polylog(d) parameters. No steps reduce by construction to fitted quantities, self-citations, or renamed inputs; the derivation relies on the stated assumptions about the attention mechanism and initialization rather than smuggling in the target result. The result is framed as breaking the FFNN parameter bottleneck via explicit analysis of the gradient alignment with the unknown support indices, making the claim independent of the inputs by the paper's own equations.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Based on abstract only; the claim rests on the stated data model and architecture. No free parameters, invented entities, or additional axioms are visible.

axioms (1)

domain assumption Labels satisfy y = -x_i* x_j* for some unknown fixed i*, j* in [d], with x in {±1}^d
Defines the sparse XOR problem as stated in the abstract.

pith-pipeline@v0.9.0 · 5774 in / 1240 out tokens · 39032 ms · 2026-05-23T03:20:53.707058+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

transformers with k trainable attention heads can learn k-parity with only O(k) parameters
IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

attention learning is necessary... fixed attention heads requires... ∥θ∥ · m² = O(n)

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

16 extracted references · 16 canonical work pages · 5 internal anchors

[1]

Erdogdu, Taiji Suzuki, Zhichao Wang, Denny Wu, and Greg Yang

Jimmy Ba, Murat A. Erdogdu, Taiji Suzuki, Zhichao Wang, Denny Wu, and Greg Yang. High-dimensional asymptotics of feature learning: How one gradi- ent step improves the representation. In Sanmi Koyejo, S. Mohamed, A. Agar- wal, Danielle Belgrave, K. Cho, and A. Oh, editors, Advances in Neural Infor- mation Processing Systems 35: Annual Conference on Neural...

work page 2022
[2]

Pascal Bergstr¨ aßer, Chris K¨ ocher, Anthony Widjaja Lin, and Georg Zetzsche

URL http://papers.nips.cc/paper_files/paper/2022/hash/ f7e7fabd73b3df96c54a320862afcb78-Abstract-Conference.html. Pascal Bergstr¨ aßer, Chris K¨ ocher, Anthony Widjaja Lin, and Georg Zetzsche. The power of hard attention transformers on data sequences: A formal language theoretic perspective. CoRR, abs/2405.16166,

work page arXiv 2022
[3]

Pascal Bergstr¨ aßer, Chris K¨ ocher, Anthony Widjaja Lin, and Georg Zetzsche

doi: 10.48550/ARXIV.2405.16166. URL https://doi. org/10.48550/arXiv.2405.16166. Satwik Bhattamishra, Arkil Patel, Varun Kanade, and Phil Blunsom. Simplicity bias in transformers and their ability to learn sparse boolean functions. In Anna Rogers, Jordan L. Boyd-Graber, and Naoaki Okazaki, editors, Proceedings of the 61st An- nual Meeting of the Associatio...

work page doi:10.48550/arxiv.2405.16166 2023
[4]

URL https://doi.org/10.18653/v1/2023.acl-long.317

doi: 10.18653/V1/2023.ACL-LONG.317. URL https://doi.org/10.18653/v1/2023.acl-long.317. Nicolas Carion, Francisco Massa, Gabriel Synnaeve, Nicolas Usunier, Alexander Kir- illov, and Sergey Zagoruyko. End-to-end object detection with transformers. CoRR, abs/2005.12872,

work page doi:10.18653/v1/2023.acl-long.317 2023
[5]

24 Attention Learning is Needed to Efficiently Learn Parity Function Amit Daniely and Eran Malach

URL https://arxiv.org/abs/2005.12872. 24 Attention Learning is Needed to Efficiently Learn Parity Function Amit Daniely and Eran Malach. Learning parities with neural networks. In Hugo Larochelle, Marc’Aurelio Ranzato, Raia Hadsell, Maria-Florina Balcan, and Hsuan-Tien Lin, editors, Advances in Neural Information Processing Systems 33: Annual Con- ference...

work page arXiv 2005
[6]

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova

URL https://proceedings.neurips.cc/paper/2020/hash/ eaae5e04a259d09af85c108fe4d7dd0c-Abstract.html. Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. BERT: pre-training of deep bidirectional transformers for language understanding. In Jill Burstein, Christy Doran, and Thamar Solorio, editors, Proceedings of the 2019 Conference of the North...

work page 2020
[7]

An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

doi: 10.18653/V1/N19-1423. URL https://doi.org/10.18653/v1/n19-1423. Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, and Neil Houlsby. An image is worth 16x16 words: Trans- formers for image recognition at scale. C...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.18653/v1/n19-1423 2010
[8]

An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

URL https: //arxiv.org/abs/2010.11929. Michael Hahn. Theoretical limitations of self-attention in neural sequence models. Trans. Assoc. Comput. Linguistics , 8:156–171,

work page internal anchor Pith review Pith/arXiv arXiv 2010
[10]

Liu, Kevin Lin, John Hewitt, Ashwin Paranjape, Michele Bevilacqua, Fabio Petroni, and Percy Liang

doi: 10.1162/TACL \ A\ 00490. URL https://doi.org/10. 1162/tacl_a_00490. Yiwen Kou, Zixiang Chen, Quanquan Gu, and Sham M. Kakade. Matching the statistical query lower bound for k-sparse parity problems with stochastic gradient descent. CoRR, abs/2404.12376,

work page internal anchor Pith review doi:10.1162/tacl
[11]

URL https://doi.org/10

doi: 10.48550/ARXIV.2404.12376. URL https://doi.org/10. 48550/arXiv.2404.12376. Pierre Marion, Rapha¨ el Berthier, G´ erard Biau, and Claire Boyer. Attention layers prov- ably solve single-location regression. In The 13th International Conference on Learning Representations, ICLR 2025 . OpenReview.net,

work page doi:10.48550/arxiv.2404.12376 2025
[12]

William Merrill and Ashish Sabharwal

URL https://openreview.net/ forum?id=DVlPp7Jd7P. William Merrill and Ashish Sabharwal. A logic for expressing log-precision trans- formers. In Alice Oh, Tristan Naumann, Amir Globerson, Kate Saenko, Moritz Hardt, and Sergey Levine, editors, Advances in Neural Information Processing Systems 36: Annual Conference on Neural Information Process- ing Systems 2...

work page 2023
[13]

25 Attention Learning is Needed to Efficiently Learn Parity Function William Merrill and Ashish Sabharwal

URL http://papers.nips.cc/paper_files/paper/2023/hash/ a48e5877c7bf86a513950ab23b360498-Abstract-Conference.html. 25 Attention Learning is Needed to Efficiently Learn Parity Function William Merrill and Ashish Sabharwal. The expressive power of transformers with chain of thought. In The Twelfth International Conference on Learning Representa- tions, ICLR ...

work page 2023
[14]

Kuntal Kumar Pal, Kazuaki Kashihara, Ujjwala Anantheswaran, Kirby C

URL https://openreview.net/forum?id=NjNGlPh8Wh. Kuntal Kumar Pal, Kazuaki Kashihara, Ujjwala Anantheswaran, Kirby C. Kuznia, Siddhesh Jagtap, and Chitta Baral. Exploring the limits of transfer learning with unified model in the cybersecurity domain. CoRR, abs/2302.10346,

work page arXiv
[15]

doi: 10.48550/ARXIV.2302. 10346. URL https://doi.org/10.48550/arXiv.2302.10346. B.T. Polyak. Gradient methods for the minimisation of functionals. USSR Computational Mathematics and Mathematical Physics , 3(4):864–878,

work page internal anchor Pith review doi:10.48550/arxiv.2302
[16]

doi: https: //doi.org/10.1016/0041-5553(63)90382-3

ISSN 0041-5553. doi: https: //doi.org/10.1016/0041-5553(63)90382-3. Zhenmei Shi, Junyi Wei, and Yingyu Liang. Provable guarantees for neural net- works via gradient feature learning. In Alice Oh, Tristan Naumann, Amir Glober- son, Kate Saenko, Moritz Hardt, and Sergey Levine, editors, Advances in Neu- ral Information Processing Systems 36: Annual Conferen...

work page doi:10.1016/0041-5553(63)90382-3 2023
[18]

URL http://arxiv.org/abs/1706.03762. 26

work page internal anchor Pith review Pith/arXiv arXiv

[1] [1]

Erdogdu, Taiji Suzuki, Zhichao Wang, Denny Wu, and Greg Yang

Jimmy Ba, Murat A. Erdogdu, Taiji Suzuki, Zhichao Wang, Denny Wu, and Greg Yang. High-dimensional asymptotics of feature learning: How one gradi- ent step improves the representation. In Sanmi Koyejo, S. Mohamed, A. Agar- wal, Danielle Belgrave, K. Cho, and A. Oh, editors, Advances in Neural Infor- mation Processing Systems 35: Annual Conference on Neural...

work page 2022

[2] [2]

Pascal Bergstr¨ aßer, Chris K¨ ocher, Anthony Widjaja Lin, and Georg Zetzsche

URL http://papers.nips.cc/paper_files/paper/2022/hash/ f7e7fabd73b3df96c54a320862afcb78-Abstract-Conference.html. Pascal Bergstr¨ aßer, Chris K¨ ocher, Anthony Widjaja Lin, and Georg Zetzsche. The power of hard attention transformers on data sequences: A formal language theoretic perspective. CoRR, abs/2405.16166,

work page arXiv 2022

[3] [3]

Pascal Bergstr¨ aßer, Chris K¨ ocher, Anthony Widjaja Lin, and Georg Zetzsche

doi: 10.48550/ARXIV.2405.16166. URL https://doi. org/10.48550/arXiv.2405.16166. Satwik Bhattamishra, Arkil Patel, Varun Kanade, and Phil Blunsom. Simplicity bias in transformers and their ability to learn sparse boolean functions. In Anna Rogers, Jordan L. Boyd-Graber, and Naoaki Okazaki, editors, Proceedings of the 61st An- nual Meeting of the Associatio...

work page doi:10.48550/arxiv.2405.16166 2023

[4] [4]

URL https://doi.org/10.18653/v1/2023.acl-long.317

doi: 10.18653/V1/2023.ACL-LONG.317. URL https://doi.org/10.18653/v1/2023.acl-long.317. Nicolas Carion, Francisco Massa, Gabriel Synnaeve, Nicolas Usunier, Alexander Kir- illov, and Sergey Zagoruyko. End-to-end object detection with transformers. CoRR, abs/2005.12872,

work page doi:10.18653/v1/2023.acl-long.317 2023

[5] [5]

24 Attention Learning is Needed to Efficiently Learn Parity Function Amit Daniely and Eran Malach

URL https://arxiv.org/abs/2005.12872. 24 Attention Learning is Needed to Efficiently Learn Parity Function Amit Daniely and Eran Malach. Learning parities with neural networks. In Hugo Larochelle, Marc’Aurelio Ranzato, Raia Hadsell, Maria-Florina Balcan, and Hsuan-Tien Lin, editors, Advances in Neural Information Processing Systems 33: Annual Con- ference...

work page arXiv 2005

[6] [6]

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova

URL https://proceedings.neurips.cc/paper/2020/hash/ eaae5e04a259d09af85c108fe4d7dd0c-Abstract.html. Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. BERT: pre-training of deep bidirectional transformers for language understanding. In Jill Burstein, Christy Doran, and Thamar Solorio, editors, Proceedings of the 2019 Conference of the North...

work page 2020

[7] [7]

An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

doi: 10.18653/V1/N19-1423. URL https://doi.org/10.18653/v1/n19-1423. Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, and Neil Houlsby. An image is worth 16x16 words: Trans- formers for image recognition at scale. C...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.18653/v1/n19-1423 2010

[8] [8]

An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

URL https: //arxiv.org/abs/2010.11929. Michael Hahn. Theoretical limitations of self-attention in neural sequence models. Trans. Assoc. Comput. Linguistics , 8:156–171,

work page internal anchor Pith review Pith/arXiv arXiv 2010

[9] [10]

Liu, Kevin Lin, John Hewitt, Ashwin Paranjape, Michele Bevilacqua, Fabio Petroni, and Percy Liang

doi: 10.1162/TACL \ A\ 00490. URL https://doi.org/10. 1162/tacl_a_00490. Yiwen Kou, Zixiang Chen, Quanquan Gu, and Sham M. Kakade. Matching the statistical query lower bound for k-sparse parity problems with stochastic gradient descent. CoRR, abs/2404.12376,

work page internal anchor Pith review doi:10.1162/tacl

[10] [11]

URL https://doi.org/10

doi: 10.48550/ARXIV.2404.12376. URL https://doi.org/10. 48550/arXiv.2404.12376. Pierre Marion, Rapha¨ el Berthier, G´ erard Biau, and Claire Boyer. Attention layers prov- ably solve single-location regression. In The 13th International Conference on Learning Representations, ICLR 2025 . OpenReview.net,

work page doi:10.48550/arxiv.2404.12376 2025

[11] [12]

William Merrill and Ashish Sabharwal

URL https://openreview.net/ forum?id=DVlPp7Jd7P. William Merrill and Ashish Sabharwal. A logic for expressing log-precision trans- formers. In Alice Oh, Tristan Naumann, Amir Globerson, Kate Saenko, Moritz Hardt, and Sergey Levine, editors, Advances in Neural Information Processing Systems 36: Annual Conference on Neural Information Process- ing Systems 2...

work page 2023

[12] [13]

25 Attention Learning is Needed to Efficiently Learn Parity Function William Merrill and Ashish Sabharwal

URL http://papers.nips.cc/paper_files/paper/2023/hash/ a48e5877c7bf86a513950ab23b360498-Abstract-Conference.html. 25 Attention Learning is Needed to Efficiently Learn Parity Function William Merrill and Ashish Sabharwal. The expressive power of transformers with chain of thought. In The Twelfth International Conference on Learning Representa- tions, ICLR ...

work page 2023

[13] [14]

Kuntal Kumar Pal, Kazuaki Kashihara, Ujjwala Anantheswaran, Kirby C

URL https://openreview.net/forum?id=NjNGlPh8Wh. Kuntal Kumar Pal, Kazuaki Kashihara, Ujjwala Anantheswaran, Kirby C. Kuznia, Siddhesh Jagtap, and Chitta Baral. Exploring the limits of transfer learning with unified model in the cybersecurity domain. CoRR, abs/2302.10346,

work page arXiv

[14] [15]

doi: 10.48550/ARXIV.2302. 10346. URL https://doi.org/10.48550/arXiv.2302.10346. B.T. Polyak. Gradient methods for the minimisation of functionals. USSR Computational Mathematics and Mathematical Physics , 3(4):864–878,

work page internal anchor Pith review doi:10.48550/arxiv.2302

[15] [16]

doi: https: //doi.org/10.1016/0041-5553(63)90382-3

ISSN 0041-5553. doi: https: //doi.org/10.1016/0041-5553(63)90382-3. Zhenmei Shi, Junyi Wei, and Yingyu Liang. Provable guarantees for neural net- works via gradient feature learning. In Alice Oh, Tristan Naumann, Amir Glober- son, Kate Saenko, Moritz Hardt, and Sergey Levine, editors, Advances in Neu- ral Information Processing Systems 36: Annual Conferen...

work page doi:10.1016/0041-5553(63)90382-3 2023

[16] [18]

URL http://arxiv.org/abs/1706.03762. 26

work page internal anchor Pith review Pith/arXiv arXiv