Provably Learning Attention with Queries
Pith reviewed 2026-05-16 11:25 UTC · model grok-4.3
The pith
Single-head attention regressors can be learned exactly with O(d^2) adaptive black-box queries.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
We show that for a model with width d, there is an elementary algorithm to learn the parameters of single-head attention with O(d^2) queries. Further, we show that if there exists an algorithm to learn ReLU feedforward networks, then the single-head algorithm can be easily adapted to learn one-layer Transformers with single-head attention. In the regime where the head dimension r ≪ d, single-head attention-based models can be learned with O(r d) queries via compressed sensing arguments. Under mild norm and margin conditions, parameters can be estimated to ε accuracy with a polynomial number of queries even when outputs are provided only up to additive tolerance. Multi-head attention is not 1
What carries the argument
The adaptive black-box query procedure that selects input sequences of vectors, observes outputs, and solves for the attention weight matrices via linear algebra or compressed sensing.
If this is right
- One-layer Transformers with single-head attention become learnable by combining the attention learner with any ReLU FFN learner.
- Epsilon-accurate parameter recovery is possible with polynomially many queries even under additive output noise.
- Query complexity reduces to O(r d) via compressed sensing whenever the head dimension is small relative to width.
- Multi-head attention requires extra structural assumptions such as orthogonality for identifiability and learnability.
Where Pith is reading between the lines
- The query complexity result suggests layer-wise recovery strategies could scale to deeper attention stacks.
- Non-identifiability of multi-head attention may explain the practical value of random head initialization during training.
- Similar black-box techniques might recover parameters of other attention variants such as linear or kernel attention.
- Efficient recovery could enable post-training verification or compression of deployed attention models.
Load-bearing premise
The target function is exactly a single-head attention regressor and the learner has adaptive black-box access to query any sequence of vectors.
What would settle it
Two distinct single-head attention parameter sets that produce identical outputs on every possible input sequence would show the parameters are not uniquely recoverable.
read the original abstract
We study the problem of learning Transformer-based sequence models with black-box access to their outputs. In this setting, a learner may adaptively query the oracle with any sequence of vectors and observe the output of the target function. We begin with studying the learnability of the simplest formulation, that is, learning a single-head attention-based regressor with queries. We show that for a model with width $d$, there is an elementary algorithm to learn the parameters of single-head attention with $O(d^2)$ queries. Further, we show that if there exists an algorithm to learn ReLU feedforward networks (FFNs), then the single-head algorithm can be easily adapted to learn one-layer Transformers with single-head attention. Next, we show that, in the common regime where the head dimension $r \ll d$, single-head attention-based models can be learned with $O(rd)$ queries via compressed sensing arguments. We also study robustness to noisy oracle access, proving that under mild norm and margin conditions, the parameters can be estimated to $\varepsilon$ accuracy with a polynomial number of queries even when outputs are only provided up to additive tolerance. Finally, we consider the learnability of multi-head attention and show that they are not identifiable from queries, and hence, learnability in the same sense is not feasible without additional assumptions. We discuss potential approaches to learn multi-head attention-based models under certain structural assumptions.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript studies the learnability of Transformer sequence models under adaptive black-box query access to outputs. For single-head attention regressors of width d it claims an elementary algorithm recovering the exact query/key/value parameters with O(d²) queries; this extends to one-layer Transformers assuming an oracle for learning ReLU FFNs. When head dimension r ≪ d it obtains O(rd) queries via compressed-sensing arguments. It also proves polynomial-query robustness to additive noise under mild norm and margin conditions, and shows that multi-head attention is not identifiable from queries alone.
Significance. If the central claims hold, the work supplies the first explicit query-complexity bounds for recovering attention parameters, together with a clean separation between identifiable single-head and non-identifiable multi-head cases. The compressed-sensing reduction and the reduction to FFN learnability are technically standard yet cleanly applied; the non-identifiability result is a useful negative finding.
major comments (2)
- [Single-head attention learning section] The O(d²) algorithm for single-head attention (abstract and the corresponding theorem) asserts that an adaptive query strategy recovers the exact matrices despite the softmax nonlinearity, yet the manuscript provides no explicit description of the query sequences (e.g., repeated tokens, orthogonal bases, or norm-controlled vectors) that produce the required invertible linear systems. Without this construction the bound cannot be verified for generic parameter settings.
- [Compressed-sensing subsection] The compressed-sensing claim for the r ≪ d regime (abstract) relies on the value matrix being full-rank and on the chosen queries keeping softmax arguments in an algebraically invertible regime; neither condition is stated as an explicit assumption nor shown to hold generically.
minor comments (1)
- [Abstract] The abstract refers to an “elementary algorithm” without a high-level pseudocode sketch or step-by-step outline; adding one would improve readability.
Simulated Author's Rebuttal
We thank the referee for the detailed reading and for identifying two points where the presentation of our algorithms requires additional explicit detail. Both comments are fair and we will revise the manuscript to address them. No standing objections remain after these clarifications.
read point-by-point responses
-
Referee: [Single-head attention learning section] The O(d²) algorithm for single-head attention (abstract and the corresponding theorem) asserts that an adaptive query strategy recovers the exact matrices despite the softmax nonlinearity, yet the manuscript provides no explicit description of the query sequences (e.g., repeated tokens, orthogonal bases, or norm-controlled vectors) that produce the required invertible linear systems. Without this construction the bound cannot be verified for generic parameter settings.
Authors: We agree that the current text describes the high-level recovery strategy (linear systems obtained after canceling the softmax via carefully chosen differences) but does not spell out the concrete adaptive query sequences. In the revision we will add an explicit construction in the proof of the O(d²) theorem: the learner first queries with scaled standard-basis vectors e_i repeated across positions to isolate each row of the query and key matrices, then perturbs with small orthogonal vectors drawn from a fixed basis to ensure the softmax arguments remain strictly positive and the resulting map is invertible by finite differences. This construction works for any fixed parameter matrices of bounded norm and yields exactly the claimed query count. revision: yes
-
Referee: [Compressed-sensing subsection] The compressed-sensing claim for the r ≪ d regime (abstract) relies on the value matrix being full-rank and on the chosen queries keeping softmax arguments in an algebraically invertible regime; neither condition is stated as an explicit assumption nor shown to hold generically.
Authors: The referee is correct; both conditions were used implicitly but not stated. In the revised manuscript we will add them as formal assumptions (value matrix V has full column rank r and all queried softmax inputs lie in an open set where the softmax is locally invertible). We will also include a short generic argument: for random Gaussian parameter matrices the value matrix is full-rank with probability 1, and a random perturbation of the query vectors of size O(1/d) keeps the arguments inside the invertible regime with high probability. The O(rd) bound then follows directly from standard compressed-sensing recovery. revision: yes
Circularity Check
No significant circularity; algorithm derives from explicit query constructions and external tools
full rationale
The paper's central result is an explicit elementary algorithm that recovers single-head attention parameters (query/key/value matrices) from O(d²) adaptive black-box queries. This construction is presented as self-contained and does not reduce to a fitted parameter renamed as a prediction, nor to a self-citation chain. The extension to one-layer Transformers invokes learnability of ReLU FFNs as an independent black-box subroutine. The O(rd) bound for r ≪ d is obtained via standard compressed-sensing arguments applied to the linearized attention scores, which are external to the paper. Multi-head non-identifiability is shown by exhibiting distinct parameter sets that produce identical input-output behavior, again without circular reduction. No load-bearing step collapses by definition or by self-citation to the target claim itself.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Mild norm and margin conditions on inputs and outputs for noisy case
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Theorem 4.1 ... two-row input X=[(u+e_j)^T; e_j^T] ... α(u;j)=σ(u^T w_j) ... u^T w_j = σ^{-1}((y-v_j^*)/(u^T v^*)) ... solve linear system
-
IndisputableMonolith/Foundation/ArithmeticFromLogic.leanabsolute_floor_iff_bare_distinguishability unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Proposition 7.1 ... multi-head attention parameters are not identifiable from value queries
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.