arxiv: 2601.16873 · v2 · submitted 2026-01-23 · 💻 cs.LG

Provably Learning Attention with Queries

Satwik Bhattamishra , Kulin Shah , Michael Hahn , Varun Kanade This is my paper

Pith reviewed 2026-05-16 11:25 UTC · model grok-4.3

classification 💻 cs.LG

keywords attentiontransformerquery complexityblack-box learningparameter recoverycompressed sensingmulti-head attentionlearnability

0 comments p. Extension

The pith

Single-head attention regressors can be learned exactly with O(d^2) adaptive black-box queries.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper examines learning Transformer sequence models when only black-box access to their outputs is given, allowing adaptive queries with any sequence of vectors. It proves that single-head attention parameters for width d are recoverable by an elementary algorithm using O(d^2) queries. The approach extends directly to one-layer Transformers once ReLU feedforward networks are assumed learnable. Compressed sensing cuts the query count to O(r d) when the head dimension r is much smaller than d. The work also shows polynomial-query recovery to epsilon accuracy under mild noise, norm, and margin conditions, while establishing that multi-head attention is not identifiable from queries alone.

Core claim

We show that for a model with width d, there is an elementary algorithm to learn the parameters of single-head attention with O(d^2) queries. Further, we show that if there exists an algorithm to learn ReLU feedforward networks, then the single-head algorithm can be easily adapted to learn one-layer Transformers with single-head attention. In the regime where the head dimension r ≪ d, single-head attention-based models can be learned with O(r d) queries via compressed sensing arguments. Under mild norm and margin conditions, parameters can be estimated to ε accuracy with a polynomial number of queries even when outputs are provided only up to additive tolerance. Multi-head attention is not 1

What carries the argument

The adaptive black-box query procedure that selects input sequences of vectors, observes outputs, and solves for the attention weight matrices via linear algebra or compressed sensing.

If this is right

One-layer Transformers with single-head attention become learnable by combining the attention learner with any ReLU FFN learner.
Epsilon-accurate parameter recovery is possible with polynomially many queries even under additive output noise.
Query complexity reduces to O(r d) via compressed sensing whenever the head dimension is small relative to width.
Multi-head attention requires extra structural assumptions such as orthogonality for identifiability and learnability.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The query complexity result suggests layer-wise recovery strategies could scale to deeper attention stacks.
Non-identifiability of multi-head attention may explain the practical value of random head initialization during training.
Similar black-box techniques might recover parameters of other attention variants such as linear or kernel attention.
Efficient recovery could enable post-training verification or compression of deployed attention models.

Load-bearing premise

The target function is exactly a single-head attention regressor and the learner has adaptive black-box access to query any sequence of vectors.

What would settle it

Two distinct single-head attention parameter sets that produce identical outputs on every possible input sequence would show the parameters are not uniquely recoverable.

read the original abstract

We study the problem of learning Transformer-based sequence models with black-box access to their outputs. In this setting, a learner may adaptively query the oracle with any sequence of vectors and observe the output of the target function. We begin with studying the learnability of the simplest formulation, that is, learning a single-head attention-based regressor with queries. We show that for a model with width $d$, there is an elementary algorithm to learn the parameters of single-head attention with $O(d^2)$ queries. Further, we show that if there exists an algorithm to learn ReLU feedforward networks (FFNs), then the single-head algorithm can be easily adapted to learn one-layer Transformers with single-head attention. Next, we show that, in the common regime where the head dimension $r \ll d$, single-head attention-based models can be learned with $O(rd)$ queries via compressed sensing arguments. We also study robustness to noisy oracle access, proving that under mild norm and margin conditions, the parameters can be estimated to $\varepsilon$ accuracy with a polynomial number of queries even when outputs are only provided up to additive tolerance. Finally, we consider the learnability of multi-head attention and show that they are not identifiable from queries, and hence, learnability in the same sense is not feasible without additional assumptions. We discuss potential approaches to learn multi-head attention-based models under certain structural assumptions.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Single-head attention can be learned with O(d²) queries while multi-head attention is not identifiable from black-box queries.

read the letter

The main result is an elementary algorithm that learns single-head attention parameters with O(d²) queries, plus a compressed sensing version that drops to O(rd) when the head dimension r is much smaller than d. They also show multi-head attention cannot be identified from queries alone. This is useful because it gives concrete positive bounds for the single-head case and a clear negative for the multi-head setting that is common in practice. The extension to one-layer transformers by reducing to FFN learning is a reasonable modular step, and the robustness to noisy outputs under mild conditions is a practical addition. The citations to compressed sensing are appropriate and the arguments seem to build on standard techniques without circularity. The soft spot is the O(d²) algorithm itself. The abstract claims it exists but does not describe how the adaptive queries are chosen to disentangle the softmax nonlinearity and recover the exact matrices. If the construction relies on special input sequences that keep the attention scores in a regime where inversion is possible, it may not hold for arbitrary parameters or inputs. The stress test concern about isolating the interactions looks worth checking in the full proofs. This paper is for people working on the theoretical learnability of Transformers and query-based access models. It has enough new bounds and a solid negative result to merit serious refereeing, though the single-head part may require careful verification of the query details during review.

Referee Report

2 major / 1 minor

Summary. The manuscript studies the learnability of Transformer sequence models under adaptive black-box query access to outputs. For single-head attention regressors of width d it claims an elementary algorithm recovering the exact query/key/value parameters with O(d²) queries; this extends to one-layer Transformers assuming an oracle for learning ReLU FFNs. When head dimension r ≪ d it obtains O(rd) queries via compressed-sensing arguments. It also proves polynomial-query robustness to additive noise under mild norm and margin conditions, and shows that multi-head attention is not identifiable from queries alone.

Significance. If the central claims hold, the work supplies the first explicit query-complexity bounds for recovering attention parameters, together with a clean separation between identifiable single-head and non-identifiable multi-head cases. The compressed-sensing reduction and the reduction to FFN learnability are technically standard yet cleanly applied; the non-identifiability result is a useful negative finding.

major comments (2)

[Single-head attention learning section] The O(d²) algorithm for single-head attention (abstract and the corresponding theorem) asserts that an adaptive query strategy recovers the exact matrices despite the softmax nonlinearity, yet the manuscript provides no explicit description of the query sequences (e.g., repeated tokens, orthogonal bases, or norm-controlled vectors) that produce the required invertible linear systems. Without this construction the bound cannot be verified for generic parameter settings.
[Compressed-sensing subsection] The compressed-sensing claim for the r ≪ d regime (abstract) relies on the value matrix being full-rank and on the chosen queries keeping softmax arguments in an algebraically invertible regime; neither condition is stated as an explicit assumption nor shown to hold generically.

minor comments (1)

[Abstract] The abstract refers to an “elementary algorithm” without a high-level pseudocode sketch or step-by-step outline; adding one would improve readability.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed reading and for identifying two points where the presentation of our algorithms requires additional explicit detail. Both comments are fair and we will revise the manuscript to address them. No standing objections remain after these clarifications.

read point-by-point responses

Referee: [Single-head attention learning section] The O(d²) algorithm for single-head attention (abstract and the corresponding theorem) asserts that an adaptive query strategy recovers the exact matrices despite the softmax nonlinearity, yet the manuscript provides no explicit description of the query sequences (e.g., repeated tokens, orthogonal bases, or norm-controlled vectors) that produce the required invertible linear systems. Without this construction the bound cannot be verified for generic parameter settings.

Authors: We agree that the current text describes the high-level recovery strategy (linear systems obtained after canceling the softmax via carefully chosen differences) but does not spell out the concrete adaptive query sequences. In the revision we will add an explicit construction in the proof of the O(d²) theorem: the learner first queries with scaled standard-basis vectors e_i repeated across positions to isolate each row of the query and key matrices, then perturbs with small orthogonal vectors drawn from a fixed basis to ensure the softmax arguments remain strictly positive and the resulting map is invertible by finite differences. This construction works for any fixed parameter matrices of bounded norm and yields exactly the claimed query count. revision: yes
Referee: [Compressed-sensing subsection] The compressed-sensing claim for the r ≪ d regime (abstract) relies on the value matrix being full-rank and on the chosen queries keeping softmax arguments in an algebraically invertible regime; neither condition is stated as an explicit assumption nor shown to hold generically.

Authors: The referee is correct; both conditions were used implicitly but not stated. In the revised manuscript we will add them as formal assumptions (value matrix V has full column rank r and all queried softmax inputs lie in an open set where the softmax is locally invertible). We will also include a short generic argument: for random Gaussian parameter matrices the value matrix is full-rank with probability 1, and a random perturbation of the query vectors of size O(1/d) keeps the arguments inside the invertible regime with high probability. The O(rd) bound then follows directly from standard compressed-sensing recovery. revision: yes

Circularity Check

0 steps flagged

No significant circularity; algorithm derives from explicit query constructions and external tools

full rationale

The paper's central result is an explicit elementary algorithm that recovers single-head attention parameters (query/key/value matrices) from O(d²) adaptive black-box queries. This construction is presented as self-contained and does not reduce to a fitted parameter renamed as a prediction, nor to a self-citation chain. The extension to one-layer Transformers invokes learnability of ReLU FFNs as an independent black-box subroutine. The O(rd) bound for r ≪ d is obtained via standard compressed-sensing arguments applied to the linearized attention scores, which are external to the paper. Multi-head non-identifiability is shown by exhibiting distinct parameter sets that produce identical input-output behavior, again without circular reduction. No load-bearing step collapses by definition or by self-citation to the target claim itself.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The work builds on existing learning theory without introducing new free parameters or entities.

axioms (1)

domain assumption Mild norm and margin conditions on inputs and outputs for noisy case
Invoked for the robustness result to hold with polynomial queries.

pith-pipeline@v0.9.0 · 5554 in / 1215 out tokens · 49838 ms · 2026-05-16T11:25:01.926025+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Theorem 4.1 ... two-row input X=[(u+e_j)^T; e_j^T] ... α(u;j)=σ(u^T w_j) ... u^T w_j = σ^{-1}((y-v_j^*)/(u^T v^*)) ... solve linear system
IndisputableMonolith/Foundation/ArithmeticFromLogic.lean absolute_floor_iff_bare_distinguishability unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Proposition 7.1 ... multi-head attention parameters are not identifiable from value queries

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.