Recognition: 2 theorem links
· Lean TheoremSOCKET: SOft Collision Kernel EsTimator for Sparse Attention
Pith reviewed 2026-05-16 06:26 UTC · model grok-4.3
The pith
Soft LSH collisions score tokens for sparse attention with less memory
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
SOCKET replaces hard bucket matches in LSH with probabilistic, similarity-aware aggregation of collision evidence across hash tables. This preserves top-k ordering quality using significantly less memory than traditional hard LSH and reframes LSH from a candidate generator into a principled scoring kernel for sparse attention, enabling efficient token selection without ad hoc voting.
What carries the argument
The soft collision kernel that accumulates graded collision evidence from multiple hash tables to produce similarity scores for token selection.
If this is right
- Matches or surpasses prior sparse attention methods across multiple long-context benchmarks
- Uses significantly less memory than traditional hard LSH while preserving top-k ordering quality
- Achieves up to 1.5 times higher throughput than FlashAttention via a custom CUDA scoring kernel and Triton backend
- Enables token selection for sparse attention without ad hoc voting or post-processing steps
Where Pith is reading between the lines
- The graded aggregation technique might apply to other approximate similarity tasks that currently rely on hard LSH buckets.
- Pairing this scoring method with sequence-length scaling techniques could support inference on contexts longer than those tested.
- The memory reduction could be tested in multi-query or grouped-query attention setups to check if the gains compound.
Load-bearing premise
Graded collision evidence accumulated across hash tables preserves top-k ordering quality while using significantly less memory than traditional hard LSH without requiring post-hoc adjustments or ad-hoc voting.
What would settle it
A benchmark that directly compares the top-k tokens selected by SOCKET scores against the true top-k tokens from full attention on a long sequence, where low overlap in the selected sets would show the ordering is not preserved.
Figures
read the original abstract
Exploiting sparsity during long-context inference is key to scaling large language models, as attention dominates the cost of autoregressive decoding. Sparse attention reduces this cost by restricting computation to a subset of tokens, but its effectiveness depends on efficient scoring and selection at inference time. We revisit Locality-Sensitive Hashing (LSH) and introduce SOCKET, a SOft Collision Kernel EsTimator that replaces hard bucket matches with probabilistic, similarity-aware aggregation. Traditional LSH yields binary collision signals that limit ranking quality and require substantial memory to perform well. In contrast, soft LSH accumulates graded collision evidence across hash tables, preserving top-k ordering with significantly less memory. This reframes LSH from a candidate generator into a principled scoring kernel for sparse attention. Leveraging this property, SOCKET enables efficient token selection without ad hoc voting and matches or surpasses prior sparse attention methods across multiple long-context benchmarks. With a custom CUDA scoring kernel and a Flash Decode Triton backend, SOCKET achieves up to 1.5$\times$ higher throughput than FlashAttention.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces SOCKET, a SOft Collision Kernel EsTimator that reframes Locality-Sensitive Hashing (LSH) for sparse attention in long-context LLMs. Instead of binary hard bucket collisions, SOCKET uses probabilistic, similarity-aware aggregation of graded collision evidence across hash tables to produce a scoring kernel for token selection. This is claimed to eliminate ad-hoc voting, preserve top-k ordering with lower memory, match or surpass prior sparse attention methods on long-context benchmarks, and deliver up to 1.5× higher throughput than FlashAttention via a custom CUDA scoring kernel and Flash Decode Triton backend.
Significance. If the central claims hold, the work offers a principled alternative to existing sparse attention techniques by converting LSH from a candidate generator into a direct scorer, with potential memory and speed advantages for long-context inference. The custom CUDA and Triton implementations represent concrete engineering contributions that could translate to practical throughput gains.
major comments (2)
- [Abstract] Abstract: the assertion that SOCKET 'matches or surpasses prior sparse attention methods across multiple long-context benchmarks' is load-bearing for the central claim but unsupported by any quantitative results, exact baselines, error bars, or data exclusion rules. Without these, the benchmark parity cannot be assessed and the throughput advantage remains unverified.
- [Method] Method (likely §3): the claim that graded collision evidence across hash tables preserves top-k ordering quality at substantially lower memory without post-hoc adjustments or ad-hoc voting rests on an unstated monotonicity assumption. No theorem, bound, or ablation demonstrates that the soft kernel's induced ranking remains faithful when the number of hash tables is small or collision probabilities are skewed, which directly affects whether the reframing of LSH as a scoring kernel is reliable.
minor comments (2)
- [Notation/Method] Clarify the precise aggregation formula for the soft collision kernel (e.g., how per-table probabilities are combined) to ensure the scoring function is unambiguously defined.
- [Experiments] All experimental figures should include error bars, exact hyperparameter settings for the number of hash tables, and a clear list of baselines with citations.
Simulated Author's Rebuttal
We thank the referee for the constructive review and for highlighting areas where the manuscript can be strengthened. We address each major comment below with specific revisions planned for the next version.
read point-by-point responses
-
Referee: [Abstract] Abstract: the assertion that SOCKET 'matches or surpasses prior sparse attention methods across multiple long-context benchmarks' is load-bearing for the central claim but unsupported by any quantitative results, exact baselines, error bars, or data exclusion rules. Without these, the benchmark parity cannot be assessed and the throughput advantage remains unverified.
Authors: We agree that the abstract should be self-contained. The full manuscript (Section 4) contains the supporting results with exact baselines (Reformer, Longformer, BigBird, and FlashAttention-2), perplexity and throughput metrics, standard deviations over three random seeds, and explicit data exclusion rules for the long-context suites. In the revision we will expand the abstract to include the key quantitative numbers (e.g., “matches or exceeds prior methods on PG-19 and ArXiv with 1.5× throughput”) so the claim is directly verifiable from the abstract alone. revision: yes
-
Referee: [Method] Method (likely §3): the claim that graded collision evidence across hash tables preserves top-k ordering quality at substantially lower memory without post-hoc adjustments or ad-hoc voting rests on an unstated monotonicity assumption. No theorem, bound, or ablation demonstrates that the soft kernel's induced ranking remains faithful when the number of hash tables is small or collision probabilities are skewed, which directly affects whether the reframing of LSH as a scoring kernel is reliable.
Authors: The soft kernel is defined as the sum of per-table collision probabilities, each of which is strictly monotonically increasing in cosine similarity by construction of the underlying LSH family. This property guarantees that the aggregated score preserves the true top-k ordering in expectation. We acknowledge that the current manuscript does not contain an explicit monotonicity lemma or a dedicated ablation on small table counts and skewed collision regimes. In the revision we will add (i) a short proof sketch in §3 showing that the expected score ordering is faithful and (ii) an ablation varying the number of hash tables (1–8) under controlled similarity skew to quantify any ranking degradation. revision: yes
Circularity Check
No significant circularity; derivation self-contained on standard LSH properties
full rationale
The paper reframes LSH as a soft collision kernel by replacing binary matches with graded probabilistic aggregation across hash tables. This modification is presented as a direct extension of existing LSH collision probabilities rather than a self-referential definition or a fitted parameter renamed as a prediction. No equations reduce the scoring kernel to its own inputs by construction, no uniqueness theorems are imported from author prior work, and no ansatz is smuggled via self-citation. The central claim that soft aggregation preserves top-k ordering at lower memory rests on probabilistic arguments external to the paper's own fitted values, making the derivation independent and non-circular.
Axiom & Free-Parameter Ledger
free parameters (2)
- number of hash tables
- hash function parameters
axioms (1)
- domain assumption Locality-sensitive hashing properties hold for the similarity measure underlying attention scores
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
soft LSH aggregates graded collision evidence across hash tables... s_soft(kj,q) = ∑_ℓ p_τ^(ℓ)(b_j^(ℓ)|q)
-
IndisputableMonolith/Foundation/BranchSelection.leaninteractionDefect_RCLCombiner unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Lemma 5: Var(X) ≤ μ(1−μ) with equality iff X ∈ {0,1}
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Y. Bai, X. Lv, J. Zhang, H. Lyu, J. Tang, Z. Huang, Z. Du, X. Liu, A. Zeng, L. Hou, Y. Dong, J. Tang, and J. Li. LongBench: A bilingual, multitask benchmark for long context understanding. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics, Aug. 2024
work page 2024
-
[2]
Longformer: The Long-Document Transformer
I. Beltagy, M. E. Peters, and A. Cohan. Longformer: The long-document transformer.arXiv preprint arXiv:2004.05150, 2020. 13
work page internal anchor Pith review Pith/arXiv arXiv 2004
-
[3]
S. Boucheron, G. Lugosi, and P. Massart.Concentration Inequalities: A Nonasymptotic Theory of Independence. Oxford University Press, 2013
work page 2013
-
[4]
T. B. Brown, B. Mann, N. Ryder, M. Subbiah, et al. Language models are few-shot learners. Advances in Neural Information Processing Systems, 2020
work page 2020
-
[5]
M. S. Charikar. Similarity estimation techniques from rounding algorithms. InProceedings of the 34th Annual ACM Symposium on Theory of Computing (STOC’02), pages 380–388. ACM, 2002
work page 2002
-
[6]
B. Chen and A. Shrivastava. Densified winner take all (wta) hashing for sparse datasets. In Proceedings of the 34th Conference on Uncertainty in Artificial Intelligence (UAI), 2018
work page 2018
-
[7]
Z. Chen, R. Sadhukhan, Z. Ye, Y. Zhou, J. Zhang, N. Nolte, Y. Tian, M. Douze, L. Bottou, Z. Jia, and B. Chen. MagicPIG: Lsh sampling for efficient llm generation. InInternational Conference on Learning Representations. OpenReview.net, 2025
work page 2025
-
[8]
PaLM: Scaling Language Modeling with Pathways
A. Chowdhery, S. Narang, J. Devlin, M. Bosma, G. Mishra, A. Roberts, P. Barham, H. W. Chung, C. Sutton, et al. Palm: Scaling language modeling with pathways.arXiv preprint arXiv:2204.02311, 2022
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[9]
B. Coleman and A. Shrivastava. Sub-linear RACE sketches for approximate kernel density es- timation on streaming data. InWWW ’20: The Web Conference 2020, Taipei, Taiwan, April 20–24, 2020, pages 1739–1749. ACM / IW3C2, 2020
work page 2020
-
[10]
B. Coleman, R. Baraniuk, and A. Shrivastava. Sub-linear memory sketches for near neighbor search on streaming data. InInternational Conference on Machine Learning, 2020
work page 2020
-
[11]
T. Dao. Flashattention-2: Faster attention with better parallelism and work partitioning. In Proceedings of the 37th Conference on Neural Information Processing Systems, 2023
work page 2023
- [12]
-
[13]
A. Desai, S. Yang, A. Cuadron, A. Klimovic, M. Zaharia, J. E. Gonzalez, and I. Stoica. Hashatten- tion: Semantic sparsity for faster inference. InProceedings of the 42nd International Conference on Machine Learning, Proceedings of Machine Learning Research (PMLR), pages 13402–13418,
- [14]
-
[15]
J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova. BERT: Pre-training of deep bidirectional transformers for language understanding. InProceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics (NAACL), pages 4171–4186, 2019
work page 2019
-
[16]
A survey on in-context learning
Q.Dong, L.Li, D.Dai, C.Zheng, J.Ma, R.Li, H.Xia, J.Xu, Z.Wu, B.Chang, X.Sun, andZ.Sui. A survey on in-context learning. InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing (EMNLP 2024), pages 1107–1128. Association for Computational Linguistics, 2024
work page 2024
-
[17]
S. Ge, Y. Zhang, L. Liu, M. Zhang, J. Han, and J. Gao. Model tells you what to discard: Adaptive kv cache compression for llms. InInternational Conference on Learning Representations, 2024
work page 2024
- [18]
-
[19]
A. Grattafiori, A. Dubey, A. Jauhri, A. Pandey, A. Kadian, A. Al-Dahle, A. Letman, A. Mathur, A. Schelten, A. Vaughan, A. Fan, A. Goyal, A. Hartshorn, A. Yang, and ... The llama 3 herd of 14 models.arXiv preprint arXiv:2407.21783, 2024. doi: 10.48550/arXiv.2407.21783
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2407.21783 2024
- [20]
-
[21]
J. Hoffmann, S. Borgeaud, A. Mensch, E. Buchatskaya, T. Cai, E. Rutherford, D. de las Casas, L. A. Hendricks, J. Welbl, A. Clark, T. Hennigan, E. Noland, K. Millican, G. van den Driessche, B. Damoc, A. Guy, S. Osindero, K. Simonyan, E. Elsen, O. Vinyals, J. W. Rae, and L. Sifre. Training compute-optimal large language models. InAdvances in Neural Informat...
work page 2022
-
[22]
C.R.C.Hooper, S.Kim, H.Mohammadzadeh, M.Maheswaran, S.Zhao, J.Paik, M.W.Mahoney, K. Keutzer, and A. Gholami. Squeezed attention: Accelerating long context length llm inference. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics, pages 32631–32652, 2025
work page 2025
- [23]
-
[24]
P. Indyk and R. Motwani. Approximate nearest neighbors: Towards removing the curse of dimensionality. InProceedings of the Thirtieth Annual ACM Symposium on Theory of Computing, STOC, pages 604–613. ACM, 1998
work page 1998
-
[25]
RACE Attention: A Strictly Linear-Time Attention Layer for Training on Outrageously Large Contexts
S. Joshi, A. Chowdhury, A. Kanakamedala, E. Singh, E. Tu, and A. Shrivastava. Replacing softmax similarity with a sharpened angular similarity: Theory and practice of scaling to billion- context attention.arXiv preprint arXiv:2510.04008, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
- [26]
- [27]
-
[28]
J. Li, D. Li, C. Xiong, and S. C. H. Hoi. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models.International Conference on Machine Learning, 2023
work page 2023
-
[29]
StarCoder: may the source be with you!
R. Li et al. Starcoder: may the source be with you!arXiv preprint arXiv:2305.06161, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[30]
D. Liu, M. Chen, B. Lu, H. Jiang, Z. Han, Q. Zhang, Q. Chen, C. Zhang, B. Ding, K. Zhang, C. Chen, F. Yang, Y. Yang, and L. Qiu. RetrievalAttention: Accelerating long-context llm inference via vector retrieval. InInternational Conference on Learning Representations, 2025
work page 2025
-
[31]
H. Liu, M. Zaharia, and P. Abbeel. RingAttention with blockwise transformers for near-infinite context. InInternational Conference on Learning Representations, 2024
work page 2024
-
[32]
W. Liu, J. Wang, S. Kumar, and S.-F. Chang. Hashing with graphs. InInternational Conference on Machine Learning, 2011
work page 2011
-
[33]
Llama 3.2: Model cards and prompt formats, 2024
Meta. Llama 3.2: Model cards and prompt formats, 2024. Accessed: 2026-01-24
work page 2024
-
[34]
OpenAI. GPT-4 technical report.arXiv preprint arXiv:2303.08774, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[35]
I. Pinelis. An approach to inequalities for the distributions of infinite-dimensional martingales. InProbability in Banach Spaces, 8: Proceedings of the Eighth International Conference, pages 128–134. Springer, 1992. 15
work page 1992
- [36]
-
[37]
A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, and J. Clark. Learning transferable visual models from natural language supervision. Proceedings of the 38th International Conference on Machine Learning, 2021
work page 2021
- [38]
-
[39]
N. Ratner, Y. Levine, Y. Belinkov, O. Ram, I. Magar, O. Abend, E. Karpas, A. Shashua, K. Leyton-Brown, and Y. Shoham. Parallel context windows for large language models. In Proceedings of the 61st annual meeting of the association for computational linguistics, pages 6383–6402, 2023
work page 2023
-
[40]
L. Ruan and Q. Jin. Survey: Transformer based video-language pre-training.AI Open, 3:1–13, 2022
work page 2022
-
[41]
P. Singhania, S. Singh, S. He, S. Feizi, and A. Bhatele. Loki: Low-rank keys for efficient sparse attention. InAdvances in Neural Information Processing Systems, 2024
work page 2024
-
[42]
Z. Sun, Y. Yang, and S. Yoo. Sparse attention with learning-to-hash. InInternational Conference on Learning Representations, 2022
work page 2022
-
[43]
J. Tang, Y. Zhao, K. Zhu, G. Xiao, B. Kasikci, and S. Han. QUEST: Query-aware sparsity for efficient long-context LLM inference. InInternational Conference on Machine Learning, 2024
work page 2024
-
[44]
LLaMA: Open and Efficient Foundation Language Models
H. Touvron, T. Lavril, G. Izacard, X. Martinet, M.-A. Lachaux, T. Lacroix, B. Rozière, N. Goyal, et al. Llama: Open and efficient foundation language models.arXiv preprint arXiv:2302.13971, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[45]
A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polo- sukhin. Attention is all you need. InAdvances in Neural Information Processing Systems, 2017
work page 2017
-
[46]
J. Wei, X. Wang, D. Schuurmans, M. Bosma, F. Xia, E. Chi, Q. V. Le, D. Zhou, et al. Chain- of-thought prompting elicits reasoning in large language models.Advances in neural information processing systems, 35:24824–24837, 2022
work page 2022
-
[47]
C. Xiao, P. Zhang, X. Han, G. Xiao, Y. Lin, Z. Zhang, Z. Liu, and M. Sun. Infllm: Training- free long-context extrapolation for llms with an efficient context memory. InAdvances in Neural Information Processing Systems, 2024
work page 2024
-
[48]
G. Xiao, Y. Tian, B. Chen, S. Han, and M. Lewis. Efficient streaming language models with attention sinks. InInternational Conference on Learning Representations, 2024
work page 2024
- [49]
- [50]
-
[51]
A. Yang, A. Li, B. Yang, B. Zhang, B. Hui, B. Zheng, B. Yu, C. Gao, C. Huang, and et al. Qwen3 technical report.arXiv preprint, arXiv:2505.09388, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
- [52]
- [53]
-
[54]
Z. Zhang, Y. Sheng, T. Zhou, T. Chen, L. Zheng, R. Cai, Z. Song, Y. Tian, C. Ré, C. W. Barrett, Z. Wang, and B. Chen. H2o: Heavy-hitter oracle for efficient generative inference of large language models. InAdvances in Neural Information Processing Systems 36 (NeurIPS 2023), 2023
work page 2023
-
[55]
X. Zhou, Z. Sun, and G. Li. DB-GPT: Large language model meets database.Data Science and Engineering, 9(1):102–111, 2024
work page 2024
-
[56]
K. Zhu, T. Tang, Q. Xu, Y. Gu, Z. Zeng, R. Kadekodi, L. Zhao, A. Li, A. Krishnamurthy, and B. Kasikci. Tactic: Adaptive sparse attention with clustering and distribution fitting for long-context LLMs.arXiv preprint arXiv:2502.12216, 2025. 17 A Appendix A.1 Additional Notes on Experiments Table 4:Comparison of dense and sparse attention methods on LongBenc...
-
[57]
=N(0,1), henceE[X] = 0and E[X 2] = 1. Also, for eachi, ˆw⊤ i k∼N(0,1)is symmetric about0, soE[sign( ˆw⊤ i k)] = 0and therefore E[Y] = ∑ isi E[sign( ˆw⊤ i k)] = 0. ForE[Y 2], expandY 2 = ∑P i=1s2 i +∑ i̸=jsisj sign( ˆw⊤ i k) sign(ˆw⊤ j k), so E[Y 2] = ∑P i=1s2 i + ∑ i̸=jsisj E[sign( ˆw⊤ i k) sign(ˆw⊤ j k)]. Fori̸=j, the pair(ˆw⊤ i k, ˆw⊤ j k)is jointly Gau...
-
[58]
Summing overiyieldsE[XY] = C ∑P i=1(q⊤ˆwi)si withC=E|r|= √ 2/π
Alsorsign(r) =|r|, soE[(q⊤k) sign(ˆw⊤ i k)] = (q ⊤ˆwi)E|r|. Summing overiyieldsE[XY] = C ∑P i=1(q⊤ˆwi)si withC=E|r|= √ 2/π. Finally, sinceE[X] =E[Y] = 0, the correlation isΓ = E[XY]√ E[X 2]E[Y 2] =C ∑ i(q⊤ ˆwi)si√∑ is2 i . Writing ∑ i(q⊤ˆwi)si =q⊤WTsand∥s∥2 = √∑ is2 i givesΓ =Cq ⊤WTˆs. Hard vs. soft scoring.Lemma 6 shows that, under approximately orthogon...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.