arXiv preprint arXiv:2410.23079 , year=

Junqi Zhao, Zhijin Fang, Shu Li, Shaohui Yang, Shichao He · 2024 · arXiv 2410.23079

3 Pith papers cite this work. Polarity classification is still indexing.

3 Pith papers citing it

read on arXiv browse 3 citing papers

citation-role summary

background 1

citation-polarity summary

background 1

representative citing papers

Attention Sink in Transformers: A Survey on Utilization, Interpretation, and Mitigation

cs.LG · 2026-04-11 · unverdicted · novelty 7.0

The first survey on Attention Sink in Transformers structures the literature around fundamental utilization, mechanistic interpretation, and strategic mitigation.

SAC: Disaggregated KV Cache System for Sparse Attention LLMs with CXL

cs.DC · 2026-06-18 · unverdicted · novelty 6.0

SAC uses CXL to fetch only top-k KV cache entries for sparse attention models, reporting 2.1x throughput, 9.7x lower TTFT and 1.8x lower TBT versus RDMA baselines on DeepSeek-V3.2.

Native Sparse Attention: Hardware-Aligned and Natively Trainable Sparse Attention

cs.CL · 2025-02-16 · unverdicted · novelty 6.0

NSA is a hardware-aligned sparse attention mechanism that enables end-to-end trainable long-context modeling by combining coarse token compression with fine-grained selection.

citing papers explorer

Showing 1 of 1 citing paper after filters.

Attention Sink in Transformers: A Survey on Utilization, Interpretation, and Mitigation cs.LG · 2026-04-11 · unverdicted · none · ref 143
The first survey on Attention Sink in Transformers structures the literature around fundamental utilization, mechanistic interpretation, and strategic mitigation.

arXiv preprint arXiv:2410.23079 , year=

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer