Recognition: no theorem link
TriAttention: Efficient Long Reasoning with Trigonometric KV Compression
Pith reviewed 2026-05-10 19:50 UTC · model grok-4.3
The pith
Pre-RoPE query-key concentration produces stable trigonometric distance preferences that let key importance be scored from position alone.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
TriAttention scores keys for retention by using the trigonometric series induced by fixed Q/K concentration centers in pre-RoPE space, together with norm signals, to select a small subset of the cache that still supports full reasoning accuracy on long generations.
What carries the argument
The trigonometric series that maps the offset between Q and K concentration centers to a preference distribution over relative key positions.
If this is right
- 32K-token reasoning on AIME25 retains full accuracy while using roughly one-tenth the KV memory.
- The same accuracy can be achieved at 2.5 times the throughput of full attention.
- Single-GPU deployment becomes possible for tasks that previously exceeded memory limits.
- Key selection no longer depends on finding representative post-RoPE queries.
Where Pith is reading between the lines
- The same center-stability observation could be tested on non-reasoning long-context tasks to see whether the trigonometric scoring still selects useful keys.
- If the centers prove stable across model scales, the method might be applied to existing checkpoints without retraining.
- The trigonometric preference could be combined with other position-based signals such as recency to handle mixed reasoning and retrieval workloads.
Load-bearing premise
Query and key vectors remain highly concentrated around fixed non-zero centers in the space before rotary position embeddings are applied.
What would settle it
A measurement on a 32K-token reasoning trace showing that the pre-RoPE centers shift with position or that the keys TriAttention keeps differ substantially from the keys actually attended to by full attention.
Figures
read the original abstract
Extended reasoning in large language models (LLMs) creates severe KV cache memory bottlenecks. Leading KV cache compression methods estimate KV importance using attention scores from recent post-RoPE queries. However, queries rotate with position during RoPE, making representative queries very few, leading to poor top-key selection and unstable reasoning. To avoid this issue, we turn to the pre-RoPE space, where we observe that Q and K vectors are highly concentrated around fixed non-zero centers and remain stable across positions -- Q/K concentration. We show that this concentration causes queries to preferentially attend to keys at specific distances (e.g., nearest keys), with the centers determining which distances are preferred via a trigonometric series. Based on this, we propose TriAttention to estimate key importance by leveraging these centers. Via the trigonometric series, we use the distance preference characterized by these centers to score keys according to their positions, and also leverage Q/K norms as an additional signal for importance estimation. On AIME25 with 32K-token generation, TriAttention matches Full Attention reasoning accuracy while achieving 2.5x higher throughput or 10.7x KV memory reduction, whereas leading baselines achieve only about half the accuracy at the same efficiency. TriAttention enables OpenClaw deployment on a single consumer GPU, where long context would otherwise cause out-of-memory with Full Attention.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes TriAttention for KV cache compression in long-context LLM reasoning. It observes that Q and K vectors in pre-RoPE space concentrate around fixed non-zero centers stable across positions, inducing trigonometric distance preferences via a series expansion. Key importance is scored using this positional trigonometric preference plus Q/K norms, avoiding reliance on rotating post-RoPE queries. On AIME25 with 32K-token generation, it matches full-attention accuracy at 2.5x throughput or 10.7x KV memory reduction, while baselines achieve roughly half the accuracy at equivalent efficiency, enabling single-GPU deployment of models like OpenClaw.
Significance. If the pre-RoPE concentration observation and resulting scoring reliably preserve reasoning-critical keys, the work offers a practical path to efficient long reasoning on consumer hardware with substantial memory and throughput gains. The approach is falsifiable via direct measurement of center stability and alignment of high-attention keys with trig-preferred distances, and the empirical results (if reproducible) represent a clear advance over post-RoPE query-based compression baselines.
major comments (4)
- [Abstract / §3] Abstract and method description: the central premise that Q/K vectors are 'highly concentrated around fixed non-zero centers' and 'remain stable across positions' lacks any quantitative validation (e.g., mean/variance of distance to center, or stability metrics across layers/models/positions). This observation is load-bearing for deriving the trigonometric series and for claiming reliable importance scoring.
- [Abstract / §4] Abstract and §4: no ablation isolating the trigonometric component from the norm-based signal, and no quantitative check (e.g., overlap statistics) that high-attention keys for reasoning steps actually coincide with the trig-preferred distances. Without this, it is unclear whether the position-only + norm score preserves the exact keys needed for full AIME25 accuracy.
- [§4] §4 results: the AIME25 accuracy claim reports no error bars, no number of runs, and no analysis of center stability measurement protocol. The reported 2.5x throughput / 10.7x memory gains at matched accuracy are load-bearing for the efficiency claim but rest on unvalidated concentration tightness.
- [§3] Method: the trigonometric scoring depends only on relative position (via the series) and norm, independent of query content. This creates a risk that keys critical for a specific reasoning step but lying at non-preferred distances are down-ranked; the manuscript provides no bound or empirical frequency on how often this occurs.
minor comments (2)
- [Abstract] The abstract could explicitly state the exact compression ratios, the number of tokens or layers tested for center stability, and whether the centers are computed once per model or per sequence.
- [§3] Notation for the trigonometric series and center definitions should be introduced with an equation number in the method section for clarity.
Simulated Author's Rebuttal
We thank the referee for the insightful and constructive comments on our paper. We have carefully considered each point and provide detailed responses below. Where appropriate, we will revise the manuscript to address the concerns raised.
read point-by-point responses
-
Referee: [Abstract / §3] Abstract and method description: the central premise that Q/K vectors are 'highly concentrated around fixed non-zero centers' and 'remain stable across positions' lacks any quantitative validation (e.g., mean/variance of distance to center, or stability metrics across layers/models/positions). This observation is load-bearing for deriving the trigonometric series and for claiming reliable importance scoring.
Authors: We agree that quantitative validation strengthens the claim. Although the observation is derived from extensive internal analysis across models and layers, we did not include the metrics in the initial submission. In the revised manuscript, we will add a dedicated paragraph and figure in Section 3 showing the mean and standard deviation of distances to the centers (typically under 0.05 in normalized units), along with stability plots across positions and layers for models like Llama and others. This will provide the requested validation and support the trigonometric derivation. revision: yes
-
Referee: [Abstract / §4] Abstract and §4: no ablation isolating the trigonometric component from the norm-based signal, and no quantitative check (e.g., overlap statistics) that high-attention keys for reasoning steps actually coincide with the trig-preferred distances. Without this, it is unclear whether the position-only + norm score preserves the exact keys needed for full AIME25 accuracy.
Authors: We acknowledge the value of such ablations. We will include in the revised §4 an ablation study that isolates the trigonometric preference from the norm signal, reporting AIME25 accuracy for trig-only, norm-only, and combined. Additionally, we will compute and report overlap metrics (e.g., the fraction of top-k keys selected by full attention that match those preferred by the trigonometric distances) on reasoning traces from AIME25, demonstrating high alignment for critical keys. revision: yes
-
Referee: [§4] §4 results: the AIME25 accuracy claim reports no error bars, no number of runs, and no analysis of center stability measurement protocol. The reported 2.5x throughput / 10.7x memory gains at matched accuracy are load-bearing for the efficiency claim but rest on unvalidated concentration tightness.
Authors: We will revise §4 to include error bars where stochasticity is present (e.g., if using sampling in generation), and specify the number of runs (typically 3-5 for such benchmarks). We will also add a detailed description of the center stability measurement protocol, including how centers are estimated from pre-RoPE vectors over a range of positions. The efficiency numbers are from direct measurements on the specified hardware, and we will provide more details on the setup to ensure reproducibility. revision: yes
-
Referee: [§3] Method: the trigonometric scoring depends only on relative position (via the series) and norm, independent of query content. This creates a risk that keys critical for a specific reasoning step but lying at non-preferred distances are down-ranked; the manuscript provides no bound or empirical frequency on how often this occurs.
Authors: This is a valid concern regarding potential edge cases. However, the strong empirical match to full attention accuracy on AIME25 indicates that such mismatches are infrequent for reasoning-critical keys. In the revision, we will add an empirical analysis measuring the frequency of high-attention keys at non-preferred distances using full attention maps, and discuss theoretical aspects from the series expansion regarding when the preference holds strongly. We believe this will mitigate the risk highlighted. revision: yes
Circularity Check
No circularity: derivation from empirical centers to trig scoring is independent
full rationale
The paper begins with an empirical observation of Q/K concentration around fixed centers in pre-RoPE space, then mathematically derives the resulting positional distance preferences via a trigonometric series as a direct consequence of that concentration. The importance scoring is then built from those derived preferences plus separate norm signals. No quoted equations reduce the final score to the input centers by construction, no fitted parameters are relabeled as predictions, and no self-citation chain is invoked to justify uniqueness or the ansatz. The central claims rest on the mathematical step plus downstream empirical validation on AIME25, which is external to the derivation itself.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Q and K vectors are highly concentrated around fixed non-zero centers and remain stable across positions in pre-RoPE space
Forward citations
Cited by 3 Pith papers
-
HeadQ: Model-Visible Distortion and Score-Space Correction for KV-Cache Quantization
HeadQ removes 84-94% of excess perplexity from 2-bit key quantization by storing low-rank residuals in a calibration-learned query basis for score-space correction and using A²-weighted distortion for values.
-
Not All Thoughts Need HBM: Semantics-Aware Memory Hierarchy for LLM Reasoning
A semantics-aware KV cache hierarchy offloads tokens to slower memory with zero approximation error, demonstrating that LLM reasoning accuracy depends only on the permanent eviction ratio and not on HBM residency.
-
HeadQ: Model-Visible Distortion and Score-Space Correction for KV-Cache Quantization
HeadQ reduces 84-94% of excess perplexity in 2-bit key quantization by adding low-rank logit corrections in a calibration-learned query basis, with further gains from an A^2-weighted value policy.
Reference graph
Works this paper leans on
-
[1]
Gqa: Training generalized multi-query transformer models from multi-head check- points
Ainslie, J., Lee-Thorp, J., de Jong, M., Zemlyanskiy, Y ., Lebron, F., and Sanghai, S. Gqa: Training generalized multi-query transformer models from multi-head check- points. InProceedings of the 2023 Conference on Em- pirical Methods in Natural Language Processing, pp. 4895–4901,
2023
-
[2]
PyramidKV: Dynamic KV Cache Compression based on Pyramidal Information Funneling
URL https://openreview.net/forum? id=RyOpooIxDF. Cai, Z., Zhang, Y ., Gao, B., Liu, Y ., Li, Y ., Liu, T., Lu, K., Xiong, W., Dong, Y ., Hu, J., et al. Pyramidkv: Dynamic kv cache compression based on pyramidal information funneling.arXiv preprint arXiv:2406.02069,
work page internal anchor Pith review arXiv
-
[3]
Scaling rl to long videos.arXiv preprint arXiv:2507.07966, 2025
Chen, Y ., Huang, W., Shi, B., Hu, Q., Ye, H., Zhu, L., Liu, Z., Molchanov, P., Kautz, J., Qi, X., et al. Scaling rl to long videos.arXiv preprint arXiv:2507.07966,
-
[4]
Devoto, A., Jeblick, M., and J´egou, S. Expected attention: Kv cache compression by estimating attention from future queries distribution.arXiv preprint arXiv:2510.00636,
-
[5]
9 TriAttention: Efficient Long Reasoning with Trigonometric KV Compression Dubey, A., Jauhri, A., Pandey, A., Kadian, A., Al-Dahle, A., Letman, A., Mathur, A., Schelten, A., Yang, A., Fan, A., et al. The llama 3 herd of models.arXiv preprint arXiv:2407.21783,
work page internal anchor Pith review Pith/arXiv arXiv
-
[6]
Attention score is not all you need for token importance indicator in kv cache reduction: Value also matters
Guo, Z., Kamigaito, H., and Watanabe, T. Attention score is not all you need for token importance indicator in kv cache reduction: Value also matters. InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pp. 21158–21166,
2024
-
[7]
On the token distance modeling ability of higher rope attention dimension
Hong, X., Dai, C., Li, B., Wu, S., Wang, Z., Wu, H., Wang, D., Zhu, J., He, S., and Sun, J.-R. On the token distance modeling ability of higher rope attention dimension. In Proceedings of the 2024 Conference on Empirical Meth- ods in Natural Language Processing (EMNLP),
2024
-
[8]
Raas: Reasoning-aware attention sparsity for efficient llm reasoning
Hu, J., Huang, W., Wang, W., Li, Z., Hu, T., Liu, Z., Chen, X., Xie, T., and Shan, Y . Raas: Reasoning-aware attention sparsity for efficient llm reasoning. InFindings of the Association for Computational Linguistics: ACL 2025, pp. 2577–2590,
2025
-
[9]
Jiang, A. Q., Sablayrolles, A., Mensch, A., Bamford, C., Chaplot, D. S., de las Casas, D., Bressand, F., Lengyel, G., Lample, G., Saulnier, L., Lavaud, L. R., Lachaux, M.-A., Stock, P., Scao, T. L., Lavril, T., Wang, T., Lacroix, T., and Sayed, W. E. Mistral 7b.arXiv preprint arXiv:2310.06825,
work page internal anchor Pith review Pith/arXiv arXiv
-
[10]
Attention is not only a weight: Analyzing transformers with vector norms
Kobayashi, G., Kuribayashi, T., Yokoi, S., and Inui, K. Attention is not only a weight: Analyzing transformers with vector norms. InProceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 7057–7075,
2020
-
[11]
arXiv preprint arXiv:2412.19442 , year =
Li, H., Li, Y ., Tian, A., Tang, T., Xu, Z., Chen, X., Hu, N., Dong, W., Li, Q., and Chen, L. A survey on large lan- guage model acceleration based on kv cache management. arXiv preprint arXiv:2412.19442, 2024a. Li, Y ., Huang, Y ., Yang, B., Venkitesh, B., Locatelli, A., Ye, H., Cai, T., Lewis, P., and Chen, D. SnapKV: LLM knows what you are looking for ...
-
[12]
American invita- tional mathematics examination 2024,
Mathematical Association of America. American invita- tional mathematics examination 2024,
2024
-
[13]
American invita- tional mathematics examination 2025,
Mathematical Association of America. American invita- tional mathematics examination 2025,
2025
-
[14]
gpt-oss-120b & gpt-oss-20b Model Card
OpenAI. gpt-oss-120b & gpt-oss-20b model card.arXiv preprint arXiv:2508.10925,
work page internal anchor Pith review Pith/arXiv arXiv
-
[15]
Transformers are multi-state rnns
Oren, M., Hassid, M., Yarden, N., Adi, Y ., and Schwartz, R. Transformers are multi-state rnns. InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pp. 18724–18741,
2024
-
[16]
YaRN: Efficient Context Window Extension of Large Language Models
Peng, B., Quesnelle, J., Fan, H., and Shippole, E. Yarn: Efficient context window extension of large language models.arXiv preprint arXiv:2309.00071,
work page internal anchor Pith review arXiv
-
[17]
Qwen Team. Qwen2.5 technical report.arXiv preprint arXiv:2412.15115,
work page internal anchor Pith review Pith/arXiv arXiv
-
[18]
Qwen Team. Qwen3 technical report.arXiv preprint arXiv:2505.09388,
work page internal anchor Pith review Pith/arXiv arXiv
-
[19]
DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models
Shao, Z., Wang, P., Zhu, Q., Xu, R., Song, J., Bi, X., Zhang, H., Zhang, M., Li, Y ., Wu, Y ., et al. Deepseekmath: Push- ing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300,
work page internal anchor Pith review Pith/arXiv arXiv
-
[20]
arXiv preprint arXiv:2407.18003
Shi, L., Zhang, H., Yao, Y ., Li, Z., and Zhao, H. Keep the cost down: A review on methods to optimize llm’s kv-cache consumption.arXiv preprint arXiv:2407.18003,
-
[21]
Quest: Query-Aware Sparsity for Efficient Long-Context LLM Inference
Tang, J., Zhao, Y ., Zhu, K., Xiao, G., Kasikci, B., and Han, S. Quest: Query-aware sparsity for efficient long-context llm inference.arXiv preprint arXiv:2406.10774,
work page internal anchor Pith review arXiv
-
[22]
Team, G., Kamath, A., Ferret, J., Pathak, S., Vieillard, N., Merhej, R., Perrin, S., Matejovicova, T., Ram ´e, A., Rivi`ere, M., et al. Gemma 3 technical report.arXiv preprint arXiv:2503.19786,
work page internal anchor Pith review Pith/arXiv arXiv
-
[23]
Zhou, Y ., Song, S., Liu, B., Xi, Z., Jin, S., Fan, X., Zhang, Z., Li, W., and Huang, X. Elitekv: Scalable kv cache compression via rope frequency selection and joint low- rank projection.arXiv preprint arXiv:2503.01586,
-
[24]
A Survey on Efficient Inference for Large Language Models
Zhou, Z., Ning, X., Hong, K., Fu, T., Xu, J., Li, S., Lou, Y ., Wang, L., Yuan, Z., Li, X., et al. A survey on effi- cient inference for large language models.arXiv preprint arXiv:2404.14294,
work page internal anchor Pith review arXiv
-
[25]
on AIME24 using DeepSeek-R1-Distill-Qwen-7B at multiple KV budgets, alongside H2O (Zhang et al., 2023), TOV A (Oren et al., 2024), and RaaS (Hu et al.,
2023
-
[26]
We compare against StreamingLLM (Xiao et al., 2024), PyramidKV (Cai et al., 2024), KnormPress (Devoto et al., 2025), Ada-KV+SnapKV (Feng et al., 2025), and H2O (Zhang et al., 2023)
(retrieval tasks, 4K context). We compare against StreamingLLM (Xiao et al., 2024), PyramidKV (Cai et al., 2024), KnormPress (Devoto et al., 2025), Ada-KV+SnapKV (Feng et al., 2025), and H2O (Zhang et al., 2023). Table B presents the full LongBench results. TriAttention achieves the highest average (48.1) across 16 subtasks, winning 11 out of 16, and surp...
2024
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.