arxiv: 2604.04921 · v1 · submitted 2026-04-06 · 💻 cs.CL · cs.CV

Recognition: no theorem link

TriAttention: Efficient Long Reasoning with Trigonometric KV Compression

Weian Mao , Xi Lin , Wei Huang , Yuxin Xie , Tianfu Fu , Bohan Zhuang , Song Han , Yukang Chen

Authors on Pith no claims yet

Pith reviewed 2026-05-10 19:50 UTC · model grok-4.3

classification 💻 cs.CL cs.CV

keywords KV cache compressionlong reasoningtrigonometric attentionRoPE embeddingsefficient LLM inferenceattention mechanismkey importance scoring

0 comments

The pith

Pre-RoPE query-key concentration produces stable trigonometric distance preferences that let key importance be scored from position alone.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that before rotary position embeddings are applied, queries and keys cluster around fixed non-zero centers that stay consistent across positions. This clustering creates a predictable pattern of which keys a query prefers, expressed as a trigonometric series that depends on the center offset. Scoring keys by their position relative to this series plus their norm values then replaces the usual reliance on recent post-rotation queries. The resulting compression keeps reasoning chains intact while shrinking the KV cache dramatically. A sympathetic reader would care because extended mathematical or logical reasoning currently hits hard memory walls on ordinary hardware.

Core claim

TriAttention scores keys for retention by using the trigonometric series induced by fixed Q/K concentration centers in pre-RoPE space, together with norm signals, to select a small subset of the cache that still supports full reasoning accuracy on long generations.

What carries the argument

The trigonometric series that maps the offset between Q and K concentration centers to a preference distribution over relative key positions.

If this is right

32K-token reasoning on AIME25 retains full accuracy while using roughly one-tenth the KV memory.
The same accuracy can be achieved at 2.5 times the throughput of full attention.
Single-GPU deployment becomes possible for tasks that previously exceeded memory limits.
Key selection no longer depends on finding representative post-RoPE queries.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same center-stability observation could be tested on non-reasoning long-context tasks to see whether the trigonometric scoring still selects useful keys.
If the centers prove stable across model scales, the method might be applied to existing checkpoints without retraining.
The trigonometric preference could be combined with other position-based signals such as recency to handle mixed reasoning and retrieval workloads.

Load-bearing premise

Query and key vectors remain highly concentrated around fixed non-zero centers in the space before rotary position embeddings are applied.

What would settle it

A measurement on a 32K-token reasoning trace showing that the pre-RoPE centers shift with position or that the keys TriAttention keeps differ substantially from the keys actually attended to by full attention.

Figures

Figures reproduced from arXiv: 2604.04921 by Bohan Zhuang, Song Han, Tianfu Fu, Weian Mao, Wei Huang, Xi Lin, Yukang Chen, Yuxin Xie.

**Figure 1.** Figure 1: Performance trade-offs on AIME25 (Qwen3-8B). (A) At equivalent accuracy (40.8%), TriAttention achieves 2.5× higher throughput than Full Attention. (B) TriAttention reduces KV cache memory by 10.7× while matching Full Attention accuracy. 2025). KV cache grows proportionally, creating severe memory bottlenecks. KV cache compression addresses this by retaining only the most important tokens, with importance … view at source ↗

**Figure 2.** Figure 2: Q/K concentration and its implications for attention. (A) Pre-RoPE Q/K vectors at the dominant frequency band are highly concentrated (high Mean Resultant Length R). (B) RoPE rotation disperses these vectors into arc patterns. In (A-B), three distinct input sequences are overlayed, showing this structure is stable across content. (C) This concentration holds across nearly all heads. (D) When Q/K are concen… view at source ↗

**Figure 3.** Figure 3: Attention reconstruction correlation across three DeepSeek-R1 distilled LLMs, including Qwen3 (Qwen Team, 2025), Qwen2.5 (Qwen Team, 2024), and Llama3 (Dubey et al., 2024). Distribution of per-head reconstruction Pearson correlation (r¯) across all attention heads. The red dashed line indicates the mean. All models show right-skewed distributions with means above 0.5. 2.2. Post-RoPE Compression Methods KV … view at source ↗

**Figure 4.** Figure 4: Method overview. From left to right: offline calibration computes Q distribution centers; then during inference, original attention is scored by combining Strig and norm-based components; the rightmost panel shows the attention map after pruning. We observe that some heads exhibit distance preference—distant keys tend to receive higher attention. However, we also find that certain keys, despite being far f… view at source ↗

**Figure 5.** Figure 5: Performance comparison on Qwen3-8B. (A–C) Accuracy vs. KV cache budget on three mathematical reasoning benchmarks. TriAttention consistently outperforms R-KV across all budget levels. (D) Memory retention on Recursive State Query benchmark. Depth refers to DFS recursion depth; deeper recursion requires retaining more intermediate states, increasing memory pressure. periments are conducted on NVIDIA A100 80… view at source ↗

read the original abstract

Extended reasoning in large language models (LLMs) creates severe KV cache memory bottlenecks. Leading KV cache compression methods estimate KV importance using attention scores from recent post-RoPE queries. However, queries rotate with position during RoPE, making representative queries very few, leading to poor top-key selection and unstable reasoning. To avoid this issue, we turn to the pre-RoPE space, where we observe that Q and K vectors are highly concentrated around fixed non-zero centers and remain stable across positions -- Q/K concentration. We show that this concentration causes queries to preferentially attend to keys at specific distances (e.g., nearest keys), with the centers determining which distances are preferred via a trigonometric series. Based on this, we propose TriAttention to estimate key importance by leveraging these centers. Via the trigonometric series, we use the distance preference characterized by these centers to score keys according to their positions, and also leverage Q/K norms as an additional signal for importance estimation. On AIME25 with 32K-token generation, TriAttention matches Full Attention reasoning accuracy while achieving 2.5x higher throughput or 10.7x KV memory reduction, whereas leading baselines achieve only about half the accuracy at the same efficiency. TriAttention enables OpenClaw deployment on a single consumer GPU, where long context would otherwise cause out-of-memory with Full Attention.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

TriAttention derives a positional trig scorer plus norm from stable pre-RoPE Q/K centers and reports full AIME25 accuracy at 10x lower KV memory, but the content-blind nature of the score needs direct checks.

read the letter

The main point is that the authors observe Q and K vectors cluster tightly around fixed non-zero centers in pre-RoPE space, then use the resulting trigonometric distance preferences plus vector norms to score and keep keys. On 32k-token AIME25 generation this reportedly matches full-attention accuracy while delivering 2.5x throughput or 10.7x memory reduction, beating the leading baselines that lose half the accuracy at similar budgets. That empirical outcome is the clearest signal the paper offers right now.

Referee Report

4 major / 2 minor

Summary. The paper proposes TriAttention for KV cache compression in long-context LLM reasoning. It observes that Q and K vectors in pre-RoPE space concentrate around fixed non-zero centers stable across positions, inducing trigonometric distance preferences via a series expansion. Key importance is scored using this positional trigonometric preference plus Q/K norms, avoiding reliance on rotating post-RoPE queries. On AIME25 with 32K-token generation, it matches full-attention accuracy at 2.5x throughput or 10.7x KV memory reduction, while baselines achieve roughly half the accuracy at equivalent efficiency, enabling single-GPU deployment of models like OpenClaw.

Significance. If the pre-RoPE concentration observation and resulting scoring reliably preserve reasoning-critical keys, the work offers a practical path to efficient long reasoning on consumer hardware with substantial memory and throughput gains. The approach is falsifiable via direct measurement of center stability and alignment of high-attention keys with trig-preferred distances, and the empirical results (if reproducible) represent a clear advance over post-RoPE query-based compression baselines.

major comments (4)

[Abstract / §3] Abstract and method description: the central premise that Q/K vectors are 'highly concentrated around fixed non-zero centers' and 'remain stable across positions' lacks any quantitative validation (e.g., mean/variance of distance to center, or stability metrics across layers/models/positions). This observation is load-bearing for deriving the trigonometric series and for claiming reliable importance scoring.
[Abstract / §4] Abstract and §4: no ablation isolating the trigonometric component from the norm-based signal, and no quantitative check (e.g., overlap statistics) that high-attention keys for reasoning steps actually coincide with the trig-preferred distances. Without this, it is unclear whether the position-only + norm score preserves the exact keys needed for full AIME25 accuracy.
[§4] §4 results: the AIME25 accuracy claim reports no error bars, no number of runs, and no analysis of center stability measurement protocol. The reported 2.5x throughput / 10.7x memory gains at matched accuracy are load-bearing for the efficiency claim but rest on unvalidated concentration tightness.
[§3] Method: the trigonometric scoring depends only on relative position (via the series) and norm, independent of query content. This creates a risk that keys critical for a specific reasoning step but lying at non-preferred distances are down-ranked; the manuscript provides no bound or empirical frequency on how often this occurs.

minor comments (2)

[Abstract] The abstract could explicitly state the exact compression ratios, the number of tokens or layers tested for center stability, and whether the centers are computed once per model or per sequence.
[§3] Notation for the trigonometric series and center definitions should be introduced with an equation number in the method section for clarity.

Simulated Author's Rebuttal

4 responses · 0 unresolved

We thank the referee for the insightful and constructive comments on our paper. We have carefully considered each point and provide detailed responses below. Where appropriate, we will revise the manuscript to address the concerns raised.

read point-by-point responses

Referee: [Abstract / §3] Abstract and method description: the central premise that Q/K vectors are 'highly concentrated around fixed non-zero centers' and 'remain stable across positions' lacks any quantitative validation (e.g., mean/variance of distance to center, or stability metrics across layers/models/positions). This observation is load-bearing for deriving the trigonometric series and for claiming reliable importance scoring.

Authors: We agree that quantitative validation strengthens the claim. Although the observation is derived from extensive internal analysis across models and layers, we did not include the metrics in the initial submission. In the revised manuscript, we will add a dedicated paragraph and figure in Section 3 showing the mean and standard deviation of distances to the centers (typically under 0.05 in normalized units), along with stability plots across positions and layers for models like Llama and others. This will provide the requested validation and support the trigonometric derivation. revision: yes
Referee: [Abstract / §4] Abstract and §4: no ablation isolating the trigonometric component from the norm-based signal, and no quantitative check (e.g., overlap statistics) that high-attention keys for reasoning steps actually coincide with the trig-preferred distances. Without this, it is unclear whether the position-only + norm score preserves the exact keys needed for full AIME25 accuracy.

Authors: We acknowledge the value of such ablations. We will include in the revised §4 an ablation study that isolates the trigonometric preference from the norm signal, reporting AIME25 accuracy for trig-only, norm-only, and combined. Additionally, we will compute and report overlap metrics (e.g., the fraction of top-k keys selected by full attention that match those preferred by the trigonometric distances) on reasoning traces from AIME25, demonstrating high alignment for critical keys. revision: yes
Referee: [§4] §4 results: the AIME25 accuracy claim reports no error bars, no number of runs, and no analysis of center stability measurement protocol. The reported 2.5x throughput / 10.7x memory gains at matched accuracy are load-bearing for the efficiency claim but rest on unvalidated concentration tightness.

Authors: We will revise §4 to include error bars where stochasticity is present (e.g., if using sampling in generation), and specify the number of runs (typically 3-5 for such benchmarks). We will also add a detailed description of the center stability measurement protocol, including how centers are estimated from pre-RoPE vectors over a range of positions. The efficiency numbers are from direct measurements on the specified hardware, and we will provide more details on the setup to ensure reproducibility. revision: yes
Referee: [§3] Method: the trigonometric scoring depends only on relative position (via the series) and norm, independent of query content. This creates a risk that keys critical for a specific reasoning step but lying at non-preferred distances are down-ranked; the manuscript provides no bound or empirical frequency on how often this occurs.

Authors: This is a valid concern regarding potential edge cases. However, the strong empirical match to full attention accuracy on AIME25 indicates that such mismatches are infrequent for reasoning-critical keys. In the revision, we will add an empirical analysis measuring the frequency of high-attention keys at non-preferred distances using full attention maps, and discuss theoretical aspects from the series expansion regarding when the preference holds strongly. We believe this will mitigate the risk highlighted. revision: yes

Circularity Check

0 steps flagged

No circularity: derivation from empirical centers to trig scoring is independent

full rationale

The paper begins with an empirical observation of Q/K concentration around fixed centers in pre-RoPE space, then mathematically derives the resulting positional distance preferences via a trigonometric series as a direct consequence of that concentration. The importance scoring is then built from those derived preferences plus separate norm signals. No quoted equations reduce the final score to the input centers by construction, no fitted parameters are relabeled as predictions, and no self-citation chain is invoked to justify uniqueness or the ansatz. The central claims rest on the mathematical step plus downstream empirical validation on AIME25, which is external to the derivation itself.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The method rests on an empirical observation of vector concentration whose generality is asserted but not proven in the provided abstract.

axioms (1)

domain assumption Q and K vectors are highly concentrated around fixed non-zero centers and remain stable across positions in pre-RoPE space
This concentration is the load-bearing observation that enables the trigonometric distance preference and importance scoring.

pith-pipeline@v0.9.0 · 5557 in / 1333 out tokens · 46451 ms · 2026-05-10T19:50:29.896391+00:00 · methodology

discussion (0)

Forward citations

Cited by 3 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

HeadQ: Model-Visible Distortion and Score-Space Correction for KV-Cache Quantization
cs.LG 2026-05 conditional novelty 8.0

HeadQ removes 84-94% of excess perplexity from 2-bit key quantization by storing low-rank residuals in a calibration-learned query basis for score-space correction and using A²-weighted distortion for values.
Not All Thoughts Need HBM: Semantics-Aware Memory Hierarchy for LLM Reasoning
cs.CL 2026-05 unverdicted novelty 6.0

A semantics-aware KV cache hierarchy offloads tokens to slower memory with zero approximation error, demonstrating that LLM reasoning accuracy depends only on the permanent eviction ratio and not on HBM residency.
HeadQ: Model-Visible Distortion and Score-Space Correction for KV-Cache Quantization
cs.LG 2026-05 unverdicted novelty 6.0

HeadQ reduces 84-94% of excess perplexity in 2-bit key quantization by adding low-rank logit corrections in a calibration-learned query basis, with further gains from an A^2-weighted value policy.

Reference graph

Works this paper leans on

26 extracted references · 16 canonical work pages · cited by 2 Pith papers · 11 internal anchors

[1]

Gqa: Training generalized multi-query transformer models from multi-head check- points

Ainslie, J., Lee-Thorp, J., de Jong, M., Zemlyanskiy, Y ., Lebron, F., and Sanghai, S. Gqa: Training generalized multi-query transformer models from multi-head check- points. InProceedings of the 2023 Conference on Em- pirical Methods in Natural Language Processing, pp. 4895–4901,

2023
[2]

PyramidKV: Dynamic KV Cache Compression based on Pyramidal Information Funneling

URL https://openreview.net/forum? id=RyOpooIxDF. Cai, Z., Zhang, Y ., Gao, B., Liu, Y ., Li, Y ., Liu, T., Lu, K., Xiong, W., Dong, Y ., Hu, J., et al. Pyramidkv: Dynamic kv cache compression based on pyramidal information funneling.arXiv preprint arXiv:2406.02069,

work page internal anchor Pith review arXiv
[3]

Scaling rl to long videos.arXiv preprint arXiv:2507.07966, 2025

Chen, Y ., Huang, W., Shi, B., Hu, Q., Ye, H., Zhu, L., Liu, Z., Molchanov, P., Kautz, J., Qi, X., et al. Scaling rl to long videos.arXiv preprint arXiv:2507.07966,

work page arXiv
[4]

Expected attention: KV cache compres- sion by estimating attention from future queries distribution.arXiv preprint arXiv:2510.00636, 2025

Devoto, A., Jeblick, M., and J´egou, S. Expected attention: Kv cache compression by estimating attention from future queries distribution.arXiv preprint arXiv:2510.00636,

work page arXiv
[5]

The Llama 3 Herd of Models

9 TriAttention: Efficient Long Reasoning with Trigonometric KV Compression Dubey, A., Jauhri, A., Pandey, A., Kadian, A., Al-Dahle, A., Letman, A., Mathur, A., Schelten, A., Yang, A., Fan, A., et al. The llama 3 herd of models.arXiv preprint arXiv:2407.21783,

work page internal anchor Pith review Pith/arXiv arXiv
[6]

Attention score is not all you need for token importance indicator in kv cache reduction: Value also matters

Guo, Z., Kamigaito, H., and Watanabe, T. Attention score is not all you need for token importance indicator in kv cache reduction: Value also matters. InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pp. 21158–21166,

2024
[7]

On the token distance modeling ability of higher rope attention dimension

Hong, X., Dai, C., Li, B., Wu, S., Wang, Z., Wu, H., Wang, D., Zhu, J., He, S., and Sun, J.-R. On the token distance modeling ability of higher rope attention dimension. In Proceedings of the 2024 Conference on Empirical Meth- ods in Natural Language Processing (EMNLP),

2024
[8]

Raas: Reasoning-aware attention sparsity for efficient llm reasoning

Hu, J., Huang, W., Wang, W., Li, Z., Hu, T., Liu, Z., Chen, X., Xie, T., and Shan, Y . Raas: Reasoning-aware attention sparsity for efficient llm reasoning. InFindings of the Association for Computational Linguistics: ACL 2025, pp. 2577–2590,

2025
[9]

Mistral 7B

Jiang, A. Q., Sablayrolles, A., Mensch, A., Bamford, C., Chaplot, D. S., de las Casas, D., Bressand, F., Lengyel, G., Lample, G., Saulnier, L., Lavaud, L. R., Lachaux, M.-A., Stock, P., Scao, T. L., Lavril, T., Wang, T., Lacroix, T., and Sayed, W. E. Mistral 7b.arXiv preprint arXiv:2310.06825,

work page internal anchor Pith review Pith/arXiv arXiv
[10]

Attention is not only a weight: Analyzing transformers with vector norms

Kobayashi, G., Kuribayashi, T., Yokoi, S., and Inui, K. Attention is not only a weight: Analyzing transformers with vector norms. InProceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 7057–7075,

2020
[11]

arXiv preprint arXiv:2412.19442 , year =

Li, H., Li, Y ., Tian, A., Tang, T., Xu, Z., Chen, X., Hu, N., Dong, W., Li, Q., and Chen, L. A survey on large lan- guage model acceleration based on kv cache management. arXiv preprint arXiv:2412.19442, 2024a. Li, Y ., Huang, Y ., Yang, B., Venkitesh, B., Locatelli, A., Ye, H., Cai, T., Lewis, P., and Chen, D. SnapKV: LLM knows what you are looking for ...

work page arXiv
[12]

American invita- tional mathematics examination 2024,

Mathematical Association of America. American invita- tional mathematics examination 2024,

2024
[13]

American invita- tional mathematics examination 2025,

Mathematical Association of America. American invita- tional mathematics examination 2025,

2025
[14]

gpt-oss-120b & gpt-oss-20b Model Card

OpenAI. gpt-oss-120b & gpt-oss-20b model card.arXiv preprint arXiv:2508.10925,

work page internal anchor Pith review Pith/arXiv arXiv
[15]

Transformers are multi-state rnns

Oren, M., Hassid, M., Yarden, N., Adi, Y ., and Schwartz, R. Transformers are multi-state rnns. InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pp. 18724–18741,

2024
[16]

YaRN: Efficient Context Window Extension of Large Language Models

Peng, B., Quesnelle, J., Fan, H., and Shippole, E. Yarn: Efficient context window extension of large language models.arXiv preprint arXiv:2309.00071,

work page internal anchor Pith review arXiv
[17]

Qwen2.5 Technical Report

Qwen Team. Qwen2.5 technical report.arXiv preprint arXiv:2412.15115,

work page internal anchor Pith review Pith/arXiv arXiv
[18]

Qwen3 Technical Report

Qwen Team. Qwen3 technical report.arXiv preprint arXiv:2505.09388,

work page internal anchor Pith review Pith/arXiv arXiv
[19]

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

Shao, Z., Wang, P., Zhu, Q., Xu, R., Song, J., Bi, X., Zhang, H., Zhang, M., Li, Y ., Wu, Y ., et al. Deepseekmath: Push- ing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300,

work page internal anchor Pith review Pith/arXiv arXiv
[20]

arXiv preprint arXiv:2407.18003

Shi, L., Zhang, H., Yao, Y ., Li, Z., and Zhao, H. Keep the cost down: A review on methods to optimize llm’s kv-cache consumption.arXiv preprint arXiv:2407.18003,

work page arXiv
[21]

Quest: Query-Aware Sparsity for Efficient Long-Context LLM Inference

Tang, J., Zhao, Y ., Zhu, K., Xiao, G., Kasikci, B., and Han, S. Quest: Query-aware sparsity for efficient long-context llm inference.arXiv preprint arXiv:2406.10774,

work page internal anchor Pith review arXiv
[22]

Gemma 3 Technical Report

Team, G., Kamath, A., Ferret, J., Pathak, S., Vieillard, N., Merhej, R., Perrin, S., Matejovicova, T., Ram ´e, A., Rivi`ere, M., et al. Gemma 3 technical report.arXiv preprint arXiv:2503.19786,

work page internal anchor Pith review Pith/arXiv arXiv
[23]

Elitekv: Scalable kv cache compression via rope frequency selection and joint low- rank projection.arXiv preprint arXiv:2503.01586,

Zhou, Y ., Song, S., Liu, B., Xi, Z., Jin, S., Fan, X., Zhang, Z., Li, W., and Huang, X. Elitekv: Scalable kv cache compression via rope frequency selection and joint low- rank projection.arXiv preprint arXiv:2503.01586,

work page arXiv
[24]

A Survey on Efficient Inference for Large Language Models

Zhou, Z., Ning, X., Hong, K., Fu, T., Xu, J., Li, S., Lou, Y ., Wang, L., Yuan, Z., Li, X., et al. A survey on effi- cient inference for large language models.arXiv preprint arXiv:2404.14294,

work page internal anchor Pith review arXiv
[25]

on AIME24 using DeepSeek-R1-Distill-Qwen-7B at multiple KV budgets, alongside H2O (Zhang et al., 2023), TOV A (Oren et al., 2024), and RaaS (Hu et al.,

2023
[26]

We compare against StreamingLLM (Xiao et al., 2024), PyramidKV (Cai et al., 2024), KnormPress (Devoto et al., 2025), Ada-KV+SnapKV (Feng et al., 2025), and H2O (Zhang et al., 2023)

(retrieval tasks, 4K context). We compare against StreamingLLM (Xiao et al., 2024), PyramidKV (Cai et al., 2024), KnormPress (Devoto et al., 2025), Ada-KV+SnapKV (Feng et al., 2025), and H2O (Zhang et al., 2023). Table B presents the full LongBench results. TriAttention achieves the highest average (48.1) across 16 subtasks, winning 11 out of 16, and surp...

2024