Recognition: unknown
Open-TQ-Metal: Fused Compressed-Domain Attention for Long-Context LLM Inference on Apple Silicon
Pith reviewed 2026-05-10 07:14 UTC · model grok-4.3
The pith
Fused compressed-domain attention on Apple Silicon enables 128K-context inference for 70B models on a single 64GB Mac.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
By fusing on-the-fly int4 quantization of the KV cache with attention computation inside custom Metal compute shaders, the system performs scaled dot-product attention directly in the compressed domain. This yields a 48x speedup in attention at 128K context over dequantize-then-attend baselines, reduces KV cache memory from 40 GB to 12.5 GB, and produces identical top-1 token predictions to FP16 inference across tested models and prompts.
What carries the argument
The sdpa_int4 kernel, a custom Metal compute shader that executes attention directly on int4-quantized key and value tensors without any intermediate dequantization step.
If this is right
- 128K context lengths become practical for 70B models on single 64GB consumer Macs.
- KV cache memory drops by a factor of 3.2 while attention speed increases up to 48 times compared with dequantize-then-attend methods.
- Output predictions remain unchanged from full-precision inference in the evaluated cases.
- Quantization success depends on the model's attention scale factor, which explains performance differences between model families.
Where Pith is reading between the lines
- Similar fused kernels could be written for other shader or accelerator platforms to extend compressed-domain attention beyond Apple Silicon.
- The scale-factor finding may guide architecture-specific quantization designs that achieve higher compression ratios.
- Reduced memory traffic could translate into lower power draw during long-context inference on edge devices.
- The technique might combine with other compression methods to support even longer contexts on the same hardware.
Load-bearing premise
The custom Metal shaders implement exact attention semantics without numerical drift or hardware-specific bugs, and the identical top-1 predictions generalize beyond the tested prompts and models.
What would settle it
Run side-by-side inference using the fused int4 kernel and a standard FP16 implementation on new prompts or additional model variants; any difference in top-1 tokens or measurable numerical deviation in attention outputs would falsify the claim of preserved semantics.
Figures
read the original abstract
We present Open-TQ-Metal, the first implementation of fused compressed-domain attention on Apple Silicon, enabling 128K-context inference for Llama 3.1 70B on a single 64GB consumer Mac -- a configuration impossible with all existing inference frameworks. Open-TQ-Metal quantizes the KV cache to int4 on the fly and computes attention directly on the compressed representation via custom Metal compute shaders, eliminating all intermediate dequantization matrices. Across 330 experiments spanning two model families (Gemma 4 31B and Llama 3.1 70B), the fused sdpa_int4 kernel achieves 48x attention speedup at 128K context over the dequantize-then-attend baseline, reduces KV cache memory from 40 GB to 12.5 GB (3.2x compression), and maintains identical top-1 token predictions to FP16 inference. We further provide the first cross-architecture analysis of KV cache quantization methods, revealing that the attention scale factor -- not model size -- determines whether angular quantization schemes like PolarQuant succeed or fail, with Gemma 4's attn_scale=1.0 amplifying directional error 25-100x more than Llama's standard 1/sqrt(d) scaling.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper presents Open-TQ-Metal, the first implementation of fused compressed-domain attention on Apple Silicon via custom Metal compute shaders. It quantizes the KV cache to int4 on the fly and performs attention directly in the compressed domain, eliminating dequantization intermediates. Claims include enabling 128K-context inference for Llama 3.1 70B on a single 64GB Mac, a 48x attention speedup at 128K context versus dequantize-then-attend, 3.2x KV cache compression (40 GB to 12.5 GB), and identical top-1 token predictions to FP16 across 330 experiments on Gemma 4 31B and Llama 3.1 70B. It also reports a cross-architecture analysis showing that the attention scale factor (not model size) determines whether angular quantization schemes succeed or fail.
Significance. If the numerical equivalence and performance claims hold, the work would be significant for practical long-context LLM deployment on consumer Apple Silicon hardware, where memory and compute constraints currently limit 128K contexts. The concrete speed and memory measurements, plus the first cross-architecture KV quantization analysis tied to attention scaling, offer actionable insights for systems implementers. The empirical focus with matching token predictions across two model families strengthens the practical contribution, though broader adoption would benefit from stronger verification of exact semantics.
major comments (1)
- [Abstract] The assertion of numerical equivalence for the sdpa_int4 kernel (Abstract) rests exclusively on identical top-1 token predictions across 330 experiments. This is a low-sensitivity test: small per-element errors arising from int4 dequantization rounding or shader accumulation order can be masked by softmax and argmax, particularly when attention scale factors differ (e.g., Gemma's attn_scale=1.0 vs. Llama's 1/sqrt(d)). No max absolute error on attention scores, per-head output tensors, or bit-exact checks against an FP16 reference are reported, leaving open the possibility that the reported 48x speedup and 3.2x compression are measured on a numerically inexact kernel.
minor comments (2)
- The abstract states 330 experiments but provides no details on the specific test prompts, context length distribution, model variants, or statistical significance (e.g., error bars or variance across runs), hindering reproducibility of the speedup and memory claims.
- The cross-architecture analysis on attention scale factor would benefit from explicit equations or pseudocode showing how PolarQuant error is amplified by attn_scale=1.0 versus standard scaling.
Simulated Author's Rebuttal
We thank the referee for the careful review and for identifying the limitations in our current numerical validation approach. We address the major comment below and will strengthen the manuscript accordingly.
read point-by-point responses
-
Referee: [Abstract] The assertion of numerical equivalence for the sdpa_int4 kernel (Abstract) rests exclusively on identical top-1 token predictions across 330 experiments. This is a low-sensitivity test: small per-element errors arising from int4 dequantization rounding or shader accumulation order can be masked by softmax and argmax, particularly when attention scale factors differ (e.g., Gemma's attn_scale=1.0 vs. Llama's 1/sqrt(d)). No max absolute error on attention scores, per-head output tensors, or bit-exact checks against an FP16 reference are reported, leaving open the possibility that the reported 48x speedup and 3.2x compression are measured on a numerically inexact kernel.
Authors: We agree that top-1 token prediction equivalence is a low-sensitivity test and that small per-element discrepancies from int4 quantization rounding or non-deterministic shader accumulation can be masked by softmax and argmax, especially across models with different attention scales. In the revised manuscript we will add maximum absolute error measurements on attention scores and per-head output tensors relative to an FP16 reference, reported for both model families at representative context lengths. These internal checks show bounded errors that do not alter final token predictions. Bit-exact matching is not feasible or meaningful here because the Metal shaders perform floating-point accumulation whose order is not guaranteed to match a CPU reference. The 48x speedup and 3.2x compression figures were obtained directly from the deployed sdpa_int4 kernel whose outputs were used in the 330 token-prediction experiments; therefore the performance numbers already correspond to the same implementation whose practical equivalence is demonstrated at the token level. We will update the abstract, results, and evaluation sections to include the additional error metrics and to clarify this distinction. revision: yes
Circularity Check
No circularity: empirical implementation with benchmark validation
full rationale
The paper describes a systems implementation of fused int4 attention kernels in Metal shaders for Apple Silicon, reporting measured speedups, memory savings, and top-1 token agreement across 330 experiments. No mathematical derivation chain, equations, fitted parameters, or self-citations of uniqueness theorems appear in the provided text. Claims rest on direct empirical measurements rather than any self-referential logic or reduction of outputs to inputs by construction. The noted limitation (reliance on top-1 match rather than per-element error metrics) is a question of validation strength, not circularity.
Axiom & Free-Parameter Ledger
axioms (1)
- standard math Standard scaled dot-product attention formula
Reference graph
Works this paper leans on
-
[1]
GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints
arXiv:2305.13245. Apple Machine Learning Research. MLX: An array frame- work for apple silicon. https://github.com/ ml-explore/mlx,
work page internal anchor Pith review arXiv
-
[2]
Dao, T. et al. Flash-Decoding for long-context in- ference. https://crfm.stanford.edu/2023/ 10/12/flashdecoding.html,
2023
-
[3]
GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers
arXiv:2210.17323. Gemma Team, Riviere, M., Pathak, S., et al. Gemma 2: 7 Open-TQ-Metal: Fused Compressed-Domain Attention on Apple Silicon Figure 7.Standalone kernel latency on Gemma 4 31B. Fused kernel (orange) vs. MLX baseline (teal). Improving open language models at a practical size.arXiv preprint arXiv:2408.00118,
work page internal anchor Pith review arXiv
-
[4]
Grattafiori, A. et al. The Llama 3 herd of models.arXiv preprint arXiv:2407.21783,
work page internal anchor Pith review Pith/arXiv arXiv
-
[5]
PolarQuant: Polar-Coordinate KV Cache Quantization,
Han, I., Kacham, P., Karbasi, A., Mirrokni, V ., and Zandieh, A. PolarQuant: Quantizing KV caches with polar trans- formation.arXiv preprint arXiv:2502.02617,
-
[6]
arXiv:2403.05527. Kwon, W., Li, Z., Zhuang, S., Sheng, Y ., Zheng, L., Yu, C. H., Gonzalez, J. E., Zhang, H., and Stoica, I. Efficient memory management for large language model serving with PagedAttention. InACM Symposium on Operating Systems Principles (SOSP),
-
[7]
Efficient Memory Management for Large Language Model Serving with PagedAttention
arXiv:2309.06180. Leviathan, Y ., Kalman, M., and Matias, Y . Fast inference from transformers via speculative decoding. InInterna- tional Conference on Machine Learning (ICML),
work page internal anchor Pith review arXiv
-
[8]
Fast inference from transformers via speculative decoding, 2023.URL https://arxiv
arXiv:2211.17192. Liu, Z., Yuan, J., Jin, H., Zhong, S., Xu, Z., Braverman, V ., Chen, B., and Hu, X. KIVI: A tuning-free asym- metric 2bit quantization for KV cache. InInternational Conference on Machine Learning (ICML),
-
[9]
Online normalizer calculation for softmax,
Milakov, M. and Gimelshein, N. Online normalizer cal- culation for softmax.arXiv preprint arXiv:1805.02867,
-
[10]
Fast Transformer Decoding: One Write-Head is All You Need
Shazeer, N. Fast transformer decoding: One write-head is all you need.arXiv preprint arXiv:1911.02150,
work page internal anchor Pith review arXiv 1911
-
[11]
QJL: 1-Bit Quantized JL Transform for KV Cache Quantization with Zero Overhead,
Zandieh, A., Daliri, M., and Han, I. QJL: 1-bit quantized JL transform for KV cache quantization with zero overhead. arXiv preprint arXiv:2406.03482,
-
[12]
Turboquant: Online vector quantization with near-optimal distortion rate,
Zandieh, A., Daliri, M., Hadian, M., and Mirrokni, V . Tur- boQuant: Online vector quantization with near-optimal distortion rate.arXiv preprint arXiv:2504.19874,
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.