pith. machine review for the scientific record. sign in

arxiv: 2604.16957 · v1 · submitted 2026-04-18 · 💻 cs.LG

Recognition: unknown

Open-TQ-Metal: Fused Compressed-Domain Attention for Long-Context LLM Inference on Apple Silicon

Authors on Pith no claims yet

Pith reviewed 2026-05-10 07:14 UTC · model grok-4.3

classification 💻 cs.LG
keywords fused attentionKV cache quantizationcompressed domainApple Metalint4 inferencelong context LLMconsumer hardware
0
0 comments X

The pith

Fused compressed-domain attention on Apple Silicon enables 128K-context inference for 70B models on a single 64GB Mac.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces Open-TQ-Metal as the first system to quantize the KV cache to int4 on the fly and compute attention directly on that compressed representation using custom Metal shaders for Apple Silicon. This removes the intermediate step of creating full-precision dequantized matrices that previously made long contexts impossible on consumer hardware due to memory limits. A sympathetic reader would care because the approach makes 128K-context runs of large models like Llama 3.1 70B feasible on everyday 64GB Macs without cloud services or multiple GPUs. Across hundreds of experiments the method delivers 48x attention speedup at 128K context, 3.2x KV cache compression, and identical top-1 predictions to FP16. The work also shows that a model's attention scale factor, rather than its size, decides whether angular quantization methods succeed or fail across architectures.

Core claim

By fusing on-the-fly int4 quantization of the KV cache with attention computation inside custom Metal compute shaders, the system performs scaled dot-product attention directly in the compressed domain. This yields a 48x speedup in attention at 128K context over dequantize-then-attend baselines, reduces KV cache memory from 40 GB to 12.5 GB, and produces identical top-1 token predictions to FP16 inference across tested models and prompts.

What carries the argument

The sdpa_int4 kernel, a custom Metal compute shader that executes attention directly on int4-quantized key and value tensors without any intermediate dequantization step.

If this is right

  • 128K context lengths become practical for 70B models on single 64GB consumer Macs.
  • KV cache memory drops by a factor of 3.2 while attention speed increases up to 48 times compared with dequantize-then-attend methods.
  • Output predictions remain unchanged from full-precision inference in the evaluated cases.
  • Quantization success depends on the model's attention scale factor, which explains performance differences between model families.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Similar fused kernels could be written for other shader or accelerator platforms to extend compressed-domain attention beyond Apple Silicon.
  • The scale-factor finding may guide architecture-specific quantization designs that achieve higher compression ratios.
  • Reduced memory traffic could translate into lower power draw during long-context inference on edge devices.
  • The technique might combine with other compression methods to support even longer contexts on the same hardware.

Load-bearing premise

The custom Metal shaders implement exact attention semantics without numerical drift or hardware-specific bugs, and the identical top-1 predictions generalize beyond the tested prompts and models.

What would settle it

Run side-by-side inference using the fused int4 kernel and a standard FP16 implementation on new prompts or additional model variants; any difference in top-1 tokens or measurable numerical deviation in attention outputs would falsify the claim of preserved semantics.

Figures

Figures reproduced from arXiv: 2604.16957 by Sai Vegasena.

Figure 1
Figure 1. Figure 1: The two core results of Open-TQ-Metal: (a) enabling 128K context on hardware where it was previously impossible, and (b) achieving super-linear attention speedup via fused compressed-domain computation. • Fused int4 SDPA kernel for Metal. A Metal compute shader that reads packed int4 keys and values directly from device memory, dequantizes per-element in GPU registers via bitwise operations, and computes a… view at source ↗
Figure 2
Figure 2. Figure 2: Gemma 4 31B decode throughput vs. context length. The fused kernel (orange) maintains constant throughput while the baseline (gray) degrades as dequantization bandwidth increases. The shaded region shows the throughput advantage of the fused kernel. et al., 2024) (4-bit (Frantar et al., 2023), 17.4 GB, 60 layers) and Llama 3.1 70B-Instruct (Grattafiori et al., 2024) (4-bit, 39.1 GB, 80 layers). 5.1. Kernel… view at source ↗
Figure 3
Figure 3. Figure 3: Memory breakdown: Open-TQ-Metal (left) vs. mlx-lm (right) at each context length. At 128K, mlx-lm needs 80 GB; Open-TQ-Metal fits in 53.6 GB. kernel optimization; and int4 KV has a context ceiling on Gemma 4 (∼950 tokens) due to compound error at α = 1.0, while Llama (α = 0.0884) works to 128K+ [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: KV cache size at 128K context for Llama 70B. Int4 (12.5 GB) and PolarQuant 5-bit (13 GB) fit in 64 GB; QJL achieves high compression but fails at 70B due to compound noise. 480 ms to 9.9 ms at 128K context, but the weight-loading floor means end-to-end throughput cannot exceed ∼10 tok/s without weight compression or MoE architectures. On Gemma 4, the MoE variant (4B active parameters) achieves 59 tok/s pre… view at source ↗
Figure 5
Figure 5. Figure 5: Total memory vs. context length for Llama 70B. FP16 KV (red) crosses the 64 GB limit at ∼73K tokens; int4 KV (orange) enables 236K [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Gemma 4 31B at 256K context. Only int4 KV fits within the 64 GB M1 Max limit. checkpoints. In Empirical Methods in Natural Language Processing (EMNLP), 2023. arXiv:2305.13245. Apple Machine Learning Research. MLX: An array frame￾work for apple silicon. https://github.com/ ml-explore/mlx, 2024. Dao, T. et al. Flash-Decoding for long-context in￾ference. https://crfm.stanford.edu/2023/ 10/12/flashdecoding.htm… view at source ↗
Figure 7
Figure 7. Figure 7: Standalone kernel latency on Gemma 4 31B. Fused kernel (orange) vs. MLX baseline (teal). Improving open language models at a practical size. arXiv preprint arXiv:2408.00118, 2024. Gerganov, G. and contributors. llama.cpp: LLM inference in C/C++. https://github.com/ggerganov/ llama.cpp, 2023. Grattafiori, A. et al. The Llama 3 herd of models. arXiv preprint arXiv:2407.21783, 2024. Han, I., Kacham, P., Karba… view at source ↗
read the original abstract

We present Open-TQ-Metal, the first implementation of fused compressed-domain attention on Apple Silicon, enabling 128K-context inference for Llama 3.1 70B on a single 64GB consumer Mac -- a configuration impossible with all existing inference frameworks. Open-TQ-Metal quantizes the KV cache to int4 on the fly and computes attention directly on the compressed representation via custom Metal compute shaders, eliminating all intermediate dequantization matrices. Across 330 experiments spanning two model families (Gemma 4 31B and Llama 3.1 70B), the fused sdpa_int4 kernel achieves 48x attention speedup at 128K context over the dequantize-then-attend baseline, reduces KV cache memory from 40 GB to 12.5 GB (3.2x compression), and maintains identical top-1 token predictions to FP16 inference. We further provide the first cross-architecture analysis of KV cache quantization methods, revealing that the attention scale factor -- not model size -- determines whether angular quantization schemes like PolarQuant succeed or fail, with Gemma 4's attn_scale=1.0 amplifying directional error 25-100x more than Llama's standard 1/sqrt(d) scaling.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The paper presents Open-TQ-Metal, the first implementation of fused compressed-domain attention on Apple Silicon via custom Metal compute shaders. It quantizes the KV cache to int4 on the fly and performs attention directly in the compressed domain, eliminating dequantization intermediates. Claims include enabling 128K-context inference for Llama 3.1 70B on a single 64GB Mac, a 48x attention speedup at 128K context versus dequantize-then-attend, 3.2x KV cache compression (40 GB to 12.5 GB), and identical top-1 token predictions to FP16 across 330 experiments on Gemma 4 31B and Llama 3.1 70B. It also reports a cross-architecture analysis showing that the attention scale factor (not model size) determines whether angular quantization schemes succeed or fail.

Significance. If the numerical equivalence and performance claims hold, the work would be significant for practical long-context LLM deployment on consumer Apple Silicon hardware, where memory and compute constraints currently limit 128K contexts. The concrete speed and memory measurements, plus the first cross-architecture KV quantization analysis tied to attention scaling, offer actionable insights for systems implementers. The empirical focus with matching token predictions across two model families strengthens the practical contribution, though broader adoption would benefit from stronger verification of exact semantics.

major comments (1)
  1. [Abstract] The assertion of numerical equivalence for the sdpa_int4 kernel (Abstract) rests exclusively on identical top-1 token predictions across 330 experiments. This is a low-sensitivity test: small per-element errors arising from int4 dequantization rounding or shader accumulation order can be masked by softmax and argmax, particularly when attention scale factors differ (e.g., Gemma's attn_scale=1.0 vs. Llama's 1/sqrt(d)). No max absolute error on attention scores, per-head output tensors, or bit-exact checks against an FP16 reference are reported, leaving open the possibility that the reported 48x speedup and 3.2x compression are measured on a numerically inexact kernel.
minor comments (2)
  1. The abstract states 330 experiments but provides no details on the specific test prompts, context length distribution, model variants, or statistical significance (e.g., error bars or variance across runs), hindering reproducibility of the speedup and memory claims.
  2. The cross-architecture analysis on attention scale factor would benefit from explicit equations or pseudocode showing how PolarQuant error is amplified by attn_scale=1.0 versus standard scaling.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the careful review and for identifying the limitations in our current numerical validation approach. We address the major comment below and will strengthen the manuscript accordingly.

read point-by-point responses
  1. Referee: [Abstract] The assertion of numerical equivalence for the sdpa_int4 kernel (Abstract) rests exclusively on identical top-1 token predictions across 330 experiments. This is a low-sensitivity test: small per-element errors arising from int4 dequantization rounding or shader accumulation order can be masked by softmax and argmax, particularly when attention scale factors differ (e.g., Gemma's attn_scale=1.0 vs. Llama's 1/sqrt(d)). No max absolute error on attention scores, per-head output tensors, or bit-exact checks against an FP16 reference are reported, leaving open the possibility that the reported 48x speedup and 3.2x compression are measured on a numerically inexact kernel.

    Authors: We agree that top-1 token prediction equivalence is a low-sensitivity test and that small per-element discrepancies from int4 quantization rounding or non-deterministic shader accumulation can be masked by softmax and argmax, especially across models with different attention scales. In the revised manuscript we will add maximum absolute error measurements on attention scores and per-head output tensors relative to an FP16 reference, reported for both model families at representative context lengths. These internal checks show bounded errors that do not alter final token predictions. Bit-exact matching is not feasible or meaningful here because the Metal shaders perform floating-point accumulation whose order is not guaranteed to match a CPU reference. The 48x speedup and 3.2x compression figures were obtained directly from the deployed sdpa_int4 kernel whose outputs were used in the 330 token-prediction experiments; therefore the performance numbers already correspond to the same implementation whose practical equivalence is demonstrated at the token level. We will update the abstract, results, and evaluation sections to include the additional error metrics and to clarify this distinction. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical implementation with benchmark validation

full rationale

The paper describes a systems implementation of fused int4 attention kernels in Metal shaders for Apple Silicon, reporting measured speedups, memory savings, and top-1 token agreement across 330 experiments. No mathematical derivation chain, equations, fitted parameters, or self-citations of uniqueness theorems appear in the provided text. Claims rest on direct empirical measurements rather than any self-referential logic or reduction of outputs to inputs by construction. The noted limitation (reliance on top-1 match rather than per-element error metrics) is a question of validation strength, not circularity.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The paper is an engineering implementation that relies on the standard transformer attention formula and existing int4 quantization schemes; no new free parameters, axioms, or invented entities are introduced.

axioms (1)
  • standard math Standard scaled dot-product attention formula
    The fused kernel is described as computing attention directly on the compressed representation, presupposing the usual QK^T / sqrt(d) + softmax + V structure.

pith-pipeline@v0.9.0 · 5525 in / 1216 out tokens · 49514 ms · 2026-05-10T07:14:39.376431+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

12 extracted references · 11 canonical work pages · 5 internal anchors

  1. [1]

    GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints

    arXiv:2305.13245. Apple Machine Learning Research. MLX: An array frame- work for apple silicon. https://github.com/ ml-explore/mlx,

  2. [2]

    Dao, T. et al. Flash-Decoding for long-context in- ference. https://crfm.stanford.edu/2023/ 10/12/flashdecoding.html,

  3. [3]

    GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers

    arXiv:2210.17323. Gemma Team, Riviere, M., Pathak, S., et al. Gemma 2: 7 Open-TQ-Metal: Fused Compressed-Domain Attention on Apple Silicon Figure 7.Standalone kernel latency on Gemma 4 31B. Fused kernel (orange) vs. MLX baseline (teal). Improving open language models at a practical size.arXiv preprint arXiv:2408.00118,

  4. [4]

    Grattafiori, A. et al. The Llama 3 herd of models.arXiv preprint arXiv:2407.21783,

  5. [5]

    PolarQuant: Polar-Coordinate KV Cache Quantization,

    Han, I., Kacham, P., Karbasi, A., Mirrokni, V ., and Zandieh, A. PolarQuant: Quantizing KV caches with polar trans- formation.arXiv preprint arXiv:2502.02617,

  6. [6]

    Gear: An efficient kv cache compression recipe for near-lossless generative inference of llm.arXiv preprint arXiv:2403.05527, 2024

    arXiv:2403.05527. Kwon, W., Li, Z., Zhuang, S., Sheng, Y ., Zheng, L., Yu, C. H., Gonzalez, J. E., Zhang, H., and Stoica, I. Efficient memory management for large language model serving with PagedAttention. InACM Symposium on Operating Systems Principles (SOSP),

  7. [7]

    Efficient Memory Management for Large Language Model Serving with PagedAttention

    arXiv:2309.06180. Leviathan, Y ., Kalman, M., and Matias, Y . Fast inference from transformers via speculative decoding. InInterna- tional Conference on Machine Learning (ICML),

  8. [8]

    Fast inference from transformers via speculative decoding, 2023.URL https://arxiv

    arXiv:2211.17192. Liu, Z., Yuan, J., Jin, H., Zhong, S., Xu, Z., Braverman, V ., Chen, B., and Hu, X. KIVI: A tuning-free asym- metric 2bit quantization for KV cache. InInternational Conference on Machine Learning (ICML),

  9. [9]

    Online normalizer calculation for softmax,

    Milakov, M. and Gimelshein, N. Online normalizer cal- culation for softmax.arXiv preprint arXiv:1805.02867,

  10. [10]

    Fast Transformer Decoding: One Write-Head is All You Need

    Shazeer, N. Fast transformer decoding: One write-head is all you need.arXiv preprint arXiv:1911.02150,

  11. [11]

    QJL: 1-Bit Quantized JL Transform for KV Cache Quantization with Zero Overhead,

    Zandieh, A., Daliri, M., and Han, I. QJL: 1-bit quantized JL transform for KV cache quantization with zero overhead. arXiv preprint arXiv:2406.03482,

  12. [12]

    Turboquant: Online vector quantization with near-optimal distortion rate,

    Zandieh, A., Daliri, M., Hadian, M., and Mirrokni, V . Tur- boQuant: Online vector quantization with near-optimal distortion rate.arXiv preprint arXiv:2504.19874,