pith. sign in

arxiv: 2606.08761 · v2 · pith:AKOWOTOEnew · submitted 2026-06-07 · 💻 cs.DC · cs.AI

APEX4: Efficient Pure W4A4 LLM Inference via Intra-SM Compute Rebalancing

Pith reviewed 2026-06-30 11:05 UTC · model grok-4.3

classification 💻 cs.DC cs.AI
keywords W4A4 quantizationLLM inferenceGPU kernel optimizationTensor Core utilizationdequantization overheadintra-SM balancingvLLM integrationINT4 GEMM
0
0 comments X

The pith

APEX4 shows that W4A4 LLM inference speed depends on matching dequantization work to each GPU's Tensor Core versus CUDA Core throughput ratio.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that full 4-bit weight and activation quantization for large language models is limited by dequantization work running on general-purpose cores inside each streaming multiprocessor. By running controlled tests on four GPUs spanning Ampere and Ada designs, the authors identify the throughput ratio between matrix-specialized units and general units as the key predictor of whether a pure W4A4 kernel runs faster or slower than mixed-precision alternatives. They then build kernels that adapt the grouping size of quantized values to this ratio, keeping the general cores from becoming the dominant cost. The resulting system runs as a drop-in replacement inside vLLM, preserves model quality to within 0.63 perplexity of FP16 on LLaMA-2-70B, and produces end-to-end speedups that reach 2.09 times on lower-ratio GPUs while still recovering 1.2 to 1.4 times on higher-ratio devices through a mixed-granularity fallback.

Core claim

The central claim is that the Tensor Cores to CUDA Cores throughput ratio ρ governs W4A4 kernel behavior: the same g128 kernel yields 2.0–2.5× speedup on RTX 3090 (ρ=16) yet only 0.43–0.47× on A100 (ρ=64) in compute-bound cases. Guided by this platform dependence, APEX4 co-designs pure INT4 GEMM kernels with ρ-aware granularity adaptation that removes the dequantization bottleneck, delivering perplexity within 0.63 of FP16 on LLaMA-2-70B, 4.0–4.4% higher zero-shot accuracy than W4Ax Atom-g128, and measured end-to-end speedups of 1.66× on L40S, 1.78× on RTX 3090, 2.09× on A40, and 1.20–1.40× on A100 via mixed mode.

What carries the argument

ρ-aware granularity adaptation that rebalances intra-SM work between Tensor Core matrix multiplies and CUDA Core dequantization in pure INT4 GEMM kernels

If this is right

  • Pure W4A4 inference becomes viable without mixed-precision fallbacks once granularity is chosen to match a GPU's ρ value.
  • Unmodified vLLM deployments obtain up to 2.09× end-to-end latency reduction on GPUs whose ρ lies near 16.
  • A mixed-granularity mode recovers 1.20–1.40× speedup on high-ρ platforms such as A100 while still using mostly INT4 arithmetic.
  • Quantization kernel design should treat the Tensor-to-CUDA core throughput ratio as a first-class tuning parameter rather than assuming uniform hardware behavior.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Future GPU architectures could reduce the need for such adaptation by shipping more balanced Tensor Core and CUDA Core throughputs.
  • The same rebalancing logic could be applied to other low-precision formats or to accelerators whose matrix and scalar units have mismatched speeds.
  • Platform-specific granularity tables may become a standard part of LLM serving stacks as more quantized kernels are deployed across heterogeneous hardware.

Load-bearing premise

The throughput ratio between Tensor Cores and CUDA Cores is the dominant hardware factor that decides whether dequantization overhead prevents pure W4A4 kernels from being faster than mixed-precision baselines.

What would settle it

A controlled benchmark on a new GPU architecture in which measured ρ fails to predict the observed W4A4 kernel speedup or slowdown relative to mixed-precision baselines would falsify the claim.

Figures

Figures reproduced from arXiv: 2606.08761 by Christoph Meinel, Haojin Yang, Hong Guo, Jona Otholt, Nianhui Guo, Weixing Wang.

Figure 1
Figure 1. Figure 1: The proposed W4A4-g128 GEMM kernel speedup over FP16 (N=K=8192) across GPUs with varying ρ. Higher ρ consistently yields lower speedup; A100 (ρ=64) falls below break-even. The central insight of this paper is that the severity of this bottleneck is not constant, but is largely gov￾erned by the intra-SM balance between Tensor Cores and CUDA Cores throughput, which we capture as ρ = TTC/TCC. This ratio varie… view at source ↗
Figure 2
Figure 2. Figure 2: Kernel-internal time ratio of W4A4 channel [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Effect of Hadamard-based activation smoothing. [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Overview of transformer architecture with W4A4 quantization deployment. The linear layers [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: W4A4 channel and group quantization principal overview. Subfigures (a) and (b) respectively [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Four-stage asynchronous pipeline timing diagram. [PITH_FULL_IMAGE:figures/full_fig_p008_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Activation matrix and weight matrix data preprocessing and bank conflict avoidance principle. [PITH_FULL_IMAGE:figures/full_fig_p009_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Thread layouts of activation matrix, weight matrix, and result matrix C on the minimal instruction-level tile, as well as the thread layouts when loading activation and weight matrix data us￾ing ldmatrix instructions, and the thread layouts when loading S1 and S2. Data preprocessing and memory management are key foundations for efficient kernel execution, involv￾ing the design of activation matrices, weigh… view at source ↗
Figure 9
Figure 9. Figure 9: Kernel speedup comparison across different precisions on four GPUs: A100, RTX 3090, A40, and [PITH_FULL_IMAGE:figures/full_fig_p012_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Comparison of end-to-end speedup across different precisions on four GPUs: A100, RTX 3090, [PITH_FULL_IMAGE:figures/full_fig_p013_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Average kernel time ratio of channel to group-128. Each data point is the mean ratio across six [PITH_FULL_IMAGE:figures/full_fig_p014_11.png] view at source ↗
read the original abstract

W4A4 quantization promises full utilization of INT4 Tensor Cores, yet group dequantization overhead on CUDA Cores has driven existing systems to mixed-precision fallbacks. We present the first systematic study of how intra-SM compute balance governs this bottleneck. Through controlled benchmarks across four GPUs from Ampere and Ada architectures, we identify the Tensor Cores to CUDA Cores throughput ratio ($\rho$) as the primary hardware indicator: the W4A4-g128 kernel yields $2.0$--$2.5\times$ speedup on RTX~3090 ($\rho=16$) yet degrades to $0.43$--$0.47\times$ on A100 ($\rho=64$) in compute-bond scenarios, establishing W4A4 viability as platform-dependent rather than universally infeasible. Guided by this finding, we build \textbf{APEX4}, which co-designs pure INT4 GEMM kernels with $\rho$-aware granularity adaptation to mitigate the CUDA Cores dequantization bottleneck. APEX4 achieves perplexity within 0.63 of FP16 on LLaMA-2-70B and outperforms W4Ax Atom-g128 by 4.0\%--4.4\% in zero-shot accuracy. Deployed as a drop-in replacement in unmodified vLLM, it delivers up to $1.66\times$ end-to-end speedup on L40S ($\rho=8$), and $1.78\times$ on RTX~3090 ($\rho=16$), $2.09\times$ on A40 ($\rho=16$), while recovering A100 ($\rho=64$) to $1.20$--$1.40\times$ via the mixed-granularity mode. Our code is available at https://github.com/APEX4-W4A4/APEX4-W4A4.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

0 major / 2 minor

Summary. The paper claims that the Tensor Cores to CUDA Cores throughput ratio ρ is the primary hardware parameter governing the W4A4 dequantization bottleneck on CUDA Cores. Through controlled benchmarks on four Ampere/Ada GPUs, the authors show that pure W4A4-g128 kernels deliver 2.0–2.5× speedup on low-ρ platforms (e.g., RTX 3090, ρ=16) but degrade on high-ρ platforms (A100, ρ=64). They introduce APEX4, which co-designs pure INT4 GEMM kernels with ρ-aware granularity adaptation (g128 and mixed modes) to mitigate the bottleneck. The system achieves perplexity within 0.63 of FP16 on LLaMA-2-70B, 4.0–4.4% higher zero-shot accuracy than W4Ax Atom-g128, and end-to-end speedups up to 2.09× on A40 (ρ=16) and 1.66× on L40S (ρ=8), recovering 1.20–1.40× on A100 via mixed mode. The implementation is released as a drop-in for vLLM with public code.

Significance. If the results hold, the work establishes that pure W4A4 inference is platform-dependent rather than universally infeasible, with ρ providing a concrete, measurable indicator for granularity adaptation. The direct hardware measurements of ρ across architectures (rather than fitted parameters) and the open-source kernels at https://github.com/APEX4-W4A4/APEX4-W4A4 constitute verifiable contributions to systems for quantized LLM inference. The empirical demonstration of accuracy-speedup trade-offs under unmodified vLLM strengthens the practical relevance.

minor comments (2)
  1. [Abstract] Abstract: the end-to-end speedup claims list specific ρ values and GPUs but omit workload parameters (batch size, sequence length, or model configuration) that produce the peak numbers; adding these would improve reproducibility.
  2. [Implementation] The description of how the mixed-granularity mode switches between g128 and finer granularity could be expanded with a short pseudocode or decision rule based on measured ρ to clarify the adaptation logic.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for the thorough and positive review, which accurately captures the core contributions of APEX4 regarding the role of the Tensor Core to CUDA Core throughput ratio ρ and the resulting platform-dependent viability of pure W4A4 inference. The recommendation for minor revision is appreciated. No specific major comments were listed in the report, so we provide no point-by-point responses below.

Circularity Check

0 steps flagged

No significant circularity identified

full rationale

The paper's central results consist of direct hardware measurements of kernel speedups, perplexity, and accuracy on specific GPUs (A40, RTX 3090, A100, L40S) with stated ρ values. The identification of ρ as the governing parameter is presented as an empirical outcome of controlled cross-architecture benchmarks rather than a fitted model or self-referential definition. No equations, predictions, or uniqueness claims reduce by construction to parameters fitted on the same data; the mixed-granularity mode and granularity adaptation are described as engineering responses to the measured platform dependence. The work is therefore self-contained against external benchmarks with no load-bearing self-citation or definitional loop.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The work rests on standard GPU microarchitecture assumptions and empirical measurement rather than new theoretical constructs or fitted constants.

free parameters (1)
  • granularity modes (g128 and mixed)
    Specific grouping sizes chosen per ρ value; treated as design choices rather than data-fitted parameters.
axioms (1)
  • domain assumption Tensor Core to CUDA Core throughput ratio ρ is the dominant factor controlling dequantization overhead in W4A4 kernels
    Invoked to interpret benchmark results across Ampere and Ada GPUs and to guide the adaptation policy.

pith-pipeline@v0.9.1-grok · 5895 in / 1365 out tokens · 38577 ms · 2026-06-30T11:05:48.517258+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

33 extracted references · 15 canonical work pages · 11 internal anchors

  1. [1]

    The Llama 3 Herd of Models

    The Llama 3 Herd of Models , author =. arXiv preprint arXiv:2407.21783 , year =

  2. [3]

    Qwen3 Technical Report

    Qwen3 Technical Report , author =. arXiv preprint arXiv:2505.09388 , year =

  3. [4]

    Wenqi Shao and Mengzhao Chen and Zhaoyang Zhang and Peng Xu and Lirui Zhao and Zhiqian Li and Kaipeng Zhang and Peng Gao and Yu Qiao and Ping Luo , booktitle =

  4. [5]

    2025 , pages =

    Huanqi Hu and Bowen Xiao and Shixuan Sun and Jianian Yin and Zhexi Zhang and Xiang Luo and Chengquan Jiang and Weiqi Xu and Xiaoying Jia and Xin Liu and Minyi Guo , booktitle =. 2025 , pages =

  5. [6]

    Guo and J

    N. Guo and J. Bethge and C. Meinel and others. Join the High Accuracy Club on ImageNet with A Binary Neural Network Ticket. arXiv e-prints. arXiv:2211.12933

  6. [7]

    Frantar and R

    E. Frantar and R. L. Castro and J. Chen and others. Marlin: Mixed-precision auto-regressive parallel inference on large language models. Proceedings of the 30th ACM SIGPLAN Annual Symposium on Principles and Practice of Parallel Programming

  7. [8]

    arXiv preprint arXiv:2505.14302 , year=

    Scaling law for quantization-aware training , author=. arXiv preprint arXiv:2505.14302 , year=

  8. [9]

    arXiv preprint arXiv:2406.09904 , year=

    Qqq: Quality quattuor-bit quantization for large language models , author=. arXiv preprint arXiv:2406.09904 , year=

  9. [10]

    Proceedings of Machine Learning and Systems , volume=

    Qserve: W4a8kv4 quantization and system co-design for efficient llm serving , author=. Proceedings of Machine Learning and Systems , volume=

  10. [11]

    Liu and L

    L. Liu and L. Cheng and H. Ren and others. COMET: Towards Practical W4A4KV4 LLMs Serving. Proceedings of the 30th ACM International Conference on Architectural Support for Programming Languages and Operating Systems

  11. [12]

    Zhao and C

    Y. Zhao and C. Y. Lin and K. Zhu and others. Atom: Low-bit quantization for efficient and accurate llm serving. Proceedings of Machine Learning and Systems

  12. [13]

    GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers

    Gptq: Accurate post-training quantization for generative pre-trained transformers , author=. arXiv preprint arXiv:2210.17323 , year=

  13. [14]

    Lin and J

    J. Lin and J. Tang and H. Tang and others. Awq: Activation-aware weight quantization for on-device llm compression and acceleration. Proceedings of Machine Learning and Systems

  14. [15]

    Chee and Y

    J. Chee and Y. Cai and V. Kuleshov and others. Quip: 2-bit quantization of large language models with guarantees. Advances in Neural Information Processing Systems

  15. [16]

    doi:10.5281/zenodo.12608602 , url =

    Gao, Leo and Tow, Jonathan and Abbasi, Baber and Biderman, Stella and Black, Sid and DiPofi, Anthony and Foster, Charles and Golding, Laurence and Hsu, Jeffrey and Le Noac'h, Alain and Li, Haonan and McDonell, Kyle and Muennighoff, Niklas and Ociepa, Chris and Phang, Jason and Reynolds, Laria and Schoelkopf, Hailey and Skowron, Aviya and Sutawika, Lintang...

  16. [17]

    2024 IEEE International Symposium on Parallel and Distributed Processing with Applications (ISPA) , pages=

    Low-bit CUTLASS GEMM Template Auto-tuning using Neural Network , author=. 2024 IEEE International Symposium on Parallel and Distributed Processing with Applications (ISPA) , pages=. 2024 , organization=

  17. [18]

    Qwen Technical Report

    Qwen technical report , author=. arXiv preprint arXiv:2309.16609 , year=

  18. [19]

    DeepSeek-V3 Technical Report

    Deepseek-v3 technical report , author=. arXiv preprint arXiv:2412.19437 , year=

  19. [20]

    Language Models are Few-Shot Learners

    Language models are few-shot learners , author=. arXiv preprint arXiv:2005.14165 , volume=

  20. [21]

    int8 (): 8-bit matrix multiplication for transformers at scale , author=

    Gpt3. int8 (): 8-bit matrix multiplication for transformers at scale , author=. Advances in neural information processing systems , volume=

  21. [22]

    Advances in Neural Information Processing Systems , volume=

    Quarot: Outlier-free 4-bit inference in rotated llms , author=. Advances in Neural Information Processing Systems , volume=

  22. [23]

    TurboQuant: Online Vector Quantization with Near-optimal Distortion Rate

    Turboquant: Online vector quantization with near-optimal distortion rate , author=. arXiv preprint arXiv:2504.19874 , year=

  23. [24]

    International conference on machine learning , pages=

    Smoothquant: Accurate and efficient post-training quantization for large language models , author=. International conference on machine learning , pages=. 2023 , organization=

  24. [25]

    LLaMA: Open and Efficient Foundation Language Models

    Llama: Open and efficient foundation language models , author=. arXiv preprint arXiv:2302.13971 , year=

  25. [26]

    Llama 2: Open Foundation and Fine-Tuned Chat Models

    Llama 2: Open foundation and fine-tuned chat models , author=. arXiv preprint arXiv:2307.09288 , year=

  26. [27]

    2025 , eprint=

    Qwen2.5 Technical Report , author=. 2025 , eprint=

  27. [28]

    Pointer Sentinel Mixture Models

    Pointer sentinel mixture models , author=. arXiv preprint arXiv:1609.07843 , year=

  28. [29]

    Proceedings of the AAAI conference on artificial intelligence , volume=

    Piqa: Reasoning about physical commonsense in natural language , author=. Proceedings of the AAAI conference on artificial intelligence , volume=

  29. [30]

    Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge

    Think you have solved question answering? try arc, the ai2 reasoning challenge , author=. arXiv preprint arXiv:1803.05457 , year=

  30. [31]

    Proceedings of the 57th annual meeting of the association for computational linguistics , pages=

    Hellaswag: Can a machine really finish your sentence? , author=. Proceedings of the 57th annual meeting of the association for computational linguistics , pages=

  31. [32]

    Communications of the ACM , volume=

    Winogrande: An adversarial winograd schema challenge at scale , author=. Communications of the ACM , volume=. 2021 , publisher=

  32. [33]

    2024 IEEE International Parallel and Distributed Processing Symposium (IPDPS) , pages=

    Benchmarking and dissecting the nvidia hopper gpu architecture , author=. 2024 IEEE International Parallel and Distributed Processing Symposium (IPDPS) , pages=. 2024 , organization=

  33. [34]

    2025 , howpublished =