APEX4: Efficient Pure W4A4 LLM Inference via Intra-SM Compute Rebalancing

Christoph Meinel; Haojin Yang; Hong Guo; Jona Otholt; Nianhui Guo; Weixing Wang

arxiv: 2606.08761 · v2 · pith:AKOWOTOEnew · submitted 2026-06-07 · 💻 cs.DC · cs.AI

APEX4: Efficient Pure W4A4 LLM Inference via Intra-SM Compute Rebalancing

Hong Guo , Nianhui Guo , Weixing Wang , Jona Otholt , Christoph Meinel , Haojin Yang This is my paper

Pith reviewed 2026-06-27 17:46 UTC · model grok-4.3

classification 💻 cs.DC cs.AI

keywords W4A4 quantizationINT4 GEMMLLM inferenceTensor CoreCUDA Coregranularity adaptationvLLMintra-SM compute balance

0 comments

The pith

Pure W4A4 inference for LLMs succeeds when the Tensor Core to CUDA Core throughput ratio stays at or below 16, through ρ-aware kernel granularity adaptation.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that full INT4 weight-and-activation quantization for large language models is limited not by quantization error itself but by the overhead of dequantization on CUDA cores inside each streaming multiprocessor. Controlled tests across Ampere and Ada GPUs reveal that this overhead is governed by the hardware ratio ρ of Tensor Core to CUDA Core throughput. When ρ is 16 or lower, pure W4A4 kernels outperform mixed-precision baselines; when ρ reaches 64, they fall behind. APEX4 applies this observation by co-designing INT4 GEMM kernels that adapt group granularity to the local ρ value, enabling drop-in use in vLLM with only a 0.63 perplexity gap to FP16 on LLaMA-2-70B.

Core claim

APEX4 co-designs pure INT4 GEMM kernels with ρ-aware granularity adaptation to mitigate the CUDA Cores dequantization bottleneck, achieving perplexity within 0.63 of FP16 on LLaMA-2-70B, 4.0–4.4% higher zero-shot accuracy than W4Ax Atom-g128, and end-to-end speedups of 1.66× on L40S (ρ=8), 1.78× on RTX 3090 (ρ=16), 2.09× on A40 (ρ=16), and 1.20–1.40× on A100 (ρ=64) via mixed-granularity mode.

What carries the argument

ρ-aware granularity adaptation inside pure INT4 GEMM kernels, which selects group size according to the measured Tensor Core to CUDA Core throughput ratio to keep dequantization from dominating execution time.

If this is right

On GPUs with ρ≤16, unmodified vLLM can replace its current kernels with APEX4 and obtain 1.78–2.09× end-to-end latency reduction.
High-ρ platforms such as A100 require the mixed-granularity fallback to retain any speedup over FP16.
The same ρ-guided adaptation principle can be applied to other low-precision GEMM kernels that mix Tensor Core and CUDA Core work.
Accuracy results remain within 0.63 perplexity of FP16 when the adapted kernels are used end-to-end on LLaMA-2-70B.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

GPU vendors could expose ρ as a queryable device property so that runtime systems automatically select the right granularity at load time.
The same intra-SM balancing technique may apply to future mixed-precision formats that also split work between Tensor Cores and CUDA Cores.
If newer architectures increase ρ further, pure low-bit inference may need dedicated dequantization hardware rather than software kernels.

Load-bearing premise

The dequantization overhead on CUDA cores is the main limiter of W4A4 performance and is controlled by the single hardware ratio ρ.

What would settle it

Run the W4A4-g128 kernel on a GPU with ρ=16 and measure whether it still underperforms a mixed-precision baseline in a compute-bound matrix multiply.

Figures

Figures reproduced from arXiv: 2606.08761 by Christoph Meinel, Haojin Yang, Hong Guo, Jona Otholt, Nianhui Guo, Weixing Wang.

**Figure 1.** Figure 1: The proposed W4A4-g128 GEMM kernel speedup over FP16 (N=K=8192) across GPUs with varying ρ. Higher ρ consistently yields lower speedup; A100 (ρ=64) falls below break-even. The central insight of this paper is that the severity of this bottleneck is not constant, but is largely governed by the intra-SM balance between Tensor Cores and CUDA Cores throughput, which we capture as ρ = TTC/TCC. This ratio varie… view at source ↗

**Figure 2.** Figure 2: Kernel-internal time ratio of W4A4 channel [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: Effect of Hadamard-based activation smoothing. [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗

**Figure 4.** Figure 4: Overview of transformer architecture with W4A4 quantization deployment. The linear layers [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗

**Figure 5.** Figure 5: W4A4 channel and group quantization principal overview. Subfigures (a) and (b) respectively [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗

**Figure 6.** Figure 6: Four-stage asynchronous pipeline timing diagram. [PITH_FULL_IMAGE:figures/full_fig_p008_6.png] view at source ↗

**Figure 7.** Figure 7: Activation matrix and weight matrix data preprocessing and bank conflict avoidance principle. [PITH_FULL_IMAGE:figures/full_fig_p009_7.png] view at source ↗

**Figure 8.** Figure 8: Thread layouts of activation matrix, weight matrix, and result matrix C on the minimal instruction-level tile, as well as the thread layouts when loading activation and weight matrix data using ldmatrix instructions, and the thread layouts when loading S1 and S2. Data preprocessing and memory management are key foundations for efficient kernel execution, involving the design of activation matrices, weigh… view at source ↗

**Figure 9.** Figure 9: Kernel speedup comparison across different precisions on four GPUs: A100, RTX 3090, A40, and [PITH_FULL_IMAGE:figures/full_fig_p012_9.png] view at source ↗

**Figure 10.** Figure 10: Comparison of end-to-end speedup across different precisions on four GPUs: A100, RTX 3090, [PITH_FULL_IMAGE:figures/full_fig_p013_10.png] view at source ↗

**Figure 11.** Figure 11: Average kernel time ratio of channel to group-128. Each data point is the mean ratio across six [PITH_FULL_IMAGE:figures/full_fig_p014_11.png] view at source ↗

read the original abstract

W4A4 quantization promises full utilization of INT4 Tensor Cores, yet group dequantization overhead on CUDA Cores has driven existing systems to mixed-precision fallbacks. We present the first systematic study of how intra-SM compute balance governs this bottleneck. Through controlled benchmarks across four GPUs from Ampere and Ada architectures, we identify the Tensor Cores to CUDA Cores throughput ratio ($\rho$) as the primary hardware indicator: the W4A4-g128 kernel yields $2.0$--$2.5\times$ speedup on RTX~3090 ($\rho=16$) yet degrades to $0.43$--$0.47\times$ on A100 ($\rho=64$) in compute-bond scenarios, establishing W4A4 viability as platform-dependent rather than universally infeasible. Guided by this finding, we build \textbf{APEX4}, which co-designs pure INT4 GEMM kernels with $\rho$-aware granularity adaptation to mitigate the CUDA Cores dequantization bottleneck. APEX4 achieves perplexity within 0.63 of FP16 on LLaMA-2-70B and outperforms W4Ax Atom-g128 by 4.0\%--4.4\% in zero-shot accuracy. Deployed as a drop-in replacement in unmodified vLLM, it delivers up to $1.66\times$ end-to-end speedup on L40S ($\rho=8$), and $1.78\times$ on RTX~3090 ($\rho=16$), $2.09\times$ on A40 ($\rho=16$), while recovering A100 ($\rho=64$) to $1.20$--$1.40\times$ via the mixed-granularity mode. Our code is available at https://github.com/APEX4-W4A4/APEX4-W4A4.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

APEX4 shows W4A4 inference speed is platform-dependent on ρ and offers a workable adaptation, but the accuracy numbers for the mixed-granularity mode on high-ρ GPUs are not broken out clearly enough.

read the letter

The paper's core finding is that pure W4A4 kernels only deliver speedup when the GPU's Tensor Core to CUDA Core ratio ρ is low enough; on higher-ρ cards the dequantization overhead kills the gains unless granularity is adapted per platform. They build APEX4 around that observation and report it as a drop-in for vLLM.

What stands out is the controlled benchmarks across four GPUs that actually measure how ρ predicts the crossover point where W4A4-g128 stops helping. That is a useful concrete result rather than another generic quantization claim. The end-to-end numbers on LLaMA-2-70B (perplexity within 0.63 of FP16, 4% zero-shot lift over Atom-g128) and the speedups (up to 2.09× on A40) are the kind of data people running inference care about.

The soft spot is exactly the one flagged in the stress test. The abstract states that A100 recovery uses the mixed-granularity mode, yet the perplexity and accuracy figures are not split by mode. If the mixed mode changes group sizes or dequant patterns on the high-ρ platform, those accuracy numbers may only apply to the uniform g128 runs on the low-ρ cards. Without that disaggregation it is hard to know whether the 1.20–1.40× A100 claim comes with the same accuracy guarantee.

This is useful reading for anyone tuning LLM serving kernels on Ampere/Ada hardware. The measurements and the vLLM integration give it enough substance to go to referees, though the accuracy reporting will need tightening.

Referee Report

2 major / 2 minor

Summary. The paper presents APEX4, a co-designed system for pure W4A4 LLM inference that uses ρ-aware granularity adaptation (where ρ is the Tensor Cores to CUDA Cores throughput ratio) in INT4 GEMM kernels. It reports a systematic study across four GPUs showing that fixed g128 kernels yield 2.0–2.5× speedup at ρ=16 but only 0.43–0.47× at ρ=64 in compute-bound cases; APEX4 mitigates this via mixed-granularity mode on high-ρ platforms. End-to-end claims include perplexity within 0.63 of FP16 on LLaMA-2-70B, 4.0–4.4% zero-shot accuracy gains over W4Ax Atom-g128, and speedups of 1.66–2.09× on low-ρ GPUs plus 1.20–1.40× recovery on A100 (ρ=64) as a drop-in replacement in unmodified vLLM.

Significance. If the accuracy claims hold under the mixed-granularity configurations required for high-ρ GPUs, the work supplies the first explicit hardware-indicator analysis (ρ) for W4A4 viability and a practical kernel-level solution that avoids mixed-precision fallbacks while delivering measurable end-to-end gains; the controlled multi-GPU benchmarks and vLLM integration are concrete strengths.

major comments (2)

[Abstract] Abstract: the reported perplexity (within 0.63 of FP16 on LLaMA-2-70B) and zero-shot accuracy gains (4.0–4.4% over W4Ax Atom-g128) are not disaggregated by granularity mode. Because the A100 (ρ=64) recovery explicitly relies on the mixed-granularity mode while the g128 mode is stated to degrade on that platform, it is impossible to verify whether the accuracy numbers apply to the configurations actually used for the 1.20–1.40× claim; this directly affects the platform-dependent viability argument.
[Abstract] Abstract and § on kernel design: the primary hardware indicator ρ is measured on target GPUs and used to motivate the mixed-granularity adaptation, yet no quantitative breakdown is given of how group-size variation or tile-level dequantization patterns in the mixed mode affect the reported accuracy metrics versus the uniform g128 baseline.

minor comments (2)

Notation for ρ and the four GPUs (A40, RTX 3090, L40S, A100) should be introduced with explicit throughput ratios in a single table for reproducibility.
[Abstract] The abstract states "pure INT4 GEMM kernels" but the mixed-granularity mode description implies per-tile or per-layer variation; a brief clarification of what remains strictly INT4 would improve precision.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on the abstract and kernel design sections. The comments correctly identify that accuracy metrics are not disaggregated by granularity mode and that the impact of mixed-granularity on accuracy is not quantified. We will revise the manuscript to address both points by adding the requested breakdowns and platform-specific accuracy tables.

read point-by-point responses

Referee: [Abstract] Abstract: the reported perplexity (within 0.63 of FP16 on LLaMA-2-70B) and zero-shot accuracy gains (4.0–4.4% over W4Ax Atom-g128) are not disaggregated by granularity mode. Because the A100 (ρ=64) recovery explicitly relies on the mixed-granularity mode while the g128 mode is stated to degrade on that platform, it is impossible to verify whether the accuracy numbers apply to the configurations actually used for the 1.20–1.40× claim; this directly affects the platform-dependent viability argument.

Authors: We agree that the abstract presents aggregate accuracy figures without explicit per-mode disaggregation, which obscures whether the reported perplexity and zero-shot gains apply to the mixed-granularity configuration used on A100. The underlying experiments used the ρ-aware mode for high-ρ platforms and g128 for low-ρ platforms; the accuracy bounds hold under those configurations. We will revise the abstract to state the mode used per platform and add a new table in the evaluation section that reports perplexity and zero-shot accuracy separately for uniform g128 and mixed-granularity modes on each GPU. revision: yes
Referee: [Abstract] Abstract and § on kernel design: the primary hardware indicator ρ is measured on target GPUs and used to motivate the mixed-granularity adaptation, yet no quantitative breakdown is given of how group-size variation or tile-level dequantization patterns in the mixed mode affect the reported accuracy metrics versus the uniform g128 baseline.

Authors: The manuscript indeed lacks a quantitative comparison of accuracy under group-size variation and tile-level dequantization in mixed mode versus the g128 baseline. This omission weakens the claim that mixed granularity preserves accuracy. We will add a dedicated subsection (or appendix) with controlled experiments on LLaMA-2-7B/13B/70B that measure perplexity and zero-shot accuracy deltas when switching from uniform g128 to the mixed-granularity patterns employed on high-ρ GPUs. If any degradation appears, it will be reported explicitly. revision: yes

Circularity Check

0 steps flagged

No circularity; empirical results from direct GPU benchmarks and end-to-end measurements

full rationale

The paper's central claims rest on controlled benchmarks across four GPUs to identify ρ as the key hardware indicator and on direct perplexity/accuracy/speedup measurements in unmodified vLLM. No equations reduce predictions to fitted inputs by construction, no self-citations are load-bearing for the viability or performance results, and the design choices are guided by measured platform differences rather than ansatzes or uniqueness theorems imported from prior author work. The derivation chain is self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

1 free parameters · 0 axioms · 0 invented entities

The work is empirical systems research; it introduces no new mathematical axioms or invented physical entities. The only free parameters are the granularity choices (g128 and mixed modes) that are adapted per GPU based on measured ρ.

free parameters (1)

granularity mode (g128 vs mixed)
Chosen per GPU architecture to balance Tensor Core and CUDA Core work; the choice is guided by measured ρ but still requires per-platform tuning.

pith-pipeline@v0.9.1-grok · 5871 in / 1409 out tokens · 16459 ms · 2026-06-27T17:46:26.206576+00:00 · methodology

APEX4: Efficient Pure W4A4 LLM Inference via Intra-SM Compute Rebalancing

Core claim

What carries the argument

If this is right

Where Pith is reading between the lines

Load-bearing premise

What would settle it

discussion (0)