Four Over Six adaptively scales blocks in NVFP4 quantization to smaller FP4 values, making representable value distributions more uniform and reducing quantization error especially for near-maximal values.
Title resolution pending
12 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
representative citing papers
MXFP4 quantization error decomposes into scale bias, deadzone truncation, and grid noise; mode-targeted corrections recover BF16 accuracy within 0.7% on Qwen2.5-3B and exceed it by 1.0% on Qwen3-30B-A3B.
Weight gradient FP4 quantization drives LLM pretraining divergence, which deterministic Hadamard rotations can stabilize on native MXFP4 hardware.
DASH-Q uses a stable diagonal curvature estimate and weighted least squares to achieve robust ultra-low-bit post-training quantization of LLMs, improving zero-shot accuracy by 7% on average over baselines.
AdaHOP applies pattern-aware Hadamard transforms and selective outlier extraction to enable from-scratch MXFP4 training of LLMs at BF16 quality with up to 3.6X memory compression and 1.46X speedup.
MixFP4 extends NVFP4 by adaptively selecting between two FP4 micro-formats per block using repurposed scale sign bits and a unified E2M2 compute path, claiming better accuracy than standard NVFP4 at 3.1% area and 1.5% power overhead.
Attention-based architectures like Swin Transformer show greater robustness to FP4 QAT recipe choice than CNNs across model scales in anomaly segmentation, with architecture having the largest impact.
Timestep- and expert-aware W4A4 quantization framework for Wan2.2-I2V MoE DiT using SVDQuant-GPTQ achieves 59.3% peak GPU memory reduction with 0.9% VBench and 2.3% Imaging Quality drops.
Mix-Quant quantizes prefilling to NVFP4 and keeps BF16 for decoding in agentic LLMs, achieving up to 3x prefilling speedup while largely preserving task performance on long-context and agentic benchmarks.
The block-size paradox in LLM microscaling is caused by underflow in subnormal E4M3 scaling factors; preventing underflow and using 4-over-6 selection resolves it, with brute-force confirming MSE strictly improves as blocks get finer.
TACO compresses tensor-parallel intermediate tensors with an adaptive FP8 scheme and fused kernels, yielding up to 1.87X throughput gains on GPT and Qwen models with near-lossless accuracy.
DuQuant++ adapts outlier-aware fine-grained rotation to MXFP4 by matching block size to the 32-element microscaling group, enabling a single rotation that smooths distributions and achieves SOTA performance on LLaMA-3 with lower cost.
citing papers explorer
-
Four Over Six: More Accurate NVFP4 Quantization with Adaptive Block Scaling
Four Over Six adaptively scales blocks in NVFP4 quantization to smaller FP4 values, making representable value distributions more uniform and reducing quantization error especially for near-maximal values.