hub

Microscaling data formats for deep learning

Bita Darvish Rouhani, Ritchie Zhao, Ankit More, Mathew Hall, Alireza Khodamoradi, Summer Deng, Dhruv Choudhary, Marius Cornea, Eric Dellinger, Kristof Denolf, et al · 2023 · arXiv 2310.10537

17 Pith papers cite this work. Polarity classification is still indexing.

17 Pith papers citing it

read on arXiv browse 17 citing papers

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 2

citation-polarity summary

background 2

representative citing papers

Grid Games: The Power of Multiple Grids for Quantizing Large Language Models

cs.LG · 2026-05-12 · accept · novelty 8.0

Allowing each quantization group to select among multiple 4-bit grids improves accuracy over single-grid FP4 for both post-training and pre-training of LLMs.

LongLive-2.0: An NVFP4 Parallel Infrastructure for Long Video Generation

cs.CV · 2026-05-18 · unverdicted · novelty 7.0

LongLive-2.0 delivers an NVFP4 parallel infrastructure that enables direct training of long multi-shot autoregressive diffusion video models and achieves up to 2.15x training and 1.84x inference speedups on Blackwell and other GPUs.

Multi-Scale Dequant: Eliminating Dequantization Bottleneck via Activation Decomposition for Efficient LLM Inference

stat.ML · 2026-05-13 · unverdicted · novelty 7.0

MSD eliminates dequantization from the GEMM path by decomposing BF16 activations into multiple low-precision parts that multiply directly with INT8 or MXFP4 weights, achieving near-16 effective bits for INT8 and 6.6 for MXFP4 with reduced HBM traffic.

Four Over Six: More Accurate NVFP4 Quantization with Adaptive Block Scaling

cs.CL · 2025-12-01 · conditional · novelty 7.0

Four Over Six adaptively scales blocks in NVFP4 quantization to smaller FP4 values, making representable value distributions more uniform and reducing quantization error especially for near-maximal values.

ThriftAttention: Selective Mixed Precision for Long-Context FP4 Attention

cs.LG · 2026-05-21 · unverdicted · novelty 6.0

ThriftAttention recovers 89.1% of the FP16 quality gap versus pure FP4 attention by running only 5% of query-key blocks in FP16 on long-context benchmarks.

SOAR: Scale Optimization for Accurate Reconstruction in NVFP4 Quantization

cs.LG · 2026-05-12 · unverdicted · novelty 6.0

SOAR improves NVFP4 post-training quantization accuracy for LLMs by analytically solving joint scale optimization and searching decoupled scales.

The Entropy of Floating-Point Numbers

cs.IT · 2026-05-12 · unverdicted · novelty 6.0

An analytic approximation for floating-point entropy is derived that links to a new quantity, with scale-invariance proven and closed forms given for common distributions.

LoKA: Low-precision Kernel Applications for Recommendation Models At Scale

cs.LG · 2026-05-11 · unverdicted · novelty 6.0 · 2 refs

LoKA enables practical FP8 use in numerically sensitive large recommendation models via online profiling of activations, reusable model modifications for stability, and dynamic kernel dispatching.

Pretraining large language models with MXFP4 on Native FP4 Hardware

cs.LG · 2026-05-11 · unverdicted · novelty 6.0 · 3 refs

Weight gradient FP4 quantization drives LLM pretraining divergence, which deterministic Hadamard rotations can stabilize on native MXFP4 hardware.

LOCALUT: Harnessing Capacity-Computation Tradeoffs for LUT-Based Inference in DRAM-PIM

cs.AR · 2026-04-06 · conditional · novelty 6.0

LOCALUT delivers 1.82x geometric mean speedup for quantized DNN inference on real UPMEM DRAM-PIM devices by using operation-packed LUTs with canonicalization, reordering, and slice streaming.

The Thermodynamic Costs of Simple Linear Regression

cond-mat.stat-mech · 2026-05-18 · unverdicted · novelty 5.0

Thermodynamic lower bounds are approximated for exact and SGD linear regression, producing energy-aware scaling laws for optimal training dataset size given a target generalization error.

StreamIndex: Memory-Bounded Compressed Sparse Attention via Streaming Top-k

cs.LG · 2026-05-04 · accept · novelty 5.0

Chunked streaming top-k enables CSA indexer execution at 1M sequence length with 6.21 GB peak memory and >=0.998 recall on synthetic V4-shaped inputs.

OSC: Hardware Efficient W4A4 Quantization via Outlier Separation in Channel Dimension

cs.LG · 2026-04-14 · unverdicted · novelty 5.0

OSC separates token-persistent outlier channels in activations into a compact high-precision tensor for dual-path 4-bit GEMM computation, limiting accuracy loss to roughly 1-2 points on Qwen3 models while delivering up to 1.78x speedup over W8A8 baselines.

Diagonal-Tiled Mixed-Precision Attention for Efficient Low-Bit MXFP Inference

cs.LG · 2026-04-05 · unverdicted · novelty 5.0

DMA delivers a fused low-bit MXFP attention kernel with diagonal tiling that achieves significant speedup on B200 GPUs with negligible generation quality loss.

Combating the Memory Walls: Optimization Pathways for Long-Context Agentic LLM Inference

cs.AR · 2025-09-11 · unverdicted · novelty 5.0

PLENA introduces a co-designed system with three optimization pathways for long-context agentic LLM inference, claiming up to 2.23x throughput over A100 and 4.04x energy efficiency.

MASQ: Accelerating Masked Diffusion via Stage-Wise Multi-Precision Quantization

cs.AR · 2026-05-22 · unverdicted · novelty 4.0

MASQ claims up to 16.06x speedup and 4.18x energy gain over A100 for masked diffusion via stage-wise multi-precision quantization and specialized hardware units while preserving quality.

HiFloat4 Format for Language Model Pre-training on Ascend NPUs

cs.LG · 2026-04-09 · unverdicted · novelty 4.0

HiFloat4 FP4 with stabilization techniques trains dense and MoE language models on Ascend NPUs at relative error within 1% of full-precision baselines.

citing papers explorer

Showing 17 of 17 citing papers.

Grid Games: The Power of Multiple Grids for Quantizing Large Language Models cs.LG · 2026-05-12 · accept · none · ref 31
Allowing each quantization group to select among multiple 4-bit grids improves accuracy over single-grid FP4 for both post-training and pre-training of LLMs.
LongLive-2.0: An NVFP4 Parallel Infrastructure for Long Video Generation cs.CV · 2026-05-18 · unverdicted · none · ref 56
LongLive-2.0 delivers an NVFP4 parallel infrastructure that enables direct training of long multi-shot autoregressive diffusion video models and achieves up to 2.15x training and 1.84x inference speedups on Blackwell and other GPUs.
Multi-Scale Dequant: Eliminating Dequantization Bottleneck via Activation Decomposition for Efficient LLM Inference stat.ML · 2026-05-13 · unverdicted · none · ref 30
MSD eliminates dequantization from the GEMM path by decomposing BF16 activations into multiple low-precision parts that multiply directly with INT8 or MXFP4 weights, achieving near-16 effective bits for INT8 and 6.6 for MXFP4 with reduced HBM traffic.
Four Over Six: More Accurate NVFP4 Quantization with Adaptive Block Scaling cs.CL · 2025-12-01 · conditional · none · ref 10
Four Over Six adaptively scales blocks in NVFP4 quantization to smaller FP4 values, making representable value distributions more uniform and reducing quantization error especially for near-maximal values.
ThriftAttention: Selective Mixed Precision for Long-Context FP4 Attention cs.LG · 2026-05-21 · unverdicted · none · ref 15
ThriftAttention recovers 89.1% of the FP16 quality gap versus pure FP4 attention by running only 5% of query-key blocks in FP16 on long-context benchmarks.
SOAR: Scale Optimization for Accurate Reconstruction in NVFP4 Quantization cs.LG · 2026-05-12 · unverdicted · none · ref 33
SOAR improves NVFP4 post-training quantization accuracy for LLMs by analytically solving joint scale optimization and searching decoupled scales.
The Entropy of Floating-Point Numbers cs.IT · 2026-05-12 · unverdicted · none · ref 6
An analytic approximation for floating-point entropy is derived that links to a new quantity, with scale-invariance proven and closed forms given for common distributions.
LoKA: Low-precision Kernel Applications for Recommendation Models At Scale cs.LG · 2026-05-11 · unverdicted · none · ref 67 · 2 links
LoKA enables practical FP8 use in numerically sensitive large recommendation models via online profiling of activations, reusable model modifications for stability, and dynamic kernel dispatching.
Pretraining large language models with MXFP4 on Native FP4 Hardware cs.LG · 2026-05-11 · unverdicted · none · ref 3 · 3 links
Weight gradient FP4 quantization drives LLM pretraining divergence, which deterministic Hadamard rotations can stabilize on native MXFP4 hardware.
LOCALUT: Harnessing Capacity-Computation Tradeoffs for LUT-Based Inference in DRAM-PIM cs.AR · 2026-04-06 · conditional · none · ref 77
LOCALUT delivers 1.82x geometric mean speedup for quantized DNN inference on real UPMEM DRAM-PIM devices by using operation-packed LUTs with canonicalization, reordering, and slice streaming.
The Thermodynamic Costs of Simple Linear Regression cond-mat.stat-mech · 2026-05-18 · unverdicted · none · ref 46
Thermodynamic lower bounds are approximated for exact and SGD linear regression, producing energy-aware scaling laws for optimal training dataset size given a target generalization error.
StreamIndex: Memory-Bounded Compressed Sparse Attention via Streaming Top-k cs.LG · 2026-05-04 · accept · none · ref 23
Chunked streaming top-k enables CSA indexer execution at 1M sequence length with 6.21 GB peak memory and >=0.998 recall on synthetic V4-shaped inputs.
OSC: Hardware Efficient W4A4 Quantization via Outlier Separation in Channel Dimension cs.LG · 2026-04-14 · unverdicted · none · ref 5
OSC separates token-persistent outlier channels in activations into a compact high-precision tensor for dual-path 4-bit GEMM computation, limiting accuracy loss to roughly 1-2 points on Qwen3 models while delivering up to 1.78x speedup over W8A8 baselines.
Diagonal-Tiled Mixed-Precision Attention for Efficient Low-Bit MXFP Inference cs.LG · 2026-04-05 · unverdicted · none · ref 13
DMA delivers a fused low-bit MXFP attention kernel with diagonal tiling that achieves significant speedup on B200 GPUs with negligible generation quality loss.
Combating the Memory Walls: Optimization Pathways for Long-Context Agentic LLM Inference cs.AR · 2025-09-11 · unverdicted · none · ref 60
PLENA introduces a co-designed system with three optimization pathways for long-context agentic LLM inference, claiming up to 2.23x throughput over A100 and 4.04x energy efficiency.
MASQ: Accelerating Masked Diffusion via Stage-Wise Multi-Precision Quantization cs.AR · 2026-05-22 · unverdicted · none · ref 33
MASQ claims up to 16.06x speedup and 4.18x energy gain over A100 for masked diffusion via stage-wise multi-precision quantization and specialized hardware units while preserving quality.
HiFloat4 Format for Language Model Pre-training on Ascend NPUs cs.LG · 2026-04-09 · unverdicted · none · ref 13
HiFloat4 FP4 with stabilization techniques trains dense and MoE language models on Ascend NPUs at relative error within 1% of full-precision baselines.

Microscaling data formats for deep learning

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer