pith. machine review for the scientific record.
sign in
Pith Number

pith:KCRU7TYZ

pith:2026:KCRU7TYZSHBOFDOHIYB74RMM5E
not attested not anchored not stored refs resolved

Multi-Scale Dequant: Eliminating Dequantization Bottleneck via Activation Decomposition for Efficient LLM Inference

Chengqiu Hu, Fangzheng Miao, Jun Li, Junyi Fan, Lingchao Zheng, Qichen Liao, Rui Shi, Yuwei Fan

Decomposing BF16 activations into low-precision scales lets quantized weights multiply directly via native GEMM.

arxiv:2605.13915 v1 · 2026-05-13 · stat.ML · cs.AI · cs.LG

Record completeness

1 Bitcoin timestamp
2 Internet Archive
3 Author claim open · sign in to claim
4 Citations open
5 Replications open
Portable graph bundle live · download bundle · merged state
The bundle contains the canonical record plus signed events. A mirror can host it anywhere and recompute the same current state with the deterministic merge algorithm.

Claims

C1strongest claim

For INT8 weights (W4A16), two-pass INT8 decomposition achieves near 16 effective bits. For MXFP4 weights (W4A16), two-pass MXFP4 decomposition yields near 6.6 effective bits with error bound 1/64 per block surpassing single-pass MXFP8 while maintaining the same effective GEMM compute time.

C2weakest assumption

That the multi-scale activation decomposition can be implemented with native hardware-accelerated GEMM without introducing pipeline stalls or accuracy loss beyond the derived bounds, and that the closed-form latency models accurately predict real hardware behavior.

C3one line summary

MSD eliminates dequantization from the GEMM path by decomposing BF16 activations into multiple low-precision parts that multiply directly with INT8 or MXFP4 weights, achieving near-16 effective bits for INT8 and 6.6 for MXFP4 with reduced HBM traffic.

References

30 extracted · 30 resolved · 6 Pith anchors

[1] DeepSeek.FlashMLA: Efficient MLA for Large Language Models. Technical Report, 2024. https://github.com/deepseek-ai/FlashMLA 2024
[2] Technical Blog, 2025.https://github.com/deepseek-ai/FlashMLA/blob/main/docs/ 20250929-hopper-fp8-sparse-deep-dive.md 2025
[3] GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers 2022 · arXiv:2210.17323
[4] AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration 2023 · arXiv:2306.00978
[5] Castro, Jiale Chen, Torsten Hoefler, and Dan Alistarh 2024
Receipt and verification
First computed 2026-05-17T23:39:18.763747Z
Builder pith-number-builder-2026-05-17-v1
Signature Pith Ed25519 (pith-v1-2026-05) · public key
Schema pith-number/v1.0

Canonical hash

50a34fcf1991c2e28dc74603fe458ce90fbb394a604a1a63b5a39ad28dfaed34

Aliases

arxiv: 2605.13915 · arxiv_version: 2605.13915v1 · doi: 10.48550/arxiv.2605.13915 · pith_short_12: KCRU7TYZSHBO · pith_short_16: KCRU7TYZSHBOFDOH · pith_short_8: KCRU7TYZ
Agent API
Verify this Pith Number yourself
curl -sH 'Accept: application/ld+json' https://pith.science/pith/KCRU7TYZSHBOFDOHIYB74RMM5E \
  | jq -c '.canonical_record' \
  | python3 -c "import sys,json,hashlib; b=json.dumps(json.loads(sys.stdin.read()), sort_keys=True, separators=(',',':'), ensure_ascii=False).encode(); print(hashlib.sha256(b).hexdigest())"
# expect: 50a34fcf1991c2e28dc74603fe458ce90fbb394a604a1a63b5a39ad28dfaed34
Canonical record JSON
{
  "metadata": {
    "abstract_canon_sha256": "ab7b7c8573619592f3a7b9ec9deb8268e5e2cf91535a29a1df75056ca742f810",
    "cross_cats_sorted": [
      "cs.AI",
      "cs.LG"
    ],
    "license": "http://arxiv.org/licenses/nonexclusive-distrib/1.0/",
    "primary_cat": "stat.ML",
    "submitted_at": "2026-05-13T09:49:56Z",
    "title_canon_sha256": "07c5f209588171790c38ceed67b05b60fde9b7ca4dc7724170c0db4ee2f5a181"
  },
  "schema_version": "1.0",
  "source": {
    "id": "2605.13915",
    "kind": "arxiv",
    "version": 1
  }
}