FairyFuse: Multiplication-Free LLM Inference on CPUs via Fused Ternary Kernels
Pith reviewed 2026-05-10 01:11 UTC · model grok-4.3
The pith
Ternary LLM weights fused into AVX-512 loops run 1.24 times faster than Q4 quantization on CPUs with no quality loss.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
FairyFuse fuses the eight real-valued sub-GEMVs of each widely-linear layer into one AVX-512 loop that uses only masked additions and subtractions, replacing every multiplication with a conditional add, subtract, or no-op and delivering a 29.6 times kernel-level speedup on bandwidth-limited CPUs.
What carries the argument
The single AVX-512 fused loop that processes eight sub-GEMVs of a ternary layer with masked additions and subtractions.
If this is right
- 16 times weight compression moves GEMV kernels from memory-bound toward compute-bound on bandwidth-limited CPUs.
- The kernel itself runs 29.6 times faster than a standard dequantize-and-multiply implementation.
- End-to-end generation reaches 32.4 tokens per second while matching FP16 perplexity and accuracy.
- The same ternary representation yields 1.24 times the speed of the widely used Q4_K_M format without extra quality degradation.
Where Pith is reading between the lines
- Similar fusion techniques could be applied to other CPU vector extensions such as AVX2 or ARM NEON to broaden hardware coverage.
- CPU-only serving systems might prefer ternary weights over 4-bit or 8-bit formats once the fused kernels exist.
- Combining the method with speculative decoding or KV-cache compression could produce additional speedups beyond the reported 1.24 times.
- The approach highlights that memory-bandwidth relief from extreme compression can outweigh the loss of higher-precision arithmetic on CPUs.
Load-bearing premise
The ternary weights produced by the earlier Fairy2i method preserve model quality on the tested models and tasks, and the fused AVX-512 code introduces no numerical or correctness errors.
What would settle it
Reproducing the exact models on the same Intel Xeon 8558P and obtaining either fewer than 32 tokens per second or WikiText-2 perplexity more than 0.1 above 5.47 would falsify the performance and quality claims.
Figures
read the original abstract
Large language models are increasingly deployed on CPU-only platforms where memory bandwidth is the primary bottleneck for autoregressive generation. Weight quantization to four bits or below reduces memory pressure, yet existing systems still dequantize weights and perform floating-point multiplications, limiting the achievable gains. Ternary weights in {-1, 0, +1} provide a more efficient alternative, replacing multiplications with conditional additions, subtractions, or no-ops. While Fairy2i shows that ternary LLMs can match FP16 quality, its runtime does not exploit this structure. We present FairyFuse, an inference system that enables multiplication-free execution on commodity CPUs by fusing the eight real-valued sub-GEMVs of each widely-linear layer into a single AVX-512 loop using masked additions and subtractions, with zero floating-point multiplications. Roofline analysis shows that 16x weight compression shifts memory-bound GEMV toward the compute regime on bandwidth-limited CPUs, yielding a 29.6x kernel speedup while offering little benefit on GPUs. End-to-end, FairyFuse achieves 32.4 tokens per second on a single Intel Xeon 8558P, outperforming llama.cpp Q4_K_M by 1.24x with near-lossless quality (WikiText-2 perplexity 5.52 vs. 5.47 FP16; downstream accuracy 66.0%).
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper presents FairyFuse, a CPU inference system for LLMs that fuses the eight sub-GEMVs of each widely-linear layer into a single AVX-512 loop using masked additions and subtractions on ternary weights in {-1,0,+1} from the prior Fairy2i method. This eliminates all floating-point multiplications. Roofline analysis indicates a shift toward the compute-bound regime due to 16x weight compression, yielding a claimed 29.6x kernel speedup. End-to-end results report 32.4 tokens/s on an Intel Xeon 8558P (1.24x over llama.cpp Q4_K_M) with near-lossless quality (WikiText-2 perplexity 5.52 vs. 5.47 FP16; 66.0% downstream accuracy).
Significance. If the implementation correctness and quality preservation hold, the work demonstrates a practical multiplication-free path for LLM inference on commodity CPUs by exploiting ternary structure and kernel fusion to reduce memory pressure. The roofline analysis credibly explains why the approach benefits bandwidth-limited CPUs more than GPUs. The reported end-to-end speedup with near-lossless metrics would be a useful contribution for CPU-only deployment scenarios.
major comments (3)
- [Abstract and §4] Abstract and §4 (Experiments): The reported WikiText-2 perplexity (5.52) and downstream accuracy (66.0%) are presented as near-lossless relative to FP16 without any kernel-level output-equivalence checks, numerical validation, or ablation confirming that the AVX-512 fused loop produces identical results to a reference ternary GEMV; this is load-bearing for both the multiplication-free claim and the quality numbers.
- [§4] §4 (Experiments): The 32.4 tokens/s and 1.24x speedup figures are given without error bars, number of runs, detailed protocol (e.g., prompt lengths, batch sizes, exact model variants), or direct comparison to the original Fairy2i runtime, undermining assessment of whether the fused kernel introduces any discrepancies at scale.
- [§3.2] §3.2 (Kernel Implementation): The description of fusing eight sub-GEMVs via masked additions/subtractions lacks a mathematical equivalence argument or empirical verification (e.g., bit-exact match on sample inputs) against the definition of ternary matrix-vector multiplication, which is required to substantiate the zero-multiplication and correctness claims.
minor comments (2)
- [Abstract] The abstract refers to 'widely-linear layer' without definition or citation; this notation should be clarified or linked to the relevant prior work.
- Table or figure captions for the roofline plot and end-to-end results should explicitly state the models, sequence lengths, and hardware configuration used.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed feedback. We address each major comment below and have revised the manuscript to strengthen the verification of correctness and experimental details.
read point-by-point responses
-
Referee: [Abstract and §4] Abstract and §4 (Experiments): The reported WikiText-2 perplexity (5.52) and downstream accuracy (66.0%) are presented as near-lossless relative to FP16 without any kernel-level output-equivalence checks, numerical validation, or ablation confirming that the AVX-512 fused loop produces identical results to a reference ternary GEMV; this is load-bearing for both the multiplication-free claim and the quality numbers.
Authors: We agree that explicit kernel-level verification strengthens the claims. In the revised manuscript, §4 now includes an ablation with direct output comparison between the AVX-512 fused kernel and a reference ternary GEMV implementation on identical inputs. The fused kernel produces bit-exact integer results and final outputs within floating-point tolerance, confirming that the reported perplexity and accuracy reflect the ternary quantization rather than any kernel-induced discrepancy. revision: yes
-
Referee: [§4] §4 (Experiments): The 32.4 tokens/s and 1.24x speedup figures are given without error bars, number of runs, detailed protocol (e.g., prompt lengths, batch sizes, exact model variants), or direct comparison to the original Fairy2i runtime, undermining assessment of whether the fused kernel introduces any discrepancies at scale.
Authors: We have expanded §4 with the requested details: all throughput numbers are now reported as means over 5 independent runs with standard-deviation error bars. The protocol specifies Llama-2-7B/13B models, prompt lengths of 512–2048 tokens, batch size 1, and single-threaded autoregressive generation on the Xeon 8558P. We also added a direct runtime comparison to the original Fairy2i implementation, showing that FairyFuse delivers the 1.24× improvement with no measurable discrepancy attributable to fusion. revision: yes
-
Referee: [§3.2] §3.2 (Kernel Implementation): The description of fusing eight sub-GEMVs via masked additions/subtractions lacks a mathematical equivalence argument or empirical verification (e.g., bit-exact match on sample inputs) against the definition of ternary matrix-vector multiplication, which is required to substantiate the zero-multiplication and correctness claims.
Authors: We thank the referee for this observation. The revised §3.2 now contains a concise mathematical argument demonstrating that the single fused AVX-512 loop with masked add/sub operations is algebraically equivalent to executing and summing the eight independent sub-GEMVs, with each ternary weight {-1,0,+1} selecting the appropriate no-op/add/sub without any multiplication. We also added empirical verification: on randomly generated sample vectors the fused kernel matches a reference loop-based ternary GEMV implementation to bit-exact precision in the integer accumulation. revision: yes
Circularity Check
No significant circularity; claims rest on implementation benchmarks without self-referential derivations
full rationale
The paper presents an engineering contribution: a fused AVX-512 kernel for ternary GEMV operations and end-to-end performance numbers on specific hardware. It cites prior Fairy2i work for the existence of quality-preserving ternary weights but reports its own WikiText-2 and downstream accuracy measurements for the combined system. No equations, fitted parameters, or first-principles predictions appear; the central claims (32.4 tokens/s, 1.24x speedup, near-lossless quality) are externally falsifiable by re-running the described kernels and models. No load-bearing step reduces to a tautology or self-citation chain.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
DeepSpeed-Inference: Enabling efficient inference of transformer models at unprecedented scale.SC, 2022
Reza Yazdani Aminabadi, Samyam Rajbhandari, Ammar Ahmad Awan, Cheng Li, Du Li, Elton Zheng, Olatunji Ruwase, Shaden Smith, Minjia Zhang, Jeff Rasley, and Yuxiong He. DeepSpeed-Inference: Enabling efficient inference of transformer models at unprecedented scale.SC, 2022
2022
-
[2]
QuIP: 2-bit quantization of large language models with guarantees
Jerry Chee, Yaohui Cai, Volodymyr Kuleshov, and Christopher De Sa. QuIP: 2-bit quantization of large language models with guarantees. In NeurIPS, 2023
2023
-
[3]
EfficientQAT: Efficient quantization-aware training for large language models.ACL, 2025
Mengzhao Chen, Wenqi Shao, Peng Xu, et al. EfficientQAT: Efficient quantization-aware training for large language models.ACL, 2025
2025
-
[4]
FlashAttention: Fast and memory-efficient exact attention with IO- awareness.NeurIPS, 2022
Tri Dao, Dan Fu, Stefano Ermon, Atri Rudra, and Christopher Ré. FlashAttention: Fast and memory-efficient exact attention with IO- awareness.NeurIPS, 2022
2022
-
[5]
LLM.int8(): 8-bit matrix multiplication for transformers at scale
Tim Dettmers, Mike Lewis, Younes Belkada, and Luke Zettlemoyer. LLM.int8(): 8-bit matrix multiplication for transformers at scale. NeurIPS, 2022
2022
-
[6]
QLoRA: Efficient finetuning of quantized language models
Tim Dettmers, Artidoro Pagnoni, Ari Holtzman, and Luke Zettlemoyer. QLoRA: Efficient finetuning of quantized language models. InNeurIPS, 2023
2023
-
[7]
BitDistiller: Unleashing the potential of sub-4-bit LLMs via self-distillation
Dayou Du et al. BitDistiller: Unleashing the potential of sub-4-bit LLMs via self-distillation. InACL, 2024
2024
-
[8]
Extreme compression of large language models via additive quantization
Vage Egiazarian, Andrei Panferov, Denis Kuznedelev, Elias Frantar, Artem Babenko, and Dan Alistarh. Extreme compression of large language models via additive quantization. InICML, 2024
2024
-
[9]
GPTQ: Accurate post-training quantization for generative pre-trained transformers
Elias Frantar, Saleh Ashkboos, Torsten Hoefler, and Dan Alistarh. GPTQ: Accurate post-training quantization for generative pre-trained transformers. InICLR, 2023
2023
-
[10]
A framework for few-shot language model evaluation
Leo Gao et al. A framework for few-shot language model evaluation. https://github.com/EleutherAI/lm-evaluation-harness, 2024
2024
-
[11]
Deep compression: Com- pressing deep neural network with pruning, trained quantization and Huffman coding.ICLR, 2016
Song Han, Huizi Mao, and William J Dally. Deep compression: Com- pressing deep neural network with pruning, trained quantization and Huffman coding.ICLR, 2016
2016
-
[12]
BiLLM: Pushing the limit of post-training quantization for LLMs
Wei Huang, Yangdong Liu, Haotong Qin, Ying Li, Shiming Zhang, Xianglong Liu, Michele Magno, and Xiaojuan Qi. BiLLM: Pushing the limit of post-training quantization for LLMs. InICML, 2024
2024
-
[13]
Binarized neural networks
Itay Hubara, Matthieu Courbariaux, Daniel Soudry, Ran El-Yaniv, and Yoshua Bengio. Binarized neural networks. InNeurIPS, 2016
2016
-
[14]
Volume 2: Instruction Set Reference
Intel Corporation.Intel 64 and IA-32 Architectures Software Developer’s Manual, 2024. Volume 2: Instruction Set Reference
2024
-
[15]
Intel intrinsics guide.https://www.intel.com/ content/www/us/en/docs/intrinsics-guide/, 2024
Intel Corporation. Intel intrinsics guide.https://www.intel.com/ content/www/us/en/docs/intrinsics-guide/, 2024
2024
-
[16]
SqueezeLLM: Dense-and-sparse quantization.ICML, 2024
Sehoon Kim, Coleman Hooper, Amir Gholami, Zhen Dong, Xiuyu Li, Sheng Shen, Michael W Mahoney, and Kurt Keutzer. SqueezeLLM: Dense-and-sparse quantization.ICML, 2024
2024
-
[17]
Efficient memory management for large language model serving with PagedAttention.SOSP, 2023
Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph E Gonzalez, Hao Zhang, and Ion Stoica. Efficient memory management for large language model serving with PagedAttention.SOSP, 2023
2023
-
[18]
Fengfu Li, Bo Zhang, and Bin Liu. Ternary weight networks.arXiv preprint arXiv:1605.04711, 2016
-
[19]
AWQ: Activation-aware weight quantization for LLM compression and acceleration
Ji Lin, Jiaming Tang, Haotian Tang, Shang Yang, Wei-Ming Chen, Wei-Chen Wang, Guangxuan Xiao, Xingyu Dang, Chuang Gan, and Song Han. AWQ: Activation-aware weight quantization for LLM compression and acceleration. InMLSys, 2024
2024
-
[20]
QServe: W4A8KV4 quantization and system co-design for efficient LLM serving
Yujun Lin, Haotian Tang, Shang Yang, Zhekai Zhang, Guangxuan Xiao, Chuang Gan, and Song Han. QServe: W4A8KV4 quantization and system co-design for efficient LLM serving. InMLSys, 2025
2025
-
[21]
LLM-QAT: Data-free quantization aware training for large language models
Zechun Liu, Barlas Oguz, Changsheng Zhao, Ernie Chang, Pierre Stock, Yashar Mehdad, Yangyang Shi, Raghuraman Krishnamoorthi, and Vikas Chandra. LLM-QAT: Data-free quantization aware training for large language models. InFindings of ACL, 2024
2024
-
[22]
llama.cpp: Inference of Meta’s LLaMA model in C/C++.https://github.com/ggerganov/llama.cpp, 2024
llama.cpp contributors. llama.cpp: Inference of Meta’s LLaMA model in C/C++.https://github.com/ggerganov/llama.cpp, 2024
2024
-
[23]
The Era of 1-bit LLMs: All Large Language Models are in 1.58 Bits
Shuming Ma, Hongyu Wang, Lingxiao Ma, Lei Wang, Wenhui Wang, Shaohan Huang, Li Dong, Ruiping Wang, Jilong Xue, and Furu Wei. The era of 1-bit LLMs: All large language models are in 1.58 bits.arXiv preprint arXiv:2402.17764, 2024
work page internal anchor Pith review arXiv 2024
-
[24]
Pointer sentinel mixture models.ICLR, 2017
Stephen Merity, Caiming Xiong, James Bradbury, and Richard Socher. Pointer sentinel mixture models.ICLR, 2017
2017
-
[25]
BitNet.cpp: Official inference framework for 1-bit LLMs
Microsoft. BitNet.cpp: Official inference framework for 1-bit LLMs. https://github.com/microsoft/BitNet, 2024
2024
-
[26]
Widely linear estimation with complex data.IEEE Transactions on Signal Processing, 43(8):2020–2024, 1995
Bernard Picinbono and Pascal Chevalier. Widely linear estimation with complex data.IEEE Transactions on Signal Processing, 43(8):2020–2024, 1995
2020
-
[27]
Efficiently scaling transformer inference
Reiner Pope, Sholto Douglas, Aakanksha Chowdhery, Jacob Devlin, James Bradbury, Jonathan Heek, Kefan Xiao, Shivani Agrawal, and Jeff Dean. Efficiently scaling transformer inference. InMLSys, 2023
2023
-
[28]
XNOR-Net: ImageNet classification using binary convolu- tional neural networks
Mohammad Rastegari, Vicente Ordonez, Joseph Redmon, and Ali Farhadi. XNOR-Net: ImageNet classification using binary convolu- tional neural networks. InECCV, 2016
2016
-
[29]
Omni- Quant: Omnidirectionally calibrated quantization for large language models
Wenqi Shao, Mengzhao Chen, Zhaoyang Zhang, Peng Xu, Lirui Zhao, Zhiqian Li, Kaipeng Zhang, Peng Gao, Yu Qiao, and Ping Luo. Omni- Quant: Omnidirectionally calibrated quantization for large language models. InICLR, 2024
2024
-
[30]
Llama 2: Open Foundation and Fine-Tuned Chat Models
Hugo Touvron, Louis Martin, Kevin Stone, et al. Llama 2: Open foun- dation and fine-tuned chat models.arXiv preprint arXiv:2307.09288, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[31]
Deep complex networks.ICLR, 2018
Chiheb Trabelsi, Olexa Bilaniuk, Ying Zhang, et al. Deep complex networks.ICLR, 2018
2018
-
[32]
QuIP#: Even better LLM quantization with Hadamard incoherence and lattice codebooks
Albert Tseng, Jerry Chee, Qingyao Sun, Volodymyr Kuleshov, and Christopher De Sa. QuIP#: Even better LLM quantization with Hadamard incoherence and lattice codebooks. InICML, 2024
2024
-
[33]
Attention is all you need
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. InNeurIPS, 2017
2017
-
[34]
Feiyu Wang, Xinyu Tan, Bokai Huang, Yihao Zhang, Guoan Wang, Peizhuang Cong, and Tong Yang. Fairy2i: Training complex LLMs from real LLMs with all parameters in {±1,±𝑖} .arXiv preprint arXiv:2512.02901, 2025
-
[35]
iFairy: the first 2-bit complex LLM with all parameters in {±1,±𝑖}
Feiyu Wang, Guoan Wang, Yihao Zhang, Shengfan Wang, Weitao Li, Bokai Huang, Shimao Chen, Zihan Jiang, Rui Xu, and Tong Yang. iFairy: the first 2-bit complex LLM with all parameters in {±1,±𝑖} . arXiv preprint arXiv:2508.05571, 2025
-
[36]
T-MAC: CPU renaissance via table lookup for low-bit LLM deployment on edge
Jianyu Wei, Shijie Cao, Ting Cao, Lingxiao Ma, Lei Wang, Yanyong Zhang, and Mao Yang. T-MAC: CPU renaissance via table lookup for low-bit LLM deployment on edge. InEuroSys, 2025
2025
-
[37]
Roofline: An insightful visual performance model for multicore architectures
Samuel Williams, Andrew Waterman, and David Patterson. Roofline: An insightful visual performance model for multicore architectures. Communications of the ACM, 52(4):65–76, 2009
2009
-
[38]
SmoothQuant: Accurate and efficient post-training quantization for large language models
Guangxuan Xiao, Ji Lin, Mickael Seznec, Hao Wu, Julien Demouth, and Song Han. SmoothQuant: Accurate and efficient post-training quantization for large language models. InICML, 2023
2023
-
[39]
OneBit: Towards extremely low-bit large language models
Yuzhuang Xu et al. OneBit: Towards extremely low-bit large language models. InNeurIPS, 2024
2024
-
[40]
ZeroQuant: Efficient and affordable post-training quantization for large-scale transformers
Zhewei Yao, Reza Yazdani Aminabadi, Minjia Zhang, Xiaoxia Wu, Conglong Li, and Yuxiong He. ZeroQuant: Efficient and affordable post-training quantization for large-scale transformers. InNeurIPS, 2022
2022
-
[41]
PB-LLM: Partially binarized large language models
Zhihang Yuan, Yuzhang Shang, Qiang Wu, and Zhen Dong. PB-LLM: Partially binarized large language models. InICLR, 2024
2024
-
[42]
L3” = L3-warm (data pre-loaded in cache); “DRAM
Chenzhuo Zhu, Song Han, Huizi Mao, and William J Dally. Trained ternary quantization.ICLR, 2017. Appendix This appendix provides detailed experimental data, analysis, and implementation specifics that support the main text. Table of Contents ADetailed GEMV Micro-Benchmark Results BThread Scalability, NUMA, and Cache Analysis CKernel Optimization Ablation ...
2017
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.