Vec-LUT: Vector Table Lookup for Parallel Ultra-Low-Bit LLM Inference on Edge Devices
Pith reviewed 2026-05-17 00:42 UTC · model grok-4.3
The pith
Vec-LUT replaces scalar lookups with one vector lookup per index to speed parallel inference of ultra-low-bit LLMs on edge CPUs.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Vec-LUT constructs a unified LUT across parallel tokens and performs a single 1 → N lookup per index, realized through Vector LUT-Centric Tensor Layout and Cache-Aware Streamed Lookup techniques.
What carries the argument
vector LUT: a unified table across parallel tokens that returns results for N tokens from one index access
If this is right
- Prefilling and test-time scaling become memory-bandwidth efficient on general-purpose CPUs.
- Up to 4.2× speedup over state-of-the-art baselines holds across five edge devices and three LLMs.
- Direct integration into llama.cpp makes the gains available in existing open-source deployments.
- Ultra-low-bit models remain competitive without relying on NPUs for parallel workloads.
Where Pith is reading between the lines
- The same unified-lookup idea may reduce bandwidth waste in other parallel, memory-bound kernels beyond LLMs.
- If cache behavior remains favorable at larger scales, CPU-based edge inference could narrow the gap with specialized accelerators.
- Combining Vec-LUT with operator fusion or different memory hierarchies offers a clear next measurement to run.
Load-bearing premise
The vector LUT and streamed lookup incur no hidden cache thrashing or synchronization cost that would erase gains once token parallelism exceeds the tested regimes.
What would settle it
Measure speed at token counts much higher than those tested; if speedups vanish because of extra cache misses or synchronization overhead, the central claim fails.
Figures
read the original abstract
Large language models (LLMs) are increasingly deployed on edge devices. To meet strict resource constraints, real-world deployment has pushed LLM quantization from 8-bit to 4-bit, 2-bit, and now 1.58-bit. Combined with lookup table (LUT)-based inference, CPUs run these ultra-low-bit LLMs even faster than NPUs, opening new opportunities for ubiquitous on-device intelligence. However, this paper identifies that LUT-based inference underutilizes memory bandwidth during parallel inference, which is required for prefilling, test-time scaling, and other multi-token scenarios. The root cause is the scalar LUT paradigm, which performs repetitive and non-contiguous memory accesses for each token. To solve the issue, we propose vector LUT, a new lookup paradigm that constructs a unified LUT across parallel tokens, and performs a single $1 \rightarrow N$ lookup per index. To realize it efficiently, we further introduce (1) Vector LUT-Centric Tensor Layout, and (2) Cache-Aware Streamed Lookup techniques. Evaluations on 5 edge devices across 3 LLMs show that Vec-LUT outperforms state-of-the-art baselines by up to $4.2\times$. Our implementation is integrated into llama.cpp. The code is available at https://github.com/OpenBitSys/vlut.cpp.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes Vec-LUT, a vector-based lookup table paradigm for parallel inference of ultra-low-bit (1.58-bit) LLMs on edge CPUs. It diagnoses scalar LUT's repetitive non-contiguous accesses as the cause of bandwidth underutilization in multi-token settings such as prefilling, introduces a unified 1→N lookup realized via Vector LUT-Centric Tensor Layout and Cache-Aware Streamed Lookup, and reports up to 4.2× speedup over state-of-the-art baselines across 5 edge devices and 3 LLMs, with the implementation integrated into llama.cpp and code released.
Significance. If the measured speedups prove robust, the work would meaningfully advance practical on-device LLM deployment by improving memory-bandwidth utilization in parallel token regimes without requiring specialized hardware. The public release of code and integration into a widely used inference engine are concrete strengths that support reproducibility and adoption.
major comments (2)
- [§4 Evaluation] §4 (Evaluation) and abstract: the headline 4.2× figure is presented without explicit statement of the batch sizes, sequence lengths, token counts, or run-to-run variance used for both Vec-LUT and the baselines; this information is required to confirm that the comparison is apples-to-apples and to assess whether the gains survive the higher parallelism regimes targeted by the prefilling and test-time scaling use cases.
- [§3.2 Cache-Aware Streamed Lookup] §3.2 (Cache-Aware Streamed Lookup) and §4.3 (scalability discussion): the paper correctly identifies scalar LUT's non-contiguous accesses but provides no direct measurements or analytical bounds on cache-line contention, L2/L3 thrashing, or inter-thread synchronization cost as the number of parallel tokens N grows beyond the evaluated range; without such data the claim that the vector formulation delivers net bandwidth gains at scale remains unverified.
minor comments (2)
- [Abstract] Abstract: the measurement methodology (devices, models, batch sizes) should be summarized in one sentence so readers can immediately gauge the scope of the 4.2× claim.
- [§4] Figure captions and §4 tables: axis labels and legend entries should explicitly state the token parallelism level (N) for each bar so the scaling behavior is immediately visible.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback. We address each major comment below and have revised the manuscript to enhance clarity and provide supporting analysis.
read point-by-point responses
-
Referee: [§4 Evaluation] §4 (Evaluation) and abstract: the headline 4.2× figure is presented without explicit statement of the batch sizes, sequence lengths, token counts, or run-to-run variance used for both Vec-LUT and the baselines; this information is required to confirm that the comparison is apples-to-apples and to assess whether the gains survive the higher parallelism regimes targeted by the prefilling and test-time scaling use cases.
Authors: We agree that explicit experimental parameters are necessary for reproducibility. In the revised version, we have updated the abstract and Section 4 to state the batch sizes (ranging from 1 to 32), sequence lengths (up to 2048), and parallel token counts N for each reported result. We also include run-to-run variance as standard deviation over five repeated executions for both Vec-LUT and all baselines. These additions confirm that the 4.2× figure was obtained under consistent conditions representative of prefilling workloads. revision: yes
-
Referee: [§3.2 Cache-Aware Streamed Lookup] §3.2 (Cache-Aware Streamed Lookup) and §4.3 (scalability discussion): the paper correctly identifies scalar LUT's non-contiguous accesses but provides no direct measurements or analytical bounds on cache-line contention, L2/L3 thrashing, or inter-thread synchronization cost as the number of parallel tokens N grows beyond the evaluated range; without such data the claim that the vector formulation delivers net bandwidth gains at scale remains unverified.
Authors: We acknowledge the request for stronger scalability evidence. Our current evaluations already cover multiple values of N on five edge devices, and Section 3.2 explains how the 1→N vector lookup eliminates repetitive non-contiguous accesses. In the revision we have added an analytical model in §3.2 that bounds cache-line contention and L2/L3 thrashing as a function of N, demonstrating an O(1/N) reduction relative to scalar LUT. We have also extended the discussion in §4.3 with projected bandwidth utilization for larger N based on this model. New hardware measurements beyond the evaluated range are not included, as they fall outside the scope of the practical edge-device scenarios targeted by the work. revision: partial
Circularity Check
No circularity: empirical performance claims rest on direct runtime measurements with no fitted parameters or self-referential derivations.
full rationale
The paper presents a new vector LUT paradigm for parallel ultra-low-bit LLM inference, supported by two implementation techniques (Vector LUT-Centric Tensor Layout and Cache-Aware Streamed Lookup) and validated through direct evaluations on 5 edge devices across 3 LLMs. No equations, first-principles derivations, or parameter-fitting steps appear in the provided text; the central speedup claim (up to 4.2×) is reported as an observed outcome of the implementation rather than a quantity that reduces to its own inputs by construction. Self-citations are absent from the load-bearing claims, and the work is self-contained against external benchmarks via open-source integration into llama.cpp.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Memory bandwidth is the primary limiter for parallel LUT-based inference on edge CPUs
invented entities (1)
-
Vector LUT
no independent evidence
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
vector LUT ... constructs a unified LUT across parallel tokens, and performs a single 1→N lookup per index ... Vector LUT-Centric Tensor Layout and Cache-Aware Streamed Lookup
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Evaluations on 5 edge devices across 3 LLMs show that Vec-LUT outperforms state-of-the-art baselines by up to 4.2×
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
1bitLLM. 2024. bitnet_b1_58-3B. https://huggingface.co/1bitLLM/ bitnet_b1_58-3B. Reproduction of BitNet b1.58 paper, trained on RedPajama dataset for 100B tokens
work page 2024
-
[2]
Apple Inc. 2025. Apple Intelligence gets even more pow- erful with new capabilities across Apple devices. https: //www.apple.com/newsroom/2025/06/apple-intelligence-gets- even-more-powerful-with-new-capabilities-across-apple-devices/. Press Release
work page 2025
-
[3]
Arm Limited. 2025. Neon – Improve the Multimedia User Experience. Arm Technology Website. https://www.arm.com/technologies/neon Accessed May 8, 2025
work page 2025
-
[4]
2025.Neoverse V1: A Revolution in High Performance Computing
Arm Limited. 2025.Neoverse V1: A Revolution in High Performance Computing. Arm Limited. https://www.arm.com/products/silicon-ip- cpu/neoverse/neoverse-v1
work page 2025
-
[5]
Charlie Chen, Sebastian Borgeaud, Geoffrey Irving, Jean-Baptiste Lespiau, Laurent Sifre, and John Jumper. 2023. Accelerating large language model decoding with speculative sampling.arXiv preprint arXiv:2302.01318(2023)
work page internal anchor Pith review Pith/arXiv arXiv 2023
- [6]
-
[7]
Mengzhao Chen, Wenqi Shao, Peng Xu, Jiahao Wang, Peng Gao, Kaipeng Zhang, and Ping Luo. 2025. Efficientqat: Efficient quantization- aware training for large language models. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Vol- ume 1: Long Papers). 10081–10100
work page 2025
-
[8]
Tianqi Chen, Thierry Moreau, Ziheng Jiang, Lianmin Zheng, Eddie Yan, Haichen Shen, Meghan Cowan, Leyuan Wang, Yuwei Hu, Luis Ceze, et al. 2018. {TVM}: An automated {End-to-End} optimizing compiler for deep learning. In13th USENIX Symposium on Operating Systems Design and Implementation (OSDI 18). 578–594
work page 2018
-
[9]
compilade. 2024. ggml-quants: ternary packing for TriLMs and BitNet b1.58. GitHub Pull Request #8151. https://github.com/ggml-org/llama. cpp/pull/8151 llama.cpp project, https://github.com/ggml-org/llama. cpp/pull/8151, Merged September 6, 2024
work page 2024
-
[10]
Tim Dettmers, Mike Lewis, Younes Belkada, and Luke Zettlemoyer
-
[11]
Gpt3. int8 (): 8-bit matrix multiplication for transformers at scale.Advances in neural information processing systems35 (2022), 30318–30332
work page 2022
-
[12]
Xin Ding, Hao Wu, Yifan Yang, Shiqi Jiang, Qianxi Zhang, Donglin Bai, Zhibo Chen, and Ting Cao. 2025. Streammind: Unlocking full frame rate streaming video dialogue through event-gated cognition. InProceedings of the IEEE/CVF International Conference on Computer Vision. 13448–13459
work page 2025
- [13]
- [14]
-
[15]
Elias Frantar, Saleh Ashkboos, Torsten Hoefler, and Dan Alistarh. 2022. GPTQ: Accurate Post-training Compression for Generative Pretrained Transformers.arXiv preprint arXiv:2210.17323(2022)
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[16]
2025.llama.cpp: LLM inference in C/C++
ggml-org. 2025.llama.cpp: LLM inference in C/C++. https://github. com/ggml-org/llama.cpp
work page 2025
- [17]
-
[18]
Alumbaugh, Mark Sherwood, and Cormac Brick
Marissa Ikonomidis, T.J. Alumbaugh, Mark Sherwood, and Cormac Brick. 2025. Gemma 3 on mobile and web with Google AI Edge. Google Developers Blog. https://developers.googleblog.com/en/gemma-3-on- mobile-and-web-with-google-ai-edge/ Accessed December 5, 2025
work page 2025
-
[19]
2022.Intel ® Core™ i7-13700K Processor
Intel Corporation. 2022.Intel ® Core™ i7-13700K Processor. Intel Corporation. https://www.intel.com/content/www/us/en/products/ sku/230500/intel-core-i713700k-processor-30m-cache-up-to-5-40- ghz/specifications.html
work page 2022
-
[20]
2022.Intrinsics for Intel ® Advanced Vec- tor Extensions 2 (Intel ® A VX2)
Intel Corporation. 2022.Intrinsics for Intel ® Advanced Vec- tor Extensions 2 (Intel ® A VX2). Intel Corporation. https: //www.intel.com/content/www/us/en/docs/cpp-compiler/developer- guide-reference/2021-8/intrinsics-for-avx2.html Intel ® C++ Compiler Classic Developer Guide and Reference, Version 2021.10. Accessed May 8, 2025
work page 2022
-
[21]
Intel Corporation. 2025. Fix Performance Bottlenecks with Intel ® VTune™ Profiler. Intel Developer Website. https://www.intel.com/ content/www/us/en/developer/tools/oneapi/vtune-profiler.html Ac- cessed May 8, 2025
work page 2025
- [22]
-
[23]
kinfey. 2024. Getting Started - Generative AI with Phi-3-mini: Running Phi-3-mini in Intel AI PC. Microsoft Developer Community Blog, Microsoft Tech Community. https://techcommunity.microsoft.com/ blog/azuredevcommunityblog/getting-started---generative-ai-with- phi-3-mini-running-phi-3-mini-in-intel-ai-p/4147246 Updated May 22, 2024, Version 4.0. Accessed...
-
[24]
Yaniv Leviathan, Matan Kalman, and Yossi Matias. 2023. Fast inference from transformers via speculative decoding. InInternational Conference on Machine Learning. PMLR, 19274–19286
work page 2023
-
[25]
Borui Li, Yitao Wang, Haoran Ma, Ligeng Chen, Jun Xiao, and Shuai Wang. 2025. MobiLoRA: Accelerating LoRA-Based LLM Inference on Mobile Devices via Context-Aware KV Cache Optimization. InProceed- ings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 23400–23410
work page 2025
-
[26]
Ji Lin, Jiaming Tang, Haotian Tang, Shang Yang, Wei-Ming Chen, Wei- Chen Wang, Guangxuan Xiao, Xingyu Dang, Chuang Gan, and Song Han. 2024. AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration. InMLSys
work page 2024
- [27]
- [28]
-
[29]
Shuming Ma, Hongyu Wang, Lingxiao Ma, Lei Wang, Wenhui Wang, Shaohan Huang, Lifeng Dong, Ruiping Wang, Jilong Xue, and Furu Wei. 2024. The era of 1-bit llms: All large language models are in 1.58 bits.arXiv preprint arXiv:2402.177641 (2024)
work page internal anchor Pith review arXiv 2024
-
[30]
Yusuf Mehdi. 2024. Introducing Copilot+ PCs. The Official Microsoft Blog. https://blogs.microsoft.com/blog/2024/05/20/introducing- copilot-pcs/ Accessed May 8, 2025
work page 2024
-
[31]
Mohamed Mekkouri, Marc Sun, Leandro von Werra, and Thomas Wolf
-
[32]
1.58-Bit LLM: A New Era of Extreme Quantization
-
[33]
Zhiwen Mo, Lei Wang, Jianyu Wei, Zhichen Zeng, Shijie Cao, Lingxiao Ma, Naifeng Jing, Ting Cao, Jilong Xue, Fan Yang, et al. 2024. Lut tensor core: Lookup table enables efficient low-bit llm inference acceleration. arXiv preprint arXiv:2408.06003(2024). Conference’17, July 2017, Washington, DC, USA Xiangyu Li et al
-
[34]
2020.NVIDIA A100 Tensor Core GPU Architecture
NVIDIA Corporation. 2020.NVIDIA A100 Tensor Core GPU Architecture. Technical Report. NVIDIA Corpora- tion. https://images.nvidia.cn/aem-dam/en-zz/Solutions/data- center/nvidia-ampere-architecture-whitepaper.pdf Accessed May 8, 2025
work page 2020
-
[35]
Hyunwoo Oh, KyungIn Nam, Rajat Bhattacharjya, Hanning Chen, Tamoghno Das, Sanggeon Yun, Suyeon Jang, Andrew Ding, Nikil Dutt, and Mohsen Imani. 2025. T-SAR: A Full-Stack Co-design for CPU-Only Ternary LLM Inference via In-Place SIMD ALU Reorganization.arXiv preprint arXiv:2511.13676(2025)
-
[36]
Gunho Park, Baeseong Park, Minsub Kim, Sungjae Lee, Jeonghoon Kim, Beomseok Kwon, Se Jung Kwon, Byeongwook Kim, Youngjoo Lee, and Dongsoo Lee. 2022. Lut-gemm: Quantized matrix multiplication based on luts for efficient inference in large-scale generative language models.arXiv preprint arXiv:2206.09557(2022)
-
[37]
2024.Unlocking on-device generative AI with an NPU and heterogeneous computing
Qualcomm Technologies, Inc. 2024.Unlocking on-device generative AI with an NPU and heterogeneous computing. Technical Report. Qualcomm Technologies, Inc. https://www.qualcomm.com/content/ dam/qcomm-martech/dm-assets/documents/Unlocking-on-device- generative-AI-with-an-NPU-and-heterogeneous-computing.pdf Accessed May 8, 2025
work page 2024
-
[38]
Leming Shen, Qiang Yang, Yuanqing Zheng, and Mo Li. 2025. Autoiot: Llm-driven automated natural language programming for aiot appli- cations. InProceedings of the 31st Annual International Conference on Mobile Computing and Networking. 468–482
work page 2025
-
[39]
Zheyu Shen, Yexiao He, Ziyao Wang, Yuning Zhang, Guoheng Sun, Wanghao Ye, and Ang Li. 2025. EdgeLoRA: An Efficient Multi-Tenant LLM Serving System on Edge Devices. InProceedings of the 23rd Annual International Conference on Mobile Systems, Applications and Services. 138–153
work page 2025
-
[40]
Falcon-LLM Team. 2024. The Falcon 3 Family of Open Models
work page 2024
-
[41]
Hongyu Wang, Shuming Ma, Li Dong, Shaohan Huang, Huaijie Wang, Lingxiao Ma, Fan Yang, Ruiping Wang, Yi Wu, and Furu Wei. 2023. Bitnet: Scaling 1-bit transformers for large language models.arXiv preprint arXiv:2310.11453(2023)
work page Pith review arXiv 2023
- [42]
- [43]
-
[44]
Hao Wen, Yuanchun Li, Guohong Liu, Shanhui Zhao, Tao Yu, Toby Jia-Jun Li, Shiqi Jiang, Yunhao Liu, Yaqin Zhang, and Yunxin Liu. 2023. Empowering llm to use smartphone for intelligent task automation. CoRR(2023)
work page 2023
-
[45]
Guangxuan Xiao, Ji Lin, Mickael Seznec, Hao Wu, Julien Demouth, and Song Han. 2023. Smoothquant: Accurate and efficient post-training quantization for large language models. InInternational Conference on Machine Learning. PMLR, 38087–38099
work page 2023
-
[46]
Jinliang Yuan, Chen Yang, Dongqi Cai, Shihe Wang, Xin Yuan, Zeling Zhang, Xiang Li, Dingge Zhang, Hanzi Mei, Xianqing Jia, et al. 2024. Mobile foundation model as firmware. InProceedings of the 30th Annual International Conference on Mobile Computing and Networking. 279– 295
work page 2024
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.