pith. machine review for the scientific record. sign in

arxiv: 2512.06443 · v2 · submitted 2025-12-06 · 💻 cs.DC · cs.AI

Vec-LUT: Vector Table Lookup for Parallel Ultra-Low-Bit LLM Inference on Edge Devices

Pith reviewed 2026-05-17 00:42 UTC · model grok-4.3

classification 💻 cs.DC cs.AI
keywords LLM inferenceultra-low-bit quantizationlookup tableedge devicesparallel inferencememory bandwidthvectorizationCPU acceleration
0
0 comments X

The pith

Vec-LUT replaces scalar lookups with one vector lookup per index to speed parallel inference of ultra-low-bit LLMs on edge CPUs.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Scalar lookup tables force repetitive non-contiguous memory accesses for each token, wasting bandwidth when many tokens run in parallel during prefilling or test-time scaling. Vec-LUT builds a single table across all tokens and issues one 1-to-N lookup per index. It realizes this with a Vector LUT-Centric Tensor Layout and Cache-Aware Streamed Lookup. On five edge devices and three models the method reaches 4.2 times the speed of prior baselines. The change makes CPU execution practical for the multi-token workloads that edge devices must handle.

Core claim

Vec-LUT constructs a unified LUT across parallel tokens and performs a single 1 → N lookup per index, realized through Vector LUT-Centric Tensor Layout and Cache-Aware Streamed Lookup techniques.

What carries the argument

vector LUT: a unified table across parallel tokens that returns results for N tokens from one index access

If this is right

  • Prefilling and test-time scaling become memory-bandwidth efficient on general-purpose CPUs.
  • Up to 4.2× speedup over state-of-the-art baselines holds across five edge devices and three LLMs.
  • Direct integration into llama.cpp makes the gains available in existing open-source deployments.
  • Ultra-low-bit models remain competitive without relying on NPUs for parallel workloads.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same unified-lookup idea may reduce bandwidth waste in other parallel, memory-bound kernels beyond LLMs.
  • If cache behavior remains favorable at larger scales, CPU-based edge inference could narrow the gap with specialized accelerators.
  • Combining Vec-LUT with operator fusion or different memory hierarchies offers a clear next measurement to run.

Load-bearing premise

The vector LUT and streamed lookup incur no hidden cache thrashing or synchronization cost that would erase gains once token parallelism exceeds the tested regimes.

What would settle it

Measure speed at token counts much higher than those tested; if speedups vanish because of extra cache misses or synchronization overhead, the central claim fails.

Figures

Figures reproduced from arXiv: 2512.06443 by Chengyu Yin, Jianyu Wei, Ting Cao, Weijun Wang, Xiangyu Li, Yunxin Liu.

Figure 1
Figure 1. Figure 1: Different mpGeMM paradigms for ternary LLM inference. Our vector LUT stores precomputed results [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: A minimal example of using LUT to calculate [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗
Figure 4
Figure 4. Figure 4: Overview of the Vec-LUT mpGeMM kernel [PITH_FULL_IMAGE:figures/full_fig_p005_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Mappings among packed weights (bits and decimal), unpacked weights (ternary), and precomputed LUT [PITH_FULL_IMAGE:figures/full_fig_p006_5.png] view at source ↗
Figure 7
Figure 7. Figure 7: Examples of topological precomputing to re [PITH_FULL_IMAGE:figures/full_fig_p007_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: mpGeMM kernel benchmark across devices and threads, using real-world LLMs’ GeMM shapes. [PITH_FULL_IMAGE:figures/full_fig_p009_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: End-to-end prefilling comparison across models, devices and threads. Sub-2-bit packings are hatched. [PITH_FULL_IMAGE:figures/full_fig_p010_9.png] view at source ↗
Figure 11
Figure 11. Figure 11: Prefilling throughput of HF BitNet 3B on In [PITH_FULL_IMAGE:figures/full_fig_p011_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Prefilling throughput of HF BitNet 3B on [PITH_FULL_IMAGE:figures/full_fig_p011_12.png] view at source ↗
read the original abstract

Large language models (LLMs) are increasingly deployed on edge devices. To meet strict resource constraints, real-world deployment has pushed LLM quantization from 8-bit to 4-bit, 2-bit, and now 1.58-bit. Combined with lookup table (LUT)-based inference, CPUs run these ultra-low-bit LLMs even faster than NPUs, opening new opportunities for ubiquitous on-device intelligence. However, this paper identifies that LUT-based inference underutilizes memory bandwidth during parallel inference, which is required for prefilling, test-time scaling, and other multi-token scenarios. The root cause is the scalar LUT paradigm, which performs repetitive and non-contiguous memory accesses for each token. To solve the issue, we propose vector LUT, a new lookup paradigm that constructs a unified LUT across parallel tokens, and performs a single $1 \rightarrow N$ lookup per index. To realize it efficiently, we further introduce (1) Vector LUT-Centric Tensor Layout, and (2) Cache-Aware Streamed Lookup techniques. Evaluations on 5 edge devices across 3 LLMs show that Vec-LUT outperforms state-of-the-art baselines by up to $4.2\times$. Our implementation is integrated into llama.cpp. The code is available at https://github.com/OpenBitSys/vlut.cpp.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper proposes Vec-LUT, a vector-based lookup table paradigm for parallel inference of ultra-low-bit (1.58-bit) LLMs on edge CPUs. It diagnoses scalar LUT's repetitive non-contiguous accesses as the cause of bandwidth underutilization in multi-token settings such as prefilling, introduces a unified 1→N lookup realized via Vector LUT-Centric Tensor Layout and Cache-Aware Streamed Lookup, and reports up to 4.2× speedup over state-of-the-art baselines across 5 edge devices and 3 LLMs, with the implementation integrated into llama.cpp and code released.

Significance. If the measured speedups prove robust, the work would meaningfully advance practical on-device LLM deployment by improving memory-bandwidth utilization in parallel token regimes without requiring specialized hardware. The public release of code and integration into a widely used inference engine are concrete strengths that support reproducibility and adoption.

major comments (2)
  1. [§4 Evaluation] §4 (Evaluation) and abstract: the headline 4.2× figure is presented without explicit statement of the batch sizes, sequence lengths, token counts, or run-to-run variance used for both Vec-LUT and the baselines; this information is required to confirm that the comparison is apples-to-apples and to assess whether the gains survive the higher parallelism regimes targeted by the prefilling and test-time scaling use cases.
  2. [§3.2 Cache-Aware Streamed Lookup] §3.2 (Cache-Aware Streamed Lookup) and §4.3 (scalability discussion): the paper correctly identifies scalar LUT's non-contiguous accesses but provides no direct measurements or analytical bounds on cache-line contention, L2/L3 thrashing, or inter-thread synchronization cost as the number of parallel tokens N grows beyond the evaluated range; without such data the claim that the vector formulation delivers net bandwidth gains at scale remains unverified.
minor comments (2)
  1. [Abstract] Abstract: the measurement methodology (devices, models, batch sizes) should be summarized in one sentence so readers can immediately gauge the scope of the 4.2× claim.
  2. [§4] Figure captions and §4 tables: axis labels and legend entries should explicitly state the token parallelism level (N) for each bar so the scaling behavior is immediately visible.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment below and have revised the manuscript to enhance clarity and provide supporting analysis.

read point-by-point responses
  1. Referee: [§4 Evaluation] §4 (Evaluation) and abstract: the headline 4.2× figure is presented without explicit statement of the batch sizes, sequence lengths, token counts, or run-to-run variance used for both Vec-LUT and the baselines; this information is required to confirm that the comparison is apples-to-apples and to assess whether the gains survive the higher parallelism regimes targeted by the prefilling and test-time scaling use cases.

    Authors: We agree that explicit experimental parameters are necessary for reproducibility. In the revised version, we have updated the abstract and Section 4 to state the batch sizes (ranging from 1 to 32), sequence lengths (up to 2048), and parallel token counts N for each reported result. We also include run-to-run variance as standard deviation over five repeated executions for both Vec-LUT and all baselines. These additions confirm that the 4.2× figure was obtained under consistent conditions representative of prefilling workloads. revision: yes

  2. Referee: [§3.2 Cache-Aware Streamed Lookup] §3.2 (Cache-Aware Streamed Lookup) and §4.3 (scalability discussion): the paper correctly identifies scalar LUT's non-contiguous accesses but provides no direct measurements or analytical bounds on cache-line contention, L2/L3 thrashing, or inter-thread synchronization cost as the number of parallel tokens N grows beyond the evaluated range; without such data the claim that the vector formulation delivers net bandwidth gains at scale remains unverified.

    Authors: We acknowledge the request for stronger scalability evidence. Our current evaluations already cover multiple values of N on five edge devices, and Section 3.2 explains how the 1→N vector lookup eliminates repetitive non-contiguous accesses. In the revision we have added an analytical model in §3.2 that bounds cache-line contention and L2/L3 thrashing as a function of N, demonstrating an O(1/N) reduction relative to scalar LUT. We have also extended the discussion in §4.3 with projected bandwidth utilization for larger N based on this model. New hardware measurements beyond the evaluated range are not included, as they fall outside the scope of the practical edge-device scenarios targeted by the work. revision: partial

Circularity Check

0 steps flagged

No circularity: empirical performance claims rest on direct runtime measurements with no fitted parameters or self-referential derivations.

full rationale

The paper presents a new vector LUT paradigm for parallel ultra-low-bit LLM inference, supported by two implementation techniques (Vector LUT-Centric Tensor Layout and Cache-Aware Streamed Lookup) and validated through direct evaluations on 5 edge devices across 3 LLMs. No equations, first-principles derivations, or parameter-fitting steps appear in the provided text; the central speedup claim (up to 4.2×) is reported as an observed outcome of the implementation rather than a quantity that reduces to its own inputs by construction. Self-citations are absent from the load-bearing claims, and the work is self-contained against external benchmarks via open-source integration into llama.cpp.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The paper introduces no new physical constants or fitted parameters; it relies on standard assumptions about CPU cache behavior and memory bandwidth being the dominant bottleneck in the scalar LUT case.

axioms (1)
  • domain assumption Memory bandwidth is the primary limiter for parallel LUT-based inference on edge CPUs
    Stated as the root cause of underutilization in the scalar paradigm.
invented entities (1)
  • Vector LUT no independent evidence
    purpose: Unified lookup table serving multiple tokens with one wide memory access
    New lookup abstraction introduced to replace scalar per-token accesses.

pith-pipeline@v0.9.0 · 5552 in / 1191 out tokens · 42052 ms · 2026-05-17T00:42:26.048726+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

46 extracted references · 46 canonical work pages · 3 internal anchors

  1. [1]

    1bitLLM. 2024. bitnet_b1_58-3B. https://huggingface.co/1bitLLM/ bitnet_b1_58-3B. Reproduction of BitNet b1.58 paper, trained on RedPajama dataset for 100B tokens

  2. [2]

    Apple Inc. 2025. Apple Intelligence gets even more pow- erful with new capabilities across Apple devices. https: //www.apple.com/newsroom/2025/06/apple-intelligence-gets- even-more-powerful-with-new-capabilities-across-apple-devices/. Press Release

  3. [3]

    Arm Limited. 2025. Neon – Improve the Multimedia User Experience. Arm Technology Website. https://www.arm.com/technologies/neon Accessed May 8, 2025

  4. [4]

    2025.Neoverse V1: A Revolution in High Performance Computing

    Arm Limited. 2025.Neoverse V1: A Revolution in High Performance Computing. Arm Limited. https://www.arm.com/products/silicon-ip- cpu/neoverse/neoverse-v1

  5. [5]

    Charlie Chen, Sebastian Borgeaud, Geoffrey Irving, Jean-Baptiste Lespiau, Laurent Sifre, and John Jumper. 2023. Accelerating large language model decoding with speculative sampling.arXiv preprint arXiv:2302.01318(2023)

  6. [6]

    Mouxiang Chen, Binyuan Hui, Zeyu Cui, Jiaxi Yang, Dayiheng Liu, Jianling Sun, Junyang Lin, and Zhongxin Liu. 2025. Parallel scaling law for language models.arXiv preprint arXiv:2505.10475(2025)

  7. [7]

    Mengzhao Chen, Wenqi Shao, Peng Xu, Jiahao Wang, Peng Gao, Kaipeng Zhang, and Ping Luo. 2025. Efficientqat: Efficient quantization- aware training for large language models. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Vol- ume 1: Long Papers). 10081–10100

  8. [8]

    Tianqi Chen, Thierry Moreau, Ziheng Jiang, Lianmin Zheng, Eddie Yan, Haichen Shen, Meghan Cowan, Leyuan Wang, Yuwei Hu, Luis Ceze, et al. 2018. {TVM}: An automated {End-to-End} optimizing compiler for deep learning. In13th USENIX Symposium on Operating Systems Design and Implementation (OSDI 18). 578–594

  9. [9]

    compilade. 2024. ggml-quants: ternary packing for TriLMs and BitNet b1.58. GitHub Pull Request #8151. https://github.com/ggml-org/llama. cpp/pull/8151 llama.cpp project, https://github.com/ggml-org/llama. cpp/pull/8151, Merged September 6, 2024

  10. [10]

    Tim Dettmers, Mike Lewis, Younes Belkada, and Luke Zettlemoyer

  11. [11]

    int8 (): 8-bit matrix multiplication for transformers at scale.Advances in neural information processing systems35 (2022), 30318–30332

    Gpt3. int8 (): 8-bit matrix multiplication for transformers at scale.Advances in neural information processing systems35 (2022), 30318–30332

  12. [12]

    Xin Ding, Hao Wu, Yifan Yang, Shiqi Jiang, Qianxi Zhang, Donglin Bai, Zhibo Chen, and Ting Cao. 2025. Streammind: Unlocking full frame rate streaming video dialogue through event-gated cognition. InProceedings of the IEEE/CVF International Conference on Computer Vision. 13448–13459

  13. [13]

    Dayou Du, Yijia Zhang, Shijie Cao, Jiaqi Guo, Ting Cao, Xiaowen Chu, and Ningyi Xu. 2024. Bitdistiller: Unleashing the potential of sub-4-bit llms via self-distillation.arXiv preprint arXiv:2402.10631(2024)

  14. [14]

    Ryan Ehrlich, Bradley Brown, Jordan Juravsky, Ronald Clark, Christo- pher Ré, and Azalia Mirhoseini. 2025. Codemonkeys: Scaling test-time compute for software engineering.arXiv preprint arXiv:2501.14723 (2025)

  15. [15]

    Elias Frantar, Saleh Ashkboos, Torsten Hoefler, and Dan Alistarh. 2022. GPTQ: Accurate Post-training Compression for Generative Pretrained Transformers.arXiv preprint arXiv:2210.17323(2022)

  16. [16]

    2025.llama.cpp: LLM inference in C/C++

    ggml-org. 2025.llama.cpp: LLM inference in C/C++. https://github. com/ggml-org/llama.cpp

  17. [17]

    Zhirui Huang, Rui Ma, Shijie Cao, Ran Shu, Ian Wang, Ting Cao, Chixiao Chen, and Yongqiang Xiong. 2025. Tenet: An efficient sparsity- aware lut-centric architecture for ternary llm inference on edge.arXiv preprint arXiv:2509.13765(2025)

  18. [18]

    Alumbaugh, Mark Sherwood, and Cormac Brick

    Marissa Ikonomidis, T.J. Alumbaugh, Mark Sherwood, and Cormac Brick. 2025. Gemma 3 on mobile and web with Google AI Edge. Google Developers Blog. https://developers.googleblog.com/en/gemma-3-on- mobile-and-web-with-google-ai-edge/ Accessed December 5, 2025

  19. [19]

    2022.Intel ® Core™ i7-13700K Processor

    Intel Corporation. 2022.Intel ® Core™ i7-13700K Processor. Intel Corporation. https://www.intel.com/content/www/us/en/products/ sku/230500/intel-core-i713700k-processor-30m-cache-up-to-5-40- ghz/specifications.html

  20. [20]

    2022.Intrinsics for Intel ® Advanced Vec- tor Extensions 2 (Intel ® A VX2)

    Intel Corporation. 2022.Intrinsics for Intel ® Advanced Vec- tor Extensions 2 (Intel ® A VX2). Intel Corporation. https: //www.intel.com/content/www/us/en/docs/cpp-compiler/developer- guide-reference/2021-8/intrinsics-for-avx2.html Intel ® C++ Compiler Classic Developer Guide and Reference, Version 2021.10. Accessed May 8, 2025

  21. [21]

    Intel Corporation. 2025. Fix Performance Bottlenecks with Intel ® VTune™ Profiler. Intel Developer Website. https://www.intel.com/ content/www/us/en/developer/tools/oneapi/vtune-profiler.html Ac- cessed May 8, 2025

  22. [22]

    Ayush Kaushal, Tejas Vaidhya, Arnab Kumar Mondal, Tejas Pandey, Aaryan Bhagat, and Irina Rish. 2024. Spectra: Surprising effective- ness of pretraining ternary language models at scale.arXiv preprint arXiv:2407.12327(2024)

  23. [23]

    kinfey. 2024. Getting Started - Generative AI with Phi-3-mini: Running Phi-3-mini in Intel AI PC. Microsoft Developer Community Blog, Microsoft Tech Community. https://techcommunity.microsoft.com/ blog/azuredevcommunityblog/getting-started---generative-ai-with- phi-3-mini-running-phi-3-mini-in-intel-ai-p/4147246 Updated May 22, 2024, Version 4.0. Accessed...

  24. [24]

    Yaniv Leviathan, Matan Kalman, and Yossi Matias. 2023. Fast inference from transformers via speculative decoding. InInternational Conference on Machine Learning. PMLR, 19274–19286

  25. [25]

    Borui Li, Yitao Wang, Haoran Ma, Ligeng Chen, Jun Xiao, and Shuai Wang. 2025. MobiLoRA: Accelerating LoRA-Based LLM Inference on Mobile Devices via Context-Aware KV Cache Optimization. InProceed- ings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 23400–23410

  26. [26]

    Ji Lin, Jiaming Tang, Haotian Tang, Shang Yang, Wei-Ming Chen, Wei- Chen Wang, Guangxuan Xiao, Xingyu Dang, Chuang Gan, and Song Han. 2024. AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration. InMLSys

  27. [27]

    Zechun Liu, Changsheng Zhao, Hanxian Huang, Sijia Chen, Jing Zhang, Jiawei Zhao, Scott Roy, Lisa Jin, Yunyang Xiong, Yangyang Shi, et al. 2025. Paretoq: Scaling laws in extremely low-bit llm quantization. arXiv preprint arXiv:2502.02631(2025)

  28. [28]

    Shuming Ma, Hongyu Wang, Shaohan Huang, Xingxing Zhang, Ying Hu, Ting Song, Yan Xia, and Furu Wei. 2025. BitNet b1. 58 2B4T Technical Report.arXiv preprint arXiv:2504.12285(2025)

  29. [29]

    Shuming Ma, Hongyu Wang, Lingxiao Ma, Lei Wang, Wenhui Wang, Shaohan Huang, Lifeng Dong, Ruiping Wang, Jilong Xue, and Furu Wei. 2024. The era of 1-bit llms: All large language models are in 1.58 bits.arXiv preprint arXiv:2402.177641 (2024)

  30. [30]

    Yusuf Mehdi. 2024. Introducing Copilot+ PCs. The Official Microsoft Blog. https://blogs.microsoft.com/blog/2024/05/20/introducing- copilot-pcs/ Accessed May 8, 2025

  31. [31]

    Mohamed Mekkouri, Marc Sun, Leandro von Werra, and Thomas Wolf

  32. [32]

    1.58-Bit LLM: A New Era of Extreme Quantization

  33. [33]

    Zhiwen Mo, Lei Wang, Jianyu Wei, Zhichen Zeng, Shijie Cao, Lingxiao Ma, Naifeng Jing, Ting Cao, Jilong Xue, Fan Yang, et al. 2024. Lut tensor core: Lookup table enables efficient low-bit llm inference acceleration. arXiv preprint arXiv:2408.06003(2024). Conference’17, July 2017, Washington, DC, USA Xiangyu Li et al

  34. [34]

    2020.NVIDIA A100 Tensor Core GPU Architecture

    NVIDIA Corporation. 2020.NVIDIA A100 Tensor Core GPU Architecture. Technical Report. NVIDIA Corpora- tion. https://images.nvidia.cn/aem-dam/en-zz/Solutions/data- center/nvidia-ampere-architecture-whitepaper.pdf Accessed May 8, 2025

  35. [35]

    Hyunwoo Oh, KyungIn Nam, Rajat Bhattacharjya, Hanning Chen, Tamoghno Das, Sanggeon Yun, Suyeon Jang, Andrew Ding, Nikil Dutt, and Mohsen Imani. 2025. T-SAR: A Full-Stack Co-design for CPU-Only Ternary LLM Inference via In-Place SIMD ALU Reorganization.arXiv preprint arXiv:2511.13676(2025)

  36. [36]

    Gunho Park, Baeseong Park, Minsub Kim, Sungjae Lee, Jeonghoon Kim, Beomseok Kwon, Se Jung Kwon, Byeongwook Kim, Youngjoo Lee, and Dongsoo Lee. 2022. Lut-gemm: Quantized matrix multiplication based on luts for efficient inference in large-scale generative language models.arXiv preprint arXiv:2206.09557(2022)

  37. [37]

    2024.Unlocking on-device generative AI with an NPU and heterogeneous computing

    Qualcomm Technologies, Inc. 2024.Unlocking on-device generative AI with an NPU and heterogeneous computing. Technical Report. Qualcomm Technologies, Inc. https://www.qualcomm.com/content/ dam/qcomm-martech/dm-assets/documents/Unlocking-on-device- generative-AI-with-an-NPU-and-heterogeneous-computing.pdf Accessed May 8, 2025

  38. [38]

    Leming Shen, Qiang Yang, Yuanqing Zheng, and Mo Li. 2025. Autoiot: Llm-driven automated natural language programming for aiot appli- cations. InProceedings of the 31st Annual International Conference on Mobile Computing and Networking. 468–482

  39. [39]

    Zheyu Shen, Yexiao He, Ziyao Wang, Yuning Zhang, Guoheng Sun, Wanghao Ye, and Ang Li. 2025. EdgeLoRA: An Efficient Multi-Tenant LLM Serving System on Edge Devices. InProceedings of the 23rd Annual International Conference on Mobile Systems, Applications and Services. 138–153

  40. [40]

    Falcon-LLM Team. 2024. The Falcon 3 Family of Open Models

  41. [41]

    Hongyu Wang, Shuming Ma, Li Dong, Shaohan Huang, Huaijie Wang, Lingxiao Ma, Fan Yang, Ruiping Wang, Yi Wu, and Furu Wei. 2023. Bitnet: Scaling 1-bit transformers for large language models.arXiv preprint arXiv:2310.11453(2023)

  42. [42]

    Jinheng Wang, Hansong Zhou, Ting Song, Shijie Cao, Yan Xia, Ting Cao, Jianyu Wei, Shuming Ma, Hongyu Wang, and Furu Wei. 2025. Bitnet. cpp: Efficient Edge Inference for Ternary LLMs.arXiv preprint arXiv:2502.11880(2025)

  43. [43]

    Jianyu Wei, Shijie Cao, Ting Cao, Lingxiao Ma, Lei Wang, Yanyong Zhang, and Mao Yang. 2024. T-mac: Cpu renaissance via table lookup for low-bit llm deployment on edge.arXiv preprint arXiv:2407.00088 (2024)

  44. [44]

    Hao Wen, Yuanchun Li, Guohong Liu, Shanhui Zhao, Tao Yu, Toby Jia-Jun Li, Shiqi Jiang, Yunhao Liu, Yaqin Zhang, and Yunxin Liu. 2023. Empowering llm to use smartphone for intelligent task automation. CoRR(2023)

  45. [45]

    Guangxuan Xiao, Ji Lin, Mickael Seznec, Hao Wu, Julien Demouth, and Song Han. 2023. Smoothquant: Accurate and efficient post-training quantization for large language models. InInternational Conference on Machine Learning. PMLR, 38087–38099

  46. [46]

    Jinliang Yuan, Chen Yang, Dongqi Cai, Shihe Wang, Xin Yuan, Zeling Zhang, Xiang Li, Dingge Zhang, Hanzi Mei, Xianqing Jia, et al. 2024. Mobile foundation model as firmware. InProceedings of the 30th Annual International Conference on Mobile Computing and Networking. 279– 295