arxiv: 2512.06443 · v2 · submitted 2025-12-06 · 💻 cs.DC · cs.AI

Vec-LUT: Vector Table Lookup for Parallel Ultra-Low-Bit LLM Inference on Edge Devices

Xiangyu Li , Chengyu Yin , Weijun Wang , Jianyu Wei , Ting Cao , Yunxin Liu This is my paper

Pith reviewed 2026-05-17 00:42 UTC · model grok-4.3

classification 💻 cs.DC cs.AI

keywords LLM inferenceultra-low-bit quantizationlookup tableedge devicesparallel inferencememory bandwidthvectorizationCPU acceleration

0 comments

The pith

Vec-LUT replaces scalar lookups with one vector lookup per index to speed parallel inference of ultra-low-bit LLMs on edge CPUs.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Scalar lookup tables force repetitive non-contiguous memory accesses for each token, wasting bandwidth when many tokens run in parallel during prefilling or test-time scaling. Vec-LUT builds a single table across all tokens and issues one 1-to-N lookup per index. It realizes this with a Vector LUT-Centric Tensor Layout and Cache-Aware Streamed Lookup. On five edge devices and three models the method reaches 4.2 times the speed of prior baselines. The change makes CPU execution practical for the multi-token workloads that edge devices must handle.

Core claim

Vec-LUT constructs a unified LUT across parallel tokens and performs a single 1 → N lookup per index, realized through Vector LUT-Centric Tensor Layout and Cache-Aware Streamed Lookup techniques.

What carries the argument

vector LUT: a unified table across parallel tokens that returns results for N tokens from one index access

If this is right

Prefilling and test-time scaling become memory-bandwidth efficient on general-purpose CPUs.
Up to 4.2× speedup over state-of-the-art baselines holds across five edge devices and three LLMs.
Direct integration into llama.cpp makes the gains available in existing open-source deployments.
Ultra-low-bit models remain competitive without relying on NPUs for parallel workloads.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same unified-lookup idea may reduce bandwidth waste in other parallel, memory-bound kernels beyond LLMs.
If cache behavior remains favorable at larger scales, CPU-based edge inference could narrow the gap with specialized accelerators.
Combining Vec-LUT with operator fusion or different memory hierarchies offers a clear next measurement to run.

Load-bearing premise

The vector LUT and streamed lookup incur no hidden cache thrashing or synchronization cost that would erase gains once token parallelism exceeds the tested regimes.

What would settle it

Measure speed at token counts much higher than those tested; if speedups vanish because of extra cache misses or synchronization overhead, the central claim fails.

Figures

Figures reproduced from arXiv: 2512.06443 by Chengyu Yin, Jianyu Wei, Ting Cao, Weijun Wang, Xiangyu Li, Yunxin Liu.

**Figure 2.** Figure 2: A minimal example of using LUT to calculate [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 4.** Figure 4: Overview of the Vec-LUT mpGeMM kernel [PITH_FULL_IMAGE:figures/full_fig_p005_4.png] view at source ↗

**Figure 5.** Figure 5: Mappings among packed weights (bits and decimal), unpacked weights (ternary), and precomputed LUT [PITH_FULL_IMAGE:figures/full_fig_p006_5.png] view at source ↗

**Figure 7.** Figure 7: Examples of topological precomputing to re [PITH_FULL_IMAGE:figures/full_fig_p007_7.png] view at source ↗

**Figure 8.** Figure 8: mpGeMM kernel benchmark across devices and threads, using real-world LLMs’ GeMM shapes. [PITH_FULL_IMAGE:figures/full_fig_p009_8.png] view at source ↗

**Figure 9.** Figure 9: End-to-end prefilling comparison across models, devices and threads. Sub-2-bit packings are hatched. [PITH_FULL_IMAGE:figures/full_fig_p010_9.png] view at source ↗

**Figure 11.** Figure 11: Prefilling throughput of HF BitNet 3B on In [PITH_FULL_IMAGE:figures/full_fig_p011_11.png] view at source ↗

**Figure 12.** Figure 12: Prefilling throughput of HF BitNet 3B on [PITH_FULL_IMAGE:figures/full_fig_p011_12.png] view at source ↗

read the original abstract

Large language models (LLMs) are increasingly deployed on edge devices. To meet strict resource constraints, real-world deployment has pushed LLM quantization from 8-bit to 4-bit, 2-bit, and now 1.58-bit. Combined with lookup table (LUT)-based inference, CPUs run these ultra-low-bit LLMs even faster than NPUs, opening new opportunities for ubiquitous on-device intelligence. However, this paper identifies that LUT-based inference underutilizes memory bandwidth during parallel inference, which is required for prefilling, test-time scaling, and other multi-token scenarios. The root cause is the scalar LUT paradigm, which performs repetitive and non-contiguous memory accesses for each token. To solve the issue, we propose vector LUT, a new lookup paradigm that constructs a unified LUT across parallel tokens, and performs a single $1 \rightarrow N$ lookup per index. To realize it efficiently, we further introduce (1) Vector LUT-Centric Tensor Layout, and (2) Cache-Aware Streamed Lookup techniques. Evaluations on 5 edge devices across 3 LLMs show that Vec-LUT outperforms state-of-the-art baselines by up to $4.2\times$. Our implementation is integrated into llama.cpp. The code is available at https://github.com/OpenBitSys/vlut.cpp.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Vec-LUT replaces scalar LUT accesses with a single vector lookup for parallel tokens and shows measurable speedups on edge CPUs, but the gains rest on tested regimes where cache behavior stays favorable.

read the letter

Vec-LUT's main move is to stop doing separate scalar lookups for each token and instead build one unified table that serves N tokens with a single index operation. That directly targets the repetitive non-contiguous reads that waste bandwidth once you move beyond single-token decoding into prefilling or test-time scaling. The two supporting pieces are a tensor layout centered on the vector LUT and a cache-aware streaming method for the actual loads. Both are concrete enough that the authors could drop the whole thing into llama.cpp and release the code, which is the part that makes the work immediately usable rather than just a diagram. The numbers come from five different edge devices and three models, with the largest reported gain at 4.2x over the baselines they chose. That level of hardware coverage and open implementation is better than most systems papers at this stage. The root-cause section on why scalar LUTs thrash memory is straightforward and lines up with what you see on CPUs with limited cache lines. No invented constants or circular derivations appear; the claims sit on measured runtimes. The soft spot is that the vector formulation's working set and any synchronization or contention costs are only characterized inside the token counts and batch sizes they actually ran. If those costs grow faster than linear once you push parallelism higher, the reported speedups would shrink. The abstract also gives no explicit statement on timing methodology or variance, so a referee would want the full experimental appendix to judge how much of the 4.2x is robust versus setup-specific. This paper is for people who already work on CPU inference engines for quantized LLMs and need practical bandwidth wins on commodity edge silicon. A reader who cares about on-device throughput will pick up usable layout and streaming tricks even if they adapt the details. It is worth sending to peer review because the core change is simple to verify, the code is public, and the hardware results are tied to real devices rather than simulation.

Referee Report

2 major / 2 minor

Summary. The paper proposes Vec-LUT, a vector-based lookup table paradigm for parallel inference of ultra-low-bit (1.58-bit) LLMs on edge CPUs. It diagnoses scalar LUT's repetitive non-contiguous accesses as the cause of bandwidth underutilization in multi-token settings such as prefilling, introduces a unified 1→N lookup realized via Vector LUT-Centric Tensor Layout and Cache-Aware Streamed Lookup, and reports up to 4.2× speedup over state-of-the-art baselines across 5 edge devices and 3 LLMs, with the implementation integrated into llama.cpp and code released.

Significance. If the measured speedups prove robust, the work would meaningfully advance practical on-device LLM deployment by improving memory-bandwidth utilization in parallel token regimes without requiring specialized hardware. The public release of code and integration into a widely used inference engine are concrete strengths that support reproducibility and adoption.

major comments (2)

[§4 Evaluation] §4 (Evaluation) and abstract: the headline 4.2× figure is presented without explicit statement of the batch sizes, sequence lengths, token counts, or run-to-run variance used for both Vec-LUT and the baselines; this information is required to confirm that the comparison is apples-to-apples and to assess whether the gains survive the higher parallelism regimes targeted by the prefilling and test-time scaling use cases.
[§3.2 Cache-Aware Streamed Lookup] §3.2 (Cache-Aware Streamed Lookup) and §4.3 (scalability discussion): the paper correctly identifies scalar LUT's non-contiguous accesses but provides no direct measurements or analytical bounds on cache-line contention, L2/L3 thrashing, or inter-thread synchronization cost as the number of parallel tokens N grows beyond the evaluated range; without such data the claim that the vector formulation delivers net bandwidth gains at scale remains unverified.

minor comments (2)

[Abstract] Abstract: the measurement methodology (devices, models, batch sizes) should be summarized in one sentence so readers can immediately gauge the scope of the 4.2× claim.
[§4] Figure captions and §4 tables: axis labels and legend entries should explicitly state the token parallelism level (N) for each bar so the scaling behavior is immediately visible.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment below and have revised the manuscript to enhance clarity and provide supporting analysis.

read point-by-point responses

Referee: [§4 Evaluation] §4 (Evaluation) and abstract: the headline 4.2× figure is presented without explicit statement of the batch sizes, sequence lengths, token counts, or run-to-run variance used for both Vec-LUT and the baselines; this information is required to confirm that the comparison is apples-to-apples and to assess whether the gains survive the higher parallelism regimes targeted by the prefilling and test-time scaling use cases.

Authors: We agree that explicit experimental parameters are necessary for reproducibility. In the revised version, we have updated the abstract and Section 4 to state the batch sizes (ranging from 1 to 32), sequence lengths (up to 2048), and parallel token counts N for each reported result. We also include run-to-run variance as standard deviation over five repeated executions for both Vec-LUT and all baselines. These additions confirm that the 4.2× figure was obtained under consistent conditions representative of prefilling workloads. revision: yes
Referee: [§3.2 Cache-Aware Streamed Lookup] §3.2 (Cache-Aware Streamed Lookup) and §4.3 (scalability discussion): the paper correctly identifies scalar LUT's non-contiguous accesses but provides no direct measurements or analytical bounds on cache-line contention, L2/L3 thrashing, or inter-thread synchronization cost as the number of parallel tokens N grows beyond the evaluated range; without such data the claim that the vector formulation delivers net bandwidth gains at scale remains unverified.

Authors: We acknowledge the request for stronger scalability evidence. Our current evaluations already cover multiple values of N on five edge devices, and Section 3.2 explains how the 1→N vector lookup eliminates repetitive non-contiguous accesses. In the revision we have added an analytical model in §3.2 that bounds cache-line contention and L2/L3 thrashing as a function of N, demonstrating an O(1/N) reduction relative to scalar LUT. We have also extended the discussion in §4.3 with projected bandwidth utilization for larger N based on this model. New hardware measurements beyond the evaluated range are not included, as they fall outside the scope of the practical edge-device scenarios targeted by the work. revision: partial

Circularity Check

0 steps flagged

No circularity: empirical performance claims rest on direct runtime measurements with no fitted parameters or self-referential derivations.

full rationale

The paper presents a new vector LUT paradigm for parallel ultra-low-bit LLM inference, supported by two implementation techniques (Vector LUT-Centric Tensor Layout and Cache-Aware Streamed Lookup) and validated through direct evaluations on 5 edge devices across 3 LLMs. No equations, first-principles derivations, or parameter-fitting steps appear in the provided text; the central speedup claim (up to 4.2×) is reported as an observed outcome of the implementation rather than a quantity that reduces to its own inputs by construction. Self-citations are absent from the load-bearing claims, and the work is self-contained against external benchmarks via open-source integration into llama.cpp.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The paper introduces no new physical constants or fitted parameters; it relies on standard assumptions about CPU cache behavior and memory bandwidth being the dominant bottleneck in the scalar LUT case.

axioms (1)

domain assumption Memory bandwidth is the primary limiter for parallel LUT-based inference on edge CPUs
Stated as the root cause of underutilization in the scalar paradigm.

invented entities (1)

Vector LUT no independent evidence
purpose: Unified lookup table serving multiple tokens with one wide memory access
New lookup abstraction introduced to replace scalar per-token accesses.

pith-pipeline@v0.9.0 · 5552 in / 1191 out tokens · 42052 ms · 2026-05-17T00:42:26.048726+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

vector LUT ... constructs a unified LUT across parallel tokens, and performs a single 1→N lookup per index ... Vector LUT-Centric Tensor Layout and Cache-Aware Streamed Lookup
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Evaluations on 5 edge devices across 3 LLMs show that Vec-LUT outperforms state-of-the-art baselines by up to 4.2×

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

46 extracted references · 46 canonical work pages · 3 internal anchors

[1]

1bitLLM. 2024. bitnet_b1_58-3B. https://huggingface.co/1bitLLM/ bitnet_b1_58-3B. Reproduction of BitNet b1.58 paper, trained on RedPajama dataset for 100B tokens

work page 2024
[2]

Apple Inc. 2025. Apple Intelligence gets even more pow- erful with new capabilities across Apple devices. https: //www.apple.com/newsroom/2025/06/apple-intelligence-gets- even-more-powerful-with-new-capabilities-across-apple-devices/. Press Release

work page 2025
[3]

Arm Limited. 2025. Neon – Improve the Multimedia User Experience. Arm Technology Website. https://www.arm.com/technologies/neon Accessed May 8, 2025

work page 2025
[4]

2025.Neoverse V1: A Revolution in High Performance Computing

Arm Limited. 2025.Neoverse V1: A Revolution in High Performance Computing. Arm Limited. https://www.arm.com/products/silicon-ip- cpu/neoverse/neoverse-v1

work page 2025
[5]

Charlie Chen, Sebastian Borgeaud, Geoffrey Irving, Jean-Baptiste Lespiau, Laurent Sifre, and John Jumper. 2023. Accelerating large language model decoding with speculative sampling.arXiv preprint arXiv:2302.01318(2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023
[6]

Mouxiang Chen, Binyuan Hui, Zeyu Cui, Jiaxi Yang, Dayiheng Liu, Jianling Sun, Junyang Lin, and Zhongxin Liu. 2025. Parallel scaling law for language models.arXiv preprint arXiv:2505.10475(2025)

work page arXiv 2025
[7]

Mengzhao Chen, Wenqi Shao, Peng Xu, Jiahao Wang, Peng Gao, Kaipeng Zhang, and Ping Luo. 2025. Efficientqat: Efficient quantization- aware training for large language models. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Vol- ume 1: Long Papers). 10081–10100

work page 2025
[8]

Tianqi Chen, Thierry Moreau, Ziheng Jiang, Lianmin Zheng, Eddie Yan, Haichen Shen, Meghan Cowan, Leyuan Wang, Yuwei Hu, Luis Ceze, et al. 2018. {TVM}: An automated {End-to-End} optimizing compiler for deep learning. In13th USENIX Symposium on Operating Systems Design and Implementation (OSDI 18). 578–594

work page 2018
[9]

compilade. 2024. ggml-quants: ternary packing for TriLMs and BitNet b1.58. GitHub Pull Request #8151. https://github.com/ggml-org/llama. cpp/pull/8151 llama.cpp project, https://github.com/ggml-org/llama. cpp/pull/8151, Merged September 6, 2024

work page 2024
[10]

Tim Dettmers, Mike Lewis, Younes Belkada, and Luke Zettlemoyer

work page
[11]

int8 (): 8-bit matrix multiplication for transformers at scale.Advances in neural information processing systems35 (2022), 30318–30332

Gpt3. int8 (): 8-bit matrix multiplication for transformers at scale.Advances in neural information processing systems35 (2022), 30318–30332

work page 2022
[12]

Xin Ding, Hao Wu, Yifan Yang, Shiqi Jiang, Qianxi Zhang, Donglin Bai, Zhibo Chen, and Ting Cao. 2025. Streammind: Unlocking full frame rate streaming video dialogue through event-gated cognition. InProceedings of the IEEE/CVF International Conference on Computer Vision. 13448–13459

work page 2025
[13]

Dayou Du, Yijia Zhang, Shijie Cao, Jiaqi Guo, Ting Cao, Xiaowen Chu, and Ningyi Xu. 2024. Bitdistiller: Unleashing the potential of sub-4-bit llms via self-distillation.arXiv preprint arXiv:2402.10631(2024)

work page arXiv 2024
[14]

Ryan Ehrlich, Bradley Brown, Jordan Juravsky, Ronald Clark, Christo- pher Ré, and Azalia Mirhoseini. 2025. Codemonkeys: Scaling test-time compute for software engineering.arXiv preprint arXiv:2501.14723 (2025)

work page arXiv 2025
[15]

Elias Frantar, Saleh Ashkboos, Torsten Hoefler, and Dan Alistarh. 2022. GPTQ: Accurate Post-training Compression for Generative Pretrained Transformers.arXiv preprint arXiv:2210.17323(2022)

work page internal anchor Pith review Pith/arXiv arXiv 2022
[16]

2025.llama.cpp: LLM inference in C/C++

ggml-org. 2025.llama.cpp: LLM inference in C/C++. https://github. com/ggml-org/llama.cpp

work page 2025
[17]

Zhirui Huang, Rui Ma, Shijie Cao, Ran Shu, Ian Wang, Ting Cao, Chixiao Chen, and Yongqiang Xiong. 2025. Tenet: An efficient sparsity- aware lut-centric architecture for ternary llm inference on edge.arXiv preprint arXiv:2509.13765(2025)

work page arXiv 2025
[18]

Alumbaugh, Mark Sherwood, and Cormac Brick

Marissa Ikonomidis, T.J. Alumbaugh, Mark Sherwood, and Cormac Brick. 2025. Gemma 3 on mobile and web with Google AI Edge. Google Developers Blog. https://developers.googleblog.com/en/gemma-3-on- mobile-and-web-with-google-ai-edge/ Accessed December 5, 2025

work page 2025
[19]

2022.Intel ® Core™ i7-13700K Processor

Intel Corporation. 2022.Intel ® Core™ i7-13700K Processor. Intel Corporation. https://www.intel.com/content/www/us/en/products/ sku/230500/intel-core-i713700k-processor-30m-cache-up-to-5-40- ghz/specifications.html

work page 2022
[20]

2022.Intrinsics for Intel ® Advanced Vec- tor Extensions 2 (Intel ® A VX2)

Intel Corporation. 2022.Intrinsics for Intel ® Advanced Vec- tor Extensions 2 (Intel ® A VX2). Intel Corporation. https: //www.intel.com/content/www/us/en/docs/cpp-compiler/developer- guide-reference/2021-8/intrinsics-for-avx2.html Intel ® C++ Compiler Classic Developer Guide and Reference, Version 2021.10. Accessed May 8, 2025

work page 2022
[21]

Intel Corporation. 2025. Fix Performance Bottlenecks with Intel ® VTune™ Profiler. Intel Developer Website. https://www.intel.com/ content/www/us/en/developer/tools/oneapi/vtune-profiler.html Ac- cessed May 8, 2025

work page 2025
[22]

Ayush Kaushal, Tejas Vaidhya, Arnab Kumar Mondal, Tejas Pandey, Aaryan Bhagat, and Irina Rish. 2024. Spectra: Surprising effective- ness of pretraining ternary language models at scale.arXiv preprint arXiv:2407.12327(2024)

work page arXiv 2024
[23]

kinfey. 2024. Getting Started - Generative AI with Phi-3-mini: Running Phi-3-mini in Intel AI PC. Microsoft Developer Community Blog, Microsoft Tech Community. https://techcommunity.microsoft.com/ blog/azuredevcommunityblog/getting-started---generative-ai-with- phi-3-mini-running-phi-3-mini-in-intel-ai-p/4147246 Updated May 22, 2024, Version 4.0. Accessed...

work page arXiv 2024
[24]

Yaniv Leviathan, Matan Kalman, and Yossi Matias. 2023. Fast inference from transformers via speculative decoding. InInternational Conference on Machine Learning. PMLR, 19274–19286

work page 2023
[25]

Borui Li, Yitao Wang, Haoran Ma, Ligeng Chen, Jun Xiao, and Shuai Wang. 2025. MobiLoRA: Accelerating LoRA-Based LLM Inference on Mobile Devices via Context-Aware KV Cache Optimization. InProceed- ings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 23400–23410

work page 2025
[26]

Ji Lin, Jiaming Tang, Haotian Tang, Shang Yang, Wei-Ming Chen, Wei- Chen Wang, Guangxuan Xiao, Xingyu Dang, Chuang Gan, and Song Han. 2024. AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration. InMLSys

work page 2024
[27]

Zechun Liu, Changsheng Zhao, Hanxian Huang, Sijia Chen, Jing Zhang, Jiawei Zhao, Scott Roy, Lisa Jin, Yunyang Xiong, Yangyang Shi, et al. 2025. Paretoq: Scaling laws in extremely low-bit llm quantization. arXiv preprint arXiv:2502.02631(2025)

work page arXiv 2025
[28]

Shuming Ma, Hongyu Wang, Shaohan Huang, Xingxing Zhang, Ying Hu, Ting Song, Yan Xia, and Furu Wei. 2025. BitNet b1. 58 2B4T Technical Report.arXiv preprint arXiv:2504.12285(2025)

work page arXiv 2025
[29]

Shuming Ma, Hongyu Wang, Lingxiao Ma, Lei Wang, Wenhui Wang, Shaohan Huang, Lifeng Dong, Ruiping Wang, Jilong Xue, and Furu Wei. 2024. The era of 1-bit llms: All large language models are in 1.58 bits.arXiv preprint arXiv:2402.177641 (2024)

work page internal anchor Pith review arXiv 2024
[30]

Yusuf Mehdi. 2024. Introducing Copilot+ PCs. The Official Microsoft Blog. https://blogs.microsoft.com/blog/2024/05/20/introducing- copilot-pcs/ Accessed May 8, 2025

work page 2024
[31]

Mohamed Mekkouri, Marc Sun, Leandro von Werra, and Thomas Wolf

work page
[32]

1.58-Bit LLM: A New Era of Extreme Quantization

work page
[33]

Zhiwen Mo, Lei Wang, Jianyu Wei, Zhichen Zeng, Shijie Cao, Lingxiao Ma, Naifeng Jing, Ting Cao, Jilong Xue, Fan Yang, et al. 2024. Lut tensor core: Lookup table enables efficient low-bit llm inference acceleration. arXiv preprint arXiv:2408.06003(2024). Conference’17, July 2017, Washington, DC, USA Xiangyu Li et al

work page arXiv 2024
[34]

2020.NVIDIA A100 Tensor Core GPU Architecture

NVIDIA Corporation. 2020.NVIDIA A100 Tensor Core GPU Architecture. Technical Report. NVIDIA Corpora- tion. https://images.nvidia.cn/aem-dam/en-zz/Solutions/data- center/nvidia-ampere-architecture-whitepaper.pdf Accessed May 8, 2025

work page 2020
[35]

Hyunwoo Oh, KyungIn Nam, Rajat Bhattacharjya, Hanning Chen, Tamoghno Das, Sanggeon Yun, Suyeon Jang, Andrew Ding, Nikil Dutt, and Mohsen Imani. 2025. T-SAR: A Full-Stack Co-design for CPU-Only Ternary LLM Inference via In-Place SIMD ALU Reorganization.arXiv preprint arXiv:2511.13676(2025)

work page arXiv 2025
[36]

Gunho Park, Baeseong Park, Minsub Kim, Sungjae Lee, Jeonghoon Kim, Beomseok Kwon, Se Jung Kwon, Byeongwook Kim, Youngjoo Lee, and Dongsoo Lee. 2022. Lut-gemm: Quantized matrix multiplication based on luts for efficient inference in large-scale generative language models.arXiv preprint arXiv:2206.09557(2022)

work page arXiv 2022
[37]

2024.Unlocking on-device generative AI with an NPU and heterogeneous computing

Qualcomm Technologies, Inc. 2024.Unlocking on-device generative AI with an NPU and heterogeneous computing. Technical Report. Qualcomm Technologies, Inc. https://www.qualcomm.com/content/ dam/qcomm-martech/dm-assets/documents/Unlocking-on-device- generative-AI-with-an-NPU-and-heterogeneous-computing.pdf Accessed May 8, 2025

work page 2024
[38]

Leming Shen, Qiang Yang, Yuanqing Zheng, and Mo Li. 2025. Autoiot: Llm-driven automated natural language programming for aiot appli- cations. InProceedings of the 31st Annual International Conference on Mobile Computing and Networking. 468–482

work page 2025
[39]

Zheyu Shen, Yexiao He, Ziyao Wang, Yuning Zhang, Guoheng Sun, Wanghao Ye, and Ang Li. 2025. EdgeLoRA: An Efficient Multi-Tenant LLM Serving System on Edge Devices. InProceedings of the 23rd Annual International Conference on Mobile Systems, Applications and Services. 138–153

work page 2025
[40]

Falcon-LLM Team. 2024. The Falcon 3 Family of Open Models

work page 2024
[41]

Hongyu Wang, Shuming Ma, Li Dong, Shaohan Huang, Huaijie Wang, Lingxiao Ma, Fan Yang, Ruiping Wang, Yi Wu, and Furu Wei. 2023. Bitnet: Scaling 1-bit transformers for large language models.arXiv preprint arXiv:2310.11453(2023)

work page Pith review arXiv 2023
[42]

Jinheng Wang, Hansong Zhou, Ting Song, Shijie Cao, Yan Xia, Ting Cao, Jianyu Wei, Shuming Ma, Hongyu Wang, and Furu Wei. 2025. Bitnet. cpp: Efficient Edge Inference for Ternary LLMs.arXiv preprint arXiv:2502.11880(2025)

work page arXiv 2025
[43]

Jianyu Wei, Shijie Cao, Ting Cao, Lingxiao Ma, Lei Wang, Yanyong Zhang, and Mao Yang. 2024. T-mac: Cpu renaissance via table lookup for low-bit llm deployment on edge.arXiv preprint arXiv:2407.00088 (2024)

work page arXiv 2024
[44]

Hao Wen, Yuanchun Li, Guohong Liu, Shanhui Zhao, Tao Yu, Toby Jia-Jun Li, Shiqi Jiang, Yunhao Liu, Yaqin Zhang, and Yunxin Liu. 2023. Empowering llm to use smartphone for intelligent task automation. CoRR(2023)

work page 2023
[45]

Guangxuan Xiao, Ji Lin, Mickael Seznec, Hao Wu, Julien Demouth, and Song Han. 2023. Smoothquant: Accurate and efficient post-training quantization for large language models. InInternational Conference on Machine Learning. PMLR, 38087–38099

work page 2023
[46]

Jinliang Yuan, Chen Yang, Dongqi Cai, Shihe Wang, Xin Yuan, Zeling Zhang, Xiang Li, Dingge Zhang, Hanzi Mei, Xianqing Jia, et al. 2024. Mobile foundation model as firmware. InProceedings of the 30th Annual International Conference on Mobile Computing and Networking. 279– 295

work page 2024