Vec-LUT delivers up to 4.2x speedup over prior LUT methods for parallel ultra-low-bit LLM inference on edge devices by unifying lookups across tokens and adding cache-aware tensor layouts.
Spectra: Surprising Effectiveness of Pretraining Ternary Language Models at Scale
4 Pith papers cite this work. Polarity classification is still indexing.
representative citing papers
Custom SIMD kernels for ternary LLMs deliver 9.2x faster time-to-first-token, 52x higher throughput, and 14x memory reduction versus standard PyTorch on Apple Silicon and similar CPUs.
A formalized design-space framework with generator and TSMC 16nm-validated cost model shows that LUT reuse gains depend on activation type and that larger cores improve density, yielding 2.2x area reduction over multiplier baselines.
BiSpikCLM is the first fully binary spiking MatMul-free causal language model that matches ANN performance on generation tasks using only 4-6 percent of the compute via softmax-free spiking attention and spike-aware distillation.
citing papers explorer
-
Vec-LUT: Vector Table Lookup for Parallel Ultra-Low-Bit LLM Inference on Edge Devices
Vec-LUT delivers up to 4.2x speedup over prior LUT methods for parallel ultra-low-bit LLM inference on edge devices by unifying lookups across tokens and adding cache-aware tensor layouts.
-
Litespark Inference on Consumer CPUs: Custom SIMD Kernels for Ternary Neural Networks
Custom SIMD kernels for ternary LLMs deliver 9.2x faster time-to-first-token, 52x higher throughput, and 14x memory reduction versus standard PyTorch on Apple Silicon and similar CPUs.
-
Hardware Generation and Exploration of Lookup Table-Based Accelerators for 1.58-bit LLM Inference
A formalized design-space framework with generator and TSMC 16nm-validated cost model shows that LUT reuse gains depend on activation type and that larger cores improve density, yielding 2.2x area reduction over multiplier baselines.
-
BiSpikCLM: A Spiking Language Model integrating Softmax-Free Spiking Attention and Spike-Aware Alignment Distillation
BiSpikCLM is the first fully binary spiking MatMul-free causal language model that matches ANN performance on generation tasks using only 4-6 percent of the compute via softmax-free spiking attention and spike-aware distillation.