Spectra: Surprising Effectiveness of Pretraining Ternary Language Models at Scale

Ayush Kaushal, Tejas Vaidhya, Arnab Kumar Mondal, Tejas Pandey, Aaryan Bhagat, Irina Rish · 2024 · arXiv 2407.12327

4 Pith papers cite this work. Polarity classification is still indexing.

4 Pith papers citing it

representative citing papers

Vec-LUT: Vector Table Lookup for Parallel Ultra-Low-Bit LLM Inference on Edge Devices

cs.DC · 2025-12-06 · conditional · novelty 7.0

Vec-LUT delivers up to 4.2x speedup over prior LUT methods for parallel ultra-low-bit LLM inference on edge devices by unifying lookups across tokens and adding cache-aware tensor layouts.

Litespark Inference on Consumer CPUs: Custom SIMD Kernels for Ternary Neural Networks

cs.CL · 2026-05-07 · conditional · novelty 6.0

Custom SIMD kernels for ternary LLMs deliver 9.2x faster time-to-first-token, 52x higher throughput, and 14x memory reduction versus standard PyTorch on Apple Silicon and similar CPUs.

Hardware Generation and Exploration of Lookup Table-Based Accelerators for 1.58-bit LLM Inference

cs.AR · 2026-04-28 · unverdicted · novelty 6.0

A formalized design-space framework with generator and TSMC 16nm-validated cost model shows that LUT reuse gains depend on activation type and that larger cores improve density, yielding 2.2x area reduction over multiplier baselines.

BiSpikCLM: A Spiking Language Model integrating Softmax-Free Spiking Attention and Spike-Aware Alignment Distillation

cs.NE · 2026-04-14 · unverdicted · novelty 6.0

BiSpikCLM is the first fully binary spiking MatMul-free causal language model that matches ANN performance on generation tasks using only 4-6 percent of the compute via softmax-free spiking attention and spike-aware distillation.

citing papers explorer

Showing 4 of 4 citing papers.

Vec-LUT: Vector Table Lookup for Parallel Ultra-Low-Bit LLM Inference on Edge Devices cs.DC · 2025-12-06 · conditional · none · ref 22
Vec-LUT delivers up to 4.2x speedup over prior LUT methods for parallel ultra-low-bit LLM inference on edge devices by unifying lookups across tokens and adding cache-aware tensor layouts.
Litespark Inference on Consumer CPUs: Custom SIMD Kernels for Ternary Neural Networks cs.CL · 2026-05-07 · conditional · none · ref 6
Custom SIMD kernels for ternary LLMs deliver 9.2x faster time-to-first-token, 52x higher throughput, and 14x memory reduction versus standard PyTorch on Apple Silicon and similar CPUs.
Hardware Generation and Exploration of Lookup Table-Based Accelerators for 1.58-bit LLM Inference cs.AR · 2026-04-28 · unverdicted · none · ref 10
A formalized design-space framework with generator and TSMC 16nm-validated cost model shows that LUT reuse gains depend on activation type and that larger cores improve density, yielding 2.2x area reduction over multiplier baselines.
BiSpikCLM: A Spiking Language Model integrating Softmax-Free Spiking Attention and Spike-Aware Alignment Distillation cs.NE · 2026-04-14 · unverdicted · none · ref 7
BiSpikCLM is the first fully binary spiking MatMul-free causal language model that matches ANN performance on generation tasks using only 4-6 percent of the compute via softmax-free spiking attention and spike-aware distillation.

Spectra: Surprising Effectiveness of Pretraining Ternary Language Models at Scale

fields

years

verdicts

representative citing papers

citing papers explorer