VitaLLM demonstrates a 16nm silicon prototype accelerator achieving 72.46 tokens/s decode for 3B ternary LLMs in 0.214 mm² area with reduced KV cache traffic via predictive sparse attention.
TeLLMe: An energy-efficient ternary LLM accelerator for prefill and decode on edge FPGAs
3 Pith papers cite this work. Polarity classification is still indexing.
fields
cs.AR 3years
2026 3representative citing papers
VitaLLM delivers 70.7 tokens/s decoding in a 0.223 mm² TSMC 16 nm chip at 66 mW with a figure-of-merit of 17.4 TOPS/mm²/W by combining TINT cores, BoothFlex attention, leading-one prediction, and dependency-aware scheduling.
A formalized design-space framework with generator and TSMC 16nm-validated cost model shows that LUT reuse gains depend on activation type and that larger cores improve density, yielding 2.2x area reduction over multiplier baselines.
citing papers explorer
-
VitaLLM: A Versatile and Tiny Accelerator for Mixed-Precision LLM Inference on Edge Devices
VitaLLM demonstrates a 16nm silicon prototype accelerator achieving 72.46 tokens/s decode for 3B ternary LLMs in 0.214 mm² area with reduced KV cache traffic via predictive sparse attention.
-
VitaLLM: A Versatile, Ultra-Compact Ternary LLM Accelerator with Dependency-Aware Scheduling
VitaLLM delivers 70.7 tokens/s decoding in a 0.223 mm² TSMC 16 nm chip at 66 mW with a figure-of-merit of 17.4 TOPS/mm²/W by combining TINT cores, BoothFlex attention, leading-one prediction, and dependency-aware scheduling.
-
Hardware Generation and Exploration of Lookup Table-Based Accelerators for 1.58-bit LLM Inference
A formalized design-space framework with generator and TSMC 16nm-validated cost model shows that LUT reuse gains depend on activation type and that larger cores improve density, yielding 2.2x area reduction over multiplier baselines.