TeLLMe: An energy-efficient ternary LLM accelerator for prefill and decode on edge FPGAs

· 2025 · arXiv 2504.16266

3 Pith papers cite this work. Polarity classification is still indexing.

3 Pith papers citing it

representative citing papers

VitaLLM: A Versatile and Tiny Accelerator for Mixed-Precision LLM Inference on Edge Devices

cs.AR · 2026-05-01 · unverdicted · novelty 7.0

VitaLLM demonstrates a 16nm silicon prototype accelerator achieving 72.46 tokens/s decode for 3B ternary LLMs in 0.214 mm² area with reduced KV cache traffic via predictive sparse attention.

VitaLLM: A Versatile, Ultra-Compact Ternary LLM Accelerator with Dependency-Aware Scheduling

cs.AR · 2026-04-30 · conditional · novelty 6.0

VitaLLM delivers 70.7 tokens/s decoding in a 0.223 mm² TSMC 16 nm chip at 66 mW with a figure-of-merit of 17.4 TOPS/mm²/W by combining TINT cores, BoothFlex attention, leading-one prediction, and dependency-aware scheduling.

Hardware Generation and Exploration of Lookup Table-Based Accelerators for 1.58-bit LLM Inference

cs.AR · 2026-04-28 · unverdicted · novelty 6.0

A formalized design-space framework with generator and TSMC 16nm-validated cost model shows that LUT reuse gains depend on activation type and that larger cores improve density, yielding 2.2x area reduction over multiplier baselines.

citing papers explorer

Showing 3 of 3 citing papers.

VitaLLM: A Versatile and Tiny Accelerator for Mixed-Precision LLM Inference on Edge Devices cs.AR · 2026-05-01 · unverdicted · none · ref 5
VitaLLM demonstrates a 16nm silicon prototype accelerator achieving 72.46 tokens/s decode for 3B ternary LLMs in 0.214 mm² area with reduced KV cache traffic via predictive sparse attention.
VitaLLM: A Versatile, Ultra-Compact Ternary LLM Accelerator with Dependency-Aware Scheduling cs.AR · 2026-04-30 · conditional · none · ref 8
VitaLLM delivers 70.7 tokens/s decoding in a 0.223 mm² TSMC 16 nm chip at 66 mW with a figure-of-merit of 17.4 TOPS/mm²/W by combining TINT cores, BoothFlex attention, leading-one prediction, and dependency-aware scheduling.
Hardware Generation and Exploration of Lookup Table-Based Accelerators for 1.58-bit LLM Inference cs.AR · 2026-04-28 · unverdicted · none · ref 6
A formalized design-space framework with generator and TSMC 16nm-validated cost model shows that LUT reuse gains depend on activation type and that larger cores improve density, yielding 2.2x area reduction over multiplier baselines.

TeLLMe: An energy-efficient ternary LLM accelerator for prefill and decode on edge FPGAs

fields

years

verdicts

representative citing papers

citing papers explorer