Full stack optimization of transformer inference: a survey

· 2023 · arXiv 2302.14017

4 Pith papers cite this work. Polarity classification is still indexing.

4 Pith papers citing it

representative citing papers

EdgeCIM: A Hardware-Software Co-Design for CIM-Based Acceleration of Small Language Models

cs.AR · 2026-04-13 · unverdicted · novelty 6.0

A CIM-based hardware-software co-design in 65nm achieves up to 7.3x higher throughput and 49.59x better energy efficiency than NVIDIA Orin Nano for LLaMA3.2-1B, averaging 336 tokens/s and 173 tokens/J under INT4 across multiple SLMs.

Watt Counts: Energy-Aware Benchmark for Sustainable LLM Inference on Heterogeneous GPU Architectures

cs.DC · 2026-04-10 · unverdicted · novelty 6.0

Watt Counts supplies over 5,000 energy measurements across 50 LLMs and 10 GPUs and shows that hardware-aware selection can reduce server-scenario energy use by up to 70 percent with little effect on user experience.

D-Legion: A Scalable Many-Core Architecture for Accelerating Matrix Multiplication in Quantized LLMs

cs.AR · 2026-02-05 · unverdicted · novelty 6.0

D-Legion proposes a scalable architecture of Legions containing adaptive-precision systolic array cores that accelerates quantized LLM matrix multiplications, delivering up to 8.2x lower latency and 3.8x higher memory savings versus prior designs.

CIMple: Standard-cell SRAM-based CIM with LUT-based split softmax for attention acceleration

cs.AR · 2026-04-17 · unverdicted · novelty 5.0

CIMple delivers a 32 kb digital SRAM-based compute-in-memory accelerator for transformer self-attention that reaches 26.1 TOPS/W at 0.85 V in 28 nm with INT8 precision using dual-banked architecture and LUT-based split softmax.

citing papers explorer

Showing 4 of 4 citing papers.

EdgeCIM: A Hardware-Software Co-Design for CIM-Based Acceleration of Small Language Models cs.AR · 2026-04-13 · unverdicted · none · ref 31
A CIM-based hardware-software co-design in 65nm achieves up to 7.3x higher throughput and 49.59x better energy efficiency than NVIDIA Orin Nano for LLaMA3.2-1B, averaging 336 tokens/s and 173 tokens/J under INT4 across multiple SLMs.
Watt Counts: Energy-Aware Benchmark for Sustainable LLM Inference on Heterogeneous GPU Architectures cs.DC · 2026-04-10 · unverdicted · none · ref 16
Watt Counts supplies over 5,000 energy measurements across 50 LLMs and 10 GPUs and shows that hardware-aware selection can reduce server-scenario energy use by up to 70 percent with little effect on user experience.
D-Legion: A Scalable Many-Core Architecture for Accelerating Matrix Multiplication in Quantized LLMs cs.AR · 2026-02-05 · unverdicted · none · ref 6
D-Legion proposes a scalable architecture of Legions containing adaptive-precision systolic array cores that accelerates quantized LLM matrix multiplications, delivering up to 8.2x lower latency and 3.8x higher memory savings versus prior designs.
CIMple: Standard-cell SRAM-based CIM with LUT-based split softmax for attention acceleration cs.AR · 2026-04-17 · unverdicted · none · ref 15
CIMple delivers a 32 kb digital SRAM-based compute-in-memory accelerator for transformer self-attention that reaches 26.1 TOPS/W at 0.85 V in 28 nm with INT8 precision using dual-banked architecture and LUT-based split softmax.

Full stack optimization of transformer inference: a survey

fields

years

verdicts

representative citing papers

citing papers explorer