SCALE-Sim: Systolic CNN Accelerator Simulator

Ananda Samajdar; Matthew Mattina; Paul Whatmough; Tushar Krishna; Yuhao Zhu

arxiv: 1811.02883 · v2 · pith:XDG66KQDnew · submitted 2018-10-16 · 💻 cs.DC · cs.AR

SCALE-Sim: Systolic CNN Accelerator Simulator

Ananda Samajdar , Yuhao Zhu , Paul Whatmough , Matthew Mattina , Tushar Krishna This is my paper

classification 💻 cs.DC cs.AR

keywords scale-simsimulatorsystolicacceleratoracceleratorsdeepdesigninsights

0 comments

read the original abstract

Systolic Arrays are one of the most popular compute substrates within Deep Learning accelerators today, as they provide extremely high efficiency for running dense matrix multiplications. However, the research community lacks tools to insights on both the design trade-offs and efficient mapping strategies for systolic-array based accelerators. We introduce Systolic CNN Accelerator Simulator (SCALE-Sim), which is a configurable systolic array based cycle accurate DNN accelerator simulator. SCALE-Sim exposes various micro-architectural features as well as system integration parameters to the designer to enable comprehensive design space exploration. This is the first systolic-array simulator tuned for running DNNs to the best of our knowledge. Using SCALE-Sim, we conduct a suite of case studies and demonstrate the effect of bandwidth, data flow and aspect ratio on the overall runtime and energy of Deep Learning kernels across vision, speech, text, and games. We believe that these insights will be highly beneficial to architects and ML practitioners.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 6 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Frontier: Towards Comprehensive and Accurate LLM Inference Simulation
cs.DC 2026-05 unverdicted novelty 7.0

Frontier is a new discrete-event simulator for disaggregated LLM serving that incorporates co-location, PDD, AFD, and optimizations, achieving under 4% throughput error and large reductions in latency prediction error...
DRIFT: Harnessing Inherent Fault Tolerance for Efficient and Reliable Diffusion Model Inference
cs.AR 2026-04 unverdicted novelty 7.0

DRIFT uses resilience analysis, targeted DVFS, and adaptive rollback ABFT to deliver 36% average energy savings or 1.7x speedup in diffusion model inference while preserving generation quality.
FEATHER: A Reconfigurable Accelerator with Data Reordering Support for Low-Cost On-Chip Dataflow Switching
cs.AR 2024-05 unverdicted novelty 7.0

FEATHER integrates data reordering into its reduction network via a new spatial array (Nest) and multi-stage network (BIRRD) to enable low-overhead dataflow switching in ML accelerators, delivering 1.27-2.89x latency ...
AMMA: A Multi-Chiplet Memory-Centric Architecture for Low-Latency 1M Context Attention Serving
cs.AR 2026-04 unverdicted novelty 6.0

AMMA is a memory-centric multi-chiplet architecture using HBM-PNM cubes, custom logic dies, hybrid parallelism, and reordered collectives that delivers 15.5X lower attention latency and 6.9X lower energy than NVIDIA H...
FireBridge: Cycle-Accurate Hardware + Firmware Co-Verification for Modern Accelerators
cs.AR 2026-03 conditional novelty 6.0

FireBridge enables cycle-accurate hardware-firmware co-verification in standard simulators using randomized memory bridges, delivering up to 50x faster debug iterations than FPGA-based flows for accelerators such as s...
CHICO-Agent: An LLM Agent for the Cross-layer Optimization of 2.5D and 3D Chiplet-based Systems
cs.AR 2026-04 unverdicted novelty 5.0

CHICO-Agent uses LLM agents with a knowledge base to find lower-cost configurations for 2.5D/3D chiplet systems than simulated annealing while providing an interpretable audit trail.