SCALE-Sim: Systolic CNN Accelerator Simulator

Ananda Samajdar , Yuhao Zhu , Paul Whatmough , Matthew Mattina , Tushar Krishna

Authors on Pith no claims yet

classification 💻 cs.DC cs.AR

keywords scale-simsimulatorsystolicacceleratoracceleratorsdeepdesigninsights

read the original abstract

Systolic Arrays are one of the most popular compute substrates within Deep Learning accelerators today, as they provide extremely high efficiency for running dense matrix multiplications. However, the research community lacks tools to insights on both the design trade-offs and efficient mapping strategies for systolic-array based accelerators. We introduce Systolic CNN Accelerator Simulator (SCALE-Sim), which is a configurable systolic array based cycle accurate DNN accelerator simulator. SCALE-Sim exposes various micro-architectural features as well as system integration parameters to the designer to enable comprehensive design space exploration. This is the first systolic-array simulator tuned for running DNNs to the best of our knowledge. Using SCALE-Sim, we conduct a suite of case studies and demonstrate the effect of bandwidth, data flow and aspect ratio on the overall runtime and energy of Deep Learning kernels across vision, speech, text, and games. We believe that these insights will be highly beneficial to architects and ML practitioners.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 4 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

DRIFT: Harnessing Inherent Fault Tolerance for Efficient and Reliable Diffusion Model Inference
cs.AR 2026-04 unverdicted novelty 7.0

DRIFT uses resilience analysis, targeted DVFS, and adaptive rollback ABFT to deliver 36% average energy savings or 1.7x speedup in diffusion model inference while preserving generation quality.
AMMA: A Multi-Chiplet Memory-Centric Architecture for Low-Latency 1M Context Attention Serving
cs.AR 2026-04 unverdicted novelty 6.0

AMMA is a memory-centric multi-chiplet architecture using HBM-PNM cubes, custom logic dies, hybrid parallelism, and reordered collectives that delivers 15.5X lower attention latency and 6.9X lower energy than NVIDIA H...
FireBridge: Cycle-Accurate Hardware + Firmware Co-Verification for Modern Accelerators
cs.AR 2026-03 conditional novelty 6.0

FireBridge enables cycle-accurate hardware-firmware co-verification in standard simulators using randomized memory bridges, delivering up to 50x faster debug iterations than FPGA-based flows for accelerators such as s...
CHICO-Agent: An LLM Agent for the Cross-layer Optimization of 2.5D and 3D Chiplet-based Systems
cs.AR 2026-04 unverdicted novelty 5.0

CHICO-Agent uses LLM agents with a knowledge base to find lower-cost configurations for 2.5D/3D chiplet systems than simulated annealing while providing an interpretable audit trail.