Recognition: unknown
SCALE-Sim: Systolic CNN Accelerator Simulator
read the original abstract
Systolic Arrays are one of the most popular compute substrates within Deep Learning accelerators today, as they provide extremely high efficiency for running dense matrix multiplications. However, the research community lacks tools to insights on both the design trade-offs and efficient mapping strategies for systolic-array based accelerators. We introduce Systolic CNN Accelerator Simulator (SCALE-Sim), which is a configurable systolic array based cycle accurate DNN accelerator simulator. SCALE-Sim exposes various micro-architectural features as well as system integration parameters to the designer to enable comprehensive design space exploration. This is the first systolic-array simulator tuned for running DNNs to the best of our knowledge. Using SCALE-Sim, we conduct a suite of case studies and demonstrate the effect of bandwidth, data flow and aspect ratio on the overall runtime and energy of Deep Learning kernels across vision, speech, text, and games. We believe that these insights will be highly beneficial to architects and ML practitioners.
This paper has not been read by Pith yet.
Forward citations
Cited by 4 Pith papers
-
DRIFT: Harnessing Inherent Fault Tolerance for Efficient and Reliable Diffusion Model Inference
DRIFT uses resilience analysis, targeted DVFS, and adaptive rollback ABFT to deliver 36% average energy savings or 1.7x speedup in diffusion model inference while preserving generation quality.
-
AMMA: A Multi-Chiplet Memory-Centric Architecture for Low-Latency 1M Context Attention Serving
AMMA is a memory-centric multi-chiplet architecture using HBM-PNM cubes, custom logic dies, hybrid parallelism, and reordered collectives that delivers 15.5X lower attention latency and 6.9X lower energy than NVIDIA H...
-
FireBridge: Cycle-Accurate Hardware + Firmware Co-Verification for Modern Accelerators
FireBridge enables cycle-accurate hardware-firmware co-verification in standard simulators using randomized memory bridges, delivering up to 50x faster debug iterations than FPGA-based flows for accelerators such as s...
-
CHICO-Agent: An LLM Agent for the Cross-layer Optimization of 2.5D and 3D Chiplet-based Systems
CHICO-Agent uses LLM agents with a knowledge base to find lower-cost configurations for 2.5D/3D chiplet systems than simulated annealing while providing an interpretable audit trail.
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.