LOCALUT delivers 1.82x geometric mean speedup for quantized DNN inference on real UPMEM DRAM-PIM devices by using operation-packed LUTs with canonicalization, reordering, and slice streaming.
PID-Comm: A Fast and Flexible Collective Communication Framework for Commodity Processing-in-DIMM Devices,
2 Pith papers cite this work. Polarity classification is still indexing.
fields
cs.AR 2representative citing papers
DCC is a data-centric compiler that co-optimizes data partitioning strategies with compute loop partitioning for ML kernels on multiple PIM architectures, reporting up to 13.17x speedup on AttAcc PIM and 4.52x average for LLM inference over GPU.
citing papers explorer
-
LOCALUT: Harnessing Capacity-Computation Tradeoffs for LUT-Based Inference in DRAM-PIM
LOCALUT delivers 1.82x geometric mean speedup for quantized DNN inference on real UPMEM DRAM-PIM devices by using operation-packed LUTs with canonicalization, reordering, and slice streaming.
-
DCC: Data-Centric Compilation of Machine Learning Kernels for Processing-In-Memory Architectures
DCC is a data-centric compiler that co-optimizes data partitioning strategies with compute loop partitioning for ML kernels on multiple PIM architectures, reporting up to 13.17x speedup on AttAcc PIM and 4.52x average for LLM inference over GPU.