UMDAM introduces a column-major tile-based data layout and configurable DRAM mapping to enable efficient NPU-PIM co-execution for LLM inference, reducing TTFT by up to 3.0x and TTLT by 2.18x on OPT models without added memory overhead or bandwidth loss.
Neupims: Npu-pim heterogeneous acceleration for batched llm inferencing,
1 Pith paper cite this work. Polarity classification is still indexing.
1
Pith paper citing it
fields
cs.DC 1years
2025 1verdicts
UNVERDICTED 1representative citing papers
citing papers explorer
-
UMDAM: A Unified Data Layout and DRAM Address Mapping for Heterogenous NPU-PIM
UMDAM introduces a column-major tile-based data layout and configurable DRAM mapping to enable efficient NPU-PIM co-execution for LLM inference, reducing TTFT by up to 3.0x and TTLT by 2.18x on OPT models without added memory overhead or bandwidth loss.