UMDAM: A Unified Data Layout and DRAM Address Mapping for Heterogenous NPU-PIM
Pith reviewed 2026-05-21 20:26 UTC · model grok-4.3
The pith
A column-major tile-based data layout with configurable DRAM mapping unifies NPU and PIM access patterns for LLM inference without added overhead.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
UMDAM employs a column-major, tile-based layout and a configurable DRAM mapping strategy to ensure compatibility with NPU computation while maximizing PIM efficiency without introducing extra memory overhead or bandwidth loss.
What carries the argument
The unified memory-affinity data layout and DRAM address mapping (UMDAM) that aligns column-major tiling with programmable row-buffer mappings to serve both NPU compute patterns and PIM access simultaneously.
If this is right
- Decode phases in LLM inference on edge devices can leverage PIM for memory-bound operations while keeping NPU execution unchanged.
- End-to-end latency for token generation drops without requiring extra DRAM capacity or redesign of existing memory controllers.
- Heterogeneous NPU-PIM setups become viable for production LLM serving on resource-constrained hardware.
- The same layout can support both prefill and decode stages without separate data copies.
Where Pith is reading between the lines
- The configurable mapping could be extended to other memory-bound workloads such as recommendation models or graph neural networks on similar edge platforms.
- Automated selection of mapping parameters based on model size and layer dimensions might further reduce manual tuning.
- Integration with dynamic voltage-frequency scaling could amplify the energy savings already implied by lower execution time.
- The approach may apply to other PIM variants beyond DRAM-based ones if the tile size is adjusted to match their internal granularity.
Load-bearing premise
A single column-major tile layout can be made fully compatible with NPU operations and optimal for PIM through mapping adjustments alone, without any performance or capacity trade-offs.
What would settle it
Measure TTFT and TTLT on the same NPU-PIM hardware running OPT models once with the UMDAM layout and once with a standard row-major layout, while tracking total memory footprint and sustained bandwidth to check for the claimed speedups and zero overhead.
Figures
read the original abstract
Large Language Models (LLMs) are increasingly deployed on edge devices with Neural Processing Units (NPUs), yet the decode phase remains memory-intensive, limiting performance. Processing-in-Memory (PIM) offers a promising solution, but co-executing NPU-PIM systems face challenges such as data layout mismatches, bandwidth loss, and redundant storage. To address these issues, we propose UMDAM, a unified memory-affinity data layout and DRAM address mapping scheme tailored for NPU-PIM co-execution. UMDAM employs a column-major, tile-based layout and a configurable DRAM mapping strategy to ensure compatibility with NPU computation while maximizing PIM efficiency -- without introducing extra memory overhead or bandwidth loss. Comprehensive evaluations on OPT models demonstrate that UMDAM reduces time-to-first-token (TTFT) by up to 3.0x and time-to-last-token (TTLT) by 2.18x, significantly improving end-to-end LLM inference efficiency on edge devices.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript proposes UMDAM, a unified data layout and DRAM address mapping scheme for heterogeneous NPU-PIM co-execution on edge devices running LLMs. It introduces a column-major, tile-based data layout paired with a configurable DRAM address mapping to achieve NPU computation compatibility while maximizing PIM efficiency, claiming zero extra memory overhead or bandwidth loss. Evaluations on OPT models report up to 3.0x reduction in time-to-first-token (TTFT) and 2.18x in time-to-last-token (TTLT), improving end-to-end inference efficiency.
Significance. If the zero-overhead compatibility and bandwidth preservation claims hold under detailed scrutiny, the work would offer a practical software-only optimization for memory-bound LLM decode phases on edge NPUs augmented with PIM, potentially enabling higher efficiency without hardware modifications or storage penalties. The reported speedups on real models like OPT indicate deployable gains for heterogeneous systems, though confirmation requires full methodological transparency.
major comments (2)
- [Evaluation] Evaluation section: The reported TTFT and TTLT speedups on OPT models lack isolated NPU-only execution time measurements, cache-miss counters, or pre/post-layout memory-bandwidth traces. This is load-bearing for the central claim because the column-major tile layout must preserve NPU access efficiency and effective bandwidth identical to baseline; without these metrics, it remains possible that PIM gains are partially offset by NPU-side degradation in stride patterns or bank interleaving.
- [Section 3] Section 3 (Design): The claim that the configurable DRAM mapping ensures 'full compatibility with NPU computation patterns' and 'no bandwidth loss' is not supported by explicit stride analysis or access-pattern proofs for typical NPU kernels (which are often tuned for row-major or channel-last layouts). A concrete example of how tile sizes and column-major reordering affect cache-line utilization would be required to substantiate the zero-cost premise.
minor comments (2)
- [Abstract] Abstract and evaluation descriptions omit baseline definitions, number of runs, error bars, and data-handling rules (e.g., warm-up iterations or model quantization details), which should be added for reproducibility.
- [Section 3.1] Notation for the DRAM mapping function and tile parameters could be clarified with a small example table showing address bits before and after reconfiguration.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback on our manuscript. We address each major comment below and indicate planned revisions to strengthen the supporting evidence for our claims.
read point-by-point responses
-
Referee: [Evaluation] Evaluation section: The reported TTFT and TTLT speedups on OPT models lack isolated NPU-only execution time measurements, cache-miss counters, or pre/post-layout memory-bandwidth traces. This is load-bearing for the central claim because the column-major tile layout must preserve NPU access efficiency and effective bandwidth identical to baseline; without these metrics, it remains possible that PIM gains are partially offset by NPU-side degradation in stride patterns or bank interleaving.
Authors: We agree that isolated NPU-only measurements would provide stronger direct evidence for the absence of NPU-side degradation. The current manuscript focuses on end-to-end gains for OPT models under the proposed layout, which was designed to maintain compatibility. To address this point, we will add NPU-only execution times, cache-miss counters, and pre/post memory-bandwidth traces in the revised evaluation section. revision: yes
-
Referee: [Section 3] Section 3 (Design): The claim that the configurable DRAM mapping ensures 'full compatibility with NPU computation patterns' and 'no bandwidth loss' is not supported by explicit stride analysis or access-pattern proofs for typical NPU kernels (which are often tuned for row-major or channel-last layouts). A concrete example of how tile sizes and column-major reordering affect cache-line utilization would be required to substantiate the zero-cost premise.
Authors: We will expand Section 3 with an explicit stride analysis for representative NPU kernels and include a concrete example demonstrating cache-line utilization under the chosen tile sizes and column-major reordering. This will directly support the compatibility and zero-bandwidth-loss claims with access-pattern details. revision: yes
Circularity Check
No significant circularity in derivation chain
full rationale
The paper proposes UMDAM as an engineering design for unified column-major tile layout plus configurable DRAM mapping to support NPU-PIM co-execution. Claims of zero extra memory overhead and no bandwidth loss are presented as properties of the chosen layout and mapping strategy, then validated through end-to-end benchmarking on OPT models that reports TTFT and TTLT speedups. No equations, fitted parameters, or self-citations are used to derive the performance numbers; the results rest on implementation measurements against external hardware baselines rather than reducing to the inputs by construction. The work is therefore self-contained.
Axiom & Free-Parameter Ledger
invented entities (1)
-
UMDAM unified layout and mapping scheme
no independent evidence
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/AbsoluteFloorClosure.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
UMDAM employs a column-major, tile-based layout and a configurable DRAM mapping strategy to ensure compatibility with NPU computation while maximizing PIM efficiency -- without introducing extra memory overhead or bandwidth loss.
-
IndisputableMonolith/Foundation/ArithmeticFromLogic.leanLogicNat unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
The mapping order is defined as (MSB) Row–Col M–Bank–Rank-Channel–Col L–Offset (LSB)
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Mobile-Agent: Autonomous Multi-Modal Mobile Device Agent with Visual Perception
J. Wang, H. Xu, J. Ye, M. Yan, W. Shen, J. Zhang, F. Huang, and J. Sang, “Mobile-agent: Autonomous multi-modal mobile device agent with visual perception,”arXiv preprint arXiv:2401.16158, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[2]
J. Wang, H. Xu, H. Jia, X. Zhang, M. Yan, W. Shen, J. Zhang, F. Huang, and J. Sang, “Mobile-agent-v2: Mobile device operation assistant with effective navigation via multi-agent collaboration,”Advances in Neural Information Processing Systems, vol. 37, pp. 2686–2710, 2024
work page 2024
-
[3]
Samsung pim/pnm for transfmer based ai: Energy efficiency on pim/pnm cluster,
J. H. Kim, Y . Ro, J. So, S. Lee, S.-h. Kang, Y . Cho, H. Kim, B. Kim, K. Kim, S. Parket al., “Samsung pim/pnm for transfmer based ai: Energy efficiency on pim/pnm cluster,” in2023 IEEE Hot Chips 35 Symposium (HCS). IEEE Computer Society, 2023, pp. 1–31
work page 2023
-
[4]
S. Lee, K. Kim, S. Oh, J. Park, G. Hong, D. Ka, K. Hwang, J. Park, K. Kang, J. Kimet al., “A 1ynm 1.25 v 8gb, 16gb/s/pin gddr6-based accelerator-in-memory supporting 1tflops mac operation and various activation functions for deep-learning applications,” in2022 IEEE In- ternational Solid-State Circuits Conference (ISSCC), vol. 65. IEEE, 2022, pp. 1–3
work page 2022
-
[5]
S. Lee, S.-h. Kang, J. Lee, H. Kim, E. Lee, S. Seo, H. Yoon, S. Lee, K. Lim, H. Shinet al., “Hardware architecture and software stack for pim based on commercial dram technology: Industrial product,” in 2021 ACM/IEEE 48th Annual International Symposium on Computer Architecture (ISCA). IEEE, 2021, pp. 43–56
work page 2021
-
[6]
Neupims: Npu-pim heterogeneous acceleration for batched llm inferencing,
G. Heo, S. Lee, J. Cho, H. Choi, S. Lee, H. Ham, G. Kim, D. Ma- hajan, and J. Park, “Neupims: Npu-pim heterogeneous acceleration for batched llm inferencing,” inProceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, V olume 3, 2024, pp. 722–737
work page 2024
-
[7]
Ianus: Integrated accelerator based on npu-pim unified memory system,
M. Seo, X. T. Nguyen, S. J. Hwang, Y . Kwon, G. Kim, C. Park, I. Kim, J. Park, J. Kim, W. Shinet al., “Ianus: Integrated accelerator based on npu-pim unified memory system,” inProceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, V olume 3, 2024, pp. 545–560
work page 2024
-
[8]
Memory-centric computing with sk hynix’s domain-specific memory,
Y . Kwon, G. Kim, N. Kim, W. Shin, J. Won, H. Joo, H. Choi, B. An, G. Shin, D. Yunet al., “Memory-centric computing with sk hynix’s domain-specific memory,” in2023 IEEE Hot Chips 35 Symposium (HCS). IEEE Computer Society, 2023, pp. 1–26
work page 2023
-
[9]
Facil: Flexible dram address mapping for soc-pim cooperative on-device llm inference,
S. H. Seo, J. Kim, D. Lee, S. Yoo, S. Moon, Y . Park, and J. W. Lee, “Facil: Flexible dram address mapping for soc-pim cooperative on-device llm inference,” in2025 IEEE International Symposium on High Performance Computer Architecture (HPCA). IEEE, 2025, pp. 1720–1733
work page 2025
-
[10]
The true processing in memory accelerator,
F. Devaux, “The true processing in memory accelerator,” in2019 IEEE Hot Chips 31 Symposium (HCS). IEEE Computer Society, 2019, pp. 1–24
work page 2019
-
[11]
To pim or not for emerging general purpose processing in ddr memory systems,
A. Devic, S. B. Rai, A. Sivasubramaniam, A. Akel, S. Eilert, and J. Eno, “To pim or not for emerging general purpose processing in ddr memory systems,” inProceedings of the 49th Annual International Symposium on Computer Architecture, 2022, pp. 231–244
work page 2022
-
[12]
Improving in-memory database operations with acceleration dimm (axdimm),
D. Lee, J. So, M. Ahn, J.-G. Lee, J. Kim, J. Cho, R. Oliver, V . C. Thummala, R. s. JV , S. S. Upadhyaet al., “Improving in-memory database operations with acceleration dimm (axdimm),” inProceedings of the 18th International Workshop on Data Management on New Hardware, 2022, pp. 1–9
work page 2022
-
[13]
Pimnast: Balanced data placement for gemv acceleration with processing-in-memory,
M. A. Ibrahim, M. Islam, and S. Aga, “Pimnast: Balanced data placement for gemv acceleration with processing-in-memory,” inSC24- W: Workshops of the International Conference for High Performance Computing, Networking, Storage and Analysis. IEEE, 2024, pp. 970– 981
work page 2024
-
[14]
Davinci: A scalable architecture for neural network computing,
H. Liao, J. Tu, J. Xia, and X. Zhou, “Davinci: A scalable architecture for neural network computing,” in2019 IEEE Hot Chips 31 Symposium (HCS), 2019, pp. 1–44
work page 2019
-
[15]
G. Kim, J. Kim, N. Kim, W. Shin, J. Won, H. Joo, H. Choi, B. An, G. Shin, D. Yunet al., “Sk hynix ai-specific computing memory solution: From aim device to heterogeneous aimx-xpu system for comprehensive llm inference,” in2024 IEEE Hot Chips 36 Symposium (HCS). IEEE Computer Society, 2024, pp. 1–26
work page 2024
-
[16]
Ramulator 2.0: A modern, modular, and extensible dram simulator,
H. Luo, Y . C. Tu ˘grul, F. N. Bostancı, A. Olgun, A. G. Ya ˘glıkc ¸ı, and O. Mutlu, “Ramulator 2.0: A modern, modular, and extensible dram simulator,”IEEE Computer Architecture Letters, vol. 23, no. 1, pp. 112– 116, 2023
work page 2023
-
[17]
LOW POWER DOUBLE DATA RATE (LPDDR) 5/5X,
JEDEC Committee, “LOW POWER DOUBLE DATA RATE (LPDDR) 5/5X,” https://www.jedec.org/standards-documents/docs/jesd209-5c, 2023
work page 2023
-
[18]
OPT: Open Pre-trained Transformer Language Models
S. Zhang, S. Roller, N. Goyal, M. Artetxe, M. Chen, S. Chen, C. Dewan, M. Diab, X. Li, X. V . Linet al., “Opt: Open pre-trained transformer language models,”arXiv preprint arXiv:2205.01068, 2022
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[19]
(2025) ModelArts: AI Development Platform
Huawei Cloud. (2025) ModelArts: AI Development Platform. [On- line]. Available: https://console.huaweicloud.com/modelarts/?region=cn- southwest-2#/dev-container
work page 2025
-
[20]
(2025) Ascend C Operator Development Documentation
Huawei Technologies Co., Ltd. (2025) Ascend C Operator Development Documentation. [Online]. Available: https://www.hiascend.com/developer/operator?tag=ascendc
work page 2025
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.