pith. sign in

arxiv: 2511.03293 · v2 · pith:CCOUXCTCnew · submitted 2025-11-05 · 💻 cs.DC

UMDAM: A Unified Data Layout and DRAM Address Mapping for Heterogenous NPU-PIM

Pith reviewed 2026-05-21 20:26 UTC · model grok-4.3

classification 💻 cs.DC
keywords NPU-PIM co-executiondata layoutDRAM address mappingLLM inferenceedge devicestime-to-first-tokenmemory affinity
0
0 comments X

The pith

A column-major tile-based data layout with configurable DRAM mapping unifies NPU and PIM access patterns for LLM inference without added overhead.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces UMDAM to fix data layout mismatches and bandwidth issues that arise when Neural Processing Units and Processing-in-Memory work together on edge devices running large language models. It uses a column-major, tile-based arrangement of data combined with a flexible DRAM address mapping that keeps NPU computation unchanged while letting PIM units access data efficiently. This design avoids any increase in memory storage or loss of bandwidth that usually occurs in such mixed systems. Tests on OPT models show the approach cuts time-to-first-token by up to 3 times and time-to-last-token by more than 2 times, raising overall inference speed on edge hardware.

Core claim

UMDAM employs a column-major, tile-based layout and a configurable DRAM mapping strategy to ensure compatibility with NPU computation while maximizing PIM efficiency without introducing extra memory overhead or bandwidth loss.

What carries the argument

The unified memory-affinity data layout and DRAM address mapping (UMDAM) that aligns column-major tiling with programmable row-buffer mappings to serve both NPU compute patterns and PIM access simultaneously.

If this is right

  • Decode phases in LLM inference on edge devices can leverage PIM for memory-bound operations while keeping NPU execution unchanged.
  • End-to-end latency for token generation drops without requiring extra DRAM capacity or redesign of existing memory controllers.
  • Heterogeneous NPU-PIM setups become viable for production LLM serving on resource-constrained hardware.
  • The same layout can support both prefill and decode stages without separate data copies.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The configurable mapping could be extended to other memory-bound workloads such as recommendation models or graph neural networks on similar edge platforms.
  • Automated selection of mapping parameters based on model size and layer dimensions might further reduce manual tuning.
  • Integration with dynamic voltage-frequency scaling could amplify the energy savings already implied by lower execution time.
  • The approach may apply to other PIM variants beyond DRAM-based ones if the tile size is adjusted to match their internal granularity.

Load-bearing premise

A single column-major tile layout can be made fully compatible with NPU operations and optimal for PIM through mapping adjustments alone, without any performance or capacity trade-offs.

What would settle it

Measure TTFT and TTLT on the same NPU-PIM hardware running OPT models once with the UMDAM layout and once with a standard row-major layout, while tracking total memory footprint and sustained bandwidth to check for the claimed speedups and zero overhead.

Figures

Figures reproduced from arXiv: 2511.03293 by Hai Huang.

Figure 1
Figure 1. Figure 1: (a) Differences between Conventional layout and PIM [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: An illustration of UMDAM: (a) Unified data layout and DRAM address mapping for NPU-PIM, (b) System overview. [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: TTFT speedup of UMDAM over the NPU-PIM baseline [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗
Figure 5
Figure 5. Figure 5: Deployability verification of UMDAM on Ascend [PITH_FULL_IMAGE:figures/full_fig_p004_5.png] view at source ↗
read the original abstract

Large Language Models (LLMs) are increasingly deployed on edge devices with Neural Processing Units (NPUs), yet the decode phase remains memory-intensive, limiting performance. Processing-in-Memory (PIM) offers a promising solution, but co-executing NPU-PIM systems face challenges such as data layout mismatches, bandwidth loss, and redundant storage. To address these issues, we propose UMDAM, a unified memory-affinity data layout and DRAM address mapping scheme tailored for NPU-PIM co-execution. UMDAM employs a column-major, tile-based layout and a configurable DRAM mapping strategy to ensure compatibility with NPU computation while maximizing PIM efficiency -- without introducing extra memory overhead or bandwidth loss. Comprehensive evaluations on OPT models demonstrate that UMDAM reduces time-to-first-token (TTFT) by up to 3.0x and time-to-last-token (TTLT) by 2.18x, significantly improving end-to-end LLM inference efficiency on edge devices.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript proposes UMDAM, a unified data layout and DRAM address mapping scheme for heterogeneous NPU-PIM co-execution on edge devices running LLMs. It introduces a column-major, tile-based data layout paired with a configurable DRAM address mapping to achieve NPU computation compatibility while maximizing PIM efficiency, claiming zero extra memory overhead or bandwidth loss. Evaluations on OPT models report up to 3.0x reduction in time-to-first-token (TTFT) and 2.18x in time-to-last-token (TTLT), improving end-to-end inference efficiency.

Significance. If the zero-overhead compatibility and bandwidth preservation claims hold under detailed scrutiny, the work would offer a practical software-only optimization for memory-bound LLM decode phases on edge NPUs augmented with PIM, potentially enabling higher efficiency without hardware modifications or storage penalties. The reported speedups on real models like OPT indicate deployable gains for heterogeneous systems, though confirmation requires full methodological transparency.

major comments (2)
  1. [Evaluation] Evaluation section: The reported TTFT and TTLT speedups on OPT models lack isolated NPU-only execution time measurements, cache-miss counters, or pre/post-layout memory-bandwidth traces. This is load-bearing for the central claim because the column-major tile layout must preserve NPU access efficiency and effective bandwidth identical to baseline; without these metrics, it remains possible that PIM gains are partially offset by NPU-side degradation in stride patterns or bank interleaving.
  2. [Section 3] Section 3 (Design): The claim that the configurable DRAM mapping ensures 'full compatibility with NPU computation patterns' and 'no bandwidth loss' is not supported by explicit stride analysis or access-pattern proofs for typical NPU kernels (which are often tuned for row-major or channel-last layouts). A concrete example of how tile sizes and column-major reordering affect cache-line utilization would be required to substantiate the zero-cost premise.
minor comments (2)
  1. [Abstract] Abstract and evaluation descriptions omit baseline definitions, number of runs, error bars, and data-handling rules (e.g., warm-up iterations or model quantization details), which should be added for reproducibility.
  2. [Section 3.1] Notation for the DRAM mapping function and tile parameters could be clarified with a small example table showing address bits before and after reconfiguration.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address each major comment below and indicate planned revisions to strengthen the supporting evidence for our claims.

read point-by-point responses
  1. Referee: [Evaluation] Evaluation section: The reported TTFT and TTLT speedups on OPT models lack isolated NPU-only execution time measurements, cache-miss counters, or pre/post-layout memory-bandwidth traces. This is load-bearing for the central claim because the column-major tile layout must preserve NPU access efficiency and effective bandwidth identical to baseline; without these metrics, it remains possible that PIM gains are partially offset by NPU-side degradation in stride patterns or bank interleaving.

    Authors: We agree that isolated NPU-only measurements would provide stronger direct evidence for the absence of NPU-side degradation. The current manuscript focuses on end-to-end gains for OPT models under the proposed layout, which was designed to maintain compatibility. To address this point, we will add NPU-only execution times, cache-miss counters, and pre/post memory-bandwidth traces in the revised evaluation section. revision: yes

  2. Referee: [Section 3] Section 3 (Design): The claim that the configurable DRAM mapping ensures 'full compatibility with NPU computation patterns' and 'no bandwidth loss' is not supported by explicit stride analysis or access-pattern proofs for typical NPU kernels (which are often tuned for row-major or channel-last layouts). A concrete example of how tile sizes and column-major reordering affect cache-line utilization would be required to substantiate the zero-cost premise.

    Authors: We will expand Section 3 with an explicit stride analysis for representative NPU kernels and include a concrete example demonstrating cache-line utilization under the chosen tile sizes and column-major reordering. This will directly support the compatibility and zero-bandwidth-loss claims with access-pattern details. revision: yes

Circularity Check

0 steps flagged

No significant circularity in derivation chain

full rationale

The paper proposes UMDAM as an engineering design for unified column-major tile layout plus configurable DRAM mapping to support NPU-PIM co-execution. Claims of zero extra memory overhead and no bandwidth loss are presented as properties of the chosen layout and mapping strategy, then validated through end-to-end benchmarking on OPT models that reports TTFT and TTLT speedups. No equations, fitted parameters, or self-citations are used to derive the performance numbers; the results rest on implementation measurements against external hardware baselines rather than reducing to the inputs by construction. The work is therefore self-contained.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 1 invented entities

Abstract-only review yields limited visibility into parameters or assumptions; the design implicitly rests on standard computer-architecture assumptions about memory affinity and address mapping compatibility.

invented entities (1)
  • UMDAM unified layout and mapping scheme no independent evidence
    purpose: To resolve data layout mismatches and bandwidth loss in NPU-PIM co-execution
    The scheme is introduced by the paper as the core solution to the stated challenges.

pith-pipeline@v0.9.0 · 5695 in / 1318 out tokens · 49830 ms · 2026-05-21T20:26:54.739759+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

20 extracted references · 20 canonical work pages · 2 internal anchors

  1. [1]

    Mobile-Agent: Autonomous Multi-Modal Mobile Device Agent with Visual Perception

    J. Wang, H. Xu, J. Ye, M. Yan, W. Shen, J. Zhang, F. Huang, and J. Sang, “Mobile-agent: Autonomous multi-modal mobile device agent with visual perception,”arXiv preprint arXiv:2401.16158, 2024

  2. [2]

    Mobile-agent-v2: Mobile device operation assistant with effective navigation via multi-agent collaboration,

    J. Wang, H. Xu, H. Jia, X. Zhang, M. Yan, W. Shen, J. Zhang, F. Huang, and J. Sang, “Mobile-agent-v2: Mobile device operation assistant with effective navigation via multi-agent collaboration,”Advances in Neural Information Processing Systems, vol. 37, pp. 2686–2710, 2024

  3. [3]

    Samsung pim/pnm for transfmer based ai: Energy efficiency on pim/pnm cluster,

    J. H. Kim, Y . Ro, J. So, S. Lee, S.-h. Kang, Y . Cho, H. Kim, B. Kim, K. Kim, S. Parket al., “Samsung pim/pnm for transfmer based ai: Energy efficiency on pim/pnm cluster,” in2023 IEEE Hot Chips 35 Symposium (HCS). IEEE Computer Society, 2023, pp. 1–31

  4. [4]

    A 1ynm 1.25 v 8gb, 16gb/s/pin gddr6-based accelerator-in-memory supporting 1tflops mac operation and various activation functions for deep-learning applications,

    S. Lee, K. Kim, S. Oh, J. Park, G. Hong, D. Ka, K. Hwang, J. Park, K. Kang, J. Kimet al., “A 1ynm 1.25 v 8gb, 16gb/s/pin gddr6-based accelerator-in-memory supporting 1tflops mac operation and various activation functions for deep-learning applications,” in2022 IEEE In- ternational Solid-State Circuits Conference (ISSCC), vol. 65. IEEE, 2022, pp. 1–3

  5. [5]

    Hardware architecture and software stack for pim based on commercial dram technology: Industrial product,

    S. Lee, S.-h. Kang, J. Lee, H. Kim, E. Lee, S. Seo, H. Yoon, S. Lee, K. Lim, H. Shinet al., “Hardware architecture and software stack for pim based on commercial dram technology: Industrial product,” in 2021 ACM/IEEE 48th Annual International Symposium on Computer Architecture (ISCA). IEEE, 2021, pp. 43–56

  6. [6]

    Neupims: Npu-pim heterogeneous acceleration for batched llm inferencing,

    G. Heo, S. Lee, J. Cho, H. Choi, S. Lee, H. Ham, G. Kim, D. Ma- hajan, and J. Park, “Neupims: Npu-pim heterogeneous acceleration for batched llm inferencing,” inProceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, V olume 3, 2024, pp. 722–737

  7. [7]

    Ianus: Integrated accelerator based on npu-pim unified memory system,

    M. Seo, X. T. Nguyen, S. J. Hwang, Y . Kwon, G. Kim, C. Park, I. Kim, J. Park, J. Kim, W. Shinet al., “Ianus: Integrated accelerator based on npu-pim unified memory system,” inProceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, V olume 3, 2024, pp. 545–560

  8. [8]

    Memory-centric computing with sk hynix’s domain-specific memory,

    Y . Kwon, G. Kim, N. Kim, W. Shin, J. Won, H. Joo, H. Choi, B. An, G. Shin, D. Yunet al., “Memory-centric computing with sk hynix’s domain-specific memory,” in2023 IEEE Hot Chips 35 Symposium (HCS). IEEE Computer Society, 2023, pp. 1–26

  9. [9]

    Facil: Flexible dram address mapping for soc-pim cooperative on-device llm inference,

    S. H. Seo, J. Kim, D. Lee, S. Yoo, S. Moon, Y . Park, and J. W. Lee, “Facil: Flexible dram address mapping for soc-pim cooperative on-device llm inference,” in2025 IEEE International Symposium on High Performance Computer Architecture (HPCA). IEEE, 2025, pp. 1720–1733

  10. [10]

    The true processing in memory accelerator,

    F. Devaux, “The true processing in memory accelerator,” in2019 IEEE Hot Chips 31 Symposium (HCS). IEEE Computer Society, 2019, pp. 1–24

  11. [11]

    To pim or not for emerging general purpose processing in ddr memory systems,

    A. Devic, S. B. Rai, A. Sivasubramaniam, A. Akel, S. Eilert, and J. Eno, “To pim or not for emerging general purpose processing in ddr memory systems,” inProceedings of the 49th Annual International Symposium on Computer Architecture, 2022, pp. 231–244

  12. [12]

    Improving in-memory database operations with acceleration dimm (axdimm),

    D. Lee, J. So, M. Ahn, J.-G. Lee, J. Kim, J. Cho, R. Oliver, V . C. Thummala, R. s. JV , S. S. Upadhyaet al., “Improving in-memory database operations with acceleration dimm (axdimm),” inProceedings of the 18th International Workshop on Data Management on New Hardware, 2022, pp. 1–9

  13. [13]

    Pimnast: Balanced data placement for gemv acceleration with processing-in-memory,

    M. A. Ibrahim, M. Islam, and S. Aga, “Pimnast: Balanced data placement for gemv acceleration with processing-in-memory,” inSC24- W: Workshops of the International Conference for High Performance Computing, Networking, Storage and Analysis. IEEE, 2024, pp. 970– 981

  14. [14]

    Davinci: A scalable architecture for neural network computing,

    H. Liao, J. Tu, J. Xia, and X. Zhou, “Davinci: A scalable architecture for neural network computing,” in2019 IEEE Hot Chips 31 Symposium (HCS), 2019, pp. 1–44

  15. [15]

    Sk hynix ai-specific computing memory solution: From aim device to heterogeneous aimx-xpu system for comprehensive llm inference,

    G. Kim, J. Kim, N. Kim, W. Shin, J. Won, H. Joo, H. Choi, B. An, G. Shin, D. Yunet al., “Sk hynix ai-specific computing memory solution: From aim device to heterogeneous aimx-xpu system for comprehensive llm inference,” in2024 IEEE Hot Chips 36 Symposium (HCS). IEEE Computer Society, 2024, pp. 1–26

  16. [16]

    Ramulator 2.0: A modern, modular, and extensible dram simulator,

    H. Luo, Y . C. Tu ˘grul, F. N. Bostancı, A. Olgun, A. G. Ya ˘glıkc ¸ı, and O. Mutlu, “Ramulator 2.0: A modern, modular, and extensible dram simulator,”IEEE Computer Architecture Letters, vol. 23, no. 1, pp. 112– 116, 2023

  17. [17]

    LOW POWER DOUBLE DATA RATE (LPDDR) 5/5X,

    JEDEC Committee, “LOW POWER DOUBLE DATA RATE (LPDDR) 5/5X,” https://www.jedec.org/standards-documents/docs/jesd209-5c, 2023

  18. [18]

    OPT: Open Pre-trained Transformer Language Models

    S. Zhang, S. Roller, N. Goyal, M. Artetxe, M. Chen, S. Chen, C. Dewan, M. Diab, X. Li, X. V . Linet al., “Opt: Open pre-trained transformer language models,”arXiv preprint arXiv:2205.01068, 2022

  19. [19]

    (2025) ModelArts: AI Development Platform

    Huawei Cloud. (2025) ModelArts: AI Development Platform. [On- line]. Available: https://console.huaweicloud.com/modelarts/?region=cn- southwest-2#/dev-container

  20. [20]

    (2025) Ascend C Operator Development Documentation

    Huawei Technologies Co., Ltd. (2025) Ascend C Operator Development Documentation. [Online]. Available: https://www.hiascend.com/developer/operator?tag=ascendc