UMDAM: A Unified Data Layout and DRAM Address Mapping for Heterogenous NPU-PIM

Hai Huang

arxiv: 2511.03293 · v2 · pith:CCOUXCTCnew · submitted 2025-11-05 · 💻 cs.DC

UMDAM: A Unified Data Layout and DRAM Address Mapping for Heterogenous NPU-PIM

Hai Huang This is my paper

Pith reviewed 2026-05-21 20:26 UTC · model grok-4.3

classification 💻 cs.DC

keywords NPU-PIM co-executiondata layoutDRAM address mappingLLM inferenceedge devicestime-to-first-tokenmemory affinity

0 comments

The pith

A column-major tile-based data layout with configurable DRAM mapping unifies NPU and PIM access patterns for LLM inference without added overhead.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces UMDAM to fix data layout mismatches and bandwidth issues that arise when Neural Processing Units and Processing-in-Memory work together on edge devices running large language models. It uses a column-major, tile-based arrangement of data combined with a flexible DRAM address mapping that keeps NPU computation unchanged while letting PIM units access data efficiently. This design avoids any increase in memory storage or loss of bandwidth that usually occurs in such mixed systems. Tests on OPT models show the approach cuts time-to-first-token by up to 3 times and time-to-last-token by more than 2 times, raising overall inference speed on edge hardware.

Core claim

UMDAM employs a column-major, tile-based layout and a configurable DRAM mapping strategy to ensure compatibility with NPU computation while maximizing PIM efficiency without introducing extra memory overhead or bandwidth loss.

What carries the argument

The unified memory-affinity data layout and DRAM address mapping (UMDAM) that aligns column-major tiling with programmable row-buffer mappings to serve both NPU compute patterns and PIM access simultaneously.

If this is right

Decode phases in LLM inference on edge devices can leverage PIM for memory-bound operations while keeping NPU execution unchanged.
End-to-end latency for token generation drops without requiring extra DRAM capacity or redesign of existing memory controllers.
Heterogeneous NPU-PIM setups become viable for production LLM serving on resource-constrained hardware.
The same layout can support both prefill and decode stages without separate data copies.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The configurable mapping could be extended to other memory-bound workloads such as recommendation models or graph neural networks on similar edge platforms.
Automated selection of mapping parameters based on model size and layer dimensions might further reduce manual tuning.
Integration with dynamic voltage-frequency scaling could amplify the energy savings already implied by lower execution time.
The approach may apply to other PIM variants beyond DRAM-based ones if the tile size is adjusted to match their internal granularity.

Load-bearing premise

A single column-major tile layout can be made fully compatible with NPU operations and optimal for PIM through mapping adjustments alone, without any performance or capacity trade-offs.

What would settle it

Measure TTFT and TTLT on the same NPU-PIM hardware running OPT models once with the UMDAM layout and once with a standard row-major layout, while tracking total memory footprint and sustained bandwidth to check for the claimed speedups and zero overhead.

Figures

Figures reproduced from arXiv: 2511.03293 by Hai Huang.

**Figure 2.** Figure 2: An illustration of UMDAM: (a) Unified data layout and DRAM address mapping for NPU-PIM, (b) System overview. [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 3.** Figure 3: TTFT speedup of UMDAM over the NPU-PIM baseline [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗

**Figure 5.** Figure 5: Deployability verification of UMDAM on Ascend [PITH_FULL_IMAGE:figures/full_fig_p004_5.png] view at source ↗

read the original abstract

Large Language Models (LLMs) are increasingly deployed on edge devices with Neural Processing Units (NPUs), yet the decode phase remains memory-intensive, limiting performance. Processing-in-Memory (PIM) offers a promising solution, but co-executing NPU-PIM systems face challenges such as data layout mismatches, bandwidth loss, and redundant storage. To address these issues, we propose UMDAM, a unified memory-affinity data layout and DRAM address mapping scheme tailored for NPU-PIM co-execution. UMDAM employs a column-major, tile-based layout and a configurable DRAM mapping strategy to ensure compatibility with NPU computation while maximizing PIM efficiency -- without introducing extra memory overhead or bandwidth loss. Comprehensive evaluations on OPT models demonstrate that UMDAM reduces time-to-first-token (TTFT) by up to 3.0x and time-to-last-token (TTLT) by 2.18x, significantly improving end-to-end LLM inference efficiency on edge devices.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

UMDAM gives a concrete layout fix for NPU-PIM co-execution on edge LLMs with reported TTFT gains, but the no-bandwidth-loss claim rests on unshown isolation of NPU-side effects.

read the letter

The main point for you is that this paper puts forward UMDAM, a column-major tile layout plus configurable DRAM mapping meant to let NPUs and PIM run together on the same data without extra storage or bandwidth penalties, and it shows up to 3x TTFT and 2.18x TTLT reductions on OPT models for edge inference. That combination looks like the actual new piece rather than a broad new theory. It does a clean job naming the real frictions in heterogeneous NPU-PIM setups, such as layout mismatches that force copies or hurt one side or the other, and it frames the solution as a single scheme that keeps NPU compatibility while feeding PIM better. The end-to-end numbers on real models give a practical sense of the payoff. The soft spot is the zero-overhead guarantee. Column-major reorganization can change stride patterns and bank access even when total bytes stay the same, and typical NPU kernels are tuned for row-major or channel-last layouts. The abstract reports overall speedups but does not break out NPU-only time, cache-miss rates, or memory-bandwidth traces before and after the change, so it is hard to tell whether PIM wins are partly offset by slower NPU execution. That is the part that needs tighter evidence rather than a fatal flaw. This paper is aimed at hardware and systems people working on memory-bound edge AI, especially those already looking at PIM for decode phases. A reader who cares about concrete memory-affinity tricks will find usable ideas even if they want to re-run the experiments themselves. It is worth sending to peer review so the methodology and measurement details can be checked properly.

Referee Report

2 major / 2 minor

Summary. The manuscript proposes UMDAM, a unified data layout and DRAM address mapping scheme for heterogeneous NPU-PIM co-execution on edge devices running LLMs. It introduces a column-major, tile-based data layout paired with a configurable DRAM address mapping to achieve NPU computation compatibility while maximizing PIM efficiency, claiming zero extra memory overhead or bandwidth loss. Evaluations on OPT models report up to 3.0x reduction in time-to-first-token (TTFT) and 2.18x in time-to-last-token (TTLT), improving end-to-end inference efficiency.

Significance. If the zero-overhead compatibility and bandwidth preservation claims hold under detailed scrutiny, the work would offer a practical software-only optimization for memory-bound LLM decode phases on edge NPUs augmented with PIM, potentially enabling higher efficiency without hardware modifications or storage penalties. The reported speedups on real models like OPT indicate deployable gains for heterogeneous systems, though confirmation requires full methodological transparency.

major comments (2)

[Evaluation] Evaluation section: The reported TTFT and TTLT speedups on OPT models lack isolated NPU-only execution time measurements, cache-miss counters, or pre/post-layout memory-bandwidth traces. This is load-bearing for the central claim because the column-major tile layout must preserve NPU access efficiency and effective bandwidth identical to baseline; without these metrics, it remains possible that PIM gains are partially offset by NPU-side degradation in stride patterns or bank interleaving.
[Section 3] Section 3 (Design): The claim that the configurable DRAM mapping ensures 'full compatibility with NPU computation patterns' and 'no bandwidth loss' is not supported by explicit stride analysis or access-pattern proofs for typical NPU kernels (which are often tuned for row-major or channel-last layouts). A concrete example of how tile sizes and column-major reordering affect cache-line utilization would be required to substantiate the zero-cost premise.

minor comments (2)

[Abstract] Abstract and evaluation descriptions omit baseline definitions, number of runs, error bars, and data-handling rules (e.g., warm-up iterations or model quantization details), which should be added for reproducibility.
[Section 3.1] Notation for the DRAM mapping function and tile parameters could be clarified with a small example table showing address bits before and after reconfiguration.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address each major comment below and indicate planned revisions to strengthen the supporting evidence for our claims.

read point-by-point responses

Referee: [Evaluation] Evaluation section: The reported TTFT and TTLT speedups on OPT models lack isolated NPU-only execution time measurements, cache-miss counters, or pre/post-layout memory-bandwidth traces. This is load-bearing for the central claim because the column-major tile layout must preserve NPU access efficiency and effective bandwidth identical to baseline; without these metrics, it remains possible that PIM gains are partially offset by NPU-side degradation in stride patterns or bank interleaving.

Authors: We agree that isolated NPU-only measurements would provide stronger direct evidence for the absence of NPU-side degradation. The current manuscript focuses on end-to-end gains for OPT models under the proposed layout, which was designed to maintain compatibility. To address this point, we will add NPU-only execution times, cache-miss counters, and pre/post memory-bandwidth traces in the revised evaluation section. revision: yes
Referee: [Section 3] Section 3 (Design): The claim that the configurable DRAM mapping ensures 'full compatibility with NPU computation patterns' and 'no bandwidth loss' is not supported by explicit stride analysis or access-pattern proofs for typical NPU kernels (which are often tuned for row-major or channel-last layouts). A concrete example of how tile sizes and column-major reordering affect cache-line utilization would be required to substantiate the zero-cost premise.

Authors: We will expand Section 3 with an explicit stride analysis for representative NPU kernels and include a concrete example demonstrating cache-line utilization under the chosen tile sizes and column-major reordering. This will directly support the compatibility and zero-bandwidth-loss claims with access-pattern details. revision: yes

Circularity Check

0 steps flagged

No significant circularity in derivation chain

full rationale

The paper proposes UMDAM as an engineering design for unified column-major tile layout plus configurable DRAM mapping to support NPU-PIM co-execution. Claims of zero extra memory overhead and no bandwidth loss are presented as properties of the chosen layout and mapping strategy, then validated through end-to-end benchmarking on OPT models that reports TTFT and TTLT speedups. No equations, fitted parameters, or self-citations are used to derive the performance numbers; the results rest on implementation measurements against external hardware baselines rather than reducing to the inputs by construction. The work is therefore self-contained.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 1 invented entities

Abstract-only review yields limited visibility into parameters or assumptions; the design implicitly rests on standard computer-architecture assumptions about memory affinity and address mapping compatibility.

invented entities (1)

UMDAM unified layout and mapping scheme no independent evidence
purpose: To resolve data layout mismatches and bandwidth loss in NPU-PIM co-execution
The scheme is introduced by the paper as the core solution to the stated challenges.

pith-pipeline@v0.9.0 · 5695 in / 1318 out tokens · 49830 ms · 2026-05-21T20:26:54.739759+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/AbsoluteFloorClosure.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

UMDAM employs a column-major, tile-based layout and a configurable DRAM mapping strategy to ensure compatibility with NPU computation while maximizing PIM efficiency -- without introducing extra memory overhead or bandwidth loss.
IndisputableMonolith/Foundation/ArithmeticFromLogic.lean LogicNat unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

The mapping order is defined as (MSB) Row–Col M–Bank–Rank-Channel–Col L–Offset (LSB)

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

20 extracted references · 20 canonical work pages · 2 internal anchors

[1]

Mobile-Agent: Autonomous Multi-Modal Mobile Device Agent with Visual Perception

J. Wang, H. Xu, J. Ye, M. Yan, W. Shen, J. Zhang, F. Huang, and J. Sang, “Mobile-agent: Autonomous multi-modal mobile device agent with visual perception,”arXiv preprint arXiv:2401.16158, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[2]

Mobile-agent-v2: Mobile device operation assistant with effective navigation via multi-agent collaboration,

J. Wang, H. Xu, H. Jia, X. Zhang, M. Yan, W. Shen, J. Zhang, F. Huang, and J. Sang, “Mobile-agent-v2: Mobile device operation assistant with effective navigation via multi-agent collaboration,”Advances in Neural Information Processing Systems, vol. 37, pp. 2686–2710, 2024

work page 2024
[3]

Samsung pim/pnm for transfmer based ai: Energy efficiency on pim/pnm cluster,

J. H. Kim, Y . Ro, J. So, S. Lee, S.-h. Kang, Y . Cho, H. Kim, B. Kim, K. Kim, S. Parket al., “Samsung pim/pnm for transfmer based ai: Energy efficiency on pim/pnm cluster,” in2023 IEEE Hot Chips 35 Symposium (HCS). IEEE Computer Society, 2023, pp. 1–31

work page 2023
[4]

A 1ynm 1.25 v 8gb, 16gb/s/pin gddr6-based accelerator-in-memory supporting 1tflops mac operation and various activation functions for deep-learning applications,

S. Lee, K. Kim, S. Oh, J. Park, G. Hong, D. Ka, K. Hwang, J. Park, K. Kang, J. Kimet al., “A 1ynm 1.25 v 8gb, 16gb/s/pin gddr6-based accelerator-in-memory supporting 1tflops mac operation and various activation functions for deep-learning applications,” in2022 IEEE In- ternational Solid-State Circuits Conference (ISSCC), vol. 65. IEEE, 2022, pp. 1–3

work page 2022
[5]

Hardware architecture and software stack for pim based on commercial dram technology: Industrial product,

S. Lee, S.-h. Kang, J. Lee, H. Kim, E. Lee, S. Seo, H. Yoon, S. Lee, K. Lim, H. Shinet al., “Hardware architecture and software stack for pim based on commercial dram technology: Industrial product,” in 2021 ACM/IEEE 48th Annual International Symposium on Computer Architecture (ISCA). IEEE, 2021, pp. 43–56

work page 2021
[6]

Neupims: Npu-pim heterogeneous acceleration for batched llm inferencing,

G. Heo, S. Lee, J. Cho, H. Choi, S. Lee, H. Ham, G. Kim, D. Ma- hajan, and J. Park, “Neupims: Npu-pim heterogeneous acceleration for batched llm inferencing,” inProceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, V olume 3, 2024, pp. 722–737

work page 2024
[7]

Ianus: Integrated accelerator based on npu-pim unified memory system,

M. Seo, X. T. Nguyen, S. J. Hwang, Y . Kwon, G. Kim, C. Park, I. Kim, J. Park, J. Kim, W. Shinet al., “Ianus: Integrated accelerator based on npu-pim unified memory system,” inProceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, V olume 3, 2024, pp. 545–560

work page 2024
[8]

Memory-centric computing with sk hynix’s domain-specific memory,

Y . Kwon, G. Kim, N. Kim, W. Shin, J. Won, H. Joo, H. Choi, B. An, G. Shin, D. Yunet al., “Memory-centric computing with sk hynix’s domain-specific memory,” in2023 IEEE Hot Chips 35 Symposium (HCS). IEEE Computer Society, 2023, pp. 1–26

work page 2023
[9]

Facil: Flexible dram address mapping for soc-pim cooperative on-device llm inference,

S. H. Seo, J. Kim, D. Lee, S. Yoo, S. Moon, Y . Park, and J. W. Lee, “Facil: Flexible dram address mapping for soc-pim cooperative on-device llm inference,” in2025 IEEE International Symposium on High Performance Computer Architecture (HPCA). IEEE, 2025, pp. 1720–1733

work page 2025
[10]

The true processing in memory accelerator,

F. Devaux, “The true processing in memory accelerator,” in2019 IEEE Hot Chips 31 Symposium (HCS). IEEE Computer Society, 2019, pp. 1–24

work page 2019
[11]

To pim or not for emerging general purpose processing in ddr memory systems,

A. Devic, S. B. Rai, A. Sivasubramaniam, A. Akel, S. Eilert, and J. Eno, “To pim or not for emerging general purpose processing in ddr memory systems,” inProceedings of the 49th Annual International Symposium on Computer Architecture, 2022, pp. 231–244

work page 2022
[12]

Improving in-memory database operations with acceleration dimm (axdimm),

D. Lee, J. So, M. Ahn, J.-G. Lee, J. Kim, J. Cho, R. Oliver, V . C. Thummala, R. s. JV , S. S. Upadhyaet al., “Improving in-memory database operations with acceleration dimm (axdimm),” inProceedings of the 18th International Workshop on Data Management on New Hardware, 2022, pp. 1–9

work page 2022
[13]

Pimnast: Balanced data placement for gemv acceleration with processing-in-memory,

M. A. Ibrahim, M. Islam, and S. Aga, “Pimnast: Balanced data placement for gemv acceleration with processing-in-memory,” inSC24- W: Workshops of the International Conference for High Performance Computing, Networking, Storage and Analysis. IEEE, 2024, pp. 970– 981

work page 2024
[14]

Davinci: A scalable architecture for neural network computing,

H. Liao, J. Tu, J. Xia, and X. Zhou, “Davinci: A scalable architecture for neural network computing,” in2019 IEEE Hot Chips 31 Symposium (HCS), 2019, pp. 1–44

work page 2019
[15]

Sk hynix ai-specific computing memory solution: From aim device to heterogeneous aimx-xpu system for comprehensive llm inference,

G. Kim, J. Kim, N. Kim, W. Shin, J. Won, H. Joo, H. Choi, B. An, G. Shin, D. Yunet al., “Sk hynix ai-specific computing memory solution: From aim device to heterogeneous aimx-xpu system for comprehensive llm inference,” in2024 IEEE Hot Chips 36 Symposium (HCS). IEEE Computer Society, 2024, pp. 1–26

work page 2024
[16]

Ramulator 2.0: A modern, modular, and extensible dram simulator,

H. Luo, Y . C. Tu ˘grul, F. N. Bostancı, A. Olgun, A. G. Ya ˘glıkc ¸ı, and O. Mutlu, “Ramulator 2.0: A modern, modular, and extensible dram simulator,”IEEE Computer Architecture Letters, vol. 23, no. 1, pp. 112– 116, 2023

work page 2023
[17]

LOW POWER DOUBLE DATA RATE (LPDDR) 5/5X,

JEDEC Committee, “LOW POWER DOUBLE DATA RATE (LPDDR) 5/5X,” https://www.jedec.org/standards-documents/docs/jesd209-5c, 2023

work page 2023
[18]

OPT: Open Pre-trained Transformer Language Models

S. Zhang, S. Roller, N. Goyal, M. Artetxe, M. Chen, S. Chen, C. Dewan, M. Diab, X. Li, X. V . Linet al., “Opt: Open pre-trained transformer language models,”arXiv preprint arXiv:2205.01068, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[19]

(2025) ModelArts: AI Development Platform

Huawei Cloud. (2025) ModelArts: AI Development Platform. [On- line]. Available: https://console.huaweicloud.com/modelarts/?region=cn- southwest-2#/dev-container

work page 2025
[20]

(2025) Ascend C Operator Development Documentation

Huawei Technologies Co., Ltd. (2025) Ascend C Operator Development Documentation. [Online]. Available: https://www.hiascend.com/developer/operator?tag=ascendc

work page 2025

[1] [1]

Mobile-Agent: Autonomous Multi-Modal Mobile Device Agent with Visual Perception

J. Wang, H. Xu, J. Ye, M. Yan, W. Shen, J. Zhang, F. Huang, and J. Sang, “Mobile-agent: Autonomous multi-modal mobile device agent with visual perception,”arXiv preprint arXiv:2401.16158, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[2] [2]

Mobile-agent-v2: Mobile device operation assistant with effective navigation via multi-agent collaboration,

J. Wang, H. Xu, H. Jia, X. Zhang, M. Yan, W. Shen, J. Zhang, F. Huang, and J. Sang, “Mobile-agent-v2: Mobile device operation assistant with effective navigation via multi-agent collaboration,”Advances in Neural Information Processing Systems, vol. 37, pp. 2686–2710, 2024

work page 2024

[3] [3]

Samsung pim/pnm for transfmer based ai: Energy efficiency on pim/pnm cluster,

J. H. Kim, Y . Ro, J. So, S. Lee, S.-h. Kang, Y . Cho, H. Kim, B. Kim, K. Kim, S. Parket al., “Samsung pim/pnm for transfmer based ai: Energy efficiency on pim/pnm cluster,” in2023 IEEE Hot Chips 35 Symposium (HCS). IEEE Computer Society, 2023, pp. 1–31

work page 2023

[4] [4]

A 1ynm 1.25 v 8gb, 16gb/s/pin gddr6-based accelerator-in-memory supporting 1tflops mac operation and various activation functions for deep-learning applications,

S. Lee, K. Kim, S. Oh, J. Park, G. Hong, D. Ka, K. Hwang, J. Park, K. Kang, J. Kimet al., “A 1ynm 1.25 v 8gb, 16gb/s/pin gddr6-based accelerator-in-memory supporting 1tflops mac operation and various activation functions for deep-learning applications,” in2022 IEEE In- ternational Solid-State Circuits Conference (ISSCC), vol. 65. IEEE, 2022, pp. 1–3

work page 2022

[5] [5]

Hardware architecture and software stack for pim based on commercial dram technology: Industrial product,

S. Lee, S.-h. Kang, J. Lee, H. Kim, E. Lee, S. Seo, H. Yoon, S. Lee, K. Lim, H. Shinet al., “Hardware architecture and software stack for pim based on commercial dram technology: Industrial product,” in 2021 ACM/IEEE 48th Annual International Symposium on Computer Architecture (ISCA). IEEE, 2021, pp. 43–56

work page 2021

[6] [6]

Neupims: Npu-pim heterogeneous acceleration for batched llm inferencing,

G. Heo, S. Lee, J. Cho, H. Choi, S. Lee, H. Ham, G. Kim, D. Ma- hajan, and J. Park, “Neupims: Npu-pim heterogeneous acceleration for batched llm inferencing,” inProceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, V olume 3, 2024, pp. 722–737

work page 2024

[7] [7]

Ianus: Integrated accelerator based on npu-pim unified memory system,

M. Seo, X. T. Nguyen, S. J. Hwang, Y . Kwon, G. Kim, C. Park, I. Kim, J. Park, J. Kim, W. Shinet al., “Ianus: Integrated accelerator based on npu-pim unified memory system,” inProceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, V olume 3, 2024, pp. 545–560

work page 2024

[8] [8]

Memory-centric computing with sk hynix’s domain-specific memory,

Y . Kwon, G. Kim, N. Kim, W. Shin, J. Won, H. Joo, H. Choi, B. An, G. Shin, D. Yunet al., “Memory-centric computing with sk hynix’s domain-specific memory,” in2023 IEEE Hot Chips 35 Symposium (HCS). IEEE Computer Society, 2023, pp. 1–26

work page 2023

[9] [9]

Facil: Flexible dram address mapping for soc-pim cooperative on-device llm inference,

S. H. Seo, J. Kim, D. Lee, S. Yoo, S. Moon, Y . Park, and J. W. Lee, “Facil: Flexible dram address mapping for soc-pim cooperative on-device llm inference,” in2025 IEEE International Symposium on High Performance Computer Architecture (HPCA). IEEE, 2025, pp. 1720–1733

work page 2025

[10] [10]

The true processing in memory accelerator,

F. Devaux, “The true processing in memory accelerator,” in2019 IEEE Hot Chips 31 Symposium (HCS). IEEE Computer Society, 2019, pp. 1–24

work page 2019

[11] [11]

To pim or not for emerging general purpose processing in ddr memory systems,

A. Devic, S. B. Rai, A. Sivasubramaniam, A. Akel, S. Eilert, and J. Eno, “To pim or not for emerging general purpose processing in ddr memory systems,” inProceedings of the 49th Annual International Symposium on Computer Architecture, 2022, pp. 231–244

work page 2022

[12] [12]

Improving in-memory database operations with acceleration dimm (axdimm),

D. Lee, J. So, M. Ahn, J.-G. Lee, J. Kim, J. Cho, R. Oliver, V . C. Thummala, R. s. JV , S. S. Upadhyaet al., “Improving in-memory database operations with acceleration dimm (axdimm),” inProceedings of the 18th International Workshop on Data Management on New Hardware, 2022, pp. 1–9

work page 2022

[13] [13]

Pimnast: Balanced data placement for gemv acceleration with processing-in-memory,

M. A. Ibrahim, M. Islam, and S. Aga, “Pimnast: Balanced data placement for gemv acceleration with processing-in-memory,” inSC24- W: Workshops of the International Conference for High Performance Computing, Networking, Storage and Analysis. IEEE, 2024, pp. 970– 981

work page 2024

[14] [14]

Davinci: A scalable architecture for neural network computing,

H. Liao, J. Tu, J. Xia, and X. Zhou, “Davinci: A scalable architecture for neural network computing,” in2019 IEEE Hot Chips 31 Symposium (HCS), 2019, pp. 1–44

work page 2019

[15] [15]

Sk hynix ai-specific computing memory solution: From aim device to heterogeneous aimx-xpu system for comprehensive llm inference,

G. Kim, J. Kim, N. Kim, W. Shin, J. Won, H. Joo, H. Choi, B. An, G. Shin, D. Yunet al., “Sk hynix ai-specific computing memory solution: From aim device to heterogeneous aimx-xpu system for comprehensive llm inference,” in2024 IEEE Hot Chips 36 Symposium (HCS). IEEE Computer Society, 2024, pp. 1–26

work page 2024

[16] [16]

Ramulator 2.0: A modern, modular, and extensible dram simulator,

H. Luo, Y . C. Tu ˘grul, F. N. Bostancı, A. Olgun, A. G. Ya ˘glıkc ¸ı, and O. Mutlu, “Ramulator 2.0: A modern, modular, and extensible dram simulator,”IEEE Computer Architecture Letters, vol. 23, no. 1, pp. 112– 116, 2023

work page 2023

[17] [17]

LOW POWER DOUBLE DATA RATE (LPDDR) 5/5X,

JEDEC Committee, “LOW POWER DOUBLE DATA RATE (LPDDR) 5/5X,” https://www.jedec.org/standards-documents/docs/jesd209-5c, 2023

work page 2023

[18] [18]

OPT: Open Pre-trained Transformer Language Models

S. Zhang, S. Roller, N. Goyal, M. Artetxe, M. Chen, S. Chen, C. Dewan, M. Diab, X. Li, X. V . Linet al., “Opt: Open pre-trained transformer language models,”arXiv preprint arXiv:2205.01068, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022

[19] [19]

(2025) ModelArts: AI Development Platform

Huawei Cloud. (2025) ModelArts: AI Development Platform. [On- line]. Available: https://console.huaweicloud.com/modelarts/?region=cn- southwest-2#/dev-container

work page 2025

[20] [20]

(2025) Ascend C Operator Development Documentation

Huawei Technologies Co., Ltd. (2025) Ascend C Operator Development Documentation. [Online]. Available: https://www.hiascend.com/developer/operator?tag=ascendc

work page 2025