HE-PIM: Demystifying Homomorphic Operations on a Real-world Processing-in-Memory System
Pith reviewed 2026-05-14 19:21 UTC · model grok-4.3
The pith
Processing-in-memory systems become competitive with CPUs and GPUs for homomorphic encryption when equipped with native modular multiplication and efficient data movement.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The central claim is that HE operations on real-world PIM expose two primary bottlenecks—lack of native 64-bit modular integer multiplication as the dominant compute limit and limited per-bank memory capacity that requires costly inter-bank movement—yet PIM remains a viable alternative to state-of-the-art CPU and GPU systems once those two features are supplied.
What carries the argument
Implementation and bottleneck characterization of a full set of HE kernels on the UPMEM PIM system, isolating performance limits to the absence of native modular multiplication and constrained bank capacity.
If this is right
- HE applications expose distinct bottlenecks across execution stages, with some kernels compute-bound on modular arithmetic and others memory-bound on large ciphertexts.
- The dominant compute bottleneck is the lack of native 64-bit modular integer multiplication.
- Limited per-bank memory capacity forces frequent inter-bank data movement for HE ciphertexts and auxiliary metadata.
- PIM hardware becomes competitive with CPUs and GPUs for HE once native modular multiplication and efficient inter-PIM data movement are added.
Where Pith is reading between the lines
- Hardware designers should consider adding modular multiplication units directly to PIM cores to accelerate cryptographic workloads.
- Better inter-bank or inter-PIM communication could enable larger-scale encrypted computations without repeated data shuffling.
- The same bottleneck analysis could guide PIM adoption for other memory-intensive secure computation primitives beyond HE.
Load-bearing premise
The UPMEM PIM system and the chosen HE kernels are representative of future general-purpose PIM hardware and of the workloads that will actually be deployed in encrypted databases and machine learning.
What would settle it
Running the same HE kernels on a PIM system that includes native 64-bit modular multiplication hardware and measuring whether its end-to-end performance exceeds current CPU and GPU baselines would confirm or refute the viability claim.
Figures
read the original abstract
Homomorphic encryption (HE) enables computation over encrypted data, offering strong privacy guarantees for untrusted computing environments. Practical adoption remains limited by high computational complexity, large ciphertext sizes, and substantial data movement. Processor-centric architectures (CPUs, GPUs, ASICs) hit fundamental bottlenecks on HE workloads because ciphertexts are large, data locality is low, and primitives such as relinearization and bootstrapping repeatedly access large auxiliary metadata. Processing-In-Memory (PIM) is a promising mitigation by computing near or inside memory. Prior PIM proposals for HE either do not target real-world PIM systems or cover only a narrow set of operations. We comprehensively characterize HE operations on a real-world, general-purpose PIM system. We implement a complete set of HE kernels used by emerging applications (databases, machine learning) on the UPMEM PIM system, evaluate performance and scalability, compare against CPU and GPU baselines, and discuss implications for future PIM hardware. Our results demonstrate four major findings. (1) HE-based applications expose distinct bottlenecks across execution stages: some kernels are compute-bound due to modular arithmetic, while others are memory-bound due to large ciphertexts and intermediate data. These bottlenecks are exacerbated by limited per-core compute and per-bank capacity, which force frequent data movement. (2) The dominant compute bottleneck is the lack of native 64-bit modular integer multiplication, a key HE primitive. (3) Limited per-bank memory capacity is the second major bottleneck, since HE ciphertexts and auxiliary metadata do not fit and require inter-bank movement. (4) Despite these limits, PIM can be a viable alternative to state-of-the-art CPU and GPU systems for HE when equipped with native modular multiplication and efficient inter-PIM data movement.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper implements a complete set of HE kernels on the UPMEM real-world PIM system, measures their performance and scalability, compares them to CPU and GPU baselines, identifies compute-bound modular arithmetic and memory-capacity-driven inter-bank movement as primary bottlenecks, and concludes that PIM becomes a viable alternative to CPU/GPU systems for HE once equipped with native 64-bit modular multiplication and efficient inter-PIM data movement.
Significance. The work supplies the first comprehensive, hardware-measured characterization of HE workloads on a commercially available general-purpose PIM platform. Direct performance numbers, bottleneck breakdowns, and cross-architecture comparisons provide concrete data that prior simulation-only or narrow-operation PIM-HE studies lacked, informing future PIM hardware features for encrypted computation.
major comments (1)
- [Abstract] Abstract and conclusion: the central claim that PIM 'can be a viable alternative ... when equipped with native modular multiplication and efficient inter-PIM data movement' rests on an untested inference. All reported timings, bottleneck analysis, and comparisons are obtained from unmodified UPMEM hardware; no analytical model, cycle-accurate simulation, or back-of-envelope projection quantifies the expected speedup once native 64-bit mod-mul (assumed 1-cycle) and higher inter-bank bandwidth are added. This leaves open whether other UPMEM constraints (per-core throughput, bank-level parallelism) would remain dominant.
minor comments (2)
- [Section 3] The manuscript would benefit from an explicit statement of the exact HE parameter sets (N, q, etc.) used for each kernel and a table summarizing them.
- [Figures 4-7] Figure captions and axis labels should explicitly state whether reported times include or exclude data movement between host and PIM banks.
Simulated Author's Rebuttal
We thank the referee for the positive evaluation of our work's significance and for the constructive feedback on the abstract and conclusion. We address the major comment below.
read point-by-point responses
-
Referee: [Abstract] Abstract and conclusion: the central claim that PIM 'can be a viable alternative ... when equipped with native modular multiplication and efficient inter-PIM data movement' rests on an untested inference. All reported timings, bottleneck analysis, and comparisons are obtained from unmodified UPMEM hardware; no analytical model, cycle-accurate simulation, or back-of-envelope projection quantifies the expected speedup once native 64-bit mod-mul (assumed 1-cycle) and higher inter-bank bandwidth are added. This leaves open whether other UPMEM constraints (per-core throughput, bank-level parallelism) would remain dominant.
Authors: We agree that the original claim would be strengthened by a quantitative projection. In the revised manuscript we have added a new subsection (Discussion 6.3) containing a back-of-envelope model. Using measured per-kernel cycle counts and utilization rates from our UPMEM experiments, the model assumes 1-cycle native 64-bit modular multiplication and 4x inter-bank bandwidth. It projects 2.8–4.1x speedup for compute-bound kernels and 1.7–2.3x for memory-bound kernels relative to current UPMEM, while explicitly noting that per-core throughput and bank-level parallelism remain secondary constraints that could cap further gains. We have updated the abstract and conclusion to describe PIM as “potentially competitive” once the two primary bottlenecks are removed, rather than stating it as a direct conclusion from the unmodified hardware results. revision: yes
Circularity Check
No circularity: claims rest on direct hardware measurements
full rationale
The paper implements and measures a complete set of HE kernels on unmodified UPMEM PIM hardware, reports raw performance and scalability numbers, identifies compute-bound vs. memory-bound stages from those timings, and compares against CPU/GPU baselines. The viability statement is a qualitative inference drawn from the observed bottlenecks (lack of native 64-bit mod-mul and per-bank capacity limits). No equations, fitted parameters, or self-referential definitions appear; no derivation reduces to its own inputs by construction. All load-bearing evidence is externally falsifiable via replication on the same hardware.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption HE kernels require repeated 64-bit modular integer multiplication and large ciphertext storage
Forward citations
Cited by 1 Pith paper
-
Taking Cryptography Out of the Data Path via Near-Memory Processing in DRAM
Real-world PIM on UPMEM accelerates cryptographic algorithms when computation is distributed across multiple DRAM ranks, outperforming CPUs at full scale.
Reference graph
Works this paper leans on
-
[1]
On Data Banks and Privacy Homo- morphisms,
R. L. Rivest, L. Adleman, and M. L. Dertouzos, “On Data Banks and Privacy Homo- morphisms, ” inFSC, 1978
work page 1978
-
[2]
On Lattices, Learning with Errors, Random Linear Codes, and Cryptog- raphy,
O. Regev, “On Lattices, Learning with Errors, Random Linear Codes, and Cryptog- raphy, ”Journal of the ACM, 2009
work page 2009
-
[3]
On Ideal Lattices and Learning with Errors over Rings,
V. Lyubashevsky, C. Peikert, and O. Regev, “On Ideal Lattices and Learning with Errors over Rings, ” inEUROCRYPT 2010, 2010
work page 2010
-
[4]
Fully Homomorphic Encryption Using Ideal Lattices,
C. Gentry, “Fully Homomorphic Encryption Using Ideal Lattices, ” inSTOC, 2009
work page 2009
-
[5]
A Fully Homomorphic Encryption Scheme,
C. Gentry, “A Fully Homomorphic Encryption Scheme, ” Ph.D. dissertation, Stanford University, 2009
work page 2009
-
[6]
Efficient Fully Homomorphic Encryption from (Standard) LWE,
Z. Brakerski and V. Vaikuntanathan, “Efficient Fully Homomorphic Encryption from (Standard) LWE, ” inFOCS, 2011
work page 2011
-
[7]
(Leveled) Fully Homomorphic Encryption without Bootstrapping,
Z. Brakerski, C. Gentry, and V. Vaikuntanathan, “(Leveled) Fully Homomorphic Encryption without Bootstrapping, ” inITCS, 2012
work page 2012
-
[8]
Fully Homomorphic SIMD Operations,
N. P. Smart and F. Vercauteren, “Fully Homomorphic SIMD Operations, ”Designs, Codes and Cryptography, 2014
work page 2014
-
[9]
C. Gentry, A. Sahai, and B. Waters, “Homomorphic Encryption from Learning with Errors: Conceptually-Simpler, Asymptotically-Faster, Attribute-Based, ” in CRYPTO 2013, Part I, 2013
work page 2013
-
[10]
Design and Implementation of HElib: a Homomorphic Encryption Library,
S. Halevi and V. Shoup, “Design and Implementation of HElib: a Homomorphic Encryption Library, ” Cryptology ePrint Archive, Report 2020/1481, 2020
work page 2020
-
[11]
Fully Homomorphic Encryption over the Integers,
M. van Dijk, C. Gentry, S. Halevi, and V. Vaikuntanathan, “Fully Homomorphic Encryption over the Integers, ” inEUROCRYPT 2010, 2010
work page 2010
-
[12]
Fully Homomorphic Encryption without Modulus Switching from Classical GapSVP,
Z. Brakerski, “Fully Homomorphic Encryption without Modulus Switching from Classical GapSVP, ” inCRYPTO, 2012
work page 2012
-
[13]
Implementing Gentry’s Fully-Homomorphic Encryption Scheme,
C. Gentry and S. Halevi, “Implementing Gentry’s Fully-Homomorphic Encryption Scheme, ” inEUROCRYPT 2011, 2011
work page 2011
-
[14]
A Survey on Homomorphic Encryption Schemes: Theory and Implementation,
A. Acar, H. Aksu, A. S. Uluagac, and M. Conti, “A Survey on Homomorphic Encryption Schemes: Theory and Implementation, ”ACM Computing Surveys, 2018
work page 2018
-
[15]
A Decade of Lattice Cryptography,
C. Peikert, “A Decade of Lattice Cryptography, ”Foundations and Trends in Theoret- ical Computer Science, 2016
work page 2016
-
[16]
A Method for Obtaining Digital Signatures and Public-Key Cryptosystems,
R. L. Rivest, A. Shamir, and L. Adleman, “A Method for Obtaining Digital Signatures and Public-Key Cryptosystems, ”Communications of the ACM, 1978
work page 1978
-
[17]
A Public Key Cryptosystem and A Signature Scheme Based on Discrete Logarithms,
T. ElGamal, “A Public Key Cryptosystem and A Signature Scheme Based on Discrete Logarithms, ”IEEE Trans. Inf. Theory, 1985
work page 1985
-
[18]
Public-Key Cryptosystems Based on Composite Degree Residuosity Classes,
P. Paillier, “Public-Key Cryptosystems Based on Composite Degree Residuosity Classes, ” inEurocrypt, 1999
work page 1999
-
[19]
M. Ogburn, C. Turner, and P. Dahal, “Homomorphic Encryption, ”PROCS, 2013
work page 2013
-
[20]
Homomorphic Encryption The “Holy Grail
D. Tourky, M. ElKawkagy, and A. Keshk, “Homomorphic Encryption The “Holy Grail” of Cryptography, ” inICCC 2016
work page 2016
-
[21]
X. Yi, R. Paulet, E. Bertino, X. Yi, R. Paulet, and E. Bertino,Homomorphic encryption, 2014
work page 2014
-
[22]
CryptoNets: Applying Neural Networks to Encrypted Data With High Throughput and Accuracy,
R. Gilad-Bachrach, N. Dowlin, K. Laine, K. Lauter, M. Naehrig, and J. Wernsing, “CryptoNets: Applying Neural Networks to Encrypted Data With High Throughput and Accuracy, ” inICML, 2016
work page 2016
-
[23]
Low Latency Privacy Preserving Inference,
A. Brutzkus, R. Gilad-Bachrach, and O. Elisha, “Low Latency Privacy Preserving Inference, ” inICML, 2019
work page 2019
-
[24]
GAZELLE: A low latency framework for secure neural network inference,
C. Juvekar, V. Vaikuntanathan, and A. Chandrakasan, “GAZELLE: A low latency framework for secure neural network inference, ” inUSENIX Sec., 2018
work page 2018
-
[25]
Delphi: A cryptographic inference service for neural networks,
P. Mishra, R. Lehmkuhl, A. Srinivasan, W. Zheng, and R. A. Popa, “Delphi: A cryptographic inference service for neural networks, ” inUSENIX Sec., 2020
work page 2020
-
[26]
Cheetah: Optimizing and Accelerating Homomorphic Encryption for Private Inference,
B. Reagen, W.-S. Choi, Y. Ko, V. T. Lee, H.-H. S. Lee, G.-Y. Wei, and D. Brooks, “Cheetah: Optimizing and Accelerating Homomorphic Encryption for Private Inference, ” inHPCA, 2021
work page 2021
-
[27]
E. Lee, J.-W. Lee, J. Lee, Y.-S. Kim, Y. Kim, J.-S. No, and W. Choi, “Low-Complexity Deep Convolutional Neural Networks on Fully Homomorphic Encryption Using Multiplexed Parallel Convolutions, ” inICML, 2022. 14
work page 2022
-
[28]
Concrete-ML: Privacy-Preserving Machine Learning Using Fully Homo- morphic Encryption,
Zama, “Concrete-ML: Privacy-Preserving Machine Learning Using Fully Homo- morphic Encryption, ” https://github.com/zama-ai/concrete-ml, 2022
work page 2022
-
[29]
S. Bian, Z. Zhang, H. Pan, R. Mao, Z. Zhao, Y. Jin, and Z. Guan, “HE3DB: An Efficient and Elastic Encrypted Database via Arithmetic-and-Logic Fully Homomorphic Encryption, ” inCCS, 2023
work page 2023
-
[30]
ArcEDB: An Arbitrary-Precision Encrypted Database via (Amortized) Modular Homomor- phic Encryption,
Z. Zhang, S. Bian, Z. Zhao, R. Mao, H. Zhou, J. Hua, Y. Jin, and Z. Guan, “ArcEDB: An Arbitrary-Precision Encrypted Database via (Amortized) Modular Homomor- phic Encryption, ” inCCS, 2024
work page 2024
-
[31]
HEDA: Multi- Attribute Unbounded Aggregation over Homomorphically Encrypted Database,
X. Ren, L. Su, Z. Gu, S. Wang, F. Li, Y. Xie, S. Bian, C. Li, and F. Zhang, “HEDA: Multi- Attribute Unbounded Aggregation over Homomorphically Encrypted Database, ” PVLDB, 2022
work page 2022
-
[32]
M. Kabra, R. Nadig, H. Gupta, R. Bera, M. Frouzakis, V. Arulchelvan, Y. Liang, H. Mao, M. Sadrosadati, and O. Mutlu, “CIPHERMATCH: Accelerating Homomor- phic Encryption-Based String Matching via Memory-Efficient Data Packing and In-Flash Processing, ” inASPLOS, 2025
work page 2025
-
[33]
PIR with Compressed Queries and Amortized Query Processing,
S. Angel, H. Chen, K. Laine, and S. T. V. Setty, “PIR with Compressed Queries and Amortized Query Processing, ” inS&P, 2018
work page 2018
-
[34]
OnionPIR: Response Efficient Single-Server PIR,
M. H. Mughees, H. Chen, and L. Ren, “OnionPIR: Response Efficient Single-Server PIR, ” inCCS, 2021
work page 2021
-
[35]
Spiral: Fast, High-Rate Single-Server PIR via FHE Composition,
S. J. Menon and D. J. Wu, “Spiral: Fast, High-Rate Single-Server PIR via FHE Composition, ” inS&P, 2022
work page 2022
-
[36]
One Server for the Price of Two: Simple and Fast Single-Server Private Information Retrieval,
A. Henzinger, M. M. Hong, H. Corrigan-Gibbs, S. Meiklejohn, and V. Vaikun- tanathan, “One Server for the Price of Two: Simple and Fast Single-Server Private Information Retrieval, ” inUSENIX Sec., 2023
work page 2023
-
[37]
Fast Private Set Intersection from Homomorphic Encryption,
H. Chen, K. Laine, and P. Rindal, “Fast Private Set Intersection from Homomorphic Encryption, ” inCCS, 2017
work page 2017
-
[38]
Anaheim: Architecture and Algorithms for Processing Fully Homomorphic Encryption in Memory,
J. Kim, S. Yun, H. Ji, W. Choi, S. Kim, and J. H. Ahn, “Anaheim: Architecture and Algorithms for Processing Fully Homomorphic Encryption in Memory, ” inHPCA, 2025
work page 2025
-
[39]
MemFHE: End-to-end Computing with Fully Homomorphic Encryption in Memory,
S. Gupta, R. Cammarota, and T. Š. Rosing, “MemFHE: End-to-end Computing with Fully Homomorphic Encryption in Memory, ”TECS, 2022
work page 2022
-
[40]
Craterlake: A Hardware Accelerator for Efficient Unbounded Computation on Encrypted Data,
N. Samardzic, A. Feldmann, A. Krastev, N. Manohar, N. Genise, S. Devadas, K. El- defrawy, C. Peikert, and D. Sanchez, “Craterlake: A Hardware Accelerator for Efficient Unbounded Computation on Encrypted Data, ” inISCA 2022
work page 2022
-
[41]
Memory Scaling: A Systems Architecture Perspective,
O. Mutlu, “Memory Scaling: A Systems Architecture Perspective, ”IMW, 2013
work page 2013
-
[42]
Research Problems and Opportunities in Memory Systems,
O. Mutlu and L. Subramanian, “Research Problems and Opportunities in Memory Systems, ”SUPERFRI, 2014
work page 2014
-
[43]
J. Dean and L. A. Barroso, “The Tail at Scale, ”Communications of the ACM, 2013
work page 2013
-
[44]
Profiling a Warehouse-Scale Computer,
S. Kanev, J. P. Darago, K. Hazelwood, P. Ranganathan, T. Moseley, G.-Y. Wei, and D. Brooks, “Profiling a Warehouse-Scale Computer, ” inISCA, 2015
work page 2015
-
[45]
Clearing the Clouds: A Study of Emerging Scale-Out Workloads on Modern Hardware,
M. Ferdman, A. Adileh, O. Kocberber, S. Volos, M. Alisafaee, D. Jevdjic, C. Kaynak, A. D. Popescu, A. Ailamaki, and B. Falsafi, “Clearing the Clouds: A Study of Emerging Scale-Out Workloads on Modern Hardware, ” inASPLOS, 2012
work page 2012
-
[46]
BigDataBench: A Big Data Benchmark Suite From Internet Services,
L. Wang, J. Zhan, C. Luo, Y. Zhu, Q. Yang, Y. He, W. Gao, Z. Jia, Y. Shi, S. Zhang, C. Zheng, G. Lu, K. Zhan, X. Li, and B. Qiu, “BigDataBench: A Big Data Benchmark Suite From Internet Services, ” inHPCA, 2014
work page 2014
-
[47]
Google Workloads for Consumer Devices: Mitigating Data Movement Bottlenecks,
A. Boroumand, S. Ghose, Y. Kim, R. Ausavarungnirun, E. Shiu, R. Thakur, D. Kim, A. Kuusela, A. Knies, P. Ranganathan, and O. Mutlu, “Google Workloads for Consumer Devices: Mitigating Data Movement Bottlenecks, ” inASPLOS, 2018
work page 2018
-
[48]
A. Boroumand, S. Ghose, B. Akin, R. Narayanaswami, G. F. Oliveira, X. Ma, E. Shiu, and O. Mutlu, “Google Neural Network Models for Edge Devices: Analyzing and Mitigating Machine Learning Inference Bottlenecks, ” inPACT, 2021
work page 2021
-
[49]
Enabling Practical Processing in and near Memory for Data-Intensive Computing,
O. Mutlu, S. Ghose, J. Gómez-Luna, and R. Ausavarungnirun, “Enabling Practical Processing in and near Memory for Data-Intensive Computing, ” inDAC, 2019
work page 2019
-
[50]
Processing Data Where It Makes Sense: Enabling In-Memory Computation,
O. Mutluet al., “Processing Data Where It Makes Sense: Enabling In-Memory Computation, ”MicPro, 2019
work page 2019
-
[51]
Processing-in- Memory: A Workload-driven Perspective,
S. Ghose, A. Boroumand, J. S. Kim, J. Gómez-Luna, and O. Mutlu, “Processing-in- Memory: A Workload-driven Perspective, ”IBM JRD 2019
work page 2019
-
[52]
Accelerating Genome Analysis: A Primer on an Ongoing Journey,
M. Alser, Z. Bingöl, D. Senol Cali, J. Kim, S. Ghose, C. Alkan, and O. Mutlu, “Accelerating Genome Analysis: A Primer on an Ongoing Journey, ”IEEE Micro, 2020
work page 2020
-
[53]
D. S. Cali, G. S. Kalsi, Z. Bingöl, C. Firtina, L. Subramanian, J. S. Kim, R. Ausavarung- nirun, M. Alser, J. Gomez-Luna, A. Boroumandet al., “GenASM: A High- Performance, Low-Power Approximate String Matching Acceleration Framework for Genome Sequence Analysis, ” inMICRO, 2020
work page 2020
-
[54]
S. Koppula, L. Orosa, A. G. Yağlıkçı, R. Azizi, T. Shahroodi, K. Kanellopoulos, and O. Mutlu, “EDEN: Enabling Energy-Efficient, High-Performance Deep Neural Network Inference using Approximate DRAM, ” inMICRO, 2019
work page 2019
-
[55]
K. Kanellopoulos, N. Vijaykumar, C. Giannoula, R. Azizi, S. Koppula, N. Man- souri Ghiasi, T. Shahroodi, J. Gomez-Luna, and O. Mutlu, “SMASH: Co-designing Software Compression and Hardware-Accelerated Indexing for Efficient Sparse Matrix Operations, ” inMICRO, 2019
work page 2019
-
[56]
DAMOV: A New Methodology and Benchmark Suite for Evaluating Data Movement Bottlenecks,
G. F. Oliveira, J. Gómez-Luna, L. Orosa, S. Ghose, N. Vijaykumar, I. Fernandez, M. Sadrosadati, and O. Mutlu, “DAMOV: A New Methodology and Benchmark Suite for Evaluating Data Movement Bottlenecks, ”IEEE Access, 2021
work page 2021
-
[57]
Evaluating the Memory System Behavior of Smartphone Workloads,
G. Narancic, P. Judd, D. Wu, I. Atta, M. Elnacouzi, J. Zebchuk, J. Albericio, N. E. Jerger, A. Moshovos, K. Kutulakoset al., “Evaluating the Memory System Behavior of Smartphone Workloads, ” inSAMOS, 2014
work page 2014
-
[58]
Memory Hierarchy for Web Search,
G. Ayers, J. H. Ahn, C. Kozyrakis, and P. Ranganathan, “Memory Hierarchy for Web Search, ” inHPCA, 2018
work page 2018
-
[59]
Understanding Big Data Analytics Workloads on Modern Processors,
Z. Jia, J. Zhan, L. Wang, C. Luo, W. Gao, Y. Jin, R. Han, and L. Zhang, “Understanding Big Data Analytics Workloads on Modern Processors, ”TPDS, 2016
work page 2016
-
[60]
Optimizing Database Architecture for the New Bottleneck: Memory Access,
S. Manegold, P. A. Boncz, and M. L. Kersten, “Optimizing Database Architecture for the New Bottleneck: Memory Access, ”The VLDB Journal, 2000
work page 2000
-
[61]
A. Gholami, Z. Yao, S. Kim, C. Hooper, M. W. Mahoney, and K. Keutzer, “AI and Memory Wall, ”IEEE Micro, 2024
work page 2024
-
[62]
H. Stengel, J. Treibig, G. Hager, and G. Wellein, “Quantifying Performance Bot- tlenecks of Stencil Computations Using the Execution-Cache-Memory Model, ” in ISC, 2015
work page 2015
-
[63]
Memory System Characterization of Deep Learning Workloads,
Z. Chishti and B. Akin, “Memory System Characterization of Deep Learning Workloads, ” inMEMSYS, 2019
work page 2019
-
[64]
The Declining Effectiveness of Dynamic Caching for General-Purpose Microprocessors,
D. C. Burger, J. R. Goodman, and A. Kagi, “The Declining Effectiveness of Dynamic Caching for General-Purpose Microprocessors, ” University of Wisconsin-Madison, Tech. Rep. 1261, 1995
work page 1995
-
[65]
The Architectural Implications of Facebook’s DNN-based Personalized Recommendation,
U. Gupta, C.-J. Wu, X. Wang, M. Naumov, B. Reagen, D. Brooks, B. Cottel, K. Hazel- wood, M. Hempstead, B. Jiaet al., “The Architectural Implications of Facebook’s DNN-based Personalized Recommendation, ” inHPCA, 2020
work page 2020
-
[66]
SoftSKU: Optimizing Server Archi- tectures for Microservice Diversity @Scale,
A. Sriraman, A. Dhanotia, and T. F. Wenisch, “SoftSKU: Optimizing Server Archi- tectures for Microservice Diversity @Scale, ” inISCA, 2019
work page 2019
-
[67]
M. Zhao, N. Agarwal, A. Basant, B. Gedik, S. Pan, M. Ozdal, R. Komuravelli, J. Pan, T. Bao, H. Luet al., “Understanding Data Storage and Ingestion for Large-Scale Deep Recommendation Model Training: Industrial Product, ” inISCA, 2022
work page 2022
-
[68]
INSIDER: Designing In-Storage Computing System for Emerging High-Performance Drive,
Z. Ruan, T. He, and J. Cong, “INSIDER: Designing In-Storage Computing System for Emerging High-Performance Drive, ” inUSENIX ATC, 2019
work page 2019
-
[69]
The Architectural Implications of Cloud Microservices,
Y. Gan and C. Delimitrou, “The Architectural Implications of Cloud Microservices, ” CAL, 2018
work page 2018
-
[70]
Accelerometer: Understanding Acceleration Op- portunities for Data Center Overheads at Hyperscale,
A. Sriraman and A. Dhanotia, “Accelerometer: Understanding Acceleration Op- portunities for Data Center Overheads at Hyperscale, ” inASPLOS, 2020
work page 2020
-
[71]
RAMBDA: RDMA-driven Acceleration Framework for Memory-Intensiveµs-Scale Datacenter Applications,
Y. Yuan, J. Huang, Y. Sun, T. Wang, J. Nelson, D. R. Ports, Y. Wang, R. Wang, C. Tai, and N. S. Kim, “RAMBDA: RDMA-driven Acceleration Framework for Memory-Intensiveµs-Scale Datacenter Applications, ” inHPCA, 2023
work page 2023
-
[72]
Amdahl’s Law for Tail Latency,
C. Delimitrou and C. Kozyrakis, “Amdahl’s Law for Tail Latency, ”CACM, 2018
work page 2018
-
[73]
Cross-Stack Workload Characterization of Deep Recommendation Systems,
S. Hsia, U. Gupta, M. Wilkening, C.-J. Wu, G.-Y. Wei, and D. Brooks, “Cross-Stack Workload Characterization of Deep Recommendation Systems, ” inIISWC, 2020
work page 2020
-
[74]
vbench: Benchmarking Video Transcoding in the Cloud,
A. Lottarini, A. Ramirez, J. Coburn, M. A. Kim, P. Ranganathan, D. Stodolsky, and M. Wachsler, “vbench: Benchmarking Video Transcoding in the Cloud, ” inASPLOS, 2018
work page 2018
-
[75]
Characterizing Job Microarchitectural Profiles at Scale: Dataset and Analysis,
K. Wang, Y. Li, C. Wang, T. Jia, K. Chow, Y. Wen, Y. Dou, G. Xu, C. Hou, J. Yaoet al., “Characterizing Job Microarchitectural Profiles at Scale: Dataset and Analysis, ” in ICPP, 2022
work page 2022
-
[76]
Missing the Forest for the Trees: End-to-End AI Application Performance in Edge Data Centers,
D. Richins, D. Doshi, M. Blackmore, A. T. Nair, N. Pathapati, A. Patel, B. Daguman, D. Dobrijalowski, R. Illikkal, K. Longet al., “Missing the Forest for the Trees: End-to-End AI Application Performance in Edge Data Centers, ” inHPCA, 2020
work page 2020
-
[77]
Reflections on the Memory Wall,
S. A. McKee, “Reflections on the Memory Wall, ” inCF, 2004
work page 2004
-
[78]
J. Kim, G. Lee, S. Kim, G. Sohn, M. Rhu, J. Kim, and J. H. Ahn, “ARK: Fully Homomorphic Encryption Accelerator with Runtime Data Generation and Inter- operation Key Reuse, ” inIEEE MICRO 2022
work page 2022
-
[79]
A Modern Primer on Processing In Memory,
O. Mutlu, S. Ghose, J. Gómez-Luna, and R. Ausavarungnirun, “A Modern Primer on Processing In Memory, ” inEmerging Computing: From Devices to Systems: Looking Beyond Moore and Von Neumann, 2022
work page 2022
- [80]
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.