HE-PIM: Demystifying Homomorphic Operations on a Real-world Processing-in-Memory System

Antonio J. Pe\~na; Harshita Gupta; Jaewoo Park; Juan G\'omez-Luna; Konstantinos Kanellopoulos; Mayank Kabra; Mohammad Sadrosadati; Nisa Bostanc{\i}; Onur Mutlu; Phillip Widdowson

arxiv: 2605.12841 · v1 · pith:Z73BFPAMnew · submitted 2026-05-13 · 💻 cs.CR

HE-PIM: Demystifying Homomorphic Operations on a Real-world Processing-in-Memory System

Harshita Gupta , Mayank Kabra , Jaewoo Park , Priyam Mehta , Phillip Widdowson , Tathagata Barik , Nisa Bostanc{\i} , Konstantinos Kanellopoulos

show 4 more authors

Juan G\'omez-Luna Antonio J. Pe\~na Mohammad Sadrosadati Onur Mutlu

This is my paper

Pith reviewed 2026-05-14 19:21 UTC · model grok-4.3

classification 💻 cs.CR

keywords homomorphic encryptionprocessing-in-memoryPIMUPMEMmodular multiplicationperformance bottlenecksprivacy-preserving computationencrypted databases

0 comments

The pith

Processing-in-memory systems become competitive with CPUs and GPUs for homomorphic encryption when equipped with native modular multiplication and efficient data movement.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper implements a complete set of homomorphic encryption kernels on the real UPMEM PIM hardware and measures their performance across execution stages. It finds that HE workloads split into compute-bound kernels limited by modular arithmetic and memory-bound kernels limited by small per-bank capacity that forces frequent inter-bank transfers. The authors conclude that despite current limits, PIM hardware can serve as a practical alternative to conventional processors for privacy-preserving workloads in databases and machine learning once future designs add native 64-bit modular multiplication support.

Core claim

The central claim is that HE operations on real-world PIM expose two primary bottlenecks—lack of native 64-bit modular integer multiplication as the dominant compute limit and limited per-bank memory capacity that requires costly inter-bank movement—yet PIM remains a viable alternative to state-of-the-art CPU and GPU systems once those two features are supplied.

What carries the argument

Implementation and bottleneck characterization of a full set of HE kernels on the UPMEM PIM system, isolating performance limits to the absence of native modular multiplication and constrained bank capacity.

If this is right

HE applications expose distinct bottlenecks across execution stages, with some kernels compute-bound on modular arithmetic and others memory-bound on large ciphertexts.
The dominant compute bottleneck is the lack of native 64-bit modular integer multiplication.
Limited per-bank memory capacity forces frequent inter-bank data movement for HE ciphertexts and auxiliary metadata.
PIM hardware becomes competitive with CPUs and GPUs for HE once native modular multiplication and efficient inter-PIM data movement are added.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Hardware designers should consider adding modular multiplication units directly to PIM cores to accelerate cryptographic workloads.
Better inter-bank or inter-PIM communication could enable larger-scale encrypted computations without repeated data shuffling.
The same bottleneck analysis could guide PIM adoption for other memory-intensive secure computation primitives beyond HE.

Load-bearing premise

The UPMEM PIM system and the chosen HE kernels are representative of future general-purpose PIM hardware and of the workloads that will actually be deployed in encrypted databases and machine learning.

What would settle it

Running the same HE kernels on a PIM system that includes native 64-bit modular multiplication hardware and measuring whether its end-to-end performance exceeds current CPU and GPU baselines would confirm or refute the viability claim.

Figures

Figures reproduced from arXiv: 2605.12841 by Antonio J. Pe\~na, Harshita Gupta, Jaewoo Park, Juan G\'omez-Luna, Konstantinos Kanellopoulos, Mayank Kabra, Mohammad Sadrosadati, Nisa Bostanc{\i}, Onur Mutlu, Phillip Widdowson, Priyam Mehta, Tathagata Barik.

**Figure 2.** Figure 2: High-level system organization of the UPMEM PIM [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗

**Figure 3.** Figure 3: Overview of HE-based Neural Network Inference. [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗

**Figure 4.** Figure 4: Dataflow and mapping of NTT on the PIM system for [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗

**Figure 5.** Figure 5: Data mapping of matrix multiplication on the PIM [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗

**Figure 6.** Figure 6: Total execution time (in s) for the Conv layer on the UPMEM PIM system for (a) varying limb count and (b) varying image sizes compared with the CPU system. of execution rounds remains constant (four rounds for 512 PIM cores and a single round for 2048 PIM cores) regardless of the chosen limb count. Obs #2. Increasing the number of PIM cores significantly reduces execution time across all ciphertext limb co… view at source ↗

**Figure 7.** Figure 7: Total execution time (in s) for the BConv subroutine on the UPMEM PIM system for (a) varying limb count and (b) varying image sizes compared with the CPU system. The execution time of matrix multiplication in the BConv subroutine is constant because the twiddle-factor matrix size depends only on the polynomial degree (N) and not on the limb count. Hence, the number of PIM execution rounds remains the same.… view at source ↗

**Figure 8.** Figure 8: Total execution time (in s) for the EvalMod operation on the UPMEM PIM system for (a) varying limb count and (b) varying image sizes compared with the CPU system. Obs #12. For a fixed number of PIM cores, increasing the number of limbs increases the execution time of EvalMod. For example, on a 512-core PIM system, the total execution time for the EvalMod operation increases from 100.7 s at a limb count of … view at source ↗

**Figure 9.** Figure 9: Total execution time (in s) for all operations/subrou [PITH_FULL_IMAGE:figures/full_fig_p011_9.png] view at source ↗

**Figure 10.** Figure 10: Execution time (in s) for end-to-end neural networks. [PITH_FULL_IMAGE:figures/full_fig_p012_10.png] view at source ↗

**Figure 11.** Figure 11: Total execution time on three PIM designs compared [PITH_FULL_IMAGE:figures/full_fig_p013_11.png] view at source ↗

read the original abstract

Homomorphic encryption (HE) enables computation over encrypted data, offering strong privacy guarantees for untrusted computing environments. Practical adoption remains limited by high computational complexity, large ciphertext sizes, and substantial data movement. Processor-centric architectures (CPUs, GPUs, ASICs) hit fundamental bottlenecks on HE workloads because ciphertexts are large, data locality is low, and primitives such as relinearization and bootstrapping repeatedly access large auxiliary metadata. Processing-In-Memory (PIM) is a promising mitigation by computing near or inside memory. Prior PIM proposals for HE either do not target real-world PIM systems or cover only a narrow set of operations. We comprehensively characterize HE operations on a real-world, general-purpose PIM system. We implement a complete set of HE kernels used by emerging applications (databases, machine learning) on the UPMEM PIM system, evaluate performance and scalability, compare against CPU and GPU baselines, and discuss implications for future PIM hardware. Our results demonstrate four major findings. (1) HE-based applications expose distinct bottlenecks across execution stages: some kernels are compute-bound due to modular arithmetic, while others are memory-bound due to large ciphertexts and intermediate data. These bottlenecks are exacerbated by limited per-core compute and per-bank capacity, which force frequent data movement. (2) The dominant compute bottleneck is the lack of native 64-bit modular integer multiplication, a key HE primitive. (3) Limited per-bank memory capacity is the second major bottleneck, since HE ciphertexts and auxiliary metadata do not fit and require inter-bank movement. (4) Despite these limits, PIM can be a viable alternative to state-of-the-art CPU and GPU systems for HE when equipped with native modular multiplication and efficient inter-PIM data movement.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Real hardware measurements of HE on PIM are the strength here, but the viability argument for modified hardware lacks supporting projections.

read the letter

This paper provides the first detailed performance breakdown of a full homomorphic encryption kernel set on commercial UPMEM PIM hardware. The work is new because prior efforts either stayed in simulation or looked at only a few operations. Here they port a complete suite of kernels used in databases and machine learning to the real UPMEM system. They measure performance and scalability, run direct comparisons to CPU and GPU, and identify where time is spent. The data shows that some stages are compute-bound on modular arithmetic while others are limited by memory capacity and the resulting data movement between banks. This is done well. The performance numbers come from actual hardware runs rather than models, and the baselines are straightforward. The identification of the two dominant bottlenecks—lack of native 64-bit modular multiply and limited per-bank capacity—follows directly from the measurements. The main soft spot is in the forward-looking claim. The paper argues that PIM could compete with CPU and GPU once native modular multiplication and efficient inter-PIM movement are added. But the results are all from the current unmodified hardware. There is no projection, cycle estimate, or simulation to show what the speedups would be after those changes or whether other limits would appear. That part rests on the assumption that fixing those two things is sufficient. This paper is aimed at systems and architecture researchers interested in near-memory computing for cryptographic workloads. Anyone thinking about hardware support for homomorphic encryption will find the bottleneck analysis practical and actionable. It is not a theoretical advance but a useful empirical study. I would send it for peer review. The empirical work is strong enough to benefit from referee feedback, particularly on how to strengthen the discussion of future hardware requirements.

Referee Report

1 major / 2 minor

Summary. The paper implements a complete set of HE kernels on the UPMEM real-world PIM system, measures their performance and scalability, compares them to CPU and GPU baselines, identifies compute-bound modular arithmetic and memory-capacity-driven inter-bank movement as primary bottlenecks, and concludes that PIM becomes a viable alternative to CPU/GPU systems for HE once equipped with native 64-bit modular multiplication and efficient inter-PIM data movement.

Significance. The work supplies the first comprehensive, hardware-measured characterization of HE workloads on a commercially available general-purpose PIM platform. Direct performance numbers, bottleneck breakdowns, and cross-architecture comparisons provide concrete data that prior simulation-only or narrow-operation PIM-HE studies lacked, informing future PIM hardware features for encrypted computation.

major comments (1)

[Abstract] Abstract and conclusion: the central claim that PIM 'can be a viable alternative ... when equipped with native modular multiplication and efficient inter-PIM data movement' rests on an untested inference. All reported timings, bottleneck analysis, and comparisons are obtained from unmodified UPMEM hardware; no analytical model, cycle-accurate simulation, or back-of-envelope projection quantifies the expected speedup once native 64-bit mod-mul (assumed 1-cycle) and higher inter-bank bandwidth are added. This leaves open whether other UPMEM constraints (per-core throughput, bank-level parallelism) would remain dominant.

minor comments (2)

[Section 3] The manuscript would benefit from an explicit statement of the exact HE parameter sets (N, q, etc.) used for each kernel and a table summarizing them.
[Figures 4-7] Figure captions and axis labels should explicitly state whether reported times include or exclude data movement between host and PIM banks.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the positive evaluation of our work's significance and for the constructive feedback on the abstract and conclusion. We address the major comment below.

read point-by-point responses

Referee: [Abstract] Abstract and conclusion: the central claim that PIM 'can be a viable alternative ... when equipped with native modular multiplication and efficient inter-PIM data movement' rests on an untested inference. All reported timings, bottleneck analysis, and comparisons are obtained from unmodified UPMEM hardware; no analytical model, cycle-accurate simulation, or back-of-envelope projection quantifies the expected speedup once native 64-bit mod-mul (assumed 1-cycle) and higher inter-bank bandwidth are added. This leaves open whether other UPMEM constraints (per-core throughput, bank-level parallelism) would remain dominant.

Authors: We agree that the original claim would be strengthened by a quantitative projection. In the revised manuscript we have added a new subsection (Discussion 6.3) containing a back-of-envelope model. Using measured per-kernel cycle counts and utilization rates from our UPMEM experiments, the model assumes 1-cycle native 64-bit modular multiplication and 4x inter-bank bandwidth. It projects 2.8–4.1x speedup for compute-bound kernels and 1.7–2.3x for memory-bound kernels relative to current UPMEM, while explicitly noting that per-core throughput and bank-level parallelism remain secondary constraints that could cap further gains. We have updated the abstract and conclusion to describe PIM as “potentially competitive” once the two primary bottlenecks are removed, rather than stating it as a direct conclusion from the unmodified hardware results. revision: yes

Circularity Check

0 steps flagged

No circularity: claims rest on direct hardware measurements

full rationale

The paper implements and measures a complete set of HE kernels on unmodified UPMEM PIM hardware, reports raw performance and scalability numbers, identifies compute-bound vs. memory-bound stages from those timings, and compares against CPU/GPU baselines. The viability statement is a qualitative inference drawn from the observed bottlenecks (lack of native 64-bit mod-mul and per-bank capacity limits). No equations, fitted parameters, or self-referential definitions appear; no derivation reduces to its own inputs by construction. All load-bearing evidence is externally falsifiable via replication on the same hardware.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The work rests on standard domain assumptions about HE primitives and the documented behavior of the UPMEM hardware; no free parameters, invented entities, or ad-hoc axioms are introduced.

axioms (1)

domain assumption HE kernels require repeated 64-bit modular integer multiplication and large ciphertext storage
Standard property of homomorphic encryption schemes used in the implemented kernels.

pith-pipeline@v0.9.0 · 5680 in / 1202 out tokens · 64134 ms · 2026-05-14T19:21:05.219355+00:00 · methodology

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Taking Cryptography Out of the Data Path via Near-Memory Processing in DRAM
cs.CR 2026-05 unverdicted novelty 5.0

Real-world PIM on UPMEM accelerates cryptographic algorithms when computation is distributed across multiple DRAM ranks, outperforming CPUs at full scale.

Reference graph

Works this paper leans on

300 extracted references · 300 canonical work pages · cited by 1 Pith paper · 3 internal anchors

[1]

On Data Banks and Privacy Homo- morphisms,

R. L. Rivest, L. Adleman, and M. L. Dertouzos, “On Data Banks and Privacy Homo- morphisms, ” inFSC, 1978

work page 1978
[2]

On Lattices, Learning with Errors, Random Linear Codes, and Cryptog- raphy,

O. Regev, “On Lattices, Learning with Errors, Random Linear Codes, and Cryptog- raphy, ”Journal of the ACM, 2009

work page 2009
[3]

On Ideal Lattices and Learning with Errors over Rings,

V. Lyubashevsky, C. Peikert, and O. Regev, “On Ideal Lattices and Learning with Errors over Rings, ” inEUROCRYPT 2010, 2010

work page 2010
[4]

Fully Homomorphic Encryption Using Ideal Lattices,

C. Gentry, “Fully Homomorphic Encryption Using Ideal Lattices, ” inSTOC, 2009

work page 2009
[5]

A Fully Homomorphic Encryption Scheme,

C. Gentry, “A Fully Homomorphic Encryption Scheme, ” Ph.D. dissertation, Stanford University, 2009

work page 2009
[6]

Efficient Fully Homomorphic Encryption from (Standard) LWE,

Z. Brakerski and V. Vaikuntanathan, “Efficient Fully Homomorphic Encryption from (Standard) LWE, ” inFOCS, 2011

work page 2011
[7]

(Leveled) Fully Homomorphic Encryption without Bootstrapping,

Z. Brakerski, C. Gentry, and V. Vaikuntanathan, “(Leveled) Fully Homomorphic Encryption without Bootstrapping, ” inITCS, 2012

work page 2012
[8]

Fully Homomorphic SIMD Operations,

N. P. Smart and F. Vercauteren, “Fully Homomorphic SIMD Operations, ”Designs, Codes and Cryptography, 2014

work page 2014
[9]

Homomorphic Encryption from Learning with Errors: Conceptually-Simpler, Asymptotically-Faster, Attribute-Based,

C. Gentry, A. Sahai, and B. Waters, “Homomorphic Encryption from Learning with Errors: Conceptually-Simpler, Asymptotically-Faster, Attribute-Based, ” in CRYPTO 2013, Part I, 2013

work page 2013
[10]

Design and Implementation of HElib: a Homomorphic Encryption Library,

S. Halevi and V. Shoup, “Design and Implementation of HElib: a Homomorphic Encryption Library, ” Cryptology ePrint Archive, Report 2020/1481, 2020

work page 2020
[11]

Fully Homomorphic Encryption over the Integers,

M. van Dijk, C. Gentry, S. Halevi, and V. Vaikuntanathan, “Fully Homomorphic Encryption over the Integers, ” inEUROCRYPT 2010, 2010

work page 2010
[12]

Fully Homomorphic Encryption without Modulus Switching from Classical GapSVP,

Z. Brakerski, “Fully Homomorphic Encryption without Modulus Switching from Classical GapSVP, ” inCRYPTO, 2012

work page 2012
[13]

Implementing Gentry’s Fully-Homomorphic Encryption Scheme,

C. Gentry and S. Halevi, “Implementing Gentry’s Fully-Homomorphic Encryption Scheme, ” inEUROCRYPT 2011, 2011

work page 2011
[14]

A Survey on Homomorphic Encryption Schemes: Theory and Implementation,

A. Acar, H. Aksu, A. S. Uluagac, and M. Conti, “A Survey on Homomorphic Encryption Schemes: Theory and Implementation, ”ACM Computing Surveys, 2018

work page 2018
[15]

A Decade of Lattice Cryptography,

C. Peikert, “A Decade of Lattice Cryptography, ”Foundations and Trends in Theoret- ical Computer Science, 2016

work page 2016
[16]

A Method for Obtaining Digital Signatures and Public-Key Cryptosystems,

R. L. Rivest, A. Shamir, and L. Adleman, “A Method for Obtaining Digital Signatures and Public-Key Cryptosystems, ”Communications of the ACM, 1978

work page 1978
[17]

A Public Key Cryptosystem and A Signature Scheme Based on Discrete Logarithms,

T. ElGamal, “A Public Key Cryptosystem and A Signature Scheme Based on Discrete Logarithms, ”IEEE Trans. Inf. Theory, 1985

work page 1985
[18]

Public-Key Cryptosystems Based on Composite Degree Residuosity Classes,

P. Paillier, “Public-Key Cryptosystems Based on Composite Degree Residuosity Classes, ” inEurocrypt, 1999

work page 1999
[19]

Homomorphic Encryption,

M. Ogburn, C. Turner, and P. Dahal, “Homomorphic Encryption, ”PROCS, 2013

work page 2013
[20]

Homomorphic Encryption The “Holy Grail

D. Tourky, M. ElKawkagy, and A. Keshk, “Homomorphic Encryption The “Holy Grail” of Cryptography, ” inICCC 2016

work page 2016
[21]

X. Yi, R. Paulet, E. Bertino, X. Yi, R. Paulet, and E. Bertino,Homomorphic encryption, 2014

work page 2014
[22]

CryptoNets: Applying Neural Networks to Encrypted Data With High Throughput and Accuracy,

R. Gilad-Bachrach, N. Dowlin, K. Laine, K. Lauter, M. Naehrig, and J. Wernsing, “CryptoNets: Applying Neural Networks to Encrypted Data With High Throughput and Accuracy, ” inICML, 2016

work page 2016
[23]

Low Latency Privacy Preserving Inference,

A. Brutzkus, R. Gilad-Bachrach, and O. Elisha, “Low Latency Privacy Preserving Inference, ” inICML, 2019

work page 2019
[24]

GAZELLE: A low latency framework for secure neural network inference,

C. Juvekar, V. Vaikuntanathan, and A. Chandrakasan, “GAZELLE: A low latency framework for secure neural network inference, ” inUSENIX Sec., 2018

work page 2018
[25]

Delphi: A cryptographic inference service for neural networks,

P. Mishra, R. Lehmkuhl, A. Srinivasan, W. Zheng, and R. A. Popa, “Delphi: A cryptographic inference service for neural networks, ” inUSENIX Sec., 2020

work page 2020
[26]

Cheetah: Optimizing and Accelerating Homomorphic Encryption for Private Inference,

B. Reagen, W.-S. Choi, Y. Ko, V. T. Lee, H.-H. S. Lee, G.-Y. Wei, and D. Brooks, “Cheetah: Optimizing and Accelerating Homomorphic Encryption for Private Inference, ” inHPCA, 2021

work page 2021
[27]

Low-Complexity Deep Convolutional Neural Networks on Fully Homomorphic Encryption Using Multiplexed Parallel Convolutions,

E. Lee, J.-W. Lee, J. Lee, Y.-S. Kim, Y. Kim, J.-S. No, and W. Choi, “Low-Complexity Deep Convolutional Neural Networks on Fully Homomorphic Encryption Using Multiplexed Parallel Convolutions, ” inICML, 2022. 14

work page 2022
[28]

Concrete-ML: Privacy-Preserving Machine Learning Using Fully Homo- morphic Encryption,

Zama, “Concrete-ML: Privacy-Preserving Machine Learning Using Fully Homo- morphic Encryption, ” https://github.com/zama-ai/concrete-ml, 2022

work page 2022
[29]

HE3DB: An Efficient and Elastic Encrypted Database via Arithmetic-and-Logic Fully Homomorphic Encryption,

S. Bian, Z. Zhang, H. Pan, R. Mao, Z. Zhao, Y. Jin, and Z. Guan, “HE3DB: An Efficient and Elastic Encrypted Database via Arithmetic-and-Logic Fully Homomorphic Encryption, ” inCCS, 2023

work page 2023
[30]

ArcEDB: An Arbitrary-Precision Encrypted Database via (Amortized) Modular Homomor- phic Encryption,

Z. Zhang, S. Bian, Z. Zhao, R. Mao, H. Zhou, J. Hua, Y. Jin, and Z. Guan, “ArcEDB: An Arbitrary-Precision Encrypted Database via (Amortized) Modular Homomor- phic Encryption, ” inCCS, 2024

work page 2024
[31]

HEDA: Multi- Attribute Unbounded Aggregation over Homomorphically Encrypted Database,

X. Ren, L. Su, Z. Gu, S. Wang, F. Li, Y. Xie, S. Bian, C. Li, and F. Zhang, “HEDA: Multi- Attribute Unbounded Aggregation over Homomorphically Encrypted Database, ” PVLDB, 2022

work page 2022
[32]

CIPHERMATCH: Accelerating Homomor- phic Encryption-Based String Matching via Memory-Efficient Data Packing and In-Flash Processing,

M. Kabra, R. Nadig, H. Gupta, R. Bera, M. Frouzakis, V. Arulchelvan, Y. Liang, H. Mao, M. Sadrosadati, and O. Mutlu, “CIPHERMATCH: Accelerating Homomor- phic Encryption-Based String Matching via Memory-Efficient Data Packing and In-Flash Processing, ” inASPLOS, 2025

work page 2025
[33]

PIR with Compressed Queries and Amortized Query Processing,

S. Angel, H. Chen, K. Laine, and S. T. V. Setty, “PIR with Compressed Queries and Amortized Query Processing, ” inS&P, 2018

work page 2018
[34]

OnionPIR: Response Efficient Single-Server PIR,

M. H. Mughees, H. Chen, and L. Ren, “OnionPIR: Response Efficient Single-Server PIR, ” inCCS, 2021

work page 2021
[35]

Spiral: Fast, High-Rate Single-Server PIR via FHE Composition,

S. J. Menon and D. J. Wu, “Spiral: Fast, High-Rate Single-Server PIR via FHE Composition, ” inS&P, 2022

work page 2022
[36]

One Server for the Price of Two: Simple and Fast Single-Server Private Information Retrieval,

A. Henzinger, M. M. Hong, H. Corrigan-Gibbs, S. Meiklejohn, and V. Vaikun- tanathan, “One Server for the Price of Two: Simple and Fast Single-Server Private Information Retrieval, ” inUSENIX Sec., 2023

work page 2023
[37]

Fast Private Set Intersection from Homomorphic Encryption,

H. Chen, K. Laine, and P. Rindal, “Fast Private Set Intersection from Homomorphic Encryption, ” inCCS, 2017

work page 2017
[38]

Anaheim: Architecture and Algorithms for Processing Fully Homomorphic Encryption in Memory,

J. Kim, S. Yun, H. Ji, W. Choi, S. Kim, and J. H. Ahn, “Anaheim: Architecture and Algorithms for Processing Fully Homomorphic Encryption in Memory, ” inHPCA, 2025

work page 2025
[39]

MemFHE: End-to-end Computing with Fully Homomorphic Encryption in Memory,

S. Gupta, R. Cammarota, and T. Š. Rosing, “MemFHE: End-to-end Computing with Fully Homomorphic Encryption in Memory, ”TECS, 2022

work page 2022
[40]

Craterlake: A Hardware Accelerator for Efficient Unbounded Computation on Encrypted Data,

N. Samardzic, A. Feldmann, A. Krastev, N. Manohar, N. Genise, S. Devadas, K. El- defrawy, C. Peikert, and D. Sanchez, “Craterlake: A Hardware Accelerator for Efficient Unbounded Computation on Encrypted Data, ” inISCA 2022

work page 2022
[41]

Memory Scaling: A Systems Architecture Perspective,

O. Mutlu, “Memory Scaling: A Systems Architecture Perspective, ”IMW, 2013

work page 2013
[42]

Research Problems and Opportunities in Memory Systems,

O. Mutlu and L. Subramanian, “Research Problems and Opportunities in Memory Systems, ”SUPERFRI, 2014

work page 2014
[43]

The Tail at Scale,

J. Dean and L. A. Barroso, “The Tail at Scale, ”Communications of the ACM, 2013

work page 2013
[44]

Profiling a Warehouse-Scale Computer,

S. Kanev, J. P. Darago, K. Hazelwood, P. Ranganathan, T. Moseley, G.-Y. Wei, and D. Brooks, “Profiling a Warehouse-Scale Computer, ” inISCA, 2015

work page 2015
[45]

Clearing the Clouds: A Study of Emerging Scale-Out Workloads on Modern Hardware,

M. Ferdman, A. Adileh, O. Kocberber, S. Volos, M. Alisafaee, D. Jevdjic, C. Kaynak, A. D. Popescu, A. Ailamaki, and B. Falsafi, “Clearing the Clouds: A Study of Emerging Scale-Out Workloads on Modern Hardware, ” inASPLOS, 2012

work page 2012
[46]

BigDataBench: A Big Data Benchmark Suite From Internet Services,

L. Wang, J. Zhan, C. Luo, Y. Zhu, Q. Yang, Y. He, W. Gao, Z. Jia, Y. Shi, S. Zhang, C. Zheng, G. Lu, K. Zhan, X. Li, and B. Qiu, “BigDataBench: A Big Data Benchmark Suite From Internet Services, ” inHPCA, 2014

work page 2014
[47]

Google Workloads for Consumer Devices: Mitigating Data Movement Bottlenecks,

A. Boroumand, S. Ghose, Y. Kim, R. Ausavarungnirun, E. Shiu, R. Thakur, D. Kim, A. Kuusela, A. Knies, P. Ranganathan, and O. Mutlu, “Google Workloads for Consumer Devices: Mitigating Data Movement Bottlenecks, ” inASPLOS, 2018

work page 2018
[48]

Google Neural Network Models for Edge Devices: Analyzing and Mitigating Machine Learning Inference Bottlenecks,

A. Boroumand, S. Ghose, B. Akin, R. Narayanaswami, G. F. Oliveira, X. Ma, E. Shiu, and O. Mutlu, “Google Neural Network Models for Edge Devices: Analyzing and Mitigating Machine Learning Inference Bottlenecks, ” inPACT, 2021

work page 2021
[49]

Enabling Practical Processing in and near Memory for Data-Intensive Computing,

O. Mutlu, S. Ghose, J. Gómez-Luna, and R. Ausavarungnirun, “Enabling Practical Processing in and near Memory for Data-Intensive Computing, ” inDAC, 2019

work page 2019
[50]

Processing Data Where It Makes Sense: Enabling In-Memory Computation,

O. Mutluet al., “Processing Data Where It Makes Sense: Enabling In-Memory Computation, ”MicPro, 2019

work page 2019
[51]

Processing-in- Memory: A Workload-driven Perspective,

S. Ghose, A. Boroumand, J. S. Kim, J. Gómez-Luna, and O. Mutlu, “Processing-in- Memory: A Workload-driven Perspective, ”IBM JRD 2019

work page 2019
[52]

Accelerating Genome Analysis: A Primer on an Ongoing Journey,

M. Alser, Z. Bingöl, D. Senol Cali, J. Kim, S. Ghose, C. Alkan, and O. Mutlu, “Accelerating Genome Analysis: A Primer on an Ongoing Journey, ”IEEE Micro, 2020

work page 2020
[53]

GenASM: A High- Performance, Low-Power Approximate String Matching Acceleration Framework for Genome Sequence Analysis,

D. S. Cali, G. S. Kalsi, Z. Bingöl, C. Firtina, L. Subramanian, J. S. Kim, R. Ausavarung- nirun, M. Alser, J. Gomez-Luna, A. Boroumandet al., “GenASM: A High- Performance, Low-Power Approximate String Matching Acceleration Framework for Genome Sequence Analysis, ” inMICRO, 2020

work page 2020
[54]

EDEN: Enabling Energy-Efficient, High-Performance Deep Neural Network Inference using Approximate DRAM,

S. Koppula, L. Orosa, A. G. Yağlıkçı, R. Azizi, T. Shahroodi, K. Kanellopoulos, and O. Mutlu, “EDEN: Enabling Energy-Efficient, High-Performance Deep Neural Network Inference using Approximate DRAM, ” inMICRO, 2019

work page 2019
[55]

SMASH: Co-designing Software Compression and Hardware-Accelerated Indexing for Efficient Sparse Matrix Operations,

K. Kanellopoulos, N. Vijaykumar, C. Giannoula, R. Azizi, S. Koppula, N. Man- souri Ghiasi, T. Shahroodi, J. Gomez-Luna, and O. Mutlu, “SMASH: Co-designing Software Compression and Hardware-Accelerated Indexing for Efficient Sparse Matrix Operations, ” inMICRO, 2019

work page 2019
[56]

DAMOV: A New Methodology and Benchmark Suite for Evaluating Data Movement Bottlenecks,

G. F. Oliveira, J. Gómez-Luna, L. Orosa, S. Ghose, N. Vijaykumar, I. Fernandez, M. Sadrosadati, and O. Mutlu, “DAMOV: A New Methodology and Benchmark Suite for Evaluating Data Movement Bottlenecks, ”IEEE Access, 2021

work page 2021
[57]

Evaluating the Memory System Behavior of Smartphone Workloads,

G. Narancic, P. Judd, D. Wu, I. Atta, M. Elnacouzi, J. Zebchuk, J. Albericio, N. E. Jerger, A. Moshovos, K. Kutulakoset al., “Evaluating the Memory System Behavior of Smartphone Workloads, ” inSAMOS, 2014

work page 2014
[58]

Memory Hierarchy for Web Search,

G. Ayers, J. H. Ahn, C. Kozyrakis, and P. Ranganathan, “Memory Hierarchy for Web Search, ” inHPCA, 2018

work page 2018
[59]

Understanding Big Data Analytics Workloads on Modern Processors,

Z. Jia, J. Zhan, L. Wang, C. Luo, W. Gao, Y. Jin, R. Han, and L. Zhang, “Understanding Big Data Analytics Workloads on Modern Processors, ”TPDS, 2016

work page 2016
[60]

Optimizing Database Architecture for the New Bottleneck: Memory Access,

S. Manegold, P. A. Boncz, and M. L. Kersten, “Optimizing Database Architecture for the New Bottleneck: Memory Access, ”The VLDB Journal, 2000

work page 2000
[61]

AI and Memory Wall,

A. Gholami, Z. Yao, S. Kim, C. Hooper, M. W. Mahoney, and K. Keutzer, “AI and Memory Wall, ”IEEE Micro, 2024

work page 2024
[62]

Quantifying Performance Bot- tlenecks of Stencil Computations Using the Execution-Cache-Memory Model,

H. Stengel, J. Treibig, G. Hager, and G. Wellein, “Quantifying Performance Bot- tlenecks of Stencil Computations Using the Execution-Cache-Memory Model, ” in ISC, 2015

work page 2015
[63]

Memory System Characterization of Deep Learning Workloads,

Z. Chishti and B. Akin, “Memory System Characterization of Deep Learning Workloads, ” inMEMSYS, 2019

work page 2019
[64]

The Declining Effectiveness of Dynamic Caching for General-Purpose Microprocessors,

D. C. Burger, J. R. Goodman, and A. Kagi, “The Declining Effectiveness of Dynamic Caching for General-Purpose Microprocessors, ” University of Wisconsin-Madison, Tech. Rep. 1261, 1995

work page 1995
[65]

The Architectural Implications of Facebook’s DNN-based Personalized Recommendation,

U. Gupta, C.-J. Wu, X. Wang, M. Naumov, B. Reagen, D. Brooks, B. Cottel, K. Hazel- wood, M. Hempstead, B. Jiaet al., “The Architectural Implications of Facebook’s DNN-based Personalized Recommendation, ” inHPCA, 2020

work page 2020
[66]

SoftSKU: Optimizing Server Archi- tectures for Microservice Diversity @Scale,

A. Sriraman, A. Dhanotia, and T. F. Wenisch, “SoftSKU: Optimizing Server Archi- tectures for Microservice Diversity @Scale, ” inISCA, 2019

work page 2019
[67]

Understanding Data Storage and Ingestion for Large-Scale Deep Recommendation Model Training: Industrial Product,

M. Zhao, N. Agarwal, A. Basant, B. Gedik, S. Pan, M. Ozdal, R. Komuravelli, J. Pan, T. Bao, H. Luet al., “Understanding Data Storage and Ingestion for Large-Scale Deep Recommendation Model Training: Industrial Product, ” inISCA, 2022

work page 2022
[68]

INSIDER: Designing In-Storage Computing System for Emerging High-Performance Drive,

Z. Ruan, T. He, and J. Cong, “INSIDER: Designing In-Storage Computing System for Emerging High-Performance Drive, ” inUSENIX ATC, 2019

work page 2019
[69]

The Architectural Implications of Cloud Microservices,

Y. Gan and C. Delimitrou, “The Architectural Implications of Cloud Microservices, ” CAL, 2018

work page 2018
[70]

Accelerometer: Understanding Acceleration Op- portunities for Data Center Overheads at Hyperscale,

A. Sriraman and A. Dhanotia, “Accelerometer: Understanding Acceleration Op- portunities for Data Center Overheads at Hyperscale, ” inASPLOS, 2020

work page 2020
[71]

RAMBDA: RDMA-driven Acceleration Framework for Memory-Intensiveµs-Scale Datacenter Applications,

Y. Yuan, J. Huang, Y. Sun, T. Wang, J. Nelson, D. R. Ports, Y. Wang, R. Wang, C. Tai, and N. S. Kim, “RAMBDA: RDMA-driven Acceleration Framework for Memory-Intensiveµs-Scale Datacenter Applications, ” inHPCA, 2023

work page 2023
[72]

Amdahl’s Law for Tail Latency,

C. Delimitrou and C. Kozyrakis, “Amdahl’s Law for Tail Latency, ”CACM, 2018

work page 2018
[73]

Cross-Stack Workload Characterization of Deep Recommendation Systems,

S. Hsia, U. Gupta, M. Wilkening, C.-J. Wu, G.-Y. Wei, and D. Brooks, “Cross-Stack Workload Characterization of Deep Recommendation Systems, ” inIISWC, 2020

work page 2020
[74]

vbench: Benchmarking Video Transcoding in the Cloud,

A. Lottarini, A. Ramirez, J. Coburn, M. A. Kim, P. Ranganathan, D. Stodolsky, and M. Wachsler, “vbench: Benchmarking Video Transcoding in the Cloud, ” inASPLOS, 2018

work page 2018
[75]

Characterizing Job Microarchitectural Profiles at Scale: Dataset and Analysis,

K. Wang, Y. Li, C. Wang, T. Jia, K. Chow, Y. Wen, Y. Dou, G. Xu, C. Hou, J. Yaoet al., “Characterizing Job Microarchitectural Profiles at Scale: Dataset and Analysis, ” in ICPP, 2022

work page 2022
[76]

Missing the Forest for the Trees: End-to-End AI Application Performance in Edge Data Centers,

D. Richins, D. Doshi, M. Blackmore, A. T. Nair, N. Pathapati, A. Patel, B. Daguman, D. Dobrijalowski, R. Illikkal, K. Longet al., “Missing the Forest for the Trees: End-to-End AI Application Performance in Edge Data Centers, ” inHPCA, 2020

work page 2020
[77]

Reflections on the Memory Wall,

S. A. McKee, “Reflections on the Memory Wall, ” inCF, 2004

work page 2004
[78]

ARK: Fully Homomorphic Encryption Accelerator with Runtime Data Generation and Inter- operation Key Reuse,

J. Kim, G. Lee, S. Kim, G. Sohn, M. Rhu, J. Kim, and J. H. Ahn, “ARK: Fully Homomorphic Encryption Accelerator with Runtime Data Generation and Inter- operation Key Reuse, ” inIEEE MICRO 2022

work page 2022
[79]

A Modern Primer on Processing In Memory,

O. Mutlu, S. Ghose, J. Gómez-Luna, and R. Ausavarungnirun, “A Modern Primer on Processing In Memory, ” inEmerging Computing: From Devices to Systems: Looking Beyond Moore and Von Neumann, 2022

work page 2022
[80]

Memory-Centric Computing,

O. Mutlu, “Memory-Centric Computing, ” inDAC, 2023

work page 2023

Showing first 80 references.

[1] [1]

On Data Banks and Privacy Homo- morphisms,

R. L. Rivest, L. Adleman, and M. L. Dertouzos, “On Data Banks and Privacy Homo- morphisms, ” inFSC, 1978

work page 1978

[2] [2]

On Lattices, Learning with Errors, Random Linear Codes, and Cryptog- raphy,

O. Regev, “On Lattices, Learning with Errors, Random Linear Codes, and Cryptog- raphy, ”Journal of the ACM, 2009

work page 2009

[3] [3]

On Ideal Lattices and Learning with Errors over Rings,

V. Lyubashevsky, C. Peikert, and O. Regev, “On Ideal Lattices and Learning with Errors over Rings, ” inEUROCRYPT 2010, 2010

work page 2010

[4] [4]

Fully Homomorphic Encryption Using Ideal Lattices,

C. Gentry, “Fully Homomorphic Encryption Using Ideal Lattices, ” inSTOC, 2009

work page 2009

[5] [5]

A Fully Homomorphic Encryption Scheme,

C. Gentry, “A Fully Homomorphic Encryption Scheme, ” Ph.D. dissertation, Stanford University, 2009

work page 2009

[6] [6]

Efficient Fully Homomorphic Encryption from (Standard) LWE,

Z. Brakerski and V. Vaikuntanathan, “Efficient Fully Homomorphic Encryption from (Standard) LWE, ” inFOCS, 2011

work page 2011

[7] [7]

(Leveled) Fully Homomorphic Encryption without Bootstrapping,

Z. Brakerski, C. Gentry, and V. Vaikuntanathan, “(Leveled) Fully Homomorphic Encryption without Bootstrapping, ” inITCS, 2012

work page 2012

[8] [8]

Fully Homomorphic SIMD Operations,

N. P. Smart and F. Vercauteren, “Fully Homomorphic SIMD Operations, ”Designs, Codes and Cryptography, 2014

work page 2014

[9] [9]

Homomorphic Encryption from Learning with Errors: Conceptually-Simpler, Asymptotically-Faster, Attribute-Based,

C. Gentry, A. Sahai, and B. Waters, “Homomorphic Encryption from Learning with Errors: Conceptually-Simpler, Asymptotically-Faster, Attribute-Based, ” in CRYPTO 2013, Part I, 2013

work page 2013

[10] [10]

Design and Implementation of HElib: a Homomorphic Encryption Library,

S. Halevi and V. Shoup, “Design and Implementation of HElib: a Homomorphic Encryption Library, ” Cryptology ePrint Archive, Report 2020/1481, 2020

work page 2020

[11] [11]

Fully Homomorphic Encryption over the Integers,

M. van Dijk, C. Gentry, S. Halevi, and V. Vaikuntanathan, “Fully Homomorphic Encryption over the Integers, ” inEUROCRYPT 2010, 2010

work page 2010

[12] [12]

Fully Homomorphic Encryption without Modulus Switching from Classical GapSVP,

Z. Brakerski, “Fully Homomorphic Encryption without Modulus Switching from Classical GapSVP, ” inCRYPTO, 2012

work page 2012

[13] [13]

Implementing Gentry’s Fully-Homomorphic Encryption Scheme,

C. Gentry and S. Halevi, “Implementing Gentry’s Fully-Homomorphic Encryption Scheme, ” inEUROCRYPT 2011, 2011

work page 2011

[14] [14]

A Survey on Homomorphic Encryption Schemes: Theory and Implementation,

A. Acar, H. Aksu, A. S. Uluagac, and M. Conti, “A Survey on Homomorphic Encryption Schemes: Theory and Implementation, ”ACM Computing Surveys, 2018

work page 2018

[15] [15]

A Decade of Lattice Cryptography,

C. Peikert, “A Decade of Lattice Cryptography, ”Foundations and Trends in Theoret- ical Computer Science, 2016

work page 2016

[16] [16]

A Method for Obtaining Digital Signatures and Public-Key Cryptosystems,

R. L. Rivest, A. Shamir, and L. Adleman, “A Method for Obtaining Digital Signatures and Public-Key Cryptosystems, ”Communications of the ACM, 1978

work page 1978

[17] [17]

A Public Key Cryptosystem and A Signature Scheme Based on Discrete Logarithms,

T. ElGamal, “A Public Key Cryptosystem and A Signature Scheme Based on Discrete Logarithms, ”IEEE Trans. Inf. Theory, 1985

work page 1985

[18] [18]

Public-Key Cryptosystems Based on Composite Degree Residuosity Classes,

P. Paillier, “Public-Key Cryptosystems Based on Composite Degree Residuosity Classes, ” inEurocrypt, 1999

work page 1999

[19] [19]

Homomorphic Encryption,

M. Ogburn, C. Turner, and P. Dahal, “Homomorphic Encryption, ”PROCS, 2013

work page 2013

[20] [20]

Homomorphic Encryption The “Holy Grail

D. Tourky, M. ElKawkagy, and A. Keshk, “Homomorphic Encryption The “Holy Grail” of Cryptography, ” inICCC 2016

work page 2016

[21] [21]

X. Yi, R. Paulet, E. Bertino, X. Yi, R. Paulet, and E. Bertino,Homomorphic encryption, 2014

work page 2014

[22] [22]

CryptoNets: Applying Neural Networks to Encrypted Data With High Throughput and Accuracy,

R. Gilad-Bachrach, N. Dowlin, K. Laine, K. Lauter, M. Naehrig, and J. Wernsing, “CryptoNets: Applying Neural Networks to Encrypted Data With High Throughput and Accuracy, ” inICML, 2016

work page 2016

[23] [23]

Low Latency Privacy Preserving Inference,

A. Brutzkus, R. Gilad-Bachrach, and O. Elisha, “Low Latency Privacy Preserving Inference, ” inICML, 2019

work page 2019

[24] [24]

GAZELLE: A low latency framework for secure neural network inference,

C. Juvekar, V. Vaikuntanathan, and A. Chandrakasan, “GAZELLE: A low latency framework for secure neural network inference, ” inUSENIX Sec., 2018

work page 2018

[25] [25]

Delphi: A cryptographic inference service for neural networks,

P. Mishra, R. Lehmkuhl, A. Srinivasan, W. Zheng, and R. A. Popa, “Delphi: A cryptographic inference service for neural networks, ” inUSENIX Sec., 2020

work page 2020

[26] [26]

Cheetah: Optimizing and Accelerating Homomorphic Encryption for Private Inference,

B. Reagen, W.-S. Choi, Y. Ko, V. T. Lee, H.-H. S. Lee, G.-Y. Wei, and D. Brooks, “Cheetah: Optimizing and Accelerating Homomorphic Encryption for Private Inference, ” inHPCA, 2021

work page 2021

[27] [27]

Low-Complexity Deep Convolutional Neural Networks on Fully Homomorphic Encryption Using Multiplexed Parallel Convolutions,

E. Lee, J.-W. Lee, J. Lee, Y.-S. Kim, Y. Kim, J.-S. No, and W. Choi, “Low-Complexity Deep Convolutional Neural Networks on Fully Homomorphic Encryption Using Multiplexed Parallel Convolutions, ” inICML, 2022. 14

work page 2022

[28] [28]

Concrete-ML: Privacy-Preserving Machine Learning Using Fully Homo- morphic Encryption,

Zama, “Concrete-ML: Privacy-Preserving Machine Learning Using Fully Homo- morphic Encryption, ” https://github.com/zama-ai/concrete-ml, 2022

work page 2022

[29] [29]

HE3DB: An Efficient and Elastic Encrypted Database via Arithmetic-and-Logic Fully Homomorphic Encryption,

S. Bian, Z. Zhang, H. Pan, R. Mao, Z. Zhao, Y. Jin, and Z. Guan, “HE3DB: An Efficient and Elastic Encrypted Database via Arithmetic-and-Logic Fully Homomorphic Encryption, ” inCCS, 2023

work page 2023

[30] [30]

ArcEDB: An Arbitrary-Precision Encrypted Database via (Amortized) Modular Homomor- phic Encryption,

Z. Zhang, S. Bian, Z. Zhao, R. Mao, H. Zhou, J. Hua, Y. Jin, and Z. Guan, “ArcEDB: An Arbitrary-Precision Encrypted Database via (Amortized) Modular Homomor- phic Encryption, ” inCCS, 2024

work page 2024

[31] [31]

HEDA: Multi- Attribute Unbounded Aggregation over Homomorphically Encrypted Database,

X. Ren, L. Su, Z. Gu, S. Wang, F. Li, Y. Xie, S. Bian, C. Li, and F. Zhang, “HEDA: Multi- Attribute Unbounded Aggregation over Homomorphically Encrypted Database, ” PVLDB, 2022

work page 2022

[32] [32]

CIPHERMATCH: Accelerating Homomor- phic Encryption-Based String Matching via Memory-Efficient Data Packing and In-Flash Processing,

M. Kabra, R. Nadig, H. Gupta, R. Bera, M. Frouzakis, V. Arulchelvan, Y. Liang, H. Mao, M. Sadrosadati, and O. Mutlu, “CIPHERMATCH: Accelerating Homomor- phic Encryption-Based String Matching via Memory-Efficient Data Packing and In-Flash Processing, ” inASPLOS, 2025

work page 2025

[33] [33]

PIR with Compressed Queries and Amortized Query Processing,

S. Angel, H. Chen, K. Laine, and S. T. V. Setty, “PIR with Compressed Queries and Amortized Query Processing, ” inS&P, 2018

work page 2018

[34] [34]

OnionPIR: Response Efficient Single-Server PIR,

M. H. Mughees, H. Chen, and L. Ren, “OnionPIR: Response Efficient Single-Server PIR, ” inCCS, 2021

work page 2021

[35] [35]

Spiral: Fast, High-Rate Single-Server PIR via FHE Composition,

S. J. Menon and D. J. Wu, “Spiral: Fast, High-Rate Single-Server PIR via FHE Composition, ” inS&P, 2022

work page 2022

[36] [36]

One Server for the Price of Two: Simple and Fast Single-Server Private Information Retrieval,

A. Henzinger, M. M. Hong, H. Corrigan-Gibbs, S. Meiklejohn, and V. Vaikun- tanathan, “One Server for the Price of Two: Simple and Fast Single-Server Private Information Retrieval, ” inUSENIX Sec., 2023

work page 2023

[37] [37]

Fast Private Set Intersection from Homomorphic Encryption,

H. Chen, K. Laine, and P. Rindal, “Fast Private Set Intersection from Homomorphic Encryption, ” inCCS, 2017

work page 2017

[38] [38]

Anaheim: Architecture and Algorithms for Processing Fully Homomorphic Encryption in Memory,

J. Kim, S. Yun, H. Ji, W. Choi, S. Kim, and J. H. Ahn, “Anaheim: Architecture and Algorithms for Processing Fully Homomorphic Encryption in Memory, ” inHPCA, 2025

work page 2025

[39] [39]

MemFHE: End-to-end Computing with Fully Homomorphic Encryption in Memory,

S. Gupta, R. Cammarota, and T. Š. Rosing, “MemFHE: End-to-end Computing with Fully Homomorphic Encryption in Memory, ”TECS, 2022

work page 2022

[40] [40]

Craterlake: A Hardware Accelerator for Efficient Unbounded Computation on Encrypted Data,

N. Samardzic, A. Feldmann, A. Krastev, N. Manohar, N. Genise, S. Devadas, K. El- defrawy, C. Peikert, and D. Sanchez, “Craterlake: A Hardware Accelerator for Efficient Unbounded Computation on Encrypted Data, ” inISCA 2022

work page 2022

[41] [41]

Memory Scaling: A Systems Architecture Perspective,

O. Mutlu, “Memory Scaling: A Systems Architecture Perspective, ”IMW, 2013

work page 2013

[42] [42]

Research Problems and Opportunities in Memory Systems,

O. Mutlu and L. Subramanian, “Research Problems and Opportunities in Memory Systems, ”SUPERFRI, 2014

work page 2014

[43] [43]

The Tail at Scale,

J. Dean and L. A. Barroso, “The Tail at Scale, ”Communications of the ACM, 2013

work page 2013

[44] [44]

Profiling a Warehouse-Scale Computer,

S. Kanev, J. P. Darago, K. Hazelwood, P. Ranganathan, T. Moseley, G.-Y. Wei, and D. Brooks, “Profiling a Warehouse-Scale Computer, ” inISCA, 2015

work page 2015

[45] [45]

Clearing the Clouds: A Study of Emerging Scale-Out Workloads on Modern Hardware,

M. Ferdman, A. Adileh, O. Kocberber, S. Volos, M. Alisafaee, D. Jevdjic, C. Kaynak, A. D. Popescu, A. Ailamaki, and B. Falsafi, “Clearing the Clouds: A Study of Emerging Scale-Out Workloads on Modern Hardware, ” inASPLOS, 2012

work page 2012

[46] [46]

BigDataBench: A Big Data Benchmark Suite From Internet Services,

L. Wang, J. Zhan, C. Luo, Y. Zhu, Q. Yang, Y. He, W. Gao, Z. Jia, Y. Shi, S. Zhang, C. Zheng, G. Lu, K. Zhan, X. Li, and B. Qiu, “BigDataBench: A Big Data Benchmark Suite From Internet Services, ” inHPCA, 2014

work page 2014

[47] [47]

Google Workloads for Consumer Devices: Mitigating Data Movement Bottlenecks,

A. Boroumand, S. Ghose, Y. Kim, R. Ausavarungnirun, E. Shiu, R. Thakur, D. Kim, A. Kuusela, A. Knies, P. Ranganathan, and O. Mutlu, “Google Workloads for Consumer Devices: Mitigating Data Movement Bottlenecks, ” inASPLOS, 2018

work page 2018

[48] [48]

Google Neural Network Models for Edge Devices: Analyzing and Mitigating Machine Learning Inference Bottlenecks,

A. Boroumand, S. Ghose, B. Akin, R. Narayanaswami, G. F. Oliveira, X. Ma, E. Shiu, and O. Mutlu, “Google Neural Network Models for Edge Devices: Analyzing and Mitigating Machine Learning Inference Bottlenecks, ” inPACT, 2021

work page 2021

[49] [49]

Enabling Practical Processing in and near Memory for Data-Intensive Computing,

O. Mutlu, S. Ghose, J. Gómez-Luna, and R. Ausavarungnirun, “Enabling Practical Processing in and near Memory for Data-Intensive Computing, ” inDAC, 2019

work page 2019

[50] [50]

Processing Data Where It Makes Sense: Enabling In-Memory Computation,

O. Mutluet al., “Processing Data Where It Makes Sense: Enabling In-Memory Computation, ”MicPro, 2019

work page 2019

[51] [51]

Processing-in- Memory: A Workload-driven Perspective,

S. Ghose, A. Boroumand, J. S. Kim, J. Gómez-Luna, and O. Mutlu, “Processing-in- Memory: A Workload-driven Perspective, ”IBM JRD 2019

work page 2019

[52] [52]

Accelerating Genome Analysis: A Primer on an Ongoing Journey,

M. Alser, Z. Bingöl, D. Senol Cali, J. Kim, S. Ghose, C. Alkan, and O. Mutlu, “Accelerating Genome Analysis: A Primer on an Ongoing Journey, ”IEEE Micro, 2020

work page 2020

[53] [53]

GenASM: A High- Performance, Low-Power Approximate String Matching Acceleration Framework for Genome Sequence Analysis,

D. S. Cali, G. S. Kalsi, Z. Bingöl, C. Firtina, L. Subramanian, J. S. Kim, R. Ausavarung- nirun, M. Alser, J. Gomez-Luna, A. Boroumandet al., “GenASM: A High- Performance, Low-Power Approximate String Matching Acceleration Framework for Genome Sequence Analysis, ” inMICRO, 2020

work page 2020

[54] [54]

EDEN: Enabling Energy-Efficient, High-Performance Deep Neural Network Inference using Approximate DRAM,

S. Koppula, L. Orosa, A. G. Yağlıkçı, R. Azizi, T. Shahroodi, K. Kanellopoulos, and O. Mutlu, “EDEN: Enabling Energy-Efficient, High-Performance Deep Neural Network Inference using Approximate DRAM, ” inMICRO, 2019

work page 2019

[55] [55]

SMASH: Co-designing Software Compression and Hardware-Accelerated Indexing for Efficient Sparse Matrix Operations,

K. Kanellopoulos, N. Vijaykumar, C. Giannoula, R. Azizi, S. Koppula, N. Man- souri Ghiasi, T. Shahroodi, J. Gomez-Luna, and O. Mutlu, “SMASH: Co-designing Software Compression and Hardware-Accelerated Indexing for Efficient Sparse Matrix Operations, ” inMICRO, 2019

work page 2019

[56] [56]

DAMOV: A New Methodology and Benchmark Suite for Evaluating Data Movement Bottlenecks,

G. F. Oliveira, J. Gómez-Luna, L. Orosa, S. Ghose, N. Vijaykumar, I. Fernandez, M. Sadrosadati, and O. Mutlu, “DAMOV: A New Methodology and Benchmark Suite for Evaluating Data Movement Bottlenecks, ”IEEE Access, 2021

work page 2021

[57] [57]

Evaluating the Memory System Behavior of Smartphone Workloads,

G. Narancic, P. Judd, D. Wu, I. Atta, M. Elnacouzi, J. Zebchuk, J. Albericio, N. E. Jerger, A. Moshovos, K. Kutulakoset al., “Evaluating the Memory System Behavior of Smartphone Workloads, ” inSAMOS, 2014

work page 2014

[58] [58]

Memory Hierarchy for Web Search,

G. Ayers, J. H. Ahn, C. Kozyrakis, and P. Ranganathan, “Memory Hierarchy for Web Search, ” inHPCA, 2018

work page 2018

[59] [59]

Understanding Big Data Analytics Workloads on Modern Processors,

Z. Jia, J. Zhan, L. Wang, C. Luo, W. Gao, Y. Jin, R. Han, and L. Zhang, “Understanding Big Data Analytics Workloads on Modern Processors, ”TPDS, 2016

work page 2016

[60] [60]

Optimizing Database Architecture for the New Bottleneck: Memory Access,

S. Manegold, P. A. Boncz, and M. L. Kersten, “Optimizing Database Architecture for the New Bottleneck: Memory Access, ”The VLDB Journal, 2000

work page 2000

[61] [61]

AI and Memory Wall,

A. Gholami, Z. Yao, S. Kim, C. Hooper, M. W. Mahoney, and K. Keutzer, “AI and Memory Wall, ”IEEE Micro, 2024

work page 2024

[62] [62]

Quantifying Performance Bot- tlenecks of Stencil Computations Using the Execution-Cache-Memory Model,

H. Stengel, J. Treibig, G. Hager, and G. Wellein, “Quantifying Performance Bot- tlenecks of Stencil Computations Using the Execution-Cache-Memory Model, ” in ISC, 2015

work page 2015

[63] [63]

Memory System Characterization of Deep Learning Workloads,

Z. Chishti and B. Akin, “Memory System Characterization of Deep Learning Workloads, ” inMEMSYS, 2019

work page 2019

[64] [64]

The Declining Effectiveness of Dynamic Caching for General-Purpose Microprocessors,

D. C. Burger, J. R. Goodman, and A. Kagi, “The Declining Effectiveness of Dynamic Caching for General-Purpose Microprocessors, ” University of Wisconsin-Madison, Tech. Rep. 1261, 1995

work page 1995

[65] [65]

The Architectural Implications of Facebook’s DNN-based Personalized Recommendation,

U. Gupta, C.-J. Wu, X. Wang, M. Naumov, B. Reagen, D. Brooks, B. Cottel, K. Hazel- wood, M. Hempstead, B. Jiaet al., “The Architectural Implications of Facebook’s DNN-based Personalized Recommendation, ” inHPCA, 2020

work page 2020

[66] [66]

SoftSKU: Optimizing Server Archi- tectures for Microservice Diversity @Scale,

A. Sriraman, A. Dhanotia, and T. F. Wenisch, “SoftSKU: Optimizing Server Archi- tectures for Microservice Diversity @Scale, ” inISCA, 2019

work page 2019

[67] [67]

Understanding Data Storage and Ingestion for Large-Scale Deep Recommendation Model Training: Industrial Product,

M. Zhao, N. Agarwal, A. Basant, B. Gedik, S. Pan, M. Ozdal, R. Komuravelli, J. Pan, T. Bao, H. Luet al., “Understanding Data Storage and Ingestion for Large-Scale Deep Recommendation Model Training: Industrial Product, ” inISCA, 2022

work page 2022

[68] [68]

INSIDER: Designing In-Storage Computing System for Emerging High-Performance Drive,

Z. Ruan, T. He, and J. Cong, “INSIDER: Designing In-Storage Computing System for Emerging High-Performance Drive, ” inUSENIX ATC, 2019

work page 2019

[69] [69]

The Architectural Implications of Cloud Microservices,

Y. Gan and C. Delimitrou, “The Architectural Implications of Cloud Microservices, ” CAL, 2018

work page 2018

[70] [70]

Accelerometer: Understanding Acceleration Op- portunities for Data Center Overheads at Hyperscale,

A. Sriraman and A. Dhanotia, “Accelerometer: Understanding Acceleration Op- portunities for Data Center Overheads at Hyperscale, ” inASPLOS, 2020

work page 2020

[71] [71]

RAMBDA: RDMA-driven Acceleration Framework for Memory-Intensiveµs-Scale Datacenter Applications,

Y. Yuan, J. Huang, Y. Sun, T. Wang, J. Nelson, D. R. Ports, Y. Wang, R. Wang, C. Tai, and N. S. Kim, “RAMBDA: RDMA-driven Acceleration Framework for Memory-Intensiveµs-Scale Datacenter Applications, ” inHPCA, 2023

work page 2023

[72] [72]

Amdahl’s Law for Tail Latency,

C. Delimitrou and C. Kozyrakis, “Amdahl’s Law for Tail Latency, ”CACM, 2018

work page 2018

[73] [73]

Cross-Stack Workload Characterization of Deep Recommendation Systems,

S. Hsia, U. Gupta, M. Wilkening, C.-J. Wu, G.-Y. Wei, and D. Brooks, “Cross-Stack Workload Characterization of Deep Recommendation Systems, ” inIISWC, 2020

work page 2020

[74] [74]

vbench: Benchmarking Video Transcoding in the Cloud,

A. Lottarini, A. Ramirez, J. Coburn, M. A. Kim, P. Ranganathan, D. Stodolsky, and M. Wachsler, “vbench: Benchmarking Video Transcoding in the Cloud, ” inASPLOS, 2018

work page 2018

[75] [75]

Characterizing Job Microarchitectural Profiles at Scale: Dataset and Analysis,

K. Wang, Y. Li, C. Wang, T. Jia, K. Chow, Y. Wen, Y. Dou, G. Xu, C. Hou, J. Yaoet al., “Characterizing Job Microarchitectural Profiles at Scale: Dataset and Analysis, ” in ICPP, 2022

work page 2022

[76] [76]

Missing the Forest for the Trees: End-to-End AI Application Performance in Edge Data Centers,

D. Richins, D. Doshi, M. Blackmore, A. T. Nair, N. Pathapati, A. Patel, B. Daguman, D. Dobrijalowski, R. Illikkal, K. Longet al., “Missing the Forest for the Trees: End-to-End AI Application Performance in Edge Data Centers, ” inHPCA, 2020

work page 2020

[77] [77]

Reflections on the Memory Wall,

S. A. McKee, “Reflections on the Memory Wall, ” inCF, 2004

work page 2004

[78] [78]

ARK: Fully Homomorphic Encryption Accelerator with Runtime Data Generation and Inter- operation Key Reuse,

J. Kim, G. Lee, S. Kim, G. Sohn, M. Rhu, J. Kim, and J. H. Ahn, “ARK: Fully Homomorphic Encryption Accelerator with Runtime Data Generation and Inter- operation Key Reuse, ” inIEEE MICRO 2022

work page 2022

[79] [79]

A Modern Primer on Processing In Memory,

O. Mutlu, S. Ghose, J. Gómez-Luna, and R. Ausavarungnirun, “A Modern Primer on Processing In Memory, ” inEmerging Computing: From Devices to Systems: Looking Beyond Moore and Von Neumann, 2022

work page 2022

[80] [80]

Memory-Centric Computing,

O. Mutlu, “Memory-Centric Computing, ” inDAC, 2023

work page 2023