pith. sign in

arxiv: 2605.12841 · v1 · pith:Z73BFPAMnew · submitted 2026-05-13 · 💻 cs.CR

HE-PIM: Demystifying Homomorphic Operations on a Real-world Processing-in-Memory System

Pith reviewed 2026-05-14 19:21 UTC · model grok-4.3

classification 💻 cs.CR
keywords homomorphic encryptionprocessing-in-memoryPIMUPMEMmodular multiplicationperformance bottlenecksprivacy-preserving computationencrypted databases
0
0 comments X

The pith

Processing-in-memory systems become competitive with CPUs and GPUs for homomorphic encryption when equipped with native modular multiplication and efficient data movement.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper implements a complete set of homomorphic encryption kernels on the real UPMEM PIM hardware and measures their performance across execution stages. It finds that HE workloads split into compute-bound kernels limited by modular arithmetic and memory-bound kernels limited by small per-bank capacity that forces frequent inter-bank transfers. The authors conclude that despite current limits, PIM hardware can serve as a practical alternative to conventional processors for privacy-preserving workloads in databases and machine learning once future designs add native 64-bit modular multiplication support.

Core claim

The central claim is that HE operations on real-world PIM expose two primary bottlenecks—lack of native 64-bit modular integer multiplication as the dominant compute limit and limited per-bank memory capacity that requires costly inter-bank movement—yet PIM remains a viable alternative to state-of-the-art CPU and GPU systems once those two features are supplied.

What carries the argument

Implementation and bottleneck characterization of a full set of HE kernels on the UPMEM PIM system, isolating performance limits to the absence of native modular multiplication and constrained bank capacity.

If this is right

  • HE applications expose distinct bottlenecks across execution stages, with some kernels compute-bound on modular arithmetic and others memory-bound on large ciphertexts.
  • The dominant compute bottleneck is the lack of native 64-bit modular integer multiplication.
  • Limited per-bank memory capacity forces frequent inter-bank data movement for HE ciphertexts and auxiliary metadata.
  • PIM hardware becomes competitive with CPUs and GPUs for HE once native modular multiplication and efficient inter-PIM data movement are added.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Hardware designers should consider adding modular multiplication units directly to PIM cores to accelerate cryptographic workloads.
  • Better inter-bank or inter-PIM communication could enable larger-scale encrypted computations without repeated data shuffling.
  • The same bottleneck analysis could guide PIM adoption for other memory-intensive secure computation primitives beyond HE.

Load-bearing premise

The UPMEM PIM system and the chosen HE kernels are representative of future general-purpose PIM hardware and of the workloads that will actually be deployed in encrypted databases and machine learning.

What would settle it

Running the same HE kernels on a PIM system that includes native 64-bit modular multiplication hardware and measuring whether its end-to-end performance exceeds current CPU and GPU baselines would confirm or refute the viability claim.

Figures

Figures reproduced from arXiv: 2605.12841 by Antonio J. Pe\~na, Harshita Gupta, Jaewoo Park, Juan G\'omez-Luna, Konstantinos Kanellopoulos, Mayank Kabra, Mohammad Sadrosadati, Nisa Bostanc{\i}, Onur Mutlu, Phillip Widdowson, Priyam Mehta, Tathagata Barik.

Figure 1
Figure 1. Figure 1: Roofline Model of the homomorphic operations [PITH_FULL_IMAGE:figures/full_fig_p004_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: High-level system organization of the UPMEM PIM [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Overview of HE-based Neural Network Inference. [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Dataflow and mapping of NTT on the PIM system for [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Data mapping of matrix multiplication on the PIM [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Total execution time (in s) for the Conv layer on the UPMEM PIM system for (a) varying limb count and (b) varying image sizes compared with the CPU system. of execution rounds remains constant (four rounds for 512 PIM cores and a single round for 2048 PIM cores) regardless of the chosen limb count. Obs #2. Increasing the number of PIM cores significantly reduces execution time across all ciphertext limb co… view at source ↗
Figure 7
Figure 7. Figure 7: Total execution time (in s) for the BConv subroutine on the UPMEM PIM system for (a) varying limb count and (b) varying image sizes compared with the CPU system. The execution time of matrix multiplication in the BConv subroutine is constant because the twiddle-factor matrix size depends only on the polynomial degree (N) and not on the limb count. Hence, the number of PIM execution rounds remains the same.… view at source ↗
Figure 8
Figure 8. Figure 8: Total execution time (in s) for the EvalMod operation on the UPMEM PIM system for (a) varying limb count and (b) varying image sizes compared with the CPU system. Obs #12. For a fixed number of PIM cores, increasing the number of limbs increases the execution time of EvalMod. For example, on a 512-core PIM system, the total execution time for the EvalMod operation increases from 100.7 s at a limb count of … view at source ↗
Figure 9
Figure 9. Figure 9: Total execution time (in s) for all operations/subrou [PITH_FULL_IMAGE:figures/full_fig_p011_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Execution time (in s) for end-to-end neural networks. [PITH_FULL_IMAGE:figures/full_fig_p012_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Total execution time on three PIM designs compared [PITH_FULL_IMAGE:figures/full_fig_p013_11.png] view at source ↗
read the original abstract

Homomorphic encryption (HE) enables computation over encrypted data, offering strong privacy guarantees for untrusted computing environments. Practical adoption remains limited by high computational complexity, large ciphertext sizes, and substantial data movement. Processor-centric architectures (CPUs, GPUs, ASICs) hit fundamental bottlenecks on HE workloads because ciphertexts are large, data locality is low, and primitives such as relinearization and bootstrapping repeatedly access large auxiliary metadata. Processing-In-Memory (PIM) is a promising mitigation by computing near or inside memory. Prior PIM proposals for HE either do not target real-world PIM systems or cover only a narrow set of operations. We comprehensively characterize HE operations on a real-world, general-purpose PIM system. We implement a complete set of HE kernels used by emerging applications (databases, machine learning) on the UPMEM PIM system, evaluate performance and scalability, compare against CPU and GPU baselines, and discuss implications for future PIM hardware. Our results demonstrate four major findings. (1) HE-based applications expose distinct bottlenecks across execution stages: some kernels are compute-bound due to modular arithmetic, while others are memory-bound due to large ciphertexts and intermediate data. These bottlenecks are exacerbated by limited per-core compute and per-bank capacity, which force frequent data movement. (2) The dominant compute bottleneck is the lack of native 64-bit modular integer multiplication, a key HE primitive. (3) Limited per-bank memory capacity is the second major bottleneck, since HE ciphertexts and auxiliary metadata do not fit and require inter-bank movement. (4) Despite these limits, PIM can be a viable alternative to state-of-the-art CPU and GPU systems for HE when equipped with native modular multiplication and efficient inter-PIM data movement.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The paper implements a complete set of HE kernels on the UPMEM real-world PIM system, measures their performance and scalability, compares them to CPU and GPU baselines, identifies compute-bound modular arithmetic and memory-capacity-driven inter-bank movement as primary bottlenecks, and concludes that PIM becomes a viable alternative to CPU/GPU systems for HE once equipped with native 64-bit modular multiplication and efficient inter-PIM data movement.

Significance. The work supplies the first comprehensive, hardware-measured characterization of HE workloads on a commercially available general-purpose PIM platform. Direct performance numbers, bottleneck breakdowns, and cross-architecture comparisons provide concrete data that prior simulation-only or narrow-operation PIM-HE studies lacked, informing future PIM hardware features for encrypted computation.

major comments (1)
  1. [Abstract] Abstract and conclusion: the central claim that PIM 'can be a viable alternative ... when equipped with native modular multiplication and efficient inter-PIM data movement' rests on an untested inference. All reported timings, bottleneck analysis, and comparisons are obtained from unmodified UPMEM hardware; no analytical model, cycle-accurate simulation, or back-of-envelope projection quantifies the expected speedup once native 64-bit mod-mul (assumed 1-cycle) and higher inter-bank bandwidth are added. This leaves open whether other UPMEM constraints (per-core throughput, bank-level parallelism) would remain dominant.
minor comments (2)
  1. [Section 3] The manuscript would benefit from an explicit statement of the exact HE parameter sets (N, q, etc.) used for each kernel and a table summarizing them.
  2. [Figures 4-7] Figure captions and axis labels should explicitly state whether reported times include or exclude data movement between host and PIM banks.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the positive evaluation of our work's significance and for the constructive feedback on the abstract and conclusion. We address the major comment below.

read point-by-point responses
  1. Referee: [Abstract] Abstract and conclusion: the central claim that PIM 'can be a viable alternative ... when equipped with native modular multiplication and efficient inter-PIM data movement' rests on an untested inference. All reported timings, bottleneck analysis, and comparisons are obtained from unmodified UPMEM hardware; no analytical model, cycle-accurate simulation, or back-of-envelope projection quantifies the expected speedup once native 64-bit mod-mul (assumed 1-cycle) and higher inter-bank bandwidth are added. This leaves open whether other UPMEM constraints (per-core throughput, bank-level parallelism) would remain dominant.

    Authors: We agree that the original claim would be strengthened by a quantitative projection. In the revised manuscript we have added a new subsection (Discussion 6.3) containing a back-of-envelope model. Using measured per-kernel cycle counts and utilization rates from our UPMEM experiments, the model assumes 1-cycle native 64-bit modular multiplication and 4x inter-bank bandwidth. It projects 2.8–4.1x speedup for compute-bound kernels and 1.7–2.3x for memory-bound kernels relative to current UPMEM, while explicitly noting that per-core throughput and bank-level parallelism remain secondary constraints that could cap further gains. We have updated the abstract and conclusion to describe PIM as “potentially competitive” once the two primary bottlenecks are removed, rather than stating it as a direct conclusion from the unmodified hardware results. revision: yes

Circularity Check

0 steps flagged

No circularity: claims rest on direct hardware measurements

full rationale

The paper implements and measures a complete set of HE kernels on unmodified UPMEM PIM hardware, reports raw performance and scalability numbers, identifies compute-bound vs. memory-bound stages from those timings, and compares against CPU/GPU baselines. The viability statement is a qualitative inference drawn from the observed bottlenecks (lack of native 64-bit mod-mul and per-bank capacity limits). No equations, fitted parameters, or self-referential definitions appear; no derivation reduces to its own inputs by construction. All load-bearing evidence is externally falsifiable via replication on the same hardware.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The work rests on standard domain assumptions about HE primitives and the documented behavior of the UPMEM hardware; no free parameters, invented entities, or ad-hoc axioms are introduced.

axioms (1)
  • domain assumption HE kernels require repeated 64-bit modular integer multiplication and large ciphertext storage
    Standard property of homomorphic encryption schemes used in the implemented kernels.

pith-pipeline@v0.9.0 · 5680 in / 1202 out tokens · 64134 ms · 2026-05-14T19:21:05.219355+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Taking Cryptography Out of the Data Path via Near-Memory Processing in DRAM

    cs.CR 2026-05 unverdicted novelty 5.0

    Real-world PIM on UPMEM accelerates cryptographic algorithms when computation is distributed across multiple DRAM ranks, outperforming CPUs at full scale.

Reference graph

Works this paper leans on

300 extracted references · 300 canonical work pages · cited by 1 Pith paper · 3 internal anchors

  1. [1]

    On Data Banks and Privacy Homo- morphisms,

    R. L. Rivest, L. Adleman, and M. L. Dertouzos, “On Data Banks and Privacy Homo- morphisms, ” inFSC, 1978

  2. [2]

    On Lattices, Learning with Errors, Random Linear Codes, and Cryptog- raphy,

    O. Regev, “On Lattices, Learning with Errors, Random Linear Codes, and Cryptog- raphy, ”Journal of the ACM, 2009

  3. [3]

    On Ideal Lattices and Learning with Errors over Rings,

    V. Lyubashevsky, C. Peikert, and O. Regev, “On Ideal Lattices and Learning with Errors over Rings, ” inEUROCRYPT 2010, 2010

  4. [4]

    Fully Homomorphic Encryption Using Ideal Lattices,

    C. Gentry, “Fully Homomorphic Encryption Using Ideal Lattices, ” inSTOC, 2009

  5. [5]

    A Fully Homomorphic Encryption Scheme,

    C. Gentry, “A Fully Homomorphic Encryption Scheme, ” Ph.D. dissertation, Stanford University, 2009

  6. [6]

    Efficient Fully Homomorphic Encryption from (Standard) LWE,

    Z. Brakerski and V. Vaikuntanathan, “Efficient Fully Homomorphic Encryption from (Standard) LWE, ” inFOCS, 2011

  7. [7]

    (Leveled) Fully Homomorphic Encryption without Bootstrapping,

    Z. Brakerski, C. Gentry, and V. Vaikuntanathan, “(Leveled) Fully Homomorphic Encryption without Bootstrapping, ” inITCS, 2012

  8. [8]

    Fully Homomorphic SIMD Operations,

    N. P. Smart and F. Vercauteren, “Fully Homomorphic SIMD Operations, ”Designs, Codes and Cryptography, 2014

  9. [9]

    Homomorphic Encryption from Learning with Errors: Conceptually-Simpler, Asymptotically-Faster, Attribute-Based,

    C. Gentry, A. Sahai, and B. Waters, “Homomorphic Encryption from Learning with Errors: Conceptually-Simpler, Asymptotically-Faster, Attribute-Based, ” in CRYPTO 2013, Part I, 2013

  10. [10]

    Design and Implementation of HElib: a Homomorphic Encryption Library,

    S. Halevi and V. Shoup, “Design and Implementation of HElib: a Homomorphic Encryption Library, ” Cryptology ePrint Archive, Report 2020/1481, 2020

  11. [11]

    Fully Homomorphic Encryption over the Integers,

    M. van Dijk, C. Gentry, S. Halevi, and V. Vaikuntanathan, “Fully Homomorphic Encryption over the Integers, ” inEUROCRYPT 2010, 2010

  12. [12]

    Fully Homomorphic Encryption without Modulus Switching from Classical GapSVP,

    Z. Brakerski, “Fully Homomorphic Encryption without Modulus Switching from Classical GapSVP, ” inCRYPTO, 2012

  13. [13]

    Implementing Gentry’s Fully-Homomorphic Encryption Scheme,

    C. Gentry and S. Halevi, “Implementing Gentry’s Fully-Homomorphic Encryption Scheme, ” inEUROCRYPT 2011, 2011

  14. [14]

    A Survey on Homomorphic Encryption Schemes: Theory and Implementation,

    A. Acar, H. Aksu, A. S. Uluagac, and M. Conti, “A Survey on Homomorphic Encryption Schemes: Theory and Implementation, ”ACM Computing Surveys, 2018

  15. [15]

    A Decade of Lattice Cryptography,

    C. Peikert, “A Decade of Lattice Cryptography, ”Foundations and Trends in Theoret- ical Computer Science, 2016

  16. [16]

    A Method for Obtaining Digital Signatures and Public-Key Cryptosystems,

    R. L. Rivest, A. Shamir, and L. Adleman, “A Method for Obtaining Digital Signatures and Public-Key Cryptosystems, ”Communications of the ACM, 1978

  17. [17]

    A Public Key Cryptosystem and A Signature Scheme Based on Discrete Logarithms,

    T. ElGamal, “A Public Key Cryptosystem and A Signature Scheme Based on Discrete Logarithms, ”IEEE Trans. Inf. Theory, 1985

  18. [18]

    Public-Key Cryptosystems Based on Composite Degree Residuosity Classes,

    P. Paillier, “Public-Key Cryptosystems Based on Composite Degree Residuosity Classes, ” inEurocrypt, 1999

  19. [19]

    Homomorphic Encryption,

    M. Ogburn, C. Turner, and P. Dahal, “Homomorphic Encryption, ”PROCS, 2013

  20. [20]

    Homomorphic Encryption The “Holy Grail

    D. Tourky, M. ElKawkagy, and A. Keshk, “Homomorphic Encryption The “Holy Grail” of Cryptography, ” inICCC 2016

  21. [21]

    X. Yi, R. Paulet, E. Bertino, X. Yi, R. Paulet, and E. Bertino,Homomorphic encryption, 2014

  22. [22]

    CryptoNets: Applying Neural Networks to Encrypted Data With High Throughput and Accuracy,

    R. Gilad-Bachrach, N. Dowlin, K. Laine, K. Lauter, M. Naehrig, and J. Wernsing, “CryptoNets: Applying Neural Networks to Encrypted Data With High Throughput and Accuracy, ” inICML, 2016

  23. [23]

    Low Latency Privacy Preserving Inference,

    A. Brutzkus, R. Gilad-Bachrach, and O. Elisha, “Low Latency Privacy Preserving Inference, ” inICML, 2019

  24. [24]

    GAZELLE: A low latency framework for secure neural network inference,

    C. Juvekar, V. Vaikuntanathan, and A. Chandrakasan, “GAZELLE: A low latency framework for secure neural network inference, ” inUSENIX Sec., 2018

  25. [25]

    Delphi: A cryptographic inference service for neural networks,

    P. Mishra, R. Lehmkuhl, A. Srinivasan, W. Zheng, and R. A. Popa, “Delphi: A cryptographic inference service for neural networks, ” inUSENIX Sec., 2020

  26. [26]

    Cheetah: Optimizing and Accelerating Homomorphic Encryption for Private Inference,

    B. Reagen, W.-S. Choi, Y. Ko, V. T. Lee, H.-H. S. Lee, G.-Y. Wei, and D. Brooks, “Cheetah: Optimizing and Accelerating Homomorphic Encryption for Private Inference, ” inHPCA, 2021

  27. [27]

    Low-Complexity Deep Convolutional Neural Networks on Fully Homomorphic Encryption Using Multiplexed Parallel Convolutions,

    E. Lee, J.-W. Lee, J. Lee, Y.-S. Kim, Y. Kim, J.-S. No, and W. Choi, “Low-Complexity Deep Convolutional Neural Networks on Fully Homomorphic Encryption Using Multiplexed Parallel Convolutions, ” inICML, 2022. 14

  28. [28]

    Concrete-ML: Privacy-Preserving Machine Learning Using Fully Homo- morphic Encryption,

    Zama, “Concrete-ML: Privacy-Preserving Machine Learning Using Fully Homo- morphic Encryption, ” https://github.com/zama-ai/concrete-ml, 2022

  29. [29]

    HE3DB: An Efficient and Elastic Encrypted Database via Arithmetic-and-Logic Fully Homomorphic Encryption,

    S. Bian, Z. Zhang, H. Pan, R. Mao, Z. Zhao, Y. Jin, and Z. Guan, “HE3DB: An Efficient and Elastic Encrypted Database via Arithmetic-and-Logic Fully Homomorphic Encryption, ” inCCS, 2023

  30. [30]

    ArcEDB: An Arbitrary-Precision Encrypted Database via (Amortized) Modular Homomor- phic Encryption,

    Z. Zhang, S. Bian, Z. Zhao, R. Mao, H. Zhou, J. Hua, Y. Jin, and Z. Guan, “ArcEDB: An Arbitrary-Precision Encrypted Database via (Amortized) Modular Homomor- phic Encryption, ” inCCS, 2024

  31. [31]

    HEDA: Multi- Attribute Unbounded Aggregation over Homomorphically Encrypted Database,

    X. Ren, L. Su, Z. Gu, S. Wang, F. Li, Y. Xie, S. Bian, C. Li, and F. Zhang, “HEDA: Multi- Attribute Unbounded Aggregation over Homomorphically Encrypted Database, ” PVLDB, 2022

  32. [32]

    CIPHERMATCH: Accelerating Homomor- phic Encryption-Based String Matching via Memory-Efficient Data Packing and In-Flash Processing,

    M. Kabra, R. Nadig, H. Gupta, R. Bera, M. Frouzakis, V. Arulchelvan, Y. Liang, H. Mao, M. Sadrosadati, and O. Mutlu, “CIPHERMATCH: Accelerating Homomor- phic Encryption-Based String Matching via Memory-Efficient Data Packing and In-Flash Processing, ” inASPLOS, 2025

  33. [33]

    PIR with Compressed Queries and Amortized Query Processing,

    S. Angel, H. Chen, K. Laine, and S. T. V. Setty, “PIR with Compressed Queries and Amortized Query Processing, ” inS&P, 2018

  34. [34]

    OnionPIR: Response Efficient Single-Server PIR,

    M. H. Mughees, H. Chen, and L. Ren, “OnionPIR: Response Efficient Single-Server PIR, ” inCCS, 2021

  35. [35]

    Spiral: Fast, High-Rate Single-Server PIR via FHE Composition,

    S. J. Menon and D. J. Wu, “Spiral: Fast, High-Rate Single-Server PIR via FHE Composition, ” inS&P, 2022

  36. [36]

    One Server for the Price of Two: Simple and Fast Single-Server Private Information Retrieval,

    A. Henzinger, M. M. Hong, H. Corrigan-Gibbs, S. Meiklejohn, and V. Vaikun- tanathan, “One Server for the Price of Two: Simple and Fast Single-Server Private Information Retrieval, ” inUSENIX Sec., 2023

  37. [37]

    Fast Private Set Intersection from Homomorphic Encryption,

    H. Chen, K. Laine, and P. Rindal, “Fast Private Set Intersection from Homomorphic Encryption, ” inCCS, 2017

  38. [38]

    Anaheim: Architecture and Algorithms for Processing Fully Homomorphic Encryption in Memory,

    J. Kim, S. Yun, H. Ji, W. Choi, S. Kim, and J. H. Ahn, “Anaheim: Architecture and Algorithms for Processing Fully Homomorphic Encryption in Memory, ” inHPCA, 2025

  39. [39]

    MemFHE: End-to-end Computing with Fully Homomorphic Encryption in Memory,

    S. Gupta, R. Cammarota, and T. Š. Rosing, “MemFHE: End-to-end Computing with Fully Homomorphic Encryption in Memory, ”TECS, 2022

  40. [40]

    Craterlake: A Hardware Accelerator for Efficient Unbounded Computation on Encrypted Data,

    N. Samardzic, A. Feldmann, A. Krastev, N. Manohar, N. Genise, S. Devadas, K. El- defrawy, C. Peikert, and D. Sanchez, “Craterlake: A Hardware Accelerator for Efficient Unbounded Computation on Encrypted Data, ” inISCA 2022

  41. [41]

    Memory Scaling: A Systems Architecture Perspective,

    O. Mutlu, “Memory Scaling: A Systems Architecture Perspective, ”IMW, 2013

  42. [42]

    Research Problems and Opportunities in Memory Systems,

    O. Mutlu and L. Subramanian, “Research Problems and Opportunities in Memory Systems, ”SUPERFRI, 2014

  43. [43]

    The Tail at Scale,

    J. Dean and L. A. Barroso, “The Tail at Scale, ”Communications of the ACM, 2013

  44. [44]

    Profiling a Warehouse-Scale Computer,

    S. Kanev, J. P. Darago, K. Hazelwood, P. Ranganathan, T. Moseley, G.-Y. Wei, and D. Brooks, “Profiling a Warehouse-Scale Computer, ” inISCA, 2015

  45. [45]

    Clearing the Clouds: A Study of Emerging Scale-Out Workloads on Modern Hardware,

    M. Ferdman, A. Adileh, O. Kocberber, S. Volos, M. Alisafaee, D. Jevdjic, C. Kaynak, A. D. Popescu, A. Ailamaki, and B. Falsafi, “Clearing the Clouds: A Study of Emerging Scale-Out Workloads on Modern Hardware, ” inASPLOS, 2012

  46. [46]

    BigDataBench: A Big Data Benchmark Suite From Internet Services,

    L. Wang, J. Zhan, C. Luo, Y. Zhu, Q. Yang, Y. He, W. Gao, Z. Jia, Y. Shi, S. Zhang, C. Zheng, G. Lu, K. Zhan, X. Li, and B. Qiu, “BigDataBench: A Big Data Benchmark Suite From Internet Services, ” inHPCA, 2014

  47. [47]

    Google Workloads for Consumer Devices: Mitigating Data Movement Bottlenecks,

    A. Boroumand, S. Ghose, Y. Kim, R. Ausavarungnirun, E. Shiu, R. Thakur, D. Kim, A. Kuusela, A. Knies, P. Ranganathan, and O. Mutlu, “Google Workloads for Consumer Devices: Mitigating Data Movement Bottlenecks, ” inASPLOS, 2018

  48. [48]

    Google Neural Network Models for Edge Devices: Analyzing and Mitigating Machine Learning Inference Bottlenecks,

    A. Boroumand, S. Ghose, B. Akin, R. Narayanaswami, G. F. Oliveira, X. Ma, E. Shiu, and O. Mutlu, “Google Neural Network Models for Edge Devices: Analyzing and Mitigating Machine Learning Inference Bottlenecks, ” inPACT, 2021

  49. [49]

    Enabling Practical Processing in and near Memory for Data-Intensive Computing,

    O. Mutlu, S. Ghose, J. Gómez-Luna, and R. Ausavarungnirun, “Enabling Practical Processing in and near Memory for Data-Intensive Computing, ” inDAC, 2019

  50. [50]

    Processing Data Where It Makes Sense: Enabling In-Memory Computation,

    O. Mutluet al., “Processing Data Where It Makes Sense: Enabling In-Memory Computation, ”MicPro, 2019

  51. [51]

    Processing-in- Memory: A Workload-driven Perspective,

    S. Ghose, A. Boroumand, J. S. Kim, J. Gómez-Luna, and O. Mutlu, “Processing-in- Memory: A Workload-driven Perspective, ”IBM JRD 2019

  52. [52]

    Accelerating Genome Analysis: A Primer on an Ongoing Journey,

    M. Alser, Z. Bingöl, D. Senol Cali, J. Kim, S. Ghose, C. Alkan, and O. Mutlu, “Accelerating Genome Analysis: A Primer on an Ongoing Journey, ”IEEE Micro, 2020

  53. [53]

    GenASM: A High- Performance, Low-Power Approximate String Matching Acceleration Framework for Genome Sequence Analysis,

    D. S. Cali, G. S. Kalsi, Z. Bingöl, C. Firtina, L. Subramanian, J. S. Kim, R. Ausavarung- nirun, M. Alser, J. Gomez-Luna, A. Boroumandet al., “GenASM: A High- Performance, Low-Power Approximate String Matching Acceleration Framework for Genome Sequence Analysis, ” inMICRO, 2020

  54. [54]

    EDEN: Enabling Energy-Efficient, High-Performance Deep Neural Network Inference using Approximate DRAM,

    S. Koppula, L. Orosa, A. G. Yağlıkçı, R. Azizi, T. Shahroodi, K. Kanellopoulos, and O. Mutlu, “EDEN: Enabling Energy-Efficient, High-Performance Deep Neural Network Inference using Approximate DRAM, ” inMICRO, 2019

  55. [55]

    SMASH: Co-designing Software Compression and Hardware-Accelerated Indexing for Efficient Sparse Matrix Operations,

    K. Kanellopoulos, N. Vijaykumar, C. Giannoula, R. Azizi, S. Koppula, N. Man- souri Ghiasi, T. Shahroodi, J. Gomez-Luna, and O. Mutlu, “SMASH: Co-designing Software Compression and Hardware-Accelerated Indexing for Efficient Sparse Matrix Operations, ” inMICRO, 2019

  56. [56]

    DAMOV: A New Methodology and Benchmark Suite for Evaluating Data Movement Bottlenecks,

    G. F. Oliveira, J. Gómez-Luna, L. Orosa, S. Ghose, N. Vijaykumar, I. Fernandez, M. Sadrosadati, and O. Mutlu, “DAMOV: A New Methodology and Benchmark Suite for Evaluating Data Movement Bottlenecks, ”IEEE Access, 2021

  57. [57]

    Evaluating the Memory System Behavior of Smartphone Workloads,

    G. Narancic, P. Judd, D. Wu, I. Atta, M. Elnacouzi, J. Zebchuk, J. Albericio, N. E. Jerger, A. Moshovos, K. Kutulakoset al., “Evaluating the Memory System Behavior of Smartphone Workloads, ” inSAMOS, 2014

  58. [58]

    Memory Hierarchy for Web Search,

    G. Ayers, J. H. Ahn, C. Kozyrakis, and P. Ranganathan, “Memory Hierarchy for Web Search, ” inHPCA, 2018

  59. [59]

    Understanding Big Data Analytics Workloads on Modern Processors,

    Z. Jia, J. Zhan, L. Wang, C. Luo, W. Gao, Y. Jin, R. Han, and L. Zhang, “Understanding Big Data Analytics Workloads on Modern Processors, ”TPDS, 2016

  60. [60]

    Optimizing Database Architecture for the New Bottleneck: Memory Access,

    S. Manegold, P. A. Boncz, and M. L. Kersten, “Optimizing Database Architecture for the New Bottleneck: Memory Access, ”The VLDB Journal, 2000

  61. [61]

    AI and Memory Wall,

    A. Gholami, Z. Yao, S. Kim, C. Hooper, M. W. Mahoney, and K. Keutzer, “AI and Memory Wall, ”IEEE Micro, 2024

  62. [62]

    Quantifying Performance Bot- tlenecks of Stencil Computations Using the Execution-Cache-Memory Model,

    H. Stengel, J. Treibig, G. Hager, and G. Wellein, “Quantifying Performance Bot- tlenecks of Stencil Computations Using the Execution-Cache-Memory Model, ” in ISC, 2015

  63. [63]

    Memory System Characterization of Deep Learning Workloads,

    Z. Chishti and B. Akin, “Memory System Characterization of Deep Learning Workloads, ” inMEMSYS, 2019

  64. [64]

    The Declining Effectiveness of Dynamic Caching for General-Purpose Microprocessors,

    D. C. Burger, J. R. Goodman, and A. Kagi, “The Declining Effectiveness of Dynamic Caching for General-Purpose Microprocessors, ” University of Wisconsin-Madison, Tech. Rep. 1261, 1995

  65. [65]

    The Architectural Implications of Facebook’s DNN-based Personalized Recommendation,

    U. Gupta, C.-J. Wu, X. Wang, M. Naumov, B. Reagen, D. Brooks, B. Cottel, K. Hazel- wood, M. Hempstead, B. Jiaet al., “The Architectural Implications of Facebook’s DNN-based Personalized Recommendation, ” inHPCA, 2020

  66. [66]

    SoftSKU: Optimizing Server Archi- tectures for Microservice Diversity @Scale,

    A. Sriraman, A. Dhanotia, and T. F. Wenisch, “SoftSKU: Optimizing Server Archi- tectures for Microservice Diversity @Scale, ” inISCA, 2019

  67. [67]

    Understanding Data Storage and Ingestion for Large-Scale Deep Recommendation Model Training: Industrial Product,

    M. Zhao, N. Agarwal, A. Basant, B. Gedik, S. Pan, M. Ozdal, R. Komuravelli, J. Pan, T. Bao, H. Luet al., “Understanding Data Storage and Ingestion for Large-Scale Deep Recommendation Model Training: Industrial Product, ” inISCA, 2022

  68. [68]

    INSIDER: Designing In-Storage Computing System for Emerging High-Performance Drive,

    Z. Ruan, T. He, and J. Cong, “INSIDER: Designing In-Storage Computing System for Emerging High-Performance Drive, ” inUSENIX ATC, 2019

  69. [69]

    The Architectural Implications of Cloud Microservices,

    Y. Gan and C. Delimitrou, “The Architectural Implications of Cloud Microservices, ” CAL, 2018

  70. [70]

    Accelerometer: Understanding Acceleration Op- portunities for Data Center Overheads at Hyperscale,

    A. Sriraman and A. Dhanotia, “Accelerometer: Understanding Acceleration Op- portunities for Data Center Overheads at Hyperscale, ” inASPLOS, 2020

  71. [71]

    RAMBDA: RDMA-driven Acceleration Framework for Memory-Intensiveµs-Scale Datacenter Applications,

    Y. Yuan, J. Huang, Y. Sun, T. Wang, J. Nelson, D. R. Ports, Y. Wang, R. Wang, C. Tai, and N. S. Kim, “RAMBDA: RDMA-driven Acceleration Framework for Memory-Intensiveµs-Scale Datacenter Applications, ” inHPCA, 2023

  72. [72]

    Amdahl’s Law for Tail Latency,

    C. Delimitrou and C. Kozyrakis, “Amdahl’s Law for Tail Latency, ”CACM, 2018

  73. [73]

    Cross-Stack Workload Characterization of Deep Recommendation Systems,

    S. Hsia, U. Gupta, M. Wilkening, C.-J. Wu, G.-Y. Wei, and D. Brooks, “Cross-Stack Workload Characterization of Deep Recommendation Systems, ” inIISWC, 2020

  74. [74]

    vbench: Benchmarking Video Transcoding in the Cloud,

    A. Lottarini, A. Ramirez, J. Coburn, M. A. Kim, P. Ranganathan, D. Stodolsky, and M. Wachsler, “vbench: Benchmarking Video Transcoding in the Cloud, ” inASPLOS, 2018

  75. [75]

    Characterizing Job Microarchitectural Profiles at Scale: Dataset and Analysis,

    K. Wang, Y. Li, C. Wang, T. Jia, K. Chow, Y. Wen, Y. Dou, G. Xu, C. Hou, J. Yaoet al., “Characterizing Job Microarchitectural Profiles at Scale: Dataset and Analysis, ” in ICPP, 2022

  76. [76]

    Missing the Forest for the Trees: End-to-End AI Application Performance in Edge Data Centers,

    D. Richins, D. Doshi, M. Blackmore, A. T. Nair, N. Pathapati, A. Patel, B. Daguman, D. Dobrijalowski, R. Illikkal, K. Longet al., “Missing the Forest for the Trees: End-to-End AI Application Performance in Edge Data Centers, ” inHPCA, 2020

  77. [77]

    Reflections on the Memory Wall,

    S. A. McKee, “Reflections on the Memory Wall, ” inCF, 2004

  78. [78]

    ARK: Fully Homomorphic Encryption Accelerator with Runtime Data Generation and Inter- operation Key Reuse,

    J. Kim, G. Lee, S. Kim, G. Sohn, M. Rhu, J. Kim, and J. H. Ahn, “ARK: Fully Homomorphic Encryption Accelerator with Runtime Data Generation and Inter- operation Key Reuse, ” inIEEE MICRO 2022

  79. [79]

    A Modern Primer on Processing In Memory,

    O. Mutlu, S. Ghose, J. Gómez-Luna, and R. Ausavarungnirun, “A Modern Primer on Processing In Memory, ” inEmerging Computing: From Devices to Systems: Looking Beyond Moore and Von Neumann, 2022

  80. [80]

    Memory-Centric Computing,

    O. Mutlu, “Memory-Centric Computing, ” inDAC, 2023

Showing first 80 references.