pith. sign in

arxiv: 2501.07047 · v4 · submitted 2025-01-13 · 💻 cs.CR · cs.AR· cs.CL· cs.PL

Leveraging ASIC AI Chips for Homomorphic Encryption

Pith reviewed 2026-05-23 06:01 UTC · model grok-4.3

classification 💻 cs.CR cs.ARcs.CLcs.PL
keywords homomorphic encryptionAI acceleratorsTPUcompiler optimizationenergy efficiencyNTTmatrix multiplicationprivacy-preserving computation
0
0 comments X

The pith

Compiler framework turns AI ASICs into efficient platforms for homomorphic encryption by mapping modular arithmetic to low-precision matrix multiplications.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper seeks to close the energy gap between GPU-based HE acceleration and specialized HE ASICs by repurposing existing AI accelerators such as TPUs. It identifies two mismatches that waste TPU resources: reliance on 32-bit integer arithmetic that bypasses the high-throughput INT8 matrix engine, and fine-grained permutations that clash with coarse memory hardware. CROSS addresses these with two offline transformations that realign the workload to the TPU architecture. The result is higher throughput per watt on NTT and HE operators than several established GPU libraries, positioning AI ASICs as the leading energy-efficient platform for HE.

Core claim

CROSS is a compiler that applies Basis-Aligned Transformation to convert high-precision modular arithmetic into dense INT8 matrix multiplications that utilize the TPU MXU, and Memory-Aligned Transformation to fold data reordering into the kernels offline; on TPU v6e this yields higher throughput per watt for NTT and HE operators than WarpDrive, FIDESlib, FAB, HEAP, and Cheddar.

What carries the argument

Basis-Aligned Transformation (BAT), which rewrites high-precision modular arithmetic as low-precision (INT8) matrix multiplications that preserve exact correctness and security.

If this is right

  • Existing AI ASIC hardware becomes usable for privacy-preserving cloud workloads without custom HE silicon.
  • HE operators can reach ASIC-level energy efficiency on commodity AI accelerators.
  • Compiler transformations can eliminate runtime data-reordering overhead for coarse-grained memory systems.
  • NTT and other HE primitives become practical at higher scale under power constraints.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same alignment technique may apply to other matrix-oriented AI accelerators beyond TPUs.
  • Production HE services could shift from GPU clusters to lower-power AI ASIC fleets.
  • Algorithm designers may begin co-optimizing new HE schemes for low-precision matrix engines from the start.

Load-bearing premise

The basis-aligned transformation converts high-precision modular arithmetic into low-precision matrix multiplications while preserving exact correctness and security properties required by the HE schemes.

What would settle it

A side-by-side measurement of throughput per watt for NTT and HE operators on TPU v6e using CROSS versus the five listed GPU libraries; failure of CROSS to exceed all five would falsify the efficiency claim.

Figures

Figures reproduced from arXiv: 2501.07047 by Anirudh Itagi, Anupam Golder, Arvind, Asra Ali, G. Edward Suh, Jeremy Kun, Jevin Jiang, Jianming Tong, Jingtian Dang, Leo de Castro, Tianhao Huang, Tushar Krishna.

Figure 1
Figure 1. Figure 1: CROSS enables direct computation on encrypted [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: TPU’s compute/memory granularity is > GPU. Sparse MatVecMul High-precision MatMul SotA GPU's HE Algorithm CROSS Compute and Memory Waste on 0 ( ) 2 compute/memory save! Use MXU with 100 TOPs 1 2 3 BAT 32 bit High-precision Scalar Mul @ HP Modular Matrix Multiplication 8 bit 0 Explicit Data Transform Only Use Low-throughput VPU Costly Layout Transformation BAT MAT Layout Invariant Low-precision Dense MatVec… view at source ↗
Figure 3
Figure 3. Figure 3: CROSS aims at (1) eliminating compute redundancy, [PITH_FULL_IMAGE:figures/full_fig_p002_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Overview of TPUv4 architecture based on public [PITH_FULL_IMAGE:figures/full_fig_p004_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: AI ASICs deliver better energy efficiency among [PITH_FULL_IMAGE:figures/full_fig_p004_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Abstract compilation layers for accelerating HE-based privacy-preserving applications. [PITH_FULL_IMAGE:figures/full_fig_p005_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: BAT converts high-precision modular scalar multiplication into dense, low-precision matrix multiplication. Compared [PITH_FULL_IMAGE:figures/full_fig_p006_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: BAT could be applied to convert each individual [PITH_FULL_IMAGE:figures/full_fig_p007_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: MAT Illustration for Permute(VecMul) and Transpose(MatMul). MAT moves explicit memory re￾ordering to compiler time by applying reordering directly on preknown parameters offline for runtime latency saving. the desired data layouts. This strategy effectively eliminates runtime memory reordering costs. 1) MAT Key Idea and Illustration: MAT leverages the insight that any reordering operation on a one-dimensio… view at source ↗
Figure 10
Figure 10. Figure 10: CROSS converts high-precision NTT into a combination of low-precision matrix multiplications and element-wise multiplications to fully exploit the MXU and VPU. Row 1 shows the conventional 4-step NTT algorithm [27]; Row 2 illustrates how the Memory Aligned Transformation (MAT) eliminates the explicit transpose and bit-reverse permutation to keep data layout invariant in NTT. Row 3 details the mapping of t… view at source ↗
Figure 11
Figure 11. Figure 11: Ablation study: Impact of different hardware and [PITH_FULL_IMAGE:figures/full_fig_p011_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Latency breakdown of HE multiplication and [PITH_FULL_IMAGE:figures/full_fig_p011_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: Ablation study: Impact of modular reduction [PITH_FULL_IMAGE:figures/full_fig_p013_13.png] view at source ↗
Figure 14
Figure 14. Figure 14: Latency profiling of HE operators using OpenFHE, [PITH_FULL_IMAGE:figures/full_fig_p016_14.png] view at source ↗
Figure 16
Figure 16. Figure 16: CROSS maps high-precision scalar multiplication into 1D convolution and temporal shifted accumulation when input operands are not known a priori. H. Fall-back Algorithm for unknown parameters of [PITH_FULL_IMAGE:figures/full_fig_p017_16.png] view at source ↗
read the original abstract

Homomorphic Encryption (HE) provides strong data privacy for cloud services but at the cost of prohibitive computational overhead. While GPUs have emerged as a practical platform for accelerating HE, there remains an order-of-magnitude energy-efficiency gap compared to specialized (but expensive) HE ASICs. This paper explores an alternate direction: leveraging existing AI accelerators, like Google's TPUs with coarse-grained compute and memory architectures, to offer a path toward ASIC-level energy efficiency for HE. However, this architectural paradigm creates a fundamental mismatch with SoTA HE algorithms designed for GPUs. These algorithms rely heavily on: (1) high-precision (32-bit) integer arithmetic to now run on a TPU's low-throughput vector unit, leaving its high-throughput low-precision (8-bit) matrix engine (MXU) idle, and (2) fine-grained data permutations that are inefficient on the TPU's coarse-grained memory subsystem. Consequently, porting GPU-optimized HE libraries to TPUs results in severe resource under-utilization and performance degradation. To tackle above challenges, we introduce CROSS, a compiler framework that systematically transforms HE workloads to align with the TPU's architecture. CROSS makes two key contributions: (1) Basis-Aligned Transformation (BAT), a novel technique that converts high-precision modular arithmetic into dense, low-precision (INT8) matrix multiplications, unlocking and improving the utilization of TPU's MXU for HE, and (2) Memory-Aligned Transformation (MAT), which eliminates costly runtime data reordering by embedding reordering into compute kernels through offline parameter transformation. CROSS (TPU v6e) achieves higher throughput per watt on NTT and HE operators than WarpDrive, FIDESlib, FAB, HEAP, and Cheddar, establishing AI ASIC as the SotA efficient platform for HE operators. Code: https://github.com/EfficientPPML/CROSS

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces CROSS, a compiler framework for mapping Homomorphic Encryption (HE) workloads onto TPUs. It proposes two transformations: Basis-Aligned Transformation (BAT), which rewrites high-precision modular arithmetic (including NTT) as dense INT8 matrix multiplications to utilize the TPU MXU, and Memory-Aligned Transformation (MAT), which folds data permutations into offline kernel parameters. The central empirical claim is that CROSS on TPU v6e delivers higher throughput per watt on NTT and HE operators than WarpDrive, FIDESlib, FAB, HEAP, and Cheddar. Open-source code is provided at https://github.com/EfficientPPML/CROSS.

Significance. If BAT and MAT are shown to preserve exact modular semantics and HE security parameters, the result would demonstrate that commodity AI ASICs can close much of the energy-efficiency gap to purpose-built HE accelerators. The direct hardware measurements and public code release are concrete strengths that support reproducibility.

major comments (2)
  1. [Abstract, BAT paragraph] Abstract (paragraph describing BAT): the claim that BAT converts high-precision modular arithmetic into exact INT8 matrix multiplications without precision loss or overflow is asserted but not accompanied by a derivation, invariant proof, or explicit verification that every intermediate value and final modular reduction matches the original arithmetic for the primes and polynomial degrees used in the evaluated schemes. This equivalence is load-bearing for both correctness and the reported performance numbers.
  2. [Experimental evaluation] Experimental evaluation (throughput-per-watt results): the comparisons against the five named libraries are presented on real TPU v6e hardware, yet the manuscript does not supply sufficient detail on precision handling during BAT, measurement methodology (e.g., power metering, batch sizes), or workload selection to allow independent assessment of whether these factors affect the central claim of superiority.
minor comments (2)
  1. [BAT description] Notation for the basis alignment in BAT could be introduced with a small worked example (one prime, small degree) to make the transformation concrete for readers unfamiliar with the technique.
  2. [Introduction] The abstract lists five baseline libraries; a short table in the introduction summarizing their target platforms and key optimizations would improve readability.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We appreciate the referee's feedback highlighting the need for more explicit support of the BAT equivalence claim and additional experimental details. We will revise the manuscript to include these elements, strengthening the presentation of our results on TPU-based HE acceleration.

read point-by-point responses
  1. Referee: [Abstract, BAT paragraph] Abstract (paragraph describing BAT): the claim that BAT converts high-precision modular arithmetic into exact INT8 matrix multiplications without precision loss or overflow is asserted but not accompanied by a derivation, invariant proof, or explicit verification that every intermediate value and final modular reduction matches the original arithmetic for the primes and polynomial degrees used in the evaluated schemes. This equivalence is load-bearing for both correctness and the reported performance numbers.

    Authors: The manuscript asserts the no-precision-loss property based on the BAT design, but we acknowledge that a full derivation and verification are not present in the current version. We will add this in the revision, including the mathematical invariant and checks for the specific parameters used in our experiments. revision: yes

  2. Referee: [Experimental evaluation] Experimental evaluation (throughput-per-watt results): the comparisons against the five named libraries are presented on real TPU v6e hardware, yet the manuscript does not supply sufficient detail on precision handling during BAT, measurement methodology (e.g., power metering, batch sizes), or workload selection to allow independent assessment of whether these factors affect the central claim of superiority.

    Authors: We agree that more details are needed for reproducibility. In the revised version, we will include a detailed description of precision handling (referencing the new BAT proof), power metering methodology using TPU hardware counters, specific batch sizes employed, and the criteria for selecting the HE workloads and parameters. revision: yes

Circularity Check

0 steps flagged

No circularity; claims rest on implementation and hardware measurement

full rationale

The paper introduces CROSS, a compiler with BAT (converting high-precision modular ops to INT8 matmuls) and MAT (embedding reordering into kernels). Its strongest claim—higher throughput/watt on TPU v6e vs. listed baselines—is presented as the outcome of direct hardware execution and measurement, not any derivation, fitted parameter, or equation that reduces to its own inputs by construction. No self-citation chains, uniqueness theorems, or ansatzes are invoked to justify the core mapping or performance numbers. The BAT correctness assertion is an implementation claim (with code released) rather than a self-referential mathematical step. This is the normal case of an engineering paper whose results are externally falsifiable via the provided implementation.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on standard properties of modular arithmetic and matrix multiplication together with the assumption that the TPU hardware behaves as documented; no free parameters, new entities, or ad-hoc axioms are introduced beyond the compiler techniques themselves.

axioms (1)
  • domain assumption Modular arithmetic operations remain correct when rewritten as low-precision matrix multiplications under the basis-aligned mapping
    Invoked in the description of BAT to justify the conversion from high-precision integers to INT8 matrix operations.

pith-pipeline@v0.9.0 · 5915 in / 1297 out tokens · 33266 ms · 2026-05-23T06:01:18.161350+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

81 extracted references · 81 canonical work pages · 4 internal anchors

  1. [1]

    Number theoretic transforms to implement fast digital convolution,

    R. Agarwal and C. Burrus, “Number theoretic transforms to implement fast digital convolution,”Proceedings of the IEEE, 1975

  2. [2]

    Heap: A fully ho- momorphic encryption accelerator with parallelized bootstrapping,

    R. Agrawal, A. Chandrakasan, and A. Joshi, “Heap: A fully ho- momorphic encryption accelerator with parallelized bootstrapping,” in 2024 ACM/IEEE 51st Annual International Symposium on Computer Architecture (ISCA), 2024, pp. 756–769

  3. [3]

    Mad: Memory-aware design techniques for accelerating fully homomorphic encryption,

    R. Agrawal, L. De Castro, C. Juvekar, A. Chandrakasan, V . Vaikun- tanathan, and A. Joshi, “Mad: Memory-aware design techniques for accelerating fully homomorphic encryption,” inProceedings of the 56th Annual IEEE/ACM International Symposium on Microarchitecture, ser. MICRO ’23. New York, NY , USA: Association for Computing Machinery, 2023

  4. [4]

    Fab: An fpga-based accel- erator for bootstrappable fully homomorphic encryption,

    R. Agrawal, L. de Castro, G. Yang, C. Juvekar, R. Yazicigil, A. Chan- drakasan, V . Vaikuntanathan, and A. Joshi, “Fab: An fpga-based accel- erator for bootstrappable fully homomorphic encryption,” 2022

  5. [5]

    Fideslib: A fully-fledged open-source fhe library for efficient ckks on gpus,

    C. Agull ´o-Domingo, ´Oscar Vera-L´opez, S. Guzelhan, L. Daksha, A. E. Jerari, K. Shivdikar, R. Agrawal, D. Kaeli, A. Joshi, and J. L. Abell ´an, “Fideslib: A fully-fledged open-source fhe library for efficient ckks on gpus,” 2025. [Online]. Available: https://arxiv.org/abs/2507.04775

  6. [6]

    Implementation and performance evaluation of rns variants of the bfv homomorphic encryption scheme,

    A. Al Badawi, Y . Polyakov, K. M. M. Aung, B. Veeravalli, and K. Rohloff, “Implementation and performance evaluation of rns variants of the bfv homomorphic encryption scheme,”IEEE Transactions on Emerging Topics in Computing, vol. 9, no. 2, pp. 941–956, 2021

  7. [7]

    Homomorphic encryption standard,

    M. Albrecht, M. Chase, H. Chen, J. Ding, S. Goldwasser, S. Gorbunov, S. Halevi, J. Hoffstein, K. Laineet al., “Homomorphic encryption standard,”Protecting privacy through homomorphic encryption, 2021

  8. [8]

    Pallas: a jax kernel language,

    J. authors, “Pallas: a jax kernel language,” 2024. [Online]. Available: https://jax.readthedocs.io/en/latest/pallas/index.html

  9. [9]

    Openfhe: Open-source fully homomorphic encryption library,

    A. A. Badawi, J. Bates, F. Bergamaschi, D. B. Cousins, S. Erabelli, N. Genise, S. Halevi, H. Hunt, A. Kim, Y . Lee, Z. Liu, D. Micciancio, I. Quah, Y . Polyakov, S. R.V ., K. Rohloff, J. Saylor, D. Suponitsky, M. Triplett, V . Vaikuntanathan, and V . Zucca, “Openfhe: Open-source fully homomorphic encryption library,” Cryptology ePrint Archive, Paper 2022/...

  10. [10]

    Demystifying bootstrapping in fully homomorphic encryption,

    A. A. Badawi and Y . Polyakov, “Demystifying bootstrapping in fully homomorphic encryption,” Cryptology ePrint Archive, Paper 2023/149,

  11. [11]

    Available: https://eprint.iacr.org/2023/149

    [Online]. Available: https://eprint.iacr.org/2023/149

  12. [12]

    Implementing the rivest shamir and adleman public key encryption algorithm on a standard digital signal processor,

    P. Barrett, “Implementing the rivest shamir and adleman public key encryption algorithm on a standard digital signal processor,” inProceed- ings on Advances in Cryptology—CRYPTO ’86. Berlin, Heidelberg: Springer-Verlag, 1987, p. 311–323

  13. [13]

    Intel HEXL (release 1.2),

    F. Boemer, S. Kim, G. Seifu, F. D. de Souza, V . Gopalet al., “Intel HEXL (release 1.2),” https://github.com/intel/hexl, Sep. 2021

  14. [14]

    JAX: composable transformations of Python+NumPy programs,

    J. Bradbury, R. Frostig, P. Hawkins, M. J. Johnson, C. Leary, D. Maclaurin, G. Necula, A. Paszke, J. VanderPlas, S. Wanderman-Milne, and Q. Zhang, “JAX: composable transformations of Python+NumPy programs,” 2018. [Online]. Available: http://github.com/google/jax

  15. [15]

    Low Latency Privacy Preserving Inference

    A. Brutzkus, O. Elisha, and R. Gilad-Bachrach, “Low latency privacy preserving inference,” 2019. [Online]. Available: https: //arxiv.org/abs/1812.10659

  16. [16]

    A full rns variant of approximate homomorphic encryption,

    J. H. Cheon, K. Han, A. Kim, M. Kim, and Y . Song, “A full rns variant of approximate homomorphic encryption,” inSelected Areas in Cryptography–SAC 2018: 25th International Conference, Calgary, AB, Canada, August 15–17, 2018, Revised Selected Papers 25. Springer, 2019, pp. 347–368

  17. [17]

    Homomorphic encryption for arithmetic of approximate numbers,

    J. H. Cheon, A. Kim, M. Kim, and Y . Song, “Homomorphic encryption for arithmetic of approximate numbers,” Cryptology ePrint Archive, Paper 2016/421, 2016, https://eprint.iacr.org/2016/421. [Online]. Available: https://eprint.iacr.org/2016/421

  18. [18]

    Dacapo: Automatic bootstrapping management for efficient fully homomorphic encryption,

    S. Cheon, Y . Lee, D. Kim, J. M. Lee, S. Jung, T. Kim, D. Lee, and H. Kim, “Dacapo: Automatic bootstrapping management for efficient fully homomorphic encryption,” in33rd USENIX Security Symposium (USENIX Security 24), 2024, pp. 6993–7010

  19. [19]

    Cheddar: A swift fully homomorphic encryption library designed for gpu architectures,

    W. Choi, J. Kim, and J. H. Ahn, “Cheddar: A swift fully homomorphic encryption library designed for gpu architectures,” inProceedings of the 31st ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 1, ser. ASPLOS ’26. New York, NY , USA: Association for Computing Machinery, 2025, p. 35–49. [Online]...

  20. [20]

    Profile your model on cloud tpu nodes,

    G. Cloud, “Profile your model on cloud tpu nodes,” 2024. [Online]. Available: https://cloud.google.com/tpu/docs/cloud-tpu-tools

  21. [21]

    Corporation

    N. Corporation. Matrix multiplication background user’s guide. NVIDIA. [Online]. Available: https://docs.nvidia.com/deeplearning/ performance/dl-performance-matrix-multiplication/index.html

  22. [22]

    W. J. Dally and B. P. Towles,Principles and practices of interconnection networks. Elsevier, 2004

  23. [23]

    Chet: Compiler and runtime for homomorphic evaluation of tensor programs,

    R. Dathathri, O. Saarikivi, H. Chen, K. Laine, K. Lauter, S. Maleki, M. Musuvathi, and T. Mytkowicz, “Chet: Compiler and runtime for homomorphic evaluation of tensor programs,” 2018

  24. [24]

    Does fully homomorphic encryption need compute acceleration?

    L. de Castro, R. Agrawal, R. Yazicigil, A. Chandrakasan, V . Vaikun- tanathan, C. Juvekar, and A. Joshi, “Does fully homomorphic encryption need compute acceleration?” 2021

  25. [25]

    Orion: A fully homomorphic encryption framework for deep learning,

    A. Ebel, K. Garimella, and B. Reagen, “Orion: A fully homomorphic encryption framework for deep learning,” inProceedings of the 30th ACM International Conference on Architectural Support for Program- ming Languages and Operating Systems, Volume 2, ser. ASPLOS ’25. New York, NY , USA: Association for Computing Machinery, 2025

  26. [26]

    Warpdrive: Gpu-based fully homo- morphic encryption acceleration leveraging tensor and cuda cores,

    G. Fan, M. Zhang, F. Zheng, S. Fan, T. Zhou, X. Deng, W. Tang, L. Kong, Y . Song, and S. Yan, “Warpdrive: Gpu-based fully homo- morphic encryption acceleration leveraging tensor and cuda cores,” in 2025 IEEE International Symposium on High Performance Computer Architecture (HPCA), 2025, pp. 1187–1200

  27. [27]

    Towards faster fully homomorphic encryption implementation with integer and floating-point computing power of gpus,

    G. Fan, F. Zheng, L. Wan, L. Gao, Y . Zhao, J. Dong, Y . Song, Y . Wang, and J. Lin, “Towards faster fully homomorphic encryption implementation with integer and floating-point computing power of gpus,” in2023 IEEE International Parallel and Distributed Processing Symposium (IPDPS), 2023, pp. 798–808

  28. [28]

    Tensorfhe: Achieving practical computation on encrypted data using gpgpu,

    S. Fan, Z. Wang, W. Xu, R. Hou, D. Meng, and M. Zhang, “Tensorfhe: Achieving practical computation on encrypted data using gpgpu,” 2022

  29. [29]

    BASALISC: Programmable hardware accelerator for BGV fully homomorphic encryption,

    R. Geelen, M. V . Beirendonck, H. V . L. Pereira, B. Huffman, T. McAuley, B. Selfridge, D. Wagner, G. Dimou, I. Verbauwhede, F. Vercauteren, and D. W. Archer, “BASALISC: Programmable hardware accelerator for BGV fully homomorphic encryption,” Cryptology ePrint Archive, Paper 2022/657, 2022. [Online]. Available: https://eprint.iacr.org/2022/657

  30. [30]

    Google Cloud TPU,

    Google, “Google Cloud TPU,” 2024. [Online]. Available: https: //cloud.google.com/tpu/docs/system-architecture-tpu-vm

  31. [31]

    Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities,

    Google, “Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities,”

  32. [32]
  33. [33]

    Logistic regression on homomorphic encrypted data at scale,

    K. Han, S. Hong, J. H. Cheon, and D. Park, “Logistic regression on homomorphic encrypted data at scale,” inProceedings of the Thirty-Third AAAI Conference on Artificial Intelligence and Thirty-First Innovative Applications of Artificial Intelligence Conference and Ninth AAAI Symposium on Educational Advances in Artificial Intelligence, ser. AAAI’19/IAAI’1...

  34. [34]

    Cinnamon: A framework for scale-out encrypted ai,

    S. Jayashankar, E. Chen, T. Tang, W. Zheng, and D. Skarlatos, “Cinnamon: A framework for scale-out encrypted ai,” inProceedings of the 30th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 1, ser. ASPLOS ’25. New York, NY , USA: Association for Computing Machinery, 2025, p. 133–150. [Online]. Av...

  35. [35]

    Ten lessons from three generations shaped google’s tpuv4i,

    N. P. Jouppi, D. Hyun Yoon, M. Ashcraft, M. Gottscho, T. B. Jablin, G. Kurian, J. Laudon, S. Li, P. Ma, X. Ma, T. Norrie, N. Patil, S. Prasad, C. Young, Z. Zhou, and D. Patterson, “Ten lessons from three generations shaped google’s tpuv4i,” in2021 ACM/IEEE 48th Annual International Symposium on Computer Architecture (ISCA), 2021

  36. [36]

    Tpu v4: An optically reconfigurable supercomputer for machine learning with hardware support for embeddings,

    N. P. Jouppi, G. Kurian, S. Li, P. Ma, R. Nagarajan, L. Nai, N. Patil, S. Subramanian, A. Swing, B. Towles, C. Young, X. Zhou, Z. Zhou, and D. Patterson, “Tpu v4: An optically reconfigurable supercomputer for machine learning with hardware support for embeddings,” 2023

  37. [37]

    Jouppi, Doe Hyun Yoon, George Kurian, Sheng Li, Nishant Patil, James Laudon, Cliff Young, and David A

    N. P. Jouppi, D. H. Yoon, G. Kurian, S. Li, N. Patil, J. Laudon, C. Young, and D. A. Patterson, “A domain-specific supercomputer for training deep neural networks,”Commun. ACM, vol. 63, no. 7, pp. 67–78, 2020. [Online]. Available: https://doi.org/10.1145/3360307

  38. [38]

    Over 100x faster bootstrapping in fully homomorphic encryption through memory- centric optimization with gpus,

    W. Jung, S. Kim, J. H. Ahn, J. H. Cheon, and Y . Lee, “Over 100x faster bootstrapping in fully homomorphic encryption through memory- centric optimization with gpus,”IACR Transactions on Cryptographic Hardware and Embedded Systems, pp. 114–148, 2021. 14

  39. [39]

    Gazelle: A Low Latency Framework for Secure Neural Network Inference

    C. Juvekar, V . Vaikuntanathan, and A. Chandrakasan, “Gazelle: A low latency framework for secure neural network inference,” 2018. [Online]. Available: https://arxiv.org/abs/1801.05507

  40. [40]

    Revisiting homomorphic encryption schemes for finite fields,

    A. Kim, Y . Polyakov, and V . Zucca, “Revisiting homomorphic encryption schemes for finite fields,” Cryptology ePrint Archive, Paper 2021/204, 2021. [Online]. Available: https://eprint.iacr.org/2021/204

  41. [41]

    Sharp: A short-word hierarchical accelerator for robust and practical fully homomorphic encryption,

    J. Kim, S. Kim, J. Choi, J. Park, D. Kim, and J. H. Ahn, “Sharp: A short-word hierarchical accelerator for robust and practical fully homomorphic encryption,” inProceedings of the 50th Annual International Symposium on Computer Architecture, ser. ISCA ’23. New York, NY , USA: Association for Computing Machinery, 2023. [Online]. Available: https://doi.org/...

  42. [42]

    Ark: Fully homomorphic encryption accelerator with runtime data generation and inter-operation key reuse,

    J. Kim, G. Lee, S. Kim, G. Sohn, M. Rhu, J. Kim, and J. H. Ahn, “Ark: Fully homomorphic encryption accelerator with runtime data generation and inter-operation key reuse,” in2022 55th IEEE/ACM International Symposium on Microarchitecture (MICRO), 2022, pp. 1237–1254

  43. [43]

    Accelerating number theoretic transformations for bootstrappable homomorphic encryption on gpus,

    S. Kim, W. Jung, J. Park, and J. H. Ahn, “Accelerating number theoretic transformations for bootstrappable homomorphic encryption on gpus,” in2020 IEEE International Symposium on Workload Characterization (IISWC). IEEE, Oct. 2020, p. 264–275. [Online]. Available: http://dx.doi.org/10.1109/IISWC50251.2020.00033

  44. [44]

    Bts: An accelerator for bootstrappable fully homomorphic encryption,

    S. Kim, J. Kim, M. J. Kim, W. Jung, J. Kim, M. Rhu, and J. H. Ahn, “Bts: An accelerator for bootstrappable fully homomorphic encryption,” inProceedings of the 49th Annual International Symposium on Computer Architecture, ser. ISCA ’22. New York, NY , USA: Association for Computing Machinery, 2022, p. 711–725. [Online]. Available: https://doi.org/10.1145/3...

  45. [45]

    Parallel implementation of nussbaumer algorithm and number theoretic transform on a gpu platform: application to qtesla,

    W.-K. Lee, S. Akleylek, D. C.-K. Wong, W.-S. Yap, B.-M. Goi, and S.-O. Hwang, “Parallel implementation of nussbaumer algorithm and number theoretic transform on a gpu platform: application to qtesla,” J. Supercomput., vol. 77, no. 4, p. 3289–3314, Apr. 2021. [Online]. Available: https://doi.org/10.1007/s11227-020-03392-x

  46. [46]

    Error-latency-aware scale management for fully homomorphic encryption,

    Y . Lee, S. Cheon, D. Kim, D. Lee, and H. Kim, “Error-latency-aware scale management for fully homomorphic encryption,” in32nd USENIX Security Symposium (USENIX Security 23), 2023

  47. [47]

    Performance-aware scale analysis with reserve for homomorphic encryption,

    Y . Lee, S. Cheon, D. Kim, D. Lee, and H. Kim, “Performance-aware scale analysis with reserve for homomorphic encryption,” inProceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 1, 2024

  48. [48]

    Hecate: Performance-aware scale optimization for homomor- phic encryption compiler,

    Y . Lee, S. Heo, S. Cheon, S. Jeong, C. Kim, E. Kim, D. Lee, and H. Kim, “Hecate: Performance-aware scale optimization for homomor- phic encryption compiler,” in2022 IEEE/ACM International Symposium on Code Generation and Optimization (CGO), 2022, pp. 193–204

  49. [49]

    Cat: A gpu-accelerated fhe framework with its application to high-precision private dataset query,

    Q. Li and R. Zong, “Cat: A gpu-accelerated fhe framework with its application to high-precision private dataset query,” 2025. [Online]. Available: https://arxiv.org/abs/2503.22227

  50. [50]

    A large-scale survey on the usability of ai programming assistants: Successes and challenges,

    J. T. Liang, C. Yang, and B. A. Myers, “A large-scale survey on the usability of ai programming assistants: Successes and challenges,” in Proceedings of the 46th IEEE/ACM international conference on software engineering, 2024, pp. 1–13

  51. [51]

    Lattice signatures without trapdoors,

    V . Lyubashevsky, “Lattice signatures without trapdoors,” inAdvances in Cryptology – EUROCRYPT 2012, D. Pointcheval and T. Johansson, Eds. Berlin, Heidelberg: Springer Berlin Heidelberg, 2012, pp. 738–755

  52. [52]

    Modular multiplication without trial division,

    P. L. Montgomery, “Modular multiplication without trial division,” Mathematics of computation, vol. 44, no. 170, pp. 519–521, 1985

  53. [53]

    Lattigo: A multiparty homomorphic encryption library in go,

    C. V . Mouchet, J.-P. Bossuat, J. R. Troncoso-Pastoriza, and J.-P. Hubaux, “Lattigo: A multiparty homomorphic encryption library in go,” in Proceedings of the 8th Workshop on Encrypted Computing and Applied Homomorphic Cryptography, 2020, pp. 64–70

  54. [54]

    GPT-4o System Card

    OpenAI, “Gpt-4o system card,” 2024. [Online]. Available: https: //arxiv.org/abs/2410.21276

  55. [55]

    Strix: An end-to- end streaming architecture with two-level ciphertext batching for fully homomorphic encryption with programmable bootstrapping,

    A. Putra, Prasetiyo, Y . Chen, J. Kim, and J.-Y . Kim, “Strix: An end-to- end streaming architecture with two-level ciphertext batching for fully homomorphic encryption with programmable bootstrapping,” 2023

  56. [56]

    Cham: A customized homomorphic encryption accelerator for fast matrix-vector product,

    X. Ren, Z. Chen, Z. Gu, Y . Lu, R. Zhong, W.-J. Lu, J. Zhang, Y . Zhang, H. Wu, X. Zheng, H. Liu, T. Chu, C. Hong, C. Wei, D. Niu, and Y . Xie, “Cham: A customized homomorphic encryption accelerator for fast matrix-vector product,” in2023 60th ACM/IEEE Design Automation Conference (DAC), 2023, pp. 1–6

  57. [57]

    Heax: An architecture for computing on encrypted data,

    M. S. Riazi, K. Laine, B. Pelton, and W. Dai, “Heax: An architecture for computing on encrypted data,” inProceedings of the Twenty-Fifth International Conference on Architectural Support for Programming Languages and Operating Systems, ser. ASPLOS ’20. New York, NY , USA: Association for Computing Machinery, 2020, p. 1295–1309. [Online]. Available: https:...

  58. [58]

    F1: A fast and programmable accelerator for fully homomorphic encryption,

    N. Samardzic, A. Feldmann, A. Krastev, S. Devadas, R. Dreslinski, C. Peikert, and D. Sanchez, “F1: A fast and programmable accelerator for fully homomorphic encryption,” inMICRO-54: 54th Annual IEEE/ACM International Symposium on Microarchitecture, ser. MICRO ’21. New York, NY , USA: Association for Computing Machinery, 2021, p. 238–252. [Online]. Availab...

  59. [59]

    Craterlake: A hardware accelerator for efficient unbounded computation on encrypted data,

    N. Samardzic, A. Feldmann, A. Krastev, N. Manohar, N. Genise, S. Devadas, K. Eldefrawy, C. Peikert, and D. Sanchez, “Craterlake: A hardware accelerator for efficient unbounded computation on encrypted data,” inProceedings of the 49th Annual International Symposium on Computer Architecture, ser. ISCA ’22. New York, NY , USA: Association for Computing Machi...

  60. [60]

    Generative artificial intelligence: A systematic review and applications,

    S. S. Sengar, A. B. Hasan, S. Kumar, and F. Carroll, “Generative artificial intelligence: A systematic review and applications,” 2024. [Online]. Available: https://arxiv.org/abs/2405.11029

  61. [61]

    High-throughput polynomial multiplier architecture for lattice-based cryptography,

    T. Shimada and M. Ikeda, “High-throughput polynomial multiplier architecture for lattice-based cryptography,” in2021 IEEE International Symposium on Circuits and Systems (ISCAS), 2021, pp. 1–5

  62. [62]

    Gme: Gpu-based microarchitectural extensions to accelerate homomorphic encryption,

    K. Shivdikar, Y . Bao, R. Agrawal, M. Shen, G. Jonatan, E. Mora, A. Ingare, N. Livesay, J. L. Abell ´an, J. Kimet al., “Gme: Gpu-based microarchitectural extensions to accelerate homomorphic encryption,” arXiv preprint arXiv:2309.11001, 2023

  63. [63]

    Accelerating polynomial multiplication for homomorphic encryption on gpus,

    K. Shivdikar, G. Jonatan, E. Mora, N. Livesay, R. Agrawal, A. Joshi, J. Abellan, J. Kim, and D. Kaeli, “Accelerating polynomial multiplication for homomorphic encryption on gpus,” 2022. [Online]. Available: https://arxiv.org/abs/2209.01290

  64. [64]

    Ntl: A library for doing number theory,

    V . Shoupet al., “Ntl: A library for doing number theory,” 2001

  65. [65]

    Fpga-based high-performance parallel architecture for homomorphic computing on encrypted data,

    S. Sinha Roy, F. Turan, K. Jarvinen, F. Vercauteren, and I. Verbauwhede, “Fpga-based high-performance parallel architecture for homomorphic computing on encrypted data,” in2019 IEEE International Symposium on High Performance Computer Architecture (HPCA), 2019

  66. [66]

    Tensorfhe+: Fully homomorphic encryption acceleration based on linear algebra,

    Y . Sun, S. Fan, Z. Yin, X. Song, X. Hu, Z. Du, Q. Guo, W. Xu, R. Hou, D. Meng, S. Bian, and M. Zhan, “Tensorfhe+: Fully homomorphic encryption acceleration based on linear algebra,”IEEE Transactions on Computers, pp. 1–14, 2025

  67. [67]

    Client-optimized algorithms and acceleration for encrypted compute offloading,

    M. van der Hagen and B. Lucia, “Client-optimized algorithms and acceleration for encrypted compute offloading,” inProceedings of the 27th ACM International Conference on Architectural Support for Pro- gramming Languages and Operating Systems, ser. ASPLOS ’22. New York, NY , USA: Association for Computing Machinery, 2022

  68. [68]

    Chameleon: An efficient fhe scheme switching acceleration on gpus,

    Z. Wang, H. He, L. Zhao, P. Li, Z. Li, D. Meng, and R. Hou, “Chameleon: An efficient fhe scheme switching acceleration on gpus,”

  69. [69]

    Available: https://arxiv.org/abs/2410.05934

    [Online]. Available: https://arxiv.org/abs/2410.05934

  70. [70]

    He- booster: An efficient polynomial arithmetic acceleration on gpus for fully homomorphic encryption,

    Z. Wang, P. Li, R. Hou, Z. Li, J. Cao, X. Wang, and D. Meng, “He- booster: An efficient polynomial arithmetic acceleration on gpus for fully homomorphic encryption,”IEEE Transactions on Parallel and Distributed Systems, vol. 34, no. 4, pp. 1067–1081, 2023

  71. [71]

    WISE-HE. Wise. GitHub repository; main branch; commit f0689fd (”init”). [Online]. Available: https://github.com/WISE-HE/WISE

  72. [72]

    Phantom: A cuda-accelerated word-wise homomorphic encryption library,

    H. Yang, S. Shen, W. Dai, L. Zhou, Z. Liu, and Y . Zhao, “Phantom: A cuda-accelerated word-wise homomorphic encryption library,”IEEE Trans. Dependable Secur. Comput., vol. 21, no. 5, p. 4895–4906, Sep

  73. [73]

    Available: https://doi.org/10.1109/TDSC.2024.3363900

    [Online]. Available: https://doi.org/10.1109/TDSC.2024.3363900

  74. [74]

    Poseidon: Practical homomorphic encryption accelerator,

    Y . Yang, H. Zhang, S. Fan, H. Lu, M. Zhang, and X. Li, “Poseidon: Practical homomorphic encryption accelerator,” in2023 IEEE Interna- tional Symposium on High-Performance Computer Architecture (HPCA), 2023, pp. 870–881

  75. [75]

    Accelerating encrypted computing on intel gpus,

    Y . Zhai, M. Ibrahim, Y . Qiu, F. Boemer, Z. Chen, A. Titov, and A. Lyashevsky, “Accelerating encrypted computing on intel gpus,”

  76. [76]

    Available: https://arxiv.org/abs/2109.14704

    [Online]. Available: https://arxiv.org/abs/2109.14704

  77. [77]

    Sok: Fully homomorphic encryption accelerators,

    J. Zhang, X. Cheng, L. Yang, J. Hu, X. Liu, and K. Chen, “Sok: Fully homomorphic encryption accelerators,”ACM Comput. Surv., vol. 56, no. 12, Oct. 2024. [Online]. Available: https://doi.org/10.1145/3676955

  78. [78]

    Hengine: A high performance optimization framework on a gpu for homomorphic encryption,

    J. Zhao, H. Yang, M. Hao, W. Zhang, H. He, and D. Wang, “Hengine: A high performance optimization framework on a gpu for homomorphic encryption,”ACM Trans. Archit. Code Optim., vol. 22, no. 2, Jul. 2025. [Online]. Available: https://doi.org/10.1145/3732942

  79. [79]

    librosa/librosa: 0.6.3,

    Y . Zhu, X. Wang, L. Ju, and S. Guo, “Fxhenn: Fpga-based acceleration framework for homomorphic encrypted cnn inference,” in2023 IEEE International Symposium on High-Performance Computer Architecture (HPCA), 2023, pp. 896–907. 15 APPENDIX A. Abstract We provide scripts to reproduce latency of BAT (Tab. V) and BConv (Tab. VI), throughput of NTT (Tab. VII, ...

  80. [80]

    The NTT and INTT are computationally intensive, accounting for approximately 45.1% to 86.3% of the overall latency in various HE operators

    Radix-2 Cooley-Tukey NTT algorithm (Butterfly NTT): The NTT converts polynomial representations from the co- efficient domain to the evaluation domain, where polynomial multiplication simplifies to element-wise (vectorized) coeffi- cient multiplication. The NTT and INTT are computationally intensive, accounting for approximately 45.1% to 86.3% of the over...

Showing first 80 references.