Leveraging ASIC AI Chips for Homomorphic Encryption

Anirudh Itagi; Anupam Golder; Arvind; Asra Ali; G. Edward Suh; Jeremy Kun; Jevin Jiang; Jianming Tong; Jingtian Dang; Leo de Castro

arxiv: 2501.07047 · v4 · submitted 2025-01-13 · 💻 cs.CR · cs.AR· cs.CL· cs.PL

Leveraging ASIC AI Chips for Homomorphic Encryption

Jianming Tong , Tianhao Huang , Jingtian Dang , Leo de Castro , Anirudh Itagi , Anupam Golder , Asra Ali , Jeremy Kun

show 4 more authors

Jevin Jiang Arvind G. Edward Suh Tushar Krishna

This is my paper

Pith reviewed 2026-05-23 06:01 UTC · model grok-4.3

classification 💻 cs.CR cs.ARcs.CLcs.PL

keywords homomorphic encryptionAI acceleratorsTPUcompiler optimizationenergy efficiencyNTTmatrix multiplicationprivacy-preserving computation

0 comments

The pith

Compiler framework turns AI ASICs into efficient platforms for homomorphic encryption by mapping modular arithmetic to low-precision matrix multiplications.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper seeks to close the energy gap between GPU-based HE acceleration and specialized HE ASICs by repurposing existing AI accelerators such as TPUs. It identifies two mismatches that waste TPU resources: reliance on 32-bit integer arithmetic that bypasses the high-throughput INT8 matrix engine, and fine-grained permutations that clash with coarse memory hardware. CROSS addresses these with two offline transformations that realign the workload to the TPU architecture. The result is higher throughput per watt on NTT and HE operators than several established GPU libraries, positioning AI ASICs as the leading energy-efficient platform for HE.

Core claim

CROSS is a compiler that applies Basis-Aligned Transformation to convert high-precision modular arithmetic into dense INT8 matrix multiplications that utilize the TPU MXU, and Memory-Aligned Transformation to fold data reordering into the kernels offline; on TPU v6e this yields higher throughput per watt for NTT and HE operators than WarpDrive, FIDESlib, FAB, HEAP, and Cheddar.

What carries the argument

Basis-Aligned Transformation (BAT), which rewrites high-precision modular arithmetic as low-precision (INT8) matrix multiplications that preserve exact correctness and security.

If this is right

Existing AI ASIC hardware becomes usable for privacy-preserving cloud workloads without custom HE silicon.
HE operators can reach ASIC-level energy efficiency on commodity AI accelerators.
Compiler transformations can eliminate runtime data-reordering overhead for coarse-grained memory systems.
NTT and other HE primitives become practical at higher scale under power constraints.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same alignment technique may apply to other matrix-oriented AI accelerators beyond TPUs.
Production HE services could shift from GPU clusters to lower-power AI ASIC fleets.
Algorithm designers may begin co-optimizing new HE schemes for low-precision matrix engines from the start.

Load-bearing premise

The basis-aligned transformation converts high-precision modular arithmetic into low-precision matrix multiplications while preserving exact correctness and security properties required by the HE schemes.

What would settle it

A side-by-side measurement of throughput per watt for NTT and HE operators on TPU v6e using CROSS versus the five listed GPU libraries; failure of CROSS to exceed all five would falsify the efficiency claim.

Figures

Figures reproduced from arXiv: 2501.07047 by Anirudh Itagi, Anupam Golder, Arvind, Asra Ali, G. Edward Suh, Jeremy Kun, Jevin Jiang, Jianming Tong, Jingtian Dang, Leo de Castro, Tianhao Huang, Tushar Krishna.

**Figure 2.** Figure 2: TPU’s compute/memory granularity is > GPU. Sparse MatVecMul High-precision MatMul SotA GPU's HE Algorithm CROSS Compute and Memory Waste on 0 ( ) 2 compute/memory save! Use MXU with 100 TOPs 1 2 3 BAT 32 bit High-precision Scalar Mul @ HP Modular Matrix Multiplication 8 bit 0 Explicit Data Transform Only Use Low-throughput VPU Costly Layout Transformation BAT MAT Layout Invariant Low-precision Dense MatVec… view at source ↗

**Figure 3.** Figure 3: CROSS aims at (1) eliminating compute redundancy, [PITH_FULL_IMAGE:figures/full_fig_p002_3.png] view at source ↗

**Figure 4.** Figure 4: Overview of TPUv4 architecture based on public [PITH_FULL_IMAGE:figures/full_fig_p004_4.png] view at source ↗

**Figure 5.** Figure 5: AI ASICs deliver better energy efficiency among [PITH_FULL_IMAGE:figures/full_fig_p004_5.png] view at source ↗

**Figure 6.** Figure 6: Abstract compilation layers for accelerating HE-based privacy-preserving applications. [PITH_FULL_IMAGE:figures/full_fig_p005_6.png] view at source ↗

**Figure 7.** Figure 7: BAT converts high-precision modular scalar multiplication into dense, low-precision matrix multiplication. Compared [PITH_FULL_IMAGE:figures/full_fig_p006_7.png] view at source ↗

**Figure 8.** Figure 8: BAT could be applied to convert each individual [PITH_FULL_IMAGE:figures/full_fig_p007_8.png] view at source ↗

**Figure 9.** Figure 9: MAT Illustration for Permute(VecMul) and Transpose(MatMul). MAT moves explicit memory reordering to compiler time by applying reordering directly on preknown parameters offline for runtime latency saving. the desired data layouts. This strategy effectively eliminates runtime memory reordering costs. 1) MAT Key Idea and Illustration: MAT leverages the insight that any reordering operation on a one-dimensio… view at source ↗

**Figure 10.** Figure 10: CROSS converts high-precision NTT into a combination of low-precision matrix multiplications and element-wise multiplications to fully exploit the MXU and VPU. Row 1 shows the conventional 4-step NTT algorithm [27]; Row 2 illustrates how the Memory Aligned Transformation (MAT) eliminates the explicit transpose and bit-reverse permutation to keep data layout invariant in NTT. Row 3 details the mapping of t… view at source ↗

**Figure 11.** Figure 11: Ablation study: Impact of different hardware and [PITH_FULL_IMAGE:figures/full_fig_p011_11.png] view at source ↗

**Figure 12.** Figure 12: Latency breakdown of HE multiplication and [PITH_FULL_IMAGE:figures/full_fig_p011_12.png] view at source ↗

**Figure 13.** Figure 13: Ablation study: Impact of modular reduction [PITH_FULL_IMAGE:figures/full_fig_p013_13.png] view at source ↗

**Figure 14.** Figure 14: Latency profiling of HE operators using OpenFHE, [PITH_FULL_IMAGE:figures/full_fig_p016_14.png] view at source ↗

**Figure 16.** Figure 16: CROSS maps high-precision scalar multiplication into 1D convolution and temporal shifted accumulation when input operands are not known a priori. H. Fall-back Algorithm for unknown parameters of [PITH_FULL_IMAGE:figures/full_fig_p017_16.png] view at source ↗

read the original abstract

Homomorphic Encryption (HE) provides strong data privacy for cloud services but at the cost of prohibitive computational overhead. While GPUs have emerged as a practical platform for accelerating HE, there remains an order-of-magnitude energy-efficiency gap compared to specialized (but expensive) HE ASICs. This paper explores an alternate direction: leveraging existing AI accelerators, like Google's TPUs with coarse-grained compute and memory architectures, to offer a path toward ASIC-level energy efficiency for HE. However, this architectural paradigm creates a fundamental mismatch with SoTA HE algorithms designed for GPUs. These algorithms rely heavily on: (1) high-precision (32-bit) integer arithmetic to now run on a TPU's low-throughput vector unit, leaving its high-throughput low-precision (8-bit) matrix engine (MXU) idle, and (2) fine-grained data permutations that are inefficient on the TPU's coarse-grained memory subsystem. Consequently, porting GPU-optimized HE libraries to TPUs results in severe resource under-utilization and performance degradation. To tackle above challenges, we introduce CROSS, a compiler framework that systematically transforms HE workloads to align with the TPU's architecture. CROSS makes two key contributions: (1) Basis-Aligned Transformation (BAT), a novel technique that converts high-precision modular arithmetic into dense, low-precision (INT8) matrix multiplications, unlocking and improving the utilization of TPU's MXU for HE, and (2) Memory-Aligned Transformation (MAT), which eliminates costly runtime data reordering by embedding reordering into compute kernels through offline parameter transformation. CROSS (TPU v6e) achieves higher throughput per watt on NTT and HE operators than WarpDrive, FIDESlib, FAB, HEAP, and Cheddar, establishing AI ASIC as the SotA efficient platform for HE operators. Code: https://github.com/EfficientPPML/CROSS

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

CROSS gets HE operators running on a real TPU v6e with better throughput per watt than the five named baselines by mapping modular work onto the MXU, but the exactness of that BAT mapping is the part that still needs checking.

read the letter

The central result here is that the authors built a compiler, CROSS, that lets a TPU run NTT and other HE kernels more efficiently per watt than WarpDrive, FIDESlib, FAB, HEAP, or Cheddar. They do this with two transformations: BAT turns the high-precision modular arithmetic into dense INT8 matrix multiplies, and MAT folds the data permutations into the kernels offline so the coarse memory system does not thrash at runtime. Both are new as described, and the paper ships code plus measurements from actual TPU v6e silicon rather than simulation or modeling alone. That combination is what makes the performance numbers worth looking at.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces CROSS, a compiler framework for mapping Homomorphic Encryption (HE) workloads onto TPUs. It proposes two transformations: Basis-Aligned Transformation (BAT), which rewrites high-precision modular arithmetic (including NTT) as dense INT8 matrix multiplications to utilize the TPU MXU, and Memory-Aligned Transformation (MAT), which folds data permutations into offline kernel parameters. The central empirical claim is that CROSS on TPU v6e delivers higher throughput per watt on NTT and HE operators than WarpDrive, FIDESlib, FAB, HEAP, and Cheddar. Open-source code is provided at https://github.com/EfficientPPML/CROSS.

Significance. If BAT and MAT are shown to preserve exact modular semantics and HE security parameters, the result would demonstrate that commodity AI ASICs can close much of the energy-efficiency gap to purpose-built HE accelerators. The direct hardware measurements and public code release are concrete strengths that support reproducibility.

major comments (2)

[Abstract, BAT paragraph] Abstract (paragraph describing BAT): the claim that BAT converts high-precision modular arithmetic into exact INT8 matrix multiplications without precision loss or overflow is asserted but not accompanied by a derivation, invariant proof, or explicit verification that every intermediate value and final modular reduction matches the original arithmetic for the primes and polynomial degrees used in the evaluated schemes. This equivalence is load-bearing for both correctness and the reported performance numbers.
[Experimental evaluation] Experimental evaluation (throughput-per-watt results): the comparisons against the five named libraries are presented on real TPU v6e hardware, yet the manuscript does not supply sufficient detail on precision handling during BAT, measurement methodology (e.g., power metering, batch sizes), or workload selection to allow independent assessment of whether these factors affect the central claim of superiority.

minor comments (2)

[BAT description] Notation for the basis alignment in BAT could be introduced with a small worked example (one prime, small degree) to make the transformation concrete for readers unfamiliar with the technique.
[Introduction] The abstract lists five baseline libraries; a short table in the introduction summarizing their target platforms and key optimizations would improve readability.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We appreciate the referee's feedback highlighting the need for more explicit support of the BAT equivalence claim and additional experimental details. We will revise the manuscript to include these elements, strengthening the presentation of our results on TPU-based HE acceleration.

read point-by-point responses

Referee: [Abstract, BAT paragraph] Abstract (paragraph describing BAT): the claim that BAT converts high-precision modular arithmetic into exact INT8 matrix multiplications without precision loss or overflow is asserted but not accompanied by a derivation, invariant proof, or explicit verification that every intermediate value and final modular reduction matches the original arithmetic for the primes and polynomial degrees used in the evaluated schemes. This equivalence is load-bearing for both correctness and the reported performance numbers.

Authors: The manuscript asserts the no-precision-loss property based on the BAT design, but we acknowledge that a full derivation and verification are not present in the current version. We will add this in the revision, including the mathematical invariant and checks for the specific parameters used in our experiments. revision: yes
Referee: [Experimental evaluation] Experimental evaluation (throughput-per-watt results): the comparisons against the five named libraries are presented on real TPU v6e hardware, yet the manuscript does not supply sufficient detail on precision handling during BAT, measurement methodology (e.g., power metering, batch sizes), or workload selection to allow independent assessment of whether these factors affect the central claim of superiority.

Authors: We agree that more details are needed for reproducibility. In the revised version, we will include a detailed description of precision handling (referencing the new BAT proof), power metering methodology using TPU hardware counters, specific batch sizes employed, and the criteria for selecting the HE workloads and parameters. revision: yes

Circularity Check

0 steps flagged

No circularity; claims rest on implementation and hardware measurement

full rationale

The paper introduces CROSS, a compiler with BAT (converting high-precision modular ops to INT8 matmuls) and MAT (embedding reordering into kernels). Its strongest claim—higher throughput/watt on TPU v6e vs. listed baselines—is presented as the outcome of direct hardware execution and measurement, not any derivation, fitted parameter, or equation that reduces to its own inputs by construction. No self-citation chains, uniqueness theorems, or ansatzes are invoked to justify the core mapping or performance numbers. The BAT correctness assertion is an implementation claim (with code released) rather than a self-referential mathematical step. This is the normal case of an engineering paper whose results are externally falsifiable via the provided implementation.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on standard properties of modular arithmetic and matrix multiplication together with the assumption that the TPU hardware behaves as documented; no free parameters, new entities, or ad-hoc axioms are introduced beyond the compiler techniques themselves.

axioms (1)

domain assumption Modular arithmetic operations remain correct when rewritten as low-precision matrix multiplications under the basis-aligned mapping
Invoked in the description of BAT to justify the conversion from high-precision integers to INT8 matrix operations.

pith-pipeline@v0.9.0 · 5915 in / 1297 out tokens · 33266 ms · 2026-05-23T06:01:18.161350+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

81 extracted references · 81 canonical work pages · 4 internal anchors

[1]

Number theoretic transforms to implement fast digital convolution,

R. Agarwal and C. Burrus, “Number theoretic transforms to implement fast digital convolution,”Proceedings of the IEEE, 1975

work page 1975
[2]

Heap: A fully ho- momorphic encryption accelerator with parallelized bootstrapping,

R. Agrawal, A. Chandrakasan, and A. Joshi, “Heap: A fully ho- momorphic encryption accelerator with parallelized bootstrapping,” in 2024 ACM/IEEE 51st Annual International Symposium on Computer Architecture (ISCA), 2024, pp. 756–769

work page 2024
[3]

Mad: Memory-aware design techniques for accelerating fully homomorphic encryption,

R. Agrawal, L. De Castro, C. Juvekar, A. Chandrakasan, V . Vaikun- tanathan, and A. Joshi, “Mad: Memory-aware design techniques for accelerating fully homomorphic encryption,” inProceedings of the 56th Annual IEEE/ACM International Symposium on Microarchitecture, ser. MICRO ’23. New York, NY , USA: Association for Computing Machinery, 2023

work page 2023
[4]

Fab: An fpga-based accel- erator for bootstrappable fully homomorphic encryption,

R. Agrawal, L. de Castro, G. Yang, C. Juvekar, R. Yazicigil, A. Chan- drakasan, V . Vaikuntanathan, and A. Joshi, “Fab: An fpga-based accel- erator for bootstrappable fully homomorphic encryption,” 2022

work page 2022
[5]

Fideslib: A fully-fledged open-source fhe library for efficient ckks on gpus,

C. Agull ´o-Domingo, ´Oscar Vera-L´opez, S. Guzelhan, L. Daksha, A. E. Jerari, K. Shivdikar, R. Agrawal, D. Kaeli, A. Joshi, and J. L. Abell ´an, “Fideslib: A fully-fledged open-source fhe library for efficient ckks on gpus,” 2025. [Online]. Available: https://arxiv.org/abs/2507.04775

work page arXiv 2025
[6]

Implementation and performance evaluation of rns variants of the bfv homomorphic encryption scheme,

A. Al Badawi, Y . Polyakov, K. M. M. Aung, B. Veeravalli, and K. Rohloff, “Implementation and performance evaluation of rns variants of the bfv homomorphic encryption scheme,”IEEE Transactions on Emerging Topics in Computing, vol. 9, no. 2, pp. 941–956, 2021

work page 2021
[7]

Homomorphic encryption standard,

M. Albrecht, M. Chase, H. Chen, J. Ding, S. Goldwasser, S. Gorbunov, S. Halevi, J. Hoffstein, K. Laineet al., “Homomorphic encryption standard,”Protecting privacy through homomorphic encryption, 2021

work page 2021
[8]

Pallas: a jax kernel language,

J. authors, “Pallas: a jax kernel language,” 2024. [Online]. Available: https://jax.readthedocs.io/en/latest/pallas/index.html

work page 2024
[9]

Openfhe: Open-source fully homomorphic encryption library,

A. A. Badawi, J. Bates, F. Bergamaschi, D. B. Cousins, S. Erabelli, N. Genise, S. Halevi, H. Hunt, A. Kim, Y . Lee, Z. Liu, D. Micciancio, I. Quah, Y . Polyakov, S. R.V ., K. Rohloff, J. Saylor, D. Suponitsky, M. Triplett, V . Vaikuntanathan, and V . Zucca, “Openfhe: Open-source fully homomorphic encryption library,” Cryptology ePrint Archive, Paper 2022/...

work page 2022
[10]

Demystifying bootstrapping in fully homomorphic encryption,

A. A. Badawi and Y . Polyakov, “Demystifying bootstrapping in fully homomorphic encryption,” Cryptology ePrint Archive, Paper 2023/149,

work page 2023
[11]

Available: https://eprint.iacr.org/2023/149

[Online]. Available: https://eprint.iacr.org/2023/149

work page 2023
[12]

Implementing the rivest shamir and adleman public key encryption algorithm on a standard digital signal processor,

P. Barrett, “Implementing the rivest shamir and adleman public key encryption algorithm on a standard digital signal processor,” inProceed- ings on Advances in Cryptology—CRYPTO ’86. Berlin, Heidelberg: Springer-Verlag, 1987, p. 311–323

work page 1987
[13]

Intel HEXL (release 1.2),

F. Boemer, S. Kim, G. Seifu, F. D. de Souza, V . Gopalet al., “Intel HEXL (release 1.2),” https://github.com/intel/hexl, Sep. 2021

work page 2021
[14]

JAX: composable transformations of Python+NumPy programs,

J. Bradbury, R. Frostig, P. Hawkins, M. J. Johnson, C. Leary, D. Maclaurin, G. Necula, A. Paszke, J. VanderPlas, S. Wanderman-Milne, and Q. Zhang, “JAX: composable transformations of Python+NumPy programs,” 2018. [Online]. Available: http://github.com/google/jax

work page 2018
[15]

Low Latency Privacy Preserving Inference

A. Brutzkus, O. Elisha, and R. Gilad-Bachrach, “Low latency privacy preserving inference,” 2019. [Online]. Available: https: //arxiv.org/abs/1812.10659

work page internal anchor Pith review Pith/arXiv arXiv 2019
[16]

A full rns variant of approximate homomorphic encryption,

J. H. Cheon, K. Han, A. Kim, M. Kim, and Y . Song, “A full rns variant of approximate homomorphic encryption,” inSelected Areas in Cryptography–SAC 2018: 25th International Conference, Calgary, AB, Canada, August 15–17, 2018, Revised Selected Papers 25. Springer, 2019, pp. 347–368

work page 2018
[17]

Homomorphic encryption for arithmetic of approximate numbers,

J. H. Cheon, A. Kim, M. Kim, and Y . Song, “Homomorphic encryption for arithmetic of approximate numbers,” Cryptology ePrint Archive, Paper 2016/421, 2016, https://eprint.iacr.org/2016/421. [Online]. Available: https://eprint.iacr.org/2016/421

work page 2016
[18]

Dacapo: Automatic bootstrapping management for efficient fully homomorphic encryption,

S. Cheon, Y . Lee, D. Kim, J. M. Lee, S. Jung, T. Kim, D. Lee, and H. Kim, “Dacapo: Automatic bootstrapping management for efficient fully homomorphic encryption,” in33rd USENIX Security Symposium (USENIX Security 24), 2024, pp. 6993–7010

work page 2024
[19]

Cheddar: A swift fully homomorphic encryption library designed for gpu architectures,

W. Choi, J. Kim, and J. H. Ahn, “Cheddar: A swift fully homomorphic encryption library designed for gpu architectures,” inProceedings of the 31st ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 1, ser. ASPLOS ’26. New York, NY , USA: Association for Computing Machinery, 2025, p. 35–49. [Online]...

work page doi:10.1145/3760250.3762223 2025
[20]

Profile your model on cloud tpu nodes,

G. Cloud, “Profile your model on cloud tpu nodes,” 2024. [Online]. Available: https://cloud.google.com/tpu/docs/cloud-tpu-tools

work page 2024
[21]

Corporation

N. Corporation. Matrix multiplication background user’s guide. NVIDIA. [Online]. Available: https://docs.nvidia.com/deeplearning/ performance/dl-performance-matrix-multiplication/index.html

work page
[22]

W. J. Dally and B. P. Towles,Principles and practices of interconnection networks. Elsevier, 2004

work page 2004
[23]

Chet: Compiler and runtime for homomorphic evaluation of tensor programs,

R. Dathathri, O. Saarikivi, H. Chen, K. Laine, K. Lauter, S. Maleki, M. Musuvathi, and T. Mytkowicz, “Chet: Compiler and runtime for homomorphic evaluation of tensor programs,” 2018

work page 2018
[24]

Does fully homomorphic encryption need compute acceleration?

L. de Castro, R. Agrawal, R. Yazicigil, A. Chandrakasan, V . Vaikun- tanathan, C. Juvekar, and A. Joshi, “Does fully homomorphic encryption need compute acceleration?” 2021

work page 2021
[25]

Orion: A fully homomorphic encryption framework for deep learning,

A. Ebel, K. Garimella, and B. Reagen, “Orion: A fully homomorphic encryption framework for deep learning,” inProceedings of the 30th ACM International Conference on Architectural Support for Program- ming Languages and Operating Systems, Volume 2, ser. ASPLOS ’25. New York, NY , USA: Association for Computing Machinery, 2025

work page 2025
[26]

Warpdrive: Gpu-based fully homo- morphic encryption acceleration leveraging tensor and cuda cores,

G. Fan, M. Zhang, F. Zheng, S. Fan, T. Zhou, X. Deng, W. Tang, L. Kong, Y . Song, and S. Yan, “Warpdrive: Gpu-based fully homo- morphic encryption acceleration leveraging tensor and cuda cores,” in 2025 IEEE International Symposium on High Performance Computer Architecture (HPCA), 2025, pp. 1187–1200

work page 2025
[27]

Towards faster fully homomorphic encryption implementation with integer and floating-point computing power of gpus,

G. Fan, F. Zheng, L. Wan, L. Gao, Y . Zhao, J. Dong, Y . Song, Y . Wang, and J. Lin, “Towards faster fully homomorphic encryption implementation with integer and floating-point computing power of gpus,” in2023 IEEE International Parallel and Distributed Processing Symposium (IPDPS), 2023, pp. 798–808

work page 2023
[28]

Tensorfhe: Achieving practical computation on encrypted data using gpgpu,

S. Fan, Z. Wang, W. Xu, R. Hou, D. Meng, and M. Zhang, “Tensorfhe: Achieving practical computation on encrypted data using gpgpu,” 2022

work page 2022
[29]

BASALISC: Programmable hardware accelerator for BGV fully homomorphic encryption,

R. Geelen, M. V . Beirendonck, H. V . L. Pereira, B. Huffman, T. McAuley, B. Selfridge, D. Wagner, G. Dimou, I. Verbauwhede, F. Vercauteren, and D. W. Archer, “BASALISC: Programmable hardware accelerator for BGV fully homomorphic encryption,” Cryptology ePrint Archive, Paper 2022/657, 2022. [Online]. Available: https://eprint.iacr.org/2022/657

work page 2022
[30]

Google Cloud TPU,

Google, “Google Cloud TPU,” 2024. [Online]. Available: https: //cloud.google.com/tpu/docs/system-architecture-tpu-vm

work page 2024
[31]

Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities,

Google, “Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities,”

work page
[32]

Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

[Online]. Available: https://arxiv.org/abs/2507.06261

work page internal anchor Pith review Pith/arXiv arXiv
[33]

Logistic regression on homomorphic encrypted data at scale,

K. Han, S. Hong, J. H. Cheon, and D. Park, “Logistic regression on homomorphic encrypted data at scale,” inProceedings of the Thirty-Third AAAI Conference on Artificial Intelligence and Thirty-First Innovative Applications of Artificial Intelligence Conference and Ninth AAAI Symposium on Educational Advances in Artificial Intelligence, ser. AAAI’19/IAAI’1...

work page doi:10.1609/aaai.v33i01.33019466 2019
[34]

Cinnamon: A framework for scale-out encrypted ai,

S. Jayashankar, E. Chen, T. Tang, W. Zheng, and D. Skarlatos, “Cinnamon: A framework for scale-out encrypted ai,” inProceedings of the 30th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 1, ser. ASPLOS ’25. New York, NY , USA: Association for Computing Machinery, 2025, p. 133–150. [Online]. Av...

work page arXiv 2025
[35]

Ten lessons from three generations shaped google’s tpuv4i,

N. P. Jouppi, D. Hyun Yoon, M. Ashcraft, M. Gottscho, T. B. Jablin, G. Kurian, J. Laudon, S. Li, P. Ma, X. Ma, T. Norrie, N. Patil, S. Prasad, C. Young, Z. Zhou, and D. Patterson, “Ten lessons from three generations shaped google’s tpuv4i,” in2021 ACM/IEEE 48th Annual International Symposium on Computer Architecture (ISCA), 2021

work page 2021
[36]

Tpu v4: An optically reconfigurable supercomputer for machine learning with hardware support for embeddings,

N. P. Jouppi, G. Kurian, S. Li, P. Ma, R. Nagarajan, L. Nai, N. Patil, S. Subramanian, A. Swing, B. Towles, C. Young, X. Zhou, Z. Zhou, and D. Patterson, “Tpu v4: An optically reconfigurable supercomputer for machine learning with hardware support for embeddings,” 2023

work page 2023
[37]

Jouppi, Doe Hyun Yoon, George Kurian, Sheng Li, Nishant Patil, James Laudon, Cliff Young, and David A

N. P. Jouppi, D. H. Yoon, G. Kurian, S. Li, N. Patil, J. Laudon, C. Young, and D. A. Patterson, “A domain-specific supercomputer for training deep neural networks,”Commun. ACM, vol. 63, no. 7, pp. 67–78, 2020. [Online]. Available: https://doi.org/10.1145/3360307

work page doi:10.1145/3360307 2020
[38]

Over 100x faster bootstrapping in fully homomorphic encryption through memory- centric optimization with gpus,

W. Jung, S. Kim, J. H. Ahn, J. H. Cheon, and Y . Lee, “Over 100x faster bootstrapping in fully homomorphic encryption through memory- centric optimization with gpus,”IACR Transactions on Cryptographic Hardware and Embedded Systems, pp. 114–148, 2021. 14

work page 2021
[39]

Gazelle: A Low Latency Framework for Secure Neural Network Inference

C. Juvekar, V . Vaikuntanathan, and A. Chandrakasan, “Gazelle: A low latency framework for secure neural network inference,” 2018. [Online]. Available: https://arxiv.org/abs/1801.05507

work page internal anchor Pith review Pith/arXiv arXiv 2018
[40]

Revisiting homomorphic encryption schemes for finite fields,

A. Kim, Y . Polyakov, and V . Zucca, “Revisiting homomorphic encryption schemes for finite fields,” Cryptology ePrint Archive, Paper 2021/204, 2021. [Online]. Available: https://eprint.iacr.org/2021/204

work page 2021
[41]

Sharp: A short-word hierarchical accelerator for robust and practical fully homomorphic encryption,

J. Kim, S. Kim, J. Choi, J. Park, D. Kim, and J. H. Ahn, “Sharp: A short-word hierarchical accelerator for robust and practical fully homomorphic encryption,” inProceedings of the 50th Annual International Symposium on Computer Architecture, ser. ISCA ’23. New York, NY , USA: Association for Computing Machinery, 2023. [Online]. Available: https://doi.org/...

work page doi:10.1145/3579371.3589053 2023
[42]

Ark: Fully homomorphic encryption accelerator with runtime data generation and inter-operation key reuse,

J. Kim, G. Lee, S. Kim, G. Sohn, M. Rhu, J. Kim, and J. H. Ahn, “Ark: Fully homomorphic encryption accelerator with runtime data generation and inter-operation key reuse,” in2022 55th IEEE/ACM International Symposium on Microarchitecture (MICRO), 2022, pp. 1237–1254

work page 2022
[43]

Accelerating number theoretic transformations for bootstrappable homomorphic encryption on gpus,

S. Kim, W. Jung, J. Park, and J. H. Ahn, “Accelerating number theoretic transformations for bootstrappable homomorphic encryption on gpus,” in2020 IEEE International Symposium on Workload Characterization (IISWC). IEEE, Oct. 2020, p. 264–275. [Online]. Available: http://dx.doi.org/10.1109/IISWC50251.2020.00033

work page doi:10.1109/iiswc50251.2020.00033 2020
[44]

Bts: An accelerator for bootstrappable fully homomorphic encryption,

S. Kim, J. Kim, M. J. Kim, W. Jung, J. Kim, M. Rhu, and J. H. Ahn, “Bts: An accelerator for bootstrappable fully homomorphic encryption,” inProceedings of the 49th Annual International Symposium on Computer Architecture, ser. ISCA ’22. New York, NY , USA: Association for Computing Machinery, 2022, p. 711–725. [Online]. Available: https://doi.org/10.1145/3...

work page doi:10.1145/3470496.3527415 2022
[45]

Parallel implementation of nussbaumer algorithm and number theoretic transform on a gpu platform: application to qtesla,

W.-K. Lee, S. Akleylek, D. C.-K. Wong, W.-S. Yap, B.-M. Goi, and S.-O. Hwang, “Parallel implementation of nussbaumer algorithm and number theoretic transform on a gpu platform: application to qtesla,” J. Supercomput., vol. 77, no. 4, p. 3289–3314, Apr. 2021. [Online]. Available: https://doi.org/10.1007/s11227-020-03392-x

work page doi:10.1007/s11227-020-03392-x 2021
[46]

Error-latency-aware scale management for fully homomorphic encryption,

Y . Lee, S. Cheon, D. Kim, D. Lee, and H. Kim, “Error-latency-aware scale management for fully homomorphic encryption,” in32nd USENIX Security Symposium (USENIX Security 23), 2023

work page 2023
[47]

Performance-aware scale analysis with reserve for homomorphic encryption,

Y . Lee, S. Cheon, D. Kim, D. Lee, and H. Kim, “Performance-aware scale analysis with reserve for homomorphic encryption,” inProceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 1, 2024

work page 2024
[48]

Hecate: Performance-aware scale optimization for homomor- phic encryption compiler,

Y . Lee, S. Heo, S. Cheon, S. Jeong, C. Kim, E. Kim, D. Lee, and H. Kim, “Hecate: Performance-aware scale optimization for homomor- phic encryption compiler,” in2022 IEEE/ACM International Symposium on Code Generation and Optimization (CGO), 2022, pp. 193–204

work page 2022
[49]

Cat: A gpu-accelerated fhe framework with its application to high-precision private dataset query,

Q. Li and R. Zong, “Cat: A gpu-accelerated fhe framework with its application to high-precision private dataset query,” 2025. [Online]. Available: https://arxiv.org/abs/2503.22227

work page arXiv 2025
[50]

A large-scale survey on the usability of ai programming assistants: Successes and challenges,

J. T. Liang, C. Yang, and B. A. Myers, “A large-scale survey on the usability of ai programming assistants: Successes and challenges,” in Proceedings of the 46th IEEE/ACM international conference on software engineering, 2024, pp. 1–13

work page 2024
[51]

Lattice signatures without trapdoors,

V . Lyubashevsky, “Lattice signatures without trapdoors,” inAdvances in Cryptology – EUROCRYPT 2012, D. Pointcheval and T. Johansson, Eds. Berlin, Heidelberg: Springer Berlin Heidelberg, 2012, pp. 738–755

work page 2012
[52]

Modular multiplication without trial division,

P. L. Montgomery, “Modular multiplication without trial division,” Mathematics of computation, vol. 44, no. 170, pp. 519–521, 1985

work page 1985
[53]

Lattigo: A multiparty homomorphic encryption library in go,

C. V . Mouchet, J.-P. Bossuat, J. R. Troncoso-Pastoriza, and J.-P. Hubaux, “Lattigo: A multiparty homomorphic encryption library in go,” in Proceedings of the 8th Workshop on Encrypted Computing and Applied Homomorphic Cryptography, 2020, pp. 64–70

work page 2020
[54]

GPT-4o System Card

OpenAI, “Gpt-4o system card,” 2024. [Online]. Available: https: //arxiv.org/abs/2410.21276

work page internal anchor Pith review Pith/arXiv arXiv 2024
[55]

Strix: An end-to- end streaming architecture with two-level ciphertext batching for fully homomorphic encryption with programmable bootstrapping,

A. Putra, Prasetiyo, Y . Chen, J. Kim, and J.-Y . Kim, “Strix: An end-to- end streaming architecture with two-level ciphertext batching for fully homomorphic encryption with programmable bootstrapping,” 2023

work page 2023
[56]

Cham: A customized homomorphic encryption accelerator for fast matrix-vector product,

X. Ren, Z. Chen, Z. Gu, Y . Lu, R. Zhong, W.-J. Lu, J. Zhang, Y . Zhang, H. Wu, X. Zheng, H. Liu, T. Chu, C. Hong, C. Wei, D. Niu, and Y . Xie, “Cham: A customized homomorphic encryption accelerator for fast matrix-vector product,” in2023 60th ACM/IEEE Design Automation Conference (DAC), 2023, pp. 1–6

work page 2023
[57]

Heax: An architecture for computing on encrypted data,

M. S. Riazi, K. Laine, B. Pelton, and W. Dai, “Heax: An architecture for computing on encrypted data,” inProceedings of the Twenty-Fifth International Conference on Architectural Support for Programming Languages and Operating Systems, ser. ASPLOS ’20. New York, NY , USA: Association for Computing Machinery, 2020, p. 1295–1309. [Online]. Available: https:...

work page doi:10.1145/3373376.3378523 2020
[58]

F1: A fast and programmable accelerator for fully homomorphic encryption,

N. Samardzic, A. Feldmann, A. Krastev, S. Devadas, R. Dreslinski, C. Peikert, and D. Sanchez, “F1: A fast and programmable accelerator for fully homomorphic encryption,” inMICRO-54: 54th Annual IEEE/ACM International Symposium on Microarchitecture, ser. MICRO ’21. New York, NY , USA: Association for Computing Machinery, 2021, p. 238–252. [Online]. Availab...

work page doi:10.1145/3466752 2021
[59]

Craterlake: A hardware accelerator for efficient unbounded computation on encrypted data,

N. Samardzic, A. Feldmann, A. Krastev, N. Manohar, N. Genise, S. Devadas, K. Eldefrawy, C. Peikert, and D. Sanchez, “Craterlake: A hardware accelerator for efficient unbounded computation on encrypted data,” inProceedings of the 49th Annual International Symposium on Computer Architecture, ser. ISCA ’22. New York, NY , USA: Association for Computing Machi...

work page 2022
[60]

Generative artificial intelligence: A systematic review and applications,

S. S. Sengar, A. B. Hasan, S. Kumar, and F. Carroll, “Generative artificial intelligence: A systematic review and applications,” 2024. [Online]. Available: https://arxiv.org/abs/2405.11029

work page arXiv 2024
[61]

High-throughput polynomial multiplier architecture for lattice-based cryptography,

T. Shimada and M. Ikeda, “High-throughput polynomial multiplier architecture for lattice-based cryptography,” in2021 IEEE International Symposium on Circuits and Systems (ISCAS), 2021, pp. 1–5

work page 2021
[62]

Gme: Gpu-based microarchitectural extensions to accelerate homomorphic encryption,

K. Shivdikar, Y . Bao, R. Agrawal, M. Shen, G. Jonatan, E. Mora, A. Ingare, N. Livesay, J. L. Abell ´an, J. Kimet al., “Gme: Gpu-based microarchitectural extensions to accelerate homomorphic encryption,” arXiv preprint arXiv:2309.11001, 2023

work page arXiv 2023
[63]

Accelerating polynomial multiplication for homomorphic encryption on gpus,

K. Shivdikar, G. Jonatan, E. Mora, N. Livesay, R. Agrawal, A. Joshi, J. Abellan, J. Kim, and D. Kaeli, “Accelerating polynomial multiplication for homomorphic encryption on gpus,” 2022. [Online]. Available: https://arxiv.org/abs/2209.01290

work page arXiv 2022
[64]

Ntl: A library for doing number theory,

V . Shoupet al., “Ntl: A library for doing number theory,” 2001

work page 2001
[65]

Fpga-based high-performance parallel architecture for homomorphic computing on encrypted data,

S. Sinha Roy, F. Turan, K. Jarvinen, F. Vercauteren, and I. Verbauwhede, “Fpga-based high-performance parallel architecture for homomorphic computing on encrypted data,” in2019 IEEE International Symposium on High Performance Computer Architecture (HPCA), 2019

work page 2019
[66]

Tensorfhe+: Fully homomorphic encryption acceleration based on linear algebra,

Y . Sun, S. Fan, Z. Yin, X. Song, X. Hu, Z. Du, Q. Guo, W. Xu, R. Hou, D. Meng, S. Bian, and M. Zhan, “Tensorfhe+: Fully homomorphic encryption acceleration based on linear algebra,”IEEE Transactions on Computers, pp. 1–14, 2025

work page 2025
[67]

Client-optimized algorithms and acceleration for encrypted compute offloading,

M. van der Hagen and B. Lucia, “Client-optimized algorithms and acceleration for encrypted compute offloading,” inProceedings of the 27th ACM International Conference on Architectural Support for Pro- gramming Languages and Operating Systems, ser. ASPLOS ’22. New York, NY , USA: Association for Computing Machinery, 2022

work page 2022
[68]

Chameleon: An efficient fhe scheme switching acceleration on gpus,

Z. Wang, H. He, L. Zhao, P. Li, Z. Li, D. Meng, and R. Hou, “Chameleon: An efficient fhe scheme switching acceleration on gpus,”

work page
[69]

Available: https://arxiv.org/abs/2410.05934

[Online]. Available: https://arxiv.org/abs/2410.05934

work page arXiv
[70]

He- booster: An efficient polynomial arithmetic acceleration on gpus for fully homomorphic encryption,

Z. Wang, P. Li, R. Hou, Z. Li, J. Cao, X. Wang, and D. Meng, “He- booster: An efficient polynomial arithmetic acceleration on gpus for fully homomorphic encryption,”IEEE Transactions on Parallel and Distributed Systems, vol. 34, no. 4, pp. 1067–1081, 2023

work page 2023
[71]

WISE-HE. Wise. GitHub repository; main branch; commit f0689fd (”init”). [Online]. Available: https://github.com/WISE-HE/WISE

work page
[72]

Phantom: A cuda-accelerated word-wise homomorphic encryption library,

H. Yang, S. Shen, W. Dai, L. Zhou, Z. Liu, and Y . Zhao, “Phantom: A cuda-accelerated word-wise homomorphic encryption library,”IEEE Trans. Dependable Secur. Comput., vol. 21, no. 5, p. 4895–4906, Sep

work page
[73]

Available: https://doi.org/10.1109/TDSC.2024.3363900

[Online]. Available: https://doi.org/10.1109/TDSC.2024.3363900

work page doi:10.1109/tdsc.2024.3363900 2024
[74]

Poseidon: Practical homomorphic encryption accelerator,

Y . Yang, H. Zhang, S. Fan, H. Lu, M. Zhang, and X. Li, “Poseidon: Practical homomorphic encryption accelerator,” in2023 IEEE Interna- tional Symposium on High-Performance Computer Architecture (HPCA), 2023, pp. 870–881

work page 2023
[75]

Accelerating encrypted computing on intel gpus,

Y . Zhai, M. Ibrahim, Y . Qiu, F. Boemer, Z. Chen, A. Titov, and A. Lyashevsky, “Accelerating encrypted computing on intel gpus,”

work page
[76]

Available: https://arxiv.org/abs/2109.14704

[Online]. Available: https://arxiv.org/abs/2109.14704

work page arXiv
[77]

Sok: Fully homomorphic encryption accelerators,

J. Zhang, X. Cheng, L. Yang, J. Hu, X. Liu, and K. Chen, “Sok: Fully homomorphic encryption accelerators,”ACM Comput. Surv., vol. 56, no. 12, Oct. 2024. [Online]. Available: https://doi.org/10.1145/3676955

work page doi:10.1145/3676955 2024
[78]

Hengine: A high performance optimization framework on a gpu for homomorphic encryption,

J. Zhao, H. Yang, M. Hao, W. Zhang, H. He, and D. Wang, “Hengine: A high performance optimization framework on a gpu for homomorphic encryption,”ACM Trans. Archit. Code Optim., vol. 22, no. 2, Jul. 2025. [Online]. Available: https://doi.org/10.1145/3732942

work page doi:10.1145/3732942 2025
[79]

librosa/librosa: 0.6.3,

Y . Zhu, X. Wang, L. Ju, and S. Guo, “Fxhenn: Fpga-based acceleration framework for homomorphic encrypted cnn inference,” in2023 IEEE International Symposium on High-Performance Computer Architecture (HPCA), 2023, pp. 896–907. 15 APPENDIX A. Abstract We provide scripts to reproduce latency of BAT (Tab. V) and BConv (Tab. VI), throughput of NTT (Tab. VII, ...

work page doi:10.5281/zenodo 2023
[80]

The NTT and INTT are computationally intensive, accounting for approximately 45.1% to 86.3% of the overall latency in various HE operators

Radix-2 Cooley-Tukey NTT algorithm (Butterfly NTT): The NTT converts polynomial representations from the co- efficient domain to the evaluation domain, where polynomial multiplication simplifies to element-wise (vectorized) coeffi- cient multiplication. The NTT and INTT are computationally intensive, accounting for approximately 45.1% to 86.3% of the over...

work page

Showing first 80 references.

[1] [1]

Number theoretic transforms to implement fast digital convolution,

R. Agarwal and C. Burrus, “Number theoretic transforms to implement fast digital convolution,”Proceedings of the IEEE, 1975

work page 1975

[2] [2]

Heap: A fully ho- momorphic encryption accelerator with parallelized bootstrapping,

R. Agrawal, A. Chandrakasan, and A. Joshi, “Heap: A fully ho- momorphic encryption accelerator with parallelized bootstrapping,” in 2024 ACM/IEEE 51st Annual International Symposium on Computer Architecture (ISCA), 2024, pp. 756–769

work page 2024

[3] [3]

Mad: Memory-aware design techniques for accelerating fully homomorphic encryption,

R. Agrawal, L. De Castro, C. Juvekar, A. Chandrakasan, V . Vaikun- tanathan, and A. Joshi, “Mad: Memory-aware design techniques for accelerating fully homomorphic encryption,” inProceedings of the 56th Annual IEEE/ACM International Symposium on Microarchitecture, ser. MICRO ’23. New York, NY , USA: Association for Computing Machinery, 2023

work page 2023

[4] [4]

Fab: An fpga-based accel- erator for bootstrappable fully homomorphic encryption,

R. Agrawal, L. de Castro, G. Yang, C. Juvekar, R. Yazicigil, A. Chan- drakasan, V . Vaikuntanathan, and A. Joshi, “Fab: An fpga-based accel- erator for bootstrappable fully homomorphic encryption,” 2022

work page 2022

[5] [5]

Fideslib: A fully-fledged open-source fhe library for efficient ckks on gpus,

C. Agull ´o-Domingo, ´Oscar Vera-L´opez, S. Guzelhan, L. Daksha, A. E. Jerari, K. Shivdikar, R. Agrawal, D. Kaeli, A. Joshi, and J. L. Abell ´an, “Fideslib: A fully-fledged open-source fhe library for efficient ckks on gpus,” 2025. [Online]. Available: https://arxiv.org/abs/2507.04775

work page arXiv 2025

[6] [6]

Implementation and performance evaluation of rns variants of the bfv homomorphic encryption scheme,

A. Al Badawi, Y . Polyakov, K. M. M. Aung, B. Veeravalli, and K. Rohloff, “Implementation and performance evaluation of rns variants of the bfv homomorphic encryption scheme,”IEEE Transactions on Emerging Topics in Computing, vol. 9, no. 2, pp. 941–956, 2021

work page 2021

[7] [7]

Homomorphic encryption standard,

M. Albrecht, M. Chase, H. Chen, J. Ding, S. Goldwasser, S. Gorbunov, S. Halevi, J. Hoffstein, K. Laineet al., “Homomorphic encryption standard,”Protecting privacy through homomorphic encryption, 2021

work page 2021

[8] [8]

Pallas: a jax kernel language,

J. authors, “Pallas: a jax kernel language,” 2024. [Online]. Available: https://jax.readthedocs.io/en/latest/pallas/index.html

work page 2024

[9] [9]

Openfhe: Open-source fully homomorphic encryption library,

A. A. Badawi, J. Bates, F. Bergamaschi, D. B. Cousins, S. Erabelli, N. Genise, S. Halevi, H. Hunt, A. Kim, Y . Lee, Z. Liu, D. Micciancio, I. Quah, Y . Polyakov, S. R.V ., K. Rohloff, J. Saylor, D. Suponitsky, M. Triplett, V . Vaikuntanathan, and V . Zucca, “Openfhe: Open-source fully homomorphic encryption library,” Cryptology ePrint Archive, Paper 2022/...

work page 2022

[10] [10]

Demystifying bootstrapping in fully homomorphic encryption,

A. A. Badawi and Y . Polyakov, “Demystifying bootstrapping in fully homomorphic encryption,” Cryptology ePrint Archive, Paper 2023/149,

work page 2023

[11] [11]

Available: https://eprint.iacr.org/2023/149

[Online]. Available: https://eprint.iacr.org/2023/149

work page 2023

[12] [12]

Implementing the rivest shamir and adleman public key encryption algorithm on a standard digital signal processor,

P. Barrett, “Implementing the rivest shamir and adleman public key encryption algorithm on a standard digital signal processor,” inProceed- ings on Advances in Cryptology—CRYPTO ’86. Berlin, Heidelberg: Springer-Verlag, 1987, p. 311–323

work page 1987

[13] [13]

Intel HEXL (release 1.2),

F. Boemer, S. Kim, G. Seifu, F. D. de Souza, V . Gopalet al., “Intel HEXL (release 1.2),” https://github.com/intel/hexl, Sep. 2021

work page 2021

[14] [14]

JAX: composable transformations of Python+NumPy programs,

J. Bradbury, R. Frostig, P. Hawkins, M. J. Johnson, C. Leary, D. Maclaurin, G. Necula, A. Paszke, J. VanderPlas, S. Wanderman-Milne, and Q. Zhang, “JAX: composable transformations of Python+NumPy programs,” 2018. [Online]. Available: http://github.com/google/jax

work page 2018

[15] [15]

Low Latency Privacy Preserving Inference

A. Brutzkus, O. Elisha, and R. Gilad-Bachrach, “Low latency privacy preserving inference,” 2019. [Online]. Available: https: //arxiv.org/abs/1812.10659

work page internal anchor Pith review Pith/arXiv arXiv 2019

[16] [16]

A full rns variant of approximate homomorphic encryption,

J. H. Cheon, K. Han, A. Kim, M. Kim, and Y . Song, “A full rns variant of approximate homomorphic encryption,” inSelected Areas in Cryptography–SAC 2018: 25th International Conference, Calgary, AB, Canada, August 15–17, 2018, Revised Selected Papers 25. Springer, 2019, pp. 347–368

work page 2018

[17] [17]

Homomorphic encryption for arithmetic of approximate numbers,

J. H. Cheon, A. Kim, M. Kim, and Y . Song, “Homomorphic encryption for arithmetic of approximate numbers,” Cryptology ePrint Archive, Paper 2016/421, 2016, https://eprint.iacr.org/2016/421. [Online]. Available: https://eprint.iacr.org/2016/421

work page 2016

[18] [18]

Dacapo: Automatic bootstrapping management for efficient fully homomorphic encryption,

S. Cheon, Y . Lee, D. Kim, J. M. Lee, S. Jung, T. Kim, D. Lee, and H. Kim, “Dacapo: Automatic bootstrapping management for efficient fully homomorphic encryption,” in33rd USENIX Security Symposium (USENIX Security 24), 2024, pp. 6993–7010

work page 2024

[19] [19]

Cheddar: A swift fully homomorphic encryption library designed for gpu architectures,

W. Choi, J. Kim, and J. H. Ahn, “Cheddar: A swift fully homomorphic encryption library designed for gpu architectures,” inProceedings of the 31st ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 1, ser. ASPLOS ’26. New York, NY , USA: Association for Computing Machinery, 2025, p. 35–49. [Online]...

work page doi:10.1145/3760250.3762223 2025

[20] [20]

Profile your model on cloud tpu nodes,

G. Cloud, “Profile your model on cloud tpu nodes,” 2024. [Online]. Available: https://cloud.google.com/tpu/docs/cloud-tpu-tools

work page 2024

[21] [21]

Corporation

N. Corporation. Matrix multiplication background user’s guide. NVIDIA. [Online]. Available: https://docs.nvidia.com/deeplearning/ performance/dl-performance-matrix-multiplication/index.html

work page

[22] [22]

W. J. Dally and B. P. Towles,Principles and practices of interconnection networks. Elsevier, 2004

work page 2004

[23] [23]

Chet: Compiler and runtime for homomorphic evaluation of tensor programs,

R. Dathathri, O. Saarikivi, H. Chen, K. Laine, K. Lauter, S. Maleki, M. Musuvathi, and T. Mytkowicz, “Chet: Compiler and runtime for homomorphic evaluation of tensor programs,” 2018

work page 2018

[24] [24]

Does fully homomorphic encryption need compute acceleration?

L. de Castro, R. Agrawal, R. Yazicigil, A. Chandrakasan, V . Vaikun- tanathan, C. Juvekar, and A. Joshi, “Does fully homomorphic encryption need compute acceleration?” 2021

work page 2021

[25] [25]

Orion: A fully homomorphic encryption framework for deep learning,

A. Ebel, K. Garimella, and B. Reagen, “Orion: A fully homomorphic encryption framework for deep learning,” inProceedings of the 30th ACM International Conference on Architectural Support for Program- ming Languages and Operating Systems, Volume 2, ser. ASPLOS ’25. New York, NY , USA: Association for Computing Machinery, 2025

work page 2025

[26] [26]

Warpdrive: Gpu-based fully homo- morphic encryption acceleration leveraging tensor and cuda cores,

G. Fan, M. Zhang, F. Zheng, S. Fan, T. Zhou, X. Deng, W. Tang, L. Kong, Y . Song, and S. Yan, “Warpdrive: Gpu-based fully homo- morphic encryption acceleration leveraging tensor and cuda cores,” in 2025 IEEE International Symposium on High Performance Computer Architecture (HPCA), 2025, pp. 1187–1200

work page 2025

[27] [27]

Towards faster fully homomorphic encryption implementation with integer and floating-point computing power of gpus,

G. Fan, F. Zheng, L. Wan, L. Gao, Y . Zhao, J. Dong, Y . Song, Y . Wang, and J. Lin, “Towards faster fully homomorphic encryption implementation with integer and floating-point computing power of gpus,” in2023 IEEE International Parallel and Distributed Processing Symposium (IPDPS), 2023, pp. 798–808

work page 2023

[28] [28]

Tensorfhe: Achieving practical computation on encrypted data using gpgpu,

S. Fan, Z. Wang, W. Xu, R. Hou, D. Meng, and M. Zhang, “Tensorfhe: Achieving practical computation on encrypted data using gpgpu,” 2022

work page 2022

[29] [29]

BASALISC: Programmable hardware accelerator for BGV fully homomorphic encryption,

R. Geelen, M. V . Beirendonck, H. V . L. Pereira, B. Huffman, T. McAuley, B. Selfridge, D. Wagner, G. Dimou, I. Verbauwhede, F. Vercauteren, and D. W. Archer, “BASALISC: Programmable hardware accelerator for BGV fully homomorphic encryption,” Cryptology ePrint Archive, Paper 2022/657, 2022. [Online]. Available: https://eprint.iacr.org/2022/657

work page 2022

[30] [30]

Google Cloud TPU,

Google, “Google Cloud TPU,” 2024. [Online]. Available: https: //cloud.google.com/tpu/docs/system-architecture-tpu-vm

work page 2024

[31] [31]

Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities,

Google, “Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities,”

work page

[32] [32]

Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

[Online]. Available: https://arxiv.org/abs/2507.06261

work page internal anchor Pith review Pith/arXiv arXiv

[33] [33]

Logistic regression on homomorphic encrypted data at scale,

K. Han, S. Hong, J. H. Cheon, and D. Park, “Logistic regression on homomorphic encrypted data at scale,” inProceedings of the Thirty-Third AAAI Conference on Artificial Intelligence and Thirty-First Innovative Applications of Artificial Intelligence Conference and Ninth AAAI Symposium on Educational Advances in Artificial Intelligence, ser. AAAI’19/IAAI’1...

work page doi:10.1609/aaai.v33i01.33019466 2019

[34] [34]

Cinnamon: A framework for scale-out encrypted ai,

S. Jayashankar, E. Chen, T. Tang, W. Zheng, and D. Skarlatos, “Cinnamon: A framework for scale-out encrypted ai,” inProceedings of the 30th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 1, ser. ASPLOS ’25. New York, NY , USA: Association for Computing Machinery, 2025, p. 133–150. [Online]. Av...

work page arXiv 2025

[35] [35]

Ten lessons from three generations shaped google’s tpuv4i,

N. P. Jouppi, D. Hyun Yoon, M. Ashcraft, M. Gottscho, T. B. Jablin, G. Kurian, J. Laudon, S. Li, P. Ma, X. Ma, T. Norrie, N. Patil, S. Prasad, C. Young, Z. Zhou, and D. Patterson, “Ten lessons from three generations shaped google’s tpuv4i,” in2021 ACM/IEEE 48th Annual International Symposium on Computer Architecture (ISCA), 2021

work page 2021

[36] [36]

Tpu v4: An optically reconfigurable supercomputer for machine learning with hardware support for embeddings,

N. P. Jouppi, G. Kurian, S. Li, P. Ma, R. Nagarajan, L. Nai, N. Patil, S. Subramanian, A. Swing, B. Towles, C. Young, X. Zhou, Z. Zhou, and D. Patterson, “Tpu v4: An optically reconfigurable supercomputer for machine learning with hardware support for embeddings,” 2023

work page 2023

[37] [37]

Jouppi, Doe Hyun Yoon, George Kurian, Sheng Li, Nishant Patil, James Laudon, Cliff Young, and David A

N. P. Jouppi, D. H. Yoon, G. Kurian, S. Li, N. Patil, J. Laudon, C. Young, and D. A. Patterson, “A domain-specific supercomputer for training deep neural networks,”Commun. ACM, vol. 63, no. 7, pp. 67–78, 2020. [Online]. Available: https://doi.org/10.1145/3360307

work page doi:10.1145/3360307 2020

[38] [38]

Over 100x faster bootstrapping in fully homomorphic encryption through memory- centric optimization with gpus,

W. Jung, S. Kim, J. H. Ahn, J. H. Cheon, and Y . Lee, “Over 100x faster bootstrapping in fully homomorphic encryption through memory- centric optimization with gpus,”IACR Transactions on Cryptographic Hardware and Embedded Systems, pp. 114–148, 2021. 14

work page 2021

[39] [39]

Gazelle: A Low Latency Framework for Secure Neural Network Inference

C. Juvekar, V . Vaikuntanathan, and A. Chandrakasan, “Gazelle: A low latency framework for secure neural network inference,” 2018. [Online]. Available: https://arxiv.org/abs/1801.05507

work page internal anchor Pith review Pith/arXiv arXiv 2018

[40] [40]

Revisiting homomorphic encryption schemes for finite fields,

A. Kim, Y . Polyakov, and V . Zucca, “Revisiting homomorphic encryption schemes for finite fields,” Cryptology ePrint Archive, Paper 2021/204, 2021. [Online]. Available: https://eprint.iacr.org/2021/204

work page 2021

[41] [41]

Sharp: A short-word hierarchical accelerator for robust and practical fully homomorphic encryption,

J. Kim, S. Kim, J. Choi, J. Park, D. Kim, and J. H. Ahn, “Sharp: A short-word hierarchical accelerator for robust and practical fully homomorphic encryption,” inProceedings of the 50th Annual International Symposium on Computer Architecture, ser. ISCA ’23. New York, NY , USA: Association for Computing Machinery, 2023. [Online]. Available: https://doi.org/...

work page doi:10.1145/3579371.3589053 2023

[42] [42]

Ark: Fully homomorphic encryption accelerator with runtime data generation and inter-operation key reuse,

J. Kim, G. Lee, S. Kim, G. Sohn, M. Rhu, J. Kim, and J. H. Ahn, “Ark: Fully homomorphic encryption accelerator with runtime data generation and inter-operation key reuse,” in2022 55th IEEE/ACM International Symposium on Microarchitecture (MICRO), 2022, pp. 1237–1254

work page 2022

[43] [43]

Accelerating number theoretic transformations for bootstrappable homomorphic encryption on gpus,

S. Kim, W. Jung, J. Park, and J. H. Ahn, “Accelerating number theoretic transformations for bootstrappable homomorphic encryption on gpus,” in2020 IEEE International Symposium on Workload Characterization (IISWC). IEEE, Oct. 2020, p. 264–275. [Online]. Available: http://dx.doi.org/10.1109/IISWC50251.2020.00033

work page doi:10.1109/iiswc50251.2020.00033 2020

[44] [44]

Bts: An accelerator for bootstrappable fully homomorphic encryption,

S. Kim, J. Kim, M. J. Kim, W. Jung, J. Kim, M. Rhu, and J. H. Ahn, “Bts: An accelerator for bootstrappable fully homomorphic encryption,” inProceedings of the 49th Annual International Symposium on Computer Architecture, ser. ISCA ’22. New York, NY , USA: Association for Computing Machinery, 2022, p. 711–725. [Online]. Available: https://doi.org/10.1145/3...

work page doi:10.1145/3470496.3527415 2022

[45] [45]

Parallel implementation of nussbaumer algorithm and number theoretic transform on a gpu platform: application to qtesla,

W.-K. Lee, S. Akleylek, D. C.-K. Wong, W.-S. Yap, B.-M. Goi, and S.-O. Hwang, “Parallel implementation of nussbaumer algorithm and number theoretic transform on a gpu platform: application to qtesla,” J. Supercomput., vol. 77, no. 4, p. 3289–3314, Apr. 2021. [Online]. Available: https://doi.org/10.1007/s11227-020-03392-x

work page doi:10.1007/s11227-020-03392-x 2021

[46] [46]

Error-latency-aware scale management for fully homomorphic encryption,

Y . Lee, S. Cheon, D. Kim, D. Lee, and H. Kim, “Error-latency-aware scale management for fully homomorphic encryption,” in32nd USENIX Security Symposium (USENIX Security 23), 2023

work page 2023

[47] [47]

Performance-aware scale analysis with reserve for homomorphic encryption,

Y . Lee, S. Cheon, D. Kim, D. Lee, and H. Kim, “Performance-aware scale analysis with reserve for homomorphic encryption,” inProceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 1, 2024

work page 2024

[48] [48]

Hecate: Performance-aware scale optimization for homomor- phic encryption compiler,

Y . Lee, S. Heo, S. Cheon, S. Jeong, C. Kim, E. Kim, D. Lee, and H. Kim, “Hecate: Performance-aware scale optimization for homomor- phic encryption compiler,” in2022 IEEE/ACM International Symposium on Code Generation and Optimization (CGO), 2022, pp. 193–204

work page 2022

[49] [49]

Cat: A gpu-accelerated fhe framework with its application to high-precision private dataset query,

Q. Li and R. Zong, “Cat: A gpu-accelerated fhe framework with its application to high-precision private dataset query,” 2025. [Online]. Available: https://arxiv.org/abs/2503.22227

work page arXiv 2025

[50] [50]

A large-scale survey on the usability of ai programming assistants: Successes and challenges,

J. T. Liang, C. Yang, and B. A. Myers, “A large-scale survey on the usability of ai programming assistants: Successes and challenges,” in Proceedings of the 46th IEEE/ACM international conference on software engineering, 2024, pp. 1–13

work page 2024

[51] [51]

Lattice signatures without trapdoors,

V . Lyubashevsky, “Lattice signatures without trapdoors,” inAdvances in Cryptology – EUROCRYPT 2012, D. Pointcheval and T. Johansson, Eds. Berlin, Heidelberg: Springer Berlin Heidelberg, 2012, pp. 738–755

work page 2012

[52] [52]

Modular multiplication without trial division,

P. L. Montgomery, “Modular multiplication without trial division,” Mathematics of computation, vol. 44, no. 170, pp. 519–521, 1985

work page 1985

[53] [53]

Lattigo: A multiparty homomorphic encryption library in go,

C. V . Mouchet, J.-P. Bossuat, J. R. Troncoso-Pastoriza, and J.-P. Hubaux, “Lattigo: A multiparty homomorphic encryption library in go,” in Proceedings of the 8th Workshop on Encrypted Computing and Applied Homomorphic Cryptography, 2020, pp. 64–70

work page 2020

[54] [54]

GPT-4o System Card

OpenAI, “Gpt-4o system card,” 2024. [Online]. Available: https: //arxiv.org/abs/2410.21276

work page internal anchor Pith review Pith/arXiv arXiv 2024

[55] [55]

Strix: An end-to- end streaming architecture with two-level ciphertext batching for fully homomorphic encryption with programmable bootstrapping,

A. Putra, Prasetiyo, Y . Chen, J. Kim, and J.-Y . Kim, “Strix: An end-to- end streaming architecture with two-level ciphertext batching for fully homomorphic encryption with programmable bootstrapping,” 2023

work page 2023

[56] [56]

Cham: A customized homomorphic encryption accelerator for fast matrix-vector product,

X. Ren, Z. Chen, Z. Gu, Y . Lu, R. Zhong, W.-J. Lu, J. Zhang, Y . Zhang, H. Wu, X. Zheng, H. Liu, T. Chu, C. Hong, C. Wei, D. Niu, and Y . Xie, “Cham: A customized homomorphic encryption accelerator for fast matrix-vector product,” in2023 60th ACM/IEEE Design Automation Conference (DAC), 2023, pp. 1–6

work page 2023

[57] [57]

Heax: An architecture for computing on encrypted data,

M. S. Riazi, K. Laine, B. Pelton, and W. Dai, “Heax: An architecture for computing on encrypted data,” inProceedings of the Twenty-Fifth International Conference on Architectural Support for Programming Languages and Operating Systems, ser. ASPLOS ’20. New York, NY , USA: Association for Computing Machinery, 2020, p. 1295–1309. [Online]. Available: https:...

work page doi:10.1145/3373376.3378523 2020

[58] [58]

F1: A fast and programmable accelerator for fully homomorphic encryption,

N. Samardzic, A. Feldmann, A. Krastev, S. Devadas, R. Dreslinski, C. Peikert, and D. Sanchez, “F1: A fast and programmable accelerator for fully homomorphic encryption,” inMICRO-54: 54th Annual IEEE/ACM International Symposium on Microarchitecture, ser. MICRO ’21. New York, NY , USA: Association for Computing Machinery, 2021, p. 238–252. [Online]. Availab...

work page doi:10.1145/3466752 2021

[59] [59]

Craterlake: A hardware accelerator for efficient unbounded computation on encrypted data,

N. Samardzic, A. Feldmann, A. Krastev, N. Manohar, N. Genise, S. Devadas, K. Eldefrawy, C. Peikert, and D. Sanchez, “Craterlake: A hardware accelerator for efficient unbounded computation on encrypted data,” inProceedings of the 49th Annual International Symposium on Computer Architecture, ser. ISCA ’22. New York, NY , USA: Association for Computing Machi...

work page 2022

[60] [60]

Generative artificial intelligence: A systematic review and applications,

S. S. Sengar, A. B. Hasan, S. Kumar, and F. Carroll, “Generative artificial intelligence: A systematic review and applications,” 2024. [Online]. Available: https://arxiv.org/abs/2405.11029

work page arXiv 2024

[61] [61]

High-throughput polynomial multiplier architecture for lattice-based cryptography,

T. Shimada and M. Ikeda, “High-throughput polynomial multiplier architecture for lattice-based cryptography,” in2021 IEEE International Symposium on Circuits and Systems (ISCAS), 2021, pp. 1–5

work page 2021

[62] [62]

Gme: Gpu-based microarchitectural extensions to accelerate homomorphic encryption,

K. Shivdikar, Y . Bao, R. Agrawal, M. Shen, G. Jonatan, E. Mora, A. Ingare, N. Livesay, J. L. Abell ´an, J. Kimet al., “Gme: Gpu-based microarchitectural extensions to accelerate homomorphic encryption,” arXiv preprint arXiv:2309.11001, 2023

work page arXiv 2023

[63] [63]

Accelerating polynomial multiplication for homomorphic encryption on gpus,

K. Shivdikar, G. Jonatan, E. Mora, N. Livesay, R. Agrawal, A. Joshi, J. Abellan, J. Kim, and D. Kaeli, “Accelerating polynomial multiplication for homomorphic encryption on gpus,” 2022. [Online]. Available: https://arxiv.org/abs/2209.01290

work page arXiv 2022

[64] [64]

Ntl: A library for doing number theory,

V . Shoupet al., “Ntl: A library for doing number theory,” 2001

work page 2001

[65] [65]

Fpga-based high-performance parallel architecture for homomorphic computing on encrypted data,

S. Sinha Roy, F. Turan, K. Jarvinen, F. Vercauteren, and I. Verbauwhede, “Fpga-based high-performance parallel architecture for homomorphic computing on encrypted data,” in2019 IEEE International Symposium on High Performance Computer Architecture (HPCA), 2019

work page 2019

[66] [66]

Tensorfhe+: Fully homomorphic encryption acceleration based on linear algebra,

Y . Sun, S. Fan, Z. Yin, X. Song, X. Hu, Z. Du, Q. Guo, W. Xu, R. Hou, D. Meng, S. Bian, and M. Zhan, “Tensorfhe+: Fully homomorphic encryption acceleration based on linear algebra,”IEEE Transactions on Computers, pp. 1–14, 2025

work page 2025

[67] [67]

Client-optimized algorithms and acceleration for encrypted compute offloading,

M. van der Hagen and B. Lucia, “Client-optimized algorithms and acceleration for encrypted compute offloading,” inProceedings of the 27th ACM International Conference on Architectural Support for Pro- gramming Languages and Operating Systems, ser. ASPLOS ’22. New York, NY , USA: Association for Computing Machinery, 2022

work page 2022

[68] [68]

Chameleon: An efficient fhe scheme switching acceleration on gpus,

Z. Wang, H. He, L. Zhao, P. Li, Z. Li, D. Meng, and R. Hou, “Chameleon: An efficient fhe scheme switching acceleration on gpus,”

work page

[69] [69]

Available: https://arxiv.org/abs/2410.05934

[Online]. Available: https://arxiv.org/abs/2410.05934

work page arXiv

[70] [70]

He- booster: An efficient polynomial arithmetic acceleration on gpus for fully homomorphic encryption,

Z. Wang, P. Li, R. Hou, Z. Li, J. Cao, X. Wang, and D. Meng, “He- booster: An efficient polynomial arithmetic acceleration on gpus for fully homomorphic encryption,”IEEE Transactions on Parallel and Distributed Systems, vol. 34, no. 4, pp. 1067–1081, 2023

work page 2023

[71] [71]

WISE-HE. Wise. GitHub repository; main branch; commit f0689fd (”init”). [Online]. Available: https://github.com/WISE-HE/WISE

work page

[72] [72]

Phantom: A cuda-accelerated word-wise homomorphic encryption library,

H. Yang, S. Shen, W. Dai, L. Zhou, Z. Liu, and Y . Zhao, “Phantom: A cuda-accelerated word-wise homomorphic encryption library,”IEEE Trans. Dependable Secur. Comput., vol. 21, no. 5, p. 4895–4906, Sep

work page

[73] [73]

Available: https://doi.org/10.1109/TDSC.2024.3363900

[Online]. Available: https://doi.org/10.1109/TDSC.2024.3363900

work page doi:10.1109/tdsc.2024.3363900 2024

[74] [74]

Poseidon: Practical homomorphic encryption accelerator,

Y . Yang, H. Zhang, S. Fan, H. Lu, M. Zhang, and X. Li, “Poseidon: Practical homomorphic encryption accelerator,” in2023 IEEE Interna- tional Symposium on High-Performance Computer Architecture (HPCA), 2023, pp. 870–881

work page 2023

[75] [75]

Accelerating encrypted computing on intel gpus,

Y . Zhai, M. Ibrahim, Y . Qiu, F. Boemer, Z. Chen, A. Titov, and A. Lyashevsky, “Accelerating encrypted computing on intel gpus,”

work page

[76] [76]

Available: https://arxiv.org/abs/2109.14704

[Online]. Available: https://arxiv.org/abs/2109.14704

work page arXiv

[77] [77]

Sok: Fully homomorphic encryption accelerators,

J. Zhang, X. Cheng, L. Yang, J. Hu, X. Liu, and K. Chen, “Sok: Fully homomorphic encryption accelerators,”ACM Comput. Surv., vol. 56, no. 12, Oct. 2024. [Online]. Available: https://doi.org/10.1145/3676955

work page doi:10.1145/3676955 2024

[78] [78]

Hengine: A high performance optimization framework on a gpu for homomorphic encryption,

J. Zhao, H. Yang, M. Hao, W. Zhang, H. He, and D. Wang, “Hengine: A high performance optimization framework on a gpu for homomorphic encryption,”ACM Trans. Archit. Code Optim., vol. 22, no. 2, Jul. 2025. [Online]. Available: https://doi.org/10.1145/3732942

work page doi:10.1145/3732942 2025

[79] [79]

librosa/librosa: 0.6.3,

Y . Zhu, X. Wang, L. Ju, and S. Guo, “Fxhenn: Fpga-based acceleration framework for homomorphic encrypted cnn inference,” in2023 IEEE International Symposium on High-Performance Computer Architecture (HPCA), 2023, pp. 896–907. 15 APPENDIX A. Abstract We provide scripts to reproduce latency of BAT (Tab. V) and BConv (Tab. VI), throughput of NTT (Tab. VII, ...

work page doi:10.5281/zenodo 2023

[80] [80]

The NTT and INTT are computationally intensive, accounting for approximately 45.1% to 86.3% of the overall latency in various HE operators

Radix-2 Cooley-Tukey NTT algorithm (Butterfly NTT): The NTT converts polynomial representations from the co- efficient domain to the evaluation domain, where polynomial multiplication simplifies to element-wise (vectorized) coeffi- cient multiplication. The NTT and INTT are computationally intensive, accounting for approximately 45.1% to 86.3% of the over...

work page