pith. machine review for the scientific record. sign in

arxiv: 2604.03425 · v1 · submitted 2026-04-03 · 💻 cs.CR · cs.AI· cs.DC

Recognition: 2 theorem links

· Lean Theorem

AEGIS: Scaling Long-Sequence Homomorphic Encrypted Transformer Inference via Hybrid Parallelism on Multi-GPU Systems

Authors on Pith no claims yet

Pith reviewed 2026-05-13 19:17 UTC · model grok-4.3

classification 💻 cs.CR cs.AIcs.DC
keywords Fully Homomorphic EncryptionCKKSTransformer InferenceMulti-GPU ParallelismPrivacy-Preserving AICommunication ReductionLong-Sequence ModelsHybrid Parallelism
0
0 comments X

The pith

By co-locating modulus-coherent and token-coherent ciphertexts according to joint Transformer and CKKS dependencies, AEGIS enables high-efficiency multi-GPU scaling for long-sequence homomorphic encrypted Transformer inference.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper seeks to overcome the memory limits and heavy inter-device communication that block fully homomorphic encryption from running Transformer models on long input sequences. It does so by deriving GPU placement from ciphertext dependencies created by both the model's dataflow and the encryption scheme's polynomial structure, then co-locating matching data and reordering operators to hide remaining communication behind computation. A reader would care because single-GPU memory quickly becomes insufficient for 2048-token encrypted inputs, and prior multi-GPU methods either replicate data or synchronize too often. The resulting system reports 96.62 percent scaling efficiency, 3.86 times end-to-end speedup, and 69.1 percent per-device memory reduction on four GPUs while cutting communication by up to 81.3 percent in attention layers.

Core claim

AEGIS derives device placement from ciphertext dependencies jointly induced by Transformer dataflow and CKKS polynomial coupling, co-locating modulus-coherent and token-coherent data so that communication is introduced only when application dependencies require it, while reordering polynomial operators to overlap the remaining collectives with computation. On 2048-token inputs this yields up to 96.62 percent scaling efficiency on four GPUs, 3.86 times end-to-end speedup, and 69.1 percent per-device memory reduction.

What carries the argument

Hybrid parallelism that places ciphertexts by jointly considering Transformer dataflow dependencies and CKKS modulus/token coherence, then overlaps collectives via operator reordering.

If this is right

  • Inter-GPU communication falls by up to 57.9 percent in feed-forward networks and 81.3 percent in self-attention for 2048-token inputs.
  • Four-GPU runs reach 96.62 percent scaling efficiency for long-sequence encrypted inference.
  • End-to-end inference accelerates by 3.86 times versus prior multi-GPU designs.
  • Memory footprint per GPU shrinks by 69.1 percent for the same workload.
  • Communication occurs only at application-level dependency points rather than at every encryption-level coupling.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same joint-dependency placement logic could be applied to other FHE workloads that combine neural-network dataflow with arithmetic constraints such as matrix multiplications under RNS.
  • Extending operator reordering to deeper stacks or eight-plus GPUs might further hide latency on faster interconnects.
  • The memory reduction per device suggests the method could support even longer sequences on fixed hardware budgets without accuracy loss.

Load-bearing premise

Co-locating modulus-coherent and token-coherent ciphertexts and reordering polynomial operators preserves exact homomorphic correctness without hidden synchronization or numerical instability.

What would settle it

Executing AEGIS on four GPUs with 2048-token encrypted inputs and verifying that decrypted outputs match a single-GPU reference while communication volume drops by the claimed percentages and scaling efficiency exceeds 90 percent.

Figures

Figures reproduced from arXiv: 2604.03425 by Fan Yao, Ran Ran, Wujie Wen, Zhaoting Gong.

Figure 2
Figure 2. Figure 2: Encrypted Transformer inference pipeline with [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗
Figure 4
Figure 4. Figure 4: Memory breakdown of encrypted BERT-Base infer [PITH_FULL_IMAGE:figures/full_fig_p004_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: (a) Collective in RNS-parallel BSGS matrix multipli [PITH_FULL_IMAGE:figures/full_fig_p005_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: (a) Encrypted diagonal matrix multiplication. (b) [PITH_FULL_IMAGE:figures/full_fig_p005_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Output-slot coupling: (a) weight slices within the [PITH_FULL_IMAGE:figures/full_fig_p006_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Parallel self-attention: (a) data partition and communication insertion; (b) latency hiding via operator reordering. [PITH_FULL_IMAGE:figures/full_fig_p007_8.png] view at source ↗
Figure 10
Figure 10. Figure 10: Weak scaling: (a) upper bound on the number of [PITH_FULL_IMAGE:figures/full_fig_p008_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Communication volume on two GPUs with 128 [PITH_FULL_IMAGE:figures/full_fig_p009_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: End-to-end inference (a) per-device memory con [PITH_FULL_IMAGE:figures/full_fig_p011_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: End-to-end inference on four GPUs: (a) per-device [PITH_FULL_IMAGE:figures/full_fig_p012_13.png] view at source ↗
read the original abstract

Fully Homomorphic Encryption (FHE) enables privacy-preserving Transformer inference, but long-sequence encrypted Transformers quickly exceed single-GPU memory capacity because encoded weights are already large and encrypted activations grow rapidly with sequence length. Multi-GPU execution therefore becomes unavoidable, yet scaling remains challenging because communication is jointly induced by application-level aggregation and encryption-level RNS coupling. Existing approaches either synchronize between devices frequently or replicate encrypted tensors across devices, leading to excessive communication and latency. We present AEGIS, an Application-Encryption Guided Inference System for scalable long-sequence encrypted Transformer inference on multi-GPU platforms. AEGIS derives device placement from ciphertext dependencies jointly induced by Transformer dataflow and CKKS polynomial coupling, co-locating modulus-coherent and token-coherent data so that communication is introduced only when application dependencies require it, while reordering polynomial operators to overlap the remaining collectives with computation. On 2048-token inputs, AEGIS reduces inter-GPU communication by up to 57.9% in feed-forward networks and 81.3% in self-attention versus prior state-of-the-art designs. On four GPUs, it achieves up to 96.62% scaling efficiency, 3.86x end-to-end speedup, and 69.1% per-device memory reduction. These results establish coordinated application-encryption parallelism as a practical foundation for scalable homomorphic Transformer inference.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The manuscript presents AEGIS, a system for scaling long-sequence homomorphic encrypted Transformer inference on multi-GPU systems through hybrid parallelism. It co-locates modulus-coherent and token-coherent ciphertexts based on joint Transformer and CKKS dependencies, reorders polynomial operators to overlap communication with computation, and reports substantial reductions in inter-GPU communication along with 96.62% scaling efficiency, 3.86x speedup, and 69.1% memory reduction on four GPUs for 2048-token inputs.

Significance. If the empirical results are supported by rigorous correctness verification, this work would be significant for advancing practical FHE applications in machine learning, particularly for long-context models where memory constraints are severe. The coordinated application-encryption approach offers a new paradigm for parallelism in encrypted computation.

major comments (3)
  1. [§4 (Hybrid Parallelism Design)] The reordering of polynomial operators to overlap collectives is central to the claimed communication reductions (57.9% in FFN, 81.3% in self-attention), but no formal invariant, dataflow analysis, or proof is provided to guarantee that this reordering preserves the exact sequence of CKKS operations including modulus switches and rotations required for correctness.
  2. [Experimental Results (§6)] The reported 96.62% scaling efficiency, 3.86x speedup, and 69.1% memory reduction lack details on experimental setup, baseline implementations, number of trials, error bars, or ablation studies separating the effects of co-location and reordering, making the quantitative claims impossible to assess.
  3. [§5.1 (Memory Reduction Analysis)] The 69.1% per-device memory reduction claim is presented without explicit before/after memory measurements or breakdown by component (weights vs. activations), which is load-bearing for the memory-scaling argument.
minor comments (2)
  1. [Abstract] The abstract mentions 'prior state-of-the-art designs' without naming them or providing citations in the provided text.
  2. [Figures 3-5] Figure legends and axis labels in the scaling plots would benefit from explicit single-GPU baseline times for direct visual comparison.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We appreciate the referee's constructive feedback on our manuscript. We address each major comment below and will revise the paper to incorporate the suggested improvements for clarity and rigor.

read point-by-point responses
  1. Referee: [§4 (Hybrid Parallelism Design)] The reordering of polynomial operators to overlap collectives is central to the claimed communication reductions (57.9% in FFN, 81.3% in self-attention), but no formal invariant, dataflow analysis, or proof is provided to guarantee that this reordering preserves the exact sequence of CKKS operations including modulus switches and rotations required for correctness.

    Authors: We thank the referee for this observation. The reordering is based on the joint dependency graph of Transformer dataflow and CKKS polynomial operations (ensuring modulus switches and rotations occur in equivalent order), but the manuscript indeed omits an explicit formal invariant or proof sketch. In the revised version, we will add a dataflow analysis section and a proof outline demonstrating that the reordering preserves semantic equivalence and the required CKKS operation sequence. revision: yes

  2. Referee: [Experimental Results (§6)] The reported 96.62% scaling efficiency, 3.86x speedup, and 69.1% memory reduction lack details on experimental setup, baseline implementations, number of trials, error bars, or ablation studies separating the effects of co-location and reordering, making the quantitative claims impossible to assess.

    Authors: We agree that the experimental section requires expansion for reproducibility. The revised manuscript will detail the hardware setup (GPU models and interconnect), baseline replication (how prior SOTA designs were implemented), number of trials (5 runs with averages and standard deviations), error bars, and ablation studies isolating co-location versus reordering contributions. revision: yes

  3. Referee: [§5.1 (Memory Reduction Analysis)] The 69.1% per-device memory reduction claim is presented without explicit before/after memory measurements or breakdown by component (weights vs. activations), which is load-bearing for the memory-scaling argument.

    Authors: We acknowledge this gap. The revised §5.1 will include explicit before/after per-device memory measurements obtained via profiling, with a component-wise breakdown separating weights from activations to substantiate the 69.1% reduction claim. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical system measurements only

full rationale

The paper describes an engineering system (AEGIS) whose core claims are measured speedups, scaling efficiencies, and memory reductions on 2048-token inputs. No equations, fitted parameters, or first-principles derivations are presented that reduce to their own inputs by construction. Device placement and operator reordering are design choices whose correctness is asserted via implementation and benchmark results rather than any self-referential proof or prediction step. Self-citations, if present, are not load-bearing for the reported numbers.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The abstract introduces no explicit free parameters, axioms, or invented entities beyond standard CKKS and Transformer components already present in the cited prior literature.

pith-pipeline@v0.9.0 · 5564 in / 1114 out tokens · 43646 ms · 2026-05-13T19:17:11.257048+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

69 extracted references · 69 canonical work pages · 2 internal anchors

  1. [1]

    Lattigo v6

    2024. Lattigo v6. Online: https://github.com/tuneinsight/lattigo

  2. [2]

    Rashmi Agrawal, Leo De Castro, Chiraag Juvekar, Anantha Chandrakasan, Vinod Vaikuntanathan, and Ajay Joshi. 2023. MAD: Memory-Aware Design Techniques for Accelerating Fully Homomorphic Encryption. InProceedings of the 56th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO ’23). Association for Computing Machinery, New York, NY, USA, 685–...

  3. [3]

    Rashmi Agrawal, Leo De Castro, Guowei Yang, Chiraag Juvekar, Rabia Yazicigil, Anantha Chandrakasan, Vinod Vaikuntanathan, and Ajay Joshi. 2023. FAB: An FPGA-based Accelerator for Bootstrappable Fully Homomorphic Encryption. In 2023 IEEE International Symposium on High-Performance Computer Architecture (HPCA). IEEE, Montreal, QC, Canada, 882–895. doi:10.11...

  4. [4]

    Ehud Aharoni, Allon Adir, Moran Baruch, Nir Drucker, Gilad Ezov, Ariel Farkash, Lev Greenberg, Ramy Masalha, Guy Moshkowich, Dov Murik, Hayim Shaul, and Omri Soceanu. 2023. HeLayers: A Tile Tensors Framework for Large Neural Networks on Encrypted Data.Proceedings on Privacy Enhancing Technologies 2023, 1 (Jan. 2023), 325–342. doi:10.56553/popets-2023-0020

  5. [5]

    Ahmad Al Badawi, Chao Jin, Jie Lin, Chan Fook Mun, Sim Jun Jie, Benjamin Hong Meng Tan, Xiao Nan, Khin Mi Mi Aung, and Vijay Ramaseshan Chan- drasekhar. 2021. Towards the AlexNet Moment for Homomorphic Encryp- tion: HCNN, the First Homomorphic CNN on Encrypted Data With GPUs. IEEE Transactions on Emerging Topics in Computing9, 3 (July 2021), 1330–1343. do...

  6. [6]

    Ahmad Al Badawi, Bharadwaj Veeravalli, Jie Lin, Nan Xiao, Matsumura Kazuaki, and Aung Khin Mi Mi. 2021. Multi-GPU Design and Performance Evaluation of Homomorphic Encryption on GPU Clusters.IEEE Transactions on Parallel and Distributed Systems32, 2 (Feb. 2021), 379–391. doi:10.1109/TPDS.2020.3021238

  7. [7]

    Ahmad Al Badawi, Andreea Alexandru, Jack Bates, Flavio Bergamaschi, David Bruce Cousins, Saroja Erabelli, Nicholas Genise, Shai Halevi, Hamish Hunt, Andrey Kim, Yongwoo Lee, Zeyu Liu, Daniele Micciancio, Carlo Pascoe, Yuriy Polyakov, Ian Quah, Saraswathy R.V., Kurt Rohloff, Jonathan Saylor, Dmitriy Suponitsky, Matthew Triplett, Vinod Vaikuntanathan, and V...

  8. [8]

    Ayoub Benaissa, Bilal Retiat, Bogdan Cebere, and Alaa Eddine Belfedhal. 2021. TenSEAL: A Library for Encrypted Tensor Operations Using Homomorphic Encryption. doi:10.48550/arXiv.2104.03152

  9. [9]

    Jean-Philippe Bossuat, Christian Mouchet, Juan Troncoso-Pastoriza, and Jean- Pierre Hubaux. 2021. Efficient Bootstrapping for Approximate Homomorphic Encryption with Non-sparse Keys. InAdvances in Cryptology – EUROCRYPT 2021: 40th Annual International Conference on the Theory and Applications of Cryptographic Techniques, Zagreb, Croatia, October 17–21, 20...

  10. [10]

    Zvika Brakerski, Craig Gentry, and Vinod Vaikuntanathan. 2014. (Leveled) Fully Homomorphic Encryption without Bootstrapping.ACM Trans. Comput. Theory 6, 3 (2014), 13:1–13:36. doi:10.1145/2633600

  11. [11]

    Hao Chen, Ilaria Chillotti, and Yongsoo Song. 2018. Improved Bootstrapping for Approximate Homomorphic Encryption. https://eprint.iacr.org/2018/1043

  12. [12]

    Jung Hee Cheon, Andrey Kim, Miran Kim, and Yongsoo Song. 2016. Homomor- phic Encryption for Arithmetic of Approximate Numbers. https://eprint.iacr. org/2016/421

  13. [13]

    Seonyoung Cheon, Yongwoo Lee, Dongkwan Kim, Ju Min Lee, Sunchul Jung, Taekyung Kim, Dongyoon Lee, and Hanjun Kim. 2024. {DaCapo}: Automatic Bootstrapping Management for Efficient Fully Homomorphic Encryption. 6993–

  14. [14]

    https://www.usenix.org/conference/usenixsecurity24/presentation/cheon

  15. [15]

    In: Proceedings of the 42nd ACM SIGPLAN International Conference on Programming Language Design and Implementation

    Meghan Cowan, Deeksha Dangwal, Armin Alaghi, Caroline Trippel, Vincent T. Lee, and Brandon Reagen. 2021. Porcupine: a synthesizing compiler for vectorized homomorphic encryption. InProceedings of the 42nd ACM SIGPLAN International Conference on Programming Language Design and Implementation (PLDI 2021). PLDI 2021: Association for Computing Machinery, New ...

  16. [16]

    Wei Dai and Berk Sunar. 2015. cuHE: A Homomorphic Encryption Accelerator Library. https://eprint.iacr.org/2015/818

  17. [17]

    Roshan Dathathri, Blagovesta Kostova, Olli Saarikivi, Wei Dai, Kim Laine, and Madanlal Musuvathi. 2020. EVA: An Encrypted Vector Arithmetic Language and Compiler for Efficient Homomorphic Computation. InProceedings of the 41st ACM SIGPLAN Conference on Programming Language Design and Implementation. 546–561. doi:10.1145/3385412.3386023

  18. [18]

    Roshan Dathathri, Olli Saarikivi, Hao Chen, Kim Laine, Kristin Lauter, Saeed Maleki, Madanlal Musuvathi, and Todd Mytkowicz. 2019. CHET: an optimizing compiler for fully-homomorphic neural-network inferencing. InProceedings of the 40th ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI 2019). Association for Computing Machinery...

  19. [19]

    Jeffrey Dean, Greg Corrado, Rajat Monga, Kai Chen, Matthieu Devin, Mark Mao, Marc’ aurelio Ranzato, Andrew Senior, Paul Tucker, Ke Yang, Quoc Le, and Andrew Ng. 2012. Large Scale Distributed Deep Networks. In Advances in Neural Information Processing Systems, Vol. 25. Curran As- sociates, Inc. https://proceedings.neurips.cc/paper_files/paper/2012/hash/ 6a...

  20. [20]

    Xianglong Deng, Shengyu Fan, Zhicheng Hu, Zhuoyu Tian, Zihao Yang, Jiangrui Yu, Dingyuan Cao, Dan Meng, Rui Hou, Meng Li, Qian Lou, and Mingzhe Zhang

  21. [21]
  22. [22]

    DESILO. 2023. Liberate.FHE: A New FHE Library for Bridging the Gap between Theory and Practice with a Focus on Performance and Accuracy

  23. [23]

    Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), Jill Burstein, Christy...

  24. [24]

    Austin Ebel, Karthik Garimella, and Brandon Reagen. 2025. Orion: A Fully Ho- momorphic Encryption Framework for Deep Learning. InProceedings of the 30th ACM International Conference on Architectural Support for Programming Lan- guages and Operating Systems, Volume 2 (ASPLOS ’25). Association for Computing Machinery, New York, NY, USA, 734–749. doi:10.1145...

  25. [25]

    Junfeng Fan and Frederik Vercauteren. 2012. Somewhat Practical Fully Homo- morphic Encryption. https://eprint.iacr.org/2012/144

  26. [26]

    Shengyu Fan, Zhiwei Wang, Weizhi Xu, Rui Hou, Dan Meng, and Mingzhe Zhang

  27. [27]

    In2023 IEEE International Symposium on High-Performance Computer Architecture (HPCA)

    TensorFHE: Achieving Practical Computation on Encrypted Data Using GPGPU. In2023 IEEE International Symposium on High-Performance Computer Architecture (HPCA). 922–934. doi:10.1109/HPCA56546.2023.10071017

  28. [28]

    William Fedus, Barret Zoph, and Noam Shazeer. 2022. Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity.Journal of Machine Learning Research23, 120 (2022), 1–39

  29. [29]

    Craig Gentry. 2009. Fully homomorphic encryption using ideal lattices. In Proceedings of the forty-first annual ACM symposium on Theory of computing. ACM, Bethesda MD USA, 169–178. doi:10.1145/1536414.1536440

  30. [30]

    Craig Gentry, Shai Halevi, and Nigel P. Smart. 2012. Homomorphic Evaluation of the AES Circuit. InAdvances in Cryptology – CRYPTO 2012, Reihaneh Safavi-Naini and Ran Canetti (Eds.). Vol. 7417. Springer Berlin Heidelberg, Berlin, Heidelberg, 850–867. doi:10.1007/978-3-642-32009-5_49

  31. [31]

    1994.Using MPI: portable parallel programming with the message-passing interface

    William Gropp, Ewing Lusk, and Anthony Skjellum. 1994.Using MPI: portable parallel programming with the message-passing interface. MIT Press, Cambridge, MA, USA

  32. [32]

    Siddharth Jayashankar, Edward Chen, Tom Tang, Wenting Zheng, and Dim- itrios Skarlatos. 2025. Cinnamon: A Framework for Scale-Out Encrypted AI. In Proceedings of the 30th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 1. ACM, Rotterdam Netherlands, 133–150. doi:10.1145/3669940.3707260

  33. [33]

    Sullivan, Wenting Zheng, and Dimitrios Skarlatos

    Siddharth Jayashankar, Joshua Kim, Michael B. Sullivan, Wenting Zheng, and Dimitrios Skarlatos. 2025. A Scalable Multi-GPU Framework for Encrypted Large-Model Inference. doi:10.48550/arXiv.2512.11269 arXiv:2512.11269 [cs]

  34. [34]

    Wonkyung Jung, Sangpyo Kim, Jung Ho Ahn, Jung Hee Cheon, and Younho Lee

  35. [35]

    2021), 114–148

    Over 100x Faster Bootstrapping in Fully Homomorphic Encryption through Memory-centric Optimization with GPUs.IACR Transactions on Cryptographic Hardware and Embedded Systems(Aug. 2021), 114–148. doi:10.46586/tches.v2021. i4.114-148

  36. [36]

    Chiraag Juvekar, Vinod Vaikuntanathan, and Anantha Chandrakasan. 2018. GAZELLE: A Low Latency Framework for Secure Neural Network Inference. https://eprint.iacr.org/2018/073

  37. [37]

    Zhaoxuan Kan, Husheng Han, Shangyi Shi, Tenghui Hua, Hang Lu, Xiaowei Li, Jianan Mu, and Xing Hu. 2025. FicGCN: Unveiling the Homomorphic Encryption Efficiency from Irregular Graph Convolutional Networks. https://openreview. net/forum?id=m74x7brnd6&noteId=1q1RAl2JNC

  38. [38]

    Miran Kim, Xiaoqian Jiang, Kristin Lauter, Elkhan Ismayilzada, and Shayan Shams. 2022. Secure Human Action Recognition by Encrypted Neural Network Inference.Nature Communications13, 1 (Aug. 2022), 4799. doi:10.1038/s41467- 022-32168-5

  39. [39]

    Miran Kim, Dongwon Lee, Jinyeong Seo, and Yongsoo Song. 2023. Accelerating HE Operations from Key Decomposition Technique. https://eprint.iacr.org/2023/ 413

  40. [40]

    Sangpyo Kim, Jongmin Kim, Michael Jaemin Kim, Wonkyung Jung, John Kim, Minsoo Rhu, and Jung Ho Ahn. 2022. BTS: an accelerator for bootstrappable fully homomorphic encryption. InProceedings of the 49th Annual International Symposium on Computer Architecture (ISCA ’22). Association for Computing Machinery, New York, NY, USA, 711–725. doi:10.1145/3470496.3527415

  41. [41]

    Aleksandar Krastev, Nikola Samardzic, Simon Langowski, Srinivas Devadas, and Daniel Sanchez. 2024. A Tensor Compiler with Automatic Data Packing for Simple and Efficient Fully Homomorphic Encryption.Proceedings of the ACM on 13 Zhaoting Gong, Ran Ran, Fan Yao, and Wujie Wen Programming Languages8, PLDI (June 2024), 126–150. doi:10.1145/3656382

  42. [42]

    Seewoo Lee, Garam Lee, Jung Woo Kim, Junbum Shin, and Mun-Kyu Lee. 2024. HETAL: Efficient Privacy-preserving Transfer Learning with Homomorphic Encryption. doi:10.48550/arXiv.2403.14111

  43. [43]

    Yongwoo Lee, Seonyoung Cheon, Dongkwan Kim, Dongyoon Lee, and Hanjun Kim. 2024. Performance-aware Scale Analysis with Reserve for Homomorphic Encryption. InProceedings of the 29th ACM International Conference on Archi- tectural Support for Programming Languages and Operating Systems, Volume 1 (ASPLOS ’24, Vol. 1). Association for Computing Machinery, New ...

  44. [44]

    Yongwoo Lee, Seonyeong Heo, Seonyoung Cheon, Shinnung Jeong, Changsu Kim, Eunkyung Kim, Dongyoon Lee, and Hanjun Kim. 2022. HECATE: Performance- Aware Scale Optimization for Homomorphic Encryption Compiler. In2022 IEEE/ACM International Symposium on Code Generation and Optimization (CGO). 193–204. doi:10.1109/CGO53902.2022.9741265

  45. [45]

    Dmitry Lepikhin, HyoukJoong Lee, Yuanzhong Xu, Dehao Chen, Orhan Firat, Yanping Huang, Maxim Krikun, Noam Shazeer, and Zhifeng Chen. 2020. GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding. https://openreview.net/forum?id=qrwe7XHTmYb

  46. [46]

    Shenggui Li, Fuzhao Xue, Chaitanya Baranwal, Yongbin Li, and Yang You. 2023. Sequence Parallelism: Long Sequence Training from System Perspective. In Proceedings of the 61st Annual Meeting of the Association for Computational Lin- guistics (Volume 1: Long Papers), Anna Rogers, Jordan Boyd-Graber, and Naoaki Okazaki (Eds.). Association for Computational Li...

  47. [47]

    Yan Liu, Jianxin Lai, Long Li, Tianxiang Sui, Linjie Xiao, Peng Yuan, Xiaojing Zhang, Qing Zhu, Wenguang Chen, and Jingling Xue. 2025. ReSBM: Region- based Scale and Minimal-Level Bootstrapping Management for FHE via Min-Cut. InProceedings of the 30th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Vo...

  48. [48]

    Wen-jie Lu, Zhicong Huang, Zhen Gu, Jingyu Li, Jian Liu, Kui Ren, Cheng Hong, Tao Wei, and WenGuang Chen. 2023. BumbleBee: Secure Two-party Inference Framework for Large Transformers. https://eprint.iacr.org/2023/1678

  49. [49]

    Maxim Milakov and Natalia Gimelshein. 2018. Online normalizer calculation for softmax.arXiv preprint arXiv:1805.02867(2018)

  50. [50]

    Jungho Moon, Dongwoo Yoo, Xiaoqian Jiang, and Miran Kim. 2024. THOR: Secure Transformer Inference with Homomorphic Encryption. https://eprint. iacr.org/2024/1881

  51. [51]

    Deepak Narayanan, Mohammad Shoeybi, Jared Casper, Patrick LeGres- ley, Mostofa Patwary, Vijay Anand Korthikanti, Dmitri Vainbrand, Prethvi Kashinkunti, Julie Bernauer, Bryan Catanzaro, Amar Phanishayee, and Matei Zaharia. 2021. Efficient Large-Scale Language Model Training on GPU Clusters Using Megatron-LM. doi:10.48550/arXiv.2104.04473

  52. [52]

    Qi Pang, Jinhao Zhu, Helen Möllering, Wenting Zheng, and Thomas Schneider

  53. [53]

    https://eprint.iacr.org/2023/1893

    BOLT: Privacy-Preserving, Accurate and Efficient Inference for Transform- ers. https://eprint.iacr.org/2023/1893

  54. [54]

    Dongjin Park, Eunsang Lee, and Joon-Woo Lee. 2024. Powerformer: Efficient Privacy-Preserving Transformer with Batch Rectifier-Power Max Function and Optimized Homomorphic Attention. https://eprint.iacr.org/2024/1429

  55. [55]

    Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, Alban Desmaison, Andreas Köpf, Edward Yang, Zach DeVito, Martin Raison, Alykhan Tejani, Sasank Chilamkurthy, Benoit Steiner, Lu Fang, Junjie Bai, and Soumith Chintala. 2019. PyTorch: An Imperative Style, High-Pe...

  56. [56]

    Reiner Pope, Sholto Douglas, Aakanksha Chowdhery, Jacob Devlin, James Brad- bury, Jonathan Heek, Kefan Xiao, Shivani Agrawal, and Jeff Dean. 2023. Efficiently Scaling Transformer Inference.Proceedings of Machine Learning and Systems5 (March 2023), 606–624. https://proceedings.mlsys.org/paper_files/paper/2023/ hash/c4be71ab8d24cdfb45e3d06dbfca2780-Abstract...

  57. [57]

    Le Qin, Junwei Cui, Weilin Cai, and Jiayi Huang. 2025. Chimera: Communication Fusion for Hybrid Parallelism in Large Language Models. InProceedings of the 52nd Annual International Symposium on Computer Architecture (ISCA ’25). Association for Computing Machinery, New York, NY, USA, 498–513. doi:10. 1145/3695053.3731025

  58. [58]

    Ran Ran, Wei Wang, Quan Gang, Jieming Yin, Nuo Xu, and Wujie Wen. 2022. CryptoGCN: Fast and Scalable Homomorphically Encrypted Graph Convolu- tional Network Inference.Advances in Neural Information Processing Systems 35 (Dec. 2022), 37676–37689. https://proceedings.neurips.cc/paper_files/paper/ 2022/hash/f5332c8273d02729730a9c24dec2135e-Abstract-Conference.html

  59. [59]

    Ran Ran, Nuo Xu, Tao Liu, Wei Wang, Gang Quan, and Wujie Wen. 2023. Pen- guin: Parallel-Packed Homomorphic Encryption for Fast Graph Convolutional Network Inference.Advances in Neural Information Processing Systems36 (Dec. 2023), 19104–19116. https://proceedings.neurips.cc/paper_files/paper/2023/hash/ 3cc685788a311fa35d8d41df93e288ca-Abstract-Conference.html

  60. [60]

    Lee, Hsien-Hsin S

    Brandon Reagen, Woo-Seok Choi, Yeongil Ko, Vincent T. Lee, Hsien-Hsin S. Lee, Gu-Yeon Wei, and David Brooks. 2021. Cheetah: Optimizing and Accel- erating Homomorphic Encryption for Private Inference. In2021 IEEE Interna- tional Symposium on High-Performance Computer Architecture (HPCA). 26–39. doi:10.1109/HPCA51647.2021.00013

  61. [61]

    Nikola Samardzic, Axel Feldmann, Aleksandar Krastev, Srinivas Devadas, Ronald Dreslinski, Christopher Peikert, and Daniel Sanchez. 2021. F1: A Fast and Programmable Accelerator for Fully Homomorphic Encryption. InMICRO- 54: 54th Annual IEEE/ACM International Symposium on Microarchitecture (MI- CRO ’21). Association for Computing Machinery, New York, NY, U...

  62. [62]

    Nikola Samardzic, Axel Feldmann, Aleksandar Krastev, Nathan Manohar, Nicholas Genise, Srinivas Devadas, Karim Eldefrawy, Chris Peikert, and Daniel Sanchez. 2022. CraterLake: a hardware accelerator for efficient unbounded computation on encrypted data. InProceedings of the 49th Annual International Symposium on Computer Architecture (ISCA ’22). Association...

  63. [63]

    Microsoft SEAL (release 4.1)

    SEAL 2023. Microsoft SEAL (release 4.1). https://github.com/Microsoft/SEAL

  64. [64]

    Mohammad Shoeybi, Mostofa Patwary, Raul Puri, Patrick LeGresley, Jared Casper, and Bryan Catanzaro. 2020. Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism. doi:10.48550/arXiv.1909.08053

  65. [65]

    Alexander Viand, Patrick Jattke, Miro Haller, and Anwar Hithnawi. 2023. {HECO}: Fully Homomorphic Encryption Compiler. 4715–4732. https://www.usenix.org/ conference/usenixsecurity23/presentation/viand

  66. [66]

    Zhiwei Wang, Peinan Li, Rui Hou, Zhihao Li, Jiangfeng Cao, XiaoFeng Wang, and Dan Meng. 2023. HE-Booster: An Efficient Polynomial Arithmetic Acceleration on GPUs for Fully Homomorphic Encryption.IEEE Transactions on Parallel and Distributed Systems34, 4 (April 2023), 1067–1081. doi:10.1109/TPDS.2022.3228628

  67. [67]

    Yinghao Yang, Xicheng Xu, Haibin Zhang, Jie Song, Xin Tang, Hang Lu, and Xiaowei Li. 2025. Hydra: Scale-out FHE Accelerator Architecture for Secure Deep Learning on FPGA. In2025 IEEE International Symposium on High Performance Computer Architecture (HPCA). 1174–1186. doi:10.1109/HPCA61900.2025.00090

  68. [68]

    Yinghao Yang, Huaizhi Zhang, Shengyu Fan, Hang Lu, Mingzhe Zhang, and Xiaowei Li. 2023. Poseidon: Practical Homomorphic Encryption Accelerator. In 2023 IEEE International Symposium on High-Performance Computer Architecture (HPCA). 870–881. doi:10.1109/HPCA56546.2023.10070984

  69. [69]

    Jiawen Zhang, Xinpeng Yang, Lipeng He, Kejia Chen, Wen-jie Lu, Yinghao Wang, Xiaoyang Hou, Jian Liu, Kui Ren, and Xiaohu Yang. 2025. Secure Transformer Inference Made Non-interactive. InProceedings 2025 Network and Distributed System Security Symposium. Internet Society, San Diego, CA, USA. doi:10.14722/ ndss.2025.230868 14