Recognition: 2 theorem links
· Lean TheoremAEGIS: Scaling Long-Sequence Homomorphic Encrypted Transformer Inference via Hybrid Parallelism on Multi-GPU Systems
Pith reviewed 2026-05-13 19:17 UTC · model grok-4.3
The pith
By co-locating modulus-coherent and token-coherent ciphertexts according to joint Transformer and CKKS dependencies, AEGIS enables high-efficiency multi-GPU scaling for long-sequence homomorphic encrypted Transformer inference.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
AEGIS derives device placement from ciphertext dependencies jointly induced by Transformer dataflow and CKKS polynomial coupling, co-locating modulus-coherent and token-coherent data so that communication is introduced only when application dependencies require it, while reordering polynomial operators to overlap the remaining collectives with computation. On 2048-token inputs this yields up to 96.62 percent scaling efficiency on four GPUs, 3.86 times end-to-end speedup, and 69.1 percent per-device memory reduction.
What carries the argument
Hybrid parallelism that places ciphertexts by jointly considering Transformer dataflow dependencies and CKKS modulus/token coherence, then overlaps collectives via operator reordering.
If this is right
- Inter-GPU communication falls by up to 57.9 percent in feed-forward networks and 81.3 percent in self-attention for 2048-token inputs.
- Four-GPU runs reach 96.62 percent scaling efficiency for long-sequence encrypted inference.
- End-to-end inference accelerates by 3.86 times versus prior multi-GPU designs.
- Memory footprint per GPU shrinks by 69.1 percent for the same workload.
- Communication occurs only at application-level dependency points rather than at every encryption-level coupling.
Where Pith is reading between the lines
- The same joint-dependency placement logic could be applied to other FHE workloads that combine neural-network dataflow with arithmetic constraints such as matrix multiplications under RNS.
- Extending operator reordering to deeper stacks or eight-plus GPUs might further hide latency on faster interconnects.
- The memory reduction per device suggests the method could support even longer sequences on fixed hardware budgets without accuracy loss.
Load-bearing premise
Co-locating modulus-coherent and token-coherent ciphertexts and reordering polynomial operators preserves exact homomorphic correctness without hidden synchronization or numerical instability.
What would settle it
Executing AEGIS on four GPUs with 2048-token encrypted inputs and verifying that decrypted outputs match a single-GPU reference while communication volume drops by the claimed percentages and scaling efficiency exceeds 90 percent.
Figures
read the original abstract
Fully Homomorphic Encryption (FHE) enables privacy-preserving Transformer inference, but long-sequence encrypted Transformers quickly exceed single-GPU memory capacity because encoded weights are already large and encrypted activations grow rapidly with sequence length. Multi-GPU execution therefore becomes unavoidable, yet scaling remains challenging because communication is jointly induced by application-level aggregation and encryption-level RNS coupling. Existing approaches either synchronize between devices frequently or replicate encrypted tensors across devices, leading to excessive communication and latency. We present AEGIS, an Application-Encryption Guided Inference System for scalable long-sequence encrypted Transformer inference on multi-GPU platforms. AEGIS derives device placement from ciphertext dependencies jointly induced by Transformer dataflow and CKKS polynomial coupling, co-locating modulus-coherent and token-coherent data so that communication is introduced only when application dependencies require it, while reordering polynomial operators to overlap the remaining collectives with computation. On 2048-token inputs, AEGIS reduces inter-GPU communication by up to 57.9% in feed-forward networks and 81.3% in self-attention versus prior state-of-the-art designs. On four GPUs, it achieves up to 96.62% scaling efficiency, 3.86x end-to-end speedup, and 69.1% per-device memory reduction. These results establish coordinated application-encryption parallelism as a practical foundation for scalable homomorphic Transformer inference.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript presents AEGIS, a system for scaling long-sequence homomorphic encrypted Transformer inference on multi-GPU systems through hybrid parallelism. It co-locates modulus-coherent and token-coherent ciphertexts based on joint Transformer and CKKS dependencies, reorders polynomial operators to overlap communication with computation, and reports substantial reductions in inter-GPU communication along with 96.62% scaling efficiency, 3.86x speedup, and 69.1% memory reduction on four GPUs for 2048-token inputs.
Significance. If the empirical results are supported by rigorous correctness verification, this work would be significant for advancing practical FHE applications in machine learning, particularly for long-context models where memory constraints are severe. The coordinated application-encryption approach offers a new paradigm for parallelism in encrypted computation.
major comments (3)
- [§4 (Hybrid Parallelism Design)] The reordering of polynomial operators to overlap collectives is central to the claimed communication reductions (57.9% in FFN, 81.3% in self-attention), but no formal invariant, dataflow analysis, or proof is provided to guarantee that this reordering preserves the exact sequence of CKKS operations including modulus switches and rotations required for correctness.
- [Experimental Results (§6)] The reported 96.62% scaling efficiency, 3.86x speedup, and 69.1% memory reduction lack details on experimental setup, baseline implementations, number of trials, error bars, or ablation studies separating the effects of co-location and reordering, making the quantitative claims impossible to assess.
- [§5.1 (Memory Reduction Analysis)] The 69.1% per-device memory reduction claim is presented without explicit before/after memory measurements or breakdown by component (weights vs. activations), which is load-bearing for the memory-scaling argument.
minor comments (2)
- [Abstract] The abstract mentions 'prior state-of-the-art designs' without naming them or providing citations in the provided text.
- [Figures 3-5] Figure legends and axis labels in the scaling plots would benefit from explicit single-GPU baseline times for direct visual comparison.
Simulated Author's Rebuttal
We appreciate the referee's constructive feedback on our manuscript. We address each major comment below and will revise the paper to incorporate the suggested improvements for clarity and rigor.
read point-by-point responses
-
Referee: [§4 (Hybrid Parallelism Design)] The reordering of polynomial operators to overlap collectives is central to the claimed communication reductions (57.9% in FFN, 81.3% in self-attention), but no formal invariant, dataflow analysis, or proof is provided to guarantee that this reordering preserves the exact sequence of CKKS operations including modulus switches and rotations required for correctness.
Authors: We thank the referee for this observation. The reordering is based on the joint dependency graph of Transformer dataflow and CKKS polynomial operations (ensuring modulus switches and rotations occur in equivalent order), but the manuscript indeed omits an explicit formal invariant or proof sketch. In the revised version, we will add a dataflow analysis section and a proof outline demonstrating that the reordering preserves semantic equivalence and the required CKKS operation sequence. revision: yes
-
Referee: [Experimental Results (§6)] The reported 96.62% scaling efficiency, 3.86x speedup, and 69.1% memory reduction lack details on experimental setup, baseline implementations, number of trials, error bars, or ablation studies separating the effects of co-location and reordering, making the quantitative claims impossible to assess.
Authors: We agree that the experimental section requires expansion for reproducibility. The revised manuscript will detail the hardware setup (GPU models and interconnect), baseline replication (how prior SOTA designs were implemented), number of trials (5 runs with averages and standard deviations), error bars, and ablation studies isolating co-location versus reordering contributions. revision: yes
-
Referee: [§5.1 (Memory Reduction Analysis)] The 69.1% per-device memory reduction claim is presented without explicit before/after memory measurements or breakdown by component (weights vs. activations), which is load-bearing for the memory-scaling argument.
Authors: We acknowledge this gap. The revised §5.1 will include explicit before/after per-device memory measurements obtained via profiling, with a component-wise breakdown separating weights from activations to substantiate the 69.1% reduction claim. revision: yes
Circularity Check
No circularity: empirical system measurements only
full rationale
The paper describes an engineering system (AEGIS) whose core claims are measured speedups, scaling efficiencies, and memory reductions on 2048-token inputs. No equations, fitted parameters, or first-principles derivations are presented that reduce to their own inputs by construction. Device placement and operator reordering are design choices whose correctness is asserted via implementation and benchmark results rather than any self-referential proof or prediction step. Self-citations, if present, are not load-bearing for the reported numbers.
Axiom & Free-Parameter Ledger
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
AEGIS derives device placement from ciphertext dependencies jointly induced by Transformer dataflow and CKKS polynomial coupling, co-locating modulus-coherent and token-coherent data
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
polynomial-operator reordering mechanism that restructures low-level instruction schedules to overlap collective communication with computation
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
- [1]
-
[2]
Rashmi Agrawal, Leo De Castro, Chiraag Juvekar, Anantha Chandrakasan, Vinod Vaikuntanathan, and Ajay Joshi. 2023. MAD: Memory-Aware Design Techniques for Accelerating Fully Homomorphic Encryption. InProceedings of the 56th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO ’23). Association for Computing Machinery, New York, NY, USA, 685–...
-
[3]
Rashmi Agrawal, Leo De Castro, Guowei Yang, Chiraag Juvekar, Rabia Yazicigil, Anantha Chandrakasan, Vinod Vaikuntanathan, and Ajay Joshi. 2023. FAB: An FPGA-based Accelerator for Bootstrappable Fully Homomorphic Encryption. In 2023 IEEE International Symposium on High-Performance Computer Architecture (HPCA). IEEE, Montreal, QC, Canada, 882–895. doi:10.11...
-
[4]
Ehud Aharoni, Allon Adir, Moran Baruch, Nir Drucker, Gilad Ezov, Ariel Farkash, Lev Greenberg, Ramy Masalha, Guy Moshkowich, Dov Murik, Hayim Shaul, and Omri Soceanu. 2023. HeLayers: A Tile Tensors Framework for Large Neural Networks on Encrypted Data.Proceedings on Privacy Enhancing Technologies 2023, 1 (Jan. 2023), 325–342. doi:10.56553/popets-2023-0020
-
[5]
Ahmad Al Badawi, Chao Jin, Jie Lin, Chan Fook Mun, Sim Jun Jie, Benjamin Hong Meng Tan, Xiao Nan, Khin Mi Mi Aung, and Vijay Ramaseshan Chan- drasekhar. 2021. Towards the AlexNet Moment for Homomorphic Encryp- tion: HCNN, the First Homomorphic CNN on Encrypted Data With GPUs. IEEE Transactions on Emerging Topics in Computing9, 3 (July 2021), 1330–1343. do...
-
[6]
Ahmad Al Badawi, Bharadwaj Veeravalli, Jie Lin, Nan Xiao, Matsumura Kazuaki, and Aung Khin Mi Mi. 2021. Multi-GPU Design and Performance Evaluation of Homomorphic Encryption on GPU Clusters.IEEE Transactions on Parallel and Distributed Systems32, 2 (Feb. 2021), 379–391. doi:10.1109/TPDS.2020.3021238
-
[7]
Ahmad Al Badawi, Andreea Alexandru, Jack Bates, Flavio Bergamaschi, David Bruce Cousins, Saroja Erabelli, Nicholas Genise, Shai Halevi, Hamish Hunt, Andrey Kim, Yongwoo Lee, Zeyu Liu, Daniele Micciancio, Carlo Pascoe, Yuriy Polyakov, Ian Quah, Saraswathy R.V., Kurt Rohloff, Jonathan Saylor, Dmitriy Suponitsky, Matthew Triplett, Vinod Vaikuntanathan, and V...
work page 2022
-
[8]
Ayoub Benaissa, Bilal Retiat, Bogdan Cebere, and Alaa Eddine Belfedhal. 2021. TenSEAL: A Library for Encrypted Tensor Operations Using Homomorphic Encryption. doi:10.48550/arXiv.2104.03152
-
[9]
Jean-Philippe Bossuat, Christian Mouchet, Juan Troncoso-Pastoriza, and Jean- Pierre Hubaux. 2021. Efficient Bootstrapping for Approximate Homomorphic Encryption with Non-sparse Keys. InAdvances in Cryptology – EUROCRYPT 2021: 40th Annual International Conference on the Theory and Applications of Cryptographic Techniques, Zagreb, Croatia, October 17–21, 20...
-
[10]
Zvika Brakerski, Craig Gentry, and Vinod Vaikuntanathan. 2014. (Leveled) Fully Homomorphic Encryption without Bootstrapping.ACM Trans. Comput. Theory 6, 3 (2014), 13:1–13:36. doi:10.1145/2633600
-
[11]
Hao Chen, Ilaria Chillotti, and Yongsoo Song. 2018. Improved Bootstrapping for Approximate Homomorphic Encryption. https://eprint.iacr.org/2018/1043
work page 2018
-
[12]
Jung Hee Cheon, Andrey Kim, Miran Kim, and Yongsoo Song. 2016. Homomor- phic Encryption for Arithmetic of Approximate Numbers. https://eprint.iacr. org/2016/421
work page 2016
-
[13]
Seonyoung Cheon, Yongwoo Lee, Dongkwan Kim, Ju Min Lee, Sunchul Jung, Taekyung Kim, Dongyoon Lee, and Hanjun Kim. 2024. {DaCapo}: Automatic Bootstrapping Management for Efficient Fully Homomorphic Encryption. 6993–
work page 2024
-
[14]
https://www.usenix.org/conference/usenixsecurity24/presentation/cheon
-
[15]
Meghan Cowan, Deeksha Dangwal, Armin Alaghi, Caroline Trippel, Vincent T. Lee, and Brandon Reagen. 2021. Porcupine: a synthesizing compiler for vectorized homomorphic encryption. InProceedings of the 42nd ACM SIGPLAN International Conference on Programming Language Design and Implementation (PLDI 2021). PLDI 2021: Association for Computing Machinery, New ...
-
[16]
Wei Dai and Berk Sunar. 2015. cuHE: A Homomorphic Encryption Accelerator Library. https://eprint.iacr.org/2015/818
work page 2015
-
[17]
Roshan Dathathri, Blagovesta Kostova, Olli Saarikivi, Wei Dai, Kim Laine, and Madanlal Musuvathi. 2020. EVA: An Encrypted Vector Arithmetic Language and Compiler for Efficient Homomorphic Computation. InProceedings of the 41st ACM SIGPLAN Conference on Programming Language Design and Implementation. 546–561. doi:10.1145/3385412.3386023
-
[18]
Roshan Dathathri, Olli Saarikivi, Hao Chen, Kim Laine, Kristin Lauter, Saeed Maleki, Madanlal Musuvathi, and Todd Mytkowicz. 2019. CHET: an optimizing compiler for fully-homomorphic neural-network inferencing. InProceedings of the 40th ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI 2019). Association for Computing Machinery...
-
[19]
Jeffrey Dean, Greg Corrado, Rajat Monga, Kai Chen, Matthieu Devin, Mark Mao, Marc’ aurelio Ranzato, Andrew Senior, Paul Tucker, Ke Yang, Quoc Le, and Andrew Ng. 2012. Large Scale Distributed Deep Networks. In Advances in Neural Information Processing Systems, Vol. 25. Curran As- sociates, Inc. https://proceedings.neurips.cc/paper_files/paper/2012/hash/ 6a...
work page 2012
-
[20]
Xianglong Deng, Shengyu Fan, Zhicheng Hu, Zhuoyu Tian, Zihao Yang, Jiangrui Yu, Dingyuan Cao, Dan Meng, Rui Hou, Meng Li, Qian Lou, and Mingzhe Zhang
-
[21]
Trinity: A General Purpose FHE Accelerator. IEEE Computer Society, 338–351. doi:10.1109/MICRO61859.2024.00033
-
[22]
DESILO. 2023. Liberate.FHE: A New FHE Library for Bridging the Gap between Theory and Practice with a Focus on Performance and Accuracy
work page 2023
-
[23]
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), Jill Burstein, Christy...
work page 2019
-
[24]
Austin Ebel, Karthik Garimella, and Brandon Reagen. 2025. Orion: A Fully Ho- momorphic Encryption Framework for Deep Learning. InProceedings of the 30th ACM International Conference on Architectural Support for Programming Lan- guages and Operating Systems, Volume 2 (ASPLOS ’25). Association for Computing Machinery, New York, NY, USA, 734–749. doi:10.1145...
-
[25]
Junfeng Fan and Frederik Vercauteren. 2012. Somewhat Practical Fully Homo- morphic Encryption. https://eprint.iacr.org/2012/144
work page 2012
-
[26]
Shengyu Fan, Zhiwei Wang, Weizhi Xu, Rui Hou, Dan Meng, and Mingzhe Zhang
-
[27]
In2023 IEEE International Symposium on High-Performance Computer Architecture (HPCA)
TensorFHE: Achieving Practical Computation on Encrypted Data Using GPGPU. In2023 IEEE International Symposium on High-Performance Computer Architecture (HPCA). 922–934. doi:10.1109/HPCA56546.2023.10071017
-
[28]
William Fedus, Barret Zoph, and Noam Shazeer. 2022. Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity.Journal of Machine Learning Research23, 120 (2022), 1–39
work page 2022
-
[29]
Craig Gentry. 2009. Fully homomorphic encryption using ideal lattices. In Proceedings of the forty-first annual ACM symposium on Theory of computing. ACM, Bethesda MD USA, 169–178. doi:10.1145/1536414.1536440
-
[30]
Craig Gentry, Shai Halevi, and Nigel P. Smart. 2012. Homomorphic Evaluation of the AES Circuit. InAdvances in Cryptology – CRYPTO 2012, Reihaneh Safavi-Naini and Ran Canetti (Eds.). Vol. 7417. Springer Berlin Heidelberg, Berlin, Heidelberg, 850–867. doi:10.1007/978-3-642-32009-5_49
-
[31]
1994.Using MPI: portable parallel programming with the message-passing interface
William Gropp, Ewing Lusk, and Anthony Skjellum. 1994.Using MPI: portable parallel programming with the message-passing interface. MIT Press, Cambridge, MA, USA
work page 1994
-
[32]
Siddharth Jayashankar, Edward Chen, Tom Tang, Wenting Zheng, and Dim- itrios Skarlatos. 2025. Cinnamon: A Framework for Scale-Out Encrypted AI. In Proceedings of the 30th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 1. ACM, Rotterdam Netherlands, 133–150. doi:10.1145/3669940.3707260
-
[33]
Sullivan, Wenting Zheng, and Dimitrios Skarlatos
Siddharth Jayashankar, Joshua Kim, Michael B. Sullivan, Wenting Zheng, and Dimitrios Skarlatos. 2025. A Scalable Multi-GPU Framework for Encrypted Large-Model Inference. doi:10.48550/arXiv.2512.11269 arXiv:2512.11269 [cs]
-
[34]
Wonkyung Jung, Sangpyo Kim, Jung Ho Ahn, Jung Hee Cheon, and Younho Lee
-
[35]
Over 100x Faster Bootstrapping in Fully Homomorphic Encryption through Memory-centric Optimization with GPUs.IACR Transactions on Cryptographic Hardware and Embedded Systems(Aug. 2021), 114–148. doi:10.46586/tches.v2021. i4.114-148
-
[36]
Chiraag Juvekar, Vinod Vaikuntanathan, and Anantha Chandrakasan. 2018. GAZELLE: A Low Latency Framework for Secure Neural Network Inference. https://eprint.iacr.org/2018/073
work page 2018
-
[37]
Zhaoxuan Kan, Husheng Han, Shangyi Shi, Tenghui Hua, Hang Lu, Xiaowei Li, Jianan Mu, and Xing Hu. 2025. FicGCN: Unveiling the Homomorphic Encryption Efficiency from Irregular Graph Convolutional Networks. https://openreview. net/forum?id=m74x7brnd6¬eId=1q1RAl2JNC
work page 2025
-
[38]
Miran Kim, Xiaoqian Jiang, Kristin Lauter, Elkhan Ismayilzada, and Shayan Shams. 2022. Secure Human Action Recognition by Encrypted Neural Network Inference.Nature Communications13, 1 (Aug. 2022), 4799. doi:10.1038/s41467- 022-32168-5
-
[39]
Miran Kim, Dongwon Lee, Jinyeong Seo, and Yongsoo Song. 2023. Accelerating HE Operations from Key Decomposition Technique. https://eprint.iacr.org/2023/ 413
work page 2023
-
[40]
Sangpyo Kim, Jongmin Kim, Michael Jaemin Kim, Wonkyung Jung, John Kim, Minsoo Rhu, and Jung Ho Ahn. 2022. BTS: an accelerator for bootstrappable fully homomorphic encryption. InProceedings of the 49th Annual International Symposium on Computer Architecture (ISCA ’22). Association for Computing Machinery, New York, NY, USA, 711–725. doi:10.1145/3470496.3527415
-
[41]
Aleksandar Krastev, Nikola Samardzic, Simon Langowski, Srinivas Devadas, and Daniel Sanchez. 2024. A Tensor Compiler with Automatic Data Packing for Simple and Efficient Fully Homomorphic Encryption.Proceedings of the ACM on 13 Zhaoting Gong, Ran Ran, Fan Yao, and Wujie Wen Programming Languages8, PLDI (June 2024), 126–150. doi:10.1145/3656382
-
[42]
Seewoo Lee, Garam Lee, Jung Woo Kim, Junbum Shin, and Mun-Kyu Lee. 2024. HETAL: Efficient Privacy-preserving Transfer Learning with Homomorphic Encryption. doi:10.48550/arXiv.2403.14111
-
[43]
Yongwoo Lee, Seonyoung Cheon, Dongkwan Kim, Dongyoon Lee, and Hanjun Kim. 2024. Performance-aware Scale Analysis with Reserve for Homomorphic Encryption. InProceedings of the 29th ACM International Conference on Archi- tectural Support for Programming Languages and Operating Systems, Volume 1 (ASPLOS ’24, Vol. 1). Association for Computing Machinery, New ...
-
[44]
Yongwoo Lee, Seonyeong Heo, Seonyoung Cheon, Shinnung Jeong, Changsu Kim, Eunkyung Kim, Dongyoon Lee, and Hanjun Kim. 2022. HECATE: Performance- Aware Scale Optimization for Homomorphic Encryption Compiler. In2022 IEEE/ACM International Symposium on Code Generation and Optimization (CGO). 193–204. doi:10.1109/CGO53902.2022.9741265
-
[45]
Dmitry Lepikhin, HyoukJoong Lee, Yuanzhong Xu, Dehao Chen, Orhan Firat, Yanping Huang, Maxim Krikun, Noam Shazeer, and Zhifeng Chen. 2020. GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding. https://openreview.net/forum?id=qrwe7XHTmYb
work page 2020
-
[46]
Shenggui Li, Fuzhao Xue, Chaitanya Baranwal, Yongbin Li, and Yang You. 2023. Sequence Parallelism: Long Sequence Training from System Perspective. In Proceedings of the 61st Annual Meeting of the Association for Computational Lin- guistics (Volume 1: Long Papers), Anna Rogers, Jordan Boyd-Graber, and Naoaki Okazaki (Eds.). Association for Computational Li...
-
[47]
Yan Liu, Jianxin Lai, Long Li, Tianxiang Sui, Linjie Xiao, Peng Yuan, Xiaojing Zhang, Qing Zhu, Wenguang Chen, and Jingling Xue. 2025. ReSBM: Region- based Scale and Minimal-Level Bootstrapping Management for FHE via Min-Cut. InProceedings of the 30th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Vo...
-
[48]
Wen-jie Lu, Zhicong Huang, Zhen Gu, Jingyu Li, Jian Liu, Kui Ren, Cheng Hong, Tao Wei, and WenGuang Chen. 2023. BumbleBee: Secure Two-party Inference Framework for Large Transformers. https://eprint.iacr.org/2023/1678
work page 2023
- [49]
-
[50]
Jungho Moon, Dongwoo Yoo, Xiaoqian Jiang, and Miran Kim. 2024. THOR: Secure Transformer Inference with Homomorphic Encryption. https://eprint. iacr.org/2024/1881
work page 2024
-
[51]
Deepak Narayanan, Mohammad Shoeybi, Jared Casper, Patrick LeGres- ley, Mostofa Patwary, Vijay Anand Korthikanti, Dmitri Vainbrand, Prethvi Kashinkunti, Julie Bernauer, Bryan Catanzaro, Amar Phanishayee, and Matei Zaharia. 2021. Efficient Large-Scale Language Model Training on GPU Clusters Using Megatron-LM. doi:10.48550/arXiv.2104.04473
-
[52]
Qi Pang, Jinhao Zhu, Helen Möllering, Wenting Zheng, and Thomas Schneider
-
[53]
https://eprint.iacr.org/2023/1893
BOLT: Privacy-Preserving, Accurate and Efficient Inference for Transform- ers. https://eprint.iacr.org/2023/1893
work page 2023
-
[54]
Dongjin Park, Eunsang Lee, and Joon-Woo Lee. 2024. Powerformer: Efficient Privacy-Preserving Transformer with Batch Rectifier-Power Max Function and Optimized Homomorphic Attention. https://eprint.iacr.org/2024/1429
work page 2024
-
[55]
Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, Alban Desmaison, Andreas Köpf, Edward Yang, Zach DeVito, Martin Raison, Alykhan Tejani, Sasank Chilamkurthy, Benoit Steiner, Lu Fang, Junjie Bai, and Soumith Chintala. 2019. PyTorch: An Imperative Style, High-Pe...
work page internal anchor Pith review Pith/arXiv arXiv 2019
-
[56]
Reiner Pope, Sholto Douglas, Aakanksha Chowdhery, Jacob Devlin, James Brad- bury, Jonathan Heek, Kefan Xiao, Shivani Agrawal, and Jeff Dean. 2023. Efficiently Scaling Transformer Inference.Proceedings of Machine Learning and Systems5 (March 2023), 606–624. https://proceedings.mlsys.org/paper_files/paper/2023/ hash/c4be71ab8d24cdfb45e3d06dbfca2780-Abstract...
work page 2023
-
[57]
Le Qin, Junwei Cui, Weilin Cai, and Jiayi Huang. 2025. Chimera: Communication Fusion for Hybrid Parallelism in Large Language Models. InProceedings of the 52nd Annual International Symposium on Computer Architecture (ISCA ’25). Association for Computing Machinery, New York, NY, USA, 498–513. doi:10. 1145/3695053.3731025
-
[58]
Ran Ran, Wei Wang, Quan Gang, Jieming Yin, Nuo Xu, and Wujie Wen. 2022. CryptoGCN: Fast and Scalable Homomorphically Encrypted Graph Convolu- tional Network Inference.Advances in Neural Information Processing Systems 35 (Dec. 2022), 37676–37689. https://proceedings.neurips.cc/paper_files/paper/ 2022/hash/f5332c8273d02729730a9c24dec2135e-Abstract-Conference.html
work page 2022
-
[59]
Ran Ran, Nuo Xu, Tao Liu, Wei Wang, Gang Quan, and Wujie Wen. 2023. Pen- guin: Parallel-Packed Homomorphic Encryption for Fast Graph Convolutional Network Inference.Advances in Neural Information Processing Systems36 (Dec. 2023), 19104–19116. https://proceedings.neurips.cc/paper_files/paper/2023/hash/ 3cc685788a311fa35d8d41df93e288ca-Abstract-Conference.html
work page 2023
-
[60]
Brandon Reagen, Woo-Seok Choi, Yeongil Ko, Vincent T. Lee, Hsien-Hsin S. Lee, Gu-Yeon Wei, and David Brooks. 2021. Cheetah: Optimizing and Accel- erating Homomorphic Encryption for Private Inference. In2021 IEEE Interna- tional Symposium on High-Performance Computer Architecture (HPCA). 26–39. doi:10.1109/HPCA51647.2021.00013
-
[61]
Nikola Samardzic, Axel Feldmann, Aleksandar Krastev, Srinivas Devadas, Ronald Dreslinski, Christopher Peikert, and Daniel Sanchez. 2021. F1: A Fast and Programmable Accelerator for Fully Homomorphic Encryption. InMICRO- 54: 54th Annual IEEE/ACM International Symposium on Microarchitecture (MI- CRO ’21). Association for Computing Machinery, New York, NY, U...
-
[62]
Nikola Samardzic, Axel Feldmann, Aleksandar Krastev, Nathan Manohar, Nicholas Genise, Srinivas Devadas, Karim Eldefrawy, Chris Peikert, and Daniel Sanchez. 2022. CraterLake: a hardware accelerator for efficient unbounded computation on encrypted data. InProceedings of the 49th Annual International Symposium on Computer Architecture (ISCA ’22). Association...
-
[63]
SEAL 2023. Microsoft SEAL (release 4.1). https://github.com/Microsoft/SEAL
work page 2023
-
[64]
Mohammad Shoeybi, Mostofa Patwary, Raul Puri, Patrick LeGresley, Jared Casper, and Bryan Catanzaro. 2020. Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism. doi:10.48550/arXiv.1909.08053
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.1909.08053 2020
-
[65]
Alexander Viand, Patrick Jattke, Miro Haller, and Anwar Hithnawi. 2023. {HECO}: Fully Homomorphic Encryption Compiler. 4715–4732. https://www.usenix.org/ conference/usenixsecurity23/presentation/viand
work page 2023
-
[66]
Zhiwei Wang, Peinan Li, Rui Hou, Zhihao Li, Jiangfeng Cao, XiaoFeng Wang, and Dan Meng. 2023. HE-Booster: An Efficient Polynomial Arithmetic Acceleration on GPUs for Fully Homomorphic Encryption.IEEE Transactions on Parallel and Distributed Systems34, 4 (April 2023), 1067–1081. doi:10.1109/TPDS.2022.3228628
-
[67]
Yinghao Yang, Xicheng Xu, Haibin Zhang, Jie Song, Xin Tang, Hang Lu, and Xiaowei Li. 2025. Hydra: Scale-out FHE Accelerator Architecture for Secure Deep Learning on FPGA. In2025 IEEE International Symposium on High Performance Computer Architecture (HPCA). 1174–1186. doi:10.1109/HPCA61900.2025.00090
-
[68]
Yinghao Yang, Huaizhi Zhang, Shengyu Fan, Hang Lu, Mingzhe Zhang, and Xiaowei Li. 2023. Poseidon: Practical Homomorphic Encryption Accelerator. In 2023 IEEE International Symposium on High-Performance Computer Architecture (HPCA). 870–881. doi:10.1109/HPCA56546.2023.10070984
-
[69]
Jiawen Zhang, Xinpeng Yang, Lipeng He, Kejia Chen, Wen-jie Lu, Yinghao Wang, Xiaoyang Hou, Jian Liu, Kui Ren, and Xiaohu Yang. 2025. Secure Transformer Inference Made Non-interactive. InProceedings 2025 Network and Distributed System Security Symposium. Internet Society, San Diego, CA, USA. doi:10.14722/ ndss.2025.230868 14
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.