GPU Acceleration of TFHE-Based High-Precision Nonlinear Layers for Encrypted LLM Inference
Pith reviewed 2026-05-10 19:10 UTC · model grok-4.3
The pith
TIGER GPU framework accelerates TFHE nonlinear layers for private LLM inference with up to 17x speedup.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
TIGER is the first GPU-accelerated framework for high-precision TFHE-based nonlinear LLM layer evaluation, achieved through a GPU-optimized WoP-PBS method combined with numerical algorithms that surpass native lookup-table precision limits on nonlinear functions, together with high-precision implementations of GELU, Softmax, and LayerNorm layers and a batch-driven design that exploits inter-input parallelism.
What carries the argument
GPU-optimized WoP-PBS method paired with numerical algorithms to extend precision in TFHE programmable bootstrapping for nonlinear functions.
Load-bearing premise
The approach assumes that optimizing programmable bootstrapping on GPUs together with numerical methods can reliably deliver higher precision for nonlinear functions than standard lookup tables allow without losing encryption security or practical speed.
What would settle it
Running TIGER on a standard GPU, measuring the actual numerical precision of its GELU output against a high-precision unencrypted reference, and comparing runtimes to a CPU baseline for the same inputs would confirm or refute the reported speedups and precision gains.
Figures
read the original abstract
Deploying large language models (LLMs) as cloud services raises privacy concerns as inference may leak sensitive data. Fully Homomorphic Encryption (FHE) allows computation on encrypted data, but current FHE methods struggle with efficient and precise nonlinear function evaluation. Specifically, CKKS-based approaches require high-degree polynomial approximations, which are costly when target precision increases. Alternatively, TFHE's Programmable Bootstrapping (PBS) outperforms CKKS by offering exact lookup-table evaluation. But it lacks high-precision implementations of LLM nonlinear layers and underutilizes GPU resources. We propose \emph{TIGER}, the first GPU-accelerated framework for high-precision TFHE-based nonlinear LLM layer evaluation. TIGER offers: (1) GPU-optimized WoP-PBS method combined with numerical algorithms to surpass native lookup-table precision limits on nonlinear functions; (2) high-precision and efficient implementations of key nonlinear layers, enabling practical encrypted inference; (3) batch-driven design exploiting inter-input parallelism to boost GPU efficiency. TIGER achieves 7.17$\times$, 16.68$\times$, and 17.05$\times$ speedups over a CPU baseline for GELU, Softmax, and LayerNorm, respectively.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript proposes TIGER, the first GPU-accelerated framework for high-precision TFHE-based nonlinear LLM layer evaluation. It combines GPU-optimized WoP-PBS with numerical algorithms to exceed native lookup-table precision limits for functions like GELU, Softmax, and LayerNorm, employs a batch-driven design for inter-input parallelism, and reports empirical speedups of 7.17×, 16.68×, and 17.05× over a CPU baseline for these layers respectively.
Significance. If the performance claims and precision guarantees hold under full experimental verification, this would represent a meaningful engineering advance for practical encrypted LLM inference. It directly tackles the inefficiency of nonlinear operations in TFHE, a key barrier to deploying FHE in privacy-sensitive cloud ML services, by leveraging GPU resources without altering TFHE's core security model.
major comments (2)
- [Abstract and evaluation sections] Abstract and evaluation sections: specific numerical speedups are stated (7.17× for GELU, 16.68× for Softmax, 17.05× for LayerNorm) but without any description of experimental setup, GPU hardware, TFHE parameter sets, CPU baseline implementation, precision measurement methodology, or error analysis. These details are load-bearing for the central empirical claims.
- [Section describing the numerical post-processing algorithms] Section describing the numerical post-processing algorithms: the assertion that WoP-PBS plus numerical methods can surpass native lookup-table precision limits while preserving security and correctness requires explicit error bounds, precision verification procedures, and confirmation that the decomposition does not degrade the underlying TFHE security parameters.
minor comments (1)
- Ensure consistent use of acronyms (e.g., WoP-PBS) and clarify any new notation introduced for the batch-driven design in the methods section.
Simulated Author's Rebuttal
We thank the referee for the constructive review and the recommendation for major revision. We address each major comment below and will update the manuscript accordingly to strengthen the presentation of our empirical claims and algorithmic details.
read point-by-point responses
-
Referee: [Abstract and evaluation sections] Abstract and evaluation sections: specific numerical speedups are stated (7.17× for GELU, 16.68× for Softmax, 17.05× for LayerNorm) but without any description of experimental setup, GPU hardware, TFHE parameter sets, CPU baseline implementation, precision measurement methodology, or error analysis. These details are load-bearing for the central empirical claims.
Authors: We agree that the abstract and evaluation sections require more explicit details to support the reported speedups. The full manuscript contains an experimental setup subsection (Section 5.1) specifying the GPU hardware (NVIDIA A100), TFHE parameter sets (e.g., polynomial size N=1024, logQ=27), CPU baseline (TFHE-rs library on 16-core Intel Xeon), and precision methodology (maximum absolute error over 10^5 random inputs). However, these were not sufficiently highlighted in the abstract or evaluation overview. In the revision we will (1) expand the abstract with a one-sentence summary of the setup and (2) add a dedicated paragraph in the evaluation section that consolidates hardware, parameters, baseline, and error analysis, including a new table of measured precision errors. revision: yes
-
Referee: [Section describing the numerical post-processing algorithms] Section describing the numerical post-processing algorithms: the assertion that WoP-PBS plus numerical methods can surpass native lookup-table precision limits while preserving security and correctness requires explicit error bounds, precision verification procedures, and confirmation that the decomposition does not degrade the underlying TFHE security parameters.
Authors: We appreciate this observation. The current manuscript describes the WoP-PBS + numerical post-processing pipeline but does not provide formal error bounds or a dedicated security paragraph. In the revised version we will insert a new subsection (3.4) that (a) derives explicit error bounds combining the WoP-PBS approximation error with the subsequent floating-point decomposition error, (b) details the verification procedure (comparison against 128-bit reference implementations on 10^6 test vectors), and (c) confirms that the post-processing operates after decryption and therefore leaves the underlying LWE hardness and TFHE security parameters unchanged. revision: yes
Circularity Check
No significant circularity detected
full rationale
The paper is an applied systems/implementation work whose central claims consist of concrete runtime speedups (7.17× for GELU, 16.68× for Softmax, 17.05× for LayerNorm) obtained by GPU-optimized WoP-PBS plus batching and numerical post-processing. These results are reported as measured outcomes against a CPU baseline rather than derived from any self-referential equations, fitted parameters renamed as predictions, or load-bearing self-citations. No derivation chain exists that reduces a claimed result to its own inputs by construction; the engineering steps (decomposition, parallelism exploitation) are independently falsifiable through re-implementation and benchmarking.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Homomorphic encryption for arithmetic of approximate numbers,
J. H. Cheon, A. Kim, M. Kim, and Y . Song, “Homomorphic encryption for arithmetic of approximate numbers,” inASIACRYPT, 2017
work page 2017
-
[2]
Faster fully homomorphic encryption: Bootstrapping in less than 0.1 seconds,
I. Chillotti, N. Gama, M. Georgieva, and M. Izabachene, “Faster fully homomorphic encryption: Bootstrapping in less than 0.1 seconds,” in ASIACRYPT, 2016
work page 2016
-
[3]
L. De Castro, D. Escudero, A. Agrawal, A. Polychroniadou, and M. Veloso, “EncryptedLLM: Privacy-preserving large language model inference via GPU-accelerated fully homomorphic encryption,” in Proceedings of the 42nd International Conference on Machine Learning, ser. Proceedings of Machine Learning Research, A. Singh, M. Fazel, D. Hsu, S. Lacoste-Julien, F...
work page 2025
-
[4]
E. e. a. Lee, “High-precision bootstrapping of rns-ckks homomorphic encryption using optimal minimax polynomial approximation,” inEU- ROCRYPT, 2022
work page 2022
-
[5]
AutoFHE: Automated adaption of CNNs for efficient evaluation over FHE,
W. Ao and V . N. Boddeti, “AutoFHE: Automated adaption of CNNs for efficient evaluation over FHE,” Cryptology ePrint Archive, Paper 2023/162, 2023. [Online]. Available: https://eprint.iacr.org/2023/162
work page 2023
-
[6]
Revisiting the functional bootstrap in tfhe,
A. Guimar ˜aes, E. Borin, and D. F. Aranha, “Revisiting the functional bootstrap in tfhe,”IACR Transactions on Cryptographic Hardware and Embedded Systems, 2021
work page 2021
-
[7]
Zama, “TFHE-rs: A Pure Rust Implementation of the TFHE Scheme for Boolean and Integer Arithmetics Over Encrypted Data,” 2022, https://github.com/zama-ai/tfhe-rs
work page 2022
-
[8]
I. Chillottiet al., “Improved programmable bootstrapping with larger precision and efficient arithmetic circuits for tfhe,”IACR ePrint, 2021
work page 2021
-
[9]
A. Vaswaniet al., “Attention is all you need,” inNeurIPS, 2017
work page 2017
-
[10]
Language models are unsupervised multitask learn- ers,
A. Radfordet al., “Language models are unsupervised multitask learn- ers,”OpenAI Technical Report, 2019
work page 2019
-
[11]
Smoothquant: Accurate and efficient post-training quantization for large language models,
G. Xiaoet al., “Smoothquant: Accurate and efficient post-training quantization for large language models,” inICML, 2023
work page 2023
-
[12]
Gptq: Accurate post-training quantization for generative pre-trained transformers,
E. Frantar and D. Alistarh, “Gptq: Accurate post-training quantization for generative pre-trained transformers,” inICLR, 2023
work page 2023
-
[13]
Parameter optimization & larger precision for (t)FHE,
L. Bergerat, A. Boudi, Q. Bourgerie, I. Chillotti, D. Ligier, J.-B. Orfila, and S. Tap, “Parameter optimization & larger precision for (t)FHE,” Cryptology ePrint Archive, Paper 2022/704, 2022. [Online]. Available: https://eprint.iacr.org/2022/704
work page 2022
-
[14]
Taylor polynomials in a high arithmetic precision as universal approximators,
N. P. Bakas, “Taylor polynomials in a high arithmetic precision as universal approximators,”Computation, vol. 12, no. 3, p. 53, 2024
work page 2024
-
[15]
J. M. Powers and M. Sen, “Approximation methods,” inMathematical Methods in Engineering. Cambridge University Press, 2015, pp. 219– 278
work page 2015
-
[16]
Division by in- variant integers using multiplication,
T. Granlund and P. L. Montgomery, “Division by invariant integers using multiplication,” inProceedings of the ACM SIGPLAN 1994 Conference on Programming Language Design and Implementation, ser. PLDI ’94. New York, NY , USA: ACM, 1994, pp. 61–72. [Online]. Available: http://doi.acm.org/10.1145/178243.178249
-
[17]
D. W. Hosmer, S. Lemeshow, and R. X. Sturdivant,Applied Logistic Regression, 3rd ed. Hoboken, NJ: Wiley, 2013
work page 2013
-
[18]
E. W. Steyerberg,Clinical Prediction Models: A Practical Approach to Development, Validation, and Updating, 2nd ed. Cham: Springer, 2019
work page 2019
-
[19]
Deep learning for credit scoring: A brief survey,
S. Maldonado, J. L ´opez, and C. Vairetti, “Deep learning for credit scoring: A brief survey,”Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery, vol. 12, no. 2, p. e1444, 2022
work page 2022
-
[20]
C. M. Bishop,Pattern Recognition and Machine Learning. New York: Springer, 2006
work page 2006
-
[21]
C. E. Rasmussen and C. K. I. Williams,Gaussian Processes for Machine Learning. Cambridge, MA: MIT Press, 2006
work page 2006
-
[22]
Physics-informed machine learning,
G. E. Karniadakis, I. G. Kevrekidis, L. Lu, P. Perdikaris, S. Wang, and L. Yang, “Physics-informed machine learning,”Nature Reviews Physics, vol. 3, no. 6, pp. 422–440, 2021
work page 2021
-
[23]
Deep learning for healthcare: Review, opportunities and challenges,
R. Miotto, F. Wang, S. Wang, X. Jiang, and J. T. Dudley, “Deep learning for healthcare: Review, opportunities and challenges,”Briefings in Bioinformatics, vol. 19, no. 6, pp. 1236–1246, 2018
work page 2018
-
[24]
Predicting good probabilities with supervised learning,
A. Niculescu-Mizil and R. Caruana, “Predicting good probabilities with supervised learning,” inProceedings of the 22nd International Conference on Machine Learning (ICML), 2005, pp. 625–632
work page 2005
-
[25]
On calibration of modern neural networks,
C. Guo, G. Pleiss, Y . Sun, and K. Q. Weinberger, “On calibration of modern neural networks,” inProceedings of the 34th International Conference on Machine Learning (ICML), 2017, pp. 1321–1330
work page 2017
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.