pith. sign in

arxiv: 2410.09457 · v2 · submitted 2024-10-12 · 💻 cs.LG · cs.CR

Power-Softmax: Towards Secure LLM Inference over Encrypted Data

Pith reviewed 2026-05-23 18:44 UTC · model grok-4.3

classification 💻 cs.LG cs.CR
keywords Power-Softmaxhomomorphic encryptionsecure LLM inferencepolynomial transformersencrypted datain-context learningtransformer variants
0
0 comments X

The pith

A new Power-Softmax attention variant enables stable training of billion-parameter polynomial LLMs for homomorphic encryption while preserving reasoning performance.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces Power-Softmax as a replacement for the standard softmax in transformer attention layers. This variant is designed to be polynomial, making it compatible with homomorphic encryption for private inference. Previous methods either approximated existing models inefficiently or used simpler but less scalable replacements. By using Power-Softmax, the authors train models exceeding a billion parameters that match standard transformers on reasoning and in-context learning tasks. This advances privacy-preserving LLMs by allowing much larger models than before.

Core claim

The central discovery is that Power-Softmax provides a stable training form for self-attention that is easy to approximate with polynomials, enabling the first polynomial LLMs over a billion parameters with reasoning and ICL capabilities comparable to standard transformers of the same size.

What carries the argument

Power-Softmax, a polynomial-friendly variant of the softmax function in self-attention that replaces the exponential with a power-based form for stability and approximability under encryption.

If this is right

  • Secure inference becomes feasible for LLMs at billion-parameter scale using homomorphic encryption.
  • Models using Power-Softmax can achieve performance parity with standard transformers on reasoning tasks.
  • Latency breakdowns for encrypted computations can guide further optimizations in privacy-preserving systems.
  • Inductive biases differ between Power-Softmax models and standard transformers, which may affect specific task performances.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Deploying such models could allow private AI services without exposing user data to the model owner.
  • Further work might explore combining Power-Softmax with other polynomial approximations for layer normalization to create fully polynomial transformers.
  • Testing these models on a wider range of benchmarks could reveal where the inductive bias differences matter most.

Load-bearing premise

The Power-Softmax attention can be trained stably at billion-parameter scale and its polynomial approximation preserves sufficient inductive bias to match standard transformer performance.

What would settle it

Training a billion-parameter model with Power-Softmax and finding that its polynomial version underperforms standard transformers significantly on in-context learning benchmarks would challenge the central claim.

Figures

Figures reproduced from arXiv: 2410.09457 by Allon Adir, Ehud Aharoni, Itamar Zimerman, Jenny Lerner, Matan Avitan, Moran Baruch, Nir Drucker, Omri Soceanu, Ramy Masalha, Reut Meiri.

Figure 1
Figure 1. Figure 1: Comparison of Softmax and PowerSoftmax normalization on normally distributed values on the left, uniformly distributed values in the middle, and evenly spaced values on the right. As can be seen, the empirical scaling trends are relatively similar. 4.1 HE-FRIENDLY ATTENTION To design a HE-friendly variant of Softmax-based attention, we start by distilling its properties that correlate with its performance:… view at source ↗
Figure 2
Figure 2. Figure 2: (middle) illustrates our HE-friendly training variant, built on top of Eqs. 4 and 5, compared to the original attention [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: , measured using HElayers 1.5.4 Aharoni et al. (2023) configured for CKKS with 128-bit security and poly-degree of 2 16. Here, matrix multiplication took 49% + 18% = 67% out of which most of it was spent on encoding the plaintext weights. Polynomial approximation accounted for 14% + 6% + 4% = 24% of the total time, where PowerSoftmax took 6% of it. Interestingly, in all polynomial approximations, the most … view at source ↗
Figure 4
Figure 4. Figure 4: Training Curves for NTP: Comparison of test perplexity for transformers with Softmax and power normalization when trained over several datasets including Pile, Wikitext-103, and Text￾8. 5.2 JUSTIFY DESIGN CHOICES To justify our design choices, we conduct a series of ablations. Power-Softmax Attention. We first compare PowerSoftmax and Softmax outside the context of HE, showing that in addition to being a H… view at source ↗
Figure 5
Figure 5. Figure 5: Results On Vision Tasks. Training curves for ViT Variants with PowerSoftmax (red) and the Softmax baseline (blue). On the left, results are presented for Tiny-ImageNet and on the middle and right for CIFAR-100 and CIFAR-10 accordingly [PITH_FULL_IMAGE:figures/full_fig_p009_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: The Significance of the Stable Vari￾ant. Training curves for NTP on Wikitext for large models .The stable variant (red) consistently out￾performs the vanilla PowerSoftmax (blue). Stability. To assess the contribution of our numerically stable variant, we conduct dedi￾cated experiments. In [PITH_FULL_IMAGE:figures/full_fig_p009_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Measuring the polynomial ap￾proximation error for different values of ϵ. ϵ-Bounded Division for Softmax. The HE￾friendly attention variant from Eq. 4 proposes adding epsilon to make the approximation problem of divi￾sion easier, resulting in an approximation of a 1 ϵ 2 - Lipschitz continuous function [PITH_FULL_IMAGE:figures/full_fig_p009_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Measuring the attention mean distance for different trans￾former variants. PowerSoftmax introduces an important hyperparameter p that differentiates it from the traditional Softmax function. To better understand its mechanistic behavior, we examine how the attention matrices evolve with varying values of p. Our analysis reveals that as p increases, the resulting atten￾tion matrices become more localized as… view at source ↗
Figure 9
Figure 9. Figure 9: Visualisation of Averaged Attention Matrices: Layer Index\Model, where models from left to right are PowerSoftmax with p = 4, 8, 12 and Softmax 10 [PITH_FULL_IMAGE:figures/full_fig_p010_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Visualisation of polynomial average attention matrices: Models with P = 4 (first column) generate more local attention matrices, with reduced mass near the diagonal compared to models with P = 8 or P = 12, particularly in layers 4-10. In all models, the final layers (rows at the bottom) display more global attention patterns than the middle layers. 15 [PITH_FULL_IMAGE:figures/full_fig_p015_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Visualisation of random samples of polynomial attention matrices: Although the attention matrices are noisy and a small number of samples may not capture the full distribution trend, the Power-softmax-based models (first three columns) show behavior similar to the original Softmax (last column). Notably, our attention layers can dynamically adjust focus across different parts of the input, allowing attent… view at source ↗
Figure 12
Figure 12. Figure 12: Comparison of training curves for 12-layer RoBERTa models with different attention [PITH_FULL_IMAGE:figures/full_fig_p018_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: The impact of different values of ϵ on training dynamics of PowerSoftmax-based models 18 [PITH_FULL_IMAGE:figures/full_fig_p018_13.png] view at source ↗
read the original abstract

Modern cryptographic methods for implementing privacy-preserving LLMs such as \gls{HE} require the LLMs to have a polynomial form. Forming such a representation is challenging because transformers include non-polynomial components, such as \Softmax and layer normalization. Previous approaches have either directly approximated pre-trained models with large-degree polynomials, which are less efficient over HE, or replaced non-polynomial components with easier-to-approximate primitives before training, e.g., \Softmax with pointwise attention. The latter approach might introduce scalability challenges. We present a new HE-friendly variant of self-attention that offers a stable form for training and is easy to approximate with polynomials for secure inference. Our work introduces the first polynomial LLMs over a billion parameters, exceeding the size of previous models by more than tenfold. The resulting models demonstrate reasoning and in-context learning (ICL) capabilities comparable to standard transformers of the same size, representing a breakthrough in the field. Finally, we provide a detailed latency breakdown for each computation over encrypted data, paving the way for further optimization, and explore the differences in inductive bias between models relying on our HE-friendly variant and standard transformers.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper introduces Power-Softmax, a new self-attention variant intended to be stable for training and amenable to low-degree polynomial approximation, enabling homomorphic-encryption (HE) friendly LLMs. It claims the first such models exceeding one billion parameters (more than 10x prior work), with reasoning and in-context learning performance comparable to standard transformers of the same size, plus a latency breakdown for encrypted inference.

Significance. If the performance and stability claims hold, the result would be a substantial advance for privacy-preserving inference, as it would demonstrate that polynomial LLMs can be scaled to practical sizes while retaining core capabilities.

major comments (2)
  1. [Abstract] Abstract: the central claim that the models exceed prior work by more than tenfold and achieve 'comparable' reasoning/ICL performance supplies no model sizes, benchmark scores, training hyperparameters, polynomial degrees, or approximation-error metrics; without these the 'first' and 'comparable' assertions cannot be evaluated.
  2. [Abstract (and results sections)] The weakest assumption (training stability of Power-Softmax and preservation of inductive bias under polynomial approximation at >1B parameters) is asserted but not supported by any derivation, ablation, or scaling experiment in the provided text; if either fails the headline result collapses.
minor comments (1)
  1. [Abstract] Abstract: 'Power-Softmax' is named without an equation or definition; a brief functional form would improve readability.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed review and constructive suggestions. We address each major comment below and will incorporate revisions to strengthen the manuscript.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the central claim that the models exceed prior work by more than tenfold and achieve 'comparable' reasoning/ICL performance supplies no model sizes, benchmark scores, training hyperparameters, polynomial degrees, or approximation-error metrics; without these the 'first' and 'comparable' assertions cannot be evaluated.

    Authors: We agree that the abstract would be clearer with explicit quantitative details. The full manuscript reports model sizes (1.3B parameters), benchmark scores on reasoning and ICL tasks, training hyperparameters, polynomial degrees used, and approximation errors in the results and experimental sections. In revision we will expand the abstract to include these key figures (e.g., exact parameter counts, selected benchmark accuracies, and degree values) while preserving brevity. revision: yes

  2. Referee: [Abstract (and results sections)] The weakest assumption (training stability of Power-Softmax and preservation of inductive bias under polynomial approximation at >1B parameters) is asserted but not supported by any derivation, ablation, or scaling experiment in the provided text; if either fails the headline result collapses.

    Authors: The results section presents training curves, loss stability across scales, and direct performance comparisons between Power-Softmax models and standard transformers at >1B parameters, which empirically support both stability and retention of capabilities under the polynomial approximation. However, we acknowledge that dedicated ablations isolating the effect of polynomial degree on inductive bias at this scale would strengthen the claim. We will add such targeted ablations and scaling plots in the revised manuscript. revision: partial

Circularity Check

0 steps flagged

No circularity in derivation chain

full rationale

The paper introduces Power-Softmax as a new attention variant and asserts empirical outcomes (first >1B-parameter polynomial LLMs with comparable reasoning/ICL). The provided abstract and description contain no equations, fitted parameters renamed as predictions, self-citations invoked as uniqueness theorems, or ansatzes smuggled via prior work. All load-bearing claims are external empirical assertions about training stability and approximation fidelity at scale; these are falsifiable outside the paper rather than reducing to inputs by construction. This is the expected non-finding for an empirical methods paper.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 1 invented entities

The abstract introduces Power-Softmax as a new attention primitive without specifying its exact functional form or any fitted constants. No free parameters, additional axioms, or invented entities beyond the new primitive itself are mentioned.

invented entities (1)
  • Power-Softmax no independent evidence
    purpose: HE-friendly replacement for softmax in self-attention that remains stable for training and admits low-degree polynomial approximation
    Introduced in the abstract as the core technical contribution enabling the billion-parameter models.

pith-pipeline@v0.9.0 · 5769 in / 1299 out tokens · 19831 ms · 2026-05-23T18:44:52.131762+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

45 extracted references · 45 canonical work pages · 4 internal anchors

  1. [1]

    HElayers: A tile tensors framework for large neural networks on encrypted data

    Ehud Aharoni, Allon Adir, Moran Baruch, Nir Drucker, Gilad Ezov, Ariel Farkash, Lev Greenberg, Ramy Masalha, Guy Moshkowich, Dov Murik, et al. HElayers: A tile tensors framework for large neural networks on encrypted data . PoPETs, 2023. doi:10.56553/popets-2023-0020

  2. [2]

    On the privacy of protocols based on cpa-secure homomorphic encryption

    Adi Akavia and Margarita Vald. On the privacy of protocols based on cpa-secure homomorphic encryption. IACR Cryptol. ePrint Arch. , 2021: 0 803, 2021. URL https://eprint.iacr.org/2021/803

  3. [3]

    Gpt-neox: Large scale autoregressive language modeling in pytorch, 9 2023

    Alex Andonian, Quentin Anthony, Stella Biderman, Sid Black, Preetham Gali, Leo Gao, Eric Hallahan, Josh Levy-Kramer, Connor Leahy, Lucas Nestler, Kip Parker, Michael Pieler, Jason Phang, Shivanshu Purohit, Hailey Schoelkopf, Dashiell Stander, Tri Songz, Curt Tigges, Benjamin Thérien, Phil Wang, and Samuel Weinbach. Gpt-neox: Large scale autoregressive lan...

  4. [4]

    AutoFHE : Automated adaption of CNNs for efficient evaluation over FHE

    Wei Ao and Vishnu Naresh Boddeti. AutoFHE : Automated adaption of CNNs for efficient evaluation over FHE . In 33rd USENIX Security Symposium (USENIX Security 24), pp.\ 2173--2190, Philadelphia, PA, August 2024. USENIX Association. ISBN 978-1-939133-44-1. URL https://www.usenix.org/conference/usenixsecurity24/presentation/ao

  5. [5]

    A Methodology for Training Homomorphic Encryption Friendly Neural Networks

    Moran Baruch, Nir Drucker, Lev Greenberg, and Guy Moshkowich. A Methodology for Training Homomorphic Encryption Friendly Neural Networks . In Applied Cryptography and Network Security Workshops, pp.\ 536--553, Cham, 2022. Springer International Publishing. ISBN 978-3-031-16815-4. doi:10.1007/978-3-031-16815-4\_29

  6. [6]

    Sensitive Tuning of Large Scale CNNs for E2E Secure Prediction using Homomorphic Encryption

    Moran Baruch, Nir Drucker, Gilad Ezov, Eyal Kushnir, Jenny Lerner, Omri Soceanu, and Itamar Zimerman. Sensitive Tuning of Large Scale CNNs for E2E Secure Prediction using Homomorphic Encryption . arXiv preprint arXiv:2304.14836, 2023. URL https://arxiv.org/pdf/2304.14836. To appear in CSCML 2024

  7. [7]

    Pythia : A suite for analyzing large language models across training and scaling

    Stella Biderman, Hailey Schoelkopf, Quentin Gregory Anthony, Herbie Bradley, Kyle O'Brien, Eric Hallahan, Mohammad Aflah Khan, Shivanshu Purohit, Usvsn Sai Prashanth, Edward Raff, Aviya Skowron, Lintang Sutawika, and Oskar Van Der Wal. Pythia : A suite for analyzing large language models across training and scaling. In Andreas Krause, Emma Brunskill, Kyun...

  8. [8]

    (Leveled) Fully Homomorphic Encryption without Bootstrapping

    Zvika Brakerski, Craig Gentry, and Vinod Vaikuntanathan. (Leveled) Fully Homomorphic Encryption without Bootstrapping . ACM Trans. Comput. Theory, 6 0 (3), July 2014. ISSN 1942-3454. doi:10.1145/2633600

  9. [9]

    The-x: Privacy-preserving transformer inference with homomorphic encryption

    Tianyu Chen, Hangbo Bao, Shaohan Huang, Li Dong, Binxing Jiao, Daxin Jiang, Haoyi Zhou, Jianxin Li, and Furu Wei. The-x: Privacy-preserving transformer inference with homomorphic encryption. arXiv preprint arXiv:2206.00216, 2022. URL https://arxiv.org/abs/2206.00216

  10. [10]

    Homomorphic encryption for arithmetic of approximate numbers

    Jung Hee Cheon, Andrey Kim, Miran Kim, and Yongsoo Song. Homomorphic encryption for arithmetic of approximate numbers. In International Conference on the Theory and Application of Cryptology and Information Security, pp.\ 409--437. Springer, 2017. doi:10.1007/978-3-319-70694-8\_15

  11. [11]

    P-nets: Deep polynomial neural networks

    Grigorios G Chrysos, Stylianos Moschoglou, Giorgos Bouritsas, Yannis Panagakis, Jiankang Deng, and Stefanos Zafeiriou. P-nets: Deep polynomial neural networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.\ 7325--7335, 2020. URL https://openaccess.thecvf.com/content_CVPR_2020/html/Chrysos_P-nets_Deep_Polynomial_...

  12. [12]

    East: Efficient and accurate secure transformer framework for inference

    Yuanchao Ding, Hua Guo, Yewei Guan, Weixin Liu, Jiarong Huo, Zhenyu Guan, and Xiyong Zhang. East: Efficient and accurate secure transformer framework for inference. arXiv preprint arXiv:2308.09923, 2023. URL https://arxiv.org/abs/2308.09923

  13. [13]

    Efficient skip connections realization for secure inference on encrypted data

    Nir Drucker and Itamar Zimerman. Efficient skip connections realization for secure inference on encrypted data. In Shlomi Dolev, Ehud Gudes, and Pascal Paillier (eds.), Cyber Security, Cryptology, and Machine Learning, pp.\ 65--73, Cham, 2023. Springer Nature Switzerland. ISBN 978-3-031-34671-2. doi:10.1007/978-3-031-34671-2_5

  14. [14]

    Somewhat Practical Fully Homomorphic Encryption

    Junfeng Fan and Frederik Vercauteren. Somewhat Practical Fully Homomorphic Encryption . Proceedings of the 15th international conference on Practice and Theory in Public Key Cryptography, pp.\ 1--16, 2012. URL https://eprint.iacr.org/2012/144

  15. [15]

    The Pile: An 800GB Dataset of Diverse Text for Language Modeling

    Leo Gao, Stella Biderman, Sid Black, Laurence Golding, Travis Hoppe, Charles Foster, Jason Phang, Horace He, Anish Thite, Noa Nabeshima, et al. The pile: An 800gb dataset of diverse text for language modeling. arXiv preprint arXiv:2101.00027, 2020. URL https://arxiv.org/abs/2101.00027

  16. [16]

    A fully homomorphic encryption scheme

    Craig Gentry. A fully homomorphic encryption scheme. PhD thesis, Stanford University, Palo Alto, CA, 2009. URL https://crypto.stanford.edu/craig/craig-thesis.pdf

  17. [17]

    Cryptonets: Applying neural networks to encrypted data with high throughput and accuracy

    Ran Gilad-Bachrach, Nathan Dowlin, Kim Laine, Kristin Lauter, Michael Naehrig, and John Wernsing. Cryptonets: Applying neural networks to encrypted data with high throughput and accuracy. In International conference on machine learning, pp.\ 201--210. PMLR, 2016. URL http://proceedings.mlr.press/v48/gilad-bachrach16.pdf

  18. [18]

    Openwebtext corpus

    Aaron Gokaslan and Vanya Cohen. Openwebtext corpus. http://Skylion007.github.io/OpenWebTextCorpus, 2019

  19. [19]

    Applications of division by convergence

    Robert E Goldschmidt. Applications of division by convergence. PhD thesis, Massachusetts Institute of Technology, 1964. URL https://dspace.mit.edu/bitstream/handle/1721.1/11113/34136725-MIT.pdf

  20. [20]

    Polynomial activation functions

    Vikas Gottemukkula. Polynomial activation functions. OpenReview, 2020. URL https://openreview.net/forum?id=rkxsgkHKvH

  21. [21]

    Bayesian neural networks uncertainty quantification with cubature rules

    Mohit Goyal, Rajan Goyal, and Brejesh Lall. Improved polynomial neural networks with normalised activations. In 2020 International Joint Conference on Neural Networks (IJCNN), pp.\ 1--8. IEEE, 2020. doi:10.1109/IJCNN48605.2020.9207535

  22. [22]

    SIGMA : Secure GPT inference with function secret sharing

    Kanav Gupta, Neha Jawalkar, Ananta Mukherjee, Nishanth Chandran, Divya Gupta, Ashish Panwar, and Rahul Sharma. SIGMA : Secure GPT inference with function secret sharing. Cryptology ePrint Archive, 2023. URL https://eprint.iacr.org/2023/1269

  23. [23]

    Neujeans: Private neural network inference with joint optimization of convolution and bootstrapping

    Jae Hyung Ju, Jaiyoung Park, Jongmin Kim, Donghwan Kim, and Jung Ho Ahn. Neujeans: Private neural network inference with joint optimization of convolution and bootstrapping. arXiv preprint arXiv:2312.04356, 2023. URL https://arxiv.org/abs/2312.04356

  24. [24]

    Low-complexity deep convolutional neural networks on fully homomorphic encryption using multiplexed parallel convolutions

    Eunsang Lee, Joon-Woo Lee, Junghyun Lee, Young-Sik Kim, Yongjune Kim, Jong-Seon No, and Woosuk Choi. Low-complexity deep convolutional neural networks on fully homomorphic encryption using multiplexed parallel convolutions. In Kamalika Chaudhuri, Stefanie Jegelka, Le Song, Csaba Szepesvari, Gang Niu, and Sivan Sabato (eds.), Proceedings of the 39th Intern...

  25. [25]

    Precise approximation of convolutional neural networks for homomorphically encrypted data

    Junghyun Lee, Eunsang Lee, Joon-Woo Lee, Yongjune Kim, Young-Sik Kim, and Jong-Seon No. Precise approximation of convolutional neural networks for homomorphically encrypted data. arXiv preprint arXiv:2105.10879, 2021. URL https://arxiv.org/abs/2105.10879

  26. [26]

    Optimized layerwise approximation for efficient private inference on fully homomorphic encryption,

    Junghyun Lee, Eunsang Lee, Young-Sik Kim, Yongwoo Lee, Joon-Woo Lee, Yongjune Kim, and Jong-Seon No. Optimizing layerwise polynomial approximation for efficient private inference on fully homomorphic encryption: A dynamic programming approach. arXiv preprint arXiv:2310.10349, 2023. URL https://arxiv.org/abs/2310.10349

  27. [27]

    MERGE : Fast private text generation

    Zi Liang, Pinghui Wang, Ruofei Zhang, Nuo Xu, Shuo Zhang, Lifeng Xing, Haitao Bai, and Ziyang Zhou. MERGE : Fast private text generation. Proceedings of the AAAI Conference on Artificial Intelligence, 38 0 (18): 0 19884--19892, Mar. 2024. doi:10.1609/aaai.v38i18.29964

  28. [28]

    Llms can understand encrypted prompt: Towards privacy-computing friendly transformers

    Xuanqi Liu and Zhuotao Liu. LLMs can understand encrypted prompt: Towards privacy-computing friendly transformers. arXiv preprint arXiv:2305.18396, 2023. URL https://arxiv.org/abs/2305.18396

  29. [29]

    RoBERTa: A Robustly Optimized BERT Pretraining Approach

    Yinhan Liu. Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692, 2019. URL https://arxiv.org/abs/1907.11692

  30. [30]

    Financial news classification dataset

    Nicholas Muchinguri. Financial news classification dataset. https://huggingface.co/datasets/nickmuchi/financial-classification, 2022. Accessed: 2024-05-26

  31. [31]

    fairseq: A fast, extensible toolkit for sequence modeling

    Myle Ott, Sergey Edunov, Alexei Baevski, Angela Fan, Sam Gross, Nathan Ng, David Grangier, and Michael Auli. fairseq: A fast, extensible toolkit for sequence modeling. In Proceedings of NAACL-HLT 2019: Demonstrations, 2019

  32. [32]

    Attention Is All You Need

    Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention is all you need, 2017. URL https://arxiv.org/abs/1706.03762

  33. [33]

    Analyzing the structure of attention in a transformer language model

    Jesse Vig and Yonatan Belinkov. Analyzing the structure of attention in a transformer language model. In Proceedings of the 2019 ACL Workshop BlackboxNLP : Analyzing and Interpreting Neural Networks for NLP , pp.\ 63--76, Florence, Italy, August 2019. Association for Computational Linguistics. doi:10.18653/v1/W19-4808. URL https://aclanthology.org/W19-4808

  34. [34]

    GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding

    Alex Wang. Glue: A multi-task benchmark and analysis platform for natural language understanding. arXiv preprint arXiv:1804.07461, 2018

  35. [35]

    On protecting the data privacy of large language models (llms): A survey,

    Biwei Yan, Kun Li, Minghui Xu, Yueyan Dong, Yue Zhang, Zhaochun Ren, and Xiuzheng Cheng. On protecting the data privacy of large language models ( LLMs ): A survey. arXiv preprint arXiv:2403.05156, 2024. URL https://arxiv.org/abs/2403.05156

  36. [36]

    Energy -Aware Proof-of-Authority: Blockchain Consensus for Clustered Wireless Sensor Network

    Yifan Yao, Jinhao Duan, Kaidi Xu, Yuanfang Cai, Zhibo Sun, and Yue Zhang. A survey on large language model (llm) security and privacy: The good, the bad, and the ugly. High-Confidence Computing, 4 0 (2): 0 100211, 2024. ISSN 2667-2952. doi:https://doi.org/10.1016/j.hcc.2024.100211

  37. [37]

    Neural networks with (low-precision) polynomial approximations: New insights and techniques for accuracy improvement

    Chi Zhang, Man Ho Au, and Siu Ming Yiu. Neural networks with (low-precision) polynomial approximations: New insights and techniques for accuracy improvement. arXiv preprint arXiv:2402.11224, 2024 a . URL https://arxiv.org/abs/2402.11224

  38. [38]

    Secure transformer inference made non-interactive

    Jiawen Zhang, Jian Liu, Xinpeng Yang, Yinghao Wang, Kejia Chen, Xiaoyang Hou, Kui Ren, and Xiaohu Yang. Secure transformer inference made non-interactive. Cryptology ePrint Archive, 2024 b . URL https://eprint.iacr.org/2024/136

  39. [39]

    Primer: Fast private transformer inference on encrypted data

    Mengxin Zheng, Qian Lou, and Lei Jiang. Primer: Fast private transformer inference on encrypted data. In 2023 60th ACM/IEEE Design Automation Conference (DAC), pp.\ 1--6, 2023. doi:10.1109/DAC56929.2023.10247719

  40. [40]

    Polynomial activation neural networks: Modeling, stability analysis and coverage bp-training

    Jun Zhou, Huimin Qian, Xinbiao Lu, Zhaoxia Duan, Haoqian Huang, and Zhen Shao. Polynomial activation neural networks: Modeling, stability analysis and coverage bp-training. Neurocomputing, 359: 0 227--240, 2019. ISSN 0925-2312. doi:https://doi.org/10.1016/j.neucom.2019.06.004

  41. [41]

    Converting transformers to polynomial form for secure inference over homomorphic encryption

    Itamar Zimerman, Moran Baruch, Nir Drucker, Gilad Ezov, Omri Soceanu, and Lior Wolf. Converting transformers to polynomial form for secure inference over homomorphic encryption. In Ruslan Salakhutdinov, Zico Kolter, Katherine Heller, Adrian Weller, Nuria Oliver, Jonathan Scarlett, and Felix Berkenkamp (eds.), Proceedings of the 41st International Conferen...

  42. [42]

    write newline

    " write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION format.date year duplicate empty "emp...

  43. [43]

    @esa (Ref

    \@ifxundefined[1] #1\@undefined \@firstoftwo \@secondoftwo \@ifnum[1] #1 \@firstoftwo \@secondoftwo \@ifx[1] #1 \@firstoftwo \@secondoftwo [2] @ #1 \@temptokena #2 #1 @ \@temptokena \@ifclassloaded agu2001 natbib The agu2001 class already includes natbib coding, so you should not add it explicitly Type <Return> for now, but then later remove the command n...

  44. [44]

    \@lbibitem[] @bibitem@first@sw\@secondoftwo \@lbibitem[#1]#2 \@extra@b@citeb \@ifundefined br@#2\@extra@b@citeb \@namedef br@#2 \@nameuse br@#2\@extra@b@citeb \@ifundefined b@#2\@extra@b@citeb @num @parse #2 @tmp #1 NAT@b@open@#2 NAT@b@shut@#2 \@ifnum @merge>\@ne @bibitem@first@sw \@firstoftwo \@ifundefined NAT@b*@#2 \@firstoftwo @num @NAT@ctr \@secondoft...

  45. [45]

    @open @close @open @close and [1] URL: #1 \@ifundefined chapter * \@mkboth \@ifxundefined @sectionbib * \@mkboth * \@mkboth\@gobbletwo \@ifclassloaded amsart * \@ifclassloaded amsbook * \@ifxundefined @heading @heading NAT@ctr thebibliography [1] @ \@biblabel @NAT@ctr \@bibsetup #1 @NAT@ctr @ @openbib .11em \@plus.33em \@minus.07em 4000 4000 `\.\@m @bibit...