pith. sign in

arxiv: 2606.11827 · v1 · pith:5NCWBGYQnew · submitted 2026-06-10 · 💻 cs.CR

Jaguar: Fast Private CNN Inference with Power-of-Two Homomorphic Arithmetic

Pith reviewed 2026-06-27 09:10 UTC · model grok-4.3

classification 💻 cs.CR
keywords private CNN inferencehomomorphic encryptionpower-of-two ringcoefficient-domain convolutionReLU truncationhybrid HE/2PCResNetlatency reduction
0
0 comments X

The pith

Jaguar replaces prime-modulus homomorphic arithmetic with a power-of-two ciphertext ring to accelerate private CNN inference through coefficient-domain convolution and exact local truncation.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper builds Jaguar around one design decision: switching the underlying homomorphic encryption to a power-of-two ring. This change replaces NTT-based polynomial multiplication in convolution with scalar-polynomial accumulation and removes the separate post-ReLU truncation protocol by permitting exact right shifts directly on ciphertexts. ReLU therefore runs at the target fixed-point precision rather than doubled bitwidth. The same ring still permits standard NTT at the client for the single decryption multiplication. On ImageNet-scale ResNet-18, ResNet-50, and MobileNetV2 the resulting system reports lower end-to-end latency and communication than prior hybrid HE/2PC baselines.

Core claim

Jaguar shows that a power-of-two ciphertext ring preserves the required homomorphic properties while enabling SPA-Conv, a coefficient-domain convolution kernel that performs scalar-polynomial accumulation instead of NTT-centric multiplication, together with exact ciphertext-side right-shift truncation that lets ReLU execute directly at target precision and eliminates the auxiliary truncation protocol.

What carries the argument

The power-of-two ciphertext ring, which supports both SPA-Conv scalar-polynomial accumulation for convolution and exact local right-shift truncation after ReLU.

If this is right

  • SPA-Conv replaces NTT-centric polynomial multiplication with scalar-polynomial accumulation in the convolution layers.
  • ReLU can be evaluated directly at the target fixed-point precision without a separate post-ReLU truncation protocol.
  • Client-side decryption retains an auxiliary NTT prime so its cost remains O(N log N).
  • Measured end-to-end latency drops 2.07-3.72x versus Cheetah and 2.16-3.36x versus Rhombus on the listed models when AVX is disabled.
  • Communication volume is reduced 1.16-1.76x compared with Cheetah on the same workloads.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same ring choice could be tested on other hybrid protocols that currently rely on prime moduli for convolution or truncation.
  • Hardware implementations that already favor power-of-two arithmetic might see additional gains once the ring change is adopted.
  • Security reductions for power-of-two rings would need explicit verification if the scheme is deployed at scale.
  • The approach might simplify fixed-point analysis in other private machine-learning settings that currently double bitwidth for ReLU.

Load-bearing premise

Switching to a power-of-two ring keeps both the correctness of the homomorphic operations and the security of the scheme intact while allowing exact right-shift truncation without precision loss or extra protocols.

What would settle it

A side-by-side execution of Jaguar and a prime-modulus baseline on identical inputs that produces different decrypted outputs, or a concrete attack that breaks semantic security of the power-of-two scheme, would falsify the central claim.

Figures

Figures reproduced from arXiv: 2606.11827 by Hyeri Roh, Nayoung Jung, Woo-Seok Choi, Yewon Jeong.

Figure 1
Figure 1. Figure 1: Motivation for Jaguar’s design choices. (a) Hybrid HE/2PC private inference framework. [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Computation flow of SPA-CONV. share-domain mismatch SIMD-encoded systems must bridge. 3) Replaces modular reduction with bit masking; Reduction modulo 2 Q is a single bitwise AND with 2 Q − 1. Correctness and security. The arithmetic backend changes; the privacy goal does not. Correctness follows the standard BFV requirement that accumulated noise stays below ∆/2 minus the truncation slack of Theorem 1; de… view at source ↗
Figure 3
Figure 3. Figure 3: Overview of Jaguar [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Jaguar convolution protocol [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: End-to-end latency and communication breakdown across ImageNet-scale CNNs: (a) [PITH_FULL_IMAGE:figures/full_fig_p009_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Worst-case pre-truncation output-noise standard deviation at the maximum scalar–ciphertext [PITH_FULL_IMAGE:figures/full_fig_p016_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Channel-wise pre-truncation output-noise standard deviation for representative maximum [PITH_FULL_IMAGE:figures/full_fig_p016_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: FC/MatMul latency trend under dif￾ferent weight densities. I Derivation of the Kernel-Level Complexity Table [PITH_FULL_IMAGE:figures/full_fig_p021_8.png] view at source ↗
read the original abstract

Hybrid HE/2PC private CNN inference remains bottlenecked by prime-modulus homomorphic arithmetic in convolution and by a precision flow that runs ReLU at doubled bitwidth before invoking a separate truncation protocol. We present Jaguar, a system built on a single design choice--a power-of-two ciphertext ring--that addresses both. The choice enables SPA-Conv, a coefficient-domain convolution kernel that replaces NTT-centric polynomial multiplication with scalar-polynomial accumulation, and an exact ciphertext-side truncation by local right shifts that lets ReLU run directly at the target fixed-point precision and eliminates the post-ReLU truncation protocol. Where NTT remains genuinely useful--at the client, for the single polynomial multiplication during decryption--we recover it through an auxiliary NTT prime, preserving the power-of-two protocol substrate while keeping decryption O(N log N). On ImageNet-scale ResNet-18, ResNet-50, and MobileNetV2 with AVX disabled, Jaguar achieves 2.07-3.72x lower end-to-end latency than Cheetah and 2.16-3.36x lower than Rhombus, with 1.16-1.76x lower communication than Cheetah.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper introduces Jaguar, a hybrid HE/2PC system for private CNN inference on ImageNet-scale models (ResNet-18/50, MobileNetV2). Its core design choice replaces the standard prime-modulus ciphertext ring with a power-of-two ring. This enables (1) SPA-Conv, a coefficient-domain convolution that avoids NTT-based polynomial multiplication, and (2) exact local right-shift truncation after ReLU at target fixed-point precision, eliminating the post-ReLU truncation protocol. Client-side NTT is retained only for the single decryption multiplication via an auxiliary prime. The paper reports 2.07-3.72× lower end-to-end latency than Cheetah and 2.16-3.36× lower than Rhombus (AVX disabled), together with 1.16-1.76× lower communication than Cheetah.

Significance. If the power-of-two ring indeed preserves both correctness and security of the underlying HE scheme while permitting exact truncation without auxiliary protocols or precision loss, the work would materially simplify the precision flow and arithmetic kernels in private inference, yielding concrete latency and communication gains on realistic CNNs. The empirical speedups are measured on full ImageNet models rather than toy networks, which strengthens the practical claim.

major comments (3)
  1. [power-of-two ring construction] The power-of-two ring construction (described after the abstract and in the system overview): the manuscript asserts that switching to a power-of-two modulus preserves both the correctness and the security of the underlying lattice HE scheme while enabling exact right-shift truncation. No security reduction, noise-growth analysis, or reference to a hardness result for power-of-two moduli is supplied; standard BFV/CKKS security reductions rely on prime moduli for NTT and for the ring-LWE hardness assumption. This assumption is load-bearing for all claimed speedups.
  2. [ReLU and truncation subsection] The exact truncation claim (ReLU and truncation subsection): the paper states that local right shifts after ReLU incur no precision loss and require no auxiliary 2PC protocol. The argument must demonstrate that modular wrap-around cannot occur at the target fixed-point bit-width and that noise growth remains compatible with the subsequent layers; without an explicit bound or modular-arithmetic invariant, the elimination of the truncation protocol is not yet justified.
  3. [evaluation section, Table X] Experimental comparison (evaluation section, Table X): the reported 2.07-3.72× latency advantage is measured with AVX disabled. The manuscript should clarify whether the baseline Cheetah and Rhombus implementations were also compiled without AVX or whether the comparison mixes optimized and unoptimized code; otherwise the speedup attribution to the power-of-two design alone is ambiguous.
minor comments (2)
  1. [system overview] Notation for the auxiliary NTT prime and the power-of-two modulus should be introduced once and used consistently; the current description mixes “auxiliary prime” and “power-of-two protocol substrate” without a single defining equation.
  2. [SPA-Conv description] The abstract claims “parameter-free” behavior for SPA-Conv; if any scaling factors or bit-width choices remain, they should be listed explicitly in the main text.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the detailed and constructive comments. We address each major point below, indicating planned revisions where appropriate. All responses are based on the existing manuscript content without introducing unsubstantiated claims.

read point-by-point responses
  1. Referee: [power-of-two ring construction] The power-of-two ring construction (described after the abstract and in the system overview): the manuscript asserts that switching to a power-of-two modulus preserves both the correctness and the security of the underlying lattice HE scheme while enabling exact right-shift truncation. No security reduction, noise-growth analysis, or reference to a hardness result for power-of-two moduli is supplied; standard BFV/CKKS security reductions rely on prime moduli for NTT and for the ring-LWE hardness assumption. This assumption is load-bearing for all claimed speedups.

    Authors: The manuscript relies on the standard ring-LWE assumption, which is known to hold over power-of-two cyclotomic rings (as in many prior works using power-of-two moduli for RLWE). The auxiliary prime is used only for client-side NTT during decryption and does not alter the server-side power-of-two protocol. However, we acknowledge that an explicit reference to hardness results and a short noise-growth comparison paragraph are missing from the current text. We will add these in the revision to strengthen the presentation. revision: partial

  2. Referee: [ReLU and truncation subsection] The exact truncation claim (ReLU and truncation subsection): the paper states that local right shifts after ReLU incur no precision loss and require no auxiliary 2PC protocol. The argument must demonstrate that modular wrap-around cannot occur at the target fixed-point bit-width and that noise growth remains compatible with the subsequent layers; without an explicit bound or modular-arithmetic invariant, the elimination of the truncation protocol is not yet justified.

    Authors: We agree that an explicit modular invariant is required. In the power-of-two ring, ReLU outputs are bounded by construction to remain within the representable range before the right-shift (ensuring no wrap-around at the target fixed-point width), and the subsequent noise growth is controlled by the same modulus size chosen for the overall scheme. We will insert a short proof sketch with the required bounds in the ReLU and truncation subsection. revision: yes

  3. Referee: [evaluation section, Table X] Experimental comparison (evaluation section, Table X): the reported 2.07-3.72× latency advantage is measured with AVX disabled. The manuscript should clarify whether the baseline Cheetah and Rhombus implementations were also compiled without AVX or whether the comparison mixes optimized and unoptimized code; otherwise the speedup attribution to the power-of-two design alone is ambiguous.

    Authors: All reported timings, including the Cheetah and Rhombus baselines, were obtained with AVX explicitly disabled in the compilation flags to isolate the effect of the algorithmic changes. We will add an explicit statement to this effect in the evaluation section and update the table caption accordingly. revision: yes

Circularity Check

0 steps flagged

No circularity; performance claims are empirical measurements of an independent design choice

full rationale

The paper presents a design choice (power-of-two ciphertext ring) that is asserted to enable SPA-Conv and exact local right-shift truncation. The headline performance numbers (2.07-3.72x latency reduction etc.) are reported as direct experimental measurements on ResNet-18/50 and MobileNetV2 rather than quantities obtained by fitting parameters to a subset of the same data or by algebraic reduction to prior fitted values. No equations, self-definitional loops, fitted-input predictions, or load-bearing self-citations appear in the abstract or described structure. The derivation chain is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review; no free parameters, axioms, or invented entities can be extracted beyond the implicit assumption that the power-of-two ring preserves HE security and correctness.

pith-pipeline@v0.9.1-grok · 5748 in / 1131 out tokens · 19840 ms · 2026-06-27T09:10:56.334091+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

45 extracted references · 11 canonical work pages

  1. [1]

    Deep Residual Learning for Image Recognition

    Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for im- age recognition. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 770–778, 2016. doi: 10.1109/CVPR.2016.90

  2. [2]

    Novoa, Justin Ko, Susan M

    Andre Esteva, Brett Kuprel, Roberto A. Novoa, Justin Ko, Susan M. Swetter, Helen M. Blau, and Sebastian Thrun. Dermatologist-level classification of skin cancer with deep neural networks. Nature, 542(7639):115–118, 2017. doi: 10.1038/nature21056

  3. [3]

    Georgios A. Kaissis, Alexander Ziller, Jonathan Passerat-Palmbach, Théo Ryffel, Dmitrii Usynin, Andrew Trask, Ionésio Lima, Jason Mancuso, Friederike Jungmann, Marc-Matthias Steinborn, Rickmer Braren, Marcus Makowski, Daniel Rueckert, et al. End-to-end privacy preserving deep learning on multi-institutional medical imaging.Nature Machine Intelligence, 3(6...

  4. [4]

    Deep Residual Learning for Image Recognition

    Mark Sandler, Andrew Howard, Menglong Zhu, Andrey Zhmoginov, and Liang-Chieh Chen. MobileNetV2: Inverted residuals and linear bottlenecks. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 4510–4520, 2018. doi: 10.1109/CVPR. 2018.00474

  5. [5]

    MCUNet: Tiny deep learning on IoT devices

    Ji Lin, Wei-Ming Chen, Yujun Lin, Chuang Gan, and Song Han. MCUNet: Tiny deep learning on IoT devices. InAdvances in Neural Information Processing Systems, volume 33, pages 11711–11722, 2020

  6. [6]

    O’Reilly Media, 2019

    Pete Warden and Daniel Situnayake.TinyML: Machine Learning with TensorFlow Lite on Arduino and Ultra-Low-Power Microcontrollers. O’Reilly Media, 2019. ISBN 9781492052043

  7. [7]

    Gomez, Lukasz Kaiser, and Illia Polosukhin

    Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention is all you need. InAdvances in Neural Informa- tion Processing Systems, volume 30, 2017

  8. [8]

    BERT : Pre-training of Deep Bidirectional Transformers for Language Understanding

    Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. BERT: Pre-training of deep bidirectional transformers for language understanding. InProceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 4171–4186, 2019. doi: 10.18653/v1/N19-1423

  9. [9]

    Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-V oss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler, Mateusz Litwi...

  10. [10]

    An image is worth 16x16 words: Transformers for image recognition at scale

    Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, and Neil Houlsby. An image is worth 16x16 words: Transformers for image recognition at scale. InInternational Conference on Learning Representations, 2021

  11. [11]

    Secureml: A system for scalable privacy-preserving machine learning

    Payman Mohassel and Yupeng Zhang. Secureml: A system for scalable privacy-preserving machine learning. In2017 IEEE Symposium on Security and Privacy (SP), pages 19–38. IEEE,

  12. [12]

    doi: 10.1109/SP.2017.12

  13. [13]

    Jian Liu, Mika Juuti, Yao Lu, and N. Asokan. Oblivious neural network predictions via minionn transformations. InProceedings of the 2017 ACM SIGSAC Conference on Computer and Communications Security, pages 619–631, 2017. doi: 10.1145/3133956.3134056. 10

  14. [14]

    {GAZELLE}: A low latency framework for secure neural network inference

    Chiraag Juvekar, Vinod Vaikuntanathan, and Anantha Chandrakasan. {GAZELLE}: A low latency framework for secure neural network inference. In27th USENIX Security symposium (USENIX Security 18), pages 1651–1669, 2018

  15. [15]

    Delphi: A cryptographic inference service for neural networks

    Pratyush Mishra, Ryan Lehmkuhl, Akshayaram Srinivasan, Wenting Zheng, and Raluca Ada Popa. Delphi: A cryptographic inference service for neural networks. In29th USENIX Security Symposium (USENIX Security 20), pages 2505–2522, 2020

  16. [16]

    Cryptflow2: Practical 2-party secure inference

    Deevashwer Rathee, Mayank Rathee, Nishant Kumar, Nishanth Chandran, Divya Gupta, Aseem Rastogi, and Rahul Sharma. Cryptflow2: Practical 2-party secure inference. InProceedings of the 2020 ACM SIGSAC Conference on Computer and Communications Security, pages 325–342, 2020

  17. [17]

    Cheetah: Lean and fast secure {Two-Party} deep neural network inference

    Zhicong Huang, Wen-jie Lu, Cheng Hong, and Jiansheng Ding. Cheetah: Lean and fast secure {Two-Party} deep neural network inference. In31st USENIX Security Symposium (USENIX Security 22), pages 809–826, 2022

  18. [18]

    Cryptonets: Applying neural networks to encrypted data with high throughput and accuracy

    Ran Gilad-Bachrach, Nathan Dowlin, Kim Laine, Kristin Lauter, Michael Naehrig, and John Wernsing. Cryptonets: Applying neural networks to encrypted data with high throughput and accuracy. InProceedings of the 33rd International Conference on Machine Learning, volume 48 ofProceedings of Machine Learning Research, pages 201–210. PMLR, 2016

  19. [19]

    Low latency privacy preserving inference

    Alon Brutzkus, Oren Elisha, and Ran Gilad-Bachrach. Low latency privacy preserving inference. InProceedings of the 36th International Conference on Machine Learning, volume 97 of Proceedings of Machine Learning Research, pages 812–821. PMLR, 2019

  20. [20]

    Low-complexity deep convolutional neural networks on fully homomorphic encryption using multiplexed parallel convolutions

    Eunsang Lee, Joon-Woo Lee, Junghyun Lee, Young-Sik Kim, Yongjune Kim, Jong-Seon No, and Woosuk Choi. Low-complexity deep convolutional neural networks on fully homomorphic encryption using multiplexed parallel convolutions. InProceedings of the 39th International Conference on Machine Learning, volume 162 ofProceedings of Machine Learning Research, pages ...

  21. [21]

    ISBN 9798400706363

    Jae Hyung Ju, Jaiyoung Park, Jongmin Kim, Minsik Kang, Donghwan Kim, Jung Hee Cheon, and Jung Ho Ahn. Neujeans: Private neural network inference with joint optimization of convolution and fhe bootstrapping. InProceedings of the 2024 ACM SIGSAC Conference on Computer and Communications Security, pages 4361–4375, 2024. doi: 10.1145/3658644. 3690375

  22. [22]

    Deep neural networks for encrypted inference with tfhe

    Alexandru Stoian, Jordan Fréry, Roman Bredehoft, Luis Montero, Celia Kherfallah, and Benoît Chevallier-Mames. Deep neural networks for encrypted inference with tfhe. Cryptology ePrint Archive, Paper 2023/257, 2023. URLhttps://eprint.iacr.org/2023/257

  23. [23]

    Flash: A hybrid private inference protocol for deep CNNs with high accuracy and low latency on cpu

    Hyeri Roh, Jinsu Yeo, Yeongil Ko, Gu-Yeon Wei, David Brooks, and Woo-Seok Choi. Flash: A hybrid private inference protocol for deep CNNs with high accuracy and low latency on cpu. arXiv preprint arXiv:2401.16732, 2024

  24. [24]

    OpenCheetah: Proof-of-concept implementation for Cheetah

    Alibaba Gemini Lab. OpenCheetah: Proof-of-concept implementation for Cheetah. https: //github.com/Alibaba-Gemini-Lab/OpenCheetah, 2022. Accessed: 2026-04-26

  25. [25]

    Impala: Low-latency, communication-efficient private deep learning inference.arXiv preprint arXiv:2205.06437, 2022

    Woo-Seok Choi, Brandon Reagen, Gu-Yeon Wei, and David Brooks. Impala: Low-latency, communication-efficient private deep learning inference.arXiv preprint arXiv:2205.06437, 2022

  26. [26]

    LLAMA: A low latency math library for secure inference.Proceedings on Privacy Enhancing Technologies, 2022(4):274–294, 2022

    Kanav Gupta, Deepak Kumaraswamy, Nishanth Chandran, and Divya Gupta. LLAMA: A low latency math library for secure inference.Proceedings on Privacy Enhancing Technologies, 2022(4):274–294, 2022

  27. [27]

    Rhombus: Fast homomorphic matrix-vector multiplication for secure two- party inference

    Jiaxing He, Kang Yang, Guofeng Tang, Zhangjie Huang, Li Lin, Changzheng Wei, Ying Yan, and Wei Wang. Rhombus: Fast homomorphic matrix-vector multiplication for secure two- party inference. InProceedings of the 2024 on ACM SIGSAC Conference on Computer and Communications Security (CCS), page 2490––2504, 2024

  28. [28]

    Somewhat practical fully homomorphic encryption

    Junfeng Fan and Frederik Vercauteren. Somewhat practical fully homomorphic encryption. Cryptology ePrint Archive, 2012. URLhttps://eprint.iacr.org/2012/144. 11

  29. [29]

    In: IEEE S&P (2024).https://doi.org/ 10.1109/SP54263.2024.00230

    Qi Pang, Jinhao Zhu, Helen Möllering, Wenting Zheng, and Thomas Schneider. BOLT: Privacy- preserving, accurate and efficient inference for transformers. In2024 IEEE Symposium on Security and Privacy (SP), pages 4753–4771, 2024. doi: 10.1109/SP54263.2024.00130

  30. [30]

    Hyena: Optimizing homomorphically encrypted convolution for private cnn inference

    Hyeri Roh and Woo-Seok Choi. Hyena: Optimizing homomorphically encrypted convolution for private cnn inference. InProceedings of the 43rd IEEE/ACM International Conference on Computer-Aided Design, pages 1–9, 2024

  31. [31]

    J. M. Pollard. The fast fourier transform in a finite field.Mathematics of Computation, 25(114): 365–374, 1971

  32. [32]

    Iron: Private inference on transformers

    Meng Hao, Hongwei Li, Hanxiao Chen, Pengzhi Xing, Guowen Xu, and Tianwei Zhang. Iron: Private inference on transformers. InAdvances in Neural Information Processing Systems 35, pages 15718–15731, 2022

  33. [33]

    Falcon: Accelerating homomorphically encrypted convolutions for efficient private mobile network inference

    Tianshi Xu, Meng Li, Runsheng Wang, and Ru Huang. Falcon: Accelerating homomorphically encrypted convolutions for efficient private mobile network inference. In2023 IEEE/ACM International Conference on Computer-Aided Design (ICCAD), pages 1–9, 2023. doi: 10.1109/ ICCAD57390.2023.10323672

  34. [34]

    Privcirnet: Efficient private inference via block circulant transformation

    Tianshi Xu, Lemeng Wu, Runsheng Wang, and Meng Li. Privcirnet: Efficient private inference via block circulant transformation. InAdvances in Neural Information Processing Systems 37, 2024

  35. [35]

    Microsoft SEAL (release 4.1)

    SEAL. Microsoft SEAL (release 4.1). https://github.com/Microsoft/SEAL, January

  36. [36]

    Microsoft Research, Redmond, W A

  37. [37]

    Homomorphic encryption standard

    HomomorphicEncryption.org Standardization Consortium. Homomorphic encryption standard. Version 1.1, 2024. URL https://homomorphicencryption.org/wp-content/uploads/ 2024/08/Homomorphic-Encryption-Standard-v1.1.pdf

  38. [38]

    A toolkit for ring-LWE cryptography

    Vadim Lyubashevsky, Chris Peikert, and Oded Regev. A toolkit for ring-LWE cryptography. InAdvances in Cryptology – EUROCRYPT 2013, volume 7881 ofLecture Notes in Computer Science, pages 35–54. Springer, 2013. doi: 10.1007/978-3-642-38348-9_3

  39. [39]

    Microsoft Research,

    Kim Laine.Simple Encrypted Arithmetic Library 2.3.1. Microsoft Research,

  40. [40]

    URL https://www.microsoft.com/en-us/research/wp-content/uploads/ 2017/11/sealmanual-2-3-1.pdf

  41. [41]

    and Fei-Fei, Li , title =

    Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, Alexander C. Berg, and Fei-Fei Li. Imagenet large scale visual recognition challenge.International Journal of Computer Vision, 115(3):211–252, 2015. doi: 10.1007/s11263-015-0816-y

  42. [42]

    Torchvision: Pytorch’s computer vision library

    PyTorch Contributors. Torchvision: Pytorch’s computer vision library. https://github. com/pytorch/vision, 2024. Accessed: 2026-04-30

  43. [43]

    RhombusEnd2End: Public implementation of rhombus

    Jiaxing He. RhombusEnd2End: Public implementation of rhombus. https://github.com/ 2646jx/RhombusEnd2End, 2025. MIT License. Accessed: 2026-04-30

  44. [44]

    Intel advanced vector extensions 512 (Intel A VX-512) overview

    Intel Corporation. Intel advanced vector extensions 512 (Intel A VX-512) overview. https://www.intel.com/content/www/us/en/architecture-and-technology/ avx-512-overview.html, 2024. Accessed: 2026-04-27

  45. [45]

    Intel intrinsics guide

    Intel Corporation. Intel intrinsics guide. https://www.intel.com/content/www/us/en/ docs/intrinsics-guide/index.html, 2024. Version 3.6.9, accessed: 2026-04-27. 12 A Notations Conventions.Ring elements are italic polynomials, e.g., ˆx(X)∈R q or short for ˆx∈Rq, and ˆx[i]denotes the i-th coefficient. Lower-case letters with "hat" symbols denote polynomials...