GPU Acceleration of Learning With Errors KEMs Using OpenACC for Post-Quantum Cryptography

Daniele Gregori; Elisabetta Boella; Gabriella Bettonte; Marco Pedicini; Matteo Barbieri; Nitin Shukla; Simone Rizzo; Tiziana Liberati

arxiv: 2606.01211 · v1 · pith:7PXRDJ7Nnew · submitted 2026-05-31 · 💻 cs.CR · cs.DC

GPU Acceleration of Learning With Errors KEMs Using OpenACC for Post-Quantum Cryptography

Tiziana Liberati , Nitin Shukla , Matteo Barbieri , Gabriella Bettonte , Elisabetta Boella , Simone Rizzo , Daniele Gregori , Marco Pedicini This is my paper

Pith reviewed 2026-06-28 17:01 UTC · model grok-4.3

classification 💻 cs.CR cs.DC

keywords LWEKEMGPU accelerationOpenACCPost-Quantum CryptographyGrace Hopperspeedupenergy efficiency

0 comments

The pith

OpenACC GPU code accelerates LWE KEMs up to 208 times on the Grace Hopper Superchip.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The authors build a Key Encapsulation Mechanism from the plain Learning with Errors problem and port it to GPUs using OpenACC directives for parallel execution. Tests across NVIDIA platforms show large gains in both time and energy, with the Grace Hopper Superchip reaching 208 times the speed of a multithreaded CPU baseline while also handling problem sizes blocked by CPU memory and synchronization limits. Energy efficiency improves by roughly a factor of two on the Superchip compared with x86 CPUs paired with H100 GPUs. The work focuses on bare-metal and container runs to quantify these benefits for post-quantum workloads that are otherwise too slow on conventional hardware.

Core claim

A plain LWE KEM is implemented and accelerated via OpenACC on NVIDIA GPUs; on the Grace Hopper Superchip the approach delivers up to 208× speedup over a multithreaded CPU baseline, enables problem sizes that exceed CPU memory and synchronization capacity, and achieves approximately 2× better energy efficiency than x86 CPU plus H100 GPU combinations.

What carries the argument

OpenACC directives applied to the matrix-vector operations and error sampling inside the LWE KEM to distribute work across GPU threads and memory hierarchy.

If this is right

Larger LWE parameter sets become practical on GPU hardware that CPU systems cannot run.
Time-to-solution for LWE KEM encapsulation and decapsulation drops sharply on the tested NVIDIA platforms.
Energy-to-solution improves by a factor of about two on the Grace Hopper Superchip relative to x86-plus-H100 systems.
Both bare-metal and containerized deployments show consistent acceleration across GPU generations.
The same OpenACC approach can be applied to other LWE-derived primitives without changing the underlying math.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same directive-based porting strategy could be reused for other lattice-based post-quantum schemes that share similar matrix operations.
High-performance-computing centers could adopt GPU clusters for quantum-resistant key exchange at scale.
Energy savings might lower the operational cost of running post-quantum protocols in data centers.
Newer GPU architectures with larger unified memory could push the feasible LWE parameter range even further.

Load-bearing premise

The OpenACC version must produce bitwise-identical results to the CPU reference and must not create new side-channel leaks or correctness errors.

What would settle it

Execute the GPU and CPU implementations on identical inputs and verify that every output bit matches exactly while also checking whether timing or power traces reveal additional information on the GPU path.

Figures

Figures reproduced from arXiv: 2606.01211 by Daniele Gregori, Elisabetta Boella, Gabriella Bettonte, Marco Pedicini, Matteo Barbieri, Nitin Shukla, Simone Rizzo, Tiziana Liberati.

**Figure 2.** Figure 2: Pseudo-code of the LWE-based key encapsulation mechanism, showing encapsulation and de [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: Key establishment using a KEM Overall, the chosen parameter sets are intended to ensure reliable decapsulation in our experiments while spanning a range of increasing, approximate hardness levels. 2.3 Bottlenecks and Parallelization Approach Our baseline implementation is written in C++ and parallelized using OpenMP to exploit multicore architectures, addressing the high computational cost of lattice-based… view at source ↗

**Figure 4.** Figure 4: Comparison of LWE-KEM execution timelines on the NVIDIA A100 [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗

**Figure 5.** Figure 5: presents the scaling behavior for a fixed problem size of n = 4096. The execution time increases approximately linearly with respect to the workload parameter N for both implementations. The OpenMP-based CPU implementation (blue) consistently exhibits higher execution times, whereas the OpenACC-based GPU implementation (green) maintains substantially lower values across all tested workloads. The roughly c… view at source ↗

**Figure 6.** Figure 6: Top row: distribution of time-to-solution (TTS) over 20 simulations for runs with [PITH_FULL_IMAGE:figures/full_fig_p012_6.png] view at source ↗

**Figure 7.** Figure 7: Execution times and relative speedups for the different GPU models evaluated in our cross [PITH_FULL_IMAGE:figures/full_fig_p013_7.png] view at source ↗

read the original abstract

Shor's algorithm proved that asymmetric cryptographic protocols based on the integer factorization and discrete logarithm problems are no longer safe in a world with large-scale quantum computers. As a result, Post-Quantum Cryptography (PQC) has been developed over the last few years, seeking cryptographic primitives resistant to quantum attacks. One of the main hard problems underlying PQC schemes is the Learning with Errors (LWE) problem, which is significantly more computationally intensive than its classical predecessors. In this work, we present a Key Encapsulation Mechanism (KEM) based on plain LWE and develop a GPU-oriented implementation using OpenACC. We evaluate the performance of our accelerated application in terms of both time-to-solution and energy-to-solution, considering bare-metal and containerized executions across multiple NVIDIA GPU models and generations. Our implementation achieves significant acceleration across all tested GPU platforms. In particular, on the NVIDIA Grace Hopper Superchip, it attains up to a $208\times$ speedup over a multithreaded CPU baseline and enables the execution of problem sizes that are impractical on CPU architectures due to memory and synchronization constraints. Energy consumption analysis also shows $\approx 2\times$ better efficiency when using the Superchip compared to systems equipped with x86-based CPUs and NVIDIA H100 GPUs. These results highlight the effectiveness of GPU acceleration for computationally demanding LWE-based cryptographic workloads.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper ports a plain LWE KEM to OpenACC on GPUs and reports 208x speedup plus better energy numbers on Grace Hopper, but supplies no evidence of output equivalence or constant-time preservation.

read the letter

The main thing to know is that the authors accelerated a basic LWE-based KEM using OpenACC on NVIDIA GPUs and measured up to 208x speedup over CPU on the Grace Hopper Superchip, plus roughly 2x better energy efficiency.

They do well by testing the implementation on multiple GPU generations and in both bare-metal and containerized environments. This kind of cross-platform data is useful for anyone trying to deploy these schemes in practice.

The soft spots are in the verification area. The stress-test note is accurate here: there is no mention of checking that the OpenACC version gives bitwise identical outputs or preserves constant-time execution. Since OpenACC can change loop scheduling and memory access, this needs explicit confirmation to make the performance claims credible. The paper also does not detail the exact problem sizes or how functional equivalence was ensured.

This is a paper for people working on high-performance implementations of post-quantum cryptography rather than for theorists. It applies established techniques to a specific problem.

I would bring it to a reading group if the group discusses practical PQC performance. I would not cite the work in my own papers. It deserves serious peer review because the empirical results on newer hardware could be valuable once the correctness questions are addressed.

Referee Report

2 major / 1 minor

Summary. The manuscript presents an LWE-based KEM and its GPU implementation via OpenACC directives. It reports empirical timing and energy measurements across NVIDIA GPU platforms, claiming up to 208× speedup over a multithreaded CPU baseline on the Grace Hopper Superchip, the ability to handle problem sizes infeasible on CPUs due to memory and synchronization limits, and approximately 2× better energy efficiency versus x86 CPU + H100 GPU systems.

Significance. If the implementation is shown to be functionally equivalent and side-channel safe, the work would provide concrete evidence that directive-based OpenACC parallelization can deliver substantial performance and energy gains for LWE KEMs on modern GPU hardware, including support for larger parameter sets. This would be relevant for high-performance post-quantum cryptography deployments.

major comments (2)

[Abstract] Abstract: the central claims of 208× speedup and ≈2× energy improvement presuppose that the OpenACC GPU kernels produce bitwise-identical outputs to the reference multithreaded CPU implementation and preserve constant-time behavior. No description of equivalence testing, differential output verification, reduction-order checks, or side-channel timing analysis is supplied, rendering the performance numbers impossible to evaluate.
[Results] Results section (implied by the performance numbers): the manuscript supplies no experimental setup details, baseline code versions, problem-size definitions (e.g., lattice dimension n, modulus q, error distribution), number of runs, or error bars. Without these, the reported speedups cannot be reproduced or assessed for statistical significance.

minor comments (1)

[Abstract] The abstract mentions both bare-metal and containerized executions but does not clarify which environment produced the 208× figure.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for highlighting the importance of verification and reproducibility. We address each major comment below and will revise the manuscript to incorporate the requested details.

read point-by-point responses

Referee: [Abstract] Abstract: the central claims of 208× speedup and ≈2× energy improvement presuppose that the OpenACC GPU kernels produce bitwise-identical outputs to the reference multithreaded CPU implementation and preserve constant-time behavior. No description of equivalence testing, differential output verification, reduction-order checks, or side-channel timing analysis is supplied, rendering the performance numbers impossible to evaluate.

Authors: We agree that explicit documentation of functional equivalence and constant-time properties is required to substantiate the reported speedups for a cryptographic primitive. The revised manuscript will add a new subsection under Methodology describing the verification process: bitwise output comparison against the CPU reference for 10,000 random test vectors across all evaluated parameter sets, confirmation that reduction operations are order-independent, and timing measurements on the GPU kernels to verify absence of data-dependent execution paths. These additions will directly support the abstract claims. revision: yes
Referee: [Results] Results section (implied by the performance numbers): the manuscript supplies no experimental setup details, baseline code versions, problem-size definitions (e.g., lattice dimension n, modulus q, error distribution), number of runs, or error bars. Without these, the reported speedups cannot be reproduced or assessed for statistical significance.

Authors: We acknowledge that the current manuscript lacks sufficient experimental detail for reproducibility. The revised version will expand the Results section (and add an Experimental Setup subsection) to specify: exact LWE parameters (n, q, error distribution σ) for each reported data point; baseline CPU implementation (OpenMP multithreaded reference with version and compiler flags); number of timed runs (minimum 100 per configuration with warm-up); and statistical reporting including mean, standard deviation, and error bars on all speedup and energy figures. Problem sizes will be tabulated explicitly. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical performance measurements

full rationale

The paper consists of implementation details and direct empirical measurements of runtime and energy on GPU platforms versus a multithreaded CPU baseline. No derivations, equations, fitted parameters, or self-citations are used to support the central claims of speedup and efficiency. Results are obtained by running the OpenACC code and recording wall-clock time and power draw; these are falsifiable against external hardware without any reduction to the paper's own inputs by construction. The work is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

This is an engineering implementation and benchmarking paper. It introduces no mathematical free parameters, axioms, or new entities.

pith-pipeline@v0.9.1-grok · 5800 in / 1085 out tokens · 37187 ms · 2026-06-28T17:01:58.878747+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

16 extracted references · 9 canonical work pages

[1]

In: SC25-W: Workshops of the International Conference for High Performance Computing, Networking, Storage and Analysis

Almerol, J.L., et al.: Accelerating gravitational n-body simulations using the risc-v-based tenstorrent wormhole. In: SC25-W: Workshops of the International Conference for High Performance Computing, Networking, Storage and Analysis. pp. 1729–1735 (2025). https://doi.org/10.1145/3731599.3767528

work page doi:10.1145/3731599.3767528 2025
[2]

Astronomy and Computing56, 101121 (2026)

Almerol, J.L., et al.: Assessing performance and porting strategies for gravitational n-body simulations on the risc-v-based tenstorrent wormhole™. Astronomy and Computing56, 101121 (2026). https://doi.org/ https://doi.org/10.1016/j.ascom.2026.101121

work page doi:10.1016/j.ascom.2026.101121 2026
[3]

Amati, G., et al.: Experience on clock rate adjustment for energy-efficient gpu-accelerated real-world codes. p. 245–257. Springer-Verlag, Berlin, Heidelberg (2025). https://doi.org/10.1007/978-3-032-07612-0_19, https://doi.org/10.1007/978-3-032-07612-0_19

work page doi:10.1007/978-3-032-07612-0_19 2025
[4]

Cryptology ePrint Archive, Paper 2024/1895 (2024)

Biasioli, B., et al.: A tool for fast and secure LWE parameter selection: the FHE case. Cryptology ePrint Archive, Paper 2024/1895 (2024)

2024
[5]

Boella, E., et al.: Accelerating the particle-in-cell code ecsim with openacc (2026)

2026
[6]

Fujisaki, E., et al.: Secure integration of asymmetric and symmetric encryption schemes. J. Cryptol.26, 80–101 (2013). https://doi.org/10.1007/s00145-011-9114-1, https://doi.org/10.1007/s00145-011-9114-1

work page doi:10.1007/s00145-011-9114-1 2013
[7]

IEEE Transactions on Parallel and Distributed Systems32(3), 575–586 (2021)

Gupta, N., et al.: Pqc acceleration using gpus: Frodokem, newhope, and kyber. IEEE Transactions on Parallel and Distributed Systems32(3), 575–586 (2021). https://doi.org/10.1109/TPDS.2020.3025691

work page doi:10.1109/tpds.2020.3025691 2021
[8]

Gustafson, J.L.: Amdahl’s Law, pp. 53–60. Springer US, Boston, MA (2011). https://doi.org/10.1007/ 978-0-387-09766-4_77

2011
[9]

Array15, 100242 (08 2022)

Kumar, M.: Post-quantum cryptography algorithm’s standardization and performance analysis. Array15, 100242 (08 2022). https://doi.org/10.1016/j.array.2022.100242

work page doi:10.1016/j.array.2022.100242 2022
[10]

National Institute of Standards and Technology: FIPS 203: Module-Lattice-Based Key-Encapsulation Mechanism Standard. Tech. Rep. FIPS 203, U.S. Department of Commerce, National Institute of Stan- dards and Technology (2024)

2024
[11]

Regev, O.: On lattices, learning with errors, random linear codes, and cryptography. J. ACM56(6) (2009). https://doi.org/10.1145/1568318.1568324

work page doi:10.1145/1568318.1568324 2009
[12]

IEEE Transactions on Parallel and Distributed Systems35(11), 1964–1976 (2024)

Shen, S., et al.: High-throughput gpu implementation of dilithium post-quantum digital signature. IEEE Transactions on Parallel and Distributed Systems35(11), 1964–1976 (2024). https://doi.org/10.1109/ TPDS.2024.3453289

arXiv 1964
[13]

In: Proceedings 35th Annual Symposium on Foundations of Computer Science

Shor, P.: Algorithms for quantum computation: discrete logarithms and factoring. In: Proceedings 35th Annual Symposium on Foundations of Computer Science. pp. 124–134 (1994). https://doi.org/10.1109/ SFCS.1994.365700

arXiv 1994
[14]

Computing in Science & Engineering24(5), 53–63 (2022)

Stack, M., et al.: Openacc acceleration of an agent-based biological simulation framework. Computing in Science & Engineering24(5), 53–63 (2022). https://doi.org/10.1109/MCSE.2022.3226602

work page doi:10.1109/mcse.2022.3226602 2022
[15]

Turisini, M., et al.: Leonardo: A pan-european pre-exascale supercomputer for hpc and ai applications (2023)

2023
[16]

The Inter- national Journal of High Performance Computing Applications39(4), 502–518 (2025)

Vignolo, A., et al.: A tale of two codes: Cuda vs openacc for mass-zero constrained dynamics. The Inter- national Journal of High Performance Computing Applications39(4), 502–518 (2025). https://doi.org/ 10.1177/10943420251331673

work page doi:10.1177/10943420251331673 2025

[1] [1]

In: SC25-W: Workshops of the International Conference for High Performance Computing, Networking, Storage and Analysis

Almerol, J.L., et al.: Accelerating gravitational n-body simulations using the risc-v-based tenstorrent wormhole. In: SC25-W: Workshops of the International Conference for High Performance Computing, Networking, Storage and Analysis. pp. 1729–1735 (2025). https://doi.org/10.1145/3731599.3767528

work page doi:10.1145/3731599.3767528 2025

[2] [2]

Astronomy and Computing56, 101121 (2026)

Almerol, J.L., et al.: Assessing performance and porting strategies for gravitational n-body simulations on the risc-v-based tenstorrent wormhole™. Astronomy and Computing56, 101121 (2026). https://doi.org/ https://doi.org/10.1016/j.ascom.2026.101121

work page doi:10.1016/j.ascom.2026.101121 2026

[3] [3]

Amati, G., et al.: Experience on clock rate adjustment for energy-efficient gpu-accelerated real-world codes. p. 245–257. Springer-Verlag, Berlin, Heidelberg (2025). https://doi.org/10.1007/978-3-032-07612-0_19, https://doi.org/10.1007/978-3-032-07612-0_19

work page doi:10.1007/978-3-032-07612-0_19 2025

[4] [4]

Cryptology ePrint Archive, Paper 2024/1895 (2024)

Biasioli, B., et al.: A tool for fast and secure LWE parameter selection: the FHE case. Cryptology ePrint Archive, Paper 2024/1895 (2024)

2024

[5] [5]

Boella, E., et al.: Accelerating the particle-in-cell code ecsim with openacc (2026)

2026

[6] [6]

Fujisaki, E., et al.: Secure integration of asymmetric and symmetric encryption schemes. J. Cryptol.26, 80–101 (2013). https://doi.org/10.1007/s00145-011-9114-1, https://doi.org/10.1007/s00145-011-9114-1

work page doi:10.1007/s00145-011-9114-1 2013

[7] [7]

IEEE Transactions on Parallel and Distributed Systems32(3), 575–586 (2021)

Gupta, N., et al.: Pqc acceleration using gpus: Frodokem, newhope, and kyber. IEEE Transactions on Parallel and Distributed Systems32(3), 575–586 (2021). https://doi.org/10.1109/TPDS.2020.3025691

work page doi:10.1109/tpds.2020.3025691 2021

[8] [8]

Gustafson, J.L.: Amdahl’s Law, pp. 53–60. Springer US, Boston, MA (2011). https://doi.org/10.1007/ 978-0-387-09766-4_77

2011

[9] [9]

Array15, 100242 (08 2022)

Kumar, M.: Post-quantum cryptography algorithm’s standardization and performance analysis. Array15, 100242 (08 2022). https://doi.org/10.1016/j.array.2022.100242

work page doi:10.1016/j.array.2022.100242 2022

[10] [10]

National Institute of Standards and Technology: FIPS 203: Module-Lattice-Based Key-Encapsulation Mechanism Standard. Tech. Rep. FIPS 203, U.S. Department of Commerce, National Institute of Stan- dards and Technology (2024)

2024

[11] [11]

Regev, O.: On lattices, learning with errors, random linear codes, and cryptography. J. ACM56(6) (2009). https://doi.org/10.1145/1568318.1568324

work page doi:10.1145/1568318.1568324 2009

[12] [12]

IEEE Transactions on Parallel and Distributed Systems35(11), 1964–1976 (2024)

Shen, S., et al.: High-throughput gpu implementation of dilithium post-quantum digital signature. IEEE Transactions on Parallel and Distributed Systems35(11), 1964–1976 (2024). https://doi.org/10.1109/ TPDS.2024.3453289

arXiv 1964

[13] [13]

In: Proceedings 35th Annual Symposium on Foundations of Computer Science

Shor, P.: Algorithms for quantum computation: discrete logarithms and factoring. In: Proceedings 35th Annual Symposium on Foundations of Computer Science. pp. 124–134 (1994). https://doi.org/10.1109/ SFCS.1994.365700

arXiv 1994

[14] [14]

Computing in Science & Engineering24(5), 53–63 (2022)

Stack, M., et al.: Openacc acceleration of an agent-based biological simulation framework. Computing in Science & Engineering24(5), 53–63 (2022). https://doi.org/10.1109/MCSE.2022.3226602

work page doi:10.1109/mcse.2022.3226602 2022

[15] [15]

Turisini, M., et al.: Leonardo: A pan-european pre-exascale supercomputer for hpc and ai applications (2023)

2023

[16] [16]

The Inter- national Journal of High Performance Computing Applications39(4), 502–518 (2025)

Vignolo, A., et al.: A tale of two codes: Cuda vs openacc for mass-zero constrained dynamics. The Inter- national Journal of High Performance Computing Applications39(4), 502–518 (2025). https://doi.org/ 10.1177/10943420251331673

work page doi:10.1177/10943420251331673 2025