Benchmarking NIST-Standardised ML-KEM and ML-DSA on ARM Cortex-M0+: Performance, Memory, and Energy on the RP2040
Pith reviewed 2026-05-15 08:42 UTC · model grok-4.3
The pith
ML-KEM-512 completes a full key exchange in 35.7 ms on ARM Cortex-M0+, seventeen times faster than ECDH P-256.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
On the RP2040 running at 133 MHz, unmodified PQClean implementations of ML-KEM-512 require 35.7 ms for a complete key exchange (key generation, encapsulation, and decapsulation) with an estimated energy consumption of 2.83 mJ based on the datasheet power model. This performance is 17 times better than a full ECDH P-256 key agreement on identical hardware. ML-DSA signing exhibits high latency variance with coefficients of variation of 66 to 73 percent and 99th-percentile times reaching 1,125 ms for ML-DSA-87. The M0+ shows only a 1.8 to 1.9 times slowdown relative to published Cortex-M4 results despite lacking advanced instructions.
What carries the argument
Unmodified reference C implementations from PQClean compiled for the ARM Cortex-M0+ on the RP2040 board at 133 MHz with 264 KB SRAM.
If this is right
- Post-quantum key exchange with ML-KEM is practical for IoT devices with long service lives without exceeding typical latency budgets.
- ML-DSA signing latency is unpredictable due to rejection sampling, requiring systems to budget for worst-case times up to over a second.
- The modest slowdown factor of 1.8-1.9x versus M4 suggests that algorithm performance is not heavily penalized by the simpler M0+ architecture.
- Higher security parameter sets scale predictably in resource use, allowing informed selection for different device classes.
- Releasing the benchmark suite as open source supports reproducibility and further development by others.
Where Pith is reading between the lines
- Direct hardware energy measurements could validate the datasheet model and provide more accurate consumption figures.
- Adding simple optimizations like assembly routines for critical loops might reduce the execution times significantly beyond these reference results.
- The results imply that post-quantum migration for embedded systems is more feasible than previously assumed based on older, less efficient algorithms.
- Similar benchmarks on other constrained platforms like 8-bit or 16-bit MCUs would help map the full feasibility landscape.
Load-bearing premise
The power consumption estimates derived from the microcontroller datasheet accurately represent the actual energy used during the cryptographic computations, and the PQClean reference code represents realistic performance without custom tuning.
What would settle it
Measuring the actual current draw on the RP2040 during ML-KEM-512 execution with a precision power monitor and comparing it to the 2.83 mJ estimate would test the energy model.
Figures
read the original abstract
The migration to post-quantum cryptography is urgent for Internet of Things devices with 10--20 year lifespans, yet no systematic benchmarks exist for the finalised NIST standards on the most constrained 32-bit processor class. This paper presents the first isolated algorithm-level benchmarks of ML-KEM (FIPS 203) and ML-DSA (FIPS 204) on ARM Cortex-M0+, measured on the RP2040 (Raspberry Pi Pico) at 133 MHz with 264 KB SRAM. Using PQClean reference C implementations, we measure all three security levels of ML-KEM (512/768/1024) and ML-DSA (44/65/87) across key generation, encapsulation/signing, and decapsulation/verification. ML-KEM-512 completes a full key exchange in 35.7 ms with an estimated energy cost of 2.83 mJ (datasheet power model)--17x faster than a complete ECDH P-256 key agreement on the same hardware. ML-DSA signing exhibits high latency variance due to rejection sampling (coefficient of variation 66--73%, 99th-percentile up to 1,125 ms for ML-DSA-87). The M0+ incurs only a 1.8--1.9x slowdown relative to published Cortex-M4 reference C results (compiled with -O3 versus our -Os), despite lacking 64-bit multiply, DSP, and SIMD instructions--making this a conservative upper bound on the microarchitectural penalty. All code, data, and scripts are released as an open-source benchmark suite for reproducibility.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. This paper benchmarks the NIST post-quantum cryptography standards ML-KEM and ML-DSA on the ARM Cortex-M0+ processor using the RP2040 board. It provides performance metrics including execution time, memory footprint, and estimated energy consumption for key generation, encapsulation, and decapsulation operations at all security levels, based on PQClean implementations. Notable findings are that ML-KEM-512 achieves a full key exchange in 35.7 ms with 2.83 mJ estimated energy (17 times faster than ECDH P-256), ML-DSA shows high latency variance due to rejection sampling, and there is only a modest 1.8-1.9x slowdown compared to Cortex-M4 results despite the simpler instruction set.
Significance. The results offer practical guidance for implementing post-quantum cryptography on resource-constrained IoT devices with long lifespans. The release of all code, data, and scripts enhances reproducibility and allows the community to build upon these baselines. Direct timing measurements on actual hardware provide high confidence in the latency claims, making this a solid contribution to the field of embedded post-quantum security.
major comments (1)
- [Energy consumption results] The energy figures, such as 2.83 mJ for ML-KEM-512, are calculated using a power value from the RP2040 datasheet rather than direct measurements with a current probe or shunt resistor. This approach assumes constant average power draw during the cryptographic operations, which may vary with the specific instruction and memory access patterns of the PQClean code. While the timing measurements are direct and reliable, the energy estimates carry this modeling uncertainty and should be presented with stronger caveats if they are to support claims about energy efficiency.
minor comments (2)
- Ensure that all compiler flags and optimization levels are consistently documented across comparisons to the M4 reference.
- The high variance in ML-DSA signing times due to rejection sampling is well-noted; consider discussing implications for real-time applications.
Simulated Author's Rebuttal
We thank the referee for their positive evaluation of the work and for the constructive comment on the energy estimates. We address the point below.
read point-by-point responses
-
Referee: The energy figures, such as 2.83 mJ for ML-KEM-512, are calculated using a power value from the RP2040 datasheet rather than direct measurements with a current probe or shunt resistor. This approach assumes constant average power draw during the cryptographic operations, which may vary with the specific instruction and memory access patterns of the PQClean code. While the timing measurements are direct and reliable, the energy estimates carry this modeling uncertainty and should be presented with stronger caveats if they are to support claims about energy efficiency.
Authors: We agree that the reported energy figures are estimates obtained by multiplying measured execution times by the typical power consumption value from the RP2040 datasheet, rather than direct current measurements. In the revised manuscript we will add explicit caveats in the energy-results section, the abstract, and the conclusions, stating that these are modeled estimates assuming constant average power and that actual consumption may vary with instruction mix and memory-access patterns of the PQClean implementations. We will also qualify all comparative energy-efficiency statements to reflect this modeling uncertainty. revision: yes
Circularity Check
No circularity; purely empirical measurements with no derivations or self-referential steps
full rationale
The paper consists entirely of direct hardware benchmarks on the RP2040 using unmodified PQClean reference implementations. Latency (e.g., 35.7 ms for ML-KEM-512 key exchange) is measured via cycle counts or timers. Energy (2.83 mJ) is computed by scaling measured time by a constant power value taken from the RP2040 datasheet; this is an external input, not a fitted parameter or self-defined quantity. No equations, ansatzes, uniqueness theorems, or self-citations appear as load-bearing steps in the provided text. The 17x comparison to ECDH is also a direct time ratio. All claims reduce to reproducible runs on open code rather than any internal reduction to inputs by construction.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Polynomial-time algorithms for prime factorization and discrete logarithms on a quantum computer,
P. W. Shor, “Polynomial-time algorithms for prime factorization and discrete logarithms on a quantum computer,”SIAM J. Comput., vol. 26, no. 5, pp. 1484–1509, 1997
work page 1997
-
[2]
Module-Lattice-Based Key-Encapsulation Mechanism Standard,
National Institute of Standards and Technology, “Module-Lattice-Based Key-Encapsulation Mechanism Standard,” FIPS 203, Aug. 2024. [Online]. Available: https://csrc.nist.gov/pubs/fips/203/final
work page 2024
-
[3]
Module-Lattice-Based Digital Signature Standard,
National Institute of Standards and Technology, “Module-Lattice-Based Digital Signature Standard,” FIPS 204, Aug. 2024. [Online]. Available: https://csrc.nist.gov/pubs/fips/204/final
work page 2024
-
[4]
Stateless Hash-Based Digital Signature Standard,
National Institute of Standards and Technology, “Stateless Hash-Based Digital Signature Standard,” FIPS 205, Aug. 2024. [Online]. Available: https://csrc.nist.gov/pubs/fips/205/final
work page 2024
-
[5]
Post-quantum cryptography for Internet of Things: A survey on performance and optimization,
T. Liu, G. Ramachandran, and R. Jurdak, “Post-quantum cryptography for Internet of Things: A survey on performance and optimization,” arXiv:2401.17538 [cs.CR], 2024. [Online]. Available: https://arxiv.org/ abs/2401.17538
-
[6]
Terminology for constrained- node networks,
C. Bormann, M. Ersue, and A. Keranen, “Terminology for constrained- node networks,” RFC 7228, Internet Engineering Task Force, May 2014
work page 2014
-
[7]
pqm4: Testing and benchmarking NIST PQC on ARM Cortex-M4,
M. J. Kannwischer, J. Rijneveld, P. Schwabe, and K. Stoffelen, “pqm4: Testing and benchmarking NIST PQC on ARM Cortex-M4,” inProc. NIST 2nd PQC Standardization Conf., 2019
work page 2019
-
[8]
pqm4: Bench- marking NIST additional post-quantum signature schemes on ARM Cortex-M4,
M. J. Kannwischer, R. Krausz, J. Petri, and S. Yang, “pqm4: Bench- marking NIST additional post-quantum signature schemes on ARM Cortex-M4,” inProc. NIST 5th PQC Standardization Conf., 2024. 9
work page 2024
-
[9]
B. Halak, T. Gibson, M. Henley, C.-B. Botea, B. Heath, and S. Khan, “Evaluation of performance, energy, and computation costs of quantum-attack resilient encryption algorithms for embedded de- vices,”IEEE Access, vol. 12, pp. 8791–8805, 2024. DOI: 10.1109/AC- CESS.2024.3350775
work page doi:10.1109/ac- 2024
-
[10]
Saber on ARM: CCA-secure module lattice-based key encapsulation on ARM,
A. Karmakar, J. M. Bermudo Mera, S. Sinha Roy, and I. Verbauwhede, “Saber on ARM: CCA-secure module lattice-based key encapsulation on ARM,”IACR Trans. Cryptogr. Hardw. Embed. Syst., vol. 2018, no. 3, pp. 243–266, 2018. DOI: 10.13154/tches.v2018.i3.243-266
-
[11]
Masking Kyber: First- and higher-order implementations,
J. W. Bos, M. Gourjon, J. Renes, T. Schneider, and C. van Vredendaal, “Masking Kyber: First- and higher-order implementations,”IACR Trans. Cryptogr. Hardw. Embed. Syst., vol. 2021, no. 4, pp. 173–214, 2021. DOI: 10.46586/tches.v2021.i4.173-214
-
[12]
L. Li, M. Wang, and W. Wang, “Implementing lattice-based PQC on resource-constrained processors: A case study for Kyber/Saber’s polynomial multiplication on ARM Cortex-M0/M0+,” inProgress in Cryptology – INDOCRYPT 2023, LNCS, vol. 14460, pp. 153–176, Springer, 2023. DOI: 10.1007/978-3-031-56235-8_8
-
[13]
Dilithium for memory constrained devices,
J. W. Bos, J. Renes, and A. Sprenkels, “Dilithium for memory constrained devices,” inProgress in Cryptology – AFRICACRYPT 2022, LNCS, vol. 13503, pp. 217–235, Springer, 2022. DOI: 10.1007/978-3-031-17433- 9_10 IACR ePrint: 2022/323
-
[14]
CRYSTALS-Dilithium: Algorithm specifications and supporting documentation (version 3.1),
L. Ducaset al., “CRYSTALS-Dilithium: Algorithm specifications and supporting documentation (version 3.1),” NIST PQC Round 3, 2021. Available: https://pq-crystals.org/dilithium/data/ dilithium-specification-round3-20210208.pdf
work page 2021
-
[15]
Taking ML-KEM & ML-DSA from Cortex-M4 to Cortex-M7 with SLOTHY ,
F. Abdulrahmanet al., “Taking ML-KEM & ML-DSA from Cortex-M4 to Cortex-M7 with SLOTHY ,” inACM ASIA CCS, 2025
work page 2025
-
[16]
ESPM-D: Efficient sparse polynomial multiplication for Dilithium on ARM Cortex-M4 and Apple M2,
J. Zhaoet al., “ESPM-D: Efficient sparse polynomial multiplication for Dilithium on ARM Cortex-M4 and Apple M2,” arXiv:2404.12675, 2024
-
[17]
Optimized implementation of CRYSTALS- Dilithium on 16-bit MSP430,
H. Park, S. Seo,et al., “Optimized implementation of CRYSTALS- Dilithium on 16-bit MSP430,”J. Inf. Security Appl., 2024
work page 2024
-
[18]
pqmx: PQC for ARM Cortex-M55/M85 (Helium MVE),
F. Abdulrahmanet al., “pqmx: PQC for ARM Cortex-M55/M85 (Helium MVE),” GitHub/IACR, 2023
work page 2023
-
[19]
Evaluating post-quantum cryptographic algorithms on resource-constrained devices,
J. Lopez, V . Cadena, and M. S. Rahman, “Evaluating post-quantum cryptographic algorithms on resource-constrained devices,” inIEEE QCE,
-
[20]
Low-performance embedded IoT devices and the need for HW-accelerated PQC,
M. Grassl and L. Sturm, “Low-performance embedded IoT devices and the need for HW-accelerated PQC,” inIoTBDS, SCITEPRESS, 2024
work page 2024
-
[21]
Quantum-resistant security for software updates on low-power networked embedded devices,
H. Tschofenig, R. Housley,et al., “Quantum-resistant security for software updates on low-power networked embedded devices,” IACR ePrint 2021/577, 2021. arXiv:2106.05577
-
[22]
Post-quantum crypto on ARM Cortex-M,
P. Schwabe, “Post-quantum crypto on ARM Cortex-M,” Microsoft Research Talk, 2019
work page 2019
-
[23]
Raspberry Pi Ltd., “RP2040 Datasheet,” 2024. [Online]. Available: https: //datasheets.raspberrypi.com/rp2040/rp2040-datasheet.pdf
work page 2024
-
[24]
PQClean: Clean, portable, tested implementations of post-quantum cryptographic algorithms,
PQClean Contributors, “PQClean: Clean, portable, tested implementations of post-quantum cryptographic algorithms,” GitHub, 2024. [Online]. Available: https://github.com/PQClean/PQClean
work page 2024
-
[25]
Securing the future IoT with post- quantum cryptography,
W. Ahmed, M. N. Bhutta,et al., “Securing the future IoT with post- quantum cryptography,” arXiv:2206.10473, 2022
-
[26]
A survey of post-quantum cryptography support in cryptographic libraries,
N. Ahmed, L. Zhang, and A. Gangopadhyay, “A survey of post-quantum cryptography support in cryptographic libraries,” arXiv:2508.16078, 2025
-
[27]
PQC-LEO: Evaluation framework for PQC on IoT networks,
A. Hanna, A. Adebusola,et al., “PQC-LEO: Evaluation framework for PQC on IoT networks,” arXiv, Mar. 2026
work page 2026
-
[28]
C.-M. M. Chung, V . Hwang, M. J. Kannwischer, G. Seiler, C.-J. Shih, and B.-Y . Yang, “NTT multiplication for NTT-unfriendly rings: New speed records for Saber and NTRU on Cortex-M4 and A VR,”IACR Trans. Cryptogr. Hardw. Embed. Syst., vol. 2021, no. 1, pp. 159–188, 2021
work page 2021
-
[29]
Y . Wang, J. Yu, S. Qu, X. Zhang, X. Li, C. Zhang, and D. Gu, “Mind the faulty Keccak: A practical fault injection attack scheme apply to all phases of ML-KEM and ML-DSA,” IACR ePrint 2024/1522, 2024
work page 2024
-
[30]
PQMicroLib-Core: ML-KEM in 5 KB RAM for Cortex- M,
PQShield Ltd., “PQMicroLib-Core: ML-KEM in 5 KB RAM for Cortex- M,” Embedded World, Mar. 2026
work page 2026
-
[31]
From ECDSA to ML-DSA: Migration analysis and implemen- tation considerations,
D. Dinu, “From ECDSA to ML-DSA: Migration analysis and implemen- tation considerations,” IACR ePrint 2025/2025, 2025
work page 2025
-
[32]
mlkem-native and mldsa- native: Production-ready PQC implementations,
Post-Quantum Cryptography Alliance (PQCA), “mlkem-native and mldsa- native: Production-ready PQC implementations,” Linux Foundation, 2026. [Online]. Available: https://github.com/pq-code-package
work page 2026
-
[33]
Mobile energy requirements of the upcoming NIST post-quantum cryptography standards,
M.-J. O. Saarinen, “Mobile energy requirements of the upcoming NIST post-quantum cryptography standards,”arXiv preprint arXiv:1912.00916, 2019. APPENDIXA BENCHMARKSOURCECODE The complete benchmark source code, raw data, analysis scripts, and reproduction instructions are available in the companion open-source repository: https://github.com/rojinc/pqc-cort...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.