Low-Stack HAETAE for Memory-Constrained Microcontrollers
Pith reviewed 2026-05-10 08:04 UTC · model grok-4.3
The pith
HAETAE lattice signatures reduced to 5.8-6 kB stack on 8-16 kB SRAM microcontrollers.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Rejection-aware pass decomposition isolates encoding to the post-acceptance path, component-level early rejection short-circuits response computation on partial-norm overflow, and reverse-order streaming rANS coding removes the need for full hint and high-bits staging buffers; combined with streamed matrix generation, a two-pass hyperball sampler, and row-streamed verification, these techniques reduce signing stack from 71-141 kB to 5.8-6.0 kB, key generation to 4.7-5.7 kB, and verification to 4.7-4.8 kB on a Nucleo-L4R5ZI for HAETAE-2/3/5, with at most 3.4x slowdown in signing and verification speedups up to 18 percent.
What carries the argument
Rejection-aware pass decomposition that isolates encoding to the post-acceptance path, together with component-level early rejection and reverse-order streaming rANS entropy coding.
If this is right
- Verification at every security level fits inside 8 kB total RAM including the signature buffer and runs 2.34-3.34x faster than ML-DSA m4fstack at comparable levels.
- Performance cost is bounded by a factor of 1.8 for key generation and 3.4 for signature generation while verification can improve by up to 18 percent.
- The pure-C code covers HAETAE-2, -3, and -5, works under RIOT-OS on both ARM Cortex-M4 and RISC-V, and reduces stack by 75-95 percent versus the reference for key generation and signing.
- All three parameter sets now become feasible on devices whose SRAM budget was previously too small for any lattice signature.
Where Pith is reading between the lines
- The same streaming-plus-early-rejection pattern could be applied to other module-lattice schemes that currently exceed microcontroller RAM limits.
- Verification fitting in 8 kB opens the door to using HAETAE directly in very small sensor nodes that previously had to fall back to classical signatures.
- Further reductions below 4 kB might be possible by trading additional computation for smaller intermediate buffers in the two-pass sampler.
Load-bearing premise
The new streaming and early-rejection techniques preserve the exact statistical distribution and security properties of the original HAETAE scheme without introducing side-channel or correctness issues.
What would settle it
Run the low-stack implementation on a target microcontroller, generate many signatures, and verify that they match the reference implementation's acceptance probability, pass all statistical tests on the hyperball sampler output, and produce identical verification results.
Figures
read the original abstract
We present a low-stack implementation of the module-lattice signature scheme HAETAE, targeting microcontrollers with 8 kB-16 kB of available SRAM. On such devices, peak stack usage is often the binding constraint, and HAETAE's hyperball-based sampler, large transient polynomial vectors, and variable-length signature payloads (hint and high-bits arrays) pose a particular challenge. To address this we introduce (i) Rejection-aware pass decomposition, which isolates encoding to the post-acceptance path; (ii) Component-level early rejection, which short-circuits the response computation when a partial norm already exceeds the bound; and (iii) Reverse-order streaming entropy coding using range Asymmetric Numeral Systems (rANS), which eliminates full hint and high-bits staging buffers. Combined with streamed matrix generation, a two-pass hyperball sampler with streaming Gaussian backend, and row-streamed verification, these techniques bring Signing stack from 71 kB-141 kB in the reference implementation down to 5.8 kB-6.0 kB, key generation to 4.7 kB-5.7 kB, and verification to 4.7 kB-4.8 kB across all three security levels. Our pure C implementation covers all three security levels (HAETAE-2/3/5), whose optimization paths differ due to the public-key domain (d>0 vs. d=0) and rejection structure. We implement our optimization on a Nucleo-L4R5ZI and compare to the reference pqm4 (for HAETAE-2 and -3) and a recently published memory-optimized implementation (targeting HAETAE-5 only). We reduce HAETAE-2, -3, and -5 stack by respectively 75, 86 and 8 % for key generation, 92, 95 and 24 % for signature generation, and 85, 91 and 22 % for verification. Depending on the parameter set, this impacts performance by at most a factor 1.8 and 3.4 for key and signature generation respectively, while even offering a performance improvement up to 18 % for verification. Verification at all security levels fits within 8 kB of RAM (signature buffer + stack) and is 2.34-3.34x faster than ML-DSA m4fstack at each comparable security level. We additionally validate portability under RIOT-OS on ARM Cortex-M4 and RISC-V targets.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper presents a low-stack implementation of the HAETAE module-lattice signature scheme targeting microcontrollers with 8-16 kB SRAM. It introduces rejection-aware pass decomposition, component-level early rejection on partial norms, reverse-order rANS streaming for hint/high-bits, streamed matrix generation, a two-pass hyperball sampler with streaming Gaussian backend, and row-streamed verification. These yield signing stack of 5.8-6.0 kB, key generation 4.7-5.7 kB, and verification 4.7-4.8 kB across HAETAE-2/3/5, with performance overhead at most 3.4x for signing and up to 18% improvement for verification, validated via concrete stack/cycle counts on Nucleo-L4R5ZI hardware against pqm4 reference and prior work, plus RIOT-OS portability.
Significance. If the optimizations preserve HAETAE's distribution and security, the work is significant for practical post-quantum signatures on severely memory-constrained embedded devices. It supplies reproducible hardware measurements, direct reference comparisons, and shows verification fits in 8 kB RAM while outperforming ML-DSA m4fstack equivalents at comparable levels. The explicit reporting of stack reductions (75-95% for most operations) and architecture portability strengthens its engineering value.
major comments (1)
- The description of the two-pass hyperball sampler and component-level early rejection provides no formal argument, equivalence proof, or statistical test (e.g., acceptance-rate matching or distribution comparison to reference) showing that short-circuiting on partial norm and the streamed Gaussian backend produce exactly the same output distribution and acceptance probabilities as the original HAETAE. This is load-bearing for the central claim, as any bias would mean the reported implementation does not realize the claimed scheme.
minor comments (3)
- The abstract states 'pure C implementation' without clarifying whether compiler intrinsics or target-specific code are used in the reported timings; this should be explicit for fair comparison.
- Stack and cycle measurement methodology (e.g., exact tool, compiler flags, or runtime monitoring) is not detailed in the results, which would aid reproducibility.
- A few minor phrasing issues in the abstract (e.g., 'impacts performance by at most a factor 1.8 and 3.4') could be clarified for precision.
Simulated Author's Rebuttal
We thank the referee for the careful review and for identifying the need to explicitly confirm that our memory optimizations preserve HAETAE's output distribution. We address the concern below.
read point-by-point responses
-
Referee: The description of the two-pass hyperball sampler and component-level early rejection provides no formal argument, equivalence proof, or statistical test (e.g., acceptance-rate matching or distribution comparison to reference) showing that short-circuiting on partial norm and the streamed Gaussian backend produce exactly the same output distribution and acceptance probabilities as the original HAETAE. This is load-bearing for the central claim, as any bias would mean the reported implementation does not realize the claimed scheme.
Authors: We agree that an explicit argument would strengthen the presentation. Component-level early rejection computes the squared Euclidean norm incrementally. Because the norm is a sum of non-negative squares, any partial sum that already exceeds the rejection bound implies that the full norm will exceed it; early rejection is therefore equivalent in both acceptance probability and conditional distribution of accepted signatures. The two-pass hyperball sampler generates the required Gaussian samples in the first pass (streamed, without materializing the full vector) and performs the norm check and rejection decision in the second pass; this reorders computation but produces identical samples and identical rejection outcomes. We will add a concise equivalence argument to Section 3.3 and include side-by-side acceptance-rate measurements against the reference implementation in the revised manuscript. revision: yes
Circularity Check
No circularity: empirical implementation results against external references
full rationale
The paper is an engineering report on stack-usage reductions for HAETAE via streaming, early-rejection, and rANS techniques. It contains no mathematical derivation chain, no fitted parameters renamed as predictions, and no load-bearing self-citations or uniqueness theorems. All performance numbers are direct measurements on Nucleo hardware compared to the external pqm4 reference and a prior memory-optimized implementation; the central claims are therefore falsifiable outside the paper's own code and do not reduce to their inputs by construction.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Enabling FrodoKEM on embedded devices
Joppe W Bos, Olivier Bronchain, Frank Custers, Joost Renes, Denise Verbakel, and Christine van Vredendaal. Enabling FrodoKEM on embedded devices. IACR Transactions on Cryptographic Hardware and Embedded Systems , 2023(3):74--96, 2023
work page 2023
-
[2]
Bos, Alexander Dima, Alexander Kiening, and Joost Renes
Joppe W. Bos, Alexander Dima, Alexander Kiening, and Joost Renes. Post-quantum secure over-the-air update of automotive systems. Cryptology ePrint Archive, Paper 2023/965, 2023
work page 2023
-
[3]
Emmanuel Baccelli, Oliver Hahm, Mesut G \"u nes, Matthias W \"a hlisch, and Thomas C. Schmidt. RIOT OS : Towards an OS for the Internet of Things . In 2013 IEEE Conference on Computer Communications Workshops (INFOCOM WKSHPS) , pages 79--80, 2013
work page 2013
-
[4]
Bos, Joost Renes, and Amber Sprenkels
Joppe W. Bos, Joost Renes, and Amber Sprenkels. Dilithium for memory constrained devices. In Lejla Batina and Joan Daemen, editors, Progress in Cryptology - AFRICACRYPT 2022: 13th International Conference on Cryptology in Africa, AFRICACRYPT 2022, Fes, Morocco, July 18-20, 2022, Proceedings , volume 13503 of Lecture Notes in Computer Science , pages 217--...
work page 2022
-
[5]
Quantum-resistant software update security on low-power networked embedded devices
Gustavo Banegas, Koen Zandberg, Emmanuel Baccelli, Adrian Herrmann, and Benjamin Smith. Quantum-resistant software update security on low-power networked embedded devices. In Giuseppe Ateniese and Daniele Venturi, editors, Applied Cryptography and Network Security - 20th International Conference, ACNS 2022, Rome, Italy, June 20-23, 2022, Proceedings , Lec...
work page 2022
-
[6]
Jung Hee Cheon, Hyeongmin Choe, Julien Devevey, Tim G\" u neysu, Dongyeon Hong, Markus Krausz, Georg Land, Junbum Shin, Damien Stehl\' e , and MinJune Yi. HAETAE . Technical report, N ational I nstitute of S tandards and T echnology, 2023. available at https://csrc.nist.gov/Projects/pqc-dig-sig/round-1-additional-signatures
work page 2023
-
[7]
u neysu, Dongyeon Hong, Markus Krausz, Georg Land, Marc M \
Jung Hee Cheon, Hyeongmin Choe, Julien Devevey, Tim G \"u neysu, Dongyeon Hong, Markus Krausz, Georg Land, Marc M \"o ller, Damien Stehl \'e , and MinJune Yi. Haetae: Shorter lattice-based Fiat-Shamir signatures. IACR Transactions on Cryptographic Hardware and Embedded Systems , 2024(3):25--75, 2024
work page 2024
-
[8]
Memory-efficient implementation of SMAUG-T and HAETAE
Yulim Hyoung, Subeen Cho, Uijae Kim, Minwoo Lee, Hwajeong Seo, and Minjoo Sim. Memory-efficient implementation of SMAUG-T and HAETAE . Cryptology ePrint Archive, Paper 2026/442, 2026
work page 2026
-
[9]
KpqC A lgorithms final specification documents
Korean Post-Quantum Cryptography Research Group . KpqC A lgorithms final specification documents. Accessed: 2026-03-30
work page 2026
-
[10]
HAETAE : Shorter Lattice-Based Fiat-Shamir Signatures , 2026
Korean Post-Quantum Cryptography Standardization Committee . HAETAE : Shorter Lattice-Based Fiat-Shamir Signatures , 2026. Final specification. available at https://www.kpqc.or.kr/images/pdf2/HAETAE.pdf
work page 2026
-
[11]
Kannwischer, Richard Petri, Joost Rijneveld, Peter Schwabe, and Ko Stoffelen
Matthias J. Kannwischer, Richard Petri, Joost Rijneveld, Peter Schwabe, and Ko Stoffelen. pqm4 : Post-quantum crypto library for the ARM Cortex-M4 . https://github.com/mupq/pqm4
-
[12]
Generalized centered binomial distribution for bimodal lattice signatures
Seungwoo Lee, Joo Woo, Jonghyun Kim, and Jong Hwan Park. Generalized centered binomial distribution for bimodal lattice signatures. IEEE Access , 13:2203--2214, 2024
work page 2024
-
[13]
Fiat- S hamir with aborts: Applications to lattice and factoring-based signatures
Vadim Lyubashevsky. Fiat- S hamir with aborts: Applications to lattice and factoring-based signatures. In Mitsuru Matsui, editor, Advances in Cryptology - ASIACRYPT 2009, 15th International Conference on the Theory and Application of Cryptology and Information Security, Tokyo, Japan, December 6-10, 2009. Proceedings , volume 5912 of Lecture Notes in Compu...
work page 2009
-
[14]
Module-Lattice-Based Digital Signature Standard ( ML-DSA ) , 2023
National Institute of Standards and Technology . Module-Lattice-Based Digital Signature Standard ( ML-DSA ) , 2023. Federal Information Processing Standards Publication 204 (Initial Public Draft), https://doi.org/10.6028/NIST.FIPS.204.ipd
-
[15]
Riot -- the friendly operating system for the internet of things
RIOT Community . Riot -- the friendly operating system for the internet of things. Accessed: March 31, 2026
work page 2026
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.