pith. sign in

arxiv: 2604.15868 · v1 · submitted 2026-04-17 · 💻 cs.CR

Low-Stack HAETAE for Memory-Constrained Microcontrollers

Pith reviewed 2026-05-10 08:04 UTC · model grok-4.3

classification 💻 cs.CR
keywords HAETAElattice-based signaturesmicrocontrollersstack optimizationpost-quantum cryptographymemory-constrained devicesembedded implementation
0
0 comments X

The pith

HAETAE lattice signatures reduced to 5.8-6 kB stack on 8-16 kB SRAM microcontrollers.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper demonstrates practical memory optimizations for the HAETAE module-lattice signature scheme so that signing, key generation, and verification fit inside the tight RAM budgets of small microcontrollers. Standard implementations require 71-141 kB of stack for signing because of large polynomial vectors and buffers for the hyperball sampler plus variable-length signature components. The authors introduce rejection-aware pass decomposition, component-level early rejection, and reverse-order streaming rANS entropy coding, together with streamed matrix generation and a two-pass sampler. These changes drop stack usage to 5.8-6.0 kB for signing, 4.7-5.7 kB for key generation, and 4.7-4.8 kB for verification across all three security levels while preserving correctness.

Core claim

Rejection-aware pass decomposition isolates encoding to the post-acceptance path, component-level early rejection short-circuits response computation on partial-norm overflow, and reverse-order streaming rANS coding removes the need for full hint and high-bits staging buffers; combined with streamed matrix generation, a two-pass hyperball sampler, and row-streamed verification, these techniques reduce signing stack from 71-141 kB to 5.8-6.0 kB, key generation to 4.7-5.7 kB, and verification to 4.7-4.8 kB on a Nucleo-L4R5ZI for HAETAE-2/3/5, with at most 3.4x slowdown in signing and verification speedups up to 18 percent.

What carries the argument

Rejection-aware pass decomposition that isolates encoding to the post-acceptance path, together with component-level early rejection and reverse-order streaming rANS entropy coding.

If this is right

  • Verification at every security level fits inside 8 kB total RAM including the signature buffer and runs 2.34-3.34x faster than ML-DSA m4fstack at comparable levels.
  • Performance cost is bounded by a factor of 1.8 for key generation and 3.4 for signature generation while verification can improve by up to 18 percent.
  • The pure-C code covers HAETAE-2, -3, and -5, works under RIOT-OS on both ARM Cortex-M4 and RISC-V, and reduces stack by 75-95 percent versus the reference for key generation and signing.
  • All three parameter sets now become feasible on devices whose SRAM budget was previously too small for any lattice signature.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same streaming-plus-early-rejection pattern could be applied to other module-lattice schemes that currently exceed microcontroller RAM limits.
  • Verification fitting in 8 kB opens the door to using HAETAE directly in very small sensor nodes that previously had to fall back to classical signatures.
  • Further reductions below 4 kB might be possible by trading additional computation for smaller intermediate buffers in the two-pass sampler.

Load-bearing premise

The new streaming and early-rejection techniques preserve the exact statistical distribution and security properties of the original HAETAE scheme without introducing side-channel or correctness issues.

What would settle it

Run the low-stack implementation on a target microcontroller, generate many signatures, and verify that they match the reference implementation's acceptance probability, pass all statistical tests on the hyperball sampler output, and produce identical verification results.

Figures

Figures reproduced from arXiv: 2604.15868 by GRACE), Gustavo Banegas (LIX, Kim Youngbeom, Seo Seog Chung, Vredendaal Christine Van.

Figure 1
Figure 1. Figure 1: Overview of RIOT-OS modularization packages. [PITH_FULL_IMAGE:figures/full_fig_p005_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Memory allocation of the proposed memory-optimized HAETAE-2,3 key gen [PITH_FULL_IMAGE:figures/full_fig_p008_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Memory allocation of the proposed pass-decomposed HAETAE signing. The [PITH_FULL_IMAGE:figures/full_fig_p009_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Memory allocation of the row-streamed HAETAE verification. Four memory [PITH_FULL_IMAGE:figures/full_fig_p013_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Implementation specification of HAETAE key generation [ [PITH_FULL_IMAGE:figures/full_fig_p022_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Memory allocation of HAETAE-5 Key generation ( [PITH_FULL_IMAGE:figures/full_fig_p022_6.png] view at source ↗
read the original abstract

We present a low-stack implementation of the module-lattice signature scheme HAETAE, targeting microcontrollers with 8 kB-16 kB of available SRAM. On such devices, peak stack usage is often the binding constraint, and HAETAE's hyperball-based sampler, large transient polynomial vectors, and variable-length signature payloads (hint and high-bits arrays) pose a particular challenge. To address this we introduce (i) Rejection-aware pass decomposition, which isolates encoding to the post-acceptance path; (ii) Component-level early rejection, which short-circuits the response computation when a partial norm already exceeds the bound; and (iii) Reverse-order streaming entropy coding using range Asymmetric Numeral Systems (rANS), which eliminates full hint and high-bits staging buffers. Combined with streamed matrix generation, a two-pass hyperball sampler with streaming Gaussian backend, and row-streamed verification, these techniques bring Signing stack from 71 kB-141 kB in the reference implementation down to 5.8 kB-6.0 kB, key generation to 4.7 kB-5.7 kB, and verification to 4.7 kB-4.8 kB across all three security levels. Our pure C implementation covers all three security levels (HAETAE-2/3/5), whose optimization paths differ due to the public-key domain (d>0 vs. d=0) and rejection structure. We implement our optimization on a Nucleo-L4R5ZI and compare to the reference pqm4 (for HAETAE-2 and -3) and a recently published memory-optimized implementation (targeting HAETAE-5 only). We reduce HAETAE-2, -3, and -5 stack by respectively 75, 86 and 8 % for key generation, 92, 95 and 24 % for signature generation, and 85, 91 and 22 % for verification. Depending on the parameter set, this impacts performance by at most a factor 1.8 and 3.4 for key and signature generation respectively, while even offering a performance improvement up to 18 % for verification. Verification at all security levels fits within 8 kB of RAM (signature buffer + stack) and is 2.34-3.34x faster than ML-DSA m4fstack at each comparable security level. We additionally validate portability under RIOT-OS on ARM Cortex-M4 and RISC-V targets.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 3 minor

Summary. The paper presents a low-stack implementation of the HAETAE module-lattice signature scheme targeting microcontrollers with 8-16 kB SRAM. It introduces rejection-aware pass decomposition, component-level early rejection on partial norms, reverse-order rANS streaming for hint/high-bits, streamed matrix generation, a two-pass hyperball sampler with streaming Gaussian backend, and row-streamed verification. These yield signing stack of 5.8-6.0 kB, key generation 4.7-5.7 kB, and verification 4.7-4.8 kB across HAETAE-2/3/5, with performance overhead at most 3.4x for signing and up to 18% improvement for verification, validated via concrete stack/cycle counts on Nucleo-L4R5ZI hardware against pqm4 reference and prior work, plus RIOT-OS portability.

Significance. If the optimizations preserve HAETAE's distribution and security, the work is significant for practical post-quantum signatures on severely memory-constrained embedded devices. It supplies reproducible hardware measurements, direct reference comparisons, and shows verification fits in 8 kB RAM while outperforming ML-DSA m4fstack equivalents at comparable levels. The explicit reporting of stack reductions (75-95% for most operations) and architecture portability strengthens its engineering value.

major comments (1)
  1. The description of the two-pass hyperball sampler and component-level early rejection provides no formal argument, equivalence proof, or statistical test (e.g., acceptance-rate matching or distribution comparison to reference) showing that short-circuiting on partial norm and the streamed Gaussian backend produce exactly the same output distribution and acceptance probabilities as the original HAETAE. This is load-bearing for the central claim, as any bias would mean the reported implementation does not realize the claimed scheme.
minor comments (3)
  1. The abstract states 'pure C implementation' without clarifying whether compiler intrinsics or target-specific code are used in the reported timings; this should be explicit for fair comparison.
  2. Stack and cycle measurement methodology (e.g., exact tool, compiler flags, or runtime monitoring) is not detailed in the results, which would aid reproducibility.
  3. A few minor phrasing issues in the abstract (e.g., 'impacts performance by at most a factor 1.8 and 3.4') could be clarified for precision.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the careful review and for identifying the need to explicitly confirm that our memory optimizations preserve HAETAE's output distribution. We address the concern below.

read point-by-point responses
  1. Referee: The description of the two-pass hyperball sampler and component-level early rejection provides no formal argument, equivalence proof, or statistical test (e.g., acceptance-rate matching or distribution comparison to reference) showing that short-circuiting on partial norm and the streamed Gaussian backend produce exactly the same output distribution and acceptance probabilities as the original HAETAE. This is load-bearing for the central claim, as any bias would mean the reported implementation does not realize the claimed scheme.

    Authors: We agree that an explicit argument would strengthen the presentation. Component-level early rejection computes the squared Euclidean norm incrementally. Because the norm is a sum of non-negative squares, any partial sum that already exceeds the rejection bound implies that the full norm will exceed it; early rejection is therefore equivalent in both acceptance probability and conditional distribution of accepted signatures. The two-pass hyperball sampler generates the required Gaussian samples in the first pass (streamed, without materializing the full vector) and performs the norm check and rejection decision in the second pass; this reorders computation but produces identical samples and identical rejection outcomes. We will add a concise equivalence argument to Section 3.3 and include side-by-side acceptance-rate measurements against the reference implementation in the revised manuscript. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical implementation results against external references

full rationale

The paper is an engineering report on stack-usage reductions for HAETAE via streaming, early-rejection, and rANS techniques. It contains no mathematical derivation chain, no fitted parameters renamed as predictions, and no load-bearing self-citations or uniqueness theorems. All performance numbers are direct measurements on Nucleo hardware compared to the external pqm4 reference and a prior memory-optimized implementation; the central claims are therefore falsifiable outside the paper's own code and do not reduce to their inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

This is an applied implementation paper. It assumes the security and correctness of the original HAETAE scheme from prior work and standard properties of C and microcontroller hardware. No new free parameters, axioms, or invented entities are introduced.

pith-pipeline@v0.9.0 · 5793 in / 1135 out tokens · 49890 ms · 2026-05-10T08:04:09.329582+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

15 extracted references · 15 canonical work pages

  1. [1]

    Enabling FrodoKEM on embedded devices

    Joppe W Bos, Olivier Bronchain, Frank Custers, Joost Renes, Denise Verbakel, and Christine van Vredendaal. Enabling FrodoKEM on embedded devices. IACR Transactions on Cryptographic Hardware and Embedded Systems , 2023(3):74--96, 2023

  2. [2]

    Bos, Alexander Dima, Alexander Kiening, and Joost Renes

    Joppe W. Bos, Alexander Dima, Alexander Kiening, and Joost Renes. Post-quantum secure over-the-air update of automotive systems. Cryptology ePrint Archive, Paper 2023/965, 2023

  3. [3]

    u nes, Matthias W \

    Emmanuel Baccelli, Oliver Hahm, Mesut G \"u nes, Matthias W \"a hlisch, and Thomas C. Schmidt. RIOT OS : Towards an OS for the Internet of Things . In 2013 IEEE Conference on Computer Communications Workshops (INFOCOM WKSHPS) , pages 79--80, 2013

  4. [4]

    Bos, Joost Renes, and Amber Sprenkels

    Joppe W. Bos, Joost Renes, and Amber Sprenkels. Dilithium for memory constrained devices. In Lejla Batina and Joan Daemen, editors, Progress in Cryptology - AFRICACRYPT 2022: 13th International Conference on Cryptology in Africa, AFRICACRYPT 2022, Fes, Morocco, July 18-20, 2022, Proceedings , volume 13503 of Lecture Notes in Computer Science , pages 217--...

  5. [5]

    Quantum-resistant software update security on low-power networked embedded devices

    Gustavo Banegas, Koen Zandberg, Emmanuel Baccelli, Adrian Herrmann, and Benjamin Smith. Quantum-resistant software update security on low-power networked embedded devices. In Giuseppe Ateniese and Daniele Venturi, editors, Applied Cryptography and Network Security - 20th International Conference, ACNS 2022, Rome, Italy, June 20-23, 2022, Proceedings , Lec...

  6. [6]

    Jung Hee Cheon, Hyeongmin Choe, Julien Devevey, Tim G\" u neysu, Dongyeon Hong, Markus Krausz, Georg Land, Junbum Shin, Damien Stehl\' e , and MinJune Yi. HAETAE . Technical report, N ational I nstitute of S tandards and T echnology, 2023. available at https://csrc.nist.gov/Projects/pqc-dig-sig/round-1-additional-signatures

  7. [7]

    u neysu, Dongyeon Hong, Markus Krausz, Georg Land, Marc M \

    Jung Hee Cheon, Hyeongmin Choe, Julien Devevey, Tim G \"u neysu, Dongyeon Hong, Markus Krausz, Georg Land, Marc M \"o ller, Damien Stehl \'e , and MinJune Yi. Haetae: Shorter lattice-based Fiat-Shamir signatures. IACR Transactions on Cryptographic Hardware and Embedded Systems , 2024(3):25--75, 2024

  8. [8]

    Memory-efficient implementation of SMAUG-T and HAETAE

    Yulim Hyoung, Subeen Cho, Uijae Kim, Minwoo Lee, Hwajeong Seo, and Minjoo Sim. Memory-efficient implementation of SMAUG-T and HAETAE . Cryptology ePrint Archive, Paper 2026/442, 2026

  9. [9]

    KpqC A lgorithms final specification documents

    Korean Post-Quantum Cryptography Research Group . KpqC A lgorithms final specification documents. Accessed: 2026-03-30

  10. [10]

    HAETAE : Shorter Lattice-Based Fiat-Shamir Signatures , 2026

    Korean Post-Quantum Cryptography Standardization Committee . HAETAE : Shorter Lattice-Based Fiat-Shamir Signatures , 2026. Final specification. available at https://www.kpqc.or.kr/images/pdf2/HAETAE.pdf

  11. [11]

    Kannwischer, Richard Petri, Joost Rijneveld, Peter Schwabe, and Ko Stoffelen

    Matthias J. Kannwischer, Richard Petri, Joost Rijneveld, Peter Schwabe, and Ko Stoffelen. pqm4 : Post-quantum crypto library for the ARM Cortex-M4 . https://github.com/mupq/pqm4

  12. [12]

    Generalized centered binomial distribution for bimodal lattice signatures

    Seungwoo Lee, Joo Woo, Jonghyun Kim, and Jong Hwan Park. Generalized centered binomial distribution for bimodal lattice signatures. IEEE Access , 13:2203--2214, 2024

  13. [13]

    Fiat- S hamir with aborts: Applications to lattice and factoring-based signatures

    Vadim Lyubashevsky. Fiat- S hamir with aborts: Applications to lattice and factoring-based signatures. In Mitsuru Matsui, editor, Advances in Cryptology - ASIACRYPT 2009, 15th International Conference on the Theory and Application of Cryptology and Information Security, Tokyo, Japan, December 6-10, 2009. Proceedings , volume 5912 of Lecture Notes in Compu...

  14. [14]

    Module-Lattice-Based Digital Signature Standard ( ML-DSA ) , 2023

    National Institute of Standards and Technology . Module-Lattice-Based Digital Signature Standard ( ML-DSA ) , 2023. Federal Information Processing Standards Publication 204 (Initial Public Draft), https://doi.org/10.6028/NIST.FIPS.204.ipd

  15. [15]

    Riot -- the friendly operating system for the internet of things

    RIOT Community . Riot -- the friendly operating system for the internet of things. Accessed: March 31, 2026