pith. sign in

arxiv: 2604.23205 · v1 · submitted 2026-04-25 · 💻 cs.CR · cs.AR· cs.LG

Tessera: Secure, Near-Line-Rate Weight Streaming for UMA Edge Accelerators

Pith reviewed 2026-05-08 07:52 UTC · model grok-4.3

classification 💻 cs.CR cs.ARcs.LG
keywords weight streamingUMA edge acceleratorscache-line decryptionAES-256-CTRDNN memory protectionbandwidth overheadphysical adversariessecure DMA
0
0 comments X

The pith

Tessera decrypts DNN weights at cache-line granularity during DRAM fetches on UMA edge accelerators, hiding AES-256-CTR latency to reach 98.4 percent of peak bandwidth.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces a reference architecture that intercepts 64-byte memory bursts on shared DRAM systems and computes the required decryption keystream in parallel with the fetch operation. This streams decrypted weights straight into isolated NPU SRAM without reserving permanent memory regions or forcing page-sized transfers. Measurements on three SoC platforms show the cryptographic work stays hidden behind ordinary DRAM timing even in worst-case scenarios. As a result, proprietary models avoid the severe bandwidth penalties that page-granularity encryption imposes on sub-page tensor tiles. The design also closes specific attack paths such as physical DRAM readout and rogue DMA transfers while keeping the active memory footprint to the current tile only.

Core claim

By computing AES-256-CTR keystreams concurrently with 64-byte AXI DRAM bursts, Tessera delivers plaintext model weights directly into NPU SRAM on UMA platforms. This parallelization conceals cryptographic latency behind standard memory access times across three measured SoCs, yielding a projected 98.4 percent of theoretical bandwidth utilization and eliminating the need for permanent secure-memory carve-outs. The approach maintains a 1x footprint for arbitrary layer geometries while blocking UMA-specific leakage vectors including physical extraction and compute hijacking.

What carries the argument

Parallel AES-256-CTR keystream generator attached to intercepted 64-byte AXI bursts that produces decryption material concurrently with DRAM reads so plaintext arrives directly in isolated NPU SRAM.

If this is right

  • Page-level memory encryption incurs up to 32 times bandwidth overhead for typical tensor tiles, whereas Tessera sustains a 1x footprint for all layer shapes.
  • Memory usage remains transient and limited to the active tile, removing any requirement for permanently reserved secure DRAM regions.
  • Major UMA attack surfaces including physical DRAM extraction, rogue DMA, and compute hijacking are neutralized with formal guarantees against plaintext leakage even for sparse tensors.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same parallel-fetch pattern could extend to other shared-memory accelerators that must protect bulk data without custom secure enclaves.
  • If the timing assumptions hold across additional process nodes, the technique might support broader consumer deployment of proprietary models on commodity edge silicon.
  • Adapting the keystream unit to alternate modes or ciphers would allow security policies to be tuned without changing the bandwidth profile.

Load-bearing premise

AES-256-CTR keystream computation can be parallelized to fully hide its latency behind DRAM fetch times on UMA systems even under worst-case timing variations.

What would settle it

A timing measurement on any of the three tested SoC platforms in which AES-256-CTR latency for a 64-byte burst exceeds the DRAM fetch duration under worst-case variation, dropping effective bandwidth below 98 percent.

Figures

Figures reproduced from arXiv: 2604.23205 by Animan Naskar.

Figure 1
Figure 1. Figure 1: summarises the Tessera datapath and trust boundary view at source ↗
Figure 2
Figure 2. Figure 2: Page-level ICE bandwidth amplification A(t) = ⌈4096/t⌉ by layer type (tile sizes t from TensorRT/Gemmini schedules [18,19]): Batch Norm 128 B; DW Conv 3×3 288 B; PW Conv narrow 512 B; Conv 3×3 mid 1024 B; PW Conv wide 2048 B; Conv/Attn/FC ≥4096 B (A = 1). Tessera holds A ≈ 1× throughout (dashed line). 5.6 End-to-End Weight-Streaming Throughput Having established the bandwidth penalty of page-level granular… view at source ↗
read the original abstract

Deploying proprietary Deep Neural Networks (DNNs) on commodity edge devices demands hardware-backed Digital Rights Management (DRM) capable of withstanding both software-level and physical adversaries. In Unified Memory Architecture (UMA) systems, the host CPU and Neural Processing Unit (NPU) share physical DRAM, leaving plaintext model weights directly readable by a compromised OS kernel. Existing defenses fail in this constrained setting: trusted execution environments monopolize scarce memory with permanently reserved regions, while full-memory encryption operates at page granularity. This forces the system to fetch massive 4 KB memory pages for sub-page tensor tiles, severely crippling bandwidth. We present Tessera, a reference architecture for inline, cache-line granularity weight decryption on UMA edge accelerators. The design intercepts 64-byte AXI bursts, computing AES-256-CTR keystreams in parallel with DRAM fetches. This streams plaintext directly into isolated NPU SRAM, creating a transient memory footprint confined to the active tile and eliminating the need for permanent memory carve-outs. Measurements across three distinct SoC platforms demonstrate that this parallelization hides cryptographic latency behind standard DRAM fetch times, a condition that holds even under worst-case timing variations. Consequently, Tessera is projected to achieve 98.4\% of the theoretical memory bandwidth ceiling (a mere 1.6\% overhead). Across standard vision and language models, page-level memory encryption suffers up to a 32x bandwidth penalty, whereas Tessera maintains an optimal 1x footprint for all layer geometries. Finally, Tessera neutralizes major UMA-specific attack vectors -- including physical DRAM extraction, rogue DMA, and compute hijacking -- and formally prevents plaintext leakage across sparse tensors.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 0 minor

Summary. The paper presents Tessera, a reference architecture for inline cache-line (64-byte AXI burst) granularity weight decryption on UMA edge accelerators. It uses parallel AES-256-CTR keystream generation to hide cryptographic latency behind standard DRAM fetches, achieving a transient memory footprint limited to the active NPU tile. Measurements on three SoC platforms are claimed to show that this holds even under worst-case timing variations, yielding a projected 98.4% of theoretical memory bandwidth (1.6% overhead). Tessera is further claimed to avoid the up-to-32x bandwidth penalty of page-level encryption for vision and language models while neutralizing UMA-specific attacks and formally preventing plaintext leakage across sparse tensors.

Significance. If the core performance claim holds, Tessera would address a practical gap in hardware-backed DRM for proprietary DNNs on commodity UMA edge devices, where TEEs and full-memory encryption impose unacceptable memory or bandwidth costs. The parallel crypto-DRAM interleaving and transient-tile design are conceptually attractive for bandwidth-constrained NPUs.

major comments (3)
  1. [Abstract] Abstract: The central claim that 'this parallelization hides cryptographic latency behind standard DRAM fetch times, a condition that holds even under worst-case timing variations' is load-bearing for the 98.4% bandwidth projection, yet no cycle counts, number of parallel AES engines, or explicit timing breakdown (crypto time ≤ DRAM fetch window) is supplied for the three platforms or for worst-case scenarios such as bus contention or address-dependent CTR increments.
  2. [Abstract] Abstract: The 98.4% figure is labeled 'projected' with no description of the projection method, measurement baselines, error bars, or how real-platform data were extrapolated while isolating platform-specific effects from the AES-DRAM hiding assumption.
  3. [Abstract] Abstract: The statement that Tessera 'formally prevents plaintext leakage across sparse tensors' is presented without a proof sketch, formal model, or security argument, leaving the formal claim unsupported in the provided text.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We are grateful to the referee for their detailed and constructive comments on our work. We respond to each major comment below and will incorporate revisions to address the concerns.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The central claim that 'this parallelization hides cryptographic latency behind standard DRAM fetch times, a condition that holds even under worst-case timing variations' is load-bearing for the 98.4% bandwidth projection, yet no cycle counts, number of parallel AES engines, or explicit timing breakdown (crypto time ≤ DRAM fetch window) is supplied for the three platforms or for worst-case scenarios such as bus contention or address-dependent CTR increments.

    Authors: The full paper provides these details in the architecture and evaluation sections, including cycle counts from measurements on the three platforms and analysis of worst-case scenarios. To improve the abstract, we will add a short clause summarizing the parallel AES implementation and confirming the timing condition holds based on our measurements. revision: yes

  2. Referee: [Abstract] Abstract: The 98.4% figure is labeled 'projected' with no description of the projection method, measurement baselines, error bars, or how real-platform data were extrapolated while isolating platform-specific effects from the AES-DRAM hiding assumption.

    Authors: We will revise the abstract to describe the projection method, which is based on the minimum bandwidth efficiency measured across the platforms after isolating the crypto overhead. This will include the baseline of theoretical bandwidth and note that the overhead is consistent without significant variation requiring error bars. revision: yes

  3. Referee: [Abstract] Abstract: The statement that Tessera 'formally prevents plaintext leakage across sparse tensors' is presented without a proof sketch, formal model, or security argument, leaving the formal claim unsupported in the provided text.

    Authors: The manuscript contains a security analysis arguing that the transient memory footprint and per-cacheline decryption prevent leakage. We will add a brief security argument to the abstract to support the formal claim, or expand the main text if necessary. revision: yes

Circularity Check

0 steps flagged

No circularity: central claims rest on empirical measurements across real SoC platforms.

full rationale

The paper presents its key performance result (98.4% of theoretical bandwidth with 1.6% overhead) as the direct outcome of measurements on three distinct hardware platforms, not as a mathematical derivation, fitted parameter, or self-referential equation. No load-bearing steps reduce by construction to the paper's own inputs; the design choices (AXI-burst interception and parallel AES-256-CTR) are described architecturally and then validated externally via testing rather than justified via self-citation chains, uniqueness theorems, or ansatzes imported from prior author work. The derivation chain is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The design rests on standard cryptographic primitives and hardware interface assumptions. No free parameters or new entities are introduced in the abstract. Full audit requires the complete manuscript.

axioms (2)
  • standard math AES-256-CTR provides confidentiality against the stated adversaries
    Relies on established properties of the AES-CTR mode for stream encryption.
  • domain assumption 64-byte AXI bursts can be intercepted and processed inline without altering bus protocol
    Hardware-level assumption about the memory interface in UMA accelerators.

pith-pipeline@v0.9.0 · 5605 in / 1210 out tokens · 37985 ms · 2026-05-08T07:52:09.335282+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

19 extracted references · 19 canonical work pages

  1. [1]

    Moriarty, B

    K. Moriarty, B. Kaliski, J. Jonsson, A. Rusch: PKCS #1: RSA Cryptography Specifications Version 2.2. RFC 8017 (Nov. 2016)

  2. [2]

    Barkan, E

    E. Barkan, E. Biham, N. Keller: Instant ciphertext-only cryptanalysis of GSM encrypted communication. In: Proc. CRYPTO, pp. 600–616 (2003)

  3. [3]

    J.A.Halderman,S.D.Schoen,N.Heninger,W.Clarkson,W.Paul,J.A.Calandrino, A. J. Feldman, J. Appelbaum, E. W. Felten: Lest we remember: Cold-boot attacks on encryption keys. Commun. ACM52(5), 91–98 (2009)

  4. [4]

    Tatar, R

    A. Tatar, R. Krishnan, E. Bos, C. Giuffrida, H. Bos, K. Razavi: Throwhammer: Rowhammer attacks over the network and defenses. In: Proc. USENIX ATC, pp. 213–226 (2018)

  5. [5]

    F. Mo, H. Haddadi, K. Katevas, E. Matus, D. Perino, N. Kourtellis: PPFL: Privacy- preserving federated learning with trusted execution environments. In: Proc. ACM MobiSys, pp. 94–108 (2021)

  6. [6]

    AMD:AMD64ArchitectureProgrammer’sManual,Volume2:SystemProgramming, Secure Memory Encryption. AMD Pub. 24593, Rev. 3.41 (2023)

  7. [7]

    1.3 (Mar

    Intel Corporation: Intel Total Memory Encryption—Multi-Key (TME-MK) Archi- tecture Specification, Rev. 1.3 (Mar. 2021)

  8. [8]

    Costan, S

    V. Costan, S. Devadas: Intel SGX Explained. IACR Cryptology ePrint Archive, Report 2016/086 (2016)

  9. [9]

    ARM PRD29-GENC-009492C (Apr

    ARM Limited: ARM Security Technology: Building a Secure System Using Trust- Zone Technology. ARM PRD29-GENC-009492C (Apr. 2009)

  10. [10]

    White Paper (2020)

    AMD: AMD Platform Security Processor (PSP) Architecture Overview. White Paper (2020)

  11. [11]

    White Paper (Apr

    NVIDIA Corporation: NVIDIA H100 Tensor Core GPU Architecture: Confidential Computing. White Paper (Apr. 2022)

  12. [12]

    Tramèr, D

    F. Tramèr, D. Boneh: Slalom: Fast, verifiable and private execution of neural networks in trusted hardware. In: Proc. ICLR (2019)

  13. [13]

    A. T. Markettos, C. Rothwell, B. F. Gutstein, A. Pearce, P. G. Neumann, S. W. Moore, R. N. M. Watson: Thunderclap: Exploring vulnerabilities in op- erating system IOMMU protection via DMA from untrustworthy peripherals. In: Proc. NDSS (2019)

  14. [14]

    Z. Hua, J. Gu, Y. Xia, H. Chen, B. Zang, H. Guan: MGX: Near-zero overhead memory protection for data-intensive accelerators. In: Proc. ISCA, pp. 726–741 (2022)

  15. [15]

    Y. Kim, R. Daly, J. Kim, C. Fallin, J. H. Lee, D. Lee, C. Wilkerson, K. Lai, O. Mutlu: Flipping bits in memory without accessing them: An experimental study of DRAM disturbance errors. In: Proc. ISCA, pp. 361–372 (2014)

  16. [16]

    D. J. Bernstein: The Poly1305-AES message-authentication code. In: Proc. FSE, LNCS 3557, pp. 32–49 (2005)

  17. [17]

    Nikova, C

    S. Nikova, C. Rechberger, V. Rijmen: Threshold implementations against side- channel attacks and glitches. In: Proc. ICICS, LNCS 4307, pp. 529–545 (2006)

  18. [18]

    [https://docs.nvidia.com/deeplearning/tensorrt](https://docs.nvidia

    NVIDIA Corporation: TensorRT Developer Guide, version 8.6 (2023). [https://docs.nvidia.com/deeplearning/tensorrt](https://docs.nvidia. com/deeplearning/tensorrt)

  19. [19]

    Genc et al.: Gemmini: Enabling systematic deep-learning architecture evaluation via full-stack integration

    H. Genc et al.: Gemmini: Enabling systematic deep-learning architecture evaluation via full-stack integration. In: Proc. DAC (2021)