Tessera: Secure, Near-Line-Rate Weight Streaming for UMA Edge Accelerators

Animan Naskar

arxiv: 2604.23205 · v1 · submitted 2026-04-25 · 💻 cs.CR · cs.AR· cs.LG

Tessera: Secure, Near-Line-Rate Weight Streaming for UMA Edge Accelerators

Animan Naskar This is my paper

Pith reviewed 2026-05-08 07:52 UTC · model grok-4.3

classification 💻 cs.CR cs.ARcs.LG

keywords weight streamingUMA edge acceleratorscache-line decryptionAES-256-CTRDNN memory protectionbandwidth overheadphysical adversariessecure DMA

0 comments

The pith

Tessera decrypts DNN weights at cache-line granularity during DRAM fetches on UMA edge accelerators, hiding AES-256-CTR latency to reach 98.4 percent of peak bandwidth.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces a reference architecture that intercepts 64-byte memory bursts on shared DRAM systems and computes the required decryption keystream in parallel with the fetch operation. This streams decrypted weights straight into isolated NPU SRAM without reserving permanent memory regions or forcing page-sized transfers. Measurements on three SoC platforms show the cryptographic work stays hidden behind ordinary DRAM timing even in worst-case scenarios. As a result, proprietary models avoid the severe bandwidth penalties that page-granularity encryption imposes on sub-page tensor tiles. The design also closes specific attack paths such as physical DRAM readout and rogue DMA transfers while keeping the active memory footprint to the current tile only.

Core claim

By computing AES-256-CTR keystreams concurrently with 64-byte AXI DRAM bursts, Tessera delivers plaintext model weights directly into NPU SRAM on UMA platforms. This parallelization conceals cryptographic latency behind standard memory access times across three measured SoCs, yielding a projected 98.4 percent of theoretical bandwidth utilization and eliminating the need for permanent secure-memory carve-outs. The approach maintains a 1x footprint for arbitrary layer geometries while blocking UMA-specific leakage vectors including physical extraction and compute hijacking.

What carries the argument

Parallel AES-256-CTR keystream generator attached to intercepted 64-byte AXI bursts that produces decryption material concurrently with DRAM reads so plaintext arrives directly in isolated NPU SRAM.

If this is right

Page-level memory encryption incurs up to 32 times bandwidth overhead for typical tensor tiles, whereas Tessera sustains a 1x footprint for all layer shapes.
Memory usage remains transient and limited to the active tile, removing any requirement for permanently reserved secure DRAM regions.
Major UMA attack surfaces including physical DRAM extraction, rogue DMA, and compute hijacking are neutralized with formal guarantees against plaintext leakage even for sparse tensors.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same parallel-fetch pattern could extend to other shared-memory accelerators that must protect bulk data without custom secure enclaves.
If the timing assumptions hold across additional process nodes, the technique might support broader consumer deployment of proprietary models on commodity edge silicon.
Adapting the keystream unit to alternate modes or ciphers would allow security policies to be tuned without changing the bandwidth profile.

Load-bearing premise

AES-256-CTR keystream computation can be parallelized to fully hide its latency behind DRAM fetch times on UMA systems even under worst-case timing variations.

What would settle it

A timing measurement on any of the three tested SoC platforms in which AES-256-CTR latency for a 64-byte burst exceeds the DRAM fetch duration under worst-case variation, dropping effective bandwidth below 98 percent.

Figures

Figures reproduced from arXiv: 2604.23205 by Animan Naskar.

**Figure 1.** Figure 1: summarises the Tessera datapath and trust boundary view at source ↗

**Figure 2.** Figure 2: Page-level ICE bandwidth amplification A(t) = ⌈4096/t⌉ by layer type (tile sizes t from TensorRT/Gemmini schedules [18,19]): Batch Norm 128 B; DW Conv 3×3 288 B; PW Conv narrow 512 B; Conv 3×3 mid 1024 B; PW Conv wide 2048 B; Conv/Attn/FC ≥4096 B (A = 1). Tessera holds A ≈ 1× throughout (dashed line). 5.6 End-to-End Weight-Streaming Throughput Having established the bandwidth penalty of page-level granular… view at source ↗

read the original abstract

Deploying proprietary Deep Neural Networks (DNNs) on commodity edge devices demands hardware-backed Digital Rights Management (DRM) capable of withstanding both software-level and physical adversaries. In Unified Memory Architecture (UMA) systems, the host CPU and Neural Processing Unit (NPU) share physical DRAM, leaving plaintext model weights directly readable by a compromised OS kernel. Existing defenses fail in this constrained setting: trusted execution environments monopolize scarce memory with permanently reserved regions, while full-memory encryption operates at page granularity. This forces the system to fetch massive 4 KB memory pages for sub-page tensor tiles, severely crippling bandwidth. We present Tessera, a reference architecture for inline, cache-line granularity weight decryption on UMA edge accelerators. The design intercepts 64-byte AXI bursts, computing AES-256-CTR keystreams in parallel with DRAM fetches. This streams plaintext directly into isolated NPU SRAM, creating a transient memory footprint confined to the active tile and eliminating the need for permanent memory carve-outs. Measurements across three distinct SoC platforms demonstrate that this parallelization hides cryptographic latency behind standard DRAM fetch times, a condition that holds even under worst-case timing variations. Consequently, Tessera is projected to achieve 98.4\% of the theoretical memory bandwidth ceiling (a mere 1.6\% overhead). Across standard vision and language models, page-level memory encryption suffers up to a 32x bandwidth penalty, whereas Tessera maintains an optimal 1x footprint for all layer geometries. Finally, Tessera neutralizes major UMA-specific attack vectors -- including physical DRAM extraction, rogue DMA, and compute hijacking -- and formally prevents plaintext leakage across sparse tensors.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Tessera shows a workable path to cache-line granularity decryption on UMA edge NPUs by parallelizing AES-CTR with DRAM fetches, but the 98.4% bandwidth claim rests on limited evidence.

read the letter

Tessera targets a real constraint: proprietary DNN weights sitting in shared DRAM on edge SoCs where a compromised kernel can read them. The design intercepts 64-byte AXI bursts, runs AES-256-CTR in parallel with the memory fetch, and feeds plaintext straight into isolated NPU SRAM. This avoids both the permanent carve-outs of TEEs and the 4 KB page fetches that kill bandwidth for small tensor tiles. The result is a transient footprint that only holds the active tile and a claimed 1.6% overhead versus raw DRAM speed. That combination is the concrete advance over prior page-granular or TEE-only approaches. The paper also walks through how the scheme blocks physical DRAM extraction, rogue DMA, and compute hijacking, plus a formal claim that plaintext never leaks even for sparse tensors. Measurements on three distinct SoC platforms are presented as the supporting data. The architecture description itself is clear enough that someone building similar accelerators could replicate the burst interception logic. The soft spot is the performance evidence. The 98.4% figure is labeled projected, and the abstract gives no cycle counts, number of parallel AES engines, or explicit comparison showing crypto time stays under DRAM time in worst-case bus contention or address patterns. Without those breakdowns it is hard to judge whether the hiding generalizes or depends on the tested platforms' specific timing. The security claims are stated but lack even a short proof sketch in the provided text. This paper is aimed at hardware security researchers and edge ML system builders who need to protect model IP without custom memory hardware. A reader already working on UMA accelerators or secure inference would find the reference design useful as a starting point. It deserves peer review because the problem is practical and the proposed mechanism is specific, even though the current results section needs expansion on methods and data to stand up.

Referee Report

3 major / 0 minor

Summary. The paper presents Tessera, a reference architecture for inline cache-line (64-byte AXI burst) granularity weight decryption on UMA edge accelerators. It uses parallel AES-256-CTR keystream generation to hide cryptographic latency behind standard DRAM fetches, achieving a transient memory footprint limited to the active NPU tile. Measurements on three SoC platforms are claimed to show that this holds even under worst-case timing variations, yielding a projected 98.4% of theoretical memory bandwidth (1.6% overhead). Tessera is further claimed to avoid the up-to-32x bandwidth penalty of page-level encryption for vision and language models while neutralizing UMA-specific attacks and formally preventing plaintext leakage across sparse tensors.

Significance. If the core performance claim holds, Tessera would address a practical gap in hardware-backed DRM for proprietary DNNs on commodity UMA edge devices, where TEEs and full-memory encryption impose unacceptable memory or bandwidth costs. The parallel crypto-DRAM interleaving and transient-tile design are conceptually attractive for bandwidth-constrained NPUs.

major comments (3)

[Abstract] Abstract: The central claim that 'this parallelization hides cryptographic latency behind standard DRAM fetch times, a condition that holds even under worst-case timing variations' is load-bearing for the 98.4% bandwidth projection, yet no cycle counts, number of parallel AES engines, or explicit timing breakdown (crypto time ≤ DRAM fetch window) is supplied for the three platforms or for worst-case scenarios such as bus contention or address-dependent CTR increments.
[Abstract] Abstract: The 98.4% figure is labeled 'projected' with no description of the projection method, measurement baselines, error bars, or how real-platform data were extrapolated while isolating platform-specific effects from the AES-DRAM hiding assumption.
[Abstract] Abstract: The statement that Tessera 'formally prevents plaintext leakage across sparse tensors' is presented without a proof sketch, formal model, or security argument, leaving the formal claim unsupported in the provided text.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We are grateful to the referee for their detailed and constructive comments on our work. We respond to each major comment below and will incorporate revisions to address the concerns.

read point-by-point responses

Referee: [Abstract] Abstract: The central claim that 'this parallelization hides cryptographic latency behind standard DRAM fetch times, a condition that holds even under worst-case timing variations' is load-bearing for the 98.4% bandwidth projection, yet no cycle counts, number of parallel AES engines, or explicit timing breakdown (crypto time ≤ DRAM fetch window) is supplied for the three platforms or for worst-case scenarios such as bus contention or address-dependent CTR increments.

Authors: The full paper provides these details in the architecture and evaluation sections, including cycle counts from measurements on the three platforms and analysis of worst-case scenarios. To improve the abstract, we will add a short clause summarizing the parallel AES implementation and confirming the timing condition holds based on our measurements. revision: yes
Referee: [Abstract] Abstract: The 98.4% figure is labeled 'projected' with no description of the projection method, measurement baselines, error bars, or how real-platform data were extrapolated while isolating platform-specific effects from the AES-DRAM hiding assumption.

Authors: We will revise the abstract to describe the projection method, which is based on the minimum bandwidth efficiency measured across the platforms after isolating the crypto overhead. This will include the baseline of theoretical bandwidth and note that the overhead is consistent without significant variation requiring error bars. revision: yes
Referee: [Abstract] Abstract: The statement that Tessera 'formally prevents plaintext leakage across sparse tensors' is presented without a proof sketch, formal model, or security argument, leaving the formal claim unsupported in the provided text.

Authors: The manuscript contains a security analysis arguing that the transient memory footprint and per-cacheline decryption prevent leakage. We will add a brief security argument to the abstract to support the formal claim, or expand the main text if necessary. revision: yes

Circularity Check

0 steps flagged

No circularity: central claims rest on empirical measurements across real SoC platforms.

full rationale

The paper presents its key performance result (98.4% of theoretical bandwidth with 1.6% overhead) as the direct outcome of measurements on three distinct hardware platforms, not as a mathematical derivation, fitted parameter, or self-referential equation. No load-bearing steps reduce by construction to the paper's own inputs; the design choices (AXI-burst interception and parallel AES-256-CTR) are described architecturally and then validated externally via testing rather than justified via self-citation chains, uniqueness theorems, or ansatzes imported from prior author work. The derivation chain is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The design rests on standard cryptographic primitives and hardware interface assumptions. No free parameters or new entities are introduced in the abstract. Full audit requires the complete manuscript.

axioms (2)

standard math AES-256-CTR provides confidentiality against the stated adversaries
Relies on established properties of the AES-CTR mode for stream encryption.
domain assumption 64-byte AXI bursts can be intercepted and processed inline without altering bus protocol
Hardware-level assumption about the memory interface in UMA accelerators.

pith-pipeline@v0.9.0 · 5605 in / 1210 out tokens · 37985 ms · 2026-05-08T07:52:09.335282+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

19 extracted references · 19 canonical work pages

[1]

Moriarty, B

K. Moriarty, B. Kaliski, J. Jonsson, A. Rusch: PKCS #1: RSA Cryptography Specifications Version 2.2. RFC 8017 (Nov. 2016)

work page 2016
[2]

Barkan, E

E. Barkan, E. Biham, N. Keller: Instant ciphertext-only cryptanalysis of GSM encrypted communication. In: Proc. CRYPTO, pp. 600–616 (2003)

work page 2003
[3]

J.A.Halderman,S.D.Schoen,N.Heninger,W.Clarkson,W.Paul,J.A.Calandrino, A. J. Feldman, J. Appelbaum, E. W. Felten: Lest we remember: Cold-boot attacks on encryption keys. Commun. ACM52(5), 91–98 (2009)

work page 2009
[4]

Tatar, R

A. Tatar, R. Krishnan, E. Bos, C. Giuffrida, H. Bos, K. Razavi: Throwhammer: Rowhammer attacks over the network and defenses. In: Proc. USENIX ATC, pp. 213–226 (2018)

work page 2018
[5]

F. Mo, H. Haddadi, K. Katevas, E. Matus, D. Perino, N. Kourtellis: PPFL: Privacy- preserving federated learning with trusted execution environments. In: Proc. ACM MobiSys, pp. 94–108 (2021)

work page 2021
[6]

AMD:AMD64ArchitectureProgrammer’sManual,Volume2:SystemProgramming, Secure Memory Encryption. AMD Pub. 24593, Rev. 3.41 (2023)

work page 2023
[7]

1.3 (Mar

Intel Corporation: Intel Total Memory Encryption—Multi-Key (TME-MK) Archi- tecture Specification, Rev. 1.3 (Mar. 2021)

work page 2021
[8]

Costan, S

V. Costan, S. Devadas: Intel SGX Explained. IACR Cryptology ePrint Archive, Report 2016/086 (2016)

work page 2016
[9]

ARM PRD29-GENC-009492C (Apr

ARM Limited: ARM Security Technology: Building a Secure System Using Trust- Zone Technology. ARM PRD29-GENC-009492C (Apr. 2009)

work page 2009
[10]

White Paper (2020)

AMD: AMD Platform Security Processor (PSP) Architecture Overview. White Paper (2020)

work page 2020
[11]

White Paper (Apr

NVIDIA Corporation: NVIDIA H100 Tensor Core GPU Architecture: Confidential Computing. White Paper (Apr. 2022)

work page 2022
[12]

Tramèr, D

F. Tramèr, D. Boneh: Slalom: Fast, verifiable and private execution of neural networks in trusted hardware. In: Proc. ICLR (2019)

work page 2019
[13]

A. T. Markettos, C. Rothwell, B. F. Gutstein, A. Pearce, P. G. Neumann, S. W. Moore, R. N. M. Watson: Thunderclap: Exploring vulnerabilities in op- erating system IOMMU protection via DMA from untrustworthy peripherals. In: Proc. NDSS (2019)

work page 2019
[14]

Z. Hua, J. Gu, Y. Xia, H. Chen, B. Zang, H. Guan: MGX: Near-zero overhead memory protection for data-intensive accelerators. In: Proc. ISCA, pp. 726–741 (2022)

work page 2022
[15]

Y. Kim, R. Daly, J. Kim, C. Fallin, J. H. Lee, D. Lee, C. Wilkerson, K. Lai, O. Mutlu: Flipping bits in memory without accessing them: An experimental study of DRAM disturbance errors. In: Proc. ISCA, pp. 361–372 (2014)

work page 2014
[16]

D. J. Bernstein: The Poly1305-AES message-authentication code. In: Proc. FSE, LNCS 3557, pp. 32–49 (2005)

work page 2005
[17]

Nikova, C

S. Nikova, C. Rechberger, V. Rijmen: Threshold implementations against side- channel attacks and glitches. In: Proc. ICICS, LNCS 4307, pp. 529–545 (2006)

work page 2006
[18]

[https://docs.nvidia.com/deeplearning/tensorrt](https://docs.nvidia

NVIDIA Corporation: TensorRT Developer Guide, version 8.6 (2023). [https://docs.nvidia.com/deeplearning/tensorrt](https://docs.nvidia. com/deeplearning/tensorrt)

work page 2023
[19]

Genc et al.: Gemmini: Enabling systematic deep-learning architecture evaluation via full-stack integration

H. Genc et al.: Gemmini: Enabling systematic deep-learning architecture evaluation via full-stack integration. In: Proc. DAC (2021)

work page 2021

[1] [1]

Moriarty, B

K. Moriarty, B. Kaliski, J. Jonsson, A. Rusch: PKCS #1: RSA Cryptography Specifications Version 2.2. RFC 8017 (Nov. 2016)

work page 2016

[2] [2]

Barkan, E

E. Barkan, E. Biham, N. Keller: Instant ciphertext-only cryptanalysis of GSM encrypted communication. In: Proc. CRYPTO, pp. 600–616 (2003)

work page 2003

[3] [3]

J.A.Halderman,S.D.Schoen,N.Heninger,W.Clarkson,W.Paul,J.A.Calandrino, A. J. Feldman, J. Appelbaum, E. W. Felten: Lest we remember: Cold-boot attacks on encryption keys. Commun. ACM52(5), 91–98 (2009)

work page 2009

[4] [4]

Tatar, R

A. Tatar, R. Krishnan, E. Bos, C. Giuffrida, H. Bos, K. Razavi: Throwhammer: Rowhammer attacks over the network and defenses. In: Proc. USENIX ATC, pp. 213–226 (2018)

work page 2018

[5] [5]

F. Mo, H. Haddadi, K. Katevas, E. Matus, D. Perino, N. Kourtellis: PPFL: Privacy- preserving federated learning with trusted execution environments. In: Proc. ACM MobiSys, pp. 94–108 (2021)

work page 2021

[6] [6]

AMD:AMD64ArchitectureProgrammer’sManual,Volume2:SystemProgramming, Secure Memory Encryption. AMD Pub. 24593, Rev. 3.41 (2023)

work page 2023

[7] [7]

1.3 (Mar

Intel Corporation: Intel Total Memory Encryption—Multi-Key (TME-MK) Archi- tecture Specification, Rev. 1.3 (Mar. 2021)

work page 2021

[8] [8]

Costan, S

V. Costan, S. Devadas: Intel SGX Explained. IACR Cryptology ePrint Archive, Report 2016/086 (2016)

work page 2016

[9] [9]

ARM PRD29-GENC-009492C (Apr

ARM Limited: ARM Security Technology: Building a Secure System Using Trust- Zone Technology. ARM PRD29-GENC-009492C (Apr. 2009)

work page 2009

[10] [10]

White Paper (2020)

AMD: AMD Platform Security Processor (PSP) Architecture Overview. White Paper (2020)

work page 2020

[11] [11]

White Paper (Apr

NVIDIA Corporation: NVIDIA H100 Tensor Core GPU Architecture: Confidential Computing. White Paper (Apr. 2022)

work page 2022

[12] [12]

Tramèr, D

F. Tramèr, D. Boneh: Slalom: Fast, verifiable and private execution of neural networks in trusted hardware. In: Proc. ICLR (2019)

work page 2019

[13] [13]

A. T. Markettos, C. Rothwell, B. F. Gutstein, A. Pearce, P. G. Neumann, S. W. Moore, R. N. M. Watson: Thunderclap: Exploring vulnerabilities in op- erating system IOMMU protection via DMA from untrustworthy peripherals. In: Proc. NDSS (2019)

work page 2019

[14] [14]

Z. Hua, J. Gu, Y. Xia, H. Chen, B. Zang, H. Guan: MGX: Near-zero overhead memory protection for data-intensive accelerators. In: Proc. ISCA, pp. 726–741 (2022)

work page 2022

[15] [15]

Y. Kim, R. Daly, J. Kim, C. Fallin, J. H. Lee, D. Lee, C. Wilkerson, K. Lai, O. Mutlu: Flipping bits in memory without accessing them: An experimental study of DRAM disturbance errors. In: Proc. ISCA, pp. 361–372 (2014)

work page 2014

[16] [16]

D. J. Bernstein: The Poly1305-AES message-authentication code. In: Proc. FSE, LNCS 3557, pp. 32–49 (2005)

work page 2005

[17] [17]

Nikova, C

S. Nikova, C. Rechberger, V. Rijmen: Threshold implementations against side- channel attacks and glitches. In: Proc. ICICS, LNCS 4307, pp. 529–545 (2006)

work page 2006

[18] [18]

[https://docs.nvidia.com/deeplearning/tensorrt](https://docs.nvidia

NVIDIA Corporation: TensorRT Developer Guide, version 8.6 (2023). [https://docs.nvidia.com/deeplearning/tensorrt](https://docs.nvidia. com/deeplearning/tensorrt)

work page 2023

[19] [19]

Genc et al.: Gemmini: Enabling systematic deep-learning architecture evaluation via full-stack integration

H. Genc et al.: Gemmini: Enabling systematic deep-learning architecture evaluation via full-stack integration. In: Proc. DAC (2021)

work page 2021