DRAMatic Speedup: Accelerating HE Operations on a Processing-in-Memory System

Jonas Sander; Niklas Klinger; Pascal Felber; Peterson Yuhala; Thomas Eisenbarth

arxiv: 2602.12433 · v2 · submitted 2026-02-12 · 💻 cs.CR · cs.DC

DRAMatic Speedup: Accelerating HE Operations on a Processing-in-Memory System

Niklas Klinger , Jonas Sander , Peterson Yuhala , Pascal Felber , Thomas Eisenbarth This is my paper

Pith reviewed 2026-05-16 04:58 UTC · model grok-4.3

classification 💻 cs.CR cs.DC

keywords homomorphic encryptionprocessing-in-memoryPIM accelerationresidue number systemnumber theoretic transformUPMEMDRAMaticconfidential computing

0 comments

The pith

DRAMatic runs foundational homomorphic encryption operations 334 times faster on UPMEM processing-in-memory hardware than prior PIM implementations.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces DRAMatic to map core homomorphic encryption arithmetic onto UPMEM PIM, a programmable memory module with on-chip processing units. It applies residue number system representations and number-theoretic transforms to handle the large parameters needed for secure evaluations. This produces a 334-fold speedup over earlier HE work on the same hardware. Direct comparisons with Microsoft SEAL show reduced runtime and energy use, though data movement costs and multiplication throughput remain limiting. The authors outline possible hardware changes to UPMEM that could remove those constraints.

Core claim

DRAMatic implements the basic arithmetic of homomorphic encryption on UPMEM PIM by combining residue number system encoding with number-theoretic transforms, delivering a 334 times speedup relative to previous PIM-based HE while narrowing the gap to conventional libraries such as SEAL, subject to remaining data-transfer and multiplication bottlenecks.

What carries the argument

Residue number system combined with number-theoretic transforms, executed across the parallel processing units embedded in UPMEM PIM memory modules.

If this is right

Foundational HE operations become substantially faster when executed on PIM hardware.
DRAMatic reduces the runtime and energy gap between PIM implementations and standard libraries such as SEAL.
Data-transfer overhead and limited multiplication throughput currently constrain further gains on UPMEM PIM.
Hardware extensions to UPMEM PIM could improve support for the large parameters and arithmetic patterns of HE.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

PIM architectures could become practical for confidential cloud workloads if data movement costs are lowered.
The same residue-number and transform optimizations may accelerate other memory-bound cryptographic tasks beyond HE.
Future PIM designs with faster on-module multiplication units would better match the needs of polynomial-based encryption schemes.

Load-bearing premise

The UPMEM PIM hardware can support the large parameters required for secure homomorphic evaluations with acceptable data-transfer overhead.

What would settle it

Measure end-to-end runtime and energy for a complete secure homomorphic evaluation of a non-trivial circuit on UPMEM PIM using DRAMatic versus the identical circuit run with SEAL on a standard CPU.

read the original abstract

Homomorphic encryption (HE) is a promising technology for confidential cloud computing, as it allows computations on encrypted data. However, HE is computationally expensive and often memory-bound on conventional computer architectures. Processing-in-Memory (PIM) is an alternative hardware architecture that integrates processing units and memory on the same chip or memory module. PIM enables higher memory bandwidth than conventional architectures and could thus be suitable for accelerating HE. We present DRAMatic, which implements operations foundational to HE on UPMEM PIM -- a programmable general-purpose PIM system developed by UPMEM. DRAMatic incorporates many arithmetic optimizations, including residue number system and number-theoretic transform techniques, and can support the large parameters required for secure homomorphic evaluations. It achieves a 334 times speed-up compared to previous HE implementations on UPMEM PIM. We also evaluate DRAMatic against Microsoft SEAL, a popular open-source HE library, regarding both runtime and energy efficiency. The results show that DRAMatic significantly closes the gap between Microsoft SEAL and HE implementations on UPMEM PIM. However, we also show that DRAMatic is currently constrained by data transfer overhead and limited multiplication performance on UPMEM PIM hardware. Finally, we discuss potential hardware extensions to UPMEM PIM.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 1 minor

Summary. The paper presents DRAMatic, an implementation of foundational homomorphic encryption operations on UPMEM PIM hardware. It incorporates RNS and NTT optimizations to support large secure parameters, reports a 334x speedup over prior HE implementations on the same platform, and compares runtime and energy efficiency against Microsoft SEAL, while noting constraints from data-transfer overhead and limited multiplication performance.

Significance. If the performance numbers hold with full inclusion of overheads, the work demonstrates that PIM can accelerate memory-bound HE workloads and narrows the gap to conventional libraries such as SEAL, offering a concrete data point for hardware-software co-design in confidential computing.

major comments (1)

[Abstract] Abstract: the headline 334x speedup versus previous UPMEM PIM HE implementations is presented without a breakdown of compute time versus host-to-PIM data-transfer time for the evaluated parameter sets. Because the abstract itself identifies data transfer overhead as a binding constraint, it is unclear whether the reported figure already folds transfers in or whether they remain negligible; this directly affects whether the central acceleration claim is supported.

minor comments (1)

The experimental section should include error bars, standard deviations, or raw timing data for all runtime and energy figures to permit assessment of measurement variability.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address the single major comment below and will revise the paper to improve clarity on the reported speedup.

read point-by-point responses

Referee: [Abstract] Abstract: the headline 334x speedup versus previous UPMEM PIM HE implementations is presented without a breakdown of compute time versus host-to-PIM data-transfer time for the evaluated parameter sets. Because the abstract itself identifies data transfer overhead as a binding constraint, it is unclear whether the reported figure already folds transfers in or whether they remain negligible; this directly affects whether the central acceleration claim is supported.

Authors: We agree that the abstract would benefit from greater precision on this point. The 334x speedup is measured for the full end-to-end HE operations on UPMEM PIM, which includes both on-PIM computation and the host-to-PIM data transfers required to move operands and results. This measurement approach matches the methodology used in the prior UPMEM HE implementations we compare against, ensuring an apples-to-apples comparison. To eliminate any ambiguity, we will revise the abstract to explicitly state that the reported speedup incorporates data-transfer overhead. We will also add a table or figure in the evaluation section that breaks down compute time versus transfer time for the evaluated parameter sets, directly supporting the central claim. revision: yes

Circularity Check

0 steps flagged

No significant circularity: empirical implementation with direct runtime measurements

full rationale

The paper is an engineering report on implementing HE operations (NTT, RNS, etc.) on UPMEM PIM hardware. It reports measured speedups (334x vs prior UPMEM HE) and comparisons to SEAL via direct execution timings and energy figures. No mathematical derivations, fitted parameters, or predictions are presented; the central claims rest on hardware benchmarks against external baselines. The abstract notes data-transfer constraints explicitly, confirming the work does not hide or redefine its own inputs as outputs. No self-citation chains, ansatzes, or uniqueness theorems are invoked as load-bearing steps. This is a standard non-circular implementation paper.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on standard assumptions about UPMEM PIM bandwidth advantages and the memory-bound character of HE; no free parameters or invented entities are introduced.

axioms (1)

domain assumption UPMEM PIM hardware provides substantially higher effective memory bandwidth than conventional CPU/GPU architectures for the access patterns of HE arithmetic.
Invoked to justify suitability of PIM for HE acceleration.

pith-pipeline@v0.9.0 · 5530 in / 1163 out tokens · 53965 ms · 2026-05-16T04:58:56.350966+00:00 · methodology

DRAMatic Speedup: Accelerating HE Operations on a Processing-in-Memory System

Core claim

What carries the argument

If this is right

Where Pith is reading between the lines

Load-bearing premise

What would settle it

discussion (0)