pith. sign in

arxiv: 1907.08641 · v1 · pith:RAX4GFPCnew · submitted 2019-07-19 · 💻 cs.AR

PPAC: A Versatile In-Memory Accelerator for Matrix-Vector-Product-Like Operations

Pith reviewed 2026-05-24 18:46 UTC · model grok-4.3

classification 💻 cs.AR
keywords processing in memoryin-memory acceleratorcontent-addressable memorymatrix-vector productneural network accelerationcryptographyforward error correctionstandard-cell CMOS
0
0 comments X

The pith

PPAC performs matrix-vector-product operations inside associative memory to accelerate neural networks, hash lookups, cryptography, and error correction.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes PPAC, a digital in-memory accelerator built around associative content-addressable memory that supports a family of matrix-vector-product-like computations. It shows that one hardware structure can handle low-precision neural networks, exact and approximate hash lookups, cryptography, and forward error correction without custom analog circuits. A reader would care because the design uses only standard-cell CMOS, which simplifies automated layout and allows the same accelerator to move across technology nodes. Post-layout results in 28 nm demonstrate throughput and energy numbers competitive with both digital and mixed-signal PIM designs while covering a broader set of tasks. The central argument is that moving these specific operations into memory yields efficiency gains without sacrificing versatility or design portability.

Core claim

PPAC integrates parallel processing directly into content-addressable memory arrays so that matrix-vector-product-like operations execute in place, delivering acceleration for low-precision neural networks, exact/approximate hash lookups, cryptography, and forward error correction; because the architecture remains fully digital and standard-cell based, it supports automated design flows and straightforward porting between CMOS nodes.

What carries the argument

The Parallel Processor in Associative Content-addressable memory (PPAC) array, which embeds simple processing elements inside a content-addressable memory so that parallel multiply-accumulate-style operations occur locally without moving data off the memory array.

If this is right

  • A single PPAC instance can replace several specialized accelerators for the listed workloads.
  • Design teams can use standard digital tools and libraries rather than custom analog or mixed-signal flows.
  • The same PPAC layout can be reused when moving to a new process node without redesigning analog components.
  • Throughput and energy efficiency remain competitive with recent PIM accelerators while supporting more application types.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Systems that already contain content-addressable memory could add PPAC-style processing with modest extra area.
  • The digital nature may simplify verification and testing compared with analog PIM approaches.
  • Edge devices that must run both inference and lightweight cryptography could share one accelerator block.

Load-bearing premise

Applications can be mapped to the PPAC array with low overhead and that post-layout simulations in 28 nm CMOS will accurately reflect the power, area, and speed of a fabricated chip.

What would settle it

Fabricate a PPAC test chip in 28 nm CMOS, map one of the claimed workloads (for example a small neural-network layer or a cryptographic primitive) onto it, and compare measured throughput and energy per operation against the post-layout predictions; a large gap would falsify the performance claims.

Figures

Figures reproduced from arXiv: 1907.08641 by Alexandra Gallyas-Sanhueza, Christoph Studer, Maria Bobbett, Oscar Casta\~neda.

Figure 1
Figure 1. Figure 1: Idealized efficiency-flexibility trade-off for different hardware [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Parallel Processor in Associative CAM (PPAC) architecture. (a) High-level architecture. (b) Each bit-cell includes an XNOR and an AND gate to [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Layout of the 256 × 256 PPAC with B = Bs = 16. All banks but one are colored using different shades of blue. For the gray bank, one row is shown in green, while the row memory and row ALU of another row are shown in orange and red, respectively. increasing the number of words M results in a higher area and power consumption than increasing the number of bits per word N by the same factor. This behavior is … view at source ↗
read the original abstract

Processing in memory (PIM) moves computation into memories with the goal of improving throughput and energy-efficiency compared to traditional von Neumann-based architectures. Most existing PIM architectures are either general-purpose but only support atomistic operations, or are specialized to accelerate a single task. We propose the Parallel Processor in Associative Content-addressable memory (PPAC), a novel in-memory accelerator that supports a range of matrix-vector-product (MVP)-like operations that find use in traditional and emerging applications. PPAC is, for example, able to accelerate low-precision neural networks, exact/approximate hash lookups, cryptography, and forward error correction. The fully-digital nature of PPAC enables its implementation with standard-cell-based CMOS, which facilitates automated design and portability among technology nodes. To demonstrate the efficacy of PPAC, we provide post-layout implementation results in 28nm CMOS for different array sizes. A comparison with recent digital and mixed-signal PIM accelerators reveals that PPAC is competitive in terms of throughput and energy-efficiency, while accelerating a wide range of applications and simplifying development.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

0 major / 3 minor

Summary. The paper proposes PPAC, a fully digital in-memory accelerator based on associative content-addressable memory that supports a range of matrix-vector-product-like operations. It claims to accelerate low-precision neural networks, exact/approximate hash lookups, cryptography, and forward error correction. The architecture is implemented using standard-cell CMOS, with post-layout results reported in 28 nm for multiple array sizes; these results are positioned as competitive in throughput and energy efficiency against recent digital and mixed-signal PIM accelerators while offering greater application versatility and design portability.

Significance. If the post-layout results and application mappings hold, the work is significant because it demonstrates a standard-cell-based PIM design that avoids mixed-signal complexities while spanning multiple domains. Explicit acknowledgment that only post-layout data are supplied and that mappings are shown for a subset of workloads is a strength; the paper thereby avoids overclaiming fabricated-silicon performance. The approach of using associative memory for MVP-like operations across domains is a concrete contribution to portable PIM research.

minor comments (3)
  1. [§4] §4 (Implementation): the post-layout area and power numbers for the 256×256 array are presented without an accompanying breakdown of the contribution from the associative CAM cells versus peripheral logic; adding this would strengthen the claim that the architecture scales efficiently.
  2. [Table 2] Table 2: the energy-efficiency comparison lists PPAC against prior work but does not state whether the prior-work numbers were obtained under identical activity factors or workload assumptions; a footnote clarifying this would improve fairness.
  3. [§5.2] §5.2 (Application mapping): the hash-lookup and FEC examples are illustrated at a high level; a short pseudocode or dataflow diagram for at least one mapping would make the “low overhead” claim easier to verify.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for the positive evaluation of PPAC, the recognition of its versatility across applications, and the recommendation for minor revision. The strengths noted regarding post-layout results and avoidance of overclaiming are appreciated.

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The manuscript is an architecture and implementation paper describing a standard-cell digital PIM design (PPAC) for MVP-like operations. It supports its claims via explicit hardware mappings for a subset of workloads and post-layout results in 28 nm CMOS rather than any derivation chain, fitted parameters, or self-referential predictions. No equations, ansatzes, or uniqueness theorems appear that could reduce to inputs by construction; the central thesis rests on the described circuit organization and measured metrics, which are externally falsifiable via fabrication. Any self-citations are incidental and non-load-bearing.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim rests on the feasibility of mapping multiple application classes onto a single digital associative-memory structure and on the accuracy of post-layout area/power estimates in a commercial 28nm process; no free parameters or invented physical entities are introduced.

axioms (1)
  • domain assumption Standard-cell library and 28nm CMOS process models are accurate for post-layout estimation
    Invoked when reporting post-layout results and comparisons.
invented entities (1)
  • PPAC architecture no independent evidence
    purpose: Hardware structure enabling versatile MVP-like operations inside memory
    New design introduced by the paper; no independent evidence outside the proposal itself.

pith-pipeline@v0.9.0 · 5732 in / 1291 out tokens · 23159 ms · 2026-05-24T18:46:32.135695+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

24 extracted references · 24 canonical work pages

  1. [1]

    Hitting the memory wall: Implications of the obvious,

    W. Wulf and S. McKee, “Hitting the memory wall: Implications of the obvious,” ACM SIGARCH Computer Architecture News , vol. 23, no. 1, pp. 20–24, March 1995

  2. [2]

    Evolution of memory architecture,

    R. Nair, “Evolution of memory architecture,” Proceedings of the IEEE , vol. 103, no. 8, pp. 1331–1345, August 2015

  3. [3]

    Compute caches,

    S. Aga, S. Jeloka, A. Subramaniyan, S. Narayanasamy, D. Blaauw, and R. Das, “Compute caches,” in Proceedings of the IEEE International Symposium on High Performance Computer Architecture (HPCA) , February 2017, pp. 481–492

  4. [4]

    Neural cache: Bit-serial in-cache acceleration of deep neural networks,

    C. Eckert, X. Wang, J. Wang, A. Subramaniyan, R. Iyer, D. Sylvester, D. Blaauw, and R. Das, “Neural cache: Bit-serial in-cache acceleration of deep neural networks,” in Proceedings of the ACM/IEEE International Symposium on Computer Architecture (ISCA) , June 2018, pp. 383–396

  5. [5]

    AC-DIMM: Asso- ciative computing with STT-MRAM,

    Q. Guo, X. Guo, R. Patel, E. ˙Ipek, and E. Friedman, “AC-DIMM: Asso- ciative computing with STT-MRAM,” in Proceedings of the ACM/IEEE International Symposium on Computer Architecture (ISCA) , June 2013, pp. 189–200

  6. [6]

    A microprocessor implemented in 65nm CMOS with configurable and bit-scalable accelerator for programmable in-memory computing,

    H. Jia, Y . Tang, H. Valavi, J. Zhang, and N. Verma, “A microprocessor implemented in 65nm CMOS with configurable and bit-scalable accelerator for programmable in-memory computing,” arXiv preprint: 1811.04047 , pp. 1–10, November 2018. [Online]. Available: https://arxiv.org/abs/1811.04047

  7. [7]

    Characterization of an associative memory chip in 28 nm CMOS technology,

    A. Annovi, G. Calderini, S. Capra, B. Checcucci, F. Crescioli, F. De Canio, G. Fedi, L. Frontini, M. Garci, C. Gentsos, T. Kubota, V . Liberali, F. Palla, J. Shojaii, C.-L. Sotiropoulou, A. Stabile, G. Traversi, and S. Viret, “Characterization of an associative memory chip in 28 nm CMOS technology,” in Proceedings of the IEEE International Symposium in Ci...

  8. [8]

    Neurocube: A programmable digital neuromorphic architecture with high-density 3D memory,

    D. Kim, J. Kung, S. Chai, S. Yalamanchili, and S. Mukhopadhyay, “Neurocube: A programmable digital neuromorphic architecture with high-density 3D memory,” in Proceedings of the ACM/IEEE International Symposium on Computer Architecture (ISCA) , June 2016, pp. 380–392

  9. [9]

    DRISA: A DRAM-based reconfigurable in-situ accelerator,

    S. Li, D. Niu, K. Malladi, H. Zheng, B. Brennan, and Y . Xie, “DRISA: A DRAM-based reconfigurable in-situ accelerator,” in Proceedings of the IEEE/ACM International Symposium on Microarchitecture (MICRO) , October 2017, pp. 288–301

  10. [10]

    BRein memory: A single-chip binary/ternary reconfigurable in-memory deep neural network accelerator achieving 1.4 TOPS at 0.6 W,

    K. Ando, K. Ueyoshi, K. Orimo, H. Yonekawa, S. Sato, H. Nakahara, S. Takameaeda-Yamazaki, M. Ikebe, T. Asai, T. Kuroda, and M. Mo- tomura, “BRein memory: A single-chip binary/ternary reconfigurable in-memory deep neural network accelerator achieving 1.4 TOPS at 0.6 W,” IEEE Journal Of Solid-State Circuits (JSSC) , vol. 53, no. 4, pp. 983–994, April 2018

  11. [11]

    Content-addresable memory (CAM) circuits and architectures: A tutorial and survey,

    K. Pagiamtzis and A. Sheikholeslami, “Content-addresable memory (CAM) circuits and architectures: A tutorial and survey,” IEEE Journal Of Solid-State Circuits (JSSC) , vol. 41, no. 3, pp. 712–727, March 2006

  12. [12]

    VLSI implementation of routing tables: tries and CAMs,

    T.-B. Pei and C. Zukowski, “VLSI implementation of routing tables: tries and CAMs,” in Proceedings of the IEEE Conference on Computer Communications (INFCOM) , April 1991, pp. 515–524

  13. [13]

    Highly-associative caches for low-power processors,

    M. Zhang and K. Asanovi ´c, “Highly-associative caches for low-power processors,” in Kool Chips Workshop, IEEE/ACM International Sympo- sium on Microarchitecture (MICRO) , December 2000, pp. 1–6

  14. [14]

    Foster, Content Addressable Parallel Processors

    C. Foster, Content Addressable Parallel Processors . John Wiley and Sons, Inc., 1976

  15. [15]

    A general-purpose CMOS associative processor IC and system,

    C. Stormon, N. Troullinos, E. Saleh, A. Chavan, M. Brule, and J. Oldfield, “A general-purpose CMOS associative processor IC and system,” IEEE Micro, vol. 12, no. 6, pp. 68–78, December 1992

  16. [16]

    Near-optimal hashing algorithms for approxi- mate nearest neighbor in high dimensions,

    A. Andoni and P. Indyk, “Near-optimal hashing algorithms for approxi- mate nearest neighbor in high dimensions,” Communications of the ACM , vol. 51, no. 1, pp. 117–122, January 2008

  17. [17]

    Bi- narized neural networks,

    I. Hubara, M. Courbariaux, D. Soudry, R. El-Yaniv, and Y . Bengio, “Bi- narized neural networks,” in Proceedings of the International Conference on Neural Information Processing Systems (NeurIPS) , December 2016, pp. 4114–4122

  18. [18]

    The STOne transform: Multi-resolution image enhancement and compressive video,

    T. Goldstein, L. Xu, K. F. Kelly, and R. Baraniuk, “The STOne transform: Multi-resolution image enhancement and compressive video,” IEEE Transactions on Image Processing , vol. 24, no. 12, pp. 5581–5593, December 2015

  19. [19]

    An always-on 3.8µJ/86% CIFAR-10 mixed-signal binary CNN processor with all memory on chip in 28nm CMOS,

    D. Bankman, L. Yang, B. Moons, M. Verhelst, and B. Murmann, “An always-on 3.8µJ/86% CIFAR-10 mixed-signal binary CNN processor with all memory on chip in 28nm CMOS,” in IEEE International Solid- State Circuits Conference (ISSCC) , February 2018, pp. 222–224

  20. [20]

    Daemen and V

    J. Daemen and V . Rijmen, The design of Rijndael: AES - The Advanced Encryption Standard. Springer Science & Business Media, 2002

  21. [21]

    A high-throughput low-power soft bit-flipping LDPC decoder in 28 nm FD-SOI,

    K. Cushon, P. Larsson-Edefors, and P. Andrekson, “A high-throughput low-power soft bit-flipping LDPC decoder in 28 nm FD-SOI,” in Proceedings of the IEEE European Solid State Circuits Conference (ESSCIRC), September 2018, pp. 102–105

  22. [22]

    Channel polarization: A method for constructing capacity- achieving codes,

    E. Arıkan, “Channel polarization: A method for constructing capacity- achieving codes,” in IEEE International Symposium on Information Theory (ISIT) , July 2008, pp. 1173–1177

  23. [23]

    UNPU: An energy-efficient deep neural network accelerator with fully variable weight bit precision,

    J. Lee, C. Kin, S. Kang, D. Shin, S. Kim, and H.-J. Yoo, “UNPU: An energy-efficient deep neural network accelerator with fully variable weight bit precision,” IEEE Journal of Solid-State Circuits (JSSC) , vol. 54, no. 1, pp. 173–185, January 2019

  24. [24]

    XNOR neural engine: A hardware accelerator IP for 21.6-fJ/op binary neural network inference,

    F. Conti, P. D. Schiavone, and L. Benini, “XNOR neural engine: A hardware accelerator IP for 21.6-fJ/op binary neural network inference,” IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems , vol. 37, no. 11, pp. 2940–2951, November 2018