PPAC: A Versatile In-Memory Accelerator for Matrix-Vector-Product-Like Operations
Pith reviewed 2026-05-24 18:46 UTC · model grok-4.3
The pith
PPAC performs matrix-vector-product operations inside associative memory to accelerate neural networks, hash lookups, cryptography, and error correction.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
PPAC integrates parallel processing directly into content-addressable memory arrays so that matrix-vector-product-like operations execute in place, delivering acceleration for low-precision neural networks, exact/approximate hash lookups, cryptography, and forward error correction; because the architecture remains fully digital and standard-cell based, it supports automated design flows and straightforward porting between CMOS nodes.
What carries the argument
The Parallel Processor in Associative Content-addressable memory (PPAC) array, which embeds simple processing elements inside a content-addressable memory so that parallel multiply-accumulate-style operations occur locally without moving data off the memory array.
If this is right
- A single PPAC instance can replace several specialized accelerators for the listed workloads.
- Design teams can use standard digital tools and libraries rather than custom analog or mixed-signal flows.
- The same PPAC layout can be reused when moving to a new process node without redesigning analog components.
- Throughput and energy efficiency remain competitive with recent PIM accelerators while supporting more application types.
Where Pith is reading between the lines
- Systems that already contain content-addressable memory could add PPAC-style processing with modest extra area.
- The digital nature may simplify verification and testing compared with analog PIM approaches.
- Edge devices that must run both inference and lightweight cryptography could share one accelerator block.
Load-bearing premise
Applications can be mapped to the PPAC array with low overhead and that post-layout simulations in 28 nm CMOS will accurately reflect the power, area, and speed of a fabricated chip.
What would settle it
Fabricate a PPAC test chip in 28 nm CMOS, map one of the claimed workloads (for example a small neural-network layer or a cryptographic primitive) onto it, and compare measured throughput and energy per operation against the post-layout predictions; a large gap would falsify the performance claims.
Figures
read the original abstract
Processing in memory (PIM) moves computation into memories with the goal of improving throughput and energy-efficiency compared to traditional von Neumann-based architectures. Most existing PIM architectures are either general-purpose but only support atomistic operations, or are specialized to accelerate a single task. We propose the Parallel Processor in Associative Content-addressable memory (PPAC), a novel in-memory accelerator that supports a range of matrix-vector-product (MVP)-like operations that find use in traditional and emerging applications. PPAC is, for example, able to accelerate low-precision neural networks, exact/approximate hash lookups, cryptography, and forward error correction. The fully-digital nature of PPAC enables its implementation with standard-cell-based CMOS, which facilitates automated design and portability among technology nodes. To demonstrate the efficacy of PPAC, we provide post-layout implementation results in 28nm CMOS for different array sizes. A comparison with recent digital and mixed-signal PIM accelerators reveals that PPAC is competitive in terms of throughput and energy-efficiency, while accelerating a wide range of applications and simplifying development.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes PPAC, a fully digital in-memory accelerator based on associative content-addressable memory that supports a range of matrix-vector-product-like operations. It claims to accelerate low-precision neural networks, exact/approximate hash lookups, cryptography, and forward error correction. The architecture is implemented using standard-cell CMOS, with post-layout results reported in 28 nm for multiple array sizes; these results are positioned as competitive in throughput and energy efficiency against recent digital and mixed-signal PIM accelerators while offering greater application versatility and design portability.
Significance. If the post-layout results and application mappings hold, the work is significant because it demonstrates a standard-cell-based PIM design that avoids mixed-signal complexities while spanning multiple domains. Explicit acknowledgment that only post-layout data are supplied and that mappings are shown for a subset of workloads is a strength; the paper thereby avoids overclaiming fabricated-silicon performance. The approach of using associative memory for MVP-like operations across domains is a concrete contribution to portable PIM research.
minor comments (3)
- [§4] §4 (Implementation): the post-layout area and power numbers for the 256×256 array are presented without an accompanying breakdown of the contribution from the associative CAM cells versus peripheral logic; adding this would strengthen the claim that the architecture scales efficiently.
- [Table 2] Table 2: the energy-efficiency comparison lists PPAC against prior work but does not state whether the prior-work numbers were obtained under identical activity factors or workload assumptions; a footnote clarifying this would improve fairness.
- [§5.2] §5.2 (Application mapping): the hash-lookup and FEC examples are illustrated at a high level; a short pseudocode or dataflow diagram for at least one mapping would make the “low overhead” claim easier to verify.
Simulated Author's Rebuttal
We thank the referee for the positive evaluation of PPAC, the recognition of its versatility across applications, and the recommendation for minor revision. The strengths noted regarding post-layout results and avoidance of overclaiming are appreciated.
Circularity Check
No significant circularity detected
full rationale
The manuscript is an architecture and implementation paper describing a standard-cell digital PIM design (PPAC) for MVP-like operations. It supports its claims via explicit hardware mappings for a subset of workloads and post-layout results in 28 nm CMOS rather than any derivation chain, fitted parameters, or self-referential predictions. No equations, ansatzes, or uniqueness theorems appear that could reduce to inputs by construction; the central thesis rests on the described circuit organization and measured metrics, which are externally falsifiable via fabrication. Any self-citations are incidental and non-load-bearing.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Standard-cell library and 28nm CMOS process models are accurate for post-layout estimation
invented entities (1)
-
PPAC architecture
no independent evidence
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
PPAC ... supports a range of matrix-vector-product (MVP)-like operations ... Hamming similarity ... inner-product ... GF(2) MVPs, and programmable logic array (PLA) functionality.
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
post-layout implementation results in 28 nm CMOS ... 256×256 PPAC array achieves 92 TOP/s at 4.15 fJ/OP
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Hitting the memory wall: Implications of the obvious,
W. Wulf and S. McKee, “Hitting the memory wall: Implications of the obvious,” ACM SIGARCH Computer Architecture News , vol. 23, no. 1, pp. 20–24, March 1995
work page 1995
-
[2]
Evolution of memory architecture,
R. Nair, “Evolution of memory architecture,” Proceedings of the IEEE , vol. 103, no. 8, pp. 1331–1345, August 2015
work page 2015
-
[3]
S. Aga, S. Jeloka, A. Subramaniyan, S. Narayanasamy, D. Blaauw, and R. Das, “Compute caches,” in Proceedings of the IEEE International Symposium on High Performance Computer Architecture (HPCA) , February 2017, pp. 481–492
work page 2017
-
[4]
Neural cache: Bit-serial in-cache acceleration of deep neural networks,
C. Eckert, X. Wang, J. Wang, A. Subramaniyan, R. Iyer, D. Sylvester, D. Blaauw, and R. Das, “Neural cache: Bit-serial in-cache acceleration of deep neural networks,” in Proceedings of the ACM/IEEE International Symposium on Computer Architecture (ISCA) , June 2018, pp. 383–396
work page 2018
-
[5]
AC-DIMM: Asso- ciative computing with STT-MRAM,
Q. Guo, X. Guo, R. Patel, E. ˙Ipek, and E. Friedman, “AC-DIMM: Asso- ciative computing with STT-MRAM,” in Proceedings of the ACM/IEEE International Symposium on Computer Architecture (ISCA) , June 2013, pp. 189–200
work page 2013
-
[6]
H. Jia, Y . Tang, H. Valavi, J. Zhang, and N. Verma, “A microprocessor implemented in 65nm CMOS with configurable and bit-scalable accelerator for programmable in-memory computing,” arXiv preprint: 1811.04047 , pp. 1–10, November 2018. [Online]. Available: https://arxiv.org/abs/1811.04047
-
[7]
Characterization of an associative memory chip in 28 nm CMOS technology,
A. Annovi, G. Calderini, S. Capra, B. Checcucci, F. Crescioli, F. De Canio, G. Fedi, L. Frontini, M. Garci, C. Gentsos, T. Kubota, V . Liberali, F. Palla, J. Shojaii, C.-L. Sotiropoulou, A. Stabile, G. Traversi, and S. Viret, “Characterization of an associative memory chip in 28 nm CMOS technology,” in Proceedings of the IEEE International Symposium in Ci...
work page 2018
-
[8]
Neurocube: A programmable digital neuromorphic architecture with high-density 3D memory,
D. Kim, J. Kung, S. Chai, S. Yalamanchili, and S. Mukhopadhyay, “Neurocube: A programmable digital neuromorphic architecture with high-density 3D memory,” in Proceedings of the ACM/IEEE International Symposium on Computer Architecture (ISCA) , June 2016, pp. 380–392
work page 2016
-
[9]
DRISA: A DRAM-based reconfigurable in-situ accelerator,
S. Li, D. Niu, K. Malladi, H. Zheng, B. Brennan, and Y . Xie, “DRISA: A DRAM-based reconfigurable in-situ accelerator,” in Proceedings of the IEEE/ACM International Symposium on Microarchitecture (MICRO) , October 2017, pp. 288–301
work page 2017
-
[10]
K. Ando, K. Ueyoshi, K. Orimo, H. Yonekawa, S. Sato, H. Nakahara, S. Takameaeda-Yamazaki, M. Ikebe, T. Asai, T. Kuroda, and M. Mo- tomura, “BRein memory: A single-chip binary/ternary reconfigurable in-memory deep neural network accelerator achieving 1.4 TOPS at 0.6 W,” IEEE Journal Of Solid-State Circuits (JSSC) , vol. 53, no. 4, pp. 983–994, April 2018
work page 2018
-
[11]
Content-addresable memory (CAM) circuits and architectures: A tutorial and survey,
K. Pagiamtzis and A. Sheikholeslami, “Content-addresable memory (CAM) circuits and architectures: A tutorial and survey,” IEEE Journal Of Solid-State Circuits (JSSC) , vol. 41, no. 3, pp. 712–727, March 2006
work page 2006
-
[12]
VLSI implementation of routing tables: tries and CAMs,
T.-B. Pei and C. Zukowski, “VLSI implementation of routing tables: tries and CAMs,” in Proceedings of the IEEE Conference on Computer Communications (INFCOM) , April 1991, pp. 515–524
work page 1991
-
[13]
Highly-associative caches for low-power processors,
M. Zhang and K. Asanovi ´c, “Highly-associative caches for low-power processors,” in Kool Chips Workshop, IEEE/ACM International Sympo- sium on Microarchitecture (MICRO) , December 2000, pp. 1–6
work page 2000
-
[14]
Foster, Content Addressable Parallel Processors
C. Foster, Content Addressable Parallel Processors . John Wiley and Sons, Inc., 1976
work page 1976
-
[15]
A general-purpose CMOS associative processor IC and system,
C. Stormon, N. Troullinos, E. Saleh, A. Chavan, M. Brule, and J. Oldfield, “A general-purpose CMOS associative processor IC and system,” IEEE Micro, vol. 12, no. 6, pp. 68–78, December 1992
work page 1992
-
[16]
Near-optimal hashing algorithms for approxi- mate nearest neighbor in high dimensions,
A. Andoni and P. Indyk, “Near-optimal hashing algorithms for approxi- mate nearest neighbor in high dimensions,” Communications of the ACM , vol. 51, no. 1, pp. 117–122, January 2008
work page 2008
-
[17]
I. Hubara, M. Courbariaux, D. Soudry, R. El-Yaniv, and Y . Bengio, “Bi- narized neural networks,” in Proceedings of the International Conference on Neural Information Processing Systems (NeurIPS) , December 2016, pp. 4114–4122
work page 2016
-
[18]
The STOne transform: Multi-resolution image enhancement and compressive video,
T. Goldstein, L. Xu, K. F. Kelly, and R. Baraniuk, “The STOne transform: Multi-resolution image enhancement and compressive video,” IEEE Transactions on Image Processing , vol. 24, no. 12, pp. 5581–5593, December 2015
work page 2015
-
[19]
D. Bankman, L. Yang, B. Moons, M. Verhelst, and B. Murmann, “An always-on 3.8µJ/86% CIFAR-10 mixed-signal binary CNN processor with all memory on chip in 28nm CMOS,” in IEEE International Solid- State Circuits Conference (ISSCC) , February 2018, pp. 222–224
work page 2018
-
[20]
J. Daemen and V . Rijmen, The design of Rijndael: AES - The Advanced Encryption Standard. Springer Science & Business Media, 2002
work page 2002
-
[21]
A high-throughput low-power soft bit-flipping LDPC decoder in 28 nm FD-SOI,
K. Cushon, P. Larsson-Edefors, and P. Andrekson, “A high-throughput low-power soft bit-flipping LDPC decoder in 28 nm FD-SOI,” in Proceedings of the IEEE European Solid State Circuits Conference (ESSCIRC), September 2018, pp. 102–105
work page 2018
-
[22]
Channel polarization: A method for constructing capacity- achieving codes,
E. Arıkan, “Channel polarization: A method for constructing capacity- achieving codes,” in IEEE International Symposium on Information Theory (ISIT) , July 2008, pp. 1173–1177
work page 2008
-
[23]
UNPU: An energy-efficient deep neural network accelerator with fully variable weight bit precision,
J. Lee, C. Kin, S. Kang, D. Shin, S. Kim, and H.-J. Yoo, “UNPU: An energy-efficient deep neural network accelerator with fully variable weight bit precision,” IEEE Journal of Solid-State Circuits (JSSC) , vol. 54, no. 1, pp. 173–185, January 2019
work page 2019
-
[24]
XNOR neural engine: A hardware accelerator IP for 21.6-fJ/op binary neural network inference,
F. Conti, P. D. Schiavone, and L. Benini, “XNOR neural engine: A hardware accelerator IP for 21.6-fJ/op binary neural network inference,” IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems , vol. 37, no. 11, pp. 2940–2951, November 2018
work page 2018
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.