pith. sign in

arxiv: 2606.22812 · v1 · pith:VVAIBSG4new · submitted 2026-06-22 · 💻 cs.AR · cs.DC

Clutch: High Performance Vector-Scalar Comparison using DRAM via Chunked Temporal Coding

Pith reviewed 2026-06-26 06:33 UTC · model grok-4.3

classification 💻 cs.AR cs.DC
keywords vector-scalar comparisonprocessing-using-DRAMtemporal codingchunked lookup tablespredicate evaluationdecision tree inference
0
0 comments X

The pith

Clutch uses chunked temporal coding to perform vector-scalar comparisons inside DRAM with far fewer commands than bit-serial methods.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents Clutch as a data representation and algorithm that accelerates vector-scalar comparisons when performed directly inside DRAM arrays. Each vector element is first encoded as a sequence of leading ones so that comparison to a scalar reduces to activating the matching DRAM row. To keep the required lookup tables small at high precision, the operands are split into independent multi-bit chunks whose results are merged afterward by a procedure whose cost stays low. The design supplies a tunable tradeoff between speed and memory footprint by varying the number of chunks. When applied to predicate evaluation and decision-tree inference, the method produces large measured gains in throughput and energy over both conventional processors and earlier DRAM-based designs.

Core claim

Clutch encodes each operand value as a temporal sequence of leading ones, partitions the operands into multiple multi-bit chunks, performs each chunk comparison via a compact lookup table realized as a DRAM row access, and merges the per-chunk results with a PuD-efficient procedure whose command count does not grow with operand bit width.

What carries the argument

Temporal coding that represents each value by the position of its leading ones, combined with per-chunk lookup tables realized as DRAM row activations.

Load-bearing premise

The chunked lookup tables and result-merging steps can be realized inside real DRAM arrays without command overhead or extra memory use that cancels the performance benefit.

What would settle it

A cycle-accurate DRAM simulator or prototype that measures total commands and energy for 32-bit comparisons and shows whether the reported speedups survive after counting every row activation required by the chunk-merging step.

Figures

Figures reproduced from arXiv: 2606.22812 by Abdullah Giray Ya\u{g}l{\i}k\c{c}{\i}, Ataberk Olgun, Daichi Tokuda, Geraldo F. Oliveira, Haocong Luo, Ismail Emir Yuksel, Mohammad Sadrosadati, Onur Mutlu, Shinya Takamaeda-Yamazaki, Tatsuya Kubo, Tomoya Nagatani.

Figure 1
Figure 1. Figure 1: DRAM Organization. 2.3 Processing-using-DRAM Processing-using-DRAM (PuD) [2, 38, 43, 45–47, 59, 63, 64, 67, 81, 95, 109–111, 119, 124, 132, 133, 143, 157, 158, 160, 162, 164, 168, 185, 188, 192, 197] is a new computing paradigm that lever￾ages the analog properties of DRAM to enable massively par￾allel in-DRAM computation. This work targets two representa￾tive PuD architectures. The first is SIMDRAM [81], … view at source ↗
Figure 2
Figure 2. Figure 2: Our FPGA-based PuD testing infrastructure (DRAM Bender [139]) with DDR4 modules. PuD computation consists of repeated invocations of two prim￾itive PuD operations: RowCopy and MAJ3. Each PuD operation is realized by issuing a dedicated sequence of DRAM commands such as ACT and PRE [63, 81, 161, 197]. RowCopy transfers data between rows within the same subarray, enabling efficient bulk data move￾ment inside… view at source ↗
Figure 4
Figure 4. Figure 4: Fraction of execution time spent on comparison. [PITH_FULL_IMAGE:figures/full_fig_p004_4.png] view at source ↗
Figure 3
Figure 3. Figure 3: Bit-serial-based PuD arithmetic. 3 Motivation 3.1 Bottleneck Profiling of Vector-Scalar Comparisons Vector-scalar comparisons are widely used across a broad range of applications that support the foundation of modern society, from query processing in in-memory databases to inference in decision tree ensembles for machine learning. In many of these workloads, accelerating the comparison step is particularly… view at source ↗
Figure 5
Figure 5. Figure 5: PuD execution effectively reduces data movement for vector-scalar comparison. Instead, following prior approaches [43, 110], PuD can prepare con￾stant rows filled entirely with 0s or 1s in advance across all columns, and then initialize each bit of 𝑎 by selecting the appropriate con￾stant row and copying it using RowCopy [63, 138, 158, 161, 197]. For example, if 𝑎 = 3 (𝑎2𝑎1𝑎0 = 011 in binary), it copies th… view at source ↗
Figure 6
Figure 6. Figure 6: Execution time breakdown of vector–scalar com [PITH_FULL_IMAGE:figures/full_fig_p005_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Lookup table-based comparison via temporal cod￾ing. 4.2 Divide-and-Conquer Approach for Comparisons To support scalable and memory-efficient lookup table-based com￾parisons, Clutch adopts a divide-and-conquer approach. Our key idea is that a full-width comparison can be decomposed into smaller sub-comparisons over bit chunks. Specifically, Clutch partitions each binary operand into multiple multi-bit chunk… view at source ↗
Figure 8
Figure 8. Figure 8: Clutch encoding and algorithm example. Although [PITH_FULL_IMAGE:figures/full_fig_p006_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Tradeoff between DRAM row usage and the number [PITH_FULL_IMAGE:figures/full_fig_p007_9.png] view at source ↗
Figure 12
Figure 12. Figure 12: Mapping of GBDT trees to DRAM. Execution flow [PITH_FULL_IMAGE:figures/full_fig_p009_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: Proposed GBDT inference flow on PuD [PITH_FULL_IMAGE:figures/full_fig_p009_13.png] view at source ↗
Figure 14
Figure 14. Figure 14: Normalized throughput of GBDT inference at (a) [PITH_FULL_IMAGE:figures/full_fig_p010_14.png] view at source ↗
Figure 15
Figure 15. Figure 15: Breakdown of (a) execution time and (b) energy [PITH_FULL_IMAGE:figures/full_fig_p010_15.png] view at source ↗
Figure 18
Figure 18. Figure 18: Clutch overhead analyses. (a) Effective GBDT in [PITH_FULL_IMAGE:figures/full_fig_p011_18.png] view at source ↗
Figure 17
Figure 17. Figure 17: Sensitivity of GBDT inference throughput to model [PITH_FULL_IMAGE:figures/full_fig_p011_17.png] view at source ↗
Figure 19
Figure 19. Figure 19: Normalized throughput of Q2 at (a) small table, (b) [PITH_FULL_IMAGE:figures/full_fig_p012_19.png] view at source ↗
Figure 20
Figure 20. Figure 20: Normalized energy efficiency of Q2 at (a) 8-bit [PITH_FULL_IMAGE:figures/full_fig_p012_20.png] view at source ↗
Figure 23
Figure 23. Figure 23: Normalized throughput on CPU-based system at [PITH_FULL_IMAGE:figures/full_fig_p013_23.png] view at source ↗
Figure 25
Figure 25. Figure 25: Breakdown of execution time on CPU-based sys [PITH_FULL_IMAGE:figures/full_fig_p013_25.png] view at source ↗
read the original abstract

Vector-scalar comparison is a fundamental computation primitive that compares each element in a vector against a single scalar value. It is widely used in various data-intensive workloads from databases to machine learning. Due to its low computational intensity, its execution tends to be memory-bound, limiting the utilization of compute resources. Processing-using-DRAM (PuD) is an emerging computing paradigm that performs massively parallel bitwise operations directly inside DRAM arrays, alleviating off-chip data movement. Existing PuD-based approaches require many DRAM commands because the comparison's algorithmic complexity grows with operand bit-width in the bit-serial execution model. This command overhead becomes the dominant bottleneck, limiting application-level speedup. We propose Clutch, a data representation and comparison algorithm that accelerates vector-scalar comparisons in PuD systems with high efficiency and scalability. Clutch first uses temporal coding, encoding each vector value as a sequence of leading ones, which enables lookup-based comparison against a scalar by accessing the corresponding DRAM row. To avoid the prohibitive memory footprint of lookup tables at high precision, Clutch partitions operands into multiple multi-bit chunks, compares chunks independently using compact lookup tables, and merges the per-chunk results with a PuD-efficient procedure. By adjusting the number of chunks, Clutch provides a flexible tradeoff between throughput and memory usage. Across predicate evaluation and decision tree inference, Clutch improves end-to-end application throughput and energy efficiency by an average of 12x and 69x over highly optimized CPU and GPU execution, and by 2.9x and 3.0x over the state-of-the-art bit-serial PuD implementation. We also present the first mapping of decision tree inference to PuD execution, extending PuD to a new application domain.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The manuscript proposes Clutch, a data representation and comparison algorithm for vector-scalar comparisons in Processing-using-DRAM (PuD) systems. It encodes vector values via temporal coding as sequences of leading ones to enable row-access lookup against a scalar, partitions operands into multi-bit chunks for compact per-chunk lookup tables, and merges results with a PuD-efficient procedure. The number of chunks provides a scalable tradeoff between throughput and memory footprint. The work claims average end-to-end improvements of 12x throughput and 69x energy efficiency over optimized CPU/GPU baselines and 2.9x/3.0x over state-of-the-art bit-serial PuD for predicate evaluation and decision-tree inference, while presenting the first PuD mapping for decision-tree inference.

Significance. If the claimed performance and energy gains are reproducible under realistic DRAM timing and command constraints, the approach could meaningfully advance PuD applicability to memory-bound primitives in databases and ML inference by mitigating the command-overhead bottleneck of bit-serial execution.

major comments (3)
  1. [Description of the scalable tradeoff] Description of the scalable tradeoff: the central performance claims (2.9x/3.0x over bit-serial PuD) rest on the unverified assumption that chunked temporal coding plus per-chunk LUTs can be mapped to real DRAM row activations and merges with command counts and capacity costs that do not negate the advantage; no quantitative breakdown of ACT/PRE command totals, bank-conflict penalties, or working-set displacement is supplied to support this.
  2. [Abstract] Abstract and experimental claims: the specific quantitative improvements (12x throughput, 69x energy) are stated without accompanying experimental methodology, hardware assumptions, baseline implementations, or error analysis in the provided text, preventing evaluation of whether the headline numbers hold under the stated DRAM constraints.
  3. [Section on temporal coding and chunk merging] Section on temporal coding and chunk merging: the claim that chunk-wise comparisons incur fewer total commands than bit-serial execution is load-bearing for the speedup result, yet the manuscript supplies no command-sequence enumeration or cycle-accurate accounting that would allow verification of the reduction.
minor comments (2)
  1. Notation for chunk count and precision parameters is introduced without a consolidated table relating chunk number, LUT size, and reported throughput, making the tradeoff curve difficult to interpret.
  2. The manuscript would benefit from an explicit statement of the DRAM timing parameters and command scheduling model used to derive the energy and throughput figures.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive comments highlighting the need for explicit verification of command overheads and experimental details. We address each point below and will revise the manuscript to strengthen these aspects.

read point-by-point responses
  1. Referee: [Description of the scalable tradeoff] Description of the scalable tradeoff: the central performance claims (2.9x/3.0x over bit-serial PuD) rest on the unverified assumption that chunked temporal coding plus per-chunk LUTs can be mapped to real DRAM row activations and merges with command counts and capacity costs that do not negate the advantage; no quantitative breakdown of ACT/PRE command totals, bank-conflict penalties, or working-set displacement is supplied to support this.

    Authors: We agree that an explicit breakdown strengthens the claims. The full manuscript (Section 4) derives the command reduction analytically from the chunked lookup and merge procedure, showing fewer total ACT/PRE operations than bit-serial for equivalent precision. In revision we will add a table with per-configuration ACT/PRE counts, bank-conflict analysis under open-page policy, and working-set sizes for the evaluated chunk counts (2–8), confirming the net advantage holds under JEDEC timing. revision: yes

  2. Referee: [Abstract] Abstract and experimental claims: the specific quantitative improvements (12x throughput, 69x energy) are stated without accompanying experimental methodology, hardware assumptions, baseline implementations, or error analysis in the provided text, preventing evaluation of whether the headline numbers hold under the stated DRAM constraints.

    Authors: The methodology, DRAM timing parameters (from Micron DDR4 datasheet), CPU/GPU baseline implementations (AVX-512 and CUDA kernels), and error bars (std. dev. over 100 runs) are provided in Sections 5 and 6. We will insert a parenthetical reference to the evaluation section in the abstract and ensure the camera-ready version cross-references these details explicitly. revision: partial

  3. Referee: [Section on temporal coding and chunk merging] Section on temporal coding and chunk merging: the claim that chunk-wise comparisons incur fewer total commands than bit-serial execution is load-bearing for the speedup result, yet the manuscript supplies no command-sequence enumeration or cycle-accurate accounting that would allow verification of the reduction.

    Authors: Section 3.3 enumerates the per-chunk lookup sequence and the PuD-efficient merge (using row-wise AND/OR), while Section 4 tabulates aggregate command counts versus bit-serial. We will expand this in revision with a side-by-side command trace for a representative 32-bit operand under 4-chunk configuration, including cycle counts under realistic tRC/tRAS constraints. revision: yes

Circularity Check

0 steps flagged

No circularity: algorithmic proposal with empirical claims, no derivation chain or fitted parameters.

full rationale

The paper proposes Clutch as a data representation and comparison algorithm using temporal coding and chunked lookup tables for PuD. Central claims rest on described mechanisms and reported empirical throughput/energy gains versus baselines, not on any equations, fitted parameters, or self-citation chains that reduce to inputs by construction. No load-bearing steps match the enumerated circularity patterns; the contribution is self-contained as an engineering proposal whose validity is external to any internal derivation.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The approach rests on the domain assumption that PuD hardware supports the required row accesses and bitwise operations; the number of chunks is presented as an adjustable parameter for the throughput-memory tradeoff.

free parameters (1)
  • number of chunks
    Abstract states that adjusting the number of chunks provides a flexible tradeoff between throughput and memory usage.
axioms (1)
  • domain assumption Processing-using-DRAM hardware can efficiently perform the row accesses and merge operations required by the chunked lookup procedure
    The entire performance claim depends on this hardware capability being available and low-overhead.

pith-pipeline@v0.9.1-grok · 5905 in / 1304 out tokens · 32847 ms · 2026-06-26T06:33:04.026522+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

222 extracted references · 5 linked inside Pith

  1. [1]

    Akmalbek Abdusalomov, Mukhriddin Mukhiddinov, Oybek Djuraev, Utkir Khamdamov, and Taeg Keun Whangbo. 2020. Automatic salient object ex- traction based on locally adaptive thresholding to generate tactile graphics. Applied Sciences(2020)

  2. [2]

    Salma Afifi, Ishan Thakkar, and Sudeep Pasricha. 2024. ARTEMIS: A Mixed Analog-Stochastic In-DRAM Accelerator for Transformer Neural Networks. IEEE TCAD(2024)

  3. [3]

    Shaizeen Aga, Supreet Jeloka, Arun Subramaniyan, Satish Narayanasamy, David Blaauw, and Reetuparna Das. 2017. Compute Caches. InHPCA

  4. [4]

    Junwhan Ahn, Sungpack Hong, Sungjoo Yoo, Onur Mutlu, and Kiyoung Choi

  5. [5]

    A Scalable Processing-in-Memory Accelerator for Parallel Graph Process- ing. InISCA

  6. [6]

    Junwhan Ahn, Sungjoo Yoo, Onur Mutlu, and Kiyoung Choi. 2015. PIM-Enabled Instructions: A Low-Overhead, Locality-Aware Processing-in-Memory Archi- tecture. InISCA

  7. [7]

    Berkin Akin, Franz Franchetti, and James C. Hoe. 2015. Data Reorganization in Memory Using 3D-Stacked DRAM. InISCA

  8. [8]

    Hoe, and Franz Franchetti

    Berkin Akın, James C. Hoe, and Franz Franchetti. 2014. HAMLeT: Hardware Accelerated Memory Layout Transform within 3D-Stacked DRAM. InHPEC

  9. [9]

    Adrián Alcolea, Mercedes E Paoletti, Juan M Haut, Javier Resano, and Antonio Plaza. 2020. Inference in supervised spectral classifiers for on-board hyperspec- tral imaging: An overview.Remote Sensing(2020)

  10. [10]

    Mustafa F Ali, Akhilesh Jaiswal, and Kaushik Roy. 2019. In-Memory Low-Cost Bit-Serial Addition Using Commodity DRAM Technology.TCAS-I(2019)

  11. [11]

    M. A. Z. Alves, M. Diener, P. C. Santos, and L. Carro. 2016. Large Vector Extensions Inside the HMC. InDATE

  12. [12]

    Marco A. Z. Alves, Paulo C. Santos, Matthias Diener, and Luigi Carro. 2015. Op- portunities and Challenges of Performing Vector Operations Inside the DRAM. InMEMSYS

  13. [13]

    M. A. Z. Alves, P. C. Santos, F. B. Moreira, et al. 2015. Saving Memory Movements Through Vector Processing in the DRAM. InCASES

  14. [14]

    American Statistical Association. 2009. Data Expo 2009: Airline On- Time Performance. https://community.amstat.org/jointscsg-section/dataexpo/ dataexpo2009

  15. [15]

    Shaahin Angizi and Deliang Fan. 2019. GraphiDe: A Graph Processing Acceler- ator Leveraging In-DRAM-Computing. InGLSVLSI

  16. [16]

    2016.Arm Cortex-A53 MPCore Processor Technical Reference Manual

    Arm Ltd. 2016.Arm Cortex-A53 MPCore Processor Technical Reference Manual

  17. [17]

    Arm Ltd. 2024. Arm Neon Intrinsics Reference. https://developer.arm.com/ architectures/instruction-sets/simd-isas/neon

  18. [18]

    Asghari-Moghaddam, A

    H. Asghari-Moghaddam, A. Farmahini-Farahani, K. Morrow, et al. 2016. Near- DRAM Acceleration with Single-ISA Heterogeneous Processing in Standard ICS ’26, July 06–09, 2026, Belfast, United Kingdom Daichi Tokuda et al. Memory Modules.IEEE Micro(2016)

  19. [19]

    Hadi Asghari-Moghaddam, Young Hoon Son, Jung Ho Ahn, and Nam Sung Kim

  20. [20]

    Chameleon: Versatile and Practical Near-DRAM Acceleration Architecture for Large Memory Systems. InMICRO

  21. [21]

    Erfan Azarkhish, Christoph Pfister, Davide Rossi, Igor Loi, and Luca Benini

  22. [22]

    Logic-Base Interconnect Design for Near Memory Computing in the Smart Memory Cube.IEEE VLSI(2016)

  23. [23]

    Erfan Azarkhish, Davide Rossi, Igor Loi, and Luca Benini. 2016. A Case for Near Memory Computation Inside the Smart Memory Cube. InEMS

  24. [24]

    Erfan Azarkhish, Davide Rossi, Igor Loi, and Luca Benini. 2018. Neurostream: Scalable and Energy Efficient Deep Learning with Smart Memory Cubes.TPDS (2018)

  25. [25]

    Babarinsa and Stratos Idreos

    Oreoluwatomiwa O. Babarinsa and Stratos Idreos. 2015. JAFAR: Near-Data Processing for Databases. InSIGMOD

  26. [26]

    Pierre Baldi, Peter Sadowski, and Daniel Whiteson. 2014. HIGGS Dataset. UCI Machine Learning Repository. https://archive.ics.uci.edu/dataset/280/higgs

  27. [27]

    Maciej Besta, Raghavendra Kanakagiri, Grzegorz Kwasniewski, Rachata Ausavarungnirun, Jakub Beránek, Konstantinos Kanellopoulos, Kacper Janda, Zur Vonarburg-Shmaria, Lukas Gianinazzi, Ioana Stefan, et al. 2021. SISA: Set- Centric Instruction Set Architecture for Graph Mining on Processing-in-Memory Systems. InMICRO

  28. [28]

    Blackard and Denis J

    Jock A. Blackard and Denis J. Dean. 1999. Covertype. https://doi.org/10.24432/ C50K5N https://archive.ics.uci.edu/dataset/31/covertype

  29. [29]

    Casper Solheim Bojer and Jens Peder Meldgaard. 2021. Kaggle forecasting competitions: An overlooked learning opportunity.International Journal of Forecasting(2021)

  30. [30]

    2020.Practical Mechanisms for Reducing Processor-Memory Data Movement in Modern Workloads

    Amirali Boroumand. 2020.Practical Mechanisms for Reducing Processor-Memory Data Movement in Modern Workloads. Ph. D. Dissertation. Carnegie Mellon University

  31. [31]

    Oliveira, Xiaoyu Ma, Eric Shiu, and Onur Mutlu

    Amirali Boroumand, Saugata Ghose, Berkin Akin, Ravi Narayanaswami, Ger- aldo F. Oliveira, Xiaoyu Ma, Eric Shiu, and Onur Mutlu. 2021. Google Neural Network Models for Edge Devices: Analyzing and Mitigating Machine Learning Inference Bottlenecks. InPACT

  32. [32]

    Oliveira, Xiaoyu Ma, Eric Shiu, and Onur Mutlu

    Amirali Boroumand, Saugata Ghose, Berkin Akin, Ravi Narayanaswami, Ger- aldo F. Oliveira, Xiaoyu Ma, Eric Shiu, and Onur Mutlu. 2021. Mitigating Edge Machine Learning Inference Bottlenecks: An Empirical Study on Accelerating Google Edge Models.arXiv preprint arXiv:2103.00768(2021)

  33. [33]

    Amirali Boroumand, Saugata Ghose, Youngsok Kim, Rachata Ausavarungnirun, Eric Shiu, Rahul Thakur, Daehyun Kim, Aki Kuusela, Allan Knies, Parthasarathy Ranganathan, et al. 2018. Google Workloads for Consumer Devices: Mitigating Data Movement Bottlenecks. InASPLOS

  34. [34]

    Amirali Boroumand, Saugata Ghose, Brandon Lucia, Kevin Hsieh, Krishna Malladi, Hongzhong Zheng, and Onur Mutlu. 2017. LazyPIM: An Efficient Cache Coherence Mechanism for Processing-in-Memory.CAL(2017)

  35. [35]

    Oliveira, and Onur Mutlu

    Amirali Boroumand, Saugata Ghose, Geraldo F. Oliveira, and Onur Mutlu. 2021. Polynesia: Enabling Effective Hybrid Transactional/Analytical Databases with Specialized Hardware/Software Co-Design.arXiv preprint arXiv:2103.00798 (2021)

  36. [36]

    Oliveira, and Onur Mutlu

    Amirali Boroumand, Saugata Ghose, Geraldo F. Oliveira, and Onur Mutlu. 2022. Polynesia: Enabling High-Performance and Energy-Efficient Hybrid Transac- tional/Analytical Databases with Hardware/Software Co-Design. InICDE

  37. [37]

    Malladi, Hongzhong Zheng, et al

    Amirali Boroumand, Saugata Ghose, Minesh Patel, Hasan Hassan, Brandon Lucia, Rachata Ausavarungnirun, Kevin Hsieh, Nastaran Hajinazar, Krishna T. Malladi, Hongzhong Zheng, et al. 2019. CoNDA: Efficient Cache Coherence Support for Near-Data Accelerators. InISCA

  38. [38]

    Malladi, Hongzhong Zheng, and Onur Mutlu

    Amirali Boroumand, Saugata Ghose, Minesh Patel, Hasan Hassan, Brandon Lucia, Nastaran Hajinazar, Kevin Hsieh, Krishna T. Malladi, Hongzhong Zheng, and Onur Mutlu. 2017. LazyPIM: Efficient Support for Cache Coherence in Processing-in-Memory Architectures.arXiv preprint arXiv:1706.03162(2017)

  39. [39]

    F Nisa Bostancı, Ataberk Olgun, Lois Orosa, A Giray Yağlıkçı, Jeremie S Kim, Hasan Hassan, Oğuz Ergin, and Onur Mutlu. 2022. DR-STRaNGe: End-to-End System Design for DRAM-Based True Random Number Generators. InHPCA

  40. [40]

    Kalsi, Zülal Bingöl, Can Firtina, Lavanya Sub- ramanian, Jeremie S

    Damla Senol Cali, Gurpreet S. Kalsi, Zülal Bingöl, Can Firtina, Lavanya Sub- ramanian, Jeremie S. Kim, Rachata Ausavarungnirun, Mohammed Alser, Juan Gomez-Luna, Amirali Boroumand, et al. 2020. GenASM: A High-Performance, Low-Power Approximate String Matching Acceleration Framework for Genome Sequence Analysis. InMICRO

  41. [41]

    Kevin K Chang, Prashant J Nair, Donghyuk Lee, Saugata Ghose, Moinuddin K Qureshi, and Onur Mutlu. 2016. Low-cost inter-linked subarrays (LISA): En- abling fast inter-subarray data movement in DRAM. InHPCA

  42. [42]

    Tianqi Chen and Carlos Guestrin. 2016. Xgboost: A scalable tree boosting system. InProceedings of the 22nd acm sigkdd international conference on knowledge discovery and data mining

  43. [43]

    Ping Chi, Shuangchen Li, Cong Xu, Tao Zhang, Jishen Zhao, Yongpan Liu, Yu Wang, and Yuan Xie. 2016. PRIME: A Novel Processing-in-Memory Architecture for Neural Network Computation in ReRAM-Based Main Memory. InISCA

  44. [44]

    Seunghwan Cho, Haerang Choi, Eunhyeok Park, Hyunsung Shin, and Sungjoo Yoo. 2020. McDRAM v2: In-Dynamic Random Access Memory Systolic Array Accelerator to Address the Large Model Problem in Deep Neural Networks on the Edge.IEEE Access(2020)

  45. [45]

    Guohao Dai, Tianhao Huang, Yuze Chi, Jishen Zhao, Guangyu Sun, Yongpan Liu, Yu Wang, Yuan Xie, and Huazhong Yang. 2018. GraphH: A Processing-in- Memory Architecture for Large-Scale Graph Processing.TCAD(2018)

  46. [46]

    Joao Paulo C de Lima, Ben Morris, Asif Ali Khan, Jeronimo Castrillon, and Alex K Jones. 2026. Count2multiply: Reliable in-memory high-radix counting. InHPCA

  47. [47]

    de Lima, Paulo Cesar Santos, Marco A

    João Paulo C. de Lima, Paulo Cesar Santos, Marco A. Z. Alves, Antonio Beck, and Luigi Carro. 2018. Design Space Exploration for PIM Architectures in 3D-Stacked Memories. InCF

  48. [48]

    Quan Deng, Lei Jiang, Youtao Zhang, Minxuan Zhang, and Jun Yang. 2018. DrAcc: A DRAM Based Accelerator for Accurate CNN Inference. InDAC

  49. [49]

    Quan Deng, Youtao Zhang, Minxuan Zhang, and Jun Yang. 2019. Lacc: Exploiting lookup table-based fast and accurate vector multiplication in dram-based cnn accelerator. InDAC

  50. [50]

    Wenya Deng, Zhi Wang, Yang Guo, Jian Zhang, Zhenyu Wu, and Yaohua Wang

  51. [51]

    DAS: A DRAM-Based Annealing System for Solving Large-Scale Combi- natorial Optimization Problems. InICA3P

  52. [52]

    Oliveira, Juan Gómez-Luna, and Onur Mutlu

    Alain Denzler, Rahul Bera, Nastaran Hajinazar, Gagandeep Singh, Geraldo F. Oliveira, Juan Gómez-Luna, and Onur Mutlu. 2021. Casper: Accelerating Stencil Computation using Near-Cache Processing.arXiv preprint arXiv:2112.14216 (2021)

  53. [53]

    Fabrice Devaux. 2019. The True Processing in Memory Accelerator. InHot Chips

  54. [54]

    Mario Drumond, Alexandros Daglis, Nooshin Mirzadeh, Dmitrii Ustiugov, Javier Picorel, Babak Falsafi, Boris Grot, and Dionisios Pnevmatikatos. 2017. The Mondrian Data Engine. InISCA

  55. [55]

    Charles Eckert, Xiaowei Wang, Jingcheng Wang, Arun Subramaniyan, Ravi Iyer, Dennis Sylvester, David Blaauw, and Reetuparna Das. 2018. Neural Cache: Bit-Serial In-Cache Acceleration of Deep Neural Networks. InISCA

  56. [56]

    D. G. Elliott, M. Stumm, W. M. Snelgrove, et al . 1999. Computational RAM: Implementing Processors in Memory.D&T(1999)

  57. [57]

    F Gökhan Ergin. 2017. Dynamic masking techniques for particle image ve- locimetry.Isı Bilimi ve Tekniği Dergisi(2017)

  58. [58]

    Franz Färber, Norman May, Wolfgang Lehner, Philipp Große, Ingo Müller, Hannes Rauhe, and Jonathan Dees. 2012. The SAP HANA Database–An Archi- tecture Overview.IEEE Data Eng. Bull.(2012)

  59. [59]

    Farmahini-Farahani, J

    A. Farmahini-Farahani, J. H. Ahn, K. Compton, and N. S. Kim. 2014. DRAMA: An Architecture for Accelerated Processing Near Memory.CAL(2014)

  60. [60]

    Amin Farmahini-Farahani, Jung Ho Ahn, Katherine Morrow, and Nam Sung Kim

  61. [61]

    NDA: Near-DRAM Acceleration Architecture Leveraging Commodity DRAM Devices and Standard Memory Modules. InHPCA

  62. [62]

    Ivan Fernandez, Ricardo Quislant, Eladio Gutiérrez, Oscar Plata, Christina Gian- noula, Mohammed Alser, Juan Gómez-Luna, and Onur Mutlu. 2020. NATSA: A Near-Data Processing Accelerator for Time Series Analysis. InICCD

  63. [63]

    João Dinis Ferreira, Gabriel Falcao, Juan Gómez-Luna, Mohammed Alser, Lois Orosa, Mohammad Sadrosadati, Jeremie S Kim, Geraldo F Oliveira, Taha Shahroodi, Anant Nori, et al . 2021. pLUTo: In-DRAM Lookup Tables to Enable Massively Parallel General-Purpose Computation.arXiv preprint arXiv:2104.07699(2021)

  64. [64]

    João Dinis Ferreira, Gabriel Falcao, Juan Gómez-Luna, Mohammed Alser, Lois Orosa, Mohammad Sadrosadati, Jeremie S Kim, Geraldo F Oliveira, Taha Shahroodi, Anant Nori, et al. 2022. pLUTo: Enabling Massively Parallel Compu- tation in DRAM via Lookup Tables. InMICRO

  65. [65]

    Daichi Fujiki. 2023. MVC: Enabling fully coherent multi-data-views through the memory hierarchy with processing in memory. InMICRO

  66. [66]

    Daichi Fujiki, Scott Mahlke, and Reetuparna Das. 2018. In-Memory Data Parallel Processor. InASPLOS

  67. [67]

    Daichi Fujiki, Scott Mahlke, and Reetuparna Das. 2019. Duality Cache for Data Parallel Acceleration. InISCA

  68. [68]

    Fei Gao, Georgios Tziantzioulis, and David Wentzlaff. 2019. ComputeDRAM: In-Memory Compute Using Off-the-Shelf DRAMs. InMICRO

  69. [69]

    Fei Gao, Georgios Tziantzioulis, and David Wentzlaff. 2022. FracDRAM: Frac- tional values in off-the-shelf DRAM. InMICRO

  70. [70]

    Mingyu Gao and Christos Kozyrakis. 2016. HRL: Efficient and Flexible Recon- figurable Logic for Near-Data Processing. InHPCA

  71. [71]

    Mingyu Gao, Jing Pu, Xuan Yang, Mark Horowitz, and Christos Kozyrakis. 2017. TETRIS: Scalable and Efficient Neural Network Acceleration with 3D Memory. InASPLOS

  72. [72]

    Esteban Garzón, Alexander Fish, and Leonid Yavits. 2026. CADM: Content addressable commodity off-the-shelf DRAM-based genome classifier.Journal of Systems Architecture(2026)

  73. [73]

    Nika Mansouri Ghiasi, Jisung Park, Harun Mustafa, Jeremie Kim, Ataberk Olgun, Arvid Gollwitzer, Damla Senol Cali, Can Firtina, Haiyu Mao, Nour Almadhoun Alserr, et al. 2022. GenStore: A High-Performance and Energy-Efficient In- Storage Computing System for Genome Sequence Analysis. InASPLOS. Clutch: High Performance Vector-Scalar Comparison using DRAM via...

  74. [74]

    Christina Giannoula, Ivan Fernandez, Juan Gómez Luna, Nectarios Koziris, Georgios Goumas, and Onur Mutlu. 2022. SparseP: Towards Efficient Sparse Matrix Vector Multiplication on Real Processing-in-Memory Architectures. In SIGMETRICS

  75. [75]

    Christina Giannoula, Nandita Vijaykumar, Nikela Papadopoulou, Vasileios Karakostas, Ivan Fernandez, Juan Gómez-Luna, Lois Orosa, Nectarios Koziris, Georgios Goumas, and Onur Mutlu. 2021. SynCron: Efficient Synchronization Support for Near-Data-Processing Architectures. InHPCA

  76. [76]

    Maya Gokhale, Bill Holmes, and Ken Iobst. 1995. Processing in Memory: The Terasys Massively Parallel PIM Array.Computer(1995)

  77. [77]

    Goetz Graefe et al. 2011. Modern B-tree techniques.Foundations and Trends in Databases(2011)

  78. [78]

    Martin Grund, Jens Krüger, Hasso Plattner, Alexander Zeier, Philippe Cudre- Mauroux, and Samuel Madden. 2010. Hyrise: a main memory hybrid storage engine.Proceedings of the VLDB Endowment(2010)

  79. [79]

    Peng Gu, Shuangchen Li, Dylan Stow, Russell Barnes, Liu Liu, Yuan Xie, and Eren Kursun. 2016. Leveraging 3D Technologies for Hardware Security: Opportunities and Challenges. InGLSVLSI

  80. [80]

    Peng Gu, Xinfeng Xie, Yufei Ding, Guoyang Chen, Weifeng Zhang, Dimin Niu, and Yuan Xie. 2020. iPIM: Programmable In-Memory Image Processing Accelerator using Near-Bank Architecture. InISCA

Showing first 80 references.