Platform Independent Software Analysis for Near Memory Computing

Ahsan Javed Awan; Gagandeep Singh; Henk Corporaal; Roel Jordans; Stefano Corda

arxiv: 1906.10037 · v1 · pith:IOXJZCU2new · submitted 2019-06-24 · 💻 cs.PF · cs.ET

Platform Independent Software Analysis for Near Memory Computing

Stefano Corda , Gagandeep Singh , Ahsan Javed Awan , Roel Jordans , Henk Corporaal This is my paper

Pith reviewed 2026-05-25 16:51 UTC · model grok-4.3

classification 💻 cs.PF cs.ET

keywords near memory computingprofilingperformance analysismemory entropyspatial localityparallelism metricssoftware analysis3D-stacked memory

0 comments

The pith

PISA-NMC adds memory entropy, spatial locality, and parallelism metrics to existing profilers so developers can spot applications that gain from near-memory computing.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces PISA-NMC as an extension to a hardware-agnostic profiling tool. It incorporates four new metrics—memory entropy, spatial locality, data-level parallelism, and basic-block-level parallelism—that are expected to matter for near-memory computing. The authors run these metrics on a set of representative applications and compare the results against performance measured on a simulated near-memory system. The correlations confirm that certain metrics reliably point to workloads that benefit from the architecture.

Core claim

PISA-NMC shows that memory entropy, spatial locality, data-level parallelism, and basic-block-level parallelism can be measured in a platform-independent way and that these measurements correlate with application performance on simulated near-memory computing systems, allowing identification of suitable applications without hardware-specific tuning.

What carries the argument

PISA-NMC, the extended profiling tool that adds memory entropy, spatial locality, data-level parallelism, and basic-block-level parallelism metrics to standard analysis.

If this is right

Applications showing low memory entropy and high spatial locality are expected to see larger gains on near-memory systems.
Workloads with higher data-level and basic-block-level parallelism become easier to select for near-memory deployment.
Platform-independent profiling can replace repeated hardware-specific measurements when screening many candidate applications.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same metric set could be tested on other memory-centric architectures such as processing-in-memory to see if the correlations transfer.
Early-stage code changes guided by these metrics might improve suitability before full simulation runs are needed.
Combining the metrics into a single suitability score would let developers rank applications automatically.

Load-bearing premise

That the added metrics capture the main factors deciding whether an application will run faster on near-memory hardware, and that simulated near-memory performance stands in for real hardware behavior.

What would settle it

Measure the same applications on actual near-memory hardware and check whether the applications flagged by the new metrics show the predicted speedups while the others do not.

Figures

Figures reproduced from arXiv: 1906.10037 by Ahsan Javed Awan, Gagandeep Singh, Henk Corporaal, Roel Jordans, Stefano Corda.

**Figure 2.** Figure 2: Our NMC System this spatial locality score is to detect a reduction in DTR when doubling the cache line size. Usually, application with low spatial locality perform very bad on traditional systems with cache hierarchies because a small portion of data is utilized compared to the data loaded from the main memory to the caches B. Parallelism metrics Data-level parallelism (DLP) measures the average length of… view at source ↗

**Figure 3.** Figure 3: b [PITH_FULL_IMAGE:figures/full_fig_p002_3.png] view at source ↗

**Figure 4.** Figure 4: shows the energy-delay product (EDP) ratio between the IBM Power 9 and the NMC system we simulated. We use EDP as our major metric of reference in this analysis because both energy and performance are critical criteria for evaluating NMC suitability. Applications with EDP reduction less than 1 are not suitable for NMC [PITH_FULL_IMAGE:figures/full_fig_p003_4.png] view at source ↗

**Figure 5.** Figure 5: Metric derived from memory entropy [PITH_FULL_IMAGE:figures/full_fig_p003_5.png] view at source ↗

**Figure 6.** Figure 6: PCA using the added metrics. Blue arrows quantify [PITH_FULL_IMAGE:figures/full_fig_p004_6.png] view at source ↗

read the original abstract

Near-memory Computing (NMC) promises improved performance for the applications that can exploit the features of emerging memory technologies such as 3D-stacked memory. However, it is not trivial to find such applications and specialized tools are needed to identify them. In this paper, we present PISA-NMC, which extends a state-of-the-art hardware agnostic profiling tool with metrics concerning memory and parallelism, which are relevant for NMC. The metrics include memory entropy, spatial locality, data-level, and basic-block-level parallelism. By profiling a set of representative applications and correlating the metrics with the application's performance on a simulated NMC system, we verify the importance of those metrics. Finally, we demonstrate which metrics are useful in identifying applications suitable for NMC architectures.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

PISA-NMC adds four NMC-specific metrics to an existing profiler and checks them via simulation correlation, but supplies almost no quantitative evidence or real-hardware grounding.

read the letter

The paper's core move is straightforward: it takes the hardware-agnostic PISA profiler and adds four metrics—memory entropy, spatial locality, data-level parallelism, and basic-block-level parallelism—then correlates those scores against performance on a simulated near-memory system. That combination of metrics for NMC screening is new, and the approach stays practical and platform-independent, which is useful for people who need to decide early whether an application is worth mapping to 3D-stacked memory without building hardware first. The authors also show a final step of identifying which of the four metrics actually flag suitable workloads, which gives the work a clear applied goal. Credit for keeping the base tool unchanged and focusing only on the added measurements. The soft spot is the validation. The abstract claims the correlations verify the metrics' importance, yet it gives no correlation coefficients, no sensitivity checks on the simulator parameters, and no comparison to measured behavior on actual HBM or HMC parts. If the simulator's latency or bandwidth model is off, the ranking of which metrics matter becomes an artifact rather than a general result. That leaves the central claim resting on unexamined simulation assumptions. The work is aimed at researchers building or tuning NMC profiling tools and at architects who need quick filters before full simulation campaigns. It is narrow but coherent, so it deserves a serious referee who can ask for the actual correlation tables and any cross-checks against real traces. I would send it to review rather than desk-reject, with the expectation that the authors will need to strengthen the empirical section.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces PISA-NMC, an extension of the PISA hardware-agnostic profiling tool that adds four NMC-relevant metrics (memory entropy, spatial locality, data-level parallelism, and basic-block-level parallelism). It profiles a set of representative applications, correlates the metrics against application performance on a simulated NMC system to verify their importance, and identifies which metrics are useful for selecting applications suitable for NMC architectures.

Significance. If the reported correlations are robust and the simulator faithfully captures real 3D-stacked DRAM behavior, the work supplies a practical, platform-independent method for identifying NMC-friendly workloads. The empirical profiling approach is a clear strength and directly addresses the need for specialized tools noted in the abstract.

major comments (2)

[Abstract and results section] Abstract and results section: the central claim that 'correlating the metrics with the application's performance on a simulated NMC system' verifies their importance supplies no quantitative details (Pearson r, R², p-values, number of profiled applications, or sensitivity to simulator parameters). This prevents assessment of how strongly the new metrics actually predict NMC performance.
[Methodology and validation sections] Methodology and validation sections: all importance verification rests on correlation against a single simulated NMC system. No comparison to measured traces from real HBM/HMC hardware, no sensitivity analysis to latency/bandwidth assumptions, and no cross-validation against alternative simulators are provided; therefore the ranking of 'useful' metrics may be an artifact of the simulator rather than a property of NMC.

minor comments (2)

[Abstract] Abstract: the final sentence states that the work 'demonstrate[s] which metrics are useful' but gives no hint of the outcome; a one-sentence summary of the key finding would improve clarity.
[Notation] Notation: ensure DLP and BLP are expanded on first use and that 'memory entropy' is given a precise definition (e.g., Shannon entropy over address distribution) before any correlation plots.

Simulated Author's Rebuttal

2 responses · 2 unresolved

We thank the referee for the constructive comments. We address each major comment below.

read point-by-point responses

Referee: [Abstract and results section] Abstract and results section: the central claim that 'correlating the metrics with the application's performance on a simulated NMC system' verifies their importance supplies no quantitative details (Pearson r, R², p-values, number of profiled applications, or sensitivity to simulator parameters). This prevents assessment of how strongly the new metrics actually predict NMC performance.

Authors: We agree that the abstract and results lack the requested quantitative details. In the revised manuscript we will report the number of profiled applications, Pearson r values, R², p-values for each metric-performance correlation, and a discussion of sensitivity to simulator parameters. revision: yes
Referee: [Methodology and validation sections] Methodology and validation sections: all importance verification rests on correlation against a single simulated NMC system. No comparison to measured traces from real HBM/HMC hardware, no sensitivity analysis to latency/bandwidth assumptions, and no cross-validation against alternative simulators are provided; therefore the ranking of 'useful' metrics may be an artifact of the simulator rather than a property of NMC.

Authors: We agree that validation uses a single simulator and will add sensitivity analysis to latency/bandwidth assumptions. However, real-hardware trace comparisons and cross-validation with other simulators cannot be added, as the work is deliberately simulation-based to remain platform-independent. revision: partial

standing simulated objections not resolved

Comparison to measured traces from real HBM/HMC hardware
Cross-validation against alternative simulators

Circularity Check

0 steps flagged

No circularity: empirical correlation against external simulator is independent of the profiled metrics.

full rationale

The paper extends an existing profiling tool with new metrics (memory entropy, spatial locality, DLP, BLP) and correlates them against performance measured on a separate simulated NMC system. This is a standard empirical verification step with no equations, fitted parameters, or self-referential definitions that reduce the claimed importance of the metrics to the inputs by construction. No load-bearing self-citations or uniqueness theorems are invoked in the provided text. The simulation acts as an external benchmark rather than a tautological re-expression of the metrics themselves.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that the chosen metrics are relevant to NMC and that simulation results generalize to hardware; no free parameters or invented entities are described in the abstract.

axioms (1)

domain assumption Simulated NMC system performance is a valid proxy for real NMC hardware performance
Invoked when correlating metrics to simulated performance to verify importance and usefulness for identification.

pith-pipeline@v0.9.0 · 5664 in / 1223 out tokens · 29673 ms · 2026-05-25T16:51:33.329721+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

24 extracted references · 24 canonical work pages

[1]

Performance characterization of in-memory data analytics on a modern cloud server,

A. J. Awan et al. , “Performance characterization of in-memory data analytics on a modern cloud server,” in 2015 IEEE Fifth International Conference on Big Data and Cloud Computing . IEEE, 2015, pp. 1–8

work page 2015
[2]

Micro-architectural characterization of apache s park on batch and stream processing workloads,

——, “Micro-architectural characterization of apache s park on batch and stream processing workloads,” in 2016 IEEE International Conferences on Big Data and Cloud Computing (BDCloud) . IEEE, 2016, pp. 59–66

work page 2016
[3]

A review of near-memory computing architectures: Opportunities and challenges,

G. Singh et al. , “A review of near-memory computing architectures: Opportunities and challenges,” in 2018 21st Euromicro Conference on Digital System Design (DSD) , Aug 2018, pp. 608–617

work page 2018
[4]

An instrumentation approach for hardware- agnostic software characterization,

A. Anghel et al. , “An instrumentation approach for hardware- agnostic software characterization,” International Journal of Parallel Programming, vol. 44, no. 5, pp. 924–948, Oct 2016. [Online]. Available: https://doi.org/10.1007/s10766-016-0410-0

work page doi:10.1007/s10766-016-0410-0 2016
[5]

Memory and parallelism analysis using a platform- independent approach,

S. Corda et al. , “Memory and parallelism analysis using a platform- independent approach,” in ACM 22nd International W orkshop on Soft- ware and Compilers for Embedded Systems (SCOPES ’19) . Sankt Goar, Germany: ACM, May 2019

work page 2019
[6]

An instrumentation approach for hardware-agnostic software characterization,

A. Anghel et al. , “An instrumentation approach for hardware-agnostic software characterization,” International Journal of Parallel Program- ming, vol. 44, pp. 924–948, 2015

work page 2015
[7]

Jolliffe, Principal Component Analysis

I. Jolliffe, Principal Component Analysis . Springer V erlag, 1986

work page 1986
[8]

Comparing benchmarks using key microarchitecture- independent characteristics,

K. Hoste et al. , “Comparing benchmarks using key microarchitecture- independent characteristics,” 2006 IEEE International Symposium on W orkload Characterization, pp. 83–92, 2006

work page 2006
[9]

Ibm power 9

IBM. Ibm power 9. [Online]. Available: https://www.ibm.com/it-infrastructure/power/power9

work page
[10]

A scalable processing-in-memory accelerator for parall el graph processing,

J. Ahn et al. , “A scalable processing-in-memory accelerator for parall el graph processing,” in ISCA 2015

work page 2015
[11]

Practical near-data processing for in-memory analytics frameworks,

M. Gao et al. , “Practical near-data processing for in-memory analytics frameworks,” in PACT 2015

work page 2015
[12]

Ramulator: A fast and extensible dram simulator,

Y . Kim et al. , “Ramulator: A fast and extensible dram simulator,” IEEE Computer Architecture Letters , vol. 15, no. 1, pp. 45–49, Jan 2016

work page 2016
[13]

A review of near-memory computing architectures: Opportunities and challenges,

G. Singh et al. , “A review of near-memory computing architectures: Opportunities and challenges,” 08 2018

work page 2018
[14]

TOP-PIM: throughput-oriented programmable process- ing in memory,

D. Zhang et al., “TOP-PIM: throughput-oriented programmable process- ing in memory,” in Proceedings of the 23rd international symposium on High-performance parallel and distributed computing . ACM, 2014, pp. 85–98

work page 2014
[15]

Transparent ofﬂoading and mapping (TOM): Enabling programmer-transparent near-data processing in GPU syste ms,

K. Hsieh et al. , “Transparent ofﬂoading and mapping (TOM): Enabling programmer-transparent near-data processing in GPU syste ms,” in 2016 ACM/IEEE 43rd Annual International Symposium on Computer A rchi- tecture (ISCA) , June 2016, pp. 204–216

work page 2016
[16]

Scheduling techniques for GPU architectures with processing-in-memory capabilities,

A. Pattnaik et al. , “Scheduling techniques for GPU architectures with processing-in-memory capabilities,” in 2016 International Conference on Parallel Architecture and Compilation Techniques (PACT), Sept 2016, pp. 31–44

work page 2016
[17]

Rodinia: A benchmark suite for heterogeneous comput- ing,

S. Che et al. , “Rodinia: A benchmark suite for heterogeneous comput- ing,” in 2009 IEEE International Symposium on W orkload Characteri- zation (IISWC) , Oct 2009, pp. 44–54

work page 2009
[18]

Polybench: The polyhedral benchmark s uite,

L.-N. Pouchet, “Polybench: The polyhedral benchmark s uite,” URL: http://www. cs. ucla. edu/pouchet/software/polybench , 2012

work page 2012
[19]

A component model of spatial locality,

X. Gu et al., “A component model of spatial locality,” in Proceedings of the 2009 International Symposium on Memory Management , ser. ISMM ’09. New Y ork, NY , USA: ACM, 2009, pp. 99–108

work page 2009
[20]

Identifying the potential of near data processing for apache spark,

A. J. Awan et al. , “Identifying the potential of near data processing for apache spark,” in Proceedings of the International Symposium on Memory Systems . ACM, 2017, pp. 60–67

work page 2017
[21]

PIM-enabled instructions: a low-overhead, locality-aw are processing-in-memory architecture,

J. Ahn et al., “PIM-enabled instructions: a low-overhead, locality-aw are processing-in-memory architecture,” in Proceedings of the 42nd Annual International Symposium on Computer Architecture . ACM, 2015, pp. 336–348

work page 2015
[22]

Performance characterization and optimiz ation of in- memory data analytics on a scale-up server,

A. J. Awan, “Performance characterization and optimiz ation of in- memory data analytics on a scale-up server,” Ph.D. disserta tion, KTH Royal Institute of Technology and Universitat Polit` ecnic a de Catalunya, 2017

work page 2017
[23]

Google workloads for consumer devices: Miti- gating data movement bottlenecks,

A. Boroumand et al. , “Google workloads for consumer devices: Miti- gating data movement bottlenecks,” SIGPLAN Not. , vol. 53, no. 2, pp. 316–331, Mar. 2018

work page 2018
[24]

A scalable processing-in-memory accelerator for paral- lel graph processing,

J. Ahn et al. , “A scalable processing-in-memory accelerator for paral- lel graph processing,” in 2015 ACM/IEEE 42nd Annual International Symposium on Computer Architecture (ISCA) , June 2015, pp. 105–117

work page 2015

[1] [1]

Performance characterization of in-memory data analytics on a modern cloud server,

A. J. Awan et al. , “Performance characterization of in-memory data analytics on a modern cloud server,” in 2015 IEEE Fifth International Conference on Big Data and Cloud Computing . IEEE, 2015, pp. 1–8

work page 2015

[2] [2]

Micro-architectural characterization of apache s park on batch and stream processing workloads,

——, “Micro-architectural characterization of apache s park on batch and stream processing workloads,” in 2016 IEEE International Conferences on Big Data and Cloud Computing (BDCloud) . IEEE, 2016, pp. 59–66

work page 2016

[3] [3]

A review of near-memory computing architectures: Opportunities and challenges,

G. Singh et al. , “A review of near-memory computing architectures: Opportunities and challenges,” in 2018 21st Euromicro Conference on Digital System Design (DSD) , Aug 2018, pp. 608–617

work page 2018

[4] [4]

An instrumentation approach for hardware- agnostic software characterization,

A. Anghel et al. , “An instrumentation approach for hardware- agnostic software characterization,” International Journal of Parallel Programming, vol. 44, no. 5, pp. 924–948, Oct 2016. [Online]. Available: https://doi.org/10.1007/s10766-016-0410-0

work page doi:10.1007/s10766-016-0410-0 2016

[5] [5]

Memory and parallelism analysis using a platform- independent approach,

S. Corda et al. , “Memory and parallelism analysis using a platform- independent approach,” in ACM 22nd International W orkshop on Soft- ware and Compilers for Embedded Systems (SCOPES ’19) . Sankt Goar, Germany: ACM, May 2019

work page 2019

[6] [6]

An instrumentation approach for hardware-agnostic software characterization,

A. Anghel et al. , “An instrumentation approach for hardware-agnostic software characterization,” International Journal of Parallel Program- ming, vol. 44, pp. 924–948, 2015

work page 2015

[7] [7]

Jolliffe, Principal Component Analysis

I. Jolliffe, Principal Component Analysis . Springer V erlag, 1986

work page 1986

[8] [8]

Comparing benchmarks using key microarchitecture- independent characteristics,

K. Hoste et al. , “Comparing benchmarks using key microarchitecture- independent characteristics,” 2006 IEEE International Symposium on W orkload Characterization, pp. 83–92, 2006

work page 2006

[9] [9]

Ibm power 9

IBM. Ibm power 9. [Online]. Available: https://www.ibm.com/it-infrastructure/power/power9

work page

[10] [10]

A scalable processing-in-memory accelerator for parall el graph processing,

J. Ahn et al. , “A scalable processing-in-memory accelerator for parall el graph processing,” in ISCA 2015

work page 2015

[11] [11]

Practical near-data processing for in-memory analytics frameworks,

M. Gao et al. , “Practical near-data processing for in-memory analytics frameworks,” in PACT 2015

work page 2015

[12] [12]

Ramulator: A fast and extensible dram simulator,

Y . Kim et al. , “Ramulator: A fast and extensible dram simulator,” IEEE Computer Architecture Letters , vol. 15, no. 1, pp. 45–49, Jan 2016

work page 2016

[13] [13]

A review of near-memory computing architectures: Opportunities and challenges,

G. Singh et al. , “A review of near-memory computing architectures: Opportunities and challenges,” 08 2018

work page 2018

[14] [14]

TOP-PIM: throughput-oriented programmable process- ing in memory,

D. Zhang et al., “TOP-PIM: throughput-oriented programmable process- ing in memory,” in Proceedings of the 23rd international symposium on High-performance parallel and distributed computing . ACM, 2014, pp. 85–98

work page 2014

[15] [15]

Transparent ofﬂoading and mapping (TOM): Enabling programmer-transparent near-data processing in GPU syste ms,

K. Hsieh et al. , “Transparent ofﬂoading and mapping (TOM): Enabling programmer-transparent near-data processing in GPU syste ms,” in 2016 ACM/IEEE 43rd Annual International Symposium on Computer A rchi- tecture (ISCA) , June 2016, pp. 204–216

work page 2016

[16] [16]

Scheduling techniques for GPU architectures with processing-in-memory capabilities,

A. Pattnaik et al. , “Scheduling techniques for GPU architectures with processing-in-memory capabilities,” in 2016 International Conference on Parallel Architecture and Compilation Techniques (PACT), Sept 2016, pp. 31–44

work page 2016

[17] [17]

Rodinia: A benchmark suite for heterogeneous comput- ing,

S. Che et al. , “Rodinia: A benchmark suite for heterogeneous comput- ing,” in 2009 IEEE International Symposium on W orkload Characteri- zation (IISWC) , Oct 2009, pp. 44–54

work page 2009

[18] [18]

Polybench: The polyhedral benchmark s uite,

L.-N. Pouchet, “Polybench: The polyhedral benchmark s uite,” URL: http://www. cs. ucla. edu/pouchet/software/polybench , 2012

work page 2012

[19] [19]

A component model of spatial locality,

X. Gu et al., “A component model of spatial locality,” in Proceedings of the 2009 International Symposium on Memory Management , ser. ISMM ’09. New Y ork, NY , USA: ACM, 2009, pp. 99–108

work page 2009

[20] [20]

Identifying the potential of near data processing for apache spark,

A. J. Awan et al. , “Identifying the potential of near data processing for apache spark,” in Proceedings of the International Symposium on Memory Systems . ACM, 2017, pp. 60–67

work page 2017

[21] [21]

PIM-enabled instructions: a low-overhead, locality-aw are processing-in-memory architecture,

J. Ahn et al., “PIM-enabled instructions: a low-overhead, locality-aw are processing-in-memory architecture,” in Proceedings of the 42nd Annual International Symposium on Computer Architecture . ACM, 2015, pp. 336–348

work page 2015

[22] [22]

Performance characterization and optimiz ation of in- memory data analytics on a scale-up server,

A. J. Awan, “Performance characterization and optimiz ation of in- memory data analytics on a scale-up server,” Ph.D. disserta tion, KTH Royal Institute of Technology and Universitat Polit` ecnic a de Catalunya, 2017

work page 2017

[23] [23]

Google workloads for consumer devices: Miti- gating data movement bottlenecks,

A. Boroumand et al. , “Google workloads for consumer devices: Miti- gating data movement bottlenecks,” SIGPLAN Not. , vol. 53, no. 2, pp. 316–331, Mar. 2018

work page 2018

[24] [24]

A scalable processing-in-memory accelerator for paral- lel graph processing,

J. Ahn et al. , “A scalable processing-in-memory accelerator for paral- lel graph processing,” in 2015 ACM/IEEE 42nd Annual International Symposium on Computer Architecture (ISCA) , June 2015, pp. 105–117

work page 2015