pith. sign in

arxiv: 1906.12066 · v1 · pith:IEB223OEnew · submitted 2019-06-28 · 💻 cs.PF · cs.PL· cs.SE

Pinpointing Performance Inefficiencies in Java

Pith reviewed 2026-05-25 13:35 UTC · model grok-4.3

classification 💻 cs.PF cs.PLcs.SE
keywords Java performance analysiswasteful memory operationshardware performance monitoring unitsdebug registersprofiling toolproduction monitoring
0
0 comments X

The pith

JXPerf identifies wasteful memory operations in Java programs at the machine-code level by sampling with performance monitors and tracking repeats via debug registers.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces JXPerf as a tool to locate performance problems in Java that appear as wasteful memory operations, such as those stemming from algorithm or data structure choices and missed compiler optimizations. Bytecode instrumentation for this purpose incurs high overhead and overlooks generated machine code. JXPerf instead samples memory locations via hardware performance monitoring units and then uses debug registers to watch for later accesses to those same locations. This yields low-overhead measurements with attribution back to source code, machine code, and full calling contexts. The approach supports production use and has guided optimizations that produced measurable speedups in tested applications.

Core claim

JXPerf samples memory locations accessed by a Java program with hardware performance monitoring units and employs hardware debug registers to monitor subsequent accesses to the same memory, producing a lightweight measurement at machine-code level with attribution of inefficiencies to their provenance in machine and source code within full calling contexts.

What carries the argument

JXPerf, the combination of hardware performance monitoring units for sampling memory accesses with hardware debug registers to detect and attribute repeated accesses to the same locations.

If this is right

  • Improvements to code generation can eliminate identified wasteful memory operations.
  • Switching to superior data structures and algorithms can produce significant speedups once the operations are located.
  • The 7 percent runtime and memory overhead allows the tool to run on production Java workloads.
  • Attribution to full calling contexts enables precise fixes at the responsible source locations.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same sampling-plus-monitoring pattern could be tested on performance problems that are not memory-related.
  • If the hardware mechanisms prove reliable across JVM implementations, the technique might generalize to other managed runtimes.

Load-bearing premise

Hardware performance monitoring units and debug registers can be programmed to capture and attribute wasteful memory operations accurately without significant sampling bias or program interference.

What would settle it

A controlled run on a Java program in which the operations flagged by JXPerf as wasteful are proven not to be avoidable, or in which measured overhead exceeds the stated 7 percent runtime and memory figures.

Figures

Figures reproduced from arXiv: 1906.12066 by Milind Chabbi, Pengfei Su, Qingsen Wang, Xu Liu.

Figure 1
Figure 1. Figure 1: The assembly code (at&t style) of lines 153 and 155 in List [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: JXPerf’s scheme for silent store detection. ○1 The PMU samples a memory store S1 that touches location M. ○2 In the PMU sample handler, a debug register is armed to monitor subsequent access to M. ○3 The debug register traps on the next store S2 to M. ○4 If S1 and S2 write the same values to M, JXPerf labels S2 as a silent store and ⟨S1, S2 ⟩ as a silent store pair. Silent stores and silent loads are value… view at source ↗
Figure 4
Figure 4. Figure 4: Fraction of wasteful memory operations on DaCapo 2006, [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Fraction of wasteful memory operations on DaCapo 2006, [PITH_FULL_IMAGE:figures/full_fig_p006_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Runtime slowdown (×) and memory bloat (×) of JXPerf at the 5M sampling period on DaCapo 2006, Dacapo-9.12-MR1-bach and ScalaBench benchmark suites [PITH_FULL_IMAGE:figures/full_fig_p007_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: A silent load pair with full calling contexts reported by [PITH_FULL_IMAGE:figures/full_fig_p008_7.png] view at source ↗
Figure 9
Figure 9. Figure 9: A silent load pair reported by JXPerf in SableCC-3.7. 561 public V put ( K key , V value ) { 562 Entry <K ,V > t = root ; 563 ... 564 do { 565 parent = t ; 566 cmp = k . compareTo ( t . key ) ; 567 if ( cmp < 0) 568 ▶ t = t . left ; 569 else if ( cmp > 0) 570 t = t . right ; 571 ... 572 } while ( t != null ) ; 573 ... 574 } Listing 5: Method put() of the JDK TreeMap class. A put operation requires O(log n)… view at source ↗
Figure 8
Figure 8. Figure 8: The assembly code (at&t style) of lines 5, 7 and 12 in List [PITH_FULL_IMAGE:figures/full_fig_p009_8.png] view at source ↗
read the original abstract

Many performance inefficiencies such as inappropriate choice of algorithms or data structures, developers' inattention to performance, and missed compiler optimizations show up as wasteful memory operations. Wasteful memory operations are those that produce/consume data to/from memory that may have been avoided. We present, JXPerf, a lightweight performance analysis tool for pinpointing wasteful memory operations in Java programs. Traditional byte-code instrumentation for such analysis (1) introduces prohibitive overheads and (2) misses inefficiencies in machine code generation. JXPerf overcomes both of these problems. JXPerf uses hardware performance monitoring units to sample memory locations accessed by a program and uses hardware debug registers to monitor subsequent accesses to the same memory. The result is a lightweight measurement at machine-code level with attribution of inefficiencies to their provenance: machine and source code within full calling contexts. JXPerf introduces only 7% runtime overhead and 7% memory overhead making it useful in production. Guided by JXPerf, we optimize several Java applications by improving code generation and choosing superior data structures and algorithms, which yield significant speedups.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 0 minor

Summary. The paper presents JXPerf, a tool that combines hardware performance monitoring units (PMUs) to sample memory locations with hardware debug registers to track subsequent accesses to those locations. This enables detection of wasteful memory operations in Java programs at the machine-code level with full calling-context attribution, while claiming only 7% runtime overhead and 7% memory overhead to support production use. The authors report using the tool to guide optimizations in several Java applications via improved code generation and better data structures/algorithms, yielding significant speedups.

Significance. If the low-overhead claims and attribution accuracy hold, the approach offers a practical alternative to high-overhead bytecode instrumentation for production Java profiling, potentially enabling more targeted optimizations. The hardware-assisted method for machine-code level insight is a notable strength for a tool paper.

major comments (2)
  1. [Abstract (method description paragraph)] Abstract (method description paragraph): The mechanism of sampling addresses via PMU and arming debug registers for subsequent monitoring does not address how the tool handles the typical limit of only 4 debug registers when programs have more than a handful of distinct hot memory locations. This leaves open the risk of systematic sampling bias, dropped monitors, or restricted active sets, which directly affects the accuracy of reported inefficiencies and the load-bearing 7% overhead claim for production usefulness.
  2. [Abstract] Abstract: Overhead figures (7% runtime, 7% memory) and speedup claims are stated without reference to evaluation methodology, baselines, workloads, error bars, or statistical significance, making the central claim of usefulness in production unverifiable from the given description even if full-text sections exist.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We respond to each major comment below and indicate where revisions to the manuscript are warranted.

read point-by-point responses
  1. Referee: [Abstract (method description paragraph)] Abstract (method description paragraph): The mechanism of sampling addresses via PMU and arming debug registers for subsequent monitoring does not address how the tool handles the typical limit of only 4 debug registers when programs have more than a handful of distinct hot memory locations. This leaves open the risk of systematic sampling bias, dropped monitors, or restricted active sets, which directly affects the accuracy of reported inefficiencies and the load-bearing 7% overhead claim for production usefulness.

    Authors: The full manuscript (implementation and design sections) explains that JXPerf maintains a larger candidate set of hot locations from PMU sampling and uses a rotation policy to arm only the top-N locations (fitting the 4 debug registers) at any time, with the rotation frequency chosen to ensure coverage. This is intended to avoid systematic bias, and the reported overheads already incorporate the management cost. We agree the abstract is insufficiently explicit on this point and will revise it to include a concise description of the rotation mechanism. revision: partial

  2. Referee: [Abstract] Abstract: Overhead figures (7% runtime, 7% memory) and speedup claims are stated without reference to evaluation methodology, baselines, workloads, error bars, or statistical significance, making the central claim of usefulness in production unverifiable from the given description even if full-text sections exist.

    Authors: The Evaluation section of the manuscript details the methodology (including DaCapo, SPECjvm, and application workloads), baselines, multiple-run statistics with error bars, and significance testing that support the 7% overhead and speedup numbers. To improve the abstract, we will add a brief clause indicating that these figures come from the comprehensive experiments reported later in the paper. revision: yes

Circularity Check

0 steps flagged

No significant circularity in tool-implementation paper

full rationale

The paper describes an engineering artifact (JXPerf) that samples via PMU and arms debug registers to attribute wasteful accesses, with overhead claims resting on direct runtime measurements rather than any derivation, fitted parameters, or equations. No self-citations, ansatzes, or uniqueness theorems appear in the provided text, and the central claims do not reduce to inputs by construction. The work is self-contained against external benchmarks via reported overheads and case-study speedups.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Based on abstract only; no explicit free parameters, axioms, or invented entities are stated. The approach implicitly assumes standard hardware PMU and debug-register semantics.

pith-pipeline@v0.9.0 · 5724 in / 1040 out tokens · 20974 ms · 2026-05-25T13:35:32.038699+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

55 extracted references · 55 canonical work pages

  1. [1]

    Adrian Nistor, Linhai Song, Darko Marinov, and Shan Lu. 2013. Toddler: Detecting Performance Problems via Similar Memory-Access Patterns. http://www.cs.fsu. edu/~nistor/toddler

  2. [2]

    Armin Rigo, Maciej Fijalkowski, Carl Friedrich Bolz, Antonio Cuni, Benjamin Pe- terson, Alex Gaynor, Holger Krekel, and Samuele Pedroni. 2018. A fast, compliant alternative implementation of the Python language. https://pypy.org

  3. [3]

    D. H. Bailey, E. Barszcz, J. T. Barton, D. S. Browning, R. L. Carter, L. Dagum, R. A. Fatoohi, P. O. Frederickson, T. A. Lasinski, R. S. Schreiber, H. D. Simon, V. Venkatakrishnan, and S. K. Weeratunga. 1991. The NAS Parallel Bench- marks&Mdash;Summary and Preliminary Results. In Proceedings of the 1991 ACM/IEEE Conference on Supercomputing (Supercomputin...

  4. [4]

    Blackburn, Robin Garner, Chris Hoffmann, Asjad M

    Stephen M. Blackburn, Robin Garner, Chris Hoffmann, Asjad M. Khang, Kathryn S. McKinley, Rotem Bentzur, Amer Diwan, Daniel Feinberg, Daniel Frampton, Samuel Z. Guyer, Martin Hirzel, Antony Hosking, Maria Jump, Han Lee, J. Eliot B. Moss, Aashish Phansalkar, Darko Stefanović, Thomas VanDrunen, Daniel von Dincklage, and Ben Wiedermann. 2006. The DaCapo Bench...

  5. [5]

    Milind Chabbi and John Mellor-Crummey. 2012. DeadSpy: A Tool to Pinpoint Program Inefficiencies. In Proceedings of the Tenth International Symposium on Code Generation and Optimization (CGO ’12). ACM, New York, NY, USA, 124–134

  6. [6]

    Intel Corp. 2010. Intel Microarchitecture Codename Nehalem Performance Mon- itoring Unit Programming Guide. https://software.intel.com/sites/default/files/ m/5/2/c/f/1/30320-Nehalem-PMU-Programming-Guide-Core.pdf

  7. [7]

    Intel Corp. 2015. Intel X86 Encoder Decoder Software Library. https://software. intel.com/en-us/articles/xed-x86-encoder-decoder-software-library. ESEC/FSE ’19, August 26–30, 2019, Tallinn, Estonia Pengfei Su, Qingsen Wang, Milind Chabbi, and Xu Liu

  8. [8]

    Oracle Corp. 2017. Oracle Developer Studio Performance Ana- lyzer. https://www.oracle.com/technetwork/server-storage/solarisstudio/ documentation/o11-151-perf-analyzer-brief-1405338.pdf

  9. [9]

    Oracle Corp. 2018. JVMTM Tool Interface. https://docs.oracle.com/en/java/ javase/11/docs/specs/jvmti.html

  10. [10]

    Oracle Corporation. 2018. All-in-One Java Troubleshooting Tool. https: //visualvm.github.io

  11. [11]

    Luca Della Toffola, Michael Pradel, and Thomas R. Gross. 2015. Performance Problems You Can Fix: A Dynamic Analysis of Memoization Opportunities. In Proceedings of the 2015 ACM SIGPLAN International Conference on Object-Oriented Programming, Systems, Languages, and Applications (OOPSLA 2015) . ACM, New York, NY, USA, 607–622

  12. [12]

    Monika Dhok and Murali Krishna Ramanathan. 2016. Directed Test Generation to Detect Loop Inefficiencies. In Proceedings of the 2016 24th ACM SIGSOFT International Symposium on Foundations of Software Engineering (FSE 2016) . ACM, New York, NY, USA, 895–907

  13. [13]

    Drongowski

    Paul J. Drongowski. 2007. Instruction-Based Sampling: A New Performance Analysis Technique for AMD Family 10h Processors. https://pdfs.semanticscholar. org/5219/4b43b8385ce39b2b08ecd409c753e0efafe5.pdf

  14. [14]

    Ariel Eizenberg, Shiliang Hu, Gilles Pokam, and Joseph Devietti. 2016. Remix: Online Detection and Repair of Cache Contention for the JVM. In Proceedings of the 37th ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI ’16). ACM, New York, NY, USA, 251–265

  15. [15]

    ej-technologies GmbH. 2018. THE AWARD-WINNING ALL-IN-ONE JAVA PRO- FILER. https://www.ej-technologies.com/products/jprofiler/overview.html

  16. [16]

    Etienne Gagnon. 2018. The Sable Research Group’s Compiler Compiler. http: //sablecc.org. May 2018

  17. [17]

    Andy Georges, Dries Buytaert, and Lieven Eeckhout. 2007. Statistically Rigorous Java Performance Evaluation. In Proceedings of the 22Nd Annual ACM SIGPLAN Conference on Object-oriented Programming Systems and Applications (OOPSLA ’07). ACM, New York, NY, USA, 57–76

  18. [18]

    David Gilbert. 2017. Welcome To JFree.org. http://www.jfree.org. November 2017

  19. [19]

    YourKit GmbH. 2018. The Industry Leader in .NET & Java Profiling. https: //www.yourkit.com

  20. [20]

    Google Corp. 2018. Google V8 JavaScript Engine. https://v8.dev

  21. [21]

    Peter Hofer and Hanspeter Mössenböck. 2014. Fast Java Profiling with Scheduling- aware Stack Fragment Sampling and Asynchronous Analysis. In Proceedings of the 2014 International Conference on Principles and Practices of Programming on the Java Platform: Virtual Machines, Languages, and Tools (PPPJ ’14) . ACM, New York, NY, USA, 145–156

  22. [22]

    IBM Corp. 2018. Monitoring and Post Mortem. https://developer.ibm.com/ javasdk/tools

  23. [23]

    Mark Scott Johnson. 1982. Some Requirements for Architectural Support of Software Debugging. In Proceedings of the First International Symposium on Ar- chitectural Support for Programming Languages and Operating Systems (ASPLOS I). ACM, New York, NY, USA, 140–148

  24. [24]

    John Levon et al. 2017. OProfile. http://oprofile.sourceforge.net

  25. [25]

    Linux. 2012. perf_event_open - Linux man page. https://linux.die.net/man/2/ perf_event_open

  26. [26]

    Linux. 2015. Linux Perf Tool. https://perf.wiki.kernel.org/index.php/Main_Page

  27. [27]

    R. E. McLear, D. M. Scheibelhut, and E. Tammaru. 1982. Guidelines for Creating a Debuggable Processor. In Proceedings of the First International Symposium on Architectural Support for Programming Languages and Operating Systems (ASPLOS I). ACM, New York, NY, USA, 100–106

  28. [28]

    Monika Dhok and Murali Krishna Ramanathan. 2016. Artifact: Directed Test Generation to Detect Loop Inefficiencies. https://drona.csa.iisc.ac.in/~sss/tools/ glider

  29. [29]

    Todd Mytkowicz, Amer Diwan, Matthias Hauswirth, and Peter F. Sweeney. 2010. Evaluating the Accuracy of Java Profilers. InProceedings of the 31st ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI ’10). ACM, New York, NY, USA, 187–197

  30. [30]

    Khanh Nguyen and Guoqing Xu. 2013. Cachetor: Detecting Cacheable Data to Remove Bloat. In Proceedings of the 2013 9th Joint Meeting on Foundations of Software Engineering (ESEC/FSE 2013) . ACM, New York, NY, USA, 268–278

  31. [31]

    Adrian Nistor. 2012. fast return for SegmentedTime- line.getExceptionSegmentCount(). https://sourceforge.net/p/jfreechart/ patches/300. November 2012

  32. [32]

    Adrian Nistor, Linhai Song, Darko Marinov, and Shan Lu. 2013. Toddler: Detecting Performance Problems via Similar Memory-access Patterns. In Proceedings of the 2013 International Conference on Software Engineering (ICSE ’13) . IEEE Press, Piscataway, NJ, USA, 562–571

  33. [33]

    Nitsan Wakart. 2016. The Pros and Cons of AsyncGetCallTrace Profilers. http: //psy-lob-saw.blogspot.com/2016/06/the-pros-and-cons-of-agct.html

  34. [34]

    The University of Edinburgh. 2018. JAVA Grande Benchmark Suite. https://www.epcc.ed.ac.uk/research/computing/performance-characterisation- and-benchmarking/java-grande-benchmark-suite. October 2018

  35. [35]

    Oswaldo Olivo, Isil Dillig, and Calvin Lin. 2015. Static Detection of Asymptotic Performance Bugs in Collection Traversals. In Proceedings of the 36th ACM SIG- PLAN Conference on Programming Language Design and Implementation (PLDI ’15). ACM, New York, NY, USA, 369–378

  36. [36]

    Andrei Pangin. 2018. Async-profiler. https://github.com/jvm-profiling-tools/ async-profiler

  37. [37]

    Bill Pugh and David Hovemeyer. 2015. Find Bugs in Java Programs. http: //findbugs.sourceforge.net. March 2015

  38. [38]

    Andreas Sewe, Mira Mezini, Aibek Sarimbekov, and Walter Binder. 2011. Da Capo Con Scala: Design and Analysis of a Scala Benchmark Suite for the Java Virtual Machine. In Proceedings of the 2011 ACM International Conference on Object Oriented Programming Systems Languages and Applications (OOPSLA ’11) . ACM, New York, NY, USA, 657–676

  39. [39]

    Linhai Song and Shan Lu. 2017. Performance Diagnosis for Inefficient Loops. In Proceedings of the 39th International Conference on Software Engineering (ICSE ’17). IEEE Press, Piscataway, NJ, USA, 370–380

  40. [40]

    SPEC Corporation. 2015. SPEC JVM2008 Benchmark Suite. https://www.spec. org/jvm2008. November 2015

  41. [41]

    Srinivas, B

    M. Srinivas, B. Sinharoy, R. J. Eickemeyer, R. Raghavan, S. Kunkel, T. Chen, W. Maron, D. Flemming, A. Blanchard, P. Seshadri, J. W. Kellington, A. Mericas, A. E. Petruski, V. R. Indukuru, and S. Reyes. 2011. IBM POWER7 performance modeling, verification, and evaluation. IBM JRD 55, 3 (May-June 2011), 4:1–4:19

  42. [42]

    Pengfei Su, Shasha Wen, Hailong Yang, Milind Chabbi, and Xu Liu. 2019. Redun- dant Loads: A Software Inefficiency Indicator. In Proceedings of the 41st Interna- tional Conference on Software Engineering (ICSE ’19) . IEEE Press, Piscataway, NJ, USA, 982–993

  43. [43]

    The Sable Research Group. 2018. A framework for analyzing and transforming Java and Android applications. https://sable.github.io/soot

  44. [44]

    Jeffrey S. Vitter. 1985. Random Sampling with a Reservoir. ACM Trans. Math. Softw. 11, 1 (March 1985), 37–57

  45. [45]

    Qingsen Wang, Xu Liu, and Milind Chabbi. 2019. Featherlight Reuse-Distance Measurement. In Proceedings of The 25th IEEE International Symposium on High- Performance Computer Architecture. 440–453

  46. [46]

    Shasha Wen, Milind Chabbi, and Xu Liu. 2017. REDSPY: Exploring Value Locality in Software. In Proceedings of the Twenty-Second International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS ’17). ACM, New York, NY, USA, 47–61

  47. [47]

    Shasha Wen, Xu Liu, John Byrne, and Milind Chabbi. 2018. Watching for Soft- ware Inefficiencies with Witch. In Proceedings of the Twenty-Third International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS ’18). ACM, New York, NY, USA, 332–347

  48. [48]

    Guoqing Xu. 2013. Resurrector: A Tunable Object Lifetime Profiling Technique for Optimizing Real-world Programs. In Proceedings of the 2013 ACM SIGPLAN International Conference on Object Oriented Programming Systems Languages and Applications (OOPSLA ’13). ACM, New York, NY, USA, 111–130

  49. [49]

    Guoqing Xu, Matthew Arnold, Nick Mitchell, Atanas Rountev, and Gary Sevitsky

  50. [50]

    In Proceedings of the 30th ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI ’09)

    Go with the Flow: Profiling Copies to Find Runtime Bloat. In Proceedings of the 30th ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI ’09). ACM, New York, NY, USA, 419–430

  51. [51]

    Guoqing Xu, Nick Mitchell, Matthew Arnold, Atanas Rountev, Edith Schonberg, and Gary Sevitsky. 2010. Finding Low-utility Data Structures. In Proceedings of the 31st ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI ’10). ACM, New York, NY, USA, 174–186

  52. [52]

    Guoqing Xu and Atanas Rountev. 2010. Detecting Inefficiently-used Containers to Avoid Bloat. In Proceedings of the 31st ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI ’10) . ACM, New York, NY, USA, 160–173

  53. [53]

    Shengqian Yang, Dacong Yan, Guoqing Xu, and Atanas Rountev. 2012. Dynamic Analysis of Inefficiently-used Containers. InProceedings of the Ninth International Workshop on Dynamic Analysis (WODA 2012). ACM, New York, NY, USA, 30–35

  54. [54]

    Zhaomo Yang, Brian Johannesmeyer, Anders Trier Olesen, Sorin Lerner, and Kirill Levchenko. 2017. Dead Store Elimination (Still) Considered Harmful. In 26th USENIX Security Symposium. USENIX Association, Berkeley, CA, USA, 1025– 1040

  55. [55]

    A. Yasin. 2014. A Top-Down method for performance analysis and counters architecture. In 2014 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS). 35–44