Pinpointing Performance Inefficiencies in Java
Pith reviewed 2026-05-25 13:35 UTC · model grok-4.3
The pith
JXPerf identifies wasteful memory operations in Java programs at the machine-code level by sampling with performance monitors and tracking repeats via debug registers.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
JXPerf samples memory locations accessed by a Java program with hardware performance monitoring units and employs hardware debug registers to monitor subsequent accesses to the same memory, producing a lightweight measurement at machine-code level with attribution of inefficiencies to their provenance in machine and source code within full calling contexts.
What carries the argument
JXPerf, the combination of hardware performance monitoring units for sampling memory accesses with hardware debug registers to detect and attribute repeated accesses to the same locations.
If this is right
- Improvements to code generation can eliminate identified wasteful memory operations.
- Switching to superior data structures and algorithms can produce significant speedups once the operations are located.
- The 7 percent runtime and memory overhead allows the tool to run on production Java workloads.
- Attribution to full calling contexts enables precise fixes at the responsible source locations.
Where Pith is reading between the lines
- The same sampling-plus-monitoring pattern could be tested on performance problems that are not memory-related.
- If the hardware mechanisms prove reliable across JVM implementations, the technique might generalize to other managed runtimes.
Load-bearing premise
Hardware performance monitoring units and debug registers can be programmed to capture and attribute wasteful memory operations accurately without significant sampling bias or program interference.
What would settle it
A controlled run on a Java program in which the operations flagged by JXPerf as wasteful are proven not to be avoidable, or in which measured overhead exceeds the stated 7 percent runtime and memory figures.
Figures
read the original abstract
Many performance inefficiencies such as inappropriate choice of algorithms or data structures, developers' inattention to performance, and missed compiler optimizations show up as wasteful memory operations. Wasteful memory operations are those that produce/consume data to/from memory that may have been avoided. We present, JXPerf, a lightweight performance analysis tool for pinpointing wasteful memory operations in Java programs. Traditional byte-code instrumentation for such analysis (1) introduces prohibitive overheads and (2) misses inefficiencies in machine code generation. JXPerf overcomes both of these problems. JXPerf uses hardware performance monitoring units to sample memory locations accessed by a program and uses hardware debug registers to monitor subsequent accesses to the same memory. The result is a lightweight measurement at machine-code level with attribution of inefficiencies to their provenance: machine and source code within full calling contexts. JXPerf introduces only 7% runtime overhead and 7% memory overhead making it useful in production. Guided by JXPerf, we optimize several Java applications by improving code generation and choosing superior data structures and algorithms, which yield significant speedups.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper presents JXPerf, a tool that combines hardware performance monitoring units (PMUs) to sample memory locations with hardware debug registers to track subsequent accesses to those locations. This enables detection of wasteful memory operations in Java programs at the machine-code level with full calling-context attribution, while claiming only 7% runtime overhead and 7% memory overhead to support production use. The authors report using the tool to guide optimizations in several Java applications via improved code generation and better data structures/algorithms, yielding significant speedups.
Significance. If the low-overhead claims and attribution accuracy hold, the approach offers a practical alternative to high-overhead bytecode instrumentation for production Java profiling, potentially enabling more targeted optimizations. The hardware-assisted method for machine-code level insight is a notable strength for a tool paper.
major comments (2)
- [Abstract (method description paragraph)] Abstract (method description paragraph): The mechanism of sampling addresses via PMU and arming debug registers for subsequent monitoring does not address how the tool handles the typical limit of only 4 debug registers when programs have more than a handful of distinct hot memory locations. This leaves open the risk of systematic sampling bias, dropped monitors, or restricted active sets, which directly affects the accuracy of reported inefficiencies and the load-bearing 7% overhead claim for production usefulness.
- [Abstract] Abstract: Overhead figures (7% runtime, 7% memory) and speedup claims are stated without reference to evaluation methodology, baselines, workloads, error bars, or statistical significance, making the central claim of usefulness in production unverifiable from the given description even if full-text sections exist.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback. We respond to each major comment below and indicate where revisions to the manuscript are warranted.
read point-by-point responses
-
Referee: [Abstract (method description paragraph)] Abstract (method description paragraph): The mechanism of sampling addresses via PMU and arming debug registers for subsequent monitoring does not address how the tool handles the typical limit of only 4 debug registers when programs have more than a handful of distinct hot memory locations. This leaves open the risk of systematic sampling bias, dropped monitors, or restricted active sets, which directly affects the accuracy of reported inefficiencies and the load-bearing 7% overhead claim for production usefulness.
Authors: The full manuscript (implementation and design sections) explains that JXPerf maintains a larger candidate set of hot locations from PMU sampling and uses a rotation policy to arm only the top-N locations (fitting the 4 debug registers) at any time, with the rotation frequency chosen to ensure coverage. This is intended to avoid systematic bias, and the reported overheads already incorporate the management cost. We agree the abstract is insufficiently explicit on this point and will revise it to include a concise description of the rotation mechanism. revision: partial
-
Referee: [Abstract] Abstract: Overhead figures (7% runtime, 7% memory) and speedup claims are stated without reference to evaluation methodology, baselines, workloads, error bars, or statistical significance, making the central claim of usefulness in production unverifiable from the given description even if full-text sections exist.
Authors: The Evaluation section of the manuscript details the methodology (including DaCapo, SPECjvm, and application workloads), baselines, multiple-run statistics with error bars, and significance testing that support the 7% overhead and speedup numbers. To improve the abstract, we will add a brief clause indicating that these figures come from the comprehensive experiments reported later in the paper. revision: yes
Circularity Check
No significant circularity in tool-implementation paper
full rationale
The paper describes an engineering artifact (JXPerf) that samples via PMU and arms debug registers to attribute wasteful accesses, with overhead claims resting on direct runtime measurements rather than any derivation, fitted parameters, or equations. No self-citations, ansatzes, or uniqueness theorems appear in the provided text, and the central claims do not reduce to inputs by construction. The work is self-contained against external benchmarks via reported overheads and case-study speedups.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Adrian Nistor, Linhai Song, Darko Marinov, and Shan Lu. 2013. Toddler: Detecting Performance Problems via Similar Memory-Access Patterns. http://www.cs.fsu. edu/~nistor/toddler
work page 2013
-
[2]
Armin Rigo, Maciej Fijalkowski, Carl Friedrich Bolz, Antonio Cuni, Benjamin Pe- terson, Alex Gaynor, Holger Krekel, and Samuele Pedroni. 2018. A fast, compliant alternative implementation of the Python language. https://pypy.org
work page 2018
-
[3]
D. H. Bailey, E. Barszcz, J. T. Barton, D. S. Browning, R. L. Carter, L. Dagum, R. A. Fatoohi, P. O. Frederickson, T. A. Lasinski, R. S. Schreiber, H. D. Simon, V. Venkatakrishnan, and S. K. Weeratunga. 1991. The NAS Parallel Bench- marks&Mdash;Summary and Preliminary Results. In Proceedings of the 1991 ACM/IEEE Conference on Supercomputing (Supercomputin...
work page 1991
-
[4]
Blackburn, Robin Garner, Chris Hoffmann, Asjad M
Stephen M. Blackburn, Robin Garner, Chris Hoffmann, Asjad M. Khang, Kathryn S. McKinley, Rotem Bentzur, Amer Diwan, Daniel Feinberg, Daniel Frampton, Samuel Z. Guyer, Martin Hirzel, Antony Hosking, Maria Jump, Han Lee, J. Eliot B. Moss, Aashish Phansalkar, Darko Stefanović, Thomas VanDrunen, Daniel von Dincklage, and Ben Wiedermann. 2006. The DaCapo Bench...
work page 2006
-
[5]
Milind Chabbi and John Mellor-Crummey. 2012. DeadSpy: A Tool to Pinpoint Program Inefficiencies. In Proceedings of the Tenth International Symposium on Code Generation and Optimization (CGO ’12). ACM, New York, NY, USA, 124–134
work page 2012
-
[6]
Intel Corp. 2010. Intel Microarchitecture Codename Nehalem Performance Mon- itoring Unit Programming Guide. https://software.intel.com/sites/default/files/ m/5/2/c/f/1/30320-Nehalem-PMU-Programming-Guide-Core.pdf
work page 2010
-
[7]
Intel Corp. 2015. Intel X86 Encoder Decoder Software Library. https://software. intel.com/en-us/articles/xed-x86-encoder-decoder-software-library. ESEC/FSE ’19, August 26–30, 2019, Tallinn, Estonia Pengfei Su, Qingsen Wang, Milind Chabbi, and Xu Liu
work page 2015
-
[8]
Oracle Corp. 2017. Oracle Developer Studio Performance Ana- lyzer. https://www.oracle.com/technetwork/server-storage/solarisstudio/ documentation/o11-151-perf-analyzer-brief-1405338.pdf
work page 2017
-
[9]
Oracle Corp. 2018. JVMTM Tool Interface. https://docs.oracle.com/en/java/ javase/11/docs/specs/jvmti.html
work page 2018
-
[10]
Oracle Corporation. 2018. All-in-One Java Troubleshooting Tool. https: //visualvm.github.io
work page 2018
-
[11]
Luca Della Toffola, Michael Pradel, and Thomas R. Gross. 2015. Performance Problems You Can Fix: A Dynamic Analysis of Memoization Opportunities. In Proceedings of the 2015 ACM SIGPLAN International Conference on Object-Oriented Programming, Systems, Languages, and Applications (OOPSLA 2015) . ACM, New York, NY, USA, 607–622
work page 2015
-
[12]
Monika Dhok and Murali Krishna Ramanathan. 2016. Directed Test Generation to Detect Loop Inefficiencies. In Proceedings of the 2016 24th ACM SIGSOFT International Symposium on Foundations of Software Engineering (FSE 2016) . ACM, New York, NY, USA, 895–907
work page 2016
-
[13]
Paul J. Drongowski. 2007. Instruction-Based Sampling: A New Performance Analysis Technique for AMD Family 10h Processors. https://pdfs.semanticscholar. org/5219/4b43b8385ce39b2b08ecd409c753e0efafe5.pdf
work page 2007
-
[14]
Ariel Eizenberg, Shiliang Hu, Gilles Pokam, and Joseph Devietti. 2016. Remix: Online Detection and Repair of Cache Contention for the JVM. In Proceedings of the 37th ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI ’16). ACM, New York, NY, USA, 251–265
work page 2016
-
[15]
ej-technologies GmbH. 2018. THE AWARD-WINNING ALL-IN-ONE JAVA PRO- FILER. https://www.ej-technologies.com/products/jprofiler/overview.html
work page 2018
-
[16]
Etienne Gagnon. 2018. The Sable Research Group’s Compiler Compiler. http: //sablecc.org. May 2018
work page 2018
-
[17]
Andy Georges, Dries Buytaert, and Lieven Eeckhout. 2007. Statistically Rigorous Java Performance Evaluation. In Proceedings of the 22Nd Annual ACM SIGPLAN Conference on Object-oriented Programming Systems and Applications (OOPSLA ’07). ACM, New York, NY, USA, 57–76
work page 2007
-
[18]
David Gilbert. 2017. Welcome To JFree.org. http://www.jfree.org. November 2017
work page 2017
-
[19]
YourKit GmbH. 2018. The Industry Leader in .NET & Java Profiling. https: //www.yourkit.com
work page 2018
-
[20]
Google Corp. 2018. Google V8 JavaScript Engine. https://v8.dev
work page 2018
-
[21]
Peter Hofer and Hanspeter Mössenböck. 2014. Fast Java Profiling with Scheduling- aware Stack Fragment Sampling and Asynchronous Analysis. In Proceedings of the 2014 International Conference on Principles and Practices of Programming on the Java Platform: Virtual Machines, Languages, and Tools (PPPJ ’14) . ACM, New York, NY, USA, 145–156
work page 2014
-
[22]
IBM Corp. 2018. Monitoring and Post Mortem. https://developer.ibm.com/ javasdk/tools
work page 2018
-
[23]
Mark Scott Johnson. 1982. Some Requirements for Architectural Support of Software Debugging. In Proceedings of the First International Symposium on Ar- chitectural Support for Programming Languages and Operating Systems (ASPLOS I). ACM, New York, NY, USA, 140–148
work page 1982
-
[24]
John Levon et al. 2017. OProfile. http://oprofile.sourceforge.net
work page 2017
-
[25]
Linux. 2012. perf_event_open - Linux man page. https://linux.die.net/man/2/ perf_event_open
work page 2012
-
[26]
Linux. 2015. Linux Perf Tool. https://perf.wiki.kernel.org/index.php/Main_Page
work page 2015
-
[27]
R. E. McLear, D. M. Scheibelhut, and E. Tammaru. 1982. Guidelines for Creating a Debuggable Processor. In Proceedings of the First International Symposium on Architectural Support for Programming Languages and Operating Systems (ASPLOS I). ACM, New York, NY, USA, 100–106
work page 1982
-
[28]
Monika Dhok and Murali Krishna Ramanathan. 2016. Artifact: Directed Test Generation to Detect Loop Inefficiencies. https://drona.csa.iisc.ac.in/~sss/tools/ glider
work page 2016
-
[29]
Todd Mytkowicz, Amer Diwan, Matthias Hauswirth, and Peter F. Sweeney. 2010. Evaluating the Accuracy of Java Profilers. InProceedings of the 31st ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI ’10). ACM, New York, NY, USA, 187–197
work page 2010
-
[30]
Khanh Nguyen and Guoqing Xu. 2013. Cachetor: Detecting Cacheable Data to Remove Bloat. In Proceedings of the 2013 9th Joint Meeting on Foundations of Software Engineering (ESEC/FSE 2013) . ACM, New York, NY, USA, 268–278
work page 2013
-
[31]
Adrian Nistor. 2012. fast return for SegmentedTime- line.getExceptionSegmentCount(). https://sourceforge.net/p/jfreechart/ patches/300. November 2012
work page 2012
-
[32]
Adrian Nistor, Linhai Song, Darko Marinov, and Shan Lu. 2013. Toddler: Detecting Performance Problems via Similar Memory-access Patterns. In Proceedings of the 2013 International Conference on Software Engineering (ICSE ’13) . IEEE Press, Piscataway, NJ, USA, 562–571
work page 2013
-
[33]
Nitsan Wakart. 2016. The Pros and Cons of AsyncGetCallTrace Profilers. http: //psy-lob-saw.blogspot.com/2016/06/the-pros-and-cons-of-agct.html
work page 2016
-
[34]
The University of Edinburgh. 2018. JAVA Grande Benchmark Suite. https://www.epcc.ed.ac.uk/research/computing/performance-characterisation- and-benchmarking/java-grande-benchmark-suite. October 2018
work page 2018
-
[35]
Oswaldo Olivo, Isil Dillig, and Calvin Lin. 2015. Static Detection of Asymptotic Performance Bugs in Collection Traversals. In Proceedings of the 36th ACM SIG- PLAN Conference on Programming Language Design and Implementation (PLDI ’15). ACM, New York, NY, USA, 369–378
work page 2015
-
[36]
Andrei Pangin. 2018. Async-profiler. https://github.com/jvm-profiling-tools/ async-profiler
work page 2018
-
[37]
Bill Pugh and David Hovemeyer. 2015. Find Bugs in Java Programs. http: //findbugs.sourceforge.net. March 2015
work page 2015
-
[38]
Andreas Sewe, Mira Mezini, Aibek Sarimbekov, and Walter Binder. 2011. Da Capo Con Scala: Design and Analysis of a Scala Benchmark Suite for the Java Virtual Machine. In Proceedings of the 2011 ACM International Conference on Object Oriented Programming Systems Languages and Applications (OOPSLA ’11) . ACM, New York, NY, USA, 657–676
work page 2011
-
[39]
Linhai Song and Shan Lu. 2017. Performance Diagnosis for Inefficient Loops. In Proceedings of the 39th International Conference on Software Engineering (ICSE ’17). IEEE Press, Piscataway, NJ, USA, 370–380
work page 2017
-
[40]
SPEC Corporation. 2015. SPEC JVM2008 Benchmark Suite. https://www.spec. org/jvm2008. November 2015
work page 2015
-
[41]
M. Srinivas, B. Sinharoy, R. J. Eickemeyer, R. Raghavan, S. Kunkel, T. Chen, W. Maron, D. Flemming, A. Blanchard, P. Seshadri, J. W. Kellington, A. Mericas, A. E. Petruski, V. R. Indukuru, and S. Reyes. 2011. IBM POWER7 performance modeling, verification, and evaluation. IBM JRD 55, 3 (May-June 2011), 4:1–4:19
work page 2011
-
[42]
Pengfei Su, Shasha Wen, Hailong Yang, Milind Chabbi, and Xu Liu. 2019. Redun- dant Loads: A Software Inefficiency Indicator. In Proceedings of the 41st Interna- tional Conference on Software Engineering (ICSE ’19) . IEEE Press, Piscataway, NJ, USA, 982–993
work page 2019
-
[43]
The Sable Research Group. 2018. A framework for analyzing and transforming Java and Android applications. https://sable.github.io/soot
work page 2018
-
[44]
Jeffrey S. Vitter. 1985. Random Sampling with a Reservoir. ACM Trans. Math. Softw. 11, 1 (March 1985), 37–57
work page 1985
-
[45]
Qingsen Wang, Xu Liu, and Milind Chabbi. 2019. Featherlight Reuse-Distance Measurement. In Proceedings of The 25th IEEE International Symposium on High- Performance Computer Architecture. 440–453
work page 2019
-
[46]
Shasha Wen, Milind Chabbi, and Xu Liu. 2017. REDSPY: Exploring Value Locality in Software. In Proceedings of the Twenty-Second International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS ’17). ACM, New York, NY, USA, 47–61
work page 2017
-
[47]
Shasha Wen, Xu Liu, John Byrne, and Milind Chabbi. 2018. Watching for Soft- ware Inefficiencies with Witch. In Proceedings of the Twenty-Third International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS ’18). ACM, New York, NY, USA, 332–347
work page 2018
-
[48]
Guoqing Xu. 2013. Resurrector: A Tunable Object Lifetime Profiling Technique for Optimizing Real-world Programs. In Proceedings of the 2013 ACM SIGPLAN International Conference on Object Oriented Programming Systems Languages and Applications (OOPSLA ’13). ACM, New York, NY, USA, 111–130
work page 2013
-
[49]
Guoqing Xu, Matthew Arnold, Nick Mitchell, Atanas Rountev, and Gary Sevitsky
-
[50]
Go with the Flow: Profiling Copies to Find Runtime Bloat. In Proceedings of the 30th ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI ’09). ACM, New York, NY, USA, 419–430
-
[51]
Guoqing Xu, Nick Mitchell, Matthew Arnold, Atanas Rountev, Edith Schonberg, and Gary Sevitsky. 2010. Finding Low-utility Data Structures. In Proceedings of the 31st ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI ’10). ACM, New York, NY, USA, 174–186
work page 2010
-
[52]
Guoqing Xu and Atanas Rountev. 2010. Detecting Inefficiently-used Containers to Avoid Bloat. In Proceedings of the 31st ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI ’10) . ACM, New York, NY, USA, 160–173
work page 2010
-
[53]
Shengqian Yang, Dacong Yan, Guoqing Xu, and Atanas Rountev. 2012. Dynamic Analysis of Inefficiently-used Containers. InProceedings of the Ninth International Workshop on Dynamic Analysis (WODA 2012). ACM, New York, NY, USA, 30–35
work page 2012
-
[54]
Zhaomo Yang, Brian Johannesmeyer, Anders Trier Olesen, Sorin Lerner, and Kirill Levchenko. 2017. Dead Store Elimination (Still) Considered Harmful. In 26th USENIX Security Symposium. USENIX Association, Berkeley, CA, USA, 1025– 1040
work page 2017
-
[55]
A. Yasin. 2014. A Top-Down method for performance analysis and counters architecture. In 2014 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS). 35–44
work page 2014
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.