Misleading Microbenchmarks on the Java Virtual Machines
Pith reviewed 2026-05-25 02:28 UTC · model grok-4.3
The pith
Microbenchmarks on the JVM can produce misleading performance results by inducing unrealistic compiler profiles.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Using microbenchmarks under conditions that induce the JVM to collect unrealistic profiles yields misleading results despite following existing guidelines. The speculative, profile-driven nature of compilation decisions means that code performance is highly dependent on profiles collected during early execution, which in isolation can include branch probabilities and receiver types that would not appear in a real application.
What carries the argument
Profile-driven dynamic compilation on the JVM, where early execution profiles determine aggressive optimizations that depend on context-specific data like branch probabilities and receiver types.
If this is right
- Microbenchmark results may not reflect real-world performance due to context isolation.
- Existing JMH guidelines are insufficient to guarantee representative profiles.
- Developers should apply additional steps to make microbenchmark results more representative of full applications.
- Performance choices based on such benchmarks risk selecting suboptimal implementations for actual use.
Where Pith is reading between the lines
- Similar profile issues could arise in other managed runtimes that use tiered JIT compilation.
- Benchmark frameworks might incorporate synthetic load from other threads to better mimic resource contention.
- Validation of collected profiles against production traces could become a standard check before trusting microbenchmark data.
Load-bearing premise
The unrealistic profiles induced in the demonstrated microbenchmarks are representative of common developer practices and lead to optimizations that materially differ from those in real applications.
What would settle it
Execute the same benchmarked methods inside a full application with competing threads and observe whether the JVM applies the same optimizations and produces the same relative performance ordering as in the isolated microbenchmark.
Figures
read the original abstract
Developers often use microbenchmarks to choose the most performant implementation of a method or a class. On the Java Virtual Machine (JVM), this is commonly done using the Java Microbenchmark Harness (JMH) which addresses common pitfalls of measuring code performance on the JVM. However, even using JMH guidelines cannot overcome the fundamental issue of context. Microbenchmarks inherently execute code in isolation, without interference from other application code competing for CPU resources, such as cache or branch-predictor capacity. On managed runtimes with tiered dynamic compilation, such as the JVM, the speculative, profile-driven nature of compilation decisions means that code performance is highly dependent on profiles collected during early execution. Because profiles usually include also branch probabilities and receiver types (besides code hotness metrics), a badly designed microbenchmark may cause the JVM to collect an unrealistic profile, resulting in aggressive, yet misleading, optimizations, that would not occur in a real application. In this paper, we demonstrate how using microbenchmarks under conditions that induce the JVM to collect unrealistic profiles yields misleading results despite following existing guidelines. We also extend these guidelines by suggesting actions to make the microbenchmark results more representative.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims that microbenchmarks on the JVM, even when using JMH and following existing guidelines, can produce misleading results because isolation causes the collection of unrealistic profiles (branch probabilities and receiver types) that trigger aggressive, non-representative optimizations. It demonstrates this issue through constructed examples and extends the guidelines with suggestions to improve representativeness of results.
Significance. If the demonstrations and comparisons hold, this would be a useful contribution to performance engineering on managed runtimes, as it identifies a fundamental context-sensitivity in profile-driven JIT compilation that current microbenchmarking practices do not fully mitigate. The extended guidelines provide concrete, actionable advice that could directly benefit developers.
major comments (1)
- [Demonstration/Experiments] The central claim that induced profiles yield 'misleading' results requires evidence that the profiles and resulting optimizations materially differ from those in real applications. The described experiments construct microbenchmarks that trigger atypical profiles but do not report side-by-side profile statistics (branch taken/not-taken counts, type histograms) or performance deltas for the same methods executed inside realistic workloads. This comparison is load-bearing for the claim.
minor comments (1)
- The abstract states that guidelines are extended, but the specific suggestions would benefit from being listed explicitly in a dedicated subsection or table for easier reference by readers.
Simulated Author's Rebuttal
We thank the referee for the constructive review. The major comment identifies a valid point about strengthening the empirical support for the central claim. We respond point-by-point below and will revise the manuscript accordingly.
read point-by-point responses
-
Referee: [Demonstration/Experiments] The central claim that induced profiles yield 'misleading' results requires evidence that the profiles and resulting optimizations materially differ from those in real applications. The described experiments construct microbenchmarks that trigger atypical profiles but do not report side-by-side profile statistics (branch taken/not-taken counts, type histograms) or performance deltas for the same methods executed inside realistic workloads. This comparison is load-bearing for the claim.
Authors: We agree that explicit side-by-side profile data would strengthen the presentation. Our experiments deliberately construct microbenchmarks that collect atypical profiles (e.g., 100% branch-taken probabilities or monomorphic receiver types) under JMH, triggering optimizations such as branch folding or devirtualization that are unlikely under mixed real-world inputs. The manuscript demonstrates the mechanism and resulting performance differences within the microbenchmark setting, but does not include direct comparisons against the same methods running inside larger applications. In revision we will add the collected profile statistics (branch taken/not-taken counts and type histograms) from the JMH runs to make the atypical nature of the profiles explicit. We will also expand the discussion to contrast these profiles with those expected in realistic workloads and acknowledge the practical difficulties of embedding the examples into full applications while preserving representative execution contexts. Full performance deltas inside real workloads remain outside the current scope but are noted as valuable future work. revision: partial
Circularity Check
No circularity: empirical demonstration without derivations or self-referential fits
full rationale
The paper is an empirical study demonstrating misleading results from microbenchmarks that induce atypical JVM profiles. The provided abstract and text contain no equations, derivations, fitted parameters, or mathematical claims. No self-citations are used to justify uniqueness theorems, ansatzes, or load-bearing premises. The central claim rests on constructed examples and guideline extensions rather than reducing to its own inputs by construction. This matches the default expectation for non-derivational empirical work, warranting a score of 0 with no circular steps.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Edd Barrett, Carl Friedrich Bolz-Tereick, Rebecca Killick, Sarah Mount, and Laurence Tratt. 2017. Virtual Machine Warmup Blows Hot and Cold.Proc. ACM Program. Lang., 1, 52:1–52:27, OOPSLA. doi:10.1145/3133876
-
[2]
S. M. Blackburn et al. 2006. The DaCapo benchmarks: Java benchmarking development and analysis. InProc. 21st ACM/SIGPLAN Conf. on Object-Oriented Programing, Systems, Languages, and Applications(OOPSLA). ACM, 169–190. doi:http://doi.acm.org/10.1145/1167473.1167488
-
[3]
Stephen M Blackburn, Zixian Cai, Rui Chen, Xi Yang, John Zhang, and John Zig- man. 2025. Rethinking Java performance analysis. InProc. 30th ACM Intl. Conf. on Architectural Support for Programming Languages and Operating Systems (ASPLOS). ACM. doi:10.1145/3669940.3707217
-
[4]
Stephen M. Blackburn et al. 2016. The truth, the whole truth, and nothing but the truth: A pragmatic guide to assessing empirical evaluations.ACM Trans. Program. Lang. Syst., 38, 4, 15:1–15:20. doi:10.1145/2983574
-
[5]
Stephen M. Blackburn et al. 2008. Wake Up and Smell the Coffee: Evaluation Methodology for the 21st Century.Commun. ACM, 51, 8, (Aug. 2008), 83–89. doi:10.1145/1378704.1378723
-
[6]
K Mani Chandy and Chittoor V Ramamoorthy. 2009. Rollback and recovery strategies for computer programs.IEEE Transactions on computers, 6, 546–556
work page 2009
- [7]
-
[8]
Yunji Chen, Shijin Zhang, Qi Guo, Ling Li, Ruiyang Wu, and Tianshi Chen
-
[9]
Deterministic replay: a survey.ACM Computing Surveys, 48, 2, 1–47
-
[10]
Diego Costa, Artur Andrzejak, Janos Seboek, and David Lo. 2017. Empirical study of usage and performance of Java collections. InProc. 8th ACM/SPEC on Intl. Conf. on Performance Engineering(ICPE), 389–400
work page 2017
-
[11]
Diego Costa, Cor-Paul Bezemer, Philipp Leitner, and Artur Andrzejak. 2019. What’s wrong with my benchmark results? Studying bad practices in JMH benchmarks.IEEE Transactions on Software Engineering, 47, 7, 1452–1467
work page 2019
-
[12]
Gilles Duboscq, Lukas Stadler, Thomas Würthinger, Doug Simon, Christian Wimmer, and Hanspeter Mössenböck. 2013. Graal IR: an extensible declarative intermediate representation. InProc. Asia-Pacific Programming Languages and Compilers Workshop, 1–9
work page 2013
-
[13]
Eclipse Team. 2025. Eclipse Collections. https://github.com/eclipse-collections /eclipse-collections. (2025)
work page 2025
-
[14]
Andy Georges, Dries Buytaert, and Lieven Eeckhout. 2007. Statistically rigorous Java performance evaluation. InProc. 22nd ACM/SIGPLAN Conf. on Object- Oriented Programing, Systems, Languages, and Applications(OOPSLA). ACM, 57–76. doi:10.1145/1297027.1297033
-
[15]
Joseph Yossi Gil, Keren Lenz, and Yuval Shimron. 2011. A microbenchmark case study and lessons learned. InProc. Co-Located Workshops on DSM’11, TMC’11, AGERE! 2011, AOOPES’11, NEAT’11, & VMIL’11(SPLASH Workshops). ACM, 297–308. doi:10.1145/2095050.2095100
-
[16]
Brian Goetz. 2004. Java theory and practice: Dynamic compilation and perfor- mance measurement. The perils of benchmarking under dynamic compilation. IBM developerWorks. (Dec. 21, 2004). Retrieved Feb. 28, 2021 from http://www .ibm.com/developerworks/library/j-jtp12214/
work page 2004
-
[17]
Urs Hölzle, Craig Chambers, and David Ungar. 1992. Debugging optimized code with dynamic deoptimization. InProc. ACM SIGPLAN Conf. on Programming Language Design and Implementation(PLDI), 32–43
work page 1992
-
[18]
Urs Hölzle, Craig Chambers, and David Ungar. 1991. Optimizing dynamically- typed object-oriented languages with polymorphic inline caches. InEuropean Conference on Object-oriented Programming(ECOOP). Springer, 21–38
work page 1991
- [19]
- [20]
-
[21]
Tomáš Kalibera and Richard Jones. 2013. Rigorous benchmarking in reasonable time.ACM SIGPLAN Notices, 48, 11, 63–74. doi:10.1145/2555670.2464160
-
[22]
Raffi Khatchadourian, Yiming Tang, Mehdi Bagherzadeh, and Baishakhi Ray
-
[23]
InFundamen- tal Approaches to Software Engineering(LNCS)
An empirical study on the use and misuse of Java 8 streams. InFundamen- tal Approaches to Software Engineering(LNCS). Vol. 12076. Springer, 97–118. doi:10.1007/978-3-030-45234-6_5
-
[24]
Donald E. Knuth. 1998.The Art of Computer Programming, Volume 3: (2nd Ed.) Sorting and Searching. Addison Wesley Longman Publishing
work page 1998
-
[25]
Christoph Laaber, Joel Scheuner, and Philipp Leitner. 2019. Software microbench- marking in the cloud. How bad is it really?Empirical Software Engineering, 24, 4, 2469–2508. doi:10.1007/s10664-019-09681-1
-
[26]
Júnior Löff, Filippo Schiavio, Andrea Rosà, Matteo Basso, and Walter Binder
-
[27]
Vectorized intrinsics can be replaced with pure Java code without impair- ing steady-state performance. InProc. 15th ACM/SPEC Intl. Conf. on Performance Engineering(ICPE). ACM, 14–24. doi:10.1145/3629526.3645051
-
[28]
Todd Mytkowicz, Amer Diwan, Matthias Hauswirth, and Peter F. Sweeney
-
[29]
Producing wrong data without doing anything obviously wrong! In Proc. 14th Intl. Conf. on Architectural Support for Programming Languages and Operating Systems(ASPLOS). ACM, 265–276. doi:10.1145/1508244.1508275
-
[30]
Venkata Krishna Suhas Nerella, Swetha Surapaneni, Sanjay K Madria, and Thomas Weigert. 2014. Exploring optimization and caching for efficient collec- tion operations.Automated Software Engineering, 21, 1, 3–40
work page 2014
-
[31]
Joshua Nostas, Juan Pablo Sandoval Alcocer, Diego Elias Costa, and Alexandre Bergel. 2021. How do developers use the Java stream API? InComputational Science and its Applications(LNCS). Vol. 12955. Springer, 323–335. doi:10.1007 /978-3-030-87007-2_23
work page 2021
-
[32]
Oracle. 2025. Ergonomics. https://docs.oracle.com/en/java/javase/24/gctuning /ergonomics.html. (2025)
work page 2025
-
[33]
Oracle. 2025. Initializing Fields. https://docs.oracle.com/javase/tutorial/java/ja vaOO/initial.html. (2025)
work page 2025
-
[34]
Oracle. 2022. Java Software | Oracle. https://www.oracle.com/java/. (2022)
work page 2022
-
[35]
Oracle. 2024. java.util.stream (Java SE 24; JDK 23). https://docs.oracle.com/en/j ava/javase/24/docs/api/java.base/java/util/stream/package-summary.html. (2024). Retrieved Jan. 31, 2024 from
work page 2024
-
[36]
Otmar Ertl (Dynatrace). 2025. Java hashing efficiency. https://www.dynatrace .com/news/blog/java-arrays-hashcode-byte-efficiency-techniques/. (2025)
work page 2025
-
[37]
Julian Ponge. 2014. Avoiding benchmarking pitfalls on the jvm. Java Magazine. (Aug. 2014). Retrieved Sept. 2025 from https://www.oracle.com/technical-reso urces/articles/java/architect-benchmarking.html
work page 2014
-
[38]
Aleksandar Prokopec, David Leopoldseder, Gilles Duboscq, and Thomas Würthinger. 2017. Making collection operations optimal with aggressive JIT compilation. InProc. 8th ACM SIGPLAN Intl. Symp. on Scala, 29–40
work page 2017
-
[39]
Aleksandar Prokopec et al. 2020. Renaissance: benchmarking suite for parallel applications on the JVM. InSoftware Engineering 2020, Fachtagung des GI- Fachbereichs Softwaretechnik(LNI). Vol. P-300. Gesellschaft für Informatik e.V., 145–146. doi:10.18420/SE2020\_44
-
[40]
Eduardo Rosales, Matteo Basso, Andrea Rosà, and Walter Binder. 2023. Large- scale characterization of Java streams.Softw. Pract. Exp., 53, 9, 1763–1792. doi:10.1002/SPE.3213
-
[41]
Filippo Schiavio and Walter Binder. 2026. JEDI: Java evaluation of declarative and imperative queries – benchmarking the Java Stream API. InProceedings of The 48th International Conference on Software Engineering. doi:10.1145/3744916 .3773165
-
[42]
Filippo Schiavio, Andrea Rosà, and Walter Binder. 2022. SQL to Stream with S2S: An Automatic Benchmark Generator for the Java Stream API. InProc. 21st ACM/SIGPLAN Intl. Conf. on Generative Programming: Concepts and Experiences (GPCE). ACM, 179–186. doi:10.1145/3564719.3568699
-
[43]
Aleksey Shipilëv. 2013. Java microbenchmark harness - the lesser of two evils. (2013). Retrieved Sept. 2025 from https://shipilev.net/talks/devoxx-Nov2013-b enchmarking.pdf
work page 2013
-
[44]
Aleksey Shipilëv. 2014. Nanotrusting the Nanotime. (2014). Retrieved Sept. 2025 from https://shipilev.net/blog/2014/nanotrusting-nanotime/
work page 2014
-
[45]
Aleksey Shipilëv and OpenJDK Community. 2013. Java Microbenchmarking Harness. (Nov. 2013). Retrieved Sept. 2025 from http://openjdk.java.net/project s/code-tools/jmh/
work page 2013
-
[46]
SPEC. 1998. SpecJVM2008. https://www.spec.org/jvm2008/. (1998). Retrieved Jan. 31, 2024 from
work page 1998
-
[47]
SPEC. 2008. SpecJVM98. https://www.spec.org/jvm98/. (2008). Retrieved Jan. 31, 2024 from
work page 2008
-
[48]
Michael J Steindorfer and Jurgen J Vinju. 2015. Optimizing hash-array mapped tries for fast and lean immutable JVM collections. InProc. ACM SIGPLAN Intl. Conf. on Object-Oriented Programming, Systems, Languages, and Applications (OOPSLA). ACM, 783–800
work page 2015
-
[49]
TPC. 2024. TPC-H - Homepage. http://www.tpc.org/tpch/. (2024). Retrieved Jan. 31, 2024 from
work page 2024
-
[50]
Luca Traini, Vittorio Cortellessa, Daniele Di Pompeo, and Michele Tucci. 2022. Towards effective assessment of steady state performance in Java software. Are we there yet?Empirical Software Engineering, 28, 1, 13. doi:10.1007/s10664- 022-10247-x
-
[51]
Luca Traini, Federico Di Menna, and Vittorio Cortellessa. 2024. AI-driven Java performance testing: balancing result quality with testing time. InProc. 39th IEEE/ACM Intl. Conf. on Automated Software Engineering(ASE). ACM, 443–454. doi:10.1145/3691620.3695017
-
[52]
Antonio Trovato, Luca Traini, Federico Di Menna, and Dario Di Nucci. 2025. AMBER: AI-enabled Java microbenchmark harness. InProc. IEEE Conf. on Software Testing, Verification and Validation(ICST), 762–766. doi:10.1109/ICST6 2969.2025.10988925
-
[53]
Thomas Würthinger, Christian Wimmer, Andreas Wöß, Lukas Stadler, Gilles Duboscq, Christian Humer, Gregor Richards, Doug Simon, and Mario Wolczko
-
[54]
One vm to rule them all. InProc. ACM Intl. Symp. on New ideas, New Paradigms, and Reflections on Programming & Software(Onward!), 187–204
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.