pith. sign in

arxiv: 2605.23570 · v1 · pith:S5DJVCVInew · submitted 2026-05-22 · 💻 cs.PL · cs.SE

Misleading Microbenchmarks on the Java Virtual Machines

Pith reviewed 2026-05-25 02:28 UTC · model grok-4.3

classification 💻 cs.PL cs.SE
keywords microbenchmarksJVMJMHprofile-driven compilationperformance measurementdynamic optimizationmisleading benchmarks
0
0 comments X

The pith

Microbenchmarks on the JVM can produce misleading performance results by inducing unrealistic compiler profiles.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that microbenchmarks executed in isolation cause the JVM to collect profiles of branch probabilities and receiver types that differ from those in full applications. Even when developers follow JMH guidelines to avoid common measurement pitfalls, the lack of competing code for resources like caches leads to speculative optimizations that would not occur in practice. This matters because developers rely on such benchmarks to select implementations, so the results can guide choices that perform worse under real conditions. The authors demonstrate the issue through examples and propose additional actions to improve how representative the collected profiles are.

Core claim

Using microbenchmarks under conditions that induce the JVM to collect unrealistic profiles yields misleading results despite following existing guidelines. The speculative, profile-driven nature of compilation decisions means that code performance is highly dependent on profiles collected during early execution, which in isolation can include branch probabilities and receiver types that would not appear in a real application.

What carries the argument

Profile-driven dynamic compilation on the JVM, where early execution profiles determine aggressive optimizations that depend on context-specific data like branch probabilities and receiver types.

If this is right

  • Microbenchmark results may not reflect real-world performance due to context isolation.
  • Existing JMH guidelines are insufficient to guarantee representative profiles.
  • Developers should apply additional steps to make microbenchmark results more representative of full applications.
  • Performance choices based on such benchmarks risk selecting suboptimal implementations for actual use.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Similar profile issues could arise in other managed runtimes that use tiered JIT compilation.
  • Benchmark frameworks might incorporate synthetic load from other threads to better mimic resource contention.
  • Validation of collected profiles against production traces could become a standard check before trusting microbenchmark data.

Load-bearing premise

The unrealistic profiles induced in the demonstrated microbenchmarks are representative of common developer practices and lead to optimizations that materially differ from those in real applications.

What would settle it

Execute the same benchmarked methods inside a full application with competing threads and observe whether the JVM applies the same optimizations and produces the same relative performance ordering as in the isolated microbenchmark.

Figures

Figures reproduced from arXiv: 2605.23570 by Filippo Schiavio, Lubom\'ir Bulej, Walter Binder.

Figure 1
Figure 1. Figure 1: Baseline implementation of hash code. 5 Case Study 1: Benchmarking a Single Function – hashCode() This case study focuses on a simple, but very common scenario: optimizing the performance of a short, stateless function. As an example, we show the computation of the hash code of a byte array. The functionality is simple, easy to understand, widely used (e.g., to compute the hash codes of strings), and an ac… view at source ↗
Figure 2
Figure 2. Figure 2: Alternative implementation of hash code. [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Throughput of baseline ( [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Throughput of baseline ( [PITH_FULL_IMAGE:figures/full_fig_p005_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Realistic traces on Dacapo-Chopin (DC) and Re [PITH_FULL_IMAGE:figures/full_fig_p006_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Speedup of the Do-Nothing execution strategy over the Manual-Pollute one for the JEDI [37] benchmarks. Values [PITH_FULL_IMAGE:figures/full_fig_p007_6.png] view at source ↗
read the original abstract

Developers often use microbenchmarks to choose the most performant implementation of a method or a class. On the Java Virtual Machine (JVM), this is commonly done using the Java Microbenchmark Harness (JMH) which addresses common pitfalls of measuring code performance on the JVM. However, even using JMH guidelines cannot overcome the fundamental issue of context. Microbenchmarks inherently execute code in isolation, without interference from other application code competing for CPU resources, such as cache or branch-predictor capacity. On managed runtimes with tiered dynamic compilation, such as the JVM, the speculative, profile-driven nature of compilation decisions means that code performance is highly dependent on profiles collected during early execution. Because profiles usually include also branch probabilities and receiver types (besides code hotness metrics), a badly designed microbenchmark may cause the JVM to collect an unrealistic profile, resulting in aggressive, yet misleading, optimizations, that would not occur in a real application. In this paper, we demonstrate how using microbenchmarks under conditions that induce the JVM to collect unrealistic profiles yields misleading results despite following existing guidelines. We also extend these guidelines by suggesting actions to make the microbenchmark results more representative.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 1 minor

Summary. The paper claims that microbenchmarks on the JVM, even when using JMH and following existing guidelines, can produce misleading results because isolation causes the collection of unrealistic profiles (branch probabilities and receiver types) that trigger aggressive, non-representative optimizations. It demonstrates this issue through constructed examples and extends the guidelines with suggestions to improve representativeness of results.

Significance. If the demonstrations and comparisons hold, this would be a useful contribution to performance engineering on managed runtimes, as it identifies a fundamental context-sensitivity in profile-driven JIT compilation that current microbenchmarking practices do not fully mitigate. The extended guidelines provide concrete, actionable advice that could directly benefit developers.

major comments (1)
  1. [Demonstration/Experiments] The central claim that induced profiles yield 'misleading' results requires evidence that the profiles and resulting optimizations materially differ from those in real applications. The described experiments construct microbenchmarks that trigger atypical profiles but do not report side-by-side profile statistics (branch taken/not-taken counts, type histograms) or performance deltas for the same methods executed inside realistic workloads. This comparison is load-bearing for the claim.
minor comments (1)
  1. The abstract states that guidelines are extended, but the specific suggestions would benefit from being listed explicitly in a dedicated subsection or table for easier reference by readers.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive review. The major comment identifies a valid point about strengthening the empirical support for the central claim. We respond point-by-point below and will revise the manuscript accordingly.

read point-by-point responses
  1. Referee: [Demonstration/Experiments] The central claim that induced profiles yield 'misleading' results requires evidence that the profiles and resulting optimizations materially differ from those in real applications. The described experiments construct microbenchmarks that trigger atypical profiles but do not report side-by-side profile statistics (branch taken/not-taken counts, type histograms) or performance deltas for the same methods executed inside realistic workloads. This comparison is load-bearing for the claim.

    Authors: We agree that explicit side-by-side profile data would strengthen the presentation. Our experiments deliberately construct microbenchmarks that collect atypical profiles (e.g., 100% branch-taken probabilities or monomorphic receiver types) under JMH, triggering optimizations such as branch folding or devirtualization that are unlikely under mixed real-world inputs. The manuscript demonstrates the mechanism and resulting performance differences within the microbenchmark setting, but does not include direct comparisons against the same methods running inside larger applications. In revision we will add the collected profile statistics (branch taken/not-taken counts and type histograms) from the JMH runs to make the atypical nature of the profiles explicit. We will also expand the discussion to contrast these profiles with those expected in realistic workloads and acknowledge the practical difficulties of embedding the examples into full applications while preserving representative execution contexts. Full performance deltas inside real workloads remain outside the current scope but are noted as valuable future work. revision: partial

Circularity Check

0 steps flagged

No circularity: empirical demonstration without derivations or self-referential fits

full rationale

The paper is an empirical study demonstrating misleading results from microbenchmarks that induce atypical JVM profiles. The provided abstract and text contain no equations, derivations, fitted parameters, or mathematical claims. No self-citations are used to justify uniqueness theorems, ansatzes, or load-bearing premises. The central claim rests on constructed examples and guideline extensions rather than reducing to its own inputs by construction. This matches the default expectation for non-derivational empirical work, warranting a score of 0 with no circular steps.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

No mathematical model, free parameters, axioms, or invented entities; the work is an empirical observation of JVM behavior.

pith-pipeline@v0.9.0 · 5740 in / 952 out tokens · 21235 ms · 2026-05-25T02:28:11.064534+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

54 extracted references · 54 canonical work pages

  1. [1]

    Edd Barrett, Carl Friedrich Bolz-Tereick, Rebecca Killick, Sarah Mount, and Laurence Tratt. 2017. Virtual Machine Warmup Blows Hot and Cold.Proc. ACM Program. Lang., 1, 52:1–52:27, OOPSLA. doi:10.1145/3133876

  2. [2]

    S. M. Blackburn et al. 2006. The DaCapo benchmarks: Java benchmarking development and analysis. InProc. 21st ACM/SIGPLAN Conf. on Object-Oriented Programing, Systems, Languages, and Applications(OOPSLA). ACM, 169–190. doi:http://doi.acm.org/10.1145/1167473.1167488

  3. [3]

    Stephen M Blackburn, Zixian Cai, Rui Chen, Xi Yang, John Zhang, and John Zig- man. 2025. Rethinking Java performance analysis. InProc. 30th ACM Intl. Conf. on Architectural Support for Programming Languages and Operating Systems (ASPLOS). ACM. doi:10.1145/3669940.3707217

  4. [4]

    Blackburn et al

    Stephen M. Blackburn et al. 2016. The truth, the whole truth, and nothing but the truth: A pragmatic guide to assessing empirical evaluations.ACM Trans. Program. Lang. Syst., 38, 4, 15:1–15:20. doi:10.1145/2983574

  5. [5]

    Blackburn et al

    Stephen M. Blackburn et al. 2008. Wake Up and Smell the Coffee: Evaluation Methodology for the 21st Century.Commun. ACM, 51, 8, (Aug. 2008), 83–89. doi:10.1145/1378704.1378723

  6. [6]

    K Mani Chandy and Chittoor V Ramamoorthy. 2009. Rollback and recovery strategies for computer programs.IEEE Transactions on computers, 6, 546–556

  7. [7]

    Raymond Chen. 2023. Inside STL: The string. (Aug. 3, 2023). Retrieved Sept. 2025 from https://devblogs.microsoft.com/oldnewthing/20230803-00/?p=108532

  8. [8]

    Yunji Chen, Shijin Zhang, Qi Guo, Ling Li, Ruiyang Wu, and Tianshi Chen

  9. [9]

    Deterministic replay: a survey.ACM Computing Surveys, 48, 2, 1–47

  10. [10]

    Diego Costa, Artur Andrzejak, Janos Seboek, and David Lo. 2017. Empirical study of usage and performance of Java collections. InProc. 8th ACM/SPEC on Intl. Conf. on Performance Engineering(ICPE), 389–400

  11. [11]

    Diego Costa, Cor-Paul Bezemer, Philipp Leitner, and Artur Andrzejak. 2019. What’s wrong with my benchmark results? Studying bad practices in JMH benchmarks.IEEE Transactions on Software Engineering, 47, 7, 1452–1467

  12. [12]

    Gilles Duboscq, Lukas Stadler, Thomas Würthinger, Doug Simon, Christian Wimmer, and Hanspeter Mössenböck. 2013. Graal IR: an extensible declarative intermediate representation. InProc. Asia-Pacific Programming Languages and Compilers Workshop, 1–9

  13. [13]

    Eclipse Team. 2025. Eclipse Collections. https://github.com/eclipse-collections /eclipse-collections. (2025)

  14. [14]

    Andy Georges, Dries Buytaert, and Lieven Eeckhout. 2007. Statistically rigorous Java performance evaluation. InProc. 22nd ACM/SIGPLAN Conf. on Object- Oriented Programing, Systems, Languages, and Applications(OOPSLA). ACM, 57–76. doi:10.1145/1297027.1297033

  15. [15]

    Joseph Yossi Gil, Keren Lenz, and Yuval Shimron. 2011. A microbenchmark case study and lessons learned. InProc. Co-Located Workshops on DSM’11, TMC’11, AGERE! 2011, AOOPES’11, NEAT’11, & VMIL’11(SPLASH Workshops). ACM, 297–308. doi:10.1145/2095050.2095100

  16. [16]

    Brian Goetz. 2004. Java theory and practice: Dynamic compilation and perfor- mance measurement. The perils of benchmarking under dynamic compilation. IBM developerWorks. (Dec. 21, 2004). Retrieved Feb. 28, 2021 from http://www .ibm.com/developerworks/library/j-jtp12214/

  17. [17]

    Urs Hölzle, Craig Chambers, and David Ungar. 1992. Debugging optimized code with dynamic deoptimization. InProc. ACM SIGPLAN Conf. on Programming Language Design and Implementation(PLDI), 32–43

  18. [18]

    Urs Hölzle, Craig Chambers, and David Ungar. 1991. Optimizing dynamically- typed object-oriented languages with polymorphic inline caches. InEuropean Conference on Object-oriented Programming(ECOOP). Springer, 21–38

  19. [19]

    Vojtěch Horký, Peter Libič, Antonín Steinhauser, and Petr Tůma. 2015. DOs and DON’Ts of conducting performance measurements in Java. InProc. 6th ACM/SPEC Intl. Conf. on Performance Engineering(ICPE). ACM, 337–340. doi:1 0.1145/2668930.2688820

  20. [20]

    Nils Japke, Martin Grambow, Christoph Laaber, and David Bermbach. 2025. 𝜇OpTime: statically reducing the execution time of microbenchmark suites using stability metrics. Version 1. arXiv: 2501.12878 [cs]. Retrieved Sept. 2025 from http://arxiv.org/abs/2501.12878. Pre-published

  21. [21]

    Tomáš Kalibera and Richard Jones. 2013. Rigorous benchmarking in reasonable time.ACM SIGPLAN Notices, 48, 11, 63–74. doi:10.1145/2555670.2464160

  22. [22]

    Raffi Khatchadourian, Yiming Tang, Mehdi Bagherzadeh, and Baishakhi Ray

  23. [23]

    InFundamen- tal Approaches to Software Engineering(LNCS)

    An empirical study on the use and misuse of Java 8 streams. InFundamen- tal Approaches to Software Engineering(LNCS). Vol. 12076. Springer, 97–118. doi:10.1007/978-3-030-45234-6_5

  24. [24]

    Donald E. Knuth. 1998.The Art of Computer Programming, Volume 3: (2nd Ed.) Sorting and Searching. Addison Wesley Longman Publishing

  25. [25]

    Christoph Laaber, Joel Scheuner, and Philipp Leitner. 2019. Software microbench- marking in the cloud. How bad is it really?Empirical Software Engineering, 24, 4, 2469–2508. doi:10.1007/s10664-019-09681-1

  26. [26]

    Júnior Löff, Filippo Schiavio, Andrea Rosà, Matteo Basso, and Walter Binder

  27. [27]

    Vectorized intrinsics can be replaced with pure Java code without impair- ing steady-state performance. InProc. 15th ACM/SPEC Intl. Conf. on Performance Engineering(ICPE). ACM, 14–24. doi:10.1145/3629526.3645051

  28. [28]

    Todd Mytkowicz, Amer Diwan, Matthias Hauswirth, and Peter F. Sweeney

  29. [29]

    14th Intl

    Producing wrong data without doing anything obviously wrong! In Proc. 14th Intl. Conf. on Architectural Support for Programming Languages and Operating Systems(ASPLOS). ACM, 265–276. doi:10.1145/1508244.1508275

  30. [30]

    Venkata Krishna Suhas Nerella, Swetha Surapaneni, Sanjay K Madria, and Thomas Weigert. 2014. Exploring optimization and caching for efficient collec- tion operations.Automated Software Engineering, 21, 1, 3–40

  31. [31]

    Joshua Nostas, Juan Pablo Sandoval Alcocer, Diego Elias Costa, and Alexandre Bergel. 2021. How do developers use the Java stream API? InComputational Science and its Applications(LNCS). Vol. 12955. Springer, 323–335. doi:10.1007 /978-3-030-87007-2_23

  32. [32]

    Oracle. 2025. Ergonomics. https://docs.oracle.com/en/java/javase/24/gctuning /ergonomics.html. (2025)

  33. [33]

    Oracle. 2025. Initializing Fields. https://docs.oracle.com/javase/tutorial/java/ja vaOO/initial.html. (2025)

  34. [34]

    Oracle. 2022. Java Software | Oracle. https://www.oracle.com/java/. (2022)

  35. [35]

    Oracle. 2024. java.util.stream (Java SE 24; JDK 23). https://docs.oracle.com/en/j ava/javase/24/docs/api/java.base/java/util/stream/package-summary.html. (2024). Retrieved Jan. 31, 2024 from

  36. [36]

    Otmar Ertl (Dynatrace). 2025. Java hashing efficiency. https://www.dynatrace .com/news/blog/java-arrays-hashcode-byte-efficiency-techniques/. (2025)

  37. [37]

    Julian Ponge. 2014. Avoiding benchmarking pitfalls on the jvm. Java Magazine. (Aug. 2014). Retrieved Sept. 2025 from https://www.oracle.com/technical-reso urces/articles/java/architect-benchmarking.html

  38. [38]

    Aleksandar Prokopec, David Leopoldseder, Gilles Duboscq, and Thomas Würthinger. 2017. Making collection operations optimal with aggressive JIT compilation. InProc. 8th ACM SIGPLAN Intl. Symp. on Scala, 29–40

  39. [39]

    Aleksandar Prokopec et al. 2020. Renaissance: benchmarking suite for parallel applications on the JVM. InSoftware Engineering 2020, Fachtagung des GI- Fachbereichs Softwaretechnik(LNI). Vol. P-300. Gesellschaft für Informatik e.V., 145–146. doi:10.18420/SE2020\_44

  40. [40]

    Eduardo Rosales, Matteo Basso, Andrea Rosà, and Walter Binder. 2023. Large- scale characterization of Java streams.Softw. Pract. Exp., 53, 9, 1763–1792. doi:10.1002/SPE.3213

  41. [41]

    Filippo Schiavio and Walter Binder. 2026. JEDI: Java evaluation of declarative and imperative queries – benchmarking the Java Stream API. InProceedings of The 48th International Conference on Software Engineering. doi:10.1145/3744916 .3773165

  42. [42]

    Filippo Schiavio, Andrea Rosà, and Walter Binder. 2022. SQL to Stream with S2S: An Automatic Benchmark Generator for the Java Stream API. InProc. 21st ACM/SIGPLAN Intl. Conf. on Generative Programming: Concepts and Experiences (GPCE). ACM, 179–186. doi:10.1145/3564719.3568699

  43. [43]

    Aleksey Shipilëv. 2013. Java microbenchmark harness - the lesser of two evils. (2013). Retrieved Sept. 2025 from https://shipilev.net/talks/devoxx-Nov2013-b enchmarking.pdf

  44. [44]

    Aleksey Shipilëv. 2014. Nanotrusting the Nanotime. (2014). Retrieved Sept. 2025 from https://shipilev.net/blog/2014/nanotrusting-nanotime/

  45. [45]

    Aleksey Shipilëv and OpenJDK Community. 2013. Java Microbenchmarking Harness. (Nov. 2013). Retrieved Sept. 2025 from http://openjdk.java.net/project s/code-tools/jmh/

  46. [46]

    SPEC. 1998. SpecJVM2008. https://www.spec.org/jvm2008/. (1998). Retrieved Jan. 31, 2024 from

  47. [47]

    SPEC. 2008. SpecJVM98. https://www.spec.org/jvm98/. (2008). Retrieved Jan. 31, 2024 from

  48. [48]

    Michael J Steindorfer and Jurgen J Vinju. 2015. Optimizing hash-array mapped tries for fast and lean immutable JVM collections. InProc. ACM SIGPLAN Intl. Conf. on Object-Oriented Programming, Systems, Languages, and Applications (OOPSLA). ACM, 783–800

  49. [49]

    TPC. 2024. TPC-H - Homepage. http://www.tpc.org/tpch/. (2024). Retrieved Jan. 31, 2024 from

  50. [50]

    Luca Traini, Vittorio Cortellessa, Daniele Di Pompeo, and Michele Tucci. 2022. Towards effective assessment of steady state performance in Java software. Are we there yet?Empirical Software Engineering, 28, 1, 13. doi:10.1007/s10664- 022-10247-x

  51. [51]

    Luca Traini, Federico Di Menna, and Vittorio Cortellessa. 2024. AI-driven Java performance testing: balancing result quality with testing time. InProc. 39th IEEE/ACM Intl. Conf. on Automated Software Engineering(ASE). ACM, 443–454. doi:10.1145/3691620.3695017

  52. [52]

    Antonio Trovato, Luca Traini, Federico Di Menna, and Dario Di Nucci. 2025. AMBER: AI-enabled Java microbenchmark harness. InProc. IEEE Conf. on Software Testing, Verification and Validation(ICST), 762–766. doi:10.1109/ICST6 2969.2025.10988925

  53. [53]

    Thomas Würthinger, Christian Wimmer, Andreas Wöß, Lukas Stadler, Gilles Duboscq, Christian Humer, Gregor Richards, Doug Simon, and Mario Wolczko

  54. [54]

    One vm to rule them all. InProc. ACM Intl. Symp. on New ideas, New Paradigms, and Reflections on Programming & Software(Onward!), 187–204