Misleading Microbenchmarks on the Java Virtual Machines

Filippo Schiavio; Lubom\'ir Bulej; Walter Binder

arxiv: 2605.23570 · v1 · pith:S5DJVCVInew · submitted 2026-05-22 · 💻 cs.PL · cs.SE

Misleading Microbenchmarks on the Java Virtual Machines

Filippo Schiavio , Lubom\'ir Bulej , Walter Binder This is my paper

Pith reviewed 2026-05-25 02:28 UTC · model grok-4.3

classification 💻 cs.PL cs.SE

keywords microbenchmarksJVMJMHprofile-driven compilationperformance measurementdynamic optimizationmisleading benchmarks

0 comments

The pith

Microbenchmarks on the JVM can produce misleading performance results by inducing unrealistic compiler profiles.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that microbenchmarks executed in isolation cause the JVM to collect profiles of branch probabilities and receiver types that differ from those in full applications. Even when developers follow JMH guidelines to avoid common measurement pitfalls, the lack of competing code for resources like caches leads to speculative optimizations that would not occur in practice. This matters because developers rely on such benchmarks to select implementations, so the results can guide choices that perform worse under real conditions. The authors demonstrate the issue through examples and propose additional actions to improve how representative the collected profiles are.

Core claim

Using microbenchmarks under conditions that induce the JVM to collect unrealistic profiles yields misleading results despite following existing guidelines. The speculative, profile-driven nature of compilation decisions means that code performance is highly dependent on profiles collected during early execution, which in isolation can include branch probabilities and receiver types that would not appear in a real application.

What carries the argument

Profile-driven dynamic compilation on the JVM, where early execution profiles determine aggressive optimizations that depend on context-specific data like branch probabilities and receiver types.

If this is right

Microbenchmark results may not reflect real-world performance due to context isolation.
Existing JMH guidelines are insufficient to guarantee representative profiles.
Developers should apply additional steps to make microbenchmark results more representative of full applications.
Performance choices based on such benchmarks risk selecting suboptimal implementations for actual use.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Similar profile issues could arise in other managed runtimes that use tiered JIT compilation.
Benchmark frameworks might incorporate synthetic load from other threads to better mimic resource contention.
Validation of collected profiles against production traces could become a standard check before trusting microbenchmark data.

Load-bearing premise

The unrealistic profiles induced in the demonstrated microbenchmarks are representative of common developer practices and lead to optimizations that materially differ from those in real applications.

What would settle it

Execute the same benchmarked methods inside a full application with competing threads and observe whether the JVM applies the same optimizations and produces the same relative performance ordering as in the isolated microbenchmark.

Figures

Figures reproduced from arXiv: 2605.23570 by Filippo Schiavio, Lubom\'ir Bulej, Walter Binder.

**Figure 1.** Figure 1: Baseline implementation of hash code. 5 Case Study 1: Benchmarking a Single Function – hashCode() This case study focuses on a simple, but very common scenario: optimizing the performance of a short, stateless function. As an example, we show the computation of the hash code of a byte array. The functionality is simple, easy to understand, widely used (e.g., to compute the hash codes of strings), and an ac… view at source ↗

**Figure 2.** Figure 2: Alternative implementation of hash code. [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: Throughput of baseline ( [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗

**Figure 4.** Figure 4: Throughput of baseline ( [PITH_FULL_IMAGE:figures/full_fig_p005_4.png] view at source ↗

**Figure 5.** Figure 5: Realistic traces on Dacapo-Chopin (DC) and Re [PITH_FULL_IMAGE:figures/full_fig_p006_5.png] view at source ↗

**Figure 6.** Figure 6: Speedup of the Do-Nothing execution strategy over the Manual-Pollute one for the JEDI [37] benchmarks. Values [PITH_FULL_IMAGE:figures/full_fig_p007_6.png] view at source ↗

read the original abstract

Developers often use microbenchmarks to choose the most performant implementation of a method or a class. On the Java Virtual Machine (JVM), this is commonly done using the Java Microbenchmark Harness (JMH) which addresses common pitfalls of measuring code performance on the JVM. However, even using JMH guidelines cannot overcome the fundamental issue of context. Microbenchmarks inherently execute code in isolation, without interference from other application code competing for CPU resources, such as cache or branch-predictor capacity. On managed runtimes with tiered dynamic compilation, such as the JVM, the speculative, profile-driven nature of compilation decisions means that code performance is highly dependent on profiles collected during early execution. Because profiles usually include also branch probabilities and receiver types (besides code hotness metrics), a badly designed microbenchmark may cause the JVM to collect an unrealistic profile, resulting in aggressive, yet misleading, optimizations, that would not occur in a real application. In this paper, we demonstrate how using microbenchmarks under conditions that induce the JVM to collect unrealistic profiles yields misleading results despite following existing guidelines. We also extend these guidelines by suggesting actions to make the microbenchmark results more representative.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper demonstrates that JMH microbenchmarks can still trigger unrealistic JVM profiles leading to misleading optimizations, and offers some practical adjustments, though without direct real-app comparisons.

read the letter

The main thing to know is that even careful use of JMH can produce microbenchmark results that don't match what the same code would do inside a real application, because the isolated run gives the JIT atypical branch or type profiles. The authors walk through how this happens and show concrete cases where following the standard guidelines still leads to aggressive optimizations that wouldn't trigger otherwise. They also add some suggestions for making benchmarks more representative, such as adjusting inputs or adding context to better match expected profiles. That part is straightforward and directly useful for anyone who measures Java performance this way. The examples make the mechanism clear without overclaiming. The soft spot is the lack of side-by-side profile data or performance numbers from the same methods running in larger workloads. The paper shows that microbenchmarks can induce odd profiles, but it doesn't quantify how often those profiles actually appear in practice or how much the resulting JIT decisions differ from real usage. That leaves the strength of the 'misleading' claim a bit open. The work is aimed at people who write or review JVM microbenchmarks. It is a practical note rather than a broad theoretical advance, but the issue it flags is real enough that the suggestions could help avoid common measurement errors. The reasoning holds together on its own terms and engages honestly with the existing JMH literature. I would send this to peer review. The observation is worth checking with referees who know the JIT internals, even if the paper needs more comparative data to tighten the conclusions.

Referee Report

1 major / 1 minor

Summary. The paper claims that microbenchmarks on the JVM, even when using JMH and following existing guidelines, can produce misleading results because isolation causes the collection of unrealistic profiles (branch probabilities and receiver types) that trigger aggressive, non-representative optimizations. It demonstrates this issue through constructed examples and extends the guidelines with suggestions to improve representativeness of results.

Significance. If the demonstrations and comparisons hold, this would be a useful contribution to performance engineering on managed runtimes, as it identifies a fundamental context-sensitivity in profile-driven JIT compilation that current microbenchmarking practices do not fully mitigate. The extended guidelines provide concrete, actionable advice that could directly benefit developers.

major comments (1)

[Demonstration/Experiments] The central claim that induced profiles yield 'misleading' results requires evidence that the profiles and resulting optimizations materially differ from those in real applications. The described experiments construct microbenchmarks that trigger atypical profiles but do not report side-by-side profile statistics (branch taken/not-taken counts, type histograms) or performance deltas for the same methods executed inside realistic workloads. This comparison is load-bearing for the claim.

minor comments (1)

The abstract states that guidelines are extended, but the specific suggestions would benefit from being listed explicitly in a dedicated subsection or table for easier reference by readers.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive review. The major comment identifies a valid point about strengthening the empirical support for the central claim. We respond point-by-point below and will revise the manuscript accordingly.

read point-by-point responses

Referee: [Demonstration/Experiments] The central claim that induced profiles yield 'misleading' results requires evidence that the profiles and resulting optimizations materially differ from those in real applications. The described experiments construct microbenchmarks that trigger atypical profiles but do not report side-by-side profile statistics (branch taken/not-taken counts, type histograms) or performance deltas for the same methods executed inside realistic workloads. This comparison is load-bearing for the claim.

Authors: We agree that explicit side-by-side profile data would strengthen the presentation. Our experiments deliberately construct microbenchmarks that collect atypical profiles (e.g., 100% branch-taken probabilities or monomorphic receiver types) under JMH, triggering optimizations such as branch folding or devirtualization that are unlikely under mixed real-world inputs. The manuscript demonstrates the mechanism and resulting performance differences within the microbenchmark setting, but does not include direct comparisons against the same methods running inside larger applications. In revision we will add the collected profile statistics (branch taken/not-taken counts and type histograms) from the JMH runs to make the atypical nature of the profiles explicit. We will also expand the discussion to contrast these profiles with those expected in realistic workloads and acknowledge the practical difficulties of embedding the examples into full applications while preserving representative execution contexts. Full performance deltas inside real workloads remain outside the current scope but are noted as valuable future work. revision: partial

Circularity Check

0 steps flagged

No circularity: empirical demonstration without derivations or self-referential fits

full rationale

The paper is an empirical study demonstrating misleading results from microbenchmarks that induce atypical JVM profiles. The provided abstract and text contain no equations, derivations, fitted parameters, or mathematical claims. No self-citations are used to justify uniqueness theorems, ansatzes, or load-bearing premises. The central claim rests on constructed examples and guideline extensions rather than reducing to its own inputs by construction. This matches the default expectation for non-derivational empirical work, warranting a score of 0 with no circular steps.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

No mathematical model, free parameters, axioms, or invented entities; the work is an empirical observation of JVM behavior.

pith-pipeline@v0.9.0 · 5740 in / 952 out tokens · 21235 ms · 2026-05-25T02:28:11.064534+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

54 extracted references · 54 canonical work pages

[1]

Edd Barrett, Carl Friedrich Bolz-Tereick, Rebecca Killick, Sarah Mount, and Laurence Tratt. 2017. Virtual Machine Warmup Blows Hot and Cold.Proc. ACM Program. Lang., 1, 52:1–52:27, OOPSLA. doi:10.1145/3133876

work page doi:10.1145/3133876 2017
[2]

S. M. Blackburn et al. 2006. The DaCapo benchmarks: Java benchmarking development and analysis. InProc. 21st ACM/SIGPLAN Conf. on Object-Oriented Programing, Systems, Languages, and Applications(OOPSLA). ACM, 169–190. doi:http://doi.acm.org/10.1145/1167473.1167488

work page doi:10.1145/1167473.1167488 2006
[3]

Stephen M Blackburn, Zixian Cai, Rui Chen, Xi Yang, John Zhang, and John Zig- man. 2025. Rethinking Java performance analysis. InProc. 30th ACM Intl. Conf. on Architectural Support for Programming Languages and Operating Systems (ASPLOS). ACM. doi:10.1145/3669940.3707217

work page doi:10.1145/3669940.3707217 2025
[4]

Blackburn et al

Stephen M. Blackburn et al. 2016. The truth, the whole truth, and nothing but the truth: A pragmatic guide to assessing empirical evaluations.ACM Trans. Program. Lang. Syst., 38, 4, 15:1–15:20. doi:10.1145/2983574

work page doi:10.1145/2983574 2016
[5]

Blackburn et al

Stephen M. Blackburn et al. 2008. Wake Up and Smell the Coffee: Evaluation Methodology for the 21st Century.Commun. ACM, 51, 8, (Aug. 2008), 83–89. doi:10.1145/1378704.1378723

work page doi:10.1145/1378704.1378723 2008
[6]

K Mani Chandy and Chittoor V Ramamoorthy. 2009. Rollback and recovery strategies for computer programs.IEEE Transactions on computers, 6, 546–556

work page 2009
[7]

Raymond Chen. 2023. Inside STL: The string. (Aug. 3, 2023). Retrieved Sept. 2025 from https://devblogs.microsoft.com/oldnewthing/20230803-00/?p=108532

work page arXiv 2023
[8]

Yunji Chen, Shijin Zhang, Qi Guo, Ling Li, Ruiyang Wu, and Tianshi Chen

work page
[9]

Deterministic replay: a survey.ACM Computing Surveys, 48, 2, 1–47

work page
[10]

Diego Costa, Artur Andrzejak, Janos Seboek, and David Lo. 2017. Empirical study of usage and performance of Java collections. InProc. 8th ACM/SPEC on Intl. Conf. on Performance Engineering(ICPE), 389–400

work page 2017
[11]

Diego Costa, Cor-Paul Bezemer, Philipp Leitner, and Artur Andrzejak. 2019. What’s wrong with my benchmark results? Studying bad practices in JMH benchmarks.IEEE Transactions on Software Engineering, 47, 7, 1452–1467

work page 2019
[12]

Gilles Duboscq, Lukas Stadler, Thomas Würthinger, Doug Simon, Christian Wimmer, and Hanspeter Mössenböck. 2013. Graal IR: an extensible declarative intermediate representation. InProc. Asia-Pacific Programming Languages and Compilers Workshop, 1–9

work page 2013
[13]

Eclipse Team. 2025. Eclipse Collections. https://github.com/eclipse-collections /eclipse-collections. (2025)

work page 2025
[14]

Andy Georges, Dries Buytaert, and Lieven Eeckhout. 2007. Statistically rigorous Java performance evaluation. InProc. 22nd ACM/SIGPLAN Conf. on Object- Oriented Programing, Systems, Languages, and Applications(OOPSLA). ACM, 57–76. doi:10.1145/1297027.1297033

work page doi:10.1145/1297027.1297033 2007
[15]

Joseph Yossi Gil, Keren Lenz, and Yuval Shimron. 2011. A microbenchmark case study and lessons learned. InProc. Co-Located Workshops on DSM’11, TMC’11, AGERE! 2011, AOOPES’11, NEAT’11, & VMIL’11(SPLASH Workshops). ACM, 297–308. doi:10.1145/2095050.2095100

work page doi:10.1145/2095050.2095100 2011
[16]

Brian Goetz. 2004. Java theory and practice: Dynamic compilation and perfor- mance measurement. The perils of benchmarking under dynamic compilation. IBM developerWorks. (Dec. 21, 2004). Retrieved Feb. 28, 2021 from http://www .ibm.com/developerworks/library/j-jtp12214/

work page 2004
[17]

Urs Hölzle, Craig Chambers, and David Ungar. 1992. Debugging optimized code with dynamic deoptimization. InProc. ACM SIGPLAN Conf. on Programming Language Design and Implementation(PLDI), 32–43

work page 1992
[18]

Urs Hölzle, Craig Chambers, and David Ungar. 1991. Optimizing dynamically- typed object-oriented languages with polymorphic inline caches. InEuropean Conference on Object-oriented Programming(ECOOP). Springer, 21–38

work page 1991
[19]

Vojtěch Horký, Peter Libič, Antonín Steinhauser, and Petr Tůma. 2015. DOs and DON’Ts of conducting performance measurements in Java. InProc. 6th ACM/SPEC Intl. Conf. on Performance Engineering(ICPE). ACM, 337–340. doi:1 0.1145/2668930.2688820

work page arXiv 2015
[20]

Nils Japke, Martin Grambow, Christoph Laaber, and David Bermbach. 2025. 𝜇OpTime: statically reducing the execution time of microbenchmark suites using stability metrics. Version 1. arXiv: 2501.12878 [cs]. Retrieved Sept. 2025 from http://arxiv.org/abs/2501.12878. Pre-published

work page arXiv 2025
[21]

Tomáš Kalibera and Richard Jones. 2013. Rigorous benchmarking in reasonable time.ACM SIGPLAN Notices, 48, 11, 63–74. doi:10.1145/2555670.2464160

work page doi:10.1145/2555670.2464160 2013
[22]

Raffi Khatchadourian, Yiming Tang, Mehdi Bagherzadeh, and Baishakhi Ray

work page
[23]

InFundamen- tal Approaches to Software Engineering(LNCS)

An empirical study on the use and misuse of Java 8 streams. InFundamen- tal Approaches to Software Engineering(LNCS). Vol. 12076. Springer, 97–118. doi:10.1007/978-3-030-45234-6_5

work page doi:10.1007/978-3-030-45234-6_5
[24]

Donald E. Knuth. 1998.The Art of Computer Programming, Volume 3: (2nd Ed.) Sorting and Searching. Addison Wesley Longman Publishing

work page 1998
[25]

Christoph Laaber, Joel Scheuner, and Philipp Leitner. 2019. Software microbench- marking in the cloud. How bad is it really?Empirical Software Engineering, 24, 4, 2469–2508. doi:10.1007/s10664-019-09681-1

work page doi:10.1007/s10664-019-09681-1 2019
[26]

Júnior Löff, Filippo Schiavio, Andrea Rosà, Matteo Basso, and Walter Binder

work page
[27]

Vectorized intrinsics can be replaced with pure Java code without impair- ing steady-state performance. InProc. 15th ACM/SPEC Intl. Conf. on Performance Engineering(ICPE). ACM, 14–24. doi:10.1145/3629526.3645051

work page doi:10.1145/3629526.3645051
[28]

Todd Mytkowicz, Amer Diwan, Matthias Hauswirth, and Peter F. Sweeney

work page
[29]

14th Intl

Producing wrong data without doing anything obviously wrong! In Proc. 14th Intl. Conf. on Architectural Support for Programming Languages and Operating Systems(ASPLOS). ACM, 265–276. doi:10.1145/1508244.1508275

work page doi:10.1145/1508244.1508275
[30]

Venkata Krishna Suhas Nerella, Swetha Surapaneni, Sanjay K Madria, and Thomas Weigert. 2014. Exploring optimization and caching for efficient collec- tion operations.Automated Software Engineering, 21, 1, 3–40

work page 2014
[31]

Joshua Nostas, Juan Pablo Sandoval Alcocer, Diego Elias Costa, and Alexandre Bergel. 2021. How do developers use the Java stream API? InComputational Science and its Applications(LNCS). Vol. 12955. Springer, 323–335. doi:10.1007 /978-3-030-87007-2_23

work page 2021
[32]

Oracle. 2025. Ergonomics. https://docs.oracle.com/en/java/javase/24/gctuning /ergonomics.html. (2025)

work page 2025
[33]

Oracle. 2025. Initializing Fields. https://docs.oracle.com/javase/tutorial/java/ja vaOO/initial.html. (2025)

work page 2025
[34]

Oracle. 2022. Java Software | Oracle. https://www.oracle.com/java/. (2022)

work page 2022
[35]

Oracle. 2024. java.util.stream (Java SE 24; JDK 23). https://docs.oracle.com/en/j ava/javase/24/docs/api/java.base/java/util/stream/package-summary.html. (2024). Retrieved Jan. 31, 2024 from

work page 2024
[36]

Otmar Ertl (Dynatrace). 2025. Java hashing efficiency. https://www.dynatrace .com/news/blog/java-arrays-hashcode-byte-efficiency-techniques/. (2025)

work page 2025
[37]

Julian Ponge. 2014. Avoiding benchmarking pitfalls on the jvm. Java Magazine. (Aug. 2014). Retrieved Sept. 2025 from https://www.oracle.com/technical-reso urces/articles/java/architect-benchmarking.html

work page 2014
[38]

Aleksandar Prokopec, David Leopoldseder, Gilles Duboscq, and Thomas Würthinger. 2017. Making collection operations optimal with aggressive JIT compilation. InProc. 8th ACM SIGPLAN Intl. Symp. on Scala, 29–40

work page 2017
[39]

Aleksandar Prokopec et al. 2020. Renaissance: benchmarking suite for parallel applications on the JVM. InSoftware Engineering 2020, Fachtagung des GI- Fachbereichs Softwaretechnik(LNI). Vol. P-300. Gesellschaft für Informatik e.V., 145–146. doi:10.18420/SE2020\_44

work page doi:10.18420/se2020 2020
[40]

Eduardo Rosales, Matteo Basso, Andrea Rosà, and Walter Binder. 2023. Large- scale characterization of Java streams.Softw. Pract. Exp., 53, 9, 1763–1792. doi:10.1002/SPE.3213

work page doi:10.1002/spe.3213 2023
[41]

Filippo Schiavio and Walter Binder. 2026. JEDI: Java evaluation of declarative and imperative queries – benchmarking the Java Stream API. InProceedings of The 48th International Conference on Software Engineering. doi:10.1145/3744916 .3773165

work page doi:10.1145/3744916 2026
[42]

Filippo Schiavio, Andrea Rosà, and Walter Binder. 2022. SQL to Stream with S2S: An Automatic Benchmark Generator for the Java Stream API. InProc. 21st ACM/SIGPLAN Intl. Conf. on Generative Programming: Concepts and Experiences (GPCE). ACM, 179–186. doi:10.1145/3564719.3568699

work page doi:10.1145/3564719.3568699 2022
[43]

Aleksey Shipilëv. 2013. Java microbenchmark harness - the lesser of two evils. (2013). Retrieved Sept. 2025 from https://shipilev.net/talks/devoxx-Nov2013-b enchmarking.pdf

work page 2013
[44]

Aleksey Shipilëv. 2014. Nanotrusting the Nanotime. (2014). Retrieved Sept. 2025 from https://shipilev.net/blog/2014/nanotrusting-nanotime/

work page 2014
[45]

Aleksey Shipilëv and OpenJDK Community. 2013. Java Microbenchmarking Harness. (Nov. 2013). Retrieved Sept. 2025 from http://openjdk.java.net/project s/code-tools/jmh/

work page 2013
[46]

SPEC. 1998. SpecJVM2008. https://www.spec.org/jvm2008/. (1998). Retrieved Jan. 31, 2024 from

work page 1998
[47]

SPEC. 2008. SpecJVM98. https://www.spec.org/jvm98/. (2008). Retrieved Jan. 31, 2024 from

work page 2008
[48]

Michael J Steindorfer and Jurgen J Vinju. 2015. Optimizing hash-array mapped tries for fast and lean immutable JVM collections. InProc. ACM SIGPLAN Intl. Conf. on Object-Oriented Programming, Systems, Languages, and Applications (OOPSLA). ACM, 783–800

work page 2015
[49]

TPC. 2024. TPC-H - Homepage. http://www.tpc.org/tpch/. (2024). Retrieved Jan. 31, 2024 from

work page 2024
[50]

Luca Traini, Vittorio Cortellessa, Daniele Di Pompeo, and Michele Tucci. 2022. Towards effective assessment of steady state performance in Java software. Are we there yet?Empirical Software Engineering, 28, 1, 13. doi:10.1007/s10664- 022-10247-x

work page doi:10.1007/s10664- 2022
[51]

Luca Traini, Federico Di Menna, and Vittorio Cortellessa. 2024. AI-driven Java performance testing: balancing result quality with testing time. InProc. 39th IEEE/ACM Intl. Conf. on Automated Software Engineering(ASE). ACM, 443–454. doi:10.1145/3691620.3695017

work page doi:10.1145/3691620.3695017 2024
[52]

Antonio Trovato, Luca Traini, Federico Di Menna, and Dario Di Nucci. 2025. AMBER: AI-enabled Java microbenchmark harness. InProc. IEEE Conf. on Software Testing, Verification and Validation(ICST), 762–766. doi:10.1109/ICST6 2969.2025.10988925

work page doi:10.1109/icst6 2025
[53]

Thomas Würthinger, Christian Wimmer, Andreas Wöß, Lukas Stadler, Gilles Duboscq, Christian Humer, Gregor Richards, Doug Simon, and Mario Wolczko

work page
[54]

One vm to rule them all. InProc. ACM Intl. Symp. on New ideas, New Paradigms, and Reflections on Programming & Software(Onward!), 187–204

work page

[1] [1]

Edd Barrett, Carl Friedrich Bolz-Tereick, Rebecca Killick, Sarah Mount, and Laurence Tratt. 2017. Virtual Machine Warmup Blows Hot and Cold.Proc. ACM Program. Lang., 1, 52:1–52:27, OOPSLA. doi:10.1145/3133876

work page doi:10.1145/3133876 2017

[2] [2]

S. M. Blackburn et al. 2006. The DaCapo benchmarks: Java benchmarking development and analysis. InProc. 21st ACM/SIGPLAN Conf. on Object-Oriented Programing, Systems, Languages, and Applications(OOPSLA). ACM, 169–190. doi:http://doi.acm.org/10.1145/1167473.1167488

work page doi:10.1145/1167473.1167488 2006

[3] [3]

Stephen M Blackburn, Zixian Cai, Rui Chen, Xi Yang, John Zhang, and John Zig- man. 2025. Rethinking Java performance analysis. InProc. 30th ACM Intl. Conf. on Architectural Support for Programming Languages and Operating Systems (ASPLOS). ACM. doi:10.1145/3669940.3707217

work page doi:10.1145/3669940.3707217 2025

[4] [4]

Blackburn et al

Stephen M. Blackburn et al. 2016. The truth, the whole truth, and nothing but the truth: A pragmatic guide to assessing empirical evaluations.ACM Trans. Program. Lang. Syst., 38, 4, 15:1–15:20. doi:10.1145/2983574

work page doi:10.1145/2983574 2016

[5] [5]

Blackburn et al

Stephen M. Blackburn et al. 2008. Wake Up and Smell the Coffee: Evaluation Methodology for the 21st Century.Commun. ACM, 51, 8, (Aug. 2008), 83–89. doi:10.1145/1378704.1378723

work page doi:10.1145/1378704.1378723 2008

[6] [6]

K Mani Chandy and Chittoor V Ramamoorthy. 2009. Rollback and recovery strategies for computer programs.IEEE Transactions on computers, 6, 546–556

work page 2009

[7] [7]

Raymond Chen. 2023. Inside STL: The string. (Aug. 3, 2023). Retrieved Sept. 2025 from https://devblogs.microsoft.com/oldnewthing/20230803-00/?p=108532

work page arXiv 2023

[8] [8]

Yunji Chen, Shijin Zhang, Qi Guo, Ling Li, Ruiyang Wu, and Tianshi Chen

work page

[9] [9]

Deterministic replay: a survey.ACM Computing Surveys, 48, 2, 1–47

work page

[10] [10]

Diego Costa, Artur Andrzejak, Janos Seboek, and David Lo. 2017. Empirical study of usage and performance of Java collections. InProc. 8th ACM/SPEC on Intl. Conf. on Performance Engineering(ICPE), 389–400

work page 2017

[11] [11]

Diego Costa, Cor-Paul Bezemer, Philipp Leitner, and Artur Andrzejak. 2019. What’s wrong with my benchmark results? Studying bad practices in JMH benchmarks.IEEE Transactions on Software Engineering, 47, 7, 1452–1467

work page 2019

[12] [12]

Gilles Duboscq, Lukas Stadler, Thomas Würthinger, Doug Simon, Christian Wimmer, and Hanspeter Mössenböck. 2013. Graal IR: an extensible declarative intermediate representation. InProc. Asia-Pacific Programming Languages and Compilers Workshop, 1–9

work page 2013

[13] [13]

Eclipse Team. 2025. Eclipse Collections. https://github.com/eclipse-collections /eclipse-collections. (2025)

work page 2025

[14] [14]

Andy Georges, Dries Buytaert, and Lieven Eeckhout. 2007. Statistically rigorous Java performance evaluation. InProc. 22nd ACM/SIGPLAN Conf. on Object- Oriented Programing, Systems, Languages, and Applications(OOPSLA). ACM, 57–76. doi:10.1145/1297027.1297033

work page doi:10.1145/1297027.1297033 2007

[15] [15]

Joseph Yossi Gil, Keren Lenz, and Yuval Shimron. 2011. A microbenchmark case study and lessons learned. InProc. Co-Located Workshops on DSM’11, TMC’11, AGERE! 2011, AOOPES’11, NEAT’11, & VMIL’11(SPLASH Workshops). ACM, 297–308. doi:10.1145/2095050.2095100

work page doi:10.1145/2095050.2095100 2011

[16] [16]

Brian Goetz. 2004. Java theory and practice: Dynamic compilation and perfor- mance measurement. The perils of benchmarking under dynamic compilation. IBM developerWorks. (Dec. 21, 2004). Retrieved Feb. 28, 2021 from http://www .ibm.com/developerworks/library/j-jtp12214/

work page 2004

[17] [17]

Urs Hölzle, Craig Chambers, and David Ungar. 1992. Debugging optimized code with dynamic deoptimization. InProc. ACM SIGPLAN Conf. on Programming Language Design and Implementation(PLDI), 32–43

work page 1992

[18] [18]

Urs Hölzle, Craig Chambers, and David Ungar. 1991. Optimizing dynamically- typed object-oriented languages with polymorphic inline caches. InEuropean Conference on Object-oriented Programming(ECOOP). Springer, 21–38

work page 1991

[19] [19]

Vojtěch Horký, Peter Libič, Antonín Steinhauser, and Petr Tůma. 2015. DOs and DON’Ts of conducting performance measurements in Java. InProc. 6th ACM/SPEC Intl. Conf. on Performance Engineering(ICPE). ACM, 337–340. doi:1 0.1145/2668930.2688820

work page arXiv 2015

[20] [20]

Nils Japke, Martin Grambow, Christoph Laaber, and David Bermbach. 2025. 𝜇OpTime: statically reducing the execution time of microbenchmark suites using stability metrics. Version 1. arXiv: 2501.12878 [cs]. Retrieved Sept. 2025 from http://arxiv.org/abs/2501.12878. Pre-published

work page arXiv 2025

[21] [21]

Tomáš Kalibera and Richard Jones. 2013. Rigorous benchmarking in reasonable time.ACM SIGPLAN Notices, 48, 11, 63–74. doi:10.1145/2555670.2464160

work page doi:10.1145/2555670.2464160 2013

[22] [22]

Raffi Khatchadourian, Yiming Tang, Mehdi Bagherzadeh, and Baishakhi Ray

work page

[23] [23]

InFundamen- tal Approaches to Software Engineering(LNCS)

An empirical study on the use and misuse of Java 8 streams. InFundamen- tal Approaches to Software Engineering(LNCS). Vol. 12076. Springer, 97–118. doi:10.1007/978-3-030-45234-6_5

work page doi:10.1007/978-3-030-45234-6_5

[24] [24]

Donald E. Knuth. 1998.The Art of Computer Programming, Volume 3: (2nd Ed.) Sorting and Searching. Addison Wesley Longman Publishing

work page 1998

[25] [25]

Christoph Laaber, Joel Scheuner, and Philipp Leitner. 2019. Software microbench- marking in the cloud. How bad is it really?Empirical Software Engineering, 24, 4, 2469–2508. doi:10.1007/s10664-019-09681-1

work page doi:10.1007/s10664-019-09681-1 2019

[26] [26]

Júnior Löff, Filippo Schiavio, Andrea Rosà, Matteo Basso, and Walter Binder

work page

[27] [27]

Vectorized intrinsics can be replaced with pure Java code without impair- ing steady-state performance. InProc. 15th ACM/SPEC Intl. Conf. on Performance Engineering(ICPE). ACM, 14–24. doi:10.1145/3629526.3645051

work page doi:10.1145/3629526.3645051

[28] [28]

Todd Mytkowicz, Amer Diwan, Matthias Hauswirth, and Peter F. Sweeney

work page

[29] [29]

14th Intl

Producing wrong data without doing anything obviously wrong! In Proc. 14th Intl. Conf. on Architectural Support for Programming Languages and Operating Systems(ASPLOS). ACM, 265–276. doi:10.1145/1508244.1508275

work page doi:10.1145/1508244.1508275

[30] [30]

Venkata Krishna Suhas Nerella, Swetha Surapaneni, Sanjay K Madria, and Thomas Weigert. 2014. Exploring optimization and caching for efficient collec- tion operations.Automated Software Engineering, 21, 1, 3–40

work page 2014

[31] [31]

Joshua Nostas, Juan Pablo Sandoval Alcocer, Diego Elias Costa, and Alexandre Bergel. 2021. How do developers use the Java stream API? InComputational Science and its Applications(LNCS). Vol. 12955. Springer, 323–335. doi:10.1007 /978-3-030-87007-2_23

work page 2021

[32] [32]

Oracle. 2025. Ergonomics. https://docs.oracle.com/en/java/javase/24/gctuning /ergonomics.html. (2025)

work page 2025

[33] [33]

Oracle. 2025. Initializing Fields. https://docs.oracle.com/javase/tutorial/java/ja vaOO/initial.html. (2025)

work page 2025

[34] [34]

Oracle. 2022. Java Software | Oracle. https://www.oracle.com/java/. (2022)

work page 2022

[35] [35]

Oracle. 2024. java.util.stream (Java SE 24; JDK 23). https://docs.oracle.com/en/j ava/javase/24/docs/api/java.base/java/util/stream/package-summary.html. (2024). Retrieved Jan. 31, 2024 from

work page 2024

[36] [36]

Otmar Ertl (Dynatrace). 2025. Java hashing efficiency. https://www.dynatrace .com/news/blog/java-arrays-hashcode-byte-efficiency-techniques/. (2025)

work page 2025

[37] [37]

Julian Ponge. 2014. Avoiding benchmarking pitfalls on the jvm. Java Magazine. (Aug. 2014). Retrieved Sept. 2025 from https://www.oracle.com/technical-reso urces/articles/java/architect-benchmarking.html

work page 2014

[38] [38]

Aleksandar Prokopec, David Leopoldseder, Gilles Duboscq, and Thomas Würthinger. 2017. Making collection operations optimal with aggressive JIT compilation. InProc. 8th ACM SIGPLAN Intl. Symp. on Scala, 29–40

work page 2017

[39] [39]

Aleksandar Prokopec et al. 2020. Renaissance: benchmarking suite for parallel applications on the JVM. InSoftware Engineering 2020, Fachtagung des GI- Fachbereichs Softwaretechnik(LNI). Vol. P-300. Gesellschaft für Informatik e.V., 145–146. doi:10.18420/SE2020\_44

work page doi:10.18420/se2020 2020

[40] [40]

Eduardo Rosales, Matteo Basso, Andrea Rosà, and Walter Binder. 2023. Large- scale characterization of Java streams.Softw. Pract. Exp., 53, 9, 1763–1792. doi:10.1002/SPE.3213

work page doi:10.1002/spe.3213 2023

[41] [41]

Filippo Schiavio and Walter Binder. 2026. JEDI: Java evaluation of declarative and imperative queries – benchmarking the Java Stream API. InProceedings of The 48th International Conference on Software Engineering. doi:10.1145/3744916 .3773165

work page doi:10.1145/3744916 2026

[42] [42]

Filippo Schiavio, Andrea Rosà, and Walter Binder. 2022. SQL to Stream with S2S: An Automatic Benchmark Generator for the Java Stream API. InProc. 21st ACM/SIGPLAN Intl. Conf. on Generative Programming: Concepts and Experiences (GPCE). ACM, 179–186. doi:10.1145/3564719.3568699

work page doi:10.1145/3564719.3568699 2022

[43] [43]

Aleksey Shipilëv. 2013. Java microbenchmark harness - the lesser of two evils. (2013). Retrieved Sept. 2025 from https://shipilev.net/talks/devoxx-Nov2013-b enchmarking.pdf

work page 2013

[44] [44]

Aleksey Shipilëv. 2014. Nanotrusting the Nanotime. (2014). Retrieved Sept. 2025 from https://shipilev.net/blog/2014/nanotrusting-nanotime/

work page 2014

[45] [45]

Aleksey Shipilëv and OpenJDK Community. 2013. Java Microbenchmarking Harness. (Nov. 2013). Retrieved Sept. 2025 from http://openjdk.java.net/project s/code-tools/jmh/

work page 2013

[46] [46]

SPEC. 1998. SpecJVM2008. https://www.spec.org/jvm2008/. (1998). Retrieved Jan. 31, 2024 from

work page 1998

[47] [47]

SPEC. 2008. SpecJVM98. https://www.spec.org/jvm98/. (2008). Retrieved Jan. 31, 2024 from

work page 2008

[48] [48]

Michael J Steindorfer and Jurgen J Vinju. 2015. Optimizing hash-array mapped tries for fast and lean immutable JVM collections. InProc. ACM SIGPLAN Intl. Conf. on Object-Oriented Programming, Systems, Languages, and Applications (OOPSLA). ACM, 783–800

work page 2015

[49] [49]

TPC. 2024. TPC-H - Homepage. http://www.tpc.org/tpch/. (2024). Retrieved Jan. 31, 2024 from

work page 2024

[50] [50]

Luca Traini, Vittorio Cortellessa, Daniele Di Pompeo, and Michele Tucci. 2022. Towards effective assessment of steady state performance in Java software. Are we there yet?Empirical Software Engineering, 28, 1, 13. doi:10.1007/s10664- 022-10247-x

work page doi:10.1007/s10664- 2022

[51] [51]

Luca Traini, Federico Di Menna, and Vittorio Cortellessa. 2024. AI-driven Java performance testing: balancing result quality with testing time. InProc. 39th IEEE/ACM Intl. Conf. on Automated Software Engineering(ASE). ACM, 443–454. doi:10.1145/3691620.3695017

work page doi:10.1145/3691620.3695017 2024

[52] [52]

Antonio Trovato, Luca Traini, Federico Di Menna, and Dario Di Nucci. 2025. AMBER: AI-enabled Java microbenchmark harness. InProc. IEEE Conf. on Software Testing, Verification and Validation(ICST), 762–766. doi:10.1109/ICST6 2969.2025.10988925

work page doi:10.1109/icst6 2025

[53] [53]

Thomas Würthinger, Christian Wimmer, Andreas Wöß, Lukas Stadler, Gilles Duboscq, Christian Humer, Gregor Richards, Doug Simon, and Mario Wolczko

work page

[54] [54]

One vm to rule them all. InProc. ACM Intl. Symp. on New ideas, New Paradigms, and Reflections on Programming & Software(Onward!), 187–204

work page