pith. sign in

arxiv: 2606.31238 · v1 · pith:44K47XQEnew · submitted 2026-06-30 · 💻 cs.SE · cs.PF· cs.PL

A Multi-Dimensional, Per-Pass Empirical Study of the LLVM Optimization Pipeline

Pith reviewed 2026-07-01 04:50 UTC · model grok-4.3

classification 💻 cs.SE cs.PFcs.PL
keywords LLVM optimization pipelinephase orderingper-pass analysisPolyBench kernelsperformance regressionphase interferencePareto dominanceempirical measurement
0
0 comments X

The pith

The LLVM -O3 pipeline is non-monotone with regressions in 6.6-9.7% of transitions and strongly back-loaded on compute-bound kernels.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper decomposes the LLVM -O3 pipeline into cumulative per-pass prefixes and measures their effects on execution time, size, energy, and hardware counters across 84,750 runs on 30 PolyBench/C kernels. It establishes that performance sometimes worsens after a pass is added and that the median kernel achieves 80% of its total speedup only after 84.8% of the pipeline has run. A small core of passes accounts for most gains while the final -O3 configuration is Pareto-dominated on the size-speedup plane for 29 of the 30 kernels. The authors also derive a search-free idealized-additive upper bound of 46.35% on performance losses caused by phase interference among passes.

Core claim

By evaluating 113 cumulative prefixes of the -O3 pipeline on 30 PolyBench/C kernels under controlled noise mitigation, the pipeline exhibits non-monotonicity (6.6-9.7% of transitions regress) and strong back-loading (median non-regressing kernel needs 84.8% of the pipeline for 80% of speedup). Most gains come from a small Pareto-dominant set of passes, the final -O3 point is Pareto-dominated on (size, speedup) for 29 of 30 kernels, IR instruction count fails to predict runtime, runtime passes also deliver 30-60% energy savings, and the idealized-additive upper bound on losses from phase interference is 46.35%.

What carries the argument

The cumulative per-pass prefix decomposition of the -O3 pipeline that isolates the marginal contribution of each pass through repeated execution-time, size, energy, and counter measurements.

If this is right

  • Pass selection and pruning can focus on the small Pareto-dominant core rather than the entire pipeline.
  • Cost models and autotuners must account for non-monotonic effects rather than assume additive gains.
  • IR instruction count cannot be used as a reliable proxy for runtime improvements.
  • Passes chosen for runtime also produce substantial energy reductions without separate targeting.
  • The 46.35% idealized upper bound on interference losses sets a concrete target for improved phase ordering.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Similar per-pass prefix measurements could be applied to other compiler backends or to MLIR dialects to test whether the same back-loading pattern appears.
  • The Pareto dominance of intermediate points suggests that production compilers could expose earlier stopping points as additional optimization levels.
  • If the interference bound is tight, search-based phase ordering has a quantifiable headroom to improve on the current static -O3 sequence.

Load-bearing premise

The 30 PolyBench/C kernels and the chosen noise-mitigation protocol are representative enough that the observed non-monotonicity, back-loading, and 46.35% interference bound generalize beyond this benchmark set and measurement setup.

What would settle it

Re-running the full per-pass prefix experiment on a different benchmark suite or compiler version and observing monotonic improvement with no regressions or different back-loading percentages would falsify the reported non-monotonicity and back-loading claims.

Figures

Figures reproduced from arXiv: 2606.31238 by Federico Bruzzone, Walter Cazzola.

Figure 1
Figure 1. Figure 1: Runtime evolution and optimization saturation. [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Marginal impact analysis and top-20 passes ranking. [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Cohen’s d between consecutive variants for the top-10 passes, clipped to [−3, +3]. ‡: |d| > 3. ∗: Wilcoxon p < 0.05. 0 20 40 60 80 100 Pass Index 14 15 16 17 Binary Size (KB) 50 55 60 65 70 75 Compile Time (ms) [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Compile time and binary size trajectories. [PITH_FULL_IMAGE:figures/full_fig_p005_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Pareto views of the multi-objective trade-off, each point [PITH_FULL_IMAGE:figures/full_fig_p005_5.png] view at source ↗
Figure 7
Figure 7. Figure 7: Per-benchmark 5×6 grid of hardware-counter signatures at the top-10 most impactful pass positions. by redundancy elimination. Loop-transformation/vectorization (41 licm, 42 loop-rotate, 48 indvars, 87 loop-vectorize, 95 loop-unroll) span the widest range: pass 41 is the most variable (cycles −47% to +70%; LLC +295%), while passes 42 and 48 are uniformly beneficial (instructions −7– 11%, energy −8–9%, 25–28… view at source ↗
Figure 8
Figure 8. Figure 8: Per-pass ∆-trajectories for runtime, IPC, D1 misses, LLC misses, CPU cycles, and energy along the full -O3 pipeline (median estimator). is driven by a handful of benchmarks. D1 is dominated by the dense linear-algebra solvers: cholesky spikes +4.6 × 107 at pass 72 and reverses (−4.8 × 107 ) at pass 73, while ludcmp adds +4.2 × 107 at pass 87. L1-I (≤0.17 × 107 ) is dominated by floyd_warshall, with deepest… view at source ↗
Figure 9
Figure 9. Figure 9: Per-pass evolution of D1, LLC, and L1-I miss rates, IPC, CPI, and energy. [PITH_FULL_IMAGE:figures/full_fig_p008_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Bounded phase-interference loss L per benchmark. kernels. Conversely, loop-vectorize and loop-unroll lower L1-I misses (fewer instructions fetched); the binary￾bloat trade-off of §IV-B surfaces instead as data working￾set expansion, inflating LLC misses on the dense solvers (lu/ludcmp/cholesky). Branch-misprediction behavior is essentially flat: control-flow optimization is not on the critical path for Po… view at source ↗
Figure 11
Figure 11. Figure 11: Per-pass runtime standard deviation (top) and 95% [PITH_FULL_IMAGE:figures/full_fig_p009_11.png] view at source ↗
read the original abstract

Quantifying the marginal impact of individual optimization passes underpins phase ordering, pass selection, optimization design, and analysis of pass/hardware interactions. In LLVM -- the standard backend for C/C++, Rust, and ML stacks via MLIR -- interactions among optimization passes, measurement noise, and pipeline scale make this difficult. We present a systematic, empirical study of the LLVM -O3 optimization pipeline. We decompose the pipeline into cumulative per-pass prefixes. We then measure execution time, compile time, binary size, hardware counters, and RAPL energy across 84,750 measurements covering 113 cumulative prefixes of the -O3 pipeline evaluated on 30 PolyBench/C kernels under rigorous noise mitigation. On these compute-bound affine kernels, the pipeline is non-monotone (6.6-9.7% of transitions regress) and strongly back-loaded (the median non-regressing kernel needs 84.8% of the pipeline for 80% of its speedup). Most gains are driven by a small Pareto-dominant core of passes, while the final -O3 configuration is Pareto-dominated on (size, speedup) for 29 of 30 kernels. We further show that IR instruction count is an unreliable predictor of runtime, that runtime-targeted passes are de facto energy-targeted (30-60% savings), and that the search-free idealized-additive upper bound on losses due to phase interference is 46.35%. These findings enable more informed pass pruning, cost-model calibration, and autotuning.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper presents a large-scale empirical study of the LLVM -O3 pipeline by decomposing it into 113 cumulative per-pass prefixes and collecting 84,750 measurements of execution time, compile time, binary size, hardware counters, and RAPL energy on 30 PolyBench/C kernels under noise mitigation. It reports that the pipeline is non-monotone (6.6-9.7% of transitions regress), strongly back-loaded (median non-regressing kernel requires 84.8% of the pipeline for 80% of speedup), driven by a small Pareto-dominant core of passes, with the final -O3 configuration Pareto-dominated on (size, speedup) for 29 of 30 kernels; it also finds IR instruction count unreliable for runtime prediction, runtime-targeted passes yield 30-60% energy savings, and computes a search-free idealized-additive upper bound of 46.35% on losses from phase interference.

Significance. If the reported measurements hold under the described protocol, the work supplies concrete, multi-dimensional data on pass interactions and ordering effects within LLVM that can directly inform pass pruning, cost-model calibration, and autotuning research. Notable strengths include the scale of the measurement campaign, the explicit scoping to the 30 kernels, the Pareto analysis, and the derivation of a search-free interference bound from the prefix data.

major comments (2)
  1. [Methods / Experimental Setup] The description of the noise-mitigation protocol, the exact statistical tests used to identify regressions, and the procedure for computing the 46.35% idealized-additive upper bound from the 113 prefix measurements are insufficiently detailed; these elements are load-bearing for all central quantitative claims (non-monotonicity rates, back-loading statistic, and interference bound).
  2. [Results / Pareto Analysis] The claim that the final -O3 configuration is Pareto-dominated on (size, speedup) for 29 of 30 kernels rests on the specific choice of the 30 PolyBench/C kernels and the measurement protocol; the manuscript should explicitly state the sensitivity of this dominance count to alternative noise thresholds or kernel subsets.
minor comments (2)
  1. [Figures] Several figures would benefit from explicit error bars or confidence intervals derived from the repeated measurements to convey variability.
  2. [Background] The manuscript should include a short table summarizing the exact pass sequence and the 113 prefix points for reference.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the thoughtful review and recommendation for minor revision. The comments highlight opportunities to improve methodological transparency and contextualize key claims. We address each point below and will incorporate the suggested clarifications in the revised manuscript.

read point-by-point responses
  1. Referee: [Methods / Experimental Setup] The description of the noise-mitigation protocol, the exact statistical tests used to identify regressions, and the procedure for computing the 46.35% idealized-additive upper bound from the 113 prefix measurements are insufficiently detailed; these elements are load-bearing for all central quantitative claims (non-monotonicity rates, back-loading statistic, and interference bound).

    Authors: We agree that expanded methodological detail is warranted for reproducibility. In the revision we will augment the Methods section with a complete description of the noise-mitigation protocol (repetition counts, outlier handling, and hardware counter collection), the precise statistical tests and thresholds used to classify regressions, and the algebraic derivation of the idealized-additive interference bound directly from the 113 prefix measurements. These additions will not change any reported numbers but will make the central claims fully verifiable. revision: yes

  2. Referee: [Results / Pareto Analysis] The claim that the final -O3 configuration is Pareto-dominated on (size, speedup) for 29 of 30 kernels rests on the specific choice of the 30 PolyBench/C kernels and the measurement protocol; the manuscript should explicitly state the sensitivity of this dominance count to alternative noise thresholds or kernel subsets.

    Authors: The 29-of-30 dominance result is scoped to the PolyBench/C kernels and our measurement protocol. We will add an explicit caveat in the Results section stating that the count is protocol-dependent and that sensitivity to different noise thresholds or kernel subsets is not quantified in the present study. This addition provides the requested context without requiring new experiments. revision: yes

Circularity Check

0 steps flagged

No significant circularity; purely empirical measurement study

full rationale

The paper is an empirical measurement study that decomposes the LLVM -O3 pipeline into 113 cumulative prefixes and reports direct observations (regression rates 6.6-9.7%, median back-loading 84.8%, Pareto dominance on 29/30 kernels, 46.35% idealized-additive interference bound) from 84,750 measurements on 30 PolyBench/C kernels. All central quantities are computed from the raw per-prefix execution times, sizes, and counters under the stated noise-mitigation protocol. No equations, fitted parameters, or predictions reduce by construction to quantities defined by the same measurements. No load-bearing self-citations, uniqueness theorems, or ansatzes are invoked. The claims are explicitly scoped to the measured set, making the derivation chain self-contained and externally replicable.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review; no free parameters, axioms, or invented entities can be audited from the provided text.

pith-pipeline@v0.9.1-grok · 5801 in / 1227 out tokens · 29495 ms · 2026-07-01T04:50:56.555755+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

100 extracted references · 2 canonical work pages · 2 internal anchors

  1. [1]

    On Programming of Arithmetic Operations,

    A. P. Ershov, “On Programming of Arithmetic Operations,”Communica- tions of the ACM, vol. 1, no. 8, pp. 3–6, Aug. 1958

  2. [2]

    Sequencing Aspects of Multiprogramming,

    J. Heller, “Sequencing Aspects of Multiprogramming,”Journal of ACM, vol. 8, no. 3, pp. 426–439, Jul. 1961

  3. [3]

    Peephole Optimization,

    W. M. McKeeman, “Peephole Optimization,”Communications of the ACM, vol. 8, no. 7, pp. 443–444, Jul. 1965

  4. [4]

    High Speed Compilation of Efficient Object Code,

    C. W. Gear, “High Speed Compilation of Efficient Object Code,”Com- munications of the ACM, vol. 8, no. 8, pp. 483–488, Aug. 1965

  5. [5]

    Object Code Optimization,

    E. S. Lowry and C. W. Medlock, “Object Code Optimization,”Commu- nications of the ACM, vol. 12, no. 1, pp. 13–22, Jan. 1969

  6. [6]

    Optimization of Expressions in For- tran,

    V . A. Busam and D. E. Englund, “Optimization of Expressions in For- tran,”Commu. of the ACM, vol. 12, no. 12, pp. 666–674, Dec. 1969

  7. [7]

    Program Optimization,

    F. E. Allen, “Program Optimization,” inAnnual Review of Automatic Programming. Pergamon Press, 1969, vol. 5, pp. 239–307

  8. [8]

    W. A. Wulf,The Design of an Optimizing Compiler. CMU, Dec. 1973

  9. [9]

    K. D. Cooper and L. Torczon,Engineering a Compiler, Nov. 2022

  10. [10]

    Kennedy and J

    K. Kennedy and J. R. Allen,Optimizing Compilers for Modern Architec- tures: A Dependence-Based Approach. Morgan Kaufmann, Oct. 2001

  11. [11]

    Future Directions in Program Transformations,

    R. Paige, “Future Directions in Program Transformations,”ACM Com- puting Surveys, vol. 28, no. 4, pp. 170–174, Dec. 1996

  12. [12]

    Compiler Transformations for High-Performance Computing,

    D. F. Bacon, S. L. Graham, and O. J. Sharp, “Compiler Transformations for High-Performance Computing,”ACM Computing Surveys, vol. 26, no. 4, pp. 345–420, Dec. 1994

  13. [13]

    Compositional Verification of Compiler Optimisations on Relaxed Memory,

    M. Dodds, M. Batty, and A. Gotsman, “Compositional Verification of Compiler Optimisations on Relaxed Memory,” inProc. ESOP’18, LNCS 10801, Thessaloniki, Greece: Springer, Apr. 2018, pp. 1027–1055

  14. [14]

    A. V . Aho, M. S. Lam, R. Sethi, and J. D. Ullman,Compilers: Principles, Techniques, and Tools, 2nd ed. Addison-Wesley, 2006

  15. [15]

    LLVM: A Compilation Framework for Lifelong Program Analysis and Transformation,

    C. Lattner and V . Adve, “LLVM: A Compilation Framework for Lifelong Program Analysis and Transformation,” inProc. CGO’04, San José, CA, USA: IEEE, Mar. 2004, pp. 75–86

  16. [16]

    The Rust Language,

    N. D. Matsakis and F. S. Klock, “The Rust Language,”ACM SIGAda Letters, vol. 34, no. 3, pp. 103–104, Oct. 2014

  17. [17]

    Julia: A Fast Dynamic Language for Technical Computing

    J. Bezanson, S. Karpinski, V . B. Shah, and A. Edelman, “Julia: A Fast Dynamic Language for Technical Computing,”arXiv e-prints, arXiv:1209.5145, pp. 1–27, Sep. 2012

  18. [18]

    MLIR: Scal- ing Compiler Infrastructure for Domain Specific Computation,

    C. Lattner, M. Amini, U. Bondhugula, A. Cohen, A. Davis, J. Pienaar, R. Riddle, T. Shpeisman, N. Vasilache, and O. Zinenko, “MLIR: Scal- ing Compiler Infrastructure for Domain Specific Computation,” in Proc. CGO’21. Virtual: IEEE, Feb./Mar. 2021, pp. 2–14

  19. [19]

    PyTorch: An Imperative Style, High- Performance Deep Learning Library,

    A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan, T. Killeen, Z. Lin, N. Gimelshein, L. Antiga, A. Desmaison, A. Köpf, E. Yang, Z. DeVito, M. Raison, A. Tejani, S. Chilamkurthy, B. Steiner, L. Fang, J. Bai, and S. Chintala, “PyTorch: An Imperative Style, High- Performance Deep Learning Library,” inProc. NeurIPS’19. Vancouver, Canada: Curran...

  20. [20]

    TensorFlow: A System for Large-Scale Machine Learning,

    M. Abadi, P. Barham, J. Chen, Z. Chen, A. Davis, J. Dean, M. Devin, S. Ghemawat, G. Irving, M. Isard, M. Kudlur, J. Levenberg, R. Monga, S. Moore, D. G. Murray, B. Steiner, P. Tucker, V . Vasudevan, P. Warden, M. Wicke, Y . Yu, and X. Zheng, “TensorFlow: A System for Large-Scale Machine Learning,” inProc. OSDI’16. Savannah, GA, USA: USENIX, Nov. 2016, pp. 265–283

  21. [21]

    The Deep Learning Compiler: A Comprehensive Survey,

    M. Li, Y . Liu, X. Liu, Q. Sun, X. You, H. Yang, Z. Luan, L. Gan, G. Yang, and D. Qian, “The Deep Learning Compiler: A Comprehensive Survey,” IEEE Trans. Par. and Distr. Syst., vol. 32, no. 3, pp. 708–727, Mar. 2021

  22. [22]

    An In-Depth Comparison of Compilers for Deep Neural Networks on Hardware,

    Y . Xing, J. Weng, Y . Wang, L. Sui, Y . Shan, and Y . Wang, “An In-Depth Comparison of Compilers for Deep Neural Networks on Hardware,” in Proc. ICESS’19. Las Vegas, NV , USA: IEEE, Jun. 2019, pp. 1–8

  23. [23]

    Hardware Compi- lation of Deep Neural Networks: An Overview,

    R. Zhao, S. Liu, H.-C. Ng, E. Wang, J. J. Davis, X. Niu, X. Wang, H. Shi, G. A. Constantinides, P. Y . K. Cheung, and W. Luk, “Hardware Compi- lation of Deep Neural Networks: An Overview,” inProc. ASAP’18. Milan, Italy: IEEE, Jul. 2018, pp. 1–8

  24. [24]

    TinyIREE: An ML Execution Environment for Embedded Systems from Compilation to Deployment,

    H.-I. C. Liu, M. Brehler, M. Ravishankar, N. Vasilache, B. Vanik, and S. Laurenzo, “TinyIREE: An ML Execution Environment for Embedded Systems from Compilation to Deployment,”IEEE Micro, vol. 42, no. 5, pp. 9–16, Sep. 2022

  25. [25]

    TVM: An Automated End-to-End Optimizing Compiler for Deep Learning

    T. Chen, T. Moreau, Z. Jiang, H. Shen, E. Q. Yan, L. Wang, Y . Hu, L. Ceze, C. Guestrin, and A. Krishnamurthy, “TVM: An Automated End-to-End Optimizing Compiler for Deep Learning,”arXiv e-prints, arXiv:1802.04799, pp. 1–16, Oct. 2018

  26. [26]

    Optimizing for Re- duced Code Space Using Genetic Algorithms,

    K. D. Cooper, P. J. Schielke, and D. Subramanian, “Optimizing for Re- duced Code Space Using Genetic Algorithms,” inProc. LCTES’99, Atlanta, GA, USA: ACM, May 1999, pp. 1–9

  27. [27]

    Adaptive Optimizing Compilers for the 21st Century,

    K. D. Cooper, D. Subramanian, and L. Torczon, “Adaptive Optimizing Compilers for the 21st Century,”J. Supercomputing, vol. 23, no. 1, pp. 7– 22, Aug. 2002

  28. [28]

    Compiler Optimization-Space Exploration,

    S. Triantafyllis, M. Vachharajani, N. Vachharajani, and D. I. August, “Compiler Optimization-Space Exploration,” inProc. CGO’03, San Francisco, CA, USA: IEEE, Mar. 2003, pp. 204–215

  29. [29]

    On the Decidability of Phase Ordering Problem in Optimizing Compilation,

    S.-A.-A. Touati and D. Barthou, “On the Decidability of Phase Ordering Problem in Optimizing Compilation,” inProc. CF’06, Ischia, Italy: ACM, May 2006, pp. 147–156

  30. [30]

    Phase Ordering of Register Allocation and Instruction Scheduling,

    S. M. Freudenberger and J. C. Ruttenberg, “Phase Ordering of Register Allocation and Instruction Scheduling,” inProc. WoCC’92. Daghstul, Germany: Springer, May 1991, pp. 146–170

  31. [31]

    Leveraging Compilation Statistics for Compiler Phase Ordering,

    J. Zhao, C. Xia, and Z. Wang, “Leveraging Compilation Statistics for Compiler Phase Ordering,” inProc. IPDPS’25. Milan, Italy: IEEE, Jun. 2025, pp. 533–545

  32. [32]

    MiCOMP: Mitigating the Compiler Phase-Ordering Prob- lem Using Optimization Sub-Sequences and Machine Learning,

    A. H. Ashouri, A. Bignoli, G. Palermo, C. Silvano, S. Kulkarni, and J. Cavazos, “MiCOMP: Mitigating the Compiler Phase-Ordering Prob- lem Using Optimization Sub-Sequences and Machine Learning,”ACM Trans. Arch. Code Opt., vol. 14, no. 3, pp. 29:1–29:28, Sep. 2017

  33. [33]

    Eval- uating Heuristic Optimization Phase Order Search Algorithms,

    J. W. Davidson, G. S. Tyson, D. B. Whalley, and P. A. Kulkarni, “Eval- uating Heuristic Optimization Phase Order Search Algorithms,” in Proc. CGO’07. San José, CA, USA: IEEE, Mar. 2007, pp. 157–169

  34. [34]

    Phase Coupling and Constant Generation in an Optimiz- ing Microcode Compiler,

    S. R. Vegdahl, “Phase Coupling and Constant Generation in an Optimiz- ing Microcode Compiler,” inProc. MICRO’82, Oct. 1982, pp. 125–133

  35. [35]

    A Graph-Based Iterative Compiler Pass Selection and Phase Ordering Approach,

    R. Nobre, L. G. A. Martins, and J. a. M. P. Cardoso, “A Graph-Based Iterative Compiler Pass Selection and Phase Ordering Approach,” in Proc. LCTES’16. Santa Barbara, USA: ACM, Jun. 2016, pp. 21–30

  36. [36]

    Using Machine Learning to Focus Iterative Optimization,

    F. Agakov, E. V . Bonilla, J. Cavazos, B. Franke, G. Fursin, M. F. P. O’Boyle, J. Thomson, M. Toussaint, and C. K. I. Williams, “Using Machine Learning to Focus Iterative Optimization,” in Proc. CGO’06, New York, NY , USA: IEEE, Mar. 2006, pp. 305–315

  37. [37]

    An Approach to Ordering Optimizing Transformations,

    D. L. Whitfield and M. L. Soffa, “An Approach to Ordering Optimizing Transformations,” inProc. PPoPP’90, ACM, Mar. 1990, pp. 137–146

  38. [38]

    Towards Efficient Compiler Auto-tuning: Leveraging Synergistic Search Spaces,

    H. Pan, Y . Wei, M. Xing, Y . Wu, and C. Zhao, “Towards Efficient Compiler Auto-tuning: Leveraging Synergistic Search Spaces,” in Proc. CGO’25. Las VEgas, NV , USA: IEEE, Mar. 2025, pp. 614–627

  39. [39]

    Compiler Autotuning through Multiple- phase Learning,

    M. Zhu, D. Hao, and J. Chen, “Compiler Autotuning through Multiple- phase Learning,”ACM Transactions on Software Engineering and Methodology, vol. 33, no. 4, pp. 1–38, May 2024

  40. [40]

    A Survey on Compiler Autotuning using Machine Learning,

    A. H. Ashouri, W. Killian, J. Cavazos, G. Palermo, and C. Silvano, “A Survey on Compiler Autotuning using Machine Learning,”ACM Com- puting Surveys, vol. 51, no. 5, pp. 96:1–96:42, Sep. 2018

  41. [41]

    A Scalable Auto-Tuning Framework for Compiler Optimization,

    A. Tiwari, C. Chen, J. Chame, M. Hall, and J. K. Hollingsworth, “A Scalable Auto-Tuning Framework for Compiler Optimization,” in Proc. IPDPS’09, Rome, Italy: IEEE, May 2009, pp. 1–12

  42. [42]

    PEAK—A Fast and Effective Perfor- mance Tuning System via Compiler Optimization Orchestration,

    Z. Pan and R. Eigenmann, “PEAK—A Fast and Effective Perfor- mance Tuning System via Compiler Optimization Orchestration,”ACM Transactions on Programming Languages and Systems, vol. 30, no. 3, pp. 17:1–17:43, May 2008

  43. [43]

    Fast and Effective Orchestration of Compiler Optimizations for Automatic Performance Tuning,

    ——, “Fast and Effective Orchestration of Compiler Optimizations for Automatic Performance Tuning,” inProc. CGO’06, 2006, pp. 319–332

  44. [44]

    Fast Searches for Effective Optimization Phase Sequences,

    P. Kulkarni, S. Hines, J. Hiser, D. Whalley, J. Davidson, and D. Jones, “Fast Searches for Effective Optimization Phase Sequences,” in Proc. PLDI’04, Washington, DC, USA: ACM, Jun. 2004, pp. 171–182

  45. [45]

    Finding Good Optimization Sequences Covering Program Space,

    S. Purini and L. Jain, “Finding Good Optimization Sequences Covering Program Space,”ACM Trans. Arch. Code Opt., vol. 9, no. 4, Jan. 2013

  46. [46]

    Mitigating the Compiler Optimization Phase-Ordering Problem Using Machine Learning,

    S. Kulkarni and J. Cavazos, “Mitigating the Compiler Optimization Phase-Ordering Problem Using Machine Learning,” inProc. OOP- SLA’12. Tucson, AZ, USA: ACM, Oct. 2012, pp. 147–162

  47. [47]

    Finding Effective Optimization Phase Sequences,

    P. Kulkarni, W. Zhao, H. Moon, K. Cho, D. Whalley, J. Davidson, M. Bailey, Y . Paek, and K. Gallivan, “Finding Effective Optimization Phase Sequences,” inProc. LCTES’03, ACM, Jun. 2003, pp. 12–23

  48. [48]

    An Approach for Exploring Code Improving Transformations,

    D. L. Whitfield and M. L. Soffa, “An Approach for Exploring Code Improving Transformations,”ACM Transactions on Programming Lan- guages and Systems, vol. 19, no. 6, pp. 1053–1084, Nov. 1997

  49. [49]

    Finding Effective Compilation Se- quences,

    L. Almagor, K. Cooper, A. Grosul, T. Harvey, S. Reeves, D. Subramanian, L. Torczon, and T. Waterman, “Finding Effective Compilation Se- quences,” inProc. LCTES’04, ACM, Jun. 2004, pp. 231–239

  50. [50]

    An Approach Toward Profit- Driven Optimization,

    M. Zhao, B. R. Childers, and M. L. Soffa, “An Approach Toward Profit- Driven Optimization,”ACM Transactions on Architecture and Code Optimization, vol. 3, no. 3, pp. 231–262, Sep. 2006

  51. [51]

    Efficient Com- piler Optimization by Modeling Passes Dependence,

    J. Liu, J. Fang, T. Wang, J. Xie, C. Huang, and Z. Wang, “Efficient Com- piler Optimization by Modeling Passes Dependence,”CCF Transactions on High Performance Computing, vol. 6, no. 6, pp. 588–607, Dec. 2024

  52. [52]

    Less is More: Exploiting the Standard Compiler Optimization Levels for Better Performance and Energy Consumption,

    K. Georgiou, C. Blackmore, S. Xavier-de Souza, and K. Eder, “Less is More: Exploiting the Standard Compiler Optimization Levels for Better Performance and Energy Consumption,” inProc. SCOPES’18, Sankt Goar, Germany: ACM, May 2018, pp. 35–42

  53. [53]

    What Every Scientific Program- mer Should Know about Compiler Optimizations?

    J. Tan, S. Jiao, M. Chabbi, and X. Liu, “What Every Scientific Program- mer Should Know about Compiler Optimizations?” inProc. ICS’20. Barcelona, Spain: ACM, Jun./Jul. 2020, pp. 1–12

  54. [54]

    Exploiting Undefined Behavior in C/C++ Programs for Optimization: A Study on the Performance Impact,

    L. Popescu and N. P. Lopes, “Exploiting Undefined Behavior in C/C++ Programs for Optimization: A Study on the Performance Impact,” in Proc. PLDI’25, Seoul, South Korea: ACM, Jun. 2025, pp. 348–371

  55. [55]

    Analyzing the Effects of Compiler Optimizations on Application Reliability,

    M. Demertzi, M. Annavaram, and M. Hall, “Analyzing the Effects of Compiler Optimizations on Application Reliability,” inProc. IISWC’11, Austin, TX, USA: IEEE, Nov. 2011, pp. 184–193

  56. [56]

    Rapidly Selecting Good Compiler Optimizations using Performance Counters,

    J. Cavazos, G. Fursin, F. Agakov, E. Bonilla, M. F. P. O’Boyle, and O. Temam, “Rapidly Selecting Good Compiler Optimizations using Performance Counters,” inProc. CGO’07, Mar. 2007, pp. 185–197

  57. [57]

    Analyzing the Influence of LLVM Code Optimization Passes on Software Perfor- mance,

    J. C. de la Torre, P. Ruiz, B. Dorronsoro, and P. L. Galindo, “Analyzing the Influence of LLVM Code Optimization Passes on Software Perfor- mance,” inProc. IPMU’18, CCIS 855. Springer, Jun. 2018, pp. 272–283

  58. [58]

    Evaluating End-to-End Optimization for Data Analytics Applications in Weld,

    S. Palkar, J. Thomas, D. Narayanan, P. Thaker, R. Palamuttam, P. Negi, A. Shanbhag, M. Schwarzkopf, H. Pirk, S. Amarasinghe, S. Madden, and M. Zaharia, “Evaluating End-to-End Optimization for Data Analytics Applications in Weld,” inProc. VLDB’18, May 2018, pp. 1002–1015

  59. [59]

    It’s Not Easy Being Green: On the Energy Efficiency of Programming Lan- guages,

    N. van Kempen, H.-J. Kwon, D. T. Nguyen, and E. D. Berger, “It’s Not Easy Being Green: On the Energy Efficiency of Programming Lan- guages,” inProc. ASE’25, IEEE, Nov. 2025, pp. 1553–1565

  60. [60]

    A Survey of Energy Concerns for Software Engineering,

    S. U. Lee, N. Fernando, K. Lee, and J.-G. Schneider, “A Survey of Energy Concerns for Software Engineering,”J. Syst. Soft., vol. 210, Apr. 2024

  61. [61]

    Energy Efficiency Across Programming Languages,

    R. Pereira, M. Couto, F. Ribeiro, R. Rua, J. Cunha, J. P. Fernandes, and J. Saraiva, “Energy Efficiency Across Programming Languages,” in Proc. SLE’17. Vancouver, BC, Canada: ACM, Oct. 2017, pp. 256–267

  62. [62]

    Deconstructing Iterative Optimization,

    Y . Chen, S. Fang, Y . Huang, L. Eeckhout, G. Fursin, O. Temam, and C. Wu, “Deconstructing Iterative Optimization,”ACM Trans. Arch. Code Opt., vol. 9, no. 3, pp. 21:1–21:30, Oct. 2012

  63. [63]

    Global Value Numbers and Redundant Computations,

    B. K. Rosen, M. N. Wegman, and F. K. Zadeck, “Global Value Numbers and Redundant Computations,” inProc. PoPL’88, Jan. 1988, pp. 12–27

  64. [64]

    Detecting Equality of Variables in Programs,

    B. Alpern, M. N. Wegman, and F. K. Zadeck, “Detecting Equality of Variables in Programs,” inProc. PoPL’88, ACM, Jan. 1988, pp. 1–11

  65. [65]

    An Efficient Method of Computing Static Single Assignment Form,

    R. Cytron, J. Ferrante, B. K. Rosen, M. N. Wegman, and F. K. Zadeck, “An Efficient Method of Computing Static Single Assignment Form,” inProc. POPL’89, Austin, TX, USA: ACM, Jan. 1989, pp. 25–35

  66. [66]

    Interprocedural Constant Propagation,

    D. Callahan, K. D. Cooper, K. Kennedy, and L. Torczon, “Interprocedural Constant Propagation,” inProc. SCC’86, ACM, Jun. 1986, pp. 152–161

  67. [67]

    Constant Propagation with Condi- tional Branches,

    M. N. Wegman and F. K. Zadeck, “Constant Propagation with Condi- tional Branches,”ACM Transactions on Programming Languages and Systems, vol. 13, no. 2, pp. 181–210, Apr. 1991

  68. [68]

    Efficiently Computing Static Single Assignment Form and the Control Dependence Graph,

    R. Cytron, J. Ferrante, B. K. Rosen, M. N. Wegman, and F. K. Zadeck, “Efficiently Computing Static Single Assignment Form and the Control Dependence Graph,”ACM Transactions on Programming Languages and Systems, vol. 13, no. 4, pp. 451–490, Oct. 1991

  69. [69]

    S. S. Muchnick,Advanced Compiler Design and Implementation, 1st ed. Morgan Kaufmann, Aug. 1997

  70. [70]

    Scalar Replacement in the Presence of Condi- tional Control Flow,

    S. Carr and K. Kennedy, “Scalar Replacement in the Presence of Condi- tional Control Flow,”Software—Practice and Experience, vol. 24, no. 1, pp. 51–77, Jan. 1994

  71. [71]

    Register Promotion in C Programs,

    J. Lu and K. D. Cooper, “Register Promotion in C Programs,” in Proc. PLDI’97, Las Vegas, NV , USA: ACM, Jun. 1997, pp. 308–319

  72. [72]

    Inline Function Expansion for Compiling C Programs,

    P. P. Chang and W.-W. Hwu, “Inline Function Expansion for Compiling C Programs,” inProc. PLDI’89, Portland, USA: Jun. 1989, pp. 246–257

  73. [73]

    Unrolling loops in Fortran,

    J. J. Dongarra and A. R. Hinds, “Unrolling loops in Fortran,”Software— Practice and Experience, vol. 9, no. 3, pp. 219–226, Mar. 1979

  74. [74]

    An Aggressive Approach to Loop Unrolling,

    J. W. Davidson and S. Jinturkar, “An Aggressive Approach to Loop Unrolling,” University of Virginia, V A, USA, TR-CS-95-26, Jan. 1995

  75. [75]

    Partial Dead Code Elimination,

    J. Knoop, O. Rüthing, and B. Steffen, “Partial Dead Code Elimination,” inProc. PLDI’94, Orlando, FL, USA: ACM, Jun. 1994, pp. 147–158

  76. [76]

    A Catalogue of Optimizing Transformations,

    F. E. Allen and J. Cocke, “A Catalogue of Optimizing Transformations,” inDesign and Optimization of Compilers, Prentice-Hall, 1972, pp. 1–30

  77. [77]

    Understanding PolyBench/C 3.2 Kernels,

    T. Yuki, “Understanding PolyBench/C 3.2 Kernels,” inProc. IM- PACT’14, Wien, Austria, Jan. 2014, pp. 1–5

  78. [78]

    Can Hardware Performance Counters Be Trusted?

    V . M. Weaver and S. A. McKee, “Can Hardware Performance Counters Be Trusted?” inProc. IISWC’08, IEEE, Sep. 2008, pp. 141–150

  79. [79]

    Statistically Rigorous Java Performance Evaluation,

    A. Georges, D. Buytaert, and L. Eeckhout, “Statistically Rigorous Java Performance Evaluation,” inProc. OOPSLA’07, Montréal, Canada: ACM, Oct. 2007, pp. 57–76

  80. [80]

    Produc- ing Wrong Data Without Doing Anything Obviously Wrong!

    T. Mytkowicz, A. Diwan, M. Hauswirth, and P. F. Sweeney, “Produc- ing Wrong Data Without Doing Anything Obviously Wrong!” in Proc. ASPLOS’09, Washington, USA: ACM, Mar. 2009, pp. 265–276

Showing first 80 references.