Improving Branch Prediction By Modeling Global History with Convolutional Neural Networks
Pith reviewed 2026-05-25 18:53 UTC · model grok-4.3
The pith
Convolutional neural networks can be mapped to global history to predict the small set of branches that cause most CPU mispredictions.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
By training convolutional neural networks on global history data, the authors produce reusable helper predictors that improve accuracy on the small number of hard-to-predict branches left after conventional predictors have done their work; these networks can be reduced to 2-bit inference suitable for current branch prediction units and retain their benefit across different program inputs.
What carries the argument
Convolutional neural networks used as helper predictors on the global history registers of hard-to-predict branches, reduced to 2-bit inference.
If this is right
- CNN helpers raise prediction accuracy on hard-to-predict branches in SPEC 2017 beyond what conventional predictors achieve.
- 2-bit CNN inference can be implemented inside existing branch prediction units without hardware redesign.
- The same trained models remain effective across different inputs to the same application.
- Offline training cost can be amortized, allowing machine-learning pattern matching to raise instructions per cycle at runtime.
Where Pith is reading between the lines
- The approach could be applied to other microarchitectural decisions that also depend on global history, such as cache prefetching or memory scheduling.
- In wider pipelines the relative IPC benefit would increase because mispredictions become more costly.
- Customer-specific models could be trained on representative workloads and loaded after silicon is manufactured.
Load-bearing premise
The same small set of hard-to-predict branches remains the dominant source of mispredictions after the networks are mapped to existing global history data and reduced to 2-bit inference.
What would settle it
Cycle-accurate simulation of the 2-bit CNN helpers on SPEC 2017 showing whether misprediction rates on the identified hard-to-predict branches fall enough to produce the claimed IPC gains.
Figures
read the original abstract
CPU branch prediction has hit a wall--existing techniques achieve near-perfect accuracy on 99% of static branches, and yet the mispredictions that remain hide major performance gains. In a companion report, we show that a primary source of mispredictions is a handful of systematically hard-to-predict branches (H2Ps), e.g. just 10 static instructions per SimPoint phase in SPECint 2017. The lost opportunity posed by these mispredictions is significant to the CPU: 14.0% in instructions-per-cycle (IPC) on Intel SkyLake and 37.4% IPC when the pipeline is scaled four-fold, on par with gains from process technology. However, up to 80% of this upside is unreachable by the best known branch predictors, even when afforded exponentially more resources. New approaches are needed, and machine learning (ML) provides a palette of powerful predictors. A growing body of work has shown that ML models are deployable within the microarchitecture to optimize hardware at runtime, and are one way to customize CPUs post-silicon by training to customer applications. We develop this scenario for branch prediction using convolutional neural networks (CNNs) to boost accuracy for H2Ps. Step-by-step, we (1) map CNNs to the global history data used by existing branch predictors; (2) show how CNNs improve H2P prediction in SPEC 2017; (3) adapt 2-bit CNN inference to the constraints of current branch prediction units; and (4) establish that CNN helper predictors are reusable across application executions on different inputs, enabling us to amortize offline training and deploy ML pattern matching to improve IPC.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript claims that convolutional neural networks (CNNs) can be mapped to global history data to improve prediction accuracy on a small set of hard-to-predict branches (H2Ps) in SPEC 2017, that a 2-bit inference version can be adapted to existing branch prediction hardware constraints, and that the resulting helper predictors are reusable across different inputs, thereby amortizing offline training costs and recovering a substantial fraction of the IPC opportunity (up to 80%) that remains unreachable by conventional predictors even with more resources.
Significance. If the empirical results on accuracy preservation and reusability hold, the work would demonstrate a viable path for deploying ML-based pattern matching inside the branch predictor, addressing a concentrated source of mispredictions that limits IPC gains on the order of 14% on SkyLake (or 37% on a scaled pipeline). The focus on hardware-constrained 2-bit inference and cross-input reusability is a concrete strength that could influence future microarchitectural customization.
major comments (1)
- [Abstract] Abstract, steps 1-3: The central claim that the CNN helper predictor remains effective after mapping to existing global history and reduction to 2-bit inference is load-bearing for both the hardware feasibility and the reusability argument (step 4), yet the manuscript provides no quantitative accuracy or IPC numbers showing that the constrained model closes any measurable portion of the 80% unreachable opportunity on the same H2P set identified in step 2.
minor comments (1)
- [Abstract] The abstract refers to 'a companion report' for the H2P characterization without a citation, link, or arXiv identifier.
Simulated Author's Rebuttal
We thank the referee for highlighting the need for quantitative support in the abstract for the constrained CNN model. We address the single major comment below.
read point-by-point responses
-
Referee: [Abstract] Abstract, steps 1-3: The central claim that the CNN helper predictor remains effective after mapping to existing global history and reduction to 2-bit inference is load-bearing for both the hardware feasibility and the reusability argument (step 4), yet the manuscript provides no quantitative accuracy or IPC numbers showing that the constrained model closes any measurable portion of the 80% unreachable opportunity on the same H2P set identified in step 2.
Authors: We agree the abstract should explicitly quantify the effectiveness of the 2-bit inference model on the H2P set to support the load-bearing claims. The manuscript body reports accuracy results for both the mapped CNN and its 2-bit adaptation (Sections 4-5), but does not highlight in the abstract the specific fraction of the 80% IPC opportunity recovered by the constrained version. We will revise the abstract to include these accuracy preservation and IPC recovery metrics for the 2-bit model on the identified H2Ps. revision: yes
Circularity Check
No circularity; empirical steps on external benchmarks
full rationale
The paper presents four sequential empirical steps—mapping CNNs to global history, demonstrating H2P accuracy gains on SPEC 2017, adapting 2-bit inference, and establishing reusability across inputs—without any equations, derivations, or fitted parameters that reduce outputs to inputs by construction. The companion report citation supplies context on H2Ps but is not load-bearing for the CNN results themselves, which are benchmark-driven and externally falsifiable. No self-definitional, uniqueness, or ansatz patterns appear.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption CNNs can be mapped to the global history data already used by existing branch predictors
Reference graph
Works this paper leans on
-
[1]
The microarchitecture of intel, amd, and via cpus
A Fog. The microarchitecture of intel, amd, and via cpus. An Optimization Guide for Assembly Programmers and Compiler Makers. Copenhagen University College of Engineering , 2018
work page 2018
-
[2]
TAGE-SC-L Branch Predictors Again
A Seznec. TAGE-SC-L Branch Predictors Again. In Proc. 5th Championship on Branch Prediction , 2016
work page 2016
-
[3]
Branch Prediction is Not a Solved Problem: Measurements, Opportunities, and Future Directions
C-K Lin and SJ Tarsa. Branch Prediction is Not a Solved Problem: Measurements, Opportunities, and Future Directions. arXiv:1906.08170, 2019
-
[4]
CBP-5 Kit. In Proc. 5th Championship on Branch Prediction , 2016
work page 2016
-
[5]
CHARSTAR: Clock Hierarchy Aware Resource Scaling in Tiled Architectures
GS Ravi and MH Lipasti. CHARSTAR: Clock Hierarchy Aware Resource Scaling in Tiled Architectures. ACM SIGARCH, 2017
work page 2017
-
[6]
Practical Post-Silicon CPU Adaptation Using Machine Learning
SJ Tarsa, RBR Chowdhury, J Sebot, GN Chinya, J Gaur, K Sankara- narayanan, C-K Lin, R Chappell, R Singhal, and H Wang. Practical Post-Silicon CPU Adaptation Using Machine Learning. In ISCA, 2019
work page 2019
-
[7]
Data Compression Using Adaptive Coding and Partial String Matching
J Cleary and I Witten. Data Compression Using Adaptive Coding and Partial String Matching. IEEE Trans Comms , 1984
work page 1984
-
[8]
Multiperspective Perceptron Predictor
DA Jim ´enez. Multiperspective Perceptron Predictor. In Proc. 5th Championship on Branch Prediction , 2016
work page 2016
-
[9]
Experiments with SPEC CPU 2017: Similarity, Balance, Phase Behavior and Simpoints
S Song, Q Wu, S Flolid, and J et al Dean. Experiments with SPEC CPU 2017: Similarity, Balance, Phase Behavior and Simpoints. Technical report, TR-180515-01, Dept. of ECE, UT-Austin, 2018
work page 2017
-
[10]
Pin: Building Customized Program Analysis Tools with Dynamic Instrumentation
C-K Luk, R Cohn, R Muth, H Patil, A Klauser, G Lowney, S Wallace, VJ Reddi, and K Hazelwood. Pin: Building Customized Program Analysis Tools with Dynamic Instrumentation. 2005
work page 2005
-
[11]
Chainer: A Next-Generation Open Source Framework for Deep Learning
S Tokui, K Oono, S Hido, and J Clayton. Chainer: A Next-Generation Open Source Framework for Deep Learning. In LearnSys, 2015
work page 2015
-
[12]
Adam: A Method for Stochastic Optimization
D Kingma and J Ba. Adam: A Method for Stochastic Optimization. arXiv:1412.6980, 2014
work page internal anchor Pith review Pith/arXiv arXiv 2014
-
[13]
M Courbariaux, I Hubara, D Soudry, R El-Yaniv, and Y Bengio. Bina- rized Neural Networks: Training Deep Neural Networks with Weights and Activations Constrained to +1 or -1. arXiv:1602.02830, 2016
work page internal anchor Pith review Pith/arXiv arXiv 2016
-
[14]
XNOR-Net: Imagenet Classification Using Binary Convolutional Neural Networks
M Rastegari, V Ordonez, J Redmon, and A Farhadi. XNOR-Net: Imagenet Classification Using Binary Convolutional Neural Networks. In ECCV, 2016
work page 2016
-
[15]
C Zhu, S Han, H Mao, and WJ Dally. Trained Ternary Quantization. arXiv:1612.01064, 2016
work page internal anchor Pith review Pith/arXiv arXiv 2016
-
[16]
R Ramanarayanan, S Mathew, V Erraguntla, R Krishnamurthy, and S Gueron. A 2.1Ghz 6.5mW 64-bit Unified Popcount/Bitscan Datapath Unit for 65nm High-Performance Microprocessor Execution Cores. In VLSID, 2008
work page 2008
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.