Improving Branch Prediction By Modeling Global History with Convolutional Neural Networks

Chit-Kwan Lin; Gautham Chinya; Gokce Keskin; Hong Wang; Stephen J Tarsa

arxiv: 1906.09889 · v1 · pith:M2VPHV53new · submitted 2019-06-20 · 💻 cs.DC · cs.LG

Improving Branch Prediction By Modeling Global History with Convolutional Neural Networks

Stephen J Tarsa , Chit-Kwan Lin , Gokce Keskin , Gautham Chinya , Hong Wang This is my paper

Pith reviewed 2026-05-25 18:53 UTC · model grok-4.3

classification 💻 cs.DC cs.LG

keywords branch predictionconvolutional neural networkshard-to-predict branchesglobal historymachine learningmicroarchitectureSPEC 2017instructions per cycle

0 comments

The pith

Convolutional neural networks can be mapped to global history to predict the small set of branches that cause most CPU mispredictions.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Modern CPUs already predict nearly all branches correctly, yet the remaining errors come from a handful of hard-to-predict branches that limit instructions per cycle. The paper maps convolutional neural networks onto the same global history registers used by conventional predictors, shows accuracy gains on these branches in SPEC 2017, and reduces the networks to 2-bit inference that fits existing hardware. It further demonstrates that the resulting helper predictors remain effective when the same application runs on new inputs, so training cost can be paid once and reused. If this holds, machine-learning pattern matching becomes practical for post-silicon customization of branch prediction without new silicon.

Core claim

By training convolutional neural networks on global history data, the authors produce reusable helper predictors that improve accuracy on the small number of hard-to-predict branches left after conventional predictors have done their work; these networks can be reduced to 2-bit inference suitable for current branch prediction units and retain their benefit across different program inputs.

What carries the argument

Convolutional neural networks used as helper predictors on the global history registers of hard-to-predict branches, reduced to 2-bit inference.

If this is right

CNN helpers raise prediction accuracy on hard-to-predict branches in SPEC 2017 beyond what conventional predictors achieve.
2-bit CNN inference can be implemented inside existing branch prediction units without hardware redesign.
The same trained models remain effective across different inputs to the same application.
Offline training cost can be amortized, allowing machine-learning pattern matching to raise instructions per cycle at runtime.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The approach could be applied to other microarchitectural decisions that also depend on global history, such as cache prefetching or memory scheduling.
In wider pipelines the relative IPC benefit would increase because mispredictions become more costly.
Customer-specific models could be trained on representative workloads and loaded after silicon is manufactured.

Load-bearing premise

The same small set of hard-to-predict branches remains the dominant source of mispredictions after the networks are mapped to existing global history data and reduced to 2-bit inference.

What would settle it

Cycle-accurate simulation of the 2-bit CNN helpers on SPEC 2017 showing whether misprediction rates on the identified hard-to-predict branches fall enough to produce the claimed IPC gains.

Figures

Figures reproduced from arXiv: 1906.09889 by Chit-Kwan Lin, Gautham Chinya, Gokce Keskin, Hong Wang, Stephen J Tarsa.

**Figure 2.** Figure 2: A CNN is fed H2P-1’s global history as a matrix of 1-hot [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 3.** Figure 3: (Top 2) 1-wide convolutional filters trained on H2P-1’s [PITH_FULL_IMAGE:figures/full_fig_p003_3.png] view at source ↗

**Figure 4.** Figure 4: Layer 2 filter weights represent how much each history [PITH_FULL_IMAGE:figures/full_fig_p004_4.png] view at source ↗

**Figure 5.** Figure 5: 2-bit CNN helpers lose fidelity encoding the magnitude of [PITH_FULL_IMAGE:figures/full_fig_p004_5.png] view at source ↗

read the original abstract

CPU branch prediction has hit a wall--existing techniques achieve near-perfect accuracy on 99% of static branches, and yet the mispredictions that remain hide major performance gains. In a companion report, we show that a primary source of mispredictions is a handful of systematically hard-to-predict branches (H2Ps), e.g. just 10 static instructions per SimPoint phase in SPECint 2017. The lost opportunity posed by these mispredictions is significant to the CPU: 14.0% in instructions-per-cycle (IPC) on Intel SkyLake and 37.4% IPC when the pipeline is scaled four-fold, on par with gains from process technology. However, up to 80% of this upside is unreachable by the best known branch predictors, even when afforded exponentially more resources. New approaches are needed, and machine learning (ML) provides a palette of powerful predictors. A growing body of work has shown that ML models are deployable within the microarchitecture to optimize hardware at runtime, and are one way to customize CPUs post-silicon by training to customer applications. We develop this scenario for branch prediction using convolutional neural networks (CNNs) to boost accuracy for H2Ps. Step-by-step, we (1) map CNNs to the global history data used by existing branch predictors; (2) show how CNNs improve H2P prediction in SPEC 2017; (3) adapt 2-bit CNN inference to the constraints of current branch prediction units; and (4) establish that CNN helper predictors are reusable across application executions on different inputs, enabling us to amortize offline training and deploy ML pattern matching to improve IPC.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 1 minor

Summary. The manuscript claims that convolutional neural networks (CNNs) can be mapped to global history data to improve prediction accuracy on a small set of hard-to-predict branches (H2Ps) in SPEC 2017, that a 2-bit inference version can be adapted to existing branch prediction hardware constraints, and that the resulting helper predictors are reusable across different inputs, thereby amortizing offline training costs and recovering a substantial fraction of the IPC opportunity (up to 80%) that remains unreachable by conventional predictors even with more resources.

Significance. If the empirical results on accuracy preservation and reusability hold, the work would demonstrate a viable path for deploying ML-based pattern matching inside the branch predictor, addressing a concentrated source of mispredictions that limits IPC gains on the order of 14% on SkyLake (or 37% on a scaled pipeline). The focus on hardware-constrained 2-bit inference and cross-input reusability is a concrete strength that could influence future microarchitectural customization.

major comments (1)

[Abstract] Abstract, steps 1-3: The central claim that the CNN helper predictor remains effective after mapping to existing global history and reduction to 2-bit inference is load-bearing for both the hardware feasibility and the reusability argument (step 4), yet the manuscript provides no quantitative accuracy or IPC numbers showing that the constrained model closes any measurable portion of the 80% unreachable opportunity on the same H2P set identified in step 2.

minor comments (1)

[Abstract] The abstract refers to 'a companion report' for the H2P characterization without a citation, link, or arXiv identifier.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for highlighting the need for quantitative support in the abstract for the constrained CNN model. We address the single major comment below.

read point-by-point responses

Referee: [Abstract] Abstract, steps 1-3: The central claim that the CNN helper predictor remains effective after mapping to existing global history and reduction to 2-bit inference is load-bearing for both the hardware feasibility and the reusability argument (step 4), yet the manuscript provides no quantitative accuracy or IPC numbers showing that the constrained model closes any measurable portion of the 80% unreachable opportunity on the same H2P set identified in step 2.

Authors: We agree the abstract should explicitly quantify the effectiveness of the 2-bit inference model on the H2P set to support the load-bearing claims. The manuscript body reports accuracy results for both the mapped CNN and its 2-bit adaptation (Sections 4-5), but does not highlight in the abstract the specific fraction of the 80% IPC opportunity recovered by the constrained version. We will revise the abstract to include these accuracy preservation and IPC recovery metrics for the 2-bit model on the identified H2Ps. revision: yes

Circularity Check

0 steps flagged

No circularity; empirical steps on external benchmarks

full rationale

The paper presents four sequential empirical steps—mapping CNNs to global history, demonstrating H2P accuracy gains on SPEC 2017, adapting 2-bit inference, and establishing reusability across inputs—without any equations, derivations, or fitted parameters that reduce outputs to inputs by construction. The companion report citation supplies context on H2Ps but is not load-bearing for the CNN results themselves, which are benchmark-driven and externally falsifiable. No self-definitional, uniqueness, or ansatz patterns appear.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The claim rests on the domain assumption that H2Ps constitute the primary remaining misprediction source and that CNNs can be mapped to existing global history without requiring new data sources or losing pattern-capture power.

axioms (1)

domain assumption CNNs can be mapped to the global history data already used by existing branch predictors
Stated as step (1) in the abstract.

pith-pipeline@v0.9.0 · 5849 in / 1231 out tokens · 31438 ms · 2026-05-25T18:53:38.345747+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

16 extracted references · 16 canonical work pages · 3 internal anchors

[1]

The microarchitecture of intel, amd, and via cpus

A Fog. The microarchitecture of intel, amd, and via cpus. An Optimization Guide for Assembly Programmers and Compiler Makers. Copenhagen University College of Engineering , 2018

work page 2018
[2]

TAGE-SC-L Branch Predictors Again

A Seznec. TAGE-SC-L Branch Predictors Again. In Proc. 5th Championship on Branch Prediction , 2016

work page 2016
[3]

Branch Prediction is Not a Solved Problem: Measurements, Opportunities, and Future Directions

C-K Lin and SJ Tarsa. Branch Prediction is Not a Solved Problem: Measurements, Opportunities, and Future Directions. arXiv:1906.08170, 2019

work page arXiv 1906
[4]

CBP-5 Kit. In Proc. 5th Championship on Branch Prediction , 2016

work page 2016
[5]

CHARSTAR: Clock Hierarchy Aware Resource Scaling in Tiled Architectures

GS Ravi and MH Lipasti. CHARSTAR: Clock Hierarchy Aware Resource Scaling in Tiled Architectures. ACM SIGARCH, 2017

work page 2017
[6]

Practical Post-Silicon CPU Adaptation Using Machine Learning

SJ Tarsa, RBR Chowdhury, J Sebot, GN Chinya, J Gaur, K Sankara- narayanan, C-K Lin, R Chappell, R Singhal, and H Wang. Practical Post-Silicon CPU Adaptation Using Machine Learning. In ISCA, 2019

work page 2019
[7]

Data Compression Using Adaptive Coding and Partial String Matching

J Cleary and I Witten. Data Compression Using Adaptive Coding and Partial String Matching. IEEE Trans Comms , 1984

work page 1984
[8]

Multiperspective Perceptron Predictor

DA Jim ´enez. Multiperspective Perceptron Predictor. In Proc. 5th Championship on Branch Prediction , 2016

work page 2016
[9]

Experiments with SPEC CPU 2017: Similarity, Balance, Phase Behavior and Simpoints

S Song, Q Wu, S Flolid, and J et al Dean. Experiments with SPEC CPU 2017: Similarity, Balance, Phase Behavior and Simpoints. Technical report, TR-180515-01, Dept. of ECE, UT-Austin, 2018

work page 2017
[10]

Pin: Building Customized Program Analysis Tools with Dynamic Instrumentation

C-K Luk, R Cohn, R Muth, H Patil, A Klauser, G Lowney, S Wallace, VJ Reddi, and K Hazelwood. Pin: Building Customized Program Analysis Tools with Dynamic Instrumentation. 2005

work page 2005
[11]

Chainer: A Next-Generation Open Source Framework for Deep Learning

S Tokui, K Oono, S Hido, and J Clayton. Chainer: A Next-Generation Open Source Framework for Deep Learning. In LearnSys, 2015

work page 2015
[12]

Adam: A Method for Stochastic Optimization

D Kingma and J Ba. Adam: A Method for Stochastic Optimization. arXiv:1412.6980, 2014

work page internal anchor Pith review Pith/arXiv arXiv 2014
[13]

Binarized Neural Networks: Training Deep Neural Networks with Weights and Activations Constrained to +1 or -1

M Courbariaux, I Hubara, D Soudry, R El-Yaniv, and Y Bengio. Bina- rized Neural Networks: Training Deep Neural Networks with Weights and Activations Constrained to +1 or -1. arXiv:1602.02830, 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016
[14]

XNOR-Net: Imagenet Classiﬁcation Using Binary Convolutional Neural Networks

M Rastegari, V Ordonez, J Redmon, and A Farhadi. XNOR-Net: Imagenet Classiﬁcation Using Binary Convolutional Neural Networks. In ECCV, 2016

work page 2016
[15]

Trained Ternary Quantization

C Zhu, S Han, H Mao, and WJ Dally. Trained Ternary Quantization. arXiv:1612.01064, 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016
[16]

A 2.1Ghz 6.5mW 64-bit Uniﬁed Popcount/Bitscan Datapath Unit for 65nm High-Performance Microprocessor Execution Cores

R Ramanarayanan, S Mathew, V Erraguntla, R Krishnamurthy, and S Gueron. A 2.1Ghz 6.5mW 64-bit Uniﬁed Popcount/Bitscan Datapath Unit for 65nm High-Performance Microprocessor Execution Cores. In VLSID, 2008

work page 2008

[1] [1]

The microarchitecture of intel, amd, and via cpus

A Fog. The microarchitecture of intel, amd, and via cpus. An Optimization Guide for Assembly Programmers and Compiler Makers. Copenhagen University College of Engineering , 2018

work page 2018

[2] [2]

TAGE-SC-L Branch Predictors Again

A Seznec. TAGE-SC-L Branch Predictors Again. In Proc. 5th Championship on Branch Prediction , 2016

work page 2016

[3] [3]

Branch Prediction is Not a Solved Problem: Measurements, Opportunities, and Future Directions

C-K Lin and SJ Tarsa. Branch Prediction is Not a Solved Problem: Measurements, Opportunities, and Future Directions. arXiv:1906.08170, 2019

work page arXiv 1906

[4] [4]

CBP-5 Kit. In Proc. 5th Championship on Branch Prediction , 2016

work page 2016

[5] [5]

CHARSTAR: Clock Hierarchy Aware Resource Scaling in Tiled Architectures

GS Ravi and MH Lipasti. CHARSTAR: Clock Hierarchy Aware Resource Scaling in Tiled Architectures. ACM SIGARCH, 2017

work page 2017

[6] [6]

Practical Post-Silicon CPU Adaptation Using Machine Learning

SJ Tarsa, RBR Chowdhury, J Sebot, GN Chinya, J Gaur, K Sankara- narayanan, C-K Lin, R Chappell, R Singhal, and H Wang. Practical Post-Silicon CPU Adaptation Using Machine Learning. In ISCA, 2019

work page 2019

[7] [7]

Data Compression Using Adaptive Coding and Partial String Matching

J Cleary and I Witten. Data Compression Using Adaptive Coding and Partial String Matching. IEEE Trans Comms , 1984

work page 1984

[8] [8]

Multiperspective Perceptron Predictor

DA Jim ´enez. Multiperspective Perceptron Predictor. In Proc. 5th Championship on Branch Prediction , 2016

work page 2016

[9] [9]

Experiments with SPEC CPU 2017: Similarity, Balance, Phase Behavior and Simpoints

S Song, Q Wu, S Flolid, and J et al Dean. Experiments with SPEC CPU 2017: Similarity, Balance, Phase Behavior and Simpoints. Technical report, TR-180515-01, Dept. of ECE, UT-Austin, 2018

work page 2017

[10] [10]

Pin: Building Customized Program Analysis Tools with Dynamic Instrumentation

C-K Luk, R Cohn, R Muth, H Patil, A Klauser, G Lowney, S Wallace, VJ Reddi, and K Hazelwood. Pin: Building Customized Program Analysis Tools with Dynamic Instrumentation. 2005

work page 2005

[11] [11]

Chainer: A Next-Generation Open Source Framework for Deep Learning

S Tokui, K Oono, S Hido, and J Clayton. Chainer: A Next-Generation Open Source Framework for Deep Learning. In LearnSys, 2015

work page 2015

[12] [12]

Adam: A Method for Stochastic Optimization

D Kingma and J Ba. Adam: A Method for Stochastic Optimization. arXiv:1412.6980, 2014

work page internal anchor Pith review Pith/arXiv arXiv 2014

[13] [13]

Binarized Neural Networks: Training Deep Neural Networks with Weights and Activations Constrained to +1 or -1

M Courbariaux, I Hubara, D Soudry, R El-Yaniv, and Y Bengio. Bina- rized Neural Networks: Training Deep Neural Networks with Weights and Activations Constrained to +1 or -1. arXiv:1602.02830, 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016

[14] [14]

XNOR-Net: Imagenet Classiﬁcation Using Binary Convolutional Neural Networks

M Rastegari, V Ordonez, J Redmon, and A Farhadi. XNOR-Net: Imagenet Classiﬁcation Using Binary Convolutional Neural Networks. In ECCV, 2016

work page 2016

[15] [15]

Trained Ternary Quantization

C Zhu, S Han, H Mao, and WJ Dally. Trained Ternary Quantization. arXiv:1612.01064, 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016

[16] [16]

A 2.1Ghz 6.5mW 64-bit Uniﬁed Popcount/Bitscan Datapath Unit for 65nm High-Performance Microprocessor Execution Cores

R Ramanarayanan, S Mathew, V Erraguntla, R Krishnamurthy, and S Gueron. A 2.1Ghz 6.5mW 64-bit Uniﬁed Popcount/Bitscan Datapath Unit for 65nm High-Performance Microprocessor Execution Cores. In VLSID, 2008

work page 2008