note that not all processors want branch prediction, especially if you want tight control over delay:

  • real-time things like video decoders
  • secure chips where you want all your operations to take the same amount of time

ARM10 branch prediction unit:

  • ARM10 allowed two instructions per cycle to be fetched

  • so the branch predictor could predict before that instruction was even computed

  • branch folding

  • static branch predictor

  • simple, one-level, one-bit predictor

    • hash branch address into table
    • predict that we do the same thing as we did last time
    • problem is that you usually get two mispredictions (e.g. on loop exit and entry)
  • simple, one-level, two-bit predictor

    • very simple: add a lag before you flip prediction (hysteresis)
    • 2-bit saturating counters
  • more than two bits (with more complex state machines) is not very useful

we can also add in some history, or a two-level predictor:

  • take advantage of local history, store it in a table, and then concat
    • so you query your table with (branch id, history)
  • you can also use global history (global in the sense of any branch taken rather than particular branch taken)

some further techniques:

  • tournament predictors
    • local and global predictors, and
    • a predictor that that chose whcih one
  • dealing with aliasing