chiplets:

  • yield for big dies is low, yield for small dies is high (because chance of defect is the same but you have to throw away a small die vs big die?)

predictors are everywhere in high performance processors because they have so many transistors

pipelining

so in a non-pipelined processor, we have CPI of 1 (good!) but also our clock period will be slow (since it has to respect the worst-case time + margin).

when we pipeline, we have some overhead for adding pipeline registers (ff). each extra FF is like an extra constant c, so our period goes from to , so there is actually a tradeoff. you have to account for clock skew and timing requirements for your pipeline registers — important for very deep pipelines

when you stick pipeline registers in your machine, you kind of want:

  • pipeline registers that isn’t too wide (not too much to carry over)
  • divides delay evenly

notably register files have many ports to support high-bandwidth access (esp on superscalar processors)

in the ideal pipeline case, you have a CPI of 1 (no stalls). e.g.

pipelined CPI = ideal pipeline CPI + pipeline stalls/instruction

pipelining hazards

in the non-superscalar case, we have pretty simple hazards:

  • data hazard (data dependencies)
  • control hazard (branch dependencies)
  • structural hazard (resource requirement)
    • notably, one might ask “why ever have structural hazards?” the answer is that there are tradeoffs. we can add more ports to our register file but that costs things

in the superscalar case (trying to execute things out of order), we have true and false data dependencies. false data dependencies are name dependencies, where instructions reuse the same registers. we normally solve this register renaming

a more complete taxonomy is:

  • read after write
  • write after write
  • write after read

inserting stalls/bubbles also has overhead! to reduce these, you designate pipeline stages that can’t stall, and either:

  • signal only the first stage to stall, and buffer incoming instructions (skid buffer)
  • or replay instructions (see blackparrot)

it’s difficult to guarantee that all instructions execute in the same clock cycle, so we can extend our execute stage with multiple different pipelines.

take the blackparrot:

TODO: look at blackparrot

diversified pipeline

instructions are issued in order but may complete out of order. we have a parallel sub-pipeline

notes

  • have to worry about write-after-write hazards
  • have to worry about contention for register file write ports
  • and as such we get more structural hazards

examples

ARM10 pipeline has multiple execution paths:

  • ALU path (single-cycle)
  • MUL path
  • MEM path (for accessing memory)

notably this allows independent ALU instructions to bypass load/stores and execute anyway. hazards are checked late

Link to original

control hazards

  • branches are hard since they’re evaluated in the execute stage, so they require the following instructions to be turned into NOPs
  • assuming branch is not taken
    • is easy because its simple
  • evaluate the branch earlier
    • i.e. in the decode stage
    • necessitates very simple branch logic, and more forwarding paths (from previous mem/wb to decode now, since the branch might required)
    • reduces branch penalty to single cycle
  • delayed branch
    • assert: instruction after branch is always executed, called the branch delay slot
    • largely ineffectual these days because of branch predictors
  • branch prediction

analyzing pipelines

  • you have critical path of delay . divide it into stages and add some clocking overhead, since each stage requires timing constraints on the registers, and possible clock skew — so you get as the new clock period
  • assume further that stalls happen at frequency , and assume the cost
  • the clocking overhead and stalls mean that there is some optimal pipeline depth
  • this is shallower if or is high
    • reducing (reducing structural hazards, better branch prediction etc) or (more aggressive clock tree, faster registers) but this costs transistors
    • more transistors more area, more power
  • optimize for area shallower pipeline
  • optimize for performance deeper pipeline (see pentium 4 in 2004, 31 stages!)
  • optimize for power shallower pipeline

supervision 4

parallelism:

  • instruction, data
  • domain accelerators
  • DDR logic (4 memory controllers)
  • power/efficiency cores

snoopy cache marks:

  • total order
  • need broadcast
  • to eliminate race conditions

memory consistency model:

  • ordering of memory operations from other hardware threads
  • sequential consistency:
    • result of any execution is the same as if all operations happened in some total order (serializability) and that the thread is in program order
  • total store order:
    • store load order within a thread is not specified (allow earlier loads to overtake stores)
  • har this is just a “description” for what the intel model actually does
  • coalescing write buffer
  • write-through vs write-back and multi-copy atomicity