# Computer Architecture ELE 475 / COS 475 Slide Deck 5: Exceptions and Superscalar 2 David Wentzlaff

Department of Electrical Engineering
Princeton University





# Agenda

- Interrupts
- Out-of-Order Processors

### Interrupts:

altering the normal flow of control



An external or internal event that needs to be processed by another (system) program. The event is usually unexpected or rare from program's point of view.

### Causes of Exceptions

Interrupt: an event that requests the attention of the processor

- Asynchronous: an external event
  - input/output device service request
  - timer expiration
  - power disruptions, hardware failure
- Synchronous: an internal exception (a.k.a. exceptions/trap)
  - undefined opcode, privileged instruction
  - arithmetic overflow, FPU exception
  - misaligned memory access
  - virtual memory exceptions: page faults,
     TLB misses, protection violations
  - software exceptions: system calls, e.g., jumps into kernel

### Asynchronous Interrupts:

invoking the interrupt handler

- An I/O device requests attention by asserting one of the prioritized interrupt request lines
- When the processor decides to process the interrupt
  - It stops the current program at instruction  $I_i$ , completing all the instructions up to  $I_{i-1}$  (a *precise interrupt*)
  - It saves the PC of instruction I<sub>i</sub> in a special register (EPC)
  - It disables interrupts and transfers control to a designated interrupt handler running in the kernel mode

### Interrupt Handler

- Saves EPC before re-enabling interrupts to allow nested interrupts ⇒
  - need an instruction to move EPC into GPRs
  - need a way to mask further interrupts at least until EPC can be saved
- Needs to read a status register that indicates the cause of the interrupt
- Uses a special indirect jump instruction RFE (returnfrom-exception) to resume user code, this:
  - enables interrupts
  - restores the processor to the user mode
  - restores hardware status and control state

### Synchronous Interrupts

- A synchronous interrupt (exception) is caused by a particular instruction
- In general, the instruction cannot be completed and needs to be restarted after the exception has been handled
  - requires undoing the effect of one or more partially executed instructions
- In the case of a system call trap, the instruction is considered to have been completed
  - syscall is a special jump instruction involving a change to privileged kernel mode
  - Handler resumes at instruction after system call

### Exception Handling 5-Stage Pipeline



- Asynchronous Interrupts
- How to handle multiple simultaneous exceptions in different pipeline stages?
- How and where to handle external asynchronous interrupts?

### Exception Handling 5-Stage Pipeline



### Exception Handling 5-Stage Pipeline

- Hold exception flags in pipeline until commit point (M stage)
- Exceptions in earlier pipe stages override later exceptions for a given instruction
- Inject external interrupts at commit point (override others)
- If exception at commit: update Cause and EPC registers, kill all stages, inject handler PC into fetch stage

### Speculating on Exceptions

- Prediction mechanism
  - Exceptions are rare, so simply predicting no exceptions is very accurate!
- Check prediction mechanism
  - Exceptions detected at end of instruction execution pipeline, special hardware for various exception types
- Recovery mechanism
  - Only write architectural state at commit point, so can throw away partially executed instructions after exception
  - Launch exception handler after flushing pipeline
- Bypassing allows use of uncommitted instruction results by following instructions

### **Exception Pipeline Diagram**

```
time
                        tO
                              t1 t2 t3 t4 t5 t6 t7 ....
                        IF<sub>1</sub> ID<sub>1</sub> EX<sub>1</sub> MA<sub>1</sub> nop overflow!
(I_1) 096: ADD
                              IF_2 ID_2 EX_2 \setminus nop nop
(I_2) 100: XOR
                                     IF_3 ID_3 \hop nop nop
(I<sub>3</sub>) 104: SUB
(I<sub>4</sub>) 108: ADD
                                           IF<sub>4</sub> hop nop nop nop
(I<sub>5</sub>) Exc. Handler code
                                                 IF<sub>5</sub> ID<sub>5</sub> EX<sub>5</sub> MA<sub>5</sub> WB<sub>5</sub>
                         time
                                   t2 t3 t4 t5 t6 t7 ....
                        tO
                  IF
                             I_2 I_3 I_4
                              I_1 I_2 I_3 nop I_5
                  ID
Resource
                  EX
                                           I_2
                                                 nop nop I_5
Usage
                  MA
                                                 nop nop nop I<sub>5</sub>
                  WB
                                                 nop nop nop l<sub>s</sub>
```

# Agenda

- Interrupts
- Out-of-Order Processors

### Out-Of-Order (OOO) Introduction

| Name | Frontend | Issue | Writeback | Commit |                                                              |
|------|----------|-------|-----------|--------|--------------------------------------------------------------|
| 14   | Ю        | Ю     | Ю         | Ю      | Fixed Length Pipelines<br>Scoreboard                         |
| 1202 | Ю        | Ю     | 000       | 000    | Scoreboard                                                   |
| 1201 | Ю        | Ю     | 000       | Ю      | Scoreboard,<br>Reorder Buffer, and Store Buffer              |
| 103  | Ю        | 000   | 000       | 000    | Scoreboard and Issue Queue                                   |
| 1021 | Ю        | 000   | 000       | Ю      | Scoreboard, Issue Queue,<br>Reorder Buffer, and Store Buffer |

### **OOO** Motivating Code Sequence

```
0 MUL R1, R2, R3
1 ADDIU R11,R10,1
2 MUL R5, R1, R4
3 MUL R7, R5, R6
4 ADDIU R12,R11,1
5 ADDIU R13,R12,1
6 ADDIU R14,R12,2
```





- Two independent sequences of instructions enable flexibility in terms of how instructions are scheduled in total order
- We can schedule statically in software or dynamically in hardware

### 14: In-Order Front-End, Issue, Writeback, Commit



## 14: In-Order Front-End, Issue, Writeback, Commit



## 14: In-Order Front-End, Issue, Writeback, Commit (4-stage MUL)



To avoid increasing CPI, needs full bypassing which can be expensive. To help cycle time, add Issue stage where register file read and instruction "issued" to Functional Unit

### 14: In-Order Front-End, Issue, Writeback, Commit (4-stage MUL)



R/W

#### **Basic Scoreboard**

Data Avail.

|     | P | F | 4 | 3 | 2 | 1 | 0 |
|-----|---|---|---|---|---|---|---|
| R1  |   |   |   |   |   |   |   |
| R2  |   |   |   |   |   |   |   |
| R3  |   |   |   |   |   |   |   |
| ••• |   |   |   |   |   |   |   |
| R31 |   |   |   |   |   |   |   |

P: Pending, Write to
Destination in flight
F: Which functional unit
is writing register
Data Avail.: Where is the
write data in the

functional unit pipeline

- A One in Data Avail. In column 'I' means that result data is in stage 'I' of functional unit F
- Can use F and Data Avail. fields to determine when to bypass and where to bypass from
- A one in column zero means that cycle functional unit is in the Writeback stage
- Bits in Data Avail. field shift right every cycle.

#### **Basic Scoreboard**

Data Avail.

|     | P | F | 4 | 3 | 2 | 1 | 0        |
|-----|---|---|---|---|---|---|----------|
| R1  |   |   | 1 |   |   |   | <b>→</b> |
| R2  |   |   | _ |   |   |   | <b>→</b> |
| R3  |   |   | _ |   |   |   | <b>→</b> |
|     |   |   |   |   |   |   |          |
| R31 |   |   | _ |   |   |   | <b>→</b> |

P: Pending, Write to
Destination in flight
F: Which functional unit
is writing register
Data Avail.: Where is the
write data in the

functional unit pipeline

- A One in Data Avail. In column 'I' means that result data is in stage 'I' of functional unit F
- Can use F and Data Avail. fields to determine when to bypass and where to bypass from
- A one in column zero means that cycle functional unit is in the Writeback stage
- Bits in Data Avail. field shift right every cycle.

```
0 MUL
        R1, R2, R3 F D I Y0 Y1 Y2 Y3 W
1 ADDIU R11,R10,1
                      F D I X0 X1 X2 X3 W
2 MUL
        R5, R1, R4
                         F
                               Ι
                                     I Y0 Y1 Y2 Y3 W
                            D
                                  Ι
        R7, R5, R6
3 MUL
                            F
                               D
                                  D
                                              I I
                                                   Y0 Y1 Y2 Y3 W
                               F F
4 ADDIU R12, R11,1
                                           D
                                              D D
                                                       X0 X1 X2 X3 W
                                        D
                                                    Ι
5 ADDIU R13,R12,1
                                        F
                                           F F F
                                                       Ι
                                                          X0 X1 X2 X3 W
                                                    D
6 ADDIU R14,R12,2
                                                          I X0 X1 X2 X3 W
                                                     F
Cyc
    DΙ
                 4 3 2 1 0
                               Dest Regs
     0
                                             RED Indicates if we look at F
1
     1 0
                                             Field, we can bypass on this cycle
2
3
     2 1
                 1
4
                 1 1
5
                   1 1
6
     3 2
                     1 1
7
                 1
                       1 1
8
                         1
                   1
9
                     1
10
                       1
     4 3
11
     5 4
                 1
12
     6 5
                 1 1
13
       6
                 1 1 1
14
                 1 1 1 1
15
                   1 1 1 1
16
                     1 1 1
17
                       1 1
                                                                              22
18
                         1
```

# I2O2: In-order Frontend/Issue, Out-of-order Writeback/Commit



SB

R R/W

#### 1202 Scoreboard

- Similar to I4, but we can now use it to track structural hazards on Writeback port
- Set bit in Data Avail. according to length of pipeline
- Architecture conservatively stalls to avoid WAW hazards by stalling in Decode therefore current scoreboard sufficient. More complicated scoreboard needed for processing WAW Hazards

```
0 MUL
        R1, R2, R3 F D I Y0 Y1 Y2 Y3 W
1 ADDIU R11,R10,1
                            I X0 W
                      F D
2 MUL
        R5, R1, R4
                          F
                                Ι
                                   Ι
                                      I Y0 Y1 Y2 Y3 W
                            D
        R7, R5, R6
3 MUL
                             F
                                D
                                   D
                                                  Ι
                                                     Y0 Y1 Y2 Y3 W
                                   F
4 ADDIU R12, R11,1
                                F
                                      F
                                         D
                                            D
                                               D
                                                  D
                                                      Ι
                                                         X0 W
5 ADDIU R13,R12,1
                                         F
                                            F
                                               F F
                                                         I X0 W
                                                      D
6 ADDIU R14,R12,2
                                                            I I X0 W
                                                      F
                                                         D
Cyc
    DΙ
                 4 3 2 1 0
                                Dest Regs
     0
                                               RED Indicates if we look at F
1
                                               Field, we can bypass on this cycle
2
     1 0
3
     2 1
                 1
4
                   1
                                   R11
5
                     1
6
     3 2
                        1
7
                 1
                                                                  Writes with two cycle
8
                   1
9
                     1
                                                                  latency. Structural
10
     4 3
                        1
                                                                  Hazard
11
     5 4
                 1
                          1
                   1 1
12
     6 5
13
14
       6
15
                        1 1
                                   R15
16
                          1
17
                                                                                25
```

18

### **Early Commit Point?**

• Limits certain types of exceptions.

# 1201: In-order Frontend/Issue, Out-of-order Writeback, In-order Commit





### Reorder Buffer (ROB)

| State | S | ST | V | Preg |
|-------|---|----|---|------|
|       |   |    |   |      |
| Р     | 1 |    |   |      |
| F     | 1 |    |   |      |
| Р     | 1 |    |   |      |
| Р     |   |    |   |      |
| F     |   |    |   |      |
| Р     |   |    |   |      |
| Р     |   |    |   |      |
|       |   |    |   |      |
|       |   |    |   |      |

**State**: {Free, Pending, Finished}

**S**: Speculative

ST: Store bit

V: Physical Register File Specifier Valid

**Preg:** Physical Register File Specifier

### Reorder Buffer (ROB)



**State**: {Free, Pending, Finished}

**S**: Speculative

ST: Store bit

V: Physical Register File Specifier Valid

**Preg:** Physical Register File Specifier

Commit stage is waiting for Head of ROB to be finished

### Finished Store Buffer (FSB)



- Only need one entry if we only support one memory instruction inflight at a time.
- Single Entry FSB makes allocation trivial.
- If support more than one memory instruction, we need to worry about Load/Store address aliasing.



# What if First Instruction Causes an Exception?

#### What About Branches?

```
Option 2
0 BEQZ R1, target F D I X0 W C
1 ADDIU R11,R10,1 F D I X0 /
                                       Squash instructions in ROB
2 ADDIU R5, R1, R4 \, F \, D \, I \, /
                                         when Branch commits
3 ADDIU R7, R5, R6
T ADDIU R12,R11,1
                               F D I . . .
Option 1
0 BEQZ R1, target F D I X0 W C
1 ADDIU R11,R10,1 F D I
2 ADDIU R5, R1, R4 F D
                                     Squash instructions earlier. Has more
                                     complexity. ROB needs many ports.
3 ADDIU R7, R5, R6
T ADDIU R12,R11,1
                               F D I . . .
Option 3
0 BEQZ R1, target F D I X0 W C
1 ADDIU R11,R10,1 F D I X0 W /
                                                 Wait for speculative instructions to
2 ADDIU R5, R1, R4 F D I X0 W /
3 ADDIU R7, R5, R6 F D I X0 W /
T ADDIU R12 P11 1
                                                 reach the Commit stage and squash in
                                                 Commit stage
T ADDIU R12,R11,1
```

#### What About Branches?

- Three possible designs with decreasing complexity based on when to squash speculative instructions and de-allocate ROB entry:
- 1. As soon as branch resolves
- 2. When branch commits
- 3. When speculative instructions reach commit
- Base design only allows one branch at a time.
   Second branch stalls in decode. Can add more bits to track multiple in-flight branches.

### **Avoiding Stalling Commit on Store** Miss



# IO3: In-order Frontend, Out-of-order Issue/Writeback/Commit



### Issue Queue (IQ)



Op: Opcode

Imm.: Immediate S: Speculative Bit

**V**: Valid (Instruction has corresponding Src/Dest)

**P**: Pending (Waiting on operands to be produced)

Instruction Ready = (!Vsrc0 || !Psrc0) && (!Vsrc1 || !Psrc1) && no structural hazards

For high performance, factor in bypassing

### Centralized vs. Distributed Issue Queue



Centralized



Distributed

### Advanced Scoreboard

Data Avail.

|     | P | 4 | 3 | 2 | 1 | 0 |
|-----|---|---|---|---|---|---|
| R1  |   |   |   |   |   |   |
| R2  |   |   |   |   |   |   |
| R3  |   |   |   |   |   |   |
|     |   |   |   |   |   |   |
| R31 |   |   |   |   |   |   |

P: Pending, Write to
Destination in flight

Data Avail.: Where is the write data in the pipeline and which functional unit

- Data Avail. now contains functional unit identfier
- A non-empty value in column zero means that cycle functional unit is in the Writeback stage
- Bits in Data Avail. field shift right every cycle.

```
0
                       1
                                4 5 6 7 8 9 10 11 12 13 14 15
0 MUL
        R1, R2, R3 F
                      D I Y0 Y1 Y2 Y3 W
1 ADDIU R11,R10,1
                          D
                             Ι
                                 X0 W
                       F
        R5, R1, R4
                          F D i
                                       I Y0 Y1 Y2 Y3 W
2 MUL
      R7, R5, R6
                             F
                                   i
3 MUL
                                D
                                                      Y0 Y1 Y2 Y3 W
4 ADDIU R12,R11,1
                                 F
                                    Di
                                             X0 W
                                         i I X0 W
5 ADDIU R13,R12,1
                                       D
6 ADDIU R14,R12,2
                                             i
                                                          X0 W
                                       F
                                          D
                                                       Ι
     DΙ
            ΙQ
                 0
                           1
                                      2
Cyc
0
                                               Dest/Src0/Src1, Circle denotes value
1
     0
                                               present in ARF
                  R1\(\)R2\(\)R3
2
     1 0
3
     2 1
                  R11/R10
     3
                  R5/R1(R4)
4
                           R7/R5(R6)
5
                                                 Value bypassed so no circle, present
     4
6
                                     R12/R11
                                                bit
     5 2
7
     6 4
                  R13/R12
                                                     Value set present by
8
       5
                                     R14/R12
                                                     Instruction 1 in cycle 5, W
9
                                                     Stage
10
       3
11
       6
                                     R14/
12
13
                                                                             40
14
```

### Assume All Instruction in Issue Queue

```
3 4 5 6 7 8 9 10 11 12 13 14 15
      R1, R2, R3 F D i
                                       I Y0 Y1 Y2 Y3 W
0 MUL
1 ADDIU R11,R10,1
                F D i
                                         I X0 W
      R5, R1, R4
2 MUL
                FDi
                                                 I Y0 Y1 Y2 Y3 W
      R7, R5, R6
3 MUL
                 F D i
                                                            I Y0 Y1 Y2 Y3 W
4 ADDIU R12,R11,1
                          F D i
                                            I X0 W
5 ADDIU R13,R12,1
                             F D i
                                                    I X0 W
6 ADDIU R14,R12,2
                                  Di
                                                      I X0 W
```

Better performance than previous?

# IO2I: In-order Frontend, Out-of-order Issue/Writeback, In-order Commit





```
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19
                  D I Y0 Y1 Y2 Y3 W C
      R1, R2, R3 F
0 MUL
                  F D I X0 W
1 ADDIU R11,R10,1
                              r
                    F D i
                               I Y0 Y1 Y2 Y3 W C
2 MUL
      R5, R1, R4
                       F
3 MUL
     R7, R5, R6
                         D
                            i
                                         Ι
                                           Y0 Y1 Y2 Y3 W C
4 ADDIU R12,R11,1
                          F D i I X0 W r
                                                          C
5 ADDIU R13,R12,1
                            F
                              D i I X0W r
                                                             C
                                           I X0 W r
6 ADDIU R14,R12,2
                               F
                                 D i
0 MUL
      R1, R2, R3 F D I Y0 Y1 Y2 Y3 W C
r
                                 Y0 Y1 Y2 Y3 W C
2 MUL
     R5, R1, R4
                    F D i
                               Ι
      R7, R5, R6
                       F
                         Di
                                           Y0 Y1 Y2 Y3 W C
3 MUL
                                         Ι
4 ADDIU R12, R11, 1
                                                          C
                          F D i I X0 W r
5 ADDIU R13,R12,1
                            F
                               D i
                                   I X0 W r
                                                             C
                               F
6 ADDIU R14,R12,2
                                 Di
                                         I X0 W r
```

# Out-of-order 2-Wide Superscalar with 1 ALU

```
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19
0 MUL R1, R2, R3 F D I Y0 Y1 Y2 Y3 W C
1 ADDIU R11,R10,1 F D I X0 W r C
2 MUL R5, R1, R4 F D i I Y0 Y1 Y2 Y3 W C
3 MUL R7, R5, R6 F D i I X0 W r C
4 ADDIU R12,R11,1 F D I X0 W r C
5 ADDIU R13,R12,1 F D i I X0 W r C
6 ADDIU R14,R12,2 F D i I X0 W r C
```

### Acknowledgements

- These slides contain material developed and copyright by:
  - Arvind (MIT)
  - Krste Asanovic (MIT/UCB)
  - Joel Emer (Intel/MIT)
  - James Hoe (CMU)
  - John Kubiatowicz (UCB)
  - David Patterson (UCB)
  - Christopher Batten (Cornell)
- MIT material derived from course 6.823
- UCB material derived from course CS252 & CS152
- Cornell material derived from course ECE 4750

Copyright © 2012 David Wentzlaff