# University of California, Berkeley <br> College of Engineering <br> Department of Electrical Engineering and Computer Science 

Spring 2000
Prof. Bob Brodersen
Midterm 1
March 15, 2000
CS152: Computer Architecture

This midterm consists of four problems, each of which has multiple parts, so budget your time accordingly. The exam is closed-book, but calculators and one sheet of notes are allowed. Good luck!

| Name | SOLUTIONS |
| :---: | :---: |
| SID |  |
| Discussion |  |


| 1 |  |
| :---: | :---: |
| 2 |  |
| 3 |  |
| 4 |  |
| Total |  |

## Problem 1: Critical Path and Delay (25 points)

Throughout this problem, use the simple linear delay model presented in class. For the circuit below, assume the following delay parameters:

NAND: $\mathrm{t}_{\mathrm{plh}}=0.5 \mathrm{~ns}, \mathrm{t}_{\mathrm{phl}}=0.5 \mathrm{~ns}, \mathrm{t}_{\mathrm{plhf}}=0.002 \mathrm{~ns} / \mathrm{fF}, \mathrm{t}_{\mathrm{phlf}}=0.002 \mathrm{~ns} / \mathrm{fF}$
Input capacitance: 100 fF
Inverter: $\mathrm{t}_{\mathrm{plh}}=0.2 \mathrm{~ns}, \mathrm{t}_{\mathrm{phl}}=0.2 \mathrm{~ns}, \mathrm{t}_{\mathrm{plhf}}=0.001 \mathrm{~ns} / \mathrm{fF}, \mathrm{t}_{\mathrm{phlf}}=0.001 \mathrm{~ns} / \mathrm{fF}$
Input capacitance: 50 fF
Wiring Capacitance: (Equal for all nodes) 5 fF

a) What is the worst case delay? Assume there is no delay at the inputs $\mathrm{X}, \mathrm{Y}$ and Z .

The equation for the worst case delay is as follows:

| $.2 \mathrm{~ns}+105 \mathrm{fF}^{*} .001 \mathrm{~ns} / \mathrm{fF}$ | (INVERTER) |
| :--- | :--- |
| $.5 \mathrm{~ns}+205 \mathrm{fF}^{*} .002 \mathrm{~ns} / \mathrm{fF}$ | (NAND1) |
| $.5 \mathrm{~ns}+205 \mathrm{fF}^{*} .002 \mathrm{~ns} / \mathrm{fF}$ | (NAND2) |
| $.5 \mathrm{~ns}+105 \mathrm{fF}^{*} .002 \mathrm{~ns} / \mathrm{fF}$ | (NAND3 |
| $.5 \mathrm{~ns}+5 \mathrm{fF}^{*} .002 \mathrm{~ns} / \mathrm{fF}$ | (NAND4) |
| $=3.345 \mathrm{~ns}$ |  |

Note: There is no delay at the input nodes, and remember to include fan-out and wiring delay!
b) Now assume that you want to generate a symbol for the circuit in part (a). Determine the following parameters for your symbol: $\mathrm{t}_{\text {plh }}, \mathrm{t}_{\text {phl }}$, and the load dependant delay (in $\mathrm{ns} / \mathrm{fF}$ ).


First the propagation delays: $\mathrm{t}_{\mathrm{plh}}=\mathrm{t}_{\mathrm{phl}}$, $=3.345 \mathrm{~ns}$. This is the same as the critical path from the last part. For the load dependent delay, since we only have a single NAND driving the output, it is the same as the NAND itself: $0.002 \mathrm{~ns} / \mathrm{fF}$.

Now consider the following circuit and the following parameters:
Register: $\mathrm{t}_{\mathrm{clk}-\mathrm{to}-\mathrm{Q}}=0.6 \mathrm{~ns}, \mathrm{t}_{\text {setup }}=0.5 \mathrm{~ns}, \mathrm{t}_{\text {hold }}=0.2 \mathrm{~ns}$
NAND: $\mathrm{t}_{\mathrm{plh}}=0.5 \mathrm{~ns}, \mathrm{t}_{\mathrm{phl}}=0.5 \mathrm{~ns}$

c) What is the maximum frequency at which this circuit will operate correctly? Ignore any load dependent delay.

The critical path is from either register through all three NANDs, which corresponds to the clock-to-Q of the first registers plus the propagation delay of the three NANDs plus the setup time of the last register. Therefore, $\mathrm{F}_{\max }=1 /(0.6 \mathrm{~ns}+1.5 \mathrm{~ns}+0.5 \mathrm{~ns})=385 \mathrm{MHz}$
d) If the clock signals are skewed so that $\phi_{1}$ arrives 0.2 ns before $\phi_{2}$ which arrives 0.2 ns before $\phi_{3}$ (such that $\phi_{1-}$ -$\phi_{2}=\phi_{2}-\phi_{3}=0.2 \mathrm{~ns}$ ), what is the maximum frequency at which this circuit will operate correctly?

The key is the direction of the skew. In this case, since $\phi_{2}$ comes after $\phi_{1}$, the critical path needs to wait for $\phi_{2}$. However, since the skew is set up such that the clock reaches the inputs earlier than the output register, we can actually increase our $\mathrm{F}_{\text {max }}$, since the skew buys us more time. (Draw the timing diagram to prove for yourself.) Therefore, $\mathrm{F}_{\max }=1 /(0.6 \mathrm{~ns}+1.5 \mathrm{~ns}+0.5 \mathrm{~ns}-0.2 \mathrm{~ns})=416 \mathrm{MHz}$
e) Now assume that $\phi_{1}$ and $\phi_{2}$ arrive at the same time, and that $\phi_{3}$ arrives later. What is the maximum tolerable skew to ensure that there are no hold time violations?

We must have clock-to- $\mathrm{Q}+$ shortest delay $>$ skew $+\mathrm{t}_{\text {hold }}$
To prevent hold time violations, we need to make sure that the data on the output of register 3 is stable for at least 0.2 ns. This means that if data from registers 1 or 2 changes at the early clock edge, we have to guarantee that no matter what values occur, the output of register 3 does not change, thus the skew needs to be less than $1.4 n s$.

## Problem 2: Single-cycle Processors (25 points)

The following MIPS code finds the maximum integer within a bounded array, where $\$ 4$ contains a pointer to the beginning of the array, $\$ 5$ contains the length of the array and $\$ 3$ contains the pointer to store the result at the end. (Assuming there is no branch delay slot.)

```
LW $2, 0($4) // assume the first number is the largest
ADDI $4, $4,4
ADDI $5, $5, -1
    LW $6, 0($4) // load array element and increment pointer
    ADDI $4, $4, 4
    SLT $7, $2, $6 // update $2 if $6 is larger
BEQ $7, $0, next
ADD $2, $0, $6
    ADDI $5, $5, -1 // continue the search until end of array
    BEQ $5, $0, finish
    J max
finish: SW $2, 0($3) // store result
```

max:
next:

The single-cycle datapath and control unit are shown on the next page. Assume that the delay and energy consumption per operation for each functional unit is as follows:

- Memory (read or write): $3 \mathrm{~ns}, 3 \mathrm{pJ}$
- ALU and adder: $2 \mathrm{~ns}, 2 \mathrm{pJ}$
- Register file (read or write): $1 \mathrm{~ns}, 1 \mathrm{pJ}$
- All other units: $0 \mathrm{~ns}, 0 \mathrm{pJ}$
a) What is the minimum clock cycle time for this processor?

The minimum cycle time of the processor is 10 ns (or a frequency of 100 MHz ).
b) For an array of length N , what is the range of execution time for this program (e.g. - the minimum possible execution time and the worst case execution time)?

We asked for a range on this problem since the exact number of instructions is data dependent (a result of the first branch). The range of execution times is a minimum of $10(7 \mathrm{~N}-4) \mathrm{ns}$ and a maximum of $10(8 \mathrm{~N}-5) \mathrm{ns}$.
c) What is the energy consumption (per instruction) for each type of instruction in the program? Assume that components are completely "turned off" and do not consume energy when they are not needed.

The energy consumption of each type of instruction is listed below.

```
LW:12 pJ
ADDI/ADD/SLT: }9\mathrm{ pJ
BEQ: }10\textrm{pJ
SW: 11 pJ
```

$\qquad$

Diagram and scratch space for Problem 2:


## Problem 3: Single-cycle Datapath Design (25 points)

The task is to design a single cycle processor with the minimum number of functional units that can perform the following standard MIPS instructions plus a new rotate instruction. The rotate instruction does a rotate of \$RS to the right by the IMMED value and stores it in \$RD (e.g. a 2 shift rotate of a 5 byte word would turn [abc d e] into [d e abc]). The blocks that you can use are given below along with their control signals and delay values. All blocks are similar to those we used in class, except for the addition of a 64 bit shifter.

Instructions:

- ADDIU \$RD \$RS IMMED
- ADD \$RD \$RS \$RT
- SRL \$RD \$RS IMMED
- Rotate \$RD \$RS IMMED

Components (and control signals):

- ALU (ALUcontrol, Zero) => 32 bit ALU with Zero status bit
- ALUcontrol $=00$ for ADD, 01 for AND, 10 for SUB, 11 for OR

Delay $=4$

- EXTENDER (Sign/Zero) => Sign extender Delay $=1$
- MUX (Select) => 2 input mux

Delay $=1$

- MEMORY (WrEnable, Addr) => Ideal memory

Delay $=1$

- REGISTER (Enable) => Clocked register Clk-to-Q Delay = 1
- REGISTER FILE (RD, RS, RT, WrEnable) => Register file Read delay = 1, Setup time = 1, Hold time = 1
- CONSTANTS => A 32 bit constant can be defined as an input to any block (no delay)
- SHIFTER (ShAmt) => 64 bit shifter (see symbol below)
- Input and output is through 2 buses which connect to the upper and lower 32 bits of a 64 bit word. Delay $=1$

a) Draw the datapath showing all interconnections and components (including the controller).

b) What is the critical path ?

The critical path is stressed on the ADD and ADDIU instructions. It includes the PC, instruction memory, register file, the ALUSrc mux, the ALU, the WrSrc mux, and the setup time for writing to the register file.
c) What is the delay of the critical path?

The sum of all the delays above is 10 . Don't forget the clock-to-Q of the PC and the setup time of the register file!
d) Show the values of all the control points for each instruction. (The Enable for the PC is given as an example)

|  | PCEnable | ALUSrc | ALUOp | Sign/Zero | Rotate | WrSrc |
| :--- | :--- | :--- | :--- | :--- | :--- | :--- |
| ADDIU | 1 | 0 | 00 | Sign | X | 0 |
| ADD | 1 | 1 | 00 | X | X | 0 |
| SRL | 1 | X | XX | X | 0 | 1 |
| Rotate | 1 | X | XX | X | 1 | 1 |

## Problem 4: Multi-cycle Processors (25 points)

For this problem you will be working with the multi-cycle datapath components on the next page. All inputs for the functional units are labeled, and registers only have one data input and one data output (you should not draw the clock lines). You will not need to deal with control in this problem, so the control inputs to each block are not shown.
a) Given the datapath components on the next page, determine the register transfer language description for each of the standard MIPS instructions in the table below. You do not need to fill in every row. Hint: Do not write to the register file at the end of a cycle (i.e. - only write directly from a register, not a functional unit).

| $\underline{\text { ADDU }}$ | LW | SW | JAL |
| :---: | :---: | :---: | :---: |
| $\mathrm{IR} \leftarrow \operatorname{Mem}(\mathrm{PC}) ;$ | $\mathrm{IR} \leftarrow \operatorname{Mem}(\mathrm{PC}) ;$ | $\mathrm{IR} \leftarrow \operatorname{Mem}(\mathrm{PC}) ;$ | $\mathrm{IR} \leftarrow \operatorname{Mem}(\mathrm{PC}) ;$ |
| $\mathrm{PC} \leqslant \mathrm{PC}+4 ;$ | $\mathrm{PC} \leftarrow \mathrm{PC}+4 ;$ | $\mathrm{PC} \leftarrow \mathrm{PC}+4 ;$ | $\mathrm{S} \leftarrow \mathrm{PC}+4$; |
| $\mathrm{A} \leftarrow \operatorname{Reg}(\mathrm{IR}[\mathrm{rs}]) ;$ | $\mathrm{A} \leftarrow \operatorname{Reg}(\operatorname{IR}[\mathrm{rs}]) ;$ | $\mathrm{A} \leftarrow \operatorname{Reg}(\operatorname{IR}[\mathrm{rs}]) ;$ | $\begin{aligned} & \operatorname{Reg}(31) \leftarrow \mathrm{S} \\ & \mathrm{PC} \leftarrow \mathrm{PC}[31: 26]\\|\operatorname{IR}[\mathrm{j}]\\| 00 \end{aligned}$ |
| $\mathrm{B} \leftarrow \operatorname{Reg}(\operatorname{IR}[\mathrm{rt}]) ;$ |  | $\mathrm{B} \leftarrow \operatorname{Reg}(\operatorname{IR}[\mathrm{rt}]) ;$ |  |
|  | $\mathrm{S} \leftarrow \mathrm{A}+\operatorname{Ext}(\mathrm{imm}) ;$ |  |  |
| $\mathrm{S} \leftarrow \mathrm{A}+\mathrm{B} ;$ |  | $\mathrm{S} \leftarrow \mathrm{A}+\operatorname{Ext}(\mathrm{imm}) ;$ |  |
|  | $\mathrm{M} \leftarrow \operatorname{Mem}(\mathrm{S}) ;$ |  |  |
| $\operatorname{Reg}(\operatorname{IR}[\mathrm{rd}]) \leftarrow \mathrm{S} ;$ |  | $\operatorname{Mem}(\mathrm{S}) \leftarrow \mathrm{B} ;$ |  |
|  | $\operatorname{Reg}(\operatorname{IR}[\mathrm{rt}]) \leftarrow \mathrm{M} ;$ |  |  |

b) You'll notice that some components need to be reused during execution of an instruction. Wire the datapath to support all four instructions, adding only muxes as needed. You may provide constants as inputs to any component. Be sure to label special buses, such as instruction fields. You do not need to draw any control signals (including mux select signals) - just assume they will be correctly generated in all cases.
c) For each instruction in part (a), calculate the CPI and indicate on the table above which operations occur during each cycle.

The register transfers in part (a) above are already separated into the operations that can be performed in each clock cycle. This corresponds to the following CPI: ADDU $-4, \mathrm{LW}-5$, SW -4 , and JAL -2 .
d) The table below indicates the worst case delay through each of the functional units used in the datapath. Given these delays, calculate the execution time of this processor for a program consisting of 400,000 adds, 250,000 loads, 250,000 stores, and 100,000 branches.

| Functional Unit | Worst-case Delay |
| :---: | :---: |
| Memory | 50 ns |
| Register File (read) | 25 ns |
| Register File (write) | 15 ns |
| ALU | 30 ns |
| All others | 0 ns |

The trick to this question was realizing that cycle time is only affected by the longest path between registers in the datapath. In this case, since there is no setup time or clock-to-Q delay, the cycle time will be 50 ns . There is no need to have two functional units in series in this datapath, and doing so will only reduce performance.

The execution time of the processor will be the total number of cycles required multiplied by the cycle time:
Time $=(400,000 \times 4+250,000 \times 5+250,000 \times 4+100,000 \times 2) \times 50 \mathrm{~ns}=202.5 \mathrm{~ms}$


