> University of California, Berkeley
> College of Engineering
> Computer Science Division - EECS

Fall 1997
D.A. Patterson

## Midterm I

October 8, 1997
CS152 Computer Architecture and Engineering

You are allowed to use a calculator and one $8.5 " \times 1 "$ double-sided page of notes. You have 3 hours. Good luck!

| Your Name: |  |
| :---: | :--- |
| SID Number: |  |
| Discussion Section: |  |


| 1 | $/ 20$ |
| :---: | :---: |
| 2 | $/ 10$ |
| 3 | $/ 10$ |
| 4 | $/ 30$ |
| Total | $/ 70$ |

## Question 1

You are running a benchmark on your company's processor, which runs at 200 MHz , and has these characteristics:

| Instruction Type | Frequency (\%) | Cycles |
| :---: | :---: | :---: |
| Arithmetic and Logical | 40 | 1 |
| Load and Store | 30 | 2 |
| Branches | 20 | 3 |
| Floating Point | 10 | 4 |

Your company is considering offering a cheaper, lower-performance version of the processor. Their plan is to remove some of the floating point hardware to reduce the die size of the chip. The new chip will run at the same clock speed.

The wafer on which the chip is produced has a diameter of 10 cm , a cost of $\$ 1000$, and $1 /\left(\mathrm{cm}^{2}\right)$ defects. The manufacturing process results in a $90 \%$ wafer yield and a value of 2 for $\alpha$.

The current processor has a die size of $12 \mathrm{~mm} \times 12 \mathrm{~mm}$. After the changes, the die size will be $10 \mathrm{~mm} \times 10 \mathrm{~mm}$, and floating point instructions will take 12 cycles to execute.

Here are some equations you may find useful:

$$
\begin{gathered}
\text { dies } / \text { wafer }=\frac{\pi \times(\text { wafer diameter } / 2)^{2}}{\text { die area }} \Leftrightarrow \frac{\pi \times \text { wafer diameter }}{\sqrt{2 \times \text { die area }}} \\
\text { die yield }=\text { wafer yield } \times\left(1+\frac{\text { defects per unit area } \times \text { die area }}{\alpha}\right)^{-\alpha}
\end{gathered}
$$

a) What is the CPI and MIPS rating of the original processor?
b) What is the CPI and MIPS rating of the smaller processor?

## Question 1 (cont)

c) What is the old cost per (working) processor?
d) What is the new cost per (working) processor?
e) What is the percentage change in price per performance?

## Question 1 (cont)

Your competitors produce a chip that runs at 250 MHz and has the following characteristics for the benchmark:

| Instruction Type | Frequency (\%) | Cycles |
| :---: | :---: | :---: |
| Arithmetic and Logical | 40 | 1 |
| Load and Store | 30 | 3 |
| Branches | 20 | 3 |
| Floating Point | 10 | 5 |

f) What is the CPI and MIPS rating of your competitor's processor for this benchmark?
g) Your company's advertising department wants to defend your company's motto ("The Appearance of Excellence") by advertising a higher MIPS rating for your flagship processor (the one with the larger die, with the floating point hardware intact) than your competitor's processor. They want you to write a benchmark that gives this result. Describe an instruction mix that would accomplish this marketing goal (give specific percentages of each instruction type).

## Question 1 (cont)

h) Instead, you decide to improve the compiler that is used to compile this benchmark on your processor. Your compiler reduces the branches by $50 \%$, but it increases the number of arithmetic and logical instructions by $25 \%$; it does not affect the number of other instructions. What is your new CPI and MIPS?
i) Using the original instruction mix, which machine is faster? How much faster is it?

## Question 2

The ALU presented in the book supported set less than (slt) using the sign bit of the adder $(a<b \Leftrightarrow a \Leftrightarrow b<0)$. Let's try the set less than operation using the values $\Leftrightarrow 7_{\text {ten }}$ and $6_{\text {ten }}$. To make it simpler to follow the example, let's limit the binary representation to 4 bits: $1001_{\text {two }}$ and $0110_{\text {two }}$.
$1001_{t w o} \Leftrightarrow 0110_{t w o}=1001_{t w_{o}}+1010_{t w o}=0011_{t w o}$
The result suggests that $\Leftrightarrow 7>6$, which is clearly wrong. Hence we must factor in overflow in the decision. The schematic given below is the most significant bit of an ALU. Fix the set output to handle slt correctly. Explain your modifications and give the new function for the set output as well. Assume that the overflow signal is correct and can be used and that the gates available to you are: inverters, AND and OR. You can use the following blank page if you need more space.


Question 2 (cont)

## Question 3

The figure below presents the portion of the schematic of a full adder that calculates the carry out signal.


Assume the following characteristics for the gates:
AND2: Input load=100fF, propagation delay low-to-high TPlh $=0.2 \mathrm{~ns}$, propagation delay high-to-low TPhl $=0.5 \mathrm{~ns}$, load dependent delay $\mathrm{TPlhf}=\mathrm{TPhlf}=0.002 \mathrm{~ns} / \mathrm{fF}$.

OR2: Input load $=100 \mathrm{fF}$, propagation delay low-to-high TPlh $=0.5$ ns, propagation delay high-to-low $\mathrm{TPh}=0.1 \mathrm{~ns}$, load dependent delay $\mathrm{TPlhf}=\mathrm{TPhlf}=0.002 \mathrm{~ns} / \mathrm{fF}$.

Identify the critical path in this schematic and fully characterize its delay using the linear delay model. Assume that the last OR2 gate drives a capacitance of 300 fF . You can use the following blank page if you need more space.

Question 3 (cont)

## Question 4

In October of 1996, Silicon Graphics introduced a new set of instructions known as MIPS Digital Media Extensions (MDMX). Similar to the Intel MMX, the MDMX specification uses a single instruction multiple data (SIMD) data path to perform parallel narrow data operations on bytes and halfwords within a single instruction.

The MDMX has yet to be implemented on a commercially available microprocessor. We will explore some MDMX ideas by extending the single cycle datapath discussed in class. Where the real MDMX uses 64 -bit floating point registers, we will use the 32 -bit integer registers to perform parallel operations on two half words ("Dual Halfs" instructions).

Consider two pseudo-MDMX instructions (based on the real MDMX!), ADD.DH and MAX.DH. ADD.DH adds two 16 -bit signed integers in parallel. MAX.DH is more unusual. It performs two simultaneous comparisons and stores the larger results in a third register. The register transfer operations are given below. $\mathrm{R}[\mathrm{x}][0]$ refers to the half word in bits $15: 0$ of register x , and $\mathrm{R}[\mathrm{x}][1]$ refers to bits 31:16.

INSTRUCTION rd, rs, rt
ADD.DH $\$ \mathrm{r} 1, \$ \mathrm{r} 2, \$ \mathrm{r} 3 \mathrm{R}[\mathrm{rd}][0] \Leftarrow \mathrm{R}[\mathrm{rs}][0]+\mathrm{R}[\mathrm{rt}][0]$

$$
\mathrm{R}[\mathrm{rd}][1] \Leftarrow \mathrm{R}[\mathrm{rs}][1]+\mathrm{R}[\mathrm{rt}][1]
$$

$$
\mathrm{PC} \Leftarrow \mathrm{PC}+4
$$

MAX.DH $\$ r 1, \$ r 2, \$ r 3$ for $i=0,1$ begin

$$
\text { if }(\mathrm{R}[\mathrm{rs}][\mathrm{i}]<\mathrm{R}[\mathrm{rt}][\mathrm{i}]) \text { then }
$$

$$
\mathrm{R}[\mathrm{rd}][\mathrm{i}] \Leftarrow \mathrm{R}[\mathrm{rt}][\mathrm{i}]
$$

else

$$
\mathrm{R}[\mathrm{rd}][\mathrm{i}] \Leftarrow \mathrm{R}[\mathrm{rs}][\mathrm{i}]
$$

## end

$$
\mathrm{PC} \Leftarrow \mathrm{PC}+4
$$

## Question 4 (cont)

a) The single cycle processor developed in class is shown below. Make the necessary datapath modifications for the two MDMX instructions. You may use a 16 -bit version of the 32 -bit ALU. If you define your own component, be sure to specify its behavior. Label your control signals with descriptive names. To maximize your chances for partial credit, write down anything else that will help us evaluate your work (you do not need to specify the control functions until part b). You will be graded for correctness more than efficiency.

The meaning of some control signals are given below.
$n P C S e l: 0 \Rightarrow P C \leftarrow P C+4 ; 1 \Rightarrow P C \leftarrow P C+4+\operatorname{SignExt}(\operatorname{Imm} 16) \| 00$
ALUCtr : "add", "sub", "and", "or", "slt"(signed comparison)
ExtOp: "zero", "sign"

Use the diagram below for scratch and neatly draw your final answer on the next page.


## Question 4 (cont)



## Question 4 (cont)

b) For ADD.DH and MAX.DH, give the values of all control signals, including those you added in part (a). The control signals can be functions of other control signals, values labeled on the datapath, or don't cares. Use $b[i]$ to denote bit $i$ on bus $b$. You may use high level specifications such as $i f$-then-else. The number of grids below is not an indication of the number of control signals you will need.

| Control Line | ADD.DH | MAX.DH |
| :---: | :--- | :--- |
| nPCSel |  |  |
| RegDst |  |  |
| RegWr |  |  |
| ExtOp |  |  |
| ALUSrc |  |  |
| ALUCtr |  |  |
| MemWr |  |  |
| MemtoReg |  |  |
|  |  |  |
|  |  |  |
|  |  |  |
|  |  |  |
|  |  |  |
|  |  |  |
|  |  |  |
|  |  |  |
|  |  |  |
|  |  |  |
|  |  |  |
|  |  |  |

## Question 4 (cont)

Extra Credit. [5 points] 16 16-bit signed integers are stored in memory as follows:

```
0x00000000 (half-word 0)
0x00000002 (half-word 1)
0x00000004 (half-word 2)
.
0x0000001E (half-word 15)
```

The following MIPS code (assuming no delay slots) finds the largest integer and stores the result in lower 16 bits of $\$ \mathrm{v} 0$. Since we are dealing with half words, the final value of the upper 16 bits of $\$ \mathrm{v} 0$ is irrelevant.

```
    lh $v0, 0x001E($zero) #assume the last number is the largest
    addi $t0, $zero, 0x001C #initialize the half-word pointer
LARGEST:
    lh $s0, O($t0)
    slt $t1, $v0, $s0
    beq $t1, $zero, NEXT
    addi $v0, $zero, $s0
NEXT:
    addi $t0, $t0, -2 #continue the search until pointer becomes negative
    slt $t1, $t0, $zero
    beq $t1, $zero, LARGEST
END :
```

Take advantage of the parallelism in MAX.DH to write a faster version of this code. Exactly how many instructions are executed in your code? What are the minimum and maximum number of instructions executed in the non-MDMX code given above?

Question 4 (cont)

