Midterm II
October 19, 1997
CS152 Computer Architecture and Engineering

You are allowed to use a calculator and one 8.5” x 1” double-sided page of notes. You have 3 hours. Good luck!

<table>
<thead>
<tr>
<th>Your Name:</th>
</tr>
</thead>
<tbody>
<tr>
<td>SID Number:</td>
</tr>
<tr>
<td>Discussion Section:</td>
</tr>
</tbody>
</table>

<table>
<thead>
<tr>
<th>1</th>
<th>/20</th>
</tr>
</thead>
<tbody>
<tr>
<td>2</td>
<td>/20</td>
</tr>
<tr>
<td>3</td>
<td>/20</td>
</tr>
<tr>
<td>4</td>
<td>/20</td>
</tr>
<tr>
<td>Total</td>
<td>/80</td>
</tr>
</tbody>
</table>
Question 1

In addition to higher hardware costs, a larger cache can also have a longer access time. Beyond a certain size, a L1 cache will reduce the performance of the CPU. It is possible, however, to reduce the miss penalty without affecting cycle time. To do this, many modern computers use a second level cache. The L2 cache is not inside the pipeline and is accessed only when there is a miss in L1. Main memory access is required only when there is a miss in L2.

For this question, we will not distinguish reads from writes. Both the L1 and L2 caches are synchronized to the CPU clock. For both L1 and L2, a miss is handled with an access to the next lower level in the memory hierarchy, and after the miss the request is handled exactly like a hit. For example, in a read, L1 will first update its contents with data from L2, and then pass it on to the CPU.

Definitions:

- MR - Miss rate. Fraction of accesses to this cache that result in a miss.
- HT - Hit time. Access time during a cache hit.
- MP - Miss penalty. Additional access time incurred by a cache miss.

a) For a system with two levels of caching, L1 and L2, give the average access time of L1 in terms of $MR_{L1}, HT_{L1}, MR_{L2}, HT_{L2}, and MP_{L2}$, for the case where HT and MP are integral multiples of CPU cycle time.
Question 1 (cont)

You are evaluating some proposals for improving performance of a system still in the design stage. For the applications that you have in mind, the base (i.e. when there are no memory stalls) CPI is 1. Cost is of no object to you.

The L1 and L2 caches available have the following characteristics:

- $HT_{L1} = 2\text{ns}$
- $MR_{L1} = 5\%$
- $HT_{L2} = 20\text{ns}$
- $MR_{L2} = 25\%$ (remember, L2 is accessed only on a L1 miss)
- $MP_{L2}$ = main memory access time = 100ns

The clock cycle time for the CPU is currently 2ns.

b) Consider a design that only uses a L1 cache. It has been proposed that doubling the size of the cache will decrease the miss rate to 4%, while increasing hit time to 2.4ns. Is this a good idea? What will be the percentage change in performance?
c) A L2 cache was added to the system. Now is it a good idea to double the L1 cache from its original size? What will be the change in performance?

d) In general, you tend to care more about miss ratio for L1 caches and hit times for L2 caches. True or False?

e) Compare your answer to part b) with part c). Does the presence or absence of a L2 cache influences design decisions for L1? Explain why.
You have recently been recruited by InDec Corp. to work on a second version of their “Alphium” processor. Alphium is a simple ISC processor with 5 pipeline stages, just like the one presented in class. Its clock frequency is 200MHz and the chip power supply (Vdd) is 3.3V. For the highly appreciated MisSpec benchmark, Alphium achieves a IPC of 0.8 instructions per cycle (IPC = \frac{1}{CPI}).

Your assignment is to evaluate the improvement in execution time, power and energy consumption for four proposed new versions of the Alphium processor. Here is a description of the alternatives:

- **Low Power Alphium (LPA):** This would be the same design as the original Alphium but it would be clocked at 133MHz to save power. The power supply (Vdd) would remain 3.3V and the achieved IPC for MisSpec would be 0.8 again.

- **Superscalar Alphium (SSA):** This would be a 4-way superscalar version of the Alphium. It would have the ability to issue up to 4 instructions per cycle. The power supply (Vdd) would remain 3.3V but the clock frequency would be reduced to 166MHz, due to the complexity of the issuing logic. The achieved IPC for MisSpec would be 2. Assume that the effective capacitance switched in the superscalar design would be 4 times that of the original.

- **Low Voltage Alphium (IVA):** This would be the same design as the original Alphium again but both the power supply and the clock frequency would be reduced. Power supply (Vdd) would be 2V and clock frequency would be 133MHz. The IPC for MisSpec would be 0.8 once again.

- **Low Voltage Superscalar Alphium (LVSSA):** This would be a 4-way superscalar version with reduced power supply. The clock frequency would be 100MHz, the power supply (Vdd) would be 2V and an IPC of 2 would be achieved for MisSpec. Assume that the effective capacitance switched in the superscalar design would be 4 times that of the original.

Here is some information that you may find useful:

**Power:** When we measure power for a system, we care about the maximum instantaneous power the system can consume. This is important as it determines the maximum current that the power supply must be able to supply to the system and the amount of heat that has to be removed from the system.

**Energy:** \(E = C \times Vdd^2\) is just the energy per transaction. This is not interesting. We care about the energy consumed from the power supply to execute a task (or perform some computation). Once the task is executed, the processor can be turned off and no further energy is needed. The energy per task determines how many tasks you can execute before the battery runs out.
Question 2 (cont)

a) Provide the basic formulas that will allow you to compare the 5 alternatives to the original Alphium. You will need the formulas for execution time, power consumption, energy consumption for the MisSpec benchmark, performance per power ratio for the MisSpec benchmark and performance per energy ratio for the MisSpec benchmark.

Execution Time =

Power =

Energy =

Performance per Power =

Performance per Energy =
Question 2 (cont)

b) Fill in the two following tables. In each box write how the proposed new version compares to the original Alphium for that feature (e.g. \( \frac{\text{ExecTime}_{\text{new}}}{\text{ExecTime}_{\text{original}}} \), \( \frac{\text{Power}_{\text{new}}}{\text{Power}_{\text{original}}} \) etc). Two fractional digits per entry are enough. Use the following (blank) page as scratch paper.

<table>
<thead>
<tr>
<th></th>
<th>IPC</th>
<th>Freq</th>
<th>Vdd</th>
<th>Relative ExecTime</th>
<th>Relative Power</th>
<th>Relative Energy</th>
</tr>
</thead>
<tbody>
<tr>
<td>LPA</td>
<td>0.8</td>
<td>133MHz</td>
<td>3.3V</td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>SSA</td>
<td>2.0</td>
<td>166MHz</td>
<td>3.3V</td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>LVA</td>
<td>0.8</td>
<td>133MHz</td>
<td>2.0V</td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>LVSSA</td>
<td>2.0</td>
<td>100MHz</td>
<td>2.0V</td>
<td></td>
<td></td>
<td></td>
</tr>
</tbody>
</table>

<table>
<thead>
<tr>
<th></th>
<th>IPC</th>
<th>Freq</th>
<th>Vdd</th>
<th>Relative Performance/Power</th>
<th>Relative Performance/Energy</th>
</tr>
</thead>
<tbody>
<tr>
<td>LPA</td>
<td>0.8</td>
<td>133MHz</td>
<td>3.3V</td>
<td></td>
<td></td>
</tr>
<tr>
<td>SSA</td>
<td>2.0</td>
<td>166MHz</td>
<td>3.3V</td>
<td></td>
<td></td>
</tr>
<tr>
<td>LVA</td>
<td>0.8</td>
<td>133MHz</td>
<td>2.0V</td>
<td></td>
<td></td>
</tr>
<tr>
<td>LVSSA</td>
<td>2.0</td>
<td>100MHz</td>
<td>2.0V</td>
<td></td>
<td></td>
</tr>
</tbody>
</table>
Question 2 (cont)
c) Circle the right answer (true or false):

1. Without any other changes, lowering the clock frequency of a processor leads to energy savings  
   True  False

2. Increasing performance always leads to wasting energy  
   True  False

3. Lowering supply voltage can be combined with increasing performance  
   True  False

4. Comparing processors using performance per power is the same as using performance per energy  
   True  False


d) For a server, which of the above metrics (execution time, power, energy, performance/power, performance/energy) would you use to pick a processor? Why?

For a laptop computer, which of the above metrics (execution time, power, energy, performance/power, performance/energy) would you use to pick a processor? Why?

For Personal Digital Assistant (PDA), which of the above metrics (execution time, power, energy, performance/power, performance/energy) would you use to pick a processor? Why?
Question 3

The DiSPlacement is a hypothetical DSP variation of the MIPS architecture. Here are the 3 changes from MIPS:

1. Load and store instructions are changed to have ONLY the following two addressing modes:
   
   (a) Register indirect: the address is the contents of the register. For example:
   
   \[
   \text{lwi } r5, r1 \quad \# r5 \leftarrow \text{Mem}[r1]
   \]
   
   (b) Register autoincrement (ai): the address is the contents of the register; as part of this instruction, increment this register by the size of the data in bytes. Note that the memory address is the ordinal value of the register before incrementing. For example:
   
   \[
   \text{lwa}i\ r5, r1 \quad \# r5 \leftarrow \text{Mem}[r1];\ r1 \leftarrow r1 + 4
   \]
   
2. There is a new 64-bit register called \text{Acc}, standing for accumulator.

3. There is a multiply accumulate instruction (MAC), which both adds the contents of the \text{Hi:Lo} to \text{Acc} and multiplies two 32-bit registers and puts the 64-bit product into the existing registers \text{Hi:Lo}. For example:
   
   \[
   \text{mac} r3, r4 \quad \# \text{Acc} \leftarrow \text{Acc} + \text{Hi:Lo};\ \text{Hi:Lo} \leftarrow r3 \times r4
   \]

Putting these extensions together, the unrolled loop of the FIR filter looks like this (assume that \text{Acc} and \text{Hi:Lo} are initialized to 0):

\[
\begin{align*}
\text{lwi} &\ r5, r1 \quad \# r5 \leftarrow \text{Mem}[r1];\ r1 \leftarrow r1 + 4 \\
\text{mac} &\ r2, r5 \quad \# \text{Acc} \leftarrow \text{Acc} + \text{Hi:Lo};\ \text{Hi:Lo} \leftarrow r2 \times r5 \\
\text{lwa} &\ i\ r5, r1 \quad \# r5 \leftarrow \text{Mem}[r1];\ r1 \leftarrow r1 + 4 \\
\text{mac} &\ r2, r5 \quad \# \text{Acc} \leftarrow \text{Acc} + \text{Hi:Lo};\ \text{Hi:Lo} \leftarrow r2 \times r5 \\
\end{align*}
\]

... Since the memory accesses are based on the contents of registers only, the designers of DiSPlacement decided to change the 5-stage pipeline by swapping the EX and MEM stages:

1. Instruction Fetch
2. Instruction Decode/Register Fetch
3. Memory Access
4. Execute
5. Write Back

Assume that the execute stage has a 1 clock cycle multiplier and that the ALU can perform 64-bit additions. The figure on the next page shows the modified pipeline datapath.
Replace this page with displacement pipeline
Question 3 (cont)

As an internationally famous computer designer (this is in 2 years), you are brought in to comment on DisPlacement.

Here are the questions the management has about DisPlacement. Please answer clearly and show how you got your answers.

a) What are the pipeline hazards in this modified pipeline?

1. Any structural hazards?

2. Any control hazards?

3. For data hazards, look at the following interactions. Which are hazards? When?

(a) Load then Load (same address register)

(b) Load then Branch

(c) Load then Arithmetic-logical

(d) Load then Store
(e) Arithmetic-logical then Arithmetic-logical

(f) Arithmetic-logical then Store

(g) Arithmetic-logical then Branch
b) Remove as many of these hazards as you can, but you are limited to changes in the datapath from the following list:

1. Change the number of read or write ports on the register file;
2. Add one more adder (of whatever width you need);
3. Add multiplexors to the inputs of memory, multipliers, ALUs, or adders.

Do not worry about the control of any changes. In the table below, list the original hazard, hardware changes, and why the change resolves the hazard.

<table>
<thead>
<tr>
<th>Hazard</th>
<th>Hardware Changes</th>
<th>Why It Resolves</th>
</tr>
</thead>
<tbody>
<tr>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td></td>
<td></td>
<td></td>
</tr>
</tbody>
</table>
c) What is the impact of these changes for non-DSP applications? Specifically, how would the changes affect the clock rate of DiSPlacement, instruction count of traditional programs running on DiSPlacement, or the CPI of the original MIPS instruction set. State assumptions, then estimate instruction count and CPI impact quantitatively.

<table>
<thead>
<tr>
<th>Reason For Change</th>
<th>Estimated Impact (Quantitative)</th>
</tr>
</thead>
<tbody>
<tr>
<td>Clock rate</td>
<td></td>
</tr>
<tr>
<td>Instruction count</td>
<td></td>
</tr>
<tr>
<td>CPI</td>
<td></td>
</tr>
</tbody>
</table>
Question 4

The I/O bus and memory system of a computer are capable of sustaining 1000 MB/s without interfering with the performance of an 700-MIPS CPU (costing $50,000). This system will be used as a transaction processing (TP) system. TP involves many relatively small changes (transactions) to a large body of shared information (the database account file). For example, airline reservation systems as well as banks are traditional customers for TP. Here are the assumptions about the software on the system that will execute a TP benchmark:

- Each transaction requires 2 disk reads plus 2 disk writes.
- The operating system uses 50,000 instructions for each disk read or write.
- The database software executes 500,000 instructions to process a transaction.
- The amount of data transferred per transaction is 2048 bytes.

You have a choice of two different types of disks:

- A small disk (2.5”) that stores 1000 MB and costs $60.
- A big disk (3.5”) that stores 2500 MB and costs $150.

Either disk in the system can support on average 100 disk reads or writes per second.

You wish to evaluate different system configurations based on a transaction processing benchmark that uses a 20 GB database account file. Answer parts (a)–(e) based on this benchmark. Assume that the requests are spread evenly to all the disks, and that there is no waiting time due to busy disks. Show all work for all parts.

a) Complete the table below. “Number of Units” refers to the minimum number of that item required for each organization; “Demand per Transaction” refers to the demand (in MIPS, bytes, or I/Os) that each transaction places on that component; and “TP/s Limit” refers to the maximum number of transactions per second that each subsystem (processor, bus, or disks) could support.

<table>
<thead>
<tr>
<th>Units</th>
<th>Performance</th>
<th>Number of Units</th>
<th>Demand per Transaction</th>
<th>TP/s Limit</th>
</tr>
</thead>
<tbody>
<tr>
<td>CPU</td>
<td>700 MIPS</td>
<td>1</td>
<td>MIPS</td>
<td></td>
</tr>
<tr>
<td>Bus</td>
<td>1000 MB/s</td>
<td>1</td>
<td>bytes</td>
<td></td>
</tr>
<tr>
<td>2.5” disks</td>
<td>100 I/Os/s</td>
<td></td>
<td>I/Os</td>
<td></td>
</tr>
<tr>
<td>3.5” disks</td>
<td>100 I/Os/s</td>
<td></td>
<td>I/Os</td>
<td></td>
</tr>
</tbody>
</table>
Question 4 (cont)

b) How many transactions per second are possible with each disk organization, assuming that each uses the minimum number of disks to hold the account file?

c) What is the system cost per transaction per second of each alternative for the benchmark?
Question 4 (cont)

d) How fast must a CPU be in order to make the 1000 MB/sec I/O bus a bottleneck for the benchmark? (Assume that you can continue to add disks.)

e) As manager of MTP (Mega TP), you are deciding whether to spend your development money building a faster CPU or improving the performance of the software. The database group says that they can reduce a transaction to 1 disk read and 1 disk write and cut the database instructions per transaction to 400,000. The hardware group can build a faster CPU that sells for the same amount as the slower CPU with the same development budget. (Assume you can add as many disks as needed to get higher performance.) How much faster does the CPU have to be to match the performance gain of the software improvement?