After the exam, indicate on the line above where you fall in the emotion spectrum between “sad” & “smiley”...

<table>
<thead>
<tr>
<th>Last Name</th>
<th></th>
</tr>
</thead>
<tbody>
<tr>
<td>First Name</td>
<td></td>
</tr>
<tr>
<td>Student ID Number</td>
<td></td>
</tr>
<tr>
<td>CS61C Login</td>
<td>cs61c</td>
</tr>
</tbody>
</table>

The name of your SECTI0N TA (please circle) David | Donggyu | Fred | Jeffrey | Martin Nolan | Sagar | Shreyas | William

Name of the person to your Left

Name of the person to your Right

All the work is my own. I had no prior knowledge of the exam contents nor will I share the contents with others in CS61C who have not taken it yet. (please sign)

Instructions (Read Me!)

- This booklet contains 14 numbered pages including the cover page. The back of each page is blank and can be used for scratch-work, but will not be looked at for grading.
- Please turn off all cell phones, smartwatches, and other mobile devices. Remove all hats & headphones. Place your backpacks, laptops and jackets under your seat.
- You have 180 minutes to complete this exam. The exam is closed book; no computers, phones, or calculators are allowed. You may use three handwritten 8.5”x11” pages (front and back) of notes in addition to the provided green sheet.
- There may be partial credit for incomplete answers; write as much of the solution as you can. We will deduct points if your solution is far more complicated than necessary. When we provide a blank, please fit your answer within the space provided. “IEC format” refers to the mebi, tebi, etc prefixes.
- You must complete ALL THE QUESTIONS, regardless of your score on the midterm. Clobbering only works from the Final to the Midterm, not vice versa. You have 3 hours... relax.

<table>
<thead>
<tr>
<th></th>
<th>M1-1</th>
<th>M1-2</th>
<th>M1-3</th>
<th>M2-1</th>
<th>M2-2</th>
<th>M2-3</th>
<th>M2-4</th>
<th>F1</th>
<th>F2</th>
<th>F3</th>
<th>Total</th>
</tr>
</thead>
<tbody>
<tr>
<td>Points</td>
<td>9</td>
<td>9</td>
<td>9</td>
<td>10</td>
<td>4</td>
<td>8</td>
<td>12</td>
<td>9</td>
<td>10</td>
<td>10</td>
<td>90</td>
</tr>
<tr>
<td>Possible</td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>Points</td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>Earned</td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
</tbody>
</table>
M1-1: I smell a potpourri section covering midterm one… (9 points)

a) Which of the following number representations give 0xFFFFFFFE the most positive value when converted to decimal?

A) Bias (with standard bias)  B) Unsigned  C) Two’s complement  D) Sign and Magnitude

b) Consider a plot that shows the mapping between 8-bit two’s complement binary numbers and their decimal equivalents (i.e. binary is on the x-axis and decimal is on the y-axis). Fill in the plot to the left and answer the following questions.

i) Fill in the plot to the left.
ii) Describe (in binary) where discontinuities occur in the plot, if any:
   The discontinuity occurs between 01111111 to 10000000, as shown by the graph.
iii) What are the most positive and most negative decimal values that this representation can store?

Most Positive: 127  
Most Negative: -128

c) Consider the C code below. Indicate where the values on the right live in memory (using (S)stack, (H)heap, (T)atic, or (C)ode). Assume no registers are used:

```c
#define a 10
int b = 0;

int main(int argc, char** argv) {
    int c = a;
    char d[10];
    int* e = malloc(sizeof(int));
}
```

d) Convert the following instructions from TAL to hex or vice versa. Use register names when possible.

i) lw $s0, 0($a0)
   - opcode: 000 0011
   - rd : $s0 -> 8
   - funct 3: 010
   - rs1: $a0 -> 10
   - imm: 0
   Ignore, out question

   Consider lw $s0, 0($a0)  I-Type

   ```
   0000 0000 0000 0101 0100 0100 0000 0011
   ```

   0x000052403

   Explanation on NEXT PAGE
1A) In both signed and two's complement, the number 0xFFFFFFFFE is negative. In unsigned format, 0xFFFFFFFFE = \(2^{32} - 2\). In Biased format, the bias will be \(2^{31} - 1\), so 0xFFFFFFFFE = \((2^{32} - 2) - (2^{31} - 1)\). This is less than our unsigned representation, yielding our solution.

1C)

a \to this is a macro, so the symbol "a" is replaced by 10 by the compiler. Thus, "a" is stored in the code section.

b \to this is a global variable, so it is stored on static

d \to d is an array that is locally allocated (within the "main" function), so d refers to a block of memory on the stack

e \to e is a local variable, but its value is an address on the heap. Thus, when we dereference e, the block of memory e refers to is on the heap.

f \to e itself is a local variable, so it is located on the stack
M1-2: I’ll believe it when I C it (9 points)

Your friend wants to take 61C next semester, and is learning C early to get ahead. They try to implement the ROT13 function as practice, but the code they wrote has some bugs, so you’ve been called in, as the C expert, to help debug their program, reproduced below:

```c
/* Applies the ROT13 cipher. rot13("happytimes") == "unclgkvzrf" */
void rot13(char *str) {
    while (*str) {
        if (str >= 'a' && str <= 'z') {
            *str = (*str + 13) % 26;
        }
        str++;
    }
}

int main(int argc, char *argv[]) {
    char a[] = "happy"; // 5 characters
    char b[] = "times"; // 5 characters
    char *s = "XXXXXXXXXXXX"; // 12 X's
    // Apply cipher to a and b.
    rot13(a);
    rot13(b);
    printf("%s%s\n", a, b);
    // Concatenate and place in s.
    int i = 0;
    for (int j = 0; a[j]; ) s[i++] = a[j++];
    for (int j = 0; b[j]; ) s[i++] = b[j++];
    printf("%s\n", s);
}
```

You want to impress your friend, so you predict the result of executing the program as it is written, just by looking at it. If the program is guaranteed to execute without crashing, describe what it prints, otherwise explain the bug that may cause a crash.

b) Now, fix all the errors in the program so that it executes correctly. Fill in the corrections you made in the table below. You may not need all the rows.

<table>
<thead>
<tr>
<th>Line #</th>
<th>Insert Before / Replace / Delete</th>
<th>Change (Explanation or Code)</th>
</tr>
</thead>
<tbody>
<tr>
<td>4</td>
<td>Replace</td>
<td>*str = 'a' &amp; *str += 13</td>
</tr>
<tr>
<td>5</td>
<td>Replace</td>
<td>*str = (*str + 13) % 26 + 'a'</td>
</tr>
<tr>
<td>14</td>
<td>Replace</td>
<td>char s[1] = &quot;XXXXXXXXXXXXXX&quot;</td>
</tr>
<tr>
<td>25</td>
<td>Insert</td>
<td>s[2] = '\0'</td>
</tr>
</tbody>
</table>
M1-3: I don’t want to MIPS a thing (9 points)

The following C code recursively sums the elements in an array of length n.

```c
int32_t sum_arr(int32_t *arr, size_t n) {
    if (n) {
        return sum_arr(arr + 1, n - 1) + arr[0];
    }
    return 0;
}
```

Translate `sum_arr` into MIPS below. Your code must follow all function calling conventions, and you may not use any pseudoinstructions. You may not need every blank.

sum_arr: __________ non_zero
    addu $v0, $0, $0
    jr $ra

non_zero: ________________

__________________________
__________________________
__________________________
addiu $s0, $a0, 0  # Store arr into $s0
__________________________
__________________________
__________________________
__________________________
sum_arr  # Make the recursive call
__________________________
__________________________  # Then add arr[0] to the result
__________________________
__________________________
jr $ra

RISC-V answers on next page
sum_arr: bne ao, xo, non-zero
        addi ao, xo, 0  \} base case
        jr ra
non_zero:  addiu sp, sp, -8
        sw so, 0(sp)  \} Backup saved register & return addr
        sw ra, 4(sp) \} we are making a function call
        addi so, ao, 0  \} back up arr
        addi ao, ao, 4  \} Prepare parameters (arr is an int array, so we ao +4)
        addi al, al, -1  \} Prepare parameters (arr is an int array, so we ao +4)
        jalr ra, sum_arr  \} recursive call
        lw t0, 0(so)  \} get arr[0]
        add ao, ao, t0  \} add output of recursive call to arr[0]
        lw so, 0(sp)  \} put in return value register
        lw ra, 4(sp)  \} Restore stack and
        addi sp, sp, 8  \} get back return address
        jr ra
M2-1: I couldn’t come up with a clever title for SDS. (10 points)

a) Give the simplest Boolean expression for the following circuit in terms of A and B, using the minimum number of AND, OR, and NOT gates:

\[(A+B)(\overline{A+B})+(A+B)\]

\[= A\overline{B} + \overline{A+B} + AB \quad \text{ (De Morgan's)}\]

\[= \overline{A}B + A\overline{B} + AB \quad \text{ (De Morgan's)}\]

\[= \overline{B} + AB = A + \overline{B} \quad \text{This is because } \overline{B} + AB = A + \overline{B} \]

\[C = A + \overline{B}\]

(You must show your work above to earn points.)

b) Using as few states as possible, complete the transition table for an FSM that takes an input with 3 values: 0, 1, or 2. The machine will output a 1 when the sum of the inputs seen so far is divisible by 3. Otherwise it should output a 0.

Assume you have seen no digits at the start state. You might not need all of the states, and you should not draw additional states. You must represent your FSM using the table to the left, the table is the only part that will be graded. The first transition has been filled in for you.

<table>
<thead>
<tr>
<th>Current State</th>
<th>Input/Output</th>
<th>Next State</th>
</tr>
</thead>
<tbody>
<tr>
<td>A</td>
<td>1/0</td>
<td>B</td>
</tr>
<tr>
<td>A</td>
<td>0/1</td>
<td>A</td>
</tr>
<tr>
<td>A</td>
<td>2/0</td>
<td>C</td>
</tr>
<tr>
<td>B</td>
<td>0/0</td>
<td>B</td>
</tr>
<tr>
<td>B</td>
<td>1/0</td>
<td>C</td>
</tr>
<tr>
<td>B</td>
<td>2/1</td>
<td>A</td>
</tr>
<tr>
<td>C</td>
<td>0/0</td>
<td>C</td>
</tr>
<tr>
<td>C</td>
<td>1/1</td>
<td>A</td>
</tr>
<tr>
<td>C</td>
<td>2/0</td>
<td>B</td>
</tr>
</tbody>
</table>
M2-1: (continued)

c) Suppose we add registers to the unoptimized circuit in part A to increase the clock rate (this modification is shown below). What is the longest clock-to-Q that the registers on inputs A and B can have that will result in correct behavior when the circuit is clocked at 10 MHz?

Assumptions:
• Assume that clock-to-Q > hold time
• All registers have a setup time of 2 ns
• All logic gates have a delay of 25 ns
• Bubbles on gates do not introduce additional delay

\[ t_{\text{clk}} = 100 \text{ ns} \]
\[ t_{\text{LOGIC}} = 50 \text{ ns} \text{ b/c critical path has 2 logic gates (as shown on path above)} \]
\[ t_{\text{clk}} = t_{\text{clk-10-\Omega}} + t_{\text{LOGIC}} + t_{\text{setup}} \]
\[ 100 = t_{\text{clk-10-\Omega}} + 50 + 2 \]
\[ t_{\text{clk-10-\Omega}} = 48 \text{ ns} \implies \text{max is 48 ns} \]

Answer: 48 ns

M2-2: Float like a butterfly and sting like an IEEE (4 points)

Let’s take another look at the IEEE754 standard for single-precision floating-point numbers. \([x, y)\) represents a range where \(x\) is included and \(y\) is not.

a) How many floats are representable in the interval \([0.5, 1)\)?
   Answer: \(1 \cdot 2^{23}\)

All numbers like \(0.1\overline{XXXX}...X\) \(\rightarrow\) each ‘X’ can be 0 or 1, so we have \(2^{23}\) in total. Any normalized float of form \(1.XX...X \cdot 2^n\) for \(n \geq 2\)

b) How many floats are representable in the interval \([0, 0.5)\)?
   Answer: \(126 \cdot 2^{23}\)

De-normal numbers \(\rightarrow 2^{23}\) of these; for each bit of significant field being 0 or 1. Any normalized float of form \(1.XX...X \cdot 2^n\) for \(n \geq 2\)

As maximum exponent is 127, we have 125 such values of \(n\), and each has \(2^{23}\) value (each significant bit can be 0 or 1)

Thus, we have \((1 \cdot 2^{23}) + (125 \cdot 2^{23}) = 126 \cdot 2^{23}\)
M2-3: If this exam were a CPU, you’d be halfway through the pipeline (8 points)

We found that the instruction fetch and memory stages are the critical path of our 5-stage pipelined MIPS CPU. Therefore, we changed the IF and MEM stages to take **two** cycles while increasing the clock rate. You can assume that the register file is written at the falling edge of the clock.

Assume that no pipelining optimizations have been made, and that branch comparisons are made by the ALU. Here’s how our pipeline looks when executing two add instructions:

<table>
<thead>
<tr>
<th>Clock Cycle #</th>
<th>1</th>
<th>2</th>
<th>3</th>
<th>4</th>
<th>5</th>
<th>6</th>
<th>7</th>
<th>8</th>
</tr>
</thead>
<tbody>
<tr>
<td>add $t0, $t1, $t2</td>
<td>IF1</td>
<td>IF2</td>
<td>ID</td>
<td>EX</td>
<td>MEM1</td>
<td>MEM2</td>
<td>WB</td>
<td></td>
</tr>
<tr>
<td>add $t3, $t4, $t5</td>
<td>IF1</td>
<td>IF2</td>
<td>ID</td>
<td>EX</td>
<td>MEM1</td>
<td>MEM2</td>
<td>WB</td>
<td></td>
</tr>
</tbody>
</table>

Make sure you take a careful look at the above diagram before answering the following questions:

a. How many stalls would a data hazard between back-to-back instructions require?

b. How many stalls would be needed after a branch instruction?

c. Suppose the old clock period was 150 ns and the new clock period is now 100ns. Would our processor have a significant speedup executing a large chunk of code...
   i. Without any pipelining hazards? Explain your answer in 1-2 sentences.
   ii. With 50% of the code containing back-to-back data hazards? Explain your answer in 1-2 sentences.
M2-4: Some say there’s nothing better than cold, hard cache (12 points)
a) What shape do the following trade-off curves have? Select a shape and enter its number into the box for each of the graphs. Unless they are the parameters being varied, assume that associativity, capacity and block size are constant. You should assume that the axes are linear.

b) Consider a system with inclusive L1 and L2 caches with 4B cache block size. Assume we have 1 MiB of on-chip memory available and want to determine how much of this memory we should give to the L1 cache and how much to the L2 cache. We will try to minimize the AMAT to do so.

Assume both caches are fully associative with LRU replacement. Their combined capacity is 1MiB (excluding tags and meta-data). You can consider all miss rates approximate.

Say you are running the following program starting from cold L1 and L2 caches:

```c
#define ARRAY_SIZE 256*1024
int a[ARRAY_SIZE];
int sum = 0; // assume sum, i, and j are stored in registers

for (int i = 0; i < 100000; i++) {
    for (int j = 0; j < ARRAY_SIZE; j++) sum += a[j];
    for (int j = ARRAY_SIZE-1; j >= 0; j--) sum += a[j];
}
```

1) How would we compute AMAT if we had the local L1 miss rate ("L1Miss"), the local L2 miss rate ("L2Miss") and the memory access time ("Memory")? Use “H1” and “H2” to represent the L1 and L2 hit times respectively. (We will compute these quantities later in the question)

\[ \text{AMAT} = H1 + L1Miss (H2 + L2Miss (Memory)) \]

Just use the original AMAT formula with the variables here...
Tag Bits - Associativity: The number of tag bits increases by 1 every time the associativity doubles, as doubling the associativity halves the # of sets, thus decreasing the # of index bits by 1. This decrease leads to 1 more tag bit. This relationship is logarithmic.

Capacity - Set Size: Assuming the # of sets (rows) remains constant, doubling the cache capacity will make each set twice as large because we cannot add sets to our cache. This is a linear relationship.

Number of Sets - Block Size: If we double the number of sets, but keep the overall capacity and associativity the same, we must have half as much data per set. This equates to halving the block size, creating a new linear relationship.

Offset Bits - Block Size: We have $2^{\text{offset bits}} = \text{Block Size}$ by definition, so offset bits = $\log_2(\text{Block Size})$
M2-4: (continued)

2) For the program above, express the local miss rate for the L1 cache in general terms as a function of the L1 cache size (write L1 for the size of L1 in bytes). Hint: The miss rate is 0 for a 1 MiB cache, 0.5 for a 0.5 MiB cache and 1 for a 0 MiB (i.e., no) L1 cache.

\[
L1\text{Miss} = 1 - \frac{L1}{1\text{MiB}}
\]

If we have 1 MiB L1 cache, the hit rate is 1, hence we always hit. As we decrease this size, our miss rate increases proportionally to the \%age of the 1 MiB memory we have in L1.

3) What is the global miss rate for the L2 cache as a function of the L1 cache size? Hint: Start by expressing the global miss rate as a function of the L2 cache size.

\[
\text{Global Miss} = 1 - \frac{L2}{1\text{MiB}} = \frac{1\text{MiB} - L2}{1\text{MiB}} = \frac{L1}{1\text{MiB}}
\]

Global miss rate is \%age of all accesses that miss at L2, same as \% of non-hit misses in L1.

4) What is the local miss rate for the L2 cache as a function of the L1 and L2 sizes? Hint: Use your results from questions 2 and 3.

\[
\text{Local Miss}_2 = \frac{\text{Global Miss}}{L1\text{Miss}} = \frac{\frac{L1}{1\text{MiB}}}{1 - \frac{L1}{1\text{MiB}}} = \frac{L1}{1\text{MiB} - L1} = \frac{L1}{L2}
\]

5) Assume the hit time of the L1 cache is 10 cycles, the hit time of the L2 cache is 20 cycles and the memory access time is 100 cycles. Using the formula from question 1, what is the AMAT for this system as a function of only the L1 size?

\[
\text{AMAT} = 10 + \left( \frac{L2}{1\text{MiB}} \right) \left( 20 + \frac{L1}{L2} \cdot 100 \right) \quad (\text{from parts 2 & 4})
\]

\[
= 10 + \frac{20 \cdot L2}{1\text{MiB}} + \frac{L1}{1\text{MiB}} \cdot 100 = 10 + \frac{20\text{MiB} - L1}{1\text{MiB}} + \frac{100L1}{1\text{MiB}}
\]

\[
= 10 + 20 + \frac{100L1 - 20L1}{1\text{MiB}} = 30 + \frac{80L1}{1\text{MiB}}
\]

6) What sizes of L1 and L2 caches should we pick to minimize the AMAT? (assume the caches have non-zero size, i.e., both of them exist)

We want \(30 + \frac{80L1}{1\text{MiB}}\) to be small, so \(L1\) must be as small as possible. Thus, the answer is \(L1: 4\text{B}, L2: 1\text{MiB}-4\text{B}\), as neither is zero and both have blocks of 4\text{B} (min size is 16\text{B}).
**F1: Paging all CS61C students (9 points)**

Consider a byte-addressed machine with a 13-bit physical address space that can hold two pages in memory. Every process is given 16 MiB of virtual memory and pages are evicted with an LRU replacement scheme.

a) What are the sizes of the following fields in bits?

1. Virtual Page Number: 12
2. Physical Page Number: 12
3. Virtual Address Offset: 12
4. Physical Address Offset: 12
5. PPN bits = \( \log_2(2) = 1 \), so we have 2 pages

b) Consider the following code snippet:

```c
// a and b are both valid pointers to // different arrays of length ARRAY_SIZE
void enumerate(int* a, int* b) {
    for (int i = 0; i < ARRAY_SIZE; i++) {
        a[i] = i;
        b[i] = ARRAY_SIZE - i;
    }
}
```

The compiled binary for the program containing this code snippet weighs in at 4096B. If this code was executed on the machine, what is the maximum value of `ARRAY_SIZE` that would allow this code to execute with 0 page faults in the best-case scenario? (Answer in IEC prefix: 8Gi, 32Ti, etc)

Each page is \( 2^{12} = 4096 \)B. Thus, the code fits in one page, so to have no faults, the data must be in the other. Each array is the same size, so \( 2 \times ARRAY_SIZE \leq PAGE_SIZE \Rightarrow \) max

\[ ARRAY_SIZE = \frac{4096}{2} = 2048 \text{ B} = 512 \text{ int} \]

c) How could we modify the above code snippet to allow a larger `ARRAY_SIZE` and execute with the fewest page faults in the best-case scenario? Write the new code below:

```c
for (int i = 0; i < ARRAY_SIZE; i++)
    a[i] = i;
for (int i = 0; i < ARRAY_SIZE; i++)
    b[i] = ARRAY_SIZE - i;
```

We can make `ARRAY_SIZE = \text{page size}` because we no longer worry about `a` & `b` knocking each other, or the code, out of memory.
F2: Why can’t you use parallelism at a gas station? It might cause a spark. (10 points)

1. Optimize factorial() using SIMD intrinsic(AVX).

```c
double factorial(int k) {
    int i;
    double f = 1.0;
    for (i = 1 ; i <= k ; i++) {
        f *= (double) i;
    }
    return f;
}
```

You might find the following intrinsics useful:

<table>
<thead>
<tr>
<th>Intrinsic</th>
<th>Description</th>
</tr>
</thead>
<tbody>
<tr>
<td>__m256d __mm256_loadu_pd(double *s)</td>
<td>returns vector(s[0], s[1], s[2], s[3])</td>
</tr>
<tr>
<td>void __mm256_store_pd(double *s, __m256d v)</td>
<td>stores p[i] = v, where i = 0, 1, 2, 3</td>
</tr>
<tr>
<td>__m256d __mm256_mul_pd(__m256d a, __m256d b)</td>
<td>returns vector(a0b0, a1b1, a2b2, a3b3)</td>
</tr>
</tbody>
</table>

```c
double factorial(int k) {
    int i, j;
    double f_init[] = {1.0, 1.0, 1.0, 1.0};
    double f_res[4];
    double f = 1.0;
    // initialize f_vec
    __m256d f_vec = __mm256_loadu_pd(f_init);

    // vectorize factorial
    for (i = 1 ; i <= k ; i++) {
        double l[] = {i, i + 1, i + 2, i + 3};
        __m256d data = __mm256_loadu_pd(l);
        __m256d f_vec = __mm256_mul_pd(f_vec, data);
    }

    // reduce vector
    __mm256_store_pd(f_res, f_vec);

    for (j = 0 ; j < 4 ; j++) {
        f = f + f_res[j];
    }

    // handle tails
    for (; i <= k ; i++) {
        f *= (double) i;
    }

    return f;
}
```
F2: (continued)

2. Cache Coherence:
We are given the task of counting the number of even and odd numbers in an array, A, which only holds integers greater than 0. Using a single thread is too slow, so we have decided to parallelize it with the following code:

```c
#include <stdio.h>
#include "omp.h"

void count_eo (int *A, int size, int threads) {
    int result[2] = {0, 0};
    int i, j;

    omp_set_num_threads(threads);

    #pragma omp parallel for
    for (j=0; j<size; j++)
        result[(A[j] % 2 == 0) ? 0 : 1] += 1;

    printf("Even: %d\n", result[0]);
    printf("Odd: %d\n", result[1]);
}
```

As we increase the number of threads running this code:

a) Will it print the correct values for Even and Odd? If not, explain the error.

b) Can there be false sharing if the cache block size is 8 bytes?

   Yes, the result array can be stored in a single block, so writing to any element will remove the block from all other caches.

c) What about 4 bytes?

   No, false sharing is only applicable when the data being accessed by the threads are distinct. If the block is 4 bytes, then a single element of result is in a cache block, so we would invalidate blocks in other caches only if they had the SAME element of the result array. Thus, false sharing is technically not possible.
F3: This isn’t a bathroom. Why is there potpourri? (10 points)

1. T/F Questions (Circle one. If the circling is unclear, you will receive no credit.)

1) CPUs need separate instructions to access I/O devices. (True / False)
2) Segmentation (base + bound) has fragmentation problems. (True / False)
3) Exceptions in early pipeline stages override exceptions in later stages for a given instruction. (True / False)
4) Exceptions are handled in the pipeline stage where they occur. (True / False)

2. Polling, Interrupts, and DMA

1) Choose polling or interrupt for the following devices.

<table>
<thead>
<tr>
<th>Device</th>
<th>Data Rate</th>
<th>Transfer Block Size</th>
<th>Polling/Interrupt?</th>
</tr>
</thead>
<tbody>
<tr>
<td>A</td>
<td>80B/s</td>
<td>4B</td>
<td>Polling</td>
</tr>
<tr>
<td>B</td>
<td>400MB/s</td>
<td>4B</td>
<td>Interrupt</td>
</tr>
<tr>
<td>C</td>
<td>400MB/s</td>
<td>2KB</td>
<td>Interrupt</td>
</tr>
</tbody>
</table>

2) To support interrupts, the CPU should be able to save and restore the current state. Which of the following should be saved before handling interrupts to ensure correct execution?

a. Program Counter
b. User Registers
c. TLB
d. Caches

3) To which device in 1) is direct memory access (DMA) most beneficial? Explain briefly.

3. Warehouse Scale Computing and Amdahl’s Law.

1) We are going to train convolutional neural networks on Amazon EC2. It turns out that 90% of training can be parallelized but the rest takes twice as long due to the communication overhead among the instances. What is the maximum speedup we can achieve?

\[
\text{Speedup} = \frac{1}{0.9 + \frac{0.9}{0.2}} = \frac{1}{0.2 + 0.45} \leq 5 \quad \text{(as } \frac{0.9}{0.2} \rightarrow 0.2)\]

2) Which of the following can increase the maximum speedup in 1)?

a. Use more instances
b. Deploy the application across multiple arrays.
c. Reduce the number of reduce operations.
d. Increase network bandwidth.
e. Increase the capacity of disks.

We need to improve the NON parallelizable part, which is the communication done. This communication is boosted by a better network OR if we do it less frequently. These options are d & c respectively.
F3: (continued)

4. Hamming Error-Correction Code (ECC)
1) Suppose we have a one-byte data value, 01101101\textsubscript{two}. Fill in the encoded data in the following table.

<table>
<thead>
<tr>
<th>Bit</th>
<th>1</th>
<th>2</th>
<th>3</th>
<th>4</th>
<th>5</th>
<th>6</th>
<th>7</th>
</tr>
</thead>
<tbody>
<tr>
<td>Encoded Data</td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>P1</td>
<td>X</td>
<td></td>
<td></td>
<td>X</td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>P2</td>
<td></td>
<td>X</td>
<td>X</td>
<td></td>
<td></td>
<td>X</td>
<td></td>
</tr>
<tr>
<td>P4</td>
<td></td>
<td>X</td>
<td></td>
<td></td>
<td>X</td>
<td></td>
<td>X</td>
</tr>
</tbody>
</table>

2) Assume that we have an encoded value, 1001110\textsubscript{two} with a single-bit error. Indicate below each parity bit if it has an error:

<table>
<thead>
<tr>
<th>Parity Bit</th>
<th>P1</th>
<th>P2</th>
<th>P4</th>
</tr>
</thead>
<tbody>
<tr>
<td>OK/ERROR</td>
<td>OK</td>
<td>ERROR</td>
<td>ERROR</td>
</tr>
</tbody>
</table>

As both P2 & P4 were errors, there is an error in bit \(2+4=6\) (shared with 0).

Incorrect bit position: 6

Correct data: 0100

Fixed: 10 01100 → 0100 is data

5. Dependability and RAID

1) Which of the following can increase the availability?
   a. Increasing MTTF
   b. Decreasing MTTF
   c. Increasing MTTR
   d. Decreasing MTTR
   e. Redundant data copies

   We want to reduce failures or decrease repair time. a, b, e follow the first reason, while d follows the second.

2) Explain very briefly why RAID1 is the most expensive form of RAID.

   It requires a full copy of disks, so the overhead is 100%.

3) How many check disks are needed for RAID3?

   One. RAID3 has a dedicated parity disk and uses byte-stripping.

4) Explain why RAID5 has a higher write throughput than RAID4.

   The check information is distributed across disks, so not a single disk is queried for all changes. Thus, the load is balanced and the parity disk bottleneck is mitigated.