CS 61C		Fall 1994		Final exam solutions

1. representation formats

A.  Binary to hex.  Chop the binary number into groups of four
bits, starting at the right.  That gives

	01 1100 1010 0101 1100 1010

Then convert each group of four bits to a hex digit, giving

    0x   1    c    a    5    c    a

so the correct answer is (C).


B.  Signed integers.  We can see from the sign bits that all four
numbers are negative, so let's negate each of them.

(a) twos comp. 11110111:  First invert all the bits (00001000)
then add 1 to the result (00001001).  So this is -9.

(b) ones comp. 11110110:  Just invert all the bits (0001001), also -9.

(c) Sign magnitude 10001000:  Just turn off the sign bit (00001000),
so this one is -8, the different one.

(d) Bias-127 01110110:  Subtract from 127 (01111111), giving 00001001,
so this one is -9 too.

So again the correct answer is (C).


C.  States to bits.  There are 47 graphic keys times 3 possible
modifier states (neither, SHIFT, or CONTROL).  47 * 3 = 141 possible
characters.  So to represent that we need lg 141 = 8 bits.  (7 bits
would give us 128 possibilities, not enough.  8 bits gives 256
possibilities, more than enough.)

Answer: 141, 8.

P.S.  This isn't quite the true story about the ASCII keyboard because
not every graphic key can be used with CONTROL, and also the keyboard
includes the spacebar and the delete key, each of which generates a
special code but isn't shiftable.  The other formatting keys (return,
tab, etc.) duplicate codes that are available with CONTROL-something.
The real story:
	47 unshifted graphics
	47 SHIFT-graphics
	32 CONTROL-graphics
	 1 space
	 1 delete
which adds up to exactly 128, which fits into 7 bits.

P.P.S.  You're thinking "but I can type control-space!" but actually when
you do that, your keyboard is generating the code control-atsign (\0).

P.P.P.S.  Of course if you have a META key then you can apply that to
any of the 128 ordinary ASCII codes, filling up the 8-bit byte.


(D)  MAL to hex.  ADDIU is an I-format instruction:

     +------+-----+-----+----------------+
     +opcode+ rs  + rt  +    immediate   +
     +------+-----+-----+----------------+

We look up addiu and find that its opcode is 001001 binary.  In MAL,
when only one register is given in an I-format instruction, it's both
the source register and the target register; $29 is 11101 binary.
The immediate value is -12, represented in twos complement.

	12 = 0000...01100    (8+4)
       !12 = 1111...10011    (invert the bits)
       -12 = 1111...10100    (add 1)

So putting these together we have

     +------+-----+-----+----------------+
     +001001+11101+11101+1111111111110100+
     +------+-----+-----+----------------+

To convert to hex, split this into groups of four bits:

	0010 0111 1011 1101 1111 1111 1111 0100

	0x 2    7    b    d    f    f    f    4

Answer: 0x27bdfff4

Scoring: One point each, all or nothing.


2.  Design criteria.

Many people lost points on this question for saying things like "the RISC
philosophy is to make things as simple as possible and this isn't simple."
Well, yes, but you could say that about almost anything!  The point of the
question was to see if you could focus on the *specific* bottlenecks in
particular proposed instructions.

A.  Block Transfer.  The main answer we were expecting here was "it would
delay the pipeline to have multiple memory references in a single instruction
cycle" and indeed many people said that.

Some people, copying from an answer in the sample exam, said "it would
require either multiple bus cycles, which would delay the pipeline, or
widening the bus."  Come on!  The question from which you copied that
answer was about a proposed 64-bit memory transfer instruction.  This one
is a proposed any-number-of-bits transfer!  There is no way we could
widen the bus to accommodate, for example, a megabyte memory-to-memory
copy.  People who did this got some credit for mentioning the pipeline
but didn't leave us with a strong feeling of confidence that you understand
what a bus is, or a pipeline, or a memory, or a transfer, or a computer.

Several people suggested that this instruction would be too wide to fit
in a word, because it would require three pointer-length operands.  But
this objection is easily overcome; the three operands could be registers
containing the pointers and the length.  This answer got half credit.

Some people mentioned that an instruction that might take a very long time
would leave the machine uninterruptable for that long time, which could
lead to I/O data loss.  This is a very good answer, showing some real
thinking -- connecting the instruction semantics with the not-obviously-
related question of interrupt handling.  Of course this issue came up in
the PDP-10 also, and here's how it was solved:  If an interrupt is
requested during a BLT instruction, the processor updates the pointers
in the register operands to reflect how far the copying has gone, then
remembers the address of the BLT itself, rather than the address of the
next instruction, as the exception PC.  After the interrupt is handled,
the BLT continues.  (That solution would run into difficulties in the
MIPS architecture if the BLT happened to occupy the delay slot of a branch.)

A few people suggested that block transfer is an unnecessary idea because
arrays are assigned by reference (copying a pointer) rather than by value.
That isn't a very good answer.  For one thing, it concerns the design of
the C language, not of the MIPS architecture.  (Pascal passes arrays by
value.)  Second, even in C, structs are passed by value, so block copies
are still useful.


B.  Jump to Subroutine.  Here are the two main answers we were expecting:

	1.  A procedure called by JSR can't be called recursively.
	2.  Allowing write access to code pages would prevent the use
	    of shared code.

We gave half credit for the response "this would block the pipeline
because two memory references would be required."  In fact only one
memory reference would be required; a branch or jump instruction does
*not* fetch the next instruction, but merely puts a new address into
the processor's PC register.  So a JSR would combine a jump (no memory
access required) with a store.  On the other hand, the proposed
jump-indirect instruction for returning from a procedure *would* block
the pipe, not because of doing two loads (it only does one) but because
the load would have to happen too early in the processing of the
instruction.  (An ordinary load actually doesn't complete until after
the next instruction!)  We gave full credit for saying that jump-indirect
would block the pipe, even if you didn't fully understand why.

One student mentioned that storing the return address near the procedure's
code would require setting the dirty bit for that page when calling the
procedure, which would increase the paging load on the system.  Good answer.

Several students suggested that using memory to save the return address
instead of a register would require additional memory allocation.  Pfui.
First of all, the amount of memory required to hold return addresses is
a tiny fraction of a program's memory requirement; second, in most cases
the return address ends up in memory anyway, because $31 is saved on the
stack.

In the apple-pie-motherhood category was the answer "a RISC instruction
shouldn't have two purposes."  Consider that the MIPS architecture *does*
include the instructions BLTZAL and BGEZAL, which combine test-and-branch
with procedure calling.

A couple of people said "this would violate stack discipline."  Like the
answer about passing arrays by reference earlier, this confuses a language
design issue with an architecture issue.  The MIPS JAL instruction
doesn't have anything to do with using a stack or not using a stack,
and neither does the PDP-10 JSR, really; the difference is between
storing the old PC in a register and storing the old PC in memory.
And in any case, "stack discipline" doesn't mean having a stack.  Scheme,
which violates stack discipline, is still generally implemented using a
stack for most purposes.  Logo is an example of a language that *does*
obey stack discipline, but is generally implemented without the use of
a hardware stack.  (The environment is instead maintained using linked
lists, as in the 61A metacircular evaluator.)

Several people gave arguments that, if true, would also prove that JAL
doesn't fit the MIPS architecture!  For example, several people thought
it problematic that (in the actual words of the problem) "The instruction
stored the PC at that address..." because what has to be stored is the
address of the next instruction, not the address of the JSR instruction.
But of course exactly the same argument applies to the PC that JAL stores
in $31!  The stored PC is the address following the procedure call, and
the same is true for JSR.

A few people argued that the JSR mechanism wouldn't work for exception
handling, for various reasons.  This is irrelevant, because the MIPS
doesn't use JAL for exception handling either!  Since there is already
a separate mechanism for exceptions, changing the ordinary procedure
calling mechanism is independent of it.  Actually this objection was
particularly ironic because, in the PDP-10, JSR was used precisely for
exception handling.  The idea is that an interrupt might occur when
no stack is set up, and no registers may be available for use by the
exception handler.  The MIPS solves this problem by building a special
piece of hardware, the Exception PC register, for the sole purpose of
keeping the old PC value.  The PDP-10 solves the problem by saving the
old PC in memory, making the assumption that the exception handler
won't be reentrant.  (It can be made reentrant by copying the old PC
from the place where JSR put it to some newly-allocated space.)  On the
PDP-10, instead of jumping to a fixed address, an exception executed
the instruction at a fixed address.  Typically that instruction would
be a JSR, which would save the PC without relying on any assumptions
about registers.

(The normal bread-and-butter procedure calling instruction on the PDP-10,
by the way, was PUSHJ, Push and Jump.  It took two operands, a register
and a jump address.  The register must have been previously set up to
point to a stack, like $29.  "Push" means to move the pointer by one
word and store something there.  So the stack was allocated a word at
a time, rather than a frame at a time.)


Scoring:  Two points per answer (one for A, two for B).  One point for
certain half-okay answers as mentioned above.


3.  C to MIPS.

mapArray:
	addiu	$sp, -20	# prologue
	sw	$31, 0($sp)
	sw	$16, 4($sp)
	sw	$17, 8($sp)
	sw	$18, 12($sp)
	sw	$19, 16($sp)

	move	$17, $4		# move args to saved registers
	move	$18, $5
	move	$19, $6

	move	$16, $0		# i = 0

loop:	bge	$16, $19, fini	# done if i >= dim

	sll	$8, $16, 2	# i * 4
	add	$8, $18		# &b[i]
	lw	$4, 0($8)	# b[i]
	jal	fun		# fun(b[i])
	sll	$8, $16, 2	# i * 4
	add	$8, $17		# &a[i]
	sw	$2, 0($8)	# a[i] = fun(b[i])

	addiu	$16, 1		# i++
	j	loop

fini:	lw	$31, 0($sp)	# epilogue
	lw	$16, 4($sp)
	lw	$17, 8($sp)
	lw	$18, 12($sp)
	lw	$19, 16($sp)
	addiu	$sp, 20
	jr	$31

Of course this isn't the only possible solution.  One common variant
is to store the arguments (a, b, dim) on the stack instead of in
saved registers.  Another is to use an additional saved register to
maintain the value of i*4 instead of multiplying each time.  And of
course many people used MUL instead of SLL to do the multiplying,
although that's a lot slower.

Scoring:   6	correct
	 4-5	has the idea
	 2-3	has an idea
	 0-1	other

Minimum for "has an idea":  No inventing nonexistent addressing modes
such as LW $4, $8($18).  Also, there must be an overall loop structure.
(It's hard to list all the ways you could not have an idea, but those
are the main ones.)

Minimum for "has the idea":
	* attempt to follow register-saving conventions
	    (maybe you miss one).
	* fun is given an argument in $4 and returns a value in $2.
	* the arrays are in memory (there's a LW before, and a SW after,
	    the call to fun).
	* indexing is done more or less right (you don't use a[0] every time).

Typical 4-point solution: elements of A and B are one byte instead of four.
Typical 3-point solution: no LW instruction.


4.  Count bits.

numBitsSet:
	move	$2, $0		# this will be the count
loop:	beqz	$4, fini	# no bits left to count
	andi	$9, $4, 1	# test low bit
	beqz	$9, next	# branch if zero
	addi	$2, 1		# one, so count it
next:	srl	$4, 1		# shift off low bit
	j	loop

fini:	jr	$31

Variants are to test the high bit (by testing whether $4 is negative!) and
shift bits off to the left; to shift the tested bit instead of shifting
the argument; to notice that the result of the ANDI above is either 0 or
1 and just add it to the count instead of branching.

By moving the end test to the bottom of the loop you can save an
instruction or two:

numBitsSet:
	move	$2, $0
loop:	andi	$9, $4, 1
	add	$2, $9
	srl	$4, 1
	bnez	$4, loop
	jr	$31

Can't get much more elegant than that.

Many students made this much more complicated than necessary by
saving things on the stack.  (Note that question 3's solution would
have been much shorter if it hadn't invoked another procedure FUN.)

For your amusement, the following much faster solution, because it doesn't
involve a loop, was pointed out to me by one of the students:

int numBitsSet(unsigned n)
{
        n = (n & 0x55555555) + ((n >> 1) & 0x55555555);
        n = (n & 0x33333333) + ((n >> 2) & 0x33333333);
        n = (n & 0x0f0f0f0f) + ((n >> 4) & 0x0f0f0f0f);
        n = (n & 0x00ff00ff) + ((n >> 8) & 0x00ff00ff);
        return (n & 0x0000ffff) + ((n >> 16) & 0x0000ffff);
}

Why this works is left as an exercise.  :-)

A common subtle mistake was to use division instead of shifting.  The
problem with this is that for negative numbers you end up extending the
sign bit, so you count more ones than were in the original number.

Scoring:   5	correct
	 3-4	has the idea
	 1-2	has an idea
	   0	other

Minimum for "has the idea" is masking off all but one bit, and
looping for all the bits.


5.  Memory organization

A.  Find 0x7f55a356 in the cache.

Since the cache is 16 bytes wide, the lowest 4 bits (4 = lg 16) of
the address are the offset within a block.  That's the last hex
digit, the 6.  So we're looking for the cache block that would
contain block number 0x7f55a35.

The cache contains 4K bytes, and one block is 16 bytes, so there
are 4K/16 = (2^12)/(2^4) = 2^8 blocks in the cache.  Since the
cache is direct mapped, every block is a set of one, and the
address uniquely determines which block (i.e., which set) to use.
We find out by looking at the lowest 8 bits (because there are
2^8 blocks) of the block number, which is 0x35.  What's left,
0x7f55a, is the tag.

So (B) is the correct answer.


B.  How many sets?

A block is 32 bytes; there are five blocks to a set.  So the
size of a set is 32*5=160 bytes.  The entire cache size is
80K = 80*(2^10), so the number of sets is 80K/160 = 2^9 = 512 bytes.


C.  TLB mapping.

A page is 4Kb = 2^12 bytes.  That means that the bottom 12 bits
of the address (3 hex digits) are the offset, and the rest is the
virtual page number.  For this virtual address, the offset is
0x678 and the virtual page number is 0x12345.  We look up the
latter in the TLB and find it in slot number 2, corresponding to
physical page number 0x200.  Combining this with the offset, the
physical address is 0x200678.


D.  What can't be paged?

The answer is (B); if the disk I/O handler were paged out, then it
could never be brought back into memory, and neither could anything
else.

It's okay to page out the page tables; they are in the operating
system's virtual address space, and the same mechanism that pages
other stuff in and out can deal with missing page table entries.
(Of course there has to be access to SOME memory, so maybe part of the
system's own page table will be in the unmapped portion of memory.)

It's okay to page out the system call handler.  System calls are
instructions executed by a user process; if the handler needed for a
certain system call isn't available, it could be paged in later, just
as anything else that a user process needs can be.

And of course if user programs couldn't be paged out, that would
defeat the whole purpose of paging!

Scoring:  One point each.


6.  Cache probing.

What is the cache size?  All of the times for an array size of 256
bytes are small (about 100 ns), whereas for an array of 512 bytes
most of the times suddenly are around 900 ns.  So probably the cache
size is 256 bytes.

What is the block size?  For array sizes at or above 512,  a stride
of 4 gives access times around 450 ns, whereas larger strides take
about 900 ns.  So a stride of 8 moves to the next block, whereas a
stride of 4 is within the same block half the time.  Therefore the
block size is 8 bytes.

What is the allocation policy?  For an array size of 512, strides of
128 and 256 lead to lower access times (around 300 ns).  For an array
size of 1024, a stride of 256 leads to a similar low time.  In other
words, if the stride is 1/2 or 1/4 of the array size, then not every
reference leads to a cache miss, whereas if the stride is 1/8 or less
of the array size, every reference misses.  This suggests a 4-way
set associative cache.

Scoring:  Two points each.  All or nothing, except that saying
"set associative" (NOT fully associative!) with the wrong value
of N got one point.


7.  Clock interrupts.

intrp:	lui	$26, 0xffff
	lw	$27, 0x10($26)		# get control register
	andi	$27, 1			# just ready bit
	beqz	$27, nope		# not ready, do nothing
	.set noat
	move	$26, $1			# save $1 in case needed for LW
	.set at
	lw	$27, SleepCount
	addiu	$27, -1			# ready, decrease count
	sw	$27, SleepCount
	.set noat
	move	$1, $26
	.set at
	bgtz	$27, nope		# counted down to zero?
	lui	$26, 0xffff
	sw	$0, 0x10($26)		# yes, turn off timer
nope:	mfc0	$26, $14
	rfe
	jr	$26

We didn't take off for this, but it's actually important to use
ADDIU and not ADDI, because if the count happened to be 0x8000000
then subtracting 1 would overflow, which would cause another
exception, which would get us in trouble since this handler is
not reentrant.

Most people saved $1 on the stack.  Just to show it can be done,
here's an alternative that doesn't assume $sp is set up.  Since
we don't need to save anything but $1, we can use $26 for that.
(The LW and SW to SleepCount are the only instructions that might
require the use of $1.)  We didn't take off if you assumed that
SleepCount was reachable without using $1, though.  (But we did
take off for any other use of $1 without saving it.)

Scoring:  5 OK, 3-4 the idea, 1-2 an idea, 0 other.  "Having the idea"
requires all of the following:  Saving registers correctly, using RFE,
checking the ready bit, updating the count, and not making any
assumptions about the user's registers.  (In particular, it's wrong
to assume that the timer will interrupt only from inside the sleep
procedure given in the question!!!)

Typical 3-point solution: Forgetting to turn off the interrupt enable
bit when the count reaches zero.

Typical 0-point solution:  Looping in the exception handler.  The
whole idea of interrupts is that they happen quickly, without waiting
for a device to become ready!


8.  Exception handling.

A.  When an exception occurs, the MIPS processor saves the old PC (d),
switches to kernel mode and disables interrupts (c), and jumps to
location 0x80000080 (b).  It does *not* read the Cause register (a).
Perhaps the exception handler software will read the Cause register,
but the processor doesn't do that on its own.  (The processor will
*set* the Cause register to a value indicating the reason for the
exception.)

B.  The main advantage of interrupts is (a) allowing the processor
to do other stuff while waiting for I/O.  The others may be advantages
of having an operating system, but not specifically of interrupts.

C.  What are the differences between kernel mode and user mode?

There are two differences:
	1.  Kernel mode allows access to virtual addresses >= 0x80000000.
	2.  Kernel mode allows the use of the coprocessor instructions.
	    (Alternatively: allows access to coprocessor registers.)

Scoring:  One point for A, one for B, and one for each of the two
differences in C.

Note:  "I/O registers" counts as the same answer as #1, since the I/O
device registers are in kernel virtual memory, so if you said "access
to high virtual addresses and access to I/O registers" you only got
one point.

A somewhat common wrong answer was "kernel mode can use $26 and $27."
No!  It is a software convention that those registers are reserved for
the exception handler, but that has nothing to do with kernel MODE and
user MODE in the hardware.

Lots of people explained why having two modes was a good idea, a question
asked in an earlier exam.  We didn't take off for that, as long as you
also answered *this* question.