Simple CPU v4

Home

I had to stop work on the TTL version of the SimpleCPU, as where I work there have been changes in Health and Safety rules. To cut a very long story short i'm no longer allowed to solder, so building things at work is a bit of an issue :(. Therefore, my thoughts moved to another implementations of the SimpleCPU that ive been avoiding, a pipelined version. Again, the emphasis here is not to design the world's best processor, rather to create a design that is focused on teaching i.e. a pipelined cpu that demonstrates the advantages of pipelining, but also exposes its issues in terms of its hardware implementation. Therefore, this CPU will not be the fastest, but it will demonstrate key design considerations and how these affects the CPUs architecture, its instruction-set and the software run upon it. As always this will be a working design, implemented as a schematic that can be downloaded onto an FPGA, not just a simulation :).

Figure 1 : a Basic MIPS architecture

The starting point of this design is going to be based around a basic MIPS (Link) / DLX (Link), as shown in figure 1. This is a VERY common design used in a lot of existing teaching material. In fact this architecture was the first pipelined processor i studied all those years ago from the classic reference book "Computer Architecture: A Quantitative Approach 2nd Edition", by Hennessy and Patterson. However, although this an excellent book it does abstract away from the low level hardware, hiding implementation details in a block diagrams etc. The example in figure 1 is a good point in case, consider the PC data path, if the next PC was truly passed through the pipeline registers as shown, this CPU isnt actually pipelined i.e. it does not fetch a new instruction on every clock cycle :(. These block diagrams are excellent at getting key points about pipelining across to students, at times you do need to abstract away from all that hardware, but block diagrams are also misleading, or just lack the details needed to truly hammer home these ideas. Therefore, i think there is a hole in the literature to be filled, and what better way to do this than a pipelined version of the SimpleCPU based around this well established architecture. No ISA copyright violations here :).

Design / Problems to solve
Instruction Set Architecture (ISA)
VHDL Prototype
SCH Prototype

Design

The MIPS architecture is from the RISC (Link) class of processor. The design emphasis for this family of processors is to maximise instruction throughput, such that the processor can complete an instruction on every clock cycle (once the pipeline is full). Therefore, complex, multi-cycle instructions used in CISC (Link) machines have to be removed from the MIPS instruction-set. Hence this architecture's name: Microprocessor without Interlocked Pipelined Stages i.e. on each clock cycle the processor moves an instruction onto the next processing (pipeline) stage. This allows processing stages to be overlapped i.e. the fetch, decode and execute phases can operate in parallel, as shown in figure 2.

Figure 2 : pipelined instructions

In this version of the MIPS architecture the instruction processing is broken down into five separate stages:

Instruction Fetch (IF) : get instruction from memory, update PC.
Instruction Decode (ID) : decode instruction, get operands.
Execute (EX) : process operands, or generate address (data or instruction), test branch condition.
Memory Access (MA) : access memory if LOAD or STORE instruction.
Write Back (WB) : write result to register file.

By allowing each processing stage to be overlapped a new instruction is fetched from memory on each clock cycle. Therefore, after five clock cycles the processor will be processing five different instructions i.e. all stages are busy, the processing pipeline is full. Once full one instruction will then complete on each following clock cycle, giving a speed up of 5, when compared to a machine that waits until an instruction completes before fetching the next. There are exceptions to this basic rule, but the key to maximising performance on this architecture is to keep the pipeline full, as shown in figure 3. A second advantage of removing complex functions (hardware) is that the size of the logic in each stage is reduced, reducing its critical path delay i.e. how long it takes a block of combinatorial logic to settle down to a final value. Therefore, smaller, faster logic allows the processor's clock speed to be increased, further improving instruction throughput.

Figure 3 : non-pipelined (top) and pipelined (bottom)instructions

This CPU design emphasis triggered a reconsideration of how processors were designed, what functions should be implemented in hardware and what should be done in software i.e. as a sequence of simple instructions, rather than a single complex instruction. However, the requirement to allow an instruction to complete on each clock cycle has a number of undesirable consequences that limit the types of addressing modes and instructions that can be used on these CPUs. In the SimpleCPU_v1d the absolute addressing mode could be used with the ADD and SUB instructions:

ADD RA 0x100      RA <- RA + M[0x100]
SUB RB 0x55       RB <- RB - M[0x55]

However, the downside of these instructions is that they require two memory accesses i.e. when the instruction is first fetched and then its accesses its operand data. Multiple memory accesses would prevent the next instruction being accessed when pipelining i.e a basic memory architecture only has a single address and data bus, restricting it to a single memory transaction per clock cycle. Therefore, to minimise the occurrence of this issue the only instructions allowed to access operand data from memory are the LOAD and STORE instructions. All other instruction have to access their operands from registers. This gives the RISC architecture its second commonly used name i.e. a LOAD / STORE architecture. This shift in where operands are stored necessitates an increase in the register-file size i.e. we need space to store our temporary variables somewhere, so you typically you see 8, 16 or more general purpose registers in this type of CPU. A downside of this change is the increase in instruction length (bits) owing to the associated increases in register bit-field sizes (RegSel), as shown in figure 4.

Figure 4 : instruction formats

As the register-file size is increased we need more bits for the register select bit fields (RegSel). This leaves less bits available for the opcode and operand bit fields. If we stick with the same fixed length instruction (16bits) these new instructions will not be capable of representing the desired number ranges etc. Therefore, the instruction length would need to be increased from 16bits to 24bits or 32bits. Increasing memory size / cost etc. In general the increased use of registers to store working variables is a good design optimisation, as it does minimises the number of memory accesses and allows the pipeline to operate correctly. This optimisation also increases processing performance when compared to accessing data from external memory as registers are typically quicker to access.

To state the obvious increasing the register-file size does not remove the memory access problem, we still need the LOAD and STORE instructions to move data between memory and registers. Therefore, we will need to break the golden rule of the MIPS and add an interlock to stop an instruction-fetch when processing the LOAD and STORE instructions i.e. a stall, as shown in figure 5.

Figure 5 : STORE instruction stall

The issue shown in this example is called a structural hazard, the processor is trying to read two different memory locations on the same clock cycle i.e. the RED blocks, the MA stage for the LOAD instruction and the IF stage for instruction 4. This problem occurs as memory typically only has a single address bus, as shown in figure 6 i.e. a hardware limitation. Typically memory only has a single port i.e. a single address and data bus. However, the memory devices in the FPGA do also support dual port memory i.e. a memory with two separate address and data buses, allowing two memory transactions per clock cycle (slight complexities if read / write to the same memory location). Therefore, we could remove this structural hazard by replacing the single port memory used by the simpleCPU_v1d with a dual port version. Now one memory port would be used for accessing instructions, the other for data accesses. However, i decided not to do this, as dual port memory is not typically used for main memory, also its a good teaching example of a structural hazard in action :).

Note, in the example shown in figure 5 we will assume that instructions 2 - 4 are independent register based instructions i.e. the data accessed by the LOAD instruction is not used by instructions 2 - 4, and their operands are stored in different registers. If not other hazards will occur. Also assume that instructions 2 - 4 are not LOAD / STORE instructions :).

Figure 6 : Memory devices, single port (left), dual port (right)

Therefore, to over come this structural hazard we need to delay the fetch of instruction 4 i.e. delay the issue of instruction 3 in the ID stage until the LOAD instruction has accessed its data. This delay is called a pipeline stall, the act of stopping instructions moving onto the next pipeline stage. This effectively creates "bubbles" into the pipeline i.e. gaps in the instruction flow, where no instruction are being executed, as we allow instructions already in the pipeline to complete. There are two common methods of implementing this stall:

Insert a NOP : when a bubble is created insert an instruction that does not affect the processor's state e.g. a no-operation (NOP) instruction. When these "fake" instructions are executed the program's state is not affected, as they perform no operation :).
Tag as not valid : allow the pipeline stage to operate as normal. This will duplicate the instruction being held in the ID stage. However, to prevent these instructions from being executed they are tagged as not valid i.e a VALID flag is added to the EX, MA and WB stages, such that as these duplicated instructions pass through the pipeline they do not change the program's state.

In the MIPS architecture a NOP instruction was created by using the R0 register. This register was not really a register, rather a hardwired constant i.e. if you read R0 you would always get zero and if you wrote a value to R0 it would be lost. Therefore, the instruction ADD R0, R0 i.e. R0 <- R0+R0 does nothing, as the operands fetched are constants and the result of the ADD is discarded. We could do something similar in this processor, however, to avoid adding new instruction or altering the existing register file, i decided to go for the second approach and add a VALID flag to each stage.

Another common problem in a pipelined processor are RAW hazards i.e Read After Write hazards. These occur in a pipelined CPU as an instruction does not complete before we issue the next. Therefore, if the next instruction uses the result of the previous instruction i.e. the result of the instruction ahead of it in the pipeline, when it accesses its operands it will access old operand values, not the new result, as shown in figure 7.

Figure 7 : RAW hazard

In this example the first two MOVE instructions initialise registers RA and RB with the values 1 and 2. These values will be written to the register-file at the end of clock cycle 5. However, the ADD instruction accesses its operands (these values) during its ID stage i.e. clock cycle 3. Therefore, it will access the previous values of RA and RB. Therefore, to prevent this error within the ID stage, the instruction decoder compares the source register fields of the current instruction to the destination register fields of the instructions in the EX, MA and WB stages. If these match, the current instruction is held in the ID phase i.e. stalled, until the matching instructions within the pipeline have written their results to the register file.

There is one final design problem we need to consider for this processor: Control Hazards. Although the SimpleCPU_v1d and the MIPS processor are architecturally similar i.e. they are RISCy rather than CISCy, they have fundamental differences in how they process their instructions. The SimpleCPU_v1d allows an instruction to complete before starting the next, whereas the MIPS overlaps the execution of five instructions within its processing pipeline. This difference is significant when it comes to flow-control instructions i.e. conditional and unconditional jump instructions. A MIPS example is shown in figure 8.

Figure 8: Control hazard

Here, the jump instruction's (JUMP) target address is not selected until the MA phase i.e. cycle 5. Therefore, the pipeline has to stall for three clock cycles until the address of the next instruction is known. On the SimpleCPU_v1d this issue does not occur as each instruction completes before the next instruction is fetched. This issue is further complicated when you consider conditional jum instructions e.g. JUMPZ, JUMPNZ etc. These are typically controlled via flags stored in the processor's' status register:

Z : zero
C : carry
O : overflow
P : positive
N : negative

These flags store the outcome of the last arithmetic or logic instruction, and are globally visible across the processor. The SimpleCPU_v1d's branch instructions shown in figure 9. The top 4bits representing the opcode, the lower 12bit (AAAAAAAAAAAA) the target address. These instructions implementing the IF-THEN-ELSE or FOR/WHILE software structures.

Instr.	Description	IR15	IR13	IR12	IR11	IR10	IR09	IR08	IR07	IR06	IR05	IR04	IR03	IR02	IR01	IR00
JUMPU	unconditional jump	1	0	0	A	A	A	A	A	A	A	A	A	A	A	A
JUMPZ	jump if zero	1	0	1	A	A	A	A	A	A	A	A	A	A	A	A
JUMPNZ	jump if not zero	1	1	0	A	A	A	A	A	A	A	A	A	A	A	A
JUMPC	jump if carry	1	1	1	A	A	A	A	A	A	A	A	A	A	A	A

Figure 9 : SimpleCPU_v1d branch instructions

The unconditional jump instruction JUMPU (or just JUMP) will always jump to the branch target address. For the conditional jump instructions: JUMPZ, JUMPNZ and JUMPC, the branch target address is dependant on the flags stored in the status register. For the SimpleCPU_v1d these indicate the result of the last arithmetic or logic instruction. However, for the MIPS determining this status register's state is a little more complicated, as it could be the result of three different instructions within the pipeline, as shown in figure 10 i.e. the result from the overlapped EX stages.

Figure 10: Status register updates

There a different approaches to overcome the issues involved of updating the status register. However, like the previous issue of access operand data from memory, pipelined processors tend to overcome this problem by removing the status register from the processor's architecture and implementing conditional jumps using the value stored in a general purpose register, as shown in the next section (figure 12).

Note, there are other better solutions to overcome structural, RAW and Control hazards. The ones previously discussed i.e. based on stalls, were selected as they are simple to implement in terms of software and hardware.

Instruction Set Architecture (ISA)

Owing to the problems (hazards) highlighted in the previous section a pipelined processor''s instruction-set tends to follow a common format. A good example is shown below based on the DLX processor (Link).

Format	Bits
	31 26	25 21	20 16	15 11	10 6	5 0
R-type	0x0	Rs1	Rs2	Rd	unused	opcode
I-type	opcode	Rs1	Rd	immediate
J-type	opcode	value

Figure 11: DLX instruction format

Instr.	Description	Format	Opcode	Operation (C-style coding)
ADD	add	R	0x20	Rd = Rs1 + Rs2
ADDI	add immediate	I	0x08	Rd = Rs1 + extend(immediate)
AND	and	R	0x24	Rd = Rs1 & Rs2
ANDI	and immediate	I	0x0c	Rd = Rs1 & immediate
BEQZ	branch if equal to zero	I	0x04	PC += (Rs1 == 0 ? extend(immediate) : 0)
BNEZ	branch if not equal to zero	I	0x05	PC += (Rs1 != 0 ? extend(immediate) : 0)
J	jump	J	0x02	PC += extend(value)
JAL	jump and link	J	0x03	R31 = PC + 4 ; PC += extend(value)
JALR	jump and link register	I	0x13	R31 = PC + 4 ; PC = Rs1
JR	jump register	I	0x12	PC = Rs1
LHI	load high bits	I	0x0f	Rd = immediate << 16
LW	load word	I	0x23	Rd = MEM[Rs1 + extend(immediate)]
OR	or	R	0x25	Rd = Rs1 \| Rs2
ORI	or immediate	I	0x0d	Rd = Rs1 \| immediate
SEQ	set if equal	R	0x28	Rd = (Rs1 == Rs2 ? 1 : 0)
SEQI	set if equal to immediate	I	0x18	Rd = (Rs1 == extend(immediate) ? 1 : 0)
SLE	set if less than or equal	R	0x2c	Rd = (Rs1 <= Rs2 ? 1 : 0)
SLEI	set if less than or equal to immediate	I	0x1c	Rd = (Rs1 <= extend(immediate) ? 1 : 0)
SLL	shift left logical	R	0x04	Rd = Rs1 << (Rs2 % 8)
SLLI	shift left logical immediate	I	0x14	Rd = Rs1 << (immediate % 8)
SLT	set if less than	R	0x2a	Rd = (Rs1 < Rs2 ? 1 : 0)
SLTI	set if less than immediate	I	0x1a	Rd = (Rs1 < extend(immediate) ? 1 : 0)
SNE	set if not equal	R	0x29	Rd = (Rs1 != Rs2 ? 1 : 0)
SNEI	set if not equal to immediate	I	0x19	Rd = (Rs1 != extend(immediate) ? 1 : 0)
SRA	shift right arithmetic	R	0x07	as SRL & see below
SRAI	shift right arithmetic immediate	I	0x17	as SRLI & see below
SRL	shift right logical	R	0x06	Rd = Rs1 >> (Rs2 % 8)
SRLI	shift right logical immediate	I	0x16	Rd = Rs1 >> (immediate % 8)
SUB	subtract	R	0x22	Rd = Rs1 - Rs2
SUBI	subtract immediate	I	0x0a	Rd = Rs1 - extend(immediate)
SW	store word	I	0x2b	MEM[Rs1 + extend(immediate)] = Rd
XOR	exclusive or	R	0x26	Rd = Rs1 ^ Rs2
XORI	exclusive or immediate	I	0x0e	Rd = Rs1 ^ immediate

Figure 12: DLX instruction-set

The instruction set of the new pipelined SimpleCPU will be a cut down version of the DLX processor's. However, do want to keep a similar style to the original simpleCPU_v1d e.g. the LOAD and STORE instructions have to use register RA for the source/destination register. However, like the MIPS processor these instructions use a new addressing mode: displacement. As previously discussed only the LOAD and STORE instructions can now access data in memory, so i have expanded the register-file to 8 general purpose data registers: RA,RB,RC,RD,RE,RF,RG and RH. I have also removed the status register, making conditional JUMP instructions conditional on a general purpose register, the same as the MIPS processor. This also requires a new addressing mode: relative, as these conditional jump instruction no longer have enough bits to represent the 12bit target address, now PC offset by +254 to -255. The CALL and RET instructions have also been replaced with the MIPS style JAL and JR i.e. the processor no longer has an internal CALL/RET stack, rather the return address is placed in register RH. The CALL/RET stack is now implemented in external memory using software. For initial testing the ISA is limited to:

 INSTR   IR15 IR14 IR13 IR12 IR11 IR10 IR09 IR08 IR07 IR06 IR05 IR04 IR03 IR02 IR01 IR00    Operation (C-style coding)

 MOVE    0    0    0    0    Rd   Rd   Rd    X    K    K    K    K    K    K    K    K      Rd = sign_extend( immediate )     
 ADD     0    0    0    1    Rd   Rd   Rd    X    K    K    K    K    K    K    K    K      Rd = Rd + sign_extend( immediate )  
 SUB     0    0    1    0    Rd   Rd   Rd    X    K    K    K    K    K    K    K    K      Rd = Rd - sign_extend( immediate )  
 AND     0    0    1    1    Rd   Rd   Rd    X    K    K    K    K    K    K    K    K      Rd = Rd & ( immediate )  

 LOAD    0    1    0    0    Rs   Rs   Rs    A    A    A    A    A    A    A    A    A      RA = MEM[Rs + sign_extend( value )]
 STORE   0    1    0    1    Rs   Rs   Rs    A    A    A    A    A    A    A    A    A      MEM[Rs + sign_extend( value )] = RA

 JUMPU   1    0    0    0     A    A    A    A    A    A    A    A    A    A    A    A      PC = value
 JUMPZ   1    0    0    1    Rs   Rs   Rs    A    A    A    A    A    A    A    A    A      PC = PC + (Rs == 0 ? sign_extend( value ) : 0)
 JUMPNZ  1    0    1    0    Rs   Rs   Rs    A    A    A    A    A    A    A    A    A      PC = PC + (Rs != 0 ? sign_extend( value ) : 0)

 JAL     1    1    0    0     A    A    A    A    A    A    A    A    A    A    A    A      RH = PC+1; PC = value
 JR      1    1    1    1    Rs   Rs   Rs    X    X    X    X    X    0    0    0    0      PC = Rs

 MOVE    1    1    1    1    Rd   Rd   Rd   Rs   Rs   Rs    X    X    0    0    0    1      Rd = Rs
 ADD     1    1    1    1    Rd   Rd   Rd   Rs   Rs   Rs    X    X    0    0    1    0      Rd = Rd + Rs     
 SUB     1    1    1    1    Rd   Rd   Rd   Rs   Rs   Rs    X    X    0    0    1    1      Rd = Rd - Rs
 AND     1    1    1    1    Rd   Rd   Rd   Rs   Rs   Rs    X    X    0    1    0    0      Rd = Rd & Rs

Figure 13: Pipelined simpleCPU instruction-set

VHDL Prototype

To speed up initial develop i implemented the prototype in VHDL i.e. a lot quicker to write some HDL than to draw a schematic. This processor was based on the simpleCPU_v1d with a few "small" alterations. The automatic RTL schematic (for what its worth) is shown in figure 14.

Figure 14: pipelined simplecpu schematic

To test if this processor is working correctly we need to devise a series of testcases to test how this hardware handles: Structural, RAW and Control hazards.

Test Case 1 : Structural hazards

ASSEMBLY CODE              COMMENTS                        ADDR     MACHINE CODE
-------------              --------                        ----     ------------
LOAD RA 0(RA)    # RA = 0100 0000 0000 0000 = 0x4000       0000   0100000000000000
MOVE RB 1        # RB = 1 = 0x0001                         0001   0000001000000001
MOVE RC 2        # RC = 2 = 0x0010                         0002   0000010000000010
MOVE RD 3        # RD = 3 = 0x0011                         0003   0000011000000011 
MOVE RE 4        # RE = 4 = 0x0100                         0004   0000100000000100

Figure 15: test case 1a, code (top), correct execution (middle), simulation (bottom)

The initial value of RA=0, therefore the LOAD instruction will load itself i.e. the bit pattern that represents LOAD RA 0(RA) into RA. The remaining MOVE instruction load the values 1 - 4 into registers RB - RE. If all is working correctly the above values will be loaded into registers RA - RE. Unfortunately this first test case broke the hardware :). Not in a way i would guess, the hardware did detect the structural hazard correctly, rather that the first instruction caused the processor to lock i.e. the LOAD RA 0(RA) incorrectly detected that RA was being read as the pipeline registers were reset to 0. This incorrectly signalled instructions were in the pipeline that were going to write to register RA. Therefore, the instruction LOAD RA 0(RA) was never released from the decode stage. Solution, needed a hardware flag to signal if the EX, MEM and WB stages were empty. Went for a quick and dirty solution i.e. an extra shift register (stage_enable in waveform diagram), maybe will come back to this one, but it works.

ASSEMBLY CODE        COMMENTS        ADDR     MACHINE CODE
-------------        --------        ----     ------------
LOAD RA 5(RA)    # RA = 0x000A       0000   0100000000000101
LOAD RA 6(RB)    # RA = 0x000B       0001   0100001000000110 
LOAD RA 7(RC)    # RA = 0x000C       0002   0100001000000111
LOAD RA 8(RD)    # RA = 0x000D       0003   0100001000001000
LOAD RA 9(RE)    # RA = 0x000E       0004   0100001000001001

DATA 0xA                             0005   0000000000001010
DATA 0xB                             0006   0000000000001011
DATA 0xC                             0007   0000000000001100
DATA 0xD                             0008   0000000000001101
DATA 0xE                             0009   0000000000001110

Figure 16: test case 1b, code (top), correct execution (top middle), simulation (bottom middle), actual execution (bottom)

Owing to the limitation of the LOAD instruction the value read from memory has to be stored in register RA, so the above program does not make any practical sense as the values read from memory are being overwritten i.e. only the last LOAD result can be accessed by other instruction. However, i thought this would be an interesting test case to test if the displacement addressing mode was functioning correctly i.e. changing the offset. Note, initial values in RA - RE are assumed to be 0 i.e. reset. Like the first test case this code also failed :(. This issue with this testcase was that the series of LOAD instructions resulted in the structural hazard detection hardware locking the pipeline i.e. the hardware stopped fetching instructions after the third LOAD. This error does not occur for a single LOAD i.e. when there is a mix of memory and register based instructions, as used in figure 15. To solve this issue i made some adjustments to the hazard detection hardware. This "solved" the issue, however, the instruction timing was not as expected, not the same as shown in figure 16 (top). Now the third LOAD instruction's stall behaviour is slightly different, as shown in figure 16 (bottom). The current solution does "work", but the ID phase is held for one cycle more than expected. Looking at the memory transactions there are no wasted cycles, so can be argued this is just as efficient, but i could not find an easy fit to produce the original behaviour. Again, this is something that i will need to have a second look at.

Test Case 2 : RAW hazards

RAW hazards occur when you try to read data (register) before it has been written to a register. These types of hazards occur a lot in pipelined processors as we don't allow an instruction to complete before we issue the next.

ASSEMBLY CODE        COMMENTS        ADDR     MACHINE CODE
-------------        --------        ----     ------------
MOVE RA 0x0F     # RA = 0x000F       0000   0000000000001111
STORE RA 7(RB)   # RB = 0x0000       0001   0101001000000111 

DATA 0x9                             0002   0000000000001001
DATA 0xA                             0003   0000000000001010
DATA 0xB                             0004   0000000000001011
DATA 0xC                             0005   0000000000001100
DATA 0xD                             0006   0000000000001101

Figure 17: test case 2a, code (top), correct execution (middle), simulation (bottom)

This test code tests for RAW hazards and to again test the displacement addressing mode i.e. the write address should be 7+0=7. Here the value in RA is initialised to 15, therefore, the STORE instruction has to be stalled until the WB phase has completed. The value in RA is then written to memory address 7 (assuming RB=0). This program seemed to work OK :). The instruction timing could be improved by allowing delayed reads i.e. WB write is performed early and ID reads are delayed, allowing the WB and ID phases to overlapped. However, to keep this optimisation back for a later design, i kept in the each ID read cycle. Again, something that i will have a second look at later.

ASSEMBLY CODE        COMMENTS        ADDR     MACHINE CODE
-------------        --------        ----     ------------
MOVE RA 0x0F     # RA = 0x000F       0000   0000000000001111
MOVE RB 0x05     # RB = 0x0005       0001   0000001000000101
STORE RA 7(RB)   #                   0002   0101001000000111 

DATA 0x9                             0003   0000000000001001
DATA 0xA                             0004   0000000000001010
DATA 0xB                             0005   0000000000001011
DATA 0xC                             0006   0000000000001100
DATA 0xD                             0007   0000000000001101

Figure 18: test case 2b, code (top), correct execution (middle), simulation (bottom)

This test code tests for two RAW hazards and to again test the displacement addressing mode i.e. the write address should be 7+5=12. Here the value in RB is initialised to 5, therefore, the STORE instruction has to be stalled until the WB phase for RB has completed. The value in RA is then written to memory address 12. This program seemed to work OK :).

ASSEMBLY CODE        COMMENTS        ADDR     MACHINE CODE
-------------        --------        ----     ------------
MOVE RA 0
MOVE RB 10
MOVE RC 11
MOVE RD 12

LOAD RA 0(RA)
STORE RA 0(RB) 
STORE RA 0(RC)  
STORE RA 0(RD)  
STORE RA 0(RE)

ASSEMBLY CODE        COMMENTS        ADDR     MACHINE CODE
-------------        --------        ----     ------------
LOAD RA 4(RA)    # RA = 0x000A       0000   0100000000000100
STORE RA 6(RB)   #                   0001   0101001000000101
LOAD RA 5(RC)    # RA = 0x000B       0002   0100001000000110
STORE RA 7(RD)   #                   0003   0101001000000111

DATA 0xA                             0004   0000000000001010
DATA 0xB                             0005   0000000000001011
DATA 0xC                             0006   0000000000001100
DATA 0xD                             0007   0000000000001101

Figure 19: test case 2c, code (top), correct execution (top middle), simulation (bottom middle), actual execution (bottom)

This code would be more typical of how LOAD/STORE instructions would work, could see this block of code in a loop with RA,RB,RC and RD being pointers used to copy blocks of data stored in memory.

Figure 19: test case 2b, code (top), correct execution (middle), simulation (bottom)

Again this test case used to test RAW hazards i.e. does the processor what for RA - RD to be updated before performing the LOAD and STORE instructions. Also tests that structural hazards are handled correctly i.e. between the LOAD and STORE instructions.

Test Case 3 : Control hazards

start:
    JUMPU   skip
    MOVE RB 1        
 
skip:       
    MOVE RB 2

Figure 19: test case 3a, code (top), correct execution (middle), simulation (bottom)

start:
    MOVE RA 0
    MOVE RB 1  
    JUMPZ RA skip1     
    MOVE RC 2 

skip1:  
    JUMPZ RB skip2     
    MOVE RC 3     

skip2:   
    MOVE RC 4

Figure 20: test case 3b, code (top), correct execution (middle), simulation (bottom)

start:
    MOVE RA 0
    MOVE RB 1  
    JUMPNZ RA skip1     
    MOVE RC 2 

skip1:  
    JUMPNZ RB skip2     
    MOVE RC 3     

skip2:   
    MOVE RC 4

Figure 21: test case 3c, code (top), correct execution (middle), simulation (bottom)

start:
    MOVE RA 0
    JAL func     
    MOVE RC 2 

func:  
    MOVE RC 3   
    JR RH

Figure 22: test case 3d, code (top), correct execution (middle), simulation (bottom)

WORK IN PROGRESS

This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.

Contact email: mike@simplecpudesign.com

Back