Simple CPU v1e: FPGA

Figure 1 : SimpleCPU_v1e block diagram

The design philosophy for all the simpleCPU processors is to develop a processor for teaching, a processor architecture that helps highlight specific instruction-set or architectural features. As a result the SimpleCPU architectures have built in limitations, intentional known flaws. This is because some things are best explained by doing, rather than reading. You don't understand why a computer's architecture looks the way it does until you hit one of these limitation i.e. when you try to do something and realise that it's not possible on a given processor unless changes are made to its instruction-set, hardware, or typically both. To illustrate this a fundamental limitation was built into the simpleCPU_v1d's CALL/RET stack. This hardware was based on the CALL/RET stack implemented in the Xilinx PicoBlaze processor (Link). This processor uses a hardware stack, rather than a data stack stored in main memory i.e. the PicoBlaze uses a separate memory space that can not be accessed by the processor. The advantage of this approach is that it significantly simplifies hardware i.e. it removes a shared resource, allowing the stack and external memory to work in parallel. However, this does limit the PicoBlaze to 15 nested CALL instructions. However, for this small embedded processor this is typically more than sufficient, unless you are using recursive algorithms. To save hardware on the simpleCPU_v1d this depth is limited to 4 nested CALL instructions, but again for basic programming examples this is typically fine, that is until you try to code a more complex program :). Therefore, the simpleCPU_v1e continues our journey into processor develop by implementing an architecture capable of implementing its stack in main memory. We now have an architecture capable of executing code that uses recursive algorithms, which are always fun :). This new processor is shown in figure 1, from which you can clearly see the increased hardware costs of implementing this functionality. However, i must admit there is elegance in how stack and frame points work, which more than justifies this hardware investment, as we will see late.

CALL/RET + Data Stack
Stack operations
Implementing the Stack

Stack Pointer
Program counter
Address Multiplexer
Data-out Multiplexer
Data-in Multiplexer

CALL/RET + Data Stack

Figure 2 : SimpleCPU memory map

Before we can build this new processor we need to understand how a typical processor's stack functions. The SimpleCPU's external memory can be considered to be a contiguous block of storage locations, from address 0x000 to address 0xFFF, as shown in figure 2. In the SimpleCPU family of architectures the first instruction is always fetched from address 0x000 i.e. the reset or boot vector. Note, this may vary for different processors. Therefore, instruction and data memory will typically grow upwards from address 0x000, as program sizes increase. Also in this memory space are memory mapped peripheral devices e.g. GPIO, serial ports etc. These are allocated fixed memory locations, their specific location will vary for a given system. Its not uncommon for these special memory locations to be allocated at the top of memory e.g. 0xFFF, well away from instruction and data memory region. However, as shown in figure 2, they can also be allocated in the "middle-ish". This is a design decision based on the system's requirements. Finally we have the stack, this data structure grows and shrinks as data is pushed onto it or popped off it. Conceptually the stack is allocated at the top of memory and grows down as data is added to it, therefore, the simpleCPUv1e's stack will typically start at address 0xFFF, but its location is user defined.

A stack is a Last In First Out (LIFO) data structure, more information on this abstract data type is available here: Link. To control the stack's operation we need a stack pointer (SP) i.e. a piece of hardware, that points to the top of the stack, the next free memory location were data can be stored. Typically, this hardware is implemented as an application specific register within the processor, however, it is also common to implement the SP in a general purpose register within the register file. This pointer is used by machine-level instructions to push (write) and pop (read) data to and from this data structure. For the simpleCPUv1e these instructions are:

Push

     
Example            :    push rb
Addressing mode    :    register
Opcode             :    1111 + 01101
RTL                :    M[SP] <- RX
                        SP <- SP - 1 
Flags set          :    None

Store the value in general purpose register to top of stack, as defined by the stack pointer (SP) register. In this example RX is register RB, therefore, if RB=10 and SP=2000 at the end of the execution phase M[2000]=10 and SP=1999.

Pop

     
Example            :    pop rb
Addressing mode    :    register
Opcode             :    1111 + 01110
RTL                :    SP <- SP + 1 
                        RX <- M[SP] 
Flags set          :    None

Save the value stored in top of stack, as defined by the stack pointer (SP) register, to general purpose register. In this example RX is register RB, therefore, if SP=1999 and M[2000]=10 at the end of the execution phase RB=10 and SP=2000.

Call

Example            :    call 300
Addressing mode    :    direct
Opcode             :    1101
RTL                :    M[SP]<- PC + 1
                   :    SP <- SP - 1
                   :    PC <- AAA
Flags set          :    None

Call subroutine. In this example the subroutine address is 300. If PC=100 and SP=2000, at the end of the execution phase the PC is updated to 300 and SP=1999. The return address 101 is written to top of stack, as defined by the stack pointer (SP) register.

Ret

Example            :    ret
Addressing mode    :    implicit
Opcode             :    1111 + 00000
RTL                :    SP <- SP + 1    
                   :    PC <- STACK[SP]
Flags set          :    None

Return from subroutine. At the end of the execution phase the PC will be updated to the return address at stack address SP+1. If M[SP+1]=101 and SP=1999, at the end of the execution phase the PC is updated to 101 and SP=2000.

Stack operations

In this version of the processor the CALL-RETURN stack will be implemented in external memory rather than the hardware LIFO used in the simpleCPU_v1d. Note, on the one hand this simplifies the hardware, but in reality adds significant complexities as we shall see later. In addition to storing the return addresses needed for subroutine calls, the stack is now also used to pass parameters and return results to and from a subroutine. The key point to note is that a subroutine is no longer passes data via registers or absolute memory locations, rather this data is now stored on the stack. This decoupling allows recursive programming techniques as each subroutine's data is stored in a separate stack frame.

In addition to the stack pointer we also need a frame pointer (FP). This allows a know constant offset to be used to calculate the position of parameters and results stored on the stack. Typically, this register is implemented using a general purpose register within the processor's register file. To illustrate these pointers in action consider the example code below:

start:
    move sp 0xFF          # for this small example set top of stack (SP) to be 0xFF
    move rh 0xFF          # use RH as the frame pointer (FP), set to top of stack 

    move RA 0x0A          # push onto stack data to be processed: 0x0A, 0x50
    push RA
    move RA 0x50
    push RA

    call mult             # call multiplication subroutine
    pop RG                # remove processed data from stack 
    pop RG                

trap:
    jump trap             # stop code 

mult:                     # subroutine - multiplication using repeated addition
    push RH               # save old FP to stack
    move RH SP            # update FP to top of stack 
    move RA 0             # initialise local variable
    push RA

loop:
    move RG RH            # generate address to access parameter 0
    add RG 3
    load RA (RG)

    load RB (RH)          # generate address to access local variable
    add RA RB             # add parameter 0 to local variable and update
    store RA (RH)

    move RG RH            # generate address to access parameter 1
    add RG 0x04
    load RA (RG)          # decrement parameter 1 and update
    sub RA 1
    store RA (RG)
    jumpnz loop           # repeat if not zero

    pop RA                # store local variable in RA
    pop RH                # restore old FP
    ret                   # return back to main program

The execution of this program can be broken down into the following steps.

Initialise the stack pointer (SP) and frame pointer (FP). SP=FP=0xFF. Note, this value was chosen to simplify this example, typically these would be set to 0xFFF.

Figure 3 : initialise SP and FP

The example code implements a simple subroutine that performs multiplication through repeated addition. The parameter representing the multiplier is pushed onto the stack i.e. the value 0x0A.

Figure 4 : push parameter 0

The second parameter is pushed onto the stack i.e. the multiplicand, the value 0x50. The final result will therefore be 0x50 * 0x0A = 0x320.

Figure 5 : push parameter 1

The CALL instruction pushes onto the stack the return address i.e. PC+1, in this example the address 0x006. Control then passes to the MULT subroutine.

Figure 6 : CALL subroutine

On entering a subroutine, the subroutine is responsible for checkpointing the old FP and updating the FP to the starting address of the current stack frame. The subroutine code is also responsible for preserving the state of any general purpose registers used by the calling program i.e. these registers needed to be pushed onto the stack (checkpoint) at the start of the subroutine and popped off the stack (restored) at the end of the subroutine, such that the subroutine is free to use these registers without corrupting the calling program's state. In this simple example no registers need to be checkpointed.

Figure 7 : checkpoint and update FP

Space on the stack can be reserved for local variables i.e. variables used in the subroutine. In this example one variable is allocated for the accumulated result.

Figure 8 : local variables

The subroutines stack frame is now complete. Stack locations before the FP contain passed parameters and the return address. Stack location at or after the FP contain local variables, or other stack frames.

Figure 9 : a stack frame

Main loop. Generate address of parameter 0 i.e. the multiplicand. This is a fixed known offset from the FP.

Figure 10 : access parameter 0

Access local variable stored on stack used to accumulate the multiplication result.

Figure 11 : access local variable

Add multiplicand to accumulating result and write back to stack.

Figure 12 : process data

Generate address of parameter 1 i.e. the multiplier. This is a fixed known offset from the FP.

Figure 13 : access parameter 1

Multiplication performed via repeated addition i.e. multiplier is used as a loop counter, decrement multiplier and write back to stack.

Figure 14 : SimpleCPU memory map

If multiplier is not zero repeat loop.

Figure 15 : SimpleCPU memory map

Repeat previous steps ...

Figure 16 : SimpleCPU memory map

Eventually, multiplier will reach zero, i.e. the required number of repeated additions have been performed, at this point the conditional jump is not taken and the program exits the main loop.

Figure 17 : SimpleCPU memory map

Pop off the stack the local variable i.e. the multiplication result, this will be stored in register RA and passed back to the main program in this register. An alternative approach would be to overwrite one of the passed parameters with this data, such that the main program pops the result off the stack i.e. when it is cleaning up the stack frame.

Figure 18 : SimpleCPU memory map

Pop off the old FP i.e. restore the FP back to the value used in the main program.

Figure 19 : SimpleCPU memory map

Return back to the main program i.e. pop of the stack the return address, then update the PC with this value.

Figure 20 : SimpleCPU memory map

Clean up the stack frame i.e. remove parameters 0 and 1.

Figure 21 : SimpleCPU memory map

For this simple program enter an infinite loop to stop the program re-entering the subroutine code, as there is no OS to rollback to we need to stop the program from executing data, or random values stored in memory that are not part of the main program.

Figure 22 : SimpleCPU memory map

From this example and the RTL descriptions of the PUSH, POP, CALL and RET instructions we can identify what operations and data values are required to implement the stack data structure in external memory. Some key points:

Stack grow down. When data is pushed onto the stack the SP is decremented. When data is popped off the stack the SP is incremented. This always catches me out :).
Stack read / write operations use the address pointers SP and SP+1. For the PUSH and CALL instructions data is written to the current SP i.e. top of stack, the SP is then decremented. For POP and RET instructions we need to access data previously pushed onto the stack, therefore, the SP is incremented, moving the SP back to a previous value.
When data is pushed onto, or popped off the stack the SP must remain stable i.e. the value of SP or SP+1 must not change during the memory transaction. However, updates to the SP i.e. increment / decrement, need to be performed in parallel with these memory transactions, therefore, the hardware must ensure these changes are not propagate through to the address bus etc.
When the CALL instruction is executed the return address PC+1 needs to be calculated and stable, as it will be pushed onto the stack. This value must remain stable during the write memory transaction. Therefore, updating the PC to the subroutine's branch address must not propagate through to the data bus.

Implementing the Stack

Figure 23 : processor

From one point of view the Stack comes with zero hardware costs i.e. its an abstract data type stored in external memory. However, this isn't the case, implementing a CALL/RET + data stack needs a lot of additional hardware, as illustrated by the significantly more complex circuit diagram shown in figure 23. One small, additional complexity is that this version of the simpleCPU is based around a hardwired controller i.e. the execution phase is performed in a single clock cycle. Therefore, each machine-level instruction must be implemented using one or more parallel micro-instructions i.e. sequential micro-instruction can not be performed as this would require a variable length execution phase. Note, an example of an architecture with a variable length instruction phase is the micro-programmed simpleCPU_v2, for more information on this processor: Link.

Stack pointer

Figure 24 : stack pointer symbol

The stack pointer (SP) is implemented as a new component, shown in figures 24, and its internal architecture is shown in figure 25. To allow the processor to access the previous, current and next stack pointer values, this component included an incrementer and a decrementer. Note, the double register implementation is used to ensure that these stack values are stable during an instruction's execution phase.

Figure 25 : stack pointer schematic

Program counter

Figure 26 : program counter symbol

The program counter (PC) used in the simpleCPUv1d contained a hardware LIFO to implement its CALL/RET stack. The simpleCPU_v1e processor implements its stack in external memory. To implement this functionality a 12bit loadable counter is used. Input data sources include :

External memory : the CALL instruction pushes the return address onto the stack, this address will be loaded into the PC during the RET instruction.
Instruction register : the JUMP instructions (unconditional and conditional) contain an absolute branch target address, this is loaded into the PC if a jump is taken.
Register file : the JR instruction uses a general purpose register to store an absolute branch target address, this is loaded into the PC during execution phase.
Interrupt vector : when an interrupt is triggered the PC is updated to a user defined constant i.e. a fixed address where the interrupt service routine (ISR) is stored.

Note, the register used to drive the PC's Y output is used to hold the PC stable during stack updates.

Figure 27 : program counter schematic

Figure 28 : counter_12 component

Address multiplexer

Figure 29 : Address multiplexer

The address multiplexer selects between the following sources:

PC : program counter, used for all instructions during the instruction fetch.
IR : instruction register, LOAD / STORE instructions.
SP : stack pointer, PUSH / CALL instructions .
SP+1 : stack pointer, POP / RET / RETI instructions.
RY : register file, JR instruction.

Data-out multiplexer

Figure 30 : Data-out multiplexer

The data-out multiplexer selects between the following sources:

RY : register file, JR instruction.
PC+1 : program counter, return address pushed onto stack by CALL instruction.
PC : program counter, return address pushed onto stack by interrupt.

Data-in multiplexer

Figure 31 : Data-in multiplexer

The data-in multiplexer selects between the following sources:

IR : instruction register, immediate addressing mode instructions.
DIN : data in bus, absolute addressing mode instructions.
RY : register file, register addressing mode instructions.
SP : stack pointer, move instruction, allowing SP to be transferred into general purpose register.

Figure 32 : Data_mux component.

Register file

Figure 33 : Register file.

An upgrade from the simpleCPUv1d, an increase in the number of general purpose registers from 4 to 8. However, RH is reserved for the FP, therefore, only really 7 general purpose registers. Also, limited to RA for absolute addressing modes, those built in limitations :).

Figure 34 : register_file_8 component.

Control logic

Figure 35 : Control logic

The instruction bit-field formats for this processor have been modified to reflect the changes to its register file e.g. an increase from a 2-bit register select bit-field to 3-bits. Another change is the increase to 5-bits for the lower opcode bit-field, allowing 32 register addressing mode instructions, as shown in figure 36.

Figure 36 : register and register-indirect instructions

Absolute addressing mode instructions remain the same, except with the addition of a new MOVE instruction to allow the user to load a 12-bit value into the stack pointer (SP).

Figure 37 : absolute and immediate instructions

The processor's status register and interrupt control logic remains the same as the simpleCPUv1d1. The final control logic is shown in figure 39.

Figure 38 : status register

Figure 39 : control logic

As with other versions of the simpleCPU, this processor uses a 16bit fixed length instruction format, processed within the processor over three clock cycles. The instruction phase is represented by a 3bit ring counter. The first level or instruction decoding is comparable to previous processors, but with the expansion of the lower opcode field, being increased from 4bits to 5bits. Therefore, the lower opcode binary-to-onehot decoder is replaced with a onehot_decoder_32 component. Note, this provides spare capacity for 16 further instructions. The interrupt hardware is the same as that used in the simpleCPU_v1d1, more information on this hardware can be found here : Link. The decoded outputs from the high and low opcode decoders are passed to the VHDL decoder component, this component generating the required control signals needed during each instruction phase. An instruction is executed over three clock cycles, the fetch, decoder and execute phases, each being performed in one clock cycle. Note, to keep the processor's internal operations comparable to previous processors, this processor does not support variable length instruction formats i.e. instruction stored across multiple memory locations, or variable length instruction execution times i.e. increases that need more that one clock cycle, such are divide functions. As a result there are limitation placed on supported instruction functionality and addressing modes. Therefore, "complex" functions and addressing modes need to be implemented from multiple simple instructions.

The classic recursive programming example: a Fibonacci sequence :)

To test out the simpelCPUv1e's stack we will use the classic recursive programming example: generating a Fibonacci sequence. This program can be described by the following python code:

def fib(n):
   if n <= 1:
       return n
   else:
       return(fib(n-1) + fib(n-2))

for i in range(11):
  print(fib(i))

A nice discussion of what the Fibonacci sequence is and how to code this solution in Java can be found here: Link (basically the same-ish). For more information on what the Fibonacci sequence is: Link. To keep things simple (to draw) consider the example below where n=4, this would result in the following subroutine calls:

Figure 40 : subroutine calls for n=4

In this example the maximum call depth is 4, for larger values of n the depth will be deeper e.g. for n=10 the depth will be 10 etc. To implement this code on the simpleCPUv1e the following assembly code can be used:

######################
# Fibonacci sequence #
######################

#   Stack frame
# ---------------
# |      N      |
# ---------------
# | Return ADDR |
# ---------------
# |      FP     | 
# ---------------
# |      RB     | <- FP
# ---------------
# |  Local VAR  |
# ---------------
  
start:
    move SP 0xFFF         # set stack pointer
    move RH SP            # set frame pointer 
    load RA N             # get N
    move RB RA
loop:
    push RB               # push N onto stack
    call fib              # call function

    pop RA                # get result
    store RA STDOUT       # write to STDOUT
    sub RB 1              # decrement N 
    jumpnz loop           # repeat if not zero 

trap:
    jump trap             # stop program

fib:
    push RH               # save old FP to stack
    move RH SP            # update FP to top of stack 

    push RB               # save registers used by calling program

    move RA 0             # allocate space for local variable
    push RA

    move RB RH            # generate address to access N
    add RB 3
    load RA (RB)

    and RA 0xFF           # test exit condition (non-JUMPC)
    jumpz exit
    sub RA 1
    jumpz exit

    load RA (RB)          # load N 
    sub RA 1              # N-1
    push RA               # push new N onto stack, call fib()
    call fib

    pop RA                # get fib() result

    move RB RH            # generate address of local VAR
    sub RB 1
    store RA (RB)         # save in local variable

    move RB RH            # generate address to access N
    add RB 3
    load RA (RB)          # load N
    sub RA 2              # N-2
    push RA               # push new N onto stack, call fib()
    call fib

    pop RA                # get fib() result

    move RB RH            # generate address of local VAR
    sub RB 1
    load RC (RB)
    add RA RC             # add first and second fib() result

    move RB RH            # generate address of return result
    add RB 3
    store RA (RB)         # overwrite N with result
    
exit:
    pop RA                # remove local variable
    pop RB                # restore RB
    pop RH                # restore RH i.e. FP
    ret
   
N:
    .data 10 

STDOUT:
    .data 0

A couple of points to note about this code and how it uses the stack. As you would expect data is added to and removed from the stack using push and pop functions. However, you can also access data from within the stack i.e. LOAD/STORE instructions using the frame pointer (FP). This allows you to modify data already on the stack, which for a true Last-In-First-Out (LIFO) data structure would not be possible e.g. on a hardware stack you can only access data on the top of the stack. That's why you need a FP, as this gives you a fixed reference into the stack i.e. a stack frame, allowing you to generate the required address pointers needed to access passed parameters and save results.

In addition to managing the processor's CALL/RET stack, this stack is also used to pass parameters to and results back from subroutines. In addition to this functionality, as previously discussed a subroutine should preserve the state of the calling codes registers e.g. pushing the old FP onto the stack. However, it should also preserve the state of any general purpose registers, such that when the subroutine returns back to the calling code any registers that contained data that will then be used by this code is restored back to their original values. To illustrate this the main program uses register RB to hold the value of N. However, the subroutine also uses registers RA, RB and RC. Therefore, on entry to the subroutine the original value of RB is pushed onto the stack, so that it can be overwritten by the subroutine, then before the subroutine exits (RET), the original value of RB is popped off the stack and restored. Note, register RA is also used by the main program, but it is not used to hold a static variables (data) i.e. the value of RA is overwritten by the main program on each loop so it does not need to be preserved (checkpointed). A simple optimisation would be to limit register usage i.e. allocate different registers to the main program and subroutine. This would work for this example, but in general would be a pain to manage.

There is a small "bug" in this assembly language program, it doesn't print out 0, i.e. the first number in the Fibonacci sequence, but, its close enough for me :). The aim of this program was to test that the stack and its associated instructions were functioning correctly, which they seem to be doing. Not sure what is the max Fibonacci value that can be calculated by this algorithm on the simpleCPUv1e i.e. the point were the stack grows too big and starts to overwrite program space or the assigned address space for memory mapped peripherals, however, if each stack frame is 5 elements this would be roughly:

(0xFFF-CODE_SIZE)/5 = (0xFFF-50)/5 = 809

n=809 will take a very long time to simulate :). The waveform simulation for n=10 is shown in figure 41 below. In this simulation a debug module was used to indicate when the program was executing code from the main program, subroutine or writing results to DATA i.e. the STDOUT bus, this bus is highlighted in the waveform diagram, displaying the Fibonacci sequence: 55, 34, 21, 13, 8, 5, 3, 2, 1, 1.

Figure 41 : simulation fun for n=10

WORK IN PROGRESS

This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.

Contact email: mike@simplecpudesign.com

Back