RISC-V Piplelined Processor

Overview

In lecture we examined the RV32I 5-stage pipeline used in Ripes — how instructions flow through Fetch, Decode, Execute, Memory, and Writeback in overlapping waves, and how the hazard unit keeps that flow correct in the face of data and control dependencies. In this lab you will use Ripes to measure that behavior directly.

By the end of this lab you should be able to:


Submission

After completing the lab, upload the following to Gradescope:

File Contents
measurements.txt Your completed copy of every measurement table in this lab
part3.s Your stall-free reordered program from Part 3
part5.s Your CPI-optimized program from Part 5

The two .s files will be run through Ripes automatically — make sure they assemble cleanly, produce the correct output, and terminate with an ecall instruction like the example programs do.


Part 1 — Pipeline Orientation

1.1 Switch to the Pipelined Processor

  1. Open Ripes: ripes from the terminal.
  2. Click the Select Processor button (CPU chip icon) in the toolbar.
  3. Choose RISC-V 32-bit 5-Stage Processor.
  4. Under Layout, select Extended.
  5. Click OK.

1.2 Load a Warmup Program

Clear the editor and enter the following:

.text
main:
    addi t0, x0, 3
    addi t1, x0, 4
    add  t2, t0, t1
    addi t3, t2, 1
    addi t4, t3, 1

    li   a7, 10
    ecall

Click Reset, switch to the Processor tab, and step through the program one clock at a time using the Clock (forward arrow) button.

1.3 Locate Key Components

Before moving on, find each of the following in the Extended layout and confirm you know what it does. You will need to recognize these on sight throughout the lab.

Component What to look for
Five pipeline registers The vertical register bars labeled IF/ID, ID/EX, EX/MEM, MEM/WB that separate the stages
Hazard unit The block that drives the stall and bubble control signals
Forwarding MUXes The multiplexers in the Decode stage that select between register file values and forwarded results
Branch comparator The comparator in Execute that determines whether a branch is taken

Tip: Click any wire to highlight its full route. Hover to see its name and current value. Use Ctrl + scroll to zoom.

1.4 Observe Forwarding

Step slowly through the warmup program and watch the forwarding MUXes. The add t2, t0, t1 instruction reads t0 and t1, which were written by the two preceding addi instructions. At the clock cycle when add is in the Execute (EX) stage, look at the forwarding MUXes. Verity that their selected input is the forwarded value from the EX/MEM and MEM/WB registers, not the (stale) register file output.

Confirm that no stalls are inserted for this dependency and that forwarding resolves it with zero penalty.


Part 2 — Hazards and Forwarding

In this part you will verify that the Hazard/Forwardign units handle RAW (read after write) dependencies. All programs below compute the same final value (t3 = 9); they differ only in how many instructions separate a write from its dependent read.

2.1 The Programs

Enter and run each of the four programs below. For each one, record the total cycle count from the Execution info panel.

Program A — no dependency (3 instructions between write and read)

.text
main:
    addi t0, x0, 4       # writes t0
    addi t1, x0, 1       # independent
    addi t2, x0, 2       # independent
    addi t3, x0, 3       # independent
    add  t3, t0, t3      # reads t0  (3 instructions later)

    li   a7, 10
    ecall

Program B — one instruction between write and read

.text
main:
    addi t0, x0, 4       # writes t0
    addi t1, x0, 1       # independent
    add  t3, t0, t1      # reads t0  (1 instruction later)

    li   a7, 10
    ecall

Program C — immediately adjacent (write then read)

.text
main:
    addi t0, x0, 4       # writes t0
    add  t3, t0, t0      # reads t0  (0 instructions later)

    li   a7, 10
    ecall

Program D — back-to-back chain (each instruction depends on the previous)

.text
main:
    addi t0, x0, 1
    addi t0, t0, 1       # depends on line above
    addi t0, t0, 1       # depends on line above
    addi t0, t0, 1       # depends on line above
    addi t0, t0, 1       # depends on line above

    li   a7, 10
    ecall

2.2 Forwarding Table

Record the numbers of bubbles and cycles for each program. The forwarding rule is: a RAW hazard is resolved by forwarding with zero stalls as long as the producing instruction has reached or passed the Execute stage by the time the consuming instruction reaches Decode. A load/use hazard (covered in Part 3) is the one exception.

Fill in this table in measurements.txt:

-------------------------
Program | Measured cycles | Measured bubbles
--------|-----------------|------------------
A       |                 |
B       |                 |
C       |                 |
D       |                 |

Bubbles = (cycles) - (instructions retired) - (pipeline drain) - (ecall stalls)
Pipeline drain = 4 extra cycles for startup/teardown
ecall stalls = 2 cycle stall for every ecall
  
For this table, use: Bubbles = cycles - (instructions retired) - 6

Was the Forwarding Unit able to prevent stalls (bubbles) in all programs? Put your answer in measurements.txt along with your table.


Part 3 — Eliminating Load/Use Stalls

A lw instruction does not produce its result until the Memory stage. If the immediately following instruction reads the loaded register, forwarding cannot help as the data simply does not exist yet at the point where Decode stage needs it. The hazard unit must stall the instruction in the Decode stage, and insert a Bubble in the Execute stage for one cycle.

3.1 The Baseline Program

Enter the following program, which has several load/use hazards deliberately baked in:

.data
vals: .word 10, 20, 30

.text
main:
    la   t0, vals
    lw   t1, 0(t0)       # t1 = 10
    add  t4, t1, t1      # HAZARD: uses t1 immediately
    lw   t2, 4(t0)       # t2 = 20
    add  t5, t2, t1      # HAZARD: uses t2 immediately
    lw   t3, 8(t0)       # t3 = 30
    add  t6, t3, t2      # HAZARD: uses t3 immediately

    # Print t4 + t5 + t6 (should be 20 + 30 + 50 = 100)
    add  a0, t4, t5
    add  a0, a0, t6
    li   a7, 1
    ecall

    li   a7, 10
    ecall

Run it and record the cycle count. Put your answer in measurements.txt. Verify the printed result is 100.

3.2 Reorder to Remove Stalls

The three load/use hazards above can each be eliminated by moving an independent instruction into the slot between the load and its consumer. You must not change the final computed values.

Rewrite the program with the instructions reordered so that zero load/use stall cycles are inserted. Save your reordered version as part3.s.

Rules:

Record the before and after cycle counts in measurements.txt.

Part 3 — Load/Use Stall Elimination
-------------------------------------
Baseline cycle count   :
Reordered cycle count  :
Stall cycles eliminated:

Check your work: step through your reordered program. You should see no stall (hold) signals asserted anywhere in the load-to-use sequence.


Part 4 — Branch Misprediction Penalty

Ripes implements a predict-not-taken strategy: it always fetches PC+4 after a branch. When a branch turns out to be taken, the two instructions that entered the pipeline behind it are wrong and the hazard unit converts them to bubbles. That costs 2 cycles per misprediction.

4.1 Summing Loop

.text
main:
    addi t0, x0, 0       # sum = 0
    addi t1, x0, 1       # i = 1
    addi t2, x0, 6       # limit = 6

loop:
    add  t0, t0, t1      # sum += i
    addi t1, t1, 1       # i++
    blt  t1, t2, loop    # if i < 6, loop back

    mv   a0, t0          # print sum (should be 1+2+3+4+5 = 15)
    li   a7, 1
    ecall

    li   a7, 10
    ecall

4.2 Prediction Table

Work out the branch behavior before running:

Then run the program and record the measured cycle count.

Part 4 — Branch Misprediction
-------------------------------
Times branch evaluated          :
Times branch taken (mispredict) :
Times branch not taken (correct):
Predicted branch penalty cycles :
Measured total cycle count      :
Expected cycle count*           :

Expected = (instruction count × 1) + (pipeline drain) + (branch penalty cycles) + (ecall stalls)

For this program: 
  instruction count = Instructions retired
  pipeline drain = 4 cycles startup/teardown
  ecall stalls = 2 cycle stall for each ecall
  branch penalty = 2 × mispredictions

4.3 Modify the Loop Bound

Change t2 to 11 (so the loop runs 10 iterations, summing 1 through 10). Run the modified program, verify the printed result is 55, and record:

Part 4 (modified — 10 iterations)
------------------------------------
Times branch taken (mispredict)  :
Total branch penalty cycles      :
Measured total cycle count       :

Add these results to your measurements.txt file.


Part 5 — Cycle Challenge

Your task is to modify a RISC-V program that:

  1. Computes the dot product of two hard-coded 8-element integer vectors and prints the result.
  2. Reduces the total number of cycles for program execution. How low can you get it?

The two vectors are:

A = [1, 2, 3, 4, 5, 6, 7, 8]
B = [8, 7, 6, 5, 4, 3, 2, 1]

The dot product is A[0]×B[0] + A[1]×B[1] + … + A[7]×B[7]. For these vectors the result is 1×8 + 2×7 + … + 8×1 = 120.

An unoptimized version of the code is shown below.

# Dot product of A and B (8 elements each)
# A = [1, 2, 3, 4, 5, 6, 7, 8]
# B = [8, 7, 6, 5, 4, 3, 2, 1]
# Expected result: 120

.data
A: .word 1, 2, 3, 4, 5, 6, 7, 8
B: .word 8, 7, 6, 5, 4, 3, 2, 1

.text
main:
    la   t0, A            # t0 = pointer into A
    la   t1, B            # t1 = pointer into B
    addi t2, x0, 0        # t2 = sum = 0
    addi t3, x0, 8        # t3 = loop counter

loop:
    lw   t4, 0(t0)        # t4 = A[i]
    lw   t5, 0(t1)        # t5 = B[i]
    mul  t4, t4, t5       # t4 = A[i] * B[i]
    add  t2, t2, t4       # sum += t4
    addi t0, t0, 4        # advance A pointer
    addi t1, t1, 4        # advance B pointer
    addi t3, t3, -1       # counter--
    bne  t3, x0, loop     # if counter != 0, repeat

    mv   a0, t2
    li   a7, 1
    ecall

    li   a7, 10
    ecall

5.1 Requirements

5.2 Loop Unrolling

Loop unrolling is a technique where you process multiple elements per loop iteration instead of one. Rather than looping 8 times and handling one element each time, you might loop 4 times and handle two elements per iteration, or eliminate the loop entirely by writing out all 8 elements sequentially.

A loop with a branch has two costs: the branch instruction itself, and a 2-cycle penalty every time the branch is taken (i.e., every iteration except the last). Unrolling removes both. It also exposes more opportunities to reorder instructions to cover load/use stalls, since you have a larger window of independent instructions to work with.

The tradeoff is code size, an unrolled loop is larger. For a short, fixed-iteration loop like the dot product here, that’s a reasonable cost. First, consider processing two elements of the vectors during each iteration of the loop. You will only loop four times instead of eight, reducing the number of mispredicted branches. Once you get that working, you can try a loop unroll of four, or even try to replace the loop entirely.

5.3 Strategy Notes

Extra cycles comes from two sources: load/use stalls and branch mispredictions. Before writing any code, think through:

Describe your approach and document your reasoning briefly at the top of part5.s in a comment block.


Appendix A — Ripes Quick Reference

Processor Controls

Button Action
Reset Reload program; clear registers and memory
Clock (single step) Advance one cycle
Reverse Undo one cycle
Auto-clock Step continuously at a chosen rate
Run Execute to completion (fastest; no per-cycle GUI update)

Reading Execution Statistics

After clicking Run, the “Execution info” pane has the statistics on the program. The key fields are:

Field Meaning
Cycles Total clock cycles elapsed
Instructions retired Dynamic instruction count (not static)
CPI Cycles Per Instruction (Cycles / instructions retired)
IPC Instructions Per Cycle (instructions retired / Cycles)

ecall Quick Reference

a7 Effect
1 Print integer in a0
4 Print null-terminated string whose address is in a0
10 Exit (halt simulator)
11 Print character in a0

Appendix B — Hazard Reference Card

Hazard Cause Penalty Resolution
RAW (non-load) Instruction reads register written 1–2 steps earlier 0 cycles Forwarded from EX or MEM stage
Load/Use lw followed immediately by instruction reading loaded register 1 cycle Stall + forward from MEM
Branch misprediction Taken branch under predict-not-taken policy 2 cycles Bubbles injected into ID and EX