RISC-V Piplelined Processor

Overview

In lecture we examined the RV32I 5-stage pipeline used in Ripes — how instructions flow through Fetch, Decode, Execute, Memory, and Writeback in overlapping waves, and how the hazard unit keeps that flow correct in the face of data and control dependencies. In this lab you will use Ripes to measure that behavior directly.

By the end of this lab you should be able to:

Switch Ripes to the pipelined processor model and navigate its interface
Identify forwarding paths and the hazard unit in the extended layout
Predict the number of stall and bubble cycles introduced by data and control hazards
Eliminate avoidable load/use stalls by reordering instructions
Measure CPI from Ripes execution statistics and reason about what drives it
Explore loop unrolling to reduce the number of pipeline stalls

Submission

After completing the lab, upload the following to Gradescope:

File	Contents
`measurements.txt`	Your completed copy of every measurement table in this lab
`part3.s`	Your stall-free reordered program from Part 3
`part5.s`	Your CPI-optimized program from Part 5

The two .s files will be run through Ripes automatically — make sure they assemble cleanly, produce the correct output, and terminate with an ecall instruction like the example programs do.

Part 1 — Pipeline Orientation

1.1 Switch to the Pipelined Processor

Open Ripes: ripes from the terminal.
Click the Select Processor button (CPU chip icon) in the toolbar.
Choose RISC-V 32-bit 5-Stage Processor.
Under Layout, select Extended.
Click OK.

1.2 Load a Warmup Program

Clear the editor and enter the following:

.text
main:
    addi t0, x0, 3
    addi t1, x0, 4
    add  t2, t0, t1
    addi t3, t2, 1
    addi t4, t3, 1

    li   a7, 10
    ecall

Click Reset, switch to the Processor tab, and step through the program one clock at a time using the Clock (forward arrow) button.

1.3 Locate Key Components

Before moving on, find each of the following in the Extended layout and confirm you know what it does. You will need to recognize these on sight throughout the lab.

Component	What to look for
Five pipeline registers	The vertical register bars labeled IF/ID, ID/EX, EX/MEM, MEM/WB that separate the stages
Hazard unit	The block that drives the stall and bubble control signals
Forwarding MUXes	The multiplexers in the Decode stage that select between register file values and forwarded results
Branch comparator	The comparator in Execute that determines whether a branch is taken

Tip: Click any wire to highlight its full route. Hover to see its name and current value. Use Ctrl + scroll to zoom.

1.4 Observe Forwarding

Step slowly through the warmup program and watch the forwarding MUXes. The add t2, t0, t1 instruction reads t0 and t1, which were written by the two preceding addi instructions. At the clock cycle when add is in the Execute (EX) stage, look at the forwarding MUXes. Verity that their selected input is the forwarded value from the EX/MEM and MEM/WB registers, not the (stale) register file output.

Confirm that no stalls are inserted for this dependency and that forwarding resolves it with zero penalty.

Part 2 — Hazards and Forwarding

In this part you will verify that the Hazard/Forwardign units handle RAW (read after write) dependencies. All programs below compute the same final value (t3 = 9); they differ only in how many instructions separate a write from its dependent read.

2.1 The Programs

Enter and run each of the four programs below. For each one, record the total cycle count from the Execution info panel.

Program A — no dependency (3 instructions between write and read)

.text
main:
    addi t0, x0, 4       # writes t0
    addi t1, x0, 1       # independent
    addi t2, x0, 2       # independent
    addi t3, x0, 3       # independent
    add  t3, t0, t3      # reads t0  (3 instructions later)

    li   a7, 10
    ecall

Program B — one instruction between write and read

.text
main:
    addi t0, x0, 4       # writes t0
    addi t1, x0, 1       # independent
    add  t3, t0, t1      # reads t0  (1 instruction later)

    li   a7, 10
    ecall

Program C — immediately adjacent (write then read)

.text
main:
    addi t0, x0, 4       # writes t0
    add  t3, t0, t0      # reads t0  (0 instructions later)

    li   a7, 10
    ecall

Program D — back-to-back chain (each instruction depends on the previous)

.text
main:
    addi t0, x0, 1
    addi t0, t0, 1       # depends on line above
    addi t0, t0, 1       # depends on line above
    addi t0, t0, 1       # depends on line above
    addi t0, t0, 1       # depends on line above

    li   a7, 10
    ecall

2.2 Forwarding Table

Record the numbers of bubbles and cycles for each program. The forwarding rule is: a RAW hazard is resolved by forwarding with zero stalls as long as the producing instruction has reached or passed the Execute stage by the time the consuming instruction reaches Decode. A load/use hazard (covered in Part 3) is the one exception.

Fill in this table in measurements.txt:

-------------------------
Program | Measured cycles | Measured bubbles
--------|-----------------|------------------
A       |                 |
B       |                 |
C       |                 |
D       |                 |

Bubbles = (cycles) - (instructions retired) - (pipeline drain) - (ecall stalls)
Pipeline drain = 4 extra cycles for startup/teardown
ecall stalls = 2 cycle stall for every ecall
  
For this table, use: Bubbles = cycles - (instructions retired) - 6

Was the Forwarding Unit able to prevent stalls (bubbles) in all programs? Put your answer in measurements.txt along with your table.

Part 3 — Eliminating Load/Use Stalls

A lw instruction does not produce its result until the Memory stage. If the immediately following instruction reads the loaded register, forwarding cannot help as the data simply does not exist yet at the point where Decode stage needs it. The hazard unit must stall the instruction in the Decode stage, and insert a Bubble in the Execute stage for one cycle.

3.1 The Baseline Program

Enter the following program, which has several load/use hazards deliberately baked in:

.data
vals: .word 10, 20, 30

.text
main:
    la   t0, vals
    lw   t1, 0(t0)       # t1 = 10
    add  t4, t1, t1      # HAZARD: uses t1 immediately
    lw   t2, 4(t0)       # t2 = 20
    add  t5, t2, t1      # HAZARD: uses t2 immediately
    lw   t3, 8(t0)       # t3 = 30
    add  t6, t3, t2      # HAZARD: uses t3 immediately

    # Print t4 + t5 + t6 (should be 20 + 30 + 50 = 100)
    add  a0, t4, t5
    add  a0, a0, t6
    li   a7, 1
    ecall

    li   a7, 10
    ecall

Run it and record the cycle count. Put your answer in measurements.txt. Verify the printed result is 100.

3.2 Reorder to Remove Stalls

The three load/use hazards above can each be eliminated by moving an independent instruction into the slot between the load and its consumer. You must not change the final computed values.

Rewrite the program with the instructions reordered so that zero load/use stall cycles are inserted. Save your reordered version as part3.s.

Rules:

The program must still print 100 and terminate cleanly.
You may not add, remove, or change any instruction — only reorder them.
The la t0, vals must remain first (it sets up the base address).

Record the before and after cycle counts in measurements.txt.

Part 3 — Load/Use Stall Elimination
-------------------------------------
Baseline cycle count   :
Reordered cycle count  :
Stall cycles eliminated:

Check your work: step through your reordered program. You should see no stall (hold) signals asserted anywhere in the load-to-use sequence.

Part 4 — Branch Misprediction Penalty

Ripes implements a predict-not-taken strategy: it always fetches PC+4 after a branch. When a branch turns out to be taken, the two instructions that entered the pipeline behind it are wrong and the hazard unit converts them to bubbles. That costs 2 cycles per misprediction.

4.1 Summing Loop

.text
main:
    addi t0, x0, 0       # sum = 0
    addi t1, x0, 1       # i = 1
    addi t2, x0, 6       # limit = 6

loop:
    add  t0, t0, t1      # sum += i
    addi t1, t1, 1       # i++
    blt  t1, t2, loop    # if i < 6, loop back

    mv   a0, t0          # print sum (should be 1+2+3+4+5 = 15)
    li   a7, 1
    ecall

    li   a7, 10
    ecall

4.2 Prediction Table

Work out the branch behavior before running:

How many times is the blt evaluated?
How many times is it taken (mispredicted under predict-not-taken)?
How many times is it not taken (predicted correctly)?
How many total branch penalty cycles does that produce?

Then run the program and record the measured cycle count.

Part 4 — Branch Misprediction
-------------------------------
Times branch evaluated          :
Times branch taken (mispredict) :
Times branch not taken (correct):
Predicted branch penalty cycles :
Measured total cycle count      :
Expected cycle count*           :

Expected = (instruction count × 1) + (pipeline drain) + (branch penalty cycles) + (ecall stalls)

For this program: 
  instruction count = Instructions retired
  pipeline drain = 4 cycles startup/teardown
  ecall stalls = 2 cycle stall for each ecall
  branch penalty = 2 × mispredictions

4.3 Modify the Loop Bound

Change t2 to 11 (so the loop runs 10 iterations, summing 1 through 10). Run the modified program, verify the printed result is 55, and record:

Part 4 (modified — 10 iterations)
------------------------------------
Times branch taken (mispredict)  :
Total branch penalty cycles      :
Measured total cycle count       :

Add these results to your measurements.txt file.

Part 5 — Cycle Challenge

Your task is to modify a RISC-V program that:

Computes the dot product of two hard-coded 8-element integer vectors and prints the result.
Reduces the total number of cycles for program execution. How low can you get it?

The two vectors are:

A = [1, 2, 3, 4, 5, 6, 7, 8]
B = [8, 7, 6, 5, 4, 3, 2, 1]

The dot product is A[0]×B[0] + A[1]×B[1] + … + A[7]×B[7]. For these vectors the result is 1×8 + 2×7 + … + 8×1 = 120.

An unoptimized version of the code is shown below.

# Dot product of A and B (8 elements each)
# A = [1, 2, 3, 4, 5, 6, 7, 8]
# B = [8, 7, 6, 5, 4, 3, 2, 1]
# Expected result: 120

.data
A: .word 1, 2, 3, 4, 5, 6, 7, 8
B: .word 8, 7, 6, 5, 4, 3, 2, 1

.text
main:
    la   t0, A            # t0 = pointer into A
    la   t1, B            # t1 = pointer into B
    addi t2, x0, 0        # t2 = sum = 0
    addi t3, x0, 8        # t3 = loop counter

loop:
    lw   t4, 0(t0)        # t4 = A[i]
    lw   t5, 0(t1)        # t5 = B[i]
    mul  t4, t4, t5       # t4 = A[i] * B[i]
    add  t2, t2, t4       # sum += t4
    addi t0, t0, 4        # advance A pointer
    addi t1, t1, 4        # advance B pointer
    addi t3, t3, -1       # counter--
    bne  t3, x0, loop     # if counter != 0, repeat

    mv   a0, t2
    li   a7, 1
    ecall

    li   a7, 10
    ecall

5.1 Requirements

Your program must print 120 using ecall with the value 1 in the a7 register and terminate with ecall with the value 10 in the a7 register.
The number of cycles is reported in the Ripes Execution info panel during execution.
You may unroll the loop, interleave instructions, or use any legal RISC-V technique. You may not use instructions outside RV32I.
You many use any of the temporary registers (t0-t6) and any argument registers (a0-a7). You are not allowed to use any of the saved registers (s0-s11).

5.2 Loop Unrolling

Loop unrolling is a technique where you process multiple elements per loop iteration instead of one. Rather than looping 8 times and handling one element each time, you might loop 4 times and handle two elements per iteration, or eliminate the loop entirely by writing out all 8 elements sequentially.

A loop with a branch has two costs: the branch instruction itself, and a 2-cycle penalty every time the branch is taken (i.e., every iteration except the last). Unrolling removes both. It also exposes more opportunities to reorder instructions to cover load/use stalls, since you have a larger window of independent instructions to work with.

The tradeoff is code size, an unrolled loop is larger. For a short, fixed-iteration loop like the dot product here, that’s a reasonable cost. First, consider processing two elements of the vectors during each iteration of the loop. You will only loop four times instead of eight, reducing the number of mispredicted branches. Once you get that working, you can try a loop unroll of four, or even try to replace the loop entirely.

5.3 Strategy Notes

Extra cycles comes from two sources: load/use stalls and branch mispredictions. Before writing any code, think through:

If you use a loop with lw instructions, how many load/use stalls will each iteration incur, and can they be avoided by reordering?
How many times will your loop branch, and how many of those will be mispredictions?
Would fully unrolling the loop (eliminating branches entirely) give you a a lower cycle count than a tight loop with careful scheduling?

Describe your approach and document your reasoning briefly at the top of part5.s in a comment block.

Appendix A — Ripes Quick Reference

Processor Controls

Button	Action
Reset	Reload program; clear registers and memory
Clock (single step)	Advance one cycle
Reverse	Undo one cycle
Auto-clock	Step continuously at a chosen rate
Run	Execute to completion (fastest; no per-cycle GUI update)

Reading Execution Statistics

After clicking Run, the “Execution info” pane has the statistics on the program. The key fields are:

Field	Meaning
Cycles	Total clock cycles elapsed
Instructions retired	Dynamic instruction count (not static)
CPI	Cycles Per Instruction (Cycles / instructions retired)
IPC	Instructions Per Cycle (instructions retired / Cycles)

ecall Quick Reference

a7	Effect
1	Print integer in `a0`
4	Print null-terminated string whose address is in `a0`
10	Exit (halt simulator)
11	Print character in `a0`

Appendix B — Hazard Reference Card

Hazard	Cause	Penalty	Resolution
RAW (non-load)	Instruction reads register written 1–2 steps earlier	0 cycles	Forwarded from EX or MEM stage
Load/Use	`lw` followed immediately by instruction reading loaded register	1 cycle	Stall + forward from MEM
Branch misprediction	Taken branch under predict-not-taken policy	2 cycles	Bubbles injected into ID and EX