RISC-V Piplelined Processor
Overview
In lecture we examined the RV32I 5-stage pipeline used in Ripes — how instructions flow through Fetch, Decode, Execute, Memory, and Writeback in overlapping waves, and how the hazard unit keeps that flow correct in the face of data and control dependencies. In this lab you will use Ripes to measure that behavior directly.
By the end of this lab you should be able to:
- Switch Ripes to the pipelined processor model and navigate its interface
- Identify forwarding paths and the hazard unit in the extended layout
- Predict the number of stall and bubble cycles introduced by data and control hazards
- Eliminate avoidable load/use stalls by reordering instructions
- Measure CPI from Ripes execution statistics and reason about what drives it
- Explore loop unrolling to reduce the number of pipeline stalls
Submission
After completing the lab, upload the following to Gradescope:
| File | Contents |
|---|---|
measurements.txt |
Your completed copy of every measurement table in this lab |
part3.s |
Your stall-free reordered program from Part 3 |
part5.s |
Your CPI-optimized program from Part 5 |
The two .s files will be run through Ripes automatically — make sure they
assemble cleanly, produce the correct output, and terminate with an ecall
instruction like the example programs do.
Part 1 — Pipeline Orientation
1.1 Switch to the Pipelined Processor
- Open Ripes:
ripesfrom the terminal. - Click the Select Processor button (CPU chip icon) in the toolbar.
- Choose RISC-V 32-bit 5-Stage Processor.
- Under Layout, select Extended.
- Click OK.
1.2 Load a Warmup Program
Clear the editor and enter the following:
.text
main:
addi t0, x0, 3
addi t1, x0, 4
add t2, t0, t1
addi t3, t2, 1
addi t4, t3, 1
li a7, 10
ecall
Click Reset, switch to the Processor tab, and step through the program one clock at a time using the Clock (forward arrow) button.
1.3 Locate Key Components
Before moving on, find each of the following in the Extended layout and confirm you know what it does. You will need to recognize these on sight throughout the lab.
| Component | What to look for |
|---|---|
| Five pipeline registers | The vertical register bars labeled IF/ID, ID/EX, EX/MEM, MEM/WB that separate the stages |
| Hazard unit | The block that drives the stall and bubble control signals |
| Forwarding MUXes | The multiplexers in the Decode stage that select between register file values and forwarded results |
| Branch comparator | The comparator in Execute that determines whether a branch is taken |
Tip: Click any wire to highlight its full route. Hover to see its name and current value. Use Ctrl + scroll to zoom.
1.4 Observe Forwarding
Step slowly through the warmup program and watch the forwarding MUXes. The
add t2, t0, t1 instruction reads t0 and t1, which were written by the two
preceding addi instructions. At the clock cycle when add is in the Execute (EX)
stage, look at the forwarding MUXes. Verity that their selected input is the
forwarded value from the EX/MEM and MEM/WB registers, not the (stale) register
file output.
Confirm that no stalls are inserted for this dependency and that forwarding resolves it with zero penalty.
Part 2 — Hazards and Forwarding
In this part you will verify that the Hazard/Forwardign units handle RAW (read
after write) dependencies. All programs below compute the same final value (t3 = 9); they differ only in how many instructions separate a write from its
dependent read.
2.1 The Programs
Enter and run each of the four programs below. For each one, record the total cycle count from the Execution info panel.
Program A — no dependency (3 instructions between write and read)
.text
main:
addi t0, x0, 4 # writes t0
addi t1, x0, 1 # independent
addi t2, x0, 2 # independent
addi t3, x0, 3 # independent
add t3, t0, t3 # reads t0 (3 instructions later)
li a7, 10
ecall
Program B — one instruction between write and read
.text
main:
addi t0, x0, 4 # writes t0
addi t1, x0, 1 # independent
add t3, t0, t1 # reads t0 (1 instruction later)
li a7, 10
ecall
Program C — immediately adjacent (write then read)
.text
main:
addi t0, x0, 4 # writes t0
add t3, t0, t0 # reads t0 (0 instructions later)
li a7, 10
ecall
Program D — back-to-back chain (each instruction depends on the previous)
.text
main:
addi t0, x0, 1
addi t0, t0, 1 # depends on line above
addi t0, t0, 1 # depends on line above
addi t0, t0, 1 # depends on line above
addi t0, t0, 1 # depends on line above
li a7, 10
ecall
2.2 Forwarding Table
Record the numbers of bubbles and cycles for each program. The forwarding rule is: a RAW hazard is resolved by forwarding with zero stalls as long as the producing instruction has reached or passed the Execute stage by the time the consuming instruction reaches Decode. A load/use hazard (covered in Part 3) is the one exception.
Fill in this table in measurements.txt:
-------------------------
Program | Measured cycles | Measured bubbles
--------|-----------------|------------------
A | |
B | |
C | |
D | |
Bubbles = (cycles) - (instructions retired) - (pipeline drain) - (ecall stalls)
Pipeline drain = 4 extra cycles for startup/teardown
ecall stalls = 2 cycle stall for every ecall
For this table, use: Bubbles = cycles - (instructions retired) - 6
Was the Forwarding Unit able to prevent stalls (bubbles) in all programs? Put
your answer in measurements.txt along with your table.
Part 3 — Eliminating Load/Use Stalls
A lw instruction does not produce its result until the Memory stage. If
the immediately following instruction reads the loaded register, forwarding
cannot help as the data simply does not exist yet at the point where Decode
stage needs it. The hazard unit must stall the instruction in the Decode stage,
and insert a Bubble in the Execute stage for one cycle.
3.1 The Baseline Program
Enter the following program, which has several load/use hazards deliberately baked in:
.data
vals: .word 10, 20, 30
.text
main:
la t0, vals
lw t1, 0(t0) # t1 = 10
add t4, t1, t1 # HAZARD: uses t1 immediately
lw t2, 4(t0) # t2 = 20
add t5, t2, t1 # HAZARD: uses t2 immediately
lw t3, 8(t0) # t3 = 30
add t6, t3, t2 # HAZARD: uses t3 immediately
# Print t4 + t5 + t6 (should be 20 + 30 + 50 = 100)
add a0, t4, t5
add a0, a0, t6
li a7, 1
ecall
li a7, 10
ecall
Run it and record the cycle count. Put your answer in measurements.txt.
Verify the printed result is 100.
3.2 Reorder to Remove Stalls
The three load/use hazards above can each be eliminated by moving an independent instruction into the slot between the load and its consumer. You must not change the final computed values.
Rewrite the program with the instructions reordered so that zero load/use
stall cycles are inserted. Save your reordered version as part3.s.
Rules:
- The program must still print 100 and terminate cleanly.
- You may not add, remove, or change any instruction — only reorder them.
- The
la t0, valsmust remain first (it sets up the base address).
Record the before and after cycle counts in measurements.txt.
Part 3 — Load/Use Stall Elimination
-------------------------------------
Baseline cycle count :
Reordered cycle count :
Stall cycles eliminated:
Check your work: step through your reordered program. You should see no stall (hold) signals asserted anywhere in the load-to-use sequence.
Part 4 — Branch Misprediction Penalty
Ripes implements a predict-not-taken strategy: it always fetches PC+4 after a branch. When a branch turns out to be taken, the two instructions that entered the pipeline behind it are wrong and the hazard unit converts them to bubbles. That costs 2 cycles per misprediction.
4.1 Summing Loop
.text
main:
addi t0, x0, 0 # sum = 0
addi t1, x0, 1 # i = 1
addi t2, x0, 6 # limit = 6
loop:
add t0, t0, t1 # sum += i
addi t1, t1, 1 # i++
blt t1, t2, loop # if i < 6, loop back
mv a0, t0 # print sum (should be 1+2+3+4+5 = 15)
li a7, 1
ecall
li a7, 10
ecall
4.2 Prediction Table
Work out the branch behavior before running:
- How many times is the
bltevaluated? - How many times is it taken (mispredicted under predict-not-taken)?
- How many times is it not taken (predicted correctly)?
- How many total branch penalty cycles does that produce?
Then run the program and record the measured cycle count.
Part 4 — Branch Misprediction
-------------------------------
Times branch evaluated :
Times branch taken (mispredict) :
Times branch not taken (correct):
Predicted branch penalty cycles :
Measured total cycle count :
Expected cycle count* :
Expected = (instruction count × 1) + (pipeline drain) + (branch penalty cycles) + (ecall stalls)
For this program:
instruction count = Instructions retired
pipeline drain = 4 cycles startup/teardown
ecall stalls = 2 cycle stall for each ecall
branch penalty = 2 × mispredictions
4.3 Modify the Loop Bound
Change t2 to 11 (so the loop runs 10 iterations, summing 1 through 10).
Run the modified program, verify the printed result is 55, and record:
Part 4 (modified — 10 iterations)
------------------------------------
Times branch taken (mispredict) :
Total branch penalty cycles :
Measured total cycle count :
Add these results to your measurements.txt file.
Part 5 — Cycle Challenge
Your task is to modify a RISC-V program that:
- Computes the dot product of two hard-coded 8-element integer vectors and prints the result.
- Reduces the total number of cycles for program execution. How low can you get it?
The two vectors are:
A = [1, 2, 3, 4, 5, 6, 7, 8]
B = [8, 7, 6, 5, 4, 3, 2, 1]
The dot product is A[0]×B[0] + A[1]×B[1] + … + A[7]×B[7].
For these vectors the result is 1×8 + 2×7 + … + 8×1 = 120.
An unoptimized version of the code is shown below.
# Dot product of A and B (8 elements each)
# A = [1, 2, 3, 4, 5, 6, 7, 8]
# B = [8, 7, 6, 5, 4, 3, 2, 1]
# Expected result: 120
.data
A: .word 1, 2, 3, 4, 5, 6, 7, 8
B: .word 8, 7, 6, 5, 4, 3, 2, 1
.text
main:
la t0, A # t0 = pointer into A
la t1, B # t1 = pointer into B
addi t2, x0, 0 # t2 = sum = 0
addi t3, x0, 8 # t3 = loop counter
loop:
lw t4, 0(t0) # t4 = A[i]
lw t5, 0(t1) # t5 = B[i]
mul t4, t4, t5 # t4 = A[i] * B[i]
add t2, t2, t4 # sum += t4
addi t0, t0, 4 # advance A pointer
addi t1, t1, 4 # advance B pointer
addi t3, t3, -1 # counter--
bne t3, x0, loop # if counter != 0, repeat
mv a0, t2
li a7, 1
ecall
li a7, 10
ecall
5.1 Requirements
- Your program must print 120 using
ecallwith the value 1 in thea7register and terminate withecallwith the value 10 in thea7register. - The number of cycles is reported in the Ripes Execution info panel during execution.
- You may unroll the loop, interleave instructions, or use any legal RISC-V technique. You may not use instructions outside RV32I.
- You many use any of the temporary registers (t0-t6) and any argument registers (a0-a7). You are not allowed to use any of the saved registers (s0-s11).
5.2 Loop Unrolling
Loop unrolling is a technique where you process multiple elements per loop iteration instead of one. Rather than looping 8 times and handling one element each time, you might loop 4 times and handle two elements per iteration, or eliminate the loop entirely by writing out all 8 elements sequentially.
A loop with a branch has two costs: the branch instruction itself, and a 2-cycle penalty every time the branch is taken (i.e., every iteration except the last). Unrolling removes both. It also exposes more opportunities to reorder instructions to cover load/use stalls, since you have a larger window of independent instructions to work with.
The tradeoff is code size, an unrolled loop is larger. For a short, fixed-iteration loop like the dot product here, that’s a reasonable cost. First, consider processing two elements of the vectors during each iteration of the loop. You will only loop four times instead of eight, reducing the number of mispredicted branches. Once you get that working, you can try a loop unroll of four, or even try to replace the loop entirely.
5.3 Strategy Notes
Extra cycles comes from two sources: load/use stalls and branch mispredictions. Before writing any code, think through:
- If you use a loop with
lwinstructions, how many load/use stalls will each iteration incur, and can they be avoided by reordering? - How many times will your loop branch, and how many of those will be mispredictions?
- Would fully unrolling the loop (eliminating branches entirely) give you a a lower cycle count than a tight loop with careful scheduling?
Describe your approach and document your reasoning briefly at the top of part5.s in a comment
block.
Appendix A — Ripes Quick Reference
Processor Controls
| Button | Action |
|---|---|
| Reset | Reload program; clear registers and memory |
| Clock (single step) | Advance one cycle |
| Reverse | Undo one cycle |
| Auto-clock | Step continuously at a chosen rate |
| Run | Execute to completion (fastest; no per-cycle GUI update) |
Reading Execution Statistics
After clicking Run, the “Execution info” pane has the statistics on the program. The key fields are:
| Field | Meaning |
|---|---|
| Cycles | Total clock cycles elapsed |
| Instructions retired | Dynamic instruction count (not static) |
| CPI | Cycles Per Instruction (Cycles / instructions retired) |
| IPC | Instructions Per Cycle (instructions retired / Cycles) |
ecall Quick Reference
| a7 | Effect |
|---|---|
| 1 | Print integer in a0 |
| 4 | Print null-terminated string whose address is in a0 |
| 10 | Exit (halt simulator) |
| 11 | Print character in a0 |
Appendix B — Hazard Reference Card
| Hazard | Cause | Penalty | Resolution |
|---|---|---|---|
| RAW (non-load) | Instruction reads register written 1–2 steps earlier | 0 cycles | Forwarded from EX or MEM stage |
| Load/Use | lw followed immediately by instruction reading loaded register |
1 cycle | Stall + forward from MEM |
| Branch misprediction | Taken branch under predict-not-taken policy | 2 cycles | Bubbles injected into ID and EX |