Y86 Program Optimization
Introduction
In this assignment, you will learn about the design and implementation of a pipelined Y86-64 processor buy optimizing a benchmark program to maximize performance. You are allowed to make any semantics-preserving transformations to the benchmark program. When you have completed the lab, you will have a keen appreciation for the interactions between code and hardware that affect the performance of your programs.
Logistics
You will work on this lab alone. Any clarifications and revisions to the assignment will be posted to the class website.
You will use the same starter code that you did for part A of this lab. You will
be working in directory archlab/pipe for this part of the lab.
Optimizing ncopy
In this assignment, you will optimize the ncopy function in the file
ncopy.ys to run as efficiently as possible our pipeline processor
implementation.
The ncopy function found in archlab/pipe/ncopy.c and shown below, copies a
len sized integer array src to a non-overlapping array dst, returning a
count of the number of positive integers contained in src.
/*
* ncopy - copy src to dst, returning number of positive numbers
* contained in src array.
*/
long ncopy(long *src, long *dst, long len) {
long count = 0;
long val;
while (len > 0) {
val = *src++;
*dst++ = val;
if (val > 0)
count++;
len--;
}
return count;
}
The baseline Y86-64 version of ncopy is shown below.
##################################################################
# ncopy.ys - Copy a src block of len words to dst.
# Return the number of positive words (>0) contained in src.
#
# Include your name here.
#
# Describe how and why you modified the baseline code.
#
##################################################################
# Do not modify this portion
# Function prologue.
# %rdi = src, %rsi = dst, %rdx = len
ncopy:
##################################################################
# You can modify this portion
# Loop header
xorq %rax,%rax # count = 0;
andq %rdx,%rdx # len <= 0?
jle Done # if so, goto Done:
Loop: mrmovq (%rdi), %r10 # read val from src...
rmmovq %r10, (%rsi) # ...and store it to dst
andq %r10, %r10 # val <= 0?
jle Npos # if so, goto Npos:
irmovq $1, %r10
addq %r10, %rax # count++
Npos: irmovq $1, %r10
subq %r10, %rdx # len--
irmovq $8, %r10
addq %r10, %rdi # src++
addq %r10, %rsi # dst++
andq %rdx,%rdx # len > 0?
jg Loop # if so, goto Loop:
##################################################################
# Do not modify the following section of code
# Function epilogue.
Done:
ret
##################################################################
# Keep the following label at the end of your function
Your job is to modify ncopy.ys to run as efficiently as possible on our
pipelined processors for a provided set of benchmarks.
Your ncopy.ys should begin with a header comment with the following information:
- Your name.
- A high-level description of your code describing how and why you modified your code.
Coding Rules
You are free to make any modifications you wish, with the following constraints:
-
If you use any callee saved registers (
%r12,%r13,%r14,%rbx,%rsp, and%rbp) you must save and restore the value of these registers in your function. You may find it faster to avoid using these registers altogether rather than occurring the overhead of saving and restoring them. -
Your
ncopy.ysfunction must work for arbitrary array sizes. You might be tempted to hardwire your solution for 64-element arrays by simply coding 64 copy instructions, but this would be a bad idea because we will be grading your solution based on its performance on arbitrary arrays. -
Your
ncopy.ysfunction must run correctly withyis. By correctly, we mean that it must correctly copy thesrcblock and return (in%rax) the correct number of positive integers. -
The assembled version of your
ncopyfile must not be more than 1000 bytes long. You can check the length of any program with thencopyfunction embedded using the provided scriptcheck-len.pl:# in the archlab/pipe directory ../misc/yas ncopy.ys ./check-len.pl < ncopy.yo
You may make any semantics preserving transformations to the ncopy.ys function, such as reordering instructions, replacing groups of instructions with single instructions, deleting some instructions, and adding other instructions. You may find it useful to read about loop unrolling in Section 5.8 of CS:APP3e.
Building and Testing Your Solution
In order to test your solution, you will need to build a driver program that calls your ncopy function. We have provided you with the gen-driver.pl program that generates a driver program for arbitrary sized input arrays.
The command make drivers will construct the following two useful driver programs:
sdriver.yo: A small driver program that tests yourncopyfunction on small arrays with 4 elements. If your solution is correct, then this program will halt with a value of 2 in register%raxafter copying thesrcarray.ldriver.yo: A large driver program that tests yourncopyfunction on larger arrays with 63 elements. If your solution is correct, then this program will halt with a value of 31 (0x1f) in register%raxafter copying thesrcarray.
Each time you modify your ncopy.ys program, you can type
make drivers
to rebuild the driver programs.
To test your solution in GUI mode on a small 4-element array, type the following.
./psim -g sdriver.yo
The code below will your solution on a larger 63-element array.
./psim -g ldriver.yo
Once your simulator correctly runs your version of ncopy.ys on these two block lengths, you will want to perform the following additional tests:
-
Testing your driver files on the ISA simulator. Make sure that your
ncopy.ysfunction works properly withyis:make drivers ../misc/yis sdriver.yo -
Testing your code on a range of block lengths with the ISA simulator. The Perl script
correctness.plgenerates driver files with block lengths from 0 up to some limit (default 64), plus some larger sizes. It simulates them (by default withyis), and checks the results. It generates a report showing the status for each block length:./correctness.plThis script generates test programs where the result count varies randomly from one run to another, and so it provides a more stringent test than the standard drivers. If you get incorrect results for some length K, you can generate a driver file for that length that includes checking code, and where the result varies randomly:
./gen-driver.pl -f ncopy.ys -n K -rc > driver.ys make driver.yo ../misc/yis driver.yoThe program will end with register %rax having the following value:
0xaaaa : All tests pass. 0xbbbb : Incorrect count 0xcccc : Function ncopy is more than 1000 bytes long. 0xdddd : Some of the source data was not copied to its destination. 0xeeee : Some word just before or just after the destination region was corrupted. -
Testing your code on a range of block lengths with the pipeline simulator. Finally, you can run the same code tests on the pipeline simulator that you did earlier with the ISA simulator:
./correctness.pl -p
Evaluation
This lab is worth 75 points, You will not receive credit if either your code for ncopy.ys or your modified simulator fails any of the tests described earlier.
-
15 points for your description in the header of ncopy.ys and overall coding style of your implementations (i.e., having well commented code).
-
60 points for performance. To receive credit here, your solution must be correct, as defined earlier. That is,
ncopyruns correctly withyis, andpipe-full.hclpasses all tests iny86-codeandptest.We will express the performance of your function in units of cycles per element (CPE). That is, if the simulated code requires C cycles to copy a block of N elements, then the CPE is C/N. The PIPE simulator displays the total number of cycles required to complete the program. The baseline version of the
ncopyfunction running on the standard PIPE simulator with a large 63-element array requires 897 cycles to copy 63 elements, for a CPE of 897/63 = 14.24.Since some cycles are used to set up the call to
ncopyand to set up the loop withinncopy, you will find that you will get different values of the CPE for different block lengths (generally the CPE will drop asNincreases). We will therefore evaluate the performance of your function by computing the average of the CPEs for blocks ranging from 1 to 64 elements. You can use the Perl scriptbenchmark.plin the pipe directory to run simulations of yourncopy.yscode over a range of block lengths and compute the average CPE. To see your performance, run the command:./benchmark.plFor example, the baseline version of the ncopy function has CPE values ranging between 29.00 and 14.27, with an average of 15.18. Note that this Perl script does not check for the correctness of the answer. Use the script
correctness.plfor this and run the correctness tests before you run any benchmarking tests. If your program has errors, it may give an unrealistically low benchmark score.You should be able to achieve an average CPE of less than 9.00. If your average CPE is c, then your score for this portion of the lab will be:
- 0 for c > 11.0
- 20 * (11.0 - c) for 8.5 <= c <= 11.0
- 60 for c < 8.5
By default,
benchmark.plandcorrectness.plcompile and testncopy.ys. Use the-fargument to specify a different file name. The-hflag gives a complete list of the command line arguments.
Tips
- Before you begin loop unrolling, first optimize the unrolled case. For example, I was able to achieve a CPE of 10.7 before doing any loop unrolling.
- Before you try a new optimization, save your previous version in case you need to go back to it.
- Remember to run
./correctness.plbefore you run./benchmark.pl. If you code doesn’t work correctly, your CPU scores may not be accurate.
Submitting your work
Upload your ncopy.ys to Gradescope.