Y86 Program Optimization

Due: Thursday April 28th, 11:59PM

Introduction

In this lab, you will learn about the design and implementation of a pipelined Y86-64 processor, optimizing a benchmark program to maximize performance. You are allowed to make any semantics-preserving transformations to the benchmark program. When you have completed the lab, you will have a keen appreciation for the interactions between code and hardware that affect the performance of your programs.

Logistics

You will work on this lab alone. Any clarifications and revisions to the assignment will be posted to the class website.

You will use the same starter code that you did for part A of this lab. You will be working in directory archlab/pipe for this part of the lab.

Optimizing ncopy

For this assignment, you will optimize the ncopy function in the file ncopy.ys to run as efficiently as possible our pipeline processor implementation.

The ncopy function found in archlab/pipe/ncopy.c and shown below, copies a len sized integer array src to a non-overlapping array dst, returning a count of the number of positive integers contained in src.

/*
 * ncopy - copy src to dst, returning number of positive numbers
 * contained in src array.
 */
long ncopy(long *src, long *dst, long len) {
    long count = 0;
    long val;
    while (len > 0) {
        val = *src++;
        *dst++ = val;
        if (val > 0)
            count++;
        len--;
    }
    return count;
}

The baseline Y86-64 version of ncopy is shown below.

##################################################################
# ncopy.ys - Copy a src block of len words to dst.
# Return the number of positive words (>0) contained in src.
#
# Include your name here.
#
# Describe how and why you modified the baseline code.
#
##################################################################
# Do not modify this portion
# Function prologue.
# %rdi = src, %rsi = dst, %rdx = len
ncopy:

##################################################################
# You can modify this portion
        # Loop header
        xorq %rax,%rax          # count = 0;
        andq %rdx,%rdx          # len <= 0?
        jle Done                # if so, goto Done:

Loop:   mrmovq (%rdi), %r10     # read val from src...
        rmmovq %r10, (%rsi)     # ...and store it to dst
        andq %r10, %r10         # val <= 0?
        jle Npos                # if so, goto Npos:
        irmovq $1, %r10
        addq %r10, %rax         # count++
Npos:   irmovq $1, %r10
        subq %r10, %rdx         # len--
        irmovq $8, %r10
        addq %r10, %rdi         # src++
        addq %r10, %rsi         # dst++
        andq %rdx,%rdx          # len > 0?
        jg Loop                 # if so, goto Loop:
##################################################################
# Do not modify the following section of code
# Function epilogue.
Done:
        ret
##################################################################
# Keep the following label at the end of your function

Your job is to modify ncopy.ys to run as efficiently as possible on our pipelined processors for a provided set of benchmarks.

Your ncopy.ys should begin with a header comment with the following information:

Coding Rules

You are free to make any modifications you wish, with the following constraints:

You may make any semantics preserving transformations to the ncopy.ys function, such as reordering instructions, replacing groups of instructions with single instructions, deleting some instructions, and adding other instructions. You may find it useful to read about loop unrolling in Section 5.8 of CS:APP3e.

Building and Testing Your Solution

In order to test your solution, you will need to build a driver program that calls your ncopy function. We have provided you with the gen-driver.pl program that generates a driver program for arbitrary sized input arrays.

The command make drivers will construct the following two useful driver programs:

Each time you modify your ncopy.ys program, you can type

 make drivers

to rebuild the driver programs.

To test your solution in GUI mode on a small 4-element array, type the following.

./psim -g sdriver.yo

The code below will your solution on a larger 63-element array.

./psim -g ldriver.yo

Once your simulator correctly runs your version of ncopy.ys on these two block lengths, you will want to perform the following additional tests:

Evaluation

This lab is worth 75 points, You will not receive any credit if either your code for ncopy.ys or your modified simulator fails any of the tests described earlier.

Submitting your work

Upload your ncopy.ys to Gradescope.