Final Study Guide

The final is closed book, closed note, no electronic devices allowed.

The final is comprehensive, so you should be familiar with the material in the exam1 study guide and the exam2 study guide, in addition to the material below.

You will be provided with a Reference Card with a list of RISC-V instructions and other useful information for the quiz.

On the course website there are blank copies of both exams and all the quizzes, along with their solutions. Questions on the final will be similar to those that you have seen in previous exams and quizzes. About 30% of the final is material since exam2, and about 70% is from previous material.

Concurrency

POSIX Threads (pthreads)

Creating Threads

pthread_t tid;
pthread_create(&tid, NULL, thread_func, (void*)arg);

Waiting for Threads to Finish

void* result;
pthread_join(tid, &result);

Passing Data to and from Threads

Pass a pointer to a struct (wrap all needed values)
Ensure the data remains valid for the thread’s lifetime

typedef struct {
    int id;
    char* msg;
} thread_data_t;

Locks

Basic Usage

pthread_mutex_t lock = PTHREAD_MUTEX_INITIALIZER;
// alternative init
pthread_mutex_init(&lock, NULL);

pthread_mutex_lock(&lock);
// critical section
pthread_mutex_unlock(&lock);

Use Cases

Protect shared variables
Prevent race conditions

Condition Variables

Usage Pattern

#include <pthread.h>

pthread_cond_t cond;
pthread_mutex_t mutex;

// Initialize with default attributes
pthread_cond_init(&cond, NULL);
pthread_mutex_init(&mutex, NULL);

// ... use the condition variable ...
pthread_mutex_lock(&lock);
while (!condition_met) {
    pthread_cond_wait(&cond, &lock);
}
// do work
pthread_mutex_unlock(&lock);

// signal another thread
pthread_cond_signal(&cond);

// Clean up when done
pthread_cond_destroy(&cond);
pthread_mutex_destroy(&mutex);

Use Case

Thread waits until a condition becomes true (e.g. buffer not empty)

Semaphores

Initialization

sem_t sem;
sem_init(&sem, 0, initial_value);

Wait and Signal

sem_wait(&sem);   // decrement semaphore and wait if semaphore is negative
sem_post(&sem);   // increment semaphore and wake a single thread waiting on the semaphore

Use Cases

Resource management (e.g. N-slot buffer)
Thread synchronization

Producer-Consumer Problem

Shared bounded buffer
Synchronization using:
- Mutex (for mutual exclusion)
- Condition variables or semaphores (to wait/signal)

Key Conditions:

Producer waits if buffer full
Consumer waits if buffer empty

Readers-Writers Locks

Goals

Allow multiple concurrent readers
Writers get exclusive access

Concurrent Data Structures

Examples:

Approximate Counters
Concurrent lists
Michael and Scott Concurrent Queues
Concurrent hash map (mutex per bucket)
Circular buffer (for producer-consumer)

Key Concepts to Know

Concept	Know How To…
Race condition	Explain what causes it and how to fix it
Deadlock	Identify and prevent
Critical section	Define and protect with locks
Bounded buffer	Describe and implement
Signal vs broadcast	Differentiate in condition variables
Mutex vs Semaphore	Compare and choose correctly

Optimizing Program Performance

Why Optimize?

Goal: Map programs efficiently to machine resources.
Importance: Can lead to significant performance improvements (e.g., 10x).
Factors: Both asymptotic complexity (Big-O) and constant factors matter. Choose the best algorithm first, then optimize implementation details.

Measuring Performance

Cycles Per Element (CPE): A common metric to express performance for programs operating on sequences of data. Lower CPE is better.

Optimization Blockers

Memory Aliasing: When two different pointers might refer to the same memory location. Compilers must be conservative and often cannot optimize memory accesses aggressively.
Procedure Calls: Compilers often treat procedure calls as “black boxes” due to potential side effects, limiting optimizations across function boundaries.
- Inlining: Replacing a function call with the function’s body (e.g., GCC -O1) can help, but is often limited to calls within the same file.

How to maximize performance

Write Compiler-Friendly Code: Understand limitations like aliasing and procedure calls.
Reduce Memory Accesses: Use temporary variables within loops to accumulate results instead of reading/writing to memory repeatedly.
Loop Optimizations:
- Code Motion: Move calculations or function calls (like strlen on a non-changing string) out of loops if their result doesn’t change per iteration.
- Loop Unrolling: Perform more work per iteration to reduce loop overhead and expose more instructions for parallel execution.
- Reassociation: Change the order of operations (if mathematically valid) to break sequential dependencies.
- Separate Accumulators: Use multiple temporary variables to accumulate results in parallel, breaking data dependencies.

Exploiting Modern Hardware (Microarchitecture)

Instruction Pipelining: Processors break instruction execution into stages, allowing multiple instructions to be in different stages simultaneously.
Instruction-Level Parallelism (ILP): Modern processors can execute multiple instructions per clock cycle (superscalar).
- Data Dependencies: The primary limiter of ILP. Optimizations often focus on breaking these dependencies.
SIMD (Single Instruction, Multiple Data): Vector instructions (like AVX2) perform the same operation on multiple data elements simultaneously.

Takeaways

Optimization requires understanding both the code and the underlying hardware.
Focus on bottlenecks.
Compilers are powerful but have limitations (aliasing, procedure calls).
Programmers can enable better performance by:
- Choosing efficient algorithms.
- Writing compiler-friendly code.
- Applying manual optimizations (loop unrolling, reassociation, etc.).
- Understanding and leveraging hardware features (ILP, SIMD).
Tune code for the specific target machine for best results.