Introduction to pthreads and locks

Overview

In this lab you will write a multithreaded C program that counts how often each letter appears in a large block of text. Along the way you will build up every major skill introduced in lecture:

Part	Concept	What you build
1	Thread lifecycle	Spawn threads, pass arguments, join
2	Divide-and-conquer	Parallel array sum
3	Race conditions & mutexes	Shared counter, broken then fixed
4	Parallel histogram	Racy → mutex-safe → local-merge

Logistics. Submit threadlab.c and a short answers.md (questions are flagged Q below) to Gradescope.

To get started, save this C program as threadlab.c

/*
 * CMPU-224: Computer Organization
 *
 * Parallel Character Frequency Analysis
 *
 * Compile:  gcc -Wall -O2 -pthread -o threadlab threadlab.c
 * Run:      ./threadlab
 */

#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <pthread.h>
#include <time.h>

/* ------------------------------------------------------------------ */
/*  Constants — feel free to experiment with these                     */
/* ------------------------------------------------------------------ */
#define NUM_THREADS   4
#define TEXT_SIZE     (1 << 23)   /* 8 MB of text */
#define ARRAY_SIZE    (1 << 24)   /* 16 M integers */
#define NUM_BINS      26          /* a–z */
#define ITERS_PER_THREAD 1000000  /* for Part 3 counter */

/* ------------------------------------------------------------------ */
/*  Shared data (allocated in main, passed to threads via structs)     */
/* ------------------------------------------------------------------ */
static char    *g_text   = NULL;
static int     *g_array  = NULL;

/* ------------------------------------------------------------------ */
/*  Utilities                                                          */
/* ------------------------------------------------------------------ */

/* Fill text with random lowercase letters */
static void fill_text(char *buf, size_t n) {
    srand(42);
    for (size_t i = 0; i < n; i++)
        buf[i] = 'a' + (rand() % 26);
    buf[n] = '\0';
}

/* Fill array with small positive integers */
static void fill_array(int *arr, size_t n) {
    srand(42);
    for (size_t i = 0; i < n; i++)
        arr[i] = rand() % 100;
}

/* Print a histogram of letter counts */
static void print_histogram(const char *label, long hist[NUM_BINS]) {
    printf("\n--- %s ---\n", label);
    for (int i = 0; i < NUM_BINS; i++)
        printf("  %c : %ld\n", 'a' + i, hist[i]);
}

/* Sequential reference histogram (known-correct) */
static void seq_histogram(const char *text, size_t n, long hist[NUM_BINS]) {
    // clear the hist array
    memset(hist, 0, NUM_BINS * sizeof(long));
    for (size_t i = 0; i < n; i++) {
        char c = text[i];
        if (c >= 'a' && c <= 'z')
            hist[c - 'a']++;
    }
}

/* Compare two histograms; return 1 if equal */
static int hist_equal(long a[NUM_BINS], long b[NUM_BINS]) {
    for (int i = 0; i < NUM_BINS; i++)
        if (a[i] != b[i]) return 0;
    return 1;
}

/* ================================================================== */
/*  PART 1 — Thread Lifecycle                                         */
/* ================================================================== */

typedef struct {
    int  thread_id;          /* which thread am I? (0-indexed) */
    char first_char;         /* first character of my text slice */
    int  slice_len;          /* how many characters in my slice */
} p1_args_t;

/*
 * TODO (Part 1A): Implement this thread function.
 *
 * It receives a p1_args_t* cast from `arg`.
 * Print a line like:
 *   [Thread 2] Hello! My slice starts with 'k' and is 2097152 chars long.
 * Then return NULL.
 */
void *p1_hello(void *arg) {
    /* --- your code here --- */

    return NULL;
}

/*
 * TODO (Part 1B): Implement this function.
 *
 * Divide g_text evenly among NUM_THREADS threads.
 * For each thread, fill in a p1_args_t and call pthread_create.
 * After all threads are created, call pthread_join on each one.
 */
void part1_run(void) {
    printf("\n========== Part 1: Thread Lifecycle ==========\n");

    pthread_t   tids[NUM_THREADS];
    p1_args_t   args[NUM_THREADS];
    int slice_len = TEXT_SIZE / NUM_THREADS;

    for (int i = 0; i < NUM_THREADS; i++) {
        args[i].thread_id  = i;
        args[i].slice_len  = slice_len;
        args[i].first_char = g_text[i * slice_len];
        /* TODO: pthread_create for tids[i] running p1_hello */
    }

    for (int i = 0; i < NUM_THREADS; i++) {
        /* TODO: pthread_join tids[i] */
    }

    printf("All threads finished.\n");
}

/* ================================================================== */
/*  PART 2 — Divide-and-Conquer Parallelism                          */
/* ================================================================== */

typedef struct {
    int  *array;    /* pointer to the global array */
    int   start;    /* inclusive start index */
    int   end;      /* exclusive end index   */
    long  result;   /* thread writes its partial sum here */
} p2_args_t;

/* Sequential reference sum */
static long seq_sum(void) {
    long s = 0;
    for (int i = 0; i < ARRAY_SIZE; i++)
        s += g_array[i];
    return s;
}

/*
 * TODO (Part 2A): Implement this thread function.
 *
 * Compute the sum of array[start..end) and store it in args->result.
 */
void *p2_sum_worker(void *arg) {
    p2_args_t *myarg = (p2_args_t *)arg;
    /* --- your code here --- */
    return NULL;
}

/*
 * TODO (Part 2B): Launch NUM_THREADS threads, each summing a slice
 * of g_array.  Join them, accumulate results, and print:
 *   Sequential sum : <value>
 *   Parallel sum   : <value>
 *   Match          : YES   (or NO)
 */
void part2_run(void) {
    printf("\n========== Part 2: Parallel Sum ==========\n");

    pthread_t  tids[NUM_THREADS];
    p2_args_t  args[NUM_THREADS];
    int slice = ARRAY_SIZE / NUM_THREADS;

    for (int i = 0; i < NUM_THREADS; i++) {
        args[i].array  = g_array;
        args[i].start  = i * slice;
        args[i].end    = (i + 1) * slice;
        args[i].result = 0;
        /* TODO: pthread_create */
    }

    long parallel_total = 0;
    for (int i = 0; i < NUM_THREADS; i++) {
        /* TODO: pthread_join, then add args[i].result to parallel_total */
    }

    long seq_total = seq_sum();
    printf("Sequential sum : %ld\n", seq_total);
    printf("Parallel sum   : %ld\n", parallel_total);
    printf("Match          : %s\n", seq_total == parallel_total ? "YES" : "NO");
}

/* ================================================================== */
/*  PART 3 — Race Conditions & Mutexes                               */
/* ================================================================== */

/* Shared counter — both versions use this */
static volatile long g_counter = 0;

/* Mutex for the safe version */
static pthread_mutex_t g_counter_mutex = PTHREAD_MUTEX_INITIALIZER;

/* --- 3A: Unsafe (racy) version --- */
void *p3_counter_unsafe(void *arg) {
    for (int i = 0; i < ITERS_PER_THREAD; i++) {
        g_counter++;   /* data race: read-modify-write is not atomic */
    }
    return NULL;
}

/*
 * TODO (Part 3B): Implement the safe version.
 *
 * Same loop, but protect the increment with g_counter_mutex.
 * Use pthread_mutex_lock / pthread_mutex_unlock.
 */
void *p3_counter_safe(void *arg) {
    for (int i = 0; i < ITERS_PER_THREAD; i++) {
        /* --- your code here --- */
        g_counter++;
        /* --- your code here --- */
    }
    return NULL;
}

void part3_run(void) {
    printf("\n========== Part 3: Race Conditions & Mutexes ==========\n");

    pthread_t tids[NUM_THREADS];
    long expected = (long)NUM_THREADS * ITERS_PER_THREAD;

    /* -- Unsafe run -- */
    g_counter = 0;
    for (int i = 0; i < NUM_THREADS; i++)
        pthread_create(&tids[i], NULL, p3_counter_unsafe, NULL);
    for (int i = 0; i < NUM_THREADS; i++)
        pthread_join(tids[i], NULL);
    printf("Unsafe counter  — expected: %ld  got: %ld  (%s)\n",
           expected, g_counter,
           g_counter == expected ? "correct (got lucky?)" : "WRONG — data race!");

    /* -- Safe run -- */
    g_counter = 0;
    for (int i = 0; i < NUM_THREADS; i++)
        pthread_create(&tids[i], NULL, p3_counter_safe, NULL);
    for (int i = 0; i < NUM_THREADS; i++)
        pthread_join(tids[i], NULL);
    printf("Safe counter    — expected: %ld  got: %ld  (%s)\n",
           expected, g_counter,
           g_counter == expected ? "correct" : "WRONG");
}

/* ================================================================== */
/*  PART 4 — Parallel Histogram                                       */
/* ================================================================== */

/* --- Shared histogram and its mutex --- */
static long g_hist[NUM_BINS];
static pthread_mutex_t g_hist_mutex = PTHREAD_MUTEX_INITIALIZER;

typedef struct {
    char      *text;
    int        start;
    int        end;
    long  local_hist[NUM_BINS];   /* used in Part 4C */
} p4_args_t;

/* ---- 4A: Unsafe (racy) ---- */
void *p4_hist_unsafe(void *arg) {
    p4_args_t *a = (p4_args_t *)arg;
    for (int i = a->start; i < a->end; i++) {
        char c = a->text[i];
        if (c >= 'a' && c <= 'z')
            g_hist[c - 'a']++;   /* data race */
    }
    return NULL;
}

/*
 * TODO (Part 4B): Implement a mutex-protected histogram worker.
 *
 * Same logic as p4_hist_unsafe, but lock g_hist_mutex before
 * updating the bin and unlock immediately after.
 */
void *p4_hist_mutex(void *arg) {
    p4_args_t *a = (p4_args_t *)arg;
    for (int i = a->start; i < a->end; i++) {
        char c = a->text[i];
        if (c >= 'a' && c <= 'z') {
            /* --- your code here --- */
            g_hist[c - 'a']++;
            /* --- your code here --- */
        }
    }
    return NULL;
}

/*
 * TODO (Part 4C — Challenge): Implement a local-histogram worker.
 *
 * Instead of touching g_hist inside the main loop, each thread
 * accumulates counts in a->local_hist[].
 *
 * After the loop, acquire g_hist_mutex ONCE, merge local_hist
 * into g_hist, then release the mutex.
 *
 * Why might this be faster than 4B?  Answer in your write-up.
 */
void *p4_hist_local(void *arg) {
    p4_args_t *a = (p4_args_t *)arg;
    // clear the hist array
    memset(a->local_hist, 0, sizeof(a->local_hist));

    /* --- your code here --- */

    return NULL;
}

/* Helper to run one histogram experiment */
static void run_hist_experiment(const char *label,
                                void *(*worker)(void *),
                                long ref[NUM_BINS]) {
    pthread_t  tids[NUM_THREADS];
    p4_args_t  args[NUM_THREADS];
    int slice = TEXT_SIZE / NUM_THREADS;

    memset(g_hist, 0, sizeof(g_hist));

    for (int i = 0; i < NUM_THREADS; i++) {
        args[i].text  = g_text;
        args[i].start = i * slice;
        args[i].end   = (i == NUM_THREADS - 1) ? TEXT_SIZE : (i + 1) * slice;
        pthread_create(&tids[i], NULL, worker, &args[i]);
    }
    for (int i = 0; i < NUM_THREADS; i++)
        pthread_join(tids[i], NULL);

    int ok = hist_equal(g_hist, ref);
    printf("  %-30s : %s\n", label, ok ? "correct" : "WRONG — data race");
}

void part4_run(void) {
    printf("\n========== Part 4: Parallel Histogram ==========\n");

    /* Compute reference */
    long ref[NUM_BINS];
    seq_histogram(g_text, TEXT_SIZE, ref);

    run_hist_experiment("Unsafe (racy)",       p4_hist_unsafe, ref);
    run_hist_experiment("Mutex-protected",     p4_hist_mutex,  ref);
    run_hist_experiment("Local + merge (4C)",  p4_hist_local,  ref);
}

/* ================================================================== */
/*  main                                                              */
/* ================================================================== */

int main(void) {
    printf("Allocating and filling data...\n");

    g_text  = malloc(TEXT_SIZE + 1);
    g_array = malloc(ARRAY_SIZE * sizeof(int));
    if (!g_text || !g_array) { perror("malloc"); return 1; }

    fill_text(g_text,   TEXT_SIZE);
    fill_array(g_array, ARRAY_SIZE);

    printf("Done.  Running lab parts...\n");

    part1_run();
    part2_run();
    part3_run();
    part4_run();

    free(g_text);
    free(g_array);
    return 0;
}

Compile and run:

gcc -Wall -O2 -pthread -o threadlab threadlab.c
./threadlab

You will get some warnings about unused variables. You will fix those as you complete the lab.

Background

Why threads?

A process has one thread of execution by default. Adding threads lets a program do real work on multiple CPU cores simultaneously. The POSIX threads library (pthreads) is the standard way to do this in C.

The four functions you need today:

// Create a thread running start_fn(arg); store its ID in *tid
int pthread_create(pthread_t *tid,
                   const pthread_attr_t *attr,   // NULL = defaults
                   void *(*start_fn)(void *),
                   void *arg);

// Block until thread tid finishes; optionally collect its return value
int pthread_join(pthread_t tid, void **retval);  // retval = NULL to ignore

// Mutual exclusion — only one thread inside the lock at a time
int pthread_mutex_lock  (pthread_mutex_t *m);
int pthread_mutex_unlock(pthread_mutex_t *m);

A mutex is initialized statically with:

pthread_mutex_t my_mutex = PTHREAD_MUTEX_INITIALIZER;

Passing arguments to threads

pthread_create passes a single void * to your thread function. The standard pattern is to allocate a struct, fill it in, and cast:

typedef struct {
    int id;
    double value;
} my_args_t;

void *worker(void *arg) {
    my_args_t *a = (my_args_t *)arg;
    printf("Thread %d, value %.2f\n", a->id, a->value);
    return NULL;
}

Part 1 — Thread Lifecycle

The starter allocates g_text: 8 MB of random lowercase letters. Your job is to divide it evenly among NUM_THREADS (4) threads, where each thread prints a greeting identifying itself and its slice.

1A — Implement `p1_hello`

Open the starter and find p1_hello. The function receives a p1_args_t * (cast from void *). Print:

[Thread 2] Hello! My slice starts with 'k' and is 2097152 chars long.

Then return NULL.

1B — Implement `part1_run`

The loop skeleton is already there. Add:

A pthread_create call for each thread.
A pthread_join call for each thread.

Q1. What happens if you comment out all the pthread_join calls and run? Try it. Explain what you see and why.

Q2. The p1_args_t array is declared on the stack inside part1_run. Would it be safe to declare a single p1_args_t and reuse it for every pthread_create call? Why or why not?

Part 2 — Divide-and-Conquer Parallelism

The starter allocates g_array: 16 million random integers. You will sum them in parallel and compare with the sequential result.

2A — Implement `p2_sum_worker`

The function receives a p2_args_t *. Compute the sum of array[start..end) and store it in a->result.

2B — Implement `part2_run`

The struct initialization loop is started for you. Add:

A pthread_create call per thread.
A pthread_join call per thread that accumulates args[i].result into parallel_total.

Expected output:

Sequential sum : 824503296
Parallel sum   : 824503296
Match          : YES

(Exact number depends on your platform’s rand.)

Q3. Where does each thread store its partial sum? Why is it safe for threads to write to args[i].result concurrently, even though they are all writing to the same array?

Q4 (Think ahead). Suppose you instead had all threads increment a single long long total = 0 directly, without the per-thread result fields. Would that work correctly? Save your answer — Part 3 will let you test your intuition.

Try setting NUM_THREADS to 3. Compile and run your code again. Does it still work? What is going on here? Fix your implementation so it works correctly when the size of the array is not evenly divisible by the number of threads.

Part 3 — Race Conditions & Mutexes

A race condition occurs when two threads access the same memory location and at least one of them is writing, without any synchronization. The result is undefined.

g_counter++ looks like one operation, but it compiles to three:

Load g_counter from memory into a register.
Add 1 to the register.
Store the result back to memory.

If two threads interleave these steps, increments are silently lost.

3A — Observe the race

part3_run already runs the unsafe version. Build and run the program. You should see the counter come out wrong. Run it a few times.

Q5. Record the expected value and the value you actually observed. How many increments were lost? Does the number of lost increments depend on NUM_THREADS? On ITERS_PER_THREAD?

3B — Fix it with a mutex

Find p3_counter_safe. Wrap the g_counter++ increment with pthread_mutex_lock and pthread_mutex_unlock on g_counter_mutex.

After your fix the safe run should always print correct.

Q6. The mutex version is correct but much slower than the racy version. Why? (Think about what happens at the hardware level when many threads are all fighting for the same lock.)

Part 4 — Parallel Histogram

Now lets combine everything you learned so far. You want to count how many times each letter a–z appears in g_text. The sequential reference is already implemented (seq_histogram).

4A — Observe the race (provided)

p4_hist_unsafe is already written. Notice that it increments g_hist[c - 'a'] without any synchronization. The harness (run_hist_experiment) runs it and checks correctness for you.

Q7. The race in Part 3 lost increments on a single counter. Now there are 26 shared counters. Does having more counters make the race condition better or worse? What does run_hist_experiment report?

4B — Mutex-protected histogram

Find p4_hist_mutex. Protect the update with g_hist_mutex:

pthread_mutex_lock(&g_hist_mutex);
g_hist[c - 'a']++;
pthread_mutex_unlock(&g_hist_mutex);

This should pass the correctness check.

Q8. You are locking and unlocking the mutex once per character in the text. That is up to 8 million lock/unlock pairs per thread. Think about the ratio of time spent doing useful work (the array index + increment) versus time spent in the mutex. Is this a good design?

4C — Local histograms + single merge (challenge)

Find p4_hist_local. The key insight: threads don’t need to share a histogram during counting. Each thread keeps its own private local_hist[26] array. Only at the very end does it need to merge into g_hist — and that merge only requires the mutex once per thread, not once per character.

Thread 0: count locally → lock → merge → unlock
Thread 1: count locally → lock → merge → unlock
...

Implement this and verify it passes correctness.

Q9. How many times does each thread acquire the mutex in 4B vs. 4C? Write the exact numbers (not “fewer”).

Q10. Sketch in words why 4C is faster than 4B. Your answer should mention the words contention and critical section.

Submission Checklist

threadslab.c — all four parts implemented and compiling cleanly with -Wall
answers.md — answers to Q1–Q10
Both files submitted to Gradescope

pthreads Quick Reference

/* Thread management */
pthread_t tid;
pthread_create(&tid, NULL, my_function, my_arg);
pthread_join(tid, NULL);

/* Mutex */
pthread_mutex_t m = PTHREAD_MUTEX_INITIALIZER;
pthread_mutex_lock(&m);
  /* ... critical section ... */
pthread_mutex_unlock(&m);