Quiz3 Study Guide

The focus of the quiz will be on the material since the second quiz, but you should be familiar with the material in the quiz1 study guide and the quiz2 study guide.

Logic Design

Digital Signals & Abstraction

Concept: Computers use voltage thresholds to represent discrete binary values (0 and 1) from continuous analog signals.
Why Digital?:
- Noise Immunity: Uses “high” and “low” ranges with a guard range in between. Small fluctuations (noise) do not change the bit value.
- Simplicity: Allows for simple, small, and fast circuits.
Representation:
- High Voltage: Represents logical 1 (True).
- Low Voltage: Represents logical 0 (False).

Logic Gates and Combinational Circuits

Logic gates are the fundamental building blocks that perform Boolean functions. They respond to input changes continuously (with a small propagation delay).

Combinational Circuits are an acyclic (no loops) network of logic gates.

Properties:
- Memoryless: The output depends solely on the current input values.
- Continuous: Outputs update continuously based on inputs (after a delay).
Truth Tables: A table listing all possible input combinations ($2^n$ rows for $n$ inputs) and their corresponding outputs.
Boolean Equations:
- Used to express logic functions as equations.
- Sum of Products (SOP): A standard way to represent any logic truth table.
  1. Look at a output variable column of the truth table. Ignore the rows where the output is 0.
  2. Where the output is 1, create Minterms (The “Products”): by writing a product term (AND operation) combining all input variables. If the input is 1, write the variable (e.g., $A$) name down, if the input is 0, write the NOT of the variable (e.g., $A’$).
  3. Combine Them (The “Sum”): Add all the minterms together using OR operators to get the final expression.
  - Note: SOP doesn’t always yield the simplest circuit.

Standard Digital Components

These are common circuits built from logic gates.

Equality Comparator:
- Bit-level: Checks if two bits are identical.
- Word-level: To check if two words are equal, logically you compare each bit pair and AND all the results together.
Multiplexer (MUX):
- Function: A “selector” that chooses one of multiple data inputs based on a control signal.
- Usage: Critical for building ALUs (e.g., choosing between the result of an AND operation and an ADD operation).
Adder:
- We built a 32-bit adder by chaining 32 1-bit full adders together. This is known as a ripple adder. It is simple to build, but gives poor preformance, because the carryout bits must propigate from the lsb to the msb.

Sequential Logic

1. Combinational vs. Sequential Logic

Combinational Logic (The “Now”):
- Acyclic: No loops or feedback.
- Deterministic: Output depends only on current inputs (e.g., ALU, Adder).
- Timeless: Changes propagate immediately after a delay.
Sequential Logic (The “History”):
- Cyclic: Contains feedback loops where output feeds back into input.
- Memory: Output depends on current inputs AND the sequence of past inputs.
- Analogy: A “Black Box” containing combinational logic + a feedback cycle to retain state.

2. The Hierarchy of Storage Elements

The Bistable Element (The Physics)

Construction: Two inverters (NOT gates) connected in a loop.
Signal Conditioning: Inverters “strengthen” weak signals, pushing them toward valid 0 or 1 voltages.
Metastability (“Here be Dragons”):
- The Point: If voltage is exactly halfway (0.5V), the circuit doesn’t know which way to settle.
- The Analogy: An Inverted Pendulum. It can balance at the top momentarily, but any noise (thermal, vibration) causes it to fall randomly to Stable 0 or Stable 1.

SR Latch (Set-Reset)

Inputs: S (Set to 1), R (Reset to 0).
Mechanism: Uses OR gates to inject control signals into the feedback loop.
Modes:
- $R=1, S=0$: Force 0.
- $S=1, R=0$: Force 1.
- $R=0, S=0$: Storing Mode (Keeps previous value).

D Latch (Level-Sensitive)

Inputs: D (Data), L (Latch/Enable).
Behavior:
- Transparent (L=1): “Open for business.” Output follows input immediately. If D changes, Output changes.
- Opaque (L=0): Input is ignored. Output is frozen at the last value.

D Flip-Flop (Edge-Triggered)

The Problem with Latches: Transparency is dangerous; signals racing through the circuit can cause errors. We want to snapshot data at a precise moment.
The Solution: The Edge-Triggered Flip-Flop.
Behavior: Samples D only on the Rising Edge (0 $\to$ 1 transition) of the clock.
Stability: At all other times, the output is stable, even if D changes. This allows the rest of the cycle for computation.

Registers and Clocks

The Clock: The “Conductor” of the CPU.
- Frequency: e.g., 2.4 GHz.
- Period: The time between ticks (inverse of frequency). Determines how much work (logic depth) can be done in one cycle.
Register: A collection of D Flip-Flops (e.g., 32 for a 32-bit word) sharing one clock.
- Acts as a barrier. Holds the fort (stable output) until the next rising edge, then loads new input instantly.

The RISC-V Register File

Structure: An array of 32 registers (x0–x31).
Reading (Combinational): Asynchronous. You supply an address (e.g., rs1), and the data comes out after a slight delay. Can read two registers simultaneously (Dual Ported).
Writing (Sequential): Synchronous. You supply an address (rd) and data, but the write only happens on the Rising Edge of the clock.

The Memory Subsystem

SRAM vs DRAM

Feature	SRAM	DRAM
Speed	Faster	Slower
Cost	More expensive	Cheaper
Structure	Uses flip-flops (6 transistors)	Uses capacitors (1 transistor + capacitor)
Volatility	Volatile	Volatile
Refreshing	No need	Requires periodic refreshing
Use Case	Cache memory	Main memory

Principle of Locality

Locality refers to the tendency of programs to access the same set of memory locations repetitively over a short period of time.

Temporal Locality

Recently accessed data is likely to be accessed again soon.
Example: Loop counters, frequently called functions.

Spatial Locality

Data near recently accessed memory is likely to be accessed soon.
Example: Array elements accessed in sequence.

Role of Caches

Caches act as a fast buffer between CPU and main memory.
Store frequently accessed data to speed up execution.
Reduce average memory access time.

Types of Cache Misses

Miss Type	Description
Cold (Compulsory)	The first time data is accessed; it’s not yet in the cache.
Conflict	Multiple data mapped to the same cache location in set-associative/direct-mapped caches.
Capacity	Cache cannot hold all the data being used; some data is evicted and reloaded.

LRU Replacement Algorithm

LRU (Least Recently Used) evicts the line that has not been accessed for the longest time.
Implements temporal locality assumption.
Commonly used in n-way set-associative caches.

Performing a Cache Lookup

To perform a lookup:

Extract the set index from the address.
Search the set for a line where the valid bit is set and the tag matches.
If found, it’s a hit. If not, it’s a miss.
On a miss, fetch from lower memory and update the cache using the replacement policy.

Cache Terminology

Term	Definition
Line	Smallest unit of data storage in a cache.
Set	A group of one or more cache lines.
Block	A chunk of data transferred between memory and cache.
Tag	Part of address used to identify memory blocks stored in cache lines.
Valid Bit	Indicates whether the data in the cache line is valid.

Cache Mapping Techniques

Type	Description
Direct-Mapped	Each memory block maps to exactly one cache line. Fast but prone to conflict misses.
Fully Associative	Any memory block can go into any line. Low conflict misses but slower and more hardware.
n-Way Set Associative	Cache is divided into sets, each with E lines. A compromise between the two extremes.

Write Policies

Policy	Description
Write-Through	Updates are written to both the cache and main memory immediately. Easy to implement, slower writes.
Write-Back	Updates are only written to cache initially. Written back to memory when line is evicted. Faster, but requires modified (dirty) bit tracking.

Summary

SRAM is fast and used in caches; DRAM is cheaper and used for main memory.
Locality (temporal/spatial) motivates cache usage.
Caches reduce latency and increase speed but have miss types to manage.
Cache mapping strategies and write policies have performance tradeoffs.

Concurrency

POSIX Threads (pthreads)

Creating Threads

pthread_t tid;
pthread_create(&tid, NULL, thread_func, (void*)arg);

Waiting for Threads to Finish

void* result;
pthread_join(tid, &result);

Passing Data to and from Threads

Pass a pointer to a struct (wrap all needed values)
Ensure the data remains valid for the thread’s lifetime

typedef struct {
    int id;
    char* msg;
} thread_data_t;

Locks

Basic Usage

pthread_mutex_t lock = PTHREAD_MUTEX_INITIALIZER;

pthread_mutex_lock(&lock);
// critical section
pthread_mutex_unlock(&lock);

Use Cases

Protect shared variables
Prevent race conditions

Condition Variables

Usage Pattern

pthread_cond_t cond = PTHREAD_COND_INITIALIZER;
pthread_mutex_t lock = PTHREAD_MUTEX_INITIALIZER;

pthread_mutex_lock(&lock);
while (!condition_met) {
    pthread_cond_wait(&cond, &lock);
}
// do work
pthread_mutex_unlock(&lock);

// signal from another thread
pthread_cond_signal(&cond);

Use Case

Thread waits until a condition becomes true (e.g. buffer not empty)

Semaphores

Initialization

sem_t sem;
sem_init(&sem, 0, initial_value);

Wait and Signal

sem_wait(&sem);   // decrement semaphore
sem_post(&sem);   // increment semaphore

Use Cases

Resource management (e.g. N-slot buffer)
Thread synchronization

Producer-Consumer Problem

Shared bounded buffer
Synchronization using:
- Mutex (for mutual exclusion)
- Condition variables or semaphores (to wait/signal)

Key Conditions:

Producer waits if buffer full
Consumer waits if buffer empty

Readers-Writers Locks

Goals

Allow multiple concurrent readers
Writers get exclusive access

Concurrent Data Structures

Examples:

Approximate Counters
Concurrent lists
Michael and Scott Concurrent Queues
Concurrent hash map (mutex per bucket)
Circular buffer (for producer-consumer)

Key Concepts to Know

Concept	Know How To…
Race condition	Explain what causes it and how to fix it
Deadlock	Identify and prevent
Critical section	Define and protect with locks
Bounded buffer	Describe and implement
Signal vs broadcast	Differentiate in condition variables
Mutex vs Semaphore	Compare and choose correctly

Optimizing Program Performance

Why Optimize?

Goal: Map programs efficiently to machine resources.
Importance: Can lead to significant performance improvements (e.g., 10x).
Factors: Both asymptotic complexity (Big-O) and constant factors matter. Choose the best algorithm first, then optimize implementation details.

Measuring Performance

Cycles Per Element (CPE): A common metric to express performance for programs operating on sequences of data. Lower CPE is better.

Optimization Blockers

Memory Aliasing: When two different pointers might refer to the same memory location. Compilers must be conservative and often cannot optimize memory accesses aggressively.
Procedure Calls: Compilers often treat procedure calls as “black boxes” due to potential side effects, limiting optimizations across function boundaries.
- Inlining: Replacing a function call with the function’s body (e.g., GCC -O1) can help, but is often limited to calls within the same file.

How to maximize performance

Write Compiler-Friendly Code: Understand limitations like aliasing and procedure calls.
Reduce Memory Accesses: Use temporary variables within loops to accumulate results instead of reading/writing to memory repeatedly.
Loop Optimizations:
- Code Motion: Move calculations or function calls (like strlen on a non-changing string) out of loops if their result doesn’t change per iteration.
- Loop Unrolling: Perform more work per iteration to reduce loop overhead and expose more instructions for parallel execution.
- Reassociation: Change the order of operations (if mathematically valid) to break sequential dependencies.
- Separate Accumulators: Use multiple temporary variables to accumulate results in parallel, breaking data dependencies.

Exploiting Modern Hardware (Microarchitecture)

Instruction Pipelining: Processors break instruction execution into stages, allowing multiple instructions to be in different stages simultaneously.
Instruction-Level Parallelism (ILP): Modern processors can execute multiple instructions per clock cycle (superscalar).
- Data Dependencies: The primary limiter of ILP. Optimizations often focus on breaking these dependencies.
SIMD (Single Instruction, Multiple Data): Vector instructions (like AVX2) perform the same operation on multiple data elements simultaneously.

Takeaways

Optimization requires understanding both the code and the underlying hardware.
Focus on bottlenecks.
Compilers are powerful but have limitations (aliasing, procedure calls).
Programmers can enable better performance by:
- Choosing efficient algorithms.
- Writing compiler-friendly code.
- Applying manual optimizations (loop unrolling, reassociation, etc.).
- Understanding and leveraging hardware features (ILP, SIMD).
Tune code for the specific target machine for best results.