Quiz3 Study Guide
The focus of the quiz will be on the material since the second quiz, but you should be familiar with the material in the quiz1 study guide and the quiz2 study guide.
Logic Design
Digital Signals & Abstraction
- Concept: Computers use voltage thresholds to represent discrete binary values (0 and 1) from continuous analog signals.
- Why Digital?:
- Noise Immunity: Uses “high” and “low” ranges with a guard range in between. Small fluctuations (noise) do not change the bit value.
- Simplicity: Allows for simple, small, and fast circuits.
- Representation:
- High Voltage: Represents logical
1(True). - Low Voltage: Represents logical
0(False).
- High Voltage: Represents logical
Logic Gates and Combinational Circuits
Logic gates are the fundamental building blocks that perform Boolean functions. They respond to input changes continuously (with a small propagation delay).
Combinational Circuits are an acyclic (no loops) network of logic gates.
- Properties:
- Memoryless: The output depends solely on the current input values.
- Continuous: Outputs update continuously based on inputs (after a delay).
- Truth Tables: A table listing all possible input combinations ($2^n$ rows for $n$ inputs) and their corresponding outputs.
- Boolean Equations:
- Used to express logic functions as equations.
- Sum of Products (SOP): A standard way to represent any logic truth
table.
- Look at a output variable column of the truth table. Ignore the rows where the output is 0.
- Where the output is 1, create Minterms (The “Products”): by writing a product term (AND operation) combining all input variables. If the input is 1, write the variable (e.g., $A$) name down, if the input is 0, write the NOT of the variable (e.g., $A’$).
- Combine Them (The “Sum”): Add all the minterms together using OR operators to get the final expression.
- Note: SOP doesn’t always yield the simplest circuit.
Standard Digital Components
These are common circuits built from logic gates.
- Equality Comparator:
- Bit-level: Checks if two bits are identical.
- Word-level: To check if two words are equal, logically you compare each bit pair and AND all the results together.
- Multiplexer (MUX):
- Function: A “selector” that chooses one of multiple data inputs based on a control signal.
- Usage: Critical for building ALUs (e.g., choosing between the result of an AND operation and an ADD operation).
- Adder:
- We built a 32-bit adder by chaining 32 1-bit full adders together. This is known as a ripple adder. It is simple to build, but gives poor preformance, because the carryout bits must propigate from the lsb to the msb.
Sequential Logic
1. Combinational vs. Sequential Logic
- Combinational Logic (The “Now”):
- Acyclic: No loops or feedback.
- Deterministic: Output depends only on current inputs (e.g., ALU, Adder).
- Timeless: Changes propagate immediately after a delay.
- Sequential Logic (The “History”):
- Cyclic: Contains feedback loops where output feeds back into input.
- Memory: Output depends on current inputs AND the sequence of past inputs.
- Analogy: A “Black Box” containing combinational logic + a feedback cycle to retain state.
2. The Hierarchy of Storage Elements
The Bistable Element (The Physics)
- Construction: Two inverters (NOT gates) connected in a loop.
- Signal Conditioning: Inverters “strengthen” weak signals, pushing them toward valid 0 or 1 voltages.
- Metastability (“Here be Dragons”):
- The Point: If voltage is exactly halfway (0.5V), the circuit doesn’t know which way to settle.
- The Analogy: An Inverted Pendulum. It can balance at the top momentarily, but any noise (thermal, vibration) causes it to fall randomly to Stable 0 or Stable 1.
SR Latch (Set-Reset)
- Inputs: S (Set to 1), R (Reset to 0).
- Mechanism: Uses OR gates to inject control signals into the feedback loop.
- Modes:
- $R=1, S=0$: Force 0.
- $S=1, R=0$: Force 1.
- $R=0, S=0$: Storing Mode (Keeps previous value).
D Latch (Level-Sensitive)
- Inputs: D (Data), L (Latch/Enable).
- Behavior:
- Transparent (L=1): “Open for business.” Output follows input immediately. If D changes, Output changes.
- Opaque (L=0): Input is ignored. Output is frozen at the last value.
D Flip-Flop (Edge-Triggered)
- The Problem with Latches: Transparency is dangerous; signals racing through the circuit can cause errors. We want to snapshot data at a precise moment.
- The Solution: The Edge-Triggered Flip-Flop.
- Behavior: Samples D only on the Rising Edge (0 $\to$ 1 transition) of the clock.
- Stability: At all other times, the output is stable, even if D changes. This allows the rest of the cycle for computation.
Registers and Clocks
- The Clock: The “Conductor” of the CPU.
- Frequency: e.g., 2.4 GHz.
- Period: The time between ticks (inverse of frequency). Determines how much work (logic depth) can be done in one cycle.
- Register: A collection of D Flip-Flops (e.g., 32 for a 32-bit word) sharing one clock.
- Acts as a barrier. Holds the fort (stable output) until the next rising edge, then loads new input instantly.
The RISC-V Register File
- Structure: An array of 32 registers (x0–x31).
- Reading (Combinational): Asynchronous. You supply an address (e.g.,
rs1), and the data comes out after a slight delay. Can read two registers simultaneously (Dual Ported). - Writing (Sequential): Synchronous. You supply an address (
rd) and data, but the write only happens on the Rising Edge of the clock.
The Memory Subsystem
SRAM vs DRAM
| Feature | SRAM | DRAM |
|---|---|---|
| Speed | Faster | Slower |
| Cost | More expensive | Cheaper |
| Structure | Uses flip-flops (6 transistors) | Uses capacitors (1 transistor + capacitor) |
| Volatility | Volatile | Volatile |
| Refreshing | No need | Requires periodic refreshing |
| Use Case | Cache memory | Main memory |
Principle of Locality
Locality refers to the tendency of programs to access the same set of memory locations repetitively over a short period of time.
Temporal Locality
- Recently accessed data is likely to be accessed again soon.
- Example: Loop counters, frequently called functions.
Spatial Locality
- Data near recently accessed memory is likely to be accessed soon.
- Example: Array elements accessed in sequence.
Role of Caches
- Caches act as a fast buffer between CPU and main memory.
- Store frequently accessed data to speed up execution.
- Reduce average memory access time.
Types of Cache Misses
| Miss Type | Description |
|---|---|
| Cold (Compulsory) | The first time data is accessed; it’s not yet in the cache. |
| Conflict | Multiple data mapped to the same cache location in set-associative/direct-mapped caches. |
| Capacity | Cache cannot hold all the data being used; some data is evicted and reloaded. |
LRU Replacement Algorithm
- LRU (Least Recently Used) evicts the line that has not been accessed for the longest time.
- Implements temporal locality assumption.
- Commonly used in n-way set-associative caches.
Performing a Cache Lookup
To perform a lookup:
- Extract the set index from the address.
- Search the set for a line where the valid bit is set and the tag matches.
- If found, it’s a hit. If not, it’s a miss.
- On a miss, fetch from lower memory and update the cache using the replacement policy.
Cache Terminology
| Term | Definition |
|---|---|
| Line | Smallest unit of data storage in a cache. |
| Set | A group of one or more cache lines. |
| Block | A chunk of data transferred between memory and cache. |
| Tag | Part of address used to identify memory blocks stored in cache lines. |
| Valid Bit | Indicates whether the data in the cache line is valid. |
Cache Mapping Techniques
| Type | Description |
|---|---|
| Direct-Mapped | Each memory block maps to exactly one cache line. Fast but prone to conflict misses. |
| Fully Associative | Any memory block can go into any line. Low conflict misses but slower and more hardware. |
| n-Way Set Associative | Cache is divided into sets, each with E lines. A compromise between the two extremes. |
Write Policies
| Policy | Description |
|---|---|
| Write-Through | Updates are written to both the cache and main memory immediately. Easy to implement, slower writes. |
| Write-Back | Updates are only written to cache initially. Written back to memory when line is evicted. Faster, but requires modified (dirty) bit tracking. |
Summary
- SRAM is fast and used in caches; DRAM is cheaper and used for main memory.
- Locality (temporal/spatial) motivates cache usage.
- Caches reduce latency and increase speed but have miss types to manage.
- Cache mapping strategies and write policies have performance tradeoffs.
Concurrency
POSIX Threads (pthreads)
Creating Threads
pthread_t tid;
pthread_create(&tid, NULL, thread_func, (void*)arg);
Waiting for Threads to Finish
void* result;
pthread_join(tid, &result);
Passing Data to and from Threads
- Pass a pointer to a struct (wrap all needed values)
- Ensure the data remains valid for the thread’s lifetime
typedef struct {
int id;
char* msg;
} thread_data_t;
Locks
Basic Usage
pthread_mutex_t lock = PTHREAD_MUTEX_INITIALIZER;
pthread_mutex_lock(&lock);
// critical section
pthread_mutex_unlock(&lock);
Use Cases
- Protect shared variables
- Prevent race conditions
Condition Variables
Usage Pattern
pthread_cond_t cond = PTHREAD_COND_INITIALIZER;
pthread_mutex_t lock = PTHREAD_MUTEX_INITIALIZER;
pthread_mutex_lock(&lock);
while (!condition_met) {
pthread_cond_wait(&cond, &lock);
}
// do work
pthread_mutex_unlock(&lock);
// signal from another thread
pthread_cond_signal(&cond);
Use Case
- Thread waits until a condition becomes true (e.g. buffer not empty)
Semaphores
Initialization
sem_t sem;
sem_init(&sem, 0, initial_value);
Wait and Signal
sem_wait(&sem); // decrement semaphore
sem_post(&sem); // increment semaphore
Use Cases
- Resource management (e.g. N-slot buffer)
- Thread synchronization
Producer-Consumer Problem
- Shared bounded buffer
- Synchronization using:
- Mutex (for mutual exclusion)
- Condition variables or semaphores (to wait/signal)
Key Conditions:
- Producer waits if buffer full
- Consumer waits if buffer empty
Readers-Writers Locks
Goals
- Allow multiple concurrent readers
- Writers get exclusive access
Concurrent Data Structures
Examples:
- Approximate Counters
- Concurrent lists
- Michael and Scott Concurrent Queues
- Concurrent hash map (mutex per bucket)
- Circular buffer (for producer-consumer)
Key Concepts to Know
| Concept | Know How To… |
|---|---|
| Race condition | Explain what causes it and how to fix it |
| Deadlock | Identify and prevent |
| Critical section | Define and protect with locks |
| Bounded buffer | Describe and implement |
| Signal vs broadcast | Differentiate in condition variables |
| Mutex vs Semaphore | Compare and choose correctly |
Optimizing Program Performance
Why Optimize?
- Goal: Map programs efficiently to machine resources.
- Importance: Can lead to significant performance improvements (e.g., 10x).
- Factors: Both asymptotic complexity (Big-O) and constant factors matter. Choose the best algorithm first, then optimize implementation details.
Measuring Performance
- Cycles Per Element (CPE): A common metric to express performance for programs operating on sequences of data. Lower CPE is better.
Optimization Blockers
- Memory Aliasing: When two different pointers might refer to the same memory location. Compilers must be conservative and often cannot optimize memory accesses aggressively.
- Procedure Calls: Compilers often treat procedure calls as “black boxes” due to potential side effects, limiting optimizations across function boundaries.
- Inlining: Replacing a function call with the function’s body (e.g., GCC
-O1) can help, but is often limited to calls within the same file.
- Inlining: Replacing a function call with the function’s body (e.g., GCC
How to maximize performance
- Write Compiler-Friendly Code: Understand limitations like aliasing and procedure calls.
- Reduce Memory Accesses: Use temporary variables within loops to accumulate results instead of reading/writing to memory repeatedly.
- Loop Optimizations:
- Code Motion: Move calculations or function calls (like
strlenon a non-changing string) out of loops if their result doesn’t change per iteration. - Loop Unrolling: Perform more work per iteration to reduce loop overhead and expose more instructions for parallel execution.
- Reassociation: Change the order of operations (if mathematically valid) to break sequential dependencies.
- Separate Accumulators: Use multiple temporary variables to accumulate results in parallel, breaking data dependencies.
- Code Motion: Move calculations or function calls (like
Exploiting Modern Hardware (Microarchitecture)
- Instruction Pipelining: Processors break instruction execution into stages, allowing multiple instructions to be in different stages simultaneously.
- Instruction-Level Parallelism (ILP): Modern processors can execute multiple instructions per clock cycle (superscalar).
- Data Dependencies: The primary limiter of ILP. Optimizations often focus on breaking these dependencies.
- SIMD (Single Instruction, Multiple Data): Vector instructions (like AVX2) perform the same operation on multiple data elements simultaneously.
Takeaways
- Optimization requires understanding both the code and the underlying hardware.
- Focus on bottlenecks.
- Compilers are powerful but have limitations (aliasing, procedure calls).
- Programmers can enable better performance by:
- Choosing efficient algorithms.
- Writing compiler-friendly code.
- Applying manual optimizations (loop unrolling, reassociation, etc.).
- Understanding and leveraging hardware features (ILP, SIMD).
- Tune code for the specific target machine for best results.