A CPU does not see:
- source files
- variables
- functions
- threads
Instead, it sees binary instructions flowing through hardware stages, driven relentlessly by a clock.
The CPU Is the Smallest Executing Unit
The CPU is the smallest hardware entity capable of executing instructions independently.
Everything else in a system exists only to feed instructions to the CPU or store results from it.
The CPU does not understand intent:
- not _why_ a function exists
- not _what_ a task represents
- not _who_ owns memory
It only reacts to:
- clock edges
- control signals
- binary instruction encodings
This is why CPU bugs are so unforgiving:
the processor will do _exactly_ what you told it to do, not what you meant.
┌────────────────────┐
│ CPU │
├────────────────────┤
│ Registers │
│ ALU │
│ Control Unit │
└────────────────────┘
From the CPU’s perspective, time does not exist as milliseconds or deadlines.
Time exists only as clock cycles.
A missed real-time deadline is simply:
“Too many cycles were consumed before a critical instruction executed.”
CPU is consisted of:
- ALU
- Control unit
- Registers
- Interconnections
- Data bus: data carrying wires
- Address bus: address carrying.
- Control bus: read/write control signals carrying wires
ALU
Arithmetic Logic Unit, is an integrated circuit within a CPU or GPU (Graphics Processing Unit) that performs arithmetic and logic operations.
- Arithmetic:
- Addition
- Subtraction
- Shifting
- Logic:
- NOT
- AND
- OR
- XOR
Registers
Register are the fastest types of memories, it is used to facilitate CPU operations
The are 2 types:
- General Purpose Register (GPR) (R0, R1, .... , R31)
- Special purpose registers:
- Program counter (PC)
- Instruction Register (IR)
- Instruction decoder (ID)
- Status Register (SREG)
- Accumulator Register (ACC)
- Stack Pointer (SP)
- Index Register (X, Y)
Control Unit
The control unit is the circuitry that controls the flow of data through the processor, and coordinates the activities of the other units within it.
In other words it is in charge of the entire instructions (machine) cycle.
ALU: The Only Place Math Happens
The ALU performs all arithmetic and logic.
It does not:
- allocate memory
- manage stacks
- understand data types
It sees only bit patterns.
Arithmetic operations
- ADD, SUB
- SHIFT left/right
- INC, DEC
Logic operations
- AND, OR, XOR
- NOT
- comparisons (via subtraction)
Example insight:
A comparison like:
if (a == b)
Is actually:
SUB Rtmp, Ra, Rb
CHECK Zero Flag
There is no “compare instruction” conceptually — just math plus flags.
Control Unit: Hardware Sequencing Brain
The Control Unit is often misunderstood because it is invisible in software.
It:
- sequences micro-operations
- enables buses
- selects ALU operations
- controls register writes
Think of it as a finite state machine driven by:
- opcode bits
- clock
- current CPU state
Without the control unit, the ALU would be a powerful but useless calculator.
Registers: Where Reality Happens
Registers are not “fast memory.”
They are the only place computation is physically possible.
The ALU cannot operate on RAM or Flash.
There is no physical path for that.
That means:
- values must be loaded into registers
- operations happen only on registers
- results must be written back
This creates a harsh rule:
**If a value is not in a register, it does not exist for computation.**
Why this matters
Many performance issues are not algorithmic — they are register pressure problems.
When registers are exhausted:
- the compiler spills values to memory
- extra load/store instructions appear
- latency explodes
- power consumption increases
This is why “simple-looking” C code can generate terrible assembly.
Register File Overview
A typical CPU register file looks conceptually like this:
Register File
+----+----+----+----+
| R0 | R1 | R2 | R3 |
+----+----+----+----+
| SP | PC | LR | PS |
+----+----+----+----+
Registers are:
- limited in number
- extremely fast
- explicitly managed by compiler + ABI
They are the CPU’s _only working memory_.
Register Types: Roles, Not Just Names
Registers are not equal.
They exist to support control flow, data flow, and execution flow.
General Purpose Registers (GPR)
General Purpose Registers are designed to store transient data required during execution.
They hold:
- operands for arithmetic and logic
- intermediate results
- function arguments
- return values
- temporary addresses
In general, **the more registers a CPU has, the faster it can work**.
Why?
Because more registers mean:
- fewer memory accesses
- less spilling to stack
- fewer pipeline stalls
- better instruction-level parallelism
This is one of the fundamental reasons why:
- ARM scales well for embedded
- x86-64 added more registers
- RISC architectures favor large register files
Example flow:
LOAD R0, a
LOAD R1, b
ADD R2, R0, R1
From the CPU’s perspective, variables like a and b do not exist — only R0 and R1.
Program Counter (PC): Control Flow Anchor
The Program Counter (PC) is the most important register in the CPU.
It holds:
**The address of the next instruction to be executed**
Every control-flow operation is fundamentally a PC modification:
- loops work
- functions are called
- jumps
- branches
- returns
- interrupts
- exceptions occur
The size of the PC is directly related to the addressable program memory:
- 16-bit PC → 64 KB address space
- 32-bit PC → 4 GB address space
- 64-bit PC → massive virtual memory
From the CPU’s perspective, _changing behavior = changing PC_.
Instruction Register (IR)
The Instruction Register (IR) holds the instruction currently being processed.
The flow is:
- Instruction fetched from memory
- Stored into IR
- Passed to the Instruction Decoder
The IR isolates:
- memory timing
- decoding logic
- execution control
This separation allows pipelining and parallelism to exist.
Status Register (SREG)
The Status Register contains flags describing the outcome of the last operation.
These flags directly influence subsequent instructions, especially branches.
Common flags include:
- Overflow Flag
- Negative Flag
- Zero Flag
- Carry Flag
- Half-Carry Flag
- Global Interrupt Mask
Indicates the result exceeded register width
Indicates the result is negative (sign bit set)
Indicates the result is zero
Indicates carry/borrow in arithmetic or logical operations
Used mainly for BCD and lower-nibble arithmetic
Enables or disables interrupts globally
Important insight:
Branch instructions rarely “compare” values — they **inspect flags**.
Accumulator Register (ACC)
The Accumulator is a special-purpose register historically used to store intermediate results.
Even in modern CPUs where ACC is less explicit, the concept remains:
- results stay in registers
- reusing them avoids memory traffic
- chained operations become faster
Keeping values in registers instead of memory:
- reduces latency
- reduces power
- improves determinism
This is why tight loops and DSP code rely heavily on register reuse.
Stack Pointer (SP): Execution Context Boundary
The Stack Pointer holds the memory address of:
- the last stored value (full stack)
- or the next free location (empty stack)
It defines the execution context boundary.
The stack pointer defines:
- where local variables live
- where return addresses are stored
- where saved registers go
High Address
+------------------+
| local variables |
+------------------+
| saved registers |
+------------------+
| return address | ← SP
+------------------+
Low Address
Behavior:
- PUSH → SP moves to next empty location
- POP → SP moves to previous value
RTOS context switching is simply:
“Save SP + registers, load another SP + registers”
Nothing more.
Index Registers (X, Y)
Index registers support indirect and indexed addressing modes.
Accessed address formula:
Effective Address = Index Register + Offset
This mode is essential for:
- arrays
- buffers
- structs
- stack frames
Without index registers, modern programming would be impractical.
Microprocessor Unit (MPU): Why CPU Alone Is Useless
An MPU contains only the CPU core.
No RAM.
No Flash.
No GPIO.
This means the CPU has:
- nowhere to fetch instructions from
- nowhere to store results
- nothing to interact with
┌───────────────┐
│ CPU │
│───────────────│
│ ALU │
│ Registers │
│ Control Unit │
└──────┬────────┘
│
┌───────────┼───────────┐
│ │ │
Data Bus Address Bus Control Bus
This is why MPUs require external components:
- RAM
- Flash
- bus controllers
- peripherals
Understanding this explains why:
- bootloaders exist
- memory controllers are critical
- board bring-up is non-trivial
Instruction Lifecycle: The Unbreakable Loop
Every instruction obeys the same lifecycle.
Fetch → Decode → Execute → Write Back
This is not a software concept.
It is hardwired behavior.
Expanded view
┌───────────┐
│ FETCH │ Read instruction from memory
└─────┬─────┘
↓
┌───────────┐
│ DECODE │ Identify opcode & operands
└─────┬─────┘
↓
┌───────────┐
│ EXECUTE │ ALU / branch / load-store
└─────┬─────┘
↓
┌───────────┐
│ WRITEBACK │ Update registers or memory
└─────┬─────┘
↓
PC++
The PC moves whether you want it to or not.
Instruction Structure: Minimal but Sufficient
An instruction contains:
- opcode → what action
- operands → where data is
The CPU does not infer intent.
Everything must be explicitly encoded.
This explains why:
- RISC instructions are simple
- complex operations are decomposed
- compilers emit many instructions for simple code
Memory Addressing Modes: Finding Data Efficiently
Addressing modes exist to reduce instruction count and cycles.
Direct addressing
Simple, explicit, but inflexible.
Indirect addressing
Pointer-based access.
Foundation of dynamic data structures.
Base + Offset
Critical for:
- stack frames
- structs
- arrays
This mode enables efficient access patterns without recomputing addresses manually.
Why Sequential Execution Is Slow
Without overlap:
- each instruction blocks the next
- hardware resources sit idle
The CPU waits while:
- memory responds
- decoding finishes
- results propagate
This inefficiency led to pipelining.
Pipelining: Overlapping Time
The instruction lifecycle is sequential, which is a bit slow.
if we supposed that fetch, decode, and execute take 1 clock cycle each, then to perform 3 instructions we need 9 clock cycles.
Pipelining is a process of Arrangement of hardware elements of the CPU such that its Overall performance is increased by making the same clock cycle execute, decode, and fetch.
Cycle 1: F1
Cycle 2: ID1, IF2
Cycle 3: EX1, ID2, IF3
Cycle 4: STORE1, EX2, ID3, IF4
Cycle 5: STORE2, EX3, ID4, IF5
....
- Increased Throughput: Multiple instructions are processed simultaneously, increasing overall instruction throughput.
- Reduced Latency: Each stage processes a part of the instruction, reducing the time it takes for each instruction to be completed.
Pipeline Hazards: When Reality Pushes Back
Pipelines are fragile.
- Structural hazards: it is a resource conflict situation, when more than one instruction tries to access the same resource in the same cycle.
- Data hazards: when instruction depends on result of prior instruction which still in the pipeline.
- Control hazards: Caused by delay between the fetching of instructions and decisions about changes in control flow (branches and jumps).
These hazards explain:
- why instruction reordering matters
- why volatile exists
- why timing analysis is hard
- why speculative execution exists
Solution to these hazards is by adding delays.
Solving these requires:
- stalls
- forwarding
- prediction
- speculation
Each solution trades:
performance ↔ power ↔ complexity ↔ security
Final Mental Compression
If you compress everything above into one mental image:
Clock
↓
Control Unit
↓
Registers ↔ ALU
↓
Memory
Everything else is software illusion layered on top.