ES: CPU Fundamentals: Thinking Like the Processor | Amr Tarek

A CPU does not see:

source files
variables
functions
threads

Instead, it sees binary instructions flowing through hardware stages, driven relentlessly by a clock.

The CPU Is the Smallest Executing Unit

The CPU is the smallest hardware entity capable of executing instructions independently.

Everything else in a system exists only to feed instructions to the CPU or store results from it.

The CPU does not understand intent:

not _why_ a function exists
not _what_ a task represents
not _who_ owns memory

It only reacts to:

clock edges
control signals
binary instruction encodings

This is why CPU bugs are so unforgiving:

the processor will do _exactly_ what you told it to do, not what you meant.

┌────────────────────┐
│        CPU         │
├────────────────────┤
│ Registers          │
│ ALU                │
│ Control Unit       │
└────────────────────┘

From the CPU’s perspective, time does not exist as milliseconds or deadlines.

Time exists only as clock cycles.

A missed real-time deadline is simply:

“Too many cycles were consumed before a critical instruction executed.”

CPU is consisted of:

ALU
Control unit
Registers
Interconnections
Data bus: data carrying wires
Address bus: address carrying.
Control bus: read/write control signals carrying wires

ALU

Arithmetic Logic Unit, is an integrated circuit within a CPU or GPU (Graphics Processing Unit) that performs arithmetic and logic operations.

Arithmetic:
Addition
Subtraction
Shifting
Logic:
NOT
AND
OR
XOR

Registers

The are 2 types:

General Purpose Register (GPR) (R0, R1, .... , R31)
Special purpose registers:
Program counter (PC)
Instruction Register (IR)
Instruction decoder (ID)
Status Register (SREG)
Accumulator Register (ACC)
Stack Pointer (SP)
Index Register (X, Y)

Control Unit

The control unit is the circuitry that controls the flow of data through the processor, and coordinates the activities of the other units within it.

In other words it is in charge of the entire instructions (machine) cycle.

ALU: The Only Place Math Happens

The ALU performs all arithmetic and logic.

It does not:

allocate memory
manage stacks
understand data types

It sees only bit patterns.

Arithmetic operations

ADD, SUB
SHIFT left/right
INC, DEC

Logic operations

AND, OR, XOR
NOT
comparisons (via subtraction)

Example insight:

A comparison like:

if (a == b)

Is actually:

SUB Rtmp, Ra, Rb
CHECK Zero Flag

There is no “compare instruction” conceptually — just math plus flags.

Control Unit: Hardware Sequencing Brain

The Control Unit is often misunderstood because it is invisible in software.

It:

sequences micro-operations
enables buses
selects ALU operations
controls register writes

Think of it as a finite state machine driven by:

opcode bits
clock
current CPU state

Without the control unit, the ALU would be a powerful but useless calculator.

Registers: Where Reality Happens

Registers are not “fast memory.”

They are the only place computation is physically possible.

The ALU cannot operate on RAM or Flash.

There is no physical path for that.

That means:

values must be loaded into registers
operations happen only on registers
results must be written back

This creates a harsh rule:

**If a value is not in a register, it does not exist for computation.**

Why this matters

Many performance issues are not algorithmic — they are register pressure problems.

When registers are exhausted:

the compiler spills values to memory
extra load/store instructions appear
latency explodes
power consumption increases

This is why “simple-looking” C code can generate terrible assembly.

Register File Overview

A typical CPU register file looks conceptually like this:

Register File
+----+----+----+----+
| R0 | R1 | R2 | R3 |
+----+----+----+----+
| SP | PC | LR | PS |
+----+----+----+----+

Registers are:

limited in number
extremely fast
explicitly managed by compiler + ABI

They are the CPU’s _only working memory_.

Register Types: Roles, Not Just Names

Registers are not equal.

They exist to support control flow, data flow, and execution flow.

General Purpose Registers (GPR)

General Purpose Registers are designed to store transient data required during execution.

They hold:

operands for arithmetic and logic
intermediate results
function arguments
return values
temporary addresses

In general, **the more registers a CPU has, the faster it can work**.

Why?

Because more registers mean:

fewer memory accesses
less spilling to stack
fewer pipeline stalls
better instruction-level parallelism

This is one of the fundamental reasons why:

ARM scales well for embedded
x86-64 added more registers
RISC architectures favor large register files

Example flow:

LOAD R0, a
LOAD R1, b
ADD  R2, R0, R1

From the CPU’s perspective, variables like a and b do not exist — only R0 and R1.

Program Counter (PC): Control Flow Anchor

The Program Counter (PC) is the most important register in the CPU.

It holds:

**The address of the next instruction to be executed**

Every control-flow operation is fundamentally a PC modification:

loops work
functions are called
jumps
branches
returns
interrupts
exceptions occur

The size of the PC is directly related to the addressable program memory:

16-bit PC → 64 KB address space
32-bit PC → 4 GB address space
64-bit PC → massive virtual memory

From the CPU’s perspective, _changing behavior = changing PC_.

Instruction Register (IR)

The Instruction Register (IR) holds the instruction currently being processed.

The flow is:

Instruction fetched from memory
Stored into IR
Passed to the Instruction Decoder

The IR isolates:

memory timing
decoding logic
execution control

This separation allows pipelining and parallelism to exist.

Status Register (SREG)

The Status Register contains flags describing the outcome of the last operation.

These flags directly influence subsequent instructions, especially branches.

Common flags include:

Overflow Flag

Indicates the result exceeded register width

Negative Flag

Indicates the result is negative (sign bit set)

Zero Flag

Indicates the result is zero

Carry Flag

Indicates carry/borrow in arithmetic or logical operations

Half-Carry Flag

Used mainly for BCD and lower-nibble arithmetic

Global Interrupt Mask

Enables or disables interrupts globally

Important insight:

Branch instructions rarely “compare” values — they **inspect flags**.

Accumulator Register (ACC)

The Accumulator is a special-purpose register historically used to store intermediate results.

Even in modern CPUs where ACC is less explicit, the concept remains:

results stay in registers
reusing them avoids memory traffic
chained operations become faster

Keeping values in registers instead of memory:

reduces latency
reduces power
improves determinism

This is why tight loops and DSP code rely heavily on register reuse.

Stack Pointer (SP): Execution Context Boundary

The Stack Pointer holds the memory address of:

the last stored value (full stack)
or the next free location (empty stack)

It defines the execution context boundary.

The stack pointer defines:

where local variables live
where return addresses are stored
where saved registers go

High Address
+------------------+
| local variables |
+------------------+
| saved registers |
+------------------+
| return address  | ← SP
+------------------+
Low Address

Behavior:

PUSH → SP moves to next empty location
POP → SP moves to previous value

RTOS context switching is simply:

“Save SP + registers, load another SP + registers”

Nothing more.

Index Registers (X, Y)

Index registers support indirect and indexed addressing modes.

Accessed address formula:

Effective Address = Index Register + Offset

This mode is essential for:

arrays
buffers
structs
stack frames

Without index registers, modern programming would be impractical.

Microprocessor Unit (MPU): Why CPU Alone Is Useless

An MPU contains only the CPU core.

No RAM.

No Flash.

No GPIO.

This means the CPU has:

nowhere to fetch instructions from
nowhere to store results
nothing to interact with

        ┌───────────────┐
        │      CPU      │
        │───────────────│
        │ ALU           │
        │ Registers     │
        │ Control Unit  │
        └──────┬────────┘
               │
   ┌───────────┼───────────┐
   │           │           │
Data Bus   Address Bus  Control Bus

This is why MPUs require external components:

RAM
Flash
bus controllers
peripherals

Understanding this explains why:

bootloaders exist
memory controllers are critical
board bring-up is non-trivial

Instruction Lifecycle: The Unbreakable Loop

Every instruction obeys the same lifecycle.

Fetch → Decode → Execute → Write Back

This is not a software concept.

It is hardwired behavior.

Expanded view

┌───────────┐
│   FETCH   │  Read instruction from memory
└─────┬─────┘
	  ↓
┌───────────┐
│  DECODE   │  Identify opcode & operands
└─────┬─────┘
	  ↓
┌───────────┐
│ EXECUTE   │  ALU / branch / load-store
└─────┬─────┘
	  ↓
┌───────────┐
│ WRITEBACK │  Update registers or memory
└─────┬─────┘
	  ↓
	 PC++

The PC moves whether you want it to or not.

Instruction Structure: Minimal but Sufficient

An instruction contains:

opcode → what action
operands → where data is

The CPU does not infer intent.

Everything must be explicitly encoded.

This explains why:

RISC instructions are simple
complex operations are decomposed
compilers emit many instructions for simple code

Memory Addressing Modes: Finding Data Efficiently

Addressing modes exist to reduce instruction count and cycles.

Direct addressing

Simple, explicit, but inflexible.

Indirect addressing

Pointer-based access.

Foundation of dynamic data structures.

Base + Offset

Critical for:

stack frames
structs
arrays

This mode enables efficient access patterns without recomputing addresses manually.

Why Sequential Execution Is Slow

Without overlap:

each instruction blocks the next
hardware resources sit idle

The CPU waits while:

memory responds
decoding finishes
results propagate

This inefficiency led to pipelining.

Pipelining: Overlapping Time

The instruction lifecycle is sequential, which is a bit slow.

if we supposed that fetch, decode, and execute take 1 clock cycle each, then to perform 3 instructions we need 9 clock cycles.

Pipelining is a process of Arrangement of hardware elements of the CPU such that its Overall performance is increased by making the same clock cycle execute, decode, and fetch.

Cycle 1: F1

Cycle 2: ID1, IF2

Cycle 3: EX1, ID2, IF3

Cycle 4: STORE1, EX2, ID3, IF4

Cycle 5: STORE2, EX3, ID4, IF5

....

Increased Throughput: Multiple instructions are processed simultaneously, increasing overall instruction throughput.
Reduced Latency: Each stage processes a part of the instruction, reducing the time it takes for each instruction to be completed.

Pipeline Hazards: When Reality Pushes Back

Pipelines are fragile.

Structural hazards: it is a resource conflict situation, when more than one instruction tries to access the same resource in the same cycle.
Data hazards: when instruction depends on result of prior instruction which still in the pipeline.
Control hazards: Caused by delay between the fetching of instructions and decisions about changes in control flow (branches and jumps).

These hazards explain:

why instruction reordering matters
why volatile exists
why timing analysis is hard
why speculative execution exists

Solution to these hazards is by adding delays.

Solving these requires:

stalls
forwarding
prediction
speculation

Each solution trades:

performance ↔ power ↔ complexity ↔ security

Final Mental Compression

If you compress everything above into one mental image:

Clock
 ↓
Control Unit
 ↓
Registers ↔ ALU
 ↓
Memory

Everything else is software illusion layered on top.