ES: CPU Fundamentals: Thinking Like the Processor

To truly master embedded systems, performance engineering, or hardware security, you must stop thinking like a programmer and start thinking like a processor.

A CPU does not see:

  • source files
  • variables
  • functions
  • threads

Instead, it sees binary instructions flowing through hardware stages, driven relentlessly by a clock.


The CPU Is the Smallest Executing Unit

The CPU is the smallest hardware entity capable of executing instructions independently.

Everything else in a system exists only to feed instructions to the CPU or store results from it.

The CPU does not understand intent:

  • not _why_ a function exists
  • not _what_ a task represents
  • not _who_ owns memory

It only reacts to:

  • clock edges
  • control signals
  • binary instruction encodings

This is why CPU bugs are so unforgiving:

the processor will do _exactly_ what you told it to do, not what you meant.

md
┌────────────────────┐
│        CPU         │
├────────────────────┤
│ Registers          │
│ ALU                │
│ Control Unit       │
└────────────────────┘

From the CPU’s perspective, time does not exist as milliseconds or deadlines.

Time exists only as clock cycles.

A missed real-time deadline is simply:

“Too many cycles were consumed before a critical instruction executed.”

CPU is consisted of:

  • ALU
  • Control unit
  • Registers
  • Interconnections
  • Data bus: data carrying wires
  • Address bus: address carrying.
  • Control bus: read/write control signals carrying wires

ALU

Arithmetic Logic Unit, is an integrated circuit within a CPU or GPU (Graphics Processing Unit) that performs arithmetic and logic operations.

  • Arithmetic:
  • Addition
  • Subtraction
  • Shifting
  • Logic:
  • NOT
  • AND
  • OR
  • XOR

Registers

Register are the fastest types of memories, it is used to facilitate CPU operations

The are 2 types:

  • General Purpose Register (GPR) (R0, R1, .... , R31)
  • Special purpose registers:
  • Program counter (PC)
  • Instruction Register (IR)
  • Instruction decoder (ID)
  • Status Register (SREG)
  • Accumulator Register (ACC)
  • Stack Pointer (SP)
  • Index Register (X, Y)

Control Unit

The control unit is the circuitry that controls the flow of data through the processor, and coordinates the activities of the other units within it.

In other words it is in charge of the entire instructions (machine) cycle.


ALU: The Only Place Math Happens

The ALU performs all arithmetic and logic.

It does not:

  • allocate memory
  • manage stacks
  • understand data types

It sees only bit patterns.

Arithmetic operations

  • ADD, SUB
  • SHIFT left/right
  • INC, DEC

Logic operations

  • AND, OR, XOR
  • NOT
  • comparisons (via subtraction)

Example insight:

A comparison like:

cpp
if (a == b)

Is actually:

md
SUB Rtmp, Ra, Rb
CHECK Zero Flag

There is no “compare instruction” conceptually — just math plus flags.


Control Unit: Hardware Sequencing Brain

The Control Unit is often misunderstood because it is invisible in software.

It:

  • sequences micro-operations
  • enables buses
  • selects ALU operations
  • controls register writes

Think of it as a finite state machine driven by:

  • opcode bits
  • clock
  • current CPU state

Without the control unit, the ALU would be a powerful but useless calculator.


Registers: Where Reality Happens

Registers are not “fast memory.”

They are the only place computation is physically possible.

The ALU cannot operate on RAM or Flash.

There is no physical path for that.

That means:

  • values must be loaded into registers
  • operations happen only on registers
  • results must be written back

This creates a harsh rule:

**If a value is not in a register, it does not exist for computation.**

Why this matters

Many performance issues are not algorithmic — they are register pressure problems.

When registers are exhausted:

  • the compiler spills values to memory
  • extra load/store instructions appear
  • latency explodes
  • power consumption increases

This is why “simple-looking” C code can generate terrible assembly.


Register File Overview

A typical CPU register file looks conceptually like this:

md
Register File
+----+----+----+----+
| R0 | R1 | R2 | R3 |
+----+----+----+----+
| SP | PC | LR | PS |
+----+----+----+----+

Registers are:

  • limited in number
  • extremely fast
  • explicitly managed by compiler + ABI

They are the CPU’s _only working memory_.


Register Types: Roles, Not Just Names

Registers are not equal.

They exist to support control flow, data flow, and execution flow.

General Purpose Registers (GPR)

General Purpose Registers are designed to store transient data required during execution.

They hold:

  • operands for arithmetic and logic
  • intermediate results
  • function arguments
  • return values
  • temporary addresses
In general, **the more registers a CPU has, the faster it can work**.

Why?

Because more registers mean:

  • fewer memory accesses
  • less spilling to stack
  • fewer pipeline stalls
  • better instruction-level parallelism

This is one of the fundamental reasons why:

  • ARM scales well for embedded
  • x86-64 added more registers
  • RISC architectures favor large register files

Example flow:

md
LOAD R0, a
LOAD R1, b
ADD  R2, R0, R1

From the CPU’s perspective, variables like a and b do not exist — only R0 and R1.


Program Counter (PC): Control Flow Anchor

The Program Counter (PC) is the most important register in the CPU.

It holds:

**The address of the next instruction to be executed**

Every control-flow operation is fundamentally a PC modification:

  • loops work
  • functions are called
  • jumps
  • branches
  • returns
  • interrupts
  • exceptions occur

The size of the PC is directly related to the addressable program memory:

  • 16-bit PC → 64 KB address space
  • 32-bit PC → 4 GB address space
  • 64-bit PC → massive virtual memory

From the CPU’s perspective, _changing behavior = changing PC_.


Instruction Register (IR)

The Instruction Register (IR) holds the instruction currently being processed.

The flow is:

  1. Instruction fetched from memory
  2. Stored into IR
  3. Passed to the Instruction Decoder

The IR isolates:

  • memory timing
  • decoding logic
  • execution control

This separation allows pipelining and parallelism to exist.


Status Register (SREG)

The Status Register contains flags describing the outcome of the last operation.

These flags directly influence subsequent instructions, especially branches.

Common flags include:

  • Overflow Flag
  • Indicates the result exceeded register width

  • Negative Flag
  • Indicates the result is negative (sign bit set)

  • Zero Flag
  • Indicates the result is zero

  • Carry Flag
  • Indicates carry/borrow in arithmetic or logical operations

  • Half-Carry Flag
  • Used mainly for BCD and lower-nibble arithmetic

  • Global Interrupt Mask
  • Enables or disables interrupts globally

Important insight:

Branch instructions rarely “compare” values — they **inspect flags**.

Accumulator Register (ACC)

The Accumulator is a special-purpose register historically used to store intermediate results.

Even in modern CPUs where ACC is less explicit, the concept remains:

  • results stay in registers
  • reusing them avoids memory traffic
  • chained operations become faster

Keeping values in registers instead of memory:

  • reduces latency
  • reduces power
  • improves determinism

This is why tight loops and DSP code rely heavily on register reuse.


Stack Pointer (SP): Execution Context Boundary

The Stack Pointer holds the memory address of:

  • the last stored value (full stack)
  • or the next free location (empty stack)

It defines the execution context boundary.

The stack pointer defines:

  • where local variables live
  • where return addresses are stored
  • where saved registers go

md
High Address
+------------------+
| local variables |
+------------------+
| saved registers |
+------------------+
| return address  | ← SP
+------------------+
Low Address

Behavior:

  • PUSH → SP moves to next empty location
  • POP → SP moves to previous value

RTOS context switching is simply:

“Save SP + registers, load another SP + registers”

Nothing more.


Index Registers (X, Y)

Index registers support indirect and indexed addressing modes.

Accessed address formula:

md
Effective Address = Index Register + Offset

This mode is essential for:

  • arrays
  • buffers
  • structs
  • stack frames

Without index registers, modern programming would be impractical.


Microprocessor Unit (MPU): Why CPU Alone Is Useless

An MPU contains only the CPU core.

No RAM.

No Flash.

No GPIO.

This means the CPU has:

  • nowhere to fetch instructions from
  • nowhere to store results
  • nothing to interact with

md
        ┌───────────────┐
        │      CPU      │
        │───────────────│
        │ ALU           │
        │ Registers     │
        │ Control Unit  │
        └──────┬────────┘
               │
   ┌───────────┼───────────┐
   │           │           │
Data Bus   Address Bus  Control Bus

This is why MPUs require external components:

  • RAM
  • Flash
  • bus controllers
  • peripherals

Understanding this explains why:

  • bootloaders exist
  • memory controllers are critical
  • board bring-up is non-trivial

Instruction Lifecycle: The Unbreakable Loop

Every instruction obeys the same lifecycle.

md
Fetch → Decode → Execute → Write Back

This is not a software concept.

It is hardwired behavior.

Expanded view

md
┌───────────┐
│   FETCH   │  Read instruction from memory
└─────┬─────┘
	  ↓
┌───────────┐
│  DECODE   │  Identify opcode & operands
└─────┬─────┘
	  ↓
┌───────────┐
│ EXECUTE   │  ALU / branch / load-store
└─────┬─────┘
	  ↓
┌───────────┐
│ WRITEBACK │  Update registers or memory
└─────┬─────┘
	  ↓
	 PC++

The PC moves whether you want it to or not.


Instruction Structure: Minimal but Sufficient

An instruction contains:

  • opcode → what action
  • operands → where data is

The CPU does not infer intent.

Everything must be explicitly encoded.

This explains why:

  • RISC instructions are simple
  • complex operations are decomposed
  • compilers emit many instructions for simple code

Memory Addressing Modes: Finding Data Efficiently

Addressing modes exist to reduce instruction count and cycles.

Direct addressing

Simple, explicit, but inflexible.

Indirect addressing

Pointer-based access.

Foundation of dynamic data structures.

Base + Offset

Critical for:

  • stack frames
  • structs
  • arrays

This mode enables efficient access patterns without recomputing addresses manually.


Why Sequential Execution Is Slow

Without overlap:

  • each instruction blocks the next
  • hardware resources sit idle

The CPU waits while:

  • memory responds
  • decoding finishes
  • results propagate

This inefficiency led to pipelining.


Pipelining: Overlapping Time

The instruction lifecycle is sequential, which is a bit slow.

if we supposed that fetch, decode, and execute take 1 clock cycle each, then to perform 3 instructions we need 9 clock cycles.

Pipelining is a process of Arrangement of hardware elements of the CPU such that its Overall performance is increased by making the same clock cycle execute, decode, and fetch.

Cycle 1: F1
Cycle 2: ID1, IF2
Cycle 3: EX1, ID2, IF3
Cycle 4: STORE1, EX2, ID3, IF4
Cycle 5: STORE2, EX3, ID4, IF5

....

  • Increased Throughput: Multiple instructions are processed simultaneously, increasing overall instruction throughput.
  • Reduced Latency: Each stage processes a part of the instruction, reducing the time it takes for each instruction to be completed.

Pipeline Hazards: When Reality Pushes Back

Pipelines are fragile.

  • Structural hazards: it is a resource conflict situation, when more than one instruction tries to access the same resource in the same cycle.
  • Data hazards: when instruction depends on result of prior instruction which still in the pipeline.
  • Control hazards: Caused by delay between the fetching of instructions and decisions about changes in control flow (branches and jumps).

These hazards explain:

  • why instruction reordering matters
  • why volatile exists
  • why timing analysis is hard
  • why speculative execution exists
Solution to these hazards is by adding delays.

Solving these requires:

  • stalls
  • forwarding
  • prediction
  • speculation

Each solution trades:

performance ↔ power ↔ complexity ↔ security

Final Mental Compression

If you compress everything above into one mental image:

md
Clock
 ↓
Control Unit
 ↓
Registers ↔ ALU
 ↓
Memory

Everything else is software illusion layered on top.