ES: Direct Memory Access (DMA) | Amr Tarek

PU cycles are gold. If the CPU wastes time copying bytes from UART to RAM or RAM to SPI LCD, your real-time guarantees collapse.

This is where DMA (Direct Memory Access) becomes a core architectural element.

DMA is not just a peripheral feature — it is a system-level performance and determinism enabler.

What is DMA

Direct Memory Access (DMA) is a hardware mechanism that allows peripherals to transfer data directly to or from memory without continuous CPU intervention.

Instead of:

Peripheral → CPU → Memory → CPU → Peripheral

DMA allows:

Peripheral → Memory → Peripheral

The CPU only:

Configures the transfer
Starts it
Gets notified when it finishes (usually via interrupt)

After that, hardware takes over.

The Classical CPU-Based Transfer Problem

Let’s first see what happens without DMA.

Example: UART receiving 1 KB of data.

1. UART RX interrupt fires
2. CPU reads 1 byte
3. CPU writes byte to RAM
4. Repeat 1024 times

        +--------+
UART -->|  CPU   |--> RAM
        +--------+
      (1024 times)

Problems:

CPU overhead
High interrupt rate
Increased latency
Power consumption
RTOS scheduling disturbance

Now imagine doing this while:

Driving an RGB LCD
Running BLE
Playing audio
Handling sensors

You lose determinism.

DMA Architecture Inside the MCU

In modern MCUs (e.g., STM32, NXP, ESP32-S3, ARM SoCs), DMA is implemented as a bus master.

It can access system memory like the CPU.

		  +-------------+
		  |    CPU      |
		  +-------------+
				 |
	-----------------------------
	|           Bus            |
	-----------------------------
	 |           |            |
 +--------+  +--------+  +--------+
 |  RAM   |  | UART   |  | SPI    |
 +--------+  +--------+  +--------+
	   ^
	   |
 +-------------+
 | DMA Engine  |
 +-------------+

The DMA controller:

Reads from source
Writes to destination
Arbitrates bus access
Generates interrupt when done

Now the CPU is free to execute tasks.

How a DMA Transfer Works Step by Step

Let’s walk through it as an embedded engineer.

Example: SPI transmitting framebuffer to LCD.

CPU configures DMA registers:

Source address (framebuffer in RAM)
Destination address (SPI data register)
Transfer size
Mode (normal/circular)

CPU enables DMA channel
DMA waits for SPI request
SPI generates DMA request
DMA moves data word-by-word
Transfer complete interrupt fires

CPU:  [Config DMA]------------------[Working on other task]

DMA:           [Transfer Transfer Transfer Transfer]

SPI:           [Request][Request][Request][Request]

The key idea:

CPU becomes control plane, DMA becomes data plane.

Core DMA Modes

Normal Mode

Transfers a fixed number of elements once.

Memory ---> Peripheral
   1024 bytes
   Stop

Used for:

File read
LCD frame update
Block transfer

Circular Mode

After finishing, DMA restarts automatically.

Buffer Start ------------------ Buffer End
     ^                              |
     |______________________________|

Used for:

ADC continuous sampling
Audio streaming
Sensor acquisition

In RTOS systems, this is powerful for real-time pipelines.

Memory-to-Memory Mode

DMA can copy memory faster than CPU in many cases.

RAM A  -------->  RAM B

Used for:

Framebuffer copy
Data duplication
Crypto preprocessing

DMA in Real Embedded Examples

UART RX with Circular DMA

Instead of interrupt per byte:

UART --> DMA --> Circular Buffer in RAM

CPU periodically processes:

while (new_data_available)
    parse_packet();

This dramatically reduces interrupt load.

         +-------+
UART --->| DMA   |----> [Ring Buffer in RAM]
         +-------+
                        ^        ^
                        |        |
                    Head Ptr   Tail Ptr

SPI LCD Framebuffer (Your ESP32-S3 Case)

In RGB panels:

Framebuffer in PSRAM
        ↓
      DMA
        ↓
RGB Peripheral
        ↓
     LCD Panel

If CPU tried this manually:

UI animations would tear
Wi-Fi would jitter
Audio would glitch

DMA ensures:

Continuous stream
Deterministic timing
Minimal CPU load

ADC + DMA for Sensor Systems

In wearables or industrial systems:

ADC --> DMA --> RAM Buffer

Then:

Task processes samples
→ Filtering
→ Sensor Fusion
→ Metrics

Without DMA, high sample rates break real-time constraints.

DMA and RTOS Interaction

In an RTOS system:

Task A configures DMA
DMA runs independently
DMA interrupt wakes Task A

Flow:

Task A:
   start_dma();
   wait_for_notification();

ISR:
   notify_task();

Diagram:

Task A ----[Start DMA]----> (Blocked)
                   |
                   v
                 DMA Engine
                   |
             [Transfer Complete]
                   |
                   v
                 ISR
                   |
                   v
                Task A resumes

This creates:

Non-blocking I/O
Efficient multitasking
Deterministic response

Bus Arbitration and Performance

DMA and CPU share the same bus.

If both try to access RAM:

Bus arbitration decides priority
Can cause latency spikes

In high-performance systems:

DMA priority levels exist
Burst mode improves throughput
AXI/AHB bus matrix allows parallel access

If not configured correctly:

CPU stalls
Cache thrashing
Real-time jitter

Cache Coherency Problem (Critical in Cortex-A / ESP32-S3)

When cache is enabled:

CPU writes to cache
DMA reads from RAM

Problem: CPU Cache != RAM

DMA reads stale data.

Solution:

Cache clean (write-back)
Cache invalidate
Use non-cacheable memory region

This is critical in:

Embedded Linux
Cortex-A
High-speed peripherals

Comparing CPU vs DMA Transfer

Feature	CPU Copy	DMA
CPU Load	High	Low
Interrupt Rate	High	Low
Power Consumption	High	Lower
Determinism	Worse	Better
Setup Complexity	Simple	Moderate

When NOT to Use DMA

DMA has overhead:

Setup time
Interrupt latency
Bus contention

For very small transfers (e.g., 4 bytes), CPU copy is faster.

Rule of thumb:

Use DMA for medium to large data blocks or continuous streams.

In a modern system:

	+----------------+
	|     CPU        |
	+----------------+
		  |     ^
		  |     |
	-------------------
	|    Bus Matrix   |
	-------------------
	 |      |       |
  +-----+ +-----+ +-----+
  | RAM | | DMA | | Per |
  +-----+ +-----+ +-----+

The CPU becomes orchestrator.

DMA becomes transporter.

Peripherals become producers/consumers.