ES: Direct Memory Access (DMA)

Direct Memory Access (DMA) is a hardware mechanism that allows peripherals to transfer data directly to or from memory without continuous CPU involvement.It improves performance, reduces CPU load, and enables efficient real-time data streaming in embedded systems.

PU cycles are gold. If the CPU wastes time copying bytes from UART to RAM or RAM to SPI LCD, your real-time guarantees collapse.

This is where DMA (Direct Memory Access) becomes a core architectural element.

DMA is not just a peripheral feature — it is a system-level performance and determinism enabler.

What is DMA

Direct Memory Access (DMA) is a hardware mechanism that allows peripherals to transfer data directly to or from memory without continuous CPU intervention.

Instead of:

txt
Peripheral → CPU → Memory → CPU → Peripheral

DMA allows:

txt
Peripheral → Memory → Peripheral

The CPU only:

  • Configures the transfer
  • Starts it
  • Gets notified when it finishes (usually via interrupt)

After that, hardware takes over.


The Classical CPU-Based Transfer Problem

Let’s first see what happens without DMA.

Example: UART receiving 1 KB of data.

txt
1. UART RX interrupt fires
2. CPU reads 1 byte
3. CPU writes byte to RAM
4. Repeat 1024 times

So

txt
        +--------+
UART -->|  CPU   |--> RAM
        +--------+
      (1024 times)

Problems:

  • CPU overhead
  • High interrupt rate
  • Increased latency
  • Power consumption
  • RTOS scheduling disturbance

Now imagine doing this while:

  • Driving an RGB LCD
  • Running BLE
  • Playing audio
  • Handling sensors

You lose determinism.


DMA Architecture Inside the MCU

In modern MCUs (e.g., STM32, NXP, ESP32-S3, ARM SoCs), DMA is implemented as a bus master.

It can access system memory like the CPU.

txt
		  +-------------+
		  |    CPU      |
		  +-------------+
				 |
	-----------------------------
	|           Bus            |
	-----------------------------
	 |           |            |
 +--------+  +--------+  +--------+
 |  RAM   |  | UART   |  | SPI    |
 +--------+  +--------+  +--------+
	   ^
	   |
 +-------------+
 | DMA Engine  |
 +-------------+

The DMA controller:

  • Reads from source
  • Writes to destination
  • Arbitrates bus access
  • Generates interrupt when done

Now the CPU is free to execute tasks.


How a DMA Transfer Works Step by Step

Let’s walk through it as an embedded engineer.

Example: SPI transmitting framebuffer to LCD.

  1. CPU configures DMA registers:
  • Source address (framebuffer in RAM)
  • Destination address (SPI data register)
  • Transfer size
  • Mode (normal/circular)
  1. CPU enables DMA channel
  2. DMA waits for SPI request
  3. SPI generates DMA request
  4. DMA moves data word-by-word
  5. Transfer complete interrupt fires

md
CPU:  [Config DMA]------------------[Working on other task]

DMA:           [Transfer Transfer Transfer Transfer]

SPI:           [Request][Request][Request][Request]

The key idea:

CPU becomes control plane, DMA becomes data plane.

Core DMA Modes

Normal Mode

Transfers a fixed number of elements once.

txt
Memory ---> Peripheral
   1024 bytes
   Stop

Used for:

  • File read
  • LCD frame update
  • Block transfer

Circular Mode

After finishing, DMA restarts automatically.

txt
Buffer Start ------------------ Buffer End
     ^                              |
     |______________________________|

Used for:

  • ADC continuous sampling
  • Audio streaming
  • Sensor acquisition

In RTOS systems, this is powerful for real-time pipelines.

Memory-to-Memory Mode

DMA can copy memory faster than CPU in many cases.

txt
RAM A  -------->  RAM B

Used for:

  • Framebuffer copy
  • Data duplication
  • Crypto preprocessing

DMA in Real Embedded Examples

UART RX with Circular DMA

Instead of interrupt per byte:

txt
UART --> DMA --> Circular Buffer in RAM

CPU periodically processes:

cpp
while (new_data_available)
    parse_packet();

This dramatically reduces interrupt load.

txt
         +-------+
UART --->| DMA   |----> [Ring Buffer in RAM]
         +-------+
                        ^        ^
                        |        |
                    Head Ptr   Tail Ptr

SPI LCD Framebuffer (Your ESP32-S3 Case)

In RGB panels:

txt
Framebuffer in PSRAM
        ↓
      DMA
        ↓
RGB Peripheral
        ↓
     LCD Panel

If CPU tried this manually:

  • UI animations would tear
  • Wi-Fi would jitter
  • Audio would glitch

DMA ensures:

  • Continuous stream
  • Deterministic timing
  • Minimal CPU load

ADC + DMA for Sensor Systems

In wearables or industrial systems:

txt
ADC --> DMA --> RAM Buffer

Then:

txt
Task processes samples
→ Filtering
→ Sensor Fusion
→ Metrics

Without DMA, high sample rates break real-time constraints.


DMA and RTOS Interaction

In an RTOS system:

txt
Task A configures DMA
DMA runs independently
DMA interrupt wakes Task A

Flow:

txt
Task A:
   start_dma();
   wait_for_notification();

ISR:
   notify_task();

Diagram:

txt
Task A ----[Start DMA]----> (Blocked)
                   |
                   v
                 DMA Engine
                   |
             [Transfer Complete]
                   |
                   v
                 ISR
                   |
                   v
                Task A resumes

This creates:

  • Non-blocking I/O
  • Efficient multitasking
  • Deterministic response

Bus Arbitration and Performance

DMA and CPU share the same bus.

If both try to access RAM:

  • Bus arbitration decides priority
  • Can cause latency spikes

In high-performance systems:

  • DMA priority levels exist
  • Burst mode improves throughput
  • AXI/AHB bus matrix allows parallel access

If not configured correctly:

  • CPU stalls
  • Cache thrashing
  • Real-time jitter

Cache Coherency Problem (Critical in Cortex-A / ESP32-S3)

When cache is enabled:

txt
CPU writes to cache
DMA reads from RAM

Problem: CPU Cache != RAM

DMA reads stale data.

Solution:

  • Cache clean (write-back)
  • Cache invalidate
  • Use non-cacheable memory region

This is critical in:

  • Embedded Linux
  • Cortex-A
  • High-speed peripherals

Comparing CPU vs DMA Transfer

FeatureCPU CopyDMA
CPU LoadHighLow
Interrupt RateHighLow
Power ConsumptionHighLower
DeterminismWorseBetter
Setup ComplexitySimpleModerate

When NOT to Use DMA

DMA has overhead:

  • Setup time
  • Interrupt latency
  • Bus contention

For very small transfers (e.g., 4 bytes), CPU copy is faster.

Rule of thumb:

Use DMA for medium to large data blocks or continuous streams.

In a modern system:

txt
	+----------------+
	|     CPU        |
	+----------------+
		  |     ^
		  |     |
	-------------------
	|    Bus Matrix   |
	-------------------
	 |      |       |
  +-----+ +-----+ +-----+
  | RAM | | DMA | | Per |
  +-----+ +-----+ +-----+

The CPU becomes orchestrator.

DMA becomes transporter.

Peripherals become producers/consumers.