PU cycles are gold. If the CPU wastes time copying bytes from UART to RAM or RAM to SPI LCD, your real-time guarantees collapse.
This is where DMA (Direct Memory Access) becomes a core architectural element.
DMA is not just a peripheral feature — it is a system-level performance and determinism enabler.
What is DMA
Direct Memory Access (DMA) is a hardware mechanism that allows peripherals to transfer data directly to or from memory without continuous CPU intervention.
Instead of:
Peripheral → CPU → Memory → CPU → Peripheral
DMA allows:
Peripheral → Memory → Peripheral
The CPU only:
- Configures the transfer
- Starts it
- Gets notified when it finishes (usually via interrupt)
After that, hardware takes over.
The Classical CPU-Based Transfer Problem
Let’s first see what happens without DMA.
Example: UART receiving 1 KB of data.
1. UART RX interrupt fires
2. CPU reads 1 byte
3. CPU writes byte to RAM
4. Repeat 1024 times
So
+--------+
UART -->| CPU |--> RAM
+--------+
(1024 times)
Problems:
- CPU overhead
- High interrupt rate
- Increased latency
- Power consumption
- RTOS scheduling disturbance
Now imagine doing this while:
- Driving an RGB LCD
- Running BLE
- Playing audio
- Handling sensors
You lose determinism.
DMA Architecture Inside the MCU
In modern MCUs (e.g., STM32, NXP, ESP32-S3, ARM SoCs), DMA is implemented as a bus master.
It can access system memory like the CPU.
+-------------+
| CPU |
+-------------+
|
-----------------------------
| Bus |
-----------------------------
| | |
+--------+ +--------+ +--------+
| RAM | | UART | | SPI |
+--------+ +--------+ +--------+
^
|
+-------------+
| DMA Engine |
+-------------+
The DMA controller:
- Reads from source
- Writes to destination
- Arbitrates bus access
- Generates interrupt when done
Now the CPU is free to execute tasks.
How a DMA Transfer Works Step by Step
Let’s walk through it as an embedded engineer.
Example: SPI transmitting framebuffer to LCD.
- CPU configures DMA registers:
- Source address (framebuffer in RAM)
- Destination address (SPI data register)
- Transfer size
- Mode (normal/circular)
- CPU enables DMA channel
- DMA waits for SPI request
- SPI generates DMA request
- DMA moves data word-by-word
- Transfer complete interrupt fires
CPU: [Config DMA]------------------[Working on other task]
DMA: [Transfer Transfer Transfer Transfer]
SPI: [Request][Request][Request][Request]
The key idea:
CPU becomes control plane, DMA becomes data plane.
Core DMA Modes
Normal Mode
Transfers a fixed number of elements once.
Memory ---> Peripheral
1024 bytes
Stop
Used for:
- File read
- LCD frame update
- Block transfer
Circular Mode
After finishing, DMA restarts automatically.
Buffer Start ------------------ Buffer End
^ |
|______________________________|
Used for:
- ADC continuous sampling
- Audio streaming
- Sensor acquisition
In RTOS systems, this is powerful for real-time pipelines.
Memory-to-Memory Mode
DMA can copy memory faster than CPU in many cases.
RAM A --------> RAM B
Used for:
- Framebuffer copy
- Data duplication
- Crypto preprocessing
DMA in Real Embedded Examples
UART RX with Circular DMA
Instead of interrupt per byte:
UART --> DMA --> Circular Buffer in RAM
CPU periodically processes:
while (new_data_available)
parse_packet();
This dramatically reduces interrupt load.
+-------+
UART --->| DMA |----> [Ring Buffer in RAM]
+-------+
^ ^
| |
Head Ptr Tail Ptr
SPI LCD Framebuffer (Your ESP32-S3 Case)
In RGB panels:
Framebuffer in PSRAM
↓
DMA
↓
RGB Peripheral
↓
LCD Panel
If CPU tried this manually:
- UI animations would tear
- Wi-Fi would jitter
- Audio would glitch
DMA ensures:
- Continuous stream
- Deterministic timing
- Minimal CPU load
ADC + DMA for Sensor Systems
In wearables or industrial systems:
ADC --> DMA --> RAM Buffer
Then:
Task processes samples
→ Filtering
→ Sensor Fusion
→ Metrics
Without DMA, high sample rates break real-time constraints.
DMA and RTOS Interaction
In an RTOS system:
Task A configures DMA
DMA runs independently
DMA interrupt wakes Task A
Flow:
Task A:
start_dma();
wait_for_notification();
ISR:
notify_task();
Diagram:
Task A ----[Start DMA]----> (Blocked)
|
v
DMA Engine
|
[Transfer Complete]
|
v
ISR
|
v
Task A resumes
This creates:
- Non-blocking I/O
- Efficient multitasking
- Deterministic response
Bus Arbitration and Performance
DMA and CPU share the same bus.
If both try to access RAM:
- Bus arbitration decides priority
- Can cause latency spikes
In high-performance systems:
- DMA priority levels exist
- Burst mode improves throughput
- AXI/AHB bus matrix allows parallel access
If not configured correctly:
- CPU stalls
- Cache thrashing
- Real-time jitter
Cache Coherency Problem (Critical in Cortex-A / ESP32-S3)
When cache is enabled:
CPU writes to cache
DMA reads from RAM
Problem: CPU Cache != RAM
DMA reads stale data.
Solution:
- Cache clean (write-back)
- Cache invalidate
- Use non-cacheable memory region
This is critical in:
- Embedded Linux
- Cortex-A
- High-speed peripherals
Comparing CPU vs DMA Transfer
| Feature | CPU Copy | DMA |
|---|---|---|
| CPU Load | High | Low |
| Interrupt Rate | High | Low |
| Power Consumption | High | Lower |
| Determinism | Worse | Better |
| Setup Complexity | Simple | Moderate |
When NOT to Use DMA
DMA has overhead:
- Setup time
- Interrupt latency
- Bus contention
For very small transfers (e.g., 4 bytes), CPU copy is faster.
Rule of thumb:
Use DMA for medium to large data blocks or continuous streams.
In a modern system:
+----------------+
| CPU |
+----------------+
| ^
| |
-------------------
| Bus Matrix |
-------------------
| | |
+-----+ +-----+ +-----+
| RAM | | DMA | | Per |
+-----+ +-----+ +-----+
The CPU becomes orchestrator.
DMA becomes transporter.
Peripherals become producers/consumers.