Audio DSP Without an FPU: Inside an Embedded Rust Guitar Trainer

Orhun Parmaksız, the developer behind tools like git-cliff and a core contributor to ratatui, recently published tuitar, a guitar trainer built on a Raspberry Pi Pico with embedded Rust. The project runs on roughly $15 of hardware: an RP2040 microcontroller, a MAX4466 electret microphone module, and a 128x64 SSD1306 OLED display. It samples audio from the mic, detects the fundamental pitch of a plucked string, and shows the note name and cent deviation on screen in real time.

The immediate reaction to projects like this is often that the hardware is underpowered for the task. That reaction is worth examining carefully, because it leads directly to the most interesting parts of the design.

What the RP2040 Gives You, and What It Doesn’t

The RP2040 is a dual-core Cortex-M0+ running at up to 133 MHz with 264 KB of SRAM and 2 MB of flash. The M0+ core is compact and power-efficient; it is not a DSP powerhouse. There is no floating-point unit. Every f32 addition, multiplication, or division runs as a software-emulated instruction sequence, typically 10 to 20 cycles per operation instead of 1 to 4. For audio DSP, which often involves tight loops over thousands of samples, this is the central constraint around which everything else must be designed.

The ADC is a 12-bit successive approximation converter, shared across four channels. It is adequate but not exceptional; the noise floor is high enough that raw samples need filtering before they are useful for pitch detection.

Why Pitch Detection Is Harder Than It Looks

Guitar strings range from E2 at 82.4 Hz up to roughly the 12th fret of the high E string at about 660 Hz, with standard tuning placing the open strings at 82.4, 110, 146.8, 196, 246.9, and 329.6 Hz. A tuner needs to detect these frequencies to within about ±1 cent (a cent is 1/100th of a semitone; adjacent semitones differ by about 6% in frequency).

The obvious approach is FFT. Capture a buffer of samples, run a fast Fourier transform, find the peak bin, convert to frequency. The problem is frequency resolution. Resolution equals sample_rate / N where N is the buffer size. At an 8 kHz sample rate with a 512-sample buffer, the resolution is ~15.6 Hz per bin. At E2 (82.4 Hz), that corresponds to roughly 320 cents of uncertainty per bin, which is useless. To get 1-cent resolution at E2, you need a buffer of roughly 88,000 samples, which at 8 kHz takes 11 seconds to collect. Increasing the sample rate or using zero-padding helps somewhat, but FFT-based approaches require awkward compromises for low-frequency instruments.

The YIN algorithm, published by de Cheveigné and Kawahara in 2002, sidesteps this by working in the time domain. It computes the squared difference function across lags:

d(τ) = Σ (x[j] - x[j+τ])²

The pitch period corresponds to the lag τ where the signal most closely repeats itself, producing a minimum in d(τ). YIN then applies a cumulative mean normalization step to suppress spurious minima:

d'(τ) = d(τ) / [(1/τ) Σ d(j)  for j in 1..τ]

This normalized function is thresholded (typically at 0.1 to 0.15) to find the first deep minimum, which gives the fundamental period. Parabolic interpolation around the minimum refines the estimate to sub-sample accuracy. The resulting frequency resolution depends on the signal itself rather than the buffer length, and accuracy in practice is well within 1 cent for clean guitar tones.

For the RP2040, YIN has a significant practical advantage: its inner loop is expressible in integer arithmetic. Instead of multiplying floats, you can work in Q15 or Q31 fixed-point format, which uses 16-bit or 32-bit integers with an implied binary point. Multiplications become integer multiplies followed by a right shift. On the M0+, a 32-bit multiply is a single-cycle instruction; a software float multiply is not. This can reduce the per-sample processing time by an order of magnitude in the hot path.

The Embassy Architecture

tuitar uses Embassy, the async embedded Rust framework, to structure the application as three concurrent tasks.

The adc_task owns the microphone pin and reads samples continuously via DMA transfer into a fixed buffer. When a buffer fills, it sends the data to a Channel. The pitch_task receives that buffer, runs YIN, and publishes the detected note through a Signal. The display_task consumes the signal and renders the note name, a tuning bar, and a waveform visualization on the OLED at around 30 frames per second.

#[embassy_executor::task]
async fn adc_task(mut adc: Adc<'static, Async>, mut pin: Channel0, tx: Sender<'static, NoopRawMutex, [i16; SAMPLES], 2>) {
    let mut buf = [0i16; SAMPLES];
    loop {
        // Fill buffer via DMA, then send to pitch_task
        adc.read_many(&mut pin, &mut buf, 1, &mut dma).await.unwrap();
        tx.send(buf).await;
    }
}

The key property here is that none of this involves heap allocation. Embassy’s task system statically allocates each task at compile time using a TaskStorage struct generated by the #[embassy_executor::task] macro. The Channel type from embassy-sync is backed by a heapless::Deque, not a Box or Arc. The entire application runs in no_std + no_main mode with no allocator.

This is fundamentally different from both bare-metal interrupt-driven code and a traditional RTOS like FreeRTOS. Bare-metal approaches put all the coordination logic into interrupt service routines, which interact through shared mutable globals guarded by critical sections. The resulting code is correct when written carefully, but the flow of data is implicit and the borrow checker cannot verify it. FreeRTOS gives you real threads with preemptive scheduling, but each thread requires its own stack allocation (typically 1 to 4 KB minimum), and the C API is not Rust-friendly.

Embassy gives you structured concurrency backed by hardware interrupts. When adc.read_many().await suspends, the executor yields to other tasks and the hardware interrupt resumes the task when the DMA transfer completes. No polling, no wasted cycles, and the ownership model ensures that only the adc_task can access the ADC pin while the task is running.

The no_std DSP Stack

The supporting crate ecosystem is mature enough that building this project does not require writing everything from scratch. heapless provides fixed-capacity Vec and String types backed by inline arrays, making it possible to work with collections in no_std without an allocator. micromath provides fast approximate implementations of sqrt, sin, cos, and atan2 without pulling in libm; the approximations are close enough for pitch detection but avoid the code size and latency of full IEEE 754 compliance. The ssd1306 crate provides a complete I2C/SPI driver for the display, and embedded-graphics handles text rendering and line drawing for the waveform visualization.

For debugging, defmt provides structured logging over RTT (Real-Time Transfer) at very low overhead. Log messages are encoded as integer indices into a string table stored in the firmware binary, so the logging call itself transmits only a few bytes over the debug wire rather than a full formatted string. probe-rs handles flashing and running; a second Pico running picoprobe firmware serves as the debug adapter, eliminating the need for a separate J-Link or ST-Link.

Comparing with the C/Arduino Approach

The most common RP2040 guitar tuner projects in C use the Arduino-Pico core with either the ArduinoFFT library or a hand-written autocorrelation loop. They are shorter and simpler to set up. They are also harder to reason about at the system level: shared state between the ADC interrupt and the main loop is guarded by disabling interrupts rather than by the type system, and there is no static verification that the audio buffer is not read while the DMA is writing to it.

The Rust version is more verbose in the setup code, but the coordination between tasks is enforced at compile time. The Channel API makes the data flow explicit: the compiler rejects code that tries to access the sample buffer from two tasks without synchronization. For a hobby project running on a $4 microcontroller, that might seem like overkill. For anyone planning to extend the firmware, it makes the system much easier to modify safely.

The performance characteristics end up comparable. Both approaches can achieve sub-5ms latency from sample acquisition to display update with YIN at an 8 kHz sample rate. The Rust version requires more initial investment in understanding the Embassy model, but the payoff is a codebase where the concurrency structure is explicit and the borrow checker has verified the memory access patterns.

Where This Leaves Embedded Rust Audio

The embedded Rust ecosystem for audio is functional but still has gaps. FFT support in no_std exists via crates like microfft, but the options are narrower than in C where ARM’s CMSIS-DSP library provides hardware-accelerated transforms for Cortex-M4F and above. High-quality audio capture generally still requires moving to a platform with an FPU and I2S support; the RP2040’s ADC is good enough for pitch detection but not for recording or playback at CD quality.

The async model in Embassy handles the concurrency patterns in audio pipelines, specifically producer-consumer buffer passing between DMA and processing tasks, better than most alternatives. As the ecosystem matures around the RP2350 and more capable Cortex-M33 and M55 targets, the gap between what embedded Rust can do and what embedded C can do in audio applications will narrow considerably. tuitar is a concrete demonstration that the toolchain is already capable enough to build something genuinely useful with it.