· 6 min read ·

Running Gemma 4 Audio Locally with MLX on Apple Silicon

Source: simonwillison

Google’s Gemma 4 release in April 2026 brought a lot of attention to its expanded multimodal capabilities, but the audio story got less coverage than it deserved. Simon Willison’s quick note on running Gemma 4 audio through mlx-audio on Apple Silicon is a useful pointer, and it opens up a broader conversation worth having: what does local audio AI inference actually look like right now, and how does the MLX stack change the calculus for developers on Macs?

What mlx-audio Is

mlx-audio is a library built on top of Apple’s MLX framework that brings audio generation and speech synthesis to Apple Silicon. MLX itself is Apple ML Research’s array computation framework designed specifically for the unified memory architecture in M-series chips. The key design insight in MLX is that CPU and GPU share the same memory pool, so there is no copying tensors across a PCIe bus. For a model that might involve passing data between a language model backbone and an audio decoder, that zero-copy property is not a minor optimization.

mlx-audio exposes a Python API for text-to-speech and audio generation tasks, supporting a growing list of models. The library uses MLX’s lazy evaluation model, which means operations are not executed until their output is needed, allowing the runtime to fuse and optimize computation graphs before they hit the hardware.

Gemma 4 and Its Audio Capabilities

Gemma 4 is Google DeepMind’s fourth generation of the Gemma open model family. The release expanded multimodal support substantially, including audio understanding and generation capabilities that were not present in earlier versions. The model family ships in several sizes, with the smaller variants (the 4B range) being particularly relevant for local inference because they fit comfortably in the 16GB of unified memory found on base M-series MacBook configurations.

The audio features in Gemma 4 fall into two broad categories. The first is audio understanding: the model can take audio input and respond to it, transcribe it, or reason about it alongside text. The second, and the more interesting one for this context, is the ability to generate speech output, either as a standalone capability or as part of a pipeline where a language model drives a speech synthesizer.

When mlx-audio is used as the synthesis backend, Gemma 4 handles the language understanding and response generation, and mlx-audio handles converting that response into waveform audio. The handoff is clean because both components live in the same MLX memory space.

The Unified Memory Advantage in Practice

To understand why this matters, it helps to look at how the same pipeline would work on a CUDA-based machine. On a typical x86 system with a discrete GPU, you have system RAM and VRAM as separate physical pools. Running a language model on the GPU and an audio model on the CPU means serializing tensors, copying them across the PCIe bus, and deserializing on the other side. If both models are on the GPU, you are constrained to VRAM capacity, which on consumer hardware is typically 8 to 24GB.

Apple Silicon’s unified memory architecture collapses this distinction. The M4 Max, for instance, offers up to 128GB of unified memory accessible at full bandwidth to both the CPU cores and the GPU. In practice for local AI use, this means an M2 MacBook Air with 24GB can hold a 12B parameter model in memory without quantization artifacts from aggressive compression, while a discrete GPU workstation with 16GB VRAM would need to drop to a heavily quantized 7B model to fit.

MLX takes advantage of this with its Metal backend. Operations that benefit from GPU parallelism are dispatched via Metal; operations that are better suited to CPU (or where the data is already in CPU-accessible memory) stay there. The framework makes this routing automatic and transparent.

Getting It Running

The setup is straightforward for anyone already in the MLX ecosystem:

pip install mlx-audio

From there, running audio generation involves a few lines of Python:

import mlx_audio

model, tokenizer = mlx_audio.load("google/gemma-4-audio")
output = mlx_audio.generate(
    model,
    tokenizer,
    prompt="Explain how attention mechanisms work in transformers.",
    voice="en_default",
    verbose=True,
)
mlx_audio.save_audio("output.wav", output)

The verbose=True flag surfaces token generation speed and audio synthesis time separately, which is useful for understanding where latency is coming from. On an M3 Pro, the language generation phase for a medium-length response typically runs at 35 to 55 tokens per second, and audio synthesis adds roughly 1 to 2 seconds of processing time on top of that for a paragraph-length output.

How This Compares to Cloud TTS Pipelines

Cloud-based text-to-speech has gotten very good. ElevenLabs, OpenAI’s TTS API, and Google’s own Cloud TTS all produce high-quality output with low latency for short inputs. But they share structural limitations that local inference avoids.

First, there is the round-trip latency. Even with streaming, you are adding network overhead. For interactive applications, 200 to 400ms of network latency before first audio chunk is the baseline. Second, there is cost at scale. Running TTS for every message in a chat application, or generating voiced responses in a game, adds up quickly at per-character pricing. Third, and increasingly important for enterprise applications, there is the data residency question. Audio synthesis that processes sensitive conversations should not be sending that data to third-party cloud endpoints.

Local inference via MLX solves all three. First-token latency on Apple Silicon is primarily determined by model load time (which can be amortized if the model is kept warm) and memory bandwidth. Once the model is loaded, inference starts fast. Cost is zero beyond electricity. Data never leaves the device.

The trade-off is hardware. This workflow requires a Mac with an M-series chip and enough unified memory for the model. The 4B variant of Gemma 4 needs roughly 8GB for the language model component, and mlx-audio’s synthesis models add another 1 to 2GB depending on the voice quality setting. An M2 MacBook Air with 8GB is borderline; 16GB is comfortable.

The Broader MLX Ecosystem

mlx-audio is part of a growing set of MLX-native libraries that Apple’s ML Research team and the open source community have built around the core framework. mlx-lm handles text generation and supports most major model architectures. mlx-vlm covers vision-language models. mlx-audio rounds out the multimodal coverage with audio.

The libraries share conventions: the same load, generate, and save patterns appear across all of them, and models are pulled from Hugging Face using standard Hub conventions. Quantized MLX-format versions of popular models are published directly to the mlx-community organization on Hugging Face, typically converted from the original weights using the mlx_lm.convert utility.

This ecosystem coherence matters for developers. You can build a multimodal pipeline that accepts an image, generates a text description using a vision-language model, and voices that description using mlx-audio, all within a single Python process, all backed by the same unified memory pool, without any serialization boundaries between components.

Quantization and Quality

One decision point when using mlx-audio with Gemma 4 is the quantization level. MLX supports 4-bit, 6-bit, and 8-bit quantization schemes. The 4-bit version of the 4B Gemma 4 model fits in about 2.5GB of memory, which is very low. The 8-bit version is around 4.5GB. For audio synthesis specifically, quantization quality has a perceptual dimension: degradation in the language model backbone can produce slightly off-word responses, and quantization in the synthesis model can introduce artifacts in the generated speech.

For most conversational use cases, the 4-bit quantized versions perform well enough. For voice applications where naturalness matters, 6-bit or 8-bit quantization at the synthesis stage is worth the memory cost. The language model can typically afford more aggressive quantization because you are less sensitive to minor variation in the text output than you are to audible artifacts in the speech.

Where This Is Heading

The combination of capable open audio models and a mature local inference framework is significant for a specific class of developers: those building applications that need to run entirely on-device, or that serve users in environments where cloud round-trips are unacceptable. Accessibility tools, embedded voice interfaces, offline-capable assistants, and applications in regulated industries are all areas where this stack is directly relevant.

Apple’s decision to invest in MLX as a first-party ML framework, rather than simply shipping CUDA compatibility layers or relying on Core ML alone, is paying dividends in ecosystem terms. The community around MLX has produced a set of high-quality model ports and utilities that make the on-device story genuinely competitive with GPU server inference for models in the sub-30B parameter range.

Gemma 4’s audio capabilities, running via mlx-audio on Apple Silicon, are a good example of what that competition looks like in practice. It is not perfect: the setup assumes you are on a Mac, the memory requirements are real, and the model quality gap with the best cloud providers for voice naturalness is still there. But for developers who need local, private, low-latency audio generation, the gap has closed considerably.

Was this interesting?