The Infrastructure Layer That Made Local Voice Assistants Actually Work
Source: hackernews
The gap between “it works” and “I actually use it every day” is enormous for local voice assistants. People have been building them since Jasper in 2014, through Rhasspy, through a parade of wake-word libraries that got abandoned or shut down. The models improved. The hardware got cheap enough. But until fairly recently, “locally hosted voice assistant” meant a system you demo to guests and then quietly go back to yelling at Alexa.
That is what makes the wave of detailed 2025 setups appearing on the Home Assistant community forums meaningful. A well-documented journey to a reliable daily driver signals something more than a hobbyist achievement. It signals that the infrastructure has matured to the point where the effort is predictable, the failure modes are known, and the result is stable enough to stake daily life on.
The reason is not better AI models in isolation. OpenAI Whisper has been available since September 2022. Piper TTS has been producing natural-sounding speech for years. The reason is protocol-level infrastructure, specifically the Wyoming protocol and the pipeline architecture Home Assistant built around it during their Year of Voice initiative.
What Wyoming Actually Is
Wyoming is a simple, event-based protocol for connecting voice assistant components over TCP or Unix sockets. It was created by Michael Hansen, who also wrote Rhasspy and Piper, and it solves a problem that kept previous systems fragile: tight coupling between components.
Before Wyoming, setting up a local voice assistant typically meant either using a monolithic system with limited configurability, or wiring together components through MQTT with custom glue code that broke every time anything updated. Rhasspy 2 used the Hermes MQTT protocol, which worked but required a broker and had awkward semantics for streaming audio.
Wyoming uses line-delimited JSON for events, followed by binary payloads when needed. Each event has a type field and optional data. Audio flows as a sequence of audio-start, audio-chunk, and audio-stop events, where each chunk carries raw PCM data. A wake word detector, an ASR service, and a TTS service each speak the same protocol. They can run on the same machine, separate machines, or in containers, and swapping one out requires changing a single address in your configuration.
The protocol is deliberately minimal. There is no service discovery, no broker, no schema registry. You point a client at a server address and it works. This simplicity is what makes it composable in practice rather than just in theory.
The Modern Stack
A reliable 2025 local voice assistant setup typically chains four discrete components.
Wake word detection is usually handled by openWakeWord, a Python library using ONNX Runtime that ships with pre-trained models for common phrases. It runs continuously on your satellite hardware, consuming roughly 5-15% CPU on a Raspberry Pi 4. The library supports custom model training with synthetic data augmentation, which matters because home environments vary significantly in acoustic character. A model trained on clean recordings performs poorly in a kitchen with a running dishwasher. openWakeWord’s approach of using speech synthesis to generate training data, then fine-tuning on real samples, produces models that generalize much better.
Speech-to-text in most serious setups uses faster-whisper, a reimplementation of Whisper using CTranslate2. On a Raspberry Pi 4, the tiny.en model transcribes a typical short command in roughly 200-400 milliseconds using int8 quantization. The small.en model trades that for better accuracy at around 700-900ms. On an x86 mini PC with an N100 or similar processor, the small multilingual model stays under 400ms for most utterances. If you have a GPU, even a modest one, the medium model becomes practical at under 200ms.
whisper.cpp by Georgi Gerganov is the alternative for more constrained hardware. Its C++ implementation supports Metal acceleration on Apple Silicon, OpenCL on various GPUs, and highly optimized CPU paths. For Raspberry Pi deployments, the two are roughly comparable on small models, though faster-whisper has better Python integration for the Wyoming server implementations.
Text-to-speech via Piper is where the quality leap happened. Piper uses VITS (Variational Inference with adversarial learning for end-to-end Text-to-Speech), which produces naturally-paced, naturally-stressed speech rather than the robotic concatenation you got from older festival or espeak-based systems. The en_US-lessac-medium and en_US-ryan-high models are particularly good. Generation latency on a Pi 4 is 50-150ms for typical response lengths, which is fast enough that it is never the bottleneck in the pipeline. Piper also supports streaming output, so audio can start playing before the full synthesis is complete, dropping perceived latency further.
Intent processing sits between STT and TTS and is still where setups diverge most. Home Assistant’s Assist pipeline uses template-based intent matching for home automation commands: turning lights on, setting temperatures, querying sensor states. This matching is fast, under 50ms, and reliable for the commands it knows. For anything beyond home automation, you need either a conversation agent backed by a local LLM or acceptance that the assistant has a fixed command vocabulary.
Running a local LLM for the conversation layer adds 1-3 seconds of latency on typical consumer hardware, which noticeably changes the feel. Ollama with a quantized 7B model on an M-series Mac or a GPU-equipped mini PC is the common approach for people who want general-purpose responses without sending data to OpenAI.
The Hardware Picture
The satellite hardware, the device with the microphone you actually talk to, is typically separate from the server running inference. This split matters because it lets you run demanding models on better hardware while keeping the satellites cheap and low-power.
The most common satellite configurations are a Raspberry Pi Zero 2W or Pi 4 with a ReSpeaker HAT for microphone array support, or an ESP32-S3 running ESPHome’s voice assistant firmware with an INMP441 MEMS microphone. The ESP32 approach is appealing because the hardware costs under fifteen dollars and the firmware handles wake word detection locally using a built-in model, sending audio upstream only after activation.
For the server, a Pi 4 with 4GB RAM handles the tiny Whisper model comfortably. A Pi 5 handles small. An N100-based mini PC running at about ten watts handles small or medium with room to spare and can also run a small LLM. The Hailo-8 M.2 accelerator, which Home Assistant is integrating support for, targets the vision pipeline but demonstrates the category of dedicated inference silicon that is making local AI more accessible.
Where It Still Falls Short
Noise robustness remains the biggest practical gap with cloud alternatives. Alexa and Google Home use beamforming microphone arrays, continuous cloud-side noise cancellation trained on billions of utterances, and per-user acoustic models. A single-mic satellite in a noisy room simply cannot match that. Multi-mic arrays with webrtc-audio-processing help significantly, but require more hardware and configuration.
Wake word false positive and false negative rates are also still noticeably worse than commercial products. openWakeWord has improved, and you can tune the detection threshold to trade one for the other, but in a house with a television playing, you will hear your lights turn on occasionally when they should not.
The intent layer boundary is explicit and visible to users in a way that cloud assistants obscure. When you ask Alexa something outside its training distribution, it routes to a skill or falls back gracefully. When you ask a template-matched Assist pipeline something it does not know, it says so flatly. The LLM-backed conversation agent softens this, but the latency cost is real.
Why This Moment Matters
The people documenting their setups in 2025 are not reporting on research prototypes. They are reporting on systems that handle requests reliably enough to replace commercial alternatives in their homes. That shift happened because the component ecosystem converged on a shared protocol, the Wyoming server implementations became first-class citizens in Home Assistant’s add-on store, and the model quality for both ASR and TTS crossed a perceptibility threshold where the output stopped sounding like a computer.
The underlying models did not change dramatically between 2022 and 2025. Whisper is still Whisper. Piper is an evolution of earlier VITS work. What changed is that running these models got easier, faster, and more reliably composable. The infrastructure caught up to the AI.
For anyone who has tried and given up on local voice, the current stack is worth revisiting. The Wyoming protocol in particular represents the kind of boring, well-designed infrastructure that makes complex systems maintainable long-term. When your TTS voice sounds dated in two years, you swap in a new Piper model. When faster-whisper ships a better quantization, you update a container. Nothing else breaks. That composability is what “reliable and enjoyable” actually means in practice.