Building a Local Voice Assistant Worth Living With

Someone in the Home Assistant community wrote up a detailed journey through building a reliable local voice assistant that picked up over 400 upvotes on Hacker News. The specifics are worth reading, but the more interesting story is in the architecture underneath it.

A couple of years ago, local voice assistants were a frustrating project. Rhasspy, the main open-source option at the time, required considerable configuration, trained poorly on non-standard speech patterns, and used the Hermes MQTT protocol, which meant building a small distributed system just to route audio between components. The stack worked, but not reliably enough to replace the muscle memory of tapping a phone.

What changed is a combination of better model weights, better inference code, and a cleaner protocol layer. No single improvement would have been sufficient on its own.

Wyoming: The Protocol That Made Modularity Practical

The piece of infrastructure that quietly enabled all of this is Wyoming, a lightweight TCP-based protocol designed by Michael Hansen, the author of Rhasspy who now works at Nabu Casa. Wyoming replaced Hermes MQTT with something considerably simpler: newline-delimited JSON events over a plain TCP socket, with optional binary payloads for raw audio.

The structure of a message is minimal:

{"type": "audio-chunk", "data": {"rate": 16000, "width": 2, "channels": 1}, "payload_length": 4096}
<4096 bytes of raw PCM>

Audio format is fixed across all Wyoming components: 16kHz, 16-bit signed little-endian PCM, mono. Every component speaks the same audio dialect, so you can swap STT engines without touching anything else. The default ports are conventional: OpenWakeWord on 10400, faster-whisper on 10300, Piper on 10200. Three separate processes, each independently upgradeable, each addressable from Home Assistant’s Wyoming integration.

You can run your wake word detector on a Pi Zero and your STT model on a mini-PC across the room. You can replace Piper with a different TTS engine tomorrow without touching the rest of the pipeline. The modularity works because the protocol boundary is clean and the audio contract is fixed. There is no discovery mechanism, no serialization framework, no broker to manage; components either speak Wyoming or they do not.

This is a deliberate simplification over the Hermes MQTT approach, where messages traveled through a broker, topics proliferated, and every new component required careful subscription management. Wyoming moves to direct TCP connections with a fixed message schema and calls it done.

The STT Model Selection Problem

Choosing a Whisper model for a local voice assistant is mostly a latency problem disguised as an accuracy problem. The LibriSpeech benchmark numbers for OpenAI’s Whisper variants:

Model	Parameters	WER (clean)
tiny.en	39M	~5.7%
small.en	244M	~3.0%
medium.en	769M	~2.7%
large-v3	1550M	~2.1%

The accuracy gap between small and large is narrow in absolute terms; the latency difference on consumer CPU hardware is substantial. On an Intel N100 mini-PC running faster-whisper with int8 quantization, small.en processes a typical three-second utterance in around 400 milliseconds. On a Raspberry Pi 4, the same model takes closer to 1.5 seconds. Switch to tiny.en on the Pi and you are back under a second.

faster-whisper is the standard inference backend for Home Assistant’s wyoming-faster-whisper add-on. It uses CTranslate2 under the hood instead of PyTorch, delivering roughly four to five times the throughput on CPU with int8 quantization compared to the original whisper package, with minimal accuracy regression of roughly 0.1 to 0.3 percent additional WER. For home automation commands, the phrase “turn off the kitchen lights” will not get misheard by either model. The 60-millisecond difference in processing time matters more than the fractional accuracy gap.

The English-only .en model variants are meaningfully faster and more accurate than their multilingual equivalents at the same parameter count, since the model capacity is entirely focused on English phonetics rather than distributed across 99 languages. If your household speaks only English, there is no reason to use the multilingual variant.

For hardware with a GPU, distil-whisper is worth considering. The distil-large-v2 variant matches large-v2 accuracy at roughly three times the speed with half the VRAM, which changes the calculus considerably if you have an NVIDIA card available. For pure CPU deployments, faster-whisper with tiny.en or small.en and int8 quantization remains the practical recommendation.

Piper and the TTS Latency Floor

The text-to-speech side of the stack is handled by Piper, another Hansen project, which uses the VITS architecture to generate speech from text. It comes in four quality tiers (x_low through high) targeting different hardware budgets.

The medium tier, at 22.05kHz output, takes 150 to 250 milliseconds to synthesize a typical short response on a Pi 4. The high tier stretches to 500 to 800 milliseconds, which becomes perceptible if you are used to cloud TTS services. For most home automation responses, medium is the practical choice; the voice en_US-lessac-medium in particular has neutral American pronunciation and good intelligibility. The en_US-ryan-high voice has more natural prosody, but the speed penalty is real on constrained hardware.

Piper 1.2 added streaming output, which starts playing audio before synthesis is complete. This reduces perceived latency by 200 to 400 milliseconds for longer responses, since the satellite speaker begins producing audio while the TTS server is still generating the tail of the sentence. For short responses like “turning on the lights,” the difference is less pronounced, but for anything more than a few words it changes the feel of the interaction.

Compared to cloud TTS services like Google Cloud or AWS Polly, Piper medium is noticeably synthetic. Compared to Festival or eSpeak, it is considerably better. The practical question is not whether it matches cloud quality; it is whether it is good enough that you stop noticing it, and at medium quality with a well-chosen voice, most people reach that threshold within a day or two.

Wake Word Reliability

Wake word detection is where many setups fall apart in practice. OpenWakeWord, used in the wyoming-openwakeword add-on, builds a small classifier head on top of Google’s pre-trained audio embedding model. The ok_nabu model, trained specifically for Home Assistant, produces roughly 0.5 to 1 false activations per hour in a typical home environment, which is tolerable for daily use. Detection runs continuously with about 50 to 80 milliseconds per inference window, so the CPU overhead is negligible.

The more recent alternative is microWakeWord, a TFLite Micro model that runs directly on ESP32-S3 hardware. When it works, it eliminates the server round-trip entirely and reduces detection latency to near-zero from the user’s perspective. The requirement is the S3 variant of the ESP32; the M5Stack ATOM Echo, the cheapest common satellite device, uses the original ESP32 and cannot run microWakeWord.

Hardware Constraints Worth Knowing

The M5Stack ATOM Echo is a $13 device that does most of what you want. Its SPM1423 MEMS microphone is single-channel, which means no acoustic echo cancellation. When the TTS response plays through the built-in speaker, the microphone picks it up. The conventional solution is to mute the mic during TTS playback, which ESPHome handles via the on_tts_start and on_end callbacks in the voice_assistant: component. This prevents false wake word triggers from TTS audio, but it means you cannot interrupt a response mid-sentence.

voice_assistant:
  microphone: mic_id
  speaker: spk_id
  noise_suppression_level: 2
  auto_gain: 31dBFS
  on_tts_start:
    - light.turn_on: {id: led, blue: 100%}
  on_end:
    - light.turn_off: led

The noise_suppression_level parameter (0 through 4) invokes ESP-ADF’s built-in noise suppression algorithm on the ESP32 itself, before audio is even sent to the Wyoming pipeline. Combined with auto_gain, this helps compensate for varying distances and ambient noise without any server-side processing.

The ESP32-S3-BOX-3, at around $50, adds a dual microphone array with hardware AEC, which solves the interruption problem cleanly. It also supports microWakeWord on-device and includes a 2.4-inch touchscreen for visual feedback. For a device that lives in a kitchen permanently, the price difference reflects a real capability improvement.

What Makes a Setup Reliable

The community post that surfaced on Hacker News is a useful guide to the distance between a setup that technically works and one that works reliably. The changes that close that gap are mostly about reducing variance: picking a Whisper model sized for your hardware so STT does not time out under load, placing the satellite within one to two meters of the speaker for consistent microphone input, tuning OpenWakeWord’s threshold to reduce false triggers without missing real ones.

A tuned pipeline runs as three Wyoming servers plus the ESPHome satellite. Wake word detection takes 50 to 80 milliseconds. STT takes 400 milliseconds to 1.5 seconds depending on hardware. TTS takes 150 to 250 milliseconds. The satellite adds roughly 100 milliseconds of network round-trip. End-to-end latency of 1 to 2.5 seconds is achievable on modest hardware, which is not instant but is good enough to stop reaching for a phone.

The architecture that enabled this, modular components connected by a clean protocol, is more durable than any single model improvement. When a better STT model ships, you swap out one container. When Piper adds a new voice, the rest of the stack does not change. That composability is what the Wyoming protocol was designed to provide, and it is why the Home Assistant voice ecosystem converged on it rather than continuing to build monolithic solutions.