· 6 min read ·

Building a Local Voice Assistant That Actually Works in 2025

Source: hackernews

A Home Assistant community post about getting a reliable, enjoyable locally-hosted voice assistant picked up over 400 points on Hacker News this week. That kind of traction is worth paying attention to. Local voice has been technically possible for years, but “possible” and “good enough for daily use” are different things. What changed is a combination of better models, a cleaner protocol architecture, and cheap fanless x86 hardware that finally provides enough compute headroom.

The architecture that holds all of this together is the Wyoming protocol, an open TCP-based microservice protocol developed by Michael Hansen and adopted as the official inter-service layer for Home Assistant voice components starting in 2023.5. Wyoming uses a simple framing format: a 4-byte little-endian header length, followed by a UTF-8 JSON header describing the message type, followed by an optional binary payload for raw audio. Each stage of the pipeline, speech-to-text, text-to-speech, and wake word detection, runs as an independent process exposing a Wyoming endpoint. Home Assistant connects to each as a client and orchestrates the pipeline.

This modularity is the critical design decision. It means you can run wyoming-faster-whisper on a box with a GPU and wyoming-piper on the same Raspberry Pi as your Home Assistant instance. It means you can swap the STT engine entirely without touching your TTS configuration. It also means the failure domain for each component is isolated, which makes debugging substantially easier than the old approach of monolithic voice stacks.

Speech Recognition: The Model Size Trap

faster-whisper is a CTranslate2-based reimplementation of OpenAI Whisper that runs 2-4x faster than the original on CPU and uses less memory through INT8 quantization. The wyoming-faster-whisper add-on wraps it with a Wyoming endpoint. The model selection decision is where many setups go wrong.

The tiny.en model (39M parameters, ~390 MB RAM) runs faster than real-time on a Raspberry Pi 4, but its word error rate of roughly 12% is high enough to cause consistent failures on proper nouns and home-automation vocabulary. The small.en model (244M parameters, ~1 GB RAM) cuts that error rate nearly in half, but on a Pi 4 it takes 4-6 seconds to transcribe a 5-second audio clip, which is longer than the audio itself. That is unusable.

On an Intel N100 mini PC, the picture changes. The small.en model with compute_type: int8 and beam_size: 1 transcribes a typical command in 500-700 ms, which puts you at 1.5-2 seconds total pipeline latency for a non-LLM command. That is competitive with Alexa on a bad day. N100 machines like the Beelink EQ12 or Minisforum UN100 run under $150 with 16 GB RAM and an NVMe slot, making them the community sweet spot for all-local setups.

One underused configuration option in faster-whisper is initial_prompt. Whisper was trained on general speech, so it will hallucinate transcriptions that sound phonetically similar to the actual utterance. Seeding the initial prompt with home-automation vocabulary, phrases like “turn on the lights in the living room” or a comma-separated list of your entity names, biases the beam search toward relevant vocabulary. The effect is measurable. Entity names that previously failed consistently start working.

Text-to-Speech: Piper’s Quality Tiers

Piper TTS uses a VITS (Variational Inference with adversarial learning for end-to-end TTS) architecture with per-voice ONNX models. It converts text to phonemes via espeak-ng, runs those through the ONNX model to produce a mel-spectrogram, and synthesizes the waveform. ONNX Runtime means it runs efficiently on CPU without requiring CUDA.

Voices come in quality tiers: x_low, low, medium, and high, trading model size and synthesis time for naturalness. On an N100, even high quality voices synthesize in 50-150 ms for a typical response sentence, so quality tier is not a latency concern on adequate hardware. On a Pi 4, medium takes 200-400 ms and high takes 400-800 ms, still acceptable since it is at the end of the pipeline. en_US-lessac-medium is the most commonly recommended starting point: clear, neutral, 63 MB. en_US-ryan-high is a solid male alternative at 120 MB. For British English, en_GB-cori-high is noticeably more natural than the lower-quality options.

wyoming-piper exposes Piper on port 10200 by default. The voice selection is configured in the pipeline, not in the add-on, which means you can have different pipelines using different voices without running multiple Piper instances.

Wake Words: On-Device vs. Server-Side

The wake word layer has two meaningfully different options with different architectural trade-offs.

openWakeWord uses a two-stage pipeline: Google’s pre-trained audio embedding model produces 96-dimensional embeddings at 20 Hz, and a small per-wake-word classifier runs on top of those embeddings. It achieves 85-97% true positive rate with around 0.3-1.0 false activations per hour at default threshold settings. It runs on the server, typically via the wyoming-openwakeword add-on on port 10400. The trade-off is that your satellite device must stream audio continuously to the server, even when idle.

microWakeWord is designed specifically for the ESP32-S3. It uses a MobileNetV2-based feature extractor small enough to run on the microcontroller, with inference time of 5-20 ms per 30 ms audio frame and RAM usage under 100 KB. No server required for the wake word stage. The false positive rate is higher, around 2-5 per hour at default settings, but for many deployments the zero-infrastructure cost is worth the trade-off. ESPHome’s voice_assistant component integrates microWakeWord directly.

The ESP32-S3-BOX-3, Espressif’s reference voice device, is officially supported by Home Assistant and ships with firmware that configures the full satellite pipeline out of the box. Custom builds on XIAO ESP32S3 Sense or bare ESP32-S3 modules with an INMP441 I2S microphone and MAX98357A I2S amplifier are well-documented in the ESPHome community.

LLM Integration: When to Use It

Home Assistant integrates LLMs as conversation agents via the ollama integration (added 2024.1) or via the openai_conversation integration pointed at Ollama’s OpenAI-compatible endpoint on port 11434. The LLM replaces the default intent-matching conversation agent for free-form queries.

The latency cost is real. llama3.2:3b on an N100 CPU takes 2-5 seconds to generate a response. llama3.1:8b takes 5-15 seconds on CPU, making it impractical unless you have a discrete GPU. With an Nvidia RTX card, llama3.1:8b responds in 1-2 seconds and becomes a reasonable choice. qwen2.5:3b and gemma2:2b offer similar capability to llama3.2:3b with slightly different latency profiles worth benchmarking on your specific hardware.

The practical recommendation most experienced users converge on: use the built-in HA NLU for standard home control commands and configure the LLM only as a fallback for queries that fail intent matching, not as the primary agent. Standard home control commands, lights on, thermostat up, lock the front door, do not benefit from LLM processing and pay a 4-8 second latency tax unnecessarily when routed through one.

The Actual Argument for Going Local

The latency comparison between local and cloud is often presented as the central question, but it is not. On an N100 with small.en and no LLM, you get 1.2-2.0 seconds from wake word to response start. Alexa and Google Home run 0.8-1.8 seconds over a good connection. The gap is small and occasionally reversed.

The real arguments for local are reliability and privacy. A local pipeline works when your internet connection is down, which matters for home automation more than for web search. It works when a cloud provider has an outage or decides to deprecate an API. No audio ever leaves your network. You can train custom wake words, configure custom intents, give the assistant knowledge of your home’s specific layout and naming conventions, and change any part of the stack independently.

The community thread that surfaced this week is a good record of what the journey actually looks like: iterating through hardware, model sizes, configuration options, and satellite designs over months until the pieces fit together. The fact that it generated 132 comments and 400+ points suggests that a lot of people are at some stage of that same journey. The stack is good enough now that the effort pays off, provided you treat hardware selection as a first-class decision rather than an afterthought.

Was this interesting?