The Full Latency Budget for a Local Voice Pipeline

The Home Assistant community thread that surfaced on Hacker News documents something a lot of people have attempted and abandoned: building a locally hosted voice assistant that works reliably enough to use without second-guessing it. The 400+ point reception reflects how widespread the frustration is, and how meaningful it is when someone publishes a configuration that actually holds together in daily use.

The thread is thorough on what works. What it covers less is why the setup feels the way it does, and which configuration decisions drive the most improvement in perceived responsiveness. The instinct when working through this for the first time is to focus on Whisper model selection. The latency budget tells a different story.

The Pipeline

A Home Assistant voice interaction follows a fixed sequence. A satellite device, whether an ESP32-S3-based board, a Raspberry Pi, or a laptop, streams microphone audio to the server. An on-device or on-server wake word model listens for the trigger phrase. When it detects one, the server begins buffering audio and forwarding it to a speech-to-text model. When VAD detects trailing silence, the stream closes, STT inference completes, the transcript goes to the conversation agent, a response is generated, and TTS audio streams back to the satellite for playback.

Every component in this chain runs as an independent process, connected by the Wyoming protocol: a line-delimited JSON format where each message is a header ending with a newline, optionally followed by raw binary payload. A typical audio chunk event looks like this:

{"type": "audio-chunk", "data": {"rate": 16000, "width": 2, "channels": 1}, "payload_length": 4096}
<4096 bytes of raw PCM>

There is no broker, no schema registry, no subscription management. Each service listens on a fixed TCP port, and Home Assistant connects to each as a client: wyoming-piper on 10200, wyoming-faster-whisper on 10300, wyoming-openwakeword on 10400. The simplicity is deliberate, and it makes the pipeline debuggable with basic networking tools rather than specialized observability infrastructure.

All Wyoming implementations agree on a fixed audio contract: 16 kHz sample rate, 16-bit signed little-endian PCM, mono. This is not negotiated per connection; it is fixed across the ecosystem. The consequence is that any component conforming to this spec is interchangeable with any other, which is the property that lets people mix and match server hardware, satellite hardware, and software components without rewriting configuration.

Where the Time Goes

The conventional advice is to pick a Whisper model based on hardware capability. On an Intel N100 mini-PC, the platform that has become standard for home server builds at around $130-150, faster-whisper with INT8 quantization produces these approximate inference times for a 3-5 second utterance:

Model	N100 INT8 latency	Raspberry Pi 4 latency
tiny.en	80-130 ms	180-300 ms
base.en	130-200 ms	350-550 ms
small.en	280-450 ms	900-1500 ms
medium.en	900-1400 ms	4-8+ seconds

These numbers are real. What gets omitted from the discussion is that VAD silence detection adds 500 to 800 milliseconds of delay before STT inference even begins. The server has to wait for trailing silence to confirm the utterance is complete. The minimum viable VAD window is around 300 ms; below that, the system clips sentence endings when speakers pause between words. In practice, 500 ms is the operational floor for reliable detection.

The full latency budget for an N100 setup with small.en and Hassil template matching:

Stage	Duration
Wake word detection	50-150 ms
VAD silence detection	500-800 ms
STT inference (small.en, INT8)	280-450 ms
Conversation agent (Hassil)	<5 ms
TTS synthesis (lessac-medium)	40-80 ms
Total	~1.0-1.5 seconds

The VAD stage accounts for 35-55% of total perceived latency. STT inference accounts for 20-30%. Switching from tiny.en to small.en adds roughly 200 ms of inference time into a pipeline that already carries 500-800 ms of unavoidable VAD delay. The user-perceptible difference is smaller than the raw numbers suggest; the pipeline has a floor that model selection cannot push below.

The Hassil template-matching conversation agent adds under 5 ms and is essentially free for the latency budget. For the large majority of home automation use cases, covering entity control, scene activation, and status queries, Hassil gets you to the response generation step without any meaningful overhead.

Audio Quality Upstream of STT

faster-whisper’s word error rate on clean speech with small.en runs around 3-5%. At 10 dB SNR, which corresponds to a television playing in the same room or an HVAC system running, the same model produces 15-25% WER. At that error rate, home automation commands succeed roughly 80% of the time, which registers as unreliability in daily use because failures cluster unpredictably.

Switching to a larger model addresses the symptom rather than the cause. medium.en improves WER by a few percentage points on degraded audio, but it adds 800-950 ms of inference time on an N100; improving audio quality upstream produces better results at lower compute cost.

On ESP32-S3-based satellites, the voice_assistant ESPHome component provides on-device noise suppression via ESP-ADF before audio is transmitted to the server:

voice_assistant:
  microphone: my_microphone
  speaker: my_speaker
  noise_suppression_level: 2
  auto_gain: 31dBFS
  volume_multiplier: 4.0

noise_suppression_level: 2 runs the ESP-ADF noise suppression algorithm on the ESP32-S3 itself, before the audio leaves the device. The processing costs nothing in server compute. Effective SNR improvement in consistent background noise scenarios is typically 8-12 dB, which closes most of the WER gap between small.en and medium.en without the inference time penalty.

The other lever is Silero VAD, available as a preprocessing filter inside wyoming-faster-whisper via --vad-filter. Silero is a 1 MB neural network that distinguishes speech from noise, music, and silence at 100x real-time on CPU. Without it, noise transients like door slams or appliance starts can trigger spurious transcriptions that produce unintended device actions. With it, the pipeline ignores non-speech audio segments before they reach the STT model.

Hardware Echo Cancellation

Single-microphone satellites without hardware AEC require muting the microphone during TTS playback, or playback audio feeds back into STT and triggers secondary transcriptions. The M5Stack ATOM Echo, which uses the original ESP32 rather than S3, requires this pattern in ESPHome:

on_tts_start:
  - micro_wake_word.stop:
on_tts_end:
  - micro_wake_word.start:

The ESP32-S3-BOX-3 has dual digital MEMS microphones through an ES7210 codec with hardware AEC, performing echo cancellation at the hardware level. The practical consequence is that the BOX-3 can be addressed while it is speaking, the same behavior as commercial smart speakers. For a device intended to replace a cloud assistant in daily use, this matters more than most software configuration decisions. Mute-during-playback satellites feel fundamentally different to interact with.

The openWakeWord Architecture

openWakeWord uses a two-stage design. The first stage is a frozen audio embedding model derived from Google’s AudioSet research: a 1D CNN that produces 96-dimensional embedding vectors from roughly 975 ms windows of 16 kHz mono audio. The second stage is a tiny per-wake-word classifier, typically under 10,000 parameters, operating on a sliding window of those embedding vectors.

The first stage is shared and runs once per audio window. The second stage is cheap enough that adding wake words costs almost nothing in compute. Custom wake words can be trained by generating synthetic TTS pronunciations of a target phrase, extracting embeddings offline, and training the small classifier head. A classifier trained on 500-1000 TTS-synthesized samples achieves roughly 85-90% of the true positive rate of one trained on recorded speech.

At threshold 0.5, the ok_nabu wake word produces 0.3-1.5 false activations per hour in quiet environments and 2-5 per hour with background television. Commercial wake words stay below 0.1 false activations per hour; the gap reflects training data scale and custom beamforming hardware in commercial devices. For home automation where a false activation at worst triggers an unintended light toggle, the default threshold is a reasonable starting point. Raising it to 0.7 roughly halves false positives at the cost of missing around 10% of genuine activations.

LLM Integration and Where the Gap Remains

Hassil handles the large majority of home automation commands, but its vocabulary is bounded by defined templates. Routing to a local Ollama instance covers ambiguous phrasing and multi-step requests, with the system prompt including all exposed entity states, typically 2-4 KB for a well-configured home.

On an N100 CPU, llama3.1:8b at 4-bit quantization runs at 15-25 tokens per second, adding 2-4 seconds to the interaction. An RTX 3060 raises throughput to 60-90 tokens per second, dropping inference time under one second for typical responses. That hardware distinction is where the cloud vs. local performance gap is still real in 2025. A Hassil-only setup covering 90% of household voice use cases competes with cloud assistants on latency; an LLM-backed setup on CPU-only hardware does not.

The home-llm HACS integration provides fine-tuned LoRA adapters for models like llama3.2:3b, trained specifically on Home Assistant service call data. JSON formatting reliability improves compared to general-purpose models with prompt engineering, which matters because a malformed service call fails silently while the user waits for a response that never arrives.

The Convergence

The thread describes a multi-year effort, and that timeline reflects the fact that the prerequisites for a working local voice pipeline assembled gradually rather than all at once.

faster-whisper made Whisper inference viable on commodity CPU hardware by replacing PyTorch with CTranslate2, a C++ inference engine with INT8 quantization for linear layers and KV cache management in a contiguous memory pool. The speedup over vanilla Whisper on CPU is typically 2-4x for the encoder pass. Piper replaced eSpeak’s parametric formant synthesis with VITS neural TTS running on ONNX Runtime, producing speech that does not register as synthetic on short home automation responses. Wyoming replaced Rhasspy’s MQTT-based component coupling with a protocol simple enough to implement in a few hundred lines and inspect with a TCP client.

The hardware economics closed the remaining gap. N100-class mini-PCs brought enough CPU performance for small.en inference and local TTS to under $150. The ESP32-S3 made on-device wake word detection and audio preprocessing available in satellite hardware at $10-50, with the BOX-3 providing hardware AEC at the $50 price point.

The pieces were individually insufficient; the timeline that produced a working setup was the timeline on which they arrived together. The thread that surfaced on Hacker News is documentation of that convergence from someone who kept trying through the period when it did not yet work.