The Intent Layer Is Where Local Voice Assistants Actually Struggle
Source: hackernews
Someone in the Home Assistant community wrote up a detailed journey through building a reliable local voice assistant that collected over 400 upvotes on Hacker News. The discussion there predictably focused on hardware choices, microphone quality, and Whisper model sizes. What got less attention is the stage that sits quietly in the middle of the pipeline: the conversation agent that takes a transcribed sentence and decides what to do with it.
Get the STT and TTS right and you have a system that hears you and speaks back. Get the conversation agent wrong and you have a system that confidently mishears intent even when transcription was perfect. The two failure modes look identical to anyone standing in the kitchen asking why the lights did not turn off.
What Hassil Actually Does
Home Assistant’s default conversation agent uses Hassil, a purpose-built intent matching library written in Python. It does not use a neural network. It uses a domain-specific language for defining sentence templates, against which incoming transcriptions are matched via a recursive parser.
A template looks like this:
language: en
intents:
HassTurnOn:
data:
- sentences:
- "turn on [the] {name}"
- "switch on [the] {name}"
- "(activate|enable) [the] {name}"
The square brackets denote optional words. Parentheses with pipe separators are alternations. Curly braces are slot references that map to entity names, area names, or other lists. Hassil expands all combinations at load time into a trie-like structure and does exact matching against that structure during inference.
The performance characteristics are excellent. Matching a transcription against the full expanded template set takes under one millisecond on any modern hardware. There is no GPU required, no model to load, and no external service to call. For the sentences it covers, it is perfectly reliable: the same phrase always produces the same intent with no stochasticity.
The limitation is what that phrase “the sentences it covers” implies. Hassil matches exactly the phrases it was given templates for. It does not generalize. “Turn off the kitchen” works if there is a template that allows area-only references. “Is anything on in the kitchen” requires a separate intent with its own template set. “What did I leave running upstairs” is not representable without significant work. The Home Assistant intent templates repository, which ships the default English intent coverage, has grown substantially over the Year of Voice initiative, but it remains a fixed vocabulary of supported commands.
For a typical home automation use case, this is fine. Most daily interactions are simple imperatives: turn on, turn off, set to brightness, lock the door, play music. Hassil handles all of these reliably. The issue surfaces in two places: commands that fall slightly outside the template vocabulary, and questions about home state.
The Local LLM Path
Home Assistant 2023.12 introduced the ability to replace the Hassil conversation agent with an LLM-backed agent via the OpenAI Conversation integration. This initially required a cloud API key. The more interesting development came with the Ollama integration, added in HA 2024.1, which routes the same conversation requests to a locally running Ollama instance.
The pipeline change is minimal from a configuration standpoint. In Settings > Voice Assistants, you select a different conversation agent for the pipeline. The Wyoming protocol handling STT and TTS stays unchanged. The swap is clean.
What changes is the inference model. Instead of Hassil performing template matching, the conversation agent constructs a system prompt, appends the user’s transcribed command, calls the local LLM, and parses the response for HA service calls. The system prompt includes the current state of all exposed entities, their friendly names, and instructions for how to format action requests.
The system prompt HA sends to the LLM looks roughly like this:
You are a home automation assistant. Only control devices and
provide information the user has made available.
Current time: 2026-03-17 19:42:15
Current date: Tuesday
Available devices:
- light.kitchen_overhead (Kitchen Overhead Light) - on, brightness: 180
- light.living_room_lamp (Living Room Lamp) - off
- switch.coffee_maker (Coffee Maker) - off
- climate.thermostat (Thermostat) - heat, 68°F, current: 65°F
...
Respond with service calls in the following format:
This is where it gets interesting. An LLM reading that prompt can answer “is the coffee maker on” because it sees the current state directly in context. It can handle “make it a bit warmer” by interpreting “a bit” as a reasonable temperature delta and calling climate.set_temperature. It can respond sensibly to “good night” with a sequence of service calls that turns off lights and locks doors, if you have defined what a goodnight routine means in the system prompt.
The trade-off is latency and determinism.
The Latency Cost
Hassil takes under a millisecond. An LLM conversation turn does not.
On an Intel N100 mini-PC with 16GB RAM running Ollama with llama3.1:8b-instruct-q4_K_M, a typical home automation query takes 2 to 4 seconds for the LLM inference step alone, on top of the STT and TTS time. The context is small (the system prompt plus one user message), but 8B parameter models at q4 quantization do roughly 15 to 25 tokens per second on CPU, and the response generation adds up.
Smaller models help. phi3:mini (3.8B parameters) runs at 35 to 50 tokens per second on the same hardware and generates adequate responses for simple home automation commands, bringing the LLM step closer to 1 to 2 seconds. mistral:7b-instruct-q4_K_M sits between the two in both quality and speed.
For comparison:
| Model | Tokens/sec (N100, CPU) | Typical inference | Notes |
|---|---|---|---|
| phi3:mini | 40-50 | 0.8-1.5s | Good for simple commands, occasionally misformats |
| mistral:7b-q4 | 20-30 | 1.5-2.5s | Reliable formatting, handles ambiguous phrasing well |
| llama3.1:8b-q4 | 15-25 | 2-4s | Best reasoning, highest latency |
| llama3.1:8b-q4 (GPU) | 60-90 | 0.5-1s | Needs VRAM; changes the calculus significantly |
With a GPU, the picture is different. An NVIDIA RTX 3060 (12GB VRAM) runs llama3.1:8b-q4 at 60 to 80 tokens per second, dropping the inference step to under a second. If you already have a server with a consumer GPU, the latency penalty largely disappears.
Total pipeline latency with Hassil on an N100: roughly 1 to 1.5 seconds wake-to-response. With a local LLM on the same CPU hardware: 3 to 5 seconds. The difference is perceptible and, for someone who expected the LLM to feel like a smarter assistant, sometimes disappointing.
The Nondeterminism Problem
Latency is the obvious cost. Nondeterminism is subtler and more operationally significant.
Hassil always maps “turn off the kitchen lights” to HassTurnOff with area kitchen and domain light. Every time. You can write a test for it. You can reason about failure modes.
An LLM may occasionally misformat the service call JSON. It may hallucinate an entity name. It may decide to turn off every device in the kitchen rather than just the lights because “kitchen” was ambiguous in the context it built. On a good day it handles edge cases better than Hassil. On a bad day it does something unexpected and you do not know why without digging into the raw API call.
The practical response to this is constrained output. The llama.cpp server supports grammar-constrained generation via its grammar parameter, which forces the model to produce output that conforms to a BNF grammar. You can write a grammar that only permits valid HA service call JSON, eliminating the misformatting failure mode entirely:
root ::= call-list
call-list ::= "[" call ("," call)* "]"
call ::= "{" "\"service\"" ":" service-name "," "\"data\"" ":" data "}"
...
Ollama does not yet expose grammar-constrained generation directly, but the underlying llama.cpp server it wraps does. There is ongoing discussion in the Ollama project about exposing this as a first-class feature. Until then, this approach requires running llama.cpp directly rather than through Ollama.
When to Use Which
The practical split for a home automation voice assistant:
Use Hassil when your command vocabulary is stable and you want reliability. The overwhelming majority of daily interactions are covered: switching devices, adjusting climate, querying simple state, running named scenes. Hassil is fast, deterministic, and gets better as the community intent templates grow. New templates ship with each HA release.
Add an LLM backend when you want to handle questions about home state, multi-step requests, or natural language that falls outside the template vocabulary. The Ollama integration in HA supports using the LLM as a fallback after Hassil fails, though this requires custom configuration. A hybrid approach where Hassil handles the common case and the LLM handles unknown intents gives you low-latency for routine commands and flexibility for edge cases.
The extended Ollama conversation integration project, a custom HACS integration called home-llm, takes this further by fine-tuning small models specifically on home automation data. A fine-tuned Phi-2 or Llama 3.2 3B model trained on HA-specific interaction patterns generates correctly-formatted service calls more reliably than a general-purpose model and runs faster due to the smaller parameter count. The project ships pre-trained LoRA adapters you can load directly into Ollama’s modelfile format.
The Broader Point
The Home Assistant community thread that surfaced on Hacker News documents a real milestone: local voice assistants are reliable enough for daily use in 2025. The Wyoming protocol, faster-whisper, and Piper together solved the audio pipeline. The conversation agent layer is the next frontier.
What makes this interesting from a systems perspective is that the two conversation agent approaches optimize for fundamentally different things. Hassil is a compiler: it transforms templates into a matching structure at load time and does O(1) lookup at runtime. An LLM conversation agent is a generative model: it is slower, nondeterministic, and flexible in ways that are hard to bound. Neither is the right answer in all cases.
For voice interfaces in particular, where feedback from the assistant is limited to audio and the user has no visual indication of what went wrong, reliability matters more than expressiveness. A system that fails silently or gives unexpected results trains users to stop using it. Hassil’s determinism is a feature, not a limitation, for the 90 percent of interactions it handles. The LLM path earns its latency cost only for the interactions that Hassil genuinely cannot handle.
The trajectory is clear enough: smaller, faster, locally-runnable models will continue closing the latency gap, and grammar-constrained generation will improve reliability for structured output tasks like HA service calls. The current pain points are real but not permanent.