Headless, Local, and Useful: Wiring Gemma 4 Into Claude Code with LM Studio's CLI
Source: hackernews
The setup for running a fully local AI coding assistant has always been achievable in theory. In practice, the experience has involved enough manual steps, format mismatches, and documentation gaps that most developers ended up either abandoning it or automating around the rough parts with scripts they’d have to maintain themselves. The combination of LM Studio’s new headless CLI, Google’s Gemma 4, and Claude Code’s configurable base URL changes the math.
George Liu’s walkthrough describes the specific workflow, and the HN discussion around it has people sharing variations. What’s worth going deeper on is the plumbing: what each piece actually does, where the protocol mismatches are, and what the honest tradeoffs look like for day-to-day use.
LM Studio’s CLI: From GUI to Scriptable Server
LM Studio has been the most approachable tool for running local models, especially on Apple Silicon where its Metal support is well-tuned. The limitation has always been that it was fundamentally a GUI application. You could start it, load a model, and use its built-in API server, but triggering any of that from a script or a remote machine required keeping the app open and the display running.
The lms CLI resolves this. It ships as part of LM Studio and provides a headless interface to the same underlying model management and inference engine:
# Start the inference server
lms server start
# Pull a model from LM Studio's catalog or Hugging Face
lms get google/gemma-4-it-GGUF
# Load a specific quantization into memory
lms load gemma-4-it-q4_k_m
# List currently loaded models
lms ps
# Stream server logs
lms log stream
Once running, the server exposes an OpenAI-compatible HTTP API at http://localhost:1234/v1. The key endpoints are /v1/chat/completions, /v1/completions, and /v1/models. Anything in the OpenAI tooling ecosystem, whether that’s curl, Continue.dev, OpenWebUI, or a custom script, can talk to it without any additional setup.
The significance of headless operation goes beyond convenience. It means LM Studio can run as a background service, be started on boot via systemd or launchd, be invoked from shell scripts as part of a larger pipeline, or be deployed on headless Linux servers without a display. The tool’s value proposition shifts from “easy model management for individual developers” to “local inference infrastructure you can actually automate.”
Gemma 4 as a Local Coding Model
The Gemma family from Google has been worth tracking. Gemma 3, released in early 2025, introduced multimodal capabilities and improved instruction following across a range from 1B to 27B parameters. The 27B variant in particular scored well on coding benchmarks relative to what you’d expect from a model that size, and the instruction-tuned variants handled structured output formats well enough to be genuinely useful for tool-use-heavy workflows.
Gemma 4 extends that foundation with stronger reasoning and a more consistent instruction-following profile. It distributes in GGUF format, the quantized format that LM Studio, Ollama, and llama.cpp all use for local inference.
The practical quantization choices are Q4_K_M and Q8_0. Q4_K_M uses K-quants, a technique that applies higher precision to weight matrices with outsized influence on output quality while compressing others more aggressively. Compared to the older Q4_0 format, Q4_K_M delivers meaningfully better output at roughly the same file size. On a 16GB unified memory machine, a Gemma 4 12B at Q4_K_M fits comfortably with memory left over for the OS and tooling. For 27B at Q4_K_M, 24GB or more is the practical floor. Q8_0 is closer to full model precision but roughly doubles the memory footprint at the same parameter count.
Inference throughput on Apple Silicon via LM Studio’s Metal backend is genuinely usable. A 12B model at Q4_K_M on recent M-series hardware will generate somewhere in the range of 30 to 50 tokens per second depending on context length and batch size. That is slower than Anthropic’s infrastructure for burst tasks, but it is fast enough that interactive use does not feel painful for most queries.
Routing Claude Code to a Local Endpoint
Claude Code, Anthropic’s terminal-based coding assistant, is built around the Anthropic Messages API. It respects the ANTHROPIC_BASE_URL environment variable, which replaces the base URL for all API calls. The complication is that LM Studio serves an OpenAI-compatible API, not an Anthropic-compatible one, and the two formats diverge in enough places that you cannot point Claude Code at LM Studio directly.
The differences are concrete. Anthropic separates the system prompt from the messages array and uses a top-level system field, while OpenAI folds the system prompt into the messages array as a role: system entry. Tool use schemas are different between the two protocols. The streaming event structure differs in field names and payload shape. Sending Claude Code’s Anthropic-formatted requests directly to LM Studio’s OpenAI endpoint produces errors or malformed responses.
The bridge is a local proxy that performs the format translation. LiteLLM is the standard option for this:
pip install 'litellm[proxy]'
litellm \
--model openai/gemma-4-it \
--api_base http://localhost:1234/v1 \
--api_key lm-studio
LiteLLM starts on port 4000 by default and exposes an Anthropic-compatible API on that port. With the proxy running, point Claude Code at it:
export ANTHROPIC_BASE_URL=http://localhost:4000
export ANTHROPIC_API_KEY=lm-studio
claude
The ANTHROPIC_API_KEY value does not need to be valid since authentication is not enforced against your own localhost proxy. Any non-empty string works. Claude Code’s full feature set, including file editing, bash execution, and the multi-step tool-use loop, continues to function because LiteLLM handles tool schema translation as part of the protocol conversion, not just message reformatting.
What the Tradeoffs Actually Look Like
The primary use cases for this setup are privacy and cost. If your work involves proprietary code that you would rather not send to external APIs, local inference addresses the constraint directly without requiring a change to the tooling you use day-to-day. For developers or teams running high volumes of code generation, the marginal cost of local inference after hardware is zero, which matters when you’re doing intensive refactoring sessions over many hours or running automated reviews on every commit.
The limitation that matters most is capability. Gemma 4 is a strong model at its size class, but it does not match Claude Sonnet at multi-step reasoning, long context coherence, or complex refactoring that requires holding large amounts of state across many files simultaneously. For narrow, well-defined tasks such as generating a function from a docstring, translating a snippet between languages, or explaining a short block of code, the gap is modest. For the kind of ambitious structural changes that Claude Code’s agentic loop handles well with a Sonnet backend, local models fall short in ways that become apparent within a few turns.
Latency compounds the capability issue. Claude Code’s agentic mode issues many sequential model calls during a single session. Each call involves Claude Code sending the full conversation context, the model generating a response, and the tool-use loop executing the result before the next call begins. The 30 to 50 token/second throughput that feels acceptable for a single interactive query starts to feel slow when you are watching Claude Code work through a multi-step plan with ten or fifteen tool calls. The time adds up in a way that the raw throughput numbers do not fully convey.
The Composability Pattern
What is worth noting beyond the specific tools is the structural pattern that makes this work. LM Studio exposes an OpenAI-compatible API. LiteLLM translates that to Anthropic-compatible format. Claude Code points at the translation layer via an environment variable. None of this required patching any of the three tools; it is pure composition at the protocol level, and each component remains independently replaceable.
Ollama can substitute for LM Studio in the same slot and exposes the same OpenAI-compatible endpoint. A different model can substitute for Gemma 4 with only a model name change. The HN discussion includes people running variations with different models for different task types, routing by context length requirements or task category. Some are running it inside dev containers so the entire local inference stack is portable with the project.
None of these variations are complicated once you understand where the protocol boundary sits. The inference server speaks OpenAI format, the proxy translates to whatever format the client needs, and the client configuration points at the proxy. The API format, not the specific model or serving infrastructure, is the stable integration point. That is the part worth internalizing if you plan to build on top of any of these tools.