· 5 min read ·

When Running AI Locally Is Worth It, and When It Is Not

Source: hackernews

Tools like canirun.ai answer the hardware compatibility question: given your GPU or CPU, which models fit in memory. That is a useful starting point. The question underneath it is whether running locally is the right choice at all. The answer depends on token throughput, API economics, data sensitivity, and what capability level you need, none of which surface in a hardware compatibility table.

The Economics of API versus Local

Cloud inference costs money per token. The exact rates shift as providers compete, but as a rough reference: GPT-4-class models run in the range of $5-15 per million output tokens, while GPT-4o-mini-class models sit around $0.60 per million output tokens. Running a capable open model locally brings marginal inference cost down to electricity.

The math for heavy usage is significant. A developer running a coding assistant that generates 200 tokens per prompt, 500 times a day, produces 100,000 output tokens daily. At a mid-range API price of $5 per million output tokens, that is $0.50 per day, $182 per year. Light enough that API costs are irrelevant.

Double those numbers across a team of five developers and the annual cost approaches $1,800. At that scale, a single RTX 4090 at roughly $1,600, drawing around 450W during inference, pays for itself in hardware cost within its first year before factoring in electricity. The crossover point is lower than it might seem.

The calculation shifts unfavorably for local inference when quality matters enough. If a task genuinely requires a frontier model and the local alternative produces materially worse results, the cost comparison is moot. Local inference saves money only when the local model is good enough for the job.

Token Speed and the Usability Floor

The compatibility question has a binary answer; the usability question does not. A model can technically run on hardware that produces 1.5 tokens per second. That is usable for overnight batch processing and too slow for a chat interface.

The practical floor for interactive use is around 3-5 tokens per second. Human reading speed is roughly 4 tokens per second at a comfortable pace, so generation below that rate creates an experience where text appears more slowly than you could read it. Above roughly 15-20 tokens per second, the model runs well ahead of you.

On an RTX 4090 with a Q4_K_M Llama 3.1 8B model, llama.cpp generates around 120-150 tokens per second. On a 16 GB Apple M2 Pro with the same model, you get roughly 30-50 tokens per second. Both are comfortable. On a CPU-only run with a 70B model and no GPU offloading, you might see 1-3 tokens per second on a fast desktop CPU.

Knowing the expected throughput for your hardware is part of what a tool like canirun.ai surfaces alongside the memory compatibility check. The question is not only whether the model loads, but whether it generates at a rate that fits your use case.

Where Local Models Compete Well

The quality gap between local and frontier models is real but task-dependent, and it has been narrowing.

Coding assistance: Models like Qwen2.5-Coder 7B and DeepSeek-Coder 6.7B handle code completion, simple bug fixes, and test generation at a level useful for day-to-day development. The gap to GPT-4o or Claude Sonnet widens on architectural questions, complex refactors, and novel algorithm design.

Summarization and extraction: When the document is in the context window, the model does not need broad world knowledge, just the ability to follow instructions. 7B-13B models handle factual extraction from provided text competently, and local models perform well here relative to their parameter count.

Structured output: With careful prompting or JSON mode, local models reliably produce structured output from well-defined schemas. Local models work well for automation pipelines where you control the input format.

Multi-step reasoning: The gap is most pronounced for tasks requiring sustained reasoning chains, broad world knowledge, or creative synthesis across domains. A 70B local model narrows the gap compared to a 7B, but frontier models retain a meaningful lead.

The parameter count threshold matters less than it did a year ago. Phi-4 at 14 billion parameters matches or exceeds earlier-generation 70B models on several standard benchmarks, reflecting architectural improvements and better training data rather than raw scale. Running a current-generation 14B model on hardware that previously handled only 7B models is a genuine capability increase.

Privacy and Offline Use

API inference involves sending prompts to a third-party server. For personal side projects that is usually acceptable. For production applications handling medical records, legal documents, financial data, or proprietary source code, it frequently is not. Regulatory requirements, contractual obligations, or security posture may require that data not leave the local environment.

Local inference removes that constraint. The model runs on hardware you control and no data traverses the network. This matters for enterprise security requirements, air-gapped deployments, and applications where users reasonably expect their inputs to remain local.

The offline case is simpler but practical. A coding assistant that works on a flight, a translation tool that runs without cellular data, development workflows that function during an internet outage. Dependency on API availability is a reliability constraint that disappears with local inference.

Workflow Integration

The friction of switching to local inference has dropped considerably. Ollama runs an HTTP server with an endpoint format compatible with the OpenAI SDK. Any code already using the OpenAI client works against a local model by changing two configuration values:

from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:11434/v1",
    api_key="ollama",  # required by the client but ignored locally
)

response = client.chat.completions.create(
    model="qwen2.5-coder:7b",
    messages=[{"role": "user", "content": "Review this function for bugs"}],
)

LM Studio offers the same OpenAI-compatible server in a GUI wrapper, which is useful for exploring and benchmarking models interactively before committing to a setup. Jan.ai provides a similar interface with an extension model for custom integrations. All three rely on llama.cpp under most configurations.

Switching an existing application from a cloud API to a local model is a configuration change. The integration work does not repeat itself as you swap models or adjust quantization.

Making the Decision

For most developers, the practical choice is between a capable 7-8B model on consumer hardware and a cloud API for tasks that exceed local capability. Running both, routing by task type, is a workable pattern: local model for code completion and document summarization, API fallback for complex reasoning or research synthesis.

canirun.ai answers the prerequisite question for that setup. Once you know what your hardware supports, the decision of whether to run locally is about task requirements, not hardware specs. The hardware sets the ceiling; whether that ceiling is high enough depends entirely on what you are building.

Was this interesting?