LLM APIs in Practice: The Implementation Details That Benchmarks Miss
Source: simonwillison
The LLM provider ecosystem has expanded well beyond the point where a single comparison table can tell you which API to use. What started as a manageable choice between a handful of providers has become a sprawling landscape where the meaningful differences live in implementation details rather than headline benchmark numbers.
Simon Willison’s recent piece on researching LLM APIs captures this problem well. His approach, built around years of maintaining the llm CLI tool, is systematic experimentation: run the same prompt against multiple providers, observe the differences, log everything, and let the data accumulate into something useful. The machinery required to do that consistently reveals a lot about how fractured the underlying API landscape has become.
The API Surface Problem
Every major provider has converged on similar capabilities while diverging on the implementation details that matter in production.
Take structured output. OpenAI introduced JSON mode in late 2023 and later added Structured Outputs with strict JSON Schema enforcement via the response_format parameter. Anthropic’s Claude handles structured data through its tool use mechanism, which operates under a completely different mental model. Google’s Gemini supports structured output through response_mime_type: "application/json" combined with a response_schema field. These solve the same problem with incompatible interfaces, and their failure modes are different enough to matter.
For a Discord bot that routes requests to different backends depending on context or cost, this means you cannot simply swap providers. You need an adapter layer, and that adapter layer is where complexity accumulates.
Here is what the OpenAI structured output interface looks like:
from openai import OpenAI
from pydantic import BaseModel
class TaskExtract(BaseModel):
title: str
priority: int
tags: list[str]
client = OpenAI()
completion = client.beta.chat.completions.parse(
model="gpt-4o",
messages=[{"role": "user", "content": "Extract task: Fix rate limiter ASAP, high priority, backend"}],
response_format=TaskExtract,
)
result = completion.choices[0].message.parsed
The Anthropic equivalent uses tool use, with a completely different call structure:
import anthropic
import json
client = anthropic.Anthropic()
response = client.messages.create(
model="claude-3-5-sonnet-20241022",
max_tokens=1024,
tools=[{
"name": "extract_task",
"description": "Extract task details from user message",
"input_schema": {
"type": "object",
"properties": {
"title": {"type": "string"},
"priority": {"type": "integer"},
"tags": {"type": "array", "items": {"type": "string"}}
},
"required": ["title", "priority", "tags"]
}
}],
tool_choice={"type": "tool", "name": "extract_task"},
messages=[{"role": "user", "content": "Extract task: Fix rate limiter ASAP, high priority, backend"}]
)
result = json.loads(response.content[0].input)
The output is identical, but the code paths are completely different. If you want to A/B test providers or build a fallback chain, you are writing adapters whether you intended to or not.
Where Simon’s llm Tool Fits In
The llm package addresses this directly. It provides a unified Python API and CLI that abstracts over provider differences through a plugin system. It supports OpenAI, Anthropic, Google, Mistral, Cohere, and a long tail of others through community-maintained plugins.
# Install providers
pip install llm llm-anthropic llm-gemini
# Run the same prompt across providers
llm -m gpt-4o "Explain RAFT consensus in one paragraph"
llm -m claude-3-5-sonnet "Explain RAFT consensus in one paragraph"
llm -m gemini-1.5-pro "Explain RAFT consensus in one paragraph"
For research workflows, this is genuinely useful. You can pipe the same input through multiple models, log results to the built-in SQLite store, and query that database later. Having all API calls logged locally with metadata including model name, prompt, response, latency, and token counts is more useful than any one-time benchmark comparison. The data accumulates into a picture of where each provider actually performs for your specific workload.
The Python API is equally clean:
import llm
model = llm.get_model("claude-3-5-sonnet")
response = model.prompt("What are the trade-offs between B-trees and LSM trees?")
print(response.text())
Swapping the model name is the only change required to test a different provider. The logging happens automatically. For anyone doing iterative research across providers, that reduction in friction compounds quickly.
Context Windows and Their Actual Implications
Another axis where providers diverge significantly is context window handling. OpenAI’s GPT-4o supports 128k tokens. Gemini 1.5 Pro supports 1M, with 2.0 Flash going higher still. Claude 3.5 supports 200k. These numbers matter less than how models behave as context fills up, which is something benchmark tables do not capture.
Long-context performance varies considerably across providers. Retrieving a specific fact buried in the middle of a 200k token document is harder than retrieval from the beginning or end of that document. This is the so-called lost-in-the-middle problem, documented in research from 2023 and still not fully resolved across the industry. Providers have made progress on it, but the extent of that progress varies by model and task type.
The practical implication for bot development: injecting a large codebase into context and asking for analysis is not equivalent to retrieval-augmented generation with a well-chunked index. The former is expensive and gives inconsistent results; the latter is cheaper and more predictable. Knowing which approach fits your workload requires testing on your actual data, not reading the context window specification.
Pricing Models Are Not Comparable
Every provider quotes pricing per million tokens, but the comparison is murkier than it looks. Token sizes differ between tokenizers, rate limits vary significantly, and caching behavior is provider-specific.
OpenAI offers prompt caching with automatic detection for repeated prefixes. Anthropic has explicit cache control headers that you mark deliberately in your request. Google’s Gemini has context caching for large repeated inputs. These mechanisms matter substantially for bots that send a large system prompt with every conversation turn.
A bot injecting a 10k token system prompt into every request gets a very different effective cost per conversation depending on which caching mechanism is available and whether the implementation actually uses it. This is the kind of detail that is invisible in top-level pricing comparisons.
How to Research This Systematically
Simon’s approach, and the one that holds up in practice, is empirical. Define a concrete task, run it against several providers with the same prompt and parameters, measure latency and cost, evaluate output quality for that task specifically, and build up data over multiple runs. Single-sample comparisons produce misleading conclusions because LLM outputs have variance.
The llm tool’s SQLite logging makes the data collection part automatic. For evaluation, you need to define what better means for your use case before running the tests. Post-hoc qualitative comparison anchors on whichever result came first.
A minimal research loop:
# Log responses from multiple providers for the same task
for model in gpt-4o claude-3-5-sonnet gemini-2.0-flash; do
llm -m "$model" "$(cat test_prompt.txt)"
done
# Query the logged results
llm logs list --json | jq '.[] | {model, duration_ms, response: .response[:200]}'
From there you can layer in evaluation logic, run statistical comparisons across sample sets, and build a grounded picture of where each provider performs for your workload.
What This Means for Bot Development
Building against multiple LLM backends simultaneously changes the architecture in specific ways. You need a routing layer that selects a provider based on task type, cost constraints, or availability. You need adapters that normalize structured output across provider conventions. You need fallback logic for rate limit scenarios.
None of this is theoretically complicated, but it adds surface area that needs testing and maintaining. The tooling that reduces that surface area, whether that is llm’s abstraction layer or the individual SDKs’ retry and streaming utilities, is worth understanding deeply before committing to an approach.
Benchmark leaderboards will tell you which model scored highest on MMLU this quarter. They will not tell you which provider’s rate limits will cause you problems under production load, which streaming implementation handles backpressure correctly, or which tool-calling interface is easiest to validate at runtime. That knowledge comes from building things and logging what happens, which is more or less what systematic API research looks like in practice.