How to Actually Research LLM APIs Without Getting Lost in Marketing Copy

The number of LLM APIs a developer can plausibly integrate into a project has grown from a handful to something resembling a phone book. OpenAI, Anthropic, Google Gemini, Mistral, Cohere, Together AI, Fireworks, Groq, and a long tail of OpenAI-compatible wrappers serving open-weights models. Each ships with documentation that is technically accurate and strategically incomplete.

Simon Willison, who built the llm CLI tool and has been systematically cataloguing this space longer than most, published a research post on LLM APIs that captures something worth expanding on: the gap between what LLM provider docs tell you and what you actually need to know when picking an API for real work.

The research problem with LLM APIs is not that information is unavailable. It is that the information is scattered, inconsistently structured, and often framed around the provider’s strongest features rather than a neutral feature inventory. When I was evaluating which model to use for my Discord bot’s context-aware responses, I spent more time reading between the lines of documentation than reading it directly.

What the `llm` Tool Reveals About the Research Problem

Willison’s llm Python library and CLI are a useful lens here. The tool abstracts over dozens of providers through a plugin system, meaning he has had to model the common capabilities and divergences of each API to make them interchangeable at the command line. That abstraction is itself a kind of research output.

# Install the llm tool
pip install llm

# Add Anthropic and Gemini plugins
llm install llm-anthropic llm-gemini

# Run the same prompt across three models
echo "Explain token counting" | llm -m gpt-4o
echo "Explain token counting" | llm -m claude-3-7-sonnet-latest
echo "Explain token counting" | llm -m gemini-2.0-flash

The plugin surface exposes which capabilities are universal versus provider-specific. Tool calling works across all major APIs now, but the schema formats differ in ways that matter. Anthropic uses a tools array with an input_schema field that follows JSON Schema closely. OpenAI uses a functions wrapper that evolved through several incompatible versions before landing on the tools format. Google’s Gemini API has its own function_declarations structure with a subset of JSON Schema that excludes some features the others support.

Structured output has a similar story. OpenAI’s response format with json_schema and strict: true enforces the schema through constrained decoding. Anthropic’s approach relies on the model following instructions rather than constrained sampling, which means you can specify more complex schemas but get slightly less reliable adherence. Google Gemini 2.0 added constrained decoding support but with different syntax for configuring it.

These are the details that benchmarks do not capture and that marketing copy glosses over.

Context Windows and What They Actually Mean

Context window sizes are stated clearly in documentation but their practical implications are not. GPT-4o supports 128k tokens. Claude 3.7 Sonnet supports 200k. Gemini 1.5 Pro’s 1M token window gets used in a lot of headlines. What does not get stated clearly is the relationship between context length and latency, cost, and effective retrieval quality.

Retrieval quality in long contexts degrades for some models and tasks. The “lost in the middle” phenomenon documented by Liu et al. in 2023 showed that transformer models tend to attend more reliably to content at the beginning and end of very long contexts. Subsequent model improvements have addressed this partially, but the effect has not been eliminated. A 200k context window does not mean 200k tokens of equally reliable recall.

For a Discord bot handling conversation threads, I found that maintaining a 20-30 message history in context was more effective than stuffing in the full channel history, regardless of what the context limit technically allowed. Beyond a certain point, you are paying for tokens that the model is not using well.

Pricing compounds this. A single API call with 100k tokens of context at Anthropic’s Claude 3.7 Sonnet pricing is substantially more expensive than breaking the problem into smaller, more focused calls. Research into LLM APIs needs to account for the cost curve across different context sizes, not just the headline per-million-token rate.

The OpenAI Compatibility Layer Question

A significant portion of the LLM API ecosystem now presents an OpenAI-compatible interface. Groq, Together AI, Fireworks, Ollama, and most services running open-weights models like Llama 3.3, Qwen 2.5, and DeepSeek V3 all accept requests structured like OpenAI API calls.

from openai import OpenAI

# Point the same client at different providers
client = OpenAI(
    base_url="https://api.groq.com/openai/v1",
    api_key="your-groq-key"
)

# Or a local Ollama instance
client = OpenAI(
    base_url="http://localhost:11434/v1",
    api_key="ollama" # ignored but required
)

This portability is useful but creates a false sense of fungibility. The OpenAI compatibility layer covers the basic chat completions endpoint, but feature parity stops there. Batch processing, fine-tuning, embeddings, and the newer Responses API with built-in tools are not uniformly replicated. Some providers implement streaming differently enough to cause subtle bugs in clients that assume specific chunk ordering.

Researching an OpenAI-compatible API means testing the actual implementation, not assuming the spec is fully implemented. The compatibility label is a starting point, not a guarantee.

Rate Limits as a Research Target

Rate limit documentation is frequently the least accurate part of an LLM provider’s documentation. Published limits reflect the defaults for new accounts, not what you will actually experience at scale, and they change without the same prominence as feature announcements.

For production use, the only reliable approach is to test rate limits empirically and build retry logic that handles 429 responses with exponential backoff. The specific headers that carry rate limit information differ across providers:

OpenAI uses x-ratelimit-remaining-tokens and x-ratelimit-remaining-requests
Anthropic uses anthropic-ratelimit-tokens-remaining and anthropic-ratelimit-requests-remaining
Google Gemini communicates limits through error response bodies rather than headers in most cases

For a bot that needs to handle burst traffic from a Discord server, understanding the per-minute versus per-day limit structure matters more than the headline limit number. OpenAI and Anthropic both have per-minute token limits that can be exhausted even when you are well within daily quotas.

Multimodal Inputs Across Providers

Vision capabilities are now nearly universal across major providers, but the details diverge in ways that affect what you can build. All of OpenAI, Anthropic, and Google accept image inputs as base64-encoded data or URLs. The practical differences emerge around file size limits, supported formats, and whether URL-based images are fetched by the provider or processed as references.

Anthropic’s Claude supports PDF inputs directly as document blocks, which is genuinely useful for document processing without a pre-processing step. OpenAI’s file handling goes through the Files API with a separate upload step. Gemini 2.0 supports inline audio and video through the Files API, which the others do not offer at the same level of integration.

For my bot, the ability to process user-uploaded images inline without a separate storage step simplified the architecture considerably. That capability existed across providers, but the implementation details around size limits and accepted MIME types were only discoverable through testing, not documentation.

Building a Personal Research Process

Willison’s broader contribution here is modeling a systematic approach: write small test scripts, run them against multiple providers with the same inputs, document what you find, and keep the test scripts around so you can re-run them as providers update their implementations. The llm tool’s logging functionality stores every prompt and response locally, which makes that kind of comparative testing much less tedious.

For anyone evaluating LLM APIs seriously, the research process should include:

A standard test prompt suite covering your actual use cases
Latency measurements at different context lengths, not just empty-context baselines
Cost calculations at your projected usage volumes, including the difference between input and output token pricing
Failure mode testing: what happens at rate limits, with malformed inputs, with very long outputs
A check of which features are in preview versus generally available, since preview features can disappear or change without the same notice as stable features

The LLM API landscape is genuinely useful and has improved substantially over the past two years. What has not improved at the same rate is the quality of comparative information available to developers trying to make informed choices. Building your own research process, however lightweight, pays off faster than you expect.