The past year has seen most major AI providers ship some version of a “research” mode, where you point the model at a question and it browses the web for several minutes, reads through sources, and comes back with a synthesized report. Simon Willison recently catalogued several of these research LLM APIs across providers, which prompted me to think harder about what makes these interfaces distinct from the chat completions and tool-calling APIs developers have been building on for the past few years.
These APIs carry a different contract from chat completions: a different cost model, a different latency profile, and a different set of failure modes. Approaching them the same way you would approach a GPT-4o call produces fragile applications.
What Research APIs Actually Do Under the Hood
Before getting into the engineering implications, it helps to be precise about what “research mode” means mechanically. The general pattern across providers is:
- The model receives a question or task
- It generates a series of search queries
- It fetches and reads web pages, sometimes following links several levels deep
- It reasons about what it has found, generates more queries if needed
- It synthesizes everything into a structured report
This is an agentic loop running inside the provider’s infrastructure. You are not calling a model once. You are kicking off a workflow that might involve dozens of internal steps. The model response you receive is just the final synthesis layer; the retrieval and reasoning scaffolding runs invisibly before it.
OpenAI’s deep research, available through ChatGPT Pro since February 2025 and exposed to developers via the Responses API, uses the o3 model combined with an extended browsing loop. Google’s equivalent lives inside Gemini’s grounding with Google Search, which routes generation through live search results before producing a response. Perplexity’s Pro Search API offers perhaps the most developer-accessible entry point, exposing deep research as a standard POST endpoint with a flat per-query pricing model.
Each provider has made different tradeoffs around latency, depth, and cost, but the underlying architecture is recognizably the same category of thing.
The Latency Problem
Standard chat completion APIs are synchronous in the happy path. You send a request, you wait seconds to maybe a minute for streaming to complete, you process the response. Your application architecture can be built around that assumption.
Research APIs invalidate it. A full deep research run from OpenAI can take anywhere from five to thirty minutes depending on the complexity of the question. Gemini grounding is faster but still operates on a different time scale from a standard completion call. Perplexity’s deep research mode runs in the three to five minute range for complex queries.
This latency is not a temporary limitation or a scaling problem the provider will eventually optimize away. It is inherent to browsing thirty web pages, reading and reasoning about them, and writing a comprehensive synthesis. The work takes time.
For a Discord bot, this creates an immediate concrete problem. Discord’s interaction model requires you to acknowledge a slash command within three seconds, but allows you to defer the actual response for up to fifteen minutes. A deferred interaction that kicks off a research task might look like this:
// Acknowledge immediately, work happens async
await interaction.deferReply();
const query = interaction.options.getString('query');
runDeepResearch(query)
.then(async (result) => {
await interaction.editReply({
content: result.summary,
});
})
.catch(async (err) => {
await interaction.editReply({
content: `Research failed: ${err.message}`,
});
});
The fifteen-minute window is usually sufficient, but you are now holding state across an async boundary with no retry mechanism if the underlying research call fails halfway through. That needs handling.
For HTTP applications the picture is more involved. You cannot hold a connection open for twenty minutes reliably. Load balancers, proxies, and client-side timeout settings will kill it. The practical solution is a polling or webhook pattern: submit the task, receive a job ID, poll for status or register a callback URL.
The OpenAI Responses API supports a background mode that decouples submission from retrieval. The pattern looks roughly like:
import openai
import time
client = openai.OpenAI()
# Submit the research task, returns immediately
response = client.responses.create(
model="o3",
input="What are the tradeoffs between LSM-tree and B-tree storage engines for write-heavy workloads?",
tools=[{"type": "web_search_preview"}],
background=True,
)
job_id = response.id
# Poll with backoff until complete
delay = 15
while True:
status = client.responses.retrieve(job_id)
if status.status == "completed":
print(status.output_text)
break
elif status.status == "failed":
raise RuntimeError(status.error)
time.sleep(delay)
delay = min(delay * 1.5, 60)
That polling loop is the minimum viable implementation. Production use wants a persistent job queue so the polling survives application restarts, a dead-letter mechanism for tasks that stay stuck, and idempotency keys on submission so retries do not double-charge you.
The Cost Model Is Different Too
Chat completion costs are relatively predictable. You pay per token, a given query has a roughly bounded token count, and you can estimate costs before you ship. Research APIs are much harder to budget.
The cost of a research run depends on how many web pages the model decides to read, how long those pages are, how many search queries it issues internally, and how much reasoning it performs between steps. A simple factual question might cost under a dollar. A complex competitive analysis or technical survey can run to twenty dollars or more. The initial prompt you write has far less influence over the final cost than it does in a standard completion.
Perplexity’s flat-rate-per-query pricing trades cost predictability for opacity about actual resource consumption. OpenAI’s token-based pricing exposes the real consumption, which includes all the fetched page content the model processes internally. That content can expand context dramatically on a complex query.
The practical implication: do not expose an unrestricted research endpoint to users without rate limiting or a per-user cost cap. A Discord bot with an open /research command and no constraints can accumulate significant charges in a short time if users discover how to ask expensive questions.
Where These APIs Fit
Given the latency and cost profile, it is worth being specific about where research LLM APIs belong in an application architecture rather than where they could theoretically fit.
They work well for tasks that would otherwise occupy a human analyst for an hour or two: competitive intelligence, synthesizing a technical landscape that has shifted since the model’s training cutoff, answering questions that require triangulating across sources. The long time-to-first-token is acceptable precisely when the alternative is human time.
They work poorly for interactive use cases, pipelines where latency compounds, or contexts where predictable cost matters more than thoroughness. If your application needs an answer in under ten seconds, no amount of capability in the underlying model changes the math.
The Broader Shift
What research LLM APIs surface is that the chat completion interface, which served the industry well as an early abstraction, does not stretch to cover everything modern LLMs can do. The synchronous request-response model with token-based billing is a good fit for conversational AI and code generation. It strains when the task is inherently multi-step, web-dependent, and variable in duration.
Providers are handling this by building new primitives on top of the existing surface: background mode, polling endpoints, webhook callbacks. This is pragmatic, and it means each provider’s research API currently feels somewhat different from the others. The architectural patterns for building reliably on top of these capabilities are still being worked out in the open.
The mental model shift the OpenAI Responses API documentation encodes is worth internalizing. Moving from chat.completions.create to responses.create with background mode is small syntactically. Architecturally, you are no longer calling a function that returns a value. You are submitting a job to a system that will work on it and notify you when it finishes.
That pattern is well understood in distributed systems. Queues, job IDs, idempotency keys, timeout handling, and dead-letter processing all apply here for the same reasons they apply in any async workflow. The novelty is that the work being queued is reasoning through web search results rather than processing a data pipeline. The engineering requirements are the same either way.