Scaling AI in the Enterprise: Why Governance Beats Model Choice

OpenAI published a guide called How enterprises are scaling AI, framing the journey from one-off experiments to compounding business impact. The thesis is that scale comes from trust, governance, workflow design, and quality, not from picking a bigger model. That tracks with what every survey on enterprise AI adoption has been saying for two years, and it is worth pulling apart because the operational details are where most pilots actually die.

The sobering backdrop: an MIT NANDA study released in 2025 found that roughly 95% of generative AI pilots at companies are delivering zero measurable return, despite around 40 billion dollars in enterprise spend. McKinsey’s 2025 State of AI report puts adoption at 78% of organizations using AI in at least one business function, but only a small fraction report material EBIT impact. The bottleneck is not model capability. The bottleneck is the surrounding system.

The maturity ladder is real, and most teams skip rungs

The pattern OpenAI describes maps closely to what Gartner, Deloitte, and a16z have all converged on independently. There are roughly four stages: ad-hoc experimentation, embedded pilots, integrated workflows, and platformized capability. The mistake teams make is treating these as a checklist rather than a dependency graph.

A company at stage one typically has someone in marketing using ChatGPT to draft copy, someone in engineering pasting stack traces into Claude, and a finance analyst running pivot tables through Code Interpreter. Useful, isolated, ungoverned. The leap to stage two requires picking a workflow with a measurable cost baseline, instrumenting it, and committing to a comparison.

The leap from two to three is where most pilots fail. A pilot that produces a working demo for a single team is a different artifact from a system embedded in a production workflow with SLAs, audit logs, fallbacks, and a human review path. The Anthropic enterprise guidance on production deployment is explicit on this: production means you have defined evals, you have a rollback plan, and you have monitoring on output quality, not just latency.

Evaluation is the unglamorous part nobody budgets for

If you read the OpenAI piece carefully, “quality at scale” is doing a lot of load-bearing work. In practice, that translates to building an eval harness, and most enterprise teams underestimate this by an order of magnitude.

A serious eval setup looks something like this:

from anthropic import Anthropic
import json

client = Anthropic()

GOLDEN_SET = json.load(open('eval_cases.json'))

def run_eval(model_id: str, system_prompt: str):
    results = []
    for case in GOLDEN_SET:
        resp = client.messages.create(
            model=model_id,
            max_tokens=1024,
            system=[{
                'type': 'text',
                'text': system_prompt,
                'cache_control': {'type': 'ephemeral'}
            }],
            messages=[{'role': 'user', 'content': case['input']}]
        )
        results.append({
            'id': case['id'],
            'expected': case['expected'],
            'actual': resp.content[0].text,
            'tokens_in': resp.usage.input_tokens,
            'cache_read': resp.usage.cache_read_input_tokens,
        })
    return results

That is the skeleton. The real work is curating the golden set, defining graders that distinguish acceptable from unacceptable outputs, and rerunning the harness every time a prompt or model changes. OpenAI’s own evals framework and the newer OpenAI Evals API exist for exactly this reason. Anthropic ships a similar evaluation tool inside the Console.

Most enterprises I have seen treat evals as a one-time benchmark before launch. The teams that compound impact treat evals as a CI artifact: every prompt change runs the suite, every model rotation runs the suite, regressions block deploy. That discipline is what turns a flaky demo into a system you trust to call customer support APIs.

Governance is mostly plumbing

“Governance” sounds like a policy document. In practice it is plumbing: identity, audit logs, data residency, retention, and a kill switch. The NIST AI Risk Management Framework provides the vocabulary, and the EU AI Act provides the deadlines, but the implementation looks like:

Every model call is tagged with a user identity, a workflow ID, and a cost center.
Inputs and outputs are logged to a system the security team can query.
PII is detected and either redacted or routed to a region-locked endpoint.
A feature flag exists that disables the AI path and falls back to the prior workflow.

The last point is underrated. OpenAI’s enterprise compliance docs and Anthropic’s trust center both check the SOC 2 and ISO 27001 boxes, but neither vendor can give you the kill switch. That is your job.

Workflow design is the real moat

The most consistent finding from companies that have moved past pilots is that the win came from redesigning the workflow, not from dropping AI into the existing one. Klarna’s customer service automation, reported as handling two-thirds of inbound chats with AI, did not work because GPT-4 was good. It worked because they rebuilt the support flow around AI-first triage, with humans handling escalations rather than the inverse.

The corollary: if you drop a chatbot in front of an unchanged process, you usually get a 10 to 20 percent productivity bump and call it a win. If you redesign the process assuming AI handles the median case, you can get an order of magnitude. The redesign requires the business to commit, not just IT, which is why the case studies that compound impact tend to come from companies where the executive sponsor owns the P&L of the function being changed.

What the source guide leaves out

The OpenAI piece is a good map of the territory, but it leaves three things underexplored.

First, cost engineering. At scale, the difference between a careless and a careful implementation is often 5x to 10x in token spend. Prompt caching on Anthropic and OpenAI’s automatic caching can cut input costs by up to 90% on cached tokens, but only if you structure your prompts to keep the static portion at the front. Batch APIs from both vendors offer 50% discounts for non-realtime workloads. None of this is hard, but none of it happens by accident.

Second, model rotation. Enterprises treat model upgrades like database migrations: rare, scary, planned for quarters. The reality is that model families ship every few months, and the teams that win are the ones who can rotate from Sonnet 4.5 to 4.6 in an afternoon because their evals tell them whether the new model is a regression. If your evals take a week to run and a human to interpret, you are stuck on whatever model you launched with.

Third, the build-versus-buy question for orchestration. Frameworks like LangChain, LlamaIndex, and the various agent SDKs are useful for prototyping and dangerous for production: they hide the prompts, abstract the retries, and make eval-driven iteration harder. The teams I trust at scale tend to either own the orchestration layer themselves or use a thin, well-understood SDK on top of the raw model API. The OpenAI piece nods at this with its emphasis on workflow design, but it understates how much of the production complexity lives in the orchestration code, not the model call.

The unglamorous conclusion

The story of enterprise AI in 2026 is not about smarter models. The frontier model gap between vendors has narrowed to the point where most enterprise workloads can run on any of the top three providers without meaningful capability loss. What separates the 5% of pilots that compound from the 95% that stall is whether the team built the boring infrastructure: the eval harness, the audit log, the cost dashboard, the rollback flag, the redesigned workflow.

That is the actual lesson buried in OpenAI’s guide. The vendor advice is to scale through trust, governance, workflow design, and quality. None of that is a model feature. All of it is engineering and organizational work. The companies that figure this out will look, five years from now, less like AI-native disruptors and more like well-run software shops that happen to have a model in the loop.