· 5 min read ·

What $52 for 76,000 Photos Actually Means for Vision AI

Source: simonwillison

Simon Willison ran a batch job that described 76,000 photos using OpenAI’s new GPT-5.4 mini and GPT-5.4 nano models, and the total bill came to $52. That’s $0.000684 per image, or roughly 0.07 cents a photo. The number itself is striking, but the more interesting question is what falls within economic reach when that’s your unit cost.

The Price Trajectory

To understand where we are, it helps to trace where we’ve been. When GPT-4 Vision launched in late 2023, image inputs were billed at a flat rate that made batch processing prohibitively expensive for most use cases. A high-resolution image could cost several cents just in input tokens because of how OpenAI’s tiling system worked: the model splits images into 512x512 tiles, and each tile consumes a fixed token budget.

With GPT-4o in 2024, pricing dropped substantially. At low detail, an image was encoded to approximately 85 tokens, which at GPT-4o mini’s input rate of $0.15 per million tokens translates to roughly $0.000013 per image in input costs alone. But output tokens, which carry the actual description text, were billed at $0.60 per million. A 100-token description adds another $0.00006. Real-world costs for batch image description with GPT-4o mini landed somewhere around $0.0002 to $0.0005 per image depending on output verbosity.

GPT-5.4 nano appears to push below that floor. The $0.000684 figure Willison reported likely includes both input and output costs across the full batch, and depending on how verbose the descriptions were, this could represent a 2x to 5x reduction in per-image cost compared to what was achievable twelve months ago.

The Nano/Mini Split

OpenAI’s model naming has grown more systematic over time. The GPT-5.4 family with explicit mini and nano tiers mirrors what Google has done with Gemini (Flash, Flash-Lite, Nano) and what Anthropic has done with the Haiku/Sonnet/Opus hierarchy. The logic is straightforward: different tasks have different quality requirements, and a single flagship model priced for maximum capability is not the right tool for bulk annotation work.

Nano, in this framing, is a model optimized for throughput and cost above all else. You sacrifice some accuracy and nuance in exchange for dramatically lower latency and token costs. Mini sits one step up: still fast and affordable, but with more capacity for complex reasoning or longer outputs.

For image description tasks, nano is often good enough. Describing what is in a photograph, generating alt text, categorizing product images, or flagging content that requires human review does not require the full depth of a frontier model. The degradation you accept in moving from a flagship model to nano is mostly invisible in these contexts.

What $0.000684 Per Image Unlocks

The useful way to think about these price points is in terms of the threshold they cross. At $0.01 per image, processing a million photos costs $10,000. That’s a project budget decision. At $0.0007 per image, a million photos costs $700. That’s a line item.

A few specific use cases become newly rational at this scale:

Alt text at ingestion time. Any platform that hosts user-uploaded images, from e-commerce to CMS to documentation tools, can afford to generate descriptive alt text for every image as it arrives. At $0.0007 per image and a million uploads per month, the annual cost is $8,400. Most mid-sized platforms spend more than that on a single engineering sprint.

Searchable photo archives. Museums, news organizations, and stock photo agencies hold collections in the millions. Full-text search over image content, powered by AI-generated captions and tags, has been technically possible for years but economically impractical at scale. Processing a 5-million image archive now runs around $3,500.

E-commerce product data enrichment. Generating product descriptions from photos, extracting attributes like color and material, and detecting quality issues in product imagery all become cheap enough to run continuously rather than as one-time batch jobs.

Dataset annotation pipelines. Training computer vision models still requires labeled data. Using a cheap vision model to generate initial annotations that humans then correct is a well-established pattern, but it only makes sense if the AI labels are cheap enough to produce at scale. Nano pricing makes this viable for smaller teams without dedicated labeling budgets.

How Willison Runs These Benchmarks

Willison’s methodology for this kind of batch test is worth understanding. He maintains the llm command-line tool and Python library, which provides a unified interface for calling multiple language model APIs. The tool tracks token usage and costs, which is what makes a clean post-hoc cost summary like “76,000 photos for $52” possible to produce.

The typical pattern is something like:

llm -m gpt-5.4-nano -a photo.jpg "describe this image briefly"

At batch scale, this becomes a Python script iterating over a directory, logging responses and costs to a SQLite database via sqlite-utils, and then querying the database afterward for aggregate statistics. The toolchain is reproducible and the cost accounting is exact, which is why Willison’s benchmarks are trustworthy in a way that vague capability claims from model vendors are not.

If you want to reproduce this yourself, the llm tool supports the OpenAI vision endpoints out of the box and logs all usage metadata automatically. A minimal batch script looks like:

import llm
import pathlib

model = llm.get_model("gpt-5.4-nano")
for path in pathlib.Path("photos").glob("*.jpg"):
    response = model.prompt(
        "Describe this image in one sentence.",
        attachments=[llm.Attachment(path=str(path))]
    )
    print(path.name, response.text())

The llm logs command will give you per-model cost summaries afterward.

The Quality Question

Cost without quality context is incomplete. The question worth asking is: what do you give up with nano compared to a more capable model?

For straightforward image description, the answer is: less than you might expect. Nano models trained on the GPT-5 base perform better at basic visual recognition than GPT-4o did at launch. The capability floor has risen even as the price floor has dropped. Where nano starts to struggle is in tasks that require precise counting, fine-grained spatial reasoning, understanding complex diagrams with text, or producing descriptions that require world-knowledge integration beyond simple visual recognition.

For most bulk annotation work, none of those are the bottleneck. You are generating tags and captions, not writing art criticism. The practical accuracy difference between nano and a flagship model on a dataset of product photos or news images is smaller than the 10x price difference implies.

The right approach is to establish a quality baseline first. Run a few hundred images through both models, evaluate the outputs against whatever quality bar matters for your use case, and then decide whether the cheaper model clears the bar. If it does, the cost savings compound quickly.

The Broader Shift

What GPT-5.4 mini and nano represent is less a breakthrough than a maturation. OpenAI is now running a model family that looks like a real product portfolio: multiple tiers, explicit cost-performance trade-offs, and pricing that maps to actual use cases rather than the prestige of the underlying research.

The competitive pressure here is real. Google’s Gemini Flash series has been aggressive on price and throughput. Anthropic’s Haiku tier covers similar ground. The result is that vision processing costs have been compressed faster than most people expected when GPT-4 Vision shipped two and a half years ago.

For developers building on these APIs, the practical implication is straightforward: budget constraints that ruled out AI vision in a system design six months ago may no longer apply. It is worth revisiting those decisions. The math has changed.

Was this interesting?