Prompts as Source Code: What Thoughtworks' SPDD Actually Changes

Martin Fowler’s site just published a write-up from Wei Zhang and Jessie Jie Xia on something Thoughtworks’ internal IT group has been doing: Structured Prompt-Driven Development, or SPDD. The headline idea is small but consequential. Prompts become first-class artifacts. They live in version control next to the code they produce, they get reviewed, and they evolve through the same discipline you would apply to a test suite or a schema migration.

I want to dig into why this matters at the team level, how it compares to the other spec-first frameworks that have appeared in the last year, and what skills the workflow actually demands of developers.

The team-versus-individual gap

Most of the value developers get from Copilot, Cursor, or Claude Code today is intensely personal. You build a private rhythm: which prompts work, which files to reference, when to trust the suggestion and when to throw it out. None of that transfers when a colleague picks up the work three sprints later. The prompt that produced a tricky bit of code is gone the moment the chat window closes, and the resulting code carries no memory of the intent that shaped it.

That’s the gap SPDD is built around. The Thoughtworks example repo shows the structure: a prompts/ directory checked into git alongside src/, with prompts organized by feature and tied to specific user stories. When the requirements shift, you edit the prompt and regenerate, the way you might edit a build script rather than chase down every downstream file by hand.

This is the same instinct that drove GitHub’s spec-kit earlier this year, and the broader category that Birgitta Böckeler has been calling spec-driven development in her ongoing memos on the topic. The frameworks differ in mechanics but agree on a premise: the natural-language artifact that describes what the system should do is more durable than the code that currently implements it, and treating it as ephemeral is a category error.

What sits in the repo

The SPDD layout I find most interesting is the separation between three layers of prompt:

A context prompt that captures domain language, architectural conventions, and constraints. This is the closest analog to a CLAUDE.md or .cursorrules file, but kept per-module rather than per-repo.
A story prompt that describes a single unit of work in business terms. This is what gets attached to a Jira ticket.
A task prompt derived from the story, refined into something the LLM can act on directly.

The layering matters because LLM-assisted work has a recurring failure mode: a prompt that worked once breaks when context changes around it. By splitting durable context from task-specific instructions, you can update one without invalidating the other. It mirrors the way twelve-factor configuration separates code from environment, or the way good test suites separate fixtures from assertions.

The three skills

The SPDD article identifies three skills developers need to make this workflow productive. They look simple on the page; in practice they invert a lot of habits.

Alignment means writing prompts the team agrees on before code gets generated. That sounds like a process complaint, but it’s really about where disagreements surface. In a traditional flow, two engineers can write code that compiles and passes tests and still embody contradictory assumptions about what the feature does. The disagreement only emerges in code review or QA. When the prompt is the artifact, the disagreement surfaces earlier, in language that product and business stakeholders can read. The team has to negotiate intent up front because the prompt is what produces the code, not a sketch of it.

Abstraction-first is the harder skill. Developers are trained to write the most specific code that solves the immediate problem; LLMs reward the opposite. A prompt that says “add a discount field to the order endpoint” produces something brittle. A prompt that says “orders support promotional adjustments at the line-item level, with an audit trail” produces something you can extend. The premium is on naming the right abstraction in the prompt itself, because that’s the seed everything downstream grows from. This is why people who have done years of careful API design tend to outperform LLM-native juniors on this kind of work; the underlying skill is the same.

Iterative review acknowledges that no prompt produces the final answer in one shot. The workflow assumes a loop: generate, read, refine the prompt, regenerate. The temptation, especially for senior engineers, is to skip the refine step and hand-edit the output. SPDD argues against this because hand-edits desynchronize the prompt from the code, and the next person to touch the prompt regenerates over your fixes. The discipline is to push corrections back up into the prompt.

This last point is where I think SPDD will fight the hardest cultural battle. Editing the output feels productive in the moment, and the cost of letting the prompt rot is invisible until someone tries to regenerate six months later.

How this compares to the alternatives

A few adjacent approaches are worth naming. Amazon Kiro, released into preview last year, ships a spec-first IDE where requirements, design, and tasks are explicit files. Anthropic’s own guidance on agentic coding leans on CLAUDE.md as a persistent context file but treats per-task prompts as ephemeral. Cursor’s rules system sits somewhere in between, with project-level and global rule files but no formal workflow for promoting a chat session into a checked-in artifact.

SPDD’s distinguishing move is that prompts are not just configuration or context; they are the upstream source of the code. The closest historical analog is something like Cog or literate programming in the Knuth sense, where a higher-level document is the canonical artifact and the executable code is a projection. SPDD reaches the same conclusion from the opposite direction, driven by the practical reality that LLMs make regeneration cheap enough to make this viable for production code.

What worries me

Three things, none of them deal-breakers.

The first is regeneration determinism. The same prompt against the same model will not produce the same code, and model updates can shift behavior in subtle ways. The SPDD article gestures at this but doesn’t fully grapple with it. A serious implementation needs pinned model versions and some equivalent of dependency lockfiles for the prompts themselves. Otherwise “regenerate from the prompt” becomes a coin flip.

The second is review fatigue. If every change involves both a prompt diff and a code diff, reviewers are looking at twice the surface area. Teams will need tooling that surfaces what actually changed, not the noise of LLM stylistic drift. There’s a real research problem hiding here.

The third is the skill ceiling. SPDD’s three skills are not evenly distributed across an engineering org. The developers who do this well are the ones who already write good design docs and clean APIs; the workflow makes their advantage more visible without obviously helping the rest of the team catch up. That’s not a critique of SPDD specifically, but it does suggest the productivity story will be uneven.

Why I think this direction sticks

The interesting claim buried in the SPDD writeup is that the prompt is more aligned with business needs than the code is. That’s the part I keep coming back to. For decades we’ve accepted that source code drifts from its original requirements the moment it’s written, and we paper over the gap with comments, tickets, and tribal knowledge. If the prompt is genuinely the artifact under change control, the gap closes by construction. You can argue about whether current LLMs are good enough to make that work in production. The premise itself, that intent should be the version-controlled thing, has been right for a long time. SPDD is one of the first proposals that makes it operationally plausible.

For anyone running a team that has plateaued on individual AI productivity and wants to see what comes next, the SPDD article is a useful read, and the example repo is worth cloning to see how the pieces fit together.