· 7 min read ·

Spec-First, Code Later: The Workflow Layer That AI Coding Tools Don't Give You

Source: hackernews

The Problem Hiding in Your AI Coding Session

Every AI coding tool ships with some mechanism for injecting persistent context. Claude Code has CLAUDE.md. Cursor has .cursorrules. GitHub Copilot has .github/copilot-instructions.md. These are good ideas, and they help. But they share a structural limitation: they are documents written by humans for humans, then used by models as best they can.

Get Shit Done (GSD) is a set of Markdown templates and workflow conventions that takes a different approach. Nothing to install, no CLI to learn. It is a discipline layered above whichever AI coding tool you already use, addressing what the project calls the “middle loop”: the space between writing individual lines of code (which AI tools do well) and high-level planning (which teams handle via tickets and specs). The middle loop is where you decompose a feature, decide what actually needs to persist, align the model’s implementation with your actual intent, and recover from sessions where the model silently went in the wrong direction. Most AI-assisted workflows leave this loop to chance.

GSD caught 255 upvotes on Hacker News and 131 comments, which is notable mostly because of what the comment section looked like. Not discovery. Recognition. Senior developers describing variations of all three of GSD’s core patterns that they had independently developed through production failures. That convergence is a stronger signal than any benchmark.

What Models Actually Do With Ambiguity

The core insight behind GSD’s spec-driven approach is specific and worth stating precisely: when a language model encounters ambiguity in a prompt, it resolves that ambiguity silently using the most statistically likely interpretation. It does not ask. It does not flag the gap. It generates code that satisfies what was literally said, which is often not the same as what was meant.

The canonical example in the GSD documentation is a /remind Discord bot command. Here is a version of the spec format GSD recommends:

## Command: /remind
**Purpose**: Schedule a message in the current channel after a specified duration.
**Inputs**:
- `time`: Duration string (e.g., "30m", "2h", "1d")
- `message`: The reminder text
**Constraints**:
- Maximum reminder duration: 7 days
- Maximum active reminders per user: 10
- Times stored in UTC, displayed in user's registered timezone
- All pending reminders must be restored from the database on bot restart
**Error cases**:
- Invalid duration format: reply with examples
- Duration exceeds maximum: reply with the limit
- User at reminder cap: reply with count and list offer

The load-bearing line is “restored from the database on bot restart.” Remove it, and every model tested defaults to in-memory storage using setTimeout. That implementation is correct by the literal text of every other constraint. The reminder schedules, fires, and clears. It just loses all pending reminders every time you deploy. A human developer reading the spec would infer persistence from context. A model generates the simplest implementation that satisfies what was written, and in-memory setTimeout satisfies everything that was written.

This is not a model capability problem. It is a specification problem. GSD’s argument is that a spec written as model input, where the implicit requirements are made explicit and the ruled-out approaches are documented with reasons, makes “simplest correct implementation” coincide with “correct implementation.” Research on SWE-bench is consistent with this: agents working from precise intent specification dramatically outperform those working from implicit context, holding codebase access constant.

The Three Layers, Concretely

Context engineering is the first layer. GSD formalizes a “context anchor”: a Markdown document injected at position zero of every session, the position with highest model attention. Research from Liu et al. (2023) on what they called the “Lost in the Middle” problem established that transformer models reliably underweight content placed in the middle of long contexts. A constraint introduced at message 5 in a long session is significantly underweighted by message 40. The context anchor counteracts this by placing the most critical constraints where they will be attended to.

The key difference from a static CLAUDE.md is that the context anchor captures not just stable project conventions but live session decisions, including ruled-out paths with reasons:

## Ruled out
- node-cron for scheduling: process restarts lose state.
  Using Redis sorted sets with score = fire timestamp instead.
- Storing reminder text encrypted: key management complexity
  not warranted for this use case.

Without the ruled-out section, a model that generated a node-cron solution and was corrected will continue to occasionally propose node-cron variants later in the session. The correction sits in context, but so does the original proposal, and both influence generation. Making the ruled-out list explicit and prominent prevents the model from re-proposing rejected approaches, which is a consistent annoyance in long sessions.

Spec-driven development is the second layer. The spec format is written for model consumption: token-efficient, constraint-complete, error-case explicit. The goal is not documentation for future developers. The goal is removing ambiguity before the model generates any code. GSD treats the spec as analogous to a function interface: it is the contract the implementation must satisfy, and both the human and the model agree to it before implementation begins.

Meta-prompting is the third and most technically interesting layer. Instead of asking the model to implement a feature in a single pass, you ask it to produce a reviewable plan first:

Given this spec, produce an implementation plan covering:
- Sequence of steps in implementation order
- Any ambiguities in the spec that need resolution before coding
- The data model, including what persists across restarts
- Error handling strategy for each error case
- What tests would verify correctness

Do not generate any code yet.

The plan output is a separate artifact. You review it. If the plan describes in-memory storage for a command that requires database persistence, you catch it at the cost of one message exchange rather than a full implementation cycle. If the plan reveals an ambiguity in your spec, you resolve the ambiguity before it propagates into code.

Meta-prompting is also applied to context maintenance. Rather than patching a CLAUDE.md forward as decisions accumulate, you use a meta-prompt to regenerate it from a description of the project’s current state. The result is a coherent document organized for how models process information, not a historical record of how the project evolved.

The Middle Loop Problem

The GSD framing that resonated most in the HN thread is the “middle loop.” AI coding tools have dramatically accelerated the inner loop: write a function, run tests, fix errors, iterate. Some tools have improved parts of the outer loop: ticket writing, planning documents, architectural sketches. The middle loop is the space in between: decomposing a feature into an implementation plan, maintaining coherent context across sessions, aligning model output with original intent across multiple exchanges, systematically catching errors before they propagate.

The middle loop is not glamorous. It is also where most AI-assisted projects quietly accumulate technical debt, because the model’s gap-filling decisions are invisible and unreviewed. GSD makes those decisions visible, reviewable artifacts before any code is written.

The TDD Parallel

GSD makes the comparison to test-driven development explicitly, and it is apt in both the complimentary and the unflattering directions. TDD has decades of evidence that writing the test before the implementation improves design and reduces defect rates. TDD also has consistently low adoption under deadline pressure because it adds upfront friction. Spec-first development with explicit planning passes has the same structure.

The one meaningful structural difference: the feedback loop is same-session. With TDD, the payoff of writing a test first materializes over a sprint or a release cycle, as defects that would have occurred are not observed. With GSD’s planning pass, a well-specified prompt produces visibly better first-pass output before the current session ends. The causal connection between the extra step and the better output is immediate and legible.

Whether that tighter feedback loop is sufficient to change adoption dynamics is an open question. The methodology has no enforcement mechanism. A spec template committed to the repository is not a spec culture on the team. Under deadline pressure, teams skip the spec phase first, which is the phase that generates the most value, which is also the structure that makes TDD hard to sustain.

What It Does Not Replace

GSD is complementary to tools like Aider’s repository map, which uses tree-sitter to auto-generate structural context from your codebase (function signatures, class locations, call graphs). The repository map answers “what does this codebase contain.” The GSD context anchor answers “what decisions have been made and why.” Both belong in the context window; they serve different questions.

It also does not eliminate the fundamental limitation that Andrej Karpathy identified when he coined “context engineering” as a replacement for “prompt engineering”: the context window is finite, the model’s attention within that window is not uniform, and everything you put in has a cost. GSD gives you a principled approach to deciding what goes in and where, but it does not make the problem go away.

Where This Goes

The HN thread’s dominant pattern, experienced practitioners recognizing rather than discovering all three of GSD’s core patterns, suggests these patterns reflect real structural constraints in how current models work rather than one team’s stylistic preferences. Independent convergence across tools, languages, and domains is strong evidence.

The natural endpoint is tool integration. Planning passes as reviewable artifacts, structured spec generation with constraint completeness checking, automated context lifecycle management: these features will eventually be absorbed into the AI coding tools themselves. Claude’s Projects feature is an early step in the direction of persistent context management. Cursor’s rules system is another.

For now, GSD packages the pattern as Markdown files and developer discipline. The Markdown files are the easy part.

Was this interesting?