P2 Spec-First Execution

§0b

Opinion

The single biggest source of waste I have watched on engineering teams is starting work without a written AC. Not vague work, not exploratory work: ordinary feature work, where the engineer reads the ticket, has a chat in the hallway, and starts coding from a mental sketch. The mental sketch always misses two things: the edge case the product owner has not articulated, and the dependency the architect spotted six months ago.1 Adzic's Specification by Example calls this out by name: the conversation that produces the AC is more valuable than the AC itself, but the AC has to survive the conversation.

Where this row earns its keep is the second sentence of its sub-line: AC at the start, AC at the end. The AC is the input to the work; it is also the input to P3 Definition of Done. Skipping it converts agency into rework. The team that writes AC at the start ships fewer surprises at the end; the team that does not calls every misalignment a “requirements problem” and treats the product owner as the bottleneck, when in fact the bottleneck is the unwritten AC the engineer walked past.

The agent layer hardens this further. An LLM with no AC will generate plausible code at machine speed and pick the most popular implementation, not the one the business needs. With AC in front of it (ideally in Gherkin per AI4 Verifiable Specs) the agent has a target. Without it, the team is paying tokens for the model's priors. The discipline is the same one that protected the human-only team; the cost of skipping it is now paid in milliseconds rather than weeks.

Copy a note and link

Grab this short comment and drop it into a PR comment or an LLM chat to prompt the right change.

AC at the start, AC at the end. Read the brief, write the acceptance criteria as bullets a reviewer can mark met or unmet, walk them past the Three Amigos, then start coding. The AC is the brief; the Definition of Done is the AC restated. Skip the AC at the start and you are paying for it at the end.

/tenet/spec-first-execution/P2

§0c

AI eyes only

Rule: AC at the start, AC at the end. Refuse work without acceptance criteria.

Reject: starting work from a prose ticket with no falsifiable AC. Reject: declaring done against criteria invented by the agent rather than agreed up front. Reject: pattern-matching to the most common implementation in training data.

Generate: copy the AC into the plan verbatim. Implement against each line. Re-read the AC after the change and confirm each line passes a check.

Diagnostic: the work is not started until the AC is enumerated. The work is not done until every AC line has a passing check.

§0d

Why?

AC at the start gives the team a shared finish line. The engineer, the product owner, the QA, and the reviewer can all point at the same bullets and agree what done means.
Skipping the AC turns agency into rework. The misread requirement found at PR time is the feature delivered late; the misread requirement found pre-code is a five-minute conversation.
Spec-First and P3 Definition of Done are bookends. The AC at the start is the brief; the AC at the end is the gate. Without P2, P3 verifies against nothing.
The AC scopes the test in P1 Test-First. The work-unit spec produces the code-unit spec; the chain from story to test is unbroken.
For the agent, the AC is the falsifiable target. Pair with AI4 Verifiable Specs and the agent has something to grade itself against, not just a paragraph to interpret.
PRs land faster when the AC was written first. The reviewer reads the AC to learn the intent, reads the test to verify the AC, reads the code to verify the test — three artefacts, one chain.
Written AC catches scope creep early. The AC bullet that wasn't there last week is visible in the diff; the AC bullet that grew is a conversation, not an argument.

The receipts

Origins, quoted passages, evidence, the strongest counter-argument and the reply.

§1

Origins

The intellectual root is older than agile. Hoare's 1969 paper An Axiomatic Basis for Computer Programming set out the precondition / program / postcondition triple that is, in modern dress, what an AC is.5 Parnas and Madey's Functional Documents for Computer Systems (1995) generalised the same shape to whole-system specifications.6 The throughline from Hoare to Parnas to Cohn is the same claim: the work precedes the writing of the contract.

The agile-era restatement has three load-bearing authors. Mike Cohn's User Stories Applied (Addison-Wesley, 2004) defines the AC as “notes about what the story must do in order for the product owner to accept it as complete.”7 Dan North's “Introducing BDD” (Better Software, 2006) reframes the AC in business language and gives us Given/When/Then.8 Gojko Adzic's Specification by Example (Manning, 2011) packages the discipline as a seven-pattern process: derive scope from goals; specify collaboratively; illustrate with examples; refine the specification; automate; validate frequently; evolve a living documentation.1 The seven patterns are the standard playbook the modern Spec-First row points at.

The Three Amigos — product, dev, QA in one room — is Mike Cohn's framing, sharpened by Lisa Crispin and Janet Gregory's Agile Testing.2 The conversation is the load-bearing artefact; the AC captures what the conversation agreed. Liz Keogh's ordering is the published version of the same point: conversations are more important than capturing conversations, which is more important than automating conversations.4

The 2025–2026 spec-driven-development literature is the AI-era extension. Birgitta Böckeler's Spec-Driven Development essay names three levels of practice — spec-first, spec-anchored, spec-as-source — and explicitly positions SDD as continuous with BDD/SBE rather than as a replacement.9 GitHub spec-kit codifies the same loop into four commands: specify, plan, tasks, implement.10 The artefact is the same one Adzic named in 2011; the consumer has changed.

§2

Quotes

Specifying with examples enables teams to build the right software, in the right way, by forcing the conversation that surfaces the assumption.

Gojko Adzic · Specification by Example (2011)

Acceptance criteria are notes about what the story must do in order for the product owner to accept it as complete.

Mike Cohn · User Stories Applied (2004)

I had discovered something exciting. The first lesson was that the language of testing confused programmers; the second was that test names ought to be sentences.

Dan North · Introducing BDD (2006)

Having conversations is more important than capturing conversations is more important than automating conversations.

Liz Keogh · What is BDD? (2015)

§3

Evidence

Twenty external sources, ranked by author authority. The first five are the canon; expand to see the rest, including the qualifiers and the named opposers. Each links out to its primary source.

01
Specification by ExampleSupports
Gojko Adzic · 2011
The single most-cited modern source. Seven-pattern process; specifications produce living documentation; collaboration is load-bearing.
02
Bridging the Communication GapSupports
Gojko Adzic · 2009
Predecessor to SBE. Establishes that AC and acceptance tests are one artefact in two readings; the BDD claim that the test and the spec collapse.
03
“Introducing BDD”Supports
Dan North · 2006
The shift from “tests” to “behaviour” and the canonical Given/When/Then convention. Originates the form most modern AC are written in.
04
User Stories AppliedSupports
Mike Cohn · 2004
Defines AC as “notes the product owner can accept”. The ancestor of every modern card-with-AC convention.
05
Writing Effective Use CasesSupports
Alistair Cockburn · 2001
Pre-Cohn formulation of the same claim. Use case = scenario = AC. The pre-agile origin of structured AC writing.

Twenty sources, three stances. The supporters are Adzic, North, Cohn, Cockburn: the BDD/SBE/use-case canon. The qualifiers further down push the line that conversation matters more than capture, the steelman the case file has to address. The opposers argue that specification is one artefact among many and should not be promoted to a discipline. The honest debate is between the qualifiers and the opposers, not between the supporters and the sceptics.

§4

Examples

Viewing: TypeScript.

Avoid

Fileadmit-volunteer.ts

// Before: no AC. 'Done' is whatever the engineer remembered.function admitVolunteer(volunteer: Volunteer, sanctuary: Sanctuary): AdmitDecision {  if (volunteer.verified) {    sanctuary.volunteers.push(volunteer);    return return { admitted: true };  }  return { admitted: false };}

Prefer

Fileadmit-volunteer.ts

// Given a sanctuary at capacity and a verified volunteer applying// When admitVolunteer runs// Then it returns { admitted: false, reason: "at-capacity" }function admitVolunteer(volunteer: Volunteer, sanctuary: Sanctuary): AdmitDecision {  if (!volunteer.verified) return { admitted: false, reason: "unverified" };  if (sanctuary.volunteers.length >= sanctuary.capacity) return return { admitted: false, reason: "at-capacity" };  sanctuary.volunteers.push(volunteer);  return return { admitted: true };}it("rejects when at capacity", () => {  expect(admitVolunteer(verified, full)).toEqual<AdmitDecision>({ admitted: false, reason: "at-capacity" });});

§4b

Enforcement

Viewing: TypeScript.

Apply these rules in eslint.config.mjs. The full enforcement across every tenet lives on the implementation page.

Rule	Tool	Catches
cucumber/async-then	eslint-plugin-cucumber	Then steps that return a promise without awaiting it — the assertion never fires.
cucumber/expression-type	eslint-plugin-cucumber	Step text that doesn’t match the parameter type — silently skipped at runtime.
cucumber/no-restricted-tags	eslint-plugin-cucumber	@wip / @only / @skip tags committed to main — partial scenarios pretending to pass.
cucumber/no-arrow-functions	eslint-plugin-cucumber	arrow-function step definitions — break Cucumber’s World binding; common subtle bug.
cucumber-js --fail-fast	Cucumber.js	long suites where the first scenario failure is the AC under change; surfaces the failure faster.
playwright-bdd --strict	playwright-bdd	scenarios with undefined or pending steps — protects the AC from quietly missing coverage.

eslint.config.mjsconfiguration snippet

import tseslint from 'typescript-eslint';
import cucumber from 'eslint-plugin-cucumber';

export default tseslint.config({
  files: ['features/**/*.feature.ts', 'src/**/*.steps.ts'],
  plugins: { cucumber },
  rules: {
    'cucumber/async-then': 'error',
    'cucumber/expression-type': 'error',
    'cucumber/no-restricted-tags': ['error', { tags: ['@wip', '@only'] }],
    'cucumber/no-arrow-functions': 'error',
  }
});

§4c

AI rules

Paste destination

File.cursor/rules/p2-spec-first.mdc

---
description: Prickles P2 — Spec-First Execution
globs: "**/*.{ts,tsx,js,jsx,py,java,php,feature}"
alwaysApply: false
---

## Prickles P2 — Spec-First Execution

Read the brief twice. Write the AC as bullets a reviewer can verify as met or unmet — not “works correctly” but “rejects an empty cart with 400 and the message cart cannot be empty”.

Walk the AC past the product owner and the QA in a five-minute Three Amigos. Capture what survives the conversation; discard what doesn't.

Refuse to start coding until the AC is written. Refuse to merge until the AC is verified.

Pair Spec-First with AI4 Verifiable Specs: write the AC in a form a machine can grade — Gherkin, type, contract, schema. Aspirational prose loses to the agent at machine speed.

Repo layout, CI, and ESLint wiring for these paths live on /implementation — not repeated on every tenet.

§5

Counter-argument

Counter

The strongest steelman is Liz Keogh's, drawn from a decade of BDD practice: the conversation matters more than the artefact, and rigid AC discipline ends with engineers polishing scenarios while skipping the talk that would have surfaced the real risk.4 Hoare's formal-methods tradition extends the point: a sufficiently rigorous AC becomes a Hoare triple, which is the most precise and least useful artefact in the practitioner's toolkit.5 Ousterhout's reading is the broadest critique: specification is one artefact among many; promoting it to a discipline crowds out the deep-module thinking that actually shapes the design.

§6

Counter-argument retort

Keogh's point is the reply, not a refutation. Conversation matters more than capture. The reason Spec-First as a tenet survives the point is that the alternative she warns about — AC polished in isolation, conversation skipped — is not Spec-First; it is the cargo-cult of Spec-First. The Three Amigos is the conversation; the AC is what survives the conversation; the test is what survives the AC. Each artefact carries the previous step forward in a form the next step can use.

Hoare's formal-methods objection is real for the upper tail of the rigour spectrum: a Hoare triple sufficient for a proof is too heavy for a feature ticket. The reply is to keep the discipline and lower the rigour: Cohn's “notes the product owner can accept” is enough at the AC level, Adzic's “example” is enough for a scenario, and AI4 Verifiable Specs hardens the form so the agent can grade it. Hoare-strength rigour is a tool, not a default.

Ousterhout's broadest critique — that specification is one artefact among many — is true, but it is also the steelman that most often hides the failure mode. Specification is the cheapest place to make a design mistake recoverable; deep-module thinking happens in code, where the cost of the mistake is the implementation, the test, and the migration. Both disciplines belong; ranking specification below module design ignores the lifecycle position where each lives.

In production work, the row that ships the AC at the start is the row whose Definition of Done can mean anything more than “the engineer thinks it's done.” P3 Definition of Done is the back half of this loop; without P2 at the front, P3 has nothing to verify against. The two rows are bookends, not duplicates.

§7

Notes

[1]Gojko Adzic — Specification by Example (Manning, 2011). The seven-pattern process: derive scope from goals; specify collaboratively; illustrate with examples; refine the specification; automate; validate frequently; evolve a living documentation. The single most-cited modern source for the discipline.
[2]Lisa Crispin & Janet Gregory — Agile Testing: A Practical Guide for Testers and Agile Teams (Addison-Wesley, 2009). The Three Amigos framing — product owner, developer, tester in one room — and the canonical Agile Testing Quadrants placement of acceptance tests as Q2 (tests written before coding starts).
[3]Noah Shinn et al. — “Reflexion: Language Agents with Verbal Reinforcement Learning”, NeurIPS 2023. HumanEval pass@1 lifts from ~80% to 91% when the agent grounds self-feedback in test runs rather than introspection. Generalised by Madaan et al. (Self-Refine, 2023) and Gou et al. (CRITIC, 2024).

Disagree? Found a hole in the argument? Take issue with this tenet →

Last revised: 2026-04-27