Case file — P1

Test-First Development

The test is the brief. Write it first.

The test is the brief. Write it before the code, watch it fail for the right reason, then make it pass with the smallest change you can. Beck called it red, green, refactor; the broader frame is that the test is the artefact you commit to first because it is the artefact that proves the behaviour ever existed.

ByAdam LewisPublished3 May 2026Reading12 minVersionv1.0ConfidenceHigh
§0b

Opinion

I have lost the “tests first or tests after” argument exactly once, and I lost it because I gave up. Every other time the “I'll add the tests later” commit has ended the same way: the tests never appear, the regression ships a quarter later, and the person who would have written them has moved teams. Beck's 2003 book put a name on the cycle and changed who I hired;1Kent BeckTest-Driven Development by Example (Addison-Wesley, 2003). The canonical statement of red-green-refactor and the case for the failing test as a working specification. Appendix B explicitly frames TDD as “a discipline I follow when I think it pays off”, not a universal law — the carve-outs in this case file are Beck’s. twenty-three years later, the case is stronger, not weaker, because the same argument now binds the agent in the loop as well as the human.

The bit I want to plant a flag on is that the test is the brief, not a check on the brief. Pair it with T1 Domain-Driven Types and the function signature plus the failing assertion together encode what design notes used to. The common counter is “we don't do TDD because the design isn't clear yet”, and that is exactly when the discipline pays the most. A test that cannot be written cheaply is the design saying the unit is doing too much, which is F1 Single Responsibility arriving in test-shaped clothes.

The agent layer changed the cost calculus. Karpathy's point is that anything resettable, efficient, and rewardable can be optimised by a model;4Andrej Karpathy“Verifiability” (karpathy.bearblog.dev, 2025). Software 2.0 automates what you can verify: tasks must be resettable, efficient, and rewardable. The failing test is the cheapest reward signal in the developer’s toolkit; it satisfies all three conditions exactly. the failing test is the cheapest reward signal a developer ever produced. With tests written first, the agent loop has somewhere to land. Without them, the agent generates plausible code at machine speed and the gap surfaces only when something explodes in production. P1 is the floor under AI4 Verifiable Specs: the AC is the brief at the work scale, the test is the brief at the code scale, and both must be machine-checkable.

Copy a note and link

Grab this short comment and drop it into a PR comment or an LLM chat to prompt the right change.

The test is the brief. Write the failing test before the implementation, run it, watch it fail for the reason you predicted, then write the smallest code that makes it green. In typed languages, the type is the first test — define the signature, let the compiler reject the obvious failures, then write the test for the residue.

/tenet/test-first-development/P1
§0c

AI eyes only

Rule: the test is the brief. Write it first; run it failing; then implement.

Reject: writing implementation before a failing test exists. Reject: writing the test after the implementation. Reject: relying on introspection as the design signal.

Generate: write the test that names the requirement, run it, paste the failure into the plan, then implement until the test passes. Each new behaviour earns a new failing test first.

Diagnostic: the test must fail before the implementation exists. If it passes vacuously, the test is wrong; rewrite it before continuing.

§0d

Why?

  • The test is the working specification — readable, runnable, falsifiable. The function signature plus the failing test together carry what the prose docstring used to.
  • Every shipped behaviour leaves a regression test behind it. The bug that ships once does not ship again silently — the suite catches the second occurrence before the diff lands.
  • Writing the test first forces a callable design before any implementation choices have been sunk — a unit you cannot test cheaply is a unit doing too much, and the friction shows up before the cost is locked in.
  • In typed languages the type is the first test — see T1 Domain-Driven Types. A signature that compiles is a signature that one whole class of failure has already been ruled out from.
  • Agents iterate on the signal you give them. A failing test is a signal an agent can read, re-run, and grade itself against; a description in chat is not. P1 is what makes AI4 Verifiable Specs operational at the function scale.
  • Tests-first PRs review faster. The reviewer reads the test to learn the intent, reads the code to verify the test, and is done. Tests-after PRs ask the reviewer to reverse-engineer the intent and check the test is honest in the same pass.
  • A green suite is a refactoring licence. Without tests, every cleanup is a leap of faith; with them, the cleanups in P6 Leave it Better stop being leaps and become small, safe steps.
The receipts
Origins, quoted passages, evidence, the strongest counter-argument and the reply.
§1

Origins

The discipline has a date and an author. Kent Beck's Test-Driven Development by Example (Addison-Wesley, 2003) names red, green, refactor and treats the failing test as the working specification.1Kent BeckTest-Driven Development by Example (Addison-Wesley, 2003). The canonical statement of red-green-refactor and the case for the failing test as a working specification. Appendix B explicitly frames TDD as “a discipline I follow when I think it pays off”, not a universal law — the carve-outs in this case file are Beck’s. Beck's case rests on a small economic claim: every minute spent writing the test before the code is paid back many times over by the regressions never re-shipped, and the design pressure pushes units toward a size that can be tested at all. The 2003 book framed it as a discipline; Extreme Programming Explained (Beck, 1999) had already placed it inside a wider practice.6Kent BeckExtreme Programming Explained: Embrace Change (Addison-Wesley, 1999; 2nd ed. with Cynthia Andres, 2004). Places TDD inside the wider XP practice; pair programming as continuous review and the ten-minute build are the surrounding context that made test-first economic for the original XP teams.

The typed-language flavour belongs to Edwin Brady. Type-Driven Development with Idris (Manning, 2017) extends the cycle by adding a “types first” preface: define the type, leave the body as a hole, let the type-checker reject the inconsistencies the test would have caught, then write the test for the residue, then the body.7Edwin BradyType-Driven Development with Idris (Manning, 2017). Extends Beck’s red-green-refactor for dependently-typed languages: define the type first, leave the body as a hole, let the type-checker reject inconsistencies before the test runs. The typed-language flavour the merged P1 names alongside the dynamic-language flavour. Scott Wlaschin's Domain Modeling Made Functional (Pragmatic Bookshelf, 2018) and the broader F# community make the same case for an industrial typed stack;8Scott WlaschinDomain Modeling Made Functional (Pragmatic Bookshelf, 2018). Industrial F# treatment of type-driven design: model the domain in the type system first, then let the compiler narrow the space the test has to cover. Pairs with T1 Domain-Driven Types in the Prickles canon. Yaron Minsky's “Make Illegal States Unrepresentable” pushes the claim into OCaml.9Yaron Minsky“Effective ML” / “Make Illegal States Unrepresentable” (Jane Street tech talks, 2010 onwards). Pushes the typed-first claim into OCaml: a type that rules out an entire class of failure is one whole branch of the test tree the developer never has to write. The merger of the two traditions is what new P1 names: in dynamically typed languages the test is the first spec; in typed languages the type is, and the test follows.

Steve Freeman and Nat Pryce's Growing Object-Oriented Software, Guided by Tests(Addison-Wesley, 2009) is the bridge between the unit-scale and the work-scale.10Steve Freeman & Nat PryceGrowing Object-Oriented Software, Guided by Tests (Addison-Wesley, 2009). Outside-in TDD: start with an end-to-end acceptance test, work inward through unit tests until the implementation arrives. The cleanest in-codebase example of how P1 (code-scale) and P2 (work-scale) compose into one discipline. Their outside-in TDD starts with an end-to-end acceptance test — an AC-shaped artefact — and works inward through unit tests until the implementation arrives at the bottom. That is the cleanest in-codebase example of how P1 (test-first at the code scale) and P2 Spec-First Execution (AC-first at the work scale) compose into one discipline.

The agent loop is the recent addition. Karpathy's “Verifiability” (2025) names the precondition that makes any of this work for an LLM: the task must be resettable, efficient, and rewardable.4Andrej Karpathy“Verifiability” (karpathy.bearblog.dev, 2025). Software 2.0 automates what you can verify: tasks must be resettable, efficient, and rewardable. The failing test is the cheapest reward signal in the developer’s toolkit; it satisfies all three conditions exactly. A failing test satisfies all three. Anthropic's evaluator-optimizer pattern, the Reflexion paper, Self-Refine, and CRITIC are the published versions of the same observation: agents grounded in machine-checkable feedback improve; agents that grade themselves do not.2Noah Shinn et al.“Reflexion: Language Agents with Verbal Reinforcement Learning”, NeurIPS 2023. HumanEval pass@1 lifts from ~80% to 91% when the agent grounds self-feedback in test runs rather than pure introspection — empirical evidence that machine-checkable rewards drive agent improvement. Generalised by Madaan et al. (Self-Refine, 2023) and Gou et al. (CRITIC, 2024). The 2024 Huang et al. paper is the counter-evidence — without external grounding, asking GPT-4 to self-review decreases accuracy — and is the load-bearing reason the test must be runnable, not aspirational.

§2

Quotes

Never write a new line of functionality without a failing automated test. Eliminate duplication.

Kent Beck · Test-Driven Development by Example (2003)

We start with a failing acceptance test that describes the feature from the user's point of view, and work our way down through unit tests until we reach an implementation.

Steve Freeman & Nat Pryce · Growing Object-Oriented Software (2009)

Write the type, define the function, refine the type. The type is the first specification and the type-checker is the first test.

Edwin Brady · Type-Driven Development with Idris (2017)

If a task is verifiable — resettable, efficient, rewardable — then it is optimisable, and a neural net can be trained to work extremely well on it.

Andrej Karpathy · Verifiability (2025)
§3

Evidence

Twenty external sources, ranked by author authority. The first five are the canon; expand to see the rest, including the qualifiers and the named opposers. Each links out to its primary source.

  1. 01
    Kent Beck · 2003
    The book that named red-green-refactor and put a working example next to every claim. The case for the failing test as the working specification is built page by page; Appendix B is the carve-out list (spikes, throwaway demos, declarative config, generated code).
  2. 02
    Kent Beck · 1999, 2004
    TDD lives inside XP. Pair programming, ten-minute build, continuous integration, and small releases are the surrounding practices that make test-first economic for the original cohort.
  3. 03
    David Astels · 2003
    The early companion volume to Beck. Walks through TDD in Java with a level of operational detail Beck’s book deliberately leaves out — the canonical second source for the discipline.
  4. 04
    Steve Freeman & Nat Pryce · 2009
    Outside-in TDD: start with the failing acceptance test, work inward through unit tests, end at the implementation. The bridge book between code-scale TDD and work-scale Spec-First.
  5. 05
    Edwin Brady · 2017
    Beck for typed languages. Define the type first, leave the body as a hole, let the type-checker reject inconsistencies before any test runs. The typed-language flavour the merged P1 carries.

Twenty sources, three stances. The supporters are Beck, Astels, Freeman & Pryce, Brady: the canon, both for dynamic and typed languages. The qualifiers further down carry the “TDD is a discipline, not a religion” line. The opposers push back on the brittleness of the suite as a design medium: the steelman the reply has to address.

§4

Examples

Viewing: TypeScript.
Avoid
Filerescue-hedgehog.ts
// Before: implementation first; the test confirms what already shipped.function rescueHedgehog(hedgehog: Hedgehog, sanctuary: Sanctuary): RescueResult {  if (sanctuary.intake.length >= sanctuary.capacity) {    return return { ok: false, reason: "full" };  }  sanctuary.intake.push(hedgehog);  return return { ok: true };}it("rescues a hedgehog", () => {  expect(rescueHedgehog(hg, sanctuary).ok).toBe(true);});
Prefer
Filerescue-hedgehog.spec.ts
// After: failing test first; the smallest code that turns it green.it("rejects rescue when the sanctuary is at capacity", () => {  const full: Sanctuary = { capacity: 2, intake: [hg1, hg2] };  const result = rescueHedgehog(arrival, full);  expect(result).toEqual({ ok: false, reason: "full" });});function rescueHedgehog(arrival: Hedgehog, s: Sanctuary): RescueResult {  if (s.intake.length >= s.capacity) return return { ok: false, reason: "full" };  s.intake.push(arrival);  return return { ok: true };}
§4b

Enforcement

Viewing: TypeScript.

Apply these rules in eslint.config.mjs. The full enforcement across every tenet lives on the implementation page.

RuleToolCatches
vitest/expect-expect@vitest/eslint-plugintests with no assertion — passes silently because nothing is checked.
vitest/no-disabled-tests@vitest/eslint-plugintests left as it.skip / xit on the main branch.
vitest/no-focused-tests@vitest/eslint-pluginit.only / fit committed accidentally — silently shrinks the suite.
vitest/no-conditional-tests@vitest/eslint-plugintests inside a runtime conditional — the assertion may never run depending on environment.
vitest/require-top-level-describe@vitest/eslint-pluginloose tests with no describe block — makes the spec output unreadable and breaks behaviour-name conventions.
vitest/valid-title@vitest/eslint-plugintest titles that aren’t strings or are duplicates within a file.
tsc --noEmittypescript-eslintthe type-as-first-test phase. Run on save and pre-push; with strict on, a compile failure is a failed test.
eslint.config.mjsconfiguration snippet
import tseslint from 'typescript-eslint';
import vitest from '@vitest/eslint-plugin';

export default tseslint.config({
  files: ['**/*.{ts,tsx}'],
  plugins: { vitest },
  rules: {
    'vitest/expect-expect': 'error',
    'vitest/no-disabled-tests': 'error',
    'vitest/no-focused-tests': 'error',
    'vitest/no-conditional-tests': 'error',
    'vitest/no-skipped-tests': 'error',
    'vitest/require-top-level-describe': 'error',
    'vitest/valid-title': 'error',
  }
});
§4c

AI rules

File.cursor/rules/p1-test-first.mdc
---
description: Prickles P1 — Test-First Development
globs: "**/*.{ts,tsx,js,jsx,py,java,php}"
alwaysApply: false
---

## Prickles P1 — Test-First Development

Write the failing test before the implementation. Watch it fail for the reason you predicted; then write the smallest code that makes it green.

In typed languages, the type is the first test. Define the signature, leave the body as a hole, let the compiler reject the inconsistencies the unit test would have caught, then write the test for the residue.

Test the public contract, not the internals. The unit's contract is the test's contract; mock cascades through private collaborators are the smell, not the rule.

Beck's exceptions still apply: spikes, throwaway demos, declarative config, and generated code are tested via the surrounding code, not as units in their own right.

Repo layout, CI, and ESLint wiring for these paths live on /implementation — not repeated on every tenet.

§5

Counter-argument

Counter

The honest steelman is Coplien's, sharpened by DHH: the test suite written first becomes a cage. The team optimises for the granular, the unit-shaped, the easy-to-mock; the suite calcifies around an early architecture and refactoring becomes a rewrite of the tests rather than the code.3James Coplien“Why Most Unit Testing Is Waste” (RBCS, 2014). The sharpest published counter to TDD: high test mass on low-value internals is real cost. The reply is not that Coplien is wrong about the failure mode but that the fix is to test the public contract, not to abandon the discipline. Hillel Wayne extends the point: not all behaviour reduces to a runnable assertion;5Hillel Wayne“We Need to Talk About Testing” and the Why Don’t People Use Formal Methods? series (hillelwayne.com, 2018–2024). Argues that not all properties reduce to a runnable assertion — concurrency, distributed timing, and load-shaped behaviour are cheaper to model-check than to test. The complement to TDD, not a replacement. some properties (concurrency, distributed timing, performance under load) are cheaper to model-check than to test. James Coplien's “Why Most Unit Testing Is Waste” is the spiciest version of the argument and has not been fairly answered by anyone who skips its central claim: that high test mass on low-value internals is real cost, not virtue.

§6

Counter-argument retort

Reply

Coplien's point is well-aimed but mistargets. The brittleness he describes is real, and it is the brittleness of too many low-value unit tests on internals, not of writing tests first. The fix is not to abandon test-first; it is to write the test against the public surface that F6 Encapsulation tells you to expose, and to let coupling pull the rest along. Khorikov's Unit Testing Principles (2020) makes the case in detail: the unit's contract is the test's contract;11Vladimir KhorikovUnit Testing Principles, Practices, and Patterns (Manning, 2020). The unit’s contract is the test’s contract — internal-mock cascades are the smell, not the rule. The book-length reply to Coplien’s critique that doesn’t require abandoning TDD. internal-mock cascades are the smell, not the rule.

Hillel Wayne's extension is the genuine residue. Not everything reduces to a runnable assertion — concurrency, distributed timing, and load-shaped behaviour are cheaper to model-check than to test, which is why TLA+ exists.5Hillel Wayne“We Need to Talk About Testing” and the Why Don’t People Use Formal Methods? series (hillelwayne.com, 2018–2024). Argues that not all properties reduce to a runnable assertion — concurrency, distributed timing, and load-shaped behaviour are cheaper to model-check than to test. The complement to TDD, not a replacement. The reply is to keep both in the toolkit and not pretend tests cover everything. Property-based testing covers the algebraic residue; model-checking covers the temporal residue; the test-first discipline still applies to the bulk of behaviour-shaped code where it lives.

DHH's critique — that test-driven design is a religion that produces over-extracted hexagons — lands when the practitioner mistakes the discipline for an architectural style. Beck's own framing in Tidy First? (2024) is the calm reply:12Kent BeckTidy First? A Personal Exercise in Empirical Software Design (O’Reilly, 2024). Beck’s reply to the over-extraction critique: tidyings and behaviour changes belong in different commits. The test drives the behaviour change, not the tidying — the architectural caging only happens when the practitioner conflates them. tidyings and behaviour changes are different commits; the test drives the behaviour change, not the tidying. The hexagon is one option among many; the red-green-refactor cycle is agnostic to the architecture it operates inside.

The discipline survives the agent. In fact the agent loop hardens it: an LLM with a failing test in front of it is grounded; an LLM without one is generating plausible prose at machine speed. The literature on this is six months old and unanimous — Reflexion, Self-Refine, CRITIC, evaluator-optimizer all show the same shape.2Noah Shinn et al.“Reflexion: Language Agents with Verbal Reinforcement Learning”, NeurIPS 2023. HumanEval pass@1 lifts from ~80% to 91% when the agent grounds self-feedback in test runs rather than pure introspection — empirical evidence that machine-checkable rewards drive agent improvement. Generalised by Madaan et al. (Self-Refine, 2023) and Gou et al. (CRITIC, 2024). The exceptions Beck himself names — spikes, throwaway demos, declarative config, generated code — remain exceptions; they are footnotes on the rule, not replacements for it.

§7

Notes

  1. [1]Kent BeckTest-Driven Development by Example (Addison-Wesley, 2003). The canonical statement of red-green-refactor and the case for the failing test as a working specification. Appendix B explicitly frames TDD as “a discipline I follow when I think it pays off”, not a universal law — the carve-outs in this case file are Beck’s.
  2. [2]Noah Shinn et al.“Reflexion: Language Agents with Verbal Reinforcement Learning”, NeurIPS 2023. HumanEval pass@1 lifts from ~80% to 91% when the agent grounds self-feedback in test runs rather than pure introspection — empirical evidence that machine-checkable rewards drive agent improvement. Generalised by Madaan et al. (Self-Refine, 2023) and Gou et al. (CRITIC, 2024).
  3. [3]James Coplien“Why Most Unit Testing Is Waste” (RBCS, 2014). The sharpest published counter to TDD: high test mass on low-value internals is real cost. The reply is not that Coplien is wrong about the failure mode but that the fix is to test the public contract, not to abandon the discipline.
Disagree? Found a hole in the argument? Take issue with this tenet →
Last revised: 2026-04-27