Verifiable Specs
If the agent can't grade itself, it isn't a spec.
Spec-Driven Development. A spec the agent can grade itself against is a contract. A spec the agent has to interpret is a wish. The first goes through the loop; the second goes through the model.
Opinion
The Spec-Driven Development tradition is the right home for this row. Andrej Karpathy framed it negatively in his Software 2.0 / 3.0 talks: “language models automate what can be verified.”1Sequoia Ascent 2026 / Software 3.0 talks. “Language models automate what can be verified.” The Karpathy framing for the row: the autonomy slider runs as far as the verification reaches and no further. GitHub's spec-kit is the same instinct made operational: specifications turn from passive documentation into executable contracts that constrain what AI agents generate.2spec-kit — Toolkit to help you get started with Spec-Driven Development. Slash commands /speckit.specify, /speckit.plan, /speckit.tasks, /speckit.implement walk the four moves of Spec-Driven Development end-to-end. The principle is that the spec is the verification artefact, not a hand-wave at a developer's good taste. With a human engineer, ambiguous specs survive because the engineer fills the gaps; with an agent, ambiguous specs collapse because the agent fills the gaps with whatever pattern its training rewards.
The literature converges. Birgitta Böckeler's Spec-Driven Development series for Martin Fowler's site frames the move as “specifications turn from passive documentation into executable contracts.”3Spec-Driven Development: Kiro, spec-kit, and Tessl (2026). The synthesis essay: specifications turn from passive documentation into executable contracts that constrain what AI agents generate. GitHub's spec-kit ships /speckit.specify, /speckit.plan, /speckit.tasks, /speckit.implement as slash commands that walk the four moves.2spec-kit — Toolkit to help you get started with Spec-Driven Development. Slash commands /speckit.specify, /speckit.plan, /speckit.tasks, /speckit.implement walk the four moves of Spec-Driven Development end-to-end. The Reflexion / Self-Refine literature shows agents only improve under verifiable feedback; without an external grader, introspection can degrade accuracy.4Reflexion: Language Agents with Verbal Reinforcement Learning (NeurIPS 2023). HumanEval pass@1 91% with Reflexion vs 80% prior SOTA. The peer-reviewed grounding for verifiable feedback as the load-bearing input. Three different angles, one rule: write the spec the agent can run.
The pillar shape that surrounds this tenet matters. P2 Spec-First Execution says read the brief and write the AC before starting. AI4 says the AC must be written in a form the agent can grade. P2 is when; AI4 is how. A team can pass P2 (the AC exists) and fail AI4 (the AC is a bullet point). The agent will do its best; its best will be plausible and wrong.
Copy a note and link
Grab this short comment and drop it into a PR comment or an LLM chat to prompt the right change.
If the agent can't grade itself against the spec, the spec is vapour. Acceptance criteria → Gherkin → executable check; or contract test, type signature, schema validator. Pick a machine-readable encoding. Aspirational prose loses to ambiguity at machine speed. /tenet/verifiable-specs/AI4
AI eyes only
Rule: every spec is runnable. Refuse to declare done without a passing check.
Reject: declaring done from inspection. Reject: prose-only specs as the verification. Reject: tests written by the same agent run as proof of correctness without an external grader.
Generate: every plan ends with a named verification command (test, type-check, build, or external evaluator). Every execute ends with that command passing. If the spec is prose, rewrite it as a runnable check before starting work.
Diagnostic: name the exact command that proves the change. If you cannot, the work is not ready to start.
Why?
- Names the lineage. Spec-Driven Development — Karpathy's Software 2.0 / 3.0, GitHub's spec-kit, Böckeler's Fowler-bliki essays — the row sits in the right tradition rather than reinventing the wheel.
- A spec the agent can grade itself against is a spec the agent can be measured against in CI. The verification step in the loop becomes a button press, not a judgement call.
- Stops the agent's signature failure mode. Without a runnable spec the agent invents its own grader; the invented grader is generous; the work passes; the work is wrong.
- Compounds with P2 Spec-First Execution. P2 says read the brief; AI4 says write the brief in a form the agent can grade. Two halves of one discipline.
- The runnable spec encodes the AC once and runs it forever. New contributors and new agents inherit the same grading rubric without negotiation.
- Backed by the strongest peer-reviewed cluster in the AI pillar — Reflexion, Self-Refine, CRITIC, Constitutional AI all show that verifiable feedback is what makes agents improve.
- Makes AI1 The Intern Pattern's execute leg safe to delegate. Without a runnable spec, “execute” is “cross your fingers” with extra steps.
Origins
The principle is older than the agent era. Bertrand Meyer's Eiffel baked Design by Contract into the language in 1986; preconditions, postconditions, and class invariants were the early statement that a spec is something the runtime can grade.5Eiffel and Design by Contract (1986–). The earliest production embedding of the principle that a spec is something the runtime can grade. Preconditions, postconditions, class invariants — all executable contracts. Cucumber and the BDD movement (2003 onward) pushed the same idea up to the acceptance-criteria layer: a Given/When/Then step is a spec the test runner can execute. Specification by Example (Gojko Adzic, 2011) made the methodology explicit: collaborate on examples, formalise them, automate them, use them as living documentation.6Specification by Example (2011). The methodology that walks the same idea up to acceptance criteria: collaborate on examples, formalise them, automate them, use them as living documentation. The methodological backbone for the BDD strand of the lineage.
Andrej Karpathy reframed the principle for the agent era. The Software 2.0 / 3.0 talks ground the modern restatement: “language models automate what can be verified.”1Sequoia Ascent 2026 / Software 3.0 talks. “Language models automate what can be verified.” The Karpathy framing for the row: the autonomy slider runs as far as the verification reaches and no further. Three conditions for an agent-verifiable task: it can be run, it can be reset, and it can be graded automatically. Specs that miss any of the three conditions cannot be delegated; the agent has no way to know whether it succeeded.
The 2025 / 2026 wave of Spec-Driven Development tooling instantiates the principle at industrial scale. GitHub's spec-kit ships four slash commands — /speckit.specify for business context and success criteria, /speckit.plan for architectural translation, /speckit.tasks for decomposition into testable units, /speckit.implement for the agent execution under those constraints.2spec-kit — Toolkit to help you get started with Spec-Driven Development. Slash commands /speckit.specify, /speckit.plan, /speckit.tasks, /speckit.implement walk the four moves of Spec-Driven Development end-to-end. Birgitta Böckeler's Spec-Driven Development series for Martin Fowler bliki surveys the wider tooling landscape (Kiro, spec-kit, Tessl) and synthesises the rule: “specifications turn from passive documentation into executable contracts that constrain what AI agents generate.”3Spec-Driven Development: Kiro, spec-kit, and Tessl (2026). The synthesis essay: specifications turn from passive documentation into executable contracts that constrain what AI agents generate.
The peer-reviewed cluster underneath the operational rule sits in Reflexion, Self-Refine, and CRITIC. Each shows the same finding from a different angle: agents that grade themselves against an external check produce measurably better output than agents that grade themselves against their own confidence. Huang et al. (2024) — the canonical caveat — show that without an external grader, introspection can hurt accuracy.4Reflexion: Language Agents with Verbal Reinforcement Learning (NeurIPS 2023). HumanEval pass@1 91% with Reflexion vs 80% prior SOTA. The peer-reviewed grounding for verifiable feedback as the load-bearing input. The runnable spec is the external grader. Without it, every other AI tenet on the site weakens by the same coefficient.
Quotes
Language models automate what can be verified.
Specifications turn from passive documentation into executable contracts that constrain what AI agents generate.
/speckit.specify captures business context and success criteria. /speckit.plan translates specs into architectural decisions. /speckit.tasks decomposes plans into testable units. /speckit.implement runs AI agents under those constraints.
Reflexion agents verbally reflect on task feedback signals, then maintain their own reflective text in an episodic memory buffer to induce better decision-making in subsequent trials.
Evidence
Twenty external sources, ranked by author authority. The first five are the canon; expand to see the rest, including the qualifiers and the named opposers. Each links out to its primary source.
- 01“Language models automate what can be verified.” The single sharpest framing for the row. The autonomy slider runs as far as the verification reaches; verifiable specs are the lever that opens the slider.
- 02Slash-command toolkit (/speckit.specify, /speckit.plan, /speckit.tasks, /speckit.implement) that operationalises the principle. Industrial-scale evidence the practice is shipping, not theoretical.
- 03Survey of the SDD tooling landscape. Names the move: specifications turn from passive documentation into executable contracts that constrain what AI agents generate.
- 04NeurIPS paper. Verifiable feedback is what makes agents improve; HumanEval pass@1 jumps from 80% to 91% when the loop closes against a runnable check.
- 05Companion paper to Reflexion. The same finding from a slightly different angle: external grading is the load-bearing input for agent improvement.
Sixteen sources. Karpathy's Software 3.0 talk supplies the framing; Böckeler on the Fowler bliki supplies the practitioner's view; GitHub's spec-kit is the operational artefact; Reflexion (Shinn) and Self-Refine (Madaan) supply the academic backing for verifiable feedback. The qualifier further down captures the case where pure introspection without an external grader can hurt; that is the case for the rule, not against it.
Examples
// Before: a wish, not a spec. The agent invents whatever pattern its training rewards.function rescueHedgehog(hedgehog: Hedgehog): RescueResult {// TODO: handle the case where the hedgehog is hibernating return { status: "rescued" };}
// After: Given/When/Then. The spec is runnable; the agent grades its own work.// Given: a hedgehog with state == "hibernating"// When: rescueHedgehog(hedgehog) is called// Then: returns { status: "deferred", reason: "hibernating" }test("defers a hibernating hedgehog", () => { const hedgehog: Hedgehog = { state: "hibernating" }; const result = rescueHedgehog(hedgehog); expect(result).toEqual({ status: "deferred", reason: "hibernating" });});function rescueHedgehog(hedgehog: Hedgehog): RescueResult { if (hedgehog.state === "hibernating") { return { status: "deferred", reason: "hibernating" }; } return { status: "rescued" };}
Enforcement
Apply these rules in eslint.config.mjs. The full enforcement across every tenet lives on the implementation page.
| Rule | Tool | Catches |
|---|---|---|
| vitest/expect-expect | @vitest/eslint-plugin | tests with no assertion. The spec is the assertion; a test without one is a spec without a grader. |
| cucumber/async-then | eslint-plugin-cucumber | Then steps that await without resolving — the AC compiles but never grades. Common Cucumber pitfall when AI agents author scenarios. |
| cucumber/no-restricted-tags | eslint-plugin-cucumber | @wip / @skip tags committed to main. The runnable AC stops being runnable the moment a tag opts out. |
| AJV — schema validation | AJV (JSON Schema validator) | fixtures and request/response payloads that drift from the JSON Schema. The schema is a verifiable spec at the wire layer. |
eslint.config.mjsconfiguration snippet
import tseslint from 'typescript-eslint';
import vitest from '@vitest/eslint-plugin';
import cucumber from 'eslint-plugin-cucumber';
export default tseslint.config({
files: ['**/*.{ts,tsx}', '**/*.feature'],
plugins: { vitest, cucumber },
rules: {
'vitest/expect-expect': 'error',
'vitest/no-disabled-tests': 'error',
'vitest/no-focused-tests': 'error',
'cucumber/async-then': 'error',
'cucumber/no-restricted-tags': ['error', { tags: ['@wip', '@skip'] }],
}
});AI rules
.cursor/rules/ai4-verifiable-specs.mdc---
description: Prickles AI4 — Verifiable Specs
globs: "**/*.{feature,spec.ts,spec.tsx,test.ts,test.tsx}"
alwaysApply: true
---
## Prickles AI4 — Verifiable Specs
Spec-Driven Development is the lineage. Karpathy's Software 2.0 / 3.0 framing and GitHub's spec-kit converge on one rule: specifications turn from passive documentation into executable contracts that constrain what AI agents generate.
If the agent cannot grade itself against the spec, the spec is vapour. Acceptance criteria the AI can run beat acceptance criteria the AI can interpret.
Pick a machine-readable encoding: Gherkin, contract tests, type signatures, schema validators, executable checklists. Aspirational prose loses to ambiguity at machine speed.
P2 says read the spec; AI4 says write it in a form the agent can grade. The two together make the spec executable, not aspirational.Repo layout, CI, and ESLint wiring for these paths live on /implementation — not repeated on every tenet.
Counter-argument
The honest steelman is that not every requirement reduces to a runnable check. Performance intent (“feels fast”), aesthetic criteria (“respects the brand voice”), discovery work whose acceptance is “we now know”, and product instincts that don't survive premature formalisation all resist the verifiable-spec rule. Birgitta Böckeler's own Spec-Driven Development survey notes the tension explicitly: the agent is best at running the verifiable parts; the human is best at the unverifiable judgement.3Spec-Driven Development: Kiro, spec-kit, and Tessl (2026). The synthesis essay: specifications turn from passive documentation into executable contracts that constrain what AI agents generate. Demanding a runnable check for every spec turns design into compliance and overfits the work to what the test runner can score.
Counter-argument retort
The unverifiable-judgement objection is real and the rule absorbs it without breaking. AI4 asks for a runnable check on the parts that compile, run, and ship. The aesthetic and product-instinct parts go through AI1 The Intern Pattern instead — the human approves the plan, the agent executes, the human reviews. The spec is the contract on the verifiable surface; the human is the contract on the surface that resists formalisation. Both gates close before merge.
The deeper response is that “not everything is verifiable” is usually code for “I haven't found the verification yet.” Performance intent reduces to a latency budget. “Respects the brand voice” reduces to a style guide and a small LLM-as-judge check; Anthropic ship constitutional AI as the framework for exactly this. Discovery work reduces to a written hypothesis and a learning-criterion. The verifiable restatement is harder than the prose, which is why the prose tends to win — but the prose is the version the agent will improvise around, and the verifiable restatement is the version that survives the next session and the next contributor.
For the genuinely irreducible residue — the human judgement that no test can replace — the rule does not vanish; it shifts. The spec records the judgement was made; the reviewer records who made it; the audit trail survives. P2 says read the brief; AI4 says grade what you can; AI1 The Intern Pattern closes the loop on what you can't.
How this differs from Spec-First Execution and Test-First Development.
See also: P2 Spec-First Execution. P2 says read the brief and write the AC before you start. AI4 says: whatever AC you wrote, write it in a form the agent can grade. P2 is when; AI4 is how. A team can pass P2 (the AC exists) and fail AI4 (the AC is a bullet point). The agent will do its best; its best will be plausible and wrong.
See also: P1 Test-First Development. P1 says write the failing test before the code. AI4 says the test must be runnable, resettable, and graded automatically — Karpathy's three conditions for a verifiable task. In the human-only loop, P1 is sufficient because the same human reads and runs the test. In the agent loop, P1 holds only if AI4 holds; an unrunnable test is a passing test for an agent that can't tell the difference.
Notes
- [1]Andrej Karpathy — Sequoia Ascent 2026 / Software 3.0 talks. “Language models automate what can be verified.” The Karpathy framing for the row: the autonomy slider runs as far as the verification reaches and no further.
- [2]GitHub — spec-kit — Toolkit to help you get started with Spec-Driven Development. Slash commands /speckit.specify, /speckit.plan, /speckit.tasks, /speckit.implement walk the four moves of Spec-Driven Development end-to-end.
- [3]Birgitta Böckeler / Martin Fowler bliki — Spec-Driven Development: Kiro, spec-kit, and Tessl (2026). The synthesis essay: specifications turn from passive documentation into executable contracts that constrain what AI agents generate.