Behaviour Testing
Test the public, not the private.
A test that asserts on a private helper's argument list is not testing the code. It is pinning the code in place. The test passes when the wrong thing changes and fails when nothing real changes: the worst kind of green and the worst kind of red.
Opinion
I've watched too many teams confuse code coverage with confidence. They mock five collaborators, spy on a private helper, and assert that the helper got called with { id: 1 }. The bar turns green and the team feels safe. Then a junior refactors the helper from fetchUser(id) to fetchUser({ id }), the test goes red, and the team blames the junior. The junior didn't break anything; the test was wrong from the moment it was written. The contract was “the user shows up on the page” and the test asserted “a function got called with a particular shape of argument.”
Adam's punch (brittle tests are the testing equivalent of premature abstraction) is the sharpest framing this principle has had in print. Sandi Metz's 3×3 matrix named which messages to test (incoming queries, outgoing commands) and which to leave alone (sent-to-self).1“The Magic Tricks of Testing”, RailsConf 2013. The canonical 3×3 matrix: incoming query / incoming command / sent-to-self / outgoing query / outgoing command, with the rule “test only what crosses the public boundary.” Kent C. Dodds named the failure mode: implementation details are things users don't see, hear, or know about, and tests that assert on them are tests that lock in decisions the design should be free to walk away from.2“Testing Implementation Details” (kentcdodds.com, 2018). “Implementation details are things which users of your code will not typically use, see, or know about.” Both arguments converge on the same answer: the test addresses the unit by what callers can perceive, then asks the public surface a question.
The query-as-user half is the operational version. Testing Library's priority list puts getByRole first because the role tree is what a screen reader, a keyboard user, and a sighted user all see in different forms.3“About Queries — locator priority”. The order: getByRole, getByLabelText, getByPlaceholderText, getByText, getByDisplayValue, getByAltText, getByTitle, getByTestId. Playwright ports the same priority to E2E. Same axiom (test the public) on different surfaces. The unit test asserts on the rendered output; the component test asserts on the role tree; the E2E asserts on the page. None of them ask whether the helper function was called. F1 Single Responsibility is the design-side cousin: a unit that does one thing has one public surface, and one public surface is one test.
Copy a note and link
Grab this short comment and drop it into a PR comment or an LLM chat to prompt the right change.
Test the public, not the private. Render and query like a user — getByRole, getByLabelText, getByText. Don't reach into private state, don't assert on internal helper calls, don't mock collaborators inside the same module. The test should break when behaviour changes, not when an internal name does. /tenet/behaviour-testing/TS6
AI eyes only
Rule: test behaviour, not implementation. Drive the same surface the user uses.
Reject: tests that pin internal collaborators with vi.mock. Reject: tests that break on a refactor that does not change behaviour. Reject: testing-library/no-node-access violations.
Generate: tests that drive the public surface (function call, screen render, HTTP call) and assert observable outcomes. Use screen.getByRole first.
Diagnostic: refactor the implementation without touching the test. If the test breaks but the behaviour is unchanged, the test is wrong.
Why?
- Tests survive refactoring. Renaming a private helper, splitting a function, swapping the internal data structure — the test still passes because the public surface is unchanged.
- When a test fails, it's a real regression. The signal-to-noise ratio rises sharply once implementation-detail tests are gone; the suite stops crying wolf.
- Painful tests become design feedback. Public-surface tests that strain — e.g., the unit needs five mocks — are signalling an SRP failure per F1 Single Responsibility, not a testing failure.
- Accessibility lands automatically.
getByRoletraverses the role tree a screen reader uses; if the test passes, the page is at minimum role-traversable. - Reviews focus on behaviour. The reviewer doesn't need to ask “is this asserting the right private detail?” — the assertion is the public outcome.
- Test names describe products, not code. “Submitting an empty form shows an error” beats “handleSubmit calls validate with the form payload”.
- Coding agents stop writing brittle tests. With the rule in CLAUDE.md and Testing Library lint plugins on, the model writes
screen.getByRolefirst because that's the cheapest path through review.
Origins
The principle has the longest authority lineage in the entire Testing pillar. Sandi Metz's Magic Tricks of Testing talk1“The Magic Tricks of Testing”, RailsConf 2013. The canonical 3×3 matrix: incoming query / incoming command / sent-to-self / outgoing query / outgoing command, with the rule “test only what crosses the public boundary.” at RailsConf 2013 introduced the canonical 3×3 matrix — incoming query, incoming command, outgoing query, outgoing command, sent-to-self — and named the rule explicitly: test only what crosses the public boundary; ignore sent-to-self entirely. Steve Freeman and Nat Pryce's Growing Object-Oriented Software, Guided by Tests gave it the most-quoted single-line statement in the literature: “only mock types you own.”5Growing Object-Oriented Software, Guided by Tests (Addison-Wesley, 2009). The most-quoted single line in the boundary-mocking literature: “only mock types you own.”
Kent Beck's Test Desiderata8Test Desiderata (kentbeck.github.io, 2019). The Behavioural property: tests should describe behaviours, not procedures. distilled it into the “behavioural” property: tests describe behaviours, not procedures. Michael Feathers's seam concept in Working Effectively with Legacy Code identified the right place for the test boundary before identifying that there should be one.9Working Effectively with Legacy Code (Prentice Hall, 2004). The seam concept names the right place to put a mock — at the boundary between code you own and code you don’t. Google's testing blog ran the consensus piece in 2013 — Test Behaviour, Not Implementation10“Testing on the Toilet: Test Behaviour, Not Implementation” (testing.googleblog.com, 2013). The FAANG-scale evidence that internal-mock-heavy tests are fragile by construction. — that turned the rule into folklore inside FAANG-scale codebases.
The component-testing flavour is younger but converged independently. Kent C. Dodds's Testing Implementation Details2“Testing Implementation Details” (kentcdodds.com, 2018). “Implementation details are things which users of your code will not typically use, see, or know about.” (2018) made the canonical title-essay for the principle in the React tradition, and Testing Library's priority list3“About Queries — locator priority”. The order: getByRole, getByLabelText, getByPlaceholderText, getByText, getByDisplayValue, getByAltText, getByTitle, getByTestId. made it executable. Marcy Sutton showed why it lands accessibility as a side effect: querying by role traverses the same tree a screen reader walks. The PUP standards/unit-testing.md file owns the on-canon formulation, citing Testing Library's guiding principle verbatim: “the more your tests resemble the way your software is used, the more confidence they can give you.”
Quotes
Implementation details are things which users of your code will not typically use, see, or know about.
Test incoming queries by asserting on the value they return. Test incoming commands by asserting on direct public side-effects. Send-to-self messages: do not test.
Only mock types you own.
The more your tests resemble the way your software is used, the more confidence they can give you.
Evidence
Twenty external sources, ranked by author authority. The first five are the canon; expand to see the rest, including the qualifiers and the named opposers. Each links out to its primary source.
- 01“Testing Implementation Details”SupportsThe canonical title-essay for the principle in the React era. Implementation details are anything users wouldn’t see, hear, or know about; tests should not assert on them.
- 02The canonical 3×3 matrix: incoming query / incoming command / sent-to-self / outgoing query / outgoing command. Test only what crosses the public boundary; ignore sent-to-self entirely.
- 03Designing Cost-Effective Tests. The book-length treatment of the matrix, with the boundary-rule and the sent-to-self exclusion expressed as design heuristics.
- 04The London-school reference. “Only mock types you own” — the most-quoted single line in the boundary-mocking literature.
- 05Test DesiderataSupportsThe Behavioural property: tests describe behaviours, not procedures. Beck’s mature position is the public-surface position.
Seventeen sources, three lineages. The OOP-design tradition (Metz, Freeman & Pryce, Beck) frames it as a design rule. The component-testing tradition (Dodds at the top of the row) frames it as a query rule. The Google testing-blog tradition further down frames it as a maintenance rule. They agree on the answer.
Examples
// Before: pokes private cache and a private helper.it("caches the lookup and calls the formatter", () => { rescueHedgehog({ name: "Spike" }); expect(rescueHedgehog.__internal.cache).toEqual({ Spike: "burrow-7" }); expect(formatHedgehogStatus).toHaveBeenCalledWith("Spike", "burrow-7");});
// After: calls the public function, asserts the public result.it("schedules a rescue and returns the assigned reserve slot", () => { const result = rescueHedgehog({ name: "Spike" }); expect(result).toEqual({ status: "scheduled", reserveSlot: "burrow-7" });});it("returns no slot when the reserve is at capacity", () => { fillReserveToCapacity(); const result = rescueHedgehog({ name: "Spike" }); expect(result.status).toBe("waitlisted");}
Enforcement
Apply these rules in eslint.config.mjs. The full enforcement across every tenet lives on the implementation page.
| Rule | Tool | Catches |
|---|---|---|
| testing-library/no-node-access | eslint-plugin-testing-library | .firstChild, .parentNode, .children walks inside tests — DOM traversal masquerading as a query. |
| testing-library/no-container | eslint-plugin-testing-library | container.querySelector — the canonical content-coupling escape hatch the rule rules out. |
| testing-library/prefer-screen-queries | eslint-plugin-testing-library | destructured render() helpers — keeps every query traceable through the screen object. |
| testing-library/no-render-in-lifecycle | eslint-plugin-testing-library | render() called inside beforeEach/beforeAll. A signal that the test is asserting on setup state, not behaviour. |
| jest-dom/prefer-in-document | eslint-plugin-jest-dom | expect(...).toHaveLength(1) and similar — pushes assertions toward the public toBeInTheDocument matcher. |
| jest-dom/prefer-to-have-text-content | eslint-plugin-jest-dom | raw .textContent === checks. The matcher keeps the test focused on what users perceive. |
| vitest/expect-expect | eslint-plugin-vitest | tests with no expect() call — the silent green that is worse than a red one. Catches empty placeholder tests left after a refactor. |
eslint.config.mjsconfiguration snippet
import tseslint from 'typescript-eslint';
import testingLibrary from 'eslint-plugin-testing-library';
import jestDom from 'eslint-plugin-jest-dom';
import vitest from 'eslint-plugin-vitest';
export default tseslint.config({
files: ['**/*.spec.{ts,tsx}', '**/*.test.{ts,tsx}'],
plugins: { 'testing-library': testingLibrary, 'jest-dom': jestDom, vitest },
rules: {
'testing-library/no-node-access': 'error',
'testing-library/no-container': 'error',
'testing-library/prefer-screen-queries': 'error',
'testing-library/await-async-events': 'error',
'testing-library/no-render-in-lifecycle': 'error',
'testing-library/no-debugging-utils': 'error',
'jest-dom/prefer-in-document': 'error',
'jest-dom/prefer-to-have-text-content': 'error',
'vitest/no-conditional-tests': 'error',
'vitest/no-conditional-in-test': 'error',
'vitest/expect-expect': 'error',
}
});AI rules
.cursor/rules/ts6-behaviour-testing.mdc---
description: Prickles TS6 — Behaviour Testing
globs: "**/*.{spec,test}.{ts,tsx,js,jsx}"
alwaysApply: false
---
## Prickles TS6 — Behaviour Testing
Test through the public surface only. The signature, the rendered output, the response payload — anything a caller or user can see. Internal helpers and private collaborators stay invisible.
Render and query like a user. Locate by role, label, or text — the same tree a screen reader walks. Implementation-detail selectors (CSS, data-testid as default, DOM traversal) are anti-patterns.
If the test breaks because an internal name changed, the test was wrong. If it breaks because the user-visible behaviour changed, the test was right.
Don't reach into private state to verify. Drive the public action; assert the public outcome.Repo layout, CI, and ESLint wiring for these paths live on /implementation — not repeated on every tenet.
Counter-argument
The honest pushback comes from the strict London-school reading of GOOS: every interaction with collaborators should be tested with mocks, including internal ones, because that's how tests drive design.5Growing Object-Oriented Software, Guided by Tests (Addison-Wesley, 2009). The most-quoted single line in the boundary-mocking literature: “only mock types you own.” If you only test the public surface, the argument runs, you defer the design feedback the test was supposed to give you. Tests-as-design-pressure (Beck's Tidy First?, Feathers's seam concept) extends the point: the moment a private helper becomes painful to test through the public surface is the moment the helper wanted to be its own unit. Refusing to mock it loses the signal.6Tidy First? (O’Reilly, 2023). Frames painful tests as design pressure: when a private helper is hard to test through the public surface, tidy first by extracting it.
Counter-argument retort
The London-school reading is half right. The mature classicist response — Fowler in Mocks Aren't Stubs, Seemann's Mocks for Commands, Stubs for Queries, Khorikov's Unit Testing — conceded the point that mocks are design pressure but moved the line: mock at the boundary, not in the middle.7“Mocks for Commands, Stubs for Queries” (blog.ploeh.dk, 2013). The classicist refinement of Metz’s matrix: mock outgoing commands, stub outgoing queries, do nothing about sent-to-self. A mock at the port is a question about the system's shape; a mock on a private helper is an answer the test agreed to before the design did. The classicist consensus, the Google testing-blog material, and Khorikov's book all converge: only mock unmanaged dependencies that cross a process boundary.
The tests-as-design-pressure argument also lands — on the design side, not the test side. When a private helper becomes painful to test through the public surface, that is real signal: the helper wants to be a unit. Beck's answer in Tidy First? is to tidy first — extract the unit, give it its own public surface, then test it through that. The design moves; the test follows. The wrong move is to leave the helper private and reach in with a spy. That captures the pain without doing anything about it.6Tidy First? (O’Reilly, 2023). Frames painful tests as design pressure: when a private helper is hard to test through the public surface, tidy first by extracting it.
The query-as-user half is what makes the rule stick in component code. Without it, “test the public” is hand-wavy and you get tests that assert on component.state.formData.email. With it, the locator priority list is the enforcement: getByRole first, then getByLabelText, getByText, and getByTestId as the last resort.3“About Queries — locator priority”. The order: getByRole, getByLabelText, getByPlaceholderText, getByText, getByDisplayValue, getByAltText, getByTitle, getByTestId. Pair the rule with TS5 Structural Assertions for the E2E surface and TS7 Test Isolation for the unit boundary. Three rules, three surfaces, one principle: the test exercises what callers can see and nothing else.
Notes
- [1]Sandi Metz — “The Magic Tricks of Testing”, RailsConf 2013. The canonical 3×3 matrix: incoming query / incoming command / sent-to-self / outgoing query / outgoing command, with the rule “test only what crosses the public boundary.”
- [2]Kent C. Dodds — “Testing Implementation Details” (kentcdodds.com, 2018). “Implementation details are things which users of your code will not typically use, see, or know about.”
- [3]Testing Library maintainers — “About Queries — locator priority”. The order: getByRole, getByLabelText, getByPlaceholderText, getByText, getByDisplayValue, getByAltText, getByTitle, getByTestId.