Case file — P7

Short-Lived Branches with AI+Human Review

Branch for a day, merge to main, review before it lands.

Branch for a day, merge to main, review before it lands. The empirical hazard the DORA research has measured for a decade is parallel divergence, not the branch itself. Pair short-lived branches with AI-first-pass review and mandatory human approval, and AI4 Verifiable Specs is one of the things human reviewers specifically check.

ByAdam LewisPublished3 May 2026Reading14 minVersionv1.0ConfidenceHigh
§0b

Opinion

The current canonical row reads as “no long-lived branches” and gets misread as “no branches.” That isn't what the literature says. Forsgren et al. measured branch lifetime, not branch absence;1Forsgren, Humble & KimAccelerate: The Science of Lean Software and DevOps (IT Revolution, 2018), ch.4. Three concrete predictors of delivery performance: ≤3 active branches in the repo, branches merged to trunk at least daily, no integration phases. Elite performers who hit reliability targets are 2.3× more likely to use TBD. Hammant's “Scaled TBD” explicitly permits PR branches under two days;2Paul HammantShort-Lived Feature Branches (trunkbaseddevelopment.com). The branch should only last a couple of days; any longer than two days, and there is a risk of the branch becoming a long-lived feature branch. Hammant's “Scaled TBD” explicitly permits PR branches at this cadence. Driessen's 2020 update to GitFlow recommends GitHub Flow for any continuous-delivery team. The empirical hazard is parallel divergence; a four-hour branch with a PR review has all of the integration cadence of pure TBD plus the structured review surface, and zero of the merge-hell tax.

The bit I want to plant a flag on is the AI+human pairing. The 2024–2026 generation of AI reviewers has shifted the cost-benefit calculation Fowler weighed for pre-integration review.3CursorBugbot product documentation (cursor.com/bugbot, 2026). Reviews ~2 million PRs per month for customers including Discord and Airtable; 76% bug-resolution; ~35% of Bugbot Autofix patches merged directly. CodeRabbit has reviewed 13 million PRs across two million repos; Greptile benchmarks at 82% bug-catch against CodeRabbit's 44% and Copilot's 54%; Cursor's Bugbot reviews two million PRs per month with 76% bug-resolution. The pattern is the same across all five tools: the AI catches the high-volume class of issues (boundaries, validation, dead code, naming, test gaps); humans retain authority on architectural, business-logic, and convention questions the model can't ground. Net effect: the per-PR cost of “good” review is falling, which strengthens (not weakens) the case for the PR-into-main flow.

What human reviewers specifically check (the part that does not reduce to the AI review) is whether AI4 Verifiable Specs was satisfied. The agent will write passing tests against an AC it interpreted loosely; the human reviewer is the one who reads the AC against the test against the code. Bacchelli & Bird's 2013 finding that review's primary benefit is knowledge transfer holds in the AI loop too, but it is sharpened: the human is the verifier of the spec-test-code chain the agent walked alone.4Bacchelli & Bird“Expectations, Outcomes, and Challenges of Modern Code Review” (Microsoft Research, ICSE 2013). 873 programmers, 165 managers, 17 deep interviews, 570 review comments. Review's primary delivered benefit is knowledge transfer; reviewer effectiveness collapses past ~200–400 LOC.

Copy a note and link

Grab this short comment and drop it into a PR comment or an LLM chat to prompt the right change.

Branch for a day, merge to main, review before it lands. Long-lived branches are debt; parallel divergence is the hazard, not branch existence. AI does the first review pass; humans approve.

/tenet/short-branches/P7
§0c

AI eyes only

Rule: branch for a day, merge to main, review before it lands. The author and reviewer are not the same agent.

Reject: branches older than a few days. Reject: self-approving a PR. Reject: the same agent session writing and reviewing the change.

Generate: short branches with one behaviour change. Open the PR. A separate AI reviewer (or session) reads it and surfaces issues. A human approves before merge.

Diagnostic: count the days the branch has been open and the agents involved. More than two days, or fewer than two distinct reviewers (one human), reject.

§0d

Why?

  • DORA / Accelerate measured branch lifetime as a predictor of delivery performance for a decade. Elite performers run with three or fewer active branches and merge daily.
  • Short branches keep PRs small. Reviewer effectiveness collapses past 400 lines (Bacchelli & Bird); small PRs review in five minutes and merge cleanly.
  • AI reviewers carry the high-volume class of issues. Copilot, Claude Code Review, Bugbot, CodeRabbit, Greptile all ship with mandatory human approval; the cost of “good” review is falling.
  • Human reviewers verify the AC chain. The AI catches the obvious; the human checks that AI4 Verifiable Specs was satisfied — the spec was machine-checkable, the test verified it, the code passed both.
  • Code review's measured second-largest output is knowledge spread across the team. Authors learn from feedback; reviewers learn the codebase by reading it.
  • Branch lifetime under a day means no merge-hell tax. The conflicts that pile up across long-lived branches don't accumulate.
  • main stays shippable. The team that pairs short branches with PR review can cut a release from main at any time; the team that doesn't cuts releases from a release branch and discovers the gap at deploy time.
The receipts
Origins, quoted passages, evidence, the strongest counter-argument and the reply.
§1

Origins

The intellectual root is empirical. The DORA / Forsgren research programme has measured branch lifetime as a predictor of delivery performance since 2014. Accelerate (IT Revolution, 2018) names three concrete predictors at the team scale: ≤3 active branches in the repo, branches merged to trunk at least daily, no integration phases or code freezes.1Forsgren, Humble & KimAccelerate: The Science of Lean Software and DevOps (IT Revolution, 2018), ch.4. Three concrete predictors of delivery performance: ≤3 active branches in the repo, branches merged to trunk at least daily, no integration phases. Elite performers who hit reliability targets are 2.3× more likely to use TBD. Elite performers who hit reliability targets are 2.3× more likely to use TBD; low performers correlate with long-lived branches.

Paul Hammant's trunkbaseddevelopment.com is the canonical site for the branching literature. Hammant's “Scaled TBD” explicitly permits PR branches under two days for teams larger than ~15 engineers;2Paul HammantShort-Lived Feature Branches (trunkbaseddevelopment.com). The branch should only last a couple of days; any longer than two days, and there is a risk of the branch becoming a long-lived feature branch. Hammant's “Scaled TBD” explicitly permits PR branches at this cadence. his stricter pure-TBD reading is realistic only for very small, very high-trust teams. Vincent Driessen's 2020 update to his own GitFlow article makes the same point from the opposite direction: for continuous-delivery teams, GitFlow is too heavy; GitHub Flow is the fit.7Vincent Driessen“A successful Git branching model” (2010, with 2020 addendum, nvie.com). Driessen's GitFlow with the explicit 2020 update: “If your team is doing continuous delivery, adopt a much simpler workflow (like GitHub flow) instead.” Martin Fowler's “Patterns for Managing Source Code Branches” ( 2020) frames the trade-off as integration frequency and integration friction; high-trust teams can drop pre-integration review entirely.8Martin Fowler“Patterns for Managing Source Code Branches” (martinfowler.com, 2020). Frames the trade-off as integration frequency and integration friction; high-trust teams can drop pre-integration review entirely, but PR-into-main is the right default for most teams.

The code-review tradition is the team-cognitive half. Bacchelli and Bird's “Expectations, Outcomes, and Challenges of Modern Code Review” (Microsoft Research, ICSE 2013) is the canonical source.4Bacchelli & Bird“Expectations, Outcomes, and Challenges of Modern Code Review” (Microsoft Research, ICSE 2013). 873 programmers, 165 managers, 17 deep interviews, 570 review comments. Review's primary delivered benefit is knowledge transfer; reviewer effectiveness collapses past ~200–400 LOC. 873 programmers, 165 managers, 17 deep interviews, 570 manually-classified review comments. Headline finding: developers expect review to be primarily about defect-finding, but in practice review is more about knowledge transfer, alternative solutions, and team awareness. Sadowski et al.'s Google study (ICSE-SEIP 2018) replicates the finding at the nine-million-review scale.9Sadowski et al.“Modern Code Review: A Case Study at Google” (ICSE-SEIP 2018). 9M reviewed changes, ~25k developers; 70% of changes commit within 24 hours of being mailed for review; education is named as a first-class motivation alongside maintaining norms, gatekeeping, and accident prevention.

The AI-review layer is the recent extension. GitHub Copilot code review went GA in April 2025 after a million-developer preview; Anthropic's Claude Code Review (2026) ships multiple specialised agents per diff and reports 84% finding rate on PRs over 1,000 lines.10AnthropicCode Review for Claude Code (2026). Multiple specialised agents per diff — boundary checks, API misuse, cross-file consistency — with a verification pass to filter false positives. 84% finding rate on PRs over 1,000 lines; 7.5 issues per review on average. Cursor's Bugbot reviews ~2 million PRs per month for customers including Discord and Airtable, with 76% bug-resolution and ~35% of Bugbot Autofix patches merged directly.3CursorBugbot product documentation (cursor.com/bugbot, 2026). Reviews ~2 million PRs per month for customers including Discord and Airtable; 76% bug-resolution; ~35% of Bugbot Autofix patches merged directly. CodeRabbit has reviewed 13 million PRs across 2 million repositories; Greptile benchmarks at 82% bug-catch rate. The pattern is the same across all five tools: AI catches the high-volume class of issues; humans retain authority on architectural, business-logic and convention questions.

The AI4 cross-link is load-bearing. Whatever the AI catches, the human still has to verify that the AC was satisfied; the agent will write passing tests against an AC it interpreted loosely. AI4 Verifiable Specs is the row the human reviewer specifically checks during the PR review — the spec was machine-checkable, the test verified it, the code passed both. Without that check, the agent loop closes on itself and ships its priors.

§2

Quotes

High performers have three or fewer active branches in the application's code repository at any time, branches have very short lifetimes (less than a day) before being merged into trunk, and the team never has “code freeze” or stabilisation periods.

Forsgren, Humble & Kim · Accelerate (2018)

The branch should only last a couple of days. Any longer than two days, and there is a risk of the branch becoming a long-lived feature branch.

Paul Hammant · Short-Lived Feature Branches

Although finding defects remains a main motivation for review, reviews are less about defects than expected and instead provide additional benefits such as knowledge transfer, increased team awareness, and creation of alternative solutions to problems.

Bacchelli & Bird · Modern Code Review (ICSE 2013)

The fundamental rule is to integrate often. The longer code stays out of mainline, the more it diverges, the more painful it is to integrate.

Martin Fowler · Patterns for Managing Source Code Branches (2020)
§3

Evidence

Twenty external sources, ranked by author authority. The first five are the canon; expand to see the rest, including the qualifiers and the named opposers. Each links out to its primary source.

  1. 01
    Forsgren, Humble & Kim · 2018
    Three concrete predictors of delivery performance: ≤3 active branches, branches merged daily, no integration phases. The empirical backstop for short-lived branches.
  2. 02
    DORA / Google Cloud · 2024
    Continues to list TBD as one of the technical capabilities that drive software-delivery performance; documentation quality is a 12.8× multiplier on its effect.
  3. 03
    Paul Hammant · 2016–
    The canonical site for the branching literature. The strict pure-TBD reading at one end; “Scaled TBD” with PR branches under two days at the other.
  4. 04
    Paul Hammant · 2016–
    Hammant's explicit endorsement of PR branches for one developer or one pair, very short lived (a day, two days at most).
  5. 05
    Vincent Driessen · 2010
    GitFlow with the 2020 update: for continuous-delivery teams, adopt GitHub Flow instead. The author of the most-popular branching model qualifying its own use.

Twenty-five sources, three stances. The supporters are Forsgren's DORA programme and Hammant's Trunk-Based Development primers — short branches with PR review measured as the productivity sweet spot. The 2024–2026 AI-reviewer literature sits further down, extending the row. The qualifiers (led by Driessen) push the line that branching is a spectrum, not a binary. The opposers (the strict-Hammant pure-TBD reading) argue PR branches are debt; the steelman the case has to address.

§4b

Enforcement

Viewing: TypeScript.

Apply these rules in eslint.config.mjs. The full enforcement across every tenet lives on the implementation page.

RuleToolCatches
Required reviews (≥1 human)GitHub required reviewsmerges without a human approval. The non-negotiable second pass after the AI review.
Branch protection (require PR)GitHub branch protectiondirect pushes to `main`. Forces every change through the PR review chain.
Claude Code ReviewClaude Code Review (Anthropic)the AI first pass. Multiple specialised agents per diff with verification pass; 84% finding rate on PRs over 1,000 lines.
GitHub Copilot code reviewGitHub Copilot code reviewsecond AI reviewer. Deterministic tool-calling on top of the LLM (ESLint, CodeQL).
PR-size actionGitHub ActionsPRs over 400 lines. Reviewer effectiveness collapses past this threshold (Bacchelli & Bird, Cohen).
max-lines / max-lines-per-functionESLint (max-lines, max-lines-per-function)files and functions that grew past the team's caps. Keeps the per-file diff small enough to read.
Required status checksGitHub required status checksmerges where the quality script (P3) hasn't passed. The DoD becomes the merge gate.
eslint.config.mjsconfiguration snippet
import tseslint from 'typescript-eslint';

export default tseslint.config({
  files: ['**/*.{ts,tsx}'],
  rules: {
    'max-lines': ['error', { max: 150, skipBlankLines: true, skipComments: true }],
    'max-lines-per-function': ['error', { max: 15, skipBlankLines: true }],
  }
});
§4c

AI rules

File.cursor/rules/p7-short-branches.mdc
---
description: Prickles P7 — Short-Lived Branches with AI+Human Review
globs: "**/*.{ts,tsx,js,jsx,py,java,php,md}"
alwaysApply: false
---

## Prickles P7 — Short-Lived Branches with AI+Human Review

Trunk is `main`, and `main` is always shippable. Cut a branch for a single piece of work, keep it under a day where you can.

Open a PR. Request the AI review first; address what it flags. Ask a human reviewer for the second pass.

Human reviewers verify two things the AI can't: the architecture the AI couldn't see, and whether AI4 Verifiable Specs was satisfied — the AC was machine-checkable, the test verified it, the code passed both.

Use feature flags or branch-by-abstraction whenever the work is bigger than a branch can carry. Never use a branch as a substitute for shipping.

Repo layout, CI, and ESLint wiring for these paths live on /implementation — not repeated on every tenet.

§5

Counter-argument

Counter

The strongest steelman is the strict-Hammant pure-TBD reading: any branch is debt, the only safe pattern is direct commits to trunk with refinement review afterward.6Paul Hammanttrunkbaseddevelopment.com (canonical site). The strict pure-TBD reading: direct commits to trunk, refinement review after merge or pair-programmed before push. Realistic only for very small, very high-trust teams. The high-trust pair-programming teams Hammant cites operate this way and ship cleanly; the implication is that PR branches are a crutch the discipline doesn't need. Newport's deep-work argument extends the point at the cognitive scale: the PR loop fragments attention and the merge cadence is the lower bound on context-switch frequency.

§6

Counter-argument retort

Reply

The strict-Hammant pure-TBD reading lands when the team is small and high-trust enough that pair-programming functions as continuous review. For everyone else — the median industrial team — the empirical record is unambiguous: PR branches under two days with mandatory review land in the same elite-performer cluster as pure TBD without giving up the structured review surface.1Forsgren, Humble & KimAccelerate: The Science of Lean Software and DevOps (IT Revolution, 2018), ch.4. Three concrete predictors of delivery performance: ≤3 active branches in the repo, branches merged to trunk at least daily, no integration phases. Elite performers who hit reliability targets are 2.3× more likely to use TBD. The fix is not to abandon PRs; it is to keep the branch short.

Newport's context-switch argument is real for the team that runs many small PRs in parallel without batching. The reply is to tune the cadence: reviewers batch their PR reading into morning and afternoon windows; the AI reviewer fires immediately so the agent can iterate; the human runs the second pass when the diff is stable. The discipline survives; the cognitive cost is paid in batches, not in interruptions.

The genuine residue is reviewer fatigue and rubber-stamping. Bacchelli and Bird's data shows reviewer effectiveness collapses above ~200–400 lines per review.4Bacchelli & Bird“Expectations, Outcomes, and Challenges of Modern Code Review” (Microsoft Research, ICSE 2013). 873 programmers, 165 managers, 17 deep interviews, 570 review comments. Review's primary delivered benefit is knowledge transfer; reviewer effectiveness collapses past ~200–400 LOC. The fix is the small-PR discipline: split the change-set, ship structural tidies in a separate PR per P6 Leave it Better, keep the behavioural diff small enough to read in five minutes. The AI reviewer extends the effectiveness curve to the right (it doesn't fatigue), but the human pass still has to land before the merge button.

In production work, the row that pairs short branches with AI-first-pass and mandatory human approval is the row whose PR queue moves through review in hours rather than days. The empirical record across five years of DORA reports has not moved against the pattern; the AI-reviewer additions of 2024–2026 strengthen the case rather than weaken it. Pair this with P3 Definition of Done and the merge button reads the same checklist the human reviewer signed off on.

§7

Notes

  1. [1]Forsgren, Humble & KimAccelerate: The Science of Lean Software and DevOps (IT Revolution, 2018), ch.4. Three concrete predictors of delivery performance: ≤3 active branches in the repo, branches merged to trunk at least daily, no integration phases. Elite performers who hit reliability targets are 2.3× more likely to use TBD.
  2. [2]Paul HammantShort-Lived Feature Branches (trunkbaseddevelopment.com). The branch should only last a couple of days; any longer than two days, and there is a risk of the branch becoming a long-lived feature branch. Hammant's “Scaled TBD” explicitly permits PR branches at this cadence.
  3. [3]CursorBugbot product documentation (cursor.com/bugbot, 2026). Reviews ~2 million PRs per month for customers including Discord and Airtable; 76% bug-resolution; ~35% of Bugbot Autofix patches merged directly.
Disagree? Found a hole in the argument? Take issue with this tenet →
Last revised: 2026-04-27