Case file — AI5

Verify the API

Don't trust the training.

The model's mental model of next, drizzle, and every other package on disk is stale by definition. Look it up before you write the call.

ByAdam LewisPublished3 May 2026Reading12 minVersionv1.0ConfidenceHigh
§0b

Opinion

This is the strongest standalone tenet in the AI pillar. Three threads converge and they all point the same way. Knowledge cutoff is structural; hallucinated APIs are the canonical failure mode; the fix is documented and works. There is no honest steelman against “look it up before you call it”.

Every model ships as “a snapshot in time with frozen knowledge at the day its training completes.”1LakeraGuide to Hallucinations in Large Language Models (2026). “Every LLM is a snapshot in time with frozen knowledge at the day its training completes — known as the knowledge cutoff.” The structural framing for why staleness is not incidental. Claude 4.7's training data is bounded; GPT-5's is bounded; Gemini's is bounded. APIs change between cutoff and now: minor versions deprecate methods, major versions reshape entire surfaces, and the most popular libraries change the fastest. The model does not know which version of react-hook-form the project installed; it will confidently generate the version it knows, which is the version it was trained on, which may not be the version on disk.

Simon Willison says it without hedging: agents “will absolutely make mistakes — sometimes subtle, sometimes huge ... like hallucinating non-existent libraries or methods.”2Simon WillisonHow I Use LLMs to Help Me Write Code (March 2025). The most-quoted practitioner essay on the agent-coding loop. Names hallucinated APIs as one of the canonical failure modes. Amazon Science published the fix in 2024: Documentation Augmented Generation (DAG) “significantly improves performance for low frequency APIs” by checking against an API index before generation.3Nihal Jain et al. (Amazon Science)On Mitigating Code LLM Hallucinations with API Documentation. “Documentation Augmented Generation (DAG) significantly improves performance for low frequency APIs.” Documented industrial fix for the failure mode. The Context7 MCP server is the production version of the same idea: it “pulls up-to-date, version-specific documentation and code examples straight from the source and places them directly into your prompt” for thousands of libraries.4UpstashContext7 MCP server. “Up-to-date, version-specific documentation and code examples straight from the source ... thousands of popular libraries.” Production tooling for the look-it-up rule. Grep against the local package, web fetch against the changelog, MCP query against the database: the surface differs, the rule does not.

The deeper claim is that this is one of the few AI tenets with no good non-AI analogue. A human engineer who guesses at an API call is being lazy; an LLM that does the same thing is being correct about its own behaviour. It generates the most likely token; the most likely token reflects training; training is stale. The fix has to live at the agent-rules layer, because the model cannot fix itself: the same probabilities that drive the hallucination drive the agent's confidence in the hallucination.

Copy a note and link

Grab this short comment and drop it into a PR comment or an LLM chat to prompt the right change.

Don't trust the training. The model's `next` and `drizzle` are stale by definition. Look up the docs, grep the source, query the MCP — then write the call. Grounding beats recall on every API older than the cutoff.

/tenet/verify-the-api/AI5
§0c

AI eyes only

Rule: do not trust the training. Look up before you call.

Reject: emitting an API call without verifying the signature against current docs. Reject: “there is probably a method called”-style guesses. Reject: inventing helpers that “should exist” in the framework.

Generate: for external APIs, query resolve-library-id + query-docs via Context7 MCP (or equivalent) before writing the call. For internal APIs, grep the workspace for the function definition first.

Diagnostic: cite the source for every API call (doc URL, file path, or grep result). No cite means a guess.

§0d

Why?

  • Kills the canonical AI failure mode. Hallucinated APIs are the most common, most-reported, most-cited bug class in agent-written code; grounding fixes them at the source.
  • Cheap and fast to enforce. One MCP query, one grep, one docs URL — the verification adds seconds, the missed verification can cost an afternoon.
  • The fix is published. Amazon Science's DAG paper, Anthropic's MCP and Context7, the broader retrieval-augmented-generation literature — the rule rests on results, not opinion.
  • Compounds with AI4 Verifiable Specs. The spec runs the call against the actual library; the test fails when the API doesn't exist; the rule pays off in the build, not in review.
  • AI-specific in a way most rules aren't. The failure mode is structural to how LLMs generate text; no human-only equivalent of the rule exists.
  • The rule survives any specific tool. MCP servers, web fetch, vendored docs, repo grep, type-check — pick the surface that fits your stack; the principle is grounding, not a particular product.
  • Pairs with strict types. A type system catches the calls the agent typed against an imagined signature; the lookup catches the API surface the agent imagined into existence. Types and grounding cover different halves of the same bug.
The receipts
Origins, quoted passages, evidence, the strongest counter-argument and the reply.
§1

Origins

The pattern showed up first as folklore: by 2023, every Twitter thread about Copilot and ChatGPT had at least one example of a hallucinated API. fs.readFileSync with imaginary options. useEffect overloads that didn't exist. NPM packages whose names looked plausible and whose npm install step failed because the package was never published. The community noticed the failure mode before the literature named it.

Simon Willison's March 2025 essay How I Use LLMs to Help Me Write Code is the single best practitioner statement of both problem and fix.2Simon WillisonHow I Use LLMs to Help Me Write Code (March 2025). The most-quoted practitioner essay on the agent-coding loop. Names hallucinated APIs as one of the canonical failure modes. Treat the agent as “an over-confident pair programming assistant who's lightning fast at looking things up” — useful, but liable to make mistakes “sometimes subtle, sometimes huge ... like hallucinating non-existent libraries or methods.” The mitigation: don't outsource the test that the code actually works.

The structural framing came from the cutoff literature. By 2025 the term “knowledge cutoff” was canonical: every LLM is “a snapshot in time with frozen knowledge at the day its training completes.”1LakeraGuide to Hallucinations in Large Language Models (2026). “Every LLM is a snapshot in time with frozen knowledge at the day its training completes — known as the knowledge cutoff.” The structural framing for why staleness is not incidental. The Frontiers in AI 2025 survey of LLM hallucinations names code-API hallucination as one of the highest-prevalence categories;5Frontiers in Artificial IntelligenceSurvey and Analysis of Hallucinations in Large Language Models (2025). Code-API hallucination is one of the highest-prevalence categories; the rule sits in a research-validated problem space. vLLM's December 2025 HaluGate work shows token-level signals can detect hallucinations during generation, not just after.6vLLMHaluGate: Token-Level Truth (December 2025). Token-level signals can detect hallucinations during generation. Useful for building guardrails that intercept the call before it ships, not just after.

The fix moved from folklore to documented practice in 2024. Amazon Science's Documentation Augmented Generation paper formalised the “look it up first” pattern: maintain an API index, retrieve from it before generating low-frequency calls, ground the model in current docs.3Nihal Jain et al. (Amazon Science)On Mitigating Code LLM Hallucinations with API Documentation. “Documentation Augmented Generation (DAG) significantly improves performance for low frequency APIs.” Documented industrial fix for the failure mode. Anthropic's Model Context Protocol shipped the production pipework for the same idea, with Context7 (Upstash) and a constellation of equivalent MCP servers for libraries, databases, repos, and changelogs.4UpstashContext7 MCP server. “Up-to-date, version-specific documentation and code examples straight from the source ... thousands of popular libraries.” Production tooling for the look-it-up rule. Anthropic's Effective Context Engineering and Writing Effective Tools for AI Agents papers position MCP-style retrieval as the load-bearing counter to training staleness.

§2

Quotes

An over-confident pair programming assistant who's lightning fast at looking things up. They'll absolutely make mistakes — sometimes subtle, sometimes huge ... like hallucinating non-existent libraries or methods.

Simon Willison · How I Use LLMs to Help Me Write Code (2025)

Documentation Augmented Generation (DAG) significantly improves performance for low frequency APIs ... intelligently trigger DAG where you check against an API index.

Nihal Jain et al. (Amazon Science) · Mitigating Code LLM Hallucinations with API Documentation

Up-to-date, version-specific documentation and code examples straight from the source ... thousands of popular libraries.

Upstash · Context7 MCP server

Every LLM is a snapshot in time with frozen knowledge at the day its training completes — known as the “knowledge cutoff”.

Lakera · Guide to Hallucinations in Large Language Models (2026)
§3

Evidence

Twenty external sources, ranked by author authority. The first five are the canon; expand to see the rest, including the qualifiers and the named opposers. Each links out to its primary source.

  1. 01
    Simon Willison · 2025
    Practitioner statement of both problem and fix. Names hallucinated libraries / methods as the canonical failure mode; advocates verification of every external call before commit.
  2. 02
    Nihal Jain et al. (Amazon Science) · 2024
    Documents the DAG approach: maintain an API index; retrieve from it before generating low-frequency calls; ground the model. The published industrial fix.
  3. 03
    Upstash · 2024–
    Production MCP for version-specific library docs. Exposes the look-it-up rule as a tool the agent can call without a context-switch.
  4. 04
    Anthropic · 2024
    The standard that lets agents query live tools (docs, databases, repos, changelogs) rather than relying on training. The pipework underneath the rule.
  5. 05
    Anthropic · 2025
    Glob and grep as first-class primitives. The agent has to be told to use retrieval; the rule is the telling.

Sixteen sources. The supports cluster on the practitioner and tooling line: Simon Willison's essay, Amazon Science's DAG paper (Jain et al.), the Context7 MCP server, and the Model Context Protocol spec from Anthropic. The structural-cutoff papers sit further down. The qualifiers cover the cases where the agent does check but chooses to ignore the result. The opposing voice (the bigger-models-will-fix-it argument) weakens with every published benchmark; the reply addresses why.

§4b

Enforcement

Viewing: TypeScript.

Apply these rules in eslint.config.mjs. The full enforcement across every tenet lives on the implementation page.

RuleToolCatches
import/no-unresolvedeslint-plugin-importimports of modules that don't exist on disk. Most-direct catch for hallucinated dependencies.
import/namedeslint-plugin-importnamed imports that don't exist in the source module. Catches the agent calling `import { useEffectAsync } from 'react'` when no such export exists.
import/no-deprecatedeslint-plugin-importcalls to APIs marked deprecated in the source. The agent's training favours the older API; the rule catches the gap.
@typescript-eslint/no-unsafe-calltypescript-eslintcalls against the `any` type — typically the agent giving up and silently routing around a missing type. Catches the Hail Mary call.
tsc --noEmittscimaginary signatures, missing fields, wrong arity. The compiler is the cheapest verification of the call surface.
eslint.config.mjsconfiguration snippet
import tseslint from 'typescript-eslint';
import importPlugin from 'eslint-plugin-import';

export default tseslint.config({
  files: ['**/*.{ts,tsx}'],
  plugins: { import: importPlugin },
  rules: {
    'import/no-unresolved': 'error',
    'import/named': 'error',
    'import/no-extraneous-dependencies': 'error',
    'import/no-deprecated': 'warn',
    '@typescript-eslint/no-unsafe-call': 'error',
    '@typescript-eslint/no-unsafe-member-access': 'error',
  }
});
§4c

AI rules

File.cursor/rules/ai5-verify-the-api.mdc
---
description: Prickles AI5 — Verify the API
globs: "**/*.{ts,tsx,js,jsx,py,java,php}"
alwaysApply: true
---

## Prickles AI5 — Verify the API

Knowledge cutoff is structural, not incidental. The model's mental model of every framework, library, and SDK is bounded by its training date.

Don't trust the training. Look up the docs, grep the source, query the MCP — then write the call. Grounding beats recall on every API older than the cutoff.

Hallucinated APIs are the canonical AI failure mode in code. The fix is documented, well-named (Documentation Augmented Generation), and works.

Treat the agent's first plausible API call as a draft. Verify against current source before letting it ship; the most likely token reflects the training, and the training is stale.

Repo layout, CI, and ESLint wiring for these paths live on /implementation — not repeated on every tenet.

§5

Counter-argument

Counter

The honest steelman is that the rule is overhead the next model will retire. Every cycle of larger context windows, web-search-by-default, and tool-augmented inference reduces the gap between training and now. The 2026 frontier models ship with built-in retrieval, browse-on-demand, and live MCP integrations enabled by default. If the next generation of agents structurally grounds every call before writing, the “look it up first” rule becomes a discipline that no longer earns its place: true but no longer instructive.

§6

Counter-argument retort

Reply

The bigger-models-will-fix-it argument concedes the rule for the present and asks for it to retire in the future. The future may very well retire it; the present has not.

Live retrieval shrinks the gap, but it does not close it. The model still has to decide what to retrieve, and that decision still runs through training-shaped probabilities; the agent that doesn't know which docs to look up is left looking up the wrong ones, or none. Anthropic's own Effective Context Engineering paper makes this explicit: glob and grep are first-class primitives because the model still needs to be told to use them, even when retrieval is enabled.7AnthropicEffective Context Engineering for AI Agents (October 2025). Glob and grep are first-class primitives because the model still needs to be told to use them, even when retrieval is enabled. The default, with no rule, is for the agent to type the most likely call and ground only when prompted.

Empirically, the failure mode persists across model generations. GPT-5, Claude 4.7, and Gemini 2.5 Pro all hallucinate APIs at non-trivial rates on contemporary benchmarks; the rate falls with each generation but does not approach zero. Sleeper-agent research at Anthropic shows the deeper problem: a confident, fluent generation can be wrong in ways the generator cannot detect.8Hubinger et al. (Anthropic)Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training (January 2024). Background context for why agent output cannot be trusted by default and why grounding / verification gates matter. Verification has to come from outside the model.

The strongest version of the rule survives even the optimistic future: ground every external-API call against current evidence before writing. If the next generation of agents structurally grounds every call by default, the rule has been honoured at the platform layer rather than the team layer; that is a victory, not a refutation. Until then, the team layer has to do the grounding. Pair this with AI4 Verifiable Specs — the spec runs the call against the actual library; the test fails when the API doesn't exist; the rule pays off in the build.

§7

Notes

  1. [1]LakeraGuide to Hallucinations in Large Language Models (2026). “Every LLM is a snapshot in time with frozen knowledge at the day its training completes — known as the knowledge cutoff.” The structural framing for why staleness is not incidental.
  2. [2]Simon WillisonHow I Use LLMs to Help Me Write Code (March 2025). The most-quoted practitioner essay on the agent-coding loop. Names hallucinated APIs as one of the canonical failure modes.
  3. [3]Nihal Jain et al. (Amazon Science)On Mitigating Code LLM Hallucinations with API Documentation. “Documentation Augmented Generation (DAG) significantly improves performance for low frequency APIs.” Documented industrial fix for the failure mode.
Disagree? Found a hole in the argument? Take issue with this tenet →
Last revised: 2026-04-27