The 37-point production-readiness checklist for AI systems
Every two-week audit I run starts with the same checklist. Thirty-seven checks, grouped into seven themes — evals, resilience, cost, observability, security, data, and rollback. Most demos clear fewer than ten of them. Most production systems clear at least thirty. The gap between “works on the founder’s laptop” and “survives a real Tuesday” is roughly the items below.
Seven themes wrap thirty-seven specific checks. The score is the same shape every time — rollback last, evals first.
What this checklist is — and what it isn’t
This is the running document I have built up over a decade of shipping AI to production and the last two years of running it as a fixed-price engagement. It is not a generic web-app readiness list. It is the part that is specific to systems where a non-deterministic model sits in the request path. The deterministic stuff — CI, IaC, secret rotation, dependency scanning — you already have, or you don’t have a product yet.
The checklist is not a quality bar. It is a triage tool. Most teams I audit pass roughly twenty of the thirty-seven. The job of the audit is not to demand all thirty-seven by Friday — it is to identify the three to five gaps that are most likely to cause an outage or burn the runway, and prioritise those. The rest are tracked and worked down over a quarter.
Theme 01 — Evals (6 checks)
Without evals you cannot say whether a deploy is safe. With them, you can. The minimum viable eval stack covers retrieval (when there is retrieval), generation, and behavioural rubrics.
Eval set under version control. A frozen, hand-curated set of at least 100 cases — covering happy path, edge cases, and known-bad inputs — lives in the repo, not someone’s laptop. Updated through PRs.
Three quantitative metrics. Faithfulness, answer-relevancy, and at least one task-specific metric (resolution rate, classification F1, tool-call accuracy). Numbers, not vibes.
Threshold-gated CI. The eval suite runs on every PR that touches the prompt, the model version, or the retrieval config. Drops below the threshold block the merge.
Judge model isolated from generator. If you use an LLM as judge, it is a different model and a different prompt from the one you are evaluating. Otherwise you measure the model rating itself, which is a known-bad pattern.
Drift detection on production traffic. A subset of live traffic is sampled, scored offline against the same metrics, and compared to the eval-set baseline weekly. Drift > 5% pages someone.
Eval cost capped. The full eval run costs less than £20 per execution. If it costs £200, nobody runs it on PRs.
Theme 02 — Resilience (6 checks)
The retry-storm story I keep retelling lives here. Six layers wrap every outbound model call.
Error classification. A single function decides retryable vs fatal. Nothing else in the codebase makes that call.
Exponential backoff with jitter. Doubling, randomised, capped at 30–60 seconds. Five attempts max.
Idempotency keys on mutating tool calls. Every send_email, charge_card, create_user path includes a deterministic key.
Circuit breaker per provider. Opens at 50% failure over a 30-second window, cools for 45 seconds, half-opens with a single probe.
Cross-provider fallback. Primary fails open → secondary provider with a per-provider prompt template. Eval thresholds clear on the fallback too.
Graceful degradation path. When all providers fail, the system returns a deterministic fallback or a 503 with retry-after, not an unhandled 500.
Theme 03 — Cost (5 checks)
The CFO Slack message arrives on day eleven. These five checks are the ones that prevent it.
Per-request token cap. Hard limit on input + output tokens per call. Returns a structured error if exceeded, not a truncated answer.
Per-tenant daily spend cap. A tenant cannot exceed their plan’s budget. The cap is enforced before the call, not reconciled after.
Semantic cache for repeat queries. Identical-or-near-identical prompts hit a cache (vector similarity threshold around 0.97). Hit rate is monitored.
Cheaper model on the easy path. Triage step routes simple queries to a smaller model. Validated against evals to prove no quality regression on the easy class.
Cost-per-conversion tracked. Token spend is attributed to a business outcome — resolved ticket, qualified lead, completed checkout. Not just £/day.
Theme 04 — Observability (5 checks)
If you cannot answer “what did the model do for request req_abc123?” in under sixty seconds, you do not have observability. You have logs.
Correlation IDs across the request graph. A single request ID flows from API ingress, through every model call and tool invocation, into the database write. Searchable from any log line.
Structured logs as JSON. No free-text. Provider, model, prompt-template-id, token-in, token-out, latency, status, request-id — all as fields.
P50/P95/P99 latency per model + provider. Tracked separately. Alerts on P99 > SLO for 10 minutes.
Trace sampling on full conversations. A configurable percentage of conversations are stored end-to-end with prompts, responses, and tool calls for debugging.
Eval drift dashboard. Production faithfulness vs eval-set faithfulness, on the same time axis. Anyone in the team can read it.
Theme 05 — Security (6 checks)
This is the theme where the 47-question enterprise questionnaire either gets answered yes or stalls the deal for six weeks. Build it in early.
Prompt-injection redress. Untrusted input (user content, scraped pages, tool outputs) is delimited and the system prompt is hardened. Documented in the architecture doc, tested in evals.
PII redaction at the edge. Names, emails, payment data, health identifiers are detected and either redacted or tokenised before they reach the provider, unless contracts and DPIA say otherwise.
Output validation against schema. Tool calls and structured outputs are parsed against a Pydantic / Zod schema. Failures are caught, retried with a corrective system message, then surfaced — never silently dropped.
Secrets in env vars, not prompts. No API keys, customer keys, or internal URLs are embedded in templates. The prompt store is reviewed for these on every PR.
Provider data-retention setting verified. Zero-retention or 30-day flag confirmed in writing with the provider; no training-on-customer-data flag is set.
RBAC on prompt and model changes. Editing a production prompt requires the same review path as editing production code. Not a config UI that anyone with marketing access can hit.
Theme 06 — Data (5 checks)
Most retrieval problems look like model problems and are not. The fix is upstream.
Indexed corpus version-pinned. The vector index has a build ID that ties to a snapshot of the source data. Re-indexing is a deploy event, not a background script that nobody owns.
Freshness SLO. The lag between source-of-truth update and index update has a target (e.g. < 1 hour) and an alert.
Deduplication and chunk hygiene. Duplicate documents are detected before indexing. Chunk boundaries respect semantic units, not byte counts. Token-length distribution is monitored.
Data residency confirmed. EU customer data does not transit US-region inference, or it does and the contract permits it. Documented per route.
Right-to-erasure path. A user-deletion request results in removal from index, cache, logs, and any judge-model fine-tune set within the documented SLA.
Theme 07 — Rollback & release (4 checks)
The cheapest theme, the highest leverage, the one teams skip first. If you can revert a bad prompt in under two minutes you have already won every argument about deployment risk.
Feature flags in front of every model and prompt. Rollback is a flag flip, not a deploy.
Blue / green or shadow deploy for prompt changes. A new prompt runs against 5% of traffic before 100%, with eval comparison live.
Two-minute rollback verified. Quarterly drill, timed. If the team has not done it, they cannot do it.
Release notes in the prompt store. Every prompt version has a who, when, why, and an eval-score-delta line. Not optional.
How to use it
Print the list. Sit with the engineer who would be on call at 3am. Walk it together. Mark each item green / amber / red, no abstaining. Do not argue about the wording — if there is doubt, the item is amber. The whole pass takes about two hours on a system you know well, four on one you do not.
Then rank the reds by likelihood-of-incident times blast-radius. The top three are this quarter’s remediation work. Everything else goes on the backlog with a date. The goal is not thirty-seven greens. The goal is no surprises.
Want this scored against your stack?
The two-week production-readiness audit is exactly this checklist run against your code, your prompts, your provider config, and your runbooks — with a written scorecard, ranked gaps, and a 90-day roadmap. Fixed price, paid up front, results delivered whether you hire us for the remediation or not.