Field notes / strategy

What the 5% do differently — patterns from production AI in regulated industries

BCG's October 2025 Widening AI Value Gap, surveying 1,250 senior executives, found that 5% of companies are capturing AI value at scale and 60% are capturing nothing material. MIT NANDA's August 2025 study put the same gap differently: 95% of enterprise GenAI pilots deliver no measurable P&L impact. After ten years shipping production AI in Defence, energy, FinTech, healthcare and UK construction, I can tell you what the 5% share on day one. None of it is exotic, and none of it is the model.

THE 5% / WHAT THEY HAVE ON DAY ONE Bottleneck in 15 words named before the model Eval harness in CI written before the prompt Boring middle retry / fallback audit / cost guards Topology vs regulator EU AI Act SS1/23 / JSP 936 DTAC / BSA Right number time/case err/1k audit % Five patterns. Visible on day one. None of them are the model.
What separates the 5% from the 95%, by week one of an engagement.

The data is consistent. The split is the question.

Three independent reports landed within ninety days of each other in 2025. MIT NANDA in August: 95% of enterprise GenAI pilots produce no measurable P&L impact. RAND in 2024 (still the most cited primary source): 80%+ of AI projects fail outright, twice the rate of non-AI IT projects. BCG in October 2025: 5% of companies capture value at scale, 60% capture nothing. S&P Global on top of that: organisations scrapped 46% of their AI POCs in 2025, up from 17% the year before.

The headline rate matters less than the split. There is a 5%, and there is a 60%, and the difference between them is not budget — Gartner's 2026 CIO Agenda has 87% of CIOs raising AI spend year-on-year. It is not access to models either. The split is operational. What follows is what I see, on every audit, on the 5% side of the line.

Pattern 1 — The bottleneck is named in fifteen words, before the model

The 5% projects start with a sentence that any board member could repeat. "Reduce surveyor time on Gateway 2 evidence prep from 11 days to under 4 days, with a regulator-ready audit trail." "Cut analyst time-to-insight on classified document corpora by 60%, with full evidentiary provenance." "Hold faithfulness above 0.85 on customer-policy queries, while removing 70% of tier-1 reviews."

The 95% projects start with the verb explore. Explore AI for compliance. Explore agents for ops. Explore GenAI for the contact centre. RAND's interviews with sixty-five AI practitioners pinned the most-cited cause of failure as miscommunication or misunderstanding of the problem. That is a polite way of saying the project began without a sentence anyone could fail.

Before I let any team I work with write a line of model code, we write that sentence. If the team cannot get it under fifteen words by the end of week one, that is the diagnosis — and the model work is paused until it is.

Pattern 2 — The eval is written before the prompt

This is the cheapest discriminator on a production audit. I open the repo. I look for a file called evals/, tests/eval_*.py, an evaluation set in JSON or YAML, a CI workflow that runs it, and a threshold that blocks merges. On a 5% project I find all of it — usually the eval predates the first model selection. On a 95% project I find a Postman collection and a screenshot of a good answer.

The reason this matters operationally is not theoretical. Without an eval, you cannot upgrade a model without a regression. You cannot change a prompt without a regression. You cannot swap providers when one of them has a bad week. You cannot deploy on a Friday afternoon. The eval is the thing that lets the team move fast in production. Without it, every change is hand-tested by the founder at midnight, and that does not scale past the first hire.

Concretely, the minimum viable evaluation stack on the projects I audit clean is five things: faithfulness, answer-relevancy, context-precision, a domain rubric scored by a judge model, and a CI gate that fails the pipeline below threshold. None of it is exotic. All of it is missing on most pilots.

Pattern 3 — The boring middle is architected, not bolted on

Production AI is a sandwich. Judgement at the start, judgement at the end, and a thick layer of unglamorous infrastructure in between. The 5% projects spent two of their first six weeks on that middle. The 95% projects skipped it and intend to come back to it later.

The minimum middle layer, in my experience, is seven items. Retry logic with exponential backoff and jitter. Idempotency keys so a transient failure does not double-charge. Cross-provider fallback so an OpenAI outage does not become your outage. A circuit breaker so a degraded provider does not melt your queue. Per-request and per-day cost guardrails so a runaway loop does not produce a five-figure invoice overnight. Structured output validation so a hallucinated JSON does not corrupt downstream state. Correlation-IDed structured logging so you can reconstruct any decision a regulator asks about, six months later.

None of this is hard engineering. A production-minded team can build all seven in four to eight weeks. Almost no demo team builds any of it — and that is the difference between a system that earns revenue and a system that dies the week of the investor demo.

Pattern 4 — The deployment topology is picked against the regulator, not the demo

This is the pattern that distinguishes regulated-industry success most sharply. The 5% projects pick the deployment topology before they pick the framework. The topology is determined by the regulatory perimeter — not the engineering preference.

A classified Defence workload under JSP 936's Dependable AI directive is not going to a public hyperscaler. The topology is air-gapped, on-premise inference, deterministic retrieval guards, and an evidence-grade audit trail. The toolchain follows from there, not the other way round.

An FCA-regulated workflow under SS1/23 model risk management is not running on whichever model is fashionable that month. It needs auditable model cards, documented validation, defined accountability, and Consumer Duty-aligned outcome monitoring. Those are architectural constraints, not paperwork.

An NHS deployment needs DTAC, DCB0129 and DCB0160 from day one. A Tier-2 UK construction contractor preparing Gateway 2 submissions needs a Golden Thread schema under the Building Safety Act 2022 — statutory, with personal liability up to two years' imprisonment for mismanaged transfer. An EU enterprise deployment under the AI Act faces fines up to €35M or 7% of global turnover for prohibited practices.

The 5% treat all of this as architecture. The 95% treat it as a slide deck. Deloitte's Q4 2024 survey found 69% of leaders expect full AI governance to take more than a year to stand up. That is exactly how long it takes — after launch — if you did not architect it in.

Pattern 5 — The success number is operational, not technical

The 5% projects measure time per case, error rate per thousand, regulator-ready audit completeness, and cost per successful task. They do not headline tokens, GPU hours, model accuracy on a curated dev set, or the 14-point Likert-scale “helpfulness” score from the demo audience.

I am not arguing the technical metrics are useless. I am arguing they are diagnostic, not directional. The board does not pay you for a faithfulness score. It pays you for the eight minutes of senior surveyor time you give back per Gateway 2 submission, or the 70% reduction in tier-1 review queue depth, or the £280K of Innovate UK funding that an audit-clean platform unlocks at investment committee.

If the success number cannot be written as a unit your CFO understands — minutes, pounds, error rate, throughput — the project is not yet ready to consume budget. It is ready for an audit.

The one-line audit test

I run a free version of this in any first call. I ask the team to send me, before we speak, a single page with five lines: the bottleneck in fifteen words, a link to the eval suite (or "we don't have one"), the deployment topology with the regulatory regime named, the seven items of middle-layer infrastructure with a yes/no next to each, and the operational success number with its current and target value. About one team in twenty sends that page back inside three days. That team is in the 5%. The rest are in the 95%, and the conversation that follows is which of the patterns we work on first.

Want to find out which side you are on?

The two-week Production-Readiness Audit measures all five patterns against your specific stack and produces a written scorecard, a ranked risk register, and a 90-day remediation roadmap. Fixed price — £3,500, paid up front, results delivered whether you hire us for the remediation or not.

Book a discovery call