Blog / field notes

Field notes on shipping AI.

Patterns I see over and over — the failure modes, the evals that actually catch regressions, the retry logic that survives 3am. Written for founders who have to ship, and CTOs who have to own it.

Updated monthly · Written by Nitish

Engineering
The 20-step problem — why production AI agents fail, and the architecture that fixes it

A 95% per-step success rate compounds to 36% over 20 steps — and real chains are worse, because the errors are correlated. Gartner puts the 2026 agentic-pilot failure rate at 78%. The compounding-error math behind it, and the five-part architecture that survives it: bounded autonomy, verification gates at the seams, subagent isolation, idempotent replay, and human gates on high-blast-radius actions.
12 min read · 18 June 2026
Governance
MCP in production — the security boundary you never provisioned for

97 million monthly downloads, 41% of organisations running MCP servers in production — and security is the number-one named adoption blocker. The Model Context Protocol threat model your REST instincts miss: over-permissioned tools, untrusted servers, prompt injection through tool descriptions, impersonation and the confused deputy. Plus the six-gate MCP production-readiness checklist.
12 min read · 18 June 2026
Strategy
AI is core infrastructure now — JPMorgan just gave you the 2026 budget defence

JPMorgan reclassified its ~$2B/year AI spend out of ‘innovation’ and into ‘core infrastructure’ inside a $19.8B 2026 tech budget. Dimon says it has already paid for itself. 500+ production use cases, 95% AML false-positive cut, 10–11% productivity lift. The procurement-credible reframe every CTO and finance lead will now have to defend their 2026 plan against — with five moves to land the 2027 budget in infrastructure, not innovation.
11 min read · 20 May 2026
Governance
The EU AI Act got delayed — and your transparency deadline got CLOSER, not further

Omnibus VII slipped high-risk AI rules to December 2027 and August 2028. Most teams read the headline and exhaled. Three less-reported clauses move the work the other direction: transparency labelling deadline pulled to 2 December 2026, SME exemption extended to small mid-caps rewires procurement diligence, and a new prohibition on AI-generated non-consensual sexual content + CSAM lands with zero grace period. The three workstreams to accelerate this quarter.
11 min read · 20 May 2026
Strategy
Anthropic just passed OpenAI in business AI — the vendor concentration question your procurement team is about to ask

Ramp AI Index April 2026, 50,000+ companies, real corporate-card spend: Anthropic 34.4%, OpenAI 32.3% — first crossover in the index’s history. Anthropic quadrupled in twelve months, OpenAI essentially held. The engine is a single product: Claude Code. Consumer usage stopped predicting enterprise spend, and the four-year vendor-lock playbook is now untenable. Four moves before the next renewal cycle.
11 min read · 20 May 2026
Engineering
AI-generated code fails the pentest — 92% of vibe-coded apps ship a critical vulnerability

Sherlock Forensics audited 50 production apps built with Cursor, Copilot, ChatGPT and Claude between January and April 2026 — 92% had a critical vulnerability, 78% stored secrets in plaintext, 8.3 exploitable findings on average. The three structural failure patterns, why AI-reviewing-AI does not close the gap, three actions before Monday, and the 90-day governance the cyber insurance carriers are about to start asking about.
11 min read · 19 May 2026
Leadership
The Chief AI Officer trap — 76% have hired one; most just bought a slide deck

IBM’s May 2026 CEO study landed the headline: 76% of large organisations now have a Chief AI Officer, up from 26% in 2025. A near-tripling in twelve months means most hires were not selected — they were grabbed. The four signatures of the mis-hire, the five operating-model moves that make the role actually work, and the eight-question board diagnostic for the next quarterly review.
11 min read · 19 May 2026
Strategy
The £80K question — why construction software is about to get cheap on purpose

A Tier 2 UK main contractor opens the Procore renewal and reads £80,000. About four-fifths of that invoice is not software; it is the services tax that AI-native vertical platforms have stripped out. The BSA Golden Thread teardown — 12 weeks and £15K per HRB project on Procore + Aconex, 1 week and £3.5K on a voice-driven AI-native equivalent. 4x cheaper, 12x faster, the same compliant outcome. Six procurement questions, the demo-to-production catch, and the 2028 prediction every construction CFO and CTO should be planning for.
12 min read · 18 May 2026
Governance
Your LLM is not a security boundary — Microsoft’s Semantic Kernel disclosure is the framework’s SQL-injection moment

Two critical CVEs in Microsoft’s own AI agent framework, disclosed by Microsoft’s own security team. A chat prompt launches arbitrary code on the host. A model-callable helper writes attacker files to Windows Startup, escaping the Azure Container Apps sandbox. Patched in Semantic Kernel Python 1.39.4 and .NET 1.71.0 — but the bug class generalises. The three actions every team running AI agents must take this week.
11 min read · 19 May 2026
Governance
The AI agent kill switch — and the inventory you need before you buy one

A Cursor + Claude agent deleted PocketOS’s production database and every backup in nine seconds, then generated 4,000 fake users to hide it. ServiceNow used the incident to launch the kill switch. The five gates upstream of any control tower — inventory, privilege scoring, named-identity logging, kill-path playbook, and lifecycle.
11 min read · 12 May 2026
Strategy
Why 80% of AI projects never ship — and the 5 failure modes I see on every audit

The stat is real. RAND put 80.3% in 2025 and the MIT GenAI study topped it at 95%. Every one of the failures I have touched matches one of five patterns — none of which are about model quality.
12 min read · 22 April 2026
Engineering
The five evals that actually matter in production

Most teams ship LLMs blind. The minimum viable evaluation stack is five metrics — three from RAGAS, one custom rubric scored by a judge model, one CI gate. Thresholds and Python included.
11 min read · 22 April 2026
Engineering
What production-grade retry logic looks like for LLM calls

Exponential backoff is table stakes. Real production retry is five layers — error classification, backoff with jitter, idempotency, circuit breakers, and cross-provider fallback. With code.
13 min read · 22 April 2026
Engineering
Seven ways an LLM bill runs sideways (and three controls that stop it)

The CFO Slack message arrives on day 11. A £3k-a-month system running on £50k. Seven cost traps I see on every audit, plus the three controls to cap them before the next bill arrives.
10 min read · 23 April 2026
Governance
What ISO 27001 and SOC 2 actually require when you add AI to your product

A 47-question enterprise AI questionnaire, translated into the four themes that cover 85% of it. A 12-week path to answering yes without breaking the product team.
11 min read · 23 April 2026
Engineering
The 37-point production-readiness checklist for AI systems

Seven themes, 37 specific checks. The list I run on every two-week audit — evals, resilience, cost, observability, security, data, rollback. Most demos clear ten. Production systems clear at least thirty.
14 min read · 27 April 2026
Engineering
Shipping RAG to production — where retrieval pipelines actually fail

Six predictable break points: chunking, embedding drift, retrieval relevance, context overflow, freshness, and citation faithfulness. The minimum viable production RAG, with the controls that fix each.
13 min read · 27 April 2026
Leadership
How AI agents can reduce operational costs without hiring more staff

For CEOs and COOs absorbing more workload without growing headcount. Five categories of agent-able work, a 90-day pilot roadmap, and the risks to design out before scaling. Cost out comes from reallocating capacity, not cutting headcount.
13 min read · 28 April 2026
Leadership
The CEO’s guide to AI automation in 2026

The four layers of the modern AI stack, where ROI actually lives, the governance that protects it, and a 12-month roadmap from first pilot to operating-rhythm capability. Adoption is no longer the differentiator — competence is.
14 min read · 28 April 2026
Strategy
What the 5% do differently — patterns from production AI in regulated industries

BCG: 5% of companies capture AI value at scale; 60% capture nothing. After 10 years shipping AI in Defence, energy, FinTech and UK construction — here are the five patterns the survivors share on day one. None of them are the model.
11 min read · 28 April 2026
Leadership
The invisible operations tax — what manual work actually costs before you automate anything

McKinsey: knowledge workers lose 20% of the week to information search. Unit4: UK finance teams lose 50+ hours weekly to manual work. None of it is on the P&L. That is exactly why it is the most expensive line item your business has — and how to measure it before any AI spend.
12 min read · 28 April 2026
Leadership
AI is not the strategy. The bottleneck you remove with it is.

PwC 2026: 56% of CEOs see no financial impact from AI. Deloitte: typical AI payback now 2–4 years. Most boards in 2026 are asking the wrong question. The right one — and the three I would put on your next board agenda instead.
11 min read · 28 April 2026
Strategy
Vertical AI just buried horizontal Copilot in regulated industries

Rogo’s $160M Series D and Microsoft 365 Copilot’s 20M seats grew the same week. They are not telling the same story. The four-question procurement filter that separates a vertical agent from a productivity tool with a vertical decal on it.
11 min read · 28 April 2026
Governance
Your AI agents are ghost users in your IAM stack

10–20 agents per prod AI org. One shared service account. Zero in the quarterly access review. Microsoft just shipped Agent 365 Runtime Protection because the gap is now product-shaped. Five hygiene gates upstream of any runtime tool.
10 min read · 28 April 2026
Engineering
The fallback ladder — surviving a foundation-model outage

Claude went down for 78 minutes on 28 April 2026. Second outage in eight days. The fix is not switching vendors. It is the same five-layer playbook every production system has followed for 30 years — cache, queue, graceful degradation, multi-vendor failover, circuit breaker. Built in order. Each survives a different failure mode.
11 min read · 28 April 2026
Governance
Comment and Control — how a PR comment exfiltrates your secrets through three AI agents

Aonan Guan’s late-April disclosure broke three production AI coding agents with one pattern: Claude Code Security Review (CVSS 9.4), Gemini CLI Action, Copilot SWE Agent. The attack is a pull-request comment. The four-step mechanic, the pull_request_target trap, and the five-step audit checklist if you ship AI agents in CI.
11 min read · 28 April 2026

No field notes under this topic yet.

Want the full production-readiness checklist?

The 37 things I check on every audit — turned into a PDF you can hand to your team tomorrow. Send me an email and I will send it back.

Email for the checklist →

Field notes on shipping AI.

The 20-step problem — why production AI agents fail, and the architecture that fixes it

MCP in production — the security boundary you never provisioned for

AI is core infrastructure now — JPMorgan just gave you the 2026 budget defence

The EU AI Act got delayed — and your transparency deadline got CLOSER, not further

Anthropic just passed OpenAI in business AI — the vendor concentration question your procurement team is about to ask

AI-generated code fails the pentest — 92% of vibe-coded apps ship a critical vulnerability

The Chief AI Officer trap — 76% have hired one; most just bought a slide deck

The £80K question — why construction software is about to get cheap on purpose

Your LLM is not a security boundary — Microsoft’s Semantic Kernel disclosure is the framework’s SQL-injection moment

The AI agent kill switch — and the inventory you need before you buy one

Why 80% of AI projects never ship — and the 5 failure modes I see on every audit

The five evals that actually matter in production

What production-grade retry logic looks like for LLM calls

Seven ways an LLM bill runs sideways (and three controls that stop it)

What ISO 27001 and SOC 2 actually require when you add AI to your product

The 37-point production-readiness checklist for AI systems

Shipping RAG to production — where retrieval pipelines actually fail

How AI agents can reduce operational costs without hiring more staff

The CEO’s guide to AI automation in 2026

What the 5% do differently — patterns from production AI in regulated industries

The invisible operations tax — what manual work actually costs before you automate anything

AI is not the strategy. The bottleneck you remove with it is.

Vertical AI just buried horizontal Copilot in regulated industries

Your AI agents are ghost users in your IAM stack

The fallback ladder — surviving a foundation-model outage

Comment and Control — how a PR comment exfiltrates your secrets through three AI agents

Want the full production-readiness checklist?