Field notes / engineering

AI-generated code fails the pentest — 92% of vibe-coded apps ship a critical vulnerability

Sherlock Forensics — a UK cybersecurity firm running pentests since 2006 — published its 2026 AI Code Security Report in early May, drawn from assessments conducted January to April 2026. They audited 50 production applications built with the four dominant AI coding assistants — Cursor, GitHub Copilot, ChatGPT, and Claude — using a methodology mapped to OWASP Top 10 and MITRE ATT&CK. The headline: 92% of AI-generated codebases contain at least one critical vulnerability. The average vibe-coded application has 8.3 exploitable findings, and 78% store secrets in plaintext. Two corroborating 2026 reports landed in the same window — ProjectDiscovery’s AI Coding Impact Report and Black Duck’s OSSRA — flagging the same pattern with different methodology. The 5× productivity story your investors were sold is also a 5× critical-vulnerability story, and the merge gate is the cheapest fix on the market. Three actions every CTO running AI-assisted coding can run before Monday.

SHERLOCK FORENSICS 2026 / 50 PRODUCTION APPS AUDITED 92% · at least one critical vulnerability per AI-generated codebase headline 78% · secrets stored in plaintext (API keys, DB strings, tokens) grep this week 8.3 · exploitable findings per application on average OWASP + MITRE mapping Three patterns: auth-by-default-broken, secrets-by-default-in-source, happy-path-only validation audit-able AI reviewer + AI author = same blind spot · the human senior reviewer is the gate training-distribution overlap Corroborating reports in the same window: ProjectDiscovery 2026 AI Coding Impact Report · Black Duck 2026 OSSRA Cyber insurance carriers have started asking about AI-coding policy in 2026 renewals The PR template is the cheapest control. The SAST gate is the next-cheapest. Both are missing from most teams.
Sherlock Forensics 2026 AI Code Security Report · 50 production applications audited Jan–Apr 2026 · OWASP Top 10 + MITRE ATT&CK methodology, manual testing alongside automated scanning. The 5× productivity story is also a 5× critical-vulnerability story.

Executive summary

Sherlock Forensics — a UK-based cybersecurity firm with twenty years of penetration-testing engagements behind it — published its 2026 AI Code Security Report drawing on assessments conducted between January and April 2026. The study covers 50 production applications shipped with the four dominant AI coding assistants on the market today: Cursor, GitHub Copilot, ChatGPT, and Claude. The methodology is mapped to OWASP Top 10 and MITRE ATT&CK, with manual testing on every engagement and automated scanning alongside — the same shape of audit a regulated-industry CISO would commission. The single number from the report: 92% of audited codebases contain at least one critical vulnerability. The supporting numbers fill in the operational picture: the average application has 8.3 exploitable findings, and 78% store secrets in plaintext — API keys, database strings, third-party tokens, the lot, sitting in source. Two parallel 2026 reports corroborate the pattern with different methodology and different sample: ProjectDiscovery’s 2026 AI Coding Impact Report flags that AI-generated code is outpacing security teams’ ability to keep up; Black Duck’s 2026 OSSRA report finds open-source vulnerabilities doubling in lockstep with AI-assisted commit volume. The honest reading is that the 5× productivity narrative around AI-assisted coding is structurally accurate — and it is generating critical-vulnerability debt at roughly the same multiple, on the same Monday morning, in the same repositories. This is not an argument against AI-assisted coding; that ship has sailed. It is an argument for putting the merge gate back where the senior reviewer used to stand, with a written checklist, a SAST tool that blocks on critical findings, and a PR template that names the three failure patterns by name. The next paragraphs are the three structural failure patterns, why the AI-reviewing-AI fallback does not close them, the three actions every CTO can run before Monday, and what a credible 90-day remediation looks like.

Why Sherlock’s 92% lands harder than the average security report

Security reports cross my desk every week; most have a quotable number and a methodology that softens it. Three things make Sherlock’s read differently — and make the 92% number worth quoting in a board memo rather than a tweet.

First, the sample is production, not lab. Sherlock did not generate code with a fresh prompt against a benchmark suite. They audited 50 real applications that had already shipped — web apps, APIs, SaaS platforms, internal tools — the exact surfaces founders and CTOs are putting on critical paths because AI coding tools are pitched as 5–10× productivity multipliers. The 92% is what the apps look like after the team called them done.

Second, the methodology is the one your regulator would recognise. OWASP Top 10 plus MITRE ATT&CK, manual testing alongside automated scanning. Not a one-pass SAST run that can be argued down to false positives. Not a benchmark that maps poorly onto real exploit chains. The findings are the same shape of finding a Big Four pentest would produce, which means the 92% number survives the procurement-team scrutiny that a vendor white paper would not.

Third, the corroboration is independent and concurrent. ProjectDiscovery’s 2026 AI Coding Impact Report and Black Duck’s 2026 OSSRA landed in the same window with the same direction of travel, drawn from completely different methodologies. Cyber insurance carriers, in private conversations across the same months, have begun asking about AI-coding policy at 2026 renewal cycles — the underwriting signal that follows when a category becomes an actuarially loss-making line. When the report, the parallel reports, and the insurance market all move the same way in the same quarter, the conservative read is that the number is the floor, not the ceiling.

The three structural patterns Sherlock found in the audits

Sherlock’s report breaks the findings into a long taxonomy. Three patterns account for the bulk of the critical-severity findings, and all three are recognisable on sight to any senior engineer who has reviewed AI-assisted PRs over the last twelve months.

Pattern 1 — auth-by-default-broken. The model writes code that compiles and runs. It also writes code that does not, by default, enforce the access-control constraints the application actually requires. JWTs validated for signature but not for the right scope or audience. Role checks at the route level but not at the data-access level (the classic Insecure Direct Object Reference, IDOR-by-default). Permission middleware applied to four of the five routes that need it because the prompt did not mention the fifth. The senior human reviewer catches these in five minutes; the AI-generated code passes its own unit tests because the unit tests are also AI-generated and share the same blind spot.

Pattern 2 — secrets-by-default-in-source. 78% is not a marginal number. It is the dominant pattern. The model produces a working example with a real-looking API key, a database URL, an OAuth client secret — the prompt asked for “a working example,” the model delivered, and the secret stayed in source through the rebase, the merge, and the deploy. Most teams discover the pattern at audit, not at merge. The cost of discovering it at audit is the time-to-rotate every secret the codebase has ever touched, the breach-narrative obligation if any of those secrets reached an external service, and the embarrassment of explaining to an auditor that the production database password lived in a public commit for nine months.

Pattern 3 — happy-path-only validation. The model validates the input it was shown in the prompt. It does not validate the inputs the prompt did not anticipate. Length limits on the username but not on the email field. Regex sanitisation on the search box but not on the comment thread. Type coercion that handles strings and numbers but coerces a JSON object into something that breaks the downstream query in a way the attacker controls. The result is a code path that looks robust to the developer who wrote the prompt and is trivially exploitable to anyone who reads the code and sends a payload the prompt did not name. SAST tools catch some of this. Manual review catches the rest. AI-generated code that reviews its own validation does not catch either.

Why AI reviewing AI does not close the gap

The instinct — reasonable, and currently being tested across the industry — is to fix the AI-author problem with an AI-reviewer. Same loop, different prompt. The thinking is that the model is good enough to catch its own mistakes if asked the right question. The data so far says the gap closes by less than the productivity number suggests.

The structural reason is training-distribution overlap. The AI reviewer and the AI author come from a substantially similar training corpus. Both have read the same OWASP cheat sheets, the same Stack Overflow answers, the same open-source codebases that omit the same edge cases. When the author misses a check because the training distribution did not emphasise it, the reviewer misses the same check for the same reason. The two systems are not statistically independent; they are correlated failures dressed up as a two-person review.

The empirical reading from Sherlock’s engagements aligns with the structural prediction. Where AI-reviewer tooling was deployed alongside AI-author tooling, the critical-finding rate dropped, but not to a level a regulator or an insurance underwriter would call adequate. The reliable closing of the gap requires a reviewer that is not training-distribution-correlated with the author — a senior human, a SAST tool with deterministic rules, or both. The two-runner pattern for AI agents applies one layer higher to AI-assisted coding: the reviewer of untrusted output should not share the failure modes of the producer. The author is the AI; the reviewer must not be.

The three actions before Monday

The Sherlock report is dated this quarter; the corroborating reports are public; the cost of opportunistic exploitation is paid by whoever moves slowest. The following three actions each take a half-day to a sprint for a team that already has the basics in place; each is several sprints for a team that does not. Run them in order.

Action 1 — Pull a recent AI-assisted PR and dual-review it against the three patterns. Pick one PR from the last fortnight that a teammate shipped with AI assistance. Print it. Sit with a senior engineer for forty-five minutes and walk it against the three patterns above: auth-by-default-broken, secrets-by-default-in-source, happy-path-only validation. Most teams find at least one critical-severity issue in the first PR they audit this way. Write up the findings — in a shared document, with file paths and line numbers — and circulate to engineering. The first audit is the most valuable; it converts the abstract Sherlock number into a concrete artefact your own team produced, which is the only artefact that changes behaviour.

Action 2 — Grep every repository for hardcoded secrets, then rotate. The 78% number is not a forecast; it is a description of the current state. Run a secrets-scanner across every repository the team has shipped to in the last twelve months — trufflehog, gitleaks, or any equivalent that scans history, not just HEAD. For every secret found, rotate it. The rotation is the expensive half; the grep is the cheap half. The teams that skip the rotation are the teams that explain the breach narrative six months later. The teams that rotate find, in the process, half a dozen credentials that have been valid in production since 2024 because nobody remembered they existed.

Action 3 — Add three lines to the PR template and a SAST gate to the merge path. The PR template gets one new section, written in the team’s voice, three bullets long: did you check auth at the data-access level, not just the route level? did the model write any string that looks like a secret? did you validate the inputs the prompt did not explicitly mention? The SAST gate is the deterministic backstop. Configure the tool to block merge on a critical-severity finding; let it run on every PR; budget the false-positive triage time as part of the engineering cost. The PR template costs nothing and catches the obvious half of the findings. The SAST gate costs an afternoon to configure and a few hours a week to triage, and catches the bulk of the rest. Neither replaces the senior human reviewer; both reduce the load on them.

The 90-day remediation that survives the next disclosure

The three before-Monday actions buy you the floor. A credible 90-day remediation buys you the operating model that survives the next AI-coding security report (and the report after, and the report after that). Five elements, none individually hard, mostly missing in the configurations I review.

One — a written AI-coding policy. One page. Which AI assistants are sanctioned for which classes of work. Which code paths must not be AI-authored without an explicit pairing review (auth, payments, anything that touches a regulated data class). Which artefacts (secrets, IAM policies, schema migrations) are out-of-scope for AI generation full stop. The policy does not need to be long; it needs to be written down and signed off by the CTO, so the question “is this allowed” has an answer that does not depend on the senior engineer who happens to be online.

Two — the SAST gate in CI, with an allow-list for false positives that is itself reviewed. The gate is not the policy; the gate is the enforcement layer. Block merge on critical-severity findings. Maintain a written allow-list of acknowledged false positives with the reviewer’s name and a quarterly re-review date. The allow-list is the failure mode in most SAST deployments — it grows until it is the policy. The quarterly re-review prevents that.

Three — secrets management as a default behaviour, not a convention. The PR template can ask. The SAST gate can scan. The durable fix is that the developer never has the option to put a secret in source — the local dev experience routes through a secrets manager (Vault, AWS Secrets Manager, Doppler, 1Password CLI) and the model is prompted with the secret-manager pattern in the system prompt, not with a raw secret in an example file.

Four — the senior-reviewer roster, written down. For every code path that the AI-coding policy flags as “requires human review,” name the reviewers. Three is the minimum — one creates a single point of failure, two leaves no cover for holidays. The roster is reviewed quarterly. The roster is what makes the policy operable on a Tuesday morning when someone needs an approval and the engineer who wrote the policy is on annual leave.

Five — an annual external pentest scoped to include AI-coding output. The internal review catches the obvious. The external pentest catches the cultural blind spots — the patterns the team has habituated to and no longer sees. Scope the engagement explicitly to include AI-assisted code paths; share the AI-coding policy with the pentester so they can stress-test it. Budget the engagement as a line item, not as a discretionary spend. The cost of the pentest is materially less than the cost of the breach the pentest prevents.

The procurement and insurance side of the same story

Two market signals make this an executive conversation, not just an engineering one. Cyber insurance carriers have begun asking about AI-coding policy in 2026 renewal questionnaires — not yet on every line, but on enough that brokers are flagging it. The underwriting question is structurally identical to the one carriers added for cloud migration in the mid-2010s and for SaaS-everywhere in the late 2010s: do you have a written policy, an enforcement layer, and an external check? Without the three, the premium goes up; with the three, the premium holds. The CFO will ask the CTO about it at the next renewal. The CTOs with the policy already drafted answer it in one paragraph. The CTOs without spend the next month on it under deadline.

The procurement side is the parallel signal. Enterprise buyers in regulated industries — financial services, healthcare, defence — have begun adding AI-coding-policy questions to vendor due diligence in 2026. The pattern is the same: do you have a written policy, can you describe your enforcement layer, can you produce an external attestation? For a founder selling into those buyers, the answer to those questions is becoming a sales-cycle dependency. The Sherlock report is one of several artefacts the buyer’s security team will reference; the seller’s ability to engage with the data, rather than dismiss it, is the credibility test.

Risks and what to avoid

Don’t ban the tools. The productivity benefit is real; banning the tools loses it, accelerates shadow IT, and pushes AI-assisted coding into channels you cannot audit at all. The right move is the merge gate, not the ban. The 5× productivity story is recoverable; the 5× vulnerability story is the bit you instrument out.

Don’t outsource the audit to AI alone. The training-distribution overlap argument above applies in production, not just in research. AI-reviewing-AI is a useful first pass; it is not the gate. The gate is a senior human plus a SAST tool with deterministic rules. Budget for both. The teams that try to close the loop with a fourth AI in the pipeline will discover, at the postmortem, that the fourth AI missed the same thing the first three did.

Don’t skip the rotation. Grepping for secrets without rotating the ones you find is theatre. The rotation is the cost. The teams that have the most painful rotation cycles are the ones that have skipped the most rotations to date. Pay it now; pay materially more later.

Don’t conflate “our developers are senior” with “we don’t have this problem.” Sherlock’s sample includes applications shipped by experienced teams. The pattern is structural to the tooling, not to the seniority of the user. A senior engineer reviewing AI-assisted output well, against a written checklist, is the working configuration. A senior engineer reviewing AI-assisted output without the checklist drifts towards the same blind spots within a quarter.

What good looks like — one quarter from now

The team has a one-page AI-coding policy signed off by the CTO, naming which assistants are sanctioned for which work and which code paths require human review. A SAST tool runs on every PR with a documented block-on-critical configuration; the allow-list is short and re-reviewed quarterly. A secrets manager is in the local developer workflow by default; trufflehog or equivalent runs in CI; every secret found in the historical sweep has been rotated. The PR template has three new bullets covering the three Sherlock patterns; the engineering team can describe them from memory. The senior-reviewer roster is written down and rotates quarterly. An external pentest is scheduled inside the next twelve months, explicitly scoped to include AI-assisted code paths. The CTO can answer, in writing, the question “what is our AI-coding governance,” in two paragraphs. Most CTOs cannot today. The ones who can are the ones whose 2027 renewal cycle is a one-paragraph email rather than a six-week scramble — and whose audit log is a feature log, not a CVE log, when the next funding round runs due diligence on the codebase.

Final thought

The 92% number is uncomfortable in the same way the “most database queries are SQL-injectable” statistic was uncomfortable in 2003. It is uncomfortable because it is true, because the productivity story attached to it is also true, and because the fix is unglamorous engineering discipline rather than a new tool to buy. The teams that wire the merge gate back into the workflow this quarter will keep the productivity benefit and shed the critical-vulnerability debt at the same time. The teams that don’t will keep both. The Sherlock report is the receipts the security team has been waiting for; the executive memo is one page; the engineering work is one sprint. The expensive version of this story is the one that ends with an incident postmortem, an insurance premium uplift, and a due-diligence finding that costs the round. The cheap version is the one that ends with three new bullets in a PR template and a green CI gate. Pick the cheap version. Now, while the calendar still has the room for it.

When did you last pentest your AI-assisted codebase?

Indica Tech’s two-week AI-coding governance audit pulls a representative sample of AI-assisted PRs from the last quarter, runs them against the three Sherlock patterns, sweeps the repository history for secrets with a rotation plan attached, drafts the one-page AI-coding policy tailored to your stack and regulator, configures the SAST gate in CI, and gives you a 90-day remediation roadmap including the senior-reviewer roster. Fixed price £3,500. Written report. Whether you hire us for the remediation or not.

See the audit engagement

Further reading