Field notes / engineering

The fallback ladder — surviving a foundation-model outage

Claude went down for 78 minutes on 28 April 2026. Second outage in eight days. If your product runs on a single foundation-model vendor with no fallback layer, you ran on yellow lights for those 78 minutes — and your end users absolutely noticed. The fix is not switching vendors. The fix is the same playbook every other production system has followed for 30 years. Five layers, each surviving a different failure mode. Built in order, cache first, circuit breaker last.

11 min read 28 April 2026

Nitish Founder, Indica Tech Field notes / Engineering

The five-layer fallback ladder. Build in order — cache first, circuit breaker last. Each layer is independently useful and pays for itself in the first relevant incident.

Executive summary

On 28 April 2026, Claude was unavailable for 78 minutes across claude.ai, the API, and Claude Code. Downdetector saw more than 12,000 reports at peak. It was the second outage in eight days. No public postmortem was published. If your product depends on Claude, GPT, Gemini or any other single foundation-model vendor and has nothing in place when the API stops responding, you do not have a reliability problem — you have an architecture problem. The fix is not switching vendors; switching trades one single point of failure for another. The fix is a five-layer fallback ladder — cache, queue, graceful degradation, multi-vendor failover, circuit breaker plus honest UX — built in that order, on the same engineering principles that have run banks, telcos and CDNs for thirty years. Every layer is independently useful, every layer pays for itself the first time the outage of its shape lands, and every layer is well-trodden. None of it is novel. It is just unevenly distributed across production AI estates in 2026.

The April outages, in numbers

Two foundation-model outages landed inside the same calendar week. Claude went down on 20 April, and again, more visibly, on 28 April. The second one was the louder of the two: 78 minutes of degraded availability across three Anthropic surfaces, more than 12,000 Downdetector reports at peak, and zero public postmortem from the vendor. Eight days between events. For a production AI surface in 2026, this is no longer an edge case — it is a category of incident with a cadence.

For most teams downstream, the 78 minutes was not the worst part. The worst part was that nothing in their stack made the outage less visible to end users. The product simply stopped working. Whatever was on the screen at 14:32 stayed on the screen until the API came back. Whatever was queued silently failed. Whatever was promised in the marketing landed in support tickets. The outage was 78 minutes; the cleanup was the rest of the week.

The reflex inside many engineering Slacks was the same: should we switch vendors? It is the wrong question. A switch from Claude to GPT, or GPT to Gemini, gives you a different vendor with the same shape of risk. What you actually need is the architecture pattern every other class of production system has followed for thirty years — layered fallbacks designed for different failure modes, built once and maintained quarterly. The rest of this post is the layered architecture, in the order it actually pays for itself.

Layer 1 — Semantic cache

The cheapest layer in the ladder, and the one most teams skip first. In any production AI product I have audited, somewhere in the range of 30 to 40 percent of inbound prompts are near-duplicates of recent ones — the same support template filled in with different customer data, the same summarisation request hitting the same source paragraph, the same drafting prompt running across the same campaign. A semantic cache catches all of it.

The implementation is short. Embed inbound prompts as they arrive. Look up against the last sixty minutes of inbound. If cosine similarity is above a sensible threshold — 0.92 is a reasonable starting point and the value most teams converge on after a calibration pass — return the cached answer. Otherwise call the model. Cache the result with a TTL appropriate to the workflow: 60 minutes is a strong default, longer for stable workflows like internal documentation, shorter for anything time-sensitive.

What this layer buys you is invisibility for short outages. A 78-minute vendor outage with a properly populated semantic cache becomes 78 minutes of slightly stale answers for a meaningful chunk of your traffic, and 78 minutes of degraded behaviour for the long tail. From the end user’s perspective, the product still works for most of what they ask. From the operator’s perspective, the support inbox does not light up. Cache is not a substitute for the model; it is a substitute for the assumption that the model is always there.

Layer 2 — Queue plus jittered retry

The second layer protects you from yourself during partial recovery. When a foundation-model API starts returning errors, the typical naive client retries immediately. Then again. Then again. Multiply that across every client in your fleet, every cron, every user-triggered job, and the result is a synchronised retry storm that hammers the recovering vendor at the exact moment you least want to. Your service is now amplifying the outage instead of riding it out.

The fix is async work. Move the model call out of the request path and onto a queue. Apply exponential backoff with jitter on retries — the jitter is the bit teams skip and pay for. Use idempotency keys so duplicate retries are safe at the application layer. Set a per-tenant queue depth limit so one customer’s traffic spike cannot starve every other customer’s queue.

This is not glamorous. It is also the layer that buys minutes of headroom during partial recovery, the layer that keeps your tail latency tolerable as the vendor heals, and the layer that prevents your incident from becoming the vendor’s incident. Every queue worker in production AI in 2026 should have jittered retry, idempotency keys and tenant-fair queue depth as defaults. They are well-trodden patterns. They are also missing from most stacks I review.

Layer 3 — Graceful degradation

The hardest layer of the five. The discipline of writing what “good enough without the model” looks like, per use case, in advance — is more valuable than the failover code itself. Most teams never do it. They have failover code that returns a generic error string, which is not graceful and not a degradation; it is just a different failure.

Define the fallback per shape of work, in writing, before the next outage:

Summarisation: a clear “we’re catching up — we will reply within five minutes” with a returnable promise. Not an error.
Search: keyword-only fallback against the same corpus, with a banner explaining the reduced relevance. Slightly worse search beats no search.
Agent: escalate to the human queue with the conversation context attached. Slow human beats stalled bot.
Drafting: return a template with [PLACEHOLDER] sections and a one-line explanation. The user gets a starting point; the team gets time.

Force the product team to write these specs. Test them once a quarter. The act of writing the spec exposes which workflows have a graceful path and which do not — and the ones that do not are usually the ones nobody had thought through, which is why the outage was painful in the first place. Graceful degradation is a product decision, not an engineering decision. Engineering can wire it once it exists.

Layer 4 — Multi-vendor failover

This is the layer most often skipped on cost grounds, and most often regretted in the postmortem. The architecture is well-known: an abstraction layer over two foundation-model vendors, feature parity at the prompt layer rather than the API layer, an evaluation suite that runs against both monthly to detect prompt drift, and per-tenant routing so you can shift load gradually rather than all-at-once.

The cost is real. About one quarter of focused engineering work for a small team is a fair estimate — the abstraction is straightforward, the eval suite is the part that takes time, and the prompt-engineering parity work has to be paid for in test runs. The payoff is clear: in the first incident the layer prevents, the engineering investment is recovered. In the second, it is recovered with interest.

Two important constraints. First, multi-vendor failover is not for prompt-level parity — the two vendors will respond differently to the same prompt, and the difference is sometimes material. The eval suite is what tells you whether the difference matters for each surface. Second, the abstraction layer should sit at the prompt-and-output level, not at the raw-API level. Vendors change their APIs. They do not change the shape of “summarise this paragraph.” Build the abstraction at the layer that does not move.

Layer 5 — Circuit breaker plus honest UX

The fifth layer is half engineering, half product. The engineering side is a circuit breaker on the model call — an error-rate threshold (a sensible default is greater than 5 percent over a 60-second rolling window), a latency threshold (p95 greater than three times baseline for 60 seconds), and an automatic flip into a degraded mode that the product UI is designed to handle. The product side is a status banner, an ETA, and a contact route, displayed honestly and immediately.

This is the layer where the operating principle is the most important thing in the post. Users forgive a clearly communicated incident. They do not forgive a service that silently lies for an hour, returning fabricated outputs or spinning forever or failing in places they cannot understand. The cheapest reliability feature you will ever ship is the banner that says “our AI features are temporarily degraded — expected back by 14:50 — here is what still works.” Build it before the outage. Test it on a Tuesday. Wire it to your circuit breaker. Done.

Honest UX is the layer that protects the trust account, which is the only account that matters. Every other layer in the ladder buys you minutes of margin. This one buys you a customer who comes back tomorrow. The order of the ladder is not arbitrary — you build cache first because it is cheapest, and you build the circuit breaker last because by the time you reach it, the four layers below have done most of the heavy lifting. But of all five, this is the one whose absence does the most quiet damage.

The full ladder, in one place

Run this as a checklist on a Friday afternoon. Score each layer green / amber / red on your current stack. Most teams score 1/5 or 2/5 on first pass. Three weeks of focused work moves a production-minded team to 4/5; the fifth layer is usually a product conversation more than an engineering one.

Layer 1 — Semantic cache. Embedding-keyed lookup against the last 60 minutes. Threshold around 0.92 cosine similarity. Cache TTL tuned to the workflow.

Layer 2 — Queue plus jittered retry. Async queue, exponential backoff with jitter, idempotency keys, per-tenant queue depth limits.

Layer 3 — Graceful degradation. Per-use-case fallback specs, written, reviewed, tested quarterly.

Layer 4 — Multi-vendor failover. Abstraction at the prompt-and-output layer, two vendors at parity, eval suite running against both monthly, per-tenant routing.

Layer 5 — Circuit breaker plus honest UX. Error-rate and latency thresholds, automatic flip into degraded mode, status banner with ETA shipped before the next outage.

Risks and what to avoid

Don’t treat “switch vendors” as a fallback. Cutting over from Claude to GPT mid-incident replaces one vendor’s outage with another vendor’s prompt-drift surprise, and your eval coverage on the new vendor is by definition zero in the moment you need it. Multi-vendor failover only works as Layer 4 because it is built and exercised before the incident, not during it.

Don’t over-tune the cache threshold. The temptation is to push cosine similarity to 0.97 or higher to avoid stale-answer risk. The result is a cache that almost never hits, and a layer that almost never helps. Start at 0.92 and tune downward in calibration; tune upward only when you have a workload where stale answers cause real harm and you can prove it from logs.

Don’t skip the queue depth limit per tenant. The most common load-shedding failure I see is one customer’s spike consuming the global queue and starving every other customer’s requests. Per-tenant queue depth limits are five lines of config, and they are the difference between a partial outage and a complete one.

Don’t hide the degradation. The temptation when the circuit breaker fires is to fall back to a worse-but-still-AI answer and not tell the user. Two outages later, your support inbox will tell you why this is the most expensive layer to skip. Honest UX is cheap; restoring trust after a silent regression is not.

What good looks like — one quarter from now

The semantic cache catches around a third of inbound prompts on a normal day, and a much higher fraction during incidents. The queue layer absorbs the post-recovery thundering herd without adding load to a healing vendor. Every customer-facing AI surface has a written degraded-mode spec, owned by a product manager, exercised quarterly. Two foundation-model vendors are wired behind a single internal interface, and the eval suite says they are at parity for the prompts that matter. The circuit breaker fires automatically in the first 60 seconds of any breach; the status banner tells the user what to expect; the ETA is honest; the support team has a runbook for the moments it is wrong.

The next 78-minute outage costs your team an afternoon of operational attention and zero customer trust. The customers do not even file tickets, because the system told them honestly what was happening and what to do. That is the bar. It is reachable. The five layers above are how you get there.

Final thought

The April 2026 outages are not the last ones you will see this year. They are the cheap ones — short, recoverable, no data loss. The expensive one will be a multi-hour outage that catches you mid-quarter, with no fallback wired, on the day a major customer is testing your SLA. The teams that will absorb that incident with grace are the teams that built the ladder before they needed it. The teams that will not, are the teams currently arguing about whether to switch vendors. Build the ladder.

Which layer is missing from your stack right now?

Indica Tech’s two-week reliability audit produces a vendor concentration map of your AI stack, scores you against the five-layer fallback ladder, ranks risk gaps by likelihood and blast radius, and gives you a 90-day implementation roadmap with named owners. Fixed price £3,500. Written report. Whether you hire us for the remediation or not.

See the audit engagement →