Field notes / engineering

What production-grade retry logic looks like for LLM calls

Exponential backoff is not retry logic. It is one layer of it. A production LLM system needs five: error classification, backoff with jitter, idempotency keys, circuit breakers, and cross-provider fallback. Any team that tries to ship with fewer is writing their own 3am incident. Here's the full stack, with code.

RETRY STACK / five layers, outside-in 01 Error classification retryable vs fatal 02 Exponential backoff + jitter 2^n × rand 03 Idempotency keys retries are safe 04 Circuit breaker stop hammering dead 05 Cross-provider fallback OpenAI → Anthropic → local if 05 fails → return deterministic fallback or 503 with a retry-after.
Five layers wrap every outbound LLM call. Skipping any of them makes the system brittle.

The 3am story

It always sounds the same. OpenAI has a bad Tuesday. Rate-limit responses start coming back. The team's code has a try/except with a three-second sleep and a single retry. The retry also rate-limits. The exception propagates. The FastAPI worker 500s. The user sees an error. The customer success lead is in the founder's DMs by 09:02. The founder pings the engineer. The engineer pushes a "fix" that retries five times with a flat five-second sleep. Traffic triples the load on the upstream provider. Now it is not just the original bad Tuesday. The engineer's retry storm is making it worse.

This is not a hypothetical. I have walked into some version of this outage three times this year alone. In each case the fix was not to write better retry logic. It was to write retry logic at all, five layers of it.

Layer 1 — Error classification

The single most important thing you do before deciding to retry is decide whether this specific error is retryable. Retrying a fatal error is worse than not retrying at all, because it burns quota, increases load on the upstream, and usually adds latency to an error the user is getting anyway.

For a typical LLM provider, the error classification looks like this:

  • Retryable: HTTP 429 (rate limit), 500, 502, 503, 504 (server errors), connection timeouts, read timeouts, SSL handshake failures.
  • Not retryable: HTTP 400 (bad request — your prompt is malformed), 401 (auth — your key is wrong), 403 (permission), 404 (model does not exist), 422 (validation error). Retrying these does not fix them.
  • Retryable with caution: HTTP 409 (conflict), certain model-specific errors that mean "context too long" (fix the prompt, do not retry), content-filter refusals (redesign, do not retry).

The practical implementation is a small function that looks at the exception, returns True if retryable, False if fatal, and is the only place in the codebase that decides. Everything else asks it.

Layer 2 — Exponential backoff with jitter

If you do retry, you need two properties. First, each retry waits longer than the previous one — exponential, typically doubling. Second, the wait time has random noise added to it, called jitter. The jitter is not a stylistic flourish. It is the single line of code that prevents retry storms.

Here is why. Without jitter, a thousand clients that hit a rate limit at the same time will all retry at exactly the same interval. Every one of them retries at t+2s, then t+4s, then t+8s. The upstream provider gets hit with synchronised thundering herds at predictable intervals. With jitter, each client's retry time is randomised within its backoff window. The herd disperses. The upstream recovers.

The math is simple: wait = min(2 ** attempt + random.uniform(0, 1), MAX_WAIT). Start at around 1 second, cap at around 60 seconds, give up after about 5 attempts. Python's tenacity library does this with one decorator, but it is worth writing once by hand so you understand what it does.

Layer 3 — Idempotency keys

Retrying is only safe if the operation is idempotent. For simple LLM completions it usually is — the same prompt produces a different response, but no state changes on our side. The moment you start using tool calls that mutate external state (create_user, charge_card, send_email), a retry that succeeds on the server but times out before the response reaches you creates a ghost operation. The email gets sent twice. The card gets charged twice. The Slack message posts twice.

The fix is idempotency keys. Every mutating request includes a deterministic key (typically a UUID generated at request creation, persisted with the request). The downstream service checks for that key, and if it has already processed a request with that key, returns the cached response instead of repeating the operation. OpenAI, Stripe, and most serious APIs support this pattern via an Idempotency-Key header. Use it. Always. If a tool call does not support it natively, wrap it in a small dedup layer backed by Redis: "has this key been seen in the last 24 hours? If yes, return cached response. If no, call through and cache."

Retrying without idempotency is duplicating without knowing.

Layer 4 — Circuit breaker

At some point the right answer is to stop retrying altogether. If the upstream provider has been returning 503 for four minutes, the fifth request is not going to save you. Worse, every additional request you send makes the recovery harder.

The circuit breaker pattern is straightforward. You track failures over a rolling window. When the failure rate crosses a threshold (for example, 50% failure in the last 30 seconds, minimum 10 requests), you flip the circuit to open: requests fail fast without hitting the upstream at all, returning an error to the caller immediately. After a cooldown (30–60 seconds), you flip to half-open: one probe request goes through. If it succeeds, the circuit closes again. If it fails, another cooldown.

The value of a circuit breaker is not just that it saves the upstream from your hammering. It also saves your service from holding requests open during a provider outage, which is how you end up with a queue of 3,000 waiting clients and a service that falls over because you ran out of file descriptors before the provider came back.

Layer 5 — Cross-provider fallback

The final layer is the one that turns a bad-provider-Tuesday from an incident into a blip. When your circuit breaker opens on OpenAI, your code falls through to Anthropic (or Gemini, or a local Llama, or whatever). The failover is automatic. The user never notices.

Two practical warnings. First, the prompts that work on one model do not necessarily work on another. You need a prompt template per provider, with a shared interface in your code. Do not assume that calling client.chat.create with the same string produces the same quality output across vendors. Run your eval set on each fallback provider independently, and know which ones clear your thresholds.

Second, cost and latency differ. Treat the fallback as a temporary quality compromise, not a permanent substitute. Emit a metric every time a fallback fires (llm_fallback_fires_total, tagged by primary and fallback provider) so the team notices when the fallback is doing 10% of your traffic rather than 0.5%. That is a signal the primary provider has a sustained problem, not a transient one.

The whole thing, as code

Rough shape using tenacity for the backoff layer and a hand-written circuit breaker. You would typically extract this into a single llm_call wrapper that every caller goes through.

import random, time, httpx, uuid
from tenacity import (
    retry, retry_if_exception_type,
    stop_after_attempt, wait_exponential_jitter,
)

RETRYABLE = (httpx.TimeoutException, httpx.RemoteProtocolError)

def is_retryable_status(exc: httpx.HTTPStatusError) -> bool:
    return exc.response.status_code in {429, 500, 502, 503, 504}

class CircuitOpen(Exception): ...

class CircuitBreaker:
    def __init__(self, fail_threshold=0.5, window=30, cooldown=45):
        self.window, self.cooldown = window, cooldown
        self.fail_threshold, self.failures, self.opened_at = fail_threshold, [], None

    def before(self):
        if self.opened_at and time.time() - self.opened_at < self.cooldown:
            raise CircuitOpen()
        self.opened_at = None

    def record(self, ok: bool):
        now = time.time()
        self.failures = [(t, k) for t, k in self.failures if now - t < self.window]
        self.failures.append((now, ok))
        if len(self.failures) >= 10:
            rate = sum(1 for _, k in self.failures if not k) / len(self.failures)
            if rate >= self.fail_threshold:
                self.opened_at = now

openai_cb    = CircuitBreaker()
anthropic_cb = CircuitBreaker()

@retry(
    retry=retry_if_exception_type(RETRYABLE),
    wait=wait_exponential_jitter(initial=1, max=30),
    stop=stop_after_attempt(4),
    reraise=True,
)
def _call_openai(payload, idem_key: str) -> dict:
    openai_cb.before()
    try:
        r = httpx.post(
            "https://api.openai.com/v1/chat/completions",
            headers={"Idempotency-Key": idem_key, "Authorization": f"Bearer {KEY}"},
            json=payload, timeout=30,
        )
        r.raise_for_status()
        openai_cb.record(ok=True)
        return r.json()
    except httpx.HTTPStatusError as e:
        openai_cb.record(ok=False)
        if is_retryable_status(e):
            raise httpx.TimeoutException("retryable upstream") from e
        raise

def llm_call(payload: dict) -> dict:
    idem = str(uuid.uuid4())
    try:
        return _call_openai(payload, idem)
    except (CircuitOpen, httpx.TimeoutException, httpx.HTTPStatusError):
        return _call_anthropic(payload, idem)  # same shape, different provider

It is not short, but it is not complicated. Five layers. Roughly 60 lines of non-trivial Python. Every production LLM service I have ever shipped has some version of this at its edge, and every production LLM service I have ever audited that did not have it had some version of the 3am story.

Observability, briefly

Retry logic without metrics is a black box. At minimum, emit: llm_request_total (tagged by provider, model, status), llm_request_duration_seconds histogram, llm_retry_total tagged by attempt number and error class, llm_circuit_open_total, llm_fallback_fires_total. Alert on circuit open for > 2 minutes. Alert on fallback > 5% of traffic for 10 minutes. Alert on retry rate > 20% for 15 minutes. Most 3am incidents would have been 3pm incidents with these alerts in place.

Not sure your retry stack would survive a bad Tuesday?

Production-readiness audits include a full review of outbound call resilience — classification, backoff, idempotency, circuit breakers, fallback. Two weeks, fixed price, written report.

See the audit engagement

Further reading