The CFO Slack message arrives on day 11 of the month. “Hey, is OpenAI really charging us £42k this month? Can you look at it?” And the founder has to go find out, in real time, why a system they thought would run on £3k a month is running on something closer to fifty.
I have sat on calls like that day-11 CFO call. Sometimes as the auditor. Once, years ago, as the person who had to explain it. The details change, the shape does not. No one intends to overspend. The bill gets big because seven small things combine, and no one put a cap on any of them.
Here are the seven I see most often, ranked by how much money they tend to burn before anyone notices. Then the three controls I now insist on before a team ships.
A user request fails. Your code retries. The retry wraps three tool calls and each tool also retries. The tool calls fan out to a sub-agent that, of course, also retries. Eight seconds later the original timeout fires and the user hits the button again.
You have just charged yourself fourteen model calls for one user intent.
This is the most common cost leak I see, and it is usually invisible on the provider dashboard because the line item says chat.completions whether one call or fourteen happened. If you are using an agent framework with implicit retries (LangChain, LlamaIndex, and most agent SDKs have them by default), count them. Add them up. You will find a hidden multiplier between 2x and 8x on your request volume.
The chat endpoint does not care how many tokens you send it. A user uploads a 400-page PDF, your retrieval layer shoves it all into context “just to be safe,” and the model happily charges you for 380,000 input tokens. Multiply by an hour of traffic. That is the bill.
Input tokens are almost always where the money actually goes, not output. Everyone budgets output because that is the more visible number in the docs. Nobody caps input.
Your primary is a cheap tier: 4o-mini, Haiku, or similar. Something goes wrong. A schema parse fails, a safety filter trips, a token limit blows. Your fallback chain silently promotes the request to Opus or GPT-4.1 and it works. Nobody notices, and your per-request unit cost quietly doubles.
I have seen this one specifically hurt a Series A company. Their dashboard showed request counts by hour. It did not show which model served which request. The fallback was firing on 38% of traffic because of a brittle JSON parser. They thought they were running on 4o-mini. They were running on Opus.
Every user session starts with the same 4,000-token system prompt. You are paying for it on every single request. Anthropic’s prompt caching cuts that to roughly 10% of the un-cached cost on repeated prefixes, if you set the cache breakpoints. OpenAI applies automatic prompt caching on prompts over about 1,024 tokens with recent models, but only if your prompt is structured to hit it, which means the static part has to come first and the user-specific part last.
Most teams have not thought about any of this. On one audit we flipped the cache on and a workflow that was costing roughly £4.20 per thousand sessions dropped to £0.68. Same model, same prompt, same output. Not an optimisation; a switch.
max_tokensYou set max_tokens=4000 “to be safe” because you cannot predict output length. Half your responses are 300 tokens. Fine. But when the model occasionally decides to enumerate all 47 items it found in the context, you are paying for 4,000 output tokens on every one of those outliers. On a pricing tier where output is 3x input cost, the tail is expensive.
Cap aggressively. If the response needs more, chunk it. Any model that was going to give you a good answer at 4,000 tokens will give you the same good answer at 600.
Your eval harness calls the model on every deployment to score regressions. Sensible. Now it also runs on every pull request, every merge to main, and on a cron every hour because someone thought “continuous eval” meant “constantly.” You are spending £600 a month on tests that were supposed to cost £40.
This one is preventable the moment anybody looks at the usage dashboard. Nobody looks at the usage dashboard.
Your provider bill arrives on the 3rd of the month. Finance processes it on the 15th. Engineering finds out on the 20th. You have now had 20 days of whatever-went-wrong running before anyone reacted. In a system where daily spend can 10x overnight, that delta is the difference between a £2k overrun and a £40k one.
Seven traps, three controls. None of these are hard to build. They are hard to prioritise before the first painful bill arrives.
Not a logging line. Not a dashboard alert. A check that runs before the call and raises an exception if the request would exceed a cap. At its simplest:
from tiktoken import encoding_for_model
MAX_INPUT_TOKENS = 25_000 # tune to your workload
MAX_COST_GBP = 0.40 # per-request ceiling
enc = encoding_for_model("gpt-4o-mini")
def guard(payload: dict, price_per_1k_in: float, price_per_1k_out: float) -> None:
n_in = sum(len(enc.encode(m["content"])) for m in payload["messages"])
if n_in > MAX_INPUT_TOKENS:
raise BudgetExceeded(f"input tokens {n_in} > cap {MAX_INPUT_TOKENS}")
projected_out = payload.get("max_tokens", 0)
cost = (n_in / 1000) * price_per_1k_in + (projected_out / 1000) * price_per_1k_out
if cost > MAX_COST_GBP:
raise BudgetExceeded(f"projected cost £{cost:.2f} > cap £{MAX_COST_GBP:.2f}")
Start the cap generous. Tighten it with real data. If a single request is about to cost more than fifty pence, someone should know about it before the call fires, not after.
Do not let the fallback chain silently promote requests to the expensive tier. Wrap model selection in a router that knows the unit cost of each option and logs every fallback loudly.
MODELS = {
"mini": {"gbp_per_1k_in": 0.00012, "gbp_per_1k_out": 0.00048, "provider": openai_mini},
"haiku": {"gbp_per_1k_in": 0.00020, "gbp_per_1k_out": 0.00100, "provider": anthropic_haiku},
"opus": {"gbp_per_1k_in": 0.01200, "gbp_per_1k_out": 0.06000, "provider": anthropic_opus},
}
def route(payload: dict, tier: str = "mini") -> dict:
try:
return MODELS[tier]["provider"](payload)
except ModelFailure as e:
metrics.fallback_fired.labels(from_tier=tier, reason=str(e)).inc()
if tier == "mini":
return route(payload, tier="haiku") # ~2x cost
if tier == "haiku":
if not payload.get("allow_opus"): # guard explicit opt-in
raise
return route(payload, tier="opus") # ~60x cost
raise
The metrics line is what matters. If 38% of your traffic is quietly falling through to Opus, your router should be screaming in Grafana, not politely padding the bill.
Every major provider now supports project-level or organisation-level spending limits. OpenAI has it per project and per API key. Anthropic has organisation-level usage limits. Google’s Vertex AI has quota policies. Turn them on. Set them at roughly 2x your expected daily spend.
Yes, this means on the day something goes wrong your product might throttle. That is the feature, not the bug. The alternative is finding out on the 15th that the “something going wrong” has been running for eleven days.
The goal is not zero overruns. It is to have the overrun conversation on day one rather than day eleven, with a £400 number rather than a £42k one, from an alert you built rather than a bill you opened.
Three controls. Three to five days of engineering for a team that already has observability. Every audit I run includes a review of exactly this. Roughly half the time it is the single biggest findable saving on the engagement.
Production-readiness audits cover cost infrastructure alongside retries, evals, and governance. Two weeks, fixed price, written report with ranked findings. If cost is your only concern, scope it to a cost-only review.
Start a conversation →