You can't ship what you can't measure, and most teams ship their LLMs completely blind. The minimum viable evaluation stack for a production LLM system is five metrics — three from RAGAS, one custom rubric scored by a judge model, and one gate wired into CI. Here are the five, the thresholds I use, and the code.
There are maybe forty named LLM evaluation metrics in serious use right now. Most teams, when they decide to start measuring, either adopt all forty, drown in dashboards, and ship nothing — or pick one vanity metric ("BLEU score is 0.42!") and treat it as a green light. Both failures have the same cause: no clarity on what the metric is supposed to catch.
A production eval stack needs to answer five specific questions, and each question maps to exactly one metric. That is the stack. Five metrics, five questions, five thresholds, one CI gate. Enough to catch real regressions. Few enough that you can actually run it on every pull request without a flaky green light.
Question it answers: is the model making things up?
Faithfulness, as defined by the RAGAS library, is the number of claims in the generated response that are supported by the retrieved context, divided by the total number of claims in the response. It ranges from 0 to 1. A perfect score means every factual claim in the answer can be traced to something that was actually retrieved. A low score means the model is hallucinating or confabulating beyond its source material.
This is the single most important metric for any retrieval-augmented system, because hallucination is the failure mode that ends careers. Wrong numbers in a financial report, wrong dosage in a clinical summary, wrong citation in a legal brief — those do not just annoy users. They get people sued.
Threshold I use: faithfulness ≥ 0.85 on the production eval set. Below 0.75, stop the deploy and investigate before merging.
Question it answers: did the model actually answer the question?
Faithfulness tells you the output is grounded. It does not tell you the output addresses the query. Answer relevancy scores the response against the original question, penalising completeness gaps ("you answered half the question") and redundancy ("you answered the question and then answered four other questions the user did not ask").
This is the metric that catches the "technically correct but not what I asked" class of failure, which is especially common in RAG systems where retrieval finds adjacent documents and the model dutifully summarises them instead of addressing the user's intent.
Threshold I use: answer relevancy ≥ 0.80. Below 0.70 is a user-facing quality emergency.
Question it answers: is the retrieval surfacing the right chunks?
Context precision is the complement of context recall. Recall asks whether the relevant information is in the retrieved set. Precision asks whether the retrieved set is mostly relevant or mostly noise. In practice, precision is the one that drives production cost — a bloated context with low signal forces larger prompts, higher token spend, and worse downstream faithfulness because the model has more irrelevant text to get distracted by.
When faithfulness is low, context precision is often the upstream culprit. The model is making things up because the retrieval is surfacing nearly-relevant documents it then has to splice together. Fix precision first, faithfulness tends to follow.
Threshold I use: context precision ≥ 0.80 on the production eval set. Below 0.70, the retrieval layer needs work before the model tuning does.
Question it answers: does the output meet your quality bar?
RAGAS metrics are excellent for generic RAG hygiene. They do not know that your legal-AI agent must refuse to give a final opinion without a human signoff, or that your medical summary tool must always cite the document section, or that your customer support bot is required to offer an escalation path to a human after two failed attempts.
That is where G-Eval comes in. You write a domain rubric in plain language ("score from 1 to 5 on how well the response follows the refusal policy in our guidelines") and a judge model — typically GPT-4 class or Claude class — grades each response against it. The judge model is not the generator, to avoid the obvious conflict of interest. Run the rubric on a curated eval set with known-good and known-bad outputs, and the score becomes a proxy for human judgement at a fraction of the cost.
A rubric typically has three to five criteria, each scored 1 to 5, with anchored definitions ("5 = follows policy exactly, 3 = partially follows, 1 = clearly violates"). The average across criteria is the metric.
Threshold I use: domain rubric ≥ 4.0 / 5.0. Below 3.5, the specific rubric criteria that failed become the next sprint's backlog.
Question it answers: is this pull request making the system worse?
The first four metrics are values. The fifth metric is a rule. On every pull request, run the eval set through both the main baseline and the proposed change, and compare. If any of the four metrics has regressed by more than a configured tolerance (I use 0.02 absolute), the build fails and the PR cannot merge.
This is the only part of the stack that actually prevents regressions from reaching production. Without it, the other four metrics become dashboard ornaments. With it, they become a quality ratchet: the main branch can only move forwards.
The practical shape of this in GitHub Actions is a single job that runs on every PR, checks out both branches, runs evals against a fixed ~100-row eval set, and writes a comparison comment on the PR with pass/fail for each metric and the delta from baseline. Full cycle takes two to five minutes on a 100-row set, which is the upper bound of tolerable.
A minimal setup using RAGAS on Python:
from ragas import evaluate
from ragas.metrics import (
faithfulness,
answer_relevancy,
context_precision,
)
from datasets import Dataset
eval_dataset = Dataset.from_dict({
"question": questions, # list[str]
"answer": generated, # list[str], from your system under test
"contexts": retrieved, # list[list[str]], what the retriever returned
"ground_truth": references, # list[str], the ideal answer
})
result = evaluate(
eval_dataset,
metrics=[faithfulness, answer_relevancy, context_precision],
)
thresholds = {
"faithfulness": 0.85,
"answer_relevancy": 0.80,
"context_precision": 0.80,
}
for metric, minimum in thresholds.items():
score = result[metric]
if score < minimum:
raise SystemExit(
f"EVAL FAIL: {metric} = {score:.3f} (min {minimum})"
)
Wire that script into a GitHub Actions job that runs on every pull request, and the fifth eval — the CI gate — comes for free. Add the G-Eval rubric as a sixth metric scored by an LLM judge, store the results in a JSONL file committed to the repo (or uploaded to Langfuse / TruLens), and you have a regression-proof quality floor that beats what most production AI teams have today.
Five metrics is a floor, not a ceiling. Once the floor is holding, the next things to add are usually cost-per-successful-request (because unit economics matter), P95 latency (because users leave), safety-policy adherence scored by a separate judge (because regulators care), and eval-score drift over time (because the model provider will silently update the underlying model and the 0.87 you had last week might be 0.79 this week without a single code change on your side).
But none of that matters if the floor isn't there. Start with five. Ship them into CI. Fail the build when they regress. The rest follows.
A two-week production-readiness audit covers exactly this — we score your current evals, identify the gaps, and hand you a written plan to close them. Fixed price, fixed scope.
See the audit engagement →