📅 05.01.26 ⏱️ Read time: 8 min
RAG systems fail in ways that are hard to detect. The retrieval step might return the wrong chunks. The model might ignore the retrieved content and hallucinate anyway. The answer might be technically grounded in the retrieved text but still misleading. Or the retrieved content might be accurate but not relevant to what the user actually asked.
None of these failures show up in a simple "did the answer match?" evaluation. RAG evals require measuring the retrieval and generation steps separately — and understanding which component is causing the failures you observe.
A traditional ML model is evaluated against labeled test data. You know the correct answer; you measure how often the model gets it right. Precision, recall, F1, RMSE — the metrics are well-defined.
RAG evaluation is harder for three reasons:
1. There's no single "correct" answer for most questions. A question answered correctly might have many valid phrasings. Exact match doesn't work; string similarity doesn't work well either.
2. The pipeline has two failure modes that need separate measurement. Retrieval failures (wrong chunks retrieved) are different from generation failures (model ignores or misuses retrieved content). Measuring only the final answer conflates both.
3. Ground truth is expensive to produce. Creating a labeled evaluation dataset for a RAG system requires human annotators who can assess: did the retrieved chunks contain the relevant information? Is the generated answer faithful to the retrieved content? Is the answer correct? This is time-consuming and requires domain expertise.
The result: many teams deploy RAG systems without adequate evaluation, discover quality problems in production, and can't diagnose which component is failing.
The RAG evaluation community has converged on four core metrics that cover the most important failure modes:
What it measures: Does the generated answer stay within what the retrieved context actually says? Or does the model add information not present in the retrieved chunks?
Why it matters: High faithfulness means the model is using the retrieved content correctly and not hallucinating. Low faithfulness means the model is generating answers from its training data despite having retrieved context — the RAG is not working as intended.
How it's measured: Compare each claim in the generated answer against the retrieved context. What fraction of claims can be directly traced to the context?
What it measures: Does the generated answer actually address the user's question? A faithful, accurate answer that doesn't address the question is still a failure.
Why it matters: Distinguishes between "the answer is grounded in retrieved content" (faithfulness) and "the answer is useful for the user" (relevance).
How it's measured: Score the semantic similarity between the answer and the original question. Often done using an LLM as judge.
What it measures: Of the retrieved chunks, what fraction are actually relevant to answering the question?
Why it matters: High precision means the retrieval step is returning relevant content. Low precision means the model is receiving noisy context — some retrieved chunks are irrelevant — which degrades generation quality.
How it's measured: For each retrieved chunk, assess whether it contains information relevant to the question. Fraction of relevant chunks / total retrieved chunks.
What it measures: Was all the information needed to answer the question present in the retrieved chunks?
Why it matters: High recall means the retrieval step found everything the model needed. Low recall means relevant information existed in the knowledge base but wasn't retrieved — the retrieval missed it.
How it's measured: Compare the retrieved chunks against the ground-truth answer. What fraction of the information needed to generate the correct answer was present in the retrieved context?
Retrieval evaluation focuses on the quality of the chunks returned by the vector search step, independent of generation.
Metrics:
What you need to run retrieval eval:
This can be evaluated without any LLM — purely by checking whether the retrieval step returns the right documents.
Common retrieval failure modes:
Generation evaluation focuses on what the model does with the retrieved context.
Reference-free evaluation (no human labels needed): Use an LLM as a judge. Present the LLM evaluator with: the original question, the retrieved context, and the generated answer. Ask it to assess faithfulness ("does the answer stay within the context?"), relevance ("does the answer address the question?"), and overall quality.
This is the most scalable approach for continuous evaluation — you don't need human labels for every test case. But LLM-as-judge introduces its own biases (preferring verbose answers, favoring its own output patterns) and is not perfectly reliable.
Reference-based evaluation (human labels required): Compare generated answers against gold-standard human-written answers using metrics like ROUGE, BERTScore, or semantic similarity. More reliable for correctness evaluation, but requires the upfront investment of creating the reference answers.
Hallucination detection: Use an NLI (natural language inference) model or an LLM judge to classify each sentence in the generated answer as: supported by the retrieved context, contradicted by the retrieved context, or not addressable from the context.
Several open-source frameworks implement RAG evaluation metrics:
RAGAS (Retrieval Augmented Generation Assessment): The most widely used RAG-specific evaluation framework. Implements faithfulness, answer relevance, context precision, and context recall out of the box. Uses LLM-as-judge internally for most metrics. Open source, integrates with LangChain and LlamaIndex.
DeepEval: A broader LLM evaluation framework that includes RAG-specific metrics alongside general LLM evaluation capabilities. Supports custom metrics and CI/CD integration.
TruLens: Focused on LLM app evaluation with particular strength in the "triad" of context relevance, groundedness, and answer relevance. Good observability tooling.
LlamaIndex Evaluation: Evaluation modules built into LlamaIndex that assess retrieval and generation quality within the LlamaIndex ecosystem.
A practical RAG eval setup:
Step 1: Build a test set. Create 50-200 question-answer pairs that cover the range of questions your users actually ask. Include the source document(s) for each question (for retrieval eval). Generate or write reference answers (for generation eval).
The test set is your most valuable eval asset. Invest in making it representative and high-quality.
Step 2: Evaluate retrieval first. Before evaluating generation, verify that your retrieval is working. Measure hit rate and context recall against your test set. Fix retrieval problems (chunking strategy, embedding model, hybrid search) before moving on. Generation quality cannot exceed retrieval quality.
Step 3: Evaluate generation. With retrieval working well, evaluate faithfulness and answer relevance using an LLM-as-judge setup. Flag answers with low faithfulness scores for human review — these are your hallucinations.
Step 4: Run evals continuously. RAG system quality degrades as the knowledge base grows and changes. Run your eval suite on every significant update to the knowledge base or retrieval configuration. Treat RAG eval like unit tests for traditional software.
Step 5: Identify the failure pattern. When overall performance drops, use component-level metrics to diagnose: is it a retrieval problem (context precision or recall dropped) or a generation problem (faithfulness dropped despite good retrieval)?
Evaluation discipline is what separates RAG systems that work reliably in production from those that looked good in demos.
→ See how Aicuflow's RAG pipeline works → Learn about AI evaluation concepts more broadly → Read about why RAG is evolving, not dying
Search for a command to run...