The Challenges of Retrieving and Evaluating Relevant Context for RAG

A case study with a grade 1 text understanding exercise for how to measure context relevance in your retrieval-augmented generation system using Ragas, TruLens, and DeepEval

How to chunk, retrieve, and evaluate context in Retrieval-Augmented Generation question answering systems using Ragas, TruLens, and DeepEval.
Towards Data Science Archive
Published

June 10, 2024

A grade one text understanding exercise converted into an example for a retrieval augmented generation question answering problem.

The relevance of your retrieved context to the user input plays a key role in the performance of your Retrieval-Augmented Generation (RAG) pipeline. However, retrieving relevant context comes with its own set of challenges. And what’s more challenging is the question of how to measure context relevance effectively.

This article will explore the challenges of retrieving relevant context and measuring context relevance with the following grade 1 text comprehension example.

text ="""Lisa is at the park. Her dog Bella, is with her.
Lisa rides her bike and plays with Bella. They race each other in the sun.
Then Lisa goes to the pond to see the ducks.
She thinks they are so cute and funny.
"""

questions = [
"Where is Lisa?",
"What is Lisa's dog's name?",
"What does Lisa do in the park?",
"Why does Lisa go to the pond?"
]

Note that state-of-the-art LLMs, such as gpt-3.5-turbo, can easily answer this 1st graders’ exercise if you pass the entire text as context since it is only six sentences long.

from openai import OpenAI
openai = OpenAI()

for question in questions:
  prompt_template = f"""
  Based on the provided context, answer the following question:

  Context: {text}
  Question: {question}
  Answer:
  """

  response = openai.chat.completions.create(
    model="gpt-3.5-turbo",
    messages=[
      {"role": "user", "content": prompt_template},
    ],
    temperature=0,
  )

  print(f"Question: {question}\nAnswer: {response.choices[0].message.content}\n")
Question: Where is Lisa?
Answer: Lisa is at the park.

Question: What is Lisa's dog's name?
Answer: Lisa's dog's name is Bella.

Question: What does Lisa do in the park?
Answer: In the park, Lisa rides her bike, plays with her dog Bella, races with Bella in the sun, and goes to the pond to see the ducks.

Question: Why does Lisa go to the pond?
Answer: Lisa goes to the pond to see the ducks because she thinks they are cute and funny.

But note that in real-life RAG systems, you will be conducting question-answering tasks on much larger documents, which you don’t necessarily want or can pass to the large language model’s (LLM’s) context window as a whole. Instead, this simple text comprehension exercise is intended to showcase some core ideas and challenges of the retrieval part in a RAG system.

Chunking

The first important consideration is around chunking your documents into smaller pieces of information. Although every new LLM with an even larger context window claims the death of RAG, studies, such as “Lost in the Middle: How Language Models Use Long Context” have shown that passing entire documents to LLMs can reduce their effectiveness at answering a question.

For example, to answer the question, “Where is Lisa?”, you only require the information “Lisa is at the park.” The remaining text is irrelevant to the question.

Example of a question that only requires a piece of context that is one sentence long.

However, simply splitting the entire document into single sentences might not necessarily be the best approach because the context may contain too little information. Consider the next question, “What is Lisa’s dog’s name?” as an example.

Although the information about the name is available in the sentence “Her dog Bella, is with her.”, this sentence on its own doesn’t provide any information about who “her” refers to. In this case, it would be preferable to have a larger chunk size of two sentences, such that the retrieved context is “Lisa is at the park. Her dog Bella, is with her.”

Example of why you could need longer document chunks for RAG.

As you can see, the chunk size is a parameter you should tweak to improve your RAG system’s performance. You can read more about this in my guide for production-ready RAG applications.

Additionally, more advanced chunking and retrieval strategies have already emerged. For example, in the sentence window retrieval technique, a document is split into single sentences but stored together with a larger context window, which is used to replace the single sentence after retrieval to provide more context.

As you can see, there are many different techniques you can experiment with for partitioning your documents. Popular orchestration frameworks, such as LangChain and LlamaIndex, offer various chunking strategies out of the box. But what is the best chunking strategy? This is a topic on its own, as there are many different approaches.

Luckily, you don’t need to perfect the chunking step, as we saw state-of-the-art LLMs, such as gpt-3.5-turbo, are able to handle larger contexts with irrelevant information. For this example, splitting the text into chunks of two sentences would probably be sufficient.

Retrieval

The next big question you must answer when building a RAG pipeline is: How many chunks should you retrieve?

While the question “What is Lisa’s dog’s name?” only requires one piece of context,

Context 1: "Lisa is at the park. Her dog Bella, is with her."

the question “What does Lisa do at the park?” would require two pieces of context — under the assumption that we’ve chunked the text into sentences of two:

Context 1: "Lisa rides her bike and plays with Bella. They race each other in the sun."
Context 2: "Then Lisa goes to the pond to see the ducks. She thinks they are so cute and funny."

An example for why you would need to balance chunk size and number of retrieved contexts in RAG

As you can see, answering the question of how many contexts to retrieve is not trivial. It depends on the question, the available information, and the chunk size. Ideally, you should run a few experiments to find the best balance between your chunk size and the number of contexts to retrieve.

Evaluation

When you run experiments to find the best settings for our RAG pipeline, you need to quantify your system’s performance to judge whether your current experiment is better than your baseline. This section discusses different metrics related to context retrieval in RAG applications.Specifically, this section discusses how you can quantify the relevance of a retrieved piece of context to the user input.

Similarity or distance metrics

While context can be stored and retrieved from different types of databases, the retrieval component most commonly refers to retrieving context with similarity search.

For this, the document chunks are embedded or converted to vector embeddings with so-called embedding models. At query time, the same is done with the search query. Because of the ability of vector embeddings to capture semantic meaning numerically, you can retrieve data objects similar to the search query by retrieving data objects closest to your search query in vector space. The proximity is calculated by common distance metrics in vector search, such as cosine similarity, dot product, L1, or L2.

Below, you can see a 2-dimensional visualization (using PCA) of the questions and document chunks embedded with OpenAI’s text-embedding-3-small embedding model. In this vector space, you can see that the closest context chunk to the question “Why does Lisa go to the pond?” is “Then Lisa goes to the pond to see the ducks. She thinks they are so cute and funny.”, which would be the most relevant context for the user query.

2D PCA visualization of question and document chunk embeddings in vector space

However, “similar” does not necessarily always mean “relevant”. For example, the calculated cosine distances of the document chunks to the question “What does Lisa do in the park?” are shown below:

What does Lisa do in the park?

1. Lisa is at the park. Her dog Bella, is with her. (Distance: 0.40)
2. Then Lisa goes to the pond to see the ducks. She thinks they are so cute and funny. (Distance: 0.45)
3. Lisa rides her bike and plays with Bella. They race each other in the sun. (Distance: 0.51)

You can see that the closest or most similar context doesn’t answer the question at all and is thus irrelevant to it.

Thus, using the distance metric to measure how relevant a piece of context is insufficient.

Search and ranking metrics

One straightforward approach to evaluating the performance of your RAG system is to use classical search and ranking metrics, such as:

  • Precision@K: How many of the retrieved contexts are relevant?
  • Recall@K: How many of the total relevant contexts are retrieved? (Can help you adjust a good number of contexts to retrieve.)
  • Mean Reciprocal Rank (MRR) (Is great if you know there is only one relevant context.): How well a system places the first relevant result in a ranked list.
  • Normalized Discounted Cumulative Gain (NDCG) (Most popular in search and recommendation systems.): considers the relevance of all the results and their positions.

While these are proven and recommended metrics, one disadvantage is that they require ground truth labels: Is a context relevant or not relevant? In the case of the NDCG metric, you will even need to provide some granularity of relevancy (How relevant is a context? On what scale do you measure the relevancy? “relevant”, “somewhat relevant”, “relevant”, or scoring on a scale from 0 to 10). Collecting ground truth labels for your validation dataset can become expensive with the growing size of your dataset.

Context relevance metric in RAG evaluation frameworks

One interesting development in this space is happening in RAG evaluation frameworks: Many different RAG evaluation frameworks, such as RagasTruLensDeepEval, etc., include a “reference-free metric” called context relevance.

The context relevance metric measures how relevant the provided context was to the user query. This is a “reference-free” metric in many RAG evaluation frameworks because it doesn’t require any ground truth labels but uses an LLM under the hood to calculate it.

As this metric is fairly new, context relevance is currently calculated differently in the different frameworks. This section aims to give you an initial intuition about the context relevance metric in RagasTruLens, and DeepEval.

Disclaimer: This section is not intended as an extensive comparison of the mentioned frameworks or is it intended to give a recommendation for which framework to use. It is merely intended to give an overview of the framework landscape and an intuition of the context relevancy metric.

For the following, we will use two small sets of examples. The first DataFrame contains the questions from the example worksheet with the contexts we have defined as relevant.

DataFrame of questions with their relevant contexts

The second DataFrame contains different contexts for the question “Why does Lisa go to the pond?”. The first three contexts all contain the relevant information but with varying amounts of “clutter” information. The last context is entirely irrelevant to the question.

DataFrame of contexts with varying relevance levels for a single question
import pandas as pd

# Replace with your examples here
df = pd.DataFrame({'question': [ ...],
                   'contexts': [[...], ...],
                   'answers': [...]})

Also, ensure you have your OpenAI API key set in the environment variables since all of the following frameworks use OpenAI’s LLMs to evaluate the context relevancy metric.

import os
os.environ["OPENAI_API_KEY"] = "sk..."

Ragas uses the following formula to calculate context relevancy:

\[ \text{context relevancy} = \dfrac{|S|}{|\text{Total number of sentences in retrieved context}|} \]

You can calculate the context relevancy metric in Ragas using the following code. The framework requires the information question and contexts to calculate context relevancy.

# !pip install ragas
from datasets import Dataset
from ragas.metrics import ContextRelevancy

# Bring dataframe into right format
dataset = Dataset.from_pandas(df)

# Set up context relevancy metric
context_relevancy = ContextRelevancy()

# Calculate metric
results = context_relevancy.score(dataset)

Below, you can see the resulting context relevancy metric for the contexts we’ve marked as relevant.

Ragas context relevancy scores for relevant contexts

The context relevance scores range between 0.5 and 1, although we’ve categorized all pieces of context as relevant. Because Ragas divides the number of sentences it deems relevant by the total number of sentences in the provided context, this can lead to lower scores quickly if the underlying LLM doesn’t score a sentence as relevant.

Below, you can see the resulting context relevancy metric for the irrelevant contexts.

Ragas context relevancy scores for irrelevant contexts

As you can see, for the first three examples, context relevancy decreases for the contexts that contain more cluttered information. Also, the last example with the entirely irrelevant context has a context relevancy score of 0.

DeepEval uses the following formula to calculate context relevancy:

\[ \text{Contextual relevancy} = \dfrac{\text{Number of relevant statements}}{\text{Total number of statements}} \]

You can calculate the context relevancy metric in DeepEval using the following code. The framework requires the information questioncontextsand answers to calculate context relevancy.

# ! pip install deepeval
from deepeval import evaluate
from deepeval.metrics import ContextualRelevancyMetric
from deepeval.test_case import LLMTestCase

context_relevancy = []

for _, row in df.iterrows():
    # Define metric
    metric = ContextualRelevancyMetric(
        threshold=0.5,
        model="gpt-3.5-turbo", #alternatively use "gpt-4"
    )

    # Define test case
    test_case = LLMTestCase(
        input = row.question,
        actual_output = row.answers,
        retrieval_context = row.contexts.tolist(),
    )

    # Calculate metric
    metric.measure(test_case)

    context_relevancy.append(metric.score)

Below, you can see the resulting context relevancy metric for the contexts we’ve marked as relevant.

DeepEval context relevancy scores for relevant contexts

You can see that most of the results have a context relevancy of 1, and only one example has a context relevancy of 0.5.

Below, you can see the resulting context relevancy metric for the irrelevant contexts.

DeepEval context relevancy scores for irrelevant contexts

Interestingly, we don’t see any difference between the context relevancy score for the examples with added clutter information. However, the last example with an entirely irrelevant context has a context relevancy score of 0.

As far as I know, TruLens doesn’t provide any details about how the context relevancy is calculated in its documentation at the time of this writing.

You can calculate the context relevancy metric in TruLens using the following code. The framework requires the information questioncontextsand answers to calculate context relevancy.

# !pip install trulens_eval
from trulens_eval import Select, Tru
from trulens_eval.tru_virtual import VirtualApp, TruVirtual, VirtualRecord
from trulens_eval.feedback.provider import OpenAI
from trulens_eval.feedback.feedback import Feedback

# Define virtual app
virtual_app = VirtualApp()
retriever = Select.RecordCalls.retriever
virtual_app[retriever] = "retriever"

# Initialize provider class
provider = OpenAI()

# Define context_call
context_call = retriever.get_context # The selector for a presumed context retrieval component's call to `get_context`.
context = context_call.rets[:] # Select context to be used in feedback. We select the return values of the virtual `get_context` call in the virtual `retriever` component.

# Define feedback function for context relevance
f_context_relevance = (
    Feedback(provider.context_relevance_with_cot_reasons)
    .on_input()
    .on(context)
)

# Define virtual recorder
virtual_recorder = TruVirtual(
    app_id="SmallTests",
    app=virtual_app,
    feedbacks=[f_context_relevance],
)

# Define test cases
for _, row in df.iterrows():
    record = VirtualRecord(
    main_input = row.question,
    main_output= row.answers,
    calls=
        {
            context_call: dict(
                rets=row.contexts.tolist()
            ),
        }
    )

    virtual_recorder.add_record(record)

# Calculate metric
tru = Tru()
tru.run_dashboard(force=True)
tru.start_evaluator()

Note that TruLens provides the results in a dashboard. The following results are copied back from the dashboard into the DataFrame to keep the result visualization consistent with the other frameworks in this article.

Below, you can see the resulting context relevancy metric for the contexts we’ve marked as relevant.

TruLens context relevancy scores for relevant contexts

You can see that TruLens scores all the relevant contexts with a high score between 0.85 and 1.

Below, you can see the resulting context relevancy metric for the irrelevant contexts.

TruLens context relevancy scores for irrelevant contexts

Interestingly, TruLens doesn’t seem to penalize the additional clutter information: All contexts containing the relevant information score 0.9 or higher. What’s also interesting is that — in contrast to Ragas and DeepEval — TruLens scores the context that doesn’t contain the information to answer the question with a relatively high 0.4

Summary

This article explored the various challenges of retrieving contexts for your RAG application and evaluating their relevance to the question. To showcase the challenges, it explored a simple grade 1 text comprehension exercise with six sentences and four questions.

In the indexing stage, you must decide on a suitable chunking strategy and size for the context pieces. Additionally, you need to balance the chunk size and the number of contexts to retrieve in the retrieval stage. Finally, in the evaluation stage, you must decide on a suitable metric to evaluate the quality of the retrieved contexts.

Additionally, this article explored the reference-free “context relevancy” metric, which is available in many modern RAG evaluation frameworks. The frameworks RagasTruLens, and DeepEval were explored for two sets of examples: How do these frameworks score contexts identified as relevant by a human, and how do these frameworks score contexts identified as irrelevant by a human?


This blog was originally published on Towards Data Science on Jun 10, 2024 and moved to this site on Jan 31, 2026.

Back to top