How to evaluate your RAG pipeline using comprehensive data?

Evaluating LLM applications, especially those using RAG (Retrieval Augmentation Generation), is critical but often overlooked. Without proper evaluation, it is nearly impossible to confirm whether your system’s retriever is valid, whether the LLM’s answers are based on sources (or hallucinations), and whether the context size is optimal.

Since initial testing lacks the real user data required for baselines, a practical solution is to comprehensively evaluate the dataset. This article will show you how to generate these real-world test cases using DeepEval, an open source framework that simplifies LLM evaluation and allows you to benchmark your RAG pipeline before it goes live. Check The complete code is here.

Install dependencies

!pip install deepeval chromadb tiktoken pandas

OpenAI API key

Because DeepEval leverages external language models to perform its detailed evaluation metrics, an OpenAI API key is required to run this tutorial.

  • If you are new to the OpenAI platform, you may need to add billing details and pay a small minimum payment (usually $5) to fully activate your API access.

Define text

In this step, we will manually create a text variable that will serve as the source document for generating the comprehensive assessment data.

The book combines diverse factual content from a number of fields – including biology, physics, history, space exploration, environmental science, medicine, computing and ancient civilizations – to ensure that the LL.M. has rich and varied material to work with.

DeepEval’s synthesizer will later:

  • Split this text into semantically coherent chunks,
  • Choose a meaningful context appropriate to the generating question, and
  • Generate synthetic “golden” pairs (input, expected output) that simulate real user queries and ideal LLM responses.

After defining the text variable, we save it as a .txt file so that DeepEval can read and process it later. You can use any other text document of your choice, such as a Wikipedia article, research summary, or technical blog post, as long as it contains informative and well-structured content. Check The complete code is here.

text = """
Crows are among the smartest birds, capable of using tools and recognizing human faces even after years.
In contrast, the archerfish displays remarkable precision, shooting jets of water to knock insects off branches.
Meanwhile, in the world of physics, superconductors can carry electric current with zero resistance -- a phenomenon
discovered over a century ago but still unlocking new technologies like quantum computers today.

Moving to history, the Library of Alexandria was once the largest center of learning, but much of its collection was
lost in fires and wars, becoming a symbol of human curiosity and fragility. In space exploration, the Voyager 1 probe,
launched in 1977, has now left the solar system, carrying a golden record that captures sounds and images of Earth.

Closer to home, the Amazon rainforest produces roughly 20% of the world's oxygen, while coral reefs -- often called the
"rainforests of the sea" -- support nearly 25% of all marine life despite covering less than 1% of the ocean floor.

In medicine, MRI scanners use strong magnetic fields and radio waves
to generate detailed images of organs without harmful radiation.

In computing, Moore's Law observed that the number of transistors
on microchips doubles roughly every two years, though recent advances
in AI chips have shifted that trend.

The Mariana Trench is the deepest part of Earth's oceans,
reaching nearly 11,000 meters below sea level, deeper than Mount Everest is tall.

Ancient civilizations like the Sumerians and Egyptians invented
mathematical systems thousands of years before modern algebra emerged.
"""
with open("example.txt", "w") as f:
    f.write(text)

Generate comprehensive assessment data

In this code, we use the Synthesizer class from the DeepEval library to automatically generate synthetic evaluation data (also called Goldens) from existing documents. The model “gpt-4.1-nano” was chosen due to its lightweight nature. We provide paths to documents (example.txt) containing factual and descriptive content on topics as diverse as physics, ecology, and computing. The synthesizer processes this text to create meaningful question-answer pairs (gold), which can later be used to test and benchmark LLM performance on comprehension or retrieval tasks.

The script successfully generated up to six synthetic golds. The generated examples are very rich – for example, one input requires “EAssessing the cognitive abilities of corvids in facial recognition tasks,” while another explores”The Amazon’s oxygen contribution and its role in the ecosystem”. Each output contains a coherent expected answer and contextual snippets derived directly from the document, demonstrating how DeepEval automatically generates high-quality synthetic datasets for LLM evaluation. View The complete code is here.

from deepeval.synthesizer import Synthesizer

synthesizer = Synthesizer(model="gpt-4.1-nano")

# Generate synthetic goldens from your document
synthesizer.generate_goldens_from_docs(
    document_paths=["example.txt"],
    include_expected_output=True
)

# Print generated results
for golden in synthesizer.synthetic_goldens[:3]:  
    print(golden, "n")

Control input complexity using EvolutionConfig

In this step, we configure the EvolutionConfig to affect how the DeepEval synthesizer generates more complex and diverse inputs. By assigning weights to different types of evolution – e.g. reasoning, multiple contexts, Compare, imaginaryand IN_BREADTH —We guide the model to create questions that vary in reasoning style, context use, and depth.

this Evolution times Parameters specify how many evolution strategies will be applied to each text block, allowing multiple perspectives to be synthesized from the same source material. This approach helps generate a richer assessment data set to test the LL.M.’s ability to handle nuanced and multifaceted inquiries.

The output demonstrates how this configuration affects the gold generated. For example, one input asked about crows’ tool use and facial recognition, prompting the LL.M. to give detailed answers covering problem solving and adaptive behavior. Another input compared Voyager 1’s golden records to the Library of Alexandria, requiring reasoning across multiple contexts and historical meanings.

Each gold includes the original context, the type of evolution applied (e.g., hypothesis, breadth, reasoning), and an overall quality score. Even using a single document, this evolution-based approach creates a diverse sample of high-quality comprehensive assessments to test LLM performance. Check The complete code is here.

from deepeval.synthesizer.config import EvolutionConfig, Evolution

evolution_config = EvolutionConfig(
    evolutions={
        Evolution.REASONING: 1/5,
        Evolution.MULTICONTEXT: 1/5,
        Evolution.COMPARATIVE: 1/5,
        Evolution.HYPOTHETICAL: 1/5,
        Evolution.IN_BREADTH: 1/5,
    },
    num_evolutions=3
)

synthesizer = Synthesizer(evolution_config=evolution_config)
synthesizer.generate_goldens_from_docs(["example.txt"])

This ability to generate high-quality, complex synthetic data is our way around the initial obstacle of a lack of real user interaction. By leveraging DeepEval’s synthesizer (especially under the guidance of EvolutionConfig), we go far beyond simple question-and-answer pairs.

The framework allows us to create rigorous test cases to explore the limitations of RAG systems, covering everything from multi-context comparisons and what-if scenarios to complex reasoning.

This rich, custom dataset provides a consistent and diverse baseline for benchmarking, allowing you to continuously iterate on retrieving and generating components, building confidence in the underlying capabilities of your RAG pipeline and ensuring it delivers reliable performance before processing the first live query. Check The complete code is here.

The iterative RAG improvement loop described above uses DeepEval’s synthetic data to establish a continuous, rigorous testing cycle for your pipeline. By calculating basic metrics such as base metrics and context, you can get the necessary feedback to iteratively refine your retriever and model components. This systematic process ensures you get a proven, high-confidence RAG system that remains reliable until deployment.


Check The complete code is here. Please feel free to check out our GitHub page for tutorials, code, and notebooks. In addition, welcome to follow us twitter And don’t forget to join our 100k+ ML SubReddit and subscribe our newsletter. wait! Are you using Telegram? Now you can also join us via telegram.


I am a Civil Engineering graduate (2022) from Jamia Millia Islamia, New Delhi and I am very interested in data science, especially neural networks and their applications in various fields.

🙌 FOLLOW MARKTECHPOST: Add us as your go-to source on Google.

You may also like...