> For the complete documentation index, see [llms.txt](https://nomadamas.gitbook.io/ragchain-docs/llms.txt). Markdown versions of documentation pages are available by appending `.md` to page URLs; this page is available as [Markdown](https://nomadamas.gitbook.io/ragchain-docs/ragchain-structure/benchmark.md).

# Benchmark

## Overview

The RAGchain `Benchmark` Module is a tool designed for the evaluation of Retrieval-Augmented Generation (RAG) workflows. It allows developers to evaluate their pipelines performance with different datasets and user's questions.

## Supporting Evaluators

The `Benchmark` module supports two types of evaluators: AutoEvaluator and DatasetEvaluator.

* **AutoEvaluator:** This tool allows you to evaluate your pipeline with your own questions without a dataset. It can evaluate retrieved passages and answers without ground truth answers and ground truth retrieved passages.
* **DatasetEvaluator:** This tool can evaluate your pipeline with question answering datasets.

## Supporting Datasets

* [StrategyQA](https://allenai.org/data/strategyqa)
* [Ko-StrategyQA](https://huggingface.co/datasets/NomaDamas/Ko-StrategyQA)
* [Qasper](https://allenai.org/data/qasper)
* [MS-MARCO](/ragchain-docs/ragchain-structure/benchmark/dataset-evaluator/ms-marco.md)
* [Mr-Tydi](https://github.com/NomaDamas/RAGchain-docs/blob/main/ragchain-structure/benchmark/dataset-evaluator/mr-tydi.md)

## Supporting Metrics

Each evaluator supports different metrics.

✅ means that the evaluator supports the metric. ❌ means that the evaluator does not support the metric. 🚧 means that the evaluator will support the metric in the future.

|                         | [AutoEvaluator](/ragchain-docs/ragchain-structure/benchmark/auto-evaluator.md) | [Qasper](/ragchain-docs/ragchain-structure/benchmark/dataset-evaluator/qasper.md) | [Ko-Strategy-QA](/ragchain-docs/ragchain-structure/benchmark/dataset-evaluator/ko-strategy-qa.md) | [Strategy-QA](/ragchain-docs/ragchain-structure/benchmark/dataset-evaluator/strategy-qa.md) | [ms-marco](/ragchain-docs/ragchain-structure/benchmark/dataset-evaluator/ms-marco.md) | [mr-tydi](https://github.com/NomaDamas/RAGchain-docs/blob/main/ragchain-structure/benchmark/dataset-evaluator/mr-tydi.md) |
| :---------------------: | :----------------------------------------------------------------------------: | :-------------------------------------------------------------------------------: | :-----------------------------------------------------------------------------------------------: | :-----------------------------------------------------------------------------------------: | :-----------------------------------------------------------------------------------: | :-----------------------------------------------------------------------------------------------------------------------: |
|            AP           |                                        ❌                                       |                                         ❌                                         |                                                 ❌                                                 |                                              ❌                                              |                                           ✅                                           |                                                             ❌                                                             |
|           NDCG          |                                        ❌                                       |                                         ❌                                         |                                                 ❌                                                 |                                              ❌                                              |                                           ✅                                           |                                                             ❌                                                             |
|            CG           |                                        ❌                                       |                                         ❌                                         |                                                 ❌                                                 |                                              ❌                                              |                                           ✅                                           |                                                             ❌                                                             |
|         Ind\_DCG        |                                        ❌                                       |                                         ❌                                         |                                                 ❌                                                 |                                              ❌                                              |                                           ✅                                           |                                                             ❌                                                             |
|           IDCG          |                                        ❌                                       |                                         ❌                                         |                                                 ❌                                                 |                                              ❌                                              |                                           ✅                                           |                                                             ❌                                                             |
|          Recall         |                                        ❌                                       |                                         ✅                                         |                                                 ✅                                                 |                                              ✅                                              |                                           ✅                                           |                                                             ✅                                                             |
|        Precision        |                                        ❌                                       |                                         ✅                                         |                                                 ✅                                                 |                                              ✅                                              |                                           ✅                                           |                                                             ✅                                                             |
|            RR           |                                        ❌                                       |                                         ❌                                         |                                                 ❌                                                 |                                              ❌                                              |                                           ✅                                           |                                                             ❌                                                             |
|           Hole          |                                        ❌                                       |                                         ✅                                         |                                                 ✅                                                 |                                              ✅                                              |                                           ✅                                           |                                                             ✅                                                             |
|         Accuracy        |                                        ❌                                       |                                         ✅                                         |                                                 ✅                                                 |                                              ✅                                              |                                           ✅                                           |                                                             ✅                                                             |
|            EM           |                                        ❌                                       |                                         ✅                                         |                                                 ✅                                                 |                                              ✅                                              |                                           ✅                                           |                                                             ✅                                                             |
|            F1           |                                        ❌                                       |                                         ✅                                         |                                                 ✅                                                 |                                              ✅                                              |                                           ✅                                           |                                                             ✅                                                             |
|   ragas context-recall  |                                        ❌                                       |                                         ✅                                         |                                                 ✅                                                 |                                              ✅                                              |                                           ✅                                           |                                                             ✅                                                             |
| ragas context-precision |                                        ✅                                       |                                         ✅                                         |                                                 ✅                                                 |                                              ✅                                              |                                           ✅                                           |                                                             ✅                                                             |
|  ragas answer-relevancy |                                        ✅                                       |                                         ✅                                         |                                                 ❌                                                 |                                              ❌                                              |                                           ✅                                           |                                                             ❌                                                             |
|   ragas faithfullness   |                                        ✅                                       |                                         ✅                                         |                                                 ❌                                                 |                                              ❌                                              |                                           ✅                                           |                                                             ❌                                                             |
|           BLEU          |                                        ❌                                       |                                         ✅                                         |                                                 ❌                                                 |                                              ❌                                              |                                           ✅                                           |                                                             ❌                                                             |
|         ROUGE-L         |                                       🚧                                       |                                         🚧                                        |                                                 🚧                                                |                                              🚧                                             |                                           🚧                                          |                                                             🚧                                                            |
|           KF1           |                                        ❌                                       |                                         ✅                                         |                                                 ❌                                                 |                                              ❌                                              |                                           ✅                                           |                                                             ❌                                                             |

## Role of the Evaluator in the Framework

The `Benchmark` module evaluate the performance of the RAG workflow by generated answers and retrieved passages. The performance is measured using more than 10 metrics. The evaluators provide a holistic view of the model's performance and help in identifying areas of improvement.

## Advantages of Evaluators

* **Comprehensive Evaluation:** The evaluators provide a thorough and detailed evaluation of the RAG workflow. They assess the performance of the workflow from multiple aspects and provide a comprehensive view of its effectiveness.
* **Flexible:** They allow you to use your own questions or question answering datasets for evaluation. Plus, you can create your own Evaluator for new datasets.
* **Easy to Use:** The evaluators come with an easy-to-use interface. You just need to provide your pipeline, and the evaluators will take care of the rest. You even don't have to download the datasets. The evaluators will download the dataset automatically.


---

# Agent Instructions
This documentation is published with GitBook. GitBook is the documentation platform designed so that both humans and AI agents can read, navigate, and reason over technical content effectively. Learn more at gitbook.com.

## Querying This Documentation
If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter, and the optional `goal` query parameter:

```
GET https://nomadamas.gitbook.io/ragchain-docs/ragchain-structure/benchmark.md?ask=<question>&goal=<endgoal>
```

`ask` is the immediate question: it should be specific, self-contained, and written in natural language.
`goal` is optional and describes the broader end goal you are ultimately trying to accomplish on behalf of the user. GitBook uses it to tailor the answer towards what is most useful for that goal.

The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
