RAGChain Docs
  • Introduction
  • Quick Start
  • Installation
  • RAGchain Structure
    • File Loader
      • Dataset Loader
        • Ko-Strategy-QA Loader
      • Hwp Loader
      • Rust Hwp Loader
      • Win32 Hwp Loader
      • OCR
        • Nougat Loader
        • Mathpix Markdown Loader
        • Deepdoctection Loader
    • Text Spliter
      • Recursive Text Splitter
      • Markdown Header Splitter
      • HTML Header splitter
      • Code splitter
      • Token splitter
    • Retrieval
      • BM25 Retrieval
      • Hybrid Retrieval
      • Hyde Retrieval
      • VectorDB Retrieval
    • LLM
    • DB
      • MongoDB
      • Pickle DB
    • Reranker
      • BM25 Reranker
      • UPR Reranker
      • TART Reranker
      • MonoT5 Reranker
      • LLM Reranker
    • Benchmark
      • Auto Evaluator
      • Dataset Evaluators
        • Qasper
        • Ko-Strategy-QA
        • Strategy-QA
        • ms-marco
  • Utils
    • Query Decomposition
    • Evidence Extractor
    • Embedding
    • Slim Vector Store
      • Pinecone Slim
      • Chroma Slim
    • File Cache
    • Linker
      • Redis Linker
      • Dynamo Linker
      • Json Linker
    • REDE Search Detector
    • Semantic Clustering
  • Pipeline
    • BasicIngestPipeline
    • BasicRunPipeline
    • RerankRunPipeline
    • ViscondeRunPipeline
  • For Advanced RAG
    • Time-Aware RAG
    • Importance-Aware RAG
Powered by GitBook
On this page
  • Overview
  • Usage
  1. RAGchain Structure
  2. Retrieval

BM25 Retrieval

BM25Retrieval Class Documentation

PreviousRetrievalNextHybrid Retrieval

Last updated 1 year ago

Overview

The BM25Retrieval class is used for BM25 retrieval. BM25 is the most popular TF-IDF method for retrieval, which reflects how important a word is to a document. It is often called sparse retrieval. It is different with dense retrieval, which is using embedding model and similarity search. Dense retrieval search passage using semantic similarity, but sparse retrieval uses word counts. If you use documents in specific domains, BM25Retrieval can be more useful than VectorDBRetrieval.

It uses the BM25Okapi algorithm for scoring and ranking the passages. There will be extra algorithm for BM25Retrieval ().

Usage

Initialize

To create an instance of the BM25Retrieval class, you need to provide the path to the saved BM25 data. The data should be in either .pkl or .pickle format.

You can also specify the name of the tokenizer to be used (optional). As default, will be used.

Here's an example of initializing the BM25Retrieval object:

bm25_path = "path/to/bm25_retrieval.pkl" 
bm25_retrieval = BM25Retrieval(save_path=bm25_path)

Please make sure to replace "path/to/bm25_retrieval.pkl" with the actual path where you want to save the BM25 retrieval data.

Ingest

Before you can retrieve passages, you need to ingest them into the BM25Retrieval object. Passages should be provided as a list of objects.

Here's an example of ingesting passages:

passages = [
    Passage(id="passage_id_1", content="This is the first passage.", filepath="filepath"),
    Passage(id="passage_id_2", content="This is the second passage.", filepath="filepath"),
    Passage(id="passage_id_3", content="This is the third passage.", filepath="filepath")
]
bm25_retrieval.ingest(passages)

Retrieve

To retrieve relevant passages based on a query, use the retrieve method. You can specify the query and the number of top-k passages to retrieve. Here's an example:

query = "Test query?" # replace with your query 
top_k = 5
retrieved_passages = bm25_retrieval.retrieve(query, top_k)

Retrieve with filter

You can also filter the retrieved passages. Use the retrieve_with_filter method and provide the query, top-k value, and a list of content, filepath, or metadata values to filter by.

Here's an example:

filtered_passages = bm25_retrieval.retrieve_with_filter(query, top_k, filepath=["filepath1", "filepath3"])
# This code will search top-5 most similar passages with filepath1 and filepath3

In this method uses DB.search method. Please refer for further information.

relevant issue
gpt tokenizer
Passage
here