BM25 Retrieval
BM25Retrieval Class Documentation
Overview
The BM25Retrieval
class is used for BM25 retrieval. BM25 is the most popular TF-IDF method for retrieval, which reflects how important a word is to a document. It is often called sparse retrieval. It is different with dense retrieval, which is using embedding model and similarity search. Dense retrieval search passage using semantic similarity, but sparse retrieval uses word counts. If you use documents in specific domains, BM25Retrieval
can be more useful than VectorDBRetrieval
.
It uses the BM25Okapi algorithm for scoring and ranking the passages. There will be extra algorithm for BM25Retrieval
(relevant issue).
Usage
Initialize
To create an instance of the BM25Retrieval
class, you need to provide the path to the saved BM25 data. The data should be in either .pkl
or .pickle
format.
You can also specify the name of the tokenizer to be used (optional). As default, gpt tokenizer will be used.
Here's an example of initializing the BM25Retrieval
object:
Please make sure to replace "path/to/bm25_retrieval.pkl"
with the actual path where you want to save the BM25 retrieval data.
Ingest
Before you can retrieve passages, you need to ingest them into the BM25Retrieval object. Passages should be provided as a list of Passage
objects.
Here's an example of ingesting passages:
Retrieve
To retrieve relevant passages based on a query, use the retrieve
method. You can specify the query and the number of top-k passages to retrieve.
Here's an example:
Retrieve with filter
You can also filter the retrieved passages. Use the retrieve_with_filter
method and provide the query, top-k value, and a list of content, filepath, or metadata values to filter by.
In this method uses DB.search
method. Please refer here for further information.
Here's an example:
Last updated