File Cache

The purpose of FileCache is to remove duplicate documents since users are likely to load duplicate document files if they ingest files multiple times.

The FileCache is a util that checks the DB for duplicate file check files.

Usage

At first, import FileCache.

from RAGchain.utils.file_cache import FileCache
from RAGchain.DB import PickleDB
from langchain.schema import Document

We will intentionally generate db saved duplicate files to illustrate.

test_passages: List[Passage] = [
    Passage(content="test1", filepath="test1"),
    Passage(content="test2", filepath="test2"),
    Passage(content="test3", filepath="test2")
]

test_documents: List[Document] = [
    Document(page_content="ttt1211", metadata={"source": "test1"}),
    Document(page_content="asdf", metadata={"source": "test2"}),
    Document(page_content="hgh", metadata={"source": "test3"}),
    Document(page_content="egrgfg", metadata={"source": "test4"}),
    Document(page_content="hhhh", metadata={"source": "test4"}),
]

Create an instance and input parameter. At this example, we use PickleDB.

db = PickleDB(save_path = "your-pickle-path.pkl"))
db.creat_or_load()
db.save(test_passages)
file_cache = FileCache(db)

And then, use delete_duplicate() to detect what file is already saved in DB. You can get List[Document] with duplicate document removed!

test_documents = file_cache.delete_duplicate(test_documents)

Last updated