RAGChain Docs
  • Introduction
  • Quick Start
  • Installation
  • RAGchain Structure
    • File Loader
      • Dataset Loader
        • Ko-Strategy-QA Loader
      • Hwp Loader
      • Rust Hwp Loader
      • Win32 Hwp Loader
      • OCR
        • Nougat Loader
        • Mathpix Markdown Loader
        • Deepdoctection Loader
    • Text Spliter
      • Recursive Text Splitter
      • Markdown Header Splitter
      • HTML Header splitter
      • Code splitter
      • Token splitter
    • Retrieval
      • BM25 Retrieval
      • Hybrid Retrieval
      • Hyde Retrieval
      • VectorDB Retrieval
    • LLM
    • DB
      • MongoDB
      • Pickle DB
    • Reranker
      • BM25 Reranker
      • UPR Reranker
      • TART Reranker
      • MonoT5 Reranker
      • LLM Reranker
    • Benchmark
      • Auto Evaluator
      • Dataset Evaluators
        • Qasper
        • Ko-Strategy-QA
        • Strategy-QA
        • ms-marco
  • Utils
    • Query Decomposition
    • Evidence Extractor
    • Embedding
    • Slim Vector Store
      • Pinecone Slim
      • Chroma Slim
    • File Cache
    • Linker
      • Redis Linker
      • Dynamo Linker
      • Json Linker
    • REDE Search Detector
    • Semantic Clustering
  • Pipeline
    • BasicIngestPipeline
    • BasicRunPipeline
    • RerankRunPipeline
    • ViscondeRunPipeline
  • For Advanced RAG
    • Time-Aware RAG
    • Importance-Aware RAG
Powered by GitBook
On this page
  1. Utils

File Cache

The purpose of FileCache is to remove duplicate documents since users are likely to load duplicate document files if they ingest files multiple times.

The FileCache is a util that checks the DB for duplicate file check files.

Usage

At first, import FileCache.

from RAGchain.utils.file_cache import FileCache
from RAGchain.DB import PickleDB
from langchain.schema import Document

We will intentionally generate db saved duplicate files to illustrate.

test_passages: List[Passage] = [
    Passage(content="test1", filepath="test1"),
    Passage(content="test2", filepath="test2"),
    Passage(content="test3", filepath="test2")
]

test_documents: List[Document] = [
    Document(page_content="ttt1211", metadata={"source": "test1"}),
    Document(page_content="asdf", metadata={"source": "test2"}),
    Document(page_content="hgh", metadata={"source": "test3"}),
    Document(page_content="egrgfg", metadata={"source": "test4"}),
    Document(page_content="hhhh", metadata={"source": "test4"}),
]
db = PickleDB(save_path = "your-pickle-path.pkl"))
db.creat_or_load()
db.save(test_passages)
file_cache = FileCache(db)

And then, use delete_duplicate() to detect what file is already saved in DB. You can get List[Document] with duplicate document removed!

test_documents = file_cache.delete_duplicate(test_documents)

PreviousChroma SlimNextLinker

Last updated 1 year ago

Create an instance and input parameter. At this example, we use .

PickleDB