RAGChain Docs
  • Introduction
  • Quick Start
  • Installation
  • RAGchain Structure
    • File Loader
      • Dataset Loader
        • Ko-Strategy-QA Loader
      • Hwp Loader
      • Rust Hwp Loader
      • Win32 Hwp Loader
      • OCR
        • Nougat Loader
        • Mathpix Markdown Loader
        • Deepdoctection Loader
    • Text Spliter
      • Recursive Text Splitter
      • Markdown Header Splitter
      • HTML Header splitter
      • Code splitter
      • Token splitter
    • Retrieval
      • BM25 Retrieval
      • Hybrid Retrieval
      • Hyde Retrieval
      • VectorDB Retrieval
    • LLM
    • DB
      • MongoDB
      • Pickle DB
    • Reranker
      • BM25 Reranker
      • UPR Reranker
      • TART Reranker
      • MonoT5 Reranker
      • LLM Reranker
    • Benchmark
      • Auto Evaluator
      • Dataset Evaluators
        • Qasper
        • Ko-Strategy-QA
        • Strategy-QA
        • ms-marco
  • Utils
    • Query Decomposition
    • Evidence Extractor
    • Embedding
    • Slim Vector Store
      • Pinecone Slim
      • Chroma Slim
    • File Cache
    • Linker
      • Redis Linker
      • Dynamo Linker
      • Json Linker
    • REDE Search Detector
    • Semantic Clustering
  • Pipeline
    • BasicIngestPipeline
    • BasicRunPipeline
    • RerankRunPipeline
    • ViscondeRunPipeline
  • For Advanced RAG
    • Time-Aware RAG
    • Importance-Aware RAG
Powered by GitBook
On this page
  • Overview
  • Document to Passage Conversion
  • Supporting Text Splitter
  • Roles of the Text Splitter in the Framework
  1. RAGchain Structure

Text Spliter

Documentation for Text Splitter Module

PreviousDeepdoctection LoaderNextRecursive Text Splitter

Last updated 1 year ago

Overview

The Text Splitter module is an essential component in our framework, designed to handle large volumes of text data. It functions by dividing loaded Document contents into manageable segments, returning a list of Passage objects. This process is essential in the RAG (Retrieval-Augmented Generation) workflow, due to the token limitations imposed by Large Language Models (LLMs).

Given that not all content within a document is useful or relevant for answering questions, it becomes necessary to split documents into smaller passages. These passages can then be analyzed and retrieved more efficiently when providing responses.

Please note that our Text Splitter is not compatible with Langchain's text splitter. We are now implementing all Langchain's text splitters.

Document to Passage Conversion

There are many fields in Passage schema. You have to set 'source' key in Document metadata. It will set to Passage's filepath field.

Also, you can set content_datetime filed at Document metadata. You can use datetime.datetime or str with YYYY-MM-DD HH:MM:SS format. It will set to Passage's content_datetime field.

Plus, you can set importance field at Document metadata. It will set to Passage's importance field.

Supporting Text Splitter

Roles of the Text Splitter in the Framework

The primary role of the Text Splitter module within our framework involves breaking down extensive Document contents into smaller Passage objects. By doing so, it allows us to manage and process vast amounts of data more effectively and efficiently.

In particular, this becomes crucial in contexts such as RAG workflows where LLMs have specific token limits. With these constraints in mind, utilizing all document contents without filtering or splitting could lead to inefficiencies or inaccuracies during information retrieval and question-answering processes.

Moreover, as many documents often contain irrelevant information or 'noise,' splitting these documents into tiny passages aids in isolating and retrieving valuable information pertinent to answering questions accurately and promptly.

RecursiveTextSplitter
markdown-header-splitter
html-header-splitter
Code splitter
Token splitter