Text Spliter

Documentation for Text Splitter Module

Overview

The Text Splitter module is an essential component in our framework, designed to handle large volumes of text data. It functions by dividing loaded Document contents into manageable segments, returning a list of Passage objects. This process is essential in the RAG (Retrieval-Augmented Generation) workflow, due to the token limitations imposed by Large Language Models (LLMs).

Given that not all content within a document is useful or relevant for answering questions, it becomes necessary to split documents into smaller passages. These passages can then be analyzed and retrieved more efficiently when providing responses.

Please note that our Text Splitter is not compatible with Langchain's text splitter. We are now implementing all Langchain's text splitters.

Document to Passage Conversion

There are many fields in Passage schema. You have to set 'source' key in Document metadata. It will set to Passage's filepath field.

Also, you can set content_datetime filed at Document metadata. You can use datetime.datetime or str with YYYY-MM-DD HH:MM:SS format. It will set to Passage's content_datetime field.

Plus, you can set importance field at Document metadata. It will set to Passage's importance field.

Supporting Text Splitter

Roles of the Text Splitter in the Framework

The primary role of the Text Splitter module within our framework involves breaking down extensive Document contents into smaller Passage objects. By doing so, it allows us to manage and process vast amounts of data more effectively and efficiently.

In particular, this becomes crucial in contexts such as RAG workflows where LLMs have specific token limits. With these constraints in mind, utilizing all document contents without filtering or splitting could lead to inefficiencies or inaccuracies during information retrieval and question-answering processes.

Moreover, as many documents often contain irrelevant information or 'noise,' splitting these documents into tiny passages aids in isolating and retrieving valuable information pertinent to answering questions accurately and promptly.

Last updated