Markdown Header Splitter
Overview
The MarkDownHeaderSplitter is used to split a document into passages based document's header information which a list of separators contain. The most feature is similar with Langchain's MarkdownHeaderTextSplitter. It split based on header.
metadata_etc of Passage contains header information and original document information. metadata_etc updates new header is two case.
First, whenever new header appear at document, metadata_etc is appended new header information.
Second, when a header with an equivalent relationship appears, the metadata is initialized and the newly appeared header is included in the metadata.
Usage
Initialization
First, initialize an instance of MarkDownHeaderSplitter. For example:
from RAGchain.preprocess.text_splitter import MarkDownHeaderSplitter
markdown_header_splitter = MarkDownHeaderSplitter()Split document
You can split document using split_document() method. It will return list of Passage objects. For example:
passages = markdown_header_splitter.split_document(document)Last updated