HTML Header splitter

HTML Header Text Splitter

Overview

The HTMLHeaderSplitter class in the RAGchain library is a text splitter that splits documents based on HTML headers. This class inherits from the BaseTextSplitter class and uses the HTMLHeaderTextSplitter from the langchain library to perform the splitting.

The difference of MarkDownHeaderSplitter is that HTMLHeaderSplitter can return chunks element by element or combine elements with the same metadata, with the objectives of (a) keeping related text grouped (more or less) semantically and (b) preserving context-rich information encoded in document structures. It can be used with other text splitters as part of a chunking pipeline.

Usage

Initialization

First, initialize an instance of HTMLHeaderSplitter, you can provide the following parameters:

  • headers_to_split_on: A list of tuples that specify the headers to split the document on. Each tuple consists of an HTML header and a key for metadata. Allowed header values are h1, h2, h3, h4, h5, h6. The default value is [("h1", "Header 1"), ("h2", "Header 2"), ("h3", "Header 3")].

  • return_each_element: A boolean that specifies whether to return each element with its associated headers. The default value is False.

For example:

from RAGchain.preprocess.text_splitter import HTMLHeaderSplitter

html_header_splitter = HTMLHeaderSplitter()

Split document

You can split document using split_document() method. It will return list of Passage objects. For example:

passages = html_header_splitter.split_document(document)

Splitter can't recognize header some cases. Please note above hyperlink!

Last updated