HTML Header splitter
Last updated
Last updated
The HTMLHeaderSplitter
class in the RAGchain library is a text splitter that splits documents based on HTML headers. This class inherits from the BaseTextSplitter
class and uses the from the langchain library to perform the splitting.
The difference of MarkDownHeaderSplitter
is that HTMLHeaderSplitter
can return chunks element by element or combine elements with the same metadata, with the objectives of (a) keeping related text grouped (more or less) semantically and (b) preserving context-rich information encoded in document structures. It can be used with other text splitters as part of a chunking pipeline.
First, initialize an instance of HTMLHeaderSplitter
, you can provide the following parameters:
headers_to_split_on
: A list of tuples that specify the headers to split the document on. Each tuple consists of an HTML header and a key for metadata.
Allowed header values are h1, h2, h3, h4, h5, h6.
The default value is [("h1", "Header 1"), ("h2", "Header 2"), ("h3", "Header 3")]
.
return_each_element
: A boolean that specifies whether to return each element with its associated headers. The default value is False.
For example:
You can split document using split_document()
method. It will return list of objects. For example:
Splitter can't recognize header some cases. Please note above hyperlink!