RAGChain Docs
  • Introduction
  • Quick Start
  • Installation
  • RAGchain Structure
    • File Loader
      • Dataset Loader
        • Ko-Strategy-QA Loader
      • Hwp Loader
      • Rust Hwp Loader
      • Win32 Hwp Loader
      • OCR
        • Nougat Loader
        • Mathpix Markdown Loader
        • Deepdoctection Loader
    • Text Spliter
      • Recursive Text Splitter
      • Markdown Header Splitter
      • HTML Header splitter
      • Code splitter
      • Token splitter
    • Retrieval
      • BM25 Retrieval
      • Hybrid Retrieval
      • Hyde Retrieval
      • VectorDB Retrieval
    • LLM
    • DB
      • MongoDB
      • Pickle DB
    • Reranker
      • BM25 Reranker
      • UPR Reranker
      • TART Reranker
      • MonoT5 Reranker
      • LLM Reranker
    • Benchmark
      • Auto Evaluator
      • Dataset Evaluators
        • Qasper
        • Ko-Strategy-QA
        • Strategy-QA
        • ms-marco
  • Utils
    • Query Decomposition
    • Evidence Extractor
    • Embedding
    • Slim Vector Store
      • Pinecone Slim
      • Chroma Slim
    • File Cache
    • Linker
      • Redis Linker
      • Dynamo Linker
      • Json Linker
    • REDE Search Detector
    • Semantic Clustering
  • Pipeline
    • BasicIngestPipeline
    • BasicRunPipeline
    • RerankRunPipeline
    • ViscondeRunPipeline
  • For Advanced RAG
    • Time-Aware RAG
    • Importance-Aware RAG
Powered by GitBook
On this page
  • HTML Header Text Splitter
  • Overview
  • Usage
  • Limitation
  1. RAGchain Structure
  2. Text Spliter

HTML Header splitter

PreviousMarkdown Header SplitterNextCode splitter

Last updated 1 year ago

HTML Header Text Splitter

Overview

The HTMLHeaderSplitter class in the RAGchain library is a text splitter that splits documents based on HTML headers. This class inherits from the BaseTextSplitter class and uses the from the langchain library to perform the splitting.

The difference of MarkDownHeaderSplitter is that HTMLHeaderSplitter can return chunks element by element or combine elements with the same metadata, with the objectives of (a) keeping related text grouped (more or less) semantically and (b) preserving context-rich information encoded in document structures. It can be used with other text splitters as part of a chunking pipeline.

Usage

Initialization

First, initialize an instance of HTMLHeaderSplitter, you can provide the following parameters:

  • headers_to_split_on: A list of tuples that specify the headers to split the document on. Each tuple consists of an HTML header and a key for metadata. Allowed header values are h1, h2, h3, h4, h5, h6. The default value is [("h1", "Header 1"), ("h2", "Header 2"), ("h3", "Header 3")].

  • return_each_element: A boolean that specifies whether to return each element with its associated headers. The default value is False.

For example:

from RAGchain.preprocess.text_splitter import HTMLHeaderSplitter

html_header_splitter = HTMLHeaderSplitter()

Split document

You can split document using split_document() method. It will return list of objects. For example:

passages = html_header_splitter.split_document(document)

Splitter can't recognize header some cases. Please note above hyperlink!

HTMLHeaderTextSplitter
Passage
Limitation