RAGChain Docs
  • Introduction
  • Quick Start
  • Installation
  • RAGchain Structure
    • File Loader
      • Dataset Loader
        • Ko-Strategy-QA Loader
      • Hwp Loader
      • Rust Hwp Loader
      • Win32 Hwp Loader
      • OCR
        • Nougat Loader
        • Mathpix Markdown Loader
        • Deepdoctection Loader
    • Text Spliter
      • Recursive Text Splitter
      • Markdown Header Splitter
      • HTML Header splitter
      • Code splitter
      • Token splitter
    • Retrieval
      • BM25 Retrieval
      • Hybrid Retrieval
      • Hyde Retrieval
      • VectorDB Retrieval
    • LLM
    • DB
      • MongoDB
      • Pickle DB
    • Reranker
      • BM25 Reranker
      • UPR Reranker
      • TART Reranker
      • MonoT5 Reranker
      • LLM Reranker
    • Benchmark
      • Auto Evaluator
      • Dataset Evaluators
        • Qasper
        • Ko-Strategy-QA
        • Strategy-QA
        • ms-marco
  • Utils
    • Query Decomposition
    • Evidence Extractor
    • Embedding
    • Slim Vector Store
      • Pinecone Slim
      • Chroma Slim
    • File Cache
    • Linker
      • Redis Linker
      • Dynamo Linker
      • Json Linker
    • REDE Search Detector
    • Semantic Clustering
  • Pipeline
    • BasicIngestPipeline
    • BasicRunPipeline
    • RerankRunPipeline
    • ViscondeRunPipeline
  • For Advanced RAG
    • Time-Aware RAG
    • Importance-Aware RAG
Powered by GitBook
On this page
  • Overview
  • Usage
  • Initialization
  • Loading Documents
  1. RAGchain Structure
  2. File Loader

Rust Hwp Loader

PreviousHwp LoaderNextWin32 Hwp Loader

Last updated 1 year ago

Overview

The RustHwpLoader is a Python class that loads HWP files using the library. It works across all OS.

This loader extracts all paragraphs and tables from the HWP file and returns them as a list of Document objects. Each Document object includes the content of the paragraph or table and some associated metadata. The first Document contains all paragraphs from the HWP file, including the texts within each table. Subsequent Documents represent the paragraphs within each table.

Unfortunately, this loader does not distinguish between rows and columns in a table.

The metadata attribute of each Document includes the file path (under the 'source' key) and the page type ('text' or 'table').

While other HWP loaders may offer more great features, RustHwpLoader is a great option for MacOS and Linux users because it does not require an external HWP loader server or a Windows-only HWP program.

Usage

Initialization

To use the RustHwpLoader, you first need to initialize it with the path to the HWP file:

loader = RustHwpLoader("/path/to/hwp/file")

If the libhwp library is not installed, an ImportError will be raised with a message asking you to install it using pip:

pip install libhwp

Loading Documents

The RustHwpLoader provides two methods to load Document objects from the HWP file: load and lazy_load.

load

The load method returns a list of all Documents:

documents = loader.load()

lazy_load

The lazy_load method is a generator that lazily yields Document objects:

for document in loader.lazy_load():
    # process document

This method is useful when working with large HWP files that could consume a lot of memory if fully loaded into a list.

Each Document yielded by load or lazy_load contains a page_content string and a metadata dictionary. The metadata includes the 'source' key (the file path) and the 'page_type' key (either 'text' or 'table').

libhwp