RAGChain Docs
  • Introduction
  • Quick Start
  • Installation
  • RAGchain Structure
    • File Loader
      • Dataset Loader
        • Ko-Strategy-QA Loader
      • Hwp Loader
      • Rust Hwp Loader
      • Win32 Hwp Loader
      • OCR
        • Nougat Loader
        • Mathpix Markdown Loader
        • Deepdoctection Loader
    • Text Spliter
      • Recursive Text Splitter
      • Markdown Header Splitter
      • HTML Header splitter
      • Code splitter
      • Token splitter
    • Retrieval
      • BM25 Retrieval
      • Hybrid Retrieval
      • Hyde Retrieval
      • VectorDB Retrieval
    • LLM
    • DB
      • MongoDB
      • Pickle DB
    • Reranker
      • BM25 Reranker
      • UPR Reranker
      • TART Reranker
      • MonoT5 Reranker
      • LLM Reranker
    • Benchmark
      • Auto Evaluator
      • Dataset Evaluators
        • Qasper
        • Ko-Strategy-QA
        • Strategy-QA
        • ms-marco
  • Utils
    • Query Decomposition
    • Evidence Extractor
    • Embedding
    • Slim Vector Store
      • Pinecone Slim
      • Chroma Slim
    • File Cache
    • Linker
      • Redis Linker
      • Dynamo Linker
      • Json Linker
    • REDE Search Detector
    • Semantic Clustering
  • Pipeline
    • BasicIngestPipeline
    • BasicRunPipeline
    • RerankRunPipeline
    • ViscondeRunPipeline
  • For Advanced RAG
    • Time-Aware RAG
    • Importance-Aware RAG
Powered by GitBook
On this page
  • Overview
  • Usage
  • Run Nougat API sever
  • Initialization
  • Loading Documents
  1. RAGchain Structure
  2. File Loader
  3. OCR

Nougat Loader

NougatPDFLoader Class Documentation

PreviousOCRNextMathpix Markdown Loader

Last updated 1 year ago

Overview

The NougatPDFLoader class is a powerful tool for loading academic document PDF files. It leverages the capabilities of the Nougat model, developed by Meta, to provide an accurate conversion of academic papers from PDF format.

Usage

Run Nougat API sever

You must run Nougat API server for using this loader. You will need server with CUDA installed for running nougat model properly. More detailed installation of nougat, please go to .

Use Docker (Recommend)

First, clone facebookresearch/nougat repository to your machine, and move to docker folder.

git clone https://github.com/facebookresearch/nougat.git
cd nougat/docker

Then, build and run your docker container following this .

Use pip

First, install nougat package api version using pip.

pip install "nougat-ocr[api]"

Then, run api server with this command.

nougat_api

Initialization

After runs your Nougat API server, you first need to create an instance by providing two parameters: file_path and nougat_host.

  • file_path: This is a string representing the path to your PDF file.

  • nougat_host: This is a string representing the host address where your Nougat API server is running.

Example:

from RAGchain.preprocess.loader import NougatPDFLoader

loader = NougatPDFLoader(file_path="path/to/your/file.pdf", nougat_host="http://localhost:5000")

During initialization, it checks if it can establish a connection with the provided Nougat server host. If it cannot establish a connection, it raises a ValueError.

Loading Documents

The class provides two methods for loading documents: load() and lazy_load().

Both methods accept three optional parameters:

  • split_section (default True): If set to True, it splits the document by section.

  • split_table (default True): If set to True, it splits the document by table.

  • You can also pass other arguments such as start page number (start) or stop page number (stop) as keyword arguments (kwargs). These are optional parameters specifying which pages of your PDF you want to load.

Example:

documents = loader.load(split_section=True, split_table=False)

or

for doc in loader.lazy_load(split_section=True):
    # process each document here...

These methods return instances of Document objects that contain processed content from your PDF file.

official github repo
instruction