RAGChain Docs
  • Introduction
  • Quick Start
  • Installation
  • RAGchain Structure
    • File Loader
      • Dataset Loader
        • Ko-Strategy-QA Loader
      • Hwp Loader
      • Rust Hwp Loader
      • Win32 Hwp Loader
      • OCR
        • Nougat Loader
        • Mathpix Markdown Loader
        • Deepdoctection Loader
    • Text Spliter
      • Recursive Text Splitter
      • Markdown Header Splitter
      • HTML Header splitter
      • Code splitter
      • Token splitter
    • Retrieval
      • BM25 Retrieval
      • Hybrid Retrieval
      • Hyde Retrieval
      • VectorDB Retrieval
    • LLM
    • DB
      • MongoDB
      • Pickle DB
    • Reranker
      • BM25 Reranker
      • UPR Reranker
      • TART Reranker
      • MonoT5 Reranker
      • LLM Reranker
    • Benchmark
      • Auto Evaluator
      • Dataset Evaluators
        • Qasper
        • Ko-Strategy-QA
        • Strategy-QA
        • ms-marco
  • Utils
    • Query Decomposition
    • Evidence Extractor
    • Embedding
    • Slim Vector Store
      • Pinecone Slim
      • Chroma Slim
    • File Cache
    • Linker
      • Redis Linker
      • Dynamo Linker
      • Json Linker
    • REDE Search Detector
    • Semantic Clustering
  • Pipeline
    • BasicIngestPipeline
    • BasicRunPipeline
    • RerankRunPipeline
    • ViscondeRunPipeline
  • For Advanced RAG
    • Time-Aware RAG
    • Importance-Aware RAG
Powered by GitBook
On this page
  • Overview
  • Usage
  • Initialization
  • Tiktoken
  • Split document(tiktoken)
  • spaCy
  • Split document(spaCy)
  • SentenceTransformers
  • Split document(SentenceTransformers)
  • NLTK
  • Split document(NLTK)
  • HuggingFace
  • Split document(HuggingFace)
  1. RAGchain Structure
  2. Text Spliter

Token splitter

PreviousCode splitterNextRetrieval

Last updated 1 year ago

Overview

The TokenSplitter is used to split a document into passages by token using various tokenization methods. It's designed to split text from a document into smaller chunks, or "tokens", using various tokenization methods. The class supports tokenization with 'tiktoken', 'spaCy', 'SentenceTransformers', 'NLTK', and 'huggingFace'.

The most feature is similar with Langchain's .

Usage

The kinds of tokenizer splitters are 'tiktoken', 'spaCy', 'SentenceTransformers', 'NLTK', and 'huggingFace'. The names of splitters are tokenizer name. Each tokenizer tokenizes in a different way. Its usages are a little bit different.

Initialization

Here are parameter information.

  • tokenizer_name: A tokenizer_name is a name of tokenizer. You can choose tokenizer_name. (tiktoken, spaCy, SentenceTransformers, NLTK, huggingFace)

  • chunk_size: Maximum size of chunks to return. Default is 100.

  • chunk_overlap: Overlap in characters between chunks. Default is 0.

  • pretrained_model_name: A huggingface tokenizer pretrained_model_name to use huggingface token splitter. You can choose various pretrained_model_name in this parameter. Default is "gpt2". Refer to pretrained model in this link. (https://huggingface.co/models)

  • kwargs: Additional arguments.

We offer to sample_test_document.txt for test. If you want to use our test file, try this code!

import os
import pathlib

root_dir = pathlib.PurePath(os.path.dirname(os.path.realpath(__file__))).parent.parent.parent
file_path = os.path.join(root_dir, "resources", "sample_test_document.txt")

with open(file_path) as f:
    state_of_the_union = f.read()

TEST_DOCUMENT = Document(
    page_content=state_of_the_union,
    metadata={
        'source': 'test_source',
        'Data information': '맨까 새끼들 부들부들하구나',
        'What is it?': 'THis is token splitter'
    }

Tiktoken

First, initialize an instance of TokenSplitter and input tokensplitter.

from RAGchain.preprocess.text_splitter import TokenSplitter

tiktoken = TokenSplitter(tokenizer_name='tiktoken', chunk_size=1000, chunk_overlap=0)

Split document(tiktoken)

tiktoken_passages = tiktoken.split_document(TEST_DOCUMENT)

spaCy

To use spaCy token splitter, you should install some packages.

!python -m spacy download en_core_web_sm
!pip install spaCy

Initialize an instance of TokenSplitter and input parameter tokenizer_name=spaCy.

from RAGchain.preprocess.text_splitter import TokenSplitter

spaCy = TokenSplitter(tokenizer_name='spaCy', chunk_size=1000, chunk_overlap=0)

Split document(spaCy)

spaCy_passages = spaCy.split_document(TEST_DOCUMENT)

SentenceTransformers

Initialize an instance of TokenSplitter and input parameter tokenizer_name=SentenceTransformers.

from RAGchain.preprocess.text_splitter import TokenSplitter

sentence_transformers = TokenSplitter(tokenizer_name='SentenceTransformers', chunk_overlap=0)

Split document(SentenceTransformers)

SentenceTransformers_passages = sentence_transformers.split_document(TEST_DOCUMENT)

NLTK

Initialize an instance of TokenSplitter and input parameter tokenizer_name=NLTK. To use NLTK token splitter, you should install some packages.

!pip install NLTK
from RAGchain.preprocess.text_splitter import TokenSplitter

NLTK = TokenSplitter(tokenizer_name='NLTK', chunk_size=1000)

Split document(NLTK)

NLTK_passages = NLTK.split_document(TEST_DOCUMENT)

Trouble Shooting(NLTK)

1. Lookup error

If you occur Lookup Error, Try this code! This error occurs because some NLTK files have not been downloaded.

import nltk
nltk.download('all')

HuggingFace

HuggingFace splitter

from RAGchain.preprocess.text_splitter import TokenSplitter

huggingFace = TokenSplitter(tokenizer_name='huggingFace', chunk_size=100, chunk_overlap=0, pretrained_model_name= "gpt2")

Split document(HuggingFace)

huggingface_passages = huggingFace.split_document(TEST_DOCUMENT)

You can split document using split_document() method. It will return list of objects. For example:

You can split document using split_document() method. It will return list of objects. For example:

You can split document using split_document() method. It will return list of objects. For example:

You can split document using split_document() method. It will return list of objects. For example:

AutoTokenizer of huggingface transformer makes you can choose various huggingface's pretrained models. (Refer to this !) Default pretrained model is gpt2.

You can split document using split_document() method. It will return list of objects. For example:

Notice: All tokenizer refer to langchain's library. If you want to know more tokenizer information, Refer to this link for more information.

Split by token
Passage
Passage
Passage
Passage
Reference
link
Passage
split by token