Token splitter
Overview
The TokenSplitter
is used to split a document into passages by token using various tokenization methods. It's designed to split text from a document into smaller chunks, or "tokens", using various tokenization methods. The class supports tokenization with 'tiktoken
', 'spaCy
', 'SentenceTransformers
', 'NLTK
', and 'huggingFace
'.
The most feature is similar with Langchain's Split by token
.
Usage
The kinds of tokenizer splitters are 'tiktoken
', 'spaCy
', 'SentenceTransformers
', 'NLTK
', and 'huggingFace
'. The names of splitters are tokenizer name. Each tokenizer tokenizes in a different way. Its usages are a little bit different.
Initialization
Here are parameter information.
tokenizer_name
: A tokenizer_name is a name of tokenizer. You can choose tokenizer_name. (tiktoken, spaCy, SentenceTransformers, NLTK, huggingFace)chunk_size
: Maximum size of chunks to return. Default is 100.chunk_overlap
: Overlap in characters between chunks. Default is 0.pretrained_model_name
: A huggingface tokenizer pretrained_model_name to use huggingface token splitter. You can choose various pretrained_model_name in this parameter. Default is "gpt2". Refer to pretrained model in this link. (https://huggingface.co/models)kwargs
: Additional arguments.
We offer to sample_test_document.txt
for test. If you want to use our test file, try this code!
Tiktoken
First, initialize an instance of TokenSplitter
and input tokensplitter.
Split document(tiktoken)
You can split document using split_document()
method. It will return list of Passage
objects. For example:
spaCy
To use spaCy token splitter, you should install some packages.
Initialize an instance of TokenSplitter and input parameter tokenizer_name=spaCy
.
Split document(spaCy)
You can split document using split_document()
method. It will return list of Passage
objects. For example:
SentenceTransformers
Initialize an instance of TokenSplitter and input parameter tokenizer_name=SentenceTransformers
.
Split document(SentenceTransformers)
You can split document using split_document()
method. It will return list of Passage
objects. For example:
NLTK
Initialize an instance of TokenSplitter and input parameter tokenizer_name=NLTK
. To use NLTK token splitter, you should install some packages.
Split document(NLTK)
You can split document using split_document()
method. It will return list of Passage
objects. For example:
Trouble Shooting(NLTK)
1. Lookup error
If you occur Lookup Error
, Try this code! This error occurs because some NLTK files have not been downloaded.
HuggingFace
HuggingFace splitter
AutoTokenizer of huggingface transformer makes you can choose various huggingface's pretrained models. (Refer to this link!) Default pretrained model is gpt2
.
Split document(HuggingFace)
You can split document using split_document()
method. It will return list of Passage
objects. For example:
Notice: All tokenizer refer to langchain's library. If you want to know more tokenizer information, Refer to this link for more information. split by token
Last updated