Token splitter
Last updated
Last updated
The TokenSplitter
is used to split a document into passages by token using various tokenization methods. It's designed to split text from a document into smaller chunks, or "tokens", using various tokenization methods. The class supports tokenization with 'tiktoken
', 'spaCy
', 'SentenceTransformers
', 'NLTK
', and 'huggingFace
'.
The most feature is similar with Langchain's .
The kinds of tokenizer splitters are 'tiktoken
', 'spaCy
', 'SentenceTransformers
', 'NLTK
', and 'huggingFace
'. The names of splitters are tokenizer name. Each tokenizer tokenizes in a different way. Its usages are a little bit different.
Here are parameter information.
tokenizer_name
: A tokenizer_name is a name of tokenizer. You can choose tokenizer_name.
(tiktoken, spaCy, SentenceTransformers, NLTK, huggingFace)
chunk_size
: Maximum size of chunks to return. Default is 100.
chunk_overlap
: Overlap in characters between chunks. Default is 0.
pretrained_model_name
: A huggingface tokenizer pretrained_model_name to use huggingface token splitter. You can choose various pretrained_model_name in this parameter. Default is "gpt2". Refer to pretrained model in this link. (https://huggingface.co/models)
kwargs
: Additional arguments.
We offer to sample_test_document.txt
for test. If you want to use our test file, try this code!
First, initialize an instance of TokenSplitter
and input tokensplitter.
To use spaCy token splitter, you should install some packages.
Initialize an instance of TokenSplitter and input parameter tokenizer_name=spaCy
.
Initialize an instance of TokenSplitter and input parameter tokenizer_name=SentenceTransformers
.
Initialize an instance of TokenSplitter and input parameter tokenizer_name=NLTK
. To use NLTK token splitter, you should install some packages.
1. Lookup error
If you occur Lookup Error
, Try this code! This error occurs because some NLTK files have not been downloaded.
HuggingFace splitter
You can split document using split_document()
method. It will return list of objects. For example:
You can split document using split_document()
method. It will return list of objects. For example:
You can split document using split_document()
method. It will return list of objects. For example:
You can split document using split_document()
method. It will return list of objects. For example:
AutoTokenizer of huggingface transformer makes you can choose various huggingface's pretrained models. (Refer to this !) Default pretrained model is gpt2
.
You can split document using split_document()
method. It will return list of objects. For example:
Notice: All tokenizer refer to langchain's library. If you want to know more tokenizer information, Refer to this link for more information.