The CodeSplitter class in the RAGchain library is a text splitter that splits documents based on separators of langchain's library Language enum. This class inherits from the BaseTextSplitter class and uses the from_language method of RecursiveCharacterTextSplitter class from the langchain library to perform the splitting.
Reference(Split code)
First, to initialize an instance of CodeSplitter, you can provide the following parameters:
language_name: A kind of language to split. Default is PYTHON.
(CPP, GO, JAVA, KOTLIN, JS, TS, PHP, PROTO, PYTHON, RST, RUBY, RUST, SCALA, SWIFT, MARKDOWN, LATEX, HTML, SOL, CSHARP)
chunk_size: Maximum size of chunks to return. Default is 50.
chunk_overlap: Overlap in characters between chunks. Default is 0.
kwargs: Additional arguments to pass to the langchain RecursiveCharacterTextSplitter.
Let's take a look example how code splitter split documents in various language.
Notice: You can't indent TEST_DOCUMENT for legible. Splitter recognize indent and space. Input raw data not space or indent.
PYTHON
First, initialize an instance of CodeSplitter. For example:
from RAGchain.preprocess.text_splitter import CodeSplittercode_splitter =CodeSplitter(language_name='PYTHON', chunk_size=50, chunk_overlap=0)
Split document
You can split document using split_document() method. It will return list of Passage objects. For example:
python_doc =Document(page_content="""def hello_world(): print("Hello, World!")# Call the functionhello_world()""", metadata={'source': 'test_source',# Check whether the metadata_etc contains the multiple information from the TEST DOCUMENT metadatas or not.'Data information': 'test for python code document','Data reference link': 'https://python.langchain.com/docs/modules/data_connection/document_transformers/text_splitters/code_splitter#python' })passages = code_splitter.split_document(python_doc)
Process is same as python splitter above. Input language_name JS.
from RAGchain.preprocess.text_splitter import CodeSplittercode_splitter =CodeSplitter(language_name='JS', chunk_size=60, chunk_overlap=0)
Split document
JS_doc =Document(page_content="""function helloWorld(){ console.log("Hello, World!");}// Call the functionhelloWorld();""",metadata={'source': 'test_source',# Check whether the metadata_etc contains the multiple information from the TEST DOCUMENT metadatas or not.'Data information': 'test for js code document','Data reference link': 'https://python.langchain.com/docs/modules/data_connection/document_transformers/text_splitters/code_splitter#js',})passages = code_splitter.split_document(JS_doc)
Process is same as python splitter above. Input language_name JS.
from RAGchain.preprocess.text_splitter import CodeSplittercode_splitter =CodeSplitter(language_name='CSHARP', chunk_size=17, chunk_overlap=0)
Split document
csharp_doc =Document(page_content="""using System;class Program{ static void Main() { int age = 30; // Change the age value as needed // Categorize the age without any console output if (age < 18) { // Age is under 18 } else if (age >= 18 && age < 65) { // Age is an adult } else { // Age is a senior citizen } }}""" , metadata={'source': 'test_source',# Check whether the metadata_etc contains the multiple information from the TEST DOCUMENT metadatas or not.'Data information': 'test for C# text document','Data reference link': 'https://python.langchain.com/docs/modules/data_connection/document_transformers/text_splitters/code_splitter#c', })passages = code_splitter.split_document(csharp_doc)