The CodeSplitter class in the RAGchain library is a text splitter that splits documents based on separators of langchain's library Language enum. This class inherits from the BaseTextSplitter class and uses the from_language method of RecursiveCharacterTextSplitter class from the langchain library to perform the splitting.
Reference(Split code)
First, to initialize an instance of CodeSplitter, you can provide the following parameters:
language_name: A kind of language to split. Default is PYTHON.
(CPP, GO, JAVA, KOTLIN, JS, TS, PHP, PROTO, PYTHON, RST, RUBY, RUST, SCALA, SWIFT, MARKDOWN, LATEX, HTML, SOL, CSHARP)
chunk_size: Maximum size of chunks to return. Default is 50.
chunk_overlap: Overlap in characters between chunks. Default is 0.
kwargs: Additional arguments to pass to the langchain RecursiveCharacterTextSplitter.
Let's take a look example how code splitter split documents in various language.
Notice: You can't indent TEST_DOCUMENT for legible. Splitter recognize indent and space. Input raw data not space or indent.
PYTHON
First, initialize an instance of CodeSplitter. For example:
You can split document using split_document() method. It will return list of Passage objects. For example:
python_doc = Document(
page_content="""
def hello_world():
print("Hello, World!")
# Call the function
hello_world()
""",
metadata={
'source': 'test_source',
# Check whether the metadata_etc contains the multiple information from the TEST DOCUMENT metadatas or not.
'Data information': 'test for python code document',
'Data reference link': 'https://python.langchain.com/docs/modules/data_connection/document_transformers/text_splitters/code_splitter#python'
}
)
passages = code_splitter.split_document(python_doc)
JS_doc =Document(page_content="""function helloWorld(){ console.log("Hello, World!");}// Call the functionhelloWorld();""",metadata={'source': 'test_source',# Check whether the metadata_etc contains the multiple information from the TEST DOCUMENT metadatas or not.'Data information': 'test for js code document', 'Data reference link': 'https://python.langchain.com/docs/modules/data_connection/document_transformers/text_splitters/code_splitter#js',
})passages = code_splitter.split_document(JS_doc)
csharp_doc =Document(page_content="""using System;class Program{ static void Main() { int age = 30; // Change the age value as needed // Categorize the age without any console output if (age < 18) { // Age is under 18 } else if (age >= 18 && age < 65) { // Age is an adult } else { // Age is a senior citizen } }}""" , metadata={'source': 'test_source',# Check whether the metadata_etc contains the multiple information from the TEST DOCUMENT metadatas or not.'Data information': 'test for C# text document', 'Data reference link': 'https://python.langchain.com/docs/modules/data_connection/document_transformers/text_splitters/code_splitter#c',
})passages = code_splitter.split_document(csharp_doc)