The CodeSplitter class in the RAGchain library is a text splitter that splits documents based on separators of langchain's library Language enum. This class inherits from the BaseTextSplitter class and uses the from_language method of RecursiveCharacterTextSplitter class from the langchain library to perform the splitting.
Reference(Split code)
First, to initialize an instance of CodeSplitter, you can provide the following parameters:
language_name: A kind of language to split. Default is PYTHON.
(CPP, GO, JAVA, KOTLIN, JS, TS, PHP, PROTO, PYTHON, RST, RUBY, RUST, SCALA, SWIFT, MARKDOWN, LATEX, HTML, SOL, CSHARP)
chunk_size: Maximum size of chunks to return. Default is 50.
chunk_overlap: Overlap in characters between chunks. Default is 0.
kwargs: Additional arguments to pass to the langchain RecursiveCharacterTextSplitter.
Let's take a look example how code splitter split documents in various language.
Notice: You can't indent TEST_DOCUMENT for legible. Splitter recognize indent and space. Input raw data not space or indent.
PYTHON
First, initialize an instance of CodeSplitter. For example:
You can split document using split_document() method. It will return list of Passage objects. For example:
python_doc = Document(
page_content="""
def hello_world():
print("Hello, World!")
# Call the function
hello_world()
""",
metadata={
'source': 'test_source',
# Check whether the metadata_etc contains the multiple information from the TEST DOCUMENT metadatas or not.
'Data information': 'test for python code document',
'Data reference link': 'https://python.langchain.com/docs/modules/data_connection/document_transformers/text_splitters/code_splitter#python'
}
)
passages = code_splitter.split_document(python_doc)
JS_doc = Document(
page_content="""
function helloWorld()
{
console.log("Hello, World!");
}
// Call the function
helloWorld();
""",
metadata={
'source': 'test_source',
# Check whether the metadata_etc contains the multiple information from the TEST DOCUMENT metadatas or not.
'Data information': 'test for js code document',
'Data reference link': 'https://python.langchain.com/docs/modules/data_connection/document_transformers/text_splitters/code_splitter#js',
}
)
passages = code_splitter.split_document(JS_doc)
csharp_doc = Document(
page_content="""
using System;
class Program
{
static void Main()
{
int age = 30; // Change the age value as needed
// Categorize the age without any console output
if (age < 18)
{
// Age is under 18
}
else if (age >= 18 && age < 65)
{
// Age is an adult
}
else
{
// Age is a senior citizen
}
}
}
"""
,
metadata={
'source': 'test_source',
# Check whether the metadata_etc contains the multiple information from the TEST DOCUMENT metadatas or not.
'Data information': 'test for C# text document',
'Data reference link': 'https://python.langchain.com/docs/modules/data_connection/document_transformers/text_splitters/code_splitter#c',
}
)
passages = code_splitter.split_document(csharp_doc)