Code splitter

Overview

The CodeSplitter class in the RAGchain library is a text splitter that splits documents based on separators of langchain's library Language enum. This class inherits from the BaseTextSplitter class and uses the from_language method of RecursiveCharacterTextSplitter class from the langchain library to perform the splitting. Reference(Split code)

CodeSplitter supports CPP, GO, JAVA, KOTLIN, JS, TS, PHP, PROTO, PYTHON, RST, RUBY, RUST, SCALA, SWIFT, MARKDOWN, LATEX, HTML, SOL, CSHARP.

Usage

Initialization

First, to initialize an instance of CodeSplitter, you can provide the following parameters:

  • language_name: A kind of language to split. Default is PYTHON. (CPP, GO, JAVA, KOTLIN, JS, TS, PHP, PROTO, PYTHON, RST, RUBY, RUST, SCALA, SWIFT, MARKDOWN, LATEX, HTML, SOL, CSHARP)

  • chunk_size: Maximum size of chunks to return. Default is 50.

  • chunk_overlap: Overlap in characters between chunks. Default is 0.

  • kwargs: Additional arguments to pass to the langchain RecursiveCharacterTextSplitter.

Let's take a look example how code splitter split documents in various language.

Notice: You can't indent TEST_DOCUMENT for legible. Splitter recognize indent and space. Input raw data not space or indent.

PYTHON

First, initialize an instance of CodeSplitter. For example:

from RAGchain.preprocess.text_splitter import CodeSplitter

code_splitter = CodeSplitter(language_name= 'PYTHON', chunk_size= 50, chunk_overlap= 0)

Split document

You can split document using split_document() method. It will return list of Passage objects. For example:

python_doc = Document(
page_content="""
def hello_world():
    print("Hello, World!")

# Call the function
hello_world()
""",
    metadata={
        'source': 'test_source',
        # Check whether the metadata_etc contains the multiple information from the TEST DOCUMENT metadatas or not.
        'Data information': 'test for python code document',
        'Data reference link': 'https://python.langchain.com/docs/modules/data_connection/document_transformers/text_splitters/code_splitter#python'
    }
)

passages = code_splitter.split_document(python_doc)
[Passage(id=UUID('8657860f-9a9f-412e-82dd-4fa53d59a893'), content='def hello_world():\n    print("Hello, World!")', filepath='test_source', next_passage_id=UUID('661fd986-4303-4e13-b645-83ab1c183b72'), metadata_etc={'Data information': 'test for python code document', 'Data reference link': 'https://python.langchain.com/docs/modules/data_connection/document_transformers/text_splitters/code_splitter#python'}), 
Passage(id=UUID('661fd986-4303-4e13-b645-83ab1c183b72'), content='# Call the function\nhello_world()', filepath='test_source', previous_passage_id=UUID('8657860f-9a9f-412e-82dd-4fa53d59a893'), metadata_etc={'Data information': 'test for python code document', 'Data reference link': 'https://python.langchain.com/docs/modules/data_connection/document_transformers/text_splitters/code_splitter#python'})]

JS

Process is same as python splitter above. Input language_name JS.

from RAGchain.preprocess.text_splitter import CodeSplitter

code_splitter = CodeSplitter(language_name= 'JS', chunk_size= 60, chunk_overlap= 0)

Split document

JS_doc = Document(
page_content="""
function helloWorld()
{
    console.log("Hello, World!");
}

// Call the function
helloWorld();
""",
metadata={
    'source': 'test_source',
    # Check whether the metadata_etc contains the multiple information from the TEST DOCUMENT metadatas or not.
    'Data information': 'test for js code document',
    'Data reference link': 'https://python.langchain.com/docs/modules/data_connection/document_transformers/text_splitters/code_splitter#js',
}
)

passages = code_splitter.split_document(JS_doc)
[[Passage(id=UUID('bb994a5b-15cf-4d8c-a8fc-43584662fd4a'), content='function helloWorld()\n{\n    console.log("Hello, World!");\n}', filepath='test_source', next_passage_id=UUID('128ef1f2-dd79-44cc-8a66-66ed3a5aeb55'), metadata_etc={'Data information': 'test for js code document', 'Data reference link': 'https://python.langchain.com/docs/modules/data_connection/document_transformers/text_splitters/code_splitter#js'}), 
Passage(id=UUID('128ef1f2-dd79-44cc-8a66-66ed3a5aeb55'), content='// Call the function\nhelloWorld();', filepath='test_source', previous_passage_id=UUID('bb994a5b-15cf-4d8c-a8fc-43584662fd4a'), metadata_etc={'Data information': 'test for js code document', 'Data reference link': 'https://python.langchain.com/docs/modules/data_connection/document_transformers/text_splitters/code_splitter#js'})]

C#

Process is same as python splitter above. Input language_name JS.

from RAGchain.preprocess.text_splitter import CodeSplitter

code_splitter = CodeSplitter(language_name= 'CSHARP', chunk_size= 17, chunk_overlap= 0)

Split document

csharp_doc = Document(
page_content="""
using System;
class Program
{
    static void Main()
    {
        int age = 30; // Change the age value as needed

        // Categorize the age without any console output
        if (age < 18)
        {
            // Age is under 18
        }
        else if (age >= 18 && age < 65)
        {
            // Age is an adult
        }
        else
        {
            // Age is a senior citizen
        }
    }
}
"""
    ,
    metadata={
        'source': 'test_source',
        # Check whether the metadata_etc contains the multiple information from the TEST DOCUMENT metadatas or not.
        'Data information': 'test for C# text document',
        'Data reference link': 'https://python.langchain.com/docs/modules/data_connection/document_transformers/text_splitters/code_splitter#c',
    }
)

passages = code_splitter.split_document(csharp_doc)

-> C# code splitter splits into 33 passages.

Last updated