Code splitter

Overview

The CodeSplitter class in the RAGchain library is a text splitter that splits documents based on separators of langchain's library Language enum. This class inherits from the BaseTextSplitter class and uses the from_language method of RecursiveCharacterTextSplitter class from the langchain library to perform the splitting. Reference(Split code)

CodeSplitter supports CPP, GO, JAVA, KOTLIN, JS, TS, PHP, PROTO, PYTHON, RST, RUBY, RUST, SCALA, SWIFT, MARKDOWN, LATEX, HTML, SOL, CSHARP.

Usage

Initialization

First, to initialize an instance of CodeSplitter, you can provide the following parameters:

  • language_name: A kind of language to split. Default is PYTHON. (CPP, GO, JAVA, KOTLIN, JS, TS, PHP, PROTO, PYTHON, RST, RUBY, RUST, SCALA, SWIFT, MARKDOWN, LATEX, HTML, SOL, CSHARP)

  • chunk_size: Maximum size of chunks to return. Default is 50.

  • chunk_overlap: Overlap in characters between chunks. Default is 0.

  • kwargs: Additional arguments to pass to the langchain RecursiveCharacterTextSplitter.

Let's take a look example how code splitter split documents in various language.

Notice: You can't indent TEST_DOCUMENT for legible. Splitter recognize indent and space. Input raw data not space or indent.

PYTHON

First, initialize an instance of CodeSplitter. For example:

Split document

You can split document using split_document() method. It will return list of Passage objects. For example:

JS

Process is same as python splitter above. Input language_name JS.

Split document

C#

Process is same as python splitter above. Input language_name JS.

Split document

-> C# code splitter splits into 33 passages.

Last updated