RAGChain Docs
  • Introduction
  • Quick Start
  • Installation
  • RAGchain Structure
    • File Loader
      • Dataset Loader
        • Ko-Strategy-QA Loader
      • Hwp Loader
      • Rust Hwp Loader
      • Win32 Hwp Loader
      • OCR
        • Nougat Loader
        • Mathpix Markdown Loader
        • Deepdoctection Loader
    • Text Spliter
      • Recursive Text Splitter
      • Markdown Header Splitter
      • HTML Header splitter
      • Code splitter
      • Token splitter
    • Retrieval
      • BM25 Retrieval
      • Hybrid Retrieval
      • Hyde Retrieval
      • VectorDB Retrieval
    • LLM
    • DB
      • MongoDB
      • Pickle DB
    • Reranker
      • BM25 Reranker
      • UPR Reranker
      • TART Reranker
      • MonoT5 Reranker
      • LLM Reranker
    • Benchmark
      • Auto Evaluator
      • Dataset Evaluators
        • Qasper
        • Ko-Strategy-QA
        • Strategy-QA
        • ms-marco
  • Utils
    • Query Decomposition
    • Evidence Extractor
    • Embedding
    • Slim Vector Store
      • Pinecone Slim
      • Chroma Slim
    • File Cache
    • Linker
      • Redis Linker
      • Dynamo Linker
      • Json Linker
    • REDE Search Detector
    • Semantic Clustering
  • Pipeline
    • BasicIngestPipeline
    • BasicRunPipeline
    • RerankRunPipeline
    • ViscondeRunPipeline
  • For Advanced RAG
    • Time-Aware RAG
    • Importance-Aware RAG
Powered by GitBook
On this page
  • Class Initialization
  • Usage
  1. Utils

Semantic Clustering

PreviousREDE Search DetectorNextPipeline

Last updated 1 year ago

This class is used to cluster the passages based on their semantic information. First, we vectorize to embedding vector for representing each passages' semantic information. Second, we cluster the embedding vectors by using various clustering algorithm.

There are no optimal clustering algorithm for all cases. So, you can try various clustering algorithm.

Class Initialization

The class is initialized with two parameters:

  • embedding_function: An instance of the Embeddings class. You can get Embeddings instance by using class easily.

  • clustering_algorithm: A string that specifies the clustering algorithm to be used. The default value is 'kmeans'. The supported algorithms are: 'affinity_propagation', 'agglomerative_clustering', 'birch', 'dbscan', 'kmeans', 'mean_shift', 'optics', and 'spectral_clustering'. All of these algorithms are from the sklearn.cluster module.

Usage

To use the SemanticClustering class, you need to first create an instance of the Embeddings class and pass it to the SemanticClustering constructor along with the name of the desired clustering algorithm. Then, you can call the cluster method with a list of Passage objects to get the clusters.

semantic_clustering = SemanticClustering(embedding_function=EmbeddingFactory('openai').get(), clustering_algorithm='kmeans')

# Cluster the passages
passages = [<your passages>]
clusters = semantic_clustering.cluster(passages, n_clustres=10)
EmbeddingFactory