Introduction

R2R’s hybrid search combines traditional keyword-based searching with modern semantic understanding, providing more accurate and contextually relevant results. This approach is particularly effective for complex queries where both specific terms and overall meaning are crucial.

How R2R Hybrid Search Works

  1. Full-Text Search: Utilizes PostgreSQL’s full-text search with ts_rank_cd and websearch_to_tsquery.
  2. Semantic Search: Performs vector similarity search using the query’s embedded representation.
  3. Reciprocal Rank Fusion (RRF): Merges results from full-text and semantic searches using the formula:
    COALESCE(1.0 / (rrf_k + full_text.rank_ix), 0.0) * full_text_weight +
    COALESCE(1.0 / (rrf_k + semantic.rank_ix), 0.0) * semantic_weight
    
  4. Result Ranking: Orders final results based on the combined RRF score.

Key Features

The full-text search component incorporates:

  • PostgreSQL’s tsvector for efficient text searching
  • websearch_to_tsquery for parsing user queries
  • ts_rank_cd for ranking full-text search results

The semantic search component uses:

  • Vector embeddings for storing and querying semantic representations
  • Cosine similarity for measuring the relevance of documents to the query

Configuration

VectorSearchSettings

Key settings for vector search configuration:

class VectorSearchSettings(BaseModel):
    use_hybrid_search: bool
    search_limit: int
    filters: dict[str, Any]
    hybrid_search_settings: Optional[HybridSearchSettings]
    # ... other settings

HybridSearchSettings

Specific parameters for hybrid search:

class HybridSearchSettings(BaseModel):
    full_text_weight: float
    semantic_weight: float
    full_text_limit: int
    rrf_k: int

Usage Example

from r2r import R2RClient

client = R2RClient()

vector_settings = {
    "use_hybrid_search": True,
    "search_limit": 20,
    # Can add logical filters, as shown:
    # "filters":{"category": {"$eq": "technology"}},
    "hybrid_search_settings" : {
        "full_text_weight": 1.0,
        "semantic_weight": 5.0,
        "full_text_limit": 200,
        "rrf_k": 50
    }
}

results = client.search(
    query="Who was Aristotle?",
    vector_search_settings=vector_settings
)
print(results)

Results Comparison

{
  "results": {
    "vector_search_results": [
      {
        "score": 0.780314067545999,
        "text": "Aristotle[A] (Greek: Ἀριστοτέλης Aristotélēs, pronounced [aristotélɛːs]; 384–322 BC) was an Ancient Greek philosopher and polymath...",
        "metadata": {
          "title": "aristotle.txt",
          "version": "v0",
          "chunk_order": 0
        }
      }, ...
    ]
  }
}

Hybrid Search with RRF

{
  "results": {
    "vector_search_results": [
      {
        "score": 0.0185185185185185,
        "text": "Aristotle[A] (Greek: Ἀριστοτέλης Aristotélēs, pronounced [aristotélɛːs]; 384–322 BC) was an Ancient Greek philosopher and polymath...",
        "metadata": {
          "title": "aristotle.txt",
          "version": "v0",
          "chunk_order": 0,
          "semantic_rank": 1,
          "full_text_rank": 3
        }, ...
      }
    ]
  }
}

Best Practices

  1. Optimize PostgreSQL indexing for both full-text and vector searches
  2. Regularly update search indices
  3. Monitor performance and adjust weights as needed
  4. Use appropriate vector dimensions and embedding models for your use case

Conclusion

R2R’s hybrid search offers a powerful solution for complex information retrieval needs, combining the strengths of keyword matching and semantic understanding. Its flexible configuration and use of Reciprocal Rank Fusion make it adaptable to a wide range of use cases, from technical documentation to broad, context-dependent queries.