Vector Store

Yellowbrick can store and search vector embeddings by using SQL through its integration with LangChain. This tutorial guides you through installing the required LangChain components. This tutorial also helps you use Yellowbrick to support a simple RAG (Retrieval-Augmented Generation) application with a chatbot.

Using Yellowbrick as the Vector Store for ChatGPT

This tutorial explains how to create a simple Python-based chatbot backed by ChatGPT using Yellowbrick as a vector store to support RAG.

Prerequisites

Ensure you have the following prerequisites:

Access to a Yellowbrick instance
An API key from OpenAI

Tutorial Overview

The tutorial is divided into the following five parts:

Create a baseline chatbot using LangChain without a vector store.
Create an embeddings table in Yellowbrick to represent the vector store.
Load documents (e.g., the Administration chapter of the Yellowbrick Manual).
Create vector embeddings from those documents and store them in Yellowbrick.
Ask questions to the improved chatbot to compare results.

Part 1: Creating a Baseline Chatbot without a Vector Store

We will use LangChain to question ChatGPT without any additional context provided by a vector store.

Perform the following steps:

1. Install the Required Libraries

python

%pip install --upgrade --quiet langchain langchain-openai langchain-community psycopg2-binary tiktoken

2. Set Up the Credentials and Configuration

python

# Replace these values with your Yellowbrick environment details and OpenAI API key
YBUSER = "[USER]"
YBPASSWORD = "[PASSWORD]"
YBDATABASE = "[DATABASE]"
YBHOST = "myinstance.aws.yellowbrickcloud.com"
OPENAI_API_KEY = "[OPENAI_API_KEY]"

3. Import the Libraries and Initialize ChatGPT

python

import os
import urllib.parse as urlparse
from langchain.chains import LLMChain
from langchain_openai import ChatOpenAI
from langchain_core.prompts.chat import ChatPromptTemplate, HumanMessagePromptTemplate, SystemMessagePromptTemplate

# Set OpenAI API Key
os.environ["OPENAI_API_KEY"] = OPENAI_API_KEY

# Define the prompt template
system_template = """If you don't know the answer, make your best guess."""
messages = [
    SystemMessagePromptTemplate.from_template(system_template),
    HumanMessagePromptTemplate.from_template("{question}"),
]
prompt = ChatPromptTemplate.from_messages(messages)

# Initialize ChatGPT Model
llm = ChatOpenAI(model_name="gpt-3.5-turbo", temperature=0, max_tokens=256)
chain = LLMChain(llm=llm, prompt=prompt)

# Query the chatbot
def print_result_simple(query):
    result = chain(query)
    print(f"Question: {query}\nAnswer: {result['text']}")

# Example queries
print_result_simple("How many databases can be in a Yellowbrick Instance?")
print_result_simple("What's an easy way to add users in bulk to Yellowbrick?")

Part 2: Connecting to Yellowbrick and Creating the Embedding Tables

Create a UTF-8 encoded table in Yellowbrick to store document embeddings by performing the following steps:

python

import psycopg2

# Build connection string
yellowbrick_connection_string = f"postgres://{urlparse.quote(YBUSER)}:{YBPASSWORD}@{YBHOST}:5432/{YBDATABASE}"
embedding_table = "my_embeddings"

# Create embedding table
create_table_query = f"""
CREATE TABLE IF NOT EXISTS {embedding_table} (
    doc_id UUID NOT NULL,
    embedding_id SMALLINT NOT NULL,
    embedding DOUBLE PRECISION NOT NULL
)
DISTRIBUTE ON (doc_id);
TRUNCATE TABLE {embedding_table};
"""

with psycopg2.connect(yellowbrick_connection_string) as conn:
    with conn.cursor() as cursor:
        cursor.execute(create_table_query)
        conn.commit()
        print(f"Table '{embedding_table}' created successfully!")

Part 3: Extracting the Documents to Index

Retrieve documents from an existing Yellowbrick table to prepare for embedding by performing the following steps:

python

YB_DOC_DATABASE = "sample_data"
YB_DOC_TABLE = "yellowbrick_documentation"

yellowbrick_doc_connection_string = f"postgres://{urlparse.quote(YBUSER)}:{YBPASSWORD}@{YBHOST}:5432/{YB_DOC_DATABASE}"

with psycopg2.connect(yellowbrick_doc_connection_string) as conn:
    with conn.cursor() as cursor:
        cursor.execute(f"SELECT path, document FROM {YB_DOC_TABLE}")
        yellowbrick_documents = cursor.fetchall()
        print(f"Extracted {len(yellowbrick_documents)} documents successfully!")

Part 4: Load Documents into Yellowbrick Vector Store

Split documents into chunks, create embeddings, and insert them into Yellowbrick by performing the following steps:

python

from langchain_core.documents import Document
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain_openai import OpenAIEmbeddings
from langchain_community.vectorstores import Yellowbrick

DOCUMENT_BASE_URL = "https://docs.yellowbrick.com/6.7.1/"
chunk_size_limit = 2000
max_chunk_overlap = 200

# Convert fetched documents to LangChain Document format
documents = [
    Document(
        page_content=doc[1],
        metadata={"source": DOCUMENT_BASE_URL + doc[0].replace(".md", ".html")}
    )
    for doc in yellowbrick_documents
]

# Split documents
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=chunk_size_limit,
    chunk_overlap=max_chunk_overlap,
    separators=["\n## ", "\n", ",", " ", ""]
)
split_docs = text_splitter.split_documents(documents)

# Create embeddings and store in Yellowbrick
embeddings = OpenAIEmbeddings()
vector_store = Yellowbrick.from_documents(
    documents=split_docs,
    embedding=embeddings,
    connection_string=yellowbrick_connection_string,
    table=embedding_table
)

print(f"Created vector store with {len(split_docs)} document chunks.")

Part 5: Creating a Chatbot Using the Yellowbrick Vector Store

Enhance the chatbot by using the populated vector store to improve answers with relevant context.

Perform the following steps:

python

from langchain.chains import RetrievalQAWithSourcesChain

# Initialize Retrieval Chain
chain = RetrievalQAWithSourcesChain.from_chain_type(
    llm=llm,
    chain_type="stuff",
    retriever=vector_store.as_retriever(search_kwargs={"k": 5}),
    return_source_documents=True,
    chain_type_kwargs={"prompt": prompt}
)

def print_result_with_sources(query):
    result = chain(query)
    print(f"""\nQuestion: {query}\nAnswer: {result['answer']}\nSources: {result['sources']}\n""")

# Example queries with vector store context
print_result_with_sources("How many databases can be in a Yellowbrick Instance?")
print_result_with_sources("What's an easy way to add users in bulk to Yellowbrick?")

Part 6: Introducing an Index to Improve Performance

Use LSH (Locality-Sensitive Hashing) to optimize retrieval time when dealing with large document sets.

Perform the following steps:

python

lsh_params = Yellowbrick.IndexParams(
    Yellowbrick.IndexType.LSH, {"num_hyperplanes": 8, "hamming_distance": 2}
)
vector_store.create_index(lsh_params)

# Query with index enabled
chain = RetrievalQAWithSourcesChain.from_chain_type(
    llm=llm,
    chain_type="stuff",
    retriever=vector_store.as_retriever(search_kwargs={"index_params": lsh_params, "k": 5}),
    return_source_documents=True,
    chain_type_kwargs={"prompt": prompt}
)

# Example indexed queries
print_result_with_sources("How many databases can be in a Yellowbrick Instance?")
print_result_with_sources("What's an easy way to add users in bulk to Yellowbrick?")

Next Steps

Modify the code to ask different questions or load your own documents.
Use alternative embeddings (e.g., Hugging Face models or private LLMs, such as Meta's Llama 2).
Explore the flexibility of LangChain to handle various document formats (i.e., HTML, PDF, etc.).

For more information, visit Yellowbrick Documentation.

Bulk Loading Tables

Bulk Load Examples

Running a Bulk Load

Loading Tables from Parquet Files

ybload Command

Load Data with SQL

Loading Data from Object Storage

Loading from Amazon S3

Loading from Azure Blob Storage

Loading Tables with Spark

Setting up and Running a Spark Job

Setting Up the ybrelay Service

Trickle Loading Data via JDBC

Unloading Data to Object Storage

Unloading Data to Parquet Files

ybunload Command

Installing ybtools

Setting Up a Database Connection

Configuring SSL/TLS for Tools and Drivers

Secure Connections for ODBC/JDBC Clients and ybsql

Appliance

Appliance: Disk Encryption

Setting Up Encrypted Drives

Remote Diagnostics

System Alerts

Creating an Alert Endpoint

Using the System Management Console

ybcli Reference

ybcli: config

Cloud

Configuring

Vanity DNS

Yellowbrick Manager

Installing

CLI Install Instructions

Public Install Instructions

Private Install Instructions

Self-Managed Install Instructions

Permissions

Kubernetes Guides

Databases

Backup & Restore

Overview

ybbackup Commands

ybbackupctl Commands

ybrestore Commands

Database Replication

Managing Replication

Setting Up Replication

Encrypting Sensitive Data

LDAP Integration

LDAP Authentication

Synchronizing Users and Groups

Metering

System Views

sys.lock

Workload Management

How WLM Works

Compatibility Parameters

Data Processing and Formatting

Feature Enablement

General

Tuning

Yellowbrick Row Store (YRS) Alerting Parameters

ybsql \copy Command

ybsql Properties and Variables

SQL Commands

CREATE EXTERNAL FORMAT

CREATE EXTERNAL TABLE

CREATE TABLE

GRANT

Plan Hinting

SELECT

GROUP BY Clause

Subqueries

SQL Data Types

Data Type Casting

DECIMAL

JSON

JSONB

Vector Store

Using Yellowbrick as the Vector Store for ChatGPT

Prerequisites

Tutorial Overview

Part 1: Creating a Baseline Chatbot without a Vector Store

1. Install the Required Libraries

2. Set Up the Credentials and Configuration

3. Import the Libraries and Initialize ChatGPT

Part 2: Connecting to Yellowbrick and Creating the Embedding Tables

Part 3: Extracting the Documents to Index

Part 4: Load Documents into Yellowbrick Vector Store

Part 5: Creating a Chatbot Using the Yellowbrick Vector Store

Part 6: Introducing an Index to Improve Performance

Next Steps