In our previous article on Retrieval Augmented Generation (RAG), we discussed the need for a Vector Database to retrieve additional information for our prompts. Today, we will dive into the inner workings of a Vector Database to better understand exactly how this technology functions. Let’s go!
What is a Vector Database in Simple Terms?
Vectors (and Word Vectors)
Vector Databases hold information like documents, images, and audio files that do not fit into the tabular format expected by traditional databases. Instead, the storage, retrieval, and management of this data is done in the form of Vectors. In mathematical terms, a vector is an object with a value for magnitude and a value for direction.
Practically speaking, a vector describes the relationship between two points with respect to a given number of dimensions. In the context of word vectors, think of how the word “boy” relates to other nouns on an X, Y plane where X is Gender and Y is Age.
The above vectors quantify the relationship ”boy” has to each of the other nouns. For example, from “boy” to “woman,” we see a considerably greater magnitude as well as a different direction compared to the vector of “boy” to “child.”
This makes sense in that a “woman” is considerably older than a “boy” and has the opposite gender, whereas a “child” might have the same relative age as a “boy” but is a gender-neutral noun.
In this comparison, we can say that the word “child” is the most similar to our query of “boy” relative to the other nouns. This similarity is what allows us to conduct searches in our Vector Database.
The core idea behind searching in a vector database is that similar items will have similar vectors. By “similar,” we mean that the vectors are close to each other in the vector space. The distance between two vectors can be measured using various methods, with cosine similarity and Euclidean distance being two of the most common.
Let’s start with the L2 norm, or Euclidean Distance, as this is the most straightforward and easily understood method. You might recognize Euclidean distance from your time in middle school algebra.
As shown above, Euclidean distance measures the “straight-line” distance between two points on an X, Y plane. When used in search, we compute the Euclidean distance between the query vector and a given number of item vectors in the database. The vector with the smallest distance to the query is considered the most similar item.
In comparison, Cosine Similarity measures the cosine of the angle between two vectors. Simply put, the smaller the angle, the more similar the query vector is to that particular item.
For example, if the query vector and item vector were identical
("boy" == "boy"), then the angle between them would be 0° and the cosine similarity score would be 1. However, if there is no similarity
("boy" != "tennis"), you would see an angle approaching 180° and a similarity score approaching 0.
Most Vector Databases use cosine similarity as their default retrieval function, where the cosine similarity between the query vector and a given number of item vectors is computed to determine how similar each item is to the query. The item with the highest cosine similarity score to the query is considered the most similar.
What is the Difference Between a Vector Database and a Relational Database?
Data Representation and Use Cases
Relational databases store data in rows and columns (tabular format), with each row representing an entity and each column representing an attribute of that entity. Vector Databases, on the other hand, store data as vectors — sequences of numbers that can represent the essence of an item.
This is why it makes them appropriate for storing and retrieving non-traditional data sources like documents, images, and audio files.
Relational databases depend on SQL (Structured Query Language) for querying. You might ask for data that meets certain criteria (ex. “all accounts where balance is less than 0”). In Vector Databases, your search vector is compared against the vectors stored in the database to find only the most similar vector(s). As such, you can expect to interact with a library for a Vector Database rather than an entire language like SQL.
What are Embedding Models and How are They Used in Vector Databases?
Embedding models are what transform your raw data (pdfs, mp3s, etc.) into vector embeddings. The goal of these embeddings is to represent data in such a way that the geometric distance between vectors is meaningful. In other words, similar items in the original data space should be close together in the embedded vector space, and dissimilar items should be far apart.
A common example is word embeddings in natural language processing. Words are transformed into vectors such that words with similar meanings or used in similar contexts are close together in the vector space. For instance, our previous example of the vector for “boy” and “child” are closer in this space than the vector for “boy” and “woman.”
Your Vector Database will likely have a default embedding model that has been abstracted away, so that you need not choose a model yourself. However, the Generative AI space is rapidly evolving, and you may need to experiment with different models to try to improve performance or search relevance.
If this is the case, you can find the latest developments on the Hugging Face Leaderboard.
What is the Best Open-Source Vector Database?
You should always choose a technology based on how that particular component’s features align with your project’s requirements. If you’re looking for a starting point for prototyping, you might want to investigate some of the following options:
FAISS (Facebook AI Similarity Search): FAISS is adept at indexing, searching, and clustering large collections of high-dimensional vectors, particularly in image recognition and semantic text search. It’s highly efficient in memory consumption and query time, capable of handling hundreds of vector dimensions. Common applications include large-scale image search engines and semantic search systems in text data.
Milvus: Milvus excels in vector indexing and querying, using advanced algorithms for fast retrieval in large datasets. It integrates well with frameworks like PyTorch and TensorFlow. Applications span across various industries, including e-commerce recommendation systems, image and video analysis, and natural language processing for document clustering and semantic search.
Chroma: Chroma was designed from the ground up to be integrated with large language model (LLM) applications and handles multiple data types and formats. It’s particularly effective with audio data, making it suitable for audio-based search engines, music recommendations, and other audio-centric applications.
What are Some Managed Cloud Services for Vector Databases?
Pinecone: Known for its user-friendly interface and focus on large-scale machine learning applications, Pinecone supports high-dimensional Vector Databases for use cases like similarity search, recommendation systems, and semantic search while abstracting away infrastructure responsibilities. It offers real-time data analysis capabilities, which are useful in cybersecurity for threat detection and integrations for various systems, including GPT models and Elasticsearch.
Snowflake Data Cloud: With their new Snowflake Cortex offering, Snowflake now provides a fully managed platform for LLM app development. These services include quick data analysis and models-as-a-service, including Meta AI’s Llama 2 model. Snowpark Container Services, another key component, allows developers to deploy, manage, and scale custom containerized workloads, including open-source LLMs, using Snowflake-managed infrastructure.
With respect to Vector Databases, Cortex includes vector embedding and similarity search functionality along with Streamlit integration for developing LLM app interfaces with minimal coding. Snowpark Container Services offers further customization options for LLM applications, enabling the deployment of containerized workloads and custom user interfaces. Snowflake emphasizes its focus on security and governance, offering these capabilities within the secure boundaries of its platform.
Today, we learned how Vector Databases offer a unique solution for storing non-tabular data like documents, images, and audio. We explored the two primary similarity metrics of cosine similarity and Euclidean distance and how Vector Databases differ from relational databases in terms of data representation and querying mechanisms.
We also learned the importance of embedding models in transforming raw data into meaningful vector embeddings. We reviewed open-source Vector Databases like FAISS, Milvus, and Chroma and managed cloud services such as Pinecone and Snowflake Cortex.
You should feel confident in your knowledge of this technology and what it can achieve, but you don’t have to journey alone. The engineering team at phData is here to help!
phData: Engineering Excellence
And why stop there?
Free Generative AI Workshop
Looking for guidance on how to best leverage Snowflake’s newest AI & ML capabilities? Attend one of our free generative AI workshops for advice, best practices, and answers.