The pace of development in artificial intelligence (AI) has become an all-out sprint, with breakthroughs being released on almost a daily basis. As technologists, understanding and keeping up with these latest techniques is vital.
One such development that has proven itself of considerable value is Retrieval Augmented Generation (RAG).
By the end of this article, you will have a firm grasp of RAG’s importance, usage, and implementation.
What is Retrieval Augmented Generation?
Simply put, RAG is a prompt engineering technique used to enhance the output of a Large Language Model (LLM). This is achieved by retrieving additional information from a knowledge base external to the LLM for the purpose of augmenting the prompt input provided to the LLM. Ultimately, these two steps lead to a generation of output far superior than if the LLM had acted alone.
Understanding Retrieval Augmented Generation (RAG)
Generative Artificial Intelligence (Gen AI) has witnessed significant advancements, including the rise of large language models (LLMs). These models can produce incredible human-like textual outputs. However, they have one considerable drawback: their knowledge is only as fresh as their latest training.
In other words, LLMs are not dynamic but rather static in nature, which prevents them from answering questions about recent events or information. This is the challenge that Retrieval Augmented Generation (RAG) addresses.
At its core, RAG is a technique designed to infuse LLMs with real-time, targeted information by acting as an intermediary, fetching the most relevant and recent data and presenting it to the LLM to optimize the generated response. This is done by creating a store of relevant knowledge, usually in the form of embeddings in a vector database, to supplement additional context for the LLM to consider when formulating a response.
Let’s say we have an HR chatbot that needs to resolve issues for internal staff (similar to ChatGPT). We give it a prompt about a company-specific policy and receive a generic response that at first seems somewhat helpful; however, after further examination, the response either lacks detail, has out-of-date information, or is even outright wrong (hallucination).
What can be done to bridge this gap? Consider the below diagram:
Instead of just directly prompting the LLM, we provide our IT chatbot with our organization’s unique document data to consider before responding. This is achieved by adding a retrieval mechanism based on the user’s query.
The original prompt/query is indexed against the vector database in order to return data to be included in the prompt to the LLM, ensuring a response relevant to our organization.
What is the Impact of RAG?
As demonstrated above, AI systems need to be more than just “generally right”; they must deliver timely and context-specific answers that resonate with the user’s immediate needs.
Here’s a deeper look into why RAG is reshaping the Gen AI paradigm:
Dynamic Data Integration: Traditional LLMs, once trained, don’t update their knowledge. RAG, however, allows for a continuous influx of fresh data, ensuring the system’s responses are based on the latest available information.
Contextual Relevance: By tapping into specific organizational or industry databases, RAG ensures that the generated responses are not merely accurate but also contextually relevant to the problems of that specific company.
Efficiency and Cost Savings: Retraining an LLM consumes massive amounts of compute and subsequently massive amounts of funding. RAG sidesteps this by leveraging real-time data without the need to alter the core LLM. This not only saves time but also significantly reduces operational costs.
Transparency and Trust: One of the cornerstones of modern AI is transparency. Users want to know the source of the information they receive. RAG’s ability to fetch and present data from specific, verifiable sources ensures users can trace back how the AI came to that response, creating deeper trust amongst the user base.
In summary, while LLMs laid the foundation for Gen AI’s capabilities, RAG pushes the paradigm toward a more agile, accurate, and trustworthy solution.
What are Some Examples of Using Retrieval Augmented Generation?
Finance Use Case
To better understand the capabilities of RAG, let’s visit a real-world scenario in the finance sector. Imagine a multi-national bank with a large portfolio of services, including personal banking, investment banking, and commercial banking. They want to provide their customers with a chatbot that can respond to queries about their account details and investment options relative to changes in the stock market, interest rates, and global economic events.
FinBot 1.0 - A Limited Success
The first iteration of the chatbot begins with just an open-source LLM. This model might be well-versed in explaining financial concepts, basic banking procedures, or even the history of certain economic events. However, it would be incapable of responding to questions about recent events or new regulatory policies.
Prompts like, “What’s the latest on the Federal Reserve’s interest rate decision?” or “How is the ongoing trade war impacting tech stocks?” would be outside its knowledge base, given that such real-time data would not be part of its last training phase.
As an example, let’s take a look at a response from GPT-4, which only has data available up to January 2022.
GPT-4 proceeds to provide general principles and economic theory on increased interest rates but ultimately fails to answer the question. FinBot would behave no differently.
One solution might be to retrain the underlying LLM. However, regularly retraining the LLM with every shift in the financial landscape would not only be computationally exhaustive but completely cost-prohibitive.
FinBot 2.0 - RAG Enhanced
Now, let’s integrate RAG into this system. Alongside the LLM, the bank has a plethora of real-time data streams: live stock market feeds, databases with global economic indicators, recent news articles on economic events, and analytical reports from financial experts. These sources can be consumed to form a knowledge store for the chatbot to reference at runtime.
So, when a customer asks about how the latest trends in the tech stock market might affect their portfolio, all of the referenced contextual and updated information is presented to the LLM, which then crafts a comprehensive response, offering the customer an informed and accurate analysis for them to act on.
Legal Firm Use Case
RAG is a paradigm unrestrained by any given industry. To illustrate, let’s switch from the domain of financial analysis to legal analysis. A firm wishes to provide a tool that gives associates the ability to research U.S. code interpretations.
Let’s consider a pipeline with and without RAG. The diagram below shows an analysis pipeline for explaining legal terminology in the context of a given provision of law. This pipeline simply prompts the LLM directly to explain a term of interest for a given provision without any further context.
As you can see, the response is generic, but as we saw previously with the finance use case, we can do better! Let’s examine in the diagram below how this pipeline is improved by taking a RAG-based approach.
The augmented pipeline provides the LLM with additional relevant sentences from case law that mention that specific term specified by the user. Specifically, the addition of this information retrieval system to provide explanatory sentences is RAG at work.
With this added context, the LLM provides a more accurate explanation of the legal terminology the team is investigating as well as the ability to verify the documents used to create that enriched definition.
How Do You Implement Retrieval Augmented Generation?
Now that we have an understanding of what RAG is and its substantial impact let’s discuss the steps for implementing a RAG-based system.
1. Build a Knowledge Repository
Start by gathering all the dynamic data sources your system needs. This could range from structured databases to unstructured data like blogs, news feeds, and more. Convert this information into a common document format to create a unified knowledge repository. This store of information is the foundation of any RAG system, ensuring the LLM has access to the most recent and relevant data.
2. Build the Vector Database
Once your knowledge repository is set up, the next step is to convert this data into numerical representations using embedding models. These models convert textual data into vectors, making them easily searchable and retrievable. These vector embeddings are then stored in a vector database. These models are pretrained and open-source, making their consumption easy and their performance verifiable. For the latest benchmarks, check the Hugging Face leaderboard.
3. Dynamic Retrieval Mechanism
As users present their queries to the system, RAG uses the same embedding model to convert each query into a vector, which we can use to search the index of documents that are similar to that question. Once the most similar ‘K’ embeddings have been found, they are returned as list documents for the LLM to consume (K being the desired number of Documents).
4. Integrating with the LLM
With the contextual data retrieved, it’s merged with the user’s original prompt and presented to the LLM. The LLM then crafts a response utilizing the data provided by the RAG component as additional context.
5. System Testing
Once implemented, we must always conduct testing. This not only helps identify potential areas of improvement but also ensures the system meets its intended objectives. In order to test your vector database, we can use Recall (relevant documents retrieved / total relevant documents) and Precision (relevant documents retrieved/ total documents retrieved) as metrics to evaluate its performance.
Additionally, we need a data set representing the ground truth of what was originally embedded in the database. By pushing these expected ground truth associations through a test harness, we can then evaluate how well the actual returned data matches the expected result.
And just like that, you have reviewed the basic steps for a RAG enhanced LLM. Not too bad, right? Well, if you would like to dive deeper into implementation, one of the leading experts of AI, Andrew Ng, has offered a free course on LLMs with RAG utilizing the Langchain library titled LangChain: Chat with Your Data. The course only takes an hour or so to complete and offers an excellent practical overview of the concepts presented in this article.
Snowflake and Gen AI
Next Generation AI App Development
Now that we have a good understanding of RAG, we need to select development tools that enable effective implementation. As such, a platform that focuses on Data and AI will give us the best results. This is where the Snowflake Data Cloud can offer tremendous value.
Building Knowledge Repositories
Snowflake’s architecture decouples compute and storage so that data is centrally located and shared amongst independently scaled compute clusters. For our RAG system, we can use an internal stage as a single source of truth to store our documents. With these centralized knowledge repositories, we can choose which document sources to include in our vector database.
Hosting a Vector Database
For our RAG system, we need a Vector Database for retrieval of our embedded knowledge. With Snowpark Container Services, we can easily host the VectorDB of our choice directly in Snowflake. Additionally, as Snowflake expands the integration of GPUs into its offerings, we can expect our vector search’s retrieval to be highly performant.
In addition to our vector database, we need to consider how to incorporate an LLM for generating our final outputs. Given the nebulous legal environment of utilizing LLM APIs, an emerging safe practice is self-hosting an open-source LLM.
For example, we could host Llama-2 on a Snowpark container enabled with GPUs to consume retrievals from our Vector Database, thus completing our RAG system.
phData: Engineering Excellence
And why stop there?