Large Language Models (LLMs) have created a massive breakthrough in artificial intelligence (AI), allowing computers to generate and respond to text in a way that feels very human. One appealing use case is to use an LLM to answer questions like a chatbot; however, general LLMs don’t know about your business context.
One way around this is to provide that information inside the prompt. That’s where Retrieval Augmented Generation, or RAG, comes in. For each question, the system finds relevant text and then provides it to the LLM as context, prompting the LLM to answer the question using the context.
For a deeper dive, check out our full blog on RAG.
Snowpark Container Services lets you build and deploy containers in a Kubernetes-based cluster, allowing you to create services with entirely custom software dependencies. You can even leverage nodes with GPUs to host open-source LLMs inside the Snowflake environment.
Use Case: HR Policy Chatbot
For this use case, let’s imagine a company’s HR team getting bombarded with employee questions about different aspects of the company’s HR policies. The team frequently pulls up the relevant policy, determines how it applies to the situation, and converses with employees to understand the policy language.
To lighten the load on the HR team, we want to develop a chatbot that can answer as many questions as possible. A traditional implementation would respond to questions with canned responses or scripted interactions that occur when certain keywords are entered. These interactions can be unsatisfactory to the employee if they don’t have the exact keywords and will require them to understand the policy as written. These don’t really feel like conversations as much as a search engine.
An AI chatbot like Chat GPT gives employees a much more natural experience. The chatbot seems to understand questions and follows instructions, but it can only accurately respond when the subject of the question is part of the data the model has already been trained on.
When given a question like, “How many sick days do we get?” a general LLM might respond in unpredictable ways. If you’re lucky, it might say that it has no way of knowing, but some LLMs create a completely fictional response instead.
RAG is a strategy to provide new data to the model cost-effectively without building custom models. LLMs have been shown to work very well at extracting information from text and summarizing longer text with a specific purpose. RAG leverages these strengths by first finding relevant text (Retrieval), combining it with the question (Augmented), and instructing the LLM to give a response only using the provided text (Generation).
When we put this all together, we end up with a powerful chatbot that is both aware of our HR policies and can respond to the intent behind each question instead of just looking for keywords.
RAG In Action
The video below is an example of an interaction with our HR Policy Bot. The employee asks the chatbot a question that it currently does not have relevant information to answer, “Should I call in for work?”
You can see it crafts a human-like response but directs the employee to consult the policy since it could not find any helpful information. This initial response is akin to searching the internet for your question.
To demonstrate how the chatbot uses the policy information to answer this question, the specific HR policy is added to the RAG knowledge base. After the policy is added, the exact same question gets a very thorough response. When I asked, “How many sick days do I get?” the answer was direct, concise, and accurate.
To build our HR Policy Bot, we will leverage the flexible compute framework, Snowpark Container Services. This will allow us to run our entire product within the Snowflake environment.
We have three services running in Snowpark Container Services that can communicate with each other and respond to requests from our Snowflake environment. All these services communicate using REST API, which can be quickly developed and deployed using a Python FastAPI.
Our central service is a FastAPI server, which uses LangChain to orchestrate the communication between the Chroma service and the LLM service. The Chroma service hosts the Chroma VectorDB. Each document is assigned an embedding, which can be used to determine how relevant that document is to a given query. The LLM Server uses FastChat to provide a unified interface for interacting with many different models, even allowing us to switch to OpenAI APIs.
The RAG logic starts by taking a query from an API request and retrieving relevant portions of documents by querying the Chroma vector database. A prompt for the LLM is crafted using the initial question and all the relevant text retrieved from Chroma.
The LLM is then asked to answer the question given the provided information, but only if the information is present. The response from the LLM server is then passed back to the chat interface via the LangChain server.
When the Python server initializes, the policy text will be read from a Snowflake Table or Stage and stored in the ChromaDB. The FastAPI server exposes two endpoints: one for asking questions and one for updating ChromaDB. Additional text can be sent to the ChromaDB to expand the chatbot’s knowledge.
Streamlit provides a rapid way to develop a chat interface. The Streamlit application could be a part of the central Python service, a Snowflake Streamlit object, or even hosted in its own container. The REST API endpoints can be easily wrapped by a UDF, making the whole application accessible via a Snowflake query.
The LLM Server container is based on the NVIDIA image
nvcr.io/nvidia/nemo:23.06. The image is built and stored in a Snowflake Image Repository. To launch this service, we need to create a specification file that defines the image, resource requirements, volumes, and any endpoints.
Our LLM requires a GPU, so that goes into the resource requirements section. In order for our other services to interact with this server, we define an API endpoint. Finally, we connect our model stage to provide our LLM files.
- name: vicunaservice
- name: models
- name: api
- name: models
This service runs a FastChat server that uses the same API framework as OpenAI giving us the flexibility to swap our service with their proprietary models for testing or exploration.
echo "RUNNING SERVICE START"
nohup python -m fastchat.serve.controller > controller.out 2>controller.err&
echo "CONTROLLER RUNNING"
nohup python -m fastchat.serve.vllm_worker --model-names "vicuna" --model-path lmsys/vicuna-7b-v1.5 --num-gpus 1 > worker.out 2>worker.err&
echo "MODEL_WORKER RUNNING"
nohup python -m fastchat.serve.openai_api_server --host 0.0.0.0 --port 8001 > api_server.out 2>api_server.err&
Our Python service runs a lightweight FastAPI server that is configured to handle data requests from Snowflake UDFs. In our specification file, a
udf endpoint exposes our two API calls
/update. These are paired with corresponding UDFs.
With this, we can access the whole application with simple SQL statements. The UDF for the
/qa endpoint is simply created using this command:
CREATE FUNCTION query_udf (text varchar)
When the server initializes, it creates a LangChain chain designed for RAG. This defines the type of embeddings to use, and which LLM model.
embedding = HuggingFaceEmbeddings(
vectordb = Chroma(embedding=embedding, client=client)
qa = VectorDBQA.from_chain_type(
llm=ChatOpenAI(model_name="vicuna"), chain_type="stuff", vectorstore=vectordb
Default prompt template:
Use the following pieces of context to answer the question at the end. If you don't know the answer, just say that you don't know, don't try to make up an answer.
Here is an example of a specification file for the Python service.
- name: rag_python_service
- name: code
- name: policy_docs
- snowflakeSecret: OPENAI_API_KEY
- name: api
- name: policy_docs
- name: code
In our spec file, the
OPENAI_API_BASE environment variable is set to the DNS of the LLM service so that the LangChain looks to our self-hosted LLM instead of OpenAI’s. There is also an example of pulling in secrets from Snowflake if you want the service to talk to OpenAI.
We also add volumes for our server code that is stored in a stage. Keeping our code in a stage allows for more flexibility and faster development by reusing images and removing the need to rebuild a new image with every iteration. Another volume is for our policy documents. When the service initializes, these documents are loaded in the ChromaDB.
This HR Policy Bot application is a great example of how cutting-edge technology can be harnessed to solve real business problems.
Retrieval Augmented Generation is a very powerful technique for building LLM applications that are tailored to your business. When paired with Snowpark Container Services, your business unlocks a unique opportunity to build RAG-enabled AI applications entirely within your Snowflake environment.
We hope this use case has helped you see how LLMs and Snowpark Container Services could transform your business.
If you ever have questions big or small, don’t hesitate to reach out to us at phData.
Free Generative AI Workshop
Looking for guidance on how to best leverage Snowflake’s newest AL & ML capabilities? Attend one of our free generative AI workshops for advice, best practices, and answers.