With Cortex Analyst from the Snowflake AI Data Cloud, business users can transform plain English questions into SQL queries, enabling self-service analytics and making data insights more accessible. This is especially powerful for users who may have pressing questions about their data but might not have the SQL-writing experience to do so.
Traditional data analysis often requires writing SQL to derive insights. The inability to do so creates a bottleneck for non-technical users who need insights from their data. This limitation forces them to rely on IT or data teams for even simple queries, slowing decision-making and potentially losing out on valuable findings from their data.
Snowflake Cortex Analyst allows users to find answers in their data by simply asking questions in natural language. In this blog, we’ll explore Cortex Analyst, explain its different components, and discuss some examples of how it can be fine-tuned to be useful for business users.
What is Snowflake Cortex Analyst?
Cortex Analyst from Snowflake is a fully managed, LLM-powered feature that allows users to quickly create applications that enable users to ask questions about their data in natural language. This feature will produce accurate text-to-SQL responses and insights that close the gap between non-technical users and their data. Additionally, as a REST API, it can be incorporated into business tools such as Streamlit applications or other chat interfaces.
What Are The Components of Cortex Analyst?
However you choose to set up your Cortex Analyst application, there are three main components that it comprises:
The Data Set. Each Cortex Analyst application is pointed at either a single table or a view composed of one or more tables.
The Semantic Model Template. The semantic model template is created to describe the table/view being read and analyzed by Cortex Analyst. A semantic model is a descriptive representation of your data that provides essential context for Cortex Analyst, helping the language model generate valid SQL queries from natural language inputs. Think of it like a dictionary, defining the meanings of different terminology in your data.
The Chat Interface. In our example, a Streamlit app is created and will act as a chat tool for end-users to ask questions about the data set. Cortex Analyst will take the question, answer it with help from information provided by the semantic model, generate a SQL query, and then return that information back to the user via the Streamlit app or the interface of your choosing.
With that in mind, let’s dive into how we can set up a Cortex Analyst application.
How to Build a Cortex Analyst Application
The Data Set
Our test data set is a small amount of data from an Airbnb property. It ranges from June 2023 to June 2024. Here are the key columns:
Confirmationcode: Unique identifier for each stay.
Status: The status of the stay.
NumAdults/NumChildren/NumAdults: Number of adults/children/infants. Summing these will give you total guests.
StartDate/EndDate: Start and end date of the stay.
Booked: Date the stay was booked.
Earnings: Total amount of money earned by the stay.
The Semantic Model
A semantic model bridges the gap between human communication and machine interpretation. Semantic models are structured representations of knowledge that capture the meaning and relationships between concepts in a specific domain. They are particularly important for generative AI (GenAI) tasks such as natural language processing, question answering, and content generation.
For Cortex Analyst, the semantic model serves to provide information about the columns and tables you’re pointing at. This is especially useful when the column names aren’t already in natural language. The semantic model file will describe metadata about the columns including dimensions, measures, time, and any filters that end users may reference. For instance, let’s pretend we have a table with the following columns:
SLSDT | SLSCD | CAT | ADDR | CITY | ST | ZIP | AMT |
---|---|---|---|---|---|---|---|
2024-05-06 | 90AS | BMX | 123 TEST ST | BEVERLY HILLS | CA | 90210 | $200 |
2024-05-07 | 65HY | MOUNTAIN BIKE | 456 TEST ST | MINNEAPOLIS | MN | 55111 | $250 |
Now, we as the user can probably intuitively guess what the columns refer to, especially given the sample data you see underneath. However, tools utilizing natural language processing might have difficulty determining what some columns refer to without a little help. This is where our model comes in.
Example Model
Below is what a semantic model for this data might look like:
# Name and description of the semantic model.
name: Sales Data
description: This semantic model can be used for asking questions over the sales data.
tables:
# A logical table on top of the 'sd_data' base table.
- name: sales_data
description: A logical table capturing daily sales information across different store locations and product categories.
# The fully qualified name of the base table.
base_table:
database: sales
schema: public
table: sd_data
# Dimension columns in the logical table.
dimensions:
- name: product_category
synonyms:
- "item_category"
- "product_type"
description: The category of the product sold.
expr: cat
data_type: TEXT
unique: false
sample_values:
- "BMX"
- "MOUNTAIN BIKE"
- name: address
description: The address where the item is being shipped.
expr: addr
data_type: TEXT
unique: false
sample_values:
- "123 TEST ST"
- name: city
description: The city where the item is being shipped.
expr: city
data_type: TEXT
unique: false
sample_values:
- "MINNEAPOLIS"
- "LOS ANGELES"
# Time dimension column in the logical table.
time_dimensions:
- name: sales_date
synonyms:
- "transaction_date"
description: The date when the sale occurred.
expr: slsdt
data_type: DATE
unique: false
# Measure column in the logical table.
measures:
- name: sales_amount
synonyms:
- "revenue"
- "total_sales"
description: The total amount of money generated from the sale.
expr: amt
data_type: NUMBER
default_aggregation: sum
# A table can define commonly used filters over it. These filters can then be referenced in user questions directly.
filters:
- name: west
synonyms:
- "W"
- "west_coast"
description: "A filter to restrict only to west coast states"
expr: st IN ('CA', 'OR', 'WA', 'NV', 'AZ')
We’ll get into how we can iterate on our specific semantic model to help our application out later.
The Chat Interface
Ultimately, we need a chat interface for our users to ask questions and receive insights about the data they are analyzing. For our example, we’re going to set up a Streamlit application using code that is provided by Snowflake.
Setup
To set up our application, we can navigate to the menu, find Streamlit under the Projects tab, and then click the + Streamlit App button.
Next, you put in the database, schema, and warehouse details to tell your application where to find the data.
Next, upload the semantic model .yml
file you created into a stage. This YAML file will get picked up by the application when we point to it in the following code. Finally, we’ll paste in the following Python code provided by Snowflake to get our application started. The only thing we had to do was update the Database, Schema, Stage, and File components.
Streamlit Code
from typing import Any, Dict, List, Optional
import pandas as pd
import requests
import snowflake.connector
import streamlit as st
from snowflake.snowpark.context import get_active_session
DATABASE = "SANDBOX"
SCHEMA = "AVIGNEAU"
STAGE = "SALES"
FILE = "airbnb_description.yml"
if 'CONN' not in st.session_state or st.session_state.CONN is None:
st.session_state.CONN = snowflake.connector.connect(
user="",
password="",
account="",
host=HOST,
port=443,
warehouse="",
role="",
)
def send_message(prompt: str) -> Dict[str, Any]:
"""Calls the REST API and returns the response."""
request_body = {
"messages": [{"role": "user", "content": [{"type": "text", "text": prompt}]}],
"semantic_model_file": f"@{DATABASE}.{SCHEMA}.{STAGE}/{FILE}",
}
resp = requests.post(
url=f"https://{HOST}/api/v2/cortex/analyst/message",
json=request_body,
headers={
"Authorization": f'Snowflake Token="{st.session_state.CONN.rest.token}"',
"Content-Type": "application/json",
},
)
request_id = resp.headers.get("X-Snowflake-Request-Id")
if resp.status_code < 400:
return {**resp.json(), "request_id": request_id} # type: ignore[arg-type]
else:
raise Exception(
f"Failed request (id: {request_id}) with status {resp.status_code}: {resp.text}"
)
def process_message(prompt: str) -> None:
"""Processes a message and adds the response to the chat."""
st.session_state.messages.append(
{"role": "user", "content": [{"type": "text", "text": prompt}]}
)
with st.chat_message("user"):
st.markdown(prompt)
with st.chat_message("assistant"):
with st.spinner("Generating response..."):
response = send_message(prompt=prompt)
request_id = response["request_id"]
content = response["message"]["content"]
display_content(content=content, request_id=request_id) # type: ignore[arg-type]
st.session_state.messages.append(
{"role": "assistant", "content": content, "request_id": request_id}
)
def display_content(
content: List[Dict[str, str]],
request_id: Optional[str] = None,
message_index: Optional[int] = None,
) -> None:
"""Displays a content item for a message."""
message_index = message_index or len(st.session_state.messages)
if request_id:
with st.expander("Request ID", expanded=False):
st.markdown(request_id)
for item in content:
if item["type"] == "text":
st.markdown(item["text"])
elif item["type"] == "suggestions":
with st.expander("Suggestions", expanded=True):
for suggestion_index, suggestion in enumerate(item["suggestions"]):
if st.button(suggestion, key=f"{message_index}_{suggestion_index}"):
st.session_state.active_suggestion = suggestion
elif item["type"] == "sql":
with st.expander("SQL Query", expanded=False):
st.code(item["statement"], language="sql")
with st.expander("Results", expanded=True):
with st.spinner("Running SQL..."):
df = pd.read_sql(item["statement"], st.session_state.CONN)
if len(df.index) > 1:
data_tab, line_tab, bar_tab = st.tabs(
["Data", "Line Chart", "Bar Chart"]
)
data_tab.dataframe(df)
if len(df.columns) > 1:
df = df.set_index(df.columns[0])
with line_tab:
st.line_chart(df)
with bar_tab:
st.bar_chart(df)
else:
st.dataframe(df)
st.title("Cortex Analyst")
st.markdown(f"Semantic Model: `{FILE}`")
if "messages" not in st.session_state:
st.session_state.messages = []
st.session_state.suggestions = []
st.session_state.active_suggestion = None
for message_index, message in enumerate(st.session_state.messages):
with st.chat_message(message["role"]):
display_content(
content=message["content"],
request_id=message.get("request_id"),
message_index=message_index,
)
if user_input := st.chat_input("What is your question?"):
process_message(prompt=user_input)
if st.session_state.active_suggestion:
process_message(prompt=st.session_state.active_suggestion)
st.session_state.active_suggestion = None
After that, you’re all set! You now have your chat interface, and you’re ready to ask your data questions! While this is a simple example just to demonstrate how we can quickly spin up an application, Cortex Analyst can be integrated with other business tools to create a seamless, enjoyable experience for users while they ask their questions.
Evaluation:
In order to gain confidence in our application, it’s helpful to set up a litmus test in the form of a series of test questions. The questions range from relatively simple to slightly more complex and they also test the ability of our tool to recognize natural language and create the SQL for us.
Example 1: How Many Bookings Were Made in 2023?
This is the simplest question in the set and it’s just looking for a basic count from this dataset. Cortex Analyst will provide its natural language interpretation of my question, the SQL query it generates, and the answer from running it. Pretty cool, right?
But when we count the number of bookings made, we generally want to exclude cancellations from our calculations. We wouldn’t be counting cancellations towards our revenue, so why would we count them when looking at raw numbers? So, let’s augment the question to see how it does.
Example 1a: How Many Bookings Were Made in 2023, Excluding Cancellations?
Notice something here? I asked the question two different times and got two separate answers. One calculated the reservation’s start date, while the other calculated the booking date, which actually leads to different results. This is something important to understand when working with Cortex Analyst or any GenAI tool in general: you should be prepared to help your tool help you (see the “help the helper” section later on).
Going back to our Cortex Analyst example, we can ensure that our tool knows what we’re asking for by being a little more specific in our questioning.
Example 1b: How Many Stays Were Booked in 2023, Excluding Cancellations?
So, we can see a couple of distinctions between this version of our question and our previous version. We specifically asked about stays booked in 2023 versus the way we asked before, “How many bookings were made in 2023?”
By asking when the stay was booked, our application looked for the “booked” date in the semantic model we provided and used that as its date column. Specificity matters. This is an excellent example of how “We do see one minor detail, though.”
The query attempts to filter away stays where the status is “Canceled by guest.” This is a pretty good guess, considering we didn’t give it any context to determine the canceled stays. However, we can see some records with the status “Canceled by Airbnb.”
With this in mind, we can do something else to help out Cortex Analyst. We can further define a filtering condition in the semantic layer .yml
file that we create.
For Cortex Analyst, we can define filtering conditions that the application can reference. In this case, we can define “cancelled” as records where the status contains the word “canceled” (status ilike '%canceled%'
in Snowflake syntax parlance).
Example 1b: How Many Stays Were Booked in 2023, Excluding Cancellations? (Take 2)
With that in mind, we can try asking the question again. And now, the Cortex Analyst tool knows what we are talking about and returns the correct answer. Even though the result is the same, the SQL query that we return is the one we expect, which is what we are ultimately testing for.
Example 2: What is the Average Number of Guests in Bookings 3 Days or Less?
I chose this as a question for a couple of reasons. I wanted to force Cortex Analyst to perform multiple operations in one question. In this example, the analyst has to find how many nights stayed for each visit and how many guests (which it correctly calculates). Then, it has to find the average number of nights stayed with the filtering criteria.
I also wanted to see how it would handle the number of guests aspect of the question without any context. Cortex Analyst found the answer without me having to prompt it.
“Help the Helper”
When working with GenAI tools like Cortex Analyst for data querying, it’s crucial to approach the process with a healthy dose of skepticism and a focus on validating the results through careful iteration. Don’t just assume that the tool you build or choose will immediately grasp the nuances of your specific data set or completely understand your query’s intent on the first attempt.
When working with GenAI, prepare for a process of refinement and experimentation. You’ll likely need to fine-tune your prompts, adjust your semantic model, and gradually shape your tool’s understanding of your unique dataset. In other words, be prepared to “help the helper.” This iterative approach allows you to build more robust and accurate GenAI applications over time, tailoring the responses to your specific needs and data context.
The questions you ask your application should ultimately simulate what your end users would be expected to ask. They should simulate the language and terminology used and test your chosen application’s ability to deal with complex operations. You can then use the feedback you receive to enhance your semantic model.
Ultimately, the goal of your question set is to give you confidence that other users will be able to ask questions about your data set and feel confident in the feedback they receive.
But there is one other aspect of this tool, and GenAI tools in general, that can always be improved: the prompts. Prompt engineering and testing are important techniques that enhance the business user’s experience with tools like Cortex.
Related Resources
If you do run into limitations or issues with Snowflake Analyst, check out our TTYD (Talk To Your Data) solution.
If you want more information on our strategy around GenAI monitoring and validation, check out our blog post.
If you’re looking for an approach to prompt engineering, utilizing a framework from Stanford, click here.
Conclusion
Snowflake Cortex Analyst is a powerful tool for users to gain further insights into their data. However, before those tools can truly be released to end users, there is preparation that needs to happen in order to ensure that the tool can be used with confidence. We can refine and improve our tools with well-written semantic models to translate what our data is referring to into natural language. We can evaluate our applications by posing relevant business questions that assess how effectively the application handles terminology, aggregations, and other key data elements critical to the end user.
With a little assistance and setup, GenAI tools like Cortex Analyst can help users unlock insights about their data by simply asking questions.