May 31, 2023

How to Talk to Your Data with ChatGPT, Snowflake, & dbt

By Nick Pileggi

AI models, like ChatGPT, have provided a lot of efficiencies for individual contributors, but generally not regarding company-specific information. There is a huge opportunity in leveraging AI to provide business users with quick insights into their company data. 

One of our manufacturing customers recently asked if they could use AI to allow non-technical business users to ask,

“What are the top performing productions sites in North America in the past 6 months,” and then translate that into a query and then illustrate the results. This post will explore how to create a Streamlit application that allows business users to input a question about their data and query their Snowflake Data Cloud tables to provide an answer. 

While vendors such as ThoughtSpot have well-established products in this space, this example aims to demonstrate the art of the possible in a world where large language models (LLMs) are rapidly accelerating AI adoption.

Business Problem

Organizations are still struggling to enable business users to easily access data. Even with complex data processes, some users struggle to leverage technical tools to answer their basic questions.

Using an AI-based approach allows users to simply ask a question using natural language and get a result nearly instantly. 

Having an application like this can be an effective solution to enable business users to gain quick insights into their data.

Solution Overview

To prototype this process, we’ve created a Streamlit application that allows users to ask a question. From there, that question is fed into ChatGPT along with dbt data models that provide information about the fields in the various tables. From there, ChatGPT generates a SQL query which is then executed in the Snowflake Data Cloud, and the results are brought back into the application in a table format. 

The user has the option to select a graphical interface to view the data outputs. In this case, after the SQL query is executed on Snowflake, it is converted into a Python dataframe, and basic graphic code is executed to generate the image.

This entire process only takes a few seconds to provide the user with the result they want to see.

A simple diagram showing how the solution works.

Data

For this demo, we used the Tasty Bytes data provided by Snowflake. This dataset contains information for a food truck business and has various fact and dimension tables, such as menu information, sales data, and location information.

This dataset was chosen because it easily replicates the data models that can be found at a variety of organizations.

Let’s Run Through an Example

The best way to show how the entire process works is by running through an example. Let’s input the question, ”What is the monthly trend of sales in 2020?” into the Streamlit application.

A screenshot of a popup screen that has the phrase, "Enter Question" at the top with a text bar below.

Once you press “Ask Question,” there is a call to the ChatGPT API that inputs that question along with information on the dbt data model.

The dbt data model then provides information to ChatGPT about the specific fields in the data and what they mean. Providing this information as part of prompt engineering helps specify to ChatGPT which fields are relevant to the question being asked. 

Currently, the number of tables and columns in the database is limited by the number of tokens the model can use. As ChatGPT continues growing with more tokens available for input, the larger the database model it can handle. 

SQL Returned:
				
					SELECT DATE_TRUNC('month', order_ts) AS month, SUM(order_total) AS total_sales
FROM order_header
WHERE EXTRACT(YEAR FROM order_ts) = 2020
GROUP BY 1
ORDER BY 1 ASC;
				
			

After the query is generated, it is sent over to Snowflake using the Python JDBC Connector. Next, Snowflake executes that query and provides the results as a pandas dataframe.

If the query fails for a compilation issue, the prompt can be resent with the SQL query it generated, and the failure message. This allows ChatGPT to correct the query and Streamlit to rerun the corrected query.

This dataframe can be easily turned into a table in the Streamlit app. As we can see, the results below show the global sales from the order_header table broken out by month for the year 2020.

An example Streamlit application

Sometimes it is useful to have the data summarized in paragraph form. Normally a data analyst will take the data from the SQL query results and generate a paragraph talking about trends and what they see in the data.

In this case, ChatGPT can be used with prompt engineering and the data and the question to output a summary of the data.

For the example question above, the summary of the 12 months of data is below.

A screenshot of the output that describes data points to support the initial question that was posed using ChatGPT

If the user wants to see the results in a graphical format, all they have to do is check the box. When this box is checked, the result of the Snowflake query is passed to ChatGPT with a prompt to generate the graph code for the Streamlit app.

Once the code is returned, it can be executed to generate the graph. From there, the final graph is then displayed in the Streamlit app.

Simple, right? All of this happens within seconds!

A simple graph of data created by ChatGPT that's titled, "Total Sales by Month" and shows a steady growth over months.

Conclusion

Due to the advancement of technology like the tools we used in this demo, it’s easier than ever to build text-based applications with your data to get answers to business users faster.

AI has the power to provide endless possibilities to organizations, and this demo showcases a singular use case to help you get started.

At phData, we believe that every organization has the ability to integrate AI into its processes, and we’re excited to participate in the AI revolution. If this is something you’re interested in, phData is offering a free workshop to help explore the potential of generative AI within the modern enterprise.

In a 90-minute session, your leaders can sit down with one of our Principal Data & AI Architects to start turning ideas into a plan of action. 

Data Coach is our premium analytics training program with one-on-one coaching from renowned experts.

Accelerate and automate your data projects with the phData Toolkit