Imagine you are building out a routine sales report in Snowflake AI Data Cloud when you come across a requirement for a field called “Is Platinum Customer.” You discover that the list of who is considered a Platinum customer is maintained in an Excel spreadsheet on the VP of Sales laptop. This scenario is all too common to analytics engineers. The data we get from the source systems is often incomplete and needs to be augmented with external data. Generally, this kind of data is called Reference Data.
What is Reference Data?
Although reference data is a broad category that can include many types of data, we will be focusing on the following specific type of reference data in this blog article:
Supplemental data for enriching a data warehouse
Data not captured in existing systems
Data that needs to be maintained by a human
Nearly all data warehouse projects will eventually need this kind of reference data. Typically, this data is scattered across Excel files on business users’ desktops. They usually operate outside any data governance structure; often, no documentation exists outside the user’s mind.
Let’s look at an example:
A sales team bases commissions on a complex set of rules and parameters. The current process involves analysts opening Excel files and copying and pasting values. It is extremely labor intensive, and the team wants to automate it using Snowflake and Tableau.
Sales team demographic information is pulled from the CRM.
Financial data is pulled from the ERP.
Commission payout % tiers are kept in an Excel spreadsheet.
Some sales executives are guaranteed a minimum commission. An accounting associate adjusts these numbers manually. The list of sales executives with guaranteed commissions is kept in a SharePoint list.
Commissions may be adjusted up or down for numerous reasons. Analysts currently track these rules and apply them on a case-by-case basis.
Loading the CRM and ERP data can be handled by services like Fivetran or Matillion. However, the remaining data is kept in sources that are not as easily accessible. To fully automate this solution, we’ll need to find a reliable way to get this data into Snowflake.
Approaches To Loading Reference Data into Snowflake
Extend Existing Systems
Most ERP and CRM systems allow users to add custom fields and tables. For example, Salesforce has a system for adding custom fields. The advantage of this approach is that the data will fall under the existing system’s security and governance.
The downside is that some admins may not allow the creation of new fields or may be reluctant to grant write access to the system. These custom attributes usually do not allow for advanced logic or data validation.
dbt Seeds
The seed functionality in dbt is pretty simple. Upload a CSV file to your dbt project and then upload it with the dbt seed
command.
The advantage of this method is that your reference data can be checked into source control. The downside is that dbt/source control can be challenging for non-technical users. Also, it doesn’t help if your company isn’t using dbt.
Cloud Storage Upload
Snowflake can easily upload files from cloud storage (AWS S3, Azure Storage, GCP Cloud Storage). Uploading a CSV file and running a COPY INTO command is simple. You can also set up a Snowpipe object to automatically ingest files.
Uploading to the cloud requires access to the cloud console or a command-line script, both of which can be challenging to business users. Multi-person collaboration is difficult because users have to download and then upload the file every time changes are made.
Upload via the Snowflake UI
Snowflake allows users to load data directly from the web UI. This is a good option for small data sets that are updated infrequently. Some limitations with this method are that each file is capped at 50 MB, files must be in a supported file type (Excel files are not supported), and the Snowflake role must have ownership privileges on the table.
Host in SharePoint or Google Docs
A simple and common option is to leave the data in a spreadsheet but host it in a document management service. This allows for easy sharing and collaboration on the data. Plus, it is a familiar interface for business users.
The downside is that spreadsheets have few controls on entering or modifying data. Governance can be difficult as these files are often spread across multiple sites/folders. Snowflake can not natively read files on these services, so an ETL service is needed to upload the data. Once an ETL process is set up, it is easy for users to break the pipeline by adding fields or modifying the source file in unexpected ways. ETL applications are often expensive and require some level of expertise to run.
Low-Code App Builders
Microsoft Power Apps and Retool are examples of low-code app builders. Their goal is to allow users to quickly create simple applications with minimal coding. Low code is a great route to choose if your data needs custom validation or advanced logic.
However, not all low-code platforms support Snowflake. Most low-code platforms charge per user monthly, which can add up quickly. Additionally, Snowflake does not like small inserts/updates and prefers data to be entered in larger batches. Hosting the data in Snowflake’s hybrid tables may be a better choice for these applications.
Custom Application
For sufficiently advanced use cases, requesting a custom app from your development team may make sense. If that is not an option, then more data engineering-friendly routes exist. Streamlit is an open-source Python library that can be used to build custom web applications inside Snowflake.
While Streamlit is primarily built to display data, it can be used to write data using input widgets. Streamlit apps require an active virtual warehouse, which will consume Snowflake credits.
Master Data Management Solution
Master data management solutions are designed to create master records for entities such as customers, places, emails, etc. They also make a great place to store reference data. However, MDM solutions tend to be expensive and complicated to set up. Using them just to maintain a few reference tables is most likely overkill. Some examples of MDM solutions are Reltio and Informatica MDM.
Reference Data Implementation
Returning to our example of the sales commission process. In our example, we have:
Multiple users (often concurrent) will be updating the reference data.
Some of the data needs custom validation before being uploaded.
The users are not technically advanced.
In this case, the line is blurred between reference data and the line of business apps. Given the more advanced requirements, an MDM app or custom / low-code application would probably be the best bet. Deciding between an MDM and a custom app usually comes down to strategy. Investing in an MDM application is a good choice for a broader deployment strategy. Custom apps make sense when the scope is narrow.
Conclusion
There is no one-size-fits-all solution to loading reference data into Snowflake. When evaluating solutions, you must consider data volume, business logic complexity, and availability of technical resources.
Need advice on which approach to take?