August 13, 2022

Snowflake & Dataiku: Getting Started, Use Cases, and More

By Nick Goble

Two of the most common areas of innovation within technology are simplification and the building of tools. Think of it like making a cake: mixing ingredients by hand is possible, but using a stand mixer is certainly easier for most people.  

In some situations it may still make sense to mix ingredients by hand, but for the general use case, you can leverage something that saves you time and energy.

Some of the more complex topics within the data ecosystem are topics like data science, machine learning, and data engineering. These topics fundamentally rely on your organization to understand not only how data flows through your systems and what that data is, but also what that data could ultimately be used for. This data drives business intelligence and ultimately business decisions by key members of the company.

Today, we’re going to talk about one of these tools: Dataiku.

What is Dataiku?

Dataiku is a low/no code tool aimed as being one central solution for the design, deployment, and management of AI applications. This type of tool allows for two things: developers to be more efficient and lowering the barrier of entry for engineers and data scientists.  Now, that’s a lot, so let’s break that down a little bit.

Design in Dataiku

Dataiku gives you the ability to perform tasks like data preparation, visualization, and processing within a centralized platform delivered through a user interface. This means that users are able to visually click through a web application and not only define what needs to happen to their data but preview what those changes look like before they’re fully executed against your system.  

You can perform data preparation, wrangling, and cleansing tasks at a column level and Dataiku will automatically execute those tasks against data as it goes through your “flow”.  

Dataiku also gives you the ability to save these tasks as a reproducible recipe, much like our cake example! You no longer need to reinvent the wheel to perform the same set of tasks across multiple datasets.

These tasks may also look like:

  • Binning
  • Concatenation
  • Currency Conversions
  • Date Conversions
  • Filtering
  • Splitting

While Dataiku has 90 pre-built transformers that you can use, you also have the ability to generate custom transformers!

We mentioned that you can preview the data as well, but what does that actually look like? Dataiku has a number of built in charts, graphs, and statistical analysis tools that allow you to understand your data better.

Examples of charts:

  • Bar
  • Line
  • Curves
  • Stacked area layouts
  • Pie charts
  • Donut layouts
  • Boxplots 
  • Scatter plots

Examples of statistical tests:

  • Fit distributions
  • Fit curve
  • Correlation matrix
  • PCA
  • Shapiro-Wilk normality test
  • Two-sample Mood test
  • Two-sample Kolmogorov-Smirnov test
  • One-way Anova test

Did you know?
With Dataiku, you also have the ability to create dashboards based on these charts and tests!

Finally, Dataiku gives you the ability to perform both manual and automated machine learning. Dataiku implements a concept known as “AutoML” (automated machine learning) where Dataiku will perform tasks automatically for users to reduce complexity.  

This includes things like automatically cleansing data for feature engineering and automating the model training process with built-in guardrails to allow business analysts to build and compare multiple production-ready models.

Dataiku also fully supports deep learning with Keras and Tensorflow. These models are treated just like any other model created and managed by Dataiku. This again reduces the complexity required for utilizing these types of models.

For engineers who are used to notebook style development, Dataiku also supports a variety of notebooks for code-based experimentation and model development using Python, R, and Scala-based on Jupyter.

Deployment and Management in Dataiku

Dataiku gives you the ability to automate deployments within an MLOps and DataOps context. This gives engineers and analysts the ability to go through the design phase (outlined above) and when they’re ready to deploy, they can do so at the click of a button.

So what does this actually mean and look like?  

Traditionally, when you wanted to make updates to a data pipeline or machine learning model, this required engineers and sometimes infrastructure engineers to get involved in deploying those changes.

As more modern tools came around such as cloud computing and infrastructure-as-code, this started to change.  Instead, deployments started to be able to be semi to fully automated when changes were merged into a git repository.

Within Dataiku, this goes a step further. Dataiku further reduces the complexity of this release process by only requiring you to indicate that you want to promote your changes to an environment.

Dataiku also gives you the ability to define data quality expectations and tests that automatically assess the before and after values of data flowing through your system.  When failures happen, an error is returned prompting investigation to reduce time to resolution.

For machine learning oriented workloads, Dataiku gives you the ability to perform batch scoring with automation nodes. This gives you the ability to automatically retrain models and update data. Dataiku also monitors your ML models for data and prediction drift to ensure reliable results.

Dataiku integrates with other DevOps tools such as Jenkins, GitLabCI, Travis CI, or Azure Pipelines if you have external dependencies for your release process.

How Do I Use Dataiku with Snowflake?

Dataiku has the ability to connect to a number of different sources and targets. This gives you the ability to load data from one system, perform any cleansing and transformations that are necessary, load it into a target system, and then build/train model(s) to create an enriched dataset.

One of the common data sources and targets for this work is the Snowflake Data Cloud. To start using Snowflake with Dataiku, you’ll need to do the following:

  • Create a role within Snowflake that has the appropriate access
  • Create a user within Snowflake for Dataiku to authenticate with
  • Assign the role to the user
  • Optional but recommended
    • Create a warehouse specifically for Dataiku
    • Create a database and schema specifically for Dataiku

If you do not specify a default warehouse, database, and schema for Dataiku to use, you will need to specify these when configuring your reads and writes in each flow.

The setup page at time of writing looks like this:

Once you’ve configured your Snowflake connection in Dataiku, you’re ready to get started! To verify that your connection is able to read the correct data, you have a few options.  

You can either login with the user you assigned to the Snowflake connection directly in Snowflake, or you can go to “datasets” in the top navigation bar.

In the above image, we have a Snowflake connection called “phData” and it shows us some metadata about that connection for each table. You can see that the origin is a SQL import and the named connection it came from along with when the table was last modified. In our case, this is one of the sample datasets that Snowflake gives you out-of-the-box.

When you create your own datasets that are written to Snowflake, you’ll see them in this list as well.  If you have any tags assigned to the dataset, they’ll show up as well.  In this case, we’ve created a “custom” tag. You can also use Snowpark with Dataiku. 

How does Dataiku Compliment Snowflake? 

Dataiku provides a platform for creating AI/ML applications, while Snowflake provides a scalable storage and compute platform for large volumes of data. Snowflake can handle any large-scale data storage or processing workload, but composing and orchestrating those workloads are typically restricted to data engineering teams.  

The modern data landscape is driven by data scientists, business intelligence analysts, and analytics engineers who can make use of low/no-code tools more effectively than traditional SQL/code-driven development. 

Dataiku integrates deeply with Snowflake to push visual SQL and other transformation steps down to run on Snowflake compute. This means that data scientists and business users who have no experience writing SQL can gain access to data in Snowflake and seamlessly scale their transformations to massive datasets by using Dataiku with Snowflake.

Use Cases for Dataiku and Snowflake

So now that you know what Dataiku is, what the tool allows you to do, and how to integrate it with Snowflake, let’s talk about how you might leverage this toolset. There are a lot of different use cases for this setup including data pipelining, machine learning, and analytics.

Dataiku provides a number of articles around specific use cases that their tool unlocks:

This allows for quite a wide range of common problems that companies run into to be tested and solved.

Dataiku and Snowflake: A Good Combo?

Yes! By reading this blog, hopefully you now understand the utility that Dataiku provides and how Snowflake integrates with Dataiku.  Dataiku unlocks your organization to perform machine learning and MLOps functions in a low/no-code environment.  

This paired with the computational power of Snowflake enables you to perform model training, data transformations, and analysis of data using the best-in-class data cloud. 

Looking to Succeed
with Snowflake & Dataiku?

phData partners closely with both platforms and has helped organizations of all sizes consistently succeed with Snowflake and Dataiku. If you have any questions or looking for expert help, feel free to reach out today!

FAQs

Dataiku is capable of reading data from one source, performing transformations, and then writing the result to a destination.  However, it’s generally recommended that you perform an “ELT'' type approach.  This means loading data into your target data warehouse “raw” from the source and then performing your data transformations in the target warehouse.  Tools like dbt are great for basic data cleansing and formatting, but performing those steps in Dataiku is possible and can be practical when you or your teams lack experience with dbt.  Since Dataiku pushes SQL operations down into Snowflake, this still matches the ELT paradigm.  If additional logic is needed strictly for your machine learning models, performing this in Dataiku makes the most sense.

Dataiku provides a number of different ways to run their control plane.  You can see the full list of available options here, but generally speaking you’ll either use Dataiku Online (Software as a Service), a cloud provider VM, or install it yourself on a Linux machine.

Data Coach is our premium analytics training program with one-on-one coaching from renowned experts.

Accelerate and automate your data projects with the phData Toolkit