This blog was originally written by Keith Smith and updated for 2022 by Nick Goble.
In 2021, the overall amount of data generated in the world was estimated to be around 79 zettabytes. To try and understand how large that is, we’re usually used to talking about things in terabytes. A zettabyte is 1TB * 1024 * 1024 * 1024, or three additional powers larger than a terabyte!
In 2022 we are likely to see this number grow by a significant amount.
With the explosion of data growth (and options to process data exploding along with it) we continually see customers standardize in a couple of key areas — regardless of their industry. Data and engineering teams are consistently choosing the Snowflake Data Cloud as the standard for data lakes, data warehouses, machine learning, and data strategy. In turn, innovative teams often build up a codebase that represents business logic and process, critical to serving a wide variety of needs.
The upshot? Snowflake has announced the public preview of Snowpark. As one of just a few fortunate partners to get early access via the Snowpark Accelerated program, we’re excited to share our unique perspective.
In this blog, we’ll cover the basics of Snowpark and why we think it’s going to be such a game-changer in the data engineering and machine learning space.
What is Snowflake’s Snowpark?
Snowpark is the latest product offering from Snowflake. Snowpark allows developers to bring their favorite tools and deploy them in a serverless manner to Snowflake’s virtual warehouse compute engine.
Take a look at this short video to see myself and Dr. Robert Coop illustrate how Snowpark enables both your data engineering and machine learning applications.
Why is Snowpark Exciting to us?
phData has been working in data engineering since the inception of the company back in 2015. Until now, we’ve had to treat them as different entities. We have seen customers transform their data analytics with Snowflake and transform their data engineering and machine learning applications with Spark, Java, Scala, and Python. By offering native data engineering functionality inside of Snowflake virtual warehouses, we see a thriving data ecosystem being built around Snowflake.
The move makes our customers’ lives simpler by unifying their data lake into a complete data platform.
Snowflake has empowered our customers to democratize their data — but they’ve remained limited to the functionality of the SQL language. Snowpark will allow for native integration with the data inside of Snowflake, whether it’s via DataFrames or other coding constructs that are now available.
What are the Capabilities of Snowpark?
By offering the functionality of Java or Scala (with other languages on the roadmap), the ability to leverage already built code bases allows customers to easily migrate their business logic. Additionally, the functionality to interact with the industry-standard DataFrame API will vastly ease the capability to access data programmatically.
We are also working on the following solutions:
- Consistent ingestion and integration system. Users can combine different types of data and performance metrics and computation.
- Standardization of the Data Engineering experience. Data Pipelines are testable, which allows for true CI/CD and unit testing. Data Pipelines are more straightforward and readable.
- Access to 3rd party libraries. Includes data science and machine learning processing.
- Machine Learning. Operationalize your MLOps platform by unlocking the ability to store, track, and serve your models.
What Does All This Mean?
In practice, Snowpark opens the door to building applications that interact with Snowflake natively. Previously, code development and deployment would require separate infrastructure and maintenance to support. Now, Snowpark allows both of these tasks to be handled within Snowflake right inside a virtual data warehouse.
Snowpark also gives you the tools to compute, report, and store statistics or computations that can be served in real-time, which should make reading IoT sensor data or streaming financial data much more straightforward. This can include ML models, web session analysis, or alert detection from streaming data events.
What are the Potential Gaps Snowpark Solves?
After building and managing workloads at scale for the past six years, we recognize there are a handful of potential issues when implementing development resources on large datasets:
- Long startup time for distributed resources.
- Systems like Hadoop or Spark require a cluster of nodes to be ready to do work. On most platforms, this can be a 5-10 minute process before being able to execute any code.
- Snowpark will solve this by tapping into Snowflake’s readily available virtual warehouses.
- Efficient coding practices over large datasets.
- Allowing developers access can introduce a learning curve that requires learning best practices for optimization, this is especially true on large data sets.
- This is something Snowpark is looking to directly solve since they provide a single layer to access their data and open up the use of the DataFrame API. This means you don’t have to access files directly and worry about issues like small files.
- Managing and troubleshooting partitions and garbage collection.
- This is the most common problem with Hadoop or other file-based data lakes and tends to be the most difficult to solve.
- Generally, troubleshooting requires knowledge at a deeper level to understand data skew and how the code is being executed across compute nodes.
If your data team does not have the internal programming skills required (Python, Java, or Scala), there tends to be a steep learning curve. But once the basics are understood, handling and manipulating data remain the same.
In short, Snowpark is an exciting step into the data application world and we are excited to partner with Snowflake on the journey. As we learn more about Snowpark, we will be sharing tips and tricks to help educate you on your journey.