More than anything, modern business challenges typically revolve around solving complex problems efficiently.
With the explosion of data growth (and options to process data exploding along with it) we continually see customers standardize in a couple of key areas — regardless of the industry. Data and engineering teams are consistently choosing the Snowflake Data Cloud as the standard for data lakes, data warehouses, and data strategy. In turn, innovative teams often choose Apache Spark as the standard engine for distributed data processing.
The upshot? Snowflake has announced the intent to integrate with Apache Spark.
Snowflake + Spark = Snowpark. See what they did there?
Why is Snowpark exciting to us?
phData has been working with both technologies since the inception of the company back in 2015. Until now, we’ve had to treat them as different entities. We have seen customers transform their data analytics with Snowflake and transform their data engineering and machine learning applications with Spark. By combining these into one offering, we see a thriving data ecosystem being built around Snowflake.
The move makes our customers lives simpler by unifying their data lake into a complete data platform.
Snowflake has empowered our customers to democratize their data — but they’ve remained limited to the functionality of the SQL language. Spark on Snowflake will allow for native integration with the data inside of Snowflake, whether it’s via the Spark DataFrame API or the Spark SQL API.
What are the capabilities of Snowpark?
Spark has vastly matured since its inception, and now offers an interface that should come naturally to those dealing with SQL. It offers both a SQL interface and a structured interface that are very comparable to a standard SQL statement. But once the data is loaded and defined inside the Spark process, the true power of the Spark distributed processing system shines.
Spark allows for:
- Consistent ingestion and integration system. Users can combine different types of data and perform metrics and computation.
- Standardization of the Data Engineering experience. Data Pipelines are testable, which allows for true CI/CD and unit testing. Data Pipelines are more straightforward and readable.
- Stream and Near Real Time processing and serving.
- Access to 3rd party libraries and Data Science and Machine Learning processing.
- In memory processing. This is optimal for iterative tasks that don’t require flushing results to a new layer.
- Multiple programming languages are supported, including Java, Scala, and Python (to come).
What does all this mean?
In practice, Snowpark opens the door to building applications that interact with Snowflake natively.
Reading IoT sensor data or streaming financial data is straightforward, because Spark gives you the tools to compute, report, and store statistics or computations that can be served in real-time. This can include ML models, web session analysis, or alert detection from streaming data events.
While we are excited for the move, there are still several unknowns for the new Snowflake + Spark combination:
- How will Snowpark run?
- How will warehouse sizes be affected?
- How will user and data management be affected?
- How will credit consumption be affected?
- What is the level of customization and API integration that it will allow for?
What are the potential issues with Snowpark?
After building and managing Spark workloads for the past 6 years, we recognize there are a handful of potential issues with implementing Spark on Snowflake:
- Long startup time.
- Spark requires a cluster of nodes to be ready to do work, on most platforms this can be a 5-10 minute process before being able to execute any code.
- Snowpark will solve this by tapping into Snowflake’s readily available virtual warehouses.
- Programming on a single system versus a distributed system makes getting started difficult.
- This is something we are hoping Snowflake will help solve since they provide a single layer to access their data.
- Managing and troubleshooting partitions and garbage collection.
- This is the most common problem with Spark, and tends to be the most difficult to solve.
- Generally, troubleshooting requires knowledge at a deeper level to understand data skew and how the code is being executed across compute nodes.
- Knowing when the process has failed.
- It’s difficult to determine when a Spark process has stopped processing data and needs to be restarted — not a problem isolated to Spark, can be more difficult to solve with the technology.
- This is an opportunity to ensure you have proper logging and monitoring in place.
If your data team does not have the internal programming skills required (Python, Java or Scala), there tends to be a steep learning curve. But once the basics are understood, handling and manipulating data remain the same.
In short, Snowpark is an exciting step into the data application world and we are excited to partner with Snowflake on the journey. As we learn more about the topic we will be sharing tips and tricks to help educate you on your journey.