Case Study

Marketing Company Migrates Hadoop to Snowflake With Snowpark

Snowflake

Customer's Challenge

A leading marketing company was working to modernize its Data Application product by migrating from Hadoop to the Snowflake Data Cloud. By moving to Snowflake, they were looking to increase performance and simplify the process of collecting data that was provided by customers. Snowflake would help achieve these goals via Snowpark, but the company lacked familiarity with Snowflake/Snowpark and needed an experienced partner to help.

phData's Solution

phData helped the client migrate one of their core products from Hadoop to Snowflake—all while delivering a framework for future migration work. phData also built a framework for automated testing and deployment as well.

Results

Using the patterns established above, the client is able to migrate any other data application to Snowflake/Snowpark using the existing code as a template. This should significantly reduce the amount of time required to move other applications in the future.

By the end of the engagement, the client was processing their customer’s data through the new process without their clients needing to change anything. For their Snowflake clients, they are now getting data directly from Snowflake Shares, which will make the process work even easier for their clients.

The Full Story

A major marketing company was looking to modernize its existing data products from Hadoop to Snowflake. At the time, the client was using either an on-premise Hadoop cluster or an AWS-hosted one. Multiple technologies were used to power this data product including HBase, Hive, and Map-Reduce. 

The existing process had multiple ways to run, with the normal operation being a batch process run via MapReduce. But there was also a streaming process that would run micro-batches. Finally, the scaling of this project had to run from a few thousand records on the low end, to over 100 million records on the high end. 

phData stepped in based on recommendations from Snowflake and reviewed their existing Java code base and architecture. phData recommended a new architecture using Snowpark as the foundation for their data application in Snowflake. 

Snowpark would allow them to reuse their existing in-house Java libraries and classes, which would significantly reduce the time to market. phData was one of the first companies outside of Snowflake to start working with Snowpark, and the existing experience helped to guide the recommendation and the development.

phData transformed the code from the Map-Reduce code and created a Snowpark Stored Procedure that can be called for any volume of data. This Store Procedure is automatically tested and deployed via automation. This set up the client to have a template for future work that follows current best practices. 

Because the Snowpark Stored Procedure could be called by a task, phData was able to set up a Snowflake stream and task to deal with not only the client’s batch data needs but also the streaming/micro-batch needs with the same deployment/code base. This significantly streamlined and standardized the processes.

Advantages of Snowpark

Besides being able to use existing libraries, the other two advantages of using Snowpark were that the application could increase or decrease the size of the warehouse based on the number of records it needed to process. And each batch could be processed asynchronously from the previous batch. 

This allows for significantly faster processing of batches of records, dropping the time it takes to process 100 million records down from more than 12 hours in Hadoop to less than four hours in Snowflake

Snowpark Performance Details

For the asynchronous process, first, an input table and stream are created. Anytime data is added to the input table, the orchestration task will kick off. This task will create a dedicated warehouse, input table, and task to run the batch of records. Once they are created, the orchestration task will run execute on the new task and log the resources in the tracking table. 

The Snowpark job will look at the number of input records in the input table and decide how large the warehouse should be before processing the records. Once the task has been completed, it will log it as finished in the tracking table and write the results to the shared output table. 

The next batch that runs, the orchestration task will delete any completed tasks, warehouses, and temporary input tables.

Take the next step
with phData.

Looking into modernizing your data-driven decision-making process with Snowflake? Learn how phData can help solve your most challenging problems. 

Data Coach is our premium analytics training program with one-on-one coaching from renowned experts.

Accelerate and automate your data projects with the phData Toolkit