The number of physical devices that are connected to the web has dramatically increased in the last few years. There are expected to be 50 billion connected Internet of Things (IoT) devices in use by 2030, up from roughly 20 billion at the end of 2020.
If you’ve been shopping for home appliances recently, you’ve probably heard of refrigerators that allow you to connect via the internet and view what’s on the shelves, or being able to start your dishwasher or washing machine remotely. These are just a few examples of the IoT devices out there.
These devices can be all kinds of things such as:
- Smart Home Devices
- Security Systems
- Wearable Medical Equipment
- Inventory Trackers
When it comes to IoT, the amount of data that you’re capturing and the frequency in which you capture data quickly adds up. For example, if you have 50 devices in your home that are connected to the internet, and each of them is sending 1 update a second, you’re looking at 4,320,000 records a day from a single household, and that’s just personal devices!
If you’re working with IoT devices within your enterprise, you likely have thousands to millions of devices that are capturing and sending data to your system – exponentially increasing the amount of data you have at your disposal. As the number of devices you offer increases, your user base grows, or you start integrating with other vendors, the volume of data can quickly scale to near unfathomable numbers.
This means you’ll need an architecture that not only allows you to reliably capture this information into your enterprise but get it into a format and location (Snowflake Data Cloud in this blog) that allows you to analyze it and put it to use.
How Do I Capture IoT Data?
Generally speaking, IoT data is going to be submitted in a streaming architecture, where updates are sent to your system as they occur on the device. Your enterprise may also be receiving data from vendors as events happen within various devices and systems you’re integrated with.
While there are a few different streaming data platforms, we will be focusing on the two most common in data engineering today: Apache Kafka and AWS Kinesis.
Apache Kafka is an open-source distributed event streaming platform that was initially developed and released by LinkedIn and is written in Java and Scala. Kafka is a publish/subscribe architecture (pub/sub) that provides a high-throughput, low-latency, and durable platform for streaming data. Some companies and cloud providers offer managed Kafka services which can also be configured and run on-premises.
Event data is put into a message via producers. Messages are grouped together to reduce overhead, and published to a topic. Then, the messages in these topics are read by subscribers. There are many different types of subscribers available, and you can even subscribe to the topic internally to Kafka. This enables processors to modify your data and publish to new topics.
Kafka also provides functionality to ensure that events are processed exactly once and are always consumed in the same order. Data is also partitioned to ensure scalability and integrity.
While Apache Kafka is free to use other than hosting costs, there is auxiliary functionality that you’ll want alongside your Kafka cluster. Namely, the ability to monitor your cluster, view logs from your cluster, scale your cluster, and perform maintenance on your cluster nodes. Services like Apache ZooKeeper can integrate with Kafka to manage and keep track of many metrics about your cluster, but Kafka doesn’t provide native tooling for these.
Amazon Kinesis, part of Amazon Web Services (AWS), is also a pub-sub messaging architecture. There are three different types of Kinesis services:
- Data Streams
- Data Firehose
- Data Analytics
For this blog post, we will be focusing on Kinesis Data Streams and Kinesis Data Firehose. It’s a fully managed service provided by AWS, and unlike Kafka, cannot be run on-premises. Kinesis breaks data up into shards, similar to partitions within Kafka, internally to provide scalability and durability of your streaming data.
Unlike Kafka, which generally has a lot of setup and configuration in order to maximize performance, setting up Kinesis is very easy! Creating a new Kinesis data stream in the AWS console is as simple as the following:
All you need to configure for AWS Kinesis is the name of your stream and the desired throughput. This works in a pay-as-you-go model where you are charged for each shard (throughput) and each streaming message. With two inputs and one click, AWS provides you with a data stream, monitoring, auto-scaling, and logging.
Like Both Streaming Data Platform Options?
Kinesis Firehose allows you to subscribe to and receive streaming data from multiple platforms, including Kafka! There are many use cases for collecting data from Kafka into Kinesis Firehose:
- Event data is published to Kafka and consumed by multiple systems or clouds
- Vendors are sending you IoT data from their own streaming data platform
- Your IoT data needs to be consumed by multiple systems, clients, or clouds
How Can I Use Kafka and/or Kinesis with Snowflake?
Now that you’ve been able to capture your IoT data into a streaming platform,you might be asking yourself, “How do I get that data into the Snowflake Data Cloud?” This varies depending on whether you’ve decided to use Kafka or Kinesis, so let’s take a look at each.
Kafka Connector and Snowflake
Snowflake provides a connector for Kafka Connector that simplifies connecting Kafka to your Snowflake instance. This requires using Kafka Connect, which runs as a standalone process on a single machine or as a dedicated cluster. This provides native and direct integration between Kafka and Snowflake.
In order to configure Kafka Connect to submit data to Snowflake, you will need to install this connector on your Kafka Connect instance or cluster and configure properties to point to your Snowflake account with the correct credentials. This includes providing a user for the connector to run as and assigning the correct permissions to a role in Snowflake. These permissions include:
- Create Table
- Create Stage
- Create Pipe
When the Snowflake Kafka Connector is used, Snowflake replicates the high-availability and low-latency approach similar to Kafka. Each partition in your Kafka topic has a dedicated pipe, which dictates how the data is copied from the stage to the internal table. This also gives you the ability to perform small processing, such as data type conversations, to your data as it’s ingested.
In the above image, we have the following:
- Data is published to Kafka from an external system
- The Kafka Connector then takes the data published to the Kafka topic and stages it internally
- Snowflake creates a pipe for each partition in the original topic and loads the data into a table
AWS Kinesis and Snowflake
Unlike Apache Kafka, AWS Kinesis and Snowflake do not have a direct integration. There are a few different ways to ingest data from Kinesis into Snowflake, as Snowflake provides an array of functionality and connectors.
We will cover two paths: Snowpipe and Kinesis Client Library (Java) with the Snowflake JDBC Driver.
Snowpipe is used in combination with Snowflake’s stages to load data as it becomes available within your stage. This allows for data to be loaded in micro-batches, so data can be processed and available faster to your end-users rather than a larger batch processing approach.
When data is processed by Snowpipe, a few things occur:
- Snowflake executes a COPY command that takes data from your stage
- Performs any processes in the command
- Moves it into another stage, table, or external location
In order to connect Snowpipe and Kinesis, you will need to create a consumer for your Kinesis instance. That’s where Kinesis Firehose comes into play. Kinesis Firehose doesn’t have a direct connection to Snowflake, so instead, we’ll need to use an external stage: Amazon Simple Storage Service (S3).
This is great for architectures where you want to simultaneously store data into your data lake and trigger an ingestion of data into Snowflake.
As new files are output to S3 from Kinesis Firehose, Snowpipe will detect the changes, trigger COPY commands to be executed against the files, and loads the data into your defined stage or table.
Kinesis Client Library
AWS Kinesis Client Library allows you to configure a Kinesis consumer within a Java application, and define how you want each record to be processed as it’s streamed. Since Snowflake has a connector for JDBC, you can write one java application that receives stream data from Kinesis Firehose, performs any processing of your data, and then loads the data into Snowflake using the JDBC connector.
The Java code will require a compute resource to run, likely on an EC2 instance or in a container in Elastic Container Service (ECS) or Managed Kubernetes Service (EKS). While this gives you more manual control over how your data is processed, you will need to manage auto-scaling your Java compute instances.
Which IoT Integration Options Should You Choose With Snowflake?
Whether you integrate Snowflake with Kafka or Kinesis Firehose and Snowpipe will largely be determined by your specific use case, data governance controls, enterprise goals, and architectures.
There are a few cases where Kinesis and/or Kinesis Firehose would be recommended:
- Ingesting data from multiple stream providers into AWS
- You do not want to manage a Kafka Connect instance or cluster
Kinesis and Kinesis Firehose allow you to ingest that data into Snowflake via Snowpipe and S3 or the Kinesis Client Library. These managed services are very simple to set up and require little maintenance.
For many customers, you’ll only have one data stream provider that you’re integrating with. If you’re already using Kafka or AWS Managed Kafka, then the Kafka Connector is recommended. Kafka is a well-established tool with an extensive track record, and it integrates seamlessly with many different vendors and applications.
Common Streaming IoT Data with Snowflake Questions
It depends on what tooling you’re using for capturing streaming data. If you’re using Kafka, there’s a connector available within Snowflake that you can install on your Kafka cluster. With some minor configuration in Kafka and Snowflake, you’re up and running! If you’re using AWS Kinesis, you can use AWS Kinesis Firehose to publish data to S3 and use Snowpipe to load data into Snowflake.
You will want to select an event streaming platform (AWS Kinesis or Kafka), and configure your devices to either publish directly to your streaming platform or send messages to an API that then publishes to your streaming platform. Once that data has been put into the streaming platform, there are a number of tools provided by Snowflake to ingest that data. Both of these streaming platforms have multiple libraries and tooling available for common programming languages to simplify data publishing.