April 3, 2024

Mastering Avro File Handling for Seamless Data Operations With Snowflake

By Priya Chauhan

Organizations will store data in an Avro file to accommodate their needs to exchange data between programs without concern for language efficiency. This can make processing challenging for many traditional databases, but not Snowflake Data Cloud. 

In this blog, we will help you master handling Avro files to allow for seamless data operations with Snowflake.

What is Avro?

Avro is a data serialization framework used to exchange data between systems and languages. Avro files are typically used in distributed computing systems and are efficient for storing large datasets due to their compact binary nature.

Avro stores the data definition in JSON format, which makes it easy to read and interpret. Avro files include markers that can be used to split large data sets into subsets.

The key feature of Avro is its excellent support for data schemas that change over time, which is called schema evolution. Avro effortlessly handles schema changes like missing fields, added fields, and changed fields. Data stored using Avro can be passed from programs written in different languages, even from a compiled language like C to a scripting language like Apache Pig.

Two Types of Avro File Extensions

file_name.avro:
These files contain serialized data records along with the Avro schema, which describes the structure of the data in a compact binary format.

file_name.avsc files:
These files contain Avro schemas. Avro schemas are written in JSON format and define the structure of the data stored in Avro files. They describe the fields, data types, and structure of the data.

.avsc files hold these schema definitions in a human-readable format separate from the data itself. The purpose of keeping a separate schema file is to maintain version control, ease of modification, and reusability across different datasets or applications using Avro serialization.

Sample Avro Data: user.avro

				
					Obj
{"type":"record","name":"example","fields":[{"name":"id","type":"int"},{"name":"name","type":"string"}]}
1Alice2Bob
				
			

What Does It Mean?

  • Obj denotes the Avro data file in binary format.

  • The following section 

				
					{
    "type":"record",
    "name":"example",
    "fields":[
        {"name":"id", "type":"int"},
        {"name":"name", "type":"string"}
    ]
}
				
			
  • is the Avro schema header.

  • The data records (1Alice and 2Bob) correspond to the actual data. In this case, it represents two records: one with id=1 and name=Alice, and another with id=2 and name=Bob.

How to Load Avro Files Into Snowflake Tables

In this example, we’ll say you have a file named part-1.avro in the s3 bucket, and your task is to load the Avro file into the Snowflake table.

Step 1: Create a storage integration in Snowflake as per the Snowflake documentation.

Step 2: Create an external stage in Snowflake, which will point to the bucket where your Avro file resides.

				
					
CREATE STAGE s3_dev_stg
    STORAGE_INTEGRATION = s3_int
    URL = 's3://phdata-blog/avro-files/'
    FILE_FORMAT = AVRO;
				
			

In the above code, we’ve provided the file format alongside. You can choose to include file format at the time of stage creation or to create a separate file format for your Avro files.

Step 3: Load the data into a table.

Consider the table definition as follows: 

				
					CREATE TABLE dev_db.phdata_blog.customer (
    id NUMBER(38.0),
    name VARCHAR(100),
    load_date DATE
)
				
			

Running the copy into the command:

				
					
COPY INTO dev_db.phdata_blog.customer FROM (
    SELECT
        $1:id::NUMBER,
        $1:name::VARCHAR,
        current_date AS load_date
    FROM @dev_db.phdata.s3_dev_stg/customer/
)
				
			

Scenarios Where More Granularity may be Required

Scenario 1: Getting Date from Filename in a Snowflake Table

Suppose a vendor performs daily file dumps in s3 and appends a date at the end. Organizations should keep track of all the files in the table in the load_date column.

Filenames may look as follows: Part-1-20231201.avro, Part-1-20231202.avro, etc.

Here’s what the copy command would look like:

				
					
COPY INTO dev_db.phdata_blog.customer FROM (
    SELECT
        $1:id::NUMBER,
        $1:name::VARCHAR,
        TO_DATE(SUBSTR(METADATA$FILENAME, -13, 8), 'YYYYMMDD')
    FROM @dev_db.phdata.s3_dev_stg/customer/
)
				
			

The METADATA$FILENAME will extract the entire file path from s3, and SUBSTR will help us get the date part out of that.

Scenario 2: Filtering Only Avro Files

In some instances, an organization may find .SUCCESS files accompanying Avro files. These .SUCCESS files serve as indicators, signaling the successful completion of the data import process for specific datasets, as demonstrated below:

Part-1-20231201.avro

Part-1-20231202.avro

.SUCCESS(0 bytes file)

In this case, the above copy command will fail because it will encounter a 0 byte .SUCCESS file. 

To fix this, the modified copy command shown below works like magic.

				
					
COPY INTO dev_db.phdata_blog.customer FROM (
    SELECT
        $1:id::NUMBER,
        $1:name::VARCHAR,
        TO_DATE(SUBSTR(METADATA$FILENAME, -13, 8), 'YYYYMMDD')
    FROM @dev_db.phdata.s3_dev_stg/customer/
)
PATTERN = '.*.avro'
				
			

Scenario 3: Loading Avro Data into Variant Columns for Direct Querying in Snowflake.

We are using the userdata.avro example file.

The schema looks like below: 

				
					{
    "type":"record",
    "name":"User",
    "fields":[
        {"name":"birthdate", "type":"string"},
        {"name":"cc", "type":"long"},
        {"name":"comments", "type":"string"},
        {"name":"country", "type":"string"},
        {"name":"email", "type":"string"},
        {"name":"first_name", "type":"string"},
        {"name":"gender", "type":"string"},
        {"name":"id", "type":"int"},
        {"name":"ip_address", "type":"string"},
        {"name":"last_name", "type":"string"},
        {"name":"registration_dttm", "type":"string"},
        {"name":"salary", "type":"double"},
        {"name":"title", "type":"string"}
    ]
}
				
			

Below is a table of user data with one variant column, user_info, and loaded the data using the copy command (see snippet below).

The file format is Avro, which is defined on the stage level.

Snowflake has read user data and Avro data into a single VARIANT column. Now, we can query the data in the VARIANT column user_info just as we would JSON data or any other semi-structured data.

Best Practices

  1. The column names in the copy-into command should match the Avro schema file or the Avro file header, or it will cause an error.

  2. Snowflake does provide a feature while loading Avro files (shown below), but it doesn’t work when you run the copy command with a transformation.

				
					MATCH_BY_COLUMN_NAME = CASE_SENSITIVE | CASE_INSENSITIVE
				
			

Let’s take the same userdata.avro file, as mentioned in scenario 3, and let’s create a new table as per the user data schema.

				
					
CREATE TABLE user_data (
    birthdate STRING,
    cc BIGINT,
    comments STRING,
    country STRING,
    email STRING,
    first_name STRING,
    gender STRING,
    id INT,
    ip_address STRING,
    last_name STRING,
    registration_dttm STRING,
    salary FLOAT,
    title STRING
);
				
			

Running the copy command without any transformation with the MATCH_BY_COLUMN_NAME feature works smoothly (see below).

But, if we use the copy command with transformation, it will throw an error.

Here, even $1:birthdate::VARCHAR is a transformation for Snowflake. We have also added an extra column filename to the table and populated it using the METADATAFILENAME feature available in Snowflake. The above command would work smoothly if we omit match_by_column_name=case_insensitive.

  1. File format can be declared at the time of stage creation, file format creation, or even at the time of running copy into command. It’s totally up to you.

Conclusion

Snowflake’s support for Avro is powerful, allowing for effortless data loading, querying, and analysis. The native compatibility and optimized performance with Avro files empower organizations to manage complex datasets efficiently within Snowflake’s cloud data platform. Snowflake can handle different Avro file processing scenarios with ease.

Need support or additional guidance with Snowflake or Avro files?

Data Coach is our premium analytics training program with one-on-one coaching from renowned experts.

Accelerate and automate your data projects with the phData Toolkit