October 20, 2023

How to Load and Analyze Semi-Structured Data in Snowflake

By Deepa Ganiger

Snowflake Data Cloud has a number of powerful features that empower organizations to make more data-driven decisions for both structured and semi-structured data. 

In this blog, we will explore how to load, unload, and analyze semi-structured data in detail. We will look at semi-structured data formats and how to parse XML & JSON using different methods in Snowflake. Lastly, we’ll dabble in leveraging third-party platforms like Alteryx to parse and load semi-structured data.

What is Semi-Structured Data?

Semi-structured data, also called partially structured data, is a form that does not adhere to the conventional tabular structure found in relational databases or other data tables. However, it does include metadata and tags that help to distinguish semantic elements and create hierarchies of fields and records.

Semi-structured data can come from many sources, including applications, sensors, and mobile devices. To support these diverse data sources, semi-structured data formats have become popular standards for transporting and storing data.

What are Supported File Formats for Semi-structured Data

Various semi-structured datasets, including JSON, Avro, Parquet, Orc, and XML,  have emerged with the rise of big data and IoT. These data formats are extensively used in various industries to share data. Snowflake has developed a user-friendly and straightforward method to manage these data formats. 

Following are the semi-structured data types supported by Snowflake.

JSON

JavaScript Object Notation, commonly known as JSON, is an open standard file format and data interchange format that employs human-readable text to store and transfer data objects, which are made up of attribute-value pairs and arrays or other serializable values. 

JSON is a lightweight data format that has gained popularity among developers due to its human-readable text, which requires minimal coding, is easy to process, and has faster processing times.

In JSON, data is represented using two basic structures: objects and arrays. Objects are enclosed in curly braces {} and consist of key-value pairs separated by commas, where the key is a string, and the value can be a string, number, boolean, null, an array, or another object. 

Arrays are enclosed in square brackets [] and consist of a list of values separated by commas, where each value can be a string, number, boolean, null, array, or object.

Here is an example of a simple JSON object:

				
					{
   "name": "John",
   "age": 30,
   "city": "New York"
}
				
			

In this example, "name", "age", and "city" are keys, and "John," 30, and "New York" are their corresponding values.

XML

XML, which stands for Extensible Markup Language, is a markup language that establishes regulations for defining any type of data. To define data in XML, you utilize markup symbols, called tags, which enable complex data encoding and facilitate the integration of information flows across various systems. 

XML elements can include properties such as text, attributes, and other elements. Although Snowflake does not have a dedicated parser for XML, unlike for JSON, you can use the XML_GET and GET functions to flatten the data.

In XML, data is represented using tags, which are enclosed in angle brackets < >. Tags can be nested within each other to represent complex data structures. XML documents consist of a hierarchy of tags with a single root element at the top.

Here is an example of a simple XML document:

				
					<dept>
  <dept_id>1</dept_id>
  <dept_name>Scientists</dept_name>
  <employee>
      <emp_id>1</emp_id>
      <emp_firstname>Mike</emp_firstname>
      <emp_lastname>Bills</emp_lastname>
      <emp_designation>Jr Scientist</emp_designation>
      <address>
          <street_adress1>234 Octopus Avenue</street_adress1>
          <street_adress2></street_adress2>
          <city>Stamford</city>
          <state>CT</state>
          <zipcode>60429</zipcode>
          <start_date>2000-05-01</start_date>
          <end_date>2000-12-01</end_date>
      </address>
        </employee>
</dept>
				
			

Parquet

Parquet is a file format for storing big data in a columnar storage format. It is specifically designed to work seamlessly with Hadoop and other big data processing frameworks. 

Parquet utilizes a highly compressed and efficient columnar format, unlike traditional databases that use a row-based storage format. This format is optimized for both large-scale batch processing and interactive querying, making it highly efficient for data processing. 

Parquet supports advanced data types, including complex nested and custom data structures, making it highly flexible. Additionally, Parquet offers features such as predicate pushdown and column pruning, which improve query performance. Due to its efficient storage and querying of complex data types, high compression, and optimization for big data processing frameworks, Parquet has become a popular and widely used file format for big data processing.

When Parquet data is stored, it is divided into row groups, each corresponding to a contiguous set of rows in the data. Data is further divided into column chunks within each row group, where each column chunk corresponds to a column in the data. Column chunks are compressed and encoded to minimize storage space and improve data access performance.

ORC

The ORC (Optimized Row Columnar) file format is designed for processing large amounts of data and storing it in a columnar format rather than the traditional row-based approach used by conventional databases. This file format is optimized for use with Hadoop and other big data processing frameworks and is highly compressed, offering excellent performance for batch processing and interactive querying. 

ORC’s advanced capabilities, such as column pruning and predicate pushdown, enhance query performance. Due to its impressive performance and numerous features, ORC is a favored option for big data storage and processing, especially in scenarios that demand high-performance capabilities.

AVRO

Avro is an open-source data serialization system that facilitates data exchange between systems, programming languages, and processing frameworks. It is a storage format based on rows, enabling efficient and effective data storage and retrieval. 

With Avro, users can seamlessly transfer data between different platforms, applications, and programming languages, ensuring that the data remains consistent throughout. As an open-source system, Avro provides a flexible and customizable solution for data serialization, enabling users to optimize their data processing capabilities.

In Avro files, data will be serialized in binary format with schema built in. Avro schema is generally defined using JSON format for applications to parse the data. Here’s an example of an Avro schema that describes a simple record:

				
					{
   "type": "record",
   "name": "User",
   "namespace": "com.example",
   "fields": [
     { "name": "id", "type": "int" },
     { "name": "name", "type": "string" },
     { "name": "email", "type": "string" }
   ]
 }
				
			

In this example, the Avro schema defines a record called "User" that contains three fields: "id", "name", and "email". The "type" field specifies the data type of each field, which can be a primitive type like "int" or "string" or a complex type like "record" or "array". The "name" field specifies the name of each field, and the "namespace" field specifies the namespace for the schema.

Overview of Data LifeCycle of Semi-structured Data within Snowflake

What are Semi-structured Data Types

Snowflake supports three different data types to store and process semi-structured data.

Variant

The variant data type is used for ingesting raw data in Snowflake, which can contain any other data types like String, Integer, Date, etc., including Object and Array.

Typically, a Variant  is used when:

  • You want to create hierarchical data by explicitly defining a hierarchy with two or more ARRAYs or OBJECTs.

  • You want to load JSON, Avro, ORC, or Parquet data without having to specify the data’s hierarchical structure.

A VARIANT can be up to 16 MB in length. You can use the CAST function, the TO VARIANT function, or the :: operator to explicitly cast a value to or from the VARIANT data type (example – expression::variant)

ARRAY

In Snowflake, the ARRAY data type is used to store ordered lists of values, where each element has the same data type. The ARRAY data type can store up to 16 million elements and supports various data types, including strings, numbers, and booleans.

Example of array data : ['apple', 'banana', 'cherry']

This can be inserted into a snowflake table with a field defined as an Array data type. Note that JSON by itself can have an array of elements, so this can also be loaded into Variant data types.

OBJECT

In Snowflake, the OBJECT data type is used to store structured data in key-value pairs. An OBJECT column can contain multiple keys, each with a different data type. This data type is often used for storing semi-structured data, such as JSON documents.

The object looks exactly like the JSON structure, like the key-value pair like the below example.

				
					{
   "name": "John",
   "age": 30,
   "city": "New York"
}
				
			

In Snowflake, an OBJECT is comparable to a JSON “object,” consisting of a set of key-value pairs. The keys are strings, and the values can be any JSON data type or value (such as Number, String, Boolean, Array, Object, or null), separated by a colon, with each entry separated by commas. Because VARIANT can store any data type, values in different key-value pairs may have different underlying data types. 

For instance, an OBJECT could contain a VARCHAR and an INTEGER, representing a person’s name and age, respectively. Snowflake does not currently support explicitly typed objects. The key in a key-value pair must not be an empty string; neither the key nor the value can be NULL. OBJECTs can be used to create hierarchical data structures. 

Loading Semi-Structured Data

Now that we have seen the semi-structured data formats and data types supported by Snowflake, let’s see how we can load data into Snowflake.

Prerequisites

  • Snowflake account, warehouse, database, and schema available to load data.

  • Snowflake CLI – SnowSQL installed and configured in a local laptop for loading sample files into the stage.

  • In this example, we will use an internal stage, but if you prefer to use an external stage, ensure that the external stage is configured for access and a stage created pointing to the S3 location. If you need to learn how to configure the S3 bucket, please follow the following documentation:

Setup

This tutorial will load and unload data using the Internal Stage.

  1. Set up Snowflake context to correct DB and Schema to load data.

				
					USE SCHEMA TRAINING.DEMO ;
				
			
  1. Create an internal Stage if it doesn’t exist.

				
					CREATE STAGE JSON_STAGE ;
CREATE STAGE XML_STAGE ;
				
			
  1. Create file format by choosing the correct format for the use case ( JSON, XML, Avro, Parquet, ORC ).

Snowflake supports creating named file formats, which are database objects that encapsulate all of the required format information. Named file formats can then be used as input in all the same places where you can specify individual file format options, thereby helping to streamline the data loading process for similarly formatted data.

Named file formats are optional but recommended when you regularly load similarly-formatted data. You can create a file format using either the web interface or SQL.

SQL Method

Create a simple file format for both JSON and XML.

				
					CREATE OR REPLACE FILE FORMAT JSON_FORMAT TYPE = 'JSON' ;
CREATE OR REPLACE FILE FORMAT XML_FORMAT TYPE = 'XML' ;

				
			

Web Interface Method

  1. Click on the Databases icon on the top and select the database where you want the file format to be created. In this example, select TRAINING DB.
  1. Go to the File Formats tab and click on the Create button. Fill in the required information like Format Name (JSON_FORMAT) and Format Type (JSON/XML). Leave the rest of the options as it is for now.

  1. Create a sample file with the contents shown below and save it locally with the name nested_json_file.

				
					[
  {
    "MainId":1111,
    "firstName":"Sherlock",
    "lastName":"Homes",
    "categories":[
      {
        "CategoryID":1,
        "CategoryName":"Example"
      }
    ]
  },
  {
    "MainId":122,
    "firstName":"James",
    "lastName":"Watson",
    "categories":[
      {
        "CategoryID":2,
        "CategoryName":"Example2"
      }
    ]
  }
]
				
			
  1. Load the data into the internal stage using SnowSQL. Login to the SnowSQL client on your laptop and navigate to TRAINING.DEMO schema.

  2. Use the PUT command and load the file into the Snowflake stage through the SnowSQL utility. Assuming the file is in the data folder in the home directory, run the following command in Snowsql.

				
					PUT file://~/data/nested_json_file.json @json_stage/data/ ;
PUT file://~/data/employee.xml @xml_stage/data/ ;

				
			

Note: For Windows laptops, adjust the path accordingly, as shown below.

				
					PUT file://c:/data/nested_json_file.json @json_stage/data/ ;
PUT file://c:/data/employee.xml @xml_stage/data/ ;

				
			
  1. Verify if the data is loaded into Stage by running the ls command as shown below. This will list the file in the stage location json_stage/data.

				
					ls @json_stage ;
				
			

Steps for Loading JSON Data

Step 1: Examine the Raw JSON data to make sure the JSON is not malformed.

The example data (nested_json_file.json) contains a two-level nested structure.

  • The outermost layer is a JSON array (represented by Square brackets).

  • The JSON array has 2 elements with 2 different main IDs (1111 and 122).

  • "categories" is the inner array within each outer array element.

  • Categories array has inner elements like CategoryID and CategoryName.


Initially, let’s load the file as is into the Variant column.

Step 2: Create a table using the Variant column to ingest the JSON data in its raw format.

				
					CREATE TABLE nested_json_table  ( data variant);
				
			

Step 3: A quick validation can be done while copying the data using the validation_mode option. This will verify if the data being loaded is valid and not malformed.

				
					copy into nested_json_table
    from '@json_stage/data/'
        file_format= (format_name =json_format)
        validation_mode = 'RETURN_ROWS';
				
			

Step 4: Load the data into the table (nested_json_table) using Snowflake copy command by providing the format_name that was created.

				
					copy into nested_json_table
     from '@json_stage/data/'
         file_format= (format_name =json_format);
				
			

You can specify one or more of the following copy options (separated by blank spaces, commas, or newlines): ON_ERROR = CONTINUE | SKIP_FILE | SKIP_FILE_num | SKIP_FILE_num% | ABORT_STATEMENT

String (constant) specifies the action to perform when an error is encountered while loading data from a file.

Note: Variant supports data up to 16MB uncompressed. If the file is larger than 16MB, we should look for options to split them.

Step 5: There is a second option for loading the data.  Since this is a small file, you can use the Load Table dialog in the WebUI (from the Databases tab, select the Tables tab, click on the created table, and click on the Load Data button. Follow the prompts to upload a file and select Warehouse, File Format, etc., to load the data).

Step 6: Validate the raw JSON data in the nested_json_table. The table will have one variant column and one data row.

				
					select * from nested_json_table;
				
			

As you can see, the entire file is loaded into a single row in the variant column. Since the file has 2 array elements, it can be split while loading the data into multiple rows. To do this, recreate the file format with the option strip_outer_array  or directly specify it in the COPY command.

				
					CREATE OR REPLACE FILE FORMAT JSON_FORMAT TYPE = 'JSON' STRIP_OUTER_ARRAY = TRUE;
				
			

Rerun the copy command after truncating the table.

Note: This option works only if the JSON file has an outermost array and not with a root element.

Steps for loading XML Data

Following are the steps for loading XML data. Ensure to complete the steps listed under setup.

Step 1: Examine the Raw XML data (below is the sample XML data used for loading).

The data contains a three-level nested structure: catalog → book → notes. Each catalog has multiple books; each book contains notes by publisher and author.

				
					<catalog>
  <book id="bk101">
    <author>Gambardella, Matthew</author>
    <title>XML Developer's Guide</title>
    <genre>Computer</genre>
    <notes>
    <author_note>Extensive guide on XML</author_note>
    <publisher_note>Published hardcover and online</publisher_note>
    </notes>
  </book>
  <book id="bk102">
    <author>Ralls, Kim</author>
    <title>Midnight Rain</title>
    <genre>Fantasy</genre>
    <notes>
    <author_note>Fiction</author_note>
    <publisher_note>To be published worldwide</publisher_note>
    </notes>
  </book>
</catalog>
				
			

Step 2: Create a table with a variant data type.

				
					create or replace table catalog_xml
( xmldata variant );
				
			

Step 3: Load the XML data

				
					copy into catalog_xml
     from '@xml_stage/data/'
         file_format= (format_name = xml_format);
				
			

Step 4: Validate the raw XML data.

The sample file has only one XML element under tags <catalog></catalog>, so it is loaded into a single row.

				
					select * from catalog_xml;
				
			

But as you can see, even though there are 2 books under <catalog>, they are loaded into a single row.

If we want to load the data at book level, we can use option strip_outer_element in file format. Let’s recreate the file format and load data again, as shown below.

				
					create or replace file format xml_Format type=xml strip_outer_element=true;
truncate table xml_catalog;
copy into xml_catalog from '@xml_stage/data' file_format=(format_name = xml_format);

				
			

Now, let’s query the data again.

				
					select * from catalog_xml;
				
			

We can see that the data is loaded at the book level instead of the catalog level.

Steps for loading ORC, Avro, and Parquet Semi-Structured Data

We need to create their corresponding file formats for ORC, AVRO, and Paquet Semi-structured data and use them in the COPY command. Data will still look like JSON when queried from their corresponding Tables. Any functions that are used for JSON querying (e.g., lateral flatten) can be used for querying AVRO, ORC, and Parquet data.

Additional information:

What happens if the schema changes in Raw Semistructured formatted data?

As the data gets loaded in a single variant column, the changes to the schema will not affect variant data load. However, when the data is queried or flattened from a variant column, these additional fields need to be handled accordingly.

Unloading Semi-Structured Data

What is the Unloading of data, and when is it needed

Unloading is the process of converting the flattened data into its original or required file format  (e.g., converting flattened data into JSON or Parquet formatted files). Snowflake currently supports the following file formats for Loading and Unloading Data. We can unload the data in the internal stage or external stage.

Data Formats Loading Into Table Unloading to Stage
JSON Yes Yes
XML Yes Not Supported
Parquet Yes Yes
Avro Yes Not Supported
ORC Yes Not Supported

The example below shows the unloading of the JSON data in the internal stage. The object_construct function can be used to export the data in JSON format. This function will construct a JSON element using column names as keys. We can also achieve Unloading using any programming language if needed, using Snowpark or Cloud Native applications like Glue, Spark, etc.

				
					COPY INTO '@json_stage/unload/'
   from (with data as (select mainid,
                              object_construct('CategoryID,' categoryid, 'CategoryName,' categoryname) as categories,
                              firstname,
                              lastname
                         from flattened_json_table)
        select array_agg(object_construct(*)) from data)
   file_format= (format_name =json_format);
				
			
				
					> ls @json_stage/unload/;
json_stage/unload/data_0_0_0.json.gz
				
			

Unloading the table data into an internal stage in Parquet format is similar to JSON, but we do not need an object_construct function to unload. Snowflake automatically exports the data in parquet format using the same fields as in the table.

				
					COPY INTO '@parquet_stage/unload/' 
from flattened_json_table file_format = (type = parquet);

				
			

Once the file is in the stage, the file can be downloaded onto a local laptop or any machine using the GET command, as shown below.

				
					GET @%parquet_stage/unload file://c:/tmp/;

				
			
				
					+----------------------------+------+------------+---------+
| file                       | size | status     | message |
|----------------------------+------+------------+---------|
| data_0_0_0.snappy.parquet  | 1234 | DOWNLOADED |         |
+----------------------------+------+------------+---------+
1 Row(s) produced. Time Elapsed: 2.01s

				
			

Unload should be done per table, but multiple tables can be joined, and the result can be unloaded. Files can be unloaded into both internal and external stages. Snowflake will automatically split the files into smaller-sized files while unloading. If a single file is desired during unloading, the option single=true can be applied to the copy command.

				
					COPY INTO '@parquet_stage/unload/' 
from parquet_table_silver file_format = (type = parquet) single=true;

				
			

This command will generate one single file by merging all small files together. This may run longer than the regular COPY command as it needs to merge the files. 

Querying and Flattening

Why Query Semi-structured Data

Semi-structured data has become far more common in recent years as businesses and organizations embrace applications in systems that output data in Semi-structured file formats like JSON, XML, Avro, ORC, and Parquet. The problem is that only some easy and fast storage platforms allow you to store and analyze semi-structured data at the same time. Snowflake provides the feasibility to load the raw format of the data, and using its inbuilt functions, we can flatten and ingest this data into tables (structured format).

The following example shows what the flattened data looks like.

Querying and flattening of JSON formatted Semi-structured data

The elements inside JSON data loaded in variant columns can be queried using two different methods.

Dot Notation

  • Data can be queried using the notation:
    <column>:<level 1 element>.<level 2 element>

  • Note the usage of : in the first level and . for accessing the next level.

Now, let’s query the loaded JSON data using dot notation

				
					SELECT data:MainId::string as mainid,
        data:firstName::string as firstname,
        data:lastName::string as lastname,
        cat.value:CategoryID::string as categoryid,
        cat.value:CategoryName::string as categoryname
   FROM simple_json_table tbl,
         lateral flatten( input => data:categories) cat;
				
			

Square Bracket Notation

  • Square bracket notation can be accessing using notation:
    column['level 1 element']['level 2 element']

  • Note that the elements inside the square bracket are case-sensitive.

The following example shows querying data by bracket notation.

				
					SELECT data['MainId']::string as mainid,
       data['firstName']::string as firstname,
       data['lastName']::string as lastname,
       cat.value['CategoryID']::string as categoryid,
       cat.value['CategoryName']::string as categoryname
FROM simple_json_table tbl,
    lateral flatten( input => data:categories) cat;

				
			

The JSON file is loaded as is in the Variant field. While we can query the variant field using dot or bracket notation, in most cases, the JSON elements need to be split up (flatten) for consumption. There are different ways to consume the flattened data. 

We can create a view or materialized view to consume the flattened data, or the data can be split and loaded into a separate table for consumption. If a table is created for consuming flattened data, a pipelined will need to be built to load new data as it arrives. Let’s look at a simple use case of flattening the JSON data and loading it into a table.

Step 1:  Create the transformed data (flattened data) table with its data types     

				
					CREATE TABLE catagory_json
    (mainid        string,
     firstname     string,
     lastname      string,
     categoryid    string,
     categoryname  string );

				
			

Step 2: Insert the data into the Flattened/Transformed table      

As mentioned earlier, since this JSON element given in the example has 2 array elements, we need to apply lateral flatten twice to extract all elements in the array.

  • We initially flatten the data array and assign an alias ppl. The elements inside data can then be accessed as ppl.value.

  • We then apply lateral flatten again on ppl.value:categories and assign an alias cat. The elements inside categories can then be accessed as cat.values.

The resulting queries look like below. We extract all elements and load the table category_json.

				
					INSERT INTO flattened_json_table
 (
 SELECT data:MainId::string as mainid,
        data:firstName::string as firstname,
        data:lastName::string as lastname,
        cat.value:CategoryID::string as categoryid,
        cat.value:CategoryName::string as categoryname
   FROM simple_json_table tbl,
        lateral flatten( input => data:categories) cat);



				
			

Step 3:  Validate the flattened data 

Verify the data that is loaded into the flattened table.

				
					select * from catagory_json;
				
			

You can also query JSON data using the GET_PATH function in the below 2 ways either directly from the stage location or from the table.

				
					-- data extraction using Table
SELECT GET_PATH(data, 'MainId')::string mainid
from simple_json_table;


-- data extraction using stage file
SELECT GET_PATH(sf.$1, 'MainId')::string mainid
from @json_stage/test/complex_json1_0_0_0.json.gz (file_format => json_format) as sf;
				
			

Querying and flattening of XML formatted Semi-structured data

Unlike JSON, Querying XML data in Variant will require a function XMLGET. This function takes three arguments:

  • Name of a variant (either the variant column itself or something that evaluates to a variant);

  • Name of the tag (the node or attribute name);

  • Optional instance number is used as an “index” into the XML structure.

A second parameter follows the function XMLGET:

  • The argument '$' returns the element’s value.

  • The argument '@' returns the element’s name.

  • Passing '@attrName' returns the value of an attribute of the element. 

XMLGET offers simplified notation to extract data from XML. There is a second way to extract data using GET and XMLGET. There is a slight difference in the way the data can be extracted.

Consider the following XML sample.

				
					<catalog>
  <book id="bk101">
     <author>Gambardella, Matthew</author>
     <title>XML Developer's Guide</title>
     <genre>Computer</genre>
  </book>
  <book id="bk102">
     <author>Ralls, Kim</author>
     <title>Midnight Rain</title>
     <genre>Fantasy</genre>
  </book>
</catalog>

				
			

The elements in the XML can be extracted in one of 2 ways, as shown below.

				
					XMLGET( xmldata, 'author' ):"$"::STRING AS AUTHOR

				
			
				
					GET(XMLGET( xmldata, 'title' ), '$')::STRING as TITLE
				
			

GET with XMLGET requires $ to be quoted in single quotes, whereas just  XMLGET notation requires $ to be quoted in double quotes. For simplicity, we will be using XMLGET in the queries going forward. Once the data is extracted, it can be explicitly cast to any data type you need with the double-colon operator. In the example above, both AUTHOR and TITLE are cast as STRING. 

Understanding Element Values vs. Attributes

XML documents can have both elements and attributes. Elements are most frequently used in XML documents, but attributes are also used. In the example below, id is the attribute, and author, title, etc. are all elements. 

Snowflake has different methods for working with element values and attribute values. We used "$" for elements to get the values along with GET and XMLGET. For attributes, we need to use '@attribute_name'. In the example above, each <book> tag has the attribute id, which can be extracted as shown below.

				
					GET(xmldata, '@id')::STRING as ID

				
			

Now, putting all these together, we can extract all the elements and attributes from XML data, as shown below. Similar to the JSON example, we are applying the lateral flatten function to extract the inner elements. Here, we are applying lateral flatten on <notes> level. Since the data is already at <book> level, we do not need to apply lateral flatten on <book>.

Note that to extract the author, title, etc., we directly use column xmldata from the table. In contrast, for author notes and publisher notes, we are referring to notes.value, which is an alias from flattening xml data and filtering them at the <notes> level.

				
					SELECT
   GET(xmldata, '@id')::STRING as ID,
   XMLGET( xmldata, 'author' ):"$"::STRING AS AUTHOR,
   GET(XMLGET( xmldata, 'title'), '$')::STRING as TITLE,
   XMLGET( xmldata, 'genre' ):"$"::STRING AS GENRE,
   XMLGET( notes.value, 'author_note' ):"$"::STRING AS AUTHOR_NOTES,
   XMLGET( notes.value, 'publisher_note' ):"$"::STRING AS PUBLISHER_NOTES
FROM
   xml_catalog cat,
   lateral FLATTEN( xmldata:"$" ) notes
WHERE
   notes.value like '<notes>%'
				
			

Note: 

  • The FLATTEN() function creates an internal array structure that contains 6 elements: SEQ, KEY, PATH, INDEX, THIS, and VALUE. You can see this structure by adding the expression emp.* into the SELECT list above. While all 6 fields are useful, we’re only interested in the value field.

  • The array structure created by the FLATTEN() function is aliased with the name notes, so we can refer to it in the SELECT list expressions.

  • The <book> structure array contains a row for every subnode of <book>. This includes not just the <notes> nodes but also the <author> and <title> nodes as well. We need to add a WHERE clause that only includes the <notes> nodes. When we filter the <notes> node, we are filtering at <notes> level, and that includes the other elements like author and title as well. If we don’t filter these out, the result set will have too many rows. To see this, comment out the WHERE clause and re-execute the query.

  • We use the shorthand :"$" syntax to access the element’s value and GET to access attributes.

To load this flattened data into a table, we can create a table with all required fields and load them using the INSERT command.

				
					CREATE OR REPLACE TABLE CATALOG_XML_FLATTENED
   (
       ID STRING,
       AUTHOR STRING,
       TITLE STRING,
       GENRE STRING,
       AUTHOR_NOTES STRING,
       PUBLISHER_NOTES STRING
   );

				
			
				
					INSERT INTO CATALOG_XML_FLATTENED SELECT
   GET(xmldata, '@id')::STRING as ID,
   XMLGET( xmldata, 'author' ):"$"::STRING AS AUTHOR,
   GET(XMLGET( xmldata, 'title'), '$')::STRING as TITLE,
   XMLGET( xmldata, 'genre' ):"$"::STRING AS GENRE,
   XMLGET( notes.value, 'author_note' ):"$"::STRING AS AUTHOR_NOTES,
   XMLGET( notes.value, 'publisher_note' ):"$"::STRING AS PUBLISHER_NOTES
FROM
   xml_catalog cat,
   lateral FLATTEN( xmldata:"$" ) notes
WHERE
   notes.value like '<notes>%' ;
				
			

Now validate the flattened data.

Querying AVRO, ORC, and Parquet data

We need to create their corresponding file formats for ORC, AVRO, and Paquet Semi-structured data and use them in the COPY command. Data will still look like JSON when queried. Any functions that are used for JSON querying (e.g., lateral flatten ) can be used for querying AVRO, ORC, and Parquet data.

Summary of Snowflake Semi-structured Functions

Below are some frequently used functions for querying and flattening semi-structured data split by category.

Parsing Extraction Casting Type Predicates
CHECK_JSON XMLGET TO_JSON IS_OBJECT
PARSE_JSON GET / GET_PATH TO_VARIANT IS_DECIMAL
CHECK_XML : TO_XML IS_INTEGER
PARSE_XML OBJECT_KEYS TO_OBJECT IS_NULL_VALUE
STRIP_NULL_VALUE [ ] TO_ARRAY IS_CHAR
lateral flatten ETC... :: ETC...

Closing

As demonstrated in this blog, Snowflake supports loading and parsing semi-structured data like JSON/XML. Organizations generally do not require any external tools to process them. With Snowpark in GA (Generally Available), the capability of parsing JSON/XML has become very versatile. When the JSON/XML is not very deeply nested, simple flattened queries can be written to explode nested structures into a relational format.

At phData, we have expertise in building complex data pipelines to process structured, semi-structured, and unstructured data to build insights.
If you have any additional questions on loading and analyzing semi-structured data in Snowflake, contact our team of experts.

Data Coach is our premium analytics training program with one-on-one coaching from renowned experts.

Accelerate and automate your data projects with the phData Toolkit