March 16, 2023

Protecting Your Pipeline From Malformed JSON/XML With Snowflake

By Dakota Kelley

So, your transformation pipeline is up, running, and automated. However, one night a piece of malformed or invalid JSON/XML processes through your pipeline or breaks it. You wake up to an operations team calling you, or first thing in the morning your end users are reporting lots of missing data or problems while you’re trying to investigate what happened and what went wrong. 

As you review, you begin to realize that a single file stopped your entire pipeline, and caused an unnecessary outage.

How do you address this? It’s a good thing you found this article, as we’ll tackle how to build a simple validation layer for your XML/JSON that allows you to only process valid XML/JSON while building out monitoring and reporting measures to help you troubleshoot and monitor for problematic XML/JSON.

Why Should you Monitor for Malformed XML/JSON?

Malformed XML/JSON that makes it into your transformation pipeline has the potential to inject only partial data and make clean-up difficult. Especially in a pipeline that lacks idempotency or the ability to backfill and recreate your data.

It is often better to prevent the bad XML/JSON from processing through your pipeline. Allowing you to keep your data models clean and up to date while you work on any fixes needed for the malformed XML/JSON.

Preventing the XML/JSON from breaking your entire pipeline also has a secondary effect of helping you build trust in your data within your organization while being proactive with your data quality, instead of reactive. 

No longer will malformed XML/JSON break your entire pipeline, and cause you to spend your weeks/nights on call fixing the pipeline that broke.

Validating XML or JSON with Snowflake

Thankfully, with the Snowflake Data Cloud, there is an extremely easy function that can be used to validate your XML or JSON. The functions are called check_xml or check_json. These functions attempt to parse your XML/JSON, and if there are any parsing errors, you will receive a message of the error from parsing. 

While valid XML/JSON will result in a null value from the check_xml/check_json functions within Snowflake.

Putting it Together

So what does this look like then? As an example, let’s say we are landing some XML or JSON responses in a table that contains some file metadata, that looks like this:

				
					CREATE TABLE IF NOT EXISTS xml_data (
    file_name STRING,
    file_load_date TIMESTAMP,
    file_data STRING
);
				
			

Maybe you have Snowpipe loading XML/JSON data, or you have some java populating the table with XML/JSON data from a web service. Next, we can use these functions to validate our file data and generate a boolean to know what data is valid. This will allow us to only load valid data and store the error message for any invalid XML/JSON. 

That looks something like this:

				
					WITH validate_xml AS (
    SELECT file_name,
           file_load_date,
           file_data,
           CHECK_XML(file_data) AS file_check_msg
      FROM xml_data
)

SELECT file_name,
       file_load_date,
       file_data,
       CASE
           WHEN file_check_msg IS NULL THEN TRUE
           ELSE FALSE
       END AS is_valid,
       file_check_msg
  FROM validate_xml


				
			

Now with the above query, we can store a validation boolean to only transform valid XML/JSON (just use CHECK_JSON for JSON). With something like dbt, we can easily transform this into an incremental model so we are only validating new XML/JSON before transforming. Which would look something like this:

				
					{{
    config(
        materialized = 'incremental',
        unique_key = ['file_name', 'file_load_date']
    )
}}

WITH validate_xml AS (
    SELECT file_name,
           file_load_date,
           file_data,
           CHECK_XML(file_data) AS file_check_msg
      FROM {{ source('src_data', 'xml_data') }}
)

SELECT file_name,
       file_load_date,
       file_data,
       CASE
           WHEN file_check_msg IS NULL THEN TRUE
           ELSE FALSE
       END AS is_valid,
       file_check_msg
  FROM validate_xml
{{ if is_incremental() }}
 WHERE file_load_date > (SELECT MAX(file_load_date) FROM {{ this }})
{{ endif }}


				
			

With this done, we now have an efficient base table to work off of, and know that our JSON/XML is valid when we begin transforming it. If we’re using dbt, we can then build a custom test to monitor and alert our team for any new invalid XML, allowing our team to respond before bad data makes it to our end users or a bad piece of XML/JSON breaks our transformation pipeline.

Closing

Put all of this together and you will have a clean validation layer to monitor for malformed XML/JSON, allowing your team to be proactive and focused on value-added activities instead of being reactive.

Need help getting started with Snowflake or dbt? Our team of data experts are happy to assist. Reach out today!

Data Coach is our premium analytics training program with one-on-one coaching from renowned experts.

Accelerate and automate your data projects with the phData Toolkit