June 23, 2023

How to Choose a Language for a Snowflake UDF: SQL, Snowpark, and Beyond.

By Ahmad Aburia

The Snowflake Data Cloud was built natively for the cloud. When we think about cloud data transformations, one crucial building block is User Defined Functions (UDFs). When writing UDFs in Snowflake, it can be overwhelming to choose the right language that suits your business and team needs. 

In this blog, we will highlight some of the offerings of each language, which may aid in deciding which language best fulfills your needs.

What Are Snowflake UDFs?

A UDF, or user-defined function, is a way to establish a reusable block of code to fulfill the requirements of its writer. Within Snowflake, there are two types of UDFs – Scalar and Tabular. A Scalar UDF will take one or more inputs and return a single value. A Tabular UDF will return anywhere between 0-N rows.

Which Languages can be Used to Implement Snowflake User-Defined Functions?

Snowflake offers quite a bit of flexibility in writing UDFs by allowing users to choose one of five possible languages: 

  • Java
  • JavaScript
  • Python
  • Scala
  • SQL

What Each Snowflake UDF Language Offers

Each language has limitations and advantages. In this section, we will explore what those limitations and advantages are. A consideration that should be taken is what your team’s baseline knowledge is in each given language to maintain, support, and develop UDFs

Java

Some advantages of using Java include using previously written or compiled Java code and the ability to easily use Java native functions. For a Java-strong developer or team, this will allow for quicker development with less of a learning curve. Another advantage is the ability to use pre-compiled JARs if you have already developed code you would like to use in Snowflake.

On the other hand, there are some limitations to Java UDFs. For example, writing files, sharing (including database objects like functions, views, etc., which reference Java UDFs), and creating SECURE UDFs are not permissible when they are written in Java.

Additionally, when granting the USAGE privilege to a role, any file contents imported by the UDF are accessible and readable by all other Java UDF calls in the same statement.

JavaScript

JavaScript can allow developers to manipulate variant objects within snowflake UDFs intuitively. As one of the top-most utilized programming languages, it can be easier for developers with JavaScript experience to hit the ground running and develop the necessary code to get the job done. Finally, JavaScript does allow for the creation of shared UDFs.

While it may be easier to get started with JavaScript, there are some limitations identified by Snowflake to be taken into consideration when deciding to use it. These limitations include:

  • Supported Types – Integers are not supported for JavaScript UDFs as a parameter or return type.
  • Code Size – The maximum size of the source code is not currently identified by Snowflake, but the recommended source size is 100 KB (compressed)
  • Memory – Again, the specific memory allocation is not statically defined, but over-consumption of memory will cause your UDF to fail
  • Time – If your UDF runs for too long, it will be killed, and an error will be returned to the user
  • Stack Depth – Recursion that results in excessive stack depth will result in an error
  • Returned Variant Size and Depth – If an object is too large (several MB, but subject to change) or too deep (nesting depth limit of 1000), an error is returned when the UDF is called.

Snowpark Python

Enabling a development team to use third-party packages can significantly reduce the need to reinvent the wheel. By using Python for UDFs, you’re doing just that. Allowing the team to focus more on addressing the business need rather than re-writing something that has been implemented, proven, and tested before.

Python UDFs share the same limitations as Java regarding cloning and sharing.

Snowpark Scala

Scala is the newest addition to the list of supported languages. Those who are familiar with big data may have previous experience with Scala. By combining functional programming principles and the power of the JVM, Scala can be written to be both scalable and performant.

Similar to both Python and Java, Scala has some substantial sharing limitations. Additionally, Scala is the only language not currently supporting writing tabular UDFs.

SQL

For basic implementations and use cases, SQL UDFs are perfect. Snowflake uses the Snowflake Scripting language to make SQL UDFs more robust while maintaining the simplicity of writing a straightforward function to allow for quick development turnaround. Additionally, SQL functions do not have the same limitations as more robust languages regarding sharing.

On the other hand, SQL functions tend to be less performant and lack the robustness of more structured programming languages such as Java and Scala.

UDFs in Action

Java

The following JAVA UDF example demonstrates a simple class definition with public and private methods.
				
					CREATE OR REPLACE FUNCTION getCountOfSubstr(input STRING, target STRING, allCases boolean)
RETURNS NUMBER
LANGUAGE JAVA
HANDLER='MyClass.getCount'
AS
$$
    class MyClass {
        public int getCount(String input, String target, boolean allCases) {
            StringCounter counter = new StringCounter(input, target, allCases);
            return counter.getOccurrences();
        }
        class StringCounter{
            private int countOfOccurrences = 0;
            private String mainString = null;
            private String targetSubstring = null;
            
            public StringCounter(String input, String target, boolean allCases){
                if (input == null || target.isEmpty()) {
                    throw new IllegalArgumentException("input and target must be non-empty strings");
                }
                if (allCases){
                    mainString = input.toLowerCase();
                    targetSubstring = target.toLowerCase(); 
                } else {
                    mainString = input;
                    targetSubstring = target;
                }
                calculateOccurrences();
            }
            //Private Methods
            private void calculateOccurrences() {
                countOfOccurrences = mainString.split(targetSubstring).length - 1;
            }
            //Public Methods
            public int getOccurrences() {
                return countOfOccurrences;
            }
        }
    }

$$;
select getCountOfSubStr('Hello My Name Is Ahmad', 'M', true);
// OUTPUT: 3

				
			

JavaScript

Here’s an example of using JavaScript to traverse an object, find the numbers and return the sum. Note the limitation of not being able to return a number type from a JavaScript UDF requires the return value to be cast to a string first.

				
					CREATE OR REPLACE FUNCTION detectAndAddNumbers(v VARIANT)
RETURNS STRING
LANGUAGE JAVASCRIPT
AS
$$
    //
    function addAllNumbers(input) {
        var numberArray = [];
        if (typeof input === "number"){
            numberArray = [input]
        }
        if (Array.isArray(input)) {
            for (var i = 0; i < input.length; i++) {
                numberArray.push.apply(numberArray, addAllNumbers(input[i]));
            }
        } else if (typeof input === "object") {
            for (var key in input) {
                numberArray.push.apply(numberArray, addAllNumbers(input[key]));                    
            }
        } 
        return numberArray
    }
    return addAllNumbers(V).reduce((F, input) => F + input, 0).toString();
$$;

set inputObject='[
  {"key1" : ["this", "example", 2], "key11" : ["helps", "show"]},
  {"key2" : ["how easy","it is", "to work"]},
  {"key3" : ["with", "variants","in", "snowflake", 5]}
  ]';

select  detectAndAddNumbers(parse_json($inputObject));
//output: 7

				
			

Python

This UDF example highlights the ability to use a Python library, hashlib, to generate a hash value with some exception handling.

				
					CREATE OR REPLACE FUNCTION MY_HASH_FUNCTION(input STRING, algorithm STRING)
RETURNS STRING
LANGUAGE PYTHON
RUNTIME_VERSION = 3.8
HANDLER = 'generate'
AS
$$
import hashlib as hl
def generate(input, algorithm):
    try:
        return hl.new(algorithm.lower(), input.encode('utf_8')).hexdigest()
        
    except:
        raise Exception(f"error encoding \'{input}\' with \'{algorithm}\'. please check algorithm")
    
$$;

select MY_HASH_FUNCTION('hello', 'MD5');
//OUTPUT 5d41402abc4b2a76b9719d911017c592

select MY_HASH_FUNCTION('hello', 'SHA256');
//OUTPUT 2cf24dba5fb0a30e26e83b2ac5b9e29e1b161e5c1fa7425e73043362938b9824

select MY_HASH_FUNCTION('hello', 'sha512');
//OUTPUT 9b71d224bd62f3785d96d46ad3ea3d73319bfbc2890caadae2dff72519673ca72323c3d99ba5c11d7c7acc6e14b8c5da0c4663475c2e5c3adef46f73bcdec043

select MY_HASH_FUNCTION('hello', 'MD');
//OUTPUT:
Traceback (most recent call last): File "https://i0.wp.com/www.phdata.io/usr/lib/python_udf/ad85a51ba4343add6dcb35033bd3134b7a8f8669d7e1b4c02b3eb8af25f75458/lib/python3.8/hashlib.py", line 157, in __hash_new return _hashlib.new(name, data) ValueError: unsupported hash type During handling of the above exception, another exception occurred: Traceback (most recent call last): File "_udf_code.py", line 5, in generate File "https://i0.wp.com/www.phdata.io/usr/lib/python_udf/ad85a51ba4343add6dcb35033bd3134b7a8f8669d7e1b4c02b3eb8af25f75458/lib/python3.8/hashlib.py", line 163, in __hash_new return __get_builtin_constructor(name)(data) File "https://i0.wp.com/www.phdata.io/usr/lib/python_udf/ad85a51ba4343add6dcb35033bd3134b7a8f8669d7e1b4c02b3eb8af25f75458/lib/python3.8/hashlib.py", line 120, in __get_builtin_constructor raise ValueError('unsupported hash type ' + name) ValueError: unsupported hash type md During handling of the above exception, another exception occurred: Traceback (most recent call last): File "_udf_code.py", line 8, in generate Exception: error encoding 'hello' with 'MD'. please check algorithm in function MY_HASH_FUNCTION with handler generate

				
			

Scala

This example shows the flexibility that Scala offers, by defining a class, a class object, and some public methods.

				
					CREATE OR REPLACE FUNCTION endOfYearRaises(SALARY DOUBLE, RAISETYPE STRING)
RETURNS DOUBLE
LANGUAGE SCALA
RUNTIME_VERSION = 2.12
HANDLER = 'Raise.giveRaise'
AS
$$
    class Raise {
        object Raises {
            def smallRaiseAmount: Double = 1.025;
            def normalRaiseAmount: Double = 1.05;
            def maxRaiseAmount: Double = 1.08;
            def promotionBaseRaiseAmount: Double = 1.10;
            def promotionBigRaiseAmount: Double = 1.15;
            def getSalaryAfterRaise(input: Double, customPercent: Double): Double = input * customPercent;
        }
        def getRaiseAmount(raise: String): Double = {
            raise.toLowerCase() match {
                case "small" => Raises.smallRaiseAmount;
                case "normal" => Raises.normalRaiseAmount;
                case "max" => Raises.maxRaiseAmount;
                case "normalpromotion" => Raises.promotionBaseRaiseAmount;
                case "bigpromotion" => Raises.promotionBigRaiseAmount;
    
            }
        }
        def giveRaise(salary: Double, raisetype: String): Double = {
            var baseSalary: Double = salary.toDouble;
            var raiseArray: Array[String] = raisetype.split(':');
            var raiseType: String = raiseArray(0);
            var raiseAmount: Double =
                if (raiseArray.length > 1)
                    1 + (raiseArray(1).toDouble/100);
                else
                    getRaiseAmount(raiseType);
            Raises.getSalaryAfterRaise(baseSalary, raiseAmount).toDouble;      
        }
    }
$$;
select ENDOFYEARRAISES(50000, 'custom:25');
//OUTPUT 62500
select ENDOFYEARRAISES(50000, 'small');
//OUTPUT 51250
select ENDOFYEARRAISES(50000, 'bigPromotion');
//OUTPUT 57500
				
			

SQL

As mentioned, SQL functions are a great way to implement a light transformation that isn’t computationally heavy. An example would be normalizing a date, given a pre-determined set of formats.

				
					CREATE OR REPLACE function NORMALIZE_DATE(INPUT_DATE VARCHAR)
    returns DATE
    AS
    $$
        SELECT coalesce(TRY_TO_DATE(INPUT_DATE, 'yyyy,mm,dd,hh,mi,ss,ff9'),
                        TRY_TO_DATE(INPUT_DATE, 'yyyy,mm,dd,hh,mi,ss'),
                        TRY_TO_DATE(INPUT_DATE, 'yyyy,mm,dd,hh,mi'),
                        TRY_TO_DATE(INPUT_DATE, 'yyyy,mm,dd,hh'),
                        TRY_TO_DATE(INPUT_DATE, 'yyyy,mm,dd'))
    $$;
SELECT NORMALIZE_DATE('1999,12,31,12,30,30');

// OUTPUT: 1999-12-31

				
			

Closing

When writing your UDF, you will have a wide range of options. Some languages better suit your needs than others regarding capabilities, sharing, performance, and data handling. Whether it’s a simple UDF to normalize a date or something more complex for a Streamlit app on Snowflake — there are considerations to keep in mind, but something available will meet your needs.

Need additional help selecting which Snowflake UDF language to use? The experts at phData can help! Reach out to us today for advice, best practices, and Snowflake consulting services.

Data Coach is our premium analytics training program with one-on-one coaching from renowned experts.

Accelerate and automate your data projects with the phData Toolkit