As Q4 comes to an end and we begin signing off meetings with “See you next year!” there is one mythical figure mentally preparing for the busiest night of his year.
Now we’ve all questioned the physics of Mr. Claus by asking: “How does he fit down the chimney?” or the transportation safety of the Christmas Eve operation by asking: “How can he see where his sleigh is going if it’s cloudy?” (The answers are: still unclear, and Rudolph, respectively).
But have you ever wondered how Santa, or the route planners at Santa’s Workshop Inc., plan his flight path to every chimney around the world? Have you ever wondered if he managed to take the most optimal path?
Or better yet, have you ever wondered which Data Cloud he might use to solve this problem? Well, I have, and now you have too.
In this festive blog, we’re going to explore how Santa could plan the most optimal flight path for his famous route on Christmas Eve by building a data application built on the Snowflake Data Cloud.
But first, a little background information on the mechanics of this grandiose operation.
Traveling Salesman Problem
The traveling salesman problem, lesser known as the Santa flight path problem, is an exercise to find the most optimal path between a set of given cities, visiting each city once and returning to the origin city.
Given one origin and a set of 3 cities, there are 6 potential paths, with 10 cities, there are 3.6 million potential paths, and with 20 cities, 2.4 quintillion. The brute force approach would be to go through all path permutations, a time complexity of O(n!)
.
Exact algorithms, like the Held-Karp algorithm, solve the problem in O(n^2 * 2^n)
time. This would solve a 20-city problem in a still completely unreasonable (but vastly improved) ~400 million permutations.
More often, heuristic approaches are used, which may not find the most optimal path but are able to return a reasonable path in a much more reasonable time. As you read further, we will walk through implementing a genetic algorithm in Snowflake and later compare its results to a baseline nearest-neighbor approach.
Genetic Algorithms
A genetic algorithm attempts to mimic natural selection. This is accomplished by using an initial population and producing children from the population using crossover and mutation techniques to create a new generation.
This process is iterated over many generations while optimizing toward a fitness score, and higher-scoring children contribute more heavily to subsequent generations. The common parameters for a genetic algorithm are p
for the population size, g
for the number of generations, and m
for the mutation rate (the rate individuals in a population are mutated).
Let’s unpack the algorithm in the context of Santa’s flight path problem:
1. Given a set of locations, generate a population p
of random flight paths that start at the north pole, travel to each location, and end at the north pole.
2. Select p
flight paths from the original population giving paths with lower total flight distances and higher probabilities of selection (sampling with replacement).
3. Randomly sample two paths from the new population set and perform a crossover. A crossover is done by using a random index range, taking the index splice from both parent paths, using the index splice with the shorter travel distance as the seed for the child path, and filling in the remaining indices with the other parent.
4. Perform mutation on paths at a mutation rate of m
. Mutation is done by randomly swapping two nodes in a path. The purpose of this step is to introduce randomness and help the population escape from local minima.
5. Keep track of the path with the lowest travel distance. Repeat the flight selection → crossover → mutation steps for g
generations and return the most optimal path found.
How to Create a Flight Path Data Application on Snowflake
Genetic algorithms are really popular when introducing people to the traveling salesman problem and can be a fun coding exercise to implement from scratch. What’s even more fun is implementing the genetic algorithm on Snowflake and building a data application on top of it.
Snowflake Setup
1. Create a 30-day free Snowflake trial account.
2. On Snowsight, under Admin → Billing & Terms, enable Anaconda so we can use Python packages in our stored procedures.
3. Open VSCode and install the Snowflake extension. This extension allows us to send queries directly to Snowflake while in our code editor.
4. Clone this repo to follow along:
git clone https://github.com/LwrncLiu/traveling_santa.git
5. Create a conda virtual environment:
conda env create -f conda-env.yml
6. Activate the virtual environment
conda activate santa_env
7. Open the file src/setup.sql
, replace {user}
with your username, and run the entire file to create our route_planner
role and traveling_santa
database.
8. Go to the Marketplace in Snowsight and add the Worldwide Address Data by Starschema to our account, indicating that the route_planner
role has access to the new database. We will use this dataset to pull random coordinates from around the world.
Creating our Algorithms in Snowflake
9. The src/utils.sql
file has SQL commands to:
Create the Python stored procedure
get_n_random_locs(num_locs, table_name)
. This procedure queries theopenaddress
table from the Marketplace dataset, samplesnum_locs
number of rows, and using the longitude and latitude columns, create a new table{table_name}
with a single geography datatype column.Create a SQL UDF
points_equal(point1, point2)
that returns true if two geography datatype points are equivalent.Create a SQL UDF
tbl_exist(tbl)
that returns true if a given table exists in theroutes
schema.
Run the file to create the stored procedure and user-defined functions.
10. The src/naive_algorithm.sql
file has a command to create the Python stored procedure naive_path_finder(source_table)
.
This procedure uses Snowpark Python to query the
source_table
(the table that was populated byget_n_random_locs()
) and creates a distance lookup between coordinate pairs.Our coordinates are stored in a geography datatype, so we can use Snowflake’s
st_distance
function to find the distance in kilometers between two points.Then, starting from the north pole, find the next closest point, and repeat until all points have been traversed, and we end up back at the north pole.
The output path and total distance of the path are stored in the table
f'{source_table}_routes'
.
Run the file to create the naive algorithm stored procedure.
11. The src/genetic_algorithm.sql
file has a command to create the python stored procedure genetic_path_finder(source_table, population_size, generations, mutation_rate)
Python stored procedures require the
main()
method, but you can create other methods or even classes inside the stored procedure.The genetic algorithm is more complex than the naive method. To mimic Path entities and encapsulate the logic of evolution, we have a
Path
class and aGeneticAlgorithm
class.The
Path
class contains information about its path, methods to calculate its distance, and a method to perform mutation.The
GeneticAlgorithm
class contains information about the population size, number of generations, mutation rate, all the points, and the entire path population. It has methods to generate a population, perform selection on a population, crossover paths, and mutate paths.When called, this stored procedure will query the
source_table
to create a distance lookup between coordinate pairs and then run the genetic algorithm that we’ve previously described to find an optimal path.The output path and total distance of the path are stored in the table
f'{source_table}_routes'
.
Run the file to create the genetic algorithm stored procedure.
12. The src/test.sql
file runs the procedures we’ve created. It first calls get_n_random_locs
to pull 20 random locations from the address dataset. Then it calls naive_path_finder
and genetic_path_finder
for both algorithms to generate a flight path from the North Pole to the selected locations. My file run resulted in the genetic algorithm producing a flight path 3km shorter than the naive method. But this will likely not always be the case depending on the algorithm’s parameters, the randomness the algorithm introduces, the location set, and the number of locations.
Visualizing our Paths using Streamlit
We can quickly visualize the flight path differences using Streamlit.
13. Inside the src/streamlit
folder, create another folder .streamlit
, and inside create a secrets.toml
file. The file should reside in src/streamlit/.streamlit/secrets.toml
. Inside the .toml
file, copy the following contents and replace the account, user, and password fields to reflect your account, user, and password.
[connections.snowflake]
account = ACCOUNT
user = USER
password = PASSWORD
role = "ROUTE_PLANNER"
warehouse = "COMPUTE_WH"
database = "TRAVELING_SANTA"
This is the connection Streamlit will use to connect to Snowflake.
14. Navigate into the streamlit
folder and run streamlit run main.py
. This will open a streamlit app locally, and you should see two globes, one with the nearest neighbor flight path and the other with the genetic algorithm flight path.
From the two flight paths, we can tell that while the genetic algorithm produced a shorter path, it’s still not optimal. The star shape created through the four points in the Midwest region for both paths could be straightened out a bit. Maybe a different population size, number of generations, or mutation rate could have nudged the algorithm to produce a better result.
One caveat to be aware of is that the Plotly 3D plots that create the globes are a bit hefty. Depending on your machine, your mileage may vary.
Summary
So what did we do? Did we create a solution that finds the most optimal flight path for all the deserving chimneys in the world? Definitely not. Did we build a production-ready application that I can use to optimally plan my tour of the 44 European countries? Not necessarily.
We utilized Snowflake’s platform and went from raw data to a data application POC in ~15 minutes. Having the ability to write Python on top of your data in Snowflake and having the code executed in Snowflake alleviates local computing limitations.
As an added bonus, Streamlit is a great tool for those familiar with Python to quickly develop an application.
If you made it to the end and are reading this in 2023, Thank you for reading this post during the holiday season. I will see you next year!