Most data analyses require some form of data blending, or combining information from multiple sources in a controlled process. Even if all of your business data is warehoused in a single location such as the Snowflake Data Cloud, in depth analytics will require the combination of information across several tables, views, and perhaps a few CSV files too!
In this tutorial, we will discuss the required conditions, correct configuration and performance optimization best practices for three methods of data blending in Sigma Computing.
Before we dive in, let’s define each of these methods:
- Lookup: A function added to a base table that uses a lookup key, or identical value in at least two different sources, to match values across sources into the base table.
- Joins: Joins combine two sources at a time, using one or more join keys to align matching information, resulting in a new table.
- Union: Another method of combining multiple tables where two or more data sources are stacked on top of each other based on the configuration and alignment of fields and field types, also resulting in a new data source as a separate table.
Starting Data Source: MLB_STADIUM_SALES
The data used in this demonstration is included in your Sigma Sample Database tables under EXAMPLES, MLB_STADIUM_SALES. This is NOT a baseball test!
The sample data we are using is for concession stand sales.
Has rows of sales transactions within the stadium at the product level. Each row has an order number, keys to identify the stand and product, and some pre-aggregated values such as the unit cost, unit price, and quantity.
Includes the product type, product line, and unit price.
Includes the stand type and stadium level.
Contains orders by game and season ticket holder ID.
Using Lookups in Sigma
Similar to the VLOOKUP, function in Excel, the Lookup function in Sigma keeps all of the values from the base table plus any matching values from another data source; a game of data-go-fish!
The Function is looking at the local key, and asking a different table if it has any matches. Where more than one match is found, the function will return an * instead of the matching value unless an aggregation level is set.
When the number of rows after a lookup is complete, it will always match the original base table with no new tables being created. The lookup function is executed as follows:
Lookup (formula, local key 1…, external key 1…, local key 2…, external key 2…, …)
- The formula is the value you want to add to the base data source from the target data source. This can be a string, a number, or an aggregated value.
- The local key indicates which field in the base data source the target data will be matching to.
- The external key indicates which field the target data source used to match with the base data source.
The lookup widget in the Element configuration window is the easiest way to ensure your lookup works as intended. Lookups can be added right in the element configuration. Just click + to add a new column, then select “add column via lookup” and a new configuration window will open.
It is important to note that while the local and external keys can have different names, they must be the same field type (string, number, date, ect…) to execute a lookup.
LOOKUP_TABLE_1 is a duplicate of the F_POINT_OF_SALES sample data source and will be our base table.
In a new column, the following function was added to bring in the Product Type (the formula) from the D_STADIUM_PRODUCT sample data source. Both tables contain a field called Product Key, which will be used for the local and external keys.
Start typing the lookup formula and a pop-up guide will walk you through the correct configuration and provide links to more how-to content.
Join Types and Configuration
Joins are used to combine two or more data sources into a single table. Data is joined by identifying matching fields from both the base (or primary) and target (or secondary. tertiary, ect..) sources.
Start by creating a new table, then select Joins in the NEW tab.
Next choose the base data source and click “select” in the bottom right.
On the join configuration menu, add a new source to be joined by clicking on the “+” symbol next to the Sources tab at the top. Here you will be able to select the key field(s) being matched.
*Note that the fields do not have to have the same name in both sources, but they do require the same field type; string to string, value to value, date to date, ect. You can edit a field type in the sources before the joining process if they do not match.
Inner Join returns only the rows that match from both the base and joined data. If there are unmatched rows, the resulting row count will be less than the base data unless 100% of the records match.
There are concession stands on all three levels of this stadium, but the second level only has room for 15 vendors while the first and third levels both have 20. For this example, the Inner Join Base has a filter on Stadium Level 200, as our goal will be to find just the sales for the middle stadium level in the Inner Join Target, which does not have Stadium Level as a column. This Inner Join is set using the Stand Key as the Join Key.
The resulting Inner Join Table below includes only the 890,598 rows of sales data for Stadium Level 200 from the 4,584,628 rows of data in the Inner Join Target, F_POINT_OF_SALES Inner Join Example.
Could we have joined the base and target table, then filtered to the second level?
Yes, but that would require needlessly processing over 3.5 million rows of data through the join. By limiting base and source tables with filters as much as possible before the join, we can achieve optimal performance.
Left Outer Join
The most commonly used and default join type configuration, a Left Outer Join returns the rows from the base data and any matching rows from the joined data. Where more than one match is found, the base data will be duplicated to add each unique row from the joined data.
This example uses the Inner Join Table (from the previous example) as the Left Outer Join Base Table and F_STAND as the Left Outer Join Target with Order Number as the join key.
Note how the row count of the final join table matches the base source. If the row count is higher than the base table (sometimes called “row bloat”), the cause is multiple rows from the target source matching with base source row level data. That is important to understand before creating aggregated fields and metrics to share on your workbooks.
Right Outer Join
Right Outer Join returns the rows from the joined data and any matching rows from the base data. Where more than one match is found, the base data will be duplicated to add each unique row from the joined data.
Right Outer Joins are rarely used in practice and are best described as the inverse of a left outer join. For this example, the base and target tables have been switched from the left out join example and the resulting join table has the same data.
Returns all rows from the base and joined data while also aligning any matches. Where more than one match is found, the base and/or joined data can be duplicated.
To best display how an outer join works, this example uses the base (STAND_KEY pre-filtered to the second stadium level) and target data (Point of Sales for the whole stadium) just like the first Inner Join Table example.
Where the inner join returns only the rows from each source with matching keys, the resulting outer join table has the same row count as the largest table (in this case the target source) because all rows in the base source matched and were joined. If an outer join table has more rows than the largest source table, the delta is equal to the number of unmatched rows.
Using Unions in Sigma
Where Lookups and Joins are used to match data rows, Unions are used to align data columns before combining multiple sources. Imagine the business application being that we only have access to the data for each level of the stadium or each vendor from separate sources. They are in similar formats and we need to stack them on top of each other based on columns (or fields), rather than comparing and keeping/excluding rows of data.
To create a new union, add a Table element and select Unions under the NEW tab.
Next, select the first data source to be unioned then click “select” in the bottom right. In the Union configuration screen, add a new data source by clicking on the “+” symbol next to the Sources tab at the top and note the alignment.
Sigma will automatically make the best possible matches between both data sources, but you should still review the matches closely. Note that the fields do not have to have the same name in both sources, but they do require the same field type; string to string, value to value, date to date, ect.
You can edit a field type in the sources before the joining process if they are not the same type. Once configured, you can change the resulting field name in the resulting table. Any unmatched columns will be included in the final union table unless you deselect them in the Union configuration.
For this example, we are going to split up the STADIUM_KEY source by level using filters, then recombine them using a union.
Start by creating a new element, select table, and select Union. Using the three example tables above, Add each new source by selecting the + symbol at the top of the Create Union window. The fields will align with the best possible match, but you can manually override any field by using the drop down arrow to the right of the field name.
Once you select Done, the final table will be labeled “Union of # Sources” and can be used anywhere else in your workbook or data model. To edit the union, highlight the union table then look in the bottom left corner of the Elements Properties tab to find the drop down next to the union table name, then select “Edit Union.”
To show what happens to unmatched fields, I created three duplicates on the D_STAND tables above and added a custom field to each one; a random number, date, and text value all with different names.
In this configuration, you can see the custom fields unmatched and listed in the Dropped Columns below each source. Because they are all unique field types, if you try to force the alignment, an error message will appear.
We can still add these columns to the final union by selecting “Add All” in the Dropped Columns section, but they will only populate for the rows from the same source table and appear null on rows from the added sources like those found in the final union table to the right.
Remember that lookups are a quick way to supplement a data source with a single field from another table. Joins are more efficient than lookups when blending multiple fields from different sources, or multiple sources. Unions allow you to stack data sources on top of each other and configure how the columns align.
Want more great Sigma how-to information? We’ve got you covered. Choose your adventure:
phData is Sigma’s largest and most trusted consulting partner, so reach out today if you have any questions or need support!
Frequently Asked Questions
A: Absolutely! Just add another source in the join configuration menu and select a join type. You can set a series of elements to join, but keep in mind this process is completed sequentially and not all at once. For example, when joining 3 tables (A,B, and C) C would be joined to the resulting join table of A and B, not directly to A and/or B. The join configuration window and lineage tab offer a process map view as well.
A: No, not at this time, so it is important to set the correct field types in the source elements before you add them as sources in the Union configuration menu.