April 17, 2024

How to Automate Database and Schema Names in dbt

By Dinesh Eswaramoorthy

dbt is a data transformation tool that helps data analysts and data engineers transform data in cloud data warehouses with loaded features and flexibility. With software engineering development capabilities, dbt helps transform data in warehouses to use it for analytics or reporting purposes.

Automating database and schema names is important because when you are working with multiple teams in different environments like DEV, TEST, and PRD, you have to:

  • Automate the environment prefixes and suffixes, as each environment will have a different logic.

  • Set up schemas for each database in each environment.

  • Set up deployment configurations for respective databases with clean and transformed data.

  • Quickly provision the new user profile, database, and schemas for different environments.

  • Spend more time on transformation rather than managing complex environments.

Do you need help achieving all of this? No worries. Let’s take a deeper dive to give you a clear understanding of what I am talking about and how important it is to automate database and schema names in your dbt project. 

First, let’s answer the question below:

Why Are Automating Database and Schema Names Important?

Consider a scenario where two developers, A and B, are working and running their models in the dev environment under a common schema (develop) for the same table with different feature branches. Developer A changed the model and deployed it to the standard schema. After that, developer B made other changes to the same model in his branch and deployed them under the same schema. 

The latest changes made by developer B will be available in the standard schema, but they will overwrite the changes deployed by developer A. That is why development changes must go to their individual developer schema during development. If deployed in a standard schema, they might conflict with the work of another developer working on the same model. 

On the other hand, when it goes to the test and prod environments, the developed work can go to a standard schema as the development would be frozen at that stage. As you can see, the deployment logic varies depending on the environment. To clarify, your database and schema names have to change depending on the environment models they are being deployed.

Now, you are probably thinking, “ok, I understand why it is important to have an individual schema for developers. Let me go ahead and hard code my schema names in the model configs” – right? 

Here comes another hiccup:

→ Consider two more developers, C and D, who are onboarded in the project, and they are hardcoding the schema names in model configs they do their development, and developer B has been allocated to another project. Now, the hard-coded schema name config should be removed for developer B. Imagine this happens multiple times, and you spend more time maintaining model configs than the business logic in transformations.

Once your models are promoted to higher environments, you cannot manually change the model configs in those higher environments (well, you might think, “There may be ways to achieve it” but I would ask AT WHAT COST?). Automating database and schema names depending on the environment is very important.

In this blog, with some examples, we will walk you through automating database and schema names in your dbt project and how and when it benefits you. Let us set up some solid foundational understanding of the latter topics.

What is Database and Schema Naming?

Definition and Role of Databases

In the data domain world, databases are collections of information about an object or an event for easy access and management later in time.

An organization can have different data streams from other sources (like web analytics, third parties, publicly available data, etc.). Those different streams are logically grouped and created as physical databases, and then data in those individual databases are transformed per business needs and used for downstream applications.

Definition and Role of Schemas

A schema is a logical representation of a database that shows how it is constructed.

Let me save you tons of time at work. Curious to know how? It is important to remember the challenges before approaching a problem. So, read further.

Common Challenges in Database and Schema Naming

When an organization expands in terms of growth, it looks for more growth opportunities. More opportunities mean more complex systems to handle in all aspects. Those systems need to be taken care of by technical experts. If more systems are involved, data grows in volume and business category. If more systems are involved, it is often a significant challenge to name databases and schemas the way they are.

  • Self-descriptive.

  • Easy to understand for different teams.

  • Consistent across systems.

  • They are aligned with the company’s policy.

  • Environment either prefixed or suffixed.

Considering and aligning with all these challenges, creating database and schema names would be challenging, time-consuming, and complex as organizational growth depends on it. 

Now that You have an understanding of the challenges you might face. To make your transformations efficient, we need to understand.

What are the Best Practices for Naming Conventions?

Below are the standard practices to be followed for naming conventions:

  1. Create self-explanatory names.

  2. Avoid very generic names like frond_end, back_end, data_for_analytics.

  3. Follow standard format across different teams.

  4. Use underscores for more readability.
  5. Do not use special characters.

  6. Include environment as a prefix or suffix.
  7. Make the names short and meaningful.

Now that you know the issues you may encounter and have checklists to consider when naming databases and schemas, it is time to reveal a game-changer tool that will help with almost all use cases. The problems we will encounter during our transformation process and the tool is dbt (data build tool).

How dbt can be used to Automate Database and Schema Naming

I thought it would have been much better if you could develop transformations in SQL where you could apply logic, version control them, do assertions, and stop transformation if an assertion on data fails in the warehouse, just like how we have more control of the process in the software development lifecycle? Well, we have a way to achieve it.

dbt is a transformation tool that helps organizations:

  • Build their transformation logic.

  • Version control them.

  • Test them with custom assertions and built-in test packages.

  • Generate documentation on transformations and assertions.

  • Check the freshness of source data.

  • Stop transformation if source freshness or test on source data fails.

  • We are deploying it in the data warehouse.

  • And much more.

dbt supports Jinja templating, a programmatic way of applying logic to create database and schema names. Let’s say you want to make a schema named <developer_name>_contract for developers working as contract employees and regular employees with just their names. In such cases, you might have to check if an employee falls under the list of contract employees to consider him as a contract employee. This can be easily achieved by tweaking built-in macros generate_database_name and generate_schema_name a little bit. generate_database_name and generate_schema_name macros are where you will centralize your model deployment logic in developing, testing, and production environments.

Note: In dbt, we don’t need to create a schema manually if it is not present in the target database during the deployment, as dbt creates the target schema if it does not exist. You must only ensure a custom schema name is passed from model configs. You can version control the changes you make with your .sql files and .yml files, so you don’t have to worry about the different versions of changes you have made to your code, as dbt supports version control of code.

Now let us try some hands-on and try to understand how generate_database_name and generate_schema_name work

How to Set up dbt for Naming Automation

To show how to automate database and schema names, I will use:

  • dbt core. 

  • Python virtual environment.

  • dbt-snowflake connector.

Now, let’s work on actual automation for databases considering the rule that it should be prefixed with DEV_<database_name> for development, TST_<database_name> for test environment, and PRD_<database_name> for production environment. To achieve this, we will use the built-in macro generate_database_name macro. The default behavior of schema naming is if the custom schema is provided, then the target schema generated will be <target_schema>_<custom_schema> else <target_schema>.

Here, we are given a few examples of how generate_schema_name and generate_database_name would help our needs.

Automating Database Naming with dbt

You can override the existing database name generation logic in the generate_database_name macro by creating a new macro with the name generate_database_name under the macros folder and writing your logic inside it. Please remember the database should already be made before deploying your changes, or it will throw errors. 

After creating macros, when you pass a custom schema name and database name, the macro generates the database name and schema name as per the logic written inside those macros. 

Below are the screenshots of default behavior and overridden behavior with explanations.

Default generate_database_name:
				
					{% macro generate_database_name(custom_database_name=none, node=none) -%}

    {%- set default_database = target.database -%}
    {%- if custom_database_name is none -%}
        {{ default_database }}
    {%- else -%}
        {{ custom_database_name | trim }}
    {%- endif -%}

{%- endmacro %}
				
			

For the default generate_database_name, the output would be:

Now, let us alter the generate_database_name with the below logic. The target name should be prefixed with the current environment if a custom database name is given. It should generate a database name given in the profile.yml.

Now, the generate_database_name macro looks like the below:

				
					{% macro generate_database_name(custom_database_name=none, node=none) -%}

    {%- set default_database = target.database -%}
    {%- if custom_database_name is none -%}
        {{ default_database }}
    {%- else -%}
        {{ target.name }}_{{ custom_database_name }}
    {%- endif -%}

{%- endmacro %}
				
			

Now, if you run the model, it will generate the dev_googleanalytics_webdata.

				
					create or replace transient table dev_googleanalytics_webdata.user_schema.src_db2 as
(select
id,
name
from sampledb_other.sampleschema_other.sampletable_other
);
				
			

As you can see in the above image, DEV_GOOGLEANALYTICS_WEBDATA is created.

Now, if you pass the target by running dbt run -s src_db2.sql --target tst in TST, then the database naming will be TST_GOOGLEANALYTICS_WEBDATA.

The same goes for production too.

Automating Schema Naming with dbt

The same goes for schema name generation. Overriding the generate_schema_name under the macros folder with our logic will generate our schemas. 

Below are screenshots with default behavior and overridden behavior.

Default generate_schema_name:
				
					{% macro generate_schema_name(custom_schema_name, node) -%}

    {%- set default_schema = target.schema -%}
    {%- if custom_schema_name is none -%}
        {{ default_schema }}
    {%- else -%}
        {{ default_schema }}_{{ custom_schema_name | trim }}
    {%- endif -%}

{%- endmacro %}
				
			

And the output for that would be:

We are overriding the macro with our logic by prefixing with dbt_<username> for DEV and cleansed schema for TST and PRD

After tweaking, macro looks like below:

				
					{% macro generate_schema_name(custom_schema_name, node) -%}

    {%- set default_schema = target.user -%}
    {%- set target_name = target.name -%}
    {%- if target_name == 'dev' -%}
        DBT_{{ default_schema }}
    {%- else -%}
        {%- if target_name == 'tst' or target_name == 'prd' %}
            cleansed
        {%- endif -%}
    {%- endif -%}
    
{%- endmacro %}
				
			

Ran below dbt command:

				
					dbt run -s src_db2.sql --target dev
dbt run -s src_db2.sql --target tst
dbt run -s src_db2.sql --target prd
				
			

The generated schema for DEV was:

				
					create or replace transient table dev_googleanalytics_webdata.DBT_din.src_db2 as
(select
id,
name
from sampledb_other.sampleschema_other.sampletable_other
);
				
			

Similarly, for TST and PRD, cleansed schemas are generated.

Wow…..That sure was a lot. Are you curious to know how to implement it in real life? 

Gear up and read below:

Strategies to Handle Complex Schema Naming Scenarios

If you have complex schema naming conventions, generate_database_name and generate_schema_name macros are the perfect place to handle them. You can use Jinja templating functions from the dbt documentation. You can achieve any complex logic conventions with Jinja and dbt together. Make sure to have dedicated schemas like staging, fact, dimension, intermediate, and cleansed models to simplify the transformation process.

Real-world Applications and Examples

Let us assume a car manufacturing company has many databases that are categorized as business as below:

  • Production and Manufacturing:
    Production_data, quality_control, manufacturing_logs, vehicle_specs, vehicle_testing, production_planning.

  • Supply Chain and Inventory:
    External_supply_chain_management, External_sales_and_inventory, External_parts_inventory, External_vendor_management.

  • Business Operations:
    Employee_records, finance_and_accounting, customer_relationships, fleet_management, regulatory_compliance, environmental_compliance.

  • Research and Development:
    Research_and_development.

  • Sales and Marketing:
    Marketing_campaigns, after_sales_service.

Note: databases will not be created if they do not exist in the target database. So, below, further explanations are made based on the assumption that before running dbt commands, databases are created in the target data warehouse as per the Best practices for naming conventions section.

If you look at the Supply Chain and Inventory category, they will have the prefix External_ because the source data comes from external sources.

Now let us create database names with environments prefixed, their categories appended with short form (say SM for sales and market), and remove the External_ keyword from the database’s name if it is passed from model config. For schema name generation, we will create user schema for dev environments and cleanse schema for TST and PRD environments.

				
					{% macro generate_database_name(custom_database_name=none, node=none) -%}

    {%- set default_database = target.database -%}
    {%- set target_env = target.name -%}
    {%- set production_and_manufacturing = ['production_data','quality_control','manufacturing_logs','vehicle_specs','vehicle_testing','production_planning'] -%}
    {%- set supply_chain_and_inventory = ['supply_chain_management','sales_and_inventory','parts_inventory','vendor_management'] -%}
    {%- set business_operations = ['employee_records','finance_and_accounting','customer_relationships','fleet_management','regulatory_compliance','environmental_compliance'] -%}
    {%- set research_and_development = 'research_and_development' -%}
    {%- set sales_and_marketing = 'marketing_campaigns' -%}
   
    {%- if custom_database_name is none -%}
        {{ default_database }}
    {% else %}
        {% set value_1 = custom_database_name.split('_')[0] %}    
            {%- if value_1|lower == 'external' -%}      
                {{ target_env }}_{{ custom_database_name|replace('external_','') }}_SCI    
            {% else %}
                {%- if custom_database_name in production_and_manufacturing -%}  
                    {{ target_env }}_{{ custom_database_name}}_PM      
                {% elif custom_database_name in business_operations %}
                    {{ target_env }}_{{ custom_database_name}}_BO
                {% elif custom_database_name == research_and_development %}
                    {{ target_env }}_{{ custom_database_name}}_RAD
                {% elif custom_database_name == sales_and_marketing %}
                    {{ target_env }}_{{ custom_database_name}}_SAM
                {% else %}
                    {{ target_env }}_{{ custom_database_name}}_SCI
                {%- endif -%}
            {%- endif -%}
    {%- endif -%}

{%- endmacro %}
				
			

Above is how generate_database_name looks

  1. First, we define a list of database categories.

  2.  {%- if custom_database_name is none -%} – we check whether custom database is supplied.

  3. {%- if value_1|lower == 'external' -%}  – we put this condition to check if the supply_chain_and_inventory list has an external keyword at the beginning. The logic that is written above can handle even if the External_ keyword is not passed. 

  4. Otherwise, we check if list values are present in those list objects, and it creates a namespace.

  5. Notice in the above namespace that we have _<category> appended at the end of all databases to identify the databases with their category.

Learn how to automate database and schema names in dbt easily. Simplify the workflow for better data operations. Feel free to try out different databases. 

Only three database sources were added for understanding purposes.

Above is the model file.

If the above three models are run, we get our expected output.

Output for DEV

Output for TST

Models are created under the databases we passed in the model configurations in the model.yml file. Also, respective schemas are created. I hope you had fun learning something new today.

Closing

Use generate_database_name and generate_schema_name as the system grows more prominent.

dbt is more flexible because of its Jinja template, built-in functions, version control mechanism, assertions, and transformation if the assertion fails.

generate_schema_name can create a schema if it does not exist.

We can use conditional logic, functions, and looping-in macros as we do in other programming languages. Always have a dedicated schema for developers to avoid conflicts in works and dedicated schema like cleansed, final, and transformed on staging and production databases.

Prefix the environment with DEV_, TST_, and PRD_ for databases.

dbt is designed to streamline and simplify data transformation processes with robust logic-handling capabilities. With dbt, you can efficiently address your transformation use cases without spending more effort managing deployment configurations for different environments.

Now that you have understood how to automate the database and schema namings, are you curious to know how to automate the development and deployment process of transformations you do in your dbt?

FAQs

Development in dbt involves creating and refining the logic for data transformations, while deployment moves those transformations into a live environment to generate analytics-ready datasets.

Yes, absolutely. Documentation is one of the core parts of dbt. To document the macros, follow this link from the official dbt docs site.

Data Coach is our premium analytics training program with one-on-one coaching from renowned experts.

Accelerate and automate your data projects with the phData Toolkit