Data storage is a vital aspect of any Snowflake Data Cloud database. Within Snowflake, data can either be stored locally or accessed from other cloud storage systems. Each data storage location, whether internal or external, is referred to as a ‘stage’. Uploading files to these stages is a process termed ‘staging’.
Designed for optimal use with major cloud platforms like Amazon Web Services, Google Cloud Platform, and Microsoft Azure, Snowflake uses unique methods for loading and unloading table data, distinct from traditional relational databases.
Snowflake’s ‘staging area’ is a specific storage location where raw files are first loaded before they’re imported into the Snowflake database.
In this post, we’ll clarify the main differences between Snowflake’s internal and external stages, and the ideal scenarios for using each type of stage in Snowflake.
What are the Different Storage Layers Available in Snowflake?
In Snowflake, there are three different storage layers available, Database, Stage, and Cloud Storage.
The Snowflake architecture’s database storage layer organizes data into multiple tiny partitions, which are compressed and optimized internally. Snowflake stores and manages data in the cloud using a shared disk approach, which simplifies data management.
The shared-nothing architecture ensures that users don’t have to worry about distributing data across multiple cluster nodes. Snowflake hides user data objects and makes them accessible only through SQL queries through the compute layer.
The database storage layer in Snowflake is responsible for storing all the database objects, including tables, views, procedures, and other schema objects. It handles the metadata related to these objects, access control configurations, and query optimization statistics.
This layer is used to store data that is loaded into Snowflake from external sources or to store data that is being exported from Snowflake. There are two types of stage storage available: internal and external stages.
Internal stages store the files internally within Snowflake.
External stages store the files in an external location (e.g. S3 bucket) that is referenced by the stage. An external stage specifies location and credential information, if required, for the S3 bucket. The external stage area includes Microsoft Azure Blob storage, Amazon AWS S3, and Google Cloud Storage.
Snowflake leverages the cloud’s native object storage services (e.g. Amazon S3 for AWS, Azure Blob Storage for Azure, or Google Cloud Storage for GCP) to store the actual data files in micro-partitions. Object storage provides virtually infinite storage capacity and ensures the durability and availability of data. Snowflake’s data storage in object storage is fully managed, and users do not need to manage the storage infrastructure.
Snowflake’s ability to scale storage and compute resources independently offers organizations a highly scalable and cost-efficient solution for managing data.
Internal Stages in Snowflake
Internal stages provide a way to store data within the Snowflake environment temporarily. After data has been transformed and processed within Snowflake, internal stages typically come into play to produce a file format that other systems can consume.
Snowflake internal stages are further classified into:
Each user has a Snowflake stage allocated to them by default for storing files, and these cannot be altered or dropped. These stages are unique to the user, meaning no other user can access the stage. User stages are not a suitable option if files need to be accessed by multiple users.
Snowflake stores all the worksheets created by the user in the user stage of the user.
User stages are referenced using @~
The following statement lists all the files contained in the user stage:
Each table has a Snowflake stage allocated to it by default for storing files, and these cannot be altered or dropped. Multiple users can access table stages but can only load data into the table it is allocated to. Table stages are unsuitable if the data must be loaded into multiple tables.
Table stages are referenced using @% and have the same name as the table. The following statement lists all the files contained in the stage of table EMPLOYEE:
Internal Named Stage
Named stages are database objects that provide the greatest degree of flexibility for data loading. They overcome the limitations of both the user and table stages.
Named stages are accessible by all the users with appropriate privileges.
The data from named stages can be loaded into multiple tables.
Numerous users can access the internal named stages, which can be utilized to load multiple tables. If you want to share files with different users and load them into multiple tables, the internal named stage is the way to go.
The following statement lists all the files contained in the internal named stage:
There are several scenarios where internal stages can be used, including:
Data Staging: If you need to temporarily store data within Snowflake for processing or analysis, internal stages can be used as a staging area.
Exporting Data: Snowflake can export data into other systems’ file formats through an internal stage.
Data Processing: If you need to transform or process data within Snowflake before outputting it to a file format, an internal stage can be used to temporarily store the transformed data.
Snowflake Data Loading/Unloading with Internal Stages
The commands below are used in Snowflake for loading data into and unloading data from Snowflake stages.
The Snowflake web interface’s Worksheet tab page does not support the GET and PUT commands. Instead, upload data files from your local computer into Snowflake Stages using the SnowSQL command-line interface.
How to Load Data From Local Folder into Snowflake Stages Using PUT Command
The following statement uploads a file named input.csv on your local machine to a Named internal stage called my_internal_stage.
put file://C:\SourceFiles\input.csv @my_internal_stage;
How to Load Data from Snowflake Stages into Tables Using COPY INTO Command
The below statement loads data into the table EMPLOYEE from the input.csv file in your internal named stage called my_internal_stage.
COPY INTO EMPLOYEE FROM @my_internal_stage/input.csv;
By providing a file format, the following line imports data from a file called input.csv in your internal named stage named my_internal_stage into a table named EMPLOYEE. Note that we’re not specifying any location information because internal stages are managed by Snowflake and do not require any external credentials or URLs.
copy into EMPLOYEE from @my_internal_stage/input.csv file_format = (type = csv field_delimiter = ',' skip_header = 0);
After you verify that you successfully copied data from your stage into the tables, you can remove data files from the internal stage using the REMOVE command to save on data storage.
The following statement will delete all files with the csv pattern in the file name of the internally named stage named my_internal_stage.
REMOVE @my_internal_stage PATTERN='.*.csv.gz';
How to Unload Data Into Snowflake Stages From Tables Using COPY INTO Command
The statement that follows loads data from the table named EMPLOYEE into your internal named stage called my_internal_stage.
copy into @my_internal_stage from EMPLOYEE;
How to Unload Data Into Local Folder from Snowflake Stages Using GET Command
The data from the internal named stage named “my_internal_stage” is unloaded onto the local machine by the following statement:
get @my_internal_stage file://C:\OutputFiles\NamedInternal;
Summing Up Internal Stages
In summary, internal stages are integral to Snowflake’s data manipulation and integration capabilities, enabling companies to transfer data quickly between multiple systems and accounts, and emit data in a way that can be accessed by other systems.
If you want to store data files in Snowflake, then internal stages are a great option. They are flexible, secure, and provide exceptional performance.
External Stages in Snowflake
External stages provide a way to access and load data from external cloud storage platforms, such as Amazon S3, Microsoft Azure Blob Storage, or Google Cloud Storage. External stages are used to define the connection information and access credentials required to access data stored in these external platforms.
The external stage is then used to load the data into Snowflake using the COPY INTO command. Once the data is loaded into Snowflake, it can be further processed and transformed using SQL queries or other tools within the Snowflake environment. This includes tasks such as data cleansing, enrichment, and aggregation.
In addition to providing organizations with the ability to ingest data from an array of external sources, including on-premises systems, cloud storage, and third-party platforms, external stages are an essential part of Snowflake’s integration capabilities.
Here’s an example of the CREATE STAGE command to create an external stage in Snowflake:
CREATE OR REPLACE STAGE my_external_stage URL = 's3://my-bucket/my-folder/' CREDENTIALS = (AWS_KEY_ID = 'my-access-key' AWS_SECRET_KEY = 'my-secret-key');
In this example, we’re creating an external stage called my_external_stage and specifying the URL for the stage, which is the location of the data files in an S3 bucket (s3://my-bucket/my-folder/).
We’re also specifying the AWS access key and secret key credentials that are required to access the S3 bucket (AWS_KEY_ID = ‘my-access-key’ AWS_SECRET_KEY = ‘my-secret-key’).
Note that Snowflake also supports other cloud storage providers, such as Azure Blob Storage and Google Cloud Storage, and the syntax for creating an external stage may differ based on the provider being used.
PUT and GET do not support uploading files to external stages. To upload files to external stages, use the utilities provided by the cloud service. By enabling SQL queries on the data contained in files, the external stages also allow you to create external tables on top of the files already available in their locations. The external tables created on top of the data files can also be used to join with other Snowflake tables, create views and materialized views.
Loading From External Stage Into Table
The following statement loads data from a file named input.csv in an external stage named my_external_stage into a table named EMPLOYEE.
copy into EMPLOYEE from @my_external_stage/input.csv;
Data can be loaded directly from external locations into Snowflake without using steps. The following statement loads data from a file named input.csv in your s3 bucket location into a table named EMPLOYEE.
COPY INTO EMPLOYEE FROM s3://my-bucket/my-folder credentials=(AWS_KEY_ID='my-access-key' AWS_SECRET_KEY='my-secret-key') FILE_FORMAT = (TYPE = CSV FIELD_DELIMITER = '|' SKIP_HEADER = 1);
Some common scenarios where external stages can be used include:
Large Data Volumes: If you have large amounts of data stored in external storage systems, external stages can be used to load this data into Snowflake for analysis or processing.
Backup and Recovery: If you need to backup or restore data stored in Snowflake, external stages can be used to unload and reload data to and from external storage systems.
Cost Savings: If you want to reduce the cost of storing data in Snowflake, you can use external stages to unload data from Snowflake and store it in lower-cost external storage systems.
Overall, external stages in Snowflake offer a flexible and efficient way to integrate data from external sources into the Snowflake data warehouse and enable seamless data movement between Snowflake and external storage systems.
When using external stages in Snowflake, it’s important to ensure that appropriate security measures are in place. This includes:
Securing the Location of the External Data Files: Ensure that the external data files are stored in a secure location only accessible by authorized users.
Securing the External Stage Object: Limit access to the external stage object to only those users who need to load data from the external files.
Validating the Data: Before loading data from external files into Snowflake, ensure that the data is valid and free from malicious code or other security risks.
Monitoring and Logging: Monitor activity around the use of external stages and log any suspicious activity for auditing purposes.
Snowflake Stages Use Case Examples
Here are a few examples of how Snowflake stages can be used:
Data Ingestion from Amazon S3: Snowflake stages can be used to ingest data from Amazon S3 into Snowflake. The external stage can create a location in S3 where files are stored, and the Snowflake COPY command can be leveraged to load the data from the external stage into a Snowflake table.
Real-Time Data Processing: Snowflake stages can process real-time data from streams. The Snowflake STREAMS feature can create a stream, and an internal stage can be leveraged to store the data as it is being ingested. The data can then be processed using Snowflake’s SQL capabilities.
Integrating Snowflake with Other Data Pipelines: Snowflake stages can be integrated with other data pipelines to enable the flow of data between Snowflake and other data processing tools. For example, an external stage can be created to define the location of data in a Kafka stream, and Snowflake can be configured to consume the data from that stream and load it into a table for further analysis.
Data Unloading: Stages can also be used to unload data from Snowflake to a variety of destinations, such as files, databases, and other cloud storage services. For example, you could use a stage to unload data from a Snowflake table to a CSV file.
Data Warehousing: Stages can be used to store data in Snowflake for data warehousing purposes. For example, you could use a stage to store data from a variety of sources in Snowflake so that it can be analyzed later.
Data Loading: Stages can be used to load data into Snowflake from a variety of sources, such as files, databases, and other cloud storage services. For example, you could use a stage to load data from a CSV file into a Snowflake table.
To summarize, understanding the distinction between internal and external stages in Snowflake is essential for efficiently managing data in the cloud-based data warehousing platform. Users may effectively process data within the Snowflake ecosystem by utilizing internal stages, while external stages provide flexibility and cost-effectiveness for handling huge volumes of data stored in external cloud storage.
Organizations may optimize their data processing, storage, and analysis by integrating internal and external stages, enabling safe and efficient data management workflows. Data can be safeguarded and accessed safely in both internal and external stages thanks to Snowflake’s strong security capabilities, ensuring compliance with data governance and privacy regulations.
In the end, Snowflake’s internal and external stages provide data-driven enterprises with scalable, flexible, and affordable data management solutions, enabling them to make superior insights and decisions in the ever-evolving digital environment.
Contact our team of Snowflake experts!
Internal stages are created and managed using the CREATE STAGE and DROP STAGE commands in Snowflake.
External stages are created and managed using the CREATE STAGE and DROP STAGE commands in Snowflake. They require the user to provide credentials to access the external cloud storage service.
External stages are commonly used for accessing large datasets that cannot be stored within Snowflake.
The primary distinction between internal and external stages is that internal stages are controlled by Snowflake and used to store data within Snowflake's storage infrastructure, whereas external stages are references to data saved in cloud storage services outside of Snowflake.
In the same Snowflake environment, you can use both internal and external stages. You have the freedom to select the ideal stage type in accordance with your needs for data processing and storage.