Across industries, the difficulty of extracting value from raw data is pervasive. The Medallion Architecture systematically solves this problem by dividing the data into three progressive tiers: Bronze (raw), Silver (validated), and Gold (enriched). Incremental value is added, and the data’s quality and structure are refined at each stage, assuring data trustworthiness and a “single source of truth” for analytics.
Combined with a unified platform like Microsoft Fabric, the potential is limitless. Fabric is Microsoft’s unified analytics platform, which integrates data engineering, data science, and business intelligence into a single, coherent ecosystem.
In this blog, we will explain how to use both and walk you through a step-by-step process for implementing a Medallion Architecture in Microsoft Fabric. You will see how to bring raw data into the Bronze layer, clean it up in the Silver layer, and prepare it for in-depth analysis in the Gold layer, all within Fabric’s integrated environment.
What is Medallion Architecture?
The Medallion Architecture is a layered design pattern for data management in a given system, specifically in data lakehouses. It partitions the entire data processing activities into three sequential logical layers, referred to as the Bronze, Silver, and Gold layers.
Bronze Layer: This layer stores the ingested data from the source systems in its original format. It acts as a repository in the system to preserve data for maintaining modification history and auditing.
Silver Layer deals with data cleansing, validation, and enrichment processes. Some of the more well-known steps performed at this level are removing duplicate records, enforcing a standardization of data formats, and merging multiple datasets. It yields trusted structured data ready for advanced analytical work.
Gold Layer: At present, we see the final stage of data refinement. At this level, the full databases are built out. It is at this point that we implement the relevant business rules and develop the data models for use by the various BI tools, reports, and machine learning processes.
The Medallion Architecture, which is very useful as we deal with greater volumes of data, does it better, especially within lakehouse systems that bring together the best of data lakes and data warehouses. This is done through the, which also serves to improve data quality, governance, and scalability. This empowers teams to build reusable and consistent analytics pipelines across the organization.
Overview of Microsoft Fabric
Microsoft Fabric is a cloud-native, end-to-end data platform that streamlines the entire data journey from ingestion to insight by combining these critical tools: Data Factory, Synapse, Power BI, and OneLake. Its tight integration for data ingestion, storage, modeling, and transformation makes a backbone for the Medallion Architecture implementation.
Fabric enables frictionless data flows into the Bronze layer, rich transformations in the Silver layer, and composable semantic models in the Gold layer, with flexibility provided by both lakehouse and warehouse patterns. Supported by intrinsic scalability, governance, and support for varied workloads from batch to machine learning, Fabric assists organizations in developing dependable, shareable analytics pipelines at scale.
How to implement Medallion Architecture in Microsoft Fabric
Here, we’ll go through the entire end-to-end process of putting the Medallion Architecture into action in Microsoft Fabric—configuring your lakehouse and ingesting raw data, through cleansing and organizing it into analytics-ready models. With our clinic appointments dataset, we’ll guide you through each phase of the Bronze → Silver → Gold pipeline, demonstrating how to get your data ready for robust reporting and insights.
1. Creating a Workspace
Before we can begin constructing the Medallion Architecture in Microsoft Fabric, we will first create a workspace where all our notebooks, lakehouses, and reports will reside. A workspace is your project’s hub or home page, where all the related assets are stored together.
How to create your workspace:
Sign in to Microsoft Fabric.
Click Workspaces on the left navigation menu.
Choose a new workspace, and assign it a simple, descriptive name — for instance,
Clinic_Analytics
.Click Apply to complete. When your new workspace loads, it should be blank, as in the image below.
2. Create the Lakehouse & Upload A Raw File
Once your workspace is created, click on + New item. A panel will appear on the right side. In the search bar, type Lakehouse to quickly find the option. When you select Lakehouse, it will prompt you to enter a name. Name it Clinic_Appts_Lakehouse
and click Create, as shown in the image below.
We’ll use a sample .csv
file from our local machine for this blog and ingest it into the lakehouse. To do this, go to the Files section, click on the three dots, and create a subfolder named Raw. Once the folder is created, click on the three dots next to it, choose Upload files, browse to the file’s location on your system, select it, and hit Upload.
After the upload is complete, the file will appear inside the Raw folder of the lakehouse, as shown in the image below.
3. Ingesting Raw Data into the Bronze Layer
In the Bronze layer, we store the raw appointment data in a managed Delta table named bronze_appointments
. In the PySpark notebook (running on Spark runtime), we load the source .csv
file as-is, keeping the original column names unchanged.
To start, open your Lakehouse, click on Open notebook, and create a new notebook as shown below. Once it opens, rename it as needed (for example, Bronze Layer Load).
Next, in the notebook, use PySpark to read the uploaded raw
.csv
file from the Raw folder in the lakehouse and write it into a managed Delta table namedbronze_appointments
. Before saving, preview the first two rows of the DataFrame to confirm the file is read correctly and all columns are loaded as expected (see image below). This way, the ingested data stays immutable and can be reused for future transformations without changing the original content.
After running the notebook cell, the
bronze_appointments
table will appear in the Tables section of the lakehouse, confirming that the Bronze layer load is complete.
4. Transforming Data for the Silver Layer
To keep things consistent, we created a notebook called Silver Layer Load. We started by reading data from the bronze_appointments
table and formatting the date columns for accurate analysis. We then cleaned the data by filling missing values in key fields, such as setting Charges to 0
and Treatment to "N/A"
. The OpenSince column was dropped since it was not required. Finally, we standardized the column names uniformly, making the data cleaner, easier to read, and ready for downstream processing.
Finally, we save the cleaned and renamed data as a new table called
silver_appointments
.
After executing the notebook cell, you’ll see the
silver_appointments
table appear in your lakehouse’s Tables section, confirming that the Silver layer load is complete.
5. Creating Fact and Dimension Tables in the Gold Layer
We create fact and dimension tables in the Gold layer to organize the data for efficient analysis and reporting. The fact table holds the key measurable events, while the dimension tables add descriptive context. This structure improves query performance, makes data exploration easier, and allows flexible slicing and dicing of information.
Keeping consistency in our process, we have created the Gold Layer Load Notebook, where we prepare the
dim_patient
table by capturing the details of each patient from thesilver_appointments
table, selecting key patient information, and saving it as a Delta table nameddim_patient
, as shown in the image below.
In the same way as
dim_patient
, we will create the dimension tablesdim_doctor
anddim_clinic
as shown in the image below.
We created
dim_date
to maintain a consistent and comprehensive date reference table. This table enables efficient time-based filtering, aggregation, and analysis across all fact tables without relying on incomplete or inconsistent date fields from source data.
Now, we will create the
fact_appointments
table, which will act as the central transactional table storing detailed records of each appointment. This table will serve as the foundation for analysis by linking to all dimension tables (patient, doctor, clinic, and date) and enabling comprehensive reporting on appointment trends, performance metrics, and revenue insights.
After executing all notebook cells, the dimension and fact tables appear in your lakehouse’s Tables section, confirming that the gold layer transformation is complete.
6. Keeping Dimension Tables Fresh with Delta MERGE and Slowly Changing Dimensions (SCD)
In a production data pipeline, it’s critical to ensure that dimension tables remain accurate and reflect the latest business information. Over time, attributes such as patient details, clinic names, or doctor contact information may change. To handle this, we use incremental updates instead of recreating tables from scratch to handle.
Delta Lake provides the MERGE
operation, which makes it easy to upsert (update or insert) records efficiently. Depending on requirements, you can choose:
Simple Upsert (SCD Type 1) – Overwrite the latest attributes whenever a change is detected.
Here, we are showing the example only for the Patients dimension. We implement the MERGE
logic in the same Gold Layer Load notebook in this use case. We select Spark SQL from the notebook’s language dropdown and enter the code as shown in the image below.
Best Practices
- Parameterize notebooks for easy reuse and flexibility across different data sources and environments. This reduces repetitive work and simplifies maintenance.
- When moving data from bronze to silver, use Delta Lake’s MERGE command for incremental data updates. This approach updates only the changed records instead of reprocessing entire datasets, improving speed and resource efficiency.
- Model your gold layer using a star schema to create a clean, simplified structure that supports fast, efficient queries. This is critical for analytics performance.
- Take advantage of Power BI’s DirectLake mode, which allows users to run fast, interactive reports directly on the Lakehouse’s Delta files without moving or duplicating data.
Closing
Ultimately, implementing the Medallion Architecture in Microsoft Fabric provides a single, reliable foundation for all your data needs. Structuring your data through the bronze, silver, and gold layers creates a trusted asset ready for any challenge. With this solid groundwork, you can move beyond traditional reporting and confidently explore advanced analytics, machine learning, and even generative AI applications, all within one unified platform.
Ready to unlock the full value of your data?
At phData, we help organizations implement modern data architectures and leverage platforms like Microsoft Fabric to unlock the full value of their data. Connect with our experts to see how we can help you build a scalable, future-ready data strategy.
FAQs
Do I really need all three layers (Bronze, Silver, and Gold)?
Yes, if you want your project to scale. The Bronze layer keeps the raw data so you can replay transformations if issues come up. The Silver layer provides a clean, single source of truth that avoids inconsistent data across teams. The Gold layer is built on top of this reliable base for BI and analytics. Skipping layers may save time initially, but it leads to a fragile and harder-to-maintain system later.
Should I use a Lakehouse or a Data Warehouse for my Gold layer?
Choose a Lakehouse if your main goal is high performance with Power BI’s DirectLake mode. Opt for a Data Warehouse when running complex T-SQL stored procedures or leveraging advanced enterprise SQL capabilities. Both layers integrate seamlessly: T-SQL in the warehouse can directly query Lakehouse tables, and Lakehouse PySpark can access warehouse data via Fabric libraries—allowing cross-platform data access without physical data movement or relying on Fabric Shortcuts.