May 7, 2025

How Querying Apache Iceberg Metadata Can Elevate Your DataOps Strategy

By Hiresh Roy

Unlock the full potential of Apache Iceberg’s metadata tables to enhance your data platform’s reliability, observability, and operational efficiency. Through practical examples with AWS Glue jobs, this blog demonstrates how to query Iceberg’s metadata tables using SQL and turn hidden information into actionable insights. You’ll learn to monitor data file changes, audit data growth patterns, and reduce troubleshooting time—without adding new tools or unnecessary complexity. Whether you’re a data engineer, architect, or platform owner, this approach can help you shift from reactive firefighting to proactive, intelligent data management.

The ETL Pipeline failed in PROD. Root Cause Unknown

For the third time this quarter, a critical AWS Glue job failed without warning. By morning, broken dashboards and unanswered tickets had triggered yet another round of reactive firefighting.

Post-incident analysis revealed an untracked schema change—one that wasn’t documented and went unnoticed until it disrupted a critical workflow.

What’s worse? This wasn’t the first time. And if nothing changed, it wouldn’t be the last.

That’s when the data platform owner started wondering:

“Isn’t there a native way to track these changes? To understand what changed, when it changed, and why—without adding yet another monitoring layer or writing custom code?”

You’re Probably Not Using Iceberg’s Best Feature

Most Data Engineering teams treat Apache Iceberg as just a high-performance “open table format” supporting ACID transactions, hidden partitions, efficient pruning, and schema evolution. But Iceberg is much more than that. It’s a rich, self-describing open table format that exposes a goldmine of operational metadata—fully queryable, deeply insightful, and often overlooked. With its built-in metadata tables, Iceberg makes your data not only accessible but also introspectable, offering powerful observability into the health, structure, and behavior of your data.

In this blog, we’ll explore how these metadata tables help you 

  • Detect schema and partition changes before they break your pipelines

  • Track data file spikes that could point to duplicates or performance issues

  • Improve discoverability, audibility, and reduce TCO

  • And most importantly, bring observability and accountability to your data platform without extra tools and complexity. 

What Are Apache Iceberg Metadata Tables?

Apache Iceberg isn’t just a modern open table format for the data lake—it’s a metadata-first architecture built for reliability, evolution, and scale. One of its hidden capabilities is queryable metadata tables.

These aren’t just logs or command-line utilities. They’re actual system tables equivalent that can be queried with SQL, providing detailed visibility into your data. You can think of them as Iceberg’s modern replacement for commands like SHOW PARTITIONS in Hive, but with full SQL support and far more flexibility.

Unlike Hive or traditional file-based formats, where gaining visibility into schema, partitioning, and file layout often depends on engine-specific logic or limited tooling, Iceberg exposes this information as structured, queryable tables. That means you can:

  • Run joins, filters, and aggregations

  • Monitor data drift, duplication, and layout skew

  • Detect slow growth, sudden spikes, or stale snapshots

  • Even “time travel” across versions to understand what changed, when, and why

In short, metadata tables act like the observability layer for your data platform, backed by SQL, and always up to date.

What Everyone Knows:

  • Iceberg stores rich metadata to speed up query planning

  • It tracks schema, partition layout, and snapshot versions

What Very Few Realize:

These metadata are first-class, queryable tables, and you can use them to:

  • Monitor pipeline health without third-party tools

  • Detect schema and file anomalies before things break

  • Audit data changes across time and contributors

  • Flag cost spikes due to duplicate or oversized files

  • Track table usage, freshness, and staleness metrics

  • Automate checks and alerts within your platform

And the best part, no extra tools, no custom plugins, no external catalogs. It’s all built in.

Meet The Queryable Iceberg Metadata Tables

Here are the iceberg metadata tables you can query using AWS Glue Job (Spark) or any execution engine that supports Apache Iceberg open table format.

#

Iceberg Metadata Table

Purpose

1

table_name.history

Captures the table’s evolution over time, like a version control log.

2

table_name.metadata_log_entries

Logs every metadata file created during table updates.

3

table_name.snapshots

Captures the table’s evolution over time, like a version control log.

4

table_name.files

Lists current data files with details like location, format, and partition info.

5

table_name.manifests

Shows manifest files that group data files for snapshot management.

6

table_name.partitions

Displays how data is organized by partitions, making it easier to optimize queries and improve performance.

7

table_name.all_data_files

Provides a complete view of all data files across valid snapshots.

8

table_name.all_manifests

Offers detailed insights into all manifest files across valid snapshots.

Let’s See It in Action: A Practical Use Case

To understand how Apache Iceberg metadata tables work in practice, we’ll walk through a realistic example using AWS Glue jobs. A single AWS Glue job performs a series of operations—such as inserts, updates, deletes, and bulk load— on an order_shipping table. After each operation, we’ll query the metadata tables to observe how Iceberg captures these changes. This demonstrates how Iceberg enables built-in observability, traceability, and debugging without relying on external tools.

				
					import sys
from pyspark.context import SparkContext
from pyspark.sql import Row
from pyspark.sql.functions import to_date, current_timestamp, lit
from awsglue.context import GlueContext

# Initialize Spark & Glue context
sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session

# Set Iceberg configs
spark.conf.set("spark.sql.catalog.glue_catalog", "org.apache.iceberg.spark.SparkCatalog")
spark.conf.set("spark.sql.catalog.glue_catalog.warehouse", "s3://phdata-demo-source/iceberg-warehouse/")
spark.conf.set("spark.sql.catalog.glue_catalog.catalog-impl", "org.apache.iceberg.aws.glue.GlueCatalog")
spark.conf.set("spark.sql.catalog.glue_catalog.io-impl", "org.apache.iceberg.aws.s3.S3FileIO")

# Step 1: Create order_shipping table (partitioned by 'day' - hidden), with 'location' field
spark.sql("""
    CREATE TABLE IF NOT EXISTS glue_catalog.phdata_demo_db.order_shipping (
        order_id STRING,
        customer_id STRING,
        location STRING,
        order_date DATE,
        shipment_date DATE,
        delivery_date DATE,
        load_timestamp TIMESTAMP
    )
    USING iceberg
    PARTITIONED BY (days(order_date))
    TBLPROPERTIES (
        'format-version'='2',
        'write.format.default'='parquet'
    )
""")
print("Step 1: Table created (Snapshot s0)")

# Step 2: Insert 10 records (Snapshot s1)
df_s1 = spark.createDataFrame([
    Row(order_id=f"ORD00{i}", customer_id=f"CUST10{i}", location="Bangalore", 
        order_date="2023-01-01", shipment_date="2023-01-02", delivery_date="2023-01-05") 
    for i in range(1, 11)
]).withColumn("order_date", to_date("order_date")) \
  .withColumn("shipment_date", to_date("shipment_date")) \
  .withColumn("delivery_date", to_date("delivery_date")) \
  .withColumn("load_timestamp", current_timestamp())

df_s1.writeTo("glue_catalog.phdata_demo_db.order_shipping").append()
print("Step 2: Inserted 10 records (Snapshot s1)")

# Step 3: Insert another 10 records (Snapshot s2)
df_s2 = spark.createDataFrame([
    Row(order_id=f"ORD01{i}", customer_id=f"CUST11{i}", location="Bangalore", 
        order_date="2023-01-03", shipment_date="2023-01-04", delivery_date="2023-01-07") 
    for i in range(1, 11)
]).withColumn("order_date", to_date("order_date")) \
  .withColumn("shipment_date", to_date("shipment_date")) \
  .withColumn("delivery_date", to_date("delivery_date")) \
  .withColumn("load_timestamp", current_timestamp())

df_s2.writeTo("glue_catalog.phdata_demo_db.order_shipping").append()
print("Step 3: Inserted 10 more records (Snapshot s2)")

# Step 4: Re-insert the same 10 records from Step 3 to simulate a need for delete (Snapshot s3)
df_s2.writeTo("glue_catalog.phdata_demo_db.order_shipping").append()
print("Step 4: Re-inserted same 10 records to simulate duplicates (Snapshot s3)")

# Step 5: Delete the last 10 records based on order_id (Snapshot s4)
order_ids_to_delete = [f"ORD01{i}" for i in range(1, 11)]
condition = "order_id IN ({})".format(",".join([f"'{oid}'" for oid in order_ids_to_delete]))

spark.sql(f"""
    DELETE FROM glue_catalog.phdata_demo_db.order_shipping
    WHERE {condition}
""")
print("Step 5: Remove duplicate records by order_id (Snapshot s4)")

# Step 6: Update a few records (Snapshot s5)
spark.sql("""
    UPDATE glue_catalog.phdata_demo_db.order_shipping
    SET location = 'Hyderabad'
    WHERE customer_id IN ('CUST102', 'CUST103')
""")

print("Step 6: Updated a few records (Snapshot s5)")

# Step 8: Insert 50 records to generate file volume
df_bulk = spark.createDataFrame([
    Row(order_id=f"ORD2{i:02}", customer_id=f"CUST20{i:02}", location="Mumbai", 
        order_date="2023-02-01", shipment_date="2023-02-02", delivery_date="2023-02-05") 
    for i in range(1, 51)
]).withColumn("order_date", to_date("order_date")) \
  .withColumn("shipment_date", to_date("shipment_date")) \
  .withColumn("delivery_date", to_date("delivery_date")) \
  .withColumn("load_timestamp", current_timestamp())

df_bulk.writeTo("glue_catalog.phdata_demo_db.order_shipping").append()
print("Step 7: Inserted 50 records to show partition file size behavior (Snapshot s6)")

# Done!
print("All operations completed. You can now query Iceberg metadata tables for observability.")
				
			

On completion, you can explore the S3 bucket, specifically the metadata and data folders. For each write or append operation, Iceberg creates a new metadata JSON file that captures a snapshot of the table. This snapshot represents the exact state of the data, enabling powerful capabilities like time travel, rollback, and historical tracking.

If you download the final metadata JSON file generated by Iceberg, you’ll notice it contains six distinct entries, each corresponding to one of the write operations we performed on the order_shipping table using AWS Glue. This file serves as a detailed log, capturing comprehensive information about each operation, including inserts, updates, deletes, and schema changes. It essentially records the operational history of the table, which is directly driven by our Glue job executions.

Now, imagine your data platform team needs to manually open and interpret these multi-level JSON metadata files just to understand what operations your AWS Glue jobs performed. With multiple operations across different snapshots, this quickly becomes inefficient and error-prone. This is exactly where metadata tables shine. In the next section, we’ll see how querying these metadata tables with SQL can give you a clear, structured view of table activity, making auditing, debugging, and insight generation far more efficient.

Extracting Operational Insights Using Iceberg Metadata Tables

History Metadata Table

Let’s begin our deep dive with the history metadata table. This table acts like a version control log for your Iceberg table, capturing a high-level timeline of changes. It’s useful when you want to audit how your table has evolved and when.

As shown in the diagram above (AWS Glue Notebook & Metadata JSON file), the history metadata table contains the following four columns that help us understand the evolution of our Iceberg table:

  • made_current_at: Timestamp when the snapshot becomes active—marks exactly when a change was committed.

  • snapshot_id: Unique identifier for each snapshot—lets you reference or trace individual table states.

  • parent_id: ID of the parent snapshot—builds the lineage and evolution chain of the table.

  • is_current_ancestor: Boolean flag indicating if the snapshot is part of the current state lineage—helps identify active vs. rolled-back snapshots.

If you have a use case that requires rolling back to a specific point in time, the history metadata table is invaluable. Using snapshot IDs, you can restore the table to an earlier state, helping you recover from errors and minimize data loss with precision.

Log Entries Metadata Table

The next metadata table is metadata_log_entries, which tracks changes to the table by logging metadata files generated during updates. Each record reflects the table’s state at a specific point in time, with the timestamp field marking when the metadata was updated.

When working with evolving table schemas, it’s common to run into situations where a recent schema change needs to be reversed or revisited. This is where the metadata_log_entries metadata table becomes incredibly useful.

As shown in the diagram above, the metadata_log_entries metadata table contains the following four columns that help us understand the evolution of our Iceberg table:

  • timestamp – When the metadata file was created.

  • file – The full S3 path to the metadata JSON file.

  • latest_snapshot_id – The snapshot ID associated with the metadata file.

  • latest_schema_id – The version of the schema used in that snapshot.

  • latest_sequence_number – The sequence number showing the order of operations.

If a recent schema change caused issues, the metadata_log_entries table can help you roll back. It tracks all metadata updates, letting you pinpoint the last known good schema version using its snapshot history.

Snapshots Metadata Table

The next metadata table is snapshots, which tracks every version of the dataset, capturing the exact state of the table at a given point in time. The operation field indicates what kind of change was made, e.g., an append, overwrite, or delete, making it easy to audit and understand how the table has evolved.

As shown in the output above, the snapshots metadata table contains the following key columns that help us track the version history and state of an Iceberg table:

  • committed_at – Timestamp indicating when the snapshot was committed.

  • snapshot_id – Unique identifier for the snapshot.

  • parent_id – Snapshot ID of the parent snapshot, indicating lineage.

  • operation – Type of change made in the snapshot (e.g., append, delete, overwrite).

  • manifest_list – S3 path to the Avro file that lists all manifest files used in this snapshot.

  • summary – Key-value metadata summarizing what was added, deleted, or changed during the snapshot (e.g., number of records, files, and partition changes).

The snapshots metadata table plays a crucial role in understanding the historical evolution of an Iceberg table. One of its key use cases is tracking the pattern of data additions over time. By analyzing snapshot metadata, such as timestamps, number of records added, and file sizes, users can gain insights into how their dataset is growing. This information is valuable for capacity planning, detecting usage trends, and forecasting future storage needs. Additionally, understanding data growth patterns can help optimize partition strategies and tune performance for large-scale analytical workloads.

Files Metadata Table

The file’s metadata table provides a snapshot of all active data files in an Iceberg table. It offers rich details about each file, including its storage location, file format, partition values, and the data it holds. This table is essential for understanding how data is physically organized and distributed across files, which is critical for optimizing query performance and storage management.

As shown in the figure above, the files metadata table contains rich details for each active data file, including file location, format, partition info, record and file size stats, per-column metrics (sizes, value counts, nulls), min/max bounds, sort order, and additional metadata for optimizing storage and query performance.

Manifests Metadata Table

The manifests metadata table provides detailed insights into each of the current manifest files associated with a table. It contains valuable information that helps track the structure of the table and how it evolves, making it easier to analyze data organization and historical changes.

As shown in the above result, the manifests metadata table captures key details about each manifest file in the Iceberg table. It includes the file’s path and size, the partition specification it follows, and the snapshot in which it was added. Additionally, it records counts of added, existing, and deleted data and associated deleted files, helping to track changes at the file level over time. The partition_summaries column further summarizes partition-level details for quick insights into the data distribution within each manifest file.

Partitions Metadata Table

The partition’s metadata table offers insight into how a table’s data is segmented into non-overlapping partitions. Each row in this table corresponds to a unique partition and provides aggregated metadata about the files and records stored within that partition. This structure enables efficient querying and data pruning by isolating changes and updates to specific data segments over time.

As shown in the image above, the partitions metadata table provides detailed information about how the Iceberg table is divided into non-overlapping logical segments. Each row corresponds to a distinct partition (e.g., a specific date) and includes the associated spec_id used to define the partitioning strategy. It captures metrics like the total number of records (record_count), number of associated data files (file_count), and their cumulative size in bytes (total_data_file_size_in_bytes). It also tracks delete operations, including the count of records and files affected by both positional and equality delete operations. Additionally, it records when the partition was last updated (last_updated_at) and the corresponding snapshot_id, offering a historical view of data changes at the partition level.

All Data Files Metadata Table

The all_data_files metadata table in Apache Iceberg offers detailed insights into every data file that exists across all valid snapshots within the table.

As shown in the above diagram, this table helps in understanding the state and organization of your data at a granular level.

It is important to understand that the all_data_files metadata table can return multiple rows for the same data file, as a single file may be associated with multiple table snapshots.

The all_data_files metadata table supports several practical use cases, including identifying the largest snapshot, aggregating file sizes over time, and evaluating partition distribution trends across versions.

All Manifests Metadata Table

The all_manifests metadata table in Apache Iceberg offers detailed visibility into each manifest file associated with all valid snapshots of a table.

Keep in mind that the all_manifests metadata table can return multiple rows for a single manifest file, as a file might be associated with multiple table snapshots. It provides a broader view of data organization compared to the all_data_files table.

A Paradigm Shift for DataOps and Platform Teams

These metadata tables represent more than just internal logs—they serve as powerful, queryable assets that DataOps and platform engineering teams can actively automate against. By leveraging them, teams can transition from reactive data troubleshooting to proactive data platform management.

With this knowledge at their fingertips:

  • You can build schema evolution dashboards to track how and when changes were introduced.

  • Set up alerts for large file generation that might violate performance best practices.

  • Perform automated audits of partitions and data changes, ensuring data integrity and compliance with governance policies.

Ultimately, Iceberg’s metadata tables enable a shift toward intelligent automation and data transparency, which are crucial for maintaining robust, scalable data platforms in modern analytics ecosystems.

Conclusion

There’s still a lot to explore when it comes to fully leveraging Apache Iceberg’s metadata tables. From setting up detailed audit real-time data quality monitoring to fine-tuning performance, the opportunities to improve how you manage and work with data are significant.

phData Blue Shield

Trying to Level Up Your Data Platform with Iceberg?

If you’re looking to bring these capabilities into your production environment or simply want to strengthen your data platform’s observability and control, the phData team is here to help, with hands-on experience and proven solutions using Apache Iceberg.

Data Coach is our premium analytics training program with one-on-one coaching from renowned experts.

Accelerate and automate your data projects with the phData Toolkit