April 25, 2025

Native Apache Iceberg Support in AWS Glue: What You Need to Know (and Probably Missed)

By Hiresh Roy

If you’ve ever managed large Parquet or CSV datasets on Amazon S3 — especially using AWS Glue — you’ve likely faced data consistency, schema evolution, and query performance challenges. Traditional data lakes in Glue tend to be file-centric, where operations like deletes, updates, or even simple appends often result in broken pipelines, corrupted data, or unexpected job failures, particularly when multiple jobs concurrently interact with the same dataset.

Apache Iceberg flips that model on its head by bringing database-like capabilities to your data lake. It enables ACID-compliant transactions, versioned schemas, and time travel — all on top of your existing S3 storage — and is fully queryable from engines like Apache Spark, AWS Athena, and Snowflake. As of Glue 4.0, AWS Glue natively supports Iceberg, allowing teams to modernize their Parquet-based pipelines without completely rewriting them.

In this post, we’ll walk through what native Iceberg support in Glue means, why it matters for teams already using AWS Glue, and how you can start taking advantage of it.

If your team is building lakehouse pipelines using AWS Glue with Parquet datasets, this blog will show why Apache Iceberg might be missing.

Why Upgrade to AWS Glue 4.x?

With native Apache Iceberg support built into AWS Glue 4.0 and above, upgrading your Glue version isn’t just about unlocking advanced features like schema evolution, hidden partitioning, or time travel. It’s also a practical cost-saving decision.

Even if your current pipelines don’t need Iceberg’s full set of capabilities, the shift from file-centric Parquet writes to table-aware Iceberg writes can significantly reduce your query execution costs, especially when used with Athena or Spark SQL.  Iceberg uses metadata-driven planning, which avoids scanning unnecessary files and reduces I/O overhead.

In short, migrating to Glue 4.0 and adopting Iceberg tables can lower your total cost of ownership, improve performance, and deliver better user experience and faster insights for downstream consumers, without requiring a complete rewrite of your existing workflows.

AWS Glue + Iceberg: A Quick Timeline

AWS Glue’s support for Apache Iceberg has matured significantly in recent versions. While it’s technically possible to use Iceberg with AWS Glue 3.0 via manual setup, the process is often error-prone. AWS Glue 4.x and above offer native support, making it far easier to manage Apache Iceberg tables while unlocking better performance, lower query costs, and improved data reliability — even if you don’t need advanced features like time travel or schema evolution immediately.

Here’s a quick look at how AWS Glue evolved to support Apache Iceberg natively across its major versions:

AWS Glue Version

Apache Iceberg Support

Notes

AWS Glue 3.0

Not supported natively

Requires custom connectors; setup is manual, brittle, and hard to scale

(Iceberg Supported Version 0.13.1)

AWS Glue 4.0

Native support introduced

Built on Spark 3.3; supports Iceberg v1 & v2 with Glue Catalog integration

(Iceberg Supported Version 1.0.0)

AWS Glue 5.0

Latest and most advanced

As of Glue 5.0, Iceberg support includes version 1.7.1 and leverages Spark 3.5.2 for further performance gains

Enabling Apache Iceberg in AWS Glue 3.0 Jobs

If you’re using AWS Glue 3.0 and want to enable Iceberg support, you’ll need to manually configure job parameters and Spark settings, which can be error-prone and hard to scale.

Start by adding --datalake-formats with the value iceberg to signal Glue Job to enable support for Iceberg tables. Then, configure Spark to recognize and use the Iceberg catalog and file formats by setting the --conf parameters in your Glue job. These settings register the glue_catalog with Apache Iceberg and ensure Spark can read and write to your S3-based warehouse properly.

				
					--conf spark.sql.extensions=org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions
--conf spark.sql.catalog.glue_catalog=org.apache.iceberg.spark.SparkCatalog
--conf spark.sql.catalog.glue_catalog.warehouse=s3://<your-warehouse-dir>/
--conf spark.sql.catalog.glue_catalog.catalog-impl=org.apache.iceberg.aws.glue.GlueCatalog
--conf spark.sql.catalog.glue_catalog.io-impl=org.apache.iceberg.aws.s3.S3FileIO

				
			
Figure 1: Job Parameter Configuration With AWS Glue 3.0

The snapshot.json file generated by AWS Glue 3.0 jobs omits several key metadata entries under the summary section, most notably the engine-version tag. As shown in Figure 2, the summary tag includes operational metadata like added records and data files, but lacks the engine metadata that newer Glue versions automatically embed. This omission can impact multi-engine environments’ downstream observability, auditing, and compatibility tracking.

Figure 2: Iceberg Table Metadata JSON file

Unless you’re using Glue job templates, infrastructure-as-code tools like AWS CloudFormation and Terraform, or automating job creation and updates through scripts or the AWS CDK, managing these configurations manually across multiple jobs can become a significant challenge. This not only increases operational overhead but also makes it harder to ensure consistency and avoid configuration drift during large-scale migrations to Iceberg.

How Native Iceberg Support in Glue 4.0+ Makes Adoption Effortless

One of the most compelling reasons to move to AWS Glue 4.0 or later is the simplified experience of working with Apache Iceberg. Unlike Glue 3.0, where enabling Iceberg required manual configurations and job parameters, Glue 4.0+ supports Iceberg natively — no setup overhead. 

You can now create and append to Iceberg tables directly using Spark’s writeTo() API, seamlessly integrating into the AWS Glue Data Catalog. Whether you’re building with DynamicFrames, using Spark SQL, or designing in Glue Studio Visual Jobs, native Iceberg support is built-in and production-ready. 

Features like tableProperty("format-version", "2") and built-in support for compression, partitioning, and schema evolution come out of the box, allowing you to modernize your pipelines without friction. This level of integration significantly reduces onboarding effort and enables you to focus on delivering faster, more reliable data workflows.

The following example demonstrates a simplified AWS Glue 5.0 Spark job that checks whether a target Iceberg table exists and either appends to it or creates it from scratch. It shows how easy it is to get started with Iceberg in a Glue-native way.

				
					import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job

# Initialize Glue context and job
args = getResolvedOptions(sys.argv, ["JOB_NAME"])
sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)
job.init(args["JOB_NAME"], args)

spark.conf.set("spark.sql.catalog.glue_catalog", "org.apache.iceberg.spark.SparkCatalog")
spark.conf.set("spark.sql.catalog.glue_catalog.warehouse", "s3://curated-iceberg-bucket/iceberg_default/")
spark.conf.set("spark.sql.catalog.glue_catalog.catalog-impl", "org.apache.iceberg.aws.glue.GlueCatalog")
spark.conf.set("spark.sql.catalog.glue_catalog.io-impl", "org.apache.iceberg.aws.s3.S3FileIO")


# Load source Iceberg table from Glue Data Catalog
source_dynamic_frame = glueContext.create_dynamic_frame.from_catalog(
    database="iceberg_default",
    table_name="sales",
    transformation_ctx="source_dynamic_frame",
)

# Convert to Spark DataFrame for Iceberg operations
source_df = source_dynamic_frame.toDF()

# Check if target Iceberg table already exists
target_table_name = "sales_icerberg"
tables_in_catalog = spark.catalog.listTables("iceberg_default")
table_exists = target_table_name in [t.name for t in tables_in_catalog]

# Define write options
iceberg_write_options = {
    "format-version": "2",
    "location": "s3://curated-iceberg-bucket/iceberg_default/sales_icerberg",
    "write.parquet.compression-codec": "gzip",
}

# Write to Iceberg table (create or append)
if table_exists:
    source_df.writeTo(
        f"glue_catalog.iceberg_default.{target_table_name}"
    ).options(**iceberg_write_options).append()
else:
    source_df.writeTo(
        f"glue_catalog.iceberg_default.{target_table_name}"
    ).options(**iceberg_write_options).create()

job.commit()

				
			

With AWS Glue version 4.0 and above, working with Apache Iceberg becomes almost invisible, in the best possible way. You don’t need to set any file format configuration flags like –datalake-format , AWS Glue handles it under the hood.

This native behavior means:

  • No format declarations.

  • No custom JARs.

  • Seamless support with DynamicFrames, Spark SQL, and Glue Studio.

Note: The spark.sql.catalog configuration is still required unless the AWS Glue scripts are built using Visual ETL Glue Job.

Whether you’re appending to an existing table or creating one from scratch, the .writeTo("glue_catalog.db.table") API is now Iceberg-aware by default. For developers, this means faster setup, cleaner code, and fewer surprises—while still benefiting from Iceberg’s powerful features like versioning, schema evolution, and partition pruning.

Figure 3: Table Created Using AWS Glue 5.0 Spark API.

The Iceberg snapshot also registered the Glue 5.0 (Spark 3.5) Job under the summary tag as shown in Figure 4.

Figure 4: Iceberg Table Snapshot File Created by Glue 5.0 Jobs

Performance Comparison: Parquet vs Apache Iceberg

To validate the performance efficiency of Apache Iceberg, a sales order dataset was processed using AWS Glue and stored in two formats—traditional Parquet-based partitions and modern Iceberg tables. The dataset spans 5 years of order records, partitioned by year/month, as illustrated in Figure 5

Each partition contains approximately 100k–150k rows with 20 columns. Unlike the Parquet format, the Iceberg table uses hidden partitioning to intelligently distribute the data. Each table holds around 7 million records and is registered in the Glue Data Catalog, making it queryable through Amazon Athena, as shown in Figure 6.

Figure 5: Sales Data Storage In S3 (Parquet Vs Iceberg)

To measure query performance, we ran an aggregation query (grouping sales by country for a single month) on both data formats using Spark SQL in an AWS Glue job (Find the Glue Job Script below). The performance improvement was significant and measurable.

  • The Parquet query took 18.87 seconds

  • The Iceberg query was completed in just 12.4 seconds

Figure 6: AWS Glue Catalog View For Both Tables

Lorem ipsum dolor sit amet, consectetur adipiscing elit. Ut elit tellus, luctus nec ullamcorper mattis, pulvinar dapibus leo.

Figure 7: Query Execution Result
				
					import sys
from awsglue.utils import getResolvedOptions
from awsglue.context import GlueContext
from awsglue.job import Job
from pyspark.context import SparkContext
from datetime import datetime
from pprint import pprint

# Get job args
args = getResolvedOptions(sys.argv, ['JOB_NAME'])

# Set up Spark and Glue context
sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)
job.init(args['JOB_NAME'], args)

# Config
glue_database = "sales_database"
parquet_table = "sales_table_parquet"
iceberg_table = "sales_table_iceberg"

# Iceberg catalog config
spark.conf.set("spark.sql.catalog.glue_catalog", "org.apache.iceberg.spark.SparkCatalog")
spark.conf.set("spark.sql.catalog.glue_catalog.warehouse", "s3://icebergdemo-s3/iceberg_data/")
spark.conf.set("spark.sql.catalog.glue_catalog.catalog-impl", "org.apache.iceberg.aws.glue.GlueCatalog")
spark.conf.set("spark.sql.catalog.glue_catalog.io-impl", "org.apache.iceberg.aws.s3.S3FileIO")

# ------------ PARQUET READ -------------------
start_parquet = datetime.now()
print(f"\n⏱️ Start loading Parquet: {start_parquet}")

parquet_sql = f"""
                SELECT country, SUM(quantity * unit_price) AS total_sales
                FROM {glue_database}.{parquet_table}
                WHERE sale_date BETWEEN  date('2023-07-01') AND date('2023-07-31')
                GROUP BY country
                ORDER BY total_sales DESC
                """
iceberg_sql = f""" 
                SELECT country, SUM(quantity * unit_price) AS total_sales
                FROM {glue_database}.{iceberg_table}
                WHERE sale_date BETWEEN  date('2023-07-01') AND date('2023-07-31')
                GROUP BY country
                ORDER BY total_sales DESC
                """

print(f"\nThe parquet query: {parquet_sql}\n")
print(f"\nThe iceberg query: {iceberg_sql}\n")

df_parquet = spark.sql(parquet_sql)
pprint(df_parquet.collect())
#df_parquet.count()  # Trigger execution

end_parquet = datetime.now()
print(f"✅ Finished loading Parquet: {end_parquet}")
print(f"🕒 Time taken for Parquet load: {(end_parquet - start_parquet).total_seconds()} seconds\n")

# ------------ ICEBERG READ -------------------
start_iceberg = datetime.now()

print(f"\n⏱️ Start loading Iceberg: {start_iceberg}")

df_iceberg = spark.sql(iceberg_sql)
pprint(df_parquet.collect())
#df_iceberg.count()  # Trigger execution

end_iceberg = datetime.now()
print(f"✅ Finished loading Iceberg: {end_iceberg}")
print(f"🕒 Time taken for Iceberg load: {(end_iceberg - start_iceberg).total_seconds()} seconds\n")

job.commit()

				
			

This significant improvement demonstrates how Iceberg’s fine-grained partitioning, automatic metadata tracking, and query pruning capabilities make it a more efficient choice for large-scale analytics. Iceberg also simplifies and optimizes small file handling, areas where Parquet-based models often struggle at scale.

Conclusion: Why Iceberg on Glue 4 & Above is a Game-Changer

Apache Iceberg isn’t just another table format—it’s a foundational upgrade for any modern data lake. And with AWS Glue 4.0 and above, enabling Iceberg is no longer a complex process. Native support means developers can now write directly to Iceberg tables without setting up JARs, manual configs, or special job parameters. 

With Glue handling all the heavy lifting under the hood, Iceberg becomes a zero-friction, production-ready solution right out of the box.

This shift brings three core advantages to the table:

  1. Operational Simplicity – Glue 4.0+ supports Iceberg natively, whether you’re using Spark SQL, DynamicFrames, or Glue Studio visual jobs. You get cleaner code, faster onboarding, and fewer moving parts.

  2. Performance + Cost Efficiency – Iceberg’s support for hidden partitioning, metadata pruning, and schema evolution can dramatically reduce query scan sizes and runtime, leading to significant cost savings. For large data workloads, this can translate to a 60–90% drop in query costs on engines like Athena or Spark.

  3. Future-Proof Architecture – With built-in support for time travel, table versioning, and advanced governance, Iceberg is well-aligned with the evolving demands of modern analytics and compliance-driven use cases. It also integrates seamlessly with AWS Lake Formation for fine-grained access control.

Continuing with Glue 3.0 means missing out on significant performance gains, operational simplicity, and cost optimization opportunities already available in Glue 4.0+.

Upgrading to Glue 4.0 & above unlocks a powerful, fully managed path to adopt Iceberg—helping you modernize your data lake with minimal friction and maximum return.

phData Blue Shield

Need help upgrading to Glue 4.0 or adopting Iceberg?

The experts at phData can help! Reach out to us today with questions, insights, and actionable advice.

Data Coach is our premium analytics training program with one-on-one coaching from renowned experts.

Accelerate and automate your data projects with the phData Toolkit